Some notes on lesser known functions/functionalities that combine common chain of {dplyr} verbs.
Using a cleaned up version of penguins
data from {palmerpenguins}
:
rename()
inside select()
You can rename a column inside select()
by assigning a new name on the left hand side:
##### Long Form #####
# penguins %>%
# select(species, island) %>%
# rename(penguin_species = species)
penguins %>%
select(penguin_species = species,
island)
# A tibble: 333 x 2
penguin_species island
<fct> <fct>
1 Adelie Torgersen
2 Adelie Torgersen
3 Adelie Torgersen
4 Adelie Torgersen
5 Adelie Torgersen
6 Adelie Torgersen
7 Adelie Torgersen
8 Adelie Torgersen
9 Adelie Torgersen
10 Adelie Torgersen
# ... with 323 more rows
This also works with {tidyselect}
helpers like starts_with()
, ends_with()
, contains()
, and matches()
:
##### Long Form #####
# penguins %>%
# select(species, island) %>%
# rename(penguin_species = species,
# weight = body_weight_g)
penguins %>%
select(penguin_species = species,
island,
weight = contains("mass"))
# A tibble: 333 x 3
penguin_species island weight
<fct> <fct> <int>
1 Adelie Torgersen 3750
2 Adelie Torgersen 3800
3 Adelie Torgersen 3250
4 Adelie Torgersen 3450
5 Adelie Torgersen 3650
6 Adelie Torgersen 3625
7 Adelie Torgersen 4675
8 Adelie Torgersen 3200
9 Adelie Torgersen 3800
10 Adelie Torgersen 4400
# ... with 323 more rows
rename()
inside count()
You can rename the new column of counts (n
by default) using the name
argument:
##### Long Form #####
# penguins %>%
# count(species) %>%
# rename(total = n)
penguins %>%
count(species, name = "total")
# A tibble: 3 x 2
species total
<fct> <int>
1 Adelie 146
2 Chinstrap 68
3 Gentoo 119
You can also rename the column(s) that are selected for counting in the same way as shown in the select()
examples above:
##### Long Form #####
# penguins %>%
# count(species) %>%
# rename(total = n,
# penguin_species = species)
penguins %>%
count(penguin_species = species, name = "total")
# A tibble: 3 x 2
penguin_species total
<fct> <int>
1 Adelie 146
2 Chinstrap 68
3 Gentoo 119
Note that the new name passed into the name
argument must be quoted, but the new name for selected column needs not to be unquoted:
identical(
# Method 1: new column name UNQUOTED
penguins %>%
count(penguin_species = species, name = "total"),
# Method 2: new column name QUOTED
penguins %>%
count("penguin_species" = species, name = "total")
)
[1] TRUE
I prefer to unquote the new column names to keep it consistent with the recommended style for rename()
This feature of select()
may seem weird and hackish (and I guess it sort of is in this demonstration) but it’s explicitly documented here if you want to read more on it.
mutate()
inside count()
You can also create a new column to count by inside count()
. This works very similarly to the above, but I think it’s worth its own mention.
It’s pretty simple - you just do what you’d do for mutate()
inside count()
:
##### Long Form #####
# penguins %>%
# mutate(long_beak = bill_length_mm > 50) %>%
# count(long_beak)
penguins %>%
count(long_beak = bill_length_mm > 50)
# A tibble: 2 x 2
long_beak n
<lgl> <int>
1 FALSE 281
2 TRUE 52
And of course, this also works when specifying multiple variables to count by:
##### Long Form #####
# penguins %>%
# mutate(long_beak = bill_length_mm > 50,
# is_adelie = species == "Adelie") %>%
# count(is_adelie, long_beak)
penguins %>%
count(long_beak = bill_length_mm > 50,
is_adelie = species == "Adelie")
# A tibble: 3 x 3
long_beak is_adelie n
<lgl> <lgl> <int>
1 FALSE FALSE 135
2 FALSE TRUE 146
3 TRUE FALSE 52
transmute()
+ select()
transmute()
is a function that mutates columns and returns only those columns:
##### Long Form #####
# penguins %>%
# mutate(body_mass_kg = body_mass_g/1000) %>%
# select(body_mass_kg)
penguins %>%
transmute(body_mass_kg = body_mass_g/1000)
# A tibble: 333 x 1
body_mass_kg
<dbl>
1 3.75
2 3.8
3 3.25
4 3.45
5 3.65
6 3.62
7 4.68
8 3.2
9 3.8
10 4.4
# ... with 323 more rows
I’ve rarely used transmute()
in the past because I thought it could only return modified columns, which would be very limiting (like in the above example, what good is a single column of penguin body mass in kilograms?)
But actually you can just name the columns you want to include in transmute()
like you would in select()
to carry over columns that you aren’t modifying. And of course, you can “rename” them as you do it1:
##### Long Form #####
# penguins %>%
# mutate(body_mass_kg = body_mass_g/1000) %>%
# select(species, island, body_mass_kg) %>%
# rename(penguin_species = species)
penguins %>%
transmute(penguin_species = species,
island,
body_mass_kg = body_mass_g/1000)
# A tibble: 333 x 3
penguin_species island body_mass_kg
<fct> <fct> <dbl>
1 Adelie Torgersen 3.75
2 Adelie Torgersen 3.8
3 Adelie Torgersen 3.25
4 Adelie Torgersen 3.45
5 Adelie Torgersen 3.65
6 Adelie Torgersen 3.62
7 Adelie Torgersen 4.68
8 Adelie Torgersen 3.2
9 Adelie Torgersen 3.8
10 Adelie Torgersen 4.4
# ... with 323 more rows
ungroup()
inside summarize()
I always found using ungroup()
after summarize()
to be extremely ugly, but I found myself using it a lot to remove left-over groupings after a summarize
call:
penguins %>%
group_by(island, species) %>%
summarize(mean_mass = mean(body_mass_g, na.rm = TRUE)) %>%
ungroup()
# A tibble: 5 x 3
island species mean_mass
<fct> <fct> <dbl>
1 Biscoe Adelie 3710.
2 Biscoe Gentoo 5092.
3 Dream Adelie 3701.
4 Dream Chinstrap 3733.
5 Torgersen Adelie 3709.
… because summarize()
only drops the last grouping variable by defaut, meaning that the output is still grouped by the island
variable if ungroup()
isn’t called:
# Without ungroup()
penguins %>%
group_by(island, species) %>%
summarize(mean_mass = mean(body_mass_g, na.rm = TRUE)) %>%
group_vars()
[1] "island"
# With ungroup()
penguins %>%
group_by(island, species) %>%
summarize(mean_mass = mean(body_mass_g, na.rm = TRUE)) %>%
ungroup() %>%
group_vars()
character(0)
Since {dplyr}
1.0.0, you can simply set the .groups
argument inside summarize()
to 'drop'
to achieve the same:
penguins %>%
group_by(island, species) %>%
summarize(mean_mass = mean(body_mass_g, na.rm = TRUE), .groups = 'drop')
# A tibble: 5 x 3
island species mean_mass
<fct> <fct> <dbl>
1 Biscoe Adelie 3710.
2 Biscoe Gentoo 5092.
3 Dream Adelie 3701.
4 Dream Chinstrap 3733.
5 Torgersen Adelie 3709.
But ungroup()
still remains relevant as you can now selectively remove grouping variables in {dplyr} 1.0.0.
arrange()
+ other features inside slice()
In past versions of {dplyr}
, if you wanted to grab the top n rows sorted by a column, you’d use top_n()
, which provides a simpler way of doing slice()
+ arrange()
:
##### Long Form #####
# penguins %>%
# arrange(desc(body_mass_g)) %>%
# slice(1:5)
penguins %>%
top_n(5, wt = body_mass_g)
# A tibble: 6 x 8
species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g sex
<fct> <fct> <dbl> <dbl> <int> <int> <fct>
1 Gentoo Biscoe 49.2 15.2 221 6300 male
2 Gentoo Biscoe 59.6 17 230 6050 male
3 Gentoo Biscoe 51.1 16.3 220 6000 male
4 Gentoo Biscoe 45.2 16.4 223 5950 male
5 Gentoo Biscoe 49.8 15.9 229 5950 male
6 Gentoo Biscoe 48.8 16.2 222 6000 male
# ... with 1 more variable: year <int>
But the recent {dplyr}
1.0.0 augmented slice()
with variants like slice_min()
and slice_max()
that now supresede top_n()
:
##### Pre-1.0.0 #####
# penguins %>%
# top_n(5, wt = body_mass_g)
penguins %>%
slice_max(order_by = body_mass_g, n = 5)
# A tibble: 6 x 8
species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g sex
<fct> <fct> <dbl> <dbl> <int> <int> <fct>
1 Gentoo Biscoe 49.2 15.2 221 6300 male
2 Gentoo Biscoe 59.6 17 230 6050 male
3 Gentoo Biscoe 51.1 16.3 220 6000 male
4 Gentoo Biscoe 48.8 16.2 222 6000 male
5 Gentoo Biscoe 45.2 16.4 223 5950 male
6 Gentoo Biscoe 49.8 15.9 229 5950 male
# ... with 1 more variable: year <int>
Note that the order of arguments is different for slice_min/max()
- the first argument after piping is where you specify the variable for ordering rather than the number of rows to return, like in top_n()
.
This is because slice_min/max()
gives you an option to either specify a certain number of rows n
or a proportion of rows prop
:
penguins %>%
slice_max(body_mass_g, prop = .01)
# A tibble: 4 x 8
species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g sex
<fct> <fct> <dbl> <dbl> <int> <int> <fct>
1 Gentoo Biscoe 49.2 15.2 221 6300 male
2 Gentoo Biscoe 59.6 17 230 6050 male
3 Gentoo Biscoe 51.1 16.3 220 6000 male
4 Gentoo Biscoe 48.8 16.2 222 6000 male
# ... with 1 more variable: year <int>
And actually, the most significant change with the new slice_*()
functions is from adding appropriate behavior for grouped dataframes.
So for example, this example below returns the top 5% of penguins by weight for each species:
penguins %>%
group_by(species) %>%
slice_max(body_mass_g, prop = .05)
# A tibble: 16 x 8
# Groups: species [3]
species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Biscoe 43.2 19 197 4775
2 Adelie Biscoe 41 20 203 4725
3 Adelie Torge~ 42.9 17.6 196 4700
4 Adelie Torge~ 39.2 19.6 195 4675
5 Adelie Dream 39.8 19.1 184 4650
6 Adelie Dream 39.6 18.8 190 4600
7 Adelie Biscoe 45.6 20.3 191 4600
8 Chinst~ Dream 52 20.7 210 4800
9 Chinst~ Dream 52.8 20 205 4550
10 Chinst~ Dream 53.5 19.9 205 4500
11 Gentoo Biscoe 49.2 15.2 221 6300
12 Gentoo Biscoe 59.6 17 230 6050
13 Gentoo Biscoe 51.1 16.3 220 6000
14 Gentoo Biscoe 48.8 16.2 222 6000
15 Gentoo Biscoe 45.2 16.4 223 5950
16 Gentoo Biscoe 49.8 15.9 229 5950
# ... with 2 more variables: sex <fct>, year <int>
But note that slice_*()
functions do not modify groups in the result if the input is a grouped dataframe, so you need to explicitly add a call to ungroup()
if you want to drop groups after slicing.
add_count()
Saving my favorite lesser-known {dplyr}
function for last!
add_count()
adds a column with the counts of each group (or combination of groups):
##### Long Form #####
# penguins %>%
# group_by(species) %>%
# mutate(count_by_species = n()) %>%
# ungroup()
penguins %>%
add_count(species, name = "count_by_species") %>%
# cutting down some columns to show the new column
select(-contains("mm"))
# A tibble: 333 x 6
species island body_mass_g sex year count_by_species
<fct> <fct> <int> <fct> <int> <int>
1 Adelie Torgersen 3750 male 2007 146
2 Adelie Torgersen 3800 female 2007 146
3 Adelie Torgersen 3250 female 2007 146
4 Adelie Torgersen 3450 female 2007 146
5 Adelie Torgersen 3650 male 2007 146
6 Adelie Torgersen 3625 female 2007 146
7 Adelie Torgersen 4675 male 2007 146
8 Adelie Torgersen 3200 female 2007 146
9 Adelie Torgersen 3800 male 2007 146
10 Adelie Torgersen 4400 male 2007 146
# ... with 323 more rows
You can use the wt
to effectively get sums by group (perhaps hackish but very very useful):
##### Long Form #####
# penguins %>%
# group_by(species) %>%
# mutate(total_weight_by_species = sum(body_mass_g)) %>%
# ungroup()
penguins %>%
add_count(species, wt = body_mass_g, name = "total_weight_by_species") %>%
# cutting down some columns to show the new column
select(-contains("mm"))
# A tibble: 333 x 6
species island body_mass_g sex year total_weight_by_species
<fct> <fct> <int> <fct> <int> <int>
1 Adelie Torgersen 3750 male 2007 541100
2 Adelie Torgersen 3800 female 2007 541100
3 Adelie Torgersen 3250 female 2007 541100
4 Adelie Torgersen 3450 female 2007 541100
5 Adelie Torgersen 3650 male 2007 541100
6 Adelie Torgersen 3625 female 2007 541100
7 Adelie Torgersen 4675 male 2007 541100
8 Adelie Torgersen 3200 female 2007 541100
9 Adelie Torgersen 3800 male 2007 541100
10 Adelie Torgersen 4400 male 2007 541100
# ... with 323 more rows
Also check out its more primitive version add_tally()
.
By default, add_tally()
adds a count of rows, which you can already do with mutate(n = n())
, but it shines when you make use of its wt
argument:
penguins %>%
add_count(species, wt = body_mass_g, name = "total_weight_by_species") %>%
add_tally(wt = body_mass_g, name = "total_weight_of_all_species") %>%
select(1:2, last_col(0):last_col(1))
# A tibble: 333 x 4
species island total_weight_of_all_species total_weight_by_species
<fct> <fct> <int> <int>
1 Adelie Torgersen 1400950 541100
2 Adelie Torgersen 1400950 541100
3 Adelie Torgersen 1400950 541100
4 Adelie Torgersen 1400950 541100
5 Adelie Torgersen 1400950 541100
6 Adelie Torgersen 1400950 541100
7 Adelie Torgersen 1400950 541100
8 Adelie Torgersen 1400950 541100
9 Adelie Torgersen 1400950 541100
10 Adelie Torgersen 1400950 541100
# ... with 323 more rows
What happens under the hood is actually copying of a sort, so this is probably not the best approach if you care about efficiency. As a case in point, you can’t use {tidyselect}
helpers in transmute
because you’re creating a new dataframe↩︎