Saving a line of piping

data wrangling dplyr tutorial

Some notes on lesser known functions/functionalities that combine common chain of {dplyr} verbs.

June Choe (University of Pennsylvania Linguistics)https://live-sas-www-ling.pantheon.sas.upenn.edu/
08-07-2020

Using a cleaned up version of penguins data from {palmerpenguins}:

data("penguins", package = "palmerpenguins")

penguins <- na.omit(penguins)

1. rename() inside select()

You can rename a column inside select() by assigning a new name on the left hand side:

##### Long Form #####
# penguins %>% 
#   select(species, island) %>% 
#   rename(penguin_species = species)

penguins %>% 
  select(penguin_species = species,
         island)
  # A tibble: 333 x 2
     penguin_species island   
     <fct>           <fct>    
   1 Adelie          Torgersen
   2 Adelie          Torgersen
   3 Adelie          Torgersen
   4 Adelie          Torgersen
   5 Adelie          Torgersen
   6 Adelie          Torgersen
   7 Adelie          Torgersen
   8 Adelie          Torgersen
   9 Adelie          Torgersen
  10 Adelie          Torgersen
  # ... with 323 more rows

This also works with {tidyselect} helpers like starts_with(), ends_with(), contains(), and matches():

##### Long Form #####
# penguins %>% 
#   select(species, island) %>% 
#   rename(penguin_species = species,
#          weight = body_weight_g)

penguins %>% 
  select(penguin_species = species,
         island,
         weight = contains("mass"))
  # A tibble: 333 x 3
     penguin_species island    weight
     <fct>           <fct>      <int>
   1 Adelie          Torgersen   3750
   2 Adelie          Torgersen   3800
   3 Adelie          Torgersen   3250
   4 Adelie          Torgersen   3450
   5 Adelie          Torgersen   3650
   6 Adelie          Torgersen   3625
   7 Adelie          Torgersen   4675
   8 Adelie          Torgersen   3200
   9 Adelie          Torgersen   3800
  10 Adelie          Torgersen   4400
  # ... with 323 more rows

2. rename() inside count()

You can rename the new column of counts (n by default) using the name argument:

##### Long Form #####
# penguins %>% 
#   count(species) %>% 
#   rename(total = n)

penguins %>% 
  count(species, name = "total")
  # A tibble: 3 x 2
    species   total
    <fct>     <int>
  1 Adelie      146
  2 Chinstrap    68
  3 Gentoo      119

You can also rename the column(s) that are selected for counting in the same way as shown in the select() examples above:

##### Long Form #####
# penguins %>% 
#   count(species) %>% 
#   rename(total = n,
#          penguin_species = species)

penguins %>% 
  count(penguin_species = species, name = "total")
  # A tibble: 3 x 2
    penguin_species total
    <fct>           <int>
  1 Adelie            146
  2 Chinstrap          68
  3 Gentoo            119

Note that the new name passed into the name argument must be quoted, but the new name for selected column needs not to be unquoted:

identical(
  # Method 1: new column name UNQUOTED
  penguins %>% 
    count(penguin_species = species, name = "total"),
  # Method 2: new column name QUOTED
  penguins %>% 
    count("penguin_species" = species, name = "total") 
)
  [1] TRUE

I prefer to unquote the new column names to keep it consistent with the recommended style for rename()

This feature of select() may seem weird and hackish (and I guess it sort of is in this demonstration) but it’s explicitly documented here if you want to read more on it.

3. mutate() inside count()

You can also create a new column to count by inside count(). This works very similarly to the above, but I think it’s worth its own mention.

It’s pretty simple - you just do what you’d do for mutate() inside count():

##### Long Form #####
# penguins %>% 
#   mutate(long_beak = bill_length_mm > 50) %>% 
#   count(long_beak)

penguins %>% 
  count(long_beak = bill_length_mm > 50)
  # A tibble: 2 x 2
    long_beak     n
    <lgl>     <int>
  1 FALSE       281
  2 TRUE         52

And of course, this also works when specifying multiple variables to count by:

##### Long Form #####
# penguins %>% 
#   mutate(long_beak = bill_length_mm > 50,
#          is_adelie = species == "Adelie") %>% 
#   count(is_adelie, long_beak)

penguins %>% 
  count(long_beak = bill_length_mm > 50,
        is_adelie = species == "Adelie")
  # A tibble: 3 x 3
    long_beak is_adelie     n
    <lgl>     <lgl>     <int>
  1 FALSE     FALSE       135
  2 FALSE     TRUE        146
  3 TRUE      FALSE        52

4. transmute() + select()

transmute() is a function that mutates columns and returns only those columns:

##### Long Form #####
# penguins %>% 
#   mutate(body_mass_kg = body_mass_g/1000) %>% 
#   select(body_mass_kg)

penguins %>% 
  transmute(body_mass_kg = body_mass_g/1000)
  # A tibble: 333 x 1
     body_mass_kg
            <dbl>
   1         3.75
   2         3.8 
   3         3.25
   4         3.45
   5         3.65
   6         3.62
   7         4.68
   8         3.2 
   9         3.8 
  10         4.4 
  # ... with 323 more rows

I’ve rarely used transmute() in the past because I thought it could only return modified columns, which would be very limiting (like in the above example, what good is a single column of penguin body mass in kilograms?)

But actually you can just name the columns you want to include in transmute() like you would in select() to carry over columns that you aren’t modifying. And of course, you can “rename” them as you do it1:

##### Long Form #####
# penguins %>% 
#   mutate(body_mass_kg = body_mass_g/1000) %>% 
#   select(species, island, body_mass_kg) %>% 
#   rename(penguin_species = species)

penguins %>% 
  transmute(penguin_species = species,
            island,
            body_mass_kg = body_mass_g/1000)
  # A tibble: 333 x 3
     penguin_species island    body_mass_kg
     <fct>           <fct>            <dbl>
   1 Adelie          Torgersen         3.75
   2 Adelie          Torgersen         3.8 
   3 Adelie          Torgersen         3.25
   4 Adelie          Torgersen         3.45
   5 Adelie          Torgersen         3.65
   6 Adelie          Torgersen         3.62
   7 Adelie          Torgersen         4.68
   8 Adelie          Torgersen         3.2 
   9 Adelie          Torgersen         3.8 
  10 Adelie          Torgersen         4.4 
  # ... with 323 more rows

5. ungroup() inside summarize()

I always found using ungroup() after summarize() to be extremely ugly, but I found myself using it a lot to remove left-over groupings after a summarize call:

penguins %>% 
  group_by(island, species) %>% 
  summarize(mean_mass = mean(body_mass_g, na.rm = TRUE)) %>% 
  ungroup()
  # A tibble: 5 x 3
    island    species   mean_mass
    <fct>     <fct>         <dbl>
  1 Biscoe    Adelie        3710.
  2 Biscoe    Gentoo        5092.
  3 Dream     Adelie        3701.
  4 Dream     Chinstrap     3733.
  5 Torgersen Adelie        3709.

… because summarize() only drops the last grouping variable by defaut, meaning that the output is still grouped by the island variable if ungroup() isn’t called:

# Without ungroup()
penguins %>% 
  group_by(island, species) %>% 
  summarize(mean_mass = mean(body_mass_g, na.rm = TRUE)) %>% 
  group_vars()
  [1] "island"
# With ungroup()
penguins %>% 
  group_by(island, species) %>% 
  summarize(mean_mass = mean(body_mass_g, na.rm = TRUE)) %>% 
  ungroup() %>% 
  group_vars()
  character(0)

Since {dplyr} 1.0.0, you can simply set the .groups argument inside summarize() to 'drop' to achieve the same:

penguins %>% 
  group_by(island, species) %>% 
  summarize(mean_mass = mean(body_mass_g, na.rm = TRUE), .groups = 'drop')
  # A tibble: 5 x 3
    island    species   mean_mass
    <fct>     <fct>         <dbl>
  1 Biscoe    Adelie        3710.
  2 Biscoe    Gentoo        5092.
  3 Dream     Adelie        3701.
  4 Dream     Chinstrap     3733.
  5 Torgersen Adelie        3709.

But ungroup() still remains relevant as you can now selectively remove grouping variables in {dplyr} 1.0.0.

6. arrange() + other features inside slice()

In past versions of {dplyr}, if you wanted to grab the top n rows sorted by a column, you’d use top_n(), which provides a simpler way of doing slice() + arrange():

##### Long Form #####
# penguins %>% 
#   arrange(desc(body_mass_g)) %>% 
#   slice(1:5)

penguins %>% 
  top_n(5, wt = body_mass_g)
  # A tibble: 6 x 8
    species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g sex  
    <fct>   <fct>           <dbl>         <dbl>            <int>       <int> <fct>
  1 Gentoo  Biscoe           49.2          15.2              221        6300 male 
  2 Gentoo  Biscoe           59.6          17                230        6050 male 
  3 Gentoo  Biscoe           51.1          16.3              220        6000 male 
  4 Gentoo  Biscoe           45.2          16.4              223        5950 male 
  5 Gentoo  Biscoe           49.8          15.9              229        5950 male 
  6 Gentoo  Biscoe           48.8          16.2              222        6000 male 
  # ... with 1 more variable: year <int>

But the recent {dplyr} 1.0.0 augmented slice() with variants like slice_min() and slice_max() that now supresede top_n():

##### Pre-1.0.0 #####
# penguins %>% 
#   top_n(5, wt = body_mass_g)

penguins %>% 
  slice_max(order_by = body_mass_g, n = 5)
  # A tibble: 6 x 8
    species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g sex  
    <fct>   <fct>           <dbl>         <dbl>            <int>       <int> <fct>
  1 Gentoo  Biscoe           49.2          15.2              221        6300 male 
  2 Gentoo  Biscoe           59.6          17                230        6050 male 
  3 Gentoo  Biscoe           51.1          16.3              220        6000 male 
  4 Gentoo  Biscoe           48.8          16.2              222        6000 male 
  5 Gentoo  Biscoe           45.2          16.4              223        5950 male 
  6 Gentoo  Biscoe           49.8          15.9              229        5950 male 
  # ... with 1 more variable: year <int>

Note that the order of arguments is different for slice_min/max() - the first argument after piping is where you specify the variable for ordering rather than the number of rows to return, like in top_n().

This is because slice_min/max() gives you an option to either specify a certain number of rows n or a proportion of rows prop:

penguins %>% 
  slice_max(body_mass_g, prop = .01)
  # A tibble: 4 x 8
    species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g sex  
    <fct>   <fct>           <dbl>         <dbl>            <int>       <int> <fct>
  1 Gentoo  Biscoe           49.2          15.2              221        6300 male 
  2 Gentoo  Biscoe           59.6          17                230        6050 male 
  3 Gentoo  Biscoe           51.1          16.3              220        6000 male 
  4 Gentoo  Biscoe           48.8          16.2              222        6000 male 
  # ... with 1 more variable: year <int>

And actually, the most significant change with the new slice_*() functions is from adding appropriate behavior for grouped dataframes.

So for example, this example below returns the top 5% of penguins by weight for each species:

penguins %>% 
  group_by(species) %>% 
  slice_max(body_mass_g, prop = .05)
  # A tibble: 16 x 8
  # Groups:   species [3]
     species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g
     <fct>   <fct>           <dbl>         <dbl>            <int>       <int>
   1 Adelie  Biscoe           43.2          19                197        4775
   2 Adelie  Biscoe           41            20                203        4725
   3 Adelie  Torge~           42.9          17.6              196        4700
   4 Adelie  Torge~           39.2          19.6              195        4675
   5 Adelie  Dream            39.8          19.1              184        4650
   6 Adelie  Dream            39.6          18.8              190        4600
   7 Adelie  Biscoe           45.6          20.3              191        4600
   8 Chinst~ Dream            52            20.7              210        4800
   9 Chinst~ Dream            52.8          20                205        4550
  10 Chinst~ Dream            53.5          19.9              205        4500
  11 Gentoo  Biscoe           49.2          15.2              221        6300
  12 Gentoo  Biscoe           59.6          17                230        6050
  13 Gentoo  Biscoe           51.1          16.3              220        6000
  14 Gentoo  Biscoe           48.8          16.2              222        6000
  15 Gentoo  Biscoe           45.2          16.4              223        5950
  16 Gentoo  Biscoe           49.8          15.9              229        5950
  # ... with 2 more variables: sex <fct>, year <int>

But note that slice_*() functions do not modify groups in the result if the input is a grouped dataframe, so you need to explicitly add a call to ungroup() if you want to drop groups after slicing.

7. count and sum by group with add_count()

Saving my favorite lesser-known {dplyr} function for last!

add_count() adds a column with the counts of each group (or combination of groups):

##### Long Form #####
# penguins %>% 
#   group_by(species) %>% 
#   mutate(count_by_species = n()) %>% 
#   ungroup()

penguins %>% 
  add_count(species, name = "count_by_species") %>% 
  # cutting down some columns to show the new column
  select(-contains("mm"))
  # A tibble: 333 x 6
     species island    body_mass_g sex     year count_by_species
     <fct>   <fct>           <int> <fct>  <int>            <int>
   1 Adelie  Torgersen        3750 male    2007              146
   2 Adelie  Torgersen        3800 female  2007              146
   3 Adelie  Torgersen        3250 female  2007              146
   4 Adelie  Torgersen        3450 female  2007              146
   5 Adelie  Torgersen        3650 male    2007              146
   6 Adelie  Torgersen        3625 female  2007              146
   7 Adelie  Torgersen        4675 male    2007              146
   8 Adelie  Torgersen        3200 female  2007              146
   9 Adelie  Torgersen        3800 male    2007              146
  10 Adelie  Torgersen        4400 male    2007              146
  # ... with 323 more rows

You can use the wt to effectively get sums by group (perhaps hackish but very very useful):

##### Long Form #####
# penguins %>% 
#   group_by(species) %>% 
#   mutate(total_weight_by_species = sum(body_mass_g)) %>% 
#   ungroup()
  

penguins %>% 
  add_count(species, wt = body_mass_g, name = "total_weight_by_species") %>% 
    # cutting down some columns to show the new column
  select(-contains("mm"))
  # A tibble: 333 x 6
     species island    body_mass_g sex     year total_weight_by_species
     <fct>   <fct>           <int> <fct>  <int>                   <int>
   1 Adelie  Torgersen        3750 male    2007                  541100
   2 Adelie  Torgersen        3800 female  2007                  541100
   3 Adelie  Torgersen        3250 female  2007                  541100
   4 Adelie  Torgersen        3450 female  2007                  541100
   5 Adelie  Torgersen        3650 male    2007                  541100
   6 Adelie  Torgersen        3625 female  2007                  541100
   7 Adelie  Torgersen        4675 male    2007                  541100
   8 Adelie  Torgersen        3200 female  2007                  541100
   9 Adelie  Torgersen        3800 male    2007                  541100
  10 Adelie  Torgersen        4400 male    2007                  541100
  # ... with 323 more rows

Also check out its more primitive version add_tally().

By default, add_tally() adds a count of rows, which you can already do with mutate(n = n()), but it shines when you make use of its wt argument:

penguins %>% 
  add_count(species, wt = body_mass_g, name = "total_weight_by_species") %>% 
  add_tally(wt = body_mass_g, name = "total_weight_of_all_species") %>% 
  select(1:2, last_col(0):last_col(1))
  # A tibble: 333 x 4
     species island    total_weight_of_all_species total_weight_by_species
     <fct>   <fct>                           <int>                   <int>
   1 Adelie  Torgersen                     1400950                  541100
   2 Adelie  Torgersen                     1400950                  541100
   3 Adelie  Torgersen                     1400950                  541100
   4 Adelie  Torgersen                     1400950                  541100
   5 Adelie  Torgersen                     1400950                  541100
   6 Adelie  Torgersen                     1400950                  541100
   7 Adelie  Torgersen                     1400950                  541100
   8 Adelie  Torgersen                     1400950                  541100
   9 Adelie  Torgersen                     1400950                  541100
  10 Adelie  Torgersen                     1400950                  541100
  # ... with 323 more rows

  1. What happens under the hood is actually copying of a sort, so this is probably not the best approach if you care about efficiency. As a case in point, you can’t use {tidyselect} helpers in transmute because you’re creating a new dataframe↩︎