ave() for the average {dplyr} user

dplyr

tidyverse 🤝 base R

June Choe (University of Pennsylvania Linguistics)https://live-sas-www-ling.pantheon.sas.upenn.edu/
2024-06-23

I think it’s safe to say that the average {dplyr} user does not know the ave() function. For that audience, this is a short appreciation post on ave(), a case of tidyverse and base R.

ave()

ave() is a split-apply-combine function in base R (specifically, {stats}). It’s a pretty short function - maybe you can make out what it does from just reading the code1

ave
  function (x, ..., FUN = mean) 
  {
      if (missing(...)) 
          x[] <- FUN(x)
      else {
          g <- interaction(...)
          split(x, g) <- lapply(split(x, g), FUN)
      }
      x
  }
  <bytecode: 0x0000029931326f80>
  <environment: namespace:stats>

Despite its (rather generic and uninformative) name, I like to think of ave() as actually belonging to the *apply() family of functions, having particularly close ties to tapply().

A unique feature of ave() is the invariant that it returns a vector of the same length as the input. And if you use an aggregating function like sum() or mean(), it simply repeats those values over the observations on the basis of their grouping.

For example, whereas tapply() can be used to summarize the average mpg by cyl:

tapply(mtcars$mpg, mtcars$cyl, FUN = mean)
         4        6        8 
  26.66364 19.74286 15.10000

The same syntax with ave() will repeat those values over each element of the input vector:

ave(mtcars$mpg, mtcars$cyl, FUN = mean)
   [1] 19.74286 19.74286 26.66364 19.74286 15.10000 19.74286 15.10000 26.66364
   [9] 26.66364 19.74286 19.74286 15.10000 15.10000 15.10000 15.10000 15.10000
  [17] 15.10000 26.66364 26.66364 26.66364 26.66364 15.10000 15.10000 15.10000
  [25] 15.10000 26.66364 26.66364 26.66364 15.10000 19.74286 15.10000 26.66364

You can also get to this output from tapply() with an extra step of vectorized indexing:

tapply(mtcars$mpg, mtcars$cyl, FUN = mean)[as.character(mtcars$cyl)]
         6        6        4        6        8        6        8        4 
  19.74286 19.74286 26.66364 19.74286 15.10000 19.74286 15.10000 26.66364 
         4        6        6        8        8        8        8        8 
  26.66364 19.74286 19.74286 15.10000 15.10000 15.10000 15.10000 15.10000 
         8        4        4        4        4        8        8        8 
  15.10000 26.66364 26.66364 26.66364 26.66364 15.10000 15.10000 15.10000 
         8        4        4        4        8        6        8        4 
  15.10000 26.66364 26.66364 26.66364 15.10000 19.74286 15.10000 26.66364

The problem

Nothing sparks more joy than when a base R function helps you write more “tidy” code. I’ve talked about this in length before with outer() in a prior blog post on dplyr::slice(), and here I want to show a cool ave() + dplyr::mutate() combo.

This example is adapted from a reprex by Cédric Scherer2 on the DSLC (previously R4DS) slack.

Given an input of multiple discrete columns and the frequencies of these values:

input <- data.frame(
  a = c("A", "A", "A", "B"), 
  b = c("X", "Y", "Y", "Z"), 
  c = c("M", "N", "O", "O"), 
  freq = c(5, 12, 3, 7)
)
input
    a b c freq
  1 A X M    5
  2 A Y N   12
  3 A Y O    3
  4 B Z O    7

The task is to add new columns named freq_* that show the total frequency of the values in each column:

output <- data.frame(
  a = c("A", "A", "A", "B"), 
  freq_a = c(20, 20, 20, 7),
  b = c("X", "Y", "Y", "Z"),
  freq_b = c(5, 15, 15, 7), 
  c = c("M", "N", "O", "O"), 
  freq_c = c(5, 12, 10, 10), 
  freq = c(5, 12, 3, 7)
)
output
    a freq_a b freq_b c freq_c freq
  1 A     20 X      5 M      5    5
  2 A     20 Y     15 N     12   12
  3 A     20 Y     15 O     10    3
  4 B      7 Z      7 O     10    7

So for example, in column a the value "A" is associated with values 5, 12, and 3 in the freq column, so a new freq_a column should be created to track their total frequencies 5 + 12 + 3 and associate that value (20) for all occurrences of "A" in the a column.

Some {tidyverse} solutions

The gut feeling is that this seems to lack a straightforwardly “tidy” solution. I mean, the input isn’t even tidy3 in the first place!

So maybe we’d be better off starting with a pivoted tidy data for constructing a tidy solution:

  # A tibble: 12 × 3
      freq name  value
     <dbl> <chr> <chr>
   1     5 a     A    
   2     5 b     X    
   3     5 c     M    
   4    12 a     A    
   5    12 b     Y    
   6    12 c     N    
   7     3 a     A    
   8     3 b     Y    
   9     3 c     O    
  10     7 a     B    
  11     7 b     Z    
  12     7 c     O

But recall that the desired output is of a wide form like the input, so it looks like our tidy solution will require some indirection, involving something like:

input %>% 
  pivot_longer(-freq) %>% 
  ... %>% 
  pivot_wider(...)

Or maybe you’d rather tackle this with some left_join()s, like:

input %>% 
  left_join(summarize(input, freq_a = sum(freq), .by = a)) %>% 
  ...

I’ll note that there’s actually also an idiomatic {dplyr}-solution to this using the lesser-known function add_count(), but you can’t avoid the repetitiveness problem because it doesn’t vectorize over the first argument:

input %>% 
  add_count(a, wt = freq, name = "freq_a") %>% 
  add_count(b, wt = freq, name = "freq_b") %>% 
  add_count(c, wt = freq, name = "freq_c")
    a b c freq freq_a freq_b freq_c
  1 A X M    5     20      5      5
  2 A Y N   12     20     15     12
  3 A Y O    3     20     15     10
  4 B Z O    7      7      7     10

You could try to scale this add_count() solution with reduce() (see my previous blog post on collapsing repetitive piping), but now we’re straying very far from the “tidy” territory:

input %>% 
  purrr::reduce(
    c("a", "b", "c"),
    ~ .x %>% 
      add_count(.data[[.y]], wt = freq, name = paste0("freq_", .y)),
    .init = .
  )
    a b c freq freq_a freq_b freq_c
  1 A X M    5     20      5      5
  2 A Y N   12     20     15     12
  3 A Y O    3     20     15     10
  4 B Z O    7      7      7     10

IMO this problem is actually a really good thinking exercise for the “average {dplyr} user”, so I encourage you to take a stab at this yourself before proceeding if you’ve read this far!

An ave() + {dplyr} solution

The crucial piece of the puzzle here is to think a little outside the box, beyond “data(frame) wrangling”.

It helps to simplify the problem once we think about the problem in terms of “(column) vector wrangling” first, and that’s where ave() comes in!

I’ll start with the cake first - this is the one-liner ave() solution I advocated for:

input %>% 
  mutate(across(a:c, ~ ave(freq, .x, FUN = sum), .names = "freq_{.col}"))
    a b c freq freq_a freq_b freq_c
  1 A X M    5     20      5      5
  2 A Y N   12     20     15     12
  3 A Y O    3     20     15     10
  4 B Z O    7      7      7     10

Taking column freq_a as an example, the ave() part of the solution essential creates this vector of summed-up freq values by the categories of a:

ave(input$freq, input$a, FUN = sum)
  [1] 20 20 20  7

From there, across() handles the iteration over columns and, as an added bonus, the naming of the new columns in convenient {glue} syntax ("freq_{.col}").

It’s the perfect mashup of base R + tidyverse. Base R takes care of the problem at the vector level with a split-apply-combine that’s concisely expressed with ave(), and tidyverse scales that solution up to the dataframe level with mutate() and across().

tidyverse 🤝 base R

Aside: {data.table} 🤝 {collapse}

Since I wrote this blog post, I discovered that {data.table} recently added in support for using names(.SD) in the LHS of the walrus :=. I’m so excited for this to hit the next release (v1.16.0)!

I’ve trying to be more mindful of showcasing {data.table} whenever I talk about {dplyr}, so here’s a solution to compare with the dplyr::across() solution above.

# data.table::update_dev_pkg()
library(data.table)
input_dt <- as.data.table(input)
input_dt
          a      b      c  freq
     <char> <char> <char> <num>
  1:      A      X      M     5
  2:      A      Y      N    12
  3:      A      Y      O     3
  4:      B      Z      O     7
input_dt[, paste0("freq_", names(.SD)) := lapply(.SD, \(x) ave(freq, x, FUN = sum)), .SDcols = a:c]
input_dt
          a      b      c  freq freq_a freq_b freq_c
     <char> <char> <char> <num>  <num>  <num>  <num>
  1:      A      X      M     5     20      5      5
  2:      A      Y      N    12     20     15     12
  3:      A      Y      O     3     20     15     10
  4:      B      Z      O     7      7      7     10

In practice, I often pair {data.table} with {collapse}, where the latter provides a rich and performant set of split-apply-combine vector operations, to the likes of ave(). In {collapse}, ave(..., FUN = sum) can be expressed as fsum(..., TRA = "replace"):

library(collapse)
ave(input_dt$freq, input_dt$a, FUN = sum)
  [1] 20 20 20  7
fsum(input_dt$freq, input_dt$a, TRA = "replace") # Also, TRA = 2
  [1] 20 20 20  7

So a version of the solution integrating fsum() would be:4

input_dt[, names(.SD) := NULL, .SDcols = patterns("^freq_")]
input_dt[, paste0("freq_", names(.SD)) := lapply(.SD, \(x) fsum(freq, x, TRA = 2)), .SDcols = a:c]
input_dt
          a      b      c  freq freq_a freq_b freq_c
     <char> <char> <char> <num>  <num>  <num>  <num>
  1:      A      X      M     5     20      5      5
  2:      A      Y      N    12     20     15     12
  3:      A      Y      O     3     20     15     10
  4:      B      Z      O     7      7      7     10

data.table 🤝 collapse

sessionInfo()

  R version 4.4.1 (2024-06-14 ucrt)
  Platform: x86_64-w64-mingw32/x64
  Running under: Windows 11 x64 (build 22631)
  
  Matrix products: default
  
  
  locale:
  [1] LC_COLLATE=English_United States.utf8 
  [2] LC_CTYPE=English_United States.utf8   
  [3] LC_MONETARY=English_United States.utf8
  [4] LC_NUMERIC=C                          
  [5] LC_TIME=English_United States.utf8    
  
  time zone: Asia/Seoul
  tzcode source: internal
  
  attached base packages:
  [1] stats     graphics  grDevices utils     datasets  methods   base     
  
  other attached packages:
   [1] collapse_2.0.14    data.table_1.15.99 lubridate_1.9.3    forcats_1.0.0     
   [5] stringr_1.5.1      dplyr_1.1.4        purrr_1.0.2        readr_2.1.5       
   [9] tidyr_1.3.1        tibble_3.2.1       tidyverse_2.0.0    ggplot2_3.5.1     
  
  loaded via a namespace (and not attached):
   [1] gtable_0.3.5      jsonlite_1.8.8    compiler_4.4.1    Rcpp_1.0.12      
   [5] tidyselect_1.2.1  parallel_4.4.1    jquerylib_0.1.4   scales_1.3.0     
   [9] yaml_2.3.8        fastmap_1.1.1     R6_2.5.1          generics_0.1.3   
  [13] knitr_1.47        distill_1.6       munsell_0.5.0     tzdb_0.4.0       
  [17] bslib_0.7.0       pillar_1.9.0      rlang_1.1.4       utf8_1.2.4       
  [21] stringi_1.8.4     cachem_1.0.8      xfun_0.44         sass_0.4.9       
  [25] timechange_0.2.0  memoise_2.0.1     cli_3.6.2         withr_3.0.0      
  [29] magrittr_2.0.3    digest_0.6.35     grid_4.4.1        rstudioapi_0.16.0
  [33] hms_1.1.3         lifecycle_1.0.4   vctrs_0.6.5       downlit_0.4.3    
  [37] evaluate_0.23     glue_1.7.0        fansi_1.0.6       colorspace_2.1-0 
  [41] rmarkdown_2.27    tools_4.4.1       pkgconfig_2.0.3   htmltools_0.5.8.1

  1. And check out the elusive split<- function!↩︎

  2. Who I can only assume was needing this for a fancy data viz thing 😆↩︎

  3. I mean that in the technical sense here. In this problem, the unit of observation is the “cells” of the input columns (the values “A”, “B”, “X”, “Y”, etc.).↩︎

  4. I couldn’t show this here with this particular example, but another nice feature of {collapse} 🤝 {data.table} is the fact that they do not shy away from consuming/producing matrices: see scale()[,1] vs. fscale() for a good example of this.↩︎