# ave() for the average {dplyr} user

tidyverse 🤝 base R

June Choe (University of Pennsylvania Linguistics)https://live-sas-www-ling.pantheon.sas.upenn.edu/
2024-06-09

I think it’s safe to say that the average `{dplyr}` user does not know the `ave()` function. For that audience, this is a short appreciation post on `ave()`, a case of tidyverse and base R.

## `ave()`

`ave()` is a split-apply-combine function in base R (specifically, `{stats}`). It’s a pretty short function - maybe you can make out what it does from just reading the code1

``ave``
``````  function (x, ..., FUN = mean)
{
if (missing(...))
x[] <- FUN(x)
else {
g <- interaction(...)
split(x, g) <- lapply(split(x, g), FUN)
}
x
}
<bytecode: 0x0000020fc12974b8>
<environment: namespace:stats>``````

Despite its (rather generic and uninformative) name, I like to think of `ave()` as actually belonging to the `*apply()` family of functions, having particularly close ties to `tapply()`.

A unique feature of `ave()` is the invariant that it returns a vector of the same length as the input. And if you use an aggregating function like `sum()` or `mean()`, it simply repeats those values over the observations on the basis of their grouping.

For example, whereas `tapply()` can be used to summarize the average `mpg` by `cyl`:

``tapply(mtcars\$mpg, mtcars\$cyl, FUN = mean)``
``````         4        6        8
26.66364 19.74286 15.10000``````

The same syntax with `ave()` will repeat those values over each element of the input vector:

``ave(mtcars\$mpg, mtcars\$cyl, FUN = mean)``
``````   [1] 19.74286 19.74286 26.66364 19.74286 15.10000 19.74286 15.10000 26.66364
[9] 26.66364 19.74286 19.74286 15.10000 15.10000 15.10000 15.10000 15.10000
[17] 15.10000 26.66364 26.66364 26.66364 26.66364 15.10000 15.10000 15.10000
[25] 15.10000 26.66364 26.66364 26.66364 15.10000 19.74286 15.10000 26.66364``````

You can also get to this output from `tapply()` with an extra step of vectorized indexing:

``tapply(mtcars\$mpg, mtcars\$cyl, FUN = mean)[as.character(mtcars\$cyl)]``
``````         6        6        4        6        8        6        8        4
19.74286 19.74286 26.66364 19.74286 15.10000 19.74286 15.10000 26.66364
4        6        6        8        8        8        8        8
26.66364 19.74286 19.74286 15.10000 15.10000 15.10000 15.10000 15.10000
8        4        4        4        4        8        8        8
15.10000 26.66364 26.66364 26.66364 26.66364 15.10000 15.10000 15.10000
8        4        4        4        8        6        8        4
15.10000 26.66364 26.66364 26.66364 15.10000 19.74286 15.10000 26.66364``````

## The problem

Nothing sparks more joy than when a base R function helps you write more “tidy” code. I’ve talked about this in length before with `outer()` in a prior blog post on `dplyr::slice()`, and here I want to show a cool `ave()` + `dplyr::mutate()` combo.

This example is adapted from a reprex by Cédric Scherer2 on the DSLC (previously R4DS) slack.

Given an input of multiple discrete columns and the frequencies of these values:

``````input <- data.frame(
a = c("A", "A", "A", "B"),
b = c("X", "Y", "Y", "Z"),
c = c("M", "N", "O", "O"),
freq = c(5, 12, 3, 7)
)
input``````
``````    a b c freq
1 A X M    5
2 A Y N   12
3 A Y O    3
4 B Z O    7``````

The task is to add new columns named `freq_*` that show the total frequency of the values in each column:

``````output <- data.frame(
a = c("A", "A", "A", "B"),
freq_a = c(20, 20, 20, 7),
b = c("X", "Y", "Y", "Z"),
freq_b = c(5, 15, 15, 7),
c = c("M", "N", "O", "O"),
freq_c = c(5, 12, 10, 10),
freq = c(5, 12, 3, 7)
)
output``````
``````    a freq_a b freq_b c freq_c freq
1 A     20 X      5 M      5    5
2 A     20 Y     15 N     12   12
3 A     20 Y     15 O     10    3
4 B      7 Z      7 O     10    7``````

So for example, in column `a` the value `"A"` is associated with values `5`, `12`, and `3` in the `freq` column, so a new `freq_a` column should be created to track their total frequencies `5 + 12 + 3` and associate that value (`20`) for all occurrences of `"A"` in the `a` column.

## Some `{tidyverse}` solutions

The gut feeling is that this seems to lack a straightforwardly “tidy” solution. I mean, the input isn’t even tidy3 in the first place!

So maybe we’d be better off starting with a pivoted tidy data for constructing a tidy solution:

``````library(tidyverse)
input %>%
pivot_longer(-freq)``````
``````  # A tibble: 12 × 3
freq name  value
<dbl> <chr> <chr>
1     5 a     A
2     5 b     X
3     5 c     M
4    12 a     A
5    12 b     Y
6    12 c     N
7     3 a     A
8     3 b     Y
9     3 c     O
10     7 a     B
11     7 b     Z
12     7 c     O``````

But recall that the desired output is of a wide form like the input, so it looks like our tidy solution will require some indirection, involving something like:

``````input %>%
pivot_longer(-freq) %>%
... %>%
pivot_wider(...)``````

Or maybe you’d rather tackle this with some `left_join()`s, like:

``````input %>%
left_join(summarize(input, freq_a = sum(freq), .by = a)) %>%
...``````

I’ll note that there’s actually also an idiomatic `{dplyr}`-solution to this using the lesser-known function `add_count()`, but you can’t avoid the repetitiveness problem because it doesn’t vectorize over the first argument:

``````input %>%
add_count(a, wt = freq, name = "freq_a") %>%
add_count(b, wt = freq, name = "freq_b") %>%
add_count(c, wt = freq, name = "freq_c")``````
``````    a b c freq freq_a freq_b freq_c
1 A X M    5     20      5      5
2 A Y N   12     20     15     12
3 A Y O    3     20     15     10
4 B Z O    7      7      7     10``````

You could try to scale this `add_count()` solution with `reduce()` (see my previous blog post on collapsing repetitive piping), but now we’re straying very far from the “tidy” territory:

``````input %>%
purrr::reduce(
c("a", "b", "c"),
~ .x %>%
add_count(.data[[.y]], wt = freq, name = paste0("freq_", .y)),
.init = .
)``````
``````    a b c freq freq_a freq_b freq_c
1 A X M    5     20      5      5
2 A Y N   12     20     15     12
3 A Y O    3     20     15     10
4 B Z O    7      7      7     10``````

IMO this problem is actually a really good thinking exercise for the “average {dplyr} user”, so I encourage you to take a stab at this yourself before proceeding if you’ve read this far!

## An `ave()` + `{dplyr}` solution

The crucial piece of the puzzle here is to think a little outside the box, beyond “data(frame) wrangling”.

It helps to simplify the problem once we think about the problem in terms of “(column) vector wrangling” first, and that’s where `ave()` comes in!

I’ll start with the cake first - this is the one-liner `ave()` solution I advocated for:

``````input %>%
mutate(across(a:c, ~ ave(freq, .x, FUN = sum), .names = "freq_{.col}"))``````
``````    a b c freq freq_a freq_b freq_c
1 A X M    5     20      5      5
2 A Y N   12     20     15     12
3 A Y O    3     20     15     10
4 B Z O    7      7      7     10``````

Taking column `freq_a` as an example, the `ave()` part of the solution essential creates this vector of summed-up `freq` values by the categories of `a`:

``ave(input\$freq, input\$a, FUN = sum)``
``  [1] 20 20 20  7``

From there, `across()` handles the iteration over columns and, as an added bonus, the naming of the new columns in convenient `{glue}` syntax (`"freq_{.col}"`).

It’s the perfect mashup of base R + tidyverse. Base R takes care of the problem at the vector level with a split-apply-combine that’s concisely expressed with `ave()`, and tidyverse scales that solution up to the dataframe level with `mutate()` and `across()`.

tidyverse 🤝 base R

## sessionInfo()

``sessionInfo()``
``````  R version 4.3.3 (2024-02-29 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22631)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.utf8
[2] LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8

time zone: Asia/Seoul
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] lubridate_1.9.3 forcats_1.0.0   stringr_1.5.1   dplyr_1.1.4
[9] tidyverse_2.0.0 ggplot2_3.5.1

loaded via a namespace (and not attached):
[1] gtable_0.3.5      jsonlite_1.8.8    compiler_4.3.3    tidyselect_1.2.1
[5] jquerylib_0.1.4   scales_1.3.0      yaml_2.3.8        fastmap_1.1.1
[9] R6_2.5.1          generics_0.1.3    knitr_1.47        distill_1.6
[13] munsell_0.5.0     tzdb_0.4.0        bslib_0.7.0       pillar_1.9.0
[17] rlang_1.1.4       utf8_1.2.4        stringi_1.8.4     cachem_1.0.8
[21] xfun_0.44         sass_0.4.9        timechange_0.2.0  memoise_2.0.1
[25] cli_3.6.2         withr_3.0.0       magrittr_2.0.3    digest_0.6.35
[29] grid_4.3.3        rstudioapi_0.16.0 hms_1.1.3         lifecycle_1.0.4
[33] vctrs_0.6.5       downlit_0.4.3     evaluate_0.23     glue_1.7.0
[37] fansi_1.0.6       colorspace_2.1-0  rmarkdown_2.27    tools_4.3.3
[41] pkgconfig_2.0.3   htmltools_0.5.8.1``````

1. And check out the elusive `split<-` function!↩︎

2. Who I can only assume was needing this for a fancy data viz thing 😆↩︎

3. I mean that in the technical sense here. In this problem, the unit of observation is the “cells” of the input columns (the values “A”, “B”, “X”, “Y”, etc.).↩︎