The many ways to (un)tidy-select

data wrangling dplyr tidyselect

Deconstructing {tidyselect} and building it back up

June Choe (University of Pennsylvania Linguistics)https://live-sas-www-ling.pantheon.sas.upenn.edu/
2023-12-04

Intro

Recently, I’ve been having frequent run-ins with {tidyselect} internals, discovering some weird and interesting behaviors along the way. This blog post is my attempt at documenting a couple of these. And as is the case with my usual style of writing, I’m gonna talk about some of the weirder stuff first and then touch on some of the “practical” side to this.

Some observations

Let’s start with some facts about how {tidyselect} is supposed to work. I’ll use this toy data for the demo:

library(dplyr, warn.conflicts = FALSE)
library(tidyselect)
df <- tibble(x = 1:2, y = letters[1:2], z = LETTERS[1:2])
df
  # A tibble: 2 × 3
        x y     z    
    <int> <chr> <chr>
  1     1 a     A    
  2     2 b     B

tidy-select!

{tidyselect} is the package that powers dplyr::select(). If you’ve used {dplyr}, you already know the behavior of select() pretty well. We can specify a column as string, symbol, or by its position:

df %>% 
  select("x")
  # A tibble: 2 × 1
        x
    <int>
  1     1
  2     2
df %>% 
  select(x)
  # A tibble: 2 × 1
        x
    <int>
  1     1
  2     2
df %>% 
  select(1)
  # A tibble: 2 × 1
        x
    <int>
  1     1
  2     2

It’s not obvious from the outside, but the way this works is that these user-supplied expressions (like "x", x, and 1) all get resolved to integer before the selection actually happens.

So to be more specific, the three calls to select() were the same because these three calls to tidyselect::eval_select() are the same:1

eval_select(quote("x"), df)
  x 
  1
eval_select(quote(x), df)
  x 
  1
eval_select(quote(1), df)
  x 
  1

You can also see eval_select() in action in the <data.frame> method for select():

dplyr:::select.data.frame
  function (.data, ...) 
  {
      error_call <- dplyr_error_call()
      loc <- tidyselect::eval_select(expr(c(...)), data = .data, 
          error_call = error_call)
      loc <- ensure_group_vars(loc, .data, notify = TRUE)
      out <- dplyr_col_select(.data, loc)
      out <- set_names(out, names(loc))
      out
  }
  <bytecode: 0x0000012f8e6de148>
  <environment: namespace:dplyr>

tidy?-select

Because the column subsetting part is ultimately done using integers, we can theoretically pass select() any expression, as long as it resolves to an integer vector.

For example, we can use 1 + 1 to select the second column:

df %>% 
  select(1 + 1)
  # A tibble: 2 × 1
    y    
    <chr>
  1 a    
  2 b

And vector recycling is still a thing here too - we can use c(1, 2) + 1 to select the second and third columns:

df %>% 
  select(c(1, 2) + 1)
  # A tibble: 2 × 2
    y     z    
    <chr> <chr>
  1 a     A    
  2 b     B

Ordinary function calls work as well - we can select a random column using sample():

df %>% 
  select(sample(ncol(df), 1))
  # A tibble: 2 × 1
    y    
    <chr>
  1 a    
  2 b

We can even use the .env pronoun to scope an integer variable from the global environment:2

offset <- 1
df %>% 
  select(1 + .env$offset)
  # A tibble: 2 × 1
    y    
    <chr>
  1 a    
  2 b

So that’s kinda interesting.3 But what if we try to mix the different approaches to tidyselect-ing? Can we do math on columns that we’ve selected using strings and symbols?

untidy-select?

Uh not quite. select() doesn’t like doing math on strings and symbols.

df %>% 
  select(x + 1)
  Error in `select()`:
  ! Problem while evaluating `x + 1`.
  Caused by error:
  ! object 'x' not found
df %>% 
  select("x" + 1)
  Error in `select()`:
  ! Problem while evaluating `"x" + 1`.
  Caused by error in `"x" + 1`:
  ! non-numeric argument to binary operator

In fact, it doesn’t even like doing certain kinds of math like multiplication (*), even with numeric constants:

df %>% 
  select(1 * 2)
  Error in `select()`:
  ! Can't use arithmetic operator `*` in selection context.

This actually makes sense from a design POV. Adding numbers to columns probably happens more often as a mistake than something intentional. These safeguards exist to prevent users from running into cryptic errors.

Unless…

untidy-select!

It turns out that {tidyselect} helpers have an interesting behavior of immediately resolving the column selection to integer. So we can get addition (+) working if we wrap our columns in redundant column selection helpers like all_of() and matches()

df %>% 
  select(all_of("x") + 1)
  # A tibble: 2 × 1
    y    
    <chr>
  1 a    
  2 b
df %>% 
  select(matches("^x$") + 1)
  # A tibble: 2 × 1
    y    
    <chr>
  1 a    
  2 b

For multiplication, we have to additionally circumvent the censoring of the * symbol. Here, we can simply use a different name for the same operation:4

`%times%` <- `*`
df %>% 
  select(matches("^x$") %times% 2)
  # A tibble: 2 × 1
    y    
    <chr>
  1 a    
  2 b

But geez, it’s so tiring to type all_of() and matches() all the time. There must be a better way to break the rule!

Tidying untidy-select

Let’s make a tidy design for the untidy pattern of selecting columns by doing math on column locations. The idea is to make our own little scope inside select() where all the existing safeguards are suspended. Like a DSL within a DSL, if you will.

Let’s call this function math(). It should let us express stuff like “give me the column to the right of column x” via this intuitive(?) syntax:

df %>% 
  select(math(x + 1))
  # A tibble: 2 × 1
    y    
    <chr>
  1 a    
  2 b

This is my take on math():

math <- function(expr) {
  math_expr <- rlang::enquo(expr)
  columns <- tidyselect::peek_vars()
  col_locs <- as.data.frame.list(seq_along(columns), col.names = columns)
  mask <- rlang::as_data_mask(col_locs)
  out <- rlang::eval_tidy(math_expr, mask)
  out
}

There’s a lot of weird functions involved here, but it’s easier to digest by focusing on its parts. Here’s what each local variable in the function looks like for our math(x + 1) example above:

  $math_expr
  <quosure>
  expr: ^x + 1
  env:  0x0000012f8e27cec8
  
  $columns
  [1] "x" "y" "z"
  
  $col_locs
    x y z
  1 1 2 3
  
  $mask
  <environment: 0x0000012f8e3332f0>
  
  $out
  [1] 2

Let’s walk through the pieces:

  1. math_expr: the captured user expression, with the environment attached

  2. columns: the column names of the current dataframe, in order

  3. col_locs: a dataframe of column names and location, created from columns

  4. mask: a data mask created from col_locs

  5. out: location of column(s) to select

Essentially, math() first captures the expression to evaluate it in its own special environment, circumventing select()’s safeguards. Then, it grabs the column names of the data frame with tidyselect::peek_vars() to define col_locs and then mask. The data mask mask is then used inside rlang::eval_tidy() to resolve symbols like x to integer 1 when evaluating the captured expression x + 1. The expression math(x + 1) thus evaluates to 1 + 1. In turn, select(math(x + 1)) is evaluated to select(2), returning us the second column of the dataframe.

Writing untidy-select helpers

A small yet powerful detail in the implementation of math() is the fact that it captures the expression as a quosure. This allows math() to appropriately scope dynamically created variables, and not just bare symbols provided directly by the user.

This makes more sense with some examples. Here, I define helper functions that call math() under the hood with their own templatic math expressions (and I have them print() the expression as passed to math() for clarity). The fact that math() captures its argument as a quosure is what allows local variables like n to be correctly scoped in these examples.

1) times()

times <- function(col, n) {
  col <- rlang::ensym(col)
  print(rlang::expr(math(!!col * n))) # for debugging
  math(!!col * n)
}
df %>%
  select(times(x, 2))
  math(x * n)
  # A tibble: 2 × 1
    y    
    <chr>
  1 a    
  2 b
num2 <- 2
df %>%
  select(times(x, num2))
  math(x * n)
  # A tibble: 2 × 1
    y    
    <chr>
  1 a    
  2 b

2) offset()

offset <- function(col, n) {
  col <- rlang::ensym(col)
  print(rlang::expr(math(!!col + n))) # for debugging
  math(!!col + n)
}
df %>%
  select(offset(x, 1))
  math(x + n)
  # A tibble: 2 × 1
    y    
    <chr>
  1 a    
  2 b
num1 <- 1
df %>%
  select(offset(x, num1))
  math(x + n)
  # A tibble: 2 × 1
    y    
    <chr>
  1 a    
  2 b

3) neighbors()

neighbors <- function(col, n) {
  col <- rlang::ensym(col)
  range <- c(-(n:1), 1:n)
  print(rlang::expr(math(!!col + !!range))) # for debugging
  math(!!col + !!range)
}
df %>%
  select(neighbors(y, 1))
  math(y + c(-1L, 1L))
  # A tibble: 2 × 2
        x z    
    <int> <chr>
  1     1 A    
  2     2 B
df %>%
  select(neighbors(y, num1))
  math(y + c(-1L, 1L))
  # A tibble: 2 × 2
        x z    
    <int> <chr>
  1     1 A    
  2     2 B

DIY!

And of course, we can do arbitrary injections ourselves as well with !! or .env$:

df %>%
  select(math(x * !!num2))
  # A tibble: 2 × 1
    y    
    <chr>
  1 a    
  2 b
df %>%
  select(math(x * .env$num2))
  # A tibble: 2 × 1
    y    
    <chr>
  1 a    
  2 b

That was fun but probably not super practical. Let’s set math() aside to try our hands on something more useful.

Let’s get practical

1) Sorting columns

Probably one of the hardest things to do idiomatically in the tidyverse is sorting (a subset of) columns by their name. For example, consider this dataframe which is a mix of columns that follow some fixed pattern ("x|y_\\d") and those outside that pattern ("year", "day", etc.).

data_cols <- expand.grid(first = c("x", "y"), second = 1:3) %>%
  mutate(cols = paste0(first, "_", second)) %>%
  pull(cols)
df2 <- as.data.frame.list(seq_along(data_cols), col.names = data_cols)
df2 <- cbind(df2, storms[1,1:5])
df2 <- df2[, sample(ncol(df2))]
df2
    y_3 x_3 month day hour y_2 y_1 x_2 year name x_1
  1   6   5     6  27    0   4   2   3 1975  Amy   1

It’s trivial to select columns by pattern - we can use the matches() helper:

df2 %>%
  select(matches("(x|y)_(\\d)"))
    y_3 x_3 y_2 y_1 x_2 x_1
  1   6   5   4   2   3   1

But what if I also wanted to further sort these columns, after I select them? There’s no easy way to do this “on the fly” inside of select, especially if we want the flexibility to sort the columns by the letter vs. the number.

But here’s one way of getting at that, exploiting two facts:

  1. matches(), like other tidyselect helpers, immediately resolves the selection to integer
  2. peek_vars() returns the column names in order, which lets us recover the column names from location

And that’s pretty much all there is to the tidyselect magic that goes into my solution below. I define locs (integer vector of column locations) and cols (character vector of column names at those locations), and the rest is just regex and sorting:

ordered_matches <- function(matches, order) {
  # tidyselect magic
  locs <- tidyselect::matches(matches)
  cols <- tidyselect::peek_vars()[locs]
  # Ordinary evaluation
  groups <- simplify2array(regmatches(cols, regexec(matches, cols)))[-1,]
  reordered <- do.call("order", asplit(groups[order, ], 1))
  locs[reordered]
}

Using ordered_matches(), we can not only select columns but also sort them using regex capture groups.

This sorts the columns by letter first then number:

df2 %>%
  select(ordered_matches("(x|y)_(\\d)", c(1, 2)))
    x_1 x_2 x_3 y_1 y_2 y_3
  1   1   3   5   2   4   6

This sorts the columns by number first then letter:

df2 %>%
  select(ordered_matches("(x|y)_(\\d)", c(2, 1)))
    x_1 y_1 x_2 y_2 x_3 y_3
  1   1   2   3   4   5   6

And if we wanted the other columns too, we can use everything() to grab the “rest”:

df2 %>%
  select(ordered_matches("(x|y)_(\\d)", c(2, 1)), everything())
    x_1 y_1 x_2 y_2 x_3 y_3 month day hour year name
  1   1   2   3   4   5   6     6  27    0 1975  Amy

2) Error handling

One of the really nice parts about the {tidyselect} design is the fact that error messages are very informative.

For example, if you select a non-existing column, it errors while pointing out that mistake:

df3 <- data.frame(x = 1)
nonexistent_selection <- quote(c(x, y))
eval_select(nonexistent_selection, df3)
  Error:
  ! Can't subset columns that don't exist.
  ✖ Column `y` doesn't exist.

If you use a tidyselect helper that returns nothing, it won’t complain by default:

zero_selection <- quote(starts_with("z"))
eval_select(zero_selection, df3)
  named integer(0)

But you can make that error with allow_empty = FALSE:

eval_select(zero_selection, df3, allow_empty = FALSE)
  Error:
  ! Must select at least one item.

General evaluation errors are caught and chained:

evaluation_error <- quote(stop("I'm a bad expression!"))
eval_select(evaluation_error, df3)
  Error:
  ! Problem while evaluating `stop("I'm a bad expression!")`.
  Caused by error:
  ! I'm a bad expression!

These error signalling patterns are clearly very useful for users,5 but there’s a little gem in there for developers too. It turns out that the error condition object contains these information too, which lets you detect different error types programmatically to forward errors to your own error handling logic.

For example, the attempted non-existent column is stored in $i:6

cnd_nonexistent <- rlang::catch_cnd(
  eval_select(nonexistent_selection, df3)
)
cnd_nonexistent$i
  [1] "y"

Zero column selections give you NULL in $i when you set it to error:

cnd_zero_selection <- rlang::catch_cnd(
  eval_select(zero_selection, df3, allow_empty = FALSE)
)
cnd_zero_selection$i
  NULL

General evaluation errors are distinguished by having a $parent:

cnd_evaluation_error <- rlang::catch_cnd(
  eval_select(evaluation_error, df3)
)
cnd_evaluation_error$parent
  <simpleError in eval_tidy(as_quosure(expr, env), context_mask): I'm a bad expression!>

Again, this is more useful as a developer, if you’re building something that integrates {tidyselect}.7 But I personally find this interesting to know about anyways!

Conclusion

Here I end with the (usual) disclaimer to not actually just copy paste these for production - they’re written with the very low standard of scratching my itch, so they do not come with any warranty!

But I hope that this was a fun exercise in thinking through one of the most mysterious magics in {dplyr}. I’m sure to reference this many times in the future myself.


  1. The examples quote("x") and quote(1) are redundant because "x" and 1 are constants. I keep quote() in there just to make the comparison clearer↩︎

  2. Not to be confused with all_of(). The idiomatic pattern for scoping an external character vector is to do all_of(x) not .env$x. It’s only when you’re scoping a non-character vector that you’d use .env$.↩︎

  3. It’s also strangely reminiscent of my previous blog post on dplyr::slice()↩︎

  4. Thanks to Jonathan Carroll for this suggestion!↩︎

  5. For those who actually read error messages, at least (points to self) …↩︎

  6. Though {tidyselect} errors early, so it’ll only record the first attempted column causing the error. You could use a while() loop (catch and remove bad columns from the data until there’s no more error) if you really wanted to get the full set of offending columns.↩︎

  7. If you want some examples of post-processing tidyselect errors, there’s some stuff I did for pointblank that may be helpful as a reference.↩︎