# The many ways to (un)tidy-select

Deconstructing {tidyselect} and building it back up

June Choe (University of Pennsylvania Linguistics)https://live-sas-www-ling.pantheon.sas.upenn.edu/
2023-12-04

## Intro

Recently, I’ve been having frequent run-ins with `{tidyselect}` internals, discovering some weird and interesting behaviors along the way. This blog post is my attempt at documenting a couple of these. And as is the case with my usual style of writing, I’m gonna talk about some of the weirder stuff first and then touch on some of the “practical” side to this.

## Some observations

Let’s start with some facts about how `{tidyselect}` is supposed to work. I’ll use this toy data for the demo:

``````library(dplyr, warn.conflicts = FALSE)
library(tidyselect)
df <- tibble(x = 1:2, y = letters[1:2], z = LETTERS[1:2])
df``````
``````  # A tibble: 2 × 3
x y     z
<int> <chr> <chr>
1     1 a     A
2     2 b     B``````

### tidy-select!

`{tidyselect}` is the package that powers `dplyr::select()`. If you’ve used `{dplyr}`, you already know the behavior of `select()` pretty well. We can specify a column as string, symbol, or by its position:

``````df %>%
select("x")``````
``````  # A tibble: 2 × 1
x
<int>
1     1
2     2``````
``````df %>%
select(x)``````
``````  # A tibble: 2 × 1
x
<int>
1     1
2     2``````
``````df %>%
select(1)``````
``````  # A tibble: 2 × 1
x
<int>
1     1
2     2``````

It’s not obvious from the outside, but the way this works is that these user-supplied expressions (like `"x"`, `x`, and `1`) all get resolved to integer before the selection actually happens.

So to be more specific, the three calls to `select()` were the same because these three calls to `tidyselect::eval_select()` are the same:1

``eval_select(quote("x"), df)``
``````  x
1``````
``eval_select(quote(x), df)``
``````  x
1``````
``eval_select(quote(1), df)``
``````  x
1``````

You can also see `eval_select()` in action in the `<data.frame>` method for `select()`:

``dplyr:::select.data.frame``
``````  function (.data, ...)
{
error_call <- dplyr_error_call()
loc <- tidyselect::eval_select(expr(c(...)), data = .data,
error_call = error_call)
loc <- ensure_group_vars(loc, .data, notify = TRUE)
out <- dplyr_col_select(.data, loc)
out <- set_names(out, names(loc))
out
}
<bytecode: 0x0000012f8e6de148>
<environment: namespace:dplyr>``````

### tidy?-select

Because the column subsetting part is ultimately done using integers, we can theoretically pass `select()` any expression, as long as it resolves to an integer vector.

For example, we can use `1 + 1` to select the second column:

``````df %>%
select(1 + 1)``````
``````  # A tibble: 2 × 1
y
<chr>
1 a
2 b``````

And vector recycling is still a thing here too - we can use `c(1, 2) + 1` to select the second and third columns:

``````df %>%
select(c(1, 2) + 1)``````
``````  # A tibble: 2 × 2
y     z
<chr> <chr>
1 a     A
2 b     B``````

Ordinary function calls work as well - we can select a random column using `sample()`:

``````df %>%
select(sample(ncol(df), 1))``````
``````  # A tibble: 2 × 1
y
<chr>
1 a
2 b``````

We can even use the `.env` pronoun to scope an integer variable from the global environment:2

``````offset <- 1
df %>%
select(1 + .env\$offset)``````
``````  # A tibble: 2 × 1
y
<chr>
1 a
2 b``````

So that’s kinda interesting.3 But what if we try to mix the different approaches to tidyselect-ing? Can we do math on columns that we’ve selected using strings and symbols?

### untidy-select?

Uh not quite. `select()` doesn’t like doing math on strings and symbols.

``````df %>%
select(x + 1)``````
``````  Error in `select()`:
! Problem while evaluating `x + 1`.
Caused by error:
``````df %>%
select("x" + 1)``````
``````  Error in `select()`:
! Problem while evaluating `"x" + 1`.
Caused by error in `"x" + 1`:
! non-numeric argument to binary operator``````

In fact, it doesn’t even like doing certain kinds of math like multiplication (`*`), even with numeric constants:

``````df %>%
select(1 * 2)``````
``````  Error in `select()`:
! Can't use arithmetic operator `*` in selection context.``````

This actually makes sense from a design POV. Adding numbers to columns probably happens more often as a mistake than something intentional. These safeguards exist to prevent users from running into cryptic errors.

Unless…

### untidy-select!

It turns out that `{tidyselect}` helpers have an interesting behavior of immediately resolving the column selection to integer. So we can get addition (`+`) working if we wrap our columns in redundant column selection helpers like `all_of()` and `matches()`

``````df %>%
select(all_of("x") + 1)``````
``````  # A tibble: 2 × 1
y
<chr>
1 a
2 b``````
``````df %>%
select(matches("^x\$") + 1)``````
``````  # A tibble: 2 × 1
y
<chr>
1 a
2 b``````

For multiplication, we have to additionally circumvent the censoring of the `*` symbol. Here, we can simply use a different name for the same operation:4

```````%times%` <- `*`
df %>%
select(matches("^x\$") %times% 2)``````
``````  # A tibble: 2 × 1
y
<chr>
1 a
2 b``````

But geez, it’s so tiring to type `all_of()` and `matches()` all the time. There must be a better way to break the rule!

## Tidying untidy-select

Let’s make a tidy design for the untidy pattern of selecting columns by doing math on column locations. The idea is to make our own little scope inside `select()` where all the existing safeguards are suspended. Like a DSL within a DSL, if you will.

Let’s call this function `math()`. It should let us express stuff like “give me the column to the right of column `x`” via this intuitive(?) syntax:

``````df %>%
select(math(x + 1))``````
``````  # A tibble: 2 × 1
y
<chr>
1 a
2 b``````

This is my take on `math()`:

``````math <- function(expr) {
math_expr <- rlang::enquo(expr)
columns <- tidyselect::peek_vars()
col_locs <- as.data.frame.list(seq_along(columns), col.names = columns)
out
}``````

There’s a lot of weird functions involved here, but it’s easier to digest by focusing on its parts. Here’s what each local variable in the function looks like for our `math(x + 1)` example above:

``````  \$math_expr
<quosure>
expr: ^x + 1
env:  0x0000012f8e27cec8

\$columns
[1] "x" "y" "z"

\$col_locs
x y z
1 1 2 3

<environment: 0x0000012f8e3332f0>

\$out
[1] 2``````

Let’s walk through the pieces:

1. `math_expr`: the captured user expression, with the environment attached

2. `columns`: the column names of the current dataframe, in order

3. `col_locs`: a dataframe of column names and location, created from `columns`

4. `mask`: a data mask created from `col_locs`

5. `out`: location of column(s) to select

Essentially, `math()` first captures the expression to evaluate it in its own special environment, circumventing `select()`’s safeguards. Then, it grabs the column names of the data frame with `tidyselect::peek_vars()` to define `col_locs` and then `mask`. The data mask `mask` is then used inside `rlang::eval_tidy()` to resolve symbols like `x` to integer `1` when evaluating the captured expression `x + 1`. The expression `math(x + 1)` thus evaluates to `1 + 1`. In turn, `select(math(x + 1))` is evaluated to `select(2)`, returning us the second column of the dataframe.

## Writing untidy-select helpers

A small yet powerful detail in the implementation of `math()` is the fact that it captures the expression as a quosure. This allows `math()` to appropriately scope dynamically created variables, and not just bare symbols provided directly by the user.

This makes more sense with some examples. Here, I define helper functions that call `math()` under the hood with their own templatic math expressions (and I have them `print()` the expression as passed to `math()` for clarity). The fact that `math()` captures its argument as a quosure is what allows local variables like `n` to be correctly scoped in these examples.

### 1) `times()`

``````times <- function(col, n) {
col <- rlang::ensym(col)
print(rlang::expr(math(!!col * n))) # for debugging
math(!!col * n)
}
df %>%
select(times(x, 2))``````
``  math(x * n)``
``````  # A tibble: 2 × 1
y
<chr>
1 a
2 b``````
``````num2 <- 2
df %>%
select(times(x, num2))``````
``  math(x * n)``
``````  # A tibble: 2 × 1
y
<chr>
1 a
2 b``````

### 2) `offset()`

``````offset <- function(col, n) {
col <- rlang::ensym(col)
print(rlang::expr(math(!!col + n))) # for debugging
math(!!col + n)
}
df %>%
select(offset(x, 1))``````
``  math(x + n)``
``````  # A tibble: 2 × 1
y
<chr>
1 a
2 b``````
``````num1 <- 1
df %>%
select(offset(x, num1))``````
``  math(x + n)``
``````  # A tibble: 2 × 1
y
<chr>
1 a
2 b``````

### 3) `neighbors()`

``````neighbors <- function(col, n) {
col <- rlang::ensym(col)
range <- c(-(n:1), 1:n)
print(rlang::expr(math(!!col + !!range))) # for debugging
math(!!col + !!range)
}
df %>%
select(neighbors(y, 1))``````
``  math(y + c(-1L, 1L))``
``````  # A tibble: 2 × 2
x z
<int> <chr>
1     1 A
2     2 B``````
``````df %>%
select(neighbors(y, num1))``````
``  math(y + c(-1L, 1L))``
``````  # A tibble: 2 × 2
x z
<int> <chr>
1     1 A
2     2 B``````

### DIY!

And of course, we can do arbitrary injections ourselves as well with `!!` or `.env\$`:

``````df %>%
select(math(x * !!num2))``````
``````  # A tibble: 2 × 1
y
<chr>
1 a
2 b``````
``````df %>%
select(math(x * .env\$num2))``````
``````  # A tibble: 2 × 1
y
<chr>
1 a
2 b``````

That was fun but probably not super practical. Let’s set `math()` aside to try our hands on something more useful.

## Let’s get practical

### 1) Sorting columns

Probably one of the hardest things to do idiomatically in the tidyverse is sorting (a subset of) columns by their name. For example, consider this dataframe which is a mix of columns that follow some fixed pattern (`"x|y_\\d"`) and those outside that pattern (`"year"`, `"day"`, etc.).

``````data_cols <- expand.grid(first = c("x", "y"), second = 1:3) %>%
mutate(cols = paste0(first, "_", second)) %>%
pull(cols)
df2 <- as.data.frame.list(seq_along(data_cols), col.names = data_cols)
df2 <- cbind(df2, storms[1,1:5])
df2 <- df2[, sample(ncol(df2))]
df2``````
``````    y_3 x_3 month day hour y_2 y_1 x_2 year name x_1
1   6   5     6  27    0   4   2   3 1975  Amy   1``````

It’s trivial to select columns by pattern - we can use the `matches()` helper:

``````df2 %>%
select(matches("(x|y)_(\\d)"))``````
``````    y_3 x_3 y_2 y_1 x_2 x_1
1   6   5   4   2   3   1``````

But what if I also wanted to further sort these columns, after I select them? There’s no easy way to do this “on the fly” inside of select, especially if we want the flexibility to sort the columns by the letter vs. the number.

But here’s one way of getting at that, exploiting two facts:

1. `matches()`, like other tidyselect helpers, immediately resolves the selection to integer
2. `peek_vars()` returns the column names in order, which lets us recover the column names from location

And that’s pretty much all there is to the tidyselect magic that goes into my solution below. I define `locs` (integer vector of column locations) and `cols` (character vector of column names at those locations), and the rest is just regex and sorting:

``````ordered_matches <- function(matches, order) {
# tidyselect magic
locs <- tidyselect::matches(matches)
cols <- tidyselect::peek_vars()[locs]
# Ordinary evaluation
groups <- simplify2array(regmatches(cols, regexec(matches, cols)))[-1,]
reordered <- do.call("order", asplit(groups[order, ], 1))
locs[reordered]
}``````

Using `ordered_matches()`, we can not only select columns but also sort them using regex capture groups.

This sorts the columns by letter first then number:

``````df2 %>%
select(ordered_matches("(x|y)_(\\d)", c(1, 2)))``````
``````    x_1 x_2 x_3 y_1 y_2 y_3
1   1   3   5   2   4   6``````

This sorts the columns by number first then letter:

``````df2 %>%
select(ordered_matches("(x|y)_(\\d)", c(2, 1)))``````
``````    x_1 y_1 x_2 y_2 x_3 y_3
1   1   2   3   4   5   6``````

And if we wanted the other columns too, we can use `everything()` to grab the “rest”:

``````df2 %>%
select(ordered_matches("(x|y)_(\\d)", c(2, 1)), everything())``````
``````    x_1 y_1 x_2 y_2 x_3 y_3 month day hour year name
1   1   2   3   4   5   6     6  27    0 1975  Amy``````

### 2) Error handling

One of the really nice parts about the `{tidyselect}` design is the fact that error messages are very informative.

For example, if you select a non-existing column, it errors while pointing out that mistake:

``````df3 <- data.frame(x = 1)
nonexistent_selection <- quote(c(x, y))
eval_select(nonexistent_selection, df3)``````
``````  Error:
! Can't subset columns that don't exist.
✖ Column `y` doesn't exist.``````

If you use a tidyselect helper that returns nothing, it won’t complain by default:

``````zero_selection <- quote(starts_with("z"))
eval_select(zero_selection, df3)``````
``  named integer(0)``

But you can make that error with `allow_empty = FALSE`:

``eval_select(zero_selection, df3, allow_empty = FALSE)``
``````  Error:
! Must select at least one item.``````

General evaluation errors are caught and chained:

``````evaluation_error <- quote(stop("I'm a bad expression!"))
eval_select(evaluation_error, df3)``````
``````  Error:
! Problem while evaluating `stop("I'm a bad expression!")`.
Caused by error:

These error signalling patterns are clearly very useful for users,5 but there’s a little gem in there for developers too. It turns out that the error condition object contains these information too, which lets you detect different error types programmatically to forward errors to your own error handling logic.

For example, the attempted non-existent column is stored in `\$i`:6

``````cnd_nonexistent <- rlang::catch_cnd(
eval_select(nonexistent_selection, df3)
)
cnd_nonexistent\$i``````
``  [1] "y"``

Zero column selections give you `NULL` in `\$i` when you set it to error:

``````cnd_zero_selection <- rlang::catch_cnd(
eval_select(zero_selection, df3, allow_empty = FALSE)
)
cnd_zero_selection\$i``````
``  NULL``

General evaluation errors are distinguished by having a `\$parent`:

``````cnd_evaluation_error <- rlang::catch_cnd(
eval_select(evaluation_error, df3)
)
cnd_evaluation_error\$parent``````
``  <simpleError in eval_tidy(as_quosure(expr, env), context_mask): I'm a bad expression!>``

Again, this is more useful as a developer, if you’re building something that integrates `{tidyselect}`.7 But I personally find this interesting to know about anyways!

## Conclusion

Here I end with the (usual) disclaimer to not actually just copy paste these for production - they’re written with the very low standard of scratching my itch, so they do not come with any warranty!

But I hope that this was a fun exercise in thinking through one of the most mysterious magics in `{dplyr}`. I’m sure to reference this many times in the future myself.

1. The examples `quote("x")` and `quote(1)` are redundant because `"x"` and `1` are constants. I keep `quote()` in there just to make the comparison clearer↩︎

2. Not to be confused with `all_of()`. The idiomatic pattern for scoping an external character vector is to do `all_of(x)` not `.env\$x`. It’s only when you’re scoping a non-character vector that you’d use `.env\$`.↩︎

3. It’s also strangely reminiscent of my previous blog post on `dplyr::slice()`↩︎

4. Thanks to Jonathan Carroll for this suggestion!↩︎

5. For those who actually read error messages, at least (points to self) …↩︎

6. Though `{tidyselect}` errors early, so it’ll only record the first attempted column causing the error. You could use a `while()` loop (catch and remove bad columns from the data until there’s no more error) if you really wanted to get the full set of offending columns.↩︎

7. If you want some examples of post-processing tidyselect errors, there’s some stuff I did for pointblank that may be helpful as a reference.↩︎