First impressions of DataFrames.jl and accessories

julia data wrangling DataFrames.jl dplyr data.table

Perspectives from a {dplyr} and {data.table} useR

June Choe (University of Pennsylvania Linguistics)https://live-sas-www-ling.pantheon.sas.upenn.edu/
2022-11-15

Intro

DataFrames.jl is a Julia package for data wrangling. As of this writing it is at v1.4.x - it’s a mature library that’s been in active development for over a decade.1

For some background, I comfortably switch between {dplyr} and {data.table}, having used both for nearly 5 years. I love digging into the implementational details of both - I really appreciate the thoughtfulness behind {dplyr}’s tidyeval/tidyselect semantics, as well as {data.table}’s conciseness and abstraction in the j. I have not been exposed to any other data wrangling frameworks but was recently compelled to learn Julia for independent reasons,2 so I decided why not pick up Julia-flavored data wrangling while I’m at it?

This blog post is a rough (and possibly evolving?) list of my first impressions of DataFrames.jl and “DataFrames.jl accessories”, namely Chain.jl and DataFramesMeta.jl.3

If you’re Julia-curious and/or just want to hear an R person talk about how another language does data wrangling differently, you’re the target audience!

However, this blog post is NOT:

All of this to say that I have no skin in the game and I don’t endorse or represent anything I write here. In fact I’m a Julia noob myself (it’s only been like 3 months) so take everything with a grain of salt and please feel free to let me know if I did anything wrong or inefficiently!

TL;DR - Big takeaways

Setup

R

# R v4.2.1
library(dplyr)      # v1.0.10
library(data.table) # v1.14.5
mtcars_df <- mtcars |>
  as_tibble(rownames = "model") |>
  type.convert(as.is = TRUE)
mtcars_dt <- as.data.table(mtcars_df)

Julia

# Julia v1.8.2
using DataFrames # (v1.4.3)
using DataFramesMeta # (v0.12.0) Also imports Chain.jl
# using Chain.jl (v0.5.0)
using StatsBase # (v0.33.21) Like base R {stats}
using RDatasets # (v0.7.7) Self-explanatory; like the {Rdatasets} package
mtcars = RDatasets.dataset("datasets", "mtcars")
  32×12 DataFrame
   Row │ Model              MPG      Cyl    Disp     HP     DRat     WT       QS ⋯
       │ String31           Float64  Int64  Float64  Int64  Float64  Float64  Fl ⋯
  ─────┼──────────────────────────────────────────────────────────────────────────
     1 │ Mazda RX4             21.0      6    160.0    110     3.9     2.62      ⋯
     2 │ Mazda RX4 Wag         21.0      6    160.0    110     3.9     2.875
     3 │ Datsun 710            22.8      4    108.0     93     3.85    2.32
     4 │ Hornet 4 Drive        21.4      6    258.0    110     3.08    3.215
     5 │ Hornet Sportabout     18.7      8    360.0    175     3.15    3.44      ⋯
     6 │ Valiant               18.1      6    225.0    105     2.76    3.46
     7 │ Duster 360            14.3      8    360.0    245     3.21    3.57
     8 │ Merc 240D             24.4      4    146.7     62     3.69    3.19
    ⋮  │         ⋮             ⋮       ⋮       ⋮       ⋮       ⋮        ⋮        ⋱
    26 │ Fiat X1-9             27.3      4     79.0     66     4.08    1.935     ⋯
    27 │ Porsche 914-2         26.0      4    120.3     91     4.43    2.14
    28 │ Lotus Europa          30.4      4     95.1    113     3.77    1.513
    29 │ Ford Pantera L        15.8      8    351.0    264     4.22    3.17
    30 │ Ferrari Dino          19.7      6    145.0    175     3.62    2.77      ⋯
    31 │ Maserati Bora         15.0      8    301.0    335     3.54    3.57
    32 │ Volvo 142E            21.4      4    121.0    109     4.11    2.78
                                                     5 columns and 17 rows omitted

Quick example

From mtcars…

  1. Filter for rows that represent "Merc"6 car models

  2. Calculate the average mpg by cyl

  3. Return a new column called kmpg that converts miles to kilometers (1:1.61)

{dplyr}

mtcars_df |>
  filter(stringr::str_detect(model, "^Merc ")) |>
  group_by(cyl) |>
  summarize(kmpg = mean(mpg) * 1.61)
  # A tibble: 3 × 2
      cyl  kmpg
    <int> <dbl>
  1     4  38.0
  2     6  29.8
  3     8  26.2

{data.table}

mtcars_dt[model %like% "^Merc ", .(kmpg = mean(mpg) * 1.61), by = cyl]
       cyl   kmpg
     <int>  <num>
  1:     4 37.996
  2:     6 29.785
  3:     8 26.243

DataFrames.jl

@chain mtcars begin
  subset(:Model => x -> occursin.(r"^Merc ", x))
  groupby(:Cyl)
  combine(:MPG => (x -> mean(x) * 1.61) => :kmpg)
end
  3×2 DataFrame
   Row │ Cyl    kmpg
       │ Int64  Float64
  ─────┼────────────────
     1 │     4   37.996
     2 │     6   29.785
     3 │     8   26.243

DataFramesMeta.jl

@chain mtcars begin
  @rsubset(occursin(r"^Merc ", :Model))
  groupby(:Cyl)
  @combine(:kmpg = mean(:MPG) * 1.61)
end
  3×2 DataFrame
   Row │ Cyl    kmpg
       │ Int64  Float64
  ─────┼────────────────
     1 │     4   37.996
     2 │     6   29.785
     3 │     8   26.243

List of 💜s and 💔s

Disclaimer: These are not a list of like/dislike or approve/disapprove. The 💜 and 💔 are just vibes - you know what I mean?

1) 💜 The distinctness of the “grouped df” type

In {dplyr}, it’s easy to think of grouping as a transient operation. We don’t really think about group_by() as a data-wrangling function because the returned <grouped_df> object is visually very similar to the input. Coupled with the general expectation that object attributes in R tend to be ephemeral, group_by() often gets treated like a second-class citizen even though grouping is sticky, making it a frequent culprit of subtle bugs.

group_by(mtcars_df, cyl)
  # A tibble: 32 × 12
  # Groups:   cyl [3]
     model         mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
     <chr>       <dbl> <int> <dbl> <int> <dbl> <dbl> <dbl> <int> <int> <int> <int>
   1 Mazda RX4    21       6  160    110  3.9   2.62  16.5     0     1     4     4
   2 Mazda RX4 …  21       6  160    110  3.9   2.88  17.0     0     1     4     4
   3 Datsun 710   22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
   4 Hornet 4 D…  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
   5 Hornet Spo…  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
   6 Valiant      18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
   7 Duster 360   14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
   8 Merc 240D    24.4     4  147.    62  3.69  3.19  20       1     0     4     2
   9 Merc 230     22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
  10 Merc 280     19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
  # … with 22 more rows

In {data.table}, it’s at the opposite extreme. Grouping via by= is right there in the slogan dt[i,j,by], sharing a privileged status with row selection i and column manipulation j. Grouping as an operation is very prominent in the code but that also makes it feel a bit too integrated for my taste. It feels too close to SQL, whereas I like my data wrangling operations to be more modular and portable.

I think that DataFrames.jl hits a good middle ground. It keeps groupby() as a stand-alone operation while marking the output as very clearly distinct from the input. The returned GroupedDataFrame type, when printed, visually splits the dataframe by groups:7

groupby(mtcars, :Cyl)
  GroupedDataFrame with 3 groups based on key: Cyl
  First Group (11 rows): Cyl = 4
   Row │ Model          MPG      Cyl    Disp     HP     DRat     WT       QSec   ⋯
       │ String31       Float64  Int64  Float64  Int64  Float64  Float64  Float6 ⋯
  ─────┼──────────────────────────────────────────────────────────────────────────
     1 │ Datsun 710        22.8      4    108.0     93     3.85    2.32     18.6 ⋯
     2 │ Merc 240D         24.4      4    146.7     62     3.69    3.19     20.0
    ⋮  │       ⋮           ⋮       ⋮       ⋮       ⋮       ⋮        ⋮        ⋮   ⋱
    10 │ Lotus Europa      30.4      4     95.1    113     3.77    1.513    16.9
    11 │ Volvo 142E        21.4      4    121.0    109     4.11    2.78     18.6
                                                      5 columns and 7 rows omitted
  ⋮
  Last Group (14 rows): Cyl = 8
   Row │ Model              MPG      Cyl    Disp     HP     DRat     WT       QS ⋯
       │ String31           Float64  Int64  Float64  Int64  Float64  Float64  Fl ⋯
  ─────┼──────────────────────────────────────────────────────────────────────────
     1 │ Hornet Sportabout     18.7      8    360.0    175     3.15    3.44      ⋯
     2 │ Duster 360            14.3      8    360.0    245     3.21    3.57
    ⋮  │         ⋮             ⋮       ⋮       ⋮       ⋮       ⋮        ⋮        ⋱
    14 │ Maserati Bora         15.0      8    301.0    335     3.54    3.57
                                                     5 columns and 11 rows omitted

2) 💜 The imperative -! variants

Many verbs in DataFrames.jl have a -! suffix counterpart, like transform()/transform!(). These -! forms indicate a mutate-in-place operation, which modifies an object it’s called with as a side-effect.8 It’s conceptually like {data.table}’s set() or the walrus operator := in the j, with no equivalent in {dplyr}.9

I love pure functions and I’m a sucker for functional programming, but one place where I can really see myself using the -! form a lot is for rename!() and sort!(), latter being equivalent to dplyr::arrange():10

mtcars_sorted = copy(mtcars); # Make a copy
sort!(mtcars_sorted, :MPG);   # Sort the copy by ascending MPG value, in-place
mtcars_sorted[ [begin, end], [:Model, :MPG] ] # Check in first and last row
  2×2 DataFrame
   Row │ Model               MPG
       │ String31            Float64
  ─────┼─────────────────────────────
     1 │ Cadillac Fleetwood     10.4
     2 │ Toyota Corolla         33.9

This is also an efficient way of dropping columns that you know you won’t use.11 In {data.table}, you can do this with := but in the funky form of assigning columns to NULL which I never really liked.12

mtcars_dropcol <- copy(mtcars_dt)
mtcars_dropcol[, mpg := NULL]                # drop single column
mtcars_dropcol[, `:=`(wt = NULL, vs = NULL)] # drop multiple columns in function form
names(mtcars_dropcol)
  [1] "model" "cyl"   "disp"  "hp"    "drat"  "qsec"  "am"    "gear"  "carb"

In DataFrames.jl that’s just select!() instead of select(). No copies are created in the calls to select!():

mtcars_dropcol = copy(mtcars);
select!(mtcars_dropcol, Not(:MPG));
select!(mtcars_dropcol, Not([:WT, :VS]));
names(mtcars_dropcol)
  9-element Vector{String}:
   "Model"
   "Cyl"
   "Disp"
   "HP"
   "DRat"
   "QSec"
   "AM"
   "Gear"
   "Carb"

3) 💔 Competition between Base.filter() and DataFrames.subset()

In DataFrames.jl, there are two ways of subsetting rows:

  1. The Base.filter() generic which has been extended with a DataFrame method. It works kind of like base R’s Filter() in that it takes the object to filter as its second argument:

    filter(:Gear => x -> x .== 5, mtcars)
      5×12 DataFrame
       Row │ Model           MPG      Cyl    Disp     HP     DRat     WT       QSec  ⋯
           │ String31        Float64  Int64  Float64  Int64  Float64  Float64  Float ⋯
      ─────┼──────────────────────────────────────────────────────────────────────────
         1 │ Porsche 914-2      26.0      4    120.3     91     4.43    2.14      16 ⋯
         2 │ Lotus Europa       30.4      4     95.1    113     3.77    1.513     16
         3 │ Ford Pantera L     15.8      8    351.0    264     4.22    3.17      14
         4 │ Ferrari Dino       19.7      6    145.0    175     3.62    2.77      15
         5 │ Maserati Bora      15.0      8    301.0    335     3.54    3.57      14 ⋯
                                                                     5 columns omitted
  2. The DataFrames.subset() function which, like other DataFrame transformation verbs, takes the dataframe as its first argument similar to dplyr::filter():

    subset(mtcars, :Gear => x -> x .== 5)
      5×12 DataFrame
       Row │ Model           MPG      Cyl    Disp     HP     DRat     WT       QSec  ⋯
           │ String31        Float64  Int64  Float64  Int64  Float64  Float64  Float ⋯
      ─────┼──────────────────────────────────────────────────────────────────────────
         1 │ Porsche 914-2      26.0      4    120.3     91     4.43    2.14      16 ⋯
         2 │ Lotus Europa       30.4      4     95.1    113     3.77    1.513     16
         3 │ Ford Pantera L     15.8      8    351.0    264     4.22    3.17      14
         4 │ Ferrari Dino       19.7      6    145.0    175     3.62    2.77      15
         5 │ Maserati Bora      15.0      8    301.0    335     3.54    3.57      14 ⋯
                                                                     5 columns omitted

I understand the rationale for supporting Julia’s standard filter() function for DataFrame types, but I feel like it’s something that the average DataFrames.jl user shouldn’t encounter unless they specifically go looking for it.

In my experience this hasn’t been the case, and it’s caused me a lot of confusion when I was first learning DataFrames.jl - I consider the simultaneous teaching of filter() and subset() in the same chapter as the one flaw in the otherwise flawless book Julia Data Science.

4) 💜 The operation specification syntax is like {data.table}’s j on steroids

The “operation specification syntax”, also called the “transformation mini language”13 refers to the templatic form:


[input columns] => [transformation function] => [output columns]

… available inside verbs like select(), transform(), combine(), and subset().

If I say “take the MPG column and multiply it by 1.61 to create a new column called KMPG”, then in DataFrames.jl that’s:

@chain mtcars begin
  transform( :MPG => (x -> x * 1.61) => :KMPG )
  # Below operations cut down rows and cols to save printing space
  select(:MPG, :KMPG)
  first(5)
end
  5×2 DataFrame
   Row │ MPG      KMPG
       │ Float64  Float64
  ─────┼──────────────────
     1 │    21.0   33.81
     2 │    21.0   33.81
     3 │    22.8   36.708
     4 │    21.4   34.454
     5 │    18.7   30.107

… where x -> x * 1.61 inside transform() is an anonymous function14 like R’s \(x) x * 1.61.

I make the comparison to {data.table}’s j because of the flexibility with what you can compute and return inside that special environment.

For example, let’s say you want to treat cyl like a categorical variable and do a one-hot encoding for cyl==4, cyl==6, and cyl==8. To comply with R/Julia’s variable naming rules, let’s name these new columns cyl_4, cyl_6, and cyl_8.

In {data.table} it just takes an lapply() inside the j - you can rely on the invariant that as long as an expression in j evaluates to a list, you will get columns back:

mtcars_onehot <- copy(mtcars_dt)
cyl_vals <- sort(unique(mtcars_onehot$cyl)) # [1] 4 6 8
cyl_cols <- paste0("cyl_", cyl_vals)        # [1] "cyl_4" "cyl_6" "cyl_8"
mtcars_onehot[, (cyl_cols) := lapply(cyl_vals, \(x) cyl == x)]
mtcars_onehot[, (setdiff(names(mtcars_onehot), c("model", cyl_cols))) := NULL]
mtcars_onehot[1:5,]
                 model  cyl_4  cyl_6  cyl_8
                <char> <lgcl> <lgcl> <lgcl>
  1:         Mazda RX4  FALSE   TRUE  FALSE
  2:     Mazda RX4 Wag  FALSE   TRUE  FALSE
  3:        Datsun 710   TRUE  FALSE  FALSE
  4:    Hornet 4 Drive  FALSE   TRUE  FALSE
  5: Hornet Sportabout  FALSE  FALSE   TRUE

Likewise, in DataFrames.jl, you can rely on the invariant that as long as you’re complying with the op-spec and returning certain data types,15 they will become columns in the table:

mtcars_onehot = copy(mtcars);
cyl_vals = sort(unique(mtcars_onehot.Cyl));
cyl_cols = "cyl_" .* string.(cyl_vals);
@chain mtcars_onehot begin
  transform!( :Cyl .=> ByRow(x -> cyl_vals .== x) => cyl_cols )
  select!(:Model, Cols(r"^cyl_"))
end;
mtcars_onehot[1:5,:]
  5×4 DataFrame
   Row │ Model              cyl_4  cyl_6  cyl_8
       │ String31           Bool   Bool   Bool
  ─────┼────────────────────────────────────────
     1 │ Mazda RX4          false   true  false
     2 │ Mazda RX4 Wag      false   true  false
     3 │ Datsun 710          true  false  false
     4 │ Hornet 4 Drive     false   true  false
     5 │ Hornet Sportabout  false  false   true

Read on for more about the ByRow() function wrapping the anonymous function inside transform!().

5) 💜 Rowwise operations with ByRow() and eachrow()

I mentioned at the beginning that DataFrames.jl doesn’t have a lot of idiomatic ways of doing things but one rare case that I really appreciate is ByRow().

Under the hood, ByRow() is actually just a function factory that takes a function as input and a broadcasted version of the function as output, kind of like R’s Vectorize():

ByRow(round)([1.2, 3.3]) # same as round.([1.2, 3.3])
  2-element Vector{Float64}:
   1.0
   3.0

I call ByRow() idiomatic because, as you can see, it doesn’t have anything to do with DataFrame rows. Nevertheless it goes perfectly hand in hand with the op-spec syntax because you can just wrap the middle “[transformation function]” component in ByRow() and call it a day:

select(mtcars, :MPG => ByRow( x -> Integer(round(x)) ) => :MPG_whole )[1:3,:]
  3×1 DataFrame
   Row │ MPG_whole
       │ Int64
  ─────┼───────────
     1 │        21
     2 │        21
     3 │        23

Whereas the equivalent using the . would take two broadcasted functions:

select(mtcars, :MPG => ( x -> Integer.(round.(x)) ) => :MPG_whole )[1:3,:]
  3×1 DataFrame
   Row │ MPG_whole
       │ Int64
  ─────┼───────────
     1 │        21
     2 │        21
     3 │        23

I especially like ByRow() in these cases because I can use Julia’s function composition operator (\circ) to re-write x -> Integer(round(x)) into Integer ∘ round, which just looks sooo clean:

select(mtcars, :MPG => ByRow(Integer  round) => :MPG_whole )[1:3,:]
  3×1 DataFrame
   Row │ MPG_whole
       │ Int64
  ─────┼───────────
     1 │        21
     2 │        21
     3 │        23

The one small qualm I have with ByRow() though is that the docs equate it to dplyr::rowwise() when really it’s more comparable to purrr::map_*().

dplyr::rowwise() is unlike ByRow() because it’s a function that takes a dataframe as input and returns an object of class <rowwise_df>. In that sense, it’s actually more similar to Base.eachrow() which you use to convert a DataFrame into a DataFrameRow object:16

mtcars_rowwise = eachrow(mtcars[1:5, 1:3])
  5×3 DataFrameRows
   Row │ Model              MPG      Cyl
       │ String31           Float64  Int64
  ─────┼───────────────────────────────────
     1 │ Mazda RX4             21.0      6
     2 │ Mazda RX4 Wag         21.0      6
     3 │ Datsun 710            22.8      4
     4 │ Hornet 4 Drive        21.4      6
     5 │ Hornet Sportabout     18.7      8

Similar to how column-major dataframes are essentially a list of vectors representing each column, the row-major DataFrameRow object is essentially a vector of NamedTuples representing each row under the hood:

copy.(mtcars_rowwise)
  5-element Vector{NamedTuple{(:Model, :MPG, :Cyl), Tuple{InlineStrings.String31, Float64, Int64}}}:
   (Model = InlineStrings.String31("Mazda RX4"), MPG = 21.0, Cyl = 6)
   (Model = InlineStrings.String31("Mazda RX4 Wag"), MPG = 21.0, Cyl = 6)
   (Model = InlineStrings.String31("Datsun 710"), MPG = 22.8, Cyl = 4)
   (Model = InlineStrings.String31("Hornet 4 Drive"), MPG = 21.4, Cyl = 6)
   (Model = InlineStrings.String31("Hornet Sportabout"), MPG = 18.7, Cyl = 8)

6) 💔 Confusingly, select() is more like dplyr::transmute() than dplyr::select()

Interestingly, select() isn’t just for selecting columns - it can modify columns too:

@chain mtcars begin
  select(:Model, :MPG => (x -> x .* 1.61) => :KMPG)
  first(3)
end
  3×2 DataFrame
   Row │ Model          KMPG
       │ String31       Float64
  ─────┼────────────────────────
     1 │ Mazda RX4       33.81
     2 │ Mazda RX4 Wag   33.81
     3 │ Datsun 710      36.708

In this sense DataFrames.select() feels more like dplyr::transmute() rather than the identically named dplyr::select():17

mtcars_df |>
  transmute(model, kmpg = mpg * 1.61) |>
  head(3)
  # A tibble: 3 × 2
    model          kmpg
    <chr>         <dbl>
  1 Mazda RX4      33.8
  2 Mazda RX4 Wag  33.8
  3 Datsun 710     36.7

I think it’s misleading that select() can also transform() in DataFrames.jl, although I don’t complain that dplyr::select() can also dplyr::rename() so idk:

mtcars_df |>
  select(car_model = model, mpg) |>
  head(3)
  # A tibble: 3 × 2
    car_model       mpg
    <chr>         <dbl>
  1 Mazda RX4      21  
  2 Mazda RX4 Wag  21  
  3 Datsun 710     22.8

7) 💔 Selection helpers are not powered by boolean algebra

The fact that DataFrames.select() behaves like dplyr::transmute() might explain why you don’t really get a rich {tidyselect}-esque interface to column selection.18 This has been very challenging as someone who loves {tidyselect}, but DataFrames.jl has been getting new column selection helpers like Cols(), Not(), Between(), etc. which makes things a bit easier compared to {data.table} at least.

But I don’t really vibe with the implementation of column selection helpers as set operation, as opposed to boolean algebra.19

I wish I’m told that I’m wrong, but it feels really clunky to do something like “select columns that are numeric and has a name that’s two characters long”. In DataFrames.jl that’s:

@chain mtcars begin
  select(_, intersect( names(_, Float64), names(_, r"^..$") ) )
  first(1)
end
  1×1 DataFrame
   Row │ WT
       │ Float64
  ─────┼─────────
     1 │    2.62

Whereas in {dplyr} you just write each condition as a predicate joined by &:

mtcars_df |>
  select( where(is.double) & matches("^..$") ) |>
  head(1)
  # A tibble: 1 × 1
       wt
    <dbl>
  1  2.62

8) 💜 groupby() has select-semantics

One thing that’s always bugged me a little in {dplyr} is the fact that dplyr::group_by() has mutate-semantics.

So like, despite the fact that people pretty much only ever20 use group_by() like select() in the form of group_by(col1, col2, col3), you can’t use {tidyselect} helpers:

mtcars_df |>
  group_by( matches("^[va]") ) |>
  summarize(mpg = mean(mpg), .groups = 'drop')
  Error in `group_by()`:
  ! Problem adding computed columns.
  Caused by error in `mutate()`:
  ! Problem while computing `..1 = matches("^[va]")`.
  Caused by error:
  ! `matches()` must be used within a *selecting* function.
  ℹ See <https://tidyselect.r-lib.org/reference/faq-selection-context.html> for
    details.

Instead you need to bridge select-semantics and mutate-semantics using across()21, which ensures that group_by() receives column vectors:

mtcars_df |>
  group_by( across(matches("^[va]")) ) |>
  summarize(mpg = mean(mpg), .groups = 'drop')
  # A tibble: 4 × 3
       vs    am   mpg
    <int> <int> <dbl>
  1     0     0  15.0
  2     0     1  19.8
  3     1     0  20.7
  4     1     1  28.4

In DataFrames.jl, however, groupby() has select-semantics.22 That lets you use column selection helpers like Cols() to dynamically choose columns like:

@chain mtcars begin
  groupby( Cols(r"^[VA]") )
  combine(:MPG => mean, renamecols = false)
end
  4×3 DataFrame
   Row │ VS     AM     MPG
       │ Int64  Int64  Float64
  ─────┼───────────────────────
     1 │     0      0  15.05
     2 │     0      1  19.75
     3 │     1      0  20.7429
     4 │     1      1  28.3714

9) 💔 No special marking of context-dependent expressions

In {dplyr} and {data.table} you get “context dependent expressions” like n() and .N which returns information about the dataframe that you’re currently manipulating.

So for example if I want to calculate counts by group, in {dplyr} I can do:

mtcars_df |>
  group_by(cyl) |>
  summarize(n = n())
  # A tibble: 3 × 2
      cyl     n
    <int> <int>
  1     4    11
  2     6     7
  3     8    14

And in {data.table} I can do:

mtcars_dt[, .(n = .N), by = cyl]
       cyl     n
     <int> <int>
  1:     6     7
  2:     4    11
  3:     8    14

Likewise in DataFrames.jl, I can use nrow:

@chain mtcars begin
  groupby(:Cyl)
  combine(nrow => :n)
end
  3×2 DataFrame
   Row │ Cyl    n
       │ Int64  Int64
  ─────┼──────────────
     1 │     4     11
     2 │     6      7
     3 │     8     14

Special keywords like nrow can be used in the op-spec syntax, and they essentially take up the first and middle “[input columns] => [transformation function]” slots.

But here’s a thing that’s super confusing about nrow as a context dependent expression. Unlike {dplyr}’s n() or {data.table}’s .N, Julia’s nrow()23 is also a stand-alone function:

nrow(mtcars)
  32

Now imagine I wrote a nrow2() function which does the same thing but with Base.size(), the equivalent of base R’s dim():

nrow2 = function(x)
  size(x, 1) # returns the first dimension (row)
end;
nrow2(mtcars)
  32

You might expect our new nrow2 to behave like nrow in the context we just saw, but it doesn’t!

@chain mtcars begin
  groupby(:Cyl)
  combine(nrow2 => :n)
end
  ArgumentError: Unrecognized column selector var"#101#102"() => :n in AsTable constructor

Because nrow2() is not a special operator like nrow, it gets evaluated ordinarily in the first component of the op-spec syntax (“[input columns]”), throwing a rather unhelpful error message about an unrecognized column selector. But it’s really difficult to figure out this exceptionalism of nrow because it also works expectedly outside!

Now compare this to the treatment of {data.table}’s .N and {dplyr}’s n(), where the former isn’t exported (it’s not even defined in the package) and the latter throws a helpful error message:

dplyr::n()
  Error in `n()`:
  ! Must be used inside dplyr verbs.

But thankfully, this is an area of ongoing development in DataFrames.jl so I’m hopeful that the documentation will catch up too.

For example, v1.4 recently added a couple more context dependent expressions like groupindices, which is equivalent to {dplyr}’s cur_group_id() and {data.table}’s .GRP:

@chain mtcars begin
  groupby(:Cyl)
  combine(groupindices)
end
  3×2 DataFrame
   Row │ Cyl    groupindices
       │ Int64  Int64
  ─────┼─────────────────────
     1 │     4             1
     2 │     6             2
     3 │     8             3

However, I still feel like there’s ways to go on the clarity front. Like, if the user doesn’t already have a mental model of context dependent expressions, then it might be confusing that you can use groupindices like this:

groupindices( groupby(mtcars[1:5,:], :Cyl) )
  5-element Vector{Union{Missing, Int64}}:
   2
   2
   1
   2
   3

But you can’t use nrow in the same way:

nrow( groupby(mtcars[1:5,:], :Cyl) )
  MethodError: no method matching nrow(::GroupedDataFrame{DataFrame})
  Closest candidates are:
    nrow(!Matched::SubDataFrame) at C:\Users\jchoe\.julia\packages\DataFrames\KKiZW\src\subdataframe\subdataframe.jl:157
    nrow(!Matched::DataFrame) at C:\Users\jchoe\.julia\packages\DataFrames\KKiZW\src\dataframe\dataframe.jl:459

Because, independently of their shared status as context dependent expressions in the op-spec syntax, groupindices and nrow differ as stand-alone functions (only the former has a method defined for GroupedDataFrame types).

10) 💜 The op-spec syntax gives you dplyr::across()/c_across() for free

So probably my favorite thing about the op-spec syntax is that the leftmost “[input columns]” component can be a vector of multiple columns. When combined with the broadcasted version of the => operator, you get dplyr::across() for free:24

select(mtcars, [:DRat, :QSec] .=> ByRow(Integer  round) => uppercase)[1:5,:]
  5×2 DataFrame
   Row │ DRAT   QSEC
       │ Int64  Int64
  ─────┼──────────────
     1 │     4     16
     2 │     4     17
     3 │     4     19
     4 │     3     19
     5 │     3     17

In {dplyr}, the above would be:

mtcars_df |>
  transmute(
    across(
      .cols = c(drat, qsec),
      .fns = \(x) as.integer(round(x)),
      .names = "{toupper(.col)}"
    )
  ) |>
  head(5)
  # A tibble: 5 × 2
     DRAT  QSEC
    <int> <int>
  1     4    16
  2     4    17
  3     4    19
  4     3    19
  5     3    17

Here’s another fun one replicating {dplyr}’s rowwise() + c_across() workflow.

As a very contrived example, let’s say I want to make two new columns, calculating the min/max across float/double columns by-row. In DataFrames.jl that can be done by selecting multiple columns and sending them in batch to a vararg function:

@chain mtcars begin
  select(_,
    :Model,
    names(_, Float64) => ByRow( (x...) -> extrema(x) ) => [:min, :max]
  )
  first(5)
end
  5×3 DataFrame
   Row │ Model              min      max
       │ String             Float64  Float64
  ─────┼─────────────────────────────────────
     1 │ Mazda RX4            2.62     160.0
     2 │ Mazda RX4 Wag        2.875    160.0
     3 │ Datsun 710           2.32     108.0
     4 │ Hornet 4 Drive       3.08     258.0
     5 │ Hornet Sportabout    3.15     360.0

That kind of operation is costly in {dplyr} because it requires a rowwise() context to operate on columns selected by c_across(), the result of which must then be converted to dataframe:

mtcars_df |>
  rowwise() |>
  transmute(
    model,
    c_across(where(is.double)) |>
      range() |>
      as.data.frame.list(col.names = c("min", "max"))
  ) |>
  ungroup() |>
  head(5)
  # A tibble: 5 × 3
    model               min   max
    <chr>             <dbl> <dbl>
  1 Mazda RX4          2.62   160
  2 Mazda RX4 Wag      2.88   160
  3 Datsun 710         2.32   108
  4 Hornet 4 Drive     3.08   258
  5 Hornet Sportabout  3.15   360

For this particular problem you can use pmax and pmin, but that’s of course not generalizable to other arbitrary operations:25

mtcars_df |>
  transmute(
    model,
    min = do.call(pmin, across(where(is.double))),
    max = do.call(pmax, across(where(is.double)))
  ) |>
  head(5)
  # A tibble: 5 × 3
    model               min   max
    <chr>             <dbl> <dbl>
  1 Mazda RX4          2.62   160
  2 Mazda RX4 Wag      2.88   160
  3 Datsun 710         2.32   108
  4 Hornet 4 Drive     3.08   258
  5 Hornet Sportabout  3.15   360

Take this benchmarking with a grain of salt but on my machine the DataFrames.jl solution takes ~0.1ms, {dplyr}’s rowwise() + c_across() solution takes ~30ms and {dplyr}’s across() + pmin()/pmax() solution takes ~3ms.

Update: a {data.table} solution that’s probably not the most efficient it could be

mtcars_dt_rowwise <- copy(mtcars_dt)
mtcars_dt_rowwise[, c("min", "max") := as.list(range(.SD)), by = .I, .SDcols = is.double]
mtcars_dt_rowwise[1:5, .(model, min, max)]
                 model   min   max
                <char> <num> <num>
  1:         Mazda RX4 2.620   160
  2:     Mazda RX4 Wag 2.875   160
  3:        Datsun 710 2.320   108
  4:    Hornet 4 Drive 3.080   258
  5: Hornet Sportabout 3.150   360

Concluding thoughts

Overall impression

Overall, DataFrames.jl has a cool design. Learning it has been mostly painless, and writing this blog post has been super fun.

I personally found it to be a great entry point from R to Julia as someone who primarily uses R for data analysis, and would recommend this route if you’re Julia-curious and you like learning by doing.

I was also pleasantly surprised by how much of my experience with {dplyr} and {data.table} transferred over to me learning an entirely different data wrangling framework. And this goes the other way as well - DataFrames.jl helps me appreciate many aspects of {dplyr} and {data.table} that I used to take for granted.

Next steps

A list of other things that I want to explore in DataFrames.jl but haven’t had the chance to:


  1. Making it slightly younger than {plyr}/{dplyr} and {data.table} by a few years↩︎

  2. I’m doing a semester of independent study on simulation-based power analysis for mixed effects models, where speed is crucial. For that I decided to switch over from {lme4} and pick up MixedModels.jl.↩︎

  3. There’s also the relatively newer DataFramesMacros.jl that makes DataFrames.jl code even closer to {dplyr}, namely by making the transformation verbs rowwise by default to save you from reasoning about broadcasting, for better or for worse.↩︎

  4. The Julia for Data Analysis book is estimated for Jan 2023 but I’ve been reading the previews and recommend it very strongly. Bogumił Kamiński, the author of the book and the JuliaCon 2022 workshop (also a core developer of DataFrames.jl) also has a blog full of DataFrames.jl goodies.↩︎

  5. Although @macroexpand is very cool and works like dplyr::show_query(), so learning DataFramesMeta.jl shouldn’t interfere with learning DataFrames.jl.↩︎

  6. Abbreviation of Mercedes-Benz, apparently.↩︎

  7. The visual effect here is similar to what you get from dplyr::group_split(), a costly operation that returns the input dataframe split into a list of dataframes.↩︎

  8. The -! suffix is a convention for side-effect (or “imperative”) functions also found in other programming languages, like my first language Racket.↩︎

  9. Though you can mimic it in {dplyr} with {magrittr}’s %<>% pipe.↩︎

  10. sort!() probably because I consider row order to be more like a dataframe’s metadata, given how fragile and not-guaranteed-to-be-stable it is.↩︎

  11. Although I usually do that inside the file-reading step, like in read_csv(file, col_select = ...).↩︎

  12. Famously the 8.1.55-56th circle of the R inferno.↩︎

  13. I’m just gonna call it op-spec syntax from now on because these names are way too long.↩︎

  14. Wrapped in parantheses in the code to make sure that the right-side => operator is not parsed as part of the function.↩︎

  15. NamedTuple, AbstractDataFrame, DataFrameRow, AbstractMatrix, to name a few.↩︎

  16. Also see the @eachrow! macro from DataFramesMeta.jl for a modify-in-place version.↩︎

  17. Or, keeping up with the current dev version {dplyr} which superceded transmute(), it’d be mutate(..., .keep = "none").↩︎

  18. Although I feel like you totally could since the “[input columns]” component of the op-spec gets evaluated in its own separate environment anyways (unlike {data.table}’s j where the three components are mashed together). There was an attempt at integrating the now-stale(?) Selections.jl which looked interesting, but alas…↩︎

  19. The {tidyselect} backend of {dplyr} supports both, namely with ! and - variants, though they recently switched to emphasizing the boolean algebra technique with !.↩︎

  20. Though the mutate-semantics lets you do things like group_by(decade = 10 * (year %/% 10)).↩︎

  21. Or the new pick() function in the upcoming version.↩︎

  22. But not mutate-semantics, although I won’t miss that.↩︎

  23. Which is not part of Base but instead comes from DataAPI.jl.↩︎

  24. Actually, dplyr::across(.cols = c(col1, col2)) is more like AsTable([:col1, :col2]) => ... => ... in op-spec, but I’ve found that distinction to be seldom important.↩︎

  25. Also see @mariviere’s solution using pmap_dfr().↩︎

  26. Update: Bogumił reached out to let me know of the WIP TableMetadataTools.jl package designed to “make the most common operations on metadata convenient”. I didn’t know about this but apparently metadata is a big topic in the DataFrames.jl world - exciting!↩︎