Perspectives from a {dplyr} and {data.table} useR
DataFrames.jl is a Julia package for data wrangling. As of this writing it is at v1.4.x - it’s a mature library that’s been in active development for over a decade.1
For some background, I comfortably switch between {dplyr} and {data.table}, having used both for nearly 5 years.
I love digging into the implementational details of both - I really appreciate the thoughtfulness behind {dplyr}’s tidyeval/tidyselect semantics, as well as {data.table}’s conciseness and abstraction in the j
.
I have not been exposed to any other data wrangling frameworks but was recently compelled to learn Julia for independent reasons,2 so I decided why not pick up Julia-flavored data wrangling while I’m at it?
This blog post is a rough (and possibly evolving?) list of my first impressions of DataFrames.jl and “DataFrames.jl accessories”, namely Chain.jl and DataFramesMeta.jl.3
If you’re Julia-curious and/or just want to hear an R person talk about how another language does data wrangling differently, you’re the target audience!
However, this blog post is NOT:
My first impressions of the Julia language or a pitch for why you should use Julia. If you want that from an R user’s perspective, check out Trang Le’s blog post and the Julia documentation on “Noteworthy differences from R”.
A DataFrames.jl tutorial. But if you’re curious, aside from the docs I learned almost exclusively from Bogumił Kamiński’s JuliaCon 2022 workshop, the Julia Data Science book, and the Julia for Data Analysis book.4
A {dplyr}/{data.table} to DataFrames.jl translation cheatsheet since those already exist, though I’ll be doing some of that myself when it helps illustrate a point.
All of this to say that I have no skin in the game and I don’t endorse or represent anything I write here. In fact I’m a Julia noob myself (it’s only been like 3 months) so take everything with a grain of salt and please feel free to let me know if I did anything wrong or inefficiently!
The syntax mimics {dplyr} but works more like {data.table} under the hood. There’s a bit of unlearning to do for {dplyr} users.
There are not as many idiomatic ways of doing things like in {dplyr}. Whereas you can get very far in {dplyr} without thinking much about base R, learning DataFrames.jl requires a good amount of “base” Julia first (especially distinctions between data types, which R lacks).
I love Chain.jl but I’m not that drawn to DataFramesMeta.jl because it feels like {dtplyr}5 - I’d personally rather just focus on learning the thing itself.
Some aspects of DataFrames.jl are relatively underdeveloped IMO (e.g., context dependent expressions) but it’s in active development and I plan to stick around to see more.
R
# R v4.2.1
library(dplyr) # v1.0.10
library(data.table) # v1.14.5
mtcars_df <- mtcars |>
as_tibble(rownames = "model") |>
type.convert(as.is = TRUE)
mtcars_dt <- as.data.table(mtcars_df)
Julia
# Julia v1.8.2
using DataFrames # (v1.4.3)
using DataFramesMeta # (v0.12.0) Also imports Chain.jl
# using Chain.jl (v0.5.0)
using StatsBase # (v0.33.21) Like base R {stats}
using RDatasets # (v0.7.7) Self-explanatory; like the {Rdatasets} package
= RDatasets.dataset("datasets", "mtcars") mtcars
32×12 DataFrame
Row │ Model MPG Cyl Disp HP DRat WT QS ⋯
│ String31 Float64 Int64 Float64 Int64 Float64 Float64 Fl ⋯
─────┼──────────────────────────────────────────────────────────────────────────
1 │ Mazda RX4 21.0 6 160.0 110 3.9 2.62 ⋯
2 │ Mazda RX4 Wag 21.0 6 160.0 110 3.9 2.875
3 │ Datsun 710 22.8 4 108.0 93 3.85 2.32
4 │ Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215
5 │ Hornet Sportabout 18.7 8 360.0 175 3.15 3.44 ⋯
6 │ Valiant 18.1 6 225.0 105 2.76 3.46
7 │ Duster 360 14.3 8 360.0 245 3.21 3.57
8 │ Merc 240D 24.4 4 146.7 62 3.69 3.19
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
26 │ Fiat X1-9 27.3 4 79.0 66 4.08 1.935 ⋯
27 │ Porsche 914-2 26.0 4 120.3 91 4.43 2.14
28 │ Lotus Europa 30.4 4 95.1 113 3.77 1.513
29 │ Ford Pantera L 15.8 8 351.0 264 4.22 3.17
30 │ Ferrari Dino 19.7 6 145.0 175 3.62 2.77 ⋯
31 │ Maserati Bora 15.0 8 301.0 335 3.54 3.57
32 │ Volvo 142E 21.4 4 121.0 109 4.11 2.78
5 columns and 17 rows omitted
From mtcars…
Filter for rows that represent "Merc"
6 car models
Calculate the average mpg
by cyl
Return a new column called kmpg
that converts miles to kilometers (1:1.61)
{dplyr}
mtcars_df |>
filter(stringr::str_detect(model, "^Merc ")) |>
group_by(cyl) |>
summarize(kmpg = mean(mpg) * 1.61)
# A tibble: 3 × 2
cyl kmpg
<int> <dbl>
1 4 38.0
2 6 29.8
3 8 26.2
{data.table}
DataFrames.jl
@chain mtcars begin
subset(:Model => x -> occursin.(r"^Merc ", x))
groupby(:Cyl)
combine(:MPG => (x -> mean(x) * 1.61) => :kmpg)
end
3×2 DataFrame
Row │ Cyl kmpg
│ Int64 Float64
─────┼────────────────
1 │ 4 37.996
2 │ 6 29.785
3 │ 8 26.243
DataFramesMeta.jl
@chain mtcars begin
@rsubset(occursin(r"^Merc ", :Model))
groupby(:Cyl)
@combine(:kmpg = mean(:MPG) * 1.61)
end
3×2 DataFrame
Row │ Cyl kmpg
│ Int64 Float64
─────┼────────────────
1 │ 4 37.996
2 │ 6 29.785
3 │ 8 26.243
Disclaimer: These are not a list of like/dislike or approve/disapprove. The 💜 and 💔 are just vibes - you know what I mean?
In {dplyr}, it’s easy to think of grouping as a transient operation.
We don’t really think about group_by()
as a data-wrangling function because the returned <grouped_df>
object is visually very similar to the input.
Coupled with the general expectation that object attributes in R tend to be ephemeral, group_by()
often gets treated like a second-class citizen even though grouping is sticky, making it a frequent culprit of subtle bugs.
group_by(mtcars_df, cyl)
# A tibble: 32 × 12
# Groups: cyl [3]
model mpg cyl disp hp drat wt qsec vs am gear carb
<chr> <dbl> <int> <dbl> <int> <dbl> <dbl> <dbl> <int> <int> <int> <int>
1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 Mazda RX4 … 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 Hornet 4 D… 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 Hornet Spo… 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
7 Duster 360 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
8 Merc 240D 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
9 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
10 Merc 280 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
# … with 22 more rows
In {data.table}, it’s at the opposite extreme.
Grouping via by=
is right there in the slogan dt[i,j,by]
, sharing a privileged status with row selection i
and column manipulation j
.
Grouping as an operation is very prominent in the code but that also makes it feel a bit too integrated for my taste.
It feels too close to SQL, whereas I like my data wrangling operations to be more modular and portable.
I think that DataFrames.jl hits a good middle ground.
It keeps groupby()
as a stand-alone operation while marking the output as very clearly distinct from the input.
The returned GroupedDataFrame
type, when printed, visually splits the dataframe by groups:7
groupby(mtcars, :Cyl)
GroupedDataFrame with 3 groups based on key: Cyl
First Group (11 rows): Cyl = 4
Row │ Model MPG Cyl Disp HP DRat WT QSec ⋯
│ String31 Float64 Int64 Float64 Int64 Float64 Float64 Float6 ⋯
─────┼──────────────────────────────────────────────────────────────────────────
1 │ Datsun 710 22.8 4 108.0 93 3.85 2.32 18.6 ⋯
2 │ Merc 240D 24.4 4 146.7 62 3.69 3.19 20.0
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
10 │ Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9
11 │ Volvo 142E 21.4 4 121.0 109 4.11 2.78 18.6
5 columns and 7 rows omitted
⋮
Last Group (14 rows): Cyl = 8
Row │ Model MPG Cyl Disp HP DRat WT QS ⋯
│ String31 Float64 Int64 Float64 Int64 Float64 Float64 Fl ⋯
─────┼──────────────────────────────────────────────────────────────────────────
1 │ Hornet Sportabout 18.7 8 360.0 175 3.15 3.44 ⋯
2 │ Duster 360 14.3 8 360.0 245 3.21 3.57
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
14 │ Maserati Bora 15.0 8 301.0 335 3.54 3.57
5 columns and 11 rows omitted
-!
variantsMany verbs in DataFrames.jl have a -!
suffix counterpart, like transform()
/transform!()
.
These -!
forms indicate a mutate-in-place operation, which modifies an object it’s called with as a side-effect.8
It’s conceptually like {data.table}’s set()
or the walrus operator :=
in the j
, with no equivalent in {dplyr}.9
I love pure functions and I’m a sucker for functional programming, but one place where I can really see myself using the -!
form a lot is for rename!()
and sort!()
, latter being equivalent to dplyr::arrange()
:10
= copy(mtcars); # Make a copy
mtcars_sorted sort!(mtcars_sorted, :MPG); # Sort the copy by ascending MPG value, in-place
end], [:Model, :MPG] ] # Check in first and last row mtcars_sorted[ [begin,
2×2 DataFrame
Row │ Model MPG
│ String31 Float64
─────┼─────────────────────────────
1 │ Cadillac Fleetwood 10.4
2 │ Toyota Corolla 33.9
This is also an efficient way of dropping columns that you know you won’t use.11 In {data.table}
, you can do this with :=
but in the funky form of assigning columns to NULL
which I never really liked.12
mtcars_dropcol <- copy(mtcars_dt)
mtcars_dropcol[, mpg := NULL] # drop single column
mtcars_dropcol[, `:=`(wt = NULL, vs = NULL)] # drop multiple columns in function form
names(mtcars_dropcol)
[1] "model" "cyl" "disp" "hp" "drat" "qsec" "am" "gear" "carb"
In DataFrames.jl that’s just select!()
instead of select()
. No copies are created in the calls to select!()
:
= copy(mtcars);
mtcars_dropcol select!(mtcars_dropcol, Not(:MPG));
select!(mtcars_dropcol, Not([:WT, :VS]));
names(mtcars_dropcol)
9-element Vector{String}:
"Model"
"Cyl"
"Disp"
"HP"
"DRat"
"QSec"
"AM"
"Gear"
"Carb"
Base.filter()
and DataFrames.subset()
In DataFrames.jl, there are two ways of subsetting rows:
The Base.filter()
generic which has been extended with a DataFrame
method. It works kind of like base R’s Filter()
in that it takes the object to filter as its second argument:
filter(:Gear => x -> x .== 5, mtcars)
5×12 DataFrame
Row │ Model MPG Cyl Disp HP DRat WT QSec ⋯
│ String31 Float64 Int64 Float64 Int64 Float64 Float64 Float ⋯
─────┼──────────────────────────────────────────────────────────────────────────
1 │ Porsche 914-2 26.0 4 120.3 91 4.43 2.14 16 ⋯
2 │ Lotus Europa 30.4 4 95.1 113 3.77 1.513 16
3 │ Ford Pantera L 15.8 8 351.0 264 4.22 3.17 14
4 │ Ferrari Dino 19.7 6 145.0 175 3.62 2.77 15
5 │ Maserati Bora 15.0 8 301.0 335 3.54 3.57 14 ⋯
5 columns omitted
The DataFrames.subset()
function which, like other DataFrame
transformation verbs, takes the dataframe as its first argument similar to dplyr::filter()
:
subset(mtcars, :Gear => x -> x .== 5)
5×12 DataFrame
Row │ Model MPG Cyl Disp HP DRat WT QSec ⋯
│ String31 Float64 Int64 Float64 Int64 Float64 Float64 Float ⋯
─────┼──────────────────────────────────────────────────────────────────────────
1 │ Porsche 914-2 26.0 4 120.3 91 4.43 2.14 16 ⋯
2 │ Lotus Europa 30.4 4 95.1 113 3.77 1.513 16
3 │ Ford Pantera L 15.8 8 351.0 264 4.22 3.17 14
4 │ Ferrari Dino 19.7 6 145.0 175 3.62 2.77 15
5 │ Maserati Bora 15.0 8 301.0 335 3.54 3.57 14 ⋯
5 columns omitted
I understand the rationale for supporting Julia’s standard filter()
function for DataFrame
types, but I feel like it’s something that the average DataFrames.jl user shouldn’t encounter unless they specifically go looking for it.
In my experience this hasn’t been the case, and it’s caused me a lot of confusion when I was first learning DataFrames.jl - I consider the simultaneous teaching of filter()
and subset()
in the same chapter as the one flaw in the otherwise flawless book Julia Data Science.
j
on steroidsThe “operation specification syntax”, also called the “transformation mini language”13 refers to the templatic form:
[input columns] => [transformation function] => [output columns]
… available inside verbs like select()
, transform()
, combine()
, and subset()
.
If I say “take the MPG column and multiply it by 1.61 to create a new column called KMPG”, then in DataFrames.jl that’s:
@chain mtcars begin
transform( :MPG => (x -> x * 1.61) => :KMPG )
# Below operations cut down rows and cols to save printing space
select(:MPG, :KMPG)
first(5)
end
5×2 DataFrame
Row │ MPG KMPG
│ Float64 Float64
─────┼──────────────────
1 │ 21.0 33.81
2 │ 21.0 33.81
3 │ 22.8 36.708
4 │ 21.4 34.454
5 │ 18.7 30.107
… where x -> x * 1.61
inside transform()
is an anonymous function14 like R’s \(x) x * 1.61
.
I make the comparison to {data.table}’s j
because of the flexibility with what you can compute and return inside that special environment.
For example, let’s say you want to treat cyl
like a categorical variable and do a one-hot encoding for cyl==4
, cyl==6
, and cyl==8
. To comply with R/Julia’s variable naming rules, let’s name these new columns cyl_4
, cyl_6
, and cyl_8
.
In {data.table} it just takes an lapply()
inside the j
- you can rely on the invariant that as long as an expression in j
evaluates to a list, you will get columns back:
mtcars_onehot <- copy(mtcars_dt)
cyl_vals <- sort(unique(mtcars_onehot$cyl)) # [1] 4 6 8
cyl_cols <- paste0("cyl_", cyl_vals) # [1] "cyl_4" "cyl_6" "cyl_8"
mtcars_onehot[, (cyl_cols) := lapply(cyl_vals, \(x) cyl == x)]
mtcars_onehot[, (setdiff(names(mtcars_onehot), c("model", cyl_cols))) := NULL]
mtcars_onehot[1:5,]
model cyl_4 cyl_6 cyl_8
<char> <lgcl> <lgcl> <lgcl>
1: Mazda RX4 FALSE TRUE FALSE
2: Mazda RX4 Wag FALSE TRUE FALSE
3: Datsun 710 TRUE FALSE FALSE
4: Hornet 4 Drive FALSE TRUE FALSE
5: Hornet Sportabout FALSE FALSE TRUE
Likewise, in DataFrames.jl, you can rely on the invariant that as long as you’re complying with the op-spec and returning certain data types,15 they will become columns in the table:
= copy(mtcars);
mtcars_onehot = sort(unique(mtcars_onehot.Cyl));
cyl_vals = "cyl_" .* string.(cyl_vals);
cyl_cols @chain mtcars_onehot begin
transform!( :Cyl .=> ByRow(x -> cyl_vals .== x) => cyl_cols )
select!(:Model, Cols(r"^cyl_"))
end;
1:5,:] mtcars_onehot[
5×4 DataFrame
Row │ Model cyl_4 cyl_6 cyl_8
│ String31 Bool Bool Bool
─────┼────────────────────────────────────────
1 │ Mazda RX4 false true false
2 │ Mazda RX4 Wag false true false
3 │ Datsun 710 true false false
4 │ Hornet 4 Drive false true false
5 │ Hornet Sportabout false false true
Read on for more about the ByRow()
function wrapping the anonymous function inside transform!()
.
ByRow()
and eachrow()
I mentioned at the beginning that DataFrames.jl doesn’t have a lot of idiomatic ways of doing things but one rare case that I really appreciate is ByRow()
.
Under the hood, ByRow()
is actually just a function factory that takes a function as input and a broadcasted version of the function as output, kind of like R’s Vectorize()
:
ByRow(round)([1.2, 3.3]) # same as round.([1.2, 3.3])
2-element Vector{Float64}:
1.0
3.0
I call ByRow()
idiomatic because, as you can see, it doesn’t have anything to do with DataFrame
rows.
Nevertheless it goes perfectly hand in hand with the op-spec syntax because you can just wrap the middle “[transformation function]” component in ByRow()
and call it a day:
select(mtcars, :MPG => ByRow( x -> Integer(round(x)) ) => :MPG_whole )[1:3,:]
3×1 DataFrame
Row │ MPG_whole
│ Int64
─────┼───────────
1 │ 21
2 │ 21
3 │ 23
Whereas the equivalent using the .
would take two broadcasted functions:
select(mtcars, :MPG => ( x -> Integer.(round.(x)) ) => :MPG_whole )[1:3,:]
3×1 DataFrame
Row │ MPG_whole
│ Int64
─────┼───────────
1 │ 21
2 │ 21
3 │ 23
I especially like ByRow()
in these cases because I can use Julia’s function composition operator ∘
(\circ
) to re-write x -> Integer(round(x))
into Integer ∘ round
, which just looks sooo clean:
select(mtcars, :MPG => ByRow(Integer ∘ round) => :MPG_whole )[1:3,:]
3×1 DataFrame
Row │ MPG_whole
│ Int64
─────┼───────────
1 │ 21
2 │ 21
3 │ 23
The one small qualm I have with ByRow()
though is that the docs equate it to dplyr::rowwise()
when really it’s more comparable to purrr::map_*()
.
dplyr::rowwise()
is unlike ByRow()
because it’s a function that takes a dataframe as input and returns an object of class <rowwise_df>
.
In that sense, it’s actually more similar to Base.eachrow()
which you use to convert a DataFrame
into a DataFrameRow
object:16
= eachrow(mtcars[1:5, 1:3]) mtcars_rowwise
5×3 DataFrameRows
Row │ Model MPG Cyl
│ String31 Float64 Int64
─────┼───────────────────────────────────
1 │ Mazda RX4 21.0 6
2 │ Mazda RX4 Wag 21.0 6
3 │ Datsun 710 22.8 4
4 │ Hornet 4 Drive 21.4 6
5 │ Hornet Sportabout 18.7 8
Similar to how column-major dataframes are essentially a list of vectors representing each column, the row-major DataFrameRow
object is essentially a vector of NamedTuple
s representing each row under the hood:
copy.(mtcars_rowwise)
5-element Vector{NamedTuple{(:Model, :MPG, :Cyl), Tuple{InlineStrings.String31, Float64, Int64}}}:
(Model = InlineStrings.String31("Mazda RX4"), MPG = 21.0, Cyl = 6)
(Model = InlineStrings.String31("Mazda RX4 Wag"), MPG = 21.0, Cyl = 6)
(Model = InlineStrings.String31("Datsun 710"), MPG = 22.8, Cyl = 4)
(Model = InlineStrings.String31("Hornet 4 Drive"), MPG = 21.4, Cyl = 6)
(Model = InlineStrings.String31("Hornet Sportabout"), MPG = 18.7, Cyl = 8)
select()
is more like dplyr::transmute()
than dplyr::select()
Interestingly, select()
isn’t just for selecting columns - it can modify columns too:
@chain mtcars begin
select(:Model, :MPG => (x -> x .* 1.61) => :KMPG)
first(3)
end
3×2 DataFrame
Row │ Model KMPG
│ String31 Float64
─────┼────────────────────────
1 │ Mazda RX4 33.81
2 │ Mazda RX4 Wag 33.81
3 │ Datsun 710 36.708
In this sense DataFrames.select()
feels more like dplyr::transmute()
rather than the identically named dplyr::select()
:17
# A tibble: 3 × 2
model kmpg
<chr> <dbl>
1 Mazda RX4 33.8
2 Mazda RX4 Wag 33.8
3 Datsun 710 36.7
I think it’s misleading that select()
can also transform()
in DataFrames.jl, although I don’t complain that dplyr::select()
can also dplyr::rename()
so idk:
# A tibble: 3 × 2
car_model mpg
<chr> <dbl>
1 Mazda RX4 21
2 Mazda RX4 Wag 21
3 Datsun 710 22.8
The fact that DataFrames.select()
behaves like dplyr::transmute()
might explain why you don’t really get a rich {tidyselect}-esque interface to column selection.18
This has been very challenging as someone who loves {tidyselect}, but DataFrames.jl has been getting new column selection helpers like Cols()
, Not()
, Between()
, etc. which makes things a bit easier compared to {data.table} at least.
But I don’t really vibe with the implementation of column selection helpers as set operation, as opposed to boolean algebra.19
I wish I’m told that I’m wrong, but it feels really clunky to do something like “select columns that are numeric and has a name that’s two characters long”. In DataFrames.jl that’s:
@chain mtcars begin
select(_, intersect( names(_, Float64), names(_, r"^..$") ) )
first(1)
end
1×1 DataFrame
Row │ WT
│ Float64
─────┼─────────
1 │ 2.62
Whereas in {dplyr} you just write each condition as a predicate joined by &
:
# A tibble: 1 × 1
wt
<dbl>
1 2.62
groupby()
has select-semanticsOne thing that’s always bugged me a little in {dplyr} is the fact that dplyr::group_by()
has mutate-semantics.
So like, despite the fact that people pretty much only ever20 use group_by()
like select()
in the form of group_by(col1, col2, col3)
, you can’t use {tidyselect} helpers:
Error in `group_by()`:
! Problem adding computed columns.
Caused by error in `mutate()`:
! Problem while computing `..1 = matches("^[va]")`.
Caused by error:
! `matches()` must be used within a *selecting* function.
ℹ See <https://tidyselect.r-lib.org/reference/faq-selection-context.html> for
details.
Instead you need to bridge select-semantics and mutate-semantics using across()
21, which ensures that group_by()
receives column vectors:
# A tibble: 4 × 3
vs am mpg
<int> <int> <dbl>
1 0 0 15.0
2 0 1 19.8
3 1 0 20.7
4 1 1 28.4
In DataFrames.jl, however, groupby()
has select-semantics.22
That lets you use column selection helpers like Cols()
to dynamically choose columns like:
@chain mtcars begin
groupby( Cols(r"^[VA]") )
combine(:MPG => mean, renamecols = false)
end
4×3 DataFrame
Row │ VS AM MPG
│ Int64 Int64 Float64
─────┼───────────────────────
1 │ 0 0 15.05
2 │ 0 1 19.75
3 │ 1 0 20.7429
4 │ 1 1 28.3714
In {dplyr} and {data.table} you get “context dependent expressions” like n()
and .N
which returns information about the dataframe that you’re currently manipulating.
So for example if I want to calculate counts by group, in {dplyr} I can do:
# A tibble: 3 × 2
cyl n
<int> <int>
1 4 11
2 6 7
3 8 14
And in {data.table} I can do:
mtcars_dt[, .(n = .N), by = cyl]
cyl n
<int> <int>
1: 6 7
2: 4 11
3: 8 14
Likewise in DataFrames.jl, I can use nrow
:
@chain mtcars begin
groupby(:Cyl)
combine(nrow => :n)
end
3×2 DataFrame
Row │ Cyl n
│ Int64 Int64
─────┼──────────────
1 │ 4 11
2 │ 6 7
3 │ 8 14
Special keywords like nrow
can be used in the op-spec syntax, and they essentially take up the first and middle “[input columns] => [transformation function]” slots.
But here’s a thing that’s super confusing about nrow
as a context dependent expression.
Unlike {dplyr}’s n()
or {data.table}’s .N
, Julia’s nrow()
23 is also a stand-alone function:
nrow(mtcars)
32
Now imagine I wrote a nrow2()
function which does the same thing but with Base.size()
, the equivalent of base R’s dim()
:
= function(x)
nrow2 size(x, 1) # returns the first dimension (row)
end;
nrow2(mtcars)
32
You might expect our new nrow2
to behave like nrow
in the context we just saw, but it doesn’t!
@chain mtcars begin
groupby(:Cyl)
combine(nrow2 => :n)
end
ArgumentError: Unrecognized column selector var"#101#102"() => :n in AsTable constructor
Because nrow2()
is not a special operator like nrow
, it gets evaluated ordinarily in the first component of the op-spec syntax (“[input columns]”), throwing a rather unhelpful error message about an unrecognized column selector.
But it’s really difficult to figure out this exceptionalism of nrow
because it also works expectedly outside!
Now compare this to the treatment of {data.table}’s .N
and {dplyr}’s n()
, where the former isn’t exported (it’s not even defined in the package) and the latter throws a helpful error message:
dplyr::n()
Error in `n()`:
! Must be used inside dplyr verbs.
But thankfully, this is an area of ongoing development in DataFrames.jl so I’m hopeful that the documentation will catch up too.
For example, v1.4 recently added a couple more context dependent expressions like groupindices
, which is equivalent to {dplyr}’s cur_group_id()
and {data.table}’s .GRP
:
@chain mtcars begin
groupby(:Cyl)
combine(groupindices)
end
3×2 DataFrame
Row │ Cyl groupindices
│ Int64 Int64
─────┼─────────────────────
1 │ 4 1
2 │ 6 2
3 │ 8 3
However, I still feel like there’s ways to go on the clarity front.
Like, if the user doesn’t already have a mental model of context dependent expressions, then it might be confusing that you can use groupindices
like this:
groupindices( groupby(mtcars[1:5,:], :Cyl) )
5-element Vector{Union{Missing, Int64}}:
2
2
1
2
3
But you can’t use nrow
in the same way:
nrow( groupby(mtcars[1:5,:], :Cyl) )
MethodError: no method matching nrow(::GroupedDataFrame{DataFrame})
Closest candidates are:
nrow(!Matched::SubDataFrame) at C:\Users\jchoe\.julia\packages\DataFrames\KKiZW\src\subdataframe\subdataframe.jl:157
nrow(!Matched::DataFrame) at C:\Users\jchoe\.julia\packages\DataFrames\KKiZW\src\dataframe\dataframe.jl:459
Because, independently of their shared status as context dependent expressions in the op-spec syntax, groupindices
and nrow
differ as stand-alone functions (only the former has a method defined for GroupedDataFrame
types).
dplyr::across()/c_across()
for freeSo probably my favorite thing about the op-spec syntax is that the leftmost “[input columns]” component can be a vector of multiple columns.
When combined with the broadcasted version of the =>
operator, you get dplyr::across()
for free:24
select(mtcars, [:DRat, :QSec] .=> ByRow(Integer ∘ round) => uppercase)[1:5,:]
5×2 DataFrame
Row │ DRAT QSEC
│ Int64 Int64
─────┼──────────────
1 │ 4 16
2 │ 4 17
3 │ 4 19
4 │ 3 19
5 │ 3 17
In {dplyr}, the above would be:
mtcars_df |>
transmute(
across(
.cols = c(drat, qsec),
.fns = \(x) as.integer(round(x)),
.names = "{toupper(.col)}"
)
) |>
head(5)
# A tibble: 5 × 2
DRAT QSEC
<int> <int>
1 4 16
2 4 17
3 4 19
4 3 19
5 3 17
Here’s another fun one replicating {dplyr}’s rowwise()
+ c_across()
workflow.
As a very contrived example, let’s say I want to make two new columns, calculating the min/max across float/double columns by-row. In DataFrames.jl that can be done by selecting multiple columns and sending them in batch to a vararg function:
@chain mtcars begin
select(_,
:Model,
names(_, Float64) => ByRow( (x...) -> extrema(x) ) => [:min, :max]
)first(5)
end
5×3 DataFrame
Row │ Model min max
│ String Float64 Float64
─────┼─────────────────────────────────────
1 │ Mazda RX4 2.62 160.0
2 │ Mazda RX4 Wag 2.875 160.0
3 │ Datsun 710 2.32 108.0
4 │ Hornet 4 Drive 3.08 258.0
5 │ Hornet Sportabout 3.15 360.0
That kind of operation is costly in {dplyr} because it requires a rowwise()
context to operate on columns selected by c_across()
, the result of which must then be converted to dataframe:
mtcars_df |>
rowwise() |>
transmute(
model,
c_across(where(is.double)) |>
range() |>
as.data.frame.list(col.names = c("min", "max"))
) |>
ungroup() |>
head(5)
# A tibble: 5 × 3
model min max
<chr> <dbl> <dbl>
1 Mazda RX4 2.62 160
2 Mazda RX4 Wag 2.88 160
3 Datsun 710 2.32 108
4 Hornet 4 Drive 3.08 258
5 Hornet Sportabout 3.15 360
For this particular problem you can use pmax
and pmin
, but that’s of course not generalizable to other arbitrary operations:25
mtcars_df |>
transmute(
model,
min = do.call(pmin, across(where(is.double))),
max = do.call(pmax, across(where(is.double)))
) |>
head(5)
# A tibble: 5 × 3
model min max
<chr> <dbl> <dbl>
1 Mazda RX4 2.62 160
2 Mazda RX4 Wag 2.88 160
3 Datsun 710 2.32 108
4 Hornet 4 Drive 3.08 258
5 Hornet Sportabout 3.15 360
Take this benchmarking with a grain of salt but on my machine the DataFrames.jl solution takes ~0.1ms, {dplyr}’s rowwise()
+ c_across()
solution takes ~30ms and {dplyr}’s across()
+ pmin()/pmax()
solution takes ~3ms.
Update: a {data.table} solution that’s probably not the most efficient it could be
mtcars_dt_rowwise <- copy(mtcars_dt)
mtcars_dt_rowwise[, c("min", "max") := as.list(range(.SD)), by = .I, .SDcols = is.double]
mtcars_dt_rowwise[1:5, .(model, min, max)]
model min max
<char> <num> <num>
1: Mazda RX4 2.620 160
2: Mazda RX4 Wag 2.875 160
3: Datsun 710 2.320 108
4: Hornet 4 Drive 3.080 258
5: Hornet Sportabout 3.150 360
Overall, DataFrames.jl has a cool design. Learning it has been mostly painless, and writing this blog post has been super fun.
I personally found it to be a great entry point from R to Julia as someone who primarily uses R for data analysis, and would recommend this route if you’re Julia-curious and you like learning by doing.
I was also pleasantly surprised by how much of my experience with {dplyr} and {data.table} transferred over to me learning an entirely different data wrangling framework. And this goes the other way as well - DataFrames.jl helps me appreciate many aspects of {dplyr} and {data.table} that I used to take for granted.
A list of other things that I want to explore in DataFrames.jl but haven’t had the chance to:
I hear that Julia and DataFrames.jl have interesting ways of representing and operating on vectors/columns with missing values.
Unlike R where you have NA_character_
, NA_integer_
, and so on, Julia just has one Missing
type and vectors with missing
values get a mixed type like Union{Missing, Int64}
:
1, 2, missing] [
3-element Vector{Union{Missing, Int64}}:
1
2
missing
Nesting/unnesting looks amazing but I need to learn more about Julia’s data types first, and the pros and cons of each for different nested column workflows.
Joins and pivoting. You have the usual set of *join()
verbs and stack()
/unstack()
which correspond to base R functions of the same name. Haven’t tried them out yet but looks pretty straightforward from skimming the docs.
Row manipulation operations like append()!
/prepend!()
which seems to works more like {data.table} than {dplyr}’s eternally-experimental row_*()
functions.
The whole Metadata section of the docs. That includes stuff like keys I think.26
Interaction with Arrow.jl, which is apparently written in pure Julia!
Just more benchmarking and experimentation so I can contribute to the language-wars discourse with hot takes optimize my bilingual R & Julia workflows.
Making it slightly younger than {plyr}/{dplyr} and {data.table} by a few years↩︎
I’m doing a semester of independent study on simulation-based power analysis for mixed effects models, where speed is crucial. For that I decided to switch over from {lme4} and pick up MixedModels.jl.↩︎
There’s also the relatively newer DataFramesMacros.jl that makes DataFrames.jl code even closer to {dplyr}, namely by making the transformation verbs rowwise by default to save you from reasoning about broadcasting, for better or for worse.↩︎
The Julia for Data Analysis book is estimated for Jan 2023 but I’ve been reading the previews and recommend it very strongly. Bogumił Kamiński, the author of the book and the JuliaCon 2022 workshop (also a core developer of DataFrames.jl) also has a blog full of DataFrames.jl goodies.↩︎
Although @macroexpand
is very cool and works like dplyr::show_query()
, so learning DataFramesMeta.jl shouldn’t interfere with learning DataFrames.jl.↩︎
Abbreviation of Mercedes-Benz, apparently.↩︎
The visual effect here is similar to what you get from dplyr::group_split()
, a costly operation that returns the input dataframe split into a list of dataframes.↩︎
The -!
suffix is a convention for side-effect (or “imperative”) functions also found in other programming languages, like my first language Racket.↩︎
Though you can mimic it in {dplyr} with {magrittr}’s %<>%
pipe.↩︎
sort!()
probably because I consider row order to be more like a dataframe’s metadata, given how fragile and not-guaranteed-to-be-stable it is.↩︎
Although I usually do that inside the file-reading step, like in read_csv(file, col_select = ...)
.↩︎
I’m just gonna call it op-spec syntax from now on because these names are way too long.↩︎
Wrapped in parantheses in the code to make sure that the right-side =>
operator is not parsed as part of the function.↩︎
NamedTuple
, AbstractDataFrame
, DataFrameRow
, AbstractMatrix
, to name a few.↩︎
Also see the @eachrow!
macro from DataFramesMeta.jl for a modify-in-place version.↩︎
Or, keeping up with the current dev version {dplyr} which superceded transmute()
, it’d be mutate(..., .keep = "none")
.↩︎
Although I feel like you totally could since the “[input columns]” component of the op-spec gets evaluated in its own separate environment anyways (unlike {data.table}’s j where the three components are mashed together). There was an attempt at integrating the now-stale(?) Selections.jl which looked interesting, but alas…↩︎
The {tidyselect} backend of {dplyr} supports both, namely with !
and -
variants, though they recently switched to emphasizing the boolean algebra technique with !
.↩︎
Though the mutate-semantics lets you do things like group_by(decade = 10 * (year %/% 10))
.↩︎
But not mutate-semantics, although I won’t miss that.↩︎
Which is not part of Base but instead comes from DataAPI.jl
.↩︎
Actually, dplyr::across(.cols = c(col1, col2))
is more like AsTable([:col1, :col2]) => ... => ...
in op-spec, but I’ve found that distinction to be seldom important.↩︎
Also see @mariviere’s solution using pmap_dfr()
.↩︎
Update: Bogumił reached out to let me know of the WIP TableMetadataTools.jl package designed to “make the most common operations on metadata convenient”. I didn’t know about this but apparently metadata is a big topic in the DataFrames.jl world - exciting!↩︎