R is a language optimized for meme-ing
Getting a program to print “Hello World” is one of the earliest things people are taught to do when picking up a new programming language. This universal experience among programmers has also turned it into a running joke about the complexity of programming languages.
For example, whereas in R we can express what we want transparently in the following:
print("Hello World")
This simple task can get absurdly complex in other languages; perhaps most notoriously, Java:
class HelloWorld { public static void main(String[] args) { System.out.println("Hello, World!"); } }
This joke around “Hello World” has also evolved into other forms. Every once in a while I come across a variant of the joke in the style of something like:
HelloWorld("print")
[1] "HelloWorld"
This is funny because it seemingly swaps the role of the argument and the function in an expression. It’s also a good educational example because it demonstrates the arbitrariness of signs as a universal design principle of programming (and human!) languages.1 Crucially, you should be able to produce this behavior in any reasonable programming language - the ability to do this is a feature, not a bug.
The most trivial implementation of the above is to define HelloWorld()
as a function that’s been hardcoded to simply print “Hello World”:
HelloWorld <- function(x) print("HelloWorld")
But here, too, languages show differences. Not so much in their ability to implement this specific solution, but in their ability to formulate a generalizable solution in an idiomatic way, using tools and concepts that are native to the language.
When it comes to R, it turns out that R has certain quirks which can give us a surprisingly principled and lean solution to the problem. So that’s what this blog post will be about.
In R, functions are distinguished from non-functions in part by their role as a caller. This role is defined by its syntactic position in an expression: it always occupies the first position [[1]]
of a <language>
object.2
When R sees a variable in an expression and needs to resolve its value, it firstly determines whether the value must be a function, by virtue of its position in the expression. Here, R eagerly commits to the assumption that whatever appears in the caller position must be a function.
This gives rise to a somewhat surprising behavior. In evaluating the expression f(1, 2)
inside a local scope below, R smartly skips the immediately-adjacent, local value of f
(a numeric constant) to scope the global value of f()
(alias of the function sum()
) that’s “further away”.
f <- sum
local({
f <- 0
f(1, 2)
})
[1] 3
So the point here is that, R knows to only scope values of f
that are functions, because it found f
in the caller position of the expression:
f_expr <- quote(f(1, 2))
f_expr[[1]]
f
This in and of itself is interesting, but I want to return to my characterization of R as “eagerly committing” to this. Consider the fact that the above example works even if you swapped f
with the string "f"
in the expression:
f <- sum
local({
f <- 0
"f"(1, 2)
})
[1] 3
Because R eagerly commits to the invariant that the first position is reserved for functions, it repairs "f"()
to f()
at the level of the parser, before the evaluation engine even sees the expression.
All of this to say that the following syntax that looks even more flipped is also valid in R:
HelloWorld <- function(x) print("HelloWorld")
"HelloWorld"(print)
[1] "HelloWorld"
This is trivially true about R’s sytax and its parser but funny nonetheless, so this deserves a mention first. Now lets talk about the implementation side of things - how well does R fair in letting us express something like “arg(f) should evaluate to f(arg)”?
I’ll get right to the chase - the following definition for HelloWorld()
gives us the ability to pass in a function that is then called with "HelloWorld"
as the argument.
HelloWorld <- function(x) {
fun <- match.fun(x)
arg <- deparse(sys.call()[[1]])
fun(arg)
}
HelloWorld("print")
[1] "HelloWorld"
HelloWorld(toupper)
[1] "HELLOWORLD"
There are two pieces to this solution.
First is match.fun()
, which allows HelloWorld()
to receive the name of a function as a string and match the function with that name. This is kind of like what we talked about in the previous section with "f"()
, but it’s a more explicit, less auto-magic way of handling functions specified as a string:
A nice convenience feature is that when match.fun()
receives a function, it simply passes it through. That also gives us this equality:
In sum, match.fun()
gives us a choice in whether HelloWorld()
receives its argument as a string vs. symbol. Combined with our observation from the previous section, this gives us a full 2-by-2 variation in whether the function or the argument is a string (vs. a symbol):
HelloWorld(print)
HelloWorld("print")
"HelloWorld"(print)
"HelloWorld"("print")
The second piece of the solution is sys.call()
, which returns the expression that called the function where sys.call()
is called from. It’s hard to explain in words but actually pretty intuitive once you see some examples:
And that’s it! When sys.call()
is called from f()
, it captures the expression that makes up f(...)
. So in the case of HelloWorld("print")
, the call to sys.call()
evaluates to the following language object:
HelloWorld("print")
… which is essentially a list of length-2:
[[1]]
HelloWorld
[[2]]
[1] "print"
So the code deparse(sys.call()[[1]])
grabs the symbol HelloWorld
and deparse()
s it into a string, resulting in "HelloWorld"
. And as I mentioned before, we grab the string "print"
and pass it to match.fun()
to get back the print()
function.
Once we have these two pieces, the line fun(arg)
evaluates to the un-flipped version print("HelloWorld")
.
And of course, as far as the argument is concerned, HelloWorld()
takes any function that can operate on the string "HelloWorld"
:
caps_split <- function(x) {
strsplit(x, "(?<!^)(?=[A-Z])", perl = TRUE)[[1]]
}
# Canonical version
caps_split("HelloWorld")
[1] "Hello" "World"
# Flipped version
HelloWorld("caps_split")
[1] "Hello" "World"
But what if I wanted to do ByeWorld("print")
or yes(toupper)
? Must I define a ByeWorld()
and yes()
each time? What would that look like?
In a sense, yes - we need to define each function to have them available for use as functions. But we don’t have to copy-paste the function definition every time. We can write a wrapper function like register()
that takes a symbol and defines a function of the same name.
register <- function(name, envir = parent.frame()) {
arg <- deparse(substitute(name))
f <- function(x) {
fun <- match.fun(x)
fun(arg)
}
assign(arg, f, envir)
}
register(ByeWorld)
ByeWorld("print")
[1] "ByeWorld"
R is pretty loose about assigning variables into different environments,3 which makes it a pretty simple task. There are two new pieces here:
First is the deparse(substitute())
combo to first capture the user-supplied argument ByeWorld
as a symbol and then turn it into the string "ByeWorld"
. Second is the assign()
function, which uses that to define ByeWorld()
in an environment which defaults to where register()
is called from (determined via parent.frame()
).
Since we just called register()
from the global environment, we see the consequence of this side effect in ls()
:4
And because register()
resolves the value of arg
immediately on the first line (vs. leaving it to be evaluated lazily), it correctly persists:
alias <- ByeWorld
alias("print") # Doesn't return `"alias"`
[1] "ByeWorld"
We can of course go the other way: from HelloWorld("print")
to print("HelloWorld")
. For this we define the function unflip()
, which captures the user-supplied expression and flips it inside out:
unflip <- function(expr) {
chr <- as.character(substitute(expr))
arg <- chr[1]
fun <- chr[2]
call(fun, arg)
}
unflip(HelloWorld("print"))
print("HelloWorld")
This works by first coercing the language object into a character vector, then plucking out its parts, and finally reconstruct the unflipped expression with call()
:5
as.character(quote(
HelloWorld("print")
))
[1] "HelloWorld" "print"
call("print", "HelloWorld")
print("HelloWorld")
But if you really wanted a world where you could linear specify the argument before the function without littering your environment, and you also don’t have the pipe ("HelloWorld" |> print()
), the next best tool for this job is probably currying.
Here’s the simplest attempt at that:6
curry <- function(arg) {
function(fun) {
fun <- match.fun(fun)
fun(arg)
}
}
curry("HelloWorld")(print)
[1] "HelloWorld"
Essentially, curry("HelloWorld")
is returning a function that takes a function and calls that function with "HelloWorld"
as its argument. Although, unfortunately, that’s not so obvious from the function definition which just looks generic:
curry("HelloWorld")
function(fun) {
fun <- match.fun(fun)
fun(arg)
}
<bytecode: 0x0000021c85438528>
<environment: 0x0000021c87df0ad0>
For us to see "HelloWorld"
in the function body for curry("HelloWorld")
, we would need to in-line the value of arg
when the curried function is defined.7 Let’s take this up in steps.
First, we can use substitute()
(or bquote()
) to create an expression where the value of arg
is in-lined. Both methods produce the contextualized function definition we want.
curry2 <- function(arg) {
list(
substitute = substitute(
function(fun) {
fun <- match.fun(fun)
fun(arg)
}
),
bquote = bquote(
function(fun) {
fun <- match.fun(fun)
fun(.(arg))
}
)
)
}
curry2("HelloWorld")
$substitute
function(fun) {
fun <- match.fun(fun)
fun("HelloWorld")
}
$bquote
function(fun) {
fun <- match.fun(fun)
fun("HelloWorld")
}
Let’s stick with substitute()
and move on. Now that we have an expression of the function definition, we can eval()
-uate it to get an actual function object back.
curry2 <- function(arg) {
eval(substitute(
function(fun) {
fun <- match.fun(fun)
fun(arg)
}
))
}
curry2("HelloWorld")
function(fun) {
fun <- match.fun(fun)
fun(arg)
}
<environment: 0x0000021c860ee9c0>
Wait… "HelloWorld"
just turned back into arg
! Turns out that functions in R have a “memory” of how they were defined. It’s stored in the srcref attribute of functions, and this is the function definition that gets shown when we print functions.
HelloWorld <- curry2("HelloWorld")
attr(HelloWorld, "srcref")
function(fun) {
fun <- match.fun(fun)
fun(arg)
}
And actually, if we just strip this attribute away, we can see our work of in-lining arg
:
attr(HelloWorld, "srcref") <- NULL
HelloWorld
function (fun)
{
fun <- match.fun(fun)
fun("HelloWorld")
}
<environment: 0x0000021c864cbd88>
We can now go back to the currying function and implement this solution there:
curry3 <- function(arg) {
inlined <- eval(substitute(
function(fun) {
fun <- match.fun(fun)
fun(arg)
}
))
attr(inlined, "srcref") <- NULL
inlined
}
curry3("HelloWorld")
function (fun)
{
fun <- match.fun(fun)
fun("HelloWorld")
}
<environment: 0x0000021c875a60d8>
To avoid all this mess, you could also inline arg
first, and then piece together the function from scratch:
curry4 <- function(arg) {
inlined_body <- rlang::expr({
fun <- match.fun(fun)
fun(!!arg)
})
rlang::new_function(
args = rlang::pairlist2(fun=),
body = inlined_body
)
}
curry4("HelloWorld")
function (fun)
{
fun <- match.fun(fun)
fun("HelloWorld")
}
<environment: 0x0000021c87bf73f8>
[1] "But don't do this in practice!"
A topic close to my heart as a linguist. This is one of the first things we teach in intro to linguistics.↩︎
Where objects of class <language>
are essentially a (nested) list of symbols and constants.↩︎
The notorious <<-
is evidence of this.↩︎
This is starting to look something like a very butchered form of string interning…↩︎
You might protest that as.character(substitute())
is bad practice which is true but it’s idiomatic in the sense that it’s the first line of the function definition of require()
.↩︎
A version with stricter safeguards would probably use force()
among other things (see Adv R).↩︎
This in-lining also resolves the need for force()
.↩︎