Cracking open the internals of ggplot

class: center, middle, inverse, title-slide

# Cracking open the internals of ggplot
## A {ggtrace} showcase
### June Choe @yjunechoe 4 December 2021

---

# The ggplot internals beastiary

😮 {ggplot2} is old (2005~), but the last & most important overhaul of the internals (ggproto) was pretty recent ([v2.0.0](https://www.rstudio.com/blog/ggplot2-2-0-0/), December 2015)

⚠️ Guides on ggplot internals are extremely sparse and scattered, and many are also outdated (but see [[1](https://github.com/paleolimbot/ggdebug)], [[2](https://htmlpreview.github.io/?https://raw.githubusercontent.com/brodieG/ggbg/development/inst/doc/extensions.html)], [[3](https://cran.r-project.org/web/packages/gginnards/vignettes/user-guide-1.html)], [[4](https://cran.r-project.org/web/packages/lemon/vignettes/gtable_show_lemonade.html)], [[5](https://cran.r-project.org/web/packages/gridExtra/vignettes/gtable.html)])

❗ There are no smooth entry points for aspiring developers. Even for experienced users, the sheer scale and foreignness is demotivating.

❗❗ More over, you can't learn this through exposure, _by design_. This unintentionally creates a monopoly around the knowledge of how ggplot works under the hood.

🤔 Do we patiently wait, relying on the mercy and sacrifice of experienced developers to host webinars, release official guides, and hold our hands through this journey?

---

# What's so hard about learning it ourselves?

🤓 Many useRs are self-taught and learn through trial and error

In fact a lot of what happens in the building of a ggplot is actually just good ol' ✨data wrangling✨

> [ggprotos] are classes that are **stateless** in the sense that you have an object that receives some data and does something to the data and spits out the data again...

> ...you should think of [ggprotos] as kind of **factories**. You have this assembly line and each method is... **a robot arm**.

- Thomas Lin Pedersen, rstudio::conf (2020)</figcaption>

---

class: center, middle, inverse
background-image: url(img/ggproto-factory.png)
background-size: contain

---

# What's so hard about learning it ourselves?

Many useRs are self-taught and learn through trial and error

And a lot of what happens in the building of a ggplot is actually just good ol' ✨data wrangling✨

> [ggprotos] are classes that are **stateless** in the sense that you have an object that receives some data and does something to the data and spits out the data again...

> ...you should think of [ggprotos] as kind of **factories**. You have this assembly line and each method is... **a robot arm**.

- Thomas Lin Pedersen, rstudio::conf (2020)

We can learn ggplot internals ourselves - we just need a tool that allows us to peak inside and manipulate the assembly line as it runs

---

# A different kind of accessibility problem

We want to interact with ggplot internals _from the outside_

But our familiar debugging tools fail us.

-----

<blockquote cite="https://yutani.rbind.io/post/a-tip-to-debug-ggplot2/">
 "You cannot use breakpoints to dig into [ggprotos]."
</blockquote>
- Hiroaki Yutani, blog post (2019)

<blockquote cite="https://github.com/paleolimbot/ggdebug/blob/master/R/trace.R">
 "[trace()] and [untrace()] ... do not work with ggproto methods"
</blockquote>
- Dewey Dunnington, {ggdebug} (2019)

<blockquote cite="https://www.rstudio.com/resources/rstudioconf-2020/extending-your-ability-to-extend-ggplot2/">
 "ggproto methods are just horrible to debug."
</blockquote>
- Thomas Lin Pedersen, rstudio::conf (2020)

---

# Enter {ggtrace}!

**Goal: expose the internals of ggplot in the familiar _functional-programming_ sense, for learners and developers alike**

Here, we make some simplifying assumptions about ggplot internals:

The **input** - the _data_ being plotted & the _instructions_ for plotting it

- The user-facing code `ggplot(data) + geom_*(...) + ...`

The **assembly line** - the _execution_ of plotting instructions on the data

- Each layer's Stat/Geom/Position methods transform the data

The **output** - the data prepared for rendering _graphical primitives_

- Between `ggplot_build()` and `ggplot_gtable()` at `print()`

---

# Our input

```r
penguins_base <- ggplot(na.omit(palmerpenguins::penguins)) +
 aes(x = species, color = species) +
 theme_minimal()

my_plot <- penguins_base +
 geom_bar(size = 2)

my_plot
```

![](index_files/figure-html/unnamed-chunk-2-1.png)

---

# Our target assembly line

```r
# ggplot2:::print.ggplot
function (x, newpage = is.null(vp), vp = NULL, ...) 
{
 set_last_plot(x)
 if (newpage) 
 grid.newpage()
 grDevices::recordGraphics(requireNamespace("ggplot2", quietly = TRUE), 
 list(), getNamespace("ggplot2"))
* data <- ggplot_build(x)
 gtable <- ggplot_gtable(data)
 if (is.null(vp)) {
 grid.draw(gtable)
 }
 else {
 if (is.character(vp)) 
 seekViewport(vp)
 else pushViewport(vp)
 grid.draw(gtable)
 upViewport()
 }
 if (isTRUE(getOption("BrailleR.VI")) && rlang::is_installed("BrailleR")) {
 print(asNamespace("BrailleR")$VI(x))
 }
 invisible(x)
}
```

---

# Our output

```r
names(ggplot_build(my_plot))
```

```
  [1] "data"   "layout" "plot"
```

```r
ggplot_build(my_plot)$data[[1]]
```

```
     colour   y count prop x flipped_aes PANEL group ymin ymax xmin xmax   fill
  1 #F8766D 146   146    1 1       FALSE     1     1    0  146 0.55 1.45 grey35
  2 #00BA38  68    68    1 2       FALSE     1     2    0   68 1.55 2.45 grey35
  3 #619CFF 119   119    1 3       FALSE     1     3    0  119 2.55 3.45 grey35
    size linetype alpha
  1    2        1    NA
  2    2        1    NA
  3    2        1    NA
```

```r
layer_data(my_plot, 1)
```

---

# Example: delayed aesthetic evaluation (1)

You can use `after_*()` functions to access computed variables

.pull-left[

```r
penguins_plot1 <- penguins_base +
 geom_bar(size = 2) +
 aes(
 y = after_stat(count),
 fill = after_scale(alpha(color, .5))
 )
```
]

.pull-right[
![](index_files/figure-html/unnamed-chunk-7-1.png)
]

```r
layer_data(penguins_plot1)
```

```
         fill  colour   y count prop x flipped_aes PANEL group ymin ymax xmin
  1 #F8766D80 #F8766D 146   146    1 1       FALSE     1     1    0  146 0.55
  2 #00BA3880 #00BA38  68    68    1 2       FALSE     1     2    0   68 1.55
  3 #619CFF80 #619CFF 119   119    1 3       FALSE     1     3    0  119 2.55
    xmax size linetype alpha
  1 1.45    2        1    NA
  2 2.45    2        1    NA
  3 3.45    2        1    NA
```

---

# Example: delayed aesthetic evaluation (2)

How would you explain this behavior?

.pull-left[

```r
penguins_plot2 <- penguins_base +
 geom_bar(
 size = 2,
* fill = "orange"
 ) +
 aes(
 y = after_stat(count),
 fill = after_scale(alpha(color, .5))
 )
```
]

.pull-right[
![](index_files/figure-html/unnamed-chunk-10-1.png)
]

```r
layer_data(penguins_plot2)
```

```
      fill  colour   y count prop x flipped_aes PANEL group ymin ymax xmin xmax
  1 orange #F8766D 146   146    1 1       FALSE     1     1    0  146 0.55 1.45
  2 orange #00BA38  68    68    1 2       FALSE     1     2    0   68 1.55 2.45
  3 orange #619CFF 119   119    1 3       FALSE     1     3    0  119 2.55 3.45
    size linetype alpha
  1    2        1    NA
  2    2        1    NA
  3    2        1    NA
```

---

# Testing a hypothesis about the internals

**Does the `fill` aesthetic even get computed in `penguins_plot2`**?

This question cannot be answered just by looking at the transformed data from the `ggplot_build()` output - both answers are consistent

We need to track the data as it gets transformed _in the assembly line_

---

# Inspecting the assembly line

In a true know-nothing fashion, we're just going to log the data after every step that transforms it inside `ggplot_build()`

```r
data_assigning_steps <- c(8, 9, 11, 12, 13, 17, 18, 19, 21, 22, 26, 29, 30, 31)
as.character(ggbody(ggplot2:::ggplot_build.ggplot)[data_assigning_steps])
```

```
 [1] "data <- layer_data" 
 [2] "data <- by_layer(function(l, d) l$setup_layer(d, plot))" 
 [3] "data <- layout$setup(data, plot$data, plot$plot_env)" 
 [4] "data <- by_layer(function(l, d) l$compute_aesthetics(d, plot))" 
 [5] "data <- lapply(data, scales_transform_df, scales = scales)" 
 [6] "data <- layout$map_position(data)" 
 [7] "data <- by_layer(function(l, d) l$compute_statistic(d, layout))"
 [8] "data <- by_layer(function(l, d) l$map_statistic(d, plot))" 
 [9] "data <- by_layer(function(l, d) l$compute_geom_1(d))" 
 [10] "data <- by_layer(function(l, d) l$compute_position(d, layout))" 
 [11] "data <- layout$map_position(data)" 
 [12] "data <- by_layer(function(l, d) l$compute_geom_2(d))" 
 [13] "data <- by_layer(function(l, d) l$finish_statistics(d))" 
 [14] "data <- layout$finish_data(data)"
```

---

# Inspecting the assembly line

```r
ggtrace(method = ggplot2:::ggplot_build.ggplot,
        trace_steps = data_assigning_steps + 1,
        trace_exprs = quote(data[[1]]),
        verbose = FALSE)
```

```
  `ggplot2:::ggplot_build.ggplot` now being traced.
```

```r
penguins_plot2
```

```
  Triggering trace on `ggplot2:::ggplot_build.ggplot`
```

```
  Untracing `ggplot2:::ggplot_build.ggplot` on exit.
```

```r
Filter(Negate(is.null), lapply(last_ggtrace(), `[[`, "fill"))
```

```
  [[1]]
  [1] "orange" "orange" "orange"
  
  [[2]]
  [1] "orange" "orange" "orange"
  
  [[3]]
  [1] "orange" "orange" "orange"
```

---

# Inspecting the assembly line (again)

Are we really sure that the `fill` aesthetic simply doesn't get calculated when it's supplied as a constant?

Let's jump into the first robot arm that computes `fill`, and see whether it was ever calculated internally

```r
data_assigning_steps[which(sapply(last_ggtrace(), function(x) {"fill" %in% colnames(x)}))]
```

```
  [1] 29 30 31
```

```r
ggbody(ggplot2:::ggplot_build.ggplot)[[29]]
```

```
 data <- by_layer(function(l, d) l$compute_geom_2(d))
```

```r
as.character(ggbody(ggplot2:::Layer$compute_geom_2))
```

```
 [1] "`{`" 
 [2] "if (empty(data)) return(data)" 
 [3] "aesthetics <- self$computed_mapping" 
 [4] "modifiers <- aesthetics[is_scaled_aes(aesthetics) | is_staged_aes(aesthetics)]"
 [5] "self$geom$use_defaults(data, self$aes_params, modifiers)"
```

---

# Inspecting the assembly line (again)

```r
class(geom_bar()$geom)
```

```
  [1] "GeomBar"  "GeomRect" "Geom"     "ggproto"  "gg"
```

```r
ggbody(GeomBar$use_defauts)
```

```
  Error: Method 'use_defauts' is not defined for `GeomBar`
  Check inheritance with `ggbody(GeomBar$use_defauts, inherit = TRUE)`
```

```r
invisible(capture.output(ggbody(GeomBar$use_defaults, inherit = TRUE)))
```

Again in a true know-nothing fashion, we log the value of `data` at every step

```r
ggtrace(method = Geom$use_defaults,
        trace_steps = seq_along(ggbody(Geom$use_defaults)),
        trace_exprs = quote(data),
        verbose = FALSE)
penguins_plot2
```

---

# Inspecting the assembly line (again)

Looks like we were wrong! Step 7 of `Geom$use_defaults` calculates the `fill` aesthetic according to `after_scale()`

```r
Filter(Negate(is.null), lapply(last_ggtrace(), `[[`, "fill"))
```

```
  [[1]]
  [1] "grey35" "grey35" "grey35"
  
  [[2]]
  [1] "#F8766D80" "#00BA3880" "#619CFF80"
  
  [[3]]
  [1] "#F8766D80" "#00BA3880" "#619CFF80"
  
  [[4]]
  [1] "#F8766D80" "#00BA3880" "#619CFF80"
  
  [[5]]
  [1] "orange" "orange" "orange"
```

```r
which(sapply(last_ggtrace(), function(x) {"fill" %in% colnames(x)}))
```

```
  [1]  6  7  8  9 10
```

---

# Yay!

[Check out the answer key for yourself!](https://github.com/tidyverse/ggplot2/blob/65d3bfc21b279a799a5450272664625b01c8778f/R/geom-.r#L124-L150)

---

# The End

**Links**:

- Github repo: [https://github.com/yjunechoe/ggtrace](https://github.com/yjunechoe/ggtrace/)
- pkgdown website: [https://yjunechoe.github.io/ggtrace](https://yjunechoe.github.io/ggtrace/)
- slides: [https://yjunechoe.github.io/ggtrace-talk](https://yjunechoe.github.io/ggtrace-talk/)

**Vignettes**:

- [Frequently Asked Questions](https://yjunechoe.github.io/ggtrace/articles/FAQ.html)
- [`ggplot_build` showcase (draft)](https://xenodochial-franklin-34d864.netlify.app/)
- [`aes_eval` showcase (draft)](https://goofy-lovelace-d330f8.netlify.app/)
- [{ggxmean} debugging case study](https://yjunechoe.github.io/ggtrace/articles/casestudy-ggxmean.html)