Reproducibility • jlmerclusterperm

This vignette showcases using RNG seed and counter state to guarantee the reproducibility of results from jlmerclusterperm.

See more tutorials and vignettes on the Articles page.

Setup

library(jlmerclusterperm)
jlmerclusterperm_setup(verbose = FALSE)

Minimal example data:

library(dplyr, warn.conflicts = FALSE)
library(ggplot2)
chickweights <- as_tibble(ChickWeight) %>%
  mutate(diet = forcats::fct_collapse(Diet, A = c(1, 3), B = c(2, 4))) %>%
  filter(Time %in% 0:20)
ggplot(chickweights, aes(Time, weight, color = diet)) +
  stat_summary(fun.args = list(mult = 1.96)) +
  stat_summary(geom = "line") +
  theme_minimal(base_size = 18)

The jlmer specification object used for this showcase:

jlmer_spec <- make_jlmer_spec(
  formula = weight ~ diet + (1 | Chick), data = chickweights,
  subject = "Chick", time = "Time"
)
jlmer_spec
#> ── jlmer specification ───────────────────────────────────────── <jlmer_spec> ──
#> Formula: weight ~ 1 + dietB + (1 | Chick)
#> Predictors:
#>   diet: dietB
#> Groupings:
#>   Subject: Chick
#>   Trial:
#>   Time: Time
#> Data:
#> # A tibble: 533 × 4
#>   weight dietB Chick  Time
#>    <dbl> <dbl> <chr> <dbl>
#> 1     42     0 1         0
#> 2     51     0 1         2
#> 3     59     0 1         4
#> # ℹ 530 more rows
#> ────────────────────────────────────────────────────────────────────────────────

RNG seed vs. counter

Because CPA is powered by bootstrapping, it is inherently a stochastic procedure. Reproducibility is guaranteed for clusterpermute() calls ran with the same seed and counter values. These are both properties of the Julia random number generator used by jlmerclusterperm and refer to two slightly different concepts:

The seed is conceptually equivalent to the same “seed” as talked about in R, as in set.seed(). The seed is used in the RNG’s initialization, locking it into a specific state that persists for the session. In jlmerclusterperm, the default seed is 1 and is primarily controlled via options("jlmerclusterperm.seed") which must be set before calling jlmerclusterperm_setup().
The counter initializes to 0 and increments by some value every time the RNG is consumed to produce a random number. There is no equivalent concept in base R, though it’s a more handy and intuitive way of interfacing with the RNG’s state in an interactive session. The counter can be manipulated via the functions reset_rng_state(), set_rng_state(), and get_rng_state().

The rest of this vignette will showcase interacting with the counter state, as it is more useful in practice.

Behavior of RNG counter

At the start of a fresh session, the counter is initialized at zero. And because we did not specify the seed, it’s at one:

get_rng_state()
#> [1] 0

get_rng_seed()
#> [1] 1

We start with a simple clusterpermute() with 100 simulations. The reproducibility profile of this run has a seed of 1 and a counter of 0:

CPA_1_0 <- clusterpermute(jlmer_spec, threshold = 1, nsim = 100, progress = FALSE)
CPA_1_0
#> $null_cluster_dists
#> ── Null cluster-mass distribution (t > 1) ────────────── <null_cluster_dists> ──
#> dietB (n = 100)
#>   Mean (SD): 0.724 (5.54)
#>   Coverage intervals: 95% [-10.847, 12.592]
#> ────────────────────────────────────────────────────────────────────────────────
#> 
#> $empirical_clusters
#> ── Empirical clusters (t > 1) ────────────────────────── <empirical_clusters> ──
#> dietB
#>   [2, 14]: 18.231 (p=0.0198)
#> ────────────────────────────────────────────────────────────────────────────────

After this run, the counter is now incremented to 3712:

get_rng_state()
#> [1] 3712

The same call to clusterpermute() now runs with this new counter value, and it produces a different result. Again, this is expected given the stochastic nature of the CPA (and thus the need to run tens of thousands of simulations when reporting results):

CPA_1_3712 <- clusterpermute(jlmer_spec, threshold = 1, nsim = 100, progress = FALSE)
CPA_1_3712
#> $null_cluster_dists
#> ── Null cluster-mass distribution (t > 1) ────────────── <null_cluster_dists> ──
#> dietB (n = 100)
#>   Mean (SD): -0.618 (8.51)
#>   Coverage intervals: 95% [-18.328, 19.076]
#> ────────────────────────────────────────────────────────────────────────────────
#> 
#> $empirical_clusters
#> ── Empirical clusters (t > 1) ────────────────────────── <empirical_clusters> ──
#> dietB
#>   [2, 14]: 18.231 (p=0.0792)
#> ────────────────────────────────────────────────────────────────────────────────

This second run also increments the counter state, but note how it’s not exactly by the same amount (this is not possible to predict):

get_rng_state()
#> [1] 7427

The difference between these two runs is specifically in the null distribution of the cluster-mass statistics. Both 100-simulation samples are drawn share the same underlying (null) distribution, though they are not identical to one another (they rarely are).

ggplot(mapping = aes(sum_statistic)) +
  geom_density(
    aes(fill = "1"),
    alpha = .5,
    data = tidy(CPA_1_0$null_cluster_dists)
  ) +
  geom_density(
    aes(fill = "3712"),
    alpha = .5,
    data = tidy(CPA_1_3712$null_cluster_dists)
  ) +
  guides(fill = guide_legend("Counter")) +
  theme_classic()

Manipulating RNG counter

Using set_rng_counter(), it is possible to reproduce a prior run. For example, to reproduce the second clusterpermute() call, we set the counter back to 3712 with set_rng_state():

set_rng_state(3712)
get_rng_state()
#> [1] 3712

The results of the subsequent run from this counter state is identical to the previous run:

identical(
  CPA_1_3712,
  clusterpermute(jlmer_spec, threshold = 1, nsim = 100, progress = FALSE)
)
#> [1] TRUE

To reproduce the first CPA ran with counter at 0, we can again use set_rng_state() or simply use reset_rng_state() which is an alias for set_rng_state(0):

reset_rng_state()
get_rng_state()
#> [1] 0

identical(
  CPA_1_0,
  clusterpermute(jlmer_spec, threshold = 1, nsim = 100, progress = FALSE)
)
#> [1] TRUE

This behavior can be observed at the specific step of simulating the null distribution as well. The function corresponding to this step is permute_timewise_statistics() and we run 100 simulations for this step with counter at zero:

reset_rng_state()
get_rng_state()
#> [1] 0

null_statistics <- permute_timewise_statistics(jlmer_spec, nsim = 100)
get_rng_state()
#> [1] 3712

Running the cluster detection algorithm over the result using extract_null_cluster_dists() returns a null cluster-mass distribution which is identical to that from CPA_1_0:

null_cluster_dists <- extract_null_cluster_dists(null_statistics, threshold = 1)
null_cluster_dists
#> ── Null cluster-mass distribution (t > 1) ────────────── <null_cluster_dists> ──
#> dietB (n = 100)
#>   Mean (SD): 0.724 (5.54)
#>   Coverage intervals: 95% [-10.847, 12.592]
#> ────────────────────────────────────────────────────────────────────────────────

identical(null_cluster_dists, CPA_1_0$null_cluster_dists)
#> [1] TRUE

Technical details and notes

The Julia back-end uses a thread-safe, counter-based RNG called Threefry (specifically, Random123.Threefry2x) which comes from the Random123.jl Julia library as an adaptation of the original algorithm developed by Salmon, Moraes, Dror, and Shaw (2011). The RNG is initialized with a “rounds” value of 20, which is also the default. Threefry is used for its speed and stability across Julia versions.

Additionally, as noted earlier, the RNG initializes with an explicit seed of 1. This behavior is different than in the case of R, where a random seed is chosen for the user for every new session. The consequence is that in the absence of explicit seeding via options("jlmerclusterperm.seed"), the first run of every new session (i.e., runs with counter value of zero) will be identical.

# Restart Julia session
jlmerclusterperm_setup(restart = TRUE, verbose = FALSE)

# RNG seed is again 1
get_rng_seed()
#> [1] 1


# So results of the first CPA of the session (with counter = 0) is the same
get_rng_state()
#> [1] 0

identical(
  CPA_1_0,
  clusterpermute(jlmer_spec, threshold = 1, nsim = 100, progress = FALSE)
)
#> [1] TRUE

Given the interface to the RNG counter, there are not many unique advantages to using a different seed over manipulating the RNG counter. But an interface to RNG seed is also provided via set_rng_seed() and get_rng_seed().

Lastly, please be aware that changing the seed for the R session (e.g., via set.seed()) has no effect on jlmerclusterperm functions powered by the Julia back-end.

Miscellaneous

You may be wondering whether two CPAs that overlap in the range of counter values also share overlapping samples from the null. This is actually not the case in practice because of how shuffling for the CPA is implemented:

# 100 simulations of timewise statistics
## First CPA using counter at 1
set_rng_state(1)
x <- permute_timewise_statistics(jlmer_spec, nsim = 100)
## Second CPA using counter at 2
set_rng_state(2)
y <- permute_timewise_statistics(jlmer_spec, nsim = 100)

# The null cluster-mass statistics
x_null <- extract_null_cluster_dists(x, threshold = 1)
x_null_stats <- tidy(x_null)$sum_statistic
y_null <- extract_null_cluster_dists(y, threshold = 1)
y_null_stats <- tidy(y_null)$sum_statistic

# Relationship between two CPAs with similar initial counters
matplot(cbind(x_null_stats, y_null_stats), type = "b", pch = 1)

The two colored lines represent two 100-simulation bootstrap permutations that differ in the initial RNG counter by a value of 1. The point is that the shape of the two lines look very different - one is not just the shifted form of another.

For more formal investigations (e.g., when simulating power for CPA), it’s of course best practice to generate random seeds. The counter is more of a convenience that balances interactivity with reproducibility.