---
title: "Random Numbers in _BiocParallel_"
author:
- name: Martin Morgan
  affiliation: Roswell Park Comprehensive Cancer Center, Buffalo, NY
  email: Martin.Morgan@RoswellPark.org
date: "Edited:  7 September, 2021; Compiled: `r format(Sys.time(), '%B %d, %Y')`"
vignette: >
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteIndexEntry{4. Random Numbers in BiocParallel}
  %\VignetteEncoding{UTF-8}
output:
  BiocStyle::html_document:
    number_sections: yes
    toc: yes
    toc_depth: 4
---

[RPCI]: https://www.roswellpark.org/martin-morgan

# Scope 

`r Biocpkg("BiocParallel")` enables use of random number streams in a
reproducible manner. This document applies to the following
`*Param()`:

* `SerialParam()`: sequential evaluation in a single R process.
* `SnowParam()`: parallel evaluation in multiple independent R
  processes.
* `MulticoreParam())`: parallel evaluation in R sessions running in
  forked threads. Not available on Windows.

The `*Param()` can be used for evaluation with:

* `bplapply()`: `lapply()`-like application of a user-supplied
  function `FUN` to a vector or list of elements `X`.
* `bpiterate()`: apply a user-supplied function `FUN` to an unknown
  number of elements resulting from successive calls to a
  user-supplied function `ITER.`

The reproducible random number implementation also supports:

* `bptry()` and the `BPREDO=` argument, for re-evaluation of elements
  that fail (e.g., because of a bug in `FUN`).

# Essentials

## Use of `bplapply()` and `RNGseed=`

Attach `r Biocpkg("BiocParallel")` and ensure that the version is
greater than 1.27.5

```{r}
library(BiocParallel)
stopifnot(
    packageVersion("BiocParallel") > "1.27.5"
)
```

For reproducible calculation, use the `RNGseed=` argument in any of the 
`*Param()`constructors.

```{r}
result1 <- bplapply(1:3, runif, BPPARAM = SerialParam(RNGseed = 100))
result1
```

Repeating the calculation with the same value for `RNGseed=` results
in the same result; a different random number seed results in
different results.

```{r}
result2 <- bplapply(1:3, runif, BPPARAM = SerialParam(RNGseed = 100))
stopifnot(
    identical(result1, result2)
)

result3 <- bplapply(1:3, runif, BPPARAM = SerialParam(RNGseed = 200))
result3

stopifnot(
    !identical(result1, result3)
)
```

Results are invariant across `*Param()`

```{r}
result4 <- bplapply(1:3, runif, BPPARAM = SnowParam(RNGseed = 100))
stopifnot(
    identical(result1, result4)
)

if (!identical(.Platform$OS.type, "windows")) {
    result5 <- bplapply(1:3, runif, BPPARAM = MulticoreParam(RNGseed = 100))
    stopifnot(
        identical(result1, result5)
    )
}
```

Parallel backends can adjust the number of `workers` (processes
performing the evaluation) and `tasks` (how elements of `X` are
distributed between workers).  Results are invariant to these
parameters. This is illustrated with `SnowParam()`, but applies also
to `MulticoreParam()`.

```{r}
result6 <- bplapply(1:3, runif, BPPARAM = SnowParam(workers = 2, RNGseed = 100))
result7 <- bplapply(1:3, runif, BPPARAM = SnowParam(workers = 3, RNGseed = 100))
result8 <- bplapply(
    1:3, runif,
    BPPARAM = SnowParam(workers = 2, tasks = 3, RNGseed = 100)
)
stopifnot(
    identical(result1, result6),
    identical(result1, result7),
    identical(result1, result8)
)
```

Subsequent sections illustrate results with `SerialParam()`, but identical 
results are obtained with `SnowParam()` and `MulticoreParam()`.

## Use with `bpiterate()`

`bpiterate()` allows parallel processing of a ’stream’ of data as a
series of tasks, with a task consisting of a portion of the overall
data. It is useful when the data size is not known or easily
partitioned into elements of a vector or list. A real use case might
involve iterating through a BAM file, where a task represents
successive records (perhaps 100,000 per task) in the file. Here we
illustrate with a simple example – iterating through a vector `x =
1:3`

```{r}
ITER_FUN_FACTORY <- function() {
    x <- 1:3
    i <- 0L
    function() {
        i <<- i + 1L
        if (i > length(x))
            return(NULL)
        x[[i]]
    }
}
```

`ITER_FUN_FACTORY()` is used to create a function that, on each invocation, 
returns the next task (here, an element of `x`; in a real example, perhaps 
100000 records from a BAM file). When there are no more tasks, the function 
returns `NULL`

```{r, collapse = TRUE}
ITER <- ITER_FUN_FACTORY()
ITER()

ITER()

ITER()

ITER()
```

In our simple example, `bpiterate()` is performing the same
computations as `bplapply()` so the results, including the random
number streams used by each task in `bpiterate()`, are the same

```{r}
result9 <- bpiterate(
    ITER_FUN_FACTORY(), runif,
    BPPARAM = SerialParam(RNGseed = 100)
)
stopifnot(
    identical(result1, result9)
)
```

## Use with `bptry()`

`bptry()` in conjunction with the `BPREDO=` argument to `bplapply()`
or `bpiterate()` allows for graceful recovery from errors. Here a
buggy `FUN1()` produces an error for the second element. `bptry()`
allows evaluation to continue for other elements of `X`, despite the
error. This is shown in the result.

```{r}
FUN1 <- function(i) {
    if (identical(i, 2L)) {
        ## error when evaluating the second element
        stop("i == 2")
    } else runif(i)
}
result10 <- bptry(bplapply(
    1:3, FUN1,
    BPPARAM = SerialParam(RNGseed = 100, stop.on.error = FALSE)
))
result10
```

`FUN2()` illustrates the flexibility of `bptry()` by fixing the bug
when `i == 2`, but also generating incorrect results if invoked for
previously correct values. The identity of the result to the original
computation shows that only the error task is re-computed, and that
the random number stream used by the task is identical to the original
stream.

```{r}
FUN2 <- function(i) {
    if (identical(i, 2L)) {
        ## the random number stream should be in the same state as the
        ## first time through the loop, and rnorm(i) should return
        ## same result as FUN
        runif(i)
    } else {
        ## if this branch is used, then we are incorrectly updating
        ## already calculated elements -- '0' in the output would
        ## indicate this error
        0
    }
}
result11 <- bplapply(
    1:3, FUN2,
    BPREDO = result10,
    BPPARAM = SerialParam(RNGseed = 100, stop.on.error = FALSE)
)
stopifnot(
    identical(result1, result11)
)
```

## Relationship between` RNGseed=` and `set.seed()`

The global random number stream (influenced by `set.seed()`) is
ignored by `r Biocpkg("BiocParallel")`, and `r Biocpkg("BiocParallel")` does 
NOT increment the global stream.

```{r}
set.seed(200)
value <- runif(1)

set.seed(200)
result12 <- bplapply(1:3, runif, BPPARAM = SerialParam(RNGseed = 100))
stopifnot(
    identical(result1, result12),
    identical(value, runif(1))
)
```

When `RNGseed=` is not used, an internal stream (not accessible to the
user) is used and `r Biocpkg("BiocParallel")` does NOT increment the
global stream.

```{r}
set.seed(100)
value <- runif(1)

set.seed(100)
result13 <- bplapply(1:3, runif, BPPARAM = SerialParam())
stopifnot(
    !identical(result1, result13),
    identical(value, runif(1))
)
```

## `bpstart()` and random number streams

In all of the examples so far `*Param()` objects are passed to
`bplapply()` or `bpiterate()` in the ’stopped’ state. Internally,
`bplapply()` and `bpiterate()` invoke `bpstart()` to establish the
computational environment (e.g., starting workers for
`SnowParam()`). `bpstart()` can be called explicitly, e.g., to allow
workers to be used across calls to `bplapply()`.

The cluster random number stream is initiated with `bpstart()`. Thus

```{r}
param <- bpstart(SerialParam(RNGseed = 100))
result16 <- bplapply(1:3, runif, BPPARAM = param)
bpstop(param)
stopifnot(
    identical(result1, result16)
)
```

This allows a second call to `bplapply` to represent a continuation of
a random number computation – the second call to `bplapply()` results
in different random number streams for each element of `X`.

```{r}
param <- bpstart(SerialParam(RNGseed = 100))
result16 <- bplapply(1:3, runif, BPPARAM = param)
result17 <- bplapply(1:3, runif, BPPARAM = param)
bpstop(param)
stopifnot(
    identical(result1, result16),
    !identical(result1, result17)
)
```

The results from `bplapply()` are different from the results from
`lapply()`, even with the same random number seed. This is because
correctly implemented parallel random streams require use of a
particular random number generator invoked in specific ways for each
element of `X`, as outlined in the Implementation notes section.

## Relationship between `bplapply()` and `lapply()`

The results from `bplapply()` are different from the results from
`lapply()`, even with the same random number seed. This is because
correctly implemented parallel random streams require use of a
particular random number generator invoked in specific ways for each
element of `X`, as outlined in the Implementation notes section.

```{r}
set.seed(100)
result20 <- lapply(1:3, runif)
stopifnot(
    !identical(result1, result20)
)
```

# Implementation notes

The implementation uses the L’Ecuyer-CMRG random number generator (see
`?RNGkind` and `?parallel::clusterSetRNGStream` for additional
details). This random number generates independent streams and
substreams of random numbers. In `r Biocpkg("BiocParallel")`, each
call to `bp start()` creates a new stream from the L’Ecuyer-CMRG
generator.  Each element in `bplap` `ply()` or `bpiterate()` creates a
new substream. Each application of `FUN` is therefore using the
L’Ecuyer-CMRG random number generator, with a substream that is
independent of the substreams of all other elements.

Within the user-supplied `FUN` of `bplapply()` or `bpiterate()`, it is
a mistake to use `RNGkind()` to set a different random number
generator, or to use `set.seed()`. This would in principle compromise
the independence of the streams across elements.

# `sessionInfo()`

```{r, echo = FALSE}
sessionInfo()
```