%\VignetteIndexEntry{4. Random Numbers in BiocParallel}
%\VignetteKeywords{parallel, Infrastructure}
%\VignettePackage{BiocParallel}
%\VignetteEngine{knitr::knitr}

\documentclass{article}

<<style, eval=TRUE, echo=FALSE, results="asis">>=
BiocStyle::latex()
@

<<setup, echo=FALSE>>=
suppressPackageStartupMessages({
    library(BiocParallel)
})
@

\newcommand{\BiocParallel}{\Biocpkg{BiocParallel}}

\title{Random Numbers in BiocParallel}
\author{Martin Morgan\footnote{\url{Martin.Morgan@RoswellPark.org}}}
\date{Edited: 7 September, 2021; Compiled: \today}

\begin{document}

\maketitle

\tableofcontents

\section{Scope}

\BiocParallel{} enables use of random number streams in a reproducible
manner. This document applies to the following \Rcode{*Param()}:

\begin{itemize}
\item \Rcode{SerialParam()}: sequential evaluation in a single \emph{R} process.
\item \Rcode{SnowParam()}: parallel evaluation in multiple independent
  \emph{R} processes.
\item \Rcode{MulticoreParam())}: parallel evaluation in \emph{R}
  sessions running in forked threads. Not available on Windows.
\end{itemize}

The \Rcode{*Param()} can be used for evaluation with:

\begin{itemize}
\item \Rcode{bplapply()}: \Rcode{lapply()}-like application of a
  user-supplied function \Rcode{FUN} to a vector or list of elements
  \Rcode{X}.
\item \Rcode{bpiterate()}: apply a user-supplied function \Rcode{FUN} to
  an unknown number of elements resulting from successive calls to a
  user-supplied function \Rcode{ITER}.
\end{itemize}

The reproducible random number implementation also supports:

\begin{itemize}
\item \Rcode{bptry()} and the \Rcode{BPREDO=} argument, for
  re-evaluation of elements that fail (e.g., because of a bug in
  \Rcode{FUN}).
\end{itemize}

\section{Essentials}

\subsection{Use of \Rcode{bplapply()} and \Rcode{RNGseed=}}

Attach \BiocParallel{} and ensure that the version is greater than 1.27.5

<<load>>=
library(BiocParallel)
stopifnot(
    packageVersion("BiocParallel") > "1.27.5"
)
@

For reproducible calculation, use the \Rcode{RNGseed=} argument in any
of the \Rcode{*Param()} constructors.

<<basic-use>>=
result1 <- bplapply(1:3, runif, BPPARAM = SerialParam(RNGseed = 100))
result1
@

Repeating the calculation with the same value for \Rcode{RNGseed=}
results in the same result; a different random number seed results in
different results.

<<same-rng>>=
result2 <- bplapply(1:3, runif, BPPARAM = SerialParam(RNGseed = 100))
stopifnot(
    identical(result1, result2)
)

result3 <- bplapply(1:3, runif, BPPARAM = SerialParam(RNGseed = 200))
result3
stopifnot(
    !identical(result1, result3)
)
@

Results are invariant across \Rcode{*Param()}

<<different-param>>=
result4 <- bplapply(1:3, runif, BPPARAM = SnowParam(RNGseed = 100))
stopifnot(
    identical(result1, result4)
)

if (!identical(.Platform$OS.type, "windows")) {
    result5 <- bplapply(1:3, runif, BPPARAM = MulticoreParam(RNGseed = 100))
    stopifnot(
        identical(result1, result5)
    )
}
@

Parallel backends can adjust the number of \Rcode{workers} (processes
performing the evaluation) and \Rcode{tasks} (how elements of \Rcode{X}
are distributed between workers). Results are invariant to these
parameters. This is illustrated with \Rcode{SnowParam()}, but applies
also to \Rcode{MulticoreParam()}.

<<different-workers-tasks>>=
result6 <- bplapply(1:3, runif, BPPARAM = SnowParam(workers = 2, RNGseed = 100))
result7 <- bplapply(1:3, runif, BPPARAM = SnowParam(workers = 3, RNGseed = 100))
result8 <- bplapply(
    1:3, runif,
    BPPARAM = SnowParam(workers = 2, tasks = 3, RNGseed = 100)
)
stopifnot(
    identical(result1, result6),
    identical(result1, result7),
    identical(result1, result8)
)
@

Subsequent sections illustrate results with \Rcode{SerialParam()}, but
identical results are obtained with \Rcode{SnowParam()} and
\Rcode{MulticoreParam()}.

\subsection{Use with \Rcode{bpiterate()}}

\Rcode{bpiterate()} allows parallel processing of a 'stream' of data as
a series of tasks, with a task consisting of a portion of the overall
data. It is useful when the data size is not known or easily
partitioned into elements of a vector or list. A real use case might
involve iterating through a BAM file, where a task represents successive
records (perhaps 100,000 per task) in the file. Here we illustrate
with a simple example -- iterating through a vector \Rcode{x = 1:3}

<<bpiterate-ITER_FUN_FACTORY>>=
ITER_FUN_FACTORY <- function() {
    x <- 1:3
    i <- 0L
    function() {
        i <<- i + 1L
        if (i > length(x))
            return(NULL)
        x[[i]]
    }
}
@ 

\Rcode{ITER\_FUN\_FACTORY()} is used to create a function that, on each invocation,
returns the next task (here, an element of \Rcode{x}; in a real
example, perhaps 100000 records from a BAM file). When there are no
more tasks, the function returns \Rcode{NULL}

<<bpiterate-ITER>>=
ITER <- ITER_FUN_FACTORY()
ITER()
ITER()
ITER()
ITER()
@ 

In our simple example, \Rcode{bpiterate()} is performing the same
computations as \Rcode{bplapply()} so the results, including the
random number streams used by each task in \Rcode{bpiterate()}, are
the same

<<bpiterate>>=
result9 <- bpiterate(
    ITER_FUN_FACTORY(), runif,
    BPPARAM = SerialParam(RNGseed = 100)
)
stopifnot(
    identical(result1, result9)
)
@ 

\subsection{Use with \Rcode{bptry()}}

\Rcode{bptry()} in conjunction with the \Rcode{BPREDO=} argument to
\Rcode{bplapply()} or \Rcode{bpiterate()} allows for graceful recovery
from errors. Here a buggy \Rcode{FUN1()} produces an error for the
second element. \Rcode{bptry()} allows evaluation to continue for
other elements of \Rcode{X}, despite the error. This is shown in the
result.

<<bptry>>=
FUN1 <- function(i) {
    if (identical(i, 2L)) {
        ## error when evaluating the second element
        stop("i == 2")
    } else runif(i)
}
result10 <- bptry(bplapply(
    1:3, FUN1,
    BPPARAM = SerialParam(RNGseed = 100, stop.on.error = FALSE)
))
result10
@

\Rcode{FUN2()} illustrates the flexibility of \Rcode{bptry()} by fixing
the bug when \Rcode{i == 2}, but also generating incorrect results if
invoked for previously correct values. The identity of the result to
the original computation shows that only the error task is
re-computed, and that the random number stream used by the task is
identical to the original stream.

<<bptry-REDO>>=
FUN2 <- function(i) {
    if (identical(i, 2L)) {
        ## the random number stream should be in the same state as the
        ## first time through the loop, and rnorm(i) should return
        ## same result as FUN
        runif(i)
    } else {
        ## if this branch is used, then we are incorrectly updating
        ## already calculated elements -- '0' in the output would
        ## indicate this error
        0
    }
}
result11 <- bplapply(
    1:3, FUN2,
    BPREDO = result10,
    BPPARAM = SerialParam(RNGseed = 100, stop.on.error = FALSE)
)
stopifnot(
    identical(result1, result11)
)
@ 

\subsection{Relationship between \Rcode{RNGseed=} and \Rcode{set.seed()}}

The global random number stream (influenced by \Rcode{set.seed()}) is
ignored by \BiocParallel{}, and \BiocParallel{} does NOT increment the
global stream.

<<RNGseed-and-set-seed>>=
set.seed(200)
value <- runif(1)

set.seed(200)
result12 <- bplapply(1:3, runif, BPPARAM = SerialParam(RNGseed = 100))
stopifnot(
    identical(result1, result12),
    identical(value, runif(1))
)
@

When \Rcode{RNGseed=} is not used, an internal stream (not accessible
to the user) is used and \BiocParallel{} does NOT increment the global stream.

<<not-RNGseed-and-set-seed>>=
set.seed(100)
value <- runif(1)

set.seed(100)
result13 <- bplapply(1:3, runif, BPPARAM = SerialParam())
stopifnot(
    !identical(result1, result13),
    identical(value, runif(1))
)
@

\subsection{\Rcode{bpstart()} and random number streams}

In all of the examples so far \Rcode{*Param()} objects are passed to
\Rcode{bplapply()} or \Rcode{bpiterate()} in the 'stopped'
state. Internally, \Rcode{bplapply()} and \Rcode{bpiterate()} invoke
\Rcode{bpstart()} to establish the computational environment (e.g.,
starting workers for \Rcode{SnowParam()}). \Rcode{bpstart()} can be
called explicitly, e.g., to allow workers to be used across calls to
\Rcode{bplapply()}.

The cluster random number stream is initiated with
\Rcode{bpstart()}. Thus

<<bpstart-basic>>=
param <- bpstart(SerialParam(RNGseed = 100))
result16 <- bplapply(1:3, runif, BPPARAM = param)
bpstop(param)
stopifnot(
    identical(result1, result16)
)
@ 

This allows a second call to \Rcode{bplapply} to represent a
continuation of a random number computation -- the second call to
\Rcode{bplapply()} results in different random number streams for each
element of \Rcode{X}.

<<bpstart-continuation>>=
param <- bpstart(SerialParam(RNGseed = 100))
result16 <- bplapply(1:3, runif, BPPARAM = param)
result17 <- bplapply(1:3, runif, BPPARAM = param)
bpstop(param)
stopifnot(
    identical(result1, result16),
    !identical(result1, result17)
)
@ 

\subsection{Relationship between \Rcode{bplapply()} and \Rcode{lapply()}}

The results from \Rcode{bplapply()} are different from the results from
\Rcode{lapply()}, even with the same random number seed. This is
because correctly implemented parallel random streams require use of a
particular random number generator invoked in specific ways for each
element of \Rcode{X}, as outlined in the Implementation notes section.

<<lapply>>=
set.seed(100)
result20 <- lapply(1:3, runif)
stopifnot(
    !identical(result1, result20)
)
@ 

\section{Implementation notes}

The implementation uses the L'Ecuyer-CMRG random number generator (see
\Rcode{?RNGkind} and \Rcode{?parallel::clusterSetRNGStream} for
additional details). This random number generates independent streams
and substreams of random numbers. In \BiocParallel{}, each call to
\Rcode{bpstart()} creates a new stream from the L'Ecuyer-CMRG
generator. Each element in \Rcode{bplapply()} or \Rcode{bpiterate()}
creates a new substream. Each application of \Rcode{FUN} is therefore
using the L'Ecuyer-CMRG random number generator, with a substream that
is independent of the substreams of all other elements.

Within the user-supplied \Rcode{FUN} of \Rcode{bplapply()} or
\Rcode{bpiterate()}, it is a mistake to use \Rcode{RNGkind()} to set a
different random number generator, or to use \Rcode{set.seed()}. This
would in principle compromise the independence of the streams across
elements.

\section{\Rcode{sessionInfo()}}

<<sessionInfo, echo=FALSE>>=
sessionInfo()
@

\end{document}