%\VignetteIndexEntry{2. Introduction to BatchtoolsParam}
%\VignetteKeywords{parallel, Infrastructure}
%\VignettePackage{BiocParallel}
%\VignetteEngine{knitr::knitr}

\documentclass{article}

<<style, eval=TRUE, echo=FALSE, results="asis">>=
BiocStyle::latex()
@

<<setup, echo=FALSE>>=
suppressPackageStartupMessages({
    library(BiocParallel)
})
@

\newcommand{\BiocParallel}{\Biocpkg{BiocParallel}}

\title{Introduction to \emph{BatchtoolsParam}}
\author{
  Nitesh Turaga\footnote{\url{Nitesh.Turaga@RoswellPark.org}},
  Martin Morgan\footnote{\url{Martin.Morgan@RoswellPark.org}}
}
\date{Edited: March 22, 2018; Compiled: \today}

\begin{document}

\maketitle

\tableofcontents

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Introduction}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

The \Rcode{BatchtoolsParam} class is an interface to the
\CRANpkg{batchtools} package from within \BiocParallel{}. This aims to
replace BatchjobsParam as \BiocParallel{}'s class for computing on a
high performance cluster such as SGE, TORQUE, LSF, SLURM, OpenLava.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Quick start}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

This example demonstrates the easiest way to launch a 100000 jobs
using batchtools. The first step involves creating a
\Rcode{BatchtoolsParam} class. You can compute using 'bplapply' and
then the result is stored.

<<intro>>=
library(BiocParallel)

## Pi approximation
piApprox <- function(n) {
    nums <- matrix(runif(2 * n), ncol = 2)
    d <- sqrt(nums[, 1]^2 + nums[, 2]^2)
    4 * mean(d <= 1)
}

piApprox(1000)

## Apply piApprox over
param <- BatchtoolsParam()
result <- bplapply(rep(10e5, 10), piApprox, BPPARAM=param)
mean(unlist(result))
@

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{\emph{BatchtoolsParam} interface}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

The \Rcode{BatchtoolsParam} interface allows you to replace the
BatchJobsParam interface, and allow more intuitive usage of your high
performance cluster with \BiocParallel{}.

The \Rcode{BatchtoolsParam} class allows the user to specify many
arguments to customize their jobs. Applicable to clusters with formal
schedulers.

\begin{itemize}

  \item{\Rcode{workers}} The number of workers used by the job.

  \item{\Rcode{cluster}}
    We currently support, SGE, SLURM, LSF, TORQUE and
    OpenLava. The 'cluster' argument is supported only if the R
    environment knows how to find the job scheduler. Each cluster type
    uses a template to pass the job to the scheduler. If the template is
    not given we use the default templates as given in the 'batchtools'
    package. The cluster can be accessed by 'bpbackend(param)'.

  \item{\Rcode{registryargs}}
    The 'registryargs' argument takes a list of arguments to create a
    new job registry for you \Rcode{BatchtoolsParam}. The job registry
    is a data.table which stores all the required information to
    process your jobs. The arguments we support for registryargs are:

    \begin{description}

      \item{\Rcode{file.dir}} Path where all files of the registry are
        saved. Note that some templates do not handle relative paths
        well. If nothing is given, a temporary directory will be used
        in your current working directory.

      \item{\Rcode{work.dir}} Working directory for R process for
        running jobs.

      \item{\Rcode{packages}} Packages that will be loaded on each node.

      \item{\Rcode{namespaces}} Namespaces that will be loaded on each
        node.

      \item{\Rcode{source}} Files that are sourced before executing a
        job.

      \item{\Rcode{load}} Files that are loaded before executing a job.

    \end{description}

<<>>=
registryargs <- batchtoolsRegistryargs(
    file.dir = "mytempreg",
    work.dir = getwd(),
    packages = character(0L),
    namespaces = character(0L),
    source = character(0L),
    load = character(0L)
)
param <- BatchtoolsParam(registryargs = registryargs)
param
@

  \item{\Rcode{resources}} A named list of key-value pairs to be
    subsituted into the template file; see
    \Rcode{?batchtools::submitJobs}.

  \item{\Rcode{template}} The template argument is unique to the
    \Rcode{BatchtoolsParam} class. It is required by the job
    scheduler. It defines how the jobs are submitted to the job
    scheduler. If the template is not given and the cluster is chosen,
    a default template is selected from the batchtools package.

  \item{\Rcode{log}} The log option is logical, TRUE/FALSE. If it is
    set to TRUE, then the logs which are in the registry are copied to
    directory given by the user using the \Rcode{logdir} argument.

  \item{\Rcode{logdir}} Path to the logs. It is given only if
    \Rcode{log=TRUE}.

  \item{\Rcode{resultdir}} Path to the directory is given when the job
    has files to be saved in a directory.

\end{itemize}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Defining templates}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

The job submission template controls how the job is processed by the
job scheduler on the cluster.  Obviously, the format of the template
will differ depending on the type of job scheduler.  Let's look at the
default SLURM template as an example:

<<>>=
fname <- batchtoolsTemplate("slurm")
cat(readLines(fname), sep="\n")
@

The \Rcode{<\%= =>} blocks are automatically replaced by the values of
the elements in the \Rcode{resources} argument in the
\Rcode{BatchtoolsParam} constructor.  Failing to specify critical
parameters properly (e.g., wall time or memory limits too low) will
cause jobs to crash, usually rather cryptically.  We suggest setting
parameters explicitly to provide robustness to changes to system
defaults.  Note that the \Rcode{<\%= =>} blocks themselves do not
usually need to be modified in the template.

The part of the template that is most likely to require explicit
customization is the last line containing the call to \Rcode{Rscript}.
A more customized call may be necessary if the R installation is not
standard, e.g., if multiple versions of R have been installed on a
cluster.  For example, one might use instead:

<<eval=FALSE>>=
echo 'batchtools::doJobCollection("<%= uri %>")' |\
    ArbitraryRcommand --no-save --slave
@

If such customization is necessary, we suggest making a local copy of
the template, modifying it as required, and then constructing a
\Rcode{BiocParallelParam} object with the modified template using the
\Rcode{template} argument.  However, we find that the default
templates accessible with \Rcode{batchtoolsTemplate} are satisfactory
in most cases.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Use cases}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

As an example for a BatchtoolParam job being run on an SGE cluster, we
use the same \Rcode{piApprox} function as defined earlier. The example
runs the function on 5 workers and submits 100 jobs to the SGE
cluster.

Example of SGE with minimal code:

<<simple_sge_example, eval=FALSE>>=
library(BiocParallel)

## Pi approximation
piApprox <- function(n) {
    nums <- matrix(runif(2 * n), ncol = 2)
    d <- sqrt(nums[, 1]^2 + nums[, 2]^2)
    4 * mean(d <= 1)
}

template <- system.file(
    package = "BiocParallel",
    "unitTests", "test_script", "test-sge-template.tmpl"
)
param <- BatchtoolsParam(workers=5, cluster="sge", template=template)

## Run parallel job
result <- bplapply(rep(10e5, 100), piApprox, BPPARAM=param)
@

Example of SGE demonstrating some of \Rcode{BatchtoolsParam} methods.

<<demo_sge, eval=FALSE>>=
library(BiocParallel)

## Pi approximation
piApprox <- function(n) {
    nums <- matrix(runif(2 * n), ncol = 2)
    d <- sqrt(nums[, 1]^2 + nums[, 2]^2)
    4 * mean(d <= 1)
}

template <- system.file(
    package = "BiocParallel",
    "unitTests", "test_script", "test-sge-template.tmpl"
)
param <- BatchtoolsParam(workers=5, cluster="sge", template=template)

## start param
bpstart(param)

## Display param
param

## To show the registered backend
bpbackend(param)

## Register the param
register(param)

## Check the registered param
registered()

## Run parallel job
result <- bplapply(rep(10e5, 100), piApprox)

bpstop(param)
@

\section{\Rcode{sessionInfo()}}

<<sessionInfo, results="asis">>=
toLatex(sessionInfo())
@

\end{document}