\documentclass{article}

\usepackage{mathptmx}
\usepackage{amsmath}
\usepackage{graphicx}
\usepackage{float}
\usepackage{epsfig}

%% \usepackage{semcolor}
\usepackage[pdftex,colorlinks=false,pdfborder=0 0 0,pageanchor=false,bookmarks=false,urlcolor=black]{hyperref}

\pdfpagewidth=11truein
\pdfpageheight=8.5truein
\pdfhorigin=1truein     % default value(?), but doesn't work without
\pdfvorigin=1truein     % default value(?), but doesn't work without

%% \newcommand{\fs}{\fromSlide}
%% \newcommand{\bs}{\begin{slide}}
%% \newcommand{\es}{\end{slide}}

%% \slideframe{none}
\setcounter{secnumdepth}{0}

\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textsf{#1}}}
\newcommand{\Rmethod}[1]{{\texttt{#1}}}
\newcommand{\Rfunarg}[1]{{\texttt{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}

\newcommand{\R}{\textsf{R}}

\newcommand{\term}[1]{\emph{#1}}
\newcommand{\mref}[2]{\htmladdnormallinkfoot{#2}{#1}}

\begin{document}

<<options,echo=FALSE,results=hide>>=
options(width=70)
@ 

\author{M. T. Morgan\footnote{\url{mtmorgan@fhcrc.org}} \\ 
  Fred Hutchinson Cancer Research Center \\ Seattle, WA}
\title{Parallel R Practical}
\date{4 August, 2006}
\maketitle

\section{Part 1: getting going}
We have two objectives in the first part of this practical: to orient
you on using our `cluster', and to expose you to some of the basic
functionality of the \Rpackage{Rmpi} package.

\subsection{Connecting to our `cluster'}

We've set up a small `cluster' of computers to use as a parallel
environment. The cluster is very minimalist in some senses (just three
nodes), and the linux environment might be unfamiliar to some of
you. We'll try to walk through the very basics here.

If you're familiar with \verb|ssh|, then feel free to connect
immediately to the ip address available at the front of the room.  Otherwise,
follow instructions available at the start of the lab. All of us will
use the same username \verb|knoppix| and password \verb|bioc|.

%% The steps you'll follow are:
%% \begin{enumerate}
%% \item Connect to the cluster.
%% \item Change to your working directory.
%% \item Perform preliminary operations to make sure \R{} is available
%%   and functioning properly on a single node.
%% \end{enumerate}

%% \subsubsection{Connecting to the cluster}
%% You'll need to connect to the cluster using \verb|ssh| or
%% \verb|telnet|. The cluster nodes are at the ip addresses
%% \begin{verbatim}
%% 140.107.169.186
%% 140.107.169.28
%% 140.107.169.31
%% \end{verbatim}

%% To use \verb|telnet| on Windows:
%% \begin{enumerate}
%% \item Go to the `Start' and then `Run...' menu items.
%% \item In the field labeled 'Open:', enter \verb|telnet 140.107.169.186|.
%% \item When prompted for a username, enter \verb|knoppix| and then the
%%   return key. For a password, enter \verb|bioc|.
%% \end{enumerate}

%% To use \verb|telnet| on a Mac:
%% \begin{enumerate}
%% \item Open a console window.
%% \item Type \verb|telnet 140.107.169.186| at the prompt.
%% \item Enter username \verb|knoppix| and password \verb|bioc|
%% \end{enumerate}

\subsection{Once connected}
Congratulations! You are now at a linux command line. You've all
logged in as the same user. To avoid stepping on each other's toes too
much, please change to your assigned directory immediately. Do this by
typing
\begin{verbatim}
[% user0 %] cd ~/user0
\end{verbatim}
where \verb|user0| is the user id assigned to you at the start of the
practical. This will be the directory to use for storing any files you
might create, and will be the home directory for your \R{}
session. An (initially identical) directory exists on each of the
other nodes in our cluster.

Here are some linux commands that you might find useful:
\begin{description}
\item[ls] list contents of directory; \verb|ls| list current directory
  contents, \verb|ls mydir| list contents of directory \verb|mydir|
  located in current directory, \verb|ls ~/mydir| list contents of
  directory \verb|mydir| in the `home' (\verb|~|) directory.
\item[cd] change directory. \verb|cd ..| -- change up a level,
  \verb|cd mydir| to change to directory \verb|mydir| in the current
  directory; \verb|cd| to change to `home' directory.
\item[pwd] list the path to the current directory.
\item[R] launch \R{}, with the `working directory' set to the current
  directory.
\item[logout] logout of the session.
\end{description}
You are actually in a fairly full-featured linux environment, so if
you're familiar with this feel free to launch the editor of your
choice, etc. Since we're all sharing the same computer, please don't
be too resource greedy, and please do not modify files such as
\verb|.bashrc| that will influence other users.

As a simple test that things are going well, try launching the
`command line' version of \R. Do this by typing
\begin{verbatim}
[% user0 %] R
\end{verbatim}
at the prompt. You should be greeted by the familiar welcome
information screen, and be presented by the \verb|>| prompt. Try
typing some of your favorite \R{} commands, and loading a few
packages. Everything should work as expected.
<<orientation>>=
x <- 10:1
y <- 1:10
x + y
hist(runif(1000))
sessionInfo()
library(Biobase)
search()
@ 
If there are problems, let's try to work them out now before
carrying on. 

A couple of quick things about \R{} in command-line mode. You might
find it useful to use the keys \verb|ctrl-p| and \verb|ctrl-n| to
recall commands that you've already entered. Some commands make use of
other programs that can be confusing, and in particular the command
<<fix,eval=FALSE>>=
f <- function(x) x
fix(f)
@ 
will open the \verb|vi| editor to edit the function \verb|f|. Unless
you're familiar with \verb|vi|, probably the best thing to do is to
exit immediately by typing \verb|:q| followed by the return key. We'll
try to stick with simple input and output, to avoid having to make
complex editing changes. Another confusing command can result from
searching for help
\begin{verbatim}
> library(help = Rmpi)
> ?mpi.bcast.cmd
> help.search("bcast")
\end{verbatim}
Each of these commands opens the \verb|less| `pager'. You can scroll
through the resulting display by typing \verb|f| and \verb|b| to move
forward and backward a page at a time, and type \verb|q| to stop
displaying the help information.

\subsection{Launching and exploring \Rpackage{Rmpi}}
The objective of this section is to familiarize you with the basic
functionality of the \Rpackage{Rmpi} package. We won't really do any
useful work, but hopefully spur some thoughts about how to use
parallel programming.

Start by loading the \Rpackage{Rmpi} package, starting and stopping a
collection of nodes, and executing a few simple commands:
<<Rmpi-intro>>=
library(Rmpi)
mpi.spawn.Rslaves()
mpi.remote.exec(search())
x <- 1:5
mpi.bcast.Robj2slave(x)
rm(x)
mpi.remote.exec(x^2)
mpi.remote.exec(x <- x^2)
mpi.remote.exec(x <- x*mpi.comm.rank())
mpi.close.Rslaves()
@ 

Now let's get a more comprehensive feeling for the functionality of
the package. Look at the overview of functions provided by
<<Rmpi-help,eval=FALSE>>=
library(help=Rmpi)
@ 
Remember that you can scroll forward and backward with the commands
\verb|f|, \verb|b|, and that you can stop reading with \verb|q|.

There are three groups of functions. We'll find those toward the end,
in the section `MPI Extensions specifically to slavedaemon.R' most
useful (slavedaemon.R refers to a script in the \Rpackage{Rmpi}
package used to launch nodes when we use the command
\Rfunction{mpi.spawn.Rslaves}). Explore the help pages for specific
functions that look particularly relevant, e.g.,
\begin{verbatim}
> ?mpi.remote.exec
> ?mpi.parLapply
> ?mpi.setup.rngstream
> ?mpi.close.Rslaves
\end{verbatim}
Think about how these functions might be helpful in parallel analysis.

The functions in the section `MPI Extensions in R Environment' are
useful in conjunction with the \Rfunction{mpi.spawn.Rslaves} function,
as well as in more general parallel programming contexts. Scan the
help pages for a few of these
\begin{verbatim}
> ?mpi.spawn.Rslaves
> ?mpi.bcast.Robj
\end{verbatim}
How would you launch a cluster with, say, 6 slaves? Any ideas about
when would you want a cluster with more or fewer slaves than there are
compute nodes in the cluster?

The functions in the section `MPI APIs' are the core of
\Rpackage{Rmpi}. If you have used MPI in other programming languages,
then you'll recognize that \Rpackage{Rmpi} has an interface to some
but not all of the functionality of MPI. \Rpackage{Rmpi} provides the
common point-to-point (e.g., \verb|send| and \verb|recv| data between
two nodes), collective (e.g., \verb|bcast|, \verb|scatter| and
\verb|gather| data efficiently to collections of nodes), and
non-blocking operations. We will not emphasize these functions, though
they are actually used in many of the functions we have already
executed.

\subsection{A first example: mad about Golub}
Many Bioconductor packages summarize gene expression data in
\Rclass{ExpressionSet} objects. Type the following
<<golub>>=
library(golubEsets)
data(golubMerge)
golubMerge
@ 
and view the help page (\verb|?golubMerge|) about this data set to
get a feel for what it represents.

The \R{} function \Rfunction{mad} calculates the \emph{median average
  deviation} of a sample. The \Rfunction{mad} is a non-parametric
statistic like the standard deviation, describing variability in the
sample. As an exploratory step, we might be interested in identifying
genes with large variability, thinking perhaps that these are going to
be most useful in distinguishing between samples (this is naive!).
Here is how we might extract the matrix of expression values from
\verb|golubMerge|, calculate the \Rfunction{mad} for each gene,
and view the top 10 most variable genes in \verb|golubMerge|:
<<mad-calc>>=
exprSet <- exprs(golubMerge)
res <- apply(exprSet, 1, mad)
sort(res, decreasing=TRUE)[1:10]
@ 
See the help pages (\verb|?exprs|, \verb|?apply|, etc.) for
information on \R{} functions you do not recognize, or to learn more
about arguments provided to functions.

We'll now walk through a couple of different ways in which we could
use a parallel environment to perform these calculations. The first
method is to replace \Rfunction{apply} with \Rfunction{mpi.parApply}.
To do this, evaluate commands like the following:
<<mad-parallel-1>>=
library(Rmpi)
mpi.spawn.Rslaves(nslaves=3)
res <- mpi.parApply(exprSet, 1, mad)
sort(res, decreasing=TRUE)[1:10]
mpi.close.Rslaves()
@ 
(It is not necessary to use \Rfunction{mpi.spawn.Rslaves} and
\Rfunction{mpi.close.Rslaves} for every calculation, just once for
each R session.)
\begin{itemize}
\item Hopefully the results of the two different ways of calculating
  \Robject{res} are the same. Use \Rfunction{identical} to confirm this. Are
  there situations when results might not be exactly identical, but
  `close enough'?
\end{itemize}
Use \Rfunction{system.time} to get a sense for how long calculations are taking:
<<system.time>>=
mpi.spawn.Rslaves(nslaves=3)
system.time(res1 <- apply(exprSet, 1, mad))
system.time(res2 <- mpi.parApply(exprSet, 1, mad))
identical(res1, res2)
@ 
This is very approximate, especially with the setup in the lab! 
\begin{itemize}
\item How does the system time for parallel evaluation compare with
  that for evaluation on a single node (pay particular attention to
  the first and third numbers reported by \Rfunction{system.time}, and
  use the help page for \Rfunction{system.time} to help you interpret
  the results)?
\item Does \Rfunction{system.time} offer any evidence that
  calculations are actually occurring on remote nodes?
\end{itemize}
In my trials, this command took only slightly longer or was even
faster on one node as on several nodes.
\begin{itemize}
\item What sorts of factors might influence parallel evaluation time? 
\item If evaluating this type of command really is about as
  time-efficient on one as on several nodes, does this mean that
  parallelization is pointless? Would it be better to dissect the
  calculation into smaller pieces (e.g., figuring out where
  \Rfunction{mad} spends most of its time) and try to parallelize
  those parts, or to look for ways of combining the \Rfunction{mad}
  calculation with additional steps of analysis? How would you go
  about each of these steps (dissection, or combining analyses)?
\end{itemize}

Let's explore a second way of performing our calculation in parallel.
The previous example sends portions of the \Robject{exprSet} `over the
wire' to each node. But actually, each node has access to a copy of
\verb|golubMerge| on its own disk. So let's just send the
\emph{commands} over the wire, and see what happens. The single node
code we will try to emulate uses \Rfunction{sapply}:

<<mad-serial-2>>=
ff <- function(i) mad(exprSet[i,])
res1 <- sapply(1:nrow(exprSet), ff)
@ 
To execute in parallel, we have to load the required data:
<<mad-parallel-2a>>=
mpi.bcast.cmd(library(golubEsets))
mpi.bcast.cmd(data(golubMerge))
mpi.bcast.cmd(exprSet <- exprs(golubMerge))
@ 
We can then do our parallel evaluation sending only our function
\Rfunction{ff} to the remote nodes:
<<mad-parallel-2b>>=
system.time(res1 <- sapply(1:nrow(exprSet), ff))
system.time(res2 <- mpi.parSapply(1:nrow(exprSet), ff))
identical(res1, res2)
@ 
On a real cluster, this calculation seems to scale very well, almost
inversely proportional to the number of nodes performing calculations.

\begin{itemize}
\item Is it `fair' to only include the time spent in
  \Rfunction{mpi.parSapply}? Or should we also include the time for
  sending the additional command? Or even distributing the
  \Robject{golubEsets} library to the nodes in the first place?
\item Suppose the calculation really does scale inversely to the
  number of nodes. Sketch a graph, the x-axis indicating number of
  nodes in a cluster and the y axis time. Sketch a curve reflecting
  the total 'computation' time as a function of node number. How much
  do I gain by adding a constant number of nodes, say 1, to a small
  cluster, compared to a large cluster? Sketch a second curve
  reflecting communication time.  If there were no network issues,
  communication time might increase close to logarithmically with
  number of nodes.  Combine computation plus communication time into a
  single, overall computation, time.  What does this say about limits
  to parallelization?
\end{itemize}

\subsection{Reproducibility}
As a final introductory exploration of \Rpackage{Rmpi}, and as an
introduction to a thorny issue in parallel programming, let's explore
generating random numbers. Here are some things to try:
\begin{itemize}
\item Start a cluster of slaves.
\item Use \Rfunction{runif} and \Rfunction{mpi.remote.exec} to
  generate 15 random numbers on each node. Why does each node generate
  supposedly random numbers that are all in the same sequence?
\item Figure out how to use \Rfunction{mpi.setup.rngstream} to
  generate streams of random numbers that are different on each node.
\item Figure out how to use \Rfunction{mpi.setup.rngstream} so that
  each node produces a different stream of random numbers (different
  on each node -- call the streams `A'), but that two separate
  invocations of the same random number call again produces the streams
  `A'.
\item Now arrange to start the cluster, generate a sequence of random
  numbers (stream `A'), stop the cluster, and start the cluster again
  (with the same number of nodes) and generate the sequence of random
  numbers `A' again.
\item Try stopping your cluster, and staring with a \emph{different}
  number of nodes (remember that each node starts up a new instance of
  \R{}, and that we only have three actual nodes available, so don't
  be too ambitious about your cluster size). Can you generate the
  sequence of random numbers `A'? If you wanted to be able to repeat
  exactly the same sequence of random numbers regardless of cluster
  size, what might you do?
\end{itemize}

\section{Part 2: cross-validation, bootstrap, and `SPMD' processing}
This portion of the lab tackles some useful work, in the form of an
analysis (\Rfunction{xval}) that can require substantial computational
time. We also explore a model of parallel computation in \R, where
each node executes the same programming, performing some redundant
calculation before or after a parallel section where work is divided
between nodes.

\subsection{Cross-validation}
The serial version of the analysis we want to do is
<<xval-serial>>=
library(MLInterfaces)
library(golubEsets)
data(golubMerge)
smallG <- golubMerge[200:250, ]
lk1 <- xval(smallG, "ALL.AML", knnB, xvalMethod = "LOO", group = as.integer(0))
table(lk1, smallG$ALL.AML)
@ 
This takes a subset of the \Robject{golubMerge} data set, 
devises an algorithm to classify gene expresssion profiles into
tumor types (acute lymphoblastic leukemia, ALL, or acute myeloid
leukemia, AML) using the $k$-nearest neighbor machine learning
algorithm and leave-one-out cross-validation. The table summarizes how
the cross-validations performed: diagonal elements represent correct
classifications.

If the next paragraph becomes too complicated, use the command
<<source-xval,eval=FALSE>>=
source("~/rsrcs/xval-setup.R")
@ 
to read the required set-up into your \R{} session.

The \Rfunction{xval} provides provides a `hook' that allows users to
reach into the function and parallelize the computationally intensive
section of code. We'll access the hook by creating a class that serves
as a `predicate' to influence code execution, and then create an
object of that class:
<<xval-predicate>>=
setClass("RmpiXval",representation("list"))
cluster <- new("RmpiXval")
@ 
The documentation for \Rfunction{xval} suggests that we can define a
method \Rfunction{xvalLoop} that will be executed in place of
the \Rfunction{lapply} loop that performs the computationally
important and parallelizeable parts of the calculation. The following
method makes it possible to use \Rfunction{mpi.parLapply} instead of
\Rfunction{lapply}.
<<xval-loop>>=
setMethod("xvalLoop", signature(cluster="RmpiXval"), function(cluster, ...) mpi.parLapply)
@ 

With these preliminaries, we're ready for our first attempt at
parallelizing \Rfunction{xval}:
<<xval-1>>=
mpi.bcast.cmd(library(MLInterfaces))
lk1 <- xval(smallG, "ALL.AML", knnB, xvalMethod = "LOO", group = as.integer(0), cluster=cluster)
table(lk1, smallG$ALL.AML)
@ 
\begin{itemize}
\item Hopefully the results of the single-processor and parallel
  results agree. Do they?
\item We discussed \Rfunction{mpi.parLapply} above. When it is used in
  \Rfunction{xval}, the effect is to forward \Robject{exprs(smallG)}
  to each of the nodes in the cluster. How would you expect this
  solution to scale? Any ideas about how to make this more efficient?
\end{itemize}

\subsection{`Single program, multiple data' model} 
Before starting this section, execute the command
<<source-batch,eval=FALSE>>=
source("~/rsrsc/batch.R")
@ 

<<wrapper, echo=FALSE, results=hide>>=
wrap <- function(func, pseudo, comm=1) {
    iter <- -1
    sz <- mpi.comm.size(comm)-1
    rank <- mpi.comm.rank(comm)
    tag <- list(result=1)
    if (rank==0)
      function(...) {
          iter <<- iter + 1
          mpi.recv.Robj(iter %% sz + 1, tag$result, comm)
      }
    else
      function(...) {
          iter <<- iter + 1
          if (iter %% sz + 1 == rank) {
              result <- func(...)
              mpi.send.Robj(result, 0, tag$result, comm)
          } else result <- pseudo
          result
      }
}
mpi.bcast.Robj2slave(wrap)
@ 
<<bcast.Rcmd, echo=FALSE, results=hide>>=
bcast.Rcmd <- function (cmd = NULL, rank = 0, comm = 1){
    if (mpi.comm.rank(comm) == rank) {
        cmdp <- deparse(substitute(cmd), width.cutoff = 500)
        cmdp <- paste(cmdp, collapse = "\"\"/")
        mpi.bcast(x = nchar(cmdp), type = 1, rank = rank, comm = comm)
        mpi.bcast(x = cmdp, type = 3, rank = rank, comm = comm)
        eval(cmd, parent.frame())
    } else {
        charlen <- mpi.bcast(x = integer(1), type = 1, rank = rank, 
                             comm = comm)
        if (is.character(charlen)) 
          parse(text = "break")
        else {
            out <-
              mpi.bcast(x = .Call("mkstr", as.integer(charlen), PACKAGE = "Rmpi"),
                        type = 3, rank = rank, comm = comm)
            parse(text = unlist(strsplit(out, "\"\"/")))
        }
    }
}
@ 
The program flow in the previous example is that the manager starts to
evaluate \Rfunction{xval}. The manager `massages' input data, and at a
critical point forwards data and work to other nodes in the cluster. 

A different model is to start \Rfunction{xval} on all nodes, and for
all nodes to input and massage data. At the critical section of code
marked by \Rfunction{xvalLoop}, each node performs only its own work,
and arranges to communicate results between all nodes. This model of
computation is potentially very efficient, avoiding communication of
large data betweeen nodes. We will now explore how to implement this
solution.

Consider the following commands:
<<bacth-1>>=
bcast.Rcmd(ff <- function() {
    f <- function(i) mpi.comm.rank()
    sapply(1:20, f)
})
bcast.Rcmd(ff())
@ 
So far not very interesting!  One way to think the function
\Rfunction{ff} is as a small program.  \Rfunction{bcast.Rcmd} is a function I
wrote for this practical, available in the file \verb|~/rsrsc/batch.R|. It
evaluates the single program on all nodes, including the master node.
The first \Rfunction{bcast.Rcmd} assigns the value \Robject{ff} to all
nodes on the cluster. The second \Rfunction{bcast.Rcmd} evaluates
\Robject{ff} on all nodes.  \Rfunction{bcast.Rcmd} returns the value
of the program as run on the manager node.

We now introduce a `wrapper' that changes the way evaluation of
\Rfunction{f} works. 
<<batch-1a>>=
bcast.Rcmd(ff <- function() {
    f <- function(i) mpi.comm.rank()
    ff <- wrap(f, -1)
    sapply(1:20, ff)
})
bcast.Rcmd(ff())
@ 
The \Rfunction{wrap} function (also written for this practical) takes a function and `wraps' it so that
successive calls to the function only sometimes compute a new value.
The other times, it returns a previously-stored psuedo-value.
Computations occur in a round-robin fashion -- first node 1, then node
2, etc.  The manager node never evaluates the function, but instead
collects and coordinates the results of the worker nodes. The end
result is that the manager node collects the round-robin evaluations
from all nodes, returning something approaching a useful result.

Here is a slightly more extensive example, where a bootstrap
calculation is distributed across nodes in a cluster.
<<batch-2>>=
bcast.Rcmd(ff <- function() {
  library(boot)
  ratio <- function(d, w) sum(d$x * w)/sum(d$u * w)
  ratiow <- wrap(ratio, -1)
  boot(city, ratiow, R = 999, stype = "w")
})
bcast.Rcmd(ff())
@ 
Notice that this is changed only a very little from the
single-processor evaluation.


Here is a sketch of the \Rfunction{wrap} function:
<<wrapper-revealed>>=
<<wrapper>>
@ 
This uses several features of the \R{} language, and as an exercise it
would be useful to discuss these with others in the lab.
\begin{itemize}
\item In \R{}, a function can be treated just like any other
  object. What is the return value of \Rfunction{wrap}?
\item \R{} is \emph{lexically scoped}. This means that the return
  values of \Rfunction{wrap} `remember' the environment in which they
  were defined. So the return value of \Rfunction{wrap} can access the
  variables \Robject{iter, sz, rank} and \Robject{tag}.
\item A line near then end of \Rfunction{wrap} reads \verb|iter <<- iter + 1|.
  The \R{} operator \verb|<<-| causes assignment,
  but unlike the usual assignment statement \Rfunction{<-} it looks
  for a variable (in this case named \Robject{iter}) to assign to in the
  \emph{environment} where the function is defined. So the effect is
  to change the value of the \Robject{iter} that is defined in the
  very first line of \Rfunction{wrap}
\item \Rfunction{wrap} is really quite incomplete. What issues can you
  see with it, and how migth it be improved?
\end{itemize}

\subsection{A parallel \Rfunction{lapply}}
\Rfunction{lapply} provides a very useful framework for embarassingly
parallel problems.
<<lapply>>=
args(lapply)
@ 
The argument \Robject{X} represents the `work' to be done,
\Robject{FUN} the program that needs to be evaluated on each piece of
work, and \Robject{...} the data on which the program is to be
evaluated.  Here is a batch program that uses \Rfunction{lapplys}, an
\Rfunction{lapply}-like alternative, to perform a simple parallel
computation:
<<lapplys,echo=FALSE,results=hide>>=
lapplys <- function(X, FUN, ..., comm=1) {
    rank <- mpi.comm.rank(comm) + 1
    n <- mpi.comm.size(comm)
    tasks <- 1:length(X)
    mywork <- X[split(tasks, cut(tasks, n))[[rank]]]
    result <- lapply(mywork, FUN, ...)
    allgather.Robj(result, comm)
}
mpi.bcast.Robj2slave(lapplys)
@ 
<<allgather,echo=FALSE,results=hide>>=
allgather.Robj <- function(obj=NULL, comm=1) {
    obj <- as.integer(charToRaw(serialize(obj, NULL)))
    sz <- mpi.allgather(length(obj), 1, integer(mpi.comm.size(comm)), comm)
    objs <-mpi.allgatherv(obj, 1, integer(sum(sz)), sz, comm)
    as.list(unlist(lapply(split(as.raw(objs),rep(1:length(sz)-1, sz)), unserialize),
                   recursive=FALSE, use.names=FALSE))
}
mpi.bcast.Robj2slave(allgather.Robj)
@ 
<<lapplys-simple>>=
res <- bcast.Rcmd(lapplys(1:20, function(i) mpi.comm.rank()))
unlist(res)
@ 
The `work' is a list from 1 to 20. The function to be performed by
each node is to determine the rank of the node. \Rfunction{lapplys}
ensures that work gets distributed approximately evenly between nodes,
and collates results from each node into a single list
(\Rfunction{unlist(res)} simplifies results for compact presentation).

As a more complicated example, here is a way to perform
cross-validation in batch mode.
<<xval-batch>>=
bcast.Rcmd(ff <- function() {
    library(MLInterfaces)

    setClass("BatchXval",representation("list"))
    setMethod("xvalLoop", signature(cluster="BatchXval"), function(cluster, ...) lapplys)
    cluster <- new("BatchXval")

    data(golubMerge)
    smallG <- golubMerge[200:250, ]
    lk1 <- xval(smallG, "ALL.AML", knnB, xvalMethod = "LOO", group = as.integer(0), cluster=cluster)
    table(lk1, smallG$ALL.AML)
})
bcast.Rcmd(ff())
@ 
\begin{itemize}
\item Describe what steps the flow of execution of this program. Which
  node(s) read data from disk? How much communication is likely
  occurring between nodes?
\item Based on experience earlier in the lab, how do you think this
  program performs as more nodes are used in the computation?
\end{itemize}

Let's take a closer look at \Rfunction{lapplys}.
<<lapplys-revealed>>=
<<lapplys>>
@ 
The function has nearly the same signature as \Rfunction{lapply}. The
first several lines make \Rfunction{lapplys} slightly different on
each node that it runs on. The line \Rfunction{mywork<-...} divides
\Robject{X} into approximately evenly sized lists, and selects a
different list for each node. The next line evaluates \Robject{FUN}
for a subset of the original \Robject{X}, with the subset differing
depending on the the node. The final line is an \Rpackage{Rmpi}
extension that collates results from different nodes.

\Rfunction{lapplys} makes use of an extension to
\Rpackage{Rmpi}
<<allgather-revealed>>=
<<allgather>>
@ 
Allgather is an MPI concept, where objects from all nodes are collated
and redistributed to all nodes. The \Rfunction{allgather.Robj} is an
extension to this, and is included here to illustrate how MPI routines
can be included in \R{} functions for easy evaluation.

Finally, \Rfunction{bcast.Rcmd} is defined on the manager node to
broadcast commands to all workers, and to evaluate the same commands
on the manager.
<<bcast.Rcmd-revealed>>=
<<bcast.Rcmd>>
@ 
This command makes use of several \R{} language concepts, including
the ability to deparse and serialize \R{} objects (including
functions), and to evaluate \R{} objects in environments different
from the calling environment.

\section{Conclusions}
We have covered alot of ground in the lab. Here is a brief summary
of the highlights:
\begin{enumerate}
\item There are at present few `out of the box' solutions for parallel
  programming in \R{}; you'll have to grapple with at least some
  details of parallelization.
\item Not all parts of a script can be usefully parallelized.
\item Communication costs can be important, and for large data sets
  can outweigh much of the benefit of faster computation.
\item Even without communication costs, the decrease in compute time
  is inversely proportional to the number of nodes available: adding
  1 node to a 1 node cluster doubles potential computational
  throughput, but adding 1 node to a 10 node cluster matters hardly at all.
\item Repeatability is essential for scientific research, and can be a
  thorny issue in a parallel environment. 
\item \Rpackage{Rmpi} provides some functionality for parallel
  programming, including an interface to lower-level MPI functions.
\item Useful new parallel techniques introduced in the lab rely on
  executing a series of commands as a single program on the remote and
  manager nodes.
\item This technique is useful in conjunction with a `wrapper' that
  determines which nodes engage in expensive computations, and
  \Rfunction{lapplys}, an \Rfunction{lapply}-like function for
  distributing work evenly between nodes.
\end{enumerate}
Needless to say, we have barely scratched the surface of
possibility. The algorithms presented here are only a fraction
of those useful in parallel computation. They lack any error checking
or recovery, and make no attempt to check-point or otherwise ensure
against system failure. These are the directions for exploration.

\end{document}