%\VignetteIndexEntry{DREAM4 Overview}
%\VignetteDepends{DREAM4}
%\VignetteDepends{networkBMA}
%\VignettePackage{DREAM4}

\documentclass[12pt]{article}
\usepackage[utf8]{inputenc}

\usepackage{url}


\newenvironment{packed_enum}{
\begin{enumerate}
  \setlength{\itemsep}{2pt}
  \setlength{\parskip}{0pt}
  \setlength{\parsep}{0pt}
}{\end{enumerate}}


\newenvironment{packed_list}{
\begin{itemize}
  \setlength{\itemsep}{2pt}
  \setlength{\parskip}{0pt}
  \setlength{\parsep}{0pt}
}{\end{itemize}}


\usepackage{times}
\usepackage{hyperref}


\usepackage[authoryear,round]{natbib}
\usepackage{times}
\usepackage{comment}

\usepackage[nottoc,numbib]{tocbibind}

\textwidth=6.2in
\textheight=8.5in
%\parskip=.3cm

\setlength{\parindent}{0pt}
\setlength{\parskip}{2ex}


\oddsidemargin=.1in
\evensidemargin=.1in
\headheight=-.3in

\newcommand{\scscst}{\scriptscriptstyle}
\newcommand{\scst}{\scriptstyle}

\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}


\bibliographystyle{plainnat}


\title{DREAM4:  Simulated mRNA Expression Data for Assessing Network Inference Software}
\author{Paul Shannon}

\begin{document}

\maketitle
\tableofcontents


\section{Introduction}

The \emph{Dialogue for Reverse Engineering Assessments and Methods}
(\emph{DREAM}) (\url{http://www.the-dream-project.org}) organizes annual challenges in which systems
biology questions are posed, solutions invited from interested groups,
and winners selected.  As with the the
CASP (\url{http://predictioncenter.org}) protein structure contest,
DREAM poses questions to which the sponsors know the solution, but the
contestants do not.  DREAM4 challenges fall into two categories:
cellular network inference and quantitative model building.  

This vignette uses one of the simpler 2009 DREAM4 network inference
challenges to introduce Bioconductor users to the inference of genetic
regulatory networks from gene expression (and possibly additional)
data.  Synthetic gene expression data, generated by software, and
based upon patterns found in model organisms, provides the basis for
inference in the examples presented here.  Ancillary data may be
helpful when it is available and when, as with networkBMA, the
inference package supports it. Both of these scenarios are
demonstrated below.

At present, we discuss only the Bioconductor package
\Rpackage{networkBMA} (which implements the method described in
\citet{lo2012integrating}). Subsequent revisions of this vignette will
include more network inference packages.

A crucial feature of the DREAM4 challenge is that the underlying
network (the \emph{gold standard}) is distributed along with the simulated
expression data.  This allows us to assess the quality of any
inference made with the data.  The gold standard matrix was withheld 
during the original challenge, which took place in 2009, and which is
described in full here:

   \url{http://wiki.c2b2.columbia.edu/dream/index.php/D4c2}

There were three sub-challenges.  Quoting from the DREAM4 website:
 
\begin{itemize}

  \item Wild-type, knockouts, knockdowns, multifactorial
    perturbations, and time series simulated expression data, in five
    networks of size 10. Participants are challenged to predict the
    directed unsigned topology of these networks.

  \item similar to the first one, except that the five networks are of size 100.

  \item five networks of size 100. In this challenge, we assume that
    extensive knockout / knockdown or time series experiments can't be
    performed. Instead, different variations of the network can be
    observed (e.g., samples from different patients). Thus, only the
    multifactorial perturbation dataset described below is provided.


\end{itemize}

\section{The Data}

\subsection{As distributed by DREAM}

The simulated expression data is generated by GeneNetWeaver
(\url{http://gnw.sourceforge.net/}) based on network and gene
expression characteristics of two well-studied sytems, E.coli and
S.cerevisiae.  Gene regulatory patterns found in these organisms are
used to construct the gold standard network of regulators and targets,
from which simulated expression data is then generated.  Successful
inference consists of reverse-engineering the gold standard network
from the simulated expression data.

The DREAM4 dataset is distributed by the DREAM project in  multiple
directories, each containing a gold standard network and a number of
tab-delimited files with expression data generated under different
(simulated) conditions: wildtype steady state, time series after
perturbation, knockouts, knockdowns, dual knockouts, and
``multifactorial''.  Drawing again from the DREAM4 description:

\begin{itemize}
  \item Wild-type:  the steady-state levels of the unperturbed network
  \item Time series data:
The files \emph{*timeseries.tsv} contain time courses showing how the network
responds to a perturbation and how it relaxes upon removal of the
perturbation. For networks of size 10 we provide 5 different time
series, for networks of size 100 we provide 10 time series. Each time
series has 21 time points. The initial condition always corresponds to
a steady-state measurement of the wild-type. At t=0, a perturbation is
applied to the network as described below. The first half of the time
series (until t=500) shows the response of the network to the
perturbation. At t=500, the perturbation is removed (the wild-type
network is restored). The second half of the time series (until
t=1000) shows how the gene expression levels go back from the
perturbed to the wild-type state.  In contrast to the multifactorial
perturbations described in the previous section, which affect all the
genes simultaneously, the perturbations applied here only affect about
a third of all genes, but basal activation of these genes can be
strongly increased or decreased. For example, these experiments could
correspond to physical or chemical perturbations applied to the cells,
which would cause (via regulatory mechanisms not explicitly modeled
here) some genes to have an increased or decreased basal
activation. The genes that are directly targeted by the perturbation
may then cause a change in the expression level of their downstream
target genes.

\item Multifactorial data: The files \emph{*multifactorial.tsv} contain steady-state levels of
variations of the network, which are obtained by applying
multifactorial perturbations to the original network. Each line gives
the steady state of a different perturbation experiment, i.e., of a
different variation of the network. One may think of each experiment
as a gene expression profile from a different patient, for example. We
simulate multifactorial perturbations by slightly increasing or
decreasing the basal activation of all genes of the network
simultaneously by different random amounts.


\item Knockout and knockdown data are NxN matrices in which the expression 
of all genes is measured, as the expression of each gene in turn is 
eliminated (for knockouts) or substantially reduced (for knockdowns).

\item Dual knockouts consist of simulating each of the five networks
  in which two gene are knocked-out simultaneously. Gene expression
  data for dual knockouts is not provided to the
  participants. Instead, participants may predict steady-state levels
  for dual knockouts in the bonus round described in the previous
  section. The files *dualknockouts\_indexes.tsv indicate the pairs of
  genes for which a dual knockout should be predicted. For example,
  the line labeled "6 8" means that participants should predict the
  steady-state of the network after knocking out genes 6 and 8. For
  networks of size 10 we ask for predictions for 5 dual knockout
  experiments, for networks of size 100 we ask for 20 predictions.
  
\end{itemize}

The data sets are complete, in the sense that the expression
levels of all regulators and all of their targets are known.  This
advantage is offset by a corresponding disadvantage: inasmuch as the
synthetic expression data is not from an actual biological system,
various sorts of prior and related information about gene regulation
is not avaialable.  Transcription factor binding motifs, ChIP-seq,
DNAseI footprinting, and methylation data simply do not exist.  


\subsection{As repackaged into Bioc \Rclass{RangedSummarizedExperiment}s}

There are ten artifical gene regulatory datasets in DREAM4, five describing
networks of 10 nodes, five describing networks of 100.  These networks are
all defined by a gold standard adjacency matrix.  All networks are
accompanied by wild-type, timeseries, knockout, dual-knockout, and knockdown
simulated expression data.  The hundred-node networks also have
multifactorial data.  (In the DREAM4 distribution, the multifactorial data
is separated out into five additional directories; we have added them 
to the data object which has the corresponding gold standard matrix and
expression data.)


We use the Bioconductor \Rclass{RangedSummarizedExperiment} class from the 
\Rpackage{SummarizedExperiment} package to store the DREAM4 data.  We provide
ten instances, one for each simulated data set. Each can be loaded
via a \Rfunction{data} statement, for instance:

\begin{verbatim}
> data(dream4_010_01)
\end{verbatim}

We examine this dataset, loading its \Rclass{RangedSummarizedExperiment} object 
into an R session, then getting a brief summary via the \Rfunction{show} command.

But first, some initializations: a convenience function for printing, and a
command to set the display width for subsequent R output, so that it is 
visible within the printed version of this document.

<<init>>=
printf <- function(...)print(noquote(sprintf(...)))
options(width=60)
@ 

<<load-01-01>>=
library(DREAM4)
data(dream4_010_01)
show(dream4_010_01)
@ 

This tells us that \Robject{dream4\_010\_01} has
\begin{packed_list}
   \item  One assay matrix  called \emph{simulated} with 10 rows and 136 columns
   \item  10 row names:  \emph{G1} - \emph{G10}
   \item  136 column names: \emph{wt}, \emph{perturbation.t0}, \emph{...} \emph{MF.9}, \emph{MF.10}
   \item  No metadata for rows or columns (\emph{rowRanges} and \emph{colData} are both empty)
   \item  Two objects in \emph{metadata}: the gold standard matrix, and double knockout data
\end{packed_list}

The standard Bioconductor \Rclass{RangedSummarizedExperiment} class, when
used to hold real experimental data, encourages the storage of
experiment metadata with the experimental measurements.
This facilitates reproducibility, reanalysis and simplified data
storage.  However this class is, in one regard, a slightly awkward
fit to the DREAM4 data: the timeseries, knockout, knockdown, double
knockout and multiparameter measurements all have different
dimensions.  \Rclass{RangedSummarizedExperiments} can only ``stack up''
multiple measurement arrays if their dimensions are the same.  We
have worked around this restriction by combining all of the matrices in
each simulation into a single \Robject{assay} matrix.  We use
column names which allow for the easy extraction of the original matrices,
as demonstrated below.  

We now extract the full 10 x 136 combined expression matrix from the first (and only) element of
the list in the \Robject{assays} slot, displaying the first 10 rows
and 3 columns to give a sense of the data, along with a sampling of the column names:

<<extractFullMatrix>>=
mtx.all <- assays (dream4_010_01)[[1]]
dim(mtx.all)
mtx.all[1:10, c(1,2,126)]
set.seed(37)
print(colnames(mtx.all)[sort(sample(1:ncol(mtx.all), 16))])
@ 

\section{The networkBMA Package}

\subsection{Overview}

networkBMA (Regression-based network inference using Bayesian Model
Averaging) is a Bioconductor package 

\url{http://www.bioconductor.org/packages/release/bioc/html/networkBMA.html}

designed to work with time-series expression data with the capability to integrate prior
knowledge.  We will demonstrate networkBMA using the first the
five DREAM4 10-node time-series simulated expression sets.  One of
the strengths of networkBMA is its ability to incorporate prior
knowledge in the inference process.  We will illustrate the gains such
knowledge confers: the first inference run uses no priors, while the second
run does.

The gold standard adjacency matrix is stored as the first element in the 
metadata (or ``metadata'' slot) of the RangedSummarizedExperiment:

<<retrieveExpressionData>>=
mtx.goldStandard <- metadata (dream4_010_01)[[1]]
@ 

We want to extract only one of the ten perturbation timeseries included
in \Robject{dream4\_010\_01}.  Examine these self-describing column names:
<<colnames>>=
grep("perturbation.1.", colnames(mtx.all), v=T, fixed=TRUE)
@ 

The man page for \Rpackage{networkBMA} \Rfunction{?networkBMA} explains that there are two required
and eight optional parameters.  We will start with the required two, \Robject{data}, 
and \Robject{nTimePoints}, to which we add the optional \Robject{self}, by which we specify
that there are no self-edges (no self-regulating genes) in this data set.

\begin{verbatim}

       data: A matrix whose columns correspond to variables or genes and
             whose rows correspond to the observations at different time
             points.  

nTimePoints: The number of time points at which expression measurements
             are available.  The number of columns in 'data' should be a
             multiple of 'nTimePoints', which could be greater than 1 if
             there are replicates.
          

\end{verbatim}

To prepare the data for \Rfunction{networkBMA} we need to extract one timeseries of 
the ten offered, and then transpose the result into a matrix in the form expected
by \Rfunction{networkBMA}.
<<extractAndTranspose>>=
ts1.columns <- grep("perturbation.1.", colnames(mtx.all), fixed=TRUE)
mtx.ts1 <- t(mtx.all[, ts1.columns])
print(mtx.ts1[1:3, 1:3])
@ 

Extract the gold standard matrix, convert it successively to a graphAM, then a graphNEL,
and display it with RCytoscape.
<<extractGoldStandard>>=
print(names(metadata(dream4_010_01)))
mtx.goldStandard <- metadata(dream4_010_01)[[1]]
@ 
Transform the gold standard matrix into an explict table of regulators and targets.

<<transformGS>>=
idx <- which(mtx.goldStandard == 1)
idx.m1 <- idx -1
rows <- idx.m1 %% nrow (mtx.goldStandard) + 1
cols <- idx.m1 %/% nrow (mtx.goldStandard) + 1
tbl.goldStandard <- data.frame(Regulator=rownames(mtx.goldStandard)[rows],
                             Target=colnames(mtx.goldStandard)[cols],
                             Source=rep('goldStandard', length(rows)),
                             stringsAsFactors=FALSE)
@ 

Here is an RCytosxcape rendering of this network, followed by the code which produced it.


\includegraphics{d4-10-01.png}

%% disable the RCy code because Cytoscape may not be running.
%% display the code as if it were a live code chunk
%% periodically test this code, manually, to be sure it
%% still works
%% <<displayGSgraph>>=
\begin{verbatim}
> library(RCytoscape)
> print(dim(mtx.goldStandard))
> print(mtx.goldStandard[1:5, 1:5])
> g.gsam <- graphAM(mtx.goldStandard, edgemode="directed")
> g.gs <- as(g.gsam, "graphNEL")
> g.gs <- initEdgeAttribute(g.gs, "edgeType", "char", "not yet assigned")
> source.nodes <- tbl.goldStandard$Regulator 
> target.nodes <- tbl.goldStandard$Target
> edgeData(g.gs, source.nodes, target.nodes, attr="edgeType") <- "regulates"
> cw <- new.CytoscapeWindow("goldStandard 010-01", g.gs, overwriteWindow=TRUE)
> hideAllPanels(cw)
> setWindowSize(cw,800,600)
> showGraphicsDetails(cw,TRUE)
> displayGraph(cw)
> layoutFile <- system.file("extdata", "gs010-01-layout.RData",
+                          package="DREAM4")
> setEdgeTargetArrowRule(cw, "edgeType", 
+                       c("not yet assigned", "regulates"), 
+                       c("No Arrow", "Arrow"))
> restoreLayout(cw, layoutFile)
> fitContent(cw)
> setZoom(cw, 0.8*getZoom(cw))
\end{verbatim}

\subsection{Inference Using Expression Data Only}

Now we load \Rpackage{networkBMA}, infer the network from twenty-one timepoints of expression
over ten genes, assuming no prior knowledge, and sort the predictions.

<<infer0>>=
library(networkBMA)
tbl.inferred <- networkBMA(mtx.ts1, nTimePoints=nrow(mtx.ts1), self=FALSE)
tbl.inferred <- tbl.inferred[order(tbl.inferred$PostProb, decreasing=TRUE),]
@ 

The networkBMA package provides a function with which to assess the inferred 
network with respect to a reference network.    The standard 
measure of inferential success is \emph{AUPR} -- or ``area under the precision 
recall curve".  After introducing those quantities and their calculation, we
plot the AUPR for this just-completed run.

``Precision'' and ``Recall'' are standard measures of success in the
fields of pattern recognition and information retrieval; see
\href{http://en.wikipedia.org/wiki/Precision_and_recall}{Wikipedia,
``Precision and recall''}. In the following, true positives (TP), false positives (FP)
and false negative (FN) counts are used.

\begin{packed_list}
  \item \emph{precision}: the fraction of all retrieved instances which are relevant.
    Formally: TP/(TP+FP)
  \item \emph{recall}: the fraction of all relevant instances which have been retrieved.
    Formally: TP/(TP+FN)
\end{packed_list}

Both quantities range from 0.0 to 1.0. In the \emph{AUPR} plot, precision is typically
plotted on the Y axis, and recall on the X axis.  A perfect retrieval or inference
method would always retrieve 100 per cent of all relevant instances, and all retrieved
instances would be relevant.  In this idealized case, the AUPR would be infinite.
Even the best network inference algorithms fare considerably worse than this 
ideal case.  Let us examine the networkBMA inference we just ran, using the 
\Rfunction{contabs.netwBMA} function provided by the networkBMA package, and the
\Rfunction(prc) function. (prc stands for ``Precision-Recall Curve''.)

<<aupr0, fig=TRUE>>=
tbl.contingency <- contabs.netwBMA(tbl.inferred, tbl.goldStandard[,-3])
pr <- scores(tbl.contingency, what = c("precision","recall"))
colors <- c ("blue", "darkred")
plot(pr$recall, pr$precision, type='b',
     xlab='RECALL', ylab='PRECISION', 
     col=colors[1],
     xlim=c(0,1), ylim=c(0,1))
@ 

The AUPR information:
<<printaupr>>=
print(prc(tbl.contingency, plotit=FALSE))
@ 

To make sense of these results, let us look at a few rows selected from the the contingency table.
<<explicateAUPC>>=
last.row <- nrow(tbl.contingency)
mid.row <- as.integer(round(last.row)/2)
print(tbl.contingency[c(1,mid.row,last.row),])
@ 
The first row has perfect recall: all regulatory edges from the gold standard have
been inferred, but a very large number of non-existent edges have been inferred as well.

The last row has perfect precision: three edges were accurately inferred, no 
non-existent edges were proposed, eleven edges were missed.

Further insight into the inferred network is provided by examining the first few entries in
the sorted output of \Rfunction{networkBMA}.
<<headtblInferred>>=
print(head(tbl.inferred, n=10))
@ 
The three regulatory relationships with a posterior probability score of 1 are
accurately inferred.  The remaining edges are either reversals of edges in the
gold standard network, or incorrect.

\subsection{Add Prior Probabilities}

It can be seen from these results that the inference of gene regulatory
relationships from mRNA expression alone, even for an idealized
network with complete information, is imperfect.  These
strategies are among those used to improve inference:

\begin{packed_list}
   \item Incorporate ancillary data (derived, for instance, from ChiP-seq experiments, or interaction databases)
   \item Create novel algorithms (for instance, \citet{castelo2009reverse}, and \citet{danaher2011joint})
   \item Perform meta-inference, in which several inference tools are used together [\citet{greenfield2010dream4}]
\end{packed_list}

Bayesian methods, of which the \Rpackage{networkBMA} package is one
(see the package vignette for more information and references),
exploit prior information (prior probabilities) to improve inference.
Recent trends in molecular biology, and the ENCODE project in
particular, and public interaction databases, provide a large amount of useful prior regulatory
information. 


DREAM4 is synthetic data, so neither of these broad classes of prior
information is available.  We can, however, simulate priors by
extracting a few regulatory relationships from the gold standard
network, adjusting them with randomly generated probabilities to better
approximate real data, and thereby demonstrate how to use networkBMA with priors.

There are two arguments to the \Rfunction{networkBMA} function whe used to with prior information:

\begin{packed_list}
   \item \emph{prior.prob}: an adjacency matrix, where each value is a probability between 0 and 1.  This is well-suited
       for, e.g., experimental data, such as ChIP-seq.
   \item \emph{known}: a two-column matrix with known (unqualified) regulatory matrices.
\end{packed_list}

We shall use the first of these two parameters, leaving the second to default \Robject{NULL}.

Note that in the inference run above, none of the actual targets of regulator \emph{G10} were reported with a significant score:

<<g10targets>>=
print(subset(tbl.inferred, Regulator=="G10"))
@ 

But the gold standard shows two targets:
<<g10inGoldStandard>>=
mtx.goldStandard["G10",]
@ 

Since this is synthetic network, we have no experimental data with
which to improve the inference.  However, we can simulate ChIP-seq
data for G10, first by setting low probabilities for all interactions,
then adding two probabilities for the G10->G3 and G10->G4 edges:

<<simulateG10binding>>=
set.seed(37)
tbl.priors <- matrix(data=runif(100, 0, 0.4), nrow=10, ncol=10, 
                     dimnames=list(rownames(mtx.goldStandard), 
                                   colnames(mtx.goldStandard)))
tbl.priors["G10", c("G3", "G4")] <- runif(2, 0.8, 1.0)
@ 
Run the inference again:

<<inferWithPriors>>=
tbl.inferredWithPriors <- networkBMA(mtx.ts1, nTimePoints=nrow(mtx.ts1), 
                                     prior.prob=tbl.priors, self=FALSE)
tbl.inferredWithPriors <- 
   tbl.inferredWithPriors[order(tbl.inferredWithPriors$PostProb, 
                                decreasing=TRUE),]

print(tbl.inferredWithPriors)
@ 
G10->G3 and G10->G4 now show non-negligible posterior probabilities

Here we plot the precision-recall curves for both runs together, for easy comparison:

<<aupr2>>=
tbl.contingencyWithPriors <- contabs.netwBMA(tbl.inferredWithPriors, tbl.goldStandard[,-3])
plot(pr$recall, pr$precision, type='b',
     xlab='RECALL', ylab='PRECISION', 
     col=colors[1],
     xlim=c(0,1), ylim=c(0,1))

prWithPriors <- scores( tbl.contingencyWithPriors, what = c("precision","recall"))
lines(prWithPriors$recall, prWithPriors$precision, type='b',
      xlab='RECALL', ylab='PRECISION', 
      col=colors[2],
      xlim=c(0,1), ylim=c(0,1))

legend.titles = c("expression only", "with priors")
legend (0.6, 0.9, legend.titles, colors)

print(prc(tbl.contingencyWithPriors, plotit=FALSE))
@ 
\includegraphics{auprs.png}

And examine the contingency table of the second run:

<<printx>>=
print(tbl.contingencyWithPriors)
@ 

\section{Conclusion}

The DREAM4 synthetic mRNA expression data, and the gold standard
networks from which they are generated, offer a useful test for
gene regulatory network inference.  One result of such tests is the
realization of how difficult such inference is.  Incorporating prior
information into inference is beneficial.  We anticipate that in the
near future many other kinds of regulatory data will begin to match
mRNA expression for completeness, accuracy and cost, leading to
increasingly constrained, and therefore increasingly accurate network
inference.

\nocite{*} 


\bibliography{DREAM4}

\end{document}