%\VignetteIndexEntry{DREAM4 Overview} %\VignetteDepends{DREAM4} %\VignetteDepends{networkBMA} %\VignettePackage{DREAM4} \documentclass[12pt]{article} \usepackage[utf8]{inputenc} \usepackage{url} \newenvironment{packed_enum}{ \begin{enumerate} \setlength{\itemsep}{2pt} \setlength{\parskip}{0pt} \setlength{\parsep}{0pt} }{\end{enumerate}} \newenvironment{packed_list}{ \begin{itemize} \setlength{\itemsep}{2pt} \setlength{\parskip}{0pt} \setlength{\parsep}{0pt} }{\end{itemize}} \usepackage{times} \usepackage{hyperref} \usepackage[authoryear,round]{natbib} \usepackage{times} \usepackage{comment} \usepackage[nottoc,numbib]{tocbibind} \textwidth=6.2in \textheight=8.5in %\parskip=.3cm \setlength{\parindent}{0pt} \setlength{\parskip}{2ex} \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \bibliographystyle{plainnat} \title{DREAM4: Simulated mRNA Expression Data for Assessing Network Inference Software} \author{Paul Shannon} \begin{document} \maketitle \tableofcontents \section{Introduction} The \emph{Dialogue for Reverse Engineering Assessments and Methods} (\emph{DREAM}) (\url{http://www.the-dream-project.org}) organizes annual challenges in which systems biology questions are posed, solutions invited from interested groups, and winners selected. As with the the CASP (\url{http://predictioncenter.org}) protein structure contest, DREAM poses questions to which the sponsors know the solution, but the contestants do not. DREAM4 challenges fall into two categories: cellular network inference and quantitative model building. This vignette uses one of the simpler 2009 DREAM4 network inference challenges to introduce Bioconductor users to the inference of genetic regulatory networks from gene expression (and possibly additional) data. Synthetic gene expression data, generated by software, and based upon patterns found in model organisms, provides the basis for inference in the examples presented here. Ancillary data may be helpful when it is available and when, as with networkBMA, the inference package supports it. Both of these scenarios are demonstrated below. At present, we discuss only the Bioconductor package \Rpackage{networkBMA} (which implements the method described in \citet{lo2012integrating}). Subsequent revisions of this vignette will include more network inference packages. A crucial feature of the DREAM4 challenge is that the underlying network (the \emph{gold standard}) is distributed along with the simulated expression data. This allows us to assess the quality of any inference made with the data. The gold standard matrix was withheld during the original challenge, which took place in 2009, and which is described in full here: \url{http://wiki.c2b2.columbia.edu/dream/index.php/D4c2} There were three sub-challenges. Quoting from the DREAM4 website: \begin{itemize} \item Wild-type, knockouts, knockdowns, multifactorial perturbations, and time series simulated expression data, in five networks of size 10. Participants are challenged to predict the directed unsigned topology of these networks. \item similar to the first one, except that the five networks are of size 100. \item five networks of size 100. In this challenge, we assume that extensive knockout / knockdown or time series experiments can't be performed. Instead, different variations of the network can be observed (e.g., samples from different patients). Thus, only the multifactorial perturbation dataset described below is provided. \end{itemize} \section{The Data} \subsection{As distributed by DREAM} The simulated expression data is generated by GeneNetWeaver (\url{http://gnw.sourceforge.net/}) based on network and gene expression characteristics of two well-studied sytems, E.coli and S.cerevisiae. Gene regulatory patterns found in these organisms are used to construct the gold standard network of regulators and targets, from which simulated expression data is then generated. Successful inference consists of reverse-engineering the gold standard network from the simulated expression data. The DREAM4 dataset is distributed by the DREAM project in multiple directories, each containing a gold standard network and a number of tab-delimited files with expression data generated under different (simulated) conditions: wildtype steady state, time series after perturbation, knockouts, knockdowns, dual knockouts, and ``multifactorial''. Drawing again from the DREAM4 description: \begin{itemize} \item Wild-type: the steady-state levels of the unperturbed network \item Time series data: The files \emph{*timeseries.tsv} contain time courses showing how the network responds to a perturbation and how it relaxes upon removal of the perturbation. For networks of size 10 we provide 5 different time series, for networks of size 100 we provide 10 time series. Each time series has 21 time points. The initial condition always corresponds to a steady-state measurement of the wild-type. At t=0, a perturbation is applied to the network as described below. The first half of the time series (until t=500) shows the response of the network to the perturbation. At t=500, the perturbation is removed (the wild-type network is restored). The second half of the time series (until t=1000) shows how the gene expression levels go back from the perturbed to the wild-type state. In contrast to the multifactorial perturbations described in the previous section, which affect all the genes simultaneously, the perturbations applied here only affect about a third of all genes, but basal activation of these genes can be strongly increased or decreased. For example, these experiments could correspond to physical or chemical perturbations applied to the cells, which would cause (via regulatory mechanisms not explicitly modeled here) some genes to have an increased or decreased basal activation. The genes that are directly targeted by the perturbation may then cause a change in the expression level of their downstream target genes. \item Multifactorial data: The files \emph{*multifactorial.tsv} contain steady-state levels of variations of the network, which are obtained by applying multifactorial perturbations to the original network. Each line gives the steady state of a different perturbation experiment, i.e., of a different variation of the network. One may think of each experiment as a gene expression profile from a different patient, for example. We simulate multifactorial perturbations by slightly increasing or decreasing the basal activation of all genes of the network simultaneously by different random amounts. \item Knockout and knockdown data are NxN matrices in which the expression of all genes is measured, as the expression of each gene in turn is eliminated (for knockouts) or substantially reduced (for knockdowns). \item Dual knockouts consist of simulating each of the five networks in which two gene are knocked-out simultaneously. Gene expression data for dual knockouts is not provided to the participants. Instead, participants may predict steady-state levels for dual knockouts in the bonus round described in the previous section. The files *dualknockouts\_indexes.tsv indicate the pairs of genes for which a dual knockout should be predicted. For example, the line labeled "6 8" means that participants should predict the steady-state of the network after knocking out genes 6 and 8. For networks of size 10 we ask for predictions for 5 dual knockout experiments, for networks of size 100 we ask for 20 predictions. \end{itemize} The data sets are complete, in the sense that the expression levels of all regulators and all of their targets are known. This advantage is offset by a corresponding disadvantage: inasmuch as the synthetic expression data is not from an actual biological system, various sorts of prior and related information about gene regulation is not avaialable. Transcription factor binding motifs, ChIP-seq, DNAseI footprinting, and methylation data simply do not exist. \subsection{As repackaged into Bioc \Rclass{RangedSummarizedExperiment}s} There are ten artifical gene regulatory datasets in DREAM4, five describing networks of 10 nodes, five describing networks of 100. These networks are all defined by a gold standard adjacency matrix. All networks are accompanied by wild-type, timeseries, knockout, dual-knockout, and knockdown simulated expression data. The hundred-node networks also have multifactorial data. (In the DREAM4 distribution, the multifactorial data is separated out into five additional directories; we have added them to the data object which has the corresponding gold standard matrix and expression data.) We use the Bioconductor \Rclass{RangedSummarizedExperiment} class from the \Rpackage{SummarizedExperiment} package to store the DREAM4 data. We provide ten instances, one for each simulated data set. Each can be loaded via a \Rfunction{data} statement, for instance: \begin{verbatim} > data(dream4_010_01) \end{verbatim} We examine this dataset, loading its \Rclass{RangedSummarizedExperiment} object into an R session, then getting a brief summary via the \Rfunction{show} command. But first, some initializations: a convenience function for printing, and a command to set the display width for subsequent R output, so that it is visible within the printed version of this document. <>= printf <- function(...)print(noquote(sprintf(...))) options(width=60) @ <>= library(DREAM4) data(dream4_010_01) show(dream4_010_01) @ This tells us that \Robject{dream4\_010\_01} has \begin{packed_list} \item One assay matrix called \emph{simulated} with 10 rows and 136 columns \item 10 row names: \emph{G1} - \emph{G10} \item 136 column names: \emph{wt}, \emph{perturbation.t0}, \emph{...} \emph{MF.9}, \emph{MF.10} \item No metadata for rows or columns (\emph{rowRanges} and \emph{colData} are both empty) \item Two objects in \emph{metadata}: the gold standard matrix, and double knockout data \end{packed_list} The standard Bioconductor \Rclass{RangedSummarizedExperiment} class, when used to hold real experimental data, encourages the storage of experiment metadata with the experimental measurements. This facilitates reproducibility, reanalysis and simplified data storage. However this class is, in one regard, a slightly awkward fit to the DREAM4 data: the timeseries, knockout, knockdown, double knockout and multiparameter measurements all have different dimensions. \Rclass{RangedSummarizedExperiments} can only ``stack up'' multiple measurement arrays if their dimensions are the same. We have worked around this restriction by combining all of the matrices in each simulation into a single \Robject{assay} matrix. We use column names which allow for the easy extraction of the original matrices, as demonstrated below. We now extract the full 10 x 136 combined expression matrix from the first (and only) element of the list in the \Robject{assays} slot, displaying the first 10 rows and 3 columns to give a sense of the data, along with a sampling of the column names: <>= mtx.all <- assays (dream4_010_01)[[1]] dim(mtx.all) mtx.all[1:10, c(1,2,126)] set.seed(37) print(colnames(mtx.all)[sort(sample(1:ncol(mtx.all), 16))]) @ \section{The networkBMA Package} \subsection{Overview} networkBMA (Regression-based network inference using Bayesian Model Averaging) is a Bioconductor package \url{http://www.bioconductor.org/packages/release/bioc/html/networkBMA.html} designed to work with time-series expression data with the capability to integrate prior knowledge. We will demonstrate networkBMA using the first the five DREAM4 10-node time-series simulated expression sets. One of the strengths of networkBMA is its ability to incorporate prior knowledge in the inference process. We will illustrate the gains such knowledge confers: the first inference run uses no priors, while the second run does. The gold standard adjacency matrix is stored as the first element in the metadata (or ``metadata'' slot) of the RangedSummarizedExperiment: <>= mtx.goldStandard <- metadata (dream4_010_01)[[1]] @ We want to extract only one of the ten perturbation timeseries included in \Robject{dream4\_010\_01}. Examine these self-describing column names: <>= grep("perturbation.1.", colnames(mtx.all), v=T, fixed=TRUE) @ The man page for \Rpackage{networkBMA} \Rfunction{?networkBMA} explains that there are two required and eight optional parameters. We will start with the required two, \Robject{data}, and \Robject{nTimePoints}, to which we add the optional \Robject{self}, by which we specify that there are no self-edges (no self-regulating genes) in this data set. \begin{verbatim} data: A matrix whose columns correspond to variables or genes and whose rows correspond to the observations at different time points. nTimePoints: The number of time points at which expression measurements are available. The number of columns in 'data' should be a multiple of 'nTimePoints', which could be greater than 1 if there are replicates. \end{verbatim} To prepare the data for \Rfunction{networkBMA} we need to extract one timeseries of the ten offered, and then transpose the result into a matrix in the form expected by \Rfunction{networkBMA}. <>= ts1.columns <- grep("perturbation.1.", colnames(mtx.all), fixed=TRUE) mtx.ts1 <- t(mtx.all[, ts1.columns]) print(mtx.ts1[1:3, 1:3]) @ Extract the gold standard matrix, convert it successively to a graphAM, then a graphNEL, and display it with RCytoscape. <>= print(names(metadata(dream4_010_01))) mtx.goldStandard <- metadata(dream4_010_01)[[1]] @ Transform the gold standard matrix into an explict table of regulators and targets. <>= idx <- which(mtx.goldStandard == 1) idx.m1 <- idx -1 rows <- idx.m1 %% nrow (mtx.goldStandard) + 1 cols <- idx.m1 %/% nrow (mtx.goldStandard) + 1 tbl.goldStandard <- data.frame(Regulator=rownames(mtx.goldStandard)[rows], Target=colnames(mtx.goldStandard)[cols], Source=rep('goldStandard', length(rows)), stringsAsFactors=FALSE) @ Here is an RCytosxcape rendering of this network, followed by the code which produced it. \includegraphics{d4-10-01.png} %% disable the RCy code because Cytoscape may not be running. %% display the code as if it were a live code chunk %% periodically test this code, manually, to be sure it %% still works %% <>= \begin{verbatim} > library(RCytoscape) > print(dim(mtx.goldStandard)) > print(mtx.goldStandard[1:5, 1:5]) > g.gsam <- graphAM(mtx.goldStandard, edgemode="directed") > g.gs <- as(g.gsam, "graphNEL") > g.gs <- initEdgeAttribute(g.gs, "edgeType", "char", "not yet assigned") > source.nodes <- tbl.goldStandard$Regulator > target.nodes <- tbl.goldStandard$Target > edgeData(g.gs, source.nodes, target.nodes, attr="edgeType") <- "regulates" > cw <- new.CytoscapeWindow("goldStandard 010-01", g.gs, overwriteWindow=TRUE) > hideAllPanels(cw) > setWindowSize(cw,800,600) > showGraphicsDetails(cw,TRUE) > displayGraph(cw) > layoutFile <- system.file("extdata", "gs010-01-layout.RData", + package="DREAM4") > setEdgeTargetArrowRule(cw, "edgeType", + c("not yet assigned", "regulates"), + c("No Arrow", "Arrow")) > restoreLayout(cw, layoutFile) > fitContent(cw) > setZoom(cw, 0.8*getZoom(cw)) \end{verbatim} \subsection{Inference Using Expression Data Only} Now we load \Rpackage{networkBMA}, infer the network from twenty-one timepoints of expression over ten genes, assuming no prior knowledge, and sort the predictions. <>= library(networkBMA) tbl.inferred <- networkBMA(mtx.ts1, nTimePoints=nrow(mtx.ts1), self=FALSE) tbl.inferred <- tbl.inferred[order(tbl.inferred$PostProb, decreasing=TRUE),] @ The networkBMA package provides a function with which to assess the inferred network with respect to a reference network. The standard measure of inferential success is \emph{AUPR} -- or ``area under the precision recall curve". After introducing those quantities and their calculation, we plot the AUPR for this just-completed run. ``Precision'' and ``Recall'' are standard measures of success in the fields of pattern recognition and information retrieval; see \href{http://en.wikipedia.org/wiki/Precision_and_recall}{Wikipedia, ``Precision and recall''}. In the following, true positives (TP), false positives (FP) and false negative (FN) counts are used. \begin{packed_list} \item \emph{precision}: the fraction of all retrieved instances which are relevant. Formally: TP/(TP+FP) \item \emph{recall}: the fraction of all relevant instances which have been retrieved. Formally: TP/(TP+FN) \end{packed_list} Both quantities range from 0.0 to 1.0. In the \emph{AUPR} plot, precision is typically plotted on the Y axis, and recall on the X axis. A perfect retrieval or inference method would always retrieve 100 per cent of all relevant instances, and all retrieved instances would be relevant. In this idealized case, the AUPR would be infinite. Even the best network inference algorithms fare considerably worse than this ideal case. Let us examine the networkBMA inference we just ran, using the \Rfunction{contabs.netwBMA} function provided by the networkBMA package, and the \Rfunction(prc) function. (prc stands for ``Precision-Recall Curve''.) <>= tbl.contingency <- contabs.netwBMA(tbl.inferred, tbl.goldStandard[,-3]) pr <- scores(tbl.contingency, what = c("precision","recall")) colors <- c ("blue", "darkred") plot(pr$recall, pr$precision, type='b', xlab='RECALL', ylab='PRECISION', col=colors[1], xlim=c(0,1), ylim=c(0,1)) @ The AUPR information: <>= print(prc(tbl.contingency, plotit=FALSE)) @ To make sense of these results, let us look at a few rows selected from the the contingency table. <>= last.row <- nrow(tbl.contingency) mid.row <- as.integer(round(last.row)/2) print(tbl.contingency[c(1,mid.row,last.row),]) @ The first row has perfect recall: all regulatory edges from the gold standard have been inferred, but a very large number of non-existent edges have been inferred as well. The last row has perfect precision: three edges were accurately inferred, no non-existent edges were proposed, eleven edges were missed. Further insight into the inferred network is provided by examining the first few entries in the sorted output of \Rfunction{networkBMA}. <>= print(head(tbl.inferred, n=10)) @ The three regulatory relationships with a posterior probability score of 1 are accurately inferred. The remaining edges are either reversals of edges in the gold standard network, or incorrect. \subsection{Add Prior Probabilities} It can be seen from these results that the inference of gene regulatory relationships from mRNA expression alone, even for an idealized network with complete information, is imperfect. These strategies are among those used to improve inference: \begin{packed_list} \item Incorporate ancillary data (derived, for instance, from ChiP-seq experiments, or interaction databases) \item Create novel algorithms (for instance, \citet{castelo2009reverse}, and \citet{danaher2011joint}) \item Perform meta-inference, in which several inference tools are used together [\citet{greenfield2010dream4}] \end{packed_list} Bayesian methods, of which the \Rpackage{networkBMA} package is one (see the package vignette for more information and references), exploit prior information (prior probabilities) to improve inference. Recent trends in molecular biology, and the ENCODE project in particular, and public interaction databases, provide a large amount of useful prior regulatory information. DREAM4 is synthetic data, so neither of these broad classes of prior information is available. We can, however, simulate priors by extracting a few regulatory relationships from the gold standard network, adjusting them with randomly generated probabilities to better approximate real data, and thereby demonstrate how to use networkBMA with priors. There are two arguments to the \Rfunction{networkBMA} function whe used to with prior information: \begin{packed_list} \item \emph{prior.prob}: an adjacency matrix, where each value is a probability between 0 and 1. This is well-suited for, e.g., experimental data, such as ChIP-seq. \item \emph{known}: a two-column matrix with known (unqualified) regulatory matrices. \end{packed_list} We shall use the first of these two parameters, leaving the second to default \Robject{NULL}. Note that in the inference run above, none of the actual targets of regulator \emph{G10} were reported with a significant score: <>= print(subset(tbl.inferred, Regulator=="G10")) @ But the gold standard shows two targets: <>= mtx.goldStandard["G10",] @ Since this is synthetic network, we have no experimental data with which to improve the inference. However, we can simulate ChIP-seq data for G10, first by setting low probabilities for all interactions, then adding two probabilities for the G10->G3 and G10->G4 edges: <>= set.seed(37) tbl.priors <- matrix(data=runif(100, 0, 0.4), nrow=10, ncol=10, dimnames=list(rownames(mtx.goldStandard), colnames(mtx.goldStandard))) tbl.priors["G10", c("G3", "G4")] <- runif(2, 0.8, 1.0) @ Run the inference again: <>= tbl.inferredWithPriors <- networkBMA(mtx.ts1, nTimePoints=nrow(mtx.ts1), prior.prob=tbl.priors, self=FALSE) tbl.inferredWithPriors <- tbl.inferredWithPriors[order(tbl.inferredWithPriors$PostProb, decreasing=TRUE),] print(tbl.inferredWithPriors) @ G10->G3 and G10->G4 now show non-negligible posterior probabilities Here we plot the precision-recall curves for both runs together, for easy comparison: <>= tbl.contingencyWithPriors <- contabs.netwBMA(tbl.inferredWithPriors, tbl.goldStandard[,-3]) plot(pr$recall, pr$precision, type='b', xlab='RECALL', ylab='PRECISION', col=colors[1], xlim=c(0,1), ylim=c(0,1)) prWithPriors <- scores( tbl.contingencyWithPriors, what = c("precision","recall")) lines(prWithPriors$recall, prWithPriors$precision, type='b', xlab='RECALL', ylab='PRECISION', col=colors[2], xlim=c(0,1), ylim=c(0,1)) legend.titles = c("expression only", "with priors") legend (0.6, 0.9, legend.titles, colors) print(prc(tbl.contingencyWithPriors, plotit=FALSE)) @ \includegraphics{auprs.png} And examine the contingency table of the second run: <>= print(tbl.contingencyWithPriors) @ \section{Conclusion} The DREAM4 synthetic mRNA expression data, and the gold standard networks from which they are generated, offer a useful test for gene regulatory network inference. One result of such tests is the realization of how difficult such inference is. Incorporating prior information into inference is beneficial. We anticipate that in the near future many other kinds of regulatory data will begin to match mRNA expression for completeness, accuracy and cost, leading to increasingly constrained, and therefore increasingly accurate network inference. \nocite{*} \bibliography{DREAM4} \end{document}