%\VignetteIndexEntry{splineTimeR} %\VignetteDepends{} %\VignetteKeywords{spline regression, time-sourse differential expression, network reconstruction, network properties, pathway enrichment analysis} %\VignettePackage{splineTimeR} % -*- mode: noweb; noweb-default-code-mode: R-mode; -*- \documentclass[a4paper, 11pt]{article} \usepackage{amsmath} \usepackage[authoryear,round]{natbib} \usepackage{hyperref} \usepackage[utf8]{inputenc} \usepackage{graphicx} \usepackage{float} \begin{document} \SweaveOpts{concordance=TRUE} \title{\bf Using splineTimeR package} \author{Agata Michna and Herbert Braselmann} \maketitle \begin{center} Research Unit Radiation Cytogenetics, Group Integrative Biology\\ Helmholtz Zentrum M\"unchen \end{center} \begin{center} {\tt braselm@helmholtz-muenchen.de} \end{center} \vspace{5cm} \begin{frame} \centering \fbox{\begin{tabular}{@{}l@{}} This free open-source software implements academic research by the authors and \\ co-workers. If you use it, please support the project by citing the journal article \\ referred to in the overview section. \end{tabular}} \end{frame} \newpage \tableofcontents %============================================================ \section{Overview} This package provides functions for differential gene expression analysis of time-course gene expression data, pathway enrichment analysis and time dependent gene regulatory network reconstruction. The functions in this package gather and simplify the usage of other Bioconductor packages and R-functions. A natural cubic spline regression model for an experimental two-way designs with one treatment factor and time as continuous variable is fitted to the experimental time-course data. Two treatment groups (e.g. control and treated) have to be provided. Time dependent expression data must be available for each group. Parameters of fitted spline regression models are used to define differentially expressed over time genes. Those genes may further be used for pathway enrichment analysis and/or the reconstruction of time dependent gene regulatory association networks. The workflow implemented in this package is explained and discussed in detail in \citet{Michna2016}. Expression microarray data generated in our laboratory are are available from the ArrayExpress database (accession number E-MTAB-4829). %============================================================ \section{Data preparation} \subsection{Simulation of data} Two simulated gene expression data sets, generated with the open-source simulator GeneNetWeaver (GNW), are included in this package and serve as test data to demonstrate the usage of the functions \citep{Schaffter2011, Marbach2009}. The \textit{in silico} expression data were simulated based on the network structure of a 2000-gene sub-network from the Reactome functional interaction network (\citeauthor{Reactome_data}, \url{http://www.reactome.org/}). The sub-network was converted to a dynamical network model without autoregulatory interactions (self-loops). The data were simulated based on the {\tt "ODEs"} model. Two types of time-series experiments were chosen: {\tt "Time Series as in DREAM4"} and {\tt "Multifactorial"}. Gene expression data were simulated for 48 time points after perturbations. For more details see GNW User Manual (\citealp{GNW}, \url{http://gnw.sourceforge.net}). The names of {\tt "Time Series as in DREAM4"} and {\tt "Multifactorial"} simulation experiments were changed to {\tt "T1"} and {\tt "T2"}, respectively. From the generated data sets, eight time points are provided (1, 4, 8, 16, 24, 32, 40 and 48). The numbers correspond to the same time units after perturbation (e.g. minutes, hours, days, ect.). Replicates for both time-course experiments were generated by the addition of the normally distributed random errors with a standard deviation of 0.05 to the expression values for each time point. Subsequently, the entire dataset was normalized between 0 and 1. \subsection{Required data design} Gene expression data must be provided as a Biobase object of class {\tt ExpressionSet}. {\tt ExpressionSet} objects are capable of storing log-ratios or log-values of expressions from multiple microarrays as well as feature data and phenotypic data of the samples. Functions included in this R-package require phenotypic data in the {\tt ExpressionSet} with three columns named {\tt SampleName}, {\tt Time} and {\tt Treatment}. Two types of treatments (e.g. control and treated) have to be specified. This defines groups to compare. Optionally {\tt Replicate} or any other column can be added. If the {\tt Replicate} column is not present, all samples are considered as single replicates. Replicates are not required. The time points for compared treatment groups should be identical. Visualisation of results is demonstrated using simulated data ({\tt TCsimData}). <>= #path1 <- "/Users/hbraselmann/R/Bioc3.4dev/library" #path2 <- "/Library/Frameworks/R.framework/Versions/3.3/Resources/library" #.libPaths(c(path1, path2)) #.libPaths() @ <>= library(splineTimeR) data(TCsimData) head(pData(TCsimData),8) @ %============================================================ \section{Time-course data differential expression analysis} \subsection{Model and model parameters} To simulate a non-linear behaviour of gene expression over time a temporal trend using a natural cubic spline regression is fitted to the measured data points. The function {\tt splineDiffExprs} compares the time dependent behaviour of genes in two different groups. \\\\ The mathematical model of natural cubic spline regression is defined by: \begin{equation} \begin{split} y = y(t,x) = b_0 + b_1B_1(t-t_0) + b_2B_2(t-t_0) + ... + b_mB_m(t-t_0) + \\ + x(d_0 + d_1B_1(t-t_0) + d_2B_2(t-t_0) + ... + d_mB_m(t-t_0)) \end{split} \end{equation} where $b_1$, $b_2$, ..., $b_m$ are the spline coefficients in the reference group (e.g. control group) and $d_1$, $d_2$, ..., $d_m$ are differential spline coefficients for the compared group (e.g. treated group). $B_1(t-t_0)$, $B_2(t-t_0)$, ..., $B_m(t-t_0)$ are spline base functions and $t_0$ is the time of the first measurement. For $x=0$, $y=y_{reference}$ and for $x=1$, $y=y_{compared}$. \\\\ Dependent on the data type and the number of measured time points, user has to define the number of degrees of freedom {\tt df}. Choosing effective degrees of freedom in range 3-5 is reasonable. \subsection{Statistical methods and description of output table} Time dependent differential expression of a gene is determined by the application of empirical Bayes moderate F-statistics on the differences of coefficient values of the fitted natural cubic spline regression models for the same gene in the two compared treatment groups \citep{Smyth2004}. In other words, comparing the coefficient values of the fitted splines in both groups allows the detection of differences in the shape of the curves, which represent the gene expressions changes over time. The {\tt reference} parameter specify which treatent group should be considered as reference. The {\tt intercept} defines, which coefficients are considered for the statistical analysis. Two options are possible. If {\tt intercept=TRUE}, the F-test includes all parameters, i.e shape parameters and intercept at the first given time point. If {\tt intercept=FALSE}, only shape parameters are considered. The {\tt cutoff.adj.pVal} input parameter defines the Benjamini-Hochberg adjusted p-value cut-off for genes to be considered as differentially expressed. The {\tt splineDiffExprs} function generates a output table, where the first columns contain all feature data of the {\tt ExpressionSet} object ({\tt fData(eSetObject)}), if any feature data were defined. Otherwise, only one column {\tt row\underline{ }IDs}, containing the row names is created. The {\tt $b_0$}, {\tt $b_1$},..., {\tt $b_m$} coefficients correspond to the reference model parameters. The {\tt $d_0$}, {\tt $d_1$},..., {\tt $d_m$} coefficients represent the differences between the reference model parameters and the model parameters in the compared group. {\tt AveExprs} refers to the average log2-expression for a probe (representing a gene) over all arrays. The {\tt F} column contains moderate F-statistics, {\tt P.Value} raw p-value and {\tt adj.P.Value} Benjamini-Hochberg adjusted p-value. <>= diffExprs <- splineDiffExprs(eSetObject = TCsimData, df = 3, cutoff.adj.pVal = 0.01, reference = "T1", intercept = TRUE) head(diffExprs, 3) @ %============================================================ \section{Visualization of time-course data with natural cubic spline regression model} The {\tt splinePlot} function visualises the time dependent behaviour of genes in two treatment groups. The natural cubic spline regression curves fitted to discrete, time dependent expression data are plotted. One plot shows two curves - representing the reference group and the compared group, respectively. Reference group is specified by {\tt reference} parameter. User can decide which genes to plot by setting {\tt toPlot} input parameter, otherwise all genes included in {\tt ExpressionSet} object are plotted. Defined genes to plot have to be included in {\tt ExpressionSet} object. The plots are saved in a .pdf file. <>= splinePlot(eSetObject = TCsimData, df = 3, reference = "T1", toPlot = c("EEF2","OR5W2")) @ \begin{figure}[H] \centering \begin{tabular}{@{}c@{\hspace{.5cm}}c@{}} \includegraphics[page=1,width=.5\textwidth]{plots_df3_spline} & \includegraphics[page=2,width=.5\textwidth]{plots_df3_spline} \\[.5cm] \end{tabular} \end{figure} %============================================================ \section{Pathway enrichment analysis} The {\tt pathEnrich} function performs a pathway enrichment analysis of defined genes ({\tt geneList}). Collection of curated gene sets ({\tt geneSets}) representing metabolic and signaling pathways has to be downloaded from a database of interest in Gene Matrix Transposed file format (*.gmt), where each gene set is described by a pathway name, a description, and the genes in the gene set. Two examples are shown bellow to demonstrate how to define {\tt geneSets} object. The variable {\tt universe} represents a total number of genes that were probed in the initial experiment, e.g. the number of all genes on a microarray. If {\tt universe} is not defined, {\tt universe} is the number of all genes that can be mapped to any pathways in the chosen database. The result table contains the name of the enriched pathway ({\tt pathway}), the gene set description ({\tt description}), the total number of genes in the pathway ({\tt genes\underline{ }in\underline{ }pathway}), the number of matched genes from the given gene set ({\tt matches}), the percentage of matched genes referred to the total number of known genes in the pathway ({\tt \%\underline{ }match}), the p-value ({\tt pValue}), Benjamini-Hochberg adjucted p-value ({\tt adj.pValue}), and a list of genes that are part of the pathway ({\tt overlap}). Statistical significance was determined by hypergeometric test (Fisher's exact test). Gene names for pathway enrichment must be provided in HGNC-approved gene nomenclature, otherwise the genes are not included in the analysis. Simulated time-course expression data were derived from Reactome functional interaction network containing 6536 unique identifiers (nodes), meaning that in our experiment universe is equal to 6536.\\ \noindent Example 1 - gene sets from the Molecular Signatures Database (MSigDB collections), \url{http://www.broadinstitute.org/gsea/msigdb/collections.jsp} \citep{Subramanian2005} <>= ## Not run ## Download .gmt file 'c2.all.v5.0.symbols.gmt' (all curated gene sets, ## gene symbols) from the Broad, ## http://www.broad.mit.edu/gsea/downloads.jsp#msigdb, then geneSets <- getGmt("/path/to/c2.all.v5.0.symbols.gmt") ## load ExpressionSet object containing simulated time-course data data(TCsimData) ## check for differentially expressed genes diffExprs <- splineDiffExprs(eSetObject = TCsimData, df = 3, cutoff.adj.pVal = 0.01, reference = "T1") ## use differentially expressed genes for pathway enrichment analysis enrichPath <- pathEnrich(geneList = rownames(diffExprs), geneSets = geneSets, universe = 6536) ## End(Not run) @ \noindent \\Example 2 - gene sets from the Reactome Pathway Database, \url{http://www.reactome.org/pages/download-data/} \citep{Reactome2014} <>= ## Not run ## Download and unzip .gmt.zip file 'ReactomePathways.gmt.zip' ## ("Reactome Pathways Gene Set" under "Specialized data formats") from ## the Reactome website http://www.reactome.org/pages/download-data/, then geneSets <- getGmt("/path/to/ReactomePathways.gmt") data(TCsimData) diffExprs <- splineDiffExprs(eSetObject = TCsimData, df = 3, cutoff.adj.pVal = 0.01, reference = "T1") enrichPath <- pathEnrich(geneList = rownames(diffExprs), geneSets = geneSets, universe = 6536) ## End(Not run) @ %============================================================ \section{Gene association network reconstruction of time-course data} The {\tt splineNetRecon} function reconstructs gene association networks from time-course data. Based on given {\tt ExpressionSet} object, longitudinal data object is created. Subsequantly the function estimates edges using partial correlation method with shrinkage approach applying {\tt ggm.estimate.pcor} and {\tt network.test.edges} functions. As a result an object or list of object of class {\tt igraph} is created. Gene association network reconstruction is done for a selected type of {\tt Treatment}. This allows to identify regulatory associations between genes under a certain condition (treatment). First, a {\tt longitudinal} data object of the gene expression data with possible replicates is created. This object is used to estimate the partial correlation with the selected shrinkage method ({\tt dynamic} or {\tt static}) with the {\tt ggm.estimate.pcor} function (for details see {\tt ggm.estimate.pcor} function help). Finally, the the {\tt network.test.edges} function estimates the probabilities for all possible edges and lists them in descending order (for details see {\tt network.test.edges} help). An object or list of objects of class {\tt igraph} with all edges that exceeded the probability-cutoff ({\tt cutoff.ggm}) is returned. If more than one value for {\tt cutoff.ggm} is defined than function returns a {\tt list} of objects of class {\tt igraph} for each defined {\tt cutoff.ggm} value. Otherwise a single object of class {\tt igraph} with one selected probability is returned. \\ For one defined cut-off: <>= igr <- splineNetRecon(eSetObject = TCsimData, treatmentType = "T2", probesForNR = rownames(diffExprs), cutoff.ggm = 0.7, method = "dynamic") @ <>= plot(igr, vertex.label = NA, vertex.size = 3, main = "igraph_0.7") @ \begin{figure}[H] \centering \begin{tabular}{@{}c@{\hspace{.5cm}}c@{}} \includegraphics[page=1,width=.5\textwidth]{plot_igr1} \end{tabular} \end{figure} For more defined cut-off values: <>= igr <- splineNetRecon(eSetObject = TCsimData, treatmentType = "T2", probesForNR = rownames(diffExprs), cutoff.ggm = c(0.8,0.9), method = "dynamic") @ <>= plot(igr[[1]], vertex.label = NA, vertex.size = 3, main = "igraph_0.8") plot(igr[[2]], vertex.label = NA, vertex.size = 3, main = "igraph_0.9") @ \begin{figure}[H] \centering \begin{tabular}{@{}c@{\hspace{.5cm}}c@{}} \includegraphics[page=1,width=.5\textwidth]{plot_igrlist} & \includegraphics[page=2,width=.5\textwidth]{plot_igrlist} \\[.5cm] \end{tabular} \end{figure} %============================================================ \section{Scale-free properties of a network} Biological networks are thought to be scale-free. Scale-free networks follow a power-law distribution of the degrees of nodes in the network \citep{Barabasi1999, Albert2005}. Power-law distribution is characterised by the exponent $\gamma$. The probability that a node has the degree $k$ ($k$ connections) is given by: \begin{equation} P(k) \sim k^{-\gamma} \end{equation} or after logarithmic transformation: \begin{equation} log P(k) \sim -\gamma log k \end{equation} For most biological networks the exponent $\gamma$ ranges between 2 and 3 \citep{Barabasi2004}. Nevertheless, subnetworks of a scale free-network are not necessarily scale-free \citep{Stumpf2005}. The {\tt networkProperties} function plots the degree distribution of nodes in a given network ({\tt igraph}). For comparison, the degree exponents of the Reactome and BioGRID networks are shown \citep{Reactome2014, BioGRID2006}. Both of these networks are derived from functional interaction pairs obtained from Reactome and BioGRID repositories. The functional interaction pairs were extracted from mentioned databases and are provided in FIs data package. <>= library(FIs) data(FIs) names(FIs) head(FIs$FIs_Reactome) head(FIs$FIs_BioGRID) @ For each given {\tt igraph} three types of plots are created: empirical cumulative distribution, degree distribution and power-law degree distribution on log-log scale with fitted trend line. Additionally, a summary table containing the number of nodes, the number of edges and the degree exponents for each given network is returned. <>= igr <- splineNetRecon(eSetObject = TCsimData, treatmentType = "T2", probesForNR = rownames(diffExprs), cutoff.ggm = c(0.7,0.8,0.9), method = "dynamic") scaleFreeProp <- networkProperties(igr) head(scaleFreeProp) @ Examplary figures ploted in .pdf files are shown below. \begin{figure}[H] \centering \begin{tabular}{@{}c@{\hspace{.5cm}}c@{}} \includegraphics[page=1,width=1\textwidth]{degree_distribution} \\[.5cm] \end{tabular} \end{figure} \begin{figure}[H] \centering \begin{tabular}{@{}c@{\hspace{.5cm}}c@{}} \includegraphics[page=1,width=.5\textwidth]{scale_free_properties} & \includegraphics[page=4,width=.5\textwidth]{scale_free_properties} \\[.5cm] \end{tabular} \end{figure} \bibliographystyle{apalike} \bibliography{splineTimeR} \end{document}