%\VignetteIndexEntry{Trendy Vignette} %\VignettePackage{Trendy} %\VignetteEngine{knitr::knitr} \documentclass{article} \usepackage{graphicx, graphics, epsfig,setspace,amsmath, amsthm} \usepackage{natbib} \usepackage{moreverb} \usepackage{float} <>= BiocStyle::latex() @ \begin{document} \title{Trendy: segmented regression analysis of expression dynamics in high-throughput ordered profiling experiments} \author{Rhonda Bacher, Ning Leng, Ron Stewart} \maketitle \tableofcontents \setcounter{tocdepth}{2} \section{Overview} \label{sec:intro} Trendy is an R package for analyzing high-throughput expression data (e.g RNA-seq or microarray) with ordered conditions (e.g. time-course, spatial-course). For each gene (or other features), Trendy fits a set of segmented (or breakpoint) regression models. Each breakpoint represents a significant change in the gene's expression across the time-course. The optimal model is chosen as the one with the lowest BIC. The top dynamic genes are identified as those that are well profiled by their optimal gene-specific segmented regression model. Trendy also implements functions to: visualize dynamic genes and their trends, to order dynamic genes by their trends, and to compute the distribution of breakpoints across all genes and time-points. To illustrate Trendy here we refer specifically to time-course gene expression data, however Trendy may also be applied to other types of features (e.g. isoform or exon expression) and/or other types of experiments with ordered conditions (e.g. spatial-course). If you use Trendy in published research, please cite: \href{https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2405-x}{Bacher R, Leng N, Chu LF, Ni Z, Thomson JA, Kendziorski C, Stewart R. Trendy: segmented regression analysis of expression dynamics in high-throughput ordered profiling experiments. BMC Bioinformatics. 2018 Dec;19(1):380.} \subsection{The model} Denote the normalized gene expression of gene $g$ and sample/time $t$ as $Y_{g,t}$ for a total of $G$ genes and a total of $N$ samples. For each gene, Trendy fits a set of segmented regression models having 0 to $K$ breakpoints. $K$ defaults to 3 but can also be specified by the user. The \CRANpkg{segmented} R package is used to fit the segmented regression models. For a given gene, among the models with varying number of breakpoints, Trendy selects the optimal model by comparing the BIC. To avoid overfitting, the optimal number of breakpoints will be set as $\tilde{k_g} = \tilde{k_g} -1$ if at least one segment has less than $c_{num}$ samples. The threshold $c_{num}$ can be specified by the user; the default is \Rcode{minNumInSeg} = 5. Trendy reports the following for the optimal model: \begin{itemize} \item Gene specific adjusted $R^2$ (penalized for the chosen value of $k$) \item Segment slopes \item Segment trends (and associated p-values) \item Breakpoint estimates \end{itemize} Among all genes, the top dynamic genes are defined as those whose optimal model has a high adjusted $R^2$. Trendy also summarizes the fitted trend or expression pattern of top genes. For samples between the $i^{th}$ and $i+1 ^{th}$ breakpoint for a given gene, if the t-statistic of the segment slope (slope and standard errors are estimated by the segmented package) has p-value greater than $c_{pval}$, the trend of this segment will be defined as no change. Otherwise the trend will be defined as up/down based on the slope coefficient. The default value of $c_{pval}$ is \Rcode{pvalCut} = 0.1, but may also be specified by the user. In the \Rfunction{trendy} function, the thresholds $c_{num}$and $c_{pval}$ can be specified via parameters \Rcode{minNumInSeg} and \Rcode{pvalCut}, respectively. Trendy also computes a breakpoint distribution of the number of breakpoints over all genes alonge the time-course. Time-points with a large number of breakpoints may represent global expression changes and be targetted for follow-up investigations. \section{Installation} \subsection{Install via Bioconductor} The \Rpackage{Trendy} package can be installed from Bioconductor if you have R version $\geq 3.5$: <>= if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager") BiocManager::install("Trendy") @ \subsection{Install via GitHub} The \Rpackage{Trendy} package can also be installed using functions in the \Rpackage{devtools} package. If you have R version $\geq 3.5$: <>= install.packages("devtools") library(devtools) install_github("rhondabacher/Trendy") @ For prior R versions you may use the following but note that this version is not being updated: <>= install.packages("devtools") library(devtools) install_github("rhondabacher/Trendy", ref="devel") @ \subsection{Install locally} Trendy may also be installed locally. Download the Trendy package from: \url{https://github.com/rhondabacher/Trendy} \subsection{Load the package} To load the Trendy package: <>= library(Trendy) @ \section{Analysis} \subsection{Input} \subsubsection{Normalized Data} The input data should be a $G-by-N$ matrix containing the expression values for each gene and each sample, where $G$ is the number of genes and $N$ is the number of samples. The samples should be sorted following the time course order. The object \Robject{trendyExampleData} is a simulated data matrix containing 50 rows of genes and 40 columns of samples. <>= data("trendyExampleData") str(trendyExampleData) @ These values should be expression data after normalization across samples. For example, for RNA-seq data, the raw counts may be normalized using Median Normalization (Anders and Huber, 2010) via the \Rfunction{MedianNorm} and \Rfunction{GetNormalizedMat} functions in the \Biocpkg{EBSeq} package: <>= library(EBSeq) Sizes <- MedianNorm(trendyExampleData) normalizedData <- GetNormalizedMat(trendyExampleData, Sizes) @ More details can be found in the \Biocpkg{EBSeq} vignette: \small{\url{http://www.bioconductor.org/packages/devel/bioc/vignettes/EBSeq/inst/doc/EBSeq_Vignette.pdf}} If you are working with microarray expression data, an extensive overview for normalization can be found in the vignette of the \Biocpkg{affy} package: \small{\url{https://www.bioconductor.org/packages/release/bioc/html/affy.html}} \subsubsection{Time Vector} The time vector is important to specify as it contains information regarding replicates and the relative timing or spacing of each sample. The order of the time vector should match the order of the columns in the expression data. Below are a few examples on how to specify the time vector for a variety of situations: Suppose all 40 samples are equally spaced time-points: <>= time.vector <- 1:40 time.vector @ Suppose there are 20 equally spaced time points, each with 2 replicates: <>= time.vector <- rep(1:20, each = 2) time.vector @ Suppose there are 18 unequally spaced time points, most times have 2 replicates but a few times have 3: <>= time.vector <- c(rep(1, 3), rep(2:9, each = 2), rep(10:11, 3), rep(12:17, each=2), rep(18, 3)) time.vector table(time.vector) @ Remember, it is critical that this specification corresponds exactly to the order of samples (columns) in the normalized data matrix described in the previous section! \noindent \textbf{FAQ: Does it matter if I use the real time or equally spaced time?} Suppose you have an experiment that was sampled at minutes 1,2,10,20, and 60 with two replicates at each time. In the plot below we can clearly see that the interpretation would be quite different depending on the definition of the time vector. In the true time plot (right), the expression increases quickly initially then levels off, whereas in the equal spacing plot (left) the expression appears to increase at a constant rate. In the majority of cases, we recommend using the true time to define the time vector. An example is given in Section 4.1 and 4.2. <>= mygene <- trendyExampleData[2,1:10] equalSpacing <- rep(c(1:5), each=2) trueSpacing <- c(1,1,2,2,10,10,20,20,60,60) par(mfrow=c(1,2), mar=c(5,5,2,1)) plot(equalSpacing, mygene, ylab="Expression") plot(trueSpacing, mygene, ylab="Expression") @ \subsection{Run Trendy} The \Rfunction{trendy} function will fit multiple segmented regressions models for each gene (via the \CRANpkg{segmented} R package) and select the the optimal model. For this example, we will only consider a maximum of two breakpoints for each gene. <>= time.vector <- 1:40 res <- trendy(Data = trendyExampleData, tVectIn = time.vector, maxK = 2) res <- results(res) res.top <- topTrendy(res) # default adjusted R square cutoff is 0.5 res.top$AdjustedR2 @ The \Rfunction{topTrendy} function may be used to extract top dynamic genes. By default, \Rfunction{topTrendy} will extract genes whose adjusted $R^2$, $\bar{R}^{2}$, is greater or equal to 0.5. To change this threshold, a user may specify the \Rcode{adjR2Cut} parameter in the \Rfunction{topTrendy} function. The \Rfunction{topTrendy} function returns the Trendy output with genes sorted decreasingly by $\bar{R}^{2}$. By default the \Rfunction{trendy} function only considers genes whose mean expression is greater than 10. To use another threshold, the user may specify the desired value using the parameter \Robject{meanCut}. \subsection{Visualize trends of the top dynamic genes} The object \Robject{res.top\$Trend} contains the trend specification of the top genes. The function \Rfunction{trendHeatmap} can be used to display these trends. First, the \Rfunction{trendHeatmap} function classifies the top dynamic genes into three groups: those that start with 'up', start with 'down' and start with 'no change'. Within each group, genes are sorted by the position of the first breakpoint. <>= res.trend <- trendHeatmap(res.top) str(res.trend) @ To generate an expression heatmap of the first group of genes (first go 'up'): <>= library(gplots) heatmap.2(trendyExampleData[names(res.trend$firstup),], trace="none", Rowv=FALSE,Colv=FALSE,dendrogram='none', scale="row", main="top genes (first go up)") @ Similarly, to generate an expression heatmap of the second group of genes (first go down): <>= heatmap.2(trendyExampleData[names(res.trend$firstdown),], trace="none", Rowv=FALSE,Colv=FALSE,dendrogram='none', scale="row", main="top genes (first go down)") @ To generate an expression heatmap of the second group of genes (first no change): <>= heatmap.2(trendyExampleData[names(res.trend$firstnochange),], trace="none", Rowv=FALSE,Colv=FALSE,dendrogram='none', scale="row", main="top genes (first no change)", cexRow=.8) @ \subsection{Visualize individual genes} The \Rfunction{plotFeature} function may be used to plot expression of individual features/genes and the fitted lines. For example, to plot the top six genes in the first group of genes (first go up): <>= par(mfrow=c(3,2)) plotFeature(Data = trendyExampleData, tVectIn = time.vector, simple = TRUE, featureNames = names(res.trend$firstup)[1:6], trendyOutData = res) @ These can be plot together with segment trends colored and breakpoints highlighted by setting \Rcode{simple=FALSE}. A legend can be placed by specifying \Rcode{legendLocation = 'side'} or \Rcode{legendLocation = 'bottom'}. The user may supress the legend by setting \Rcode{showLegend = FALSE}. The size of the legend text can be adjusted using the parameter \Rcode{legendCex}. <>= par(mfrow=c(3,2)) #specify the layout of multiple plots in a single panel plotFeature(Data = trendyExampleData, tVectIn = time.vector, simple = FALSE, showLegend = TRUE, legendLocation='side',cexLegend=1, featureNames = names(res.trend$firstup)[1:6], trendyOutData = res) @ <>= par(mfrow=c(3,2)) #specify the layout of multiple plots in a single panel plotFeature(Data = trendyExampleData, tVectIn = time.vector, simple = FALSE, showLegend = TRUE, legendLocation='bottom',cexLegend=1, featureNames = names(res.trend$firstup)[1:6], trendyOutData = res) @ The input of function \Rfunction{plotFeature} requires the expression data and a list of genes of interest. The parameter \Robject{trendyOut} contains the results from the \Rfunction{trendy} function. If it is not specified, then \Rfunction{plotFeature} will run \Rfunction{trendy} on the genes of interest before plotting. Specifying the output obtained from previous steps will save time by avoiding fitting the models again. Similarly, to plot the top six genes in the second group of genes (first go down): <>= par(mfrow=c(3,2)) plotFeature(Data = trendyExampleData,tVectIn = time.vector, simple=TRUE, featureNames = names(res.trend$firstdown)[1:6], trendyOutData = res) @ To plot the two genes in the third group of genes (first no change): <>= par(mfrow=c(1,2)) plotFeature(trendyExampleData,tVectIn = time.vector, simple=TRUE, featureNames = names(res.trend$firstnochange)[1:2], trendyOutData = res) @ \subsection{Gene specific estimates} For a given gene of interest, its estimated parameters can be obtained individually: <>= par(mfrow=c(1,1)) plot2 <- plotFeature(trendyExampleData,tVectIn = time.vector, featureNames = "g2", trendyOutData = res) res.top$Breakpoints["g2",] # break points res.top$AdjustedR2["g2"] # adjusted r squared res.top$Segments["g2",] # fitted slopes of the segments res.top$Segment.Pvalues["g2",] # p value of each the segment @ The above printout shows that for gene g2 the optimal number of breakpoints is two estimated at time-points 12 and 30. The fitted slopes for the 3 adjoining segments are 3.31, 0.06 and -2.97, which indicates the trend is 'up'-'no change'-'down.' These estimates can be automatically formatted using the function \Rfunction{formatResults}, which can then be saved as a .txt. or .csv file. The output currently includes the estimated slope, p-value, and trend of each segment, the estimated breakpoints, the trend for each sample, and the adjusted $R^2$. <>= trendy.summary <- formatResults(res.top) trendy.summary[1:4,1:8] # To save: # write.table(trendy.summary, file="trendy_summary.txt") @ The NA indicates that g3 does not have a segment 3 slope since it only has one breakpoint (i.e two segments). \subsection{Breakpoint distribution over the time course} To calculate the number of breakpoints for all genes over the time course: <>= res.bp <- breakpointDist(res.top) barplot(res.bp, ylab="Number of breakpoints", col="lightblue") @ The bar plot indicates that a number of genes have breakpoints around times 11 - 13. \section{More advanced analysis} \subsection{Time course with non-uniform sampling} If the samples were collected at different time intervals then it is highly suggested to denote the time vector by this scale (instead of a vector of consecutive numbers). To do so, the user may specify the order/times via the tVectIn parameter in the \Rfunction{trendy} function. For example, suppose for the simulated data, the first 30 samples were collected every hour and the remaining 10 samples were collected every 5 hours. We may define the time vector as: <>= time.vector <- c(1:30, seq(31, 80, 5)) names(time.vector) <- colnames(trendyExampleData) time.vector @ To run Trendy using the empirical collecting time instead of sample ID (1-40): <>= res2 <- trendy(Data = trendyExampleData, tVectIn = time.vector, maxK=2) res2 <- results(res2) res.top2 <- topTrendy(res2) res.trend2 <- trendHeatmap(res.top2) str(res.trend2) @ To plot the first four genes that have up-regulated pattern at the beginning of the time course: <>= par(mfrow=c(2,2)) plotFeature(trendyExampleData, tVectIn=time.vector, simple = TRUE, featureNames = names(res.trend2$firstup)[1:4], trendyOutData = res2) @ \subsection{Time-course with replicates available} Trendy is able to make use of replicated time-points if available. To do so, the user can specify the replicates directly in the the \Rcode{tVectIn} parameter in the \Rfunction{trendy} function. For example, suppose for the simulated data, 10 time points were observed 4 times each. We may define the time vector as: <>= time.vector <- rep(1:10, each=4) names(time.vector) <- colnames(trendyExampleData) time.vector @ <>= res3 <- trendy(Data = trendyExampleData, tVectIn = time.vector, maxK=2) res3 <- results(res3) res.top3 <- topTrendy(res3) res.trend3 <- trendHeatmap(res.top3) @ <>= par(mfrow=c(2,2)) plotFeature(trendyExampleData, tVectIn=time.vector, simple = FALSE, legendLocation = 'bottom', featureNames = names(res.trend2$firstup)[1:4], trendyOutData = res3) @ \subsection{Extract genes with specific patterns} Users can search for genes with patterns of interest using the \Rfunction{extractPatterns} function in the Trendy package. For example, genes that have a peak along the time-course will have fitted trend somewhere as "up-down": <>= # Genes that peak pat1 <- extractPattern(res3, Pattern = c("up","down")) head(pat1) par(mfrow=c(1,2)) plotPat1 <- plotFeature(trendyExampleData, tVectIn=time.vector, featureNames = pat1$Gene[1:2], trendyOutData = res3) @ We may only want those where the peak has occured after some time-point. This can be specified using the \Rcode{Delay} parameter: <>= # Genes that peak after some time pat3 <- extractPattern(res3, Pattern = c("up","down"), Delay = 7) head(pat3) @ To search for genes that have a 'no change' segment, the \Rcode{extractPattern} function accepts both 'no change' and 'same'. For example, here we search for genes that are stable and then go up: <>= # Genes that are constant, none extractPattern(res2, Pattern = c("no change", "up")) extractPattern(res2, Pattern = c("same", "up")) @ \subsection{Determining threshold for adjusted $R^2$} Depending on the type of experiment (RNA-seq, microarray, scRNA-seq, etc.) and level of noise, different thresholds for the adjusted $R^2$ may be used. One way to decide an appropriate threshold is to perform a permutation procedure as follows: <>= library(Trendy) res.r2 <- c() for(i in 1:100) { # permute 100 times at least BiocParallel::register(BiocParallel::SerialParam()) seg.shuffle <- trendy(trendyExampleData[sample(1:nrow(data.norm.scale), 100),], #sample genes each time tVectIn = sample(time.vector), # shuffle the time vector saveObject=FALSE, numTry = 5) res <- results(seg.shuffle) res.r2 <- c(res.r2, sapply(res, function(x) x$AdjustedR2)) } # Histogram of all R^2 hist(res.r2, ylim=c(0,1000), xlim=c(0,1), xlab=expression(paste("Adjusted R"^"2"))) # Say you want to use the value such that less than 1% of permutations reach: sort(res.r2, decreasing=T)[round(.01 * length(res.r2))] # Say you want to use the value such that less than 5% of permutations reach: sort(res.r2, decreasing=T)[round(.05 * length(res.r2))] @ Note: For an experiment with replicates, you should shuffle the replicated timepoints together: <>= time.vector = c(1,1,2,2,10,10,20,20,60,60) # How to shuffle the replicates -together- set.seed(12) shuf.temp=sample(unique(time.vector)) print(shuf.temp) setshuff=do.call(c,lapply(shuf.temp, function(x) which(!is.na(match(time.vector, x))))) use.shuff <- time.vector[setshuff] print(use.shuff) @ <>= ## Then in the permutation code you'll do: for(i in 1:100) { # permute 100 times at least BiocParallel::register(BiocParallel::SerialParam()) shuf.temp=sample(unique(time.vector)) setshuff=do.call(c,lapply(shuf.temp, function(x) which(!is.na(match(time.vector, x))))) use.shuff <- time.vector[setshuff] seg.shuffle <- trendy(trendyExampleData[sample(1:nrow(data.norm.scale), 100),], #sample genes each time tVectIn = use.shuff, # shuffle the time vector saveObject=FALSE, numTry = 5) res <- results(seg.shuffle) res.r2 <- c(res.r2, sapply(res, function(x) x$AdjustedR2)) } @ \subsection{Further analysis of Trendy expression trends} For each gene, the Trendy segments are assigned a trend as: "up", "down", or "same". These trends can be used to cluster genes having similar dynamics along the time-course. Here I will use a simple hierarchical clustering to demonstrate the clustering but other clustering methods may be used instead. <>= # Get trend matrix: trendMat <- res.top$Trends # Cluster genes using hierarchical clustering: hc.results <- hclust(dist(trendMat)) plot(hc.results) #Decide how many clusters to choose #Let's say there are 4 main clusters hc.groups <- cutree(hc.results, k = 4) @ Here are heatmaps of genes in Clusters 1 and 4. <>= cluster1.genes <- names(which(hc.groups == 1)) res.trend2 <- trendHeatmap(res.top, featureNames = cluster1.genes) cluster4.genes <- names(which(hc.groups == 4)) res.trend2 <- trendHeatmap(res.top, featureNames = cluster4.genes) @ The genes in each cluster can then be used as input for gene enrichment analysis. Two popular gene set enrichment tools include: enrichr (web-based, http://amp.pharm.mssm.edu/Enrichr/) or GSEA (via MSigDB: http://software.broadinstitute.org/gsea/msigdb/index.jsp). \section{Trendy shiny app} The Trendy shiny app requires the .RData object output from the \Rfunction{trendy} function, which can be obtained by setting \Rcode{saveObject=TRUE} and specifying a name via the \Rfunction{fileName} parameter. <>= res <- trendy(trendyExampleData, tVectIn = 1:40, maxK=2, saveObject = TRUE, fileName="exampleObject") res <- results(res) @ Then in R run: <>= trendyShiny() @ Below are screenshots of the Shiny application: \begin{figure}[H] \centering \includegraphics[width=1\textwidth]{Shiny_UploadData.png} \caption{Upload shiny object} \end{figure} \begin{figure}[H] \centering \includegraphics[width=1\textwidth]{Shing_allGenePat.png} \caption{Find all genes with a given pattern} \end{figure} \begin{figure}[H] \centering \includegraphics[width=1\textwidth]{Shiny_geneViz.png} \caption{Search genes individually} \end{figure} \newpage \section{SessionInfo} <>= sessionInfo() @ \end{document}