--- title: "Introduction to the Robust longitudinal Differential Expression (RolDE) method" output: rmarkdown::html_vignette: toc: true vignette: > %\VignetteIndexEntry{Introduction} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(RolDE) library(printr) library(knitr) ``` ## Introduction RolDE is a composite method, consisting of three independent modules with different approaches to detecting longitudinal differential expression. The combination of these diverse modules allows RolDE to robustly detect varying differences in longitudinal trends and expression levels in diverse data types and experimental settings. The RegROTS module merges the power of regression modelling with the power of the established differential expression method Reproducibility Optimized Test Statistic (ROTS) (Elo et al., Suomi et al.). A polynomial regression model for protein expression over time is fitted separately for each replicate (individual) in each condition. Differential expression between two replicates (individuals) in different conditions is examined by comparing the coefficients of the replicate-specific regression models. If all coefficient differences are zero, no longitudinal differential expression between the two replicates in different conditions exist. For a thorough exploration of differential expression between the conditions, all possible combinations of replicates in different conditions are examined. In the DiffROTS module the expression of replicates (individuals) in different conditions are directly compared at all time points. Again, if the expression level differences at all time points are zero, no differential expression between the examined replicates (individuals) in different conditions exist. Similarly to the RegROTS module, differential expression is examined between all possible combinations of replicates (individuals) in the different conditions. In non-aligned time point data, the overall expression level differences between the conditions is examined when accounting for time-associated trends of varying complexity in the data. More specifically, the expression level differences between the conditions are examined when adjusting for increasingly complex time-related expression trends of polynomial degrees d=0,1,.,d where d is the maximum degree for the polynomial and the same degree as is used for the PolyReg module. In the PolyReg module, polynomial regression modelling is used to detect longitudinal differential expression. Condition is included as a categorical factor within the models and by investigating the condition related intercept and longitudinal-term related coefficients at different levels of the condition factor, average differences in expression levels as well as differences in longitudinal expression patterns between the conditions can be examined. Finally, to conclusively detect any differential expression, the findings from the different modules are combined using the rank product. For more details about the method, see the original RolDE publication (Valikangas et al.). ## Installation The latest version of RolDE can be installed from Bioconductor: ```{r install RolDE, eval=FALSE, message=FALSE, warning=FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager", repos = "http://cran.us.r-project.org") BiocManager::install("RolDE") ``` ## Applying RolDE in example datasets with aligned time points First, load the dataset and the design matrix to be used: ```{r load data1, echo=TRUE} data(data1) data("des_matrix1") ``` To understand the structure of the design matrix and what a design matrix for RolDE should look like, let's explore the data and the design matrix: Dimensions of the data: ```{r explore data1, echo=TRUE} dim(data1) ``` Column names of the data: ```{r explore data1 2, echo=TRUE} head(colnames(data1)) ``` Dimensions of the design matrix: ```{r explore data1 3, echo=TRUE} dim(des_matrix1) ``` Column names of the design matrix: ```{r explore data1 4, echo=TRUE} colnames(des_matrix1) ``` Contents of the design matrix: ```{r explore data1 5, echo=TRUE} head(des_matrix1) tail(des_matrix1) ``` The first column of the design matrix should contain the sample names of the data (column names of the data). The second column should indicate the condition/group factor status for each sample to be explored (e.g. sick/healthy, control/case). The third column should indicate the time point for each sample (data with aligned time points) or time value (or equivalent, e.g. age, temperature, disease progression) information in data with non-aligned time points. The fourth and final column of the design matrix should indicate the replicate / individual each sample came from. Let's look at the exemplary design matrix a little bit more: Unique samples: ```{r explore data1 6, echo=TRUE} unique(des_matrix1[,1]) ``` Conditions: ```{r explore data1 7, echo=TRUE} unique(des_matrix1[,2]) table(des_matrix1[,2]) ``` Time points: ```{r explore data1 8, echo=TRUE} unique(des_matrix1[,3]) table(des_matrix1[,3]) ``` Replicates: ```{r explore data1 9, echo=TRUE} unique(des_matrix1[,4]) table(des_matrix1[,4]) ``` In this example, we have a dataset of 30 samples, 2 conditions, 6 replicates each with 5 time points. The timepoints in the data are aligned. This is how a design matrix for a dataset for RolDE should look like. This gives all the essential information for RolDE it needs to determine longitudinal differential expression between the conditions. By bare minimum, RolDE needs the data and the design matrix. By default RolDE assumes that the time points in the data are aligned. If not defined otherwise, RolDE will by default use sequential processing. However, using parallel processing and multiple threads will greatly reduce the computational time required by RolDE. Please use set.seed() for reproducibility. Running RolDE using parallel processing and 3 threads: ```{r RolDE data1, echo=TRUE, eval=FALSE} set.seed(1) data1.res<-RolDE(data=data1, des_matrix=des_matrix1, n_cores=3) ``` RolDE supports the SummarizedExperiment data structure and the data and design matrix can be provided to RolDE as a SummarizedExperiment object. In this case, the *data* argument for RolDE must be a SummarizedExperiment object, where the data matrix is included as a list in the *assays* argument and the design matrix in the *colData* argument. The format of the data matrix and the design matrix within the SummarizedExperiment object must be the same as when provided separately. Constructing a SummarizedExperiment object from data1 and the associated design matrix for RolDE: ```{r RolDE SE usage, echo=TRUE, eval=FALSE} SE_object_for_RolDE = SummarizedExperiment(assays=list(data1), colData = des_matrix1) ``` Running RolDE using a SummarizedExperiment object including the data and the metadata: ```{r RolDE SE object run, echo=TRUE, eval=FALSE} set.seed(1) data1.res<-RolDE(data=SE_object_for_RolDE, n_cores=3) ``` ```{r explore RolDE res data1, echo=FALSE} data(res1) data1.res=res1 ``` The results of RolDE are returned in a list with lot of information: ```{r explore RolDE res data1 2, echo=TRUE} names(data1.res) ``` The main results of RolDE are located in the first element of the provided result list. Elements 2,4 and 6 provide the result for the different modules of RolDE separately (the RegROTS, DiffROTS and PolyReg modules). Elements 3 and 5 provide the significance values for the different ROTS runs of the RegROTS and DiffROTS modules, respectively. The used ROTS runs within those modules are given in element 8. The comparisons between the replicates in the different conditions are divided into different runs so that each sample is used only once within each run to preserve the proper degrees of freedom for statistical testing. All the condition related significance values of the polynomial regression models in the PolyReg module are given in element 7. The used polynomial degrees for the regression models in the RegROTS and the PolyReg modules are given in element 9. And in element 10, all the inputs used by RolDE (both given by the user and those determined automatically by the method) are given. Typically, the main (only) interest of the user is in element 1, where the main results of RolDE are given: ```{r explore RolDE res data1 3, echo=TRUE} RolDE.data1<-data1.res$RolDE_Results dim(RolDE.data1) colnames(RolDE.data1) ``` By default, the features in the result data frame are given in the same order as entered. Let's order the results based on the strength of longitudinal differential expression detected by RolDE: ```{r explore RolDE res data1 4, echo=TRUE, fig.height=5, fig.width=7} RolDE.data1<-RolDE.data1[order(as.numeric(RolDE.data1[,2])),] head(RolDE.data1, 5) ``` Explore the distribution of the estimated significance values: ```{r explore RolDE res data1 5, echo=TRUE, fig.height=5, fig.width=7} hist(RolDE.data1$`Estimated Significance Value`, main = "Estimated significance values", xlab="") ``` As can be observed, the distribution of the estimated significance values is approximately uniform. Data1 is random data; the values for the features have been randomly assigned from a normal distribution with a mean of 22 and a standard deviation of 1.5. Overall, the null hypothesis between Condition1 and Condition2 is true; the conditions are not differentially expressed and a uniform significance value distribution is expected. After correcting for the simultaneous testing of multiple hypothesis by using the Bejamini-Hochberg (FDR) correction, no differentially epxressed feature remains with the commonly used FDR level of 0.05: ```{r explore RolDE res data1 6, echo=TRUE} length(which(RolDE.data1$`Adjusted Estimated Significance Value`<=0.05)) ``` The calculation of significance values in RolDE is controlled via the parameter *sigValSampN* which control if the significance values should be calculated and how many permutations should be used when determining the significance values. Computational time required by RolDE can be reduced by not calculating the significance values by setting *sigValSampN* to 0. By increasing *sigValSampN* from the default number, the significance values can be estimated more accurately but the required computational time will also increase. If parallel processing for RolDE is enabled (*n_cores* > 1), it will also be utilized when estimating the significance values. The estimated significance values can be adjusted by any method included in the p.adjust method in the stats package. Alternatively, q-values as defined by Storey et al. in the Bioconductor package qvalue can be used. Valid values for *sig_adj_meth* are then: "holm", "hochberg", "hommel", "bonferroni", "BH", "BY","fdr","none", "qvalue". ### Semi-simulated spike-in dataset In addition to the random data already explored, a semi-simulated spike-in dataset with differential expression between the conditions is also included in the RolDE package: ```{r explore data3, echo=TRUE} data("data3") data("des_matrix3") ?data3 ``` Let's run RolDE for data3 using the Bonferroni procedure for multiple hypothesis adjustement: ```{r Run RolDE for data3, echo=TRUE, eval=FALSE} set.seed(1) data3.res<-RolDE(data=data3, des_matrix=des_matrix3, n_cores=3, sig_adj_meth = "bonferroni") ``` ```{r explore RolDE res data3, echo=FALSE} data(res3) data3.res=res3 ``` Retrieve the final RolDE results and organize the results based on strength of differential expression: ```{r explore RolDE res data3 2, echo=TRUE, fig.height=5, fig.width=7} RolDE.data3<-data3.res$RolDE_Results RolDE.data3<-RolDE.data3[order(as.numeric(RolDE.data3[,2])),] head(RolDE.data3, 3) ``` Explore the distribution of the estimated significance values: ```{r explore RolDE res data3 3, echo=TRUE, fig.height=5, fig.width=7} hist(RolDE.data3$`Estimated Significance Value`, main = "Estimated significance values", xlab="") ``` We can now observe, that the distribution of signficance values is different than for the random data of data1, where the null hypothesis was true. In data3, true differential expression between the conditions exists. The spike-in proteins are expected to change between the conditions, while most of the background proteins are expected to remain stable between the conditions. However, in reality this is not always the case - some background proteins might be changing too due to variations in the experimental conditions during the preparation of the dataset. Let's see how RolDE has ranked the spike-in proteins known to be changing between the conditions: ```{r explore RolDE res data3 4, echo=TRUE, fig.height=5, fig.width=7} grep("ups", RolDE.data3[,1]) ``` Most of the spike-in proteins are located near the top of the result list, as expected. How many of the proteins remain signifcant at the alpha level of 0.05 after the Bonferroni procedure to adjust for multiple hypothesis testing: ```{r explore RolDE res data3 5, echo=TRUE, fig.height=5, fig.width=7} length(which(RolDE.data3[,4]<=0.05)) ``` Findings from the DE analysis can be plotted with the *plotFindings* function included in the RolDE package: ```{r plotfindings, echo=TRUE, fig.height=5, fig.width=7} ?plotFindings ``` ## Applying RolDE in example data with non aligned time points In addition to data with aligned time points, RolDE can be applied in data with non-aligned time points. However, the *aligned* parameter in RolDE must now be set to FALSE in order for RolDE to discern that time points in the data are not aligned. Instead of fixed time points, the integer values in the time point column 3 of the design matrix should now be replaced with continuous numerical values, or time values (e.g. age at the time of sampling). ## Changing the settings for RolDE While RolDE performs typically very well with the default parameter values, the user might sometimes wish to alter some of these values for a specific kind of analysis. Some of the most important parameters include: Parameter *min_comm_diff* controls how many common time points must two replicates (individuals) have in different conditions to be compared. The first value controls the number of common time points for the RegROTS module, while the second one controls the number of common time points for the DiffROTS module. If *min_comm_diff* is set to "auto", RolDE will use a value of 3 for the RegROTS module and a value of 1 for the DiffROTS module. In the case of data with non-aligned time points, the first value of *min_comm_diff* controls how many time values (or similar, e.g. age, temperature) must both replicates (individuals) in different conditions have in the common time interval to be compared. Defining larger values for the minimum number of common time points to be used in RolDE with data1: ```{r change RolDE parameters, echo=TRUE, eval=FALSE} set.seed(1) data1.res<-RolDE(data=data1, des_matrix=des_matrix1, doPar=T, n_cores=3, min_comm_diff = c(4,3)) ``` Using the above values for *min_comm_diff* requires the replicates in different conditions to have 4 common time points to be compared in the RegROTS module and 3 common time points to be compared in the DiffROTS module. *min_feat_obs* controls the number of non-missing values a feature must have for a replicate (an individual) in a condition to be compared in the RegROTS module and the DiffROTS module (in data with aligned time points). A feature is required to have at least *min_feat_obs* non-missing values for both replicates in the different conditions to be compared. If lowered, more missing values are allowed but the analysis may become less accurate. If increased, only replicates (individuals) with less missing values for a feature are compared. The user can control the degree of polynomials used by the RegROTS and the PolyReg modules via the *degtree_RegROTS* and the *degree_PolyReg parameters*. If left to "auto", RolDE will determine the suitable degrees automatically as described in (Valikangas et al.) Using RolDE with non default user given polynomial degrees for the RegROTS and PolyReg modules: ```{r change RolDE parameters 2, echo=TRUE, eval=FALSE} set.seed(1) data1.res<-RolDE_Main(data=data1, des_matrix=des_matrix1, n_cores=3, degree_RegROTS = 2, degree_PolyReg = 3) ``` By default, RolDE uses fixed effects only regression with a common intercept and slope for the replicates (individuals) when time points in the data are aligned and mixed effects models with a random effect for the individual baseline (intercept) if the time points are non aligned for the PolyReg and the DiffROTS (only in data with non aligned time points) modules. This behaviour is controlled with the parameter *model_type* and the default behaviour is induced when *model_type* is allowed to be "auto". However, the user can choose to use mixed effects regression modelling when appropriate by setting the parameter *model_type* as "mixed0" for random effects for the individual baseline and setting *model_type* as "mixed1" for an individual baseline and slope. Fixed effects only models can be chosen to be used by setting as "fixed". Valid inputs for \code{model_type} are "auto" (the default), "mixed0", "mixed1" and "fixed". Analyzing data1 using mixed effects modelling with random intercepts for the replicates in the PolyReg module of RolDE: ```{r change RolDE parameters 3, echo=TRUE, eval=FALSE} set.seed(1) data1.res<-RolDE_Main(data=data1, des_matrix=des_matrix1, n_cores=3, model_type="mixed0") ``` Analyzing data1 using mixed effects modelling with random intercepts and also random linear slopes for the replicates in the PolyReg AND the DiffROTS modules of RolDE: ```{r change RolDE parameters 4, echo=TRUE, eval=FALSE} set.seed(1) data1.res<-RolDE(data=data1, des_matrix=des_matrix1, n_cores=3, model_type="mixed1") ``` Altering the *model_type* parameter has an effect for the DiffROTS module only when the time points in the data are non-aligned. In non-aligned time point data, the expression level differences between the conditions in the DiffROTS module is examined when accounting for time-associated trends of varying complexity in the data. ## Preparation of data for RolDE The data for RolDE needs to be appropriately preprocessed (e.g. log - transformed, normalized). RolDE does not perform any filtering or normalization for the data; such preprocessing must be performed prior to applying RolDE. Similarly, adjusting for possible confounding effects in the data must be performed before the application of RolDE. ## References Elo, Laura, Filen S, Lahesmaa R, et al. Reproducibility-optimized test statistic for ranking genes in microarray studies. IEEE/ACM Trans. Comput. Biol. Bioinform. 2008; 5:423-31. Suomi T, Seyednasrollah F, Jaakkola MK, et al. ROTS: An R package for reproducibility-optimized statistical testing. PLoS Comput. Biol. 2017; 13:5. Storey JD, Bass AJ, Dabney A, et al. qvalue: Q-value estimation for false discovery rate control. 2019. Välikangas T, Suomi T, ELo LL, et al. Enhanced longitudinal differential expression detection in proteomics with robust reproducibility optimization regression. bioRxiv 2021. ```{r session info, echo=TRUE} #Session info sessionInfo() ```