%\VignetteIndexEntry{Discordant} %\VignetteKeywords{omics, differential correlation, transcriptomics} %\VignettePackage{Discordant} %\VignetteEngine{knitr::knitr} \documentclass[12pt]{article} \usepackage{caption} \usepackage{cite} \usepackage{hyperref} \usepackage{amssymb} \usepackage[utf8]{inputenc} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\Rcode}[1]{{\texttt{#1}}} \newcommand{\software}[1]{\textsf{#1}} \newcommand{\R}{\software{R}} \newcommand{\Bioconductor}{\software{Bioconductor}} \newcommand{\Discordant}{\Rpackage{Discordant}} \title{The Discordant R Package: A Novel Approach to Differential Correlation} \author{Charlotte Siska and Katerina Kechris} \begin{document} \maketitle <>= options(width=72) @ \tableofcontents \pagebreak \section{Introduction} Discordant is an R package that identifies pairs of features that correlate differently between phenotypic groups, with application to -omics datasets. Discordant uses a mixture model that “bins” molecular feature pairs based on their type of coexpression. More information on the algorithm can be found in \cite{siska1, siska2}. The final output are posterior probabilities of differential correlation. This package can be used to determine differential correlation within one –omics dataset or between two –omics datasets (provided that both –omics datasets were taken from the same samples). Also, the type of data can be any type of –omics with normal or non-normal distributions. Some examples are metabolomics, transcriptomic, proteomics, etc. The functions in the Discordant package provide a simple pipeline for intermediate R users to determine differentially correlated pairs. The final output is a table of molecular feature pairs and their respective posterior probabilities. Functions have been written to allow flexibility for users in how they interpret results, which will be discussed further. Currently, the package only supports the comparison between two phenotypic groups (e.g., disease vs control, mutant vs wildtype). The \Rpackage{Discordant} package is available at bioconductor.org and can be downloaded via \Rfunction{bioLite}: <>= if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager") BiocManager::install("Discordant") @ To use \Rpackage{Discordant}, import into R: <>= library(discordant) @ \section{Discordant Algorithm} Discordant is originally derived from the Concordant algorithm written by \cite{lai1, lai2}. It was used to determine concordance between microarrays. We have applied it to determine differential correlation of features between groups \cite{siska1,siska2}. Using a three component mixture model and the EM algorithm, the model predicts if the correlation coefficients in phenotypic groups 1 and 2 for a molecular feature pair are dissimilar \cite{siska1}. The correlation coefficients are generated for all possible molecular feature pairs witin an -omics dataset or between two -omics datasets. The correlation coefficients are transformed into z scores using Fisher's tranformation. The three components are -, + and 0 which correspond respectively to a negative, positive or no correlation. Molecular features that have correlation coefficients in \textit{different} components are considered \textit{differentially} correlated, as opposed to when correlation coefficients are in the \textit{same} component then they are \textit{equivalently} correlated. \begin{table} \begin{center} \begin{tabular}{ c | c c c } & 0 & - & + \\ \hline 0 & 1 & 2 & 3 \\ - & 4 & 5 & 6 \\ + & 7 & 8 & 9 \\ \end{tabular} \caption{Class Matrix for Three Component Mixture Model} \end{center} \end{table} The class matrix (Table 1) contains the classes that represent all possible paired-correlation scenarios. These scenarios are based off the components in the mixture models. Molecular features that have correlation coefficients in different components are considered differentially correlated, as opposed to when correlation coefficients are in the same component they are equivalently correlated. This can be visualized in the class matrix, where the rows represent the components for group 1 and the columns represent the components for group 2. The classes on the diagonal represent equivalent correlation (1, 5 and 9), and classes in the off-diagonal represent differential correlation (2, 3, 4, 6, 8). After running the EM algorithm, we have 9 posterior probabilities for each molecular feature pair that correspond to the 9 classes in the class matrix. Since we want to summarize the probability that the molecular feature pair is differentially correlated, we sum the posterior probabilities representing the off-diagonal classes in the class matrix. \section{Example Data} All datasets are originally from the Cancer Genome Atlas (TCGA) and can be found at \href{http://cancergenome.nih.gov/}{http://cancergenome.nih.gov/}. \begin{description} \item{\Robject{TCGA\_GBM\_miRNA\_microarray}}{ Data is miRNA expression values from 10 control and 20 tumor samples for a Glioblastoma multiforme (GBM) Agilent miRNA micorarray. The feature size was originally 470, but after features with outliers were filtered out feature size reduces to 331. In this sample dataset, random 10 features are present.} \item{\Robject{TCGA\_GBM\_transcript\_microarray}}{ Data is transcript (or mRNA) expression values from 10 control and 20 tumor samples in a GBM Agilent 244k micorarray. The feature size was originally 90797, but after features with outliers were filtered out, feature size reduces to 72656. In this sample dataset, 20 random features are present.} \item{\Robject{TCGA\_Breast\_miRNASeq}}{ Data is miRNA counts from 15 control and 45 tumor samples in a Breast Cancer Illumina HiSeq miRNASeq. The feature size was originally 212, but after features with outliers were filtered out feature size reduces to 200. In this sample dataset, 100 random features are present.} \item{\Robject{TCGA\_Breast\_RNASeq}}{ Data is transcript (or mRNA) counts from 15 control and 45 tumor samples in a Breast Cancer Illumina HiSeq RNASeq. The feature size was originally 19414, but after features with outliers were filtered out feature size reduces to 16656. In this sample dataset, 100 random features are present.} \item{\Robject{TCGA\_Breast\_miRNASeq\_voom}}{ This dataset is the voom-transformed \Robject{TCGA\_Breast\_miRNASeq}.} \item{\Robject{TCGA\_Breast\_RNASeq\_voom}}{ This dataset is the voom-transformed \Robject{TCGA\_Breast\_RNASeq}.} \end{description} \section{Before Starting} \subsection{Required Inputs} "Within" –omics refers to when the Discordant analysis is performed within one –omics dataset where all molecular features within a -omics dataset are paired to each other (e.g. transcript-transcript pairs in a microarray transcriptomics experiment). "Between" -omics refers to analysis of two -omics datasets. Molecular feature pairs analyzed are between the two -omics, (e.g. transcript-protein, protein-metabolite) are paired. \begin{description} \item{\Robject{x}}{ m by n matrix where m are features and n are samples. If only this matrix is provided, a within -omics analysis is performed.} <<<<<<< HEAD \item{\Robject{y}}{ m by n matrix where m are features and n are samples. This is an optional argument which will induce between -omics analysis. Samples must be matched with those in x.} ======= \item{\Robject{y}}{ m by n matrix where m are features and n are samples. This is an optional argument which will induce between -omics analysis. Samples must be matched with those in x.} >>>>>>> release-0.99.6 \item{\Robject{group}}{ vector containing 1s and 2s that correspond to the location of samples in the columns of x (and y if provided). For example, the control group is group 1 and the experimental group 2, and the location of samples corresponding to the two groups matches the locations of 1s and 2s in the group vector} \end{description} \subsection{Outliers} In our work, we found that features with outliers would skew correlation and cause false positives. Our approach was to filter out features that had large outliers. With normal data, such as in microarrays, Grubbs' test can be used. The null hypothesis is that there are no outliers in the data, and so features with p-value $\ge$ 0.05 are kept. A simple R function is found in the \Rpackage{outliers} R package as \Rfunction{grubbs.test}. Determining outliers in non-normal data is more complicated. We used the median absolute deviation (MAD). Normally, features are filtered if they are outside 2 or 3 MADs from the median \cite{leys}. This is not completely applicable to sequencing data, because sequencing data has large variance and a non-symmetrical distribution. We used 'split MAD', which has been used before \cite{magwene}. A left MAD is determined based on data left to the median and a right MAD is determined based on data to the right of the median. If there are any feature outside a factor of the left or right MAD from the median, there are featured out. A function in Discordant is provided called \Rfunction{split.madOutlier}. The number of MAD outside of the median can be changed with option \Robject{threshold}. Another option is \Robject{filter0} which if \Robject{TRUE} will filter out any feature with at least one 0. Arguments returned are \Robject{mat.filtered}, which is the filtered matrix and \Robject{index} which is the index of features that are retained in \Robject{mat.filtered}. <>= data(TCGA_Breast_miRNASeq) mat.filtered <- splitMADOutlier(TCGA_Breast_miRNASeq,filter0 = TRUE, threshold = 4) @ \section{Create Correlation Vectors} To run the Discordant algorithm, correlation vectors respective to each group are necessary for input, which are easy to create using the $\verb|createVectors|$. Each correlation coefficient represents the correlation between two molecular features. The type of molecular feature pairs depends if a within -omics or between -omics analysis is performed. Correlation between molecular features in the same -omics dataset is within -omics, and correlation between molecular features in two different -omics datasets is between -omics. Whether or not within -omics or between -omics analysis is performed depends on whether one or two matrices are parameters for this function. $\verb|createVectors|$ has two outputs: \begin{description} \item{\Robject{v1}}{Correlation vector of molecular feature pairs corresponding to samples labeled 1 in group parameter.} \item{\Robject{v2}}{Correlation vector of molecular feature pairs corresponding to samples labeled 2 in group parameter.} \end{description} <>= data(TCGA_GBM_miRNA_microarray) data(TCGA_GBM_transcript_microarray) groups <- c(rep(1,10), rep(2,10)) #Within -Omics wthn_vectors <- createVectors(TCGA_GBM_transcript_microarray, groups = groups) #Between -Omics btwn_vectors <- createVectors(TCGA_GBM_miRNA_microarray, TCGA_GBM_transcript_microarray, groups = groups) @ \subsection{Correlation Metric} We also have included different options for correlation metrics. This argument is called \Robject{cor.method} and its default value is "spearman." Other options are "pearson," "bwmc" and "sparcc". For information and comparison of Spearman, Pearson and biweight midcorrelation (bwmc) please read this paper by \href{http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-13-328}{Song et al}\cite{song}. We have also investigated correlation metrics in Discordant in relation to sequencing data, and found Spearman's correlation had the best performance\cite{siska2}. The algorithm for SparCC was introduced by Friedman et al\cite{friedman}. We use R code written by Huaying Fang\cite{fang} . \section{Run the Discordant Algorithm} The Discordant Algorithm is implemented in the the function \Rfunction{discordantRun} which requires two correlation vectors and the original data. If the user wishes to generate their own correlation vector before inputting into the dataset, they can do so. However, the function will return an error message if the dimensions of the datasets inserted do not match the correlation vector. The posterior probability output of the Discordant algorithm are the differential correlation posterior probabilities (the sum of the off-diagonal of the class matrix described in Table 1). If the user wishes to observe more detailed information, alternative outputs are available. \Rfunction{discordantRun} has five outputs: \begin{description} \item{\Robject{discordPPMatrix}}{Matrix of differential correlation posterior probabilities where rows and columns reflect features. If only x was inputted, then the number of rows and columns are number of features in x. The rows and column names are the feature names, and the upper diagonal of the matrix are NAs to avoid repeating results. If x and y are inputted, the number of rows is the feature size of x and the number of columns the feature size of y. The row names are features from x and the column names are features from y.} \item{\Robject{discordPPVector}}{Vector of differential correlation posterior probabilities. The length is the number of feature pairs. The names of the vector are the feature pairs.} \item{\Robject{classMatrix}}{Matrix of classes with the highest posterior probability. Row and column names are the same as in discordPPMatrix depending if only x is inputted or both x and y.} \item{\Robject{classVector}}{Vector of class with the highest posterior probability for each pair. The length is the number of feature pairs. Names of vector correspond to the feature pairs, similar to discordPPVector.} \item{\Robject{probMatrix}}{Matrix of all posterior probabilities, where the number of rows is the number of feature pairs and the columns represent the class within the class matrix. The number of columns can be 9 or 25, depending on how many mixture components are chosen (discussed later). The values across each row add up to 1. Posterior probabilities in discordPPMatrix and discordPPVector are the summation of columns that correspond to differential correlation classes (Table 1).} \item{\Robject{loglik}}{The log likelihood.} \end{description} <>= data(TCGA_GBM_miRNA_microarray) data(TCGA_GBM_transcript_microarray) groups <- c(rep(1,10), rep(2,10)) #Within -omics wthn_vectors <- createVectors(TCGA_GBM_transcript_microarray, groups = groups) wthn_result <- discordantRun(wthn_vectors$v1, wthn_vectors$v2, TCGA_GBM_transcript_microarray) wthn_result$discordPPMatrix[1:4,1:4] head(wthn_result$discordPPVector) wthn_result$classMatrix[1:4,1:4] head(wthn_result$classVector) head(wthn_result$probMatrix) wthn_result$loglik # Between -omics btwn_vectors <- createVectors(TCGA_GBM_miRNA_microarray, TCGA_GBM_transcript_microarray, groups = groups) btwn_result <- discordantRun(btwn_vectors$v1, btwn_vectors$v2, TCGA_GBM_miRNA_microarray, TCGA_GBM_transcript_microarray) btwn_result$discordPPMatrix[1:3,1:3] head(btwn_result$discordPPVector) btwn_result$classMatrix[1:3,1:3] btwn_result$classMatrix[1:3,1:3] head(btwn_result$classVector) head(btwn_result$probMatrix) btwn_result$loglik @ \subsection{Subsampling} Subsampling is an option to run the EM algorithm with a random sample of independent feature pairs. This is repeated for a number of samplings, and then the average of these parameters are used to maximize posterior probabilities for all feature pairs. This option was introduced to speed up Discordant method and to also address the independence assumption. There are some implementation issues which are explained in Siska et al, submitted\cite{siska2}. The argument \Robject{subsampling} must be set to TRUE for subsampling to be used. The number of independent feature pairs to be subsampled is determined by the argument \Robject{subSize} whose default value is the number of rows in \Robject{x}. The number of independent feature pairs must be less or equal to the number of features in \Robject{x} and \Robject{y}. The number of random samplings to be run is set by the argument \Robject{iter}, whose default value is 100. For the subsampling example, we will use the sequencing data. The sequencing data has a greater feature size than the microarray data. A dataset with small feature size will cause a segmentation fault with subsampling because not enough feature pairs are being used to estimate parameters in the mixture model. \begin{table}[h] \begin{center} \begin{tabular}{ c | c c c c c} & 0 & - & -- & + & ++ \\ \hline 0 & 1 & 2 & 3 & 4 & 5 \\ - & 6 & 7 & 8 & 9 & 10 \\ -- & 11 & 12 & 13 & 14 & 15 \\ + & 16 & 17 & 18 & 19 & 20 \\ ++ & 21 & 22 & 23 & 24 & 25 \end{tabular} \caption{Class Matrix for Five Component Mixture Model} \end{center} \end{table} \subsection{Increase Component Size} We also provide the option to increase component size from three to five in the mixture model. The number of classes in the class matrix increases, as seen in Table 2. Incorporating the extra components means that it is possible to identify elevated differential correlation, which is when there are associations in both groups in the same direction but one is more extreme. Using this options introduces more parameters, which does have an effect on run-time. We also found that using the five mixture component mixture model reduces performance compared to the three component mixture model\cite{siska2}. However, the option is available if users wish to explore more types of differential correlation. The default is to run the three component mixture model and can be changed with option \Robject{components}. \section{Example Run with Microarrays} <>= data(TCGA_GBM_miRNA_microarray) data(TCGA_GBM_transcript_microarray) groups <- c(rep(1,10), rep(2,10)) #Within -Omics wthn_vectors <- createVectors(TCGA_GBM_transcript_microarray, groups = groups) wthn_result <- discordantRun(wthn_vectors$v1, wthn_vectors$v2, TCGA_GBM_transcript_microarray) #Between -Omics btwn_vectors <- createVectors(TCGA_GBM_miRNA_microarray, TCGA_GBM_transcript_microarray, groups = groups) btwn_result <- discordantRun(btwn_vectors$v1, btwn_vectors$v2, TCGA_GBM_miRNA_microarray, TCGA_GBM_transcript_microarray) @ \bibliographystyle{ieeetr} \bibliography{Discordant_bib_v3} \end{document}