% \VignetteEngine{knitr::knitr} % \VignetteIndexEntry{ccrepe} \documentclass{article} %% ------------------------- %% Pre-amble: %% ------------------------- \usepackage[nonamebreak,square]{natbib} <>= BiocStyle::latex(width=78, use.unsrturl=FALSE) @ <>= library(ccrepe) @ \title{CCREPE: Compositionality Corrected by Permutation and Renormalization} \author{Emma Schwager, George Weingart, Craig Bielski, Curtis Huttenhower} %% ------------------------- %% End pre-amble %% ------------------------- \begin{document} \maketitle \tableofcontents \section{Introduction} \Biocpkg{ccrepe} is a package for analysis of sparse compositional data. Specifically, it determines the significance of association between features in a composition, using any similarity measure (e.g. Pearson correlation, Spearman correlation, etc.) The CCREPE methodology stands for Compositionality Corrected by Renormalization and Permutation, as detailed below. The package also provides a novel similarity measure, the N-dimensional checkerboard score (NC-score), particularly appropriate to compositions derived from microbial community sequencing data. This results in p-values and false discovery rate q-values corrected for the effects of compositionality. The package contains two functions \Rfunction{ccrepe} and \Rfunction{nc.score} and is maintained by the Huttenhower lab (\email{ccrepe-users@googlegroups.com}). %--------------------------------------------------------- \section{ccrepe} %--------------------------------------------------------- \Rfunction{ccrepe} is the main package function. It calculates compositionality-corrected p-values and q-values for a user-selected similarity measure, operating on either one or two input matrices. If given one matrix, all features (columns) in the matrix are compared to each other using the selected similarity measure. If given two matrices, each feature in the first are compared against all features in the second. \subsection{General functionality} Compositional data induces spurious correlations between features due to the nonindependence of values that must sum to a fixed total. CCREPE abrogates this when determining the significance of a similarity measure for each feature pair using two main steps, permutation/renormalization and bootstrapping. First, given two features to compare, CCREPE generates a null distribution of the similarity expected just due to compositionality by iteratively permuting one feature, renormalizing all samples in the composition to their previous sum, and computing the resulting similarity measures. Second, CCREPE bootstraps over sample subsets in order to assess confidence in the "true" similarity measure. Finally, the two resulting distributions are compared using a pooled-variance Z-test to give a compositionality-corrected p-value. False discovery rate q-values are additionally calculated using the Benjamin-Hochberg-Yekutieli procedure. For greater detail, see \citet{faust2012microbial} and \citet{schwager2012unpublished}. \subsection{Arguments} \begin{description} \setlength{\itemsep}{1em} \item[\Rcode{x}] First \Rclass{dataframe} or \Rclass{matrix} containing relative abundances. Columns are features, rows are samples. Rows should therefore sum to a constant. Row names are used for identification if present. \item[\Rcode{y}] Second \Rclass{dataframe} or \Rclass{matrix} (optional) containing relative abundances. Columns are features, rows are samples. Rows should therefore sum to a constant. If both \Rcode{x} and \Rcode{y} are specified, they will be merged by row names. If no row names are specified for either or both datasets, the default is to merge by row number. \item[\Rcode{sim.score}] Similarity measure, such as \Rfunction{cor} or \Rfunction{nc.score}. This can be either an existing R function or user-defined. If the latter, certain properties should be satisfied as detailed below (also see examples). The default similarity measure is Spearman correlation. A user-defined similarity measure should mimic the interface of \Rfunction{cor}: \begin{enumerate} \item Take either two \Rclass{vector} inputs one \Rclass{matrix} or \Rclass{dataframe} input. \item In the case of two inputs, return a single number. \item In the case of one input, return a matrix in which the (\Rcode{i},\Rcode{j})th entry is the similarity score for column \Rcode{i} and column \Rcode{j} in the original matrix. \item The resulting matrix (in the case of one input) must be symmetric. \item The inputs must be named \Rcode{x} and \Rcode{y}. \end{enumerate} \item[\Rcode{sim.score.args}] An optional list of arguments for the measurement function. When given, they are passed to the \Rcode{sim.score} function directly. For example, in the case of cor, the following would be acceptable: <>= sim.score.args = list(method="spearman", use="complete.obs") @ Note that this example corresponds to the default behavior. \item[\Rcode{min.subj}] Minimum number (count) of samples that must be nonzero in order to apply the similarity measure. This is to ensure that there are sufficient samples to perform a bootstrap (default: 20). \item[\Rcode{iterations}] The number of iterations for both bootstrap and permutation calculations (default: 1000). \item[\Rcode{subset.cols.x}] Subset of columns from \Rcode{x} to work on. The default (\Rcode{NULL}) uses all columns. Note that all features are used for normalization, but calculations are performed only with the requested subset. \item[\Rcode{subset.cols.y}] Subset of columns from \Rcode{y} to work on. The default (\Rcode{NULL}) uses all columns. Note that all features are used for normalization, but calculations are performed only with the requested subset. \item[\Rcode{errthresh1}] Maximum allowable probability of getting all zeros in a given bootstrapped column for the first dataset \Rcode{x}. If a feature has a number of zeros that makes the probability of obtaining all zeros when sampling with replacement greater than this value, that feature will be excluded from the subsequent analysis. This is to ensure that the standard deviation of the bootstrap sample is non-zero. (default: 0.0001). \item[\Rcode{verbose}] If \Rcode{TRUE}, print periodic progress of the algorithm through the dataset(s), as well as including more detailed debugging output. (default: \Rcode{FALSE}). \item[\Rcode{iterations.gap}] If \Rcode{verbose=TRUE}, the number of iterations between issuing status messages (default: 100). \item[\Rcode{distributions}] Optional output file for detailed log (if given) of all intermediate permutation and renormalization distributions. \item[\Rcode{compare.within.x}] A boolean value indicating whether to do comparisons given by taking all subsets of size 2 from \Rcode{subset.cols.x} or to do comparisons given by taking all possible combinations of \Rcode{subset.cols.x} or \Rcode{subset.cols.y}. If \Rcode{TRUE} but \Rcode{subset.cols.y=NA}, returns all comparisons involving any features in \Rcode{subset.cols.x}. \item[\Rcode{concurrent.output}] Optional output file to which each comparison will be written as it is calculated. \item[\Rcode{make.output.table}] A boolean value indicating whether to include table-formatted output. \end{description} \subsection{Output} \Rcode{ccrepe} returns a \Rclass{list} containing both the calculation results and the parameters used: \begin{description} \setlength{\itemsep}{1em} \item[\Rcode{sim.score}] \Rclass{matrix} of simliarity scores for all requested comparisons. The (\Rcode{i},\Rcode{j})th element corresponds to the similarity score of column \Rcode{i} (or the \Rcode{i}th column of \Rcode{subset.cols.1}) and column \Rcode{j} (or the \Rcode{j}th column of \Rcode{subset.cols.1}) in one dataset, or to the similarity score of column \Rcode{i} (or the \Rcode{i}th column of \Rcode{subset.cols.1}) in dataset \Rcode{x} and column \Rcode{j} (or the \Rcode{j}th column of \Rcode{subset.cols.2}) in dataset \Rcode{y} in the case of two datasets. \item[\Rcode{p.values}] \Rclass{matrix} of the corrected p-values for all requested comparisons. The (\Rcode{i},\Rcode{j})th element corresponds to the p-value of the (\Rcode{i},\Rcode{j})th element of \Rcode{sim.score}. \item[\Rcode{q.values}] \Rclass{matrix} of the Benjamini-Hochberg-Yekutieli corrected p-values. The (\Rcode{i},\Rcode{j})th element corresponds to the p-value of the (\Rcode{i},\Rcode{j})th element of \Rcode{sim.score}. \item[\Rcode{z.stat}] \Rclass{matrix} of the z-statistics used in generating the p-values for all requested comparisons. The (\Rcode{i},\Rcode{j})th element corresponds to the z-statistic generating the (\Rcode{i},\Rcode{j})th element of \Rcode{p.values}. \end{description} \subsection{Usage} <>= ccrepe( x = NA, y = NA, sim.score = cor, sim.score.args = list(), min.subj = 20, iterations = 1000, subset.cols.x = NULL, subset.cols.y = NULL, errthresh1 = 1e-04, verbose = FALSE, iterations.gap = 100, distributions = NA, compare.within.x = TRUE, concurrent.output = NA, make.output.table = FALSE) @ \subsection{Example 1} An example of how to use \Rfunction{ccrepe} with one dataset. <<>>= data <- matrix(rlnorm(40,meanlog=0,sdlog=1),nrow=10,ncol=4) data[,1] = 2*data[,2] + rnorm(10,0,1) data.rowsum <- apply(data,1,sum) data.norm <- data/data.rowsum apply(data.norm,1,sum) # The rows sum to 1, so the data are normalized test.input <- data.norm dimnames(test.input) <- list(c( "Sample 1", "Sample 2","Sample 3","Sample 4","Sample 5", "Sample 6","Sample 7","Sample 8","Sample 9","Sample 10"), c("Feature 1", "Feature 2", "Feature 3","Feature 4")) test.output <- ccrepe(x=test.input, iterations=20, min.subj=10) @ <>= par(mfrow=c(1,2)) plot(data[,1],data[,2],xlab="Feature 1",ylab="Feature 2",main="Non-normalized") plot(data.norm[,1],data.norm[,2],xlab="Feature 1",ylab="Feature 2", main="Normalized") test.output @ \subsection{Example 2} An example of how to use \Rfunction{ccrepe} with two datasets. <<>>= data <- matrix(rlnorm(40,meanlog=0,sdlog=1),nrow=10,ncol=4) data[,1] = 2*data[,2] + rnorm(10,0,1) data.rowsum <- apply(data,1,sum) data.norm <- data/data.rowsum apply(data.norm,1,sum) # The rows sum to 1, so the data are normalized test.input <- data.norm data2 <- matrix(rlnorm(105,meanlog=0,sdlog=1),nrow=15,ncol=7) aligned.rows <- c(seq(1,4),seq(6,9),11,12) # The datasets dont need # to have subjects line up exactly data2[aligned.rows,1] <- 2*data[,3] + rnorm(10,0,1) data2.rowsum <- apply(data2,1,sum) data2.norm <- data2/data2.rowsum apply(data2.norm,1,sum) # The rows sum to 1, so the data are normalized test.input.2 <- data2.norm dimnames(test.input) <- list(paste("Sample",seq(1,10)),paste("Feature",seq(1,4))) dimnames(test.input.2) <- list(paste("Sample",c(seq(1,4),11,seq(5,8),12,9,10,13,14,15)),paste("Feature",seq(1,7))) test.output.two.datasets <- ccrepe(x=test.input, y=test.input.2, iterations=20, min.subj=10) @ <>= par(mfrow=c(1,2)) plot(data2[aligned.rows,1],data[,3],xlab="dataset 2: Feature 1",ylab="dataset 1: Feature 3",main="Non-normalized") plot(data2.norm[aligned.rows,1],data.norm[,3],xlab="dataset 2: Feature 1",ylab="dataset 1: Feature 3", main="Normalized") test.output.two.datasets @ \subsection{Example 3} An example of how to use \Rfunction{ccrepe} with \Rfunction{nc.score} as the similarity score. <<>>= data <- matrix(rlnorm(40,meanlog=0,sdlog=1),nrow=10,ncol=4) data[,1] = 2*data[,2] + rnorm(10,0,1) data.rowsum <- apply(data,1,sum) data.norm <- data/data.rowsum apply(data.norm,1,sum) # The rows sum to 1, so the data are normalized test.input <- data.norm dimnames(test.input) <- list(paste("Sample",seq(1,10)),paste("Feature",seq(1,4))) test.output.nc.score <- ccrepe(x=test.input, sim.score=nc.score, iterations=20, min.subj=10) @ <>= par(mfrow=c(1,2)) plot(data[,1],data[,2],xlab="Feature 1",ylab="Feature 2",main="Non-normalized") plot(data.norm[,1],data.norm[,2],xlab="Feature 1",ylab="Feature 2", main="Normalized") test.output.nc.score @ \subsection{Example 4} An example of how to use \Rfunction{ccrepe} with a user-defined \Rfunction{sim.score} function. <<>>= data <- matrix(rlnorm(40,meanlog=0,sdlog=1),nrow=10,ncol=4) data[,1] = 2*data[,2] + rnorm(10,0,1) data.rowsum <- apply(data,1,sum) data.norm <- data/data.rowsum apply(data.norm,1,sum) # The rows sum to 1, so the data are normalized test.input <- data.norm dimnames(test.input) <- list(paste("Sample",seq(1,10)),paste("Feature",seq(1,4))) my.test.sim.score <- function(x,y=NA,constant=0.5){ if(is.vector(x) && is.vector(y)) return(constant) if(is.matrix(x) && is.na(y)) return(matrix(rep(constant,ncol(x)^2),ncol=ncol(x))) if(is.data.frame(x) && is.na(y)) return(matrix(rep(constant,ncol(x)^2),ncol=ncol(x))) else stop('ERROR') } test.output.sim.score <- ccrepe(x=test.input, sim.score=my.test.sim.score, iterations=20, min.subj=10, sim.score.args = list(constant = 0.6)) @ <>= par(mfrow=c(1,2)) plot(data[,1],data[,2],xlab="Feature 1",ylab="Feature 2",main="Non-normalized") plot(data.norm[,1],data.norm[,2],xlab="Feature 1",ylab="Feature 2", main="Normalized") test.output.sim.score @ \subsection{Example 5} An example of how to use \Rfunction{ccrepe} when specifying column subsets. <<<>>= data <- matrix(rlnorm(40,meanlog=0,sdlog=1),nrow=10,ncol=4) data.rowsum <- apply(data,1,sum) data.norm <- data/data.rowsum apply(data.norm,1,sum) # The rows sum to 1, so the data are normalized test.input <- data.norm dimnames(test.input) <- list(paste("Sample",seq(1,10)),paste("Feature",seq(1,4))) test.output.1.3 <- ccrepe(x=test.input, iterations=20, min.subj=10, subset.cols.x=c(1,3)) test.output.1 <- ccrepe(x=test.input, iterations=20, min.subj=10, subset.cols.x=c(1), compare.within.x=FALSE) test.output.12.3 <- ccrepe(x=test.input, iterations=20, min.subj=10, subset.cols.x=c(1,2),subset.cols.y=c(3), compare.within.x=FALSE) test.output.1.3$sim.score test.output.1$sim.score test.output.12.3$sim.score @ %--------------------------------------------------------- \section{nc.score} %--------------------------------------------------------- The \Rfunction{nc.score} similarity measure is an N-dimensional extension of the checkerboard score particularly suited to similarity score calculations between compositions derived from ecological relative abundance measurements. In such cases, features typically represent species abundances, and the NC-score discretizes these continuous values into one of N bins before computing a normalized similarity of co-occurrence or co-exclusion. This can be used as a standalone function or with \Rfunction{ccrepe} as above to obtain compositionality-corrected p-values. \subsection{General Functionality} The NC-score is an extension to Diamond's checkerboard score (see \citet{cody1975ecology}) to ordinal data. The generalization of the checkerboard score is through defining general co-variation and co-exclusion patterns in ordinal data. With ordinal data, the checkerboard considers the $2\times 2$ submatrices which are of the form: \[ \left[ \begin{array}{cc} a & b\\ c & d\\ \end{array} \right] \] In the cases that $a < b$ and $c < d$ (or $a > b$ and $c > d$), the submatrix is considered a co-variation pattern; conversely, if $a > b$ and $c < d$ (or $a < b$ and $c > d$) then the submatrix is considered a co-exclusion pattern. The number of co-exclusion patterns for $n$ bins is adjusted based on the expected ratio of co-variaion to co-exclusion patterns in any given sample. The adjustment factor is given by: \[ 1.5 \ast \frac{n(n-1)}{n^2 - n + 1}, \] with the complete derivation being left to \citet{bielski2013thesis}. \noindent The function as implemented here first performs basic quality control filtering of input relative abundance data. It then transforms relative abundances to ordinal values based on user-provided or default bin thresholds. It then computes a raw NC-score: \[ C_{ij} - 1.5 \ast \frac{n(n-1)}{n^2 - n + 1} \ast D_{ij} \] where $C_{ij}$ and $D_{ij}$ are the total number of co-variation and co-exclusion patterns, respectively, between species $i$ and $j$ and where $n$ is the number of bins. \noindent The raw NC-score is then normalized to between -1 and 1 for $n$ bins and $s$ samples by dividing by the maximum possible NC-score, which is based on the maximum number of co-variation patterns possible. The normalized NC-score is then analogous to Pearsons $\rho$, where positive values indicate more co-variation than co-exclusion patterns and the magnitude of the score indicates the strength of the association between the species. The derivation of this normalization factor can be found in \citet{bielski2013thesis}. The normalization factor is given by: \[ \left(\begin{array}{c} s\\ 2\\ \end{array} \right) - \left[ s \bmod n \ast \left(\begin{array}{c} \left \lfloor{\frac{s}{n}} \right \rfloor + 1 \\ 2 \\ \end{array} \right) + \big(n - s \bmod n \big) \ast \left( \begin{array}{c} \left \lfloor{\frac{s}{n}} \right \rfloor \\ 2 \\ \end{array} \right)\right]. \] \subsection{Arguments} \begin{description} \setlength{\itemsep}{1em} \item[\Rcode{x}] First numerical \Rclass{vector}, or single \Rclass{dataframe} or \Rclass{matrix}, containing relative abundances. If the latter, columns are features, rows are samples. Rows should therefore sum to a constant. \item[\Rcode{y}] If provided, second numerical \Rclass{vector} containing relative abundances. If given, \Rcode{x} must be a \Rclass{vector} as well. \item[\Rcode{bins}] Either the number of bins to use or a \Rclass{vector} specifying bin edges. If a single number is given, this is used as the number of bins with the \Rfunction{discretize} function of the package \CRANpkg{infotheo}. If a \Rclass{vector} is specified, the function \Rfunction{findInterval} is used to discretize the data. The default behavior is to use the defaults for the \Rfunction{discretize} function. \item[\Rcode{verbose}] Request verbose output. \item[\Rcode{min.abundance}] Minimum abundance threshold for quality control filtering. For a feature to be included, it must take a value of at least \Rcode{min.abundance} in at least \Rcode{min.samples} percent of samples. \item[\Rcode{min.samples}] Minimum sample threshold for quality control filtering. For a feature to be included, it must take a value of at least \Rcode{min.abundance} in at least \Rcode{min.samples} percent of samples. \end{description} \subsection{Output} \Rfunction{nc.score} returns either a single number (if called with two vectors) or a \Rclass{matrix} of all pairwise scores (if called with a \Rclass{matrix}) of normalized scores. \subsection{Usage} <>= nc.score( x = NA, y = NA, bins = NA, verbose = FALSE, min.abundance = 1e-04, min.samples = 0.1) @ \subsection{Example 1} An example of using \Rfunction{nc.score} to get a single similarity score or a matrix. <<>>= data <- matrix(rlnorm(40,meanlog=0,sdlog=1),nrow=10,ncol=4) data.rowsum <- apply(data,1,sum) data[,1] = 2*data[,2] + rnorm(10,0,1) data.norm <- data/data.rowsum apply(data.norm,1,sum) # The rows sum to 1, so the data are normalized test.input <- data.norm dimnames(test.input) <- list(paste("Sample",seq(1,10)),paste("Feature",seq(1,4))) test.output.matrix <- nc.score(x=test.input) test.output.num <- nc.score(x=test.input[,1],y=test.input[,2]) @ <>= par(mfrow=c(1, 2)) plot(data[,1],data[,2],xlab="Feature 1",ylab="Feature 2",main="Non-normalized") plot(data.norm[,1],data.norm[,2],xlab="Feature 1",ylab="Feature 2", main="Normalized") test.output.matrix test.output.num @ \subsection{Example 2} An example of using \Rfunction{nc.score} with an aribitrary bin number. <<>>= data <- matrix(rlnorm(40,meanlog=0,sdlog=1),nrow=10,ncol=4) data.rowsum <- apply(data,1,sum) data[,1] = 2*data[,2] + rnorm(10,0,1) data.norm <- data/data.rowsum apply(data.norm,1,sum) # The rows sum to 1, so the data are normalized test.input <- data.norm dimnames(test.input) <- list(paste("Sample",seq(1,10)),paste("Feature",seq(1,4))) test.output <- nc.score(x=test.input,bins=2) @ <>= par(mfrow=c(1, 2)) plot(data[,1],data[,2],xlab="Feature 1",ylab="Feature 2",main="Non-normalized") plot(data.norm[,1],data.norm[,2],xlab="Feature 1",ylab="Feature 2", main="Normalized") test.output @ \subsection{Example 3} An example of using \Rfunction{nc.score} with user-defined bin edges. <<>>= data <- matrix(rlnorm(40,meanlog=0,sdlog=1),nrow=10,ncol=4) data.rowsum <- apply(data,1,sum) data[,1] = 2*data[,2] + rnorm(10,0,1) data.norm <- data/data.rowsum apply(data.norm,1,sum) # The rows sum to 1, so the data are normalized test.input <- data.norm dimnames(test.input) <- list(paste("Sample",seq(1,10)),paste("Feature",seq(1,4))) test.output <- nc.score(x=test.input,bins = c(0.001,0.1,0.25,0.6)) @ <>= par(mfrow=c(1, 2)) plot(data[,1],data[,2],xlab="Feature 1",ylab="Feature 2",main="Non-normalized") plot(data.norm[,1],data.norm[,2],xlab="Feature 1",ylab="Feature 2", main="Normalized") test.output @ %--------------------------------------------------------- \section{References} %--------------------------------------------------------- \bibliographystyle{plainnat} \bibliography{ccrepe} \end{document}