% modification on git from copied files %\VignetteIndexEntry{Advanced baySeq analyses} %\VignettePackage{baySeq} %\VignetteKeywords{baySeq, generic, advanced} \documentclass[a4paper]{article} \title{Advanced analysis using baySeq; generic distribution definitions} \author{Thomas J. Hardcastle} <<>= BiocStyle::latex() @ \begin{document} \maketitle \section{Generic Prior Distributions} \verb'baySeq' now offers complete user-specification of underlying distributions. This vignette describes using \verb'baySeq' under this protocol. Familiarity with standard (negative-binomial) baySeq is assumed; please consult the other vignettes for a description of this approach. Analysis is carried out through specification of a \verb'densityFunction' class. The primary value in a \verb'densityFunction' object is the \verb'@density' slot, a user-defined function that should take variables \verb'x', \verb'observables' and \verb'parameters'. \verb'x' corresponds to a row of data in a \verb'countData' object. \verb'observables' is a list object of observed values that may influence the model. By default, the \verb'@libsizes' and \verb'@seglens' values of the \verb'countData' object will be internally appended to this list, unless objects with these names are otherwise specified by the user. \verb'parameters' is a list object of parameters to be empirically estimated from the data with the \verb'getPriors' function and used to estimate likelihoods with the \verb'getLikelihoods' function. The \verb'@dist' function should return a vector of log-likelihood values (or NA for invalid parameter choices) of the same length as the input variable \verb'x'. Other required slots of the \verb'densityFunction' object are \verb'initiatingValues', a vector of initiating values to be used in optimisation of the parameters to be used in the \verb'@dist' slot (and thus defining the length of the parameter object) and \verb'equalOverReplicates', a specification of which parameters are fixed for every replicate group and which may vary for different experimental conditions. If only one parameter is variable over experimental conditions, the Nelder-Mead optimisation used may be unstable, and one-dimensional optimisation with user defined functionally specified lower and upper bounds may (optionally) be provided; otherwise, Nelder-Mead will be attempted. Optionally a function may be provided in \verb'@stratifyFunction' to stratify the data and improve prior estimation in the tails where the \verb'samplesize' argument in the getPriors function is less than the row dimension of the \verb'countData' object. If this is provided, the \verb'@stratifyBreaks' slot should give the number of strata to be used. Below a model is constructed based on the normal distribution. The standard deviation is assumed to be constant for a given row of data across all experimental conditions, while the means (normalised by library scaling factor) are allowed to vary across experimental conditions. <>= set.seed(102) options(width = 90) @ If parallelisation is available, it is useful to use it. <<>>= if(require("parallel")) cl <- makeCluster(8) else cl <- NULL @ <<>>= library(baySeq) normDensityFunction <- function(x, observables, parameters) { if(any(sapply(parameters, function(x) any(x < 0)))) return(rep(NA, length(x))) dnorm(x, mean = parameters[[2]] * observables$libsizes, sd = parameters[[1]], log = TRUE) } normDensity <- new("densityFunction", density = normDensityFunction, initiatingValues = c(0.1, 1), equalOverReplicates = c(TRUE, FALSE), lower = function(x) 0, upper = function(x) 1 + max(x) * 2, stratifyFunction = rowMeans, stratifyBreaks = 10) @ We construct the \verb'countData' object as before. <<>>= data(simData) CD <- new("countData", data = simData, replicates = c("simA", "simA", "simA", "simA", "simA", "simB", "simB", "simB", "simB", "simB"), groups = list(NDE = c(1,1,1,1,1,1,1,1,1,1), DE = c(1,1,1,1,1,2,2,2,2,2)) ) libsizes(CD) <- getLibsizes(CD) densityFunction(CD) <- normDensity @ We can then fit priors and calculate posterior likelihoods based on our specified distributional model. The distributional model is specified in the 'getPriors' function and will be automatically used in the `getLikelihoods' function <<>>= normCD <- getPriors(CD, cl = cl) normCD <- getLikelihoods(normCD, cl = cl) @ Similarly, we can construct a generic version of the negative-binomial model. <<>>= nbinomDensityFunction <- function(x, observables, parameters) { if(any(sapply(parameters, function(x) any(x < 0)))) return(NA) dnbinom(x, mu = parameters[[1]] * observables$libsizes * observables$seglens, size = 1 / parameters[[2]], log = TRUE) } densityFunction(CD) <- new("densityFunction", density = nbinomDensityFunction, initiatingValues = c(0.1, 1), equalOverReplicates = c(FALSE, TRUE), lower = function(x) 0, upper = function(x) 1 + max(x) * 2, stratifyFunction = rowMeans, stratifyBreaks = 10) nbCD <- getPriors(CD, cl = cl) nbCD <- getLikelihoods(nbCD, cl = cl) @ We can compare this to the standard analysis of these data. <<>>= CD <- getPriors.NB(CD, cl = cl) CD <- getLikelihoods(CD, cl = cl) @ <>= plot(exp(CD@posteriors[,2]), exp(nbCD@posteriors[,2]), ylab = "Standard baySeq", xlab = "Generic (NB-distribution) baySeq") @ <>= TPs <- cumsum(order(CD@posteriors[,2], decreasing = TRUE) %in% 1:100); FPs <- 1:1000 - TPs nbTPs <- cumsum(order(nbCD@posteriors[,2], decreasing = TRUE) %in% 1:100); nbFPs <- 1:1000 - nbTPs plot(x = FPs, y = TPs, type = "l") lines(x = nbFPs, y = nbTPs, col = "red") legend(x = "bottomright", legend = c("standard baySeq", "Generic (NB-distribution) baySeq"), lty = 1, col = c("black", "red")) @ \begin{figure}[!ht] \begin{center} <>= <> @ \caption{Likelihoods of DE estimated by standard/generic baySeq"} \label{figCompPostLikes} \end{center} \end{figure} \begin{figure}[!ht] \begin{center} <>= <> @ \caption{ROC curves estimated by standard/generic baySeq"} \label{figCompROC} \end{center} \end{figure} The generic negative-binomial data performs almost identically to standard baySeq. The methods differ in that the standard baySeq uses quasi-maximum-likelihood to estimate the priors, while generic baySeq uses maximum-likelihood (since no generic method exists for quasi-maximum-likelihood on arbitrary distributions). \section{Paired Data Analysis} We illustrate the possibilities of `null' data, in which two separate models are applied to data equivalently expressed across all samples. The process for analysing paired data follows approximately the same steps as for analysing unpaired data, however, two different types of differential expression can exist within paired data. Firstly, we can find differential expression between replicate groups, as before. However, we can also find (consistent) differential expression between pairs; this would occur when for a single row of data, the first member of each pair differs from the second member of each pair. \verb'baySeq' can identify both these types of differential expression simultaneously, and we implement this proceedure below. We begin by loading a simulated dataset containing counts for four paired datasets. <<>>= data(pairData) @ The first four columns in these data are paired with the second four columns. We construct a count data containing paired data in a similar fashion to the countData object. Note that the data are now three dimensional; for each row and each sample there are two observations. <<>>= pairCD <- new("countData", data = array(c(pairData[,1:4], pairData[,5:8]), dim = c(nrow(pairData), 4, 2)), replicates = c(1,1,2,2), groups = list(NDE = c(1,1,1,1), DE = c(1,1,2,2)), densityFunction = bbDensity) @ We can find the library sizes for the data with the \verb'getLibsizes' function. <<>>= libsizes(pairCD) <- getLibsizes(pairCD) @ We estimate an empirical distribution on the parameters of a beta-binomial distribution by bootstrapping from the data, taking individual counts and finding the maximum likelihood parameters for a beta-binomial distribution. By taking a sufficiently large sample, an empirical distribution on the parameters is estimated. A sample size of around 10000 iterations is suggested, depending on the data being used), but 1000 is used here to rapidly generate the plots and tables. <<>>= pairCD <- getPriors(pairCD, samplesize = 1000, cl = cl) @ We then acquire posterior likelihoods as before. The use of 'nullData = TRUE' in this context allows us to identify pairs which show no differential expression between replicate groups, but does show deviation from a one-to-one ratio of data between pairs. <<>>= pairCD <- getLikelihoods(pairCD, pET = 'BIC', nullData = TRUE, cl = cl) @ We can ask for the top candidates for differential expression between replicate groups using the \verb'topCounts' function as before. <<>>= topCounts(pairCD, group = 2) @ However, we can also look for consistent differential expression between the pairs. <<>>= topCounts(pairCD, group = 1) @ \section{Different Model Priors} It is now possible to use different model priors for different subsets of the countData object. If we expect a certain class of genes (for example) to have a different prior likelihood towards differential expression than another such class, we can separate the two sets and estimate (or set) the model priors independently. Let us suppose that we have reason to believe that the first hundred genes in the `CD' object are likely to behave differently to the remaining genes. Then <<>>= CDv <- getLikelihoods(nbCD, modelPriorSets = list(A = 1:100, B = 101:1000), cl = cl) @ The model priors used are recorded in the @priorModels slot. <<>>= CDv@priorModels @ We can see the difference in performance by computing the ROC curves as before. Using different model priors can substantially improve performance, although obviously we have cheated here by splitting exactly those data simulated as DE and those as none-DE. It should also be recognised that this approach may bias downstream analyses; e.g. GO enrichment analysis. <>= TPs <- cumsum(order(CD@posteriors[,2], decreasing = TRUE) %in% 1:100); FPs <- 1:1000 - TPs nbTPs <- cumsum(order(nbCD@posteriors[,2], decreasing = TRUE) %in% 1:100); nbFPs <- 1:1000 - nbTPs vTPs <- cumsum(order(CDv@posteriors[,2], decreasing = TRUE) %in% 1:100); vFPs <- 1:1000 - vTPs plot(x = FPs, y = TPs, type = "l") lines(x = nbFPs, y = nbTPs, col = "red") lines(x = vFPs, y = vTPs, col = "blue") legend(x = "bottomright", legend = c("standard", "Generic (NB-distribution)", "Variable model priors"), lty = 1, col = c("black", "red", "blue")) @ \begin{figure}[!ht] \begin{center} <>= <> @ \caption{ROC curves estimated by standard/generic/variable model priors baySeq"} \label{figCompROC} \end{center} \end{figure} Several pre-existing distributions are built into baySeq. Here we use a pre-developed zero-inflated negative binomial distribution to analyse zero-inflated data. <<>>= data(zimData) CD <- new("countData", data = zimData, replicates = c("simA", "simA", "simA", "simA", "simA", "simB", "simB", "simB", "simB", "simB"), groups = list(NDE = c(1,1,1,1,1,1,1,1,1,1), DE = c(1,1,1,1,1,2,2,2,2,2)) ) libsizes(CD) <- getLibsizes(CD) densityFunction(CD) <- nbinomDensity CD <- getPriors(CD, cl = cl) CD <- getLikelihoods(CD, cl = cl) CDz <- CD densityFunction(CDz) <- ZINBDensity CDz <- getPriors(CDz, cl = cl) CDz <- getLikelihoods(CDz, cl = cl) @ Finally, we shut down the cluster (assuming it was started to begin with). <<>>= if(!is.null(cl)) stopCluster(cl) @ \section*{Session Info} <<>>= sessionInfo() @ \end{document} # multinomial <>= data(pairData) multCD <- new("countData", data = list(pairData[,1:4], pairData[,5:8], matrix(round(abs(rnorm(n = prod(dim(pairData[,5:8])), mean = pairData[,5:8] * 4, sd = 3))), ncol = 4), matrix(round(abs(rnorm(n = prod(dim(pairData[,5:8])), mean = pairData[,5:8] * 20, sd = 3))), ncol = 4), matrix(round(abs(rnorm(n = prod(dim(pairData[,5:8])), mean = pairData[,5:8] * 10, sd = 3))), ncol = 4)), replicates = c(1,1,2,2), groups = list(NDE = c(1,1,1,1), DE = c(1,1,2,2))) libsizes(multCD) <- matrix(round(runif(4*5, 30000, 90000)), nrow = 4) mdDensity@initiatingValues <- c(0.01, rep(1/dim(multCD@data)[3], dim(multCD@data)[3] - 1)) mdDensity@equalOverReplicates <- c(TRUE, rep(FALSE, dim(multCD@data)[3] - 1)) densityFunction(multCD) = mdDensity multCD <- getPriors(multCD, samplesize = 1000, cl = cl) multCD <- getLikelihoods(multCD, subset = 1:1000, cl = cl) TPs <- cumsum(order(multCD@posteriors[,2], decreasing = TRUE) %in% 1:100) FPs <- 1:nrow(multCD) - TPs plot(FPs / max(FPs), TPs / max(TPs)) @