%------------------------------------------------------------ \section{Processing of the main data set}\label{sec:main} \subsection{Preliminaries} %------------------------------------------------------------ Load the \Bioconductor{}-package \Biocexptpkg{HD2013SGI}. <>= library("HD2013SGI") @ Load data and screen annotation in workspace. <>= data("datamatrixfull",package="HD2013SGI") D = datamatrixfull$D data("TargetAnnotation",package="HD2013SGI") data("QueryAnnotation",package="HD2013SGI") @ Output directories are created where the data, figures and tables will be placed in. <>= dir.create(file.path("result","data"),recursive=TRUE,showWarnings=FALSE) dir.create(file.path("result","Figures"),recursive=TRUE,showWarnings=FALSE) dir.create(file.path("result","Tables"),recursive=TRUE,showWarnings=FALSE) @ Get indices for different sets of experiments (samples, controls, ...). <>= IndexSAMPLE = which(TargetAnnotation$group == "sample") IndexCTRL = which(TargetAnnotation$group == "SingleKDctrl") IndexNEG = which(TargetAnnotation$group == "negctrl") IndexSAMPLENEG = c(IndexSAMPLE, IndexNEG) @ Set the color value for controls. <>= colCTRL = rep("gray", nrow(TargetAnnotation)) colCTRL[TargetAnnotation$group == "SingleKDctrl"] = "royalblue" colCTRL[TargetAnnotation$group == "negctrl"] = "red" @ \subsection{Transform features and screen normalization} Previous genetic interaction screens that were based on a quantitative cell viability phenotype were analysed using a multiplicative interaction model \cite{mani2008defining,schuldiner2005exploration,costanzo2010genetic,% baryshnikova2010quantitative,bandyopadhyay2010rewiring}. For model fitting, it is then convenient to transform the data to a logarithmic scale. Motivated by \cite{horn2011mapping,axelsson2011extracting}, we adapted this approach to all features considered here. Since some features had a non-positive range of values, instead of the logarithm we applied a generalized logarithm transform~\cite{huber2002variance} \begin{equation}\label{eq:glog} f(x; c) = \log\left(\frac{x+\sqrt{x^2+c^2}}{2}\right). \end{equation} <>= glog <- function(x,c) { log2((x+sqrt(x^2+c^2))/2) } @ This family of functions has one parameter $c$. For $c=0$, the function is equivalent to an ordinary logarithm transformation. For $c>0$, the function is smooth for all values of $x$ (including 0 and negative values), avoiding the singularity of the ordinary logarithm at $x=0$, but still approximately equivalent to the ordinary logarithm for $x\gg c$ as shown in the following plot. <>= px = seq(-1.5, 9, length.out=200) trsf = list( log = function(x) log(ifelse(x>0, x, NA_real_)), glog = function(x, c=1) log( (x+sqrt(x^2+c^2))/2 )) colores = c("#202020", "RoyalBlue") pdf(file.path("result","Figures","supplementaryInformation-glog.pdf"),width=4.2,height=4.2) matplot(px, sapply(trsf, do.call, list(px)), type="l", lty=c(2,1), col=colores, lwd=2.5, ylab="f(x)", xlab="x") legend("bottomright", fill=colores, legend=names(trsf)) dev.off() @ \begin{center} \includegraphics[width=0.5\textwidth]{result/Figures/supplementaryInformation-glog.pdf} \end{center} For each feature, we chose $c$ to be the $3\%$-quantile of the feature's empirical distribution. <>= for (i in seq_len(dim(D)[5])) { m = quantile(D[,,,,i,],probs=0.03,na.rm=TRUE) D[,,,,i,] = glog(D[,,,,i,],m) } @ After transformation, to take account of the plate and batch effects and differences in efficiency of the siRNA transfection, an additive normalization is performed per plate. The plate median is set to be equal for all siRNA designs and replicates for each query gene and phenotypic feature. This compensates for global differences between replicates and siRNA designs. <>= M = apply(D,c(2:6),median,na.rm=TRUE) M2 = apply(M, c(2,4), mean) M2 = rep(M2, times=8) dim(M2) = dim(M)[c(2,4,1,3,5)] M2 = aperm(M2,c(3,1,4,2,5)) M = M - M2 M = rep(M[], each=dim(D)[1]) dim(M) = dim(D) D = D - M @ After transformation, to take account of the fact that the data range of the different features was different, data were centered and scaled separately for each feature. Center and scale were computed as the median and median absolute deviation, respectively. <>= for (i in seq_len(dim(D)[5])) { D[,,,,i,] = (D[,,,,i,] - median(D[,,,,i,],na.rm=TRUE)) / mad(D[,,,,i,],na.rm=TRUE) } @ In the following, we will refer to the array of transformed, centered and scaled values as $D_{ijklmr}$ with the indices $i$, $j$, $k$, $l$, $m$,$r$ counting over the 6 dimensions of the data cube with extensions\\ \begin{center} \begin{tabular}{lrrl} & \Sexpr{dim(D)[1]} & target genes ($i$)\\ $\times$ & \Sexpr{dim(D)[2]} & siRNA target design ($j$)\\ $\times$ & \Sexpr{dim(D)[3]} & query genes ($k$)\\ $\times$ & \Sexpr{dim(D)[4]} & siRNA query designs ($l$)\\ $\times$ & \Sexpr{dim(D)[5]} & phenotypic features ($m$)\\ $\times$ & \Sexpr{dim(D)[6]} & biological replicates ($r$) \end{tabular} \end{center} \subsection{Quality control of features}\label{QControlFeatures} To control for the quality of each feature, its reproducibility over replicate measurements was assessed. For each feature $f$, we computed the two vectors $\mathbf{v}_{f}^{1}$ and $\mathbf{v}_{f}^{2}$, \begin{equation} \mathbf{v}_{f}^{l} = D_{\cdot\,-\,\cdot\,-\,f\,l} \quad\quad\mbox{for }l=1,2. \end{equation} Here, the notation $-$ indicates averaging over an index, and the notation $\cdot$ indicates extraction of the whole subspace spanned by this index. Here, we used only the \Sexpr{length(IndexSAMPLE)} sample genes, since the set of target genes contained a number of negative and positive controls. Thus, $\mathbf{v}_{m}^{l}$ is a vector with \Sexpr{length(IndexSAMPLE)} $\times$ \Sexpr{dim(D)[3]} $=$ \Sexpr{length(IndexSAMPLE)*(dim(D)[3])} elements. We then computed the correlation coefficient $\rho_{f}$ between $\mathbf{v}_{f}^{1}$ and $\mathbf{v}_{f}^{2}$. <>= C = rep(NA_real_,dim(D)[5]) D2 = (D[,1,,1,,] + D[,2,,1,,] + D[,1,,2,,] + D[,2,,2,,]) / 4 for (i in seq_len(dim(D)[5])) { C[i] = cor(as.vector(D2[IndexSAMPLE,,i,1]),as.vector(D2[IndexSAMPLE,,i,2])) } @ % We plot the correlation coefficients for all \Sexpr{length(C)} features. <>= pdf(file=file.path("result","Figures","QCreproducibilityOfFeatures.pdf")) @ <>= plot(sort(C, decreasing=TRUE), pch=20, xlab="features", ylab="correlation", ylim=c(0,1), cex.lab=1.75, cex.axis=1.5) abline(h=0.6) @ <>= dev.off() @ \begin{center} \includegraphics[width=0.5\textwidth]{result/Figures/QCreproducibilityOfFeatures.pdf} \end{center} % As an example, the scatter plot of eccentricity measurements is plotted for two replicates. <>= pdf(file=file.path("result","Figures","QCscatterFeaturesBetweenReplicates.pdf")) @ <>= i = "cell.Bact.m.eccentricity" plot(as.vector(D2[IndexSAMPLE,,i,1]), as.vector(D2[IndexSAMPLE,,i,2]), pch=20, cex.lab=1.75, cex.axis=1.5, xlab="eccentricity replicate 1", ylab="eccentricity replicate 2") cc = cor(as.vector(D2[IndexSAMPLE,,i,1]), as.vector(D2[IndexSAMPLE,,i,2])) text(x=2, y=-4, sprintf("cor = %0.2f",cc), cex=1.75) @ <>= dev.off() @ \begin{center} \includegraphics[width=0.5\textwidth]{result/Figures/QCscatterFeaturesBetweenReplicates.pdf} \end{center} <>= I = order(-C) write.table(data.frame(name=dimnames(D)[[5]][I],cor=C[I]), file=file.path("result","Tables","QC_reproducibilityOfFeatures.txt"), sep="\t",quote=FALSE,row.names=FALSE) @ We selected features with a correlation of at least 0.6 for subsequent analysis. \Sexpr{sum(C >= 0.6, na.rm=TRUE)} out of \Sexpr{dim(D)[5]} features passed quality control. <>= I = which(C >= 0.6) D = D[,,,,I,,drop=FALSE] dim(D) @ % The dimension of the data cube after quality control of the phenotypic features is\\ \begin{center} \begin{tabular}{lrrl} & \Sexpr{dim(D)[1]} & target genes \\ $\times$ & \Sexpr{dim(D)[2]} & siRNA target designs \\ $\times$ & \Sexpr{dim(D)[3]} & query genes \\ $\times$ & \Sexpr{dim(D)[4]} & siRNA query designs \\ $\times$ & \Sexpr{dim(D)[5]} & phenotypic features\\ $\times$ & \Sexpr{dim(D)[6]} & biological replicates \end{tabular} \end{center} \subsection{Quality control of siRNA designs}\label{QControlGenes} To detect cases where the siRNA reagents for our target genes had off-target effects, we compared the phenotypic profiles of the two siRNA designs for each target gene. To this end, we computed the vectors \begin{equation} \mathbf{w}_{if}^{j} = D_{ij\,\cdot\, -\, f\, -} \end{equation} using only those features that passed the quality filter described in Section~\ref{QControlFeatures}. We then computed the correlation $\tilde{\rho'}_{if}$ between $\mathbf{w}_{if}^{1}$ and $\mathbf{w}_{if}^{2}$. The congruence score \begin{equation} \tilde{\rho}_{i}=\tilde{\rho'}_{i\,-} \end{equation} is the mean over the correlation coefficients of each feature. <>= D1 = (D[,,,1,,1] + D[,,,1,,2] + D[,,,2,,1] + D[,,,2,,2])/4 Cdesign1 = rep(NA_real_,dim(D)[1]) for (k in seq_len(dim(D)[5])) { for (i in seq_len(dim(D)[1])) { Cdesign1[i] = cor(as.vector(D1[i,1,,k]),as.vector(D1[i,2,,k])) } if ( k == 1) { Cdesign1all = Cdesign1 } else { Cdesign1all = Cdesign1all + Cdesign1 } } Cdesign1all = Cdesign1all / dim(D)[5] @ The congruence scores for all target genes are plotted in decreasing order. <>= pdf(file.path("result","Figures","QCreproducibilityOfTargetsiRNA.pdf")) @ <>= plot(sort(Cdesign1all[IndexSAMPLE], decreasing=TRUE), pch=20, cex.lab=1.75, cex.axis=1.5, ylim=c(0,1), xlab="target genes",ylab="correlation") abline(h=0.7) @ <>= dev.off() @ \begin{center} \includegraphics[width=0.5\textwidth]{result/Figures/QCreproducibilityOfTargetsiRNA.pdf} \end{center} CHAF1A is an example gene with a low congruence score. Exemplarily a barchart for the cell eccentricity for the double knock-downs of both siRNA designs of CHAF1A together with siRNA design 1 of DPF2 is plotted. <>= pdf(file.path("result","Figures","QCreproducibilityOfTargetsiRNAexample.pdf"),width=2,height=4) par(xpd=NA,mar=c(6, 4, 2, 2) + 0.1) @ <>= f = "cell.Bact.m.eccentricity" D1 = D[ TargetAnnotation$Symbol == "CHAF1A",,QueryAnnotation$Symbol == "DPF2",1,f,] bp = barplot(apply(D1,1,mean),ylab="Cell eccentricity", names.arg=c("CHAF1A #1\nDPF2 #1","CHAF1A #2\nDPF2 #1"),las=2) points(bp,D1[,1]) points(bp,D1[,2]) @ <>= dev.off() @ \begin{center} \includegraphics[width=0.25\textwidth]{result/Figures/QCreproducibilityOfTargetsiRNAexample.pdf} \end{center} Target gene $i$ was selected for subsequent analysis if $\tilde{\rho}_i\ge0.7$. The number of target genes passing quality control was \Sexpr{sum(Cdesign1all[IndexSAMPLE] >= 0.7,na.rm=TRUE)} out of \Sexpr{length(IndexSAMPLE)}. Additionally, \Sexpr{length(IndexNEG)} negative control target siRNAs were selected. <>= I = IndexSAMPLE[Cdesign1all[IndexSAMPLE] >= 0.7] I = c(I, IndexNEG) D = D[I,,,,,,drop=FALSE] TargetAnnotation = TargetAnnotation[I,] dim(D) @ % The dimension of the data cube \Robject{D} after quality control of the siRNA designs is \begin{center} \begin{tabular}{lrrl} & \Sexpr{dim(D)[1]} & target genes \\ $\times$ & \Sexpr{dim(D)[2]} & siRNA target designs \\ $\times$ & \Sexpr{dim(D)[3]} & query genes \\ $\times$ & \Sexpr{dim(D)[4]} & siRNA query designs \\ $\times$ & \Sexpr{dim(D)[5]} & phenotypic features\\ $\times$ & \Sexpr{dim(D)[6]} & biological replicates \end{tabular} \end{center} %------------------------------------------------------------ \subsection{Selection of non-redundant features}\label{featureSelection} %------------------------------------------------------------ The \Sexpr{dim(D)[5]} features are highly redundant. We selected a set of non-redundant features. To do so, in the following the phenotypic profiles are summarized over all siRNA designs, and control measurements are excluded. % <>= D1 = (D[,1,,1,,] + D[,1,,2,,] + D[,2,,1,,] + D[,2,,2,,]) / 4 D1 = D1[TargetAnnotation$group == "sample",,,] @ % 3000 perturbations are randomly selected to speed up the selection process. % <>= dim(D1) = c(prod(dim(D1)[1:2]),dim(D1)[3:4]) D1 = aperm(D1,c(1,3,2)) set.seed(5830458) Sample = sample(seq_len(dim(D1)[1]), 3000) subSampleForStabilitySelection = list(D = D1[Sample,,], Sample = Sample, phenotype = dimnames(D)[[5]]) @ % Next, the feature selection was performed. The feature \emph{number of cells} was manually preselected because of its biological interest. All other features were selected automatically in a sequential way. In each step, all remaining candidate features were evaluated separately. Let $F(i)$ be the features that were selected after step $i$. In step $i$, for all candidate features $f$ a linear regression was fit on the previously selected features. \begin{equation} D_{\cdot\,-\,\cdot\,-\,f\,\cdot} \sim D_{\cdot\,-\,\cdot\,-\,F(i-1)\,\cdot} \end{equation} Let $r_{\cdot\,\cdot\,f\,\cdot}$ be the residuals of this fit, i.\,e., the component of $D_{\cdot\,-\,\cdot\,-\,f\,\cdot}$ that is orthogonal to the linear subspace spanned by $D_{\cdot\,-\,\cdot\,-\,F(i-1)\,\cdot}$. The residuals are a composition of random noise and biological information that is non-redundant to the previously selected features. To estimate if there still exists unexplained biological signal, the Pearson correlation coefficient $\bar{\rho}_{f}$ of the residual vectors $r_{\cdot\,\cdot\,f\,1}$ and $r_{\cdot\,\cdot\,f\,2}$ between the two biological replicates was computed. It can be considered a proxy for the signal-to-noise ratio of these residuals. The feature with the largest correlation coefficient $\bar{\rho}$ was selected. % <>= stabilitySelection = HD2013SGIselectByStability(subSampleForStabilitySelection, preselect = c("count"), Rdim = 25, verbose = TRUE) @ The step-wise approach was stopped when the number of positive correlation coefficients was smaller than or equal to the number of negative correlation coefficients, \begin{equation} \sum_f\left(\Theta\left(\bar{\rho}_f\right)\right) \leq \sum_f\left(\Theta\left(-\bar{\rho}_f\right)\right)\,, \end{equation} where $\Theta\left(x\right)$ is the Heaviside function. This criterion is motivated by the observation that for random data, half of the correlation coefficients is expected to be positive and the other half is expected to be negative. <>= Sel = (stabilitySelection$ratioPositive >= 0.5) sum(Sel) @ % Thus, a set of \Sexpr{sum(Sel)} features was considered to contain non-redundant information. The barplot below shows the correlation coefficients of the residual features, which was considered the proxy for information content. % <>= col = brewer.pal(3,"Pastel1")[1:2] pdf(file.path("result","Figures","FeatureSelectionInformationgain.pdf")) par(mar=c(13,4,1,1)) @ <>= barplot(stabilitySelection$correlation, names.arg=HD2013SGI:::humanReadableNames[stabilitySelection$selected], col=ifelse(Sel, col[2], col[1]), ylim=c(0,1), las=2, ylab="information gain") @ <>= dev.off() @ \begin{center} \includegraphics[width=0.5\textwidth]{result/Figures/FeatureSelectionInformationgain.pdf} \end{center} % Now we plot the fraction of positive correlation coefficients, which served as a stop criterion. % <>= pdf(file.path("result","Figures","FeatureSelectionRatioPositive.pdf")) par(mar=c(13,4,1,1)) @ <>= barplot(stabilitySelection$ratioPositive-0.5, names.arg=HD2013SGI:::humanReadableNames[stabilitySelection$selected], offset=0.5,col=ifelse(Sel, col[2], col[1]),ylim=c(0,1), las=2,ylab="fraction positive correlated") @ <>= dev.off() @ \begin{center} \includegraphics[width=0.5\textwidth]{result/Figures/FeatureSelectionRatioPositive.pdf} \end{center} % The criteria for feature selection are saved and the features are selected. <>= save(stabilitySelection, file=file.path("result","data", "stabilitySelection.rda")) D = D[,,,,stabilitySelection$selected[Sel],,drop=FALSE] dimnames(D)[[1]] = TargetAnnotation$Symbol dimnames(D)[[3]] = QueryAnnotation$Symbol dim(D) @ % After selecting \Sexpr{sum(Sel)} non-redundant features, the dimension of the data cube is\\ \begin{center} \begin{tabular}{lrrl} & \Sexpr{dim(D)[1]} & target genes \\ $\times$ & \Sexpr{dim(D)[2]} & siRNA target designs \\ $\times$ & \Sexpr{dim(D)[3]} & query genes \\ $\times$ & \Sexpr{dim(D)[4]} & siRNA query designs \\ $\times$ & \Sexpr{dim(D)[5]} & phenotypic features\\ $\times$ & \Sexpr{dim(D)[6]} & biological replicates \end{tabular} \end{center} % The datamatrix is now completely pre-processed and can be saved. <>= datamatrix = list(D=D, Anno = list(target = TargetAnnotation, query = QueryAnnotation, phenotype=dimnames(D)[[5]])) save(datamatrix, file=file.path("result","data","datamatrix.rda")) @ %-------------------------------------------------- \subsection{Pairwise interaction scores} %-------------------------------------------------- Pairwise interaction scores ($\pi$-scores) were estimated using a robust linear fit. The query main effects are lifted such that they equal to the mean of the single knock down measurements of the query genes (target siRNA is a scrambled sequence serving as negative control). <>= D = datamatrix$D pimatrix = datamatrix pimatrix$D[] = NA_real_ mainEffects = list(target = D[,,1,,,], query = D[1,,,,,], overall = D[1,,1,,,], Anno = datamatrix$Anno) for (i in seq_len(dim(D)[2])) { for (j in seq_len(dim(D)[4])) { for (k in seq_len(dim(D)[5])) { for (l in seq_len(dim(D)[6])) { MP = HD2013SGImaineffects(D[,i,,j,k,l], TargetNeg=which(TargetAnnotation$group == "negctrl")) pimatrix$D[,i,,j,k,l] = MP$pi mainEffects$target[,i,j,k,l] = MP$targetMainEffect mainEffects$query[i,,j,k,l] = MP$queryMainEffect mainEffects$overall[i,j,k,l] = MP$neg } } } } save(mainEffects, file=file.path("result","data","mainEffects.rda")) @ %-------------------------------------------------- \subsection{Statistical testing of interaction terms} %-------------------------------------------------- For statistical testing, the interaction terms that differed by $\ge4$ times the median standard deviation of the interaction scores were flagged as outliers. Afterwards the interaction terms were tested for significance by a moderated $t$-test implemented in the R package \Rpackage{limma}. % <>= D = pimatrix$D PADJ = D[pimatrix$Anno$target$group == "sample",,,,,1] s = rep(NA_real_, dim(D)[5]) for (i in seq_len(dim(D)[5])) { Data = D[,,,,i,] Data = Data[pimatrix$Anno$target$group == "sample",,,,] d = dim(Data) dim(Data) = c(prod(d[1:4]),prod(d[5])) Data[abs(Data[,1]-Data[,2]) > 4*mad(Data[,1]-Data[,2],center=0.0),]=NA_real_ s[i] = median(apply(Data,1,sd), na.rm=TRUE) padj = rep(NA_real_, nrow(Data)) K = which(apply(!is.na(Data),1,all)) fit = eBayes(lmFit(Data[K,])) padj[K] = p.adjust(fit$p.value, method="BH") PADJ[,,,,i] = padj cat(sprintf("i=%2d",i), " nr int (1%) = ", sum(padj <= 0.01, na.rm=TRUE)/nrow(Data), " nr int (3%) = ", sum(padj <= 0.03, na.rm=TRUE)/nrow(Data), "\n") } @ % $\pi$-scores for each siRNA pair were summarized over both replicates. Furthermore, interaction scores were divided by their median deviation over all siRNA pairs, to scale them to comparable dynamic range. % <>= PI = pimatrix$D PI = PI[pimatrix$Anno$target$group == "sample",,,,,] Interactions = list(newpiscore = PI, scale = s, padj = PADJ, Anno = pimatrix$Anno) Interactions$Anno$target = Interactions$Anno$target[ pimatrix$Anno$target$group == "sample",] save(Interactions, file=file.path("result","data","Interactions.rda")) @