%\VignetteIndexEntry{ArrayTV Vignette} %\VignetteKeywords{copy number} %\VignettePackage{ArrayTV} \documentclass{article} \usepackage{amsmath} \usepackage{graphicx} \usepackage[numbers]{natbib} \usepackage{color} \usepackage[margin=1in]{geometry} \usepackage{setspace} \bibliographystyle{plain} \newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle} \newcommand{\Rpackage}[1]{\texttt{#1}} \newcommand{\Rfunction}[1]{\texttt{#1}} \newcommand{\Robject}[1]{\texttt{#1}} \newcommand{\Rclass}[1]{\textit{#1}} \newcommand{\R}{\textsf{R}} \newcommand{\ArrayTV}{\Rpackage{ArrayTV}} \newcommand{\comm}[1]{{\color{blue}{\small #1}}} \title{Wave correction for arrays} \author{Eitan Halper-Stromberg and Robert B. Scharpf} \begin{document} \maketitle <>= library(ArrayTV) options(width=70) @ \begin{abstract} \ArrayTV{} is a GC correction algorithm for microarray data designed to mitigate the appearance of waves. This work is inspired by a method for mitigating waves in sequencing data \cite{Benjamini2012}. We find the genomic window for computing GC that best captures the dependence of signal intensity on GC content for each array using a score similar to the total variance (TV) statistic defined in \cite{Benjamini2012}. The correction each probe receives is the mean signal intensity of all probes sharing the same local GC. We center windows at the starting position of each probe. \end{abstract} \section{Adjusting copy number estimates for GC waves} \paragraph{Importing data.} The wave correction methods in \ArrayTV{} handle simple matrices with rows indexing markers and columns indexing samples, as well as more complex data structures commonly used to process Affymetrix and Illumina platforms made available in the \Rpackage{oligoClasses} package. In this vignette, we illustrate our approach with a single female lymphoblastoid cell line sample assayed on the Agilent 1M , Affymetrix 2.7M, and NimbleGen 2.1M platforms (available in full from http://rafalab.jhsph.edu/cnvcomp, spike-in tube 2)\cite{Halper-Stromberg2011}. For each platform, we provide a matrix. The rows of the matrix correspond to markers on chromosome 15 and the columns correspond to the physical position of the first nucleotide of the marker using UCSC build hg18 (note: for genotyping arrays, the position should be adjusted to the start of the marker and not the location of the SNP) and preprocessed signal intensities that we expect to be proportional to the copy number as well as sequence artifacts such as GC content. Agilent and NimbleGen signal intensities were preprocessed by taking the log ratio of loess normalized raw signal. Affymetrix performed their own preprocessing to obtain log R ratios. Hereafter, we refer to the preprocessed signal as \textit{M} values. To keep the size of this package small, we provide the $M$ values as integers ($M \times 1000$). In the following code, we load the data for each platform, inspect the first few rows, and transform the $M$ values for each platform to the original scale. <>= library(ArrayTV) ### parallelization will be discussed later but for now we provide the following ### command to avoid warnings if(require(doParallel)){ cl <- makeCluster(1) registerDoParallel(cl) } path <- system.file("extdata", package="ArrayTV") load(file.path(path, "array_logratios.rda")) head(agilent) head(affymetrix) head(nimblegen) agilent[, "M"] <- agilent[, "M"]/1000 affymetrix[, "M"] <- affymetrix[, "M"]/1000 nimblegen[, "M"] <- nimblegen[, "M"]/1000 @ To determine the optimal window(s) for GC correction, we generate TV scores for several window sizes. Again, we assume that the positions denote the start location of each probe. While our analysis in this vignette is limited to chromosome 15 for computational reasons, in practice we recommend implementing the wave correction on all autosomes simultaneously. We specify the windows to search by providing two vectors that indicate the maximum window size and the increment: <>= max.window <- c(100,10e3,1e6) increms <- c(20, 2000, 200e3) @ \noindent The above specification indicates that we will compute TV scores for three sequential window sizes as follows. The first series of windows has a maximum window extension of 100 from center, (\texttt{max.window[1]}) (effective maximum window size is therefore 200 since we extend in both directions) and increment extension value 20 (\texttt{increms[1]}) again, extending from the center of the window, indicating that windows extending 20, 40, \ldots, and 100 from the probe starts will be evaluated. For the second series, the maximum window extension is 10e3 (\texttt{max.window[2]}) and window extensions 2000, 4000, \ldots, 10e3 will be evaluated since the increment value is 2000 (\texttt{increms[2]}). For each of the three series, the number of windows to be evaluated is 5. The sequence for each series must have the same length. \paragraph{Evaluating TV scores as a function of window size for computing GC.} The class of the first argument to the \Rfunction{gcCorrect} method determines the dispatch to the low-level function \Rfunction{gcCorrectMain}. Any of the arguments to \Rfunction{gcCorrectMain} can be passed through the \texttt{...} operator and we refer the user to the documentation for \Rfunction{gcCorrectMain} for additional arguments and \Rfunction{gcCorrect} for the classes of objects handled by the \Rfunction{gcCorrect} method. Computing the GC content of the sequence for each marker requires that the appropriate BSgenome package is loaded. For example, here the package \Rpackage{BSgenome.Hsapiens.UCSC.hg18} is loaded internally. The following codechunk performs GC correction on the $M$ values for the NimbleGen, Agilent, and Affymetrix platforms, respectively, for the windows specified previously. <>= nimcM1List <- gcCorrect(object=nimblegen[, "M", drop=FALSE], chr=rep("chr15", nrow(nimblegen)), starts=nimblegen[, "position"], increms=increms, maxwins=max.window, build='hg18') afcM1List <- gcCorrect(affymetrix[, "M", drop=FALSE], chr=rep("chr15", nrow(affymetrix)), starts=affymetrix[, "position"], increms=increms, maxwins=max.window, build="hg18") @ <>= ## We load the corrected NimbleGen values computed during unit-testing and Affy ## values, pre-computed, to save time load(file=file.path(path,"nimcM1List.rda")) load(file=file.path(path,"afcM1List.rda")) agcM1List <- gcCorrect(agilent[, "M", drop=FALSE], chr=rep("chr15", nrow(agilent)), starts=agilent[, "position"], increms=increms, maxwins=max.window, build="hg18") if(require(doParallel)) stopCluster(cl) @ To plot the TV score for each of the windows evaluated, we concatenate the results from each platform into a list and then convert the elements representing the TVscores into a data.frame. Plotting is based on the size of the genomic windows used to compute GC (small, medium, and large). <>= results <- list(nimcM1List,agcM1List,afcM1List) tvscores <- do.call("rbind", lapply(results, "[[", "tvScore"))[, 1] tv.df <- data.frame(tv=tvscores, platform=rep(c("NimbleGen", "Agilent", "Affy 2.7M"), each=15), window=c(seq(20, 100, 20), seq(2000, 10e3, 2e3), seq(2e5, 1e6, 2e5))*2) op <- par(no.readonly = TRUE) par(las=1, mar=c(6, 4, 3, 2)+0.1) with(tv.df, plot(log10(window), tv, pch=20, col=tv.df$platform, xaxt="n", xlab="window (bp)", ylab="TV score", cex=1.2)) invisible(lapply(split(tv.df, tv.df$platform), function(x) points(log10(x$window),x$tv, type="l",col=x$platform, lwd=1.5))) axis(1, at=log10(tv.df$window), labels=FALSE) #labels=FALSE, tick=TRUE)a text(x=log10(tv.df$window), par("usr")[3]-0.005, labels=tv.df$window, srt=45, xpd=TRUE, cex=0.7, adj=1) legend("bottomright", legend=unique(tv.df$platform), pch=20, col=unique(tv.df$platform), lwd=1.2) par(op) @ \begin{figure}[t] \begin{center} \includegraphics[width=0.8\textwidth]{ArrayTV-tvscorefig} \vspace{-1em} \caption{TV scores for three array platforms indicated by color. For the Agilent and NimbleGen platforms, the window selected for wave correction is relatively large ($\approx$ 1.2Mb). For the Affymetrix platform, the optimal window is $\approx$ 8kb. Often, the optimal window is closer to the size of the probe or the PCR fragment. A correction for both large and small windows could be acheived by a second iteration of the \Rfunction{gcCorrect} function.} \label{fig:tvscores} \end{center} \end{figure} \subsection*{View corrected and uncorrected $M$ values} Our GC-adjusted signal was computed when we called \Rfunction{gcCorrect} and is now stored in objects \Robject{nimcM1List}, \Robject{agcM1List}, and \Robject{afcM1List}. Here we create a \Rclass{data.frame} to store corrected and uncorrected signal, ahead of plotting. <>= wave.df <- data.frame(position=c(rep(nimblegen[, "position"]/1e6, 2), rep(agilent[, "position"]/1e6, 2), rep(affymetrix[, "position"]/1e6,2)), r=c(nimblegen[, "M"], nimcM1List[['correctedVals']], agilent[,"M"], agcM1List[['correctedVals']], affymetrix[,"M"], afcM1List[['correctedVals']]), platform=rep(c(rep("Nimblegen", 2), rep("Agilent", 2), rep("Affymetrix", 2)), c(rep(length(nimblegen[, "M"]),2), rep(length(agilent[,"M"]),2), rep(length(affymetrix[,"M"]),2))), wave.corr=c(rep("raw", length(nimblegen[, "M"])), rep("ArrayTV", length(nimblegen[, "M"])), rep("raw", length(agilent[,"M"])), rep("ArrayTV", length(agilent[,"M"])), rep("raw", length(affymetrix[,"M"])), rep("ArrayTV",length(affymetrix[,"M"])))) wave.df$platform <- factor(wave.df$platform, levels=c("Nimblegen", "Agilent", "Affymetrix")) wave.df$wave.corr <- factor(wave.df$wave.corr, levels=c("raw", "ArrayTV")) ## remove some points to make subsequent figure smaller wave.df <- wave.df[seq(1,nrow(wave.df),4),] @ We use the \R{} package \Rpackage{lattice} to visualize the $M$ values before and after correction for the three platforms (Figure \ref{fig:crossplatform}) <>= ## For each of the 3 example platforms, the uncorrected signal appears immediately above the ## corrected signal, across chromosome 15. A smoothed loess line may be drawn through each ## by uncommenting the commented line in the xyplot function require("lattice") && require("latticeExtra") p <- xyplot(r~position|platform+wave.corr, wave.df, pch=".", col="gray", ylim=c(-2,1.5), xlim=range(nimblegen[, "position"]/1e6), ylab="M", xlab="position (Mb) on chr15", scales=list(y=list(tick.number=8)), panel=function(x,y,subscripts,...){ panel.grid(lty=2) panel.xyplot(x, y, ...) panel.abline(h=0) ## panel.loess(x, y, span=1/20, lwd=2, col="black") }, par.strip.text=list(cex=0.9), layout=c(2,3), as.table=TRUE) @ <>= p2 <- useOuterStrips(p) print(p2) @ \begin{figure}[t] \begin{center} \includegraphics[width=\textwidth]{ArrayTV-crossplatform} \caption{\label{fig:crossplatform} Comparison of $M$ values before and after \ArrayTV{} wave correction (rows) for three array platforms (columns).} \label{fig:wavecorrect} \end{center} \end{figure} \section{Spike-In Example} To demonstrate that the GC-correction does not remove true CNV signals, we illustrate our approach on a spike-in datasets containing a small amplification. While the signal for the spike-in is subtle, the main point here is that the small inflection is not effected by the GC correction. The amplification was spiked in to each of the three array platforms as described in Halper-Stromberg {\it et al.} \cite{Halper-Stromberg2011}. As the segmentation of the data using circular binary segmentation (CBS; \cite{Olshen2004}) is only for visualization and is not critical to the functionality of this package, the following code chunk is static (not evaluated) to reduce computation: <>= if(require("DNAcopy")){ cbs.segs <- list() platforms <- c("Nimblegen", "Agilent", "Affymetrix") correction <- c("raw", "ArrayTV") Ms <- list(nimblegen[, "M"], nimcM1List[['correctedVals']], agilent[, "M"], agcM1List[['correctedVals']], affymetrix[, "M"], afcM1List[['correctedVals']]) starts <- list(nimblegen[, "position"], agilent[, "position"], affymetrix[, "position"]) k <- 0 for(i in 1:3){ for(j in 1:2){ k <- k+1 MsUse <- Ms[[k]] chrUse <- rep("chr15", length(MsUse)) startsUse <- starts[[i]] toUse <- !(duplicated(startsUse)) cna1 <- CNA(as.matrix(MsUse[toUse]), chrUse[toUse], startsUse[toUse], data.type="logratio", paste(platforms[i], correction[j], sep="_")) smooth.cna1 <- smooth.CNA(cna1) cbs.segs[[k]] <- segment(smooth.cna1, verbose=1,min.width = 5) } } cbs.out <- do.call("rbind", lapply(cbs.segs, "[[", "output")) } @ <>= save(cbs.out, file=file.path(path,"cbs_out.rda")) @ %As CBS segmentation of these genomes is time-consuming, We load the results from the above computation into our workspace as follows: <>= load(file.path(path, "cbs_out.rda")) @ %\subsection*{Parse the CBS output in preparation for plotting} To facilitate downstream analyses such as visualization of the results, we coerce the results from the CBS algorithm to an object of class \Robject{GRanges}. <>= ids <- cbs.out$ID ## create vectors indicating corrected/uncorrected and platform for each cbs segment platforms <- gsub("_raw", "", ids) platforms <- gsub("_ArrayTV", "", platforms) correction <- gsub("Nimblegen_", "", ids) correction <- gsub("Affymetrix_", "", correction) correction <- gsub("Agilent_", "", correction) ## store cbs segments in a GRanges object with designations for corrected/uncorrected ## and platform cbs.segs <- GRanges(rep("chr14", nrow(cbs.out)), IRanges(cbs.out$loc.start, cbs.out$loc.end), numMarkers=cbs.out$num.mark, seg.mean = cbs.out$seg.mean, platform = platforms, correction=correction) ## separate corrected and uncorrected cbs segments int separate objects cbs.segs.raw <- cbs.segs[cbs.segs$correction=="raw", ] cbs.segs.arrayTV <- cbs.segs[cbs.segs$correction=="ArrayTV", ] ## create GRanges objects with M values, platform, and ## corrected/uncorrected designation marker.data <- GRanges(rep("chr14", nrow(wave.df)), IRanges(wave.df$position*1e6, width=1), numMarkers=1, seg.mean = wave.df$r, platform=wave.df$platform, correction=wave.df$wave.corr) marker.raw <- marker.data[marker.data$correction=="raw",] marker.arrayTV <- marker.data[marker.data$correction=="ArrayTV",] ## populate cbs.seg metadata column by joining M values with CBS segments by location olaps1 <- findOverlaps(marker.raw, cbs.segs.raw, select="first") marker.raw$cbs.seg <- cbs.segs.raw$seg.mean[olaps1] olaps2 <- findOverlaps(marker.arrayTV, cbs.segs.arrayTV, select="first") marker.arrayTV$cbs.seg <- cbs.segs.arrayTV$seg.mean[olaps2] ## populate SpikeIn metadata column if a marker falls within the Spiked-in region spikeIn <- IRanges(start=45396506,end=45537976) spikeinStart <- 45396506; spikeinEnd <- 45537976 marker.raw$spikeIn <- marker.raw$cbs.seg marker.arrayTV$spikeIn <- marker.arrayTV$cbs.seg index1 <- which(is.na(findOverlaps(unlist(ranges(marker.raw)), spikeIn,select="first"))) index2 <- which(is.na(findOverlaps(unlist(ranges(marker.arrayTV)), spikeIn,select="first"))) marker.raw$spikeIn[index1] <- NA marker.arrayTV$spikeIn[index2] <- NA ## put GRanges objects into a dataframe ahead of plotting rawd <- as.data.frame(marker.raw) atvd <- as.data.frame(marker.arrayTV) rawd <- rbind(rawd, atvd) rawd$platform <- factor(rawd$platform, levels=c("Nimblegen", "Agilent", "Affymetrix")) rawd$correction <- factor(rawd$correction, levels=c("raw", "ArrayTV")) ## remove some points to make subsequent figure smaller rawd <- rawd[seq(1,nrow(rawd),2),] @ %View CBS lines through the spike in (expected copy number 2.5). Black points in Figure \ref{fig:spikein} represent signal intensities for each probe, and red line segments correspond to the segmentation of the intensities from CBS. The red line segments demarcate the spike-in. As in the previous figure, raw signal intensities appear immediately above the corrected intensities for each platform <>= p2 <- xyplot(seg.mean+cbs.seg+spikeIn~start/1e6|platform+correction, rawd, ylim=c(-1.5, 1.5), pch=".", xlim=c(spikeinStart/1e6-1.5,spikeinEnd/1e6+1.5), ylab="M", xlab="position (Mb)",##pch=c(".", NA,NA), col=c('gray','black','red'), distribute.type=TRUE, type=c('p','l','l'),lwd=c(NA,2,3), scales=list(y=list(tick.number=8)), par.strip.text=list(cex=0.9), layout=c(2,3), as.table=TRUE) p3 <- useOuterStrips(p2) @ <>= print(p3) @ \begin{figure}[h] \begin{center} \includegraphics[width=0.8\textwidth]{ArrayTV-spikein} \caption{\label{fig:spikein}Comparison of CBS segments (lines) and $M$ values (points) before and after \ArrayTV{} wave correction (rows) for three array platforms (columns) in the vicinity of a known amplification of copy number 2.5 (red).} \end{center} \end{figure} \section{Parallelization} When multiple cores are available, computation can proceed in parallel for computing the GC content for the various window sizes specified. To enable parallelization across 8 cores using the \Rpackage{doParallel} package, one could execute the following (unevaluated) code chunk prior to calling the \Rfunction{gcCorrect} function: <>= if(require(doParallel)){ cl <- makeCluster(8) registerDoParallel(cl) } ## ... execute gcCorrect if(require(doParallel)) stopCluster(cl) @ \section{Session Information} <>= toLatex(sessionInfo()) @ %%\section*{References} \bibliography{refs} \end{document}