%\VignetteIndexEntry{ENmix User's Guide} %\VignetteKeywords{DNA methylation, background correction, preprocessing} %\VignettePackage{ENmix} \documentclass[12pt]{article} <>= options(width=70) @ \usepackage{Sweave} \usepackage[utf8]{inputenc} \usepackage{times} \usepackage{color, hyperref} \usepackage{fullpage} \usepackage{parskip} \usepackage{multirow} \usepackage{booktabs} \newcommand{\Rpackage}[1]{\textsf{#1}} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \title{The \Rpackage{ENmix} User's Guide} \date{Modified: April 9 2020. Compiled: \today} \begin{document} \maketitle \section{Introduction} The \Rpackage{ENmix} package provides a set of quality control and data pre-processing tools for Illumina HumanMethylation450 and MethylationEPIC Beadchips. It includes ENmix background correction, RELIC dye bias correction, RCP probe-type bias adjustment, along with a number of additional tools. These functions can be used to remove unwanted experimental noise and thus to improve accuracy and reproducibility of methylation measures. \Rpackage{ENmix} functions are flexible and transparent. Users have option to choose a single pipeline command to finish all data pre-processing steps (including background correction, dye-bias adjustment, inter-array normalization and probe-type bias correction) or to use individual functions sequentially to perform data pre-processing in a more customized manner. In addition the \Rpackage{ENmix} package has selectable complementary functions for efficient data visualization (such as data distribution plots); quality control (identifing and filtering low quality data points, samples, probes, and outliers, along with imputation of missing values); identification of probes with multimodal distributions due to SNPs or other factors; exploration of data variance structure using principal component regression analysis plot; preparation of experimental factors related surrogate control variables to be adjusted in downstream statistical analysis; an efficient algorithm oxBS-MLE to estimate 5-methylcytosine and 5-hydroxymethylcytosine level; estimation of celltype proporitons; methlation age calculation and differentially methylated region (DMR) analysis. The data structure used by the \Rpackage{ENmix} package is compatible with several other related R packages, such as \Rpackage{minfi}, \Rpackage{wateRmelon} and \Rpackage{ChAMP}, providing straightforward integration of ENmix-corrected datasets for subsequent data analysis. The software is designed to support large scale data analysis, and provides multi-processor parallel computing options for most functions. \section{Citation} The following publications can be refered to learn more about the methods implemented in this package. Xu Z, Niu L, Li L, Taylor JA. ENmix: a novel background correction method for Illumina HumanMethylation450 BeadChip, Nucleic Acids Research, 2015 Niu L, Xu Z, Taylor JA. RCP: a novel probe design bias correction method for Illumina Methylation BeadChip. Bioinformatics, 2016 Xu Z, Langie SA, De Boever P, Taylor JA, Niu L. RELIC: a novel dye-bias correction method for Illumina Methylation BeadChip. BMC Genomics. 2017 Xu Z, Taylor JA, Leung YK, Ho SM, Niu L. oxBS-MLE: an efficient method to estimate 5-methylcytosine and 5-hydroxymethylcytosine in paired bisulfite and oxidative bisulfite treated DNA. Bioinformatics. 2016 Xu Z, Xie C, Taylor JA, Niu L. ipDMR: Identification of differentially methylated regions with interval P values in review, 2020 \section{Example Analysis} \subsection{Example 1: using pipeline } Pipeline function \Rfunction{mpreprocess} can be used to perform background correction, dye-bias adjustment, inter-array normalization, probe-type bias correction, as well as outlier removal and imputation in a single step. <>= library(ENmix) #read in data require(minfiData) #data pre-processing beta=mpreprocess(RGsetEx,nCores=6) #or #read in IDAT files path <- file.path(find.package("minfiData"),"extdata") rgSet <- readidat(path = path,recursive = TRUE) #quality control and data pre-processing beta=mpreprocess(rgSet,nCores=6,qc=TRUE,foutlier=TRUE, rmcr=TRUE,impute=TRUE) @ \subsection{Example 2: using individual function} This example code is basically doing the same thing as in example 1, but more transparent, and has more customized options. <>= library(ENmix) #read in data path <- file.path(find.package("minfiData"),"extdata") rgSet <- readidat(path = path,recursive = TRUE) #QC info qc<-QCinfo(rgSet) #background correction and dye bias correction #if provide qc info, the low quality samples and probes #will be excluded before background correction mdat<-preprocessENmix(rgSet, bgParaEst="oob", dyeCorr="RELIC", QCinfo=qc, nCores=6) #low quality samples and probes can also be excluded after #background correction using QCfilter. #mdat <- QCfilter(mdat,qcinfo=qc,outlier=TRUE) #inter-array normalization mdat<-norm.quantile(mdat, method="quantile1") #probe-type bias adjustment beta<-rcp(mdat,qcscore=qc) beta <- rm.outlier(beta,qcscore=qc,impute=TRUE,rmcr=TRUE) @ \subsection{Example 3: A more elaborated example} This example is to demonstrate the flexibility of the package functions. Getting detailed information about data quality, SNP-like probes, excluding user specified probes and/or samples, principal component regression analysis, outlier removal, sex prediction, cell sub-type proportion estimates and control surrogate variables. <>= library(ENmix) #read in data path <- file.path(find.package("minfiData"),"extdata") rgSet <- readidat(path = path,recursive = TRUE) sheet <- read.metharray.sheet(file.path(find.package("minfiData"),"extdata"), pattern = "csv$") rownames(sheet)=paste(sheet$Slide,sheet$Array,sep="_") colData(rgSet)=as(sheet[colnames(rgSet),],"DataFrame") #control plots plotCtrl(rgSet) #QC info qc<-QCinfo(rgSet) #calculate detection P values detp=calc_detP(rgSet,detPtype = "negative") detp2=calc_detP(rgSet,detPtype = "oob") #get methDataSet mraw <- getmeth(rgSet) #get raw methyaltion beta values beta<-getB(mraw) #distribution plot multifreqpoly(beta,main="Methylation Beta value distribution") #Search for multimodal CpGs #sample size in this example data is too small for this purpose! #exclude low quality data first bb=beta; bb[qc$detP>0.05 | qc$nbead<3]=NA nmode<-nmode.mc(bb, minN = 3, modedist=0.2, nCores = 6) outCpG = names(nmode)[nmode>1] #background correction and dye bias correction mdat<-preprocessENmix(rgSet, bgParaEst="oob", dyeCorr="RELIC", QCinfo=qc, exCpG=outCpG, nCores=6) #inter-array normalization mdat<-norm.quantile(mdat, method="quantile1") colData(mdat)=as(sheet[colnames(mdat),],"DataFrame") #probe-type bias adjustment beta<-rcp(mdat,qcscore=qc) # Principal component regression analysis plot cov<-data.frame(group=colData(mdat)$Sample_Group, slide=factor(colData(mdat)$Slide)) pcrplot(beta, cov, npc=6) #filter out low quality and outlier data points for each probe; #rows and columns with too many missing value can be removed #if specify; Do imputation to fill missing data if specify. beta <- rm.outlier(beta,qcscore=qc,rmcr=TRUE,impute=TRUE) #Non-negative control surrogate variables csva<-ctrlsva(rgSet) #estimate cell type proportions #based on rgDataset celltype=estimateCellProp(userdata=rgSet,refdata="FlowSorted.Blood.450k", nonnegative = TRUE,normalize=TRUE) #using methDataSet celltype=estimateCellProp(userdata=mdat,refdata="FlowSorted.Blood.450k", nonnegative = TRUE,normalize=TRUE) #using beta value celltype=estimateCellProp(userdata=beta,refdata="FlowSorted.Blood.450k", nonnegative = TRUE,normalize=TRUE) #predic sex based on rgDataSet or methDataset sex=predSex(rgSet) sex=predSex(mdat) #Methylation age calculation mage=methyAge(beta) #ICC analysis, see manual #DMR analysis, see manual @ \section{Setting up the data} The first step is to import array raw data files (*.idat) to create an object of \Robject{rgDataSet}, which is similar to \Robject{RGChannelSetExtended} in minfi package, but with probe annotation integrated into data object, and having smaller data size. User can also use minfi package to read in idat files and create \Robject{RGChannelSetExtended}. Most functions in ENmix support \Robject{RGChannelSetExtended} or \Robject{RGChannelSet}. <>= library(ENmix) rgSet <- readidat(path = "directory containing idat files",recursive = TRUE) @ When methylation IDAT raw data files are not available, such as in many publically available datasets, users can use methylated (M) and unmethylated (U) intensity data to create an object of \Robject{MethylSet}. <>= M<-matrix_for_methylated_intensity U<-matrix_for_unmethylated_intensity pheno<-as.data.frame(cbind(colnames(M), colnames(M))) names(pheno)<-c("Basename","filenames") rownames(pheno)<-pheno$Basename pheno<-AnnotatedDataFrame(data=pheno) anno<-c("IlluminaHumanMethylation450k", "ilmn12.hg19") names(anno)<-c("array", "annotation") mdat<-MethylSet(Meth = M, Unmeth = U, annotation=anno, phenoData=pheno) @ \section{Quality Control} \subsection{Internal control probes} Illumina 450k chip incorporated 15 different types of internal control probes (total of 848 probes). The control plots generated using the data by Illumina GenomeStudio software are very useful to inspect experimental process and data quality. However, GenomeStudio only works on the windows operating system, it is time consuming to generate these plots for larger dataset, and there is also no option to save the plots into file. The function \Rfunction{plotCtrl} can generate similar plots for each type of control. <>= plotCtrl(rgSet) @ See Illumina Infinium HD Methylation Assay for detailed description on how to interpret these control figures. Here is a list of control types: \begin{tabular}{ l c } \toprule \textbf{Control types} & \textbf{Number of probes} \\ \midrule \textbf{Sample-Independent Controls} & \\ STAINING & 4 \\ EXTENSION & 4 \\ HYBRIDIZATION & 3 \\ TARGET REMOVAL & 2 \\ RESTORATION & 1 \\ \midrule \textbf{Sample-Dependent Controls} & \\ BISULFITE CONVERSION I & 12 \\ BISULFITE CONVERSION II & 4 \\ SPECIFICITY I & 12 \\ SPECIFICITY II & 3 \\ NON-POLYMORPHIC & 4 \\ NORM\_A & 32 \\ NORM\_C & 61 \\ NORM\_G & 32 \\ NORM\_T & 61 \\ NEGATIVE & 613 \\ \bottomrule \end{tabular} \begin{figure}[!htbp] \centerline{\includegraphics[scale=0.75]{./fig/BISULFITE_CONVERSION_I.jpg}} \caption{Bisulfite conversion controls for type I probes} \label{fig:bisul_I} \end{figure} \begin{figure}[!htbp] \centerline{\includegraphics[scale=0.75]{./fig/NEGATIVE.jpg}} \caption{Negative control probes} \label{fig:negcont} \end{figure} \begin{figure}[!htbp] \centerline{\includegraphics[scale=0.75]{./fig/NORM_ACGT.jpg}} \caption{NORM ACGT control probes} \label{fig:acgt} \end{figure} These controls can also be plotted in user specified order to check how experimental factors affect methylation measures, such as batch, plate, array or array location. <>= path <- file.path(find.package("minfiData"),"extdata") rgSet <- readidat(path = path,recursive = TRUE) sheet <- read.metharray.sheet(file.path(find.package("minfiData"),"extdata"), pattern = "csv$") rownames(sheet)=paste(sheet$Slide,sheet$Array,sep="_") colData(rgSet)=as(sheet[colnames(rgSet),],"DataFrame") #control plots IDorder=rownames(colData(rgSet))[order(colData(rgSet)$Slide,colData(rgSet)$Array)] plotCtrl(rgSet,IDorder) @ \subsection{Data distribution} Methylation intensity or beta value distribution plots are very useful for data summary, visual inspection and identification of outlier samples. Density plot is routinely generated using R function \Rfunction{multidensity}. However, the function is computationally intensive, and can take several hours to produce density plots for a large methylation dataset. Furthermore, density plot is difficult to understand for many investigators, also as noted in the man page of \Rfunction{multidensity}, density plot may not be able to display data distribution accurately for some data because of the smooth function which may lead part of the distribution to be out of range, and may obscure important details in data distribution. ENmix's frequency polygon plot provides a better alternative for inspection of data distribution. It can accurately reflect data distribution and, like histogram it is easy to understand. It is also much faster, and only take a few minutes to produce a distribution plot for >1000 samples. <>= mraw <- getmeth(rgSet) #total intensity plot is userful for data quality inspection #and identification of outlier samples multifreqpoly(assays(mraw)$Meth+assays(mraw)$Unmeth, xlab="Total intensity") #Compare frequency polygon plot and density plot beta<-getB(mraw) anno=rowData(mraw) beta1=beta[anno$Infinium_Design_Type=="I",] beta2=beta[anno$Infinium_Design_Type=="II",] library(geneplotter) jpeg("dist.jpg",height=900,width=600) par(mfrow=c(3,2)) multidensity(beta,main="Multidensity") multifreqpoly(beta,main="Multifreqpoly",xlab="Beta value") multidensity(beta1,main="Multidensity: Infinium I") multifreqpoly(beta1,main="Multifreqpoly: Infinium I", xlab="Beta value") multidensity(beta2,main="Multidensity: Infinium II") multifreqpoly(beta2,main="Multifreqpoly: Infinium II", xlab="Beta value") dev.off() @ See the following figures (Figure 4) generated from the above code. When type I and type II probes are plotted separately (Fig 4 bottom 4 panels) the difference in modes between type I and II probes can be appreciated. But when all probes are plotted together (Fig 4 top panels), the multidensity plot obscures these differences, while they remain readily apparent in the multifreqpoly plot. In addition, the multidensity plots appear to suggest that probes range in value from <0 to >1, whereas multifreqpoly correctly show the range from 0 to 1. \begin{figure}[!htbp] \centerline{\includegraphics[scale=0.75]{./fig/dist.jpg}} \caption{Methylation beta value distribution plots for all probes (top 2 panels) and for type I (middle panels) and II (bottom panels) probes separately. The smoothing function in multidensity plots (panels on left) results in misleading range and mode information which are more accurately depicted in the multifreqpoly plots (panels on right)} \label{fig:dist} \end{figure} \subsection{QC information, filtering of low quality samples and probes} Data quality measures, including detection P values, number of beads for each methylation read and average intensities for bisulfite conversion probes can be extracted using the function \Rfunction{QCinfo} from an object of \Robject{RGChannelSetExtended}. Based on default or user specified quality score thresholds, the \Rfunction{QCinfo} can also identify and export a list of low quality samples and CpG probes. Outlier samples based on total intensity or beta value distribution should be excluded before further analysis. Such samples have been difficult to identify, but by using the argument outlier=TRUE, these outlier samples will be identified automatically. Data quality score figures from \Rfunction{QCinfo} can be used to guide the selection of quality score thresholds. Low quality samples and probes can be filtered out using \Rfunction{QCfilter} or \Rfunction{preprocessENmix}. <>= qc<-QCinfo(rgSet) #exclude before backgroud correction mdat<-preprocessENmix(rgSet, bgParaEst="oob", dyeCorr="RELIC", QCinfo=qc, nCores=6) #Or exclude after background correction mdat <- QCfilter(mdat,qcinfo=qc,outlier=TRUE) @ \subsection{Filtering out outliers and low quality data values} Outlier and low quality data values can have a large impact on association statistical tests. Function \Rfunction{rm.outlier} can filter out these data points and replace them as missing values. Outliers are defined as values smaller than 3 times IQR from the lower quartile or larger than 3 times IQR from the upper quartile. Some statistical methods do not allow missing values, argument \textit{impute=TRUE} in the function can be specified to impute missing data using k-nearest neighbors method. <>= #filter out outliers b1=rm.outlier(beta) #filter out low quality and outlier values b2=rm.outlier(beta,qcscore=qcscore) #filter out low quality and outlier values, remove rows and columns # with too many missing values b3=rm.outlier(beta,qcscore=qcscore,rmcr=TRUE) #filter out low quality and outlier values, remove rows and columns # with too many missing values, and then do imputation b3=rm.outlier(beta,qcscore=qcscore,rmcr=TRUE,impute=TRUE) @ \section{Background correction and dye-bias adjustment} Function \Rfunction{preprocessENmix} incorporates a model based background correction method \textit{ENmix}, which models methylation signal intensities with a flexible exponential-normal mixture distribution, together with a truncated normal distribution to model background noise. Users can also specify a list of poor performance CpGs to be excluded before background correction using argument \textit{exCpG}. Argument dyeCorr can be used to specify a method for dye-bias correction, the default is RELIC. See the following papers for the detailed description of related methods: Zongli Xu, et. al. ENmix: a novel background correction method for Illumina HumanMethylation450 BeadChip, Nucleic Acids Research, 2015 Xu Z, Langie SA, De Boever P, Taylor JA, Niu L. RELIC: a novel dye-bias correction method for Illumina Methylation BeadChip. BMC Genomics. 2017 If argument QCinfo is specified, the low quality samples and probes identified by function \Rfunction{QCinfo} will be excluded before ENmix background correction. Using argument \textit{exSample} and \textit{exCpG}, User can also specify a list of samples or probes to be excluded before background correction. <>= qc=QCinfo(rgSet) mdat<-preprocessENmix(rgSet, bgParaEst="oob", dyeCorr="RELIC", QCinfo=qc, exCpG=NULL, nCores=6) @ \section{Inter-array normalization} Function \Rfunction{norm.quantile} can be used to perform quantile normalization on methylation intensity values. <>= mdat<-norm.quantile(mdat, method="quantile1") @ \section{Probe type bias adjustment} The majority of probes on Illumina 450K and EPIC BeadChips are type II probes. Although type II probes facilitate increased array genome coverage, they were shown to have decreased dynamic range and reproducibility compared to type I probes. Taking advantage of the high spatial correlation of DNA methylation levels along the human genome, The RCP (Regression on Correlated Probes) method utilizes nearby (<25 bp) type I and II probe pairs to derive the quantitative relationship between probe types and then recalibrates type II probe measurements using type I probes as referents. <>= beta<-rcp(mdat) @ See the following publication for the detailed description of the method: Niu L, Xu Z, Taylor JA. RCP: a novel probe design bias correction method for Illumina Methylation BeadChip. Bioinformatics, 2016 \section{Batch effect correction} Function \Rfunction{ctrlsva} can be used to estimate surrogate variables for batch effect and unknown experimental confounders from intensity data for non-negative internal control probes. These variables can then be adjusted as covariables in downstream association analysis to remove unwanted data variation. <>= sva<-ctrlsva(rgSet) @ \section{Principal component regression analysis plot} First, principal component analysis will be performed in standardized beta value matrix (standardized for each CpG), and then the specified number of top principal components (that explain most data variation) will be used to perform linear regression with each specified variables, such as batch or environmental variables. Regression P values will be plotted to explore methylation data variance structure and to identify possible confounding variables to guide association statistical analysis. <>= cov<-data.frame(group=colData(mdat)$Sample_Group, slide=factor(colData(mdat)$Slide)) pcrplot(beta, cov, npc=6) @ \begin{figure}[!htbp] \centerline{\includegraphics[scale=0.75]{./fig/pcr_diag.jpg}} \caption{Example principal component regression p value plot of raw data generated using 450K methylation data from a published study} \label{fig:pcrdiag} \end{figure} \section{Multimodal CpGs} Function \Rfunction{nmode.mc} uses an empirical approach to identify multimodal distributed CpGs (SNP like probes). When measured in a population of people the majority of CpGs on the Illumina HumanMethylation450 BeadChip have unimodal distributions of DNA methylation values with relatively small between-person variation. However, some CpGs (typically around 10,000 in 450k array often seemingly the result of SNPs in the probe region) may have multimodal distributions of methylation values with sizeable differences between modes and large between-person variation. These multimodal distributed data are usually caused by SNP effect, problematic probe design or other unknown artifacts instead of actual methylation level and thus should be excluded from DNA methylation analysis. Researchers have often excluded CpGs based on SNP annotation information. However, because SNP annotation always depends on population origin, we found that this approach alone may exclude many well-distributed (unimodal) CpGs, while still failing to identify other multi-modal CpGs. We developed an empirical approach to identify CpGs that are obviously not uni-modally distributed, so that researchers can make more informed decisions about whether to exclude them in their particular study populations and analyses. See online supplementary materials of the following paper for an evaluation of the method using published EWAS data. Zongli Xu, Liang Niu, Leping Li and Jack A. Taylor ENmix: a novel background correction method for Illumina HumanMethylation450 BeadChip, Nucleic Acids Research, 2015 <>= nmode<- nmode.mc(beta, minN = 3, modedist=0.2, nCores = 5) @ \section{Compatibility with other related R packages} The \Rpackage{ENmix} uses the data structure provided by R \Rpackage{minfi} packages as input and output, and thus is fully compatible with the minfi package. The same data structures were also used by several other R packages, such as \Rpackage{ChAMP} and \Rpackage{wateRmelon}, so the output from ENmix functions can be easily utilized in these packages for further analysis. Here are some examples: Example 1: mixed use of minfi and ENmix functions <>= library(ENmix) #minfi functions to read in data sheet <- read.metharray.sheet(file.path(find.package("minfiData"), "extdata"), pattern = "csv$") rgSet <- read.metharray.exp(targets = sheet, extended = TRUE) #ENmix function for control plot plotCtrl(rgSet) #minfi functions to extract methylation and annotation data mraw <- preprocessRaw(rgSet) beta<-getB(mraw, "Illumina") anno=getAnnotation(rgSet) #ENmix function for fast and accurate distribution plot multifreqpoly(beta,main="Data distribution") multifreqpoly(beta[anno$Type=="I",],main="Data distribution, type I") multifreqpoly(beta[anno$Type=="II",],main="Data distribution, type II") #ENmix background correction mset<-preprocessENmix(rgSet, bgParaEst="oob", dyeCorr="RELIC", nCores=6) #minfi functions for further preprocessing and analysis gmSet <- preprocessQuantile(mset) bumps <- bumphunter(gmSet, design = model.matrix(~ gmSet$status), B = 0, type = "Beta", cutoff = 0.25) @ Example 2: add ENmix background correction step into ChAMP pipeline <>= library(ENmix) library(ChAMP) testDir=system.file("extdata",package="ChAMPdata") myLoad=champ.load(directory=testDir) #ENmix background correction mset<-preprocessENmix(myLoad$rgSet,bgParaEst="oob", nCores=6) #remove probes filtered by champ.load() mset=mset[rownames(myLoad$beta),] #update myLoad object with background corrected intensity data myLoad$mset=mset myLoad$beta=getB(mset) myLoad$intensity=getMeth(mset)+getUnmeth(mset) #continue ChAMP pipeline myNorm=champ.norm() @ \section{SessionInfo} <>= toLatex(sessionInfo()) @ \section{References} Zongli Xu, Liang Niu, Leping Li and Jack A. Taylor, ENmix: a novel background correction method for Illumina HumanMethylation450 BeadChip, Nucleic Acids Research, 2015 Liang Niu, Zongli Xu and Jack A. Taylor, RCP: a novel probe design bias correction method for Illumina Methylation BeadChip, Bioinformatics 2016 Zongli Xu, Jack A. Taylor, Yuet-Kin Leung, Shuk-Mei Ho and Liang Niu, oxBS-MLE: An efficient method to estimate 5-methylcytosine and 5-hydroxymethylcytosine in paired bisulfite and oxidative bisulfite treated DNA, under review. Zongli Xu, Sabine A. S. Langie, Patrick De Boever, Jack A. Taylor1 and Liang Niu, RELIC: a novel dye-bias correction method for Illumina Methylation BeadChip, in review 2016 Illumina Inc., Infinium HD Assay Methylation Protocol Guide, Illumina, Inc. San Diego, CA. Aryee MJ, Jaffe AE, Corrada-Bravo H, Ladd-Acosta C, Feinberg AP, Hansen KD and Irizarry RA (2014). Minfi: A flexible and comprehensive Bioconductor package for the analysis of Infinium DNA Methylation microarrays. Bioinformatics, 30(10), pp. 13631369. Pidsley, R., CC, Y.W., Volta, M., Lunnon, K., Mill, J. and Schalkwyk, L.C. (2013) A data-driven approach to preprocessing Illumina 450K methylation array data. BMC genomics, 14, 293. Teschendorff AE et. Al (2013). A beta-mixture quantile normalization method for correcting probe design bias in Illumina Infinium 450 k DNA methylation data. Bioinformatics. Johnson, WE, Rabinovic, A, and Li, C (2007). Adjusting batch effects in microarray expression data using Empirical Bayes methods. Biostatistics 2007 8(1):118-127. \end{document}