%\VignetteIndexEntry{The bigmelon Package} %\VignetteKeywords{Illumina DNA methylation 450k array normalization epic} %\VignettePackage{bigmelon} % Sweave Vignette Template derived bt Leo Schalkwyk from %% www.bioconductor.org/help/course-materials/2010/AdvancedR/BuildPackage.pdf \documentclass[11pt]{article} <>= BiocStyle::latex() @ %\newcommand{\Rfunction}[1]{{\texttt{#1}}} %\newcommand{\Robject}[1]{{\texttt{#1}}} %\newcommand{\Rpackage}[1]{{\textit{#1}}} %\newcommand{\Rclass}[1]{{\textit{#1}}} \title{The bigmelon Package} \author{Tyler Gorrie-Stone, Ayden Saffari, Karim Malki and Leonard C Schalkwyk} \begin{document} \SweaveOpts{concordance=TRUE} \maketitle \section{About} The \Rpackage{bigmelon} package for Illumina methylation data provides a fast and convenient way to apply a variety of different normalisation methods to your data, such as those previously described by Pidsley et al. [1] and implemented in the package \Rpackage{wateRmelon}. Bigmelon extends the capabilities of the \Rpackage{wateRmelon} to higher dimensional data, allowing larger data sets containing many more arrays to be processed simultaneously, while also providing convenient storage for data for future access and sharing with peers. This has been achieved by adapting methods from the \Rpackage{gdsfmt} package, originally designed for handling SNP data, which through efficient memory use and management is able to overcome the memory overheads associated with handling big data in \R{}.\\ \section{Quick-Start} This section will briefly describe how to import, access and 'analyse' data using bigmelon and the gds file format. All the functions described in this section are described in full in the later portions of this document. <>= library(bigmelon) data(melon) # Convert methylumiset or methylset objects to .gds gfile <- es2gds(melon, 'melon.gds') # 'melon.gds' file created in current working directory dir() # Access data with: betas(gfile) # OR index.gdsn(gfile, 'betas') # Get betas with '[' notation betas(gfile)[1:5,1:5] # Or call from gfile gfile[1:5, 1:5, node = 'methylated'] # Preprocess data with pfilter and dasen pfilter(gfile) dasen(gfile) # Note you do not have to store the output # because the functions make changes to disk # Use apply.gdsn (or clusterApply.gdsn) to perform apply-like operations meth <- methylated(gfile) apply.gdsn(meth, 2, median, as.is='double', na.rm = TRUE) # Calculating Horvath's epigenetic ages with agep data(coef) agep(gfile, coeff=coef) # Close .gds file closefn.gds(gfile) # Open a .gds file gfile <- openfn.gds('melon.gds') @ <>= closefn.gds(gfile) unlink('melon.gds', force = TRUE) @ \section{Installation} \Rpackage{bigmelon} works with existing \Bioconductor{} packages and therefore has a number of dependencies. The install.packages() should install the required packages automatically, but should this not succeed for any reason, the following commands can be used to install these manually: <>= if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager") BiocManager::install('wateRmelon', 'gdsfmt') @ Install the latest package from a local copy (located in the current working directory of your R session): <>= install.packages('bigmelon_0.99.11.tar.gz', repos = NULL, type = 'source') @ \section{Using Bigmelon} \subsection{Loading Data to gds format} There are multiple methods that can be used to load in data into a gds object. These can either be from GenomeStudio final report text files or from raw binary (.IDAT) files (preferred method). \subsubsection{IDAT Files} IDAT files are the raw intensities obtained from DNA methylation microarrays and are split into two files per sample (one Red Channel and one Green Channel). These are read into R using \Rpackage{minfi} or \Rpackage{methylumi}. In \Rpackage{bigmelon} IDAT files can be read in using the \Rfunction{iadd} or \Rfunction{iadd2} functions. These functions pass to \Rfunction{methylumIDATepic} (from \Rpackage{wateRmelon}) to read in the data - although it should be noted that the full annotation of the features are not included when reading from idats. And can be added to the \Rfunction{iadd} and \Rfunction{iadd2} function differently. \Rfunction{iadd} will take a vector of barcodes, \Rfunction{iadd2} will accept a directory pathway and extract all IDAT files within the specified path. You will need to be in the same directory as the idat files for this to work. \Rfunction{iadd2} also has the functionality to read IDAT files in chunks. This is useful if you are attempting to read in a lot of data at once and do not have sufficient memory on your workstation to support this. \textbf{This method is recommended if you are using a workstation bounded by memory limitations.} <>= # read in an IDAT file with barcode 'sentrixid_rnncnn' gfile <- iadd('sentrixid_rnncnn', gds = 'melon.gds') gfile <- iadd2('Data/IDATLocations/dataset', gds = 'melon.gds', chunksize = 100) @ \subsubsection{ExpressionSet Objects} You may have been given a \Rclass{MethylumiSet}, \Rclass{RGChannelSet} or \Rclass{MethylSet} instead of idats. These can be passed to \Rfunction{es2gds} to convert the data into a \Rclass{gds.class}. Such circumstance can arise when one wants to use a particular normalisation methodology only available to specific packages. Henceforth we will convert the ExpressionSet data object 'melon' packaged within \Rpackage{wateRmelon} to demonstrate further down stream analysis. <>= data(melon) gfile <- es2gds(melon, 'melon.gds') @ \subsubsection{Text Files} To read in text files, the \Rfunction{methyLumiR} function from \Rpackage{methylumi} can be used. If using this method, we suggest saving the unnormalised, uncorrected version of the data. We also recommend keeping the barcode names (SentrixID\_RnnCnn) as the column headers or in a separate dataframe. Alternatively you can use the function \Rfunction{finalreport2gds} that will output a .gds object. <>= library(methylumi) # read Illumina methylation data into a MethyLumiSet object melon <- methyLumiR('finalreport.txt') # read Illumina methylation final report into a gds.class object. gfile <- finalreport2gds('finalreport.txt', gds='melon.gds') @ Assuming you have used \Rfunction{methyLumiR} you would then need to convert the resultant object to a Genomic Data Structure (GDS) data file. This can also be achieved using the function \Rfunction{es2gds} which can convert \Rclass{MethyLumiSet} objects (from \Rpackage{methylumi}, \Rclass{RGChannelSet} and \Rclass{MethylSet} objects (from \Rpackage{minfi}) as described above. <>= # convert a MethyLumiSet object to a gds.class object gfile <- es2gds(melon, 'melon.gds') @ \section{Opening and Closing gds files} Now that you have created a .gds file you can continue working on it within the same R session. Or close the file for later use, or to share with others. The functions \Rfunction{openfn.gds} and \Rfunction{closefn.gds} are used. <>= # Closing File closefn.gds(gfile) # Opening File gfile <- openfn.gds('melon.gds') @ Recommended: See \Rfunction{?openfn.gds} Generally only one instance of a gds can be opened per R session, this can be disabled by setting allow.fork and allow.duplicate arguments in \Rfunction{openfn.gds} to TRUE. \section{Exploring the \Rclass{gds.class}} The resulting \Rclass{gds.class} may be different to any other data-structure you have previously used. Simply, it resembles an S4 object but instead of slots there are nodes with a \Rclass{gdsn.class} class. To access these, specialized functions need to be used as common R functionality ({@} and {\$}) are not yet existent for these objects. When printing the \Rclass{gds.object} we are given an almost directory-like output. <>= print(gfile) @ From this output we can see some useful information about our object such as the file name, total object size and the name, size and type of each node within the gds object. Typically a \Rpackage{bigmelon} gds file is comprised of some common nodes these being: betas, methylated, unmethylated, pvals, fData, pData, and History. If you are familiar with the \Robject{MethyLumiSet} this will be immediately familiar to you. If not a brief description is as follows \begin{itemize} \item{ betas: The ratio between Methylated and Unmethylated intensities - most commonly used for analysis} \item{ methylated: The methylated intensities} \item{ unmethylated: The unmethylated intensities} \item{ pvals: The detection P values of the array } \item{ NBeads(not shown): The total beadcount (per feature) on the array.} \item{ fData: The feature data, which contains all relavent biological information to CpG probes within the micro-array (rows).} \item{ pData: The pheno data, which contains information relevant to biological samples (columns).} \item{ history: Brief description of operations applied to the file.} \end{itemize} To access the data represented in the object we need to use the function \Rfunction{index.gdsn} <>= index.gdsn(gfile, 'betas') class(index.gdsn(gfile, 'betas')) # Access nodes with additional nodes inside index.gdsn(gfile, 'fData/TargetID') @ Alternatively, there are some accessors written for the common object names see \Rfunction{?'bigmelon-accessors'}. Majority of these accessors will pass to \Rfunction{index.gdsn} but if the object is small enough the accessor may read the object into memory without further indexing. <>= betas(gfile) class(betas(gfile)) @ If the directory-tree output is hard to interpret or you wish to list all available nodes the function \Rfunction{ls.gdsn} allows you to view the contents of a gds file in a vector. <>= ls.gdsn(gfile) # Look into nodes with additional nodes ls.gdsn(index.gdsn(gfile, 'fData')) @ \subsection{Exploring the \Rclass{gdsn.class}} You may ask the question - 'How do I access \textbf{that} juicy data?'. To do this, the functions \Rfunction{read.gdsn} and \Rfunction{readex.gdsn} are used. \Rfunction{read.gdsn} will load the entire object represented in a gdsn.class object into memory. While \Rfunction{readex.gdsn} allows you to specify a subset to load into memory. <>= # Call a gdsn.class node anode <- betas(gfile) anode class(anode) # All data dat <- read.gdsn(anode) dim(dat) head(dat) # Subset! datsub <- readex.gdsn(anode, sel = list(1:5, 1:3)) dim(datsub) datsub @ You may immediately notice that the rownames and column names of the matrix are missing. This is an unfortunate side-effect of using \Rfunction{read.gdsn} because such information is not stored within the specified gds node. However within \Rpackage{bigmelon} we have written a wrapper-function for \Rfunction{read.gdsn} (and \Rfunction{readex.gdsn}) to load data into R. This is achieved using \Rfunction{'['}. The purpose of this is to enable similar indexing operations that most will be familiar with. <>= # Re-using node from previous example anode datsub <- anode[1:5,1:3] dim(datsub) datsub # Additionally, the row and col names can be disabled anode[1:5, 1:3, name = FALSE] @ There are a few more tricks that are possible in \Rpackage{bigmelon} that we will briefly explore here. Most of the indexing tricks that can be implements on matrices can be performed on gdsn.nodes. <>= # Logical Indexing anode[1:5,c(TRUE,FALSE,FALSE)] # Ordering calls anode[c(5,9,1,500,345), c(8,4,1,3)] # Indexing by characters (and drop functionality) anode[c('cg00000029', 'cg00000236'), '6057825008_R02C01', drop = FALSE] # Loading entire data (no indexing) dat <- anode[ , ] # Not recommended for large data. dim(dat) @ Additionally it is possible to call a gds node from a gds.class within the \Rfunction{'['} indexing. This is particularly useful if you have a foreign matrix in your gds object with the name "foobar", it will be possible to implement the below structures to retrieve specific data. <>= gfile[1:5, 1:3, node = 'betas', name = TRUE] gfile[1:5, 1:3, node = 'methylated', name = TRUE] @ As a brief side note, the row and column names are still stored within the gds data file. Located at the bottom of each gds data file will be a node labelled as "paths". This contains a string to where the row and column names are stored. These are determined by default upon the creation of the gds data file but in events where they are incorrect they can be corrected with the \Rfunction{redirect.gds}. <>= read.gdsn(index.gdsn(gfile, "paths")) head(read.gdsn(index.gdsn(gfile, "fData/TargetID"))) head(read.gdsn(index.gdsn(gfile, "pData/sampleID"))) @ \section{Preprocessing} \subsection{Quality Control} Prior to data analysis, you may find it is necessary to perform some quality control and normalization. Within bigmelon, we have some functions can assist with the QC but you can use whatever functions you like. Typical workflows involve visualizing raw intensities: <>= rawmet <- methylated(gfile)[,] rawume <- unmethylated(gfile)[,] @ <>= boxplot(log(rawmet), las=2, cex.axis=0.8) @ <>= boxplot(log(rawume), las=2, cex.axis=0.8) @ Alternatively it is possible to used some highly specialized functions available within \Rpackage{bigmelon}. <>= rawbet <- betas(gfile)[,] @ <>= outlyx(rawbet, plot = TRUE) @ If the data is too large to load into memory, one can use the \Rpackage{bigmelon} method which determines outliers with a small subset of data. <>= outlyx(gfile, plot = TRUE, perc = 0.01) @ Filtering probes/features by detection p-values also provides another straightforward approach for removing both failed samples and probes. The \Rfunction{pfilter} function discards samples with more than 1 \% of probes above .05 detection p-value threshold, and probes with any samples with beadcount under 3 or more than 1\% above the p-value threshold. \textbf{n.b.} This will perform irreversible subsetting procedures onto the gds file and will \textbf{not} work if the gds is in read mode. <>= pfilter(gfile) @ Alternatively if you do not wish to perform subsetting at this time one can use pfilter.gds to get the list of failed probes and use those for subsetting at a later time. \subsection{Backing Up/Storing Raw data} Before performing any function that will noticeably change the data, you may want to create a physical back-up of the gds file so you do not have to retrace your steps incase you lose progress. The \Rfunction{backup.gdsn} function serves as an easy way to copy a node you may be interested in to your gds file incase you need it for later. <>= backup.gdsn(gds = NULL, node = index.gdsn(gfile, 'betas')) ls.gdsn(index.gdsn(gfile, 'backup')) @ Alternatively you can create a new gdsfile to store the copy or use the \Rfunction{copyto.gdsn}. <>= f <- createfn.gds('melon2.gds') backup.gdsn(gds = f, node = index.gdsn(gfile, 'betas')) f copyto.gdsn(node = f, source = index.gdsn(gfile, 'betas'), name = 'betacopy') f copyto.gdsn(node = gfile, source = index.gdsn(gfile, 'betas'), name='betacopy') # Close File closefn.gds(f) @ <>= unlink('melon2.gds') @ \subsection{Normalization} Within \Rpackage{bigmelon} there are numerous normalisation methods that can be used. The method \Rfunction{dasen} will work well for most data sets. \textbf{n.b.} This will perform irreversible procedures on the data. And will replace raw intensities with the normalised ones. This will not work if gds file is in read mode. <>= dasen(gfile) # Alternatively it is possible to store normalized betas to a separate node # If you want to keep the raw data dasen(gfile, node="normbeta") index.gdsn(gfile, "normbeta") @ Due to how the normalisation process is broken down within \Rpackage{bigmelon} there is only ever a small amount of memory required throughout data analysis. For example when attempting to process ~4000 EPIC array samples (>850,000 features), totalling around 28Gb of data. Simple quantile normalisation procedures quickly use up all available memory to attempt such feat. Whereas within bigmelon, the same analyses uses considerably less memory and (in this circumstance) provide a 1000 fold decrease in memory use. \includegraphics{logmemuse.pdf} \section{Analysis} While we cannot recommend any advice about how to perform your statistical analysis we will demonstrate how to make the most out of the \Rpackage{bigmelon} package. Within \Rpackage{gdsfmt} there are many functions written that are specialized for gds files. Notably the \Rfunction{apply.gdsn} function is particularly useful as it will perform functions upon specified margins efficiently instead of loading the entire object into R to perform analysis. <>= # Example of apply.gdsn apply.gdsn(betas(gfile), margin = 2, as.is='double', FUN = function(x,y){ mean(x, na.rm=y) }, y = TRUE) @ Additional analyses that could be useful to an EWAS in particular are age prediction and estimating cell types, these can be done with \Rfunction{agep} and \Rfunction{estimateCellCounts.gds} <>= # Age Prediction data(coef) # Load up a set of coefficient (horvaths) agep(gfile, coef) # Alternatively agep(betas(gfile), coef) # or index.gdsn(gfile, 'foobar') for a different matrix @ You can define your own functions to supply as an argument to FUN. Please do explore \Rfunction{apply.gdsn} as it is extremely versatile and it can additionally store outputs straight the a gds node if needed. There will ofcourse be some analyses that may not be amenable to high dimensional data but if analysis can be broken down into column/row wise operations then it is possible. Currently, all available methods within \Rpackage{wateRmelon} with the exception of \Rfunction{seabi}, \Rfunction{swan}, \Rfunction{tost}, \Rfunction{fuks} and \Rfunction{BMIQ} have been optimised for memory usage. \section{Back-Porting} Should you find it necessary to convert your gds object back into memory (perhaps for some specialised analyses) you can use the functions \Rfunction{gds2mlumi} and \Rfunction{gds2mset} which will build a \Robject{MethyLumiSet} object and \Robject{MethylSet} object in your environment. <>= gds2mlumi(gfile) gds2mset(gfile, anno="450k") @ \section{Adding additional objects to gds objects} Often, it will be necessary to make use of a transformation while additionally preserving the original betas. This can be done using the add.gdsn which is described in the gdsfmt vignette in great detail. \section{Finishing an R session} As this workflow is in its infancy there are some issues that have yet to be ironed out. Notably there have been observed instances of data-loss when connection to a gds file has been interrupted without proper closure using \Rfunction{closefn.gds}. As such it is \textbf{imperative} that once you are ready to exit R, you must close the connection to the gds file and then exit R. <>= # Closing the connection closefn.gds(gfile) @ <>= unlink('melon.gds', force = TRUE) @ \section{Session Info} <>= sessionInfo() @ \section{References} [1] Pidsley R, Wong CCY, Volta M, Lunnon K, Mill J, Schalkwyk LC (2013) A data-driven approach to preprocessing Illumina 450K methylation array data. BMC genomics, 14(1), 293. [2] Zheng X, Levine D, Shen J, Gogarten SM, Laurie C, Weir BS (2012) A high-performance computing toolset for relatedness and principal component analysis of SNP data. Bioinformatics, 28, 3326-3328. \end{document} % R CMD Sweave tut.Rnw % R CMD texi2dvi --pdf --clean tut.tex % R CMD Stangle tut.rnw