%\VignetteIndexEntry{methylPipe} %\VignetteEngine{knitr::knitr} %\VignetteKeywords{methylation, epigenetics} %\VignetteDepends{methylPipe} %\VignettePackage{methylPipe} \documentclass[11pt]{article} \usepackage[margin=2cm,nohead]{geometry} \usepackage{hyperref} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\software}[1]{\textsf{#1}} \newcommand{\R}{\software{R}} \newcommand{\Bioconductor}{\software{Bioconductor}} \newcommand{\methylPipe}{\Rpackage{methylPipe}} \title{methylPipe} \author{Kamal Kishore, Mattia Pelizzola} \begin{document} \maketitle <>= options(width=72) @ \tableofcontents \section{Introduction} In this document you can find a brief tutorial on the \textbf{\Rpackage{methylPipe}} package for the analysis of base-pair resolution DNA methylation data. DNA methylation is a potentially heritable epigenetic modification of the genomic DNA typical of most eukaryotic organisms and critical for the regulation of gene transcription. The level of DNA methylation varies according to age, diet and environment and an appropriate control of methylation patterns is important for the onset of cellular differentiation processes and can be deregulated in diseases as cancer. In eukaryotic genomes the methyl group can be added to Cytosines in the CG, CHG and CHH sequence context (H being one of A, C or T). It is nowadays possible to generate genome-wide base-pair resolution DNA methylation maps (see Lister R et al Nature 2009). \textbf{\Rpackage{methylPipe}} is an R package that includes a series of objects and methods for a memory efficient management, query, analysis and visualization of DNA methylation data and their integration with heterogeneous data types. This package has been developed alongwith \textbf{\Rpackage{compEpiTools}} which provide functions and methods for the analysis of epigenomics data. The data package \textbf{\Rpackage{ListerEtAlBSseq}} has also been developed which consists of base resolution Bisulfite sequencing data of H1 and IMR90 cell line (Lister R et al Nature 2009). The overview section of this document will briefly walk you through the main available functionalities, while the following sections will provide additional details on a selection of available classes and methods. \section{Quick overview} A number of \textbf{classes} are defined in \Rpackage{methylPipe}: \Rclass{BSdata}, \Rclass{BSdataSet}, \Rclass{GEcollection} and \Rclass{GElist}. \begin{itemize} \item \textbf{\Rclass{BSdata}} is a reference class to store base resolution DNA methylation data generated from a high-thoughput sequencing experiment for a given biological sample. \item The \textbf{\Rclass{BSdataSet}} class allows combining DNA methylation data for several samples for the same organism. \item The \textbf{\Rclass{GEcollection}} class is used to store the DNA methylation status of a collection of genomic regions. \item The \textbf{\Rclass{GElist}} class is the list of multiple objects of class \Rclass{GEcollection}. \end{itemize} \textbf{\Rpackage{methylPipe}} consists of methods which allows analysis and visualisation of genome wide DNA methylation data. The \Rfunction{meth.call} function processes methylation information from aligned files and creates tabular data file. \Rfunction{BSprepare} processes and tabix compresses such tabular data files for creation of a \Rclass{BSdata} object for each sample. The \Rclass{BSdataSet} classe stores multiple such samples of class \Rclass{BSdata}. The \Rmethod{mCsmoothing} methods extracts DNA methylation data for a given genomic region and smooths/plots it. For methylation analysis on your regions of interest \Rclass{GEcollection} and \Rclass{GElist} acts as a repository of methylation data across various genomic ranges. \Rfunction{getCpos} and \Rfunction{getCposDensity} retrieves all potential genomic cytosine positions and their density across genomic regions, repectively. \Rfunction{mapBSdata2GRanges} allows extraction of DNA methylation information across multiple genomic ranges and \Rfunction{profileDNAmetBin} determines the absolute and relative methylation to create a \Rclass{GEcollection} object. Different methods for the detection of differentially methylated regions are implemented, according to the number of samples (see \Rmethod{findDMR} and \Rfunction{consolidateDMRs}). In addition the methylation profile can be visualised using \Rfunction{plotMeth}. Morover, integrative analysis with other epignomics dataset can be performed by using \Rfunction{heatmapdata}, \Rfunction{heatmapPlot} functions of the package \textbf{\Rpackage{compEpiTools}}. The methods that are most demanding in terms of computational resources are optimized for low memory fingerprint and multi-processor support. \Rpackage{methylPipe} includes a series of objects and methods that can be used as building blocks for the creation of pipelines for the data analysis of epigenomics data, with particular emphasis on DNA methylation, and their integration with any kind of annotation or additional data type. \section{Profiling Genome wide DNA methylation} First of all the \Rpackage{methylPipe} and the genome sequence libraries are loaded: <>= library(methylPipe) library(BSgenome.Hsapiens.UCSC.hg18) @ The DNA methylation information can be read from a text file. See the documentation of the \Rfunction{BSprepare} on how to build data suitable to populate a \Rclass{BSdata} object. Moreover, the methylation information can also be read directly from the sorted SAM file(s) generated from the BISMARK aligner. The function \Rfunction{meth.call} process the sorted SAM file. The user can specify the sequence context in which the methylation information is read from these files either "CpG" or "All". In case of whole genome Bisulfite data especially in case of non-CG methylation the number of potential methylation sites are enormous and vast majority of them are unmethylated. In order to avoid storing too much data while maintaining the ability to identify methylated, unmethylated and uncovered Cytosines, \Rpackage{methylPipe} only stores and access C positions with at least 1 mC read. Genomic regions not covered by sequencing are stored as a \Rclass{GRanges} object. The function \Rfunction{meth.call} produces methylation call text file and uncovered genomic regions file for each sample in the output folder. Finally, when profiling DNA methylation unmethylated Cs are determined as those genomic cytosines not having any methylated reads and not belogning to uncovered genomic regions. <>= file_loc <- system.file('extdata', 'test_methcall', package='methylPipe') meth.call(files_location=file_loc, output_folder=tempdir(), no_overlap=TRUE, read.context="CpG", Nproc=1) @ \Rpackage{methylPipe} adopts TABIX compressed indexing of flatfiles to reduce disk space which allows fast access. The \Rpackage{methylPipe} library includes a small subset of the first two base resolution human DNA methylomes (Lister R et al Nature 2009) for two well known human cell lines: embryonic stem cells (H1) and fetal lung fibroblasts (IMR90). The methylation data can be stored into \Rclass{BSdata} objects and then collected together within a \Rclass{BSdataSet} object. The \Rclass{BSprepare} command is not run in the following example since it requires a local copy of tabix software. Hence, for this example we use pre-generated tabix indexed files from \Rpackage{methylPipe} for H1 and IMR90. <>= file_loc <- system.file('extdata', 'H1_chr20_CG_10k.txt', package='methylPipe') #BSprepare(files_location=file_loc,output_folder=file_loc, tabix="/path-to-tabix/") uncov_GR <- GRanges(Rle('chr20'), IRanges(c(14350,69251,84185), c(18349,73250,88184))) H1data <- system.file('extdata', 'H1_chr20_CG_10k_tabix_out.txt.gz', package='methylPipe') H1.db <- BSdata(file=H1data, uncov=uncov_GR, org=Hsapiens) H1.db @ Multiple \Rclass{BSdata} objects can be stored in \Rclass{BSdataSet} object by specifying group name for each sample either as "C" (control) or "E" (Experiment): <>= IMR90data <- system.file('extdata', 'IMR90_chr20_CG_10k_tabix_out.txt.gz', package='methylPipe') IMR90.db <- BSdata(file=IMR90data, uncov=uncov_GR, org=Hsapiens) H1.IMR90.set <- BSdataSet(org=Hsapiens, group=c("C","E"), IMR90=IMR90.db, H1=H1.db) H1.IMR90.set @ The data includes the chromosome, the genomic position, strand and sequence context for each methylCytosine (mC). In addition, for each mC the number of reads with C at that genomic position and strand (supporting DNA methylation), the number of reads with T (evidence of an unmethylated C) and the binomial p-value supporting the mC call are attached. Importantly, these data are referenced into the \Rclass{BSdata} but they are not actually loaded into the memory. The \Rmethod{mCsmoothing} method allow uploading into memory the DNA methylation data for a given genomic region and to display their methylation profile over it. The smoothing can be performed by either "sum" or "mean" of methylation level for each regions. \begin{center} <>= gr <- GRanges("chr20",IRanges(1,5e5)) sres <- mCsmoothing(H1.db, gr, Scorefun='sum', Nbins=50, plot=TRUE) @ \end{center} \section{Descriptive statistics of DNA methylation} \Rpackage{methylPipe} allows checking the basic stats about the methylation data such as range, mean and quantile distribution of methylation and assess sample similarity with correlation and clustering analysis. The \Rfunction{methstats} method computes pairwise correlation coefficients (Pearson) between the methylation profiles across all the samples in \Rclass{BSdataSet} object. It outputs scatter plot matrix of correlation coefficients. Finally, it performs (eucledian distance based) hierarchical clustering of samples and outputs the dendrogram. In the example below, the analysis is performed on \Rclass{BSdataSet} object of artificially replicated H1 and IMR90. \begin{center} <>= stats.set <- BSdataSet(org=Hsapiens, group=c("C","C","E","E"), IMR_1=IMR90.db, IMR_2=IMR90.db, H1_1=H1.db,H1_2=H1.db) stats_res <- methstats(stats.set,chrom="chr20",mcClass='mCG', Nproc=1) stats_res @ \end{center} \section{Profiling DNA methylation on genomic ranges} \Rpackage{methylPipe} allows profiling DNA methylation over a set of genomic regions and determining their absolute or relative methylation profiles. Absolute methylation is here defined as the genomic density of mC calls per bp, while relative methylation is the proportion of density of methylated genomic sites over the density of potential methylation sites. The \Rfunction{mapBSdata2GRanges} method retrieves mC calls given a \Rclass{BSdata} object of a sample and a set of genomic regions. In the following example we will extract the CG sites for a set of genomic regions. In particular we will consider a \Rclass{GRanges} object of genomic regions that spans 2Kb upstream and downstream the TSS of the transcript ids on the first 500Kb of chromosome 20: <>= gr_file <- system.file('extdata', 'GR_chr20.Rdata', package='methylPipe') load(gr_file) resmC <- mapBSdata2GRanges(GenoRanges=GR_chr20, Sample=H1.db, context='CG') head(resmC[[4]]) @ To profile DNA methylation on genomic regions of interest the \Rclass{GEcollection} class (collection of genomic regions) is used. These genomic regions can be any regions of interest such as one or more gene loci, promoter regions, a set of transcription factor binding sites, a set of regions enriched by some histone mark or a set of transposable elements. Most of these regions can be usually retrieved from either publications or public databases like the UCSC Genome Browser and its Table Browser tools (see \Rpackage{compEpiTools} for methods for generation/manipulation of these regions). The \Rclass{GEcollection} class has data slots and methods tailored for storing and analyzing DNA methylation data. It is an extension of the \Rclass{RangedSummarizedExperiment} class from the \Rpackage{SummarizedExperiment} package. For the creation of the \Rclass{GEcollection}, the \Rfunction{getCpos} and \Rfunction{mapBSdata2GRanges} methods are called automatically to populate its various data slots. Further arguments of the latter function allows to filter the mC sites based on the depth of sequencing, number of reads with C at that genomic position and their binomial p-value. The \Rfunction{profileDNAmetBin} method acts as a wrapper for these methods to profile absolute and relative DNA methylation for a set of regions within a \Rclass{GEcollection}. In particular for each genomic region, and possibly each bin of it, the absolute methylation density (mC/bp), the density of possible methylation sites (C/bp) and the relative methylation level (mC/C) will be determined. For example, in case of the CG context: mCG/bp, CG/bp and mCG/CG densities will be stored for each bin of each genomic region in the binmC, binC and binrC data slots, respectively. <>= gec.H1 <- profileDNAmetBin(GenoRanges=GR_chr20, Sample=H1.db, mcCLASS='mCG', nbins=3) binmC(gec.H1)[4:5,] binC(gec.H1)[4:5,] binrC(gec.H1)[4:5,] @ \Rclass{GEcollection} objects can be \textbf{subsetted and combined}. Subset can be useful to work only on the regions on a particular strand and/or chromosome or chromosomal region. <>= gec1 <- gec.H1[start(gec.H1) < 153924] gec2 <- gec.H1[start(gec.H1) > 153924] @ Multiple \Rclass{GEcollection} objects can be saved into a \Rclass{GElist} object. This can be convenient if there are many of them and if one would like to iteratively apply the same method to them, for example profiling DNA methylation for their genomic regions in the same sample. <>= gecIMR_file <- system.file('extdata', 'gec.IMR90.Rdata', package='methylPipe') load(gecIMR_file) gel <- GElist(gecIMR90=gec.IMR90, gecH1=gec.H1) print(names(gel)) @ The \Rclass{GElist} objects can be visualized using \Rfunction{plotMeth}. This allows methylation data of various samples to be displayed together with the annotation information for the genomic regions of interest. Morover, various epigenomics data tracks can also be visualized together with the methylation information. See the documentation of the \Rfunction{plotMeth} for more details. <>= library(TxDb.Hsapiens.UCSC.hg18.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg18.knownGene gel <- GElist(gecIMR90=gec.IMR90[1:10], gecH1=gec.H1[1:10]) plotMeth(gel, colors=c("red","blue"), datatype=c("mC","mC"), yLim=c(.025, .025), brmeth=list(IMR90=IMR90.db, H1=H1.db), mcContext="CG", transcriptDB=txdb, chr="chr20", start=14350, end=277370, org=Hsapiens) @ \section{Differential DNA methylation} The \Rmethod{findDMR} function can be used to identify differentially methylated regions (DMRs) for any sequence context (mCG, mCHG or mCHH). DMRs can be identified comparing the mC methylation levels (the proportion of methylated over total reads for a set of mC) between \textbf{two samples} or within a group of \textbf{N samples}, with the Wilcoxon or the Kruskal-Wallis non parametric test, respectively. Only cytosine positions covered in both the samples (pairwise analysis) or all the samples (multiple sample analysis) are considered for DMR identification. The algorithm uses dynamic sliding window approach which identifies DMRs depending on methylated cytosine frequency and their relative distance. Moreover, to perform regions specific analysis (such as promoter or CpG island centric) genomic regions could be specified and the alogirithm will report DMRs only within those regions, if any. In a first step, all the evaluated regions are reported in the output with their corresponding statistical significance. The methylation difference is determined in terms of MethDiffPerc (percentage difference between the mean methylation of experiment and control) and log2Enrichment (log2 of mean methylation of experiment over control). <>= DMRs <- findDMR(object= H1.IMR90.set, Nproc=1, ROI=GR_chr20, MCClass='mCG', dmrSize=6, dmrBp=800) head(DMRs) @ The \Rfunction{consolidateDMRs} function can be used to multiple-testing correct the DMRs and consolidate them according to their relative distance, type of DMRs and thresholds of p-value/Methylation difference/log enrichment. A final \Rclass{GRanges} object with the set of DMRs including p-value, methylation difference and log enrichment is provided: <>= hyper.DMRs.conso <- consolidateDMRs(DmrGR=DMRs, pvThr=0.05, GAP=100, type="hyper") hyper.DMRs.conso[1:4] @ \newpage \section{Session Information} <>= sessionInfo() @ \end{document}