--- title: "SeSAMe User Guide" shorttitle: "sesame guide" package: sesame output: rmarkdown::html_vignette fig_width: 8 fig_height: 6 vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{"0. SeSAMe User Guide"} %\VignetteEncoding{UTF-8} --- ```{r message=FALSE, warning=FALSE, include=FALSE} library(sesame) library(dplyr) ``` SeSAMe is designed to process Illumina Infinium DNA methylation data. It currently supports EPIC, HM450 and HM27 platforms. # Install SeSAMe From Bioconductor ```{r, eval=FALSE} if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager") BiocManager::install("sesame") ``` Development version can be installed from github ```{r, eval=FALSE} BiocManager::install('zwdzwd/sesameData') BiocManager::install('zwdzwd/sesame') ``` # The openSesame Pipeline The openSesame pipeline is composed of noob, nonlinear dye bias correction and pOOBAH, achieved through: ```{r message = FALSE, warning = FALSE} idat_dir <- system.file("extdata/", package = "sesameData") betas <- openSesame(idat_dir) ``` where `idat_dir` is the directory containing all the IDAT files (they can be present under nested sub-directories). If you want to have more granuality of control, openSesame is equivalent to ```{r message = FALSE, warning = FALSE, eval = FALSE} betas <- do.call(cbind, lapply(searchIDATprefixes(idat_dir), function(pfx) { pfx %>% readIDATpair %>% pOOBAH %>% noob %>% dyeBiasCorrTypeINorm %>% getBetas })) ``` behind the scene. ## How/Why Probes Are Masked? The `openSesame` command also takes the arguments to turn on/off probe masking ( when probe beta value measurements are replaced with NA) and adjust for stringency in detection calling. The current probe masking is constituted by two major parts: 1) Low intensity-based detection calling achieved by `pOOBAH`: This sets the p-value for each probe ([Zhou et al. 2018](https://www.ncbi.nlm.nih.gov/pubmed/30085201)). Probes with p-value higher than a threshold (default: 0.05) are masked. The default threshold can be adjusted to say, 0.1, by `pval.threshold = 0.1` in either `getBetas` or `openSesame`. 2) Probes masked for putative design issues designated in `sesameDataGet('HM450.probeInfo')$mask` ( [Zhou et al. 2017](https://www.ncbi.nlm.nih.gov/pubmed/27924034)): This masking only supports EPIC, HM450 and HM27 and is turned on by default. It can be turned off by setting `quality.mask = FALSE` for either `getBetas` or `openSesame`. ## Specify Your Own Manifest Instead of working with HM450, EPIC etc, SeSAMe also works with customed array. ```{r eval = FALSE} openSesame(idat_dir, 'custom_array_name', manifest_file) ``` In this case, one needs to provide a `platform` string, which can be any string used to reference the platform, and a `manifest_file` which is a data frame ( or tibble) with a minimum of four columns (`Probe_ID`, `M`, `U` and `col`). ``` Probe_ID M U col 1 cg14361672 7743487 51800947 R 2 cg21784030 NA 29783926 NA 3 cg13417420 27786954 5613976 G 4 cg12480843 19684581 16692916 R 5 cg05493344 NA 58754149 NA 6 cg10136773 NA 3699389 NA ``` The `col` is either `G` (stand for Green) or `R` (stand for Red) or `NA` ( stand for both in the case of Infinium II design). # Data Structure for Signal Intensity SeSAMe design includes alight-weight full exposure of internal signal intensities (essential information for users of Illumina methylation array data, as demonstrated in Zhou et al 2018), which permits sensitive and specific joint inference on copy number and DNA methylation. Central to the SeSAMe platform is the `SigSet` data structure, an S4 class with slots containing signals for six different classes of probes: 1) `II` - Type-II probes; 2) `IR` - Type-I Red channel probes; 3) `IG` - Type-I Grn channel probes; 4) `oobG` - Out-of-band Grn channel probes (matching Type-I Red channel probes in number); 5) `oobR` - Out-of-band Red channel probes (matching Type-I Grn channel probes in number); 6) `ctl` - control probes. For all save control probes, signal intensities are stored as an `Nx2` numeric matrix, with `N` representing the number of probes in the class. The two columns of the matrix represent the methylated probe intensity and the unmethylated probe intensity. (Previously, this was implemented in an R6 Reference class, `SignalSet`. The current S4 implementation in `SigSet` complies with Bioconductor guidelines, and for backwards compatibility, the `signalR6toS4` function transforms a `SignalSet` to a `SigSet`. ```{r, echo = FALSE, message = FALSE} library(sesameData) library(sesame) sset <- sesameDataGet('EPIC.1.LNCaP')$sset ``` For example, printing the SigSet directly shows its content ```{r} sset ``` ## Infinium I/II Probes Infinium-II probe signal can be accessed using the slot function `sset@II` or via the getter function ```{r} head(II(sset)) ``` ## Infinium I Out-of-band Signal Similarly, signals for Type-I probes can be accessed from `sset@IR` and `sset@IG` and out-of-band signals from `sset@oobG` and `sset@oobR`. As one can see the probe names (row names) of `IR` always coincide with the probe names (row names) of `oobG` (and vice versa). This is because the out-of-band probe signal for red channel probes is in green channel (and vice versa). ## Control Probes Lastly, Control probes are represented in a data frame with the last column holding the type of the control. ```{r} head(ctl(sset)) ``` ## Number of Beads # Functionalities SeSAMe implements stricter QC and preprocessing standards: comprehensive probe quality masking, bleed-through correction in background subtraction, nonlinear dye bias correction, stricter nondetection calling and control for bisulfite conversion based on C/T-extension probes. The package also provides convenient, performant implementations of typical analysis steps, such as the inference of gender, age, ethnicity (based on both internal SNP probes and channel-switching Type-I probes) directly from the data. This allows users to infer these common covariates if such information is not provided, and to check for potential sample swaps when it is provided. SeSAMe also provides functionality for calling differential methylation and segmented copy number. ### Read IDATs into SigSet list ```{r} ssets <- lapply( searchIDATprefixes(system.file("extdata/", package = "sesameData")), readIDATpair) ``` A simple list of "SigSet"s are returned. One can also just provide a vector of file paths prefixes (excluding `_Red.idat` and `_Grn.idat`, one prefix for a pair of IDATs) and call `readIDATpair` directly. ### Background subtraction Like many other Infinium Methylation-targeted software, SeSAMe implements the background subtraction based on normal-exponential deconvolution using out-of-band probes `noob` ([Triche et al. 2013](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3627582/)) and optionally with extra bleed-through subtraction. Signal bleed-through happens when measurement from one channel affects the measurement in the other channel. SeSAMe's `noobsb` further removes residual background by regressing out the green-to-red and red-to-green relationship using Type-I probes. ```{r} sset <- sesameDataGet('EPIC.1.LNCaP')$sset sset.nb <- noob(sset) sset.nb <- noobsb(sset) ``` ### Type-I channel inference Sometimes Type-I channel spec is inaccurate in the manifest. We can infer the channel using data. ```{r} sset.TypeICorrected <- inferTypeIChannel(sset) ``` ### Dye bias correction Dye bias refers to the difference in signal intensity between the two color channel. SeSAMe offers two flavors of dye bias correction: linear scaling (`dyeBiasCorr`) and nonlinear scaling (`dyeBiasCorrTypeINorm`). Linear scaling equalize the mean of all probes from the two color channel. ```{r} library(sesame) sset.dbLinear <- dyeBiasCorr(sset) qqplot( slot(sset.dbLinear, 'IR'), slot(sset.dbLinear, 'IG'), xlab='Type-I Red Signal', ylab='Type-I Grn Signal', main='Linear Correction', cex=0.5) abline(0,1,lty='dashed') ``` Residual dye bias can be corrected using nonlinear quantile interpolation with Type-I probes. ```{r} sset.dbNonlinear <- dyeBiasCorrTypeINorm(sset) ``` Under this correction, Type-I Red probes and Type-I Grn probes have the same distribution of signal. ```{r} qqplot( slot(sset.dbNonlinear, 'IR'), slot(sset.dbNonlinear, 'IG'), xlab='Type-I Red Signal', ylab='Type-I Grn Signal', main='Nonlinear Correction', cex=0.5) abline(0,1,lty='dashed') ``` Note that linear scaling does not shift beta values of Type-I probes while nonlinear scaling does shift beta values of Type-I probes. ### Get betas Beta values are defined as `methylated signal`/(`methylated signal` + `unmethylated signal`). It can be computed using `getBetas` function. The output is a named vector with probe ID as name. There are two options for `getBetas` that affects probe masking. The first is `quality.mask=TRUE/FALSE` which switches probe quality masking. The quality masking includes mapping issues, SNPs and non-uniqueness, and is described in [Zhou et al 2017](https://academic.oup.com/nar/article/45/4/e22/2290930). `nondetection.mask = TRUE/FALSE` is used to switch masking of nondetection based on detection P-value. Both masks are recommended to ensure data quality and defaulted to TRUE. ```{r} betas <- getBetas(sset) head(betas) ``` Beta values for Type-I probes can also be obtained by summing up the two in-band channel and out-of-band channel. This rescues probes with SNP hitting the extension base and hence switching color channel. More details can be found in [Zhou et al 2017](https://academic.oup.com/nar/article/45/4/e22/2290930). ```{r} betas <- getBetas(sset, sum.TypeI = TRUE) ``` For such probes, extra SNP allele frequencies can be derived by summing up methylated and umethylated alleles. ```{r} extraSNPAFs <- getAFTypeIbySumAlleles(sset) ``` ### Sample/experiment QC SeSAMe implements inference of sex, age, ethnicity. These are valuable information for checking the integrity of the experiment and detecting sample swaps. #### Sex Sex is inferred based on our curated X-linked probes and Y chromosome probes excluding pseudo-autosomal regions. ```{r} inferSex(sset) inferSexKaryotypes(sset) ``` #### Ethnicity Ethnicity is inferred using a random forest model trained based on both the built-in SNPs (`rs` probes) and channel-switching Type-I probes. ```{r} inferEthnicity(sset) ``` #### Age SeSAMe provides age regression a la the Horvath 353 model. ```{r} betas <- sesameDataGet('HM450.1.TCGA.PAAD')$betas predictAgeHorvath353(betas) ``` #### Mean intensity The mean intensity of all the probes characterize the quantity of input DNA and efficiency of probe hybridization. ```{r} meanIntensity(sset) ``` #### Bisulfite conversion control using GCT scores Infinium platforms are intrinsically robust to incomplete bisulfite conversion as non-converted probes would fail to hybridize to the target. Residual incomplete bisulfite conversion can be quantified using GCT score based on C/T-extension probes. Details of this method can be found in [Zhou et al. 2017](https://academic.oup.com/nar/article/45/4/e22/2290930). The closer the score to 1.0, the more complete the bisulfite conversion. ```{r} bisConversionControl(sset) ``` ### Probe retrieval and $\beta$-value visualization To visualize all probes from a gene ```{r, message=FALSE, fig.width=6, fig.height=5} betas <- sesameDataGet('HM450.10.TCGA.PAAD.normal') visualizeGene('DNMT1', betas, platform='HM450') ``` To visualize probes from arbitrary region ```{r, message=FALSE, fig.width=6, fig.height=5} visualizeRegion( 'chr19',10260000,10380000, betas, platform='HM450', show.probeNames = FALSE) ``` To visualize by probe names ```{r, message=FALSE, fig.width=6} visualizeProbes(c("cg02382400", "cg03738669"), betas, platform='HM450') ``` ### CNV SeSAMe performs copy number variation in three steps: 1) normalizes the signal intensity using a copy-number-normal data set; 2) groups adjacent probes into bins; 3) runs DNAcopy internally to group bins into segments. ```{r, message=FALSE, fig.width=6} ssets.normal <- sesameDataGet('EPIC.5.normal') segs <- cnSegmentation(sset, ssets.normal) ``` To visualize segmentation in SeSAMe, ```{r, message=FALSE, fig.width=6} visualizeSegments(segs) ``` ### Cell Composition Deconvolution SeSAMe estimates leukocyte fraction using a two-component model.This function works for samples whose targeted cell-of-origin is not related to white blood cells. ```{r, message=FALSE} betas.tissue <- sesameDataGet('HM450.1.TCGA.PAAD')$betas estimateLeukocyte(betas.tissue) ```