%\VignetteIndexEntry{An Introduction to the REMP Package}
%\VignetteKeywords{DNA Methylation, repetitive element, prediction}
%\VignettePackage{REMP}

\documentclass{article}

\usepackage{listings}
\usepackage{times}
\usepackage{hyperref}
\usepackage{biblatex}

\textwidth=6.5in
\textheight=9in
\parskip=.1cm
\oddsidemargin=0in
\evensidemargin=0in
\headheight=-.6in

\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpkg}[1]{{\textit{#1}}}
\newcommand{\Rmethod}[1]{{\texttt{#1}}}
\newcommand{\Rfunarg}[1]{{\texttt{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}
\newcommand{\Rcode}[1]{{\texttt{#1}}}

<<echo=FALSE>>=
library(knitr)
options(width=72)

listing <- function(x, options) {
  paste("\\begin{lstlisting}[basicstyle=\\ttfamily,breaklines=true]\n",
    x, "\\end{lstlisting}\n", sep = "")
}
knit_hooks$set(source=listing, output=listing)
@

\title{An Introduction to the \Rpkg{REMP} Package}
\author{Yinan Zheng}
\date{\today}

\begin{document}

\maketitle

\tableofcontents

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\newpage

\section{Introduction}

\Rpkg{REMP} predicts DNA methylation of locus-specific repetitive elements (RE) by learning surrounding genetic and epigenetic information. \Rpkg{REMP} provides genomewide single-base resolution of DNA methylation on RE that is difficult to measure directly using array-based or sequencing-based platforms, which enables epigenome-wide association study (EWAS) and differentially methylated region (DMR) analysis on RE. \Rpkg{REMP} also provides handy tool to extract methylation data of CpGs that are located within RE sequences.

\Rpkg{REMP} supports both Illumina methylation BeadChip array platforms (450k and EPIC) and sequencing platforms (e.g. TruSeq Methyl Capture EPIC). Both genome build hg19 and hg38 are supported.

\section{Installation}
Install \Rpkg{REMP} (release version):

<<bioconductorREMPrelease, eval=FALSE>>=
if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("REMP")
@

To install devel version:

<<bioconductorREMPdev, eval=FALSE>>=
library(devtools)
install_github("YinanZheng/REMP")
@

Load \Rpkg{REMP} into the workspace:

<<loadREMP, message=FALSE, eval=TRUE>>=
library(REMP)
@

\section{REMP: Repetitive Element Methylation Prediction}
Currently \Rpkg{REMP} supports Human (hg19/hg38) Alu, LINE-1 (L1), and Long Terminal Repeat (LTR) (including endogenous retroviruses, ERV) repetitive element (RE) methylation prediction using Illumina 450k/EPIC array or sequencing platform.

\subsection{Groom methylation data}
Appropriate data preprocessing including quality control and normalization of methylation data are recommended before running \Rpkg{REMP}. Many packages are available to carry out these data preprocessing steps, for example, \Rpkg{minfi}, \Rpkg{wateRmelon}, and \Rpkg{methylumi}.

\Rpkg{REMP} is trying to minimize the requirement of the methylation data format. Users can maintain the methylation data in \Rclass{RatioSet} or \Rclass{GenomicRatioSet} object offered by \Rpkg{minfi}, \Rclass{data.table}, \Rclass{data.frame}, \Rclass{DataFrame}, or \Rclass{matrix}. Users can input either beta value or M-value. There are only two basic requirements of the methylation array data (450k/EPIC):
\begin{enumerate}
  \item Each row should represent CpG probe and each column should represent sample.
  \item The row names should indicate Illumina probe ID (i.e. cg00000029).
\end{enumerate}

However, there are some other common data issues that may prevent \Rpkg{REMP} from running correctly. For example, if the methylation data are in beta value and contain zero methylation values, logit transformation (to create M-value) will create negative infinite value; or the methylation data contain \Rcode{NA}, \Rcode{Inf}, or \Rcode{NaN} data. To tackle these potential issues, \Rpkg{REMP} includes a handy function \Rfunction{grooMethy} which can help detect and fix these issues. We highly recommend to take advantage of this function:

<<grooMethy, eval=TRUE>>=
# Get GM12878 methylation data (450k array)
GM12878_450k <- getGM12878('450k') 
GM12878_450k <- grooMethy(GM12878_450k)
GM12878_450k
@

For zero beta values, \Rfunction{grooMethy} will replace them with smallest non-zero beta value. For one beta values, \Rfunction{grooMethy} will replace them with largest non-one beta value. For \Rcode{NA}/\Rcode{NaN}/\Rcode{Inf} values, \Rfunction{grooMethy} will treat them as missing values and then apply KNN-imputation to complete the dataset. If the imputed value is out of the original range (which is possible when \Rcode{imputebyrow = FALSE}), mean value will be used instead. Warning: imputed values for multimodal distributed CpGs (across samples) may not be correct. Please check package \Rpkg{ENmix} to identify the CpGs with multimodal distribution. 

For sequencing data, the users only need to prepare a methylation data matrix (row = CpGs, column = samples). The corresponding CpG location information (either in hg19 or hg38) should be prepared in a separate \Rclass{GRanges} object and provide it to the \Rfunarg{Seq.GR} argument in \Rfunction{grooMethy}. For an example of \Rfunarg{Seq.GR}, please run:

<<SeqGR, eval=TRUE>>=
library(IlluminaHumanMethylation450kanno.ilmn12.hg19)
getLocations(IlluminaHumanMethylation450kanno.ilmn12.hg19)
@

Note that the row names of the CpGs in \Rfunarg{Seq.GR} can be \Rcode{NULL}. 

\subsection{Prepare annotation data}
To run \Rpkg{REMP} for RE methylation prediction, users first need to prepare some annotation datasets. The function \Rfunction{initREMP} is designed to do the job.

Suppose users will predict Alu methylation using Illumina 450k array data:

<<remparcel, eval=TRUE>>=
data(Alu.hg19.demo)
remparcel <- initREMP(arrayType = "450k",
                     REtype = "Alu",
                     annotation.source = "AH",
                     genome = "hg19",
                     RE = Alu.hg19.demo,
                     ncore = 1)
remparcel
@

For demonstration, we only use 500 selected Alu sequence dataset which comes along with the package (\Robject{Alu.hg19.demo}). We specify \Rfunarg{RE = Alu.hg19.demo}, so that the annotation dataset will be generated for the 500 selected Alu sequences. Most of the time, specifying \Rfunarg{RE} is not necessary, as the function will fetch the complete RE sequence dataset from package \Rpkg{AnnotationHub} using \Rfunction{fetchRMSK}. Users can also use this argument \Rfunarg{RE} to provide customized RE dataset.

\Rfunarg{annotation.source} allows the users to switch the source of the annotation databases, including the RefSeq Gene annotation database and RepeatMasker annotation database. If \Rfunarg{annotation.source = "AH"}, the database will be obtained from the AnnotationHub package. If \Rfunarg{annotation.source = "UCSC"}, the database will be downloaded from the UCSC website http://hgdownload.cse.ucsc.edu/goldenpath. The corresponding build (\Rfunarg{"hg19"} or \Rfunarg{"hg38"}) can be specified in the argument \Rfunarg{genome}. Most of the time \Rcode{"hg19"} is used for array data. But if \Rcode{"hg38"} is specified, the function will liftover the CpG probe location information to \Rcode{"hg38"} and obtain annotation databases in \Rcode{"hg38"}.

If \Rfunarg{arrayType = "Sequencing"}, users should provide the genomic location information of the CpGs in a \Rclass{GRanges} object to \Rfunarg{Seq.GR}. Note that the genome build of \Rfunarg{Seq.GR} provided should match the genome build specified in \Rfunarg{genome}.

All data are stored in the \Rclass{REMParcel} object:

<<saveParcel, eval=TRUE>>=
saveParcel(remparcel)
@

It is recommended to specify a working directory using argument \Rfunarg{work.dir} in \Rfunction{initREMP} so that the annnotation data generated can be re-used. Without specifying working directory, the annotation dataset will be created under the temporal directory \Rcode{tempdir()} by default. Users can also turn on the \Rfunarg{export} argument in \Rfunction{initREMP} to save the data automatically.

\subsection{Run prediction}

Once the annotation data are ready, users can pass the annotation data parcel to \Rfunction{remp} for prediction:

<<rempredict, eval=TRUE>>=
remp.res <- remp(GM12878_450k, 
                 REtype = 'Alu', 
                 parcel = remparcel, 
                 ncore = 1, 
                 seed = 777)
@

If \Rfunarg{parcel} is missing, \Rfunction{remp} will then try to search the \Rclass{REMParcel} data file in the directory indicated by \Rfunarg{work.dir}. If \Rfunarg{work.dir} is also missing, \Rfunction{remp} will try to search the REMParcel data file in the temporal directory \Rcode{tempdir()}.

By default, \Rfunction{remp} uses Random Forest (\Rfunarg{method = 'rf'}) model (package \Rpkg{ranger} for fast implementation) for prediction. Random Forest model is recommended because it offers more accurate prediction results and it automatically enables Quantile Regression Forest (Nicolai Meinshausen, 2006) for prediction reliability evaluation. \Rfunction{remp} constructs predictors to carry out the prediction. For Random Forest model, the tuning parameter \Rfunarg{param = 6} (i.e. \Rfunarg{mtry} in \Rpkg{ranger} or \Rpkg{randomForest}) indicates how many predictors will be randomly selected for building the individual trees. The performance of random forest model is often relatively insensitive to the choice of \Rfunarg{mtry}. Therefore, auto-tune will be turned off using random forest and \Rfunarg{mtry} will be set to one third of the total number of predictors. It is recommended to specify a seed for reproducible prediction results.

Besides random forest, \Rfunction{remp} provides other machine learning engines for users to explore, including Extreme Gradient Boosting, SVM with linear kernel, and SVM with radial kernel).

\Rfunction{remp} will return a \Rclass{REMPset} object, which inherits Bioconductor's \Rclass{RangedSummarizedExperiment} class:

<<rempprint, eval=TRUE>>=
remp.res

# Display more detailed information
details(remp.res)
@

Prediction results can be obtained by accessors: 

<<rempaccessors, eval=TRUE>>=
# Predicted RE-CpG methylation value (Beta value)
rempB(remp.res)

# Predicted RE-CpG methylation value (M value)
rempM(remp.res)

# Genomic location information of the predicted RE-CpG
# Function inherit from class 'RangedSummarizedExperiment'
rowRanges(remp.res)

# Standard error-scaled permutation importance of predictors
rempImp(remp.res)

# Retrive seed number used for the reesults
metadata(remp.res)$Seed
@

Trim off less reliable predicted results:

<<remptrim, eval=TRUE>>=
# Any predicted CpG values with quality score less than
# threshold (default = 1.7) will be replaced with NA. 
# CpGs contain more than missingRate * 100% (default = 20%) 
# missing rate across samples will be discarded. 
remp.res <- rempTrim(remp.res, threshold = 1.7, missingRate = 0.2)
details(remp.res)
@

(Optional) Aggregate the predicted methylation of CpGs in RE by averaging them to obtain the RE-specific methylation level:

<<rempaggregate, eval=TRUE>>=
remp.res <- rempAggregate(remp.res, NCpG = 2)
details(remp.res)
@

Aggregating CpGs in the same RE for RE-level methylation data is beneficial because 1) it greatly reduces the data dimension for downstream analysis and 2) it may produce more robust RE methylation estimation. Note that by default, RE with 2 or more predicted CpG sites will be aggregated. Therefore, the downside of doing this is the reduced coverage of RE. The assumption of doing this is the CpG methylation level within each RE are similar.

To add genomic regions annotation of the predicted REs:

<<rempdecodeAnnot, eval=TRUE>>=
# By default gene symbol annotation will be added 
remp.res <- decodeAnnot(remp.res)
rempAnnot(remp.res)
@

Seven genomic region indicators will be added to the annotation data in the input \Rclass{REMProduct} object:

\begin{itemize}
\item InNM: in protein-coding genes (overlap with refSeq gene's "NM" transcripts + 2000 bp upstream of the transcription start site (TSS))
\item InNR: in noncoding RNA genes (overlap with refSeq gene's "NR" transcripts + 2000 bp upstream of the TSS)
\item InTSS: in flanking region of 2000 bp upstream of the TSS. Default upstream limit is 2000 bp, which can be modified globally using \Rfunction{remp\_options}
\item In5UTR: in 5'untranslated regions (UTRs)
\item InCDS: in coding DNA sequence regions
\item InExon: in exon regions
\item In3UTR: in 3'UTRs
\end{itemize}

Note that intron region and intergenic region information can be derived from the above genomic region indicators: if "InNM" and/or "InNR" is not missing but "InTSS", "In5UTR", "InExon", and "In3UTR" are missing, then the RE is strictly located within intron region; if all indicators are missing, then the RE is strictly located in intergenic region.

\subsection{Plot prediction}

Make a density plot of the predicted methylation (beta values):

\begin{figure}[h!]
\centering
<<rempplot, eval=TRUE, fig.width=4.5, fig.height=4>>=
remplot(remp.res, main = "Alu methylation (GM12878)", col = "blue")
@
\end{figure}

\pagebreak

\section{Extract RE-CpG methylation profiled by Illumina BeadChip array}

\Rpkg{REMP} offers a handy tool to extract methylation data of CpGs that are located in RE. Similar as \Rfunction{remp}, users can choose the source of annotation database (\Rcode{AH}: AnnotationHub or \Rcode{UCSC}: UCSC website) and genome build (\Rcode{hg19} or \Rcode{hg38}). 

<<remprofile, eval=TRUE>>=
# Use Alu.hg19.demo for demonstration
remp.res <- remprofile(GM12878_450k, 
                       REtype = "Alu", 
                       annotation.source = "AH", 
                       genome = "hg19", 
                       RE = Alu.hg19.demo)
details(remp.res)

# All accessors and utilites for REMProduct are applicable
remp.res <- rempAggregate(remp.res)
details(remp.res)
@

\end{document}