%\VignetteEngine{knitr::knitr}
%\VignetteIndexEntry{MLSeq}

\documentclass[11pt]{article}
\usepackage[left=2.5cm, top=2.5cm, right=2.5cm, bottom=2cm]{geometry}
\usepackage[utf8]{inputenc}
\usepackage{color}

\title{\textbf{MLSeq package: Machine Learning Interface to RNA-Seq Data}}
\author{Gokmen Zararsiz$^{1,2}$, Dincer Goksuluk$^3$, Selcuk Korkmaz$^3$, Vahap Eldem$^4$,\\ Izzet Parug Duru$^5$, Turgay Unver$^6$, Ahmet Ozturk$^1$\\[1cm]
\small{$^1$Erciyes University, Faculty of Medicine, Department of Biostatistics, Kayseri, TURKEY}\\[0cm]
\small{$^2$Erciyes University, Genome and Stem Cell Center (GENKOK), Division of Bioinformatics, Kayseri, TURKEY}\\[0cm]
\small{$^3$Hacettepe University, Faculty of Medicine, Department of Biostatistics, Ankara, TURKEY}\\[0cm]
\small{$^4$Istanbul University, Faculty of Science, Department of Biology, Istanbul, TURKEY}\\[0cm]
\small{$^5$Marmara University, Faculty of Science, Department of Physics, Istanbul, TURKEY}\\[0cm]
\small{$^6$Cankiri University, Faculty of Science, Department of Biology, Cankiri, TURKEY}
\\[0cm]\texttt{\small{$^1$gokmenzararsiz@erciyes.edu.tr}}
} 
\date{}

\begin{document}
<<echo=FALSE, message=FALSE>>=
require(knitr)
opts_chunk$set(cache = TRUE, dev = "pdf")
@

\maketitle
\begin{abstract}
\texttt{MLSeq} package provides several algorithms including support vector machines (SVM), bagging support vector machines (bagSVM), random forest (RF) and classification and regression trees (CART) to classify sequencing data. To achieve this, \texttt{MLSeq} package requires a count table, which contains the number of reads mapped to each transcript for each sample. This kind of count data can be obtained from RNA-Seq experiments, also from other sequencing experiments such as DNA or ChIP-sequencing. This vignette is presented to guide researchers how to use this package. \\[1.5cm]
\end{abstract}

\tableofcontents

\newpage
\section{Introduction}

With the recent developments in molecular biology, it is feasible to measure the expression levels of thousands of genes simultaneously. Using this information, one major task is the gene-expression based classification. With the use of microarray data, numerous classification algorithms are developed and adapted for this type of classification. RNA-Seq is a recent technology, which uses the capabilities of next-generation sequencing (NGS) technologies. It has some major advantages over microarrays such as providing less noisy data and detecting novel transcripts and isoforms. These advantages can also affect the performance of classification algorithms. Working with less noisy data can improve the predictive performance of classification algorithms. Further, novel transcripts may be a biomarker in related disease or phenotype. \texttt{MLSeq} package includes several classification algorithms, also normalization and transformation approaches for RNA-Seq classification. \texttt{MLSeq} package can be loaded as below:

<<warning=FALSE, message=FALSE>>=
library(MLSeq)
@

\section{Preparation of input data}

\texttt{MLSeq} package expects a count table that contains the number of reads mapped to each transcript for each sample and class label information of samples in an \texttt{S4} class \texttt{DESeqDataSet} format.

After mapping the RNA-Seq reads to a reference genome or transcriptome, number of reads mapped to the reference genome can be counted to measure transcript abundance. It is very important that the count values must be raw sequencing read counts to implement the methods given in \texttt{MLSeq} package. There are a number of functions in Bioconductor packages which summarizes mapped reads to a count data format. These tools include \texttt{featureCounts} function in \texttt{Rsubread} package \cite{liao2013}, \texttt{summarizeOverlaps} function in \texttt{GenomicRanges} package \cite{lawrence2013} and \texttt{easyRNASeq} package \cite{delhomme2012}. It is also possible to access this type of count data from Linux-based softwares as \texttt{htseq-count} function in \texttt{HTSeq} \cite{htseq} and \texttt{multicov} function in \texttt{bedtools} \cite{quinlan2010} software. 

In this vignette, we will work with the cervical count data. Cervical data is from an experiment that measures the expression levels of 714 miRNA's of human samples \cite{witten2010}. There are 29 tumor and 29 non-tumor cervical samples and these two groups can be treated as two separete classes for classification purpose. We can define the file path with \texttt{system.file} function: 

<<chunk1>>=
filepath = system.file("extdata/cervical.txt", package = "MLSeq")
filepath
@

Next, we can load the data using \texttt{read.table} function:

<<chunk2, message=FALSE>>=
cervical = read.table(filepath, header=TRUE)
@

After loading the data, one can check the counts as follows. These counts are the number of mapped miRNA reads to each transcript.

<<chunk3>>=
head(cervical[,1:5])
@

Cervical data is in data.frame format which contains 714 miRNA mapped counts given in rows, belonging to 58 samples given in columns:

<<chunk4>>=
class(cervical)
dim(cervical)
@

First 29 columns of the data contain the miRNA mapped counts of non-tumor samples, while the last 29 columns contain the count information of tumor samples. We need to create a class label information in order to apply classification models.

<<chunk5, echo=FALSE>>=
options(width = 63)
@

<<chunk6>>=
class = data.frame(condition = factor(rep(c("N","T"), c(29,29))))
as.factor(class[,1])
@

For simplicity, we can work with a subset of cervical data with first 150 features.

<<chunk7>>=
data = cervical[c(1:150),]
@

Now, we can split the data into two parts as training and test sets. Training set can be used to build classification models, and test set can be used to assess the performance of each model. We can use \texttt{set.seed} function to specify initial value of random-number seed and use \texttt{sample} function for selection.

<<chunk8>>=
nTest = ceiling(ncol(data)*0.2)  
set.seed(12345) 
ind = sample(ncol(data), nTest, FALSE)
@

Now, training and test sets can be created based on this sampling process:

<<chunk9>>=
data.train = data[,-ind]    
data.train = as.matrix(data.train + 1)
classtr = data.frame(condition = class[-ind,])
@

<<chunk10>>=
data.test = data[,ind]
data.test = as.matrix(data.test + 1)
classts = data.frame(condition = class[ind,])
@

Now, we have 46 samples which will be used to train the classification models and have remaining 12 samples to be used to test the model performances:

<<chunk11>>=
dim(data.train)
dim(data.test)
@

We can now transform our training and test data to \texttt{DESeqDataSet} instance, which is the main data structure in the \texttt{MLSeq} package. For this purpose, we use the \texttt{DESeqDataSetFromMatrix} function of \texttt{DESeq2} package \cite{love2014}:

<<chunk12, message=FALSE>>=
data.trainS4 = DESeqDataSetFromMatrix(countData = data.train,
colData = classtr, formula(~condition))
data.trainS4 = DESeq(data.trainS4, fitType="local")
data.trainS4
@

<<chunk13, message=FALSE>>=
data.testS4 = DESeqDataSetFromMatrix(countData = data.test,
colData = classts, formula(~condition))
data.testS4 = DESeq(data.testS4, fitType = "local") 
data.testS4
@

Counts and class label information are adequate for classification analysis. However, users can also enter other information. Furthermore, users can directly call the count data obtained from \texttt{HTSeq} software using \texttt{DESeqDataSetFromHTSeqCount} function of \texttt{DESeq2} package.

\section{Data normalization and transformation}

In differential expression analysis of RNA-Seq data, it is crucial to normalize the count data to adjust between-sample differences. In our experiments, we have also seen that normalization significantly increase the performance of most classifiers. In \texttt{MLSeq} package two normalization methods are available. First one is the ``deseq normalization", which estimates the size factors by dividing each sample by the geometric means of the transcript counts \cite{love2014}. Median statistic is mostly used as a size factor for each sample. Another normalization method is ``trimmed mean of M values (TMM)". TMM first trims the data in both lower and upper side by log-fold changes (default 30\%) to minimize the log-fold changes between the samples and by absolute intensity (default 5\%). After trimming, TMM calculates a normalization factor using the weighted mean of data. These weights are calculated based on the inverse approximate asymptotic variances using the delta method \cite{robinson2010}. 

After the normalization process, it is useful to transform the data for classification analysis. There are two transformation methods available in \texttt{MLSeq} package. First one is the ``voom transformation" which applies a logarithmic transformation to normalized count data and computes gene weights using the mean-dispersion relationship \cite{charity2014}. Second transformation method is the ``vst transformation". This approach uses an error modeling and the concept of variance stabilizing transformations to estimate the mean-dispersion relationship of data \cite{love2014}.

If the normalization method is selected as ``TMM", then \texttt{MLSeq} package automatically applies ``voom" transformation. However, it is possible to select either ``vst" or ``voom" transformations after ``deseq" normalization.

Further details on these normalization and transformation methods can be found in referenced papers. 

\section{Cross-validation concept}

One essential goal of classification analysis is to build a generalizable model that will have a low misclassification error when applied to new samples. One way is to use \emph{k}-fold cross-validation to validate obtained model. \emph{k}-fold cross-validation technique randomly splits the data into \emph{k} non-overlapping and equally sized subsets. A classification model is trained on $(k-1)$ subsets and tested in the remaining subsets. This process is repeated \emph{k} times, thus all subsets are used as a test set in each step. \texttt{MLSeq} package also has the repeat option to obtain more generalizable models. Giving a number of \emph{m} repeats, cross validation concept is applied \emph{m} times.

\section{Building classification models}

Now, we can train our data, data.trainS4, using one of the classifiers among SVM, bagSVM, RF and CART algorithms. To build a classification model, we simply use the \texttt{classify} function. First, let us use CART classifier and choose ``deseq" as normalization method, ``vst" as transformation method and assign this model in ``cart" object. We also define the number of cross validation fold as ``cv=5" and number of repeats as ``rpt=3" for model validation. The reference class is considered as ``T" via \texttt{ref = "T"}.

<<chunk14, message=FALSE, tidy.opts=list(width.cutoff=65)>>=
cart <- classify(data = data.trainS4, method = "cart", normalize = "deseq",
               deseqTransform = "vst", cv = 5, rpt = 3, ref = "T")
cart
@

After running the code given above, we obtain the results in MLSeq class. CART successfully fits a model with \Sexpr{round(cart@confusionMat$overall[1]*100, 2)}\% true classification accuracy.

``cart" object also stores information about model training and the parameters used to build this model. 

<<chunk15>>=
getSlots("MLSeq")
trained(cart)
@

Now, let us train another model with same parameters using RF classifier and save this model in ``rf" object.

<<chunk16, message=FALSE, tidy.opts=list(width.cutoff=65)>>=
rf = classify(data = data.trainS4, method = "randomforest", normalize = "deseq", 
              deseqTransform = "vst", cv = 5, rpt = 3, ref = "T")
rf
@

We can see that RF method successfully trained the model without misclassifying any samples.

\section{Prediction of new samples}

Now, we will predict the class labels of our test data ``data.testS4" and test the performance of classifiers based on the models we built using \texttt{classify} function. Here, we use \texttt{predictClassify} function in order to achieve this goal.

<<chunk17, message=FALSE>>=
pred.cart = predictClassify(cart, data.testS4)
pred.cart
@

<<chunk18, message=FALSE>>=
pred.rf = predictClassify(rf, data.testS4) 
pred.rf
@

To assess the predictive performance of each method, we can cross the actual class labels and predictions in a table :

<<chunk19, echo=FALSE, message=FALSE>>=
cart.tbl <- table(pred.cart, relevel(data.testS4$condition, 2))
rf.tbl <- table(pred.rf, relevel(data.testS4$condition, 2))
@

<<chunk19_2>>=
table(pred.cart, relevel(data.testS4$condition, 2))
table(pred.rf, relevel(data.testS4$condition, 2))
@

We can see that RF showed better predictive performances than CART by correctly classifying \Sexpr{sum(diag(rf.tbl))} out of \Sexpr{sum(rf.tbl)} test samples with a \Sexpr{round(100*sum(diag(rf.tbl)) / sum(rf.tbl), 2)}\% classification accuracy. However, note that the true classification rate for \texttt{predictClassify} is dependent on number of repeats and selected folds. Therefore, users may have different results that we have obtained above.

\section{Session info}

<<chunk20>>=
sessionInfo()
@
\rule[0mm]{40mm}{0.6mm} \\[2cm]
% BIBLIOGRAHY
\begin{thebibliography}{20}
\bibitem{liao2013} Liao Y, Smyth GK, Shi W (2013). featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. \textit{Bioinformatics}.
\bibitem{lawrence2013} Lawrence M, Huber W, Pages H, et al. (2013). Software for Computing and Annotating Genomic Ranges. \textit{Plos Computational Biology}, DOI: 10.1371/journal.pcbi.1003118.
\bibitem{delhomme2012} Delhomme N, Padioleau I, Furlong EE, et al. (2012). easyRNASeq: a bioconductor package for processing RNA-Seq data. \textit{Bioinformatics}, 28(\textbf{19}):2532-2533.
\bibitem{htseq} {\color{blue}http://www-huber.embl.de/users/anders/HTSeq/doc/overview.html}
\bibitem{quinlan2010} Quinlan AR, Hall IM (2010). BEDTools: a flexible suite of utilities for comparing genomic features. \textit{Bioinformatics}, 26(\textbf{6}):841-842.
\bibitem{witten2010} Witten D, Tibshirani R, Gu S, et al. (2010). Ultra-high throughput sequencing-based small RNA discovery an discrete statistical biomarker analysis in a collection of cervical tumors and matched controls. \textit{BMC Biology}, 8(\textbf{58}).
\bibitem{love2014} Love MI, Huber W, Anders S (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 \textit{Genome Biology}, 15(\textbf{12}):550.
\bibitem{robinson2010} Robinson MD, Oshlack A (2010). A scaling normalization method for differential expression analysis of RNA-Seq data. \textit{Genome Biology}, 11:R25, doi:10.1186/gb--2010--11--3--r25.
 \bibitem{charity2014} Charity WL, Chen Y, Shi W, et al. (2014) Voom: precision weights unlock linear model analysis tools for RNA-Seq read counts, \textit{Genome Biology}, 15:R29, doi:10.1186/gb--2014--15--2--r29.
\end{thebibliography}

\end{document}