%\VignetteEngine{knitr::knitr}
%\VignetteIndexEntry{SGSeq}
%\VignettePackage{SGSeq}

\documentclass[10pt]{article}

<<knitr, include=FALSE, cache=FALSE>>=
library(knitr)
opts_chunk$set(fig.align='center', fig.show='hold')
@

<<<style, eval=TRUE, echo=FALSE, results="asis">>=
BiocStyle::latex()
@ 

\bioctitle{Event-level prediction and quantification of transcript isoforms
    from RNA-seq data}

\author{Leonard Goldstein \\ [1em] 
    \small{Department of Bioinformatics and Computational Biology, 
        Genentech Inc.}}

\begin{document}

\maketitle

\section{Background}

RNA-seq data are informative for the analysis of known and novel transcript
isoforms. While the short length of RNA-seq reads limits the ability to
predict and quantify full-length transcripts, short read data are well suited
for the analysis of individual alternative transcript events (e.g. inclusion
or skipping of a cassette exon). Available event-centric methods typically
rely on annotated transcripts and only consider a subset of all possible
events. We developed a novel approach for the identification and quantification
of alternative transcript events from RNA-seq data, implemented in the 
\Rpackage{SGSeq} package.

\section{Overview}

\Rpackage{SGSeq} predicts splice junctions and exons from genomic RNA-seq 
read alignments in BAM format. The discrete transcript features are assembled
into a genome-wide splice graph \cite{Heber:2002aa}. Splice junctions and
disjoint exon bins form the edges of the graph, while nodes correspond to
transcript starts and ends, and splice donor and acceptor sites. Alternative
transcript events are regions with two or more transcript variants. In the
context of the splice graph, they are defined by a start and an end node
connected by two or more alternative paths and no intervening nodes with all
paths intersecting. \Rpackage{SGSeq} identifies alternative transcript events
recursively from the splice graph, and quantifies transcript variants locally,
based on counts of reads spanning event boundaries.

\section{Preliminaries}

This vignette illustrates an analysis of paired-end RNA-seq data obtained
from four colorectal tumors and four normal colorectal samples, which are
part of a data set published in \cite{Seshagiri:2012gr}. 
For the purpose of this vignette, we created BAM files including alignments
overlapping a single gene of interest (\emph{FBXO31}).

<<message=FALSE>>=
library(SGSeq)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
@

In the following, we use a \Rclass{data.frame} \Robject{si} with sample
information, and a \Rclass{GRanges} object \Robject{gr} with genomic
coordinates of the \emph{FBXO31} gene.

The \Rclass{data.frame} with sample information contains
alignment information, including paired-end status, median read length, 
median insert size and the total number of alignments. These were obtained
from the original BAM files using function \Rfunction{getBamInfo}. 
We set the correct BAM file paths in the sample information.

<<>>=
dir <- system.file("extdata", package = "SGSeq")
si$file_bam <- file.path(dir, "bams", si$file_bam)
@ 

We obtain transcript annotation from the UCSC knownGene table, 
available as a \Bioconductor{} annotation package 
\Biocannopkg{TxDb.Hsapiens.UCSC.hg19.knownGene}. We retain transcripts
on chromosome 16, where the \emph{FBXO31} gene is located, and change
chromosome names in the annotation to match chromosome names in the
BAM files.

<<>>=
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
txdb <- keepSeqlevels(txdb, "chr16")
seqlevelsStyle(txdb) <- "NCBI"
@ 

\Rpackage{SGSeq} makes extensive use of the \Bioconductor{} infrastructure
for genomic ranges \cite{Lawrence:2013hi}. To store genomic coordinates
for both exons and splice junctions, we created a new class
\Rclass{TxFeatures}, which extends the \Rclass{GRanges} class with additional
columns. Column \Rcode{type} takes values in \Rcode{J} (splice junction), 
\Rcode{I} (internal exon), \Rcode{F} (first/$5^\prime$ terminal exon), 
\Rcode{L} (last/$3^\prime$ terminal exon) and \Rcode{U} (unspliced).

In addition to \Rclass{TxFeatures}, we designed the \Rclass{SGFeatures} class
to store splice graph features. Similar to \Rclass{TxFeatures}, the 
\Rclass{SGFeatures} class extends the \Rclass{GRanges} class with additional
columns. Column \Rcode{type} in an \Rclass{SGFeatures} object takes values in
\Rcode{J} (splice junction), \Rcode{E} (disjoint exon bin), 
\Rcode{D} (splice donor) and \Rcode{A} (splice acceptor). 

For both \Rclass{TxFeatures} and \Rclass{SGFeatures},
additional column data can be accessed using functions that are
named after the columns they access (e.g. use function \Rfunction{type}
to obtain feature type). Transcript features or splice graph features can
be exported to BED files using function \Rfunction{exportFeatures}.

To work with annotated transcripts in the \Rpackage{SGSeq} framework,
we extract transcript features from the \Rclass{TxDb} object and
store them as a \Rclass{TxFeatures} object. We only retain features 
overlapping the \emph{FBXO31} gene locus.

<<>>=
txf_annotated <- convertToTxFeatures(txdb)
txf_annotated <- txf_annotated[txf_annotated %over% gr]
@

If transcript annotation is not available as a \Rclass{TxDb} object,
function \Rfunction{convertToTxFeatures} can construct 
\Rclass{TxFeatures} from a \Rclass{GRangesList} of exons grouped
by transcript (which can be obtained from other formats such as GFF/GTF).

\section{Analysis based on annotated transcript features}

Initially, we perform an analysis for annotated transcript features. 
The following example converts the transcript features into splice graph
features and obtains compatible counts for each feature and each sample. 

<<>>=
sgfc <- analyzeFeatures(si, features = txf_annotated)
@ 

\sloppy \Rfunction{analyzeFeatures} returns an \Rclass{SGFeatureCounts}
object, which extends the \Rclass{SummarizedExperiment} class from the 
\Biocpkg{GenomicRanges} package. \Rclass{SGFeatureCounts} contains
sample information as \Rcode{colData}, splice graph features as
\Rcode{rowData} and assays \Rcode{counts} and \Rcode{FPKM}, 
which store compatible counts and FPKMs for each splice graph feature 
and sample, respectively. Accessor functions \Rfunction{colData},
\Rfunction{rowData}, \Rfunction{counts} and \Rfunction{FPKM} can be used
to access the data.

Compatible FPKMs for splice graph features can be visualized with function
\Rfunction{plotFeatures}.

<<figure-1, fig.width=4.5, fig.height=4.5>>=
plotFeatures(sgfc, geneID = 1)
@ 

\section{Analysis based on predicted transcript features}

Instead of relying on existing annotation, we can predict transcript features 
from BAM files directly.

<<>>=
sgfc <- analyzeFeatures(si, which = gr)
@ 

For interpretability, we annotate predicted features with respect to 
known transcript features.

<<>>=
sgfc <- annotate(sgfc, txf_annotated)
@ 

The predicted splice graph features and compatible FPKMs can be visualized
as previously. By default, splice graph features with missing annotation
are highlighted in red.

<<figure-2, fig.width=4.5, fig.height=4.5>>=
plotFeatures(sgfc, geneID = 1)
@ 

Note that, in contrast to the previous figure, the predicted gene model
does not include parts of the splice graph that are not expressed. 
Also, an unannotated exon was discovered from the RNA-seq data, which is 
expressed in three of the four normal colorectal samples. 

\section{Analysis of transcript variants}

We can focus our analysis on alternative transcript events. The following
example identifies transcript variants from the splice graph and obtains
representative counts for each transcript variant. Estimates for variant
frequencies are obtained based on representative counts.

<<>>=
txvc <- analyzeVariants(sgfc)
@ 

\Rfunction{analyzeVariants} returns a \Rclass{TxVariantCounts} object.
Similar to \Rclass{SGFeatureCounts}, \Rclass{TxVariantCounts} extends
the \Rclass{SummarizedExperiment} class from the \Biocpkg{GenomicRanges} 
package. \Rclass{TxVariantCounts} contains sample information as
\Rcode{colData} and transcript variants as \Rcode{rowData}. 
Assay \Rcode{variantFreq} stores frequency estimates
for each transcript variant and sample. Accessor functions 
\Rfunction{colData}, \Rfunction{rowData} and \Rfunction{variantFreq} 
can be used to access the data. 

Each transcript variant consists of one or more splice graph features.
Information on transcript variants is stored as \Rcode{elementMetadata}
(or \Rcode{mcols}) in the \Rclass{TxVariants} object and can be accessed
as follows.

<<>>=
mcols(txvc)
@ 

Transcript variants and estimates of variant frequencies can be visualized
with function \Rfunction{plotVariants}.

<<figure-3, fig.width=1.5, fig.height=4.5>>=
plotVariants(txvc, eventID = 1)
@ 

\section{Advanced use}

Functions \Rfunction{analyzeFeatures} and \Rfunction{analyzeVariants} 
wrap multiple analysis steps for convenience. Alternatively, the 
functions performing individual steps can be called directly. 
The previous analysis based on predicted transcript features can be
performed as follows.

<<>>=
txf <- predictTxFeatures(si, gr)
sgf <- convertToSGFeatures(txf)
sgf <- annotate(sgf, txf_annotated)
sgfc <- getSGFeatureCounts(si, sgf)
txv <- findTxVariants(sgf)
txvc <- getTxVariantCounts(sgfc, txv)
@ 

\sloppy Feature prediction and counting (with \Rfunction{predictTxFeatures} 
and \Rfunction{getSGFeatureCounts}, respectively) can be performed for
individual samples, and results can be combined at a later point in time 
(e.g. to distribute samples across a high-performance computing cluster). 

\sloppy Note that \Rfunction{predictTxFeatures} predicts features for 
each sample, merges features across samples and finally performs 
filtering and processing of predicted terminal exons. When using 
\Rcode{predictTxFeatures} for individual samples, with predictions
intended to be merged later, run \Rcode{predictTxFeatures} with 
argument \Rcode{min\_overhang = NULL} to suppress processing of terminal
exons. Then predictions can subsequently be merged and processed with
functions \Rfunction{mergeTxFeatures} and \Rfunction{processTerminalExons}, 
respectively.

\section{Performance}

\sloppy When performing genome-wide analyses or working with large data sets, 
parallelization is highly recommended. For functions 
\Rfunction{analyzeFeatures}, \Rfunction{predictTxFeatures} and 
\Rfunction{getSGFeatureCounts}, parallelization across samples is
controlled with argument \Robject{BPPARAM}. It defaults to 
\Rcode{MulticoreParam(workers = 1)} (no parallelization). 
For analyses run on a single node with multiple cores, the number of samples
processed in parallel can be specified with argument \Robject{workers}. 
For more details, please see the documentation for the \Biocpkg{BiocParallel} 
package. The number of cores used per samples, enabling parallel processing
of multiple chromosomes/strands, is specified with argument 
\Robject{cores\_per\_sample}. Processing a single BAM file with $\sim50$ 
million paired-end reads using 4 cores typically takes $\sim2-3$ hours for
prediction and $\sim1-2$ hours for counting. Processing times can be strongly 
affected by individual genes or genomic regions with many alignments. 
For some data sets, it may be beneficial to exclude regions with high
coverage (e.g. the mitochondrial chromosome).
Identification of transcript variants from the splice graph is
performed on a per-gene basis and can be parallelized by setting argument
\Robject{cores} when using \Rfunction{analyzeVariants} or 
\Rfunction{findTxVariants}.

\section{Session information}

<<results="asis", echo=FALSE>>=
toLatex(sessionInfo())
@ 

\bibliography{SGSeq}

\end{document}