---
title: "Splice event prediction and quantification from RNA-seq data"
author: "Leonard D Goldstein"
date: "`r BiocStyle::doc_date()`"
package: "`r BiocStyle::pkg_ver('SGSeq')`"
abstract: >
  *SGSeq* is a software package for analyzing splice events from RNA-seq data. Input data are RNA-seq reads mapped to a reference genome in BAM format. Genes are represented as a splice graph, which can be obtained from existing annotation or predicted from the mapped sequence reads. Splice events are identified from the graph and are quantified locally using structurally compatible reads at the start or end of each splice variant. The software includes functions for splice event prediction, quantification, visualization and interpretation.
vignette: >
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteIndexEntry{SGSeq}
  %\VignettePackage{SGSeq}
output:
  BiocStyle::html_document:
    toc: true
    fig_caption: true
bibliography: SGSeq.bib
---

```{r, echo = FALSE, results = 'hide'}
library(knitr)
opts_chunk$set(error = FALSE)
```

```{r style, echo = FALSE, results = 'asis'}
##BiocStyle::markdown()
```

# Overview

*SGSeq* provides functions and data structures for analyzing splice events from RNA-seq data. An overview of *SGSeq* classes and how they are related is shown in Figure 1 and summarized below:

* The *TxFeatures* class stores discrete transcript features (exons and splice junctions) as they are observed in RNA transcripts. These can be extracted from annotation or predicted from aligned RNA-seq reads.
* The *SGFeatures* class stores features defining a splice graph [@Heber:2002aa]. The splice graph is a directed acyclic graph with edges corresponding to exonic regions and splice junctions, and nodes corresponding to transcript starts, ends and splice sites. It is directed from the 5$^\prime$ end to the 3$^\prime$ end of a gene.
* The *SGVariants* class stores splice variants. If two nodes in the splice graph are connected by two or more paths, and there are no intervening nodes with all paths intersecting, the alternative paths are considered splice variants. Splice variants sharing the same start and end node, together form a splice event.

If you use *SGSeq*, please cite:

* Goldstein LD, Cao Y, Pau G, Lawrence M, Wu TD, Seshagiri S, Gentleman R (2016) Prediction and Quantification of Splice Events from RNA-Seq Data. PLoS ONE 11(5): e0156132. [doi:10.1371/journal.pone.0156132](http://journals.plos.org/plosone/article?id=10.1371%2Fjournal.pone.0156132)

![**Figure 1.** Overview of *SGSeq* data structures. (**a**) Schematic illustrating transcripts, discrete transcript features, the splice graph, and splice events consisting of alternative splice variants. Splice events are defined in an augmented graph with a unique source and sink node for each gene, connected to alternative starts and ends, respectively (dashed lines). (**b**) *SGSeq* representation of concepts shown in (a). Classes are shown in bold and outlined, function names are shown in italics. Dashed arrows indicate functions *analyzeFeatures()* and *analyzeVariants()*, which wrap multiple analysis steps. *SGSeq* makes extensive use of *Bioconductor* infrastructure for genomic ranges [@Lawrence:2013hi]: *TxFeatures* and *SGFeatures* extend *GRanges*, *SGVariants* extends *CompressedGRangesList*. *SGFeatureCounts* and *SGVariantCounts* extend *SummarizedExperiment* and are containers for per-sample counts (or other expression values) along with corresponding *SGFeatures* and *SGVariants* objects. $^\ast$*SGVariantCounts* assay *countsVariant5pOr3p* can only be obtained from BAM files, for details see section [Testing for differential splice variant usage].](classes.png)

# Preliminaries

```{r, message = FALSE}
library(SGSeq)
```

When starting a new project, *SGSeq* requires information about the samples to be analyzed. This information is obtained once initially, and can then be used for all future analyses. Sample information is provided as a data frame with the following columns:

* *sample_name* Character vector with a unique name for each sample
* *file_bam* Character vector or *BamFileList* specifying BAM files generated with a splice-aware alignment program
* *paired_end* Logical vector indicating whether data are paired-end or single-end
* *read_length* Numeric vector with read lengths
* *frag_length* Numeric vector with average fragment lengths (for paired-end data)
* *lib_size* Numeric vector with the total number of aligned reads for single-end data, or the total number of read pairs for paired-end data

Sample information can be stored in a *data.frame* or *DataFrame* object (if BAM files are specified as a *BamFileList*, it must be stored in a *DataFrame*). Sample information can be obtained automatically with function *getBamInfo()*, which takes as input a data frame with columns *sample_name* and *file_bam* and extracts all other information from the specified BAM files.

For *SGSeq* to work correctly it is essential that reads were mapped with a splice-aware alignment program, such as GSNAP [@Wu:2010ep], HISAT [@Kim:2015be] or STAR [@Dobin:2013fg], which generates SAM/BAM files with a custom tag 'XS' for spliced reads, indicating the direction of transcription. BAM files must be indexed. Note that *lib_size* should be the total number of aligned fragments, even if BAM files were subset to genomic regions of interest. The total number of fragments is required for accurate calculation of FPKM values (fragments per kilobase and million sequenced fragments). Here, the term 'fragment' denotes a sequenced cDNA fragment, which is represented by a single read in single-end data, or a pair of reads in paired-end data.

This vignette illustrates an analysis of paired-end RNA-seq data from four tumor and four normal colorectal samples, which are part of a data set published in [@Seshagiri:2012gr]. RNA-seq reads were mapped to the human reference genome using GSNAP [@Wu:2010ep]. The analysis is based on BAM files that were subset to reads mapping to a single gene of interest (*FBXO31*). A *data.frame* *si* with sample information was generated from the original BAM files with function *getBamInfo()*. Note that column *lib_size* reflects the total number of aligned fragments in the original BAM files.

```{r}
si
```

The following code block sets the correct BAM file paths for the current *SGSeq* installation.

```{r}
path <- system.file("extdata", package = "SGSeq")
si$file_bam <- file.path(path, "bams", si$file_bam)
```

# RNA transcripts and the *TxFeatures* class

Transcript annotation can be obtained from a *TxDb* object or imported from GFF format using function *importTranscripts()*. Alternatively, transcripts can be specified as a *GRangesList* of exons grouped by transcripts. In the following code block, the UCSC knownGene table is loaded as a *TxDb* object. Transcripts on chromosome 16 (where the *FBXO31* gene is located) are retained, and chromosome names are changed to match the naming convention used in the BAM files.

```{r, message = FALSE}
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
txdb <- keepSeqlevels(txdb, "chr16")
seqlevelsStyle(txdb) <- "NCBI"
```

To work with the annotation in the *SGSeq* framework, transcript features are extracted from the *TxDb* object using function *convertToTxFeatures()*. Only features overlapping the genomic locus of the *FBXO31* gene are retained. The genomic coordinates of *FBXO31* are stored in a *GRanges* object *gr*.

```{r}
txf_ucsc <- convertToTxFeatures(txdb)
txf_ucsc <- txf_ucsc[txf_ucsc %over% gr]
head(txf_ucsc)
```

*convertToTxFeatures()* returns a *TxFeatures* object, which is a *GRanges*-like object with additional columns. Column *type* indicates the feature type and can take values

* *J* (splice junction)
* *I* (internal exon)
* *F* (first/5$^\prime$-terminal exon)
* *L* (last/5$^\prime$-terminal exon)
* *U* (unspliced transcript).

Columns *txName* and *geneName* indicate the transcript and gene each feature belongs to. Note that a feature can belong to more than one transcript, and accordingly these columns can store multiple values for each feature. For *TxFeatures* and other data structures in *SGSeq*, columns can be accessed with accessor functions as shown in the following code block.

```{r}
type(txf_ucsc)
head(txName(txf_ucsc))
head(geneName(txf_ucsc))
```

# The splice graph and the *SGFeatures* class

Exons stored in a *TxFeatures* object represent the exons spliced together in an RNA molecule. In the context of the splice graph, exons are represented by unique non-overlapping exonic regions. Function *convertToSGFeatures()* converts *TxFeatures* to *SGFeatures*. In the process, overlapping exons are disjoined into non-overlapping exon bins.

```{r}
sgf_ucsc <- convertToSGFeatures(txf_ucsc)
head(sgf_ucsc)
```

Similar to *TxFeatures*, *SGFeatures* are *GRanges*-like objects with additional columns. Column *type* for an *SGFeatures* object takes values

* *J* (splice junction)
* *E* (disjoint exon bin)
* *D* (splice donor site)
* *A* (splice acceptor site).

By convention, splice donor and acceptor sites correspond to the exonic positions immediately flanking the intron. *SGFeatures* has additional columns not included in *TxFeatures*: *spliced5p* and *spliced3p* indicate whether exon bins have a mandatory splice at the 5$^\prime$ and 3$^\prime$ end, respectively. This information is used to determine whether a read is structurally compatible with an exon bin, and whether an exon bin is consistent with an annotated transcript. *featureID* provides a unique identifier for each feature, *geneID* indicates the unique component of the splice graph a feature belongs to.

# Splice graph analysis based on annotated transcripts

This section illustrates an analysis based on annotated transcripts. Function *analyzeFeatures()* converts transcript features to splice graph features and obtains compatible fragment counts for each feature and each sample.

```{r, message = FALSE}
sgfc_ucsc <- analyzeFeatures(si, features = txf_ucsc)
sgfc_ucsc
```

*analyzeFeatures()* returns an *SGFeatureCounts* object. *SGFeatureCounts* contains the sample information as *colData*, splice graph features as *rowRanges* and assays *counts* and *FPKM*, which store compatible fragment counts and FPKMs, respectively. The different data types can be accessed using accessor functions with the same name.

```{r, eval = FALSE}
colData(sgfc_ucsc)
rowRanges(sgfc_ucsc)
head(counts(sgfc_ucsc))
head(FPKM(sgfc_ucsc))
```

Counts for exons and splice junctions are based on structurally compatible fragments. In the case of splice donors and acceptors, counts indicate the number of fragments with reads spanning the spliced boundary (overlapping the splice site and the flanking intronic position).

FPKM values are calculated as $\frac{x}{NL}10^6$, where $x$ is the number of compatible fragments, $N$ is the library size (stored in *lib_size*) and *L* is the effective feature length (the number of possible positions for a compatible fragment). For paired-end data it is assumed that fragment length is equal to *frag_length*.

FPKMs for splice graph features can be visualized with function *plotFeatures*. *plotFeatures* generates a two-panel figure with a splice graph shown in the top panel and a heatmap of expression levels for individual features in the bottom panel. For customization of *plotFeatures* output, see section [Visualization]. The plotting function invisibly returns a *data.frame* with details about the splice graph shown in the plot.

```{r figure-1, fig.width=4.5, fig.height=4.5}
df <- plotFeatures(sgfc_ucsc, geneID = 1)
```

Note that the splice graph includes three alternative transcript start sites (TSSs). However, the heatmap indicates that the first TSS is not used in this data set.

# Splice graph analysis based on *de novo* prediction

Instead of relying on existing annotation, annotation can be augmented with predictions from RNA-seq data, or the splice graph can be constructed from RNA-seq data without the use of annotation. The following code block predicts transcript features supported by RNA-seq reads, converts them into splice graph features, and then obtains compatible fragment counts. For details on how predictions are obtained, please see [@Goldstein:2016hc].

```{r, message = FALSE}
sgfc_pred <- analyzeFeatures(si, which = gr)
head(rowRanges(sgfc_pred))
```

For interpretation, predicted features can be annotated with respect to known transcripts. The *annotate()* function assigns compatible transcripts to each feature and stores the corresponding transcript and gene name in columns *txName* and *geneName*, respectively.

```{r}
sgfc_pred <- annotate(sgfc_pred, txf_ucsc)
head(rowRanges(sgfc_pred))
```

The predicted splice graph and FPKMs can be visualized as previously. Splice graph features with missing annotation are highlighted using argument *color_novel*.

```{r figure-2, fig.width=4.5, fig.height=4.5}
df <- plotFeatures(sgfc_pred, geneID = 1, color_novel = "red")
```

Note that most exons and splice junctions predicted from RNA-seq data are consistent with transcripts in the UCSC knownGene table (shown in gray). However, in contrast to the previous figure, the predicted gene model does not include parts of the splice graph that are not expressed in the data. Also, an unannotated exon (E3, shown in red) was discovered from the RNA-seq data, which is expressed in three of the four normal colorectal samples (N2, N3, N4).

# Splice variant identification

Instead of considering the full splice graph of a gene, the analysis can be focused on individual splice events. Function *analyzeVariants()* recursively identifies splice events from the graph, obtains representative counts for each splice variant, and computes estimates of relative splice variant usage, also referred to as 'percentage spliced in' (PSI or $\Psi$) [@Venables:2008aa, @Katz:2010aa].

```{r, message = FALSE}
sgvc_pred <- analyzeVariants(sgfc_pred)
sgvc_pred
```

*analyzeVariants()* returns an *SGVariantCounts* object. Sample information is stored as *colData*, and *SGVariants* as *rowRanges*. Assay *variantFreq* stores estimates of relative usage for each splice variant and sample. As previously, the different data types can be accessed using accessor functions. Information on splice variants is stored in *SGVariants* metadata columns and can be accessed with function *mcols()* as shown below. For a detailed description of columns, see the manual page for *SGVariants*.

```{r}
mcols(sgvc_pred)
```

# Splice variant quantification

Splice variants are quantified locally, based on structurally compatible fragments that extend across the start or end of each variant. Local estimates of relative usage $\Psi_i$ for variant $i$ are obtained as the number of fragments compatible with $i$ divided by the number of fragments compatible with any variant belonging to the same event. For variant start $S$ and variant end $E$, $\hat{\Psi}_i^S = x_i^S / x_.^S$ and $\hat{\Psi}_i^E = x_i^E / x_.^E$, respectively. For variants with valid estimates $\hat{\Psi}_i^S$ and $\hat{\Psi}_i^E$, a single estimate is calculated as a weighted mean of local estimates $\hat{\Psi}_i = x_.^S/(x_.^S + x_.^E) \hat{\Psi}_i^S + x_.^E/(x_.^S + x_.^E) \hat{\Psi}_i^E$.

Estimates of relative usage can be unreliable for events with low read count. If argument *min_denominator* is specified for functions *analyzeVariants()* or *getSGVariantCounts()*, estimates are set to *NA* unless at least one of $x_.^S$ or $x_.^E$ is equal or greater to the specified value.

Note that *SGVariantCounts* objects also store the raw count data. Count data can be used for statistical modeling, for example as suggested in section [Testing for differential splice variant usage].

```{r}
variantFreq(sgvc_pred)
```

Splice variants and estimates of relative usage can be visualized with function *plotVariants*.

```{r figure-3, fig.width=1.5, fig.height=4.5}
plotVariants(sgvc_pred, eventID = 1, color_novel = "red")
```

*plotVariants* generates a two-panel figure similar to *plotFeatures*. The splice graph in the top panel illustrates the selected splice event. In this example, the event consists of two variants, corresponding to a skip or inclusion of the unannotated exon. The heatmap illustrates estimates of relative usage for each splice variant. Samples N2, N3 and N4 show evidence for transcripts that include the exon as well as transcripts that skip the exon. The remaining samples show little evidence for exon inclusion.

# Splice variant interpretation

The functional consequences of a splice variant can be assessed by predicting its effect on protein-coding potential. Function *predictVariantEffects()* takes as input an *SGVariants* object with splice variants of interest, a set of annotated transcripts, and a matching reference genome provided as a *BSgenome* object.

```{r, message = FALSE}
library(BSgenome.Hsapiens.UCSC.hg19)
seqlevelsStyle(Hsapiens) <- "NCBI"
vep <- predictVariantEffects(sgv_pred, txdb, Hsapiens)
vep
```

The output is a data frame with each row describing the effect of a particular splice variant on an annotated protein-coding transcript. The effect of the variants is described following [HGVS recommendations](http://varnomen.hgvs.org). In its current implementation, variant effect prediction is relatively slow and it is recommended to run *predictVariantEffects()* on select variants only.

# Visualization

Functions *plotFeatures()* and *plotVariants()* support many options for customizing figures. The splice graph in the top panel is plotted by function *plotSpliceGraph*, which can be called directly.

*plotFeatures()* includes multiple arguments for selecting features to be displayed. The following code block illustrates three different options for plotting the splice graph and expression levels for *FBXO31* (Entrez ID 79791).

```{r, eval = FALSE}
plotFeatures(sgfc_pred, geneID = 1)
plotFeatures(sgfc_pred, geneName = "79791")
plotFeatures(sgfc_pred, which = gr)
```

By default, the heatmap generated by *plotFeatures()* displays splice junctions. Alternatively, exon bins, or both exon bins and splice junctions can be displayed.

```{r, eval = FALSE}
plotFeatures(sgfc_pred, geneID = 1, include = "junctions")
plotFeatures(sgfc_pred, geneID = 1, include = "exons")
plotFeatures(sgfc_pred, geneID = 1, include = "both")
```

Argument *toscale* controls which parts of the gene model are drawn to scale.

```{r, eval = FALSE}
plotFeatures(sgfc_pred, geneID = 1, toscale = "gene")
plotFeatures(sgfc_pred, geneID = 1, toscale = "exon")
plotFeatures(sgfc_pred, geneID = 1, toscale = "none")
```

Heatmaps allow the visualization of expression values summarized for splice junctions and exon bins. Alternatively, per-base read coverages and splice junction counts can be visualized with function *plotCoverage*.

```{r, figure-4, fig.width=4.5, fig.height=4.5}
par(mfrow = c(5, 1), mar = c(1, 3, 1, 1))
plotSpliceGraph(rowRanges(sgfc_pred), geneID = 1, toscale = "none", color_novel = "red")
for (j in 1:4) {
  plotCoverage(sgfc_pred[, j], geneID = 1, toscale = "none")
}
```

# Testing for differential splice variant usage

*SGSeq* does not implement statistical tests for differential splice variant usage. However, existing software packages such as `r BiocStyle::Biocpkg("DEXSeq")` [@Anders:2012es] and `r BiocStyle::Biocpkg("limma")` [@Ritchie:2015fa] can be used for this purpose. These packages allow infererence of differential exon usage within a gene (between groups of samples). In the *SGSeq* framework, this approach can be used to test for differential splice variant usage within a splice event, treating splice variants and splice events as exons and genes, respectively.

For these methods to be applicable, a single count is required for each splice variant. *SGVariantCounts* objects as described above store two counts for each splice variant, one for the 5$^\prime$ and one for the 3$^\prime$ end of the variant. These counts can be readily obtained from *SGFeatureCounts* objects, but are impractical for count-based differential testing. A single count for each variant (based on fragments compatible at either end of the variant) can be obtained from BAM files using function *getSGVariantCounts()*. The output is an *SGVariantCounts* object with additional assay *countsVariant5pOr3p*.

```{r, message = FALSE}
sgv <- rowRanges(sgvc_pred)
sgvc <- getSGVariantCounts(sgv, sample_info = si)
sgvc
```

Performing differential tests requires per-variant counts, unique identifiers for each variant, and a variable indicating how variants are grouped by events. All three can be obtained from the *SGVariantCounts* object.

```{r}
x <- counts(sgvc)
vid <- variantID(sgvc)
eid <- eventID(sgvc)
```

Treating these three objects analogously to per-exon counts, exon identifiers and gene identifiers, they can be used to construct a *DEXSeqDataSet* object for use with `r BiocStyle::Biocpkg("DEXSeq")` or as input for function *diffSplice()* in combination with *voom()* for use with `r BiocStyle::Biocpkg("limma")`.

# Advanced usage

Functions *analyzeFeatures()* and *analyzeVariants()* wrap multiple analysis steps for convenience. Alternatively, the functions performing individual steps can be called directly. For example, the analysis based on *de novo* prediction can be performed as follows.

```{r, message = FALSE}
txf <- predictTxFeatures(si, gr)
sgf <- convertToSGFeatures(txf)
sgf <- annotate(sgf, txf_ucsc)
sgfc <- getSGFeatureCounts(si, sgf)
sgv <- findSGVariants(sgf)
sgvc <- getSGVariantCounts(sgv, sgfc)
```

*predictTxFeatures()* predicts features for each sample, merges features across samples, and performs filtering and processing of predicted terminal exons. *predictTxFeatures()* and *getSGFeatureCounts()* can also be run on individual samples (e.g. for distribution across a high-performance computing cluster). When using *predictTxFeatures()* for individual samples, with predictions intended to be merged later, run *predictTxFeatures()* with argument *min_overhang = NULL* to suppress processing of terminal exons. Then predictions can subsequently be merged and processed with functions *mergeTxFeatures()* and *processTerminalExons()*, respectively.

# Multi-core use and memory requirements

*SGSeq* supports parallelization across multiple cores on the same compute node. For functions that support parallelization, the number of cores can be specified with the *cores* argument. For most genome-wide analyses, memory requirements should not exceed 16 Gb per core. Memory issues encountered during feature prediction or read counting are likely due to individual genomic regions with high read coverage. By default, regions with high split-read complexity are skipped during prediction (controlled with argument *max_complexity*). For the counting step, memory issues may be resolved by excluding problematic regions (e.g. mitochondrial genes).

# Session information

```{r}
sessionInfo()
```

# References