--- title: "Introduction" shorttitle: "Introduction" author: "Diogo Veiga" affiliation: The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA" email: Diogo.Veiga@jax.org package: maser date: "`r Sys.Date()`" output: BiocStyle::html_document: toc_float: true vignette: > %\VignetteIndexEntry{Introduction} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", dpi = 100 ) ``` # Overview of maser Alternative splicing occurs in most human genes and novel splice isoforms may be associated to disease or tissue specific functions. However, functional interpretation of splicing requires the integration of heterogeneous data sources such as transcriptomic and proteomic information. We developed **maser** (**M**apping **A**lternative **S**plicing **E**vents to p**R**oteins) package to enable functional characterization of splicing in both transcriptomic and proteomic contexts. Overall, `r Rpackage("maser")` allows a detailed analysis of splicing events identified by [rMATS](http://rnaseq-mats.sourceforge.net/) by implementing the following functionalities: * Filtering of [rMATS](http://rnaseq-mats.sourceforge.net/) splicing events based on RNA-seq coverage, p-values and differential percent spliced-in (PSI) * Analysis of global splicing effects using boxplots, principal component analysis and volcano plots. * Mapping of splicing events to [Ensembl](http://www.ensembl.org/) transcripts and [UniprotKB](http://www.uniprot.org/) proteins * Integration with [UniprotKB](http://www.uniprot.org/) for batch annotation of protein features overlapping splicing events * Visualization of transcripts and protein affected by splicing using custom `r Rpackage("Gviz")` plots. A key feature of the package is mapping of splicing events to protein features such as topological domains, motifs and mutation sites provided by the [UniprotKB](http://www.uniprot.org/) database and visualized along side the genomic location of splicing. In this manner, `r Rpackage("maser")` can quickly identify splicing affecting known protein domains, extracellular and transmembrane regions, as well as mutation sites in the protein. In this vignette, we describe a basic `r Rpackage("maser")` workflow for analyzing alternative splicing, starting with importing splicing events, filtering events based on their coverage and differential expression, and analyzing global or specific spling events with several types of graphics. The second vignette describes how to use `r Rpackage("maser")` for annotation and visualization of protein features affected by splicing. Throughout the text we use the following abbreviations to describe different types of splicing events: * SE, refers to exon skipping * RI, intron retention * MXE, mutually exclusive exons * A3SS, alternative 3' splice site * A5SS, alternative 5' splice site # Importing rMATS events We demonstrate the package with data generated to investigate alternative splicing in colorectal cancer cells undergoing hypoxia [publication](https://doi.org/10.1038/npjgenmed.2016.20). RNA-seq was collected at 0h and 24h after hypoxia in the HCT116 cell line. We applied [rMATS](http://rnaseq-mats.sourceforge.net/) to detect splicing events using the release 25 GTF (GRCh38 build) from the Gencode [website](ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_25/gencode.v25.annotation.gtf.gz). The example dataset was reduced because of size constraints. To import splicing events, we use the constructor function `maser()` indicating the path containing rMATS files, and labels for conditions. The `maser()` constructor returns an object containing all events quantified by rMATS. **Note**: The argument `ftype` indicates which rMATS output files to import. Newer rMATS versions (>4.0.1) use `JCEC` or `JC` nomenclature, while `ReadsOnTargetandJunction` or `JunctionCountOnly` are used in rMATS 3.2.5 or lower. See `?maser` for a description of parameters. ```{r, warning = FALSE, message = FALSE} library(maser) library(rtracklayer) # path to Hypoxia data path <- system.file("extdata", file.path("MATS_output"), package = "maser") hypoxia <- maser(path, c("Hypoxia 0h", "Hypoxia 24h"), ftype = "ReadsOnTargetAndJunctionCounts") ``` ```{r, warning = FALSE, message = FALSE} hypoxia ``` Splicing events have genomic locations, raw counts, PSI levels and statistics (p-values) regarding their differential splicing between conditions. Access to different data types is done using `annotation()`, `counts()`, `PSI()`, and `summary()`, which takes as argument the `maser` object and event type. ```{r, warning = FALSE, message = FALSE, eval=FALSE} head(summary(hypoxia, type = "SE")[, 1:8]) ``` ```{r, warning = FALSE, message = FALSE, echo=FALSE} knitr::kable( head(summary(hypoxia, type = "SE")[, 1:8]) ) ``` # Filtering events Low coverage splicing junctions are commonly found in RNA-seq data and lead to low confidence PSI levels. We can remove low coverage events using `filterByCoverage()`, which may signficantly reduced the number of splicing events. ```{r, warning = FALSE, message = FALSE} hypoxia_filt <- filterByCoverage(hypoxia, avg_reads = 5) ``` The function `topEvents()` allows to select statistically significant events given a FDR cutoff and minimum PSI change. Default values are `fdr = 0.05` and `deltaPSI = 0.1` (ie. 10% minimum change). ```{r, warning = FALSE, message = FALSE} hypoxia_top <- topEvents(hypoxia_filt, fdr = 0.05, deltaPSI = 0.1) hypoxia_top ``` Gene specific events can be selected using `geneEvents()`. For instance, there are 8 splicing changes affecting MIB2 as seen below. ```{r, warning = FALSE, message = FALSE} hypoxia_mib2 <- geneEvents(hypoxia_filt, geneS = "MIB2", fdr = 0.05, deltaPSI = 0.1) print(hypoxia_mib2) ``` Events in a `maser` object can be queried using an interactive data table provided by `display()`. The table allows to look up event information such as gene names, identifiers and PSI levels. ```{r, warning = FALSE, message = FALSE} maser::display(hypoxia_mib2, "SE") ``` PSI levels for gene events can be plotted using `plotGenePSI()`, indicating a valid splicing type. ```{r, warning = FALSE, message = FALSE} plotGenePSI(hypoxia_mib2, type = "SE", show_replicates = TRUE) ``` # Global splicing plots An overview of significant events can be obtained using either `dotplot()` or `volcano()` functions, specifying FDR levels, minimum change in PSI between conditions and splicing type. Significant events in each condition will be highlighted. ```{r, warning = FALSE, message = FALSE, fig.small = TRUE} volcano(hypoxia_filt, fdr = 0.05, deltaPSI = 0.1, type = "SE") ``` If only significant events should be plotted, then use `topEvents()` combined with `volcano()` or `dotplot()` for visualization. ```{r, warning = FALSE, message = FALSE, fig.small = TRUE} dotplot(hypoxia_top, type = "SE") ``` Splicing patterns of individual replicates can also be inspected using principal component analysis (PCA) and plotted using `pca()`. Also, boxplots of PSI levels distributions for all events in the `maser` object can be visualized using `boxplot_PSI_levels()`. The breakdown of splicing types can be plotted using `splicingDistribution()` and desired significance thresholds. Please refer to help pages for examples on how to use these functions. # Genomic visualization of splicing events Users can use `r Rpackage("maser")` to visualize Ensembl transcripts affected by splicing. The core function for transcripts visualization is `plotTranscripts()`. This is a wrapper function for performing both mapping of splicing events to the transcriptome, as well as visualization of transcripts that are compatible with the splice event. The function uses `r Biocpkg("Gviz")` for visualization of the genomic locus of splicing event along with involved transcripts. Internally, `plotTranscripts()` uses `mapTranscriptsToEvents()` to identify compatible transcripts by overlapping exons involved in splicing to the gene models provided in the Ensembl GTF. Each type of splice event requires a specific overlapping rule (described below) and a customized `r Biocpkg("Gviz")` plot is created for each splicing type. `plotTranscripts()` requires an Ensembl or Gencode GTF using the hg38 build of the human genome. Ensembl GTFs can be retrieved using `r Biocpkg("AnnotationHub")` or imported using `import.gff()` from the `r Biocpkg("rtracklayer")` package. Several GTF releases are available, and `r Rpackage("maser")` is compatible with any version using the hg38 build. We are going to use a reduced GTF extracted from Ensembl Release 85 for running examples. ```{r, warning = FALSE, message = FALSE} ## Ensembl GTF annotation gtf_path <- system.file("extdata", file.path("GTF","Ensembl85_examples.gtf.gz"), package = "maser") ens_gtf <- rtracklayer::import.gff(gtf_path) ``` ## Exon skipping The most common type of splicing event involves a cassette exon that is either expressed or skipped. For instance, the splicing factor SRSF6 undergoes splicing during hypoxia by expressing an alternative exon. `plotTranscripts` identify the switch from isoform *SRSF6-001* to the isoform *SRSF6-002* after 24hr hypoxia. ```{r, warning = FALSE, message = FALSE} ## Retrieve SRSF6 splicing events srsf6_events <- geneEvents(hypoxia_filt, geneS = "SRSF6", fdr = 0.05, deltaPSI = 0.1 ) ## Dislay affected transcripts and PSI levels plotTranscripts(srsf6_events, type = "SE", event_id = 33209, gtf = ens_gtf, zoom = FALSE, show_PSI = TRUE) ``` In the `r Biocpkg("Gviz")`, the *event* track depicts location of exons involved in skipping event. The *Inclusion* track shows transcripts overlapping the cassette exon as well as both flanking exons (i.e upstream and downstream exons). On the other hand, the *skipping* track displays transcripts overlapping both flanking exons but missing the cassette exon. In the exon skipping event, the *PSI* track displays the inclusion level for the cassette exon (a.k.a. alternative exon). This *PSI* track is included by default and can be turned off with `r show_PSI = FALSE`. In the example above, there is a significant increase of the cassette exon inclusion after 24h hypoxia. This allows one to predict the direction of the isoform switch, i.e. from isoform *SRSF6-001* to the isoform *SRSF6-002* in hypoxic conditions. ## Intron retention Intron retention is a type of splice event that are usually associated to decreased protein translation. In this case, the *Retention* track shows transcripts with an exact overlap of the retained intron, and the *Non-retention* tracks will display transcripts in which the intron is spliced out and overlap flanking exons. Here the PSI track refers to the inclusion level of the retained intron. The example below shows the increase in retained intron affecting STAT2 in hypoxia. ```{r, warning = FALSE, message = FALSE} stat2_events <- geneEvents(hypoxia_filt, geneS = "STAT2", fdr = 0.05, deltaPSI = 0.1 ) plotTranscripts(stat2_events, type = "RI", event_id = 3785, gtf = ens_gtf, zoom = FALSE) ``` ## Mutually exclusive exons This event refers to adjacent exons that are mutually exclusive, i.e. are not expressed together. There are 36 MXE events in the hypoxia dataset, including one affecting the *IL32* gene. Tracks will display transcripts harboring the first or second mutually exclusive exons, as well as both flanking exons. The *PSI* track in the mutually exclusive exons event wil show two sets of boxplots. The first set refers to Exon 1 PSI levels while the second set refers to Exon 2 PSI levels in the two conditions. Therefore, the example below denotes increased Exon 2 PSI in hypoxia 24h. ```{r, warning = FALSE, message = FALSE} il32_events <- geneEvents(hypoxia_filt, geneS = "IL32", fdr = 0.05, deltaPSI = 0.1 ) plotTranscripts(il32_events, type = "MXE", event_id = 1136, gtf = ens_gtf, zoom = FALSE) ``` ## Alternative 5' and 3' exons Alternative 5' splicing occur due to alternative **donor** sites, while altenative 3' is caused by change in the **acceptor** sites. Practically, these splicing events will lead to expression of longer or shorter exons. The *short exon* track shows transcripts overlapping both short and flanking exons, while the *long exon* track shows transcripts overlapping both long and flanking exons. In these type of events, the *PSI* track indicates inclusion levels for the longest exon. It might be useful for visualization to set `zoom = TRUE` when the alternative splicing generates exons of similar sizes. In the example below, BCS1L contains an exon with alternative 5' splice site, being that the longer exon has a signficantly higher PSI in normal condition (0h), while the shorter exon has higher PSI in hypoxia. ```{r, warning = FALSE, message = FALSE} #A5SS event bcs1l_gene <- geneEvents(hypoxia_filt, geneS = "BCS1L", fdr = 0.05, deltaPSI = 0.1 ) plotTranscripts(bcs1l_gene, type = "A5SS", event_id = 3988, gtf = ens_gtf, zoom = TRUE) ``` # Session info Here is the output of `sessionInfo()` on the system on which this document was compiled: ```{r, echo=FALSE} sessionInfo() ```