---
title: 'NxtIRFcore: Differential Alternative Splicing and Intron Retention analysis'
author: "Alex CH Wong"
date: "`r format(Sys.Date(), '%m/%d/%Y')`"
output:
    rmarkdown::html_document:
        highlight: pygments
        toc: true
        toc_float: true
abstract:
    Intron Retention (IR) is a form of alternative splicing whereby the intron 
    is retained (i.e. not spliced) in final messenger RNA. Although many 
    bioinformatics tools are available to quantitate other forms of alternative
    splicing, dedicated tools to quantify Intron Retention are limited. 
    Quantifying IR requires not only measurement of spliced transcripts (often 
    using mapped splice junction reads), but also measurement of the coverage of
    the putative retained intron. The latter requires adjustment for the fact 
    that many introns contain repetitive regions as well as other RNA expressing
    elements. IRFinder corrects for many of these complexities; however its 
    dependencies on Linux and STAR limits its wider usage. Also, IRFinder does
    not calculate other forms of splicing besides IR. Finally, IRFinder 
    produces text-based output, requiring an established understanding of the
    data produced in order to interpret its results.
    
    NxtIRF overcomes the above limitations. Firstly, NxtIRF incorporates the
    IRFinder C++ routines, allowing users to run the IRFinder algorithm in the
    R/Bioconductor environment on multiple platforms. NxtIRF is a full pipeline
    that quantifies IR (and other alternative splicing) events, organises the
    data and produces relevant visualisation. Additionally, NxtIRF offers an
    interactive graphical interface that allows users to explore the data.
    
    NxtIRFcore is the command-line version of NxtIRF. Version 
    `r packageVersion("NxtIRFcore")`
vignette: >
    %\VignetteIndexEntry{NxtIRFcore: Differential Alternative Splicing and Intron Retention analysis}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>"
)
```

# (1) Installation and Quick-Start

This section provides instructions for installation and a quick working example
to demonstrate the important functions of NxtIRF. NxtIRFcore is the command line
utility for NxtIRF.

For detailed explanations of each step shown here, refer to 
chapter 2: "Explaining the NxtIRF workflow" in this
vignette. For a list of ready-made "recipes" for typical-use NxtIRF in real
datasets, refer to chapter 3: "NxtIRF cookbook"

### Installation

To install NxtIRFcore, start R (version "4.1") and enter: 

```{r eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("NxtIRFcore")
```

(Optional) For **MacOS** users, make sure OpenMP libraries are installed 
correctly. We recommend users follow this 
[guide](https://mac.r-project.org/openmp/), but the
quickest way to get started is to install `libomp` via brew:

```{bash eval=FALSE}
brew install libomp
```


### Loading NxtIRF

```{r}
library(NxtIRFcore)
```

### Building the NxtIRF reference

A NxtIRF reference requires a genome FASTA file (containing genome nucleotide
sequences) and a gene annotation GTF file (preferably from Ensembl or Gencode).

NxtIRF provides an example genome and gene annotation which can be accessed
via the NxtIRFdata package installed with NxtIRF:

```{r}
# Provides the path to the example genome:
chrZ_genome()

# Provides the path to the example gene annotation:
chrZ_gtf()
```

Using these two files, we construct a NxtIRF reference as follows:

```{r, results='hide', message = FALSE, warning = FALSE}
ref_path = file.path(tempdir(), "Reference")
BuildReference(
    reference_path = ref_path,
    fasta = chrZ_genome(),
    gtf = chrZ_gtf()
)
```

### Running IRFinder

NxtIRF provides an example set of 6 BAM files to demonstrate its use via this
vignette.

Firstly, retrieve the BAM files from ExperimentHub using the NxtIRF helper
function `NxtIRF_example_bams()`. This makes a copy of the BAM files to the
temporary directory:

```{r}
bams = NxtIRF_example_bams()
bams
```

Finally, run NxtIRF/IRFinder as follows:

```{r, results='hide', message = FALSE, warning = FALSE}
irf_path = file.path(tempdir(), "IRFinder_output")
IRFinder(
    bamfiles = bams$path,
    sample_names = bams$sample,
    reference_path = ref_path,
    output_path = irf_path
)
```

### Collate individual IRFinder runs to build a NxtIRF Experiment

First, collate the IRFinder output files using the helper function
`Find_IRFinder_Output()`

```{r}
expr = Find_IRFinder_Output(irf_path)
```

This creates a 3-column data frame with sample name, IRFinder gzipped text
output, and COV files. Compile these output files into a single experiment:

```{r, results='hide', message = FALSE, warning = FALSE}
nxtirf_path = file.path(tempdir(), "NxtIRF_output")
CollateData(
    Experiment = expr,
    reference_path = ref_path,
    output_path = nxtirf_path
)
```

### Importing the collated data as a NxtSE object:

The `NxtSE` is a data structure that inherits `SummarizedExperiment`

```{r, results='hide', message = FALSE, warning = FALSE}
se = MakeSE(nxtirf_path)
```

### Set some experimental conditions:

```{r}
colData(se)$condition = rep(c("A", "B"), each = 3)
colData(se)$batch = rep(c("K", "L", "M"), 2)
```

### Perform differential alternative splicing

The code below will contrast condition:B in respect to condition:A

```{r, results='hide', message = FALSE, warning = FALSE}
# Requires limma to be installed:
require("limma")
res_limma = limma_ASE(
    se = se,
    test_factor = "condition",
    test_nom = "B",
    test_denom = "A",
)

# Requires DESeq2 to be installed:
require("DESeq2")
res_deseq = DESeq_ASE(
    se = se,
    test_factor = "condition",
    test_nom = "B",
    test_denom = "A",
)

# Requires DoubleExpSeq to be installed:
require("DoubleExpSeq")
res_DES = DoubleExpSeq_ASE(
    se = se,
    test_factor = "condition",
    test_nom = "B",
    test_denom = "A",
)
```

### Visualise a Coverage plot of a differential IR event:

Filter by visibly-different events:

```{r}
res_limma.filtered = subset(res_limma, abs(AvgPSI_A - AvgPSI_B) > 0.05)
```

Plot individual samples:

```{r, fig.width = 8, fig.height = 6}
p = Plot_Coverage(
    se = se,
    Event = res_limma.filtered$EventName[1],
    tracks = colnames(se)[c(1,2,4,5)],
)
as_egg_ggplot(p)
```

Display the plotly interactive version of the coverage plot (not shown here)

```{r eval = FALSE}
# Running this will display an interactive plot
p$final_plot
```

Plot by condition:

```{r, fig.width = 8, fig.height = 6}
p = Plot_Coverage(
    se = se,
    Event = res_limma.filtered$EventName[1],
    tracks = c("A", "B"),
    condition = "condition",
    stack_tracks = TRUE,
    t_test = TRUE,
)
as_egg_ggplot(p)
```

```{r include = FALSE}
# Reset to prevent interfering with next section
colData(se)$condition = NULL
colData(se)$batch = NULL
```

# (2) Explaining the NxtIRF Workflow

This section explains the working example provided in the prior "Quick-Start" 
section, to demonstrate how to use NxtIRF.

## Generating the NxtIRF reference

NxtIRF first needs to generate a set of reference files. The NxtIRF reference
is used to quantitate IR and alternative splicing, as well as in downstream
visualisation tools.

### Building an example NxtIRF reference using BuildReference()

First, load the NxtIRF package:

```{r, eval = FALSE}
library(NxtIRFcore)
```

NxtIRF generates a reference from a user-provided genome FASTA and 
genome annotation GTF file, and is optimised for Ensembl references but can
accept other reference GTF files. Alternatively, NxtIRF accepts AnnotationHub
resources, using the record names of AnnotationHub records as input.

We will first demonstrate a runnable example using the included example 
NxtIRF genome. This was created using 7 example genes (SRSF1, SRSF2, SRSF3, 
TRA2A, TRA2B, TP53 and NSUN5). The SRSF and TRA family of genes all contain
poison exons flanked by retained introns. Additionally, NSUN5 contains an
annotated IR event in its terminal intron. Sequences from these 7 genes were
aligned into one sequence to create an artificial chromosome Z (chrZ). The gene
annotations were modified to only contain the 7 genes with the modified genomic
coordinates.

```{r, eval = FALSE}
ref_path = file.path(tempdir(), "Reference")

GetReferenceResource(
    reference_path = ref_path,
    fasta = chrZ_genome(), 
    gtf = chrZ_gtf()
)

BuildReference(reference_path = ref_path)

# or equivalently, in a one-step process:

BuildReference(
    reference_path = ref_path,
    fasta = chrZ_genome(), 
    gtf = chrZ_gtf()
)

```

The first function, GetReferenceResource(), requires 3 arguments:
(1) `fasta` The file (or url) of the genome FASTA file,
(2) `gtf` The file (or url) of the gene annotation GTF file, and
(3) `reference_path` The directory (ideally an empty directory) to create
the NxtIRF reference

`GetReferenceResource()` processes the given source genome / annotation files,
downloading them where necesssary, and saves a local copy of the GTF file as
well as a compressed version of the FASTA sequence file (as a TwoBitFile which
is a binary compressed version of the genome sequence).

The next function, `BuildReference()`, uses the resources to build the NxtIRF
reference. NxtIRF builds its own version of alternative splicing annotations
by comparing differential combinations of splice junctions between the
transcripts of each gene. Events annotated include Skipped Exons (SE),
Mutually Exclusive Exons (MXE), Alternative First / Last Exons (AFE / ALE),
Alternative 5' or 3' Splice Sites (A5SS / A3SS), and Intron Retention (IR).
For IR events, every intron is considered as a potential Intron Retention event,
assuming most IR events are not annotated.

Additionally, BuildReference assesses the coding sequence changes that arise if
each individual IR event occurs, annotating the IR events as NMD-inducing if the
inclusion of the intron results in premature termination codons (PTCs, as 
defined by a stop codon located 55 bases upstream of the last exon junction).

## Quantitate IR and Alternative Splicing in Aligned BAM files

NxtIRF adopts the IRFinder algorithm to measure IR in aligned BAM files. The
IRFinder algorithm also provides spliced junction counts that is used by
NxtIRF to quantitate alternate splicing events.

### Running IRFinder() on example BAM files

In this vignette, we provide 6 example BAM files. These were generated based on
aligned RNA-seq BAMs of 6 samples from the Leucegene AML dataset (GSE67039).
Sequences aligned to hg38 were filtered to only include genes aligned to that
used to create the chrZ chromosome. These sequences were then re-aligned to
the chrZ reference using STAR.

First, we download the example bam files using the following convenience 
function:

```{r, eval = FALSE}
bam_path = file.path(tempdir(), "bams")

# Copies NxtIRF example BAM files to `bam_path` subdirectory; returns BAM paths
example_bams(path = bam_path)
```

Often, alignment pipelines process multiple samples. NxtIRF provides convenience
functions to recursively locate all the BAM files in a given folder, and tries
to ascertain sample names. Often sample names can be gleaned when:
* The BAM files are named by their sample names, e.g. "sample1.bam", 
"sample2.bam". In this case, `level = 0`
* The BAM files have generic names but are contained inside parent directories
labeled by their sample names, e.g. "sample1/Unsorted.bam", 
"sample2/Unsorted.bam". In this case, `level = 1`

```{r,  eval = FALSE}
# as BAM file names denote their sample names
bams = Find_Bams(bam_path, level = 0) 

# In the case where BAM files are labelled using sample names as parent 
# directory names (which oftens happens with the STAR aligner), use level = 1
```

This convenience function retrieves a data frame with the first and second
columns as the sample names and paths of all the BAM files found. We use this to
run IRFinder on all the BAM files using the following:

```{r, eval = FALSE}
irf_path = file.path(tempdir(), "IRFinder_output")
IRFinder(
    bamfiles = bams$path,
    sample_names = bams$sample,
    reference_path = ref_path,
    output_path = irf_path,
    n_threads = 2,
    overwrite = FALSE,
    run_featureCounts = FALSE
)
```

This runs IRFinder using 2 threads. Multi-threading is performed using OpenMP
(where available). If OpenMP is not installed when NxtIRFcore was compiled, 
BiocParallel will be used instead. Setting `overwrite = FALSE` ensures
if IRFinder files were generated by a previous run, these would not be 
overwritten. 

Also (optionally), users can generate gene counts from the same
reference using `run_featureCounts = TRUE` (requires the Rsubread package).
The featureCounts output is stored in the "main.FC.Rds" file which can be
retrieved using below:

```{r, eval = FALSE}
# Re-run IRFinder without overwrite, and run featureCounts
require(Rsubread)

IRFinder(
    bamfiles = bams$path,
    sample_names = bams$sample,
    reference_path = ref_path,
    output_path = irf_path,
    n_threads = 2,
    overwrite = FALSE,
    run_featureCounts = TRUE
)

# Load gene counts
gene_counts <- readRDS(file.path(irf_path, "main.FC.Rds"))

# Access gene counts:
gene_counts$counts
```

## Collating IRFinder output into a single dataset

IRFinder produces output on individual samples. NxtIRF makes it easy to collate
the output from multiple samples. As some splicing events may
occur in some samples but not others, it is important to unify the data and
organise it into a single data structure. NxtIRF uses a specialized class 
(called the NxtSE object) to contain multiple arrays of data. This is used to 
contain Included and Excluded event counts as well as associated quality control
data (including intronic "coverage" which is the fraction of the intron covered
by RNA-seq reads in IR events).

### Reviewing the IRFinder output:

Having run IRFinder on all individual BAM files of an experiment, we can collate
this data using the CollateData() function. Again, we use a convenience function
to generate a list of IRFinder output files:

```{r, eval = FALSE}
expr <- Find_IRFinder_Output(irf_path)
```

`expr` is a 3-column data frame. The first two columns contain the sample 
names and IRFinder output file path (as a ".txt.gz" - gzipped text file). 
This contains all of the original output from vanilla IRFinder, 
as well as sample QC readouts). Feel free to unzip a file to see what output
is generated.

The third column contains "COV" file paths. COV files are generated by NxtIRF
and contains RNA-seq read coverage data in a novel compressed format. COV files
compress data much better than BigWig files, and also contains RNA-seq coverage
data from individual strands. Currently, only NxtIRF is able to read COV files
but we anticipate other packages can use COV files via NxtIRF (in future
NxtIRF releases)! We will demonstrate how to use COV files in a later section.

## Using CollateData to parse IRFinder output from multiple files.

Now that we have an organised list of IRFinder output files, we parse this into
`CollateData()`

```{r, eval = FALSE}
nxtirf_path = file.path(tempdir(), "NxtIRF_output")

CollateData(
    Experiment = expr,
    reference_path = ref_path,
    output_path = nxtirf_path,
    IRMode = "SpliceOverMax",
    n_threads = 1
)
```

CollateData extracts IR quantitation and junction counts and organises these
counts into unified arrays of data. CollateData also compiles QC parameters
for all samples, including read depth and strandedness (directionality).

`IRMode` is a parameter that specifies how IRFinder calculates percentage
intron retention (PIR). Previously, IRFinder estimates spliced (or exonic)
abundance by including junction reads that involve either flanking exon
`SpliceMax = max(SpliceLeft, SpliceRight)`. 
This is done to correct for the possibility
of alternate exons flanking the intron being assessed. NxtIRF extends this 
estimate by accounting for the possibility that BOTH flanking exons may be
alternate exons, thereby accounting for splice events that overlap the intron
but does not involve the junctions of either flanking exons. We call this
parameter `SpliceOverMax`, which includes `SpliceMax` with the
addition of distant splice events.

## Building a NxtSE object from collated data

Now that `CollateData()` has created the unified data in the \code{"NxtSE"}
directory, we can retrieve the data as a NxtSE object:

```{r, eval = FALSE}
se = MakeSE(nxtirf_path, RemoveOverlapping = TRUE)
```

Two things of note:

(1) By default, MakeSE constructs the NxtSE object using all the samples in
the collated data. It is possible (and particularly useful in large data sets)
to read only a subset of samples. In this case, construct a data.frame object
with the first column containing the desired sample names and parse this into
the `colData` parameter as shown:

```{r, eval = FALSE}
subset_samples = colnames(se)[1:4]
df = data.frame(sample = subset_samples)
se_small = MakeSE(nxtirf_path, colData = df, RemoveOverlapping = TRUE)
```

(2) In complex transcriptomes including those of human and mouse, alternative
splicing implies that introns are often overlapping. Thus, algorithms run the
risk of over-calling intron retention where overlapping introns are assessed.
NxtIRF removes overlapping introns by considering only introns belonging to
the major splice isoforms. It estimates a list of introns of major isoforms
by assessing the compatible splice junctions of each isoform, and removes
overlapping introns belonging to minor isoforms. To disable this functionality,
set `RemoveOverlapping = FALSE`.

## Filtering for Expressed Splicing Events 

Often, the gene annotations contain isoforms for all discovered splicing events.
Most annotated transcripts are not expressed, and their inclusion in 
differential analysis complicates results including adjusting for multiple 
testing. It is prudent to filter these out using various approaches, akin to 
removing genes with low gene counts in differential gene analysis. We suggest
using the default filters which generally excludes splicing events whereby the
total included / excluded event counts less than 20 RNA-seq reads. There are
other quality-control filters included but these are out of the scope of this
vignette (please see the documentation for details).

To filter the NxtSE object using default filters:

```{r, results = FALSE, message = FALSE, warning = FALSE}
expressed_events <- apply_filters(se, get_default_filters())

se.filtered = se[expressed_events,]
```

In the above code, `get_default_filters()` retrieves the default filters
suggested by NxtIRF. `apply_filters` returns a vector containing which
events are included if the filters are applied to the NxtSE object. Finally,
we obtain a NxtSE object by subsetting the original NxtSE using the vector
of `expressed_events`.

To view a description of what these filters actually do, simply display them:

```{r}
get_default_filters()
```

To construct custom filters, create a NxtFilter object as shown:

```{r}
f1 = NxtFilter(filterClass = "Data", filterType = "Depth", minimum = 20)
f1
```

For more information about the filters available, refer to the manual for
details:

```{r eval=FALSE}
?NxtFilter
```

## Performing analysis of Differential Alternative Splicing Events

Before performing differential analysis, we must annotate our samples. Currently
there are no annotations:

```{r }
colData(se.filtered)
```

To populate the NxtSE object with annotations, we can modify colData just as we
would for a SummarizedExperiment object.

```{r }
colData(se.filtered)$condition = rep(c("A", "B"), each = 3)
colData(se.filtered)$batch = rep(c("K", "L", "M"), 2)

# To display what colData is:
as.data.frame(colData(se.filtered))
```

We can use either limma to perform differential alternative splicing analysis

```{r }
require("limma")

# Compare by condition-B vs condition-A

res_limma_condition <- limma_ASE(
    se = se.filtered,
    test_factor = "condition",
    test_nom = "B",
    test_denom = "A",
)

# Compare by condition-B vs condition-A, batch-corrected by "batch"

res_limma_treatment_batchcorrected <- limma_ASE(
    se = se.filtered,
    test_factor = "condition",
    test_nom = "B",
    test_denom = "A",
    batch1 = "batch"
)
```

NB: we can also use DESeq2 or DoubleExpSeq to perform the same analysis using
`DESeq_ASE()` or `DoubleExpSeq_ASE()`. The differences are as follows:

* Limma: models counts as log-normal distributed
* DESeq2: models counts as negative binomial distributed
* DoubleExpSeq: models alternative splicing ratios as beta-binomial distributed

## Visualisation of Differential ASE Analysis

The following guide demonstrates how to plot standard figures including scatter
plots of average percent-spliced-in (PSI) values, volcano plots of 
differentially expressed IR / AS events, and heatmaps of PSI values.

### Scatter plot of differential Alternate Splicing Events

These functions return a data frame with the differential analysis results.
These include average percent-spliced-in values for both conditions, which can
be used directly to produce a scatter plot comparing the two conditions:

```{r, fig.width = 7, fig.height = 5}
library(ggplot2)

ggplot(res_limma_condition, aes(x = 100 * AvgPSI_B, y = 100 * AvgPSI_A)) + 
    geom_point() + xlim(0, 100) + ylim(0, 100) +
    labs(title = "PSI values across conditions",
         x = "PSI of condition B", y = "PSI of condition A")
```

Note the columns in the results are defined by the names of the conditions
"A" and "B".

To filter for specific splicing events (e.g. IR only), simply filter the
results data.frame by EventType:

```{r, fig.width = 7, fig.height = 5}
ggplot(subset(res_limma_condition, EventType == "IR"), 
        aes(x = 100 * AvgPSI_B, y = 100 * AvgPSI_A)) + 
    geom_point() + xlim(0, 100) + ylim(0, 100) +
    labs(title = "PIR values across conditions (IR Only)",
         x = "PIR of condition B", y = "PIR of condition A")
```

### Volcano plot of differential Alternate Splicing Events

`limma_ASE` and `DESeq_ASE` contain direct output from limma or DESeq2.
The P.Value and pvalue of limma and DESeq2 output denote nominal P values, 
whereas multiple-correction is performed via the adj.P.Val and padj columns of 
limma and DESeq2, respectively. To plot these:

```{r, fig.width = 7, fig.height = 5}
ggplot(res_limma_condition,
        aes(x = logFC, y = -log10(adj.P.Val))) + 
    geom_point() +
    labs(title = "Differential analysis - B vs A",
         x = "Log2-fold change", y = "BH-adjusted P values (-log10)")
```

### Heatmap of differential events

NxtIRF provides convenience functions to retrieve the splicing ratios via the
NxtSE object. To produce a heatmap, first we create a matrix of splicing ratios
of the top 10 differential events:

```{r }
mat = make_matrix(
    se.filtered,
    event_list = res_limma_condition$EventName[1:10],
    method = "PSI"
)
```

`make_matrix()` has the option of supplying values as logit-transformed. This
is useful to contrast where the majority of values are near the 0- or 1- 
boundary.

Also, when the dataset contains low-abundance splicing depths, samples with 
splicing depths below a certain threshold can be excluded (replaced with NA).
Simply set `depth_threshold` to a level below which sample-events will be
converted to NA. Also, events can be excluded from display if a certain fraction
of samples have low coverage (i.e. return NA values). This filter can be set
using `na.percent.max`. This is useful as events with high number of NA
values can return errors in heatmap functions that also perform clustering

With the matrix of values, one can produce a heatmap as shown:

```{r, fig.width = 8, fig.height = 6}
library(pheatmap)

pheatmap(mat, annotation_col = as.data.frame(colData(se.filtered)))
```

## NxtIRF Coverage Plots

NxtIRF is able to produce RNA-seq coverage plots of analysed samples. Coverage
data is compiled simultaneous to the IR and junction quantitation performed
by the IRFinder C++ routine. This data is saved in "COV" files, which is a BGZF
compressed and indexed file. COV files show compression and performance gains
over BigWig files.

Additionally, NxtIRF performs coverage plots of multiple samples combined based
on user-defined experimental conditions. This is a powerful tool to
illustrate group-specific differential splicing or IR. NxtIRF does this by
normalising the coverage depths of each sample based on transcript depth at
the splice junction / intron of interest. By doing so, the coverage depths of
constitutively expressed flanking exons are normalised to unity. As a result,
the intron depths reflect the fraction of transcripts with retained introns and
can be compared across samples.

We will first demonstrate by plotting the RNA-seq coverage of a single gene
by a single sample. `Plot_Coverage()` performs the calculations and generates
a compound object containing both static and interactive plots. We can coerce
this to a static plot using `as_egg_ggplot()`

```{r, fig.width = 8, fig.height = 6}
res = Plot_Coverage(
    se = se.filtered, 
    Gene = "TP53", 
    tracks = colnames(se.filtered)[1]
)

as_egg_ggplot(res)
```

The interactive plot (not run here) can be displayed by directly calling
the `final_plot` element of the object returned by `Plot_Coverage`

```{r eval = FALSE}
# Running this will display an interactive plot
res$final_plot
```

There are many transcripts in TP53! This is because by default, NxtIRF displays
all annotated transcripts. For clarity, one can either collapse the transcripts
at a per-gene level, by setting condense_tracks = TRUE:
 
```{r, fig.width = 8, fig.height = 6}
res = Plot_Coverage(
    se = se.filtered, 
    Gene = "TP53",
    tracks = colnames(se.filtered)[1],
    condense_tracks = TRUE
)
as_egg_ggplot(res)
```

Alternatively, for fine control, one can supply a vector containing the
transcript names to be displayed:

```{r, fig.width = 8, fig.height = 6}
res = Plot_Coverage(
    se = se.filtered, 
    Gene = "TP53",
    tracks = colnames(se.filtered)[1],
    selected_transcripts = c("TP53-201", "TP53-204")
)
as_egg_ggplot(res)
```

In the heatmap in the previous section, we can see some retained introns in 
NSUN5 that are more highly expressed in "02H003" and "02H025". To demonstrate 
this, we first introduce a new condition that groups these two samples:

```{r }
colData(se.filtered)$NSUN5_IR = c(rep("High", 2), rep("Low", 4))
```

Performing differential analysis will confirm this to be the case:

```{r }
require("limma")

res_limma_NSUN5 <- limma_ASE(
    se = se.filtered,
    test_factor = "NSUN5_IR",
    test_nom = "High",
    test_denom = "Low",
)

head(res_limma_NSUN5$EventName)
```

We can now visualise the top hit, NSUN5 intron 8. To visualise the IR event
across conditions, we specify the type of condition to contrast. 
The names of the tracks will be the names of the nominator (test) and 
denominator (control) conditions

```{r, fig.width = 8, fig.height = 6}
res = Plot_Coverage(
    se = se.filtered, 
    Event = res_limma_NSUN5$EventName[1],
    condition = "NSUN5_IR",
    tracks = c("High", "Low"),
)
as_egg_ggplot(res)
```

Note the lines represent mean coverages of all samples in the labelled
experimental conditions. These are achieved by normalising their individual
coverages by their transcript abundance, which is calculated by summing the
abundances of spliced and intron-retaining transcripts. Thus, the normalised
coverage at the specified splice junction is approximately 1.0.

The grey shading represent 95% confidence intervals. It is possible that the
shaded areas are larger as we traverse away from the splice junction. This may
be because of differences in 3'-sequencing bias between different samples of the
same condition.

Although NSUN5 intron 8 is the most differentially retained in the analysis, the
difference isn't very apparent. This is because the difference between the
absolute values is very small. In contrast, intron 2 looks more prominently
different in the heatmap, so we can explore this as well.

We can further compare the coverage by plotting the two conditions on the same
track. This done by setting `stack_tracks = TRUE`. Further, we can add a track
that tests the per-nucleotide statistical difference between the normalised
coverage between the two conditions. To do this, set `t_test = TRUE`:

```{r, fig.width = 8, fig.height = 6}
res = Plot_Coverage(
    se = se.filtered, 
    Event = "NSUN5/ENST00000252594_Intron2/clean",
    condition = "NSUN5_IR",
    tracks = c("High", "Low"),
    stack_tracks = TRUE,
    t_test = TRUE
)
as_egg_ggplot(res)
```

On the first track, the different coloured traces represent the two conditions.
The second track plots the -log10 transformed p values, using Students T-test to
test the difference between the coverages between the conditions, at 
per-nucleotide resolution.

Note that the t-test track does not replace formal statistical analysis using
the `limma_ASE` or `DESeq_ASE` functions. Instead, they provide a useful
adjunct to assess the adequacy of the normalisation of the group coverages.

There are more complex customisation options with NxtIRF's Plot_Coverage tool
which will be explored in subsequent tutorials (TODO).

# (3) NxtIRF cookbook

```{r eval = FALSE}
library(NxtIRFcore)
```

## Reference Generation

First, define the path to the directory in which the reference should be stored.
This directory will be made by NxtIRF, but its parent directory must exist,
otherwise an error will be returned.

```{r eval = FALSE}
ref_path = "./Reference"
```

### Create a NxtIRF reference from user-defined FASTA and GTF files locally:

Note that setting `genome_path = "hg38"` will prompt NxtIRF to use the default
files for nonPolyA and Mappability exclusion references in the generation of its
reference. Valid options for `genome_path` are "hg38", "hg19", "mm10" and "mm9".

```{r eval=FALSE}
BuildReference(
    reference_path = ref_path,
    fasta = "genome.fa", gtf = "transcripts.gtf",
    genome_type = "hg38"
)
```

### Create a NxtIRF reference using web resources from Ensembl's FTP:

The following will first download the genome and gene annotation files from the
online resource and store a local copy of it in a file cache, facilitated by
BiocFileCache. Then, it uses the downloaded resource to create the NxtIRF
reference.

```{r eval=FALSE}
FTP = "ftp://ftp.ensembl.org/pub/release-94/"

BuildReference(
    reference_path = ref_path,
    fasta = paste0(FTP, "fasta/homo_sapiens/dna/",
        "Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz"), 
    gtf = paste0(FTP, "gtf/homo_sapiens/",
        "Homo_sapiens.GRCh38.94.chr.gtf.gz"),
    genome_type = "hg38"
)
```

### Create a NxtIRF reference using AnnotationHub resources:

AnnotationHub contains Ensembl references for many genomes. To browse what is
available:

```{r}
require(AnnotationHub)

ah = AnnotationHub()
query(ah, "Ensembl")
```

For a more specific query:

```{r}
query(ah, c("Homo Sapiens", "release-94"))
```

We wish to fetch "AH65745" and "AH64631" which contains the desired FASTA
and GTF files, respectively. To build a reference using these resources:

```{r eval=FALSE}
BuildReference(
    reference_path = ref_path,
    fasta = "AH65745",
    gtf = "AH64631",
    genome_type = "hg38"
)
```

`BuildReference` will recognise the inputs of `fasta` and `gtf` as AnnotationHub
resources as they begin with "AH".

### Create a NxtIRF reference from species other than human or mouse:

For human and mouse genomes, we highly recommend specifying `genome_type` as
the default mappability file is used to exclude intronic regions with repeat
sequences from intron retention analysis. For other species, one could
generate a NxtIRF reference without this reference:

```{r eval=FALSE}
BuildReference(
    reference_path = ref_path,
    fasta = "genome.fa", gtf = "transcripts.gtf",
    genome_type = ""
)
```

Alternatively, if `STAR` is available on the computer or server where R/RStudio
is being run, we can use the one-line function `BuildReference_Full`. This
function will:
* Prepare the resources from the given FASTA and GTF files
* Generate a STAR genome
* Use the STAR genome and the FASTA file to *de-novo* calculate and define low
  mappability regions
* Build the NxtIRF reference using the genome resources and mappability file

```{r eval=FALSE}
BuildReference_Full(
    reference_path = ref_path,
    fasta = "genome.fa", gtf = "transcripts.gtf",
    genome_type = "",
    use_STAR_mappability = TRUE,
    n_threads = 4
)
```

`n_threads` specify how many threads should be used to build the STAR reference
and to calculate the low mappability regions

Finally, if `STAR` is not available, `Rsubread` is available on Bioconductor to
perform mappability calculations. The example code in the manual is displayed
here for convenience, to demonstrate how this would be done:

```{r eval = FALSE}
# (1a) Creates genome resource files 

ref_path = file.path(tempdir(), "Reference")

GetReferenceResource(
    reference_path = ref_path,
    fasta = chrZ_genome(),
    gtf = chrZ_gtf()
)

# (1b) Systematically generate reads based on the NxtIRF example genome:

Mappability_GenReads(
    reference_path = ref_path
)

# (2) Align the generated reads using Rsubread:

# (2a) Build the Rsubread genome index:

setwd(ref_path)
Rsubread::buildindex(basename = "./reference_index", 
    reference = chrZ_genome())

# (2b) Align the synthetic reads using Rsubread::subjunc()

Rsubread::subjunc(
    index = "./reference_index", 
    readfile1 = file.path(ref_path, "Mappability", "Reads.fa"), 
    output_file = file.path(ref_path, "Mappability", "AlignedReads.bam"), 
    useAnnotation = TRUE, 
    annot.ext = chrZ_gtf(), 
    isGTF = TRUE
)

# (3) Analyse the aligned reads in the BAM file for low-mappability regions:

Mappability_CalculateExclusions(
    reference_path = ref_path,
    aligned_bam = file.path(ref_path, "Mappability", "AlignedReads.bam")
)

# (4) Build the NxtIRF reference using the calculated Mappability Exclusions

BuildReference(ref_path)      

```

## Align Raw FASTQ files

### Checking if STAR is installed

To use `STAR` to align FASTQ files, one must be using a system with `STAR` 
installed.
This software is not available in Windows. To check if `STAR` is available:

```{r}
STAR_version()
```

### Building a STAR reference

```{r eval = FALSE}

ref_path = "./Reference"

# Ensure genome resources are prepared from genome FASTA and GTF file:

if(!file.exists(file.path(ref_path, "resources"))) {
    GetReferenceResource(
        reference_path = ref_path,
        fasta = "genome.fa",
        gtf = "transcripts.gtf"
    )
}

# Generate a STAR genome reference:
STAR_BuildRef(
    reference_path = ref_path,
    n_threads = 8
)

```

### Aligning a single sample using STAR

```{r eval = FALSE}
STAR_align_fastq(
    STAR_ref_path = file.path(ref_path, "STAR"),
    BAM_output_path = "./bams/sample1",
    fastq_1 = "sample1_1.fastq", fastq_2 = "sample1_2.fastq",
    n_threads = 8
)
```

### Aligning multiple samples using STAR

```{r eval = FALSE}
Experiment = data.frame(
    sample = c("sample_A", "sample_B"),
    forward = file.path("raw_data", c("sample_A", "sample_B"),
        c("sample_A_1.fastq", "sample_B_1.fastq")),
    reverse = file.path("raw_data", c("sample_A", "sample_B"),
        c("sample_A_2.fastq", "sample_B_2.fastq"))
)

STAR_align_experiment(
    Experiment = Experiment,
    STAR_ref_path = file.path("Reference_FTP", "STAR"),
    BAM_output_path = "./bams",
    n_threads = 8,
    two_pass = FALSE
)
```

To use two-pass mapping, set `two_pass = TRUE`. We recommend disabling this
feature, as one-pass mapping is adequate in typical-use cases.

## Running IRFinder on BAM files

To conveniently find all BAM files recursively in a given path:

```{r eval=FALSE}
bams = Find_Bams("./bams", level = 1)
```

This convenience function returns the putative sample names, either from BAM
file names themselves (`level = 0`), or from the names of their parent 
directories (`level = 1`).

To use IRFinder using 4 OpenMP threads:

```{r eval=FALSE}
# assume NxtIRF reference has been generated in `ref_path`

IRFinder(
    bamfiles = bams$path,
    sample_names = bams$sample,
    reference_path = ref_path,
    output_path = "./IRFinder_output",
    n_threads = 4,
    Use_OpenMP = TRUE
)
```

## Creating COV files from BAM files without running IRFinder

Sometimes one may wish to create a COV file from a BAM file without running
the IRFinder algorithm. One reason might be because a NxtIRF/IRFinder reference
is not available.

To convert a list of BAM files, run `BAM2COV()`. This is a function structurally
similar to `IRFinder()` but without the need to give the path to the NxtIRF
reference:

```{r eval=FALSE}

BAM2COV(
    bamfiles = bams$path,
    sample_names = bams$sample,
    output_path = "./IRFinder_output",
    n_threads = 4,
    Use_OpenMP = TRUE
)
```

## Collating IRFinder data into a single experiment:

Assuming the NxtIRF reference is in `ref_path`, after running IRFinder as shown
in the previous section, use the convenience function `Find_IRFinder_Output()`
to tabulate a list of samples and their corresponding IRFinder outputs:

```{r eval=FALSE}
expr = Find_IRFinder_Output("./IRFinder_output")
```

This data.frame can be directly used to run `CollateData`:

```{r eval = FALSE}
CollateData(
    Experiment = expr,
    reference_path = ref_path,
    output_path = "./NxtIRF_output"
)
```

Then, the collated data can be imported as a `NxtSE` object, which is an object
that inherits `SummarizedExperiment` and has specialized containers to hold
additional data required by NxtIRF.

```{r eval = FALSE}
se = MakeSE("./NxtIRF_output")
```

## Downstream analysis using NxtIRFcore

Please refer to chapters 1 and 2 for worked examples using the NxtIRF example
dataset.

# SessionInfo

```{r}
sessionInfo()
```