%\VignetteEngine{knitr::knitr}
%\VignetteIndexEntry{Using Branchpointer for annotation of intronic human splicing branchpoints}
%\VignetteDepends{branchpointer}
%\VignetteKeywords{branchpointer}
%\VignettePackage{branchpointer}

\documentclass{article}
\usepackage[utf8]{inputenc}

<<style-knitr, eval=TRUE, echo=FALSE, results="asis">>=
BiocStyle::latex()
@
\begin{document}

<<setup, include=FALSE, cache=FALSE>>=
library(knitr)
opts_chunk$set(out.width="0.7\\maxwidth",fig.align="center")
@

\title{Using branchpointer for annotation of intronic human splicing branchpoints}
\author{Beth Signal}

\maketitle

\tableofcontents

\section{Introduction}

The spliceosome mediates the formation of an intron lariat though inter-action between the 5’ splice site and branchpoint (Will and Luhrmann, 2011). A subsequent reaction at the 3’ SS then removes the intron lariat, producing a spliced RNA product. Mapping of branchpoints generally requires sequencing of the intron lariat following cDNA synthesis (Gao et al., 2008; Taggart et al., 2012). However, intron lariats are rapidly de-branched and degraded, and much less abundant than the spliced RNA, resulting in the poor recovery of elements using sequencing. 
Most recently, Mercer et al. (2015) employed a targeted sequencing approach to identify 59,359 branchpoints in 17.4\% of annotated human gene introns. Whilst this constituted the largest annotation to date, the identification of branchpoints was restricted to highly-expressed genes with sufficient sequence coverage. 

To address this limitation, and expand branchpoint annotations across the human genome, we have developed a machine-learning based model of branchpoints trained with this empirical annotation (Signal et al., 2016). This model requires only genomic sequence and exon annotations, and exhibits no discernible bias to gene type or expression, and can be applied using the R package, branchpointer. Aberrant splicing is known to lead to many human diseases (Singh and Cooper, 2012), however prediction of intronic variant effects have been typically limited to splice site alterations (McLaren et al., 2016; Wang et al., 2010). Therefore, in addition to annotation of branchpoints, branchpointer allows users to assess the effects of intronic mutations on branchpoint architecture. 

Gao,K. et al. (2008) Human branch point consensus sequence is yUnAy. Nucleic Acids Res., 36, 2257–67.

McLaren,W. et al. (2016) The Ensembl Variant Effect Predictor. Genome Biol., 17, 122.

Mercer,T.R. et al. (2015) Genome-wide discovery of human splicing branchpoints. Genome Res., 25, 290–303.

Signal,B. et al. (2016) Machine-learning annotation of human splicing branchpoints. BioRxiv. doi: 10.1101/094003.

Singh,R.K. and Cooper,T.A. (2012) Pre-mRNA splicing in disease and therapeutics. Trends Mol. Med., 18, 472–482.

Taggart,A.J. et al. (2012) Large-scale mapping of branchpoints in human pre-mRNA transcripts in vivo. Nat. Struct. Mol. Biol., 19, 719–21.

Wang,K. et al. (2010) ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res., 38, e164.

Will,C.L. and Luhrmann,R. (2011) Spliceosome structure and function. Cold Spring Harb. Perspect. Biol., 3, a003707.


\section{Preparation}

\subsection{Download genome annotations}
Branchpointer requires a genome annotation GTF file for branchpoint annotation.
We will be using the GENCODE annotation (http://www.gencodegenes.org/releases/current.html) as an example, although others and custom annotations can be used.

Create or move to a working directory where these files can be stored.
Note that GTFs can be large files (over 1GB) when uncompressed.

\begin{verbatim}
wget ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_26/gencode.v26.annotation.gtf.gz

gunzip gencode.v26.annotation.gtf.gz
\end{verbatim}

branchpointer requires either a BSGenome object, or a genome .fa file for sequence retrieval.
The genome must correspond to the gtf used -- i.e. gencodev26 uses GRCh38.

load a \Biocpkg{BSGenome}:
<<load_bsgenome, message=FALSE>>=
library(BSgenome.Hsapiens.UCSC.hg38)
g <- BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38
@

or download .fa:
\begin{verbatim}
wget ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_26/GRCh38.primary_assembly.genome.fa.gz
GRCh38.primary_assembly.genome.fa.gz
\end{verbatim}

\subsection{Read in exon annotations}

Start by loading branchpointer.
<<load_branchpointer, message=FALSE>>=
library(branchpointer)
@

gtfToExons will generate an exon annotation GRanges object from a gtf.
To load in the gtf downloaded from the preparation section: 
<<read_exon_gtf, message=FALSE, eval=FALSE>>=
exons <- gtfToExons("gencode.v26.annotation.gtf")
@

We will load in a small gtf from the package data for the following examples.
<<read_exon_small, message=FALSE>>=
smallExons <- system.file("extdata","gencode.v26.annotation.small.gtf", 
                          package = "branchpointer")
exons <- gtfToExons(smallExons)
@

\section{Branchpoint annotations in intronic regions}

\subsection{Read query and calculate location attributes}

Query regions must contain a branchpoint window - that is the region located at -18 to -44 from the 3' splice site. 
Each region given will be treated as only one query, and associated with the closest 3' exon. 
To cover multiple 3'exons, please provide branchpointer with separate region queries.
For known regions, queries can be supplied as a table with the following formatting:

<<show_introns, message=FALSE>>=
queryIntronFile <- system.file("extdata","intron_example.txt", 
                               package = "branchpointer")
queryIntronTable <- read.delim(queryIntronFile)
head(queryIntronTable)
@

Query files can be read into branchpointer using readQueryFile(), and location information retrieved for these sequences:

<<read_introns, message=FALSE>>=
queryIntron <- readQueryFile(queryIntronFile, 
                             queryType = "region", 
                             exons = exons)
head(queryIntron)
@

Alternatively, to generate branchpoint window region queries by from the gtf annotation, the exon object can be used:

Note that when searching for genes, transcripts, or exons, the ids used must be in the same format as in the annotation file (i.e. ENSG00000XXXXXX.X, ENST00000XXXXXX.X, ENSE00000XXXXXX.X). If you are unsure of an id, aliases can typically be found through ensembl (ensembl.org), or through a biomaRt query.
<<make_intron, message=FALSE>>=
queryIntronFromGTF <- makeBranchpointWindowForExons("ENSE00000939171.1", 
                                                    idType = "exon_id", 
                                                    exons = exons)
head(queryIntronFromGTF)

# for multiple ids:
queryIntronFromGTF <- makeBranchpointWindowForExons(c("ENSE00000939171.1", 
                                                      "ENSE00001814242.1"), 
                                                    idType = "exon_id", 
                                                    exons = exons)
head(queryIntronFromGTF)
@

\subsubsection{Using bedtools with a genome .fa file}

During the prediction step, if a BSGenome object is not specified, sequences covering each site +/- 250 nt can be retrieved using bedtools. 
The absolute location of the bedtools binary must be provided for calls from within R.
To find the location of your installed bedtools binary, using the command line type:

\begin{verbatim}
which bedtools
\end{verbatim}

If chromosome names in the .fa genome file do not match those in the query (i.e chr1 in query, 1 in .fa), the argument rm\_chr should be set to FALSE.

\subsection{Predict branchpoint probabilities}

Branchpoint probability scores can now be evaluated using the branchpointer model.
This will generate a new GRanges object with a row for each site (of 27) in branchpoint window regions.
If a SNP query type is provided (See next section), this will also perform an in silico mutation of the sequence.

We recommend use of the cut-off probability 0.52 to distinguish branchpoints and non-branchpoint sites.
U2 binding energy can be used as a measurement of branchpoint strength when the probability score is above the cut-off.

All features required for the model to predict branchpoint probability are contained within the output object, along with the score and U2 binding energy.

<<predict_introns, message=FALSE>>=
branchpointPredictionsIntron <- predictBranchpoints(queryIntron, 
                                                    queryType = "region", 
                                                    BSgenome = g)
head(branchpointPredictionsIntron)
@

The window scores can be plotted using plotBranchpointWindow(), with optional plots for gene and isoform structure.
The main panel displays the probability scores of each site within the branchpoint window. The opacity of the bars is representative of relative U2 binding energy (darker = stronger), and the lower panel shows U2 binding energy for all sites above the provided probability cutoff.

BRCA2 intron (ENSE00000939169.1 - ENSE00000939171.1):
<<brca2-plot>>=
plotBranchpointWindow(queryIntron$id[2], 
                      branchpointPredictionsIntron,
                      probabilityCutoff = 0.52,
                      plotMutated = FALSE,
                      plotStructure = TRUE,
                      exons = exons)
@


\section{Effects of SNPs on branchpoint annotations}
In addition to locating branchpoints in intronic windows, branchpointer can be used to evaluate the local effects of SNPs on branchpoints.
The general workflow is the same as for annotation of intronic windows, however queryType="SNP" must be used.

\subsection{Read query and calculate location attributes}

Query SNPs should be located nearby a branchpoint window to have any potential effects on branchpoint architecture
SNP queries can be supplied as a table formatted as follows:

<<show_snp, message=FALSE>>=
querySNPFile <- system.file("extdata","SNP_example.txt", 
                            package = "branchpointer")
querySNPTable <- read.delim(querySNPFile)
head(querySNPTable)
@

When reading in exceptionally large numbers of SNPs, it is recommended to set filter = TRUE.
This adds a pre-filtering step which removes any SNPs not located in an intron or 50nt from a 3' exon.

Each SNP will be associated with the closest 3' exon. If SNPs are distal from branchpoint windows, the max\_dist argument will remove any greater than the specified distance. Filtering prior to exon associations can therefore speed up processing in instances where it is unknown if the majority of SNPs fall nearby branchpoint windows.

<<read_snp, message=FALSE>>=
querySNP <- readQueryFile(querySNPFile, 
                          queryType = "SNP", 
                          exons = exons, 
                          filter = TRUE)
head(querySNP)
@
Queries can be provided as stranded or unstranded. 
In the case of unstranded queries, any value except "+" or "-" will cause branchpointer to run on both strands.

Alternatively, appropriate attributes can be pulled from biomart when a list of refsnp ids is provided:
<<read_snp_mart, message=FALSE>>=
library(biomaRt)
mart <- useMart("ENSEMBL_MART_SNP", dataset="hsapiens_snp",host="www.ensembl.org")
querySNP <- makeBranchpointWindowForSNP(c("rs587776767","rs786205083"), 
                                        mart.snp = mart, 
                                        exons = exons, 
                                        filter  = FALSE)
head(querySNP)
@
By default, all SNPs retrieved will be unstranded, and hence further processing will be done on both strands

\subsection{Predict branchpoint probabilities}

Using a .fa and bedtools:
<<snp_att_fa, message=FALSE, eval=FALSE>>=
branchpointPredictionsSNP <- predictBranchpoints(querySNP,
                                        queryType = "SNP",
                                        genome = "GRCh38.primary_assembly.genome.fa",
                                        bedtoolsLocation="/Apps/bedtools2/bin/bedtools")
@

Using a BSgenome:
<<snp_att_BS, message=FALSE>>=
#for query SNPs
branchpointPredictionsSNP <- predictBranchpoints(querySNP,
                                        queryType = "SNP",
                                        BSgenome = g)
head(branchpointPredictionsSNP)

#to summarise effects:
querySNPSummary <- predictionsToSummary(querySNP,branchpointPredictionsSNP)
head(querySNPSummary)
@

The window scores in the reference and alternative sequences can be visualised using plotBranchpointWindow().

rs587776767 in TH intron
<<rs587776767-plot>>=
plotBranchpointWindow(querySNP$id[2], 
                      branchpointPredictionsSNP,
                      probabilityCutoff = 0.52,
                      plotMutated = TRUE,
                      plotStructure = TRUE,
                      exons = exons)
@
\section{Performance}

Branchpointer is vectorised where possible to decrease run times. 

When reading in SNP queries, a prefiltering step can be applied by setting filter=TRUE. This step adds less than 10 seconds to readQueryFile(), and can reduce run times when the locations of SNPs are unknown [Figure 1].

\begin{figure}[h!]
    \centering
    \textbf{}
    \par\medskip
    \includegraphics[scale=1]{readQueryFile_time.pdf}
    \caption{Time taken for readQueryFile() on SNP query files}
\end{figure}


predictBranchpoints() is the rate limiting step, taking ~55 seconds per 100 region queries. This can be lowered using parallelisation by setting useParallel=TRUE and specifying the number of cores [Figure 2].

\begin{figure}[h!]
    \centering
    \textbf{}
    \par\medskip
    \includegraphics[scale=1]{predictBranchpoints_time.pdf}
    \caption{Time taken for predictBranchpoints() on region queries}
\end{figure}


\subsection{Example run times}
<<run times, message=FALSE, eval=FALSE>>=

# Step times for annotating branchpoints in introns:
gtfToExons()
# user  system elapsed 
# 41.385   3.848  47.096 
 
# Set 1. 294 lincRNA introns on chr22:
makeBranchpointWindowForExons()
# user  system elapsed 
# 0.196   0.024   0.226
predictBranchpoints()
# user  system elapsed 
# 208.934   4.157 225.849

# Set 2. 3693 protein coding exons on chr22:
makeBranchpointWindowForExons()
# user  system elapsed 
# 0.245   0.013   0.261 
predictBranchpoints()
# user   system  elapsed 
# 2332.519   38.266 2482.032

# Step times for annotating branchpoints with SNPs:
# 29899 GWAS SNPS
readQueryFile(filter = TRUE)
# user  system elapsed 
# 5.997   1.608   7.773
readQueryFile(filter = FALSE)
# user  system elapsed 
# 1.744   0.427   2.339 

# 298 filtered SNPS
predictBranchpoints()
# user  system elapsed 
# 172.495   2.485 181.876 

predictionsToSummary()
# user  system elapsed 
# 0.057   0.003   0.061
@
  
Example scripts used to test times are found in
inst/scripts, and were run on a 2.4GHz Macbook Pro with 8GB RAM

\section{Session info}
<<sessionInfo, eval=TRUE>>=
sessionInfo()
@

\end{document}