---
title: "nearBynding Vignette"
author: "Veronica Busa"
date: "`r Sys.Date()`"
output: rmarkdown::pdf_document
urlcolor: blue
vignette: >
    %\VignetteIndexEntry{nearBynding Vignette}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}
---

\tableofcontents
\newpage

```{r setup, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>"
)
library(nearBynding)
library(Rsamtools)
```
```{r, echo=FALSE}
## test whether the local computer can run the required programs
bedtools<-suppressWarnings(system2("bedtools", "--version", 
                                    stdout = NULL, stderr = NULL)) == 0
CapR<-suppressWarnings(system2("CapR", stdout = NULL, stderr = NULL)) == 0
stereogene<-suppressWarnings(system2("Stereogene", "-h", 
                                    stdout = NULL, stderr = NULL)) == 0
```

## Introduction

nearBynding is a package designed to discern annotated RNA structures proximal
to protein binding. nearBynding allows users to annotate RNA structure contexts
via CapR or input their own annotations in BED/bedGraph format. It accomodates 
protein binding information from CLIP-seq experiments as either aligned 
CLIP-seq reads or peak-called intervals. This vignette will walk you through:

    * The external software necessary to support this pipeline
    * Creating a concatenated transcriptome
    * Extracting and folding RNA from the transcriptome via CapR
    * Mapping protein-binding and RNA structure information onto a transcriptome
    * Running StereoGene to identify RNA structure proximal to protein binding
    * Visualizing binding results
    * Determining the distance between binding contexts

Before running any of these examples, it is highly recommended that the user 
establishes a new empty directory and uses `setwd()` to make certain that all 
outputs are deposited there. Some of the functions below create multiple output 
files.

## Installation

```{r, eval = FALSE}
if(!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("nearBynding")
```

## External Software Dependencies

Add all dependency directories to your PATH after installation.

### bedtools

bedtools is available for installation 
[here.](https://bedtools.readthedocs.io/en/latest/content/installation.html)

Installation instructions will vary by operating system.

### CapR

Download the zip file from the
[github repository](https://github.com/fukunagatsu/CapR), 
unzip the file, and move it to a directory where you want to permanently store 
the function.

In the command line, access the folder where the unzipped file is stored.

```{r, eval=FALSE}
cd CapR-master
make
./CapR
```

If installation is successful, the final line will output

`Error: The number of argument is invalid.`

### StereoGene

Download the zip file from the 
[github repository](https://github.com/favorov/stereogene), 
unzip the file, and move it to a directory where you want to permanently store 
the function.

In the command line, access the folder where the unzipped file is stored.

```{r, eval=FALSE}
cd stereogene-master
cd src
make
./stereogene -h
```

If installation is successful, the final line will output a menu of argument 
options.

## 1. Concatenate the Transcriptome

Although nearBynding is designed to support whole-genome analyses, we will 
exclusively be evaluating protein-coding genes of chromosomes 4 and 5 through
this vignette.

First, a list of transcripts must be identified for analysis. A recommended 
criterium for selection is that the transcripts be expressed in the cell type 
used for CLIP-seq experiments. For this vignette, 50 random transcripts have 
been selected, and the 3'UTR structure of each transcript will be used for 
analysis, though any region of a transcript such as 5'UTR or CDS could be 
assessed instead.

This step creates a chain file that will be used to map the selected regions of 
transcripts end-to-end, excluding the intergenic regions and undesired 
transcripts that comprise the majority of the genome.

```{r}
# load transcript list
load(system.file("extdata/transcript_list.Rda", package="nearBynding"))
# get GTF file
gtf<-system.file("extdata/Homo_sapiens.GRCh38.chr4&5.gtf", 
                    package="nearBynding")

GenomeMappingToChainFile(genome_gtf = gtf,
                        out_chain_name = "test.chain",
                        RNA_fragment = "three_prime_utr",
                        transcript_list = transcript_list,
                        alignment = "hg38")
```

A file containing the sizes of each concatenated chromosome in the chain file 
will be required for downstream analysis.

```{r}
getChainChrSize(chain = "test.chain", 
                out_chr = "chr4and5_3UTR.size")
```

## 2. Fold RNA via CapR

In order to fold the 3'UTRs, the sequences must first be extracted. This is 
achieved with the following code:

```{r, eval = FALSE}
ExtractTranscriptomeSequence(transcript_list = transcript_list,
                    ref_genome = "Homo_sapiens.GRCh38.dna.primary_assembly.fa",
                    genome_gtf = gtf,
                    RNA_fragment = "three_prime_utr",
                    exome_prefix = "chr4and5_3UTR")
```

The reference genome can be found through 
[Ensembl](ftp://ftp.ensembl.org/pub/release-100/fasta/homo_sapiens/dna/), but 
for users who do not want to download that 3.2GB file for the sake of this 
vignette, the FASTA output of the above code is available via:

```{r}
chr4and5_3UTR.fa <- system.file("extdata/chr4and5_3UTR.fa", 
                                package="nearBynding")
```

These sequences can then be submitted to CapR for folding.

```{r, eval = CapR}
runCapR(in_file = chr4and5_3UTR.fa)
```

Warning: This step can take hours or even days depending on how many 
transcripts are submitted, how long the RNA fragments are, and the maximum 
distance between base-paired nucleotides submitted to the CapR algorithm.

## 3. Map to Transcriptome

The nearBynding pipeline can accomodate either a BAM file of aligned CLIP-seq 
reads or a BED file of peak intervals. BAM files must be sorted and 
converted to a bedGraph file first, whereas BED files can be read-in directly.

#### BAM file input

```{r, eval = bedtools}
bam <- system.file("extdata/chr4and5.bam", package="nearBynding")
sorted_bam<-sortBam(bam, "chr4and5_sorted")

CleanBAMtoBG(in_bam = sorted_bam)
```

#### Map Protein Binding and RNA Structure to the Transcriptome

BED or bedGraph files can then be mapped onto the concatenated transcriptome 
using the chain file created by `GenomeMappingToChainFile()`. This way, only 
the protein binding from transcriptomic regions of interest will be considered 
in the final protein binding analysis.

```{r}
liftOverToExomicBG(input = "chr4and5_sorted.bedGraph",
                    chain = "test.chain",
                    chrom_size = "chr4and5_3UTR.size",
                    output_bg = "chr4and5_liftOver.bedGraph")
```

For BED file inputs, use the additional argument `format = "BED"`.

The RNA structure information from the CapR output also needs to mapped onto 
the concatenated transcriptome. There are six different binding contexts 
calculated by CapR -- *stem*, *hairpin*, *multibranch*, *exterior*, *internal*, 
and *bulge*. `processCapRout()` parses the CapR output, converts it into six
separate bedGraph files, and then performs `liftOverToExomic()` for all the 
files.

For this sake of this vignette, the CapR outfile is available:

```{r}
processCapRout(CapR_outfile = system.file("extdata/chr4and5_3UTR.out", 
                                            package="nearBynding"),
                chain = "test.chain",
                output_prefix = "chr4and5_3UTR",
                chrom_size = "chr4and5_3UTR.size",
                genome_gtf = gtf,
                RNA_fragment = "three_prime_utr")
```

It is possible for users to input their own RNA structure information rather 
than using CapR. The information should be in BED file format and can be input 
into `liftOverToExomicBG()` to map the RNA structure data to the same 
transcriptome as the protein binding data.

## 4. Calculate Cross-correlation via StereoGene

This is the process that directly answers the question, "What does RNA 
structure look like around where the protein is binding?" StereoGene is used to 
calculate the cross-correlation between RNA structure and protein binding in 
order to visualize the RNA structure landscape surrounding protein binding.

If CapR is used to determine RNA structure, `runStereogeneOnCapR()` initiates 
StereoGene for a given protein against all CapR-generated RNA structure 
contexts.

For the sake of this vignette, use `outfiles()` to pull the StereoGene output
files to your local directory if you do not want to run StereoGene.

```{r, eval = stereogene}
runStereogeneOnCapR(protein_file = "chr4and5_liftOver.bedGraph",
                    chrom_size = "chr4and5_3UTR.size",
                    name_config = "chr4and5_3UTR.cfg",
                    input_prefix = "chr4and5_3UTR")
```

```{r, echo = FALSE, eval = !stereogene, results='hide'}
get_outfiles()
```

If external RNA structure data is being tested, `runStereogene()` initiates 
analysis for a given protein and a single RNA structure context.

Note: The input track file order matters! The correct order is:

    1) RNA structure
    2) protein binding

Otherwise, data visualization will be inverted and all downstream analysis will 
be backwards.

```{r, eval = FALSE}
runStereogene(track_files = c("chr4and5_3UTR_stem_liftOver.bedGraph", 
                                "chr4and5_liftOver.bedGraph"),
                name_config = "chr4and5_3UTR.cfg")
```

## 5. Visualize Results

The cross-correlation output of StereoGene can be visualized as either a 
heatmap or a line plot. Examples of both are below.

For CapR-derived RNA structure, all six contexts can be viewed simultaneously.

```{r, eval = stereogene}
visualizeCapRStereogene(CapR_prefix = "chr4and5_3UTR",
                        protein_file = "chr4and5_liftOver",
                        heatmap = TRUE,
                        out_file = "all_contexts_heatmap",
                        x_lim = c(-500, 500))
visualizeCapRStereogene(CapR_prefix = "chr4and5_3UTR",
                        protein_file = "chr4and5_liftOver",
                        x_lim = c(-500, 500),
                        out_file = "all_contexts_line",
                        y_lim = c(-18, 22))
```
```{r, fig.show='hold', echo = FALSE, out.width = '50%'}
knitr::include_graphics("all_contexts_heatmap.jpeg")
knitr::include_graphics("all_contexts_line.pdf")
```

Warning: This step may take up to an hour for a full transcriptome.

Alternatively, a single context can be viewed at a time.

```{r, eval = stereogene}
visualizeStereogene(context_file = "chr4and5_3UTR_stem_liftOver",
                    protein_file = "chr4and5_liftOver",
                    out_file = "stem_line",
                    x_lim = c(-500, 500))
visualizeStereogene(context_file = "chr4and5_3UTR_stem_liftOver",
                    protein_file = "chr4and5_liftOver",
                    heatmap = TRUE,
                    out_file = "stem_heatmap",
                    x_lim = c(-500, 500))
```
```{r, fig.show='hold', echo = FALSE, out.width = '50%'}
knitr::include_graphics("stem_heatmap.jpeg")
knitr::include_graphics("stem_line.pdf")
```

Although this specific, limited example does not provide a particularly 
visually stimulating image, larger data sets (which provide many more 
StereoGene windows) result in narrower peaks and less noise.

## 6. Calculate Distance

In order to determine the similarity of two binding contexts, we can calculate 
the Wasserstein distance between curves. A small value suggests two binding 
contexts are very similar, whereas larger values suggest substantial 
differences. 

For example, calculate the distance between the stem and hairpin contexts 
visualized above.

```{r}
bindingContextDistance(RNA_context = "chr4and5_3UTR_stem_liftOver",
                        protein_file = "chr4and5_liftOver",
                        RNA_context_2 = "chr4and5_3UTR_hairpin_liftOver")
```

Now compare it to the distance between internal and hairpin contexts.

```{r}
bindingContextDistance(RNA_context = "chr4and5_3UTR_internal_liftOver",
                        protein_file = "chr4and5_liftOver",
                        RNA_context_2 = "chr4and5_3UTR_hairpin_liftOver")
```

We can see that the stem context is less similar to the hairpin context than 
the internal context, and this is reflected in the calculated distances.

*Questions? Comments? Please email Veronica Busa at vbusa1@jhmi.edu*

```{r}
sessionInfo()
```