--- title: "nearBynding Vignette" author: "Veronica Busa" date: "`r Sys.Date()`" output: rmarkdown::pdf_document urlcolor: blue vignette: > %\VignetteIndexEntry{nearBynding Vignette} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- \tableofcontents \newpage ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(nearBynding) library(Rsamtools) ``` ```{r, echo=FALSE} ## test whether the local computer can run the required programs bedtools<-suppressWarnings(system2("bedtools", "--version", stdout = NULL, stderr = NULL)) == 0 CapR<-suppressWarnings(system2("CapR", stdout = NULL, stderr = NULL)) == 0 stereogene<-suppressWarnings(system2("Stereogene", "-h", stdout = NULL, stderr = NULL)) == 0 ``` ## Introduction nearBynding is a package designed to discern annotated RNA structures proximal to protein binding. nearBynding allows users to annotate RNA structure contexts via CapR or input their own annotations in BED/bedGraph format. It accomodates protein binding information from CLIP-seq experiments as either aligned CLIP-seq reads or peak-called intervals. This vignette will walk you through: * The external software necessary to support this pipeline * Creating a concatenated transcriptome * Extracting and folding RNA from the transcriptome via CapR * Mapping protein-binding and RNA structure information onto a transcriptome * Running StereoGene to identify RNA structure proximal to protein binding * Visualizing binding results * Determining the distance between binding contexts Before running any of these examples, it is highly recommended that the user establishes a new empty directory and uses `setwd()` to make certain that all outputs are deposited there. Some of the functions below create multiple output files. ## Installation ```{r, eval = FALSE} if(!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("nearBynding") ``` ## External Software Dependencies Add all dependency directories to your PATH after installation. ### bedtools bedtools is available for installation [here.](https://bedtools.readthedocs.io/en/latest/content/installation.html) Installation instructions will vary by operating system. ### CapR Download the zip file from the [github repository](https://github.com/fukunagatsu/CapR), unzip the file, and move it to a directory where you want to permanently store the function. In the command line, access the folder where the unzipped file is stored. ```{r, eval=FALSE} cd CapR-master make ./CapR ``` If installation is successful, the final line will output `Error: The number of argument is invalid.` ### StereoGene Download the zip file from the [github repository](https://github.com/favorov/stereogene), unzip the file, and move it to a directory where you want to permanently store the function. In the command line, access the folder where the unzipped file is stored. ```{r, eval=FALSE} cd stereogene-master cd src make ./stereogene -h ``` If installation is successful, the final line will output a menu of argument options. ## 1. Concatenate the Transcriptome Although nearBynding is designed to support whole-genome analyses, we will exclusively be evaluating protein-coding genes of chromosomes 4 and 5 through this vignette. First, a list of transcripts must be identified for analysis. A recommended criterium for selection is that the transcripts be expressed in the cell type used for CLIP-seq experiments. For this vignette, 50 random transcripts have been selected, and the 3'UTR structure of each transcript will be used for analysis, though any region of a transcript such as 5'UTR or CDS could be assessed instead. This step creates a chain file that will be used to map the selected regions of transcripts end-to-end, excluding the intergenic regions and undesired transcripts that comprise the majority of the genome. ```{r} # load transcript list load(system.file("extdata/transcript_list.Rda", package="nearBynding")) # get GTF file gtf<-system.file("extdata/Homo_sapiens.GRCh38.chr4&5.gtf", package="nearBynding") GenomeMappingToChainFile(genome_gtf = gtf, out_chain_name = "test.chain", RNA_fragment = "three_prime_utr", transcript_list = transcript_list, alignment = "hg38") ``` A file containing the sizes of each concatenated chromosome in the chain file will be required for downstream analysis. ```{r} getChainChrSize(chain = "test.chain", out_chr = "chr4and5_3UTR.size") ``` ## 2. Fold RNA via CapR In order to fold the 3'UTRs, the sequences must first be extracted. This is achieved with the following code: ```{r, eval = FALSE} ExtractTranscriptomeSequence(transcript_list = transcript_list, ref_genome = "Homo_sapiens.GRCh38.dna.primary_assembly.fa", genome_gtf = gtf, RNA_fragment = "three_prime_utr", exome_prefix = "chr4and5_3UTR") ``` The reference genome can be found through [Ensembl](ftp://ftp.ensembl.org/pub/release-100/fasta/homo_sapiens/dna/), but for users who do not want to download that 3.2GB file for the sake of this vignette, the FASTA output of the above code is available via: ```{r} chr4and5_3UTR.fa <- system.file("extdata/chr4and5_3UTR.fa", package="nearBynding") ``` These sequences can then be submitted to CapR for folding. ```{r, eval = CapR} runCapR(in_file = chr4and5_3UTR.fa) ``` Warning: This step can take hours or even days depending on how many transcripts are submitted, how long the RNA fragments are, and the maximum distance between base-paired nucleotides submitted to the CapR algorithm. ## 3. Map to Transcriptome The nearBynding pipeline can accomodate either a BAM file of aligned CLIP-seq reads or a BED file of peak intervals. BAM files must be sorted and converted to a bedGraph file first, whereas BED files can be read-in directly. #### BAM file input ```{r, eval = bedtools} bam <- system.file("extdata/chr4and5.bam", package="nearBynding") sorted_bam<-sortBam(bam, "chr4and5_sorted") CleanBAMtoBG(in_bam = sorted_bam) ``` #### Map Protein Binding and RNA Structure to the Transcriptome BED or bedGraph files can then be mapped onto the concatenated transcriptome using the chain file created by `GenomeMappingToChainFile()`. This way, only the protein binding from transcriptomic regions of interest will be considered in the final protein binding analysis. ```{r} liftOverToExomicBG(input = "chr4and5_sorted.bedGraph", chain = "test.chain", chrom_size = "chr4and5_3UTR.size", output_bg = "chr4and5_liftOver.bedGraph") ``` For BED file inputs, use the additional argument `format = "BED"`. The RNA structure information from the CapR output also needs to mapped onto the concatenated transcriptome. There are six different binding contexts calculated by CapR -- *stem*, *hairpin*, *multibranch*, *exterior*, *internal*, and *bulge*. `processCapRout()` parses the CapR output, converts it into six separate bedGraph files, and then performs `liftOverToExomic()` for all the files. For this sake of this vignette, the CapR outfile is available: ```{r} processCapRout(CapR_outfile = system.file("extdata/chr4and5_3UTR.out", package="nearBynding"), chain = "test.chain", output_prefix = "chr4and5_3UTR", chrom_size = "chr4and5_3UTR.size", genome_gtf = gtf, RNA_fragment = "three_prime_utr") ``` It is possible for users to input their own RNA structure information rather than using CapR. The information should be in BED file format and can be input into `liftOverToExomicBG()` to map the RNA structure data to the same transcriptome as the protein binding data. ## 4. Calculate Cross-correlation via StereoGene This is the process that directly answers the question, "What does RNA structure look like around where the protein is binding?" StereoGene is used to calculate the cross-correlation between RNA structure and protein binding in order to visualize the RNA structure landscape surrounding protein binding. If CapR is used to determine RNA structure, `runStereogeneOnCapR()` initiates StereoGene for a given protein against all CapR-generated RNA structure contexts. For the sake of this vignette, use `outfiles()` to pull the StereoGene output files to your local directory if you do not want to run StereoGene. ```{r, eval = stereogene} runStereogeneOnCapR(protein_file = "chr4and5_liftOver.bedGraph", chrom_size = "chr4and5_3UTR.size", name_config = "chr4and5_3UTR.cfg", input_prefix = "chr4and5_3UTR") ``` ```{r, echo = FALSE, eval = !stereogene, results='hide'} get_outfiles() ``` If external RNA structure data is being tested, `runStereogene()` initiates analysis for a given protein and a single RNA structure context. Note: The input track file order matters! The correct order is: 1) RNA structure 2) protein binding Otherwise, data visualization will be inverted and all downstream analysis will be backwards. ```{r, eval = FALSE} runStereogene(track_files = c("chr4and5_3UTR_stem_liftOver.bedGraph", "chr4and5_liftOver.bedGraph"), name_config = "chr4and5_3UTR.cfg") ``` ## 5. Visualize Results The cross-correlation output of StereoGene can be visualized as either a heatmap or a line plot. Examples of both are below. For CapR-derived RNA structure, all six contexts can be viewed simultaneously. ```{r, eval = stereogene} visualizeCapRStereogene(CapR_prefix = "chr4and5_3UTR", protein_file = "chr4and5_liftOver", heatmap = TRUE, out_file = "all_contexts_heatmap", x_lim = c(-500, 500)) visualizeCapRStereogene(CapR_prefix = "chr4and5_3UTR", protein_file = "chr4and5_liftOver", x_lim = c(-500, 500), out_file = "all_contexts_line", y_lim = c(-18, 22)) ``` ```{r, fig.show='hold', echo = FALSE, out.width = '50%'} knitr::include_graphics("all_contexts_heatmap.jpeg") knitr::include_graphics("all_contexts_line.pdf") ``` Warning: This step may take up to an hour for a full transcriptome. Alternatively, a single context can be viewed at a time. ```{r, eval = stereogene} visualizeStereogene(context_file = "chr4and5_3UTR_stem_liftOver", protein_file = "chr4and5_liftOver", out_file = "stem_line", x_lim = c(-500, 500)) visualizeStereogene(context_file = "chr4and5_3UTR_stem_liftOver", protein_file = "chr4and5_liftOver", heatmap = TRUE, out_file = "stem_heatmap", x_lim = c(-500, 500)) ``` ```{r, fig.show='hold', echo = FALSE, out.width = '50%'} knitr::include_graphics("stem_heatmap.jpeg") knitr::include_graphics("stem_line.pdf") ``` Although this specific, limited example does not provide a particularly visually stimulating image, larger data sets (which provide many more StereoGene windows) result in narrower peaks and less noise. ## 6. Calculate Distance In order to determine the similarity of two binding contexts, we can calculate the Wasserstein distance between curves. A small value suggests two binding contexts are very similar, whereas larger values suggest substantial differences. For example, calculate the distance between the stem and hairpin contexts visualized above. ```{r} bindingContextDistance(RNA_context = "chr4and5_3UTR_stem_liftOver", protein_file = "chr4and5_liftOver", RNA_context_2 = "chr4and5_3UTR_hairpin_liftOver") ``` Now compare it to the distance between internal and hairpin contexts. ```{r} bindingContextDistance(RNA_context = "chr4and5_3UTR_internal_liftOver", protein_file = "chr4and5_liftOver", RNA_context_2 = "chr4and5_3UTR_hairpin_liftOver") ``` We can see that the stem context is less similar to the hairpin context than the internal context, and this is reflected in the calculated distances. *Questions? Comments? Please email Veronica Busa at vbusa1@jhmi.edu* ```{r} sessionInfo() ```