---
title: "Upcoming Features"
author:
- name: Zachary L. Skidmore
  affiliation: McDonnell Genome Institute - Washington University
date: "`r Sys.Date()`"
output:
    BiocStyle::html_document:
        toc_float: true
package: GenVisR
abstract:
    Instructions for visualizing small variants using the GenVisR package
vignette: >
  %\VignetteIndexEntry{Visualizing Small Variants}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# !!!!!!!!! Please Read important note !!!!!!!!!!!!!!!

GenVisR is being converted to S4 to increase flexibility for users. The S4-based functions in this vignette are operative but changes may still occur and documentation is still being developed. Please keep in mind that the **new functions based on S4 are similarily named to the established functions**; both are available for used in this release. For example:

**In General new functions start with a capital, old functions do not**

```{r, eval=FALSE}
# new S4 function (note the capital W)
?Waterfall

# older established function
?waterfall
```

# GenVisR a brief introduction

Intuitively visualizing and interpreting data from high-throughput genomic technologies continues to be challenging. Often creating a publication ready graphic not only requires extensive manipulation of data but also an in-depth knowledge of graphics libraries. As such creating such visualizations has traditionally taken a significant amount of time in regards to both data pre-processing and aesthetic manipulations. GenVisR (Genomic Visulizations in R) attempts to alleviate this burden by providing highly customizable publication-quality graphics in an easy to use structure. Many of the plotting functions in this library have a focus in the realm of human cancer genomics however we support a large number of species and many of the plotting methods incorprated within are of use for visualizing any type of genomic abnormality.

## getting started

For the majority of users we recommend installing GenVisR from the release branch of Bioconductor, Installation instructions using this method can be found on the [GenVisR](https://bioconductor.org/packages/GenVisR) landing page on Bioconductor.

Please note that GenVisR imports a few packages that have “system requirements”, in most cases these requirements will already be installed. If they are not please follow the instructions to install these packages given in the R terminal. Briefly these packages are: “libcurl4-openssl-dev” and “libxml2-dev”

Once GenVisR is successfully installed it will need to be loaded. For the purposes of this vignette we do this here and set a seed to ensure reproducibility.

```{r, message=FALSE, tidy=TRUE}
# set a seed
set.seed(426)

# load GenVisR into R
library(GenVisR)
```

# Visualization of small variants

Genomic changes come in many forms, one such form are single nucleotide variants (SNVS) and small insertions and deletions (INDELS). These variations while small can have a devestating impact on the health of cells and are the basis for many genomic diseases. We provide 5 functions for visualizing the impact these changes can have and to aid in recognizing patterns that could be associated with genomic diseases.

## Supported files

In order to be as flexible as possible GenVisR is designed to work with 3 file formats which encode information regarding small variants. The first of these are vep formated files from the [Variant Effect Predictor (VEP)](https://ensembl.org/info/docs/tools/vep/index.html), a tool developed by ensembl to annotate the effect a variant has on a genome. [Mutation Annotation Format (MAF)](https://wiki.nci.nih.gov/display/TCGA/Mutation+Annotation+Format+(MAF)+Specification) files are also widely used in the bioinformatics community particularly within The Cancer Genome Atlas (TCGA) project and are supported by GenVisR. Annotation files from the [Genome Modeling System (GMS)](https://github.com/genome/gms) are also supported. In the interest of flexibility a fourth option exists as well, users can supply a `data.table` object to any function designed to work with small variants. Users should note however that specific columns are required for functions when using this method (see specific functions for details).

## Plot Objects

The output from `Waterfall()`, `Lolliplot()`, `Rainfall()`, `SmallVariantSummary()`, and `MutSpectra()` are objects which store the data which was plotted as well as the individual subplots, and the final arranged plot. The reason for storing the data and plots in such a manner is two-fold. First allowing access to the data which was plotted provides transparency as to how the plot was produced. Secondly as all this items can be accessed by the user it allows for greater flexibility for customizing plots. These data and plots can be accessed with the `getData()` and `getGrob()` functions respectively. The final plot can be printed with the `drawPlot()` function which takes one of the afore mentioned objects. 

# Basic Pipeline

## Reading in data

When constructing a plot with GenVisR there are three basic steps. First one must load the data into R, as mentioned previously GenVisR supports three filetypes which can be read in this manner. An example for each is supplied below using test data installed with the package.

### Basic Syntax

1. Reading in a MAF file
```{r, message=FALSE, tidy=TRUE}
# get the disk location for maf test file
testFileDir <- system.file("extdata", package="GenVisR")
testFile <- Sys.glob(paste0(testFileDir, "/brca.maf"))

# define the objects for testing
mafObject <- MutationAnnotationFormat(testFile)
```

2. Reading in a VEP file
```{r, message=FALSE, tidy=TRUE}
# get the disk location for test files
testFileDir <- system.file("extdata", package="GenVisR")
testFile <- Sys.glob(paste0(testFileDir, "/*.vep"))

# define the object for testing
vepObject <- VEP(testFile)
```

3. Reading in a GMS file
```{r, tidy=TRUE}
# get the disk location for test files
testFileDir <- system.file("extdata", package="GenVisR")
testFile <- Sys.glob(paste0(testFileDir, "/FL.gms"))

# define the objects for testing
gmsObject <- GMS(testFile)
```

### Viewing Data

In some cases you may want to view data after it has been read in, there are functions available to make viewing these data easier. Briefly these are `getSample()`, `getPosition()`, `getMutation()`, and `getMeta()` which are globally available for each object type (GMS, VEP, etc.). Let's test one out by viewing the samples from the VEP file that was read in with `VEP()`.

```{r, tidy=TRUE, eval=FALSE}
# view the samples from the VEP file
getSample(vepObject)
```

### Additional Notes

All of these functions expect a path to a file of the appropriate type. Optionally a wildcard can be supplied with * as was done with `VEP()` causing the function to read in multiple files at once. With the exception of MAF files these functions **expect the files being read in to have a column called "sample"** (MAF files already have a sample designation within the file). If a sample column is not found one will be created based on the filenames. Each of these functions will attempt to infer a file specification version from the file header. If a version is not found one will need to be specified via the `version` parameter.

## Constructing a Plot Object and viewing plotted data

Once the data is read in and stored in one of the previously mentioned objects the data can be plotted with one of the plotting functions. We will go over each plotting function in more detail in section 4 however to continue with our example pipeline let's create a simple waterfall plot for the VEP annotated data.

```{r, tidy=TRUE, warning=FALSE}
waterfallPlot <- Waterfall(vepObject, recurrence=.40)
```

With the plot object created we can view the actual data making up the plot with the `getData()` function. This is an accessor function to pull out specific data making up the plot (Refer to the R documentation for `Waterfall()` to see available slots in the object which hold data). Let's use it here to extract the data making up the main plot panel, we can specify the slot either by name or it's index.

```{r eval=FALSE}
# extract data by the slot name
getData(waterfallPlot, name="primaryData")

# extract data by the slot index (same as above)
getData(waterfallPlot, index=1)
```

## Drawing and saving a plot object

With the plot object defined we can now visualize the plot, this is achieved by calling the `drawPlot()` function on the plot object. The plot can be saved simply by opening a graphics device using the base R functions such as `pdf()`, `svg()` etc. It should be noted that space within the graphics device will need to be adequate in order to draw a plot. If plots appear to be overlapping try giving the plot more space to draw via the `height` and `width` parameters.

```{r eval=FALSE}
# draw the plot
drawPlot(waterfallPlot)

# draw the plot and save it to a pdf
pdf(file="waterfall.pdf", height=10, width=15)
drawPlot(waterfallPlot)
dev.off()
```

# Functions for visualizing small variants

## Waterfall

The Waterfall plot is designed to make recognizing patterns related to co-occurring or mutually exclusive events easier to visualize across a cohort. It plots mutations for a gene/sample in a hierarchical order beginning with the most recurrently mutated gene for a cohort and working it's way down. By default two sub-plots are also displayed which show the mutation burden for a sample and the number of genes effected by a mutation in a cohort. Let's start by making a simple waterfall plot with default parameters from the MAF file we read in earlier.

```{r, fig.keep='last', fig.width=10, fig.height=6.5, tidy=TRUE, warning=FALSE}
# draw a waterfall plot for the maf object
drawPlot(Waterfall(vepObject, recurrence=.20))
```

### Using filtering parameters with Waterfall

Relevant Parameters: `samples`, `recurrence`, `genes`, `geneMax`, `mutation`

Often it is the case that the input data supplied to the `Waterfall()` function will contain thousands of genes and hundreds of samples. While `Waterfall()` can handle such scenarios the graphics device `Waterfall()` would neeed to output to would have to be enlarged to such a degree that the visualization may become unwieldy. To alleviate such issues and show the most relevant data `Waterfall()` provides a suite of filtering parameters. Let's filter our plot using a few of these, suppose we only wanted to visualize those genes which occur in 50% of the cohort and we further didn't care about sample "FLX0010Naive". We could tell `Waterfall()` this by giving the samples we only wanted plotted via the `samples` parameter and setting the `recurrence` parameter to .50.

```{r, fig.keep='last', fig.width=10, fig.height=6.5, tidy=TRUE, warning=FALSE}
# show those genes which recur in 50% of the cohort for these 4 samples
drawPlot(Waterfall(vepObject, recurrence=.50, samples=c("FLX0040Naive", "FLX0070Naive", "FLX0050Naive", "FLX0030Naive")))
```

### Mutation sub-plot

Relevant Parameters: `coverage`, `plotA`, `plotATally`, `plotALayers`

You might have noticed that when filtering the waterfall plot the top sub-plot didn't change i.e. for sample "FLX0040Naive" the Frequency of mutations remained around 50. You may also have noticed a warning when constructing a waterfall plot for the VEP data saying that duplicate genomic locations were detected. This is important to note and understand, regardless of any filtering that occurs the top sub-plot will always be based on the original input with one caveat. It will check for duplicate variants *(i.e. same sample, allele, genomic position)* and remove variants which are deemed to be duplicated. This occurs in the case of VEP as different transcripts can be annotated with different mutation types but still be the same variant which is why we see the warning. By default `Waterfall()` will simply output the frequencies observed in the cohort. However a mutation burden can be displayed as well by changing the value for parameter `plotA` from "frequency" to "burden". In doing so you will also **need to supply a value to `coverage`** giving the aproximate space in base pairs for which a mutation could have been called. Doing so will plot a mutation burden instead of a mutation frequency calculated via the folowing formula:

`Mutation Burden = (# of observed mutations per sample)/(# of bases for which a mutation could be called) * 1000000`

For a more accurate mutation burden calculation we can also supply coverage values individual per sample via a named vector, let's do that and tell `Waterfall()` to calculate a mutation burden by mutation type + sample rather than sample alone via the parameter `plotATally`. Lastly we will set the `drop` parameter to FALSE to avoid dropping mutations from the legend that are not in the main plot but are in the top sub-plot.

```{r, fig.keep='last', fig.width=10, fig.height=6.5, tidy=TRUE, warning=FALSE}
# define a coverage for each sample
sampCov <- c("FLX0040Naive"=1.45e7, "FLX0070Naive"=1.39e7, "FLX0050Naive"=1.21e7, "FLX0030Naive"=1.3e7, "FLX0010Naive"=1.1e7)

drawPlot(Waterfall(vepObject, recurrence=.50, coverage=sampCov, plotA="burden", plotATally="complex", drop=FALSE))
```

### Altering the mutation colors and hierarchy

Relevant Parameters: `mutationHierarchy`

You may ask yourself what happens when there are two different mutations for the same gene/sample, which one is plotted? For each file type a pre-defined hierarchy of mutations has been established which follows the order of the "Mutation Type" legend from top to bottom. By default mutations are based on a hierarchy of being most deleterious however both the priority and color of mutations can be changed by supplying a data.table to the `mutationHierarchy` parameter. Let's assume that we are particularly interested in splice variants, by default these might not show up because a "missense_variant" is deemed to be more deleterious but let's change that by supplying a custom hierarchy along with colors to go along with it. To do that we will need to give the parameter a data.table object with column names "mutation" and "color".

```{r, fig.keep='last', fig.width=10, fig.height=6.5, tidy=TRUE, warning=FALSE}
# find which mutations are in the input data
mutations <- getMutation(vepObject)$Conse

# define a new color and hierarchy for mutations
library(data.table)
newHierarchy <- data.table("mutation"=c("splice_region_variant", "splice_acceptor_variant",  "splice_donor_variant", "missense_variant", "stop_gained"), "color"=c("tomato1", "tomato2", "tomato3", "purple", "cyan"))

# draw the plot
drawPlot(Waterfall(vepObject, recurrence=.50, mutationHierarchy = newHierarchy, plotATally="complex", drop=FALSE))
```

Here we can see that for sample "FLX0040Naive" on gene *BIRC6* there is actually a splice_region_variant as well as a missense_variant.

<span style="color:red">Note:</span> *If mutations are found in the input but are not listed in the mutationHierarchy parameter they will be addded and assigned random colors*

<span style="color:red">Note:</span> *In a VEP formated file mutations can be comma delimited, if they are not specifically specified in the mutation hierarchy as comma delimited the mutations are split up and collapsed*

## MutSpectra

The MutSpectra plot is designed to make recognizing bias in the relative rates of transitions and transversions easier. It annotates transitions and transversions based on the variant call and then plots the relative proportion observed in each sample. A frequency plot for these transitions and transversions is also calculated and displayed in order to determine if an observed ratio is attributable to a low number of mutations for a given sample.

### Input

Input to the `MutSpectra()` function can be just one of the afore mentioned and supported file format object discussed in section 3.1.1. *In cases where the file format does not provide a reference base, such as with vep,  a `BSgenome` object will also need to be provided* in order to annotate reference bases for a given position. These `BSgenome` objects are a core object class in bioconductor and can be installed through the instructions on the bioconductor site. Let's go ahead and create a MutSpectra plot for the vep data. We first determine which assembly vep used for these annotations ussing the `getHeader()` function and can see that it is GRCh37. We next load in the BSgenome for GRCh37 (GRCh37 and hg19 are analogous assemblies with very minor differences) and supply the BSgenome object to the parameter `BSgenome`.

```{r, fig.keep='last', fig.width=10, fig.height=6.5, tidy=TRUE, warning=FALSE, message=FALSE}
# determine the appropriate BSgenome object to use from the vep header
assembly <- getHeader(vepObject)
assembly <- assembly[grepl("assembly", assembly$Info),]

# load in the correct BSgenome object
library(BSgenome.Hsapiens.UCSC.hg19)

# create a MutSpectra plot
drawPlot(MutSpectra(vepObject, BSgenome=BSgenome.Hsapiens.UCSC.hg19))
```

<span style="color:red">Note:</span> *In the example above you'll see a warning related to chromosome mismatches, this is part of the difference between GRCh37 and hg19, chromosomes are designated differently between the two (i.e. chr1 and 1). `MutSpectra()` is smart enough to recognize this and append chr to all chromosomes.*

<span style="color:red">Note:</span> *If you are unsure what assembly to use try running `MutSpectra()` without specifying one. It will attempt to match an assembly from bioconductor with the information in the header.*

## Rainfall

The Rainfall plot is designed to make visualizing kataegis across the entire genome easier. It calculates the genomic distances between consecutive mutations for each sample and chromosome on a log10 scale. This information is displayed on the y-axis over the context of genome coordinates on the x-axis. A density plot for these mutations is also displayed above the main plot to display the number of mutations within a mutation hotspot.

### Input

As with the MutSpectra plots basic input consists of an object corresponding to one of the supported file types. Further a BSgenome object is required if the input format does not contain a reference column (see the MutSpectra section for more details). Let's start by creating a Rainfall plot for our VEP object. You might notice in the plot that density estimates are missing for a few chromosomes, this is normal and is done intentionally to avoid artificially skewing the y-axis when there are too few points to construct a confident density plot.

```{r, fig.keep='last', fig.width=10, fig.height=6.5, tidy=TRUE, warning=FALSE, message=FALSE}
# determine the appropriate BSgenome object to use from the vep header
assembly <- getHeader(vepObject)
assembly <- assembly[grepl("assembly", assembly$Info),]

# load in the correct BSgenome object
library(BSgenome.Hsapiens.UCSC.hg19)

# create a Rainfall plot
drawPlot(Rainfall(vepObject, BSgenome=BSgenome.Hsapiens.UCSC.hg19))
```

### Using filtering parameters with Rainfall

By default the `Rainfall()` function plots everything within the input, this means that it is not only plotting all chromosomes but also all samples. In some cases this may not be desireable but fortunately `Rainfall()` has two parameters to limit the 
data which is plotted, these are `sample` and `chromosomes` which will limit the samples and chromosomes ploted respectively. Let's go ahead and try them out, we'll limit our plot to only chromosomes 1-3 and only sample "FLX0010Naive".

```{r, fig.keep='last', fig.width=10, fig.height=6.5, tidy=TRUE, warning=FALSE, message=FALSE}
# create a Rainfall plot limiting the chromosomes and samples plotted
drawPlot(Rainfall(vepObject, BSgenome=BSgenome.Hsapiens.UCSC.hg19, sample=c("FLX0010Naive"), chromosomes=c("chr1", "chr2", "chr3")))
```

## Lolliplot

TODO: rework the existing lolliplot to be quicker and look better

## SmallVariantSummary

Coming Soon!

# Session Info

```{r, message=FALSE, tidy=TRUE}
sessionInfo()
```