---
title: The Shiny Variant Explorer
author:
    -   name: Kévin Rue-Albrecht
        email: kevinrue67@gmail.com
        affiliation:
            - Department of Medicine, Imperial College London, UK
            - Nuffield Department of Medicine, University of Oxford, UK
date: "`r doc_date()`"
package: "`r pkg_ver('BiocStyle')`"
abstract: >
    [Shiny](http://shiny.rstudio.com) web-application that demonstrates the
    functionalities of the *TVTB* package integrated
    in a programming-free environment.
vignette: >
    %\VignetteIndexEntry{The Shiny Variant Explorer}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}
output:
    BiocStyle::html_document
bibliography:
    TVTB.bib
---

# Preliminary notes {#Preliminary}

The Shiny Variant Explorer (*tSVE*) was primarily developped to demonstrate
features implemented in the `r Biocpkg("TVTB")`, **not** as a production
environment.
As a result, a few important considerations should be made
to clarify what should and should **not** be expected from the 
web-application:

* Bug fixes will be treated with a much lower priority relative to those
    related to package methods.
* It is technically not feasible to offer in a web-interface the same degree
    of flexibility as the command-line environment (*e.g.* `...`[^1]).
* Greater control over the input and output data is possible at the
    command-line (*e.g.* refinement of `ggplot` objects,
    definition of custom genomic ranges).
* Requests for new features should apply to the package first.
    Only features relevant to the package functionalities
    may be made available in the web-application.
* Figures (`ggplot`) are currently the only output that can be exported from
    the web-application (using the web browser
    "Download image", or equivalent context menu item).
    In the future, action buttons may be added to export
    tables (*e.g.* CSV format) and figures (*e.g.* PDF format).
* This vignette is largely static as the web-application may only be used
    in an interactive session.
* First-time users are encouraged to follow this vignette
    sequentially (*i.e* in order, without skipping sections),
    as it takes readers through the sequence of actions of a typical analysis.
    + This vignette was designed to be read beside an open *R* session with
        the `r Biocpkg("TVTB")` package installed, so that users may follow
        the instructions marked by the word **Action** and bulleted points
        in the following sections.

[^1]: The `...` argument is called "ellipsis".

# Pre-requisites {#Prerequisites}

The *Shiny Variant Explorer* suggests a few additional package dependencies
compared to the package, to support certain forms of data input and display.

**Input**

* The `r Biocpkg("ensembldb")` package and relevant `EnsDb`[^2] annotation
    packages are required if that interface is used to query genomic ranges
    (demonstrated in this [section](#InputEnsDb)).
* The `r Biocpkg("EnsDb.Hsapiens.v75")` is required to query
    genomic ranges associated by gene names for the demonstration data[^3].
* The `r Biocpkg("rtracklayer")` package is required if a BED file is used
    to provide genomic ranges (demonstrated in this [section](#InputBED)).

[^2]: In the future, the web-application may also
    support `TxDb` and `OrganismDb` annotation packages.

[^3]: In the future, the web-application may also use annotation packages
    to facet statistics and figures by genomic range(s).

**Display**

* The latest version of the `r Githubpkg("rstudio/DT")` package is recommended
    to benefit from the latest developments (*e.g.* column filters inactivated
    if a single value exist in that column; version *>= 0.2.2*).
* The `r Githubpkg("rstudio/shiny")` package is required for all Shiny
    web-applications.

# Launching the Shiny Variant Explorer {#Launch}

The `TVTB::tSVE()` method launches the web-application.

# Overall layout of the web-application {#OverallLayout}

Overall, the web-application is implemented as a web-page
with a top level navigation bar organised 
from left to right to reflect progression through a typical analysis,
with the exception of the last two menu items **Settings** and **Session**,
which may be useful to check and update at any point.

Here is a brief overview of the menu items:

* **Input**
    + Control which samples, phenotypes, genomic ranges, and VCF fields
        must be imported.
    + An `EnsDb` annotation package may be selected to use the associated
        database interface.
* **Frequencies**
    + Add and remove INFO fields that contain calculated genotype counts
        and allele frequencies.
    + Add and remove genotype counts and allele frequencies across all
        samples, or within individual phenotype levels.
* **Filters**
    + Define and apply *VCF filter rules* (detailed in a separate vignette).
* **Views**
    + Display and examine major objects of the analysis and their slots.
* **Plots**
    + Display data plots and associated data tables.
* **Settings**
    + Control advanced parameters of the analysis and web-application.
* **Session**
    + Display session information and other relevant information.

# Input panel {#Input}

The **Input** panel controls the major input parameters of the analysis,
including phenotypes (and therefore samples), genomic ranges,
and fields to import from VCF file(s).
Those inputs are useful to import only data of interest, as well as
to limit memory usage and duration of calculations.

## Phenotypes {#InputPhenotypes}

Phenotypes are critical to define groups of samples that may be compared
in summary statistics, tables, and plots.
Moreover, phenotypes also implicitely define the set of samples required
in the analysis (unique sample identifiers usually set as `rownames`
of the phenotypes).

The web-application accepts phenotypes stored in a text file, with the
following requirements:

* Fields must be delimited by "white space"
    (default separator for the `read.table` function).
* The first column of the file must contain unique sample identifiers,
    as syntactically valid `rownames`.
* The first row of the file must phenotype names, as syntactically valid
    `colnames`.

When provided, phenotypes will be used to import from VCF file(s)
only genotypes for the corresponding samples identifiers.
Moreover, an error message will be displayed if any of the sample identifiers
present in the phenotypes is absent from the VCF file(s).

Note that the web-application does not *absolutely* require phenotype
information. In the absence of phenotype information, all samples are imported
from VCF file(s).

> **Action**:
> 
> * Click on the *Browse* action button
> * Navigate to the `extdata` folder of the *TVTB* installation directory
> * Select the file `integrated_samples.txt`
>
> Alternatively: click the *Sample file* button

**Notes**

* The `r Biocpkg("TVTB")` installation directory can be identified
using the following command in an *R* session:

```{r systemFile, results='hide'}
system.file("extdata", package = "TVTB")
```

* The file selection pop-up window that is open by the action button browses
    files on the server side. *This point is only relevant if the
    package/web-application is run on a remote server.*

## Genomic ranges {#InputGRanges}

Genomic ranges are critical to import only variants in
targeted genomic regions or features (*e.g.* genes, transcripts, exons),
as well as to limit memory usage and duration of calculations.

The Shiny Variant Explorer currently supports three types of input to define
genomic ranges:

* BED file
* UCSC-style text input
* `EnsDb` annotation packages

Currently, the web-application uses genomic ranges solely
to query the corresponding variants from VCF file(s).
In the future, those genomic ranges may also be used to produce
faceted summary statistics and plots.

**Notes**:

* The web-application does not *absolutely* require genomic ranges.
In the absence of genomic ranges, all variants are imported from VCF
file(s). *Caution recommended with large files!*
* When VCF file(s) are parsed (in a later [section](#InputVariants)),
    only the genomic ranges from the currently selected input mode are
* The active genomic ranges are only taken from the currently selected input
    mode

### BED file {#InputBED}

If a BED file is supplied, the web-application parses it using the
`r Biocpkg("rtracklayer")` `import.bed` method.
Therefore the file must respect the
[BED file format](https://genome.ucsc.edu/FAQ/FAQformat.html#format1)
guidelines.

> **Action**:
> 
> * Click on the *Browse* action button
> * Navigate to the `extdata` folder of the *TVTB* installation directory
> * Select the file `SLC24A5.bed`
>
> Alternatively: click the *Sample file* button

**Notes**:

* The BED file defines the same genomic range
    that was used to extract variant from the 
    [1000 Genomes Project](http://www.1000genomes.org) Phase 3
    release VCF file used in this vignette.
* The file selection window that is open by the action button browses
    files on the server side (see [Phenotypes](#InputPhenotypes) section
    above).

### UCSC format {#InputUCSC}

Sequence names (*i.e.* chromosomes), start, and end positions of one or more
genomic ranges may be defined in the text field,
with individual regions separated by `";"`.

> **Action**:
> 
> * Paste `15:48,413,169-48,434,869` in the text field
>
> Alternatively: click the *Sample input* button

**Notes**:

* The web-application automatically trims `","` characters from the text
    input, before coercing the start and end positions to `numeric`
* Multiple genomic ranges may be supplied (*e.g.*
    `1:123-456;2:234-345;2:456-789`)

### Ensembl-based annotation packages {#InputEnsDb}

Currently, genomic ranges encoding only gene-coding regions may be retrieved
from an Ensembl-based database.
This feature was adapted from the web-application implemented in the
`r Biocpkg("ensembldb")` package.

\bioccomment{
In the future, the interface to query transcripts and exons annotations
may be added to the web-application.
}

> **Action**:
> 
> * Paste `SLC24A5` in the text field
>
> Alternatively: click the *Sample input* button

\fixme{
Genomic feature located on contigs may cause problems when working
with one VCF per chromosome.
In the future, an option may be added to ignore contigs.
}

## Variants {#InputVariants}

At the core of the `r Biocpkg("TVTB")` package, variants must be imported from
one or more VCF file(s) annotated by the Ensembl Variant Effect Predictor
([VEP](http://www.ensembl.org/info/docs/tools/vep))
script [@RN1].

Considering the large size of most VCF file(s), it is common practice to
split genetic variants into multiple files, each file used to store
variants located on a single chromosome (more generally; a single sequence).
The Shiny Variant Explorer supports two situations:

* All variants are stored in a single VCF file ("Single-VCF" mode).
* Variants are split into one file per sequence, with the requirement that
    files be named with a pattern including the sequence name
    (must match the `seqnames` slot of the genomic ranges described
    [above](#InputGRanges) ("Multi-VCF mode").

In addition, VCF files can store a plethora of information in their various
fields. It is often useful to select only a subset of fields relevant for
a particular analysis, to limit memory usage.
The web-application uses the
`r Biocpkg("VariantAnnotation")` `scanVcfHeader` to parse the header of
the VCF file (*Single-VCF* mode) or the first VCF file (*Multi-VCF* mode),
to display the list of available fields that users may choose to import.
A few considerations must be made:

* The web-application requires that Ensembl VEP predictions be present in the
    INFO field.
* The web-application requires that the `"GT"` key be present in the
    FORMAT field.

### Single-VCF mode {#SingleVCF}

This mode display an action button that must be used to select the VCF file
from which to import variants.

> **Action**:
> 
> * Click on the *Browse* action button
> * Navigate to the `extdata` folder of the *TVTB* installation directory
> * Select the file `chr15.phase3_integrated.vcf.gz`
>
> Alternatively: click the *Sample file* button

### Multi-VCF mode {#MultiVCF}

This mode requires two pieces of information:

* The path to the folder that contains one or more VCF file(s).
* The naming pattern of VCF file(s), with the following requirement:
    + The pattern must include `"%s"` to declare the emplacement of the
        sequence (*i.e.* chromosome) name in the pattern.

Note that a summary of VCF file(s) detected using the given the folder and
pattern is displayed on the right, to help users determine whether the
parameters are correct. In addition, the content of the given folder is
displayed at the bottom of the page, beside the same content filtered for
the VCF file naming pattern.

> **Action**:
> 
> *None*. The text fields should already be filled with default values,
> pointing to the single example VCF file (`chr15.phase3_integrated.vcf.gz`).

### VCF scan parameters {#scanVcfParam}

This panel allows users to select the INFO and FORMAT fields to import
(in the `info` and `geno` slots of the `VCF` object, respectively).

It is important to note that the FORMAT/GT and INFO/<vep> fields---where
`<vep>` stands for the INFO key where Ensembl VEP predictions are stored---are
implicitely imported from the VCF.
Similarly, the mandatory FIXED fields `CHROM`, `POS`, `ID`, `REF`, `ALT`,
`QUAL`, and `FILTER` are automatically imported to populate
the `rowRanges` slot of the `VCF` object.

> **Action**:
> 
> * Click the *Deselect all* action button under the *INFO fields*
>     selection input to import only the INFO/CSQ and FORMAT/GT fields.
> * Click the *Import variants* action button
>
> A summary of variants, phenotypes, and samples imported will appear beside
> the action button.

## Annotations {#EnsDbPkg}

This panel allows users to select a pre-installed annotation package.
Currently, only `EnsDb` annotation packages are supported,
and only **gene**-coding regions may be queried.

> **Action**:
> 
> * If none of the `EnsDb` packages are installed, it will simply **not** be
>     possible to use the `ensembl` interface of the *Genomic ranges* input
>     tab.
> * If the `EnsDb.Hsapiens.v75` package is the only `EnsDb` packages installed,
>     no action is required; the package should already be pre-selected.
> * If the `EnsDb.Hsapiens.v75` package is **not** the only `EnsDb` packages
>     installed, users should select it in the list of choices.

# Frequencies panel {#Frequencies}

This panel demonstrates the use of three methods implemented in the
`r Biocpkg("TVTB")` package, namely `addFrequencies`, `addOverallFrequencies`,
and `addPhenoLevelFrequencies`.

## Overall frequencies {#OverallFrequencies}

This panel allows users to *Add* and *Remove* INFO fields
that contain genotype counts
(*i.e.* homozygote reference, heterozygote, homozygote alternate)
and allele frequencies
(*i.e.* alternate allele frequency, minor allele frequency)
calculated across all the samples and variants imported.
The web-application uses the homozygote reference, heterozygote,
and homozygote alternate genotypes defined in the
[Advanced settings](#AdvancedSettings)
panel.

Importantly, the name of the INFO keys that are used to store
the calculated values can be defined in the
[Advanced settings](#AdvancedSettings) panel.

> **Action**:
> 
> * Click the *Add* action button
> * See the *Latest changes* message update at the top of the screen.
> * Optionally, the [Views](#Views) panel can be used to examine the new fields

## Phenotype-level frequencies {#PhenoLevelFrequencies}

This panel allows users to *Refresh* the list of INFO fields
that contain genotype counts and allele frequencies
calculated within *groups of samples* associated with
various levels of a given phenotype.

> **Action**:
> 
> * Select `super_pop` in the list of phenotypes
> * Click the *Select all* action button
> * Click the *Refresh* action button
> * See the *Latest changes* message update at the top of the screen.
> * Optionally, the [Views](#Views) panel can be used to examine the new fields

# Filters panel {#VcfFilterRules}

One of the flagship features of the `r Biocpkg("TVTB")` package
are the *VCF filter rules*, extending the
`r Biocpkg("S4Vectors")` `FilterRules` class to new classes of filter rules
that can be evaluated within environments defined by the various slots
of `VCF` objects.

Generally speaking, `FilterRules` greatly facilitate the design 
and combination of powerful filter rules for table-like objects,
such as the `fixed` and `info` slots of
`r Biocpkg("VariantAnnotation")` `VCF` objects,
as well as Ensembl VEP predictions stored in the meta-columns of `GRanges`
returned by the `r Biocpkg("ensemblVEP")` `parseCSQToGRanges` method.

A separate vignette describes in greater detail the use of classes
that contain *VCF filter rules*. A simple example is shown below.

> **Action**:
> 
> * Select `VEP` as the *Type* of filter
> * Paste `grepl("missense",Consequence)` in the text field
> * Leave the *Active?* checkbox ticked
> * Click the *Add filter* action button
> * See the list of rules update at the bottom of the screen
> * Click the *Apply filters* action button
> * See the summary of filtered variants update beside the action button
> * Optionally, the [Views](#Views) panel can be used to examine the new fields
>
> Alternatively: click the *Sample input* button

# Views panel {#Views}

This panel offers the chance to examine the main objects of the session,
namely:

* The active genomic ranges
* The `rowRanges` and selected meta-columns of the filtered variants.
* Selected field of the `info` slot (of the filtered variants).
* Selected Ensembl VEP predictions (of the filtered variants).
* Selected phenotypes attached to the variants.
* Subset of genotypes (among the filtered variants).
    + Genotypes for all filtered variants may be displayed as a heatmap
        (`ggplot`).
        
> **Action**:
> 
> * In the various panels, select fields to examine each object
>     + In particular, note the INFO fields that contain genotype counts
>          and allele frequencies calculated [earlier](#Frequencies)
> * Go to the *Heatmap* tab of the *Genotypes* panel
> * Click the *Go!* action button to calculate and display the heatmap

# Plots panel {#Plots}

This panel demonstrates the use of two methods implemented in the
`r Biocpkg("TVTB")` package, namely `tabulateVepByPhenotype`
and `densityVepByPhenotype`.

# Settings panel

This panel stores more advanced settings that users may not need to edit as
frequently, if at all. Those settings are divided in two sub-panels:

* **Advanced**
    + Genotypes, INFO key suffixes, and VCF yield size
* **Parallel**
    + Use of multiple CPUs to accelerate calculations

## Advanced settings {#AdvancedSettings}

### Genotypes {#AdvancedGenotypes}

It is critical to accurately identify and define how the different
genotypes---homozygote reference, heterozygote, and homozygote alternate---are
encoded in the VCF file, to produce accurate
[genotypes counts and frequencies](#Frequencies), for instance.
This generally requires examining the
content of the FORMAT/GT field outside of the web-application.
For instance, the functions `unique` and `table` may be used to identify
(and count) all the distinct genotype codes in the `geno` slot (`"GT"` key) of
a `VCF` object.

The default selected values are immediately compatible with the demonstration
data set. Users who wish to select genotypes codes not yet available
among the current choices may
either contact the package maintainer to add them in a future release,
or edit the [Global configuration file](#GlobalConfig)
of the web-application locally.

### INFO key suffixes {#AdvancedSuffixes}

Currently, the three calculated genotypes counts and two allele frequencies
require five INFO fields to store their respective values. 

Considering that `r Biocpkg("TVTB")` offers the possibility to calculate
counts and frequencies for the overall data set, and for each level of each
phenotype, it is important to define a clear and consistent naming mechanism
that does not conflict with INFO keys imported from the VCF file(s).
In the `r Biocpkg("TVTB")` package, a suffix is required for each type of
genotype and frequency calculated, to generate INFO as follows:

+ Overall counts and frequencies are stored in INFO keys named `<suffix>`
+ Counts and frequencies calculated for individual levels of
    selected phenotypes are stored under INFO keys formed as
    `<phenotype>_<level>_<suffix>`

Again, the default values are immediately compatible with the demonstration
data set. For other data sets, it may be necessary to change those values,
either by preference, or to avoid conflict with INFO keys imported from
the VCF file(s).

### Miscellaneous settings {#AdvancedMiscellaneous}

Other rarely used settings in this panel include:

* VCF yield size
    - Only applicable when VCF file(s) are parsed without defined
        [genomic ranges](#InputGRanges). See the `r Biocpkg("Rsamtools")`
        documentation.

## Parallel settings {#ParallelSettings}

Several functionalities of the `r Biocpkg("TVTB")` package are applied
to independent subsets of data (*e.g.* counting genotypes in various levels
of a given phenotype). Such processes can benefit from multi-threaded
calculations. Multi-threading settings in the Shiny web-application are
somewhat experimental, as they have been validated only on a small set of
operating systems, while some issues have been reported for others.

| Report |  Operating System  | Cluster Class | Cluster type | # Cores |
| :----: | :---:| :-----------: | :----------: | :-----: |
| OK | Ubuntu 14.04 | Multicore | FORK | 2 |
| OK | Scientific Linux 6.7 | Multicore | FORK | 2 |
| Hang~1~ | OS X El Capitan | Snow | SOCK | 2 |

1.  Application hangs while CPUs work infinitely at full capacity.

\bioccomment{
Users are welcome to send feedback to report additional
successful configuration, as well as newly identified issues.
}

# Session information {#SessionInfo}

The last panel of the Shiny Variant Explorer offers detailed views of
objects and settings in the current session, including:

* **Session info**
    + The `sessionInfo()` value
* **TVTB settings**
    + See the vignette *Introduction to TVTB* for more information
* **General settings**
    + Including the current value of various input widgets
* **Advanced settings**
    + including the current value of more input widgets
* **BED**
    + Structural view of the active genomic ranges
* **Variants**
    + Overview of the raw `VCF` object
* **VEP**
    + Structural view of the `GRanges` that store the Ensembl VEP predictions
* **Phenotypes**
    + Structural view of the phenotype information attached to the variants
* **Genotypes**
    + Structural view of the `geno` slot (`"GT"` key) of the raw variants

# Global configuration {#GlobalConfig}

Most default values are stored in the `global.R` file of the web-application.
All the files of the web-application are stored in the `extdata/shinyApp`
folder of the `r Biocpkg("TVTB")` installation directory
(see an [earlier section](#InputPhenotypes) to identify this directory).

Users who wish to change the default values of certain input widgets
(*e.g.* genotype codes)
may edit the `global.R` file accordingly. However, the file will be reset at
each package update.

\bioccomment{
In the future, a mechanism may be implemented to override global settings
locally, without risk of seeeing this custom configuration overwritten
at the next package update (e.g. a file in the user home folder that would
be parsed to overwrite certain settings).
}

# Vignette session {#VignetteSessionInfo}

Here is the output of `sessionInfo()` on the system on which this
document was compiled:

```{r sessionInfo}
sessionInfo()
```

# References {#References}

[R]: http://r-project.org
[RStudio]: http://www.rstudio.com/