---
title: "Lab 1.3: Starting an analysis"
output:
  BiocStyle::html_document:
    toc: true
vignette: >
  % \VignetteIndexEntry{Lab 1.3: Starting an analysis}
  % \VignetteEngine{knitr::rmarkdown}
---

```{r style, echo = FALSE, results = 'asis'}
BiocStyle::markdown()
```

```{r setup, echo=FALSE, messages=FALSE, warnings=FALSE}
knitr::opts_chunk$set(
    eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")),
    cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE"))
)
suppressPackageStartupMessages({
    library(Biostrings)
    library(GenomicRanges)
})
```

Original Authors: Martin Morgan<br />
Presenting Authors: [Martin Morgan][], [Lori Shepherd][]</br >
Date: 22 July, 2019</br >
Back: [Monday labs](lab-1-intro-to-r-bioc.html)

[Martin Morgan]: mailto: Martin.Morgan@RoswellPark.org
[Lori Shepherd]: mailto: Lori.Shepherd@RoswellPark.org

**Objective**: An overview of software available in _Bioconductor_.

**Lessons learned**:

- How to discover _CRAN_ and _Bioconductor_ packages
- Finding vignettes, workflows, and tutorials
- Exploring an existing analysis
- Approaching your own data
- How to get help

# _Bioconductor_

Analysis and comprehension of high-throughput genomic data

- Statistical analysis: large data, technological artifacts, designed
  experiments; rigorous
- Comprehension: biological context, visualization, reproducibility
- High-throughput
    - Sequencing: RNASeq, ChIPSeq, variants, copy number, ...
    - Microarrays: expression, SNP, ...
    - Flow cytometry, proteomics, images, ...

## Packages, vignettes, work flows

![Alt Sequencing Ecosystem](our_figures/SequencingEcosystem.png)

- 1750 software packages
- Discover and navigate via [biocViews][]
- Package 'landing page', e.g., `r Biocpkg("Gviz")`
    - Title, author / maintainer, short description, citation,
      installation instructions, ..., download statistics
- All user-visible functions have help pages, most with runnable
  examples
- 'Vignettes' an important feature in _Bioconductor_ -- narrative
  documents illustrating how to use the package, with integrated code
    - Example: `AnnotationHub` [landing page][AH] references
      [HOW-TO vignette][AH-howto] illustrating some fun use cases.
    - Some are extensive; check out [Gviz][], [limma][], [edgeR][],
      [DESeq2][]!
- 'Release' (every six months) and 'devel' branches

[AH]: https://bioconductor.org/packages/AnnotationHub
[AH-howto]: http://bioconductor.org/packages/devel/bioc/vignettes/AnnotationHub/inst/doc/AnnotationHub-HOWTO.html

# Packages

## Finding packages

'Domain-specific' analysis

**Exercise**

- Visit the listing of [All Packages][].
- Use the 'Search biocViews' box in the upper left to identify
  packages that have been tagged for RNASeq analysis. Explore other
  analysis like ChIPSeq, Epigenetics, VariantAnnotation, Proteomics,
  SingleCell, etc. Explore the graph of software packages by
  expanding and contracting individual terms.
- Return to RNASeq. Two very popular packages are DESeq2 and
  edgeR. Visit the 'landing page' of one of these packages. The
  landing page has a title, authors, instructions for citing use the
  package, etc.
- Breifly explore the vignette and reference manual links. When
  would you consult the vignette? When would the reference manual be
  helpful?

[All Packages]: https://bioconductor.org/packages/release/BiocViews.html#___Software

_Bioconductor_ provides 'infrastructure' for working with genomic
data. We'll explore some of these in more detail in a later part of
this lab. For now...

**Exercise**

- Visit the landing pages of each of the [Biostrings][],
  [GenomicRanges][], [VariantAnnotation][], and [GenomicAlignments][]
  packages. Create a short summary describing what each package does,
  and when it might be useful.
- Visit the landing page for the [SummarizedExperiment][]
  package. This package is meant to provide a data representation that
  helps manage, in a coordinated fashion, 'assays' (e.g., gene x
  sample matrix of RNASeq counts) and the row (e.g., genomic
  coordinates, P-values) and column (e.g., sample sheets). Briefly
  review the user-oriented vignette ('SummarizedExperiment for
  Coordinating Experimental Assays, Samples, and Regions of Interest')
  to get a sense for how this package might be used.
- Visit the landing page for the [rtracklayer][] package. From the
  reference manual, when would the various `import()` and `export()`
  functions be useful?

Annotation packages are data-, rather than software-, centric,
providing information about the relationship between different
identifiers, gene models, reference genomes, etc.

**Exercise**

- On the page listing [All Packages][], click on the AnnotationData
  top-level term.
- Search, using the box on the right-hand side, for annotation
  packages that start with the following letters to get a sense of the
  packages and organisms available.

  - `org.*`: symbol mapping
  - `TxDb.*` and `EnsDb.*`: gene models
  - `BSgenome.*`: reference genomes

We'll see in a subsequent lab that a wealth of additional annotation
resources, including updated EnsDb and reference genomes, are
available through [AnnotationHub][].

Workflow packages are meant to provide a comprehensive introduction to
work flows that require several different packages. These can be quite
extensive documents, providing a very rich source of information.

**Exercise**

- Briefly explore the 'Simple Single Cell' workflow (or other workflow
  relevant to your domain of interest) to get a sense of the material
  the workflow covers.

## Installing packages

Likely the packages needed for this course are already
installed. Nonetheless it is useful to know how to install other
packages.

_Bioconductor_ has a particular approach to making packages
available. We have a 'devel' branch where new packages and features
are introduced, and a 'release' branch where users have access to
stable packages. Each six months, in spring and fall, the current
'devel' version of packages is branched to become the next
'release'. Packages within a release are tested with one another, so
it is important to install packages from the same release. The
[BiocManager][] package tries to make it easy to do this.

The first step to package installation is to make sure that the
[BiocManager][] package has been installed using standard _R_
procedures.

```{r, eval = FALSE}
if (!require(BiocManager))
    install.packages("BiocManager", repos = "https://cran.r-project.org")
```

Then, install the package(s) you would like to use

```{r, eval = FALSE}
BiocManager::install(c("Biostrings", "GenomicRanges"))
```

[BiocManager][] knows how to install CRAN and github packages, too.

There are several common problems encountered with package
installation. Often, packages have been installed using methods
different from the one recommended here, and the packages are from
different _Bioconductor_ releases. This leads to problems when
packages from different releases are incompatible with one another.

**Exercise** Verify that your packages are current and installed from
the same _Bioconductor_ release with

```{r, eval = FALSE}
BiocManager::valid()
```

Two common problems are that some packages are too old (a newer
version of the package exists) or too new (some packages have been
installed using a method other than [BiocManager][]). If there are
packages that are too old or too new, it is almost always a good idea
to follow the instructions from `BiocManager::valid()` to correct the
situation.

## Loading and using packages

Packages need to be installed only once for each version of _R_ you
use, but need to be _loaded_ into each new _R_ session that you
start. Packages are loaded using

```{r, messages = FALSE}
library(Biostrings)
```

Whan a package is loaded, it can sometimes generate messages that are
informational only, if you are confident this is the case for the
packages you're loading, use `suppressPackageStartupMessages()` for a
quieter experience:

```{r}
suppressPackageStartupMessages({
    library(GenomicRanges)
    library(GenomicAlignments)
})
```

**Exercise** It is usually very helpful to explore package vignettes.

- Visit the vignette of the [DESeq2][] package, and walk through a few
  steps to understand what the vignette provides in terms of
  instructions for starting with the package, functionality the
  package provides, mathematical and statistical details of the
  implementation, and how the analysis provided by the package might
  be extended by other packages in the _Bioconductor_ ecosystem. One
  can visit vignettes through RStudio, or by running commands such as

  ```{r, eval = FALSE}
  vignette(package = "DESeq2")
  browseVignettes("DESeq2")
  ```

- Most vignettes are written in such a way that the _R_ code of the
  vignette _must_ be correct for the vignette to be produced. The code
  itself is available in the package. Find the code for the DESeq2
  vignette

  ```{r}
  dir(system.file(package="DESeq2", "doc"))
  vign <- system.file(package="DESeq2", "doc", "DESeq2.R")
  ```

  open it in RStudio (e.g., using File -> Open File... menu), step
  through the first few lines of _R_ code and compare your output to
  the output in the vignette. Alternatively, run the entire analysis
  in the vignette with the command

  ```{r, eval = FALSE}
  source(vign, echo = TRUE, max.lines = Inf)
  ```

**Exercise** Help pages provide more focused instructions for use of
particular functions. It is often con

- Load the [Biostrings][] package

    ```{r, message = FALSE}
    library(Biostrings)
    ```

- Look for help on the function `letterFrequency()` using the command

    ```
    ?letterFrequency
    ```

  note that there is tab completion after the `?` and first few
  letters of the command.

- The help page is quite complicated, documenting several different
  functions. In the 'Description' section, find a description of what
  `letterFrequency()` does. In the 'Usage' section, find the arguments
  that can be used with `letterFrequency()`, and try to understand,
  from the `Arguments` section what each argument might be or how it
  influences the computation. The `Value` section attempts to describe
  the return value of the `letterFrequency()` function.

- Sometimes an example is worth a thousand words. Can you run the
  first two sections of the example at the end of the help page (for
  `alphabetFrequency()` and `letterFrequency()` to arrive at a better
  understanding of how the `letterFrequency()` function works?

# Getting help

Where to get help?

- Most questions from users: https://support.bioconductor.org

What can you get help on?

- Problems on how to use software
- Statistical interpretation of results
- But not: extensive statistical consultation on, e.g., advanced
  experimental design. Best source: local experts and collaborators.

How to ask a good question

- Simplify to just a few lines of _R_ code.

- Must be able to be run by someone else

  - Use built-in data sets or simple simulations
  - Don't reference files on your system!

- include the output of `sessionInfo()`, which often shows problems
  with out-of-date packages.

**Exercise** Visit the support site and review the five most recent
questions. Which do you think are 'good', from the guidelines offered
above? Which have received helpful answers? Can you figure out who the
person answering the question is, i.e., why do they think they have an
answer?

# A sequence analysis package tour

This very open-ended topic points to some of the most prominent
_Bioconductor_ packages for sequence analysis. Use the opportunity in
this lab to explore the package vignettes and help pages highlighted
below; many of the material will be covered in greater detail in
subsequent labs and lectures.

Basics

- _Bioconductor_ packages are listed on the [biocViews][] page. Each
  package has 'biocViews' (tags from a controlled vocabulary)
  associated with it; these can be searched to identify appropriately
  tagged packages, as can the package title and author.
- Each package has a 'landing page', e.g., for
  [GenomicRanges][]. Visit this landing page, and note the
  description, authors, and installation instructions. Packages are
  often written up in the scientific literature, and if available the
  corresponding citation is present on the landing page. Also on the
  landing page are links to the vignettes and reference manual and, at
  the bottom, an indication of cross-platform availability and
  download statistics.
- A package needs to be installed once, using the instructions on the
  landing page. Once installed, the package can be loaded into an R
  session and the help system queried interactively, as outlined
  above:

```{r require}
library(GenomicRanges)
```

```{r help, eval=FALSE}
help(package="GenomicRanges")
vignette(package="GenomicRanges")
vignette(package="GenomicRanges", "GenomicRangesHOWTOs")
?GRanges
```

Domain-specific analysis -- explore the landing pages, vignettes, and
reference manuals of two or three of the following packages.

- Important packages for analysis of differential expression include
  [edgeR][] and [DESeq2][]; both have excellent vignettes for
  exploration. Additional research methods embodied in _Bioconductor_
  packages can be discovered by visiting the [biocViews][] web page,
  searching for the 'DifferentialExpression' view term, and narrowing
  the selection by searching for 'RNA seq' and similar.
- Single-cell 'omics are increasingly important. From the
  [biocViews][] page, enter 'single cell' in the 'search table'
  field. Check out the [single-cell workflow][sc-workflow] for an
  overview of key packages.
- Popular ChIP-seq packages include [DiffBind][] and [csaw][] for
  comparison of peaks across samples, [ChIPQC][] for quality
  assessment, and [ChIPpeakAnno][] and [ChIPseeker][] for annotating
  results (e.g., discovering nearby genes). What other ChIP-seq
  packages are listed on the [biocViews][] page?
- Working with called variants (VCF files) is facilitated by packages
  such as [VariantAnnotation][], [VariantFiltering][], and
  [ensemblVEP][]; packages for calling variants include, e.g.,
  [h5vc][] and [VariantTools][].
- Several packages identify copy number variants from sequence data,
  including [cn.mops][]; from the [biocViews][] page, what other copy
  number packages are available? The [CNTools][] package provides some
  useful facilities for comparison of segments across samples.
- Microbiome and metagenomic analysis is facilitated by packages such
  as [phyloseq][] and [metagenomeSeq][].
- Metabolomics, chemoinformatics, image analysis, and many other
  high-throughput analysis domains are also represented in
  _Bioconductor_; explore these via biocViews and title searches.

Working with sequences, alignments, common web file formats, and raw
data; these packages rely very heavily on the [IRanges][] /
[GenomicRanges][] infrastructure that we will encounter later in the
course.

- The [Biostrings][] package is used to represent DNA and other
  sequences, with many convenient sequence-related functions. Check
  out the functions documented on the help page `?consensusMatrix`,
  for instance. Also check out the [BSgenome][] package for working
  with whole genome sequences, e.g., `?"getSeq,BSgenome-method"`
- The [GenomicAlignments][] package is used to input reads aligned to
  a reference genome. See for instance the `?readGAlignments` help
  page and `vigentte(package="GenomicAlignments",
  "summarizeOverlaps")`
- The [rtracklayer][] `import` and `export` functions can read in many
  common file types, e.g., BED, WIG, GTF, ..., in addition to querying
  and navigating the UCSC genome browser. Check out the `?import` page
  for basic usage.
- The [ShortRead][] and [Rsamtools][] packages can be used for
  lower-level access to FASTQ and BAM files, respectively.
- Many genomics data files are very large. We'll explore strategies of
  _restriction_ (only input some of the data in the file) and
  _iteration_ (read the file in chunks, rather than its entirety) for
  processing large data in other labs.

Annotation: _Bioconductor_ provides extensive access to 'annotation'
resources (see the [AnnotationData][] biocViews hierarchy); these are
covered in greater detail in Thursday's lab, but some interesting
examples to explore during this lab include:

- [biomaRt][], [PSICQUIC][], [KEGGREST][] and other packages for
  querying on-line resources; each of these have informative vignettes.
- [AnnotationDbi][] is a cornerstone of the
  [Annotation Data][AnnotationData] packages provided by _Bioconductor_.
  - **org** packages (e.g., [org.Hs.eg.db][]) contain maps between
    different gene identifiers, e.g., ENTREZ and SYMBOL. The basic
    interface to these packages is described on the help page `?select`
  - **TxDb** packages (e.g., [TxDb.Hsapiens.UCSC.hg38.knownGene][])
    contain gene models (exon coordinates, exon / transcript
    relationships, etc) derived from common sources such as the hg38
    knownGene track of the UCSC genome browser. These packages can be
    queried, e.g., as described on the `?exonsBy` page to retrieve all
    exons grouped by gene or transcript.
  - **EnsDb** packages and databases (e.g. [EnsDb.Hsapiens.v86][]) provide,
    similar to TxDb packages, gene models, but also protein annotations (protein
    sequences and protein domains within these) and additional annotation
    columns such as `"gene_biotype"` or `"tx_biotype"` defining the *biotype* of
    the features (e.g. lincRNA, protein_coding, miRNA etc). EnsDb databases are
    designed for [Ensembl][] annotations and contain annotations for all genes
    (protein coding and non-coding) for a specific Ensembl release.
  - **BSgenome** packages (e.g., [BSgenome.Hsapiens.UCSC.hg38][])
    contain whole genomes of model organisms.
- [VariantAnnotation][] and [ensemblVEP][] provide access to sequence
  annotation facilities, e.g., to identify coding variants; see the
  [Introduction to VariantAnnotation](http://bioconductor.org/packages/release/bioc/vignettes/ShortRead/inst/doc/Overview.pdf)
  vignette for a brief introduction; we'll re-visit this during the
  Thursday lab.
- Take a quick look (there are more activites in other labs) at the
  [annotation work flow](http://bioconductor.org/help/workflows/annotation/annotation/)
  on the _Bioconductor_ web site.

A number of _Bioconductor_ packages help with visualization and
reporting, in addition to functions provided by indiidual packages.

- [Gviz][] provides a track-like visualization of genomic regions;
  it's got an amazing vignette.
- [ComplexHeatmap][] does an amazing job of all sorts of heatmaps,
  including OncoPrint-style summaries.
- [ReportingTools][] provides a flexible way to generate static and
  dynamic HTML-based reports.

# End matter

## Session Info

```{r}
sessionInfo()
```

## Acknowledgements

Research reported in this tutorial was supported by the National Human
Genome Research Institute and the National Cancer Institute of the
National Institutes of Health under award numbers U41HG004059 and
U24CA180996.

This project has received funding from the European Research Council
(ERC) under the European Union's Horizon 2020 research and innovation
programme (grant agreement number 633974)


[biocViews]: http://bioconductor.org/packages/release/BiocViews.html#___Software
[AnnotationData]: http://bioconductor.org/packages/release/BiocViews.html#___AnnotationData

[aprof]: http://cran.r-project.org/web/packages/aprof/index.html
[hexbin]: http://cran.r-project.org/web/packages/hexbin/index.html
[lineprof]: https://github.com/hadley/lineprof
[microbenchmark]: http://cran.r-project.org/web/packages/microbenchmark/index.html

[AnnotationDbi]: http://bioconductor.org/packages/AnnotationDbi
[AnnotationHub]: https://bioconductor.org/packages/AnnotationHub
[BSgenome]: http://bioconductor.org/packages/BSgenome
[BiocManager]: https://cran.r-project.org/package=BiocManager
[Biostrings]: http://bioconductor.org/packages/Biostrings
[CNTools]: http://bioconductor.org/packages/CNTools
[ChIPQC]: http://bioconductor.org/packages/ChIPQC
[ChIPpeakAnno]: http://bioconductor.org/packages/ChIPpeakAnno
[ChIPseeker]: http://bioconductor.org/packages/ChIPseeker
[ComplexHeatmap]: http://bioconductor.org/packages/ComplexHeatmap
[csaw]: http://bioconductor.org/packages/csaw
[DESeq2]: http://bioconductor.org/packages/DESeq2
[DiffBind]: http://bioconductor.org/packages/DiffBind
[GenomicAlignments]: http://bioconductor.org/packages/GenomicAlignments
[GenomicRanges]: http://bioconductor.org/packages/GenomicRanges
[Gviz]: http://bioconductor.org/packages/Gviz
[IRanges]: http://bioconductor.org/packages/IRanges
[KEGGREST]: http://bioconductor.org/packages/KEGGREST
[PSICQUIC]: http://bioconductor.org/packages/PSICQUIC
[rtracklayer]: http://bioconductor.org/packages/rtracklayer
[Rsamtools]: http://bioconductor.org/packages/Rsamtools
[ReportingTools]: http://bioconductor.org/packages/ReportingTools
[ShortRead]: http://bioconductor.org/packages/ShortRead
[SummarizedExperiment]: http://bioconductor.org/packages/SummarizedExperiment
[VariantAnnotation]: http://bioconductor.org/packages/VariantAnnotation
[VariantFiltering]: http://bioconductor.org/packages/VariantFiltering
[VariantTools]: http://bioconductor.org/packages/VariantTools
[biomaRt]: http://bioconductor.org/packages/biomaRt
[cn.mops]: http://bioconductor.org/packages/cn.mops
[h5vc]: http://bioconductor.org/packages/h5vc
[edgeR]: http://bioconductor.org/packages/edgeR
[ensemblVEP]: http://bioconductor.org/packages/ensemblVEP
[limma]: http://bioconductor.org/packages/limma
[metagenomeSeq]: http://bioconductor.org/packages/metagenomeSeq
[phyloseq]: http://bioconductor.org/packages/phyloseq
[snpStats]: http://bioconductor.org/packages/snpStats

[org.Hs.eg.db]: http://bioconductor.org/packages/org.Hs.eg.db
[TxDb.Hsapiens.UCSC.hg38.knownGene]: http://bioconductor.org/packages/TxDb.Hsapiens.UCSC.hg38.knownGene
[BSgenome.Hsapiens.UCSC.hg38]: http://bioconductor.org/packages/BSgenome.Hsapiens.UCSC.hg38

[EnsDb.Hsapiens.v86]: http://bioconductor.org/packages/EnsDb.Hsapiens.v86
[Ensembl]: http://www.ensembl.org

[sc-workflow]: http://bioconductor.org/packages/release/workflows/html/simpleSingleCell.html