%\VignetteEngine{knitr::knitr}
%\VignetteIndexEntry{3. Common Workflows: RNA-seq}

# Common Workflows, with an RNA-seq example

useR! 2014<br>
Author: Martin Morgan (<a
    href="mailto:mtmorgan@fhcrc.org">mtmorgan@fhcrc.org</a>), Sonali
    Arora<br>
Date: 30 June, 2014

```{r setup, echo=FALSE}
knitr::opts_chunk$set(cache=TRUE)
```

## Overview

RNASeq

- Counting -- overlapping alignments
- Known genes
- Known transcripts
- (Novel transcripts)

ChIPSeq -- Chromatin immunopreciptation

- (Peak calling)
- Peak identification
- Differential binding
- Motif and other down-stream analysis

Variants

- (Calling)
- Annotating
- Effect prediction
- GWAS
- Integration

Structural

- Copy number
- Rearrangements
- ...

## In depth: RNASeq

Work flow

- Experimental design
  - Replication
  - Batch effects
- Sequencing & alignment
- Differential expression
  - Counting
  - Statistical evaluation
- Annotation & down-stream analysis

Statistical challenges

- Designed experiments
- Library size normalization
- Counts (vs. RPKM)
  - Negative binomial distribution 
  - Dispersion
- Borrowing information -- shrinkage
- Independent filtering

### RNASeq in parathyroid adenomas

This work flow derives from a more comprehensive example developed in
the [parathyroidSE][] package and the
[CSAMA 2014](http://bioconductor.org/help/course-materials/2014/CSAMA2014/)
workshop. It uses [DESeq2][] to estimate gene-level differential
expression.  Analysis picks up after counting reads aligning to
Ensembl gene regions; counting can be accomplished using
`summarizeOverlaps()`, as described in the [parathyroidSE][] vignette.

The data is from
[Haglund et al](http://www.ncbi.nlm.nih.gov/pubmed/23024189).,
Evidence of a Functional Estrogen Receptor in Parathyroid Adenomas, J
Clin Endocrin Metab, Sep 2012.  The experiment investigated the role
of the estrogen receptor in parathyroid tumors. The investigators
derived primary cultures of parathyroid adenoma cells from 4
patients. These primary cultures were treated with diarylpropionitrile
(DPN), an estrogen receptor &beta; agonist, or with 4-hydroxytamoxifen
(OHT).  RNA was extracted at 24 hours and 48 hours from cultures under
treatment and control.

Start by loading the count data, which has been carefully curatd as a
`SummarizedExperiment` object. Explore the object with accessors such
as `rowData(se)`, `colData(se)`, `exptData(se)$MIAME`, and
`head(assay(se))`

```{r deseq}
require("DESeq2")
require("parathyroidSE")
data("parathyroidGenesSE")
se <- parathyroidGenesSE
colnames(se) <- se$run
```

The first step is to add an experimental design and perform additional
preparatory steps. The experiment contains several technical
replicates, and while these could be incorporated in the model, here
we simply pool them (pooling technical replicates is often appropriate
for RNASeq data).

```{r deseq-prep}
ddsFull <- DESeqDataSet(se, design = ~ patient + treatment)
ddsCollapsed <-
    collapseReplicates(ddsFull, groupby = ddsFull$sample, 
                       run = ddsFull$run)
dds <- ddsCollapsed[, ddsCollapsed$time == "48h"]
dds$time <- droplevels(dds$time)
dds$treatment <- relevel(dds$treatment, "Control")
```

The entire work flow is executed with a call to the `DESeq()`
function. This includes library size factor estimation, dispersion
estimates, model fitting, independent filtering, and formulating test
statistics based on the specified design. Here we generate the results
for a comparison between DPN and control.

```{r deseq-analysis}
dds <- DESeq(dds)
res <- results(dds, contrast = c("treatment", "DPN", "Control"))
```

The remaining steps provide some basic checks and visualizations of
the data. We identify genes for which the adjusted (for multiple
comparison) _p_ value is less than 0.1, and order these by their log2
fold change.

```{r deseq-interpret}
resSig <- res[which(res$padj < 0.1 ),]
head(resSig[order(resSig$log2FoldChange),])
```

An 'MA' plot represents each gene as a point, with average expression
over all samples on the x-axis, and log2 fold change between treatment
groups on the y-axis; highlighted values are genes with adjusted _p_
values less than 0.1.

```{r deseq-MA}
plotMA(res, ylim = c(-1, 1))
```

The strategy taken by [DESeq2][] for dispersion estimation is
summarized by the `plotDispEsts()` function. It shows (a) black
per-gene dispersion estimates, (b) a red trend line representing the
global relationship between dispersion and normalized count, (c) blue
'shrunken' values moderating individual dispersion estimates by the
global relationship, and (d) blue-circled dispersion outliers with
high gene-wise dispersion that were not adjusted.

```{r deseq-dispest}
plotDispEsts(dds, ylim = c(1e-6, 1e1))
```

A final diagnostic is a plot of the histogram of _p_ values, which
should be uniform under the null hypothesis; skew to the right may
indicate batch or other effects not accommodated in the model

```{r deseq-hist}
hist(res$pvalue, breaks=40, col="grey")
```

[DESeq2]: http://bioconductor.org/packages/release/bioc/html/DESeq2.html
[parathyroidSE]: http://bioconductor.org/packages/release/data/experiment/html/parathyroidSE.html