---
title: "4. RNA-Seq expression example, components of the airway dataset"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{4. RNA-Seq expression example}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r echo=F}
# To test
# devtools::load_all(".",export_all=F) ; rmarkdown::render("vignettes/airway.Rmd")

knitr::opts_chunk$set(fig.width=6, fig.height=4)
```

Let's look at the [airway](http://bioconductor.org/packages/release/data/experiment/html/airway.html) dataset as an example of a typical small-scale RNA-Seq experiment. In this experiment, four Airway Smooth Muscle (ASM) cell lines are treated with the asthma medication dexamethasone.

The limma `voom` function will be used to assign precision weights, then the result converted to a weitrix.

```{r setup, warning=F, message=F}
library(weitrix)
library(EnsDb.Hsapiens.v86)
library(edgeR)
library(limma)
library(reshape2)
library(tidyverse)
library(airway)

set.seed(1234)

# BiocParallel supports multiple backends. 
# If the default hangs or errors, try others.
BiocParallel::register( BiocParallel::SnowParam() )

# The most reliable backed is to use serial processing
#BiocParallel::register( BiocParallel::SerialParam() )
```

```{r}
data("airway")
airway
```


# Initial processing

Initial steps are the same as for a differential expression analysis.

```{r}
counts <- assay(airway,"counts")

design <- model.matrix(~ dex + cell, data=colData(airway))

good <- filterByExpr(counts, design=design) 
table(good)

airway_elist <- 
    DGEList(counts[good,]) %>%
    calcNormFactors() %>%
    voom(design, plot=TRUE)
```

There are many possible variations on this:

* `voom` provides precision weights. One could instead choose a `prior.count` for `cpm` to produce a flat (or at least non-decreasing) mean-variance trend-line.

* Use `cpm` to produce log transformed counts with a small prior count, then use `weitrix_calibrate_trend` to account for any mean-variance relationship.


## Conversion to weitrix

```{r}
airway_weitrix <- as_weitrix(airway_elist)

# Include row and column information
colData(airway_weitrix) <- colData(airway)
rowData(airway_weitrix) <- 
    mcols(genes(EnsDb.Hsapiens.v86))[rownames(airway_weitrix),c("gene_name","gene_biotype")]

airway_weitrix
```


# Exploration

RNA-Seq expression is well trodden ground. The main contribution of the weitrix package here is to aid exploration by discovering components of variation, providing not just column scores but row loadings and respecting precision weights.

## Find components of variation

This will find various numbers of components, from 1 to 6. In each case, the components discovered have varimax rotation applied to their gene loadings to aid interpretability. The result is a list of Components objects.

```{r message=F}
comp_seq <- weitrix_components_seq(airway_weitrix, p=6)
comp_seq
```

We can compare the proportion of variation explained to what would be explained in a completely random weitrix. Random normally distributed values are generated with variances equal to one over the weights.

```{r message=F}
rand_weitrix <- weitrix_randomize(airway_weitrix)
rand_comp <- weitrix_components(rand_weitrix, p=1)

components_seq_screeplot(comp_seq, rand_comp)
```

## Examining components

Up to 4 components may be justified.

```{r}
comp <- comp_seq[[4]]

comp$col[,-1] %>% melt(varnames=c("Run","component")) %>%
    left_join(as.data.frame(colData(airway)), by="Run") %>%
    ggplot(aes(y=cell, x=value, color=dex)) + 
    geom_vline(xintercept=0) + 
    geom_point(alpha=0.5, size=3) + 
    facet_grid(~ component) +
    labs(title="Sample scores for each component", x="Sample score", y="Cell line", color="Treatment")

comp$row[,-1] %>% melt(varnames=c("name","component")) %>%
    ggplot(aes(x=comp$row[name,"(Intercept)"], y=value)) + 
    geom_point(cex=0.5, alpha=0.5) + 
    facet_wrap(~ component) +
    labs(title="Gene loadings for each component vs average log2 RPM", x="average log2 RPM", y="gene loading")
```

## Without varimax rotation, components may be harder to interpret

If varimax rotation isn't used, `weitrix_components` and `weitrix_components_seq` will produce a Principal Components Analysis, with components ordered from most to least variance explained.

Without varimax rotation the treatment effect is still mostly in the first component, but has also leaked a small amount into the other components.

```{r message=F}
comp_nonvarimax <- weitrix_components(airway_weitrix, p=4, use_varimax=FALSE)

comp_nonvarimax$col[,-1] %>% melt(varnames=c("Run","component")) %>%
    left_join(as.data.frame(colData(airway)), by="Run") %>%
    ggplot(aes(y=cell, x=value, color=dex)) + 
    geom_vline(xintercept=0) + 
    geom_point(alpha=0.5, size=3) + 
    facet_grid(~ component) +
    labs(title="Sample scores for each component, no varimax rotation", x="Sample score", y="Cell line", color="Treatment")
```


## `col` can potentially be used as a design matrix with limma

If you're not sure of the experimental design, for example the exact timing of a time series or how evenly a drug treatment was applied, the extracted component might actually be more accurate.

Note that this ignores uncertainty about the `col` matrix itself.

This may be useful for hypothesis generation -- finding some potentially interesting genes, while discounting noisy or lowly expressed genes -- but don't use it as proof of significance.

```{r}
airway_elist <- weitrix_elist(airway_weitrix)

fit <- 
    lmFit(airway_elist, comp$col) %>% 
    eBayes()

fit$df.prior
fit$s2.prior

topTable(fit, "C1")

all_top <- topTable(fit, "C1", n=Inf, sort.by="none")
plotMD(fit, "C1", status=all_top$adj.P.Val <= 0.01)
```

You might also consider using my `topconfects` package. This will find the largest confident effect sizes, while still correcting for multiple testing.

```{r}
library(topconfects)
limma_confects(fit, "C1")
```