---
title: "2. poly(A) tail length example"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{2. poly(A) tail length example}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

poly(A) tail length of transcripts can be measured using the [PAT-Seq protocol](https://rnajournal.cshlp.org/content/21/8/1502.long). This protocol produces 3'-end focussed reads that include the poly(A) tail. We examine [GSE83162](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE83162). This is a time-series experiment in which two strains of yeast are released into synchronized cell cycling and observed through two cycles. Yeast are treated with $\alpha$-factor, which causes them to stop dividing in antici... pation of a chance to mate. When the $\alpha$-factor is washed away, they resume cycling.

```{r echo=F}
knitr::opts_chunk$set(fig.width=6, fig.height=4, cache=TRUE, autodep=TRUE)

# To examine objects:
# devtools::load_all(".", export_all=F) ; qwraps2::lazyload_cache_dir("vignettes/tail_length_cache/html")
```

# Read files, extract experimental design from sample names

```{r load, message=F}
library(tidyverse)
library(reshape2)
library(SummarizedExperiment)
library(limma)
library(topconfects)
library(org.Sc.sgd.db)
library(weitrix)

# Produce consistent results
set.seed(12345)

# BiocParallel supports multiple backends. 
# If the default hangs or errors, try others.
BiocParallel::register( BiocParallel::SnowParam() )

# The most reliable backed is to use serial processing
#BiocParallel::register( BiocParallel::SerialParam() )

tail <- system.file("GSE83162", "tail.csv.gz", package="weitrix") %>%
    read_csv() %>%
    column_to_rownames("Feature") %>%
    as.matrix()

tail_count <- system.file("GSE83162", "tail_count.csv.gz", package="weitrix") %>%
    read_csv() %>%
    column_to_rownames("Feature") %>%
    as.matrix()
    
samples <- data.frame(sample=I(colnames(tail))) %>%
    extract(sample, c("strain","time"), c("(.+)-(.+)"), remove=FALSE) %>%
    mutate(
        strain=factor(strain,unique(strain)), 
        time=factor(time,unique(time)))
rownames(samples) <- colnames(tail)

samples
```

"tpre" is the cells in an unsynchronized state, other times are minutes after release into cycling.

The two strains are a wildtype and a strain with a mutated set1 gene.


# Create weitrix object

From experience, noise scales with the length of the tail. Therefore to stabilize the variance we will be examing log2 of the tail length.

These tail lengths are each the average over many reads. We therefore weight each tail length by the number of reads. This is somewhat overoptimistic as there is biological noise that doesn't go away with more reads, which we will correct for in the next step.


```{r weitrix, message=FALSE}
log2_tail <- log2(tail)

good <- rowMeans(tail_count) >= 10
table(good)

wei <- as_weitrix(
    log2_tail[good,,drop=FALSE], 
    weights=tail_count[good,,drop=FALSE])

rowData(wei)$gene <- AnnotationDbi::select(
    org.Sc.sgd.db, keys=rownames(wei), columns=c("GENENAME"))$GENENAME
rowData(wei)$total_reads <- rowSums(weitrix_weights(wei))
colData(wei) <- cbind(colData(wei), samples)
```


# Calibration

Our first step is to calibrate our weights. Our weights are overoptimistic for large numbers of reads, as there is a biological components of noise that does not go away with more reads.

Calibration requires a model explaining non-random effects. We provide a design matrix and a weighted linear model fit is found for each row. The lack of replicates makes life difficult, for simplicity here we will assume time and strain are independent.

```{r cal}
design <- model.matrix(~ strain + time, data=colData(wei))
```

The dispersion is then calculated for each row. A spline curve is fitted to the log dispersion. Weights in each row are then scaled so as to flatten this curve.

```{r}
cal_design <- weitrix_calibrate_trend(wei, design, ~splines::ns(log2(total_reads),3))
```

## Examining the effect of calibration

Let's unpack what happens in that `weitrix_calibrate_trend` step.

```{r disp-calc}
rowData(cal_design)$dispersion_unweighted <- weitrix_dispersions(weitrix_x(wei), design)
```

Consider first the dispersion if weights were uniform. Missing data is weighted 0, all other weights are 1.

```{r disp-plot-unif}
rowData(cal_design) %>% as.data.frame() %>%
    ggplot(aes(x=total_reads, y=dispersion_unweighted)) + geom_point(size=0.1) +
    scale_x_log10() + scale_y_log10() +
    labs(title="Dispersion if uniform weights used (log scales)")
```

There is information from the number of reads that we can remove.

Here are the dispersions using read-counts as weights.

```{r disp-plot-weighted}
rowData(cal_design) %>% as.data.frame() %>%
    ggplot(aes(x=total_reads, y=dispersion_before)) + geom_point(size=0.1) +
    geom_line(aes(y=dispersion_trend), color="red") +
    scale_x_log10() + scale_y_log10() +
    labs(title="Dispersion with read-counts as weights (log scales)")
```

In general it is improved, but now we have the opposite problem. There is a component of noise that does not go away with more and more reads. The red line is a trend fitted to this, which `weitrix_calibrate_trend` divides out of the weights.

Finally, here are the dispersion from the calibrated weitrix.

```{r disp-plot-cal}
rowData(cal_design) %>% as.data.frame() %>%
    ggplot(aes(x=total_reads, y=dispersion_after)) + geom_point(size=0.1) +
    scale_x_log10() + scale_y_log10() +
    labs(title="Dispersion with read-count weights / trend-based calibration\n(log scales)")
```

This is reasonably close to uniform, with no trend from the number of reads. `limma` can now estimate the variability of dispersions between genes, and apply its Emprical-Bayes squeezed dispersion based testing.


# Testing

We are now ready to test things. We feed our calibrated weitrix to [limma](https://bioconductor.org/packages/release/bioc/html/limma.html).


```{r limmadesign}
fit_cal_design <- cal_design %>%
    weitrix_elist() %>%
    lmFit(design)

ebayes_fit <- eBayes(fit_cal_design)
result_signif <- topTable(ebayes_fit, "strainDeltaSet1", n=Inf)

result_signif %>%
    dplyr::select(gene,diff_log2_tail=logFC,ave_log2_tail=AveExpr,
        adj.P.Val,total_reads) %>%
    head(20)
```

My package [topconfects](https://bioconductor.org/packages/release/bioc/html/topconfects.html) can be used to find top confident differential tail length. Rather than picking "most significant" genes, it will highlight genes with a large effect size.


```{r confectsdesign}
result_confects <- limma_confects(
    fit_cal_design, "strainDeltaSet1", full=TRUE, fdr=0.05)

result_confects$table %>% 
    dplyr::select(gene,diff_log2_tail=effect,confect,total_reads) %>% 
    head(20)

cat(sum(!is.na(result_confects$table$confect)), 
    "genes significantly non-zero at FDR 0.05\n")
```

This lists the largest confident log2 fold changes in poly(A) tail length.
The `confect` column is an inner confidence bound on 
the difference in log2 tail length,
adjusted for multiple testing.

We discover some genes with less total reads, but large change in tail length.

Note that due to PCR amplification slippage and limited read length, the observed log2 poly(A) tail lengths may be an underestimate. However as all samples have been prepared in the same way, observed differences should indicate the existence of true differences.


## Examine individual genes

Having discovered genes with differential tail length, let's look at some genes in detail.

```{r examine, fig.show="hold", fig.width=4, fig.height=3}
view_gene <- function(id, title="") {
    ggplot(samples, aes(x=time,color=strain,group=strain, y=tail[id,])) +
       geom_hline(yintercept=0) + 
       geom_line() + 
       geom_point(aes(size=tail_count[id,])) +
       labs(x="Time", y="Tail length", size="Read count", title=paste(id,title))
}

# Top "significant" genes
view_gene("YDR170W-A")
view_gene("YJR027W/YJR026W")
view_gene("YAR009C")
view_gene("YIL015W","BAR1")

# topconfects has highlighted some genes with lower total reads
view_gene("YER133W","GLC7")
view_gene("YCR014C","POL4")
```


# Exploratory analysis

The test we've performed was somewhat unsatisfactory. Due to the design of the experiment it's difficul to specify differential tests that fully interrogate this dataset: the lack of replicates, and the difficult specifying apriori how tail length will change over time.

Perhaps we should let the data speak for itself.

Perhaps this is what we should have done first!

The weitrix package allows us to look for components of variation. We'll try to explain the data with different numbers of components (from 1 to 10 components).

```{r comp, message=F}
comp_seq <- weitrix_components_seq(wei, p=10)
```

`weitrix_seq_screeplot` shows how much additional variation in the data is explained as each further component is allowed. However the ultimate decision of how many components to examine is a matter of judgement.

```{r scree}
components_seq_screeplot(comp_seq)
```

Looking at three components shows some of the major trends in this data-set.

```{r exam}
comp <- comp_seq[[3]]

matrix_long(comp$col[,-1], row_info=samples, varnames=c("sample","component")) %>%
    ggplot(aes(x=time, y=value, color=strain, group=strain)) + 
    geom_hline(yintercept=0) + 
    geom_line() + 
    geom_point(alpha=0.75, size=3) + 
    facet_grid(component ~ .) +
    labs(title="Sample scores for each component", y="Sample score", x="Time", color="Strain")
```

We observe:

* C1 - A gradual lengthening of tails after release into cell cycling. (The reason for the divergence between strains at the end is unclear.)
* C2 - A lengthening of poly(A) tails in the set1 mutant.
* C3 - Variation in poly(A) tail length with the cell cycle.

The log2 tail lengths are approximated by `comp$row %*% t(comp$col)` where `comp$col` is an $n_\text{sample} \times (p+1)$ matrix of scores (shown above), and `comp$row` is an $n_\text{gene} \times (p+1)$ matrix of gene loadings, which we will now examine. (The $+1$ is the intercept "component", allowing each gene to have a different baseline tail length.)

```{r limmacomp}
cal_comp <- weitrix_calibrate_trend(wei, comp, ~splines::ns(log2(total_reads),3))
fit_comp <- cal_comp %>%
    weitrix_elist() %>%
    lmFit(comp$col)
```

**Treat these results with caution.** Confindence bounds take into account uncertainty in the loadings but not in the scores! What follows is best regarded as exploratory rather than a final result.

## Gene loadings for C1: gradual lengthing over time

```{r C1}
result_C1 <- limma_confects(fit_comp, "C1")
```

```{r examine_C1, echo=FALSE, fig.show="hold", fig.width=4, fig.height=3}
result_C1$table %>% 
    dplyr::select(gene,loading=effect,confect,total_reads) %>% 
    head(10)

cat(sum(!is.na(result_C1$table$confect)), 
    "genes significantly non-zero at FDR 0.05\n")

view_gene("YBL016W", "FUS3")
view_gene("YOR096W", "RPS7A")
view_gene("YDR092W", "UBC1")
view_gene("YLR118C", "")
```

FUS3 is involved in yeast mating. We see here a poly(A) tail signature of yeast realizing there are not actually any $\alpha$ cells around to mate with.


## Gene loadings for C2: longer tails in set1 mutant

```{r C2}
result_C2 <- limma_confects(fit_comp, "C2")
```

```{r examine_C2, echo=FALSE, fig.show="hold", fig.width=4, fig.height=3}
result_C2$table %>% 
    dplyr::select(gene,loading=effect,confect,total_reads) %>% 
    head(10)

cat(sum(!is.na(result_C2$table$confect)), 
    "genes significantly non-zero at FDR 0.05\n")

view_gene("YDR476C","")
view_gene("YPL131W","RPL5")
view_gene("YCR014C","POL4")
view_gene("YGR251W","NOP19")
```

## Gene loadings for C3: cell-cycle associated changes

```{r C3}
result_C3 <- limma_confects(fit_comp, "C3")
```

Given the mixture of signs for effects in C3, different genes are longer in different stages of the cell cycle. We see many genes to do with replication, and also Mating Factor A.

```{r examine_C3, echo=FALSE, fig.show="hold", fig.width=4, fig.height=3}
result_C3$table %>% 
    dplyr::select(gene,loading=effect,confect,total_reads) %>% 
    head(10)

cat(sum(!is.na(result_C3$table$confect)), 
    "genes significantly non-zero at FDR 0.05\n")

view_gene("YFL014W", "HSP12")
view_gene("YNL036W",  "NCE103")
view_gene("YJL173C", "RFA3")
view_gene("YBL003C","HTA2")
view_gene("YDR461W", "MFA1")
```

# Discussion

Looking back to our initial differential testing in light of these results, a reasonable refinement would be to omit "tpre" and "t0m", considering only the samples that have settled into cell cycling.