---
title: "3. Alternative polyadenylation"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{3. Alternative polyadenylation}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

APA sites can be detected using the [PAT-Seq protocol](https://rnajournal.cshlp.org/content/21/8/1502.long). This protocol produces 3'-end focussed reads. We examine [GSE83162](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE83162). This is a time-series experiment in which two strains of yeast are released into synchronized cell cycling and observed through two cycles. Yeast are treated with $\alpha$-factor, which causes them to stop dividing in antici... pation of a chance to mate. When the $\alpha$-factor is washed away, they resume cycling.


# Shift score definition

Each gene has several APA sites, ordered from furthest upstrand to furthest downstrand. For each sample, we have a read count at each site.

For each gene:

We define the "shift" of a particular sample relative to all reads from all samples. 

A "shift" score is first assigned to each site, being the proportion of all reads upstrand of the site minus the proportion of all reads downstrand of it (i.e. an average over all reads where upstrand reads are 1, downstrand reads are -1 and reads from the site itself are 0). The measurement for each sample is then an average over the site score for each read. The weight is the number of reads.

Shifts scores range from -1 to 1, and summarize whether upstrand (more negative) or downstrand (more positive) sites are being favoured. The weighted average score is zero.

The weights are the number of reads, however for a randomly chosen read we can estimate its variance based on the number of reads at each site and the site scores. (This estimate of variance includes any biological signal, so it's not exactly a residual variance.) This is stored in the `rowData()` of the weitrix, and can be used to further calibrate weights. We prefer to defer this sort of calibration until after we've discoverd components of variation, as it tends to give high weight to genes with little APA. There are clearly some alternative choices to how weighting could be performed, and we hope the weitrix package gives you basic building blocks with which you can experiment!


# Load files

```{r setup, echo=F}
knitr::opts_chunk$set(fig.width=6, fig.height=4, cache=TRUE, autodep=TRUE)

# To examine objects:
# devtools::load_all(".", export_all=F) ; qwraps2::lazyload_cache_dir("vignettes/3_shift_cache/html")

options(width=1000)
```

```{r load, message=F, warning=F}
library(tidyverse)
library(reshape2)
library(limma)
library(topconfects)
library(weitrix)

# Produce consistent results
set.seed(12345)

# BiocParallel supports multiple backends. 
# If the default hangs or errors, try others.
BiocParallel::register( BiocParallel::SnowParam() )

# The most reliable backed is to use serial processing
#BiocParallel::register( BiocParallel::SerialParam() )

peaks <- system.file("GSE83162", "peaks.csv.gz", package="weitrix") %>%
    read_csv()

counts <- system.file("GSE83162", "peak_count.csv.gz", package="weitrix") %>%
    read_csv() %>%
    column_to_rownames("name") %>%
    as.matrix()

genes <- system.file("GSE83162", "genes.csv.gz", package="weitrix") %>%
    read_csv() %>%
    column_to_rownames("name")
    
samples <- data.frame(sample=I(colnames(counts))) %>%
    extract(sample, c("strain","time"), c("(.+)-(.+)"), remove=FALSE) %>%
    mutate(
        strain=factor(strain,unique(strain)), 
        time=factor(time,unique(time)))
rownames(samples) <- samples$sample

groups <- dplyr::select(peaks, group=gene_name, name=name)
# Note the order of this data frame is important
```

```{r examine_raw}
samples

head(groups, 10)

counts[1:10,1:5]
```

A "shift" weitrix is constructed based on a matrix of site-level counts, plus a data frame grouping sites into genes. The order of this data frame is important, earlier sites are considered upstrand of later sites.

```{r shift}
wei <- counts_shift(counts, groups)

colData(wei) <- cbind(colData(wei), samples)
rowData(wei) <- cbind(rowData(wei), genes[match(rownames(wei), rownames(genes)),])
```

Having obtained a weitrix, everthing discussed for the [poly(A) tail length example](2_tail_length.html) is applicable here as well. We will only perform a brief exploratory analysis here.


# Exploratory analysis

We can look for components of variation.

```{r comp, message=F}
comp_seq <- weitrix_components_seq(wei, p=10, design=~0)
```

```{r scree}
components_seq_screeplot(comp_seq)
```

Pushing a point somewhat, we examine four components.

```{r exam, fig.height=6}
comp <- comp_seq[[4]]

matrix_long(comp$col, row_info=samples, varnames=c("sample","component")) %>%
    ggplot(aes(x=time, y=value, color=strain, group=strain)) + 
    geom_hline(yintercept=0) + 
    geom_line() + 
    geom_point(alpha=0.75, size=3) + 
    facet_grid(component ~ .) +
    labs(title="Sample scores for each component", y="Sample score", x="Time", color="Strain")
```

A weitrix created with `counts_shift` has a built-in default trend formula, so we don't need to give a formula explicitly to `weitrix_calibrate_trend`.

```{r calibrate}
cal_comp <- weitrix_calibrate_trend(wei, comp)
metadata(cal_comp)$weitrix$trend_fit
```

The calibration is based on two predictors of dispersion, the total number of reads and the estimated per read variance. To illustrate the calibration, we facet by bins of the per-read variance and show total reads on the x axis.

```{r calibrate_show}
rowData(cal_comp) %>% as.data.frame() %>%
    ggplot(aes(x=total_reads, y=dispersion_before)) + 
    facet_wrap(~ cut(per_read_var, 9)) +
    geom_point(size=0.1) +
    geom_point(aes(y=dispersion_trend), color="red", size=0.1) +
    scale_x_log10() + scale_y_log10() +
    labs(y="Dispersion (log scale)", x="Total reads (log scale)")
```

```{r limma}
fit_comp <- cal_comp %>%
    weitrix_elist() %>%
    lmFit(comp$col)
```

**Treat these results with caution.** Confindence bounds take into account uncertainty in the loadings but not in the scores! What follows is best regarded as exploratory rather than a final result.

## Gene loadings for C1

```{r C1}
limma_confects(fit_comp, "C1", full=TRUE, fdr=0.05)
```

## Gene loadings for C2

```{r C2}
limma_confects(fit_comp, "C2", full=TRUE, fdr=0.05)
```

## Gene loadings for C3

```{r C3}
limma_confects(fit_comp, "C3", full=TRUE, fdr=0.05)
```

## Gene loadings for C4

```{r C4}
limma_confects(fit_comp, "C4", full=TRUE, fdr=0.05)
```

## Examine individual genes

Let's examine peak-level read counts for some genes we've identified.

```{r examiner, message=F, warning=F, fig.height=3}
examine <- function(gene_wanted, title) {
    peak_names <- filter(peaks, gene_name==gene_wanted)$name

    counts[peak_names,] %>% melt(value.name="reads", varnames=c("peak","sample")) %>%
        left_join(samples, by="sample") %>%
        ggplot(aes(x=factor(as.integer(peak)), y=reads)) + 
        facet_grid(strain ~ time) + geom_col() +
        labs(x="Peak",y="Reads",title=title)
}

examine("YLR058C", "SHM2, from C1")
examine("YLR333C", "RPS25B, from C2")
examine("YDR077W", "SED1, from C3")
examine("YIL015W", "BAR1, from C4")
examine("tK(CUU)M", "tK(CUU)M, from C4")
```