---
title: "InPAS Vignette"
author: "Jianhong Ou, Sungmi Park, Michael Green, Lihua Julie Zhu"
date: "`r doc_date()`"
package: "`r pkg_ver('InPAS')`"
bibliography: ref.bib
csl: nature.csl
abstract: >
   Alternative polyadenylation (APA) is one of the important post-transcriptional
   regulation mechanisms which occurs in most human genes. InPAS facilitates the 
   discovery of novel APA sites and the differential usage of APA sites from 
   RNA-Seq data. It leverages cleanUpdTSeq[@sheppard2013accurate] to fine tune
   identified APA sites by removing false sites due to internal-priming.
vignette: >
  %\VignetteIndexEntry{InPAS Vignette}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
output:
  BiocStyle::html_document
---
```{r preloadLibrary, echo=FALSE, results="hide", warning=FALSE}
suppressPackageStartupMessages({
    library(InPAS)
    library(BSgenome.Mmusculus.UCSC.mm10)
    library(TxDb.Mmusculus.UCSC.mm10.knownGene)
    library(org.Hs.eg.db)
    library(cleanUpdTSeq)
})
```
# Introduction

Alternative polyadenylation (APA) is one of the most important 
post-transcriptional regulation mechanisms which is prevalent in human 
genes. Like alternative splicing, APA can create proteome diversity. 
In addition, it defines 3' UTR and results in altered expression of the gene.
It is a tightly controlled process and mis-regulation of APA can lead to 
pathological effects to the cells, such as uncontrolled cell cycle and growth. 
Although several high throughput sequencing methods have been developed, 
there are still limited data available.

RNA-seq has become one of the most commonly used methods for quantifying
genome-wide gene expression. There are massive RNA-seq datasets available
in public databases such as GEO and TCGA. For this reason, we develop the 
InPAS algorithm for identifying APA from conventional RNA-seq data.

The workflow for InPAS is:

- Calculate coverage from BEDGraph files
- Identify putative pA sites based on the coverage changes 
- Filter putative pA sites using cleanUpdTseq to remove false sites due to internal-priming.
- Estimate alternative pA usage using the coverage for the predicted 3' UTR
- Identify differential usage of APA in different samples leveraging Limma

# How to

First, load the required libraries InPAS, species specific BSgenome, and TxDb as follows.

```{r loadLibrary}
library(InPAS)
library(BSgenome.Mmusculus.UCSC.mm10)
library(TxDb.Mmusculus.UCSC.mm10.knownGene)
path <- file.path(find.package("InPAS"), "extdata")
```

Next, prepare annotation using the function `utr3Annotation` with a TxDb and org annotation.
Please note that the 3' UTR annotation prepared by `utr3Annotation` 
includes the gaps to the next annotated region.

```{r prepareAnno}
library(org.Hs.eg.db)
samplefile <- system.file("extdata", "hg19_knownGene_sample.sqlite",
            package="GenomicFeatures")
txdb <- loadDb(samplefile)
utr3.sample.anno <- utr3Annotation(txdb=txdb, 
                   orgDbSYMBOL="org.Hs.egSYMBOL")
utr3.sample.anno
```

Users can also directly load the 3' UTR annotation included in this package 
for mm10 and hg19. Here we show how to load the pre-built mm10 3' UTR annotation file.

```{r loadAnno}
##step1 annotation
# load from dataset
data(utr3.mm10)
```

For coverage calculation, alignment files in BEDGraph format are required, 
which can be generated from BAM files using the genomecov tool in bedtools
with parameter: -bg -split. Potential pA sites identified from the coverage
data can be filtered/adjusted using the classifier provided by _cleanUpdTseq_. 
The following scripts illustrate the function calls needed to perform the 
complete analysis using _InPAS_.

```{r step2HugeDataF}
load(file.path(path, "polyA.rds"))
library(cleanUpdTSeq)
data(classifier)
bedgraphs <- c(file.path(path, "Baf3.extract.bedgraph"), 
               file.path(path, "UM15.extract.bedgraph"))
hugeData <- FALSE
##step1 Calculate coverage
coverage <- coverageFromBedGraph(bedgraphs, 
                                 tags=c("Baf3", "UM15"),
                                 genome=BSgenome.Mmusculus.UCSC.mm10,
                                 hugeData=hugeData)

## we hope the coverage rate of should be greater than 0.75 in the expressed gene.
## which is used because the demo data is a subset of genome.
coverageRate(coverage=coverage,
             txdb=TxDb.Mmusculus.UCSC.mm10.knownGene,
             genome=BSgenome.Mmusculus.UCSC.mm10,
             which = GRanges("chr6", ranges=IRanges(98013000, 140678000)))

##step2 Predict cleavage sites
CPs <- CPsites(coverage=coverage, 
               genome=BSgenome.Mmusculus.UCSC.mm10,
               utr3=utr3.mm10, 
               search_point_START=200, 
               cutEnd=.2, 
               long_coverage_threshold=3,
               background="10K",
               txdb=TxDb.Mmusculus.UCSC.mm10.knownGene,
               PolyA_PWM=pwm, 
               classifier=classifier,
               shift_range=50,
               step=10)
head(CPs)
##step3 Estimate 3UTR usage
res <- testUsage(CPsites=CPs, 
                 coverage=coverage, 
                 genome=BSgenome.Mmusculus.UCSC.mm10,
                 utr3=utr3.mm10, 
                 method="fisher.exact",
                 gp1="Baf3", gp2="UM15")

##step4 view the results
as(res, "GRanges") 

filterRes(res, gp1="Baf3", gp2="UM15",
          background_coverage_threshold=3,
          adj.P.Val_cutoff=0.05, 
          dPDUI_cutoff=0.3, 
          PDUI_logFC_cutoff=0.59)
```

The steps described above can be done in one function call.

```{r OneStep}
if(interactive()){
    res <- inPAS(bedgraphs=bedgraphs, tags=c("Baf3", "UM15"), 
              genome=BSgenome.Mmusculus.UCSC.mm10, 
              utr3=utr3.mm10, gp1="Baf3", gp2="UM15",
              txdb=TxDb.Mmusculus.UCSC.mm10.knownGene,
              search_point_START=200,
              short_coverage_threshold=15,
              long_coverage_threshold=3, 
              cutStart=0, cutEnd=.2, 
              hugeData=FALSE,
              shift_range=50,
              PolyA_PWM=pwm, classifier=classifier,
              method="fisher.exact",
              adj.P.Val_cutoff=0.05,
              dPDUI_cutoff=0.3, 
              PDUI_logFC_cutoff=0.59)
}
```

InPAS can also handle single group data.

```{r SingleGroupData}
inPAS(bedgraphs=bedgraphs[1], tags=c("Baf3"), 
      genome=BSgenome.Mmusculus.UCSC.mm10, 
      utr3=utr3.mm10, gp1="Baf3", gp2=NULL,
      txdb=TxDb.Mmusculus.UCSC.mm10.knownGene,
      search_point_START=200,
      short_coverage_threshold=15,
      long_coverage_threshold=3, 
      cutStart=0, cutEnd=.2, 
      hugeData=FALSE, 
      PolyA_PWM=pwm, classifier=classifier,
      method="singleSample",
      adj.P.Val_cutoff=0.05,
      step=10)
```

# Session Info
```{r sessionInfo, results='asis'}
sessionInfo()
```