--- title: "InPAS Vignette" author: "Jianhong Ou, Sungmi Park, Michael Green, Lihua Julie Zhu" date: "`r doc_date()`" package: "`r pkg_ver('InPAS')`" bibliography: ref.bib csl: nature.csl abstract: > Alternative polyadenylation (APA) is one of the important post-transcriptional regulation mechanisms which occurs in most human genes. InPAS facilitates the discovery of novel APA sites and the differential usage of APA sites from RNA-Seq data. It leverages cleanUpdTSeq[@sheppard2013accurate] to fine tune identified APA sites by removing false sites due to internal-priming. vignette: > %\VignetteIndexEntry{InPAS Vignette} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} output: BiocStyle::html_document --- ```{r preloadLibrary, echo=FALSE, results="hide", warning=FALSE} suppressPackageStartupMessages({ library(InPAS) library(BSgenome.Mmusculus.UCSC.mm10) library(TxDb.Mmusculus.UCSC.mm10.knownGene) library(org.Hs.eg.db) library(cleanUpdTSeq) }) ``` # Introduction Alternative polyadenylation (APA) is one of the most important post-transcriptional regulation mechanisms which is prevalent in human genes. Like alternative splicing, APA can create proteome diversity. In addition, it defines 3' UTR and results in altered expression of the gene. It is a tightly controlled process and mis-regulation of APA can lead to pathological effects to the cells, such as uncontrolled cell cycle and growth. Although several high throughput sequencing methods have been developed, there are still limited data available. RNA-seq has become one of the most commonly used methods for quantifying genome-wide gene expression. There are massive RNA-seq datasets available in public databases such as GEO and TCGA. For this reason, we develop the InPAS algorithm for identifying APA from conventional RNA-seq data. The workflow for InPAS is: - Calculate coverage from BEDGraph files - Identify putative pA sites based on the coverage changes - Filter putative pA sites using cleanUpdTseq to remove false sites due to internal-priming. - Estimate alternative pA usage using the coverage for the predicted 3' UTR - Identify differential usage of APA in different samples leveraging Limma # How to First, load the required libraries InPAS, species specific BSgenome, and TxDb as follows. ```{r loadLibrary} library(InPAS) library(BSgenome.Mmusculus.UCSC.mm10) library(TxDb.Mmusculus.UCSC.mm10.knownGene) path <- file.path(find.package("InPAS"), "extdata") ``` Next, prepare annotation using the function `utr3Annotation` with a TxDb and org annotation. Please note that the 3' UTR annotation prepared by `utr3Annotation` includes the gaps to the next annotated region. ```{r prepareAnno} library(org.Hs.eg.db) samplefile <- system.file("extdata", "hg19_knownGene_sample.sqlite", package="GenomicFeatures") txdb <- loadDb(samplefile) utr3.sample.anno <- utr3Annotation(txdb=txdb, orgDbSYMBOL="org.Hs.egSYMBOL") utr3.sample.anno ``` Users can also directly load the 3' UTR annotation included in this package for mm10 and hg19. Here we show how to load the pre-built mm10 3' UTR annotation file. ```{r loadAnno} ##step1 annotation # load from dataset data(utr3.mm10) ``` For coverage calculation, alignment files in BEDGraph format are required, which can be generated from BAM files using the genomecov tool in bedtools with parameter: -bg -split. Potential pA sites identified from the coverage data can be filtered/adjusted using the classifier provided by _cleanUpdTseq_. The following scripts illustrate the function calls needed to perform the complete analysis using _InPAS_. ```{r step2HugeDataF} load(file.path(path, "polyA.rds")) library(cleanUpdTSeq) data(classifier) bedgraphs <- c(file.path(path, "Baf3.extract.bedgraph"), file.path(path, "UM15.extract.bedgraph")) hugeData <- FALSE ##step1 Calculate coverage coverage <- coverageFromBedGraph(bedgraphs, tags=c("Baf3", "UM15"), genome=BSgenome.Mmusculus.UCSC.mm10, hugeData=hugeData) ## we hope the coverage rate of should be greater than 0.75 in the expressed gene. ## which is used because the demo data is a subset of genome. coverageRate(coverage=coverage, txdb=TxDb.Mmusculus.UCSC.mm10.knownGene, genome=BSgenome.Mmusculus.UCSC.mm10, which = GRanges("chr6", ranges=IRanges(98013000, 140678000))) ##step2 Predict cleavage sites CPs <- CPsites(coverage=coverage, genome=BSgenome.Mmusculus.UCSC.mm10, utr3=utr3.mm10, search_point_START=200, cutEnd=.2, long_coverage_threshold=3, background="10K", txdb=TxDb.Mmusculus.UCSC.mm10.knownGene, PolyA_PWM=pwm, classifier=classifier, shift_range=50, step=10) head(CPs) ##step3 Estimate 3UTR usage res <- testUsage(CPsites=CPs, coverage=coverage, genome=BSgenome.Mmusculus.UCSC.mm10, utr3=utr3.mm10, method="fisher.exact", gp1="Baf3", gp2="UM15") ##step4 view the results as(res, "GRanges") filterRes(res, gp1="Baf3", gp2="UM15", background_coverage_threshold=3, adj.P.Val_cutoff=0.05, dPDUI_cutoff=0.3, PDUI_logFC_cutoff=0.59) ``` The steps described above can be done in one function call. ```{r OneStep} if(interactive()){ res <- inPAS(bedgraphs=bedgraphs, tags=c("Baf3", "UM15"), genome=BSgenome.Mmusculus.UCSC.mm10, utr3=utr3.mm10, gp1="Baf3", gp2="UM15", txdb=TxDb.Mmusculus.UCSC.mm10.knownGene, search_point_START=200, short_coverage_threshold=15, long_coverage_threshold=3, cutStart=0, cutEnd=.2, hugeData=FALSE, shift_range=50, PolyA_PWM=pwm, classifier=classifier, method="fisher.exact", adj.P.Val_cutoff=0.05, dPDUI_cutoff=0.3, PDUI_logFC_cutoff=0.59) } ``` InPAS can also handle single group data. ```{r SingleGroupData} inPAS(bedgraphs=bedgraphs[1], tags=c("Baf3"), genome=BSgenome.Mmusculus.UCSC.mm10, utr3=utr3.mm10, gp1="Baf3", gp2=NULL, txdb=TxDb.Mmusculus.UCSC.mm10.knownGene, search_point_START=200, short_coverage_threshold=15, long_coverage_threshold=3, cutStart=0, cutEnd=.2, hugeData=FALSE, PolyA_PWM=pwm, classifier=classifier, method="singleSample", adj.P.Val_cutoff=0.05, step=10) ``` # Session Info ```{r sessionInfo, results='asis'} sessionInfo() ```