--- title: "Analysing transcript 5'-profiling data using icetea" author: - name: "Vivek Bhardwaj and Thomas Manke" affiliation: - "Max Planck Institute of Immunobiology and Epigenetics, Stübeweg 51, 79108, Freiburg, Germany" date: "`r format(Sys.Date(), '%d/%m/%Y')`" abstract: > Transcript 5'-profiling techniques detect 5' sites of the expressed transcripts in the genome, allowing us to detect transcription start sites in the genome at high resolution. *CAGE* (Cap Analysis of Gene Expression) is one such technique. Programs for analysis of CAGE data use the clustering of CAGE tags (CTSS) to annotate transcription start sites and detect promoter shifts. Recently, faster and more efficient 5'-profiling techniques have been developed which allow early sample multiplexing, removal of PCR duplicates using random barcodes, and use paired-end sequencing. **icetea** can be used to process such data. It also implements a new method for detection of transcription start sites based on local enrichment of genomic windows, that can take advantage of biological replicates. It further allows users to perform differential expression analysis of transcription start sites between groups of samples, therefore integrating transcript annotation and expression analysis. While icetea analysis is mostly useful for multiplexed, paired-end experiments with replicates, conventional CAGE data can also be analysed. *icetea* analysis compliments other programs that perform promoter profiling and expression analysis based on clusteing of CAGE tags. output: rmarkdown::html_document: highlight: pygments toc: true bibliography: references.bib vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{Analysing transcript 5'-profiling data using icetea} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Introduction Here we show the usability of *icetea* in the analysis of data from a recently developed paired-end, multiplexed 5'-profiling protocol called MAPCap. **MAPCap** (Multiplexed Affinity Purification of Capped RNA) allows fast and accutate detection of transcription start sites and expression analysis of multiplexed, low-input RNA samples. ## Quick Start Below is the set of minimal steps for the processing of MAPCap data for the detection of TSS. Starting from the raw fastq files, we can either perform quality trimming using a standard program (eg. [cutadapt](https://cutadapt.readthedocs.io/en/stable/)/[trimgalore](https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/)) or simply begin the analysis by creating a `CapSet` object. The minimal information required to create a `CapSet` object is the path to raw/trimmed FASTQ files and a vector of sample names. For multiplexed FASTQ files, additionally a vector of sample indexes should be provided to proceed with de-multiplexing. The following steps creates the object, demultiplexes the fastq, maps them, filters them and detects the transcription start sites. ```{r quickIntro, eval=FALSE} # load the package library(icetea) # provide demultiplexing barcodes and sample names idxlist <- c("CAAGTG", "TTAGCC", "GTGGAA", "TGTGAG") dir <- system.file("extdata", package="icetea") # corresponding sample names fnames <- c("embryo1", "embryo2", "embryo3", "embryo4") ## create CapSet object cs <- newCapSet(expMethod = 'MAPCap', fastq_R1 = file.path(dir, 'mapcap_test_R1.fastq.gz'), fastq_R2 = file.path(dir, 'mapcap_test_R2.fastq.gz'), idxList = idxlist, sampleNames = fnames) # demultiplex fastq and trim the barcodes dir.create("splitting") cs <- demultiplexFASTQ(cs, max_mismatch = 1, outdir = "splitting", ncores = 10) # map fastq dir.create("mapping") cs <- mapCaps(cs, subread_idx, outdir = "mapping", ncores = 20, logfile = "mapping/subread_mapping.log") # filter PCR duplicates dir.create("removedup") cs <- filterDuplicates(cs, outdir = "removedup") # detect TSS dir.create("tssCalling") cs <- detectTSS(cs, groups = c("wt", "wt", "mut", "mut"), outfile_prefix = "tssCalling/testTSS") ``` ## Help and citations ### How to get help ICETEA questions could be posted to the [Bioconductor support site](https://support.bioconductor.org), which serves as a searchable knowledge base of questions and answers. Posting a question and tagging with “icetea” will automatically send an alert to the package authors to respond on the support site. See the first question in the list of Frequently Asked Questions (FAQ) for information about how to construct an informative post. ### How to cite ICETEA To cite *icetea* package in your analysis type `citation("icetea")` from within your R session. ## The CapSet object The workflow begins by creating an object of class `CapSet` using the function `newCapSet`. This object contains information about the experiment method (CAGE [@kodzius2006cage], RAMPAGE [@batut2013] or MAPCap), along with the path of the multiplexed fastq files. To allow de-multiplexing as , the demultiplexing barcodes should be provided to `idxList` and the corresponding samplenames should be provided to `sampleNames`. With this information, create the CapSet object as follows. ```{r makeCS, eval=TRUE, results = "hold"} # provide demultiplexing barcodes as strings idxlist <- c("CAAGTG", "TTAGCC", "GTGGAA", "TGTGAG") # provide corresponding sample names fnames <- c("embryo1", "embryo2", "embryo3", "embryo4") # `dir` contains example data provided with this package dir <- system.file("extdata", package = "icetea") ## CapSet object from raw (multiplexed) fastq files library(icetea) cs <- newCapSet(expMethod = 'MAPCap', fastq_R1 = file.path(dir, 'mapcap_test_R1.fastq.gz'), fastq_R2 = file.path(dir, 'mapcap_test_R2.fastq.gz'), idxList = idxlist, sampleNames = fnames) ``` The `CapSet` object is designed to allow the steps in the analysis to proceed like a pipeline. It holds the information on the output files from each step, which are then passed on to the next step. This information is stored in a DataFrame object, which can be retreived and manipulated using the function `sampleInfo`. In case of underlying files being deleted or moved to another folder, we can reset the sample information this way. ```{r sampleinfo, eval=TRUE, results = "hold"} si <- sampleInfo(cs) dir <- system.file("extdata/bam", package = "icetea") si$mapped_file <- list.files(dir, pattern = ".bam$", full.names = TRUE) sampleInfo(cs) <- si ``` The `CapSet` object doesn't need to be created at the beginning of the workflow. Users can perform the initial, computationally expensive steps of FASTQ de-multiplexing, mapping and PCR de-duplication outside of the R environment. The `CapSet` object can be created at any point of the workflow, using the output files from current step. After mapping and de-duplication : ```{r cs2, eval = TRUE, results = "hold"} dir <- system.file("extdata/filtered_bam", package = "icetea") cs2 <- newCapSet(expMethod = 'MAPCap', filtered_file = list.files(dir, pattern = ".bam$", full.names = TRUE), sampleNames = fnames) ``` Providing output files from previous steps are not necessory, but providing them allows calculating additional metadata (like number of mapped reads) which could be useful for plotting (see section [QC](#QC) ) ## Analysis workflow ### De-multiplexing the fastq Experiments like MAPCap and RAMPAGE [@batut2013] produce multiplexed fastq files with sample indicies and PCR barcodes attached to the fastq sequence. The tool `demultiplexFASTQ` de-multiplexes the samples using the sample barcode, producing fastq files corresponding to each sample. It also trims off these barcodes and attaches them in the header of the fastq files for further processing. For RAMPAGE [@batut2013] protocol, the modified header looks like this : `#:` For MAPCap protocol, the modified header looks like this : `#::` For conventional CAGE [@kodzius2006cage] protocols, we assume that standard Illumina de-multiplexing has already been performed and this step is not required. Function returns a modified CapSet object that contains location of demultiplexed files along with processing statistics. ```{r demultiplex, eval=TRUE, results = "hold"} # demultiplex fastq and trim the barcodes dir.create("splitting") cs <- demultiplexFASTQ(cs, max_mismatch = 1, outdir = "splitting", ncores = 10) ``` It takes about 6 min (340 sec) to trim and de-multiplex 1M PE reads into 12 sampels (12 pair of fastqs), using 1 thread. This can be done under 45 seconds if 10 threads are used. Optionally, we can skip the above process and perform post-mapping de-multiplexing on the BAM files (see below). ### Mapping the fastqs The demultiplexed fastqs can now be mapped using the `mapCaps` function. This function is a wrapper over the `subjunc` function from `Rsubread` package [@liao2013subread]. It additionally performs sorting and collects mapping statistics of the mapped files, stored in the modified `CapSet` object. In order to run the function we first create a subread index of our genome. ```{r indexgenome, eval=FALSE} dir.create("genome_index") library(Rsubread) buildindex(basename = "genome_index/dm6", reference = "/path/to/dm6/genome.fa") ``` We can now perform the mapping. ```{r mapcaps, eval=FALSE} # provide location of a subread index file subread_idx <- "genome_index/dm6" # map fastq cs <- mapCaps(cs, subread_idx, outdir = "mapping", ncores = 20, logfile = "mapping/subread_mapping.log") # you can save the CapSet object for later use save(cs, file = "myCapSet.Rdata") ``` **Note:** The package `Rsubread` is not available for windows. Windows users would need to map their demultiplexed files using another tool. **Note:** Optionally, we can perform mapping directly on the multiplexed files and perform the de-multiplexing afterwards (see below). This option is only recommended where the FASTQ files are not very large (for example, where overall sequencing depth is low, below 10 Mil). ### (optional) Post mapping de-multiplexing Demultiplxeing can be performed on the BAM files after mapping, using the `splitBAM_byIndex` and (for MAPCap) `splitBAM_byRepindex` functions. These functions assume that the reads have been trimmed and the sample barcodes have been attached to the header in the way as [described above](#demult). ```{r splitbam, eval=FALSE} splitBAM_byIndex(bamFile, index_list, outfile_list, max_mismatch = 1, ncores = 10) ``` ### Filtering PCR duplicates Experiments like MAPCap and RAMPAGE provide us a way to remove sequencing reads which are PCR duplicates from the mapped data. Random barcodes present in the read sequence are used for this purpose. In MAPCap, pre-designed random barcodes present in the oligos serve as UMIs, while in RAMPAGE, the sequences used as RT-PCR primers are treated as pseudo-random barcodes. PCR duplicates are recognized as reads that map to the same 5' position and contain the same random barcode. The function `filterDuplicates` removes these PCR duplicate sequences (keeping only one copy in these cases), and creates de-duplicated BAM files. It returns the modified `CapSet` object with de-duplication statistics. `filterDuplicates` works in a strand-specific way. By default, it filters the R2 read from the paired-end BAM files, along with secondary alignments. Both reads can be allowed to kept using the option `keepPairs = TRUE`. Note that in this case, each read in a pair is evaluated seperately and in case of fragments where only one mate is flagged as duplicated, the other mate would be kept. In order to keep only fragments with both kept mates, the output BAM files should be filtered for proper pairs using samtools. Note that the run-time for paired-end evaluation is significantly longer, and only the first mate is considered for TSS detection (later). ```{r filterdups, eval=TRUE, results = "hold"} # load a previously saved CapSet object (or create new one) cs <- exampleCSobject() # filter PCR duplicates and save output in a new directory dir.create("removedup") cs <- filterDuplicates(cs, outdir = "removedup") ``` Tagging of read IDs using the random barcodes and subsequent removal of PCR duplicates can also be performed using the command line tools, like [UMItools](https://github.com/CGATOxford/UMI-tools). ### Detection of TSS *icetea* implements a new method of detection of transcription start sites, which is adopted from the recently described methods for differential transcription factor binding analysis in ChIP-Seq data [@lun2015csaw]. Genome is divided into small, sliding windows and the TSSs are detected as the windows that show a user-defined fold-enrichment over a local background region. Multiple consicutively enriched windows are then merged to detect broader TSSs. The method works well with replicates, resulting in one TSS file per group. This method is implemented in the function `detectTSS`, which returns a modified `CapSet` object with TSS detection statistics. ```{r getTSS, eval=TRUE, results = "hold"} # detect TSS dir.create("tssCalling") cs <- detectTSS(cs, groups = c("wt", "wt", "mut", "mut"), outfile_prefix = "tssCalling/testTSS", restrictChr = "X", ncores = 1) # export the detected TSS bed files exportTSS(cs, merged = TRUE, outfile_prefix = "tssCalling/testTSS") ``` ## Plotting and QC The `sampleInfo` field of the CapSet object stores information about the read numbers kept at each step of processing, this information can be easily plotted using the function `plotReadStats` We can either plot the number of reads at each step of processing, or the proportion of reads w.r.t total demultiplexed reads per sample. Stacked or separate barplots can be made for each category. ```{r plotstats, eval=TRUE, results = "hold", fig.width=7, fig.height=5} # separated barchart for numbers plotReadStats(cs, plotValue = "numbers", plotType = "dodge") ``` ```{r plotstats2, eval=TRUE, results = "hide", fig.width=7, fig.height=5} # stacked barchart for proportions plotReadStats(cs, plotValue = "proportions", plotType = "stack" ) ``` In case of well annotated genomes, one way to check the quality of TSS detection is to look at the fraction of detected TSSs that fall close to an annotated TSS in the genome. The cumulative fraction can be plotted for each sample, which can be used to compare samples. This can be done using the function `plotTSSprecision`, which takes the *known* TSS annotations as a `TxDB` object, along with either a `CapSet` object or a BED file, for TSS positions. ```{r plotPrecision, eval=TRUE, results = "hold", fig.width=7, fig.height=5} library("TxDb.Dmelanogaster.UCSC.dm6.ensGene") seqlevelsStyle(TxDb.Dmelanogaster.UCSC.dm6.ensGene) <- "ENSEMBL" # only analyse genes on chrX to make the analysis faster seqlevels(TxDb.Dmelanogaster.UCSC.dm6.ensGene) <- "X" transcripts <- transcripts(TxDb.Dmelanogaster.UCSC.dm6.ensGene) # Plotting the precision using a pre computed set of TSS (.bed files) : tssfile <- system.file("extdata", "testTSS_merged.bed", package = "icetea") plotTSSprecision(reference = transcripts, detectedTSS = tssfile, sampleNames = "testsample") ``` ## Differential TSS expression analysis For the experiments with two or more groups, *icetea* can be used to perform differential TSS expression analysis between groups. The requirements for differential TSS expression analysis is the same as that for differential expression analysis of RNA-Seq data. At least two or more biological replicates per group is required. The functions `fitDiffTSS` and `detectDiffTSS` utilize [edgeR](http://bioconductor.org/packages/release/bioc/html/edgeR.html) [@robinson2010edger] to perform differential TSS expression analysis. ```{r diffTSS, eval=FALSE} ## fitDiffTSS returns a DGEGLM object csfit <- fitDiffTSS(cs, groups = rep(c("wt","mut"), each = 2), normalization = "windowTMM", outplots = NULL, plotref = "embryo1") save(csfit, file = "diffTSS_fit.Rdata") ## This object is then used by the detectDiffTSS function to return differentially expressed TSSs de_tss <- detectDiffTSS(csfit, testGroup = "mut", contGroup = "wt", TSSfile = file.path(dir, "testTSS_merged.bed"), MAplot_fdr = 0.05) ## export the output library(rtracklayer) export.bed(de_tss[de_tss$score < 0.05], con = "diffTSS_output.bed") ``` **Result:** `de_tss` is a `GRanges` object that contain all the tested TSSs, along with metadata columns `score` (containing adjusted P-values) `logFC` (containing log-fold changes of group 'test' over group 'control') and logCPM (containing average expression of the TSS across all groups). This file can be exported to a bed file using the function `export.bed` from the package `rtracklayer` ### Using spike-In controls For the differential TSS expression analysis, the function `fitDiffTSS` shown above can utilize one of the available internal normalization methods. In some cases however, external (spike-in) normalizations are preferred. *icetea* provides a way to perform spike-in normalization during differential expression analysis. Spike-In samples could be processed in the same way as the normal samples, using the `CapSet` object as described above. Normalization factors can be obtained from the spike-in reads using `getNormFactors` and provided to the `fitDiffTSS` function during detection of differential TSS between samples. ```{r spikeNorm, eval=FALSE} ## get gene counts for spike-in RNA mapped to human genome library("TxDb.Hsapiens.UCSC.hg38.knownGene") normfacs <- getNormFactors(cs_spike, features = genes(TxDb.Hsapiens.UCSC.hg38.knownGene)) csfit <- fitDiffTSS(cs, groups = rep(c("wt","mut"), each = 2), normalization = NULL, normFactors = normfacs, outplots = NULL, plotRefSample = "embryo1", ncores = 1) ``` ## Additional useful functions ### Getting the gene counts Given the annotations, TSS counts for each gene can be summarized to gene counts. The following function sums up the TSS counts per gene from a given `txdb` object and returns gene counts. ```{r genecounts, eval=FALSE} # get transcripts dm6trans <- transcriptsBy(TxDb.Dmelanogaster.UCSC.dm6.ensGene, "gene") # get gene counts, counting reads around 500 bp of the TSS gcounts <- getGeneCounts(cs, dm6trans) ``` ### Annotating the TSS distribution Detected or differentially expressed TSS, which are exported as bed files, can be quickly annotated using the function `annotateTSS`. `annotateTSS` reports the numbers/proportions of detected TSS falling into different genomic features (provided by a TxDb object) as a data.frame and optionally as a plot. In order to break ties between overlapping features, we can provide a vector of feature names in the decreasing order of preference. ```{r annotateTSS, eval = TRUE, results = "hold", fig.width=7, fig.height=5} # save the output as data.frame (outdf) + plot on screen outdf <- annotateTSS(tssBED = tssfile, txdb = TxDb.Dmelanogaster.UCSC.dm6.ensGene, plotValue = "number", outFile = NULL) ``` # References