%\VignetteIndexEntry{07 Read Counting in RNA-seq -- Exercises} %\VignetteEngine{knitr::knitr} \documentclass{article} <>= options(max.print=1000) stopifnot(BiocInstaller::biocVersion() == "2.13") BiocStyle::latex() library(knitr) opts_chunk$set(cache=TRUE, tidy=FALSE) @ \title{Practical: Read Counting in RNA-seq} \author{Herv\'e Pag\`es (\url{hpages@fhcrc.org})} \date{5 February 2014} \newcommand{\Hsap}{\emph{H.~sapiens}} \newcommand{\Dmel}{\emph{D.~melanogaster}} \usepackage{theorem} \newtheorem{Ext}{Exercise} \newenvironment{Exercise}{ \renewcommand{\labelenumi}{\alph{enumi}.}\begin{Ext}% }{\end{Ext}} \newenvironment{Solution}{% \noindent\textbf{Solution:}\renewcommand{\labelenumi}{\alph{enumi}.}% }{\bigskip} \setlength{\abovecaptionskip}{6pt} \setlength{\belowcaptionskip}{6pt} \begin{document} \maketitle \tableofcontents %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Introduction} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% In the context of a high-throughput sequencing experiment, \emph{counting the reads} means counting the number of reads per gene (or exon). It is usually the preliminary step to a \emph{differential analysis} at the gene level (or exon level). The result of this counting will typically be organized as a matrix where: \begin{itemize} \item each row represents a gene (or exon); \item each column represents a sequencing run (usually a given sample); \item and each value is the raw number of reads from the sequencing run that were \emph{assigned} to the gene (or exon). \end{itemize} Different criteria can be used for assigning reads to genes. A common one is to assign a read to a gene if the aligned read overlaps with that gene and with that gene only. \medskip In this practical we learn how aligned reads stored in BAM files can be counted with the \Rcode{summarizeOverlaps()} function from the \Rpackage{GenomicRanges} package. With this function, the criteria used for assigning reads to genes is controlled via 2 arguments: the \Rcode{mode} and \Rcode{inter.feature} arguments. In addition to the man page for \Rcode{summarizeOverlaps()}, the ``Counting reads with summarizeOverlaps'' vignette (located in the \Rpackage{GenomicRanges} package) is recommended reading if you're planning to use this function for your work. The output of \Rcode{summarizeOverlaps()} will be a \Rclass{SummarizedExperiment} object containing the matrix of counts together with information about the genes (or exons) in the \Rcode{rowData} component and about the samples (e.g. patient ID, treatment, etc...) in the \Rcode{colData} component. This object will be suitable input to the \Rpackage{DESeq2} package for performing a \emph{differential analysis}. \medskip IMPORTANT NOTE: Starting with the upcoming version of \Bioconductor{} (BioC 2.14, scheduled for April 2014), the \Rcode{summarizeOverlaps()} function and its vignette will be located in the new \Rpackage{GenomicAlignments} package. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{First look at some precomputed read counts} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Before we do our own read counting, we start by having a quick look at some precomputed counts so we get an idea of what a \Rclass{SummarizedExperiment} object looks like. The \Rpackage{parathyroidSE} package contains RNA-seq data from the publication of Haglund et al. \cite{Haglund2012Evidence}. The \emph{paired-end} sequencing was performed on primary cultures from parathyroid tumors of 4 patients at 2 time points over 3 conditions (control, treatment with diarylpropionitrile (DPN) and treatment with 4-hydroxytamoxifen (OHT)). DPN is a selective estrogen receptor $\beta$ 1 agonist and OHT is a selective estrogen receptor modulator. One sample (patient 4, 24 hours, control) was omitted by the paper authors due to low quality. The \Rpackage{parathyroidSE} package contains several data sets. One of them is the \Rcode{parathyroidGenesSE} data set which contains the counts of reads per gene. \begin{Exercise} In this exercise, we load the \Rcode{parathyroidGenesSE} data set from the \Rpackage{parathyroidSE} package and perform some basic manipulations on it. \begin{enumerate} \item Load the \Rcode{parathyroidGenesSE} data set from the \Rpackage{parathyroidSE} package. What's the class of this object? What are its dimensions? \item The information in a \Rclass{SummarizedExperiment} object can be accessed with accessor functions. For example, to get the actual data (i.e., here, the read counts), we use the \Rcode{assay()} function. What's returned by \Rcode{assay()}? What are its dimensions. Display the top left corner of it (e.g. first 8 rows and columns). Does it have row names? Column names? What are the row names? \item In this matrix of read counts, each row represents an Ensembl gene, each column a sequencing run, and the values are the raw numbers of reads in each sequencing run that were assigned to the respective gene. How many reads were assigned to a gene in each sequencing run? How many genes have non-zero counts? \item Use \Rcode{rowData()} on \Rcode{parathyroidGenesSE}. What do you get? What's its length? \item Use \Rcode{colData()} on \Rcode{parathyroidGenesSE}. What do you get? How many rows does it have? Use \Rcode{table()} to summarize the number of runs for each treatment (Control, DPN, and OHT). \end{enumerate} \end{Exercise} \begin{Solution} \begin{enumerate} \item First we load the \Rpackage{parathyroidSE} package. <>= library(parathyroidSE) @ Before we load the \Rcode{parathyroidGenesSE} data set, we can check what data sets are contained in the \Rpackage{parathyroidSE} package with: <>= data(package="parathyroidSE") @ Load the \Rcode{parathyroidGenesSE} data set: <<>>= data(parathyroidGenesSE) parathyroidGenesSE class(parathyroidGenesSE) dim(parathyroidGenesSE) @ \medskip \item <<>>= class(assay(parathyroidGenesSE)) dim(assay(parathyroidGenesSE)) assay(parathyroidGenesSE)[1:8, 1:8] @ The row names are Ensembl gene ids. No column names: <<>>= colnames(parathyroidGenesSE) @ \medskip \item To compute the number of reads that were assigned to a gene in each sequencing run, we just need to sum all the counts that are in a column and do this for each column: <<>>= colSums(assay(parathyroidGenesSE)) @ Genes with non-zero counts: <<>>= sum(rowSums(assay(parathyroidGenesSE)) != 0) @ \medskip \item <<>>= rowData(parathyroidGenesSE) @ We get a \Rclass{GRangesList} object with one list element per gene. Each list element is a \Rclass{GRanges} object containing the exon ranges for the gene. \medskip \item <<>>= colData(parathyroidGenesSE) @ We get a \Rclass{DataFrame} object with one row per sequencing run. <<>>= table(colData(parathyroidGenesSE)$treatment) @ \end{enumerate} \end{Solution} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Aligned reads and BAM files} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% To operate, the \Rcode{summarizeOverlaps()} function needs 2 data objects: \begin{enumerate} \item one representing the genomic ranges of the genes (or exons); \item one representing the aligned reads. \end{enumerate} The aligned reads are typically stored in one BAM file per sequencing run. In the next exercise we will have a quick look at the BAM files included in the \Rpackage{parathyroidSE} package. The reads in these files are \emph{paired-end reads} that were aligned using the TopHat aligner. To keep the package to a reasonable size, only a subset of all the aligned reads from the experiment have been placed in these files. More information on how these BAM files were obtained can be found in the vignette located in the \Rpackage{parathyroidSE} package. To get the paths to these files, do: <>= bamdir <- system.file("extdata", package="parathyroidSE") bampaths <- list.files(bamdir, pattern="bam$", full.names=TRUE) bampaths @ To load \emph{single-end reads} from a BAM file, we can use the \Rcode{readGAlignmentsFromBam()} function from the \Rpackage{Rsamtools} package. If the reads are \emph{paired-end} and we want to preserve the pairing, the \Rcode{readGAlignmentPairsFromBam()} function can be used. However, for downstream analyses where the pairing doesn't need to be preserved (e.g. if we're only going to compute the coverage of the reads), the reads can be loaded with \Rcode{readGAlignmentsFromBam()}, which is faster and returns an object that is simpler and easier to manipulate. One last thing before we start the exercise. By default \Rcode{readGAlignmentsFromBam()} and \Rcode{readGAlignment\-Pairs\-From\-Bam()} load PCR or optical duplicates as well as secondary alignments. These alignments are generally discarded from the read counting step. We can discard them up-front by filtering them out when we load the alignments from the BAM files. This is done by creating and passing a \Rclass{ScanBamParam} object to the \Rcode{param} argument of \Rcode{readGAlignmentsFromBam()} or \Rcode{readGAlignmentPairsFromBam()}. \begin{Exercise} In this exercise, we learn how to load paired-end reads and filter out the alignments that are not suitable for read counting. \begin{enumerate} \item Load the SRR479052.bam file included in the \Rpackage{parathyroidSE} package, first with \Rcode{readGAlignmentsFromBam()}, then with \Rcode{readGAlignmentPairsFromBam()}. What are the classes of the returned objects? How many pairs are there in the 2nd object? \item The first and last mate for each pair can be extracted from the \Rclass{GAlignmentPairs} object with the \Rcode{first()} and \Rcode{last()} accessor functions. Extract the first mates. Extract the last mates. \item See the man page for the \Rcode{ScanBamParam()} constructor in the \Rpackage{Rsamtools} package. Construct a \Rclass{ScanBamParam} object (that you will pass to \Rcode{readGAlignmentPairsFromBam()}) that will filter out PCR or optical duplicates as well as secondary alignments. Use it to load the pairs again. \end{enumerate} \end{Exercise} \begin{Solution} \begin{enumerate} \item Loading the BAM file first with \Rcode{readGAlignmentsFromBam()}: <<>>= library(Rsamtools) gal0 <- readGAlignmentsFromBam(bampaths[1]) gal0 @ then with \Rcode{readGAlignmentPairsFromBam()}: <<>>= galp0 <- readGAlignmentPairsFromBam(bampaths[1]) galp0 @ \Rcode{gal} is a \Rclass{GAlignments} object. \Rcode{galp} is a \Rclass{GAlignmentPairs} object. A \Rclass{GAlignmentPairs} object is also vector-like object where each element represents an aligned \emph{paired-end read}. So the number of pairs in it is just: <<>>= length(galp0) @ \medskip \item <<>>= first(galp0) last(galp0) @ \medskip \item <<>>= param <- ScanBamParam(flag=scanBamFlag(isDuplicate=FALSE, isNotPrimaryRead=FALSE)) readGAlignmentPairsFromBam(bampaths[1], param=param) @ \end{enumerate} \end{Solution} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Choosing and loading a gene model} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% To operate, \Rcode{summarizeOverlaps()} needs access to the genomic ranges of the genes (or exons). This information can be extracted from what we call a \emph{gene model}. Gene models for various organisms are provided by many annotation providers on the internet (UCSC, Ensembl, NCBI, TAIR, FlyBase, WormBase, etc...) In \Bioconductor{} a gene model is typically represented as a \Rclass{TranscriptDb} object. The \Rpackage{GenomicFeatures} package contains tools for obtaining a gene model from these providers and store it in a \Rclass{TranscriptDb} object (the container for gene models). For convenience, the most commonly used gene models are available as \Bioconductor{} data packages (called TxDb packages). Each TxDb package contains a \Rclass{TranscriptDb} object ready to use. According to the vignette located in the \Rpackage{parathyroidSE} package, the reads in the BAM files were aligned to the GRCh37 human reference genome. If we wanted to use the gene model for Human provided by Ensembl, we could do: <>= ## Requires INTERNET ACCESS and takes about 6 min. Please don't try to run this! library(GenomicFeatures) txdb <- makeTranscriptDbFromBiomart(biomart="ensembl", dataset="hsapiens_gene_ensembl") @ This would return a \Rclass{TranscriptDb} object containing the Ensembl gene model for Human. IMPORTANT NOTE: One must be careful to choose a gene model based on the same reference genome that was used to align the reads. The annotations provided by Ensembl are updated at each new Ensembl release, which typically happens 2 or 3 times per year (current release is Ensembl 73). The \Rcode{"hsapiens\_gene\_ensembl"} dataset is usually based on the most recent version of the Human reference genome (currently GRCh37). So before we proceed with this \Rclass{TranscriptDb} object, we would need to make sure that it's compatible with our BAM files, that is, we would need to check that the \Rcode{"hsapiens\_gene\_ensembl"} dataset was based on GRCh37 human at the time the \Rclass{TranscriptDb} object was made. Because our goal is to use the counts to perform a \emph{differential analysis} at the gene level, we will need to feed \Rcode{summarizeOverlaps()} with a \Rclass{GRangesList} object containing the exon ranges grouped by gene. This can be extracted from the \Rclass{TranscriptDb} object with the \Rcode{exonsBy()} function: <>= ex_by_gene <- exonsBy(txdb, by="gene") # GRangesList object @ For the purpose of this practical, we'll use a subset of the Ensembl genes. This subset is stored in the \Rpackage{parathyroidSE} package and is based on the GRCh37 human reference genome. \begin{Exercise} In this exercise, we have a quick look at the \Rcode{exonsByGene} data set included in the \Rpackage{parathyroidSE} package. \begin{enumerate} \item Load the \Rcode{exonsByGene} data set from the \Rpackage{parathyroidSE} package. What is it? \item How many genes are represented in this object? \end{enumerate} \end{Exercise} \begin{Solution} \begin{enumerate} \item <<>>= data(exonsByGene) exonsByGene @ \medskip \item Number of genes in this object: <<>>= length(exonsByGene) @ \end{enumerate} \end{Solution} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Count the reads} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% To count the reads, we use the \Rcode{summarizeOverlaps()} function defined and documented in the \Biocpkg{GenomicRanges} package. The aligned reads must be passed to the \Rcode{reads} argument of the function (the 2nd argument). They can be represented in different ways, including as a \Rclass{BamFile}, a \Rclass{GAlignments}, a \Rclass{GAlignmentPairs}, or a \Rclass{BamFileList} object. The first 3 types of objects only allow passing the reads from a single sequencing run at a time. Using a \Rclass{BamFileList} object allows us to pass the reads from all the sequencing runs at once. To create such an object, we use the \Rcode{BamFileList()} constructor function from the \Rpackage{Rsamtools} package: <>= library(Rsamtools) bamfile_list <- BamFileList(bampaths, index=character()) @ Note that we need to use \Rcode{index=character()} here because there are no BAM index files (\Rcode{.bam.bai} extension) associated with our BAM files. \begin{Exercise} Let's do the read counting. \begin{enumerate} \item Use \Rcode{summarizeOverlaps()} on \Rcode{exonsByGene} and \Rcode{bamfile\_list} to count the reads. Check the man page for the details. Note that because the RNA-seq protocol was not strand specific, you need to specify \Rcode{ignore.strand=TRUE}. Also because the reads are \emph{paired-end}, you need to specify \Rcode{singleEnd=FALSE}. This will tell \Rcode{summarizeOverlaps()} to use \Rcode{readGAlignment\-Pairs\-From\-Bam()} instead of \Rcode{readGAlignmentsFromBam()} internally to read the BAM files. \item When \Rcode{summarizeOverlaps()} calls the reading function internally on each BAM file, it does so without specifying any particular \Rcode{param} value, so, by default, PCR or optical duplicates and secondary alignments are loaded. However, if a \Rcode{param} argument is passed to \Rcode{summarizeOverlaps()}, it will be passed along to the reading function. Count the reads again but discard PCR or optical duplicates as well as secondary alignments. \end{enumerate} \end{Exercise} \begin{Solution} \begin{enumerate} \item <<>>= read_count0 <- summarizeOverlaps(exonsByGene, bamfile_list, ignore.strand=TRUE, singleEnd=FALSE) read_count0 @ \medskip \item To discard PCR or optical duplicates as well as secondary alignments, we re-use the \Rclass{ScanBamParam} object we prepared earlier: <<>>= read_count <- summarizeOverlaps(exonsByGene, bamfile_list, ignore.strand=TRUE, singleEnd=FALSE, param=param) read_count @ Let's do a quick comparison between the 2 counts: <<>>= colSums(assay(read_count0)) colSums(assay(read_count)) @ No difference here in the final count. This means the reads we discarded didn't get assigned to any gene the first time we counted (but this wouldn't necessarily be the case with a bigger data set). \end{enumerate} \end{Solution} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Conclusion} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Now that we have our read counts, we're ready to perform a \emph{differential analysis} with the \Rpackage{DESeq2} package. \medskip \medskip \medskip \centerline{THANKS!} \appendix \bibliography{SummerX} \end{document} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{How to get rid of reads with low mapping quality?} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% In a BAM file, the mapping quality of the alignments is stored in the MAPQ field (MAPping Quality). For more information about this field (and about the SAM format in general) please refer to the SAM Spec document hosted on the SAMtools website (\url{http://samtools.sourceforge.net/}). By default \Rcode{readGAlignmentsFromBam()} and \Rcode{readGAlignmentPairsFromBam()} don't load the MAPQ field from the file. In fact, they completely ignore this field and the alignments are loaded independently of their mapping quality. Using a \Rclass{ScanBamParam} object when calling these functions allows to choose what BAM fields to load. \begin{Exercise} The goal of this exercise is to get rid of reads with low mapping quality before the counting. \begin{enumerate} \item Construct a \Rclass{ScanBamParam} object that will filter out PCR or optical duplicates as well as secondary alignments, and that will load the MAPQ field. \item Use this \Rclass{ScanBamParam} object to load the pairs in SRR479052.bam \item Extract the first mates and last mates separately. Check their metadata columns. \item Subset the \Rclass{GAlignmentPairs} object to keep only the pairs in which the 2 mates have a mapping quality greater than some threshold. \item In addition to a \Rclass{BamFileList}, the \Rcode{reads} argument of \Rcode{summarizeOverlaps()} can be a \Rclass{BamFile}, a \Rclass{GAlignments}, or a \Rclass{GAlignmentPairs} object. Call \Rcode{summarizeOverlaps()} on \Rcode{exonsByGene} and the above \Rclass{GAlignmentPairs} object, and with \Rcode{ignore.strand=TRUE}. \item To obtain the full count like in the previous section, we would need to repeat this for all the BAM files. This would give us one \Rclass{SummarizedExperiment} object per file that we would then need to combine with \Rcode{cbind()}. \end{enumerate} \end{Exercise}