%\VignetteIndexEntry{Working with the ShortRead Package} %\VignetteDepends{Genominator, ShortRead, yeastRNASeq} %\VignettePackage{Genominator} \documentclass[letterpaper,pdf,english]{article} <>= options(width=80) @ \SweaveOpts{prefix.string=plots_withShortRead,eps=FALSE,echo=TRUE} \usepackage{times} \usepackage{hyperref} \usepackage{color} \usepackage{babel} \usepackage{graphicx} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rfuncarg}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\Rcode}[1]{{\texttt{#1}}} \newcommand{\software}[1]{\textsf{#1}} \newcommand{\R}{\software{R}} \usepackage[margin=1in]{geometry} \usepackage{fancyhdr} \pagestyle{fancy} \rhead{} \renewcommand{\footrulewidth}{\headrulewidth} \title{Working with the ShortRead Package} \author{James Bullard \and Kasper D.\ Hansen} \date{Modified: April 18, 2010. Compiled: \today} \begin{document} \maketitle \thispagestyle{fancy} In this document we show how to use the \Rpackage{Genominator} package with the \Rpackage{ShortRead} package and conduct a simple differential expression analysis. \section{Importing data} <>= require(Genominator) require(ShortRead) require(yeastRNASeq) @ The data we will use are described in the vignette of the \Rpackage{yeastRNASeq} and was originally published in \cite{Lee}. But to summarize, we have data from two different yeasts, a wild-type strain (``\texttt{wt}'') and a mutant strain (``\texttt{mut}''). Part of the RNA degradation pathway is knocked out in the mutant. For each strain we have two lanes worth of data -- it is the exact same sample and library preparation sequenced in both lanes. Only 500,000 raw reads from each lane is part of the \Rpackage{yeastRNASeq} package and they have been aligned with Bowtie, keeping only unique hits with up to two mismatches. Since not all reads align, we will be working with a up to 500,000 reads per lane. The data is available as a list of \Rclass{AlignedRead} class objects: <>= data(yeastAligned) yeastAligned[[1]] sapply(yeastAligned, length) @ In the package we also have the Bowtie output files, which we will use for illustration: <>= list.files(file.path(system.file(package = "yeastRNASeq"), "reads"), pattern = "bowtie") @ The Bowtie output files are compressed using \software{gzip}, \Rpackage{ShortRead} can handle this. Before importing a set of aligned reads using \Rpackage{Genominator} one usually has to make some decisions regarding chromosome names as well as names of the columns in the resulting database. Note the chromosome names are a bit special: \texttt{Scchr01}. In our experience it is quite common for different annotation sources to use different shorthands for the chromosomes (for yeast we have at least seen \texttt{chr1}, \texttt{chr01}, \texttt{Scchr01}, \texttt{chrI}). This is a bit of a pain and is very hard to automate. In \Rpackage{Genominator} we expect the user to explicitly convert the chromosome names to a common representation and we furthermore require chromosome names to be integers. For the sake of importing alignment files this is accomplished by using the \Rfuncarg{chrMap} argument that is a simple character vector which the chromosome names are matched up against. The first element in the vector gets assigned the integer 1 and so on. If there are chromosomes in the alignment file that does not appear in the \Rfuncarg{chrMap} vector, the corresponding reads are not imported. This is often useful, but can also lead to loss of data. We construct the \Rfuncarg{chrMap} object by <>= chrMap <- paste("Scchr", sprintf("%02d", 1:16), sep = "") unique(chromosome((yeastAligned[[1]]))) @ We see that by using this version of \Rfuncarg{chrMap} we will drop reads aligning to the mitochondrial chromosome. The other decision we need to make is how to map filenames to names of the columns of the resulting database. Note that whatever choice we make is more or less permanent. Usually we prefer short, descriptive names. It makes a lot of sense to construct these names programmatic from the filenames: <>= files <- list.files(file.path(system.file(package = "yeastRNASeq"), "reads"), pattern = "bowtie", full.names = TRUE) names(files) <- sub("_f\\.bowtie\\.gz", "", basename(files)) names(files) @ The way this is all specific is through a named vector of filenames, the names of the vector corresponds to column names in the resulting database. If two (or more) file entries have the same name, they will be joined into one column. We now import the alignment files using \Rfunction{importFromAlignedReads}, which uses the \Rfunction{readAligned} function from \Rpackage{ShortRead} to parse the files. <>= eData <- importFromAlignedReads(files, chrMap = chrMap, dbFilename = "my.db", tablename = "raw", type = "Bowtie") eData head(eData) @ This will create a database \texttt{my.db} in the current directory. The reads are automatically associated to the genomic location corresponding to the 5' end of the read -- this is slightly different from some other programs that uses the lefternmost location of the read (the difference is in reads mapping to the reverse strand). \section{Annotation} There are at least two easy ways to retrieve annotation using \software{Bioconductor}: the \Rpackage{biomaRt} package which accesses Ensembl and the \Rpackage{rtracklayer} package which accesses the UCSC genome browser. There are oftentimes species-specific databases (for yeast we have at least SGD) which is a third source of annotation. \subsection*{Using biomaRt} We will use \Rpackage{biomaRt} to retrieve information from the Ensembl database. We will illustrate a few pitfalls by comparing two different, but similar looking queries. More information on using \Rpackage{biomaRt} can be found it its excellent vignette. The following code was run in January 2010. <>= require(biomaRt) mart <- useMart("ensembl", "scerevisiae_gene_ensembl") attributes.gene <- c("ensembl_gene_id", "chromosome_name", "start_position", "end_position", "strand", "gene_biotype") attributes.tr <- c("ensembl_gene_id", "ensembl_transcript_id", "ensembl_exon_id", "chromosome_name", "start_position", "end_position", "strand", "gene_biotype", "exon_chrom_start", "exon_chrom_end", "rank") ensembl.gene <- getBM(attributes = attributes.gene, mart = mart) ensembl.transcript <- getBM(attributes = attributes.tr, mart = mart) @ The output is saved in the \Robject{yeastAnno.sources} object which is a list containing various annotation objects from yeast. <>= data(yeastAnno.sources) ensembl.gene <- yeastAnno.sources$ensembl.gene ensembl.transcript <- yeastAnno.sources$ensembl.transcript head(ensembl.gene, n = 2) head(ensembl.transcript, n = 2) dim(ensembl.gene) dim(ensembl.transcript) subset(ensembl.gene, ensembl_gene_id == "YPR098C") subset(ensembl.transcript, ensembl_gene_id == "YPR098C") length(unique(ensembl.transcript$ensembl_transcript_id)) @ Note that altering the query a bit leads to a quite different number of rows being retrieved. In this case it is because each row in one query correspond to a gene, whereas each row in the other query correspon to an exon of a transcript. A key observation here is that the columns \texttt{start\_position} and \texttt{end\_position} contains the start and end position of the gene which is different from the start and end position of the exon when we look at a gene consiting of two exons. Furthermore, it is not even clear from the output of the first query that genes with multiple exons exists. In this example, there is no difference between the transcript id and the gene id, because (at least according to this annotation), there are no genes in yeast that produces multiple transcripts. In this case, we would argue that the right object to use is \Robject{yeastAnno.transcripts}. Below we will post-process this object for use with \Rpackage{Genominator}. \subsection*{Using rtracklayer} Here we will use the \Rpackage{rtracklayer} package to access the UCSC genome browser. UCSC sometimes have different tables for a specific genome. We will take a closer look at the SGD table (based on the name we presume that it is supposed to package information from SGD (= Saccharomyces Genome Database)) and the ENS table (which we assume is an attempt to package information from Ensembl) The following code was run in January 2010 <>= require(rtracklayer) session <- browserSession() genome(session) <- "sacCer2" ucsc.sgdGene <- getTable(ucscTableQuery(session, "sgdGene")) ucsc.ensGene <- getTable(ucscTableQuery(session, "ensGene")) @ <>= yeastAnno.sources <- list(ensembl.gene = ensembl.gene, ensembl.transcript = ensembl.transcript, ucsc.sgdGene = ucsc.sgdGene, ucsc.ensGene = ucsc.ensGene) save(yeastAnno.sources, file = "yeastAnno.sources.rda") @ We will also examine the output from these two tables <>= data(yeastAnno.sources) ucsc.sgdGene <- yeastAnno.sources$ucsc.sgdGene ucsc.ensGene <- yeastAnno.sources$ucsc.ensGene head(ucsc.sgdGene, n = 2) head(ucsc.ensGene, n = 2) dim(ucsc.sgdGene) dim(ucsc.ensGene) subset(ucsc.sgdGene, name == "YPR098C") subset(ucsc.ensGene, name == "YPR098C") subset(ucsc.sgdGene, name == "YER102W") subset(ucsc.ensGene, name == "YER102W") @ From this we see that the two tables have quite a different number of genes in them, that the two tables look very similar on the two exon gene considered in the previous section (although one table has the additional information of a protein ID), but that the two tables differ on at least one gene (actually, there are 757 entries that have same value in the \texttt{name} column but different values in the \texttt{exonStarts} column). \subsection*{Some comments on annotation} As we see here, there are different sources of annotation that differ, even for a relatively simple and well-studied species as \emph{S.\ cerevisiae}. We cannot give any recommendation as to what annotation source to use, that depends on the biological question and possibly other factors. However, some effort ought to be spend on making sure that the used annotation matches up with the genome used. Even for yeast there are several different genomes. And while they differ in only a few substitutions and insertion/deletions, the insertion/deletions can easily lead to ``off-by-a-little'' errors. Finally we note that for this specific example, none of the queries above display the SGD classification of genes into ``verified'', ``dubious'', and ``uncharacterized'', a classification that is often important when analyzing ones results. This information is obtainable directly from SGD and perhaps from a better use of the annotation tools above. \subsection*{Post-processing the annotation} In the following we will work with the \Robject{ensembl.transcript} object, which we now post-process for use with \Rpackage{Genominator}. In \Rpackage{Genominator} an annotation object is a \Rclass{data.frame} with columns \Rcode{chr, start, end, strand} where \Rcode{chr} is an integer (and should match up with whatever was used in the import step earlier) and \Rcode{strand} has values in $\{-1,1,0\}$ with $0$ indicating that there is no strand information (and hence $0$ matches both $1$ and $-1$). <>= yAnno <- yeastAnno.sources$ensembl.transcript yAnno$chr <- match(yAnno$chr, c(as.character(as.roman(1:16)), "MT", "2-micron")) yAnno$start <- yAnno$start_position yAnno$end <- yAnno$end_position rownames(yAnno) <- yAnno$ensembl_exon_id yAnno.simple <- yAnno[yAnno$chr %in% 1:16, c("chr", "start", "end", "strand")] head(yAnno.simple, n = 2) head(yAnno, n = 2) @ (note that we remove all annotation on the mitochondria and the plasmid). A useful function is \Rfunction{validAnnotation} that checks whether the produced \Robject{data.frame} satisfy the annotation assumptions <>= validAnnotation(yAnno) @ \section{Gene level counts} It is easy to obtain region level counts for a given annotation object: <>= geneCounts.1 <- summarizeByAnnotation(eData, yAnno, ignoreStrand = TRUE) head(geneCounts.1) @ This produces a matrix containing the read counts per gene. Such a matrix is ready for analysis by various packages as well as the functionality in \Rpackage{Genominator}. We use \Rfuncarg{ignoreStrand = TRUE} because the experimental assay does not keep strand of origin, so we count reads on either strand, inside each atomic region. Note that the resulting matrix has one row for each row in the annotation object. Inthis annotation object, rows correspond to exons. We can get gene level counts either by using a \Rfunction{tapply} or directly from \Rpackage{Genominator} by <>= geneCounts.2 <- summarizeByAnnotation(eData, yAnno, ignoreStrand = TRUE, groupBy = "ensembl_gene_id") head(geneCounts.2) @ Note that -- because of the way reads are represented in the database when we use \Rfunction{importFromAlignedReads} -- that a read is counting as part of a gene if the first sequenced base maps within the region defined by the gene. This may create some concern at gene boundaries. There are further complications. In yeast, it is very common for two genes to overlap each other on opposite strands. In other organisms, a gene may have several transcripts. For this purpose, \Rpackage{Genominator} supports the computation of various gene models. Union-intersection (UI) genes were introduced by \cite{}. The UI representation of a gene is the set of bases that are annotated as being part of every transcript of the gene and that are not part of any transcript of any other gene. We may transform our annotation object into a UI representation by <>= yAnno.UI <- makeGeneRepresentation(yAnno, type = "UIgene", gene.id = "ensembl_gene_id", transcript.id = "ensembl_transcript_id") head(yAnno.UI) @ In this step we loose all of the additional columns of the \Robject{yAnno} object. \section{Statistical Analysis} We can see how ``good'' the replicates are by assessing whether it fits the Poisson model of constant gene expression across lanes with variable sequencing effort. <>= groups <- gsub("_[0-9]_f", "", colnames(geneCounts)) groups plot(regionGoodnessOfFit(geneCounts, groups), chisq = TRUE) @ \section{Working with Priming Weights} In a recent publication \cite{Hansen} we describe how the use of random priming for Illumina RNA-Seq impacts the nucleotide content of the reads and we describe a method for alleviating this bias. The method associates a weight with each read and instead of counting the number of reads in a given genomic interval, the weights in the interval are summed. Because of this, the use of weights happen at the data import step. Since our example data was generated using random priming, we illustrate the methodology. The first step is to compute the weights using the function \Rfunction{computePrimingWeights} and an \Rclass{AlignedRead} object. Next, the weights are associated with each read using \Rfunction{addPrimingWeights}. Once reads have an associated weight, the \Rfunction{importFromAlignedReads} function uses these. Because of the need to compute the weights, for now, it is not possible to have \Rfunction{importFromAlignedReads} work directly on filenames. We start with the \Robject{yeastAligned} object which was simply a list of \Rclass{AlignedRead}, generated using an \Rfunction{lapply} on a vector of filenames. In this case, we have around 410.000-430.000 reads per sample, with each read having a length of only 26 bases. In order to compute the priming weights we need to assess the $k$-mer distribution at the end of the reads. In \cite{Hansen}, the end is based on reads having at least 35 bases, so we need to modify this. We also makes the weights a bit shorter (as described in Hansen et al., this does not change the effect much) <>= weightsList <- lapply(yeastAligned, computePrimingWeights, unbiasedIndex = 20:21, weightsLength = 6L) sapply(weightsList, summary) @ We will continue with a separate set of weights for each lane. <>= yeastAligned2 <- mapply(addPrimingWeights, yeastAligned, weightsList) alignData(yeastAligned2[[1]]) head(alignData(yeastAligned2[[1]])$weights) @ Now the weights have been added the \Rclass{AlignedRead} objects. Once this has happened, \Rfunction{importFromAlignedReads} will use them. <>= eData2 <- importFromAlignedReads(yeastAligned2, chrMap = chrMap, dbFilename = "my.db", tablename = "weights") @ We can now easily get the re-weighted gene level counts as usual. <>= reweightedCounts <- summarizeByAnnotation(eData2, yAnno, ignoreStrand = TRUE, groupBy = "ensembl_gene_id") head(reweightedCounts) @ \section*{SessionInfo} <>= toLatex(sessionInfo()) @ \begin{thebibliography}{30} \bibitem{Lee} Lee,A., Hansen,K.D., Bullard,J., Dudoit,S. and Sherlock,G. (2008) Novel Low Abundance and Transient RNAs in Yeast Revealed by Tiling Microarrays and Ultra High-Throughput Sequencing Are Not Conserved Across Closely Related Yeast Species. \textit{PLoS Genet}, \textbf{4}e1000299. \bibitem{Bullard} Bullard,J.H., Purdom,E.A., Hansen,K.D., and Dudoit,S. (2010) Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. \textit{BMC Bioinformatics}, \textbf{11}, 94. \bibitem{Hansen} Hansen,K.D., Brenner,S.E. and Dudoit,S. (2010) Biases in Illumina transcriptome sequencing caused by random hexamer priming. \textit{Nucleic Acids Res}, doi: 10.1093/nar/gkq224 \end{thebibliography} \end{document}