%\VignetteIndexEntry{The Genominator User Guide} %\VignetteDepends{Genominator} %\VignettePackage{Genominator} \documentclass[letterpaper,pdf,english]{article} <>= options(width = 80) @ \SweaveOpts{prefix.string=plots_Genominator,eps=FALSE,echo=TRUE} \usepackage{times} \usepackage{hyperref} \usepackage{color} \usepackage{babel} \usepackage{graphicx} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rfuncarg}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\Rcode}[1]{{\texttt{#1}}} \newcommand{\software}[1]{\textsf{#1}} \newcommand{\R}{\software{R}} \usepackage[margin=1in]{geometry} \usepackage{fancyhdr} \pagestyle{fancy} \rhead{} \renewcommand{\footrulewidth}{\headrulewidth} \title{The Genominator User Guide} \author{James Bullard \and Kasper D.\ Hansen} \date{Modified: April 18, 2010, Compiled: \today} \begin{document} \maketitle \thispagestyle{fancy} <>= ## -- For Debugging / Timing ## require(RSQLite) ## source("../../R/Genominator.R") ## source("../../R/importAndManage.R") ## source("../../R/plotRegion.R") ## source("../../R/coverage.R") ## source("../../R/goodnessOfFit.R") @ \section{Introduction} \subsection*{Overview} The \Rpackage{Genominator} package provides an interface to storing and retrieving genomic data, together with some additional functionality aimed at high-throughput sequence data. The intent is that retrieval and summarization will be fast enough to enable online experimentation with the data. We have used to package to analyze tiling arrays and (perhaps more appropriate) RNA-Seq data consisting of more than 400 million reads. The canonical use case at the core of the package is summarizing the data over a large number of genomic regions. The standard example is for each annotated exon in human, count the number of reads that lands in that exon, for all experimental samples. Data is stored in a SQLite database, and as such the package makes it possible to work with very large datasets in limited memory. However, working with SQLite databases is limited by I/O (disk speed), and substantial performance gains are possible by using a fast disk. Work using this package should cite \cite{Bullard}. \subsection*{Limitations} While data may be stored in different ways using the experimental data model described below, typically (for sequencing data) we associate each read to a single location in the genome. This means that the package does not currently support paired-end reads, nor dealing with reads that span exon-exon junctions. Work is being done to include these important cases. As mentioned above, the package uses a SQLite backend and the speed of its functionality is primarily limited by disk speed, which is unlike many other R packages. There is no gain in having multiple cores available and there is substantial slow down when accessing the data over a networked drive. \section{Data model} The core functionality of the package is the analysis of \emph{experimental} data using genomic \emph{annotation}, such as counting the number of reads falling in annotated exons. In the following we describe what we mean by experimental and annotation data. \subsection*{Experimental data} The package utilizes \Rpackage{RSQLite} to store the data corresponding to an experiment. The SQL data model is: \begin{verbatim} chr INTEGER, strand INTEGER (-1,0,1), location INTEGER, [name NUMERIC]* \end{verbatim} Specifically it means that each data unit has an associated (non-empty) genomic location triplet, namely \texttt{chr} (chromosome coded as an integer), \texttt{strand} (one of -1, 0, or 1 representing reverse strand, no strand information, and forward strand respectively) as well as a \texttt{location}. We also allow an unlimited number of additional numeric variables. (The requirement that the additional variables be numeric is purely an optimization.) There is no requirement that the location triplet (chr, strand, location) only occurs once in the data. Examples are \begin{verbatim} chr INTEGER, strand INTEGER (-1,0,1), location INTEGER, chip_1 REAL, chip_2 REAL chr INTEGER, strand INTEGER (-1,0,1), location INTEGER, counts INTEGER \end{verbatim} The first data model could describe data from two tiling arrays, and in that case there would typically only be a single occurrence of each location triplet (chr, strand, location). The second data model could describe \begin{itemize} \item data from a single sequencing experiment, each row corresponding to a read, such as \begin{verbatim} chr, strand, location, counts 1, 1, 1, 1 1, 1, 1, 1 1, 1, 2, 1 \end{verbatim} (here two reads map to the same location) \item data from a single sequencing experiment, but where the reads have been \emph{aggregated} such that each row corresponds to the number of reads mapped to that location. \begin{verbatim} chr, strand, location, counts 1, 1, 1, 2 1, 1, 1, 1 \end{verbatim} \end{itemize} \subsection*{Annotation data} Unlike experimental data, which is stored in an SQLite database, we represent annotation as an R \Robject{data.frame} with the following columns: \begin{verbatim} chr integer, strand integer (-1L,0L,1L), start integer, end integer, [name class]* \end{verbatim} As in the experimental data representation, strand information is encoded as -1 for Reverse/Minus Strand, 0 for No Strand Information Available/Relevant, and 1 for Forward/Plus Strand. Each row is typically called a region (or region of interest, abbreviated ROI). Since each region is consecutive, a transcript with multiple exons needs to be represented by multiple regions and one would typically have a column indicating transcript id, linking the different regions together. This data model is very similar to the genomic feature format (GFF). A common example is \begin{verbatim} chr integer, strand integer (-1L,0L,1L), start integer, end integer, feature factor \end{verbatim} \section{Overview} There are 3 broad classes of functions within \Rpackage{Genominator}: functions that import and transform data, functions that retrieve and summarize data and finally functions that operate on retrieved data (focused on analysis of next generation sequencing data). In the next two sections we will generate experimental data and annotation data for use in later examples. \subsection*{Creating Experimental Data} We are going to walk through a very simple example using simulated experimental data to present the data import pipeline. This example uses a verbose setting of TRUE to illustrate activities performed in the SQLite databases. The example data will be used later in the vignette to illustrate other aspects of the package. For an example of importing ``real'' next generation sequence data, see the companion vignette on working with the \Rpackage{ShortRead} package, which illustrates the use of \Rfunction{importFromAlignedReads}. The data can be thought of as next generation sequencing data ($N$ number of reads), in an organism with 16 chromosomes and a small number of annotated regions ($K$ regions). <>= library(Genominator) options(verbose = TRUE) # to be used by Genominator functions. set.seed(123) N <- 100000L # the number of observations. K <- 100L # the number of annotation regions, not less than 10 df <- data.frame(chr = sample(1:16, size = N, replace = TRUE), location = sample(1:1000, size = N, replace = TRUE), strand = sample(c(1L,-1L), size = N, replace = TRUE)) head(df) eDataRaw <- importToExpData(df, "my.db", tablename = "ex_tbl", overwrite = TRUE) eDataRaw head(eDataRaw) @ The \Robject{df} object contains unsorted reads, which is imported using \Rfunction{importToExpData}. \Robject{eDataRaw} is an example of an ExpData object, the core object in Genominator. Such an object essentially points to a specific table in a specific database (in SQLite a database is a file). The data in \Robject{eDataRaw} is ordered along the genome (unlike \Robject{df}), but there may be multiple rows with the same genomic location. The argument \Robject{overwrite = TRUE} indicates that if the table already exists in the database, overwrite it. This can be handy for scripts and vignettes. The \Robject{eDataRaw} has a number of slots related to an internal bookkeeping of database connection. The \Robject{index columns} indicates what columns of the database are indexed. For ``normal'' uses this will always be (chr, location, strand). The \Robject{mode} indicates whether the database is in read or write mode (write implies read). In a normal pipeline, the first thing we do is aggregate the reads. With default settings, this means counting the number of reads starting at each (chr, location, strand). The resulting database has only one row with a given (chr, location, strand). <>= eData <- aggregateExpData(eDataRaw, tablename = "counts_tbl", deleteOriginal = FALSE, overwrite = TRUE) eData head(eData) @ The return object is (as always) an ExpData object pointing to the table that was \emph{created}. Note the addition of the \texttt{counts} column. The input ExpData object points to table \texttt{ex\_tbl} in database \texttt{my.db}. The output ExpData object (currently) always uses the same database as the input object, possibly with a different name (in this case \texttt{counts\_tbl}). All functions that manipulate databases have the arguments \Rfuncarg{overwrite} and \Rfuncarg{deleteOriginal}. If \Rcode{deleteOriginal = TRUE}, the original table (in this case \texttt{ex\_tbl}) is deleted. If \Rcode{tablename = NULL} (default), the function does a destructive in-place modification of the table. It is possible to break ExpData objects. For example, if we had used the \Rfunction{aggregateExpData} function with \Rcode{deleteOriginal = TRUE}, the table that \Robject{eDataRaw} points to would have been deleted. Or, if the function had been used with \Rcode{tablename = NULL} (default), both \Robject{eDataRaw} and \Robject{eData} would point to the same table in the database, but the schema recorded in \Robject{eDataRaw} during instantiation would be out of date because it would not include the \texttt{counts} column. While this may seem problematic, it has not been cause for much concern. Remember, that ExpData objects are very cheap to create, so if something seems to break, delete and recreate it. With a bit of familiarity, this problem can be avoided. In general, we highly recommend carrying out the creation/manipulation of data in a script separate from the analysis, since creation/manipulation requires write access, whereas analysis is read only. Each ExpData object has a mode that indicates whether the database is in read or write, which also implies read, mode. The \Robject{eDataRaw} and \Robject{eData} objects created above had a 'write' mode. To prevent unwanted modifications to the database, we will instantiate a new \Robject{eData} object in 'read' only mode. <>= eData <- ExpData(dbFilename = "my.db", tablename = "counts_tbl") eData head(eData) @ This used the constructor function \Rfunction{ExpData}, which is a standard way to instantiate ExpData objects in a new session. It is possible to use the normal \Rfunction{[} and \Rfunction{\$} operators on ExpData objects (although they do not have rownames). This is rarely necessary, and the return objects may be massive. <>= head(eData$chr) eData[1:3, 1:2] @ \subsection*{Creating Annotation} We now create a suitable annotation object. As described in earlier sections, annotation consists of consecutive genomic regions that may or may not be overlapping. <>= annoData <- data.frame(chr = sample(1:16, size = K, replace = TRUE), strand = sample(c(1L, -1L), size = K, replace = TRUE), start = (st <- sample(1:1000, size = K, replace = TRUE)), end = st + rpois(K, 75), feature = c("gene", "intergenic")[sample(1:2, size = K, replace = TRUE)]) rownames(annoData) <- paste("elt", 1:K, sep = ".") head(annoData) @ In this example, the \Robject{annoData} object needs to have distinct row names in order to maintain the link between annotation and returned data structures from the \Rpackage{Genominator} API. Also the \texttt{strand} column needs to have values in $\{-1,0,1\}$, and the \texttt{chr} column needs to have integer values. When you access ``real'' annotation, you will often need a post-processing step where the annotation gets massaged into this format. See the additional vignette on working with the \Rpackage{ShortRead} package for a real-life example. \section{Creating and managing data} For illustrative purposes, we generate another set of data: <>= df2 <- data.frame(chr = sample(1:16, size = N, replace = TRUE), location = sample(1:1000, size = N, replace = TRUE), strand = sample(c(1L,-1L), size = N, replace = TRUE)) eData2 <- aggregateExpData(importToExpData(df2, dbFilename = "my.db", tablename = "ex2", overwrite = TRUE)) @ as well as re-doing the aggregation performed previously (with a new tablename) <>= eData1 <- aggregateExpData(importToExpData(df, dbFilename = "my.db", tablename = "ex1", overwrite = TRUE)) @ \subsection*{Aggregation} Aggregation refers to aggregation over rows of the database. This is typically used for sequencing data and is typically employed in order to go from a ``one row, one read'' type representation to a ``one row, one genomic location with an associated number of reads''. The default arguments creates a new column \texttt{counts} using an ``aggregator'' that is the number of times a genomic location occurs in the data. Aggregation has also been discussed in an earlier section. \subsection*{Merging} It is natural to store a number of experimental replicates in columns of a table. However, it is often the case that we receive the data in chunks over time, and that merging new values with old values is not trivial. For this reason we provide a \Rfunction{joinExpData} function to bind two tables together. Essentially, merging consists of placing two (or more) columns next to each other. It is (somewhat) clear that (in general) it makes the most sense to merge two datasets when each dataset has only a single occurrence of each genomic location. Otherwise, how would we deal with/interpret the case where a genomic location occurs multiple times in each dataset? For that reason, joining two datasets typically happens after they have been aggregated. It is possible to merge/join the datasets in R and then subsequently use the import facility to store the resulting object in a database. In general, that approach is less desirable because 1) it is slow(er) and 2) it requires all datasets to be present in memory. <>= ## eData1 <- aggregateExpData(importToExpData(df2, dbFilename = "my.db", tablename = "ex1", ## overwrite = TRUE)) ## eData2 <- aggregateExpData(importToExpData(df, dbFilename = "my.db", tablename = "ex2", ## overwrite = TRUE)) eDataJoin <- joinExpData(list(eData1, eData2), fields = list(ex1 = c(counts = "counts_1"), ex2 = c(counts = "counts_2")), tablename = "allcounts") head(eDataJoin) @ In this example both \Robject{eData1} and \Robject{eData2} have a column named \texttt{counts} that we rename in the resulting object using the \Rfuncarg{fields} argument. Also missing values are introduced when the locations in one object are not present in the other. Finally, the \Rfunction{joinExpData} function supports joining an arbitrary number of ExpData objects. We can examine the result using \Rfunction{summarizeExpData} (described later) <>= summarizeExpData(eDataJoin, fxs = c("total", "avg", "count")) @ Because of the way we store the data, the ``total'' column will be the total number of read, the ``count'' column will be the number of bases where a read starts, and the ``avg'' column will be average number of reads at a given genomic location (removing the genomic locations that were not sequenced). \subsection*{Collapsing} Another common operation is collapsing data across columns. One use would be to join to experiments where the same sample was sequenced. One advantage of collapsing the data in this case is speed. Collapsing is most often done using summation (the default): <>= head(collapseExpData(eDataJoin, tablename = "collapsed", collapse = "sum", overwrite = TRUE)) @ But it could also be done using an ``average'' or a ``weighted average'' (weighted according to sequencing effort). <>= head(collapseExpData(eDataJoin, tablename = "collapsed", collapse = "weighted.avg", overwrite = TRUE)) head(collapseExpData(eDataJoin, tablename = "collapsed", collapse = "avg", overwrite = TRUE)) @ In this case setting \Rcode{collapse = "avg"} or \Rcode{collapse = "weighted.avg"} respectively yields the exact same result, since the two experiments have the same number of reads. \section{Interface} In this section we describe the core functionality of the \Rpackage{Genominator} package. We will use two examples: one ExpData consisting of a single (aggregated) experiment and one ExpData consisting of two aggregated, joined experiments. <>= eData <- ExpData("my.db", tablename = "counts_tbl", mode = "r") eDataJoin <- ExpData("my.db", tablename = "allcounts", mode = "r") @ \subsection*{Summarizing experimental data} We can use the function \Rfunction{summarizeExpData} to summarize ExpData objects. This function does not utilize annotation, so the summarization is in some sense ``global''. The call to generate the total number of counts, i.e. the number of reads, in column ``counts'' is <>= ss <- summarizeExpData(eData, what = "counts") ss @ The \Rfuncarg{what} argument is present in many of the following functions. It refers to which columns are being used, and in general the default depends on the type of function. If the function summarizes data, the default is ``all columns, except the genomic location columns'', whereas if the function retrieves data, the default is ``all columns''. We can customize the summary by specifying the name of any SQLite function (\url{www.sqlite.org/lang\_aggfunc.html}) in the \Rfuncarg{fxs} argument <>= summarizeExpData(eData, what = "counts", fxs = c("MIN", "MAX")) @ This yields the maximum/minimum number of reads mapped to a single location. The minimum number of reads is not zero because we only store locations associated with reads. \subsection*{Selecting a region} We can access genomic data in a single genomic region using the function \Rfunction{getRegion}. <>= reg <- getRegion(eData, chr = 2L, strand = -1L, start = 100L, end = 105L) class(reg) reg @ It is possible exclude either \texttt{start} or \texttt{end} in which case the default values are 0 and $1e12$. \subsection*{Using annotation with experimental data} The two previous sections show useful functions, but in general we want to summarize data over many genomic regions simultaneously, in order to \begin{itemize} \item Summarize regions (means, lengths, sums, etc) \item Fit models on each region \item Perform operations over classes of regions (genes, intergenic regions, ncRNAs) \end{itemize} There are essentially two different strategies for this: retrieve the data as a big object and then use R functions to operate on the data (e.g. \Rfunction{splitByAnnotation}) or perform some operation on the different regions in the database (e.g. \Rfunction{summarizeByAnnotation}). The first approach is more flexible, but also slower and requires more memory. The second approach is faster, but limited to operations that can be expressed in SQL. First, we demonstrate how to summarize over regions of interest. Here we are going to compute the SUM and COUNT of each region, which tell us the total number of sequencing reads at each location and the number of unique locations that were read respectively. <>= head(summarizeByAnnotation(eData, annoData, what = "counts", fxs = c("COUNT", "TOTAL"), bindAnno = TRUE)) @ The result of \Rfunction{summarizeByAnnotation} is a \Robject{data.frame}. We use \Rcode{bindAnno = TRUE} in order to keep the annotation as part of the result, which is often very useful. The \Rfuncarg{fxs} argument takes the names of SQLite functions, see \url{www.sqlite.org/lang\_aggfunc.html}. One important note regarding summation: the standard SQL function \texttt{SUM} handles the sum of only missing values, which in R is expressed as \Rcode{sum(NA)}, differently from the SQLite specific function \texttt{TOTAL}. In particular, the \texttt{SUM} function returns a missing value and the \texttt{TOTAL} function returns a zero. This is relevant when summarizing over a genomic region containing no reads in one experiment, but reads in another experiment. This next example computes, the number of reads mapping to the region for each annotated region, ignoring strand. Ignoring strand might be the right approach if the protocol does not retain strand information (e.g. the current standard protocol for Illumina RNA-Seq). <>= head(summarizeByAnnotation(eDataJoin, annoData, , fxs = c("SUM"), bindAnno = TRUE, preserveColnames = TRUE, ignoreStrand = TRUE)) @ Essentially, this output is the input to a simple differential expression analysis. Note that if two regions in the annotation overlap, data falling in the overlap will be part of the end result for each region. We can produce summarizes by category using the splitBy argument. <>= res <- summarizeByAnnotation(eData, annoData, what = "counts", fxs = c("TOTAL", "COUNT"), splitBy = "feature") class(res) lapply(res, head) @ Finally, we might want to join the relevant annotation to the summaries using the \Rfuncarg{bindAnno} argument. <>= res <- summarizeByAnnotation(eData, annoData, what = "counts", fxs = c("TOTAL", "COUNT"), splitBy = "feature", bindAnno = TRUE) lapply(res, head) @ (Both of these example might require \Rcode{ignoreStrand = TRUE} in a real world application.) Unfortunately, the \Rfunction{summarizeByAnnotation} function is only able to utilize the very small set of SQLite functions, essentially limiting the function to computing very simple summaries. We also mention the \Rfuncarg{groupBy} argument to \Rfunction{summarizeByAnnotation}. This argument allows for an additional summarization level, used in the following way: assume that the \Robject{annoData} object has rows corresponding to exons and that there is an additional column (say \texttt{gene\_id}) describing how exons are grouped into genes. If \Rcode{groupBy = "gene\_id"} the final object will have one line per gene, containing the summarized data over each set of exons. We will now examine the \Rfunction{splitByAnnotation} function that splits the data according to annotation and returns the ``raw'' data. The return object of this function may be massive depending on the size of the data and the size of the annotation. In general we advise to do as much computation as possible using SQLite (essentially using \Rfunction{summarizeByAnnotation}), but sometimes it is necessary to access the raw data. Leaving aside the problems with the size of the return object, \Rfunction{splitByAnnotation} is reasonably fast. We start by only splitting on the annotated regions of type ``gene''. <>= dim(annoData[annoData$feature %in% "gene", ]) a <- splitByAnnotation(eData, annoData[annoData$feature %in% "gene", ]) class(a) length(a) names(a)[1:10] head(a[[1]]) @ There are several notes here. For starters, the return object is named according to the rownames of the annotation data. Also, only annotated regions with data are represented in the return object, so in general the return object does \emph{not} have an element for each annotated region. We provide several convenience functions for operating on the results of \Rfunction{splitByAnnotation}, which are discussed later. Now we wish to compute a trivial function over the counts, such as a quantile. <>= sapply(a, function(x) { quantile(x[,"counts"], .9) })[1:10] @ Often we wish to use the annotation information when operating on the data in each region. The \Rfunction{applyMapped} function makes this easy by appropriately matching up the annotation and the data. Essentially, this function ensures that you are applying the right bit of annotation to the correct chunk of data. <>= applyMapped(a, annoData, FUN = function(region, anno) { counts <- sum(region[,"counts"]) length <- anno$end - anno$start + 1 counts/length })[1:10] @ This example computes the average number of reads per base, taking into account that bases without data exists. Note that \Rfuncarg{FUN} needs to be a function of two arguments. What we see is that some of our regions are not present. This is a byproduct of the fact that some of our regions have no data within their bounds. One can successfully argue that the result of the example above ought to include such regions with a value of zero (see below for this). When our data sets are large, it is often more convenient and significantly faster to return only some columns. <>= sapply(splitByAnnotation(eData, annoData, what = "counts"), median)[1:10] @ Often we wish to ``fill'' in missing regions. In the case of a coding sequence there may be bases that have no reads, and so these bases will not appear in our resulting object. We can ``expand'' a region to include these bases by filling in 0 reads for them. There are different ways to do this expansion. For convenience, data are stratified by strand. Therefore expansion will produce a list-of-lists, where each sublist has possibly two elements corresponding to each strand. If the original annotation query is stranded, then expansion will produce a list, where each sublist only has one element. Finally, we provide a feature to collapse across strand for the common use case of combining reads occurring on either strand within the region. In this case the return value is a list, where each element is an expanded matrix representing the reads that occurred on either strand. <>= ## This returns a list-of-lists x1 <- splitByAnnotation(eData, annoData, expand = TRUE, ignoreStrand = TRUE) names(x1[[1]]) ## this returns a list-of-lists, however they are of length 1 x2 <- splitByAnnotation(eData, annoData, expand = TRUE) names(x2[[1]]) ## this returns a list where we have combined the two sublists x3 <- splitByAnnotation(eData, annoData, expand = TRUE, addOverStrand = TRUE) head(x3[[1]]) @ Leaving the \Rfunction{splitByAnnotation} function, sometimes we want to compute summaries of higher level entities, such as genes, pseudogenes, and intergenic regions. We can label each genomic location with its annotation using the \Rfunction{mergeWithAnnotation} convenience function. <>= mergeWithAnnotation(eData, annoData)[1:3,] @ This will result in duplicated genomic locations in case a genomic location is annotated multiple times. There are a number of parameters that can make this more natural. <>= par(mfrow=c(1,2)) x <- lapply(mergeWithAnnotation(eData, annoData, splitBy = "feature", what = "counts"), function(x) { plot(density(x)) }) @ \section{Annotation} Annotation has been defined earlier as a \Rcode{data.frame} satisfying certain requirements. This may be checked using \Rfunction{validAnnotation} which return \Rcode{TRUE} or \Rcode{FALSE}. \Rpackage{Genominator} has a very useful function for dealing with annotation, \Rfunction{makeGeneRepresentation}. The input is a \Rcode{data.frame} not unlike the result of a query to Ensembl or UCSC (using the \Rpackage{biomaRt} package of the \Rpackage{rtracklayer} package). The output of this function is a new set of gene models satisfying certain requirements detailed in the following. The typical output from such a query is an object where each row correspond to an exon and there are additional columns describing how exons are grouped into transcripts or genes. An \Rcode{Ugene} model (``Union gene'') consists of all bases annotated as belonging to a certain gene. Bases that are annotated as belonging to other genes as well, are removed. An \Rcode{UIgene} model (``Union-intersection'') is an approach to summarize a gene protecting against different splice forms. It consists of the bases that are annotated as belonging to every transcript of the gene. Bases that are annotated as belonging to other genes as well, are removed. An \Rcode{ROCE} model (``Region of Constant Expression'') are maximal intervals such that every base in the interval belongs to the same set of transcripts from the same gene. Bases that are annotated as belonging to other genes as well, are removed. Finally \Rcode{Background} models consists of the bases that are not annotated as belonging to any gene. Note: this function does not require chromosome lengths, so you may have to add an additional interval (row) for each chromosome, consisting of the interval from the last annotated base to the end of the chromosome. \section{Analysis tools} In the case of short read sequencing, the \Rpackage{Genominator} package offers a number of specific useful tools. They are presented in no particular order. \subsection*{Coverage} The \Rfunction{computeCoverage} function can be used to assess the sequencing depth. <>= coverage <- computeCoverage(eData, annoData, effort = seq(100, 1000, by = 5), cutoff = function(e, anno, group) {e > 1}) plot(coverage, draw.legend = FALSE) @ \subsection*{Statistical Functions: Goodness of Fit} You can conduct a goodness of fit analysis to the Poisson model across lanes using the following function. The result is assessed by a QQ-plot against the theoretical distribution (of either the p-values or the statistic). <>= plot(regionGoodnessOfFit(eDataJoin, annoData)) @ We can also do this for subsets of the data, for example within condition. <>= plot(regionGoodnessOfFit(as.data.frame(matrix(rpois(1000, 100), ncol = 10)), groups = rep(c("A", "B"), 5), denominator = rep(1, 10))) @ \section*{SessionInfo} <>= toLatex(sessionInfo()) @ \begin{thebibliography}{30} \bibitem{Bullard} Bullard,J.H., Purdom,E.A., Hansen,K.D., and Dudoit,S. (2010) Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. \textit{BMC Bioinformatics}, \textbf{11}, 94. \end{thebibliography} \end{document}