\documentclass{article} \usepackage[colorlinks=true,linkcolor=blue]{hyperref} \usepackage{theorem} \usepackage{underscore} \usepackage{color} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textsf{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\code}[1]{{\texttt{#1}}} \newcommand{\file}[1]{{\texttt{#1}}} \newcommand{\software}[1]{\textsf{#1}} \newcommand{\R}{\software{R}} \newcommand{\bioc}{\software{BioConductor}} \newcommand{\ELAND}{\software{ELAND}} \newcommand{\MAQ}{\software{MAQ}} \newcommand{\Bowtie}{\software{Bowtie}} %% Excercises and Questions \theoremstyle{break} \newtheorem{Ex}{Exercise} \theoremstyle{break} \newtheorem{Q}{Question} %% And solution or answer \newenvironment{solution}{\color{blue}}{\bigskip} \begin{document} \SweaveOpts{keep.source=TRUE,width=9,height=8,eps=FALSE, include=FALSE, prefix.string=docs/IOQAFig} %% \setkeys{Gin}{width=0.95\textwidth} \title{I/0 and Quality Assessment using the \Rpackage{ShortRead} package} \date{May 31, 2009} \maketitle \tableofcontents <>= options(width=60) olocale=Sys.setlocale(locale="C") @ \section{Introduction} This portion of the course uses the \Rpackage{ShortRead} package to input aligned and other short read data files, and illustrates some of the available Solexa-based quality assessment tools. Activities during the lab are posed as exercises. So, as a first exercise: \begin{Ex} Start an \R{} session, and load the \Rpackage{ShortRead} package. <>= library("ShortRead") <>= library("ShortRead") packageDescription("ShortRead")$Version @ %% Confirm that the version of your package is at least as recent as the version in this document. Seek assistance from one of the course assistants if you need help getting the current version of \Rpackage{ShortRead}. \end{Ex} %% The course also requires access to sample data. \begin{Ex} Copy the data from the distribution media to your local hard drive. In \R{} change the working directory to point to the data location (which should contain the folders \code{extdata}, \code{scripts}, etc.), along the lines of <>= setwd("c:/Documents and Settings/Desktop/Course/labs") @ <>= oldwd = setwd("..") @ %% and confirm that the files have been copied correctly. \end{Ex} \section{Aligned read input} This section illustrates input of aligned reads. It focuses on aligned reads produced by the Solexa Genome Analyzer \ELAND{} software; reading data produced by software such as \MAQ{} or \Bowtie{} is described in the \Rpackage{ShortRead} `Overview' vignette and on the \Rfunction{readAligned} help page. \subsection{\Rclass{SolexaPath}: navigating Solexa output} This section introduces a way to conveniently navigate the hierarchy of files produced by \ELAND{}; this simplifies subsequent activities, but the full \ELAND{} output is not required to use \Rpackage{ShortRead}. Solexa software processes data in a pipeline. Raw images are extracted to image intensity files by the Firecrest software component. Image intensity files are summarized as base calls using the Bustard base caller. Subsequent analysis is performed by a diversity of software components called Gerald; one example of a Gerald program is \ELAND{}, a whole-genome aligner. The pipepline provides (or can be configured to provide -- the people running the machine have quite a bit of control over this) exquisite detail about each stage of the process. For instance, Firecrest scans each lane of a flow cell as 300 \emph{tiles}, arranged in three serpentine columns. The Firecrest output is summarized in two different file types for each tile, so there are $2 \times 300 \times 8 = 4800$ files produced by Firecrest alone. A portion of a file hierarchy is provided as course data. \begin{Ex} Consult the help page for \code{SolexaPath}, and create an instance of this object, e.g., <>= sp <- SolexaPath("extdata/ELAND/080828_HWI-EAS88_0003") @ \end{Ex} This command scans the file system rooted at the specified path (the final directory name given above is a typical top level name for a Solexa run, encoding the date and machine used, for instance), identifying likely paths associated with each stage of the pipeline. \begin{Ex} Display this object, and query it for the paths were Gerald analysis results are stored. <>= sp analysisPath(sp) @ % There are two analysis paths, because the data were generated by two separate runs of the \ELAND{} software. The analysis paths are nested inside \code{baseCallPath(sp)} as a Solexa convention, to indicate that the Gerald analyses both use the results of the same application of the base calling software. \end{Ex} Note how the display of \Robject{sp} is compact, and the names in the display hint at how to navigate the object. Short read objects can be very large. Most objects in \Rpackage{ShortRead} and related packages have been designed to display a summary, rather than the `big' data. Accessing components of objects, such as with \Rfunction{analysisPath} in the example above, often returns the data in all its glory -- frequently, you'll want to adopt a strategy of assigning such big data to a variable, and inspecting it with functions like \Rfunction{str} or \Rfunction{head}. A key functionality provided by \Rpackage{ShortRead} is input of a diversity of file types, both of files from the Solexa pipeline and from other software and in formats appropriate for other technologies. The interface to these input functions is meant to facilitate reading one or more files into a single object, e.g., to read all files containing image intensity produced by Firecrest. The interface is like that for \Rfunction{list.files}: provide a directory path where relevant files are to be found, plus a regular expression to select files you are interested in. \begin{Ex} Use \Rfunction{list.files} to display all files in the first Gerald analysis directory of the \Robject{sp} object. <>= list.files(analysisPath(sp)[[1]]) @ \end{Ex} A full Gerald data set would contain hundreds of files. We'll select just one file to use in subsequent analysis, based on files matching the regular expression \texttt{.*_export_head.txt}. These files are produced using \ELAND{} software run in \texttt{eland-extended} mode. This mode produces files, one for each lane (and `end' of paired end reads) that summarize diverse features of \emph{all} reads, and is a very convenient starting point for analysis. \begin{Ex} We'll use an abbreviated file for most parts of this lab. The abbreviated file contains the first 500,000 reads from a single lane of a Solexa paired-end run. It is from lane 1, and we will use the first end only. The name of the file is \file{s_1_1_export_head.txt}. We will use this as the `pattern' to match, and check that we have specified the pattern appropriately <>= pattern <- "s_1_1_export_head.txt" list.files(analysisPath(sp), pattern) @ %% Our success shouldn't be too surprising in this case, but it often pays to check. For instance, the pattern above would also match a file with the same name but with \texttt{.tar.gz} appended! \end{Ex} \subsection{\Rfunction{readAligned} and the \Rclass{AlignedRead} class} The \Rfunction{readAligned} function can be used to input aligned reads. The first argument is a directory path where alignment files are to be found. The second argument is the regular expression to select files to be read. By the default for this argument, all files will be read. An optional third argument allows the user to specify which type of file is to be read in. \begin{Ex} Use \Rfunction{readAligned} to read in our abbreviated version of lane 1. \Rfunction{readAligned} is smart enough to know where alignemnt files are located in the Solexa file hierarchy. <>= aln <- readAligned(sp, pattern) @ %% The \Rfunarg{sp} argument could have been replaced by a directory path, e.g., \code{"."} (if the file were in the current working directory) or \code{analysisPath(sp)}. The third argument (unspecified in the above) allows input of diverse types of Solexa alignment files, in addition to input of \MAQ{} and \Bowtie{} alignment files. See the help page for \Rfunction{readAligned} for additional details. \end{Ex} What does \Rfunction{readAligned} input? It inputs the short read sequences and base call qualities, and the chromosome, position, and strand information associated with short read alignments. This information is expected to be provided by all short read alignemnt software. \begin{Ex} Display the object we've read in. <>= aln @ % There are \Sexpr{length(aln)} reads in the object, each read consisting of \Sexpr{width(aln)} nucleotides. View the first several reads and query information about, e.g., the number of reads that align to each strand, or the number of positions recorded as \code{NA}. <>= head(sread(aln)) table(strand(aln), useNA="ifany") sum(is.na(position(aln))) @ %% Notice how elements of the \Robject{aln} object are extracted using \emph{accessors} such as \Rfunction{sread} and \Rfunction{strand}; these are described on the help page for the class of the \Robject{aln} object (indicated in the display of \Robject{aln}, above, as class \Rclass{AlignedRead}); note that the help page refers to the help page for \Rfunction{accessors} to enumerate additional ways of accessing the data. \end{Ex} What are all the \code{NA} values returned by \Rfunction{strand} and \Rfunction{position}? These correspond to reads that did not align to the reference genome used by \ELAND{}; that about 1/2 the reads do not align is below normal, making this an interesting opportunity for quality assessment. The \Rfunction{strand} function returns a factor with three levels. The first two describe reads aligned to the plus and minus strands, the third (\code{*}) is available for successful alignments where strand information is irrelevant. Aligned reads contain several different kinds of information about `quality'. Individual bases are assessed for quality during base calling. These `raw' base qualities are `calibrated' during \ELAND{} alignment; details of calibration are to be found in Illumina documentation. The alignments themselves also have qualities associated with them, with the details of alignment quality differing between alignment algorithms. \begin{Ex} Retrieve calibrated base quality from \Robject{aln}. <>= head(quality(aln)) @ % These qualities are string-encoded $-10 \log_{10}$ probabilities. The encoding in this case follows a convention established by Solexa. The details of the encoding can be obtained by querying \code{quality(aln)} for its \Rfunction{alphabet}; the letter \code{A} corresponds to a $-10 log_{10}$ score of 1. Numeric values are readily retrieved as a matrix, with rows corresponding to reads and columns to cycles. Computations can then be performed on them, e.g., to determine average calibrated quality scores as a function of cycle. <>= alf <- alphabet(quality(aln)) m <- as(quality(aln), "matrix") colMeans(m) @ \end{Ex} Alignment qualities are accessible with \Rfunction{alignQuality}. This returns an object that can contain quality scores in different formats; to extract the actual quality scores, use \Rfunction{quality}. Reads failing to align or to align in multiple locations have an alignment quality of 0. \begin{Ex} Retrieve the alignment quality scores, determine how many align poorly, and visualize the distribution (Figure~\ref{fig:alignqual}) of scores. <>= alignQuality(aln) q <- quality(alignQuality(aln)) sum(q==0) print(densityplot(q[q>1], plot.points=FALSE, xlab="Alignment quality")) @ % \begin{figure} \centering \includegraphics[width=.95\textwidth]{IOQAFig-alignQuality} \caption{Alignment quality} \label{fig:alignqual} \end{figure} \end{Ex} Alignment algorithms produce information in addition to basic data about chromosome, position, and strand alignment. The exact content varies between algorithms, and is available with \Rfunction{alignData}. \Rfunction{alignData} returns an \Rclass{AlignedDataFrame} object that contains these data and a metadata description of them. For instance, \ELAND{} includes information about whether the read passed a base-calling filter (based on strength and consistency of early bases in the read), in addition to the lane, tile, x and y coordinate of each read. \begin{Ex} Use the \Rfunction{alignData} function to extract the additional information in the \ELAND{} alignment file. The underlying data in this object can be accessed as though it were a data frame, for instance to tally the number of reads passing Solexa base calling filter. <>= alignData(aln) table(alignData(aln)$filtering) @ \end{Ex} \subsection{Subsets and filters} A very common operation is to reduce the number of reads used for subsequent analysis. This can be done in a coordinated fashion by creating a subset of \Robject{aln}. \begin{Ex} Select just the aligned reads passing Solexa filtering, and aligning to the reference genome. <