\documentclass[10pt]{article} \usepackage{hyperref} \usepackage{theorem} \usepackage{underscore} \usepackage{color} \textwidth=6.5in \textheight=8.5in \oddsidemargin=-.1in \evensidemargin=-.1in \headheight=-.5in \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textsf{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\code}[1]{{\texttt{#1}}} \newcommand{\file}[1]{{\texttt{#1}}} \newcommand{\software}[1]{\textsf{#1}} \newcommand{\R}{\software{R}} \newcommand{\bioc}{\software{BioConductor}} \newcommand{\ELAND}{\software{ELAND}} \newcommand{\MAQ}{\software{MAQ}} \newcommand{\Bowtie}{\software{Bowtie}} %% Excercises and Questions \theoremstyle{break} \newtheorem{Ex}{Exercise} \theoremstyle{break} \newtheorem{Q}{Question} %% And solution or answer \newenvironment{solution}{\color{blue}}{\bigskip} \begin{document} \SweaveOpts{eps=FALSE, include=FALSE, keep.source=TRUE} \title{EMBL Short Reads Course June 2009:\\ Managing sequence and annotation data using the\\ \Rpackage{Biostrings} and \Rpackage{BSgenome} packages} \author{Patrick Aboyoun \\ Fred Hutchinson Cancer Research Center \\ Seattle, WA 98008} \date{6 June 2009} \maketitle \tableofcontents <>= options(width=72) .oldlocale <- Sys.setlocale(locale="C") @ \section{Preliminaries} This lab is designed to teach the basics of \Rpackage{Biostrings} and \Rpackage{BSgenome} data packages. For this lab you need: \begin{itemize} \item \R{} version 2.9 and Bioconductor release 2.4, \item the \Rpackage{Biostrings}, \Rpackage{BSgenome} and \Rpackage{BSgenome.Mmusculus.UCSC.mm9} packages, \item \Robject{topReads.rda}: an example data file containing the top 1000 reads for all 8 Solexa lanes of two ChIP-seq experiments. This will be provided to you by the course instructors. Store it on your local file system. \end{itemize} % --------------------------------------------------------------------------- \section{Setup} \begin{Ex} Start an \R{} session and use the \Rfunction{library} function to load the \Rpackage{BSgenome.Mmusculus.UCSC.mm9} genome package. <>= library("BSgenome.Mmusculus.UCSC.mm9") @ \end{Ex} \begin{Ex} Use the \Rfunction{load} function to load the example dataset into your \R{} session. (You will need to adapt the directory path to where the file is on your system.) <>= load(file.path("..", "data", "topReads.rda")) ls() @ \end{Ex} \Robject{topReads} is a list of length \Sexpr{length(topReads)}, corresponding to the two experiments. Each element is again a list of length \Sexpr{length(topReads[[1L]])}, corresponding to the lanes of the Solexa instrument. Each of the elements of that is an \Rclass{XDataFrame} object with \Sexpr{nrow(topReads[[1L]][[1L]])} rows and \Sexpr{ncol(topReads[[1L]][[1L]])} columns. More on the provenance of the \Robject{topReads} object is described in Section~\ref{sec:topReads}. <>= ## checking stopifnot(all(sapply(topReads, length)==length(topReads[[1L]]))) dd = dim(topReads[[1L]][[1L]]) for(expts in topReads) for(lanes in expts) stopifnot(identical(dim(lanes), dd)) rm(list=c("dd", "expts", "lanes")) @ % FIXME(wh 7 June 2009): the show method for XDataFrame is pretty uninformative, % couldn't it be a bit more verbose? <>= topReads[[1]][[1]] colnames(topReads[[1]][[1]]) topReads[[1]][[1]][1,"read"] topReads[[1]][[1]][1,"count"] @ % % --------------------------------------------------------------------------- \section{Basic containers} \subsection{DNAString objects} The \Rclass{DNAString} class is the basic container for storing a long nucleotide sequence. Unlike a standard character vector in R that can store multiple strings, a \Rclass{DNAString} object can only contain one. \begin{Ex} \begin{enumerate} \item Create a \Rclass{DNAString} object \Robject{r1} by using the \Rfunction{[[} %]] operator to extract the first read from experiment 2, lane 1. \item Use the \Rfunction{nchar} and \Rfunction{alphabetFrequency} functions to obtain the number of characters and the frequencies of the different letters in \Robject{r1}. \item Get its reverse complement. \item Extract substrings with the function \Rfunction{subseq}. \end{enumerate} \end{Ex} % <>= r1 <- topReads[["experiment2"]][["lane1"]][,"read"][[1]] nchar(r1) alphabetFrequency(r1) @ % Note that the \Rclass{DNAString} class can contain characters from the complete set of IUAPC nucleic acid codes (for example, R stands for purine, i.\,e., A or G). % <>= reverseComplement(r1) subseq(r1, start=5, end=15) subseq(r1, end=15) subseq(r1, start=-5) @ \subsection{DNAStringSet objects} The \Rclass{DNAStringSet} class is the basic container for storing multiple nucleotide sequences. As with \R{} vectors, the \Rfunction{length} function returns the number of elements (sequences) in a \Rclass{DNAStringSet} object and the \Rfunction{[} % ] operator can be used to subset it. In addition, the element access operator \Rfunction{[[} % ]] can be used to extract a single element and return it as a \Rclass{DNAString} object. \begin{Ex} \begin{enumerate} \item Use the \Rfunction{DNAStringSet} constructor to store the 1000 reads from experiment 2, lane 1 into a \Rclass{DNAStringSet} object. Let us call this instance \Robject{dss}. \item Use \Rfunction{length} and \Rfunction{width} on \Robject{dss}. \item Use subsetting operator \Rfunction{[} to remove its second element. \item Use the \Rfunction{rev} to invert the order of its elements. \item Use subsetting operator \Rfunction{[[} %]] to extract its first element as a \Rclass{DNAString} object. \item Use the \Rfunction{DNAStringSet} constructor (i) to remove the last 2 nucleotides of each element, then (ii) to keep only the last 10 nucleotides. \item Call \Rfunction{alphabetFrequency} on \Robject{dss} and on its reverse complement. Try again with setting its argument \Rfunarg{collapse=TRUE}. \item Remove reads with Ns, and put the ``cleaned up'' set of sequences back into \Robject{dss}. \end{enumerate} \end{Ex} % <>= dss <- topReads[["experiment2"]][["lane1"]][,"read"] length(dss) table(width(dss)) dss[-2] rev(dss) dss[[1]] DNAStringSet(dss, end=-3) DNAStringSet(dss, start=-10) head(alphabetFrequency(dss)) reverseComplement(dss) # Use 'collapse=TRUE' to collapse all the rows alphabetFrequency(dss, collapse=TRUE) alphabetFrequency(reverseComplement(dss), collapse=TRUE) # Use [ , ] to subset the matrix returned by alphabetFrequency() dss <- dss[alphabetFrequency(dss)[ ,"N"] == 0] @ %-------------------------------------------------- \subsection{XStringViews objects} %-------------------------------------------------- An \Rclass{XStringViews} object contains a set of views on the same sequence, which is called the {\it subject}; for example, this can be a \Rclass{DNAString} object. Each view is defined by its start and end locations: both are integers such that start $\le$ end. The \Rfunction{Views} function can be used to create an \Rclass{XStringViews} object, given a subject and a set of start and end locations. Like for \Rclass{DNAStringSet} objects, \Rfunction{length}, \Rfunction{width}, \Rfunction{[} and \Rfunction{[[} % ]] are supported for \Rclass{XStringViews} objects. Additional methods \Rfunction{subject}, \Rfunction{start}, \Rfunction{end} and \Rfunction{gaps} are also provided. A typical use case for views is for the subject to be the sequence of a molecule (e.\,g.\ a chromosome) and the different views to be certain features of the sequence, such as protein-binding regions, transcribed regions, etc. \begin{Ex} \begin{enumerate} \item Use the \Rfunction{Views} function to create an \Rclass{XStringViews} object with a \Rclass{DNAString} subject. Make it such that some views are overlapping but also that the set of views do not cover the subject entirely. \item Try \Rfunction{subject}, \Rfunction{start}, \Rfunction{end} and \Rfunction{gaps} on this \Rclass{XStringViews} object. \item Try \Rfunction{alphabetFrequency} on it. \item Turn it into a \Rclass{DNAStringSet} object with the \Rfunction{DNAStringSet} constructor. \end{enumerate} \end{Ex} % <>= v3 <- Views(dss[[1]], start=c(2, 12, 20), end=c(5, 26, 27)) subject(v3) start(v3) end(v3) gaps(v3) alphabetFrequency(v3) DNAStringSet(v3) @ % --------------------------------------------------------------------------- \section{BSgenome data packages} The name of a BSgenome data package is made of 4 parts separated by a dot, for example \Rpackage{BSgenome.Celegans.UCSC.ce2}. \begin{enumerate} \item The $1^{\mbox{st}}$ part is always the prefix {\tt BSgenome}. \item The $2^{\mbox{nd}}$ part is the (abbreviated) name of the organism. \item The $3^{\mbox{rd}}$ part is the name of the organisation who assembled the genome. \item The $4^{\mbox{th}}$ part is the release string or number used by this organisation for this assembly of the genome. \end{enumerate} All BSgenome data packages contain a single top level object whose name matches the second part of the package name. \begin{Ex} \begin{enumerate} \item Load \Rpackage{BSgenome.Mmusculus.UCSC.mm9} and display its top level object. Note that this does not load any sequence into memory yet. \item Use \Rfunction{seqlengths} on it to get the lengths of the single sequences (this does not load any sequence either). \end{enumerate} \end{Ex} % <>= Mmusculus seqlengths(Mmusculus) @ Display information about the mitochondrial chromosome. <>= Mmusculus$chrM @ Some information about the built-in masks is displayed. Let us drop the masks for now by accessing the sequence with % <>= unmasked(Mmusculus$chrM) @ % \begin{Ex} \begin{enumerate} \item Do the chromosomes contain IUPAC extended letters? \item Treatment of DNA with bisulfite converts cytosine residues to uracil, but leaves 5-methylcytosine residues unaffected. The first reported method~\cite{Frommer} of methylation analysis using bisulfite-treated DNA used PCR and standard dideoxynucleotide DNA sequencing to directly determine the nucleotides resistant to bisulfite conversion. All sites of unmethylated cytosines were displayed as thymines in the resulting amplified sequence of the sense strand, and as adenines in the amplified antisense strand. Assuming that no part of it is methylated, use the \Rfunction{chartr} function to simulate a bisulfite transformation of a 5 kb segment of chromosome 1. \end{enumerate} \end{Ex} % Apply \Rfunction{alphabetFrequency} to each unmasked chromosome: % <>= af <- sapply(seqnames(Mmusculus), function(name) alphabetFrequency(unmasked(Mmusculus[[name]]), baseOnly=TRUE)) @ and plot the result as a barplot (see Fig.~\ref{fig:af}). <>= barplot(t(af), beside=TRUE, col=rainbow(ncol(af))) @ \begin{figure} \begin{center} \includegraphics[width=0.6\textwidth]{BiostringsLab-figaf} \caption{\label{fig:af}Nucleotide frequencies.} \end{center} \end{figure} % Bisulfite transformation of the plus strand: <>= reg = 10000000+(1:5000) plus_strand <- chartr("C", "T", unmasked(Mmusculus$chr1)[reg]) alphabetFrequency(plus_strand, baseOnly=TRUE) @ % $\ldots$ and the minus strand: <>= minus_strand <- chartr("G", "A", unmasked(Mmusculus$chr1)[reg]) alphabetFrequency(minus_strand, baseOnly=TRUE) @ % --------------------------------------------------------------------------- \section{String matching} % --------------------------------------------------------------------------- \subsection{The \Rfunction{matchPattern} function} This function finds all the occurences (a.k.a. {\it matches} or {\it hits}) of a given pattern in a reference sequence called {\it the subject}. \begin{Ex} \begin{enumerate} \item Find all the matches of a short pattern (invent one) in mouse chromosome 1. Do not choose the pattern too short or too long. \item In fact, without further attention, we only get the hits in the plus strand of the chromosome. Find the matches in the minus strand too. (Note: the cost of taking the reverse complement of an entire chromosome sequence can be high in terms of memory usage. Try to do something better.) \item \Rfunction{matchPattern} supports insertions and deletions (``indels'') via the \Rfunarg{with.indels} argument. Use the same pattern to find all the matches in chromosome 1 that are at an edit distance <= 2 from it. \end{enumerate} \end{Ex} % <>= pattern <- DNAString("ACCGGTTATC") matchPattern(pattern, Mmusculus$chr1) @ Reverse complement \Robject{pattern} instead of \Robject{Mmusculus\$chr1}: it is more memory efficient and it keeps coordinates relative to the plus strand, which is what everybody seems to do (NCBI, UCSC, etc...) <>= matchPattern(reverseComplement(pattern), Mmusculus$chr1) @ Allow for indels. <>= matchPattern(pattern, Mmusculus$chr1, max.mismatch=2, with.indels=TRUE) @ % --------------------------------------------------------------------------- \subsection{The vmatchPattern function} % --------------------------------------------------------------------------- This function finds all the matches of a given pattern in a set of reference sequences (the ``v'' stands for \textit{vectorized}). \begin{Ex} \begin{enumerate} \item Load the \Robject{upstream5000} object from \Robject{Mmusculus} and find all the matches of a short pattern in it. \item The value returned by \Rfunction{vmatchPattern} is an \Rclass{MIndex} object containing the match coordinates for each reference sequence. You can use the \Rfunction{startIndex} and \Rfunction{endIndex} accessors on it to extract the match starting and ending positions as lists (one list element per reference sequence). \Rfunction{[[} % ]] extracts the matches of a given reference sequence as an \Rclass{MIndex} object. \Rfunction{countIndex} extracts the match counts as an integer vector (one element per reference sequence). \end{enumerate} \end{Ex} % <>= Mmusculus$upstream5000 m <- vmatchPattern(pattern, Mmusculus$upstream5000) @ To get the indices of the references sequences with hits: <>= which(countIndex(m) != 0) @ To get the hits in reference sequence 2956: <>= m[[2956]] @ %-------------------------------------------------- \subsection{Ambiguities} %-------------------------------------------------- IUPAC extended letters can be used to express ambiguities in the pattern or in the subject of a search with \Rfunction{matchPattern}. This is controlled via the \Robject{fixed} argument of the function. If \Robject{fixed} is \Robject{TRUE} (the default), all letters in the pattern and the subject are interpreted litterally. If \Robject{fixed} is \Robject{FALSE}, IUPAC extended letters in the pattern and in the subject are interpreted as ambiguities e.g. {\tt M} will match {\tt A} or {\tt C} and {\tt N} will match any letter (the \Robject{IUPAC\_CODE\_MAP} named character vector gives the mapping between IUPAC letters and the set of nucleotides that they stand for). The most common use of this feature is to introduce wildcards in the pattern by replacing some of its letters with {\tt N}s. \begin{Ex} \begin{enumerate} \item Search for the pattern {\tt GAACTTTGCCACTC} in Mouse chromosome 1. \item Repeat but this time allow the 2nd {\tt T} in the pattern (6th letter) to match anything. Anything wrong? \item Call \Rfunction{matchPattern} with \Rfunarg{fixed="subject"} to work around this problem. \end{enumerate} \end{Ex} % <>= matchPattern("GAACTTTGCCACTC", Mmusculus$chr1) @ By default, \Rfunarg{fixed} is \Robject{TRUE}, so the N in the pattern can only match an N in the subject: <>= matchPattern("GAACTNTGCCACTC", Mmusculus$chr1) matchPattern("GAACTNTGCCACTC", Mmusculus$chr1, fixed=FALSE) @ %-------------------------------------------------- \subsection{Masking} %-------------------------------------------------- The \Rclass{MaskedDNAString} container is dedicated to the storage of masked DNA sequences. As mentioned above, you can use the \Rfunction{unmasked} function to turn a \Rclass{MaskedDNAString} object into a \Rclass{DNAString} object (the masks will be lost), or use the \Rfunction{masks} accessor to extract the masks. Each mask on a sequence can be active or not. Masks can be activated individually with: <>= chr1 <- Mmusculus$chr1 active(masks(chr1))["TRF"] <- TRUE @ This will activate the Tandem Repeats Finder (TRF) mask. All masks together can activated with: <>= active(masks(chr1)) <- TRUE @ Some functions in \Rpackage{Biostrings} like \Rfunction{alphabetFrequency} or the string matching functions will skip masked regions when walking along a sequence with active masks. \begin{Ex} What percentage of Mouse chromosome 1 is made of assembly gaps? \end{Ex} <>= maskedratio(masks(Mmusculus$chr1)["AGAPS"]) @ \begin{Ex} Check the alphabet frequency of Mouse chromosome 1 when only the AGAPS mask is active, when only the AGAPS and AMB masks are active. Compare with unmasked chromosome 1. \end{Ex} \Robject{Mmusculus\$chr1} is an immutable object, so before we can turn its masks on or off, we need to copy it to another variable (note that the chromosome sequence itself is not copied during this operation, so it does not result in the use of a substantial amount of additional memory): % <>= chr1 <- Mmusculus$chr1 active(masks(chr1)) <- FALSE active(masks(chr1))["AGAPS"] <- TRUE chr1 alphabetFrequency(chr1, baseOnly=TRUE) active(masks(chr1))["AMB"] <- TRUE alphabetFrequency(chr1, baseOnly=TRUE) alphabetFrequency(unmasked(chr1), baseOnly=TRUE) @ \begin{Ex} \begin{enumerate} \item Try \code{as(chr1 , "XStringViews")} and \code{gaps(as(chr1 , "XStringViews"))} with different sets of active masks. How do you use this to display the contigs as views? \item Activate all masks and find the occurences of an arbitrary DNA pattern in it. Compare to what you get with unmasked chromosome 1. \end{enumerate} \end{Ex} % To display the contig as views: <>= chr1 <- Mmusculus$chr1 active(masks(chr1)) <- FALSE active(masks(chr1))["AGAPS"] <- TRUE as(chr1 , "XStringViews") @ Activate all masks <>= active(masks(chr1)) <- TRUE chr1 matchPattern("ACACACACACACACACACAC", chr1) matchPattern("ACACACACACACACACACAC", unmasked(chr1)) @ In addition to the built-in masks, the user can put its own mask on a sequence. Two types of user-controlled masking are supported: {\it by content} or {\it by position}. The \Rfunction{maskMotif} function will mask the regions of a sequence that contain a motif specified by the user. The \Rfunction{Mask} constructor will return the mask made of the regions defined by the start and end locations specified by the user (like with the \Rfunction{Views} function). \subsection{Finding the hits of a large set of short motifs} Our own competitor to other fast alignment tools like MAQ or bowtie is the \Rfunction{matchPDict} function. Its speed is comparable to the speed of MAQ but it uses more memory than MAQ to align the same set of reads against the same genome. Here are some important differences between \Rfunction{matchPDict} and MAQ (or bowtie): \begin{itemize} \item \Rfunction{matchPDict} ignores the quality scores, \item it finds all the matches, \item it fully supports 2 or 3 (or more) mismatching nucleotides anywhere in the reads (performance will decrease significantly though if the reads are not long enough), \item it supports masking (masked regions are skipped), \item it supports IUPAC ambiguities in the subject (useful for SNP detection). \end{itemize} % FIXME (wh 7 June 2009) Which of these points are still up-to-date? The workflow with \Rfunction{matchPDict} is the following: \begin{enumerate} \item Preprocess the set of short reads with the \Rfunction{PDict} constructor. \item Call \Rfunction{matchPDict} on it. \item Query the \Rclass{MIndex} object returned by \Rfunction{matchPDict}. \end{enumerate} \begin{Ex} \begin{enumerate} \item Preprocess \Robject{dss} (obtained earlier from \Robject{topReads.rda}) with the \Rfunction{PDict} constructor. \item Use this \Rclass{PDict} object to find the (exact) hits of \Robject{dss} in Mouse chromosome 1. \item Use \Rfunction{countIndex} on the \Rclass{MIndex} object returned by \Rfunction{matchPDict} to extract the number of hits per read. \item Which read has the highest number of hits? Display those hits as an \Rclass{XStringViews} object. Check this result with a call to \Rfunction{matchPattern}. \item You only got the hits that belong to the + strand. How would you get the hits that belong to the - strand? \item Redo this analysis for inexact matches with at most 2 mismatches per read in the last 20 nucleotides. \end{enumerate} <>= pdss <- PDict(dss) m <- matchPDict(pdss, Mmusculus$chr1) Rle(countIndex(m)) which(countIndex(m) == max(countIndex(m))) pdss[[46]] Views(unmasked(Mmusculus$chr1), start=start(m[[46]]), end=end(m[[46]])) matchPattern(pdss[[46]], Mmusculus$chr1) ### Hits in the minus strand: pdict1 <- PDict(reverseComplement(dss)) m1 <- matchPDict(pdict1, Mmusculus$chr1) Rle(countIndex(m1)) which(countIndex(m1) == max(countIndex(m1))) reverseComplement(pdict1[[433]]) # The previous analysis was for exact hits. To find inexact hits # with at most 2 mismatches per read in the last 20 nucleotides, we # need to specify a Trusted Band during preprocessing: pdict2 <- PDict(dss, tb.end=16) # and to call matchPDict() with 'max.mismatch=2': m2 <- matchPDict(pdict2, Mmusculus$chr1, max.mismatch=2) # Of course we find the same hits or more for each read: all(countIndex(m2) >= countIndex(m)) which(countIndex(m2) == max(countIndex(m2))) pdss[[90]] @ \end{Ex} %-------------------------------------------------- \section{More on the provenance of the \Robject{topReads} object} \label{sec:topReads} %-------------------------------------------------- The code that was used to produce the \Robject{topReads} object looked something like: <>= sp <- list("experiment1" = SolexaPath(file.path("path", "to", "experiment1")), "experiment2" = SolexaPath(file.path("path", "to", "experiment2"))) patSeq <- paste("s_", 1:8, "_.*_seq.txt", sep = "") names(patSeq) <- paste("lane", 1:8, sep = "") topReads <- lapply(seq_len(length(sp)), function(i) { print(experimentPath(sp[[i]])) lapply(seq_len(length(patSeq)), function(j, n = 1000) { cat("Reading", patSeq[[j]], "...") x <- tables(readXStringColumns(baseCallPath(sp[[i]]), pattern = patSeq[[j]], colClasses = c(rep(list(NULL), 4), list("DNAString")))[[1]], n = n)[["top"]] names(x) <- chartr("-", "N", names(x)) cat("done.\n") XDataFrame(read = DNAStringSet(names(x)), count = unname(x)) }) }) names(topReads) <- names(sp) for (i in seq_len(length(sp))) { names(topReads[[i]]) <- names(patSeq) } @ % You could adapt this for use with your own data. %-------------------------------------------------- \section{Session Information} %-------------------------------------------------- <>= toLatex(sessionInfo()) @ \begin{thebibliography}{10} \bibitem{Frommer} Frommer M, McDonald LE, Millar DS, Collis CM, Watt F, Grigg GW, Molloy PL, Paul CL. \newblock A genomic sequencing protocol that yields a positive display of 5-methylcytosine residues in individual DNA strands. \newblock \textit{Proc Natl Acad Sci U S A} 89:1827--1831 (1992) \end{thebibliography} \end{document}