\documentclass{article} \usepackage{Sweave} \usepackage[a4paper]{geometry} \usepackage{hyperref,graphicx} \usepackage{theorem} \usepackage{url} \SweaveOpts{keep.source=TRUE,eps=FALSE,include=FALSE,width=4,height=4.5} \newcommand{\Robject}[1]{\texttt{#1}} \newcommand{\Rpackage}[1]{\textit{#1}} \newcommand{\Rclass}[1]{\textit{#1}} \newcommand{\Rfunction}[1]{{\small\texttt{#1}}} %% Excercises and Questions \theoremstyle{break} \newtheorem{Ex}{Exercise} \theoremstyle{break} \newtheorem{Q}{Question} %% And solution or answer \newenvironment{solution}{\color{blue}}{\bigskip} \author{Simon Anders\\[1em]European Bioinformatics Institute,\\ Hinxton, Cambridge, UK\\[1em] \texttt{sanders@fs.tum.de}} \title{EMBL course on Short Read analysis with Bioconductor:\\An exercise with coverage vectors} \date{23 June 2009} \begin{document} \maketitle In the ChIP-Seq tutorial, we used the \Rpackage{IRanges} and \Rpackage{ShortRead} packages to get coverage vectors from aligned Solexa reads. The function \Rfunction{coverage} gave us a \Rclass{GenomeData} object that contained for each chromosome an \Rclass{Rle} vector of coverage scores. In this exercise, we explore an interesting analysis that is of use especially in ChIP-Seq experiments studying histone modifications or other characteristics of chromatin structure. In one of the pioneering works in this field, Barski et al.~\cite{Barski} have performed ChIP-Seq for various histone modifications. Among other questions they investigated whether the positioning of histone methylation marks correlates with transcription start sites (TSSs). They could confirm that (as was already knwon before) the mark H3K4me3 peaks sharply at TSSs with a distinctive peak shape. Here, we will re-do the final step of this analysis. In order to save time, we have already performed these steps for you: \begin{itemize} \item Download Barski et al.'s raw data from the NCBI Short Read Archive. \item Use Maq to align the data from the lanes with H3K4me1 and H3K4me3 samples to the human reference genome. \item Use \Rpackage{ShortRead}'s \Rpackage{ReadAligned} function to read in the output of Maq. \item Call the \Rfunction{coverage} method to get the coverage vectors. \end{itemize} The following commands were used for this: <>= library( ShortRead ) maps.me1 <- sapply( list.files( "ShortReadExampleData/H3K4me1", "run.*lane.\\.map" ), function(filename) readAligned( "ShortReadExampleData/H3K4me1", filename, type="MAQMapShort" ) ) maps.me3 <- sapply( list.files( "ShortReadExampleData/H3K4me3", "run.*lane.\\.map" ), function(filename) readAligned( "ShortReadExampleData/H3K4me3", filename, type="MAQMapShort" ) ) get_coverage <- function( lanes, chrom, chromLength ) { res <- Rle( 0, chromLength ) for ( lane in lanes ) { lc <- lane[ chromosome(lane) == chrom ] cvg <- coverage( lc, start = 1, end = chromLength, extend = 185L - width(lc) ) res <- res + cvg[[ chrom ]] } res } lengthChr10 <- 135374737 coverage.chr10.me1 <- get_coverage( maps.me1, "10", lengthChr10 ) coverage.chr10.me3 <- get_coverage( maps.me3, "10", lengthChr10 ) save( coverage.chr10.me1, coverage.chr10.me3, file="coverage.H3K4meX.rda") @ In case you want to do this yourself, you can find the raw data at \url{http://www.ebi.ac.uk/~anders/ShortReadExampleData} To save space, we have taken only the vector for chromosome 10 from this data. Load the \Rpackage{ShortRead} package and the R data file \texttt{coverage.H3K4meX.rda}. <>= library("ShortRead") load(file.path("..", "data", "coverage.H3K4meX.rda")) @ We have these two \Rclass{Rle} vectors: <>= coverage.chr10.me1 coverage.chr10.me3 @ If you would like to study these tracks in a genome browser, you can save them as Wiggle files by using the \Rpackage{rtracklayer} package. \Rpackage{rtracklayer} expects to get the data as \Rclass{RangedData}, hence we have to use a coercion method first. <>= library("rtracklayer") coverage.chr10.me1.rd <- as( coverage.chr10.me1, "RangedData" ) coverage.chr10.me3.rd <- as( coverage.chr10.me3, "RangedData" ) names( coverage.chr10.me1.rd ) <- "chr10" names( coverage.chr10.me3.rd ) <- "chr10" @ The coercion method cannot know that the data is from "chr10", because this information is not part of the \Rclass{Rle} vector. (The original call to \Rfunction{coverage} returned a \Rclass{GenomeData} object, which contaions the chromosome names. However, we have only one of its \Rclass{Rle} elements here.) Hence, we have to add the chromosome name manually. Now, we can export the data as wiggle files: <>= export( coverage.chr10.me1.rd, "chr10_me1.wig", format="wig" ) export( coverage.chr10.me3.rd, "chr10_me3.wig", format="wig" ) @ Inspect the wiggle files with a text editor. You may find that the chromosome names are wrong, they read "chrchr10". This is a bug in \Rpackage{rtracklayer} that will be corrected in the next few days: The \Rfunction{export} function prefixes an extra ``chr'' to the chromosome names. To resolve the issue, just omit ``chr'' in your chromosome name: <>= names( coverage.chr10.me1.rd ) <- "10" names( coverage.chr10.me3.rd ) <- "10" export( coverage.chr10.me1.rd, "chr10_me1.wig", format="wig" ) export( coverage.chr10.me3.rd, "chr10_me3.wig", format="wig" ) @ Now, use your favorite genome browser to display the wiggle files. You can also use Hilbert curve visualisation. Either, use the stand-alone tool and load the wiggle files, or type <>= library( HilbertVisGUI ) hilbertDisplay( as.integer( coverage.chr10.me1 ) ) @ Using HilbertVis from R unfortunately requires us to convert from an \Rclass{Rle} vector to an ordinary one, and this will require a lot of RAM. (I'll change this soon.) So, it might not work well on a 2 GB machine. If you have enough RAM, you may also compare the two data vectors: <>= hilbertDisplay( as.integer( coverage.chr10.me1 ), as.integer( coverage.chr10.me3 ) ) @ In order to see how the methylation marks correlate around the TSSs, we place windows around them and collect the coverage from 200 bp before to 200 bp after the TSS. This gives us a vector of length 401 from each TSS, and by adding them all up, we see whether the histone marks tend to be at a non-random distance to TSSs. Assume that a TSS is at position 123456. Then, we can get the coverage of the 101-bp vector around it with the \Rfunction{subseq} method: % <>= as.vector( subseq( coverage.chr10.me1, 123456-50, 123456+50 ) ) @ Note that \Rfunction{subseq} returns an \Rclass{Rle} object. However, our result is a short vector, and for these sizes, it is easy enough to use ordinary vectors. Hence, the call to \Rfunction{as.vector}. If we have a vector of TSS positions, we can now easily add all these intervals up. \begin{Ex} Write a function to do that. It should take an \Rclass{Rle} coverage vector, an integer vector of TSS positions and a boolean vector indicating whether the TSS and its gene are on the reverse strand. \end{Ex} Solution: You might write something like this: <>= deltaConvolve <- function( coverage, winCentres, revStrand, left, right) { result <- rep( 0, right - left + 1 ) for( i in seq(along=winCentres) ) { v <- as.vector( subseq( coverage, winCentres[i] + left, winCentres[i] + right ) ) if( revStrand[i] ) v <- rev( v ) result <- result + v } result } @ If you like it more elegant, try using \Rfunction{sapply} instead of \Rfunction{for}. This might actually be slower, however. \begin{Ex} Now, we need a list of the transcription start sites of all genes on human chromosome 10. If you are familiar with the \Rpackage{biomaRt} package, try to get this information from Ensembl. \end{Ex} Solution: The following Biomart query does the job. % <>= library("biomaRt") mart <- useMart( "ensembl", dataset = "hsapiens_gene_ensembl" ) martReply <- getBM( attributes=c( "transcript_start", "transcript_end", "strand" ), filters="chromosome_name", values="10", mart=mart ) @ We now have <>= head(martReply) @ Note that \Rfunction{transcription\_start} is always smaller than \Rfunction{transcription\_end}, even when the transcript is on the ``$-$'' strand. Hence, we have to use either the start or the end coordinate of the transcript, depending on the strand, to get the actual transcription start sites, i.e., the 5' ends of the transcripts:\footnote{The version of this document that we distributed at the course neglected this point. Many thanks to Ludo Pagie who spotted this mistake.} <>= tss <- ifelse( martReply$strand == 1, martReply$transcript_start, martReply$transcript_end ) @ This allows us to call our function: <>= win = c(-500, 500) H3K4me1 <- deltaConvolve( coverage.chr10.me1, tss, martReply$strand == -1, left=win[1], right=win[2] ) H3K4me3 <- deltaConvolve( coverage.chr10.me3, tss, martReply$strand == -1, left=win[1], right=win[2] ) @ We can plot this and get two curves similar to those in Barski et al.'s paper. <>= plotfun = function( win, ... ) { y = cbind( ... ) matplot( win[1]:win[2], y, type='l', lty=1, lwd=2, ylab="average coverage", xlab="distance from TSS" ) legend( "topleft", colnames(y), fill=seq_len(ncol(y)) ) } <>= plotfun( win, H3K4me1, H3K4me3 ) @ The result is shown in the left panel of Figure~\ref{fig:tss}. As we used data from only one chromosome, ours is less smooth than theirs. \begin{figure} \begin{center} \includegraphics[width=0.49\textwidth]{TSS_plot-plot1} \includegraphics[width=0.49\textwidth]{TSS_plot-plot2} \caption{\label{fig:tss}Histone modification signals, averaged over transcripts originating from chromosome 10 and aligned by TSS. Compare to Fig.~2 B and D in \cite{Barski}.} \end{center} \end{figure} \begin{figure} \begin{center} \includegraphics[width=0.7\textwidth]{TSS_plot-wideWindow} \caption{\label{fig:wide}The same plot as in Fig.\ \ref{fig:tss}b, but with a wider window around the TSSs.} \end{center} \end{figure} It might be more appropriate to normalize the two curves by the respective library sizes: <>= H3K4me1.N = H3K4me1 / sum(coverage.chr10.me1) H3K4me3.N = H3K4me3 / sum(coverage.chr10.me3) <>= plotfun( win, H3K4me1.N, H3K4me3.N ) @ The result is shown in the right panel of Figure~\ref{fig:tss}. A wider range of the window might be helpful, too. hence, the same again with a window of width 6,000: <>= win.w = c(-3000, 3000) H3K4me1.w <- deltaConvolve(coverage.chr10.me1, tss, martReply$strand == -1, left=win.w[1], right=win.w[2]) H3K4me3.w <- deltaConvolve(coverage.chr10.me3, tss, martReply$strand == -1, left=win.w[1], right=win.w[2]) H3K4me1.w.N = H3K4me1.w / sum(coverage.chr10.me1) H3K4me3.w.N = H3K4me3.w / sum(coverage.chr10.me3) plotfun( win.w, H3K4me1.w.N, H3K4me3.w.N ) @ The output is shown in Fig.\ \ref{fig:wide}. \begin{thebibliography}{10} \bibitem{Barski} A. Barski, S. Cuddapah, K. Cui, T.-Y. Roh, D. E. Schones, Z. Wang, G. Wei, I. Chepelev, K. Zhao. \newblock High-resolution proling of histone methylations in the human genome. \newblock \textit{Cell} 129:823--837 (2007) \end{thebibliography} \end{document}