%\VignetteIndexEntry{Data preprocessing and creation of the data objects pasillaGenes and pasillaExons} \documentclass{article} \usepackage{Sweave} \usepackage[a4paper]{geometry} \usepackage{hyperref,graphicx} \usepackage{cite} \usepackage{color} \usepackage{url} \usepackage{listings} \SweaveOpts{keep.source=TRUE,eps=FALSE,include=FALSE,height=4.5,width=4} \newcommand{\Robject}[1]{\texttt{#1}} \newcommand{\Rpackage}[1]{\textit{#1}} \newcommand{\Rclass}[1]{\textit{#1}} \newcommand{\Rfunction}[1]{{\small\texttt{#1}}} \newcommand{\fixme}[1]{{\textbf{Fixme:} \textit{\textcolor{blue}{#1}}}} \title{\textsf{\textbf{Data preprocessing and creation of the data objects auxiliary for the DEXSeq package}}} \author{Alejandro Reyes} \begin{document} \maketitle \begin{abstract} This vignette describes the steps that were followed for the generation of the data objects contained in the package \Rpackage{pasilla}. \end{abstract} \tableofcontents %-------------------------------------------------- \section{Downloading the files} %-------------------------------------------------- We used the RNA-Seq data from the publication by Brooks et al.~\cite{Brooks2010}. The experiment investigated the effect of siRNA knock-down of pasilla, a gene that is known to bind to mRNA in the spliceosome, and which is thought to be involved in the regulation of splicing. The data set contains 3 biological replicates of the knockdown as well as 4 biological replicates for the untreated control. Data files are publicly available in the NCBI Gene Expression Omnibus under the accession GSE18508\footnote{\url{http://www.ncbi.nlm.nih.gov/projects/geo/query/acc.cgi?acc=GSE18508}}. The read sequences in FASTQ format were extracted from the NCBI short read archive file (.sra files), using the sra toolkit\footnote{\url{http://www.ncbi.nlm.nih.gov/books/NBK47540/\#SRA\_Download\_Guid\_B.5\_Converting\_SRA\_for}}. \\ <>= tab = data.frame( file = c("treated1fb", "treated2fb", "treated3fb", "untreated1fb", "untreated2fb", "untreated3fb", "untreated4fb"), type = c("single-read", "paired-end", "paired-end", "single-read", "single-read", "paired-end", "paired-end"), "number of lanes" = as.integer(c(5,2,2,2,6,2,2)), "total number of reads" = c("35158667", "12242535 (x2)", "12443664 (x2)", "17812866", "34284521", "10542625 (x2)", "12214974 (x 2)"), "exon counts" = as.integer(c(15679615, 15620018, 12733865, 14924838, 20764558, 10283129, 11653031)), stringsAsFactors = TRUE, check.names = FALSE) library("xtable") print(xtable(tab, caption = "Read numbers and alignment statistics. The column \\emph{exon counts} refers to the number of reads that could be uniquely aligned to an exon.", label="tab:samples"), file="create_objects_tabsamples.tex") @ %-------------------------------------------------- \section{Read alignment and filtering} %-------------------------------------------------- The reads in the FASTQ files were aligned using tophat version 1.2.0 with default parameters against the reference Drosophila melanogaster genome. Table~\ref{tab:samples} summarizes the read number and alignment statistics. \input{create_objects_tabsamples.tex} The reference genome fasta files were obtained from the Ensembl ftp server\footnote{\url{http://www.ensembl.org/info/data/ftp/index.html}}. We ran \texttt{bowtie-build} to index the fasta file. For more information on this procedure see the bowtie webpage\footnote{\url{http://bowtie-bio.sourceforge.net/tutorial.shtml}}. The indexed form is required by bowtie, and thus tophat. \begin{lstlisting} wget ftp://ftp.ensembl.org/pub/release-62/fasta/drosophila_melanogaster/ \ dna/Drosophila_melanogaster.BDGP5.25.62.dna_rm.toplevel.fa.gz gunzip Drosophila_melanogaster.BDGP5.25.62.dna_rm.toplevel.fa.gz bowtie-build Drosophila_melanogaster.BDGP5.25.62.dna_rm.toplevel.fa \ d_melanogaster_BDGP5.25.62 \end{lstlisting} We generated the alignment BAM file using tophat. For the single-reads data: \begin{lstlisting} tophat bowtie_index reads1.fastq,reads2.fastq,...,readsN.fastq \end{lstlisting} For the paired-end data: \begin{lstlisting} tophat -r inner-fragment-size bowtie_index \ reads1_1.fastq,reads2_1.fastq,...,readsN_1.fastq \ reads1_2.fastq,reads2_2.fastq,...,readsN_2.fastq \end{lstlisting} More information on tophat is provided on its webpage\footnote{\url{http://tophat.cbcb.umd.edu/tutorial.html}}. The SAM alignment files from which \Rpackage{pasilla} was generated are available at \url{http://www-huber.embl.de/pub/DEXSeq/analysis/brooksetal/bam/}. %-------------------------------------------------- \section{Exon count files} %-------------------------------------------------- To generate the per-exon read counts, we first needed to define the exonic regions. To this end, we downloaded the file Drosophila\_melanogaster.BDGP5.25.62.gtf.gz from Ensembl\footnote{\url{ftp://ftp.ensembl.org/pub/release-62/gtf/drosophila\_melanogaster}}. The script \texttt{dexseq\_prepare\_annotation.py} contained in the \Rpackage{DEXSeq} package was used to extract the exons of the transcripts from the file, define new non-overlapping exonic regions and reformat it to create the file Dmel.BDGP5.25.62.DEXSeq.chr.gff contained in pasilla/extdata. For example, for this file we ran: \begin{lstlisting} wget ftp://ftp.ensembl.org/pub/release-62/gtf/ \ drosophila_melanogaster/Drosophila_melanogaster.BDGP5.25.62.gtf.gz gunzip Drosophila_melanogaster.BDGP5.25.62.gtf.gz python dexseq_prepare_annotation.py Drosophila_melanogaster.BDGP5.25.62.gtf \ Dmel.BDGP5.25.62.DEXSeq.chr.gff \end{lstlisting} To count the reads that fell into each non-overlapping exonic part, the script \texttt{dexseq\_count.py}, which is also contained in the \Rpackage{DEXSeq} package, was used. It took the alignment results in the form of a SAM file (sorted by position in the case of a paired end data) and the \texttt{gtf} file Dmel.BDGP5.25.62.DEXSeq.chr.gff and returned one file for each biological replicate with the exon counts. For example, for the file treated1.bam, which contained single-end alignments, we ran: \begin{lstlisting} samtools index treated1.bam samtools view treated1.bam > treated1.sam python dexseq_count.py Dmel.BDGP5.25.62.DEXSeq.chr.gff \ treated1.sam treated1fb.txt \end{lstlisting} For the file treated2.bam, which contained paired-end alignments: \begin{lstlisting} samtools index treated2.bam samtools view treated2.bam > treated2.sam sort -k1,1 -k2,2n treated2.sam > treated2_sorted.sam python dexseq_count.py -p yes Dmel.BDGP5.25.62.DEXSeq.chr.gff \ treated2_sorted.sam treated2fb.txt \end{lstlisting} The output of the two HTSeq python scripts is provided in the \Rpackage{pasilla} package: % <>= library("pasilla") <>= inDir = system.file("extdata", package="pasilla", mustWork=TRUE) dir(inDir) @ % The Python scripts are built upon the HTSeq library\footnote{\url{http://www-huber.embl.de/users/anders/HTSeq/doc/overview.html}}. %-------------------------------------------------- \section{Creation of the \Rclass{DEXSeqDataSet} \Robject{dxd}} %-------------------------------------------------- <>= z = sub("fb$", "", tab$file) nc = nchar(z) samples = data.frame( condition = substr(z, 1, nc-1), # replicate = as.integer(substr(z, nc, nc)), type = tab$type, row.names = tab$file, stringsAsFactors = TRUE, check.names = FALSE) #stopifnot(!any(is.na(samples$replicate))) @ %$ To create an \Rclass{DEXSeqDataSet} object, we started with a data frame \Robject{samples} that contained the sample annotations, as in Table~\ref{tab:samples}. <>= head(samples) @ We also needed the annotation file with the per exon annotation. % <>= annotationfile = file.path(inDir, "Dmel.BDGP5.25.62.DEXSeq.chr.gff") @ With these, we could call the function \Rfunction{DEXSeqDataSet} to construct the object \Robject{dxd}. <>= library("DEXSeq") dxd = DEXSeqDataSetFromHTSeq( countfiles = file.path(inDir, paste(rownames(samples), "txt", sep=".")), sampleData=samples, design= ~ sample + exon + condition:exon, flattenedfile = annotationfile) @ We only wanted to work with data from a subset of genes, which was defined in the following file. % <>= genesforsubset = readLines(file.path(inDir, "geneIDsinsubset.txt")) dxd = dxd[geneIDs( dxd ) %in% genesforsubset,] @ We save our objects We saved the objects in the data directory of the package: <>= save(dxd, file=file.path("..", "data", "pasillaDEXSeqDataSet.RData")) @ %--------------------------------------------------------- % SessionInfo %--------------------------------------------------------- \begin{table*}[tbp] \begin{minipage}{\textwidth} <>= toLatex(sessionInfo()) @ \end{minipage} \caption{\label{tab:sessioninfo}% The output of \Rfunction{sessionInfo} on the build system after running this vignette.} \end{table*} %-------------------------------------------------- \bibliography{library} \bibliographystyle{plain} \end{document}