% -*- mode: noweb; noweb-default-code-mode: R-mode; -*- %\VignetteIndexEntry{fastseg: Manual for the R package} %\VignetteDepends{fastseg} %\VignettePackage{fastseg} %\VignetteKeywords{copy number analysis, segmentation, % CNV, copy number variant} \documentclass[article]{bioinf} \usepackage[noae]{Sweave} \usepackage{amsmath,amssymb} \usepackage{hyperref} \usepackage{float} \usepackage[authoryear]{natbib} \hypersetup{colorlinks=false, pdfborder=0 0 0, pdftitle={fastseg: a fast segmentation algorithm}, pdfauthor={G\"unter Klambauer}} \title{{\Huge fastseg}\\[5mm] An R Package for fast segmentation} \author{G\"unter Klambauer and Andreas Mitterecker} \affiliation{Institute of Bioinformatics, Johannes Kepler University Linz\\Altenberger Str. 69, 4040 Linz, Austria\\ \email{klambauer@bioinf.jku.at}} \newcommand{\fastseg}{\texttt{fastseg}} \newcommand{\method}[1]{{\fontfamily{phv}\fontshape{rm}\selectfont #1}} \newcommand{\R}{R} \newcommand{\Real}{\mathbb{R}} \renewcommand{\vec}[1]{\mathbf{#1}} \setkeys{Gin}{width=0.55\textwidth} \SweaveOpts{eps=FALSE} \begin{document} <>= options(width=80) set.seed(0) fastsegVersion <- packageDescription("fastseg")$Version @ \newcommand{\fastsegVersion}{\Sexpr{fastsegVersion}} \manualtitlepage[Version \fastsegVersion, \today] \section*{Scope and Purpose of this Document} This document is a user manual for the \R\ package \fastseg. It is only meant as a gentle introduction into how to use the basic functions implemented in this package. Not all features of the \R\ package are described in full detail. Such details can be obtained from the documentation enclosed in the \R\ package. Further note the following: (1) this is neither an introduction to segmentation algorithms; (2) this is not an introduction to \R. If you lack the background for understanding this manual, you first have to read introductory literature on these subjects. \vspace{1cm} \newlength{\auxparskip} \setlength{\auxparskip}{\parskip} \setlength{\parskip}{0pt} \tableofcontents \clearpage \setlength{\parskip}{\auxparskip} \newlength{\Nboxwidth} \setlength{\Nboxwidth}{\textwidth} \addtolength{\Nboxwidth}{-2\fboxrule} \addtolength{\Nboxwidth}{-2\fboxsep} \newcommand{\notebox}[1]{ \begin{center} \fbox{\begin{minipage}{\Nboxwidth} \noindent{\sffamily\bfseries Note:} #1 \end{minipage}} \end{center}} \section{Introduction} \fastseg\ implements a very fast and efficient segmentation algorithm. It has similar functionality as {\tt DNACopy} \citep{Olshen:04} but is considerably faster and more flexible. \fastseg\ can segment data stemming from DNA microarrays and data stemming from next generation sequencing for example to detect copy number segments. Further it can segment data stemming from RNA microarrays like tiling arrays to identify transcripts. Most generally, it can segment data given as a matrix or as a vector. Various data formats can be used as input to \fastseg\ like expression set objects for microarrays or GRanges for sequencing data. The segmentation criterion of \fastseg\ is based on a statistical test in a Bayesian framework, namely the cyber t-test \citep{Baldi:01}. The speed-up stems from the facts, that sampling is not necessary in for \fastseg\ and that a dynamic programming approach is used for calculation of the segments' first and higher order moments. For further information regarding the algorithm and its assessment see the \method{fastseg} homepage at \url{http://www.bioinf.jku.at/software/fastseg/fastseg.html} \section{Getting started} To load the package, enter the following in your \R\ session: <<>>= library(fastseg) @ \subsection{Data} According to the DNAcopy package from bioconductor we selected a subset of the data set presented in \citep{Snijders:01}. This data set will be called {\tt coriell}. The data correspond to two array CGH studies of fibroblast cell strains. \footnote{\url{http://www.nature.com/ng/journal/v29/n3/suppinfo/ng754_S1.html}} In particular, the studies {\bf GM05296} and {\bf GM13330} were chosen. After selecting only the mapped data from chromosomes 1-22 and X, there are 2271 data points. To prepare the data for our examples we execute the following code: <>= data(coriell) head(coriell) samplenames <- colnames(coriell)[4:5] data <- as.matrix(coriell[4:5]) #data[is.na(data)] <- median(data, na.rm=TRUE) chrom <- coriell$Chromosome maploc <- coriell$Position @ The main functions of the package are {\tt fastseg} and {\tt toDNAcopyObj}. The first on runs the segmentation algorithm and the latter converts the segmentation results the a DNAcopy object which will be quite helpful for plot functions. \subsection{File formats} The package can handle different file formats: GRanges, ExpressionSet objects, matrix or a vector. \subsubsection{Vector} <>= data2 <- data[, 1] res <- fastseg(data2) head(res) @ \subsubsection{Matrix} <>= data2 <- data[1:400, ] res <- fastseg(data2) head(res) @ \subsubsection{GRanges objects} <>= library("GenomicRanges") ## with both individuals gr <- GRanges(seqnames=chrom, ranges=IRanges(maploc, end=maploc)) mcols(gr) <- data colnames(mcols(gr)) <- samplenames res <- fastseg(gr) head(res) ## with one individual gr2 <- gr data2 <- as.matrix(data[, 1]) colnames(data2) <- "sample1" mcols(gr2) <- data2 res <- fastseg(gr2) head(res) @ \subsubsection{ExpressionSet objects} <>= library(oligo) eSet <- new("ExpressionSet") assayData(eSet) <- list(intensity=data) featureData(eSet) <- new("AnnotatedDataFrame", data=data.frame( chrom = paste("chr",chrom,sep=""), start = maploc, end = maploc,stringsAsFactors=FALSE)) phenoData(eSet) <- new("AnnotatedDataFrame", data=data.frame(samples=samplenames)) sampleNames(eSet) <- samplenames res <- fastseg(eSet) head(res) @ \subsubsection{Vector} <>= data2 <- data[, 1] res <- fastseg(data2) head(res) @ \subsubsection{Matrix} <>= data2 <- data[1:400, ] res <- fastseg(data2) head(res) @ \subsection{Plotting the segmentation results} \noindent For plotting the data we have to generate an {\tt DNAcopy} object out of the segmentation results: <<>>= ## with both individuals gr <- GRanges(seqnames=chrom, ranges=IRanges(maploc, end=maploc)) mcols(gr) <- data colnames(mcols(gr)) <- samplenames res <- fastseg(gr,segMedianT=0.2) @ The plotting is done via the {\tt plot} function of {\tt DNAcopy}: \begin{center} <>= segPlot(gr,res, plot.type="w") @ \end{center} Or alternatively: \begin{center} <>= segPlot(gr,res, plot.type="s") @ \end{center} \subsection{Performance of the method} Here we show that \fastseg\ outperforms {\tt DNAcopy} with respect to computational time on summarized microarray data. The quality of the segmentation result of both \fastseg\ and {\tt DNAcopy} depends strongly on the methods' parameters.\\ The data is a small subset of copy number calls which were produced by the {\tt cn.farms} algorithm \cite{Clevert:11} from an Affymetrix SNP microarray experiment of a HapMap sample. <>= data(fastsegData) system.time(res <- fastseg(fastsegData)) @ \begin{center} <>= segPlot(fastsegData,res, plot.type="w") @ \end{center} <>= library(DNAcopy) cna <- DNAcopy::CNA(fastsegData,chrom="chr1",maploc=1:length(fastsegData)) system.time(res2 <- DNAcopy::segment(cna)) @ \begin{center} <>= plot(res2, plot.type="w", xmaploc=TRUE) @ \end{center} \section{Future Extensions} We are planning to program a parallelized version of this package. Furthermore we will enhance the plot functions by our own. \section{How to cite this package} If you use this package for research that is published later, you are kindly asked to cite it as follows: \citep{Klambauer:12}. To obtain Bib\TeX\ entries of the two references, you can enter the following into your R session: <>= toBibtex(citation("fastseg")) @ \bibliographystyle{apalike} %\bibliographystyle{natbib} \bibliography{fastseg} \end{document}