% -*- mode: noweb; noweb-default-code-mode: R-mode; -*-
%\VignetteIndexEntry{fastseg: Manual for the R package}
%\VignetteDepends{fastseg}
%\VignettePackage{fastseg}
%\VignetteKeywords{copy number analysis, segmentation,
% CNV, copy number variant}

\documentclass[article]{bioinf}

\usepackage[noae]{Sweave}
\usepackage{amsmath,amssymb}
\usepackage{hyperref}
\usepackage{float}
\usepackage[authoryear]{natbib}

\hypersetup{colorlinks=false,
   pdfborder=0 0 0,
   pdftitle={fastseg: a fast segmentation algorithm},
   pdfauthor={G\"unter Klambauer}}

\title{{\Huge fastseg}\\[5mm]
  An R Package for fast segmentation}
\author{G\"unter Klambauer and Andreas Mitterecker}
\affiliation{Institute of Bioinformatics, Johannes Kepler University
Linz\\Altenberger Str. 69, 4040 Linz, Austria\\
\email{klambauer@bioinf.jku.at}}

\newcommand{\fastseg}{\texttt{fastseg}}
\newcommand{\method}[1]{{\fontfamily{phv}\fontshape{rm}\selectfont #1}}
\newcommand{\R}{R}
\newcommand{\Real}{\mathbb{R}}
\renewcommand{\vec}[1]{\mathbf{#1}}

\setkeys{Gin}{width=0.55\textwidth}

\SweaveOpts{eps=FALSE}

\begin{document}
<<echo=FALSE>>=
options(width=80)
set.seed(0)
fastsegVersion <- packageDescription("fastseg")$Version
@
\newcommand{\fastsegVersion}{\Sexpr{fastsegVersion}}
\manualtitlepage[Version \fastsegVersion, \today]

\section*{Scope and Purpose of this Document}

This document is a user manual for the \R\ package \fastseg.
It is only meant as a gentle introduction into how to use the basic
functions implemented in this package. Not all features of the \R\
package are described in full detail. Such details can be obtained
from the documentation enclosed in the  \R\ package. Further note
the following: (1) this is neither an introduction to segmentation  
algorithms; (2) this is not an introduction to \R.
If you lack the background for understanding this manual, you first
have to read introductory literature on these subjects.


\vspace{1cm}

\newlength{\auxparskip}
\setlength{\auxparskip}{\parskip}
\setlength{\parskip}{0pt}
\tableofcontents
\clearpage
\setlength{\parskip}{\auxparskip}

\newlength{\Nboxwidth}
\setlength{\Nboxwidth}{\textwidth}
\addtolength{\Nboxwidth}{-2\fboxrule}
\addtolength{\Nboxwidth}{-2\fboxsep}

\newcommand{\notebox}[1]{
\begin{center}
\fbox{\begin{minipage}{\Nboxwidth}
\noindent{\sffamily\bfseries Note:} #1
\end{minipage}}
\end{center}}

\section{Introduction}
\fastseg\ implements a very fast and efficient segmentation algorithm.
It has similar functionality as {\tt DNACopy} \citep{Olshen:04} but 
is considerably faster and more flexible. \fastseg\ can segment data 
stemming from DNA microarrays and data stemming 
from next generation sequencing for example to detect 
copy number segments. Further it can segment data stemming from RNA microarrays 
like tiling arrays to identify transcripts. Most generally, 
it can segment data given as a matrix or as a vector.
Various data formats can be used as input to \fastseg\ like  
expression set objects for microarrays or GRanges for sequencing data.

The  segmentation criterion of \fastseg\ is based on a statistical test 
in a Bayesian framework, namely the cyber t-test \citep{Baldi:01}.
The speed-up stems from the facts, that sampling is not necessary in for 
\fastseg\ and that a dynamic programming approach is used for calculation of the 
segments' first and higher order moments. 

For further information regarding the algorithm and its assessment 
see the \method{fastseg} homepage at 
\url{http://www.bioinf.jku.at/software/fastseg/fastseg.html}

\section{Getting started}

To load the package, enter the following in your \R\ session:
<<>>=
library(fastseg)
@

\subsection{Data}
According to the DNAcopy package from bioconductor we selected 
a subset of the data set presented in \citep{Snijders:01}.
This data set will be called {\tt coriell}.  The data correspond to
two array CGH studies of fibroblast cell strains.
\footnote{\url{http://www.nature.com/ng/journal/v29/n3/suppinfo/ng754_S1.html}}  
In particular, 
the studies {\bf GM05296} and {\bf GM13330} were chosen.  After selecting
only the mapped data from chromosomes 1-22 and X, there are 2271 data
points. 

To prepare the data for our examples we execute the following code:

<<echo=TRUE>>=
data(coriell)
head(coriell)

samplenames <- colnames(coriell)[4:5]
data <- as.matrix(coriell[4:5])
#data[is.na(data)] <- median(data, na.rm=TRUE)
chrom <- coriell$Chromosome
maploc <- coriell$Position
@

The main functions of the package are {\tt fastseg} and {\tt toDNAcopyObj}. 
The first on runs the segmentation algorithm and the latter converts the 
segmentation results the a DNAcopy object which will be quite helpful for plot 
functions. 

\subsection{File formats}
The package can handle different file formats: GRanges, ExpressionSet objects,
matrix or a vector. 

\subsubsection{Vector}
<<echo=TRUE>>=
data2 <- data[, 1]
res <- fastseg(data2)
head(res)


@

\subsubsection{Matrix}
<<echo=TRUE>>=
data2 <- data[1:400, ]
res <- fastseg(data2)
head(res)
@



\subsubsection{GRanges objects}
<<echo=TRUE>>=
library("GenomicRanges")

## with both individuals
gr <- GRanges(seqnames=chrom,
        ranges=IRanges(maploc, end=maploc))
mcols(gr) <- data
colnames(mcols(gr)) <- samplenames
res <- fastseg(gr)
head(res)



## with one individual
gr2 <- gr
data2 <- as.matrix(data[, 1])
colnames(data2) <- "sample1"
mcols(gr2) <- data2
res <- fastseg(gr2)
head(res)

@


\subsubsection{ExpressionSet objects}
<<echo=TRUE>>=
library(oligo)
eSet <- new("ExpressionSet")
assayData(eSet) <- list(intensity=data)

featureData(eSet) <- new("AnnotatedDataFrame", 
        data=data.frame(
                chrom = paste("chr",chrom,sep=""),
                start = maploc, 
                end   = maploc,stringsAsFactors=FALSE))
phenoData(eSet) <- new("AnnotatedDataFrame", 
        data=data.frame(samples=samplenames))
sampleNames(eSet) <- samplenames
res <- fastseg(eSet)
head(res)
@

\subsubsection{Vector}
<<echo=TRUE>>=
data2 <- data[, 1]
res <- fastseg(data2)
head(res)


@

\subsubsection{Matrix}
<<echo=TRUE>>=
data2 <- data[1:400, ]
res <- fastseg(data2)
head(res)
@

\subsection{Plotting the segmentation results}

\noindent For plotting the data we have to generate an {\tt DNAcopy} object out of the
segmentation results:

<<>>=
## with both individuals
gr <- GRanges(seqnames=chrom,
        ranges=IRanges(maploc, end=maploc))
mcols(gr) <- data
colnames(mcols(gr)) <- samplenames
res <- fastseg(gr,segMedianT=0.2)
@

The plotting is done via the {\tt plot} function of {\tt DNAcopy}:

\begin{center}
<<fig=TRUE>>=
segPlot(gr,res, plot.type="w")
@
\end{center} 

Or alternatively:

\begin{center}
<<fig=TRUE>>=
segPlot(gr,res, plot.type="s")
@
\end{center} 

\subsection{Performance of the method}
Here we show that \fastseg\ outperforms {\tt DNAcopy} with respect to
 computational time on summarized microarray data. The quality of the 
 segmentation result of both \fastseg\ and {\tt DNAcopy} depends strongly on 
 the methods' parameters.\\
 The data is a small subset of copy number calls which were
 produced by the {\tt cn.farms} algorithm \cite{Clevert:11} 
 from an Affymetrix SNP microarray experiment of a HapMap sample.

<<echo=TRUE>>=
data(fastsegData)
system.time(res <- fastseg(fastsegData))
@

\begin{center}
<<echo=TRUE,fig=TRUE>>=
segPlot(fastsegData,res, plot.type="w")
@
\end{center} 

<<echo=TRUE>>=
library(DNAcopy)
cna <- DNAcopy::CNA(fastsegData,chrom="chr1",maploc=1:length(fastsegData))
system.time(res2 <- DNAcopy::segment(cna))
@

\begin{center}
<<echo=TRUE,fig=TRUE>>=
plot(res2, plot.type="w", xmaploc=TRUE)
@
\end{center} 


\section{Future Extensions}
We are planning to program a parallelized version of this package. Furthermore 
we will enhance the plot functions by our own. 


\section{How to cite this package}

If you use this package for research that is published later, you are kindly
asked to cite it as follows:
\citep{Klambauer:12}.

To obtain Bib\TeX\ entries of the two references, you can enter the following
into your R session:
<<eval=FALSE>>=
toBibtex(citation("fastseg"))
@ 


\bibliographystyle{apalike}
%\bibliographystyle{natbib}
\bibliography{fastseg}

\end{document}