%\VignetteIndexEntry{A quick introduction to AneuFinder}
%\VignetteEngine{knitr::knitr}
\documentclass{article}

<<style-knitr, eval=TRUE, echo=FALSE, results="asis">>=
BiocStyle::latex()
@

\author{Aaron Taudt\thanks{\email{aaron.taudt@gmail.com}}}
\title{A quick introduction to AneuFinder}

\begin{document}
\maketitle
\tableofcontents
\clearpage
<<options, results='hide', message=FALSE, eval=TRUE, echo=FALSE>>=
library(AneuFinder)
options(width=90)
@
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Introduction}
\Rpackage{AneuFinder} offers functionality for the study of copy number variations (CNV) in whole-genome single cell sequencing (WGSCS) data. Functionality implemented in this package includes:
\begin{itemize}
\item Copy number detection using a Hidden Markov Model on binned read counts.
\item Various plotting capabilities like genomewide heatmaps of copy number state and arrayCGH-like plots.
\item Export of copy number calls in BED format for upload to the UCSC genome browser.
\item Quality metrics.
\item Measures for addressing karyotype heterogeneity.
\end{itemize}

\section{Quickstart}

The main function of this package is called \Rfunction{Aneufinder} and performs all the necessary steps to get from aligned reads to interpretable output. \Rpackage{AneuFinder} offers two methods to call copy number variations: A Hidden Markov Model described in \cite{Bakker2016} and an approach using the \Rpackage{DNAcopy} package adopted from \cite{Garvin2015}.

<<eval=FALSE>>==
Aneufinder(inputfolder='folder-with-BAM-or-BED', outputfolder='output-directory',
           format='bam', numCPU=2, method=c('dnacopy','HMM'))
@

Although in most cases the above command will produce reasonably good results, it might be worthwile to adjust the default parameters to improve performance and the quality of the results (see section \ref{sec:workflow}). You can get a description of all available parameters by typing

<<eval=TRUE>>==
?Aneufinder
@

After the function has finished, you will find the folder \textbf{output-directory} containing all produced files and plots. This folder contains the following \emph{files} and \textbf{folders}:
\begin{itemize}
\item \emph{AneuFinder.config}: This file contains all the parameters that are necessary to reproduce your analysis. You can specify this file as

<<eval=FALSE>>==
Aneufinder(..., configfile='AneuFinder.config')
@

to run another analysis with the same parameter settings.
\item \emph{chrominfo.tsv}: A tab-separated file with chromosome length information.
\item \textbf{binned}: This folder contains the binned data. If you chose a correction method, you will also see a folder like 'binned-GC' in case of GC correction. You can load the data with

<<eval=FALSE>>==
files <- list.files('output-directory/binned', full.names=TRUE)
binned.data <- loadFromFiles(files)
@

\item \textbf{BROWSERFILES}: A folder which contains BED files with copy number calls that can be uploaded to the UCSC genome browser. If reads.store=TRUE it will also contain a subfolder \textbf{data} with the mapped reads in BED format.
\item \textbf{MODELS}: A folder with all produced Hidden Markov Models. You can load the results for further processing, such as quality control and customized plotting.

<<eval=FALSE>>==
files <- list.files('output-directory/MODELS/method-HMM', full.names=TRUE)
hmms <- loadFromFiles(files)
cl <- clusterByQuality(hmms)
heatmapGenomewide(cl$classification[[1]])
@

\item \textbf{PLOTS}: All plots that are produced by default will be stored here.
\item \textbf{data}: Only produced if reads.store=TRUE. This folder stores all the read data as RData objects. This exists mostly for internal usage.
\end{itemize}

\section{\label{sec:workflow}A detailed workflow}

\subsection{\label{sec:mapcor}Mappability correction}
The first step of your workflow should be the production of a reference file for mappability correction. Mappability correction is done via a variable-width binning approach (as compared to fixed-width bins) and requires a euploid reference. You can either simulate this reference file or take a real euploid reference. For optimal results we suggest to use a real reference, e.g. by merging BAM files of single cells from a euploid reference tissue. This can be achieved with the 'samtools merge' command (not part of R). Be careful: All CNVs that are present in the reference will lead to artifacts in the analysis later. This includes sex-chromosomes that are present in one copy only, so we advice to use a female reference and to exclude the Y-chromosome from the analysis. If you have no reference available, you can simulate one with the \Rfunction{simulateReads} command:

<<eval=TRUE, message=FALSE, warning=FALSE>>==
## Load human genome
library(BSgenome.Hsapiens.UCSC.hg19)

## Get a BAM file for the estimation of quality scores (adjust this to your experiment)
bamfile <- system.file("extdata", "BB150803_IV_074.bam", package="AneuFinderData")

## Simulate reads of length 51bp for human genome
# We simulate reads every 50000bp for demonstration purposes, for a real
# application you should use a much denser spacing (e.g. 500bp or less)
simulatedReads.file <- tempfile() # replace this with your destination file
simulateReads(BSgenome.Hsapiens.UCSC.hg19, readLength=51, bamfile=bamfile,
              file=simulatedReads.file, every.X.bp=50000)
@


This simulated FASTQ file must then be aligned with your aligner of choice (ideally the same that you used for your other samples) and given as reference in the \Rfunction{Aneufinder} function (option \Rcode{variable.width.reference}).

\subsection{\label{sec:blacklist}Blacklisting}
To further improve the quality of the results and remove artifacts caused by high mappability repeat regions, e.g. near centromers, a blacklist can be used in option \Rcode{blacklist} of the \Rfunction{Aneufinder} function. All reads falling into the regions specified by the blacklist will be discarded when importing the read files.
You can either download a blacklist from the UCSC genome browser, e.g. the ``DAC Blacklisted Regions from ENCODE/DAC(Kundaje)'' mappability track, or make your own. For optimal results, we advice to make your own blacklist from a euploid reference. The following code chunck takes a euploid reference and makes fixed-width bins of 100kb. Bins with read count above and beloow the 0.999 and 0.05 quantile are taken as blacklist:

<<eval=TRUE, warning=FALSE, message=FALSE, fig.width=9.5, fig.height=4, out.width='\\textwidth'>>==
## Get a euploid reference (adjust this to your experiment)
bedfile <- system.file("extdata", "hg19_diploid.bam.bed.gz", package="AneuFinderData")

## Make 100kb fixed-width bins
bins <- binReads(bedfile, assembly='hg19', binsizes=100e3,
                 chromosomes=c(1:22,'X'))[[1]]
## Make a plot for visual inspection and get the blacklist
lcutoff <- quantile(bins$counts, 0.05)
ucutoff <- quantile(bins$counts, 0.999)
p <- plot(bins) + coord_cartesian(ylim=c(0,50))
p <- p + geom_hline(aes(yintercept=lcutoff), color='red')
p <- p + geom_hline(aes(yintercept=ucutoff), color='red')
print(p)
## Select regions that are above or below the cutoff as blacklist
blacklist <- bins[bins$counts <= lcutoff | bins$counts >= ucutoff]
blacklist <- reduce(blacklist)
## Write blacklist to file
blacklist.file <- tempfile()
exportGRanges(blacklist, filename=blacklist.file, header=FALSE,
              chromosome.format='NCBI')
@


\subsection{Running Aneufinder}
The function \Rfunction{Aneufinder} takes an input folder with BAM or BED files and produces an output folder with results, plots and browserfiles. The following code is an example of how to run \Rfunction{Aneufinder} with variable-width bins (see section \ref{sec:mapcor}), blacklist (see section \ref{sec:blacklist}) and GC-correction. Results will be stored in 'outputfolder/hmms' as RData objects for further processing such as quality filtering and customized plotting.


<<eval=TRUE, message=FALSE>>==
## First, get some data and reference files (adjust this to your experiment)
var.width.ref <- system.file("extdata", "hg19_diploid.bam.bed.gz", package="AneuFinderData")
blacklist <- system.file("extdata", "blacklist-hg19.bed.gz", package="AneuFinderData")
datafolder <- system.file("extdata", "B-ALL-B", package = "AneuFinderData")
list.files(datafolder) # only 3 cells for demonstration purposes

## Library for GC correction
library(BSgenome.Hsapiens.UCSC.hg19)

## Produce output files
cnv.states <- c('zero-inflation', paste0(1:10, '-somy'))
outputfolder <- tempdir()
Aneufinder(inputfolder = datafolder, outputfolder = outputfolder, assembly = 'hg19',
           numCPU = 3, binsizes = c(5e5, 1e6), variable.width.reference = var.width.ref,
           chromosomes = c(1:22,'X','Y'), blacklist = blacklist, states = cnv.states,
           correction.method = 'GC', GC.BSgenome = BSgenome.Hsapiens.UCSC.hg19,
           num.trials = 1)
@


\subsection{Quality control}
Once the function \Rfunction{Aneufinder} has completed, results will be accessible as .RData files under 'outputfolder/hmms'. Single cell sequencing is prone to noise and therefore it is a good idea to filter the results by quality to retain only high-quality cells. We found that simple filtering procedures such as cutoffs on the total number of reads etc., are insufficient to distinguish good from bad-quality libraries. Therefore, we have implemented a multivariate clustering approach that works on multiple quality metrics (see ?clusterByQuality for details) to increase robustness of the filtering. Here is an example demonstrating the usage of \Rfunction{clusterByQuality}.


<<eval=TRUE, message=FALSE, fig.align='center', out.width='0.8\\textwidth'>>==
## Get some pre-produced results (adjust this to your experiment)
results <- system.file("extdata", "primary-lung", "hmms", package="AneuFinderData")
files <- list.files(results, full.names=TRUE)

## Cluster by quality, please type ?getQC for other available quality measures
cl <- clusterByQuality(files, measures=c('spikiness','num.segments','entropy','bhattacharyya','sos'))
plot(cl$Mclust, what='classification')
print(cl$parameters)
## Apparently, the last cluster corresponds to failed libraries
## while the first cluster contains high-quality libraries
@

<<eval=TRUE, message=FALSE, fig.width=50, fig.height=10, fig.align='center', out.width='\\textwidth'>>==
## Select the two best clusters and plot it
selected.files <- unlist(cl$classification[1:2])
heatmapGenomewide(selected.files)
@


\subsection{Karyotype measures}
This package implements two measures to quantify karyotype heterogeneity, an \emph{aneuploidy} and a \emph{heterogeneity} score. Both measures are independent of the number of cells, the length of the genome, and take into account every position in the genome. The following example compares the heterogeneity and aneuploidy between a primary lung cancer and the corresponding liver metastasis.


<<eval=TRUE, message=FALSE>>==
## Get some pre-produced results (adjust this to your experiment)
results <- system.file("extdata", "primary-lung", "hmms", package="AneuFinderData")
files.lung <- list.files(results, full.names=TRUE)
results <- system.file("extdata", "metastasis-liver", "hmms", package="AneuFinderData")
files.liver <- list.files(results, full.names=TRUE)

## Get karyotype measures
k.lung <- karyotypeMeasures(files.lung)
k.liver <- karyotypeMeasures(files.liver)

## Print the scores in one data.frame
df <- rbind(lung = k.lung$genomewide, liver = k.liver$genomewide)
print(df)
## While the aneuploidy is similar between both cancers, the heterogeneity is
## nearly twice as high for the primary lung cancer.
@


\section{Session Info}

<<sessionInfo, results="asis", eval=TRUE>>=
toLatex(sessionInfo())
@

\bibliography{references}


\end{document}