%\VignetteIndexEntry{A quick introduction to AneuFinder} %\VignetteEngine{knitr::knitr} \documentclass{article} <>= BiocStyle::latex() @ \author{Aaron Taudt\thanks{\email{aaron.taudt@gmail.com}}} \title{A quick introduction to AneuFinder} \begin{document} \maketitle \tableofcontents \clearpage <>= library(AneuFinder) options(width=90) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% NOTE: This is the vignette for AneuFinder v1.6.0 \section{Introduction} \Rpackage{AneuFinder} offers functionality for the study of copy number variations (CNV) in whole-genome single cell sequencing (WGSCS) data. Functionality implemented in this package includes: \begin{itemize} \item Copy number detection using a Hidden Markov Model or changepoint-algorithm on binned read counts. \item Various plotting capabilities like genomewide heatmaps of copy number state and arrayCGH-like plots. \item Export of copy number calls in BED format for upload to the UCSC genome browser. \item Quality metrics. \item Measures for addressing karyotype heterogeneity. \end{itemize} \section{Quickstart} The main function of this package is called \Rfunction{Aneufinder()} and performs all the necessary steps to get from aligned reads to interpretable output. \Rpackage{AneuFinder} offers three methods to call copy number variations: (1) A Hidden Markov Model described in \cite{Bakker2016}, (2) an approach based on the \Rpackage{DNAcopy} package adopted for single cells from \cite{Garvin2015}, and (3) an approach described in \cite{Taudt2018} using the edivisive-algorithm from the \Rpackage{ecp} package.\footnote{Please cite \cite{Bakker2016} if you use the \emph{HMM} results. Please cite \cite{Bakker2016} and \cite{Garvin2015} if you use the \emph{dnacopy} results. Please cite \cite{Bakker2016} and \cite{Taudt2018} if you use the default \emph{edivisive} approach.} We recommend using the "edivisive" method. <>== library(AneuFinder) Aneufinder(inputfolder='folder-with-BAM-or-BED', outputfolder='output-directory', numCPU=2, method=c('edivisive', 'dnacopy','HMM')) @ The above command will produce reasonably good results. However, quality of the results can be significantly improved by applying \emph{mappability correction}, \emph{blacklisting} and \emph{GC-content correction} (see section \ref{sec:workflow}). A description of all available parameters can be obtained by typing <>== ?Aneufinder @ After the function has finished, you will find the folder \textbf{output-directory} containing all produced files and plots. This folder contains the following \emph{files} and \textbf{folders}: \begin{itemize} \item \emph{AneuFinder.config}: This file contains all the parameters that are necessary to reproduce your analysis. You can specify this file as <>== Aneufinder(..., configfile='AneuFinder.config') @ to run another analysis with the same parameter settings. \item \emph{chrominfo.tsv}: A tab-separated file with chromosome length information. \item \textbf{binned}: This folder contains the binned data. If you chose a correction method, you will also see a folder like \textbf{binned-GC} in case of GC correction. You can load the data with <>== files <- list.files('output-directory/binned', full.names=TRUE) binned.data <- loadFromFiles(files) @ \item \textbf{BROWSERFILES}: A folder which contains BED files with copy number calls and breakpoint locations that can be uploaded to the UCSC genome browser. If reads.store=TRUE it will also contain a subfolder \textbf{data} with the mapped reads in BED format. \item \textbf{MODELS}: A folder with all produced Hidden Markov Models. You can load the results for further processing, such as quality control and customized plotting. The folder might be named \textbf{MODELS\_refined} or \textbf{MODELS-StrandSeq}, depending on whether you chose to analyze Strand-seq data or refine breakpoints. <>== files <- list.files('output-directory/MODELS/method-edivisive', full.names=TRUE) hmms <- loadFromFiles(files) cl <- clusterByQuality(hmms) heatmapGenomewide(cl$classification[[1]]) @ \item \textbf{PLOTS}: All plots that are produced by default will be stored here. \item \textbf{data}: Only produced if reads.store=TRUE. This folder stores all the read data as RData objects. This exists mostly for internal usage, namely to estimate confidence intervals and refine breakpoints. \end{itemize} \section{\label{sec:workflow}A detailed workflow} \subsection{\label{sec:mapcor}Mappability correction} The first step of your workflow should be the production of a reference file for mappability correction. Mappability correction is done via a variable-width binning approach (as compared to fixed-width bins) and requires a euploid reference. You can either simulate this reference file or take a real euploid reference. For optimal results we suggest to use a real reference, e.g. by merging BAM files of single cells from a euploid reference tissue. This can be achieved with the 'samtools merge' command (not part of R). Be careful: All CNVs that are present in the reference will lead to artifacts in the analysis later. This includes sex-chromosomes that are present in one copy only, so we advice to use a female reference and to exclude the Y-chromosome from the analysis. If you have no reference available, you can simulate one with the \Rfunction{simulateReads()} command. You can also skip this step and continue without mappability correction. The algorithm will use fixed-width bins in this case. <>== library(AneuFinder) ## Load human genome library(BSgenome.Hsapiens.UCSC.hg19) ## Get a BAM file for the estimation of quality scores (adjust this to your experiment) bamfile <- system.file("extdata", "BB150803_IV_074.bam", package="AneuFinderData") ## Simulate reads of length 51bp for human genome # We simulate reads every 500000bp for demonstration purposes, for a real # application you should use a much denser spacing (e.g. 500bp or less) simulatedReads.file <- tempfile() # replace this with your destination file simulateReads(BSgenome.Hsapiens.UCSC.hg19, readLength=51, bamfile=bamfile, file=simulatedReads.file, every.X.bp=500000) @ This simulated FASTQ file must then be aligned with your aligner of choice (ideally the same that you used for your other samples) and given as reference in \Rfunction{Aneufinder(..., variable.width.reference="your-reference-bamfile")}. \subsection{\label{sec:blacklist}Blacklisting} To further improve the quality of the results and remove artifacts caused by high mappability repeat regions, e.g. near centromers, a blacklist can be used with option \Rfunction{Aneufinder(..., blacklist)}. All reads falling into the regions specified by the blacklist will be discarded when importing the read files. You can either download a blacklist from the UCSC genome browser, e.g. the ``DAC Blacklisted Regions from ENCODE/DAC(Kundaje)'' mappability track, or make your own. For optimal results, we advice to make your own blacklist from a euploid reference. The following code chunck takes a euploid reference and makes fixed-width bins of 100kb. Bins with read count above and below the 0.999 and 0.05 quantile are taken as blacklist: <>== library(AneuFinder) ## Get a euploid reference (adjust this to your experiment) bedfile <- system.file("extdata", "hg19_diploid.bam.bed.gz", package="AneuFinderData") ## Make 100kb fixed-width bins bins <- binReads(bedfile, assembly='hg19', binsizes=100e3, chromosomes=c(1:22,'X'))[[1]] ## Make a plot for visual inspection and get the blacklist lcutoff <- quantile(bins$counts, 0.05) ucutoff <- quantile(bins$counts, 0.999) p <- plot(bins) + coord_cartesian(ylim=c(0,50)) p <- p + geom_hline(aes(yintercept=lcutoff), color='red') p <- p + geom_hline(aes(yintercept=ucutoff), color='red') print(p) ## Select regions that are above or below the cutoff as blacklist blacklist <- bins[bins$counts <= lcutoff | bins$counts >= ucutoff] blacklist <- reduce(blacklist) ## Write blacklist to file blacklist.file <- tempfile() exportGRanges(blacklist, filename=blacklist.file, header=FALSE, chromosome.format='NCBI') @ \subsection{\label{sec:GCcor}GC-content correction} GC-content can greatly bias the number of read fragments in a WGSCS-experiment. Therefore it is essential to apply a GC-correction step when calling copy-number-variations with \Rfunction{Aneufinder()}. GC-correction can be easily enabled with <>== library(AneuFinder) ## Library for GC correction library(BSgenome.Hsapiens.UCSC.hg19) ## Produce output files outputfolder <- tempdir() Aneufinder(inputfolder = datafolder, outputfolder = outputfolder, assembly = 'hg19', correction.method = 'GC', GC.BSgenome = BSgenome.Hsapiens.UCSC.hg19, method = 'edivisive') @ \subsection{Running Aneufinder} The function \Rfunction{Aneufinder()} takes an input folder with BAM or BED files and produces an output folder with results, plots and browserfiles. The following code is an example of how to run \Rfunction{Aneufinder()} with variable-width bins (see section \ref{sec:mapcor}), blacklist (see section \ref{sec:blacklist}) and GC-correction. Results will be stored in \textbf{outputfolder/MODELS} as RData objects for further processing such as quality filtering and customized plotting. <>== library(AneuFinder) ## First, get some data and reference files (adjust this to your experiment) var.width.ref <- system.file("extdata", "hg19_diploid.bam.bed.gz", package="AneuFinderData") blacklist <- system.file("extdata", "blacklist-hg19.bed.gz", package="AneuFinderData") datafolder <- system.file("extdata", "B-ALL-B", package = "AneuFinderData") list.files(datafolder) # only 3 cells for demonstration purposes ## Library for GC correction library(BSgenome.Hsapiens.UCSC.hg19) ## Produce output files outputfolder <- tempdir() Aneufinder(inputfolder = datafolder, outputfolder = outputfolder, assembly = 'hg19', numCPU = 3, binsizes = c(5e5, 1e6), variable.width.reference = var.width.ref, chromosomes = c(1:22,'X','Y'), blacklist = blacklist, correction.method = 'GC', GC.BSgenome = BSgenome.Hsapiens.UCSC.hg19, refine.breakpoints=FALSE, method = 'edivisive') @ \subsection{Loading results and plotting single cells} Once the function \Rfunction{Aneufinder()} has completed, results will be accessible as \textit{.RData} files under \textbf{outputfolder/MODELS}. You can load the results into R using <>== library(AneuFinder) files <- list.files('outputfolder/MODELS/method-edivisive', full.names=TRUE) models <- loadFromFiles(files) @ Here are some examples of the plotting functions that are available. Most of these plots are also produced by default by \Rfunction{Aneufinder()} and are available as PDF in \textbf{outputfolder/PLOTS}. <>== ## Get some pre-produced results (adjust this to your experiment) results <- system.file("extdata", "primary-lung", "hmms", package="AneuFinderData") files <- list.files(results, full.names=TRUE) plot(files[1], type='profile') plot(files[1], type='histogram') plot(files[1], type='karyogram') @ \subsection{Quality control} Once the function \Rfunction{Aneufinder()} has completed, results will be accessible as .RData files under \textbf{outputfolder/MODELS}. Single cell sequencing is prone to noise and therefore it is a good idea to filter the results by quality to retain only high-quality cells. We found that simple filtering procedures such as cutoffs on the total number of reads etc., are insufficient to distinguish good from bad-quality libraries. Therefore, we have implemented a multivariate clustering approach that works on multiple quality metrics (see ?clusterByQuality for details) to increase robustness of the filtering. Here is an example demonstrating the usage of \Rfunction{clusterByQuality()}. <>== library(AneuFinder) ## Get some pre-produced results (adjust this to your experiment) results <- system.file("extdata", "primary-lung", "hmms", package="AneuFinderData") files <- list.files(results, full.names=TRUE) ## Cluster by quality, please type ?getQC for other available quality measures cl <- clusterByQuality(files, measures=c('spikiness','num.segments','entropy','bhattacharyya','sos')) plot(cl$Mclust, what='classification') print(cl$parameters) ## Apparently, the last cluster corresponds to failed libraries ## while the first cluster contains high-quality libraries @ <>== ## Select the two best clusters and plot it selected.files <- unlist(cl$classification[1:2]) heatmapGenomewide(selected.files) @ \subsection{Karyotype measures} This package implements two measures to quantify karyotype heterogeneity, an \emph{aneuploidy} and a \emph{heterogeneity} score. Both measures are independent of the number of cells, the length of the genome, and take into account every position in the genome. The following example compares the heterogeneity and aneuploidy between a primary lung cancer and the corresponding liver metastasis. <>== library(AneuFinder) ## Get some pre-produced results (adjust this to your experiment) results <- system.file("extdata", "primary-lung", "hmms", package="AneuFinderData") files.lung <- list.files(results, full.names=TRUE) results <- system.file("extdata", "metastasis-liver", "hmms", package="AneuFinderData") files.liver <- list.files(results, full.names=TRUE) ## Get karyotype measures k.lung <- karyotypeMeasures(files.lung) k.liver <- karyotypeMeasures(files.liver) ## Print the scores in one data.frame df <- rbind(lung = k.lung$genomewide, liver = k.liver$genomewide) print(df) ## While the aneuploidy is similar between both cancers, the heterogeneity is ## nearly twice as high for the primary lung cancer. plotHeterogeneity(hmms.list = list(lung=files.lung, liver=files.liver)) @ \subsection{Principal component analysis} The following code let's you plot a PCA for a selection of single cells. <>== ## Get results from a small-cell-lung-cancer lung.folder <- system.file("extdata", "primary-lung", "hmms", package="AneuFinderData") lung.files <- list.files(lung.folder, full.names=TRUE) ## Get results from the liver metastasis of the same patient liver.folder <- system.file("extdata", "metastasis-liver", "hmms", package="AneuFinderData") liver.files <- list.files(liver.folder, full.names=TRUE) ## Plot a clustered heatmap classes <- c(rep('lung', length(lung.files)), rep('liver', length(liver.files))) labels <- c(paste('lung',1:length(lung.files)), paste('liver',1:length(liver.files))) plot_pca(c(lung.files, liver.files), colorBy=classes, PC1=2, PC2=4) @ \section{Session Info} <>= toLatex(sessionInfo()) @ \bibliography{references} \end{document}