%\VignetteIndexEntry{Bioconductor LaTeX Style} %\VignettePackage{BiocStyle} %\VignetteEngine{utils::Sweave} \documentclass{article} \usepackage{amsmath} <>= BiocStyle::latex() @ \newcommand{\exitem}[3]{% \item \texttt{\textbackslash#1\{#2\}} #3 \csname#1\endcsname{#2}.% } \title{ddPCRclust --- An R package and Shiny app for automated analysis of multiplexed ddPCR data} \author{Benedikt G. Brink, Justin Meskas, Ryan R. Brinkman} \begin{document} \SweaveOpts{concordance=TRUE} \maketitle \tableofcontents \section{Introduction} Droplet digital PCR (ddPCR) is an emerging technology for quantifying DNA. By partitioning the target DNA into $\sim\,$20$\,$000 droplets, each serving as its own PCR reaction compartment, ddPCR has significantly increased sensitivity compared to other technologies for DNA quantification. However, manual analysis of the data is time consuming and algorithms for automated analysis of non-orthogonal, multiplexed ddPCR data are unavailable, presenting a major bottleneck for the advancement of ddPCR transitioning from low-throughput to high-throughput. During a ddPCR run, each genetic target is fluorescently labelled with a combination of two fluorophores (typically HEX and FAM), giving it a unique footprint in a two-dimensional space represented by the intensities per colour channel. The position of each droplet within this space reveals how many and, more importantly, which genetic targets it contains. Thus, droplets that contain the same targets cluster together. The number of positive droplets for each target determine its abundance. ddPCRclust is an R package for automated analysis of multiplexed ddPCR data. It can automatically analyse and visualise multiplexed, non-orthogonal ddPCR experiments with up to four targets per reaction. Results are on par with manual analysis, but only take minutes to compute instead of hours. The accompanying Shiny app ddPCRvis provides easy access to the functionalities of ddPCRclust through a web-browser based GUI. For details on the experimental setup please refer to Hughesman et al. \cite{hughesman2016robust}. \section{Input data}\label{input} The input data are one or multiple CSV files containing the raw data from Bio-Rad's droplet digital PCR systems (QX100 and QX200). Each file can be represented as a two-dimensional data frame. Each row within the data frame represents a single droplet, each column the respective intensities per colour channel. Load files from your file system with \Rfunction{readFiles}. \begin{table}[h!] \small \begin{tabular}{ll} Ch1 Amplitude & Ch2 Amplitude \\ 2360.098 & 6119.26953 \\ 2396.3916 & 1415.31665 \\ 2445.838 & 6740.79639 \\ 2451.63867 & 1381.74683 \\ 2492.55884 & 1478.19617 \\ 2519.6355 & 7082.25049 \\ \vdots & \vdots \end{tabular} \end{table} You can also set up a run template to tell ddPCRclust which and how many markers were used in each experiment. \begin{table}[h!] \small \begin{tabular}{lllllll} Well & Sample type & No of markers & Marker 1 & Marker 2 & Marker 3 & Marker 4 \\ B01 & Blood & 4 & a & b & c & d \\ G01 & FFPE & 4 & a & b & c & d \\ F02 & Blood & 3 & a & & c & d \\ D03 & FFPE & 3 & a & & c & d \\ A04 & FFPE & 4 & a & b & c & d \\ G07 & Cell line & 3 & a & & c & d \\ G08 & Cell line & 3 & a & & c & d \\ E09 & FFPE & 2 & & & c & d \end{tabular} \end{table} \section{Methods}\label{methods} Data from ddPCR consists of a number of different clusters $l_1, \dots, l_k$ and their respective centroids $c_1, \dots, c_k$, where $k$ is the number of clusters. All droplets ($x_1, \dots, x_m$) represent one or more genetic targets $t_1, \dots, t_n$, where $m$ is the number of droplets and $n$ is the number of targets. Each cluster $l_i$ is defined as a group of droplets that contain an identical combination of targets. ddPCRclust performes four steps to successfully analyze this data: \begin{enumerate} \item Find all cluster centroids $c$. \item Assign one or multiple targets $t$ to each cluster $l$ based on $c$. \item Allocate the rain and assign a cluster label $l$ to each droplet $x$. \item Determine the number of positive droplets for each target $t$ and calculate the CPDs. \end{enumerate} \subsection{ddPCRclust} The main function of the package is \Rfunction{ddPCRclust}. This function runs the algorithm with one or multiple files, automatically distributing them among all CPU cores (no parallelisation on Windows). We provide eight exemplary ddPCR files along with this package. Analyse them using the following commands: <>= library(ddPCRclust) # Read files exampleFiles <- list.files(paste0(find.package('ddPCRclust'), '/extdata'), full.names = TRUE) files <- readFiles(exampleFiles[3]) # To read all example files uncomment the following line # files <- readFiles(exampleFiles[1:8]) # Read template template <- readTemplate(exampleFiles[9]) # Run ddPCRclust result <- ddPCRclust(files, template) # Plot the results library(ggplot2) p <- ggplot(data = result$B01$data, mapping = aes(x = Ch2.Amplitude, y = Ch1.Amplitude)) p <- p + geom_point(aes(color = factor(Cluster)), size = .5, na.rm = TRUE) + ggtitle('B01 example')+theme_bw() + theme(legend.position='none') p @ \subsubsection{Parameters}\label{params} \begin{itemize} \item{files:} The input data obtained from the csv files. For more information, please see \ref{input}. \item{template:} A data frame containing information about the individual ddPCR runs. For more information, please see \ref{input}. \item{numOfMarkers:} The number of primary clusters that are expected according the experiment set up. Can be ignored if a template is provided. Else, a vector with length equal to \texttt{length(files)} should be provided, containing the number of markers used for the respective reaction. \item{sensitivity:} The clustering algorithms can be tweaked in order to partition the data into more or fewer clusters. A sensible value lies between 0.1 and 2, the standard is 1. A higher value means the data is divided into more clusters, a lower value means more clusters are merged. This allows fine tuning of the algorithm for exceptionally low or high CPDs. \item{similarityParam:} If the distance of a droplet between two or more clusters is very similar, the algorithm can't be sure where it belongs. Therefore, it will not be counted for either if the ratio of the distances from cluster 1 and cluster 2 is larger than similarityParam (assuming cluster 2 > cluster 1). The standard is 0.95, i.e. at least 95\% similarity. A sensible value lies between 0 and 1, where 0 means none of the 'rain' droplets will be counted and 1 means all droplets will be counted. \item{distanceParam:} When assigning rain between two clusters, typically the bottom 20\% are assigned to the lower cluster and the remaining 80\% to the higher cluster. This parameter changes the ratio, i.e. a value of 0.1 would assign only 10\% to the lower cluster. \item{fast:} Run a simpler version of the algorithm that uses only flowDensity (see \ref{dens}) and thus is about 10x faster. For clean data, this might already deliver very good results. However, it is mostly intended to get a quick overview over the data. \item{multithread:} Distribute the algorithm among all CPU cores to speed up the computation if set to TRUE (not available on Windows). Default is FALSE. \end{itemize} \subsection{Initial clustering} The initial clustering of the data (see step 1 in section \ref{methods}) is performed by three different clustering packages, which have been adjusted to work on ddPCR data. Each algorithm has its own function, in case users need low level control. \subsubsection{flowDensity}\label{dens} Originally designed for gating of flow cytometry data, \emph{flowDensity} \cite{malek2015flowdensity} identifies cell populations in a dataset using characteristics of the density distribution (e.g. the number, height and width of peaks and the slope of the distribution curve). Parameters can be adjusted on a population-specific basis. We use the density function to find local peaks above a threshold, which represent the centres of clusters. The method comprises the following steps: \begin{enumerate} \item Remove all $x$ where $(x_1,x_2)$ < 0.125$\,\cdot\,max(x_1,x_2)$. The bottom 12.5$\,\%$ of the data space is known to contain the negative population, i.e. the droplets without any of the targets of interest. \item Find the highest density peaks with $max(x_1)$ and $max(x_2)$, respectively. We define these as the two outer primary clusters $y$ and $z$, since the primary clusters empirically contain the majority of the non-negative events. \item Rotate the data with $\theta = \lvert atan(\frac{y_2 - z_2}{y_1 - z_1}) \rvert$. \item Cut the rotated data above the two outer clusters in a staircase manner and find all density peaks. \item Take the previously removed data and repeat steps 2 and 4, until all clusters are found. \end{enumerate} Clusters are then labelled based on their rotated position and lastly the rain is assigned. This part of the algorithm can be accessed with the function \Rfunction{runDensity}. The parameters are the same as discussed in \ref{params}. \subsubsection{SamSPECTRAL} Since spectral clustering is computationally expensive ($\mathcal{O}(n^3)$ time and $\mathcal{O}(n^2)$ space), \emph{SamSPECTRAL} \cite{zare2010data} uses density based pre-processing to reduce the number of edges in the graph. To do so, a faithful sampling algorithm builds $m$ communities, which are then connected to a graph where the edges represent the similarity between corresponding communities. The spectrum of this graph is subsequently analysed using classical spectral clustering to find the clusters. Finally, the clusters are combined based on their similarity in the community graph and a cluster number for each event in the original data is returned. We use this implementation of spectral clustering and choose $m$ encompassing 5$\,$\% of the data, which has empirically proven to be a good compromise between accuracy and speed. However, users can choose a different value if necessary. Clusters are then labelled based on their position and lastly the rain is assigned. This part of the algorithm can be accessed with the function \Rfunction{runSam}. The parameters are the same as discussed in \ref{params}. \subsubsection{flowPeaks} The third approach uses the \emph{flowPeaks} package \cite{ge2012flowpeaks}. The \emph{flowPeaks} algorithm first uses a two step k-means clustering with a large k, in order to partition the dataset into many compact clusters. The result is then used to generate a smoothed density function. All local peaks are exhaustively found by exploring the density function and the clusters are merged according to their local peaks. Clusters are then labelled based on their position and lastly the rain is assigned. This part of the algorithm can be accessed with the function \Rfunction{runPeaks}. The parameters are the same as discussed in \ref{params}. \subsection{Copies per droplet} Once all droplets are correctly assigned, the actual copies per droplet (CPDs) for each target are calculated by the function \Rfunction{calculateCPDs} according to Equation \ref{eq:cpd}, \begin{equation}\label{eq:cpd} CPD_i = -ln(1-\frac{C_i}{C_T}) \end{equation} where $C_i$ is the total number of positive droplets for target $i$ and $C_T$ the total droplet count. \subsubsection{Parameters} \begin{itemize} \item{results:} The result of the ddPCRclust algorithm. \item{template:} A data frame containing information about the individual ddPCR runs. For more information, please see \ref{input}. \item{constantControl:} The name of a marker that has been used as a constant control. This marker has to be present in \emph{every} row of the provided template (i.e. in every reaction). The results will be normalised against this control. \end{itemize} \subsection{Exporting results} The results can be exported using \Rfunction{exportPlots}, \Rfunction{exportToExcel}, and \Rfunction{exportToCSV}. \section{Visual Interface} Furthermore, we developed ddPCRvis, a GUI that gives access to the aforementioned functionalities of the ddPCRclust package directly through a web browser, powered by R Shiny \cite{shiny2017}. It also enables the user to check the results and manually correct them if necessary. The interface is available online at https://bibiserv.cebitec.uni-bielefeld.de/ddPCRvis/ or for download at https://github.com/bgbrink/ddPCRvis/. \bibliography{ddPCRclust} \end{document}