%\VignetteIndexEntry{An introduction to FunChIP} %\VignetteDepends{} %\VignetteKeywords{ChIP-Seq, GRanges, shape, functional data analysis} %\VignettePackage{FunChIP} %\VignetteEngine{utils::Sweave} \documentclass{article} <>= BiocStyle::latex() @ \usepackage[algoruled]{algorithm2e} \usepackage{color} \usepackage{amsfonts} \usepackage{graphicx} \definecolor{bronze}{rgb}{0.93, 0.53, 0.18} \definecolor{darkblue}{rgb}{0, 0.3, 0.6} \SetKw{align}{\color{darkblue}Alignment:} \SetKw{template}{Template:} \SetKw{assign}{Assignment:} \SetKw{norm}{Normalization:} \newcommand{\bam}{\texttt{BAM}} \title{\Biocpkg{FunChIP}: A Functional Data Analysis approach to cluster ChIP-Seq peaks according to their shapes} \author{Alice Parodi \email{alicecarla.parodi@polimi.it} \\ Laura M. Sangalli \\ Piercesare Secchi \\ Simone Vantini \\ Marco J. Morelli } %\date{Modified: February 24, 2016. Compiled: \today} \date{\today} \begin{document} \SweaveOpts{concordance=FALSE} \SweaveOpts{background = "#C0C0C0", size = tiny} \maketitle \tableofcontents <>= options(width=90) @ <>= library(FunChIP) @ \section{Introduction} The \Biocpkg{FunChIP} package provides a set of methods for the \Rclass{GRanges} class of the package \Biocpkg{GenomicRanges} to cluster ChIP-Seq peaks according to their shapes, starting from a \bam{} file containing the aligned reads and a \Rclass{GRanges} object with the corresponding enriched regions. \section{Input and Preprocessing}\label{sect:prep} ChIP-Seq enriched regions are provided by the user in a \Rclass{GRanges} object \Rcode{GR}. The user must provide the \bam{} file containing the reads aligned on the positive and negative strands of the DNA. From the \bam{} file we can compute, for each region of the \Rclass{GRanges} (let $N$ be the total number of regions), the base-level coverage separately for positive and negative reads. These two count vectors are used to estimate the distance $d_{pn}$ between positive and negative reads and then the total length of the fragments of the ChIP-Seq experiment $d$. In particular, we assume that the positive and negative counts measure the same signal, shifted by $d_{pn}$, as they are computed from the two ends of the sequencing fragments. The global length of the fragment is the sum between the length of the reads of the \bam{} file, $r$\footnote{ If in the \bam{} file multiple length are present, $r$ is estimated as the average length.}, and the distance between the positive and negative coverage $d_{pn}$ $$ d = d_{pn} + r. $$ The function \Rfunction{compute\_fragments\_length} computes, from the \Rclass{GRanges} object and the \bam{} file, the estimated length of the fragments. Given a range for $d_{pn}$: $[d_{\min}; d_{\max}]$, the optimum distance $d_{pn}$ is $$ d_{pn} = \textrm{argmin}_{\delta \in [d_{\min}: d_{\max}]} \sum_{n = 1}^N D(f_{n+}, f_{n-}^{\delta}), $$ where $f_{n+}$ is the positive coverage function of the $n$-th region, and $f_{n-}^{\delta}$ is the negative coverage of the $n$ th region, shifted by $\delta$. The distance $D$ is the square of the $L^2$ distance between the coverages, normalized by the width of the region. The definition of the $L^2$ distance is detailed in Section \ref{sect:kmean}. <>= # load the GRanges object associated # to the ChIP-Seq experiment on the # transcription factor c-Myc in murine cells data(GR100) # name of the .bam file (the # .bam.bai index file must also be present) bamf <- system.file("extdata", "test.bam", package="FunChIP", mustWork=TRUE) # compute d d <- compute_fragments_length(GR, bamf, min.d = 0, max.d = 300) d @ %% In Figure \ref{FunChIP-fragment_length} the distance function is shown varying the parameter $\delta$, and the minimum value $d_{pn}$ is computed. \incfig{FunChIP-fragment_length}{0.5\textwidth}{Identification of d.} {optimal value of $d_{pn}$ is presented. It is the minimum of the global distance function.} Once we have correctly identified the fragment length we can compute the final coverage function to obtain the shape of the peaks. The \Rfunction{pileup\_peak} method for the \Rclass{GRanges} class uses the \bam{} file to compute the base-level coverage on these regions, once the reads are extended up to their final length $d$. \Rfunction{pileup\_peak} adds to the \Rclass{GRanges} a \Rcode{counts} metadata column, containing for each region a vector with length equal to the width of the region storing the coverage function. <>= # associate to each peak # of the GRanges object the correspondent # coverage function peaks <- pileup_peak(GR, bamf, d = d) peaks @ %% Additional information can be found in the help page of the \Rfunction{pileup\_peak} method. \section{Smoothing} The \Rcode{counts} metadata is approximated by a combination of splines to guarantee the smoothness and regularity needed for further analysis, as described in the following Sections.\\ The preprocessing steps carried out in the \Rfunction{smooth\_peak} method are the following: \begin{itemize} \item \textit{Removal of the background and extension.} In ChIP-Seq experiments, peaks may have an additive noisy background, and the removal of this background is mandatory to compare different peaks. The background is estimated as a constant value "raising" the peak and equal to the minimum value the coverage assumes. Consequently, once the background has been removed, each peak has zero as minimum value, thus allowing the peak to be indefinitely extended with zeros, if necessary. In Section \ref{sect:kmean}, how this choice affects the algorithm will be discussed. \item \textit{Smoothing.} In order to be regular enough to computed derivatives, a peak has to be transformed in a suitable functional object, as described in Section \ref{sect:kmean}. The smoothing of the count vector $c$ is performed through the projection of $c$ on a cubic B-spline basis $\Phi = \{\phi_1, \ldots \phi_K\}$ with a penalization on the second derivative \cite{ramsay}. The result is a spline approximation of the data, which is continuous on the whole domain, together with its first order derivatives. Moreover, the penalization on the second derivative allows to control the global regularity of the function avoiding over-fitting and a consequent noisy spline definition. The spline approximation $s = \sum_{k = 1}^K \theta_k \phi_k$ of the count vector $ c = \{c_j\}$ is defined minimizing $$ S(\lambda) = \sum_{j = 1}^n \left[ c_j - s(x_j) \right]^2 + \lambda \int \! \left[ s''(x)\right] ^2 dx, $$ with $x_j$ being the relative genomic coordinate the counts. The multiplying coefficient $\lambda$ quantifies the penalization on the second derivative and is chosen through the Generalized Cross Validation criteria. For each peak $i$ the $GCV_i$ index is computed with a leave-one-out cross validation $$ GCV_i = \left(\frac{n}{n - df(\lambda)} \right) \left( \frac{SSE_i}{n-df(\lambda)}\right) $$ and then it is summed on the whole data set to obtain the global $GCV$. The number of degrees of freedom $df(\lambda)$ is automatically computed from the definition of the basis $\Phi$.\\ The error $SSE_i$ can be computed either on the data ($SSE_i^0$) or on the derivatives ($SSE_i^1$), to control the regularity of the function or the regularity of the derivatives, respectively: $$ SSE_i^0 = \sqrt{\sum_{j = 1}^n \left( c_j - s(x_j) \right)^2} \,\,\, \textrm{or} \,\,\, SSE_i^1 = \sqrt{\sum_{j = 1}^{n-1} \left( \nabla c_j - s'(x_j) \right)^2} , $$ with $\nabla c_j$ being the finite-difference approximation of the derivative of the \Rcode{counts} vector $c$ for the data $i$: $c = c(i)$, while $s'(x_i)$ is the evaluation of the first derivative $s' = s'(i)$ on the genomic coordinates. For further details on the spline definition see the \Rfunction{spline} function of the \CRANpkg{fda} package. \item\textit{Scaling of the peaks.} This optional preprocessing step makes all the curves having the same width and area. In particular all the abscissa grid are scaled to become equal to the smallest grid throughout the data, while $y-$values are scaled to make areas of all the curves equal to 1. \end{itemize} The \Rfunction{smooth\_peak} method approximates the \Rcode{counts} metadata by removing the background, computing the spline and potentially defining the scaled approximation. Focusing on the spline approximation, \Rfunction{smooth\_peak} automatically chooses the optimal $\lambda$ parameter according to the $GCV$ criteria; the user can decide whether to consider the data or the derivatives to compute the $SSE$. <>= # the method smooth_peak # removes the background and defines the spline # approximation from the previously computed peaks # with lambda estimated from the # GCV on derivatives. The method spans a non-uniform # grid for lambda from 10^-4 to 10^12. # ( the grid is uniform for log10(lambda) ) peaks.smooth <- smooth_peak(peaks, lambda = 10^(-4:12), subsample.data = 50, GCV.derivatives = TRUE, plot.GCV = TRUE, rescale = FALSE ) @ \incfig{FunChIP-figureGCV}{0.8\textwidth}{Generalized Cross Validation index.} {$GCV$ computed on data (left), and on the derivatives (right), as a function of $\lambda$.} In Figure \ref{FunChIP-figureGCV}, the plot of the $GCV$ for both data and derivatives is shown. From this Figure we see that the optimum value of $\lambda$, which minimizes the $GCV$ for the derivatives, is also associated to a small value of the $GCV$ for the data thus supporting the automatic choice. <>= # the automatic choice is lambda = 10^6 peaks.smooth <- smooth_peak(peaks, lambda = 10^6, plot.GCV = FALSE) head(peaks.smooth) # mantaining this choice of lambda smooth_peak # can also define the scaled approximation # of the spline peaks.smooth.scaled <- smooth_peak(peaks, lambda = 10^6, plot.GCV = FALSE, rescale = TRUE) head(peaks.smooth.scaled) @ Now the \Rclass{GRanges} object contains, besides \Rcode{counts}, 5 new metadata columns with the spline approximation evaluated on the base-level grid, its derivatives, the width of the spline and the new starting and ending points (see Figure \ref{FunChIP-plotBOTH}). For a more detailed description of the metadata columns, see the help page of the \Rfunction{smooth\_peak} method. With the introduction of the smoothing, counts at the edges of the peak are connected with regularity to 0, and therefore new values different from zeros may be introduced. In order to maintain regularity, the grid is extended up to the new boundaries. Adding to \Rfunction{smooth\_peak} the option \Rcode{rescale = TRUE} the method, beside the 5 metadata columns previously introduced, returns 2 more metadata columns with the scaled approximation of the spline and its derivatives. Once the spline approximation is defined, the summit of the smoothed peak (or even of the scaled peak), i.e. of its spline approximation, can be detected. The summit will be used to initialize the peak alignment procedure, described in Section \ref{sect:kmean}, and it can either be a user-defined parameter, stored in a vector of the same length of the \Rcode{GR}, or automatically computed as the maximum height of the spline. The summit is stored in the new metadata column \Rcode{summit\_spline}. If the \Rcode{rescale} option is set to \Rcode{TRUE} the summit of the scaled approximation is also returned in the metadata colum \Rcode{summit\_spline\_rescaled}. <>= # peaks.summit identifies the maximum point # of the smoothed peaks peaks.summit <- summit_peak(peaks.smooth) head(peaks.summit) # peaks.summit can identify also the maximum # point of the scaled approximation peaks.summit.scaled <- summit_peak(peaks.smooth.scaled, rescale = TRUE) head(peaks.summit.scaled) @ \section{The k-mean alignment algorithm and the \Rfunction{cluster\_peak} method }\label{sect:kmean} The k-mean alignment algorithm is an efficient method to classify functional data allowing for general transformation of abscissae \cite{kma}; this general method is implemented in the package \CRANpkg{fdakma} and various applications to real dataset are introduced in \cite{Sangalli:2014}, \cite{Bernardi:2014}, \cite{Patriarca:2014}. In particular, given \begin{itemize} \item a set of curves $s_1, \ldots, s_n$, \item the number of clusters $K$, \item a distance function $d(s_i, s_j)$ between two curves $s_i$ and $s_j$, as for example the integral of the difference $s_i - s_j$, \item a family of warping functions $\mathcal{W}$ to transform the abscissae of the curves and therefore align the peaks. Generally, $\mathcal{W}$ is the set of shifts or dilations or affine transformations (shift + dilation), \end{itemize} the algorithm, presented in Algorithm \ref{kmean_box}, is an iterative procedure to split the curves into $K$ clusters. The introduction of the warping function $h \in \mathcal{W}$ allows each curve to be shifted, dilated, or both, to define the minimum distance between curves. The new curve $s \circ h$ has the same values of $s$, but its abscissa grid is modified. \incfig{FunChIP-alignment}{0.8\textwidth}{Alignment procedure.} {Representation of two smoothed peaks. In the left panel they are not aligned, while in the right panel they are aligned with an integer shift.} For example, in Figure \ref{FunChIP-alignment} two peaks are presented: in the left panel, they are not aligned, while the right panel shows the effects of alignment; the transformation of the abscissae (shift transformation) makes the two peaks more similar, and the distance $d$ is not anymore affected by artificial phase distance. The code generating Figure \ref{FunChIP-alignment} calls \Rfunction{cluster\_peak} and \Rfunction{plot\_peak}, which are described in Section \ref{sect:cluster} and Section \ref{sect:plot}. <>= # representation of two peaks par (mfrow = c(1,2)) plot_peak(peaks.summit, index = c(6,7), col=c('red',2), cex.main = 2, cex.lab = 2, cex.axis = 1.5, lwd = 2) aligned.peaks <- cluster_peak(peaks.summit[c(6,7)], parallel = FALSE , n.clust = 1, seeds = 1, shift.peak = TRUE, weight = 1, alpha = 1, p = 2, t.max = 2, plot.graph.k = FALSE, verbose = FALSE) aligned.peaks # shift coefficients aligned.peaks$coef_shift plot_peak(aligned.peaks, col = 'forestgreen', shift = TRUE, k = 1, cluster.peak = TRUE, line.plot = 'spline', cex.main = 2, cex.lab = 2, cex.axis = 1.5, lwd = 2) @ For the specific case of ChIP-Seq data, the admitted warping functions for the k-mean alignment algorithm (in the \Rfunction{cluster\_peak} method), are integer shifts: \begin{equation}\label{eq:warping} \mathcal{W} = \left\{ h : h(t) = t + q \textrm{ with } q \in \mathbb{Z} \right\}. \end{equation} In other words, with this choice, peaks can be shifted by integer values in the \emph{alignment} procedure of the algorithm. \begin{algorithm} Given a set of functions $s_1, \ldots, s_n$ and a number $K$ of clusters\\ \template random choice (if not provided) of the initial centers of the clusters $c_1 \ldots, c_k$\\ \While{ decrease of the distance higher than a fixed threshold }{ \ForEach {$i \in 1:n$} { \align $s_i$ is aligned to each template $c_k$: the optimal warping function $h_{i,k}^{\star}$ in $\mathcal{W}$ is detected $$ h^{\star}_{i,k} = \textrm{argmin}_{h \in \mathcal{W}} \,\, d(c_k, x_i \circ h) $$ with the corresponding distance $d^{\star}_{i,k} = \min_{h \in \mathcal{W}} d(c_k, x_i \circ h)$\\ \assign $s_i$ is assigned to the best cluster $$ k_i^{\star} = \textrm{argmin}_{k \in 1 : K} \,\, d^{\star}_{i,k} $$ } \ForEach {$k \in 1 : K$} { \template identification of the new template of the cluster $c_k$\\ \norm the average warping function of the curves belonging to $k$ is set to be the identity transformation $$ h(s) = s $$ } } \caption{k-mean alignment algorithm} \label{kmean_box} \end{algorithm} In the \Rfunction{cluster\_peak} method the distance between two curves $s_1$ and $s_2$ is defined as \begin{eqnarray}\label{eq:dist} d (s_1, s_2) & = & (1 - \alpha) \, d_0(s_1, s_2) + \alpha \, w \, d_1(s_1, s_2) = \nonumber\\ & = & (1 - \alpha)\, \| s^e_1 - s^e_2 \|_p + \alpha \, w \, \| (s^e_1)' - (s^e_2)' \|_p, \end{eqnarray} where \begin{itemize} \item $\| f \|_p$ is the $p$ norm of $f$. In particular, for $p = 0$, $\| \cdot \|_p$ is the $L^{\infty}$ norm $$ \| f \|_0 = \|f \|_{L^{\infty}} = \max_{x \in U} |f(x)|, $$ with $U$ being the domain of $f$.\\ For $p=1$, $\| \cdot \|_p$ is the $L^1$ norm $$ \| f \|_1 = \|f \|_{L^{1}} = \int_U |f(x)| dx. $$ And for $p=2$, $\| \cdot \|_p$ is the $L^2$ norm $$ \|f \|_2 = \|f \|_{L^{2}} = \int_U \left(f(x)\right)^2 dx. $$ \item $s_1^e$ and $s_2^e$ are the functions $s_1$ and $s_2$ extended with zeros where not defined, after their backgrounds have been removed (see Section \ref{sect:prep}). The distance function is computed on the union of the domains of $s_1$ and $s_2$ ($U$); $s_1$ and $s_2$ need to be extended to cover the whole $U$. \item $\alpha \in [0,1]$ is a coefficient tuning the contributions of the norm of the data and the norm of the derivatives. If $\alpha = 0$, the distance is computed on the data, while if $\alpha = 1$ it is based on the derivatives. Intermediate values balance these two contributions: increasing the relevance given to the derivatives emphasizes the shapes of the peaks, while data are more related to the height. \item $w$ is a weight coefficient, essential to make the norm of the data and of the derivatives comparable. It can be user defined or computed inside the \Rfunction{cluster\_peak} method. A suggestion for computing the weight $w$ is given in Section \ref{sect:weight}. \end{itemize} \subsection{Definition of weight in the distance function} \label{sect:weight} If not provided, the method \Rfunction{cluster\_peak} defines $w$ as $$ w = \textrm{median} \left( \frac{d_0(s_i, s_j)}{d_1(s_i, s_j)} \right) $$ where $d_0(i,j)=\| s^e_i - s^e_j \|_p$ and $d_1(i, j) = \| (s^e_1)' - (s^e_2)' \|_p$. These matrices can be automatically computed with the \Rfunction{distance\_peak} function. <>= # compute the weight from the first 10 peaks dist_matrix <- distance_peak(peaks.summit) # dist matrix contains the two matrices d_0(i,j) # and d_1(i,j), used to compute w names(dist_matrix) ratio_norm <- dist_matrix$dist_matrix_d0 / dist_matrix$dist_matrix_d1 ratio_norm_upper_tri <- ratio_norm[upper.tri(ratio_norm)] summary(ratio_norm_upper_tri) # suggestion: use the median as weight w <- median(ratio_norm_upper_tri) w @ \subsection{The \Rfunction{cluster\_peak} method} \label{sect:cluster} The two main characteristics of the k-mean alignment algorithm used in \Biocpkg{FunChIP} are the distance function $d$ (defined in Equation (\ref{eq:dist})), used to compute the distance between curves, and the set of warping functions $\mathcal{W}$ (defined in Equation (\ref{eq:warping})) considered for the alignment. The \Rfunction{cluster\_peak} method applies the k-mean alignment algorithm with these specifications to the set of peaks stored in the \Rclass{GRanges} object. In particular, the parameters \Rcode{weight}\footnote{ \Rcode{weight} can be also set to \Rcode{NULL} and it will be automatically computed as specified in Section \ref{sect:weight}. To save computational time, it is generally computed on a random sub-sample of data, whose size is set by the \Rcode{subsample.weight} parameter.}, \Rcode{alpha} and \Rcode{p} define the distance used in the algorithm, while \Rcode{t.max} sets the maximum shift of each peak in each iteration (in this particular case, $q$ of Equation (\ref{eq:warping}) does not vary in the whole $\mathbb{Z}$ but $q \in \{ - \textrm{\Rcode{t.max}} \cdot |U|, \ldots, + \textrm{\Rcode{t.max}}\cdot |U| \}$, with $|U|$ being the maximum width of the spline approximation of the peaks. Given a \Rclass{GRanges} \Rcode{GR} containing the metadata columns computed from the \Rfunction{smooth\_peak} method, \Rfunction{cluster\_peak} applies the k-mean alignment algorithm for all the values of $k$ between 1 and \Rcode{n.clust} (parameter of the function). The algorithm can be run in parallel, setting to \Rcode{TRUE} the \Rcode{parallel} argument of the method and providing the number of cores \Rcode{num.cores}. With these settings, the different applications of the algorithm, corresponding to different numbers of clusters, are executed in parallel. As detailed in the help, the \Rfunction{cluster\_peak} method has 2 outputs: \begin{itemize} \item The \Rclass{GRanges} object, updated with new metadata columns associated to the classification. In particular, in the general case of classification with and without alignment, columns with information on the clustering of the peaks (\Rcode{cluster\_shift} and \Rcode{cluster\_NOshift}), the corresponding shifts (\Rcode{coef\_shift}) and the distances from the template of the clusters (\Rcode{dist\_shift} and \Rcode{dist\_NOshift}) are added. \item The graph of the global distance within clusters\footnote{sum over all the peaks of the distance of each peak from the corresponding template.} as a function of the number of clusters (if \Rcode{plot.graph.k = TRUE}). This plot can be used to identify the optimal number of clusters of the partition of the data set and the effect of the alignment procedure. In particular, if \Rcode{shift = NULL}, the algorithm is run both with and without alignment and two trend lines are plotted: the black line corresponds to the global distance without the shift, and the red line corresponds to the distance obtained with alignment. If \Rcode{shift} is set to \Rcode{TRUE} or \Rcode{FALSE}, just one type of algorithm is run and the correspondent curve is plotted. For each trend line, this graph allows the identification of the optimal value of the number of clusters: for this value, the distance significantly decreases with respect to the lower values of $k$, and negligibly increases with respect to higher values of $k$ (elbow in the line). The gap between the red and the black line, instead, shows the decrease of the distance when the shift is introduced. \end{itemize} It is relevant to point out that the algorithm can be run both on the original data and on the scaled peaks, depending on the focus of the analysis. The logic paramter \Rcode{rescale} allows the user to choose. <>= # classification of the smooth peaks in different # numbers of clusters, from 1 ( no distinction, only shift ) # to 4. # here the analysis is run on the spline approximation # without scaling peaks.cluster <- cluster_peak(peaks.summit, parallel = FALSE , seeds=1:4, n.clust = 1:4, shift = NULL, weight = 1, alpha = 1, p = 2, t.max = 2, plot.graph.k = TRUE, verbose = FALSE) head(peaks.cluster) @ <>= # here the analysis is run on the spline approximation # with scaling peaks.cluster.scaled <- cluster_peak(peaks.summit.scaled, parallel = FALSE , seeds=1:4, n.clust = 1:4, shift = NULL, weight = 1, alpha = 1, p = 2, t.max = 2, plot.graph.k = TRUE, verbose = FALSE, rescale = TRUE) head(peaks.cluster.scaled) @ \begin{figure} \label{graphK} \centering \includegraphics [width = 0.4\textwidth]{FunChIP-figurek} \includegraphics [width = 0.4\textwidth]{FunChIP-figurekScaled} \caption{\textbf{Global distance within clusters.} Global distance of the peaks form the corresponding template, as a function of the number of clusters $k$. In the left panel the graph for the original spline approximation, while in the right panel restuls are relative to the scaled approximation.} \end{figure} The particular case of k-mean alignment with $k = 1$ clusters can be used to highlight the effects of the alignment of the peaks: no grouping is performed, just the shifts are computed. Therefore, the decrease of the global distance is solely due to a change of the abscissae of the functions, as Figure \ref{FunChIP-alignment} shows. Moreover, focusing for exemple on the first panel of Figure \ref{graphK}, we can deduce that, for this case \begin{itemize} \item the alignment can effectively decrease the distance, for exemple for $k = 6$, the gap between red and black line is significant; \item the alignment may change the optimal $k$: looking at the black line, one would have chosen $k=4$, while the red line suggests $k=3$ is the best choice. With the introduction of the shifts, data which are originally different becomes more similar and therefore one less cluster is needed; it has to be noted that the distance obtained with $k=3$ and alignment is very similar to the one obtained with $k=4$ and no alignment. \end{itemize} Therefore, for this case, one possible classification is the one associated to $k=3$ with shift. On the contray for the scaled peaks the value of $k$ we can identify as crucial is $k=2$ and shift is relevant since it reduces a lot the global distance. In Section \ref{sect:silhouette} we present two quantitative methods to isolate the most suitable value of $k$: the first quantifies the elbow rule we have presented here and the second that computes the Silhouette index to consider the global structure of the dataset. The results for this specific number of clusters can then be selected with the \Rfunction{choose\_k} method: <>= # select the results for k = 3 with alignment peaks.classified.short <- choose_k(peaks.cluster, k = 3, shift = TRUE, cleaning = TRUE) head(peaks.classified.short) peaks.classified.extended <- choose_k(peaks.cluster, k = 3, shift = TRUE, cleaning = FALSE) # and for the scaled version for k =2 and alignment peaks.classified.scaled.short <- choose_k(peaks.cluster.scaled, k = 2, shift = TRUE, cleaning = TRUE) head(peaks.classified.scaled.short) peaks.classified.scaled.extended <- choose_k(peaks.cluster.scaled, k = 2, shift = TRUE, cleaning = FALSE) @ The \Rfunction{choose\_k} method allows, respectively, to remove all the metadata columns computed by \Biocpkg{FunChIP} and obtain a \Rclass{GRanges} equivalent to the initial one, with an extra the metadata column \Rcode{cluster} containing the classification labels (\Rcode{cleaning = TRUE}), or a \Rclass{GRanges} retaining all the details of the prepossessing and clustering (all the previously described metadata columns), with the extra column \Rcode{cluster} (\Rcode{cleaning = FALSE}). \subsection{The choice of the final classification with the bending index and the Silhouette plot} \label{sect:silhouette} In this section we present two analyses which could be useful to identify the most suitable value of the number of clusters. The first analysis quantifies the elbow rule we proposed in the previous section. The function \Rfunction{bending\_index} computes for each value of $k$ in the range $[2 : K-1]$, with $K$ the maximum number of clusters allowed in \Rfunction{cluster\_peak}, an index that quantifies the bending of the curve. The higher this index, the more appropriate is the correspondent value of $k$. The index is computed as the distance of the point $P$ in $k$ of the global distance function, normalized to its maximum value, ( $ P = (k, \tilde{D}(k))$) from the line $r$ passing by the point in $k-1$ and in $k+1$ ($ r = r((k-1, \tilde{D}(k-1)), (k+1, \tilde{D}(k+1))$). The normalized distance function is $$ \tilde{D}(k) = \frac{D(k)}{\max_{k = 1: K}{D(k)}} $$ with $D(k)$ being the global distance of the set of classified peaks, when divided into $k$ clusters: $$ D(k) = \sum_{i = 1}^n d_{i,k}^{\star}. $$ In particular, $d_{i,k}^{\star}$ is the distance of each peak $i$ from the center of the correspondent cluster $c_{k^{\star}_i}$. Then, the index can be written as: $$ r(k) = \textrm{dist}((k, \tilde{D}(k)), r((k-1, \tilde{D}(k-1)), (k+1, \tilde{D}(k+1)))) $$ <>= # here we compute the bending index for the classification # provided for the non scaled dataset index <- bending_index(peaks.cluster, plot.graph.k = FALSE) index # and then for the scaled dataset index_scaled <- bending_index(peaks.cluster.scaled, plot.graph.k = FALSE) index_scaled @ The function returns the value of the bending index varying $k$ for both the classification with \Rcode{index\_shift} and without \Rcode{index\_NOshift} alignment, if both classifications are stored in the \Rcode{object}. It can also show the plot of the global distance curves of Figure \ref{graphK}, if \Rcode{plot.graph.k = TRUE}. Moreover, we present another method to evaluate the choice the proper number of clusters, which takes into account the global set of data. This method is based on the defintion of the Silhouette index of \cite{Rousseeuw:1987} and, for the peak $i$, the index is computed as $$ s(i) = \frac{a(i)-b(i)}{\max(a(i), b(i))}, $$ with $a(i)$ being the average dissimilarity of peak $i$ with all other data within the same cluster and $b(i)$ the lowest average dissimilarity of $i$ to any other cluster, of which $i$ is not a member. To compute the $b(i)$ value it is necessary to compare peaks belonging to different clusters. In case of classification with alignment, to define the distance of peaks of different clusters it is necessary first to align the two clusters. To perform this global alignment in an efficient way, we align the centers of the clusters and then use the estimated shift coefficient to align all the peaks of the two clusters. <>= sil <- silhouette_plot(peaks.cluster, p=2, weight = 1, alpha = 1, rescale = FALSE, t.max = 2) @ <>= sil <- silhouette_plot(peaks.cluster.scaled, p=2, weight = 1, alpha = 1, rescale = FALSE, t.max = 2) @ \begin{figure} \label{shilouette} \centering \includegraphics [width = 0.99\textwidth]{FunChIP-sil} \caption{\textbf{Silhouette graph for the classification of unscaled peaks.} In the top panel, Silhouette plots for the classification without alignment, as function of the number of clusters; in the bottom panel, Silhouette plots for the classification with alignment.} \end{figure} \begin{figure} \label{shilouette} \centering \includegraphics [width = 0.99\textwidth]{FunChIP-sil_scaled} \caption{\textbf{Silhouette graph for the classification of scaled peaks.} In the top panel, Silhouette plots for the classification without alignment, as a function of the number of clusters; in the bottom panel, Silhouette plots for the classification with alignment.} \end{figure} Th plots shown in Figure \ref{shilouette} represent the Silhouette index for all the peaks divided in the clusters, together with the global average Silhouette index. The average Silhouette index is maximum when the clustering is separating the data in the most suited manner and then this index has to be maximized to choose the best classification. In the case here presented here, the best classification according to this criteria is the classification with alignment and $k =2$ or $k=3$. Considering both Silhouette and the bending criteria, in the next section we present the results of the classification with alignement and $k=3$ clusters. \section{Visualization of the peaks} \label{sect:plot} The \Rfunction{plot\_peak} method is a very flexible function for displaying ChIP-Seq peaks. In particular, it allows to plot the raw counts obtained by the \Rfunction{pileup\_peak} method, as in Figure \ref{FunChIP-plot1}. It can also plot smoothed peaks, possibly centered around the summit, as in Figure \ref{twoplots}, or scaled as in Figure \ref{scaled} and centerd. <>= # plot of the first 10 peaks (raw data) par(mar=c(4.5,5,4,4)) plot_peak(peaks, index = 1:10, line.plot = 'count', cex.main = 2, cex.lab = 2, cex.axis =2) @ \incfig{FunChIP-plot1}{0.5\textwidth}{10 peaks: counts.} {Representation of the original peaks as raw counts (no smoothing).} <>= # plot of the smoothed approximation of the first 10 peaks par(mar=c(4.5,5,4,4)) plot_peak(peaks.smooth, index = 1:10, line.plot = 'spline', cex.main = 2,cex.lab = 2, cex.axis =2) @ % \incfig{FunChIP-plot2}{0.5\textwidth}{10 peaks: splines.} % {Representation of the original peaks (with spline smoothing) as counts % on the basis of the genome.} <>= # plot of the smoothed approximation of the first 10 peaks, # centering peaks around their summits par(mar=c(4.5,5,4,4)) plot_peak(peaks.summit, index = 1:10, line.plot = 'spline', cex.main = 2,cex.lab = 2, cex.axis =2) @ <>= # plot of the smoothed approximation of the first 10 peaks; # the scaled functions are plotted. # par(mar=c(4.5,5,4,4)) plot_peak(peaks.smooth.scaled, index = 1:10, line.plot = 'spline', rescale = TRUE, cex.main = 2,cex.lab = 2, cex.axis =2) @ <>= # plot of the scaled approximation of the first 10 peaks, # centering peaks around their summits par(mar=c(4.5,5,4,4)) plot_peak(peaks.summit.scaled, index = 1:10, line.plot = 'spline', rescale = TRUE, cex.main = 2,cex.lab = 2, lwd = 2,cex.axis =2) @ \begin{figure} \label{twoplots} \centering \includegraphics [width = 0.4\textwidth]{FunChIP-plot2} \includegraphics [width = 0.4\textwidth]{FunChIP-plot3} \caption{\textbf{10 spline-smoothed peaks.} In the left panel, smoothed peaks are shown, while in the right panel the same peaks are centered around their summits.} \end{figure} \begin{figure} \label{twoplotsbis} \centering \includegraphics [width = 0.4\textwidth]{FunChIP-plot2bis} \includegraphics [width = 0.4\textwidth]{FunChIP-plot3bis} \caption{\textbf{10 spline-smoothed and scaled peaks.} In the left panel, scaled peaks are shown, while in the right panel the same peaks are centered around their summits.} \end{figure} From the comparison of Figure \ref{twoplots} and Figure \ref{twoplotsbis} it is clear how the scaling affects the shape of splines. Now peaks are no more related to the magnitude, but just to their shapes. Moreover, plotting both raw counts and spline is also possible: Figure \ref{FunChIP-plotBOTH} shows a single peak in its raw and smoothed version. This representation is useful to check the accuracy of the smoothing and, if needed, manually set the $\lambda$ parameter of the spline approximation. <>= # plot of a peak comparing its raw structure and # its spline-smoothed version. par(mar=c(4.5,5,4,4)) plot_peak(peaks.summit, index = 3, lwd =2, line.plot = 'both', col = 'darkblue', cex.main = 2,cex.lab = 2, cex.axis =2) @ \incfig{FunChIP-plotBOTH}{0.4\textwidth}{Read coverage and spline approximation.} {Plot of the original read coverage of a peak and its smoothing (spline approximation), centered around the summit.} \incfig{FunChIP-plot4}{0.99\textwidth}{Peaks divided in the three clusters} {The same spline-smoothed peaks are plotted in grey, and for each panel the peaks in the corresponding cluster are colored to show their different shapes. Peaks are aligned with the shift coefficients obtained by the k-mean alignment algorithm.} \incfig{FunChIP-plot4bis}{0.8\textwidth}{Scaled peaks divided in the two clusters} {The same spline-smoothed scaled peaks are plotted in grey, and for each panel the peaks in the corresponding cluster are colored to show their different shapes. Peaks are aligned with the shift coefficients obtained by the k-mean alignment algorithm.} <>= # plot of the results of the kmean alignment. # Peaks are plotted in three different panels # according to the clustering results. plot_peak(peaks.cluster, index = 1:100, line.plot = 'spline', shift = TRUE, k = 3, cluster.peak = TRUE, col = topo.colors(3), cex.main = 2,cex.lab = 2, cex.axis =2) @ <>= # plot of the results of the kmean alignment. # Scaled peaks are plotted in three different panels # according to the clustering results. plot_peak(peaks.cluster.scaled, index = 1:100, line.plot = 'spline', shift = TRUE, k = 2, cluster.peak = TRUE, rescale = TRUE, col = heat.colors(2), cex.main = 2,cex.lab = 2, cex.axis =2) @ Finally, the \Rfunction{plot\_peak} method allows to plot the results of the clustering via the k-mean alignment. In Figure \ref{FunChIP-plot4} and Figure \ref{FunChIP-plot4bis}, smoothed and scaled peaks are divided into the three clusters and plotted with the optimal shift obtained with the alignment. \bibliography{mybib} \end{document}