\documentclass[a4paper,11pt]{article} \usepackage[margin=1.5in]{geometry} %%%%\usepackage[left=3cm, right=3cm, top=3cm]{geometry} %%%%%\usepackage[left= 2cm, righttextwidth=10cm]{geometry} %\VignetteIndexEntry{ChIC} \usepackage[utf8]{inputenc} %%%% international characters \usepackage[T1]{fontenc} %%%% Output font for international characters \usepackage[english]{babel} \usepackage{graphicx} %%%%%\usepackage[colorlinks=false, linktocpage=true]{hyperref} \usepackage[hidelinks]{hyperref} \usepackage[dvipsnames]{xcolor} \usepackage{sectsty} \definecolor{airforceblue}{rgb}{0.36, 0.54, 0.66} \sectionfont{\color{RoyalBlue}} \subsectionfont{\color{RoyalBlue}} \definecolor{pumpkin}{rgb}{1.0, 0.46, 0.09} \usepackage{PTSerif} %%%%% Use the Paratype Serif font \usepackage{Sweave} \DefineVerbatimEnvironment{Sinput}{Verbatim} {xleftmargin=1.1em, frame=single}%, label=Input, labelposition=topline} \DefineVerbatimEnvironment{Soutput}{Verbatim}{xleftmargin=1.1em, frame=single}%, label=Output} \usepackage[final]{pdfpages} \begin{document} \begin{titlepage} % Suppresses headers and footers on the title page %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Minimalist Book Title Page % LaTeX Template % Version 2.0 (21/7/17) % % This template was downloaded from: % http://www.LaTeXTemplates.com % % Original author: % Peter Wilson (herries.press@earthlink.net) with modifications by: % Vel (vel@latextemplates.com) % % License: % CC BY-NC-SA 3.0 (http://creativecommons.org/licenses/by-nc-sa/3.0/) \raggedleft % Right align everything \vspace*{\baselineskip} % Whitespace at the top of the page {\Large Carmen M. Livi} % Author name \vspace*{0.167\textheight} % Whitespace before the title \textbf{\LARGE ChIP-seq quality Control}\\[\baselineskip] % First line %\textbf{\LARGE }\\[\baselineskip] % First title line {\textcolor{RoyalBlue}{\Huge \textbf{ChIC} }}\\[\baselineskip] % Main title line which draws the focus of the reader {\Large \textit{A short introduction}} % Subtitle \vfill % Whitespace between the titles and the publisher %{\large The Publisher~~\plogo} % Publisher and logo \includegraphics[width = 50mm]{ChIC_Logo} \vspace*{3\baselineskip} % Whitespace at the bottom of the page \end{titlepage} %\title{ChIP-seq quality Control package \textbf{ChIC}: %A short introduction} \tableofcontents \newpage \section{Introduction} \textbf{ChI}P-seq quality \textbf{C}ontrol package (ChIC) provides functions and data structures to assess the quality of ChIP-seq data. The tool computes three different categories of quality control (QC) metrics: QC-metrics designed for narrow-peak profiles and general metrics, QC-metrics based on the global read distribution and QC-metrics from local signal enrichment around annotated genes. The user-friendly functions allow performing the analysis with a single command, whereas step by step functions are also available for more experienced users. \noindent The package comes with a large reference compendium of precomputed QC-metrics from public ChIP-seq samples. Key features are the calculation, visualisation and creation of summary plots for QC-metrics, tools for the comparison of metagene profiles against reference profiles, tools for the comparison of single QC-metrics against the compendium values and finally a random forest model to compute a single score summarising quality control. \noindent ChIC provides three wrapper functions that are used to calculate a comprehensive set of metrics: \textcolor{airforceblue}{qualityScores\_EM()}, \textcolor{airforceblue}{qualityScores\_GM()} and \textcolor{airforceblue}{qualityScores\_LM()}. \newpage \section{Input} \noindent To run ChIC the user has to provide two bam files: one for ChIP and one for the input. In this tutorial we illustrate ChIC functions using a H3K36me3 ChIP-seq dataset from ENCODE (ID: ENCFF000BFX) and its input (ID: ENCFF000BDQ). The data can can be downloaded from: \noindent \url{https://www.encodeproject.org/files/ENCFF000BFX/}\\ \url{https://www.encodeproject.org/files/ENCFF000BDQ/} <>= system("wget https://www.encodeproject.org/files/ENCFF000BFX/ @@download/ENCFF000BFX.bam") system("wget https://www.encodeproject.org/files/ENCFF000BDQ/ @@download/ENCFF000BDQ.bam") chipName <- "ENCFF000BFX" inputName <- "ENCFF000BDQ" @ \textbf{PLEASE NOTE:} For timing reasons the tutorial will use toy-bam files with a reduced number of chromosomes. Input and chip data are therefore loaded from our datapackage "ChIC.data": <>= library(ChIC) #load tag-list with reads aligned to a subset of chromosomes data("chipSubset", package = "ChIC.data", envir = environment()) chipBam <- chipSubset data("inputSubset", package = "ChIC.data", envir = environment()) inputBam <- inputSubset @ <>= options(width=60) @ \newpage \section{ENCODE Metrics (EM)} \noindent \textcolor{airforceblue}{qualityScores\_EM()} is a wrapper function that reads the provided bam files and calculates a number of QC-metrics from cross-correlation analysis and peak-calling. We will refer to the output measures as ENCODE Metrics EM, as originally proposed by ENCODE consortium \cite{Landt}. <>= ##calculating EM mc=4 ##for parallelisation CC_Result <- qualityScores_EM(chipName=chipName, inputName=inputName, annotationID="hg19", read_length=36, mc=mc) finalTagShift <- CC_Result$QCscores_ChIP$tag.shift @ \noindent The function expects two bam files: one for the immunoprecipitation (ChIP) and one for the control (Input). The read length (read\_length parameter) can be taken from the bam file itself. The 'mc' parameter is set to 1 by default, when changed it triggers the parallelisation and speeds up the calculations of a few analysis steps that allow using multiple computing cores. 'annotationID' refers to the genome annotation used. Currently hg19 and mm9 (dm3 in preparation) are supported. \textcolor{airforceblue}{qualityScores\_EM()} produces the Cross-correlation plot (see Figure \ref{fig:1}) and returns a number of QC-metrics (see: \ref{Appendix} Appendix). Amongst others the tag.shift value, which represents an input parameter for further steps (i.e. peak-calling and metagene calculation). \textcolor{airforceblue}{qualityScores\_EM()} executes the following single steps: \begin{enumerate} \item Reading BAM files (readBamFile) \item Calculatig QC-metrics from CrossCorrelation analysis (getCrossCorrelationScores) \item Removing anomalies in the read distrbution (removeLocalTagAnomalies) \item Calculating QC-metrics from peak calls (getPeakCallingScores) \item Profile smoothing for further analysis steps (tagDensity) \end{enumerate} \noindent \textbf{PLEASE NOTE:} The following code chunks are automatically executed within the \textcolor{airforceblue}{qualityScores\_EM()} wrapper function. Nevertheless each listed function is available for single use if the user starts from the bam file and wants to perform all the steps individually. \subsection{Reading BAM files} The first step in the \textcolor{airforceblue}{qualityScores\_EM()} function reads ChIP-seq data in .bam file format. The function expects the filename, that can also contain the pathname. <>= chipBam <- readBamFile(chipName) inputBam <- readBamFile(inputName) @ \subsection{Calculate QC-metrics from CrossCorrelation analysis} \textcolor{airforceblue}{getCrossCorrelationScores()} calculates QC-metrics from the cross-correlation analysis and other general metrics, e.g. the non-redundant fractions of mapped reads. An important parameter required by \textcolor{airforceblue}{getCrossCorrelationScores()} is the binding-characteristics, calculated using \textcolor{airforceblue}{spp::get.binding.characteristics()} function. The binding-characteristics structure provides information about the peak separation distance and the cross-correlation profile (for more details see \cite{Kharchenko2008}). <>= mc=2 data("crossvalues_Chip", package = "ChIC.data", envir = environment()) @ <>= cluster <- parallel::makeCluster( mc ) ## calculate binding characteristics chip_binding.characteristics <- get.binding.characteristics( chipBam, srange=c(0,500), bin = 5, accept.all.tags = TRUE, cluster = cluster) input_binding.characteristics <- get.binding.characteristics( inputBam, srange=c(0,500), bin = 5, accept.all.tags = TRUE, cluster = cluster) parallel::stopCluster( cluster ) @ <>= ## calculate cross correlation QC-metrics crossvalues_Chip <- getCrossCorrelationScores( chipBam , chip_binding.characteristics, read_length = 36, annotationID="hg19", savePlotPath = filepath, mc = mc) @ \noindent "savePlotPath" sets the path in which the Cross-Correlation plot (as pdf) should be saved. If nothing is provided the plot will be forwarded to default DISPLAY. An example of a Cross-Correlation plot is shown in Figure \ref{fig:1}. Along with the visual output, \textcolor{airforceblue}{getCrossCorrelationScores()} returns a list with EM (see \ref{Appendix} Appendix), from which we have to save tag.shift value for further steps: <>= finalTagShift <- crossvalues_Chip$tag.shift @ \begin{figure} \centering .\includegraphics[scale=0.60]{phantomCrossCorrelation.pdf} \caption{Cross-correlation plot of the ChIP.} \label{fig:1} \end{figure} \subsection{Remove anomalies in the read distrbution} The data is processed further by using \textcolor{airforceblue}{removeLocalTagAnomalies()} that removes local read anomalies like regions with extremely high read counts compared to the neighborhood (for more details see \cite{Kharchenko2008}). <>= ##get chromosome information and order chip and input by it chrl_final <- intersect(names(chipBam$tags), names(inputBam$tags)) chipBam$tags <- chipBam$tags[chrl_final] chipBam$quality <- chipBam$quality[chrl_final] inputBam$tags <- inputBam$tags[chrl_final] inputBam$quality <- inputBam$quality[chrl_final] @ <>= ##remove positions with extremely high read counts with ##respect to the neighbourhood selectedTags <- removeLocalTagAnomalies(chipBam, inputBam, chip_binding.characteristics, input_binding.characteristics) inputBamSelected <- selectedTags$input.dataSelected chipBamSelected <- selectedTags$chip.dataSelected @ \subsection{Calculate QC-metrics from peak calls} The last set of QC-metrics belonging to the category of EMs are based on the number of called peaks using \textcolor{airforceblue}{getPeakCallingScores()}. <>= bindingScores <- getPeakCallingScores(chip = chipBam, input = inputBam, chip.dataSelected = chipBamSelected, input.dataSelected = inputBamSelected, annotationID="hg19", tag.shift = finalTagShift, mc = mc) @ \subsection{Profile smoothing} The last step executed in \textcolor{airforceblue}{qualityScores\_EM()} is the smoothing (using a Gaussian kernel) of the read profile to obtain the tag density profile (for more details see \cite{Kharchenko2008}). <>= smoothedChip <- tagDensity(chipBamSelected, annotationID = "hg19", tag.shift = finalTagShift, mc = mc) smoothedInput <- tagDensity(inputBamSelected, annotationID = "hg19", tag.shift = finalTagShift, mc = mc) @ \noindent The read density profile is needed to calculate the remaining two categories of QC-metrics: the Global enrichment profile Metrics (GM) and the local enrichment profile metrics (LM) (see \ref{Appendix} Appendix). \newpage \section{Global enrichment profile Metrics (GM) and Fingerprint-plot} This category of QC-metrics is based on the global read distribution along the genome for ChIP and Input \cite{Diaz2012}. The wrapper \textcolor{airforceblue}{qualityScores\_GM()} reproduces the so-called Fingerprint plot (Figure \ref{fig:Fingerprint}), i.e. the cumulative read distribution plot, from which 9 quantitative QC-metrics are derived. Examples of these metrics are the (a) fraction of bins without reads for ChIP and input, (b) the point of maximum distance between the ChIP and input (x-coordinate, y-coordinate for ChIP and input, the distance calculated as absolute difference between the two y-coordinates, the sign of the difference), (c) the fraction of reads in the top 1 percent of bins with highest coverage for ChIP and input. <>= Ch_Results <- qualityScores_GM(densityChip = smoothedChip, densityInput = smoothedInput, savePlotPath = filepath) str(Ch_Results) @ <>= Ch_Results <- qualityScores_GM(densityChip=smoothedChip, densityInput=smoothedInput) @ \begin{figure}[htbp] \centering <>= <> @ \caption{Fingerprint plot of sample ENCFF000BFX and its input ENCFF000BDQ.} \label{fig:Fingerprint} \end{figure} <>= str(Ch_Results) @ \newpage \section{Local enrichment profile metrics (LM) and metagene profiles} \textcolor{airforceblue}{createMetageneProfile()} creates the unscaled single-point and a scaled whole gene metagene profile and returns a list with three items: "TSS", "TES" and "geneBody". Each item is again a list with the metagene profiles for ChIP and input. <>= Meta_Result <- createMetageneProfile( smoothed.densityChip = smoothedChip, smoothed.densityInput = smoothedInput, annotationID="hg19", tag.shift = finalTagShift, mc = mc) @ \noindent The objects in "Meta\_Result" are needed to create the metagene plots and to extract the LMs for the different profiles. \noindent Metagene profiles show the signal enrichment around a region of interest like the transcription start site (TSS) or over the gene body. ChIC creates two types of metagene profiles: an unscaled single-point profile for the TSS and transcription end site, and a scaled whole gene metagene profile including the gene body like in Figure \ref{fig:MetaGenePlots}. For the metagene profiles the tag density of the immunoprecipitation is taken over all RefSeq annotated human genes, averaged and log2 transformed. The same is done for the input. The normalised profile (Figure \ref{fig:MetaGenePlotNorm}) is calculated as the signal enrichment (immunoprecipitation over the input) and plotted on the y-axis, whereas the genomic coordinates of the genes like the TSS or regions up- and downstream are shown on the x-axis. \subsection{Plotting an unscaled single-point metagene profile} The "TSS" or "TES" object and the \textcolor{airforceblue}{qualityScores\_LMgenebody()} function are used to plot the unscaled profile (see Figure \ref{fig:TSS}) and return the respective LM values. <>= TES_Scores=qualityScores_LM(data=Meta_Result$TES, tag="TES") @ \begin{figure}[htbp] \centering <>= TSS_Scores=qualityScores_LM(data=Meta_Result$TSS, tag="TSS") @ \caption{Unscaled single-point metagene profile: Signal enrichment for ChIP and Input at the TSS.} \label{fig:TSS} \end{figure} \subsection{Plotting a scaled whole gene metagene profile} The "geneBody" object and the \textcolor{airforceblue}{qualityScores\_LMgenebody()} function are used to plot the scaled profile (see Figure \ref{fig:MetaGenePlots}) and return the respective LM values: \begin{figure}[htbp] \centering <>= geneBody_Scores <- qualityScores_LMgenebody(Meta_Result$geneBody) @ \caption{Scaled whole gene metagene profile: Signal enrichment for ChIP and Input along the gene body.} \label{fig:MetaGenePlots} \end{figure} \begin{figure}[htbp] \centering .\includegraphics{ScaledMetaGeneNormalized.pdf} \caption{Normalized profile: signal enrichment for ChIP over Input along the gene body for a scaled whole gene profile.} \label{fig:MetaGenePlotNorm} \end{figure} \newpage \section{Quality assessment using the compendium of QC-metrics as reference} The comprehensive set of QC-metrics, computed over a large set of ChIP-seq samples, constitutes in itself a valuable compendium that can be used as a reference for comparison with new samples. ChIC provides three functions for that: \begin{itemize} \item \textit{metagenePlotsForComparison} to compare the metagene plots with the compendium \item \textit{plotReferenceDistribution} to compare a single QC-metric with the compendium values \item \textit{predictionScore} to obtain a single quality score from the random forest models trained on the previously computed QC-metrics \end{itemize} \subsection{Available chromatin marks and factors} This useful function lists all chromatin marks and transcription factors that are available for the comparison analysis. With the keywords "mark" and "TF" the respective lists with the available elements are listed. <>= listAvailableElements(target="H3K36me3") @ \subsection{List the sample IDs of the compendium} Shows the IDs of all analysed ChIP-seq samples from ENCODE and Roadmap that have been included in the compendium by providing the keyword "ENCODE" or "Roadmap". <>= head(listDatasets(dataset="ENCODE")) @ \subsection{Comparing local enrichment profiles} The \textcolor{airforceblue}{metagenePlotsForComparison()} function is used to compare the local enrichment profile to the reference compendium by plotting the metagene profile against the expected metagene for the same type of chromatin mark. The expected metagene profile is provided by the compendium mean (black line) and standard error (blue shadow). Examples are shown in Figures \ref{fig:MetaProfileComparisonGeneBody} and \ref{fig:MetaProfileComparisonTSS}. \begin{figure}[htbp] \centering <>= metagenePlotsForComparison(data = Meta_Result$geneBody, target = "H3K4me3", tag = "geneBody") @ \caption{Enrichment profile plotted against the pre-computed profiles of the compendium. The metagene profile shows the sample signal (red line) for the ChIP compared to the compendium mean signal (black line) and the 2x standard error (blue shadow).} \label{fig:MetaProfileComparisonGeneBody} \end{figure} \begin{figure}[htbp] \centering <>= metagenePlotsForComparison(data = Meta_Result$TSS, target = "H3K4me3", tag = "TSS") @ \caption{The metagene profile shows the sample signal for the ChIP (red line) compared to the compendium mean signal (black line) and the 2x standard error (blue shadow).} \label{fig:MetaProfileComparisonTSS} \end{figure} %\begin{figure}[htbp] %\centering %.\includegraphics{Comparison.pdf} %\caption{Enrichment profile plotted against the pre-computed %profiles of the compendium. The metagene profiles show the sample signal %(red line) for the ChIP (A) and the enrichment (chip over input) (B) %when compared to the compendium mean signal (black line) and its 2x standard %error (blue shadow).} %\label{fig:MetaProfileComparison} %\end{figure} %<>= %metagenePlotsForComparison(data = Meta_Result$geneBody, % target = "H3K4me3", tag = "geneBody") %@ %\begin{figure}[htbp] %\centering %<>= %<> %@ %\caption{The metagene profile shows the signal (red line) %of the input compared to the compendium.} %\label{fig:MetaProfileComparisonInput} %\end{figure} \subsection{Comparing QC-metrics to the reference values of the compendium} Plotting a single QC-metric against the reference values from a large number of already published data, adds an extra level of information that can be easily used by less experienced users. An example is shown in Figure \ref{fig:ValueComparison}. \begin{figure}[htbp] \centering <>= plotReferenceDistribution(target = "H3K4me3", metricToBePlotted = "RSC", currentValue = crossvalues_Chip$CC_RSC ) @ \caption{The QC-metric of a newly analysed ChIP-seq sample can be compared to the reference values of the compendium. The density plot shows the QC-metric RSC (red dashed line) of the sample versus the distribution of the same metric in the the compendium for the respective binding profile or TF.} \label{fig:ValueComparison} \end{figure} \subsection{Assessing data quality with machine learning} The compendium of metrics has been used to train a random forest model that summarizes the sample quality in one single QC-score. <>= EM_scoresNew=NULL EM_scoresNew$QCscores_ChIP <- crossvalues_Chip EM_scoresNew$QCscores_binding <- bindingScores EM_scoresNew$TagDensityInput <- list() EM_scoresNew$TagDensityChip <-list() CC_Result <- EM_scoresNew @ <>= te <- predictionScore(target = "H3K4me3", features_cc = CC_Result, features_global = Ch_Results, features_TSS = TSS_Scores, features_TES = TES_Scores, features_scaled = geneBody_Scores) print(te) @ \subsection{Summary report} When analysing many differet ChIP-seq samples the user might be interested only in the visualization of the quality assessment. ChIC contains also a function that produces a summary report. The summary report is a single document (pdf format) that contains all the produced analysis plots. The function returns also the predicted sample QC-score. <>= prediction=chicWrapper(chipName=chipName, inputName=inputName, chromMarkToUse= "H3K4me3", read_length=36, mc=mc , savePlotPath=filepath) @ \newpage \addcontentsline{toc}{section}{References} \bibliographystyle{unsrt} \begin{thebibliography}{10} \bibitem{Landt} Landt, Stephen G. \newblock{ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia.} \newblock {Genome Research 2012} \bibitem{Kharchenko2008} Kharchenko, Peter V and Tolstorukov, Michael Y and Park, Peter J. \newblock{Design and analysis of ChIP-seq experiments for DNA-binding proteins.} \newblock {Nat. Biotechnol. 2008}, \bibitem{Diaz2012} Diaz, Aaron and Nellore, Abhinav and Song, Jun. \newblock {CHANCE: comprehensive software for quality control and validation of ChIP-seq data.} \newblock {Genome Biol. 2012} \end{thebibliography} \newpage \section{Appendix} \label{Appendix} %%\includepdf[pages=-,pagecommand={},width=\textwidth]{ListOfFeatures.pdf} %%\includepdf[pages=-,picturecommand*={% %% \put (133,660) {% %% \parbox{\textwidth}]{QC-metrics.pdf} %\includepdf[pages=-, noautoscale=true, pagecommand={}]{QC-metrics.pdf} \begin{figure}[htbp] \centering .\includegraphics[scale=1.5]{QC-metricsLM.pdf} \end{figure} \begin{figure}[htbp] \centering .\includegraphics[scale=2]{QC-metricsLM1.pdf} \end{figure} \begin{figure}[htbp] \centering .\includegraphics[scale=1.2]{QC-metricsGM_EM.pdf} \end{figure} \end{document}