%\VignetteIndexEntry{SimBindProfiles: Similar Binding Profiles, identifies common and unique regions in array genome tiling array data} %\VignetteDepends{Ringo, mclust, limma} %VignetteKeywords{microarray ChIP-chip DamID-chip, Nimblegen, Affymetrix} %VignettePackage{SimBindProfiles} \documentclass[11pt, a4paper]{article} \usepackage[nogin]{Sweave} \usepackage{hyperref} %%\usepackage[authoryear, round]{natbib} %%\SweaveOpts{echo=T,eval=T,cache=F} \newcommand{\R}{\texttt{R} } \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\email}[1]{\href{mailto:#1}{\normalfont\texttt{#1}}} %% colors \usepackage{color} \definecolor{Red}{rgb}{0.7,0,0} \definecolor{Blue}{rgb}{0,0,0.8} \parindent0mm \parskip2ex plus0.5ex minus0.3ex \hypersetup{% hyperindex = {true}, colorlinks = {true}, linktocpage = {true}, plainpages = {false}, linkcolor = {Blue}, citecolor = {Blue}, urlcolor = {Red}, pdfstartview = {Fit}, pdfpagemode = {UseOutlines}, pdfview = {XYZ null null null} } \addtolength{\textwidth}{2cm} \addtolength{\oddsidemargin}{-1cm} \addtolength{\evensidemargin}{-1cm} \addtolength{\textheight}{2cm} \addtolength{\topmargin}{-1cm} \addtolength{\skip\footins}{1cm} \title{SimBindProfiles: Similar Binding Profiles, identifies common and unique regions in array genome tiling array data} \author{Bettina Fischer} \begin{document} \maketitle \tableofcontents \section{Introduction}\label{sec:into} SimBindProfiles identifies common and unique binding regions in genome tiling array data. This package does not rely on peak calling, but directly compares binding profiles processed on the same array platform. It implements a simple threshold approach, thus allowing retrieval of commonly and differentially bound regions between datasets as well as events of compensation and increased binding. The tool requires each data set in SGR file format, which are tab-delimited files with chromosome, position and signal value columns. We suggest preprocessing two-colour microarray data with \Rpackage{Ringo} \cite{Ringo2007} and Affymetrix cel files with \Rpackage{Starr} \cite{Starr2009}, which perform smoothing of probe intensities across replicate arrays for each data set. It is important that data are sorted by chromosomes and ascending positions along each chromosome. <>= library("SimBindProfiles") @ \section{Reading data and normalisation}\label{sec:read} We provide three working example data sets in this vignette. These data have been processed on NimbleGen Drosophila melanogaster DM5 Tiling Set HX1 in triplicates and smoothed into window scores using \Rpackage{Ringo}. In \textit{Drosophila melanogaster}, SoxNeuro (SoxN) and Dichaete (D) belong to the SoxB family of transcription factors. SoxN and Dichaete play essential roles in many aspects of neurogenesis and exhibit some degree of functional redundancy in cells where they are coexpressed. Here, we study the binding patterns of SoxN and Dichaete in wildtype (wt) and SoxN mutant embryos via DamID. In this vignette, we only use the probes located on chromosome X between 1-6000000 bp:\newline SoxNDam = SoxN binding in wt embryos\newline SoxN-DDam = Dichaete binding in SoxN mutants\newline DDam = Dichaete binding in wt embryos\newline <>= dataPath <- system.file("extdata",package="SimBindProfiles") list.files(dataPath, pattern=".txt") head(read.delim(paste(dataPath, "SoxNDam_trunc.txt", sep="/"), header=FALSE, nrows=5)) @ To read data into an \Rclass{ExpressionSet} object each file name is specified omitting the .txt extension. While reading the files, data is quantile normalised as per \Rpackage{limma} \cite{limma05}. For more details on \Rclass{ExpressionSet} please refer to "An Introduction to Bioconductor's ExpressionSet Class" \cite{expressionSet}. <>= readTestSGR <- readSgrFiles(X=c("SoxNDam_trunc", "SoxN-DDam_trunc", "DDam_trunc"), dataPath) @ We continue in this vignette using the larger chromosome X probe set data which was previously saved as SGR.Rdata object and can be loaded and viewed. <>= dataPath <- system.file("data",package="SimBindProfiles") load(paste(dataPath, "SGR.RData", sep="/")) print(SGR) @ A useful plot comparing all the data sets and their correlation after normalisation can be created with the \Rfunction{eSetScatterPlot} function (Figure~\ref{fig:SimBindProfiles-scatter}). <>= eSetScatterPlot(SGR) @ <>= png("SimBindProfiles-scatter.png") eSetScatterPlot(SGR) dev.off() @ \begin{figure}[h!tb] \centering \includegraphics[width=0.5\textwidth]{SimBindProfiles-scatter.png} \caption{\textit{Smoothed scatterplots of normalised signals and the corresponding correlations.}} \label{fig:SimBindProfiles-scatter} \end{figure} \section{Determining a bound cut-off and a difference cut-off}\label{sec:cutoff} In order to identify probes or regions, which are bound similarly or differentially bound between the data sets, \Robject{bound.cutoff} and \Robject{diff.cutoff} (difference cut-off) thresholds have to be chosen. We implemented two methods to set the bound cut-off, probes above this threshold are considered "bound". The \Rfunction{twoGaussiansNull} method established in the \Rpackage{Ringo} package \cite{Ringo2007}, in which data is assumed to follow a mixture of two Gaussian distributions. The Gaussian with the lower mean value is assumed to be the null distribution and probe levels are assigned p-values based on this null distribution. Alternatively the user can select the \Rfunction{normalNull} method which assumes the null distribution is normal and symmetrical around the mode or zero. For both methods the user can decide if the resulting p-values are to be adjusted for multiple testing (fdr) or select a p-value threshold. In our example we use the twoGaussiansNull at 25\% FDR, the function also provides a QC plot of the two Gaussians curves and an optional p-value histogram (not shown). <>= bound.cutoff <- findBoundCutoff(SGR, method="twoGaussiansNull", fdr=0.25) @ <>= dataPath <- system.file("data",package="SimBindProfiles") load(paste(dataPath, "precomputed.RData", sep="/")) cat("Using bound.cutoff =", bound.cutoff, "\n") @ To show the \Robject{bound.cutoff} in relation to the data one can plot a histogram of the data with the bound cutoff (Figure~\ref{fig:SimBindProfiles-boundHist}). <>= hist(exprs(SGR)[,1], breaks=1000, freq=FALSE, border="grey", main=sampleNames(SGR)[1], xlab="signal", sub=paste("bound.cutoff =", bound.cutoff, sep=" ")) abline(v=bound.cutoff, col="red", lty=3, lwd=2) @ <>= png("SimBindProfiles-boundHist.png") hist(exprs(SGR)[,1], breaks=1000, freq=FALSE, border="grey", main=sampleNames(SGR)[1], xlab="signal", sub=paste("bound.cutoff =", bound.cutoff, sep=" ")) abline(v=bound.cutoff, col="red", lty=3, lwd=2) dev.off() @ \begin{figure}[!hbtp] \centering \includegraphics[width=0.5\textwidth]{SimBindProfiles-boundHist.png} \caption{\textit{Histogram of the signal of SoxNDam. Probes with a signal above the bound.cutoff threshold are considered "bound".}} \label{fig:SimBindProfiles-boundHist} \end{figure} A probe is considered uniquely bound in one data set if it is bound above the \Robject{diff.cutoff} threshold to the other set. We propose that the difference cut-off should be smaller than the bound cut-off, but for more stringent analysis criteria it can also be set to the same value as the \Robject{bound.cutoff}. In our example we used the diff.cutoff as 75\% of the \Robject{bound.cutoff} <>= diff.cutoff <- round(bound.cutoff * 0.75,2) @ We can plot the probe intensities along parts of the chromosome using the \Rfunction{chipAlongChrom} function from \Rpackage{Ringo}, which can be called via the plot command. First we create a \Rclass{probeAnno} class object which contains the mapping between the probes and their genomic positions and uses the information stored in the \Rclass{ExpressionSet} object and requires the probe length of the oligo on the array. In Figure~\ref{fig:visualiseBound} SoxNDam (green curve) is uniquely bound at region 2324000 - 2326000 as the intensity is above the bound.cutoff. Whereas SoxN-DDam (orange) is also above the bound.cutoff at region 2334000 - 2336000 but the difference in the intensity of SoxNDam is too small (below diff.cutoff) to call this region uniquely bound. <>= probeAnno <- probeAnnoFromESet(SGR, probeLength=50) @ <>= plot(SGR, probeAnno, chrom="X", xlim=c(2323000,2337000), ylim=c(-3,4), samples=c(1,2)) @ \begin{figure}[!hbtp] \centering <>= plot(SGR, probeAnno, chrom="X", xlim=c(2323000,2337000), ylim=c(-3,4), samples=c(1,2), icex=1, sampleLegend=FALSE) @ \caption{\textit{Probe intensities along chromosome, green = SoxNDam, orange = SoxN-DDam.}} \label{fig:visualiseBound} \end{figure} \section{Setting array specific parameters}\label{sec:para} The user has to specify the \Robject{probes} and \Robject{probe.max.spacing} parameters. Both depend on the array platform and how densely the probes are spaced across the genome on the tiling array. The minimum number of probes specifies how many probes have to be in a valid region. The probe.max.spacing is the maximum distance in base pairs allowed between probes before a region is split into separate regions. <>= probes <- 10 probe.max.spacing <- 200 @ A frequency plot of number of probes per bound regions can be used to help determine the probes parameter selection <>= probeLengthPlot(SGR, sgrset=1, chr=NULL, bound.cutoff, probe.max.spacing=200, xlim.max=25) @ \begin{figure}[!hbtp] \centering <>= probeLengthPlot(SGR, sgrset=1, chr=NULL, bound.cutoff, probe.max.spacing=200, xlim.max=25) @ \caption{\textit{Frequency plot of number of probes per bound region.}} \label{fig:probeLength} \end{figure} \section{Performing pairwise classification}\label{sec:pair} In this section we identify regions that are uniquely or commonly bound in two data sets. First the probes for each data set are flagged as bound or not bound depending on whether the signal is above the \Robject{bound.cutoff}. Then the probes are split into three classes:\newline Class 1: uniquely bound in set 1 (bound in set 1, not bound in set 2, signal 1 minus signal 2 above diff.cutoff).\newline Class 2: uniquely bound in set 2 (bound in set 2, not bound in set 1, signal 2 minus signal 1 above diff.cutoff)\newline Class 3: bound in both data sets\newline Then the classified probes are filtered into regions using the \Robject{probes} and \Robject{probe.max.spacing} parameters. The regions for each class are exported to bed files, which provide the chromosome, start and end positions, the name and a score for the region. For example the score for class 1 is calculated as follows. First we subtract for each probe within a region signal set 1 minus signal set 2, and then we calculate the mean over the region. The tool uses the names of each data set and the selected parameters as the resulting file names (b = bound.cutoff, d = diff.cutoff, v = probes, g = probe.max.spacing). In our example we query SoxNDam vs. DDam, which correspond to data set 1 and 3 in the ExpessionSet object. <>= pairwiseR <- pairwiseRegions(SGR, sgrset=c(1,3), bound.cutoff, diff.cutoff, probes, probe.max.spacing) head(pairwiseR) @ We also provide a ploting method to show the bound probes (before filtering into regions) in colour. <>= plotBoundProbes(SGR, sgrset=c(1,2), method="pairwise", bound.cutoff, diff.cutoff) @ <>= png("SimBindProfiles-pairwiseBound.png") plotBoundProbes(SGR, sgrset=c(1,2), method="pairwise", bound.cutoff, diff.cutoff) dev.off() @ \begin{figure}[!hbtp] \centering \includegraphics[width=0.5\textwidth]{SimBindProfiles-pairwiseBound.png} \caption{\textit{Scatterplot of pairwise classification of SoxNDam vs. DDam. Red highlights the unique SoxNDam probes, green unique in DDam and grey are probes common to both.}} \label{fig:pairwiseBoundProbesPlot} \end{figure} \section{Performing three-way classification}\label{sec:three} This tool allows identification of regions that are unique or common in three data sets. The approach is the same as for the pairwise classification. The probes are segregated into seven classes:\newline Class = 1: unique probes in set 1\newline Class = 2: unique probes in set 2\newline Class = 3: unique probes in set 3\newline Class = 4: common probes in set 1+2\newline Class = 5: common probes in set 2+3\newline Class = 6: common probes in set 1+3\newline Class = 7: common probes in set 1+2+3\newline Then the probes are again filtered into regions using the \Robject{probes} and \Robject{probe.max.spacing} parameters. The regions for each class are exported to bed files. The score is calculated similar to the pair-wise classification. In our example we query SoxNDam vs. SoxN-DDam vs. DDam (results not shown). <>= threewayR <- threewayRegions(SGR, sgrset=c(1,2,3), bound.cutoff, diff.cutoff, probes, probe.max.spacing) @ \section{Performing increased binding classification}\label{sec:incr} It might be of interest to identify regions showing increased binding, which are more bound in one dataset compared to the other. In our example, Dichaete can bind at a higher level in the SoxN mutant embryos (SoxN-DDam) compared to the wt embryos (DDam) (Figure~\ref{fig:visualiseIncreasedBinding}). If the signal of a bound probe in set 1 is higher than the diff.cutoff, then these probes are filtered into regions using the \Robject{probes} and \Robject{probe.max.spacing} parameters and reported as bed file, which reports the chromosome, start, end, name and score. The score is calculated similarly to the pairwise classification. Compare set SoxN-DDam vs. DDam <>= increasedR <- increasedBindingRegions(SGR, sgrset=c(2,3), bound.cutoff, diff.cutoff, probes, probe.max.spacing) head(increasedR) @ <>= head(increasedR) @ Visualise the second increased binding region in genomic context. <>= plot(SGR, probeAnno, chrom="X", xlim=c(4589000,4593200), ylim=c(-0.5,5), samples=c(2,3)) @ \begin{figure}[!hbtp] \centering <>= plot(SGR, probeAnno, chrom="X", xlim=c(4589000,4593200), ylim=c(-0.5,5), samples=c(2,3), icex=1, sampleLegend=FALSE) @ \caption{\textit{Probe intensities along chromosome at increased binding region, green = SoxN-DDam, orange = DDam.}} \label{fig:visualiseIncreasedBinding} \end{figure} To plot the probes showing increased binding (before filtering into regions) in colour (Figure~\ref{fig:increasedBindingBoundProbesPlot}). <>= plotBoundProbes(SGR, sgrset=c(2,3), method="increasedBinding", bound.cutoff, diff.cutoff, pcex=4) @ <>= png("SimBindProfiles-increasedBindingBound.png") plotBoundProbes(SGR, sgrset=c(2,3), method="increasedBinding", bound.cutoff, diff.cutoff, pcex=4) dev.off() @ \begin{figure}[!hbtp] \centering \includegraphics[width=0.5\textwidth]{SimBindProfiles-increasedBindingBound.png} \caption{\textit{Scatterplot of increased binding classification of SoxNDam vs. DDam, increased binding probes are highlighted in blue}} \label{fig:increasedBindingBoundProbesPlot} \end{figure} \section{Performing compensation classification}\label{sec:comp} This is another special case in which we want to identify regions which are bound in two sets but not in the third. In our example, in wt embryos Dichaete is not bound (DDam) and SoxN is bound (SoxNDam). However, in SoxN mutants (SoxN-DDam) Dichaete binds at this location to compensate for the loss of SoxN (Figure~\ref{fig:visualiseCompensation}). Probes above the bound.cutoff for which the average bound signal of set 1 and set 2 are larger than the diff.cutoff to the non-bound set 3 are identified. The probes are again filtered into regions using the \Robject{probes} and \Robject{probe.max.spacing} parameters and reported in a bed file, which gives the chromosome, start, end, name and score. Compare set SoxNDam + SoxN-DDam vs. DDam <>= compR <- compensationRegions(SGR, sgrset=c(1,2,3), bound.cutoff, diff.cutoff, probes, probe.max.spacing) head(compR) @ <>= head(compR) @ Visualise a compensation region in genomic context (Figure~\ref{fig:visualiseCompensation}). <>= plot(SGR, probeAnno, chrom="X", xlim=c(2943000,2947000), ylim=c(-1,4)) @ \begin{figure}[!hbtp] \centering <>= plot(SGR, probeAnno, chrom="X", xlim=c(2943000,2947000), ylim=c(-1,4), icex=1) @ \caption{\textit{Probe intensities along chromosome, green = SoxN-DDam, orange = SoxN-DDam, blue = DDam.}} \label{fig:visualiseCompensation} \end{figure} To plot the probes showing compensation (before filtering into regions) in colour (Figure~\ref{fig:compensationBoundProbesPlot}). <>= plotBoundProbes(SGR, sgrset=c(1,2,3), method="compensation", bound.cutoff, diff.cutoff, pcex=4) @ <>= png("SimBindProfiles-compensationBound.png") plotBoundProbes(SGR, sgrset=c(1,2,3), method="compensation", bound.cutoff, diff.cutoff, pcex=4) dev.off() @ \begin{figure}[!hbtp] \centering \includegraphics[width=0.5\textwidth]{SimBindProfiles-compensationBound.png} \caption{\textit{Scatterplot of compensation classification of SoxNDam + SoxN-DDam vs. DDam. Probes which are bound in SoxNDam + SoxN-DDam but not in DDAm are highlighted in orange.}} \label{fig:compensationBoundProbesPlot} \end{figure} \section{Concluding Remarks}\label{sec:remarks} The package \Rpackage{SimBindProfiles} facilitates the comparison of ChIP-chip or DamID profiles generated on the same microarray platform. It provides functions for data import, normalization and analysis. High-level plots for quality assessment are available. While this analysis approach worked well with our data, we do not claim it is the definite algorithm for this task.\medskip This vignette was generated using the following package versions: <>= toLatex(sessionInfo()) @ \section*{Acknowledgments} Many thanks to Enrico Ferrero who generated the data sets and our conductive discussions about the analysis approach, Steven Russell for helpful suggestions and Robert Stojnic for source code contributions to SimBindProfiles. \bibliographystyle{abbrv} \bibliography{SimBindProfiles-Bibliography} \end{document}