% \VignetteIndexEntry{Simulating ChIP-seq experiments} % \VignetteDepends{ChIPsim} \documentclass[a4paper, 11pt]{article} \usepackage[OT1]{fontenc} \usepackage{fullpage} \usepackage{Sweave} \begin{document} \title{Simulating ChIP-seq experiments} \author{Peter Humburg} \maketitle \section{Introduction} This document provides a brief introduction to the use of \texttt{ChIPsim} in form of a worked example. The main purpose of this package is to provide a framework for the simulation of ChIP-seq experiments. The simulation of nucleosome positioning experiments is implemented as part of the package, see `Extending ChIPsim' for a discussion of how other types of ChIP-seq experiments can be implemented with the help of this package. The simulation of ChIP-seq experiments can be a powerful tool to get a better understanding of ChIP-seq data and the challenges of analysing it. A first step into this direction was taken by Zhang~\emph{et al.}~\cite{Zhang08}. They developed a simulation of a transcription factor binding ChIP-seq experiment and used this do derive an appropriate background model. The simulation framework provided by \texttt{ChIPsim} goes one step further. It simulates the location of various genomic features, such as nucleosome positions or transcription factor binding sites, and the resulting sequence reads. This can then be used to test entire analysis pipelines, from read mapping to feature identification. We may be interested in testing a specific part of the pipeline, e.g., investigate the performance of different read mapping tools or different peak calling algorithms. While this can be very useful we may gain further insights into the performance of our pipeline by testing different combinations of tools to capture interactions between them. Once an analysis pipeline is established we can use the simulation to help us plan new experiments by providing an estimate of how much sequencing will be required to obtain the desired outcomes. In the next section we will discuss the general structure of the simulation to get a better idea of how it works and where we might intervene to change its behaviour. This is followed by an example to demonstrate the simulation of a nucleosome positioning experiment on a small genome. This includes examples to show how the parameters of the simulation can be adjusted but does not cover extensions to different experiment types. Such extensions are discussed at the end of this document where a more complex example demonstrates how we could implement the simulation of a transcription factor binding simulation. \section{Overview} The simulation is divided into six stages that are illustrated below for the example of a nucleosome positioning experiment. The general structure is the same, regardless of experiment but the details are adjusted to suit the problem under consideration. \begin{description} \item[Generate feature sequence] A Markov chain is used to generate a sequence of features, e.g. nucleosomes of different stability, to cover the supplied genome. Each state of the Markov chain represents a different feature type and generates a set of associated parameters. \item[Compute nucleosome density] Each feature type is associated with a feature generating function that translates the feature sequence into a feature density, i.e., for each position in the genome covered by the feature it computes the probability that a nucleosome is centred there. \item[Compute read density] Using information about the length distribution of DNA fragments and the distribution of binding sites within them, the feature density is translated into a read density for each strand. The read density at a given position on a given DNA strand is proportional to the probability that a sequence read starting at that position is produced during the sequencing process. \item[Sample read start sites] Based on the read densities computed in the previous step the location for the desired number of reads is generated. \item[Create read names] A name is assigned to each read. \item[Obtain read sequence and quality] Read positions are translated into the corresponding DNA sequence, qualities are assigned to each read and used to introduce errors into the read sequence. \end{description} All six stages of the simulation are carried out by the function \texttt{simChIP}. It is possible to provide intermediate results to skip some of these steps. For example, if we already have a feature density we could use that instead of generating a new one. The read sequences and qualities generated by the simulation can be exported in FASTQ format. In the default configuration of \texttt{simChIP} this is done automatically if an output file name is provided. Each stage of the simulation is carried out by a different function that is called by \texttt{simChIP} with the appropriate arguments. Individual functions can be replaced to make substantial changes to the behaviour of the simulation in cases where simply adjusting some of the function's arguments is not enough to achieve the desired result. \section{Using the nucleosome positioning simulation} With that general structure in mind we will now take a look at how we might use the ChIPsim package for a simple nucleosome positioning simulation. This section focuses on the build in simulation of a nucleosome positioning experiment. While we will see how this can be modified to adapt the simulation to a specific experiment this example mostly relies on the default settings to keep things simple. \subsection{Setup} We start by loading the ChIPsim package. To ensure that the results of the simulation are reproducible the random number generator is initialised with a fixed seed. <>= library(ChIPsim) set.seed(1) @ If we just want to run the nucleosome positioning simulation with default settings no further preparations are necessary in practice. The remainder of this section is only required because we want to run the example without relying on additional files. It also serves as an example of how the defaults can be adjusted to fit the needs of a particular experiment. The later stages of the simulation require a reference genome to generate read sequences. For the purpose of these examples we will use a very simple (and relatively short) simulated genome sequence. The variable \texttt{chrLen} gives the length (and number) of chromosomes, we could easily generate a different genome size by changing these numbers. <>= chrLen <- c(2e5, 1e5) chromosomes <- sapply(chrLen, function(n) paste(sample(DNA_BASES, n, replace = TRUE), collapse = "")) names(chromosomes) <- paste("CHR", seq_along(chromosomes), sep="") genome <- DNAStringSet(chromosomes) @ To read the genome sequence from a (FASTA formatted) file we would simply use the name of the fasta file instead of the generated sequence above. <>= rm(chromosomes) @ Since the output of the simulation is a FASTQ formatted file we will also need read qualities. By default the simulation uses read quality scores from a real sequencing experiment for this purpose. Since that is not very practical for this example we will provide a function to generate read qualities instead. This will also help to demonstrate how aspects of the simulation can be adjusted for specific needs. <>= randomQuality <- function(read, ...){ paste(sample(unlist(strsplit(rawToChar(as.raw(64:104)),"")), nchar(read), replace = TRUE), collapse="") } @ The \texttt{randomQuality} function will produce read quality scores in Illumina 1.3 format. Apart from the encoding the quality scores will not be very similar to real read qualities but this is sufficient for the purpose of this example. Another aspect of the simulation we may want to change for this example is how output is produced. By default the simulation writes the read sequences and qualities to a FASTQ file. Instead, we would like to produce a data frame that holds this information. To achieve this we have to provide a function that accepts read positions together with some other parameters and returns the \texttt{data.frame} we want. <>= dfReads <- function(readPos, readNames, sequence, readLen, ...){ ## create vector to hold read sequences and qualities readSeq <- character(sum(sapply(readPos, sapply, length))) readQual <- character(sum(sapply(readPos, sapply, length))) idx <- 1 ## process read positions for each chromosome and strand for(k in length(readPos)){ ## chromosome for(i in 1:2){ ## strand for(j in 1:length(readPos[[k]][[i]])){ ## get (true) sequence readSeq[idx] <- as.character(readSequence(readPos[[k]][[i]][j], sequence[[k]], strand=ifelse(i==1, 1, -1), readLen=readLen)) ## get quality readQual[idx] <- randomQuality(readSeq[idx]) ## introduce sequencing errors readSeq[idx] <- readError(readSeq[idx], decodeQuality(readQual[idx])) idx <- idx + 1 } } } data.frame(name=unlist(readNames), sequence=readSeq, quality=readQual, stringsAsFactors = FALSE) } @ \subsection{Simulation} Now that we have a reference genome and a way to generate read qualities we can simply call \texttt{simChIP} to simulate sequencing data. The first argument of \texttt{simChIP} is the number of reads we would like to generate, the second is the reference genome. In this case we will use the genome sequence we generated above, we could simply pass the name of a FASTA formatted file that contains the reference sequence instead. The third argument is the prefix we would like to use for the output files. For this example we will prevent file creation by setting \texttt{file = ""}. Usually it is a good idea to write results to a file, which will produce a FASTQ formatted file with the sequences of all simulated reads. Using file output will also generate additional files containing intermediate results like the underlying feature sequence or resulting read densities for later reuse. The \texttt{control} argument allows us to change the parameters of the simulation. The entire simulation process is divided into six stages and parameters for each can be set separately by providing named entries in the list passed as \texttt{control} argument. As explained above, the six stages of the simulation are: \begin{description} \item[features]{Generate feature sequence (for each chromosome);} \item[bindDensity]{Compute binding site density;} \item[readDensity]{Compute read density for both strands;} \item[sampleReads]{sample from read density to determine read start sites;} \item[readNames]{Create read names;} \item[readSequence]{Obtain read sequence and quality.} \end{description} These parameters are then passed to a function that carries out that step of the simulation. Which function is used can be controlled through the \texttt{functions} argument. Here we will mostly use the defaults but we need to change the way read sequences and qualities are generated. We can use \texttt{defaultFunctions} to obtain the default list of functions and simply replace the one for the final step of the simulation. Instead of completely replacing a part of the simulation we sometimes just want to adjust the default behaviour slightly. In many cases that is possible by changing a parameter value. We will illustrate the process by changing the mean fragment length to 150bp (the default is 160bp). <>= myFunctions <- defaultFunctions() myFunctions$readSequence <- dfReads nReads <- 1000 simulated <- simChIP(nReads, genome, file = "", functions = myFunctions, control = defaultControl(readDensity=list(meanLength = 150))) @ Note the use of \texttt{defaultControl}. This allows us to maintain the default values for all parameters except the ones we list explicitly. The \texttt{defaultControl} function uses the same argument names as \texttt{functions}. Each argument is expected to be a named list with the names corresponding to the names of arguments of the target function. We now have a data frame with \Sexpr{nReads} simulated read sequences together with the location in the genome they originated from, the read densities that were used determine read positions as well as the underlying nucleosome densities and feature sequence. The list returned by \texttt{simChIP} contains all of this information in different components. Their names should give some idea of what they are: <>= names(simulated) @ Of course we can also look at individual components in more detail. Here are the first few reads: <<>>= head(simulated$readSequence) @ We can also examine some of the nucleosome and read densities. Figure~1 shows the nucleosome and read densities surrounding the first stable nucleosome in the feature sequence. The stable nucleosome in the centre is flanked by two phased regions. <>= feat <- simulated$features[[1]] stableIdx <- which(sapply(feat, inherits, "StableFeature")) start <- feat[[stableIdx[1]]]$start plot((start-2000):(start+500), simulated$bindDensity[[1]][(start-2000):(start+500)], xlab="Chromosome 1", ylab="Density", type='l') lines((start-2000):(start+500), simulated$readDensity[[1]][(start-2000):(start+500),1], col=4) lines((start-2000):(start+500), simulated$readDensity[[1]][(start-2000):(start+500),2], col=2) legend("topright", legend=c("Nucleosome density", "Read density (+)", "Read density (-)"), col=c(1,4,2), lwd=1) @ \begin{figure} \centering <>= feat <- simulated$features[[1]] stableIdx <- which(sapply(feat, inherits, "StableFeature")) start <- feat[[stableIdx[1]]]$start plot((start-2000):(start+500), simulated$bindDensity[[1]][(start-2000):(start+500)], xlab="Chromosome 1", ylab="Density", type='l') lines((start-2000):(start+500), simulated$readDensity[[1]][(start-2000):(start+500),1], col=4) lines((start-2000):(start+500), simulated$readDensity[[1]][(start-2000):(start+500),2], col=2) legend("topright", legend=c("Nucleosome density", "Read density (+)", "Read density (-)"), col=c(1,4,2), lwd=1) @ \caption{Nucleosome and read densities for a stable feature and flanking regions.} \end{figure} Once the simulation is completed we can use the read sequences and qualities as input for an analysis pipeline. Typically the first step is to map the reads back to the genome (preferably using the read qualities in the process), followed by the application of a peak-finding algorithm to identify peaks in the read counts. The performance of both parts of the analysis can be assessed with the help of data generated by this simulation. This may prove useful to compare and improve analysis pipelines for a variety of ChIP-seq experiments. Once a pipeline is established the simulation can also be used to estimate the amount of sequencing, i.e. the coverage, necessary for a given experiment. \subsection{Using output files} If we had instructed \texttt{simChIP} to generate output files by providing a base file name to argument \texttt{file} the list would contain the name of the output file for each stage rather than the data. These file names are automatically generate by appending an appropriate suffix to the base name at each stage. This name is also used to determine whether output from a previous run is available for re-use. For example, if we set \texttt{file = "test"} in the above call to \texttt{simChIP} the simulation will generate several files including `test\_features.rdata', `test\_bindDensity.rdata' and `test\_readDensity.rdata' for the first three stages of the simulation. If any of these files are found in the working directory and we set \texttt{load = TRUE} the last file in the sequence is loaded and used as input for the next stage, skipping all previous stages. In the above example the names of files produced by the final three stages would be `test\_1\_readPosition.rdata', `test\_1\_readNames.rdata' and `test\_1\_fastq.txt'. These all relate to the reads that were sampled from the read densities and are not loaded as intermediate results. Instead \texttt{simChIP} scans the working directory for files with these names of the form `test\_$n$\_fastq.txt' to determine which index $n$ to use for the current run of the simulation. This enables us to run the same simulation with the same underlying feature sequence, nucleosome and read density repeatedly, each time generating a new set of reads. This essentially simulates repeated sequencing of the same library. Of course it is also possible to introduce some variation into the process. For example if we wanted to study the effect of changes to the DNA fragment size distribution we could use the same nucleosome density but recalculate the read densities with a new set of parameters. \section{Extending ChIPsim} The \texttt{ChIPsim} package provides a framework for the simulation of ChIP-seq experiments. While, as we have seen above, the package includes an implementation of a nucleosome positioning simulation the real strength of \texttt{ChIPsim} is its ability to accommodate a variety of different experiments. Here we will demonstrate how \texttt{ChIPsim} can be extended to other types of experiments. As an example we will create a simulation of transcription factor binding sites based on the one proposed by Zhang~\emph{et al.}~\cite{Zhang08}. This simulation divides the genome into a number of non-overlapping binding sites separated by background regions. The background between binding sites is split into blocks of 1kb, each with its own sampling weight. In the following sections we will implement a variant of this simulation using the \texttt{ChIPsim} framework, demonstrating how the different stages of the simulation can be adjusted for different types of experiments. This is followed by an alternative, more complex simulation of the same process. \subsection{A simple model} In this section we will attempt to recreate the simulation of Zhang~\emph{et al.} (or at least its core aspects). As we will see later there are some differences in the assumptions underlying the work of Zhang~\emph{et al.} and \texttt{ChIPsim}, which means that we will not use the full extend of \texttt{ChIPsim}'s capabilities. This section is followed by the description of a more complex model of transcription factor ChIP-seq that is better suited for an implementation in \texttt{ChIPsim}. \subsubsection{Markov chain} At the core of the simulation is a Markov model that generates a sequence of different features. Each state of the model corresponds to a different feature class and generates a set of parameters required to determine the sampling weight for each feature. For the transcription factor simulation we will only need two states, representing binding sites and background respectively. To specify the Markov model we need to define the transition probabilities between the states and the initial state distribution. For each state we will also need a function to generate the parameters of the corresponding features. We start by specifying the transition probabilities. To do this we need to create a list with components corresponding to the states of the model (we will us ``Binding'' and ``Background'') each of which holds an (S3) object of class ``StateDistribution'', which is simply a numeric vector with the non-zero transition probabilities to other states. <>= transition <- list(Binding=c(Background=1), Background=c(Binding=0.05, Background=0.95)) transition <- lapply(transition, "class<-", "StateDistribution") @ Note that we are allowing several background regions to follow each other while binding regions always have to be followed by a background region. Also note that the transition probability from background to binding regions determines the expected number of binding regions. Unlike Zhang~\emph{et al.} we are not determining the number in advance. We will not allow the sequence to start with a binding site. Therefore, again using the ``StateDistribution'' class, the initial state distribution is: <>= init <- c(Binding=0, Background=1) class(init) <- "StateDistribution" @ The next step is to define functions to generate the parameters for binding and background regions. The parameters for a region should always include `\texttt{start}' and `\texttt{length}', indicating the start position and length of the region respectively. Note that `\texttt{start}' will be determined by the function that generates the feature sequence (see below) so that we only have to accept it as a parameter. The only other parameter required for background regions is the sampling weight. Following the model of Zhang~\emph{et al.} we use a gamma distribution to model the background sampling weight for each region. <>= backgroundFeature <- function(start, length=1000, shape=1, scale=20){ weight <- rgamma(1, shape=1, scale=20) params <- list(start = start, length = length, weight = weight) class(params) <- c("Background", "SimulatedFeature") params } @ The convention used by \texttt{ChIPsim} is to name functions that generate parameters for the different feature classes ``\emph{state}Feature'', where \emph{state} is the name of the state. These functions always return a list of parameters. This list has classes ``\emph{state}'' and ``SimulatedFeature''. It is important that the first class matches the name used for the state in other places. The binding site model requires some additional parameters. Zhang~\emph{et al.} set the average sampling weight $\bar{w}_f$ for binding sites to be the average weight of background regions multiplied by an enrichment coefficient $t$: $\bar{w}_f = t \times \bar{w}_b$. They then increase the sampling weight of binding sites as sequence reads are assigned to them, resulting in a power-law distribution of sampling weights. Since \texttt{ChIPsim} separates the generation of feature densities and the sampling of sequence reads from those densities into two distinct stages it is not easily possible to use the same procedure here. Instead we will use a Pareto distribution with parameter $r$ to determine $w_f$ for each binding site. The minimum of the distribution is chosen such that its mean is $\bar{w}_f$ (Note that this requires $r > 1$). <>= bindingFeature <- function(start, length=500, shape=1, scale=20, enrichment=5, r=1.5){ stopifnot(r > 1) avgWeight <- shape * scale * enrichment lowerBound <- ((r - 1) * avgWeight) weight <- actuar::rpareto1(1, r, lowerBound) params <- list(start = start, length = length, weight = weight) class(params) <- c("Binding", "SimulatedFeature") params } @ With this in place we can now generate a feature sequence using the \texttt{ChIPsim} function \texttt{makeFeatures}. This function takes the description of a Markov model together with several other parameters and produces a list of features for a genomic region of given length. This list is of class ``SimulatedExperiment'' and usually has a more specialized class indicating the specific type of experiment as well. We will see later how this class information is used. To generate a feature sequence we can use something like this: <>= set.seed(1) generator <- list(Binding=bindingFeature, Background=backgroundFeature) features <- ChIPsim::makeFeatures(generator, transition, init, start = 0, length = 1e6, globals=list(shape=1, scale=20), experimentType="TFExperiment", lastFeat=c(Binding = FALSE, Background = TRUE)) @ We have already discussed the \texttt{transition} and \texttt{init} parameters of \texttt{makeFeatures}. The \texttt{generator} parameter also relates to the Markov model; it provides a list of the emission distributions for each state. Another interesting argument of \texttt{makeFeatures} is \texttt{globals}. This allows us to pass a list of parameters that will be passed on to all states of the model. Here we use this mechanism to ensure that the values used by both states for \texttt{shape} and \texttt{scale} match. There is another parameter called \texttt{control} that allows us to pass a list of arguments to individual states. \begin{figure}[htb] \centering <>= bindIdx <- sapply(features, inherits, "Binding") plot(density(sapply(features[!bindIdx], "[[", "weight")), xlim=c(0,max(sapply(features, "[[", "weight"))), xlab="Sampling weight", main="") lines(density(sapply(features[bindIdx], "[[", "weight")), col=2) legend("topright", legend=c("Background", "Binding"), col=1:2, lty=1) @ \caption{Sampling weight distributions for background and binding regions.} \end{figure} The feature sequence generated above contains \Sexpr{sum(bindIdx)} binding sites, see Figure~2 for a plot of the resulting weight distributions of binding and background regions. Below is the first feature in the sequence: <<>>= features[[1]] @ Before we turn our attention to the conversion of this sequence into binding site densities it is worth noting that there is another function we can use to generate a feature sequence. Using \texttt{placeFeatures} provides some additional functionality. Firstly, \texttt{placeFeatures} makes an effort to generate a feature sequence that fills as much of the genomic region as possible. This is especially useful for cases where we are dealing with variable length features that may otherwise lead to a relatively large area without any features at the end of the genomic region. Secondly, it calls \texttt{reconcileFeatures}, an S3 method that dispatches on the class of the feature sequence (i.e., the experiment type) and provides a way to post-process the feature sequence. The nucleosome simulation uses this to ensure that some of the parameters of adjacent features are compatible. This mechanism is not needed for the simulation considered here so that we can use the default. However, if we wanted to use it we could simply define a function \texttt{reconcileFeatures.TFExperiment(features, ...)} to perform the desired post-processing. <>= set.seed(1) features <- ChIPsim::placeFeatures(generator, transition, init, start = 0, length = 1e6, globals=list(shape=1, scale=20), experimentType="TFExperiment", lastFeat=c(Binding = FALSE, Background = TRUE)) @ This results in the same feature sequence but a closer look reveals that the default post-processing step has introduced an additional parameter for each feature: <<>>= features[[1]] @ The overlap between adjacent features defaults to 0, which is what we want here. In some cases it may be desirable to have overlapping features. The nucleosome simulation uses this to ensure smooth transitions between features. \subsubsection{Binding site density} Once the feature sequence is determined it has to be translated into a binding site density. This task is carried out through a call to \texttt{featureDensity} which is dispatched on the feature class. This means that we have to provide implementations for \texttt{featureDensity.Binding} and \texttt{featureDensity.Background}. In this case both functions are simply constant and produce output of a given length. <>= constRegion <- function(weight, length) rep(weight, length) featureDensity.Binding <- function(feature, ...) constRegion(feature$weight, feature$length) featureDensity.Background <- function(feature, ...) constRegion(feature$weight, feature$length) @ A binding site density for the entire genomic region is generated by a call to \texttt{feat2dens}. <>= dens <- ChIPsim::feat2dens(features) @ The resulting density for a region containing the first binding site in the sequence is shown in Figure~3. Since the simulation of Zhang~\emph{et al.} does not distinguish between reads on the forward and reverse strand we could simply use this binding site density to sample the location of (extended) reads. <>= readLoc <- sample(44000:64000, 1e3, prob=dens[44000:64000], replace=TRUE) @ \begin{figure}[htb] \centering <>= start <- features[[min(which(bindIdx))]]$start length <- features[[min(which(bindIdx))]]$length par(mfrow=c(2,1)) par(mar=c(0, 4.1, 1.1, 2.1)) plot((start-10000):(start+10000), dens[(start-10000):(start+10000)], type='l', xlab="", ylab="Sampling weight", xaxt = 'n') abline(v=start, col="grey", lty=2) abline(v=start+length-1, col="grey", lty=2) par(mar=c(4.1, 4.1, 0, 2.1)) counts <- hist(readLoc, breaks=seq(44000-0.5, 64000.5, 1), plot = FALSE)$counts extReads <- zoo::rollmean(ts(counts), 200)*200 plot(44100:63901,extReads, xlab="Position in genomic region", ylab="Read count", main="", type='l', xlim = c(start-10000, start+10000)) abline(v=start, col="grey", lty=2) abline(v=start+length-1, col="grey", lty=2) @ \caption{Binding site density for a transcription factor binding site (between the dashed lines) and surrounding background (top). Resulting counts of overlapping extended reads are shown below.} \end{figure} Here we are assuming that sampled read positions correspond to the fragment centre and extend reads into both directions. Figure~3 shows counts of reads that were extended to 200bp, the average fragment length. We now have a (simple) simulation of transcription factor ChIP-seq based on the work by Zhang~\emph{et al.} This does not incorporate all aspects of their simulation, e.g. the intra-site weight profile is missing from this version, but demonstrates the principle. In the next section we will build on this to create a simulation that takes information about binding site and DNA fragment length into account to generate strand specific read densities. \subsection{Advanced model} This extended version of the transcription factor ChIP-seq simulation builds on the model from the previous section. We will use the same Markov chain as above with the same emission distributions; but unlike before we will not use blocks of 500bp to represent binding sites. Instead the length of a binding site will be defined as the stretch of DNA protected by the transcription factor. What will change is the binding site density for the ``Binding'' state. We will use a single spike at the centre of the binding site rather than a uniform distribution across the binding site. This will allow us to compute strand specific read densities later. This requires some adjustment to the background density as well. \subsubsection{Post-processing the feature sequence} Since we plan to change the binding site density such that a transcription factor binding site is represented by a single spike at the centre of the binding site, we have to reduce the background density accordingly. It would be straightforward to incorporate this change into \texttt{featureDensity.Background} but for the purpose of this example we will do this as part of the post-processing step that was mentioned but not illustrated before. We will assume that all binding sites have the same length. That should be a reasonable assumption if we only deal with one transcription factor per experiment. <>= reconcileFeatures.TFExperiment <- function(features, ...){ bindIdx <- sapply(features, inherits, "Binding") if(any(bindIdx)) bindLength <- features[[min(which(bindIdx))]]$length else bindLength <- 1 lapply(features, function(f){ if(inherits(f, "Background")) f$weight <- f$weight/bindLength ## The next three lines (or something to this effect) ## are required by all 'reconcileFeatures' implementations. f$overlap <- 0 currentClass <- class(f) class(f) <- c(currentClass[-length(currentClass)], "ReconciledFeature", currentClass[length(currentClass)]) f }) } @ Now we can generate a new feature sequence. <>= set.seed(1) features <- ChIPsim::placeFeatures(generator, transition, init, start = 0, length = 1e6, globals=list(shape=1, scale=20), experimentType="TFExperiment", lastFeat=c(Binding = FALSE, Background = TRUE), control=list(Binding=list(length=50))) @ Note the use of the \texttt{control} argument to set the length of binding sites. The parameters of the first feature have changed slightly: <<>>= features[[1]] @ \subsubsection{Revised binding site density} Here is the modified feature density for binding regions: <>= featureDensity.Binding <- function(feature, ...){ featDens <- numeric(feature$length) featDens[floor(feature$length/2)] <- feature$weight featDens } @ Now we can generate the new binding site density, using the same call as before. <>= dens <- ChIPsim::feat2dens(features, length = 1e6) @ Figure~5 shows the binding site density in the same genomic region as before. \subsubsection{Strand specific read densities} The function \texttt{bindDens2readDens} provides us with an easy way to convert binding site densities into read densities for both strands. All we need to do to use this function for our transcription factor simulation is to specify a function that returns the length distribution of DNA fragments used in the experiment. This function should accept a numeric first argument and return the proportion of DNA fragments of that length in the sample. Further arguments the function should accept (and may use) are the minimum and maximum fragment length as well as the length of the binding site. <>= fragLength <- function(x, minLength, maxLength, meanLength, ...){ sd <- (maxLength - minLength)/4 prob <- dnorm(minLength:maxLength, mean = meanLength, sd = sd) prob <- prob/sum(prob) prob[x - minLength + 1] } @ Here we do not use the length of the binding site but our function either needs a \texttt{bind} argument or use `$\ldots$' to absorb any additional arguments. <>= readDens <- ChIPsim::bindDens2readDens(dens, fragLength, bind = 50, minLength = 150, maxLength = 250, meanLength = 200) @ This produces a two column matrix with the read density for the forward strand in the first column and the reverse strand density in the second. Figure~4 gives an impression of how the read densities relate to the binding site density. \begin{figure}[htb] \centering <>= bindIdx <- sapply(features, inherits, "Binding") start <- features[[min(which(bindIdx))]]$start length <- features[[min(which(bindIdx))]]$length par(mfrow=c(2,1)) par(mar=c(0, 4.1, 1.1, 2.1)) plot((start-10000):(start+10000), dens[(start-10000):(start+10000)], type='l', xlab="", ylab="Sampling weight", xaxt='n') par(mar=c(4.1, 4.1, 0, 2.1)) plot((start-10000):(start+10000), readDens[(start-10000):(start+10000), 1], col=4, type='l', xlab="Position in genomic region", ylab="Sampling weight") lines((start-10000):(start+10000), readDens[(start-10000):(start+10000), 2], col=2) legend("topleft", legend=c("forward", "reverse"), col=c(4,2), lty=1) @ \caption{Binding site density for a transcription factor binding site and surrounding background for the modified model (top) and resulting strand specific read densities (bottom).} \end{figure} \subsubsection{Sampling reads} Once the read densities are computed it is straightforward to sample read positions. The \texttt{sampleReads} function is provided for this purpose. It allows us to specify the number of reads required as well as weights for each strand. Here we sample 100,000 reads from the read densities with equally weighted strands, which is the default. <>= readLoc <- ChIPsim::sampleReads(readDens, 1e5) @ This produces a list with components \texttt{fwd} and \texttt{rev} giving the start positions of reads on the respective strand. Figure~5 shows the transcription factor binding site with read densities and read positions. \begin{figure}[htb] \centering <>= fwdCount <- hist(readLoc$fwd, breaks=seq(0.5, 1000000.5, by=1), plot=FALSE)$counts revCount <- hist(readLoc$rev, breaks=seq(0.5, 1000000.5, by=1), plot=FALSE)$counts fwdExt <- zoo::rollmean(ts(fwdCount[(start-1000):(start+1050)]), 200, align="right")*200 revExt <- zoo::rollmean(ts(revCount[(start-1000):(start+1050)]), 200, align="left")*200 par(mfrow=c(2,1)) mar.old <- par("mar") par(mar=c(0, 4.1, 1.1, 2.1)) plot((start-1000):(start+1050) ,fwdCount[(start-1000):(start+1050)], col="lightblue", type='h', xlab="", ylab="Read count / density", ylim=c(0, 2), xlim=c(start-975, start+1025), xaxt='n') lines((start-975):(start+1025) ,revCount[(start-975):(start+1025)], col="mistyrose2", type='h') lines((start-10000):(start+10000), dens[(start-10000):(start+10000)]) lines((start-10000):(start+10000), readDens[(start-10000):(start+10000), 1], col=4) lines((start-10000):(start+10000), readDens[(start-10000):(start+10000), 2], col=2) par(mar=c(4.1, 4.1, 0, 2.1)) plot((start-801):(start+851), fwdExt+revExt, xlim=c(start-975, start+1025), xlab="Position in genomic region", ylab="Overlap count", type='s') lines((start-801):(start+1050), fwdExt, col=4) lines((start-1000):(start+851), revExt, col=2) abline(v=c(start-150, start+200), col="grey", lty=2) @ \caption{Read counts and densities for a transcription factor binding site. In the top panel vertical light blue and red bars indicate the number of reads starting at each position on the forward and reverse strand respectively. The underlying read densities are shown as blue and red lines and the binding site density is shown in black. The number of overlapping extended reads on the forward (blue) and reverse (red) strand are shown in the bottom panel.} \end{figure} Following the procedure in Zhang~\emph{et al} we can extend each read to the average fragment length (200bp) and count the number of overlapping reads at each position. This results in a peak surrounding the binding site that is approximately 350bp wide. \subsubsection{Putting it all together} After generating the desired number of read positions the resulting read sequences can be written to a file using the infrastructure provided by \texttt{ChIPsim} and will not be covered here in detail. A final point worth noting is that although we have carried out the various steps of the simulation for demonstration purposes, it is not necessary to do this in practice. The \texttt{ChIPsim} package provides the \texttt{simChIP} function that allows us to run the entire simulation with a single function call. <>= randomQuality <- function(read, ...){ paste(sample(unlist(strsplit(rawToChar(as.raw(64:104)),"")), nchar(read), replace = TRUE), collapse="") } dfReads <- function(readPos, readNames, sequence, readLen, ...){ ## create vector to hold read sequences and qualities readSeq <- character(sum(sapply(readPos, sapply, length))) readQual <- character(sum(sapply(readPos, sapply, length))) idx <- 1 ## process read positions for each chromosome and strand for(k in length(readPos)){ ## chromosome for(i in 1:2){ ## strand for(j in 1:length(readPos[[k]][[i]])){ ## get (true) sequence readSeq[idx] <- as.character(ChIPsim::readSequence(readPos[[k]][[i]][j], sequence[[k]], strand=ifelse(i==1, 1, -1), readLen=readLen)) ## get quality readQual[idx] <- randomQuality(readSeq[idx]) ## introduce sequencing errors readSeq[idx] <- ChIPsim::readError(readSeq[idx], ChIPsim::decodeQuality(readQual[idx])) idx <- idx + 1 } } } data.frame(name=unlist(readNames), sequence=readSeq, quality=readQual, stringsAsFactors = FALSE) } @ Using the output method described in ``Introduction to ChIPsim'' to produce read sequences we first set up the functions to carry out the different stages of the simulation. As we have seen above we can rely on the build-in functions for this, only the last stage needs to be changed to accommodate the fact that we do not want to create output files. <<>>= myFunctions <- ChIPsim::defaultFunctions() myFunctions$readSequence <- dfReads @ Now we have to adjust the arguments that will be passed to these functions to enable them to work with our transcription factor simulation. <<>>= featureArgs <- list(generator=generator, transition=transition, init=init, start = 0, length = 1e6, globals=list(shape=1, scale=20), experimentType="TFExperiment", lastFeat=c(Binding = FALSE, Background = TRUE), control=list(Binding=list(length=50))) readDensArgs <- list(fragment=fragLength, bind = 50, minLength = 150, maxLength = 250, meanLength = 200) @ After generating a random reference sequence we are all set. <>= genome <- Biostrings::DNAStringSet(c(CHR=paste(sample(Biostrings::DNA_BASES, 1e6, replace = TRUE), collapse = ""))) set.seed(1) simulated <- ChIPsim::simChIP(1e4, genome, file = "", functions = myFunctions, control = ChIPsim::defaultControl(features=featureArgs, readDensity=readDensArgs)) @ Note that we reduced the number of reads in this simulation to avoid generating a large number of read sequences, which can be time consuming. Apart from this change the result is the same as before: <<>>= all.equal(readDens, simulated$readDensity[[1]]) @ \section{Session info} <<>>= sessionInfo() @ \begin{thebibliography}{9} \bibitem{Zhang08} Zhengdong~D. Zhang, Joel Rozowsky, Michael Snyder, Joseph Chang, and Mark Gerstein. \newblock Modeling chip sequencing in silico with applications. \newblock {\em PLoS Computational Biology}, 4(8), August 2008. \end{thebibliography} \end{document}