% \VignetteIndexEntry{Simulating ChIP-seq experiments}
% \VignetteDepends{ChIPsim}
\documentclass[a4paper, 11pt]{article}
\usepackage[OT1]{fontenc}
\usepackage{fullpage}
\usepackage{Sweave}
\begin{document}

\title{Simulating ChIP-seq experiments}
\author{Peter Humburg}

\maketitle

\section{Introduction}
This document provides a brief introduction to the use of \texttt{ChIPsim} in 
form of a worked example. The main purpose of this package is to 
provide a framework for the simulation of ChIP-seq experiments. The simulation
of nucleosome positioning experiments is implemented as part of the package, see 
`Extending ChIPsim' for a discussion of how other types of ChIP-seq experiments 
can be implemented with the help of this package. 

The simulation of ChIP-seq experiments can be a powerful tool to get a better 
understanding of ChIP-seq data and the challenges of analysing it. A first step
into this direction was taken by Zhang~\emph{et al.}~\cite{Zhang08}. They developed
a simulation of a transcription factor binding ChIP-seq experiment and used this
do derive an appropriate background model. The simulation framework provided
by \texttt{ChIPsim} goes one step further. It simulates the location of various genomic
features, such as nucleosome positions or transcription factor binding sites, and the 
resulting sequence reads. This can then be used to test entire analysis pipelines,
from read mapping to feature identification. We may be interested in testing a specific
part of the pipeline, e.g., investigate the performance of different read mapping tools
or different peak calling algorithms. While this can be very useful we may gain further
insights into the performance of our pipeline by testing different combinations
of tools to capture interactions between them. Once an analysis pipeline is established
we can use the simulation to help us plan new experiments by providing an estimate of 
how much sequencing will be required to obtain the desired outcomes. 

In the next section we will discuss the general structure of the simulation to get 
a better idea of how it works and where we might intervene to change its behaviour.
This is followed by an example to demonstrate the simulation 
of a nucleosome positioning experiment on a small genome. This includes examples
to show how the parameters of the simulation can be adjusted but does not cover
extensions to different experiment types. Such extensions are discussed at the 
end of this document where a more complex example demonstrates how we could 
implement the simulation of a transcription factor binding simulation.
 
\section{Overview}
The simulation is divided into six stages that are illustrated below for the example
of a nucleosome positioning experiment. The general structure is the same, regardless
of experiment but the details are adjusted to suit the problem under consideration.
\begin{description}
\item[Generate feature sequence] A Markov chain is used to generate a sequence of features,
e.g. nucleosomes of different stability, to cover the supplied genome. Each state
of the Markov chain represents a different feature type and generates a set of associated
parameters. 
\item[Compute nucleosome density] Each feature type is associated with a feature 
generating function that translates the feature sequence into a feature density, i.e., 
for each position in the genome covered by the feature it computes the probability 
that a nucleosome is centred there.
\item[Compute read density] Using information about the length distribution of DNA
fragments and the distribution of binding sites within them, the feature density
is translated into a read density for each strand. The read density at a given position
on a given DNA strand is proportional to the probability that a sequence read starting
at that position is produced during the sequencing process.
\item[Sample read start sites] Based on the read densities computed in the previous step
the location for the desired number of reads is generated.
\item[Create read names] A name is assigned to each read.
\item[Obtain read sequence and quality] Read positions are translated into the corresponding
DNA sequence, qualities are assigned to each read and used to introduce errors into the
read sequence.
\end{description}
All six stages of the simulation are carried out by the function \texttt{simChIP}. It is 
possible to provide intermediate results to skip some of these steps. For example,
if we already have a feature density we could use that instead of generating a new one.
The read sequences and qualities generated by the simulation can be exported in FASTQ format.
In the default configuration of \texttt{simChIP} this is done automatically if an output
file name is provided. Each stage of the simulation is carried out by a different
function that is called by \texttt{simChIP} with the appropriate arguments. Individual
functions can be replaced to make substantial changes to the behaviour of the simulation in
cases where simply adjusting some of the function's arguments is not enough to achieve the
desired result.   

\section{Using the nucleosome positioning simulation}
With that general structure in mind we will now take a look at how we might use the
ChIPsim package for a simple nucleosome positioning simulation. This section focuses
on the build in simulation of a nucleosome positioning experiment. While we will see
how this can be modified to adapt the simulation to a specific experiment this example
mostly relies on the default settings to keep things simple. 
\subsection{Setup} 
We start by loading the ChIPsim package. To ensure that the results of the 
simulation are reproducible the random number generator is initialised with a fixed seed.
<<seeding>>=
library(ChIPsim)
set.seed(1)
@
If we just want to run the nucleosome positioning simulation with default
settings no further preparations are necessary in practice. The remainder 
of this section is only required because we want to run the example without
relying on additional files. It also serves as an example of how the defaults
can be adjusted to fit the needs of a particular experiment.
  
The later stages of the simulation require a reference genome to generate read sequences.
For the purpose of these examples we will use a very simple (and relatively short) simulated 
genome sequence. The variable \texttt{chrLen} gives the length (and number) of
chromosomes, we could easily generate a different genome size by changing these numbers.
<<genome>>=
chrLen <- c(2e5, 1e5)
chromosomes <- sapply(chrLen, function(n) paste(sample(DNA_BASES, n, replace = TRUE), collapse = ""))
names(chromosomes) <- paste("CHR", seq_along(chromosomes), sep="")
genome <- DNAStringSet(chromosomes)
@
To read the genome sequence from a (FASTA formatted) file we would simply use the name
of the fasta file instead of the generated sequence above.
<<echo=FALSE>>=
rm(chromosomes)
@

Since the output of the simulation is a FASTQ formatted file we will also need
read qualities. By default the simulation uses read quality scores from a real
sequencing experiment for this purpose. Since that is not very practical for 
this example we will provide a function to generate read qualities instead. 
This will also help to demonstrate how aspects of the simulation can be 
adjusted for specific needs.
<<simulateQualities>>=
randomQuality <- function(read, ...){
	paste(sample(unlist(strsplit(rawToChar(as.raw(64:104)),"")), 
					nchar(read), replace = TRUE), collapse="")
}
@
The \texttt{randomQuality} function will produce read quality scores in Illumina 1.3 format. 
Apart from the encoding the quality scores will not be very similar to real read qualities
but this is sufficient for the purpose of this example.

Another aspect of the simulation we may want to change for this example is how output is 
produced. By default the simulation writes the read sequences and qualities to a FASTQ 
file. Instead, we would like to produce a data frame that holds this information. To achieve
this we have to provide a function that accepts read positions together with some other 
parameters and returns the \texttt{data.frame} we want.
<<output>>=
dfReads <- function(readPos, readNames, sequence, readLen, ...){
	
	## create vector to hold read sequences and qualities
	readSeq <- character(sum(sapply(readPos, sapply, length)))
	readQual <- character(sum(sapply(readPos, sapply, length)))
	
	idx <- 1
	## process read positions for each chromosome and strand
	for(k in length(readPos)){ ## chromosome
		for(i in 1:2){ ## strand
			for(j in 1:length(readPos[[k]][[i]])){
				## get (true) sequence
				readSeq[idx] <- as.character(readSequence(readPos[[k]][[i]][j], sequence[[k]], 
						strand=ifelse(i==1, 1, -1), readLen=readLen))
				## get quality
				readQual[idx] <- randomQuality(readSeq[idx])
				## introduce sequencing errors
				readSeq[idx] <- readError(readSeq[idx], decodeQuality(readQual[idx]))
				idx <- idx + 1
			}
		}
	}
	data.frame(name=unlist(readNames), sequence=readSeq, quality=readQual, 
			stringsAsFactors = FALSE)
}
@

\subsection{Simulation}
Now that we have a reference genome and a way to generate read qualities we can simply call
\texttt{simChIP} to simulate sequencing data. The first argument of \texttt{simChIP} is the
number of reads we would like to generate, the second is the reference genome. 
In this case we will use the genome sequence we generated above, we could simply pass the name
of a FASTA formatted file that contains the reference sequence instead. The third argument
is the prefix we would like to use for the output files. For this example we will prevent file
creation by setting \texttt{file = ""}. Usually it is a good idea to write results to a file,
which will produce a FASTQ formatted file with the sequences of all simulated reads. Using
file output will also generate additional files containing intermediate results like the 
underlying feature sequence or resulting read densities for later reuse. The \texttt{control}
argument allows us to change the parameters of the simulation. The entire simulation process
is divided into six stages and parameters for each can be set separately by providing named
entries in the list passed as \texttt{control} argument. As explained above, the six stages 
of the simulation are:
\begin{description}
\item[features]{Generate feature sequence (for each chromosome);}
\item[bindDensity]{Compute binding site density;}
\item[readDensity]{Compute read density for both strands;}
\item[sampleReads]{sample from read density to determine read start sites;}
\item[readNames]{Create read names;}
\item[readSequence]{Obtain read sequence and quality.}
\end{description} 
These parameters are then passed to a function that carries out that step of the simulation. 
Which function is used can be controlled through the \texttt{functions} argument.

Here we will mostly use the defaults but we need to change the way read sequences and qualities 
are generated. We can use \texttt{defaultFunctions} to obtain the default list of functions and
simply replace the one for the final step of the simulation. Instead of completely replacing a 
part of the simulation we sometimes just want to adjust the default behaviour slightly. In many cases
that is possible by changing a parameter value. We will illustrate the process by changing the mean
fragment length to 150bp (the default is 160bp).
<<simulation>>=
myFunctions <- defaultFunctions()
myFunctions$readSequence <- dfReads 
nReads <- 1000
simulated <- simChIP(nReads, genome, file = "", functions = myFunctions,
		control = defaultControl(readDensity=list(meanLength = 150)))
@
Note the use of \texttt{defaultControl}. This allows us to maintain the default values for
all parameters except the ones we list explicitly. The \texttt{defaultControl} function
uses the same argument names as \texttt{functions}. Each argument is expected to be a named list
with the names corresponding to the names of arguments of the target function.

We now have a data frame with \Sexpr{nReads} simulated read sequences together with the location 
in the genome they originated from, the read densities that were used determine read positions 
as well as the underlying nucleosome densities and feature sequence.  The list returned by
\texttt{simChIP} contains all of this information in different components. Their names should give
some idea of what they are:
<<listSummary>>=
	names(simulated)
@
Of course we can also look at individual components in more detail. Here are the first few reads:
<<>>=
head(simulated$readSequence)
@
We can also examine some of the nucleosome and read densities. Figure~1 shows the nucleosome
and read densities surrounding the first stable nucleosome in the feature sequence. The stable nucleosome 
in the centre is flanked by two phased regions. 
<<eval=FALSE>>=
feat <- simulated$features[[1]]
stableIdx <- which(sapply(feat, inherits, "StableFeature"))
start <- feat[[stableIdx[1]]]$start
plot((start-2000):(start+500), simulated$bindDensity[[1]][(start-2000):(start+500)], xlab="Chromosome 1",
		ylab="Density", type='l')
lines((start-2000):(start+500), simulated$readDensity[[1]][(start-2000):(start+500),1], col=4)
lines((start-2000):(start+500), simulated$readDensity[[1]][(start-2000):(start+500),2], col=2)
legend("topright", legend=c("Nucleosome density", "Read density (+)", "Read density (-)"), col=c(1,4,2), lwd=1)
@
\begin{figure}
\centering
<<fig=TRUE, echo=FALSE>>=
feat <- simulated$features[[1]]
stableIdx <- which(sapply(feat, inherits, "StableFeature"))
start <- feat[[stableIdx[1]]]$start
plot((start-2000):(start+500), simulated$bindDensity[[1]][(start-2000):(start+500)], xlab="Chromosome 1",
		ylab="Density", type='l')
lines((start-2000):(start+500), simulated$readDensity[[1]][(start-2000):(start+500),1], col=4)
lines((start-2000):(start+500), simulated$readDensity[[1]][(start-2000):(start+500),2], col=2)
legend("topright", legend=c("Nucleosome density", "Read density (+)", "Read density (-)"), col=c(1,4,2), lwd=1)
@
\caption{Nucleosome and read densities for a stable feature and flanking regions.}
\end{figure}
Once the simulation is completed we can use the read sequences and qualities as input for an analysis pipeline.
Typically the first step is to map the reads back to the genome (preferably using the read qualities in the process),
followed by the application of a peak-finding algorithm to identify peaks in the read counts. The
performance of both parts of the analysis can be assessed with the help of data generated by this
simulation. This may prove useful to compare and improve analysis pipelines for a variety of ChIP-seq
experiments. Once a pipeline is established the simulation can also be used to estimate the amount of 
sequencing, i.e. the coverage, necessary for a given experiment. 

\subsection{Using output files}
If we had instructed \texttt{simChIP} to generate output files by providing a base file name to argument
\texttt{file} the list would contain the name of the output file for each stage rather than the data.
These file names are automatically generate by appending an appropriate suffix to the base name at
each stage. This name is also used to determine whether output from a previous run is available for re-use.
For example, if we set \texttt{file = "test"} in the above call to \texttt{simChIP} the simulation will
generate several files including `test\_features.rdata', `test\_bindDensity.rdata' and `test\_readDensity.rdata'
for the first three stages of the simulation. If any of these files are found in the working directory 
and we set \texttt{load = TRUE} the
last file in the sequence is loaded and used as input for the next stage, skipping all previous stages. In the
above example the names of files produced by the final three stages would be `test\_1\_readPosition.rdata',
`test\_1\_readNames.rdata' and `test\_1\_fastq.txt'. These all relate to the reads that were sampled from
the read densities and are not loaded as intermediate results. Instead \texttt{simChIP} scans the
working directory for files with these names of the form `test\_$n$\_fastq.txt' to determine which index
$n$ to use for the current run of the simulation. This enables us to run the same simulation with the
same underlying feature sequence, nucleosome and read density repeatedly, each time generating a new
set of reads. This essentially simulates repeated sequencing of the same library. Of course it is also possible
to introduce some variation into the process. For example if we wanted to study the effect of changes to
the DNA fragment size distribution we could use the same nucleosome density but recalculate the read
densities with a new set of parameters.

\section{Extending ChIPsim}
The \texttt{ChIPsim} package provides a framework for the simulation of ChIP-seq experiments. 
While, as we have seen above, the package includes an implementation of a nucleosome positioning simulation 
the real strength of \texttt{ChIPsim} is its 
ability to accommodate a variety of different experiments. Here we will demonstrate
how \texttt{ChIPsim} can be extended to other types of experiments. As an example we will 
create a simulation of transcription factor binding sites based on the one proposed by 
Zhang~\emph{et al.}~\cite{Zhang08}. This simulation divides the genome into a number of
non-overlapping binding sites separated by background regions. The background between binding
sites is split into blocks of 1kb, each with its own sampling weight.

In the following sections we will implement a variant of this simulation using the \texttt{ChIPsim} 
framework, demonstrating how the different stages of the simulation can be adjusted 
for different types of experiments. This is followed by an alternative, more complex
simulation of the same process.

\subsection{A simple model}
In this section we will attempt to recreate the simulation of Zhang~\emph{et al.} 
(or at least its core aspects). As we will see later there are some differences
in the assumptions underlying the work of Zhang~\emph{et al.} and \texttt{ChIPsim},
which means that we will not use the full extend of \texttt{ChIPsim}'s capabilities.
This section is followed by the description of a more complex model of transcription
factor ChIP-seq that is better suited for an implementation in \texttt{ChIPsim}.

\subsubsection{Markov chain}
At the core of the simulation is a Markov model that generates a sequence of different 
features. Each state of the model corresponds to a different feature class and generates
a set of parameters required to determine the sampling weight for each feature.
For the transcription factor simulation we will only need two states, representing
binding sites and background respectively.

To specify the  Markov model we need to define the transition probabilities between the states
and the initial state distribution. For each state we will also need a function to
generate the parameters of the corresponding features. We start by specifying the transition
probabilities. To do this we need to create a list with components corresponding to
the states of the model (we will us ``Binding'' and ``Background'') each of which holds an (S3) object of 
class ``StateDistribution'', which is simply a numeric vector with the non-zero transition 
probabilities to other states.
<<transitions>>=   
 transition <- list(Binding=c(Background=1), Background=c(Binding=0.05, Background=0.95))
 transition <- lapply(transition, "class<-", "StateDistribution")
@
Note that we are allowing several background regions to follow each other while binding
regions always have to be followed by a background region. Also note that the transition
probability from background to binding regions determines the expected number of binding
regions. Unlike Zhang~\emph{et al.} we are not determining the number in advance.

We will not allow the sequence to start with a binding site. Therefore, again using
the ``StateDistribution'' class, the initial state distribution is:
<<initial>>=
 init <- c(Binding=0, Background=1)
 class(init) <- "StateDistribution"
@

The next step is to define functions to generate the parameters for binding and 
background regions. The parameters for a region should always include `\texttt{start}'
and `\texttt{length}', indicating the start position and length of the region
respectively. Note that `\texttt{start}' will be determined by the function
that generates the feature sequence (see below) so that we only have to accept
it as a parameter. The only other parameter required for background regions is the 
sampling weight. Following the model of Zhang~\emph{et al.} we use a gamma 
distribution to model the background sampling weight for each region.
<<bgEmission>>=
 backgroundFeature <- function(start, length=1000, shape=1, scale=20){
	 weight <- rgamma(1, shape=1, scale=20)
	 params <- list(start = start, length = length, weight = weight)
	 class(params) <- c("Background", "SimulatedFeature")
	 
	 params
 }
@
The convention used by \texttt{ChIPsim} is to name functions that generate
parameters for the different feature classes ``\emph{state}Feature'', where 
\emph{state} is the name of the state. These functions always return a list 
of parameters. This list has classes ``\emph{state}'' and ``SimulatedFeature''. 
It is important that the first class matches the name used for the state in
other places.

The binding site model requires some additional parameters. Zhang~\emph{et al.} 
set the average sampling weight $\bar{w}_f$ for binding sites to be the average weight of 
background regions multiplied by an enrichment coefficient $t$: 
$\bar{w}_f = t \times \bar{w}_b$.
They then increase the sampling weight of binding sites as sequence reads are assigned to them,
resulting in a power-law distribution of sampling weights. Since \texttt{ChIPsim} 
separates the generation of feature densities and the sampling of sequence reads
from those densities into two distinct stages it is not easily possible to use the same 
procedure here. Instead we will use a Pareto distribution with parameter $r$ to determine $w_f$ 
for each binding site. The minimum of the distribution is chosen such that its mean is
$\bar{w}_f$ (Note that this requires $r > 1$).   
<<bindingEmission>>=
 bindingFeature <- function(start, length=500, shape=1, scale=20, enrichment=5, r=1.5){
	 stopifnot(r > 1)
	 
	 avgWeight <- shape * scale * enrichment
	 lowerBound <- ((r - 1) * avgWeight)
	 weight <- actuar::rpareto1(1, r, lowerBound)
	 
	 params <- list(start = start, length = length, weight = weight)
	 class(params) <- c("Binding", "SimulatedFeature")
	 
	 params
 }
@
With this in place we can now generate a feature sequence using the \texttt{ChIPsim} function
\texttt{makeFeatures}. This function takes the description of a Markov model together with
several other parameters and produces a list of features for a genomic region of given length.
This list is of class ``SimulatedExperiment'' and usually has a more specialized class indicating
the specific type of experiment as well. We will see later how this class information is used.
To generate a feature sequence we can use something like this:
<<features1_1>>=
 set.seed(1)
 generator <- list(Binding=bindingFeature, Background=backgroundFeature)
 
 features <- ChIPsim::makeFeatures(generator, transition, init, start = 0, length = 1e6, globals=list(shape=1, scale=20),
		 experimentType="TFExperiment", lastFeat=c(Binding = FALSE, Background = TRUE))
@
We have already discussed the \texttt{transition} and \texttt{init} parameters of \texttt{makeFeatures}.
The \texttt{generator} parameter also relates to the Markov model; it provides a list of the
emission distributions for each state. Another interesting argument of \texttt{makeFeatures} is
\texttt{globals}. This allows us to pass a list of parameters that will be passed on to all states
of the model. Here we use this mechanism to ensure that the values used by both states for 
\texttt{shape} and \texttt{scale} match. There is another parameter called \texttt{control}
that allows us to pass a list of arguments to individual states.
\begin{figure}[htb]
\centering
<<echo=FALSE, fig=TRUE>>=
 bindIdx <- sapply(features, inherits, "Binding")
 
 plot(density(sapply(features[!bindIdx], "[[", "weight")), 
		 xlim=c(0,max(sapply(features, "[[", "weight"))), xlab="Sampling weight", main="")
 lines(density(sapply(features[bindIdx], "[[", "weight")), col=2)
 legend("topright", legend=c("Background", "Binding"), col=1:2, lty=1)
@
\caption{Sampling weight distributions for background and binding regions.}
\end{figure}
The feature sequence generated above contains \Sexpr{sum(bindIdx)} binding sites, see Figure~2
for a plot of the resulting weight distributions of binding and background regions. Below
is the first feature in the sequence:
<<>>=
features[[1]]
@

Before we turn our attention to the conversion of this sequence into binding site densities 
it is worth noting that there is another function we can use to generate a feature sequence. Using
\texttt{placeFeatures} provides some additional functionality. Firstly, \texttt{placeFeatures}
makes an effort to generate a feature sequence that fills as much of the genomic region 
as possible. This is especially useful for cases where we are dealing with variable length
features that may otherwise lead to a relatively large area without any features at the end
of the genomic region. Secondly, it calls \texttt{reconcileFeatures}, an S3 method that
dispatches on the class of the feature sequence (i.e., the experiment type) and provides
a way to post-process the feature sequence. The nucleosome simulation uses this to ensure
that some of the parameters of adjacent features are compatible. This mechanism is not 
needed for the simulation considered here so that we can use the default. However, if we 
wanted to use it we could simply define a function \texttt{reconcileFeatures.TFExperiment(features, ...)} 
to perform the desired post-processing.
<<features1_2>>=
 set.seed(1)
 features <- ChIPsim::placeFeatures(generator, transition, init, start = 0, length = 1e6, globals=list(shape=1, scale=20),
		 experimentType="TFExperiment", lastFeat=c(Binding = FALSE, Background = TRUE))
@
This results in the same feature sequence but a closer look reveals that the default
post-processing step has introduced an additional parameter for each feature:
<<>>=
features[[1]]
@
The overlap between adjacent features defaults to 0, which is what we want here. In some cases
it may be desirable to have overlapping features. The nucleosome simulation uses this
to ensure smooth transitions between features. 

\subsubsection{Binding site density}
Once the feature sequence is determined it has to be translated into a binding site density. 
This task is carried out through a call to \texttt{featureDensity} which is dispatched on the
feature class. This means that we have to provide implementations for \texttt{featureDensity.Binding}
and \texttt{featureDensity.Background}. In this case both functions are simply constant 
and produce output of a given length. 
<<featureDensity1>>=
constRegion <- function(weight, length) rep(weight, length)
featureDensity.Binding <- function(feature, ...) constRegion(feature$weight, feature$length)
featureDensity.Background <- function(feature, ...) constRegion(feature$weight, feature$length)
@
A binding site density for the entire genomic region is generated by a call to \texttt{feat2dens}.
<<density1>>=
 dens <- ChIPsim::feat2dens(features)
@
The resulting density for a region containing the first binding site in the sequence is shown in Figure~3.
Since the simulation of Zhang~\emph{et al.} does not distinguish between reads on the forward and
reverse strand we could simply use this binding site density to sample the location of (extended) reads.
<<readLoc1>>=
 readLoc <- sample(44000:64000, 1e3, prob=dens[44000:64000], replace=TRUE)
@
\begin{figure}[htb]
\centering
<<fig=TRUE, echo=FALSE>>=
 start <- features[[min(which(bindIdx))]]$start
 length <- features[[min(which(bindIdx))]]$length
 par(mfrow=c(2,1))
 par(mar=c(0, 4.1, 1.1, 2.1))
 plot((start-10000):(start+10000), dens[(start-10000):(start+10000)], type='l',
		xlab="", ylab="Sampling weight", xaxt = 'n')
 abline(v=start, col="grey", lty=2)
 abline(v=start+length-1, col="grey", lty=2) 
 par(mar=c(4.1, 4.1, 0, 2.1))
 counts <- hist(readLoc, breaks=seq(44000-0.5, 64000.5, 1), plot = FALSE)$counts
 extReads <- zoo::rollmean(ts(counts), 200)*200
 plot(44100:63901,extReads, xlab="Position in genomic region", ylab="Read count", main="", 
		 type='l', xlim = c(start-10000, start+10000))
 abline(v=start, col="grey", lty=2)
 abline(v=start+length-1, col="grey", lty=2) 
@
\caption{Binding site density for a transcription factor binding site (between the dashed lines) 
and surrounding background (top). Resulting counts of overlapping extended reads are shown below.}
\end{figure}
Here we are assuming that sampled read positions correspond to the fragment centre and 
extend reads into both directions. Figure~3 shows counts of reads that were extended to 200bp,
the average fragment length.

We now have a (simple) simulation of transcription factor ChIP-seq based on the work by Zhang~\emph{et al.}
This does not incorporate all aspects of their simulation, e.g. the intra-site weight profile is missing from 
this version, but demonstrates the principle. In the next section we will build on this to create a
simulation that takes information about binding site and DNA fragment length into account
to generate strand specific read densities.

\subsection{Advanced model}
This extended version of the transcription factor ChIP-seq simulation builds on the model from the
previous section. We will use the same Markov chain as above with the same emission distributions;
but unlike before we will not use blocks of 500bp to represent binding sites. Instead the length
of a binding site will be defined as the stretch of DNA protected by the transcription factor. What
will change is the binding site density for the ``Binding'' state. We will use a single spike at the centre 
of the binding site rather than a uniform distribution across the binding site. This will allow us to 
compute strand specific read densities later. This requires some adjustment to the background 
density as well.

\subsubsection{Post-processing the feature sequence}
Since we plan to change the binding site density such that a transcription factor binding site
is represented by a single spike at the centre of the binding site, we have to reduce the 
background density accordingly. It would be straightforward to incorporate this change into
\texttt{featureDensity.Background} but for the purpose of this example we will do this as part of 
the post-processing step that was mentioned but not illustrated before. We will assume that 
all binding sites have the same length. That should be a reasonable assumption if we only
deal with one transcription factor per experiment.
<<reconcileFeatures>>=
	reconcileFeatures.TFExperiment <- function(features, ...){
		bindIdx <- sapply(features, inherits, "Binding")
		if(any(bindIdx))
			bindLength <- features[[min(which(bindIdx))]]$length
		else bindLength <- 1
		lapply(features, function(f){
				if(inherits(f, "Background"))
					f$weight <- f$weight/bindLength
				## The next three lines (or something to this effect)
				## are required by all 'reconcileFeatures' implementations. 
				f$overlap <- 0
				currentClass <- class(f)
				class(f) <- c(currentClass[-length(currentClass)], 
						"ReconciledFeature", currentClass[length(currentClass)])
				f
			})
	}
@
Now we can generate a new feature sequence.
<<features2>>=
 set.seed(1)
 features <- ChIPsim::placeFeatures(generator, transition, init, start = 0, length = 1e6, globals=list(shape=1, scale=20),
		 experimentType="TFExperiment", lastFeat=c(Binding = FALSE, Background = TRUE), 
		 control=list(Binding=list(length=50)))
@
Note the use of the \texttt{control} argument to set the length of binding sites. The parameters
of the first feature have changed slightly:
<<>>=
	features[[1]]
@

\subsubsection{Revised binding site density}
Here is the modified feature density for binding regions:
<<featureDensity2>>=
 featureDensity.Binding <- function(feature, ...){
	 featDens <- numeric(feature$length)
	 featDens[floor(feature$length/2)] <- feature$weight
	 featDens
 }
@
Now we can generate the new binding site density, using the same call as before.
<<density2>>=
	dens <- ChIPsim::feat2dens(features, length = 1e6)
@
Figure~5 shows the binding site density in the same genomic region as before. 
 
\subsubsection{Strand specific read densities}
The function \texttt{bindDens2readDens} provides us with an easy way to convert
binding site densities into read densities for both strands. All we need to do
to use this function for our transcription factor simulation is to specify a
function that returns the length distribution of DNA fragments used in the
experiment. This function should accept a numeric first argument and return
the proportion of DNA fragments of that length in the sample. Further arguments
the function should accept (and may use) are the minimum and maximum fragment length
as well as the length of the binding site.
<<fragmentLength>>=
	fragLength <- function(x, minLength, maxLength, meanLength, ...){
		sd <- (maxLength - minLength)/4
		prob <- dnorm(minLength:maxLength, mean = meanLength, sd = sd)
		prob <- prob/sum(prob)
		prob[x - minLength + 1]
	}
@
Here we do not use the length of the binding site but our function either needs 
a \texttt{bind} argument or use `$\ldots$' to absorb any additional arguments.
<<readDens>>=
	readDens <- ChIPsim::bindDens2readDens(dens, fragLength, bind = 50, minLength = 150, maxLength = 250,
			meanLength = 200) 
@
This produces a two column matrix with the read density for the forward strand in the first column 
and the reverse strand density in the second. Figure~4 gives an impression of how the read densities 
relate to the binding site density.
\begin{figure}[htb]
\centering
<<fig=TRUE, echo=FALSE>>=
 bindIdx <- sapply(features, inherits, "Binding")
 start <- features[[min(which(bindIdx))]]$start
 length <- features[[min(which(bindIdx))]]$length
 par(mfrow=c(2,1))
 par(mar=c(0, 4.1, 1.1, 2.1))
 plot((start-10000):(start+10000), dens[(start-10000):(start+10000)], type='l',
		 xlab="", ylab="Sampling weight", xaxt='n')
 par(mar=c(4.1, 4.1, 0, 2.1))
 plot((start-10000):(start+10000), readDens[(start-10000):(start+10000), 1], col=4, type='l',
		 xlab="Position in genomic region", ylab="Sampling weight")
 lines((start-10000):(start+10000), readDens[(start-10000):(start+10000), 2], col=2)
 legend("topleft", legend=c("forward", "reverse"), col=c(4,2), lty=1)
@
\caption{Binding site density for a transcription factor binding site 
and surrounding background for the modified model (top) and resulting 
strand specific read densities (bottom).}
\end{figure}

\subsubsection{Sampling reads}
Once the read densities are computed it is straightforward to sample read positions. 
The \texttt{sampleReads} function is provided for this purpose. It allows us to specify
the number of reads required as well as weights for each strand. Here we sample 100,000 reads
from the read densities with equally weighted strands, which is the default.
<<readLoc2>>=
	readLoc <- ChIPsim::sampleReads(readDens, 1e5)
@
This produces a list with components \texttt{fwd} and \texttt{rev} giving the start positions
of reads on the respective strand. Figure~5 shows the transcription factor binding site 
with read densities and read positions.
\begin{figure}[htb]
\centering
<<fig=TRUE, echo=FALSE>>=
	fwdCount <- hist(readLoc$fwd, breaks=seq(0.5, 1000000.5, by=1), plot=FALSE)$counts
	revCount <- hist(readLoc$rev, breaks=seq(0.5, 1000000.5, by=1), plot=FALSE)$counts
	fwdExt <- zoo::rollmean(ts(fwdCount[(start-1000):(start+1050)]), 200, align="right")*200
	revExt <- zoo::rollmean(ts(revCount[(start-1000):(start+1050)]), 200, align="left")*200
	
	par(mfrow=c(2,1))
	mar.old <- par("mar")
	par(mar=c(0, 4.1, 1.1, 2.1))
	plot((start-1000):(start+1050) ,fwdCount[(start-1000):(start+1050)], col="lightblue",
			type='h', xlab="", ylab="Read count / density", ylim=c(0, 2),
			xlim=c(start-975, start+1025), xaxt='n')
	lines((start-975):(start+1025) ,revCount[(start-975):(start+1025)], col="mistyrose2", type='h')
	lines((start-10000):(start+10000), dens[(start-10000):(start+10000)])
	lines((start-10000):(start+10000), readDens[(start-10000):(start+10000), 1], col=4)
	lines((start-10000):(start+10000), readDens[(start-10000):(start+10000), 2], col=2)
	par(mar=c(4.1, 4.1, 0, 2.1))
	plot((start-801):(start+851), fwdExt+revExt, xlim=c(start-975, start+1025), 
			xlab="Position in genomic region", ylab="Overlap count", type='s')
	lines((start-801):(start+1050), fwdExt, col=4)
	lines((start-1000):(start+851), revExt, col=2)
	abline(v=c(start-150, start+200), col="grey", lty=2)
@
\caption{Read counts and densities for a transcription factor binding site. In the top panel vertical 
light blue and red bars indicate the number of reads starting at each position on the forward and reverse 
strand respectively. The underlying read densities are shown as blue and red lines and the binding site 
density is shown in black. The number of overlapping extended reads on the forward (blue) and reverse (red) strand
are shown in the bottom panel.}
\end{figure}
Following the procedure in Zhang~\emph{et al} we can extend each read to the average fragment length (200bp)
and count the number of overlapping reads at each position. This results in a peak surrounding the binding
site that is approximately 350bp wide.

\subsubsection{Putting it all together}
After generating the desired number of read positions the resulting read sequences can be written to 
a file using the infrastructure provided by \texttt{ChIPsim} and will not be covered 
here in detail.  A final point worth noting is that although we have carried out
the various steps of the simulation for demonstration purposes, it is not necessary
to do this in practice. The \texttt{ChIPsim} package provides the \texttt{simChIP} 
function that allows us to run the entire simulation with a single function call.
<<echo=FALSE>>=	
	randomQuality <- function(read, ...){
		paste(sample(unlist(strsplit(rawToChar(as.raw(64:104)),"")), 
						nchar(read), replace = TRUE), collapse="")
	}
	dfReads <- function(readPos, readNames, sequence, readLen, ...){
		## create vector to hold read sequences and qualities
		readSeq <- character(sum(sapply(readPos, sapply, length)))
		readQual <- character(sum(sapply(readPos, sapply, length)))
		
		idx <- 1
		## process read positions for each chromosome and strand
		for(k in length(readPos)){ ## chromosome
			for(i in 1:2){ ## strand
				for(j in 1:length(readPos[[k]][[i]])){
					## get (true) sequence
					readSeq[idx] <- as.character(ChIPsim::readSequence(readPos[[k]][[i]][j], sequence[[k]], 
									strand=ifelse(i==1, 1, -1), readLen=readLen))
					## get quality
					readQual[idx] <- randomQuality(readSeq[idx])
					## introduce sequencing errors
					readSeq[idx] <- ChIPsim::readError(readSeq[idx], ChIPsim::decodeQuality(readQual[idx]))
					idx <- idx + 1
				}
			}
		}
		data.frame(name=unlist(readNames), sequence=readSeq, quality=readQual, 
				stringsAsFactors = FALSE)
	}
@
Using the output method described in ``Introduction to ChIPsim'' to produce read sequences 
we first set up the functions to carry out the different stages of the simulation.
As we have seen above we can rely on the build-in functions for this, only the
last stage needs to be changed to accommodate the fact that we do not want to create
output files.
<<>>=
	myFunctions <- ChIPsim::defaultFunctions()
	myFunctions$readSequence <- dfReads
@
Now we have to adjust the arguments that will be passed to these functions to enable them
to work with our transcription factor simulation. 
<<>>=
	featureArgs <- list(generator=generator, transition=transition, init=init, start = 0, 
			length = 1e6, globals=list(shape=1, scale=20), experimentType="TFExperiment", 
			lastFeat=c(Binding = FALSE, Background = TRUE), control=list(Binding=list(length=50)))
	readDensArgs <- list(fragment=fragLength, bind = 50, minLength = 150, maxLength = 250,
			meanLength = 200)
@
After generating a random reference sequence we are all set.
<<simulation>>=
	genome <- Biostrings::DNAStringSet(c(CHR=paste(sample(Biostrings::DNA_BASES, 1e6, replace = TRUE), collapse = "")))
	set.seed(1)
	simulated <- ChIPsim::simChIP(1e4, genome, file = "", functions = myFunctions,
			control = ChIPsim::defaultControl(features=featureArgs, readDensity=readDensArgs))
@
Note that we reduced the number of reads in this simulation to avoid generating a large number
of read sequences, which can be time consuming. Apart from this change the result is the same 
as before:
<<>>=
	all.equal(readDens, simulated$readDensity[[1]])
@


\section{Session info}
<<>>=
sessionInfo()
@
\begin{thebibliography}{9}
\bibitem{Zhang08}
Zhengdong~D. Zhang, Joel Rozowsky, Michael Snyder, Joseph Chang, and Mark
  Gerstein.
\newblock Modeling chip sequencing in silico with applications.
\newblock {\em PLoS Computational Biology}, 4(8), August 2008.
\end{thebibliography}

\end{document}