%\VignetteIndexEntry{The genomic STate ANnotation package}
%\VignetteDepends{}
%\VignetteKeywords{Genome Annotation, Hidden Markov Model}
%\VignettePackage{STAN} % name of package
%\VignetteEngine{knitr::knitr}
\documentclass{article}
\usepackage{subfig}
%cp STAN.Rnw ../STAN.Rnw
%R CMD Sweave --engine=knitr::knitr --pdf STAN.Rnw


<<style, eval=TRUE, echo=FALSE, results="asis">>=
BiocStyle::latex()
@

<<knitr, echo=FALSE, results="hide">>=
library("knitr")
opts_chunk$set(tidy=FALSE,dev="png",fig.show="hide",
               fig.width=4,fig.height=4.5,
               message=FALSE, highlight=FALSE)
@ 




\author{Benedikt Zacher$^{1,2,*}$, Julia Ertl$^{1}$, Julien Gagneur$^{1}$, Achim
Tresch$^{1,2,3}$ \\[1em] \small{$^{1}$ Gene Center and Department of Biochemistry, LMU, Munich, Germany} \\ \small{$^{2}$ Institute for
Genetics, University of Cologne, Cologne, Germany} \\
\small{$^{3}$
Max Planck Institute for Plant Breeding Research, Cologne, Germany} \\
 \\
\small{\texttt{$^*$zacher (at) genzentrum.lmu.de}}}


\title{The genomic STate ANnotation package}

\begin{document}
% \SweaveOpts{concordance=TRUE}

<<include=FALSE>>=
library(knitr)
opts_chunk$set(
concordance=TRUE
)
@


\maketitle

\vspace{1cm}


 \begin{center}
    \begin{tabular}{ | l | }
      \hline \\
      If you use \Biocpkg{STAN} in published research, please cite:  \\
      \\
      Zacher, B.  and Lidschreiber, M.  and Cramer, P.  and Gagneur, J.  and
      Tresch, A.  (2014): \\ \textbf{Annotation of genomics data using
      bidirectional hidden Markov models unveils} \\
      \textbf{variations in Pol II transcription cycle}  \emph{Mol. Syst. Biol.}
      \textbf{10}:768 \\
      \\ \hline 
    \end{tabular}
  \end{center}


\tableofcontents


\vspace{1cm}

\newpage

\section{Quick start}

\vspace{0.25cm}
<<Quick_start, results="hide", eval=FALSE>>=
## Loading library and data
library(STAN) 
data(trainRegions)
data(pilot.hg19)

## Model initialization
hmm_nb = initHMM(trainRegions[1:3], nStates=10, "NegativeBinomial")

## Model fitting
hmm_fitted_nb = fitHMM(trainRegions[1:3], hmm_nb, maxIters=10)

## Calculate state path
viterbi_nb = getViterbi(hmm_fitted_nb, trainRegions[1:3])

## Convert state path to GRanges object
viterbi_nb_gm12878 = viterbi2GRanges(viterbi_nb, pilot.hg19, 200)
@
\vspace{0.25cm}


\section{Introduction}
Genome segmentation with hidden Markov models has become a useful tool to
annotate genomic elements, such as promoters and enhancers by data integration. \Biocpkg{STAN} (genomic
\textbf{ST}ate \textbf{AN}notation) implements (bidirectional) Hidden Markov
Models (HMMs) using a variety of different probability distributions, which
can be used to model a wide range of current genomic data:

\begin{itemize}
  \item Multivariate gaussian: e.g. conutinuous microarray and
  transformed sequencing data.
  \item Poisson: e.g. discrete count data from sequencing experiments. 
  \item (Zero-Inflated) Negative Binomial: e.g. discrete count data from
  sequencing experiments.
  \item Poisson Log-Normal: e.g. discrete count data from sequencing
  experiments.
   \item Negative Multinomial: e.g. discrete count data from
   sequencing experiments.
  \item Multinomial: e.g. methylation rates from bisulfite sequencing (in this
  case it reduces to a Binomial) or binned nuceotide frequencies.
  \item Bernoulli: Initially proposed by \cite{ernst2010} to model
  presence/absence of chromatin marks (another example: transcription factor
  binding).
  \item Nucleotide distribution: e.g. nucleotide frequencies in the DNA
  sequence.
\end{itemize}

The use of these distributions enables integrating a wide range of genomic data
types (e.g.
continuous, discrete, binary) to \textit{de novo} learn and annotate the genome into a given
number of 'genomic states'.
The 'genomic states' may for instance reflect distinct
genome-associated protein complexes or describe recurring patterns of
chromatin features, i.e. the 'chromatin state'. Unlike other tools,
\Biocpkg{STAN} also allows for the integration of strand-specific (e.g. RNA) 
and non-strand-specific data (e.g. ChIP) \cite{zacher2014}. In this vignette, we illustrate
the use of \Biocpkg{STAN} by inferring 'chromatin states' from a small example
data set of two Roadmap Epigenomics cell lines. Moreover, we show how to use
strand-specific RNA expression with non-strand-specific ChIP data to infer
'transcription states' in yeast. Before getting started the package needs to be
loaded:

\vspace{0.25cm}
<<Loading_library, results="hide">>=
library(STAN)
@
\vspace{0.25cm}

\section{Genomic state annotation of Roadmap Epigenomics Sequencing data}
The data (or observations) provided to \Biocpkg{STAN} may consist of one or more
observation sequences (e.g. chromosomes), which are contained in a list of 
(position x experiment) matrices.
\Robject{trainRegions} is a list containing one three ENCODE pilot
regions (stored in \Robject{pilot.hg19} as \Rclass{GRanges} object) with data
for two cell types (K562:
E123, Gm12878:
E116) from the Roadmap Epigenomics project. The data set contains ChIP-Seq
experiments of seven histone modifications (H3K4me1, H3K4me3, H3K36me3,
H3K27me3, H3K27ac and H3K9ac), as well as DNase-Seq and genomic input.

\vspace{0.25cm}
<<Loading_example_data>>=
data(trainRegions)
names(trainRegions)
str(trainRegions[c(1,4)])
@
\vspace{0.25cm}

The genomic regions for each cell type in \Robject{trainRegions} are stored as a
GRanges object in \Robject{pilot.hg19}:

\vspace{0.25cm}
<<Loading_example_data_regions>>=
data(pilot.hg19)
pilot.hg19
@
\vspace{0.25cm}


Before model fitting, we calculate size factors to correct for the different
different sequencing depths between cell lines.

\vspace{0.25cm}
<<Calculate size factors>>=
celltypes = list("E123"=grep("E123", names(trainRegions)), 
        "E116"=grep("E116", names(trainRegions)))
sizeFactors = getSizeFactors(trainRegions, celltypes)
sizeFactors
@
\vspace{0.25cm}

Genome segmentation is carried out in \Biocpkg{STAN} using three functions:
\Rfunction{initHMM}, \Rfunction{fitHMM} and \Rfunction{getViterbi}.
\Rfunction{initHMM} initializes a model with \Robject{nStates} states for
a given probability/emission distribution, which we set to  'PoissonLogNormal'
in this example. \Rfunction{fitHMM} then optimizes model
parameters using the Expectation-Maximization algorithm. Model parameters
can be accessed with the \Rfunction{EmissionParams} function. Note that in this
example, we set the maximal number of iteration to 10 in this case for speed
reason. To ensure convergence this number should be higher in real world
applications. After HMM fitting, the state annotation is calculated using the
\Rfunction{getViterbi} function.

\vspace{0.25cm}
<<STAN-PoiLog>>=
nStates = 10
hmm_poilog = initHMM(trainRegions, nStates, "PoissonLogNormal", sizeFactors)
hmm_fitted_poilog = fitHMM(trainRegions, hmm_poilog, sizeFactors=sizeFactors, maxIters=10)
viterbi_poilog = getViterbi(hmm_fitted_poilog, trainRegions, sizeFactors)
str(viterbi_poilog)
@
\vspace{0.25cm}

In order to ease the use of other genomic applications and Bioconductor
packages, the viterbi path can be converted into a \Rclass{GRanges}
object.

\vspace{0.25cm}
<<STAN-PoiLog viterbi>>=
viterbi_poilog_gm12878 = viterbi2GRanges(viterbi_poilog[1:3], regions=pilot.hg19, binSize=200)
viterbi_poilog_gm12878
@
\vspace{0.25cm}

Before giving some more details about further analysis and visualization of the
models we repeat above segementations using the 'NegativeBinomial' emission
functions.

\vspace{0.25cm}
<<STAN-NB, results="hide">>=
hmm_nb = initHMM(trainRegions, nStates, "NegativeBinomial", sizeFactors)
hmm_fitted_nb = fitHMM(trainRegions, hmm_nb, sizeFactors=sizeFactors, maxIters=10)
viterbi_nb = getViterbi(hmm_fitted_nb, trainRegions, sizeFactors=sizeFactors)
viterbi_nb_gm12878 = viterbi2GRanges(viterbi_nb[1:3], pilot.hg19, 200)
@
\vspace{0.25cm}

In order to assign biologically meaningful roles to the inferred states we
calculate the mean number of reads per 200 base pair bin for both
segmentations.

\vspace{0.25cm}
<<STAN coverage, results="hide">>=
avg_cov_nb = getAvgSignal(viterbi_nb, trainRegions)
avg_cov_poilog = getAvgSignal(viterbi_poilog, trainRegions)
@
\vspace{0.25cm}


<<STAN_coverage_plotting_pdf, echo=FALSE, eval=TRUE, results="hide">>=

library(gplots)
heat = c("dark blue", "dodgerblue4", "darkred", "red", "orange", "gold", "yellow")
colfct = colorRampPalette(heat)
colpal_statemeans = colfct(200)
ord_nb = order(apply(avg_cov_nb,1,max), decreasing=TRUE)
statecols_nb = rainbow(nStates)
names(statecols_nb) = ord_nb

pdf("nb_avg_cov.pdf")
heatmap.2(log(avg_cov_nb+1)[as.character(ord_nb),], margins=c(8,7),srtCol=45, RowSideColors=statecols_nb[as.character(ord_nb)], dendrogram="none", Rowv=FALSE, Colv=FALSE, col=colpal_statemeans, trace="none", cellnote=round(avg_cov_nb,1)[as.character(ord_nb),], notecol="black")
dev.off()

ord_poilog = order(apply(avg_cov_poilog,1,max), decreasing=TRUE)
statecols_poilog = rainbow(nStates)
names(statecols_poilog) = ord_poilog
pdf("poilog_avg_cov.pdf")
heatmap.2(log(avg_cov_poilog+1)[ord_poilog,], RowSideColors=statecols_poilog[as.character(ord_poilog)], margins=c(8,7),srtCol=45, dendrogram="none", Rowv=FALSE, Colv=FALSE, col=colpal_statemeans, trace="none", cellnote=round(avg_cov_poilog,1)[ord_poilog,], notecol="black")
dev.off()

@

These are then plotted using the \Rfunction{heatmap.2} function (see Figure
\ref{fig:mean1}).

\vspace{0.25cm}
<<STAN_coverage_plotting, results="hide">>=
## specify color palette
library(gplots)
heat = c("dark blue", "dodgerblue4", "darkred", "red", "orange", "gold", "yellow")
colfct = colorRampPalette(heat)
colpal_statemeans = colfct(200)

## define state order and colors
ord_nb = order(apply(avg_cov_nb,1,max), decreasing=TRUE)
statecols_nb = rainbow(nStates)
names(statecols_nb) = ord_nb
heatmap.2(log(avg_cov_nb+1)[as.character(ord_nb),], margins=c(8,7), srtCol=45, 
        RowSideColors=statecols_nb[as.character(ord_nb)], dendrogram="none", 
        Rowv=FALSE, Colv=FALSE, col=colpal_statemeans, trace="none", 
        cellnote=round(avg_cov_nb,1)[as.character(ord_nb),], notecol="black")


## define state order and colors
ord_poilog = order(apply(avg_cov_poilog,1,max), decreasing=TRUE)
statecols_poilog = rainbow(nStates)
names(statecols_poilog) = ord_poilog
heatmap.2(log(avg_cov_poilog+1)[as.character(ord_poilog),], margins=c(8,7), srtCol=45, 
        RowSideColors=statecols_poilog[as.character(ord_poilog)], dendrogram="none", 
        Rowv=FALSE, Colv=FALSE, col=colpal_statemeans, trace="none", 
        cellnote=round(avg_cov_poilog,1)[as.character(ord_poilog),], notecol="black")
@
\vspace{0.25cm}


\begin{figure}[!htp]
  \centering
    \subfloat[]{{\includegraphics[width=8cm]{nb_avg_cov.pdf} }}%
    \qquad
    \subfloat[]{{\includegraphics[width=8cm]{poilog_avg_cov.pdf} }}%
      \caption{Mean read
    counts of the (a) 'NegativeBinomial' and (b) 'PoissonLogNormal' state
    annotation}%
  \label{fig:mean1}%
 \end{figure}


In order to visualize both \Biocpkg{STAN} segmentations, we
convert the viterbi paths and the data to \Biocpkg{Gviz} objects.

\vspace{0.25cm}
<<STAN_convert_to_Gviz, results="hide">>=
library(Gviz)
from = start(pilot.hg19)[3]
to = from+300000
gvizViterbi_nb = viterbi2Gviz(viterbi_nb_gm12878, "chr11", "hg19", from, to, statecols_nb)
gvizViterbi_poilog = viterbi2Gviz(viterbi_poilog_gm12878, "chr11", "hg19", from, to, 
        statecols_poilog)
gvizData = data2Gviz(trainRegions[[3]], pilot.hg19[3], binSize = 200, gen = "hg19", col="black", chrom = "chr11")
@
\vspace{0.25cm}

Then, we use the \Rfunction{plotTracks} function to plot everything (see Figure
\ref{fig:gviz1}).

\vspace{0.25cm}
<<STAN_plot_with_Gviz, eval=FALSE, results="hide">>=
gaxis = GenomeAxisTrack()
data(ucscGenes)
mySize = c(1,rep(1.2,9), 0.5,0.5,3)
plotTracks(c(list(gaxis), gvizData,gvizViterbi_nb,gvizViterbi_poilog,ucscGenes["chr11"]),
        from=from, to=to, showFeatureId=FALSE, featureAnnotation="id", fontcolor.feature="black",
        cex.feature=0.7, background.title="darkgrey", lwd=2, sizes=mySize)
@
\vspace{0.25cm}


<<STAN_plot_with_Gviz_pdf, echo=FALSE, eval=TRUE, results="hide">>=

gaxis = GenomeAxisTrack()
data(ucscGenes)
mySize = c(1,rep(1.2,9), 0.5,0.5,3)
pdf("gviz_example.pdf", width=7*1.5)
plotTracks(c(list(gaxis), gvizData,gvizViterbi_nb,gvizViterbi_poilog,ucscGenes["chr11"]),
        from=from, to=to, showFeatureId=FALSE, featureAnnotation="id", fontcolor.feature="black",
        cex.feature=0.7, background.title="darkgrey", lwd=2, sizes=mySize)#, stacking="dense")#, ylim=c(0,100))
dev.off()
@

\begin{figure}[!htp]
  \centering
    \includegraphics[width=18cm]{gviz_example.pdf}
    \caption{Genome Browser showing the 10 data tracks used for model
    learning together with the 'Negativebinomial' (top) and 'PoissonLogNormal'
    (bottom) segmentations and known UCSC gene annotations.}%
    \label{fig:gviz1}%
\end{figure}


\subsection*{Modeling Sequencing data using other emission functions}
In this section we illustrate the use of other distributions to annotate
the the Roadmap Epigenomics example data set, namely the 'Poisson',
'NegativeMultinomial', 'Gaussian' and 'Bernoulli' models.
The 'Poisson' model is an obvious choice when dealing with count data. However
since the variance of the Poisson is equal to its mean it might not be an ideal
choice for modeling Sequencing experiments, which have been shown to be
overdispersed \cite{anders2010}.

\vspace{0.25cm}
<<STAN_poisson_emissions, results="hide">>=
hmm_pois = initHMM(trainRegions, nStates, "Poisson")
hmm_fitted_pois = fitHMM(trainRegions, hmm_pois, maxIters=10)
viterbi_pois = getViterbi(hmm_fitted_pois, trainRegions)
@
\vspace{0.25cm}

The 'NegativeMultinomial' distribution for genome segmentation with HMMs was
first proposed in the EpicSeg model \cite{mammana2015}. The Negative Multinomial can be
understood as a Multinomial distribution, where its overdispersion of is modeled
by a Negative Binomial distribution. However, this assumes a shared
overdispersion across data tracks within a state as opposed to the
'NegativeBinomial' and 'PoissonLogNormal' models which model the variance for
each state and data track separately. In order to use the 'NegativeMultinomial'
in \Biocpkg{STAN} an additional data track - the sum of counts - for each bin
needs to be added to the data. Internally the 'NegativeMultinomial'
is modeled as a product of a 'NegativeBinomial' and a 'Multinomial' emission
(see section 'Combining different emission functions' for further details):

\vspace{0.25cm}
<<STAN_nmn_emissions, results="hide">>=
simData_nmn = lapply(trainRegions, function(x) cbind(apply(x,1,sum), x))
hmm_nmn = initHMM(simData_nmn, nStates, "NegativeMultinomial")
hmm_fitted_nmn = fitHMM(simData_nmn, hmm_nmn, maxIters=10)
viterbi_nmn = getViterbi(hmm_fitted_nmn, simData_nmn)
@
\vspace{0.25cm}

In order to model the data using Gaussian distributions, it needs to be
log-transformed and smoothed. This approach is implementd in Segway, a method
used by the ENCODE Consortium for chromatin state annotation \cite{hoffman2012}.
However, to overcome singularity of the (diagonal) covariance matrix due to the
zero-inflated distribution of the transformed read counts, it uses a shared
variance over states for each data track. To use gaussian distributions with
Sequencing data in \Biocpkg{STAN}, we transform the data (with the hyperbole
sine function \cite{hoffman2012}) and model it using the emission 'IndependentGaussian' with a shared
covariance, i.e. \Robject{sharedCov=TRUE}.

\vspace{0.25cm}
<<STAN_gaussian_emissions, results="hide">>=
trainRegions_smooth = lapply(trainRegions, function(x) 
            apply(log(x+sqrt(x^2+1)), 2, runningMean, 2))
hmm_gauss = initHMM(trainRegions_smooth, nStates, "IndependentGaussian", sharedCov=TRUE)
hmm_fitted_gauss = fitHMM(trainRegions_smooth, hmm_gauss, maxIters=10)
viterbi_gauss = getViterbi(hmm_fitted_gauss, trainRegions_smooth)
@
\vspace{0.25cm}

Another approach was proposed in ChromHMM, which models binarized data using an
independent Bernoulli model \cite{ernst2010}. Note, that the performance of the model
highly depends on the non-trivial choice of a proper cutoff and quantitative
information is lost. The latter is especially important when
predicting promoters and enhancers since these elements are both marked H3K4me1
and H3K4me3, but at different ratios. The function \Rfunction{binarizeData}
binarizes the data using the default approach by ChromHMM \cite{ernst2010}. The
model can then be fit by specifying the 'Bernoulli' model. Note however, that
initialization and model fitting are carried out differently than in the
ChromHMM implementation. In particular \Biocpkg{STAN} uses the EM algorithm
while ChromHMM uses online EM. For details on the initialization, please see
the \Rfunction{initHMM} manual.

\vspace{0.25cm}
<<STAN_bernoulli_emissions, results="hide">>=
trainRegions_binary = binarizeData(trainRegions)
hmm_ber = initHMM(trainRegions_binary, nStates, "Bernoulli")
hmm_fitted_ber = fitHMM(trainRegions_binary, hmm_ber, maxIters=10)
viterbi_ber = getViterbi(hmm_fitted_ber, trainRegions_binary)
@
\vspace{0.25cm}

We calculate the mean read coverage for each method and segmentation:

\vspace{0.25cm}
<<STAN_other_emissions_avg_cov, results="hide">>=
avg_cov_gauss = getAvgSignal(viterbi_gauss, trainRegions)
avg_cov_nmn = getAvgSignal(viterbi_nmn, trainRegions)
avg_cov_ber = getAvgSignal(viterbi_ber, trainRegions)
avg_cov_pois = getAvgSignal(viterbi_pois, trainRegions)
@
\vspace{0.25cm}

These are again plotted using the \Rfunction{heatmap.2} function (see Figure \ref{fig:mean_counts_other}). 

\vspace{0.25cm}
<<STAN_other_emissions_avg_cov_plot, eval=FALSE, results="hide">>=
heatmap.2(log(avg_cov_gauss+1), margins=c(8,7),srtCol=45, dendrogram="row", Rowv=TRUE, 
        Colv=FALSE, col=colpal_statemeans, trace="none", notecex=0.7, cexRow=0.75, cexCol=1, 
        cellnote=round(avg_cov_gauss,1), notecol="black")
heatmap.2(log(avg_cov_nmn+1), margins=c(8,7),srtCol=45, dendrogram="row", Rowv=TRUE, 
        Colv=FALSE, col=colpal_statemeans, trace="none", notecex=0.7, cexRow=0.75, cexCol=1, 
        cellnote=round(avg_cov_nmn,1), notecol="black")
heatmap.2(log(avg_cov_ber+1), margins=c(8,7),srtCol=45, dendrogram="row", Rowv=TRUE, 
        Colv=FALSE, col=colpal_statemeans, trace="none", notecex=0.7, cexRow=0.75, cexCol=1, 
        cellnote=round(avg_cov_ber,1), notecol="black")
heatmap.2(log(avg_cov_pois+1), margins=c(8,7),srtCol=45, dendrogram="row", Rowv=TRUE, 
        Colv=FALSE, col=colpal_statemeans, trace="none", notecex=0.7, cexRow=0.75, cexCol=1, 
        cellnote=round(avg_cov_pois,1), notecol="black")
@
\vspace{0.25cm}


<<STAN_other_emissions_avg_cov_plot_pdf, echo=FALSE, eval=TRUE, results="hide">>=
pdf("avg_cov_gauss.pdf")
heatmap.2(log(avg_cov_gauss+1), margins=c(8,7),srtCol=45, dendrogram="row", Rowv=TRUE, Colv=FALSE, col=colpal_statemeans, trace="none", notecex=0.7, cexRow=0.75, cexCol=1, cellnote=round(avg_cov_gauss,1), notecol="black")
dev.off()
pdf("avg_cov_nmn.pdf")
heatmap.2(log(avg_cov_nmn+1), margins=c(8,7),srtCol=45, dendrogram="row", Rowv=TRUE, Colv=FALSE, col=colpal_statemeans, trace="none", notecex=0.7, cexRow=0.75, cexCol=1, cellnote=round(avg_cov_nmn,1), notecol="black")
dev.off()
pdf("avg_cov_ber.pdf")
heatmap.2(log(avg_cov_ber+1), margins=c(8,7),srtCol=45, dendrogram="row", Rowv=TRUE, Colv=FALSE, col=colpal_statemeans, trace="none", notecex=0.7, cexRow=0.75, cexCol=1, cellnote=round(avg_cov_ber,1), notecol="black")
dev.off()
pdf("avg_cov_pois.pdf")
heatmap.2(log(avg_cov_pois+1), margins=c(8,7),srtCol=45, dendrogram="row", Rowv=TRUE, Colv=FALSE, col=colpal_statemeans, trace="none", notecex=0.7, cexRow=0.75, cexCol=1, cellnote=round(avg_cov_pois,1), notecol="black")
dev.off()
@

\begin{figure}[!htp]
  \centering
    \subfloat[]{{\includegraphics[width=8cm]{avg_cov_gauss.pdf} }}%
    \qquad
    \subfloat[]{{\includegraphics[width=8cm]{avg_cov_nmn.pdf} }}
    \\% \caption{2 Figures side by side}%
      \subfloat[]{{\includegraphics[width=8cm]{avg_cov_ber.pdf} }}%
    \qquad
    \subfloat[]{{\includegraphics[width=8cm]{avg_cov_pois.pdf} }}
    \caption{Mean read
    counts of the (a) 'IndependentGaussian' (b) 'NegativeMultinomial' (c)
    'Bernoulli' and (d) 'Poisson' state
    annotation.}
    \label{fig:mean_counts_other}%
\end{figure}


\section{Integrating strand-specific and non-strand-specific data with STAN}
\Biocpkg{STAN} also allows for the integration of strand-specific (e.g. RNA) 
and non-strand-specific data (e.g. ChIP). This is donw using bidirectional
hidden Markov models (bdHMMs) which were proposed in \cite{zacher2014}. 
A bdHMM models a directed process using the concept of twin states, where each
genomic state is split up into a pair of twin states, one for  each 
direction (e.g. sense and antisense in context of transcription). Those 
twin state  pairs  are identical in terms of their emissions (i.e. they model
the same genomic state). Currently the following models are available for
bdHMMs: 'IndependentGaussian', 'Gaussian', 'NegativeBinomial',
'ZINegativeBinomial' and 'PoissonLogNormal'. We now illustrate the use of bdHMMs
in \Biocpkg{STAN}  at an example data set of yeast transcription factors
measured by ChIP-chip and RNA expression measured with a tiling array
which was used to model the transcription cycle as a sequence of 'transcription
states' in \cite{zacher2014}.
\\
The \Rfunction{initBdHMM} function is used to initialize a bdHMM with 6 twin
states. Note that the overall number of states in the bdHMM is 12 (6 identical
twin state pairs). \Robject{dirobs} defines the directionality (or
strand-specificity) of the data tracks. In \Robject{dirobs}, the first 10 data
tracks are non-strand-specific ChIP-chip measurments, indicated by '0' and data track 11
and 12 are strand-specific RNA expression measurements, indicated by '1'.
Note that strand-specific data tracks must be labeled as increasing pairs of
integers. Thus and additional strand-specific data track pair would be labeled
as a pair of '2'. Model fitting and calculation of the state annotation are
carried out as for standard HMMs:

\vspace{0.25cm}
<<bdHMM_yeast_fit, results="hide">>=
data(yeastTF_databychrom_ex)
dStates = 6
dirobs = as.integer(c(rep(0,10), 1, 1))
bdhmm_gauss = initBdHMM(yeastTF_databychrom_ex, dStates = dStates, method = "Gaussian", directedObs=dirobs)
bdhmm_fitted_gauss = fitHMM(yeastTF_databychrom_ex, bdhmm_gauss)
viterbi_bdhmm_gauss = getViterbi(bdhmm_fitted_gauss, yeastTF_databychrom_ex)
@
\vspace{0.25cm}

We plot the means of the multivariate gaussian distrbutions for each state
(see Figure \ref{fig:means_yeast}):

\vspace{0.25cm}
<<bdHMM_yeast_params_plot, eval=FALSE, results="hide">>=
statecols_yeast = rep(rainbow(nStates), 2)
names(statecols_yeast) = StateNames(bdhmm_fitted_gauss)
means_fitted = EmissionParams(bdhmm_fitted_gauss)$mu
heatmap.2(means_fitted, col=colpal_statemeans, 
        RowSideColors=statecols_yeast[rownames(means_fitted)], 
        trace="none", cexCol=0.9, cexRow=0.9,
        cellnote=round(means_fitted,1), notecol="black", dendrogram="row",
        Rowv=TRUE, Colv=FALSE, notecex=0.9)
@
\vspace{0.25cm}

\begin{figure}[!htp]
  \centering
    \includegraphics[width=10cm]{yeast_means_gauss.pdf}
    \caption{Mean signal 6 bdHMM twin state pairs. 'F' and 'R'
    indicate forward and reverse direction of state pairs.}%
    \label{fig:means_yeast}%
\end{figure}


<<bdHMM_yeast_params_plot_pdf, echo=FALSE, eval=TRUE, results="hide">>=
pdf("yeast_means_gauss.pdf")
statecols_yeast = rep(rainbow(nStates), 2)
names(statecols_yeast) = StateNames(bdhmm_fitted_gauss)
means_fitted = EmissionParams(bdhmm_fitted_gauss)$mu
heatmap.2(means_fitted, col=colpal_statemeans, RowSideColors=statecols_yeast[rownames(means_fitted)], trace="none", cexCol=0.9, cexRow=0.9,
        cellnote=round(means_fitted,1), notecol="black", dendrogram="row",
        Rowv=TRUE, Colv=FALSE, notecex=0.9)
dev.off()
@
\vspace{0.25cm}

We convert the viterbi path into a \Robject{GRanges} object. Note that the
directionaliy of bdHMM states is indicated by 'F' (forward) and 'R' (reverse).

\vspace{0.25cm}
<<bdHMM_convert_GRanges>>=
yeastGRanges = GRanges(IRanges(start=1214616, end=1225008), seqnames="chrIV")
names(viterbi_bdhmm_gauss) = "chrIV"
viterbi_bdhmm_gauss_gr = viterbi2GRanges(viterbi_bdhmm_gauss, yeastGRanges, 8)
viterbi_bdhmm_gauss_gr
@
\vspace{0.25cm}

Next, we visualize the data, state annotation and together with SGD genes using
\Biocpkg{Gviz} (see Figure \ref{fig:gviz_yeast}):

\vspace{0.25cm}
<<bdHMM_gviz_plot, eval=FALSE, results="hide">>=
chr = "chrIV"
gen = "sacCer3"
gtrack <- GenomeAxisTrack()

from=1217060
to=1225000
forward_segments = grep("F", viterbi_bdhmm_gauss_gr$name)
reverse_segments = grep("R", viterbi_bdhmm_gauss_gr$name)
gvizViterbi_yeast = viterbi2Gviz(viterbi_bdhmm_gauss_gr[forward_segments], 
        "chrIV", "sacCer3", from, to, statecols_yeast)
gvizViterbi_yeast2 = viterbi2Gviz(viterbi_bdhmm_gauss_gr[reverse_segments], 
        "chrIV", "sacCer3", from, to, statecols_yeast)

gvizData_yeast = data2Gviz(obs = yeastTF_databychrom_ex[[1]], regions = yeastGRanges, binSize = 8, gen = "sacCer3", col="black", chrom = chr)
gaxis = GenomeAxisTrack()
data(yeastTF_SGDGenes)
mySize = c(1,rep(1,12), 0.5,0.5,3)

plotTracks(c(list(gaxis), gvizData_yeast,gvizViterbi_yeast,gvizViterbi_yeast2,
        list(yeastTF_SGDGenes)), cex.feature=0.7, background.title="darkgrey", lwd=2,
        sizes=mySize, from=from, to=to, showFeatureId=FALSE, featureAnnotation="id",
        fontcolor.feature="black", cex.feature=0.7, background.title="darkgrey",
        showId=TRUE)
@
\vspace{0.25cm}



\vspace{0.25cm}
<<bdHMM_gviz_plot_pdf, echo=FALSE, eval=TRUE, results="hide">>=
yeastGRanges = GRanges(IRanges(start=1214616, end=1225008), seqnames="chrIV")
names(viterbi_bdhmm_gauss) = "chrIV"
viterbi_bdhmm_gauss_gr = viterbi2GRanges(viterbi_bdhmm_gauss, yeastGRanges, 8)
chr = "chrIV"
gen = "sacCer3"
gtrack <- GenomeAxisTrack()

from=1217060
to=1225000
forward_segments = grep("F", viterbi_bdhmm_gauss_gr$name)
reverse_segments = grep("R", viterbi_bdhmm_gauss_gr$name)
gvizViterbi_yeast = viterbi2Gviz(viterbi_bdhmm_gauss_gr[forward_segments], "chrIV", "sacCer3", from, to, statecols_yeast)
gvizViterbi_yeast2 = viterbi2Gviz(viterbi_bdhmm_gauss_gr[reverse_segments], "chrIV", "sacCer3", from, to, statecols_yeast)

gvizData_yeast = data2Gviz(obs = yeastTF_databychrom_ex[[1]], regions = yeastGRanges, binSize = 8, gen = "sacCer3", col="black", chrom = chr)
gaxis = GenomeAxisTrack()
data(yeastTF_SGDGenes)
mySize = c(1,rep(1,12), 0.5,0.5,3)
pdf("gviz_example_yeast.pdf", width=7*1.5)
plotTracks(c(list(gaxis), gvizData_yeast,gvizViterbi_yeast,gvizViterbi_yeast2,list(yeastTF_SGDGenes)),
        cex.feature=0.7, background.title="darkgrey", lwd=2, sizes=mySize, from=from, to=to,
        showFeatureId=FALSE, featureAnnotation="id",
        fontcolor.feature="black", cex.feature=0.7, background.title="darkgrey", showId=TRUE)#, stacking="dense")#, ylim=c(0,100))
plot(1:10, (1:10)^3, type = "l")
dev.off()
@
\vspace{0.25cm}


\begin{figure}[!htp]
  \centering
    \includegraphics[width=18cm]{gviz_example_yeast.pdf}
    \caption{Genome Browser showing the 12 data tracks used for model
    learning together with the segmentations and known SGD gene annotations.}%
    \label{fig:gviz_yeast}%
\end{figure}

%\subsection*{Collapsing two directed twin states into one undirected states}
%There are two states in the yeast bdHMM that seem to have low directional
%information, since they frequently switch between forward and reverse direction
%(Figure \ref{fig:gviz_yeast}): The promoter and low signal state. It is
% possible to collapse these twin states into one undirected state and ...

%\section{Combining different emission functions}
%TODO


\section{Concluding Remarks}
This vignette was generated using the following package versions:

<<sessInfo, results="asis", echo=FALSE>>=
toLatex(sessionInfo())
@

\newpage

\bibliography{refs}

\end{document}