% \VignetteIndexEntry{HMMcopy}
% \VignetteDepends{HMMcopy, data.table}
% \VignetteKeywords{Readcount GC Mappability Correction Normalization HTS}
% \VignettePackage{HMMcopy}

\documentclass[letterpaper]{article}
\usepackage{fullpage}
\usepackage{url} 

\title{HMMcopy: A package for bias-free copy number estimation and robust CNA detection in tumour samples from WGS HTS data}
\author{Daniel Lai and Gavin Ha}
\date{\today}

\begin{document}

\setkeys{Gin}{width=\textwidth}

\maketitle
\tableofcontents

\section{Introduction}

High-throughput whole genome DNA sequencing results in millions of reads which
after short-read mapping, can be used to infer the original sequence of the
sample.  An additional piece of information we gain from mapping is the coverage
or number of reads overlapping each position, which can be used as a rough
estimate of copy number.  In practice, it is often easier to work at a lower
resolution, and so we divide the genome into non-overlapping windows of fixed
width (\textit{e.g.} 1000), and use the \textit{readcount} or number of reads
starting in each of these windows (henceforth termed \textit{bins}).

The main assumption of using readcount as a proxy for copy number is that during
the sequencing protocol, reads are uniformly sampled from all genetic material
present in the original input sample.  This assumption has been shown to be
incorrect, with reads being sampled at unequal rates depending on the GC
content, among other things \cite{Benjamini}.

To different extents, this bias exists and varies in all platforms, protocols,
and samples.  Additionally, Benjamini and Speed \cite{Benjamini} show that two libraries prepared
from the same DNA source show different biases, which highlights the fact that
matched sample normalization (\textit{i.e.} normalizing tumour copy by matched
normal reference) is insufficient to eliminate this bias.

\section{Generating Copy Number Profiles}

For the remainder of this vignette, we will go through a short example using
a matched normal tumour dataset, and show the ease and effects of normalization
readcount by GC and mappability.  

The dataset that is described here belongs to a female triple-negative breast cancer 
patient from the published dataset \cite{Ha2012,Shah2012}. This genome library was
sequenced on the ABI/Life SOLiD platform, generating hybrid lengths of 25bp and 50bp 
paired-end reads.  The reads were aligned using BioScope \\
(\url{https://products.appliedbiosystems.com/}) where reads mapped to multiple sites
were ignored.

\subsection{Obtaining HTS copy number data}

It should be noted that due to memory and speed considerations, the raw input
that feeds in to this package are required to be generated by programs
distributed as part of the \textit{HMMcopy Suite} available at:

In short, the suite has tools to:
\begin{itemize}
  \item Obtain high resolution bin counts for large ($\approx$ 250GB) BAM files within
  a few hours.
  \item Obtain GC content for bins from standard FASTA files within minutes for
  a human genome.
  \item Obtain average mappability for bins from BigWig within minutes, or
  FASTA files within a day for a human genome.
\end{itemize}

\subsection{Loading HTS copy number data}

To show the effects of correction in a simple case, let us begin by correcting
data obtained from a normal genome sample.

To start, we load in WIG files containing three sets of data, namely readcounts,
GC content, and average mappability, pre-computed for each bin with an external
applications provided by the \textit{HMMcopy Suite}.

Briefly, the readcounts data consists of the number of reads mapped within 
the boundaries of each bin.  The GC content file consists of G and C base 
proportions for each bin; -1 is used when the reference sequence for the
bin contains at least one unknown nucleotide base (i.e. N, rather than A,C,
T or G). The mappability file, which was generated using 
\emph{ENCODE Duke Uniqueness of 35bp sequences} track from UCSC \\
(\url{http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg18&g=wgEncodeMapability}), 
provides scores indicating how mappable the reference sequence at each bin 
is to aligners; the higher the score, the more mappable the sequence.

<<>>=
options(stringsAsFactors = TRUE)
library(HMMcopy)
rfile <- system.file("extdata", "normal.wig", package = "HMMcopy")
gfile <- system.file("extdata", "gc.wig", package = "HMMcopy")
mfile <- system.file("extdata", "map.wig", package = "HMMcopy")
normal_reads <- wigsToRangedData(rfile, gfile, mfile)
normal_reads[1000:1010, ]
@

\subsection{Correcting HTS copy number data}

The resultant output is an data.table object, which can easily use other
Bioconductor functions and packages for analysis and visualzation.  Although
it is recommended that the output be left untouched until furthur correction
as follows:

<<>>=
normal_copy <- correctReadcount(normal_reads)
normal_copy[1000:1010, ]

@

The corrected reads are in the two columns \verb|cor.gc| and \verb|cor.map|,
which are corrected and normalized, effectively a copy number estimation for
each bin.  For downstream analysis, \verb|copy| is used, which is simply
the base 2 log values of \verb|cor.map|, to normalize the differences between
values above and below 1.  Specifically, the columns output from this process
are:

\begin{itemize}
  \item[valid] Bins with valid GC and average mappability and non-zero read
  \item[ideal] Valid bins of high mappability and reads that are not outliers
  \item[cor.gc] Readcounts after the first GC correction step
  \item[cor.map] \verb|cor.gc| readcounts after a furthur mappability correction
  \item[copy] \verb|cor.map| transformed into log2 space
\end{itemize}

\subsection{Visualizing the effects of correction}

To best visualize the biases in the original sample and the effects of
correction, convenience plotting functions are available as follows:

<<fig=TRUE, pdf=FALSE, eps=FALSE, png=TRUE>>=
par(cex.main = 0.7, cex.lab = 0.7, cex.axis = 0.7, mar = c(4, 4, 2, 0.5))
plotBias(normal_copy, pch = 20, cex = 0.5)

@

As observed, a unimodal relationship typically exists between readcount and
GC content (excuse the weak tail), which is corrected and eliminated in the
GC correction step.  The GC-corrected readcounts exhibit a slight correlation
with mappability, which is corrected in the second step.

\subsection{Visualizing corrected copy number profiles}

To see the effects of correction for copy number estimation, we can plot
the copy number along a segment of a chromosome as follows:

<<fig=TRUE, pdf=FALSE, eps=FALSE, png=TRUE>>=
par(mar = c(4, 4, 2, 0))
plotCorrection(normal_copy, pch = ".")

@

In the top track, the original uncorrected reads are used to estimate copy
number (by simply dividing by the median value), followed by tracks with
additional GC then mappability correction.  Mediam absolute deviations (MAD)
measuring the variance of each track can be seen to decrease after each
correction step.

\subsection{Correcting and visualizing tumour copy number profiles}

The identical process can be applied to tumour genomes using the same
GC and mappability files as follows:

<<fig=TRUE, pdf=FALSE, eps=FALSE, png=TRUE>>=
tfile <- system.file("extdata", "tumour.wig", package = "HMMcopy")
tumour_copy <- correctReadcount(wigsToRangedData(tfile, gfile, mfile))

par(mar = c(4, 4, 2, 0))
plotCorrection(tumour_copy, pch = ".")

@

Note that at this stage, the correction of the normal and tumour samples were
done \emph{indepedently}, meaning that \textbf{HMMcopy is usable both in single
sample and matched sample libraries.}

\section{Segmentation and Classification of Copy Number Profiles}

While corrected copy number data points are pleasing to stare at, they are often
of much more practical use if we could group and classify them segments of
specific copy number variation events.  Included in the package is an
implementation of an Hidden Markov Model \cite{shah} which does two main things:

\begin{enumerate}
\item Segments the copy number profile in regions predicted to be generated
by the same copy number variation event
\item Predicts the copy number variation event for each segment, based on the
biological likelihood of the event occuring
\end{enumerate}

<<>>=
tumour_segments <- HMMsegment(tumour_copy)

@

\subsection{Visualizing segments and classified states}

We can visualize both the resultant segments and the states assigned
to each segment easily as follows:

<<fig=TRUE, pdf=FALSE, eps=FALSE, png=TRUE, height=2>>=
par(mfrow = c(1, 1))
par(cex.main = 0.5, cex.lab = 0.5, cex.axis = 0.5, mar = c(2, 1.5, 0, 0), mgp = c(1, 0.5, 0))
plotSegments(tumour_copy, tumour_segments, pch = ".",
  ylab = "Tumour Copy Number", xlab = "Chromosome Position")

cols <- stateCols() # 6 default state colours
legend("topleft", c("HOMD", "HETD", "NEUT", "GAIN", "AMPL", "HLAMP"),
  fill = cols, horiz = TRUE, bty = "n", cex = 0.5)

@

As observed, there are six coloured states, corresponding to six biological
CNA events:

\begin{itemize}
  \item[HOMD] Homozygous deletion, $\leq$ 0 copies
  \item[HETD] Heterozygous deletion, 1 copy
  \item[NEUT] Neutral change, 2 copies
  \item[GAIN] Gain of chromosome, 3 copies
  \item[AMPL] Amplification event, 4 copies
  \item[HLAMP] High level amplification, $\geq$ 5 copies
\end{itemize}

Segments are seen as neon green lines running through regions estimate to belong
to the same CNA event.

\subsection{Improving segmentation performance}

The segmentation program will attempt its best at creating the most biologically
likely state assignment, but unfortunately and fortunately for us humans, the
automated program can often fail, as seen in this specific example.  In cases
like these, there is then a need to adjust the parameters generated by the
algorithm.  Instead of getting parameters from scratch however, one can easily
retrieve the default set of parameters as follows:

<<>>=
default_param <- HMMsegment(tumour_copy, getparam = TRUE)
default_param

@

By calling the same segmentation algorithm with the \verb|getparam| argument
set to \verb|TRUE|, the algorithm generates parameters without actually
running the segmentation and returns it.  There are 10 parameters explained
in detail in the documentation (\textit{i.e.} \verb|?HMMsegment|), each having
values across six states in each row.

\subsection{Reducing the number of segments}

To reduce the segment, the two parameters we need to adjust are \verb|e| and
\verb|strength|, which are the suggested probability of staying in a segment,
and the strength of your suggestion to the algorithm.  Once set, we run the
segmentation algorithm, but this time expliciting providing our new parameters.

<<>>==
longseg_param <- default_param
longseg_param$e <- 0.999999999999999
longseg_param$strength <- 1e30
longseg_segments <- HMMsegment(tumour_copy, longseg_param, verbose = FALSE)

@

And finally, we can visualize the results to confirm a decrease in segments
as intended.

<<fig=TRUE, pdf=FALSE, eps=FALSE, png=TRUE, height=2>>=
par(cex.main = 0.5, cex.lab = 0.5, cex.axis = 0.5, mar = c(2, 1.5, 0, 0), mgp = c(1, 0.5, 0))
plotSegments(tumour_copy, longseg_segments, pch = ".",
  ylab = "Tumour Copy Number", xlab = "Chromosome Position")
legend("topleft", c("HOMD", "HETD", "NEUT", "GAIN", "AMPL", "HLAMP"),
  fill = cols, horiz = TRUE, bty = "n", cex = 0.5)

@

As one can imagine, if for some odd reason you'd like to increase the
number of segments, simply decrease \verb|e| and \verb|strength| to a small
value (close to 0).  Although the algorithm will almost certainly take a few
times longer to run.

\subsection{Adjusting copy number state ranges}

A second more subtle and debilitating problem seen in this profile is actually
the incorrect median of each copy number state.  Specifically, this is a
problem with the \verb|mu| parameter.  Accessing one of the output values of
the segmentation process:

<<>>=
longseg_segments$mus

@

We see the median of 6 states (rows) after each iteration of the optimization
algorithm (columns).  The first column is our initial suggested \verb|mu|:

<<>>=
longseg_param$mu

@

And the last column is the actual values used during the segmentation process,
which we can visualize as follows:

<<fig=TRUE, pdf=FALSE, eps=FALSE, png=TRUE, height=2>>=
par(cex.main = 0.5, cex.lab = 0.5, cex.axis = 0.5, mar = c(2, 1.5, 0, 0), mgp = c(1, 0.5, 0))
plotSegments(tumour_copy, longseg_segments, pch = ".",
  ylab = "Tumour Copy Number", xlab = "Chromosome Position")
for(i in 1:nrow(longseg_segments$mus)) {
  abline(h = longseg_segments$mus[i ,ncol(longseg_segments$mus)], col = cols[i],
  lwd = 2, lty = 3)
}
abline(v = 7.68e7, lwd = 2, lty = 3)
abline(v = 8.02e7, lwd = 2, lty = 3)

@

From this, it becomes obvious that the medians are clearly not running
through the middle of the visually obvious segments for many of the states.
For example, the red medians for \verb|GAIN|, \verb|AMPL| and \verb|HLAMP|
are much too high, the green medians for \verb|HOMD| and \verb|HETD|are much
too close together.  Finally there is a clear segment flanked by vertical
black dashed lines near the centre of the chromosome that is stuck between
medians for green \verb|HETD| and blue \verb|NEUT|, when it should
definitively belong to one or the other.

Staring at such a diagram, it is easy to suggest new \verb|mu| params, which
can be put back into the segmentation algorithm once again and visualized as
follows.

<<fig=TRUE, pdf=FALSE, eps=FALSE, png=TRUE, height=2>>=
newmu_param <- longseg_param
newmu_param$mu <- c(-0.5, -0.4, -0.15, 0.1, 0.4, 0.7)
newmu_segments <- HMMsegment(tumour_copy, newmu_param, verbose = FALSE)
par(cex.main = 0.5, cex.lab = 0.5, cex.axis = 0.5, mar = c(2, 1.5, 0, 0), mgp = c(1, 0.5, 0))
plotSegments(tumour_copy, newmu_segments, pch = ".",
  ylab = "Tumour Copy Number", xlab = "Chromosome Position")

@

\subsubsection{Understanding parameter convergence}

The observant reader will noticed nothing has changed.  And this was done
to purposely highlight that the parameters you set are often just
\emph{suggestions}, and given sufficient flexibility, the algorithm will
effectively ignore your suggestions.  As seen below in \verb|newmu_segments|,
after several iterations our newly suggested values of \verb|mu|
(first column) effectively converge (in the last column) to the same default
values of \verb|mu| in \verb|longseg_param|.

<<>>=
newmu_segments$mus
longseg_param$mu

@

\subsubsection{Overriding parameter convergence}

The solution is to simply disallow the algorithm from making large
shifts to \verb|mu|, which we can achieve by setting the prior mean of
\verb|n| (\textit{i.e.} \verb|m|) to values identical to \verb|mu|.

<<fig=TRUE, pdf=FALSE, eps=FALSE, png=TRUE, height=2>>=
par(cex.main = 0.5, cex.lab = 0.5, cex.axis = 0.5, mar = c(2, 1.5, 0, 0), mgp = c(1, 0.5, 0))
newmu_param$m <- newmu_param$mu
realmu_segments <- HMMsegment(tumour_copy, newmu_param, verbose = FALSE)
plotSegments(tumour_copy, realmu_segments, pch = ".",
  ylab = "Tumour Copy Number", xlab = "Chromosome Position")

@

\section{Matched Tumour-Normal Sample Correction}

So far, our entire process has been done on single samples, providing
respectable results and aesthetically pleasing profiles.  However, when give
matched tumour and normal pairs, we can improve our copy number results in
two ways:
\begin{enumerate}
  \item We can divide our tumour copy number profile by normal copy number
  profile to eliminate any germline copy number variation (CNV)
  events present in both normal and tumour profiles.  This will leave only
  somatic copy number aberration (CNA) events in the tumour.
  \item A dramatic reduction in noise is also observed, furthur improving
  copy number profiles.
\end{enumerate}

\subsection{Normalizing tumour by normal copy number profiles}

The normalization is a simple division of tumour copy number by normal
copy number in a bin-wise fashion as follows, or a subtracting of the
values in log space as follows:

<<>>=
somatic_copy <- tumour_copy
# LOGARITHM IDENTITY: log(a) - log(b) == lob(a / b)
somatic_copy$copy <- tumour_copy$copy - normal_copy$copy

@

We can then do the same segmentation and visualization as follows:

<<fig=TRUE, pdf=FALSE, eps=FALSE, png=TRUE, height=2>>=
somatic_segments <- HMMsegment(somatic_copy, newmu_param, verbose = FALSE)
par(cex.main = 0.5, cex.lab = 0.5, cex.axis = 0.5, mar = c(2, 1.5, 0, 0), mgp = c(1, 0.5, 0))
plotSegments(somatic_copy, somatic_segments, pch = ".",
  ylab = "Tumour Copy Number", xlab = "Chromosome Position")

@

The resultant corrected reads can easily be analyzed and exported with other
R and Bioconductor packages.

\section{Session Information}

The version number of R and packages loaded for generating the vignette were:

<<echo=FALSE, results=tex>>=
toLatex(sessionInfo())

@

\bibliographystyle{plain}
\bibliography{HMMcopy}
%\begin{thebibliography}{99}
%  \bibitem[Benjamimi and Speed, 2011]{Benjamini}
%  Benjamini Y, Speed TP.
%  (2012)
%  Summarizing and correcting the GC content bias in high-throughput sequencing. \textit{Nucleic Acids Res.} \textbf{10}, e72.
%  \bibitem[Shah \textit{et~al.}, 2006]{Shah} 
%  Shah,S.P. \textit{et~al.} 
%  (2006) 
%  Integrating copy number polymorphisms into array CGH analysis using a robust HMM. \textit{Bioinf.} \textbf{22}, e341-9.
%\end{thebibliography}

\end{document}