%\VignetteIndexEntry{Growing phylogenetic trees with TreeLine}
%\VignettePackage{DECIPHER}

\documentclass[10pt]{article}

\usepackage{times}
\usepackage{hyperref}
\usepackage{underscore}
\usepackage{enumerate}
\usepackage{graphics}
\usepackage{wrapfig}
\usepackage{cite}

\textwidth=6.5in
\textheight=8.5in
%\parskip=.3cm
\oddsidemargin=-.1in
\evensidemargin=-.1in
\headheight=-.3in

\newcommand{\scscst}{\scriptscriptstyle}
\newcommand{\scst}{\scriptstyle}

\newcommand{\R}{{\textsf{R}}}
\newcommand{\C}{{\textsf{C}}}
\newcommand{\code}[1]{{\texttt{#1}}}
\newcommand{\term}[1]{{\emph{#1}}}
\newcommand{\Rpackage}[1]{\textsf{#1}}
\newcommand{\Rfunction}[1]{\texttt{#1}}
\newcommand{\Robject}[1]{\texttt{#1}}
\newcommand{\Rclass}[1]{{\textit{#1}}}
\newcommand{\Rmethod}[1]{{\textit{#1}}}
\newcommand{\Rfunarg}[1]{{\textit{#1}}}

\bibliographystyle{plainnat}

\begin{document}
%\setkeys{Gin}{width=0.55\textwidth}

\title{Growing phylogenetic trees with TreeLine}
\author{Erik S. Wright}
\date{\today}
\maketitle

\newenvironment{centerfig}
{\begin{figure}[htp]\centering}
{\end{figure}}
\renewcommand{\indent}{\hspace*{\tindent}}
\DefineVerbatimEnvironment{Sinput}{Verbatim} {xleftmargin=2em} \DefineVerbatimEnvironment{Soutput}{Verbatim}{xleftmargin=2em} \DefineVerbatimEnvironment{Scode}{Verbatim}{xleftmargin=2em} \fvset{listparameters={\setlength{\topsep}{0pt}}} \renewenvironment{Schunk}{\vspace{\topsep}}{\vspace{\topsep}}
\SweaveOpts{keep.source=TRUE}
<<echo=false>>=
options(continue=" ")
options(width=80)
@

\tableofcontents

%------------------------------------------------------------
\section{Introduction}
%------------------------------------------------------------

This document describes how to grow phylogenetic trees using the \Rfunction{TreeLine} function in the \Rpackage{DECIPHER} package.  \Rfunction{TreeLine} takes as input a set of aligned nucleotide or amino acid sequences and returns a phylogenetic tree (i.e., \Rclass{dendrogram} object) as output.  This vignette focuses on building maximum likelihood (ML) and maximum parsimony (MP) phylogenetic trees starting from sequences, but \Rfunction{TreeLine} can also be used to build additive trees from a distance matrix.

Why is the function called \Rfunction{TreeLine}?  The goal of \Rfunction{TreeLine} is to find the most likely/parsimonious tree for a given sequence alignment.  There are often many trees with nearly maximal likelihood/parsimony.  Therefore, \Rfunction{TreeLine} seeks to find a tree as close as possible to the treeline, analogous to how no trees can grow above the treeline on a mountain.

Why use \Rfunction{TreeLine} versus other programs?  The \Rfunction{TreeLine} function is designed to return an excellent phylogenetic tree with minimal user intervention.  Many tree building programs have a large set of complex options for niche applications.  In contrast, \Rfunction{TreeLine} simply builds a great tree when relying on its defaults.  This vignette is intended to get you started and introduce additional options/functions that might be useful.

%------------------------------------------------------------
\section{Performance Considerations}
%------------------------------------------------------------

Finding a tree with very high likelihood/parsimony is no easy feat.  \Rfunction{TreeLine} systematically optimizes hundreds to thousands of candidate trees before returning the best one.  This takes time, but there are things you can do to make it go faster.

\begin{itemize}
\item Only use the sequences you need:  \Rfunction{TreeLine} scales a bit worse than quadratically with the number of sequences.  Hence, limiting the number of sequences is a worthwhile consideration.  In particular, always eliminate redundant sequences, as shown below, and remove any sequences that are not necessary.  This concern is shared for all tree building programs, and \Rfunction{TreeLine} is no exception.
\item Set a timeout:  The \code{maxTime} argument specifies the (approximate) maximum number of hours you are willing to let \Rfunction{TreeLine} run.  If you are concerned about the code running too long then simply specify this argument.
\item Compile with OpenMP support:  Significant speed-ups can be achieved with multi-threading using OpenMP.  See the ``Getting Started DECIPHERing'' vignette for how to do this on your computer platform.  Then you only need to set the argument \code{processors=NULL} and \Rfunction{TreeLine} will use all available processors.
\item Compile for SIMD support:  \Rfunction{TreeLine} is configured to make use of SIMD operations, which are available on some processors.  The easiest way to enable SIMD is to add `` -O3 -march=native'' to the end of \code{PKG_CFLAGS} in the ``DECIPHER/src/MAKEVARS'' text file.  This enables level-3 compiler optimization for your native computer architecture.  Then, after recompiling, there can be an automatic speed-up on systems with SIMD support.
\end{itemize}

%------------------------------------------------------------
\section{Growing a Phylogenetic Tree}
%------------------------------------------------------------

\Rfunction{TreeLine} takes as input a multiple sequence alignment when constructing a maximum likelihood or maximum parsimony phylogenetic tree.  Multiple sequence alignments can be constructed from a set of (unaligned) sequences using \Rfunction{AlignSeqs} or related functions.  \Rfunction{TreeLine} will optimize trees for amino acid (i.e., \code{AAStringSet}) or nucleotide (i.e., \code{DNAStringSet} or \code{RNAStringSet}) sequences.  Here, we are going to use a set of sequences that is included with \Rpackage{DECIPHER}.  These sequences are from the internal transcribed spacer (ITS) between the 16S and 23S ribosomal RNA genes in several \textit{Streptomyces} species.

<<expr1,eval=TRUE,echo=TRUE>>=
library(DECIPHER)

# specify the path to your sequence file:
fas <- "<<path to FASTA file>>"
# OR find the example sequence file used in this tutorial:
fas <- system.file("extdata", "Streptomyces_ITS_aligned.fas", package="DECIPHER")

seqs <- readDNAStringSet(fas) # use readAAStringSet for amino acid sequences
seqs # the aligned sequences
@

Many of these sequences are redundant or from the same genome.  We can de-replicate the sequences to accelerate tree building:

<<expr2,eval=TRUE,echo=TRUE>>=
seqs <- unique(seqs) # remove duplicated sequences

ns <- gsub("^.*Streptomyces( subsp\\. | sp\\. | | sp_)([^ ]+).*$", "\\2", names(seqs))
names(seqs) <- ns # name by species
seqs <- seqs[!duplicated(ns)] # remove redundant sequences from the same species
seqs
@

Now, it's time to find the most likely tree.  Here, we will set a strict time limit to make this example faster, although longer time limits (e.g., 24 hours) are advised.

Note that \Rfunction{TreeLine} automatically selects a substitution model based on Akaike information criterion (by default).  It is possible to specify specific model(s) (e.g., \code{model="GTR+G4"}) to limit the possible selections.

Also, since \Rfunction{TreeLine} is a stochastic optimizer, it is critical to always set the random number seed for reproducibility.

\begin{centerfig}
<<expr3,eval=TRUE,echo=TRUE,fig=TRUE,height=3.5,width=6>>=
set.seed(123) # set the random number seed

tree <- TreeLine(seqs, reconstruct=TRUE, maxTime=0.05) # default is method="ML"

set.seed(NULL) # reset seed

plot(tree)
@
\caption{\label{f1} Maximum likelihood tree showing the relationships between \textit{Streptomyces} species.}
\end{centerfig}

\clearpage

%------------------------------------------------------------
\section{Plotting Branch Support Values}
%------------------------------------------------------------

\Rfunction{TreeLine} automatically returns a variety of information about the tree that can be accessed with the \Rfunction{attributes} and \Rfunction{attr} functions:

<<expr4,eval=TRUE,echo=TRUE>>=
attributes(tree) # view all attributes
attr(tree, "score") # best score
@

The tree is (virtually) rooted at its midpoint by default.  For maximum likelihood trees, all internal nodes include aBayes branch support values \cite{Anisimova:2011}.  These are given as probabilities that can be used in plotting on top of each edge.  We can also italicize the species names.

\begin{centerfig}
<<expr5,eval=TRUE,echo=TRUE,fig=TRUE,height=6,width=6>>=
plot(dendrapply(tree,
	function(x) {
		s <- attr(x, "probability") # choose "probability" (aBayes) or "support"
		if (!is.null(s) && !is.na(s)) {
			s <- formatC(as.numeric(s), digits=2, format="f")
			attr(x, "edgetext") <- paste(s, "\n")
		}
		attr(x, "edgePar") <- list(p.col=NA, p.lwd=1e-5, t.col="#CC55AA", t.cex=0.7)
		if (is.leaf(x))
			attr(x, "nodePar") <- list(lab.font=3, pch=NA)
		x
	}),
	horiz=TRUE,
	yaxt='n')
# add a scale bar
arrows(0, 0, 0.4, 0, code=3, angle=90, len=0.05, xpd=TRUE)
text(0.2, 0, "0.4 subs./site", pos=3, xpd=TRUE)
@
\caption{\label{f2} Tree with (aBayes) support probabilities at each internal node.}
\end{centerfig}

\clearpage

Maximum likelihood and maximum parsimony trees both provide branch supports in the form of the fraction of optimized trees that contained a given partition (branch).  These are accessible from the ``support'' attribute.  As expected, support values and (aBayes) probabilities are correlated, but support tends to be more conservative.

\begin{centerfig}
<<expr6,eval=TRUE,echo=TRUE,fig=TRUE,height=3.5,width=3.5>>=
getSupports <- function(x) {
	if (is.leaf(x)) {
		NULL
	} else {
		rbind(cbind(attr(x, "support"), attr(x, "probability")),
			getSupports(x[[1]]), getSupports(x[[2]]))
	}
}
support <- getSupports(tree)
plot(support[, 1], support[, 2], xlab="Support", ylab="aBayes probability", asp=1)
abline(a=0, b=1, lty=2) # line of identity (y=x)
@
\caption{\label{f3} Comparison of aBayes probabilities and branch support values.}
\end{centerfig}

\clearpage

%------------------------------------------------------------
\section{Ancestral State Reconstruction}
%------------------------------------------------------------

One of the advantages of maximum likelihood and maximum parsimony tree building methods is that they automatically predict states at each internal node on the tree \cite{Joy:2016}.  This feature is enabled when \Rfunarg{reconstruct} is set to \code{TRUE}.  These character states can be used by the function \Rfunction{MapCharacters} to determine state transitions along each edge of the tree.

\begin{centerfig}
<<expr7,eval=TRUE,echo=TRUE,fig=TRUE,height=6,width=6>>=
new_tree <- MapCharacters(tree, labelEdges=TRUE)
plot(new_tree, edgePar=list(p.col=NA, p.lwd=1e-5, t.col="#55CC99", t.cex=0.7))
attr(new_tree[[1]], "change") # state changes on first branch left of (virtual) root
@
\caption{\label{f4} Edges labeled with the number of state transitions.}
\end{centerfig}

\clearpage

%------------------------------------------------------------
\section{Session Information}
%------------------------------------------------------------

All of the output in this vignette was produced under the following
conditions:

<<sessinfo,echo=FALSE,results=tex>>=
toLatex(sessionInfo(), locale=FALSE)
@

\begin{thebibliography}{}

\bibitem{Anisimova:2011}
{Anisimova, M.}, {Gil, M.}, {Dufayard, J.}, {Dessimoz, C.}, \& {Gascuel, O.}
\newblock Survey of branch support methods demonstrates accuracy, power, and robustness of fast likelihood-based approximation schemes.
\newblock {\em Syst Biol.}, 60(5), 685-699.

\bibitem{Joy:2016}
{Joy, J.}, {Liang, R.}, {McCloskey, R.}, {Nguyen, T.}, \& {Poon, A.}
\newblock Ancestral Reconstruction.
\newblock {\em PLoS Comp. Biol.}, 12(7), e1004763.

\end{thebibliography}

\end{document}