%\VignetteIndexEntry{Growing phylogenetic trees with TreeLine} %\VignettePackage{DECIPHER} \documentclass[10pt]{article} \usepackage{times} \usepackage{hyperref} \usepackage{underscore} \usepackage{enumerate} \usepackage{graphics} \usepackage{wrapfig} \usepackage{cite} \textwidth=6.5in \textheight=8.5in %\parskip=.3cm \oddsidemargin=-.1in \evensidemargin=-.1in \headheight=-.3in \newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle} \newcommand{\R}{{\textsf{R}}} \newcommand{\C}{{\textsf{C}}} \newcommand{\code}[1]{{\texttt{#1}}} \newcommand{\term}[1]{{\emph{#1}}} \newcommand{\Rpackage}[1]{\textsf{#1}} \newcommand{\Rfunction}[1]{\texttt{#1}} \newcommand{\Robject}[1]{\texttt{#1}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\textit{#1}}} \newcommand{\Rfunarg}[1]{{\textit{#1}}} \bibliographystyle{plainnat} \begin{document} %\setkeys{Gin}{width=0.55\textwidth} \title{Growing phylogenetic trees with TreeLine} \author{Erik S. Wright} \date{\today} \maketitle \newenvironment{centerfig} {\begin{figure}[htp]\centering} {\end{figure}} \renewcommand{\indent}{\hspace*{\tindent}} \DefineVerbatimEnvironment{Sinput}{Verbatim} {xleftmargin=2em} \DefineVerbatimEnvironment{Soutput}{Verbatim}{xleftmargin=2em} \DefineVerbatimEnvironment{Scode}{Verbatim}{xleftmargin=2em} \fvset{listparameters={\setlength{\topsep}{0pt}}} \renewenvironment{Schunk}{\vspace{\topsep}}{\vspace{\topsep}} \SweaveOpts{keep.source=TRUE} <>= options(continue=" ") options(width=80) @ \tableofcontents %------------------------------------------------------------ \section{Introduction} %------------------------------------------------------------ This document describes how to grow phylogenetic trees using the \Rfunction{TreeLine} function in the \Rpackage{DECIPHER} package. \Rfunction{TreeLine} takes as input a set of aligned nucleotide or amino acid sequences and returns a phylogenetic tree (i.e., \Rclass{dendrogram} object) as output. This vignette focuses on building maximum likelihood (ML) and maximum parsimony (MP) phylogenetic trees starting from sequences, but \Rfunction{TreeLine} can also be used to build additive trees from a distance matrix. Why is the function called \Rfunction{TreeLine}? The goal of \Rfunction{TreeLine} is to find the most likely/parsimonious tree for a given sequence alignment. There are often many trees with nearly maximal likelihood/parsimony. Therefore, \Rfunction{TreeLine} seeks to find a tree as close as possible to the treeline, analogous to how no trees can grow above the treeline on a mountain. Why use \Rfunction{TreeLine} versus other programs? The \Rfunction{TreeLine} function is designed to return an excellent phylogenetic tree with minimal user intervention. Many tree building programs have a large set of complex options for niche applications. In contrast, \Rfunction{TreeLine} simply builds a great tree when relying on its defaults. This vignette is intended to get you started and introduce additional options/functions that might be useful. %------------------------------------------------------------ \section{Performance Considerations} %------------------------------------------------------------ Finding a tree with very high likelihood/parsimony is no easy feat. \Rfunction{TreeLine} systematically optimizes hundreds to thousands of candidate trees before returning the best one. This takes time, but there are things you can do to make it go faster. \begin{itemize} \item Only use the sequences you need: \Rfunction{TreeLine} scales a bit worse than quadratically with the number of sequences. Hence, limiting the number of sequences is a worthwhile consideration. In particular, always eliminate redundant sequences, as shown below, and remove any sequences that are not necessary. This concern is shared for all tree building programs, and \Rfunction{TreeLine} is no exception. \item Set a timeout: The \code{maxTime} argument specifies the (approximate) maximum number of hours you are willing to let \Rfunction{TreeLine} run. If you are concerned about the code running too long then simply specify this argument. \item Compile with OpenMP support: Significant speed-ups can be achieved with multi-threading using OpenMP. See the ``Getting Started DECIPHERing'' vignette for how to do this on your computer platform. Then you only need to set the argument \code{processors=NULL} and \Rfunction{TreeLine} will use all available processors. \item Compile for SIMD support: \Rfunction{TreeLine} is configured to make use of SIMD operations, which are available on some processors. The easiest way to enable SIMD is to add `` -O3 -march=native'' to the end of \code{PKG_CFLAGS} in the ``DECIPHER/src/MAKEVARS'' text file. This enables level-3 compiler optimization for your native computer architecture. Then, after recompiling, there can be an automatic speed-up on systems with SIMD support. \end{itemize} %------------------------------------------------------------ \section{Growing a Phylogenetic Tree} %------------------------------------------------------------ \Rfunction{TreeLine} takes as input a multiple sequence alignment when constructing a maximum likelihood or maximum parsimony phylogenetic tree. Multiple sequence alignments can be constructed from a set of (unaligned) sequences using \Rfunction{AlignSeqs} or related functions. \Rfunction{TreeLine} will optimize trees for amino acid (i.e., \code{AAStringSet}) or nucleotide (i.e., \code{DNAStringSet} or \code{RNAStringSet}) sequences. Here, we are going to use a set of sequences that is included with \Rpackage{DECIPHER}. These sequences are from the internal transcribed spacer (ITS) between the 16S and 23S ribosomal RNA genes in several \textit{Streptomyces} species. <>= library(DECIPHER) # specify the path to your sequence file: fas <- "<>" # OR find the example sequence file used in this tutorial: fas <- system.file("extdata", "Streptomyces_ITS_aligned.fas", package="DECIPHER") seqs <- readDNAStringSet(fas) # use readAAStringSet for amino acid sequences seqs # the aligned sequences @ Many of these sequences are redundant or from the same genome. We can de-replicate the sequences to accelerate tree building: <>= seqs <- unique(seqs) # remove duplicated sequences ns <- gsub("^.*Streptomyces( subsp\\. | sp\\. | | sp_)([^ ]+).*$", "\\2", names(seqs)) names(seqs) <- ns # name by species seqs <- seqs[!duplicated(ns)] # remove redundant sequences from the same species seqs @ Now, it's time to find the most likely tree. Here, we will set a strict time limit to make this example faster, although longer time limits (e.g., 24 hours) are advised. Note that \Rfunction{TreeLine} automatically selects a substitution model based on Akaike information criterion (by default). It is possible to specify specific model(s) (e.g., \code{model="GTR+G4"}) to limit the possible selections. Also, since \Rfunction{TreeLine} is a stochastic optimizer, it is critical to always set the random number seed for reproducibility. \begin{centerfig} <>= set.seed(123) # set the random number seed tree <- TreeLine(seqs, reconstruct=TRUE, maxTime=0.05) # default is method="ML" set.seed(NULL) # reset seed plot(tree) @ \caption{\label{f1} Maximum likelihood tree showing the relationships between \textit{Streptomyces} species.} \end{centerfig} \clearpage %------------------------------------------------------------ \section{Plotting Branch Support Values} %------------------------------------------------------------ \Rfunction{TreeLine} automatically returns a variety of information about the tree that can be accessed with the \Rfunction{attributes} and \Rfunction{attr} functions: <>= attributes(tree) # view all attributes attr(tree, "score") # best score @ The tree is (virtually) rooted at its midpoint by default. For maximum likelihood trees, all internal nodes include aBayes branch support values \cite{Anisimova:2011}. These are given as probabilities that can be used in plotting on top of each edge. We can also italicize the species names. \begin{centerfig} <>= plot(dendrapply(tree, function(x) { s <- attr(x, "probability") # choose "probability" (aBayes) or "support" if (!is.null(s) && !is.na(s)) { s <- formatC(as.numeric(s), digits=2, format="f") attr(x, "edgetext") <- paste(s, "\n") } attr(x, "edgePar") <- list(p.col=NA, p.lwd=1e-5, t.col="#CC55AA", t.cex=0.7) if (is.leaf(x)) attr(x, "nodePar") <- list(lab.font=3, pch=NA) x }), horiz=TRUE, yaxt='n') # add a scale bar arrows(0, 0, 0.4, 0, code=3, angle=90, len=0.05, xpd=TRUE) text(0.2, 0, "0.4 subs./site", pos=3, xpd=TRUE) @ \caption{\label{f2} Tree with (aBayes) support probabilities at each internal node.} \end{centerfig} \clearpage Maximum likelihood and maximum parsimony trees both provide branch supports in the form of the fraction of optimized trees that contained a given partition (branch). These are accessible from the ``support'' attribute. As expected, support values and (aBayes) probabilities are correlated, but support tends to be more conservative. \begin{centerfig} <>= getSupports <- function(x) { if (is.leaf(x)) { NULL } else { rbind(cbind(attr(x, "support"), attr(x, "probability")), getSupports(x[[1]]), getSupports(x[[2]])) } } support <- getSupports(tree) plot(support[, 1], support[, 2], xlab="Support", ylab="aBayes probability", asp=1) abline(a=0, b=1, lty=2) # line of identity (y=x) @ \caption{\label{f3} Comparison of aBayes probabilities and branch support values.} \end{centerfig} \clearpage %------------------------------------------------------------ \section{Ancestral State Reconstruction} %------------------------------------------------------------ One of the advantages of maximum likelihood and maximum parsimony tree building methods is that they automatically predict states at each internal node on the tree \cite{Joy:2016}. This feature is enabled when \Rfunarg{reconstruct} is set to \code{TRUE}. These character states can be used by the function \Rfunction{MapCharacters} to determine state transitions along each edge of the tree. \begin{centerfig} <>= new_tree <- MapCharacters(tree, labelEdges=TRUE) plot(new_tree, edgePar=list(p.col=NA, p.lwd=1e-5, t.col="#55CC99", t.cex=0.7)) attr(new_tree[[1]], "change") # state changes on first branch left of (virtual) root @ \caption{\label{f4} Edges labeled with the number of state transitions.} \end{centerfig} \clearpage %------------------------------------------------------------ \section{Session Information} %------------------------------------------------------------ All of the output in this vignette was produced under the following conditions: <>= toLatex(sessionInfo(), locale=FALSE) @ \begin{thebibliography}{} \bibitem{Anisimova:2011} {Anisimova, M.}, {Gil, M.}, {Dufayard, J.}, {Dessimoz, C.}, \& {Gascuel, O.} \newblock Survey of branch support methods demonstrates accuracy, power, and robustness of fast likelihood-based approximation schemes. \newblock {\em Syst Biol.}, 60(5), 685-699. \bibitem{Joy:2016} {Joy, J.}, {Liang, R.}, {McCloskey, R.}, {Nguyen, T.}, \& {Poon, A.} \newblock Ancestral Reconstruction. \newblock {\em PLoS Comp. Biol.}, 12(7), e1004763. \end{thebibliography} \end{document}