\documentclass[article]{bioinf}

\usepackage{texshade}
\usepackage{hyperref}

\hypersetup{colorlinks=false,
    pdfborder=0 0 0,
    pdftitle={msa - An R Package for Multiple Sequence Alignment},
    pdfauthor={Enrico Bonatesta, Christoph Kainrath, and Ulrich Bodenhofer}}

\usepackage[OT1]{fontenc}

\title{{\Huge msa}\\[5mm] An R Package for Multiple Sequence Alignment}
\author{Enrico Bonatesta, Christoph Kainrath, and Ulrich Bodenhofer}
\affiliation{Institute of Bioinformatics, Johannes Kepler University
Linz\\Altenberger Str. 69, 4040 Linz, Austria\\
\email{msa@bioinf.jku.at}}

\newcommand{\msa}{\texttt{msa}}
\newcommand{\MSA}{\texttt{msa}}
\newcommand{\R}{\texttt{R}}
\newcommand{\shade}{\TeXshade}

%\VignetteIndexEntry{msa - An R Package for Multiple Sequence Alignment}
%\VignetteDepends{methods, stats, graphics, utils}
%\VignetteEngine{knitr::knitr}

\setlength{\fboxrule}{1.5pt}

\newcommand{\notebox}[1]{%
\begin{center}
\fbox{\begin{minipage}{\Nboxwidth}
\noindent{\sffamily\bfseries Note:} #1
\end{minipage}}
\end{center}}

\begin{document}
<<LoadPackageToDetermineVersion,echo=FALSE,message=FALSE,results='hide'>>=
options(width=65)
set.seed(0)
library(msa)
library(seqinr)
msaVersion <- packageDescription("msa")$Version
msaDateRaw <- packageDescription("msa")$Date
msaDateYear <- as.numeric(substr(msaDateRaw, 1, 4))
msaDateMonth <- as.numeric(substr(msaDateRaw, 6, 7))
msaDateDay <- as.numeric(substr(msaDateRaw, 9, 10))
msaDate <- paste(month.name[msaDateMonth], " ",
                 msaDateDay, ", ",
                 msaDateYear, sep="")
@

\newcommand{\msaVer}{\Sexpr{msaVersion}}
\newcommand{\msaDate}{\Sexpr{msaDate}}
\manualtitlepage[Version \msaVer, \msaDate]

\section*{Scope and Purpose of this Document}

This document provides a gentle introduction into the \R\ package \MSA. Not all
features of the \R\ package are described in full detail. Such details can be
obtained from the documentation enclosed in the  \R\ package. Further note
the following: (1) this is not an introduction to multiple sequence alignment
or algorithms for multiple sequence alignment; (2) this is not an
introduction to  \R\ or any of the Bioconductor packages used in this document.
If you lack the background for understanding this manual, you
first have to read introductory literature on the subjects mentioned above.

\newpage

\vspace{1cm}

\newlength{\auxparskip}
\setlength{\auxparskip}{\parskip}
\setlength{\parskip}{0pt}
\tableofcontents
\clearpage
\setlength{\parskip}{\auxparskip}

\newlength{\Nboxwidth}
\setlength{\Nboxwidth}{\textwidth}
\addtolength{\Nboxwidth}{-2\fboxrule}
\addtolength{\Nboxwidth}{-2\fboxsep}


\section{Introduction}
Multiple sequence alignment is one of the most fundamental tasks in
bioinformatics. Algorithms like ClustalW~\cite{Thompson1994},
ClustalOmega~\cite{Sievers2011}, and MUSCLE~\cite{Edgar2004b,Edgar2004a}
are well known and
widely used. However, all these algorithms are implemented as stand-alone
commmand line programs without any integration into the R/Bioconductor
ecosystem. Before the \MSA\ package, only the \verb+muscle+ package has
been available in \R, but no other multiple sequence alignment algorithm,
although the \verb+Biostrings+ package has provided data types for
representing multiple sequence alignments for quite some time.
The \MSA\ package aims to close that gap by
providing a unified R interface to the multiple sequence alignment algorithms
ClustalW, ClustalOmega, and MUSCLE. The package requires no additional
software packages and runs on all major platforms. Moreover, the \MSA\
package provides an R interface to
the powerful \LaTeX\ package \shade\ \cite{Beitz2000} which allows for a
highly customizable plots of multiple sequence alignments. Unless some very
special features of \shade\ are required, users can pretty-print multiple
sequence alignments without the need to know the details of \LaTeX\ or
\shade.

\section{Installation}\label{sec:install}

The \MSA\ \R\ package (current version: \msaVer) is
available via Bioconductor. The simplest way to install the package
is the following:

<<InstallMSA,eval=FALSE>>=
if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("msa")
@

To test the installation of the \MSA\ package, enter
<<LoadMSA,eval=FALSE>>=
library(msa)
@
\noindent in your \R\ session. If this command terminates without any error
message or warning, you can be sure that the \MSA\ package has
been installed successfully. If so, the \MSA\ package is ready
for use now and you can start performing multiple sequence alignments.

To make use of all functionalities of \verb+msaPrettyPrint()+,
a \TeX/\LaTeX{} system \cite{Lamport1999} must be installed. To make use of
\LaTeX\ code created by \verb+msaPrettyPrint()+ or to use the output
of \verb+msaPrettyPrint()+ in Sweave \cite{Leisch2002} or knitr
\cite{Xie2014} documents, the \LaTeX\ package
\shade\ (file \verb+texshade.sty+) \cite{Beitz2000} must be accessible to
the \LaTeX\  system too. The file \verb+texshade.sty+ is shipped with
the \MSA\ package. To determine where the file is located, enter the
following command in your \R\ session:
<<locateTeXshadeSty,eval=FALSE>>=
system.file("tex", "texshade.sty", package="msa")
@
\noindent Alternatively, \shade\ can be installed directly from the
Comprehensive \TeX\ Archive Network (CTAN).%
\footnote{\url{https://www.ctan.org/pkg/texshade}}

\section{\MSA\ for the Impatient}\label{sec:impatient}

In order to illustrate the basic workflow, this section presents a simple
example with default settings and without going into the details of each
step. Let us first load amino acid sequences from one of the example files
that are supplied with the \MSA\ package:
<<SimpleExFileNames>>=
mySequenceFile <- system.file("examples", "exampleAA.fasta", package="msa")
mySequences <- readAAStringSet(mySequenceFile)
mySequences
@
\noindent Now that we have loaded the sequences, we can run the \verb+msa()+
function which, by default, runs ClustalW with default parameters:
<<doAlignment>>=
myFirstAlignment <- msa(mySequences)
myFirstAlignment
@
\noindent Obviously, the default printing function shortens the alignment
for the sake of compact output. The \verb+print()+ function provided by the
\MSA\ package provides some ways for customizing the output, such as,
showing the entire alignment split over multiple blocks of sub-sequences:
<<showWholeWidth>>=
print(myFirstAlignment, show="complete")
@

The \MSA\ package additionally offers the function \verb+msaPrettyPrint()+
which allows for pretty-printing multiple alignments using the \LaTeX\ package
\shade. As an example, the following \R\ code creates a PDF file
\verb+myfirstAlignment.pdf+ which is shown in
Figure~\ref{fig:myFirstAlignment}:
<<IntegratePDF2>>=
msaPrettyPrint(myFirstAlignment, output="pdf", showNames="none",
               showLogo="none", askForOverwrite=FALSE, verbose=FALSE)
@
\noindent In the above call to \verb+msaPrettyPrint()+, the printing of
sequence names
has been suppressed by \verb+showNames="none"+. The settings
\verb+askForOverwrite=FALSE+ and \verb+verbose=FALSE+ are necessary for
building this vignette, but, in an interactive \R\ session, they are not
necessary.
\begin{figure}
\includegraphics[width=\textwidth]{myFirstAlignment.pdf}
\caption{The PDF file \texttt{myfirstAlignment.pdf}
  created with \texttt{msaPrettyPrint()}.}\label{fig:myFirstAlignment}
\end{figure}

Almost needless to say, the file names created by \verb+msaPrettyPrint()+
are customizable. By default, the name of the argument is taken as file name.
More importantly, the actual output of \verb+msaPrettyPrint()+ is
highly customizable, too. For more details,
see the Section~\ref{sec:msaPrettyPrint} and the help page of the function
(\verb+?msaPrettyPrint+).

The \verb+msaPrettyPrint()+ function is particularly useful for pretty-printing
multiple sequence alignments in Sweave \cite{Leisch2002} or knitr
\cite{Xie2014} documents. More details are provided in
Section~\ref{sec:msaPrettyPrint}. Here, we restrict to a teasing example:

<<VisualizePDF,results='asis'>>=
msaPrettyPrint(myFirstAlignment, y=c(164, 213), output="asis",
               showNames="none", showLogo="none", askForOverwrite=FALSE)
@

\section{Functions for Multiple Sequence Alignment in More Detail}
\label{sec:moreDetail}

The example in Section~\ref{sec:impatient} above simply called
the function \verb+msa()+ without any additional arguments.
We mentioned already that, in this case, ClustalW is called with default
parameters. We can also explicitly request ClustalW or one of the two
other algorithms ClustalOmega or MUSCLE:
<<OtherAlgorithms,>>=
myClustalWAlignment <- msa(mySequences, "ClustalW")
myClustalWAlignment
myClustalOmegaAlignment <- msa(mySequences, "ClustalOmega")
myClustalOmegaAlignment
myMuscleAlignment <- msa(mySequences, "Muscle")
myMuscleAlignment
@
\noindent Please note that the call \verb+msa(mySequences, "ClustalW", ...)+ is
just a shortcut for the call \verb+msaClustalW(mySequences, ...)+,
analogously for
\verb+msaClustalOmega()+ and \verb+msaMuscle()+. In other words,
\verb+msa()+ is nothing else but a wrapper function that provides a
unified interface to the three functions \verb+msaClustalW()+,
\verb+msaClustalOmega()+, and \verb+msaMuscle()+.

All three functions \verb+msaClustalW()+, \verb+msaClustalOmega()+, and
\verb+msaMuscle()+ have the same parameters: The input sequences are passed
as argument \verb+inputSeqs+, and all functions have the following
arguments: \verb+cluster+, \verb+gapOpening+, \verb+gapExtension+,
\verb+maxiters+, \verb+substitutionMatrix+, \verb+order+, \verb+type+, and
\verb+verbose+. The ways these parameters are interpreted, are largely
analogous, although there are some differences, also in terms of
default values. See the subsections below and the man page
of the three functions for more details.
All of the three functions \verb+msaClustalW()+, \verb+msaClustalOmega()+, and
\verb+msaMuscle()+, however, are not restricted to the parameters mentioned
above. All three have a `\verb+...+' argument through which several other
algorithm-specific parameters can be passed on to the underlying library.
The following subsections provide an overview of which parameters are
supported by each of the three algorithms.

\subsection{ClustalW-Specific Parameters}\label{ssec:msaClustalW}

The original implementation of ClustalW offers a lot of parameters for
customizing the way a multiple sequence alignment is computed. Through
the `\verb+...+' argument, \verb+msaClustalW()+ provides an interface to
make use of most these parameters (see the documentation of ClustalW%
\footnote{\url{http://www.clustal.org/download/clustalw_help.txt}}
for a comprehensive overview). Currently, the following restrictions and
caveats apply:
\begin{itemize}
  \item The parameters \verb+infile+, \verb+clustering+, \verb+gapOpen+,
    \verb+gapExt+, \verb+numiters+, \verb+matrix+, and \verb+outorder+
    have been renamed to
    the standardized argument names \verb+inputSeqs+, \verb+cluster+,
    \verb+gapOpening+, \verb+gapExtension+, \verb+maxiters+,
    \verb+substitutionMatrix+, and \verb+order+
    in order to provide a consistent interface for all three multiple
    sequence alignment algorithms.
  \item Boolean flags must be passed as logical values, e.g.\
    \verb+verbose=TRUE+.
  \item The parameter \verb+quiet+
    has been replaced by \verb+verbose+ (with the exact opposite meaning).
  \item The following parameters are (currently) not supported:
    \verb+bootstrap+, \verb+check+, \verb+fullhelp+, \verb+interactive+,
    \verb+maxseqlen+, \verb+options+, and \verb+tree+.
  \item For the parameter \verb+output+, only the choice \verb+"clustal"+
    is available.
\end{itemize}

\subsection{ClustalOmega-Specific Parameters}
\label{ssec:msaClustalOmega}

In the same way as ClustalW, the original implementation of ClustalOmega
also offers a lot of parameters for
customizing the way a multiple sequence alignment is computed. Through
the `\verb+...+' argument, \verb+msaClustalOmega()+ provides an interface to
make use of most these parameters (see the documentation of ClustalOmega%
\footnote{\url{http://www.clustal.org/omega/README}}
for a comprehensive overview). Currently, the following restrictions and
caveats apply:
\begin{itemize}
  \item The parameters \verb+infile+, \verb+cluster-size+,
    \verb+iterations+, and \verb+output-order+ have been renamed to
    the argument names \verb+inputSeqs+, \verb+cluster+,
    \verb+maxiters+, and \verb+order+
    in order to provide a consistent interface for all three multiple
    sequence alignment algorithms.
  \item ClustalOmega does not allow for setting custom gap penalties.
    Therefore, setting the parameters \verb+gapOpening+ and
    \verb+gapExtension+ currently has no effect and will lead to a
    warning. These arguments are only defined for
    future extensions and consistency with the other algorithms
    available in \MSA.
  \item ClustalOmega only allows for choosing substitution matrices
    from a pre-defined set of names, namely \verb+"BLOSUM30"+,
    \verb+"BLOSUM40"+, \verb+"BLOSUM50"+, \verb+"BLOSUM65"+,
    \verb+"BLOSUM80"+, and \verb+"Gonnet"+. This is a new feature
    --- the original ClustalOmega implementation does not allow for
    using any custom substitution matrix. However, since these are
    all amino acid substitution matrices, ClustalOmega is still hardly
    useful for multiple alignments of nucleotide sequences.
  \item Boolean flags must be passed as logical values, e.g.\
    \verb+verbose=TRUE+.
  \item The following parameters are (currently) not supported:
    \verb+maxSeqLength+ and \verb+help+.
  \item For the parameter \verb+outFmt+, only the choice \verb+"clustal"+
    is available.
\end{itemize}

\subsection{MUSCLE-Specific Parameters}\label{msaMuscle}

Finally, also MUSCLE offers a lot of parameters for
customizing the way a multiple sequence alignment is computed. Through
the `\verb+...+' argument, \verb+msaMuscle()+ provides an interface to
make use of most these parameters (see the documentation of MUSCLE%
\footnote{\url{http://www.drive5.com/muscle/muscle.html}}
for a comprehensive overview). Currently, the following restrictions and
caveats apply:
\begin{itemize}
  \item The parameters \verb+in+, \verb+gapOpen+,
    \verb+gapExtend+, \verb+matrix+, and \verb+seqtype+ have been renamed
    to \verb+inputSeqs+, \verb+gapOpening+,
    \verb+gapExtension+, \verb+substitutionMatrix+ and \verb+type+
    in order to provide a consistent interface for all three multiple
    sequence alignment algorithms.
  \item Boolean flags must be passed as logical values, e.g.\
    \verb+verbose=TRUE+.
  \item The parameter \verb+quiet+
    has been replaced by \verb+verbose+ (with the exact opposite meaning).
  \item The following parameters are currently not supported:
    \verb+clw+, \verb+clwstrict+, \verb+fastaout+, \verb+group+, \verb+html+,
    \verb+in1+, \verb+in2+, \verb+log+, \verb+loga+,
    \verb+msaout+, \verb+msf+, \verb+out+, \verb+phyi+, \verb+phyiout+,
    \verb+phys+, \verb+physout+, \verb+refine+, \verb+refinew+,
    \verb+scorefile+, \verb+spscore+, \verb+stable+, \verb+termgaps4+,
    \verb+termgapsfull+, \verb+termgapshalf+, \verb+termgapshalflonger+,
    \verb+tree1+, \verb+tree2+,  \verb+usetree+,
    \verb+weight1+, and \verb+weight2+.
\end{itemize}

\section{Printing Multiple Sequence Alignments}\label{sec:msaPrint}

As already shown above, multiple sequence alignments can be shown in
plain text format on the \R\ console using the \verb+print()+ function
(which is implicitly called if just the object name is entered on the
\R\ console). This function allows for multiple customizations, such as,
enabling/disabling to display a consensus sequence, printing the entire
alignment or only a subset, enabling/disabling to display sequence names,
and adjusting the width allocated for sequence names. For more information,
the reader is referred to the
help page of the print function:
<<helpPrint,eval=FALSE>>=
help("print,MsaDNAMultipleAlignment-method")
@
We only provide some examples here:
<<printExamples>>=
print(myFirstAlignment)
print(myFirstAlignment, show="complete")
print(myFirstAlignment, showConsensus=FALSE, halfNrow=3)
print(myFirstAlignment, showNames=FALSE, show="complete")
@

\section{Processing Multiple Alignments}\label{sec:msaProc}

\subsection{Methods Inherited From {\tt Biostrings}}

The classes defined by the \MSA\ package for storing multiple alignment results
have been derived from the corresponding classes defined by the
\verb+Biostrings+ package. Therefore, all methods for processing
multiple alignments are available and work without any practical limitation. In
this section, we highlight some of these.

The classes used for storing multiple alignments allow for defining masks
on sequences and sequence positions via their row and column mask slots.
They can be set by \verb+rowmask()+ and \verb+colmask()+ functions which serve
both as setter and getter functions. To set row or column masks, an
\verb+IRanges+ object must be supplied:
<<maskExample>>=
myMaskedAlignment <- myFirstAlignment
colM <- IRanges(start=1, end=100)
colmask(myMaskedAlignment) <- colM
myMaskedAlignment
@

The \verb+unmasked()+ allows for removing these masks, thereby casting
the multiple alignment to a set of aligned \verb+Biostrings+ sequences
(class \verb+AAStringSet+, \verb+DNAStringSet+, or \verb+RNAStringSet+):
<<unmaskedExample>>=
unmasked(myMaskedAlignment)
@

Consensus matrices can be computed conveniently as follows:
<<consensusExample1>>=
conMat <- consensusMatrix(myFirstAlignment)
dim(conMat)
conMat[, 101:110]
@
If called on a masked alignment, \verb+consensusMatrix()+ only uses those
sequences/rows that are not masked. If there are masked columns,
the matrix contains \verb+NA+'s in those columns:
<<consensusExample2>>=
conMat <- consensusMatrix(myMaskedAlignment)
conMat[, 95:104]
@

Multiple alignments also inherit the \verb+consensusString()+ method from
the \verb+Biostrings+ package. However, for more flexibility and consistency,
we rather advise users to use the method \verb+msaConsensusSequence()+
method (see below).

\subsection{Consensus Sequences and Conservation Scores}

With version 1.7.1 of \MSA, new methods have been provided that allow for
the computation of consensus sequences and conservation scores.
By default, the \verb+msaConsensusSequence()+ method is a wrapper around the
\verb+consensusString()+ method from the \verb+Biostrings+:
<<consensusExample3>>=
printSplitString <- function(x, width=getOption("width") - 1)
{
    starts <- seq(from=1, to=nchar(x), by=width)

    for (i in 1:length(starts))
        cat(substr(x, starts[i], starts[i] + width - 1), "\n")
}

printSplitString(msaConsensusSequence(myFirstAlignment))
@
However, there is also a second method for computing consensus sequence that
has been implemented in line with a consensus sequence method implemented
in \TeXshade\ that allows for specify an upper and a lower conservation threshold
(see example below). This method can be accessed via the argument
\verb+type="upperlower"+. Additional customizations are available, too:
<<consensusExample4>>=
printSplitString(msaConsensusSequence(myFirstAlignment, type="upperlower",
                                      thresh=c(40, 20)))
@

Regardless of which method is used, masks are taken into account: masked
rows/sequences are neglected and masked columns are shown as ``\verb+#+'' in
the consensus sequence:
<<consensusExample5>>=
printSplitString(msaConsensusSequence(myMaskedAlignment, type="upperlower",
                                      thresh=c(40, 20)))
@

The main purpose of consensus sequences is to get an impression of conservation
at individual positions/columns of a multiple alignment. The \MSA\ package also
provides another means of analyzing conservation: the method
\verb+msaConservationScore()+ computes sums of pairwise scores for a given
substitution/scoring matrix. Thereby, conservation can also be analyzed in a
more sensible way than by only taking relative frequencies of letters into
account as \verb+msaConsensusSequence()+ does.
<<conservationExample1>>=
data(BLOSUM62)
msaConservationScore(myFirstAlignment, BLOSUM62)
@
As the above example shows, a substitution matrix must be provided. The result
is obviously a vector as long as the alignment has columns. The entries of the
vector are labeled by the consensus sequence. The way the consensus sequence is
computed can be customized:
<<conservationExample2>>=
msaConservationScore(myFirstAlignment, BLOSUM62, gapVsGap=0,
                     type="upperlower", thresh=c(40, 20))
@
The additional argument \verb+gapVsGap+ allows for controlling how pairs of
gap are taken into account when computing pairwise scores (see
\verb+?msaConservationScore+ for more details).

Conservation scores can also be computed from masked alignments. For masked
columns, \verb+NA+'s are returned:
<<conservationExample3>>=
msaConservationScore(myMaskedAlignment, BLOSUM62, gapVsGap=0,
                     type="upperlower", thresh=c(40, 20))
@

\subsection{Interfacing to Other Packages}

There are also other sequence analysis packages that use or make use of multiple
sequence alignments. The \msa\ package does not directly interface to these packages
in order to avoid dependencies and possible incompatibilities. However, \msa\ provides
a function \verb+msaConvert()+ that allows for converting multiple sequence alignment
objects to other types/classes. Currently, five such conversions are available, namely to
the classes \verb+alignment+ (\verb+seqinr+ package \cite{CharifLobry2007}),
\verb+align+ (\verb+bios2mds+ package \cite{PeleBecuAbdiChabbert2012}),
\verb+AAbin+/\verb+DNAbin+ (\verb+ape+ package \cite{ParadisClaudeStrimmer2004}),
and \verb+phyDat+ (\verb+phangorn+ package \cite{Schliep2011}). Except for the
conversion to the class \verb+phyDat+, these conversion are performed without loading
or depending on the respective packages.

In the following example, we perform a multiple alignment of Hemoglobin alpha
example sequences and convert the result for later processing with the \verb+seqinr+
package:
<<Hemoglobin1>>=
hemoSeq <- readAAStringSet(system.file("examples/HemoglobinAA.fasta",
                                       package="msa"))
hemoAln <- msa(hemoSeq)
hemoAln
hemoAln2 <- msaConvert(hemoAln, type="seqinr::alignment")
@
Now we compute a distance matrix using the \verb+dist.alignment()+ function from
the \verb+seqinr+ package:
<<Hemoglobin2>>=
library(seqinr)

d <- dist.alignment(hemoAln2, "identity")
as.matrix(d)[2:5, "HBA1_Homo_sapiens", drop=FALSE]
@
Now we can construct a phylogenetic tree with the neighbor joining algorithm using the
\verb+nj()+ function from the \verb+ape+ package:
<<HemoglobinTree,output.width='0.8\\textwidth',output.height='0.5\\textwidth',message=FALSE,results='hide'>>=
library(ape)

hemoTree <- nj(d)
plot(hemoTree, main="Phylogenetic Tree of Hemoglobin Alpha Sequences")
@

The following example shows how to convert a multiple alignment object in an object of
class \verb+align+ as defined by the \verb+bios2mds+ package:
<<Hemoglobin3>>=
hemoAln3 <- msaConvert(hemoAln, type="bios2mds::align")
str(hemoAln3)
@

The conversions to the standard \verb+Biostrings+ classes are straightforward using
standard \verb+as()+ methods and not provided by the \verb+msaConvert()+ function.
The following example converts a multiple alignment object to class \verb+BStringSet+
(e.g.\ the \verb+msaplot()+ function from the \verb+ggtree+ package
\cite{YuSmithZhuGuanLam2016} accepts \verb+BStringSet+ objects):
<<Hemoglobin4>>=
hemoAln4 <- as(hemoAln, "BStringSet")
hemoAln4
@

\notebox{The \texttt{msaConvert()} function has been introduced in version 1.3.3 of the
  \MSA\ package. So, to have this function available, at least Bioconductor 3.3
  is required, which requires at least R 3.3.0.}

\section{Pretty-Printing Multiple Sequence Alignments}\label{sec:msaPrettyPrint}

As already mentioned above, the \MSA\ package offers the function
\verb+msaPrettyPrint()+ which allows for pretty-printing multiple
sequence alignments using the \LaTeX\ package \shade\ \cite{Beitz2000}.
Which prerequisites are necessary to take full advantage of the
\verb+msaPrettyPrint()+ function is described in Section~\ref{sec:install}.

The \verb+msaPrettyPrint()+ function writes a multiple sequence alignment
to an alignment (\verb+.aln+) file and then creates \LaTeX\ code for
pretty-printing the multiple sequence alignment on the basis of the
\LaTeX\ package \shade. Depending on the choice of the \verb+output+
argument, the function \verb+msaPrettyPrint()+ either prints a \LaTeX\
fragment to the \R\ session (choice \verb+output="asis"+) or writes a
\LaTeX\ source file (choice \verb+output="tex"+) that it processes to
a DVI file (choice \verb+output="dvi"+) or PDF file
(choice \verb+output="pdf"+). Note that no extra software is needed for
choices \verb+output="asis"+ and \verb+output="tex"+. For
\verb+output="dvi"+ and \verb+output="pdf"+, however, a \TeX/\LaTeX\
distribution must be installed in order to translate the \LaTeX\ source
file into the desired target format (DVI or PDF).

The function \verb+msaPrettyPrint()+ allows for making the most common
settings directly and conveniently via an \R\ interface without the need
to know the details of \LaTeX\ or \shade. In the following, we will describe
some of these customizations. For all possibilities, the user is referred
to the documentation of \shade.%
\footnote{\url{https://www.ctan.org/pkg/texshade}}

\subsection{Consensus Sequence and Sequence Logo}

The consensus sequence of the alignment is one of the most important results
of a multiple sequence alignment. \verb+msaPrettyPrint()+ has a
standard possibility to show
this consensus sequence with the parameter \verb+showConsensus+.
The default value is \verb+"bottom"+, which results in the following:
<<ShowConsensusBottom,results="asis">>=
msaPrettyPrint(myFirstAlignment, output="asis", y=c(164, 213),
               subset=c(1:6), showNames="none", showLogo="none",
               consensusColor="ColdHot", showLegend=FALSE,
               askForOverwrite=FALSE)
@
Consensus sequences can also be displayed on top of a multiple sequence
alignment or omitted completely.

In the above example, an exclamation mark `\verb+!+' in the consensus
sequence stands for a conserved letter, i.e.\ a sequence positions in which
all sequences agree, whereas an asterisk `\verb+*+' stands for positions in
which there is a majority of sequences agreeing. Positions in which the
sequences disagree are left blank in the consensus sequence. For a more
advanced example how to customize the consensus sequence, see the example in
Subsection~\ref{ssec:custom} below.

The color scheme of the consensus sequence can be configured with the
\verb+consensusColors+ parameter. Possible values are
\verb+"ColdHot"+, \verb+"HotCold"+, \verb+"BlueRed"+, \verb+"RedBlue"+,
\verb+"GreenRed"+, \verb+"RedGreen"+, or \verb+"Gray"+. The above example
uses the color scheme \verb+"RedGreen"+.

Additionally, \texttt{msaPrettyPrint()} also offers a more sophisticated
visual representation of the consensus sequence --- sequence logos.
Sequence logos can be displayed either on top of the multiple sequence
alignment (\verb+showLogo="top"+), below the multiple sequence
alignment (\verb+showLogo="bottom"+), or omitted at all
(\verb+showLogo="none"+):
<<ShowLogoDefault,results="asis">>=
msaPrettyPrint(myFirstAlignment, output="asis", y=c(164, 213),
               subset=c(1:6), showNames="none", showLogo="top",
               logoColors="rasmol", shadingMode="similar",
               showLegend=FALSE, askForOverwrite=FALSE)
@
The color scheme of the sequence logo can be configured with the
\verb+logoColors+ parameter. Possible values are \verb+"chemical"+,
\verb+"rasmol"+, \verb+"hydropathy"+, \verb+"structure"+,
\verb+"standard area"+, and \verb+"accessible area"+. The above example
uses the color scheme \verb+"rasmol"+.

Note that a consensus sequence and a sequence logo can be displayed
together, but only on opposite sides.

Finally, a caveat: for computing consensus sequences,
\verb+msaPrettyPrint()+ uses the functionality provided by \shade, therefore,
the results need not match to the results of the methods described in
Section~\ref{sec:msaProc} above.

\subsection{Color Shading Modes}

\shade\ offers different shading schemes for displaying the multiple sequence
alignment itself. The following schemes are available:
\verb+"similar"+, \verb+"identical"+, and \verb+"functional"+.
Moreover, there are five different color schemes available for shading:
\verb+"blues"+, \verb+"reds"+, \verb+"greens"+, \verb+"grays"+, or
\verb+"black"+. The following example uses the shading mode
\verb+"similar"+ along with the color scheme \verb+"blues"+:
<<Shading1,results='asis'>>=
msaPrettyPrint(myFirstAlignment, output="asis", y=c(164, 213),
               showNames="none", shadingMode="similar",
               shadingColors="blues", showLogo="none",
               showLegend=FALSE, askForOverwrite=FALSE)
@

If the shading modes \verb+"similar"+ or \verb+"identical"+ are used,
the \verb+shadingModeArg+ argument allows for setting a similarity
threshold (a numerical value between 0 and 100).
For shading mode \verb+"functional"+, the following settings of the
\verb+shadingModeArg+ argument are possible:
\verb+"charge"+, \verb+"hydropathy"+, \verb+"structure"+, \verb+"hemical"+,
\verb+"rasmol"+, \verb+"standard area"+,  and
\verb+"accessible area"+. The following example uses shading mode
\verb+"functional"+ along with \verb+shadingModeArg+ set to
\verb+"structure"+:
<<Shading2,results='asis'>>=
msaPrettyPrint(myFirstAlignment, output="asis", y=c(164, 213),
               showNames="none", shadingMode="functional",
               shadingModeArg="structure",
               askForOverwrite=FALSE)
@

In the above example, a legend is shown that specifies the meaning of the
color codes with which the letters are shaded. In some of the other examples
above, we have suppressed this legend with the option \verb+showLegend=FALSE+.
The default, however, is that a legend is printed underneath the multiple
sequence alignment like in the previous example.

\subsection{Subsetting}

In case that not the complete multiple sequence alignment should be printed,
\verb+msaPrettyPrint()+ offers two ways of sub-setting.
On the one hand, the \verb+subset+ argument allows for selecting only
a subset of sequences. Not surprisingly, \verb+subset+ must be a numeric
vector with indices of sequences to be selected.
On the other hand, it is also possible to slice out certain positions
of the multiple sequence alignment using the \verb+y+ argument.
In the simplest case, \verb+y+ can be a numeric vector with two elements
in ascending order which correspond to the left and right bounds between
which the multiple sequence alignment should be displayed. However,
it is also possible to slice out multiple windows. For this purpose,
the argument \verb+y+ must be an \verb+IRanges+ object containing the starts
and ends of the windows to be selected.

\subsection{Additional Customizations}\label{ssec:custom}

The \verb+msaPrettyPrint()+ function provides an interface to the most common
functionality of \shade\ in a way that the user does not need to know the
specific commands of \shade. \shade, however, provides a host of additional
customizations many of which are not covered by the interface of the
\verb+msaPrettyPrint()+ function. In order to allow users to make use of
all functionality of \shade,  \verb+msaPrettyPrint()+ offers the
\verb+furtherCode+ argument through which users can add \LaTeX\ code to the
\verb+texshade+ environment that is created by \verb+msaPrettyPrint()+.
Moreover, the \verb+code+ argument can be used to bypass all of
\verb+msaPrettyPrint()+'s generation of \shade\ code.

Here is an example how to use the \verb+furtherCode+ argument in order to
customize the consensus sequence and to show a ruler on top:
<<ShowConsensusBottom2,results="asis">>=
msaPrettyPrint(myFirstAlignment, output="asis", y=c(164, 213),
               subset=c(1:6), showNames="none", showLogo="none",
               consensusColor="ColdHot", showLegend=FALSE,
               shadingMode="similar", askForOverwrite=FALSE,
               furtherCode=c("\\defconsensus{.}{lower}{upper}",
                             "\\showruler{1}{top}"))
@

\subsection{Sweave or knitr Integration}

The function \verb+msaPrettyPrint()+ is particularly well-suited for
pretty-printing multiple alignments in Sweave \cite{Leisch2002} or knitr
\cite{Xie2014} documents. The key is to set \verb+output+ to \verb+"asis"+
when calling \verb+msaPrettyPrint()+ and, at the same time, to
let the \R\ code chunk produce output that is directly included in the
resulting \LaTeX\ document as it is. This can be accomplished with the code
chunk option \verb+results="tex"+ in Sweave and with the code chunk option
\verb+results="asis"+ in knitr. Here is an example of a Sweave code chunk that
displays a pretty-printed multiple sequence alignment inline:
\input{SweaveExample.txt}
The same example in knitr:
\input{knitrExample.txt}

Note that, for processing the resulting \LaTeX\ source document, the
\shade\ package must be installed (see Section~\ref{sec:install}) and
the \shade\ package must be loaded in the preamble:
\begin{quote}
\begin{verbatim}
\usepackage{texshade}
\end{verbatim}
\end{quote}

\subsection{Sequence Names}

The \verb+Biostrings+ package does not impose any restrictions on the names of
sequences. Consequently, \MSA\ also allows all possible ASCII strings as
sequence (row) names in multiple alignments. As soon as \verb+msaPrettyPrint()+
is used for pretty-printing multiple sequence alignments, however, the sequence
names are interpreted as plain \LaTeX\ source code. Consequently, \LaTeX\ errors
may arise because of characters or words in the sequence names that \LaTeX\
does not or cannot interpret as plain text correctly. This particularly includes
appearances of special characters and backslash characters in the sequence names.

The \MSA\ package offers a function \verb+msaCheckNames()+ which allows for finding
and replacing potentially problematic characters in the sequence names of
multiple alignment objects (see \verb+?msaCheckNames+). However, the best solution
is to check sequence names carefully and to avoid problematic sequence names from the beginning.
Note, moreover, that too long sequence names will lead to less appealing outputs,
so users are generally advised to consider sequence names carefully.

\subsection{Pretty-Printing Wide Alignments}

If the alignment to be printed with \verb+msaPrettyPrint()+ is wide
(thousands of columns or wider), \LaTeX\ may terminate prematurely because of
exceeded \TeX\ capacity. Unfortunately, this problem remains opaque to the
user, since \verb+texi2dvi()+ and \verb+texi2pdf()+ do not convey much details
about \LaTeX\ problems when typesetting a document. We recommend the following
if a user encounters problems with running \verb+msaPrettyPrint()+'s output
with \verb+texi2dvi()+ and \verb+texi2pdf()+:
\begin{enumerate}
\item Run \verb+pdflatex+ on the generated \verb+.tex+ file to see
  whether it is actually a problem with \TeX\ capacity.
\item If so, split the alignment into multiple chunks and run
  \verb+msaPrettyPrint()+ on each chunk separately.
\end{enumerate}

The following example
demonstrates this approach for a multiple aligment object `\verb+aln+':
<<SplitAlignmentIntoJunks,eval=FALSE>>=
chunkSize <- 300 ## how much fits on one page depends on the length of
                 ## names and the number of sequences;
                 ## change to what suits your needs

for (start in seq(1, ncol(aln), by=chunkSize))
{
    end <- min(start + chunkSize - 1, ncol(aln))
    alnPart <- DNAMultipleAlignment(subseq(unmasked(aln), start, end))

    msaPrettyPrint(x=alnPart, output="pdf", subset=NULL,
                   file=paste0("aln_", start, "-", end, ".pdf"))
}
@
\noindent This creates multiple PDF files all of which show one part of the alignment.
Please note, however, that the numbering of columns is restarted for each chunk.

\subsection{Further Caveats}

\begin{itemize}
  \item Note that \verb+texi2dvi()+ and \verb+texi2pdf()+ always
    save the resulting DVI/PDF files to the current working directory,
    even if the \LaTeX\ source file is in a different directory.
    That is also the reason why the temporary file is created in the
    current working directory in the example below.
  \item \shade\ has a wide array of functionalities. Only the most common ones
    have been tested for interoperability with \R. So the use of the arguments
    \verb+furtherCode+ and \verb+code+ is the user's own risk!
\end{itemize}

\section{Known Issues}\label{sec:issues}

\subsubsection*{Memory Leaks}

The original implementations of ClustalW, ClustalOmega, and MUSCLE
are stand-alone command line programs which are only run once each time
a multiple sequence alignment is performed. During the development of the
\MSA\ package, we performed memory management checks using
Valgrind~\cite{Nethercote2007} and discovered multiple memory leaks in
ClustalW and MUSCLE.
These memory leaks have no effect for the command line tools,
since the program is closed each time the alignment is finished.
In the implementation of the \MSA\ package, however, these memory leaks
may have an effect if the same algorithm is run multiple times.

For MUSCLE, we managed to eliminate all memory leaks by deactivating the
two parameters \verb+weight1+ and \verb+weight2+. ClustalOmega did
not show any memory leaks. ClustalW indeed has several memory leaks
which are benign if the algorithm is run only a few times, but which may
have more severe effects if the algorithm is run many times. ClustalOmega also
has a minor memory leak, but the loss of data is so small that no major
problems are to be expected except for thousands of executions of
ClustalOmega.

\subsubsection*{ClustalOmega vs.\ Older GCC Versions on Linux/Unix}

We have encountered peculiar behavior of ClustalOmega if the package was
built using an older GCC version: if we built the package on an \verb+x86_64+
Linux system with GCC 4.4.7, ClustalOmega built smoothly and could be executed
without any errors. However, the resulting multiple sequence alignment was
more than sub-optimal. We could neither determine the source of this problem
nor which GCC versions show this behavior. We therefore recommend Linux/Unix
users to use an up-to-date GCC version (we used 4.8.2 during package
development, which worked nicely) or, in case they encounter dubious results,
to update to a newer GCC version and re-install the package.

\subsubsection*{ClustalOmega: OpenMP Support on Mac OS}

ClustalOmega is implemented to make use of OpenMP (if available on the
target platform). Due to issues on one of the Bioconductor build servers
running Mac OS, we had to deactivate OpenMP generally for Mac OS platforms.
If a Mac OS user wants to re-activate OpenMP, he/she should download the
source package tarball, untar it, comment/uncomment the corresponding line in
\verb+msa/src/ClustalOmega/msaMakefile+ (see first six lines), and
build/install the package from source.

\subsubsection*{Build/installation issues}

Some users have reported compiler and linker errors when building \MSA\ from
source on Linux systems. In almost all cases, these could have been tracked down
to issues with the \R\ setup on those systems (e.g.\ a \verb+Rprofile.site+ file
that makes changes to the \R\ environment that are not compatible with \MSA's
Makefiles).\footnote{See, e.g., \url{https://support.bioconductor.org/p/90735/}}
In most cases, these issues can be avoided by installing \MSA\ in a ``vanilla \R\ session'',
i.e.\ starting \R\ with option \verb+--vanilla+ when installing \MSA.

\section{Future Extensions}\label{sec:future}

We envision the following changes/extensions in future versions of the package:
\begin{itemize}
  \item Integration of more multiple sequence alignment algorithms, such as,
    T-Coffee \cite{Notredame2000} or DIALIGN \cite{Morgenstern1999}
  \item Support for retrieving guide trees from the multiple sequence
    alignment algorithms
  \item Interface to methods computing phylogenetic trees (e.g.\ as
    contained in the original implementation of ClustalW)
  \item Elimination of memory leaks described in Section~\ref{sec:issues} and
    re-activation of parameters that have been deactivated in order to
    avoid memory leaks
  \item More tolerant handling of custom substitution matrices
    (MUSCLE interface)
\end{itemize}

\section{How to Cite This Package}

If you use this package for research that is published later, you are kindly
asked to cite it as follows:
\begin{quotation}
\noindent U.~Bodenhofer, E.~Bonatesta, C.~Horej\v{s}-Kainrath, and S.~Hochreiter (2015).
msa: an R package for multiple sequence alignment. {\em Bioinformatics} {\bf 31}(24):3997--3999.
DOI: \href{http://dx.doi.org/10.1093/bioinformatics/btv494}{bioinformatics/btv494}.
\end{quotation}
To obtain a Bib\TeX\ entries of the reference, enter the
following into your R session:
<<GetBibTeX,eval=FALSE>>=
toBibtex(citation("msa"))
@

Moreover, we insist that, any time you cite the package, you also cite
the original paper in which the original algorithm has been introduced (see
bibliography below).

\bibliographystyle{plain}
\bibliography{lit}

\end{document}