%\VignetteIndexEntry{Triplex User Guide} %\VignetteKeywords{Triplex, DNA, Sequence, Biostrings} %\VignettePackage{triplex} \documentclass[a4paper]{article} \usepackage[utf8]{inputenc} \title{Triplex: User Guide} \author{Matej Lexa, Tom\'{a}\v{s} Mart\'{i}nek, Ji\v{r}\'{i} Hon} \begin{document} \SweaveOpts{concordance=TRUE} \maketitle \tableofcontents \section{Introduction} The R triplex package is essentially an R interface to the underlying C implementation of a dynamic-programming search strategy of the same name published in~\cite{Lexa2011}. The main functionality of the original program is to detect positions of subsequences in a much larger sequence, where these subsequences are likely to fold into an intramolecular triplex (H-DNA). The evaluation is based on the number of canonical nucleotide triplets that can form from nucleotides in such subsequence . In creating this incarnation in R, we extended the basic functionality, to also include the calculation of likely base-pairing in the triple helices. This allowed us to extend the functionality of the package towards visualization showing the exact base-pairing in 2D or 3D as published earlier~\cite{Rajdl12}. The rest of this vignette is organized as follows: The basic usage of the package for triplex detection is described in section~\ref{sec:detection}. Two methods for visualization of detected triplexes in 2D and 3D are shown in section~\ref{sec:visualization}. Section~\ref{sec:export} describes techniques for the conversion of search results into other types of objects such as {\it GRanges} or {\it DNAStringSet}, that can be further exported into GFF3 or FASTA files. The vignette is concluded with section~\ref{sec:real} showing triplex package usage on a real genomic sequence from {\it BSGenome}. \newpage \section{Detection} \label{sec:detection} As usual, before first package use, it is necessary to load the triplex library using the following command: <<>>= library(triplex) @ Identification of potential intramolecular triplex-forming sequences in DNA is performed using the {\it triplex.search} function. This function has one required parameter representing the studied DNA sequence in the form of a {\it DNAString} object and several modifying options with predefined values (see {\it triplex.search} help page). Based on triplex position (forward or reverse strand) and its third strand orientation, up to 8 types of triplexes are distinguished by the {\it triplex.search} function. All these triplex types are depicted on figure~\ref{fig:types}. By default, the function detects all 8 types, however this behaviour can be changed by setting {\it type} parameter to arbitrary value or a subset of values in the range 0 to 7. \begin{figure}[h] \centering \includegraphics[width=.8\linewidth]{fig/types.pdf} \caption{Triplex types} \label{fig:types} \end{figure} \subsection*{Example 1: Basic triplex detection} As a simple example, let's find all types of intramolecular triplexes in the DNA sequence {\it TTGGGGAAAGCAATGCCAGGCAGGGGGTTCCTTTCGTTACGGTCCGTCCC}: <<>>= seq <- DNAString("TTGGGGAAAGCAATGCCAGGCAGGGGGTTCCTTTCGTTACGGTCCGTCCC") triplex.search(seq) @ Detected triplexes are returned in the form of a {\it TriplexViews} class, which represents the basic container for storing a set of views on the same input sequence similarly to {\it XStringView} object (in fact {\it TriplexViews} only extends the {\it XStringView} class with the number of displayed columns). Each triplex view is defined by start locations, width, score, P-value, number of insertions, type, strand, loop start position and loop end position. Please note, that the strand orientation depends on triplex type only. The {\it triplex.search} function assumes that input DNA sequence represents the forward strand. \subsection*{Example 2: Selection of a specific triplex type} Let's reduce the searching procedure to triplex type 1 only, using the following command. Please note, that the output list contains potential triplexes of type 1 only. <<>>= triplex.search(seq, type=1) @ The basic requirements for shape or length of detected triplexes can be defined using four parameters: {\it min\_len}, {\it max\_len}, {\it min\_loop} and {\it max\_loop}. While {\it min\_len} and {\it max\_len} specify the length of triplex stems composed of individual triplets, {\it min\_loop} and {\it max\_loop} parameters define the range of lengths for unpaired loops at the top of detected triplexes. A graphical representation of these parameters is shown in figure~\ref{fig:triplex}. Please note that these parameters also impacts the overall computation time. For longer triplexes, larger space has to be explored and thus more computation time is consumed. \begin{figure}[h] \centering \includegraphics[width=.6\linewidth]{fig/triplex.pdf} \caption{Triplex scheme} \label{fig:triplex} \end{figure} \subsection*{Example 3: Definition of triplex shape} Let's modify the previous example by specifying minimal and maximal triplex lengths. Please, execute the following command and note that only one of the two triplexes detected before satisfies these conditions. <<>>= triplex.search(seq, min_len=10, max_len=20) @ The quality of each triplex is defined by its score value. A higher score value represents a higher-quality triplex. This quality is decreased by several types of imperfections at the level of triplets, such as character (nucleotide) mismatch, insertion, deletion, isomorphic group change etc. Penalization constants for these imperfections can be set up using the following parameters: {\it mis\_pen}, {\it ins\_pen}, {\it iso\_pen}, {\it iso\_bonus} and {\it dtwist\_pen}. Detailed information about the scoring function and penalization parameters can be found in~\cite{Lexa2011}. It is highly recommended to see~\cite{Lexa2011} prior to changing any penalization parameter. \subsection*{Example 4: Scoring function modification} Let's modify the previous example by reducing the penalization for isomorphic group change from value 5 to 2. Please execute the following command and note that calculated score values have changed. <<>>= triplex.search(seq, iso_pen=2) @ The {\it triplex.search} function can result in a large list containing tens of thousands of potential triplexes. The size of these results can be reduced using two filtration mechanisms: (1) by specifying the minimal acceptable score value using {\it min\_score} parameter or (2) by specifying maximum acceptable P-value of results using {\it p\_value} parameter. The P-value represents the probability of occurrence of detected triplexes in a random sequence. % By default, only triplexes with P-value equal or less than 0.05 are reported. Calculation of P-value depends on two extreme value distribution parameters {\it lambda} and {\it mi}. It is highly recommended to see~\cite{Lexa2011} prior changing the {\it lambda} or {\it mi} parameters. \subsection*{Example 5: Filtration of results} Let's modify the previous example to show only triplexes with score values 14 or higher. Please execute the following command and note that only one of the two previously detected triplexes satisfies this condition. <<>>= triplex.search(seq, min_score=14) @ \newpage \section{Visualization} \label{sec:visualization} Besides triplex detection, the {\it triplex} package offers also visualization of detected results. Three major methods of visualization are supported: \begin{enumerate} \item {\it Triplex alignment (text)}: Selected triplex is shown in basic text format representing the alignment of all of its strands. The output consists of four sequences: {\it plus} and {\it minus} sequences representing 5' to 3' and 3' to 5' DNA strands of the detected triplex; {\it anti/para-plus/minus} sequence representing the third triplex strand aligned to {\it plus} or {\it minus} strand in {\it antiparallel} or {\it parallel} fashion; and finally {\it loop} sequence representing the unpaired loop. Please, note that all eight triplex types shown in figure~\ref{fig:types} can be represented using four types of alignments, because each alignment can correspond to triplex detected either on forward or reverse DNA strand. \item {\it 2D diagram (graphical)}: Selected triplex is shown in a 2D diagram displaying the individual triplets (based on Watson-Crick and Hoogsteen base paring) and the loop composed of unpaired nucleotides. \item {\it 3D model (graphical)}: Selected triplex is shown in 3D. At first, a model is calculated and then the result is displayed using the {\it RGL} package, which allows you to manipulate the triplex 3D model (zoom in, zoom out, rotation, etc.). \end{enumerate} \subsection*{Example 6: Triplex visualization} Let's suppose, we would like to display an alignment (in text format), a 2D diagram and a 3D model of the best detected triplex from the previous examples. At first, it is suitable to store the results of calling the {\it triplex.search} function into an auxiliary variable. <<>>= t <- triplex.search(seq) t @ Then, call {\it triplex.alignment} function on the first item of the list. Please note that similarly to other DNA multiple sequence alignments the output of the {\it triplex.alignment} method is stored as {\it DNAStringSet} object. Also note that the {\it loop} sequence is always the last one and unaligned to the previous three sequences. <<>>= triplex.alignment(t[1]) @ Then, call {\it triplex.diagram} function on the same item of the list. Please note that at first the triplex alignment is calculated and printed into R console and then the graphical output is displayed in a separate window. R provides methods to redirect the output to other suitable devices, such as files (see {\it png()}, for example). \setkeys{Gin}{width=.6\linewidth} \begin{figure}[h] \centering <>= triplex.diagram(t[1]) @ \caption{2D diagram of a detected triplex} \label{fig:triplexvis2d} \end{figure} \newpage Finally, let's display the 3D structure of the same triplex using {\it triplex.3D} function. Please note that the result will be displayed in separate graphical window using the {\it RGL} library. The 3D model is based on optimizing angles and distances present in the molecule to be as close as possible to tabulated values (see \cite{Rajdl12} for more information). <>= triplex.3D(t[1]) @ <>= triplex.alignment(t[1]) @ \begin{figure}[h!] \centering \includegraphics[width=.6\linewidth]{fig/triplex3d} \caption{3D scheme of detected triplex} \label{fig:triplexvis3d} \end{figure} \newpage \section{Exporting Results} \label{sec:export} As mentioned above, the results of detection are stored in the {\it TriplexView} object. Because the {\it TriplexView} class is only an extension of the {\it XStringViews} class, all operations applied to the {\it XStringViews} object can also be applied to the {\it TriplexView} object as well. Additionaly, {\it TriplexView} class contains a conversion function to create {\it GRanges} objects. Thus, all detected triplexes can be transformed into elements of a {\it GRanges} object and saved as a GFF3 file, for example. \subsection*{Example 7: GRanges conversion} In this example, the output of the {\it triplex.search} function will be stored in a {\it GRanges} object and further exported as a GFF3 file. At first, let's do the conversion using the following command: <<>>= gr <- as(t, "GRanges") gr @ Please note that the chromosome name is set to {\it chr1} by default, but it can be changed to any other value. Items such as score, triplex type, P-value, loop start position, loop end position and number of indels can be added as optional attributes. In the next step the resulting {\it GRanges} object is exported as a GFF3 file. <<>>= library(rtracklayer) export(gr,"test.gff", version="3") @ Please note, that it is necessary to load the {\it rtracklayer} library before running the {\it export} command. The contents of the resulting GFF3 file are: <>= text <- readLines("test.gff",n=10) cat(strwrap(text, width=80, exdent=3),sep="\n") @ Another possibility of utilizing the results of detection is to transform the {\it TriplexView} object into a {\it DNAStringSet} object, which represents another commonly used class of the {\it Biostrings} package. Triplexes stored inside {\it DNAStringSet} can be exported into a FASTA file, for example. \subsection*{Example 8: DNAStringSet conversion} In this example, the output of the {\it triplex.search} function will be stored into a {\it DNAStringSet} object and further exported as a FASTA file. At first, let's do the conversion using the following command: <<>>= dss <- as(t, "DNAStringSet") dss @ \noindent In the next step, the {\it DNAStringSet} object is exported as a FASTA file. <<>>= writeXStringSet(dss, file="test.fa", format="fasta") @ \noindent The contents of the resulting FASTA file are: <>= text <- readLines("test.fa",n=10) cat(text,sep="\n") @ Please, note that all attributes of detection such as start position, end position, score value, P-value, number of indels, triplex type and strand are stored as a {\it name} parameter (inside the {\it DNAStringSet}), and thus, they are also shown in the description line of the FASTA format (the line with the initial '>' symbol). \newpage \section{A real world example} \label{sec:real} In the following example, we load a real genomic sequence from one of the BSGenome packages, identify potential triplexes with length over 8 triplets of nucleotides and less than 15\% mismatches (score >17 with the currently used scoring matrices), create three different visualizations of the best triplexes. We export the identified positions into a genome annotation track (via a GFF3 file) and FASTA file. Finally, we plot some statistics about the potential triplex distribution and nucleotide composition. \begin{enumerate} \item Load necessary libraries and genomes. <<>>= library(triplex) library(BSgenome.Celegans.UCSC.ce10) @ \item Search for potential triplex positions and display the results. Please note that the sequence is limited to the first 100k bases for time reasons. <<>>= t <- triplex.search(Celegans[["chrX"]][1:100000],min_score=17,min_len=8) t @ \item Sort the results by score and display 2D a 3D diagram of the best-scored triplex. <<>>= ts <- t[order(score(t),decreasing=TRUE)] ts @ \newpage \setkeys{Gin}{width=.7\linewidth} \begin{figure}[h] \centering <>= triplex.diagram(ts[1]) @ \caption{2D diagram of detected triplex} \label{fig:triplex2d} \end{figure} \newpage <>= triplex.3D(ts[1]) @ <>= triplex.alignment(ts[1]) @ \begin{figure}[h!] \centering \includegraphics[width=.6\linewidth]{fig/triplex3d2} \caption{3D scheme of detected triplex} \label{fig:triplex3d} \end{figure} \newpage \item Export all triplexes into a GFF3 format file. <<>>= library(rtracklayer) export(as(t, "GRanges"),"test.gff", version="3") @ The contents of the GFF3 file are as follows (the first 10 records only): <>= text <- readLines("test.gff",n=10) cat(strwrap(text, width=80, exdent=3),sep="\n") @ \item Export all triplexes into a FASTA format file. <<>>= writeXStringSet(as(t, "DNAStringSet"), file="test.fa", format="fasta") @ The contents of the FASTA file are as follows (the first 10 records only): <>= text <- readLines("test.fa",n=20) cat(text,sep="\n") @ \item Show histogram for score distribution of detected triplexes. <>= hist(score(t), breaks=20) @ \item Show triplex distribution along the chromosome or other analysed sequence. <>= plot(coverage(ts[0:length(t)]), type="s", col="grey75") @ \end{enumerate} \section{P-value calculation} \label{sec:pvalue} While calculating the scores of individual triplexes is straightforward with given scoring matrices and penalty scores, calculating reasonable P-values of these scores is a challenging task. The P-values describe the probability of obtaining the reported scores by chance. To estimate it, we use a randomized genomic sequence. Because of the strikingly different nucleotide and H-DNA content of prokaryotic and eukaryotic sequences, we use E.coli genome and a segment of human chromosome 5 as models. The calculation of P-value follows the approach of \cite{Eddy1997}. We used the ExtremeValueFitHistogram() function from hmmer-2.4 to fit the values of $\lambda$ and $\mu$ in the equation: \begin{equation} \textit{P-value}(x) = 1 - e^{-n P(S \geq x)} \end{equation} where \begin{equation} P(S \geq x) = 1-e^{-e^{-\lambda(x - \mu)}} \end{equation} The problematic part here is the determination of $n$. Because we search a single long sequence, but usually report multiple hits, the value of $n$ can only be estimated. It must take into account the internal filtering of hits by \emph{triplex} and the filtering property of the DP algorithm itself. We counted all the hits returned when fitting the EVD to a genome of size 4.8Mbp, to find an apparent value of n: \begin{equation} n(4.8Mbp) = 170000 \end{equation} This leads to a reported hit every 30bp, or a ratio of $n$ to sequence length: \begin{equation} rn = \frac{170000}{4800000} = 0.035 \end{equation} \begin{thebibliography}{9} \bibitem{Lexa2011} Lexa, M., Mart\'{i}nek, T., Burgetov\'{a}, I., Kope\u{c}ek, D., Br\'{a}zdov\'{a}, M.: \emph{A dynamic programming algorithm for identification of triplex-forming sequences}, In: Bioinformatics, Vol. 27, No. 18, 2011, Oxford, GB, p. 2510-2517, ISSN 1367-4803. \bibitem{Rajdl12} Rajdl, K. \emph{Funkce pro manipulaci a vizualizaci molekul\'{a}rn\'{i}ch dat v prost\v{r}ed\'{i} R} [online]. \bibitem{Eddy1997} Eddy, S. R. \emph{Maximum likelihood fitting of extreme value distributions.} 1997. Unpublished technical notes. Available at http://www.genetics.wustl.edu/eddy/publications/ \end{thebibliography} \end{document}