%\VignetteIndexEntry{The Double Life of RNA: Uncovering Non-Coding RNAs}
%\VignettePackage{DECIPHER}

\documentclass[10pt]{article}

\usepackage{times}
\usepackage{hyperref}
\usepackage{underscore}

\textwidth=6.5in
\textheight=8.5in
%\parskip=.3cm
\oddsidemargin=-.1in
\evensidemargin=-.1in
\headheight=-.3in
\setlength{\parindent}{1cm}

\newcommand{\scscst}{\scriptscriptstyle}
\newcommand{\scst}{\scriptstyle}

\newcommand{\R}{{\textsf{R}}}
\newcommand{\code}[1]{{\texttt{#1}}}
\newcommand{\term}[1]{{\emph{#1}}}
\newcommand{\Rpackage}[1]{\textsf{#1}}
\newcommand{\Rfunction}[1]{\texttt{#1}}
\newcommand{\Robject}[1]{\texttt{#1}}
\newcommand{\Rclass}[1]{{\textit{#1}}}
\newcommand{\Rmethod}[1]{{\textit{#1}}}
\newcommand{\Rfunarg}[1]{{\textit{#1}}}

\bibliographystyle{plainnat}

\begin{document}
%\setkeys{Gin}{width=0.55\textwidth}

\title{The Double Life of RNA: Uncovering Non-Coding RNAs}
\author{Erik S. Wright}
\date{\today}
\maketitle

\newenvironment{centerfig}
{\begin{figure}[htp]\centering}
{\end{figure}}
\renewcommand{\indent}{\hspace*{\tindent}}
\DefineVerbatimEnvironment{Sinput}{Verbatim} {xleftmargin=2em} \DefineVerbatimEnvironment{Soutput}{Verbatim}{xleftmargin=2em} \DefineVerbatimEnvironment{Scode}{Verbatim}{xleftmargin=2em} \fvset{listparameters={\setlength{\topsep}{0pt}}} \renewenvironment{Schunk}{\vspace{\topsep}}{\vspace{\topsep}}
\SweaveOpts{keep.source=TRUE}
<<echo=false>>=
options(continue=" ")
options(width=80)
options(SweaveHooks=list(fig=function()	par(mar=c(4.1, 4.1, 0.3, 0.1))))
set.seed(123)
@

\tableofcontents

%------------------------------------------------------------
\section{Introduction}
%------------------------------------------------------------

RNA leads a double life as both messenger (mRNA) and an alphabet soup of roles in the cell (e.g., rRNA, tRNA, snRNA, tmRNA, etc.).  RNAs that take on a life of their own are notoriously difficult to detect \emph{ab initio} using bioinformatics.  This is in part because their signature characteristic, that a single strand of RNA folds into a stable structure, can be found in many regions of the genome.  This makes life difficult because there is no clear signature that a genomic region contains a non-coding RNA in the way there is for protein coding regions.  The most promising approach is to search for intergenic regions with high GC-content a folded structure that is conserved across many genomes.  But what do you do if you only have a single genome and want to find non-coding RNAs? That is exactly what this vignette shows you how to do!

Before we can have the time of our lives finding non-coding RNAs, we must first train models on examples of known RNAs.  As a case study, this tutorial focuses on finding all known non-coding RNAs in the genome of \emph{Chlamydia trachomatis}, an intracellular bacterial pathogen known for causing chlamydia.  This genome was chosen because it is relatively small (only ~1 Mbp) so the examples run quickly.  \emph{Chlamydia} harbors all of the standard non-coding RNAs that date back near the origin of life (i.e., rRNAs, tRNAs, etc.).  It also has one small RNA (sRNA), named IhtA, that has only been found in Chlamydiae and is believed to play a role in regulating its life stages.  We are going to find IhtA in the genome of \emph{C. trachomatis}, then we are going to use the same approach to find all of the standard non-coding RNAs that are conserved across many forms of life.

%------------------------------------------------------------
\section{Getting Started}
%------------------------------------------------------------

\newlength\tindent
\setlength{\tindent}{\parindent}
\setlength{\parindent}{0pt}

\subsection{Startup}

To get started we need to load the \Rpackage{DECIPHER} package, which automatically loads a few other required packages.

<<startup,results=hide>>=
library(DECIPHER)
@

Searches for non-coding RNAs are performed by the function \Rfunction{FindNonCoding}.  Help can be accessed via:

\begin{Schunk}
\begin{Sinput}
> ? FindNonCoding
\end{Sinput}
\end{Schunk}

Once \Rpackage{DECIPHER} is installed, the code in this tutorial can be obtained via:

\begin{Schunk}
\begin{Sinput}
> browseVignettes("DECIPHER")
\end{Sinput}
\end{Schunk}

%------------------------------------------------------------
\section{Building a Non-Coding RNA Model}
%------------------------------------------------------------

\setlength{\parindent}{1cm}

Before we can find a non-coding RNA, we need to create a multiple sequence alignment of some of its sequence representatives.  This alignment will be the input to \Rfunction{LearnNonCoding}.

\subsection{Importing the sequences}

The first step is to set filepaths to the sequences (in FASTA format).  In this case we are going to use the IhtA sequences included with \Rpackage{DECIPHER}, but you could follow along with your own set of homologous sequences.  Be sure to change the path names to those on your system by replacing all of the text inside quotes labeled ``$<<$path to ...$>>$'' with the actual path on your system.

<<expr1,eval=TRUE>>=
# specify the path to your genome:
fas_path <- "<<path to FASTA file>>"
# OR use the example genome:
fas_path <- system.file("extdata",
	"IhtA.fas",
	package="DECIPHER")

# read the sequences into memory
rna <- readRNAStringSet(fas_path)
rna
@

Ideally we would start with a few thousand diverse sequence representatives, yet only 27 representatives of IhtA are known.  That's life, so we will have to make due with what we have available.

\subsection{Aligning the sequences}

Next we need to align the sequences with \Rfunction{AlignSeqs}.  Note that non-coding RNA alignments are more accurate if we provide the sequences as a \Rclass{RNAStringSet} rather than the equivalent \Rclass{DNAStringSet}, because \Rfunction{AlignSeqs} will use conserved secondary structure to improve the alignment.  Alignment is fast, so hold on for dear life!  (Consider adding \code{processors=NULL} if you want it to go even faster with multiple processors.)

<<expr2,eval=TRUE>>=
RNA <- AlignSeqs(rna)
RNA
@

We can see from the alignment that the sequences have both conserved and variable regions.  But if we really want to bring the sequences to life we need to look at their predicted secondary structures.  The IhtA is believed to form three back-to-back hairpin loops based on its minimum free energy structure.  We can use \Rfunction{PredictDBN} to predict the secondary structures of the sequences.

<<expr3,eval=TRUE>>=
p <- PredictDBN(RNA, type="structures")
BrowseSeqs(RNA, patterns=p)
@

In this color scheme, blue regions pair to green regions and red regions are unpaired.  There is clear evidence for a hairpin loop near the 3'-end if IhtA, and weaker evidence for conserved secondary structure elsewhere in the sequences (Fig. \ref{f1}).

\begin{figure}
\begin{center}
\includegraphics[width=1\textwidth]{BrowseSeqsStructureIhtA}
\caption{\label{f1} Predicted secondary structure of IhtA}
\end{center}
\end{figure}

\clearpage

We can also visualize the secondary structure through a dot plot (Fig. \ref{f2}).  One half of the dot plot shows the probabilities of pairing, and the other half shows the predicted structure.  This view reveals the level of ambiguity in the secondary structure based on the (low) amount of available information.

\begin{centerfig}
<<expr4,eval=TRUE,echo=TRUE,fig=TRUE,include=TRUE,height=6,width=6>>=
evidence <- PredictDBN(RNA, type="evidence", threshold=0, verbose=FALSE)
pairs <- PredictDBN(RNA, type="pairs", verbose=FALSE)
dots <- matrix(0, width(RNA)[1], width(RNA)[1])
dots[evidence[, 1:2]] <- evidence[, 3]
dots[pairs[, 2:1]] <- 1

image(dots, xaxt="n", yaxt="n", col=gray(seq(1, 0, -0.01)))
abline(a=0, b=1)
cons <- toString(ConsensusSequence(RNA, threshold=0.2))
cons <- strsplit(cons, "")[[1]]
at <- seq(0, 1, length.out=length(cons))
axis(1, at, cons, tick=FALSE, cex.axis=0.3, gap.axis=0, line=-1)
axis(2, at, cons, tick=FALSE, cex.axis=0.3, gap.axis=0, line=-1)
@
\caption{\label{f2} Secondary Structure Dot Plot.}
\end{centerfig}

\clearpage

Two considerations are important at this stage: (\emph{i}) that we have a clearly defined boundary representing the true beginning and end of every sequence, and (\emph{ii}) that the sequences are a diverse sample of what we hope to find.  Here we don't have any partial sequences, but if we did they could be identified by counting gaps (``-'') at their ends with \Rfunction{TerminalChar}.  Any partial sequences should be removed before proceeding, for example by using:

<<expr5,eval=TRUE>>=
RNA <- unique(RNA)
t <- TerminalChar(RNA)
w <- which(t[, "leadingChar"] <= median(t[, "leadingChar"]) &
	t[, "trailingChar"] <= median(t[, "trailingChar"]))
RNA <- RemoveGaps(RNA[w], "common")
@

Rather than remove partial sequences, it would have been possible to shorten the alignment to the region shared by all sequences using \Rfunction{subseq}.

\subsection{Learning sequence patterns}

Now we need to build a model capturing the essential characteristics of the non-coding RNA.  The function \Rfunction{LearnNonCoding} takes an alignment as input and outputs an object of class \Rclass{NonCoding} that describes the sequences.

<<expr6,eval=TRUE>>=
y <- LearnNonCoding(RNA)
y
@

The output object is a list containing patterns of three types:  ``motifs'', ``hairpins'', and ``kmers''.  Motifs are short regions of the sequence that can be used to identify the sequences:

<<expr7,eval=TRUE>>=
y[["motifs"]]
@

Note that some of the motifs contain ambiguity codes (see \code{IUPAC_CODE_MAP}) that represent multiple nucleotides.  Motifs are defined by their distance from each end of the non-coding RNA, and their prevalence across sequence representatives when allowing for a certain degree of distance.

<<expr8,eval=TRUE>>=
y[["hairpins"]]
@

Hairpins are defined similarly, but allow for ambiguity in the form of varying free energy (dG).  As we saw in the predicted structures, the IhtA sequences end in a prominent hairpin that is both long and has a low free energy.

<<expr9,eval=TRUE>>=
head(y[["kmers"]])
tail(y[["kmers"]])
@

Finally, the sequences are identifiable by their k-mer frequencies.  In general, non-coding RNAs have higher GC-content than protein coding regions of genomes.  Note that the value of ``k'' is set automatically depending on the amount of information in the input sequence alignment.

%------------------------------------------------------------
\section{Finding Non-Coding RNAs}
%------------------------------------------------------------

Now that we have captured the life force of the sequences, our next goal is to find homologous non-coding RNAs in a genome.  You can either use your own genome or follow along with the example \emph{C. trachomatis} genome.

<<expr10,eval=TRUE>>=
# specify the path to your genome:
genome_path <- "<<path to genome FASTA file>>"
# OR use the example genome:
genome_path <- system.file("extdata",
	"Chlamydia_trachomatis_NC_000117.fas.gz",
	package="DECIPHER")

# read the sequences into memory
genome <- readDNAStringSet(genome_path)
genome
@

The function \Rfunction{FindNonCoding} finds matches to \Rclass{NonCoding} models in a genome.  Let's search for the IhtA model in the \emph{Chlamydia} genome:

<<expr11,eval=TRUE>>=
FindNonCoding(y, genome)
@

And there it is!  The output tells us that the IhtA gene is found on the forward strand of the first sequence (\code{"Index"}) in the genome.  This match had a high score to first (and only) model in \code{y} (i.e., \code{"Gene"} is \code{-1}).  Values in the \code{"Gene"} column are negative to signify that these are non-coding RNAs and not protein coding genes.

Life's too short to build models for every non-coding RNA, so we can load a set of pre-built models for our bacterial genome.  Replace ``Bacteria'' with ``Archaea'' or ``Eukarya'' for genomes from organisms belonging to other domains of life.

<<expr12,eval=TRUE>>=
data(NonCodingRNA_Bacteria)
x <- NonCodingRNA_Bacteria
names(x)
@

What a life saver!  Our new dataset (\code{x}) is a list of models, including tRNAs (by amino acid), transfer-messenger RNA, RNase P, SsrS (6S RNA), group I and II introns, the signal recognition particle, and three rRNAs (5S, 16S, and 23S).  Let's add the model we built of IhtA into the list:

<<expr13,eval=TRUE>>=
x[[length(x) + 1]] <- y
names(x)[length(x)] <- "IhtA"
@

Now we can search for them all at once with \Rfunction{FindNonCoding}:

<<expr14,eval=TRUE>>=
rnas <- FindNonCoding(x, genome)
rnas
class(rnas)
@

Wow!  That has to be one of life's simplest pleasures.  \Rfunction{FindNonCoding} returned an object of class \Rclass{Genes}.  By convention, the starting position of non-coding RNAs on the forward (\code{0}) \code{Strand} is at \code{Begin}, while those on the reverse (\code{1}) \code{Strand} start at \code{End}.  We can take a look at which RNAs were found:

<<expr15,eval=TRUE>>=
annotations <- attr(rnas, "annotations")
m <- match(rnas[, "Gene"], annotations)
sort(table(names(annotations)[m]))
@

We see that the \emph{C. trachomatis} genome has multiple tRNA genes, two copies of each ribosomal RNA gene, and the RNaseP and tmRNA genes.  Finally, it is possible to extract the non-coding RNAs from the genome:

<<expr16,eval=TRUE>>=
ExtractGenes(rnas, genome, type="RNAStringSet")
@

You'll be the life of the party now that you know how to build models for non-coding RNAs and find them in a genome.

%------------------------------------------------------------
\section{Session Information}
%------------------------------------------------------------

All of the output in this vignette was produced under the following conditions:

<<sessinfo,echo=FALSE,results=tex>>=
toLatex(sessionInfo(), locale=FALSE)
@

\end{document}