%\VignetteIndexEntry{BCRANK} %\VignetteDepends{} %\VignetteKeywords{} %\VignettePackage{BCRANK} \documentclass[10pt,a4paper]{article} \setlength{\oddsidemargin}{0.3cm} %margin \setlength{\evensidemargin}{0.3cm} %margin \setlength{\textwidth}{15.2cm} \addtolength\footskip{1cm} \pagestyle{plain} \usepackage{Sweave} \usepackage{graphicx} \usepackage[small]{caption} \setkeys{Gin}{width=0.6\textwidth} \title{BCRANK: predicting binding site consensus from ranked DNA sequences} \author{Adam Ameur} \date{\today} \begin{document} \maketitle \section{Introduction} This document describes the \texttt{BCRANK} \texttt{R} package. \texttt{BCRANK}\cite{Ameur09} is a method that takes a ranked list of genomic regions as input and outputs short DNA sequences that are overrepresented in some part of the list. The algorithm was developed for detecting transcription factor (TF) binding sites in a large number of enriched regions from high-throughput ChIP-chip or ChIP-seq experiments, but it can be applied to any ranked list of DNA sequences. \subsection{Input data} BCRANK takes a fasta file with DNA sequences as input: \begin{itemize} \item{The sequences in the file must be ordered, with the ones most likely to contain a TF binding sequence at the top.} \item{The sequences may be of varying lengths. Usually the average length is somewhere in the range between 50bp and 2kb.} \item{All IUPAC nucleotide symbols are allowed in the sequences, but positions with other letters than $(A,C,G,T)$ will not be considered in the motif search.} \end{itemize} \subsection{The algorithm} \texttt{BCRANK} uses a heuristic search strategy. First a score is computed for an initial short consensus sequence, typically selected at random. The score takes into account both the number of consensus occurrences and the rank of the genomic regions. Then all consensus sequences in a neighborhood of the start guess are evaluated and the one with highest score is kept as the starting point for the next iteration. When a local optimum is found, the algorithm is terminated and the locally optimal consensus is reported as a result. In order to increase the chance of detecting the globally optimal solution, the algorithm may be restarted several times using different random starting points. Alternatively, \texttt{BCRANK} can be used for assigning scores to previously established consensus sequences. The sections below describe in more detail how the neighborhood, scoring function and start guess are implemented. \texttt{BCRANK} can extend and shorten motifs and can therefore be used in situations where the motif length is not known a priori. Moreover, \texttt{BCRANK} implements two optional penalties that can give relatively higher scores some consensus sequences. The effect of the penalties is that i) the consensus contains fewer redundant bases (i.e. other bases than ${A,C,G,T}$), and ii) the consensus will not be frequently occurring as a repetitive element in the enriched regions. \subsubsection{Neighborhood} In BCRANK, all consensus sequences are represented by IUPAC nucleotide symbols. The neighborhood of one consensus sequence $s$ consists of all consensuses that can be generated from s by first adding one IUPAC letter N (representing any nucleotide) to either side of $s$ and then flipping any base to any other IUPAC symbol. Since there are 15 symbols in total, a sequence of length $l$ will have $14 \cdot (l+2)$ neighbors. After each search step any flanking Ns are removed from the highest scoring sequence in the neighborhood. The removal and additions of flanking Ns allows the algorithm to shorten and extend the predicted binding sites. \subsubsection{Scoring function} The score tells whether a given consensus sequence is overrepresented in some part of the ranked list or not. Starting from $N$ ranked regions and a consensus sequence $c$, a binary vector of size $N$ is created, with 1 at position $i$ if $c$ is occurring in sequence number $i$, and 0 if not. The reverse complement of $c$ is also allowed to match. Then the cumulative sum of the match vector is computed and stored in a vector called $A$. The $A$-vector tells where in the ranked list most occurrences are located (see Figure~\ref{fig:intro1}). \begin{figure}[h] \center \includegraphics[height=4.2cm,width=7.7cm]{BCRANK_intro_fig1} \caption{$A$-vectors for the two consensus sequences \texttt{CACGTGAC} (left) and \texttt{CAGGCTGG} (right). On the x-axis are the top 5211 regions from a whole genome ChIP-chip study on USF1 in human liver cells\cite{Rada08}, ranked by their enrichment signal. The aim of \texttt{BCRANK} is to detect sequences that are biased towards some part of the list. Therefore \texttt{CACGTGAC} will get a higher score than \texttt{CAGGCTGG} even though it has a lower number of total occurrences. It is important to have enough number of ranked input regions to \texttt{BCRANK}, so this bias is observed for the correct binding motif. The established USF1 binding sequence is \texttt{CACGTG}.} \label{fig:intro1} \end{figure} To compute a score, $A$ is compared to what it would look like if the genomic regions were randomly ordered. Therefore a large number $R$ (the \texttt{reorderings} parameter to \texttt{bcrank()}) of random orderings of the input regions are generated, and a corresponding vector $A_j$ is computed for each re-ordering $1 \le j \le R$ as above. For each $j$, the difference $D_j$ between $A_j$ and $A$ is estimated by the area between the corresponding lines (see Figure~\ref{fig:intro2}). When calculating $D_j$, the $A$ and $A_j$ vectors are first scaled so they range between 0 and 1. \begin{figure}[h] \center \includegraphics[height=4.5cm,width=8.25cm]{BCRANK_intro_fig2} \caption{The vector $A$ (red line) and a corresponding vector $A_j$ (blue line) for \texttt{CACGTGAC} in the USF1 data. There is a clear bias towards the high scoring sequences as indicated by the red line. The significance of this bias can be estimated by comparing to the \texttt{CACGTGAC}-occurrences in randomly ordered regions, as indicated by the blue line. $D_j$ corresponds to the grey area between the two lines.} \label{fig:intro2} \end{figure} $D_j$ will be close to zero when the consensus occurrences are distributed as expected by random sampling. If on the other hand all $D_j$ are far off from zero, $c$ is biased towards some part of the list. Therefore the score is calculated as the t-statistic $T$ for the $D_j$ being drawn from a distribution centered around zero. Consensus sequences that are biased towards some part of the list will thus get high scores whereas consensuses with no bias will get low scores. Moreover, consensuses that are matching just a few regions will not get a high $T$ even if it is matching only among the top ranked regions. This is because there will be a high variation within the $D_j$ values which will result in a low $T$. \subsubsection*{Penalties} The t-statistic gives consensus sequences that are biased towards some part of the list. But there may be other issues to take into account if the aim is to detect TFBS from ChIP-chip or ChIP-seq data. Therefore, \texttt{BCRANK} implements two optional penalties, $P1$ and $P2$, with values between 0 and 1. The final scoring function is defined as: $score = T \cdot P1 \cdot P2$. If a penalty is not used it will be set to 1. \begin{itemize} \item $P1$ - Penalty on non-specific bases. Let $l$ be the length of the consensus sequence and $b$ the total number of fixed bases (A, C, G, T) in the sequence. If there are no fixed bases, $b$ is set to 0.5. The penalty is then defined as $P1=b/l$. \item $P2$ - Penalty on repetitive motifs. Let $r_n, n \in {1,2}$ be the number of input DNA regions that contain at least $n$ occurrences of the consensus. Then $P2=1-(r_2/r_1)$. \end{itemize} \subsubsection{Start guess} In case the algorithm is used for ab inito search, the initial guess is a randomly generated consensus of a specified length (the \texttt{length} parameter to \texttt{bcrank()}), with 10 bases as default. Multiple restarts with different random start guesses are usually required to increase the chance of finding the globally optimal solution. The number of restarts is determined by the \texttt{restarts} parameter. \texttt{BCRANK} can also use start guesses passed to \texttt{bcrank()} by the \texttt{startguesses} parameter. By setting the \texttt{do.search} parameter to \texttt{FALSE}, \texttt{BCRANK} assigns scores for the given start guesses without performing any search. \subsection{Additional information} Some other important details: \begin{itemize} \item{The algorithm randomly re-orders the data when the score is calculated. This implies that the same consensus sequence will get different \texttt{BCRANK} scores in the same data when run with different re-orderings. The variability in scores can be decreased by increasing the \texttt{reorderings} parameter} \item{The algorithm performs a breadth-first search, meaning that the highest scoring neighbor in the neighborhood is selected in each search step.} \item{The algorithm keeps track of all consensus sequences that have already been tested so the same sequence is not visited twice when performing a search.} \end{itemize} \subsection{Citation} To cite \texttt{BCRANK}, please use (Ameur et al, 2009), see \cite{Ameur09} in the References section. \newpage \section{BCRANK - An example run} The user is required to load the package using the \texttt{library()} command: <>= library(BCRANK) @ \subsection{Sequence data} \texttt{BCRANK} takes a fasta file containing ranked sequences as input. The command below loads an example file containing 2500 ranked regions from a whole genome ChIP-chip experiment for the protein USF1 in the human liver cell line HepG2\cite{Rada08}. <>= fastaFile <- system.file("Exfiles/USF1_small.fa", package="BCRANK") @ \subsection{Running BCRANK} The \texttt{bcrank()} function call below runs the \texttt{BCRANK} algorithm on the example USF1 data set. The \texttt{set.seed()} call sets seed for the random number generator for reproducibility. \\\\ \noindent{\texttt{> set.seed(0)}}\\ \noindent{\texttt{> BCRANKout <- bcrank(fastaFile, restarts=25, use.P1=TRUE, use.P2=TRUE)}}\\ \noindent{Since it takes some time to run the algorithm, results can instead be loaded from a previous run on a larger USF1 data set containing the top 5211 regions:} <>= data(BCRANKout) @ \subsection{BCRANK output} An object of type \texttt{BCRANKresult} is returned: <>= BCRANKout @ \subsubsection{The \texttt{BCRANKsearch} object} \noindent{Use the \texttt{toptable()} function to access information about each motif found by \texttt{bcrank}. It returns an object of type \texttt{BCRANKsearch}:}. Here we extract the top scoring motif: <>= topMotif <- toptable(BCRANKout,1) topMotif @ \noindent{The \texttt{pwm()} function returns the position weight matrix for the search result. If \texttt{normalize} is set to \texttt{TRUE} each column will sum to \texttt{1}. <>= weightMatrix <- pwm(topMotif, normalize=FALSE) weightMatrix @ \noindent{A weight matrix can be viewed as a sequence logo by using the \texttt{seqLogo} package. Make sure that that the pwm is normalized before running the \texttt{seqLogo()} function. \begin{center} <>= weightMatrixNormalized <- pwm(topMotif, normalize=TRUE) library(seqLogo) seqLogo(weightMatrixNormalized) @ \end{center} \noindent{The search path can be visualized. For each consensus in the search path, the number of occurences among the ranked regions are plotted. As seen in the figure, \texttt{BCRANK} searches for consensus sequences that don't give straight lines.} \begin{center} <>= plot(topMotif) @ \end{center} \subsubsection{Predicted binding sites} \noindent{Individual predicted binding sites can be reported by the \texttt{matchingSites} function. <>= topConsensus <- as.character(toptable(BCRANKout)[1,"Consensus"]) print(topConsensus) bindingSites <- matchingSites(fastaFile,topConsensus) nrSites <- nrow(bindingSites) cat("Number predicted binding sites:",nrSites,"\n") print(bindingSites[1:15,]) @ \noindent{As seen in the example above, some binding sites can be reported both on the sense and anti-sense strands. If the consensus is palindromic, duplicate enries can be avoided by setting the \texttt{revComp} argument in the \texttt{matchingSites()} call to \texttt{FALSE}.} \bibliographystyle{plainnat} \begin{thebibliography}{} \bibitem[1]{Ameur09} Ameur, A., Rada-Iglesias, A., Komorowski, J., Wadelius, C.\begin{em} Identification of candidate regulatory SNPs by combination of transcription factor binding site prediction, SNP genotyping and haploChIP.\end{em} Nucleic Acids Res, 2009, 37(12):e85. \bibitem[2]{Rada08} Rada-Iglesias, A., Ameur, A., Kapranov, P., Enroth, S., Komorowski, J., Gingeras, T. R., Wadelius, C.\begin{em} Whole-genome maps of USF1 and USF2 binding and histone H3 acetylation reveal new aspects of promoter structure and candidate genes for common human disorders.\end{em} Genome Res, 2008, 18(3):380-92. \end{thebibliography} \end{document}