%\VignetteIndexEntry{ASEB}
%\VignetteKeywords{lysine (K)-acetyl-transferase (KAT) acetylation}
%\VignettePackage{ASEB}

\documentclass[11pt]{article}
\usepackage{Sweave}
\usepackage{amsmath}
\usepackage{hyperref}
\usepackage[authoryear,round]{natbib}

\setlength{\textheight}{8.5in}
\setlength{\textwidth}{6in}
\setlength{\topmargin}{-0.25in}
\setlength{\oddsidemargin}{0.25in}
\setlength{\evensidemargin}{0.25in}
\newcommand{\Rpackage}[1]{{\textit{#1}}}

\begin{document}
\title{\bf How to use the ASEB Package}
\author{Likun Wang$^1$$^,$$^2$ and Tingting Li$^1$$^,$$^3$}
\maketitle
\noindent
$^1$Institute of Systems Biomedicine, Peking University Health Science Center.\\
\noindent
$^2$College of Computer Science and Technology, Jilin University.\\
\noindent
$^3$Department of Biomedical Informatics, Peking University Health Science Center.\\
\begin{center}
{\tt wang.likun@gmail.com}
\end{center}
\tableofcontents

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Introduction}

Lysine acetylation is a well-studied posttranslational modification on kinds of proteins. 
About four thousand lysine acetylation sites and over 20 lysine (K)-acetyl-transferases (KATs) have been identified. 
However, which KAT is responsible for a given protein or lysine site acetylation is mostly unknown. 
In our previous study, we found that different KAT families acetylate lysine sites with different sequence features \citep{Li03}. 
Based on these differences, we developed a computer program, Acetylation Set Enrichment Based (ASEB) method to
predict which KAT-families are responsible for acetylation of a given protein or lysine site.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Getting started}
To load the \Rpackage{ASEB} package, type {\tt library(ASEB)}. 
Total six methods are presented in this package. They are {\tt readSequence}, {\tt asebSites}, 
{\tt asebProteins}, {\tt drawStat} and {\tt drawEScurve}.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Methods}
In this package, we use a GSEA-like method to make predictions.
Gene Set Enrichment Analysis (GSEA) method was developed and successfully used to detect coordinated expression changes  \citep{Subramanian05, Mootha03}
and find the putative functions of the long non-coding RNAs \citep{Guttman09}.   
In our study \citep{Li03}, we treated the (acetylated) lysine sites and their surrounding amino acids (8 on each side) as (acetylated) peptide sequences.
We first define all the validated acetylated peptide sequences from one KAT family as a KAT-specific set, 
and define 10000 random selected peptide sequences from the whole proteome as a background set. 
When given a new query peptide sequence, similarity scores are calculated according to the BLOSUM62 matrix between this query peptide and peptides in the KAT-specific set 
and background set. A list is then created by ranking the scores. 
Similar with GSEA method, a running sum score (enrichment score) was calculated by walking down the list. 
To estimate the significance of the enrichment score for a query peptide, at first, 
a certain number of peptide sets with the same size as KAT-specific set was randomly generated.
Secondly, treating each randomly generated set as a pseudo predefined KAT-specific set, an enrichment score can be calculated for each randomly generated set.  
At last, rank all the enrichment scores (from high to low) and a nominal P-value could be calculated.
The nominal P-value is defined as the rank of enrichment score for the KAT-specific set divided by the total number of random selected sets.
The whole process is similar with the GSEA method (permuting gene sets). Please see \cite{Li03} for details.

\section{Data}
We provide the KAT-specific set for CBP/P300 and GCN5/PCAF family in this package.
The file predefined\_sites.fa under extdata contains the KAT-specific set for CBP/P300 family (total 267 sites). 
While the file predefined\_sites2.fa contains the KAT-specific set for GCN5/PCAF family (total 82 sites). 
The two sets were generated by searching the PubMed literature.
The file background\_sites.fa contains 10000 randomly selected sites and the KAT-specific set for CBP/P300 family (total 10000+267 sites).
While the file background\_sites2.fa contains 10000 randomly selected sites and the KAT-specific set for GCN5/PCAF family (total 10000+82 sites).
\section{Examples}
\subsection{Example for readSequence}
This function return an object of {\tt SequenceInfo} that contains sequences and identifiers from FASTA format input file.  
\begin{center}
<<echo=TRUE,print=FALSE>>=
  library(ASEB)
  ff <- system.file("extdata", "background_sites.fa", package="ASEB")
  readSequence(ff)
@
\end{center}
\noindent

\subsection{Example for asebSites}
This function is used to predict lysine sites that can be acetylated by a specific KAT-family.  
\begin{center}
<<echo=TRUE,print=FALSE>>=
  backgroundSites <- readSequence(system.file("extdata", "background_sites.fa", package="ASEB")) 
  prodefinedSites <- readSequence(system.file("extdata", "predefined_sites.fa", package="ASEB"))
  testSites <- readSequence(system.file("extdata", "sites_to_test.fa", package="ASEB"))
  resultList <- asebSites(backgroundSites, prodefinedSites, testSites, permutationTimes=100)
  resultList$results[1:2,]
@
\end{center}
This method can also process FASTA format files directly 
without loading all the sequences to the workspace of R.
In this case, it can process huge number of lysine sites each time.
\begin{center}
<<echo=TRUE,print=FALSE>>=
  backgroundSitesFile <- system.file("extdata", "background_sites.fa", package="ASEB")
  prodefinedSitesFile <- system.file("extdata", "predefined_sites.fa", package="ASEB")
  testSitesFile <- system.file("extdata", "sites_to_test.fa", package="ASEB")
  asebSites(backgroundSitesFile, prodefinedSitesFile, testSitesFile, permutationTimes=100)
@
\end{center}
\noindent
\subsection{Example for drawEScurve}
This method can be used to draw enrichment score curve for a specific site.
Please see \cite{Li03} for details about enrichment score curve. 
\begin{center}
<<echo=TRUE,print=FALSE, fig=TRUE>>=
  drawEScurve(resultList$curveInfo, max_p_value=0.1, min_es=0.1)
@
\end{center}
These curves show running-sum process for calculating enrichment score.
Users can find detail algorithm from \cite{Li03}.
The data.frame object contains curve information is given by {\tt asebSites}, 
or {\tt asebProteins}.
\subsection{Example for asebProteins}
This function is used to predict all lysine sites on a specific protein that can be acetylated by a specific KAT-family. 
\begin{center}
<<echo=TRUE,print=FALSE,results=hide>>=
  backgroundSites <- readSequence(system.file("extdata", "background_sites.fa", package="ASEB")) 
  prodefinedSites <- readSequence(system.file("extdata", "predefined_sites.fa", package="ASEB"))
  testProteins <- readSequence(system.file("extdata", "proteins_to_test.fa", package="ASEB"))
  resultList <- asebProteins(backgroundSites, prodefinedSites, testProteins, permutationTimes=100)
  resultList$results[1:2,]
@
\end{center}
For processing huge number of lysine sites on proteins, this method can be used as below.
\begin{center}
<<echo=TRUE,print=FALSE,results=hide>>=
  backgroundSitesFile <- system.file("extdata", "background_sites.fa", package="ASEB")
  prodefinedSitesFile <- system.file("extdata", "predefined_sites.fa", package="ASEB")
  testProteinsFile <- system.file("extdata", "sites_to_test.fa", package="ASEB")
  asebProteins(backgroundSitesFile, prodefinedSitesFile, testProteinsFile, permutationTimes=100)
@
\end{center}
\subsection{Example for drawStat}
This function is used to show P-values and enrichment scores for all lysine sites on a specific protein.
The X-axis shows positions of all lysine sites on a specific protein, and Y-axis shows the enrichment scores (0~1) and P-values (0~1)
for each lysine site. 
\begin{center}
<<echo=TRUE,print=FALSE,results=hide, fig=TRUE>>=
  drawStat(curveInfoDataFrame=resultList$curveInfo);
@
\end{center}
The sites with less P-values are more significant.
For sites that have similar P-values, the ones with higher enrichment scores are more likely to be 
acetylated. These P-values are nominal \citep{Subramanian05, Mootha03}, hence it is hard to
give an threshold to predict significant sites. However, users can order all the sites 
and pay more attention to the ones with relatively less P-values.
Please see \cite{Li03} for details.
\noindent

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%\newpage
\bibliographystyle{apalike}
\bibliography{ASEB}
\end{document}