%\VignetteIndexEntry{Perform GWAS trait-associated SNP enrichment analyses in genomic intervals}
%\VignettePackage{BiocStyle}
%\VignetteEngine{utils::Sweave}

\documentclass{article}
<<style, eval=TRUE, echo=FALSE, results=tex>>=
    BiocStyle::latex(use.unsrturl=FALSE)
@

\newcommand{\exitem}[3]
{\item \texttt{\textbackslash#1\{#2\}} #3 \csname#1\endcsname{#2}.}

\usepackage{Sweave}
\renewcommand{\baselinestretch}{1.3}

\author{Li Chen, Zhaohui S.Qin \\[1em]Department of Biostatistics and Bioinformatics\\ Emory University\\
  Atlanta, GA 303022 \\ [1em] \texttt{li.chen@emory.edu,zhaohui.qin@emory.edu}}

\title{\textsf{\textbf{traseR: TRait-Associated SNP EnRichment analyses}}}

\begin{document}
\maketitle
\tableofcontents

%% abstract


\begin{abstract}
This vignette introduces the use of {\tt traseR} 
({\bf TR}ait-{\bf A}ssociated  SNP{\bf E}n{\bf R}ichment analyses,
which is designed to provide quantitative assessment whether a selected genomic interval(s) is likely to be functionally connected with certain traits or diseases. 
{\tt traseR} consists of several modules, all written in R, to perform hypothesis testing, exploration and visualization of trait-associated SNPs(taSNPs). It also assembles the up-to-date taSNPs from dbGaP and NHGRI, SNPs from 1000 Genome Project CEU population with linkage disequilibrium greater than 0.8 within 100 kb of taSNPs, and all SNPs of CEU population from 1000 Genome project into its built-in database, which could be directly loaded when performing analyses. 
\end{abstract}

%% introduction
\section{Introduction}
Genome-wide association study (GWAS) have successfully identified many sequence variants that are significantly associated with common diseases and traits. Tens of thousands of such trait-associated SNPs have already been cataloged which we believe are great resources for genomic research. However, no tools existing utilizes those resources in a comprehensive and convenient way. In this study, we show the collection of taSNPs can be exploited to indicate whether a query genomic interval(s) is likely to be functionally connected with certain traits or diseases. A R Bioconductor package named {\tt traseR}  has been developed to carry out such analyses.

\section{Data collection}
 One great feature of {\tt traseR} is the built-in database that collects various public SNP resources. Common public SNP databases include Association Result Browser and 1000 Genome Project. We briefly introduce the procedures to process those public available SNP resources 


\subsection{Obtain taSNPs}
Association Results Browser
(\url{http://www.ncbi.nlm.nih.gov/projects/gapplusprev/sgap_plus.htm}) combines identified taSNPs from dbGaP and NHGRI, which together provide 44,078 SNP-trait associations, 48,936 SNP-trait class associations, 30,553 unique taSNPs, 573 unique traits and 33 unique trait classes. This resource has been built into GRanges object \texttt{taSNP} and could be loaded into R console by typing \texttt{data(taSNP)}.

{\tt traseR} need to specify the collection of trait-associated SNPs in particular format
before we carry out enrichment analyses. The format starts with the columns,
\begin{enumerate}
\item Trait: Description of disease/trait examined in the study
\item Trait\_Class: Trait class which is formed based on the phenotype tree. Close traits are grouped together to form one class.
\item SNP\_ID: SNP rs number
\item p.value: GWAS reported p-values
\item seqnames: Chromosome number associated with rs number
\item ranges: Chromosomal position, in base pairs, associated with rs number
\item Context: SNP functional class
\item GENE\_NAME:  Genes reported to be associated with SNPs
\item GENE\_START: Chromosome start position of genes
\item GENE\_END: Chromosome end position of genes
\item GENE\_STRAND: Chromosome strand associated with SNPs
\end{enumerate}
Currently, the {\tt traseR} package automatically synchronize trait-associated SNPs from 
Association Results Browser, which collects up-to-date GWAS results from dbGaP 
NHGRI GWAS catalog.


\subsection{Obtain linkage disequilibrium taSNPs from 1000 Genome Project}
We first download CEU vcf files from
(\url{ftp://share.sph.umich.edu/1000genomes/fullProject/2012.03.14/}) that contain all sequence variants information. The followed two steps are used to identify linkage disequilibrium SNPs >0.8 and located within 100kb of taSNPs. Firstly, we use {\tt vcftools} to convert the vcf file format to {\tt PLINK} format. Then we use {\tt PLINK} to call the LD taSNPs by specifying options that limit the linkage disequilibrium SNPs >0.8 (--ld-window-r2 0.8) and within 100kb of taSNP (--ld-window-kb 100). The detailed commands are listed below,\\
\texttt{vcftools --vcf  vcf.file  --out plink.file --plink plink --file plink.file --r2 --inter-chr --ld-snp-list snps.txt --ld-window-r2 0.8 --ld-window-kb 100 --out output.file --noweb}\\
Finally, we have 90,700 SNP-trait associations and 78,247 unique linkage disequilibrium trait-associated SNP. We also build linkage disequilibrium taSNP into another GRanges object   \texttt{taSNPLD}, which could be loaded into R console by typing \texttt{data(taSNPLD)}.\\
The format of \texttt{taSNPLD} is,
\begin{enumerate}
\item seqnames: Chromosome number associated with rs number
\item  SNP\_ID: SNP rs number
\item  ranges: Chromosomal position, in base pairs, associated with rs number
\item  Trait: Description of disease/trait examined in the study
\item Trait\_Class: Trait class which is formed based on the phenotype tree. Close traits are grouped together to form one class.
\end{enumerate}


\subsection{Obtain background SNPs from 1000 Genome Project}
We use the command  \texttt{ plink --file plink.file  --freq --out chr } to retrieve all SNPs with corresponding MAF (minor allele frequency) from the CEU vcf files downloaded. There are totally 6,571,512 SNPs (MAF>0.05) excluding variants on Y chromosome. Those SNPs could serve as background in hypothesis testing. We build  those SNPs into the built-in GRanges subject \texttt{CEU} into the package.\\
The format of \texttt{CEU} is,
\begin{enumerate}
\item seqnames: Chromosome number associated with rs number
\item  SNP\_ID: SNP rs number
\item  ranges: Chromosomal position, in base pairs, associated with rs number
\end{enumerate}


\section{Using {\tt traseR}}
To assess the enrichment level of trait-associated SNPs in given genomic interval(s) using {\tt traseR}, one needs to
follow the simple steps below.
\begin{enumerate}
\item Prepare the genomic intervals in R object of either data frame format with column names {\tt chr},{\tt start},{\tt end} or GRanges object
\item Query a given a set of genomic interval(s) against all the taSNPs in the collection, perform statistical analyses
\item Explore genes/SNPs of particular interest
\end{enumerate}


\subsection{Background selection}
\subsubsection{Whole genome}
The assumption is each base could be possibly be the taSNP. Based on the assumption, with the number of taSNPs inside and outside the genomic interval(s), the number of bases inside and outside of the genomic interval(s), we could classify all bases based on the fact that one base is taSNP or not and in genomic intervals or not.

\subsubsection{All SNPs}
The assumption is each SNP could possibly be the taSNP. Based on the assumption, with the number of taSNPs inside and outside the genomic interval(s), the non-taSNPs inside and outside of the genomic interval(s), we could classify all SNPs based on the fact that one SNP is taSNP or not and in genomic intervals or not.

\subsection{Hypothesis testing}
{\tt traseR} provides differential hypothesis testing methods in core function {\tt traseR}, together with other functions for exploring and visualizing the results. The genomic interval(s) could be a data frame with three columns as {\tt chr}(chromosome), {\tt start}(genomic start position) and {\tt end}(genomic end position) or a GRanges object. {\tt traseR} offers either including LD SNPs or excluding LD SNPs as the taSNPs and either using the whole genome or all SNPs as the background for hypothesis testing. \\
If using whole genome as background, the command line is:
<<eval=FALSE>>=
x=traseR(snpdb=taSNP,region=Tcell)
print(x)
@
If including the LD SNPs, the command line is:
<<eval=FALSE>>=
x=traseR(snpdb=taSNPLD,region=Tcell)
print(x)
@
If using all SNPs as background, the command line is:
<<eval=FALSE>>=
x=traseR(snpdb=taSNP,region=Tcell,snpdb.bg=CEU)
@
For the above commands, {\tt region} is the data frame; {\tt snpdb} is taSNPs or including LD SNPs; {\tt snpdb.bg} is background SNPs;  If {\tt rankby} is set as "pvalue", all traits will be sorted by p-value in increasing order; if{\tt rankby} is set as "odds.ratio", all traits will be sorted by odds ratio in decreasing order. There are four options for {\tt test.method} including "binomial", "chisq", "fisher", and "nonparametric"to perform binomial test, ${\chi}^2$ test, Fisher's exact test and nonparametric respectively. If {\tt alternative} is set to "greater", {\tt traseR} will perform hypothesis testing on whether genomic intervals are enriched of taSNPs than the background; If {\tt alternative} is set to "less",{\tt traseR} will perform hypothesis testing on whether genomic intervals are depleted of taSNPs than the background.
%

\subsubsection{${\chi}^2$  test and Fisher's exact test}
Based on which background we choose, we could construct the 2 by 2 contingency table. then, we could perform ${\chi}^2$ test on the table to assess the difference of proportions of taSNPs inside and outside of genomic intervals(s). We could also assume taSNPs inside genomic intervals follows hypergeometric distribution and calculate  p-value directly using Fisher's exact test.

\subsubsection{Binomial test}
The assumption is the probability of observing a single base/SNP being a taSNP is the same inside and outside of genomic intervals. The probability of observing a single base/SNPs being a taSNP in genomic intervals could be estimated by using total number of taSNPs divided by the genome size/number of all SNPs. Then corresponding p-value could be calculated directly by Binomial test.

\subsubsection{Nonparametric Test}
Instead of imposing any assumption, the matched genomic interval(s) are generated by permuting the genomic intervals randomly $N$ times and overlap with taSNPs in each time. Then we could calculate the empirical p-value directly by counting how many taSNP hits larger/smaller than the observed taSNP hits.


\section{Choose appropriate statistical test method}
Depending on the characteristics of the test statistics, we suggest to choose appropriate statistical test method under different scenarios,
\begin{itemize}
\item ${\chi}^2$  test:  the numbers in the contingency table is fairly large and balanced
\item  Fisher's exact test: the numbers of the contingency table is relatively small, e.g. no more than 20
\item  Nonparametric test: the number of query genomic intervals are small, e.g. no more than 1000
\item  Binomial test: default test method, not limited by sample size, distribution assumption and computational time
\end{itemize}


\subsection{Example}
To further illustrate the usage of {\tt traseR}  R package, we download H3K4me1 peak regions in peripheral blood T cell from Roadmap Epigenomics. Those peak regions are deemed the genomic intervals. Since the degree of enrichment level is measured by p-value, we could rank traits/trait classes based on p-value in an increasing order. We choose Binomial test are the default option for {\tt test.method}, use whole genome as background and over-enrichment as hypothesis testing direction.
<<>>=
library(traseR)
data(taSNP)
data(Tcell)
x=traseR(taSNP,Tcell)
print(x)
@


\subsection{Exploratory and visualization functions}
Plot the distribution of SNP functional class
<< fig=TRUE>>=
plotContext(snpdb=taSNP,region=Tcell,keyword="Autoimmune")
@

%%\begin{figure}[h!]
%%\centering
%%\includegraphics[width=3in]{plotContext.pdf}
%%\caption{Autoimmune-associated SNP functional class distribution }
%%\end{figure}
%%
Plot the distribution of p-value of trait-associated SNPs
<<fig=TRUE>>=
plotPvalue(snpdb=taSNP,region=Tcell,keyword="autoimmune",plot.type="densityplot")
@
%%\begin{figure}[h!]
%%\centering
%%\includegraphics[width=3in]{plotPvalue.pdf}
%%\caption{-logPvalue distribution of Autoimmune-associated SNP compared to background distribution }
%%\end{figure}
%%

Plot SNPs or genes given genomic interval
<<fig=TRUE>>=
plotInterval(snpdb=taSNP,data.frame(chr="chrX",start=152633780,end=152737085))
@
%%\begin{figure}[h!]
%%\centering
%%\includegraphics[width=3in]{plotInterval.pdf}
%%\caption{SNP information for queried genomic interval}
%%\end{figure}
%%

Query trait-associated SNPs by key word,
<<>>=
x=queryKeyword(snpdb=taSNP,region=Tcell,keyword="autoimmune",returnby="SNP")
head(x)
@
%%

Query trait-associated SNPs by gene name,
<<>>=
x=queryGene(snpdb=taSNP,genes=c("AGRN","UBE2J2","SSU72"))
x
@
%%

Query trait-associated SNPs by SNP name,
<<>>=
x=querySNP(snpdb=taSNP,snpid=c("rs3766178","rs880051"))
x
@

\section{Conclusion}
{\tt traseR} provides methods to assess the enrichment level of taSNPs
in a given sets of genomic intervals. Moreover, it provides other functionalities to 
explore and visualize the results.

\section{Session Info}
<<echo=TRUE>>=
sessionInfo()
@

\begin{thebibliography}{99}

\bibitem{NAR}
\textsc{Welter D, MacArthur J, Morales J, Burdett T, Hall P, Junkins H, Klemm A, Flicek P, Manolio T, Hindorff L et al} (2010).
\newblock The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. 
\newblock {\em Nucleic Acid Research\/}, {\bf 42}, D1001-1006.

\bibitem{nature}
\textsc{Roadmap Epigenomics C, Kundaje A, Meuleman W, Ernst J, Bilenky M, Yen A, Heravi-Moussavi A, Kheradpour P, Zhang Z, Wang J et al}. (2015).
\newblock Integrative analysis of 111 reference human epigenomes
\newblock {\em Nature\/},~\textbf{7539}, 317-330

\bibitem{browser}
\newblock \url{http://www.ncbi.nlm.nih.gov/projects/gapplusprev/sgap_plus.htm}
\newblock {\em Association Results Browser\/}

\bibitem{browser}
\newblock \url{ftp://share.sph.umich.edu/1000genomes/fullProject/2012.03.14/}
\newblock {\em 1000Genome EUR \/}

\end{thebibliography}

\end{document}