%\VignetteIndexEntry{genphen overview}
\documentclass[a4paper, 11pt]{article}
\usepackage[utf8]{inputenc}
\let\chapter\section
\usepackage{libertine}
\usepackage[numbers]{natbib}
\bibliographystyle{plainnat}
\usepackage[english]{babel}
\selectlanguage{english}
\usepackage{graphicx}
\usepackage{placeins}
\usepackage{amsmath}
\usepackage{amscd}
\usepackage{ifthen}
\usepackage{float}
\usepackage{subfig}
\usepackage{lscape}
\usepackage{parskip}
\usepackage{multirow}
\usepackage{todonotes}
\usepackage{color}
\usepackage{colortbl}
\definecolor{hellgrau}{rgb}{0.9,0.9,0.9}
\definecolor{dunkelgrau}{rgb}{0.8,0.8,0.8}
\usepackage{hyperref}
%add-on renews
\renewcommand\topfraction{0.85}
\renewcommand\bottomfraction{0.85}
\renewcommand\textfraction{0.1}
\renewcommand\floatpagefraction{0.85}
\setlength\floatsep{1.25\baselineskip plus 3pt minus 2pt}
\setlength\textfloatsep{1.25\baselineskip plus 3pt minus 2pt}
\setlength\intextsep{1.25\baselineskip plus 3pt minus 2 pt}
\setlength{\parindent}{0pt}

\begin{document}
\SweaveOpts{concordance=TRUE}

\title{Decoding genotype-phenotype associations with \texttt{genphen}}
\author{Simo Kitanovski, Daniel Hoffmann,\\
Bioinformatics and Computational Biophysics,\\
University of Duisburg-Essen, Essen, Germany}
\maketitle

\tableofcontents

This tutorial gives you some of the technical background underlying
\texttt{genphen} that should enable you to understand and use this tool.

\section{\texttt{genphen} quantifies genotype-phenotype associations}
Genome wide association studies (GWAS) have become an important tool for
understanding the association between genotypes and phenotypes. With GWAS we try
to answer questions such as ``what are the genotypes in the human genome which
predispose to a disease?'' or ``what are the genotypes in certain strains of
mice which confer resistance against a specific virus?''. The advances in high-
throughput sequencing technology (HTSeq) have provided massive genetic data and 
thus potentially countless applications for GWAS, with genotypes representing 
single nucleotide polymorphisms (SNPs) or single amino acid polymorphisms 
(SAAPs) identified in a group of individuals, whereas the phenotype can be any 
quantiative or discrete trait or charactersitic measured in each individual.

The classical (frequentist) statistical methods for GWAS rely on simple and 
often inadequate methods to capture complex and potentially non-linear genotype-
phenotype associations. By quantifying the strength of the association using 
P-values, the frequentist methods often run into the multiple-comparison 
problem, which when countered with a rigorous P-value correction results in 
large amounts of false-negatives. Additional disadvantages of the P-values 
include the facts that they are difficult to interpret and compare between 
studies.

With \texttt{genphen} we provide a hybrid method for the joint analysis of 
multiple traits of different typies which reaps the benefits of two approaches: 
i) statistical learning approaches such as random forest (RF) and support vector 
machine (SVM) to capture complex associations; ii) Bayesian inference using 
hierarchical models for accurate quantification of the strength of association, 
whereby the models are robust to outliers, consistent with the data and 
automatically deal with the multiple-hypothesis problem.

Furthermore, \texttt{genphen} provides a set of additional procedures, including 
a test for phylogenetic bias (used to discover biases in the data due to the 
population structure) and procedure for data reduction (used for the removal of 
non-informative genotypes and thereby simplifying the otherwiese computationally 
costly GWAS). Future updates will include procedures for data augmentation 
(to augment small/noisy datasets) and methods for gene prioritization based on 
network diffusion algorithms using functional network data.

\newpage
\section{Methods}
\label{sec:methods}
\subsection{Input}
Two types of data are necessary to perform a genetic association study:
\begin{itemize}
\item genotype data (e.g. set of 1,000 SNPs found along the aligned genomes of
10 individuals), provided in one of three possible input types: 
  \begin{itemize}
    \item \textbf{character vector} of length $N$ (if only a single SNP/SAAP is 
    provided), containing the genotypes of $N$ individuals.
    \item \textbf{character matrix} with dimensions $N \times S$ (with $N$ = 
    individuals, $S$ = SNPs/SAAPs)
    \item \textbf{AAMultipleAlignment} or \textbf{DNAMultipleAlignment} 
    object (package \texttt{Biostrings}) - if the genotype data is a multiple 
    sequence alignment, composed of $N$ sequences (individuals).
  \end{itemize}
\item phenotype data which can include a combination of dichotomous and 
quantitative traits is allowed (experimental measurement made for each 
individual such as body mass index, immune response, survival, case-control, 
etc.) provided as:
  \itemize{
    \item numerical vector of length $N$ if only a single phenotype is analyzed
    \item numerical matrix $N \times P$, if $P$ phenotypes are provided.
  }
\end{itemize}




\subsection{Association Scores}
With \texttt{genphen} we quantify the association between each genotype 
(SNP/SAAP) and phenotype using multiple metrics of association, each of which 
is explained in the following paragraphs.


\paragraph{Classification accuracy ($CA$)}
$CA$ quantifies the degree of accuracy with which one can classify (predict) 
the alleles of a SNP from the phenotype. If there exists a strong association 
between a particular SNP and the phenotype, one should be able to train a 
statistical model (using RF or SVM) which accurately classifies the two alleles 
of that SNP solely from the phenotype data (with $CA \approx 1$). Otherwise, the 
model should perform poorly, with the classification accuracy of the model being 
approximately similar to that of simple guessing ($CA \approx 0.5$).

To estimate $CA$, \texttt{genphen} uses RF and SVM in a cross-validation (CV) 
mode, computing a distribution of possible $CAs$ for each SNP. During each 
iteration of the CV procedure, a subset (e.g. 66\%) of the genotype-phenotype 
data is selected at random (with replacement) and used to train a classifier, 
followed by testing (prediction) based on the remaining data. To summarize $CA$, 
we compute its mean and 95\% highest density interval (95\% HDI), which is 
defined as the interval that covers 95\% of the $CA$ distribution, with every 
point inside the interval having higher credibility than any point outside it. 
The SNPs with $CA \approx 1$, and a narrow HDI have a strong association with 
the phenotype.


\paragraph{Cohen's $\kappa$ statistic}
There is one pitfall where the $CA$ estimate can be misleading, and this is the 
case when the analyzed SNP is composed of unevenly represented genetic states 
(alleles). For instance, the allele A of a given SNP is found in 90\% of the 
individuals, while the other allele T in only 10\%. Such an uneven composition 
of the alleles can lead to a misleading $CA$, i.e. even without learning, the 
algorithm can produce a high $CA \approx 0.9$ by always predicting the dominant 
label. The Cohen's $\kappa$ statistics can be used to estimate the degree of 
$CA$ above the expected accuracy ($CA_{exp}$):
\begin{gather*}
\kappa =  (CA - CA_{exp})/(1 - CA_{exp})
\end{gather*}
The $\kappa$ statistics is a quality metric, which is to be used together with
$CA$. Cohen defines the following meaningful $\kappa$ intervals: [$\kappa$<0]:
``no agreement'', [0.0-0.2]: ``slight agreement'', [0.2-0.4]: ``fair agreement''
, [0.4-0.6]: ``moderate agreement'', [0.6-0.8]: ``substantial agreement'' and
[0.8-1.0]: ``almost perfect agreement''. To summarize the Cohen's $\kappa$, we 
compute its mean and 95\% highest density interval (95\% HDI).


\paragraph{Effect size}
Given $N$ individuals, each having genotype values for $S$ refSNPs, we generate 
the genotype matrix $X^{N \times S}$. The genotype matrix can also be refereed 
to as a design matrix in the genetic association study, with $X_{ij}$ set to 1 
if an individual has the first allele, and $X$ set to -1 for the second allele. 
For multi-allelic genotypes, the genotype matrix is expanded (colmn-wise) to 
include each bi-allelic permutation. The phenotypes $P$ can also be grouped to 
form the phenotype matrix $Y^{N \times P}$. We model the effect of each SNP/SAAP 
on each phenotype using the following Bayesian model:
\begin{gather*}
 Y_{ik} \sim \begin{cases} 
 \operatorname{Student-t}\left(\nu_k, \alpha_{jk} + \beta_{jk} \cdot X_{ij}, 
 \sigma_k\right), & \text{if $k$ quant.}\\
 \operatorname{Bernoulli}\left(\operatorname{logit^{-1}}(\alpha_{jk} + 
 \beta_{jk} \cdot X_{ij})\right), & \text{if $k$ dich.} \end{cases}
\end{gather*}
where $i$ and $j$ and $k$ index different individuals, SNPs and phenotypes; For 
quantitative trait, the model assumes Student-t distributed measurement errors 
with phenotype-specific standard deviation ($\sigma$) and degrees of freedom 
($\nu$), with central tendency defined by $\alpha_{k}+\beta_{jk} \cdot X_{ij}$,
where $\alpha$ and $\beta$ are the inferred intercept and slope (effect size) 
coefficients of the model. For dichotomous traits, the model assumes Bernoulli 
distributed measurement errors. Assuming that all SNPs are independent, we can 
use the univariate model setting in \texttt{genphen} to place independent vague 
priors on all slope and intercept coefficients as:
\begin{gather*}
\beta_{jk} \sim \text{Student-t}(1, 0, 10) \\
\alpha_{jk} \sim \text{Student-t}(1, 0, 100)
\end{gather*}
On the other hand, if we assume that the estimated slopes are not completely
independent and originate from a common overarching distribution, we can use the 
hierarchical model setting in \texttt{genphen} to model the hierarchy as:
\begin{gather*}
\beta_{jk}\sim\text{Student-t}(\nu_{\beta_k},\mu_{\beta_k},\sigma_{\beta_k})
\alpha_{jk}\sim\text{Student-t}(\nu_{\alpha_k},\mu_{\alpha_k},\sigma_{\alpha_k})
\end{gather*}
The remaining lines of the model describes the priors of the remaining 
parameters defined in either model type:
\begin{gather*}
\mu_{\beta_k} \sim \operatorname{Student-t}\left(1, 0, 10\right) \\
\mu_{\alpha_k} \sim \operatorname{Student-t}\left(1, 0, 100\right) \\
\nu_k, \nu_{\beta_k} \sim \operatorname{Gamma}\left(1, 2\right) \\
\sigma_{k}, \sigma_{\beta_k} \sim \operatorname{Half-Cauchy}\left(0, 5\right)\\
\end{gather*}
Importantly, the hierarchical version of the model performs partial-pooling, 
and therefore does an automatic correction for multiple-comparison. The model 
was implemented in Stan~\footnote{Stan Development Team. 2017. Stan Modeling 
Language Users Guide and Reference Manual, Version 2.17.0. http://mc-stan.org}.

We summarize each association using the mean of its slope coefficient ($\beta$) 
and 95\% (for instance) highest density interval (HDI), which is defined as the 
interval that covers a 95\% of the posterior distribution, with every point 
inside the interval having a higher credibility than any point outside it. Thus 
we can define an association as significant if the null-effect, i.e. $\beta = 0$ 
lies outside the 95\% HDI. The complete posterior is provided to the user, 
enabling checks for MCMC convergence and posterior prediction using built-in 
routines for poterior predictive checks.

\subsection{Phylogenetic Bias ($B$)}
To control for potential phylogenetic biases (population structure), we devised
the following procedure. First, we use the complete genotype data (all SNPs) to
compute a kinship matrix ($K^{N \times N}$ - dissimilarity matrix for the $N$ 
individuals). Alternatively, the users can provide their own kinship matrix 
(e.g. kinship estimated using more accurate phylogenetic methods). For a group 
of individuals which belong to a group defined by an alleles of a given SNP, we 
next compute their mean kinship distance using the kinship matrix data. If the 
individuals in the group are related, the compute mean kinship distance must be 
significantly lower than the mean kinship distance computed from the complete 
kinship matrix. We define the phylogenetic bias as:
\begin{gather*}
B = 1 - \hat{d}_{g}/\hat{d}_{t}
\end{gather*}
where $\hat{d}_{g}$ is the mean kinship distance between the individuals who
share the genotype $g$; $\hat{d}_{t}$ is the mean kinship distance of the
complete kinship matrix. For a complete phylogenetic bias, $B = 1$ ($\hat{d}_{g}
<< \hat{d}_{t}$), and $B = 0$ (or slightly negative) for no bias. This estimate
is computed for each SNP and genotype group within each SNP.

To compute the phylogenetic bias associated with a SNP we compute:
\begin{gather*}
B = 1 - min(\hat{d}_{g_{1}}, \hat{d}_{g_{2}})/\hat{d}_{t}
\end{gather*}
where $\hat{d}_{g_{1}}$ and $\hat{d}_{g_{2}}$ represent the mean kinship
distance between the individuals who share the genotype (allele) $g_{1}$ and
$g_{2}$ or a given SNP; $\hat{d}_{t}$ is the mean kinship distance in the
complete kinship matrix. For a complete phylogenetic bias, $B = 1$ and $B = 0$
(or slightly negative) for no bias. This estimate is computed for each SNP
and each pair of genotypes.


\subsection{Pareto Optimization}
We use Pareto optimization (with R package \texttt{rPref}) to rank the SNPs 
based on their multi-factorial association. Given that $CA$ is encoded into 
$\kappa$, we use only $\beta$ and $\kappa$ with an objective function that 
prioritizes SNPs which score high with respect to both of them. The Pareto
optimization procedure assigns each SNP to a non-dominated front (rank).


\newpage
\section{Case studies}
\subsection{I: Association between SNPs and a *quantiative* phenotype}
\label{sec:case1}

In the first case study, we show a typical genotype-phenotype analysis, 
whereby the genotype is a multiple sequence alignment containing of 120 protein
sequences (individuals), each composed of 8 amino acids (positions), and a 
quantiative phenotype measured for each individual.

\begin{scriptsize}
<<echo=TRUE, include=TRUE, results=hide>>=
require(genphen)
require(ggplot2)
require(knitr)
require(ggrepel)
require(reshape)
require(ape)
require(xtable)
require(gridExtra)
options(xtable.floating = FALSE)
@
\end{scriptsize}


\begin{scriptsize}
<<echo=TRUE, include=TRUE, results=hide>>=
# genotype as matrix (120x154), we will subset part of it:
data(genotype.saap)

# phenotype as vector (120 measurements):
data(phenotype.saap)
@
\end{scriptsize}


\paragraph{Genotype-phenotype data}
First we show an overview of the distribution of the phenotype across the 
genetic states found at each of the 8 studied positions in the multiple 
sequence alignment.

\begin{scriptsize}
<<echo=TRUE, include=TRUE, results=hide>>=
# Format the genotype-phenotype data, such that it can then
# be visualized with ggplot
df <- data.frame(genotype.saap[, 82:89], 
                 phenotype = phenotype.saap,
                 stringsAsFactors = FALSE)
df <- melt(data = df, id.vars = "phenotype")
colnames(df) <- c("phenotype", "site", "genotype")
df$site <- gsub(pattern = "X", replacement = '', x = df$site)
df$site <- factor(x = df$site, levels = unique(df$site))
@


<<echo=TRUE, include=TRUE, results=hide>>=
# Visualization
g <- ggplot(data = df)+
  facet_wrap(facets = ~site, nrow = 2, scales = "free_x")+
  geom_violin(aes(x = genotype, y = phenotype))+
  ylab(label = "Quantitative phenotype")+
  xlab(label = "Genotypes")+
  geom_point(aes(x = genotype, y = phenotype, col = genotype), 
             size = 1, shape = 21, position = position_jitterdodge())+
  scale_color_discrete(name = "genotype")+
  theme_bw(base_size = 14)+
  theme(legend.position = "none")
g
@


\begin{figure}[H]
\centering
<<echo=FALSE, include=TRUE, results=hide, fig=TRUE, width=8, height=4.25>>=
g
@
\end{figure}
\end{scriptsize}

\textbf{Important remark:} We recommended that the quantiative phenotypes are 
roughly normally (or T) distributed. While our models are designed to be robust 
against outliers, we advise you to perform the data transformations of skewed 
phenotypes (e.g. log-transformations) before the analysis. Here the phenotype
has already been log10-transformed and is normally distributed.


\paragraph{Association analysis}
Next, we perform the genetic association study for a single 
\underline{quantitative ('Q')} phenotype with \texttt{genphen} using 
the following settings:
\begin{itemize}
\item \underline{hierarchical} Bayesian model will be run with \underline{2 MCMC 
chains} composed of \underline{1,500 iterations} each, including \underline{500 
warmup iterations}. 
\item \underline{Random forest} was selected for the statistical learning, which 
will be run in a cross-validation mode with \underline{200 iterations}, whereby 
in each iteration 66\% of the data (\underline{default: cv.fold = 0.66}) will 
be used to train the model.
\item We report for each metrics its mean and \underline{95\% HDI}
\item Whenever possible, \underline{1 core} will be used.
\end{itemize}

\begin{scriptsize}
<<echo=TRUE, include=TRUE, results=hide>>=
# Run genphen
c.out <- genphen::runGenphen(genotype = genotype.saap[, 82:89], 
                             phenotype = phenotype.saap,
                             phenotype.type = "Q",
                             model.type = "hierarchical",
                             mcmc.chains = 2,
                             mcmc.steps = 1500,
                             mcmc.warmup = 500,
                             cores = 1,
                             hdi.level = 0.95,
                             stat.learn.method = "rf",
                             cv.steps = 200)
@
\end{scriptsize}



Typical way of visualizing the \texttt{genphen} results is with the following
plot, where each point represents a SAAP plotted according to x = classification 
accuracy ($CA$), y = effect slize ($\beta$), color = Cohen's $\kappa$. The most 
promising SAAPs have $CA$ and $\kappa$ close to 1, with a non-null $\beta$, 
i.e. $\beta$ with 95\% HDI that does not overlap with 0 (shown as a dashed line 
in the figure). The labels show the SAAP site in the genotype data, and its 
constituting genotypes.

\begin{scriptsize}
<<echo=TRUE, include=TRUE, results=hide>>=
# Get the scores data
c.score <- c.out$scores

# Some optional formatting for the SNPs (label = site : genotype1 -> genotype2)
c.score$label <- paste(c.score$site, ":", c.score$ref,
                       "->", c.score$alt, sep = '')

# Visualization
g <- ggplot(data = c.score)+
  geom_errorbar(aes(x = ca.mean, ymin = beta.hdi.low, ymax = beta.hdi.high),
                width = 0.015, col = "darkgray")+
  geom_point(aes(x = ca.mean, y = beta.mean, fill = kappa.mean), 
             shape = 21, size = 4)+
  geom_text_repel(aes(x = ca.mean, y = beta.mean, label = label), size = 5)+
  theme_bw(base_size = 14)+
  ylab(label = expression("Effect size ("*beta*") (with 95% HDI)"))+
  scale_x_continuous(name = "CA", limits = c(0, 1.05))+
  geom_hline(yintercept = 0, linetype = "dashed")+
  theme(legend.position = "top")+
  scale_fill_distiller(palette = "Spectral", limits = c(-0.2, 1))+
  guides(fill = guide_colorbar(barwidth = 10, barheight = 1.5))

g
@
\end{scriptsize}



\begin{figure}[H]
\centering
<<echo=FALSE, include=TRUE, results=hide, fig=TRUE, width=8, height=7.5>>=
g
@
\end{figure}

The association scores are also shown in the following table:

\begin{scriptsize}
<<echo=TRUE, include=TRUE, results=hide>>=
# Description:
# Rounds digits to 2-decimal points, and concatinates the lower and upper
# limits of the HDI to have a simpler visualization
getHdiPretty <- function(x, digits = 2) {
  x[1] <- round(x = x[1], digits = digits)
  x[2] <- round(x = x[2], digits = digits)
  return(paste("(", x[1], ", ", x[2], ")", sep = ''))
}
c.score$beta.hdi <- apply(X = c.score[, c("beta.hdi.low", "beta.hdi.high")],
                          MARGIN = 1, getHdiPretty, digits = 2)
c.score$ca.hdi <- apply(X = c.score[, c("ca.hdi.low", "ca.hdi.high")],
                        MARGIN = 1, getHdiPretty, digits = 2)
c.score$kappa.hdi <- apply(X = c.score[, c("kappa.hdi.low", "kappa.hdi.high")],
                           MARGIN = 1, getHdiPretty, digits = 2)

# Print table
print(xtable(c.score[, c("label", "beta.mean", "beta.hdi", "ca.mean",
                         "ca.hdi", "kappa.mean", "kappa.hdi"), ],
             align = rep(x = "c", times = 8, digits = 2)),
      include.rownames = FALSE, size = "scriptsize")
@
\end{scriptsize}


<<echo=FALSE, include=TRUE, results=tex>>=
print(xtable(c.score[, c("label", "beta.mean", "beta.hdi", "ca.mean",
                         "ca.hdi", "kappa.mean", "kappa.hdi"), ],
             align = rep(x = "c", times = 8, digits = 2)),
      include.rownames = FALSE, size = "scriptsize")
@



\paragraph{Pareto optimization}
We use Pareto optimization (with R package \texttt{rPref}) to rank the SNPs 
based on their multi-factorial association. Given that $CA$ is encoded into 
$\kappa$, we use only $\beta$ and $\kappa$ with an objective function that 
prioritizes SNPs which score high with respect to both of them. The results 
from the Pareto optimization procedure are shown below:

\begin{scriptsize}
<<echo=TRUE, include=TRUE, results=hide>>=

# Visualization
g <- ggplot(data = c.score)+
  facet_wrap(facets = ~phenotype.id, scales = "free")+
  geom_line(aes(y = abs(beta.mean), x = kappa.mean, group = rank))+
  geom_point(aes(y = abs(beta.mean), x = kappa.mean, fill = rank),
             shape = 21, size = 4)+
  geom_text_repel(aes(y = abs(beta.mean), x = kappa.mean, label = label), 
                  size = 5)+
  theme_bw(base_size = 14)+
  ylab(label = expression("|"*beta*"|"))+
  xlab(label = expression(kappa))+
  scale_fill_gradientn(colours = terrain.colors(n = 10))+
  theme(legend.position = "top")

g
@
\end{scriptsize}


\begin{figure}[H]
\centering
<<echo=FALSE, include=TRUE, results=hide, fig=TRUE, width=6, height=5>>=
g
@
\end{figure}



\paragraph{MCMC convergence and sampling issues}
You might want to check the validity your Bayesian inference by inspecting
the \texttt{genphen} output named convergence which contains information about
the markov chain monte carlo (MCMC) simulation done with R package rstan
including potential scale reduction factor (Rhat) and effective sampling size
(ESS), as well as information concerning potential convergence issues such as
divergences and tree depth exceeded warnings. For detailed information about
each warning please read Stan documentation (mc-stan.org/users/documentation/).

\begin{scriptsize}
<<echo=TRUE, include=TRUE, results=hide>>=
rstan::check_hmc_diagnostics(c.out$complete.posterior)
rstan::stan_rhat(c.out$complete.posterior)
rstan::stan_ess(c.out$complete.posterior)
rstan::stan_diag(c.out$complete.posterior)
@
\end{scriptsize}




\paragraph{Phylogenetic bias control}
Next, we compute the phylogenetic bias of each mutation, shown in the table
below:

\begin{scriptsize}
<<echo=TRUE, include=TRUE, results=hide>>=
# Compute the phylogenetic bias
bias <- runPhyloBiasCheck(genotype = genotype.saap,
                          input.kinship.matrix = NULL)

# Extract kinship matrix
kinship.matrix <- bias$kinship.matrix

# Extract the bias associated with mutations of the sites which 
# were included in the association analysis
mutation.bias <- bias$bias

# To make site id concordant with data
mutation.bias$site <- mutation.bias$site - 81
mutation.bias <- merge(x = c.score, y = mutation.bias,
                       by = c("site", "ref", "alt"))

# Show the bias table
print(xtable(mutation.bias[, c("site", "ref", "alt", "bias.ref", "bias.alt")],
             align = rep(x = "c", times = 6, digits = 2)),
      include.rownames = FALSE, size = "small")
@
\end{scriptsize}


<<echo=FALSE, include=TRUE, results=tex>>=
print(xtable(mutation.bias[, c("site", "ref", "alt", "bias.ref", "bias.alt")],
             align = rep(x = "c", times = 6, digits = 2)),
      include.rownames = FALSE, size = "scriptsize")
@


We use the kinship matrix to perform hierarchical clustering, visualizing the
population strcuture and two examples (mutations) with genotype 1 marked with
blue and genotype 2 marked with orange in either case. Individuals not covered
by either genotype are marked with gray color. The shown examples differ in the 
degree of phylogenetic bias.

\begin{scriptsize}
<<echo=TRUE, include=TRUE, results=hide>>=
color.a <- character(length = nrow(genotype.saap))
color.a[1:length(color.a)] <- "gray"
color.a[which(genotype.saap[, 82] == "h")] <- "orange"
color.a[which(genotype.saap[, 82] == "q")] <- "blue"

color.b <- character(length = nrow(genotype.saap))
color.b[1:length(color.b)] <- "gray"
color.b[which(genotype.saap[, 84] == "a")] <- "orange"
color.b[which(genotype.saap[, 84] == "d")] <- "blue"

c.hclust <- hclust(as.dist(kinship.matrix), method = "average")

par(mfrow = c(1, 2), mar = c(0,0,1,0) + 0.1)
plot(as.phylo(c.hclust), tip.color = color.a, cex = 0.6, 
     type = "fan", main = "B = 0.15")
plot(as.phylo(c.hclust), tip.color = color.b, cex = 0.6, 
     type = "fan", main = "B = 0.43")
@
\end{scriptsize}



\begin{figure}[H]
\centering
<<echo=FALSE, include=TRUE, fig=TRUE, width=9, height=6>>=
par(mfrow = c(1, 2), mar = c(0,0,1,0) + 0.1)
plot(as.phylo(c.hclust), tip.color = color.a, cex = 0.6, 
     type = "fan", main = "B = 0.15")
plot(as.phylo(c.hclust), tip.color = color.b, cex = 0.6, 
     type = "fan", main = "B = 0.43")
@
\end{figure}



\subsection{II: Association between SNP and two phenotypes (quantitative and 
dichotomous)}
\label{sec:case2}
In the second case study we show you how to use \texttt{genphen} in case of
two phenotypes of different types (quantitative and dichotomouse). Here, the 
genotype is a single simulated SNP in 40 individuals. First we show an overview 
of the distribution of the phenotypes in each genotype.

\begin{scriptsize}
<<echo=TRUE, include=TRUE, results=hide>>=
# simulate genotype
genotype <- rep(x = c("A", "C", "T", "G"), each = 10)

# simulate quantitative and dichotomous phenotypes
phenotype.Q <- c(rnorm(n = 10, mean = 0, sd = 1),
                 rnorm(n = 10, mean = 0.5, sd = 1),
                 rnorm(n = 10, mean = -0.5, sd = 1),
                 rnorm(n = 10, mean = 2, sd = 1))
phenotype.D <- c(rbinom(n = 10, size = 1, prob = 0.3),
                 rbinom(n = 10, size = 1, prob = 0.5),
                 rbinom(n = 10, size = 1, prob = 0.6),
                 rbinom(n = 10, size = 1, prob = 0.7))
phenotype <- cbind(phenotype.Q, phenotype.D)
rm(phenotype.Q, phenotype.D)

out <- runGenphen(genotype = genotype,
                  phenotype = phenotype,
                  phenotype.type = c("Q", "D"),
                  model.type = "hierarchical",
                  mcmc.chains = 4,
                  mcmc.steps = 2500,
                  mcmc.warmup = 500,
                  cores = 1,
                  hdi.level = 0.95,
                  stat.learn.method = "svm",
                  cv.steps = 500)
@



<<echo=TRUE, include=TRUE, results=hide>>=
# Format the genotype-phenotype data, such that it can then
# be visualized with ggplot
df <- data.frame(genotype = genotype, 
                 phenotype.Q = phenotype[, 1], 
                 phenotype.D = phenotype[, 2],
                 stringsAsFactors = FALSE)
@


<<echo=TRUE, include=TRUE, results=hide>>=
# Visualization
g1 <- ggplot(data = df)+
  geom_point(aes(x = genotype, y = phenotype.Q, col = genotype), size = 1,
             shape = 21, position = position_jitterdodge(jitter.width = 0.2,
                                                         jitter.height = 0,
                                                         dodge.width = 0.5))+
  xlab(label = "Genotypes")+
  ylab(label = "Phenotype (Q)")+
  theme_bw(base_size = 14)+
  theme(legend.position = "none")

g2 <- ggplot(data = df)+
  geom_point(aes(x = genotype, y = phenotype.D, col = genotype), size = 1,
             shape = 21, position = position_jitterdodge(jitter.width = 0.2,
                                                         jitter.height = 0.05,
                                                         dodge.width = 0.5))+
  xlab(label = "Genotypes")+
  scale_y_continuous(name = "Phenotype (D)", 
                     breaks = c(0, 1), labels = c(0, 1))+
  theme_bw(base_size = 14)+
  theme(legend.position = "none")

gridExtra::grid.arrange(g1, g2, ncol = 2)
@
\end{scriptsize}

\begin{figure}[H]
\centering
<<echo=FALSE, include=TRUE, results=hide, fig=TRUE, width=6, height=3.5>>=
gridExtra::grid.arrange(g1, g2, ncol = 2)
@
\end{figure}


\textbf{Important remark:} The dichotomous phenotype can be provided as both
numeric or character vector. The elements of these vectors are then encoded into
two categories (1 and 0). If the user has a preference of how the encoding has
to be done (which category is to be encoded to 1 or 0), the encoding should be
done prior to the analysis.


\paragraph{Association analysis}
We perform a genetic association study for \underline{multiple phenotypes} of 
different types, including one quantiative ('Q') and one dichotomous phenotype 
('D'), using the following settings:
\begin{itemize}
\item \underline{univariate} Bayesian model will be run with \underline{4 MCMC 
chains} composed of \underline{1500 iterations} each, including \underline{500 
warmup iterations}. 
\item \underline{Support vector machines} was selected for the statistical 
learning, which will be run in a cross-validation mode with \underline{500 
iterations}, whereby in each iteration 80\% of the data will be used to train 
the model.
\item All estimates will be reported according to their mean and \underline{95\% 
HDI} 
\item Whenever possible, \underline{1 core} will be used.
\end{itemize}

\begin{scriptsize}
<<echo=TRUE, include=TRUE, results=hide>>=
# run genphen
m.out <- genphen::runGenphen(genotype = genotype,
                             phenotype = phenotype,
                             phenotype.type = c("Q", "D"),
                             model.type = "univariate",
                             mcmc.chains = 4,
                             mcmc.steps = 1500,
                             mcmc.warmup = 500,
                             cores = 1,
                             hdi.level = 0.95,
                             stat.learn.method = "svm",
                             cv.steps = 500,
                             cv.fold = 0.8)
@
\end{scriptsize}


Once again we visualize the \texttt{genphen} results is with a plot in which 
the point represents the SNP, plotted according to x = classification accuracy 
($CA$), y = slope ($\beta$) with error bars representing the 95\% HDI, color = 
Cohen's $\kappa$.

\begin{scriptsize}
<<echo=TRUE, include=TRUE, results=hide>>=
# Get the scores data
m.score <- m.out$scores

# Some optional formatting for the SNPs 
# (label = site : genotype1 -> genotype2)
m.score$label <- paste(m.score$site, ":", m.score$ref,
                       "->", m.score$alt, sep = '')


# Visualization
g1 <- ggplot(data = m.score[m.score$phenotype.id == 1, ])+
  geom_errorbar(aes(x = ca.mean, ymin = beta.hdi.low, ymax = beta.hdi.high),
                width = 0.015, col = "darkgray")+
  geom_point(aes(x = ca.mean, y = beta.mean, fill = kappa.mean), 
             shape = 21, size = 4)+
  geom_text_repel(aes(x = ca.mean, y = beta.mean, label = label), size = 5)+
  theme_bw(base_size = 14)+
  ylab(label = expression("Effect size ("*beta*") (with 95% HDI)"))+
  scale_x_continuous(name = "CA", limits = c(0, 1.05))+
  geom_hline(yintercept = 0, linetype = "dashed")+
  theme(legend.position = "top")+
  scale_fill_distiller(palette = "Spectral", limits = c(-0.2, 1))+
  guides(fill = guide_colorbar(barwidth = 10, barheight = 1.5))+
  ggtitle(label = "Phenotype Q")

g2 <- ggplot(data = m.score[m.score$phenotype.id == 2, ])+
  geom_errorbar(aes(x = ca.mean, ymin = beta.hdi.low, ymax = beta.hdi.high),
                width = 0.015, col = "darkgray")+
  geom_point(aes(x = ca.mean, y = beta.mean, fill = kappa.mean), 
             shape = 21, size = 4)+
  geom_text_repel(aes(x = ca.mean, y = beta.mean, label = label), size = 5)+
  theme_bw(base_size = 14)+
  ylab(label = expression("Effect size ("*beta*") (with 95% HDI)"))+
  scale_x_continuous(name = "CA", limits = c(0, 1.05))+
  geom_hline(yintercept = 0, linetype = "dashed")+
  theme(legend.position = "top")+
  scale_fill_distiller(palette = "Spectral", limits = c(-0.2, 1))+
  guides(fill = guide_colorbar(barwidth = 10, barheight = 1.5))+
  ggtitle(label = "Phenotype D")

grid.arrange(g1, g2, ncol = 2)
@
\end{scriptsize}


\begin{figure}[H]
\centering
<<echo=FALSE, include=TRUE, results=hide, fig=TRUE, width=8, height=7.5>>=
grid.arrange(g1, g2, ncol = 2)
@
\end{figure}




\newpage
\section{Extra Utilities}
\label{sec:utilities}
\subsection{Data Reduction}
The methods implemented in \texttt{genphen} are statistically superior to the
ones implemented by most classical (frequentist) tools for GWAS. The major
challenge, however, is the substantially increased computational cost when
analyzing the effects of hundreeds of thousands of SNPs. Inspired by the
biological assumption that the major fraction of the studied SNPs are non-
informative (genetic noise) with respect to the selected phenotype, various 
data reduction techniques can be implemented to quickly scan the SNP scpae and 
discard a substantial portion of the the SNPs deemed clearly non-informative.

Our data reduction procedure includes the following steps:
\enumerate{
\item The complete data (genotypes and a single phenotype) is used to train a 
random forest (RF) model, which will quantify the importance of each SNP/SAAP in 
explaining the phenotypeassociation between each SNP and the phenotype.
\item We can then plot the distribution of variable importances, to get an 
insight into the structure of the importances values and potentially disect the 
signal from the noise.
\item The main analysis can then be performed with runGenphen using a subset 
(based on their importance) of SNPs
}

Using a case study based on a simulated data of 50,000 SNPs (60 subjects), we 
elaborate the typical data reduction steps in more detail. The typical runtime 
for the provided dataset ($60 \times 50,000$) is few minutes.

\begin{scriptsize}
<<echo=TRUE, include=F, results=hide>>=
# Simulate 50,000 SNPs and 60 phenotypes
set.seed(seed = 551155)
g1 <- replicate(n=1*10^4, expr=as.character(
  rbinom(n=30, size = 1,prob = 0.49)))
g2 <- replicate(n=1*10^4, expr=as.character(
  rbinom(n=30, size = 1,prob = 0.51)))
gen <- rbind(g1, g2)
phen <- c(rnorm(n = 30, mean = 3, sd = 3), 
          rnorm(n = 30, mean = 5, sd = 3))
@
\end{scriptsize}

\begin{scriptsize}
<<echo=TRUE, include=F, results=hide>>=
# Run diagnostics
diag <- genphen::runDiagnostics(genotype = gen,
                                phenotype = phen,
                                phenotype.type = "Q",
                                rf.trees = 50000)
@
\end{scriptsize}

We can inspect the distribution of importances and select a set of promising
SNP based on their importance score, which can then be studied in the main 
association study as explained previously. 

\begin{scriptsize}
<<echo=TRUE, include=TRUE, results=hide>>=
# Visualization
g <- ggplot(data = diag)+
  geom_density(aes(importance))+
  xlab("Importance")+
  theme_bw(base_size = 14)+
  scale_x_continuous(trans = "log10")+
  annotation_logticks(base = 10, sides = "b")
g
@
\end{scriptsize}

\begin{figure}[H]
\centering
<<echo=FALSE, include=TRUE, results=hide, fig=TRUE, width=8, height=6>>=
g
@
\end{figure}


%By visualizing the estimated slope coefficients of the diagnosed SNPs, we can 
%observe an enrichment of non-informative (statistically not-significant) SNPs 
%beyond the rank of 1,000. We can thus narrow down our interval of interest to
%the top-ranked 2,500 SNPs, yielding massive data reduction of 95\%, while still
%retaining many SNPs with small effects.



\end{document}