%\textwidth=6.2in \textheight=8.5in
%\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in
%\headheight=-.3in
%\VignetteIndexEntry{Case study: predicting protein function}
%\VignettePackage{BiocStyle}
%\VignetteEngine{utils::Sweave}

\documentclass[12pt]{article}

<<style-Sweave, eval=TRUE, echo=FALSE, results=tex>>=
BiocStyle::latex()
@

\usepackage{dsfont}
\usepackage{booktabs}
\usepackage{fixltx2e} %subscripts
%\usepackage{hyperref}


\newcommand{\textq}[1]{``#1''}
% \newcommand{\Rcode}[1]{{\texttt{#1}}}
% \newcommand{\Robject}[1]{{\texttt{#1}}}
% \newcommand{\Rfunction}[1]{{\texttt{#1}}}
% \newcommand{\Rpackage}[1]{{\textit{#1}}}
% \newcommand{\Rclass}[1]{{\textit{#1}}}
\newcommand{\Rmethod}[1]{{\textit{#1}}}
\newcommand{\Rfunarg}[1]{{\textit{#1}}}


\begin{document}
\SweaveOpts{concordance=TRUE, include = TRUE, echo=TRUE}
%\SweaveOpts{fig=FALSE, prefix.string=yeast_latexfiles/}
%\setkeys{Gin}{width=1\textwidth}
\bioctitle[The diffuStats R package]{diffuStats: an R package to compute
diffusion-based scores on biological networks}
\author[1,2]{Sergio Picart-Armada\thanks{\email{sergi.picart@upc.edu}}}
\author[3,4]{Wesley K. Thompson}
\author[3]{Alfonso Buil}
\author[1,2]{Alexandre Perera-Lluna}
\affil[1]{B2SLab, Departament d'Enginyeria de Sistemes,
Autom\`atica i Inform\`atica Industrial,
Universitat Polit\`ecnica de Catalunya, CIBER-BBN, Barcelona, 08028, Spain}
\affil[2]{Institut de Recerca Pedi\`atrica
Hospital Sant Joan de D\'eu, Esplugues de Llobregat, Barcelona, 08950, Spain}
\affil[3]{Mental Health Center Sct. Hans, 4000 Roskilde, Denmark}
\affil[4]{Department of Family Medicine and Public Health,
University of California, San Diego, La Jolla, California, USA}
%\small{\author{Sergio Picart-Armada,
%Wesley K. Thompson, \\
%Alfonso Buil and Alexandre Perera-Lluna}}
\maketitle

\section{Abstract}

Label propagation approaches are a standard and ubiquitous
procedure in computational biology for giving context
to molecular entities.
Node labels, which can derive from gene expression,
genome-wide association studies,
protein domains or metabolomics profiling,
are propagated to their neighbours,
effectively smoothing the scores through
prior annotated knowledge and prioritising novel candidates.
However, there are several settings to tune
when defining the diffusion process,
including the diffusion kernel,
the numeric codification of the labels and
a choice of statistical normalisation of the scores.
These settings can have a large impact
on results, and there is currently no
software implementing many of them in one place
to screen their performance in the application of interest.
This vignette presents \Rpackage{diffuStats}, an R package with a
collection of diffusion kernels and scores,
as well as a parallel permutation analysis for the normalised scores,
that eases the analysis of several sets of molecular entities at once.

\section{Introduction}

The application of label propagation algorithms \cite{labelpropagation}
is based on the guilt by association principle \cite{gba},
which can be rephrased in the protein-protein interaction context as
\textq{proteins that interact are more likely to share biological functions}.
However, this principle is extremely general and has experienced
success in numerous applications in bioinformatics.

HotNet \cite{hotnet} uses a diffusion process with mutated genes as
seed nodes to find modules with a statistically high number of mutated genes
in cancer.
Another attempt to find relevant
modules from gene expression and mutation data
can be found in \cite{mosca},
where the authors propose a diffusion process followed
by a statistical normalisation and an automatic process
to extract a subnetwork.
TieDIE \cite{tiedie} runs two diffusion processes to
link perturbation in the genome with changes in the transcriptome,
effectively linking two sets of genes.
GeneMANIA \cite{genemania} is a web server that predicts
gene function using label propagation with a bias on the
unlabelled nodes.
Network-based learning sharing a background with diffusion
has been also applied to protein classification using
multiple networks
\cite{fastprotein}.
Label propagation using graph kernels
has been proven successful in gene-disease
association \cite{valentini, diffusion_gwas}.
Equivalent formulations can be found under different terminology,
like the electrical model applied to prioritise candidate
genes in eQTL in \cite{electricalmodel}.

The heterogeneity of applications hinders comparisons among approaches,
therefore tools gathering the state of the art are highly needed.
An existing solution is \Rpackage{RANKS}, an
R package that contains a variety of diffusion kernels and kernelised
scores for label propagation using a binary input vector.
\Rpackage{RANKS} eases kernelised scores benchmarking
and models it as a \textq{one-class} classification semi-supervised 
learning problem, in which only some members 
of the class (positives) are known.
Another possibility is to divide nodes into labelled positive, 
negative and unlabelled, like in \cite{genemania}, 
which poses questions like the effect of unlabelled nodes
and possible numeric codifications of the labels,
or the option to include quantitative data in the labels.
In addition, statistical normalisations such as in \cite{mosca}
remove the effect of network structures like hubs and
should be taken into account when choosing a diffusion scoring
method.
This motivates the introduction of our \Rpackage{diffuStats}
R package, which collects widely adopted input codifications and
explicitly accounts for unlabelled nodes.
It also includes three statistically normalised scores,
which can be obtained through Monte Carlo trials or
a parametric alternative.
The \Rpackage{diffuStats} package uses existent classes and
provides high-level functions to screen the performance of
several diffusion scores, in order to facilitate their integration
in any computational biology study.

\section{Methodology}

One of the main purposes of \Rpackage{diffuStats} is to offer
a battery of approaches to compute and compare diffusion scores.
The diffusion scores $f$ using an input vector $y$ and
a diffusion kernel $K$ are generally computed as $$ f = K\cdot y$$
possibly followed by further adjustments or a statistical normalisation.

The decisions taken in the definition of $K$, $y$ and the posterior
normalisation generally give rise to
different priorisations due to a different treatment of
the balance between positive and negative examples,
the unlabelled data and the network structure.
The following sections cover the implemented choices
for the kernel $K$ and the initial labels $y$.

\subsection{Diffusion kernels and regularisation}

The representation of any kind of data in a network model
allows the definition of notions like distance or similarity
based on the links in the network.
This section will follow the notation in \cite{smola} and
summarise the kernels proposed by the authors.
In general, an undirected graph $G = (V, E)$ consists of a set
of $n$ nodes $V$ and a set of edges $E$ of unordered pairs of nodes.
This can be extended to weighted, undirected graphs, where each
edge $i \sim j$ has a weight attribute $W_{ij} \in [0, \infty)$.
The degree matrix of $G$ is defined as the $n \times n$
diagonal matrix so that $D_{ii} = \sum_{j=1}^n W_{ij}$.
The (unnormalised) Laplacian of $G$ is defined as
the $n \times n$ matrix $L = D - W$,
whereas its normalised version is
$\tilde{L} = D^{-\frac{1}{2}}\cdot L \cdot D^{-\frac{1}{2}}$.

The graph Laplacian is diagonalisable and
can be written in terms of its eigenvalues $v_j$
and eigenvectors $\lambda_j$,
as $L = \sum_{j=1}^n \lambda_j v_j v_j^T$.
The proposed kernels stem from
a family of regularisation functions $r(\lambda)$
on the spectrum of the graph Laplacian:
$$K = \sum_{j=1}^n r^{-1}(\lambda_j) v_j v_j^T$$

Well known graph kernels belong to this family because
they can be written as transformations on the Laplacian spectrum.
Table \ref{tab:kernels} summarises them, assuming
the usage of the normalised Laplacian -
the unnormalised Laplacian can also be used
as long as the resulting kernel is still positive semidefinite.
Further details about this family of kernels,
all available in our package \Rpackage{diffuStats}, can be found
in the original manuscript \cite{smola}.

\begin{table}[!t]
\begin{center}
\begin{tabular}{@{}ll@{}}
\toprule
Kernel & Function  \\\midrule
Regularised Laplacian &
    $r(\lambda) = 1 + \sigma^2\lambda$  \\
Diffusion process &
    $r(\lambda) = \exp(\frac{\sigma^2}{2}\lambda)$  \\
$p$-Step random walk &
    $r(\lambda) = (a - \lambda)^{-p}$ with $a \geq 2$, $p \geq 1$ \\
Inverse cosine &
    $r(\lambda) = (cos(\lambda \frac{\pi}{4}))^{-1}$ \\\bottomrule
\end{tabular}
\caption{Implemented diffusion kernels from \cite{smola}}
\label{tab:kernels}
\end{center}
\end{table}

Additionally, the \Rpackage{diffuStats} package includes
the commute time kernel, introduced in \cite{ctkernel}.
This kernel, also writable in terms of a regularisation function,
is simply the pseudoinverse
of the graph Laplacian, $K = L^+$.

The default option in the \Rpackage{diffuStats} package is
the regularised Laplacian kernel, as it is widely adopted
and describes many physical models, for instance in
\cite{hotnet, tiedie, electricalmodel}.

\subsection{Diffusion scores}

Besides choosing a graph kernel, the codification of the input and
the presence of a statistical normalisation can lead to important
differences in the results.
Table \ref{tab:scores} gives an overview of the implemented scores,
which will be detailed in the following sections.
The argument \Rfunarg{method} in the function \Rfunction{diffuse}
can be set to the desired scores in table \ref{tab:scores},
which are described in the following sections.
The numeric values of the
positive, negative and unlabelled examples are
respectively $y^+$, $y^-$ and $y^u$.
Column \textq{normalised} refers to the application of a statistical
model involving permutations, whereas \textq{stochastic}
enumerates the normalised scores that need actual
Monte Carlo permutations. 
The scores that also accept quantitative inputs instead of binary labels 
are listed in the \textq{quantitative} column.

\begin{table}[!t]
\begin{center}
\resizebox{\textwidth}{!}{
\begin{tabular}{@{}llllllll@{}}
\toprule
Score & $y^+$ & $y^-$ & $y^u$ & Normalised & Stochastic & 
Quantitative & Reference \\\midrule
raw & 1 & 0 & 0 & No & No & Yes & \cite{hotnet} \\
ml & 1 & -1 & 0 & No & No & No & \cite{fastprotein, labelpropagation} \\
gm & 1 & -1 & $k$ & No & No & No & \cite{genemania} \\
ber\textsubscript{s} & 1 & 0 & 0 & No & No & Yes & \cite{mosca} \\
ber\textsubscript{p} & 1 & 0 & 0* & Yes & Yes & Yes & \cite{mosca} \\
mc & 1 & 0 & 0* & Yes & Yes & Yes & \cite{mosca} \\
z & 1 & 0 & 0* & Yes & No & Yes & \cite{kerneltesting} \\\bottomrule
\end{tabular}
}
\caption{Implemented diffusion scores}
\label{tab:scores}
\end{center}
\end{table}

\subsubsection{Scores without statistical normalisation}

The base diffusion score \Rcode{raw}, which has been used in
algorithms like HotNet \cite{hotnet} and TieDIE \cite{tiedie},
solves a diffusion problem in terms of the
regularised Laplacian kernel \cite{smola}.

$$ f_{raw} = K\cdot y_{raw} $$

$K$ is the kernel matrix and $y_{raw}$ the vector of codified inputs.
In general, the $i$-th component of $y$
equals $y^+$ if node $i$ is a positive,
$y^-$ if $i$ is a negative and $y^u$ if $i$ is unlabelled.
In the particular case of $y_{raw}$, the positively labelled nodes
introduce one flow unit in the network ($y_{raw}^{+}=1$),
whereas the negative and unlabelled nodes are treated
equivalently and do not introduce anything
($y_{raw}^{-}=y_{raw}^{u}=0$).
In the physical model using the regularised Laplacian kernel,
the flow can be evacuated from the graph due to the presence of
first-order leaking in every node,
see \cite{hotnet} for further details on this.
This formulation is proportional up to a scaling factor
to the average score in \Rpackage{RANKS}. 

On the other hand, the classical label propagation
\cite{labelpropagation} treats positives
as $y_{ml}^{+}=1$ and negatives as $y_{ml}^{-}=-1$, while unlabelled nodes
remain as $y_{ml}^{u}=0$, thus making a distinction between the last two.
A biological example can be found 
in protein classification \cite{fastprotein}.
This option is available as \Rcode{ml}, and intuitively scores a
node by counting if the majority of its neighbours
vote positive or negative:
$$ f_{ml} = K\cdot y_{ml} $$

The authors of GeneMANIA \cite{genemania} propose a modification
on the \Rcode{ml} input - they adhere to
$y_{gm}^{+}=1$ and $y_{gm}^{-}=-1$, but introduce a bias term in the
unlabelled nodes
$$y_{gm}^{u} = \frac{n^{+}-n^{-}}{n^{+}+n^{-}+n^{u}}$$
being $n^{+}$, $n^{-}$ and $n^{u}$ the number of positives,
negatives and unlabelled entities.
The \Rcode{gm} score is then computed through
$$ f_{gm} = K\cdot y_{gm} $$

The last option in this part, named \Rcode{ber\_s} \cite{mosca},
is a quantification of the relative change in the node score
before and after the network smoothing. The score for a
particular node $i$ can be written as
$$ f_{ber_s, i} = \frac{f_{raw, i}}{y_{raw, i} + \epsilon}$$

where $\epsilon > 0$ is a parameter that regulates the importance of
the relative change.

\subsubsection{Scores with statistical normalisation}

Recently, the combination of a permutation analysis with
diffusion processes has been suggested \cite{mosca}.
This is a way to quantify how the diffusion score
of a certain node compares to its score if the
input was randomised - nodes that might have systematically
high or low scores regardless of the input are
normalised accordingly.

The cornerstone of normalised scores is
the empirical p-value \cite{empiricalpvalue}
that indicates, for a node $i$,
the proportion of input permutations
that led to a diffusion score as high or higher than the
original diffusion score.
Specifically, $f_{raw}$ is compared
to scores from random trials $j$,
$f_{raw}^{null, j} = K\cdot \pi_j(y_{raw})$,
where $\pi_j(y_{raw})$ is a permutation
of $y_{raw}$ on the labelled entities.
The empirical p-value for node $i$ is therefore defined as in
\cite{empiricalpvalue}:
$$p_i = \frac{r_i + 1}{n + 1}$$
being $r_i$ the number of trials $j$ in which
$f_{raw, i}^{null, j} \geq f_{raw, i}$,
and $n$ the total number of trials.

To be consistent with the
increasing direction of the scores,
the \Rcode{mc} scores are defined as
$$f_{mc, i} = 1 - p_i$$

Importantly, the permutation has been applied only to the
observed nodes.
Therefore, any node outside the labelled
background cannot receive a different input
score in a permuted input.
This implies that even if both negatives and unlabelled nodes
are assigned the same input value ($y^{-}=y^{u}=0$),
the negatives are actually permuted and eventually exchanged
for a $1$ in some permutations provided that
the number of runs is large enough.
For this reason, negatives and unlabelled nodes are not
equivalent in these scores.

A parametric alternative to $f_{mc}$, are \Rcode{z} scores,
where each node $i$ is scored as:

$$f_{z,i} = \frac{f_{raw, i} - E(f_{raw, i}^{null, j})}
{\sqrt{V(f_{raw, i}^{null, j})}}$$

The expectation and variance are computed in a closed form
and these scores do not require running actual permutations,
therefore saving on computational time and avoiding
stochastic models.
The analytical expression proven below also works for the general case
when the input has quantitative labels.

First of all, note that all the unlabelled nodes will cancel out when
substracting the expected value, so they can be excluded
without loss of generality.
Thus, let the row vector $\tilde{k}_{i}$ be the $i$-th row of the
submatrix from $K$ including the columns
indexed by the labelled nodes only, and let
$n_{lab}$ be the number of labelled nodes.
Analogously, let $\tilde{y}$ be the (quantitative or
binary) inputs for the observed nodes,
with the same indexing as $\tilde{k}_{i}$.
Consider the following scalar quantities:

$$ Sk_i^I = \sum_{j = 1}^{n_{lab}} (\tilde{k}_{i})_{j} $$
$$ Sk_i^{II} = \sum_{j = 1}^{n_{lab}} (\tilde{k}_{i})_{j}^2 $$

$$ Sy^I = \sum_{j = 1}^{n_{lab}} \tilde{y}_j $$
$$ Sy^{II} = \sum_{j = 1}^{n_{lab}} \tilde{y}_j^2 $$

Using these definitions and notation, the raw score
for node $i$ is

$$f_i = \tilde{k}_{i} \cdot \tilde{y}$$

Its null score using a permuted input score is the random variable

$$ f_i^{null} = \tilde{k}_{i} \cdot X $$

where $X$ is the random variable that arises from permuting $\tilde{y}$.
In order to compute the z-scores, the expected value and variance
of $f_i^{null}$ are needed.
Starting with its expected value,

$$ E_X(f_i^{null}) = E_X(\tilde{k}_{i} \cdot X) =
\tilde{k}_{i}\cdot E_X(X) =
\tilde{k}_i \cdot\frac{\sum_{j = 1}^{n_{lab}}
\tilde{y}_j}{n_{lab}}\cdot\mathds{1}=
\frac{Sk_i^I \cdot Sy^I}{n_{lab}} $$

where $\mathds{1}$ is the $n_{lab}$-th dimensional
column vector full of ones.
Regarding its variance,

$$ V_X(f_i^{null}) = V_X(\tilde{k}_{i} \cdot X) =
\tilde{k}_{i}\cdot V_X(X)\cdot \tilde{k}_{i}^T $$

The covariance of $X$ can be written as

$$ V_X(X) = \left[ \frac{\sum_{j = 1}^{n_{lab}}
\tilde{y}_j^2}{n_{lab}} - \left(\frac{\sum_{j = 1}^{n_{lab}}
\tilde{y}_j}{n_{lab}}\right)^2 \right] \left[ \frac{n_{lab}}{n_{lab} - 1}Id
- \frac{1}{n_{lab} - 1}\mathds{1}\mathds{1}^T \right]$$

being $Id$ the $n_{lab} \times n_{lab}$ identity matrix.
Operating, the variance of $f_i^{null}$ can be finally computed as

$$ V_X(f_i^{null}) = \frac{1}{(n_{lab} - 1)n_{lab}^2}
\left[n_{lab}Sy^{II} - (Sy^I)^2 \right]
\left[n_{lab}Sk_i^{II} - (Sk_i^I)^2 \right]$$

The z-score for node $i$ is, in terms of the new notation:

$$f_{z,i} = \frac{f_{i} - E_X(f_{i}^{null})}
{\sqrt{V_X(f_{i}^{null})}}$$

This closes the \Rcode{z} scoring option, which clearly does not require
any actual permutations but normalises 
through the theoretical first and second
statistical moments of the null distribution of the diffusion scores.

Finally, the authors in \cite{mosca} also suggest a
combination of a classical score with an statistically normalised one.
Available as \Rcode{ber\_p}, the score of node $i$ is
defined as
$$f_{ber_p, i} = -\log_{10}(p_i)\cdot f_{raw, i}$$
This approach corrects the original diffusion scores by
the effect of the network, in order to mitigate the effect of
structures like hubs.

\subsubsection{Quantitative inputs}

In its current release, \Rpackage{diffuStats} also accepts quantitative
labels as input in the following scores: \Rcode{raw},
\Rcode{ber\_s}, \Rcode{z}, \Rcode{ber\_p} and \Rcode{mc}.
The scores \Rcode{ml} and
\Rcode{gm} are naturally excluded from non-binary inputs due
to their definitions.
Currently \Rcode{mc} scores (and therefore \Rcode{ber\_p}
scores as well) accept quantitative inputs 
that are treated as sparse. 
Dense continuous inputs might take more time to compute, 
but this will be extended in future versions.
Beware that quantitative inputs should be
meaningful and one-tailed (monotonic) -
the nature of diffusion processes involves averaging and strong
effects with opposing signs cancel out instead of adding up.

\subsection{Implementation, functions and classes}

The package \Rpackage{diffuStats} is mainly implemented in R \cite{r}, but
takes advantage of existing classes in \Rpackage{igraph} \cite{igraph} and
basic data types, thus not introducing any new data structure -
this minimises the learning effort by the final user.
Inputs and outputs are conceived to require minimal
formatting effort in the whole analysis.
The computationally intense
stochastic part of the permutation analysis is
implemented in C++ through the packages
\Rpackage{Rcpp} \cite{rcpp}, \Rpackage{RcppArmadillo} \cite{rcpparmadillo}
and parallelised through \Rpackage{RcppParallel} \cite{rcppparallel}.
Package \Rpackage{diffuStats} also contains documented functions and
unit testing with small cases to spot potential bugs.
Two vignettes facilitate the user experience: an introductory
vignette showing the basic usage of the functions on a synthetic
example and this vignette, which further describes the contents
of the package and shows an application to real data.

\begin{figure}[!tpb]%figure1
{\includegraphics{diffustats_fig_overview.eps}}
\caption{Overview of the main package functions.}
\label{fig:overview}
\end{figure}

A diagram containing the main functions in the R
package \Rpackage{diffuStats} can be found in (Fig. \ref{fig:overview}).
The main funcion is \Rfunction{diffuse}, 
a wrapper for computing diffusion scores
from several categories at once, stemming from possibly different observed
backgrounds. Function \Rfunction{diffuse} makes use of the deterministic
\Rfunction{diffuse\_raw} or the stochastic 
\Rfunction{diffuse\_mc} implementations
and combines them to give the desired scores for all the nodes
in each of the observed backgrounds.
The second wrapper \Rfunction{perf} compares the
result of the diffusion scores to target validation scores.
Validation scores might include only part of the nodes of the network
and these nodes can be background-specific.

Regarding memory and processing power requirements,
the analysis of networks with more than 10,000 nodes might need
additional RAM memory and processing power.
The main reason is the
manipulation of dense graph kernel matrices that scale quadratically with
the network order.
To give a reference, a dense matrix of double-precision real
numbers with 10,000 rows and columns
uses roughly 800MB of memory.
Computing a graph kernel on a large network
can require -depending on the kernel- matrix diagonalisation, matrix products
and matrix inversion operations that are likely to use a considerable
amount of memory and time.

\subsection{Limitations}

The kernel framework is known to scale poorly with the number
of nodes of the network when the kernel is explicitly computed and dense. 
Therefore, \Rpackage{diffuStats} is best suited for manipulating biological 
networks of a medium size - few tenths of thousands of nodes.
Protein-protein interaction networks can have around 20,000 nodes, 
which is also the limit of the capabilities \Rpackage{diffuStats}, 
as it is now, in a standard workstation with 16GB of physical memory.

In particular, the explicit kernel computation is a demanding step, 
although it only has to be performed once with a given network.
Figure \ref{fig:profiling} contains a profiling on the kernel computation
using a Compaq q8100 workstation 
(intel core i5 650 @3.20GHz, 16GB physical memory). 
This should give an approximation of what to expect in terms of 
time and memory consumption. 
The same figure contains the memory usage of the kernel matrix per se.

\begin{figure}[!tpb]%figure1
{\includegraphics{diffustats_fig_profiling.eps}}
\caption{Profiling of the computation of the regularised Laplacian kernel
on a synthetic n-th order network. 
Undirected networks are generated through \Rfunction{barabasi.game} 
in \CRANpkg{igraph} using default parameters and \Rfunarg{m = 6}
(each node adds 6 edges).}
\label{fig:profiling}
\end{figure}

On the other hand, the parallel implementation
of the stochastic permutation analysis is another demanding task. 
It has been optimised assuming that the number of positives is usually low.
Lists of scores with a very high amount of positives might
slow down the permutation analysis, but not the parametric \Rcode{z}.

\section{Getting started}

This vignette contains a classical example of label propagation
on a biological network. The core tools for this analysis are
the \Rpackage{igraph} R package \cite{igraph} and the
\Rpackage{diffuStats} package,
whereas \Rpackage{ggplot2} \cite{ggplot2} is a convenient tool to
plot the results.

The data for this example is the \Rcode{yeast} interactome with
functional annotations, as found in the data package \Rcode{igraphdata}
\cite{igraphdata}.

<<01_imports>>=
# Core
library(igraph)
library(diffuStats)

# Plotting
library(ggplot2)
library(ggsci)

# Data
library(igraphdata)
data(yeast)
set.seed(1)
@

\subsection{Data description}

A summary of the network object can be obtained by just
showing the object:

<<02_data1>>=
summary(yeast)
@

For this analysis, only the largest connected
component of this graph will be used,
although the algorithms can handle graphs
with several connected components.

<<02_data2>>=
yeast <- diffuStats::largest_cc(yeast)
@

This yields to a graph with \Sexpr{format(vcount(yeast))} nodes and
\Sexpr{format(ecount(yeast))} edges.
There are several attributes that can be of interest.
First of all, the \Rcode{name} of the protein nodes:

<<02_data3>>=
head(V(yeast)$name)
@

Furthermore, the corresponding aliases and complete names can
be found in \Rcode{Description}

<<02_data4>>=
head(V(yeast)$Description)
@

The labels to perform network propagation are MIPS categories
\cite{mips}, which provide means to classify proteins regarding their
function. These functions are coded as characters in the \Rcode{yeast}
object, in the node attribute \Rcode{Class}

<<02_data5>>=
table_classes <- table(V(yeast)$Class, useNA = "always")
table_classes
@

The graph attribute \Rcode{Classes} maps
these abbreviations to the actual category:

<<02_data6>>=
head(yeast$Classes)
@

Finally, the graph edges have a \Rcode{Confidence} attribute that
assesses the amount of evidence supporting the interaction.
All the edges will be kept in this analysis, but
different confidences can be weighted to favour diffusion in
high confidence edges.

<<02_data7>>=
table(E(yeast)$Confidence)
@

More on the yeast object can be found through \Rcode{?yeast}.

\subsection{First analysis: protein ranking}

In this first case, the diffusion scores will be applied to
the prediction of a single protein function.
Let's assume that 50\% of the labelled proteins in the graph as
\Rcode{transport and sensing} (category \Rcode{A})
are actually unlabelled.
Now, using the labels of the known positive and negative
examples for \Rcode{transport and sensing},
can we correctly label the remaining 50\%?
First of all, the list of known and unknown positives is generated.
The function \Rfunction{diffuse} uses (row)names in the input scores
so that unlabelled nodes are accounted as so.

<<03_datasplit>>=
perc <- .5

# Transport and sensing is class A
nodes_A <- V(yeast)[Class %in% "A"]$name
nodes_unlabelled <- V(yeast)[Class %in% c(NA, "U")]$name
nodes_notA <- setdiff(V(yeast)$name, c(nodes_A, nodes_unlabelled))

# Known labels
known_A <- sample(nodes_A, perc*length(nodes_A))
known_notA <- sample(nodes_notA, perc*length(nodes_notA))
known <- c(known_A, known_notA)

# Unknown target nodes
target_A <- setdiff(nodes_A, known_A)
target_notA <- setdiff(nodes_notA, known_notA)
target <- c(target_A, target_notA)
target_id <- V(yeast)$name %in% target

# True scores
scores_true <- V(yeast)$Class %in% "A"
@

Now that the input is ready, the diffusion algorithm can be applied
to rank all the proteins. As a first approach, the
vanilla diffusion scores will be computed through the \Rfunarg{raw} method
and the default regularised Laplacian kernel,
which is calculated on the fly.

<<03_diffuse>>=
# Vector of scores
scores_A <- setNames((known %in% known_A)*1, known)

# Diffusion
diff <- diffuStats::diffuse(
    yeast,
    scores = scores_A,
    method = "raw"
)
@

Diffusion scores are ready and
in the same format they were introduced:

<<03_diff1>>=
head(diff)
@

Now, the scores obtained by the proteins
actually belonging to \Rcode{transport and sensing}
can be compared to proteins with other labels.

<<03_diff2, fig=TRUE>>=
# Compare scores
df_plot <- data.frame(
    Protein = V(yeast)$name,
    Class = ifelse(scores_true, "Transport and sensing", "Other"),
    DiffusionScore = diff,
    Target = target_id,
    Method = "raw",
    stringsAsFactors = FALSE
)

ggplot(subset(df_plot, Target), aes(x = Class, y = DiffusionScore)) +
    geom_boxplot(aes(fill = Method)) +
    theme_bw() +
    scale_y_log10() +
    xlab("Protein class") +
    ylab("Diffusion score") +
    ggtitle("Target proteins in 'transport and sensing'")
@

The last plot justifies the usefulness of label propagation,
as proteins in \Rcode{transport and sensing} obtain
higher diffusion scores than the rest.
The network analysis can be deepened by examining, for instance,
the subnetwork containing the proteins with the top 30 diffusion
scores, highlighting with squares the ones that were
positive labels in the input.
Notice the small clusters of proteins:

<<03_diff3, fig=TRUE>>=
# Top scores subnetwork
vertex_ids <- head(order(df_plot$DiffusionScore, decreasing = TRUE), 30)
yeast_top <- igraph::induced.subgraph(yeast, vertex_ids)

# Overlay desired properties
# use tkplot for interactive plotting
igraph::plot.igraph(
    yeast_top,
    vertex.color = diffuStats::scores2colours(
        df_plot$DiffusionScore[vertex_ids]),
    vertex.shape = diffuStats::scores2shapes(
        df_plot$Protein[vertex_ids] %in% known_A),
    vertex.label.color = "gray10",
    main = "Top 30 proteins from diffusion scores"
)
@

\subsection{Second example: comparing scores with single protein ranking}

The proposed diffusion scores can be easily applied and compared.
The regularised Laplacian kernel will be used
to compute all the implemented scores for the
target nodes in \Rcode{transport and sensing}.

<<04_kernel>>=
K_rl <- diffuStats::regularisedLaplacianKernel(yeast)
@

Functions \Rfunction{diffuse} and \Rfunction{perf} do accept,
however, an \Rcode{igraph} object as well, and compute the kernel
automatically.
For medium networks (10,000 nodes or more)
the kernel computation can be computationally expensive in
memory and time, so precomputing it avoids unnecessary
recalculations.

The diffusion scores can be computed over 
a list of methods or sets of parameters. 
This can be achieved with instructions like \Rfunction{lapply}, 
but \Rpackage{diffuStats} contains a wrapper to facilitate this 
task.
The function \Rfunction{diffuse\_grid} takes the specified 
combinations of parameters -which can include the scoring method as well-
and computes the diffusion scores for each one. 
The results are concatenated in a data frame that can be 
easily plotted:


<<04_diff>>=
list_methods <- c("raw", "ml", "gm", "ber_s", "ber_p", "mc", "z")

df_diff <- diffuse_grid(
    K = K_rl,
    scores = scores_A,
    grid_param = expand.grid(method = list_methods),
    n.perm = 1000 
)
df_diff$transport <- ifelse(
    df_diff$node_id %in% nodes_A, 
    "Transport and sensing", 
    "Other"
)
@

The results can be directly plotted:

<<04_plot, fig=TRUE>>=
df_plot <- subset(df_diff, node_id %in% target)
ggplot(df_plot, aes(x = transport, y = node_score)) +
    geom_boxplot(aes(fill = method)) +
    scale_fill_npg() +
    theme_bw() +
    theme(axis.text.x = element_text(
        angle = 45, vjust = 1, hjust = 1)) +
    facet_wrap( ~ method, nrow = 1, scales = "free") +
    xlab("Protein class") +
    ylab("Diffusion score") +
    ggtitle("Target proteins scores in 'transport and sensing'")
@

As expected, all the diffusion scores qualitatively show differences
between positive and negative labels, but the quality
of class separation will generally
depend on the dataset and scoring method.

\subsection{Third example: 
benchmarking scores with multiple protein functions}

The package \Rpackage{diffuStats} is able to perform several
screenings at once. To show its usefulness, we will generalise
the procedure in the last section but screening all the
categories in the yeast graph.

First of all, the input data must meet an
adequate format - a straightforward approach is to
populate a matrix with the input labels (one category per column).

<<05_scores>>=
# All classes except NA and unlabelled
names_classes <- setdiff(names(table_classes), c("U", NA))

# matrix format
mat_classes <- sapply(
    names_classes,
    function(class) {
        V(yeast)$Class %in% class
    }
)*1
rownames(mat_classes) <- V(yeast)$name
colnames(mat_classes) <- names_classes
@

The former 50\% known / 50\% unknown approach will be kept
with the same split, although not all the
\Sexpr{ncol(mat_classes)} categories will be
totally balanced in the splits now.
All the methods will be compared
using the area under the ROC curve
(AUROC) as a performance index.

Please note that \Rpackage{diffuStats} is equipped with basic 
performance measures for rankers: the AUROC, the 
area under the Precision-Recall curve, or AUPRC, and 
their partial versions. 
These are available through the helper function \Rfunction{metric\_fun} 
and can be passed in list format to \Rfunction{perf}.
These measures are based on the \CRANpkg{precrec} R package - 
further detail can be found in the original manuscript \cite{precrec}.

<<05_perf>>=
list_methods <- c("raw", "ml", "gm", "ber_s", "ber_p", "mc", "z")

df_methods <- perf(
    K = K_rl,
    scores = mat_classes[known, ],
    validation = mat_classes[target, ],
    grid_param = expand.grid(
        method = list_methods,
        stringsAsFactors = FALSE),
    n.perm = 1000
)
@

This allows plotting of the AUCs over the categories for each method
in one step:

<<05_plot, fig=TRUE>>=
ggplot(df_methods, aes(x = method, y = auc)) +
    geom_boxplot(aes(fill = method)) +
    scale_fill_npg() +
    theme_bw() +
    xlab("Method") +
    ylab("Area under the curve") +
    ggtitle("Methods performance in all categories")
@

<<05_plotsave, echo=FALSE, eval=FALSE, include=FALSE>>=
# Save plot for manuscript
# # library(extrafont)
setEPS()
postscript("data-raw/BoxplotAUC.eps", width = 3.38, height = 3.38)
ggplot(df_methods, aes(x = method, y = auc)) +
    # geom_boxplot(aes(fill = method), outlier.size = 1) +
    geom_boxplot(outlier.size = 1) +
    scale_fill_npg(name = "Method") +
    theme_bw() +
    xlab("Method") +
    ylab("Area under the curve") +
    # coord_fixed() +
    ggtitle("Benchmark of protein categories") +
    theme(
        text = element_text(size = 10),
        axis.text.x = element_text(
            size = 8, angle = 45, vjust = 1,
            hjust = 1, color = "black"),
        axis.text.y = element_text(size = 8, color = "black"),
        plot.title = element_text(size = 12, hjust = .5))
dev.off()

# Build table for manuscript
df_perf <- reshape2::acast(df_methods, Column~method, value.var = "auc")
df_test <- perf_wilcox(
    df_perf, 
    digits_p = 1, 
    adjust = function(p) p.adjust(p, method = "fdr"), 
    scientific = FALSE)
xtable::xtable(df_test)
@

Scaling up the analysis can be useful for assessing
how adequate a diffusion score is in the dataset of interest.
These results suggest that, for the current \Rcode{yeast} interactome
and protein functions, the best priorisations are those
obtained through a statistical normalisation,
which might motivate its usage in other biological
networks.

The user can also statistically compare the performance metrics 
through the function \Rfunction{perf\_wilcox}. 
This generates a table with (i) the estimates on the differences 
on performance between the methods in rows and columns, with 
their confidence intervals and (ii)
their associated p-value (Wilcoxon test). 
Positive and negative estimates respectively favour the method 
in the row and the column.

<<05_test, results=tex>>=
# Format the data
df_perf <- reshape2::acast(df_methods, Column~method, value.var = "auc")
# Compute the comparison matrix
df_test <- perf_wilcox(
    df_perf, 
    digits_p = 1, 
    adjust = function(p) p.adjust(p, method = "fdr"), 
    scientific = FALSE)
knitr::kable(df_test, format = "latex")
@

\section{Conclusions}

The \Rpackage{diffuStats} package is a new
computational tool to compute and compare
single-network diffusion scores that are object of
active research in several bioinformatics areas.
It is an effort to gather a collection of
settings in the diffusion process like the graph kernel, the
label codification and the choice of a statistical normalisation.
The \Rpackage{diffuStats} package will
help the end user in choosing and computing the best performing
diffusion scores in the application of interest.

\section{Funding}
This work was supported by the Spanish
Ministry of Economy and Competitiveness (MINECO)
[TEC2014-60337-R to A.P.] and the
National Institutes of Health (NIH) [R01GM104400 to W.T.].
AP. and S.P. thank for funding the
Spanish Biomedical Research Centre in Diabetes
and Associated Metabolic Disorders (CIBERDEM)
and the Networking Biomedical Research Centre
in the subject area of Bioengineering,
Biomaterials and Nanomedicine (CIBER-BBN),
both initiatives of Instituto de Investigaci\'on Carlos III (ISCIII).
SP. thanks the AGAUR FI-scholarship programme.

\newpage
%\bibliographystyle{plain}
% \bibliographystyle{apalike}

\bibliography{bibliography}

\newpage

\appendix

\section{Session info}

Here is the output of \Rfunction{sessionInfo()} on the system that
compiled this vignette:
<<sessioninfo, results=tex, echo=FALSE>>=
toLatex(sessionInfo())
@

\end{document}