\documentclass[a4paper,12pt]{article}
<<style-Sweave, eval=TRUE, echo=FALSE, results=tex>>=
BiocStyle::latex()
@
%\VignetteIndexEntry{Manual for the TTMap library}
%\VignettePackage{TTmap}
\usepackage{amsmath}    % need for subequations
\usepackage{amssymb}    %useful mathematical symbols
\usepackage{bm}         %needed for bold greek letters and math symbols
\usepackage{graphicx}   % need for PS figures
\usepackage{hyperref}   % use for hypertext links
\begin{document}
\title[Two-Tier Mapper]{\textbf{\LARGE{Two-Tier Mapper: 
a user-independent clustering 
method 
for global gene expression analysis 
based on topology}}}
\author{Rachel Jeitziner\\
\small{UPBRI\& ISREC} \\
\small{EPFL Lausanne}}
\date{}  %comment to include current date
\maketitle
\tableofcontents
\section{Introduction}
\label{sec:intro}
We developed a new user-independent analytical 
framework, called 
\textit{Two-Tier Mapper (TTMap)}.
This tool is separated into two parts. 
TTMap consists of two separated and 
independent parts : 1. Hyperrectangle 
Deviation assessment
(HDA) and 2. Global-to-Local Mapper (GtLMap), where the first step 
establishes properties of the
control group and removes outliers in order to calculate the 
deviation of each vector in the test group
from the corrected control group. The second step uses the 
traditional Mapper algorithm
\cite{Extracting} with a two-tier cover and a special distance. 
This topological tool detects both global and local differences in the 
patterns of deviations and thereby 
captures the structure of the test group. The samples are clustered 
according to the shape of their 
deviation (do they both deviate positively, negatively or are they as the 
control). To still keep on the 
information about the amount of deviation, one separates the data into 4 
clusters according to a 
function measuring the amount of deviation. These represent then the 
second tier. 
Each cluster is colored by the extent of the deviation. A list of the 
differentially expressed genes is also 
provided.
The functions and methods presented on this \textit{vignette} provide
explanation on how to use TTMap, by default and what can be changed by 
the user. 
\begin{figure*}[h]
\begin{center}
\includegraphics[height= 17.5cm]{Figure1new_test_small.png}
\caption{ Schematic overview of TTMap. The inputs (green) 
are given by two gene expression matrices,
the control ($\mathsf{N}$) and the test group ($\mathsf{T}$), 
rows represent genes and columns 
samples. In Part 1, TTMap adjusts the control 
group for outlier values ($\bar{N}_*$), feature by feature. 
It calculates deviation from this corrected control 
group for individual samples in the test group ($Dc.T_*
$). In Part 2, TTMap computes a similarity 
measure, the mismatch distance (represented as a heatmap) 
using the deviation components. 
The Mapper \cite{Mapper} algorithm is used with a two-tier cover to 
generate a visual representation of 
the clustering creating a network of global clusters (Overall) and 
local clusters (1st, 2nd, 3rd, 4th quartile 
of a filter function). It takes as inputs the mismatch distance and 
the deviation components.}
\end{center}
\end{figure*}
\section{Prepare the data}
\label{sec:intro}
Upload the file(s) to compare in R. Log transform and subselect them.
Use \emph{make\_matrices} to create 
the needed files for the first function of TTMap since it generates the 
control and the test matrix in the right format. As an example, we 
use the airway data set available at Bioconductor. 
\begin{scriptsize}
<<loadLibAndData>>=
library(airway)
data(airway)
airway <- airway[rowSums(assay(airway))>80,]
assay(airway) <- log(assay(airway)+1,2)
experiment <- TTMap::make_matrices(airway,seq_len(4),seq_len(4)+4,
NAME = rownames(airway), CLID =rownames(airway))
@ 
\end{scriptsize}
This function can directly be used on a normalised count table from RNA-
seq precising what are the columns of the control group (in col\_ctrl) and 
what are the columns in the test group (in col\_test) .
\section{TTMap part1: Adjustement of the control group (ctrl\_adj)}
The first part of the method checks if the control and the test matrices 
have the same row-names, and if not the method subselects the common 
rows. It outputs the files with the common rows subselected (with the 
extension mesh).  It then calculates the corrected control matrix, which 
removes outliers and replaces them by a chosen method (given by a 
function with input the matrix with NAs where there is an outlier and 
should 
return a matrix without NAs), or by the median of the other values by 
default. The inputs can even be given by the CTRL and TEST variables of 
the list given by the output of \emph{make\_matrices} or by imputed 
control and test matrices in pcl format (see \cite{Monica}). The name of 
the control group and the project name need to be inputed as well as the 
working directory, in which the output files will be created. 
A value for what to consider as an outlier (called e)
can be imputed or use the data-driven 
default value given by the method. If there are any batch effects to 
consider, they can be imputed using the variable B, which is a vector of 
numbers representing the batches. Last, the parameter $P$ is a value 
which will remove the genes that have a higher percentage than $P$ of 
outlier values. 
\begin{scriptsize}
<<part1>>=
E=1
Pw=1.1
Bw=0
TTMAP_part1prime <-TTMap::control_adjustment(normal.pcl = experiment$CTRL,
tumor.pcl = experiment$TEST, 
normalname = "The_healthy_controls", dataname = "The_effect_of_cancer", 
org.directory = getwd(), e=E,P=Pw,B=Bw);
@
\end{scriptsize}
This outputs: 
\begin{itemize}
\item A file with the number of outliers per sample (Dataname followed by 
the number of the batch followed by na\_numbers\_per\_col.txt)
\item A file with the number of outliers per row (Dataname followed by 
the 
number of the batch followed by na\_numbers\_per\_col.txt)
\item A picture of the distribution of the mean against variance for each 
gene, before (Dataname followed by \_mean\_vs\_variance.pdf) and
\item after correction of outliers (Dataname followed by \\
\_mean\_vs\_variance\_after\_correction.pdf).
\end{itemize}
The corrected control matrix is output in the next step.
A possible output after this first step is shown in figure 
\ref{fig_mean_vs_var}
\begin{center}
\begin{figure}
\includegraphics[scale=14]{The_effect_of_cancer_mean_vs_variance.pdf}
\caption{\texttt{barplotSignifSignatures}: Plot of mean against variance 
per 
gene.}
\label{fig_mean_vs_var}
\end{figure}
\end{center}
\section{TTMap part1: Hyperrectangle deviation assessement (hda)}
This part consists in calculating deviation 
components from a hyperrectangle. This enables the calculation in 
the third function (ttmap\_part2\_gtlmap) of the 
shape of deviation. One parameter k is given by if all the vectors of the 
control group should be kept or if only the the top k-dimensional principal 
component approximation of the control matrix should be kept using the 
singular value decomposition (as in \cite{Monica}). The default is to keep 
all 
the vectors.
\begin{scriptsize}
<<part2>>=
TTMAP_part1_hda <- TTMap::hyperrectangle_deviation_assessment(x = 
TTMAP_part1prime,k=dim(TTMAP_part1prime$Normal.mat)[2],
dataname = "The_effect_of_cancer", normalname = "The_healthy_controls");
head(TTMAP_part1_hda$Dc.Dmat)
@
\end{scriptsize}
The outputs of this step are the following. 
\begin{itemize}
\item The corrected control matrix, calculated at the first step is given in 
\textit{The\_healthy\_controls.NormalModel.pcl}, with a possible trimming 
of columns if $k$ is 
different than the number of columns in the corrected matrix.
\item The deviation component of each test sample is written in 
\textit{The\_effect\_of\_cancer.Tdis.pcl}. An example of the deviation 
component is found in 
the previous script by writing \textit{head(TTMAP\_part1\_hda\$Dc.Dmat)}
\item The normal component of each test sample is written in 
\textit{The\_effect\_of\_cancer.Tnorm.pcl}.
\end{itemize}
The two values of this function are the deviation component matrix and 
the 
overall deviation (calculated by summing in absolute values the deviation 
components).
\section{TTMap part2: Global-to-local Mapper (gtlmap)}
The third part corresponds to the Global-to-local Mapper part. One starts 
with an annotation file of our samples, in order to annotate the obtained 
clusters. In this example here we just copied several times the column 
names. This annotation file needs to have as rownames the columns of 
the 
test samples followed by ".Dis". We then calculate the distance matrix 
between the samples using the \textit{generate\_mismatch\_distance} 
function, which uses a cutoff parameter $\alpha$ 
in order to decide what is 
a considered as noise. Any other distance matrix can be computed here 
and 
used for the next step. Then, we calculate and output the clusters using 
\textit{ttmap\_part2\_gtlmap}, which needs as inputs the values of 
\textit{ttmap\_part1\_ctrl\_adj, ttmap\_part1\_hda.} 
The default parameter uses all the genes to 
calculate the overall deviation, but if a subset should be selected (only 
one 
pathway for example), it can be imputed here. 
ttmap\_part2\_gtlmap then 
calculates using \textit{calcul\_e} a parameter of closeness using the 
data, 
in order to know what distance is "close" enough to clusters samples 
together. The parameter n determines which column of metadata should 
be 
chosen for the output files.  Two more parameters of convenience,
if ad is 
set to something different than 0 (the default) then the clusters on the 
output picture will not be annotated and if bd is different than 0 (default), 
the output will be without outliers of the test data set. After the picture 
has been adjusted to what one wants to see one can save it using the 
\textit{rgl.postscript} function.
\begin{scriptsize}
<<part3>>=
library(rgl)
ALPHA <- 1
annot <- c(paste(colnames(experiment$TEST[,-seq_len(3)]),"Dis",sep=".")
,paste(colnames(experiment$CTRL[,-seq_len(3)]),"Dis",sep="."))
annot <- cbind(annot,annot)
rownames(annot)<-annot[,1]
dd5_sgn_only <-TTMap::generate_mismatch_distance(TTMAP_part1_hda,
select=rownames(TTMAP_part1_hda$Dc.Dmat),alpha = ALPHA)
TTMAP_part2_gtlmap <- 
TTMap::ttmap(TTMAP_part1_hda,TTMAP_part1_hda$m,
select=rownames(TTMAP_part1_hda$Dc.Dmat),
annot,e= TTMap::calcul_e(dd5_sgn_only,0.95,TTMAP_part1prime,1), 
filename="first_comparison",
n=1,dd=dd5_sgn_only)
rgl.postscript("first_output.pdf","pdf")
@
\end{scriptsize}
\section{TTMap: Finding the significant genes (sgn\_genes)}
This last function analyses the different clusters for significant features. It 
outputs a file per level 
(one for overall, called all, one for the lower quartile, 
called low, one for the second quartile, called mid1, the third, mid2, and 
the 
higher quartile, called high). In each of them one file per cluster is given, 
with the list of significant genes linked to the cluster. Relaxed is a 
parameter permitting 
to 
select as a match one sample that would be 0 
for the deviation component, 
while the others deviate in the same shape.
\begin{scriptsize}
<<part4>>=
TTMap::ttmap_sgn_genes(TTMAP_part2_gtlmap, 
TTMAP_part1_hda, TTMAP_part1prime, 
annot, n = 2, a = ALPHA,
filename = "first_trial", annot = TTMAP_part1prime$tag.pcl, col = "NAME",
path = getwd(), Relaxed = 0)
@
\end{scriptsize}
\section{Conclusion}
Two-Tier Mapper (TTMap) is a topology-based clustering tool, which is 
user-
friendly and reliable. The algorithm first provides an overall clustering, in 
an 
unbiased manner, since all the parameters are defined in a data-driven 
manner or by reliable default parameters. his method enables a refined 
view 
on the composition of the clusters by delineating
how clusters differ locally and how the local clusters relate to the global 
structure of the dataset. The output is a visual interpretation
of the data 
given by a colored graph that is easy to interpret, which describes the 
shape of the data according to the chosen distance. \\
\bibliography{biblio2}
\end{document}