%
% NOTE -- ONLY EDIT THE .Rnw FILE!!!  The .tex file is
% likely to be overwritten.
%


%\VignetteIndexEntry{A modified spectral clustering method for clustering Flow Cytometry Data}
%\VignetteDepends{SamSPECTRAL}
%\VignetteKeywords{}
%\VignettePackage{SamSPECTRAL}
\documentclass[12pt]{article}

\usepackage{amsmath}
\usepackage{cite, hyperref}

\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}

\author{Habil Zare and Parisa Shooshtari }

\begin{document}
\title{SamSPECTRAL: A Modified Spectral Clustering Method for Clustering Flow Cytometry Data}
\maketitle{}

\tableofcontents
\newpage

\section{Introduction} 
Data analysis is a crucial step in most of recent biological research areas such as microarray techniques, gene expression and protein classification. A classical approach for analysing biological data is to first group individual data points based on some similarity criterion,
a process known as clustering, and then compare the outcome of clustering with the desired biological hypotheses. Spectral clustering is a non-parametric clustering method 
%%(\citealp{spectral_tutorial})  
which has proved useful in many pattern recognition areas. 
%%(\citealp{first_spectral}). 
Not only it does not require a priori assumptions on the size, shape and distributions of clusters, but it has several features that make it an appropriate candidate for clustering biological data:


\begin{itemize}
	\item It is not sensitive to outliers, noise or shape of clusters.

	\item It is adjustable so we can make use of biological knowledge to adapt it for a specific problem or dataset.

	\item There is mathematical evidence to guarantee its proper performance.
		 %%(\citealp{spectral_theory}).
\end{itemize}

However, because of the machine limitations, one faces serious empirical barriers in applying this method for large data sets. SamSPECTRAL is a modification to spectral clustering such that it will be applicable on large size datasets.
See the reference for more details and cite it if you use this
approach.


\section{How to run SamSPECTRAL?}
SamSPECTRAL is an R package source that can be downloaded from BioCunductor. 
In Linux, it can be installed by the following command:

$\\$
\texttt{R CMD INSTALL SamSPECTRAL\_x.y.z.tar.gz}
\newline
$\\$
where x.y.z. determines the version.

The main function of this package is SamSPECTRAL() which is loaded by using the command library(SamSPECTRAL) in R. 
Before running this function on a data set, some parameters are required to be set including: 
normal.sigma and separation.factor. This can be best done by  running the algorithm on some number of samples 
(Normally, 2 or 3 samples are sufficient). 
Then the function SamSPECTRAL() can be applied to all samples in that data set to identify cell populations in each sample data.


\subsection{An example}

This example shows how SamSPECTRAL can be run on flow cytometry data. If $f$ is a flow frame 
(which is normally read from an FCS file using flowCore), 
then the object "small" in the following example should be replaced by expr($f$). 


\begin{center}
<<f_example,fig=FALSE, echo=TRUE>>=
library(SamSPECTRAL)
set.seed(4)
data(small_data)
full <- small

L <- SamSPECTRAL(full,dimension=c(1,2,3),normal.sigma = 200,separation.factor = 0.39)   
plot(full, pch='.', col= L)
@
\end{center}

SamSPECTRAL is done. The results are in $L$, a vector that provides a numeric label for each event.
All events with equal label are in one component and isolated outliers are labelled by $NA$.  
The following piece of code is not a part of the analysis 
and it is included only for more clear presentation of the results. 
The code computes the frequency of events in each component and adds a legend to the figure.

\begin{center}
<<f_example,fig=TRUE, echo=TRUE>>=
## Computing the frequency:
plot(full, pch='.', col= L)
frequency <- c()	
minimum.frequency <- 0.01	
## components smaller than this threshould, will not be aprear in the legend statistics.
freq.large <- c()	
labels <- as.character(unique(L))
for(label in labels){
  if(!is.na(label)){
    frequency[label] <- length(which(L==label))/length(L)
    if(frequency[label] > minimum.frequency)
      freq.large[label] <- frequency[label]
  }
}	
print(frequency)	

## Adding legend 
legend(x="topleft",as.character(round(freq.large,3)),col=names(freq.large),pch=19)
@
\end{center}


\section{Adjusting parameters}

For efficiency, one can set $m=3000$ to keep the running time bellow 1 minute by a 2 GHz processor
and normally the results remained satisfactory for flow cytometry data. The separation factor 
and scaling parameter ($\sigma$) are two main parameters that needed to be adjusted.
The general way is to run SamSPECTRAL on one or two random data samples of a flow cytometry data set 
and try different values for $\sigma$ and separation factor. Then, the selected parameters were fixed 
and used to apply SamSPECTRAL on the rest of data samples. An efficient strategy is explained by the following example.


\subsection{Example}
  First we load data and store the transformed coordinates in a matrix called $full$:

<<ReadFiles2, echo=true,results=verbatim>>=

data(small_data)
full <- small
@

The objects needed for creating this vignette can be directly computed or loaded from previously saved workspace to save time.
The later increases the speed of building this vignette.
	
<<ReadFiles2, echo=true,results=verbatim>>=
run.live <- FALSE
@


The following parameters are rarely needed to be changed for flow cytometry data:


<<ReadFiles2, echo=true,results=verbatim>>=	
## Parameters:
m <- 3000; 
community.weakness.threshold <-1; precision <- 6; 
maximum.number.of.clusters <- 30    
@

The following piece of code, scales the coordinates in range [0,1]:
<<ReadFiles2, echo=true,results=verbatim>>=	
for (i.column in 1:dim(full)[2]){#For all columns
  ith.column <- full[,i.column]
  full[,i.column] <- (ith.column-min(ith.column)) /(max(ith.column) - min(ith.column) )  
  ##^ This is the scaled column.
}#End for (i.column.
## Therefore, 	
space.length <- 1
@

To perform faithful sampling, we run:
<<f_sampling,fig=TRUE, echo=true,results=verbatim>>=	
## Sample the data and build the communities 
society <- Building_Communities(full,m, space.length, community.weakness.threshold)
plot(full[society$representatives, ], pch=20)
@


We intend to first find an appropriate value for $\sigma$ and then set separation factor. 
Note that normal.sigma=$\frac{1}{\sigma^2}$, therefore, decreasing normal.sigma is equivalent 
to increasing $\sigma$ and visa versa.
We start with normal.sigma=10:
<<f_small,fig=TRUE, echo=true,results=verbatim>>=	
normal.sigma <- 10
## Compute conductance between communities
conductance <- Conductance_Calculation(full, normal.sigma, space.length, society, precision)
## Compute the eigenspace:
if (run.live){
  clust_result.10 <- Civilized_Spectral_Clustering(full, maximum.number.of.clusters, society, conductance,stabilizer=1)    
  eigen.values.10 <- clust_result.10@eigen.space$values
} else 
  data("eigen.values.10")		
plot(eigen.values.10[1:50])
@


We observe that the eigen values curve does not have a "knee" shape. So we increase sigma:
<<f_large,fig=TRUE, echo=true,results=verbatim>>=	
normal.sigma <- 1000
## Compute conductance between communities
conductance <- Conductance_Calculation(full, normal.sigma, space.length, society, precision)
## Compute the eigenspace:
if (run.live){
  clust_result.1000 <- Civilized_Spectral_Clustering(full, maximum.number.of.clusters, society, conductance,stabilizer=1)    
  eigen.values.1000 <- clust_result.1000@eigen.space$values
} else 
  data("eigen.values.1000")		
plot(eigen.values.1000[1:50])
@


We observe that in the eigen values plot, "too many" values are close to 1 but for this example 
we do not expect 20 populations. So we decrease sigma:
<<f_appropriate,fig=TRUE, echo=true,results=verbatim>>=	
normal.sigma <- 250
## Compute conductance between communities
conductance <- Conductance_Calculation(full, normal.sigma, space.length, society, precision)
## Compute the eigenspace:
clust_result.250 <- Civilized_Spectral_Clustering(full, maximum.number.of.clusters, society, conductance,stabilizer=1)    
eigen.values.250 <- clust_result.250@eigen.space$values
plot(eigen.values.250[1:50])
@

This is ``a right'' value for normal.sigma because the curve has now a knee shape. 
Even some variation to this parameter does not change the shape significantly (200 or 300 can be tried). 

Now having sigma been adjusted, separation factor can be tuned:
<<f_sep1,fig=TRUE, echo=true,results=verbatim>>=	
## Extracting labels:
labels.for_num.of.clusters <- clust_result.250@labels.for_num.of.clusters
number.of.clusters <- clust_result.250@number.of.clusters
L33 <- labels.for_num.of.clusters[[number.of.clusters]]
## Setting septation factor:
separation.factor <- 0.1
## post-processing:
component.of <- Connecting(full, society, conductance, number.of.clusters, labels.for_num.of.clusters, separation.factor)$label
## ploting:
plot(full, pch='.', col= component.of)
@

This value is too small for the separation factor and a population is combined by mistake. 
Therefore, we increase septation factor to separate the components more:
<<f_sep2,fig=TRUE, echo=true,results=verbatim>>=	
## Setting septation factor:
separation.factor <- 0.5
## post-possesing:
component.of <- Connecting(full, society, conductance, number.of.clusters, labels.for_num.of.clusters, separation.factor)$label
## ploting:
plot(full, pch='.', col= component.of)
@

This is the right value for separator factor as all population are now separated. 

Now, we can fix these values for the parameters; normal.sigma=250 and separation.factor=0.5. 
One can run the SamSPECTRAL algorithm on the rest of the data set without changing them, 
hopefully, obtaining as appropriate results. 


\section{Reference}
Zare, H. and Shooshtari, P.  and Gupta, A. and Brinkman R.B: 
  \textbf{Data Reduction for Spectral Clustering to Analyse High Throughput Flow Cytometry Data.} 
\textit{BMC Bioinformatics}, 2010, \textbf{11}:403.


\end{document}