---
title: "ClusterJudge"
author: "Adrian Pasculescu"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Vignette Title}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}

references:
- author:
  - family: Gibbons
    given: F.D
  - family: Roth
    given: F.P.
  container-title: Genome Research
  id: GibRoth2002
  issued:
    month: 10
    volume: 12
    year: 2002
  publisher: Genome Research
  title: 'Judging the Quality of Gene Expression-Based Clustering Methods Using Gene Annotation'
- author:
  - family: Cho
    given: R.C. et all
  container-title: Molecular Cell
  id: Cho1998
  issued:
    month: 7
    volume: 2
    year: 1998
  publisher: Science Direct
  title: 'A genome-wide transcriptional analysis of the mitotic cell cycle'
- author:
  - family: Tavazoie
    given: S. et all
  container-title: Nature Genetics
  id: Taivasoie1999
  issued:
    month: 7
    volume: 22
    year: 1999
  publisher: Nature Publishing Group
  title: 'Systematic determination of genetic network architecture'
- author:
  - family: Cover
    given: T.M. et all
  container-title: book
  id: Cover1991
  issued:
    month: 1
    volume: (ed. D.L. Schilling)
    year: 1991
  publisher: Wiley-Interscience
  title: 'Elements of information theory'
---

```{r echo=FALSE, warning=FALSE, message=FALSE}
devtools::load_all('.')
```

## Introduction
The quality of a cluster analysis can be judged based on external information.
For example when the clustered entities possess attributes, the total mutual information between
the clustering and each of the important and non-redundant attributes of entities might be helpful.
Gene Ontology is an example of a system of attributes that can help in assessing gene clustering resulting 
from genomics or proteomics experiments.  
A clustering that `agrees well` with a system of attributes should have a pronounced decrease of 
the total mutual information when the number of random swaps of entities between clusters 
is increased [@GibRoth2002].


The following simple steps must be performed to evaluate a clustering using ClusterJudge:

1. Obtain the clusters as a structure containing entities and their cluster IDs;
2. Obtain the entity-attribute associations and verify the entity-attribute associations for consistency;
3. Often, the number of attributes can be consolidated (reduced) by using a set of _uncertainty_ levels;  
4. 'Judge' the original clustering versus clustering of random changes that are applied to the original data.


## Obtaining the clustering

In most of the cases the clustering is obtained from experimental data after 
applying some statistical methods such as k-means or cutting a hierarchical clustering at a predefined 'height'.

The following example extracts the Yeast cell cycle clustering of genes related to the Yeast cell 
cycle  [@Cho1998] [@Taivasoie1999]:
```{r  warning=FALSE, message=FALSE}
library('yeastExpData')
```
```{r echo=TRUE}
data(ccyclered)
head(ccyclered)
```

The clustering should not contain any ambiguities ( i.e. one entity should be assigned to just one cluster)
and no NA should be present.
For this package, the clusters must be organized as a named vector:
```{r}
clusters <- ccyclered$Cluster
###  convert from Gene names to the new standard of Saccharomyces Genome Database (SGD) gene ids
ccyclered$SGDID <- sub('^S','S00',ccyclered$SGDID)
names(clusters) <- ccyclered$SGDID
str(clusters)
```

Note that, in this example, we modified the names of the genes to match the names of entities in 
the Gene Ontology data (see next section).


## Obtaining Entity Attribute associations


For genomics or proteomics data, one source of attributes is the Gene Ontology (http://www.geneontology.org/).
For Yeast Gene Ontology attributes can also be downloaded from the [Saccharomyces Genome Database (SGD)]
(http://www.yeastgenome.org/)
In the ClusterJudge R package there is already a downloaded mapping of Yeast genes to attributes: 

```{r, fig.show='hold', echo=TRUE}
data(Yeast.GO.assocs);
str(Yeast.GO.assocs);
head(Yeast.GO.assocs);
validate_association(Yeast.GO.assocs)
```

The *download_Yeast_GO_mapping()* function is also available to download a fresh copy of the mapping. 

For other species one can use specialized R and Bioconductor packages such as biomaRt 
as in the following `toy` example.
```{r, eval=FALSE, echo=TRUE}
library(biomaRt)
rn <-  useDataset("rnorvegicus_gene_ensembl", mart=useMart("ensembl"))
rgd.symbol=c("As3mt", "Borcs7", "Cyp17a1", "Wbp1l", "Sfxn2", "Arl3") ### exemplify for a limited set of  genes
entity.attr  <- getBM(attributes=c('rgd_symbol','go_id'), filters='rgd_symbol', values=rgd.symbol, mart=rn)
```


The _validate_association()_ function quickly verifies if the input structure has two columns, 
if there are no NAs or NULLs and if there are no duplicated associations.

## Consolidating Entity-Attribute associations
In some cases, the attributes may not bring additional useful information. For example, an attribute 
may be too specific as it is only assigned to one or two genes (e.g. it may simply be a new name of a gene). 
In other cases, two or more attributes may be very correlated and appear on almost the same set of genes.

For the first case a consolidated association will ignore those attributes that belong to very few genes 
(*min.entities.per.attr*) 

```{r, fig.width=6, fig.height=4}
entities_attribute_stats(Yeast.GO.assocs) ### shows number of entities per attribute distribution
Yeast.GO.assocs.cons1 <- consolidate_entity_attribute(
       entity.attribute = Yeast.GO.assocs
     , min.entities.per.attr =3  ### keep only attributes associated to 3 or more entities
     , mut.inf=FALSE
     )

dim(Yeast.GO.assocs)
dim(Yeast.GO.assocs.cons1) ### shows reduction in the number of associations
``` 
For the second case we use the pre-calculated mutual information [@Cover1991] between attributes and impose 
a maximum _uncertainty_ level where uncertainty is defined as mutual information divided by the maximum 
of mutual information.

```{r, fig.width=6, fig.height=4}
data(mi.GO.Yeast)
Yeast.GO.assocs.cons <- consolidate_entity_attribute(
     entity.attribute = Yeast.GO.assocs
   , min.entities.per.attr =3
   , mut.inf=mi.GO.Yeast   ### use precalculated mutual information
   , U.limit = c(0.1, 0.001) ### calculate consolidated association for these uncertainty levels
   ) ### shows distribution of the number of pairs of attributes by Uncertainty 
str(Yeast.GO.assocs.cons)                                                   
```

The lower the Uncertainty limit the more attributes are ignored and the faster the judging 
of clusters will be performed.


## Calculating the mutual information 
In the case where the entity-attributes associations are downloaded from other organisms or other sources,
mutual information can be calculated using this function: *attribute_mut_inf* :

```{r}
data(Yeast.GO.assocs) 
### because it takes time, we use a small sampled subset of associations
entity.attribute.sampled <- Yeast.GO.assocs[sample(1:nrow(Yeast.GO.assocs),100),]
mi.GO.Yeast.sampled <- attribute_mut_inf(
    entity.attribute = entity.attribute.sampled
  , show.progress    = FALSE  ## for this small example do not print progress
  )  
str(mi.GO.Yeast.sampled)
```

The mutual information of the pair of attributes A,B is defined as 
  mi(A,b) = H(A)+H(B) - H(cbind(A,B))
where H is the entropy.


## Judging the clustering for selected `uncertainty` levels
The _clusters_ named vector prepared above and one of the consolidated entity-attribute
associations (for example the one at 0.001 level of Uncertainty) are needed at this step.
The _ClusterJudge_ will swap at random more and more entities between clusters and will calculate
and plot the new total mutual information obtained at each iteration. 
If the total mutual information decreases by increasing the number of random swaps we can conclude 
that there is a good agreement between the clustering of experimental data and the ensemble of information 
provided by the entity-attribute associations.

```{r, fig.width=6, fig.height=4}
mi.by.swaps<-clusterJudge(
    clusters = clusters
  , entity.attribute=Yeast.GO.assocs.cons[["0.001"]]
  , plot.notes='Yeast clusters judged at uncertainty level 0.001 - Ref: Tavazoie S,& all 
`Systematic determination of genetic network architecture. Nat Genet. 1999`'
, plot.saveRDS.file= 'cj.rds') ### save the plot for later use  

p <- readRDS('cj.rds') ### retrieve the previous plot
pdf('cj.pdf'); plot(p); dev.off() ### plot on another device
```
In most of the cases mutual information should have a decreasing trend when the number of shuffles increases.
However, in rare cases, a consistent increasing trend might suggest novel discoveries; only when both experimental data 
and clustering are *very reliable*. In these rare cases a *high confident* clustering  might complement or even contradict 
information accumulated in time into entity-attribute associations.


*ClusterJudge* can also be seen as a complement to many of the Bioconductor packages
related to evaluation and comparison of Clustering (such as [clustComp](http://www.bioconductor.org/packages/devel/bioc/html/clustComp.html ), 
[clusterExperiment](http://www.bioconductor.org/packages/devel/bioc/html/clusterExperiment.html ), 
[clusterSignificance](http://www.bioconductor.org/packages/devel/bioc/html/ClusterSignificance.html ),
[GOpro](http://www.bioconductor.org/packages/devel/bioc/html/GOpro.html ), 
[multiClust](http://www.bioconductor.org/packages/devel/bioc/html/multiClust.html ) ).

## References