---
title: "Benchmarks and other GO similarity methods"
author: "by Casper Peters"
date: "`r Sys.Date()`"
output: 
  # rmarkdown::html_vignette:
  prettydoc::html_pretty:
    theme: cayman
    toc: true
    toc_depth: 4
    highlight: vignette
    copmressed_css: true
    code_folding: none
vignette: >
  %\VignetteIndexEntry{Benchmarks and other GO similarity methods}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  chunk_output_type: inline
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

## Introduction

### Neccesary libraries

```{r load_libs, warning = FALSE, message = FALSE, include = TRUE}
library(GAPGOM)
library(profvis)
library(GO.db)
library(graph)
library(ggplot2)
library(reshape2)
library(Biobase)
```

### Background

Benchmarks are an important part of getting to know how fast algorithms perform and what to expect. Thus, we will compare multiple example use cases to give an idea how fast the GAPGOM algorithms are. As well as try give an insight as to why certain algorithms have certain speeds. The profvis library will be used to determine the amount of time spent on each calculation as well as ram usage. **Since profvis can hugely impact performance depending on the specifics of the code and its timings, timings might not be accurate or be an accurate representation of real-time performance. It still does show the important relations between analyses however.**

Because this package is mainly made for human gene research, this is the model organism that will be used for all testing here. The Biological Process or BP ontology will be used as the model ontology. Some internal function are used because they are needed for doing some of the operations required by base algorithsm/showing off the performance. Most of these results are based on random samples and shouldn't neccesarily be taken as-is/as a baseline. Instead, this benchmarking is more to give an idea about how well the processes scale.

Besides this, results have been pre-calculated and come with the package, as otherwise calculations times wouldn't be reasonable nor accurate. For some reasons running a vignette build takes longer than evaluating normal R code... All code blocks used to generate the results are included however.

The benchmarks have been run on two systems; A laptop and a bigger server. Both with the following specs (obtained with `cat /proc/cpuinfo` & `cat /proc/meminfo`):

- **Laptop**

CPU:
```
processor	: 3
vendor_id	: GenuineIntel
cpu family	: 6
model		: 78
model name	: Intel(R) Core(TM) i7-6500U CPU @ 2.50GHz
stepping	: 3
microcode	: 0xc6
cpu MHz		: 2800.039
cache size	: 4096 KB
physical id	: 0
siblings	: 4
core id		: 1
cpu cores	: 2
apicid		: 3
initial apicid	: 3
fpu		: yes
fpu_exception	: yes
cpuid level	: 22
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp flush_l1d
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips	: 5184.00
clflush size	: 64
cache_alignment	: 64
address sizes	: 39 bits physical, 48 bits virtual
power management:
```
Memory:
```
MemTotal:        8071404 kB
MemFree:          978056 kB
MemAvailable:    3087884 kB
Buffers:          367592 kB
Cached:          2616176 kB
SwapCached:            0 kB
Active:          4737256 kB
Inactive:        1944028 kB
Active(anon):    3699308 kB
Inactive(anon):   760208 kB
Active(file):    1037948 kB
Inactive(file):  1183820 kB
Unevictable:          48 kB
Mlocked:              48 kB
SwapTotal:       8000508 kB
SwapFree:        8000508 kB
Dirty:              5696 kB
Writeback:             0 kB
AnonPages:       3697580 kB
Mapped:           630132 kB
Shmem:            761956 kB
Slab:             279024 kB
SReclaimable:     198148 kB
SUnreclaim:        80876 kB
KernelStack:       11136 kB
PageTables:        47776 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    12036208 kB
Committed_AS:    9791644 kB
VmallocTotal:   34359738367 kB
VmallocUsed:           0 kB
VmallocChunk:          0 kB
HardwareCorrupted:     0 kB
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
CmaTotal:              0 kB
CmaFree:               0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:      166288 kB
DirectMap2M:     7077888 kB
DirectMap1G:     2097152 kB
```

- **Server**

CPU:
```
processor	: 31
vendor_id	: GenuineIntel
cpu family	: 6
model		: 45
model name	: Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz
stepping	: 7
microcode	: 0x713
cpu MHz		: 2699.953
cache size	: 20480 KB
physical id	: 1
siblings	: 16
core id		: 7
cpu cores	: 8
apicid		: 47
initial apicid	: 47
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm ida arat xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips	: 5400.94
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:
```
Memory:
```
MemTotal:       264108916 kB
MemFree:        66614704 kB
Buffers:          262744 kB
Cached:         65338616 kB
SwapCached:        74732 kB
Active:         62788360 kB
Inactive:       44151044 kB
Active(anon):   56367000 kB
Inactive(anon): 23284424 kB
Active(file):    6421360 kB
Inactive(file): 20866620 kB
Unevictable:        7556 kB
Mlocked:            7556 kB
SwapTotal:      268387324 kB
SwapFree:       265225612 kB
Dirty:                88 kB
Writeback:             0 kB
AnonPages:      41301344 kB
Mapped:           212636 kB
Shmem:          38309896 kB
Slab:           21008548 kB
SReclaimable:     630584 kB
SUnreclaim:     20377964 kB
KernelStack:       12368 kB
PageTables:       125496 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    400441780 kB
Committed_AS:   89205012 kB
VmallocTotal:   34359738367 kB
VmallocUsed:    68290760 kB
VmallocChunk:   34093725116 kB
HardwareCorrupted:     0 kB
AnonHugePages:  11550720 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:      191424 kB
DirectMap2M:     5005312 kB
DirectMap1G:    265289728 kB
```

**All processes are singlecore.** (Some functions might be mutli-threadable however)

## lncRNA2GOA `expression_prediction`

Since lncRNA2GOA is based on expression values, we'll use the standard example that is included with the pacakge to make life easier. This algorithm should be linear depending on how many expression values you add. This will however not be shown because the actual calculations on the expression data are actually very fast. The main thing that slows down this algorithm is querying the GO terms and attaching them to the correct ID for the results.

```{r, eval=FALSE}
# Example with default dataset, take a look at the data documentation
# to fully grasp what's going on with making of the filter etc. (Biobase 
# ExpressionSet)
 
# keep everything that is a protein coding gene
filter_vector <- expset@featureData@data[(expset@featureData@data$GeneType=="protein_coding"),]$GeneID
# set gid and run.
gid <- "ENSG00000228630"

p <- profvis::profvis({GAPGOM::expression_prediction(gid, 
                                        GAPGOM::expset, 
                                        "human", 
                                        "BP",
                                        id_translation_df = GAPGOM::id_translation_df,
                                        id_select_vector = filter_vector,
                                        method = "combine", verbose = TRUE, 
                                        filter_pvals = TRUE
                       )})
time <- max(p$x$message$prof$time)*10
mem <- max(p$x$message$prof$memalloc)
```
```{r}
# laptop
# time
310
# mem
124.91
# ---
# server
# time
560
# mem
72.47
```

Both the laptop and server didn't take long, this is because the example is rather small. The laptop took about ~310ms to complete the calculations, while the server took ~560ms. In the case where a translation_df wouldn't be defined, the most time is wasted in finding GO calculations. So the discrepancies might actually be mainly due to disk IO (AnnotationDbi uses an SQL file to store the GO DAGs).

There is also the `expression_semantic_scoring` function, this only calculates scores. This is not included in the benchmarks however because it is also done within `expression_prediction`.

## TopoICSim


There's three algorithms in the GAPGOM package in regards to TopoICSim; One on term level (`topo_ic_sim_term`), one on gene level (`topo_ic_sim_genes`) and one on geneset level (also `topo_ic_sim_genes`).
Because it is slightly difficult to determine which GOs are attributes of which genes, random genes will be sampled for testing some of the algorithms. The same goes for all GOs, as their path lookup might be different as well.

Prepare GO data for random data selection, topoICSim term calculation and some other neccesary variables.

```{r, eval=FALSE}
# prepare the godata for mouse and some other calculations later needed in benchmarking
organism <- "human"
ontology <- "BP"
go_data <- GAPGOM::set_go_data(organism, ontology)
```

Since go data does not neccesarily need to be prepared for functions, real use cases might be slower if it isn't prepared.

### TopoICSim - Term level

To get a good baseline of the underlying algorithm, multiple combinations of pairs will be tested against each other.

```{r topotitj, eval=FALSE}
# grab 15 random GOs (for term algorithm)
## sample(unique(go_data@geneAnno$GO), 15)
random_gos <- c("GO:0030177", "GO:0001771", "GO:0045715", "GO:0044330", "GO:0098780", 
"GO:1901006", "GO:0061143", "GO:0060025", "GO:0015695", "GO:0090074", 
"GO:0035445", "GO:0008595", "GO:1903634", "GO:1903826", "GO:0048389"
)
# print them for reproducability
## dput(random_gos)
# now compare all unique random GO pairs. (105 uniques).
unique_pairs <- GAPGOM:::.unique_combos(random_gos, random_gos)

times <- c()
mem_usages <- c()
for (i in seq_len(nrow(unique_pairs))) {
  prof_toptitj <- profvis({
    pair <- unique_pairs[i]
    go1 <- pair[[1]]
    go2 <- pair[[2]]
    GAPGOM::topo_ic_sim_term(organism, ontology, go1, go2, go_data = go_data)
  })
  time <- max(prof_toptitj$x$message$prof$time)*10
  mem <- max(prof_toptitj$x$message$prof$memalloc)
  mem_usages <- c(mem_usages, mem)
  times <- c(times, time)
  gc()
}
times_term <- times
mems_term <- mem_usages
```

```{r, fig.width=3, fig.height=3}
times_term_df <- data.frame(GAPGOM:::benchmarks$server_times_term, 
                          GAPGOM:::benchmarks$laptop_times_term, 
                          seq_len(105)
                          )
colnames(times_term_df) <- c("server", "laptop", "n")
times_term_df_melted <- melt(times_term_df, id="n")
colnames(times_term_df_melted) <- c("n", "machine", "milliseconds")
ggplot(times_term_df_melted, aes(x=machine, y=milliseconds, colour=machine)) + 
  geom_boxplot(notch = FALSE) + 
  scale_y_continuous(breaks=pretty(times_term_df_melted$milliseconds, n = 5)) + 
  labs(title = paste(strwrap("Speed of term algorithm", width = 20), collapse = "\n")) + 
  theme(plot.title = element_text(hjust = 0.5))
```

The term algorithm seems to largely complete within 1.5 seconds, with ~2 seconds being the max. The median is around ~1250 ms, so most calculations last about that long, they also last at least ~900ms it seems. The reason 1 termpair takes so long, is because some variables need to be prepared each times the user-function is used. In higher level functions, this will be thus be negliable since it will be prepared from the start.

```{r, fig.width=3, fig.height=3}
mems_term_df <- data.frame(GAPGOM:::benchmarks$server_mems_term, 
                          GAPGOM:::benchmarks$laptop_mems_term, 
                          seq_len(105)
                          )
colnames(mems_term_df) <- c("server", "laptop", "n")
mems_term_df_melted <- melt(times_term_df, id="n")
colnames(mems_term_df_melted) <- c("n", "machine", "RAM")
ggplot(mems_term_df_melted, aes(x=machine, y=RAM, colour=machine)) + 
  geom_boxplot(notch = FALSE) + 
  scale_y_continuous(breaks=pretty(mems_term_df_melted$RAM, n = 5)) + 
  labs(title = paste(strwrap("RAM usage for gene algorithm", width = 20), collapse = "\n")) + 
  theme(plot.title = element_text(hjust = 0.5))
```

The RAM usage is a bit more peculiar, the server's the baseline is higher. The reason why the baseline is higher is because the servers' R session (probably) had more in memory from the start. 

for both the time and memory, the spread is quite big. This can be explained because all GO-pairs have a certain amount of common ancestors to loop through, depending on how far the GOs are removed from each other. This differs on a case-to-case basis and is hard to predict. Other than this, it seems that there's a rough spread of ~500mb for both machines for this algorithm.

### TopoICSim - Gene level

To measure how the gene level algorithm performs with random genes, we will sample 10 random gene-pairs to test on it.

```{r, eval=FALSE}
## dput(sample(unique(go_data@geneAnno$ENTREZID), 5))
random_genes <- c("3848", "2824", "65108", "3988", "10800")

unique_pairs <- GAPGOM:::.unique_combos(random_genes, random_genes)

times <- c()
mem_usages <- c()
for (i in seq_len(nrow(unique_pairs))) {
  prof_topg1g2 <- profvis({
    pair <- unique_pairs[i]
    gene1 <- pair[[1]]
    gene2 <- pair[[2]]
    GAPGOM::topo_ic_sim_genes(organism, ontology, gene1, gene2, go_data=go_data)
  })
  time <- max(prof_topg1g2$x$message$prof$time)*10
  mem <- max(prof_topg1g2$x$message$prof$memalloc)
  mem_usages <- c(mem_usages, mem)
  times <- c(times, time)
  gc()
}
times
mem_usages
times_gen <- times
mems_gen <- mem_usages
```

```{r, fig.width=3, fig.height=3}
times_gene_df <- data.frame(GAPGOM:::benchmarks$server_times_gen, 
                          GAPGOM:::benchmarks$laptop_times_gen, 
                          seq_len(10)
                          )
colnames(times_gene_df) <- c("server", "laptop", "n")
times_gene_df_melted <- melt(times_gene_df, id="n")
times_gene_df_melted$value <- times_gene_df_melted$value/1000 
colnames(times_gene_df_melted) <- c("n", "machine", "seconds")
ggplot(times_gene_df_melted, aes(x=machine, y=seconds, colour=machine)) + 
  geom_boxplot(notch = FALSE) + 
  labs(title = paste(strwrap("Speed of gene algorithm", width = 20), collapse = "\n")) + 
  theme(plot.title = element_text(hjust = 0.5))
```

As you can see from the plots, the spread is pretty big. This is because every gene has different GOs, and each different GO-pair has a different relationship as well. All of this hugely affects calculation time, making it hard to predict how well an algorithm performs without knowing the specific implications of the selected underlying GOs. The spread seems pretty even between the two machines however, from ~5sec to ~45sec. This might indicate that there's groups of genes with certain n's of GOs. Most calculations seem to finish within ~32 seconds.

```{r, fig.width=3, fig.height=3}
mems_gene_df <- data.frame(GAPGOM:::benchmarks$server_mems_gen, 
                          GAPGOM:::benchmarks$laptop_mems_gen, 
                          seq_len(10)
                          )
colnames(mems_gene_df) <- c("server", "laptop", "n")
mems_gene_df_melted <- melt(mems_gene_df, id="n")
colnames(mems_gene_df_melted) <- c("n", "machine", "RAM")
ggplot(mems_gene_df_melted, aes(x=machine, y=RAM, colour=machine)) + 
  geom_boxplot(notch = FALSE) + 
  scale_y_continuous(breaks=pretty(mems_gene_df_melted$RAM, n = 5)) + 
  labs(title = paste(strwrap("RAM usage for gene algorithm", width = 20), collapse = "\n")) + 
  theme(plot.title = element_text(hjust = 0.5))
```

Funnily enough, ram usage here is very consistent, most if not all of the quantiles for both machines fall on 1 point (with some spread however). For the server most RAM usage is around ~190Mb, and for the laptop it's around ~150mb.

### TopoICSim - Geneset level (and scalability)

To measure scalability on geneset-level, [the pfam clan gene set is used](https://pfam.xfam.org/family/PF00171). The reason we're not choosing random genes, is because this is not something representative in a practical use case scenario. Most of the time, you want to compare genes that belong within a certain pathway. Rather than looking at completely random genes that don't have a relationship.

```{r, eval=FALSE}
list1=c("126133","221","218","216","8854","220","219","160428","224","222","8659","501","64577","223","217","4329","10840","7915","5832")
times <- c()
mem_usages <- c()
for (i in seq(length(list1)-1)) {
  sampled_list <- list1[1:(i+1)]
  print(sampled_list)
  p <- profvis({
     GAPGOM::topo_ic_sim_genes(organism, ontology, sampled_list, sampled_list, drop=NULL, go_data=go_data)
   })
  time <- max(p$x$message$prof$time)*10
  mem <- max(p$x$message$prof$memalloc)
  mem_usages <- c(mem_usages, mem)
  times <- c(times, time)
  gc()
}
times
mem_usages
times_genset <- times
mems_genset <- mem_usages
```

```{r, fig.width=6, fig.height=3}
final_times <- data.frame(GAPGOM:::benchmarks$server_times_genset, 
                          GAPGOM:::benchmarks$laptop_times_genset, 
                          seq_along(GAPGOM:::benchmarks$laptop_times_genset)
                          )
colnames(final_times) <- c("server", "laptop", "n")
final_times_melted <- melt(final_times, id="n")
final_times_melted$value <- final_times_melted$value/1000/60
colnames(final_times_melted) <- c("Genes in geneset", "Machine", "Minutes")
p <- ggplot(data=final_times_melted, aes(x=`Genes in geneset`, y=Minutes, colour=Machine)) + 
  geom_point() + 
  labs(title = paste(strwrap("Speed of TopoICSim geneset algorithm", width = 20), collapse = "\n")) + 
  theme(plot.title = element_text(hjust = 0.5))
p
p + stat_smooth(method = "lm", formula = y ~ x + I(x^2), size = 1)
```

As you can see, TopoICSim seems to scale reasonably well. The underlying algorithm should be exponential however. It is also visible that there's a certain "stepping" occuring. This is probably because genes share similar GO terms, which are stored temporarily as to prevent unnecesary calculations. However this prevention method loses it's grip the more genes you keep adding, especially unrelated/less related genes. This shows that the underlying curve is exponential. This is logical because the amount of calculations done is completely dependent on the list-lengths. The confidence interval between the two machines are shared in the beginning of the curve mostly, the curve is a bit more steep on the server because of lower single-core performance.

To give you an idea of actual performance; the "genelist mode" example on the `topo_ic_sim_genes()` function takes roughly ~16 seconds (turn verbose to `TRUE` to see time taken in seconds). This is for a 5x5 genematrix put in the analysis.

```{r, fig.width=3, fig.height=3}
mems_geneset_df <- data.frame(GAPGOM:::benchmarks$server_mems_genset, 
                          GAPGOM:::benchmarks$laptop_mems_genset, 
                          seq_len(18)
                          )
colnames(mems_geneset_df) <- c("server", "laptop", "n")
mems_geneset_df_melted <- melt(mems_geneset_df, id="n")
colnames(mems_geneset_df_melted) <- c("Genes in geneset", "machine", "RAM")
ggplot(mems_geneset_df_melted, aes(x=machine, y=RAM, colour=machine)) + 
  geom_boxplot(notch = FALSE) + 
  scale_y_continuous(breaks=pretty(mems_geneset_df_melted$RAM, n = 5)) + 
  labs(title = paste(strwrap("RAM usage for geneset algorithm", width = 20), collapse = "\n")) + 
  theme(plot.title = element_text(hjust = 0.5))
```

```{r, fig.width=6, fig.height=3}
p <- ggplot(mems_geneset_df_melted, aes(x=`Genes in geneset`, y=RAM, colour=machine)) + 
  geom_point()  + 
  scale_y_continuous(breaks=pretty(mems_geneset_df_melted$RAM, n = 5)) + 
  labs(title = paste(strwrap("RAM usage for geneset algorithm", width = 20), collapse = "\n")) + 
  theme(plot.title = element_text(hjust = 0.5))
p
p + stat_smooth(method = "lm", formula = y ~ x + I(x^2), size = 1)
```

You can see that the ram fits a bigger exponential curve, again proving the point the underlying algorithm fits this property. There is very little discrepency between the two machines in term of the ram usage. What is notable as well but not shown on the graphs; The ram usage decreases as soon as the R session gets reset, Thus showing there's RAM leaks ooccuring. This is also why it is advised to restart your R session every now and then. The RAM leaks seem to be attributed to use of `apply` and is hard to prevent/avoid.

### Overall thoughts

Although the speed can be reasonably determined for the algorithms, RAM is a bit more tricky because of a lot of variables. This can be caused by differences in R session, memory leaks and other semi-random events/structuring of the code. Thus the only real thing that we can conclude is that there is quite a lot of RAM leaks, especially in the case of the geneset algorithm because of its iterative nature.

Speed might vary hugely depending on the organism, ontology and geneset used. This is because, again, GO paths may differ drastically.

Also, profvis' timings can be inaccurate, since it impacts performance hugely.

## Benchmarks

In the following benchmark, lncRNA2GOA will be tested. We will use the Glycolosis hallmark geneset together with the GSE63733 expression dataset to test how well lncRNA2GOA will perform. First, we only take a small selection of the geneset based on the most variance in the expression (otherwise calculations take too long). After this, we predict annotation of this subset for each of its members, using the full expressionset (leave-one-out method). With this prediction, we then compare the semantic similarity scores of the original genes vs the custom ones.

```{r, eval=FALSE}
## set stringsasfactors to false to ensure data is loaded properly.
options(stringsAsFactors = FALSE)

## load data
# 
expdata <- read.table("~/Downloads/GSE63733_m.txt")
# http://software.broadinstitute.org/gsea/msigdb/cards/HALLMARK_GLYCOLYSIS.html
# GLYCOLOSIS HALLMARK
geneset <- read.table("~/Downloads/geneset_glyc.txt", sep="\n")

colnames(expdata)[1:2] <- c("ENSEMBL", "SYMBOL") # Set important colnames

## make an expressionset

expression_matrix <- as.matrix(expdata[,3:ncol(expdata)])
rownames(expression_matrix) <- expdata[[1]]
featuredat <- as.data.frame(expdata[,1:2])
rownames(featuredat) <- expdata[[1]]
expset <- ExpressionSet(expression_matrix, 
                        featureData = new("AnnotatedDataFrame", 
                        data=featuredat))

## make a selection on the geneset
geneset <- as.vector(geneset[3:nrow(geneset),])
geneset_ensembl <- expdata[expdata[[2]] %in% geneset,][[1]]
# select on the 10 genes with the most variance
geneset_ensembl_topvar <- names(rev(sort(apply(
  expression_matrix[geneset_ensembl,], 1, var)))[1:10])
# convert back to symbol
geneset_selection <- expdata[expdata$ENSEMBL %in% geneset_ensembl_topvar,][[2]]

## predict annotation of the geneset selection (per ontology).

ontologies <- c("MF", "BP", "CC")
bench_results <- list()
for (ontology in ontologies) {
  # print("---")
  # print(ontology)
  predicted_annotations <- list()
  for (gene in geneset_ensembl_topvar) {
    symbol <- expdata[expdata$ENSEMBL==gene,]$SYMBOL
    
    result <- GAPGOM::expression_prediction(gene, 
                                          expset,
                                          "human", 
                                          ontology,
                                          idtype = "ENSEMBL")
    go_annotation_prediction <- unique(as.vector(result$GOID))  
    predicted_annotations[[symbol]] <- go_annotation_prediction
  }
  print("Prediction finished! Calculating sims now")
  
  ## compare custom genes to original genes
  tmp_go_data <- set_go_data("human", ontology, computeIC = TRUE, 
                             keytype = "SYMBOL")
  bench_results[[ontology]] <- mapply(
    function(gos, name) {
      custom_gene <- list()
      custom_gene[[paste0(name, "_pred")]] <- gos
      
      name_orig <- unlist(strsplit(name, "_"), 
                     FALSE, FALSE)[1]
      
      # print(name_orig)
      
      return(GAPGOM::topo_ic_sim_genes("human", ontology, c(), name_orig, 
                                   custom_genes1 = custom_gene, 
                                   idtype = "SYMBOL", 
                                   progress_bar = FALSE, go_data = tmp_go_data)$GeneSim)
    }, gos = predicted_annotations, name = names(predicted_annotations)
  )
}
# save(bench_results, file = "~/tmp_bench.RData")
```

For the results, we calculated the closeness of the individual gene similarities against their annotation predicted counterparts. The similarity should be as close to 1 as possible (exactly same annotation). The GO terms may not be exactly the same nor overlapping, but might be close in the GO tree. Meaning that it still indicates the correct process/function/component (depending on ontology). Because this benchmark takes long to run too, it's results will be stored with the package. 

```{r}
bench_results <- GAPGOM:::benchmarks$bench_results
ontologies <- c("MF", "BP", "CC")

for (ontology in ontologies) {
  dat <- data.frame(Gene = names(bench_results[[ontology]]), 
                    sim = bench_results[[ontology]])
  brplot <- ggplot(dat, aes(x=Gene, weight=sim)) + geom_bar() + coord_flip() +
    labs(title = paste(strwrap(paste0(
      "Similarities of custom genes versus originals in ", ontology, 
      " Ontology"), 
                               width = 30), collapse = "\n")) +
    theme(plot.title = element_text(hjust = 0.5)) +
    ylab("Similarity (TopoICSim)")
  print("---")
  print(ontology)
  print("mean:")
  print(mean(bench_results[[ontology]]))
  print(brplot)
}
```

Here we can see that the scores performs like; MF < BP < CC. This presumably has to do with increasing ambiguity between these ontologies as well. From these results we can roughly say that on average, depending on ontology, the predicted annotations have a TopoICSim similarity between ~0.6 and ~0.86.

## For code reviewers

Finally, for code reviewers, you can just call the result of each `profvis` to get a flame graph and other details of code speed.

## SessionInfo

```{r}
sessionInfo()
```