--- title: "Benchmarks and other GO similarity methods" author: "by Casper Peters" date: "`r Sys.Date()`" output: # rmarkdown::html_vignette: prettydoc::html_pretty: theme: cayman toc: true toc_depth: 4 highlight: vignette copmressed_css: true code_folding: none vignette: > %\VignetteIndexEntry{Benchmarks and other GO similarity methods} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} editor_options: chunk_output_type: inline --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Introduction ### Neccesary libraries ```{r load_libs, warning = FALSE, message = FALSE, include = TRUE} library(GAPGOM) library(profvis) library(GO.db) library(graph) library(ggplot2) library(reshape2) library(Biobase) ``` ### Background Benchmarks are an important part of getting to know how fast algorithms perform and what to expect. Thus, we will compare multiple example use cases to give an idea how fast the GAPGOM algorithms are. As well as try give an insight as to why certain algorithms have certain speeds. The profvis library will be used to determine the amount of time spent on each calculation as well as ram usage. **Since profvis can hugely impact performance depending on the specifics of the code and its timings, timings might not be accurate or be an accurate representation of real-time performance. It still does show the important relations between analyses however.** Because this package is mainly made for human gene research, this is the model organism that will be used for all testing here. The Biological Process or BP ontology will be used as the model ontology. Some internal function are used because they are needed for doing some of the operations required by base algorithsm/showing off the performance. Most of these results are based on random samples and shouldn't neccesarily be taken as-is/as a baseline. Instead, this benchmarking is more to give an idea about how well the processes scale. Besides this, results have been pre-calculated and come with the package, as otherwise calculations times wouldn't be reasonable nor accurate. For some reasons running a vignette build takes longer than evaluating normal R code... All code blocks used to generate the results are included however. The benchmarks have been run on two systems; A laptop and a bigger server. Both with the following specs (obtained with `cat /proc/cpuinfo` & `cat /proc/meminfo`): - **Laptop** CPU: ``` processor : 3 vendor_id : GenuineIntel cpu family : 6 model : 78 model name : Intel(R) Core(TM) i7-6500U CPU @ 2.50GHz stepping : 3 microcode : 0xc6 cpu MHz : 2800.039 cache size : 4096 KB physical id : 0 siblings : 4 core id : 1 cpu cores : 2 apicid : 3 initial apicid : 3 fpu : yes fpu_exception : yes cpuid level : 22 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp flush_l1d bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf bogomips : 5184.00 clflush size : 64 cache_alignment : 64 address sizes : 39 bits physical, 48 bits virtual power management: ``` Memory: ``` MemTotal: 8071404 kB MemFree: 978056 kB MemAvailable: 3087884 kB Buffers: 367592 kB Cached: 2616176 kB SwapCached: 0 kB Active: 4737256 kB Inactive: 1944028 kB Active(anon): 3699308 kB Inactive(anon): 760208 kB Active(file): 1037948 kB Inactive(file): 1183820 kB Unevictable: 48 kB Mlocked: 48 kB SwapTotal: 8000508 kB SwapFree: 8000508 kB Dirty: 5696 kB Writeback: 0 kB AnonPages: 3697580 kB Mapped: 630132 kB Shmem: 761956 kB Slab: 279024 kB SReclaimable: 198148 kB SUnreclaim: 80876 kB KernelStack: 11136 kB PageTables: 47776 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 12036208 kB Committed_AS: 9791644 kB VmallocTotal: 34359738367 kB VmallocUsed: 0 kB VmallocChunk: 0 kB HardwareCorrupted: 0 kB AnonHugePages: 0 kB ShmemHugePages: 0 kB ShmemPmdMapped: 0 kB CmaTotal: 0 kB CmaFree: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB DirectMap4k: 166288 kB DirectMap2M: 7077888 kB DirectMap1G: 2097152 kB ``` - **Server** CPU: ``` processor : 31 vendor_id : GenuineIntel cpu family : 6 model : 45 model name : Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz stepping : 7 microcode : 0x713 cpu MHz : 2699.953 cache size : 20480 KB physical id : 1 siblings : 16 core id : 7 cpu cores : 8 apicid : 47 initial apicid : 47 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm ida arat xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid bogomips : 5400.94 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: ``` Memory: ``` MemTotal: 264108916 kB MemFree: 66614704 kB Buffers: 262744 kB Cached: 65338616 kB SwapCached: 74732 kB Active: 62788360 kB Inactive: 44151044 kB Active(anon): 56367000 kB Inactive(anon): 23284424 kB Active(file): 6421360 kB Inactive(file): 20866620 kB Unevictable: 7556 kB Mlocked: 7556 kB SwapTotal: 268387324 kB SwapFree: 265225612 kB Dirty: 88 kB Writeback: 0 kB AnonPages: 41301344 kB Mapped: 212636 kB Shmem: 38309896 kB Slab: 21008548 kB SReclaimable: 630584 kB SUnreclaim: 20377964 kB KernelStack: 12368 kB PageTables: 125496 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 400441780 kB Committed_AS: 89205012 kB VmallocTotal: 34359738367 kB VmallocUsed: 68290760 kB VmallocChunk: 34093725116 kB HardwareCorrupted: 0 kB AnonHugePages: 11550720 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB DirectMap4k: 191424 kB DirectMap2M: 5005312 kB DirectMap1G: 265289728 kB ``` **All processes are singlecore.** (Some functions might be mutli-threadable however) ## lncRNA2GOA `expression_prediction` Since lncRNA2GOA is based on expression values, we'll use the standard example that is included with the pacakge to make life easier. This algorithm should be linear depending on how many expression values you add. This will however not be shown because the actual calculations on the expression data are actually very fast. The main thing that slows down this algorithm is querying the GO terms and attaching them to the correct ID for the results. ```{r, eval=FALSE} # Example with default dataset, take a look at the data documentation # to fully grasp what's going on with making of the filter etc. (Biobase # ExpressionSet) # keep everything that is a protein coding gene filter_vector <- expset@featureData@data[(expset@featureData@data$GeneType=="protein_coding"),]$GeneID # set gid and run. gid <- "ENSG00000228630" p <- profvis::profvis({GAPGOM::expression_prediction(gid, GAPGOM::expset, "human", "BP", id_translation_df = GAPGOM::id_translation_df, id_select_vector = filter_vector, method = "combine", verbose = TRUE, filter_pvals = TRUE )}) time <- max(p$x$message$prof$time)*10 mem <- max(p$x$message$prof$memalloc) ``` ```{r} # laptop # time 310 # mem 124.91 # --- # server # time 560 # mem 72.47 ``` Both the laptop and server didn't take long, this is because the example is rather small. The laptop took about ~310ms to complete the calculations, while the server took ~560ms. In the case where a translation_df wouldn't be defined, the most time is wasted in finding GO calculations. So the discrepancies might actually be mainly due to disk IO (AnnotationDbi uses an SQL file to store the GO DAGs). There is also the `expression_semantic_scoring` function, this only calculates scores. This is not included in the benchmarks however because it is also done within `expression_prediction`. ## TopoICSim There's three algorithms in the GAPGOM package in regards to TopoICSim; One on term level (`topo_ic_sim_term`), one on gene level (`topo_ic_sim_genes`) and one on geneset level (also `topo_ic_sim_genes`). Because it is slightly difficult to determine which GOs are attributes of which genes, random genes will be sampled for testing some of the algorithms. The same goes for all GOs, as their path lookup might be different as well. Prepare GO data for random data selection, topoICSim term calculation and some other neccesary variables. ```{r, eval=FALSE} # prepare the godata for mouse and some other calculations later needed in benchmarking organism <- "human" ontology <- "BP" go_data <- GAPGOM::set_go_data(organism, ontology) ``` Since go data does not neccesarily need to be prepared for functions, real use cases might be slower if it isn't prepared. ### TopoICSim - Term level To get a good baseline of the underlying algorithm, multiple combinations of pairs will be tested against each other. ```{r topotitj, eval=FALSE} # grab 15 random GOs (for term algorithm) ## sample(unique(go_data@geneAnno$GO), 15) random_gos <- c("GO:0030177", "GO:0001771", "GO:0045715", "GO:0044330", "GO:0098780", "GO:1901006", "GO:0061143", "GO:0060025", "GO:0015695", "GO:0090074", "GO:0035445", "GO:0008595", "GO:1903634", "GO:1903826", "GO:0048389" ) # print them for reproducability ## dput(random_gos) # now compare all unique random GO pairs. (105 uniques). unique_pairs <- GAPGOM:::.unique_combos(random_gos, random_gos) times <- c() mem_usages <- c() for (i in seq_len(nrow(unique_pairs))) { prof_toptitj <- profvis({ pair <- unique_pairs[i] go1 <- pair[[1]] go2 <- pair[[2]] GAPGOM::topo_ic_sim_term(organism, ontology, go1, go2, go_data = go_data) }) time <- max(prof_toptitj$x$message$prof$time)*10 mem <- max(prof_toptitj$x$message$prof$memalloc) mem_usages <- c(mem_usages, mem) times <- c(times, time) gc() } times_term <- times mems_term <- mem_usages ``` ```{r, fig.width=3, fig.height=3} times_term_df <- data.frame(GAPGOM:::benchmarks$server_times_term, GAPGOM:::benchmarks$laptop_times_term, seq_len(105) ) colnames(times_term_df) <- c("server", "laptop", "n") times_term_df_melted <- melt(times_term_df, id="n") colnames(times_term_df_melted) <- c("n", "machine", "milliseconds") ggplot(times_term_df_melted, aes(x=machine, y=milliseconds, colour=machine)) + geom_boxplot(notch = FALSE) + scale_y_continuous(breaks=pretty(times_term_df_melted$milliseconds, n = 5)) + labs(title = paste(strwrap("Speed of term algorithm", width = 20), collapse = "\n")) + theme(plot.title = element_text(hjust = 0.5)) ``` The term algorithm seems to largely complete within 1.5 seconds, with ~2 seconds being the max. The median is around ~1250 ms, so most calculations last about that long, they also last at least ~900ms it seems. The reason 1 termpair takes so long, is because some variables need to be prepared each times the user-function is used. In higher level functions, this will be thus be negliable since it will be prepared from the start. ```{r, fig.width=3, fig.height=3} mems_term_df <- data.frame(GAPGOM:::benchmarks$server_mems_term, GAPGOM:::benchmarks$laptop_mems_term, seq_len(105) ) colnames(mems_term_df) <- c("server", "laptop", "n") mems_term_df_melted <- melt(times_term_df, id="n") colnames(mems_term_df_melted) <- c("n", "machine", "RAM") ggplot(mems_term_df_melted, aes(x=machine, y=RAM, colour=machine)) + geom_boxplot(notch = FALSE) + scale_y_continuous(breaks=pretty(mems_term_df_melted$RAM, n = 5)) + labs(title = paste(strwrap("RAM usage for gene algorithm", width = 20), collapse = "\n")) + theme(plot.title = element_text(hjust = 0.5)) ``` The RAM usage is a bit more peculiar, the server's the baseline is higher. The reason why the baseline is higher is because the servers' R session (probably) had more in memory from the start. for both the time and memory, the spread is quite big. This can be explained because all GO-pairs have a certain amount of common ancestors to loop through, depending on how far the GOs are removed from each other. This differs on a case-to-case basis and is hard to predict. Other than this, it seems that there's a rough spread of ~500mb for both machines for this algorithm. ### TopoICSim - Gene level To measure how the gene level algorithm performs with random genes, we will sample 10 random gene-pairs to test on it. ```{r, eval=FALSE} ## dput(sample(unique(go_data@geneAnno$ENTREZID), 5)) random_genes <- c("3848", "2824", "65108", "3988", "10800") unique_pairs <- GAPGOM:::.unique_combos(random_genes, random_genes) times <- c() mem_usages <- c() for (i in seq_len(nrow(unique_pairs))) { prof_topg1g2 <- profvis({ pair <- unique_pairs[i] gene1 <- pair[[1]] gene2 <- pair[[2]] GAPGOM::topo_ic_sim_genes(organism, ontology, gene1, gene2, go_data=go_data) }) time <- max(prof_topg1g2$x$message$prof$time)*10 mem <- max(prof_topg1g2$x$message$prof$memalloc) mem_usages <- c(mem_usages, mem) times <- c(times, time) gc() } times mem_usages times_gen <- times mems_gen <- mem_usages ``` ```{r, fig.width=3, fig.height=3} times_gene_df <- data.frame(GAPGOM:::benchmarks$server_times_gen, GAPGOM:::benchmarks$laptop_times_gen, seq_len(10) ) colnames(times_gene_df) <- c("server", "laptop", "n") times_gene_df_melted <- melt(times_gene_df, id="n") times_gene_df_melted$value <- times_gene_df_melted$value/1000 colnames(times_gene_df_melted) <- c("n", "machine", "seconds") ggplot(times_gene_df_melted, aes(x=machine, y=seconds, colour=machine)) + geom_boxplot(notch = FALSE) + labs(title = paste(strwrap("Speed of gene algorithm", width = 20), collapse = "\n")) + theme(plot.title = element_text(hjust = 0.5)) ``` As you can see from the plots, the spread is pretty big. This is because every gene has different GOs, and each different GO-pair has a different relationship as well. All of this hugely affects calculation time, making it hard to predict how well an algorithm performs without knowing the specific implications of the selected underlying GOs. The spread seems pretty even between the two machines however, from ~5sec to ~45sec. This might indicate that there's groups of genes with certain n's of GOs. Most calculations seem to finish within ~32 seconds. ```{r, fig.width=3, fig.height=3} mems_gene_df <- data.frame(GAPGOM:::benchmarks$server_mems_gen, GAPGOM:::benchmarks$laptop_mems_gen, seq_len(10) ) colnames(mems_gene_df) <- c("server", "laptop", "n") mems_gene_df_melted <- melt(mems_gene_df, id="n") colnames(mems_gene_df_melted) <- c("n", "machine", "RAM") ggplot(mems_gene_df_melted, aes(x=machine, y=RAM, colour=machine)) + geom_boxplot(notch = FALSE) + scale_y_continuous(breaks=pretty(mems_gene_df_melted$RAM, n = 5)) + labs(title = paste(strwrap("RAM usage for gene algorithm", width = 20), collapse = "\n")) + theme(plot.title = element_text(hjust = 0.5)) ``` Funnily enough, ram usage here is very consistent, most if not all of the quantiles for both machines fall on 1 point (with some spread however). For the server most RAM usage is around ~190Mb, and for the laptop it's around ~150mb. ### TopoICSim - Geneset level (and scalability) To measure scalability on geneset-level, [the pfam clan gene set is used](https://pfam.xfam.org/family/PF00171). The reason we're not choosing random genes, is because this is not something representative in a practical use case scenario. Most of the time, you want to compare genes that belong within a certain pathway. Rather than looking at completely random genes that don't have a relationship. ```{r, eval=FALSE} list1=c("126133","221","218","216","8854","220","219","160428","224","222","8659","501","64577","223","217","4329","10840","7915","5832") times <- c() mem_usages <- c() for (i in seq(length(list1)-1)) { sampled_list <- list1[1:(i+1)] print(sampled_list) p <- profvis({ GAPGOM::topo_ic_sim_genes(organism, ontology, sampled_list, sampled_list, drop=NULL, go_data=go_data) }) time <- max(p$x$message$prof$time)*10 mem <- max(p$x$message$prof$memalloc) mem_usages <- c(mem_usages, mem) times <- c(times, time) gc() } times mem_usages times_genset <- times mems_genset <- mem_usages ``` ```{r, fig.width=6, fig.height=3} final_times <- data.frame(GAPGOM:::benchmarks$server_times_genset, GAPGOM:::benchmarks$laptop_times_genset, seq_along(GAPGOM:::benchmarks$laptop_times_genset) ) colnames(final_times) <- c("server", "laptop", "n") final_times_melted <- melt(final_times, id="n") final_times_melted$value <- final_times_melted$value/1000/60 colnames(final_times_melted) <- c("Genes in geneset", "Machine", "Minutes") p <- ggplot(data=final_times_melted, aes(x=`Genes in geneset`, y=Minutes, colour=Machine)) + geom_point() + labs(title = paste(strwrap("Speed of TopoICSim geneset algorithm", width = 20), collapse = "\n")) + theme(plot.title = element_text(hjust = 0.5)) p p + stat_smooth(method = "lm", formula = y ~ x + I(x^2), size = 1) ``` As you can see, TopoICSim seems to scale reasonably well. The underlying algorithm should be exponential however. It is also visible that there's a certain "stepping" occuring. This is probably because genes share similar GO terms, which are stored temporarily as to prevent unnecesary calculations. However this prevention method loses it's grip the more genes you keep adding, especially unrelated/less related genes. This shows that the underlying curve is exponential. This is logical because the amount of calculations done is completely dependent on the list-lengths. The confidence interval between the two machines are shared in the beginning of the curve mostly, the curve is a bit more steep on the server because of lower single-core performance. To give you an idea of actual performance; the "genelist mode" example on the `topo_ic_sim_genes()` function takes roughly ~16 seconds (turn verbose to `TRUE` to see time taken in seconds). This is for a 5x5 genematrix put in the analysis. ```{r, fig.width=3, fig.height=3} mems_geneset_df <- data.frame(GAPGOM:::benchmarks$server_mems_genset, GAPGOM:::benchmarks$laptop_mems_genset, seq_len(18) ) colnames(mems_geneset_df) <- c("server", "laptop", "n") mems_geneset_df_melted <- melt(mems_geneset_df, id="n") colnames(mems_geneset_df_melted) <- c("Genes in geneset", "machine", "RAM") ggplot(mems_geneset_df_melted, aes(x=machine, y=RAM, colour=machine)) + geom_boxplot(notch = FALSE) + scale_y_continuous(breaks=pretty(mems_geneset_df_melted$RAM, n = 5)) + labs(title = paste(strwrap("RAM usage for geneset algorithm", width = 20), collapse = "\n")) + theme(plot.title = element_text(hjust = 0.5)) ``` ```{r, fig.width=6, fig.height=3} p <- ggplot(mems_geneset_df_melted, aes(x=`Genes in geneset`, y=RAM, colour=machine)) + geom_point() + scale_y_continuous(breaks=pretty(mems_geneset_df_melted$RAM, n = 5)) + labs(title = paste(strwrap("RAM usage for geneset algorithm", width = 20), collapse = "\n")) + theme(plot.title = element_text(hjust = 0.5)) p p + stat_smooth(method = "lm", formula = y ~ x + I(x^2), size = 1) ``` You can see that the ram fits a bigger exponential curve, again proving the point the underlying algorithm fits this property. There is very little discrepency between the two machines in term of the ram usage. What is notable as well but not shown on the graphs; The ram usage decreases as soon as the R session gets reset, Thus showing there's RAM leaks ooccuring. This is also why it is advised to restart your R session every now and then. The RAM leaks seem to be attributed to use of `apply` and is hard to prevent/avoid. ### Overall thoughts Although the speed can be reasonably determined for the algorithms, RAM is a bit more tricky because of a lot of variables. This can be caused by differences in R session, memory leaks and other semi-random events/structuring of the code. Thus the only real thing that we can conclude is that there is quite a lot of RAM leaks, especially in the case of the geneset algorithm because of its iterative nature. Speed might vary hugely depending on the organism, ontology and geneset used. This is because, again, GO paths may differ drastically. Also, profvis' timings can be inaccurate, since it impacts performance hugely. ## Benchmarks In the following benchmark, lncRNA2GOA will be tested. We will use the Glycolosis hallmark geneset together with the GSE63733 expression dataset to test how well lncRNA2GOA will perform. First, we only take a small selection of the geneset based on the most variance in the expression (otherwise calculations take too long). After this, we predict annotation of this subset for each of its members, using the full expressionset (leave-one-out method). With this prediction, we then compare the semantic similarity scores of the original genes vs the custom ones. ```{r, eval=FALSE} ## set stringsasfactors to false to ensure data is loaded properly. options(stringsAsFactors = FALSE) ## load data # expdata <- read.table("~/Downloads/GSE63733_m.txt") # http://software.broadinstitute.org/gsea/msigdb/cards/HALLMARK_GLYCOLYSIS.html # GLYCOLOSIS HALLMARK geneset <- read.table("~/Downloads/geneset_glyc.txt", sep="\n") colnames(expdata)[1:2] <- c("ENSEMBL", "SYMBOL") # Set important colnames ## make an expressionset expression_matrix <- as.matrix(expdata[,3:ncol(expdata)]) rownames(expression_matrix) <- expdata[[1]] featuredat <- as.data.frame(expdata[,1:2]) rownames(featuredat) <- expdata[[1]] expset <- ExpressionSet(expression_matrix, featureData = new("AnnotatedDataFrame", data=featuredat)) ## make a selection on the geneset geneset <- as.vector(geneset[3:nrow(geneset),]) geneset_ensembl <- expdata[expdata[[2]] %in% geneset,][[1]] # select on the 10 genes with the most variance geneset_ensembl_topvar <- names(rev(sort(apply( expression_matrix[geneset_ensembl,], 1, var)))[1:10]) # convert back to symbol geneset_selection <- expdata[expdata$ENSEMBL %in% geneset_ensembl_topvar,][[2]] ## predict annotation of the geneset selection (per ontology). ontologies <- c("MF", "BP", "CC") bench_results <- list() for (ontology in ontologies) { # print("---") # print(ontology) predicted_annotations <- list() for (gene in geneset_ensembl_topvar) { symbol <- expdata[expdata$ENSEMBL==gene,]$SYMBOL result <- GAPGOM::expression_prediction(gene, expset, "human", ontology, idtype = "ENSEMBL") go_annotation_prediction <- unique(as.vector(result$GOID)) predicted_annotations[[symbol]] <- go_annotation_prediction } print("Prediction finished! Calculating sims now") ## compare custom genes to original genes tmp_go_data <- set_go_data("human", ontology, computeIC = TRUE, keytype = "SYMBOL") bench_results[[ontology]] <- mapply( function(gos, name) { custom_gene <- list() custom_gene[[paste0(name, "_pred")]] <- gos name_orig <- unlist(strsplit(name, "_"), FALSE, FALSE)[1] # print(name_orig) return(GAPGOM::topo_ic_sim_genes("human", ontology, c(), name_orig, custom_genes1 = custom_gene, idtype = "SYMBOL", progress_bar = FALSE, go_data = tmp_go_data)$GeneSim) }, gos = predicted_annotations, name = names(predicted_annotations) ) } # save(bench_results, file = "~/tmp_bench.RData") ``` For the results, we calculated the closeness of the individual gene similarities against their annotation predicted counterparts. The similarity should be as close to 1 as possible (exactly same annotation). The GO terms may not be exactly the same nor overlapping, but might be close in the GO tree. Meaning that it still indicates the correct process/function/component (depending on ontology). Because this benchmark takes long to run too, it's results will be stored with the package. ```{r} bench_results <- GAPGOM:::benchmarks$bench_results ontologies <- c("MF", "BP", "CC") for (ontology in ontologies) { dat <- data.frame(Gene = names(bench_results[[ontology]]), sim = bench_results[[ontology]]) brplot <- ggplot(dat, aes(x=Gene, weight=sim)) + geom_bar() + coord_flip() + labs(title = paste(strwrap(paste0( "Similarities of custom genes versus originals in ", ontology, " Ontology"), width = 30), collapse = "\n")) + theme(plot.title = element_text(hjust = 0.5)) + ylab("Similarity (TopoICSim)") print("---") print(ontology) print("mean:") print(mean(bench_results[[ontology]])) print(brplot) } ``` Here we can see that the scores performs like; MF < BP < CC. This presumably has to do with increasing ambiguity between these ontologies as well. From these results we can roughly say that on average, depending on ontology, the predicted annotations have a TopoICSim similarity between ~0.6 and ~0.86. ## For code reviewers Finally, for code reviewers, you can just call the result of each `profvis` to get a flame graph and other details of code speed. ## SessionInfo ```{r} sessionInfo() ```