funOmics

Elisa Gómez de Lope

2024-02-16

Introduction

The funOmics R package is a collection of functions designed to aggregate omics data into higher-level functional representations such as pathways, protein complexes, and cellular locations. This vignette provides a detailed guide on how to use the package.

Omics data analysis is a critical component of modern biomedical research. The funOmics package provides a tool for aggregating omics data from high-throughput experiments (e.g. transcriptomics, metabolomics, proteomics) into higher-level functional activity scores that can then be used for further analysis and modeling. This capability provides a more global view of the biological systems, reduces the dimensionality, and facilitates biological interpretation of results.

The package provides different pooling operators, such as aggregation statistics (mean, median, standard deviation, min, max), dimension-reduction derived scores (pca, nmf, mds, pathifier), or test statistics (t-test, Wilcoxon test, Kolmogorov–Smirnov test) with options for adjusting parameters and settings to suit specific research questions and data types. The package is also well-documented, with detailed descriptions of each function and an example of usage.

funOmics distinguishes itself from existing Bioconductor packages dedicated to pathway or gene set analysis such as GSEA and ORA (clusterProfiler, fgsea, GSEAset), or GSVA, by offering a comprehensive tool for directly aggregating diverse omics data types into higher-level functional representations, allowing the analysis of such functional representations as functional activity scores that can be modeled as input features for identifying candidate biomarkers, or in clustering strategies for patient identification. Unlike GSEA and ORA, which primarily focus on gene expression and predefined gene sets, funOmics accommodates various omics modalities (e.g., metabolomics, transcriptomics, proteomics), and allows users to define custom molecular sets for aggregation. Additionally, funOmics goes beyond GSVA by providing flexibility in the choice of aggregation operators, enabling users to derive interpretable functional activity scores tailored to their specific research questions. By offering a flexible and user-friendly, alternative tool for functional analysis, funOmics aims to contribute to the diverse array of Bioconductor packages and enhance the capabilities of the community.

Installation

Install funOmics from Bioconductor (release 3.19 onwards) via:

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("funOmics")

or the pre-release and latest development version from GitHub:

if (!require("devtools", quietly = TRUE))
    install.packages("devtools")

devtools::install_github("elisagdelope/funOmics") 

Usage

Loading the Package

To use the Funomics R package, load it with the following command:

library(funOmics)

Main Function: summarize_pathway_level

You can then access the main function provided by the package, summarize_pathway_level with the type of pooling operator desired to be applied for each molecular set. This function has several options for adjusting parameters and settings to suit specific research questions and data types. The available aggregation operators and other parameters options are described in detail in the package documentation. You can also see the documentation for this function with the command:

?funOmics::summarize_pathway_level
## No documentation for 'summarize_pathway_level' in specified packages and libraries:
## you could try '??summarize_pathway_level'

Let’s see two examples of usage, with a simulated gene expression matrix X of dimensions g*s (g genes and s samples), and a list of 100 gene sets pathways. The expression values are random positive values sampled from a standard normal distribution. These examples mimic gene expression data and pathway gene sets, but funOmics can be used to aggregate other types of omics data and molecular sets, such as metabolomics or proteomics. Go to section Real data: Where to find molecular sets?, to see where to find real molecular sets for different types of omics. Note that if your actual omics data is in SummarizedExperiment format, the assay matrix can easily be retrieved as a matrix format (assay(SEobject)).

Generate simulation data

# Example usage:
set.seed(1)
g <- 10000
s <- 20
X <- matrix(abs(rnorm(g * s)), nrow = g, dimnames = list(paste0("g", 1:g), paste0("s", 1:s)))
pathways <- as.list(sample(10:100, size = 100, replace = TRUE))
pathways <- lapply(pathways, function(s, g) paste0("g", sample(1:g, size = s, replace = FALSE)), g)
names(pathways) <- paste0("pathway", seq_along(pathways))
print(paste("Dimensions of omics matrix X:", dim(X)[1], "*", dim(X)[2], "; Length of molecular sets list:", length(pathways)))
## [1] "Dimensions of omics matrix X: 10000 * 20 ; Length of molecular sets list: 100"
print(head(X))
##           s1        s2        s3        s4        s5        s6        s7
## g1 0.6264538 0.8043316 0.2353485 0.6179223 0.2212571 0.5258908 0.3413341
## g2 0.1836433 1.0565257 0.2448250 0.8935057 0.3517935 0.4875444 0.4136665
## g3 0.8356286 1.0353958 0.6421869 0.4277562 0.1606019 1.1382508 0.1220357
## g4 1.5952808 1.1855604 1.9348085 0.2999012 0.1240523 1.2151344 1.5893806
## g5 0.3295078 0.5004395 1.0386957 0.5319833 0.6598739 0.4248307 0.7874385
## g6 0.8204684 0.5249887 0.2835501 1.7059816 0.5038493 1.4508403 1.5920640
##            s8         s9        s10       s11          s12       s13       s14
## g1 1.00203611 1.55915937 0.09504307 0.7914415 0.0145724495 0.8663644 0.1604426
## g2 0.02590761 0.20166217 0.38805939 0.3921679 1.7854043337 0.9476952 0.9241849
## g3 0.44814178 1.04017610 2.13657003 0.4726670 0.0002997544 0.4522428 1.5561751
## g4 0.84323332 0.07195772 0.55661945 0.4579517 0.4356948690 0.2782408 0.8812202
## g5 0.21846310 0.01526544 0.59094164 0.1681319 1.4076452475 1.4175945 0.5263595
## g6 0.47678629 0.33938598 1.52014345 0.5856737 0.6929698698 0.6329981 0.4627372
##            s15       s16       s17       s18        s19        s20
## g1 0.249371112 1.2520152 2.3150930 0.4414410 0.82485558 1.37086468
## g2 0.335346796 0.3351313 1.0603800 0.4130862 0.74402087 0.54569610
## g3 0.004405287 0.1080678 0.3970672 0.8660777 0.69009734 1.62446330
## g4 0.986768348 0.4717051 0.4840034 2.2615708 1.76900681 0.06247283
## g5 0.543575705 2.5070607 1.3584146 0.1787018 0.55215640 0.57021255
## g6 0.626142823 1.2451344 0.6370574 0.4713582 0.03257056 0.31350574
print(head(pathways))
## $pathway1
##  [1] "g6248" "g3566" "g2064" "g8249" "g8532" "g5047" "g4018" "g7193" "g2574"
## [10] "g3728" "g4493" "g4394" "g9017" "g2334" "g7652" "g9604" "g9035" "g4100"
## [19] "g408"  "g9042" "g1229" "g2484" "g7442" "g7239" "g3059" "g811"  "g509" 
## [28] "g2182" "g8420" "g5947" "g1425" "g7218" "g7758" "g1236" "g6966" "g9457"
## [37] "g2668" "g9397" "g7394" "g2975" "g4304" "g9533" "g5287" "g5723" "g6432"
## [46] "g3826" "g3774" "g5906" "g6948" "g295"  "g6675" "g2471" "g4504" "g5162"
## [55] "g3110" "g1901" "g7837" "g5189" "g4422" "g2063" "g7950" "g2301" "g1107"
## [64] "g856"  "g6523" "g6524" "g1880" "g1675" "g7082" "g9914" "g2523" "g6892"
## [73] "g3631" "g8880" "g5560" "g6779" "g5713" "g5485" "g6659" "g6694" "g9592"
## [82] "g2434" "g3335" "g7725" "g3508" "g9399" "g7404" "g7160" "g9628" "g9345"
## 
## $pathway2
##  [1] "g1046" "g9504" "g6237" "g4026" "g2799" "g1764" "g8273" "g9165" "g6322"
## [10] "g9264" "g6996" "g3352" "g6635" "g2801" "g4064" "g2936" "g1823" "g2693"
## [19] "g3927" "g9267" "g3880" "g4245" "g2254" "g1570" "g6507" "g7158" "g7553"
## [28] "g292"  "g7602" "g7571" "g2353" "g3741" "g5694" "g706"  "g9877" "g6889"
## [37] "g4524" "g9776" "g4015" "g3747" "g6639" "g2807" "g2101" "g6546" "g8113"
## [46] "g847"  "g9948" "g6219" "g4743" "g7975" "g466"  "g869"  "g5913" "g8362"
## [55] "g7665" "g838"  "g8119" "g167"  "g2999" "g2484" "g8541" "g7628" "g8437"
## [64] "g6932" "g3017" "g6458" "g8875" "g7429" "g6948" "g8641" "g2004" "g338" 
## [73] "g9654" "g1781" "g5126" "g9549" "g469"  "g5229" "g115"  "g7608" "g2606"
## [82] "g3479" "g5674"
## 
## $pathway3
##  [1] "g9331" "g5832" "g5311" "g6300" "g4939" "g3028" "g8170" "g277"  "g2162"
## [10] "g1887" "g1011" "g9797" "g6396" "g7073" "g3353" "g6757" "g3167" "g3991"
## [19] "g3457" "g7550" "g5707" "g5096" "g7510" "g1334" "g6667" "g7433" "g7005"
## [28] "g3474" "g6809" "g9081" "g6952" "g3525" "g6561" "g8581" "g4255" "g1702"
## [37] "g9013" "g8298" "g2499" "g2690" "g6702" "g6605" "g7020" "g1743" "g8261"
## [46] "g9191" "g6258" "g7717" "g4973" "g3082" "g9346" "g2522" "g6260" "g9108"
## [55] "g7869" "g6858" "g7192" "g9378"
## 
## $pathway4
##  [1] "g7398" "g6306" "g4012" "g6673" "g7713" "g2518" "g9017" "g3360" "g4784"
## [10] "g6259" "g9059" "g1799" "g9813" "g5004" "g4810" "g4174" "g2439" "g8029"
## [19] "g5730" "g7516" "g154"  "g7419" "g19"   "g4068" "g4232" "g9767" "g9163"
## [28] "g311"  "g1916" "g3805" "g8445" "g3603" "g6501" "g153"  "g2922" "g8097"
## [37] "g9985" "g9719" "g7827" "g1428" "g2116" "g2994" "g2918" "g8054" "g2103"
## [46] "g3184" "g5540" "g8584" "g8271" "g9014" "g960"  "g1951" "g2607" "g8926"
## [55] "g7844" "g1999" "g1731" "g7758" "g5109" "g9245" "g6640" "g7569" "g8124"
## [64] "g394"  "g7910" "g8973" "g366"  "g2027" "g1070" "g8787" "g5606" "g5249"
## [73] "g7203" "g6650" "g1617"
## 
## $pathway5
##  [1] "g6454" "g2022" "g9627" "g8471" "g9677" "g3516" "g3262" "g4189" "g4912"
## [10] "g3286" "g2037" "g821"  "g3957" "g2482" "g8656" "g9962" "g3550" "g5620"
## [19] "g2042" "g1207" "g8654" "g9313" "g8064" "g2393" "g757"  "g1574" "g9199"
## [28] "g7132" "g9324" "g1657" "g8535" "g5200" "g4639" "g8986" "g6650" "g488" 
## [37] "g6241" "g5407" "g1038" "g198"  "g8992" "g9396" "g9200" "g4567" "g6191"
## [46] "g9970" "g4035" "g1762" "g815"  "g1205" "g5063" "g4883" "g2993" "g3763"
## [55] "g1555" "g5455" "g8534" "g1242" "g3209" "g2687" "g3786" "g9385" "g6187"
## [64] "g3120" "g1013" "g7091" "g6423" "g1548" "g8316" "g4777" "g8392" "g2101"
## [73] "g9991" "g8915" "g3369" "g1075" "g3770" "g9501" "g7476" "g5178" "g9552"
## [82] "g1877" "g6608" "g1399" "g5689" "g4526" "g1043" "g5942" "g2524" "g5958"
## [91] "g8162"
## 
## $pathway6
##  [1] "g748"  "g2887" "g2237" "g6300" "g562"  "g1823" "g1476" "g3523" "g5496"
## [10] "g4240" "g7780" "g2982" "g3004" "g6110" "g2512" "g6411" "g7219" "g8294"

Example usage 1: apply summarize_pathway_level to summarize data with mean pooling and a minimum size of sets

Now we can apply the function summarize_pathway_level. In this example, pathway activity is summarized using the mean pooling aggregation for those sets containing at least 12 genes. Note that you can adjust the minsize and type of aggregation as desired.

min <- 12
pathway_activity <- summarize_pathway_level(X, pathways, type = "mean", minsize = min)
## 
## 100 functional sets read.
## iteration 100
## 95 successful functional aggregations over minsize
## 5 failed functional aggregations under minsize
print(paste("Pathway activity score matrix has dimensions:", nrow(pathway_activity), ",", ncol(pathway_activity)))
## [1] "Pathway activity score matrix has dimensions: 95 , 20"

From the original matrix X with dimensions [g,s] (in this example, 1000 genes and 20 samples), summarize_pathway_level has generated a pathway-level activity score for each of the 20 samples, for 95 pathways having containing more than 12 genes. Let’s see how this matrix looks like:

print(head(pathway_activity))
##                 s1        s2        s3        s4        s5        s6        s7
## pathway1 0.8856707 0.7523236 0.8549943 0.7388130 0.7925286 0.8334914 0.7992642
## pathway2 0.8195863 0.8626590 0.8864375 0.7387156 0.7385007 0.7833466 0.7602959
## pathway3 0.7850385 0.7178379 0.8601687 0.7368610 0.7706303 0.7247713 0.8604497
## pathway4 0.9223525 0.6942613 0.8204792 0.8153523 0.7955479 0.9152442 0.7992134
## pathway5 0.9114610 0.7112581 0.8054646 0.8684732 0.8134853 0.7694264 0.6587641
## pathway6 0.6227958 0.8245890 0.8223787 0.7626611 0.8641242 0.9292614 0.6487087
##                 s8        s9       s10       s11       s12       s13       s14
## pathway1 0.8360021 0.6835664 0.7858110 0.7760340 0.7035882 0.7978041 0.7516856
## pathway2 0.8074307 0.6878013 0.7672425 0.7680346 0.8845340 0.7153691 0.8658347
## pathway3 0.7472542 0.7420240 0.7557788 0.8714399 0.6596007 0.7748371 0.7146510
## pathway4 0.8153612 0.8172311 0.7705315 0.8365182 0.7158393 0.7362310 0.9941277
## pathway5 0.8208152 0.8176990 0.7511963 0.8832247 0.7476202 0.7715769 0.8993376
## pathway6 0.9917968 0.6278624 0.8573334 0.6735154 0.4910625 0.7790914 0.8918141
##                s15       s16       s17       s18       s19       s20
## pathway1 0.7707884 0.8048002 0.8139325 0.7562304 0.8257812 0.7988866
## pathway2 0.7450440 0.8073472 0.8396782 0.7724498 0.7732119 0.6771755
## pathway3 0.7427666 0.9419171 0.9125146 0.7504836 0.7127936 0.7270405
## pathway4 0.7121663 0.7625575 0.7675355 0.8329739 0.8193932 0.7584810
## pathway5 0.8388787 0.8025472 0.8337763 0.7940322 0.6808545 0.7773564
## pathway6 0.7292842 0.7034269 0.6653677 0.7941846 0.7447829 0.8574432

The resulting matrix of higher-level functional representations looks very similar to the original one, except that the original had many more features (1000 instead of 95).

In this example, 5 of the gene sets in pathways has size < 12. You can check which pathways have been left out and how many genes they had by running the following command:

length_sets <- sapply(pathways, function(p) length(p))
short_sets <- names(length_sets[length_sets < min])
print(length_sets[short_sets])
##  pathway20  pathway29  pathway56  pathway60 pathway100 
##         10         11         11         11         11

You can also check which genes were found in these sets:

print(pathways[short_sets])
## $pathway20
##  [1] "g7619" "g7228" "g4784" "g7604" "g2909" "g550"  "g3498" "g9766" "g1379"
## [10] "g9024"
## 
## $pathway29
##  [1] "g9593" "g7338" "g9592" "g1214" "g5061" "g2451" "g2666" "g2299" "g836" 
## [10] "g3948" "g4898"
## 
## $pathway56
##  [1] "g1402" "g2296" "g2731" "g5182" "g3993" "g7093" "g6375" "g3336" "g1762"
## [10] "g2398" "g1667"
## 
## $pathway60
##  [1] "g3429" "g5579" "g6744" "g4102" "g7061" "g5217" "g1025" "g2838" "g1540"
## [10] "g8468" "g6575"
## 
## $pathway100
##  [1] "g3438" "g443"  "g7243" "g8206" "g2375" "g9002" "g625"  "g5807" "g6973"
## [10] "g8004" "g1757"

Example usage 2: apply summarize_pathway_level to summarize data with pca pooling and no minimum size of sets

Using the same matrix X and gene sets pathways, let’s generate aggregated representations using dimension-reduction derived scores from the PCA. The pca-aggregated activity scores values represent the projection of the overall expression of all genes in each pathway onto the first principal component.

pathway_activity <- summarize_pathway_level(X, pathways, type = "pca", minsize = 0)
## 
## 100 functional sets read.
## iteration 100
## 100 successful functional aggregations over minsize
## 0 failed functional aggregations under minsize
print(paste("Pathway activity score matrix has dimensions:", nrow(pathway_activity), ",", ncol(pathway_activity)))
## [1] "Pathway activity score matrix has dimensions: 100 , 20"

Now from the original matrix X with dimensions [g,s] (1000 genes and 20 samples), summarize_pathway_level has generated a pathway-level activity score for each of the 20 samples for all 100 pathways, since the size of the molecular sets was not restricted. Let’s see how this matrix looks like:

print(head(pathway_activity))
##                   s1         s2         s3         s4         s5          s6
## pathway1 -0.51596316  4.4081016 -3.6172441  0.9078345 -4.3632274  4.47148780
## pathway2 -6.35344708  4.4721898 -0.1072179 -5.6146386 -0.4361479  3.43595874
## pathway3 -3.69833166  1.8035359  0.9002908 -0.7440726  3.7495110  1.25453327
## pathway4 -5.71654009  1.1753201  2.3176206  1.9626837 -2.9970946  0.08046059
## pathway5 -2.45806955 -0.9199807 -6.6012194  3.9444841 -0.5009367  2.02176596
## pathway6  0.02525345  0.3790888  1.4559051  1.2151299 -4.6422525 -2.02724802
##                  s7          s8         s9        s10        s11        s12
## pathway1  0.1786907  2.68548927  0.6952866  0.2297664  4.0740206  1.1389743
## pathway2  2.0274377 -0.05016894  3.8767232  1.0640257 -1.7338074  2.4601009
## pathway3  2.1352637 -1.38294820  1.0078658  4.1406329  0.2845910 -1.6128825
## pathway4  2.8826719 -2.24804495 -2.5774802  5.7096576  2.4964751  2.6572455
## pathway5 -1.0535810 -1.78555013 -0.1863068  1.6203119  4.3788157 -0.1565894
## pathway6 -2.0024241  2.17669923 -0.3979514 -1.7318485  0.6773351 -0.8882183
##                 s13       s14        s15        s16        s17        s18
## pathway1 -0.2447407 -4.932566  2.6488582  0.5928979 -2.2855632  1.0699588
## pathway2  1.4911608 -1.204245  1.1324795 -3.3654006 -0.6572542 -1.8564556
## pathway3 -0.8852759 -4.948525 -1.5682527  4.3381664  1.0671926 -0.6944561
## pathway4  0.2437641 -2.979152  0.8372801 -1.5266307  2.8745670 -0.5905516
## pathway5 -2.0571340  6.301765  2.0743195 -3.5761567  2.6095770  0.6137905
## pathway6 -2.4032898  2.959492  2.1802264 -0.1962708  0.8000977  2.0549126
##                 s19        s20
## pathway1 -5.2955798 -1.8464827
## pathway2 -0.2952237  1.7139303
## pathway3 -2.6939543 -2.4528843
## pathway4 -2.7427071 -1.8595451
## pathway5 -1.9449648 -2.3243405
## pathway6 -0.2149931  0.5803559

Packages & Session information

The funOmics package was developed for R version >= 4.0.3. However, BioConductor release 3.19 runs on R-4.4. See session information and loaded packages below:

sI <- sessionInfo()
print(sI, locale = FALSE)
## R Under development (unstable) (2024-03-18 r86148)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] funOmics_0.99.1     bigmemory_4.6.4     Biobase_2.63.0     
## [4] BiocGenerics_0.49.1
## 
## loaded via a namespace (and not attached):
##  [1] sass_0.4.9          utf8_1.2.4          generics_0.1.3     
##  [4] stringi_1.8.3       digest_0.6.35       magrittr_2.0.3     
##  [7] evaluate_0.23       grid_4.4.0          RColorBrewer_1.1-3 
## [10] iterators_1.0.14    fastmap_1.1.1       R.oo_1.26.0        
## [13] foreach_1.5.2       doParallel_1.0.17   plyr_1.8.9         
## [16] jsonlite_1.8.8      princurve_2.1.6     bigmemory.sri_0.1.8
## [19] fansi_1.0.6         scales_1.3.0        codetools_0.2-19   
## [22] jquerylib_0.1.4     cli_3.6.2           registry_0.5-1     
## [25] rlang_1.1.3         R.methodsS3_1.8.2   munsell_0.5.0      
## [28] cachem_1.0.8        yaml_2.3.8          tools_4.4.0        
## [31] parallel_4.4.0      uuid_1.2-0          reshape2_1.4.4     
## [34] dplyr_1.1.4         colorspace_2.1-0    ggplot2_3.5.0      
## [37] rngtools_1.5.2      NMF_0.27            pathifier_1.41.0   
## [40] vctrs_0.6.5         R6_2.5.1            lifecycle_1.0.4    
## [43] stringr_1.5.1       cluster_2.1.6       pkgconfig_2.0.3    
## [46] pillar_1.9.0        bslib_0.6.2         gtable_0.3.4       
## [49] glue_1.7.0          Rcpp_1.0.12         xfun_0.42          
## [52] tibble_3.2.1        tidyselect_1.2.1    knitr_1.45         
## [55] htmltools_0.5.7     rmarkdown_2.26      compiler_4.4.0     
## [58] gridBase_0.4-7

Real-world data: Where to find molecular sets?

The examples in the section above run on simulated gene expression data and pathway gene sets. Real-world molecular sets can be downloaded from several sources. In terms of gene sets, the Gene Ontology is a versatile resource that covers three domains: cellular components, biological processes and molecular functions. Gene sets from KEGG and reactome pathways can also be used. Explore the different releases and download the corresponding gene sets for the different types of GO terms, KEGG and reactome pathways here. You can also aggregate genes into protein complexes, which you can find in the CORUM database.

Regarding other omics types, such as metabolomics, the function summarize_pathway_level can be applied in a similar manner to a metabolomics matrix X and KEGG metabolic pathways. Metabolite sets from KEGG pathways can also be downloaded with the KEGG API.

Once you have downloaded the molecular sets, this information needs to be in the form of a list of lists, i.e., a list of molecular sets, each of which is a list of molecule identifiers (e.g. entrez IDs, PubChem CIDs, etc).

For instance, let’s retrieve gene sets from GO terms for cellular compartments. The information can be downloaded from msigdb:

goccdb <- "https://data.broadinstitute.org/gsea-msigdb/msigdb/release/7.5/c5.go.cc.v7.5.entrez.gmt"
downdb <- sapply(readLines(goccdb), function(x) strsplit(x, "\t")[[1]])
gocc <- sapply(as.matrix(downdb), function(x) x[3:length(x)])
names(gocc) = sapply(as.matrix(downdb), function(x) x[1])
gocc[1:3]
## $GOCC_NUCLEOTIDE_EXCISION_REPAIR_COMPLEX
##  [1] "1069" "1161" "2067" "2071" "2072" "2073" "5424" "5887" "7507" "7508"
## [11] "7515"
## 
## $GOCC_HISTONE_DEACETYLASE_COMPLEX
##  [1] "10013"  "10014"  "10284"  "10428"  "10467"  "10524"  "10629"  "10847" 
##  [9] "10856"  "10902"  "10933"  "1107"   "1108"   "116092" "1457"   "221037"
## [17] "23186"  "23309"  "23314"  "23468"  "25855"  "25942"  "26038"  "283248"
## [25] "3065"   "3066"   "3094"   "3622"   "473"    "51317"  "51564"  "51780" 
## [33] "53615"  "54556"  "54815"  "55758"  "55806"  "55809"  "55818"  "55869" 
## [41] "55929"  "57459"  "57504"  "57634"  "57649"  "58516"  "5928"   "5931"  
## [49] "64426"  "64431"  "6907"   "7764"   "79595"  "79685"  "79718"  "79885" 
## [57] "81611"  "8204"   "8295"   "83933"  "84215"  "84312"  "8607"   "8819"  
## [65] "8841"   "90665"  "9112"   "91748"  "9219"   "9611"   "9734"   "9759"  
## 
## $GOCC_SAGA_COMPLEX
##  [1] "100130302" "10474"     "10629"     "112869"    "117143"    "170067"   
##  [7] "23326"     "2648"      "27097"     "55578"     "56943"     "56970"    
## [13] "6871"      "6878"      "6880"      "6881"      "6883"      "8295"     
## [19] "8464"      "8850"      "9913"

Now let’s simulate a gene expression matrix X, where gene IDs are codes between 1:10000 (to match entrez IDs), and summarize_pathway_level can be applied:

X_expr <- matrix(abs(rnorm(g * s)), nrow = g, dimnames = list(1:g, paste0("s", 1:s)))
sd_gocc_expr <- summarize_pathway_level(X_expr, gocc, type="sd")
## 
## 1006 functional sets read.
## iteration 100
## iteration 200
## iteration 300
## iteration 400
## iteration 500
## iteration 600
## iteration 700
## iteration 800
## iteration 900
## iteration 1000
## 484 successful functional aggregations over minsize
## 522 failed functional aggregations under minsize
head(sd_gocc_expr)
##                                                s1        s2        s3        s4
## GOCC_NUCLEOTIDE_EXCISION_REPAIR_COMPLEX 0.6874068 0.7700535 0.6081128 0.4530643
## GOCC_HISTONE_DEACETYLASE_COMPLEX        0.6223028 0.5385014 0.6643452 0.4426644
## GOCC_SAGA_COMPLEX                       0.5129453 0.4294224 0.8654145 0.8470737
## GOCC_GOLGI_MEMBRANE                     0.5917873 0.5855094 0.5465478 0.5718336
## GOCC_UBIQUITIN_LIGASE_COMPLEX           0.5948998 0.4760826 0.5789951 0.5767282
## GOCC_NUCLEAR_UBIQUITIN_LIGASE_COMPLEX   0.5407960 0.5548760 0.6030492 0.5342973
##                                                s5        s6        s7        s8
## GOCC_NUCLEOTIDE_EXCISION_REPAIR_COMPLEX 0.5681003 0.7631878 0.9503497 0.5612025
## GOCC_HISTONE_DEACETYLASE_COMPLEX        0.5758610 0.5930402 0.4941448 0.6645293
## GOCC_SAGA_COMPLEX                       0.4338404 0.4709883 0.3084187 0.7409339
## GOCC_GOLGI_MEMBRANE                     0.5807997 0.5888171 0.5551484 0.5884325
## GOCC_UBIQUITIN_LIGASE_COMPLEX           0.5727471 0.6340398 0.6245755 0.6393219
## GOCC_NUCLEAR_UBIQUITIN_LIGASE_COMPLEX   0.5298508 0.5555573 0.5965341 0.7068170
##                                                s9       s10       s11       s12
## GOCC_NUCLEOTIDE_EXCISION_REPAIR_COMPLEX 0.3333466 0.6413364 0.5172743 0.3772728
## GOCC_HISTONE_DEACETYLASE_COMPLEX        0.7136638 0.6075606 0.4458465 0.4218555
## GOCC_SAGA_COMPLEX                       0.4651406 0.6359423 0.4213689 0.6133269
## GOCC_GOLGI_MEMBRANE                     0.5864312 0.6190158 0.5857053 0.5615674
## GOCC_UBIQUITIN_LIGASE_COMPLEX           0.6192819 0.6271703 0.5979364 0.6197959
## GOCC_NUCLEAR_UBIQUITIN_LIGASE_COMPLEX   0.6772410 0.5017567 0.7980839 0.5586112
##                                               s13       s14       s15       s16
## GOCC_NUCLEOTIDE_EXCISION_REPAIR_COMPLEX 0.8742280 0.6755375 0.2849060 0.6235697
## GOCC_HISTONE_DEACETYLASE_COMPLEX        0.6471213 0.7559094 0.6199201 0.5354556
## GOCC_SAGA_COMPLEX                       0.4484959 0.8593859 0.5471176 0.6409236
## GOCC_GOLGI_MEMBRANE                     0.5466437 0.6052277 0.4992445 0.6505082
## GOCC_UBIQUITIN_LIGASE_COMPLEX           0.5173636 0.6161817 0.5185140 0.5601128
## GOCC_NUCLEAR_UBIQUITIN_LIGASE_COMPLEX   0.5490904 0.5522513 0.5792256 0.5016461
##                                               s17       s18       s19       s20
## GOCC_NUCLEOTIDE_EXCISION_REPAIR_COMPLEX 0.3961030 0.3872135 0.3654351 0.3670957
## GOCC_HISTONE_DEACETYLASE_COMPLEX        0.6429443 0.5779971 0.4946191 0.6472834
## GOCC_SAGA_COMPLEX                       0.7105728 0.6171745 0.3329560 0.7926726
## GOCC_GOLGI_MEMBRANE                     0.5940520 0.6790624 0.5742977 0.6443335
## GOCC_UBIQUITIN_LIGASE_COMPLEX           0.5960200 0.5793572 0.5719646 0.7395734
## GOCC_NUCLEAR_UBIQUITIN_LIGASE_COMPLEX   0.5711452 0.5433063 0.5448356 0.8566029

A GO cellular compartments activity score is generated via the sd-pooling signal of its gene components.

Similarly, funOmics is interoperable with other standard Bioconductor classes for gene sets like KEGGREST. See the code snippet below to retrieve KEGG gene sets for humans:

library("KEGGREST")
# Retrieve all pathways from organism
pathway_ids <- keggList("pathway", organism = "hsa")
pathway_ids <- names(pathway_ids)
# Iterate through batches of pathway IDs: this may take ~1 min
batch_size <- 10  # KEGG limitation: 10 requests at once
kegg_sets <- list()
for (i in seq(1, length(pathway_ids), by = batch_size)) {
  batch_ids <- pathway_ids[i:(min(i + batch_size - 1, length(pathway_ids)))]
  batch_objs <- keggGet(batch_ids)
  # Retrieve genes
  for (obj in batch_objs) {
    if ("GENE" %in% names(obj)) {
      entrez_id_indices <- seq(1, length(obj$GENE), by = 2) # Extract Entrez IDs (odd entries)
      entrez_ids <- obj$GENE[entrez_id_indices]
      kegg_sets[[obj$PATHWAY_MAP[[1]]]] <- entrez_ids
    }
  }
}

kegg_sets[1:3]
## $`Glycolysis / Gluconeogenesis`
##  [1] "3101"   "3098"   "3099"   "80201"  "2645"   "2821"   "5213"   "5214"  
##  [9] "5211"   "2203"   "8789"   "230"    "226"    "229"    "7167"   "2597"  
## [17] "26330"  "5232"   "5230"   "5223"   "5224"   "441531" "2027"   "2026"  
## [25] "2023"   "387712" "5315"   "5313"   "5161"   "5160"   "5162"   "1737"  
## [33] "1738"   "160287" "92483"  "3939"   "3945"   "3948"   "124"    "125"   
## [41] "126"    "131"    "127"    "128"    "130"    "10327"  "217"    "224"   
## [49] "219"    "501"    "223"    "221"    "222"    "218"    "84532"  "55902" 
## [57] "130589" "5236"   "55276"  "2538"   "57818"  "92579"  "83440"  "669"   
## [65] "9562"   "5105"   "5106"  
## 
## $`Citrate cycle (TCA cycle)`
##  [1] "1431"  "47"    "50"    "48"    "3417"  "3418"  "3420"  "3421"  "3419" 
## [10] "55753" "4967"  "1743"  "1738"  "8802"  "8801"  "8803"  "6389"  "6390" 
## [19] "6391"  "6392"  "2271"  "4190"  "4191"  "5091"  "5105"  "5106"  "5161" 
## [28] "5160"  "5162"  "1737" 
## 
## $`Pentose phosphate pathway`
##  [1] "2821"   "2539"   "25796"  "9563"   "5226"   "6120"   "729020" "7086"  
##  [9] "84076"  "8277"   "6888"   "22934"  "23729"  "51071"  "64080"  "5236"  
## [17] "55276"  "221823" "5634"   "5631"   "9104"   "414328" "132158" "230"   
## [25] "226"    "229"    "2203"   "8789"   "5213"   "5214"   "5211"

And then summarize_pathway_level can be applied seamlessly to generate, for instance, pathway-level expression signatures via dimension-reduction multi-dimensional scaling (mds).

sd_kegg_expr <- summarize_pathway_level(X_expr, kegg_sets, type="mds")
## 
## 346 functional sets read.
## iteration 100
## iteration 200
## iteration 300
## 319 successful functional aggregations over minsize
## 27 failed functional aggregations under minsize
head(sd_kegg_expr)
##                                                   s1         s2          s3
## Glycolysis / Gluconeogenesis              0.03381027 -0.5900246  0.76860349
## Citrate cycle (TCA cycle)                 1.21589878 -2.1771730 -1.79260831
## Pentose phosphate pathway                 2.00240252 -0.3552397  0.86013327
## Pentose and glucuronate interconversions -1.68504534 -1.0775848  0.77029731
## Fructose and mannose metabolism           3.41935713  0.6260220 -0.06882378
## Galactose metabolism                     -1.04495679 -1.0375985 -0.92278221
##                                                  s4         s5          s6
## Glycolysis / Gluconeogenesis              1.6568284 -1.0467614 -1.40658544
## Citrate cycle (TCA cycle)                -0.7172462 -2.5040382  0.83970969
## Pentose phosphate pathway                 0.1196814 -0.3114586  3.36588332
## Pentose and glucuronate interconversions  0.2319988  0.1038933 -0.73356118
## Fructose and mannose metabolism          -0.5195243 -0.6916561  0.07335589
## Galactose metabolism                     -1.6570534  1.2079572 -0.74982696
##                                                  s7         s8          s9
## Glycolysis / Gluconeogenesis              1.9128943  1.5526851  0.77418666
## Citrate cycle (TCA cycle)                -2.2339720  0.9192363 -0.05543352
## Pentose phosphate pathway                -1.2810089 -0.7468456  0.22231480
## Pentose and glucuronate interconversions  0.8996248  0.3213037 -1.85265180
## Fructose and mannose metabolism          -0.9884894 -1.1378034  1.25408414
## Galactose metabolism                      1.4841675  0.4248430  0.15216205
##                                                 s10        s11         s12
## Glycolysis / Gluconeogenesis             -3.9245110  2.1522299  1.18926945
## Citrate cycle (TCA cycle)                 0.3081241  1.0494616  0.37608681
## Pentose phosphate pathway                -1.0977821  0.4498480 -0.53811659
## Pentose and glucuronate interconversions -0.7918918  0.4927486 -0.75757127
## Fructose and mannose metabolism           0.6974275  0.2533823  0.88980534
## Galactose metabolism                     -0.5631880 -1.1458326  0.02300677
##                                                  s13        s14        s15
## Glycolysis / Gluconeogenesis              0.06481599  0.1579013 -0.2041039
## Citrate cycle (TCA cycle)                 0.51451774  1.5155227  0.8896059
## Pentose phosphate pathway                -0.71194067 -1.4379657 -0.4433792
## Pentose and glucuronate interconversions  0.12037131  0.5756674 -0.1742791
## Fructose and mannose metabolism          -0.44316800  0.2012736 -0.3794476
## Galactose metabolism                      1.89883786  0.1795751  1.0246083
##                                                 s16        s17        s18
## Glycolysis / Gluconeogenesis             -1.3501557 -0.4141481 -2.8753269
## Citrate cycle (TCA cycle)                 0.3918214  1.8650711  0.2687718
## Pentose phosphate pathway                 0.3302367 -0.3782894 -0.1618136
## Pentose and glucuronate interconversions -0.9088215  1.0806923  1.4317707
## Fructose and mannose metabolism           0.8446045 -0.8315701 -0.2122450
## Galactose metabolism                     -0.6839999  0.8635995 -1.3031425
##                                                 s19         s20
## Glycolysis / Gluconeogenesis              1.5427459  0.00564625
## Citrate cycle (TCA cycle)                 0.1052626 -0.77861926
## Pentose phosphate pathway                 0.9363595 -0.82301948
## Pentose and glucuronate interconversions  1.5734153  0.37962322
## Fructose and mannose metabolism          -2.1514227 -0.83516189
## Galactose metabolism                      1.9032379 -0.05361439

Note that these are just some examples, and you can apply similar procedures for other types of molecular sets.

Contact Information

Feedback is very welcome! If you have any questions, issues, or suggestions for improving the funOmics package, please use the GitHub issues page or contact .

License

The funOmics package is released under the terms of the MIT License. See the LICENSE file for more details.