---
title: "muleaData - GMT Datasets for the mulea Package"
author: 
- name: "Eszter Ari"
  email: "arieszter@gmail.com"
- name: "Márton Ölbei"
- name: "Leila Gul"
- name: "Balázs Bohár"
- name: "Tamás Stirling"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{muleaData}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
```

`muleaData` is an ExperimentHubData Bioconductor package providing 
pre-processed gene set data for use with the 
[`mulea`](https://github.com/ELTEbioinformatics/mulea) 
R package, a comprehensive tool for overrepresentation and functional 
enrichment analysis. `mulea` leverages ontologies (gene and protein sets) 
stored in the standardized Gene Matrix Transposed (GMT) format.

We provide these *GMT* files for 27 different model organisms, ranging from 
*Escherichia coli* to human. These files are compiled from publicly available 
sources and include various gene and protein identifiers like *UniProt* 
protein IDs, *Entrez*, *Gene Symbol*, and *Ensembl* gene IDs. 
The GMT files and the scripts we applied to create them are available at the 
[GMT_files_for_mulea](https://github.com/ELTEbioinformatics/GMT_files_for_mulea)
repository. For the `muleaData` we read these *GMT* 
files with the `mulea::read_gmt()` function and saved 
them to **.rds** files with the standard R `saveRDS()` function.

List of species `muleaData` covers:

-   *Arabidopsis thaliana*
-   *Bacillus subtilis*
-   *Bacteroides thetaiotaomicron VPI-5482*
-   *Bifidobacterium longum*
-   *Bos taurus*
-   *Caenorhabditis elegans*
-   *Chlamydomonas reinhardtii*
-   *Danio rerio*
-   *Daphnia pulex*
-   *Dictyostelium discoideum*
-   *Drosophila melanogaster*
-   *Drosophila simulans*
-   *Escherichia coli*
-   *Gallus gallus*
-   *Homo sapiens*
-   *Macaca mulatta*
-   *Mus musculus*
-   *Mycobacterium tuberculosis*
-   *Neurospora crassa*
-   *Pan troglodytes*
-   *Rattus norvegicus*
-   *Saccharomyces cerevisiae*
-   *Salmonella enterica subsp. enterica serovar Typhimurium str. LT2*
-   *Schizosaccharomyces pombe*
-   *Tetrahymena thermophila*
-   *Xenopus tropicalis*
-   *Zea mays*

Type, name, link and citation of the databases `muleaData` covers:

|                                     |                                                                                        |                                                                                                                                                                    |                                                                                                                                                                                                |
|-------------------------------------|:--------------------------------------------------------------------------------------:|:------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|
| **Ontology category**               |                                   **Ontology name**                                    |                                                                  **Short description of content**                                                                  |                                                                                         **Reference**                                                                                          |
| **Gene expression**                 |                          [FlyAtlas](http://www.flyatlas.org/)                          |                                                   Tissue-specific expression data for *Drosophila melanogaster*.                                                   |                      Chintapalli,V.R. *et al.* (2007) Using FlyAtlas to identify better *Drosophila melanogaster* models of human disease. *Nat Genet*, **39**, 715–720.                       |
|                                     |                        [ModEncode](http://data.modencode.org/)                         | Functional characterization (cell line, temporal expression, tissue expression, treatment) of elements for *Caenorhabditis elegans* and *Drosophila melanogaster*. |                The Modencode Consortium *et al.* (2010) Identification of functional elements and regulatory circuits by *Drosophila* modENCODE. *Science*, **330**, 1787–1797.                |
| **Genomic location**                |                                   Chromosomal Bands                                    |                                                                Location of genes on the chromosome.                                                                |                                                       Martin,F.J. *et al.* (2023) Ensembl 2023. *Nucleic Acids Res,* **51**, D933–D941.                                                        |
|                                     |                                   Consecutive genes                                    |                                                              *n* consecutive genes on the chromosome.                                                              |                                                                                                                                                                                                |
| **miRNA regulation**                | [miRTarBase](https://mirtarbase.cuhk.edu.cn/~miRTarBase/miRTarBase_2022/php/index.php) |                                                       Experimentally validated miRNA - target interactions.                                                        |            Huang,H.-Y. et al. (2022) miRTarBase update 2022: an informative resource for experimentally validated miRNA–target interactions. Nucleic Acids Res, **50**, D222–D230.             |
| **Gene Ontology**                   |                            [GO](https://geneontology.org/)                             |                                            Gene Ontology (GO) categorizes genes into unified categories and attributes.                                            |                                      The Gene Ontology Consortium *et al.* (2023) The Gene Ontology knowledgebase in 2023. *Genetics*, **224**, iyad031.                                       |
| **Pathway**                         |                   [Pathway Commons](https://www.pathwaycommons.org/)                   |                                                       Collection of biological pathway and interaction data.                                                       |                    Rodchenkov,I. et al. (2020) Pathway Commons 2019 Update: integration, analysis and exploration of pathway data. *Nucleic Acids Res*, **48**, D489–D497.                     |
|                                     |                           [Reactome](https://reactome.org/)                            |                                                       Collection of biological pathway and interaction data.                                                       |                                             Jassal,B. *et al.* (2020) The reactome pathway knowledgebase. *Nucleic Acids Res*, **48**, D498–D503.                                              |
|                                     |                           [Signalink](http://signalink.org/)                           |                                              Interaction database focussing on pathways and interactions of pathways.                                              |                     Csabai,L. *et al.* (2022) SignaLink3: a multi-layered resource to uncover tissue-specific signaling networks. *Nucleic Acids Res*, **50**, D701–D709.                      |
|                                     |                     [Wikipathways](https://www.wikipathways.org/)                      |                                                       Collection of biological pathway and interaction data.                                                       |                                            Martens,M. *et al.* (2021) WikiPathways: connecting communities. *Nucleic Acids Res*, **49**, D613–D621.                                            |
| **Protein domain**                  |                             [PFAM](http://pfam.xfam.org/)                              |                                                                 Protein domain structure database.                                                                 |                                         Mistry,J. *et al.* (2021) Pfam: The protein families database in 2021. *Nucleic Acids Res*, **49**, D412–D419.                                         |
| **Transcription factor regulation** |                            [ATRM](http://atrm.gao-lab.org/)                            |                                            Transcription factor - target gene interactions for *Arabidopsis thaliana*.                                             | Jin,J. et al. (2015) An *Arabidopsis* transcriptional regulatory map reveals distinct functional and evolutionary features of novel transcription factors. *Mol Biol Evol*, **32**, 1767–1773. |
|                                     |                    [dorothEA](https://saezlab.github.io/dorothea/)                     |                                                Transcription factor - target gene interactions for human and mouse.                                                |             Garcia-Alonso,L. *et al.* (2019) Benchmark and integration of resources for the estimation of human transcription factor activities. *Genome Res*, **29**, 1363–1375.              |
|                                     |                      [RegulonDB](https://regulondb.ccg.unam.mx/)                       |                                          Transcription factor - target gene interactions for *Escherichia coli* bacteria.                                          |        Tierrafría,V.H. *et al.* (2022) RegulonDB 11.0: Comprehensive high-throughput datasets on transcriptional regulation in *Escherichia coli* K-12. *Microb Genom*, **8**, 000833.         |
|                                     |                             [TFLink](https://tflink.net/)                              |                              Small- and large-scale transcription factor - target gene interactions for human and 6 model organisms.                               |              Liska,O. *et al.* (2022) TFLink: an integrated gateway to access transcription factor–target gene interactions for multiple species. *Database*, **2022**, baac083.               |
|                                     |                       [TRRUST](https://www.grnpedia.org/trrust/)                       |                                                     Transcription factor - target gene interactions for human.                                                     |              Han,H. *et al.* (2018) TRRUST v2: an expanded reference database of human and mouse transcriptional regulatory interactions. *Nucleic Acids Res*, **46**, D380–D386.              |
|                                     |                         [Yeastract](http://www.yeastract.com/)                         |                                          Transcription factor - target gene interactions for *Saccharomyces cerevisiae*.                                           |    Teixeira,M.C. *et al.* (2018) YEASTRACT: an upgraded database for the analysis of transcription regulatory networks in Saccharomyces cerevisiae. *Nucleic Acids Res*, **46**, D348–D353.    |

## Installation Instructions

Install the developmental version of `R` from 
[CRAN](https://cran.r-project.org/sources.html). Then install 
the developmental version of [Bioconductor](http://bioconductor.org/) 
and the `ExperimentHub` library using the following code:

```{r 'install', eval=FALSE}
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("ExperimentHub")
BiocManager::install("muleaData")
```

## Example

This is a basic example which shows you how to use the `muleaData`:

```{r 'example'}

# Calling the ExperimentHub library.
library(ExperimentHub)

# Downloading the metadata from ExperimentHub.
eh <- ExperimentHub()

# Creating the muleaData variable.
muleaData <- query(eh, "muleaData")

# Checking the muleaData variable.
muleaData

# Looking for the ExperimentalHub ID of i.e. target genes of transcription
# factors from TFLink in Caenorhabditis elegans.
mcols(muleaData) %>% 
    as.data.frame() %>% 
    dplyr::filter(species == "Caenorhabditis elegans" & 
        sourceurl == "https://tflink.net/")

# Creating a variable for the GMT data.frame of EH8735.
# EH8735 contains small-scale measurement results, where the target genes are
# coded with Ensembl ID-s
Transcription_factor_TFLink_Caenorhabditis_elegans_SS_EnsemblID <- 
    muleaData[["EH8735"]]
```

# Session Info

```{r 'session_info'}
sessionInfo()
```

# Citation

To cite package `muleaData` in publications use:

C. Turek, M. Olbei, T. Stirling, G. Fekete, E. Tasnadi, L. Gul, B. Bohar, 
B. Papp, W. Jurkowski, E. Ari: 
mulea - an R package for enrichment analysis using multiple ontologies and 
empirical FDR correction. *bioRxiv* (2024),
[doi:10.1101/2024.02.28.582444](https://doi.org/10.1101/2024.02.28.582444).