0.1 `orthogene`: Interspecies gene mapping

orthogene is an R package for easy mapping of orthologous genes across hundreds of species.
It pulls up-to-date interspecies gene ortholog mappings across 700+ organisms.

It also provides various utility functions to map common objects (e.g. data.frames, gene expression matrices, lists) onto 1:1 gene orthologs from any other species.

In brief, orthogene lets you easily:

1 Installation

if (!requireNamespace("BiocManager", quietly = TRUE))
     install.packages("BiocManager")
# orthogene is only available on Bioconductor>=3.14
if(BiocManager::version()<"3.14") BiocManager::install(version = "3.14")

BiocManager::install("orthogene")

library(orthogene)

data("exp_mouse")
# Setting to "homologene" for the purposes of quick demonstration.
# We generally recommend using method="gprofiler" (default).
method <- "homologene"

2 Examples

2.1 Convert orthologs

convert_orthologs is very flexible with what users can supply as gene_df, and can take a data.frame/data.table/tibble, (sparse) matrix, or list/vector containing genes.

Genes, transcripts, proteins, SNPs, or genomic ranges will be recognised in most formats (HGNC, Ensembl, RefSeq, UniProt, etc.) and can even be a mixture of different formats.

All genes will be mapped to gene symbols, unless specified otherwise with the ... arguments (see ?orthogene::convert_orthologs or here for details).

2.1.1 Note on non-1:1 orthologs

A key feature of convert_orthologs is that it handles the issue of genes with many-to-many mappings across species. This can occur due to evolutionary divergence, and the function of these genes tends to be less conserved and less translatable. Users can address this using different strategies via non121_strategy=:

"drop_both_species" : Drop genes that have duplicate mappings in either the input_species or output_species, (DEFAULT).
"drop_input_species" : Only drop genes that have duplicate mappings in input_species.
"drop_output_species" : Only drop genes that have duplicate mappings in the output_species.
"keep_both_species" : Keep all genes regardless of whether they have duplicate mappings in either species.
"keep_popular" : Return only the most “popular” interspecies ortholog mappings. This procedure tends to yield a greater number of returned genes but at the cost of many of them not being true biological 1:1 orthologs.
"sum","mean","median","min" or "max": When gene_df is a matrix and gene_output="rownames", these options will aggregate many-to-one gene mappings (input_species-to-output_species) after dropping any duplicate genes in the output_species.

gene_df <- orthogene::convert_orthologs(gene_df = exp_mouse,
                                        gene_input = "rownames", 
                                        gene_output = "rownames", 
                                        input_species = "mouse",
                                        output_species = "human",
                                        non121_strategy = "drop_both_species",
                                        method = method)

## Preparing gene_df.

## sparseMatrix format detected.

## Extracting genes from rownames.

## 15,259 genes extracted.

## Converting mouse ==> human orthologs using: homologene

## Retrieving all organisms available in homologene.

## Mapping species name: mouse

## Common name mapping found for mouse

## 1 organism identified from search: 10090

## Retrieving all organisms available in homologene.

## Mapping species name: human

## Common name mapping found for human

## 1 organism identified from search: 9606

## Checking for genes without orthologs in human.

## Extracting genes from input_gene.

## 13,416 genes extracted.

## Extracting genes from ortholog_gene.

## 13,416 genes extracted.

## Checking for genes without 1:1 orthologs.

## Dropping 46 genes that have multiple input_gene per ortholog_gene.

## Dropping 56 genes that have multiple ortholog_gene per input_gene.

## Filtering gene_df with gene_map

## Setting ortholog_gene to rownames.

## 
## =========== REPORT SUMMARY ===========

## Total genes dropped after convert_orthologs :
##    2,016 / 15,259 (13%)

## Total genes remaining after convert_orthologs :
##    13,243 / 15,259 (87%)

knitr::kable(as.matrix(head(gene_df)))

	astrocytes_ependymal	endothelial-mural	interneurons	microglia	oligodendrocytes	pyramidal CA1	pyramidal SS
TSPAN12	0.3303571	0.5872340	0.6413793	0.1428571	0.1207317	0.2864750	0.1453634
TSHZ1	0.4285714	0.4468085	1.1551724	0.4387755	0.3621951	0.0692226	0.8320802
ADAMTS15	0.0089286	0.0978723	0.2206897	0.0000000	0.0231707	0.0117146	0.0375940
CLDN12	0.2232143	0.1148936	0.5517241	0.0510204	0.2609756	0.4376997	0.6842105
RXFP1	0.0000000	0.0127660	0.2551724	0.0000000	0.0158537	0.0511182	0.0751880
SEMA3C	0.1964286	0.9957447	8.6379310	0.2040816	0.1853659	0.1608094	0.2280702

2.2 Map species

map_species lets you standardise species names from a wide variety of identifiers (e.g. common name, taxonomy ID, Latin name, partial match).

All exposed orthogene functions (including convert_orthologs) use map_species under the hood, so you don’t have to worry about getting species names exactly right.

You can check the full list of available species by simply running map_species() with no arguments, or checking here.

species <- orthogene::map_species(species = c("human",9544,"mus musculus",
                                              "fruit fly","Celegans"), 
                                  output_format = "scientific_name")

## Retrieving all organisms available in homologene.

## Mapping species name: human

## Common name mapping found for human

## 1 organism identified from search: Homo sapiens

## Mapping species name: 9544

## 1 organism identified from search: Macaca mulatta

## Mapping species name: mus musculus

## 1 organism identified from search: Mus musculus

## Mapping species name: fruit fly

## Common name mapping found for fruit fly

## 1 organism identified from search: Drosophila melanogaster

## Mapping species name: Celegans

## 1 organism identified from search: Caenorhabditis elegans

print(species)

##                     human                      9544              mus musculus 
##            "Homo sapiens"          "Macaca mulatta"            "Mus musculus" 
##                 fruit fly                  Celegans 
## "Drosophila melanogaster"  "Caenorhabditis elegans"

2.3 Report orthologs

It may be helpful to know the maximum expected number of orthologous gene mappings from one species to another.

ortholog_report generates a report that tells you this information genome-wide.

orth_zeb <- orthogene::report_orthologs(target_species = "zebrafish",
                                        reference_species = "human",
                                        method_all_genes = method,
                                        method_convert_orthologs = method)

## Retrieving all organisms available in homologene.

## Mapping species name: zebrafish

## Common name mapping found for zebrafish

## 1 organism identified from search: Danio rerio

## Retrieving all organisms available in homologene.

## Mapping species name: human

## Common name mapping found for human

## 1 organism identified from search: Homo sapiens

## Retrieving all genes using: homologene.

## Retrieving all organisms available in homologene.

## Mapping species name: Danio rerio

## 1 organism identified from search: 7955

## Gene table with 20,897 rows retrieved.

## Returning all 20,897 genes from Danio rerio.

## Retrieving all genes using: homologene.

## Retrieving all organisms available in homologene.

## Mapping species name: Homo sapiens

## 1 organism identified from search: 9606

## Gene table with 19,129 rows retrieved.

## Returning all 19,129 genes from Homo sapiens.

## Preparing gene_df.

## data.frame format detected.

## Extracting genes from Gene.Symbol.

## 20,897 genes extracted.

## Converting Danio rerio ==> Homo sapiens orthologs using: homologene

## Retrieving all organisms available in homologene.

## Mapping species name: Danio rerio

## 1 organism identified from search: 7955

## Retrieving all organisms available in homologene.

## Mapping species name: Homo sapiens

## 1 organism identified from search: 9606

## Checking for genes without orthologs in Homo sapiens.

## Extracting genes from input_gene.

## 14,768 genes extracted.

## Extracting genes from ortholog_gene.

## 14,768 genes extracted.

## Checking for genes without 1:1 orthologs.

## Dropping 47 genes that have multiple input_gene per ortholog_gene.

## Dropping 2,708 genes that have multiple ortholog_gene per input_gene.

## Filtering gene_df with gene_map

## Adding input_gene col to gene_df.

## Adding ortholog_gene col to gene_df.

## 
## =========== REPORT SUMMARY ===========

## Total genes dropped after convert_orthologs :
##    10,338 / 20,895 (49%)

## Total genes remaining after convert_orthologs :
##    10,557 / 20,895 (51%)

## 
## =========== REPORT SUMMARY ===========

## 10,556 / 20,895 (50.52%) target_species genes remain after ortholog conversion.

## 10,556 / 19,129 (55.18%) reference_species genes remain after ortholog conversion.

knitr::kable(head(orth_zeb$map))

	HID	Gene.ID	Gene.Symbol	taxonomy_id	input_gene	ortholog_gene
8	3	406283	acadm	7955	acadm	ACADM
20	5	573723	acadvl	7955	acadvl	ACADVL
32	6	445290	acat1	7955	acat1	ACAT1
55	7	30615	acvr1l	7955	acvr1l	ACVR1
74	12	334431	adsl	7955	adsl	ADSL
92	13	566517	aga	7955	aga	AGA

knitr::kable(orth_zeb$report)

target_species	target_total_genes	reference_species	reference_total_genes	one2one_orthologs	target_percent	reference_percent
Danio rerio	20895	Homo sapiens	19129	10556	50.52	55.18

2.4 Map genes

map_genes finds matching within-species synonyms across a wide variety of gene naming conventions (HGNC, Ensembl, RefSeq, UniProt, etc.) and returns a table with standardised gene symbols (or whatever output format you prefer).

genes <-  c("Klf4", "Sox2", "TSPAN12","NM_173007","Q8BKT6",9999,
             "ENSMUSG00000012396","ENSMUSG00000074637")
mapped_genes <- orthogene::map_genes(genes = genes,
                                     species = "mouse", 
                                     drop_na = FALSE)

## Retrieving all organisms available in gprofiler.

## Using stored `gprofiler_orgs`.

## Mapping species name: mouse

## Common name mapping found for mouse

## 1 organism identified from search: mmusculus

## 8 / 7 (87.5%) genes mapped.

knitr::kable(head(mapped_genes))

input_number	input	target_number	target	name	description	namespace
1	Klf4	1.1	ENSMUSG00000003032	Klf4	Kruppel-like factor 4 (gut) [Source:MGI Symbol;Acc:MGI:1342287]	ENTREZGENE,MGI,UNIPROT_GN,WIKIGENE
2	Sox2	2.1	ENSMUSG00000074637	Sox2	SRY (sex determining region Y)-box 2 [Source:MGI Symbol;Acc:MGI:98364]	ENTREZGENE,MGI,UNIPROT_GN,WIKIGENE
3	TSPAN12	3.1	ENSMUSG00000029669	Tspan12	tetraspanin 12 [Source:MGI Symbol;Acc:MGI:1889818]	ENTREZGENE,MGI,UNIPROT_GN,WIKIGENE
4	NM_173007	4.1	ENSMUSG00000029669	Tspan12	tetraspanin 12 [Source:MGI Symbol;Acc:MGI:1889818]	REFSEQ_MRNA_ACC
5	Q8BKT6	5.1	ENSMUSG00000029669	Tspan12	tetraspanin 12 [Source:MGI Symbol;Acc:MGI:1889818]	UNIPROTSWISSPROT_ACC,UNIPROT_GN_ACC
6	9999	6.1	NA	NA	NA

2.5 Aggregate mapped genes

aggregate_mapped_genes does the following:

Uses map_genes to identify within-species many-to-one gene mappings (e.g. Ensembl transcript IDs ==> gene symbols). Alternatively, can map across species if output from map_orthologs is supplied to gene_map argument (and gene_map_col="ortholog_gene").
Drops all non-mappable genes.
Aggregates the values of matrix gene_df using "sum","mean","median","min" or "max".

Note, this only works when the input data (gene_df) is a sparse or dense matrix, and the genes are row names.

data("exp_mouse_enst") 
knitr::kable(tail(as.matrix(exp_mouse_enst)))

	astrocytes_ependymal	endothelial-mural	interneurons	microglia	oligodendrocytes	pyramidal CA1	pyramidal SS
ENSMUST00000102875	2.8258910	0.4041560	1.3171987	0.3774840	1.3426606	1.0403481	1.0876508
ENSMUST00000133343	2.8259032	0.4042189	1.3171312	0.3774038	1.3425772	1.0403432	1.0876385
ENSMUST00000143890	2.8258554	0.4041963	1.3171145	0.3774192	1.3426119	1.0403496	1.0876334
ENSMUST00000005053	0.4597978	0.3403299	0.9067953	0.2958589	0.7254482	0.4813420	0.7418000
ENSMUST00000185896	0.4596631	0.3403637	0.9067538	0.2958896	0.7255006	0.4812783	0.7417918
ENSMUST00000188282	0.4597399	0.3403441	0.9067727	0.2957819	0.7255681	0.4811978	0.7417924

exp_agg <- orthogene::aggregate_mapped_genes(gene_df=exp_mouse_enst,
                                             species="mouse",
                                             FUN = "sum")

## Retrieving all organisms available in gprofiler.

## Using stored `gprofiler_orgs`.

## Mapping species name: mouse

## Common name mapping found for mouse

## 1 organism identified from search: mmusculus

## 482 / 482 (100%) genes mapped.

## Aggregating rows using: monocle3

## Matrix aggregated:
##   - Input: 482 x 7 
##   - Output: 92 x 7

knitr::kable(tail(as.matrix(exp_agg)))

	astrocytes_ependymal	endothelial-mural	interneurons	microglia	oligodendrocytes	pyramidal CA1	pyramidal SS
Tshz1	1.713936	1.7869498	4.620366	1.7545487	1.4483505	0.2764327	3.3280256
Tspan12	1.981594	3.5228954	3.847706	0.8565873	0.7237384	1.7184690	0.8716624
Ugp2	11.303434	1.6167531	5.268556	1.5096830	5.3705113	4.1614945	4.3505804
Usp28	1.561545	1.4885072	12.481956	0.9176950	1.0237324	5.5261972	6.4652509
Vat1l	0.178117	0.0337314	1.199619	0.0812187	0.1165772	0.2339571	0.4006268
Wtap	2.691383	2.4118074	12.289111	3.5809075	3.2808114	9.3443456	8.6384533

2.6 Get all genes

You can also quickly get all known genes from the genome of a given species with all_genes.

genome_mouse <- orthogene::all_genes(species = "mouse", 
                                     method = method)

## Retrieving all genes using: homologene.

## Retrieving all organisms available in homologene.

## Mapping species name: mouse

## Common name mapping found for mouse

## 1 organism identified from search: 10090

## Gene table with 21,207 rows retrieved.

## Returning all 21,207 genes from mouse.

knitr::kable(head(genome_mouse))

	HID	Gene.ID	Gene.Symbol	taxonomy_id
6	3	11364	Acadm	10090
18	5	11370	Acadvl	10090
29	6	110446	Acat1	10090
52	7	11477	Acvr1	10090
64	9	20391	Sgca	10090
71	12	11564	Adsl	10090

3 Session Info

utils::sessionInfo()

## R version 4.1.3 (2022-03-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] orthogene_1.0.2  BiocStyle_2.22.0
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.8.3              lattice_0.20-45          
##  [3] tidyr_1.2.0               assertthat_0.2.1         
##  [5] digest_0.6.29             utf8_1.2.2               
##  [7] R6_2.5.1                  backports_1.4.1          
##  [9] evaluate_0.15             highr_0.9                
## [11] httr_1.4.2                ggplot2_3.3.5            
## [13] pillar_1.7.0              rlang_1.0.2              
## [15] lazyeval_0.2.2            data.table_1.14.2        
## [17] car_3.0-12                jquerylib_0.1.4          
## [19] magick_2.7.3              Matrix_1.4-1             
## [21] grr_0.9.5                 rmarkdown_2.13           
## [23] labeling_0.4.2            stringr_1.4.0            
## [25] htmlwidgets_1.5.4         RCurl_1.98-1.6           
## [27] munsell_0.5.0             broom_0.7.12             
## [29] compiler_4.1.3            gprofiler2_0.2.1         
## [31] xfun_0.30                 pkgconfig_2.0.3          
## [33] htmltools_0.5.2           tidyselect_1.1.2         
## [35] tibble_3.1.6              GenomeInfoDbData_1.2.7   
## [37] bookdown_0.25             fansi_1.0.3              
## [39] viridisLite_0.4.0         crayon_1.5.1             
## [41] dplyr_1.0.8               ggpubr_0.4.0             
## [43] Matrix.utils_0.9.8        bitops_1.0-7             
## [45] grid_4.1.3                jsonlite_1.8.0           
## [47] gtable_0.3.0              lifecycle_1.0.1          
## [49] DBI_1.1.2                 magrittr_2.0.3           
## [51] scales_1.1.1              cli_3.2.0                
## [53] stringi_1.7.6             carData_3.0-5            
## [55] farver_2.1.0              ggsignif_0.6.3           
## [57] bslib_0.3.1               ellipsis_0.3.2           
## [59] generics_0.1.2            vctrs_0.4.0              
## [61] tools_4.1.3               homologene_1.4.68.19.3.27
## [63] glue_1.6.2                purrr_0.3.4              
## [65] abind_1.4-5               parallel_4.1.3           
## [67] fastmap_1.1.0             yaml_2.3.5               
## [69] babelgene_22.3            colorspace_2.0-3         
## [71] BiocManager_1.30.16       rstatix_0.7.0            
## [73] plotly_4.10.0             knitr_1.38               
## [75] patchwork_1.1.1           sass_0.4.1

orthogene: Getting Started

Author: Brian M. Schilder

Most recent update: Apr-10-2022

Contents

0.1 `orthogene`: Interspecies gene mapping

1 Installation

2 Examples

2.1 Convert orthologs

2.1.1 Note on non-1:1 orthologs

2.2 Map species

2.3 Report orthologs

2.4 Map genes

2.5 Aggregate mapped genes

2.6 Get all genes

3 Session Info

orthogene: Getting Started

Author: Brian M. Schilder

Most recent update: Apr-10-2022

Contents

0.1 orthogene: Interspecies gene mapping

1 Installation

2 Examples

2.1 Convert orthologs

2.1.1 Note on non-1:1 orthologs

2.2 Map species

2.3 Report orthologs

2.4 Map genes

2.5 Aggregate mapped genes

2.6 Get all genes

3 Session Info

0.1 `orthogene`: Interspecies gene mapping