Cellosaurus is a comprehensive knowledge resource dedicated to cell lines, providing a wealth of information about various types of cells used in biomedical research. It serves as a centralized repository that offers detailed data on cell lines, including their origins, characteristics, authentication methods, references, and more. Please view the Cellosaurus website at https://web.expasy.org/cellosaurus/ for more information and a detailed description can be found at https://www.cellosaurus.org/description.html.
The AnnotationGx package provides a wrapper around the Cellosaurus API to map cell line identifiers to the Cellosaurus database fields.
Cellosaurus is licensed under CC BY 4.0. Source: https://www.cellosaurus.org/faq
The main function that is provided by the package is mapCell2Accession. This function takes in a vector of cell line identifiers and returns a data.table.
By default, the function will try to map using the common identifiers and synonyms (from = "idsy") and will return the the Standardized Identifier as cellLineName and the Cellosaurus Accession ID accession. The function also returns an additional column query which can be used to identify the original query if needed.
Let’s see how we can use this function to map the “HeLa” and “A549” cell line names to the Cellosaurus database.
mapCell2Accession("hela")
#> [11:17:10][INFO][AnnotationGx::mapCell2Accession] Creating Cellosaurus queries
#> [11:17:10][INFO][AnnotationGx::mapCell2Accession] Building Cellosaurus requests
#> [11:17:11][INFO][AnnotationGx::mapCell2Accession] Performing Cellosaurus queries
#> [11:17:14][INFO][AnnotationGx::mapCell2Accession] Parsing Cellosaurus responses
#> cellLineName accession query
#> <char> <char> <char>
#> 1: HeLa CVCL_0030 helamapCell2Accession("A549")
#> [11:17:20][INFO][AnnotationGx::mapCell2Accession] Creating Cellosaurus queries
#> [11:17:20][INFO][AnnotationGx::mapCell2Accession] Building Cellosaurus requests
#> [11:17:21][INFO][AnnotationGx::mapCell2Accession] Performing Cellosaurus queries
#> [11:17:23][INFO][AnnotationGx::mapCell2Accession] Parsing Cellosaurus responses
#> cellLineName accession query
#> <char> <char> <char>
#> 1: A-549 CVCL_0023 A549Functionality for mapping multiple cell lines is also supported.
mapCell2Accession(c("A549", "THIS SHOULD FAIL", "BT474"))
#> [11:17:29][INFO][AnnotationGx::mapCell2Accession] Creating Cellosaurus queries
#> [11:17:29][INFO][AnnotationGx::mapCell2Accession] Building Cellosaurus requests
#> [11:17:30][INFO][AnnotationGx::mapCell2Accession] Performing Cellosaurus queries
#> [11:17:33][INFO][AnnotationGx::mapCell2Accession] Parsing Cellosaurus responses
#> [11:17:38][WARNING]No results found for THIS SHOULD FAIL
#> cellLineName accession query
#> <char> <char> <char>
#> 1: A-549 CVCL_0023 A549
#> 2: <NA> <NA> THIS SHOULD FAIL
#> 3: BT-474 CVCL_0179 BT474By default, the function will parse the API responses to return the most common mapping. To return all possible mappings, set parsed = FALSE.
# parsed
mapCell2Accession(c("A549", "hela", "BT474"), parsed = TRUE)
#> [11:17:38][INFO][AnnotationGx::mapCell2Accession] Creating Cellosaurus queries
#> [11:17:38][INFO][AnnotationGx::mapCell2Accession] Building Cellosaurus requests
#> [11:17:39][INFO][AnnotationGx::mapCell2Accession] Performing Cellosaurus queries
#> Querying Cellosaurus... ■■■■■■■■■■■■■■■■■■■■■ 67% | ETA: 1s
#> Querying Cellosaurus... ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 100% | ETA: 0s
#> [11:17:42][INFO][AnnotationGx::mapCell2Accession] Parsing Cellosaurus responses
#> cellLineName accession query
#> <char> <char> <char>
#> 1: A-549 CVCL_0023 A549
#> 2: HeLa CVCL_0030 hela
#> 3: BT-474 CVCL_0179 BT474
# no parsing
mapCell2Accession(c("A549", "hela", "BT474"), parsed = FALSE)
#> [11:17:54][INFO][AnnotationGx::mapCell2Accession] Creating Cellosaurus queries
#> [11:17:54][INFO][AnnotationGx::mapCell2Accession] Building Cellosaurus requests
#> [11:17:55][INFO][AnnotationGx::mapCell2Accession] Performing Cellosaurus queries
#> Querying Cellosaurus... ■■■■■■■■■■■■■■■■■■■■■ 67% | ETA: 1s
#> Querying Cellosaurus... ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 100% | ETA: 0s
#> [11:17:57][INFO][AnnotationGx::mapCell2Accession] Parsing Cellosaurus responses
#> cellLineName accession category ageAtSampling sexOfCell
#> <char> <char> <char> <char> <char>
#> 1: A-549 CVCL_0023 Cancer cell line 58Y Male
#> 2: A549(VM)28 CVCL_4V06 Cancer cell line 58Y Male
#> 3: A549(VP)28 CVCL_4V07 Cancer cell line 58Y Male
#> 4: A549.EpoB40 CVCL_4Z15 Cancer cell line 58Y Male
#> 5: A549-Dual CVCL_5I73 Cancer cell line 58Y Male
#> ---
#> 4194: BT474-LAPRa CVCL_EI02 Cancer cell line 60Y Female
#> 4195: BT474-LAPRb CVCL_EI03 Cancer cell line 60Y Female
#> 4196: BT474-LR CVCL_VL01 Cancer cell line 60Y Female
#> 4197: BT474 A3 CVCL_YX79 Cancer cell line 60Y Female
#> 4198: BT474-J4 CVCL_ZL46 Cancer cell line 60Y Female
#> synonyms diseases
#> <list> <list>
#> 1: A 549,A549,NCI-A549,A549/ATCC,A549 ATCC,A549ATCC,...[7] <list[1]>
#> 2: NA <list[1]>
#> 3: NA <list[1]>
#> 4: EpoB40 <list[1]>
#> 5: NA <list[1]>
#> ---
#> 4194: NA <list[1]>
#> 4195: NA <list[1]>
#> 4196: BT474/LR,BT474 LR <list[1]>
#> 4197: BT474-A3 <list[1]>
#> 4198: BT-474-J4 <list[1]>
#> crossReferences hierarchy comments query
#> <list> <list> <list> <char>
#> 1: <list[1209]> NA <list[14]> A549
#> 2: <list[2]> <list[1]> <list[4]> A549
#> 3: <list[2]> <list[1]> <list[4]> A549
#> 4: <list[2]> <list[1]> <list[4]> A549
#> 5: <list[3]> <list[1]> <list[5]> A549
#> ---
#> 4194: <list[2]> <list[1]> <list[4]> BT474
#> 4195: <list[2]> <list[1]> <list[4]> BT474
#> 4196: <list[3]> <list[1]> <list[4]> BT474
#> 4197: <list[2]> <list[1]> <list[4]> BT474
#> 4198: <list[14]> <list[1]> <list[5]> BT474The backend of the function also tries to map any misspellings or synonyms of the cell line names.
samples <- c("SK23", "SJCRH30")
mapCell2Accession(samples)
#> [11:18:09][INFO][AnnotationGx::mapCell2Accession] Creating Cellosaurus queries
#> [11:18:09][INFO][AnnotationGx::mapCell2Accession] Building Cellosaurus requests
#> [11:18:10][INFO][AnnotationGx::mapCell2Accession] Performing Cellosaurus queries
#> [11:18:10][INFO][AnnotationGx::mapCell2Accession] Parsing Cellosaurus responses
#> cellLineName accession query
#> <char> <char> <char>
#> 1: SK-MEL-23 CVCL_6027 SK23
#> 2: Rh30 CVCL_0041 SJCRH30If some cell lines still cannot be found, there is an additional parameter for fuzzy searching.
# No fuzzy
mapCell2Accession("DOR 13")
#> [11:18:10][INFO][AnnotationGx::mapCell2Accession] Creating Cellosaurus queries
#> [11:18:10][INFO][AnnotationGx::mapCell2Accession] Building Cellosaurus requests
#> [11:18:11][INFO][AnnotationGx::mapCell2Accession] Performing Cellosaurus queries
#> [11:18:11][INFO][AnnotationGx::mapCell2Accession] Parsing Cellosaurus responses
#> [11:18:11][WARNING]No results found for DOR 13
#> query
#> <char>
#> 1: DOR 13
# Fuzzy
mapCell2Accession("DOR 13", fuzzy = TRUE)
#> [11:18:11][INFO][AnnotationGx::mapCell2Accession] Creating Cellosaurus queries
#> [11:18:11][INFO][AnnotationGx::mapCell2Accession] Building Cellosaurus requests
#> [11:18:11][INFO][AnnotationGx::mapCell2Accession] Performing Cellosaurus queries
#> [11:18:12][INFO][AnnotationGx::mapCell2Accession] Parsing Cellosaurus responses
#> cellLineName accession query
#> <char> <char> <char>
#> 1: DOV13 CVCL_6774 DOR 13Once accession IDs are obtained and the mappings are satisfactory, they can then be mapped to other fields in the Cellosaurus database. A list of available fields can be found using cellosaurus_fields()
cellosaurus_fields()
#> [1] "id" "sy" "idsy"
#> [4] "ac" "acas" "dr"
#> [7] "ref" "rx" "ra"
#> [10] "rt" "rl" "ww"
#> [13] "genome-ancestry" "hla" "registration"
#> [16] "sequence-variation" "anecdotal" "biotechnology"
#> [19] "breed" "caution" "cell-type"
#> [22] "characteristics" "donor-info" "derived-from-site"
#> [25] "discontinued" "doubling-time" "from"
#> [28] "group" "karyotype" "knockout"
#> [31] "msi" "miscellaneous" "misspelling"
#> [34] "mab-isotype" "mab-target" "omics"
#> [37] "part-of" "population" "problematic"
#> [40] "resistance" "senescence" "integrated"
#> [43] "transformant" "virology" "cc"
#> [46] "str" "di" "din"
#> [49] "dio" "ox" "sx"
#> [52] "ag" "oi" "hi"
#> [55] "ch" "ca" "dt"
#> [58] "dtc" "dtu" "dtv"The annotateCellAccession() function can be used to map the accession IDs to the desired fields. By default the function will try to map to "id", "ac", "hi", "sy", "ca", "sx", "ag", "di", "derived-from-site", "misspelling", "dt"
# Annotate the A549 cell line
mappedAccessions <- mapCell2Accession("A549")
#> [11:18:13][INFO][AnnotationGx::mapCell2Accession] Creating Cellosaurus queries
#> [11:18:13][INFO][AnnotationGx::mapCell2Accession] Building Cellosaurus requests
#> [11:18:13][INFO][AnnotationGx::mapCell2Accession] Performing Cellosaurus queries
#> [11:18:16][INFO][AnnotationGx::mapCell2Accession] Parsing Cellosaurus responses
annotateCellAccession(accessions = mappedAccessions$accession)
#> [11:18:21][INFO][AnnotationGx::annotateCellAccession] Building Cellosaurus requests...
#> [11:18:22][INFO][AnnotationGx::annotateCellAccession] Performing Requests...
#> [11:18:22][INFO][AnnotationGx::annotateCellAccession] Parsing Responses...
#> cellLineName accession category
#> <char> <char> <char>
#> 1: A-549 CVCL_0023 Cancer cell line
#> date ageAtSampling
#> <char> <char>
#> 1: Created: 04-04-12; Last updated: 27-11-25; Version: 53 58Y
#> sexOfCell synonyms diseases
#> <char> <list> <list>
#> 1: Male A 549,A549,NCI-A549,A549/ATCC,A549 ATCC,A549ATCC,...[7] <list[1]>
#> crossReferences hierarchy comments
#> <char> <char> <list>
#> 1: <NA> <NA> <list[2]>sessionInfo()
#> R version 4.6.0 alpha (2026-04-05 r89794)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] data.table_1.18.2.1 AnnotationGx_0.99.2
#>
#> loaded via a namespace (and not attached):
#> [1] crayon_1.5.3 cli_3.6.6 knitr_1.51 rlang_1.2.0
#> [5] xfun_0.57 otel_0.2.0 jsonlite_2.0.0 glue_1.8.1
#> [9] backports_1.5.1 htmltools_0.5.9 sass_0.4.10 rappdirs_0.3.4
#> [13] rmarkdown_2.31 evaluate_1.0.5 jquerylib_0.1.4 fastmap_1.2.0
#> [17] yaml_2.3.12 lifecycle_1.0.5 httr2_1.2.2 memoise_2.0.1
#> [21] compiler_4.6.0 digest_0.6.39 R6_2.6.1 curl_7.0.0
#> [25] parallel_4.6.0 magrittr_2.0.5 bslib_0.10.0 checkmate_2.3.4
#> [29] withr_3.0.2 tools_4.6.0 cachem_1.1.0