Progenetix is an open data resource that provides curated individual cancer copy number variation (CNV) profiles along with associated metadata sourced from published oncogenomic studies and various data repositories. This vignette provides a comprehensive guide on accessing and utilizing metadata for samples or their corresponding individuals within the Progenetix database. If your focus lies in cancer cell lines, you can access data from cancercelllines.org by specifying the dataset
parameter as “cancercelllines”. This data repository originates from CNV profiling data of cell lines initially collected as part of Progenetix and currently includes additional types of genomic mutations.
library(pgxRpi)
pgxLoader
functionThis function loads various data from Progenetix
database.
The parameters of this function used in this tutorial:
type
A string specifying output data type. Available options are “biosample”, “individual”, “variant” or “frequency”.filters
Identifiers for cancer type, literature, cohorts, and age such as
c(“NCIT:C7376”, “pgx:icdom-98353”, “PMID:22824167”, “pgx:cohort-TCGAcancers”, “age:>=P50Y”). For more information about filters, see the documentation.filterLogic
A string specifying logic for combining multiple filters when query metadata. Available options are “AND” and “OR”. Default is “AND”. An exception is filters associated with age that always use AND logic when combined with any other filter, even if filterLogic = “OR”, which affects other filters.individual_id
Identifiers used in Progenetix database for identifying individuals.biosample_id
Identifiers used in Progenetix database for identifying biosamples.codematches
A logical value determining whether to exclude samples
from child concepts of specified filters that belong to cancer type/tissue encoding system (NCIt, icdom/t, Uberon).
If TRUE, retrieved samples only keep samples exactly encoded by specified filters.
Do not use this parameter when filters
include ontology-irrelevant filters such as PMID and cohort identifiers.
Default is FALSE.limit
Integer to specify the number of returned samples/individuals/coverage profiles for each filter.
Default is 0 (return all).skip
Integer to specify the number of skipped samples/individuals/coverage profiles for each filter.
E.g. if skip = 2, limit=500, the first 2*500 =1000 profiles are skipped and the next 500 profiles are returned.
Default is NULL (no skip).dataset
A string specifying the dataset to query. Default is “progenetix”. Other available options are “cancercelllines”.type, filters, filterLogic, individual_id, biosample_id, codematches, limit, skip, dataset
Filters are a significant enhancement to the Beacon query API, providing a mechanism for specifying rules to select records based on their field values. To learn more about how to utilize filters in Progenetix, please refer to the documentation.
The pgxFilter
function helps access available filters used in Progenetix. Here is the example use:
# access all filters
all_filters <- pgxFilter()
# get all prefix
all_prefix <- pgxFilter(return_all_prefix = TRUE)
# access specific filters based on prefix
ncit_filters <- pgxFilter(prefix="NCIT")
head(ncit_filters)
#> [1] "NCIT:C28076" "NCIT:C18000" "NCIT:C14158" "NCIT:C14161" "NCIT:C28077"
#> [6] "NCIT:C28078"
The following query is designed to retrieve metadata in Progenetix related to all samples of lung adenocarcinoma, utilizing a specific type of filter based on an NCIt code as an ontology identifier.
biosamples <- pgxLoader(type="biosample", filters = "NCIT:C3512")
# data looks like this
biosamples[c(1700:1705),]
#> biosample_id biosample_label biosample_legacy_id individual_id
#> 1700 pgxbs-kftvgjwk NA NA pgxind-kftx2a7n
#> 1701 pgxbs-kftvkvin NA NA pgxind-kftx725f
#> 1702 pgxbs-kftvkv7s NA NA pgxind-kftx71rq
#> 1703 pgxbs-kftvla1o NA NA pgxind-kftx7kck
#> 1704 pgxbs-kftvj6rv NA NA pgxind-kftx50m6
#> 1705 pgxbs-kftvlali NA NA pgxind-kftx7l1e
#> callset_ids group_id group_label pubmed_id
#> 1700 pgxcs-kftvmigq NA NA PMID:24174329
#> 1701 pgxcs-kftwuyog NA NA PMID:28336552
#> 1702 pgxcs-kftwuvbc NA NA PMID:28336552
#> 1703 pgxcs-kftx10y3 NA NA PMID:28481359
#> 1704 pgxcs-kftwfjnr NA NA PMID:21521776
#> 1705 pgxcs-kftx1731 NA NA PMID:29337640
#> pubmed_label
#> 1700 Clinical Lung Cancer Genome Project (CLCGP), Network Genomic Medicine (NGM). (2013): A genomics-based classification of human lung...
#> 1701 Jordan EJ, Kim HR et al. (2017): Prospective Comprehensive Molecular Characterization of Lung...
#> 1702 Jordan EJ, Kim HR et al. (2017): Prospective Comprehensive Molecular Characterization of Lung...
#> 1703 Zehir A, Benayed R et al. (2017): Mutational landscape of metastatic cancer revealed...
#> 1704 Broët P, Dalmasso C et al. (2011): Genomic profiles specific to patient ethnicity...
#> 1705 Rizvi H, Sanchez-Vega F et al. (2018): Molecular Determinants of Response to Anti-Programmed...
#> cellosaurus_id cellosaurus_label cbioportal_id
#> 1700
#> 1701 cbioportal:lung_msk_2017
#> 1702 cbioportal:lung_msk_2017
#> 1703 cbioportal:msk_impact_2017
#> 1704
#> 1705 cbioportal:nsclc_pd1_msk_2018
#> cbioportal_label tcgaproject_id tcgaproject_label
#> 1700 NA
#> 1701 NA
#> 1702 NA
#> 1703 NA
#> 1704 NA
#> 1705 NA
#> external_references_id___arrayexpress
#> 1700
#> 1701
#> 1702
#> 1703
#> 1704
#> 1705
#> external_references_label___arrayexpress cohort_ids
#> 1700 NA
#> 1701 NA
#> 1702 NA
#> 1703 NA
#> 1704 NA
#> 1705 NA
#> legacy_ids
#> 1700 PGX_AM_BS_24174329-clc-S01702
#> 1701 PGX_AM_BS_LUNG_MSK_2017-P_0002917_T01_IM3
#> 1702 PGX_AM_BS_LUNG_MSK_2017-P_0000689_T01_IM3
#> 1703 PGX_AM_BS_MSK_IMPACT_2017-P_0012161_T01_IM5
#> 1704 PGX_AM_BS_GSM837716
#> 1705 PGX_AM_BS_NSCLC_PD1_MSK_2018-P_0005295_T01_IM5
#> notes histological_diagnosis_id
#> 1700 adenocarcinoma [lung] NCIT:C3512
#> 1701 Lung Adenocarcinoma NCIT:C3512
#> 1702 Lung Adenocarcinoma NCIT:C3512
#> 1703 Lung Adenocarcinoma NCIT:C3512
#> 1704 lung adenocarcinoma [East Asian] NCIT:C3512
#> 1705 Lung Adenocarcinoma NCIT:C3512
#> histological_diagnosis_label icdo_morphology_id icdo_morphology_label
#> 1700 Lung Adenocarcinoma pgx:icdom-81403 Adenocarcinoma, NOS
#> 1701 Lung Adenocarcinoma pgx:icdom-81403 Adenocarcinoma, NOS
#> 1702 Lung Adenocarcinoma pgx:icdom-81403 Adenocarcinoma, NOS
#> 1703 Lung Adenocarcinoma pgx:icdom-81403 Adenocarcinoma, NOS
#> 1704 Lung Adenocarcinoma pgx:icdom-81403 Adenocarcinoma, NOS
#> 1705 Lung Adenocarcinoma pgx:icdom-81403 Adenocarcinoma, NOS
#> icdo_topography_id icdo_topography_label pathological_stage_id
#> 1700 pgx:icdot-C34.9 Lung, NOS NCIT:C92207
#> 1701 pgx:icdot-C34.9 Lung, NOS NCIT:C92207
#> 1702 pgx:icdot-C34.9 Lung, NOS NCIT:C92207
#> 1703 pgx:icdot-C34.9 Lung, NOS NCIT:C92207
#> 1704 pgx:icdot-C34.9 Lung, NOS NCIT:C92207
#> 1705 pgx:icdot-C34.9 Lung, NOS NCIT:C92207
#> pathological_stage_label biosample_status_id biosample_status_label
#> 1700 Stage Unknown EFO:0009656 neoplastic sample
#> 1701 Stage Unknown EFO:0009656 neoplastic sample
#> 1702 Stage Unknown EFO:0009656 neoplastic sample
#> 1703 Stage Unknown EFO:0009656 neoplastic sample
#> 1704 Stage Unknown EFO:0009656 neoplastic sample
#> 1705 Stage Unknown EFO:0009656 neoplastic sample
#> sampled_tissue_id sampled_tissue_label tnm stage grade age_iso sex_id
#> 1700 UBERON:0002048 lung NA NA NA P60Y NA
#> 1701 UBERON:0002048 lung NA NA NA P48Y NA
#> 1702 UBERON:0002048 lung NA NA NA P56Y NA
#> 1703 UBERON:0002048 lung NA NA NA P69Y NA
#> 1704 UBERON:0002048 lung NA NA NA NA
#> 1705 UBERON:0002048 lung NA NA NA P73Y NA
#> sex_label followup_state_id followup_state_label followup_time
#> 1700 NA EFO:0030041 alive (follow-up status) NA
#> 1701 NA EFO:0030039 no followup status NA
#> 1702 NA EFO:0030039 no followup status NA
#> 1703 NA EFO:0030039 no followup status NA
#> 1704 NA EFO:0030039 no followup status NA
#> 1705 NA EFO:0030039 no followup status NA
#> geoprov_city geoprov_country geoprov_iso_alpha3 geoprov_long_lat
#> 1700 Koeln Germany DEU 6.95::50.93
#> 1701 New York City United States of America USA -74.01::40.71
#> 1702 New York City United States of America USA -74.01::40.71
#> 1703 New York City United States of America USA -74.01::40.71
#> 1704 Evry France FRA 2.45::48.63
#> 1705 New York City United States of America USA -74.01::40.71
#> cnv_fraction cnv_del_fraction cnv_dup_fraction cell_line
#> 1700 NA NA NA
#> 1701 NA NA NA
#> 1702 NA NA NA
#> 1703 NA NA NA
#> 1704 NA NA NA
#> 1705 NA NA NA
The data contains many columns representing different aspects of sample information.
In Progenetix, biosample id and individual id serve as unique identifiers for biosamples and the corresponding individuals. You can obtain these IDs through metadata search with filters as described above, or through website interface query.
biosamples_2 <- pgxLoader(type="biosample", biosample_id = "pgxbs-kftvgioe",individual_id = "pgxind-kftx28q5")
metainfo <- c("biosample_id","individual_id","pubmed_id","followup_state_label","followup_time")
biosamples_2[metainfo]
#> biosample_id individual_id pubmed_id followup_state_label
#> 1 pgxbs-kftvgioe pgxind-kftx28pu PMID:24174329 alive (follow-up status)
#> 2 pgxbs-kftvgiom pgxind-kftx28q5 PMID:24174329 dead (follow-up status)
#> followup_time
#> 1 NA
#> 2 NA
It’s also possible to query by a combination of filters, biosample id, and individual id.
By default, it returns all related samples (limit=0). You can access a subset of them
via the parameter limit
and skip
. For example, if you want to access the first 1000 samples
, you can set limit
= 1000, skip
= 0.
biosamples_3 <- pgxLoader(type="biosample", filters = "NCIT:C3512",skip=0, limit = 1000)
# Dimension: Number of samples * features
print(dim(biosamples))
#> [1] 4641 49
print(dim(biosamples_3))
#> [1] 1000 49
The number of samples in specific group can be queried by pgxCount
function.
pgxCount(filters = "NCIT:C3512")
#> filters label total_count exact_match_count
#> 1 NCIT:C3512 Lung Adenocarcinoma 4641 4505
codematches
useThe NCIt code of retrieved samples doesn’t only contain specified filters but contains child terms.
unique(biosamples$histological_diagnosis_id)
#> [1] "NCIT:C3512" "NCIT:C5649" "NCIT:C7270" "NCIT:C2923" "NCIT:C7268"
#> [6] "NCIT:C7269" "NCIT:C5650"
Setting codematches
as TRUE allows this function to only return biosamples with exact match to the filter.
biosamples_4 <- pgxLoader(type="biosample", filters = "NCIT:C3512",codematches = TRUE)
unique(biosamples_4$histological_diagnosis_id)
#> [1] "NCIT:C3512"
filterLogic
useThis function supports querying samples that belong to multiple filters. For example, If you want to retrieve information about lung adenocarcinoma samples from the literature
PMID:24174329, you can specify multiple matching filters and set filterLogic
to “AND”.
biosamples_5 <- pgxLoader(type="biosample", filters = c("NCIT:C3512","PMID:24174329"),
filterLogic = "AND")
If you want to query metadata (e.g. survival data) of individuals where the samples of interest come from, you can follow the tutorial below.
type, filters, filterLogic, individual_id, biosample_id, codematches, limit, skip, dataset
individuals <- pgxLoader(type="individual",filters="NCIT:C3270")
# Dimension: Number of individuals * features
print(dim(individuals))
#> [1] 2001 25
# data looks like this
individuals[c(36:40),]
#> individual_id individual_legacy_id legacy_ids sex_id
#> 36 pgxind-kftx2orw NA PGX_IND_Nbl-mic-AD-101 PATO:0020000
#> 37 pgxind-kftx3804 NA PGX_IND_NBL-Meta-142 PATO:0020000
#> 38 pgxind-kftx6b31 NA PGX_IND_GSM174440 PATO:0020000
#> 39 pgxind-kftx381l NA PGX_IND_NBL-Meta-166 PATO:0020000
#> 40 pgxind-kftx2ovz NA PGX_IND_Nbl-mic-AH-5 PATO:0020000
#> sex_label age_iso age_days data_use_conditions_id
#> 36 genotypic sex P4Y7M 1673.8869 NA
#> 37 genotypic sex P1Y4M 486.9093 NA
#> 38 genotypic sex NA NA
#> 39 genotypic sex P2Y 730.4850 NA
#> 40 genotypic sex P0Y1M 30.4167 NA
#> data_use_conditions_label histological_diagnosis_id
#> 36 NA NCIT:C3270
#> 37 NA NCIT:C3270
#> 38 NA NCIT:C3270
#> 39 NA NCIT:C3270
#> 40 NA NCIT:C3270
#> histological_diagnosis_label index_disease_notes index_disease_followup_time
#> 36 Neuroblastoma NA P87.6M
#> 37 Neuroblastoma NA P41M
#> 38 Neuroblastoma NA None
#> 39 Neuroblastoma NA P48M
#> 40 Neuroblastoma NA P19.2M
#> index_disease_followup_state_id index_disease_followup_state_label
#> 36 EFO:0030041 alive (follow-up status)
#> 37 EFO:0030041 alive (follow-up status)
#> 38 EFO:0030039 no followup status
#> 39 EFO:0030041 alive (follow-up status)
#> 40 EFO:0030041 alive (follow-up status)
#> auxiliary_disease_id auxiliary_disease_label auxiliary_disease_notes
#> 36 NA NA NA
#> 37 NA NA NA
#> 38 NA NA NA
#> 39 NA NA NA
#> 40 NA NA NA
#> geoprov_id geoprov_city geoprov_country geoprov_iso_alpha3
#> 36 NA Gent Belgium NA
#> 37 NA Gent Belgium NA
#> 38 NA Philadelphia United States NA
#> 39 NA Gent Belgium NA
#> 40 NA Gent Belgium NA
#> geoprov_long_lat cell_line_donation_id cell_line_donation_label
#> 36 3.7199999999999998::51.05 NA NA
#> 37 3.7199999999999998::51.05 NA NA
#> 38 -75.16::39.95 NA NA
#> 39 3.7199999999999998::51.05 NA NA
#> 40 3.7199999999999998::51.05 NA NA
You can get the id from the query of samples
individual <- pgxLoader(type="individual",individual_id = "pgxind-kftx26ml", biosample_id="pgxbs-kftvh94d")
individual
#> individual_id individual_legacy_id legacy_ids sex_id
#> 1 pgxind-kftx3565 NA PGX_IND_EpTu-N270 PATO:0020000
#> 2 pgxind-kftx26ml NA PGX_IND_AdSqLu-bjo-01 PATO:0020001
#> sex_label age_iso age_days data_use_conditions_id
#> 1 genotypic sex NA NA NA
#> 2 male genotypic sex NA NA NA
#> data_use_conditions_label histological_diagnosis_id
#> 1 NA NCIT:C3697
#> 2 NA NCIT:C3493
#> histological_diagnosis_label index_disease_notes index_disease_followup_time
#> 1 Myxopapillary Ependymoma NA None
#> 2 Squamous Cell Lung Carcinoma NA None
#> index_disease_followup_state_id index_disease_followup_state_label
#> 1 EFO:0030039 no followup status
#> 2 EFO:0030039 no followup status
#> auxiliary_disease_id auxiliary_disease_label auxiliary_disease_notes
#> 1 NA NA NA
#> 2 NA NA NA
#> geoprov_id geoprov_city geoprov_country geoprov_iso_alpha3 geoprov_long_lat
#> 1 NA Nijmegen The Netherlands NA 5.84::51.81
#> 2 NA Helsinki Finland NA 24.94::60.17
#> cell_line_donation_id cell_line_donation_label
#> 1 NA NA
#> 2 NA NA
pgxMetaplot
functionThis function generates a survival plot using metadata of individuals obtained by the pgxLoader
function.
The parameters of this function:
data
: The meatdata of individuals returned by pgxLoader
function.group_id
: A string specifying which column is used for grouping in the Kaplan-Meier plot.condition
: Condition for splitting individuals into younger and older groups. Only used if group_id
is age related.return_data
: A logical value determining whether to return the metadata used for plotting. Default is FALSE....
: Other parameters relevant to KM plot. These include pval
, pval.coord
, pval.method
, conf.int
, linetype
, and palette
(see ggsurvplot function from survminer package)Suppose you want to investigate whether there are survival differences between younger and older patients with a particular disease, you can query and visualize the relevant information as follows:
# query metadata of individuals with lung adenocarcinoma
luad_inds <- pgxLoader(type="individual",filters="NCIT:C3512")
# use 65 years old as the splitting condition
pgxMetaplot(data=luad_inds, group_id="age_iso", condition="P65Y", pval=TRUE)
It’s noted that not all individuals have available survival data. If you set return_data
to TRUE
,
the function will return the metadata of individuals used for the plot.
#> R version 4.4.0 (2024-04-24 ucrt)
#> Platform: x86_64-w64-mingw32/x64
#> Running under: Windows Server 2022 x64 (build 20348)
#>
#> Matrix products: default
#>
#>
#> locale:
#> [1] LC_COLLATE=C
#> [2] LC_CTYPE=English_United States.utf8
#> [3] LC_MONETARY=English_United States.utf8
#> [4] LC_NUMERIC=C
#> [5] LC_TIME=English_United States.utf8
#>
#> time zone: America/New_York
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] pgxRpi_1.0.2 BiocStyle_2.32.1
#>
#> loaded via a namespace (and not attached):
#> [1] gtable_0.3.5 xfun_0.44 bslib_0.7.0
#> [4] ggplot2_3.5.1 rstatix_0.7.2 lattice_0.22-6
#> [7] vctrs_0.6.5 tools_4.4.0 generics_0.1.3
#> [10] curl_5.2.1 tibble_3.2.1 fansi_1.0.6
#> [13] highr_0.11 pkgconfig_2.0.3 Matrix_1.7-0
#> [16] data.table_1.15.4 lifecycle_1.0.4 compiler_4.4.0
#> [19] farver_2.1.2 munsell_0.5.1 tinytex_0.51
#> [22] carData_3.0-5 htmltools_0.5.8.1 sass_0.4.9
#> [25] yaml_2.3.8 pillar_1.9.0 car_3.1-2
#> [28] ggpubr_0.6.0 jquerylib_0.1.4 tidyr_1.3.1
#> [31] cachem_1.1.0 survminer_0.4.9 magick_2.8.3
#> [34] abind_1.4-5 km.ci_0.5-6 tidyselect_1.2.1
#> [37] digest_0.6.35 dplyr_1.1.4 purrr_1.0.2
#> [40] bookdown_0.39 labeling_0.4.3 splines_4.4.0
#> [43] fastmap_1.2.0 grid_4.4.0 colorspace_2.1-0
#> [46] cli_3.6.2 magrittr_2.0.3 survival_3.7-0
#> [49] utf8_1.2.4 broom_1.0.6 withr_3.0.0
#> [52] scales_1.3.0 backports_1.5.0 lubridate_1.9.3
#> [55] timechange_0.3.0 rmarkdown_2.27 httr_1.4.7
#> [58] gridExtra_2.3 ggsignif_0.6.4 zoo_1.8-12
#> [61] evaluate_0.24.0 knitr_1.47 KMsurv_0.1-5
#> [64] survMisc_0.5.6 rlang_1.1.4 Rcpp_1.0.12
#> [67] xtable_1.8-4 glue_1.7.0 BiocManager_1.30.23
#> [70] attempt_0.3.1 jsonlite_1.8.8 R6_2.5.1
#> [73] plyr_1.8.9