Contents

Progenetix is an open data resource that provides curated individual cancer copy number variation (CNV) profiles along with associated metadata sourced from published oncogenomic studies and various data repositories. This vignette provides a comprehensive guide on accessing and utilizing metadata for samples or their corresponding individuals within the Progenetix database. If your focus lies in cancer cell lines, you can access data from cancercelllines.org by specifying the dataset parameter as “cancercelllines”. This data repository originates from CNV profiling data of cell lines initially collected as part of Progenetix and currently includes additional types of genomic mutations.

1 Load library

library(pgxRpi)

1.1 pgxLoader function

This function loads various data from Progenetix database.

The parameters of this function used in this tutorial:

  • type A string specifying output data type. Available options are “biosample”, “individual”, “variant” or “frequency”.
  • filters Identifiers for cancer type, literature, cohorts, and age such as c(“NCIT:C7376”, “pgx:icdom-98353”, “PMID:22824167”, “pgx:cohort-TCGAcancers”, “age:>=P50Y”). For more information about filters, see the documentation.
  • filterLogic A string specifying logic for combining multiple filters when query metadata. Available options are “AND” and “OR”. Default is “AND”. An exception is filters associated with age that always use AND logic when combined with any other filter, even if filterLogic = “OR”, which affects other filters.
  • individual_id Identifiers used in Progenetix database for identifying individuals.
  • biosample_id Identifiers used in Progenetix database for identifying biosamples.
  • codematches A logical value determining whether to exclude samples from child concepts of specified filters that belong to cancer type/tissue encoding system (NCIt, icdom/t, Uberon). If TRUE, retrieved samples only keep samples exactly encoded by specified filters. Do not use this parameter when filters include ontology-irrelevant filters such as PMID and cohort identifiers. Default is FALSE.
  • limit Integer to specify the number of returned samples/individuals/coverage profiles for each filter. Default is 0 (return all).
  • skip Integer to specify the number of skipped samples/individuals/coverage profiles for each filter. E.g. if skip = 2, limit=500, the first 2*500 =1000 profiles are skipped and the next 500 profiles are returned. Default is NULL (no skip).
  • dataset A string specifying the dataset to query. Default is “progenetix”. Other available options are “cancercelllines”.

2 Retrieve meatdata of samples

2.1 Relevant parameters

type, filters, filterLogic, individual_id, biosample_id, codematches, limit, skip, dataset

2.2 Search by filters

Filters are a significant enhancement to the Beacon query API, providing a mechanism for specifying rules to select records based on their field values. To learn more about how to utilize filters in Progenetix, please refer to the documentation.

The pgxFilter function helps access available filters used in Progenetix. Here is the example use:

# access all filters
all_filters <- pgxFilter()
# get all prefix
all_prefix <- pgxFilter(return_all_prefix = TRUE)
# access specific filters based on prefix
ncit_filters <- pgxFilter(prefix="NCIT")
head(ncit_filters)
#> [1] "NCIT:C28076" "NCIT:C18000" "NCIT:C14158" "NCIT:C14161" "NCIT:C14167"
#> [6] "NCIT:C28077"

The following query is designed to retrieve metadata in Progenetix related to all samples of lung adenocarcinoma, utilizing a specific type of filter based on an NCIt code as an ontology identifier.

biosamples <- pgxLoader(type="biosample", filters = "NCIT:C3512")
# data looks like this
biosamples[c(1700:1705),]
#>        biosample_id   individual_id                                      notes
#> 1700 pgxbs-kftvjjhx pgxind-kftx5fyd                        lung adenocarcinoma
#> 1701 pgxbs-kftvjjhz pgxind-kftx5fyf                        lung adenocarcinoma
#> 1702 pgxbs-kftvjji1 pgxind-kftx5fyh                        lung adenocarcinoma
#> 1703 pgxbs-kftvjjn2 pgxind-kftx5g4r   lung adenocarcinoma [cell line PC-9/GR4]
#> 1704 pgxbs-kftvjjn4 pgxind-kftx5g4t lung adenocarcinoma [cell line PC-9/WZR10]
#> 1705 pgxbs-kftvjjn5 pgxind-kftx5g4v lung adenocarcinoma [cell line PC-9/WZR11]
#>      histological_diagnosis_id histological_diagnosis_label
#> 1700                NCIT:C3512          Lung Adenocarcinoma
#> 1701                NCIT:C3512          Lung Adenocarcinoma
#> 1702                NCIT:C3512          Lung Adenocarcinoma
#> 1703                NCIT:C3512          Lung Adenocarcinoma
#> 1704                NCIT:C3512          Lung Adenocarcinoma
#> 1705                NCIT:C3512          Lung Adenocarcinoma
#>      pathological_stage_id pathological_stage_label biosample_status_id
#> 1700           NCIT:C27975                 Stage Ia         EFO:0009656
#> 1701           NCIT:C27976                 Stage Ib         EFO:0009656
#> 1702           NCIT:C27976                 Stage Ib         EFO:0009656
#> 1703           NCIT:C92207            Stage Unknown         EFO:0030035
#> 1704           NCIT:C92207            Stage Unknown         EFO:0030035
#> 1705           NCIT:C92207            Stage Unknown         EFO:0030035
#>       biosample_status_label sample_origin_type_id sample_origin_type_label
#> 1700       neoplastic sample           OBI:0001479   specimen from organism
#> 1701       neoplastic sample           OBI:0001479   specimen from organism
#> 1702       neoplastic sample           OBI:0001479   specimen from organism
#> 1703 cancer cell line sample           OBI:0001876             cell culture
#> 1704 cancer cell line sample           OBI:0001876             cell culture
#> 1705 cancer cell line sample           OBI:0001876             cell culture
#>      sampled_tissue_id sampled_tissue_label tnm    stage_id   stage_label
#> 1700    UBERON:0002048                 lung     NCIT:C27975      Stage Ia
#> 1701    UBERON:0002048                 lung     NCIT:C27976      Stage Ib
#> 1702    UBERON:0002048                 lung     NCIT:C27976      Stage Ib
#> 1703    UBERON:0002048                 lung     NCIT:C92207 Stage Unknown
#> 1704    UBERON:0002048                 lung     NCIT:C92207 Stage Unknown
#> 1705    UBERON:0002048                 lung     NCIT:C92207 Stage Unknown
#>      tumor_grade_id tumor_grade_label age_iso biosample_label
#> 1700             NA                NA                      NA
#> 1701             NA                NA                      NA
#> 1702             NA                NA                      NA
#> 1703             NA                NA                      NA
#> 1704             NA                NA                      NA
#> 1705             NA                NA                      NA
#>      icdo_morphology_id icdo_morphology_label icdo_topography_id
#> 1700    pgx:icdom-81403   Adenocarcinoma, NOS    pgx:icdot-C34.9
#> 1701    pgx:icdom-81403   Adenocarcinoma, NOS    pgx:icdot-C34.9
#> 1702    pgx:icdom-81403   Adenocarcinoma, NOS    pgx:icdot-C34.9
#> 1703    pgx:icdom-81403   Adenocarcinoma, NOS    pgx:icdot-C34.9
#> 1704    pgx:icdom-81403   Adenocarcinoma, NOS    pgx:icdot-C34.9
#> 1705    pgx:icdom-81403   Adenocarcinoma, NOS    pgx:icdot-C34.9
#>      icdo_topography_label     pubmed_id
#> 1700             Lung, NOS PMID:26444668
#> 1701             Lung, NOS PMID:26444668
#> 1702             Lung, NOS PMID:26444668
#> 1703             Lung, NOS PMID:22961667
#> 1704             Lung, NOS PMID:22961667
#> 1705             Lung, NOS PMID:22961667
#>                                                                                                                                                                                        pubmed_label
#> 1700 Aramburu A, Zudaire I, Pajares MJ, Agorreta J, Orta et al. (2015): Combined clinical and genomic signatures for the prognosis of early stage non-small cell lung cancer based on gene copy ...
#> 1701 Aramburu A, Zudaire I, Pajares MJ, Agorreta J, Orta et al. (2015): Combined clinical and genomic signatures for the prognosis of early stage non-small cell lung cancer based on gene copy ...
#> 1702 Aramburu A, Zudaire I, Pajares MJ, Agorreta J, Orta et al. (2015): Combined clinical and genomic signatures for the prognosis of early stage non-small cell lung cancer based on gene copy ...
#> 1703                                                                                                    Ercan D, Xu C, Yanagita M et al. (2012): Reactivation of ERK signaling causes resistance...
#> 1704                                                                                                    Ercan D, Xu C, Yanagita M et al. (2012): Reactivation of ERK signaling causes resistance...
#> 1705                                                                                                    Ercan D, Xu C, Yanagita M et al. (2012): Reactivation of ERK signaling causes resistance...
#>             cellosaurus_id cellosaurus_label cbioportal_id cbioportal_label
#> 1700                                                                     NA
#> 1701                                                                     NA
#> 1702                                                                     NA
#> 1703 cellosaurus:CVCL_DH34          PC-9/GR4                             NA
#> 1704 cellosaurus:CVCL_DG31        PC-9/WZR10                             NA
#> 1705 cellosaurus:CVCL_DG32        PC-9/WZR11                             NA
#>      tcgaproject_id tcgaproject_label cohort_ids biosample_name  geoprov_city
#> 1700                                          NA     GSM1857292 San Sebastian
#> 1701                                          NA     GSM1857293 San Sebastian
#> 1702                                          NA     GSM1857294 San Sebastian
#> 1703                                          NA      GSM925738        Boston
#> 1704                                          NA      GSM925740        Boston
#> 1705                                          NA      GSM925741        Boston
#>               geoprov_country geoprov_iso_alpha3 geoprov_long_lat group_id
#> 1700                    Spain                ESP     -1.97::43.31       NA
#> 1701                    Spain                ESP     -1.97::43.31       NA
#> 1702                    Spain                ESP     -1.97::43.31       NA
#> 1703 United States of America                USA    -71.06::42.36       NA
#> 1704 United States of America                USA    -71.06::42.36       NA
#> 1705 United States of America                USA    -71.06::42.36       NA
#>      group_label
#> 1700          NA
#> 1701          NA
#> 1702          NA
#> 1703          NA
#> 1704          NA
#> 1705          NA

The data contains many columns representing different aspects of sample information.

2.3 Search by biosample id and individual id

In Progenetix, biosample id and individual id serve as unique identifiers for biosamples and the corresponding individuals. You can obtain these IDs through metadata search with filters as described above, or through website interface query.

biosamples_2 <- pgxLoader(type="biosample", biosample_id = "pgxbs-kftvgioe",individual_id = "pgxind-kftx28q5")

metainfo <- c("biosample_id","individual_id","pubmed_id","histological_diagnosis_id","geoprov_city")
biosamples_2[metainfo]
#>     biosample_id   individual_id     pubmed_id histological_diagnosis_id
#> 1 pgxbs-kftvgioe pgxind-kftx28pu PMID:24174329                NCIT:C3512
#> 2 pgxbs-kftvgiom pgxind-kftx28q5 PMID:24174329                NCIT:C3512
#>   geoprov_city
#> 1        Koeln
#> 2        Koeln

It’s also possible to query by a combination of filters, biosample id, and individual id.

2.4 Access a subset of samples

By default, it returns all related samples (limit=0). You can access a subset of them via the parameter limit and skip. For example, if you want to access the first 1000 samples , you can set limit = 1000, skip = 0.

biosamples_3 <- pgxLoader(type="biosample", filters = "NCIT:C3512",skip=0, limit = 1000)
# Dimension: Number of samples * features
print(dim(biosamples))
#> [1] 4641   40
print(dim(biosamples_3))
#> [1] 1000   40

2.5 Query the number of samples in Progenetix

The number of samples in specific group can be queried by pgxCount function.

pgxCount(filters = "NCIT:C3512")
#>      filters               label total_count exact_match_count
#> 1 NCIT:C3512 Lung Adenocarcinoma        4641              4505

2.6 Parameter codematches use

The NCIt code of retrieved samples doesn’t only contain specified filters but contains child terms.

unique(biosamples$histological_diagnosis_id)
#> [1] "NCIT:C3512" "NCIT:C5649" "NCIT:C7269" "NCIT:C2923" "NCIT:C7268"
#> [6] "NCIT:C5650" "NCIT:C7270"

Setting codematches as TRUE allows this function to only return biosamples with exact match to the filter.

biosamples_4 <- pgxLoader(type="biosample", filters = "NCIT:C3512",codematches = TRUE)

unique(biosamples_4$histological_diagnosis_id)
#> [1] "NCIT:C3512"

2.7 Parameter filterLogic use

This function supports querying samples that belong to multiple filters. For example, If you want to retrieve information about lung adenocarcinoma samples from the literature PMID:24174329, you can specify multiple matching filters and set filterLogic to “AND”.

biosamples_5 <- pgxLoader(type="biosample", filters = c("NCIT:C3512","PMID:24174329"), 
                          filterLogic = "AND")

3 Retrieve meatdata of individuals

If you want to query metadata (e.g. survival data) of individuals where the samples of interest come from, you can follow the tutorial below.

3.1 Relevant parameters

type, filters, filterLogic, individual_id, biosample_id, codematches, limit, skip, dataset

3.2 Search by filters

individuals <- pgxLoader(type="individual",filters="NCIT:C3270")
# Dimension: Number of individuals * features
print(dim(individuals))
#> [1] 2001   17
# data looks like this
individuals[c(36:40),]
#>      individual_id       sex_id     sex_label age_iso  age_days
#> 36 pgxind-kftx27zb PATO:0020000 genotypic sex   P5Y7M 2039.1294
#> 37 pgxind-kftx27zd PATO:0020000 genotypic sex   P0Y5M  152.0835
#> 38 pgxind-kftx27zf PATO:0020000 genotypic sex   P0Y4M  121.6668
#> 39 pgxind-kftx27zh PATO:0020000 genotypic sex   P0Y6M  182.5002
#> 40 pgxind-kftx27zj PATO:0020000 genotypic sex   P0Y4M  121.6668
#>    data_use_conditions_id data_use_conditions_label histological_diagnosis_id
#> 36                     NA                        NA                NCIT:C3270
#> 37                     NA                        NA                NCIT:C3270
#> 38                     NA                        NA                NCIT:C3270
#> 39                     NA                        NA                NCIT:C3270
#> 40                     NA                        NA                NCIT:C3270
#>    histological_diagnosis_label index_disease_notes index_disease_followup_time
#> 36                Neuroblastoma                  NA                        P20M
#> 37                Neuroblastoma                  NA                        P41M
#> 38                Neuroblastoma                  NA                        P18M
#> 39                Neuroblastoma                  NA                         P8M
#> 40                Neuroblastoma                  NA                        P19M
#>    index_disease_followup_state_id index_disease_followup_state_label
#> 36                     EFO:0030049           alive (follow-up status)
#> 37                     EFO:0030049           alive (follow-up status)
#> 38                     EFO:0030049           alive (follow-up status)
#> 39                     EFO:0030049           alive (follow-up status)
#> 40                     EFO:0030041            dead (follow-up status)
#>    auxiliary_disease_id auxiliary_disease_label auxiliary_disease_notes
#> 36                   NA                      NA                      NA
#> 37                   NA                      NA                      NA
#> 38                   NA                      NA                      NA
#> 39                   NA                      NA                      NA
#> 40                   NA                      NA                      NA
#>    individual_legacy_id
#> 36                   NA
#> 37                   NA
#> 38                   NA
#> 39                   NA
#> 40                   NA

3.3 Search by biosample id and individual id

You can get the id from the query of samples

individual <- pgxLoader(type="individual",individual_id = "pgxind-kftx26ml", biosample_id="pgxbs-kftvh94d")

individual
#>     individual_id       sex_id     sex_label age_iso age_days
#> 1 pgxind-kftx3565 PATO:0020000 genotypic sex      NA       NA
#> 2 pgxind-kftx26ml  NCIT:C20197          male      NA       NA
#>   data_use_conditions_id data_use_conditions_label histological_diagnosis_id
#> 1                     NA                        NA                NCIT:C3697
#> 2                     NA                        NA                NCIT:C3493
#>   histological_diagnosis_label index_disease_notes index_disease_followup_time
#> 1     Myxopapillary Ependymoma                  NA                        None
#> 2 Squamous Cell Lung Carcinoma                  NA                        None
#>   index_disease_followup_state_id index_disease_followup_state_label
#> 1                     EFO:0030039                 no followup status
#> 2                     EFO:0030039                 no followup status
#>   auxiliary_disease_id auxiliary_disease_label auxiliary_disease_notes
#> 1                   NA                      NA                      NA
#> 2                   NA                      NA                      NA
#>   individual_legacy_id
#> 1                   NA
#> 2                   NA

4 Visualization of survival data

4.1 pgxMetaplot function

This function generates a survival plot using metadata of individuals obtained by the pgxLoader function.

The parameters of this function:

  • data: The meatdata of individuals returned by pgxLoader function.
  • group_id: A string specifying which column is used for grouping in the Kaplan-Meier plot.
  • condition: Condition for splitting individuals into younger and older groups. Only used if group_id is age related.
  • return_data: A logical value determining whether to return the metadata used for plotting. Default is FALSE.
  • ...: Other parameters relevant to KM plot. These include pval, pval.coord, pval.method, conf.int, linetype, and palette (see ggsurvplot function from survminer package)

Suppose you want to investigate whether there are survival differences between younger and older patients with a particular disease, you can query and visualize the relevant information as follows:

# query metadata of individuals with lung adenocarcinoma
luad_inds <- pgxLoader(type="individual",filters="NCIT:C3512")
# use 65 years old as the splitting condition
pgxMetaplot(data=luad_inds, group_id="age_iso", condition="P65Y", pval=TRUE)

It’s noted that not all individuals have available survival data. If you set return_data to TRUE, the function will return the metadata of individuals used for the plot.

5 Session Info

#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] pgxRpi_1.0.4     BiocStyle_2.32.1
#> 
#> loaded via a namespace (and not attached):
#>  [1] gtable_0.3.5        xfun_0.47           bslib_0.8.0        
#>  [4] ggplot2_3.5.1       rstatix_0.7.2       lattice_0.22-6     
#>  [7] vctrs_0.6.5         tools_4.4.1         generics_0.1.3     
#> [10] curl_5.2.2          tibble_3.2.1        fansi_1.0.6        
#> [13] highr_0.11          pkgconfig_2.0.3     Matrix_1.7-0       
#> [16] data.table_1.16.0   lifecycle_1.0.4     compiler_4.4.1     
#> [19] farver_2.1.2        munsell_0.5.1       tinytex_0.53       
#> [22] carData_3.0-5       htmltools_0.5.8.1   sass_0.4.9         
#> [25] yaml_2.3.10         pillar_1.9.0        car_3.1-2          
#> [28] ggpubr_0.6.0        jquerylib_0.1.4     tidyr_1.3.1        
#> [31] cachem_1.1.0        survminer_0.4.9     magick_2.8.4       
#> [34] abind_1.4-8         km.ci_0.5-6         tidyselect_1.2.1   
#> [37] digest_0.6.37       dplyr_1.1.4         purrr_1.0.2        
#> [40] bookdown_0.40       labeling_0.4.3      splines_4.4.1      
#> [43] fastmap_1.2.0       grid_4.4.1          colorspace_2.1-1   
#> [46] cli_3.6.3           magrittr_2.0.3      survival_3.7-0     
#> [49] utf8_1.2.4          broom_1.0.6         withr_3.0.1        
#> [52] scales_1.3.0        backports_1.5.0     lubridate_1.9.3    
#> [55] timechange_0.3.0    rmarkdown_2.28      httr_1.4.7         
#> [58] gridExtra_2.3       ggsignif_0.6.4      zoo_1.8-12         
#> [61] evaluate_1.0.0      knitr_1.48          KMsurv_0.1-5       
#> [64] survMisc_0.5.6      rlang_1.1.4         Rcpp_1.0.13        
#> [67] xtable_1.8-4        glue_1.7.0          BiocManager_1.30.25
#> [70] attempt_0.3.1       jsonlite_1.8.8      R6_2.5.1           
#> [73] plyr_1.8.9