• 1 How could I generate a manifest file with filtering of Race and Ethnicity?
  • 2 How can I get the number of cases with RNA-Seq data added by date to TCGA project with GenomicDataCommons?

1 How could I generate a manifest file with filtering of Race and Ethnicity?

From https://support.bioconductor.org/p/9138939/.

library(GenomicDataCommons,quietly = TRUE)

I made a small change to the filtering expression approach based on changes to lazy evaluation best practices. There is now no need to include the ~ in the filter expression. So:

q = files() %>%
  GenomicDataCommons::filter(
    cases.project.project_id == 'TCGA-COAD' &
      data_type == 'Aligned Reads' &
      experimental_strategy == 'RNA-Seq' &
      data_format == 'BAM')

And get a count of the results:

count(q)
## [1] 521

And the manifest.

manifest(q)
## Rows: 521 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (4): id, filename, md5, state
## dbl (1): size
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
ABCDEFGHIJ0123456789
 
 
id
<chr>
19d8f2b87-a61d-44db-b3e8-39a3d66e6e16
2efc3d63e-21af-4cdd-8315-cf04fee407b3
325b9e5c8-514c-48d9-be40-a5644b7b99f9
42d8a4267-969c-4a3a-bd89-c3ce0b9a6cf2
5bd582f37-941a-4136-8b91-3ea8ed81dcfe
6ff6d6688-c19c-4a7f-8058-4d1bc0249d83
7af862835-d908-4ee8-8753-78a612d419be
8958a5e6e-92b2-41a7-8ed4-3c464869120c
91ec4cf4e-ef14-4554-b3a7-a28ab51a9610
10b0b7a16b-a600-46d0-98ec-cdd707456446

Your question about race and ethnicity is a good one.

all_fields = available_fields(files())

And we can grep for race or ethnic to get potential matching fields to look at.

grep('race|ethnic',all_fields,value=TRUE)
## [1] "cases.demographic.ethnicity"                 
## [2] "cases.demographic.race"                      
## [3] "cases.follow_ups.hormonal_contraceptive_type"
## [4] "cases.follow_ups.hormonal_contraceptive_use" 
## [5] "cases.follow_ups.scan_tracer_used"

Now, we can check available values for each field to determine how to complete our filter expressions.

available_values('files',"cases.demographic.ethnicity")
## [1] "not hispanic or latino" "not reported"           "hispanic or latino"    
## [4] "unknown"                "not allowed to collect" "_missing"
available_values('files',"cases.demographic.race")
##  [1] "white"                                    
##  [2] "not reported"                             
##  [3] "black or african american"                
##  [4] "asian"                                    
##  [5] "unknown"                                  
##  [6] "other"                                    
##  [7] "not allowed to collect"                   
##  [8] "american indian or alaska native"         
##  [9] "native hawaiian or other pacific islander"
## [10] "_missing"

We can complete our filter expression now to limit to white race only.

q_white_only = q %>%
  GenomicDataCommons::filter(cases.demographic.race=='white')
count(q_white_only)
## [1] 249
manifest(q_white_only)
## Rows: 249 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (4): id, filename, md5, state
## dbl (1): size
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
ABCDEFGHIJ0123456789
 
 
id
<chr>
1efc3d63e-21af-4cdd-8315-cf04fee407b3
2bd582f37-941a-4136-8b91-3ea8ed81dcfe
3ff6d6688-c19c-4a7f-8058-4d1bc0249d83
41ec4cf4e-ef14-4554-b3a7-a28ab51a9610
56702c4d7-a218-4dc3-810a-004f5a166c2a
63c6c8465-a13d-4dbe-bf55-92dd423f9f8f
7c1c3ed06-d423-46bf-8b43-77f7817c59bd
84de88051-0d80-419d-ae7d-89dcce5f6baa
9fe0ce41e-3f45-47e1-bd31-56fde0668b0e
10ea4d5db7-d421-4b92-a9fb-5ed74565e85c

2 How can I get the number of cases with RNA-Seq data added by date to TCGA project with GenomicDataCommons?

I would like to get the number of cases added (created, any logical datetime would suffice here) to the TCGA project by experiment type. I attempted to get this data via GenomicDataCommons package, but it is giving me I believe the number of files for a given experiment type rather than number cases. How can I get the number of cases for which there is RNA-Seq data?

library(tibble)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:GenomicDataCommons':
## 
##     count, filter, select
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(GenomicDataCommons)

cases() %>% 
  GenomicDataCommons::filter(~ project.program.name=='TCGA' & 
                               files.experimental_strategy=='RNA-Seq') %>% 
  facet(c("files.created_datetime")) %>% 
  aggregations() %>% 
  .[[1]] %>% 
  as_tibble() %>%
  dplyr::arrange(dplyr::desc(key))
ABCDEFGHIJ0123456789
doc_count
<int>
key
<chr>
3622021-04-05t12:48:23.926301-05:00
4382021-04-05t08:30:00.775501-05:00
3732021-04-05t08:29:15.674486-05:00
4272021-04-05t08:20:25.746896-05:00
4722021-04-05t08:19:17.399147-05:00
3582021-04-05t08:16:31.043565-05:00
8752021-04-05t08:14:54.002129-05:00
3802018-11-08t15:58:37.938089-06:00
5352018-10-24t15:05:03.191583-05:00
5002018-10-24t15:05:00.562958-05:00