1 How could I generate a manifest file with filtering of Race and Ethnicity?

From https://support.bioconductor.org/p/9138939/.

library(GenomicDataCommons,quietly = TRUE)

I made a small change to the filtering expression approach based on changes to lazy evaluation best practices. There is now no need to include the ~ in the filter expression. So:

q = files() %>%
  GenomicDataCommons::filter(
    cases.project.project_id == 'TCGA-COAD' &
      data_type == 'Aligned Reads' &
      experimental_strategy == 'RNA-Seq' &
      data_format == 'BAM')

And get a count of the results:

count(q)

## [1] 521

And the manifest.

manifest(q)

## Rows: 521 Columns: 5

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (4): id, filename, md5, state
## dbl (1): size

## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

ABCDEFGHIJ0123456789

	id <chr>
1	9d8f2b87-a61d-44db-b3e8-39a3d66e6e16
2	efc3d63e-21af-4cdd-8315-cf04fee407b3
3	25b9e5c8-514c-48d9-be40-a5644b7b99f9
4	2d8a4267-969c-4a3a-bd89-c3ce0b9a6cf2
5	bd582f37-941a-4136-8b91-3ea8ed81dcfe
6	ff6d6688-c19c-4a7f-8058-4d1bc0249d83
7	af862835-d908-4ee8-8753-78a612d419be
8	958a5e6e-92b2-41a7-8ed4-3c464869120c
9	1ec4cf4e-ef14-4554-b3a7-a28ab51a9610
10	b0b7a16b-a600-46d0-98ec-cdd707456446

Your question about race and ethnicity is a good one.

all_fields = available_fields(files())

And we can grep for race or ethnic to get potential matching fields to look at.

grep('race|ethnic',all_fields,value=TRUE)

## [1] "cases.demographic.ethnicity"                 
## [2] "cases.demographic.race"                      
## [3] "cases.follow_ups.hormonal_contraceptive_type"
## [4] "cases.follow_ups.hormonal_contraceptive_use" 
## [5] "cases.follow_ups.scan_tracer_used"

Now, we can check available values for each field to determine how to complete our filter expressions.

available_values('files',"cases.demographic.ethnicity")

## [1] "not hispanic or latino" "not reported"           "hispanic or latino"    
## [4] "unknown"                "not allowed to collect" "_missing"

available_values('files',"cases.demographic.race")

##  [1] "white"                                    
##  [2] "not reported"                             
##  [3] "black or african american"                
##  [4] "asian"                                    
##  [5] "unknown"                                  
##  [6] "other"                                    
##  [7] "not allowed to collect"                   
##  [8] "american indian or alaska native"         
##  [9] "native hawaiian or other pacific islander"
## [10] "_missing"

We can complete our filter expression now to limit to white race only.

q_white_only = q %>%
  GenomicDataCommons::filter(cases.demographic.race=='white')
count(q_white_only)

## [1] 249

manifest(q_white_only)

## Rows: 249 Columns: 5

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (4): id, filename, md5, state
## dbl (1): size

## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

ABCDEFGHIJ0123456789

	id <chr>
1	efc3d63e-21af-4cdd-8315-cf04fee407b3
2	bd582f37-941a-4136-8b91-3ea8ed81dcfe
3	ff6d6688-c19c-4a7f-8058-4d1bc0249d83
4	1ec4cf4e-ef14-4554-b3a7-a28ab51a9610
5	6702c4d7-a218-4dc3-810a-004f5a166c2a
6	3c6c8465-a13d-4dbe-bf55-92dd423f9f8f
7	c1c3ed06-d423-46bf-8b43-77f7817c59bd
8	4de88051-0d80-419d-ae7d-89dcce5f6baa
9	fe0ce41e-3f45-47e1-bd31-56fde0668b0e
10	ea4d5db7-d421-4b92-a9fb-5ed74565e85c

2 How can I get the number of cases with RNA-Seq data added by date to TCGA project with `GenomicDataCommons`?

From https://support.bioconductor.org/p/9135791/

I would like to get the number of cases added (created, any logical datetime would suffice here) to the TCGA project by experiment type. I attempted to get this data via GenomicDataCommons package, but it is giving me I believe the number of files for a given experiment type rather than number cases. How can I get the number of cases for which there is RNA-Seq data?

library(tibble)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:GenomicDataCommons':
## 
##     count, filter, select

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(GenomicDataCommons)

cases() %>% 
  GenomicDataCommons::filter(~ project.program.name=='TCGA' & 
                               files.experimental_strategy=='RNA-Seq') %>% 
  facet(c("files.created_datetime")) %>% 
  aggregations() %>% 
  .[[1]] %>% 
  as_tibble() %>%
  dplyr::arrange(dplyr::desc(key))

ABCDEFGHIJ0123456789

doc_count <int>	key <chr>
362	2021-04-05t12:48:23.926301-05:00
438	2021-04-05t08:30:00.775501-05:00
373	2021-04-05t08:29:15.674486-05:00
427	2021-04-05t08:20:25.746896-05:00
472	2021-04-05t08:19:17.399147-05:00
358	2021-04-05t08:16:31.043565-05:00
875	2021-04-05t08:14:54.002129-05:00
380	2018-11-08t15:58:37.938089-06:00
535	2018-10-24t15:05:03.191583-05:00
500	2018-10-24t15:05:00.562958-05:00

Questions and answers from over the years

Tuesday, October 26, 2021

1 How could I generate a manifest file with filtering of Race and Ethnicity?

2 How can I get the number of cases with RNA-Seq data added by date to TCGA project with `GenomicDataCommons`?

Questions and answers from over the years

Tuesday, October 26, 2021

1 How could I generate a manifest file with filtering of Race and Ethnicity?

2 How can I get the number of cases with RNA-Seq data added by date to TCGA project with GenomicDataCommons?

2 How can I get the number of cases with RNA-Seq data added by date to TCGA project with `GenomicDataCommons`?