GenomicDataCommons
?From https://support.bioconductor.org/p/9138939/.
library(GenomicDataCommons,quietly = TRUE)
I made a small change to the filtering expression approach based on
changes to lazy evaluation best practices. There is now no need to
include the ~
in the filter expression. So:
q = files() %>%
GenomicDataCommons::filter(
cases.project.project_id == 'TCGA-COAD' &
data_type == 'Aligned Reads' &
experimental_strategy == 'RNA-Seq' &
data_format == 'BAM')
And get a count of the results:
count(q)
## [1] 521
And the manifest.
manifest(q)
## Rows: 521 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (4): id, filename, md5, state
## dbl (1): size
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
id <chr> | ||
---|---|---|
1 | 9d8f2b87-a61d-44db-b3e8-39a3d66e6e16 | |
2 | efc3d63e-21af-4cdd-8315-cf04fee407b3 | |
3 | 25b9e5c8-514c-48d9-be40-a5644b7b99f9 | |
4 | 2d8a4267-969c-4a3a-bd89-c3ce0b9a6cf2 | |
5 | bd582f37-941a-4136-8b91-3ea8ed81dcfe | |
6 | ff6d6688-c19c-4a7f-8058-4d1bc0249d83 | |
7 | af862835-d908-4ee8-8753-78a612d419be | |
8 | 958a5e6e-92b2-41a7-8ed4-3c464869120c | |
9 | 1ec4cf4e-ef14-4554-b3a7-a28ab51a9610 | |
10 | b0b7a16b-a600-46d0-98ec-cdd707456446 |
Your question about race and ethnicity is a good one.
all_fields = available_fields(files())
And we can grep for race
or ethnic
to get potential matching fields
to look at.
grep('race|ethnic',all_fields,value=TRUE)
## [1] "cases.demographic.ethnicity"
## [2] "cases.demographic.race"
## [3] "cases.follow_ups.hormonal_contraceptive_type"
## [4] "cases.follow_ups.hormonal_contraceptive_use"
## [5] "cases.follow_ups.scan_tracer_used"
Now, we can check available values for each field to determine how to complete our filter expressions.
available_values('files',"cases.demographic.ethnicity")
## [1] "not hispanic or latino" "not reported" "hispanic or latino"
## [4] "unknown" "not allowed to collect" "_missing"
available_values('files',"cases.demographic.race")
## [1] "white"
## [2] "not reported"
## [3] "black or african american"
## [4] "asian"
## [5] "unknown"
## [6] "other"
## [7] "not allowed to collect"
## [8] "american indian or alaska native"
## [9] "native hawaiian or other pacific islander"
## [10] "_missing"
We can complete our filter expression now to limit to white
race only.
q_white_only = q %>%
GenomicDataCommons::filter(cases.demographic.race=='white')
count(q_white_only)
## [1] 249
manifest(q_white_only)
## Rows: 249 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (4): id, filename, md5, state
## dbl (1): size
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
id <chr> | ||
---|---|---|
1 | efc3d63e-21af-4cdd-8315-cf04fee407b3 | |
2 | bd582f37-941a-4136-8b91-3ea8ed81dcfe | |
3 | ff6d6688-c19c-4a7f-8058-4d1bc0249d83 | |
4 | 1ec4cf4e-ef14-4554-b3a7-a28ab51a9610 | |
5 | 6702c4d7-a218-4dc3-810a-004f5a166c2a | |
6 | 3c6c8465-a13d-4dbe-bf55-92dd423f9f8f | |
7 | c1c3ed06-d423-46bf-8b43-77f7817c59bd | |
8 | 4de88051-0d80-419d-ae7d-89dcce5f6baa | |
9 | fe0ce41e-3f45-47e1-bd31-56fde0668b0e | |
10 | ea4d5db7-d421-4b92-a9fb-5ed74565e85c |
GenomicDataCommons
?I would like to get the number of cases added (created, any logical datetime would suffice here) to the TCGA project by experiment type. I attempted to get this data via GenomicDataCommons package, but it is giving me I believe the number of files for a given experiment type rather than number cases. How can I get the number of cases for which there is RNA-Seq data?
library(tibble)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:GenomicDataCommons':
##
## count, filter, select
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(GenomicDataCommons)
cases() %>%
GenomicDataCommons::filter(~ project.program.name=='TCGA' &
files.experimental_strategy=='RNA-Seq') %>%
facet(c("files.created_datetime")) %>%
aggregations() %>%
.[[1]] %>%
as_tibble() %>%
dplyr::arrange(dplyr::desc(key))
doc_count <int> | key <chr> | |||
---|---|---|---|---|
362 | 2021-04-05t12:48:23.926301-05:00 | |||
438 | 2021-04-05t08:30:00.775501-05:00 | |||
373 | 2021-04-05t08:29:15.674486-05:00 | |||
427 | 2021-04-05t08:20:25.746896-05:00 | |||
472 | 2021-04-05t08:19:17.399147-05:00 | |||
358 | 2021-04-05t08:16:31.043565-05:00 | |||
875 | 2021-04-05t08:14:54.002129-05:00 | |||
380 | 2018-11-08t15:58:37.938089-06:00 | |||
535 | 2018-10-24t15:05:03.191583-05:00 | |||
500 | 2018-10-24t15:05:00.562958-05:00 |