--- title: "Questions and answers from over the years" author: "Sean Davis" date: "`r format(Sys.Date(), '%A, %B %d, %Y')`" always_allow_html: yes output: BiocStyle::html_document: df_print: paged toc_float: true keep_md: true abstract: > vignette: > %\VignetteIndexEntry{Questions and answers from over the years} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # How could I generate a manifest file with filtering of Race and Ethnicity? From https://support.bioconductor.org/p/9138939/. ```{r} library(GenomicDataCommons,quietly = TRUE) ``` I made a small change to the filtering expression approach based on changes to lazy evaluation best practices. There is now no need to include the `~` in the filter expression. So: ```{r} q = files() %>% GenomicDataCommons::filter( cases.project.project_id == 'TCGA-COAD' & data_type == 'Aligned Reads' & experimental_strategy == 'RNA-Seq' & data_format == 'BAM') ``` And get a count of the results: ```{r} count(q) ``` And the manifest. ```{r} manifest(q) ``` Your question about race and ethnicity is a good one. ```{r} all_fields = available_fields(files()) ``` And we can grep for `race` or `ethnic` to get potential matching fields to look at. ```{r} grep('race|ethnic',all_fields,value=TRUE) ``` Now, we can check available values for each field to determine how to complete our filter expressions. ```{r} available_values('files',"cases.demographic.ethnicity") available_values('files',"cases.demographic.race") ``` We can complete our filter expression now to limit to `white` race only. ```{r} q_white_only = q %>% GenomicDataCommons::filter(cases.demographic.race=='white') count(q_white_only) manifest(q_white_only) ``` # How can I get the number of cases with RNA-Seq data added by date to TCGA project with `GenomicDataCommons`? - From https://support.bioconductor.org/p/9135791/ I would like to get the number of cases added (created, any logical datetime would suffice here) to the TCGA project by experiment type. I attempted to get this data via GenomicDataCommons package, but it is giving me I believe the number of files for a given experiment type rather than number cases. How can I get the number of cases for which there is RNA-Seq data? ```{r} library(tibble) library(dplyr) library(GenomicDataCommons) cases() %>% GenomicDataCommons::filter(~ project.program.name=='TCGA' & files.experimental_strategy=='RNA-Seq') %>% facet(c("files.created_datetime")) %>% aggregations() %>% .[[1]] %>% as_tibble() %>% dplyr::arrange(dplyr::desc(key)) ```