---
title: "Questions and answers from over the years"
author: "Sean Davis"
date: "`r format(Sys.Date(), '%A, %B %d, %Y')`"
always_allow_html: yes
output:
  BiocStyle::html_document:
    df_print: paged
    toc_float: true
    keep_md: true
abstract: >

vignette: >
  %\VignetteIndexEntry{Questions and answers from over the years}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

# How could I generate a manifest file with filtering of Race and Ethnicity?

From https://support.bioconductor.org/p/9138939/.

```{r}
library(GenomicDataCommons,quietly = TRUE)
```

I made a small change to the filtering expression approach based on 
changes to lazy evaluation best practices. There is now no need to 
include the `~` in the filter expression. So:

```{r}
q = files() %>%
  GenomicDataCommons::filter(
    cases.project.project_id == 'TCGA-COAD' &
      data_type == 'Aligned Reads' &
      experimental_strategy == 'RNA-Seq' &
      data_format == 'BAM')
```
And get a count of the results:

```{r}
count(q)
```

And the manifest.

```{r}
manifest(q)
```

Your question about race and ethnicity is a good one. 

```{r}
all_fields = available_fields(files())
```

And we can grep for `race` or `ethnic` to get potential matching fields
to look at.

```{r}
grep('race|ethnic',all_fields,value=TRUE)
```

Now, we can check available values for each field to determine how to complete
our filter expressions.

```{r}
available_values('files',"cases.demographic.ethnicity")
available_values('files',"cases.demographic.race")
```

We can complete our filter expression now to limit to `white` race only.

```{r}
q_white_only = q %>%
  GenomicDataCommons::filter(cases.demographic.race=='white')
count(q_white_only)
manifest(q_white_only)
```

# How can I get the number of cases with RNA-Seq data added by date to TCGA project with `GenomicDataCommons`?

- From https://support.bioconductor.org/p/9135791/

I would like to get the number of cases added (created, any logical datetime would suffice here) to the TCGA project by experiment type. I attempted to get this data via GenomicDataCommons package, but it is giving me I believe the number of files for a given experiment type rather than number cases. How can I get the number of cases for which there is RNA-Seq data?

```{r}
library(tibble)
library(dplyr)
library(GenomicDataCommons)

cases() %>% 
  GenomicDataCommons::filter(~ project.program.name=='TCGA' & 
                               files.experimental_strategy=='RNA-Seq') %>% 
  facet(c("files.created_datetime")) %>% 
  aggregations() %>% 
  .[[1]] %>% 
  as_tibble() %>%
  dplyr::arrange(dplyr::desc(key))
```