--- title: "Lab 1.2: Introduction to tidy _R_" output: BiocStyle::html_document: toc: true vignette: > % \VignetteIndexEntry{Lab 1.2: Introduction to tidy R} % \VignetteEngine{knitr::rmarkdown} --- ```{r style, echo = FALSE, results = 'asis'} BiocStyle::markdown() ``` ```{r setup, echo=FALSE} knitr::opts_chunk$set( eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")), cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE")) ) ``` Original Authors: Martin Morgan
Presenting Authors: [Martin Morgan][]
Date: 22 July, 2019
Back: [Monday labs](lab-1-intro-to-r-bioc.html) [Martin Morgan]: mailto: Martin.Morgan@RoswellPark.org [Lori Shepherd]: mailto: Lori.Shepherd@RoswellPark.org **Objective**: Gain confidence working with 'tidy' R commands and data structures. **Lessons learned**: - Basic data input using [readr][]. - Working with tibbles using common [dplyr][] 'verbs'. - Using `gather()` to 'tidy' data. - Introduction to writing our own tidy functions. - Visualization using base R and [ggplot2][]. [readr]: https://cran.r-project.org/package=readr [dplyr]: https://cran.r-project.org/package=dplyr [ggplot2]: https://cran.r-project.org/package=ggplot2 [broom]: https://cran.r-project.org/package=broom # Introduction to the tidyverse The 'tidyverse' is a a recent approach to working with data in _R_. The basic principle is that data is often best represented in 'long-form' data.frames, with data transformations implemented with a few central functions. Start by loading some important packages in the tidyverse, _readr_ for data input, _tibble_ for data representation, _dplyr_ for manipulation, and _ggplot2_ for visualization. ```{r, message = FALSE} library(readr) library(tibble) library(dplyr) library(ggplot2) ``` Read the BRFSS data into a `tibble` (data.frame) using `read_csv()`; this function is similar to `read.csv()` but standardized the argument naming convention. ```{r echo=TRUE, eval=FALSE} fname <- file.choose() ## BRFSS-subset.csv stopifnot(file.exists(fname)) brfss <- read_csv(fname) ``` Note that by default character values were _not_ interpreted as factors, that the display is more informative (the class of each column is indicated under the title), and only a 'preview' of all the data is displayed. ```{r} brfss ``` A `tibble()` can be manipulated like a data.frame, but usually operations in the tidyverse are 'piped' using the `%>%` symbol from one representation to another. We start with some clean-up using the `mutate()` function to update or add individual columns. Start with an interactive exploration... ```{r} brfss %>% mutate( Sex = factor(Sex), Year = factor(Year) ) ``` ...and when the pipeline looks good re-assign the updated tibble. ```{r} brfss <- brfss %>% mutate( Sex = factor(Sex), Year = factor(Year) ) ``` Common operations are to `filter()` rows to those containing only certain values, and to `select()` a subset of columns. ```{r} brfss %>% filter(Sex == "Female", Year == "1990") %>% select(Age, Weight, Height) ``` Another common operation is to `group_by()` one or more columns, and `summarize()` data in other columns, e.g., ```{r} brfss %>% group_by(Sex, Year) %>% summarize( AveAge = mean(Age, na.rm=TRUE), AveWeight = mean(Weight, na.rm=TRUE) ) ``` Note that the output of each pipe is itself a tibble, so that the output can be further transformed using tidyverse functions. The main features of 'tidy' data are 1. Standard representation as a long-format data.frame-like `tibble()` 2. Restricted vocabulary of core functions -- `mutate()`, `filter()`, `select()`, `group_by()`, `summarize()`, ... These functions are 'isomorphic', meaning that the return value is the same type (a tibble!) as the first argument of the function. 3. Short 'pipes' summarizing data transformation steps. # Case studies ## Revisiting the ALL phenotypic data ### Data input and basic manipulation Revisit case study 2.1, ALL Phenotype Data in Lab 1.1: Introduction to _R_ using a tidy approach to data exploration. 1. Input the data "ALLphenoData.tsv" using `read_tsv()` (note: we use `_tsv()` because the columns are _t_ab _s_eparated). Compare column types, display, etc with base _R_ functions. Note that columns are never `factor()` by default. 2. Use `filter()` to create a subset of the data consisting only of female individuals over 40. Compare this approach with that used in base _R_. Likewise, create a subsect with only "BCR/ABL" and "NEG" in the `mol.biol` column. 3. Use `mutate()` to further transform the bcrabl subset by recoding the `BT` column to be either B or T, based on the first letter of the column. 4. Use `group_by()` and `summarize()` to deterimine the number of individuals in each combination of `BT` and `mol.biol`. Can you perform this same computation using only `count()`? ### Using un-tidy functions: `t.test()` We'd like to compare the average age of males and females in the study using `t.test()`. In base _R_ , we can write ```{r} pdata <- read_tsv("ALLphenoData.tsv") t.test(age ~ sex, pdata) ``` `t.test()` takes a formula `age ~ sex` (`age` as a function of `sex`) describing the comparison that we'd like to make. It also takes an argument `data=` that contains the data we'd like to perform the t-test on. Unlike functions we've encountered so far where the data to be processed is the first argument, `t.test()` expects the data as the second argument. To adapt `t.test()` for use, we need to explicitly indicate that the data should be the second argument. One way of doing this is to use the special symbol `.` to represent the location of the incoming data, invoking `t.test(age ~ sex, data = .)`: ```{r, eval = FALSE} pdata %>% t.test(age ~ sex, data = .) ``` **Exercise** Perform a t-test to ask whether there is evidence of differences in ages between the sexes. How can we change the default value of `var.equal` to `TRUE`? Is this appropriate? **Exercise** Write a function that makes it easier to use `t.test()` in a 'tidy' work flow. Do this by arranging it so that the first argument of your function is the data set, the second argument the formula, and then allow for any number of additional arguments. Pass these to `t.test()`, along the lines of ```{r} t_test <- function(data, formula, ...) { t.test(formula, data, ...) } ``` Verify that you can use your `t_test()` function in a straight-forward way ```{r} pdata %>% t_test(age ~ sex) ``` **Exercise (advanced)** The return value of `t.test()` doesn't fit well with `tidy` data analysis, because it is a complicated object that is not represented as a `tibble` and hence cannot be computed upon using the common tidy verbs. Update `t_test()` so that it is more tidy-friendly, accepting `data = ` as it's first argument, using `t.test()` internally to compute results, and returning a `tibble` containing results formatted for subsequent computation. One way to accomplish the last task is through use of [broom][]`::tidy()` function to transform many base _R_ objects into tidy-friendly data structures. ## Tidying the 'airway' data We'll encounter the 'airway' data set extensively later in the course. Here we read in the description of 8 samples used in a RNAseq analysis, usnig `select()` to choose specific columns to work with. ```{r, eval = FALSE} pdata_file <- file.choose() # airway-sample-sheet.csv count_file <- file.choose() # airway-read-counts.csv ``` ```{r, echo = FALSE} pdata_file <- "airway-sample-sheet.csv" ``` ```{r} pdata <- read_csv(pdata_file) pdata <- pdata %>% select(Run, cell, dex) pdata ``` Now read in the 'airway-read_counts.csv' file. Each row is a sample, each column (other than the first, the gene identifier) is a gene, and each entry the number of RNA-seq reads overlapping each gene and sample -- a measure of gene expression. ```{r} count_file <- "airway-read-counts.csv" counts <- read_csv(count_file) eg <- counts[, 1:6] # make it easy to work with eg ``` An advanced tidy concept is to _join_ to data sets, as described on the help page `?left_join`. Use `left_join()` to combine the phenotypic and expression data. ```{r} data <- left_join(pdata, eg) data ``` This is how a statistician might organize their data -- in a 'wide' format, with each line representing a sample and each column a variable measured on that sample. Tidy data is usually transformed into 'long' format, where 'Count' is interpreted as a measurement, with 'Gene' and 'Run' as indicator variables describing which experimental entity the count is associated with. Use `gather()` to transform the wide format data into long-format 'tidy' data. `gather()` is a tricky function to work with, so study the help page carefully. ```{r, message=FALSE} library(tidyr) ``` ```{r} tbl <- gather(data, "Gene", "Count", -(1:3)) tbl ``` Use `group_by()` and `summarize()` to calculate the library size, i.e., the total number of reads mapped per run. Likewise use `group_by()` and `summarize()` to descrbe average and log-transformed counts. ```{r} tbl %>% group_by(Run) %>% summarize(lib_size = sum(Count)) tbl %>% group_by(Gene) %>% summarize( ave_count = mean(Count), ave_log_count = mean(log(1 + Count)) ) ``` Now tidy all the data. ```{r} counts_tbl <- gather(counts, "Gene", "Count", -Run) ``` and join to the `pdata` ```{r} data_tbl <- left_join(pdata, counts_tbl) data_tbl ``` Summarize library size (what are the maximum and minimum library sizes?) ```{r} data_tbl %>% group_by(Run) %>% summarize(lib_size = sum(Count)) ``` and average 'expression' of each gene. ```{r} gene_summaries <- data_tbl %>% group_by(Gene) %>% summarize( ave_count = mean(Count), ave_log_count = mean(log(1 + Count)) ) gene_summaries ``` And visualize using `ggplot2` ```{r, message=FALSE} library(ggplot2) ``` ```{r} ggplot(gene_summaries, aes(ave_log_count)) + geom_density() ``` # End matter ## Session Info ```{r} sessionInfo() ``` ## Acknowledgements Research reported in this tutorial was supported by the National Human Genome Research Institute and the National Cancer Institute of the National Institutes of Health under award numbers U41HG004059 and U24CA180996. This project has received funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme (grant agreement number 633974)