--- title: "Overview of the tidybulk package" author: "Stefano Mangiola" date: "`r Sys.Date()`" package: tidybulk output: BiocStyle::html_document: toc_float: true vignette: > %\VignetteEngine{knitr::knitr} %\VignetteIndexEntry{Overview of the tidybulk package} %\usepackage[UTF-8]{inputenc} --- **Brings transcriptomics to the tidyverse!** # [![Lifecycle:maturing](https://img.shields.io/badge/lifecycle-maturing-blue.svg)](https://www.tidyverse.org/lifecycle/#maturing) ## Functions/utilities available Function | Description ------------ | ------------- `aggregate_duplicates` | Aggregate robustly abundance and annotation of duplicated transcripts `scale_abundance` | Scale (normalise) abundance for RNA requencing depth `reduce_dimensions` | Perform dimensionality reduction (PCA, MDS, tSNE) `cluster_elements` | Labels elements with cluster identity (kmeans, SNN) `remove_redundancy` | Filter out elements with highly correlated features `adjust_abundance` | Remove known unwanted variation (Combat) `test_differential_abundance` | Differenatial abundance testing (DE) `deconvolve_cellularity` | Estimated tissue composition (Cibersort) `keep_variable` | Filter top variable features `keep_abundant` | Filter out lowly abundant transcripts `test_gene_enrichment` | Gene enrichment analyses (EGSEA) `test_gene_overrepresentation` | Gene enrichment on list of transcript names (no rank) `impute_abundance` | Impute abundance for missing data points using sample groupings Utilities | Description ------------ | ------------- `tidybulk` | add tidybulk attributes to a tibble object `tidybulk_SAM_BAM` | Convert SAM BAM files into tidybulk tibble `pivot_sample` | Select sample-wise columns/information `pivot_transcript` | Select transcript-wise columns/information `rotate_dimensions` | Rotate two dimensions of a degree `ensembl_to_symbol` | Add gene symbol from ensembl IDs `symbol_to_entrez` | Add entrez ID from gene symbol ## Minimal input data frame sample | transcript | abundance | annotation ------------ | ------------- | ------------- | ------------- `chr` or `fctr` | `chr` or `fctr` | `integer` | ... ## Output data frame sample | transcript | abundance | annotation | new information ------------ | ------------- | ------------- | ------------- | ------------- `chr` or `fctr` | `chr` or `fctr` | `integer` | ... | ... ```{r, echo=FALSE, include=FALSE} library(knitr) #library(kableExtra) knitr::opts_chunk$set(cache = TRUE, warning = FALSE, message = FALSE, cache.lazy = FALSE) #options(width = 120) options(pillar.min_title_chars = Inf) library(tibble) library(dplyr) library(magrittr) library(tidyr) library(ggplot2) # library(widyr) library(rlang) library(purrr) library(tidybulk) my_theme = theme_bw() + theme( panel.border = element_blank(), axis.line = element_line(), panel.grid.major = element_line(size = 0.2), panel.grid.minor = element_line(size = 0.1), text = element_text(size=12), legend.position="bottom", aspect.ratio=1, strip.background = element_blank(), axis.title.x = element_text(margin = margin(t = 10, r = 10, b = 10, l = 10)), axis.title.y = element_text(margin = margin(t = 10, r = 10, b = 10, l = 10)) ) # counts_mini = # tidybulk::counts %>% # filter(transcript %in% (tidybulk::X_cibersort %>% rownames)) %>% # filter(sample %in% c("SRR1740034", "SRR1740035", "SRR1740058", "SRR1740043", "SRR1740067")) %>% # mutate(condition = ifelse(sample %in% c("SRR1740034", "SRR1740035", "SRR1740058"), TRUE, FALSE)) # se_mini se_mini = tidybulk:::tidybulk_to_SummarizedExperiment(tidybulk::counts_mini, sample, transcript, count) se_breast_tcga_mini = tidybulk:::tidybulk_to_SummarizedExperiment( tidybulk::breast_tcga_mini, sample, ens, `count`) se.cibersort = tidybulk:::tidybulk_to_SummarizedExperiment(tidybulk::counts, sample , transcript, count) se.norm.batch = tidybulk:::tidybulk_to_SummarizedExperiment(tidybulk::counts, sample , transcript, count) %>% scale_abundance() ``` ## Create `tidybulk` tibble. It memorises key column names ```{r} tt = counts_mini %>% tidybulk(sample, transcript, count) ``` **All tidybulk methods are directly compatible with SummarizedExperiment as well.** ## Aggregate `transcripts` tidybulk provide the `aggregate_duplicates` function to aggregate duplicated transcripts (e.g., isoforms, ensembl). For example, we often have to convert ensembl symbols to gene/transcript symbol, but in doing so we have to deal with duplicates. `aggregate_duplicates` takes a tibble and column names (as symbols; for `sample`, `transcript` and `count`) as arguments and returns a tibble with aggregate transcript with the same name. All the rest of the column are appended, and factors and boolean are appended as characters. ```{r aggregate, cache=TRUE} tt.aggr = tt %>% aggregate_duplicates( aggregation_function = sum ) tt.aggr ``` All functions are also directly compatible with `SummarizedExperiment`. ```{r aggregate se, cache=TRUE} se.aggr = se_mini %>% aggregate_duplicates( aggregation_function = sum ) se.aggr ``` ## Scale `counts` We may want to compensate for sequencing depth, scaling the transcript abundance (e.g., with TMM algorithm, Robinson and Oshlack doi.org/10.1186/gb-2010-11-3-r25). `scale_abundance` takes a tibble, column names (as symbols; for `sample`, `transcript` and `count`) and a method as arguments and returns a tibble with additional columns with scaled data as `_scaled`. ```{r normalise, cache=TRUE} tt.norm = tt.aggr %>% scale_abundance(method="TMM") tt.norm %>% select(`count`, count_scaled, lowly_abundant, everything()) ``` We can easily plot the scaled density to check the scaling outcome. On the x axis we have the log scaled counts, on the y axes we have the density, data is grouped by sample and coloured by cell type. ```{r plot_normalise, cache=TRUE} tt.norm %>% ggplot(aes(count_scaled + 1, group=sample, color=`Cell type`)) + geom_density() + scale_x_log10() + my_theme ``` All functions are also directly compatible with `SummarizedExperiment`. ```{r normalise se, cache=TRUE} se.norm = se.aggr %>% scale_abundance(method="TMM") se.norm ``` ## Filter `variable transcripts` We may want to identify and filter variable transcripts. ```{r filter variable, cache=TRUE} tt.norm.variable = tt.norm %>% keep_variable() ``` ## Reduce `dimensions` We may want to reduce the dimensions of our data, for example using PCA or MDS algorithms. `reduce_dimensions` takes a tibble, column names (as symbols; for `sample`, `transcript` and `count`) and a method (e.g., MDS or PCA) as arguments and returns a tibble with additional columns for the reduced dimensions. **MDS** (Robinson et al., 10.1093/bioinformatics/btp616) ```{r mds, cache=TRUE} tt.norm.MDS = tt.norm %>% reduce_dimensions(.abundance = count_scaled, method="MDS", .dims = 3) tt.norm.MDS %>% select(sample, contains("Dim"), `Cell type`, time ) %>% distinct() ``` On the x and y axes axis we have the reduced dimensions 1 to 3, data is coloured by cell type. ```{r plot_mds, cache=TRUE, eval=FALSE} tt.norm.MDS %>% select(contains("Dim"), sample, `Cell type`) %>% distinct() %>% GGally::ggpairs(columns = 1:3, ggplot2::aes(colour=`Cell type`)) ``` All functions are also directly compatible with `SummarizedExperiment`. ```{r mds se, cache=TRUE} se.norm.MDS = se.norm %>% reduce_dimensions(.abundance = count_scaled, method="MDS", .dims = 3) se.norm.MDS ``` **PCA** ```{r pca, cache=TRUE} tt.norm.PCA = tt.norm %>% reduce_dimensions(.abundance = count_scaled, method="PCA" , .dims = 3) tt.norm.PCA %>% select(sample, contains("PC"), `Cell type`, time ) %>% distinct() ``` On the x and y axes axis we have the reduced dimensions 1 to 3, data is coloured by cell type. ```{r plot_pca, cache=TRUE, eval=FALSE} tt.norm.PCA %>% select(contains("PC"), sample, `Cell type`) %>% distinct() %>% GGally::ggpairs(columns = 1:3, ggplot2::aes(colour=`Cell type`)) ``` All functions are also directly compatible with `SummarizedExperiment`. ```{r pca se, cache=TRUE} se.norm.PCA = se.norm %>% reduce_dimensions(.abundance = count_scaled, method="PCA" , .dims = 3) se.norm.PCA ``` **tSNE** ```{r, echo=FALSE, include=FALSE} tt_tcga_breast = tidybulk::breast_tcga_mini %>% tidybulk(sample, ens, `count`) ``` ```{r tsne, cache=TRUE} tt.norm.tSNE = tt_tcga_breast %>% reduce_dimensions( .abundance = count_scaled, method = "tSNE", top = 500, perplexity=10, pca_scale =TRUE ) tt.norm.tSNE %>% select(contains("tSNE", ignore.case = FALSE), sample, Call) %>% distinct() tt.norm.tSNE %>% pivot_sample() %>% ggplot(aes(x = `tSNE1`, y = `tSNE2`, color=Call)) + geom_point() + my_theme ``` All functions are also directly compatible with `SummarizedExperiment`. ```{r tsne se, cache=TRUE} se.norm.tSNE = se_breast_tcga_mini %>% reduce_dimensions( .abundance = count_scaled, method = "tSNE", top = 500, perplexity=10, pca_scale =TRUE ) se.norm.tSNE ``` ## Rotate `dimensions` We may want to rotate the reduced dimensions (or any two numeric columns really) of our data, of a set angle. `rotate_dimensions` takes a tibble, column names (as symbols; for `sample`, `transcript` and `count`) and an angle as arguments and returns a tibble with additional columns for the rotated dimensions. The rotated dimensions will be added to the original data set as ` rotated ` by default, or as specified in the input arguments. ```{r rotate, cache=TRUE} tt.norm.MDS.rotated = tt.norm.MDS %>% rotate_dimensions(`Dim1`, `Dim2`, rotation_degrees = 45, .element = sample) ``` **Original** On the x and y axes axis we have the first two reduced dimensions, data is coloured by cell type. ```{r plot_rotate_1, cache=TRUE} tt.norm.MDS.rotated %>% pivot_sample() %>% ggplot(aes(x=`Dim1`, y=`Dim2`, color=`Cell type` )) + geom_point() + my_theme ``` **Rotated** On the x and y axes axis we have the first two reduced dimensions rotated of 45 degrees, data is coloured by cell type. ```{r plot_rotate_2, cache=TRUE} tt.norm.MDS.rotated %>% pivot_sample() %>% ggplot(aes(x=`Dim1 rotated 45`, y=`Dim2 rotated 45`, color=`Cell type` )) + geom_point() + my_theme ``` All functions are also directly compatible with `SummarizedExperiment`. ```{r rotate se, cache=TRUE} se.norm.MDS %>% rotate_dimensions(`Dim1`, `Dim2`, rotation_degrees = 45, .element = sample) ``` ## Test `differential abundance` We may want to test for differential transcription between sample-wise factors of interest (e.g., with edgeR). `test_differential_abundance` takes a tibble, column names (as symbols; for `sample`, `transcript` and `count`) and a formula representing the desired linear model as arguments and returns a tibble with additional columns for the statistics from the hypothesis test (e.g., log fold change, p-value and false discovery rate). ```{r de, cache=TRUE} tt %>% test_differential_abundance( ~ condition, action="only") ``` All functions are also directly compatible with `SummarizedExperiment`. ```{r de se, cache=TRUE} se_mini %>% test_differential_abundance( ~ condition) ``` ## Adjust `counts` We may want to adjust `counts` for (known) unwanted variation. `adjust_abundance` takes as arguments a tibble, column names (as symbols; for `sample`, `transcript` and `count`) and a formula representing the desired linear model where the first covariate is the factor of interest and the second covariate is the unwanted variation, and returns a tibble with additional columns for the adjusted counts as `_adjusted`. At the moment just an unwanted covariated is allowed at a time. ```{r, echo=FALSE, include=FALSE} tt.norm.batch = tt.norm %>% # Add fake batch and factor of interest left_join( (.) %>% distinct(sample) %>% mutate(batch = c(0,1,0,1,1)) ) %>% mutate(factor_of_interest = `Cell type` == "b_cell") ``` ```{r adjust, cache=TRUE} tt.norm.adj = tt.norm.batch %>% adjust_abundance( ~ factor_of_interest + batch, .abundance = count_scaled, action = "only" ) tt.norm.adj ``` All functions are also directly compatible with `SummarizedExperiment`. ```{r adjust se, cache=TRUE} se.norm.batch %>% adjust_abundance( ~ factor_of_interest + batch, .abundance = count_scaled ) ``` ## Deconvolve `Cell type composition` We may want to infer the cell type composition of our samples (with the algorithm Cibersort; Newman et al., 10.1038/nmeth.3337). `deconvolve_cellularity` takes as arguments a tibble, column names (as symbols; for `sample`, `transcript` and `count`) and returns a tibble with additional columns for the adjusted cell type proportions. **columns truncated** ```{r cibersort, cache=TRUE} tt.cibersort = tt %>% deconvolve_cellularity(action="get", cores=2) tt.cibersort %>% select(sample, contains("cibersort:")) ``` With the new annotated data frame, we can plot the distributions of cell types across samples, and compare them with the nominal cell type labels to check for the purity of isolation. On the x axis we have the cell types inferred by Cibersort, on the y axis we have the inferred proportions. The data is facetted and coloured by nominal cell types (annotation given by the researcher after FACS sorting). ```{r plot_cibersort, cache=TRUE} tt.cibersort %>% gather(`Cell type inferred`, `proportion`, 5:26) %>% distinct(sample, `Cell type`, `Cell type inferred`, proportion) %>% ggplot(aes(x=`Cell type inferred`, y=proportion, fill=`Cell type`)) + geom_boxplot() + facet_wrap(~`Cell type`) + my_theme + theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5), aspect.ratio=1/5) ``` All functions are also directly compatible with `SummarizedExperiment`. ```{r cibersort se, cache=TRUE} se.cibersort %>% deconvolve_cellularity(cores=2) ``` ## Cluster `samples` We may want to cluster our data (e.g., using k-means sample-wise). `cluster_elements` takes as arguments a tibble, column names (as symbols; for `sample`, `transcript` and `count`) and returns a tibble with additional columns for the cluster annotation. At the moment only k-means clustering is supported, the plan is to introduce more clustering methods. **k-means** ```{r cluster, cache=TRUE} tt.norm.cluster = tt.norm %>% cluster_elements(.abundance = count_scaled, method="kmeans", centers = 2 ) tt.norm.cluster ``` We can add cluster annotation to the MDS dimesion reduced data set and plot. ```{r plot_cluster, cache=TRUE} tt.norm.MDS %>% cluster_elements( .abundance = count_scaled, method="kmeans", centers = 2, action="get" ) %>% ggplot(aes(x=`Dim1`, y=`Dim2`, color=`cluster kmeans`)) + geom_point() + my_theme ``` All functions are also directly compatible with `SummarizedExperiment`. ```{r cluster se, cache=TRUE} se.norm %>% cluster_elements(.abundance = count_scaled, method="kmeans", centers = 2 ) ``` **SNN** ```{r SNN, cache=TRUE} tt.norm.SNN = tt.norm.tSNE %>% cluster_elements(.abundance= count_scaled, method = "SNN") tt.norm.SNN %>% pivot_sample() tt.norm.SNN %>% select(contains("tSNE", ignore.case = FALSE), `cluster SNN`, sample, Call) %>% gather(source, Call, c("cluster SNN", "Call")) %>% distinct() %>% ggplot(aes(x = `tSNE1`, y = `tSNE2`, color=Call)) + geom_point() + facet_grid(~source) + my_theme # Do differential transcription between clusters tt.norm.SNN %>% mutate(factor_of_interest = `cluster SNN` == 3) %>% test_differential_abundance( ~ factor_of_interest, action="only" ) ``` All functions are also directly compatible with `SummarizedExperiment`. ```{r SNN se, cache=TRUE} se.norm.tSNE %>% cluster_elements(.abundance= count_scaled, method = "SNN") ``` ## Drop `redundant` We may want to remove redundant elements from the original data set (e.g., samples or transcripts), for example if we want to define cell-type specific signatures with low sample redundancy. `remove_redundancy` takes as arguments a tibble, column names (as symbols; for `sample`, `transcript` and `count`) and returns a tibble dropped recundant elements (e.g., samples). Two redundancy estimation approaches are supported: + removal of highly correlated clusters of elements (keeping a representative) with method="correlation" + removal of most proximal element pairs in a reduced dimensional space. **Approach 1** ```{r drop, cache=TRUE} tt.norm.non_redundant = tt.norm.MDS %>% remove_redundancy( method = "correlation" ) ``` We can visualise how the reduced redundancy with the reduced dimentions look like ```{r plot_drop, cache=TRUE} tt.norm.non_redundant %>% pivot_sample() %>% ggplot(aes(x=`Dim1`, y=`Dim2`, color=`Cell type`)) + geom_point() + my_theme ``` All functions are also directly compatible with `SummarizedExperiment`. ```{r drop se, cache=TRUE} se.norm.MDS %>% remove_redundancy( method = "correlation" ) ``` **Approach 2** ```{r drop2, cache=TRUE} tt.norm.non_redundant = tt.norm.MDS %>% remove_redundancy( method = "reduced_dimensions", .element = sample, .feature = transcript, Dim_a_column = `Dim1`, Dim_b_column = `Dim2` ) ``` We can visualise MDS reduced dimensions of the samples with the closest pair removed. ```{r plot_drop2, cache=TRUE} tt.norm.non_redundant %>% pivot_sample() %>% ggplot(aes(x=`Dim1`, y=`Dim2`, color=`Cell type`)) + geom_point() + my_theme ``` All functions are also directly compatible with `SummarizedExperiment`. ```{r drop2 se, cache=TRUE} se.norm.MDS %>% remove_redundancy( method = "reduced_dimensions", .element = sample, .feature = transcript, Dim_a_column = `Dim1`, Dim_b_column = `Dim2` ) ``` ## Other useful wrappers The above wrapper streamline the most common processing of bulk RNA sequencing data. Other useful wrappers are listed above. ## From BAM/SAM to tibble of gene counts We can calculate gene counts (using FeatureCounts; Liao Y et al., 10.1093/nar/gkz114) from a list of BAM/SAM files and format them into a tidy structure (similar to counts). ```{r eval=FALSE} counts = tidybulk_SAM_BAM( file_names, genome = "hg38", isPairedEnd = TRUE, requireBothEndsMapped = TRUE, checkFragLength = FALSE, useMetaFeatures = TRUE ) ``` ## From ensembl IDs to gene symbol IDs We can add gene symbols from ensembl identifiers. This is useful since different resources use ensembl IDs while others use gene symbol IDs. ```{r ensembl, cache=TRUE} counts_ensembl %>% ensembl_to_symbol(ens) ``` ## ADD versus GET versus ONLY modes Every function takes a tidytranscriptomics structured data as input, and (i) with action="add" outputs the new information joint to the original input data frame (default), (ii) with action="get" the new information with the sample or transcript relative informatin depending on what the analysis is about, or (iii) with action="only" just the new information. For example, from this data set ```{r, cache=TRUE} tt.norm ``` **action="add"** (Default) We can add the MDS dimensions to the original data set ```{r, cache=TRUE} tt.norm %>% reduce_dimensions( .abundance = count_scaled, method="MDS" , .element = sample, .feature = transcript, .dims = 3, action="add" ) ``` **action="get"** We can add the MDS dimensions to the original data set selecting just the sample-wise column ```{r, cache=TRUE} tt.norm %>% reduce_dimensions( .abundance = count_scaled, method="MDS" , .element = sample, .feature = transcript, .dims = 3, action="get" ) ``` **action="only"** We can get just the MDS dimensions relative to each sample ```{r, cache=TRUE} tt.norm %>% reduce_dimensions( .abundance = count_scaled, method="MDS" , .element = sample, .feature = transcript, .dims = 3, action="only" ) ``` ## Appendix ```{r} sessionInfo() ```