--- title: "Using the depmap data" author: - name: Theo Killian affiliation: Computational Biology, UCLouvain - name: Laurent Gatto affiliation: Computational Biology, UCLouvain date: "`r Sys.Date()`" output: BiocStyle::html_document: toc_float: true vignette: > %\VignetteIndexEntry{depmap use cases} %\VignetteEngine{knitr::rmarkdown} %\usepackage[utf8]{inputenc} --- ```{r style, echo = FALSE, results = 'asis'} BiocStyle::markdown() ``` This vignette illustrates use cases and visualizations of the data found in the [depmap](https://bioconductor.org/packages/devel/data/experiment/html/depmap.html) package. See the *depmap* vignette for details about the datasets. # Introduction The [depmap](https://bioconductor.org/packages/devel/data/experiment/html/depmap.html) package aims to provide a reproducible research framework to cancer dependency data described by [Tsherniak, Aviad, et al. "Defining a cancer dependency map." Cell 170.3 (2017): 564-576.](https://www.ncbi.nlm.nih.gov/pubmed/28753430). The data found in the [depmap](https://bioconductor.org/packages/devel/data/experiment/html/depmap.html) package has been formatted to facilitate the use of common R packages such as `dplyr` and `ggplot2`. We hope that this package will allow researchers to more easily mine, explore and visually illustrate dependency data taken from the Depmap cancer genomic dependency study. # Use cases Perhaps the most interesting datasets found within the [depmap](https://bioconductor.org/packages/devel/data/experiment/html/depmap.html) package are those that relate to the cancer gene dependency score, such as `rnai` and `crispr`. These datasets contain a score expressing how vital a particular gene is in terms of how lethal the knockout/knockdown of that gene is on a target cell line. For example, a highly negative dependency score implies that a cell line is highly dependent on that gene. Load necessary libaries. ```{r load_libraries, message=FALSE, warning=FALSE, echo=TRUE} library("dplyr") library("ggplot2") library("viridis") library("tibble") library("gridExtra") library("stringr") library("depmap") library("ExperimentHub") ``` Load the `rnai`, `crispr` and `copyNumber` datasets for visualization. ```{r load_data, message=FALSE, warning=FALSE, echo=TRUE} ## create ExperimentHub query object eh <- ExperimentHub() query(eh, "depmap") rnai <- eh[["EH2260"]] crispr <- eh[["EH2261"]] copyNumber <- eh[["EH2262"]] # note: the datasets listed above are from the 19Q1 release. Newer datasets, # such as 19Q2 and 19Q3 are available. ``` ## Find dependency score for "BRCA1" on "184A1_Breast" We will demonstrate how to obtain individual dependency scores corresponding to a specific gene and cell lineage. For example, shown below is the dependency of a breast cancer lineage, such as `184A1_BREAST` has on a human tumor suppressor gene, like `BRCA1` when it is knocked down via rnai. Shown below is the comparison for data found within the `rnai` dataset. This shows a score which is slightly positive, indicating that the knockdown of this gene is slightly beneficial to the vitality of this cancer cell lineage. However, it may be insightful to put this single dependency score in context. ```{r dep_score_BRCA1_184A1Breast, echo=TRUE} dep_score_BRCA1_184A1Breast <- rnai %>% select(cell_line, gene_name, dependency) %>% filter(cell_line == "184A1_BREAST", gene_name == "BRCA1") dep_score_BRCA1_184A1Breast ``` ## Average gene dependency for "BRCA1" Shown below is the average dependency score for `BRCA1` for all cancer cell lines in the `rnai` dataset. ```{r, BRCA1_Avg_Dep_Score, echo=TRUE} brca1_dep_score_avg_rnai <- rnai %>% select(gene_name, dependency) %>% filter(gene_name == "BRCA1") %>% summarise(mean_dependency_brca1 = mean(dependency, na.rm=TRUE)) brca1_dep_score_avg_rnai ``` ## Average gene dependency for all genes in the `rnai` dataset Or to see the average gene dependency across all genes in the entire `rnai` dataset. As one can see below, the average dependency for an average gene in the `rnai` dataset is slightly negative but close to zero. ```{r, all_gene_ds_avg_rnai, echo=TRUE} all_gene_dep_score_avg_rnai <- rnai %>% select(gene_name, dependency) %>% summarise(mean_dependency_all_genes_rnai = mean(dependency, na.rm=TRUE)) all_gene_dep_score_avg_rnai ``` ## Cell lines in the `rnai` dataset with "soft tissue" in the name If we are interested researching soft tissue sarcomas and wanted to find the cell lines withing the `rnai` dataset that had "soft tissue" in the CCLE name of cancer cell line, and sort by the highest dependency score. The results of such a search is shown below. Note: CCLE names are in ALL CAPS with an underscore. ```{r, soft_tissue_cell_lines, echo=TRUE} soft_tissue_dependency_rnai <- rnai %>% select(cell_line, gene_name, dependency) %>% filter(stringr::str_detect(cell_line, "SOFT_TISSUE")) %>% arrange(dependency) soft_tissue_dependency_rnai ``` ## Cell lines with dependency for a entrez_id of interest Sometimes it is difficult to find the subset with the exact gene name one wishes to find. In this case, it is better to search by `entrez_id`. For example, a [recent paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6251792/) describes gene knockdown of *NRF2* increases chemosensitivity in certain types of cancer. It might be interesting to see what interactions knockdown of this gene has on other cancer cell lines. However, searching by *filter(gene_name == "NRF2")* will not yield any results. We know from NCBI that the Entrez ID for this gene is "4780" and it is possible to search this dataset by that criteria. Here it can be shown that the gene name for *NRF2* in the `rnai` dataset is *NFE2L2*. ```{r cell_lines_with_entrez_id_NRF2, echo=TRUE} entrez_id_NRF2 <- rnai %>% select(entrez_id, cell_line, gene_name, dependency) %>% filter(entrez_id == "4780") entrez_id_NRF2 ``` ## Cell lines with dependency for "NFE2L2" Below the highest dependency scores via rnai knock down of a specific gene, *NFE2L2* will be obtained and the cancer cell lines associated with those values will be listed. It appears that the knockdown of this gene is strongly associated with cell death with in lung and kidney cancer cell lines. ```{r greatest_Dep_Score_NFE2L2, echo=TRUE} top_dep_score_NFE2L2_rnai <- rnai %>% select(cell_line, gene_name, dependency) %>% filter(gene_name == "NFE2L2") %>% arrange(dependency) top_dep_score_NFE2L2_rnai ``` ## Genes for cell line "NCIH2066_LUNG" If we would like to obtain the top 10 lowest dependency scores for a particular cell line (for example `NCIH2066_LUNG`) along with the genes associated with those values: ```{r top_dep_score_NCIH2066_LUNG_rnai, echo=TRUE} top_dep_score_NCIH2066_LUNG_rnai <- rnai %>% select(cell_line, gene_name, dependency) %>% filter(cell_line == "NCIH2066_LUNG") %>% arrange(dependency) top_dep_score_NCIH2066_LUNG_rnai ``` ## Most and least RNAi dependency genes Below shows the most significant genes that deplete cancer cell lines upon knockdown and their dependency scores for the entire `rnai` data. ```{r greatest_dep_score_gene_rnai, echo=TRUE} greatest_dep_score_gene_rnai <- rnai %>% select(cell_line, gene_name, dependency) %>% arrange(dependency) greatest_dep_score_gene_rnai ``` Below shows the least significant genes that induce cancer cell line vitality upon knockdown and their dependency scores for the entire `rnai` data. Unsurprisingly, we see high incidence of "TP53", a well known cancer driver. ```{r cell_line_gene_rnai_lowest_dep_score, echo=TRUE} lowest_dep_score_gene_rnai <- rnai %>% select(cell_line, gene_name, dependency) %>% arrange(desc(dependency)) lowest_dep_score_gene_rnai ``` ## Most and least CRISPR-Cas9 dependency genes Below we will apply some of the same selections as shown in the above examples on the `crispr` gene knockout dataset and observe the difference between that dataset and `rnai`. First we will look at the most significant dependency scores in the `crispr` dataset. As can be seen below, there is a different population of significant genes with the highest dependency score. ```{r cell_line_gene_crispr_greatest_dep_score, echo=TRUE} greatest_dep_score_gene_crispr <- crispr %>% select(cell_line, gene_name, dependency) %>% arrange(dependency) greatest_dep_score_gene_crispr ``` First we will look at the least significant (most cancer inducing) dependency scores in the `crispr` dataset. ```{r cell_line_gene_crispr_lowest_dep_score, echo=TRUE} lowest_dep_score_gene_crispr <- crispr %>% select(cell_line, gene_name, dependency) %>% arrange(desc(dependency)) lowest_dep_score_gene_crispr ``` ## Differences in RNAi and CRISPR-Cas9 dependency scores Here we will plot the difference in expression between the most signficant genes found in the `crispr` and `rnai` datasets. ```{r comparison_rnai_crispr_dep_scores, fig.height=6, fig.width=7, fig.align="center", echo=FALSE, message=FALSE} # sort `crispr` dep scores by most cancer inducing top_20_dep_scores_crispr <- crispr %>% select(gene_name, dependency) %>% arrange(desc(dependency)) %>% top_n(20) # sort `crispr` dep scores by least cancer inducing lowest_20_dep_scores_crispr <- crispr %>% select(gene_name, dependency) %>% arrange(dependency) %>% top_n(-20) # sort `rnai` dep scores by most cancer inducing top_20_dep_scores_rnai <- rnai %>% select(gene_name, dependency) %>% arrange(desc(dependency)) %>% top_n(20) # shorten gene name so it fits on plot top_20_dep_scores_rnai[12, 1] <- "GAS6-AS2" # sort `rnai` dep scores by least cancer inducing lowest_20_dep_scores_rnai <- rnai %>% select(gene_name, dependency) %>% arrange(dependency) %>% top_n(-20) # rnai highest dep scores p1 <- ggplot(lowest_20_dep_scores_rnai) + aes(gene_name, dependency, color = dependency) + geom_point() + scale_colour_viridis() + theme(axis.text.x = element_text(angle=90, hjust=1)) + ggtitle("High DS genes rnai") # crispr highest dep scores p2 <- ggplot(lowest_20_dep_scores_crispr) + aes(gene_name, dependency, color = dependency) + geom_point() + scale_colour_viridis() + theme(axis.text.x = element_text(angle=90, hjust=1)) + ggtitle("High DS genes crispr ") # rnai lowest dep scores p3 <- ggplot(top_20_dep_scores_rnai) + aes(gene_name, dependency, color = dependency) + geom_point() + scale_colour_viridis() + theme(axis.text.x = element_text(angle=90, hjust=1)) + ggtitle("Low DS genes rnai") # crispr lowest dep scores p4 <- ggplot(top_20_dep_scores_crispr) + aes(gene_name, dependency, color = dependency) + geom_point() + scale_colour_viridis() + theme(axis.text.x = element_text(angle=90, hjust=1)) + ggtitle("Low DS genes crispr") #plot as 1x2 grid grid.arrange(p1, p2, p3, p4, nrow=2, top = "Most Extreme Dep Scores for CRISPR and RNAI") ``` Compare the count of top 50 unique genes for `crispr` and `rnai` datasets for the most cancer-vitality inducing genes. ```{r count_comp_rnai_crispr_dep_scores, fig.height=6, fig.width=7, fig.align="center", echo=FALSE, message=FALSE} # get counts of top 50 genes in `crispr` that are most cancer-vitality inducing unique_lowest_dep_scores_gene_crispr <- crispr %>% select(gene_name, dependency) %>% arrange(desc(dependency)) %>% top_n(50) %>% count(gene_name) %>% arrange(desc(n)) # TBC1D3 appears to be an extremely common for `crispr` and dominates top 100 # most cancer-vitality-inducing genes. # get counts of top 50 genes in `rnai` that are most cancer-vitality inducing unique_lowest_dep_scores_gene_rnai <- rnai %>% select(gene_name, dependency) %>% arrange(desc(dependency)) %>% top_n(50) %>% count(gene_name) %>% arrange(desc(n)) # Whereas for `rnai` UBBP4 and ACTG1P4 are most common. # shorten gene name to fit on plot unique_lowest_dep_scores_gene_rnai[9, 1] <- "GAS6-AS2" # get counts of top 50 genes in `crispr` with greatest dependency unique_top_dep_scores_crispr <- crispr %>% select(gene_name, dependency) %>% arrange(dependency) %>% top_n(-50) %>% count(gene_name) %>% arrange(desc(n)) # Most common gene for top 100 most dependent genes in `crispr` is RAN # get counts of top 50 genes in `rnai` with greatest dependency unique_top_dep_scores_rnai <- rnai %>% select(gene_name, dependency) %>% arrange(dependency) %>% top_n(-50) %>% count(gene_name) %>% arrange(desc(n)) # Most common genes for top 100 most dependent genes in `rnai` are RPL7, EIF3B # and RPL14. # rnai highest dep scores p5 <- ggplot(unique_top_dep_scores_crispr, aes(x=gene_name, y=n)) + geom_bar(stat='identity', fill="steelblue2") + ylab("count") + theme(axis.text.x = element_text(angle=90, hjust=1)) + ggtitle("Highest DS genes crispr") p6 <- ggplot(unique_top_dep_scores_rnai, aes(x=gene_name, y=n)) + geom_bar(stat='identity', fill="steelblue2") + ylab("count") + theme(axis.text.x = element_text(angle=90, hjust=1)) + ggtitle("Highest DS genes rnai") p7 <- ggplot(unique_lowest_dep_scores_gene_crispr, aes(x=gene_name, y=n)) + geom_bar(stat='identity', fill="steelblue2") + ylab("count") + theme(axis.text.x = element_text(angle=90, hjust=1)) + ggtitle("Lowest DS genes crispr") p8 <- ggplot(unique_lowest_dep_scores_gene_rnai, aes(x=gene_name, y=n)) + geom_bar(stat='identity', fill="steelblue2") + ylab("count") + theme(axis.text.x = element_text(angle=90, hjust=1)) + ggtitle("Lowest DS genes rnai") ## plot as 1x2 grid grid.arrange(p5, p6, p7, p8, nrow=2, top = "Top 50 CRISPR and RNAI genes with High and Low Dep Score") ``` Mean log copy number (total dataset) and mean log copy number for each gene ```{r mean_log_copy_num_gene, fig.height=4, fig.width=6, fig.align="center", echo=FALSE, message=FALSE} # mean log copy number for all genes mean_log_copy_num_gene <- copyNumber %>% select(gene_name, log_copy_number) %>% summarise(mean_log_copy_number_all_genes = mean(log_copy_number, na.rm = TRUE)) # get average value val_mean_log_copy_num_gene <- as.numeric(mean_log_copy_num_gene[1,1]) # mean log copy number for each gene each_log_copy_num_gene <- copyNumber %>% select(gene_name, log_copy_number) %>% group_by(gene_name) %>% summarise(mean_log_copy_number = mean(log_copy_number, na.rm = TRUE)) # add an ID column all_log_copy_num_gene <- as.data.frame(na.omit(each_log_copy_num_gene$mean_log_copy_number)) %>% tibble::rowid_to_column(., "ID") # add col names colnames(all_log_copy_num_gene, do.NULL = FALSE) colnames(all_log_copy_num_gene) <- c("gene", "log_copy_number") # plot of mean copy number for every gene p9 <- ggplot(all_log_copy_num_gene) + aes(x=gene, y=log_copy_number, color = log_copy_number) + geom_point() + scale_colour_viridis() + geom_hline(yintercept = val_mean_log_copy_num_gene, linetype = "dashed", color = "red") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + ggtitle("Mean log copy number change for every gene") p9 ``` ## Genes with greatest mean log copy number Find genes with greatest mean log copy number ```{r greatest_log_copy_num_genes, fig.height=4, fig.width=6, fig.align="center", echo=FALSE, message=FALSE} # mean log copy number for each gene greatest_log_copy_num_genes <- copyNumber %>% select(gene_name, log_copy_number) %>% group_by(gene_name) %>% summarise(mean_long_copy_num = mean(log_copy_number))%>% arrange(desc(mean_long_copy_num)) %>% top_n(20) p10 <- ggplot(greatest_log_copy_num_genes, aes(x=gene_name, y=mean_long_copy_num)) + geom_bar(stat='identity', fill="steelblue2") + ylab("log Copy Number") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + ggtitle("Top 20 genes with largest log copy number") p10 ``` # Session information ```{r echo = FALSE} sessionInfo() ```