--- title: "GenomicSuperSignature - Quickstart" author: "Sehyun Oh" date: "`r format(Sys.time(), '%B %d, %Y')`" abstract: | This vigenette demostrates a basic usage of GenomicSuperSignature. More extensive and biology-relavant use cases are available [**HERE**](https://shbrief.github.io/GenomicSuperSignaturePaper/). vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{Quickstart} %\VignetteEncoding{UTF-8} output: BiocStyle::html_document: number_sections: yes toc: yes toc_depth: 4 --- ```{r setup, include = FALSE} knitr::opts_chunk$set( comment = "#>", collapse = TRUE, message = FALSE, warning = FALSE, fig.align='center' ) ``` # Setup ## Install and load package ```{r eval = FALSE} if (!require("BiocManager")) install.packages("BiocManager") BiocManager::install("GenomicSuperSignature") BiocManager::install("bcellViper") ``` ```{r results="hide", message=FALSE, warning=FALSE} library(GenomicSuperSignature) library(bcellViper) ``` ## Download RAVmodel You can download GenomicSuperSignature from Google Cloud bucket using `GenomicSuperSignature::getModel` function. Currently available models are built from top 20 PCs of 536 studies (containing 44,890 samples) containing 13,934 common genes from each of 536 study's top 90% varying genes based on their study-level standard deviation. There are two versions of this RAVmodel annotated with different gene sets for GSEA - MSigDB C2 (`C2`) and three priors from PLIER package (`PLIERpriors`). The demo in this vignette is based on human B-cell expression data, so we are using the `PLIERpriors` model annotated with blood-associated gene sets. Note that the first interactive run of this code, you will be asked to allow R to create a cache directory. The model file will be stored there and subsequent calls to `getModel` will read from the cache. ```{r load_model} RAVmodel <- getModel("PLIERpriors", load=TRUE) RAVmodel ``` ## Example dataset The human B-cell dataset (Gene Expression Omnibus series GSE2350) consists of 211 normal and tumor human B-cell phenotypes. This dataset was generated on Affymatrix HG-U95Av2 arrays and stored in an ExpressionSet object with 6,249 features x 211 samples. ```{r message=FALSE, warning=FALSE} data(bcellViper) dset ``` You can provide your own expression dataset in any of these formats: simple matrix, ExpressionSet, or SummarizedExperiment. Just make sure that genes are in a 'symbol' format. # Which RAV best represents the dataset? `validate` function calculates validation score, which provides a quantitative representation of the relevance between a new dataset and RAV. RAVs that give the validation score is called _*validated RAV*_. The validation results can be displayed in different ways for more intuitive interpretation. ```{r} val_all <- validate(dset, RAVmodel) head(val_all) ``` ## HeatmapTable `heatmapTable` takes validation results as its input and displays them into a two panel table: the top panel shows the average silhouette width (avg.sw) and the bottom panel displays the validation score. `heatmapTable` can display different subsets of the validation output. For example, if you specify `scoreCutoff`, any validation result above that score will be shown. If you specify the number (n) of top validation results through `num.out`, the output will be a n-columned heatmap table. You can also use the average silhouette width (`swCutoff`), the size of cluster (`clsizecutoff`), one of the top 8 PCs from the dataset (`whichPC`). Here, we print out top 5 validated RAVs with average silhouette width above 0. ```{r out.height="80%", out.width="80%", message=FALSE, warning=FALSE} heatmapTable(val_all, num.out = 5, swCutoff = 0) ``` ## Interactive Graph Under the default condition, `plotValidate` plots validation results of all non single-element RAVs in one graph, where x-axis represents average silhouette width of the RAVs (a quality control measure of RAVs) and y-axis represents validation score. We recommend users to focus on RAVs with higher validation score and use average silhouette width as a secondary criteria. ```{r out.height="75%", out.width="75%", plotValidate_function} plotValidate(val_all, interactive = FALSE) ``` Note that `interactive = TRUE` will result in a zoomable, interactive plot that included tooltips. You can hover each data point for more information: - **sw** : the average silhouette width of the cluster - **score** : the top validation score between 8 PCs of the dataset and RAVs - **cl_size** : the size of RAVs, represented by the dot size - **cl_num** : the RAV number. You need this index to find more information about the RAV. - **PC** : test dataset's PC number that validates the given RAV. Because we used top 8 PCs of the test dataset for validation, there are 8 categories. If you double-click the PC legend on the right, you will enter an individual display mode where you can add an additional group of data point by single-click. # What kinds of information can you access through RAV? GenomicSuperSignature connects different public databases and prior information through RAVmodel, creating the knowledge graph illustrated below. Through RAVs, you can access and explore the knowledge graph from multiple entry points such as gene expression profiles, publications, study metadata, keywords in MeSH terms and gene sets.