--- title: "Discover and download datasets and files from the cellxgene data portal" author: - name: Martin Morgan affiliation: Roswell Park Comprehensive Cancer Center email: Martin.Morgan@RoswellPark.org package: cellxgenedp output: BiocStyle::html_document abstract: | The cellxgene data portal (https://cellxgene.cziscience.com/) provides a graphical user interface to collections of single-cell sequence data processed in standard ways to 'count matrix' summaries. The cellxgenedp package provides an alternative, R-based inteface, allowing flexible data discovery, viewing, and downloading. vignette: | %\VignetteIndexEntry{Discover and download datasets and files from the cellxgene data portal} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- NOTE: The interface to CELLxGENE has changed; versions of [cellxgenedp][] prior to 1.4.1 / 1.5.2 will cease to work when CELLxGENE removes the previous interface. See the vignette section 'API changes' for additional details. # Installation and use This package is available in _Bioconductor_ version 3.15 and later. The following code installs [cellxgenedp][] as well as other packages required for this vignette. [cellxgenedp]: https://bioconductor.org/packages/cellxgenedp ```{r, eval = FALSE} pkgs <- c("cellxgenedp", "zellkonverter", "SingleCellExperiment", "HDF5Array") required_pkgs <- pkgs[!pkgs %in% rownames(installed.packages())] BiocManager::install(required_pkgs) ``` Use the following `pkgs` vector to install from GitHub (latest, unchecked, development version) instead ```{r, eval = FALSE} pkgs <- c( "mtmorgan/cellxgenedp", "zellkonverter", "SingleCellExperiment", "HDF5Array" ) ``` Load the package into your current _R_ session. We make extensive use of the dplyr packages, and at the end of the vignette use SingleCellExperiment and zellkonverter, so load those as well. ```{r} suppressPackageStartupMessages({ library(zellkonverter) library(SingleCellExperiment) # load early to avoid masking dplyr::count() library(dplyr) library(cellxgenedp) }) ``` # `cxg()` Provides a 'shiny' interface The following sections outline how to use the [cellxgenedp][] package in an _R_ script; most functionality is also available in the `cxg()` shiny application, providing an easy way to identify, download, and visualize one or several datasets. Start the app ```{r, eval = FALSE} cxg() ``` choose a project on the first tab, and a dataset for visualization, or one or more datasets for download! # Collections, datasets and files Retrieve metadata about resources available at the cellxgene data portal using `db()`: ```{r} db <- db() ``` Printing the `db` object provides a brief overview of the available data, as well as hints, in the form of functions like `collections()`, for further exploration. ```{r} db ``` The portal organizes data hierarchically, with 'collections' (research studies, approximately), 'datasets', and 'files'. Discover data using the corresponding functions. ```{r} collections(db) datasets(db) files(db) ``` Each of these resources has a unique primary identifier (e.g., `file_id`) as well as an identifier describing the relationship of the resource to other components of the database (e.g., `dataset_id`). These identifiers can be used to 'join' information across tables. ## Using `dplyr` to navigate data A collection may have several datasets, and datasets may have several files. For instance, here is the collection with the most datasets ```{r} collection_with_most_datasets <- datasets(db) |> count(collection_id, sort = TRUE) |> slice(1) ``` We can find out about this collection by joining with the `collections()` table. ```{r} left_join( collection_with_most_datasets |> select(collection_id), collections(db), by = "collection_id" ) |> glimpse() ``` We can take a similar strategy to identify all datasets belonging to this collection ```{r} left_join( collection_with_most_datasets |> select(collection_id), datasets(db), by = "collection_id" ) ``` ## `facets()` provides information on 'levels' present in specific columns Notice that some columns are 'lists' rather than atomic vectors like 'character' or 'integer'. ```{r} datasets(db) |> select(where(is.list)) ``` This indicates that at least some of the datasets had more than one type of `assay`, `cell_type`, etc. The `facets()` function provides a convenient way of discovering possible levels of each column, e.g., `assay`, `organism`, `self_reported_ethnicity`, or `sex`, and the number of datasets with each label. ```{r facets} facets(db, "assay") facets(db, "self_reported_ethnicity") facets(db, "sex") ``` ## Filtering faceted columns Suppose we were interested in finding datasets from the 10x 3' v3 assay (`ontology_term_id` of `EFO:0009922`) containing individuals of African American ethnicity, and female sex. Use the `facets_filter()` utility function to filter data sets as needed ```{r african_american_female} african_american_female <- datasets(db) |> filter( facets_filter(assay, "ontology_term_id", "EFO:0009922"), facets_filter(self_reported_ethnicity, "label", "African American"), facets_filter(sex, "label", "female") ) ``` Use `nrow(african_american_female)` to find the number of datasets satisfying our criteria. It looks like there are up to ```{r} african_american_female |> summarise(total_cell_count = sum(cell_count)) ``` cells sequenced (each dataset may contain cells from several ethnicities, as well as males or individuals of unknown gender, so we do not know the actual number of cells available without downloading files). Use `left_join` to identify the corresponding collections: ```{r} ## collections left_join( african_american_female |> select(collection_id) |> distinct(), collections(db), by = "collection_id" ) ``` ## Publication and other external data Many collections include publication information and other external data. This information is available in the return value of `collections()`, but the helper function `publisher_metadata()`, `authors()`, and `links()` may facilite access. Suppose one is interested in the publication "A single-cell atlas of the healthy breast tissues reveals clinically relevant clusters of breast epithelial cells". Discover it in the collections ```{r} title_of_interest <- paste( "A single-cell atlas of the healthy breast tissues reveals clinically", "relevant clusters of breast epithelial cells" ) collection_of_interest <- collections(db) |> dplyr::filter(startsWith(name, title_of_interest)) collection_of_interest |> glimpse() ``` Use the `collection_id` to extract publisher metadata (including a DOI if available) and author information ```{r} collection_id_of_interest <- pull(collection_of_interest, "collection_id") publisher_metadata(db) |> filter(collection_id == collection_id_of_interest) |> glimpse() authors(db) |> filter(collection_id == collection_id_of_interest) ``` Collections may have links to additional external data, in this case a DOI and two links to `RAW_DATA`. ```{r} external_links <- links(db) external_links external_links |> count(link_type) external_links |> filter(collection_id == collection_id_of_interest) ``` Conversely, knowledge of a DOI, etc., can be used to discover details of the corresponding collection. ```{r} doi_of_interest <- "https://doi.org/10.1016/j.stem.2018.12.011" links(db) |> filter(link_url == doi_of_interest) |> left_join(collections(db), by = "collection_id") |> glimpse() ``` # Visualizing data in `cellxgene` Discover files associated with our first selected dataset ```{r} selected_files <- left_join( african_american_female |> select(dataset_id), files(db), by = "dataset_id" ) selected_files ``` The `filetype` column lists the type of each file. The cellxgene service can be used to visualize *datasets* that have `CXG` files. ```{r, eval = FALSE} selected_files |> filter(filetype == "CXG") |> slice(1) |> # visualize a single dataset datasets_visualize() ``` Visualization is an interactive process, so `datasets_visualize()` will only open up to 5 browser tabs per call. # File download and use Datasets usually contain `CXG` (cellxgene visualization), `H5AD` (files produced by the python AnnData module), and `Rds` (serialized files produced by the _R_ Seurat package). There are no public parsers for `CXG`, and the `Rds` files may be unreadable if the version of Seurat used to create the file is different from the version used to read the file. We therefore focus on the `H5AD` files. For illustration, we download one of our selected files. ```{r local_file_from_dataset_id} local_file <- selected_files |> filter( dataset_id == "de985818-285f-4f59-9dbd-d74968fddba3", filetype == "H5AD" ) |> files_download(dry.run = FALSE) basename(local_file) ``` These are downloaded to a local cache (use the internal function `cellxgenedp:::.cellxgenedb_cache_path()` for the location of the cache), so the process is only time-consuming the first time. `H5AD` files can be converted to _R_ / _Bioconductor_ objects using the [zellkonverter][] package. ```{r readH5AD} h5ad <- readH5AD(local_file, use_hdf5 = TRUE, reader = "R") h5ad ``` The `SingleCellExperiment` object is a matrix-like object with rows corresponding to genes and columns to cells. Thus we can easily explore the cells present in the data. ```{r} h5ad |> colData(h5ad) |> as_tibble() |> count(sex, donor_id) ``` # Next steps The [Orchestrating Single-Cell Analysis with Bioconductor][OSCA] online resource provides an excellent introduction to analysis and visualization of single-cell data in _R_ / _Bioconductor_. Extensive opportunities for working with AnnData objects in _R_ but using the native python interface are briefly described in, e.g., `?AnnData2SCE` help page of [zellkonverter][]. [zellkonverter]: https://bioconductor.org/packages/zelkonverter [OSCA]: https://bioconductor.org/books/OSCA The [hca][] package provides programmatic access to the Human Cell Atlas [data portal][HCAportal], allowing retrieval of primary as well as derived single-cell data files. [hca]: https://bioconductor.org/packages/hca [HCAportal]: https://data.humancellatlas.org/explore # API changes Data access provided by CELLxGENE has changed to a new 'Discover' [API][]. The main functionality of the [cellxgenedp][] package has not changed, but specific columns have been removed, replaced or added, as follows: `collections()` - Removed: `access_type`, `data_submission_policy_version` - Replaced: `updated_at` replaced with `revised_at` - Added: `collection_version_id`, `collection_url`, `doi`, `revising_in`, `revision_of` `datasets()` - Removed: `is_valid`, `processing_status`, `published`, `revision`, `created_at` - Replaced: `dataset_deployments` replaced with `explorer_url`, `name` replaced with `title`, `updated_at` replaced with `revised_at` - Added: `dataset_version_id`, `batch_condition`, `x_approximate_distribution` `files()` - Removed: `file_id`, `filename`, `s3_uri`, `user_submitted`, `created_at`, `updated_at` - Added: `filesize`, `url` [API]: https://api.cellxgene.cziscience.com/curation/ui/ # Session info {.unnumbered} ```{r sessionInfo, echo=FALSE} sessionInfo() ```