--- title: "Contribution guidelines and dataset format" date: "Created: 13 September 2022; Compiled: `r BiocStyle::doc_date()`" package: "`r BiocStyle::pkg_ver('imcdatasets')`" author: - name: Nicolas Damond affiliation: [Department for Quantitative Biomedicine; University of Zurich, Institute of Molecular Health Sciences; ETH Zurich] email: nicolas.damond@dqbm.uzh.ch output: BiocStyle::html_document: toc_float: yes bibliography: "`r system.file('scripts', 'ref.bib', package='imcdatasets')`" vignette: > %\VignetteIndexEntry{"Contributing guidelines and datasets format"} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r style, echo=FALSE} knitr::opts_chunk$set(error=FALSE, warning=FALSE, message=FALSE) library(BiocStyle) ``` # **Introduction** This vignette describes the procedure to contribute new datasets to the `imcdatasets` package and contains guidelines for dataset formatting. # **Contribution guidelines** Contributions or suggestions for new imaging mass cytometry (IMC) datasets to add to the `imcdatasets` package are always welcome. New datasets can be suggested by opening an issue at the `imcdatasets` [GitHub page](https://github.com/BodenmillerGroup/imcdatasets/issues). The only requirements are that the new dataset *(i)* is publicly available and *(ii)* has been described in a published scientific article. Details about creating Bioconductor's `ExperimentHub` packages [are available here](https://bioconductor.org/packages/HubPub). ## Create a data generation script The first step is to create a new branch at the `imcdatasets` [GitHub page](https://github.com/BodenmillerGroup/imcdatasets/branches). Then, create an R markdown (`.Rmd`) script in `.inst/scripts/` to generate the data objects: * Download single cell data, multiplexed images and cell segmentation masks. * Format the single cell data into a [SingleCellExperiment](https://bioconductor.org/packages/SingleCellExperiment) object. * Format the images and masks into [CytoImageList](https://bioconductor.org/packages/CytoImageList) objects. * Save the three data objects so that they can be uploaded to `ExperimentHub`. * The three data objects must contain matching information so that they can be associated by the [cytomapper](https://bioconductor.org/packages/cytomapper) package, as described in the [imcdataset vignette](https://www.bioconductor.org/packages/release/data/experiment/vignettes/imcdatasets/inst/doc/imcdatasets.html#5_Usage). The `.Rmd` script must be formatted in the same way as pre-existing scripts. Examples can be found [here](https://github.com/BodenmillerGroup/imcdatasets/blob/master/inst/scripts/make-Damond-2019-Pancreas.Rmd) and [here](https://github.com/BodenmillerGroup/imcdatasets/blob/master/inst/scripts/make-JacksonFischer-2020-BreastCancer.Rmd). Each step should be clearly and comprehensively documented. **For usability of the package and consistency across datasets, the data ** **objects must be formatted as described in the** `Dataset format` **section ** **below.** ## Update the documentation Other files in the `imcdatasets` package should be updated to include the new dataset: * Make a new `./R/Lastname_Year_Type.R` file with a function to load the new dataset and extensive documentation. Examples can be found [here](https://github.com/BodenmillerGroup/imcdatasets/blob/master/R/Damond_2019_Pancreas.R) and [here](https://github.com/BodenmillerGroup/imcdatasets/blob/master/R/JacksonFischer_2020_BreastCancer.R). * Run roxygenize to generate `man` documentation files (go to the `imcdatasets` directory and run `roxygen2::roxygenize(".")`). * Update the `./inst/scripts/make-metadata.R` R script and run it. This will update the `./inst/extdata/metadata.csv` file that is used by `ExperimentHub` to provide metadata information about datasets. * Add the reference of the paper that describes the dataset to the `./inst/scripts/ref.bib` file. * Add the new dataset to the `./inst/extdata/alldatasets.csv` file. * Add the new dataset to the dataset list in `./tests/testthat/test_loading.R`. ## Open a pull request After these steps have been completed, open a pull request at the [imcdataset GitHub page](https://github.com/BodenmillerGroup/imcdatasets/pulls). The package maintainers will do the following: * Check the R markdown script for data generation. * Generate the data objects by knitting the R markdown script. * Make sure the data objects are well formatted and consistent with the other datasets. * Check all the new package metadata and documentation. * Upload the data objects to `AWS S3` and announce the upload to Bioconductor Hubs. * Download the data objects from `ExperimentHub` and check the format again. * Update the `NEWS`, `DESCRIPTION` (add new contributor, version bump) and `README.md` (if needed) files. * Build and check the package, make sure it passes all R and Bioconductor checks. * Push to GitHub and check that the `imcdatasets` package can be installed from there. * Test the package functionality in R. * Once everything works, approve the pull request and merge with the master branch. Contributors will be recognized by having their names added to the DESCRIPTION file of the `imcdatasets` package. # **Dataset format** The `imcdatasets` package is meant to provide quick and easy access to published and curated IMC datasets. Each dataset consists of three data objects that can be retrieved individually: * __Single cell data__ in the form of a [SingleCellExperiment](https://bioconductor.org/packages/SingleCellExperiment) object. * __Multichannel images__ formatted into a [CytoImageList](https://bioconductor.org/packages/CytoImageList) object. * __Cell segmentation masks__ formatted into a [CytoImageList](https://bioconductor.org/packages/CytoImageList) object. The three data objects can be mapped using unique `image_name` values contained in the metadata of each object. For consistency across datasets, the guidelines below must be followed when creating a new dataset. ## Single cell data Single cell data should be formatted into a [SingleCellExperiment](https://bioconductor.org/packages/SingleCellExperiment) object named `sce` that contains the following slots: - `colData`: observations metadata. - `rowData`: marker metadata. - `assays`: marker expression levels per cell. - `colPairs` *(optional)*: neighborhood information. ### colData The `colData` entry of the `SingleCellExperiment` object is a `DataFrame` that contains observation metadata; i.e., cells, slides, tissue, patients, .... It is recommended that all column names have a prefix that indicates the level of observation (e.g. `cell_`, `slide_` , `tissue_`, `patient_`, `tumor_`). The following columns are required: - `image_name` and/or `image_number`: unique image (ablated ROI) name, respectively number. Should map to the `image_name`/`image_number` column(s) in the metadata of the `images` and `masks` objects. - `cell_number`: integer representing cell numbers. Should map to the values of cell segmentation masks. - `cell_id`: a unique cell identifier defined as {`image_number` `_` `cell_number`} (e.g., `7_138`). - `cell_x` and `cell_y`: position of the cell centroid on the image. These columns are used as `SpatialCoords` when converting to a [SpatialExperiment](https://bioconductor.org/packages/SpatialExperiment) object. In addition, `colnames(sce)` should be set as `colData(sce)$cell_id`. ### rowData The `rowData` entry of the `SingleCellExperiment` is a `DataFrame` that contains marker (protein, RNA, probe) information. The following columns are required in the `rowData` entry: - `channel`: a unique integer that maps to the channels of the associated multichannel images. - `metal`: the metal isotope used for detection, formatted as { `ChemicalSymbol` `IsotopeMass`} (e.g., `Y89`, `In115`, `Yb176`, `Bi209`). - `name`: marker name used in the publication that describes the dataset. - `full_name`: full marker name. - `short_name`: abbreviated marker name, preferably following the official [UniProt](https://www.uniprot.org) nomenclature. For the `full_name` and `short_name` columns, the following guidelines apply: * In `short_name`, all dashes, dots and spaces should removed or replaced with underscores. * For post-translationally modified proteins: * Prefix the `full_name` with the modification type (e.g., `phospho-`, `methyl-`) and suffix it with the modified aminoacids (e.g., `[S28]`). * Prefix the `short_name` with an abbreviation of the modification type (e.g., `p_`, `me_`) and do not indicate the modified aminoacids, unless there is a possible confusion with another target in the dataset. ```{r echo=FALSE, results='hide'} marker_names <- data.frame( full_name = c( "Carbonic anhydrase IX", "CD3 epsilon", "CD8 alpha", "E-Cadherin", "cleaved-Caspase3 + cleaved-PARP", "Cytokeratin 5", "Forkhead box P3", "Glucose transporter 1", "Histone H3", "phospho-Histone H3 [S28]", "Ki-67", "Myeloperoxidase", "Programmed cell death protein 1", "Programmed death-ligand 1", "phospho-Rb [S807/S811]", "Smooth muscle actin", "Vimentin", "Iridium 191", "Iridium 193" ), short_name = c( "CA9", "CD3e", "CD8a", "CDH1", "cCASP3_cPARP", "KRT5", "FOXP3", "SLC2A1", "H3", "p_H3", "Ki67", "MPO", "PD_1", "PD_L1", "p_Rb", "SMA", "VIM", "DNA1", "DNA2" ) ) ``` ```{r echo=FALSE, results='asis'} knitr::kable( marker_names, caption = "'full_name' and 'short_name' examples for some commonly used markers" ) ``` In addition, `rownames(sce)` should be set as `rowData(sce)$short_name`. ### assays The `assays` slot of the `SingleCellExperiment` contains counts matrices representing marker expression levels per cell and channel. It should at least contain a `counts` matrix with raw ion counts. The `assays` slot can also contain additional matrices with commonly used counts transformations, or counts transformations that were used in the publication that describes the dataset. All counts transformations must be documented in the `.R` function used to load the dataset. Common examples include: * `exprs`: asinh-transformed counts. For IMC, a cofactor of 1 is typically used. * `quant_norm`: counts censored (e.g., at the 99th percentile) and scaled from 0 to 1. ### colPairs Neighborhood information, such as a list of cells that are localized next to each other, can be stored as a [SelfHits](https://bioconductor.org/packages/devel/bioc/vignettes/SingleCellExperiment/inst/doc/intro.html#6_Storing_row_or_column_pairings) object in the `colPair` slot of the `SingleCellExperiment` object. ## Images and masks ### Images Multichannel images are stored in a [CytoImageList](https://bioconductor.org/packages/CytoImageList) object named `images`. Channel names of the `images` object (`channelNames(images)`) must map to `rownames(sce)` (marker short names). The metadata slot (`mcols(images)`) must contain an `image_name` column that maps to the `image_name` column of `colData(sce)`, and to the `image_name` column of `mcols(masks)`. This information is used by [cytomapper](https://bioconductor.org/packages/cytomapper) to associate multichannel images, cell segmentation masks, and single cell data. ### Masks Cell segmentation masks are stored in a [CytoImageList](https://bioconductor.org/packages/CytoImageList) object named `masks`. The values of the masks should be integers mapping to the `cell_number` column of `colData(sce)`. This information is used by [cytomapper](https://bioconductor.org/packages/cytomapper) to associate single cell data and cell segmentation masks. The metadata slot (`mcols(masks)`) must contain an `image_name` column that maps to the `image_name` column of `colData(sce)`, and to the `image_name` column of `mcols(images)`. This information is used by [cytomapper](https://bioconductor.org/packages/cytomapper) to associate multichannel images, cell segmentation masks, and single cell data. # **Session info** {.unnumbered} ```{r sessionInfo, echo=FALSE} sessionInfo() ```