---
title: "Contribution guidelines and dataset format"
date: "Created: 13 September 2022; Compiled: `r BiocStyle::doc_date()`"
package: "`r BiocStyle::pkg_ver('imcdatasets')`"
author:
- name: Nicolas Damond
  affiliation: [Department for Quantitative Biomedicine; University of Zurich, 
  Institute of Molecular Health Sciences; ETH Zurich]
  email: nicolas.damond@dqbm.uzh.ch
output:
    BiocStyle::html_document:
        toc_float: yes
bibliography: "`r system.file('scripts', 'ref.bib', package='imcdatasets')`"
vignette: >
    %\VignetteIndexEntry{"Contributing guidelines and datasets format"}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}
---

```{r style, echo=FALSE}
knitr::opts_chunk$set(error=FALSE, warning=FALSE, message=FALSE)
library(BiocStyle)
```

# **Introduction**

This vignette describes the procedure to contribute new datasets to the
`imcdatasets` package and contains guidelines for dataset formatting.


# **Contribution guidelines**

Contributions or suggestions for new imaging mass cytometry (IMC) datasets to 
add to the `imcdatasets` package are always welcome. New datasets can be 
suggested by opening an issue at the `imcdatasets` 
[GitHub page](https://github.com/BodenmillerGroup/imcdatasets/issues). The only 
requirements are that the new dataset *(i)* is publicly available and *(ii)* 
has been described in a published scientific article.

Details about creating Bioconductor's `ExperimentHub` packages 
[are available here](https://bioconductor.org/packages/HubPub).


## Create a data generation script

The first step is to create a new branch at the `imcdatasets` 
[GitHub page](https://github.com/BodenmillerGroup/imcdatasets/branches).

Then, create an R markdown (`.Rmd`) script in `.inst/scripts/` to generate the 
data objects:

* Download single cell data, multiplexed images and cell segmentation masks.
* Format the single cell data into a 
[SingleCellExperiment](https://bioconductor.org/packages/SingleCellExperiment) 
object.
* Format the images and masks into 
[CytoImageList](https://bioconductor.org/packages/CytoImageList) objects.
* Save the three data objects so that they can be uploaded to `ExperimentHub`.
* The three data objects must contain matching information so that they can be 
associated by the [cytomapper](https://bioconductor.org/packages/cytomapper) 
package, as described in the [imcdataset vignette](https://www.bioconductor.org/packages/release/data/experiment/vignettes/imcdatasets/inst/doc/imcdatasets.html#5_Usage).

The `.Rmd` script must be formatted in the same way as pre-existing scripts. 
Examples can be found [here](https://github.com/BodenmillerGroup/imcdatasets/blob/master/inst/scripts/make-Damond-2019-Pancreas.Rmd) and [here](https://github.com/BodenmillerGroup/imcdatasets/blob/master/inst/scripts/make-JacksonFischer-2020-BreastCancer.Rmd). Each step should be clearly and comprehensively 
documented.

**For usability of the package and consistency across datasets, the data **
**objects must be formatted as described in the** `Dataset format` **section **
**below.**


## Update the documentation

Other files in the `imcdatasets` package should be updated to include the new 
dataset:

* Make a new `./R/Lastname_Year_Type.R` file with a function to load the new 
dataset and extensive documentation. Examples can be found [here](https://github.com/BodenmillerGroup/imcdatasets/blob/master/R/Damond_2019_Pancreas.R) 
and [here](https://github.com/BodenmillerGroup/imcdatasets/blob/master/R/JacksonFischer_2020_BreastCancer.R).
* Run roxygenize to generate `man` documentation files (go to the `imcdatasets`
directory and run `roxygen2::roxygenize(".")`).
* Update the `./inst/scripts/make-metadata.R` R script and run it. This will 
update the  `./inst/extdata/metadata.csv` file that is used by `ExperimentHub` 
to provide metadata information about datasets.
* Add the reference of the paper that describes the dataset to the 
`./inst/scripts/ref.bib` file.
* Add the new dataset to the `./inst/extdata/alldatasets.csv` file.
* Add the new dataset to the dataset list in `./tests/testthat/test_loading.R`.


## Open a pull request

After these steps have been completed, open a pull request at the 
[imcdataset GitHub page](https://github.com/BodenmillerGroup/imcdatasets/pulls).

The package maintainers will do the following:

* Check the R markdown script for data generation.
* Generate the data objects by knitting the R markdown script.
* Make sure the data objects are well formatted and consistent with the other 
datasets.
* Check all the new package metadata and documentation.
* Upload the data objects to `AWS S3` and announce the upload to Bioconductor 
Hubs.
* Download the data objects from `ExperimentHub` and check the format again.
* Update the `NEWS`, `DESCRIPTION` (add new contributor, version bump) and 
`README.md` (if needed) files.
* Build and check the package, make sure it passes all R and Bioconductor 
checks.
* Push to GitHub and check that the `imcdatasets` package can be installed 
from there.
* Test the package functionality in R.
* Once everything works, approve the pull request and merge with the master 
branch.

Contributors will be recognized by having their names added to the DESCRIPTION 
file of the `imcdatasets` package.


# **Dataset format**

The `imcdatasets` package is meant to provide quick and easy access to 
published and curated IMC datasets. Each dataset consists of three data objects
that can be retrieved individually:

* __Single cell data__ in the form of a [SingleCellExperiment](https://bioconductor.org/packages/SingleCellExperiment) 
object.
* __Multichannel images__ formatted into a [CytoImageList](https://bioconductor.org/packages/CytoImageList) object.
* __Cell segmentation masks__ formatted into a [CytoImageList](https://bioconductor.org/packages/CytoImageList) object.

The three data objects can be mapped using unique `image_name` values contained
in the metadata of each object.

For consistency across datasets, the guidelines below must be followed when 
creating a new dataset.


## Single cell data

Single cell data should be formatted into a [SingleCellExperiment](https://bioconductor.org/packages/SingleCellExperiment) 
object named `sce` that contains the following slots:

- `colData`: observations metadata.
- `rowData`: marker metadata.
- `assays`: marker expression levels per cell.
- `colPairs` *(optional)*: neighborhood information.

### colData

The `colData` entry of the `SingleCellExperiment` object is a `DataFrame` that 
contains observation metadata; i.e., cells, slides, tissue, patients, .... It 
is recommended that all column names have a prefix that indicates the level of 
observation (e.g. `cell_`, `slide_` , `tissue_`, `patient_`, `tumor_`).

The following columns are required:

- `image_name` and/or `image_number`: unique image (ablated ROI) name, 
respectively number. Should map to the `image_name`/`image_number` column(s) 
in the metadata of the `images` and `masks` objects.
- `cell_number`: integer representing cell numbers. Should map to the values of
cell segmentation masks.
- `cell_id`: a unique cell identifier defined as 
{`image_number` `_` `cell_number`} (e.g., `7_138`).
- `cell_x` and `cell_y`: position of the cell centroid on the image. These 
columns are used as `SpatialCoords` when converting to a [SpatialExperiment](https://bioconductor.org/packages/SpatialExperiment) 
object.

In addition, `colnames(sce)` should be set as `colData(sce)$cell_id`.

### rowData

The `rowData` entry of the `SingleCellExperiment` is a `DataFrame` that 
contains marker (protein, RNA, probe) information.

The following columns are required in the `rowData` entry:

- `channel`: a unique integer that maps to the channels of the associated 
multichannel images.
- `metal`: the metal isotope used for detection, formatted as 
{ `ChemicalSymbol` `IsotopeMass`} (e.g., `Y89`, `In115`, `Yb176`, `Bi209`).
- `name`: marker name used in the publication that describes the dataset.
- `full_name`: full marker name.
- `short_name`: abbreviated marker name, preferably following the official [UniProt](https://www.uniprot.org) nomenclature.

For the `full_name` and `short_name` columns, the following guidelines apply:

* In `short_name`, all dashes, dots and spaces should removed or replaced with 
underscores.
* For post-translationally modified proteins:
* Prefix the `full_name` with the modification type (e.g., `phospho-`, 
`methyl-`) and suffix it with the modified aminoacids (e.g., `[S28]`).
* Prefix the `short_name` with an abbreviation of the modification type 
(e.g., `p_`, `me_`) and do not indicate the modified aminoacids, unless there
is a possible confusion with another target in the dataset.

```{r echo=FALSE, results='hide'}
marker_names <- data.frame(
    full_name = c(
        "Carbonic anhydrase IX",
        "CD3 epsilon",
        "CD8 alpha",
        "E-Cadherin",
        "cleaved-Caspase3 + cleaved-PARP",
        "Cytokeratin 5",
        "Forkhead box P3",
        "Glucose transporter 1",
        "Histone H3",
        "phospho-Histone H3 [S28]",
        "Ki-67",
        "Myeloperoxidase",
        "Programmed cell death protein 1",
        "Programmed death-ligand 1",
        "phospho-Rb [S807/S811]",
        "Smooth muscle actin",
        "Vimentin",
        "Iridium 191",
        "Iridium 193"
    ),
    short_name = c(
        "CA9",
        "CD3e",
        "CD8a",
        "CDH1",
        "cCASP3_cPARP",
        "KRT5",
        "FOXP3",
        "SLC2A1",
        "H3",
        "p_H3",
        "Ki67",
        "MPO",
        "PD_1",
        "PD_L1",
        "p_Rb",
        "SMA",
        "VIM",
        "DNA1",
        "DNA2"
    )
)
```

```{r echo=FALSE, results='asis'}
knitr::kable(
    marker_names,
    caption = "'full_name' and 'short_name' examples for some commonly 
        used markers"
)
```

In addition, `rownames(sce)` should be set as `rowData(sce)$short_name`.

### assays

The `assays` slot of the `SingleCellExperiment` contains counts matrices 
representing marker expression levels per cell and channel.

It should at least contain a `counts` matrix with raw ion counts. The `assays` 
slot can also contain additional matrices with commonly used counts 
transformations, or counts transformations that were used in the publication 
that describes the dataset. All counts transformations must be documented in 
the `.R` function used to load the dataset. Common examples include:

* `exprs`: asinh-transformed counts. For IMC, a cofactor of 1 is typically 
used.
* `quant_norm`: counts censored (e.g., at the 99th percentile) and scaled from 
0 to 1.

### colPairs

Neighborhood information, such as a list of cells that are localized next to 
each other, can be stored as a [SelfHits](https://bioconductor.org/packages/devel/bioc/vignettes/SingleCellExperiment/inst/doc/intro.html#6_Storing_row_or_column_pairings) object in the `colPair` slot of the `SingleCellExperiment` object.


## Images and masks

### Images

Multichannel images are stored in a [CytoImageList](https://bioconductor.org/packages/CytoImageList) object named 
`images`.

Channel names of the `images` object (`channelNames(images)`) must map to 
`rownames(sce)` (marker short names).

The metadata slot (`mcols(images)`) must contain an `image_name` column that 
maps to the `image_name` column of `colData(sce)`, and to the `image_name` column of `mcols(masks)`. This information is used by [cytomapper](https://bioconductor.org/packages/cytomapper) 
to associate multichannel images, cell segmentation masks, and single cell data.

### Masks

Cell segmentation masks are stored in a [CytoImageList](https://bioconductor.org/packages/CytoImageList) object named 
`masks`.

The values of the masks should be integers mapping to the `cell_number` column 
of `colData(sce)`. This information is used by [cytomapper](https://bioconductor.org/packages/cytomapper) to associate single 
cell data and cell segmentation masks.

The metadata slot (`mcols(masks)`) must contain an `image_name` column that 
maps to the `image_name` column of `colData(sce)`, and to the `image_name` column of `mcols(images)`. This information is used by [cytomapper](https://bioconductor.org/packages/cytomapper) to associate 
multichannel images, cell segmentation masks, and single cell data.


# **Session info** {.unnumbered}

```{r sessionInfo, echo=FALSE}
sessionInfo()
```