--- title: "Providing Bioimage Dataset for ExperimentHub" author: - name: Satoshi Kume email: satoshi.kume.1984@gmail.com date: "`r Sys.Date()`" graphics: no package: BioImageDbs output: BiocStyle::html_document: toc_float: true bibliography: BioImageDbs.bib csl: bj.csl vignette: > %\VignetteIndexEntry{Providing Bioimage Dataset for ExperimentHub} %\VignetteEncoding{UTF-8} %\VignetteDepends{ExperimentHub} %\VignetteEngine{knitr::rmarkdown} editor_options: chunk_output_type: console --- ```{r style, echo = FALSE, results = 'asis', message=FALSE} BiocStyle::markdown() ``` **Last modified:** `r file.info("BioImageDbs.Rmd")$mtime`
**Compiled**: `r Sys.time()` # Utilisation and prospects of bioimage datasets In recent years, there has been a growing need for data analysis using machine learning in the field of bioimaging research. Machine learning is an inductive approach using data, and the construction of models, such as image segmentation and classification, involves the use of image data itself. Therefore, the publication and sharing of bioimage datasets [@Williams2017] as well as knowledge creation through providing metadata to bioimages [@Kobayashi2018;@Kume2017] are important issues to be discussed. At present, there is no commonly used format for sharing bioimage datasets. Also, the data is scattered among various repositories. Therefore, different image repositories manage the data in different formats (image data itself and metadata, including image format, instruments/microscopes and biosamples). In the data analysis and quantification using those images, it is assumed that several steps of image pre-processing are performed depending on the analysis environment used. However, the implementation of supervised learning starts with finding a repository of the bioimage dataset that contains original images and their corresponding supervised labels. Once the repository is found, the image data is downloaded from the repository, the data is loaded into each environment and it is prepared in a format suitable for analytical package. These processes are time consuming before the main analysis. Also, in most of the image repositories, the data are not published in a format suitable for reading and processing in R (.Rdata, etc.), and the data are not easy to use for R users. For performing supervised learning of bioimage data, BioImageDbs provides R list objects of the original images and their corresponding supervised labels converted into a 4D or 5D array. After retrieving the data from ExperimentHub, it can be utilised for deep learning using Keras/Tensorflow [@Chollet2017kerasR] and other machine learning methods, without the need for pre-processing. On the other hand, many image analysis packages are also available on R; however, there is a lack of standardisation in image analysis. The use of common, open datasets is one of the essential steps in standardising and comparing the analytical methods. The provision of the array data of images through ExperimentHub is also intended for applications such as (1) comparing models using common-sharing data among R users and (2) applying predictions to new datasets through transfer learning and fine-tuning based on these arrays. # Fetch Bioimage Datasets from ExperimentHub The `BioImageDbs` package provides the metadata for all BioImage databases in `r Biocpkg("ExperimentHub")`. The `BioImageDbs` package provides the metadata for bioimage datasets, which is preprocessed as array format and saved in `r Biocpkg("ExperimentHub")`. First we load/update the `ExperimentHub` resource. ```{r load-lib, message = FALSE} library(ExperimentHub) eh <- ExperimentHub() ``` Next we list all BioImageDbs entries from `ExperimentHub`. ```{r list-BioImageDbs} query(eh, "BioImage") ``` We can confirm the metadata in ExperimentHub in Bioconductor S3 bucket with `mcols()`. ```{r confirm-metadata} mcols(query(eh, "BioImage")) ``` We can retrieve only the BioImageDbs tibble files as follows. ```{r query-mouse} qr <- query(eh, c("BioImageDbs", "LM_id0001")) qr #Import data #BioImageDbs_image_Dat <- qr[[1]] ``` # 5D Arrays from the ExperimentHub The ordering of the array dimensions corresponds to the channels_last format (default) in R/Keras. The input shape of 5D array is to be batch, spatial_dim1, spatial_dim2, spatial_dim3 and channels. The number of this batch is the same as the number of the 3D image sets. The number of channels is 1 for grey images and 3 for RGB images. # 4D Arrays from the ExperimentHub The ordering of the array dimensions corresponds to the channels_last format (default) in R/Keras. The input shape of 4D array is to be batch, height, width and channels. The number of this batch is the same as the number of the 2D images. # Visualization of gif images from the ExperimentHub As a test, we also provided gif files of some arrays for visualizations. We visualize the files using `magick::image_read` function. ```{r} qr <- query(eh, c("BioImageDbs", ".gif")) qr #EM_id0001_Brain_CA1_hippocampus_region_5dTensor_train_data qr[1] ##Display the gif image #magick::image_read(qr[[1]]) ``` ```{r Fig001, fig.cap = "EM_id0001_Brain_CA1_hippocampus_region_5dTensor_train_dataset.gif", echo = FALSE} #magick::image_read(qr[[1]]) options(EBImage.display = "raster") img <- system.file("images", "EM_id0001.png", package="BioImageDbs") EBImage::display(EBImage::readImage(files = img)) ``` ```{r Fig002, fig.cap = "EM_id0002_Drosophila_brain_region_5dTensor_train_dataset.gif", echo = FALSE} #magick::image_read(qr[[2]]) options(EBImage.display = "raster") img <- system.file("images", "EM_id0002.png", package="BioImageDbs") EBImage::display(EBImage::readImage(files = img)) ``` ```{r Fig003, fig.cap = "EM_id0003_J558L_4dTensor_train_dataset.gif", echo = FALSE} #magick::image_read(qr[[9]]) options(EBImage.display = "raster") img <- system.file("images", "EM_id0003.png", package="BioImageDbs") EBImage::display(EBImage::readImage(files = img)) ``` ```{r Fig004, fig.cap = "EM_id0004_PrHudata_4dTensor_train_dataset.gif", echo = FALSE} #magick::image_read(qr[[10]]) options(EBImage.display = "raster") img <- system.file("images", "EM_id0004.png", package="BioImageDbs") EBImage::display(EBImage::readImage(files = img)) ``` ```{r Fig008, fig.cap = "EM_id0005_Mouse_Kidney_2D_All_Mito_1024_4dTensor_dataset.gif", echo = FALSE} options(EBImage.display = "raster") img <- system.file("images", "EM_id0005_Mouse_Kidney_2D_Mito.png", package="BioImageDbs") EBImage::display(EBImage::readImage(files = img)) ``` ```{r Fig009, fig.cap = "EM_id0005_Mouse_Kidney_2D_All_Nuc_1024_4dtensor.Rds", echo = FALSE} options(EBImage.display = "raster") img <- system.file("images", "EM_id0005_Mouse_Kidney_2D_Nuc.png", package="BioImageDbs") EBImage::display(EBImage::readImage(files = img)) ``` ```{r Fig010, fig.cap = "EM_id0006_Rat_Liver_2D_All_Mito_1024_4dTensor_dataset.gif", echo = FALSE} options(EBImage.display = "raster") img <- system.file("images", "EM_id0006_Rat_Liver_Mito.png", package="BioImageDbs") EBImage::display(EBImage::readImage(files = img)) ``` ```{r Fig011, fig.cap = "EM_id0006_Rat_Liver_2D_All_Nuc_1024_4dTensor_dataset.gif", echo = FALSE} options(EBImage.display = "raster") img <- system.file("images", "EM_id0006_Rat_Liver_Nuc.png", package="BioImageDbs") EBImage::display(EBImage::readImage(files = img)) ``` ```{r Fig012, fig.cap = "EM_id0007_Mouse_Kidney_MultiScale_All_Low_Glomerulus_1024_4dTensor_dataset.gif", echo = FALSE} options(EBImage.display = "raster") img <- system.file("images", "EM_id0007_Mouse_Kidney_MultiScale_Glomerulus.png", package="BioImageDbs") EBImage::display(EBImage::readImage(files = img)) ``` ```{r Fig013, fig.cap = "EM_id0007_Mouse_Kidney_MultiScale_All_Middle_Podocyte_1024_4dTensor_dataset.gif", echo = FALSE} options(EBImage.display = "raster") img <- system.file("images", "EM_id0007_Mouse_Kidney_MultiScale_Podocyte.png", package="BioImageDbs") EBImage::display(EBImage::readImage(files = img)) ``` ```{r Fig014, fig.cap = "EM_id0008_Human_NB4_2D_All_Cel_512_4dTensor_dataset.gif", echo = FALSE} options(EBImage.display = "raster") img <- system.file("images", "EM_id0008_Human_NB4_Nuc.png", package="BioImageDbs") EBImage::display(EBImage::readImage(files = img)) ``` ```{r Fig015, fig.cap = "EM_id0008_Human_NB4_2D_All_Nuc_1024_4dTensor_dataset.gif", echo = FALSE} options(EBImage.display = "raster") img <- system.file("images", "EM_id0008_Human_NB4_Cell.png", package="BioImageDbs") EBImage::display(EBImage::readImage(files = img)) ``` ```{r Fig016, fig.cap = "EM_id0009_MurineBMMC_All_512_4dTensor_dataset.gif", echo = FALSE} options(EBImage.display = "raster") img <- system.file("images", "EM_id0009.png", package="BioImageDbs") EBImage::display(EBImage::readImage(files = img)) ``` ```{r Fig017, fig.cap = "EM_id0010_HumanBlast_All_512_4dTensor_dataset.gif", echo = FALSE} options(EBImage.display = "raster") img <- system.file("images", "EM_id0010.png", package="BioImageDbs") EBImage::display(EBImage::readImage(files = img)) ``` ```{r Fig018, fig.cap = "EM_id0011_HumanJurkat_All_512_4dTensor_dataset.gif", echo = FALSE} options(EBImage.display = "raster") img <- system.file("images", "EM_id0011.png", package="BioImageDbs") EBImage::display(EBImage::readImage(files = img)) ``` ```{r Fig005, fig.cap = "LM_id0001_DIC_C2DH_HeLa_4dTensor_train_dataset.gif", echo = FALSE} #magick::image_read(qr[[3]]) options(EBImage.display = "raster") img <- system.file("images", "LM_id0001.png", package="BioImageDbs") EBImage::display(EBImage::readImage(files = img)) ``` ```{r Fig006, fig.cap = "LM_id0002_PhC_C2DH_U373_4dTensor_train_dataset.gif", echo = FALSE} #magick::image_read(qr[[5]]) options(EBImage.display = "raster") img <- system.file("images", "LM_id0002.png", package="BioImageDbs") EBImage::display(EBImage::readImage(files = img)) ``` ```{r Fig007, fig.cap = "LM_id0003_Fluo_N2DH_GOWT1_4dTensor_train_dataset.gif", echo = FALSE} #magick::image_read(qr[[7]]) options(EBImage.display = "raster") img <- system.file("images", "LM_id0003.png", package="BioImageDbs") EBImage::display(EBImage::readImage(files = img)) ``` # A simple execution command using Keras/Tensorflow We select a data array and a label array from the data list and assign them to variables. These variables are then used as the x and y arguments of the fit () function of Keras as an example. The model in Keras should be prepared before the execution. ```{r} ## Not Run ## # qr <- query(eh, c("BioImageDbs")) # BioImageData <- qr[[1]] # data <- BioImageData$Train$Train_Original # labels <- BioImageData$Train$Train_GroundTruth # dim(data); dim(labels) # model %>% fit( x = data, y = labels ) ``` # About the imaging dataset and its metadata in BioImageDbs For this dataset in BioImageDbs, the published open data was used as follows: 1. For cellular ultra-microstructures, electron microscopy-based imaging data of mouse B myeloma cell line J558L (ex. EM_id0003_J558L_4dTensor.Rda) [@Morath2013] and primary human T cell isolated from peripheral blood mononuclear cells (ex. EM_id0004_PrHudata_4dTensor.Rda) [@Morath2013], Human NB-4 cell (ex. EM_id0008_Human_NB4_2D_All_Cel_512_4dTensor.Rds) [@Kume2017], murine bone marrow derived-mast cells (ex. EM_id0009_MurineBMMC_All_512_4dTensor.Rds) [@Morath2013], human blasts (ex. EM_id0010_HumanBlast_All_512_4dTensor.Rds) [@Morath2013], and human T-cell line Jurkat (ex. EM_id0011_HumanJurkat_All_512_4dTensor.Rds) [@Morath2013] were used. 2. For bio-tissue ultra-microstructures, electron microscopy-based imaging data of the mouse brain (ex. EM_id0001_Brain_CA1_hippocampus_region_5dTensor.Rda) [@Lucchi2012;@Lucchi2013bib], Drosophila brain (ex. EM_id0002_Drosophila_brain_region_5dTensor.Rda) [@Cardona2010;@ArgandaCarreras2015], mouse kidney (ex. EM_id0005_Mouse_Kidney_2D_All_Nuc_1024_4dtensor.Rds) [@Kume2016] and rat liver (ex. EM_id0006_Rat_Liver_2D_All_Mito_1024_4dtensor.Rds) [@Kume2016] were used. 3. For cell tracking, light microscopy-based imaging data of the human HeLa cells on a flat glass (ex. LM_id0001_DIC_C2DH_HeLa_4dTensor.Rda) [@Maska2014;@Ulman2017], human glioblastoma-astrocytoma U373 cells on a polyacrylamide substrate (ex. LM_id0002_PhC_C2DH_U373_4dTensor.Rda) [@Maska2014;@Ulman2017] and GFP-GOWT1 mouse stem cells (ex. LM_id0003_Fluo_N2DH_GOWT1_4dTensor.Rda) [@Bartova2011] were used. The values of the supervised labels were provided as array data with binary or multiple values. The detailed information was described in the metadata file of BioImageDbs. Some of cell tracking data were obtained from the [cell tracking challenge](http://celltrackingchallenge.net/2d-datasets/). # Session information {.unnumbered} ```{r sessionInfo, echo=FALSE} sessionInfo() ``` # References