---
title: Overview of the MouseGastrulationData datasets
author: Jonathan Griffiths and Aaron Lun
date: "Revised: August 5, 2019"
output:
  BiocStyle::html_document:
    toc_float: true
vignette: >
  %\VignetteIndexEntry{Available datasets}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
bibliography: biblio.bib
---

```{r, echo=FALSE, results="hide"}
knitr::opts_chunk$set(error=FALSE, warning=FALSE, message=FALSE)
```

# Introduction

The `r Biocpkg("MouseGastrulationData")` package provides convenient access to the single-cell RNA sequencing (scRNA-seq) datasets from @pijuan-sala_single-cell_2019, and additional data generated in similar systems.
This study focuses on mouse gastrulation and organogenesis, providing transcriptomic profiles at single-cell resolution across several stages of early development.
Datasets are provided as count matrices with additional feature- and sample-level metadata after processing.
Raw sequencing data can be acquired from ArrayExpress accession [E-MTAB-6967](https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-6967/).

# Installation

The package may be installed from Bioconductor.
Bioconductor packages can be accessed using the `r CRANpkg("BiocManager")` package.

```{r getPackage, eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("MouseGastrulationData")
```

BiocManager also supports installation of the development version of the package from Github.

```{r, eval = FALSE}
BiocManager::install("MarioniLab/MouseGastrulationData")
```

To use the package, load it in the typical way.

```{r Load, message=FALSE}
library(MouseGastrulationData)
```

# Processing overview

Detailed methods are available in the methods that accompany [the paper](https://doi.org/10.1038/s41586-019-0933-9),
or from the code in the corresponding [Github repository](https://github.com/MarioniLab/EmbryoTimecourse2018/).
Briefly, whole embryos were dissociated at timepoints between embryonic days (E) 6.5 and 8.5 of development.
Libraries were generated using the 10x Genomics Chromium platform (v1 chemistry) and sequenced on the Illumina HiSeq 2500.
The computational analysis involved a number of steps:

- Demultiplexing, read alignment and feature quantification was performed with _Cellranger_ using Ensembl 92 genome annotation.
- Swapped molecules were excluded using the `swappedDrops()` function from `r Biocpkg("DropletUtils")` [@griffiths2018detection].
- Cell-containing droplets were called using the `emptyDrops()` function from `r Biocpkg("DropletUtils")` [@lun2019emptydrops].
- Called cells with aberrant transcriptional features (e.g., high mitochondrial gene content) were filtered out.
- Size factors were computed using the `computeSumFactors()` function from `r Biocpkg("scran")` [@lun2016pooling].
- Putative doublets were identified and excluded using the `doubletCells()` function from `r Biocpkg("scran")`.
- Cytoplasm-stripped nuclei were also excluded.
- Batch correction was performed in the principal component space with `fastMNN()` from `r Biocpkg("scran")` [@haghverdi2018batch].
- Clusters were identified using a recursive strategy with `buildSNNGraph()` (from `r Biocpkg("scran")`) 
and `cluster_louvain` (from `r CRANpkg("igraph")`), and were annotated and merged into interpretable units by hand.

# Atlas data format

The data accessible via this package is stored in subsets according to the different 10x samples that were generated.
For the embryo atlas, the exported object `AtlasSampleMetadata` provides metadata information for each of the samples.
Descriptions of the contents of each column can be accessed using `?AtlasSampleMetadata`.

```{r}
head(AtlasSampleMetadata, n = 3)
```

All data access functions allow you to select the particular samples you would like to access.
By loading only the samples that you are interested in for your particular analysis, you will save time when downloading and loading the data, and also reduce memory consumption on your machine.

## Processed data access

The package provides the dataset in the form of a `SingleCellExperiment` object.
This section details how you can interact with the object.
We load in only one of the samples from the atlas to reduce memory consumption when compiling this vignette.

```{r, message=FALSE}
sce <- EmbryoAtlasData(samples = 21)
sce
```

We use the `counts()` function to retrieve the count matrix.
These are stored as a sparse matrix, as implemented in the `r CRANpkg("Matrix")` package.

```{r}
counts(sce)[6:9, 1:3]
```

Size factors for normalisation are present in the object and are accessed with the `sizeFactors()` function.

```{r}
head(sizeFactors(sce))
```

After running `r Biocpkg("scater")`'s `normalize` function on the `r Biocpkg("SingleCellExperiment")` object, normalised or log-transformed counts can be accessed using `normcounts` and `logcounts`.
These are not demonstrated in this vignette to avoid a dependency on `r Biocpkg("scater")`.

The MGI symbol and Ensembl gene ID for each gene is stored in the `rowData` of the `SingleCellExperiment` object.

```{r}
head(rowData(sce))
```

The `colData` contains cell-specific attributes.
The meaning of each field is detailed in the function documentation (`?EmbryoAtlasData`).

```{r}
head(colData(sce))
```

Batch-corrected PCA representations of the data are available via the `reducedDim` function, in the `pca.corrected` slot.
This representation contains `NA` values for cells that are doublets, or cytoplasm-stripped nuclei.

A vector of celltype colours (as used in the paper) is also provided in the exported object `EmbryoCelltypeColours`.
Its use is shown below.

```{r, fig.height = 6}
#exclude technical artefacts
singlets <- which(!(colData(sce)$doublet | colData(sce)$stripped))
plot(
    x = reducedDim(sce, "umap")[singlets, 1],
    y = reducedDim(sce, "umap")[singlets, 2],
    col = EmbryoCelltypeColours[colData(sce)$celltype[singlets]],
    pch = 19,
    xaxt = "n", yaxt = "n",
    xlab = "UMAP1", ylab = "UMAP2"
)
```

## Raw data access

Unfiltered count matrices are also available from `r Biocpkg("MouseGastrulationData")`.
This refers to count matrices where swapped molecules have been removed but no cells have been called.
They can be obtained using the `EmbryoAtlasData()` function and are returned as `SingleCellExperiment` objects.

```{r}
unfilt <- EmbryoAtlasData(type="raw", samples=c(1:2))
sapply(unfilt, dim)
```

These unfiltered matrices may be useful if you want to perform tests of cell-calling analyses, 
or analyses which use the ambient pool of RNA in 10x samples.
Note that empty columns are excluded from these matrices.

# Chimera data information

## Background

Data from experiments involving chimeric embryos in @pijuan-sala_single-cell_2019 are also available from this package.
In these embryos, a population of fluorescent embryonic stem cells were injected into wild-type E3.5 mouse embryos.
The embryos were then returned to a parent mouse, and allowed to develop normally until collection.
The cells were flow-sorted to purify host and injected populations, 
libraries were generated using 10x version 2 chemistry and sequencing was performed on the HiSeq 4000.

Chimeras are especially effective for studying the effect of knockouts of essential developmental genes.
We inject stem cells that possess a knockout of a particular gene, and allow the resulting chimeric embryo to develop.
Both injected and host cells contribute to the different tissues in the mouse.
The presence of the wild-type host cells allows the embryo to compensate and avoid gross developmental failures, 
while cells with the knockout are also captured, and their aberrant behaviour can be studied.

## Available datasets

The package contains two chimeric datasets:

- Wild-type chimeras involving ten samples, from five independant embryo pools at two timepoints.
The injected wild-type cells differ only in the insertion of the *td-Tomato* construct.
These data are useful for identifying properties of a typical chimera in *scRNAseq* data.
Raw sequencing data are available at [E-MTAB-7324](https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-7324/).
The data can be accessed using the `WTChimeraData()` function.
- *Tal1* knockout chimeras involving four samples, from one embryo pool at one timepoint.
The injected cells in the *Tal1* chimeras have are knockouts for the *Tal1* gene.
They also contain the *td-Tomato* construct.
Raw sequencing data are available at [E-MTAB-7325](https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-7325/).
The data can be accessed using the `Tal1ChimeraData()` function.

The processed data for each experiment are provided as a `SingleCellExperiment`, as for the previously described atlas data.
However, there are a few small differences:

- They contain an extra feature for expression of the *td-Tomato*.
- Cells derived from the injected cells (and thus are positive for *td-Tomato*) are marked in the `colData` field `tomato`.
- Information for the proper pairing of samples from the same embryo pools can be found in the `colData` field `pool`.

Unfiltered count matrices are also provided for each sample in these datasets.

# Session Information

```{r}
sessionInfo()
```

# References