---
title: 'Using Monocle as input to **tradeSeq**'
author: "Koen Van den Berge and Hector Roux de Bézieux"
bibliography: tradeSeq.bib
date: "02/20/2020"
output: 
  rmarkdown::html_document:
    toc: true
    toc_depth: 3
vignette: >
  %\VignetteIndexEntry{Monocle + tradeSeq}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---
```{r, echo = FALSE}
library(knitr)
```

# Introduction

`tradeSeq` is an `R` package that allows analysis of gene expression along trajectories. While it has been developed and applied to single-cell RNA-sequencing (scRNA-seq) data, its applicability extends beyond that, and also allows the analysis of, e.g., single-cell ATAC-seq data along trajectories or bulk RNA-seq time series datasets. For every gene in the dataset, `tradeSeq` fits a generalized additive model (GAM) by building on the `mgcv` R package. It then allows statistical inference on the GAM by assessing contrasts of the parameters of the fitted GAM model, aiding in interpreting complex datasets. All details about the `tradeSeq` model and statistical tests are described in our preprint [@VandenBerge2019a].

In this vignette, we analyze a subset of the data from [@Paul2015]. 
A `SingleCellExperiment` object of the data has been provided with the [`tradeSeq`](https://github.com/statOmics/tradeSeqpaper) package and can be retrieved as shown below. The data and UMAP reduced dimensions were derived from following the [Monocle 3 vignette](http://cole-trapnell-lab.github.io/monocle-release/monocle3/#tutorial-1-learning-trajectories-with-monocle-3). 

The [main vignette](https://bioconductor.org/packages/devel/bioc/vignettes/tradeSeq/inst/doc/tradeSeq.html) focuses on using `tradeSeq` downsteam of `slingshot`. Here, we present how to use `tradeSeq` downstream of `monocle`[@Qiu2017].

# Load data

```{r, warning=F, message=F}
library(tradeSeq)
library(RColorBrewer)
library(SingleCellExperiment)
library(monocle)
library(dplyr)
library(tidyr)

# For reproducibility
palette(brewer.pal(8, "Dark2"))
data(countMatrix, package = "tradeSeq")
counts <- as.matrix(countMatrix)
rm(countMatrix)
data(celltype, package = "tradeSeq")
```

# Fit trajectories using Monocle

We will fit developmental trajectories using the `monocle` package.

```{r, warning=FALSE, out.width="50%", fig.asp=.6}
set.seed(200)
pd <- data.frame(cells = colnames(counts), cellType = celltype)
rownames(pd) <- pd$cells
fd <- data.frame(gene_short_name = rownames(counts))
rownames(fd) <- fd$gene_short_name
cds <- newCellDataSet(counts, phenoData = new("AnnotatedDataFrame", data = pd),
                      featureData = new("AnnotatedDataFrame", data = fd))
cds <- estimateSizeFactors(cds)
cds <- reduceDimension(cds, max_components = 2)
cds <- orderCells(cds)
cds <- orderCells(cds, root_state = 5)
```

# Fit Smoothers

Only possible for DDRTree

```{r}
sce <- fitGAM(cds, verbose = TRUE)
```

You can then analyse the `sce` object following the [main vignette](https://bioconductor.org/packages/devel/bioc/vignettes/tradeSeq/inst/doc/tradeSeq.html).

# Monocle3

As of now (02/2020), monocle3[@Cao2019], is still in its beta version. Therefore, we have no plan yet to include a S4 method for monocle3 while it is not on CRAN or Bioconductor and the format is still moving. However, we present below a way to use `tradeSeq` downstream of `monocle3` as of version '0.2', for a fully connected graph. We follow the tutorial from the [monocle3 website](https://cole-trapnell-lab.github.io/monocle3/docs/trajectories/). 

## Constructing the trajectory

You will need to install monocle3 from [here](https://cole-trapnell-lab.github.io/monocle3/docs/installation/) before running the code below.

```{r, eval = FALSE}
set.seed(22)
library(monocle3)
# Create a cell_data_set object
cds <- new_cell_data_set(counts, cell_metadata = pd,
                gene_metadata = data.frame(gene_short_name = rownames(counts),
                                           row.names = rownames(counts)))
# Run PCA then UMAP on the data
cds <- preprocess_cds(cds, method = "PCA")
cds <- reduce_dimension(cds, preprocess_method = "PCA",
                        reduction_method = "UMAP")

# First display, coloring by the cell types from Paul et al
plot_cells(cds, label_groups_by_cluster = FALSE, cell_size = 1,
           color_cells_by = "cellType")

# Running the clustering method. This is necessary to the construct the graph
cds <- cluster_cells(cds, reduction_method = "UMAP")
# Visualize the clusters
plot_cells(cds, color_cells_by = "cluster", cell_size = 1)

# Construct the graph
# Note that, for the rest of the code to run, the graph should be fully connected
cds <- learn_graph(cds, use_partition = FALSE)

# We find all the cells that are close to the starting point
cell_ids <- colnames(cds)[pd$cellType ==  "Multipotent progenitors"]
closest_vertex <- cds@principal_graph_aux[["UMAP"]]$pr_graph_cell_proj_closest_vertex
closest_vertex <- as.matrix(closest_vertex[colnames(cds), ])
closest_vertex <- closest_vertex[cell_ids, ]
closest_vertex <- as.numeric(names(which.max(table(closest_vertex))))
mst <- principal_graph(cds)$UMAP
root_pr_nodes <- igraph::V(mst)$name[closest_vertex]

# We compute the trajectory
cds <- order_cells(cds, root_pr_nodes = root_pr_nodes)
plot_cells(cds, color_cells_by = "pseudotime")
```

## Extracting the pseudotimes and cell weights for tradeSeq

```{r, eval = FALSE}
# Get the closest vertice for every cell
y_to_cells <-  principal_graph_aux(cds)$UMAP$pr_graph_cell_proj_closest_vertex %>%
  as.data.frame() %>%
  dplyr::mutate(cells = rownames(.)) %>%
  rename("Y" = V1)

# Get the root vertices
# It is the same node as above
root <- cds@principal_graph_aux$UMAP$root_pr_nodes

# Get the other endpoints
endpoints <- names(which(igraph::degree(mst) == 1))
endpoints <- endpoints[!endpoints %in% root]

# For each endpoint
cellWeights <- lapply(endpoints, function(endpoint) {
  # We find the path between the endpoint and the root
  path <- igraph::shortest_paths(mst, root, endpoint)$vpath[[1]]
  path <- as.character(path)
  # We find the cells that map along that path
  df <- y_to_cells %>%
    dplyr::filter(Y %in% path)
  df <- data.frame(weights = as.numeric(colnames(cds) %in% df$cells))
  colnames(df) <- endpoint
  return(df)
  }) %>% bind_cols() 
rownames(cellWeights) <- colnames(cds)
pseudotime <- matrix(pseudotime(cds), ncol = ncol(cellWeights),
                     nrow = ncol(cds), byrow = FALSE)
```

```{r, eval = FALSE}
sce <- fitGAM(counts = counts,
              pseudotime = pseudotime,
              cellWeights = cellWeights)
```

Then, the `sce` object can be analyzed following the [main vignette](https://bioconductor.org/packages/devel/bioc/vignettes/tradeSeq/inst/doc/tradeSeq.html).

# Session

```{r}
sessionInfo()
```

# References