--- title: "Introduction to MFA" author: "Kieran R Campbell" date: "2017-3-3" output: BiocStyle::html_document vignette: > %\VignetteIndexEntry{Vignette Title} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} bibliography: bibliography.bib --- ```{r use-libs, include = FALSE} library(mfa) library(ggplot2) library(dplyr) knitr::opts_chunk$set(echo = TRUE, cache = TRUE, fig.width = 6, fig.height = 4, warning = FALSE, message = FALSE) ``` # Introduction `mfa` is an R package for fitting a Bayesian mixture of factor analysers to infer developmental trajectories with bifurcations from single-cell gene expression data. It is able to jointly infer pseudotimes, branching, and genes differentially regulated across branches using a generative, Bayesian hierarchical model. Inference is performed using fast Gibbs sampling. # Installation `mfa` can be installed in one of two ways: ## From Bioconductor ```{r install-bioconductor, eval = FALSE} if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager") BiocManager::install("mfa") library(mfa) ``` ## From Github This requires the `devtools` package to be installed first ```{r install-github, eval = FALSE} install.packages("devtools") # If not already installed devtools::install_github("kieranrcampbell/mfa") library(mfa) ``` # An example on synthetic data ## Generating synthetic data We first create some synthetic data for 100 cells and 40 genes calling the `mfa` function `create_synthetic`. This returns a list with gene expression, pseudotime, branch allocation, and various parameter estimates: ```{r synthetic} synth <- create_synthetic(C = 100, G = 40) print(str(synth)) ``` We can then PCA and put into a tidy format: ```{r to-tidy} df_synth <- as_data_frame(prcomp(synth$X)$x[,1:2]) %>% mutate(pseudotime = synth$pst, branch = factor(synth$branch)) ``` and have a look at a PCA representation, coloured by both pseudotime and branch allocation: ```{r pca-rep} ggplot(df_synth, aes(x = PC1, y = PC2, color = pseudotime)) + geom_point() ggplot(df_synth, aes(x = PC1, y = PC2, color = branch)) + geom_point() ``` ## Calling `mfa` The input to `mfa` is either an `ExpressionSet` (e.g. from using the package [Scater](http://bioconductor.org/packages/release/bioc/html/scater.html)) or a cell-by-gene expression matrix. If an `ExpressionSet` is provided then the values in the `exprs` slot are used for gene expression. We invoke `mfa` with a call to the `mfa(...)` function. Depending on the size of the dataset and number of MCMC iterations used, this may take some time: ```{r run-mfa} m <- mfa(synth$X) print(m) ``` Particular care must be paid to the initialisation of the pseudotimes: by default they are initialised to the first principal component, though if the researcher suspects (based on plotting marker genes) that the trajectory corresponds to a different PC, this can be set using the `pc_initialise` argument. ## MCMC diagnostics As in any MCMC analysis, basic care is needed to make sure the samples have converged to something resembling the stationary distribution (see e.g. @cowles1996markov for a full discussion). For a quick summary of these, `mfa` provides two functions: `plot_mfa_trace` and `plot_mfa_autocorr` for quick plotting of the trace and autocorrelation of the posterior log-likelihood: ```{r diagnostics} plot_mfa_trace(m) plot_mfa_autocorr(m) ``` ## Plotting results We can extract posterior mean estimates along with credible intervals using the `summary` function: ```{r summary} ms <- summary(m) print(head(ms)) ``` This has six entries: * `pseudotime` The MAP pseudotime estimate * `branch` The MAP branch estimate * `branch_certainty` The proportion of MCMC traces (after burn-in) for which the cell was assigned to the MAP branch * `pseudotime_lower` and `pseudotime_upper`: the lower and upper 95\% highest-probability-density posterior credible intervals We can compare the inferred pseudotimes to the true values: ```{r compare-pst} qplot(synth$pst, ms$pseudotime, color = factor(synth$branch)) + xlab('True pseudotime') + ylab('Inferred pseudotime') + scale_color_discrete(name = 'True\nbranch') ``` And we can equivalently plot the PCA representation coloured by MAP branch: ```{r pca-rep-with-branch} mutate(df_synth, inferred_branch = ms[['branch']]) %>% ggplot(aes(x = PC1, y = PC2, color = inferred_branch)) + geom_point() + scale_color_discrete(name = 'Inferred\nbranch') ``` ## Finding genes that bifurcate A unique part of this model is that through an ARD-like prior structure on the loading matrices we can automatically infer which genes are involved in the bifurcation process. For a quick-and-dirty look we can use the `plot_chi` function, where larger values of inverse-chi imply the gene is associated with the bifurcation: ```{r plot-chi} plot_chi(m) ``` To calculate the MAP values for chi we can call the `calculate_chi` function, which returns a `data_frame` with the feature names and values: ```{r posterior-mean-chi} posterior_chi_df <- calculate_chi(m) head(posterior_chi_df) ``` # Advanced usage ## The `mfa` class A call to `mfa(...)` returns an `mfa` object that contains all the information about the dataset and the MCMC inference performed. Note that it does _not_ contain a copy of the original data. We can see the structure by calling `str` on an `mfa` object: ```{r str-mfa} str(m, max.level = 1) ``` This contains the following slots: * `traces` - the raw MCMC traces (discussed in following section) * `iter` - the number of MCMC iterations * `thin` - the thinning of the MCMC chain * `burn` - the number of MCMC iterations thrown away as burn-in * `b` - the number of branches modelled * `collapse` - whether collapsed Gibbs sampling was implemented * `N` - the number of cells * `G` - the number of features (e.g. genes) * `feature_names` - the names of the features (e.g. genes) * `cell_names` - the names of the cells ## Accessing MCMC traces MCMC traces can be accessed through the `traces` slot of an `mfa` object. This gives a list with an element for each variable, along with the log-likelihood: ```{r str-traces} print(names(m$traces)) ``` For non-branch-specific variables this is simply a matrix. For example, for the variable $\tau$ is just an interation-by-gene matrix: ```{r tau} str(m$traces$tau_trace) ``` We can easily get the posterior mean by calling `colMeans`. More fancy posterior density estimation can be perfomed using the `MCMCglmm` package, such as `posterior.mode(...)` for MAP estimation (though in practice this is often similar to posterior mean). We can estimate posterior intervals using the `HPDInterval(...)` function from the `coda` package (note that traces must be converted to `coda` objects before calling either of these). Some variables are branch dependent, meaning the traces returned are arrays (or _tensors_ in fashionable speak) that have dimension `iteration x gene x branch`. An example is the $k$ variable: ```{r print-k} str(m$traces$k_trace) ``` To get posterior means (or modes, or intervals) we then need to use the `apply` function to iterate over the branches. To find the posterior means of `k`, we then call ```{r posterior-mean-of-k} pmean_k <- apply(m$traces$k_trace, 3, colMeans) str(pmean_k) ``` This returns a gene-by-branch matrix of posterior estimates. # Technical ```{r sess-info} sessionInfo() ``` # References