Detecting Differential Expression in Single-Cell RNAseq data

The waddR package provides an adaptation of the semi-parametric testing procedure based on the 2-Wasserstein distance which is specifically tailored to identify differential distributions in single-cell RNA-seqencing (scRNA-seq) data.

In particular, a two-stage (TS) approach has been implemented that takes account of the specific nature of scRNA-seq data by separately testing for differential proportions of zero gene expression (using a logistic regression model) and differences in non-zero gene expression (using the semi-parametric 2-Wasserstein distance-based test) between two conditions.

Application to an example data set

This an example on how to analyse the expression of single cell data in two conditions. As example data we will look at single cell data from decidua and blood samples analysed by Vento-Tormo et al. 2018 (https://doi.org/10.1038/s41586-018-0698-6). For this demonstration we use only one replicate of the two conditions that has been normalized, scaled and is available for download on github.

First, we will first load all required packages:

suppressPackageStartupMessages({
    library("waddR")
    library("SingleCellExperiment")
    library("BiocFileCache")
})

Download the example data set

url.base <- "https://github.com/goncalves-lab/waddR-data/blob/master/data/"
sce.blood.url <- paste0(url.base, "sce_blood.RDa?raw=true")
sce.decidua.url <- paste0(url.base, "sce_decidua.RDa?raw=true")

getCachedPath <- function(url, rname){
    bfc <- BiocFileCache() # fire up cache
    res <- bfcquery(bfc, url, field="fpath", exact=TRUE)
    if (bfccount(res) == 0L)
        cachedFilePath <- bfcadd(bfc, rname=rname, fpath=url)
    else
        cachedFilePath <- bfcpath(bfc, res[["rid"]])
    cachedFilePath
}

load(getCachedPath(sce.blood.url, "sce.blood"))
load(getCachedPath(sce.decidua.url, "sce.decidua"))

We have downloaded the two SingleCellExperiment objects sce.blood and sce.decidua. Next we will randomly select a subset of genes with which we will proceed, a whole analysis would be out of scope for this vignette.

set.seed(28)
randgenes <- sample(rownames(sce.blood), 2500, replace=FALSE)
sce.blood <- sce.blood[randgenes, ]
sce.decidua <- sce.decidua[randgenes, ]

The input data to wasserstein.sc always discriminates two conditions. It can be given either as a data matrix and a vector explaining the condition of each column, or a SingleCellExperiment for each condition.

We will proceed to use the two SingleCellExperiment objects we loaded. The SingleCellExperiment objects passed to wasserstein.sc must contain a counts assay each.

assays(sce.blood)
#> List of length 1
#> names(1): counts
assays(sce.decidua)
#> List of length 1
#> names(1): counts

Before testing for differential expression in the two conditions, the counts should be normalized for cell or gene-specific biases. We can skip this here since the example data has already been normalized and an entire analysis would be out of scope for this vignette. The detailed processing steps performed on this data can be seen here.

wsres <- wasserstein.sc(sce.blood, sce.decidua, "OS")
head(wsres, n=10L)
#>                                   d.wass    d.wass^2    d.comp^2     d.comp
#> GNPTG-ENSG00000090581         0.20573251 0.042325867 0.042627776 0.20646495
#> HIVEP3-ENSG00000127124        0.20588614 0.042389102 0.043288184 0.20805813
#> C6orf47-AS1-ENSG00000227198   0.00000000 0.000000000 0.000000000 0.00000000
#> RP11-377D9.3-ENSG00000255621  0.00000000 0.000000000 0.000000000 0.00000000
#> BBOX1-ENSG00000129151         0.00000000 0.000000000 0.000000000 0.00000000
#> CEP290-ENSG00000198707        0.16215300 0.026293594 0.025579340 0.15993542
#> FOXRED2-ENSG00000100350       0.07682033 0.005901363 0.005922216 0.07695594
#> RP11-166D18.1-ENSG00000251471 0.00000000 0.000000000 0.000000000 0.00000000
#> BBX-ENSG00000114439           0.39587040 0.156713370 0.157161283 0.39643572
#> MESTIT1-ENSG00000272701       0.00000000 0.000000000 0.000000000 0.00000000
#>                                   location        size      shape       rho
#> GNPTG-ENSG00000090581         4.142331e-04 0.019841603 0.02237194 0.9344195
#> HIVEP3-ENSG00000127124        3.160493e-03 0.017597022 0.02253067 0.8807532
#> C6orf47-AS1-ENSG00000227198   0.000000e+00 0.000000000 0.00000000 0.0000000
#> RP11-377D9.3-ENSG00000255621  0.000000e+00 0.000000000 0.00000000 0.0000000
#> BBOX1-ENSG00000129151         0.000000e+00 0.000000000 0.00000000 0.0000000
#> CEP290-ENSG00000198707        9.538226e-04 0.004271303 0.02035421 0.7916142
#> FOXRED2-ENSG00000100350       4.165072e-05 0.005880565 0.00000000 0.0000000
#> RP11-166D18.1-ENSG00000251471 0.000000e+00 0.000000000 0.00000000 0.0000000
#> BBX-ENSG00000114439           2.463632e-02 0.002027704 0.13049726 0.8288606
#> MESTIT1-ENSG00000272701       0.000000e+00 0.000000000 0.00000000 0.0000000
#>                                     pval p.ad.gdp N.exc perc.loc perc.size
#> GNPTG-ENSG00000090581         0.19590000       NA    NA     0.97     46.55
#> HIVEP3-ENSG00000127124        0.08170000       NA    NA     7.30     40.65
#> C6orf47-AS1-ENSG00000227198   1.00000000       NA    NA      NaN       NaN
#> RP11-377D9.3-ENSG00000255621  1.00000000       NA    NA      NaN       NaN
#> BBOX1-ENSG00000129151         1.00000000       NA    NA      NaN       NaN
#> CEP290-ENSG00000198707        0.22290000       NA    NA     3.73     16.70
#> FOXRED2-ENSG00000100350       0.59540000       NA    NA     0.70     99.30
#> RP11-166D18.1-ENSG00000251471 1.00000000       NA    NA      NaN       NaN
#> BBX-ENSG00000114439           0.00059994       NA    NA    15.68      1.29
#> MESTIT1-ENSG00000272701       1.00000000       NA    NA      NaN       NaN
#>                               perc.shape decomp.error    pval.adj
#> GNPTG-ENSG00000090581              52.48  0.007132959 0.727373887
#> HIVEP3-ENSG00000127124             52.05  0.021210199 0.380353818
#> C6orf47-AS1-ENSG00000227198          NaN  0.000000000 1.000000000
#> RP11-377D9.3-ENSG00000255621         NaN  0.000000000 1.000000000
#> BBOX1-ENSG00000129151                NaN  0.000000000 1.000000000
#> CEP290-ENSG00000198707             79.57  0.027164565 0.796071429
#> FOXRED2-ENSG00000100350             0.00  0.003533602 1.000000000
#> RP11-166D18.1-ENSG00000251471        NaN  0.000000000 1.000000000
#> BBX-ENSG00000114439                83.03  0.002858166 0.009802941
#> MESTIT1-ENSG00000272701              NaN  0.000000000 1.000000000

The 2-Wasserstein single-cell testing procedure is performed on all genes in the data set – one line of the output describes the result of testing one gene.

Note: The two-stage (TS) procedure includes a test for differential proportions of zero gene expression, considering the proportion of zeroes over all conditions and genes in a model. It may yield incorrect results, if it is applied on only a subset of genes is given.

Note: For reproducible p-values, see the “seed” argument.

2-Wasserstein single cell test output
Output Column Description
d.wass 2-Wasserstein distance between the two samples
d.wass^2 squared 2-Wasserstein distance between the two samples
d.comp^2 squared 2-Wasserstein distance (decomposition approx.)
d.comp 2-Wasserstein distance (decomposition approx.)
location location term in the squared 2-Wasserstein decomposition
size size term in the squared 2-Wasserstein decomposition
shape shape term in the squared 2-Wasserstein decomposition
rho correlation coefficient in the quantile-quantile plot
pval p-value of the semi-parametric test
p.ad.gpd p-value of the Anderson-Darling test to check whether the GPD actually fits the data well *
N.exc number of exceedances (starting with 250 and iteratively decreased by 10 if necessary) that are required to obtain a good GPD fit (i.e. p-value of Anderson-Darling test greater or eqaul to 0.05) *
perc.loc fraction (in %) of the location part with respect to the overall squared 2-Wasserstein distance obtained by the decomposition approximation
perc.size fraction (in %) of the size part with respect to the overall squared 2-Wasserstein distance obtained by the decomposition approximation
perc.shape fraction (in %) of the shape part with respect to the overall squared 2-Wasserstein distance obtained by the decomposition approximation
decomp.error relative error between the squared 2-Wasserstein distance computed by the quantile approximation and the squared 2-Wasserstein distance computed by the decomposition approximation
p.zero p-value of the test for differential proportions of zero expression (logistic regression model) **
p.combined combined p-value of p.nonzero and p.zero obtained by Fisher’s method **
pval.adj adjusted p-value of the semi-parametric 2-Wasserstein distance-based test according to the method of Benjamini-Hochberg
p.adj.zero adjusted p-value of the test for differential proportions of zero expression (logistic regression model) according to the method of Benjamini-Hochberg **
p.adj.combined adjusted combined p-value of p.nonzero and p.zero obtained by Fisher’s method according to the method of Benjamini-Hochberg **

* only if GDP fitting is performed (NA otherwise); ** only if the two-stage test is run

See Also

Session Info

sessionInfo()
#> R version 4.0.0 (2020-04-24)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 18.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.11-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.11-bioc/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
#> [8] methods   base     
#> 
#> other attached packages:
#>  [1] BiocFileCache_1.12.0        dbplyr_1.4.3               
#>  [3] SingleCellExperiment_1.10.0 SummarizedExperiment_1.18.0
#>  [5] DelayedArray_0.14.0         matrixStats_0.56.0         
#>  [7] Biobase_2.48.0              GenomicRanges_1.40.0       
#>  [9] GenomeInfoDb_1.24.0         IRanges_2.22.0             
#> [11] S4Vectors_0.26.0            BiocGenerics_0.34.0        
#> [13] waddR_1.2.0                
#> 
#> loaded via a namespace (and not attached):
#>  [1] statmod_1.4.34         tidyselect_1.0.0       xfun_0.13             
#>  [4] purrr_0.3.4            splines_4.0.0          lattice_0.20-41       
#>  [7] vctrs_0.2.4            htmltools_0.4.0        yaml_2.2.1            
#> [10] blob_1.2.1             rlang_0.4.5            nloptr_1.2.2.1        
#> [13] pillar_1.4.3           glue_1.4.0             DBI_1.1.0             
#> [16] BiocParallel_1.22.0    rappdirs_0.3.1         bit64_0.9-7           
#> [19] GenomeInfoDbData_1.2.3 lifecycle_0.2.0        stringr_1.4.0         
#> [22] zlibbioc_1.34.0        coda_0.19-3            memoise_1.1.0         
#> [25] evaluate_0.14          knitr_1.28             curl_4.3              
#> [28] Rcpp_1.0.4.6           arm_1.11-1             XVector_0.28.0        
#> [31] abind_1.4-5            lme4_1.1-23            bit_1.1-15.2          
#> [34] digest_0.6.25          stringi_1.4.6          dplyr_0.8.5           
#> [37] grid_4.0.0             tools_4.0.0            bitops_1.0-6          
#> [40] magrittr_1.5           RCurl_1.98-1.2         tibble_3.0.1          
#> [43] RSQLite_2.2.0          crayon_1.3.4           pkgconfig_2.0.3       
#> [46] ellipsis_0.3.0         MASS_7.3-51.6          Matrix_1.2-18         
#> [49] minqa_1.2.4            assertthat_0.2.1       rmarkdown_2.1         
#> [52] httr_1.4.1             boot_1.3-25            R6_2.4.1              
#> [55] nlme_3.1-147           compiler_4.0.0