---
title: "Detecting and correcting batch effects with BEclear"
author: 
    - name: David Rasp
      affiliation:
        Center for Bioinformatics, Saarland University, Saarbruecken, Germany
      email: David.J.Rasp@gmail.com
    - name: Markus Merl
package: BEclear
abstract: |
    We show in this tutorial how to use the BEclear package [@Akulenko2016] to detect and correct
    batch effects in methylation data. Even though BEclear was developed for
    the use on methylation data, it can also be used to find and correct batch 
    effects in other kinds of data.
    The central method of BEclear is based on Latent Factor Models [@Candes2009], which can in 
    theory be used on every matrix containing real numbers to predict missing values.
date: "`r Sys.Date()`"
output: 
    BiocStyle::html_document:
            toc_float: true
bibliography: "`r system.file('REFERENCES.bib', package = 'BEclear')`"
vignette: |
  %\VignetteIndexEntry{BEclear tutorial}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}  
---

```{r, echo = FALSE, message = FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
library(pander)
panderOptions('knitr.auto.asis', TRUE)
panderOptions('plain.ascii', TRUE)
```

# Introduction

We guide you through the individual steps of the `r BiocStyle::Biocpkg("BEclear")`
package in their own chapters. They will follow in the logical order of an 
example of correcting some batch affected DNA methylation data. 
This article should only give a small tutorial,
more details about the individual methods can always be found in the help
sections of the `r BiocStyle::Biocpkg("BEclear")` package, e.g. through typing
`calcBatchEffects` in the R environment with the package loaded.
To work with the methods contained in the BEclear package, a matrix or
data.frame with genes as row-names and samples as column names as well as a
samples data.frame with the first column named "sample\_id" and the second
column named "batch\_id" is needed as input.

# Installation

`r BiocStyle::Biocpkg("BEclear")` is available on Bioconductor. To install it 
you can therefore use the `r BiocStyle::Biocpkg("BiocManager")`:

```{r, eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE)) {
  install.packages("BiocManager")
}
BiocManager::install("BEclear")
```

Otherwise you can also install `r BiocStyle::Githubpkg("David-J-R/BEclear")`
from its Github repository by the following command:

```{r, eval=FALSE}
if (!requireNamespace("devtools", quietly = TRUE)) {
  install.packages("devtools")
}
devtools::install_github("David-J-R/MoDentify")
```

We however recommend installing it through Bioconductor, as this takes care of 
installing the dependencies and furthermore you can refer to the release of 
Bioconductor, when using our package, which enables you to reproduce the exact 
conditions of your run.

During the compilation of the code, many parts of the software will be automatically
tested for correct execution and reproduction of expected results. This is implemented 
in form of unit tests with the help of the `r BiocStyle::CRANpkg("testthat")` package.

When done with the installation you can simply load the package by typing:

```{r}
library(BEclear)
```

# Data

The beta values stored in the ex.data matrix were obtained from level 3 BRCA
data from the TCGA portal [@TCGA]. Generally, beta values are calculated by
dividing the methylated signal by the sum of the unmethylated and methylated
signals from a DNA methylation microrarray. In the level 3 TCGA data, this
calculation has already been done. The sample data used here contains averaged
beta values of probes that belong to promoter regions of single genes. Another
possibility would be to use beta values of single probes, whereby the probe
names should then be used instead of the gene names as rownames of the matrix.

You can load our sample data via the following command:

```{r data}
data("BEclearData")
```

It contains one matrix with the beta values:

```{r}
knitr::kable(ex.data[1:10,1:5], caption = 'Some entries from the example data-set')
```

And one data.frame containing the assignment of samples to batches:

```{r}
knitr::kable(ex.samples[1:10,], caption = 'Some entries from the example sample annotation')
```

# Detection of batch effects

For the detection of batch effects we calculate the median difference between the 
beta values of a gene in a batch and the values of this gene in all other batches. 
Furthermore we use a non-parametric Kolmogorov-Smirnov test (`ks.test`) to  compare the
distribution of the beta value for this gene in the batch and the other batches.

If one gene in a batch has a p-value determined by the `ks.test` of less or equal
0.01 and a median difference of greater or equal 0.05 it is considered batch effected.

## Detection

For the calculation of the batch effects you just use the `calcBatchEffects` function.
It calculates both median difference and p-value. By default we correct the p-values
by the false discovery rate developed by @BH, but you can use all adjustment
methods covered by `p.adjust.methods`.

```{r detection, cache=TRUE}
batchEffect <- calcBatchEffects(
  data = ex.data, samples = ex.samples,
  adjusted = TRUE, method = "fdr"
)
mdifs <- batchEffect$med
pvals <- batchEffect$pval
```


## Summary

To see which genes in which batches are effected you use the `calcSummary` function 
as follows:

```{r summary, cache=TRUE}
summary <- calcSummary(medians = mdifs, pvalues = pvals)
knitr::kable(head(summary), caption = 'Summary over the batch affected gene-sample combination of the example data set')
```

## Scoring 

Furthermore you can calculate a batch score for a whole batch to determine the
severeness how it is affected.

```{r score, cache=TRUE}
score <- calcScore(ex.data, ex.samples, summary, dir = getwd())
knitr::kable(score, caption = 'Batch scores of the example data-set')
```

# Imputation of missing values

For the imputation of missing values we use a slightly modified version of the
stochastic gradient descent method described by @Koren2009. 
In this section we will describe our implementation of this method and how to 
use it.


We assume that our complete data matrix \(D_{ij}\) can be described by the effects of
a matrix \(L_i\), which represents the effect of the features (genes in our case)
and a matrix \(R_j\) describing the effect of the samples in the following way:

\begin{equation}
D_{ij} = L_{i}^{T} \times R_{j} .
(\#eq:assumption)
\end{equation}

The method can either be run on the complete data set or the data set can be 
divided into blocks on which the method is applied.
This division into blocks allows for parallelisation of the method, which can be 
useful to speed up the process. We have found that a block-size of 60x60 works 
well[@Akulenko2016].

The error for each block is calculated in the following way:

\begin{equation}
  errorMatrix_{ij} = Block_{ij} - L_{i}^{T} \times R_{j} .
  (\#eq:errormatrix)
\end{equation}

We try to minimize the following loss function through a gradient descent:

\begin{equation}
  min_{L, R}  \sum_{ij \in K}(errorMatrix_{ij}^2) + \lambda \times
  (\left\lVert L_{i}\right\rVert_{F}^{2} + 
  \left\lVert R_{j}\right\rVert_{F}^{2} ).
  (\#eq:loss)
\end{equation}
Where \(K\) is the set of tuples \((i,j)\) for which the value is present. 
\(\lambda\) is the penalty coefficient, which controls how restrictive the 
selection of variables should be. The default of \(\lambda\) is 1.

Another coefficient \(\gamma\) controls the size of the step by which the 
two matrices \(L_i\) and \(R_j\) are modified. It is initialized 
by default with 0.01 and its value changes during the iterations (epochs).

For the first iteration the matrices \(L_i\) and \(R_j\) are filled with random values
generated by the `rnorm` function from the `r BiocStyle::Rpackage("stats")` 
package and the initial loss and error matrix are calculated.

Then for each iteration the following is done:
    
* \(L_i\) and \(R_j\) are modified proportional by \(\gamma\) through the following 
calculation:

    + \begin{equation}
      L_i = L_i + 2 \times \gamma \times  (errorMatrix_{ij} \times R_j - \lambda \times L_i).
      (\#eq:Lmod)
      \end{equation}

    + \begin{equation}
      R_j = R_j + 2 \times \gamma \times (errorMatrix_{ij} \times L_i - \lambda \times R_j).
      (\#eq:Rmod)
      \end{equation}

* Then the new error matrix and loss are calculated.
* If the old loss is smaller than the new one: 
    + \(\gamma = \gamma \div 2.\)
* Else:
    + \(\gamma = \gamma \times 1.05.\)
    
The \(L_i\) and \(R_j\) matrices at the end of the last iteration are then used to 
impute the missing data. The default number of iterations is 50.

## Usage

First you have to set the found batch effect values to NAs. You can do this
by using the `clearBEgenes` function:

```{r clearBE, cache=TRUE}
cleared.data <- clearBEgenes(ex.data, ex.samples, summary)
```
In case you're using `r BiocStyle::Biocpkg("BEclear")` not for correcting batch
effects, but just for the data imputation, you would have to set the values you
want to impute to NA, if they not already are. 

For the data imputation you use the `imputeMissingData` function:

```{r imputation, cache=TRUE}
corrected.data <- imputeMissingData(cleared.data,
  rowBlockSize = 60,
  colBlockSize = 60, epochs = 50,
  outputFormat = "", dir = getwd()
)
```

If you set rowBlockSize and colBlockSize to 0 the matrix will not be divided into 
block and the gradient descent will be applied to the matrix as a whole.

## Replacing values outside the boundaries

Note that sometimes during the prediction, it can happen that values beyond the
boundaries of beta values are returned, that means values smaller than zero or
greater than one. `findWrongValues` simply returns a list of these values,
together with the position in the output matrix, `replaceOutsideValues` corrects
these by simply setting the wrong values to zero or one, respectively. Note that 
these methods are especially designed for the prediction of beta values from 
DNA methylation data, which only take on values between 0 and 1. 


```{r replace, cache=TRUE}
corrected.data.valid<-replaceOutsideValues(corrected.data)
```

In this case there were no values to be replaced.


# Overall correction

Besides the individual methods BEclear also offers an overall method, which 
executes all the described previous steps in one call. It also applies some 
preprocessing to your data set if necessary.

```{r correction, cache=TRUE}
result <- correctBatchEffect(data = ex.data, samples = ex.samples)
```

Returned is a list containing all results from the executed functions.

# Parallelization

For parallelization we use the `r BiocStyle::Biocpkg("BiocParellel")` package.
However by default all methods are executed in serial mode.
The methods `CalcBatchEffect`, `imputeMissingData` and `correctBatchEffect` 
support parallelization through there parameter `BPPARAM`, which takes a `BiocParallel::BiocParallelParam` class as an argument. 

Type the following to get an overview over the supported evaluation environments:

```{r}
?BiocParallel::BiocParallelParam
```
# Plotting

Additionally `r BiocStyle::Biocpkg("BEclear")` also includes a method for 
plotting the batch effects.
Let us now use the `makeBoxplot` to compare the distributions of the values
in the different samples before and after the batch effect correction:

```{r boxplot1, fig.wide = TRUE, fig.cap = "Distribution of the example beta values grouped by sample"}
makeBoxplot(ex.data, ex.samples, score,
  bySamples = TRUE,
  col = "standard", main = "Example data", xlab = "Batch",
  ylab = "Beta value", scoreCol = TRUE)
```

```{r boxplot2, fig.wide = TRUE, fig.cap = "Distribution of the corrected beta values grouped by sample"}
makeBoxplot(corrected.data, ex.samples, score,
  bySamples = TRUE,
  col = "standard", main = "Corrected example data",
  xlab = "Batch", ylab = "Beta value", scoreCol = FALSE)
```

# Session info {.unnumbered}

Here is the output of `sessionInfo()` on the system on which this document 
was compiled running pandoc `r rmarkdown::pandoc_version()`:

```{r sessionInfo, echo=FALSE}
pander(sessionInfo(), compact=TRUE)
```

# References {.unnumbered}