---
title: "Use of IlluminaHumanMethylationEPICv2manifest"
author: "Zuguang Gu (z.gu@dkfz.de)"
date: '`r Sys.Date()`'
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Use of IlluminaHumanMethylationEPICv2manifest}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---


```{r, echo = FALSE, message = FALSE}
library(knitr)
knitr::opts_chunk$set(
    error = FALSE,
    tidy  = FALSE,
    message = FALSE,
    warning = FALSE,
    fig.width = 6,
    fig.height = 6,
    fig.align = "center")
```

This package provides manifest for Illumina methylation EPIC array v2.0. The data is based on the file https://support.illumina.com/content/dam/illumina-support/documents/downloads/productfiles/methylationepic/MethylationEPIC%20v2%20Files.zip from 
  https://support.illumina.com/array/array_kits/infinium-methylationepic-beadchip-kit/downloads.html.

A script for creating the data object in this package is in
  `system.files("scripts", "manifest.R", package = "IlluminaHumanMethylationEPICv2manifest")`, 
which is slightly adjusted from the original script in the **IlluminaHumanMethylationEPICmanifest** package.

When using with the **minfi** package, you can manually set the `"array"` element by providing the prefix (removing the string "manifest"):

```{r, eval = FALSE}
RGset = read.metharray.exp(...)

annotation(RGset)["array"] = "IlluminaHumanMethylationEPICv2"

# explained in the IlluminaHumanMethylationEPICv2anno.20a1.hg38 package
annotation(RGset)["annotation"] = "20a1.hg38"
```

## Probe IDs are coded differently in v1 and v2 packages

In **IlluminaHumanMethylationEPICmanifest**, probe IDs (e.g. in the format of "cg18478105") are unique in the array. But
in **IlluminaHumanMethylationEPICv2manifest**, probe IDs may be duplicated. Thus, we use the "illumina_ID" (column "IlmnID" in the manifest file, https://knowledge.illumina.com/microarray/general/microarray-general-reference_material-list/000001568) as the ID type for probes in v2-packages. The duplicated probes have the same probe sequence, but locate
randomly on the array. The illumina_ID is a combination of probe ID and a "duplication ID".

The original illumina IDs:

```{r}
library(IlluminaHumanMethylationEPICv2manifest)

illumina_ID = getManifestInfo(IlluminaHumanMethylationEPICv2manifest, "locusNames")
head(illumina_ID)
any(duplicated(illumina_ID))
```

Let's check how many probes have duplicated probe IDs. First remove the suffix to only keep the probe IDs.

```{r}
probe_ID = gsub("_.*$", "", illumina_ID)
```

Check the duplication of probe IDs:

```{r}
tb = table(probe_ID)
table(tb)
```

We can see in the most extreme case, a probe ID is repeated 10 times in the array. Let's check what it is:

```{r}
tb[tb == 10]
```

```{r}
illumina_ID[probe_ID == "cg06373096"]
```

## Compare EPIC array v1 and v2 probes

The following code compares probes in **IlluminaHumanMethylationEPICv2manifest** and **IlluminaHumanMethylationEPICmanifest**.

```{r}
library(IlluminaHumanMethylationEPICmanifest)

probe1 = getManifestInfo(IlluminaHumanMethylationEPICmanifest, "locusNames")
probe2 = getManifestInfo(IlluminaHumanMethylationEPICv2manifest, "locusNames")

# probes can be duplicated
probe1 = unique(probe1)

probe2 = gsub("_.*$", "", probe2)  # remove suffix
probe2 = unique(probe2)
```

Number of probes:

```{r}
length(probe1)
length(probe2)
```

Overlap of probes:

```{r, fig.width = 6, fig.height = 6}
library(eulerr)
plot(euler(list(v1 = probe1, v2 = probe2)), quantities = TRUE,
    main = "Compare total probes in the two arrays")
```

We can also compare Type I and Type II probes in the two arrays.

```{r}
TypeI_1 = IlluminaHumanMethylationEPICmanifest@data$TypeI
TypeI_2 = IlluminaHumanMethylationEPICv2manifest@data$TypeI
plot(euler(list(v1_TypeI = unique(TypeI_1$Name), v2_TypeI = unique(gsub("_.*$", "", TypeI_2$Name)))), 
    quantities = TRUE, main = "Compare TypeI probes in the two arrays")

TypeII_1 = IlluminaHumanMethylationEPICmanifest@data$TypeII
TypeII_2 = IlluminaHumanMethylationEPICv2manifest@data$TypeII
plot(euler(list(v1_TypeII = unique(TypeII_1$Name), v2_TypeII = unique(gsub("_.*$", "", TypeII_2$Name)))), 
    quantities = TRUE, main = "Compare TypeII probes in the two arrays")
```


## What if using the wrong array version?

We get the demo dataset of EPIC array v2.0 from the following link: https://support.illumina.com/array/array_kits/infinium-methylationepic-beadchip-kit/downloads.html, with the name "Infinium MethylationEPIC v2.0 Demo Data Set (iScan)".


We randomly picked one pair of these files. To create this vignette dynamically, we hosted the files on GitHub and now we download
them and put them in a temporary folder. `206891110001_R01C01.zip` contains one `*_Grn.idat` and corresponding `*_Red.idat` file.

```{r}
tempdir = tempdir()
datadir = paste0(tempdir, "/206891110001")
dir.create(datadir, showWarnings = FALSE)

url = "https://github.com/jokergoo/IlluminaHumanMethylationEPICv2manifest/files/11008723/206891110001_R01C01.zip"
local = paste0(tempdir, "/206891110001_R01C01.zip")
download.file(url, dest = local, quiet = TRUE)

unzip(local, exdir = datadir)
```

Next we use `minfi::read.metharray.exp()` to import the original data.

```{r}
library(minfi)
RGset = read.metharray.exp(datadir)
RGset
```

We can see there are more than 1 million probes. Note **minfi** cannot
automatically detect the array type of this data ("array" is assigned to
"Unknown"). For users' interest, they can refer to the source code of `minfi:::.guessArrayTypes()` to see
how **minfi** identifies array type by comparing to the number of total probes in the array.

If we manually assign the EPIC array v1.0 to `RGset` and perform preprocessing:

```{r}
annotation(RGset)["array"] = "IlluminaHumanMethylationEPIC"
preprocessRaw(RGset)
```

Only around 100K probes remain, which is definitely wrong.

While if we choose the correct v2.0 array type:

```{r}
annotation(RGset)["array"] = "IlluminaHumanMethylationEPICv2"
preprocessRaw(RGset)
```

More than 900K probes remain, which fits the design of the EPIC array v2.0.

## Change illumina IDs to probe IDs in the final beta matrix

You can simply take the average of beta values of probes with the same probe IDs. The following
is an easy-to-understand but ineffcient way.

```{r}
obj = preprocessRaw(RGset)
# there can be more intermediate steps ...
beta = getBeta(obj)

beta2 = do.call(rbind, tapply(1:nrow(beta), gsub("_.*$", "", rownames(beta)), function(ind) {
    colMeans(beta[ind, , drop = FALSE])
}, simplify = FALSE))

head(beta2)
```

```{r, echo = FALSE}
unlink(datadir)
unlink(local)
```

## Session info

```{r}
sessionInfo()
```