---
title: "Guide to ggmanh Package"
package: ggmanh
vignette: >
    %\VignetteIndexEntry{Guide to ggmanh Package}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignettePackage{ggmanh}
    %\VignetteEncoding{UTF-8}
output: 
  BiocStyle::html_document:
    toc: true
    toc_float: true
---

# Introduction

The package `ggmanh` is aimed to provide easy and direct access to visualisation to the GWAS / PWAS results while also providing many functionalities and features. 

Manhattan plot is commonly used to display significant Single Nucleotide Polymorphisms (SNPs) in Genome Wide Association Study (GWAS). The x-axis is divided into chromosomes, and SNPs are plotted in their respective positions. The y-axis typically represents $-10*log(p value)$. Majority of the points have low y-values, with some of the significant SNPs having high y-values. It is not uncommon to see strong association between SNPs and a phenotype, yielding a high y-value. This results in a wide y-scales in which SNPs with lower significance are squished. 

This function addresses this problem by rescaling the y-axis according to the range of y.
There are more features, such as labelling without overlap (with the help of [ggrepel package](https://ggrepel.slowkow.com)), reflecting the size of chromosomes along the x-axis, and displaying significant lines.

The package can be installed by:

```{r, eval = FALSE}
if (!require("BiocManager"))
    install.packages("BiocManager")
BiocManager::install("ggmanh")
```


# Functions Overview

`manhattan_plot()` is a generic method that can take a `data.frame`, `MPdata`, or a `GRanges` object. The `data.frame`, at bare minimum, must have three columns containing: `chromosome`, `position`, and `p.value`. For a `GRanges` object, meta data column name for the p-value needs to be passed.

There are two steps to this function: preprocess data and plot data. The preprocessing step (accomplished with `manhattan_data_preprocess()`) preprocesses the data by calculating the new x-position to map to the plot (`new_pos` column added to the data), "thining" the data points, and saving other graphical information needed for manhattan plot, which is returned as a `MPdata` object. The plot step (accomplished with `manhattan_plot()`) determines if rescaling of the y-axis is needed and plots / saves the manhattan plot.

While using `manhattan_plot()` on `data.frame` is sufficient, it is fine to separately run pre-processing and plotting for customizing the plot without having to preprocess again and again.

# Example with Simulated GWAS

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.height = 4
)
```

```{r setup}
library(ggmanh)
library(SeqArray)
```

First, create a simulated data to be used for demonstration.

```{r simulated_data}
set.seed(1000)

nsim <- 50000

simdata <- data.frame(
  "chromosome" = sample(c(1:22,"X"), size = nsim, replace = TRUE),
  "position" = sample(1:100000000, size = nsim),
  "P.value" = rbeta(nsim, shape1 = 5, shape2 = 1)^7
)
```

`manhattan_plot` expects data.frame to have at least three columns: chromosome, position, and p.value. 

```{r showsimdata}
head(simdata)
```

To avoid ambiguity in plotting, it is recommended that that the chromosome column is passed as a factor, or `chr.order` is specified. 

```{r order_chrom}
simdata$chromosome <- factor(simdata$chromosome, c(1:22,"X"))
```

This is the bare minimum to plot manhattan plot, and `manhattan_plot` can handle the rest.

```{r manh_minimum}
g <- manhattan_plot(x = simdata, pval.colname = "P.value", chr.colname = "chromosome", pos.colname = "position", plot.title = "Simulated P.Values", y.label = "P")
g
```

`manhattan_plot` is also defaulted to display the GWAS p.value threshold at `5e-8` and `5e-7`. For now, the threshold is required; the values and color can be customized.

## Rescaling

The function is also suited to rescale the y-axis depending on the magnitude of p values.

Let's suppose that there are signals from chromosome 5 and 21, and the significant p-value is low for chromosome 21 and even lower for chromosome 5.

```{r simdata_w_signif}
tmpdata <- data.frame(
  "chromosome" = c(rep(5, 10), rep(21, 5)),
  "position" = c(sample(250000:250100, 10, replace = FALSE), sample(590000:600000, 5, replace = FALSE)),
  "P.value" = c(10^-(rnorm(10, 100, 3)), 10^-rnorm(5, 9, 1))
)

simdata <- rbind(simdata, tmpdata)
simdata$chromosome <- factor(simdata$chromosome, c(1:22,"X"))
```

```{r manh_badscale}
g <- manhattan_plot(x = simdata, pval.colname = "P.value", chr.colname = "chromosome", pos.colname = "position", plot.title = "Simulated P.Values - Significant", rescale = FALSE)
g
```

The significant point at chromosome 5 has such a small p-value compared to other chromosomes that other significant poitns are less visible and the pattern at other chromosomes are masked. Rescaling attempts to fix this by changing the visual scale near the significant cutoff, removing the large white space in between.

```{r manh_goodscale}
g <- manhattan_plot(x = simdata, pval.colname = "P.value", chr.colname = "chromosome", pos.colname = "position", plot.title = "Simulated P.Values - Significant", rescale = TRUE)
g
```

The place where jump happens is indicated on the y-axis by a double line ("=").

## Annotation

Annotation is also very simple. The function makes use of `ggrepel`'s `geom_repel_label()` to annotate each point while trying to avoid collision between texts.

All that is needed is to add a column containing labels. **The function labels everything included in the column**, so it is important to keep the labels for the points you are interested in, and set the rest with `NA` or `""`. From my understanding & experience, the differences between the two options are:

1. `NA`: The labels will likely not overlap, but may conceal the points. If there are a lot of points & labels, this is faster than using `""`.
2. `""`: The labels will avoid other labels as well as other points.

If you forget to do so and try to label all the points, the plotting may take hours.

```{r label_data}
sig <- simdata$P.value < 5e-07

simdata$label <- ""
simdata$label[sig] <- sprintf("Label: %i", 1:sum(sig))
```

To add the labels to the plot, specify the column name with labels to `label.colname`. Also, you can change various aspects of the plot by with different arguments. To avoid having to preprocess over and over, you can separately preprocess first and then customize the plot.

```{r plot_labelled_data_biglabel}
simdata$label2 <- ""
i <- (simdata$chromosome == 5) & (simdata$P.value < 5e-8)
simdata$label2[i] <- paste("Chromosome 5 label", 1:sum(i))

mpdata <- manhattan_data_preprocess(x = simdata, pval.colname = "P.value", chr.colname = "chromosome", pos.colname = "position")

g <- manhattan_plot(x = mpdata, label.colname = "label", plot.title = "Simulated P.Values with labels")
g

g <- manhattan_plot(x = mpdata, label.colname = "label2", label.font.size = 2, plot.title = "Simulated P.Values with labels", max.overlaps = Inf, force = 5)
g
```

## Highlighting

You can also choose the colors for each point. You will need to add a column that categorizes the point to the data, and supply the column name to `highlight.colname` as well pass a character vector to `highlight.col`, which specifies the colop mapped to each value in `highlight.colname`. Then set `color.by.highlight = TRUE`.

You can see that you only have to preprocess the original data once while still making changes to the plot itself.

```{r highlight chromosome}
simdata$color <- "Not Significant"
simdata$color[simdata$P.value <= 5e-8] <- "Significant"

highlight_colormap <- c("Not Significant" = "grey", "Significant" = "black")

tmp <- manhattan_data_preprocess(
  simdata, pval.colname = "P.value", chr.colname = "chromosome", pos.colname = "position",
  highlight.colname = "color", highlight.col = highlight_colormap
)

g_nohighlight <- manhattan_plot(tmp, plot.title = "No Highlight Points")
g <- manhattan_plot(tmp, plot.title = "Highlight Points", color.by.highlight = TRUE)
g_nohighlight
g
```

## thinPoints

Another simple, yet useful, feature of the package is a function called `thinPoints`. The function can be called manually or used as a part of `manhattan_data_preprocess`. It will partition the $log_{10}(pvalue)$ into `nbin` bins (`manhattan_data_preprocess` uses 200 bins to be conservative), and then randomly sample up to `n` points to effectively reduce the data while keeping the resulting output visually equal. This is done to keep the size of `MPdata`, which contains the data itself, small, and speed up the process of plotting.

## Zoom into Chromosome

If a value is passed to `chromosome` argument of the function `manhattan_plot` or `manhattan_data_preprocess`, then the plot zooms into a chromosome if you are interested in a specific chromosome.

**NOTE:** When `chromosome` is specified, the default behavior is to turn off `thinPoints` as the default `thin.n` value may be too small and may make the plot less cluttered than it actually is. The user can manually increase `thin.n` so that the visual output is not changed dramatically while keeping the data size small.

```{r chromosome}
manhattan_plot(simdata, chromosome = 5, pval.colname = "P.value", chr.colname = "chromosome", pos.colname = "position")
```

# Annotation with GDS File

The package also comes with a small [GDS file](https://academic.oup.com/bioinformatics/article/33/15/2251/3072873) in a [SeqArray](https://bioconductor.org/packages/release/bioc/html/SeqArray.html) file format that contains chromosome, position, reference / alternate allele, and [Ensembl Variant Effect Predictor](https://useast.ensembl.org/info/docs/tools/vep/index.html) of highest consequence. We obtained exomic data from the [GnomAD database](https://gnomad.broadinstitute.org/downloads) in VCF format, then converted to GDS file. Finally, some cleaning was done to get this final GDS file. The gds data that comes with this package can be located with this code after installation.

```{r locate_gds, results="hide"}
default_gds_path <- system.file("extdata", "gnomad.exomes.vep.hg19.v5.gds", mustWork = TRUE, package = "ggmanh")
print(default_gds_path)

gds <- SeqArray::seqOpen(default_gds_path)
print(gds)
SeqArray::seqClose(gds)
```

The gds file contains `position`, `chromosome`, and `allele` (reference and alternate), which can be used to search variants. In this file, the gene symbol, consequence (by VEP), and the loss of function predictor (by LOFTEE) are stored in `annotation/symbol`, `annotation/consequence`, `annnotation/LoF`, respectively. We were able to keep the annotation data size small due to several reasons:

    * We have removed RS IDs, which is stored in `annotation/id`. Although the gds file has the capabilities to store the RS IDs, we removed them to save space
    * Each variant may have multiple Gene/Consequence predicted by VEP. For example, VEP predicts rs429358 to have an effect on APOE and TOMM40 genes. We've selected only one effect with the highest consequence in this case. Only one predicted effect is included for each variant.
    * The VCF data format has been converted to a GDS data format.
    
`gds_annotate()` is used to retrieve annotations.

```{r gds_annotation_demonstration}
sampledata <- data.frame(
  chromosome = c(1, 17, 19, 22),
  position = c(12422750, 10366900, 43439739, 39710145), 
  Reference = c("A", "T", "T", "G"),
  Alternate = c("G", "G", "C", "A")
)
gds_annotate(
  sampledata, annot.method = "position", 
  chr = "chromosome", pos = "position", ref = "Reference", alt = "Alternate"
  )
```

Annotations are returned in the same order as the rows in the `data.frame`. If the parameter `gdsfile` is empty (`NULL`), then the default GDS file is used.

There are two ways to retrieve annotations. One way is to match the variants by chromosome, position, reference allele, and alternate allele. Another way is to match by RS ID (the RS ID needs to be stored in "annotation/id" in the annotation gds file. Look up documentation for `SeqArray::seqSetFilterAnnotID` for more detail).

Therefore, you can use `gds_annotate` to quickly add annotations to the GWAS summary results, and then plot manhattan plot with annotations.

```{r manhattan_label_with_gds}
simdata$Reference <- NA
simdata$Alternate <- NA
simdata <- simdata[,c("chromosome", "position", "Reference", "Alternate", "P.value")]
sampledata$P.value <- 10^-rnorm(4, 14, 1)
sampledata <- sampledata[c("chromosome", "position", "Reference", "Alternate", "P.value")]
simdata_label <- rbind(simdata, sampledata)

# add annotation
simdata_label$label <- gds_annotate(x = simdata_label, annot.method = "position", chr = "chromosome", pos = "position", ref = "Reference", alt = "Alternate")

manhattan_plot(simdata_label, outfn = NULL, pval.colname = "P.value", chr.colname = "chromosome", pos.colname = "position", label.colname = "label")
```


# SessionInfo

```{r sessionInfo}
sessionInfo()
```