---
title: "Word Cloud Annotation"
author: "Zuguang Gu (z.gu@dkfz.de)"
date: '`r Sys.Date()`'
output: 
  rmarkdown::html_vignette:
    fig_caption: true
    css: main.css
vignette: >
  %\VignetteIndexEntry{2. Word Cloud Annotation}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, echo = FALSE}
library(knitr)
knitr::opts_chunk$set(
    error = FALSE,
    tidy  = FALSE,
    message = FALSE,
    warning = FALSE,
    fig.align = "center",
    dev = "jpeg"
)
options(width = 100)
```

In the plot generated by `simplifyGO()`, there is a word cloud annotation
attached to the heatmap which shows the general biological functions of the GO
terms in each cluster. In this vignette, I will demonstrate a general function
`anno_word_cloud()` that generates word cloud annotations to work with the
**ComplexHeatmap** package.

The `anno_word_cloud()` function basically a wrapper of two components: 1.
constructing the word cloud (with `count_word()` and `word_cloud_grob()`) and
2. constructing an annotation (with `ComplexHeatmap::anno_link()`) that can be
used in `rowAnnotation()` function.

`anno_word_cloud()` has two main arguments `align_to` and `term`. `align_to`
defines how to align the annotation to the heatmap. Similar as in
`ComplexHeatmap::anno_link()`, the value of `align_to` can be a list of row
indices where each index vector in the list corresponds to a word cloud. The
value of `align_to` can also be a categorical vector where rows with the same
level correspond to a same word cloud. If `align_to` is a categorical vector
and `term` is a list, names of `term` should have overlap to the levels in
`align_to`. When `align_to` is set as a categorical vector, normally the same
value is set to `row_split` in the main heatmap so that each row slice can
correspond to a word cloud. `term` defines the description texts used for
constructing the word clouds. The value should have the same format as
`align_to`. If `align_to` is a list, `term` should also be a list. In this
case, the length of vectors in `term` is not necessarily the same as in
`align_to`. E.g. `length(term[[1]])` is not necessarily equal to
`length(align_to[[1]]`. In other words, `term[[i]]` can contain arbitrary text
as long as `length(term) == length(align_to)`. If `align_to` is a categorical
vector, `term` should only be a character vector with the same length as
`align_to`.

Other arguments in `anno_word_cloud()` are straightforward to understand:

- `exclude_words`: The words excluded from word cloud.
- `max_words`: Maximal number of words in each word cloud.
- `word_cloud_grob_param`: Graphic parameters send to `word_cloud_grob()`. The value should be a named list.
- `fontsize_range`: Range of the font size. The value is a vector of length two.
- `bg_gp`: Graphic parameters for controlling the background.
- `side`: Side of the annotation relative to the heatmap. The value should be either "right" or "left".

Specifically for GO terms, users do not need to provide the full GO
descriptions, instead, they can only provide the GO IDs and the descriptions
will be automatically extracted internally. In this case, users can use the
helper function `anno_word_cloud_from_GO()` and set the GO ID list via the
`go_id` argument. The format of `go_id` is similar as `term` in
`anno_word_cloud()`, either a list of GO IDs or as a vector. Again note,
if `go_id` is a list, e.g. `length(go_id[[1]])` is not necessarily equal to
`length(align_to[[1]]`.


In the first example, I generate 10 word clouds and attach to the heatmap which is split
into 10 groups by rows.

```{r, fig.width = 7, fig.height = 4}
library(tm)
data(crude)
term = lapply(content(crude), as.character)[1:10]
mat = matrix(rnorm(100*10), nrow = 100)

split = sample(letters[1:10], 100, replace = TRUE)

names(term) = letters[1:10]  ## names of `term` must be the same as `unique(split)`
str(term, nchar.max = 80)

library(ComplexHeatmap)
library(simplifyEnrichment)

Heatmap(mat, row_split = split, 
	right_annotation = rowAnnotation(wc = anno_word_cloud(split, term))
)
```

The value for the first argument of `anno_word_cloud()` can also be explicitly converted into a list:

```{r, eval = FALSE}
align_to = split(seq_along(split), split)
Heatmap(mat, row_split = split, 
	right_annotation = rowAnnotation(wc = anno_word_cloud(align_to, term))
)
```

Argument `side` can be set to `"left"` to put the annotation on
the left of the heatmap:

```{r, fig.width = 7, fig.height = 4}
Heatmap(mat, row_split = split, row_dend_side = "right", row_title_side = "right",
	left_annotation = rowAnnotation(wc = anno_word_cloud(split, term, side = "left"))
)
```

The second example is more specific to GO terms. The following example visualizes an gene expression
matrix where rows are split into three groups by k-means clustering. GO enrichment analysis was applied
to the genes in the three groups separately. Variable `km` contains the k-means classification. `go_list`
contains list of IDs of significant GO terms.

```{r}
load(system.file("extdata", "golub_sig_go.RData", package = "simplifyEnrichment"))
head(km)
str(go_list)
```

Just make sure names of `go_list` should correspond to the levels in `km`. Adding word cloud annotations
for the enriched GO terms is very straightforward:

```{r, echo = FALSE}
ht_opt$message = FALSE
```

```{r, fig.width = 7, fig.height = 4}
library(circlize)
Heatmap(t(scale(t(sig_mat))), name = "z-score",
	col = colorRamp2(c(-2, 0, 2), c("green", "white", "red")),
	show_row_names = FALSE, show_column_names = FALSE, 
	row_title = NULL, column_title = NULL,
	show_row_dend = FALSE, show_column_dend = FALSE,
	row_split = km) +
rowAnnotation(go = anno_word_cloud_from_GO(km, go_list, max_words = 30))
```

It seems the major keywords are very similar in the four groups, these words can be excluded:


```{r, fig.width = 7, fig.height = 4}
library(circlize)
Heatmap(t(scale(t(sig_mat))), name = "z-score",
	col = colorRamp2(c(-2, 0, 2), c("green", "white", "red")),
	show_row_names = FALSE, show_column_names = FALSE, 
	row_title = NULL, column_title = NULL,
	show_row_dend = FALSE, show_column_dend = FALSE,
	row_split = km) +
rowAnnotation(go = anno_word_cloud_from_GO(km, go_list, max_words = 20,
	exclude_words = c("regulation", "process", "response", "positive", "cell")))
```