---
title: "Network comparison: consensus modules and module preservation"
author:
- name: Fabricio Almeida-Silva
  affiliation: Universidade Estadual do Norte Fluminense Darcy Ribeiro, RJ, Brazil
- name: Thiago Motta Venancio
  affiliation: Universidade Estadual do Norte Fluminense Darcy Ribeiro, RJ, Brazil
output:
  BiocStyle::html_document:
    toc: true
    number_sections: yes
bibliography: vignette3.bib
vignette: |
  %\VignetteIndexEntry{Network comparison: consensus modules and module preservation}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  message = TRUE,
  warning = FALSE,
  cache = FALSE,
  fig.align = 'center',
  fig.width = 5,
  fig.height = 4
)
```

# Installation

```{r installation, eval=FALSE}
if(!requireNamespace('BiocManager', quietly = TRUE))
  install.packages('BiocManager')

BiocManager::install("BioNERO")
```

```{r load_package}
# Load package after installation
library(BioNERO)
set.seed(12) # for reproducibility
```

# Introduction

Comparing different coexpression networks can reveal relevant biological 
patterns. For instance, seeking for **consensus modules** can identify 
coexpression modules that occur in all data sets regardless of natural 
variation and, hence, are core players of the studied phenotype. 
Additionally, **module preservation** within and across species can reveal 
patterns of conservation and divergence between transcriptomes. 
In this vignette, we will explore consensus modules and module preservation 
analyses with `BioNERO`. Although they seem similar, their goals are opposite:
while consensus modules identification focuses on the commonalities, module 
preservation focuses on the divergences. 


# Data loading and description

We will use RNA-seq data of maize (*Zea mays*) and rice (*Oryza sativa*) 
obtained from @Shin2020. 

```{r inspect_data}
data(zma.se)
zma.se

data(osa.se)
osa.se
```

All `BioNERO`'s functions for consensus modules and module preservation 
analyses require the expression data to be in a **list**. Each element of the 
list can be a SummarizedExperiment object (recommended) or an expression data 
frame with genes in row names and samples in column names. 


# Consensus modules

The most common objective in consensus modules identification is to find core 
modules across different tissues or treatments for the same species. 
For instance, one can infer GCNs for different types of cancer in human 
tissues (say prostate and liver) and identify modules that occur in all sets, 
which are likely core components of cancer biology. Likewise, one can also 
identify consensus modules across samples from different geographical origins 
to find modules that are not affected by population structure or kinship.

## Data preprocessing

Here, we will subset 22 random samples from the maize data twice and find 
consensus modules between the two sets.

```{r create_list_consensus}
# Preprocess data and keep top 2000 genes with highest variances
filt_zma <- exp_preprocess(zma.se, variance_filter = TRUE, n = 2000)

# Create different subsets by resampling data
zma_set1 <- filt_zma[, sample(colnames(filt_zma), size=22, replace=FALSE)]
zma_set2 <- filt_zma[, sample(colnames(filt_zma), size=22, replace=FALSE)]
colnames(zma_set1)
colnames(zma_set2)

# Create list
zma_list <- list(set1 = zma_set1, set2 = zma_set2)
length(zma_list)
```

## Identification of consensus modules

As described in the first vignette, before inferring the GCNs, we need to 
identify the optimal $\beta$ power that makes the network closer to a scale-free 
topology. We can do that with `consensus_SFT_fit()`.

```{r consensus_sft, message=FALSE}
cons_sft <- consensus_SFT_fit(zma_list, setLabels = c("Maize 1", "Maize 2"),
                              cor_method = "pearson")
```

This function returns a list with the optimal \beta powers and a summary plot, 
exactly as `SFT_fit()` does.

```{r sft_results, fig.width=6}
powers <- cons_sft$power
powers
cons_sft$plot
```

Now, we can infer GCNs and identify consensus modules across data sets.

```{r consensus_modules_identification}
consensus <- consensus_modules(zma_list, power = powers, cor_method = "pearson")
names(consensus)
head(consensus$genes_cmodules)
```

Finally, we can correlate consensus module eigengenes to sample metadata 
(here, plant tissues).[^1]

[^1]: **NOTE:** Blank grey cells in the heatmap represent correlation 
values that have opposite sign in the expression sets. For each correlation 
pair, consensus correlations are calculated by selecting the minimum value 
across matrices of consensus module-trait correlations.

```{r consensus_trait_cor, fig.width=5, fig.height=5}
consensus_trait <- consensus_trait_cor(consensus, cor_method = "pearson")
head(consensus_trait)
```

Users can also transpose the heatmap as in `module_trait_cor()`.


# Module preservation

Module preservation is often used to study patterns of evolutionary 
conservation and divergence across transcriptomes, an approach 
named *phylotranscriptomics*. This way, one can investigate how evolution 
shaped the expression profiles for particular gene families across taxa.

## Data preprocessing

To calculate module preservation statistics, gene IDs must be shared by 
the expression sets. For intraspecies comparisons, this is an easy task, 
as gene IDs are the same. However, for interspecies comparisons, users need 
to identify orthogroups between the different species and collapse the 
gene-level expression values to orthogroup-level expression values. 
This way, all expression sets will have common row names. We recommend 
identifying orthogroups with **OrthoFinder** [@Emms2015], as it is simple 
to use and widely used.[^2] Here, we will compare maize and rice expression 
profiles. The orthogroups between these species were downloaded from the 
PLAZA 4.0 Monocots database [@VanBel2018].

[^2]: **PRO TIP:** If you identify orthogroups with **OrthoFinder**, 
`BioNERO` has a helper function named `parse_orthofinder()` that parses 
the *Orthogroups.tsv* file generated by OrthoFinder into a suitable data frame 
for module preservation analysis. See `?parse_orthofinder` for more details.

```{r load_orthogroups}
data(og.zma.osa)
head(og.zma.osa)
```

As you can see, the orthogroup object for `BioNERO` must be a data frame with 
orthogroups, species IDs and gene IDs, respectively. Let's collapse gene-level 
expression to orthogroup-level with `exp_genes2orthogroups()`. By default, if 
there is more than one gene in the same orthogroup for a given species, their 
expression levels are summarized to the median. Users can also summarize 
to the mean.


```{r genes2orthogorups}
# Store SummarizedExperiment objects in a list
zma_osa_list <- list(osa = osa.se, zma = zma.se)

# Collapse gene-level expression to orthogroup-level
ortho_exp <- exp_genes2orthogroups(zma_osa_list, og.zma.osa, summarize = "mean")

# Inspect new expression data
ortho_exp$osa[1:5, 1:5]
ortho_exp$zma[1:5, 1:5]
```

Now, we will preprocess both expression sets and keep only the top 1000 
orthogroups with the highest variances for demonstration purposes. 

```{r create_list_preservation}
# Preprocess data and keep top 1000 genes with highest variances
ortho_exp <- lapply(ortho_exp, exp_preprocess, variance_filter=TRUE, n=1000)

# Check orthogroup number
sapply(ortho_exp, nrow)
```

## Calculating module preservation statistics

Now that row names are comparable, we can infer GCNs for each set. We will 
do that iteratively with lapply.

```{r gcn_inference}
# Calculate SFT power
power_ortho <- lapply(ortho_exp, SFT_fit, cor_method="pearson")

# Infer GCNs
gcns <- lapply(seq_along(power_ortho), function(n) 
  exp2gcn(ortho_exp[[n]], SFTpower = power_ortho[[n]]$power, 
          cor_method = "pearson")
  )

length(gcns)
```

Initially, module preservation analyses were performed with WGCNA's 
algorithm [@Langfelder2008]. However, the summary preservation statistics 
used by WGCNA rely on parametric assumptions that are often not met. 
For this purpose, the NetRep algorithm [@Ritchie2016] is more accurate 
than WGCNA, as it uses non-parametric permutation analyses. 
Both algorithms are implemented in `BioNERO` for comparison purposes, 
but we **strongly recommend** using the NetRep algorithm. 
Module preservation analysis can be performed with a 
single function: `module_preservation()`.

```{r preservation}
# Using rice as reference and maize as test
pres <- module_preservation(ortho_exp, 
                            ref_net = gcns[[1]], 
                            test_net = gcns[[2]], 
                            algorithm = "netrep")
```

None of the modules in rice were preserved in maize. This can be 
either due to the small number of orthogroups we have chosen or to the 
natural biological variation between species and sampled tissues. You can 
(and should) include more orthogroups in your analyses for a better view of 
transcriptional conservation between species.

# Identifying singletons and duplicated genes

Finally, `BioNERO` can identify singletons and duplicated genes 
with `is_singleton()`. This function returns logical vectors indicating if 
each of the input genes is a singleton or not.

```{r singleton}
# Sample 50 random genes
genes <- sample(rownames(zma.se), size = 50)
is_singleton(genes, og.zma.osa)
```


# Session info {.unnumbered}

This vignette was created under the following conditions:

```{r sessionInfo, echo=FALSE}
sessionInfo()
```

# References