---
title: "Introduction to magrene"
author: 
  - name: Fabricio Almeida-Silva
    affiliation: 
    - VIB-UGent Center for Plant Systems Biology, Ghent, Belgium
    - Department of Plant Biotechnology and Bioinformatics, Ghent University, Ghent, Belgium
  - name: Yves Van de Peer
    affiliation: 
    - VIB-UGent Center for Plant Systems Biology, Ghent, Belgium
    - Department of Plant Biotechnology and Bioinformatics, Ghent University, Ghent, Belgium
    - College of Horticulture, Academy for Advanced Interdisciplinary Studies, Nanjing Agricultural University, Nanjing, China
    - Center for Microbial Ecology and Genomics, Department of Biochemistry, Genetics and Microbiology, University of Pretoria, Pretoria, South Africa
output: 
  BiocStyle::html_document:
    self_contained: yes
    toc: true
    toc_depth: 2
date: "`r Sys.Date()`"
bibliography: vignette_bibliography.bib
vignette: >
  %\VignetteIndexEntry{Introduction to magrene}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}  
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>",
    fig.align = 'center',
    crop = NULL ## Related to https://stat.ethz.ch/pipermail/bioc-devel/2020-April/016656.html
)
```

# Introduction

Network motifs are the building blocks of complex networks, and they can be 
interpreted as small genetic circuits. Identifying and counting motifs in 
gene regulatory networks can reveal important aspects of the evolution of
transcriptional regulation. In particular, they can be used to explore
the impact of gene duplication in the rewiring of 
regulatory interactions [@mottes2021impact]. `magrene` aims to identify and
analyze motifs in (duplicated) gene regulatory networks to better comprehend
the role of gene duplication in network evolution. The figure below shows
the networks motifs users can identify with `magrene`.


```{r intro_motifs, echo = FALSE, fig.cap = "Network motifs and functions to identify them. Shaded boxes indicate paralogs. Regulators and targets are indicated in purple and green, respectively. Arrows indicate directed regulatory interactions, while dashed lines indicate protein-protein interaction."}
knitr::include_graphics("motifs_vignette.png")
```

# Installation

You can install `magrene` with:

```{r "install", eval = FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE)) {
      install.packages("BiocManager")
  }

BiocManager::install("magrene")
```

Then, you can load the package with:

```{r load}
library(magrene)
set.seed(123) # for reproducibility
```

# Data description

For this vignette, we will use three example data sets:

1. **gma_grn:** A gene regulatory network inferred with 
`BioNERO` [@almeida2022bionero] using soybean RNA-seq data from the Soybean
Expression Atlas [@machado2020systematic].

2. **gma_ppi:** A protein-protein interaction (PPI) network for soybean
obtained from the STRING database [@szklarczyk2021string], filtered to contain
only physical interactions with confidence score > 0.4.

3. **gma_paralogs:** Soybean paralogous gene pairs derived by whole-genome,
tandem, proximal, transposed, and dispersed duplications (WGD, TD, PD, TRD,
and DD, respectively). This data set was obtained from @almeida2020exploring.

Networks are represented as edge lists. Let's take a look at the data.

```{r data}
data(gma_grn)
head(gma_grn)

data(gma_ppi)
head(gma_ppi)

data(gma_paralogs)
head(gma_paralogs)
```


# Finding motifs

Motifs can be found using `find_*` motifs, as shown in Figure 1. Each function
returns a character vector of motifs, and each motif has its own character
representation. Let's demonstrate how they work. 
For the sake of demonstration, we will only use WGD-derived paralogs. 
For GRN motifs, we will only use a smaller subset of the edge list.

```{r filtering}
# Include only WGD-derived paralogs
paralogs <- gma_paralogs[gma_paralogs$type == "WGD", 1:2]

# Keep only the top 30k edges of the GRN, remove "Weight" variable
grn <- gma_grn[1:30000, 1:2]
```

## PPI V

PPI V motifs are paralogous proteins that share an interaction partner.
To find them, you will use `find_ppi_v()`. The character representation of
PPI V motifs is:

$$
\text{paralog1-partner-paralog2}
$$

```{r ppi_v}
# Find PPI V motifs
ppi_v <- find_ppi_v(gma_ppi, paralogs)
head(ppi_v)
```

## V

V motifs are paralogous regulators that regulate the same target.
These motifs can be created when a regulator undergoes a 
small-scale duplication. To find them, you will use `find_v()`. 
The character representation of V motifs is:

$$
\text{regulator->target<-regulator}
$$

```{r V}
# Find V motifs
v <- find_v(grn, paralogs)
head(v)
```


## Lambda

Lambda motifs are the opposite of V motifs: a single regulator that regulates
two target genes that are paralogous. These motifs can be created when
an ancestral target gene undergoes a small-scale duplication. To find them
you will use `find_lambda()`. The character representation of lambda motifs is:

$$
\text{target1<-regulator->target2}
$$


```{r lambda}
lambda <- find_lambda(grn, paralogs)
head(lambda)
```

## Delta

Delta motifs are pretty similar to lambda motifs, but here we take 
protein-protein interactions between targets into account. Thus, they are
represented by a regulator that regulates two targets that interact at the
protein level. They can be created by the same evolutionary mechanism of
lambda motifs. To find them, you will use `find_delta()`. The character
representation of delta motifs is:

$$
\text{target1<-regulator->target2}
$$

To find delta motifs, you have two options:

1. Pass PPI edge list + a vector of previously identified lambda motifs
(recommended).

2. Pass PPI edge list + GRN edge list + paralogs data frame. In this option,
`find_delta()` will find lambda motifs first, then use the lambda vector to 
find delta motifs. If you have identified lambda motifs beforehand, it is 
way faster to pass the lambda vector to `find_delta()`, so you don't have
to do double work.

```{r delta}
# Find delta motifs from lambda motifs
delta <- find_delta(edgelist_ppi = gma_ppi, lambda_vec = lambda)
head(delta)
```

## Bifan

Bifan motifs are the most complex: they are represented by 
two paralogous regulators that regulate the same set of two paralogous targets.
They can be created when both the ancestral regulator and the ancestral target
are duplicated by small-scale duplications, or when the genome undergoes a 
whole-genome duplication event. To find these motifs, 
you will use `find_bifan()`. The character representation of bifan motifs is:

$$
\text{regulator1,regulator2->target1,target2}
$$

Under the hood, what `find_bifan()` does it to find lambda motifs involving
the same targets and check if their regulators are paralogs. Thus, if you have 
identified lambda motifs beforehand, it is much faster to simply give them
to `find_bifan()`, so it doesn't have to find them again.

```{r bifan}
# Find bifans from lambda motifs
bifan <- find_bifan(paralogs = paralogs, lambda_vec = lambda)
head(bifan)
```

# Counting motifs and evaluating significance

As motifs are simple character vectors, one can count their frequencies with 
the base R `length()` function. For example, let's count the frequency of 
each motif in our example data set:

```{r count}
count_df <- data.frame(
    Motif = c("PPI V", "V", "Lambda", "Delta", "Bifan"),
    Count = c(
        length(ppi_v), length(v), length(lambda), length(delta), length(bifan)
    )
)

count_df
```

However, unless you have another data set to which you can compare your
frequencies, counting is not enough. You need to evaluate the significance
of your motif frequencies. One way to do that is by comparing your observed
frequencies to a null distribution generated by counting motifs in
*N* (e.g., 1000) simulated networks [^1]. `magrene` allows you to generate
null distributions of motif frequencies for each motif type with the function
`generate_nulls()`. As generating the null distributions takes a bit of time,
we will demonstrate `generate_nulls()` with 5 permutations only.
As a rule of thumb, you would probably want *N* >= 1000.

```{r generate_nulls}
generate_nulls(grn, paralogs, gma_ppi, n = 5)
```

[^1]: **NOTE:** Simulated networks are created by node label permutation 
(i.e., resampling node labels without replacement). This method allows you
to have random networks that preserve the same degree of the 
original network. Hence, networks are called 
**degree-preserving simulated networks**.

As you can see, the output of `generate_nulls()` is a list of numeric vectors
with the frequency of each motif type in the simulated networks[^2]. 
You can use the null distribution to calculate Z-scores for your observed 
frequencies, which are defined as below:


[^2]: **Note on performance:** The function `generate_nulls()` can be 
parallelized thanks to the Bioconductor 
package `r BiocStyle::Biocpkg("BiocParallel")`. However, keep in mind that
parallelization is not always the best choice, because it may 
take longer to create multiple copies of your data to split into multiple 
cores than it takes to find motifs with a single core.


$$
Z = \frac{ n_{observed} - \bar{n}_{null} }{ \sigma_{null} }
$$

To calculate Z-scores, you can use the function `calculate_Z()`. As input,
you need to give a list of observed frequencies and a list of nulls.
Here, we will load pre-computed null distributions of *N* = 100.

```{r calculate_Z}
# Load null distros
data(nulls)
head(nulls)

# Create list of observed frequencies
observed <- list(
    lambda = length(lambda), 
    bifan = length(bifan), 
    V = length(v),
    PPI_V = length(ppi_v),
    delta = length(delta)
)
calculate_Z(observed, nulls)
```

Now that you have Z-scores, you can use a cut-off of your choice to 
define significance.

# Evaluting interaction similarity

Finally, another interesting pattern you may want to analyze is the interaction
similarity between paralogous gene pairs. Previous studies have demonstrated
that the Sorensen-Dice similarity is a suitable index for interaction 
similarity [@defoort2019evolution; @mottes2021impact], which is defined as:

$$
S(A,B) = \frac{2 \left| A \cap B \right|}{ \left|A \right| + \left| B \right|}
$$

where A and B are the interacting partners of nodes A and B. 
To calculate the Sorensen-Dice similarity for paralogous gene pairs,
you can use the function `sd_similarity()`. Let's demonstrate it by
calculating the similarity between paralogs in the PPI network.

```{r similarity}
sim <- sd_similarity(gma_ppi, paralogs)
head(sim)

summary(sim$sorensen_dice)
```


# Session information {.unnumbered}

This document was created under the following conditions:

```{r sessioninfo}
sessioninfo::session_info()
```

# References {.unnumbered}