--- title: "Introduction to magrene" author: - name: Fabricio Almeida-Silva affiliation: - VIB-UGent Center for Plant Systems Biology, Ghent, Belgium - Department of Plant Biotechnology and Bioinformatics, Ghent University, Ghent, Belgium - name: Yves Van de Peer affiliation: - VIB-UGent Center for Plant Systems Biology, Ghent, Belgium - Department of Plant Biotechnology and Bioinformatics, Ghent University, Ghent, Belgium - College of Horticulture, Academy for Advanced Interdisciplinary Studies, Nanjing Agricultural University, Nanjing, China - Center for Microbial Ecology and Genomics, Department of Biochemistry, Genetics and Microbiology, University of Pretoria, Pretoria, South Africa output: BiocStyle::html_document: self_contained: yes toc: true toc_depth: 2 date: "`r Sys.Date()`" bibliography: vignette_bibliography.bib vignette: > %\VignetteIndexEntry{Introduction to magrene} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.align = 'center', crop = NULL ## Related to https://stat.ethz.ch/pipermail/bioc-devel/2020-April/016656.html ) ``` # Introduction Network motifs are the building blocks of complex networks, and they can be interpreted as small genetic circuits. Identifying and counting motifs in gene regulatory networks can reveal important aspects of the evolution of transcriptional regulation. In particular, they can be used to explore the impact of gene duplication in the rewiring of regulatory interactions [@mottes2021impact]. `magrene` aims to identify and analyze motifs in (duplicated) gene regulatory networks to better comprehend the role of gene duplication in network evolution. The figure below shows the networks motifs users can identify with `magrene`. ```{r intro_motifs, echo = FALSE, fig.cap = "Network motifs and functions to identify them. Shaded boxes indicate paralogs. Regulators and targets are indicated in purple and green, respectively. Arrows indicate directed regulatory interactions, while dashed lines indicate protein-protein interaction."} knitr::include_graphics("motifs_vignette.png") ``` # Installation You can install `magrene` with: ```{r "install", eval = FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) { install.packages("BiocManager") } BiocManager::install("magrene") ``` Then, you can load the package with: ```{r load} library(magrene) set.seed(123) # for reproducibility ``` # Data description For this vignette, we will use three example data sets: 1. **gma_grn:** A gene regulatory network inferred with `BioNERO` [@almeida2022bionero] using soybean RNA-seq data from the Soybean Expression Atlas [@machado2020systematic]. 2. **gma_ppi:** A protein-protein interaction (PPI) network for soybean obtained from the STRING database [@szklarczyk2021string], filtered to contain only physical interactions with confidence score > 0.4. 3. **gma_paralogs:** Soybean paralogous gene pairs derived by whole-genome, tandem, proximal, transposed, and dispersed duplications (WGD, TD, PD, TRD, and DD, respectively). This data set was obtained from @almeida2020exploring. Networks are represented as edge lists. Let's take a look at the data. ```{r data} data(gma_grn) head(gma_grn) data(gma_ppi) head(gma_ppi) data(gma_paralogs) head(gma_paralogs) ``` # Finding motifs Motifs can be found using `find_*` motifs, as shown in Figure 1. Each function returns a character vector of motifs, and each motif has its own character representation. Let's demonstrate how they work. For the sake of demonstration, we will only use WGD-derived paralogs. For GRN motifs, we will only use a smaller subset of the edge list. ```{r filtering} # Include only WGD-derived paralogs paralogs <- gma_paralogs[gma_paralogs$type == "WGD", 1:2] # Keep only the top 30k edges of the GRN, remove "Weight" variable grn <- gma_grn[1:30000, 1:2] ``` ## PPI V PPI V motifs are paralogous proteins that share an interaction partner. To find them, you will use `find_ppi_v()`. The character representation of PPI V motifs is: $$ \text{paralog1-partner-paralog2} $$ ```{r ppi_v} # Find PPI V motifs ppi_v <- find_ppi_v(gma_ppi, paralogs) head(ppi_v) ``` ## V V motifs are paralogous regulators that regulate the same target. These motifs can be created when a regulator undergoes a small-scale duplication. To find them, you will use `find_v()`. The character representation of V motifs is: $$ \text{regulator->target<-regulator} $$ ```{r V} # Find V motifs v <- find_v(grn, paralogs) head(v) ``` ## Lambda Lambda motifs are the opposite of V motifs: a single regulator that regulates two target genes that are paralogous. These motifs can be created when an ancestral target gene undergoes a small-scale duplication. To find them you will use `find_lambda()`. The character representation of lambda motifs is: $$ \text{target1<-regulator->target2} $$ ```{r lambda} lambda <- find_lambda(grn, paralogs) head(lambda) ``` ## Delta Delta motifs are pretty similar to lambda motifs, but here we take protein-protein interactions between targets into account. Thus, they are represented by a regulator that regulates two targets that interact at the protein level. They can be created by the same evolutionary mechanism of lambda motifs. To find them, you will use `find_delta()`. The character representation of delta motifs is: $$ \text{target1<-regulator->target2} $$ To find delta motifs, you have two options: 1. Pass PPI edge list + a vector of previously identified lambda motifs (recommended). 2. Pass PPI edge list + GRN edge list + paralogs data frame. In this option, `find_delta()` will find lambda motifs first, then use the lambda vector to find delta motifs. If you have identified lambda motifs beforehand, it is way faster to pass the lambda vector to `find_delta()`, so you don't have to do double work. ```{r delta} # Find delta motifs from lambda motifs delta <- find_delta(edgelist_ppi = gma_ppi, lambda_vec = lambda) head(delta) ``` ## Bifan Bifan motifs are the most complex: they are represented by two paralogous regulators that regulate the same set of two paralogous targets. They can be created when both the ancestral regulator and the ancestral target are duplicated by small-scale duplications, or when the genome undergoes a whole-genome duplication event. To find these motifs, you will use `find_bifan()`. The character representation of bifan motifs is: $$ \text{regulator1,regulator2->target1,target2} $$ Under the hood, what `find_bifan()` does it to find lambda motifs involving the same targets and check if their regulators are paralogs. Thus, if you have identified lambda motifs beforehand, it is much faster to simply give them to `find_bifan()`, so it doesn't have to find them again. ```{r bifan} # Find bifans from lambda motifs bifan <- find_bifan(paralogs = paralogs, lambda_vec = lambda) head(bifan) ``` # Counting motifs and evaluating significance As motifs are simple character vectors, one can count their frequencies with the base R `length()` function. For example, let's count the frequency of each motif in our example data set: ```{r count} count_df <- data.frame( Motif = c("PPI V", "V", "Lambda", "Delta", "Bifan"), Count = c( length(ppi_v), length(v), length(lambda), length(delta), length(bifan) ) ) count_df ``` However, unless you have another data set to which you can compare your frequencies, counting is not enough. You need to evaluate the significance of your motif frequencies. One way to do that is by comparing your observed frequencies to a null distribution generated by counting motifs in *N* (e.g., 1000) simulated networks [^1]. `magrene` allows you to generate null distributions of motif frequencies for each motif type with the function `generate_nulls()`. As generating the null distributions takes a bit of time, we will demonstrate `generate_nulls()` with 5 permutations only. As a rule of thumb, you would probably want *N* >= 1000. ```{r generate_nulls} generate_nulls(grn, paralogs, gma_ppi, n = 5) ``` [^1]: **NOTE:** Simulated networks are created by node label permutation (i.e., resampling node labels without replacement). This method allows you to have random networks that preserve the same degree of the original network. Hence, networks are called **degree-preserving simulated networks**. As you can see, the output of `generate_nulls()` is a list of numeric vectors with the frequency of each motif type in the simulated networks[^2]. You can use the null distribution to calculate Z-scores for your observed frequencies, which are defined as below: [^2]: **Note on performance:** The function `generate_nulls()` can be parallelized thanks to the Bioconductor package `r BiocStyle::Biocpkg("BiocParallel")`. However, keep in mind that parallelization is not always the best choice, because it may take longer to create multiple copies of your data to split into multiple cores than it takes to find motifs with a single core. $$ Z = \frac{ n_{observed} - \bar{n}_{null} }{ \sigma_{null} } $$ To calculate Z-scores, you can use the function `calculate_Z()`. As input, you need to give a list of observed frequencies and a list of nulls. Here, we will load pre-computed null distributions of *N* = 100. ```{r calculate_Z} # Load null distros data(nulls) head(nulls) # Create list of observed frequencies observed <- list( lambda = length(lambda), bifan = length(bifan), V = length(v), PPI_V = length(ppi_v), delta = length(delta) ) calculate_Z(observed, nulls) ``` Now that you have Z-scores, you can use a cut-off of your choice to define significance. # Evaluting interaction similarity Finally, another interesting pattern you may want to analyze is the interaction similarity between paralogous gene pairs. Previous studies have demonstrated that the Sorensen-Dice similarity is a suitable index for interaction similarity [@defoort2019evolution; @mottes2021impact], which is defined as: $$ S(A,B) = \frac{2 \left| A \cap B \right|}{ \left|A \right| + \left| B \right|} $$ where A and B are the interacting partners of nodes A and B. To calculate the Sorensen-Dice similarity for paralogous gene pairs, you can use the function `sd_similarity()`. Let's demonstrate it by calculating the similarity between paralogs in the PPI network. ```{r similarity} sim <- sd_similarity(gma_ppi, paralogs) head(sim) summary(sim$sorensen_dice) ``` # Session information {.unnumbered} This document was created under the following conditions: ```{r sessioninfo} sessioninfo::session_info() ``` # References {.unnumbered}