---
title: "A TCGA dataset application"
author: "Selcen Ari"
date: "`r Sys.Date()`"
output: 
  rmarkdown::html_vignette:
   toc: true
vignette: >
  %\VignetteIndexEntry{An TCGA dataset application}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
bibliography: ref.bib
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```


# 1. Introduction

This vignette is about the integration of gene and miRNA pairs and their expression dataset and analysis. The sample dataset in this demonstration, which contains human miRNA:target pairs, was retrieved from [miRTarBase](http://mirtarbase.mbc.nctu.edu.tw/php/download.php) website (Release 7.0).

```{r message= FALSE, warning=FALSE}
library(ceRNAnetsim)
```

# 2. Installation

```{r, eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("ceRNAnetsim")
```

# 3. Integration of dataset which includes only miRNA and gene expression values

## 3.1. miRNA:target pairs

```{r, warning= FALSE, message=FALSE}
data("mirtarbasegene")
head(mirtarbasegene)
```

**NOTE** if the mirna:target dataset includes miRNA genes as targets, the `priming_graph()` function can fail. Because, the function define to miRNAs and targets without distinguishing between uppercase or lowercase. 

## 3.2. Gene expression in normal and tumor samples

The gene and mirna expression counts of patient barcoded with TCGA-E9-A1N5 is retrieved from TCGA via `TCGAbiolinks` package  [@colaprico2015tcgabiolinks] from `Bioconductor`. The instructions of retrieving data can be found at `TCGAbiolinks` manual. 

For this step you don't have to use TCGA data, any other source or package can be utilized.

```{r}
data("TCGA_E9_A1N5_normal")
head(TCGA_E9_A1N5_normal)
```

```{r}
data("TCGA_E9_A1N5_tumor")
head(TCGA_E9_A1N5_tumor)
```

## 3.3. miRNA expression data

```{r}
data("TCGA_E9_A1N5_mirnatumor")
head(TCGA_E9_A1N5_mirnatumor)
```

```{r}
data("TCGA_E9_A1N5_mirnanormal")
head(TCGA_E9_A1N5_mirnanormal)
```

Here's the summary of size of each dataset

|Dataset name| Number of rows |
|------------|----------------|
| `mirtarbasegene`           | `r nrow(mirtarbasegene)`           |
| `TCGA_E9_A1N5_normal`      | `r nrow(TCGA_E9_A1N5_normal)`      |
| `TCGA_E9_A1N5_tumor`       | `r nrow(TCGA_E9_A1N5_tumor)`       |
| `TCGA_E9_A1N5_mirnanormal` | `r nrow(TCGA_E9_A1N5_mirnanormal)` |
| `TCGA_E9_A1N5_mirnatumor`  | `r nrow(TCGA_E9_A1N5_mirnatumor)`  |

## 3.4. Integrating and analysing data

All of these datasets are integrated using the code below resulting in  miRNA:target dataset that contains miRNA and gene expression values.

```{r}
TCGA_E9_A1N5_mirnagene <- TCGA_E9_A1N5_mirnanormal %>%
  inner_join(mirtarbasegene, by= "miRNA") %>%
  inner_join(TCGA_E9_A1N5_normal, 
             by = c("Target"= "external_gene_name")) %>%
  select(Target, miRNA, total_read, gene_expression) %>%
  distinct() 
```

Note: Some of genes have expression values more than one because some of tissue samples were sequenced in two medium separately. So, we select maximum expression values of that genes at following:

```{r, echo=FALSE}

TCGA_E9_A1N5_mirnagene%>%
  group_by(Target, miRNA)%>%
  count()%>%
  filter(n==2)

TCGA_E9_A1N5_mirnagene %>%    
  group_by(Target) %>%        
  mutate(gene_expression= max(gene_expression)) %>%
  distinct() %>%
  ungroup() -> TCGA_E9_A1N5_mirnagene

```

```{r}
head(TCGA_E9_A1N5_mirnagene)
```

When we compared the two gene expression dataset of TCGA-E9A1N5 patient, and selected a gene which has 30-fold increased expression, (gene name: HIST1H3H), this gene node will be used in the example. Note that the selected node must not be isolated one. If the an isolated node is selected the change in expression will not propagate in network. (You can see commands for node selection in the vignette [The auxiliary commands which can help to the users](https://selcenari.github.io/ceRNAnetsim/articles/auxiliary_commands.html))

Optionally, you can filter the low expressed gene nodes because they are not effective elements.

```{r}
TCGA_E9_A1N5_mirnagene <- TCGA_E9_A1N5_mirnagene%>%
  filter(gene_expression > 10)
```

The analysis is performed based on amounts of miRNAs and targets as seen. Firstly, we tried to find optimal iteration for the network when simulation start with *HIST1H3H* node. As an example, `simulation()` function was used with `cycle = 5` argument, this argument can be arranged according to network. Note that it can be appropriate that using greater number of `cycle` for comprehensive network objects.

```{r, warning=FALSE, fig.height=5, fig.width=6, fig.align='center', warning=FALSE}

simulation_res_HIST <- TCGA_E9_A1N5_mirnagene %>% 
  priming_graph(competing_count = gene_expression, 
                miRNA_count = total_read) %>% 
  update_how(node_name = "HIST1H3H", how =30) %>% 
  simulate(5)

simulation_res_HIST%>%
  find_iteration(plot=TRUE)

```

The graph was shown that the change in expression level of HIST1H3H results in weak perturbation efficiency, despite 30-fold change. The code shown below can be used for calculation of fold changes after simulation HIST1H3H gene to 30 fold:

```{r}

simulation_res_HIST%>%
  as_tibble()%>%
  mutate(FC= count_current/initial_count)%>%
  arrange(desc(FC))
```

And then, we tried to simulate the network with the gene which has higher expression value. For this, we selected *ACTB* node as shown in [The auxiliary commands which can help to the users](https://selcenari.github.io/ceRNAnetsim/articles/auxiliary_commands.html)

```{r, warning=FALSE, fig.height=5, fig.width=6, fig.align='center', warning=FALSE}
simulation_res_ACTB <- TCGA_E9_A1N5_mirnagene %>% 
  priming_graph(competing_count = gene_expression, 
                miRNA_count = total_read) %>% 
  update_how(node_name = "ACTB", how =1.87) %>% 
  simulate(5)

simulation_res_ACTB%>%
  find_iteration(plot=TRUE)
```

Following codes are shown entire gene fold changes after simulation ACTB gene to 1.87 fold:

```{r}

simulation_res_ACTB%>%
  as_tibble()%>%
  mutate(FC= count_current/initial_count)%>%
  arrange(desc(FC))
```

Note: it can be useful that you look at [The auxiliary commands which can help to the users](https://selcenari.github.io/ceRNAnetsim/articles/auxiliary_commands.html) for perturbation efficiency of ACTB gene by simulation with same conditions and different expression changes.

## 3.5. The sum of two conditions:

In a real biological sample, we tested perturbation efficiencies of two genes;
* one with low expression but high fold change (HIST1H3H, 30-fold increase in tumor) 
* another one with high expression but small change in expression level (ACTB, 1.87-fold increase in tumor)

With these two samples, it has been obtained that expression values of genes, rest of the perturbed gene, changed slightly.

Despite high fold change, former gene caused little perturbation. When the perturbation efficiencies of both of these genes are analysed, it has been oberved that HIST1H3H does not affect the other genes in given limit. On the contrary, high expressing gene with very low fold increase in tumor causes greater perturbation in the network. Additionaly, the perturbation efficiency of ACTB gene is quite high from HIST1H3H with 30-fold change, when ACTB is simulated with 30 fold-change.

Thus, if the perturbed node has lower target:total target ratio in group or groups, the efficiency of it can be weak, or vice versa. The efficiency of ACTB gene may be high for this reason, in comparison with HIST1H3H perturbation. In fact, it has been observed that ACTB has not strong perturbation efficiency too. This could be arisen from low miRNA:target ratio or ineffective target nodes which have very low expression levels.


# 4. Dataset (`huge_example`) which includes miRNA and gene expressions and miRNA:target interaction factors

## 4.1. Description of the *huge_example* dataset

Interactions between miRNAs and their targets can be analyzed after the integration of miRNA and targets via various datasets. As an example, we prepared the *huge_example* dataset. It was generated by integrating:

* Next-generation RNA sequencing data of a breast cancer patient from TCGA (patient id:TCGA_A7_A0CE) 
* The  microRNA expression values of the same breast cancer patient.
* High-throughput miRNA:target determination datasets. We utilised the microRNA:target gene dataset which was integrated from high-throughput experimental studies([CLASH](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3650559/) & [CLEAR-CLiP](https://www.nature.com/articles/ncomms9864) methods). These datasets give exact microRNA:target pairs because of these methods base on chimeric reading of pairs after binding of microRNAs and their targets. For this reason, the datasets contain detailed information for microRNA:target pairs such as interacted bases (in terms of degree of complementarity) on microRNAs and target mRNAs, location of interaction region (seed) on mRNA and estimated binding energies of pairs.


Below, only 6 rows from total of 26,176 rows are shown.

```{r, message=FALSE, warning=FALSE}
data("huge_example")
head(huge_example)
```

## 4.2. Select a node as trigger

The node that initiates simulation can be determined according your interest or research.

The dataset, which is a data frame, can be manipulated with [tidyverse packages](https://www.tidyverse.org/packages/). As an example, competing RNAs targeted by less than 5 miRNAs are eliminated to make the network manageable size.

```{r}
filtered_example <- huge_example %>%
  add_count(competing) %>%
  filter(n > 5) %>%
  select(-n)

head(filtered_example)
```

On the other hand, we chose the node *GAPDH* according to interaction count of the nodes. With the simulation, the graph was visualized after node *GAPDH* was increased to five fold.

```{r, fig.height=5, fig.width=6, fig.align='center', warning=FALSE}
simulation_GAPDH <- filtered_example %>%
  priming_graph(competing_count = competing_counts, 
                miRNA_count = mirnaexpression_normal, 
                aff_factor = Energy) %>%
  update_how("GAPDH", 5) 

simulation_GAPDH%>%
  vis_graph(title = "Distribution of GAPDH gene node")
```

Let's visualize each step of simulation via `simulate_vis()` function.

```{r, fig.height=5, fig.width=6, fig.align='center', warning=FALSE, fig.show='hide'}

simulation_GAPDH%>%
  simulate_vis(title = "GAPDH over expression in the real dataset", 3)
```

![GAPDH over expression in real dataset](gifs/realexample.gif#center)

Now, we can track changes in expression levels at every node for 3 cycles when GAPDH is overexpressed 5-fold.

- After increase in GAPDH expression level in the first graph, the responses of the other competing elements to the GAPDH distributions were calculated. 

- The changing regulations (up or down) were observed as a result of interactions in the second graph.

- When three graphs were carefully compared to each other, it can be observed that the expression levels of nodes change continuously at each stage.

# 5. Finding perturbation efficiency on an experimental dataset

`find_node_perturbation()` runs `calc_perturb` on all nodes in the network in parallel with help of the `future` and `furrr` packages. In this vignette, the function is demonstrated on the `midsamp` data. This dataset is not comparable to actual biological miRNA:target gene datasets in size and complexity. Although `find_node_perturbation()` runs in parallel it might take long time to run in real huge biological datasets. 

In real biological datasets, more complex interactions whether functional or non-functional could be observed. We have improved our approach with `fast` argument in `find_node_perturbation()` based on selection of elements that could be affected from perturbation. In this fucntion, `fast` argument specifies the percentage of the competing amount that can be affected within the initial competing amount and acts as a selection parameter. For instance, in filtered example data:

```{r, warning=FALSE}

entire_perturbation <- filtered_example%>%
  priming_graph(competing_count = competing_counts, miRNA_count = mirnaexpression_normal)%>%
  find_node_perturbation(how=5, cycle=3, fast = 15)%>%
  select(name, perturbation_efficiency, perturbed_count)
  
```

```{r, warning=FALSE}

entire_perturbation%>%
  filter(!is.na(perturbation_efficiency), !is.na(perturbed_count))%>%
  select(name, perturbation_efficiency, perturbed_count)
```


# 6. Session Info
```{r sessioninfo}
sessionInfo()
```

## References