---
title: "NetBoxR Tutorial"
author: "Eric Minwei Liu and Augustin Luna"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output:
  BiocStyle::html_document:
    toc: yes
    toc_float: no
  html_document:
    df_print: paged
    toc: yes
  html_notebook: default
  md_document:
    toc: yes
    variant: gfm
  pdf_document:
    toc: yes
always_allow_html: yes
vignette: >
  %\VignetteIndexEntry{NetBoxR Tutorial} 
  %\VignetteEncoding{UTF-8} 
  %\VignetteEngine{knitr::rmarkdown}
---


```{r knitrSetup, include=FALSE}
library(knitr)
opts_chunk$set(out.extra='style="display:block; margin: auto"', fig.align="center", fig.width=12, fig.height=12, tidy=TRUE)
```

```{r style, include=FALSE, echo=FALSE, results='asis'}
BiocStyle::markdown()
```

# Overview

The **netboxr** package composes a number of functions to retrive and process 
genetic data from large-scale genomics projects (e.g. TCGA projects) including 
from mutations, copy number alterations, gene expression and DNA methylation. 
The netboxr package implements NetBox algorithm in R package. NetBox algorithm 
integrates genetic alterations with literature-curated pathway knowledge to 
identify pathway modules in cancer. NetBox algorithm uses (1) global network 
null model and (2) local network null model to access the statistic significance 
of the discovered pathway modules.

# Basics
## Installation

```{r installNetBoxr, eval=FALSE}
BiocManager::install("netboxr")
```

## Getting Started

Load **netboxr** package: 

```{r loadLibrary, message=FALSE, warning=FALSE}
library(netboxr)
```

A list of all accessible vignettes and methods is available with the following command: 

```{r searchHelp, eval=FALSE, tidy=FALSE}
help(package="netboxr")
```

For help on any **netboxr** package functions, use one of the following command formats:

```{r showHelp, eval=FALSE, tidy=FALSE}
help(geneConnector)
?geneConnector
```

# Example of Cerami et al. PLoS One 2010

This is an example to reproduce the network discovered on [Cerami et al.(2010)](http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0008918).

The results presented here are comparable to the those from Cerami et al. 2010 
though the unadjusted p-values for linker genes are not the same. 
It is because the unadjusted p-value of linker genes in Cerami et al. 2010 were
calculated by the probabiliy of the observed data point, Pr(X). The netboxr used the probability
of an observed or more extreme assuming the null hypothesis is true,  Pr(X>=x|H),
as unadjusted p-value for linker genes. The final number of linker genes after 
FDR correction are the same between netboxr result and original Cerami et al. 2010.

## Load Human Interactions Network (HIN) network

Load pre-defined HIN network and simplify the interactions by removing loops
and duplicated interactions in the network. The netowork after reduction
contains 9264 nodes and 68111 interactions. 

```{r netboxrExampleNetwork}
data(netbox2010)
sifNetwork <- netbox2010$network
graphReduced <- networkSimplify(sifNetwork,directed = FALSE)      
```

## Load altered gene list

The altered gene list contains 517 candidates from mutations and copy number
alterations. 

```{r netboxrExampleGene}
geneList <- as.character(netbox2010$geneList) 
length(geneList)
```

## Map altered gene list on HIN network

The geneConnector function in the netboxr package takes altered gene list as
input and maps the genes on the curated network to find the local processes 
represented by the gene list. 

```{r netboxrExampleGeneConnector, fig.width=12, fig.height=12}
## Use Benjamini-Hochberg method to do multiple hypothesis 
## correction for linker candidates.

## Use edge-betweeness method to detect community structure in the network. 
threshold <- 0.05
results <- geneConnector(geneList=geneList,
                        networkGraph=graphReduced,
                        directed=FALSE,
                       pValueAdj="BH",
                       pValueCutoff=threshold,
                       communityMethod="ebc",
                       keepIsolatedNodes=FALSE)

# Add edge annotations
library(RColorBrewer)
edges <- results$netboxOutput
interactionType<-unique(edges[,2])
interactionTypeColor<-brewer.pal(length(interactionType),name="Spectral")

edgeColors<-data.frame(interactionType,interactionTypeColor,stringsAsFactors = FALSE)
colnames(edgeColors)<-c("INTERACTION_TYPE","COLOR")


netboxGraphAnnotated <- annotateGraph(netboxResults = results,
                                      edgeColors = edgeColors,
                                      directed = FALSE,
                                      linker = TRUE)

# Check the p-value of the selected linker
linkerDF <- results$neighborData
linkerDF[linkerDF$pValueFDR<threshold,]

# The geneConnector function returns a list of data frames. 
names(results)

# Plot graph with the Fruchterman-Reingold layout algorithm
# As an example, plot both the original and the annotated graphs
# Save the layout for easier comparison
graph_layout <- layout_with_fr(results$netboxGraph) 

# plot the original graph
plot(results$netboxCommunity,results$netboxGraph, layout=graph_layout) 

# Plot the edge annotated graph
plot(results$netboxCommunity, netboxGraphAnnotated, layout = graph_layout,
     vertex.size = 10,
     vertex.shape = V(netboxGraphAnnotated)$shape,
     edge.color = E(netboxGraphAnnotated)$interactionColor,
     edge.width = 3)

# Add interaction type annotations
legend("bottomleft", 
       legend=interactionType,
       col=interactionTypeColor,
       lty=1,lwd=2,
       bty="n",
       cex=1)


```

## Consistency with Previously Published Results 
The GBM result by netboxr identified exactly the same linker genes (6 linker genes), the same number of modules (10 modules) and the same genes in each identified module as GBM result in Cerami et al. 2010.  

The results of netboxr are consistent with previous implementation of the NetBox algorithm. The RB1 and PIK3R1 modules are clearly represented in the figure. For example, the RB1 module contains genes in blue color and enclosed by light orange circle. The PIK3R1 module contains genes in orange color and enclosed by pink circle. 

# Statistical Significance of Discovered Network 

NetBox algorithm used (1) global network null model and (2) local network null model to access the statistical significance of the discovered network. 

## Global Network Null Model 
The global network null model calculates the empirical p-value as the number of times (over a set of iterations) the size of the largest connected component (the giant component) in the network coming from the same number of randomly selected genes (number of genes is 274 in this example) equals or exceeds the size of the largest connected component in the observed network. The suggested iterations are 1000.  

```{r netboxrExampleGlobalTest, eval=FALSE}
## This function will need a lot of time to complete. 
globalTest <- globalNullModel(netboxGraph=results$netboxGraph, networkGraph=graphReduced, iterations=10, numOfGenes = 274)
```

## Local Network Null Model
Local network null model evaluates the deviation of modularity in the observed network from modularity distribution in the random network. For each interaction, a random network is produced from local re-wiring of literature curated network. It means all nodes in the network kept the same degree of connections but connect to new neighbors randomly. Suggested iterations is 1000.


```{r netboxrExampleLocalTest}
localTest <- localNullModel(netboxGraph=results$netboxGraph, iterations=1000)

```

Through 1000 iterations, we can obtain the mean and the standard deviation of modularity in the local network null model. Using the mean  (~0.3) and the standard deviation (0.06), we can covert the observed modularity in the network (0.519) into a Z-score (~3.8). From the Z-score, we can calculate one-tail p-value. If one-tail pvalue is less than 0.05, the observed modularity is significantly different from random. In the histogram, the blue region is the distribution of modularity in the local network null model. The red vertical line is the observed modularity in the NetBox results.

```{r netboxrExampleLocalTestPlot}
h<-hist(localTest$randomModularityScore,breaks=35,plot=FALSE)
h$density = h$counts/sum(h$counts)
plot(h,freq=FALSE,ylim=c(0,0.1),xlim=c(0.1,0.6), col="lightblue")
abline(v=localTest$modularityScoreObs,col="red")
```

* The global null model is used to assess the global connectivity (number of nodes and edges) of the largest module in the identified network compared with the same number but randomly selected gene list.

* The local null model is used to assess the network modularity in the identified network compared with random re-wired network. 

# View Module Membership 

The table below shows the module memberships for all genes. 

```{r}
DT::datatable(results$moduleMembership, rownames = FALSE)
```

# Write NetBox Output to Files 

```{r netboxrEampleOutput, eval=FALSE}
# Write results for further visilaztion in the cytoscape software. 
#
# network.sif file is the NetBox algorithm output in SIF format.  
write.table(results$netboxOutput, file="network.sif", sep="\t", quote=FALSE, col.names=FALSE, row.names=FALSE)
#
# netighborList.txt file contains the information of all neighbor nodes. 
write.table(results$neighborData, file="neighborList.txt", sep="\t", quote=FALSE, col.names=TRUE, row.names=FALSE)
#
# community.membership.txt file indicates the identified pathway module numbers.
write.table(results$moduleMembership, file="community.membership.txt", sep="\t", quote=FALSE, col.names=FALSE, row.names=FALSE)
#
# nodeType.txt file indicates the node is "linker" node or "candidate" node. 
write.table(results$nodeType,file="nodeType.txt", sep="\t", quote=FALSE, col.names=FALSE, row.names=FALSE)
```

# Term Enrichment in Modules using Gene Ontology (GO) Analysis 

After module identification, one main task is understanding the biological processes that may be represented by the returned modules. Here we use the Bioncoductor clusterProfiler to do an enrichment analysis using GO Biological Process terms on a selected module. 

```{r clusterExample, eval=TRUE}
library(clusterProfiler)
library(org.Hs.eg.db)

module <- 6
selectedModule <- results$moduleMembership[results$moduleMembership$membership == module,]
geneList <-selectedModule$geneSymbol

# Check available ID types in for the org.Hs.eg.db annotation package
keytypes(org.Hs.eg.db)

ids <- bitr(geneList, fromType="SYMBOL", toType=c("ENTREZID"), OrgDb="org.Hs.eg.db")
head(ids)

ego <- enrichGO(gene = ids$ENTREZID,
                OrgDb = org.Hs.eg.db,
                ont = "BP",
               pAdjustMethod = "BH",
                pvalueCutoff  = 0.01,
               qvalueCutoff  = 0.05,
               readable = TRUE)
```

## Enrichment Results 

```{r}
head(ego)
```

## Visualize Enrichment Results
```{r, fig.width=10, fig.height=5}
dotplot(ego)
```

# Alternative Module Discovery Methods 
In netboxr, we used the Girvan-Newman algorithm (communityMethod="ebc") as the default method to detect community membership in the identified network. The Girvan-Newman algorithm iteratativly removes the edge in the network with highest edge betweeness until no edges left. When the identified network contains many edges, the Girvan-Newman algorithm will spend a large amount of time to remove edges and re-calucalte the edge betweenese score in the network.  If  the user cannot get the community detection result in reasonable time,  we suggest to switch to leading eigenvector method (communityMethod="lec") for community detection. Users can check original papers of [the Girvan-Newman algorithm](http://www.pnas.org/content/99/12/7821) and [leading eigenvector method](https://journals.aps.org/pre/abstract/10.1103/PhysRevE.74.036104) for more details.  

# Alternative Pathway Data 
## Using Tabular Simple Interaction Format (SIF)-Based Network Data
Users can load alternative pathway data formatted in the SIF format (Simple Interaction Format). SIF is a space/tab separated format that summarizes interactions in a graph as an edgelist. In the format,  every row corresponds to an individual interaction (edge) between a source and a target node. NOTE: An arbitrary interaction type can be used, such as "interacts" if the true interaction type is unknown.

``` 
PARTICIPANT_A INTERACTION_TYPE PARTICIPANT_B
nodeA relationship nodeB
nodeC relationship nodeA
nodeD relationship nodeE
```

Resources, such as the Functional Interaction network from Reactome (https://reactome.org/download-data) and StringDB (https://string-db.org/) provide network information in formats reusable as a SIF. NOTE: The next section demonstrates how to retrieve SIF-based networks for many well-known interaction databases using paxtoolsr. 

SIF formatted data can be passed to networkSimplify(). The result of which is used with the geneConnector() function as other examples in this vignette demonstrate. 

```{r}
example <- "PARTICIPANT_A	INTERACTION_TYPE	PARTICIPANT_B
TP53	interacts	MDM2
MDM2	interacts	MDM4"

sif <- read.table(text=example, header=TRUE, sep="\t", stringsAsFactors=FALSE)

graphReduced <- networkSimplify(sif, directed = FALSE)  
```

## Using PaxtoolsR for Pathway Commons Data
Users can load alternative pathway data from the [Pathway Commons](http://www.pathwaycommons.org/) repository using the **paxtoolsr** package from [Bioconductor](https://bioconductor.org/packages/release/bioc/html/paxtoolsr.html). This pathway data represents an update to the Pathway Commons data used in the original 2010 NetBox publication. Below is an example that makes use of data from the [Reactome pathway database](http://www.reactome.org/). 

**NOTE:** Downloaded data is automatically cached to avoid unnecessary downloads. 

```{r paxtoolsr, fig.width=15, fig.height=15, eval=FALSE}
library(paxtoolsr)

filename <- "PathwayCommons.8.reactome.EXTENDED_BINARY_SIF.hgnc.txt.gz"
sif <- downloadPc2(filename, version="8")


# Filter interactions for specific types
interactionTypes <- getSifInteractionCategories()

filteredSif <- filterSif(sif$edges, interactionTypes=interactionTypes[["BetweenProteins"]])
filteredSif <- filteredSif[(filteredSif$INTERACTION_TYPE %in% "in-complex-with"), ]

# Re-run NetBox algorithm with new network
graphReduced <- networkSimplify(filteredSif, directed=FALSE)      
geneList <- as.character(netbox2010$geneList) 

threshold <- 0.05
pcResults <- geneConnector(geneList=geneList,
                          networkGraph=graphReduced,
                           directed=FALSE,
                           pValueAdj="BH",
                           pValueCutoff=threshold,
                           communityMethod="lec",
                           keepIsolatedNodes=FALSE)

# Check the p-value of the selected linker
linkerDF <- results$neighborData
linkerDF[linkerDF$pValueFDR<threshold,]

# The geneConnector function returns a list of data frames. 
names(results)

# plot graph with the Fruchterman-Reingold layout algorithm
plot(results$netboxCommunity,results$netboxGraph, layout=layout_with_fr) 
```

# Selecting Input Gene Lists for use with NetBox

The main input for the NetBox algorithm is an input list of "significantly" altered genes. Each project is different, unique considerations for how significance should be considered may be required. Researchers may seek stronger thresholds of significance for particular questions and different profiling technologies may have their own considerations. It is beyond the scope of this work to provide guidance for all situations. 

However, to help users better understand the process of generating an input gene list we provide examples using best practices derived from the The Cancer Genome Project using the cBioPortal (http://cbioportal.org/), a platform that aggregates clinical genomics datasets into a standard representation. As of August 2020, cBioPortal has approximately 290 studies. In cases where appropriate data is available a similar procedure to the example can be used. 

## Accesing Pre-Computed Alteration Results from the cBioPortal DataHub 

For TCGA studies on cBioPortal, users can access pre-processed datasets from the [cBioPortal DataHub](https://github.com/cBioPortal/datahub/tree/master/public) that contain significantly altered genes by mutations and copy number. Example study link: https://github.com/cBioPortal/datahub/tree/master/public/acc_tcga

* Significantly altered genes by mutations (via MutSig algorithm) are accessible within the 'data_mutsig.txt' file for a study; typically mutations with a q-value < 0.1 are selected as significantly altered
* Significantly altered genes by copy number (via GISTIC algorithm) are accessible within the 'data_gistic_genes_del.txt' (deletions) file and 'data_gistic_genes_amp.txt' (amplifications). 

Users are directed to the accompanying study publications; study publication details are in the 'meta_study.txt' file for a study.

## Accessing Cancer Genomics Data from cBioPortal

Users can download cancer alteration data from [cBioPortal](https://www.cbioportal.org/) using the **cgdsr** package from [CRAN](https://cran.r-project.org/web/packages/cgdsr/index.html). Here we show how a simple example for selecting genes for use with netboxr for datasets provided by cBioPortal using a using a 10% alteration frequency threshold to select genes; this general procedure has previously been used as part of [TCGA studies](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3910500/). In the example, we consider: 

* For mutations, mutations of any type contribute to the overall alteration frequency of the gene
* For copy number, discretized GISTIC-derived values for amplification or deep deletions contribute to the overall alteration frequency

The resulting gene list then becomes an input for netboxr. The resulting gene list will select EGFR and TP53, which have high alteration frequencies in glioblastoma (GBM) over the housekeeping genes ACTB and GAPDH, which have very low alteration frequencies.

```{r, eval=FALSE}
library(cgdsr)

mycgds <- CGDS("http://www.cbioportal.org/")

# Find available studies, caselists, and geneticProfiles 
studies <- getCancerStudies(mycgds)
caselists <- getCaseLists(mycgds,'gbm_tcga_pub')
geneticProfiles <- getGeneticProfiles(mycgds,'gbm_tcga_pub')

genes <- c("EGFR", "TP53", "ACTB", "GAPDH")

results <- sapply(genes, function(gene) {
  geneticProfiles <- c("gbm_tcga_pub_cna_consensus", "gbm_tcga_pub_mutations")
  caseList <- "gbm_tcga_pub_cnaseq"
  
  dat <- getProfileData(mycgds, genes=gene, geneticProfiles=geneticProfiles, caseList=caseList)
  head(dat)
  
  cna <- dat$gbm_tcga_pub_cna_consensus
  cna <- as.numeric(levels(cna))[cna]
  
  mut <- dat$gbm_tcga_pub_mutations
  mut <- as.character(levels(mut))[mut]
  
  tmp <- data.frame(cna=cna, mut=mut, stringsAsFactors = FALSE)
  tmp$isAltered <- abs(tmp$cna) == 2 | !is.na(tmp$mut) # Amplification or Deep Deletion or any mutation
  length(which(tmp$isAltered))/nrow(tmp)
}, USE.NAMES = TRUE)

# 10 percent alteration frequency cutoff 
geneList <- names(results)[results > 0.1]
```

# References

* Cerami E, Demir E, Schultz N, Taylor BS, Sander C (2010) Automated Network Analysis Identifies Core Pathways in Glioblastoma. PLoS ONE 5(2): e8918. doi:10.1371/journal.pone.0008918
* Cerami EG, Gross BE, Demir E, Rodchenkov I, Babur O, Anwar N, Schultz N, Bader GD, Sander C. Pathway Commons, a web resource for biological pathway data. Nucleic Acids Res. 2011 Jan;39(Database issue):D685-90. doi:10.1093/nar/gkq1039. Epub 2010 Nov 10.

# Session Information

```{r sessionInfo}
sessionInfo()
```