---
title: "Basic checks of BioPlex PPI data"
output:
  BiocStyle::html_document:
    self_contained: yes 
    toc: true
    toc_float: true
    toc_depth: 2
    code_folding: show
date: "`r doc_date()`"
package: "`r pkg_ver('BioPlex')`"
vignette: >
  % \VignetteIndexEntry{2. Data checks}
  % \VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>",
    crop = NULL ## Related to https://stat.ethz.ch/pipermail/bioc-devel/2020-April/016656.html
)
```

# Setup

```{r, message = FALSE}
library(BioPlex)
library(AnnotationDbi)
library(AnnotationHub)
library(graph)
```

Connect to 
[AnnotationHub](http://bioconductor.org/packages/AnnotationHub):

```{r ahub, message = FALSE}
ah <- AnnotationHub::AnnotationHub()
```

Connect to 
[ExperimentHub](http://bioconductor.org/packages/ExperimentHub):

```{r ehub, message = FALSE}
eh <- ExperimentHub::ExperimentHub()
```

OrgDb package for human:
```{r orgdb, message = FALSE}
orgdb <- AnnotationHub::query(ah, c("orgDb", "Homo sapiens"))
orgdb <- orgdb[[1]]
orgdb
keytypes(orgdb)
```

# Check: identify CORUM complexes that have a subunit of interest

Get core set of complexes:
```{r corumCore}
core <- getCorum(set = "core", organism = "Human")
```

Turn the CORUM complexes into a list of graph instances,
where all nodes of a complex are connected to all other nodes of that complex
with undirected edges.

```{r corum2glist}
core.glist <- corum2graphlist(core, subunit.id.type = "UNIPROT")
```

Identify complexes that have a subunit of interest:
```{r corum-subunit}
has.cdk2 <- hasSubunit(core.glist, 
                       subunit = "CDK2",
                       id.type = "SYMBOL")
```

Check the answer:
```{r corum-subunit2}
table(has.cdk2)
cdk2.glist <- core.glist[has.cdk2]
lapply(cdk2.glist, function(g) unlist(graph::nodeData(g, attr = "SYMBOL")))
```

We can then also inspect the graph with plotting utilities from the 
[Rgraphviz](https://bioconductor.org/packages/Rgraphviz)
package:

```{r corum-subunit3, message = FALSE, eval = FALSE}
plot(cdk2.glist[[1]], main = names(cdk2.glist)[1])
```

# Check: extract BioPlex PPIs for a CORUM complex

Get the latest version of the 293T PPI network:

```{r bioplex293T}
bp.293t <- getBioPlex(cell.line = "293T", version = "3.0")
```

Turn the BioPlex PPI network into one big graph where bait and prey relationship
are represented by directed edges from bait to prey.

```{r bpgraph}
bp.gr <- bioplex2graph(bp.293t)
```

Now we can also easily pull out a BioPlex subnetwork for a CORUM complex
of interest:

```{r}
n <- graph::nodes(cdk2.glist[[1]])
bp.sgr <- graph::subGraph(n, bp.gr)
bp.sgr
```

# Check: identify interacting domains for a PFAM domain of interest

Add PFAM domain annotations to the node metadata:

```{r}
bp.gr <- BioPlex::annotatePFAM(bp.gr, orgdb)
```

Create a map from PFAM to UNIPROT:

```{r}
unip2pfam <- graph::nodeData(bp.gr, graph::nodes(bp.gr), "PFAM")
pfam2unip <- stack(unip2pfam)
pfam2unip <- split(as.character(pfam2unip$ind), pfam2unip$values)
head(pfam2unip, 2)
```

Let's focus on [PF02023](http://pfam.xfam.org/family/PF02023), corresponding to the
zinc finger-associated SCAN domain. For each protein containing the SCAN domain,
we now extract PFAM domains connected to the SCAN domain by an edge in the BioPlex network.

```{r}
scan.unip <- pfam2unip[["PF02023"]]
getIAPfams <- function(n) graph::nodeData(bp.gr, graph::edges(bp.gr)[[n]], "PFAM")
unip2iapfams <- lapply(scan.unip, getIAPfams)
unip2iapfams <- lapply(unip2iapfams, unlist)
names(unip2iapfams) <- scan.unip
```

Looking at the top 5 PFAM domains most frequently connected to the SCAN domain 
by an edge in the BioPlex network ...

```{r}
pfam2iapfams <- unlist(unip2iapfams)
sort(table(pfam2iapfams), decreasing = TRUE)[1:5]
```

... we find [PF02023](http://pfam.xfam.org/family/PF02023), the SCAN domain itself,
and [PF00096](http://pfam.xfam.org/family/PF00096), a C2H2 type zinc finger domain.
This finding is consistent with results reported in the
[BioPlex 3.0 publication](https://doi.org/10.1016/j.cell.2021.04.011).


See also the 
[PFAM domain-domain association analysis vignette](https://ccb-hms.github.io/BioPlexAnalysis/articles/PFAM.html)
for a more comprehensive analysis of PFAM domain associations in the BioPlex network.

# Check: agreement between CNV tracks

Genomic data from whole-genome sequencing for six different lineages
of the human embryonic kidney HEK293 cell line can be obtained from
[hek293genome.org](http://hek293genome.org).

This includes copy number variation (CNV) data for the 293T cell line.
Available CNV tracks include (i) CNV regions inferred from sequencing 
read-depth analysis, and (ii) CNV regions inferred from Illumina SNP arrays.

Here, we check agreement between inferred copy numbers from both assay types.

We start by obtaining genomic coordinates and copy number scores from the sequencing
track ...
 
```{r cnv-seq}
cnv.hmm <- getHEK293GenomeTrack(track = "cnv.hmm", cell.line = "293T")
cnv.hmm
```

... and from the SNP track.

```{r cnv-snp}
cnv.snp <- getHEK293GenomeTrack(track = "cnv.snp", cell.line = "293T")
cnv.snp
```

We reduce the check for agreement between both CNV tracks by transferring copy
numbers to overlapping genes, and subsequently, assess the agreement between the
resulting gene copy numbers for both tracks.  

As the genomic coordinates from both CNV tracks is based on the hg18 human genome 
assembly, we obtain gene coordinates for hg18 from AnnotationHub:

```{r hg18-genes}
AnnotationHub::query(ah, c("TxDb", "Homo sapiens"))
txdb <- ah[["AH52257"]]
gs <- GenomicFeatures::genes(txdb)
gs
```

We then transfer SNP-inferred copy numbers to genes by overlap ...

```{r}
olaps <- GenomicRanges::findOverlaps(gs, cnv.snp)
joined <- gs[S4Vectors::queryHits(olaps)]
joined$score <- cnv.snp$score[S4Vectors::subjectHits(olaps)]
joined
```

... and, analogously, transfer sequencing-inferred copy numbers to genes by overlap.

```{r}
olaps <- GenomicRanges::findOverlaps(gs, cnv.hmm)
joined2 <- gs[S4Vectors::queryHits(olaps)]
joined2$score <- cnv.hmm$score[S4Vectors::subjectHits(olaps)]
joined2
```

We then restrict both tracks to common genes.

```{r}
isect <- intersect(names(joined), names(joined2))
joined <- joined[isect]
joined2 <- joined2[isect]
```

Now, can assess agreement by testing the correlation between SNP-inferred gene
copy numbers and the corresponding sequencing-inferred gene copy numbers. 

```{r}
cor(joined$score, joined2$score)
cor.test(joined$score, joined2$score)
```

We also inspect the correlation via a boxplot.

```{r, fig.width = 8, fig.height = 8}
spl <- split(joined2$score, joined$score)
boxplot(spl, xlab = "SNP-inferred copy number", ylab = "sequencing-inferred copy number")
rho <- cor(joined$score, joined2$score)
rho <- paste("cor", round(rho, digits=3), sep=" = ")

p <- cor.test(joined$score, joined2$score)
p <- p$p.value
p <- "   p < 2.2e-16"

legend("topright", legend=c(rho, p))
```


# Check: expressed genes are showing up as prey (293T cells)

Get RNA-seq data for HEK293 cells from GEO: 
[GSE122425](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE122425)

```{r gse122425}
se <- getGSE122425()
se
```

Inspect expression of prey genes:

```{r prey-expression}
bait <- unique(bp.293t$SymbolA)
length(bait)
prey <- unique(bp.293t$SymbolB)
length(prey)
```

```{r}
ind <- match(prey, rowData(se)$SYMBOL)
par(las = 2)
boxplot(log2(assay(se, "rpkm") + 0.5)[ind,], 
        names = se$title, 
        ylab = "log2 RPKM")
```

How many prey genes are expressed (raw read count > 0) in all 3 WT reps:

```{r prey-expression2}
# background: how many genes in total are expressed in all three WT reps
gr0 <- rowSums(assay(se)[,1:3] > 0)
table(gr0 == 3)
# prey: expressed in all three WT reps
table(gr0[ind] == 3)
# prey: expressed in at least one WT rep
table(gr0[ind] > 0)
```

Are prey genes overrepresented in the expressed genes?
```{r prey-expression-ora}
exprTable <-
     matrix(c(9346, 1076, 14717, 32766),
            nrow = 2,
            dimnames = list(c("Expressed", "Not.expressed"),
                            c("In.prey.set", "Not.in.prey.set")))
exprTable
```

Test using hypergeometric test (i.e. one-sided Fisher's exact test):
```{r prey-expression-ora2}
fisher.test(exprTable, alternative = "greater")
```

Alternatively: permutation test, i.e. repeatedly sample number of prey genes 
from the background, and assess how often we have as many or more than 9346 genes
expressed:

```{r prey-expression-293T-perm}
permgr0 <- function(gr0, nr.genes = length(prey)) 
{
    ind <- sample(seq_along(gr0), nr.genes)
    sum(gr0[ind] == 3)
}
```

```{r prey-expression-perm2}
perms <- replicate(permgr0(gr0), 1000)
summary(perms)
(sum(perms >= 9346) + 1) / 1001
```

# Check: is there a relationship between prey frequency and prey expression level?

Check which genes turn up most frequently as prey:

```{r prey-freq}
prey.freq <- sort(table(bp.293t$SymbolB), decreasing = TRUE)
preys <- names(prey.freq)
prey.freq <- as.vector(prey.freq)
names(prey.freq) <- preys
head(prey.freq)
summary(prey.freq)
hist(prey.freq, breaks = 50, main = "", xlab = "Number of PPIs", ylab = "Number of genes")
```

Prey genes are involved in `r round(mean(as.vector(prey.freq)))` PPIs on average.

There doesn't seem to be a strong correlation between expression level and the 
frequency of gene to turn up as prey: 

```{r}
ind <- match(names(prey.freq), rowData(se)$SYMBOL)
rmeans <- rowMeans(assay(se, "rpkm")[ind, 1:3])
log.rmeans <- log2(rmeans + 0.5)
par(pch = 20)
plot( x = prey.freq,
      y = log.rmeans,
      xlab = "prey frequency",
      ylab = "log2 RPKM")
cor(prey.freq, 
    log.rmeans,
    use = "pairwise.complete.obs")
```

See also the [BioNet maximum scoring subnetwork analysis vignette](https://ccb-hms.github.io/BioPlexAnalysis/articles/BioNet.html) 
for a more comprehensive analysis of the 293T transcriptome data from GSE122425
when mapped onto BioPlex PPI network.

# Check: differential protein expression (HEK293 vs. HCT116)  

Get the relative protein expression data comparing 293T and HCT116 cells
from Supplementary Table S4A of the BioPlex 3 paper:

```{r bp.prot}
bp.prot <- getBioplexProteome()
bp.prot
rowData(bp.prot)
```

A couple of quick sanity checks:

1. The relative abundances are scaled to sum up to 100% for each protein:

```{r}
rowSums(assay(bp.prot)[1:5,]) 
```

2. The `rowData` column `log2ratio` corresponds to the mean of the five HEK samples, 
divided by the mean of the five HCT samples (and then taking log2 of it):

```{r}
ratio <- rowMeans(assay(bp.prot)[1:5, 1:5]) / rowMeans(assay(bp.prot)[1:5, 6:10])
log2(ratio)
```

3. The `rowData` column `adj.pvalue` stores Benjamini-Hochberg adjusted *p*-values
from a *t*-test between the five HEK samples and the five HCT samples:

```{r}
t.test(assay(bp.prot)[1, 1:5], assay(bp.prot)[1, 6:10])
```

The [Transcriptome-Proteome analysis vignette](https://ccb-hms.github.io/BioPlexAnalysis/articles/TranscriptomeProteome.html) also explores the agreement between differential gene expression and 
differential protein expression when comparing HEK293 against HCT116 cells.

# SessionInfo

```{r sessionInfo}
sessionInfo()
```