---
title: "Overview of BiocPkgTools"
author:
- name: Shian Su
  affiliation: Walter and Eliza Hall Institute, Melbourne, Australia
- name: Vince Carey
  affiliation: Channing Lab, Brigham and Womens Hospital, Harvard University, Boston, MA USA
- name: Lori Shepherd
  affiliation: Roswell Park Comprehensive Cancer Center, Buffalo, NY USA
- name: Martin Morgan
  affiliation: Roswell Park Comprehensive Cancer Center, Buffalo, NY USA
- name: Sean Davis
  affiliation: National Cancer Institute, National Institutes of Health, Bethesda, MD USA
  email: seandavi@gmail.com
package: BiocPkgTools
output:
  BiocStyle::html_document:
    toc: false
abstract: |
  Bioconductor has a rich ecosystem of metadata around packages, usage, and build status. This package is a simple collection of functions to access that metadata from R in a tidy data format. The goal is to expose metadata for data mining and value-added functionality such as package searching, text mining, and analytics on packages. 
vignette: |
  %\VignetteIndexEntry{Overview of BiocPkgTools}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

# Introduction

Bioconductor has a rich ecosystem of metadata around packages, usage, and build status. This package is a simple collection of functions to access that metadata from R in a tidy data format. The goal is to expose metadata for data mining and value-added functionality such as package searching, text mining, and analytics on packages. 

Functionality includes access to :

- Download statistics
- General package listing
- Build reports
- Package dependency graphs
- Vignettes

```{r init, include=FALSE}
library(knitr)
opts_chunk$set(warning = FALSE, message = FALSE, cache=FALSE)
```

```{r style, echo = FALSE, results = 'asis'}
BiocStyle::markdown()
```

# Build reports

The Bioconductor build reports are available online as HTML pages. 
However, they are not very computable.
The `biocBuildReport` function does some heroic parsing of the HTML
to produce a *tidy* data.frame for further processing in R. 

```{r}
library(BiocPkgTools)
head(biocBuildReport())
```

## Personal build report

Because developers may be interested in a quick view of their own
packages, there is a simple function, `problemPage`, to produce an HTML report of
the build status of packages matching a given author *regex*. The default is
to report only "problem" build statuses (ERROR, WARNING).

```{r eval=FALSE}
problemPage()
```

When run in an interactive environment, the `problemPage` function 
will open a browser window for user interaction. Note that if you want
to include all your package results, not just the broken ones, simply 
specify `includeOK = TRUE`.

# Download statistics

Bioconductor supplies download stats for all packages. The `biocDownloadStats`
function grabs all available download stats for all packages in all
Experiment Data, Annotation Data, and Software packages. The results
are returned as a tidy data.frame for further analysis.

```{r}
head(biocDownloadStats())
```

The download statistics reported are for ***all available versions*** of a package.
There are no separate, publicly available statistics broken down by version.

# Package details

The R `DESCRIPTION` file contains a plethora of information regarding package
authors, dependencies, versions, etc. In a repository such as Bioconductor, these
details are available in bulk for all inclucded packages. The `biocPkgList` returns
a data.frame with a row for each package. Tons of information are avaiable, as 
evidenced by the column names of the results.

```{r}
bpi = biocPkgList()
colnames(bpi)
```

Some of the variables are parsed to produce `list` columns. 

```{r}
head(bpi)
```

As a simple example of how these columns can be used, extracting
the `importsMe` column to find the packages that import the 
`r Biocpkg("GEOquery")` package.


```{r}
require(dplyr)
bpi = biocPkgList()
bpi %>% 
    filter(Package=="GEOquery") %>%
    pull(importsMe) %>%
    unlist()
```

# Package Explorer

For the end user of Bioconductor, an analysis often starts with finding a
package or set of packages that perform required tasks or are tailored 
to a specific operation or data type. The `biocExplore()` function
implements an interactive bubble visualization with filtering based on 
biocViews terms. Bubbles are sized based on download statistics. Tooltip
and detail-on-click capabilities are included. To start a local session:

```{r biocExplore}
biocExplore()
```

# Dependency graphs

The Bioconductor ecosystem is built around the concept of interoperability
and dependencies. These interdependencies are available as part of the
`biocPkgList()` output. The `BiocPkgTools` provides some convenience
functions to convert package dependencies to R graphs. A modular approach leads
to the following workflow.

1. Create a `data.frame` of dependencies using `buildPkgDependencyDataFrame`.
2. Create an `igraph` object from the dependency data frame using `buildPkgDependencyIgraph`
3. Use native `igraph` functionality to perform arbitrary network operations. 
Convenience functions, `inducedSubgraphByPkgs` and `subgraphByDegree` are available.
4. Visualize with packages such as `r CRANpkg("visNetwork")`.


## Working with dependency graphs

A dependency graph for all of Bioconductor is a starting place.

```{r}
library(BiocPkgTools)
dep_df = buildPkgDependencyDataFrame()
g = buildPkgDependencyIgraph(dep_df)
g
library(igraph)
head(V(g))
head(E(g))
```

See `inducedSubgraphByPkgs` and `subgraphByDegree` to produce
subgraphs based on a subset of packages.

See the igraph documentation for more detail on graph analytics, setting
vertex and edge attributes, and advanced subsetting.

## Graph visualization

The visNetwork package is a nice interactive visualization tool
that implements graph plotting in a browser. It can be integrated
into shiny applications. Interactive graphs can also be included in
Rmarkdown documents (see vignette)

```{r}
igraph_network = buildPkgDependencyIgraph(buildPkgDependencyDataFrame())
``` 

The full dependency graph is really not that informative to look at, though
doing so is possible. A common use case is to visualize the graph of
dependencies "centered" on a package of interest. In this case, I will 
focus on the `r Biocpkg("GEOquery")` package. 

```{r}
igraph_geoquery_network = subgraphByDegree(igraph_network, "GEOquery")
```

The `subgraphByDegree()` function returns all nodes and connections within
`degree` of the named package; the default `degree` is `1`.

The visNework package can plot `igraph` objects directly, but more flexibility
is offered by first converting the graph to visNetwork form.

```{r}
library(visNetwork)
data <- toVisNetworkData(igraph_geoquery_network)
```

The next few code chunks highlight just a few examples of the visNetwork
capabilities, starting with a basic plot.

```{r }
visNetwork(nodes = data$nodes, edges = data$edges, height = "500px")
```

For fun, we can watch the graph stabilize during drawing, best viewed
interactively.

```{r}
visNetwork(nodes = data$nodes, edges = data$edges, height = "500px") %>%
    visPhysics(stabilization=FALSE)
```

Add arrows and colors to better capture dependencies.

```{r}
data$edges$color='lightblue'
data$edges[data$edges$edgetype=='Imports','color']= 'red'
data$edges[data$edges$edgetype=='Depends','color']= 'green'

visNetwork(nodes = data$nodes, edges = data$edges, height = "500px") %>%
    visEdges(arrows='from') 
```

Add a legend.

```{r}
ledges <- data.frame(color = c("green", "lightblue", "red"),
  label = c("Depends", "Suggests", "Imports"), arrows =c("from", "from", "from"))
visNetwork(nodes = data$nodes, edges = data$edges, height = "500px") %>%
  visEdges(arrows='from') %>%
  visLegend(addEdges=ledges)
```

## Integration with `r Biocpkg("BiocViews")`

[Work in progress]

The `r Biocpkg("biocViews")` package is a small ontology of terms describing
Bioconductor packages. This is a work-in-progress section, but here is a small example of 
plotting the biocViews graph.

```{r biocViews}
library(biocViews)
data(biocViewsVocab)
biocViewsVocab
library(igraph)
g = igraph.from.graphNEL(biocViewsVocab)
library(visNetwork)
gv = toVisNetworkData(g)
visNetwork(gv$nodes, gv$edges, width="100%") %>%
    visIgraphLayout(layout = "layout_as_tree", circular=TRUE) %>%
    visNodes(size=20) %>%
    visPhysics(stabilization=FALSE)
```

# Dependency burden

The dependency burden of a package, namely the amount of functionality that
a given package is importing, is an important parameter to take into account
during package development. A package may break because one or more of its
dependencies have changed the part of the API our package is importing or
this part has even broken. For this reason, it may be useful for package
developers to quantify the dependency burden of a given package. To do that
we should first gather all dependency information using the function
`buildPkgDependencyDataFrame()` but setting the arguments to work with
packages in Bioconductor and CRAN and dependencies categorised as `Depends`
or `Imports`, which are the ones installed by default for a given package.

```{r}
library(BiocPkgTools)

depdf <- buildPkgDependencyDataFrame(repo=c("BioCsoft", "CRAN"),
                                     dependencies=c("Depends", "Imports"))
depdf
```

Finally, we call the function `pkgDepMetrics()` to obtain different metrics
on the dependency burden of a package we want to analyze, in the case below,
the package `BiocPkgTools` itself:

```{r}
pkgDepMetrics("BiocPkgTools", depdf)
```

In this resulting table, rows correspond to dependencies and columns provide
the following information:

  * `ImportedAndUsed`: number of functionality calls imported and used in
    the package.
  * `Exported`: number of functionality calls exported by the dependency.
  * `Usage`: (`ImportedAndUsed`x 100) / `Exported`. This value provides an
    estimate of what fraction of the functionality of the dependency is
    actually used in the given package.
  * `DepOverlap`: Similarity between the dependency graph structure of the
    given package and the one of the dependency in the corresponding row,
    estimated as the [Jaccard index](https://en.wikipedia.org/wiki/Jaccard_index)
    between the two sets of vertices of the corresponding graphs. Its values
    goes between 0 and 1, where 0 indicates that no dependency is shared, while
    1 indicates that the given package and the corresponding dependency depend
    on an identical subset of packages.
  * `DepGainIfExcluded`: The 'dependency gain' (decrease in the total number 
    of dependencies) that would be obtained if this package was excluded 
    from the list of direct dependencies. 

The reported information is ordered by the `Usage` column to facilitate the
identification of dependencies for which the analyzed package is using a small
fraction of their functionality and therefore, it could be easier remove them.
To aid in that decision, the column `DepOverlap` reports the overlap of the
dependency graph of each dependency with the one of the analyzed package. Here
a value above, e.g., 0.5, could, albeit not necessarily, imply that removing
that dependency could substantially lighten the dependency burden of the analyzed
package.

An `NA` value in the `ImportedAndUsed` column indicates that the function
`pkgDepMetrics()` could not identify what functionality calls in the analyzed
package are made to the dependency. This may happen because `pkgDepMetrics()`
has failed to identify the corresponding calls, as it happens with imported
constants such as `DNA_BASES` from `Biostrings`, or that although the given
package is importing that dependency, none of its functionality is actually
being used. In such a case, this dependency could be safely removed without
any further change in the analyzed package.

We can find out what actually functionality calls are we importing as follows:

```{r}
imp <- pkgDepImports("BiocPkgTools")
imp %>% filter(pkg == "DT")
```

# Provenance

```{r}
sessionInfo()
```