--- title: "SummarizedBenchmark: Introduction" author: "Patrick K. Kimes, Alejandro Reyes" date: "`r BiocStyle::doc_date()`" package: "`r BiocStyle::pkg_ver('SummarizedBenchmark')`" abstract: > "When performing a data analysis in R, users are often presented with multiple packages and methods for accomplishing the same task. Benchmarking the performance of these different methods on real and simulated data sets is a common way of learning the relative strengths and weaknesses of each approach. However, as the number of tools and parameters increases, keeping track of output and how it was generated can quickly becomes messy. The `SummarizedBenchmark` package provides a framework for organizing benchmark comparisons, making it easier to both reproduce the original benchmark and replicate the comparison with new data. This vignette introduces the general approach and features of the package using a simple example. SummarizedBenchmark package version: `r packageVersion("SummarizedBenchmark")`" output: BiocStyle::html_document: highlight: pygments toc: true fig_width: 5 bibliography: library.bib vignette: > %\VignetteIndexEntry{SummarizedBenchmark: Introduction} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} editor_options: chunk_output_type: inline --- ```{r echo=FALSE, include=FALSE} knitr::opts_chunk$set(tidy = FALSE, cache = TRUE, dev = "png", message = FALSE, error = FALSE, warning = TRUE) ``` # Introduction With `SummarizedBenchmark`, a complete benchmarking workflow is comprised of three primary components: 1. data, 2. methods, and 3. performance metrics. The first two (_data_ and _methods_) are necessary for carrying out the benchmark experiment, and the last (_performance metrics_) is essential for evaluating the results of the experiment. Following this approach, the `SummarizedBenchmark` package defines two primary types of objects: *BenchDesign* objects and *SummarizedBenchmark* objects. *BenchDesign* objects contain only the design of the benchmark experiment, namely the _data_ and _methods_, where a _method_ is defined as the combination of a function or algorithm and parameter settings. After constructing a *BenchDesign*, the experiment is executed to create a *SummarizedBenchmark* containing the results of applying the methods to the data. *SummarizedBenchmark* objects extend the Bioconductor `SummarizedExperiment` class, with the additional capability of working with _performance metrics_. The basic framework is illustrated in the figure below. A *BenchDesign* is created with _methods_ and combined with _data_ to create a *SummarizedBenchmark*, which contains the output of applying the methods to the data. This output is then paired with _performance metrics_ specified by the user. Note that the same *BenchDesign* can be combined with several data sets to produce several *SummarizedBenchmark* objects with the corresponding outputs. For convenience, several default _performance metrics_ are implemented in the package, and can be added to *SummarizedBenchmark* objects using simple commands. ![basic benchmarking class relationship](summarizedbenchmark-figure1.png) In this vignette, we first illustrate the basic use of both the *BenchDesign* and *SummarizedBenchmark* classes with a simple comparison of methods for p-value correction in the context of multiple hypothesis testing. More advanced features of the package are demonstrated in several other case study vignettes. After becoming familiar with the basic `SummarizedBenchmark` framework described in this introduction, we recommend reading the **SummarizedBenchmark: Class Details** vignette for more on the structure of the *BenchDesign* and *SummarizedBenchmark* classes and details on issues of _reproducibility_ using these classes. Then, we recommend moving on to the more detailed **SummarizedBenchmark: Full Case Study** vignette where we describe more advanced features of the package with a case study comparing three methods for differential expression analysis. ## Related Work Other frameworks for benchmarking methods include `r BiocStyle::Biocpkg("iCOBRA")` (a package for comparing results of _"binary classification and ranking methods"_ with a Shiny web application for interactive analyses), `r BiocStyle::Biocpkg("rnaseqcomp")` (a package for comparing results of RNA-seq quantification pipelines), and `r BiocStyle::Githubpkg("stephenslab/dsc")` (a framework for _"managing computational benchmarking experiments that compare several competing methods"_ written in Python but capable of running methods implemented in both Python and R). # Quickstart Case Study ```{r} library("SummarizedBenchmark") library("magrittr") ``` To illustrate the basic use of the *BenchDesign* class, we use the `tdat` data set included with this package. ```{r} data(tdat) ``` The data set is a *data.frame* containing the results of 50 two-sample t-tests. The tests were performed using independently simulated sets of 20 observations drawn from a single standard Normal distribution (when `H = 0`) or two mean-shifted Normal distributions (when `H = 1`). ```{r} head(tdat) ``` Several approaches have been proposed and implemented to compute *adjusted p-values* and *q-values* with the goal of controlling the total number of false discoveries across a collection of tests. In this example, we compare three such methods: 1. Bonferroni correction (`p.adjust` w/ `method = "bonferroni"`) [@Dunn_1961], 2. Benjamini-Hochberg (`p.adjust` w/ `method = "BH"`) [@Benjamini_1995], and 3. Storey's FDR q-value (`qvalue::qvalue`) [@Storey_2002]. First, consider how benchmarking the three methods might look without the *SummarizedBenchmark* framework. To compare methods, each is applied to `tdat`, and the results are stored in separate variables. ```{r} adj_bonf <- p.adjust(p = tdat$pval, method = "bonferroni") adj_bh <- p.adjust(p = tdat$pval, method = "BH") qv <- qvalue::qvalue(p = tdat$pval) adj_qv <- qv$qvalues ``` Since the values of interest are available from the ouput of each method as a vector of length 50 (the number of hypotheses tested), to keep things clean, they can be combined into a single *data.frame*. ```{r} adj <- cbind.data.frame(adj_bonf, adj_bh, adj_qv) head(adj) ``` The *data.frame* of adjusted p-values and q-values can be used to compare the methods, either by directly parsing the table or using a framework like `r BiocStyle::Biocpkg("iCOBRA")`. Additionally, the *data.frame* can be saved as a `RDS` or `Rdata` object for future reference, eliminating the need for recomputing on the original data. While this approach can work well for smaller comparisons, it can quickly become overwhelming and unweildy as the number of methods and parameters increases. Furthermore, once each method is applied and the final *data.frame* (`adj`) is constructed, there is no way to determine *how* each value was calculated. While an informative name can be used to "label" each method (as done above), this does not capture the full complexity, e.g. parameters and context, where the function was evaluated. One solution might involve manually recording function calls and parameters in a separate *data.frame* with the hope of maintaining synchrony with the output *data.frame*. However, this is prone to errors, e.g. during fast "copy and paste" operations or additions and deletions of parameter combinations. An alternative (and hopefully better) solution, is to use the framework of the *SummarizedBenchmark* package. In the *SummarizedBenchmark* approach, a *BenchDesign* is constructed with the data and any number of methods. Optionally, a *BenchDesign* can also be constructed without any data or method inputs. Methods and data can be added or removed from the object modularly in a few different ways, as will be described in the following section. For simplicity, we first show how to construct the *BenchDesign* with just the data set as input. The data object, here `tdat`, must be passed explicitly to the `data =` parameter. ```{r} b <- BenchDesign(data = tdat) ``` Then, each method of interest can be added to the *BenchDesign* using `addMethod()`. ```{r} b <- addMethod(bd = b, label = "bonf", func = p.adjust, params = rlang::quos(p = pval, method = "bonferroni")) ``` At a minimum, `addMethod()` requires three parameters: 1. `bd`: the *BenchDesign* object to modify, 2. `label`: a character name for the method, and 3. `func`: the function to be called. After the minimum parameters are specified, any parameters needed by the `func` method should be passed as named parameters, e.g. `p = pval, method = "bonferroni"`, to `params =` as a list of [*quosures*](http://rlang.tidyverse.org/reference/quosure.html) using `rlang::quos(..)`. Notice here that the `pval` wrapped in `rlang::quos(..)` **does not** need to be specified as `tdat$pval` for the function to access the column in the data.frame. For readers familiar with the `r BiocStyle::CRANpkg("ggplot2")` package, the use of `params = rlang::quos(..)` here should be viewed similar to the use of `aes = aes(..)` in `ggplot2` for mapping between the data and plotting (or benchmarking) parameters. The process of adding methods can be written more concisely using the pipe operators from the `r BiocStyle::CRANpkg("magrittr")` package. ```{r} b <- b %>% addMethod(label = "BH", func = p.adjust, params = rlang::quos(p = pval, method = "BH")) %>% addMethod(label = "qv", func = qvalue::qvalue, params = rlang::quos(p = pval), post = function(x) { x$qvalues }) ``` For some methods, such as the q-value approach above, it may be necessary to call a "post-processing" function on the primary method to extract the desired output (here, the q-values). This should be specified using the optional `post =` parameter. Now, the *BenchDesign* object contains three methods. This can be verified either by calling on the object. ```{r} b ``` More details about each method can be seen by using the `printMethods()` function. ```{r} printMethods(b) ``` While the bench now includes all the information necessary for performing the benchmarking study, the actual adjusted p-values and q-values have not yet been calculated. To do this, we simply call `buildBench()`. While `buildBench()` does not require any inputs other than the *BenchDesign* object, when the corresponding ground truth is known, the `truthCols =` parameter should be specified. In this example, the `H` column of the `tdat` *data.frame* contains the true null or alternative status of each simulated hypothesis test. Note that if any of the methods are defined in a separate package, they must be installed and loaded _before_ running the experiment. ```{r} sb <- buildBench(b, truthCols = "H") ``` The returned object is a *SummarizedBenchmark* class. The *SummarizedBenchmark* object is an extension of a *SummarizedExperiment* object. The table of adjusted p-values and q-values is contained in a single "assay" of the object with each method added using `addMethod()` as a column with the corresponding `label` as the name. ```{r} head(assay(sb)) ``` Metadata for the methods is contained in the `colData()` of the same object, with each row corresponding to one method in the comparison. ```{r} colData(sb) ``` In addition to columns for the functions and parameters specified with `addMethod()` (`func, post, label, param.*`), the `colData()` includes several other columns added during the `buildBench()` process. Most notably, columns for the package name and version of `func` if available (`func.pkg`, `func.pkg.vers`). When available, ground truth data is contained in the `rowData()` of the *SummarizedBenchmark* object. ```{r} rowData(sb) ``` In addition, the *SummarizedBenchmark* class contains an additional slot where users can define performance metrics to evaluate the different methods. Since different benchmarking experiments may require the use of different metrics to evaluate the performance of the methods, the *SummarizedBenchmark* class provides a flexible way to define performance metrics. We can define performance metrics using the function `addPerformanceMetric()` by providing a *SummarizedBenchmark* object, a name of the metric, an assay name, and the function that defines it. Importantly, the function must contain the following two arguments: query (referring to a vector of values being evaluated, i.e. the output of one method) and truth (referring to the vector of ground truths). If further arguments are provided to the performance function, these must contain default values. For our example, we define the performance metric "TPR" (True Positive Rate) that calculates the fraction of true positives recovered given an alpha value. This performance metric uses the `H` assay of our *SummarizedBenchmark* example object. ```{r addPerformanceMetric} sb <- addPerformanceMetric( object = sb, assay = "H", evalMetric = "TPR", evalFunction = function(query, truth, alpha = 0.1) { goodHits <- sum((query < alpha) & truth == 1) goodHits / sum(truth == 1) } ) performanceMetrics(sb)[["H"]] ``` Having defined all the desired performance metrics, the function `estimatePerformanceMetrics()` calculates these for each method. Parameters for the performance functions can be passed here. In the case below, we specify several `alpha =` values to be used for calculating the performance metrics with each function. ```{r} resWide <- estimatePerformanceMetrics(sb, alpha = c(0.05, 0.1, 0.2)) resWide ``` By default, the function above returns a *DataFrame*, where the parameters of the performance function are stored in its `elementMetadata()`. ```{r elWide} elementMetadata(resWide) ``` A second possibility is to set the parameter `addColData = TRUE` for these results to be stored in the `colData()` of the *SummarizedBenchmark* object. ```{r} sb <- estimatePerformanceMetrics(sb, alpha = c(0.05, 0.1, 0.2), addColData = TRUE) colData(sb) elementMetadata(colData(sb)) ``` Finally, if the user prefers tidier formats, by setting the parameter `tidy = TRUE` the function returns a long-formated version of the results. ```{r} estimatePerformanceMetrics(sb, alpha = c(0.05, 0.1, 0.2), tidy = TRUE) ``` As an alternative to get the same *data.frame* as the previous chunk, we can call the function `tidyUpMetrics()` on the saved results from a *SummarizedBenchmark* object. ```{r} head(tidyUpMetrics(sb)) ``` For example, the code below extracts the TPR for an alpha of 0.1 for the Bonferroni method. ```{r} tidyUpMetrics(sb) %>% dplyr:::filter(label == "bonf", alpha == 0.1, performanceMetric == "TPR") %>% dplyr:::select(value) ``` # Next Steps This vignette described the minimal structure of the `SummarizedBenchmark` framework using the *BenchDesign* and *SummarizedBenchmark* classes. This should be sufficient for building and executing simple benchmarks but may not be enough for more complex benchmarking experiments. A more detailed example using the `SummarizedBenchmark` is provided in the **SummarizedBenchmark: Full Case Study** vignette and additional features can be found the various **Feature** vignettes. A more complete description of the *BenchDesign* and *SummarizedBenchmark* classes can be found in the **SummarizedBenchmark: Class Details** vignette. # References