--- title: "An overview of FELLA: data enrichment for metabolomics summary data" author: - name: Sergio Picart-Armada affiliation: B2SLab at Polytechnic University of Catalonia email: sergi.picart@upc.edu - name: Alexandre Perera-Lluna affiliation: B2SLab at Polytechnic University of Catalonia email: alexandre.perera@upc.edu date: "`r BiocStyle::doc_date()`" package: "`r BiocStyle::pkg_ver('FELLA')`" output: BiocStyle::html_document bibliography: bibliography.bib vignette: > %\VignetteIndexEntry{Quick start} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # Introduction `FELLA` is an R package that brings a new concept for metabolomics data interpretation. The starting point of this data enrichment is a list of affected metabolites, which can stem from a contrast between experimental groups. This list, that may vary in size, encompasses key role players from different __biological pathways__ that generate a biological perturbation. The classical way to analyse this list is the __over representation analysis__. Each metabolic pathway has a statistic, the number of affected metabolites in it, that yields a p-value. After correcting for multiple testing, a list of prioritised pathways helps performing a quality check on the data and suggesting novel biological mechanisms related to the data. Subsequent generations of __pathway analysis__ methods attempt to include quantitative and/or topological data in the statistics in order to improve power for subtle signals, but the interpretation of a prioritised pathway list remains a challenge. Package `FELLA`, on the other hand, introduces a comprehensive output that encompasses other biological entities that coherently relate the top ranked pathways. The priorisation of the pathways and other entiteis stems from a diffusion process on a holistic __graph representation__ of the __KEGG database__. `FELLA` needs: 1. The KEGG graph and other complementary data files. This is stored in a unique `FELLA.DATA` S4 object. 2. A list of affected metabolites (KEGG compounds). This is stored in a unique `FELLA.USER` S4 object, along with user analyses. # Loading the KEGG data This vignette makes use of sample data that contains small subgraph of `FELLA`'s KEGG graph (mid 2017 KEGG release). All the necessary contextual data is stored in an S4 data structure with class `FELLA.DATA`. Several functions need access to the contextual data, passed as an argument called `data`, being the enrichment itself among them. ```{r, warning=FALSE, message=FALSE} library(FELLA) data("FELLA.sample") class(FELLA.sample) show(FELLA.sample) ``` Keep in mind that `FELLA.DATA` objects need to be constructed only once by using `buildGraphFromKEGGREST` and `buildDataFromGraph`, in that precise order. This will store them in a local path and they should be loaded through `loadKEGGdata`. The user is disadvised from manually modifying the database internal files and the `FELLA.DATA` object slots not to corrupt the database. # Loading the metabolomics summary data The second block of necessary data is a list of affected metabolites, which shoud be specified as KEGG compound IDs. Provided is a list of hypothetical affected metabolites belonging to the graph, to which some decoys that do not map to the graph are added. ```{r, warning=FALSE, message=FALSE} data("input.sample") input.full <- c(input.sample, paste0("intruder", 1:10)) show(input.full) ``` Compounds are introduced through the `defineCompounds` function and provide the first `FELLA.USER` user data object containing the mapped compounds and empty analyses slots. The user should always build `FELLA.USER` objects through `defineCompounds` instead of manipulating the slots of the object manually - this might skip quality checks. ```{r, warning=TRUE, message=TRUE} myAnalysis <- defineCompounds( compounds = input.full, data = FELLA.sample) ``` Note that a warning message informs the user that some compounds did not map to the KEGG compound collection. Compounds that successfully mapped can be obtained through `getInput`, ```{r, warning=TRUE, message=TRUE} getInput(myAnalysis) ``` while compounds that were excluded because of mismatch can be accessed through `getExcluded`: ```{r, warning=TRUE, message=TRUE} getExcluded(myAnalysis) ``` Keep in mind that exact matching is sought, so be extremely careful with __whitespaces__, tabs or similar characters that might create mismatches. For example: ```{r, warning=FALSE, message=FALSE, error=TRUE} input.fail <- paste0(" ", input.full) defineCompounds( compounds = input.fail, data = FELLA.sample) ``` # Enriching the data Once the `FELLA.DATA` and the `FELLA.USER` with the affected metabolites are ready, the data can be easily enriched. ## Enrichment methods There are three methods to enrich: 1. __Hypergeometric test__ (`method = "hypergeom"`): it performs the metabolite-sampling hypergeometric test using the connections in `FELLA`'s KEGG graph. This is included for completeness and does not include the contextual novelty of the diffusive methods. 2. __Diffusion__ (`method = "diffusion"`): it performs sub-network analysis on the KEGG graph to extract a meaningful subgraph. This subgraph can be plotted an interpreted 3. __PageRank__ (`method = "pagerank"`): analogous to `"diffusion"` but using the directed diffusion, which matches the PageRank algorithm for web ranking. ## Statistical approximations For methods `"diffusion"` and `"pagerank"`, two statistical approximations are proposed: 1. __Normal approximation__ (`approx = "normality"`): scores are computed through z-scores based on analytical expected value and covariance matrix of the null model for diffusion. This approximation is deterministic and fast. 1. __Monte Carlo trials__ (`approx = "simulation"`): scores are computed through Monte Carlo trials of the random variables. This approximation requires computing the random trials, governed by the `ntrials` argument. ## Enrichment: methods, approximations and wrapper function The function `enrich` wraps the functions `defineCompounds`, `runHypergeom`, `runDiffusion` and `runPagerank` in an easily usable manner, returning a `FELLA.USER` object with complete analyses. ```{r, warning=TRUE, message=TRUE} myAnalysis <- enrich( compounds = input.full, method = "diffusion", approx = "normality", data = FELLA.sample) ``` The output is quite informative and aggregates all the warnings. Let's compare an empty `FELLA.USER` object ```{r, warning=TRUE, message=TRUE} show(new("FELLA.USER")) ``` to the output of a processed one: ```{r, warning=TRUE, message=TRUE} show(myAnalysis) ``` The wrapper function `enrich` can run the three analysis at once with the option `method = listMethods()`, or only the desired ones providing them as a character vector: ```{r, warning=FALSE, message=FALSE} myAnalysis <- enrich( compounds = input.full, method = listMethods(), approx = "normality", data = FELLA.sample) show(myAnalysis) ``` The wrapped functions work in a similar way, here is an example with `runDiffusion`: ```{r, warning=FALSE, message=TRUE} myAnalysis_bis <- runDiffusion( object = myAnalysis, approx = "normality", data = FELLA.sample) show(myAnalysis_bis) ``` # Visualising the results The method `plot` for data from the package `FELLA` allows a friendly visualisation of the relevant part of the KEGG graph. ## Hypergeom In the case `method = "hypergeom"` the plot encompasses a bipartite graph that contains top pathways and affected compounds. In that case, `threshold = 1` allows the visualisation of both pathways; otherwise a plot with only one pathway would be quite uninformative. ```{r, warning=FALSE, message=TRUE, fig.width=8, fig.height=8} plot( x = myAnalysis, method = "hypergeom", main = "My first enrichment using the hypergeometric test in FELLA", threshold = 1, data = FELLA.sample) ``` ## Diffusion For `method = "diffusion"` the graph contains a richer representations involving __modules, enzymes and reactions__ that link affected pathways and compounds. ```{r, warning=FALSE, message=TRUE, fig.width=8, fig.height=8} plot( x = myAnalysis, method = "diffusion", main = "My first enrichment using the diffusion analysis in FELLA", threshold = 0.1, data = FELLA.sample) ``` ## PageRank For `method = "pagerank"` the concept is analogous to diffusion: ```{r, warning=FALSE, message=TRUE, fig.width=8, fig.height=8} plot( x = myAnalysis, method = "pagerank", main = "My first enrichment using the PageRank analysis in FELLA", threshold = 0.1, data = FELLA.sample) ``` # Exporting the results `FELLA` offers several exporting alternatives, both for the R environment and for external software. ## Exporting inside R The appropriate functions to export the results inside R are `generateResultsTable` for a __data.frame__ object: ```{r, warning=FALSE, message=TRUE, results='asis'} myTable <- generateResultsTable( object = myAnalysis, method = "diffusion", threshold = 0.1, data = FELLA.sample) knitr::kable(head(myTable, 20)) ``` ...and `generateResultsGraph` for a __graph__ in [igraph](http://igraph.org/r/) format: ```{r, warning=FALSE, message=TRUE} myGraph <- generateResultsGraph( object = myAnalysis, method = "diffusion", threshold = 0.1, data = FELLA.sample) show(myGraph) ``` ## Exporting outside R Results can be saved as permanent files. The __data.frame__ data format can be saved as a `.csv` file: ```{r, warning=FALSE, message=TRUE, results='asis'} myTempDir <- tempdir() myExp_csv <- paste0(myTempDir, "/table.csv") exportResults( format = "csv", file = myExp_csv, method = "pagerank", threshold = 0.1, object = myAnalysis, data = FELLA.sample) test <- read.csv(file = myExp_csv) knitr::kable(head(test)) ``` In the same line, the __graph__ can be saved in `RData`: ```{r, warning=FALSE, message=TRUE, results='asis'} myExp_graph <- paste0(myTempDir, "/graph.RData") exportResults( format = "igraph", file = myExp_graph, method = "pagerank", threshold = 0.1, object = myAnalysis, data = FELLA.sample) stopifnot("graph.RData" %in% list.files(myTempDir)) ``` Other formats exported by [igraph](http://igraph.org/r/) are also available, internally using their function `igraph::write.graph`. Check the __format__ argument of `?igraph::write.graph` for a list of the supported formats. For example, using `"pajek"` format: ```{r, warning=FALSE, message=TRUE, results='asis'} myExp_pajek <- paste0(myTempDir, "/graph.pajek") exportResults( format = "pajek", file = myExp_pajek, method = "diffusion", threshold = 0.1, object = myAnalysis, data = FELLA.sample) stopifnot("graph.pajek" %in% list.files(myTempDir)) ``` This option is toggled if the format does not match any other predefined export option. # Session info For reproducibility purposes, below is the `sessionInfo()` output: ```{r} sessionInfo() ```