--- title: "pRolocGUI - Interactive visualisation of spatial proteomics data" author: - name: Lisa Breckels affiliation: - &cpu Computational Proteomics Unit, Cambridge, UK - name: Thomas Naake - name: Laurent Gatto affiliation: *cpu package: pRolocGUI output: BiocStyle::html_document: toc_float: true vignette: > %\VignetteIndexEntry{pRolocGUI - Interactive visualisation of spatial proteomics data} %\VignetteEngine{knitr::rmarkdown} %\VignetteKeywords{Infrastructure, Bioinformatics, ontology, data} %\VignetteEncoding{UTF-8} --- ```{r env, echo=FALSE} library("BiocStyle") ``` # Foreword `r Biocpkg("pRolocGUI")` is under active development; current functionality is evolving and new features will be added. This software is free and open-source. You are invited to open issues in the [Github `pRolocGUI` repository](https://github.com/ComputationalProteomicsUnit/pRolocGUI/issues) in case you have any questions, suggestions or have found any bugs or typos. To reach a broader audience for more general questions about proteomics analyses using R consider of writing to the [Bioconductor Support Forum](https://support.bioconductor.org). # Introduction This vignette describes the implemented functionality of the `pRolocGUI` package. The package is based on the `MSnSet` class definitions of `r Biocpkg("MSnbase")` and on the functions defined in the `r Biocpkg("pRoloc")` package. `r Biocpkg("pRolocGUI")` is intended for, but not limited to, the interactive visualisation and analysis of quantitative spatial proteomics data. To achieve reactivity and interactivity, `pRolocGUI` relies on the [`shiny`](http://www.rstudio.com/shiny/) framework. We recommend some familiarity with the `MSnSet` class (see `?MSnSet` for details) and the `pRoloc` vignette (see `vignette("pRoloc-tutorial")`) before using `pRolocGUI`. There are 3 applications distributed with `pRolocGUI` which are wrapped and launched by the `pRolocVis` function. These 3 applications are called according to the argument `app` in the `pRolocVis` function which may be one of "pca", "classify" or "compare". * The `pca` application launches a Principal Components Analysis (PCA) plot of the data, with an alternate profiles tab for visualisation of protein profiles, it also features a searchable data table for the identification of proteins of interest. * The `classify` application has been designed to view machine learning classification results according to user-specified thresholds for the assignment of sub-cellular location. * The `compare` application allows the comparison of two comparable `MSnSet` instances, e.g. this might be of help for the analyses of changes in protein localisation in different conditions. ## Getting started Once R is started, the first step to enable functionality of the package is to load it, as shown in the code chunk below. We also load the `r Biocpkg("pRolocdata")` data package, which contains quantitative proteomics datasets. ```{r loadPkgs, message = FALSE, warning = FALSE} library("pRolocGUI") library("pRolocdata") ``` We begin by loading the dataset `hyperLOPIT2015` from the `pRolocdata` data package. The data was produced from using the hyperLOPIT technology on mouse E14TG2a embryonic stem cells ([Christoforou et al 2016](http://www.nature.com/ncomms/2016/160112/ncomms9992/full/ncomms9992.html)). For more background spatial proteomics data anlayses please see [Gatto et al 2010](http://www.ncbi.nlm.nih.gov/pubmed/21080489), [Gatto et al 2014](http://www.ncbi.nlm.nih.gov/pubmed/24846987) and also the [`pRoloc` tutorial vignette](http://bioconductor.org/packages/release/bioc/vignettes/pRoloc/inst/doc/pRoloc-tutorial.html). ```{r loadData, echo = TRUE, message = FALSE, warning = FALSE} data(hyperLOPIT2015) ``` To load one of the applications using the `pRolocVis` function and view the data you are required to specify a minimum of one key argument, `object`, which is the data to display and must be of class `MSnSet` (or a `MSnSetList` of length 2 for the `compare` application). Please see `vignette("pRoloc-tutorial")` or `vignette("MSnbase-io")` for importing and loading data. The argument `app` tells the `pRolocVis` function what type of application to load. One can choose from: `"pca"` (default), `"classify"`, `"compare"`. The optional argument `fcol` (and `fcol1` and `fcol2` for the compare app) can be used which allows the user to specify the feature meta-data label(s) (`fData` column name(s)) to be plotted. The default is `markers` (i.e. the labelled data) for the PCA and compare For the classification app one must specify the prediction column i.e. the feature meta-data label that corresponds to the column containing the classification results, generated from running a supervised machine learning analysis (see [below](#the-classify-application)). For example, to load the default `pRolocVis` application: ```{r example, eval = FALSE, echo = TRUE} pRolocVis(object = hyperLOPIT2015, fcol = "markers") ``` Launching any of the `pRolocVis` applications will open a new tab in a separate pop-up window, and then the application can be opened in your default Internet browser if desired, by clicking the 'open in browser' button in the top panel of the window. To stop the applications from running press `Esc` or `Ctrl-C` in the console (or use the "STOP" button when using RStudio) and close the browser tab, where `pRolocVis` is running. ## Which app should I use? There are 3 different applications, each one designed to address a different specific user requirement. * The PCA app is intended for exploratory data analysis, which features a clickable interface and zoomable PCA plot. If you would like to search for a particular protein or set of proteins this is the application to use. This app also features a protein profiles tab, designed for examining the patterns of user-specified sets of proteins. For example, if one has several overlapping sub-cellular clusters in their data, as highlighted by the PCA plot or otherwise, one can check for separation in all data dimensions by examining the protein profile patterns. Proteins that co-localise are known to exhibit similar distributions (De Duve's principale). * The classification app can be used for viewing the sub-cellular class predictions output from a supervised machine learning analysis and to help the user set a classification threshold (see the [`pRoloc` tutorial](http://bioconductor.org/packages/release/bioc/vignettes/pRoloc/inst/doc/pRoloc-tutorial.pdf) for details on spatial proteomics data analysis). * The comparison application may be of interest if a user wishes to examine two replicate experiments, or two experiments from different conditions etc. Two PCA plots are loaded side-by-side and one can search and identify common proteins between the two data sets. As per the default application there is also a protein profiles tab to allow one to look at the patterns of protein profiles of interest in each dataset. # The `pca` application The `pca`, default, application is characterised by an interactive and searchable Principal Components Analysis (PCA) plot. PCA is an ordinance method that can be used to transform a high-dimensional dataset into a smaller lower-dimenensional set of uncorrelated variables (principal components), such that the first principal component has the largest possible variance to account for as much variability in the data as possible. Each succeeding component in turn has the highest variance possible under the constraint that it be orthogonal to the preceding components. Thus, PCA is particularly useful for visualisation of multidimensional data in 2-dimensions, wherein all the proteins can be plotted on the same figure. The application is subdivided in to three tabs: (1) PCA, (2) Profiles, and (3) Table Selection. A searchable data table containing the experimental feature meta-data is permanantly dispalyed at the bottom of the screen for ease. You can browse between the tabs by simply clicking on them at the top of the screen. To run the `pca` application using `pRolocVis`: ```{r pca1, eval = FALSE, echo = TRUE} pRolocVis(object = hyperLOPIT2015, fcol = "markers") ``` ![The PCA Tab](figures/SS_PCA1.jpg) **Viewing** The PCA tab is characterised by its main panel which shows a PCA plot for the selected `MSnSet`. By default a PCA plot is used to display the data and the first two principal components are plotted. The sidebar panel controls what features to highlight on the PCA plot. Under the 'Labels' menu, input can be selected by clicking on and off the data class names, or by typing and searching in the white input box. Selected items can then be deleted, by clicking on the name of the class and pressing the delete button on your keyboard. The PCA plot will then be updated accordingly. Below the select box is a 'transparancy' slider bar which controls the opacity of the highlighted data classes and two action buttons 'Zoom/reset plot' and 'Clear selection', which are described below. ![Selecting sub-cellular classes](figures/SS_PCA2.jpg) **Searching** Below the PCA plot is a searchable data table containing the fetaure meta data (`fData`). For LOPIT experiments, such as the one used in this example, this may contain protein accession numbers, protein entry names, protein description, the number of quantified peptides per protein, and columns containing sub-cellular localisation information. The data table is limited to displaying 12 columns of information, these are automatically selected from the `fData` to be the first 6 and last features. To select specific columns in the `fData` to display in the data table use the `fdataInds` argument, see `?pRolocVis` for more details.One can search for proteins of interest by using the white search box, above the table to the right. Searching is done by partial pattern matching with table elements. Any matches or partial text matches that are found are highlighted in the data table. To select/unselect a protein of interest one can simply click/unclick on the corresponding entry in the table or double click directly on a protein of interest on the interactive PCA plot. If a protein(s) in the table is clicked and selected the row in the table will turn grey and the protein(s) will be highlighted on the PCA plot by a dark grey circle(s), if the 'Show labels' box is checked in the left sidebar panel the protein names for the selected protein(s) will also be shown on the PCA plot. Any selected proteins on the PCA plot or in the table can be cleared at any time by clicking the 'Clear selection' button on the left hand side panel. ![Searching for proteins of interest](figures/SS_PCA3.jpg) **Zooming** If a user wishes to examine a protein(s) in more detail, one can zoom in on specific points by hovering the mouse over the plot, then clicking and drawing a (square) brush and then clicking the 'Zoom/reset button' in the left side panel to zoom to the brushed area. This process can be repeated until the desired level of zoom is reached. The plot can be resetted to the original size by clicking the 'Zoom/reset button' once again. ![Brushing on the plot](figures/SS_PCA4.jpg) ![Zooming proteins of interest](figures/SS_PCA5.jpg) **Profiles** By clicking the profiles tab at the top of the page a protein profiles plot is displayed that shows the quantitation data that is stored in the `exprs` data slot of the `MSnSet`. For the `hyperLOPIT2015` dataset this is the relative abundances of each protein across the 20 fractions (2 x 10-plex replicates). As per the PCA tab, the profiles plot can also be updated according to the input selected in the sidebar panel on the left. The profiles tab may be useful to specifically look for discrimination between (potentially overlappling) sub-cellular niches. It allows one to do this in an easy and direct manor where all proteins belonging to the same sub-cellular niche/data cluster (as specified by `fcol`) are loaded together. The protein distribution patterns can then be examined on a group vs group basis. Proteins of interest can be searched in the data table and once clicked, the distribution(s) of selected protein(s) are shown by black lines. ![The profiles tab](figures/SS_PCA6.jpg) ![The profiles tab, selecting proteins of interest](figures/SS_PCA7.jpg) **Features** There is also functionality to use the `FeaturesOfInterest`/`FoICollection` infrastructure distributed by the `MSnbase` package (for examples on how to create `FeaturesOfInterest` see the [`pRoloc` tutorial](http://bioconductor.org/packages/release/bioc/vignettes/pRoloc/inst/doc/pRoloc-tutorial.html)). ![Table Selection](figures/SS_PCA8.jpg) **Table Selection** The Table Selection tab provides an interface for data table column selection. Multiple columns can be selected on and off by clicking/unclicking the checkboxes that correspond to the columns in the data table. **Note:** Other ordinance methods are available for displaying the data, for example, multidimensional scaling (MDS), and kernal-PCA, and t-SNE are all supported, and can be specified using the `method` argument when caling `pRolocVis` (this is not supported in the `compare` or `classify` application). # The `classify` application Machine learning classification forms a large part of spatial proteomics data analysis. Protein localisation prediction can be cast as a supervised machine learning problem (learning from labelled instances), wherein one has a set of a few well-known examples (labelled data), that is sub-cellular protein markers (proteins that are known to belong to a set of finite sub-cellular niches), which can used to learn a classifier to associate unlabelled proteins to one of the sub-cellular classes that appear in the labelled training data. In the example below, we use one of the classification algorithms from the `r Biocpkg("pRoloc")` package; a Support Vector Machine (SVM) classifier, and train a model for protein localisation prediction of unassigned proteins in the `hyperLOPIT2015` dataset. We first use the `svmOptimisation` function to find the best model parameters using the labelled training data found in `fcol = "markers"` and then apply these parameters using the `svmClassification` function. (Note, here we perform a reduced search using `times = 3` in the interest of time. In practise we recommend at least to use `times = 100` as described in the [`pRoloc` tutorial](http://bioconductor.org/packages/release/bioc/vignettes/pRoloc/inst/doc/pRoloc-tutorial.html). This tutorial also contains more information on machine learning, the practise of training and testing, and some extensive examples of machine learning classification in spatial proteomics.) ```{r classify, eval = TRUE, echo = TRUE, warning = FALSE} opt <- svmOptimisation(object = hyperLOPIT2015, fcol = "markers", times = 3, verbose = FALSE) res <- svmClassification(object = hyperLOPIT2015, assessRes = opt) ``` By default, the classification function adds new feature variables containing the new sub-cellular assignments made by the SVM classifier and the associated assignment probabilities, called scores, to the `featureData` slot of the `MSnSet`, in this case, they are labelled `svm` and `svm.scores`, and can be accessed using the `fData` accessor method, e.g. `fData(res)$svm` or `fData(res)$svm.scores`. It is common when applying a supervised classification algorithm, wherein the whole class diversity is not present in the training data, to set a specific score cutoff on which to define new assignments, below which classifications are set to unknown/unassigned. Deciding on a threshold is not trivial as classifier scores are heavily dependent upon the classifier used and different sub-cellular niches can exhibit different score distributions. To help examine these distributions and set a threshold one can use the `classify` app. To launch the `classify` application: ```{r cutoff, eval = FALSE, echo = TRUE} pRolocVis(object = res, app = "classify", fcol = "svm") ``` ![The classification application, setting a "quantile" cutoff score](figures/SS_Classify1.jpg) The data is loaded and displayed on a PCA plot and a boxplot is used to display the classifier scores by data class. On the left there is a sidebar panel with sliders to control the thresholds upon which classifications are made. There are two types of cut-off that the user can choose from: (1) "Quantile" and (2) "User-defined". By default, when the application is launched quatile scoring is selected and set to 0.5, the median. The class-specific score thresholds that correspond to selecting the desired quantile are shown on as red dots on the boxplot. The assignments on the PCA plot are also updated according to the selected threshold. The quantile threshold can be set by moving the corresponding quantile slider. If one wished to set their own cut-offs the "User-defined" radio button must be selected and then the sliders for defining user-specified scores become active and the scores and highlighted on the boxplot by blue dots. ![The classification application, setting a "user-defined" cutoff score](figures/SS_Classify2.jpg) By default, when user-specified scores are selected all sliders are set to 1 and can be changed by moving the sliders to the desired score. Once the desired score has been found the application can be closed and the class-specific scores are displayed in the R console. These scores can be used to get protein localisation predictions using the `getPredictions` function, as demonstrated below: ```{r score, eval=FALSE} mythreshold <- pRolocVis(object = res, app = "classify", fcol = "svm") res <- getPredictions(res, fcol = "svm", mcol = "markers", t = mythreshold) ``` The classification app can also be used as an intercative version of the function `orgQuants` in the `r Biocpkg("pRoloc")` package. # The `compare` application The comparison application may be of interest if a user wishes to examine two replicate experiments, or two experiments from different conditions etc. Two PCA plots are loaded side-by-side and one can search and identify common proteins between the two data sets. A `MSnSetList` of length 2 must be supplied as input, containing the two datasets one wishes to compare. In the example below we load two replicate datasets of mouse embryonic stem cells produced using the hyperLOPIT technology. ```{r compare, eval = FALSE, echo = TRUE} data(hyperLOPIT2015ms3r1) data(hyperLOPIT2015ms3r2) mydata <- MSnSetList(list(hyperLOPIT2015ms3r1, hyperLOPIT2015ms3r2)) pRolocVis(mydata, app = "compare") ``` ![The compare application, main panel](figures/SS_Compare1.jpg) ![The compare application, selecting classes](figures/SS_Compare2.jpg) **Viewing, remapping, searching and zooming** The compare app has the same functionality as the pca application and PCA, Profiles and Table Selection tabs. One key feature of the compare application is the ability to re-map the second dataset onto the PCA data space of the first (reference) data set (see `?pRolocVis` and the argument `remap = TRUE`). Currently, only PCA is supported and re-mapping is done by default. This can be switched off with the `remap` argument. Using the first dataset as the reference set, PCA is carried out on the first dataset and the standard deviations of the principal components (i.e. the square roots of the eigenvalues of the covariance/correlation matrix) and the matrix of variable loadings (i.e. a matrix whose columns contain the eigenvectors) are stored and then used to calculate the principal components of the second dataset. Both datasets are scaled and centered in the usual way. The first dataset appears on the left, and the second re-mapped data appears on the right. The order of the first (the reference data for remapping) and second dataset can be changed through regeneration/re-ordering of the `MSnSetList` object. Note: the proteins that are common in both datasets are only displayed. As per the pca application, and described in detail above, there is a sidebar with a 'Labels' menu, where input can be selected by clicking on and off the data class names. Proteins of interest can be highlighted by double clicking on any of the PCA plots (and highlighted in both datasets on both PCA plots), or by typing and searching in the white input box above the data table and clicking on the protein of interest in the data table. Zooming, clicking and searching for proteins of interest is supported as per the pca app. ![The compare application, searching fpr proteins](figures/SS_Compare3.jpg) ![The compare application, brushing](figures/SS_Compare4.jpg) ![The compare application, zooming](figures/SS_Compare5.jpg) **Profiles** As per the pca application there is a profiles tab which loads the quantitative protein profiles for the first experiment and second experiment, on the left- and right-hand sides respectively. One can highlight proteins of interest by clicking items in the data table, and selecting classes to display in the side panel under the 'Labels' menu. ![The compare application, profiles](figures/SS_Compare6.jpg) **Table Selection** By default 4 columns containing the feature data the first dataset (dark blue) and 4 columns for the second dataset (black) will be displayed in the table, and users can select particular columns they wish to display in the Table Selection tab. ![The compare application, table selection](figures/SS_Compare7.jpg) # References Gatto L., VizcaĆ­no J.A., Hermjakob H., Huber W. and Lilley K.S. *Organelle proteomics experimental designs and analysis* [Proteomics, 10:22, 3957-3969, 2010](http://www.ncbi.nlm.nih.gov/pubmed/21080489). Gatto L., Breckels L.M., Burger T., Nightingale D., Groen A.J., Campbell C., Nikolovski N., Mulvey C.M., Christoforou A., Ferro M., Lilley K.S. *A foundation for reliable spatial proteomics data analysis*, [Mol Cell Proteomics. 2014 Aug;13(8):1937-52](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4125728/). Christoforou A., Mulvey C.M., Breckels L.M., Hayward P.C., Geladaki E., Hurrell T., et al. *A draft map of the mouse pluripotent stem cell spatial proteome*. [Nat Commun. 2016 Jan 12;7:9992](http://www.nature.com/ncomms/2016/160112/ncomms9992/full/ncomms9992.html).