--- title: "R / Bioconductor packages for Proteomics" author: "[Laurent Gatto](http://proteome.sysbiol.cam.ac.uk/lgatto/)" output: html_document: theme: united toc: true --- **Abstract** This workshop will give delegates the opportunity to discover and try some of the recent R / Bioconductor developments for proteomics. Topics covered will including support for open community-driven formats for raw data and identification results, packages for peptide-spectrum matching, quantitative proteomics, mass spectrometry (MS) and quantitation data processing, and visualisation. The workshop material will be a self-contained vignette/workflow including example data. This short tutorial is part of the `ProteomicsBioc2014Workshop` package (version 0.2.0), available at https://github.com/ComputationalProteomicsUnit/ProteomicsBioc2014Workshop. ## Introduction In Bioconductor version 2.14, there are respectively 53 proteomics and 31 mass spectrometry software packages and 7 mass spectrometry experiment packages. These respective packages can be extracted with the `proteomicsPackages()`, `massSpectrometryPackages()` and `massSpectrometryDataPackages()` and explored interactively. ```r library("RforProteomics") pp <- proteomicsPackages(biocv) display(pp) ``` ## Mass spectrometry data |Type |Format |Package | |:--------------|:---------------------------|:-----------------------------------------------------------------------------------------| |raw |mzML, mzXML, netCDF, mzData |[`mzR`](http://bioconductor.org/packages/release/bioc/html/mzR.html) (read) | |identification |mzIdentML |[`mzID`](http://bioconductor.org/packages/release/bioc/html/mzID.html) (read) | |quantitation |mzQuantML | | |peak lists |mgf |[`MSnbase`](http://bioconductor.org/packages/release/bioc/html/MSnbase.html) (read/write) | |other |mzTab |[`MSnbase`](http://bioconductor.org/packages/release/bioc/html/MSnbase.html) (read/write) | ### Getting data from proteomics repositories Contemporary MS-based proteomics data is disseminated through the [ProteomeXchange](http://www.proteomexchange.org/) infrastructure, which centrally coordinates submission, storage and dissemination through multiple data repositories, such as the [PRIDE](https://www.ebi.ac.uk/pride/archive/) data base at the EBI for MS/MS experiments, [PASSEL](http://www.peptideatlas.org/passel/) at the ISB for SRM data and the [MassIVE](http://massive.ucsd.edu/ProteoSAFe/static/massive.jsp) resource. The [`rpx`](http://www.bioconductor.org/packages/release/bioc/html/rpx.html) is an interface to ProteomeXchange and provides a basic and unified access to PX data. ```r library("rpx") pxannounced() ``` ``` ## 14 new ProteomeXchange annoucements ``` ``` ## Data.Set Publication.Data Message ## 1 PXD001187 2014-07-30 15:21:44 New ## 2 PXD000249 2014-07-30 14:31:46 Updated information ## 3 PXD000249 2014-07-30 12:24:35 New ## 4 PXD001049 2014-07-30 10:12:30 New ## 5 PXD000459 2014-07-29 09:59:26 New ## 6 PXD001029 2014-07-29 07:27:32 New ## 7 PXD000807 2014-07-28 09:24:47 Updated information ## 8 PXD000479 2014-07-28 09:23:05 Updated information ## 9 PXD000909 2014-07-28 08:09:14 New ## 10 PXD000813 2014-07-25 09:16:02 New ## 11 PXD000764 2014-07-25 08:53:34 Updated information ## 12 PXD000004 2014-07-24 15:07:14 New ## 13 PXD000001 2014-07-24 14:59:35 New ## 14 PXD001030 2014-07-24 12:28:19 Updated information ``` ```r px <- PXDataset("PXD000001") px ``` ``` ## Object of class "PXDataset" ## Id: PXD000001 with 8 files ## [1] 'F063721.dat' ... [8] 'erwinia_carotovora.fasta' ## Use 'pxfiles(.)' to see all files. ``` ```r pxfiles(px) ``` ``` ## [1] "F063721.dat" ## [2] "F063721.dat-mztab.txt" ## [3] "PRIDE_Exp_Complete_Ac_22134.xml.gz" ## [4] "PRIDE_Exp_mzData_Ac_22134.xml.gz" ## [5] "PXD000001_mztab.txt" ## [6] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzXML" ## [7] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.raw" ## [8] "erwinia_carotovora.fasta" ``` Other metadata for the `px` dataset: ```r pxtax(px) pxurl(px) pxref(px) ``` Data files can then be downloaded with the `pxget` function as illustrated below. Alternatively, the file is available on the workshop's Amazon virtual machine in `/data/Proteomics/data/`. ```r mzf <- "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzXML" mzf <- file.path("/data/Proteomics/data", mzf) if (!file.exists(mzf)) mzf <- pxget(px, pxfiles(px)[6]) ``` ``` ## Downloading 1 file ## TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzXML already present. ``` ```r mzf ``` ``` ## [1] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzXML" ``` ### Handling raw MS data The `mzR` package provides an interface to the [proteowizard](http://proteowizard.sourceforge.net/) code base, the legacy RAMP is a non-sequential parser and other C/C++ code to access various raw data files, such as `mzML`, `mzXML`, `netCDF`, and `mzData`. The data is accessed on-disk, i.e it does not get loaded entirely in memory by default. The three main functions are `openMSfile` to create a file handle to a raw data file, `header` to extract metadata about the spectra contained in the file and `peaks` to extract one or multiple spectra of interest. Other functions such as `instrumentInfo`, or `runInfo` can be used to gather general information about a run. ```r library("mzR") ms <- openMSfile(mzf) ms ``` ``` ## Mass Spectrometry file handle. ## Filename: TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzXML ## Number of scans: 7534 ``` ```r hd <- header(ms) dim(hd) ``` ``` ## [1] 7534 21 ``` ```r names(hd) ``` ``` ## [1] "seqNum" "acquisitionNum" ## [3] "msLevel" "polarity" ## [5] "peaksCount" "totIonCurrent" ## [7] "retentionTime" "basePeakMZ" ## [9] "basePeakIntensity" "collisionEnergy" ## [11] "ionisationEnergy" "lowMZ" ## [13] "highMZ" "precursorScanNum" ## [15] "precursorMZ" "precursorCharge" ## [17] "precursorIntensity" "mergedScan" ## [19] "mergedResultScanNum" "mergedResultStartScanNum" ## [21] "mergedResultEndScanNum" ``` #### Exercise Extract the index of the MS2 spectrum with the highest base peak intensity and plot its spectrum. Is the data centroided or in profile mode? ### Handling identification data The `RforProteomics` package distributes a small identification result file (see `?TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzid`) that we load and parse using infrastructure from the [`mzID`](http://bioconductor.org/packages/release/bioc/html/mzID.html) package. ```r library("mzID") (f <- dir(system.file("extdata", package = "ProteomicsBioc2014Workshop"), pattern = "mzid", full.names=TRUE)) ``` ``` ## [1] "/home/lgatto/R/x86_64-unknown-linux-gnu-library/3.1/ProteomicsBioc2014Workshop/extdata/TMT_Erwinia.mzid.gz" ``` ```r id <- mzID(f) ``` ``` ## reading TMT_Erwinia.mzid.gz... DONE! ``` ```r id ``` ``` ## An mzID object ## ## Software used: MS-GF+ (version: Beta (v10072)) ## ## Rawfile: /home/lgatto/dev/00_github/RforProteomics/sandbox/TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzXML ## ## Database: /home/lgatto/dev/00_github/RforProteomics/sandbox/erwinia_carotovora.fasta ## ## Number of scans: 5287 ## Number of PSM's: 5563 ``` Various data can be extracted from the `mzID` object, using one the accessor functions such as `database`, `scans`, `peptides`, ... The object can also be converted into a `data.frame` using the `flatten` function. #### Exercise Is there a relation between the length of a protein and the number of identified peptides, conditioned by the (average) e-value of the identifications? ### MS/MS database search While searches are generally performed using third-party software independently of R or can be started from R using a `system` call, the [`rTANDEM`](http://www.bioconductor.org/packages/release/bioc/html/rTANDEM.html) package allows one to execute such searches using the X!Tandem engine. The [`shinyTANDEM`](http://www.bioconductor.org/packages/release/bioc/html/shinyTANDEM.html) provides a interactive interface to explore the search results. ```r library("rTANDEM") ?rtandem library("shinyTANDEM") ?shinyTANDEM ``` ## High-level data interface The above sections introduced low-level interfaces to raw and identification results. The [`MSnbase`](http://bioconductor.org/packages/release/bioc/html/MSnbase.html) package provides abstractions for raw data through the `MSnExp` class and containers for quantification data via the `MSnSet` class. Both store the actual assay data (spectra or quantitation matrix) and sample and feature metadata, accessed with `spectra` (or the `[`, `[[` operators) or `exprs`, `pData` and `fData`. The figure below give a schematics of an `MSnSet` instance and the relation between the assay data and the respective feature and sample metadata. plot of chunk msnset Another useful slot is `processingData`, accessed with `processingData(.)`, that records all the processing that objects have undergone since their creation (see examples below). The `readMSData` will parse the raw data, extract the MS2 spectra and construct an MS experiment file. ```r library("MSnbase") quantFile <- dir(system.file(package = "MSnbase", dir = "extdata"), full.name = TRUE, pattern = "mzXML$") quantFile ``` ``` ## [1] "/home/lgatto/R/x86_64-unknown-linux-gnu-library/3.2/MSnbase/extdata/dummyiTRAQ.mzXML" ``` ```r msexp <- readMSData(quantFile, verbose=FALSE) msexp ``` ``` ## Object of class "MSnExp" ## Object size in memory: 0.2 Mb ## - - - Spectra data - - - ## MS level(s): 2 ## Number of MS1 acquisitions: 1 ## Number of MSn scans: 5 ## Number of precursor ions: 5 ## 4 unique MZs ## Precursor MZ's: 437.8 - 716.34 ## MSn M/Z range: 100 2017 ## MSn retention times: 25:1 - 25:2 minutes ## - - - Processing information - - - ## Data loaded: Wed Jul 30 21:43:48 2014 ## MSnbase version: 1.13.12 ## - - - Meta data - - - ## phenoData ## rowNames: 1 ## varLabels: sampleNames ## varMetadata: labelDescription ## Loaded from: ## dummyiTRAQ.mzXML ## protocolData: none ## featureData ## featureNames: X1.1 X2.1 ... X5.1 (5 total) ## fvarLabels: spectrum ## fvarMetadata: labelDescription ## experimentData: use 'experimentData(object)' ``` The identification results stemming from the same raw data file can then be used to add PSM matches. ```r ## find path to a mzIdentML file identFile <- dir(system.file(package = "MSnbase", dir = "extdata"), full.name = TRUE, pattern = "mzid$") identFile ``` ``` ## [1] "/home/lgatto/R/x86_64-unknown-linux-gnu-library/3.2/MSnbase/extdata/dummyiTRAQ.mzid" ``` ```r msexp <- addIdentificationData(msexp, identFile) ``` ``` ## reading dummyiTRAQ.mzid... DONE! ``` ```r fData(msexp) ``` ``` ## spectrum passthreshold rank calculatedmasstocharge ## X1.1 1 TRUE 1 645.0 ## X2.1 2 TRUE 1 547.0 ## X3.1 3 NA NA NA ## X4.1 4 NA NA NA ## X5.1 5 TRUE 1 437.3 ## experimentalmasstocharge chargestate ms-gf:denovoscore ms-gf:evalue ## X1.1 645.4 3 77 79.37 ## X2.1 547.0 3 39 13.47 ## X3.1 NA NA NA NA ## X4.1 NA NA NA NA ## X5.1 437.8 2 5 366.38 ## ms-gf:rawscore ms-gf:specevalue assumeddissociationmethod ## X1.1 -39 5.527e-05 CID ## X2.1 -30 9.399e-06 CID ## X3.1 NA NA ## X4.1 NA NA ## X5.1 -42 2.578e-04 CID ## isotopeerror isdecoy post pre end start accession length ## X1.1 1 FALSE A R 186 170 ECA0984;ECA3829 231 ## X2.1 0 FALSE A K 62 50 ECA1028 275 ## X3.1 NA NA NA NA ## X4.1 NA NA NA NA ## X5.1 1 FALSE L K 28 22 ECA0510 166 ## description ## X1.1 DNA mismatch repair protein;acetolactate synthase isozyme III large subunit ## X2.1 2,3,4,5-tetrahydropyridine-2,6-dicarboxylate N-succinyltransferase ## X3.1 ## X4.1 ## X5.1 putative capsular polysacharide biosynthesis transferase ## pepseq modified modification databaseFile ## X1.1 VESITARHGEVLQLRPK FALSE NA erwinia_carotovora.fasta ## X2.1 IDGQWVTHQWLKK FALSE NA erwinia_carotovora.fasta ## X3.1 NA NA ## X4.1 NA NA ## X5.1 LVILLFR FALSE NA erwinia_carotovora.fasta ## identFile nprot npep.prot npsm.prot npsm.pep ## X1.1 2 2 1 1 1 ## X2.1 2 1 1 1 1 ## X3.1 NA NA NA NA NA ## X4.1 NA NA NA NA NA ## X5.1 2 1 1 1 1 ``` ```r msexp[[1]] ``` ``` ## Object of class "Spectrum2" ## Precursor: 645.4 ## Retention time: 25:1 ## Charge: 3 ## MSn level: 2 ## Peaks count: 2921 ## Total ion count: 668170086 ``` ```r plot(msexp[[1]], full=TRUE) ``` plot of chunk specplot ```r as(msexp[[1]], "data.frame")[100:105, ] ``` ``` ## mz i ## 100 141.1 588595 ## 101 141.1 845401 ## 102 141.1 791352 ## 103 141.1 477623 ## 104 141.1 155376 ## 105 141.1 4753 ``` ### Quantitative proteomics There are a wide range of proteomics quantitation techniques that can broadly be classified as labelled vs. label-free, depending whether the features are labelled prior the MS acquisition and the MS level at which quantitation is inferred, namely MS1 or MS2. | |Label-free |Labelled | |:---|:----------|:----------| |MS1 |XIC |SILAC, 15N | |MS2 |Counting |iTRAQ, TMT | In terms of raw data quantitation, most efforts have been devoted to MS2-level quantitation. Label-free XIC quantitation has however been addressed in the frame of metabolomics data processing by the [`xcms`](http://bioconductor.org/packages/release/bioc/html/xcms.html) infrastructure. An `MSnExp` is converted to an `MSnSet` by the `quantitation` method. Below, we use the iTRAQ 4-plex isobaric tagging strategy (defined by the `iTRAQ4` parameter; other tags are available). ```r plot(msexp[[1]], full=TRUE, reporters = iTRAQ4) ``` plot of chunk itraq4plot ```r msset <- quantify(msexp, method = "trap", reporters = iTRAQ4, verbose=FALSE) ``` ``` ## Error: argument "verbose" is missing, with no default ``` ```r exprs(msset) ``` ``` ## Error: error in evaluating the argument 'object' in selecting a method for function 'exprs': Error: object 'msset' not found ``` ```r processingData(msset) ``` ``` ## Error: error in evaluating the argument 'object' in selecting a method for function 'processingData': Error: object 'msset' not found ``` Other MS2 quantitation methods available in `quantify` include the (normalised) spectral index `SI` and (normalised) spectral abundance factor `SAF` or simply a simple count method. ```r exprs(si <- quantify(msexp, method = "SIn")) ``` ``` ## 1 ## ECA0510 0.003589 ## ECA1028 0.001470 ``` ```r exprs(saf <- quantify(msexp, method = "NSAF")) ``` ``` ## 1 ## ECA0510 0.6236 ## ECA1028 0.3764 ``` Note that spectra that have not been assigned any peptide (`NA`) or that match non-unique peptides (`npsm > 1`) are discarded in the counting process. **See also** The [`isobar`](http://www.bioconductor.org/packages/release/bioc/html/isobar.html) package supports quantitation from centroided `mgf` peak lists or its own tab-separated files that can be generated from Mascot and Phenyx vendor files. ### Importing third-party data The PSI `mzTab` file format is aimed at providing a simpler (than XML formats) and more accessible file format to the wider community. It is composed of a key-value metadata section and peptide/protein/small molecule tabular sections. ```r mztf <- pxget(px, pxfiles(px)[2]) ``` ``` ## Downloading 1 file ## F063721.dat-mztab.txt already present. ``` ```r (mzt <- readMzTabData(mztf, what = "PEP")) ``` ``` ## Warning: Support for mzTab version 0.9 only. Support will be added soon. ``` ``` ## Detected a metadata section ## Detected a peptide section ``` ``` ## MSnSet (storageMode: lockedEnvironment) ## assayData: 1528 features, 6 samples ## element names: exprs ## protocolData: none ## phenoData ## rowNames: sub[1] sub[2] ... sub[6] (6 total) ## varLabels: abundance ## varMetadata: labelDescription ## featureData ## featureNames: 1 2 ... 1528 (1528 total) ## fvarLabels: sequence accession ... uri (14 total) ## fvarMetadata: labelDescription ## experimentData: use 'experimentData(object)' ## Annotation: ## - - - Processing information - - - ## mzTab read: Wed Jul 30 21:44:00 2014 ## MSnbase version: 1.13.12 ``` It is also possible to import arbitrary spreadsheets as `MSnSet` objects into R with the `readMSnSet2` function. The main 2 arguments of the function are (1) a text-based spreadsheet and (2) column names of indices that identify the quantitation data. ```r csv <- dir(system.file ("extdata" , package = "pRolocdata"), full.names = TRUE, pattern = "pr800866n_si_004-rep1.csv") getEcols(csv, split = ",") ``` ``` ## [1] "\"Protein ID\"" "\"FBgn\"" ## [3] "\"Flybase Symbol\"" "\"No. peptide IDs\"" ## [5] "\"Mascot score\"" "\"No. peptides quantified\"" ## [7] "\"area 114\"" "\"area 115\"" ## [9] "\"area 116\"" "\"area 117\"" ## [11] "\"PLS-DA classification\"" "\"Peptide sequence\"" ## [13] "\"Precursor ion mass\"" "\"Precursor ion charge\"" ## [15] "\"pd.2013\"" "\"pd.markers\"" ``` ```r ecols <- 7:10 res <- readMSnSet2(csv, ecols) head(exprs(res)) ``` ``` ## area.114 area.115 area.116 area.117 ## 1 0.3790 0.2810 0.2250 0.1140 ## 2 0.4200 0.2097 0.2061 0.1639 ## 3 0.1873 0.1673 0.1697 0.4760 ## 4 0.2475 0.2530 0.3200 0.1790 ## 5 0.2160 0.1830 0.3420 0.2590 ## 6 0.0720 0.2123 0.5730 0.1427 ``` ```r head(fData(res)) ``` ``` ## Protein.ID FBgn Flybase.Symbol No..peptide.IDs Mascot.score ## 1 CG10060 FBgn0001104 G-ialpha65A 3 179.86 ## 2 CG10067 FBgn0000044 Act57B 5 222.40 ## 3 CG10077 FBgn0035720 CG10077 5 219.65 ## 4 CG10079 FBgn0003731 Egfr 2 86.39 ## 5 CG10106 FBgn0029506 Tsp42Ee 1 52.10 ## 6 CG10130 FBgn0010638 Sec61beta 2 79.90 ## No..peptides.quantified PLS.DA.classification Peptide.sequence ## 1 1 PM ## 2 9 PM ## 3 3 ## 4 2 PM ## 5 1 GGVFDTIQK ## 6 3 ER/Golgi ## Precursor.ion.mass Precursor.ion.charge pd.2013 pd.markers ## 1 PM unknown ## 2 PM unknown ## 3 unknown unknown ## 4 PM unknown ## 5 626.887 2 Phenotype 1 unknown ## 6 ER/Golgi ER ``` ## Data processing and analysis ### Processing and normalisation Each different types of quantitative data will require their own pre-processing and normalisation steps. Both `isobar` and `MSnbase` allow to correct for isobaric tag impurities normalise the quantitative data. ```r data(itraqdata) qnt <- quantify(itraqdata, method = "trap", reporters = iTRAQ4, verbose = FALSE) ``` ``` ## Error: argument "verbose" is missing, with no default ``` ```r impurities <- matrix(c(0.929,0.059,0.002,0.000, 0.020,0.923,0.056,0.001, 0.000,0.030,0.924,0.045, 0.000,0.001,0.040,0.923), nrow=4, byrow = TRUE) ## or, using makeImpuritiesMatrix() ## impurities <- makeImpuritiesMatrix(4) qnt.crct <- purityCorrect(qnt, impurities) ``` ``` ## Error: error in evaluating the argument 'object' in selecting a method for function 'purityCorrect': Error: object 'qnt' not found ``` ```r processingData(qnt.crct) ``` ``` ## Error: error in evaluating the argument 'object' in selecting a method for function 'processingData': Error: object 'qnt.crct' not found ``` ```r plot0 <- function(x, y, main = "") { old.par <- par(no.readonly = TRUE) on.exit(par(old.par)) par(mar = c(4, 4, 1, 1)) par(mfrow = c(2, 2)) sx <- sampleNames(x) sy <- sampleNames(y) for (i in seq_len(ncol(x))) { plot(exprs(x)[, i], exprs(y)[, i], log = "xy", xlab = sx[i], ylab = sy[i]) grid() } } plot0(qnt, qnt.crct) ``` ``` ## Error: error in evaluating the argument 'object' in selecting a method for function 'sampleNames': Error: object 'qnt' not found ``` Various normalisation methods can be applied the `MSnSet` instances using the `normalise` method: variance stabilisation (`vsn`), quantile (`quantiles`), median or mean centring (`center.media` or `center.mean`), ... ```r qnt.crct.nrm <- normalise(qnt.crct,"quantiles") ``` ``` ## Error: error in evaluating the argument 'object' in selecting a method for function 'normalize': Error: object 'qnt.crct' not found ``` ```r plot0(qnt, qnt.crct.nrm) ``` ``` ## Error: error in evaluating the argument 'object' in selecting a method for function 'sampleNames': Error: object 'qnt' not found ``` The `combineFeatures` method combines spectra/peptides quantitation values into protein data. The grouping is defined by the `groupBy` parameter, which is generally taken from the feature metadata (protein accessions, for example). ```r ## arbitraty grouping g <- factor(c(rep(1, 25), rep(2, 15), rep(3, 15))) prt <- combineFeatures(qnt.crct.nrm, groupBy = g, fun = "sum") ``` ``` ## Error: error in evaluating the argument 'object' in selecting a method for function 'normalize': Error: object 'qnt.crct.nrm' not found ``` ```r processingData(prt) ``` ``` ## Error: error in evaluating the argument 'object' in selecting a method for function 'processingData': Error: object 'prt' not found ``` Finally, proteomics data analysis is generally hampered by missing values. Missing data imputation is a sensitive operation whose success will be guided by many factors, such as degree and (non-)random nature of the missingness. Missing value in `MSnSet` instances can be filtered out and imputed using the `filterNA` and `impute` functions. ```r set.seed(1) qnt0 <- qnt ``` ``` ## Error: object 'qnt' not found ``` ```r exprs(qnt0)[sample(prod(dim(qnt0)), 10)] <- NA ``` ``` ## Error: object 'qnt0' not found ``` ```r table(is.na(qnt0)) ``` ``` ## Error: object 'qnt0' not found ``` ```r qnt00 <- filterNA(qnt0) ``` ``` ## Error: error in evaluating the argument 'object' in selecting a method for function 'filterNA': Error: object 'qnt0' not found ``` ```r dim(qnt00) ``` ``` ## Error: object 'qnt00' not found ``` ```r qnt.imp <- impute(qnt0) ``` ``` ## Error: error in evaluating the argument 'object' in selecting a method for function 'impute': Error: object 'qnt0' not found ``` ```r plot0(qnt, qnt.imp) ``` ``` ## Error: error in evaluating the argument 'object' in selecting a method for function 'sampleNames': Error: object 'qnt' not found ``` #### Exercise The `mzt` instance created from the `mzTab` file has the following is a TMT 6-plex with the following design: In this TMT 6-plex experiment, four exogenous proteins were spiked into an equimolar *Erwinia carotovora* lysate with varying proportions in each channel of quantitation; yeast enolase (ENO) at 10:5:2.5:1:2.5:10, bovine serum albumin (BSA) at 1:2.5:5:10:5:1, rabbit glycogen phosphorylase (PHO) at 2:2:2:2:1:1 and bovin cytochrome C (CYT) at 1:1:1:1:1:2. Proteins were then digested, differentially labelled with TMT reagents, fractionated by reverse phase nanoflow UPLC (nanoACQUITY, Waters), and analysed on an LTQ Orbitrap Velos mass spectrometer (Thermo Scientic). Explore the `mzt` data using some of the illustrated functions. The heatmap and MAplot (see `MAplot` function), taken from the [`RforProteomics`](http://www.bioconductor.org/packages/release/data/experiment/html/RforProteomics.html) vignette, have been produced using the same data. ![heatmap](figure/heatmap.png) ![maplot](figure/maplot.png) ### Statistical analysis R in general and Bioconductor in particular are well suited for the statistical analysis of data. Several packages provide dedicated resources for proteomics data: - [`MSstats`](http://www.bioconductor.org/packages/release/bioc/html/MSstats.html): A set of tools for statistical relative protein significance analysis in DDA, SRM and DIA experiments. - [`msmsTest`](http://www.bioconductor.org/packages/release/bioc/html/msmsTests.html): Statistical tests for label-free LC-MS/MS data by spectral counts, to discover differentially expressed proteins between two biological conditions. Three tests are available: Poisson GLM regression, quasi-likelihood GLM regression, and the negative binomial of the edgeR package. - [`isobar`](http://www.bioconductor.org/packages/release/bioc/html/isobar.html) also provides dedicated infrastructure for the statistical analysis of isobaric data. ### Machine learning The [`MLInterfaces`](http://www.bioconductor.org/packages/release/bioc/html/MLInterfaces.html) package provides a unified interface to a wide range of machine learning algorithms. Initially developed for microarray and `ExpressionSet` instances, the [`pRoloc`](http://www.bioconductor.org/packages/release/bioc/html/pRoloc.html) package enables application of these algorithms to `MSnSet` data. ```r library("MLInterfaces") library("pRoloc") library("pRolocdata") data(dunkley2006) traininds <- which(fData(dunkley2006)$markers != "unknown") ans <- MLearn(markers ~ ., data = t(dunkley2006), knnI(k = 5), traininds) ans ``` ``` ## MLInterfaces classification output container ## The call was: ## MLearn(formula = markers ~ ., data = t(dunkley2006), .method = knnI(k = 5), ## trainInd = traininds) ## Predicted outcome distribution for test set: ## ## ER Golgi mit/plastid PM vacuole ## 203 86 129 112 17 ## Summary of scores on test set (use testScores() method for details): ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.400 1.000 1.000 0.965 1.000 1.000 ``` ```r kcl <- MLearn( ~ ., data = dunkley2006, kmeansI, centers = 12) kcl ``` ``` ## clusteringOutput: partition table ## ## 1 2 3 4 5 6 7 8 9 10 11 12 ## 38 10 22 94 24 99 43 118 107 48 50 36 ## The call that created this object was: ## MLearn(formula = ~., data = dunkley2006, .method = kmeansI, centers = 12) ``` ```r plot(kcl, exprs(dunkley2006)) ``` plot of chunk clust ```r hcl <- MLearn( ~ ., data = t(dunkley2006), hclustI(distFun = dist, cutParm = list(k = 4))) hcl ``` ``` ## clusteringOutput: partition table ## ## 1 2 3 4 ## 4 4 4 4 ## The call that created this object was: ## MLearn(formula = ~., data = t(dunkley2006), .method = hclustI(distFun = dist, ## cutParm = list(k = 4))) ``` ```r plot(hcl, exprs(t(dunkley2006))) ``` plot of chunk clust2 ## Annotation All the [Bioconductor annotation infrastructure](http://bioconductor.org/help/workflows/annotation/annotation/), such as [`biomaRt`](http://bioconductor.org/packages/release/bioc/html/biomaRt.html), [`GO.db`](http://www.bioconductor.org/packages/release/data/annotation/html/GO.db.html), organism specific annotations, .. are directly relevant to the analysis of proteomics data. Some proteomics-centred annotations such as the PSI Mass Spectrometry Ontology, Molecular Interaction (PSI MI 2.5) or Protein Modifications are available through the [`rols`](http://www.bioconductor.org/packages/release/bioc/html/rols.html). Data from the [Human Protein Atlas](http://www.proteinatlas.org/) is available via the [`hpar`](http://www.bioconductor.org/packages/release/bioc/html/hpar.html) package. ## Other relevant packages/pipelines - Analysis of post translational modification with [`isobar`](http://www.bioconductor.org/packages/release/bioc/html/isobar.html). - Analysis of label-free data from a Synapt G2 (including ion mobility) with [`synapter`](http://www.bioconductor.org/packages/release/bioc/html/synapter.html). - Analysis of spatial proteomics data with [`pRoloc`](http://www.bioconductor.org/packages/release/bioc/html/pRoloc.html). - Analysis of MALDI data with the [`MALDIquant`](http://cran.r-project.org/web/packages/MALDIquant/index.html) package. - Access to the Proteomics Standard Initiative Common QUery InterfaCe with the [`PSICQUIC`](http://www.bioconductor.org/packages/release/bioc/html/PSICQUIC.html) package. Additional relevant packages are described in the [`RforProteomics`](http://www.bioconductor.org/packages/release/data/experiment/html/RforProteomics.html) vignette. ## Session information ``` ## R Under development (unstable) (2014-04-10 r65396) ## Platform: x86_64-unknown-linux-gnu (64-bit) ## ## attached base packages: ## [1] parallel methods stats graphics grDevices utils datasets ## [8] base ## ## other attached packages: ## [1] shinyTANDEM_1.3.0 xtable_1.7-3 mixtools_1.0.2 ## [4] segmented_0.4-0.0 boot_1.3-11 shiny_0.10.0 ## [7] rTANDEM_1.5.0 data.table_1.9.2 BiocInstaller_1.15.5 ## [10] pRolocdata_1.3.0 pRoloc_1.5.13 MLInterfaces_1.45.1 ## [13] sfsmisc_1.0-26 cluster_1.15.2 annotate_1.43.5 ## [16] XML_3.98-1.1 AnnotationDbi_1.27.8 GenomeInfoDb_1.1.12 ## [19] IRanges_1.99.23 S4Vectors_0.1.2 rda_1.0.2-2 ## [22] rpart_4.1-8 genefilter_1.47.6 MASS_7.3-33 ## [25] rpx_1.1.1 MSnbase_1.13.12 BiocParallel_0.7.8 ## [28] ggplot2_1.0.0 Biobase_2.25.0 BiocGenerics_0.11.3 ## [31] mzID_1.3.4 mzR_1.11.10 Rcpp_0.11.2 ## [34] RforProteomics_1.3.1 knitr_1.6 ## ## loaded via a namespace (and not attached): ## [1] affy_1.43.3 affyio_1.33.0 ## [3] BatchJobs_1.3 BBmisc_1.7 ## [5] biocViews_1.33.10 bitops_1.0-6 ## [7] BradleyTerry2_1.0-5 brew_1.0-6 ## [9] brglm_0.5-9 car_2.0-20 ## [11] caret_6.0-30 Category_2.31.1 ## [13] caTools_1.17 checkmate_1.2 ## [15] class_7.3-11 codetools_0.2-8 ## [17] colorspace_1.2-4 DBI_0.2-7 ## [19] digest_0.6.4 doParallel_1.0.8 ## [21] e1071_1.6-3 evaluate_0.5.5 ## [23] fail_1.2 FNN_1.1 ## [25] foreach_1.4.2 formatR_0.10 ## [27] gdata_2.13.3 graph_1.43.0 ## [29] grid_3.2.0 gridSVG_1.4-0 ## [31] GSEABase_1.27.1 gtable_0.1.2 ## [33] gtools_3.4.1 htmltools_0.2.4 ## [35] httpuv_1.3.0 impute_1.39.0 ## [37] interactiveDisplay_1.3.9 iterators_1.0.7 ## [39] kernlab_0.9-19 labeling_0.2 ## [41] lattice_0.20-29 limma_3.21.10 ## [43] lme4_1.1-7 lpSolve_5.6.9 ## [45] MALDIquant_1.10 Matrix_1.1-4 ## [47] mboost_2.3-0 mclust_4.3 ## [49] minqa_1.2.3 munsell_0.4.2 ## [51] mvtnorm_1.0-0 nlme_3.1-117 ## [53] nloptr_1.0.0 nnet_7.3-8 ## [55] nnls_1.4 pcaMethods_1.55.0 ## [57] plyr_1.8.1 preprocessCore_1.27.1 ## [59] proto_0.3-10 proxy_0.4-12 ## [61] quadprog_1.5-5 R.methodsS3_1.6.1 ## [63] R.oo_1.18.0 R.utils_1.32.4 ## [65] randomForest_4.6-10 RBGL_1.41.0 ## [67] RColorBrewer_1.0-5 RCurl_1.95-4.1 ## [69] reshape2_1.4 RJSONIO_1.2-0.2 ## [71] RSQLite_0.11.4 RUnit_0.4.26 ## [73] sampling_2.6 scales_0.2.4 ## [75] sendmailR_1.1-2 splines_3.2.0 ## [77] stats4_3.2.0 stringr_0.6.2 ## [79] survival_2.37-7 tools_3.2.0 ## [81] vsn_3.33.0 zlibbioc_1.11.1 ```