---
title: "R / Bioconductor packages for Proteomics"
author: "[Laurent Gatto](http://proteome.sysbiol.cam.ac.uk/lgatto/)"
output:
  html_document:
    theme: united	
    toc: true
---

<!--
%\VignetteEngine{knitr::rmarkdown}
%\VignetteIndexEntry{Bioc2014}
-->

**Abstract** This workshop will give delegates the opportunity to
discover and try some of the recent R / Bioconductor developments for
proteomics. Topics covered will including support for open
community-driven formats for raw data and identification results,
packages for peptide-spectrum matching, quantitative proteomics, mass
spectrometry (MS) and quantitation data processing, and
visualisation. The workshop material will be a self-contained
vignette/workflow including example data.

This short tutorial is part of the `ProteomicsBioc2014Workshop`
package (version 0.2.0),
available at
https://github.com/ComputationalProteomicsUnit/ProteomicsBioc2014Workshop.


## Introduction


In Bioconductor version 2.14, there are respectively 53
proteomics and 31 mass spectrometry software packages and
7 mass spectrometry experiment packages. These respective
packages can be extracted with the `proteomicsPackages()`,
`massSpectrometryPackages()` and `massSpectrometryDataPackages()` and
explored interactively.


```r
library("RforProteomics")
pp <- proteomicsPackages(biocv)
display(pp)
```

## Mass spectrometry data


|Type           |Format                      |Package                                                                                   |
|:--------------|:---------------------------|:-----------------------------------------------------------------------------------------|
|raw            |mzML, mzXML, netCDF, mzData |[`mzR`](http://bioconductor.org/packages/release/bioc/html/mzR.html) (read)               |
|identification |mzIdentML                   |[`mzID`](http://bioconductor.org/packages/release/bioc/html/mzID.html) (read)             |
|quantitation   |mzQuantML                   |                                                                                          |
|peak lists     |mgf                         |[`MSnbase`](http://bioconductor.org/packages/release/bioc/html/MSnbase.html) (read/write) |
|other          |mzTab                       |[`MSnbase`](http://bioconductor.org/packages/release/bioc/html/MSnbase.html) (read/write) |

### Getting data from proteomics repositories

Contemporary MS-based proteomics data is disseminated through the
[ProteomeXchange](http://www.proteomexchange.org/) infrastructure,
which centrally coordinates submission, storage and dissemination
through multiple data repositories, such as the
[PRIDE](https://www.ebi.ac.uk/pride/archive/) data base at the EBI for
MS/MS experiments, [PASSEL](http://www.peptideatlas.org/passel/) at
the ISB for SRM data and the
[MassIVE](http://massive.ucsd.edu/ProteoSAFe/static/massive.jsp)
resource. The
[`rpx`](http://www.bioconductor.org/packages/release/bioc/html/rpx.html)
is an interface to ProteomeXchange and provides a basic and unified
access to PX data.


```r
library("rpx")
pxannounced()
```

```
## 14 new ProteomeXchange annoucements
```

```
##     Data.Set    Publication.Data             Message
## 1  PXD001187 2014-07-30 15:21:44                 New
## 2  PXD000249 2014-07-30 14:31:46 Updated information
## 3  PXD000249 2014-07-30 12:24:35                 New
## 4  PXD001049 2014-07-30 10:12:30                 New
## 5  PXD000459 2014-07-29 09:59:26                 New
## 6  PXD001029 2014-07-29 07:27:32                 New
## 7  PXD000807 2014-07-28 09:24:47 Updated information
## 8  PXD000479 2014-07-28 09:23:05 Updated information
## 9  PXD000909 2014-07-28 08:09:14                 New
## 10 PXD000813 2014-07-25 09:16:02                 New
## 11 PXD000764 2014-07-25 08:53:34 Updated information
## 12 PXD000004 2014-07-24 15:07:14                 New
## 13 PXD000001 2014-07-24 14:59:35                 New
## 14 PXD001030 2014-07-24 12:28:19 Updated information
```


```r
px <- PXDataset("PXD000001")
px
```

```
## Object of class "PXDataset"
##  Id: PXD000001 with 8 files
##  [1] 'F063721.dat' ... [8] 'erwinia_carotovora.fasta'
##  Use 'pxfiles(.)' to see all files.
```

```r
pxfiles(px)
```

```
## [1] "F063721.dat"                                                
## [2] "F063721.dat-mztab.txt"                                      
## [3] "PRIDE_Exp_Complete_Ac_22134.xml.gz"                         
## [4] "PRIDE_Exp_mzData_Ac_22134.xml.gz"                           
## [5] "PXD000001_mztab.txt"                                        
## [6] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzXML"
## [7] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.raw"  
## [8] "erwinia_carotovora.fasta"
```

Other metadata for the `px` dataset:


```r
pxtax(px)
pxurl(px)
pxref(px)
```

Data files can then be downloaded with the `pxget` function as
illustrated below. Alternatively, the file is available on the
workshop's Amazon virtual machine in `/data/Proteomics/data/`.


```r
mzf <- "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzXML"
mzf <- file.path("/data/Proteomics/data", mzf)
if (!file.exists(mzf))
    mzf <- pxget(px, pxfiles(px)[6])
```

```
## Downloading 1 file
## TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzXML already present.
```

```r
mzf
```

```
## [1] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzXML"
```

### Handling raw MS data

The `mzR` package provides an interface to the
[proteowizard](http://proteowizard.sourceforge.net/) code base,
the legacy RAMP is a non-sequential parser and other C/C++ code to
access various raw data files, such as `mzML`, `mzXML`, `netCDF`, and
`mzData`. The data is accessed on-disk, i.e it does not get loaded
entirely in memory by default. The three main functions are
`openMSfile` to create a file handle to a raw data file, `header` to
extract metadata about the spectra contained in the file and `peaks`
to extract one or multiple spectra of interest. Other functions such
as `instrumentInfo`, or `runInfo` can be used to gather general
information about a run.


```r
library("mzR")
ms <- openMSfile(mzf)
ms
```

```
## Mass Spectrometry file handle.
## Filename:  TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzXML 
## Number of scans:  7534
```


```r
hd <- header(ms)
dim(hd)
```

```
## [1] 7534   21
```

```r
names(hd)
```

```
##  [1] "seqNum"                   "acquisitionNum"          
##  [3] "msLevel"                  "polarity"                
##  [5] "peaksCount"               "totIonCurrent"           
##  [7] "retentionTime"            "basePeakMZ"              
##  [9] "basePeakIntensity"        "collisionEnergy"         
## [11] "ionisationEnergy"         "lowMZ"                   
## [13] "highMZ"                   "precursorScanNum"        
## [15] "precursorMZ"              "precursorCharge"         
## [17] "precursorIntensity"       "mergedScan"              
## [19] "mergedResultScanNum"      "mergedResultStartScanNum"
## [21] "mergedResultEndScanNum"
```

#### Exercise

Extract the index of the MS2 spectrum with the highest base peak
intensity and plot its spectrum. Is the data centroided or in profile
mode?


### Handling identification data

The `RforProteomics` package distributes a small identification result
file (see
`?TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzid`) that we
load and parse using infrastructure from the 
[`mzID`](http://bioconductor.org/packages/release/bioc/html/mzID.html)
package.


```r
library("mzID")
(f <- dir(system.file("extdata", package = "ProteomicsBioc2014Workshop"),
         pattern = "mzid", full.names=TRUE))
```

```
## [1] "/home/lgatto/R/x86_64-unknown-linux-gnu-library/3.1/ProteomicsBioc2014Workshop/extdata/TMT_Erwinia.mzid.gz"
```

```r
id <- mzID(f)
```

```
## reading TMT_Erwinia.mzid.gz... DONE!
```

```r
id
```

```
## An mzID object
## 
## Software used:   MS-GF+ (version: Beta (v10072))
## 
## Rawfile:         /home/lgatto/dev/00_github/RforProteomics/sandbox/TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzXML
## 
## Database:        /home/lgatto/dev/00_github/RforProteomics/sandbox/erwinia_carotovora.fasta
## 
## Number of scans: 5287
## Number of PSM's: 5563
```

Various data can be extracted from the `mzID` object, using one the
accessor functions such as `database`, `scans`, `peptides`, ... The
object can also be converted into a `data.frame` using the `flatten`
function.

#### Exercise

Is there a relation between the length of a protein and the number of
identified peptides, conditioned by the (average) e-value of the
identifications?


### MS/MS database search

While searches are generally performed using third-party software
independently of R or can be started from R using a `system` call, the
[`rTANDEM`](http://www.bioconductor.org/packages/release/bioc/html/rTANDEM.html)
package allows one to execute such searches using the X!Tandem
engine. The
[`shinyTANDEM`](http://www.bioconductor.org/packages/release/bioc/html/shinyTANDEM.html)
provides a interactive interface to explore the search results.


```r
library("rTANDEM")
?rtandem
library("shinyTANDEM")
?shinyTANDEM
```


## High-level data interface

The above sections introduced low-level interfaces to raw and
identification results. The
[`MSnbase`](http://bioconductor.org/packages/release/bioc/html/MSnbase.html)
package provides abstractions for raw data through the `MSnExp` class
and containers for quantification data via the `MSnSet` class. Both
store the actual assay data (spectra or quantitation matrix) and
sample and feature metadata, accessed with `spectra` (or the
`[`, `[[` operators) or `exprs`, `pData` and `fData`.

The figure below give a schematics of an `MSnSet` instance and the
relation between the assay data and the respective feature and sample
metadata.


<img src="figure/msnset.png" title="plot of chunk msnset" alt="plot of chunk msnset" style="display: block; margin: auto;" />

Another useful slot is `processingData`, accessed with `processingData(.)`, that records all the processing that objects have undergone since their creation (see examples below).

The `readMSData` will parse the raw data, extract the MS2 spectra and
construct an MS experiment file.


```r
library("MSnbase")
quantFile <- dir(system.file(package = "MSnbase", dir = "extdata"),
                 full.name = TRUE, pattern = "mzXML$")
quantFile
```

```
## [1] "/home/lgatto/R/x86_64-unknown-linux-gnu-library/3.2/MSnbase/extdata/dummyiTRAQ.mzXML"
```

```r
msexp <- readMSData(quantFile, verbose=FALSE)
msexp
```

```
## Object of class "MSnExp"
##  Object size in memory: 0.2 Mb
## - - - Spectra data - - -
##  MS level(s): 2 
##  Number of MS1 acquisitions: 1 
##  Number of MSn scans: 5 
##  Number of precursor ions: 5 
##  4 unique MZs
##  Precursor MZ's: 437.8 - 716.34 
##  MSn M/Z range: 100 2017 
##  MSn retention times: 25:1 - 25:2 minutes
## - - - Processing information - - -
## Data loaded: Wed Jul 30 21:43:48 2014 
##  MSnbase version: 1.13.12 
## - - - Meta data  - - -
## phenoData
##   rowNames: 1
##   varLabels: sampleNames
##   varMetadata: labelDescription
## Loaded from:
##   dummyiTRAQ.mzXML 
## protocolData: none
## featureData
##   featureNames: X1.1 X2.1 ... X5.1 (5 total)
##   fvarLabels: spectrum
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
```

The identification results stemming from the same raw data file can
then be used to add PSM matches.


```r
## find path to a mzIdentML file
identFile <- dir(system.file(package = "MSnbase", dir = "extdata"),
                 full.name = TRUE, pattern = "mzid$")
identFile
```

```
## [1] "/home/lgatto/R/x86_64-unknown-linux-gnu-library/3.2/MSnbase/extdata/dummyiTRAQ.mzid"
```

```r
msexp <- addIdentificationData(msexp, identFile)
```

```
## reading dummyiTRAQ.mzid... DONE!
```

```r
fData(msexp)
```

```
##      spectrum passthreshold rank calculatedmasstocharge
## X1.1        1          TRUE    1                  645.0
## X2.1        2          TRUE    1                  547.0
## X3.1        3            NA   NA                     NA
## X4.1        4            NA   NA                     NA
## X5.1        5          TRUE    1                  437.3
##      experimentalmasstocharge chargestate ms-gf:denovoscore ms-gf:evalue
## X1.1                    645.4           3                77        79.37
## X2.1                    547.0           3                39        13.47
## X3.1                       NA          NA                NA           NA
## X4.1                       NA          NA                NA           NA
## X5.1                    437.8           2                 5       366.38
##      ms-gf:rawscore ms-gf:specevalue assumeddissociationmethod
## X1.1            -39        5.527e-05                       CID
## X2.1            -30        9.399e-06                       CID
## X3.1             NA               NA                      <NA>
## X4.1             NA               NA                      <NA>
## X5.1            -42        2.578e-04                       CID
##      isotopeerror isdecoy post  pre end start       accession length
## X1.1            1   FALSE    A    R 186   170 ECA0984;ECA3829    231
## X2.1            0   FALSE    A    K  62    50         ECA1028    275
## X3.1         <NA>      NA <NA> <NA>  NA    NA            <NA>     NA
## X4.1         <NA>      NA <NA> <NA>  NA    NA            <NA>     NA
## X5.1            1   FALSE    L    K  28    22         ECA0510    166
##                                                                      description
## X1.1 DNA mismatch repair protein;acetolactate synthase isozyme III large subunit
## X2.1          2,3,4,5-tetrahydropyridine-2,6-dicarboxylate N-succinyltransferase
## X3.1                                                                        <NA>
## X4.1                                                                        <NA>
## X5.1                    putative capsular polysacharide biosynthesis transferase
##                 pepseq modified modification             databaseFile
## X1.1 VESITARHGEVLQLRPK    FALSE           NA erwinia_carotovora.fasta
## X2.1     IDGQWVTHQWLKK    FALSE           NA erwinia_carotovora.fasta
## X3.1              <NA>       NA           NA                     <NA>
## X4.1              <NA>       NA           NA                     <NA>
## X5.1           LVILLFR    FALSE           NA erwinia_carotovora.fasta
##      identFile nprot npep.prot npsm.prot npsm.pep
## X1.1         2     2         1         1        1
## X2.1         2     1         1         1        1
## X3.1        NA    NA        NA        NA       NA
## X4.1        NA    NA        NA        NA       NA
## X5.1         2     1         1         1        1
```


```r
msexp[[1]]
```

```
## Object of class "Spectrum2"
##  Precursor: 645.4 
##  Retention time: 25:1 
##  Charge: 3 
##  MSn level: 2 
##  Peaks count: 2921 
##  Total ion count: 668170086
```

```r
plot(msexp[[1]], full=TRUE)
```

<img src="figure/specplot.png" title="plot of chunk specplot" alt="plot of chunk specplot" style="display: block; margin: auto;" />

```r
as(msexp[[1]], "data.frame")[100:105, ]
```

```
##        mz      i
## 100 141.1 588595
## 101 141.1 845401
## 102 141.1 791352
## 103 141.1 477623
## 104 141.1 155376
## 105 141.1   4753
```

### Quantitative proteomics

There are a wide range of proteomics quantitation techniques that can
broadly be classified as labelled vs. label-free, depending whether
the features are labelled prior the MS acquisition and the MS level at
which quantitation is inferred, namely MS1 or MS2. 


|    |Label-free |Labelled   |
|:---|:----------|:----------|
|MS1 |XIC        |SILAC, 15N |
|MS2 |Counting   |iTRAQ, TMT |

In terms of raw data quantitation, most efforts have been devoted to
MS2-level quantitation. Label-free XIC quantitation has however been
addressed in the frame of metabolomics data processing by the
[`xcms`](http://bioconductor.org/packages/release/bioc/html/xcms.html)
infrastructure. 


An `MSnExp` is converted to an `MSnSet` by the `quantitation`
method. Below, we use the iTRAQ 4-plex isobaric tagging strategy
(defined by the `iTRAQ4` parameter; other tags are available).


```r
plot(msexp[[1]], full=TRUE, reporters = iTRAQ4)
```

<img src="figure/itraq4plot.png" title="plot of chunk itraq4plot" alt="plot of chunk itraq4plot" style="display: block; margin: auto;" />


```r
msset <- quantify(msexp, method = "trap", reporters = iTRAQ4, verbose=FALSE)
```

```
## Error: argument "verbose" is missing, with no default
```

```r
exprs(msset)
```

```
## Error: error in evaluating the argument 'object' in selecting a method for function 'exprs': Error: object 'msset' not found
```

```r
processingData(msset)
```

```
## Error: error in evaluating the argument 'object' in selecting a method for function 'processingData': Error: object 'msset' not found
```

Other MS2 quantitation methods available in `quantify` include the
(normalised) spectral index `SI` and (normalised) spectral abundance
factor `SAF` or simply a simple count method.


```r
exprs(si <- quantify(msexp, method = "SIn"))     
```

```
##                1
## ECA0510 0.003589
## ECA1028 0.001470
```

```r
exprs(saf <- quantify(msexp, method = "NSAF"))
```

```
##              1
## ECA0510 0.6236
## ECA1028 0.3764
```

Note that spectra that have not been assigned any peptide (`NA`) or
that match non-unique peptides (`npsm > 1`) are discarded in the
counting process.

**See also** The
[`isobar`](http://www.bioconductor.org/packages/release/bioc/html/isobar.html)
package supports quantitation from centroided `mgf` peak lists or its
own tab-separated files that can be generated from Mascot and Phenyx
vendor files.

### Importing third-party data

The PSI `mzTab` file format is aimed at providing a simpler (than XML
formats) and more accessible file format to the wider community. It is
composed of a key-value metadata section and peptide/protein/small
molecule tabular sections. 


```r
mztf <- pxget(px, pxfiles(px)[2])
```

```
## Downloading 1 file
## F063721.dat-mztab.txt already present.
```

```r
(mzt <- readMzTabData(mztf, what = "PEP"))
```

```
## Warning: Support for mzTab version 0.9 only. Support will be added soon.
```

```
## Detected a metadata section
## Detected a peptide section
```

```
## MSnSet (storageMode: lockedEnvironment)
## assayData: 1528 features, 6 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   rowNames: sub[1] sub[2] ... sub[6] (6 total)
##   varLabels: abundance
##   varMetadata: labelDescription
## featureData
##   featureNames: 1 2 ... 1528 (1528 total)
##   fvarLabels: sequence accession ... uri (14 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:  
## - - - Processing information - - -
## mzTab read: Wed Jul 30 21:44:00 2014 
##  MSnbase version: 1.13.12
```

It is also possible to import arbitrary spreadsheets as `MSnSet`
objects into R with the `readMSnSet2` function. The main 2 arguments
of the function are (1) a text-based spreadsheet and (2) column names
of indices that identify the quantitation data.


```r
csv <- dir(system.file ("extdata" , package = "pRolocdata"),
           full.names = TRUE, pattern = "pr800866n_si_004-rep1.csv")
getEcols(csv, split = ",")
```

```
##  [1] "\"Protein ID\""              "\"FBgn\""                   
##  [3] "\"Flybase Symbol\""          "\"No. peptide IDs\""        
##  [5] "\"Mascot score\""            "\"No. peptides quantified\""
##  [7] "\"area 114\""                "\"area 115\""               
##  [9] "\"area 116\""                "\"area 117\""               
## [11] "\"PLS-DA classification\""   "\"Peptide sequence\""       
## [13] "\"Precursor ion mass\""      "\"Precursor ion charge\""   
## [15] "\"pd.2013\""                 "\"pd.markers\""
```

```r
ecols <- 7:10
res <- readMSnSet2(csv, ecols)
head(exprs(res))
```

```
##   area.114 area.115 area.116 area.117
## 1   0.3790   0.2810   0.2250   0.1140
## 2   0.4200   0.2097   0.2061   0.1639
## 3   0.1873   0.1673   0.1697   0.4760
## 4   0.2475   0.2530   0.3200   0.1790
## 5   0.2160   0.1830   0.3420   0.2590
## 6   0.0720   0.2123   0.5730   0.1427
```

```r
head(fData(res))
```

```
##   Protein.ID        FBgn Flybase.Symbol No..peptide.IDs Mascot.score
## 1    CG10060 FBgn0001104    G-ialpha65A               3       179.86
## 2    CG10067 FBgn0000044         Act57B               5       222.40
## 3    CG10077 FBgn0035720        CG10077               5       219.65
## 4    CG10079 FBgn0003731           Egfr               2        86.39
## 5    CG10106 FBgn0029506        Tsp42Ee               1        52.10
## 6    CG10130 FBgn0010638      Sec61beta               2        79.90
##   No..peptides.quantified PLS.DA.classification Peptide.sequence
## 1                       1                    PM                 
## 2                       9                    PM                 
## 3                       3                                       
## 4                       2                    PM                 
## 5                       1                              GGVFDTIQK
## 6                       3              ER/Golgi                 
##   Precursor.ion.mass Precursor.ion.charge     pd.2013 pd.markers
## 1                                                  PM    unknown
## 2                                                  PM    unknown
## 3                                             unknown    unknown
## 4                                                  PM    unknown
## 5            626.887                    2 Phenotype 1    unknown
## 6                                            ER/Golgi         ER
```

## Data processing and analysis

### Processing and normalisation

Each different types of quantitative data will require their own
pre-processing and normalisation steps. Both `isobar` and `MSnbase`
allow to correct for isobaric tag impurities normalise the
quantitative data.


```r
data(itraqdata)
qnt <- quantify(itraqdata, method = "trap",
                reporters = iTRAQ4, verbose = FALSE)
```

```
## Error: argument "verbose" is missing, with no default
```

```r
impurities <- matrix(c(0.929,0.059,0.002,0.000,
                       0.020,0.923,0.056,0.001,
                       0.000,0.030,0.924,0.045,
                       0.000,0.001,0.040,0.923),
                     nrow=4, byrow = TRUE)
## or, using makeImpuritiesMatrix()
## impurities <- makeImpuritiesMatrix(4)
qnt.crct <- purityCorrect(qnt, impurities)
```

```
## Error: error in evaluating the argument 'object' in selecting a method for function 'purityCorrect': Error: object 'qnt' not found
```

```r
processingData(qnt.crct)
```

```
## Error: error in evaluating the argument 'object' in selecting a method for function 'processingData': Error: object 'qnt.crct' not found
```


```r
plot0 <- function(x, y, main = "") {
    old.par <- par(no.readonly = TRUE)
    on.exit(par(old.par))
    par(mar = c(4, 4, 1, 1))
    par(mfrow = c(2, 2))
    sx <- sampleNames(x)
    sy <- sampleNames(y)
    for (i in seq_len(ncol(x))) {
        plot(exprs(x)[, i], exprs(y)[, i], log = "xy",
             xlab = sx[i], ylab = sy[i])
        grid()
    }
}

plot0(qnt, qnt.crct)
```

```
## Error: error in evaluating the argument 'object' in selecting a method for function 'sampleNames': Error: object 'qnt' not found
```

Various normalisation methods can be applied the `MSnSet` instances
using the `normalise` method: variance stabilisation (`vsn`), quantile
(`quantiles`), median or mean centring (`center.media` or
`center.mean`), ...


```r
qnt.crct.nrm <- normalise(qnt.crct,"quantiles")
```

```
## Error: error in evaluating the argument 'object' in selecting a method for function 'normalize': Error: object 'qnt.crct' not found
```

```r
plot0(qnt, qnt.crct.nrm)
```

```
## Error: error in evaluating the argument 'object' in selecting a method for function 'sampleNames': Error: object 'qnt' not found
```

The `combineFeatures` method combines spectra/peptides quantitation
values into protein data. The grouping is defined by the `groupBy`
parameter, which is generally taken from the feature metadata (protein
accessions, for example).


```r
## arbitraty grouping
g <- factor(c(rep(1, 25), rep(2, 15), rep(3, 15)))
prt <- combineFeatures(qnt.crct.nrm, groupBy = g, fun = "sum")
```

```
## Error: error in evaluating the argument 'object' in selecting a method for function 'normalize': Error: object 'qnt.crct.nrm' not found
```

```r
processingData(prt)
```

```
## Error: error in evaluating the argument 'object' in selecting a method for function 'processingData': Error: object 'prt' not found
```

Finally, proteomics data analysis is generally hampered by missing
values. Missing data imputation is a sensitive operation whose success
will be guided by many factors, such as degree and (non-)random nature
of the missingness. Missing value in `MSnSet` instances can be
filtered out and imputed using the `filterNA` and `impute` functions.


```r
set.seed(1)
qnt0 <- qnt
```

```
## Error: object 'qnt' not found
```

```r
exprs(qnt0)[sample(prod(dim(qnt0)), 10)] <- NA
```

```
## Error: object 'qnt0' not found
```

```r
table(is.na(qnt0))
```

```
## Error: object 'qnt0' not found
```

```r
qnt00 <- filterNA(qnt0)
```

```
## Error: error in evaluating the argument 'object' in selecting a method for function 'filterNA': Error: object 'qnt0' not found
```

```r
dim(qnt00)
```

```
## Error: object 'qnt00' not found
```

```r
qnt.imp <- impute(qnt0)
```

```
## Error: error in evaluating the argument 'object' in selecting a method for function 'impute': Error: object 'qnt0' not found
```

```r
plot0(qnt, qnt.imp)
```

```
## Error: error in evaluating the argument 'object' in selecting a method for function 'sampleNames': Error: object 'qnt' not found
```

#### Exercise

The `mzt` instance created from the `mzTab` file has the following is
a TMT 6-plex with the following design:

   In this TMT 6-plex experiment, four exogenous proteins were spiked
   into an equimolar *Erwinia carotovora* lysate with varying
   proportions in each channel of quantitation; yeast enolase (ENO) at
   10:5:2.5:1:2.5:10, bovine serum albumin (BSA) at 1:2.5:5:10:5:1,
   rabbit glycogen phosphorylase (PHO) at 2:2:2:2:1:1 and bovin
   cytochrome C (CYT) at 1:1:1:1:1:2. Proteins were then digested,
   differentially labelled with TMT reagents, fractionated by reverse
   phase nanoflow UPLC (nanoACQUITY, Waters), and analysed on an LTQ
   Orbitrap Velos mass spectrometer (Thermo Scientic).

Explore the `mzt` data using some of the illustrated functions. The
heatmap and MAplot (see `MAplot` function), taken from the
[`RforProteomics`](http://www.bioconductor.org/packages/release/data/experiment/html/RforProteomics.html)
vignette, have been produced using the same data.

![heatmap](figure/heatmap.png)
![maplot](figure/maplot.png)

### Statistical analysis

R in general and Bioconductor in particular are well suited for the
statistical analysis of data. Several packages provide dedicated
resources for proteomics data:

- [`MSstats`](http://www.bioconductor.org/packages/release/bioc/html/MSstats.html):
  A set of tools for statistical relative protein significance
  analysis in DDA, SRM and DIA experiments.

- [`msmsTest`](http://www.bioconductor.org/packages/release/bioc/html/msmsTests.html):
  Statistical tests for label-free LC-MS/MS data by spectral counts,
  to discover differentially expressed proteins between two biological
  conditions. Three tests are available: Poisson GLM regression,
  quasi-likelihood GLM regression, and the negative binomial of the
  edgeR package.

<!-- ```{r, msmstest} -->
<!-- library(msmsTests) -->
<!-- data(msms.dataset) -->
<!-- msms.dataset -->
<!-- e <- pp.msms.data(msms.dataset) -->
<!-- e -->
     
<!-- null.f <- "y~batch" -->
<!-- alt.f <- "y~treat+batch" -->
<!-- div <- apply(exprs(e),2,sum) -->
<!-- res <- msms.edgeR(e,alt.f,null.f,div=div,fnm="treat") -->
     
<!-- head(res) -->
<!-- ``` -->

- [`isobar`](http://www.bioconductor.org/packages/release/bioc/html/isobar.html)
  also provides dedicated infrastructure for the statistical analysis of isobaric data. 

### Machine learning

The
[`MLInterfaces`](http://www.bioconductor.org/packages/release/bioc/html/MLInterfaces.html)
package provides a unified interface to a wide range of machine
learning algorithms. Initially developed for microarray and
`ExpressionSet` instances, the
[`pRoloc`](http://www.bioconductor.org/packages/release/bioc/html/pRoloc.html)
package enables application of these algorithms to `MSnSet` data.


```r
library("MLInterfaces")
library("pRoloc")
library("pRolocdata")
data(dunkley2006)
traininds <- which(fData(dunkley2006)$markers != "unknown")
ans <- MLearn(markers ~ ., data = t(dunkley2006), knnI(k = 5), traininds)
ans
```

```
## MLInterfaces classification output container
## The call was:
## MLearn(formula = markers ~ ., data = t(dunkley2006), .method = knnI(k = 5), 
##     trainInd = traininds)
## Predicted outcome distribution for test set:
## 
##          ER       Golgi mit/plastid          PM     vacuole 
##         203          86         129         112          17 
## Summary of scores on test set (use testScores() method for details):
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.400   1.000   1.000   0.965   1.000   1.000
```


```r
kcl <- MLearn( ~ ., data = dunkley2006, kmeansI, centers = 12)
kcl
```

```
## clusteringOutput: partition table
## 
##   1   2   3   4   5   6   7   8   9  10  11  12 
##  38  10  22  94  24  99  43 118 107  48  50  36 
## The call that created this object was:
## MLearn(formula = ~., data = dunkley2006, .method = kmeansI, centers = 12)
```

```r
plot(kcl, exprs(dunkley2006))
```

<img src="figure/clust.png" title="plot of chunk clust" alt="plot of chunk clust" style="display: block; margin: auto;" />


```r
hcl <- MLearn( ~ ., data = t(dunkley2006), hclustI(distFun =  dist, cutParm = list(k = 4)))
hcl
```

```
## clusteringOutput: partition table
## 
## 1 2 3 4 
## 4 4 4 4 
## The call that created this object was:
## MLearn(formula = ~., data = t(dunkley2006), .method = hclustI(distFun = dist, 
##     cutParm = list(k = 4)))
```

```r
plot(hcl, exprs(t(dunkley2006)))
```

<img src="figure/clust2.png" title="plot of chunk clust2" alt="plot of chunk clust2" style="display: block; margin: auto;" />


## Annotation

All the
[Bioconductor annotation infrastructure](http://bioconductor.org/help/workflows/annotation/annotation/),
such as
[`biomaRt`](http://bioconductor.org/packages/release/bioc/html/biomaRt.html),
[`GO.db`](http://www.bioconductor.org/packages/release/data/annotation/html/GO.db.html),
organism specific annotations, .. are directly relevant to the
analysis of proteomics data. Some proteomics-centred annotations such
as the PSI Mass Spectrometry Ontology, Molecular Interaction (PSI MI
2.5) or Protein Modifications are available through the
[`rols`](http://www.bioconductor.org/packages/release/bioc/html/rols.html). Data
from the [Human Protein Atlas](http://www.proteinatlas.org/) is
available via the
[`hpar`](http://www.bioconductor.org/packages/release/bioc/html/hpar.html)
package.

## Other relevant packages/pipelines

- Analysis of post translational modification with
  [`isobar`](http://www.bioconductor.org/packages/release/bioc/html/isobar.html).
- Analysis of label-free data from a Synapt G2 (including ion
  mobility) with
  [`synapter`](http://www.bioconductor.org/packages/release/bioc/html/synapter.html).
- Analysis of spatial proteomics data with
  [`pRoloc`](http://www.bioconductor.org/packages/release/bioc/html/pRoloc.html).
- Analysis of MALDI data with the
  [`MALDIquant`](http://cran.r-project.org/web/packages/MALDIquant/index.html)
  package.
- Access to the Proteomics Standard Initiative Common QUery InterfaCe
  with the
  [`PSICQUIC`](http://www.bioconductor.org/packages/release/bioc/html/PSICQUIC.html)
  package.

Additional relevant packages are described in the
[`RforProteomics`](http://www.bioconductor.org/packages/release/data/experiment/html/RforProteomics.html)
vignette. 

## Session information


```
## R Under development (unstable) (2014-04-10 r65396)
## Platform: x86_64-unknown-linux-gnu (64-bit)
## 
## attached base packages:
## [1] parallel  methods   stats     graphics  grDevices utils     datasets 
## [8] base     
## 
## other attached packages:
##  [1] shinyTANDEM_1.3.0    xtable_1.7-3         mixtools_1.0.2      
##  [4] segmented_0.4-0.0    boot_1.3-11          shiny_0.10.0        
##  [7] rTANDEM_1.5.0        data.table_1.9.2     BiocInstaller_1.15.5
## [10] pRolocdata_1.3.0     pRoloc_1.5.13        MLInterfaces_1.45.1 
## [13] sfsmisc_1.0-26       cluster_1.15.2       annotate_1.43.5     
## [16] XML_3.98-1.1         AnnotationDbi_1.27.8 GenomeInfoDb_1.1.12 
## [19] IRanges_1.99.23      S4Vectors_0.1.2      rda_1.0.2-2         
## [22] rpart_4.1-8          genefilter_1.47.6    MASS_7.3-33         
## [25] rpx_1.1.1            MSnbase_1.13.12      BiocParallel_0.7.8  
## [28] ggplot2_1.0.0        Biobase_2.25.0       BiocGenerics_0.11.3 
## [31] mzID_1.3.4           mzR_1.11.10          Rcpp_0.11.2         
## [34] RforProteomics_1.3.1 knitr_1.6           
## 
## loaded via a namespace (and not attached):
##  [1] affy_1.43.3              affyio_1.33.0           
##  [3] BatchJobs_1.3            BBmisc_1.7              
##  [5] biocViews_1.33.10        bitops_1.0-6            
##  [7] BradleyTerry2_1.0-5      brew_1.0-6              
##  [9] brglm_0.5-9              car_2.0-20              
## [11] caret_6.0-30             Category_2.31.1         
## [13] caTools_1.17             checkmate_1.2           
## [15] class_7.3-11             codetools_0.2-8         
## [17] colorspace_1.2-4         DBI_0.2-7               
## [19] digest_0.6.4             doParallel_1.0.8        
## [21] e1071_1.6-3              evaluate_0.5.5          
## [23] fail_1.2                 FNN_1.1                 
## [25] foreach_1.4.2            formatR_0.10            
## [27] gdata_2.13.3             graph_1.43.0            
## [29] grid_3.2.0               gridSVG_1.4-0           
## [31] GSEABase_1.27.1          gtable_0.1.2            
## [33] gtools_3.4.1             htmltools_0.2.4         
## [35] httpuv_1.3.0             impute_1.39.0           
## [37] interactiveDisplay_1.3.9 iterators_1.0.7         
## [39] kernlab_0.9-19           labeling_0.2            
## [41] lattice_0.20-29          limma_3.21.10           
## [43] lme4_1.1-7               lpSolve_5.6.9           
## [45] MALDIquant_1.10          Matrix_1.1-4            
## [47] mboost_2.3-0             mclust_4.3              
## [49] minqa_1.2.3              munsell_0.4.2           
## [51] mvtnorm_1.0-0            nlme_3.1-117            
## [53] nloptr_1.0.0             nnet_7.3-8              
## [55] nnls_1.4                 pcaMethods_1.55.0       
## [57] plyr_1.8.1               preprocessCore_1.27.1   
## [59] proto_0.3-10             proxy_0.4-12            
## [61] quadprog_1.5-5           R.methodsS3_1.6.1       
## [63] R.oo_1.18.0              R.utils_1.32.4          
## [65] randomForest_4.6-10      RBGL_1.41.0             
## [67] RColorBrewer_1.0-5       RCurl_1.95-4.1          
## [69] reshape2_1.4             RJSONIO_1.2-0.2         
## [71] RSQLite_0.11.4           RUnit_0.4.26            
## [73] sampling_2.6             scales_0.2.4            
## [75] sendmailR_1.1-2          splines_3.2.0           
## [77] stats4_3.2.0             stringr_0.6.2           
## [79] survival_2.37-7          tools_3.2.0             
## [81] vsn_3.33.0               zlibbioc_1.11.1
```