---
title: "Proteomics Data Import"
author: "Constantin Ahlmann-Eltze"
output: BiocStyle::html_document
vignette: >
  %\VignetteIndexEntry{Data Import}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

# Introduction

After running your samples on a mass spectrometer, you want to find out if 
there are interesting patterns in the data. But the first challenge is how
do you get the data from the files that your mass spectrometer produced into R?

In the following, I will describe several ways of importing data from MaxQuant.
The general approaches will also be applicable to data from other tools, you will
just have to adapt the column names.

## MaxQuant File Overview

MaxQuant is a popular tool for identifying, integrating, and combining MS peaks
to derive peptide and protein intensities. MaxQuant produces several output
files including **proteinGroups.txt**. It is usually a tab separated table with 
a lot of different columns, which can make it difficult to not get overwhelmed
with information. 

The most important columns that every proteinGroups.txt file contains are

* *Protein IDs*: a semicolon delimited text listing all protein identifiers
  that match an identified set of peptides. Most of the time this is just a 
  single protein, but sometimes proteins are so similar to each other because
  of gene duplication  that it was not possible to distinguish them.

* *Majority protein IDs*: a semicolon delimited text that lists all proteins 
  from the *Protein IDs* column which had more than half of their peptides 
  identified.
  
* *Identification type [SAMPLENAME]*: For each sample there is one column that 
  explains how the peptide peaks where identified. Either they were directly
  sequenced by the MS2 ("By MS/MS") or by matching the m/z peak and elution timing
  across samples ("By matching").

* *Intensity [SAMPLENAME]*: The combined intensity of the peptides of the protein.
  Missing or non-identified proteins are simply stored as `0`. In a label-free 
  experiment, this is also often called *LFQ Intensity [SAMPLENAME]*.
  
* *iBAQ [SAMPLENAME]*: iBAQ is short for intensity-based absolute quantification.
  It is an attempt to make intensity values comparable across proteins. Usually
  the intensity values are only relative, which means that they are only 
  comparable within one protein. This is because differences in ionization 
  and detection efficiency. It is usually better to just
  compare the *Intensity* columns to identify differentially abundant proteins.
  
 * *Only identified by site*: Contains a "+" if the protein was only identified by
   a modification site.

 * *Reverse*:  Contains a "+" if the protein matches the reversed part of the
   decoy database.
  
 * *Contaminant*: Contains a "+" if the protein is a commonly occurring
   contaminant.

The last three columns are commonly used to filter out false positive hits.

The full information what each column means is provided in the *tables.pdf* file
in the MaxQuant output folder.


# Workflow

Our goal is to turn this complicated table into a useable matrix or a 
`SummarizedExperiment` object. There are several ways to achieve this:

1. Use the base R functions (`read.delim()` and `[<-`) to read in the data
2. Use the `tidyverse` packages to load the file and turn it into a useable object
3. Use the [`DEP`](https://bioconductor.org/packages/DEP/) package 
   and the `import_MaxQuant()` function

I will demonstrate each approach using an example file that comes with this
package.

The example file contains the LFQ data from a BioID
experiment in *Drosophila melanogaster*. 11 different Palmitoyltransferases 
(short DHHC) were tagged with a promiscuous biotin ligase and all biotinylated
proteins were enriched and identified using label-free mass spectrometry. The
conditions are named after the tagged DHHC and the negative control condition 
is called S2R for the cell line. Each condition was measured in triplicates,
which means that there are a total of 36 samples To make the file smaller,
I provide a reduced data set which only contains the first 122 rows of the data.


# Base R

The example file is located in

```{r}
system.file("extdata/proteinGroups.txt", 
            package = "proDA", mustWork = TRUE)
```

In this specific file, all spaces have been replaced with dots. This is an 
example how each output file from MaxQuant slightly differs. This can make it 
difficult to write a generic import function. Instead I will first demonstrate
the most general approach which is to simply use the base R tools for loading
the data and turning it into useful objects.

The first step is to load the full table.

```{r}
full_data <- read.delim(
    system.file("extdata/proteinGroups.txt", 
                package = "proDA", mustWork = TRUE),
    stringsAsFactors = FALSE
)

head(colnames(full_data))
```

Next, I  create a matrix of the intensity data, where each sample is a column and
each protein group is a row.

```{r}
# I use a regular expression (regex) to select the intensity columns
intensity_colnames <- grep("^LFQ\\.intensity\\.", colnames(full_data), value=TRUE)

# Create matrix which only contains the intensity columns
data <- as.matrix(full_data[, intensity_colnames])
colnames(data) <- sub("^LFQ\\.intensity\\.", "", intensity_colnames)
# Code missing values explicitly as NA
data[data == 0] <- NA
# log transformation to account for mean-variance relation
data <- log2(data)
# Overview of data
data[1:7, 1:6]
# Set rownames after showing data, because they are so long
rownames(data) <- full_data$Protein.IDs
```


In the next step I will create an annotation `data.frame` that contains information 
on the sample name, the condition and the replicate.


```{r}
annotation_df <- data.frame(
    Condition = sub("\\.\\d+", "", sub("^LFQ\\.intensity\\.", 
                                       "", intensity_colnames)),
    Replicate = as.numeric(sub("^LFQ\\.intensity\\.[[:alnum:]]+\\.", 
                               "", intensity_colnames)),
    stringsAsFactors = FALSE, row.names = colnames(data)
)
head(annotation_df)
```

We can use this data to fit the probabilistic dropout model and 
test for differentially abundant proteins.

```{r eval=FALSE, include=TRUE}
# Not Run
library(proDA)
fit <- proDA(data, design= annotation_df$Condition, col_data = annotation_df)
test_diff(fit, contrast = CG1407 - S2R)
# End Not Run
```


Optionally, we can turn the data also into a `SummarizedExperiment` 
or `MSnSet` object

```{r, include=FALSE}
library(SummarizedExperiment)
library(MSnbase)
```


```{r}
library(SummarizedExperiment)
se <- SummarizedExperiment(SimpleList(LFQ=data), colData=annotation_df)
rowData(se) <- full_data[, c("Only.identified.by.site", 
                             "Reverse", "Potential.contaminant")]
se
```


```{r}
library(MSnbase)

fData <- AnnotatedDataFrame(full_data[, c("Only.identified.by.site", 
                                 "Reverse", "Potential.contaminant")])
rownames(fData) <- rownames(data)
ms <- MSnSet(data, pData=AnnotatedDataFrame(annotation_df), fData=fData)
ms
```

Both input types are also accepted by `proDA`.

```{r eval=FALSE, include=TRUE}
# Not Run
library(proDA)
fit <- proDA(se, design = ~ Condition - 1)
test_diff(fit, contrast = ConditionCG1407 - ConditionS2R)
# End Not Run
```


# Tidyverse


The [tidyverse](https://www.tidyverse.org/) is a set of coherent R packages 
that provide many useful functions
for common data analysis tasks. It replicates many of the functionalities 
already available in base R packages, but learns from its mistakes and avoids 
some of the surprising behaviors. For example strings are never automatically
converted to factors. Another popular feature in the tidyverse is the pipe 
operator (`%>%`) that makes it easy to chain complex transformations.


```{r, include=FALSE}
library(dplyr)
library(stringr)
library(readr)
library(tidyr)
library(tibble)
```


```{r}
library(dplyr)
library(stringr)
library(readr)
library(tidyr)
library(tibble)
# Or short 
# library(tidyverse)
```


I first load the full data file

```{r}
# The read_tsv function works faster and more reliable than read.delim
# But it sometimes needs help to identify the right type for each column,
# because it looks only at the first 1,000 elements. 
# Here, I explicitly define the `Reverse` column as a character column
full_data <- read_tsv(
    system.file("extdata/proteinGroups.txt", 
                package = "proDA", mustWork = TRUE),
    col_types = cols(Reverse = col_character())
)

full_data
```

Next, I create a tidy version of the data set. I pipe (`%>%`) the
results from each transformation to the next transformation, to
first `select` the columns of interest, reshape (`gather`) the dataset from 
wide to long format, and lastly create new columns with `mutate`.

```{r}
# I explicitly call `dplyr::select()` because there is a naming conflict
# between the tidyverse and BioConductor packages for `select()` function
tidy_data <- full_data %>%
    dplyr::select(ProteinID=Protein.IDs, starts_with("LFQ.intensity.")) %>%
    gather(Sample, Intensity, starts_with("LFQ.intensity.")) %>%
    mutate(Condition = str_match(Sample, 
                 "LFQ\\.intensity\\.([[:alnum:]]+)\\.\\d+")[,2]) %>%
    mutate(Replicate = as.numeric(str_match(Sample, 
                 "LFQ\\.intensity\\.[[:alnum:]]+\\.(\\d+)")[,2])) %>%
    mutate(SampleName = paste0(Condition, ".", Replicate))

tidy_data
```

Using the tidy data, I create the annotation data frame and the data matrix.

```{r}
data <- tidy_data %>%
    mutate(Intensity = ifelse(Intensity == 0, NA, log2(Intensity))) %>%
    dplyr::select(ProteinID, SampleName, Intensity) %>%
    spread(SampleName, Intensity) %>%
    column_to_rownames("ProteinID") %>%
    as.matrix()

data[1:4, 1:7]

annotation_df <- tidy_data %>%
    dplyr::select(SampleName, Condition, Replicate) %>%
    distinct() %>%
    arrange(Condition, Replicate) %>%
    as.data.frame() %>%
    column_to_rownames("SampleName")

annotation_df
```


Optionally, we can again turn this into a `SummarizedExperiment` or `MSnSet` object


```{r}
library(SummarizedExperiment)
se <- SummarizedExperiment(SimpleList(LFQ=data), colData=annotation_df)
rowData(se) <- full_data[, c("Only.identified.by.site", 
                             "Reverse", "Potential.contaminant")]
se
```


```{r}
library(MSnbase)

fData <- AnnotatedDataFrame(full_data[, c("Only.identified.by.site", 
                                 "Reverse", "Potential.contaminant")])
rownames(fData) <- rownames(data)
ms <- MSnSet(data, pData=AnnotatedDataFrame(annotation_df), fData=fData)
ms
```


Both input types are also accepted by `proDA`.

```{r eval=FALSE, include=TRUE}
# Not Run
library(proDA)
fit <- proDA(se, design = ~ Condition - 1)
test_diff(fit, contrast = ConditionCG1407 - ConditionS2R)
# End Not Run
```


# DEP

DEP is a [BioConductor package](https://bioconductor.org/packages/release/bioc/html/DEP.html)
that is designed for the analysis of 
mass spectrometry data. It provides helper functions
to impute missing values and makes it easy to run 
[limma](https://bioconductor.org/packages/release/bioc/html/limma.html) on 
the completed dataset.

To load the data, we need to provide all the column names of the 
intensity values. I then call the `import_MaxQuant()` function that directly 
creates a `SummarizedExperiment` object.

```{r}
library(DEP)

full_data <- read.delim(
    system.file("extdata/proteinGroups.txt", 
                package = "proDA", mustWork = TRUE),
    stringsAsFactors = FALSE
)


exp_design <- data.frame(
   label =c("LFQ.intensity.CG1407.01", "LFQ.intensity.CG1407.02",  "LFQ.intensity.CG1407.03",  "LFQ.intensity.CG4676.01",  "LFQ.intensity.CG4676.02", "LFQ.intensity.CG4676.03", "LFQ.intensity.CG51963.01", "LFQ.intensity.CG51963.02", "LFQ.intensity.CG51963.03","LFQ.intensity.CG5620A.01", "LFQ.intensity.CG5620A.02", "LFQ.intensity.CG5620A.03", "LFQ.intensity.CG5620B.01","LFQ.intensity.CG5620B.02", "LFQ.intensity.CG5620B.03", "LFQ.intensity.CG5880.01", "LFQ.intensity.CG5880.02",  "LFQ.intensity.CG5880.03",  "LFQ.intensity.CG6017.01",  "LFQ.intensity.CG6017.02", "LFQ.intensity.CG6017.03", "LFQ.intensity.CG6618.01",  "LFQ.intensity.CG6618.02",  "LFQ.intensity.CG6618.03",  "LFQ.intensity.CG6627.01", "LFQ.intensity.CG6627.02", "LFQ.intensity.CG6627.03",  "LFQ.intensity.CG8314.01", "LFQ.intensity.CG8314.02",  "LFQ.intensity.CG8314.03", "LFQ.intensity.GsbPI.001", "LFQ.intensity.GsbPI.002",  "LFQ.intensity.GsbPI.003",  "LFQ.intensity.S2R.01", "LFQ.intensity.S2R.02", "LFQ.intensity.S2R.03"),
   condition = c("CG1407", "CG1407", "CG1407", "CG4676", "CG4676", "CG4676", "CG51963", "CG51963", "CG51963", "CG5620A", "CG5620A", "CG5620A", "CG5620B", "CG5620B", "CG5620B", "CG5880", "CG5880", "CG5880", "CG6017", "CG6017", "CG6017", "CG6618", "CG6618", "CG6618", "CG6627", "CG6627", "CG6627", "CG8314", "CG8314", "CG8314", "GsbPI", "GsbPI", "GsbPI", "S2R", "S2R", "S2R" ),
   replicate = rep(1:3, times=12),
   stringsAsFactors = FALSE
)

se <- import_MaxQuant(full_data, exp_design)
se
assay(se)[1:5, 1:5]
```


Again, we can run `proDA` on the result:

```{r eval=FALSE, include=TRUE}
# Not Run
library(proDA)
fit <- proDA(se, design = ~ condition - 1)
# Here, we need to be specific, because DEP also has a test_diff method
proDA::test_diff(fit, contrast = conditionCG1407 - conditionS2R)
# End Not Run
```


# Session Info

```{r}
sessionInfo()
```