---
title: "Proteolytic resistance analysis"
author: Valentina Cappelletti (<cappelletti@imsb.biol.ethz.ch>),
  Malinovska Liliana (<malinovska@imsb.biol.ethz.ch>), 
  Devon Kohler (<kohler.d@northeastern.edu>)
date: "`r Sys.Date()`"
vignette: >
  %\VignetteIndexEntry{MSstatsLiP Proteolytic Workflow}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
output:
  html_document:
    toc: yes
    df_print: paged
  html_notebook:
    toc: yes
    highlight: pygments
    theme: lumen
  pdf_document:
    toc: yes
---
# MSstatsLiP Workflow: Protease resistance analysis

This is an [R Markdown](http://rmarkdown.rstudio.com) Notebook describing the analysis of a LiP-MS experiment using [MSstatsLiP](https://github.com/Vitek-Lab/MSstatsLiP). When you execute code within the notebook, the results appear beneath the code. 

Here, we use LiP-MS data of human alpha-Synuclein in the monomeric (M) and fibrillar form (F) spiked into a *S.cerevisiae* lysate  at 5 pmol/ug lysate (M1 and F1) and 20 pmol/ug lysate (M2 and F2).The data set is composed of four biological replicates per condition.

## 1. Installation

- Install and load all necessary packages. The installation needs to be performed at first use only. Un-comment the lines for execution.

```{r setup}
 knitr::opts_chunk$set(include = FALSE)
```

```  {r, eval = FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("MSstatsLiP")
```

```{r,include=TRUE,results="hide",message=FALSE,warning=FALSE}
library(MSstatsLiP)
library(tidyverse)
library(data.table)
library(gghighlight)
```

- Set the working directory

```{r set working drectory, echo=TRUE, message=FALSE, warning=FALSE, include=TRUE, eval = FALSE}
input_folder=choose.dir(caption="Choose the working directory")
knitr::opts_knit$set(root.dir = input_folder) 
```

# 2. Data preprocessing

## 2.1 Load datasets

Load the data from the Spectronaut export. LiP data is loaded as `raw_lip`,  trypsin-only control data (TrP data) is loaded as `raw_prot`. The function `choose.files()` enables browsing for the input file. 

**CAVE:** Make sure the separator `delim` is set correctly. For comma-separated values (csv), the separator is set to `delim=","`.

```{r load LiP data, echo=TRUE, message=FALSE, warning=FALSE, include=TRUE, eval = FALSE}
raw_lip <- read_delim(file=choose.files(caption="Choose LiP dataset"), 
                         delim=",", escape_double = FALSE, trim_ws = TRUE)
```

```{r load TrP data, eval=FALSE, echo=TRUE, message=FALSE, warning=FALSE, include=TRUE}

raw_prot <- read_delim(file=choose.files(caption="Choose TrP dataset"), 
                          delim=",", escape_double = FALSE, trim_ws = TRUE)
```

```{r Adjust aSynuclein nomenclature to match fasta file, eval=TRUE, echo=TRUE, message=FALSE, warning=FALSE, include=TRUE}
raw_lip <- raw_lip %>% mutate_all(funs(ifelse(.=="P37840.1", "P37840", .)))
raw_prot <- raw_prot %>% mutate_all(funs(ifelse(.=="P37840.1", "P37840", .)))
```

Load the fasta file that was used in the Spectronaut search.

```{r load fasta file, message=FALSE, warning=FALSE, echo=TRUE,include=TRUE, eval = FALSE}
fasta_file=choose.files(caption = "Choose FASTA file")
```

```{r include=FALSE}
fasta_file = "../inst/extdata/proteolytic_fasta_data.fasta"
```

Convert the data to MSstatsLiP format. Load first the LiP data set `raw_lip`, then the FASTA file `fasta_file` used for searches. If the experiment contains TrP data, `raw_prot` is loaded last.

To remove information on iRT peptides, the default setting is `removeiRT = TRUE`. As default, peptides containing modifications are filtered, but this can be changed using the argument `removeModifications`. Also, peptides with multiple protein annotations are filtered as default. However, for data sets containing protein isoforms, this argument can be set to `removeNonUniqueProteins = FALSE`.

The default settings use *PeakArea* as measure of intensity, filter features based on the q-value, with a q-value cut-off of 0.01 and import all conditions. You can adjust the settings accordingly. For information on each option, refer to the vignette of the function.

```{r convert to MSstatsLiP format, message=FALSE, warning=FALSE,echo=TRUE,include=TRUE}
msstats_data <- SpectronauttoMSstatsLiPFormat(raw_lip, fasta_file, raw_prot)
```

## 2.2 Select only fully tryptic (FT) peptides in both LiP and TrP dataset

Proteolytic resistance is calculated as the of the intensity of fully tryptic peptides in the LiP condition to the TrP condition. Half-tryptic (HT) peptides are excluded from this analysis. The function "calculateTrypticity" is used to annotate FT and HT peptides in the LiP dataset. Next, from the TrP dataset we filtered out FT peptides not identified in the LiP dataset.The msstats_data list will finally contain only FT peptides measured in both LiP and TrP datasets.  

```{r find FT peptides, echo=TRUE, message=FALSE, warning=FALSE,include=TRUE}

FullyTrP <- msstats_data[["LiP"]] %>% 
  distinct(ProteinName, PeptideSequence) %>% 
  calculateTrypticity(fasta_file) %>% 
  filter(fully_TRI) %>% 
  select(ProteinName, PeptideSequence, StartPos, EndPos)
```

```{r select FT peptides in LiP data, echo=TRUE, message=FALSE, warning=FALSE,include=TRUE}
msstats_data[["LiP"]] <- msstats_data[["LiP"]] %>% 
  select(-ProteinName) %>% inner_join(FullyTrP)
```

```{r select FT peptides in TrP data, echo=TRUE, message=FALSE, warning=FALSE, include=TRUE}
msstats_data[["TrP"]] <- msstats_data[["TrP"]] %>% 
  select(-ProteinName) %>% inner_join(FullyTrP)
```

## 2.3 Correct nomenclature

#### Step 1: 
Ensure that the `Condition` nomenclature is identical in both data sets. If the output is `TRUE` for all conditions, continue to [step 2](#steptwo).

```{r Test Condtion nomenclature, echo=TRUE,include=TRUE}
unique(msstats_data[["LiP"]]$Condition)%in%unique(msstats_data[["TrP"]]$Condition)
```

To correct the condition nomenclature, display the condition for both data sets.
```{r Display Condtion nomenclature, echo=TRUE,include=TRUE}
paste("LiP Condition nomenclature:", unique(msstats_data[["LiP"]]$Condition), ",",
      "TrP Condition nomenclature:",unique(msstats_data[["TrP"]]$Condition))
```

If necessary, un-comment following lines to correct the condition nomenclature in either of the data sets. E.g. change the nomenclature of the TrP samples from `Cond1` to `cond1`.

```{r Correct Condition nomenclature, echo=TRUE,include=TRUE}
# msstats_data[["TrP"]] = msstats_data[["TrP"]] %>% 
#   mutate(Condition = case_when(Condition == "Cond1" ~ "cond1",
#                                Condition == "Cond2" ~ "cond2"))
```


#### Step 2: {#steptwo}

Ensure that `BioReplicate` nomenclature is correctly annotated (see also  [MSstats](http://msstats.org/wp-content/uploads/2020/02/MSstats_v3.18.1_manual_2020Feb26-v2.pdf) user manual. The BioReplicate needs a unique nomenclature, while the technical replicates can have duplicate numbering. If the replicate nomenclature is correct, proceed to [section 2.3](#data-summ).

```{r Display BioReplicate nomenclature, echo=TRUE,include=TRUE}
paste("LiP BioReplicate nomenclature:", unique(msstats_data[["LiP"]]$BioReplicate), ",",
      "TrP BioReplicate nomenclature:",unique(msstats_data[["TrP"]]$BioReplicate))
```

Adjust `BioReplicate` column to correct nomenclature for a Case-control experiment. 

```{r Correct replicate nomenclature, echo=TRUE,include=TRUE}
msstats_data[["LiP"]] = msstats_data[["LiP"]] %>% 
  mutate(BioReplicate = paste0(Condition,".",BioReplicate))

msstats_data[["TrP"]] = msstats_data[["TrP"]] %>% 
  mutate(BioReplicate = paste0(Condition,".",BioReplicate))
```

Inspect corrected `BioReplicate` column. 

```{r Display corrected BioReplicate nomenclature, echo=TRUE,include=TRUE}
paste("LiP BioReplicate nomenclature:", unique(msstats_data[["LiP"]]$BioReplicate), ",",
      "TrP BioReplicate nomenclature:",unique(msstats_data[["TrP"]]$BioReplicate))
```

## 2.4 Data Summarization{#data-summ}

Summarize the data. The default settings use a log2-transformation and normalize the data using the  `"equalizeMedians"` method. The default summary method is `"TMP"` and imputation is set to `"FALSE"`. For detailed information on all settings, please refer to the function vignette. 

This function will take some time and memory. If memory is limited, it is advisable to remove the raw files using the `rm()` function and clearing the memory cache using the `gc()` function.

```{r Data summarization, message=FALSE, warning=FALSE, echo=TRUE,include=TRUE}
MSstatsLiP_Summarized <- dataSummarizationLiP(msstats_data, normalization.LiP = "equalizeMedians")

```

Inspect `MSstatsLiP_Summarized`. 
```{r Inspect summarized data, echo=TRUE,include=TRUE}
names(MSstatsLiP_Summarized[["LiP"]])

head(MSstatsLiP_Summarized[["LiP"]]$FeatureLevelData)
head(MSstatsLiP_Summarized[["LiP"]]$ProteinLevelData)

head(MSstatsLiP_Summarized[["TrP"]]$FeatureLevelData)
head(MSstatsLiP_Summarized[["TrP"]]$ProteinLevelData)
```

Save and/or load summarized data. 
```{r Save summarized data, echo=TRUE,include=TRUE, eval=FALSE}
save(MSstatsLiP_Summarized, file = 'MSstatsLiP_summarized.rda')
load(file = 'MSstatsLiP_summarized.rda')
```

# 3. Modelling

Run the modeling to obtain significantly altered peptides and proteins. The function `groupComparisonLiP`outputs a list with three separate models: 1. `LiP.Model`, which contains the differential analysis on peptide level in the LiP sample without correction for protein abundance alterations. 2. `Adjusted.LiP.Model`, which contains the differential analysis on peptide level in the LiP sample with correction for protein abundance alterations 3. `TrP.Model`, which contains the differential analysis on protein level. The default setting of the function is a pairwise comparison of all existing groups. Alternatively, a contrast matrix can be provided to specify the comparisons of interest. See Vignette for details.

```{r Modelling, message=FALSE, warning=FALSE, echo=TRUE,include=TRUE}
MSstatsLiP_model = groupComparisonLiP(MSstatsLiP_Summarized)

```

Inspect `MSstatsLiP_model`. 
```{r Inspect model, message=FALSE, warning=FALSE, echo=TRUE,include=TRUE}
head(MSstatsLiP_model[["LiP.Model"]])
head(MSstatsLiP_model[["TrP.Model"]])
```

Save and/or load model data. 
```{r Save model, echo=TRUE, message=FALSE, warning=FALSE, include=TRUE, eval=FALSE}
save(MSstatsLiP_model, file = 'MSstatsLiP_model.rda')
load(file = 'MSstatsLiP_model.rda')
```

# 4. Calculate proteolytic resistance ratios

Proteolytic resistance ratios are calculated as the ratio of the intensity of fully tryptic peptides in the LiP condition to the TrP condition. In general, a low protease resistance value is indicative of high extent of cleavage, while high protease resistance values indicate low cleavage extent.

```{r calculate accessibility, message=FALSE, warning=FALSE, echo=TRUE,include=TRUE}

Accessibility = calculateProteolyticResistance(MSstatsLiP_Summarized, 
                                               fasta_file, 
                                               differential_analysis = TRUE)

Accessibility$RunLevelData

```

```{r Barplot of protease resistance of aSynuclein (monomer - M and fibril - F), message=FALSE, warning=FALSE, fig.height=2, fig.width=10,echo=TRUE,include=TRUE}

ResistanceBarcodePlotLiP(Accessibility,
                         fasta_file,
                         which.prot = "P16622",
                         which.condition = "F1",
                         address = FALSE)


```


# 5. Proteolytic resistance differential analysis

In this paragraph we described how to compare proteolytic resistance patterns of different conditions, as reported in Cappelletti et al., 2021, Figure 3. As described in the "Protease digestion accessibility analysis" paragraph of Cappelletti et al., proteolytic resistance is calculated as the ratio of the intensity of fully tryptic peptides in the LiP condition to the TrP condition and can be compared across different conditions using the linear mixed effects models-based differential analysis implemented in the MSstatsLiP package. First, infinite values are filtered out from the result of the groupComparisonLiP function. Next, logFCs and standard errors of the LiP (log2FC, s2) and TrP (log2FC_ref,s2_ref) models are combined and Student’s T-test is applied to compare proteolytic resistance between different conditions.Finally, p-values are adjusted for multiple comparisons (default is Benjamini & Hochberg method).
In general, a low Proteolytic resistance value is indicative of high extent of cleavage, while high Proteolytic resistance values indicate low cleavage extent.


```{r Perform Proteolytic resistance differential analysis on Fully tryptic peptides, message=FALSE, warning=FALSE, echo=TRUE,include=TRUE}

Accessibility$groupComparison

```

Save and/or load model data
```{r Save protection model, echo=TRUE, message=FALSE, warning=FALSE, include=TRUE, eval=FALSE}
# save(FullyTrp.Model, file = 'Protection_model.rda')
# load(file = 'Protection_model.rda')
```

# 6. Save outputs

Save the output of the modeling in a .csv file. 
```{r Save output, echo=TRUE,include=TRUE, eval=FALSE}
# write.csv(FullyTrp.Model, "Proteolytic_resistance_DA.csv")
```

# 7. Plot aSynuclein proteolytic resistance DA result as barcode

Proteolytic resistance barcodes can be used to visualize FT peptides along the sequence of aSynucelin. Significant peptides showing high protease resistance are colored in red, significant peptides showing a decreased protease resistance are colored in blue and non-significant peptides (no change in protease resistance between conditions) are colored in grey. Black regions represent regions with no identified matching peptide. Position of the NAC domain is indicated by a rectangle.


```{r Barplot of DA of protease resistance of aSynuclein (monomer - M and fibril - F), message=FALSE, warning=FALSE, fig.height=2, fig.width=10,echo=TRUE,include=TRUE}

ResistanceBarcodePlotLiP(Accessibility,
                         fasta_file,
                         which.prot = "P16622",
                         which.condition = "F1",
                         differential_analysis = TRUE,
                         which.comp = "F1 vs F2",
                         address = FALSE)


```