%\VignetteIndexEntry{Handling Modifications with MSnID} %\VignetteDepends{BiocStyle, ggplot2} %\VignetteKeywords{Documentation} %\VignettePackage{MSnID} \documentclass[11pt]{article} <>= BiocStyle::latex(use.unsrturl=FALSE) @ <>= library(MSnID) library(xtable) library(dplyr) library(Biostrings) @ \title{Handling Modifications with \Rpackage{MSnID}} \author{Vladislav A. Petyuk} \begin{document} \SweaveOpts{concordance=TRUE, eval=TRUE, prefix.string=graphics} \maketitle \tableofcontents \section{Introduction} This vignette describes handling modifications of the peptides. Modifications can be biologically-relevant and introduced after protein translation, thus \underline{p}ost-\underline{t}ranlational \underline{m}odifications or PTMs. Modification can also be introduced as artifact during sample processing. Here we use more general term - modification, that encompasses both PTMs, artifacts and intential modifications during the sample preparation. \section{Reading and Brief Info on Present Mods} <<>>= m <- MSnID(".") mzids <- system.file("extdata","phospho.mzid.gz",package="MSnID") m <- read_mzIDs(m, mzids) @ Method \Rcode{report\_mods} returns the table with masses and their counts within the dataset. This is a quick way to get insight on what is present. The other useful piece of information is the exact masses of modifications. In the later steps we will be using them to encode with characters, typically asterisk, which is a rather common representation of modified peptides. <<>>= # to know the present mod masses report_mods(m) @ \section{Encoding Mods with Characters} A common way to denote the position and type of modificaton is with a non-alhanumeric character character. E.g. X.XXXX*XXXX.X means the modification at 4th residue. Typically it is most interest to map modifications that were dynamic in the MS/MS search. In this example TMT (229.1629) and cystein alkylation (57.021463735) are static modifications. The 79.966330925 is dynamic (that is may or maynot be present) phosphorylation. Note, \Rcode{add\_mod\_symbol} added \texttt{peptide\_mod} column. <<>>= m <- add_mod_symbol(m, mod_mass="79.966330925", symbol="*") x <- psms(m) %>% distinct(modification, peptide, peptide_mod) @ Sample of the table: <>= sel_idx <- c(1,6,10,13,19) x <- x %>% `[`(sel_idx,) %>% xtable() align(x) <- "lp{1.8in}p{1.8in}p{1.8in}" print(x, include.rownames = FALSE, size="\\fontsize{8pt}{6pt}\\selectfont", add.to.row = list(pos = as.list(seq_along(sel_idx)), command = rep("\\\\[3pt]",length(sel_idx))), floating = FALSE) @ \\ We can map additional modifications. Althogh typically, a given study focuses on one PTM at a time. Nonetheless: <>= m <- add_mod_symbol(m, mod_mass="229.1629", symbol="#") m <- add_mod_symbol(m, mod_mass="57.021463735", symbol="^") x <- psms(m) %>% distinct(modification, peptide, peptide_mod) @ Sample of the table: <>= x <- x %>% `[`(sel_idx,) %>% xtable() align(x) <- "lp{1.8in}p{1.8in}p{1.8in}" print(x, include.rownames = FALSE, size="\\fontsize{8pt}{6pt}\\selectfont", add.to.row = list(pos = as.list(seq_along(sel_idx)), command = rep("\\\\[3pt]",length(sel_idx))), floating = FALSE) @ \section{Mapping Sites to Protein Sequence} Somewhat conventional form of PTM notation (or non-synonymous mutations) is gene/protein ID followed by AA code in upper case, position in the sequence and original AA shows as low case. For example, phosphoryation of serine at position 473 of AKT1 would look like AKT1-S473s. Besides \Robject{MSnID} object, the key component to mapping the modifications is the FASTA file with protein sequences. <<>>= fst_path <- system.file("extdata","for_phospho.fasta.gz",package="MSnID") fst <- readAAStringSet(fst_path) @ When we link accession IDs with FASTA entry names, obviously they need to be in the same format. So in this case we have to trip the names in FASTA. <<>>= names(fst) <- sub("(^[^ ]*) .*$", "\1", names(fst)) @ The core method for mapping modification sites. The warning message it gives about extra characters in peptide sequences is about "\#" and "\string^" we used to denote TMT modification and alkylation. They are ignored during mapping. <<>>= m <- map_mod_sites(m, fst, accession_col = "accession", peptide_mod_col = "peptide_mod", mod_char = "*", site_delimiter = "lower") x <- psms(m) %>% distinct(peptide_mod, SiteID) @ <>= x <- x %>% `[`(sel_idx,) %>% xtable() align(x) <- "lp{1.8in}p{3.6in}" print(x, include.rownames = FALSE, size="\\fontsize{8pt}{6pt}\\selectfont", add.to.row = list(pos = as.list(seq_along(sel_idx)), command = rep("\\\\[3pt]",length(sel_idx))), floating = FALSE) @ \section{Re-Doing Mapping with Gene IDs} The most commont (human readable and somewhat comprehensible) identifier of proteins and corresponding genes is gene symbol. For example, \textbf{sp|P84996|ALEX\_HUMAN} UniProt ID corresponds to \textbf{GNAS} gene symbol. In this section we'll remap UniProt IDs to gene symbols and report Site IDs in a more human readable way. % remap_accessions % remap_fasta_entry_names % fetch_conversion_table First, we'll download a table converting from one ID to another. There are multiple ways how one can get this type of table. In this example we implicitely use \Rpackage{AnnotationHub} package. <<>>= conv_tab <- fetch_conversion_table("Homo sapiens", "UNIPROT", "SYMBOL") head(conv_tab) @ Re-mapping accessions from IDs indicated in the first columns of the \Robject{conv\_tab} to the second. The accessions in the \Robject{MSnID} object may not be in exactly in the same form as in the database used to fetch \Robject{conv\_tab} conversion table. Thus, there is the \Rcode{extraction\_pttrn} argument. It extracts the first matching group \verb_"\\1"_ as the proper ID. There are three suggested extraction patterns for UniProt, RefSeq and ENSEMBL. In case the accession is more complicated than that, user can provide a custom extraction pattern. <<>>= head(accessions(m)) m <- remap_accessions(m, conv_tab, extraction_pttrn = "\\|([^|-]+)(-\\d+)?\\|") head(accessions(m)) @ Since we updated the accessions in the \Robject{MSnID} object, we need to provide FASTA file with corresponding entry names if we want to map the PTM sites. If such FASTA file isn't readily available, which is very likely if the \Robject{MSnID} object accessions converted to gene symbols. Entry names can be updated using the same conversion table. <<>>= fst_path <- system.file("extdata","for_phospho.fasta.gz",package="MSnID") fst_path_2 <- remap_fasta_entry_names(fst_path, conv_tab, "\\|([^|-]+)(-\\d+)?\\|") library(Biostrings) readAAStringSet(fst_path) readAAStringSet(fst_path_2) @ Now we can execute the same remapping, but using gene symbols as protein IDs. <<>>= fst <- readAAStringSet(fst_path_2) m <- map_mod_sites(m, fst, accession_col = "accession", peptide_mod_col = "peptide_mod", mod_char = "*", site_delimiter = "lower") x <- psms(m) %>% distinct(peptide_mod, SiteID) @ <>= x <- x %>% `[`(sel_idx,) %>% xtable() align(x) <- "lp{1.8in}p{3.6in}" print(x, include.rownames = FALSE, size="\\fontsize{8pt}{6pt}\\selectfont", add.to.row = list(pos = as.list(seq_along(sel_idx)), command = rep("\\\\[3pt]",length(sel_idx))), floating = FALSE) @ <>= unlink(".Rcache", recursive=TRUE) @ \end{document}