%\VignetteIndexEntry{Introduction to pepXMLTab} %\VignetteKeywords{pepXML, peptide identification, peptide table, peptide FDR} %\VignettePackage{pepXMLTab} \documentclass[11pt]{article} \usepackage{times} \usepackage[utf8]{inputenc} \usepackage{hyperref} \usepackage[numbers]{natbib} \textwidth=6.5in \textheight=8.5in %\parskip=.3cm \oddsidemargin=-.1in \evensidemargin=-.1in \headheight=-.3in \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\Rcode}[1]{{\texttt{#1}}} \newcommand{\software}[1]{\textsf{#1}} \newcommand{\R}{\software{R}} \newcommand{\Bioconductor}{\software{Bioconductor}} \newcommand{\pepXMLTab}{\Rpackage{pepXMLTab}} \title{Introduction to \pepXMLTab} \author{Xiaojing Wang} \date{\today} \begin{document} \maketitle \tableofcontents \section{Introduction} Mass spectrometry (MS)-based proteomics technology is widely used in biological researches. MS/MS spectra generated by this technology are usually searched and assigned to peptides, and such assignments are typically described by a data format named pepXML developped at the SPC/Institute for systems biology. More detailed information about pepXML can be found at \url{http://tools.proteomecenter.org/wiki/index.php?title=Formats:pepXML}. Recently a community standard format mzIdentML \cite{mzIdentML} has been defined by HUPO \url{http://www.psidev.info/mzidentml} and is set to replace pepXML. There is an existing R package \Rpackage{mzID}, which is designed to parse the mzIdentML file format. As the first widely accepted data format and supported by many search engines, pepXML is still commonly used. Although this XML based format features a highly organized structure, it is less intuitive to human interpretation, thus converting it to human readable format is often desired. To this end, we developed this R package, \Rpackage{pepXMLTab}, which import the Peptide-Spectum-Matches(PSMs) and related information from pepXML files and filter them based on user specified FDR threshold. \Rpackage{pepXMLTab} has been tested using sample pepXML files generated from multiple search engines, MyriMatch \cite{myrimatch}, Mascot \cite{mascot}, X!Tandam \cite{xtandam} and SEQUEST \cite{Sequest} \section{Convert pepXML to a tabular format} In order to calculate FDR at the peptide level, \Rpackage{pepXMLTab} uses the function \Rfunction{pepXML2tab} to convert the 'spectrum\_query' section of a pepXML file to a data frame. The structure of the output data frame is dependent on the input pepXML, with each column representing a section of the information defined by the search engine. Different search engines use their own scoring method for PSMs. For instance, MyriMatch uses a sophisticated statistical scoring system. For each experiment spectrum, MyriMatch examines every m/z location and computes two probabilistic scores: an intensity-based MVH score and a mass error-based mzFidelity score. In SEQUEST, a cross correlation score (XCorr) is used to represent an average of the differences between the m/z values in the observed and virtual spectrum. Please check the documents of each search engine for more details. <>= options(width=70) @ <>= library(pepXMLTab) @ <>= #MyriMatch example pepxml <- system.file("extdata/pepxml", "Myrimatch.pepXML", package="pepXMLTab") tttt <- pepXML2tab(pepxml) tttt[1:2,] #Mascot example pepxml <- system.file("extdata/pepxml", "Mascot.pepXML", package="pepXMLTab") tttt <- pepXML2tab(pepxml) tttt[1:2,] #SEQUEST example pepxml <- system.file("extdata/pepxml", "SEQUEST.pepXML", package="pepXMLTab") tttt <- pepXML2tab(pepxml) tttt[1:2,] #XTandem example pepxml <- system.file("extdata/pepxml", "XTandem.pepXML", package="pepXMLTab") tttt <- pepXML2tab(pepxml) tttt[1:2,] @ \section{PSMs Filtering} After loading from the pepXML files, function \Rfunction{PSMfilter} was used to filter the PSMs based on score(defined by search engines), hit rank and peptide length. By default, \Rfunction{PSMfilter} selects the top ranking peptide hit with a minimum amino acid length of 6. The FDR estimation is based on decoy database matches. The calculation method is similar to what has been used in IDPicker2 \cite{idpicker}. All the peptides are seperated into different peptide classes based on tryptic status and charge status. For each peptide class, PSMs were filtered based on user-specified FDR (Default is 0.01). PSMs that passed the FDR threshold in each class were then pooled together as output \cite{idpicker}. For example, considering the combination of three tryptic type (fully tryptic, semi tryptic and nontryptic) and three charge status (1+, 2+, 3+), all PSMs can be divided into 9 groups. In each group, we may keep the PSMs with FDR less than 0.01. The passed PSMs in each group are then pooled together as output. <>= ## MyriMatch example pepxml <- system.file("extdata/pepxml", "Myrimatch.pepXML", package="pepXMLTab") tttt <- pepXML2tab(pepxml) passed <- PSMfilter(tttt, pepFDR=0.01, scorecolumn='mvh', hitrank=1, minpeplen=6, decoyprefix='rev_') passed[1, ] @ \section{Session Information} <>= sessionInfo() @ \bibliographystyle{unsrtnat} \bibliography{pepXMLTab} \end{document}