%% LyX 2.0.5 created this file. For more info, see http://www.lyx.org/. %% Do not edit unless you really know what you are doing. \documentclass{scrartcl} \usepackage[T1]{fontenc} \usepackage[utf8]{inputenc} \usepackage[authoryear]{natbib} \usepackage[unicode=true,pdfusetitle, bookmarks=true,bookmarksnumbered=false,bookmarksopen=false, breaklinks=true,pdfborder={0 0 0},backref=page,colorlinks=false] {hyperref} \usepackage{breakurl} \makeatletter %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Textclass specific LaTeX commands. <>= if(exists(".orig.enc")) options(encoding = .orig.enc) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% User specified LaTeX commands. %\VignetteIndexEntry{The proteinProfiles package} %\VignettePackage{proteinProfiles} \usepackage[english]{babel} \makeatother \begin{document} \title{The\emph{ proteinProfiles} package} \author{Julian Gehring} \maketitle <>= set.seed(1) options(width=65) @ \section{Introduction} \subsection{Motivation and method} In current high-throughput proteomics, it is feasible to assess the abundance of a large number of proteins in one measurement. In case these measurements correspond to different time points, it is often of interest to identify groups of proteins showing similar time courses. The \emph{proteinProfiles} package offers the functionality to \begin{enumerate} \item Define protein groups of interest based on matching text annotation. \item Compute similarity (distance) measures of time courses for a set of proteins. \item Assess the significance of the similarity in terms of p-values in relation to randomly permuted sets. \end{enumerate} A detailed use case for this method is described in \citet{hansson_highly_2012}. \subsection{About the package} To use the functions and the data described in this document, you have to load the package first: <>= library(proteinProfiles) @ If you have not installed the package so far, you can do this in the same way as for any other bioconductor package (see also \href{http://bioconductor.org/install/ }{http://bioconductor.org/install/ }for details): <>= if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager") BiocManager::install("proteinProfiles") @ You can get more information about the package in general and specific function (e.g. the \emph{profileDistance} function) with: <>= vignette(package="proteinProfiles") vignette("proteinProfiles") ?profileDistance @ \section{Data import and structure} For illustrating a typical workflow, we will use an example data set which mimics the data used in \citet{hansson_highly_2012}. <>= data(ips_sample) ls() @ For the analysis, you need first the abundance measurements for the proteins over time. These can be absolute or relative values, and can optionally include replicates. The data is stored as a numeric matrix, with rows corresponding to proteins and columns to time points/replicates. <>= head(ratios) @ Further, you have to provide a data frame with annotation data associated with proteins. This can include multiple annotation columns, as shown in the example data set. <>= colnames(annotation) @ The matching of the annotation to the measurements relies on a custom identifier which is stored as row names in both \emph{ratios} and \emph{annotation.} \section{Removing features with missing data} Not for all data points the measurement was successful and hence contains missing data (\emph{NA}). Since computing the distances of profiles with several data points missing may be unreliable, you can optionally remove protein measurements with the fraction of missing data points exceeding a user-defined threshold. A threshold of e.g. 0.3 will remove all features with more than 30\% of the data points missing. <>= ratios_filtered <- filterFeatures(ratios, 0.3, verbose=TRUE) @ \section{Defining protein group of interest based on annotation} Based on the annotation provided in the original data set, a group of proteins of interest can be obtained. The \texttt{grepAnnotation} function matches substrings (regular expressions) against a column of the annotation object and returns the matching protein identifiers. Here, we search for all protein names starting with the string \emph{``28S''}. For details, read the documentation of the \emph{grep} function. <>= names(annotation) index_28S <- grepAnnotation(annotation, pattern="^28S", column="Protein.Name") index_28S @ We can also use other columns of the annotation. Here, we search for all proteins associated with the term \emph{``Ribosome''} in the annotation column, taken from the KEGG database. <>= index_ribosome <- grepAnnotation(annotation, "Ribosome", "KEGG") index_ribosome @ \section{Computing profile distances and assessing significance} The \texttt{profileDistance} function constitutes the core part of the analysis. \begin{enumerate} \item It computes the mean euclidean distance $d_{0}$ of the profiles for the proteins of interest defined by \emph{index}. This distance is shown as a red vertical line in the plot. \item It performs step (1) for a number \emph{nSample} of randomly selected groups with the same size as our group of interest. The distances are shown as a cumulative distribution in the plot. \item Based on the results of step (1) and (2), a p-value $p$ given by the cumulative density at $d_{0}$ (which is equivalent to the area under the probability density in the range $[-\infty,d_{0}]$) is computed. It indicates the probability of observing a group of proteins by chance with profiles having the same or a smaller distance as our group of interest. \end{enumerate} <>= z1 <- profileDistance(ratios, index_28S) z1$d0 z1$p.value plotProfileDistance(z1) @ <>= z2 <- profileDistance(ratios, index_ribosome, nSample=2000) plotProfileDistance(z2) @ \appendix \section*{\newpage{}\bibliographystyle{plain} \bibliography{proteinProfiles-references} } \section*{Session Info} <>= toLatex(sessionInfo(), locale=FALSE) @ \end{document}