%\VignetteIndexEntry{Introduction to proBAMr} %\VignetteKeywords{Proteogenomics, Data integration, Protein assembly} %\VignettePackage{proBAMr} \documentclass[11pt]{article} \usepackage{times} \usepackage[utf8]{inputenc} \usepackage{hyperref} \usepackage{booktabs,caption,fixltx2e} \usepackage[flushleft]{threeparttable} \textwidth=6.5in \textheight=8.5in %\parskip=.3cm \oddsidemargin=-.1in \evensidemargin=-.1in \headheight=-.3in \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\Rcode}[1]{{\texttt{#1}}} \newcommand{\software}[1]{\textsf{#1}} \newcommand{\R}{\software{R}} \newcommand{\Bioconductor}{\software{Bioconductor}} \newcommand{\proBAMr}{\Rpackage{proBAMr}} \title{Introduction to \proBAMr} \author{Xiaojing Wang} \date{\today} \begin{document} \maketitle \tableofcontents \section{Why proBAM file} Recent advances of sequencing technologies have reformed our conception of genomic data analysis, storage and interpretation, instigating more research interest in exploring human proteome at a parallel scale. Shotgun proteomics holds this promise by surveying proteome both qualitatively and quantitatively. Over the last years large amount of proteomics data has been accumulated, an emerging demand is to combine these efforts to catalogue the wide dynamic range of protein expression and complexity of alternative isoforms. However, this task is daunting due to the fact that different studies use varying databases, search engines and assembly tools. Such a challenge calls for an efficient approach of integrating data from different proteomics studies and even with genomic data. Here we provide an R package, \Rpackage{proBAMr}, that maps identified PSMs to the genome in BAM format, a binary format for efficient data storage and fast access in genomic research field. This method differs from other approaches because of its ability of building connections between peptide and genomic location and simultaneously maintaining spectra count information. PSMs are aligned under the same coordination framework regardless of the annotation systems (e.g. RefSeq, ENSEMBL) of the input proteomics data, which enables flexible protein assembly switch between different annotation or at different level (gene or protein). When genomic/transcriptomic information of the same individual is available, this approach allows the co-analysis with -omics data together. \section{What is proBAM file} \begin{table}[!ht] \centering \caption{Mandatory field definition of proBAM file and compare to original BAM format for genomic studies.} \includegraphics[keepaspectratio=true, width=\textwidth]{Tab1.jpg} \label{tab:MandatoryproBAM} \end{table} To take full advantage of tools developed for processing BAM files in genomics studies, we designed proBAM by incorporating features from the BAM file format and other features specifically for proteomics. Like BAM, it contains a header section and an alignment section. A full description of the BAM format is available at \url{http://samtools.github.io/hts-specs/SAMv1.pdf}. A PSGM (peptide-spectrum-genomic location-match) is the basic unit in proBAM and is similar to a read in NGS data. In Table \ref{tab:MandatoryproBAM}, we compared the BAM and proBAM description of each mandatory column in the alignment section. \begin{table}[!ht] \centering \caption{FLAG description in proBAM file.} \includegraphics[keepaspectratio=true, width=\textwidth]{Tab2.jpg} \label{tab:FLAGproBAM} \end{table} proBAM allows for 5 FLAG values due to the less complicated requirements by shotgun proteomics data (Table \ref{tab:FLAGproBAM}). For the same reason, the CIGAR tag in proBAM file only supports 'M' for match/mismatch and 'N' for skipped bases on the reference. The optional filed keep extra information from proteomics experiment platform or search engines. The definition and value format of each optional column is descripted in Table \ref{tab:OptionalproBAM}. It is important to note that this table is extendable depending on continuous development and input from the community. \begin{table}[!ht] \centering \caption{Optional field definition of proBAM file.} \includegraphics[keepaspectratio=true, width=\textwidth]{Tab3.jpg} \begin{tablenotes} \small \item The optional field follow the rule TAG:TYPE:VALUE defined by BAM file. There are three type of VALUE format: i, Singed 32-bit integer; Z, Printable string; f, Single-precision floating number. \end{tablenotes} \label{tab:OptionalproBAM} \end{table} \section{How to build proBAM file} \subsection{Preparing annotation files} To map proteomics data to the genome, numerous pieces of genome annotation information are needed, such as genome elements region boundary, protein coding sequence and protein sequence et al. It is possible to manually download these data from different public resources (e.g. NCBI, UCSC and ENSEMBL) and then parse them to an appropriate format. To make this process more efficient and autonomous, we provide functions to prepare the gene/transcript annotation files from UCSC, ENSEMBL and GENCODE. The \Rfunction{PrepareAnnotationRefseq} and \Rfunction{PrepareAnnotationEnsembl} were included in another R package \Rpackage{customProDB} \url{http://bioconductor.org/packages/3.0/bioc/html/customProDB.html}. Here, we provide the function \Rfunction{PrepareAnnotationGENCODE} to prepare the annotation from GENCODE. This function requires users to download GTF file, coding sequence and protein sequence FASTA files from GENCODE ftp \url{ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/}. Users should use the same version of annotations through the same project annalysis. All the annotations are saved to a specified directory for latter use. <>= options(width=70) @ <>= library(proBAMr) @ <>= gtfFile <- system.file("extdata", "test.gtf", package="proBAMr") CDSfasta <- system.file("extdata", "coding_seq.fasta", package="proBAMr") pepfasta <- system.file("extdata", "pro_seq.fasta", package="proBAMr") annotation_path <- tempdir() PrepareAnnotationGENCODE(gtfFile, CDSfasta, pepfasta, annotation_path, dbsnp=NULL, splice_matrix=FALSE, COSMIC=FALSE) @ \subsection{Preparing PSMs table} After preparing all the annotation files, the R package \Rpackage{pepXMLTab} is used to extract confident PSMs and related information from pepXML files. Other tools are also applicable at this step, as long as it generates similar tabular files, as shown below. <>= passedPSM <- read.table(system.file("extdata", "passedPSM.tab", package="proBAMr"), sep='\t', header=TRUE) passedPSM[1:3, ] @ \subsection{Generate SAM file using \Rfunction{PSMtab2SAM}} The function \Rfunction{PSMtab2SAM} first finds the peptide location in protein sequences, then maps the coding sequence of the peptide back to the genome according to the annotation. <>= load(system.file("extdata/GENCODE", "exon_anno.RData", package="proBAMr")) load(system.file("extdata/GENCODE", "proseq.RData", package="proBAMr")) load(system.file("extdata/GENCODE", "procodingseq.RData", package="proBAMr")) options(stringsAsFactors=FALSE) passedPSM <- read.table(system.file("extdata", "passedPSM.tab", package="proBAMr"), sep='\t', header=TRUE) SAM <- PSMtab2SAM(passedPSM, XScolumn='mvh', exon, proteinseq, procodingseq) write.table(SAM, file=paste(tempdir(), '/test.sam', sep=''), sep='\t', quote=FALSE, row.names=FALSE, col.names=FALSE) dim(SAM) SAM[20:27, ] @ \subsection{Convert SAM file to BAM and index} Add the header to the SAM file. Converted them to the binary BAM files using samtools \url{http://samtools.sourceforge.net/}. Sort and index them for fast access. The bullet list below summarizes the steps after the SAM file been generated. <>= paste('cat header test.sam > test_header.sam') paste('samtools view -S -b test_header.sam > test_header.bam') paste('samtools sort test_header.bam > test_header_sort') paste('samtools index test_header_sort') @ \subsection{Visulize proteomics data in IGV} The proBAM files can be visulized in IGV directly. Furthermore, users can co-visulize their proteomics data with the paired genomics/transcriptomics data, as shown in Fig \ref{fig:igvprobam}. \begin{figure}[!ht] \caption{IGV snapshot of a homozygous mutation in gene ALDH1B1 in both proteomics and RNA-Seq data (inside read box)} \centering \includegraphics[keepaspectratio=true, width=\textwidth]{Fig1.jpg} \label{fig:igvprobam} \end{figure} \section{Session Information} <>= sessionInfo() @ \end{document}