% \VignetteIndexEntry{ExpressionView file format} \documentclass{article} \usepackage{ragged2e} \usepackage{url} \usepackage{listings} \usepackage{color} \usepackage{courier} \newcommand{\Rfunction}[1]{\texttt{#1}} \newcommand{\Rpackage}[1]{\texttt{#1}} \newcommand{\Rclass}[1]{\texttt{#1}} \newcommand{\Rargument}[1]{\textsl{#1}} \newcommand{\filename}[1]{\texttt{#1}} \newcommand{\variable}[1]{\texttt{#1}} \definecolor{green}{rgb}{0.8,1,0.8} \definecolor{blue}{rgb}{0.8,0.8,1} \lstset{basicstyle=\footnotesize\ttfamily,% keywordstyle=\footnotesize\ttfamily,% language=XML,frame=single,columns=fullflexible,% aboveskip=8pt,belowskip=-5pt} \lstnewenvironment{schema}{\lstset{backgroundcolor=\color{green}}}{} \lstnewenvironment{example}{\lstset{backgroundcolor=\color{blue}}}{} \newcommand{\xmltag}[1]{% \csname a:xmltag\endcsname\texttt{<#1>}\csname b:xmltag\endcsname} \begin{document} \title{ExpressionView file format} \author{G\'abor Cs\'ardi} \maketitle \tableofcontents \RaggedRight \setlength{\parskip}{12pt} \section{Introduction} ExpressionView is an interactive visualization tool for biclusters in gene expression data. The software has two parts. The first part is an ordering algorithm, written in GNU R, that reorders the rows and columns of the gene expression matrix to make (potentially overlapping) biclusters more visible. The second part is the interactive tool, written in Adobe Flex. It runs in an Adobe Flash enabled web browser. The user can export the ordered gene expression matrix, with additional meta-data from R to a data file, that can be openened by the Adobe Flash application. In this document we briefly discuss the format of this data file. \section{The file format} The EVF data file, used by ExpressionView, is a standard XML file. The R package contains an XML Schema file that describes the exact format. In the following we will show this schema file and explain its parts step by step, while also showing samples from an example EVF data file. Parts of the schema file appear in green boxes, EVF file code snipplets are in blue boxes. \subsection{Header and main parts} The schema file starts with a standard header: \begin{schema} ExpressionView file format schema, version 1.1. ExpressionView is a tool to visualize modules (biclusters) in gene expression data. Please see http://www.unil.ch/cbg for details. Copyright 2010 UNIL DGM Computational Biology Group \end{schema} This is the 1.1 version of the EVF file. ExpressionView can also read the older 1.0 version. \begin{schema} \end{schema} An EVF file contains a single \xmltag{evf} tag. It has the following parts: \begin{schema} \end{schema} The \xmltag{summary} tag contains information about the data set, such as the number of genes, samples, modules, etc. \xmltag{experimentdata} is for experiment meta-data, i.e. the lab where it was performed, the abstract of the related publication and possibly more. \xmltag{genes} and \xmltag{samples} have the gene and sample meta data. \xmltag{modules} defines the biclusters. Finally \xmltag{data} contains the expression values themselves. Let us see all of these tags in detail now. \subsection{Summary} This is the type of the \xmltag{summary} tag: \begin{schema} \end{schema} The \xmltag{summary} tag optionally contains the description of the data file (\xmltag{description} tag), this is displayed above the main gene expression window in ExpressionView. The \xmltag{version} tag must be \texttt{1.1} for files for the format we are discussing here. The \xmltag{dataorigin} tag is optional and has the value `\texttt{eisa}' for modules generated with the ISA algorithm~\cite{bergmann03} and exported using the ExpressionView R package. \xmltag{xaxislabel} and \xmltag{yaxislabel} are optional axis labels for the gene expression plot. The last three tags (\xmltag{nmodules}, \xmltag{ngenes} and \xmltag{nsamples}) are required and give the number of modules, number of genes and number of samples in the data set. A sample EVF file header, together with the \xmltag{summary} tag looks like this: \begin{example} ExpressionView data file 1.1 eisa 8 3522 128 \end{example} \subsection{Experiment meta-data} The \xmltag{experimentdata} tag is next, this contains the experiment meta-data. It has the following parts: \begin{schema} \end{schema} All fields are optional here: \xmltag{title}, \xmltag{name}, \xmltag{lab}, \xmltag{abstract}, \xmltag{url}, \xmltag{annotation}, \xmltag{organism}. They are collected and shown in the \emph{Experiment} tab in the interactive ExpressionView viewer. The \xmltag{annotation} tag should contain the name of the chip on which the experiment was performed. If the \xmltag{organism} tag contains `\texttt{Homo sapiens}', then ExpressionView links genes with the Gene Cards homepage, for other organism is uses Entrez Gene. The tags within \texttt{experimentdata} can be specified in arbitrary order. Here is an example for the \xmltag{experimentdata} tag: \begin{example} Gene expression profile of adult T-cell acute lymphocytic leukemia identifies distinct subsets of patients with different response to therapy and survival. Chiaretti et al. Department of Medical Oncology, ... Gene expression profiles were examined ... http://... hgu95av2 Homo sapiens \end{example} \subsection{Genes} The next tag within \xmltag{evf} is \xmltag{genes}, its type os defined as \begin{schema} \end{schema} It has two main parts, a \xmltag{genetags} tag first, and then a number of \xmltag{gene} tags, one for each gene in the data set. Before we continue with the gene tags, we define a new tag type, that is required to contain a \texttt{name} attribute. We call this type \texttt{ExtraType}. This tag type can be used to add any kind of meta data to the table in the \emph{Genes} tab in ExpressionView. \begin{schema} \end{schema} The \xmltag{genetags} tag has one required and many optional subtags: \begin{schema} \end{schema} The \xmltag{id} must be an integer number, that starts with one and goes up to the number of genes. It is used as an id that is refered from other tags in the document. \xmltag{name} is typically the name of the probe-set, but it can be used for other purposes as well, \xmltag{symbol} is typically the canonical gene symbol. \xmltag{entrezid} is the Entrez gene id. If a given probeset does not map to a known gene, then \xmltag{symbol} and \xmltag{entrezid} can be set to \texttt{NA}. Finally, the gene tags might contain any number of \xmltag{x} tags, each defining an additional column in the \emph{Genes} tab in ExpressionView. These tags must have a \texttt{name} attribute, which is refered by the tags of the individual genes. Here is an example gene tags section: \begin{example} # Name Symbol EntrezID Chromosome \end{example} We define an extra column for the gene table, this will give the chromosome on which the gene is located. After the gene tags, we have a typically large number of genes in the file. The definition of the \xmltag{gene} tag: \begin{schema} \end{schema} The \xmltag{gene} tags must have exactly the same subtags as the ones given in the gene tags section. For example, assuming the \xmltag{genetags} example above we can have: \begin{example} 1 33500_i_at NA NA NA 2 40990_at TSPAN5 10098 4 ... \end{example} Observe, how the extra tag for the chromosomes is referenced here. \subsection{Samples} The samples and their associated data have a format similar to the genes: \begin{schema} \end{schema} The \xmltag{samples} tag has two parts, the first defines the sample meta data entries and the second part contains one \xmltag{sample} tag for each sample in the data set. \begin{schema} \end{schema} The \xmltag{id} and \xmltag{name} sample tags are required. \xmltag{id} contains numeric ids, from one, up to the number of samples in the data set. These are referenced by other tags, e.g. the ones that define the modules. \xmltag{name} is an arbitrary string, it is typically the sample name in the R ExpressionSet object, that was used to find the biclusters. Similarly to the genes, any number of additional \xmltag{x} tags can be added, each with a unique \texttt{name} attribute, to define additional information about the samples. Here is an example \xmltag{sampletags} tag: \begin{example} # Name Date of diagnosis Gender of the patient Age of the patient at entry does the patient have B-cell or T-cell ALL \end{example} We defined four extra tags for the samples in this example. Then the samples follow: \begin{schema} \end{schema} Here is an example, that corresponds to the \xmltag{sampletags} entry given above: \begin{example} 1 01010 3/29/2000 2 19 3 2 04010 10/30/1997 1 18 2 ... \end{example} \subsection{Modules} Again, the \xmltag{modules} tag has a syntax that is similar to the syntax of the genes and the samples: \begin{schema} \end{schema} There are two required and two optional parts. The first part is \xmltag{moduletags}, this defines the meta-data associated with the modules. The \xmltag{gotags} and \xmltag{keggtags} tags are optional, they are only present if the file contains Gene Ontology and/or KEGG pathway enrichment calculation results for the biclusters. Finally, there is a list of \xmltag{module} tags, one for each bicluster. The type of \xmltag{moduletags}: \begin{schema} \end{schema} There is only one tag required in \xmltag{moduletags}, the \xmltag{id}, which is a numeric id, going from 1 to the number of modules. The rest is optional, and it is also possible to create any kind of extra tags with the \xmltag{x} tag. \xmltag{name} is an arbitrary character string. The other optional tags are mainly defined to modules coming from the ISA algorithm: \xmltag{iterations} gives the number of ISA iterations needed to find the module, \xmltag{oscillation} specifies the oscillation cycle length for oscillating modules. It is mostly zero, meaning that the module does not oscillate. \xmltag{thr\_row} is the ISA gene threshold, \xmltag{thr\_col} is the ISA condition threshold. \xmltag{freq} is the number of ISA seeds that converged to the module, \xmltag{rob} is its robustness score. \xmltag{rob\_limit} is the robustness threshold that was used to filter the module. Here is an example \xmltag{moduletags} tag: \begin{example} # Name Iterations Oscillation cycle Gene threshold Column threshold Frequency Robustness Robustness limit \end{example} The optional \xmltag{gotags} and \xmltag{keggtags} tags are discussed in Section~\ref{sec:gokegg}, let us now see how the data for a single module looks like: \begin{schema} \end{schema} The first couple of fields refer to the ones defined in the \xmltag{moduletags} section above. The others are: \xmltag{containedgenes}, a space separated list of gene ids, for the genes that are included in the module; \xmltag{genescores}, the gene scores of these genes, in the order of the list in \xmltag{containedgenes}; \xmltag{containedsamples}, the ids of the samples in the module; \xmltag{samplescores}, the scores for these samples; \xmltag{intersectingmodules}, the ids of the modules that have an overlap with the given module. \xmltag{gos} and \xmltag{keggs} are optional and describe the Gene Ontology and KEGG pathway enrichment of the given module, see their details in Section~\ref{sec:gokegg}. Here is an example \xmltag{module} tag: \begin{example} 1 module 1 22 0 2.7 1.4 1 21.98 21.98 214 215 216 217 218 219 220 221 222 -0.94 -0.88 0.74 -0.76 -1.00 -0.84 -0.74 -0.76 -0.85 63 64 65 54 66 -0.62 -1.00 -0.77 -0.40 -0.28 7 ... see below ... ... see below ... \end{example} We refered to the integer list and double list data types above, now we define them, this is the integer list: \begin{schema} \end{schema} and similarly for the double list: \begin{schema} \end{schema} \subsection{Gene Ontology and KEGG pathway enrichment}% \label{sec:gokegg} EVF files can optionally contain Gene Ontology (GO) and/or KEGG pathway enrichment calculation results. In this case \xmltag{moduletags} is followed by a \xmltag{gotags} and/or a \xmltag{keggtags} tag. The definition of \xmltag{gotags}: \begin{schema} \end{schema} \xmltag{id} is a numeric id, starting from one, up to the total number of enriched GO categories; \xmltag{go} is the GO id; \xmltag{term} is the GO term; \xmltag{ontology} is the GO ontology of the term, its possible values are: \texttt{BP}, \texttt{CC}, \texttt{MF}, standing for \emph{Biological process}, \emph{Cellular component} and \emph{Molecular function}; \xmltag{pvalue} is the enrichment $p$-value; \xmltag{oddsratio} is the odds ratio; \xmltag{expcount} is the expected number of genes in the intersection, by chance; \xmltag{count} is the number of genes in the intersection; \xmltag{size} is the size of the GO term, i.e. the number of genes (in the current gene universe) that are annotated with the enriched term. \begin{schema} \end{schema} The tags for \xmltag{keggtags} is almost the same as for \xmltag{gotags}, but here \xmltag{kegg} id the KEGG pathway ID, and \xmltag{pathname} is the name of the pathway. The rest is the same. Part of an EVF file with the \xmltag{gotags} and \xmltag{keggtags} tags: \begin{example} # GO Term Ontology PValue OddsRatio ExpCount Count Size # KEGG Path Name PValue OddsRatio ExpCount Count Size \end{example} The actual enrichment data is given in the \xmltag{module} tags, these can contain a \xmltag{gos} and/or a \xmltag{keggs} tag with the enrichment $p$-values and other statistics. \begin{schema} \end{schema} \xmltag{gos} is a list of \xmltag{go} tags. \begin{schema} \end{schema} \xmltag{keggs} is a list of \xmltag{kegg} tags. The subtags within a \xmltag{go} or a \xmltag{kegg} tag refer to the tags already listed above in the \xmltag{gotags} and \xmltag{keggtags} tags: \begin{schema} \end{schema} \begin{schema} \end{schema} Here is an example, this is part of a \xmltag{module} tag: \begin{example} 1 GO:0002376 immune system process BP 1.64e-04 7.03 3.95 16 353 2 GO:0002504 antigen processing and presentation of peptide or polysaccharide antigen via MHC class II BP 1.06e-03 79.95 0.10 4 9 1 05310 Asthma 0.02 31.60 0.15 3 10 2 05320 Autoimmune thyroid disease 0.03 24.54 0.18 3 12 \end{example} \subsection{The expression data} Finally, the expression data in included in the \xmltag{data} tag. This is a Base64 encoded string, generated by representing each expression value with a single unsigned byte, and then concatenating and Base64 encoding these bytes. This data looks like this: \begin{example} /Acs1hZT+FDeFGH/29zL+3rK4+sTCvzZzxVgvNrM5Bw8Kfjw2vb6Of8qti943/0QIF8xJy7g2ufB zMoOqyYA4/ne1KsZ1bH69u3c28zewtQM9G8T5zGK7P0szwIM/ub2z9zwHxclIdz3ywC6xtfOySje DewMklghAfGqzsQtDeHZ3wolAC1GSN+9N+/i5Oz99bTzzRb1QvzHE+Qlr1Ej1U697+AFvtynDeHd ... \end{example} \begin{schema} \end{schema} \section{Additional information} Please see the ExpressionView homepage at \url{http://www.unil.ch/cbg/ExpressionView} for more infomation, the schema file and example data files. \bibliographystyle{unsrt} \bibliography{ExpressionView} \end{document}