% -*- mode: noweb; noweb-default-code-mode: R-mode; -*- %\VignetteIndexEntry{geNetClassifier-vignette} %\VignetteKeywords{Classification, Network, Gene Expression, SVM, feature selection, cross-validation, gene marker, machine-learning} %\VignetteDepends{geNetClassifier} %\VignettePackage{geNetClassifier} \documentclass[a4paper,12pt]{article} \usepackage{Sweave} %modify for your R home directory \usepackage{amsmath} \usepackage{hyperref} %\usepackage[authoryear,round]{natbib} \usepackage{authblk} \usepackage{anysize} \usepackage{float} \usepackage[font=small, labelfont=bf, labelsep=period]{caption} \setlength{\captionmargin}{30pt} \marginsize{1in}{1in}{0.5in}{0.5in} \setlength{\parindent}{0pt} \pagestyle{myheadings} \markright{geNetClassifier\hfill} \title{\textit{geNetClassifier}\\ classify multiple diseases and build associated gene networks using gene expression profiles} \author{Sara Aibar} \author{Celia Fontanillo} \author{Conrad Droste} \author{Javier De Las Rivas} \affil{Bioinformatics and Functional Genomics Group\\ Centro de Investigacion del Cancer (CiC-IBMCC, CSIC/USAL)\\ Salamanca - Spain} \begin{document} \SweaveOpts{concordance=TRUE} \maketitle \begin{center} Version: 1.6 \end{center} \tableofcontents \newpage \section{Introduction to \textit{geNetClassifier}} \label{sec:intro} \textit{geNetClassifier} is an algorithm designed to build transparent classifiers and the associated gene networks based on genome-wide expression data. \\ \textit{geNetClassifier()} is also the name of the main function in the package. This function takes as input the \textit{expressionSet} or expression matrix of the studied samples and the classes the samples belong to (i.e. the diseases or disease subtypes). Once the data are analyzed, geNetClassifier() provides: \textbf{(i)} ranked gene sets (or gene signatures) that identify each class; \textbf{(ii)} a multiple-class classifier; and \textbf{(iii)} gene networks associated to each class. \begin{itemize} \item Gene ranking: The genes, probesets, or any other variables that are input in the \textit{expressionSet} are considered \textit{features} for the classification. These features are analyzed by \textit{geNetClassifier}, and ranked according to the class they best identify, in order to select the optimum set for training the classifier. This ranking is returned by \textit{geNetClassifier()} as well as the parameters calculated for gene selection. \item Classifier: \textit{geNetClassifier()} also returns a multi-class SVM-based classifier, which can be queried later on; the genes (features) chosen for classification; their discriminant power (a parameter that measures the importance that the classifier internally gives to each gene); and, optionally, the classifier's generalization error and statistics about the selected genes. \item Network: The mutual-information (interactions) and the co-expression (correlations) between the genes are also calculated and analyzed by the algorithm. These allow to estimate the degree of association between the variables and they are used to generate a gene network for each class. These networks can be plotted, providing a integrated overview of the genes that characterized each disease (i.e. each class). \end{itemize} \begin{figure}[H] \centering \includegraphics[width=0.7\textwidth]{Fig1intro} \caption{Taking an \textit{expressionSet} as input, \textit{geNetClassifier()} returns a gene signature for each class, a classifier to discriminate the classes, and gene networks associated to each class. The package also includes several analytic and visualizing tools to explore these results.} \label{fig:intro} \end{figure} \textbf{Examples of use}\\ The algorithm shows a robust performance applied to patient-based gene expression datasets that study disease subtypes or disease classes. In this vignette, we show its performance for a leukemia dataset that includes 60 microarray samples from bone marrow of patients with four major leukemia subtypes (ALL, AML, CLL, CML) and no-leukemia controls (NoL). The results outperform a previously published classification analysis of these data \cite{mile}.\\ The method is designed to be applied to the analysis and classification of different disease subtypes. Therefore, in the R package and this vignette, all the explanations and examples are disease-oriented. However, \textit{geNetClassifier} can be applied to the classification of any other type of biological states, pathological or not.\\ \textbf{Methods}\\ The algorithm \textit{geNetClassifier()} integrates several existing machine learning and statistical methods. The \textit{feature ranking} is achieved based on a Parametric Empirical Bayes method (PEB). Double-nested internal cross-validation (CV) \cite{barrier} is used for the \textit{feature selection} process and to estimate the \textit{generalization error} of the classifier. The machine learning method implemented in the classifier is a multi-class Support Vector Machine (SVM) \cite{meyer}. The gene \textit{networks} are built calculating the relations derived from gene to gene co-expression analysis (by default, \textit{Pearson correlation}) and the interactions derived from gene mutual information analysis (using \textit{minet} package) \cite{minet}. More details about these methods are available in the appropriate sections.\\ \textbf{Queries}\\ \textit{geNetClassifier} includes a \textit{query} function that allows either validation of the classifiers using external independent samples of known class (section \ref{sec:extVal}) or classification of new samples whose class is unknown (section \ref{sec:prediction}). This function facilitates the application of the classification algorithm as a predictor for new samples, and it is designed to resemble expert behavior by allowing \textit{NotAssigned} (NA) instances when it is not sure about the class labelling. In order to assign a sample to a class, the algorithm requires a minimum certainty (i.e. probability), leaving it unassigned in case it does not achieve a clear call to a single class. These probability thresholds can be tuned to achieve a more or less stringent assignment. By following this procedure, the algorithm emulates human experts in the decision-making.\\ \newpage \section{Install the package and example data} \label{sec:installation} To install \textit{geNetClassifier} from \textit{Bioconductor}: <>= if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager") BiocManager::install("geNetClassifier") @ %The package requires the installation of other packages that are usually automatically installed with it using Bioconductor: \textit{Biobase}, \textit{BiocGenerics}, \textit{EBarrays}, \textit{lattice}, \textit{minet} and \textit{infotheo}. \textit{igraph} and \textit{RColorBrewer} are optional, but highly recommended in order to get full functionality of geNetClassifier, including the plots. To follow the examples presented in this \textit{Vignette}, we also need to install a sample dataset called \textit{leukemiasEset}: <>= BiocManager::install("leukemiasEset") @ This dataset contains an \textit{expresssionSet} built with 60 gene expression microarrays (HG-U133 plus 2.0 from \textit{Affymetrix}) hybridized with mRNA extracted from bone marrow biopsies of patients of the 4 major types of leukemia (ALL, AML, CLL and CML) and from non-leukemia controls (NoL). These data was produced by the Microarray Innovations in LEukemia (MILE) research project \cite{mile} and are available at GEO, under accession number GSE13159. The selected samples are labeled keeping their source GEO IDs.\\ To have an overview of this \textit{ExpressionSet} and its available info: <<>>= library(leukemiasEset) data(leukemiasEset) leukemiasEset @ <<>>= summary(leukemiasEset$LeukemiaType) @ <>= pData(leukemiasEset) @ For further information/help about this \textit{ExpressionSet}: <>= ?leukemiasEset @ CEL files were preprocessed using an alternative Chip Description File (CDF), which allows mapping the expression directly to genes (Ensembl IDs ENSG) instead of \textit{Affymetrix} probesets. This alternative CDF, with redefines gene-based annotation files for the \textit{Affymetrix} expression microarrays, can be found in \textit{GATExplorer} (a bioinformatic web platform that integrates a gene loci browser with nucleotide level re-mapping of the oligo \textit{probes} from \textit{Affymetrix} expression microarrays to genes: mRNAs and ncRNAs)\cite{gatexplorer}.\\ To translate these Ensembl gene IDs into Gene Symbols for easier reading, the optional argument \textit{geneLabels} from \textit{geNetClassifier} can be used. This option allows to extend the annotation and labelling of the genes by providing a table that contains the gene symbol and other characteristics of the genes in the \textit{expresssionSet}. This option can be used with any annotation (i.e. Bioconductor's \textit{org.Hs.eg.db} package) as long as it is provided in the correct format. However, for increased consistency between versions, when using \textit{GATExplorer} CDF, we recomend to also use \textit{GATExplorer} annotation files. Annotation files with the Gene Symbol corresponding to each Ensembl gene ID can be found at: \begin{small}\url{http://bioinfow.dep.usal.es/xgate/mapping/mapping.php?content=annotationfiles}\end{small}. The one used in this example is the \textit{Human Genes} R annotation file. A subset of this file was saved into the object \textit{geneSymbols} for easier use in the examples: <>= data(geneSymbols) head(geneSymbols) @ This annotation file provides further information which can be used to filter the genes. i.e. To consider only protein-coding genes for the construction of \textit{geNetClassifier}, use the following filter: <>= load("genes-human-annotation.R") leukEset_protCoding <- leukemiasEset[featureNames(leukemiasEset) %in% rownames(genes.human.Annotation[genes.human.Annotation$biotype %in% "protein_coding",]),] dim(leukemiasEset) dim(leukEset_protCoding) @ Please note that \textit{geNetClassifier} is designed to work with genes. In case the expression data is not summarized into genes (i.e. it uses the default \textit{probesets}) \textit{geNetClassifier} can still be used but those probesets/features will still be called \textit{genes}. \section{Main function of the package: \textit{geNetClassifier()} } \label{sec:geNetClassifier} \textit{geNetClassifier()} is the main function of the package. It builds the classifier and the gene network associated to each class, and also returns the genes ranking and further information about the selected genes.\\ The workflow internally followed by \textit{geNetClassifier()} includes the following steps:\\ \\ \textbf{1.-} Filtering data and calculating the genes ranking.\\ \textbf{2.-} Calculating correlations between genes.\\ \textbf{3.-} Calculating interactions between genes.\\ \textbf{Optional - } Filter of redundant genes from the ranking (see arguments \textit{removeCorrelations} and \textit{removeInteractions}).\\ \textbf{4.-} Construction of the classifier: Selects of a subset of genes to train the classifier through 8-fold cross-validation. The selected genes are used to train the classifier with the complete set of samples.\\ \textbf{5.-} Estimation of performance: calculates the \textit{generalization error} of the classifier and the statistics about the genes adding an 5-fold \textit{cross-validation} around the construction of the classifier (\textit{nested cross-validation}).\\ \textbf{6.-} Construction of the gene networks: a gene network is built for each one of the classes using the pairwise gene-to-gene correlations and interactions.\\ \textbf{7.-} Writing and saving the results including a series of plots for visualization.\\ The following sections show: how to load the package and the data (sec. \ref{sec:load}); how to run the algorithm (sec. \ref{sec:run}); an overview of the results and returned data (sec. \ref{sec:returns}): the genes ranking (sec. \ref{sec:ranking}), the classifier (sec. \ref{sec:classifier}) and the gene networks (sec. \ref{sec:networks}). \subsection{Loading the package and data} \label{sec:load} In order to have \textit{geNetClassifier} functions available, the first step is to load the package: <>= library(geNetClassifier) @ To list all available tutorials for this package, or to open this \textit{Vignette} you can use: <>= # List available vignettes for package geNetClassifier: vignette(package="geNetClassifier") # Open vignette named "geNetClassifier-vignette": vignette("geNetClassifier-vignette") @ To list all the available functions and objects included in \textit{geNetClassifier} use the function \textit{objects()}. Typing its name with a question mark (?) before any function, will show its help file. Through this tutorial, we will see how to use the main ones: <>= options(width=80) @ <<>>= objects("package:geNetClassifier") @ <>= ?geNetClassifier @ After the package is loaded, you can proceed to analyze your data. In this vignette we use \textit{leukemiasEset}: 60 microarrays from bone marrow from patients of the 4 major types of leukemia (ALL, AML, CLL, CML) and from healthy non-leukemia controls (NoL).(For installation and further information regarding \textit{leukemiasEset} data package see Section \ref{sec:installation}). <>= library(leukemiasEset) data(leukemiasEset) @ In \textit{leukemiasEset} there are 60 samples: 12 of each class (ALL, AML, CLL, CML and NoL). We will select 10 samples from each class to execute \textit{geNetClassifier()}, and leave 2 for external validation of the resulting classifier. In this way, it makes a total of 50 samples for the \textit{training} and 10 samples for the \textit{validation}. <<>>= trainSamples <- c(1:10, 13:22, 25:34, 37:46, 49:58) summary(leukemiasEset$LeukemiaType[trainSamples]) @ \subsection{Run \textit{geNetClassifier()} } \label{sec:run} The essential input elements that \textit{geNetClassifier} needs are: \begin{quote} 1.- An \textit{expressionSet}: R object defined in Bioconductor that contains a genome-wide expression matrix with data for multiple samples; see \textit{?ExpressionSet}. Note that since the ranking is built though package \emph{EBarrays}, the data in the expression set should be normalized intensity values (positive and on raw scale, not on a logarithmic scale). 2.- The \textit{sampleLabels}: a vector with the class name of each sample or the \textit{ExpressionSet phenoData} object containing this information. Note that to run \textit{geNetClassifier} it is highly recommended to have the \textbf{same number of samples in each class}. A balanced number of samples allows an even exploration of each class and provides better classification. \\ \end{quote} The algorithm input also includes many other arguments that allow to personalize the execution or modify some of the parameters internally used. All of them have a default value and there is no need to modify them. In the following step we will see examples on how to use the main ones. Information about them can be found using the help options (i.e. \textit{?geNetClassifier}). This is the full list of arguments with their default values: \begin{quote} \begin{small} \begin{verbatim} geNetClassifier(eset, sampleLabels, plotsName=NULL, buildClassifier=TRUE, estimateGError=FALSE, calculateNetwork=TRUE, labelsOrder=NULL, geneLabels=NULL, numGenesNetworkPlot=100, minGenesTrain=1, maxGenesTrain=100, continueZeroError=FALSE, numIters=6, lpThreshold=0.95, numDecimals=3, removeCorrelations=FALSE, correlationsThreshold=0.8, correlationMethod="pearson", removeInteractions=FALSE, interactionsThreshold=0.5, skipInteractions=FALSE, minProbAssignCoeff=1, minDiffAssignCoeff=0.8, IQRfilterPercentage=0, precalcGenesNetwork=NULL, precalcGenesRanking=NULL, returnAllGenesRanking=TRUE, verbose=TRUE) \end{verbatim} \end{small} \end{quote} The execution time will depend on the computer and the size of the dataset. To avoid waiting now for the construction of a new classifier to continue this tutorial, a pre-executed example is included in the package: <>= data(leukemiasClassifier) @ This classifier was built running the following code: <>= leukemiasClassifier <- geNetClassifier(leukEset_protCoding[,trainSamples], sampleLabels="LeukemiaType", plotsName="leukemiasClassifier", estimateGError=TRUE, geneLabels=geneSymbols) @ These are some examples of standard use: \begin{itemize} \item{The fastest execution would be training the classifier exploring a reduced number of genes (by default \textit{maxGenesTrain=100}). In order ot skip calculating the network within the genes, set \textit{calculateNetwork=FALSE}. However, since the correlations are relatively fast to calculate, we recommend keeping \textit{calculateNetwork=TRUE}, and set \textit{skipInteractions=TRUE} instead.} <>= leukemiasClassifier <- geNetClassifier(eset=leukemiasEset[,trainSamples], sampleLabels="LeukemiaType", plotsName="leukemiasClassifier", skipInteractions=TRUE, maxGenesTrain=20, geneLabels=geneSymbols) @ \item{The default execution (\textit{buildClassifier=TRUE}, \textit{calculateNetwork=TRUE}) only requires the \textit{expressionSet} and the \textit{sampleLabels}. Providing \textit{plotsName} is also recommended in order to produce the plots:} <>= leukemiasClassifier <- geNetClassifier(eset=leukemiasEset[,trainSamples], sampleLabels="LeukemiaType", plotsName="leukemiasClassifier") @ \item{In order to also estimate the classifier's performance, set \textit{estimateGError=TRUE}. This option will take longer to execute} <>= leukemiasClassifier <- geNetClassifier(eset=leukemiasEset[,trainSamples], sampleLabels="LeukemiaType", plotsName="leukemiasClassifier", estimateGError=TRUE) @ \end{itemize} Some of the parameters allow to provide extra information for an easier reading of the results: - \textit{labelsOrder} allows to show and plot the classes in a specific order (i.e. \textit{labelsOrder=c('ALL', 'CLL', 'AML', 'CML', 'NoL')}) - \textit{geneLabels} can be used to add a label to the genes to show in the outputs instead of the \textit{featureNames} from the \textit{ExpressionSet}. In the example, the genes were labeled with the gene symbols provided by \textit{GATExplorer} gene-based probe mapping (\textit{geneLabels=geneSymbols}), as it was indicated in section \ref{sec:load}.\\ After running \textit{geNetClassifier()}, we recommend to save the output: <>= getwd() save(leukemiasClassifier, file="leukemiasClassifier.RData") @ \subsection{Overview of the data returned by \textit{geNetClassifier()}} \label{sec:returns} The main results that \textit{leukemiasClassifier()} provides are: the \textbf{genes ranking} (sec. \ref{sec:ranking}), the \textbf{classifier} (sec.\ref{sec:classifier}) and the \textbf{gene networks} (sec. \ref{sec:networks}). All this information is returned by \textit{geNetClassifier()} in an object of class \textit{GeNetClassifierReturn}. This objec tontains several slots which can be seen with the function names(): <>= options(width=50) @ <<>>= names(leukemiasClassifier) @ <>= options(width=80) @ The slot \textbf{@call} contains the R sentence that was used to execute \textit{geNetClassifier()}. It is the only slot that will always be returned by \textit{geNetClassifier()}, the presence and contents of the other components returned by the algorithm will depend on the arguments used to run it. %<>= <<>>= leukemiasClassifier@call @ All the outputs and returned components are explained in detail in the following sections: \begin{itemize} \item{ @\textbf{genesRanking} in section \ref{sec:ranking}} \item{ @\textbf{classifier} and @\textbf{classificationGenes} in section \ref{sec:classifier}} \item{@\textbf{generalizationError} in section \ref{sec:GE}} \item{ @\textbf{genesNetwork} in section \ref{sec:networks}} \item{ The \textbf{plots} are explained in section \ref{sec:plots} } \end{itemize} A general view of the output can be seen by just typing the assigned name: <<>>= leukemiasClassifier @ \subsection{Return I: Genes ranking} \label{sec:ranking} The first step of \textit{geNetClassifier} algorithm is to determine a ranking of genes for each class based in the analysis of the expression signal. To create this ranking, it uses the function \textit{emfit}, a Parametric Empirical Bayes method \cite{morris}, included in package \textit{EBarrays} \cite{EBarrays}. This method implements an expectation-maximization (EM) algorithm for gene expression mixture models, which compares the patterns of differential expression across multiple conditions and provides a \textit{posterior probability}. \\ The posterior probability is calculated for each gene-class pair, and represents how much each gene differentiates a class from the other classes; being 1 the best value, and 0 the worst. In this way, the posterior probability allows to find the genes that show significant differential expression when comparing the samples of one class \textit{versus} all the other samples (One-versus-Rest comparison).\\ A first version of the ranking is built by ordering the genes decreasingly by their posterior probability for each class. To resolve the ties, \textit{geNetClassifier} uses the expression difference between the mean for each gene in the given class and the mean in the closest class. In addition, the genes with a posterior probability greater or equal to 0.95 for the 'no difference' -the genes that do not show any difference between classes- are filtered out before proceeding into further steps.\\ The final version of the ranking is built assigning each gene to the class in which it has the best ranking. In this way the separation between classes is optimized, and the method will choose first the genes that best differentiate any of the classes. As a result of this process, even if a gene is found associated to several classes during the expression analysis, \textbf{each gene can only be on the ranking of one class}.\\ Note that in case of an analysis performed for only \textbf{two classes} (i.e. a control vs case study, or a comparison between two diseases), the \emph{one vs rest} approach only provides one list of genes. This list of genes, the genes that differentiate the classes, is labeled \emph{BothClasses}. The sign of the expression value reflects in which class it is up or down. \textit{geNetClassifier} will use the first label as reference and print a message when executed: i.e. \emph{Expression difference calculated for AML (Reference/control: ALL)}. In this way, a positive value means the mean expression value is higher in the second class (AML). \begin{figure}[H] \centering \includegraphics[width=0.6\textwidth]{Fig3overlap} \caption{Scheme representing the overlap between the sets of genes that each disease may affect. \textit{geNetClassifier} explores all the genes that affect each disease (\textbf{ovals}) and selects as significant, the genes that are unique (differentially expressed) to each disease (\textbf{coloured circles}). } \label{fig:overlap} \end{figure} The genes ranking obtained for each class is used for the gene selection in the classification procedure and it is also provided as an output of \textit{geNetClassifier()} in the slot: ...@genesRanking. <<>>= leukemiasClassifier@genesRanking @ This ranking an object of class \textit{GenesRanking}. This class provides some utility functions which will help working with the information contained in the object. The total number of genes in the ranking for each class can be queried using the function \textit{numGenes()}. These numbers include all the genes that have some ability to distinguish between classes, although only the top ones are really significant. <<>>= numGenes(leukemiasClassifier@genesRanking) @ With \textit{getTopRanking()} a subset of the ranking containing only the given number of top genes can be obtained. Since the returned object is also a \textit{GenesRanking} object, no information is lost and other functions (i.e. \textit{genesDetails()}) can be used afterwards. <<>>= subRanking <- getTopRanking(leukemiasClassifier@genesRanking, 10) @ In order to retrieve the whole ranking in the form of a matrix (i.e. to print the full version or get a subset of it), the function \textit{getRanking()} can be used. This function provides the option to show the ranking with the gene IDs or the gene Labels. <<>>= getRanking(subRanking) @ <<>>= getRanking(subRanking, showGeneID=TRUE)$geneID[,1:4] @ \newpage The function \textit{genesDetails()} allows to show all the available info of the genes in the ranking. <<>>= genesDetails(subRanking)$AML @ NOTE: If the console splits the table into several lines, try: <>= options(width=200) @ By default, the \textit{rownames} are the ID included in the \textit{expressionSet}: in our case the ENSEMBL IDs. The \textit{GeneName} column has been added by setting the argument \textit{geneLabels=geneSymbols} (see sec. \ref{sec:run}).\\ To see the description of the content of this table write: \textit{?genesDetails}. \\ More details about \textit{GenesRanking} class is available at: \textit{?GenesRanking}. \newpage \subsubsection{Significant genes} \label{sec:significantGenes} The set of genes considered \textit{significant} for each of the classes is determined by a common threshold for the posterior probability (by default \textit{lpThreshold=0.95}). This common threshold provides a way to quantify the size of the \textit{gene signature} assigned to each disease (as always: compared to the other diseases in the study). In this way, the algorithm provides a framework to compare biological states, i.e. the biological or pathological conditions represented in the samples.\\ \textit{plotSignificantGenes()} provides a plot of the distribution of the posterior probabilities of the genes within the rankings for each class: \begin{figure}[H] \centering \includegraphics[width=0.6\textwidth]{Fig2significantGeneswithoutNoL} \caption{Plot of the posterior probabilities of the genes of 4 leukemia classes, ordering the genes according to their rank.} \label{fig:significantGeneswithoutNoL} \end{figure} This example shows the big differences in size of the gene sets assigned to a disease: at lpThreshold 0.95 CLL has been assigned 2028 genes, while AML only 308 genes. The biological interpretation of this observation will depend on the specific study. Larger gene signatures may be an indication of more \textit{systemic} diseases (i.e. a disease affect more genes than another), but it may also be an indication of the relative differences between the diseases in the study (i.e. one of the diseases affects different genes than the others). In any case, the results provided by \textit{geNetClassifier} may help to unravel disease sub-types differences based on the gene signatures.\\ \textit{numSignificantGenes()} provides the number of significant genes, the number of genes with posterior probability over the threshold: <<>>= numSignificantGenes(leukemiasClassifier@genesRanking) @ The plot of the posterior probability (\textit{plotSignificantGenes()}) is the default plot for objects of class \textit{GenesRanking}. (More details in section \ref{sec:plotSignificantGenes}). <>= plot(leukemiasClassifier@genesRanking) @ In both functions, the threshold can be modified through \textit{lpThreshold}: <>= plot(leukemiasClassifier@genesRanking, numGenesPlot=3000, lpThreshold=0.80) @ \subsection{Return II: Classifier} \label{sec:classifier} The information regarding the classifier is saved into the slots: @classifier, @classifcationGenes and @generalizationError.\\ The \textit{@\textbf{classifier}} slot contains the SVM classifier that can later be used to make queries. The SVM method included in the algorithm is a linear kernel implementation from R package \textit{e1071}. This implementation allows multi-class classification by using a One-versus-One (OvO) approach, in which all the binary classifications are fitted and the correct class is found based on a voting system. <>= leukemiasClassifier@classifier @ \textit{@\textbf{classificationGenes}} contains the final genes selected to build the classifier. Since \textit{@classificationGenes} is an object of class \textit{GenesRanking}, functions such as \textit{numGenes()} or \textit{genesDetails()} can be used to explore it. <<>>= leukemiasClassifier@classificationGenes @ <<>>= numGenes(leukemiasClassifier@classificationGenes) genesDetails(leukemiasClassifier@classificationGenes)$ALL @ Note that besides the common information about the genes provided by the genes ranking (sec. \ref{sec:ranking}), the classification genes also have information about the \textbf{discriminant power} of the genes (sec. \ref{sec:plotDiscriminantPower}).\\ For details on the \textit{gene selection procedure} (sec. \ref{sec:geneSelect}) and the \textit{estimation of performance and generalization error procedure} (slot \textit{@\textbf{generalization}}) (sec. \ref{sec:GE}), see the next two sections. %\newpage \subsubsection{Gene selection procedure} \label{sec:geneSelect} The optimum number of genes to train the classifier is selected by evaluating the classifiers trained with increasing number of genes. This is done using several iterations of 8-fold cross-validation. Each cross-validation iteration starts with the first ranked gene of each class: it trains an internal classifier with these genes, and evaluates its performance. One more gene is added in each step to those classes for which a 'perfect prediction' is not achieved (i.e. not all samples correctly identified). The genes are taken in order from the \textit{genes ranking} of each class until any of the classes reaches gets to the maximum number of genes (\textit{maxGenesTrain=100}) or until zero error is reached (\textit{continueZeroError=FALSE}). The error for each of the classifiers and the number of genes used to construct them are saved. Once the cross-validation loop is finished, it saves the minimum number of genes per class which produced the classifier with minimum error.\\ To achieve the best stability in the number of selected genes, the cross-validation is not run just once, but it is repeated several times with new samplings. This process is repeated as many times as indicated by the optional parameter \textit{numIters} (6 by default). In each of these iterations, the minor number of genes that provided the smallest error is selected. \begin{figure}[H] \centering \includegraphics[page=1, width=0.9\textwidth]{Fig4-5geneSelection} \caption{ Plot of the gene-selection iterations. Each line represents an iteration and the error rates observed for each number of genes (starting at 5, one per class). The algorithm runs until exploring a maximum number of genes in any class (\textit{maxGeneTrain=100}) or until zero error is reached (\textit{continueZeroError=FALSE}). In each iteration the minimum number of genes with minimum error is selected. } \label{fig:geneSelectionIterations} \end{figure} The final selection is done based on the genes selected in each of the iterations. For each class, the top ranked genes are selected by taking the highest number of genes --excluding outliers-- selected in the cross-validaton iterations. This allows to identify a stable number of genes, while accounting for the diffences in sampling. \\ \begin{figure}[H] \centering \includegraphics[page=2, width=0.9\textwidth]{Fig4-5geneSelection} \caption{ Plot of the number of genes selected in each iteration. The bars represent the number of genes with minimum error rates in each iteration. Each color represents an iteration. The filled bar is the final number of genes of each class selected to train the classifier.} \label{fig:numberGenesSelectedEachIter} \end{figure} Figures \ref*{fig:geneSelectionIterations} and \ref*{fig:numberGenesSelectedEachIter} show the gene selection for the leukemia's example. \\ \subsubsection{Estimation of performance and generalization error procedure} \label{sec:GE} The estimation of the \textit{generalization error} (GE) of the classification algorithm is an option that can be included using the parameter \textit{estimateGError=TRUE}. When this option is chosen, an independent validation is simulated by adding a second loop of cross-validation (CV) around the construction of the classifier. In each iteration of this loop, a few samples are left out of the \textit{training} and used as \textit{test} samples. This step allows to estimate and provide statistics and metrics regarding the quality of the classifier and the genes selected for classification. The parameters measured for the classifier are the following:\\ - \textbf{Sensitivity}: Proportion of samples from a given class which were correctly identified. In statistical terms it is the rate of true positives (TP). \textit{Sensitivity} relates to the ability of the test to identify positive results.\\ $ Sensitivity = \dfrac{TP}{TP + FN} = True Positive Rate $\\ - \textbf{Specificity}: Proportion of samples assigned to a given class which really belonged to the class. In statistical terms it is the rate of true negatives (TN). \textit{Specificity} relates to the ability of the test to identify negative results.\\ $ Specificity = \dfrac{TN}{TN + FP} = True Negative Rate $\\ Note: In order to truly evaluate the classification, both sensitivity and specificity need to be taken into account. For example, 100\% sensitivity for AML will be achieved by assigning all AML samples to AML. In the same way, 100\% specificity will be achieved by not assigning any sample from other class to AML. Therefore, the classification will only be reliable if both -sensitivity and specificity- are optimized, by identifying all samples from one class while not having samples from another classes miss-classified. \\ - \textbf{Matthews Correlation Coefficient} (MCC): It is a measure which takes into account both true and false positives and negatives. It is generally regarded as a balanced measure of performance. In machine learning it is used as a measure of the quality of binary classifications. \\ $ MCC = \dfrac{TP\times TN - FP\times FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}} $\\ \\ - \textbf{Global Accuracy}: Proportion of true results within the assigned samples.\\ - \textbf{Call Rate per class and Global Call Rate}: Proportion of \textit{assigned} samples within a class or in the whole prediction.\\ $ Call Rate = \dfrac{Assigned}{Assigned + Not Assigned} $\\ The results about the estimation of performance and the generalization error are saved in the slot: \textit{@generalizationError} <<>>= leukemiasClassifier@generalizationError @ To see all the available info gathered during estimation of performance use the \textit{overview()} function: <>= overview(leukemiasClassifier@generalizationError) @ This object contains all the information regarding estimation of performance in different slots: \textit{@accuracy}, \textit{@sensitivitySpecificity}, \textit{@confMatrix}, \textit{@probMatrix}, \textit{@querySummary}.\\ The slot \textit{...@confMatrix} contains the confusion matrix. A confusion matrix is a table used to quickly visualize and evaluate the performance of a classification algorithm. The rows represent the real class of the samples, while the columns represent the class to which the samples were assigned. Therefore, the correctly assigned samples are in the diagonal. <<>>= leukemiasClassifier@generalizationError@confMatrix @ The slot \textit{...@probMatrix} presents the probabilities of assignment to each class that are calculated during the 5-fold cross-validation. This \textit{probability matrix} provides a good estimation of how easy or difficult is to assign each sample to its class. It also provides an indication about the likelihood to confuse one class with others: <<>>= leukemiasClassifier@generalizationError@probMatrix @ The slot \textit{...@classificationGenes.stats} includes calculations about the number of times that each gene was selected for classification in the 5-fold cross-validation executions: \\%To query the classification statistics of the top genes for a given class use $class. - \textit{timesChosen}, number of times that each gene is chosen for classification in the 5 CV.\\ - \textit{chosenRankMean}, average rank of the gene only within the CV loops in which the gene was chosen for classification.\\ - \textit{chosenRankSD}, standard deviation of the gene rank only within the CV loops in which the gene was chosen for classification.\\ - \textit{geRankMean}, average rank of the gene in the 5 CV loops performed during the generalization error estimation.\\ - \textit{geRankSD}, standard deviation of the rank of the gene in the 5 CV loops performed during the generalization error estimation. <<>>= leukemiasClassifier@generalizationError@classificationGenes.stats$CLL @ %$ The slot \textit{...@classificationGenes.num} includes calculations about the number of genes selected for each class in the 5 runs of the 5-fold cross-validation applied for the estimation of performance. These numbers allow to explore the number of genes that are used per class. However, the proper calculation of the final \textit{number of genes} selected for each class in the classifier is done with the other 8-fold cross-validation which includes all the available samples (as indicated in section \ref{sec:geneSelect}). <<>>= leukemiasClassifier@generalizationError@classificationGenes.num @ %\newpage \subsection{Return III: Gene networks} \label{sec:networks} Together to the classifier and the genes ranking, the third major result that the algorithm \textit{geNetClassifier} produces are the gene networks associated to each class.\\ The gene networks for each class are built based on association parameters between genes. These association parameters are gene to gene co-expression calculated using a correlation coefficient (\textit{Pearson} by default) and gene to gene interactions derived from \textit{mutual information} (MI) analysis (\textit{mi.empirical} entropy estimator from the R package minet \cite{minet}); both calculated along all the samples of each class of the studied dataset.\\ The \textit{correlations} and \textit{interactions} also allow to find possible redundancy between the genes as features in the classification procedure. Such redundancy can be tested by producing comparative classifiers that include or not the associated genes. Usually, classifiers without redundant genes need less features for classification.\\ The \textit{...@genesNetwork} slot contains the list of networks. %options(width=70) <<>>= leukemiasClassifier@genesNetwork overview(leukemiasClassifier@genesNetwork$AML) @ %$ Each of the networks in this list is an object of the class \textit{GenesNetwork}. This class offers some functions to retrieve and count the edges and nodes, and also to subset the network (\textit{getSubNetwork()}). Note that \textit{getNodes()} includes all possible nodes even if they are no linked by edges. <<>>= getNumEdges(leukemiasClassifier@genesNetwork$AML) getNumNodes(leukemiasClassifier@genesNetwork$AML) @ <<>>= getEdges(leukemiasClassifier@genesNetwork$AML)[1:5,] getNodes(leukemiasClassifier@genesNetwork$AML)[1:12] @ The function \textit{network2txt()} allows to save or export the networks as text files. This function produces two text files: one with the information about the \textit{nodes} and another with the information about the \textit{edges}. They are flat text files (.txt). In the case of the \textit{edges} file, it includes the nodes that interact (gene1 -- gene2), the type of link (correlation or interaction) and the value of such relation. <>= network2txt(leukemiasClassifier@genesNetwork, filePrefix="leukemiasNetwork") @ To produce just the files with the information about the \textit{edges}: <>= geneNtwsInfo <- lapply(leukemiasClassifier@genesNetwork, function(x) write.table(getEdges(x), file=paste("leukemiaNtw_",getEdges(x)[1,"class1"],".txt",sep=""))) @ These flat text files allow to export the networks to external software (e.g. \textit{Cytoscape}, http://www.cytoscape.org). \\ The networks can also be exported using direct R connectors (e.g. RCytoscape) with the igraph objects returned by the function \textit{plotNetwork} (sec. \ref{sec:plotNetwork}).\\ For more information see the class help \textit{?GenesNetwork}. \newpage \section{External validation: query with new samples of known class} \label{sec:extVal} Once a classifier is built for a group of diseases or disease subtypes, it can be queried with new samples to know their class. However, before proceeding with samples whose class is unknown, an external validation is normally performed. An external validation consists on querying the classifier with several samples whose class is \textit{a priori} known, in order to see if the classification is done correctly. As indicated in section \ref{sec:GE}, if the number of known samples is limited (as it is usually the case) to avoid leaving a sub-set of known samples out of the training, \textit{geNetClassifier()} provides the \textit{generalization error} option, which will simulate an external validation by using cross-validation. Despite this possibility, it is clear that using external samples (totally independent to the classifier built) is the best option to validate its performance. \\ In this section, we will proceed with an example of external validation with the leukemia's classifier. In \textit{leukemiasEset}, the class of all the available samples is known \textit{a priori}. Since we had 60 samples in the initial leukemia dataset and only 50 were used to train the classifier, the 10 remaining can be used for external validation. \\ The first step is to select the 10 samples that were not used for training: <<>>= testSamples <- c(1:60)[-trainSamples] testSamples @ The classifier is then be asked about the class of these 10 samples using \textit{queryGeNetClassifier()}: <<>>= queryResult <- queryGeNetClassifier(leukemiasClassifier, leukemiasEset[,testSamples]) @ This query will return the class that each sample has been assigned to, which will be saved into \$class. It also returns the probabilities of assignment of each sample to each class in \$probabilities. <<>>= queryResult$class queryResult$probabilities @ Since the real class of the samples is known, we can create a confusion matrix. Note: For using this matrix as input in upcoming functions the real classes should be placed as row names (\textit{rownames}) and the predicted classes (assigned by the classifier) as column names (\textit{colnames}). <<>>= confusionMatrix <- table(leukemiasEset[,testSamples]$LeukemiaType, queryResult$class) @ Once we have executed the query, \textit{externalValidation.stats()} can be used to calculate the parameters to evaluate the classifier (Section \ref{sec:GE}). <<>>= externalValidation.stats(confusionMatrix) @ The class to class assignment probability matrix, that gives support to the confusion matrix, can be also created for the external validation analysis: <<>>= externalValidation.probMatrix(queryResult, leukemiasEset[,testSamples]$LeukemiaType, numDecimals=3) @ \newpage \subsection{Assignment conditions} \label{sec:assignment} \textit{queryGeNetClassifier()} includes an expert-like approach to decide if a sample is assigned to a class: instead of directly assigning a sample to the class with the highest probability, it takes into account the probability of belonging to the class and the probability of the closest class before taking the final decision.\\ By default, the probability to assign a sample to a given class should be at least double than the \textit{random probability}, and the difference with the next likely class should also be higher than 0.8 times the \textit{random probability}. For example, to assign a sample in a 5 class classifier, the highest probability should be at least 40\% (2 x 0.20 = 0.40) and the probability of belonging to the closest class should be at least 16\% lower than the highest (0.8 x 0.20 = 0.16). This implies that if a sample's probability to belong to one class is 55\% and to belong to another class is 40\%, since the the difference is lower than 16\%, it is not clear enough, and it will be left as a \textit{NotAssigned} (NA). This feature allows modulation of the assignment to resembles expert decision-making. \\ To allow adapting these conditions, \textit{queryGeNetClassifier()} includes two coefficients that determine the \textit{minimum probability for assignment} (minProbAssignCoeff), and the \textit{minimum difference between the of the first and the second classes} (minDiffAssignCoeff). If these two coefficients are set up to 0 all samples will be assigned to the most likely class and therefore no samples will be left as \textit{NotAssigned}. <<>>= queryResult_AssignAll <- queryGeNetClassifier(leukemiasClassifier, leukemiasEset[,testSamples], minProbAssignCoeff=0, minDiffAssignCoeff=0) which(queryResult_AssignAll$class=="NotAssigned") @ On the contrary, the thresholds can be raised to increase the the certainty of the assignments: i.e. by setting the coefficients to 1.5 and 1, the minimum probability to be assigned is 0.6 (\textbf{1.5} x 2 x 0.20) and the minimum difference between first and second class probabilities is 0.2 (\textbf{1} x 0.20). <<>>= queryResult_AssignLess <- queryGeNetClassifier(leukemiasClassifier, leukemiasEset[,testSamples], minProbAssignCoeff=1.5, minDiffAssignCoeff=1) queryResult_AssignLess$class @ In this case, these samples were left as \textit{NotAssigned}: <<>>= t(queryResult_AssignLess$probabilities[, queryResult_AssignLess$class=="NotAssigned", drop=FALSE]) @ To help understanding how these thresholds behave for a specific dataset, if \textit{geNetClassifier()} is executed with \textit{estimateGError=TRUE}, it generates a plot presenting the assignment probabilities for each sample. This plot shows the probability of the most likely class \textit{versus} the probability difference with next likely class for each sample. Therefore, it allows to view the effects of the 2 coefficients (\textit{minProbAssignCoeff} and \textit{minDiffAssignCoeff}) in the assignment. \\ \begin{figure}[H] \centering \includegraphics{Fig6leukemiasClassifierAssingorNotAssing} \caption{ Assignment probabilities plot: It shows for each sample the probability of its most likely class \textit{versus} the difference in probability with the next likely class. \textbf{Green} dots indicate that the probability of the most likely class is the correct class. \textbf{Red} dots indicate that the probability of the most likely class is not the correct class and, if assigned, such sample would have been missclassified. Dotted lines represent the chosen thresholds. The green area between them shows the samples that are actually assigned, those out of the green area are left as \textit{NotAssigned}. } \label{fig:leukemiasClassifierAssingorNotAssing} \end{figure} The plot in Figure \ref*{fig:leukemiasClassifierAssingorNotAssing} was obtained through the execution of \textit{geNetClassifier()} with the leukemia's dataset. It shows that there are several samples under the assignment thresholds: these samples are left as \textit{NotAssigned}. Out of these not assigned samples, the highest probability of some of them was to the real class (green), but some others was to an incorrect class (red). If the classifier had assigned the samples in red, it would have been an incorrect assignment. \section{Sample classification: query with new samples of unknown class} \label{sec:prediction} Once a classifier is built for a group of diseases or biological states, we can take external samples from new patients or new studies to query the classifier and know their class type.\\ Since we had 60 samples in the initial leukemia dataset and only 50 were used in the classifier, the 10 not used for training can be used as new samples to query the classifier and find out their class. In this case we will consider that the class of these samples is unknown. <<>>= testSamples <- c(1:60)[-trainSamples] @ \textit{queryGeNetClassifier()} can then be used to ask the classifier about the class of the new samples. <<>>= queryResult_AsUnkown <- queryGeNetClassifier(leukemiasClassifier, leukemiasEset[,testSamples]) @ In the field \textit{\$class} of the return, we can see the class that each sample has been assigned to. <<>>= names(queryResult_AsUnkown) @ <<>>= queryResult_AsUnkown$class @ If there were samples that had not been assigned to any class, they would be marked as\textit{NotAssigned}. In the field \textit{\$probabilities}, we could see the probability of each sample to belong to each class. All these steps are very similar to the ones describes in section \ref{sec:assignment}. <<>>= t(queryResult_AsUnkown$probabilities[ , queryResult$class=="NotAssigned"]) @ The function \textit{querySummary()} provides a summary of the results by counting the number of samples that were assigned to each class and with which probabilities. It is a good way to have an overview of the classification results. In this case, the 100\% \textit{call rate} indicates that all samples have been assigned. <<>>= querySummary(queryResult_AsUnkown, numDecimals=3) @ \newpage \section{Functions to plot the results} \label{sec:plots} \subsection{Plot Ranked Significant Genes: \textit{plot(...@genesRaking)}} \label{sec:plotSignificantGenes} As indicated in section \ref{sec:significantGenes}, the default plot of a \textit{genesRanking} can be obtained through the plot() function. This plot represents the gene rank obtained for each class \textit{versus} the posterior probability of the genes.\\ <>= plot(leukemiasClassifier@genesRanking) @ Some of the parameters to personalize this plot are: \begin{itemize} \item{\textit{lpThreshold} to set the value of the posterior probability threshold (marked as an horizontal line in the plot)} \item{\textit{numGenesPlot} to determine the maximum number of genes that will be plot} \end{itemize} <>= plot(leukemiasClassifier@genesRanking, numGenesPlot=3000, plotTitle="5 classes: ALL, AML, CLL, CML, NoL", lpThreshold=0.80) @ \begin{figure}[H] \centering \includegraphics[width=0.6\textwidth]{Fig7significantGenes} \caption{ Plot of the posterior probabilities of the genes of 4 leukemia classes and the non-leukemia controls, ordering the genes according to their rank and setting the \textit{lpThreshold} at 0.80.} \label{fig:plotSignificantGenes} \end{figure} \textit{calculateGenesRanking()} allows to calculate (and plot) the ranking for a given data set without building the classifier: <>= ranking <- calculateGenesRanking(leukemiasEset[,trainSamples], "LeukemiaType") @ \subsection{Plot Gene Expression Profiles: \textit{plotExpressionProfiles()}} \label{sec:plotExpressionProfiles} The function \textit{plotExpressionProfiles()} generates an overview of the expression profile of each gene along all the samples contained in the studied dataset. The plot will be saved as a PDF if \textit{fileName} is indicated. The parameter \textit{geneLabels} can be used to show a different name to the one included in the expression matrix (i.e. gene symbol instead of ENSEMBL ID or \textit{Affymetrix} ID).\\ To plot the expression of 4 specific genes across the samples included in the leukemia's set: <>= data(geneSymbols) @ <>= data(geneSymbols) topGenes <- getRanking( getTopRanking(leukemiasClassifier@classificationGenes,numGenesClass=1), showGeneID=TRUE)$geneID plotExpressionProfiles(leukemiasEset, topGenes[,c("ALL","AML"), drop=FALSE], sampleLabels="LeukemiaType", geneLabels=geneSymbols) @ \begin{figure}[H] \centering \includegraphics[width=0.90\textwidth]{Fig8expressionProfilePlot_2g} \caption{ Plot of the expression profiles across 60 samples of 2 genes.} \label{fig:expressionProfilePlot} \end{figure} If a \textit{geNetClassifierReturn} object is provided instead of a list of genes, it will plot the expression of all the genes used for training the classifier: <>= plotExpressionProfiles(leukemiasEset[,trainSamples], leukemiasClassifier, sampleLabels="LeukemiaType", fileName="leukExprs_trainSamples.pdf") @ To plot the expression of all the genes chosen for classification for a specific class, for example AML: <>= classGenes <- getRanking(leukemiasClassifier@classificationGenes, showGeneID=TRUE)$geneID[,"AML"] plotExpressionProfiles(leukemiasEset, genes=classGenes, sampleLabels="LeukemiaType", geneLabels=geneSymbols, fileName="AML_genes.pdf") @ \newpage These plots can be modified in several ways, for example coloring specific samples or classes, or plotting the expression as boxplot \begin{itemize} \item{Coloring specific samples or classes:} \end{itemize} <>= plotExpressionProfiles(leukemiasEset, genes=topGenes[,3, drop=FALSE], sampleLabels="LeukemiaType", showMean=TRUE, identify=FALSE, sampleColors=c("grey","red") [(sampleNames(leukemiasEset)%in% c("GSM331386.CEL","GSM331392.CEL"))+1]) @ <>= plotExpressionProfiles(leukemiasEset, genes=topGenes[,3, drop=FALSE], sampleLabels="LeukemiaType", showMean=TRUE, identify=FALSE, classColors=c("red","red", "blue","red","red")) @ \begin{itemize} \item{Plotting the expression as boxplot (grouped by classes):} \end{itemize} <>= plotExpressionProfiles(leukemiasEset, genes=topGenes[,3, drop=FALSE], sampleLabels="LeukemiaType", type="boxplot", geneLabels=geneSymbols, sameScale=FALSE) @ \begin{figure}[H] \centering \includegraphics[width=0.90\textwidth]{Fig9expressionProfilePlot_B} \caption{Two different versions of expression plot.} \label{fig:expressionProfilePlot_B} \end{figure} See \emph{?plotExpressionProfiles} for more details. \newpage \subsection{Plot Genes Discriminant Power: \textit{plotDiscriminantPower()}} \label{sec:plotDiscriminantPower} The \textit{discriminant power} is a parameter derived from the classifier's \textit{support vectors} which resembles the power of each gene to mark the difference between classes.\\ The multi-class SVM algorithm (One-versus-One, OvO) produces a set of \textit{support vectors} for each binary comparison between classes. Such \textit{support vectors} include the \textit{Lagrange coefficients} (alpha) for all the genes selected for the classification. Therefore, we can assign to each gene the sum of the \textit{Lagrange coefficients} of all the \textit{support vectors} of each class (represented as piled up bars in the plot). The \textit{discriminant power} is then calculated as the difference between the value of the largest class and the closest (the distance marked by two red lines in the plot). In conclusion, the \textit{discriminant power} is a parameter that allows the characterization of the genes based in their capacity to separate different classes (i.e. different diseases or diseases subtypes compared).\\ The \textit{discriminant power} is calculated for each gene included in the classifier (the \textit{@classificationGenes}) when it is built \textit{geNetClassifier()}). The \textit{plotDiscriminantPower()} function is included in the package to generate a graphic representation of the \textit{discriminant power}. <>= plotDiscriminantPower(leukemiasClassifier, classificationGenes="ENSG00000169575") @ \begin{figure}[H] \centering <>= plotDiscriminantPower(leukemiasClassifier, classificationGenes="ENSG00000169575") @ \caption{Plot of the discriminant power of gene VPREB1 (ENSG00000169575). The plot shows that this gene identifies class ALL and the closest class is NoL.} \label{fig:discPowerPlot} \end{figure} The next example shows the discriminant power of the top genes of a class. In order to plot more than 20 genes, or to save the plots as PDF, provide a \textit{fileName}. <>= discPowerTable <- plotDiscriminantPower(leukemiasClassifier, classificationGenes=getRanking(leukemiasClassifier@classificationGenes, showGeneID=TRUE)$geneID[1:4,"AML",drop=FALSE], returnTable=TRUE) @ \begin{figure}[H] \centering <>= discPowerTable <- plotDiscriminantPower(leukemiasClassifier, classificationGenes=getRanking(leukemiasClassifier@classificationGenes, showGeneID=TRUE)$geneID[1:4,"AML",drop=FALSE], returnTable=TRUE) @ %$ \caption{Plot of the discriminant power of the 4 genes that best discriminate AML class from the other classes. The figures indicate that MEIS1 (ENSG00000143995) presents the highest discriminant power. This gene encodes a homeobox protein that has been involved in myeloid leukemia. A high discriminant power can help to identify gene markers.} \label{fig:discPowerPlot2} \end{figure} Some of the options to personalize the plot are \textit{classNames} to provide a different name for the classes and textit{geneLabels} to provide a alias for the genes. As usual, more details about the function are available at \textit{?plotDiscriminantPower}. \newpage \subsection{Plot Gene Networks: \textit{plotNetwork()}} \label{sec:plotNetwork} The package also includes some functions to manipulate the networks produced by \textit{geNetClassifier()} (i.e. select part of a network and personalize the plots).\\ \textbf{Step 1}: Select a network or sub-network. \textit{getSubNetwork()} allows to select sub-networks. i.e. the sub-network containing only the classification genes: <<>>= clGenesSubNet <- getSubNetwork(leukemiasClassifier@genesNetwork, leukemiasClassifier@classificationGenes) @ \textbf{Step 2}: Get the info of the genes to plot. \textit{genesDetails()} provides the available information about the genes. This information can be shown in the network: The gene name will be the node label. The expression of the gene will be shown with the node color, and the discriminant power will determine its size. In case the network includes genes selected for classification and genes which were not selected, the genes selected for classification will be plot as squares and the not selected as circles (only available for PDF plot, not on the dynamic view). For more details see the network legend in figure \ref*{fig:plotNtwAML100g}. <>= clGenesInfo <- genesDetails(leukemiasClassifier@classificationGenes) @ \textbf{Step 3}: Plot the network. The network plots can be produced either using R interactive view (\textit{tkplot} from \textit{igraph}) or plotted as saved PDF files. Use \textit{plotType="pdf"} to save the network as a static PDF file. This option is recommended to produce an overview of several networks. To produce interactive networks skip this argument. Iteractive plotscan be exported as a \textit{postscript} files (.eps).\\ Some plot examples: Network of ALL classification genes: <>= plotNetwork(genesNetwork=clGenesSubNet$ALL, genesInfo=clGenesInfo) @ \begin{figure}[H] \centering \includegraphics[width=0.3\textwidth]{Fig11plotNtwALL9g} \caption{ Gene network obtained for class ALL including the 9 classification genes selected for this disease.} \label{fig:plotNtwALL9g} \end{figure} \newpage Only connected nodes from ALL classification genes network: <>= plotNetwork(genesNetwork=clGenesSubNet$ALL, genesInfo=clGenesInfo, plotAllNodesNetwork=FALSE, plotOnlyConnectedNodesNetwork=TRUE) @ AML network of the top 30 genes from the ranking (calculated as co-expression and mutual information): <>= top30g <- getRanking(leukemiasClassifier@genesRanking, showGeneID=TRUE)$geneID[1:30,] top30gSubNet <- getSubNetwork(leukemiasClassifier@genesNetwork, top30g) top30gInfo <- lapply(genesDetails(leukemiasClassifier@genesRanking), function(x) x[1:30,]) plotNetwork(genesNetwork=top30gSubNet$AML, genesInfo=top30gInfo$AML) @ \begin{figure}[H] \centering \includegraphics[width=0.8\textwidth]{Fig12plotNtwAML30g} \caption{ Gene network obtained for class AML including the top 30 genes selected from the gene ranking of this disease.} \label{fig:plotNtwAML30g} \end{figure} Network of the top 100 genes from AML ranking. A preview of this network is automatically plotted for every class by \textit{geNetClassifier()} if \textit{plotsName} is provided. <>= top100gRanking <- getTopRanking(leukemiasClassifier@genesRanking, numGenes=100) top100gSubNet <- getSubNetwork(leukemiasClassifier@genesNetwork, getRanking(top100gRanking, showGeneID=TRUE)$geneID) plotNetwork(genesNetwork=top100gSubNet, classificationGenes=leukemiasClassifier@classificationGenes, genesRanking=top100gRanking, plotAllNodesNetwork=TRUE, plotOnlyConnectedNodesNetwork=TRUE, labelSize=0.4, plotType="pdf", fileName="leukemiasNetwork") @ \begin{figure}[H] \centering \includegraphics[page=1, width=0.8\textwidth]{Fig13plotNtwAML100g} \includegraphics[page=2, width=0.7\textwidth]{Fig13plotNtwAML100g} \caption{ Gene network obtained for class AML selecting the 100 top genes from the gene ranking of this disease, but presenting only the connected nodes. The figure also includes the network legend indicating the meaning of the shapes and colors given to the nodes and edges.} \label{fig:plotNtwAML100g} \end{figure} \newpage \section*{Acknowledgements} This work was supported by Instituto de Salud Carlos III and by a grant from the Junta de Castilla y Leon and the European Social Fund to S.A and C.D. \begin{thebibliography}{8} % 100 is a random guess of the total number of references \bibitem{mile} Haferlach T, Kohlmann A, Wieczorek L, Basso G, Kronnie GT, Bene MC, De Vos J, Hernandez JM, Hofmann WK, Mills KI, Gilkes A, Chiaretti S, Shurtleff SA, Kipps TJ, Rassenti LZ, Yeoh AE, Papenhausen PR, Liu WM, Williams PM, Foa R (2010). \emph{Clinical utility of microarray-based gene expression profiling in the diagnosis and subclassification of leukemia: report from the International Microarray Innovations in Leukemia Study Group}. J Clin Oncol. 28: 2529-2537. \bibitem{barrier} Barrier A, Lemoine A, Boelle PY, Tse C, Brault D, Chiappini F, Breittschneider J, Lacaine F, Houry S, Huguier M, Van der Laan MJ, Speed T, Debuire B, Flahault A, Dudoit S (2005) \emph{Colon cancer prognosis prediction by gene expression profiling}. Oncogene. 24: 6155-6164. \bibitem{meyer} Meyer D, Leischa F, Hornik K (2005). \emph{The supportvector machine under test}. Neurocomputing. 55: 169-186 \bibitem{minet} Meyer PE, Lafitte F, Bontempi G (2008). \emph{minet: A R/Bioconductor package for inferring large transcriptional networks using mutual information}. BMC Bioinformatics. 9: 461. \bibitem{gatexplorer} Risueno A, Fontanillo C, Dinger ME, De Las Rivas J (2010). \emph{GATExplorer: genomic and transcriptomic explorer; mapping expression probes to gene loci, transcripts, exons and ncRNAs}. BMC Bioinformatics. 11: 221. \bibitem{morris} Morris C (1983). \emph{Parametric empirical Bayes inference: theory and applications}. JASA. 78: 47-65. \bibitem{EBarrays} Kendziorski CM, Newton MA, Lan H, Gould MN (2003). \emph{On parametric empirical Bayes methods for comparing multiple groups using replicated gene expression profiles}. Statistics in Medicine 22: 3899-3914. \bibitem{fdr} Benjamini Y, Hochberg Y (1995). \emph{Controlling the false discovery rate: A practical and powerful approach to multiple testing}. J. Roy. Statist. Soc. Ser. B. 57: 289-300. \end{thebibliography} \end{document}