% % NOTE -- ONLY EDIT .Rnw!!! % .tex file will get overwritten. % %\VignetteIndexEntry{MLInterfaces Primer} %\VignetteDepends{golubEsets} %\VignetteKeywords{Genomics} %\VignettePackage{MLInterfaces} % % NOTE -- ONLY EDIT THE .Rnw FILE!!! The .tex file is % likely to be overwritten. % \documentclass[12pt]{article} \usepackage{amsmath} \usepackage[authoryear,round]{natbib} \usepackage{hyperref} \textwidth=6.2in \textheight=8.5in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \textwidth=6.2in \bibliographystyle{plainnat} \begin{document} %\setkeys{Gin}{width=0.55\textwidth} \title{\Rpackage{MLInterfaces}: towards uniform behavior of machine learning tools in R} \author{VJ Carey, J Mar, R Gentleman} \maketitle \section{Introduction} We define machine learning methods as data based algorithms for prediction. Given data D, a generic machine learning procedure MLP produces a function ML = MLP(D). For data D' with structure comparable to D, ML(D') is a set of predictions about elements of D'. To be slightly more precise, a dataset D is a set of records. Each record has the same structure, consisting of a set of features (predictors) and one or more predictands (classes or responses of interest, to be predicted). MLP uses features, predictands, and tuning parameter settings to construct the function ML. ML is to be a function from features only to predictands. There are many packages and functions in R that provide machine learning procedures. They conform to the abstract setup described above, but with great diversity in the details of implementation and use. The input requirements and the output objects differ from procedure to procedure. Our objective in \Rpackage{MLInterfaces} is to simplify the use and evaluation of machine learning methods by providing specifications and implementations for a uniform interface. (The {\tt tune} procedures in \Rpackage{e1071} also pursue more uniform interface to machine learning procedures.) At present, we want to simplify use of machine learning with microarray data, assumed to take the form of {\tt ExpressionSet}s. The present implementation addresses the following concerns: \begin{itemize} \item simplify the selection of the predictand from {\tt ExpressionSet} structure; \item simplify (in fact, require) decomposition of input data into training and test set, with output emphasizing test set results; \item provide a uniform output structure. \end{itemize} The output structures currently supported are subclasses of a general class \Rclass{MLOutput}, described in Section \ref{mlosec} below. @ %Several other concerns will be addressed as the %project matures. Among these are: %\begin{itemize} %\item generic interfaces for cross-validation (exploiting %native resources when available) %\item simplified specification of criteria for outlier %and doubt predictions %\item appropriate visualizations. %\end{itemize} To give a flavor of the current implementation, we perform a few runs with different machine learning tools. We will use 60 genes drawn arbitrarily from Golub's data. <>= library(MLInterfaces) library(golubEsets) <>= data(Golub_Merge) smallG <- Golub_Merge[200:259,] smallG @ Here is how $k$-nearest neighbors is used to get predictions of ALL status, using the first 40 records as the training set: <>= #krun <- knnB( smallG, "ALL.AML", trainInd=1:40 ) krun = MLearn(ALL.AML~., smallG, knnI(k=1), 1:40) krun @ The \Rfunction{confuMat} method computes the confusion matrix resulting from applying the trained model to the reserved test data: <>= confuMat(krun) @ Additional parameters can be supplied as accepted by the target procedure in package \Rpackage{class}. To use a neural net in the same context (with fewer genes to simplify the summary below) <>= set.seed(1234) #nns <- nnetB( smallG[1:10,], "ALL.AML", trainInd=1:40, size=2, decay=.01, maxit=250 ) nns <- MLearn( ALL.AML~., smallG[1:10,], nnetI, trainInd=1:40, size=2, decay=.01, maxit=250 ) nns confuMat(nns) @ \section{Usage} The basic call sequence for supervised learning for {\tt ExpressionSet}s is \begin{verbatim} MLearn(formula, data, learnerSchema, trainInd, ...) \end{verbatim} The parameter \texttt{formula} is a standard R formula, with \verb+y~x+z+ indicating that \verb+x and \verb+z+ are predictors of response \verb+y+. If \texttt{data} is a data.frame instance, then the formula has the usual interpretation for R. If \texttt{data} is an \texttt{ExpressionSet} instance, then it is assumed that the dependent variable is present in the pData component of phenoData, and the variables on the RHS are found in the exprs component of assayData. If . is used on the RHS, then all features in the exprs component are used as predictors. The \texttt{learnerSchema} parameter is bound by instances of the learnerSchema class. Many examples are provided with MLInterfaces, see the page from help(MLearn) for a complete list. Parameter {\tt trainInd} is a numeric sequence isolating the samples to be used for training; it may also be bound by an instance of xvalSpec to define a cross-validation of a learning process (see section \ref{xval}). %For unsupervised learning, there are a number of methods of the form %\begin{verbatim} %methB(eset, k, height, ...) %\end{verbatim} %The idea here is that one may specify a number {\tt k} %of clusters, or a height of a clustering tree that will %be cut to form clusters from the {\tt eset} samples. %Note that there is no training/test dichotomy for clustering %at this stage. % %The {\tt RObject} method will access the fit object %from the basic procedure. Thus, returning to the {\tt nnetB} %invocation above, we have <>= summary(RObject(nns)) @ %which is the customary {\tt nnet} summary. This also gives %access to visualization. <>= ags <- agnesB(smallG, k=4, height=0, stand=FALSE) <>= plot(RObject(ags), which.plot=2) @ \section{Classes} For input to MLearn, to define the procedure to be used, two major classes are defined: learnerSchema, and xvalSpec. <>= getClass("learnerSchema") getClass("xvalSpec") @ For output, we have only the classifierOutput class: <>= getClass("classifierOutput") @ %Some legacy classes for output structures exist %but are too complex and will be abandoned after a %deprecation cycle. The clustOutput class %will be replaced by a clusteringOutput class: <>= getClass("clustOutput") @ \section{Cross-validation} Instances of the xvalSpec class are bound to the trainInd parameter of MLearn to perform cross-validation. The constructor \texttt{xvalSpec} can be used in line. It has parameters type (only relevant to select "LOO", for leave-one out), niter (number of partitions to use), partitionFunc (function that returns indices of members of partitions), fsFunc (function that performs feature selection and returns a formula with selected features on right-hand side). The partitionFunc must take parameters data, clab, iternum. data is the usual data frame to be supplied to the learner function. clab must be the name of a column in data. Values of the variable in that column are balanced across cross-validation partitions. iternum is used to select the partition elements as we iterate through cross validation. \begin{itemize} \item straight leave-one-out (LOO) -- note the group parameter must be integer; it is irrelevant for the LOO method. <>= library(golubEsets) data(Golub_Merge) smallG <- Golub_Merge[200:250,] lk1 <- MLearn(ALL.AML~., smallG, knnI(k=1,l=0), xvalSpec("LOO")) confuMat(lk1) @ \item Now do a random 8-fold cross-validation. <>= ranpart = function(K, data) { N = nrow(data) cu = as.numeric(cut(1:N, K)) sample(cu, size=N, replace=FALSE) } ranPartition = function(K) function(data, clab, iternum) { p = ranpart(K, data) which(p != iternum) # to retain training fraction } lkran <- MLearn(ALL.AML~., smallG, knnI(k=1,l=0), xvalSpec("LOG", 8, partitionFunc=ranPartition(8))) confuMat(lkran) @ \item Now do an 8-fold cross-validation with approximate balance among groups with respect to frequency of ALL and AML. The utility function balKfold.xvspec helps for this. <>= lk3 <- MLearn(ALL.AML~., smallG, knnI(k=1,l=0), xvalSpec("LOG", 8, partitionFunc=balKfold.xvspec(8))) confuMat(lk3) @ \end{itemize} \section{Cross validation with feature selection} Stephen Henderson of UC London supplied infrastructure to allow embedding of feature selection in the cross-validation process. These have been wrapped in fs.* closures that can be passed in xvalSpec: %See the manual page on xval for more details. <>= data(iris) iris2 = iris[ iris$Species %in% levels(iris$Species)[1:2], ] iris2$Species = factor(iris2$Species) # drop unused levels x1 = MLearn(Species~., iris2, ldaI, xvalSpec("LOG", 3, balKfold.xvspec(3), fs.absT(3))) fsHistory(x1) @ @ \section{A sketch of a `doubt' computation} The \Rfunction{nnet} function returns a structure encoding predicted probabilities of class occupancy. We will use this to enrich the \Rfunction{MLearn/nnetI} output to include a ``doubt'' outcome. As written this code will handle a two-class outcome; additional structure emerges with more than two classes and some changes will be needed for such cases. First we obtain the predicted probabilities (for the test set) and round these for display purposes. <>= predProb <- round(testScores(nns),3) @ We save the true labels and the predicted labels. <>= truth <- as.character(smallG$ALL.AML[-c(1:40)]) simpPred <- as.character(testPredictions(nns)) @ We create a closure that allows boundaries of class probabilities to be specified for assertion of ``doubt'': <>= douClo <- function(pprob) function(lo,hi) pprob>lo & pprob>= smallDou <- douClo(predProb) @ Now replace the labels for those predictions that are very close to .5. <>= douPred <- simpPred douPred[smallDou(.35,.65)] <- "doubt" @ The resulting modified predictions are in the fourth column: <>= mm <- cbind(predProb,truth,simpPred,douPred) mm table(mm[,"truth"], mm[,"simpPred"]) table(mm[,"truth"], mm[,"douPred"]) @ \end{document}