% NOTE -- ONLY EDIT .Rnw!!!
% .tex file will get overwritten.
%
%\VignetteIndexEntry{MLInterfaces devel for schema-based MLearn}
%\VignetteDepends{MASS}
%\VignetteKeywords{Genomics}
%\VignettePackage{MLInterfaces}
%
% NOTE -- ONLY EDIT THE .Rnw FILE!!!  The .tex file is
% likely to be overwritten.


%
% NOTE -- ONLY EDIT THE .Rnw FILE!!!  The .tex file is
% likely to be overwritten.
%
\documentclass[12pt]{article}

\usepackage{amsmath}
\usepackage[authoryear,round]{natbib}
\usepackage{hyperref}


\textwidth=6.2in
\textheight=8.5in
%\parskip=.3cm
\oddsidemargin=.1in
\evensidemargin=.1in
\headheight=-.3in

\newcommand{\scscst}{\scriptscriptstyle}
\newcommand{\scst}{\scriptstyle}


\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}
\newcommand{\Rmethod}[1]{{\texttt{#1}}}
\newcommand{\Rfunarg}[1]{{\texttt{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}

\textwidth=6.2in

\bibliographystyle{plainnat} 
 
\begin{document}
%\setkeys{Gin}{width=0.55\textwidth}

\title{MLInterfaces 2.0 -- a new design}
\author{VJ Carey}
\maketitle

\section{Introduction}

MLearn, the workhorse method of MLInterfaces, has been streamlined
to support simpler development.

In 1.*, MLearn included a substantial switch statement, and the
external learning function was identified by a string.  Many
massage tasks were wrapped up in switch case elements devoted to
each method.  MLearn returned instances of MLOutput, but these
had complicated subclasses.

MLearn now takes a signature \texttt{c("formula", "data.frame", "learnerSchema",
"numeric")}, with the expectation that extra parameters captured in ... go to the fitting
function.  The complexity of dealing with expectations and return
values of different machine learning functions is handled primarily by
the learnerSchema instances.  The basic realizations are that
\begin{itemize}
\item most learning functions use the formula/data idiom, and additional
parameters can go in ...
\item the problem of converting from the function's output structures (typically
lists, but sometimes also objects with attributes) to the uniform
structure delived by MLearn should be handled as generically
as possible, but specialization will typically be needed
\item the conversion process can be handled in most cases using only the
native R object returned by the learning function, the data, and the training index set.
\item some functions, like knn, are so idiosyncratic (lacking formula interface or
predict method) that special software is needed to adapt MLearn to work with them
\end{itemize}

Thus we have defined a learnerSchema class,
<<lks>>=
library(MLInterfaces)
library(gbm)
getClass("learnerSchema")
@
along with a constructor used to define a family of schema objects that help
MLearn carry out specific tasks of learning.

\section{Some examples}

We define interface schema instances with suffix "I".

randomForest has a simple converter:
<<lkrf>>=
randomForestI@converter
@

The job of the converter is to populate as much as the classifierOutput
instance as possible.  For something like nnet, we can do more:
<<lknn>>=
nnetI@converter
@
We can get posterior class probabilities.

To obtain the predictions necessary for confusionMatrix computation, we
may need the converter to know about parameters used in the fit.
Here, closures are used.

<<lkknn>>=
knnI(k=3, l=2)@converter
@

So we can have the following calls:
<<show>>=
library(MASS)
data(crabs)
kp = sample(1:200, size=120)
rf1 = MLearn(sp~CL+RW, data=crabs, randomForestI, kp, ntree=100)
rf1
RObject(rf1)
knn1 = MLearn(sp~CL+RW, data=crabs, knnI(k=3,l=2), kp)
knn1
@

\section{Making new interfaces}
\subsection{A simple example: ada}

The \Rfunction{ada} method of the \Rpackage{ada} package has a formula
interface and a predict method.  We can create a learnerSchema on the fly,
and then use it:

<<mkadaI>>=
adaI = makeLearnerSchema("ada", "ada", standardMLIConverter )
arun = MLearn(sp~CL+RW, data=crabs, adaI, kp )
confuMat(arun)
RObject(arun)
@

What is the standardMLIConverter?
<<lks>>=
standardMLIConverter
@

\subsection{Dealing with gbm}

The \Rpackage{gbm} package workhorse fitter is \texttt{gbm}.  The formula
input must have a numeric response, and the predict method only returns
a numeric vector.  There is also no namespace.  We introduced a gbm2 function
<<lkggg>>=
gbm2
@
that requires a two-level factor response and recodes for use by gbm.
It also returns an S3 object of newly defined class gbm2, which only
returns a factor.  At this stage, we could use a standard interface, but
the prediction values will be unpleasant to work with.  Furthermore the
predict method requires specification of \texttt{n.trees}.
So we pass a parameter \texttt{n.trees.pred}.
<<tryg>>=
BgbmI
set.seed(1234)
gbrun = MLearn(sp~CL+RW+FL+CW+BD, data=crabs, BgbmI(n.trees.pred=25000,thresh=.5), 
   kp, n.trees=25000, 
   distribution="bernoulli", verbose=FALSE )
gbrun
confuMat(gbrun)
summary(testScores(gbrun))
@

\subsection{A harder example: rda}

As of Jan 2020, rda has been proposed for archiving at CRAN.  While
maintenance of rda is being contemplated, the computations in this
section are conditional on having an installed instance of rda.

The \Rpackage{rda} package includes the rda function for regularized
discriminant analysis, and rda.cv for aiding in tuning parameter selection.
No formula interface is supported, and the predict method delivers
arrays organized according to the tuning parameter space search.
Bridge work needs to be done to support the use of MLearn with this
methodology.


First we create a wrapper that uses formula-data for the rda.cv function:
<<dowrap>>=
if (requireNamespace("rda", quietly=TRUE)) {
library("rda")
rdaCV = function( formula, data, ... ) {
 passed = list(...)
 if ("genelist" %in% names(passed)) stop("please don't supply genelist parameter.")
 # data input to rda needs to be GxN
 x = model.matrix(formula, data)
 if ("(Intercept)" %in% colnames(x))
   x = x[, -which(colnames(x) %in% "(Intercept)")]
 x = t(x)
 mf = model.frame(formula, data)
 resp = as.numeric(factor(model.response(mf)))
 run1 = rda( x, resp, ... )
 rda.cv( run1, x, resp )
}
}
@

We also create a wrapper for a non-cross-validated call, where
alpha and delta parameters are specified:
<<dow2>>=
if (requireNamespace("rda", quietly=TRUE)) {
rdaFixed = function( formula, data, alpha, delta, ... ) {
 passed = list(...)
 if ("genelist" %in% names(passed)) stop("please don't supply genelist parameter.")
 # data input to rda needs to be GxN
 x = model.matrix(formula, data)
 if ("(Intercept)" %in% colnames(x))
   x = x[, -which(colnames(x) %in% "(Intercept)")]
 x = t(x)
 featureNames = rownames(x)
 mf = model.frame(formula, data)
 resp = as.numeric(resp.fac <- factor(model.response(mf)))
 finalFit=rda( x, resp, genelist=TRUE, alpha=alpha, delta=delta, ... )
 list(finalFit=finalFit, x=x, resp.num=resp, resp.fac=resp.fac, featureNames=featureNames,
    keptFeatures=featureNames[ which(apply(finalFit$gene.list,3,function(x)x)==1) ])
}
}
@

Now our real interface is defined.  It runs the rdaCV  wrapper
to choose tuning parameters, then passes these to rdaFixed, and
defines an S3 object (for uniformity, to be
like most learning packages that we work with.)
We also define a predict method for that object.
This has to deal with the fact that rda does not
use factors.

<<doreal>>=
if (requireNamespace("rda", quietly=TRUE)) {
rdacvML = function(formula, data, ...) {
 run1 = rdaCV( formula, data, ... )
 perf.1se = cverrs(run1)$one.se.pos
 del2keep = which.max(perf.1se[,2])
 parms2keep = perf.1se[del2keep,]
 alp = run1$alpha[parms2keep[1]]
 del = run1$delta[parms2keep[2]]
 fit = rdaFixed( formula, data, alpha=alp, delta=del, ... )
 class(fit) = "rdacvML"
 attr(fit, "xvalAns") = run1
 fit
}

predict.rdacvML = function(object, newdata, ...) {
 newd = data.matrix(newdata)
 fnames = rownames(object$x)
 newd = newd[, fnames]
 inds = predict(object$finalFit, object$x, object$resp.num, xnew=t(newd))
 factor(levels(object$resp.fac)[inds])
}

print.rdacvML = function(x, ...) {
 cat("rdacvML S3 instance. components:\n")
 print(names(x))
 cat("---\n")
 cat("elements of finalFit:\n")
 print(names(x$finalFit))
 cat("---\n") 
 cat("the rda.cv result is in the xvalAns attribute of the main object.\n")
}
}
@

Finally, the converter can take the standard form; look at
rdacvI.


\section{Additional features}

The xvalSpec class allows us to specify types of cross-validation, and to
control carefully how partitions are formed.  More details are provided
in the MLprac2\_2 vignette.

\section{The MLearn approach to clustering and other forms
of unsupervised learning}

A learner schema for a clustering method needs to specify clearly
the feature distance measure.  We will experiment here.  Our main
requirements are
\begin{itemize}
\item ExpressionSets are the basic input objects
\item The typical formula interface would be \verb+~.+ but one can
imagine cases where a factor from phenoData is specified as a 'response' to
color items, and this will be allowed
\item a clusteringOutput class will need to be defined to contain the
results, and it will propagate the result object from the native
learning method.
\end{itemize}


%need unsupLearnerSchema?


\end{document}