%\VignetteIndexEntry{clst} %\VignetteKeywords{clst, classifier, ROC} %\VignettePackage{clst} % % NOTE -- ONLY EDIT THE .Rnw FILE!!! The .tex file is % likely to be overwritten. % \documentclass[10pt]{article} \usepackage{amsmath,pstricks} % With MikTeX 2.9, using pstricks also requires using auto-pst-pdf or running % pdflatex will fail. Note that using auto-pst-pdf requires to set environment % variable MIKTEX_ENABLEWRITE18=t on Windows, and to set shell_escape = t in % texmf.cnf for Tex Live on Unix/Linux/Mac. \usepackage{auto-pst-pdf} \usepackage[authoryear,round]{natbib} \usepackage{hyperref} \usepackage{subfigure} % all figures at end of document % \usepackage{endfloat} %%%% NOT the default vignette page layout \usepackage[margin=2cm]{geometry} \geometry{letterpaper} %%%% part of default vignette page layout (or at least, it came from somewhere) % \textwidth=6.2in % \textheight=8.5in % %\parskip=.3cm % \oddsidemargin=.1in % \evensidemargin=.1in % \headheight=-.3in \newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\code}[1]{{\texttt{#1}}} \bibliographystyle{plainnat} \begin{document} %\setkeys{Gin}{width=0.55\textwidth} %\setkeys{Gin}{width=8in} % figures saved in ./figs; create if necessary below <>= figdir <- 'figs_out' dir.create(figdir, showWarnings=FALSE) @ \SweaveOpts{prefix.string=\Sexpr{figdir}/clst,eps=FALSE} \title{\Rpackage{clst} demo} \author{Noah Hoffman} \maketitle \tableofcontents \section{Introduction} \Rpackage{clst} performs supervised classification using a modified nearest-neighbor approach. This vignette demonstrates its use. \subsection{Definition of symbols} Consider a set of $N$ objects that can be grouped into $K$ classes with class labels $c_m, m \in 1...K$. Each object may be compared to another to generate a pairwise distance $d_{i,j}$. Additional symbol definitions are as follows: \vspace{1em} \begin{center} \begin{tabular}{r p{10cm}} $d^w$ & all within-class pairwise distances.\\ $d^b$ & all between-class pairwise distances.\\ $d^w_m$ & within-group distances among members of class $m$.\\ $d_{x,i \in 1...N}$ & distances between object $x$ and each object in the reference set. \\ $d_{x,i\in c_m}$ & distances between query object $x$ and each object belonging to class $c_m$.\\ $D$ & A value defining an optimal separation between within-class comparisons and between-class comparisons. In general, therefore, we would expect $d^w < D < d^b$. \end{tabular} \end{center} \vspace{1em} Given these definitions, we can further define a match score $s$ to describe the confidence that object $x$ is a member of class $c_m$ as follows: \begin{equation*} s_{x,m} = \frac{d_{x,i \in c_m} < D}{d_{x,i \in 1...N} + d} \end{equation*} We include a small offset $d$ in the denominator to prevent over-weighting match scores for small groups. Thus if $d=0.5$, a query object with all $d_{x,m} < D$ will have a match score of $1/1+0.5 \approx 0.66$ for a class with a single member. We can define an additional parameter $C$ as the value of $s_{x,c_m}$ above which object $x$ is classified as a member of class $c_m$. (Note: $C$ is defined by the \texttt{minScore} argument to the function \texttt{classify}. \subsection{Classification outcomes} Classification of an object $x$ can be performed by considering the values of $s_{x,m}$ for all classes represented in a reference set. Possible outcomes of classification might include an unequivocal \textit{match} with a single class; \textit{confusion} between two or more possible classes; \textit{no match}, which reflects confidence that the object can \textit{not} be placed into any of the represented classes; or evidence that the object is a clear \textit{outlier}. Given the simple case of two classes $A$ and $B$ and four query objects $w$, $x$, $y$, and $z$ (illustrated in Figure~\ref{fig:matchtypes}, we might define these outcomes as follows: \begin{figure}[h] \centering \fbox{\includegraphics{matchtypes.pdf}} \caption[Classification scenarios]{Members of classes A and B lie within respectively labeled circles; points w, x, y, and z indicate unclassified objects.} \label{fig:matchtypes} \end{figure} \begin{description} \item[match (point $w$)] $\begin{cases} s_{w,A} > C \\ s_{w,B} < C \end{cases}$ \item[confusion (point $x$)] $\begin{cases} s_{x,A} > C \\ s_{x,B} > C \end{cases}$ \item[no match (point $y$)] $\begin{cases} s_{y,A} < C \\ s_{y,B} < C \\ d_{y,i \in 1...N} < some \; d^b \end{cases}$ \item[outlier (point $z$)] $\begin{cases} s_{z,A} < C \\ s_{z,B} < C \\ d_{y,i \in 1...N} > most \; d^b \end{cases}$ \end{description} Described in the context of the function \Rfunction{classify}, classification is performed as follows. The arguments \code{minScore} and \code{doffset} provide the parameters $C$ and $d$, respectively. On each iteration of the classification algorithm, pairwise distances in the reference set are defined as either within- or between-group based on the labels in \code{groups}. A threshold partitioning these two categories is determined as the point of maximal mutual information. If classification results in confusion between two or more categories, pairwise distances among objects in these remaining categories will be re-partitioned, and a new threshold determined. The \code{maxDepth} parameter defines the maximum number of times this re-partitioning may occur. \section{Iris example} Create a distance matrix (\texttt{dmat}) using the Iris data set, which describes phenotypic characteristics of three flower species. We also define a factor (\texttt{groups}) that associates each of the samples in the data set with one of three Iris species. <<>>= library(clst) data(iris) dmat <- as.matrix(dist(iris[,1:4], method="euclidean")) groups <- iris$Species @ The relationship among objects in the Iris data set can be illustrated using multidimensional scaling. The function \texttt{scaleDistPlot} creates an annotated scatterplot of the output of \texttt{cmdscale}. \begin{figure}[p!] \centering <>= ii <- c(1,125) plot(scaleDistPlot(dmat, groups, indices=ii,O=ii)) @ \caption[Visualization of the Iris data set using multidimensional scaling.]{Visualization of the Iris data set using multidimensional scaling. Objects to be classified in the examples below are indicated.} \label{fig:scaleDistPlot} \end{figure} We can calculate $D$ as the value that results in maximal mutual information between the vector classifying distances as ``within'' or ``between'' and the vector indicating if the corresponding pairwise distance is greater than or less than $D$ (method=``mutinfo''). This is the default method used by the classifier. <<>>= thresh <- findThreshold(dmat, groups, type="mutinfo") str(thresh) @ By default, $D$ has a lower and upper bound defined as the minimum of the between-class distances and maximum of the within-class distances. One can also set the upper and lower bounds of $D$ as some quantile of the within class distances and between-class differences, respectively using the \texttt{prob} argument (the default is 0.5); the constraint can be removed by setting \code{prob=NA} <<>>= thresh2 <- findThreshold(dmat, groups, type="mutinfo", prob=NA) print(thresh2$interval) @ In this example, setting alternative bounds for $D$ does not alter the results. The plot method for the thresh object illustrates the underlying calculations performed using either method <>= plot(do.call(plotDistances, thresh)) @ <>= plot(do.call(plotDistances, thresh2)) @ This data set contains two classes (that is, flower species) that are closely related, and one more distantly related to the other two. Therefore the performance of a single $D$ for predicting inter- versus intra-class distances is rather poor. \subsection{Classification of selected items} The function \texttt{classify} implements the classification algorithm. Inputs to the function include a square matrix of distances and a factor of class labels (\texttt{dmat} and \texttt{groups}) as well as \texttt{dvect}, a vector containing distances between the sample to be classified and each of the reference objects. The classification algorithm requires the two parameters, $C$ and $d$, which are provided to \texttt{classify} as arguments \texttt{minScore} and \texttt{doffset}. The more important of the two parameters, \texttt{minScore}, defines the score cutoff for group membership; that is, if an unknown sample has a proportion of similarity scores $>$ \texttt{minScore} when compared to reference samples in a given class, it is considered a match for that class. We will experiment with the effect of changing the value of \texttt{minScore} later. \texttt{doffset} is used in the calculation of the match score; its purpose is to reduce the weight of scores resulting from matches to members of a group with very few members. Thus with the default value of \texttt{doffset}=0.5, if a query object has a distance $< D$ to an object in a group of size 1, the resulting score is $1/(1 + 0.5) = 0.66$. To provide an artificial example, we can remove a single sample from the Iris data and perform classification using the remaining samples as a reference set. <<>>= ind <- 1 species <- gettextf('I. %s', groups[ind]) cat('class of "unknown" sample is',species) dmat1 <- dmat[-ind,-ind] groups1 <- groups[-ind] dvect1 <- dmat[ind, -ind] cc <- classify(dmat1, groups1, dvect1) printClst(cc) @ The query sequence is a member of the species \textit{I. setosa}, which is relatively divergent from the other two, so our classifier performs well on the first try. Classification of members of the other two Iris species present a more challenging case: <<>>= ind <- 125 species = gettextf('I. %s', groups[ind]) pp <- pull(dmat, groups, ind) cc <- do.call(classify, pp) cat(paste('class of "unknown" sample is', species)) printClst(cc) @ \subsection{Leave-one-out analysis} The performance of the classifier for a data set using a given set of parameters can be examined using a leave-one-out analysis. The list \Robject{loo} contains the results of the final iteration of classification for each object in the iris data set. <<>>= loo <- lapply(seq_along(groups), function(i){ do.call(classify, pull(dmat, groups, i)) }) matches <- lapply(loo, function(x) rev(x)[[1]]$matches) result <- sapply(matches, paste, collapse='-') table(ifelse(result=='','no match',result),groups) @ Note that there is some confusion between versicolor and virginica. The objects that could not be assigned to a single taxon are indicated in Figure~\ref{fig:confusion45}. \begin{figure}[p!] \centering <>= confusion <- sapply(matches, length) > 1 no_match <- sapply(matches, length) < 1 plot(scaleDistPlot(dmat, groups, fill=confusion, O=confusion, X=no_match)) @ \caption[MDS plot of a leave-one-out analysis.]{Visualization of the Iris data set using multidimensional scaling. Objects with a classification output of confusion are filled and indicated with a circle.} \label{fig:confusion45} \end{figure} The specificity of the classifier can be improved by increasing the value of \code{minScore}, at the cost of a decrease in sensitivity (Figure~\ref{fig:confusion65}). <<>>= loo <- lapply(seq_along(groups), function(i){ do.call(classify, c(pull(dmat, groups, i),minScore=0.65)) }) matches <- lapply(loo, function(x) rev(x)[[1]]$matches) result <- sapply(matches, paste, collapse='-') table(ifelse(result=='','no match',result),groups) @ \begin{figure}[p!] \centering <>= confusion <- sapply(matches, length) > 1 no_match <- sapply(matches, length) < 1 plot(scaleDistPlot(dmat, groups, fill=confusion, O=confusion, X=no_match, indices=no_match)) @ \caption[MDS plot of leave-one-out analysis, \texttt{minScore=0.65}]{Visualization of the Iris data set using multidimensional scaling. Objects with a classification output of confusion are filled and indicated with a circle. Unclassified items are crossed out. Here \texttt{minScore=0.65}} \label{fig:confusion65} \end{figure} Finally, we can see a summary of the classification results of an item for which the result was ``no match.'' <<>>= printClst(loo[[118]]) @ \end{document}