% % NOTE -- ONLY EDIT THE .Rnw FILE!!! The .tex file is % likely to be overwritten. % %\VignetteIndexEntry{bioDist Introduction} %\VignetteKeywords{Distances} %\VignettePackage{bioDist} \documentclass[12pt]{article} \usepackage{amsmath} \usepackage[authoryear,round]{natbib} \usepackage{hyperref} \textwidth=6.2in \textheight=8.5in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \bibliographystyle{plainnat} \begin{document} \title{bioDist Introduction} \maketitle \section*{Introduction} The \Rpackage{bioDist} package contains some distance functions that have been shown to be useful in a number of different biological or bioinformatic problems. The return values are typically instances of the S3 class \Rclass{dist}. \section{Data} We will use the \Robject{sample.ExpressionSet} object from the \Rpackage{Biobase} package as our data. The \Rpackage{bioDist} functions are in some ways extensions of the distance functions available via the \Rfunction{dist} function in R, and hence they compute pairwise distances between the rows of the input. For an expression matrix, this will correspond to the genes or features on the array. Since we are generally more interested in distances between samples, we will transpose the data in this demonstration. <>= library("bioDist") data(sample.ExpressionSet) exData = t(exprs(sample.ExpressionSet)) @ \section{Distance Measures} The two most used distance measures in the \Rpackage{bioDist} package are MI and KLD. These measures focus on very different distributional aspects of the data. MI is large when the joint distribution is quite different from the product of the marginals, while KLD measures how much the shape of one distribution resembles that of the other. MI can be considered as a multivariate measure of association, and if the transformation \begin{equation} \label{eq:distJOE} \delta^* = [ 1 - \exp(-2 MI)]^{1/2} \end{equation} is used, then $\delta^*$ takes values in the interval $[0,1]$ and can be interpreted as a a generalization of the correlation. We will make the further transformation to $1 - \delta^*$ so that this measure has the same interpretation as other correlation-based distance measures. There are two functions for computing mutual information distance measures: \Rfunction{mutualInfo} that computes the distance from independence and \Rfunction{MIdist} that computes the transformation in Equation~(\ref{eq:distJOE}). We note that the computations are not terribly fast, and computing these distances on very large data sets is time consuming. <>= s1 = MIdist(exData) s2 = as.matrix(s1) dim(s2) r1 = mutualInfo(exData) @ For KL distances, there is one implementation that uses binning, \Rfunction{KLdist.matrix}, and one that uses density estimation followed by numerical integration, \Rfunction{KLD.matrix}. <>= kl1 = KLdist.matrix(exData) kl2 = KLD.matrix(exData, method="density", supp=range(exData)) @ The \Rpackage{bioDist} package also provides implementations of distances based on two other measures of correlation: Kendall's tau and Pearson's rho. In the examples below we will measure distance between genes, not between samples as was done in the first few examples. We will also restrict our analysis to the last 100 genes in the sample data in order to keep computing times low. <>= eS = sample.ExpressionSet[401:500,] tauD = tau.dist(eS, sample=FALSE) sp = spearman.dist(eS, sample=FALSE) @ To find a specified number of nearest neighbors, we will use a simple helper function called \Rfunction{closest.top}. <>= f1 = featureNames(eS)[1] closest.top(f1, sp, 3) @ \section{Session Information} The version number of R and packages loaded for generating the vignette were: \begin{verbatim} <>= sessionInfo() @ \end{verbatim} \end{document}