---
title: Detecting exact nearest neighbors 
author: 
- name: Aaron Lun
  affiliation: Cancer Research UK Cambridge Institute, Cambridge, United Kingdom
date: "Revised: 2 December 2018"
output:
  BiocStyle::html_document:
    toc_float: true
package: BiocNeighbors 
vignette: >
  %\VignetteIndexEntry{1. Detecting exact nearest neighbors}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}    
bibliography: ref.bib  
---

```{r, echo=FALSE, results="hide", message=FALSE}
require(knitr)
opts_chunk$set(error=FALSE, message=FALSE, warning=FALSE)
library(BiocNeighbors)
```

# Introduction

The `r Biocpkg("BiocNeighbors")` package implements a few algorithms for exact nearest neighbor searching:

- The k-means for k-nearest neighbors (KMKNN) algorithm [@wang2012fast] uses k-means clustering to create an index.
Within each cluster, the distance of each of that cluster's points to the cluster center are computed and used to sort all points.
Given a query point, the distance to each cluster center is determined and the triangle inequality is applied to determine which points in each cluster warrant a full distance calculation.
- The vantage point (VP) tree algorithm [@yianilos1993data] involves constructing a tree where each node is located at a data point and is associated with a subset of neighboring points.
Each node progressively partitions points into two subsets that are either closer or further to the node than a given threshold. 
Given a query point, the triangle inequality is applied at each node in the tree to determine if the child nodes warrant searching.

Both methods involve a component of randomness during index construction, though the k-nearest neighbors result is fully deterministic^[Except in the presence of ties, see `?findKNN` for details.].

# Identifying k-nearest neighbors

The most obvious application is to perform a k-nearest neighbors search.
We'll mock up an example here with a hypercube of points, for which we want to identify the 10 nearest neighbors for each point.

```{r}
nobs <- 10000
ndim <- 20
data <- matrix(runif(nobs*ndim), ncol=ndim)
```
    
The `findKNN()` method expects a numeric matrix as input with data points as the rows and variables/dimensions as the columns.
We indicate that we want to use the KMKNN algorithm by setting `BNPARAM=KmknnParam()` (which is also the default, so this is not strictly necessary here).
We could use a VP tree instead by setting `BNPARAM=VptreeParam()`.

```{r}
fout <- findKNN(data, k=10, BNPARAM=KmknnParam())
head(fout$index)
head(fout$distance)
```

Each row of the `index` matrix corresponds to a point in `data` and contains the row indices in `data` that are its nearest neighbors.
For example, the 3rd point in `data` has the following nearest neighbors:

```{r}
fout$index[3,]
```

... with the following distances to those neighbors:

```{r}
fout$distance[3,]
```

Note that the reported neighbors are sorted by distance.

# Querying k-nearest neighbors

Another application is to identify the k-nearest neighbors in one dataset based on query points in another dataset.
Again, we mock up a small data set:

```{r}
nquery <- 1000
ndim <- 20
query <- matrix(runif(nquery*ndim), ncol=ndim)
```
    
We then use the `queryKNN()` function to identify the 5 nearest neighbors in `data` for each point in `query`.

```{r}
qout <- queryKNN(data, query, k=5, BNPARAM=KmknnParam())
head(qout$index)
head(qout$distance)
```

Each row of the `index` matrix contains the row indices in `data` that are the nearest neighbors of a point in `query`.
For example, the 3rd point in `query` has the following nearest neighbors in `data`:

```{r}
qout$index[3,]
```

... with the following distances to those neighbors:

```{r}
qout$distance[3,]
```

Again, the reported neighbors are sorted by distance.

# Further options

Users can perform the search for a subset of query points using the `subset=` argument.
This yields the same result as but is more efficient than performing the search for all points and subsetting the output.

```{r}
findKNN(data, k=5, subset=3:5)
```

If only the indices are of interest, users can set `get.distance=FALSE` to avoid returning the matrix of distances.
This will save some time and memory.

```{r}
names(findKNN(data, k=2, get.distance=FALSE))
```

It is also simple to speed up functions by parallelizing the calculations with the `r Biocpkg("BiocParallel")` framework.

```{r}
library(BiocParallel)
out <- findKNN(data, k=10, BPPARAM=MulticoreParam(3))
```

For multiple queries to a constant `data`, the pre-clustering can be performed in a separate step with `buildIndex()`.
The result can then be passed to multiple calls, avoiding the overhead of repeated clustering^[The algorithm type is automatically determined when `BNINDEX` is specified, so there is no need to also specify `BNPARAM` in the later functions.].

```{r}
pre <- buildIndex(data, BNPARAM=KmknnParam())
out1 <- findKNN(BNINDEX=pre, k=5)
out2 <- queryKNN(BNINDEX=pre, query=query, k=2)
```

The default setting is to search on the Euclidean distance.
Alternatively, we can use the Manhattan distance by setting `distance="Manhattan"` in the `BiocNeighborParam` object.

```{r}
out.m <- findKNN(data, k=5, BNPARAM=KmknnParam(distance="Manhattan"))
```

Advanced users may also be interested in the `raw.index=` argument, which returns indices directly to the precomputed object rather than to `data`.
This may be useful inside package functions where it may be more convenient to work on a common precomputed object.

# Session information

```{r}
sessionInfo()
```

# References