BiocNeighbors 1.5.6
The BiocNeighbors package provides several algorithms for approximate neighbor searches:
These methods complement the exact algorithms described previously.
Again, it is straightforward to switch from one algorithm to another by simply changing the BNPARAM
argument in findKNN
and queryKNN
.
We perform the k-nearest neighbors search with the Annoy algorithm by specifying BNPARAM=AnnoyParam()
.
nobs <- 10000
ndim <- 20
data <- matrix(runif(nobs*ndim), ncol=ndim)
fout <- findKNN(data, k=10, BNPARAM=AnnoyParam())
head(fout$index)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 6488 6448 9532 8837 7654 1951 1571 2883 3390 3108
## [2,] 7227 4779 9210 6418 7337 4286 5522 7658 2041 7575
## [3,] 8727 746 5064 5330 7241 2447 5257 3291 3197 6644
## [4,] 8459 6527 1425 8837 7654 7808 36 8097 9346 4357
## [5,] 6779 1096 9684 2386 5975 6334 3802 2183 359 9580
## [6,] 3200 7261 8178 2015 3969 3762 1458 7051 9246 4844
head(fout$distance)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.8600450 0.9249793 0.9269966 0.9306393 0.9345161 0.9924889 1.0047200
## [2,] 0.9097685 0.9099226 0.9916550 1.0116951 1.0248376 1.0465744 1.0513953
## [3,] 0.9476256 0.9681602 0.9729730 0.9758121 1.0004933 1.0342309 1.0441247
## [4,] 1.0204934 1.0457920 1.1406522 1.1797287 1.1931326 1.1973635 1.2246155
## [5,] 0.9022015 0.9353637 0.9363446 0.9484413 0.9602932 0.9683115 0.9860380
## [6,] 0.8698812 0.8929705 0.8962781 0.9073462 0.9307991 0.9365644 0.9492784
## [,8] [,9] [,10]
## [1,] 1.0081060 1.0179118 1.020115
## [2,] 1.0649713 1.0661714 1.072467
## [3,] 1.0574653 1.0621282 1.073696
## [4,] 1.2257073 1.2296618 1.232692
## [5,] 0.9879158 0.9910637 1.008006
## [6,] 0.9965479 1.0009067 1.001656
We can also identify the k-nearest neighbors in one dataset based on query points in another dataset.
nquery <- 1000
ndim <- 20
query <- matrix(runif(nquery*ndim), ncol=ndim)
qout <- queryKNN(data, query, k=5, BNPARAM=AnnoyParam())
head(qout$index)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 2131 9835 6651 9994 431
## [2,] 6371 8130 7286 3535 6515
## [3,] 782 7921 590 7126 7420
## [4,] 8334 1380 6420 4402 5796
## [5,] 1235 5500 639 8698 4143
## [6,] 9300 9768 931 4165 1110
head(qout$distance)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.8906709 0.9397735 0.9484660 0.9861960 1.0011728
## [2,] 0.7787679 0.9276452 0.9487606 0.9563434 0.9589845
## [3,] 0.8892353 0.9742768 1.0102720 1.0148180 1.0340621
## [4,] 0.9149737 0.9488438 0.9737120 1.0285389 1.0346249
## [5,] 0.8810596 1.0035259 1.0174783 1.0246798 1.0375212
## [6,] 0.9929487 0.9995984 1.0036471 1.0305095 1.0393341
It is similarly easy to use the HNSW algorithm by setting BNPARAM=HnswParam()
.
Most of the options described for the exact methods are also applicable here. For example:
subset
to identify neighbors for a subset of points.get.distance
to avoid retrieving distances when unnecessary.BPPARAM
to parallelize the calculations across multiple workers.BNINDEX
to build the forest once for a given data set and re-use it across calls.The use of a pre-built BNINDEX
is illustrated below:
pre <- buildIndex(data, BNPARAM=AnnoyParam())
out1 <- findKNN(BNINDEX=pre, k=5)
out2 <- queryKNN(BNINDEX=pre, query=query, k=2)
Both Annoy and HNSW perform searches based on the Euclidean distance by default.
Searching by Manhattan distance is done by simply setting distance="Manhattan"
in AnnoyParam()
or HnswParam()
.
Users are referred to the documentation of each function for specific details on the available arguments.
Both Annoy and HNSW generate indexing structures - a forest of trees and series of graphs, respectively -
that are saved to file when calling buildIndex()
.
By default, this file is located in tempdir()
1 On HPC file systems, you can change TEMPDIR
to a location that is more amenable to concurrent access. and will be removed when the session finishes.
AnnoyIndex_path(pre)
## [1] "/tmp/Rtmpp4TWOP/file1161a27dad858.idx"
If the index is to persist across sessions, the path of the index file can be directly specified in buildIndex
.
This can be used to construct an index object directly using the relevant constructors, e.g., AnnoyIndex()
, HnswIndex()
.
However, it becomes the responsibility of the user to clean up any temporary indexing files after calculations are complete.
sessionInfo()
## R version 4.0.0 alpha (2020-04-05 r78150)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Mojave 10.14.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
##
## locale:
## [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocNeighbors_1.5.6 knitr_1.28 BiocStyle_2.15.8
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.4.6 bookdown_0.18 lattice_0.20-41
## [4] digest_0.6.25 grid_4.0.0 stats4_4.0.0
## [7] magrittr_1.5 evaluate_0.14 rlang_0.4.5
## [10] stringi_1.4.6 S4Vectors_0.25.15 Matrix_1.2-18
## [13] rmarkdown_2.1.2 BiocParallel_1.21.3 tools_4.0.0
## [16] stringr_1.4.0 parallel_4.0.0 xfun_0.13
## [19] yaml_2.2.1 compiler_4.0.0 BiocGenerics_0.33.3
## [22] BiocManager_1.30.10 htmltools_0.4.0