BiocNeighbors 1.14.0
The BiocNeighbors package provides several algorithms for approximate neighbor searches:
These methods complement the exact algorithms described previously.
Again, it is straightforward to switch from one algorithm to another by simply changing the BNPARAM
argument in findKNN
and queryKNN
.
We perform the k-nearest neighbors search with the Annoy algorithm by specifying BNPARAM=AnnoyParam()
.
nobs <- 10000
ndim <- 20
data <- matrix(runif(nobs*ndim), ncol=ndim)
fout <- findKNN(data, k=10, BNPARAM=AnnoyParam())
head(fout$index)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 7688 7911 5982 6329 8261 6717 8775 3909 9563 8518
## [2,] 8682 4076 7016 657 7899 8394 1199 407 3561 2410
## [3,] 6148 8763 6962 3122 1419 4972 6396 7781 1805 178
## [4,] 245 9156 9560 6542 7738 1612 7744 7280 757 3125
## [5,] 8273 6427 8740 2778 3202 2819 3773 7550 658 3864
## [6,] 8316 4854 3029 3623 64 448 1468 4982 2999 726
head(fout$distance)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.8304179 0.8750356 0.8759342 0.8829713 0.9082669 0.9491317 0.9684428
## [2,] 0.9251890 0.9276413 0.9449535 0.9631755 1.0158614 1.0201880 1.0280715
## [3,] 0.7472256 0.8548992 0.8610095 0.9206980 0.9432310 0.9796737 0.9917855
## [4,] 0.8656153 0.9403350 0.9639321 0.9715745 0.9871412 0.9894491 1.0108013
## [5,] 0.9155345 0.9801137 0.9867500 0.9923490 1.0192692 1.0513476 1.0730041
## [6,] 0.9352902 0.9399719 0.9447927 0.9566540 1.0086448 1.0389432 1.0474594
## [,8] [,9] [,10]
## [1,] 0.9723414 0.979493 0.9814662
## [2,] 1.0306145 1.044805 1.0577210
## [3,] 0.9959299 1.003495 1.0126903
## [4,] 1.0369606 1.041605 1.0471301
## [5,] 1.1029378 1.115360 1.1172078
## [6,] 1.0697364 1.071698 1.0945700
We can also identify the k-nearest neighbors in one dataset based on query points in another dataset.
nquery <- 1000
ndim <- 20
query <- matrix(runif(nquery*ndim), ncol=ndim)
qout <- queryKNN(data, query, k=5, BNPARAM=AnnoyParam())
head(qout$index)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 8679 2426 7392 5154 5646
## [2,] 3513 2300 7373 3718 2891
## [3,] 4796 4914 8079 8206 9027
## [4,] 4212 3137 6809 7052 2715
## [5,] 5221 2539 8357 9500 7903
## [6,] 4114 5603 801 3653 779
head(qout$distance)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.9181021 0.9334105 0.9611604 0.9939877 1.0098161
## [2,] 0.8746895 0.9087604 0.9108444 0.9175109 0.9178008
## [3,] 1.0728860 1.1161153 1.1323576 1.1492079 1.1816459
## [4,] 0.7262368 0.9570859 0.9860527 0.9911924 1.0460759
## [5,] 0.7616389 0.8509811 0.8773931 0.9265016 0.9295678
## [6,] 0.9703227 0.9858002 0.9861577 0.9886760 1.0036516
It is similarly easy to use the HNSW algorithm by setting BNPARAM=HnswParam()
.
Most of the options described for the exact methods are also applicable here. For example:
subset
to identify neighbors for a subset of points.get.distance
to avoid retrieving distances when unnecessary.BPPARAM
to parallelize the calculations across multiple workers.BNINDEX
to build the forest once for a given data set and re-use it across calls.The use of a pre-built BNINDEX
is illustrated below:
pre <- buildIndex(data, BNPARAM=AnnoyParam())
out1 <- findKNN(BNINDEX=pre, k=5)
out2 <- queryKNN(BNINDEX=pre, query=query, k=2)
Both Annoy and HNSW perform searches based on the Euclidean distance by default.
Searching by Manhattan distance is done by simply setting distance="Manhattan"
in AnnoyParam()
or HnswParam()
.
Users are referred to the documentation of each function for specific details on the available arguments.
Both Annoy and HNSW generate indexing structures - a forest of trees and series of graphs, respectively -
that are saved to file when calling buildIndex()
.
By default, this file is located in tempdir()
1 On HPC file systems, you can change TEMPDIR
to a location that is more amenable to concurrent access. and will be removed when the session finishes.
AnnoyIndex_path(pre)
## [1] "/tmp/RtmpDtEfx3/file13795537a7e65.idx"
If the index is to persist across sessions, the path of the index file can be directly specified in buildIndex
.
This can be used to construct an index object directly using the relevant constructors, e.g., AnnoyIndex()
, HnswIndex()
.
However, it becomes the responsibility of the user to clean up any temporary indexing files after calculations are complete.
sessionInfo()
## R version 4.2.0 RC (2022-04-19 r82224)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Mojave 10.14.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
##
## locale:
## [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocNeighbors_1.14.0 knitr_1.38 BiocStyle_2.24.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.8.3 magrittr_2.0.3 BiocGenerics_0.42.0
## [4] BiocParallel_1.30.0 lattice_0.20-45 R6_2.5.1
## [7] rlang_1.0.2 fastmap_1.1.0 stringr_1.4.0
## [10] tools_4.2.0 parallel_4.2.0 grid_4.2.0
## [13] xfun_0.30 cli_3.3.0 jquerylib_0.1.4
## [16] htmltools_0.5.2 yaml_2.3.5 digest_0.6.29
## [19] bookdown_0.26 Matrix_1.4-1 BiocManager_1.30.17
## [22] S4Vectors_0.34.0 sass_0.4.1 evaluate_0.15
## [25] rmarkdown_2.14 stringi_1.7.6 compiler_4.2.0
## [28] bslib_0.3.1 stats4_4.2.0 jsonlite_1.8.0