BiocNeighbors 1.10.0
The BiocNeighbors package provides several algorithms for approximate neighbor searches:
These methods complement the exact algorithms described previously.
Again, it is straightforward to switch from one algorithm to another by simply changing the BNPARAM
argument in findKNN
and queryKNN
.
We perform the k-nearest neighbors search with the Annoy algorithm by specifying BNPARAM=AnnoyParam()
.
nobs <- 10000
ndim <- 20
data <- matrix(runif(nobs*ndim), ncol=ndim)
fout <- findKNN(data, k=10, BNPARAM=AnnoyParam())
head(fout$index)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 9419 5383 5801 9819 6156 5697 1356 8024 7357 7540
## [2,] 4779 4255 4130 8292 1443 8055 1486 9925 8293 2982
## [3,] 5442 9170 6624 7997 6284 9768 6725 6781 9442 9028
## [4,] 4302 5048 9194 1659 7720 7617 4212 8528 814 6324
## [5,] 4530 5494 7118 1715 6176 8868 3942 4484 2468 2079
## [6,] 9039 3176 7014 9810 577 8868 2781 3591 3334 3193
head(fout$distance)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.8132501 0.8488185 0.8631178 0.8958908 0.9000078 0.9576015 0.9660131
## [2,] 0.8062201 0.8723599 0.8773741 0.9117950 0.9172041 0.9524634 0.9702494
## [3,] 0.9617059 0.9792035 1.0102417 1.0139475 1.0299968 1.0361048 1.0634689
## [4,] 0.8564699 0.8584789 0.8761120 0.9338816 0.9732273 0.9996098 1.0361505
## [5,] 0.6855983 0.9793282 0.9934275 0.9937800 1.0161976 1.0279846 1.0353119
## [6,] 0.9363781 0.9726624 0.9879481 0.9955242 1.0042561 1.0085794 1.0132691
## [,8] [,9] [,10]
## [1,] 0.9889613 1.009330 1.011024
## [2,] 0.9849720 1.010131 1.023426
## [3,] 1.0769291 1.076997 1.077183
## [4,] 1.0406028 1.056442 1.057538
## [5,] 1.0397735 1.040114 1.047013
## [6,] 1.0344353 1.043902 1.049183
We can also identify the k-nearest neighbors in one dataset based on query points in another dataset.
nquery <- 1000
ndim <- 20
query <- matrix(runif(nquery*ndim), ncol=ndim)
qout <- queryKNN(data, query, k=5, BNPARAM=AnnoyParam())
head(qout$index)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 554 8831 5359 5778 9117
## [2,] 5091 1960 5960 8102 3176
## [3,] 3667 3024 5256 3273 3159
## [4,] 6237 9451 8803 4074 9523
## [5,] 6377 9364 8209 8845 9010
## [6,] 4433 7194 8348 2378 2911
head(qout$distance)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.8107834 0.9270491 1.0416028 1.0461897 1.0559363
## [2,] 0.7699306 0.9092005 0.9839563 0.9840844 1.0110680
## [3,] 0.8941807 0.8978932 0.9420072 0.9506536 0.9589225
## [4,] 0.8914589 0.9189705 0.9708286 1.0087608 1.0187865
## [5,] 1.0459340 1.0973414 1.1116436 1.1399796 1.1554955
## [6,] 0.8877277 0.9150562 0.9550412 0.9810067 0.9820225
It is similarly easy to use the HNSW algorithm by setting BNPARAM=HnswParam()
.
Most of the options described for the exact methods are also applicable here. For example:
subset
to identify neighbors for a subset of points.get.distance
to avoid retrieving distances when unnecessary.BPPARAM
to parallelize the calculations across multiple workers.BNINDEX
to build the forest once for a given data set and re-use it across calls.The use of a pre-built BNINDEX
is illustrated below:
pre <- buildIndex(data, BNPARAM=AnnoyParam())
out1 <- findKNN(BNINDEX=pre, k=5)
out2 <- queryKNN(BNINDEX=pre, query=query, k=2)
Both Annoy and HNSW perform searches based on the Euclidean distance by default.
Searching by Manhattan distance is done by simply setting distance="Manhattan"
in AnnoyParam()
or HnswParam()
.
Users are referred to the documentation of each function for specific details on the available arguments.
Both Annoy and HNSW generate indexing structures - a forest of trees and series of graphs, respectively -
that are saved to file when calling buildIndex()
.
By default, this file is located in tempdir()
1 On HPC file systems, you can change TEMPDIR
to a location that is more amenable to concurrent access. and will be removed when the session finishes.
AnnoyIndex_path(pre)
## [1] "/tmp/RtmpJWVm2v/file7109568b367.idx"
If the index is to persist across sessions, the path of the index file can be directly specified in buildIndex
.
This can be used to construct an index object directly using the relevant constructors, e.g., AnnoyIndex()
, HnswIndex()
.
However, it becomes the responsibility of the user to clean up any temporary indexing files after calculations are complete.
sessionInfo()
## R version 4.1.0 RC (2021-05-10 r80283)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Mojave 10.14.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
##
## locale:
## [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocNeighbors_1.10.0 knitr_1.33 BiocStyle_2.20.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.6 magrittr_2.0.1 BiocGenerics_0.38.0
## [4] BiocParallel_1.26.0 lattice_0.20-44 R6_2.5.0
## [7] rlang_0.4.11 stringr_1.4.0 tools_4.1.0
## [10] parallel_4.1.0 grid_4.1.0 xfun_0.23
## [13] jquerylib_0.1.4 htmltools_0.5.1.1 yaml_2.2.1
## [16] digest_0.6.27 bookdown_0.22 Matrix_1.3-3
## [19] BiocManager_1.30.15 S4Vectors_0.30.0 sass_0.4.0
## [22] evaluate_0.14 rmarkdown_2.8 stringi_1.6.2
## [25] compiler_4.1.0 bslib_0.2.5.1 stats4_4.1.0
## [28] jsonlite_1.7.2