BiocNeighbors 1.14.0
The BiocNeighbors package provides several algorithms for approximate neighbor searches:
These methods complement the exact algorithms described previously.
Again, it is straightforward to switch from one algorithm to another by simply changing the BNPARAM
argument in findKNN
and queryKNN
.
We perform the k-nearest neighbors search with the Annoy algorithm by specifying BNPARAM=AnnoyParam()
.
nobs <- 10000
ndim <- 20
data <- matrix(runif(nobs*ndim), ncol=ndim)
fout <- findKNN(data, k=10, BNPARAM=AnnoyParam())
head(fout$index)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 4042 7856 4238 8299 1766 5796 5621 6000 9152 6617
## [2,] 2651 604 6064 5111 4753 7231 6854 8954 1729 1971
## [3,] 9193 1346 5334 6023 6992 331 6063 61 37 6677
## [4,] 4759 6542 4552 7614 1768 8657 960 4970 9903 1503
## [5,] 5590 5862 7763 1951 3114 2410 6985 7746 3644 444
## [6,] 1504 4662 9109 5744 188 357 1513 3972 4232 9816
head(fout$distance)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.8239768 0.8296127 0.8839809 0.8971590 0.8987589 0.8999861 0.9078665
## [2,] 0.6472155 0.9567929 0.9744633 0.9749049 0.9864627 0.9981125 1.0008955
## [3,] 0.9584417 1.0204414 1.0658659 1.0752560 1.0907028 1.0965564 1.0992800
## [4,] 0.8943398 0.9228672 0.9440770 0.9861521 0.9943904 1.0246195 1.0302318
## [5,] 0.9527171 0.9658626 0.9683394 1.0050453 1.0252653 1.0646908 1.0674213
## [6,] 1.0222392 1.1300329 1.1315137 1.1404370 1.1887219 1.1944500 1.1952969
## [,8] [,9] [,10]
## [1,] 0.9312438 0.9337491 0.9372119
## [2,] 1.0061921 1.0160054 1.0245962
## [3,] 1.1149708 1.1175532 1.1233436
## [4,] 1.0326362 1.0623778 1.0674558
## [5,] 1.0696968 1.0736390 1.0765567
## [6,] 1.1991218 1.2015804 1.2017171
We can also identify the k-nearest neighbors in one dataset based on query points in another dataset.
nquery <- 1000
ndim <- 20
query <- matrix(runif(nquery*ndim), ncol=ndim)
qout <- queryKNN(data, query, k=5, BNPARAM=AnnoyParam())
head(qout$index)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 9175 6051 5461 6733 5945
## [2,] 1786 3211 2969 3911 8359
## [3,] 8263 138 4407 7105 681
## [4,] 2623 655 4523 8536 2800
## [5,] 6097 4074 6542 4103 3086
## [6,] 8852 882 3304 2622 6612
head(qout$distance)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.9157733 0.9319692 0.9942271 1.019294 1.053326
## [2,] 0.9436610 1.0530164 1.0871286 1.090240 1.091675
## [3,] 0.9417372 1.0759110 1.0840186 1.111516 1.143231
## [4,] 1.0033902 1.0060449 1.0133778 1.015667 1.024746
## [5,] 0.8307928 0.9824422 0.9932968 1.119641 1.130494
## [6,] 0.9663057 1.0589536 1.1283503 1.130326 1.143419
It is similarly easy to use the HNSW algorithm by setting BNPARAM=HnswParam()
.
Most of the options described for the exact methods are also applicable here. For example:
subset
to identify neighbors for a subset of points.get.distance
to avoid retrieving distances when unnecessary.BPPARAM
to parallelize the calculations across multiple workers.BNINDEX
to build the forest once for a given data set and re-use it across calls.The use of a pre-built BNINDEX
is illustrated below:
pre <- buildIndex(data, BNPARAM=AnnoyParam())
out1 <- findKNN(BNINDEX=pre, k=5)
out2 <- queryKNN(BNINDEX=pre, query=query, k=2)
Both Annoy and HNSW perform searches based on the Euclidean distance by default.
Searching by Manhattan distance is done by simply setting distance="Manhattan"
in AnnoyParam()
or HnswParam()
.
Users are referred to the documentation of each function for specific details on the available arguments.
Both Annoy and HNSW generate indexing structures - a forest of trees and series of graphs, respectively -
that are saved to file when calling buildIndex()
.
By default, this file is located in tempdir()
1 On HPC file systems, you can change TEMPDIR
to a location that is more amenable to concurrent access. and will be removed when the session finishes.
AnnoyIndex_path(pre)
## [1] "F:\\biocbuild\\bbs-3.15-bioc\\tmpdir\\RtmpED9viq\\file3b705af469e.idx"
If the index is to persist across sessions, the path of the index file can be directly specified in buildIndex
.
This can be used to construct an index object directly using the relevant constructors, e.g., AnnoyIndex()
, HnswIndex()
.
However, it becomes the responsibility of the user to clean up any temporary indexing files after calculations are complete.
sessionInfo()
## R version 4.2.0 RC (2022-04-19 r82224 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows Server x64 (build 20348)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=C
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocNeighbors_1.14.0 knitr_1.38 BiocStyle_2.24.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.8.3 magrittr_2.0.3 BiocGenerics_0.42.0
## [4] BiocParallel_1.30.0 lattice_0.20-45 R6_2.5.1
## [7] rlang_1.0.2 fastmap_1.1.0 stringr_1.4.0
## [10] tools_4.2.0 parallel_4.2.0 grid_4.2.0
## [13] xfun_0.30 cli_3.3.0 jquerylib_0.1.4
## [16] htmltools_0.5.2 yaml_2.3.5 digest_0.6.29
## [19] bookdown_0.26 Matrix_1.4-1 BiocManager_1.30.17
## [22] S4Vectors_0.34.0 sass_0.4.1 evaluate_0.15
## [25] rmarkdown_2.14 stringi_1.7.6 compiler_4.2.0
## [28] bslib_0.3.1 stats4_4.2.0 jsonlite_1.8.0