TileDBArray 1.2.1
TileDB implements a framework for local and remote storage of dense and sparse arrays.
We can use this as a DelayedArray
backend to provide an array-level abstraction,
thus allowing the data to be used in many places where an ordinary array or matrix might be used.
The TileDBArray package implements the necessary wrappers around TileDB-R
to support read/write operations on TileDB arrays within the DelayedArray framework.
TileDBArray
Creating a TileDBArray
is as easy as:
X <- matrix(rnorm(1000), ncol=10)
library(TileDBArray)
writeTileDBArray(X)
## <100 x 10> matrix of class TileDBMatrix and type "double":
## [,1] [,2] [,3] ... [,9] [,10]
## [1,] 0.02493567 -0.06813714 -0.61287869 . 1.14287741 1.10965186
## [2,] -1.37066931 -2.32875989 -1.15545691 . 0.10657002 1.89199271
## [3,] -0.91428619 0.46166392 1.42954891 . -1.41940020 -0.47717514
## [4,] -0.11493400 1.27049143 1.04680415 . 0.03078511 0.42449471
## [5,] 1.29993491 0.34148876 -0.87254952 . 0.62386948 0.09848873
## ... . . . . . .
## [96,] 1.26437706 0.86259918 -0.66980090 . -0.7034460 -0.6499060
## [97,] 0.19829676 -0.76734639 -0.15346548 . -0.1363722 -0.3365438
## [98,] -0.24915771 -0.71702292 0.93239525 . -0.4765130 1.1528684
## [99,] 0.08230368 -1.25379843 -0.26331820 . 0.2748773 0.2841799
## [100,] 0.72515334 0.65160940 1.00606481 . -0.4198069 -0.8818134
Alternatively, we can use coercion methods:
as(X, "TileDBArray")
## <100 x 10> matrix of class TileDBMatrix and type "double":
## [,1] [,2] [,3] ... [,9] [,10]
## [1,] 0.02493567 -0.06813714 -0.61287869 . 1.14287741 1.10965186
## [2,] -1.37066931 -2.32875989 -1.15545691 . 0.10657002 1.89199271
## [3,] -0.91428619 0.46166392 1.42954891 . -1.41940020 -0.47717514
## [4,] -0.11493400 1.27049143 1.04680415 . 0.03078511 0.42449471
## [5,] 1.29993491 0.34148876 -0.87254952 . 0.62386948 0.09848873
## ... . . . . . .
## [96,] 1.26437706 0.86259918 -0.66980090 . -0.7034460 -0.6499060
## [97,] 0.19829676 -0.76734639 -0.15346548 . -0.1363722 -0.3365438
## [98,] -0.24915771 -0.71702292 0.93239525 . -0.4765130 1.1528684
## [99,] 0.08230368 -1.25379843 -0.26331820 . 0.2748773 0.2841799
## [100,] 0.72515334 0.65160940 1.00606481 . -0.4198069 -0.8818134
This process works also for sparse matrices:
Y <- Matrix::rsparsematrix(1000, 1000, density=0.01)
writeTileDBArray(Y)
## <1000 x 1000> sparse matrix of class TileDBMatrix and type "double":
## [,1] [,2] [,3] ... [,999] [,1000]
## [1,] 0 0 0 . 0 0
## [2,] 0 0 0 . 0 0
## [3,] 0 0 0 . 0 0
## [4,] 0 0 0 . 0 0
## [5,] 0 0 0 . 0 0
## ... . . . . . .
## [996,] 0 0 0 . 0 0
## [997,] 0 0 0 . 0 0
## [998,] 0 0 0 . 0 0
## [999,] 0 0 0 . 0 0
## [1000,] 0 0 0 . 0 0
Logical and integer matrices are supported:
writeTileDBArray(Y > 0)
## <1000 x 1000> sparse matrix of class TileDBMatrix and type "logical":
## [,1] [,2] [,3] ... [,999] [,1000]
## [1,] FALSE FALSE FALSE . FALSE FALSE
## [2,] FALSE FALSE FALSE . FALSE FALSE
## [3,] FALSE FALSE FALSE . FALSE FALSE
## [4,] FALSE FALSE FALSE . FALSE FALSE
## [5,] FALSE FALSE FALSE . FALSE FALSE
## ... . . . . . .
## [996,] FALSE FALSE FALSE . FALSE FALSE
## [997,] FALSE FALSE FALSE . FALSE FALSE
## [998,] FALSE FALSE FALSE . FALSE FALSE
## [999,] FALSE FALSE FALSE . FALSE FALSE
## [1000,] FALSE FALSE FALSE . FALSE FALSE
As are matrices with dimension names:
rownames(X) <- sprintf("GENE_%i", seq_len(nrow(X)))
colnames(X) <- sprintf("SAMP_%i", seq_len(ncol(X)))
writeTileDBArray(X)
## <100 x 10> matrix of class TileDBMatrix and type "double":
## SAMP_1 SAMP_2 SAMP_3 ... SAMP_9 SAMP_10
## GENE_1 0.02493567 -0.06813714 -0.61287869 . 1.14287741 1.10965186
## GENE_2 -1.37066931 -2.32875989 -1.15545691 . 0.10657002 1.89199271
## GENE_3 -0.91428619 0.46166392 1.42954891 . -1.41940020 -0.47717514
## GENE_4 -0.11493400 1.27049143 1.04680415 . 0.03078511 0.42449471
## GENE_5 1.29993491 0.34148876 -0.87254952 . 0.62386948 0.09848873
## ... . . . . . .
## GENE_96 1.26437706 0.86259918 -0.66980090 . -0.7034460 -0.6499060
## GENE_97 0.19829676 -0.76734639 -0.15346548 . -0.1363722 -0.3365438
## GENE_98 -0.24915771 -0.71702292 0.93239525 . -0.4765130 1.1528684
## GENE_99 0.08230368 -1.25379843 -0.26331820 . 0.2748773 0.2841799
## GENE_100 0.72515334 0.65160940 1.00606481 . -0.4198069 -0.8818134
TileDBArray
sTileDBArray
s are simply DelayedArray
objects and can be manipulated as such.
The usual conventions for extracting data from matrix-like objects work as expected:
out <- as(X, "TileDBArray")
dim(out)
## [1] 100 10
head(rownames(out))
## [1] "GENE_1" "GENE_2" "GENE_3" "GENE_4" "GENE_5" "GENE_6"
head(out[,1])
## GENE_1 GENE_2 GENE_3 GENE_4 GENE_5 GENE_6
## 0.02493567 -1.37066931 -0.91428619 -0.11493400 1.29993491 0.47817405
We can also perform manipulations like subsetting and arithmetic.
Note that these operations do not affect the data in the TileDB backend;
rather, they are delayed until the values are explicitly required,
hence the creation of the DelayedMatrix
object.
out[1:5,1:5]
## <5 x 5> matrix of class DelayedMatrix and type "double":
## SAMP_1 SAMP_2 SAMP_3 SAMP_4 SAMP_5
## GENE_1 0.024935673 -0.068137142 -0.612878693 -1.980013958 -0.573325848
## GENE_2 -1.370669314 -2.328759891 -1.155456913 0.008304827 0.694028469
## GENE_3 -0.914286189 0.461663921 1.429548907 -0.445354507 -0.353929032
## GENE_4 -0.114933998 1.270491429 1.046804152 0.874761810 -1.617543651
## GENE_5 1.299934906 0.341488763 -0.872549522 -0.937829952 0.172763592
out * 2
## <100 x 10> matrix of class DelayedMatrix and type "double":
## SAMP_1 SAMP_2 SAMP_3 ... SAMP_9 SAMP_10
## GENE_1 0.04987135 -0.13627428 -1.22575739 . 2.28575482 2.21930372
## GENE_2 -2.74133863 -4.65751978 -2.31091383 . 0.21314004 3.78398543
## GENE_3 -1.82857238 0.92332784 2.85909781 . -2.83880040 -0.95435028
## GENE_4 -0.22986800 2.54098286 2.09360830 . 0.06157023 0.84898942
## GENE_5 2.59986981 0.68297753 -1.74509904 . 1.24773895 0.19697747
## ... . . . . . .
## GENE_96 2.5287541 1.7251984 -1.3396018 . -1.4068919 -1.2998120
## GENE_97 0.3965935 -1.5346928 -0.3069310 . -0.2727444 -0.6730877
## GENE_98 -0.4983154 -1.4340458 1.8647905 . -0.9530260 2.3057369
## GENE_99 0.1646074 -2.5075969 -0.5266364 . 0.5497547 0.5683597
## GENE_100 1.4503067 1.3032188 2.0121296 . -0.8396139 -1.7636268
We can also do more complex matrix operations that are supported by DelayedArray:
colSums(out)
## SAMP_1 SAMP_2 SAMP_3 SAMP_4 SAMP_5 SAMP_6 SAMP_7
## 5.871220 -11.289699 -4.585060 -3.733275 -8.348973 14.564894 9.725526
## SAMP_8 SAMP_9 SAMP_10
## -6.764044 -5.703727 7.986859
out %*% runif(ncol(out))
## <100 x 1> matrix of class DelayedMatrix and type "double":
## y
## GENE_1 0.1092489
## GENE_2 -1.4180811
## GENE_3 1.2549715
## GENE_4 4.2807866
## GENE_5 0.5116300
## ... .
## GENE_96 -0.802353
## GENE_97 -2.198789
## GENE_98 -2.112606
## GENE_99 -2.809833
## GENE_100 1.647648
We can adjust some parameters for creating the backend with appropriate arguments to writeTileDBArray()
.
For example, the example below allows us to control the path to the backend
as well as the name of the attribute containing the data.
X <- matrix(rnorm(1000), ncol=10)
path <- tempfile()
writeTileDBArray(X, path=path, attr="WHEE")
## <100 x 10> matrix of class TileDBMatrix and type "double":
## [,1] [,2] [,3] ... [,9] [,10]
## [1,] 1.2175741 -0.4621092 0.4299813 . -1.7061255 -1.9288403
## [2,] 0.6955273 1.0614182 0.1420658 . -0.6925629 0.6127065
## [3,] 0.2783070 -0.2958901 -0.3278123 . 1.9606334 -0.3289870
## [4,] -0.6340023 -0.9678723 -1.1508408 . 0.1264930 0.5503688
## [5,] -0.9717764 -0.3981920 -0.5088833 . 0.1895312 1.0342039
## ... . . . . . .
## [96,] 0.53104118 0.78074564 -0.42852046 . -0.8877758 -1.2113294
## [97,] -0.57042574 1.05277439 -1.90333613 . 1.6000215 -0.1859337
## [98,] -1.59427296 0.51504530 -0.82834998 . 0.6485175 -0.4266327
## [99,] 0.05654771 0.33282495 0.93416361 . 0.9780190 -0.1698558
## [100,] 0.07441734 -0.32873663 1.09843974 . 1.0647916 0.1468263
As these arguments cannot be passed during coercion, we instead provide global variables that can be set or unset to affect the outcome.
path2 <- tempfile()
setTileDBPath(path2)
as(X, "TileDBArray") # uses path2 to store the backend.
## <100 x 10> matrix of class TileDBMatrix and type "double":
## [,1] [,2] [,3] ... [,9] [,10]
## [1,] 1.2175741 -0.4621092 0.4299813 . -1.7061255 -1.9288403
## [2,] 0.6955273 1.0614182 0.1420658 . -0.6925629 0.6127065
## [3,] 0.2783070 -0.2958901 -0.3278123 . 1.9606334 -0.3289870
## [4,] -0.6340023 -0.9678723 -1.1508408 . 0.1264930 0.5503688
## [5,] -0.9717764 -0.3981920 -0.5088833 . 0.1895312 1.0342039
## ... . . . . . .
## [96,] 0.53104118 0.78074564 -0.42852046 . -0.8877758 -1.2113294
## [97,] -0.57042574 1.05277439 -1.90333613 . 1.6000215 -0.1859337
## [98,] -1.59427296 0.51504530 -0.82834998 . 0.6485175 -0.4266327
## [99,] 0.05654771 0.33282495 0.93416361 . 0.9780190 -0.1698558
## [100,] 0.07441734 -0.32873663 1.09843974 . 1.0647916 0.1468263
sessionInfo()
## R version 4.1.0 RC (2021-05-10 r80283)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Mojave 10.14.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
##
## locale:
## [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] TileDBArray_1.2.1 DelayedArray_0.18.0 IRanges_2.26.0
## [4] S4Vectors_0.30.0 MatrixGenerics_1.4.0 matrixStats_0.58.0
## [7] BiocGenerics_0.38.0 Matrix_1.3-3 BiocStyle_2.20.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.6 bslib_0.2.5.1 compiler_4.1.0
## [4] BiocManager_1.30.15 jquerylib_0.1.4 tools_4.1.0
## [7] digest_0.6.27 bit_4.0.4 jsonlite_1.7.2
## [10] evaluate_0.14 lattice_0.20-44 nanotime_0.3.2
## [13] rlang_0.4.11 RcppCCTZ_0.2.9 yaml_2.2.1
## [16] xfun_0.23 stringr_1.4.0 knitr_1.33
## [19] sass_0.4.0 bit64_4.0.5 grid_4.1.0
## [22] R6_2.5.0 rmarkdown_2.8 bookdown_0.22
## [25] tiledb_0.9.2 magrittr_2.0.1 htmltools_0.5.1.1
## [28] stringi_1.6.2 zoo_1.8-9