Contents

1 Overview

beachmat has a few useful utilities outside of the C++ API. This document describes how to use them.

2 Choosing HDF5 chunk dimensions

Given the dimensions of a matrix, users can choose HDF5 chunk dimensions that give fast performance for both row- and column-level access.

library(beachmat)
nrows <- 10000
ncols <- 200
getBestChunkDims(c(nrows, ncols))
## [1] 708  15

In the future, it should be possible to feed this back into the API. Currently, if chunk dimensions are not specified in the C++ code, the API will retrieve them from R via the getHDF5DumpChunkDim() function from HDF5Array. The aim is to also provide a setHDF5DumpChunkDim() function so that any chunk dimension specified in R will be respected.

3 Rechunking a HDF5 file

The most common access patterns for matrices (at least, for high-throughput biological data) is by row or by column. The rechunkByMargins() will take a HDF5 file and convert it to using purely row- or column-based chunks.

library(HDF5Array)
A <- as(matrix(runif(5000), nrow=100, ncol=50), "HDF5Array")
byrow <- rechunkByMargins(A, byrow=TRUE)
byrow
## <100 x 50> HDF5Matrix object of type "double":
##              [,1]       [,2]       [,3] ...       [,49]       [,50]
##   [1,] 0.12486206 0.20868439 0.29459917   .  0.74968205  0.06714846
##   [2,] 0.02607930 0.69304142 0.15390463   .  0.02529415  0.87960745
##   [3,] 0.24767370 0.66030521 0.83139350   .  0.05757601  0.40336922
##   [4,] 0.75944610 0.60733397 0.10299036   .  0.97410812  0.84616836
##   [5,] 0.05319507 0.80794798 0.23957735   .  0.78224182  0.40128613
##    ...          .          .          .   .           .           .
##  [96,] 0.99737888 0.47785621 0.34327972   . 0.007512688 0.797500690
##  [97,] 0.62765806 0.70956984 0.95881573   . 0.705777390 0.720051073
##  [98,] 0.58004488 0.05857615 0.14267108   . 0.379293296 0.097039901
##  [99,] 0.07213459 0.81417975 0.28473076   . 0.948091235 0.111810020
## [100,] 0.20245589 0.97955223 0.74598644   . 0.981190281 0.003867964
bycol <- rechunkByMargins(A, byrow=FALSE)
bycol
## <100 x 50> HDF5Matrix object of type "double":
##              [,1]       [,2]       [,3] ...       [,49]       [,50]
##   [1,] 0.12486206 0.20868439 0.29459917   .  0.74968205  0.06714846
##   [2,] 0.02607930 0.69304142 0.15390463   .  0.02529415  0.87960745
##   [3,] 0.24767370 0.66030521 0.83139350   .  0.05757601  0.40336922
##   [4,] 0.75944610 0.60733397 0.10299036   .  0.97410812  0.84616836
##   [5,] 0.05319507 0.80794798 0.23957735   .  0.78224182  0.40128613
##    ...          .          .          .   .           .           .
##  [96,] 0.99737888 0.47785621 0.34327972   . 0.007512688 0.797500690
##  [97,] 0.62765806 0.70956984 0.95881573   . 0.705777390 0.720051073
##  [98,] 0.58004488 0.05857615 0.14267108   . 0.379293296 0.097039901
##  [99,] 0.07213459 0.81417975 0.28473076   . 0.948091235 0.111810020
## [100,] 0.20245589 0.97955223 0.74598644   . 0.981190281 0.003867964

Rechunking can provide a substantial speed-up to downstream functions, especially those requiring access to random columns or rows. Indeed, the time saved in those functions often offsets the time spent in constructing a new HDF5Matrix.

4 Session information

sessionInfo()
## R version 3.5.0 (2018-04-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.4 LTS
## 
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.7-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.7-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] HDF5Array_1.8.0     rhdf5_2.24.0        DelayedArray_0.6.0 
##  [4] BiocParallel_1.14.0 IRanges_2.14.0      S4Vectors_0.18.0   
##  [7] BiocGenerics_0.26.0 matrixStats_0.53.1  beachmat_1.2.0     
## [10] knitr_1.20          BiocStyle_2.8.0    
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.16    magrittr_1.5    stringr_1.3.0   tools_3.5.0    
##  [5] xfun_0.1        htmltools_0.3.6 yaml_2.1.18     rprojroot_1.3-2
##  [9] digest_0.6.15   bookdown_0.7    Rhdf5lib_1.2.0  evaluate_0.10.1
## [13] rmarkdown_1.9   stringi_1.1.7   compiler_3.5.0  backports_1.1.2