h5mread {HDF5Array}R Documentation

An alternative to rhdf5::h5read

Description

h5mread is the result of experimenting with alternative rhdf5::h5read implementations.

It should still be considered experimental!

Usage

h5mread(filepath, name, starts, counts=NULL, noreduce=FALSE,
        as.integer=FALSE, method=0L)

get_h5mread_returned_type(filepath, name, as.integer=FALSE)

Arguments

filepath

The path (as a single character string) to the HDF5 file where the dataset to read from is located.

name

The name of the dataset in the HDF5 file.

starts, counts

2 lists specifying the array selection. The 2 lists must have one list element per dimension in the dataset.

Each list element in starts must be a vector of valid positive indices along the corresponding dimension in the dataset. An empty vector (integer(0)) is accepted and indicates an empty selection along that dimension. A NULL is accepted and indicates a full selection along the dimension so has the same meaning as a missing subscript when subsetting an array-like object with [. (Note that for [ a NULL subscript indicates an empty selection.)

Each list element in counts must be NULL or a vector of non-negative integers of the same length as the corresponding list element in starts. Each value in the vector indicates how many positions to select starting from the associated start value. A NULL indicates that a single position is selected for each value along the corresponding dimension.

If counts is NULL, then each index in each starts list element indicates a single position selection along the corresponding dimension. Note that in this case the starts argument is equivalent to the index argument of h5read and extract_array (with the caveat that h5read doesn't accept empty selections).

Finally note that when counts is not NULL then the selection described by starts and counts must be strictly ascending along each dimension.

noreduce

TODO

as.integer

TODO

method

TODO

Details

COMING SOON...

Value

An array for h5mread.

The type of the array that will be returned by h5mread for get_h5mread_returned_type. Equivalent to:

  typeof(h5mread(filepath, name, rep(list(integer(0)), ndim)))
  

where ndim is the number of dimensions (a.k.a. the rank in hdf5 jargon) of the dataset. get_h5mread_returned_type is provided for convenience.

See Also

Examples

## ---------------------------------------------------------------------
## BASIC USAGE
## ---------------------------------------------------------------------
m0 <- matrix((runif(600) - 0.5) * 10, ncol=12)
M0 <- writeHDF5Array(m0, name="M0")

m <- h5mread(path(M0), "M0", starts=list(NULL, c(3, 12:8)))
stopifnot(identical(m0[ , c(3, 12:8)], m))

m <- h5mread(path(M0), "M0", starts=list(integer(0), c(3, 12:8)))
stopifnot(identical(m0[NULL , c(3, 12:8)], m))

m <- h5mread(path(M0), "M0", starts=list(1:5, NULL), as.integer=TRUE)
storage.mode(m0) <- "integer"
stopifnot(identical(m0[1:5, ], m))

m1 <- matrix(1:60, ncol=6)
M1 <- writeHDF5Array(m1, filepath=path(M0), name="M1")
h5ls(path(M1))

m <- h5mread(path(M1), "M1", starts=list(c(2, 7), NULL),
                             counts=list(c(4, 2), NULL))
stopifnot(identical(m1[c(2:5, 7:8), ], m))

## ---------------------------------------------------------------------
## PERFORMANCE
## ---------------------------------------------------------------------
library(ExperimentHub)
hub <- ExperimentHub()

## With the "sparse" TENxBrainData dataset
## ---------------------------------------
fname0 <- hub[["EH1039"]]
h5ls(fname0)  # all datasets are 1D datasets

index <- list(77 * sample(34088679, 5000, replace=TRUE))
## h5mread() about 3x faster than h5read():
system.time(a <- h5mread(fname0, "mm10/data", index))
system.time(b <- h5read(fname0, "mm10/data", index=index))
stopifnot(identical(a, b))

index <- list(sample(1306127, 7500, replace=TRUE))
## h5mread() about 20x faster than h5read():
system.time(a <- h5mread(fname0, "mm10/barcodes", index))
system.time(b <- h5read(fname0, "mm10/barcodes", index=index))
stopifnot(identical(a, b))

## With the "dense" TENxBrainData dataset
## ---------------------------------------
fname1 <- hub[["EH1040"]]
h5ls(fname1)  # "counts" is a 2D dataset

index <- list(sample(  27998, 250, replace=TRUE),
              sample(1306127, 250, replace=TRUE))
## h5mread() about 2x faster than h5read():
system.time(a <- h5mread(fname1, "counts", index))
system.time(b <- h5read(fname1, "counts", index=index))
stopifnot(identical(a, b))

## The bigger the selection, the greater the speedup between
## h5read() and h5mread():
## Not run: 
  index <- list(sample(  27998, 1000, replace=TRUE),
                sample(1306127, 1000, replace=TRUE))
  ## h5mread() about 8x faster than h5read() (22s vs 3min):
  system.time(a <- h5mread(fname1, "counts", index))
  system.time(b <- h5read(fname1, "counts", index=index))
  stopifnot(identical(a, b))

## End(Not run)

[Package HDF5Array version 1.12.1 Index]