CreateBrick {HiCBricks} | R Documentation |
CreateBrick
creates the complete HDF5 on-disk data structure
CreateBrick(ChromNames, BinTable, bin.delim = "\t", col.index = c(1, 2, 3), impose.discontinuity = TRUE, ChunkSize = NULL, Output.Filename, exec = "cat", remove.existing = FALSE)
ChromNames |
Required A character vector containing the chromosomes to be considered for the dataset. This string is used to verify the presence of all chromosomes in the provided bitable. |
BinTable |
Required A string containing the path to the file to load as the binning table for the Hi-C experiment. The number of entries per chromosome defines the dimension of the associated Hi-C data matrices. For example, if chr1 contains 250 entries in the binning table, the cis Hi-C data matrix for chr1 will be expected to contain 250 rows and 250 cols. Similary, if the same binning table contained 150 entries for chr2, the trans Hi-C matrices for chr1,chr2 will be a matrix with dimension 250 rows and 150 cols. There are no constraints on the bintable format. As long as the table is in a delimited format, the corresponding table columns can be outlined with the associated parameters. The columns of importance are chr, start and end. It is recommended to always use binning tables where the end and start of consecutive ranges are not the same. If they are the same, this may lead to unexpected behaviour when using the GenomicRanges "any" overlap function. |
bin.delim |
Optional. Defaults to tabs. A character vector of length 1 specifying the delimiter used in the file containing the binning table. |
col.index |
Optional. Default "c(1,2,3)". A character vector of length 3 containing the indexes of the required columns in the binning table. the first index, corresponds to the chr column, the second to the start column and the third to the end column. |
impose.discontinuity |
Optional. Default TRUE. If TRUE, this parameter ensures a check to make sure that required the end and start coordinates of consecutive entries are not the same per chromosome. |
ChunkSize |
Optional. A numeric vector of length 1. If provided, the HDF dataset will use this value as the chunk size, for all matrices. By default, the ChunkSize is set to matrix dimensions/100. |
Output.Filename |
Required A string specifying the location and name of the HDF file to create. If path is not provided, it will be created in the Bioc File cache. Otherwise, it will be created in the specified directory and tracked via Bioc File Cache. |
exec |
Optional. Default cat. A string specifying the program or expression to use for reading the file. For bz2 files, use bzcat and for gunzipped files use zcat. |
remove.existing |
Optional. Default FALSE. If TRUE, will remove the HDF file with the same name and create a new one. By default, it will not replace existing files. |
This function creates the complete HDF data structure, loads the binning table associated to the Hi-C experiment and creates (for now) a 2D matrix layout for all chromosome pairs. Please note, the binning table must be a discontinuous one (first range end != secode range start), as ranges overlaps using the "any" form will routinely identify adjacent ranges with the same end and start to be in the overlap. Therefore, this criteria is enforced as default behaviour.
The structure of the HDF file is as follows: The structure contains three major groups which are then hierarchically nested with other groups to finally lead to the corresponding datasets.
Base.matrices - group For storing Hi-C matrices
chromosome - group
chromosome - group
attributes - attribute
Filename - Name of the file
Min - min value of Hi-C matrix
Max - max value of Hi-C matrix
sparsity - specifies if this is a sparse matrix
distance - max distance of data from main diagonal
Done - specifies if a matrix has been loaded
matrix - dataset - contains the matrix
bin.coverage - dataset - proportion of cells with values greater than 0
row.sums - dataset - total sum of all values in a row
sparsity - dataset - proportion of non-zero cells near the diagonal
Base.ranges - group, Ranges tables for quick and easy access. Additional ranges tables are added here under separate group names.
Bintable - group - The main binning table associated to a Brick.
ranges - dataset - Contains the three main columns chr, start and end.
offsets - dataset - first occurence of any given chromosome in the ranges dataset.
lengths - dataset - Number of occurences of that chromosome
chr.names - dataset - What chromosomes are present in the given ranges table.
Base.metadata - group, A place to store metadata info
chromosomes - dataset - Metadata information specifying the chromosomes present in this particular Brick file.
other metadata tables.
This function will generate the target Brick file. Upon completion, the function will provide the path to the created/tracked HDF file.
Bintable.path <- system.file("extdata", "Bintable_40kb.txt", package = "HiCBricks") Chromosomes <- "chr19" Path_to_cached_file <- CreateBrick(ChromNames = Chromosomes, BinTable = Bintable.path, bin.delim = " ", Output.Filename = file.path(tempdir(),"test.hdf"), exec = "cat", remove.existing = TRUE) ## Not run: Bintable.path <- system.file("extdata", "Bintable_40kb.txt", package = "HiCBricks") Chromosomes <- c("chr19", "chr20", "chr22", "chr21") Path_to_cached_file <- CreateBrick(ChromNames = Chromosomes, BinTable = Bintable.path, impose.discontinuity=TRUE, col.index = c(1,2,3), Output.Filename = file.path(tempdir(),"test.hdf"), exec = "cat", remove.existing = TRUE) This will cause an error as the file located at Bintable.path, contains coordinates for only chromosome 19. For this code to work, either all other chromosomes need to be removed from the Chromosomes variable or coordinate information for the other chromosomes need to be provided. Similarly vice-versa is also true. If the Bintable contains data for other chromosomes, but they were not listed in ChromNames, this will cause an error. Keep in mind that if the end coordinates and start coordinates of adjacent ranges are not separated by at least a value of 1, then impose.discontinuity = TRUE will likely cause an error to occur. This may seem obnoxious, but GenomicRanges by default will consider an overlap of 1 bp as an overlap. Therefore, to be certain that ranges which should not be, are not being targeted during retrieval operations, a check is initiated to make sure that adjacent ends and starts are not overlapping. To load continuous ranges, use impose.discontinuity = FALSE. Also note, that col.index determines which columns to use for chr, start and end. Therefore, the original binning table may have 10 or 20 columns, but it only requires the first three in order of chr, start and end. ## End(Not run)