seqVCF2GDS {SeqArray} | R Documentation |
Reformats Variant Call Format (VCF) files.
seqVCF2GDS(vcf.fn, out.fn, header=NULL, storage.option="LZMA_RA", info.import=NULL, fmt.import=NULL, genotype.var.name="GT", ignore.chr.prefix="chr", scenario=c("general", "imputation"), reference=NULL, start=1L, count=-1L, optimize=TRUE, raise.error=TRUE, digest=TRUE, parallel=FALSE, verbose=TRUE) seqBCF2GDS(bcf.fn, out.fn, header=NULL, storage.option="LZMA_RA", info.import=NULL, fmt.import=NULL, genotype.var.name="GT", ignore.chr.prefix="chr", scenario=c("general", "imputation"), reference=NULL, optimize=TRUE, raise.error=TRUE, digest=TRUE, bcftools="bcftools", verbose=TRUE)
vcf.fn |
the file name(s) of VCF format; or a |
bcf.fn |
a file name of binary VCF format (BCF) |
out.fn |
the file name of output GDS file |
header |
if NULL, |
storage.option |
specify the storage and compression option,
"ZIP_RA" ( |
info.import |
characters, the variable name(s) in the INFO field
for import; or |
fmt.import |
characters, the variable name(s) in the FORMAT field
for import; or |
genotype.var.name |
the ID for genotypic data in the FORMAT column; "GT" by default, VCFv4.0 |
ignore.chr.prefix |
a vector of character, indicating the prefix of chromosome which should be ignored, like "chr"; it is not case-sensitive |
scenario |
"general": use float32 to store floating-point numbers (by default); "imputation": use packedreal16 to store DS and GP in the FORMAT field with four decimal place accuracy |
reference |
genome reference, like "hg19", "GRCh37"; if the genome reference is not available in VCF files, users could specify the reference here |
start |
the starting variant if importing part of VCF files |
count |
the maximum count of variant if importing part of VCF files, -1 indicates importing to the end |
optimize |
if |
raise.error |
|
digest |
a logical value (TRUE/FALSE) or a character ("md5", "sha1", "sha256", "sha384" or "sha512"); add md5 hash codes to the GDS file if TRUE or a digest algorithm is specified |
parallel |
|
verbose |
if |
bcftools |
the path of the program |
If there are more than one files in vcf.fn
, seqVCF2GDS
will
merge all VCF files together if they contain the same samples. It is useful
to merge multiple VCF files if variant data are split by chromosomes.
The real numbers in the VCF file(s) are stored in 32-bit floating-point
format by default. Users can set
storage.option=seqStorageOption(float.mode="float64")
to switch to 64-bit floating point format. Or packed real numbers can be
adopted by setting
storage.option=seqStorageOption(float.mode="packedreal16:scale=0.0001")
.
By default, the compression method is "ZIP_RA" (zlib algorithm with default
compression level + independent data blocks). Users can maximize the
compression ratio by storage.option="ZIP_RA.max"
or
storage.option=seqStorageOption("ZIP_RA.max")
.
LZ4 (http://cyan4973.github.io/lz4/) is an option via
storage.option="LZ4_RA"
or
storage.option=seqStorageOption("LZ4_RA")
.
LZMA (xz, http://tukaani.org/xz/) is another option via
storage.option="LZMA_RA"
or
storage.option=seqStorageOption("LZMA_RA")
, and it is known to have
higher compression ratio than zlib.
If multiple cores/processes are specified in parallel
, all VCF files
are scanned to calculate the total number of variants before format conversion.
storage.option="Ultra"
and storage.option="UltraMax"
need much
large memory. Users may consider using seqRecompress
to recompress
the GDS file after calling seqVCF2GDS()
, since seqRecompress()
takes much less memory when "Ultra"
or "UltraMax"
is used.
Return the file name of GDS format with an absolute path.
Xiuwen Zheng
Danecek, P., Auton, A., Abecasis, G., Albers, C.A., Banks, E., DePristo, M.A., Handsaker, R.E., Lunter, G., Marth, G.T., Sherry, S.T., et al. (2011). The variant call format and VCFtools. Bioinformatics 27, 2156-2158.
seqVCF_Header
, seqStorageOption
,
seqMerge
, seqGDS2VCF
,
seqRecompress
# the VCF file vcf.fn <- seqExampleFileName("vcf") # conversion seqVCF2GDS(vcf.fn, "tmp.gds", storage.option="ZIP_RA") # conversion in parallel seqVCF2GDS(vcf.fn, "tmp_p2.gds", storage.option="ZIP_RA", parallel=2L) # display (f <- seqOpen("tmp.gds")) seqClose(f) # convert without the INFO fields seqVCF2GDS(vcf.fn, "tmp.gds", storage.option="ZIP_RA", info.import=character(0)) # display (f <- seqOpen("tmp.gds")) seqClose(f) # convert without the INFO and FORMAT fields seqVCF2GDS(vcf.fn, "tmp.gds", storage.option="ZIP_RA", info.import=character(0), fmt.import=character(0)) # display (f <- seqOpen("tmp.gds")) seqClose(f) # delete the temporary file unlink(c("tmp.gds", "tmp_p2.gds"), force=TRUE)