The VariantAnnotation package has facilities for reading in all or portions of Variant Call Format (VCF) files. Structural location information can be determined as well as amino acid coding changes for non-synonymous variants. Consequences of the coding changes can be investigated with the SIFT and PolyPhen database packages.
This workflow annotates variants found in the Transient Receptor Potential Vanilloid (TRPV) gene family on chromosome 17. The VCF file is available in the cgdv17 data package and contains Complete Genomics data for population type CEU.
## Warning: package 'GenomicRanges' was built under R version 3.1.1
## Warning: package 'IRanges' was built under R version 3.1.1
## Warning: package 'DBI' was built under R version 3.1.1
## Warning: package 'GGBase' was built under R version 3.1.1
library(VariantAnnotation)
library(cgdv17)
file <- system.file("vcf", "NA06985_17.vcf.gz", package = "cgdv17")
## Explore the file header with scanVcfHeader
hdr <- scanVcfHeader(file)
info(hdr)
## DataFrame with 3 rows and 3 columns
## Number Type Description
## <character> <character> <character>
## NS 1 Integer Number of Samples With Data
## DP 1 Integer Total Depth
## DB 0 Flag dbSNP membership, build 131
geno(hdr)
## DataFrame with 12 rows and 3 columns
## Number Type Description
## <character> <character> <character>
## GT 1 String Genotype
## GQ 1 Integer Genotype Quality
## DP 1 Integer Read Depth
## HDP 2 Integer Haplotype Read Depth
## HQ 2 Integer Haplotype Quality
## ... ... ... ...
## mRNA . String Overlaping mRNA
## rmsk . String Overlaping Repeats
## segDup . String Overlaping segmentation duplication
## rCov 1 Float relative Coverage
## cPd 1 String called Ploidy(level)
Convert the gene symbols to gene ids compatible with the TxDb.Hsapiens.UCSC.hg19.knownGene annotations. The annotaions are used to define the TRPV ranges that will be extracted from the VCF file.
## get entrez ids from gene symbols
library(org.Hs.eg.db)
genesym <- c("TRPV1", "TRPV2", "TRPV3")
geneid <- select(org.Hs.eg.db, keys=genesym, keytype="SYMBOL",
columns="ENTREZID")
geneid
## SYMBOL ENTREZID
## 1 TRPV1 7442
## 2 TRPV2 51393
## 3 TRPV3 162514
Load the annotation package.
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
txdb
## TranscriptDb object:
## | Db type: TranscriptDb
## | Supporting package: GenomicFeatures
## | Data source: UCSC
## | Genome: hg19
## | Organism: Homo sapiens
## | UCSC Table: knownGene
## | Resource URL: http://genome.ucsc.edu/
## | Type of Gene ID: Entrez Gene ID
## | Full dataset: yes
## | miRBase build ID: GRCh37
## | transcript_nrow: 82960
## | exon_nrow: 289969
## | cds_nrow: 237533
## | Db created by: GenomicFeatures package from Bioconductor
## | Creation time: 2014-03-17 16:15:59 -0700 (Mon, 17 Mar 2014)
## | GenomicFeatures version at creation time: 1.15.11
## | RSQLite version at creation time: 0.11.4
## | DBSCHEMAVERSION: 1.0
Modify the seqlevels (chromosomes) in the txdb to match those in the VCF file. This step is necessary because we want to use ranges from the txdb to extract a subset from the VCF.
## TranscriptDb object:
## | Db type: TranscriptDb
## | Supporting package: GenomicFeatures
## | Data source: UCSC
## | Genome: hg19
## | Organism: Homo sapiens
## | UCSC Table: knownGene
## | Resource URL: http://genome.ucsc.edu/
## | Type of Gene ID: Entrez Gene ID
## | Full dataset: yes
## | miRBase build ID: GRCh37
## | transcript_nrow: 82960
## | exon_nrow: 289969
## | cds_nrow: 237533
## | Db created by: GenomicFeatures package from Bioconductor
## | Creation time: 2014-03-17 16:15:59 -0700 (Mon, 17 Mar 2014)
## | GenomicFeatures version at creation time: 1.15.11
## | RSQLite version at creation time: 0.11.4
## | DBSCHEMAVERSION: 1.0
## TranscriptDb object:
## | Db type: TranscriptDb
## | Supporting package: GenomicFeatures
## | Data source: UCSC
## | Genome: hg19
## | Organism: Homo sapiens
## | UCSC Table: knownGene
## | Resource URL: http://genome.ucsc.edu/
## | Type of Gene ID: Entrez Gene ID
## | Full dataset: yes
## | miRBase build ID: GRCh37
## | transcript_nrow: 82960
## | exon_nrow: 289969
## | cds_nrow: 237533
## | Db created by: GenomicFeatures package from Bioconductor
## | Creation time: 2014-03-17 16:15:59 -0700 (Mon, 17 Mar 2014)
## | GenomicFeatures version at creation time: 1.15.11
## | RSQLite version at creation time: 0.11.4
## | DBSCHEMAVERSION: 1.0
Create a list of transcripts by gene:
txbygene = transcriptsBy(txdb, "gene")
Create the gene ranges for the TRPV genes
gnrng <- unlist(range(txbygene[geneid$ENTREZID]), use.names=FALSE)
names(gnrng) <- geneid$SYMBOL
A ScanVcfParam object is used to retrieve data subsets. This object can specify genomic coordinates (ranges) or individual VCF elements. Extractions of ranges (vs fields) requires a tabix index. See ?indexTabix for details.
param <- ScanVcfParam(which = gnrng, info = "DP", geno = c("GT", "cPd"))
param
## class: ScanVcfParam
## vcfWhich: 1 elements
## vcfFixed: character() [All]
## vcfInfo: DP
## vcfGeno: GT cPd
## vcfSamples:
## Extract the TRPV ranges from the VCF file
vcf <- readVcf(file, "hg19", param)
## Inspect the VCF object with the 'fixed', 'info' and 'geno' accessors
vcf
## class: CollapsedVCF
## dim: 405 1
## rowData(vcf):
## GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
## info(vcf):
## DataFrame with 1 column: DP
## info(header(vcf)):
## Number Type Description
## DP 1 Integer Total Depth
## geno(vcf):
## SimpleList of length 2: GT, cPd
## geno(header(vcf)):
## Number Type Description
## GT 1 String Genotype
## cPd 1 String called Ploidy(level)
head(fixed(vcf))
## DataFrame with 6 rows and 4 columns
## REF ALT QUAL FILTER
## <DNAStringSet> <DNAStringSetList> <numeric> <character>
## 1 A G 120 PASS
## 2 A 0 PASS
## 3 AAAAA 0 PASS
## 4 AA 0 PASS
## 5 C T 59 PASS
## 6 T C 157 PASS
geno(vcf)
## List of length 2
## names(2): GT cPd
To find the structural location of the variants, we'll use the locateVariants function with the TxDb.Hsapiens.UCSC.hg19.knownGene package loaded eariler.
## Use the 'region' argument to define the region
## of interest. See ?locateVariants for details.
cds <- locateVariants(vcf, txdb, CodingVariants())
five <- locateVariants(vcf, txdb, FiveUTRVariants())
splice <- locateVariants(vcf, txdb, SpliceSiteVariants())
intron <- locateVariants(vcf, txdb, IntronVariants())
all <- locateVariants(vcf, txdb, AllVariants())
Each row in cds represents a variant-transcript match so multiple rows per variant are possible. If we are interested in gene-centric questions the data can be summarized by gene regardless of transcript.
## Did any variants match more than one gene?
table(sapply(split(mcols(all)$GENEID, mcols(all)$QUERYID),
function(x) length(unique(x)) > 1))
##
## FALSE TRUE
## 391 11
## Summarize the number of variants by gene:
idx <- sapply(split(mcols(all)$QUERYID, mcols(all)$GENEID), unique)
sapply(idx, length)
## 125144 162514 51393 7442 84690
## 1 172 62 143 35
## Summarize variant location by gene:
sapply(names(idx),
function(nm) {
d <- all[mcols(all)$GENEID %in% nm, c("QUERYID", "LOCATION")]
table(mcols(d)$LOCATION[duplicated(d) == FALSE])
})
## 125144 162514 51393 7442 84690
## spliceSite 0 2 0 1 0
## intron 0 153 58 117 19
## fiveUTR 0 0 0 0 0
## threeUTR 0 0 0 0 0
## coding 0 5 3 8 0
## intergenic 0 0 0 0 0
## promoter 1 12 1 17 16
Amino acid coding for non-synonymous variants can be computed with the function predictCoding. The BSgenome.Hsapiens.UCSC.hg19 package is used as the source of the reference alleles. Variant alleles are provided by the user.
library(BSgenome.Hsapiens.UCSC.hg19)
seqlevelsStyle(vcf) <- "UCSC"
seqlevelsStyle(txdb) <- "UCSC"
aa <- predictCoding(vcf, txdb, Hsapiens)
## Warning: records with missing 'varAllele' were ignored
## Warning: varAllele values containing 'N' were not translated
predictCoding returns results for coding variants only. As with locateVariants, the output has one row per variant-transcript match so multiple rows per variant are possible.
## Did any variants match more than one gene?
table(sapply(split(mcols(aa)$GENEID, mcols(aa)$QUERYID),
function(x) length(unique(x)) > 1))
##
## FALSE
## 17
## Summarize the number of variants by gene:
idx <- sapply(split(mcols(aa)$QUERYID, mcols(aa)$GENEID, drop=TRUE), unique)
sapply(idx, length)
## 162514 51393 7442
## 6 3 8
## Summarize variant consequence by gene:
sapply(names(idx),
function(nm) {
d <- aa[mcols(aa)$GENEID %in% nm, c("QUERYID","CONSEQUENCE")]
table(mcols(d)$CONSEQUENCE[duplicated(d) == FALSE])
})
## 162514 51393 7442
## nonsynonymous 2 0 2
## not translated 1 0 5
## synonymous 3 3 1
The variants 'not translated' are explained by the warnings thrown when predictCoding was called. Variants that have a missing varAllele or have an 'N' in the varAllele are not translated. If the varAllele substitution had resulted in a frameshift the consequence would be 'frameshift'. See ?predictCoding for details.
The SIFT.Hsapiens.dbSNP132 and PolyPhen.Hsapiens.dbSNP131 packages provide predictions of how damaging amino acid coding changes may be to protein structure and function. Both packages search on rsid.
The pre-computed predictions in the SIFT and PolyPhen packages are based on specific gene models. SIFT is based on Ensembl and PolyPhen on UCSC Known Gene. The TranscriptDb we used to identify coding variants was from UCSC Known Gene so we will use PolyPhen for predictions.
## Load the PolyPhen package and explore the available keys and columns
library(PolyPhen.Hsapiens.dbSNP131)
keys <- keys(PolyPhen.Hsapiens.dbSNP131)
cols <- columns(PolyPhen.Hsapiens.dbSNP131)
## column descriptions are found at ?PolyPhenDbColumns
cols
## [1] "RSID" "TRAININGSET" "OSNPID" "OACC" "OPOS"
## [6] "OAA1" "OAA2" "SNPID" "ACC" "POS"
## [11] "AA1" "AA2" "NT1" "NT2" "PREDICTION"
## [16] "BASEDON" "EFFECT" "PPH2CLASS" "PPH2PROB" "PPH2FPR"
## [21] "PPH2TPR" "PPH2FDR" "SITE" "REGION" "PHAT"
## [26] "DSCORE" "SCORE1" "SCORE2" "NOBS" "NSTRUCT"
## [31] "NFILT" "PDBID" "PDBPOS" "PDBCH" "IDENT"
## [36] "LENGTH" "NORMACC" "SECSTR" "MAPREG" "DVOL"
## [41] "DPROP" "BFACT" "HBONDS" "AVENHET" "MINDHET"
## [46] "AVENINT" "MINDINT" "AVENSIT" "MINDSIT" "TRANSV"
## [51] "CODPOS" "CPG" "MINDJNC" "PFAMHIT" "IDPMAX"
## [56] "IDPSNP" "IDQMIN" "COMMENTS"
## Get the rsids for the non-synonymous variants from the
## predictCoding results
rsid <- unique(names(aa)[mcols(aa)$CONSEQUENCE == "nonsynonymous"])
## Retrieve predictions for non-synonymous variants. Two of the six variants
## are found in the PolyPhen database.
select(PolyPhen.Hsapiens.dbSNP131, keys=rsid,
columns=c("AA1", "AA2", "PREDICTION"))
## RSID AA1 AA2 PREDICTION
## 1 rs224534 T I benign
## 2 rs222747 M I benign
## 3 rs322937 R G possibly damaging
## 4 rs322937 R G benign
## 5 rs322965 I V benign
[ Back to top ]
Follow installation instructions to start using these packages. To install VariantAnnotation use
library(BiocInstaller)
biocLite("VariantAnnotation")
Package installation is required only once per R installation. View a full list of available software and annotation packages.
To use the VariantAnnotation
, evaluate the commands
library(VariantAnnotation)
These commands are required once in each R session.
[ Back to top ]
Packages have extensive help pages, and include vignettes highlighting common use cases. The help pages and vignettes are available from within R. After loading a package, use syntax like
help(package="VariantAnnotation")
?predictCoding
to obtain an overview of help on the VariantAnnotation
package, and
the predictCoding
function. View the package vignette with
browseVignettes(package="VariantAnnotation")
To view vignettes providing a more comprehensive introduction to package functionality use
help.start()
[ Back to top ]
sessionInfo()
## R version 3.1.0 (2014-04-10)
## Platform: x86_64-apple-darwin10.8.0 (64-bit)
##
## locale:
## [1] C
##
## attached base packages:
## [1] splines parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] PolyPhen.Hsapiens.dbSNP131_1.0.2
## [2] BSgenome.Hsapiens.UCSC.hg19_1.3.99
## [3] BSgenome_1.32.0
## [4] cgdv17_0.2.0
## [5] TxDb.Hsapiens.UCSC.hg19.knownGene_2.14.0
## [6] GenomicFeatures_1.16.2
## [7] GGtools_5.0.0
## [8] data.table_1.9.2
## [9] GGBase_3.26.1
## [10] snpStats_1.14.0
## [11] Matrix_1.1-4
## [12] survival_2.37-7
## [13] org.Hs.eg.db_2.14.0
## [14] RSQLite_0.11.4
## [15] DBI_0.3.0
## [16] AnnotationDbi_1.26.0
## [17] Biobase_2.24.0
## [18] VariantAnnotation_1.10.5
## [19] Rsamtools_1.16.1
## [20] Biostrings_2.32.1
## [21] XVector_0.4.0
## [22] GenomicRanges_1.16.4
## [23] GenomeInfoDb_1.0.2
## [24] IRanges_1.22.10
## [25] BiocGenerics_0.10.0
##
## loaded via a namespace (and not attached):
## [1] BBmisc_1.7 BatchJobs_1.3
## [3] BiocParallel_0.6.1 Formula_1.1-2
## [5] GenomicAlignments_1.0.6 Gviz_1.8.4
## [7] Hmisc_3.14-4 KernSmooth_2.23-12
## [9] R.methodsS3_1.6.1 RColorBrewer_1.0-5
## [11] RCurl_1.95-4.3 ROCR_1.0-5
## [13] Rcpp_0.11.2 XML_3.98-1.1
## [15] annotate_1.42.1 biglm_0.9-1
## [17] biomaRt_2.20.0 biovizBase_1.12.3
## [19] bit_1.1-12 bitops_1.0-6
## [21] brew_1.0-6 caTools_1.17.1
## [23] checkmate_1.4 cluster_1.15.2
## [25] codetools_0.2-8 colorspace_1.2-4
## [27] dichromat_2.0-0 digest_0.6.4
## [29] evaluate_0.5.5 fail_1.2
## [31] ff_2.2-13 foreach_1.4.2
## [33] formatR_0.10 gdata_2.13.3
## [35] genefilter_1.46.1 gplots_2.14.1
## [37] grid_3.1.0 gtools_3.4.1
## [39] hexbin_1.27.0 iterators_1.0.7
## [41] knitr_1.6 lattice_0.20-29
## [43] latticeExtra_0.6-26 matrixStats_0.10.0
## [45] munsell_0.4.2 plyr_1.8.1
## [47] reshape2_1.4 rtracklayer_1.24.2
## [49] scales_0.2.4 sendmailR_1.1-2
## [51] stats4_3.1.0 stringr_0.6.2
## [53] tools_3.1.0 xtable_1.7-3
## [55] zlibbioc_1.10.0
[ Back to top ]