Bioconductor can import diverse sequence-related file types, including fasta, fastq, BAM, gff, bed, and wig files, among others. Packages support common and advanced sequence manipulation operations such as trimming, transformation, and alignment. Domain-specific analyses include quality assessment, ChIP-seq, differential expression, RNA-seq, and other approaches. Bioconductor includes an interface to the Sequence Read Archive (via the SRAdb package).
This workflow walks through the annotation of a generic set of ranges with Bioconductor packages. The ranges can be any user-defined region of interest or can be from a public file.
As a first step, data are put into a GRanges object so we can take advantage of overlap operations and store identifiers as metadata columns.
The first set of ranges are variants from a dbSNP Variant Call Format (VCF) file. This file can be downloaded from the ftp site at NCBI ftp://ftp.ncbi.nlm.nih.gov/snp/ and imported with readVcf() from the VariantAnnotation package. Alternatively, the file is available as a pre-parsed VCF object in the AnnotationHub.
## Warning: package 'VariantAnnotation' was built under R version 3.1.1
## Warning: package 'BiocGenerics' was built under R version 3.1.1
## Warning: package 'GenomeInfoDb' was built under R version 3.1.1
## Warning: package 'S4Vectors' was built under R version 3.1.1
## Warning: package 'IRanges' was built under R version 3.1.1
## Warning: package 'GenomicRanges' was built under R version 3.1.1
## Warning: package 'Rsamtools' was built under R version 3.1.1
## Warning: package 'XVector' was built under R version 3.1.1
## Warning: package 'Biostrings' was built under R version 3.1.1
## Warning: package 'AnnotationHub' was built under R version 3.1.1
## Note: the specification for S3 class "AsIs" in package 'RJSONIO' seems equivalent to one from package 'BiocGenerics': not turning on duplicate class definitions for this class.
library(VariantAnnotation)
library(AnnotationHub)
hub <- AnnotationHub()
vcf <- hub$dbSNP.organisms.human_9606.VCF.ByChromosome.22.12159.GIH.RData
dim(vcf)
## [1] 19698 88
When performing overlap operations the seqlevels and genome of the objects must match. Here were modify the VCF to match the TxDb.
## Warning: package 'GenomicFeatures' was built under R version 3.1.1
## Warning: package 'AnnotationDbi' was built under R version 3.1.1
## Warning: package 'Biobase' was built under R version 3.1.1
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb_hg19 <- TxDb.Hsapiens.UCSC.hg19.knownGene
head(seqlevels(txdb_hg19))
## [1] "chr1" "chr2" "chr3" "chr4" "chr5" "chr6"
seqlevels(vcf)
## [1] "22"
seqlevels(vcf) <- paste0("chr", seqlevels(vcf))
unique(genome(txdb_hg19))
## [1] "hg19"
genome(vcf) <- "hg19"
Sanity check to confirm we have matching seqlevels.
intersect(seqlevels(txdb_hg19), seqlevels(vcf))
## [1] "chr22"
The GRanges in a VCF object is in the 'rowData' slot.
gr_hg19 <- rowData(vcf)
The second set of ranges is a user-defined region of chromosome 4 in mouse. The idea here is that any region, known or unknown, can be annotated with the following steps.
library(TxDb.Mmusculus.UCSC.mm10.ensGene)
txdb_mm10 <- TxDb.Mmusculus.UCSC.mm10.ensGene
We are creating the GRanges from scratch and can specify the seqlevels (chromosome names) to match the TxDb.
head(seqlevels(txdb_mm10))
## [1] "chr1" "chr2" "chr3" "chr4" "chr5" "chr6"
gr_mm10 <- GRanges("chr4", IRanges(c(4000000, 107889000), width=1000))
Now assign the genome.
unique(genome(txdb_mm10))
## [1] "mm10"
genome(gr_mm10) <- "mm10"
locateVariants() in the VariantAnnotation package annotates ranges with transcript, exon, cds and gene ID's from a TxDb. Various extractions are performed on the TxDb (exonsBy(), transcripts(), cdsBy(), etc.) and the result is overlapped with the ranges. An appropriate GRangesList can also be supplied as the annotation. Different variants such as 'coding', 'fiveUTR', 'threeUTR', 'spliceSite', 'intron', 'promoter', and 'intergenic' can be searched for by passing the appropriate constructor as the 'region' argument. See ?locateVariants for details.
loc_hg19 <- locateVariants(gr_hg19, txdb_hg19, AllVariants())
table(loc_hg19$LOCATION)
##
## spliceSite intron fiveUTR threeUTR coding intergenic
## 6 36416 246 1586 1690 8877
## promoter
## 2425
loc_mm10 <- locateVariants(gr_mm10, txdb_mm10, AllVariants())
## Warning in valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE): GRanges object contains 3 out-of-bound ranges located on sequences
## chr4_JH584293_random, chr4_JH584295_random, and
## chr5_JH584296_random. Note that only ranges located on a
## non-circular sequence whose length is not NA can be considered
## out-of-bound (use seqlengths() and isCircular() to get the lengths
## and circularity flags of the underlying sequences). You can use
## trim() to trim these ranges. See ?`trim,GenomicRanges-method` for
## more information.
## Warning in valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE): GRanges object contains 3 out-of-bound ranges located on sequences
## chr4_JH584293_random, chr4_JH584295_random, and
## chr5_JH584296_random. Note that only ranges located on a
## non-circular sequence whose length is not NA can be considered
## out-of-bound (use seqlengths() and isCircular() to get the lengths
## and circularity flags of the underlying sequences). You can use
## trim() to trim these ranges. See ?`trim,GenomicRanges-method` for
## more information.
table(loc_mm10$LOCATION)
##
## spliceSite intron fiveUTR threeUTR coding intergenic
## 6 1 0 0 0 0
## promoter
## 12
The ID's returned from locateVariants() can be used in select() to map to ID's in other annotation packages.
## Warning: package 'DBI' was built under R version 3.1.1
library(org.Hs.eg.db)
cols <- c("UNIPROT", "PFAM")
keys <- unique(loc_hg19$GENEID)
head(select(org.Hs.eg.db, keys, cols, keytype="ENTREZID"))
## ENTREZID UNIPROT PFAM
## 1 150160 Q96SF2 PF00118
## 2 387590 <NA> <NA>
## 3 23783 <NA> <NA>
## 4 150165 Q5GH77 PF09815
## 5 27437 <NA> <NA>
## 6 128954 Q2WGN9 PF00169
The 'keytype' argument specifies that the mouse TxDb contains Ensembl instead of Entrez gene id's.
library(org.Mm.eg.db)
keys <- unique(loc_mm10$GENEID)
head(select(org.Mm.eg.db, keys, cols, keytype="ENSEMBL"))
## ENSEMBL UNIPROT PFAM
## 1 ENSMUSG00000028236 Q7TQA3 PF00106
## 2 ENSMUSG00000028608 Q8BHG2 PF05907
Files stored in the AnnotationHub have been pre-processed into ranged-based R objects such as a GRanges, GAlignments and VCF. The positions in our GRanges can be overlapped with the ranges in the AnnotationHub files. This allows for easy subsetting of multiple files, resulting in only the ranges of interest.
Create a 'hub' from AnnotationHub and filter the files based on organism.
hub <- AnnotationHub()
filters(hub) <- list(Species = "Homo sapiens")
hg19_files <- names(hub)[grepl("hg19", names(hub))]
length(hg19_files)
## [1] 4609
Extract the matching ranges from the first 3 files.
ov_hg19 <- lapply(hg19_files[1:3], function(x)
subsetByOverlaps(hub$[[x]], gr_hg19))
Take a look at the results.
names(ov_hg19) <- hg19_files[1:3]
lapply(ov_hg19, head, n=3)
## $goldenpath.hg19.encodeDCC.wgEncodeSydhTfbs.wgEncodeSydhTfbsHepg2Brca1a300IggrabPk.narrowPeak_0.0.1.RData
## GRanges object with 3 ranges and 6 metadata columns:
## seqnames ranges strand | name score
## <Rle> <IRanges> <Rle> | <character> <integer>
## [1] chr22 [43010733, 43011938] * | . 1000
## [2] chr22 [21355953, 21356739] * | . 1000
## [3] chr22 [21212791, 21213563] * | . 778
## signalValue pValue qValue peak
## <numeric> <numeric> <numeric> <integer>
## [1] 6.02722 52.80257 50.45445 230
## [2] 12.61455 23.33600 21.40622 420
## [3] 10.25426 15.37789 13.68114 341
## -------
## seqinfo: 24 sequences from hg19 genome
##
## $goldenpath.hg19.encodeDCC.wgEncodeAffyRnaChip.wgEncodeAffyRnaChipFiltTransfragsGm12878CellTotal.broadPeak_0.0.1.RData
## GRanges object with 3 ranges and 5 metadata columns:
## seqnames ranges strand | name score
## <Rle> <IRanges> <Rle> | <character> <integer>
## [1] chr22 [16886836, 16886882] * | . 0
## [2] chr22 [17012922, 17012968] * | . 0
## [3] chr22 [17306056, 17306167] * | . 0
## signalValue pValue qValue
## <numeric> <numeric> <numeric>
## [1] 177.675 -1 -1
## [2] 216.450 -1 -1
## [3] 147.787 -1 -1
## -------
## seqinfo: 25 sequences from hg19 genome
##
## $goldenpath.hg19.encodeDCC.wgEncodeAffyRnaChip.wgEncodeAffyRnaChipFiltTransfragsGm12878CytosolLongnonpolya.broadPeak_0.0.1.RData
## GRanges object with 3 ranges and 5 metadata columns:
## seqnames ranges strand | name score
## <Rle> <IRanges> <Rle> | <character> <integer>
## [1] chr22 [17326625, 17326691] * | . 0
## [2] chr22 [17490889, 17490950] * | . 0
## [3] chr22 [17493505, 17493561] * | . 0
## signalValue pValue qValue
## <numeric> <numeric> <numeric>
## [1] 123.712 -1 -1
## [2] 167.059 -1 -1
## [3] 135.562 -1 -1
## -------
## seqinfo: 25 sequences from hg19 genome
Annotating the mouse ranges in the same fashion is left as an excercise.
For the set of dbSNP variants that fall in coding regions, amino acid changes can be computed. The output contains one line for each variant-transcript match which can result in multiple lines for each variant.
## Warning: package 'BSgenome' was built under R version 3.1.1
## Warning: package 'rtracklayer' was built under R version 3.1.1
library(BSgenome.Hsapiens.UCSC.hg19)
head(predictCoding(vcf, txdb_hg19, Hsapiens), 3)
## GRanges object with 3 ranges and 17 metadata columns:
## seqnames ranges strand | paramRangeID
## <Rle> <IRanges> <Rle> | <factor>
## rs2236639 chr22 [17072483, 17072483] - | <NA>
## rs5747988 chr22 [17073066, 17073066] - | <NA>
## rs5748622 chr22 [17264565, 17264565] - | <NA>
## REF ALT QUAL FILTER
## <DNAStringSet> <DNAStringSetList> <numeric> <character>
## rs2236639 A G <NA> .
## rs5747988 A G <NA> .
## rs5748622 G T <NA> .
## varAllele CDSLOC PROTEINLOC QUERYID
## <DNAStringSet> <IRanges> <IntegerList> <integer>
## rs2236639 C [ 958, 958] 320 34
## rs5747988 C [ 375, 375] 125 35
## rs5748622 A [1324, 1324] 442 99
## TXID CDSID GENEID CONSEQUENCE REFCODON
## <character> <integer> <character> <factor> <DNAStringSet>
## rs2236639 74436 216505 150160 nonsynonymous TGG
## rs5747988 74436 216505 150160 synonymous GCT
## rs5748622 74439 216506 150165 nonsynonymous CAC
## VARCODON REFAA VARAA
## <DNAStringSet> <AAStringSet> <AAStringSet>
## rs2236639 CGG W R
## rs5747988 GCC A A
## rs5748622 AAC H N
## -------
## seqinfo: 1 sequence from hg19 genome; no seqlengths
The ensemblVEP package provides access to the online Ensembl Variant Effect Predictor (VEP tool). The VEP tool ouputs predictions of functional consequences of known and unknown variants as reported by Sequence Ontology or Ensembl. Regulatory region consequences, HGNC, Ensembl protein identifiers, HGVS, co-located variants are optional outputs. ensemblVEP() accepts the name of a VCF file and returns a VCF on disk or GRanges in the R workspace.
## Warning: package 'ensemblVEP' was built under R version 3.1.1
library(ensemblVEP)
fl <- system.file("extdata", "ex2.vcf",
package="VariantAnnotation")
gr <- ensemblVEP(fl)
head(gr, 3)
## GRanges object with 3 ranges and 13 metadata columns:
## seqnames ranges strand | Allele
## <Rle> <IRanges> <Rle> | <factor>
## rs6054257 20 [ 14370, 14370] * | A
## 20:17330_T/A 20 [ 17330, 17330] * | A
## rs6040355 20 [1110696, 1110696] * | T
## Gene Feature Feature_type
## <factor> <factor> <factor>
## rs6054257 <NA> <NA> <NA>
## 20:17330_T/A <NA> <NA> <NA>
## rs6040355 ENSG00000125818 ENST00000333082 Transcript
## Consequence cDNA_position CDS_position
## <factor> <factor> <factor>
## rs6054257 intergenic_variant <NA> <NA>
## 20:17330_T/A intergenic_variant <NA> <NA>
## rs6040355 upstream_gene_variant <NA> <NA>
## Protein_position Amino_acids Codons Existing_variation
## <factor> <factor> <factor> <factor>
## rs6054257 <NA> <NA> <NA> <NA>
## 20:17330_T/A <NA> <NA> <NA> <NA>
## rs6040355 <NA> <NA> <NA> <NA>
## DISTANCE STRAND
## <factor> <factor>
## rs6054257 <NA> <NA>
## 20:17330_T/A <NA> <NA>
## rs6040355 2567 1
## -------
## seqinfo: 1 sequence from genome
Exercise 1: VCF header and reading data subsets.
VCF files can be large and it's often the case that only a subset of variables or genomic positions are of interest. The scanVcfHeader() function in the VariantAnnotation package retrieves header information from a VCF file. Based on the information returned from scanVcfHeader() a ScanVcfParam() object can be created to read in a subset of data from a VCF file.
Exercise 2: Annotate the mouse ranges in 'gr_mm10' with AnnotationHub files.
Exercise 3: Annotate a gene range from Saccharomyces Scerevisiae.
[ Back to top ]