## Warning: package 'knitcitations' was built under R version 3.0.2
## Warning: package 'knitr' was built under R version 3.0.2
## Warning: package 'GGtools' was built under R version 3.0.2
## Warning: package 'GGBase' was built under R version 3.0.2
## Warning: package 'snpStats' was built under R version 3.0.2
## Warning: package 'survival' was built under R version 3.0.2
## Warning: package 'Matrix' was built under R version 3.0.2
## Warning: package 'IRanges' was built under R version 3.0.2
## Warning: package 'BiocGenerics' was built under R version 3.0.2
## Warning: package 'GenomicRanges' was built under R version 3.0.2
## Warning: package 'XVector' was built under R version 3.0.2
## Warning: package 'Rsamtools' was built under R version 3.0.2
## Warning: package 'Biostrings' was built under R version 3.0.2
Numerous studies have employed genome-wide measures of mRNA abundance (typically assayed using DNA microarrays, and more recently RNA-seq) in combination with high-resolution genotyping (often supplemented with statistical imputation to loci not directly assayed, leading to genotype calls with quantified uncertainty) to search for genetic determinants of variation in gene expression. Key references in human applications include Cheung et al. (2005), Majewski & Pastinen (2011), and Gaffney et al. (2012); Shabalin (2012) addresses computational concerns.
This document focuses on searches for eQTL
A typical report describes tuning of the search (including, for example, boundaries on minor allele frequencies of variants to be tested, approach to correction for batch effects and other forms of confounding, scope of search in terms of distance from gene coding region), enumerates variants with evidence of association to expression variation in nearby genes, and then characterizes the biological roles of the discovered variants.
Suppose there are \(N\) independently sampled individuals with gene expression measures on \(G\) genes or probes. Each individual is genotyped (or has genotype statistically imputed) at \(S\) DNA locations, catalogued by NCBI dbSNP or 1000 genomes. We are given a \(G \times N\) matrix of expression assay results, and \(N \times S\) genotyping results in the form of the number of B alleles (or expected number of B alleles) for each of the loci. Select the search radius \(\rho\) (for example, 100kb) and for each gene \(g\), determine the search neighborhoods \(N_g = N_{g,\rho} = [a_g-\rho, b_g+\rho]\), where \(a_g\) denotes the genomic coordinate of the 5' end of the transcript region for gene \(g\), and \(b_g\) is the coordinate at the 3' end. Let \(|N_g|\) denote the number of SNP in that neighborhood. Key objectives are
The code in the example for the GGtools function All.cis() yields an example of a sharply restricted search for cis eQTL on chr21, using data on the HapMap CEU population.
cc = new("CisConfig") # take a default configuration
chrnames(cc) = "21" # confine to chr21
estimates(cc) = FALSE # no point estimates neede
f1 <- All.cis(cc) # compute the tests; can be slow without attendance
# to parallelization
The result of the function inherits from GRanges, and includes metadata concerning its generation.
length(f1)
## [1] 6060
f1[1:3]
## cisRun with 3 ranges and 10 metadata columns:
## seqnames ranges strand | snp
## <Rle> <IRanges> <Rle> | <character>
## GI_11342663-S 21 [42683950, 42830869] + | rs2838008
## GI_11342663-S 21 [42683950, 42830869] + | rs9985079
## GI_11342663-S 21 [42683950, 42830869] + | rs12627067
## snplocs score fdr probeid MAF
## <integer> <numeric> <numeric> <character> <numeric>
## GI_11342663-S 42685044 0.44 0.8945 GI_11342663-S 0.3483
## GI_11342663-S 42687136 0.40 0.9070 GI_11342663-S 0.4156
## GI_11342663-S 42689443 0.21 0.9171 GI_11342663-S 0.2697
## dist.mid permScore_1 permScore_2 permScore_3
## <numeric> <numeric> <numeric> <numeric>
## GI_11342663-S -72366 0.10 1.57 0.44
## GI_11342663-S -70274 0.40 0.25 0.08
## GI_11342663-S -67967 0.35 0.00 0.09
## ---
## seqlengths:
## 21
## NA
metadata(f1)
## $call
## All.cis(config = cc)
##
## $config
## CisConfig instance. Key parameters:
## smpack = GGdata ; chrnames = 21
## nperm = 3 ; radius = 50000
## ====
## Configure using
## [1] "smpack<-" "rhs<-" "nperm<-"
## [4] "folderStem<-" "radius<-" "shortfac<-"
## [7] "chrnames<-" "smchrpref<-" "gchrpref<-"
## [10] "schrpref<-" "geneApply<-" "geneannopk<-"
## [13] "snpannopk<-" "smFilter<-" "exFilter<-"
## [16] "keepMapCache<-" "SSgen<-" "excludeRadius<-"
## [19] "estimates<-"
Use of GRanges for the organization of association test statistics allows easy amalgamation of findings with other forms of genomic annotation. Retention of the association scores achieved under permutation allows recomputation of plug-in FDR after combination or filtering.
Targeted visualization of association is supported with the plot_EvG function in GGBase. To obtain the figure on the right, the expression matrix has been transformed by removing the principal components corresponding to the 10 largest eigenvalues. This is a crude approach to reducing “expression heterogeneity'', a main concern of eQTL analyses to date (Leek & Storey, 2007).
## Warning: package 'yri1kgv' was built under R version 3.0.2
## To get a tailored smlSet, use getSS("yri1kgv", [chrvec])
##
## available chromosomes are named chr1, chr10, ..., chr8, chr9
## Warning: harmonizeSamples TRUE and sampleNames for es not coincident with
## rownames(sml[[1]]); harmonizing...
Above we have a single SNP-gene association.
The family of
associations observed cis to ABHD12 can also be
visualized in conjunction with the transcript models.
As of November 2013, a reasonably efficient representation of expression, sample and genotype data is defined using an R package containing
Elements of the sampleNames of the ExpressionSet instance must coincide with elements of the row names of the SnpMatrix instances. At time of analysis, warnings will be issued and harmonization attempts will be made when the coincidence is not exact.
The SnpMatrix instances make use of a byte code for (potentially) imputed genotypes. Each element of the code defines a point on the simplex displayed below, allowing a discrete but rich set of the key quantities of interest, the expected number of B alleles.
## Warning: package 'scatterplot3d' was built under R version 3.0.2
Note that the nucleotide codes are not carried in this representation. Typically for a diallelic SNP, B denotes the alphabetically later nucleotide.
We can illustrate the basic operations with this overall structure, using data collected on Yoruban (YRI) HapMap cell lines. Expression data were obtained at ArrayExpression E-MTAB-264 (Stranger et al. 2012).
library(GGtools)
library(yri1kgv)
library(lumiHumanAll.db)
## Warning: package 'lumiHumanAll.db' was built under R version 3.0.2
## Loading required package: AnnotationDbi
## Warning: package 'AnnotationDbi' was built under R version 3.0.2
## Loading required package: Biobase
## Warning: package 'Biobase' was built under R version 3.0.2
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
##
## Loading required package: org.Hs.eg.db
## Loading required package: DBI
##
##
## lumiHumanAll.db is using or is likely to need access to special
## nuID identifiers. Users can learn about these identifiers from
## vignette documentation provided with the lumi package.
if (!exists("y22")) y22 = getSS("yri1kgv", "chr22")
## Warning: harmonizeSamples TRUE and sampleNames for es not coincident with
## rownames(sml[[1]]); harmonizing...
y22
## SnpMatrix-based genotype set:
## number of samples: 79
## number of chromosomes present: 1
## annotation: lumiHumanAll.db
## Expression data dims: 21800 x 79
## Total number of SNP: 494322
## Phenodata: An object of class 'AnnotatedDataFrame'
## sampleNames: NA18486 NA18487 ... NA19257 (79 total)
## varLabels: Source.Name Material.Type ... Factor.Value.SIGNAL.
## (26 total)
## varMetadata: labelDescription
dim(exprs(y22))
## [1] 21800 79
fn = featureNames(y22)[1:5]
The annotation of expression features can be explored in several directions. First, the probe names themselves encode the 50mers on the chip.
library(lumi)
## Warning: package 'lumi' was built under R version 3.0.2
## Warning: replacing previous import 'image' when loading 'graphics'
## KernSmooth 2.23 loaded
## Copyright M. P. Wand 1997-2009
## Warning: replacing previous import 'nleqslv' when loading 'nleqslv'
id2seq(fn) # get the 50mer for each probe
## NQqs8dKRwVSgI4SRPk
## "CAAGGGGTATTACTCAGGCACTAACCCCAGGAAAGATGACAGCACATTGC"
## BvIpQQ9yzp__kCLnEU
## "GTTAGAGGCCAACAATTCTAGTATGGCTTGTTGGCAAAGAGTGCTACACC"
## NH1MoTHk7CULTog3nk
## "ACTTCCATAGGACATACTGCATGTAAGCCAAGTCATGGAGAATCTGCTGC"
## KNJlVFShMX1UoyIkRc
## "ATCAGCGCCCCCACCCAGGACATACCTTCCCCAGGATAGAGAGCACACCT"
## fuplG2R3erO3QrujDk
## "GTGGGCGCCACGTCGCACTCTCTGGGTATGTCTCAAGGTGTGGATAATGC"
# and some annotation
Second, the mapping to institutionally curated gene identifiers is available.
select(lumiHumanAll.db, keys = fn, keytype = "PROBEID", columns = c("SYMBOL",
"CHR", "ENTREZID"))
## PROBEID SYMBOL CHR ENTREZID
## 1 NQqs8dKRwVSgI4SRPk THBS3 1 7059
## 2 BvIpQQ9yzp__kCLnEU SLC38A2 12 54407
## 3 NH1MoTHk7CULTog3nk CCNB1 5 891
## 4 KNJlVFShMX1UoyIkRc ZNF496 1 84838
## 5 fuplG2R3erO3QrujDk LOC100130238 12 100130238
Finally, we can look at the genotype information. This can be voluminous and is managed in an environment to reduce potential copying expenses.
gt22 <- smList(y22)[[1]] # access to genotypes
as(gt22[1:5, 1:5], "character")
## rs149201999 rs146752890 rs139377059 rs188945759 rs6518357
## NA18486 "A/A" "A/A" "A/A" "A/A" "A/A"
## NA18487 "A/A" "A/A" "A/A" "A/A" "A/A"
## NA18489 "A/A" "A/A" "A/A" "A/A" "A/A"
## NA18498 "A/A" "A/A" "A/A" "A/A" "A/A"
## NA18499 "A/A" "A/A" "A/A" "A/A" "A/A"
cs22 = col.summary(gt22) # some information on genotypes
cs22[1:10, ]
## Calls Call.rate Certain.calls RAF MAF P.AA P.AB
## rs149201999 79 1 1 0.094937 0.094937 0.8101 0.18987
## rs146752890 79 1 1 0.069620 0.069620 0.8608 0.13924
## rs139377059 79 1 1 0.069620 0.069620 0.8608 0.13924
## rs188945759 79 1 1 0.006329 0.006329 0.9873 0.01266
## rs6518357 79 1 1 0.075949 0.075949 0.8481 0.15190
## rs62224609 79 1 1 0.000000 0.000000 1.0000 0.00000
## rs62224610 79 1 1 0.297468 0.297468 0.4937 0.41772
## rs143503259 79 1 1 0.000000 0.000000 1.0000 0.00000
## rs192339082 79 1 1 0.000000 0.000000 1.0000 0.00000
## rs79725552 79 1 1 0.075949 0.075949 0.8481 0.15190
## P.BB z.HWE
## rs149201999 0.00000 0.932328
## rs146752890 0.00000 0.665103
## rs139377059 0.00000 0.665103
## rs188945759 0.00000 0.056613
## rs6518357 0.00000 0.730537
## rs62224609 0.00000 NA
## rs62224610 0.08861 -0.005111
## rs143503259 0.00000 NA
## rs192339082 0.00000 NA
## rs79725552 0.00000 0.730537
This workflow is based on Amazon EC2 computation managed using the MIT starcluster utilities. Configuration and management of EC2 based machinery is quite simple. The bulk of the complete run described here used configuration variables
For chromosome 1, a rescue run was required with a larger instance type (m3.2xlarge).
We will describe an approach to using a cluster to search for eQTL. The master process will communicate with slaves via sockets; slaves will write results to disk and ship back to master. The task is executed across chromosomes that have been split roughly in half.
The ciseqByCluster function of GGtools is the workhorse for the search. Arguments to this function determine how the search will be configured and executed. The invocation here asks for a search on three chromosomes, dispatching work from a master R process to a 3 node cluster, with multicore concurrency for gene-specific searches on four cores per node. The output will be GFF3 files in the folder yrex1, subordinate to the folder where the master process invoked the function.
ciseqByCluster(pack = "yri1kgv", chromsToRun = 20:22, numNodes = 3, ncoresPerNode = 4,
targetfolder = "./yrex1")
The full set of arguments and defaults for ciseqByCluster is
# this program will split all chromsomes in roughly equal halves (by probe)
# and compute All.cis for all probes ... snps may recur in the different
# halves
if (!file.exists(targetfolder)) try(system(paste0("mkdir ", targetfolder)))
if (!file.exists(targetfolder)) stop(paste0("cannot create ", targetfolder))
library(parallel)
cl <<- makePSOCKcluster(nodeNames)
firstHalf <<- function(x) x[1:floor(length(x))/2]
secondHalf <<- function(x) x[-(1:floor(length(x))/2)]
setupSplit = function(nodeset = 1:numNodes) {
clusterApply(cl, nodeset, function(w) {
library(parallel) # get resources
library(GGtools)
library(pack, character.only = TRUE)
library(geneannopk, character.only = TRUE)
library(snpannopk, character.only = TRUE)
library(Rsamtools)
library(rtracklayer)
cc = new("CisConfig") # configure search, except for choice of chromosome
smpack(cc) = pack
nperm(cc) = nperm
geneannopk(cc) = geneannopk
radius(cc) = radius
smchrpref(cc) = smchrpref
geneApply(cc) = mclapply # so genes are dispatched to cores
options(mc.cores = ncoresPerNode)
cc <<- cc
})
}
setupSplit(1:numNodes) # causes library attachment on all nodes and generation of CisConfig instances there
runOneSplit <<- function(chrtag, suffix = "A", splitter = firstHalf, gffOnly = TRUE) {
# assumes cc is defined locally as the config
if (!exists("cc"))
stop("did not find cc for local CisConfig manipulation")
chrnames(cc) = as.character(chrtag)
folderStem(cc) = paste0(folderStem(cc), "_", chrtag, suffix)
smFilter(cc) = function(x) {
# late filtering
fn = featureNames(x)
if (numPCtoFilter > 0)
tmp = MAFfilter(clipPCs(x, 1:numPCtoFilter), lower = lowerMAF) else tmp = MAFfilter(x, lower = lowerMAF)
tmp[probeId(splitter(fn)), ]
}
obn = paste0(outprefix, "_", chrnames(cc), suffix)
fn = paste0(obn, ".rda")
res <- try(All.cis(cc))
if (inherits(res, "try-error"))
return(res)
getmindist = function(snplocs, probes, geneannopk = "lumiHumanAll.db") {
require(geneannopk, character.only = TRUE, quietly = TRUE)
stub = gsub(".db$", "", geneannopk)
locenv = get(paste0(stub, "CHRLOC"))
locendenv = get(paste0(stub, "CHRLOCEND"))
gloc = abs(sapply(mget(probes, locenv), "[", 1))
gend = abs(sapply(mget(probes, locendenv), "[", 1))
ifelse(snplocs <= gend & snplocs >= gloc, 0, ifelse(snplocs > gend,
snplocs - gend, gloc - snplocs))
}
res$mindist = getmindist(res$snplocs, res$probeid, geneannopk = geneannopk)
assign(obn, res)
if (!gffOnly) {
save(list = obn, file = fn) # save to local disk
system(paste0("cp ", fn, " ", targetfolder)) # copy to target folder
}
# define transformer
cr2gff = function(cr, obn, targetfolder = ".") {
require(GenomicRanges)
require(Rsamtools)
res = as(cr, "GRanges")
sl = IRanges(res$snplocs, width = 1)
ranges(res) = sl
names(res) = NULL
sn = seqnames(res)
sn = gsub("chr", "", sn)
o = order(as.numeric(sn), start(res))
res = res[o]
gffFile = paste0(targetfolder, "/", obn, ".gff3")
# seqlevels(res) = gsub('chr', '', seqlevels(res))
export.gff3(res, gffFile)
# owd = getwd() setwd(targetfolder) bgzip( gffFile , overwrite=TRUE)
# indexTabix( paste0(gffFile, '.gz') , format='gff') setwd( owd )
}
# continue processing to tabix-indexed gff3 based on SNP address
cr2gff(as(res, "GRanges"), obn, targetfolder)
NULL
}
clusterExport(cl, "runOneSplit")
clusterExport(cl, "firstHalf")
clusterExport(cl, "secondHalf")
njobs = 2 * length(chromsToRun)
chrtags = as.character(rep(chromsToRun, each = 2))
reqlist = vector("list", njobs)
j <- 1
for (i in 1:length(chrtags)) {
reqlist[[j]] = list(chr = chrtags[j], tag = "A", splitter = firstHalf) # need to distinguish splitter elements
reqlist[[j + 1]] = list(chr = chrtags[j + 1], tag = "B", splitter = secondHalf) # therefore loop is complex
j <- j + 2
}
reqlist <<- reqlist
clusterApplyLB(cl, reqlist, function(x) runOneSplit(x[["chr"]], x[["tag"]],
x[["splitter"]]))
# }