motif_pvalue {universalmotif} | R Documentation |
For calculating p-values/logodds scores for any number of motifs.
motif_pvalue(motifs, score, pvalue, bkg.probs, use.freq = 1, k = 6, progress = ifelse(length(motifs) > 1, TRUE, FALSE), BP = FALSE)
motifs |
See |
score |
|
pvalue |
|
bkg.probs |
|
use.freq |
|
k |
|
progress |
|
BP |
|
Calculating p-values for motifs can be very computationally intensive. This is due to how p-values must be calculated: for a given score, all possible sequences which score equal or higher must be found, and the probability for each of these sequences (based on background probabilities) summed. For a DNA motif of length 10, the number of possible unique sequences is 4^10 = 1,048,576. Finding all possible sequences higher than a given score can be done very efficiently and quickly with a branch-and-bound algorithm, but as the motif length increases this calculation becomes impractical. To get around this, the p-value calculation can be approximated.
In order to calculate p-values for longer motifs, this function uses the
approximation proposed by Hartmann et al. (2013), where
the motif is subset, p-values calculated for the subsets, and finally
combined for a total p-value. The smaller the size of the subsets, the
faster the calculation; but also, the bigger the approximation. This can be
controlled by setting k
. In fact, for smaller motifs (< 13 positions)
calculating exact p-values can be done individually in reasonable time by
setting k = 12
.
To calculate a score based on a given p-value, the means and variances of
each motif subsets are combined to estimate the distribution of all
possible scores using stats::qnorm()
:
qnorm(pvalue, mean = sum(subset.means), sd = sqrt(sum(subset.vars)))
For calculating exact scores, stats::ecdf()
and stats::quantile()
are
used:
quantile(ecdf(scores), probs = pvalue)
It is important to keep in mind that both approximate and exact score
calculations assume uniform backgrounds, so do not use this function for
motifs with extremely imbalanced backgrounds. To get all possible scores for
each subset, expand.grid()
is used instead of the branch-and-bound
algorithm used for calculating p-values. Keep this in mind for determining
the best k
value for motifs with alphabets longer than those of DNA/RNA
motifs.
numeric
A vector of scores/p-values.
Benjamin Jean-Marie Tremblay, b2tremblay@uwaterloo.ca
Hartmann H, Guthöhrlein E, Siebert M, and S. Luehr J. Söding (2013). “P-value-based regulatory motif discovery using positional weight matrices.” Genome Research, 23, 181–194.
data(examplemotif) ## p-value/score calculations are performed using the PWM version of the ## motif; these calculations do not work if any -Inf values are present examplemotif["pseudocount"] <- 1 # or examplemotif <- normalize(examplemotif) ## get a minimum score based on a p-value motif_pvalue(examplemotif, pvalue = 0.001) ## get the probability of a particular sequence hit motif_pvalue(examplemotif, score = 0) ## the calculations can be performed for multiple motifs motif_pvalue(list(examplemotif, examplemotif), pvalue = c(0.001, 0.0001)) ## Compare score thresholds and P-value: scores <- motif_score(examplemotif, c(0.6, 0.7, 0.8, 0.9)) motif_pvalue(examplemotif, scores) ## Calculate the probability of getting a certain match or better: TATATAT <- score_match(examplemotif, "TATATAT") TATATAG <- score_match(examplemotif, "TATATAG") motif_pvalue(examplemotif, TATATAT) motif_pvalue(examplemotif, TATATAG) ## Get all possible matches by P-value: get_matches(examplemotif, motif_pvalue(examplemotif, pvalue = 0.0001))