compare_motifs {universalmotif}R Documentation

Compare motifs.

Description

Compare motifs using four available metrics: Pearson correlation coefficient (Pietrokovski 1996), Euclidean distance (Choi et al. 2004), Sandelin-Wasserman similarity (Sandelin and Wasserman 2004), and Kullback-Leibler divergence (Roepcke et al. 2005).

Usage

compare_motifs(motifs, compare.to, db.scores, use.freq = 1,
  use.type = "PPM", method = "MPCC", tryRC = TRUE, min.overlap = 6,
  min.mean.ic = 0.5, relative_entropy = FALSE,
  normalise.scores = FALSE, max.p = 0.01, max.e = 10,
  progress = TRUE, BP = FALSE)

Arguments

motifs

See convert_motifs() for acceptable motif formats.

compare.to

numeric If missing, compares all motifs to all other motifs. Otherwise compares all motifs to the specified motif(s).

db.scores

data.frame See details.

use.freq

numeric(1). For comparing the multifreq slot.

use.type

character(1) One of 'PPM' and 'ICM'. The latter allows for taking into account the background frequencies if relative_entropy = TRUE.

method

character(1) One of c('PCC', 'MPCC', 'EUCL', 'MEUCL', 'SW', 'MSW', 'KL', 'MKL'). See details.

tryRC

logical Try the reverse complement of the motifs as well, report the best score.

min.overlap

numeric(1) Minimum overlap required when aligning the motifs. Setting this to a number higher then the width of the motifs will not allow any overhangs. Can also be a number less than 1, representing the minimum fraction that the motifs must overlap.

min.mean.ic

numeric(1) Minimum mean information content between the two motifs for an alignment to be scored. This helps prevent scoring alignments between low information content regions of two motifs.

relative_entropy

logical(1) For ICM calculation. See convert_type().

normalise.scores

logical(1) Favour alignments which leave fewer unaligned positions, as well as alignments between motifs of similar length. Similarity scores are multiplied by the ratio of aligned positions to the total number of positions in the larger motif, and the inverse for distance scores.

max.p

numeric(1) Maximum P-value allowed in reporting matches. Only used if compare.to is set.

max.e

numeric(1) Maximum E-value allowed in reporting matches. Only used if compare.to is set. The E-value is the P-value multiplied by the number of input motifs times two.

progress

logical(1) Show progress. Not recommended if BP = TRUE.

BP

logical(1) Allows the use of BiocParallel within compare_motifs(). See BiocParallel::register() to change the default backend. Setting BP = TRUE is only recommended for comparing large numbers of motifs (>10,000). Furthermore, the behaviour of progress = TRUE is changed if BP = TRUE; the default BiocParallel progress bar will be shown (which unfortunately is much less informative).

Details

Comparisons are calculated between two motifs at a time. All possible alignments are scored, and the best score is reported. Scores are calculated per position and summed, unless the 'mean' version of the specific metric is chosen. If using a similarity metric, then the sum of scores will favour comparisons between longer motifs; and for distance metrics, the sum of scores will favour comparisons between short motifs. This can be avoided by using the 'mean' of scores.

To note regarding p-values: p-values are pre-computed using the make_DBscores function. If not given, then uses a set of internal precomputed p-values from the JASPAR2018 CORE motifs. These precalculated scores are dependent on the length of the motifs being compared; this takes into account that comparing small motifs with larger motifs leads to higher scores, since the probability of finding a higher scoring alignment is higher.

The default p-values have been precalculated for regular DNA motifs; they are of little use for motifs with a different number of alphabet letters (or even the multifreq slot).

Value

matrix if compare.to is missing; data.frame otherwise.

Author(s)

Benjamin Jean-Marie Tremblay, b2tremblay@uwaterloo.ca

References

Choi I, Kwon J, Kim S (2004). “Local feature frequency profile: a method to measure structural similarity in proteins.” PNAS, 101, 3797–3802.

Khan A, Fornes O, Stigliani A, Gheorghe M, Castro-Mondragon J, van der Lee R, Bessy A, Cheneby J, Kulkarni SR, Tan G, Baranasic D, Arenillas D, Sandelin A, Vandepoele K, Lenhard B, Ballester B, Wasserman W, Parcy F, Mathelier A (2018). “JASPAR 2018: update of the open-access database of transcription factor binding profiles and its web framework.” Nucleic Acids Research, 46, D260-D266.

Pietrokovski S (1996). “Searching databases of conserved sequence regions by aligning protein multiple-alignments.” Nucleic Acids Research, 24, 3836–3845.

Roepcke S, Grossmann S, Rahmann S, Vingron M (2005). “T-Reg Comparator: an analysis tool for the comparison of position weight matrices.” Nucleic Acids Research, 33, W438–W441.

Sandelin A, Wasserman W (2004). “Constrained binding site diversity within families of transcription factors enhances pattern discovery bioinformatics.” Journal of Molecular Biology, 338(2), 207–215.

See Also

convert_motifs(), motif_tree(), view_motifs()

Examples

motif1 <- create_motif()
motif2 <- create_motif()
motif1vs2 <- compare_motifs(list(motif1, motif2), method = "MPCC")
## to get a dist object:
as.dist(1 - motif1vs2)


[Package universalmotif version 1.2.0 Index]