get_bkg {universalmotif}R Documentation

Calculate sequence background.

Description

For a set of input sequences, calculate the overall sequence background for any k-let size. Only recommended for non-DNA/RNA sequences: otherwise use the much faster and more efficient Biostrings::oligonucleotideFrequency().

Usage

get_bkg(sequences, k = 1:3, as.prob = TRUE, pseudocount = 0,
  alphabet = NULL, to.meme = NULL, RC = FALSE, list.out = TRUE,
  progress = FALSE, BP = FALSE)

Arguments

sequences

XStringSet Input sequences. Note that if multiple sequences are present, they will be combined into one.

k

integer Size of k-let. Background can be calculated for any k-let size.

as.prob

logical(1) Whether to return k-let counts or probabilities.

pseudocount

integer(1) Add a count to each possible k-let. Prevents any k-let from having 0 or 1 probabilities.

alphabet

character(1) Provide a custom alphabet to calculate a background for. If NULL, then standard letters will be assumed for DNA, RNA and AA sequences, and all unique letters found will be used for BStringSet type sequences.

to.meme

If not NULL, then get_bkg() will return the sequence background in MEME Markov Background Model format. Input for this argument will be used for cat(..., file = to.meme) within get_bkg(). See http://meme-suite.org/doc/bfile-format.html for a description of the format.

RC

logical(1) Calculate the background of the reverse complement of the input sequences as well. Only valid for DNA/RNA.

list.out

logical(1) Return background frequencies as list, with an entry for each k. If FALSE, return a single vector.

progress

logical(1) Show progress. Not recommended if BP = TRUE.

BP

logical(1) Allows the use of BiocParallel within get_bkg(). See BiocParallel::register() to change the default backend. Setting BP = TRUE is only recommended for large jobs. Furthermore, the behaviour of progress = TRUE is changed if BP = TRUE; the default BiocParallel progress bar will be shown (which unfortunately is much less informative).

Value

If to.meme = NULL and list.out = TRUE: a list with each entry being a named numeric vector for every element in k. If to.meme = NULL and list.out = FALSE: a named numeric vector. Otherwise: NULL, invisibly.

Author(s)

Benjamin Jean-Marie Tremblay, b2tremblay@uwaterloo.ca

References

Bailey T, Elkan C (1994). “Fitting a mixture model by expectation maximization to discover motifs in biopolymers.” Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, 2, 28–36.

See Also

create_sequences(), scan_sequences(), shuffle_sequences()

Examples

## Compare to Biostrings version
library(Biostrings)
seqs.DNA <- create_sequences()
bkg.DNA <- get_bkg(seqs.DNA, k = 3, as.prob = FALSE, list.out = FALSE)
bkg.DNA2 <- oligonucleotideFrequency(seqs.DNA, 3, 1, as.prob = FALSE)
bkg.DNA2 <- colSums(bkg.DNA2)
all(bkg.DNA == bkg.DNA2)

## Create a MEME background file
get_bkg(seqs.DNA, k = 1:3, to.meme = stdout(), pseudocount = 1)

## Non-DNA/RNA/AA alphabets
seqs.QWERTY <- create_sequences("QWERTY")
bkg.QWERTY <- get_bkg(seqs.QWERTY, k = 1:2)


[Package universalmotif version 1.2.0 Index]