scoreTranscripts {transite}R Documentation

Scores transcripts with position weight matrices

Description

This function is used to count the binding sites in a set of sequences for all or a subset of RNA-binding protein sequence motifs and returns the result in a data frame, which is subsequently used by calculateMotifEnrichment to obtain binding site enrichment scores.

Usage

scoreTranscripts(sequences, motifs = NULL, max.hits = 5,
  threshold.method = "p.value", threshold.value = 0.25^6,
  n.cores = 1, cache = paste0(tempdir(), "/sc/"))

Arguments

sequences

character vector of named sequences (only containing upper case characters A, C, G, T), where the names are RefSeq identifiers and sequence type qualifiers ("3UTR", "5UTR", "mRNA"), e.g. "NM_010356|3UTR"

motifs

a list of motifs that is used to score the specified sequences. If is.null(motifs) then all Transite motifs are used.

max.hits

maximum number of putative binding sites per mRNA that are counted

threshold.method

either "p.value" (default) or "relative". If threshold.method equals "p.value", the default threshold.value is 0.25^6, which is lowest p-value that can be achieved by hexamer motifs, the shortest supported motifs. If threshold.method equals "relative", the default threshold.value is 0.9, which is 90% of the maximum PWM score.

threshold.value

semantics of the threshold.value depend on threshold.method (default is 0.25^6)

n.cores

the number of cores that are used

cache

either logical or path to a directory where scores are cached. The scores of each motif are stored in a separate file that contains a hash table with RefSeq identifiers and sequence type qualifiers as keys and the number of putative binding sites as values. If cache is FALSE, scores will not be cached.

Value

A list with three entries:

(1) df: a data frame with the following columns:

motif.id the motif identifier that is used in the original motif library
motif.rbps the gene symbol of the RNA-binding protein(s)
absolute.hits the absolute frequency of putative binding sites per motif in all transcripts
relative.hits the relative, i.e., absolute divided by total, frequency of binding sites per motif in all transcripts
total.sites the total number of potential binding sites
one.hit, two.hits, ... number of transcripts with one, two, three, ... putative binding sites

(2) total.sites: a numeric vector with the total number of potential binding sites per transcript

(3) absolute.hits: a numeric vector with the absolute (not relative) number of putative binding sites per transcript

See Also

Other matrix functions: calculateMotifEnrichment, runMatrixSPMA, runMatrixTSMA, scoreTranscriptsSingleMotif

Examples

foreground.set <- c(
  "CAACAGCCUUAAUU", "CAGUCAAGACUCC", "CUUUGGGGAAU",
  "UCAUUUUAUUAAA", "AAUUGGUGUCUGGAUACUUCCCUGUACAU",
  "AUCAAAUUA", "AGAU", "GACACUUAAAGAUCCU",
  "UAGCAUUAACUUAAUG", "AUGGA", "GAAGAGUGCUCA",
  "AUAGAC", "AGUUC", "CCAGUAA"
)
# names are used as keys in the hash table (cached version only)
# ideally sequence identifiers (e.g., RefSeq ids) and region labels
# (e.g., 3UTR for 3'-UTR)
names(foreground.set) <- c(
  "NM_1_DUMMY|3UTR", "NM_2_DUMMY|3UTR", "NM_3_DUMMY|3UTR",
  "NM_4_DUMMY|3UTR", "NM_5_DUMMY|3UTR", "NM_6_DUMMY|3UTR",
  "NM_7_DUMMY|3UTR", "NM_8_DUMMY|3UTR", "NM_9_DUMMY|3UTR",
  "NM_10_DUMMY|3UTR", "NM_11_DUMMY|3UTR", "NM_12_DUMMY|3UTR",
  "NM_13_DUMMY|3UTR", "NM_14_DUMMY|3UTR"
)

# specific motifs, uncached
motifs <- getMotifByRBP("ELAVL1")
scores <- scoreTranscripts(foreground.set, motifs = motifs, cache = FALSE)
## Not run: 
# all Transite motifs, cached (writes scores to disk)
scores <- scoreTranscripts(foreground.set)

# all Transite motifs, uncached
scores <- scoreTranscripts(foreground.set, cache = FALSE)

foreground.df <- transite:::ge$foreground1
foreground.set <- foreground.df$seq
names(foreground.set) <- paste0(foreground.df$refseq, "|",
   foreground.df$seq.type)
scores <- scoreTranscripts(foreground.set)

## End(Not run)

[Package transite version 1.2.0 Index]