search_CPs {InPAS} | R Documentation |
Estimate the CP sites for UTRs on a given chromosome
search_CPs( seqname, sqlite_db, utr3, background, z2s, depth.weight, genome, MINSIZE = 10, window_size = 100, search_point_START = 50, search_point_END = NA, cutStart = 10, cutEnd = 0, adjust_distal_polyA_end = TRUE, coverage_threshold = 5, long_coverage_threshold = 2, PolyA_PWM = NA, classifier = NA, classifier_cutoff = 0.8, shift_range = window_size, step = 1, two_way = FALSE, hugeData = TRUE, outdir, silence = FALSE )
seqname |
A character(1) vector, specifying a chromososome/scaffold name |
sqlite_db |
A path to the SQLite database for InPAS, i.e. the output of
|
utr3 |
An object of GenomicRanges::GRanges. Output of |
background |
A character(1) vector, the range for calculating cutoff threshold of local background. It can be "same_as_long_coverage_threshold", "1K", "5K","10K", or "50K". |
z2s |
one element of an output of |
depth.weight |
A named vector. One element of an output of
|
genome |
A BSgenome::BSgenome object |
MINSIZE |
A integer(1) vector, specifying the minimal length in bp of a short/proximal 3' UTR. Default, 10 |
window_size |
An integer(1) vector, the window size for novel distal or proximal CP site searching. default: 100. |
search_point_START |
A integer(1) vector, starting point relative to the 5' extremity of 3' UTRs for searching for proximal CP sites |
search_point_END |
A integer(1) vector, ending point relative to the 3' extremity of 3' UTRs for searching for proximal CP sites |
cutStart |
An integer(1) vector a numeric(1) vector. What percentage or how many nucleotides should be removed from 5' extremities before searching for CP sites? It can be a decimal between 0, and 1, or an integer greater than 1. 0.1 means 10 percent, 25 means cut first 25 bases |
cutEnd |
An integer(1) vector a numeric(1) vector. What percentage or how many nucleotides should be removed from 5' extremities before searching for CP sites? It can be a decimal between 0, and 1, or an integer greater than 1. 0.1 means 10 percent, 25 means cut first 25 bases |
adjust_distal_polyA_end |
A logical(1) vector. If true, distal CP sites are subject to adjustment by the Naive Bayes classifier from the cleanUpdTSeq::cleanUpdTSeq-package |
coverage_threshold |
An integer(1) vector, specifying the cutoff threshold of coverage for first 100 nucleotides. If the coverage of first 100 nucleotides is lower than coverage_threshold, that transcript will be not considered for further analysis. Default, 5. |
long_coverage_threshold |
An integer(1) vector, specifying the cutoff threshold of coverage for the terminal of long form 3' UTRs. If the coverage of first 100 nucleotides is lower than coverage_threshold, that transcript will be not considered for further analysis. Default, 2. |
PolyA_PWM |
An R object for a position weight matrix (PWM) for a hexamer polyadenylation signal (PAS), such as AAUAAA. |
classifier |
An R object for Naive Bayes classifier model, like the one in the cleanUpdTSeq package. |
classifier_cutoff |
A numeric(1) vector. A cutoff of probability that a site is classified as true CP sites. The value should be between 0.5 and 1. Default, 0.8. |
shift_range |
An integer(1) vector, specifying a shift range for adjusting the proximal and distal CP sites. Default, 100. It determines the range flanking the candidate CP sites to search the most likely real CP sites. |
step |
An integer (1) vector, specifying the step size used for adjusting the proximal or distal CP sites using the Naive Bayes classifier from the cleanUpdTSeq package. Default 1. It can be in the range of 1 to 10. |
two_way |
A logical (1), indicating whether the proximal CP sites are searched from both directions or not. |
hugeData |
A logical(1), indicating whether it is huge data |
outdir |
A character(1) vector, a path with write permission for storing the CP sites. If it doesn't exist, it will be created. |
silence |
logical(1), indicating whether progress is reported or not. By default, FALSE |
An object of GenomicRanges::GRanges containing distal and proximal CP site information for each 3' UTR
Jianhong Ou, Haibo Liu
search_proximalCPs()
, adjust_proximalCPs()
,
adjust_proximalCPsByPWM()
, adjust_proximalCPsByNBC()
,
get_PAscore()
, get_PAscore2()
if (interactive()) { library(BSgenome.Mmusculus.UCSC.mm10) library(TxDb.Mmusculus.UCSC.mm10.knownGene) genome <- BSgenome.Mmusculus.UCSC.mm10 TxDb <- TxDb.Mmusculus.UCSC.mm10.knownGene ## load UTR3 annotation and convert it into a GRangesList data(utr3.mm10) utr3 <- split(utr3.mm10, seqnames(utr3.mm10)) bedgraphs <- system.file("extdata",c("Baf3.extract.bedgraph", "UM15.extract.bedgraph"), package = "InPAS") tags <- c("Baf3", "UM15") metadata <- data.frame(tag = tags, condition = c("Baf3", "UM15"), bedgraph_file = bedgraphs) outdir = tempdir() write.table(metadata, file =file.path(outdir, "metadata.txt"), sep = "\t", quote = FALSE, row.names = FALSE) sqlite_db <- setup_sqlitedb(metadata = file.path(outdir, "metadata.txt"), outdir) coverage <- list() for (i in seq_along(bedgraphs)) { coverage[[tags[i]]] <- get_ssRleCov(bedgraph = bedgraphs[i], tag = tags[i], genome = genome, sqlite_db = sqlite_db, outdir = outdir, removeScaffolds = TRUE, BPPARAM = NULL)} coverage_files <- assemble_allCov(sqlite_db, outdir, genome, removeScaffolds = TRUE) data4CPsSearch <- setup_CPsSearch(sqlite_db, genome, utr3, background = "10K", TxDb = TxDb, removeScaffolds = TRUE, BPPARAM = NULL, hugeData = TRUE, outdir = outdir) ## polyA_PWM load(system.file("extdata", "polyA.rda", package = "InPAS")) ## load the Naive Bayes classifier model from the cleanUpdTSeq package library(cleanUpdTSeq) data(classifier) CPs <- search_CPs(seqname = "chr6", sqlite_db = sqlite_db, utr3 = utr3, background = data4CPsSearch$background, z2s = data4CPsSearch$z2s, depth.weight = data4CPsSearch$depth.weight, genome = genome, MINSIZE = 10, window_size = 100, search_point_START =50, search_point_END = NA, cutStart = 10, cutEnd = 0, adjust_distal_polyA_end = TRUE, coverage_threshold = 5, long_coverage_threshold = 2, PolyA_PWM = pwm, classifier = classifier, classifier_cutoff = 0.8, shift_range = 100, step = 5, two_way = FALSE, hugeData = TRUE, outdir = outdir) }