analyzeORF {IsoformSwitchAnalyzeR}R Documentation

Predictiction of Transcript Open Reading Frame.

Description

Predicts the most likely Open Reading Frame (ORF) and the NMD sensitivity of the isoforms stored in a switchAnalyzeRlist object. This functionality is made to help annotate isoforms if you have performed (guided) de-novo isoform reconstruction (isoform deconvolution). Else you should use the annotated CDS (CoDing Sequence) typically obtained though one of the implemented import methods (see vignette for details).

Usage

analyzeORF(
    switchAnalyzeRlist,
    genomeObject = NULL,
    minORFlength=100,
    orfMethod = "longest",
    cds = NULL,
    PTCDistance = 50,
    startCodons="ATG",
    stopCodons=c("TAA", "TAG", "TGA"),
    showProgress=TRUE,
    quiet=FALSE
)

Arguments

switchAnalyzeRlist

A switchAnalyzeRlist object. n

genomeObject

A BSgenome object uses as refrence genome (fx 'Hsapiens' for Homo sapiens). Only nesseary if transcript sequences were not already added (via the 'isoformNtFasta' argument in importRdata()).

minORFlength

The minimum size (in nucleotides) an ORF must be to be considered (and reported). Please note that we recommend using CPAT to predict coding potential instead of this cutoff - it is simply implemented as a pre-filter, see analyzeCPAT. Default is 100 nucleotides, which >97.5% of Gencode coding isoforms in both human and mouse have.

orfMethod

A string indicating which of the 4 available ORF identification methods should be used. The methods are:

  • longest : Identifies the longest ORF in the transcript (after filtering via minORFlength). This approach is similar to what the CPAT tool uses in it's analysis of coding potential.

  • mostUpstream : Identifies the most upstream ORF in the transcript (after filtering via minORFlength).

  • longestAnnotated : Identifies the longest ORF (after filtering via minORFlength) downstream of an annoated translation start site (which are supplied via the cds argument).

  • mostUpstreamAnnoated : Identifies the ORF (after filtering via minORFlength) downstream of the most unstream overlapping annoated translation start site (supplied via the cds argument).

Default is longest.

cds

A CDSSet object containing annotated coding regions, see ?CDSSet and ?getCDS for more information. Only necessary if orfMethod arguments is 'longestAnnotated' or 'mostUpstreamAnnoated'.

PTCDistance

A numeric giving the maximal allowed premature termination codon-distance: The minimum distance (number of nucleotides) from the STOP codon to the final exon-exon junction. If the distance from the STOP to the final exon-exon junction is larger than this the isoform to be marked as NMD-sensitive. Default is 50.

startCodons

A vector of strings indicating the start codons identified in the DNA sequence. Default is 'ATG' (corresponding to the RNA-sequence AUG).

stopCodons

A vector of strings indicating the stop codons identified in the DNA sequence. Default is c("TAA", "TAG", "TGA").

showProgress

A logic indicating whether to make a progress bar (if TRUE) or not (if FALSE). Defaults is TRUE.

quiet

A logic indicating whether to avoid printing progress messages (incl. progress bar). Default is FALSE

Details

The function uses the genomic coordinats of the transcript model to extract the nucleotide sequence of the transcript from the supplied BSgenome object (refrence genome). The nucloetide sequence is then used to predict the most likely ORF (the method is controled by the orfMethod argument, see above)). If the distance from the stop position (ORF end) to the final exon-exon junction is larger than the threshold given in PTCDistance (and the stop position does not fall in the last exon), the stop position is considered premature and the transcript is marked as NMD (nonsense mediated decay) sensitive in accordence with litterature consensus (Weischenfeldt et al (see refrences)).

The gencode refrence annoation used here are GencodeV19, GencodeV24, GencodeM1 and GencodeM9. For more info see Vitting-Seerup et al 2017.

Value

A switchAnalyzeRlist where:

The data.frame added have one row pr isoform and contains 11 columns:

NA means no information was advailable aka no ORF (passing the minORFlength filter) was found.

Author(s)

Kristoffer Vitting-Seerup

References

See Also

createSwitchAnalyzeRlist
preFilter
isoformSwitchTestDEXSeq
isoformSwitchTestDRIMSeq
extractSequence
analyzeCPAT

Examples

### Prepare for orf analysis
# Load example data and prefilter
data("exampleSwitchList")
exampleSwitchList <- preFilter(exampleSwitchList)

# Perfom test
exampleSwitchListAnalyzed <- isoformSwitchTestDEXSeq(exampleSwitchList, dIFcutoff = 0.3) # high dIF cutoff for fast runtime

### analyzeORF
library(BSgenome.Hsapiens.UCSC.hg19)
exampleSwitchListAnalyzed <- analyzeORF(exampleSwitchListAnalyzed, genomeObject = Hsapiens)

### Explore result
head(exampleSwitchListAnalyzed$orfAnalysis)
head(exampleSwitchListAnalyzed$isoformFeatures) # PTC collumn added

[Package IsoformSwitchAnalyzeR version 1.6.0 Index]