ORFik 1.12.1
Welcome to the introduction of data management with ORFik experiment. This vignette will walk you through how to work with large amounts of sequencing data effectively in ORFik.
ORFik
is an R package containing various functions for analysis of RiboSeq, RNASeq, RCP-seq, TCP-seq, Chip-seq and Cage data, we advice you to read ORFikOverview vignette, before starting this one.
NGS libraries are becoming more and more numerous. As a bioinformatician / biologist you often work on multi-library experiments, like 6 libraries of RNA-seq and 6 Ribo-seq libraries, split on 3 conditions with 2 replicates each. Then make some plots or statistics. A lot of things can go wrong when you scale up from just 1 library to many, or even to multiple experiments.
Another problem is also that annotations like gff and fasta files combined with the NGS data, must be separately loaded. Making it possible to use wrong annotation for the NGS data.
It is an object that simplify and error correct your NGS workflow, creating a single R object that stores and controls all results relevant to a specific experiment. It contains following important parts:
Let’s say we have a human experiment, containing annotation files (.gtf and .fasta genome) + Next generation sequencing libraries (NGS-data); RNA-seq, ribo-seq and CAGE.
An example of how to make the experiment will now be shown:
First load ORFik
library(ORFik)
In a normal experiment, you would usually have only bam files from alignment of your experiment to start with (and split this into 3 experiments, 1 for RNA-seq, 1 for Ribo-seq and 1 for CAGE), but to simplify this for you to replicate we use the ORFik example data.
The minimal amount of information you need to make an ORFik experiment is:
# 1. Pick directory (normally a folder with your aligned bam files)
NGS.dir <- system.file("extdata", "", package = "ORFik")
# 2. .gff/.gtf location
txdb <- system.file("extdata", "annotations.gtf", package = "ORFik")
# 3. fasta genome location
fasta <- system.file("extdata", "genome.fasta", package = "ORFik")
# 4. Pick an experiment name
exper.name <- "ORFik_example_human"
list.files(NGS.dir)
## [1] "QC_STATS" "annotations.gtf" "cage-seq-heart.bed.bgz"
## [4] "features.rdata" "genome.fasta" "genome.fasta.fai"
## [7] "pshifted" "ribo-seq-heart.bed.bgz" "ribo-seq.bam"
## [10] "ribo-seq.bam.bai" "rna-seq-heart.bed.bgz"
Experiments are created by all ac folder, so remember to keep your experiment folder clean of NGS libraries not related to the experiment.
template <- create.experiment(dir = NGS.dir, # directory of the NGS files for the experiment
exper.name, # Experiment name
txdb = txdb, # gtf / gff / gff.db annotation
fa = fasta, # Fasta genome
organism = "Homo sapiens", # Scientific naming
saveDir = NULL, # If not NULL, saves experiment directly
)
data.frame(template)
## X1 X2 X3
## 1 name ORFik_example_human
## 2 gff /tmp/Rtmpeu4i0l/Rinst1c4cf21c190147/ORFik/extdata/annotations.gtf
## 3 fasta /tmp/Rtmpeu4i0l/Rinst1c4cf21c190147/ORFik/extdata/genome.fasta
## 4 libtype stage rep
## 5 CAGE heart
## 6 RFP heart
## 7 RFP
## 8 RNA heart
## X4 X5
## 1
## 2 organism
## 3
## 4 condition fraction
## 5
## 6
## 7
## 8
## X6
## 1
## 2 Homo sapiens
## 3
## 4 filepath
## 5 /tmp/Rtmpeu4i0l/Rinst1c4cf21c190147/ORFik/extdata/cage-seq-heart.bed.bgz
## 6 /tmp/Rtmpeu4i0l/Rinst1c4cf21c190147/ORFik/extdata/ribo-seq-heart.bed.bgz
## 7 /tmp/Rtmpeu4i0l/Rinst1c4cf21c190147/ORFik/extdata/ribo-seq.bam
## 8 /tmp/Rtmpeu4i0l/Rinst1c4cf21c190147/ORFik/extdata/rna-seq-heart.bed.bgz
You see from the template, it excludes files with .bai or .fai, .rdata etc, and only using data of NGS libraries, defined by argument (type).
You can also see it tries to guess library types, stages, replicates, condition etc. It will also try to auto-detect paired end bam files.
To fix the things it did not find (a condition not specified, etc), there are 3 ways:
you either save the file and modify in Excel / Libre office, or do it directly in R.
Let’s update the template to have correct tissue-fraction in one of the samples.
template$X5[6] <- "heart_valve" # <- fix non unique row (tissue fraction is heart valve)
# read experiment from template
df <- read.experiment(template)
To save it, do:
save.experiment(df, file = "path/to/save/experiment.csv")
You can then load the experiment whenever you need it.
To see the object, just show it like this:
df
## experiment: ORFik_example_human with 3 library types and 4 runs
## libtype stage fraction
## 1: CAGE heart
## 2: RFP heart heart_valve
## 3: RFP
## 4: RNA heart
You see here that file paths are hidden, you can acces them like this:
If you have varying version of libraries, like p-shifted, bam, simplified wig files, you can get filepaths to different version with this function.
filepath(df, type = "default")
## [1] "/tmp/Rtmpeu4i0l/Rinst1c4cf21c190147/ORFik/extdata/cage-seq-heart.bed.bgz"
## [2] "/tmp/Rtmpeu4i0l/Rinst1c4cf21c190147/ORFik/extdata/ribo-seq-heart.bed.bgz"
## [3] "/tmp/Rtmpeu4i0l/Rinst1c4cf21c190147/ORFik/extdata/ribo-seq.bam"
## [4] "/tmp/Rtmpeu4i0l/Rinst1c4cf21c190147/ORFik/extdata/rna-seq-heart.bed.bgz"
# First load experiment if not present
# We use our already loaded experiment: (df) here
# Load transcript annotation
txdb <- loadTxdb(df) # transcript annotation
## Import genomic features from the file as a GRanges object ... OK
## Prepare the 'metadata' data frame ... OK
## Make the TxDb object ... OK
# And now NGS data
outputLibs(df, chrStyle = seqlevelsStyle(txdb)) # Use txdb as seqlevelsStyle reference
## Outputting libraries from: ORFik_example_human
By default all libraries are loaded into .GlobalEnv (global environment) with names decided by columns in experiment, to see what the names will be, do:
bamVarName(df) #This will be the names:
## [1] "CAGE_heart" "RFP_heart_fheart_valve" "RFP"
## [4] "RNA_heart"
If you have multiple experiments, it might be a chance of non-unique naming, 2 experiments might have a library called cage. To be sure names are unique, add the experiment in the variable name:
df@expInVarName <- TRUE
bamVarName(df) #This will be the names:
## [1] "ORFik_example_human_CAGE_heart"
## [2] "ORFik_example_human_RFP_heart_fheart_valve"
## [3] "ORFik_example_human_RFP"
## [4] "ORFik_example_human_RNA_heart"
You see here that the experiment name, “ORFik” is in the variable name If you are only working on one experiment, you do not need to include the name, since there is no possibility of duplicate naming (the experiment class validates all names are unique).
Since we want NGS data names without “ORFik”, let’s remove the loaded libraries and load them again.
df@expInVarName <- FALSE
remove.experiments(df)
## Removed loaded libraries from experiment:ORFik_example_human
outputLibs(df, chrStyle = seqlevelsStyle(txdb))
## Outputting libraries from: ORFik_example_human
Let’s say we want to load all leaders, cds and 3’ UTRs that are longer than 30. With ORFik experiment this is easy:
txNames <- filterTranscripts(txdb, minFiveUTR = 30,minCDS = 30, minThreeUTR = 30)
loadRegions(txdb, parts = c("leaders", "cds", "trailers"), names.keep = txNames)
The regions are now loaded into .GlobalEnv, only keeping transcripts from txNames.
Lets make a plot with coverage over mrna in just ribo-seq
transcriptWindow(leaders, cds, trailers, df[3,])
## RFP
## [[1]]
##
## [[2]]
If your experiment consists of Ribo-seq, you want to do p-site shifting.
shiftFootprintsByExperiment(df[df$libtype == "RFP",])
P-shifted ribo-seq will automaticly be stored as .wig (wiggle files for IGV and other genome browsers) and .ofst (ORFik serialized for R) files in a ./pshifted folder, relative to original libraries.
To validate p-shifting, use shiftPlots. Here is an example from Bazzini et al. 2014 I made.
df.baz <- read.experiment("zf_bazzini14_RFP")
shiftPlots(df.baz, title = "Ribo-seq, zebrafish, Bazzini et al. 2014")
p-site analysis
To see the shifts per library do:
shifts.load(df)
To see the location of pshifted files:
filepath(df[df$libtype == "RFP",], type = "pshifted")
To load p-shifted libraries, you can do:
outputLibs(df[df$libtype == "RFP",], type = "pshifted")
Bam files are slow to load, and usually you don’t need all the information contained in a bam file.
Usually you convert to bed or wig files, but ORFik also support 2 formats for much faster loading and use of data.
From the bam file store these columns as a serialized file: seqname, start, cigar, strand, score (number of identical replicates for that read).
This is the fastest format to use, loading time of 10GB Ribo-seq bam file reduced from minutes to ~ 1 second and ~ 20MB size.
From the bam file store these columns as text file: seqname, start, end (if not all widths are 1), strand, score (number of identical replicates for that read), size (size of cigar Ms according to reference)
The R object loaded from these files are GRanges, since cigar is not needed.
Loading time of 10GB Ribo-seq bam file reduced to ~ 10 seconds and ~ 100MB size.
From the bam file store these columns as text file: seqname, cigar, start, strand, score (number of identical replicates for that read)
The R object loaded from these files are GAlignments or GAlignmentPairs, since cigar is needed.
Loading time of 10GB Ribo-seq bam file reduced to ~ 15 seconds and ~ 200MB size.
ORFik also support a full QC report for post alignment statistics, correlation plots, simplified libraries for plotting, meta coverage, ++.
The default QC report:
QCreport(df)
The plots and statistics are saved to disc. To see the statistics, you can do:
QCstats(df)
In addition there is a QC report for Ribo-seq, with some addition analysis of read lengths and frames. This should only be run on when you have pshifted the reads.
RiboQC.plot(df)
Usually you want to do some operation on multiple data-sets. If ORFik does not include a premade function for what you want, you can make it yourself. If your data is in the format of an ORFik experiment, this operation is simple.
There are 3 ways to run loops for the data:
outputLibs(df, type = "pshifted") # Output all libraries, fastest way
libs <- bamVarName(df) # <- here are names of the libs that were outputed
cds <- loadRegion(df, "cds")
# parallel loop
bplapply(libs, FUN = function(lib, cds) {
return(entropy(cds, get(lib)))
}, cds = cds)
files <- filepath(df, type = "pshifted")
cds <- loadRegion(df, "cds")
# parallel loop
res <- bplapply(files, FUN = function(file, cds) {
return(entropy(cds, fimport(file)))
}, cds = cds)
files <- filepath(df, type = "pshifted")
cds <- loadRegion(df, "cds")
# Single thread loop
lapply(files, FUN = function(file, cds) {
return(entropy(cds, fimport(file)))
}, cds = cds)
library(data.table)
outputLibs(df, type = "pshifted")
libs <- bamVarName(df) # <- here are names of the libs that were outputed
cds <- loadRegion(df, "cds")
# parallel loop
res <- bplapply(libs, FUN = function(lib, cds) {
return(entropy(cds, get(lib)))
}, cds = cds)
# Add some names and convert
names(res) <- libs
data.table::setDT(res) # Will give 1 column per library
res # Now by columns