processAmplicons {edgeR} | R Documentation |
Given a list of sample-specific index (barcode) sequences and hairpin/sgRNA-specific sequences from an amplicon sequencing screen, generate a DGEList of counts from the raw fastq file/(s) containing the sequence reads. Assumes fixed structure of amplicon sequences (i.e. both the sample-specific index sequences and hairpin/sgRNA sequences can be found at particular locations within each read).
processAmplicons(readfile, readfile2=NULL, barcodefile, hairpinfile, barcodeStart=1, barcodeEnd=5, barcode2Start=NULL, barcode2End=NULL, barcodeStartRev=NULL, barcodeEndRev=NULL, hairpinStart=37, hairpinEnd=57, allowShifting=FALSE, shiftingBase=3, allowMismatch=FALSE, barcodeMismatchBase=1, hairpinMismatchBase=2, allowShiftedMismatch=FALSE, verbose=FALSE)
readfile |
character vector giving one or more fastq filenames |
readfile2 |
character vector giving one or more fastq filenames for reverse read, default to NULL |
barcodefile |
filename containing sample-specific barcode ids and sequences |
hairpinfile |
filename containing hairpin/sgRNA-specific ids and sequences |
barcodeStart |
numeric value, starting position (inclusive) of barcode sequence in reads |
barcodeEnd |
numeric value, ending position (inclusive) of barcode sequence in reads |
barcode2Start |
numeric value, starting position (inclusive) of second barcode sequence in forward reads |
barcode2End |
numeric value, ending position (inclusive) of second barcode sequence in forward reads |
barcodeStartRev |
numeric value, starting position (inclusive) of barcode sequence in reverse reads, default to NULL |
barcodeEndRev |
numeric value, ending position (inclusive) of barcode sequence in reverse reads, default to NULL |
hairpinStart |
numeric value, starting position (inclusive) of hairpin/sgRNA sequence in reads |
hairpinEnd |
numeric value, ending position (inclusive) of hairpin/sgRNA sequence in reads |
allowShifting |
logical, indicates whether a given hairpin/sgRNA can be matched to a neighbouring position |
shiftingBase |
numeric value of maximum number of shifted bases from input |
allowMismatch |
logical, indicates whether sequence mismatch is allowed |
barcodeMismatchBase |
numeric value of maximum number of base sequence mismatches allowed in a barcode sequence when |
hairpinMismatchBase |
numeric value of maximum number of base sequence mismatches allowed in a hairpin/sgRNA sequence when |
allowShiftedMismatch |
logical, effective when |
verbose |
if |
The processAmplicons
function assumes the sequences in your fastq files have a fixed structure (as per Figure 1A of Dai et al, 2014).
The input barcode file and hairpin/sgRNA files are tab-separated text files with at least two columns (named 'ID' and 'Sequences') containing the sample or hairpin/sgRNA ids and a second column indicating the sample index or hairpin/sgRNA sequences to be matched.
If barcode2Start
and barcode2End
are specified, a third column 'Sequences2' is expected in the barcode file.
If readfile2
, barcodeStartRev
and barcodeEndRev
are specified, another column 'SequencesReverse' is expected in the barcode file.
The barcode file may also contain a 'group' column that indicates which experimental group a sample belongs to.
Additional columns in each file will be included in the respective $samples
or $genes
data.frames of the final codeDGEList object.
These files, along with the fastq file/(s) are assumed to be in the current working directory.
To compute the count matrix, matching to the given barcodes and hairpins/sgRNAs is conducted in two rounds.
The first round looks for an exact sequence match for the given barcode sequences and hairpin/sgRNA sequences at the locations specified.
If allowShifting
is set to TRUE
, the program also checks if a given hairpin/sgRNA sequence can be found at a neighbouring position in the read.
If a match isn't found, the program performs a second round of matching which allows for sequence mismatches if allowMismatch
is set to TRUE
.
The program also checks parameter allowShiftedMismatch
which accommodates mismatches at the shifted positions.
The maximum number of mismatch bases in barcode and hairpin/sgRNA are specified by the parameters barcodeMismatchBase
and hairpinMismatchBase
.
The program outputs a DGEList
object, with a count matrix indicating the number of times each barcode and hairpin/sgRNA combination could be matched in reads from input fastq file(s).
For further examples and data, refer to the case studies available from http://bioinf.wehi.edu.au/shRNAseq.
Returns a DGEList
object with following components:
counts |
read count matrix tallying up the number of reads with particular barcode and hairpin/sgRNA matches. Each row is a hairpin and each column is a sample |
genes |
In this case, hairpin/sgRNA-specific information (ID, sequences, corresponding target gene) may be recorded in this data.frame |
lib.size |
auto-calculated column sum of the counts matrix |
This function replaced the earlier function processHairpinReads
in edgeR 3.7.17.
This function cannot be used if the hairpins/sgRNAs/sample index sequences are in random locations within each read. If that is the case, then analysts will need to customise their own sequence processing pipeline, although edgeR can still be used for downstream analysis.
Zhiyin Dai and Matthew Ritchie
Dai Z, Sheridan JM, Gearing, LJ, Moore, DL, Su, S, Wormald, S, Wilcox, S, O'Connor, L, Dickins, RA, Blewitt, ME, Ritchie, ME(2014). edgeR: a versatile tool for the analysis of shRNA-seq and CRISPR-Cas9 genetic screens. F1000Research 3, 95. http://f1000research.com/articles/3-95