improvedCV2 {scran} | R Documentation |
Model the technical coefficient of variation as a function of the mean, and determine the significance of highly variable genes.
This is intended to be a more stable version of technicalCV2
.
## S4 method for signature 'ANY' improvedCV2(x, is.spike, sf.cell=NULL, sf.spike=NULL, log.prior=NULL, df=4, robust=FALSE, use.spikes=FALSE) ## S4 method for signature 'SingleCellExperiment' improvedCV2(x, spike.type=NULL, ..., assay.type="logcounts", logged=NULL, normalized=NULL)
x |
A numeric matrix of counts, normalized counts or normalized log-expression values, where each column corresponds to a cell and each row corresponds to a spike-in transcript. Alternatively, a SingleCellExperiment object that contains such values. |
is.spike |
A vector indicating which rows of |
sf.cell |
A numeric vector containing size factors for endogenous genes.
If this is not specified, counts for endogenous genes are assumed to already be normalized.
This is ignored if |
sf.spike |
A numeric vector containing size factors for spike-in transcripts.
If this is not specified, counts for the spike-in transcripts are assumed to already be normalized.
This is ignored if |
log.prior |
A numeric scalar specifying the pseudo-count added prior to log-transformation.
If this is set, |
df |
An integer scalar indicating the number of degrees of freedom for the spline fit with |
robust |
A logical scalar indicating whether robust fitting should be performed with |
use.spikes |
A logical scalar indicating whether p-values should be returned for spike-in transcripts. |
spike.type |
A character vector containing the names of the spike-in sets to use. |
... |
Additional arguments to pass to |
assay.type |
A string specifying which assay values to use. |
logged |
A logical scalar indicating if |
normalized |
A logical scalar indicating if |
This function will estimate the squared coefficient of variation (CV2) and mean for each spike-in transcript.
Both values are log-transformed and a mean-dependent trend is fitted to the log-CV2 values, using a linear model with a natural spline of degree df
.
The trend is used to obtain the technical contribution to the CV2 for each gene.
The biological contribution is computed by subtracting the technical contribution from the total CV2.
Deviations from the trend are identified by modelling the CV2 estimates for the spike-in transcripts as log-normally distributed around the fitted trend. This accounts for sampling variance as well as any variability in the true dispersions (e.g., due to transcript-specific amplification biases). The p-value for each gene is calculated from a one-sided Z-test on the log-CV2, using the fitted value as the mean and the robust scale estimate as the standard deviation. A Benjamini-Hochberg adjustment is applied to correct for multiple testing.
If log.prior
is specified, x
is assumed to contain log-expression values.
These are converted back to the count scale prior to calculation of the CV2.
Otherwise, x
is assumed to contain raw counts, which need to be normalized with sf.cell
and sf.spike
prior to calculating the CV2.
Note that both sets of size factors are set to 1 by default if their values are not supplied to the function.
For any given data set, the maximum CV2 that can be achieved is equal to the number of cells. (This occurs when only one cell has a non-zero expression value - proof via Holder's inequality.) Genes with CV2 values equal to the maximum are ignored during trend fitting. This ensures that the trend is not distorted by the presence of an upper bound on CV2 values, especially at low means.
For improvedCV2,ANY-method
, the rows corresponding to spike-in transcripts are specified with is.spike
.
These rows will be used for trend fitting, while all other rows are treated as endogenous genes.
By default, p-values are set to NA
for the spike-in transcripts, such that they do not contribute to the multiple testing correction.
This behaviour can be modified with use.spikes=TRUE
, which will return p-values for all features.
For improvedCV2,SingleCellExperiment-method
, transcripts from spike-in sets named in spike.type
will be used for trend fitting.
If spike.type=NULL
, all spike-in sets listed in x
will be used.
Size factors for endogenous genes are automatically extracted via sizeFactors
.
Spike-in-specific size factors for spike.type
are extracted from x
, if available; otherwise they are set to the size factors for the endogenous genes.
Note that the spike-in-specific factors must be the same for each set in spike.type
.
Users can also set is.spike
to NA
in improvedCV2,ANY-method
; or spike.type
to NA
in improvedCV2,SingleCellExperiment-method
.
In such cases, all rows will be used for trend fitting, and (adjusted) p-values will be reported for all rows.
This should be used in cases where there are no spike-ins.
Here, the assumption is that most endogenous genes do not exhibit high biological variability and thus can be used to model decompose variation.
A data frame is returned containing one row per row of x
(including both endogenous genes and spike-in transcripts).
Each row contains the following information:
mean
:A numeric field, containing mean (scaled) counts for all genes and transcripts.
var
:A numeric field, containing the variances for all genes and transcripts.
cv2
:A numeric field, containing CV2 values for all genes and transcripts.
trend
:A numeric field, containing the fitted value of the trend in the CV2 values. Note that the fitted value is reported for all genes and transcripts, but the trend is only fitted using the transcripts.
p.value
:A numeric field, containing p-values for all endogenous genes (NA
for rows corresponding to spike-in transcripts).
FDR
:A numeric field, containing adjusted p-values for all genes.
Aaron Lun
Lun ATL (2018). Description of the HVG machinery in scran. https://github.com/LTLA/HVGDetection2018
# Mocking up some data. ngenes <- 10000 nsamples <- 50 means <- 2^runif(ngenes, 6, 10) dispersions <- 10/means + 0.2 counts <- matrix(rnbinom(ngenes*nsamples, mu=means, size=1/dispersions), ncol=nsamples) is.spike <- logical(ngenes) is.spike[seq_len(500)] <- TRUE # Running it directly on the counts. out <- improvedCV2(counts, is.spike) head(out) plot(out$mean, out$cv2, log="xy") points(out$mean, out$trend, col="red", pch=16, cex=0.5) # Same again with an SingleCellExperiment. rownames(counts) <- paste0("X", seq_len(ngenes)) colnames(counts) <- paste0("Y", seq_len(nsamples)) X <- SingleCellExperiment(list(counts=counts)) isSpike(X, "Spikes") <- is.spike # Dummying up some size factors (for convenience only, use computeSumFactors() instead). sizeFactors(X) <- 1 X <- computeSpikeFactors(X, general.use=FALSE) X <- normalize(X) # Running it. out <- improvedCV2(X, spike.type="Spikes") head(out)