decomposeVar {scran}R Documentation

Decompose the gene-level variance

Description

Decompose the gene-specific variance into biological and technical components for single-cell RNA-seq data.

Usage

## S4 method for signature 'ANY,list'
decomposeVar(x, fit, block=NA, design=NA, subset.row=NULL, 
    BPPARAM=SerialParam(), ...)

## S4 method for signature 'SingleCellExperiment,list'
decomposeVar(x, fit, subset.row=NULL, ..., 
    assay.type="logcounts", get.spikes=NA)

Arguments

x

A numeric matrix-like object of normalized log-expression values, where each column corresponds to a cell and each row corresponds to an endogenous gene. Alternatively, a SingleCellExperiment object containing such a matrix.

fit

A list containing trend, a function that takes a numeric vector of abundances and returns the technical component of variation. This is usually produced by running trendVar on log-expression values for spike-in genes.

block

A factor containing the level of a blocking factor for each cell.

design

A numeric matrix describing the uninteresting factors contributing to expression in each cell. Alternatively, a single factor for one-way layouts.

subset.row

See ?"scran-gene-selection".

BPPARAM

A BiocParallelParam object indicating whether and how parallelization should be performed across genes.

...

For decomposeVar,matrix,list-method, additional arguments to pass to testVar. For decomposeVar,SingleCellExperiment,list-method, additional arguments to pass to the matrix method.

assay.type

A string specifying which assay values to use from x.

get.spikes

A logical scalar specifying whether decomposition should be performed for spike-ins.

Details

This function computes the variance of the normalized log-counts for each endogenous gene. The technical component of the variance for each gene is determined by interpolating the fitted trend in fit at the mean log-count for that gene. This represents variance due to sequencing noise, variability in capture efficiency, etc. The biological component is determined by subtracting the technical component from the total variance.

Highly variable genes (HVGs) can be identified as those with large biological components. Unlike other methods for decomposition, this approach estimates the variance of the log-counts rather than of the counts themselves. The log-transformation blunts the impact of large positive outliers and ensures that HVGs are driven by strong log-fold changes between cells, not differences in counts. Interpretation is not compromised – HVGs will still be so, regardless of whether counts or log-counts are considered.

The fit list should contain at least trend, as this is necessary for the decomposition. If x is missing, fit should also contain mean and var, numeric vectors of the means and variances for all features. This will be used to perform the decomposition rather than (re)computing any statistics from x. The list may optionally contain block and design, but this will be overridden by any explicitly passed arguments.

If assay.type="logcounts" and the size factors are not centred at unity, a warning will be raised - see ?trendVar for details.

Value

A DataFrame is returned where each row corresponds to and is named after a row of x. This contains the numeric fields:

mean:

Mean normalized log-expression per gene.

total:

Variance of the normalized log-expression per gene.

bio:

Biological component of the variance.

tech:

Technical component of the variance.

p.value, FDR:

Raw and adjusted p-values for the test against the null hypothesis that bio=0.

If get.spikes=NA, the p.value and FDR fields will be set to NA for rows corresponding to spike-in transcripts. Otherwise, if get.spikes=FALSE, rows corresponding to spike-in transcripts are removed.

If subset.row!=NULL, each row of the output DataFrame corresponds to an element of subset.row instead.

The metadata field of the output DataFrame also contains num.cells, an integer scalar storing the number of cells in x; and resid.df, an integer scalar specifying the residual d.f. used for variance estimation.

Accounting for uninteresting factors

To account for uninteresting factors of variation, either block or design can be specified:

If either of these arguments are NA, they will be extracted from fit, assuming that the same cells were used to fit the trend. If block is specified, this will override any setting of design. Use of block is generally favoured as group-specific means result in a better estimate of the technical component than an average mean across all groups.

Note that the use of either block or design assumes that there are no systematic differences in the size factors across levels of an uninteresting factor. If such differences are present, we suggest using multiBlockVar instead, see the discussion in ?trendVar for more details.

Feature selection

The behaviour of get.spikes and subset.row is the same as described in ?"scran-gene-selection". The only additional feature is that users can specify get.spikes=NA, which sets the p-value and FDR to NA for spike-in transcripts. This is the default as it returns the other variance statistics for diagnostic purposes, but ensures that the spike-ins are not treated as candidate HVGs. (Note that this setting is the same as get.spikes=TRUE when considering the interaction between get.spikes and subset.row.)

If x is not supplied, all genes used to fit the trend in fit will be used instead for the variance decomposition. This may be useful when a trend is fitted to all genes in trendVar, such that the statistics for all genes will already be available in fit. By not specifying x, users can avoid redundant calculations, which is particularly helpful for very large data sets.

Author(s)

Aaron Lun

References

Lun ATL, McCarthy DJ and Marioni JC (2016). A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor. F1000Res. 5:2122

Lun ATL (2018). Description of the HVG machinery in scran. https://github.com/LTLA/HVGDetection2018

See Also

trendVar, testVar

Examples

example(computeSpikeFactors) # Using the mocked-up data 'y' from this example.
y <- computeSumFactors(y) # Size factors for the the endogenous genes.
y <- computeSpikeFactors(y, general.use=FALSE) # Size factors for spike-ins. 
y <- normalize(y) # Normalizing the counts by the size factors.

# Decomposing technical and biological noise.
fit <- trendVar(y)
results <- decomposeVar(y, fit)
head(results)

plot(results$mean, results$total)
o <- order(results$mean)
lines(results$mean[o], results$tech[o], col="red", lwd=2)

plot(results$mean, results$bio)

# A trend fitted to endogenous genes can also be used, pending assumptions.
fit.g <- trendVar(y, use.spikes=FALSE)
results.g <- decomposeVar(y, fit.g)
head(results.g)

[Package scran version 1.12.1 Index]