trendVar {scran}R Documentation

Fit a variance trend

Description

Fit a mean-dependent trend to the gene-specific variances in single-cell RNA-seq data.

Usage

## S4 method for signature 'ANY'
trendVar(x, method=c("loess", "spline"), parametric=FALSE, 
    loess.args=list(), spline.args=list(), nls.args=list(), block=NULL, 
    design=NULL, weighted=TRUE, min.mean=0.1, subset.row=NULL, 
    BPPARAM=SerialParam()) 

## S4 method for signature 'SingleCellExperiment'
trendVar(x, subset.row=NULL, ..., assay.type="logcounts", use.spikes=TRUE)

Arguments

x

A numeric matrix-like object of normalized log-expression values, where each column corresponds to a cell and each row corresponds to a spike-in transcript. Alternatively, a SingleCellExperiment object that contains such values.

method

A string specifying the algorithm to use for smooth trend fitting.

parametric

A logical scalar indicating whether a parametric curve should be fitted prior to smoothing.

loess.args

A named list of arguments to pass to loess when method="loess".

spline.args

A named list of arguments to pass to robustSmoothSpline when method="spline".

nls.args

A named list of arguments to pass to nls when parametric=TRUE.

block

A factor specifying the blocking level for each cell.

design

A numeric matrix describing the uninteresting factors contributing to expression in each cell. Alternatively, a single factor can be supplied for one-way layouts.

weighted

A logical scalar indicated whether weighted trend fitting should be performed when block!=NULL.

min.mean

A numeric scalar specifying the minimum mean log-expression in order for a gene to be used for trend fitting.

subset.row

See ?"scran-gene-selection".

BPPARAM

A BiocParallelParam object indicating whether and how parallelization should be performed across genes.

...

Additional arguments to pass to trendVar,ANY-method.

assay.type

A string specifying which assay values in x to use.

use.spikes

A logical scalar specifying whether the trend should be fitted to variances for spike-in transcripts or endogenous genes.

Details

This function fits an abundance-dependent trend to the variance of the log-normalized expression for the spike-in transcripts. For SingleCellExperiment objects, these expression values are computed by normalize after setting the size factors, e.g., with computeSpikeFactors. Log-transformed values are used as these are more robust to genes/transcripts with strong expression in only one or two outlier cells. It also allows the fitted trend to be applied in downstream procedures that use log-transformed counts.

The mean and variance of the normalized log-counts is calculated for each spike-in transcript, and a trend is fitted to the variance against the mean for all transcripts. The fitted value of this trend represents technical variability due to sequencing, drop-outs during capture, etc. at a given mean. This assumes that a constant amount of spike-in RNA was added to each cell, such that any differences in observed expression are purely due to measurement error. Variance decomposition to biological and technical components for endogenous genes can then be performed later with decomposeVar.

Value

A named list is returned, containing:

mean:

A numeric vector of mean log-expression values for all spike-in transcripts, if block=NULL. Otherwise, a numeric matrix of means where each row corresponds to a spike-in and each column corresponds to a level of block.

var:

A numeric vector of the variances of log-expression values for all spike-in transcripts, if block=NULL. Otherwise, a numeric matrix of variances where each row corresponds to a spike-in and each column corresponds to a level of block.

resid.df:

An integer scalar specifying the residual d.f. used for variance estimation of each spike-in transcript, if block=NULL. Otherwise, a integer vector where each entry specifies the residual d.f. used in each level of block.

block:

A factor identical to the input block, only returned if it was not NULL.

design:

A numeric matrix (or factor) identical to the input design, only returned if it was not NULL and block=NULL.

trend:

A function that returns the fitted value of the trend at any mean.

df2:

A numeric scalar, specifying the second degrees of freedom for a scaled F-distribution describing the variability of variance estimates around the trend.

Trend fitting options

If parametric=FALSE, smoothing is performed directly on the log-variances. This is the default as it provides the most stable performance on arbitrary mean-variance relationships.

If parametric=TRUE, a non-linear curve of the form

y = ax/(x^n + b)

is fitted to the variances against the means using nls. Starting values and the number of iterations are automatically set if not explicitly specified in nls.args. A smoothing algorithm is then applied to the log-ratios of the variance to the fitted value for each gene. The aim is to use the parametric curve to reduce the sharpness of the expected mean-variance relationship[for easier smoothing. Conversely, the parametric form is not exact, so the smoothers will model any remaining trends in the residuals.

The method argument specifies the smoothing algorithm to be applied on the log-ratios/variances. By default, a robust loess curve is used for trend fitting via loess. This provides a fairly flexible fit while protecting against genes with very large or very small variances. Arguments to loess are specified with loess.args, with defaults of span=0.3, family="symmetric" and degree=1 unless otherwise specified. Some experimentation with these parameters may be required to obtain satisfactory results.

If method="spline", smoothing will instead be performed using robustSmoothSpline (a robustified version of the smooth.spline function in the aroma.light package). Splines can be more effective than loess at capturing smooth curves with strong non-linear gradients. Spline fitting is also faster for very large numbers of points, which may occur in large data sets with many genes and/or many levels of block. Arguments are specified with spline.args with a default df=4 and "symmetric" robust weighting. Users can force the use of cross-validation by supplying an invalid df=0, but this tends to result in a rather bumpy trend.

The trendVar function will produce an output trend function with which fitted values can be computed. When extrapolating to values below the smallest observed mean, the output function will approach zero. When extrapolating to values above the largest observed mean, the output function will be set to the fitted value of the trend at the largest mean.

Handling uninteresting factors of variation

There are three approaches to handling unwanted factors of variation. The simplest approach is to use a design matrix containing the uninteresting factors can be specified in design. This will fit a linear model to the log-expression values for each gene, yielding an estimate for the residual variance. The trend is then fitted to the residual variance against the mean for each spike-in transcripts.

Another approach is to use block, where all cells in each level of the blocking factor are treated as a separate group. Means and variances are estimated within each group and the resulting sets of means/variances are pooled across all groups. The trend is then fitted to the pooled observations, where observations from different levels are weighted according to the residual d.f. used for variance estimation. This effectively multiplies the number of points by the number of levels in block. If both block and design are specified, block will take priority and design will be ignored.

The final approach is to subset the data set for each level of the blocking factor and model the variances within each block separately. This is done using the multiBlockVar function, see its documentation for more details. Separate modelling and trend fitting is the most correct approach if there are systematic differences in the size factors (spike-in or endogenous) between levels. With the other two methods, such differences would be normalized out in the full log-expression matrix, preventing proper estimation of the level-specific abundance.

Assuming there are no differences in the size factors between levels, we suggest using block wherever possible instead of design. This is because the use of block preserves differences in the means/variances between levels of the factor. In contrast, using design will effectively compute an average mean/variance. This may yield an inaccurate representation of the trend, as the fitted value at an average mean may not be equal to the average variance for non-linear trends. Nonetheless, we still support design as it can accommodate additive models, whereas block only handles one-way layouts.

Additional notes on row selection

The selection of spike-in transcripts can be adjusted in trendVar,SingleCellExperiment-method using the use.spikes method.

If use.spikes=FALSE, this implies that trendVar will be applied to the endogenous genes in the SingleCellExperiment object. For trendVar,ANY-method, it is equivalent to manually supplying a matrix of normalized expression for endogenous genes. This assumes that most genes exhibit technical variation and little biological variation, e.g., in a homogeneous population.

Low-abundance genes with mean log-expression below min.mean are not used in trend fitting, to preserve the sensitivity of span-based smoothers at moderate-to-high abundances. It also protects against discreteness, which can interfere with estimation of the variability of the variance estimates and accurate scaling of the trend. The default threshold is chosen based on the point at which discreteness is observed in variance estimates from Poisson-distributed counts. For heterogeneous droplet data, a lower threshold of 0.001-0.01 may be more appropriate.

Users can directly specify which rows to use with subset.row. See ?"scran-gene-selection" for how the different options interact with each other.

Note that features are not used under any circumstance if they have a variance of zero. These are not compatible with trend fitting on the log-scale. In practice, it should be effectively impossible to obtain variances of zero for any expressed gene, though some care may be required with simulated data sets. Any NA variances are similarly ignored.

Note that, depending on the output of the various row selection parameters, the number of used rows may be much less than the number of rows of x. In pathological cases, this may result in an error stating that insufficient points are available for the trend points.

Warning on size factor centring

If assay.type="logcounts", trendVar,SingleCellExperiment-method will attempt to determine if the expression values were computed from counts via normalize. If so, a warning will be issued if the size factors are not centred at unity. This is because different size factors are typically used for endogenous genes and spike-in transcripts. If these size factor sets are not centred at the same value, there will be systematic differences in abundance between these features. This precludes the use of a spike-in fitted trend with abundances for endogenous genes in decomposeVar.

For other expression values and in trendVar,ANY-method, the onus is on the user to ensure that normalization (i) does not introduce differences in abundance between spike-in and endogenous features, while (ii) preserving differences in abundance within the set of endogenous or spike-in features. In short, the scaling factors used to normalize each feature should have the same mean across all cells. This ensures that spurious differences in abundance are not introduced by the normalization process.

Author(s)

Aaron Lun

References

Lun ATL, McCarthy DJ and Marioni JC (2016). A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor. F1000Res. 5:2122

Lun ATL (2018). Description of the HVG machinery in scran. https://github.com/LTLA/HVGDetection2018

See Also

nls, loess, decomposeVar, computeSpikeFactors, computeSumFactors, normalize

Examples

example(computeSpikeFactors) # Using the mocked-up data 'y' from this example.

# Normalizing (gene-based factors for genes, spike-in factors for spike-ins)
y <- computeSumFactors(y) 
y <- computeSpikeFactors(y, general.use=FALSE)
y <- normalize(y)

# Fitting a trend to the spike-ins.
fit <- trendVar(y)
plot(fit$mean, fit$var)
curve(fit$trend(x), col="red", lwd=2, add=TRUE)

# Fitting a trend to the endogenous genes. 
fit.g <- trendVar(y, use.spikes=FALSE)
plot(fit.g$mean, fit.g$var)
curve(fit.g$trend(x), col="red", lwd=2, add=TRUE)

[Package scran version 1.12.0 Index]