Denoise with PCA {scran}R Documentation

Denoise expression with PCA

Description

Denoise log-expression data by removing principal components corresponding to technical noise.

Usage

## S4 method for signature 'ANY'
denoisePCA(x, technical, subset.row=NULL,
    value=c("pca", "n", "lowrank"), min.rank=5, max.rank=100, approximate=NULL,
    irlba.args=list(), BSPARAM=ExactParam(), BPPARAM=SerialParam())

## S4 method for signature 'SingleCellExperiment'
denoisePCA(x, ..., subset.row=NULL, 
    value=c("pca", "n", "lowrank"), assay.type="logcounts", 
    get.spikes=FALSE, sce.out=TRUE)

denoisePCANumber(var.exp, var.tech, var.total) 

Arguments

x

A numeric matrix of log-expression values for denoisePCA,ANY-method, or a SingleCellExperiment object containing such values for denoisePCA,SingleCellExperiment-method.

technical

A function that computes the technical component of the variance for a gene with a given mean (log-)expression, see ?trendVar. This can also be a numeric vector containing the technical component for each gene in x; or the entire DataFrame generated by decomposeVar or combineVar.

subset.row

See ?"scran-gene-selection".

value

A string specifying the type of value to return; the PCs, the number of retained components, or a low-rank approximation.

min.rank, max.rank

Integer scalars specifying the minimum and maximum number of PCs to retain.

approximate, irlba.args

Deprecated, use BSPARAM instead.

BSPARAM

A BiocSingularParam object specifying the algorithm to use for PCA.

BPPARAM

A BiocParallelParam object to use for parallel processing.

...

Further arguments to pass to denoisePCA,ANY-method.

assay.type

A string specifying which assay values to use.

get.spikes

See ?"scran-gene-selection".

sce.out

A logical scalar specifying whether a modified SingleCellExperiment object should be returned.

var.exp

A numeric vector of the variances explained by successive PCs, starting from the first (but not necessarily containing all PCs).

var.tech

A numeric scalar containing the variance attributable to technical noise.

var.total

A numeric scalar containing the total variance in the data.

Details

This function performs a principal components analysis to reduce random technical noise in the data. Random noise is uncorrelated across genes and should be captured by later PCs, as the variance in the data explained by any single gene is low. In contrast, biological substructure should be correlated and captured by earlier PCs, as this explains more variance for sets of genes. The idea is to discard later PCs to remove technical noise and improve the resolution of substructure.

The choice of the number of PCs to discard is based on the estimates of technical variance in technical. This can be:

The percentage of variance explained by technical noise is estimated by summing the technical components across genes and dividing by the summed total variance. Genes with negative biological components are ignored during downstream analyses to ensure that the total variance is greater than the overall technical estimate.

Now, consider the retention of the first X PCs. For a given value of X, we compute the variance explained by all of the later PCs. We aim to find the largest value of X such that the sum of variances explained by the later PCs is still less than the variance attributable to technical noise. This X represents a lower bound on the number of PCs that can be retained before biological variation is definitely lost. Note that X will be coerced to lie between min.rank and max.rank.

Value

For denoisePCA,ANY-method, a numeric matrix is returned containing the selected PCs (columns) for all cells (rows) if value="pca". If value="n", it will return an integer scalar specifying the number of retained components. If value="lowrank", it will return a low-rank approximation of x with the same dimensions.

For denoisePCA,SingleCellExperiment-method, the return value is the same as denoisePCA,ANY-method if sce.out=FALSE or value="n". Otherwise, a SingleCellExperiment object is returned that is a modified version of x. If value="pca", the modified object will contain the PCs as the "PCA" entry in the reducedDims slot. If value="lowrank", it will return a low-rank approximation in assays slot, named "lowrank".

For all uses of denoisePCA, the fractions of variance explained by the first max.rank PCs will be stored as the "percentVar" attribute in the return value. This is directly compatible with functions such as plotPCA.

denoisePCANumber will return an integer scalar specifying the number of PCs to retain. This is equivalent to the output from denoisePCA after setting value="n", but ignoring any setting of min.rank or max.rank.

Generating low-rank approximations

When value="lowrank", a low-rank approximation of the original matrix is computed using only the first X components. This is useful for denoising prior to downstream applications that expect gene-wise expression profiles.

Note that approximate values are returned for all genes. This includes “unselected” genes, i.e., with negative biological components or that were not selected with subset.row. The low-rank approximation is obtained for these genes by projecting their expression profiles into the low-dimensional space defined by the SVD on the selected genes. The exception is when get.spikes=FALSE, whereby zeroes are returned for all spike-in rows.

Handling uninteresting factors of variation

Previous versions of this function allowed users to specify a design matrix to regress out uninteresting factors of variation. This behaviour is now defunct in favour of users running appropriate batch correction functions beforehand. If x is a batch-corrected expression matrix, technical should not be a function. Rather, users should supply a DataFrame from decomposeVar or combineVar. This is because calculation of the residual variance (or mean) from a corrected matrix is not accurate, so precomputed values are necessary.

Any correction should preserve the residual variance of each gene for the variances to be correctly interpreted. Specifically, calculation of the variance within each batch should yield the same result in both the corrected and original matrices. This limits the possible methods for batch correction:

Note that, if technical is a DataFrame, the denoisePCA function will also adjust scale the precomputed variances to match the sample variance of each gene in x. Specifically, variance components are scaled until technical$total is equal to the sample variance. This reflects the fact that the PCA (and the variance explained by each PC) does not account for the loss of residual degrees of freedom due to blocking factors. Scaling adjusts the estimated technical component to match the decreased total when the sample variance is used instead of the residual variance, and ensures that we have a correct estimate of the proportion of the sample variance driven by technical noise.

Author(s)

Aaron Lun

References

Lun ATL (2018). Discussion of PC selection methods for scRNA-seq data. https://github.com/LTLA/PCSelection2018

See Also

trendVar and decomposeVar for methods of computing technical components.

runSVD for the underlying SVD algorithm(s).

Examples

# Mocking up some data.
ngenes <- 1000
is.spike <- 1:100
means <- 2^runif(ngenes, 6, 10)
dispersions <- 10/means + 0.2
nsamples <- 50
counts <- matrix(rnbinom(ngenes*nsamples, mu=means, size=1/dispersions), ncol=nsamples)
rownames(counts) <- paste0("Gene", seq_len(ngenes))

# Fitting a trend.
lcounts <- log2(counts + 1)
fit <- trendVar(lcounts, subset.row=is.spike)

# Denoising (not including the spike-ins in the PCA;
# spike-ins are automatically removed with the SingleCellExperiment method). 
pcs <- denoisePCA(lcounts, technical=fit$trend, subset.row=-is.spike)
dim(pcs)

[Package scran version 1.12.1 Index]