correlatePairs {scran}R Documentation

Test for significant correlations

Description

Identify pairs of genes that are significantly correlated based on a modified Spearman's rho.

Usage

correlateNull(ncells, iters=1e6, block=NULL, design=NULL, BPPARAM=SerialParam()) 

## S4 method for signature 'ANY'
correlatePairs(x, null.dist=NULL, ties.method=c("expected", "average"), tol=1e-8, 
    iters=1e6, block=NULL, design=NULL, lower.bound=NULL, use.names=TRUE, 
    subset.row=NULL, pairings=NULL, per.gene=FALSE, cache.size=100L,
    BPPARAM=SerialParam())

## S4 method for signature 'SingleCellExperiment'
correlatePairs(x, ..., use.names=TRUE, subset.row=NULL, per.gene=FALSE,
    lower.bound=NULL, assay.type="logcounts", get.spikes=FALSE)

Arguments

ncells

An integer scalar indicating the number of cells in the data set.

iters

An integer scalar specifying the number of values in the null distribution.

block

A factor specifying the blocking level for each cell.

design

A numeric design matrix containing uninteresting factors to be ignored.

BPPARAM

A BiocParallelParam object that specifies the manner of parallel processing to use.

x

A numeric matrix-like object of log-normalized expression values, where rows are genes and columns are cells. Alternatively, a SingleCellExperiment object containing such a matrix.

null.dist

A numeric vector of rho values under the null hypothesis.

ties.method

String specifying how tied ranks should be handled.

tol

Deprecated argument, ignored.

use.names

A logical scalar specifying whether the row names of x should be used in the output. Alternatively, a character vector containing the names to use.

subset.row

See ?"scran-gene-selection".

pairings

A NULL value indicating that all pairwise correlations should be computed; or a list of 2 vectors of genes between which correlations are to be computed; or a integer/character matrix with 2 columns of specific gene pairs - see below for details.

per.gene

Deprecated argument, use correlateGenes instead.

lower.bound

A numeric scalar specifying the theoretical lower bound of values in x, only used when residuals=TRUE.

cache.size

Deprecated argument, ignored.

...

Additional arguments to pass to correlatePairs,ANY-method.

assay.type

A string specifying which assay values to use.

get.spikes

See ?"scran-gene-selection".

Details

The aim of the correlatePairs function is to identify significant correlations between all pairs of genes in x. This allows prioritization of genes that are driving systematic substructure in the data set. By definition, such genes should be correlated as they are behaving in the same manner across cells. In contrast, genes driven by random noise should not exhibit any correlations with other genes.

We use Spearman's rho to quantify correlations robustly based on ranked gene expression, with some adjustment for ties (see below). To identify correlated gene pairs, the significance of non-zero correlations is assessed using a permutation test. The null hypothesis is that the ranking of normalized expression across cells should be independent between genes. This allows us to construct a null distribution by randomizing the ranks within each gene.

The correlateNull function constructs an empirical null distribution for rho computed with ncells cells. When design=NULL, this is done by shuffling the ranks, calculating the rho and repeating until iters values are obtained. The p-value for each gene pair is defined as the tail probability of this distribution at the observed correlation (with some adjustment to avoid zero p-values). Correction for multiple testing is done using the BH method.

For correlatePairs, a pre-computed empirical distribution can be supplied as null.dist if available. Otherwise, it will be automatically constructed via correlateNull with ncells set to the number of columns in x. For correlatePairs,SingleCellExperiment-method, correlations should be computed for normalized expression values in the specified assay.type.

The lower bound of the p-values is determined by the value of iters. If the limited field is TRUE in the returned dataframe, it may be possible to obtain lower p-values by increasing iters. This should be examined for non-significant pairs, in case some correlations are overlooked due to computational limitations. The function will automatically raise a warning if any genes are limited in their significance at a FDR of 5%.

Value

For correlateNull, a numeric vector of length iters is returned containing the sorted correlations under the null hypothesis of no correlations. Arguments to design and residuals are stored in the attributes.

For correlatePairs with per.gene=FALSE, a DataFrame is returned with one row per gene pair and the following fields:

gene1, gene2:

Character or integer fields specifying the genes in the pair. If use.names=FALSE, integers are returned representing row indices of x, otherwise gene names are returned.

rho:

A numeric field containing the approximate Spearman's rho.

p.value, FDR:

Numeric fields containing the permutation p-value and its BH-corrected equivalent.

limited:

A logical scalar indicating whether the p-value is at its lower bound, defined by iters.

Rows are sorted by increasing p.value and, if tied, decreasing absolute size of rho. The exception is if subset.row is a matrix, in which case each row in the dataframe correspond to a row of subset.row.

Accounting for uninteresting variation

If the experiment has known (and uninteresting) factors of variation, these can be included in design or block. correlatePairs will then attempt to ensure that these factors do not drive strong correlations between genes. Examples might be to block on batch effects or cell cycle phase, which may have substantial but uninteresting effects on expression.

The approach used to remove these factors depends on whether design or block is used. If there is only one factor, e.g., for plate or animal of origin, block should be used. Each level of the factor is defined as a separate group of cells. For each pair of genes, correlations are computed within each group, and a weighted mean based on the group size) of the correlations is taken across all groups. The same strategy is used to generate the null distribution where ranks are computed and shuffled within each group.

For experiments containing multiple factors or covariates, a design matrix should be passed into design. The correlatePairs function will fit a linear model to the (log-normalized) expression values. The correlation between a pair of genes is then computed from the residuals of the fitted model. Similarly, to obtain a null distribution of rho values, normally-distributed random errors are simulated in a fitted model based on design; the corresponding residuals are generated from these errors; and the correlation between sets of residuals is computed at each iteration.

We recommend using block wherever possible (and it will take priority if both block and design are specified). While design can also be used for one-way layouts, this is not ideal as it involves more work/assumptions:

Gene selection

The pairings argument specifies the pairs of genes to compute correlations for:

If subset.row is not NULL, only the genes in the selected subset are used to compute correlations - see ?"scran-gene-selection". This will interact properly with pairings, such that genes in pairings and not in subset.row will be ignored. Rows corresponding to spike-in transcripts are also removed by default with get.spikes=FALSE. This avoids picking up strong technical correlations between pairs of spike-in transcripts.

We recommend setting subset.row and/or pairings to contain only the subset of genes of interest. This reduces computational time and memory usage by only computing statistics for the gene pairs of interest. For example, we could select only HVGs to focus on genes contributing to cell-to-cell heterogeneity (and thus more likely to be involved in driving substructure). There is no need to account for HVG pre-selection in multiple testing, because rank correlations are unaffected by the variance.

Lowly-expressed genes can also cause problems when design is non-NULL and residuals=TRUE. Tied counts, and zeroes in particular, may not result in tied residuals after fitting of the linear model. Model fitting may break ties in a consistent manner across genes, yielding large correlations between genes with many zero counts. Focusing on HVGs should mitigate the detection of these uninteresting correlations, as genes dominated by zeroes will usually have low variance.

Handling tied values

By default, ties.method="expected" which uses the expected rho from randomly broken ties. Imagine obtaining unique ranks by randomly breaking ties in expression values, e.g., by addition of some very small random jitter. We then report the expected value of all possible permutations of broken ties.

With ties.method="average", each set of tied observations is assigned the average rank across all tied values. This yields the standard value of Spearman's rho as computed by cor.

We use the expected rho by default as avoids inflated correlations in the presence of many ties. This is especially relevant for single-cell data containing many zeroes that remain tied after scaling normalization. An extreme example is that of two genes that are only expressed in the same cell, for which the standard rho is 1 while the expected rho is close to zero.

Note that the p-value calculations are not accurate in the presence of ties, as tied ranks are not considered by correlateNull. With ties.method="expected", the p-values are probably a bit conservative. With ties.method="average", they will be very anticonservative.

Author(s)

Aaron Lun

References

Lun ATL (2019). Some thoughts on testing for correlations. https://ltla.github.io/SingleCellThoughts/software/correlations/corsim.html

Phipson B and Smyth GK (2010). Permutation P-values should never be zero: calculating exact P-values when permutations are randomly drawn. Stat. Appl. Genet. Mol. Biol. 9:Article 39.

Simes RJ (1986). An improved Bonferroni procedure for multiple tests of significance. Biometrika 73:751-754.

See Also

Compare to cor for the standard Spearman's calculation.

Use correlateGenes to get per-gene correlation statistics.

Examples

set.seed(0)
ncells <- 100
null.dist <- correlateNull(ncells, iters=100000)
exprs <- matrix(rpois(ncells*100, lambda=10), ncol=ncells)
out <- correlatePairs(exprs, null.dist=null.dist)
hist(out$p.value) 

[Package scran version 1.12.1 Index]