pairwiseWilcox {scran}R Documentation

Perform pairwise Wilcoxon rank sum tests

Description

Perform pairwise Wilcoxon rank sum tests between groups of cells, possibly after blocking on uninteresting factors of variation.

Usage

pairwiseWilcox(x, clusters, block=NULL, direction=c("any", "up", "down"),
    log.p=FALSE, gene.names=rownames(x), subset.row=NULL, tol=1e-8,
    BPPARAM=SerialParam())

Arguments

x

A numeric matrix-like object of normalized log-expression values, where each column corresponds to a cell and each row corresponds to an endogenous gene.

clusters

A vector of cluster identities for all cells.

block

A factor specifying the blocking level for each cell.

direction

A string specifying the direction of effects to be considered for each cluster.

log.p

A logical scalar indicating if log-transformed p-values/FDRs should be returned.

gene.names

A character vector of gene names with one value for each row of x.

subset.row

See ?"scran-gene-selection".

tol

Numeric scalar specifying the tolerance for tied values when x is numeric.

BPPARAM

A BiocParallelParam object indicating whether and how parallelization should be performed across genes.

Details

This function performs Wilcoxon rank sum tests to identify differentially expressed genes (DEGs) between pairs of clusters. A list of tables is returned where each table contains the statistics for all genes for a comparison between each pair of clusters. This can be examined directly or used as input to combineMarkers for marker gene detection.

Effect sizes are computed as overlap proportions. Consider the distribution of expression values for gene X within each of two groups of cells A and B. The overlap proportion is defined as the probability that a randomly selected cell in A has a greater expression value of X than a randomly selected cell in B. Overlap proportions near 0 (A is lower than B) or 1 (A is higher than B) indicate that the expression distributions are well-separated. The Wilcoxon rank sum test effectively tests for significant deviations from an overlap proportion of 0.5.

Wilcoxon rank sum tests are more robust to outliers and insensitive to non-normality, in contrast to t-tests in pairwiseTTests. However, they take longer to run, the effect sizes are less interpretable, and there are more subtle violations of its assumptions in real data. For example, the i.i.d. assumptions are unlikely to hold after scaling normalization due to differences in variance. Also note that we approximate the distribution of the Wilcoxon rank sum statistic to deal with large numbers of cells and ties.

Value

A list is returned containing statistics and pairs.

The statistics element is itself a list of DataFrames. Each DataFrame contains the statistics for a comparison between a pair of clusters, including the overlap proportions, p-values and false discovery rates.

The pairs element is a DataFrame with one row corresponding to each entry of statistics. This contains the fields first and second, specifying the two clusters under comparison in the corresponding DataFrame in statistics.

In each DataFrame in statistics, the overlap proportion represents the probability of sampling a value in the first cluster greater than a random value from the second cluster. Note that switching the first and second clusters will affect the value of the overlap and, when direction!="any", the size of the p-value itself.

Blocking on uninteresting factors

If block is specified, Wilcoxon tests are performed between clusters within each level of block. For each pair of clusters, the p-values for each gene across all levels of block are combined using Stouffer's Z-score method. Blocking levels are ignored if no p-value was reported, e.g., if there were insufficient cells for a cluster in a particular level.

The weight for the p-value in a particular level of block is defined as N_xN_y, where N_x and N_y are the number of cells in clusters X and Y, respectively, for that level. This means that p-values from blocks with more cells will have a greater contribution to the combined p-value for each gene.

Direction of the effect

If direction="any", two-sided Wilcoxon rank sum tests will be performed for each pairwise comparisons between clusters. Otherwise, one-sided tests in the specified direction will be used to compute p-values for each gene. This can be used to focus on genes that are upregulated in each cluster of interest, which is often easier to interpret.

To interpret the setting of direction, consider the DataFrame for cluster X, in which we are comparing to another cluster Y. If direction="up", genes will only be significant in this DataFrame if they are upregulated in cluster X compared to Y. If direction="down", genes will only be significant if they are downregulated in cluster X compared to Y. See ?wilcox.test for more details on the interpretation of one-sided Wilcoxon rank sum tests.

Author(s)

Aaron Lun

References

Whitlock MC (2005). Combining probability from independent tests: the weighted Z-method is superior to Fisher's approach. J. Evol. Biol. 18, 5:1368-73.

Soneson C and Robinson MD (2018). Bias, robustness and scalability in single-cell differential expression analysis. Nat. Methods

Examples

# Using the mocked-up data 'y2' from this example.
example(computeSpikeFactors) 
y2 <- normalize(y2)
kout <- kmeans(t(logcounts(y2)), centers=2) # Any clustering method is okay.

# Vanilla application:
out <- pairwiseWilcox(logcounts(y2), clusters=kout$cluster)
out

# Directional:
out <- pairwiseWilcox(logcounts(y2), clusters=kout$cluster, direction="up")
out

[Package scran version 1.12.1 Index]