combineMarkers {scran}R Documentation

Combine DE results to a marker set

Description

Combine pairwise comparisons between groups or clusters into a single marker set for each cluster.

Usage

combineMarkers(de.lists, pairs, pval.field="p.value", effect.field="logFC", 
    pval.type=c("any", "all"), log.p.in=FALSE, log.p.out=log.p.in, 
    output.field=NULL, full.stats=FALSE)

Arguments

de.lists

A list-like object where each element is a data.frame or DataFrame. Each element should represent the results of a pairwise comparison between two groups/clusters, in which each row should contain the statistics for a single gene/feature. Rows should be named by the feature name in the same order for all elements.

pairs

A matrix, data.frame or DataFrame with two columns and number of rows equal to the length of de.lists. Each row should specify the pair of clusters being compared for the corresponding element of de.lists.

pval.field

A string specifying the column name of each element of de.lists that contains the p-value.

effect.field

A string specifying the column name of each element of de.lists that contains the effect size.

pval.type

A string specifying the type of combined p-value to be computed, i.e., Simes' ("any") or IUT ("all").

log.p.in

A logical scalar indicating if the p-values in de.lists were log-transformed.

log.p.out

A logical scalar indicating if log-transformed p-values/FDRs should be returned.

output.field

A string specifying the prefix of the field names containing the effect sizes.

full.stats

A logical scalar indicating whether all statistics in de.lists should be stored in the output for each pairwise comparison.

Details

An obvious strategy to characterizing differences between clusters is to look for genes that are differentially expressed (DE) between them. However, this entails a number of comparisons between all pairs of clusters to comprehensively identify genes that define each cluster. For all pairwise comparisons involving a single cluster, we would like to consolidate the DE results into a single list of candidate marker genes. This is the intention of the combineMarkers function. DE statistics from any testing regime can be supplied to this function - see the Examples for how this is done with t-tests from pairwiseTTests.

Value

A named List of DataFrames where each DataFrame contains the consolidated results for the cluster of the same name.

Within each DataFrame (say, the DataFrame for cluster X), rows correspond to genes with the fields:

Top:

Integer, the minimum rank across all pairwise comparisons. Only reported if pval.type="any".

p.value:

Numeric, the p-value across all comparisons if log.p.out=FALSE. This is a Simes' p-value if pval.type="any", otherwise it is an IUT p-value.

log.p.value:

Numeric, the log-transformed version of p.value if log.p.out=TRUE.

FDR:

Numeric, the BH-adjusted p-value for each gene if log.p.out=FALSE.

log.FDR:

Numeric, the log-transformed adjusted p-value for each gene if log.p.out=TRUE.

logFC.Y:

Numeric for every other cluster Y in clusters, containing the effect size of the comparison of X to Y when full.stats=FALSE. Note that this is named according to the output.field prefix and so may not necessarily contain the log-fold change.

stats.Y:

DataFrame for every other cluster Y in clusters, returned when full.stats=TRUE. This contains the same fields in the corresponding entry of de.lists for the X versus Y comparison.

Genes are ranked by the Top column (if available) and then the p.value column.

DataFrames are sorted according to the order of cluster IDs in pairs[,1]. The logFC.Y columns are sorted according to the order of cluster IDs in pairs[,2] within the corresponding level of the first cluster.

Note that DataFrames are only created for clusters present in pairs[,1]. Clusters unique to pairs[,2] will only be present within each DataFrame as Y.

Consolidating p-values into a ranking

By default, each table is sorted by the Top value when pval.type="any". This is the minimum rank across all pairwise comparisons for each gene, and specifies the size of the candidate marker set. Taking all rows with Top values less than or equal to X will yield a marker set containing the top X genes (ranked by significance) from each pairwise comparison. The marker set for each cluster allows it to be distinguished from every other cluster based on the differential expression of at least one gene.

To demonstrate, let us define a marker set with an X of 1 for a given cluster. The set of genes with Top <= 1 will contain the top gene from each pairwise comparison to every other cluster. If X is instead, say, 5, the set will consist of the union of the top 5 genes from each pairwise comparison. Obviously, multiple genes can have the same Top as different genes may have the same rank across different pairwise comparisons. Conversely, the marker set may be smaller than the product of Top and the number of other clusters, as the same gene may be shared across different comparisons.

This approach does not explicitly favour genes that are uniquely expressed in a cluster. Such a strategy is often too stringent, especially in cases involving overclustering or cell types defined by combinatorial gene expression. However, if pval.type="all", the null hypothesis is that the gene is not DE in all contrasts, and the IUT p-value is computed for each gene. This yields a IUT.p field instead of a Top field in the output table. Ranking based on the IUT p-value will focus on genes that are uniquely DE in that cluster.

Correcting for multiple testing

When pval.type="any", a combined p-value is calculated by consolidating p-values across contrasts for each gene using Simes' method. This represents the evidence against the null hypothesis is that the gene is not DE in any of the contrasts. The BH method is then applied on the combined p-values across all genes to obtain the FDR field. The same procedure is done with pval.type="all", but using the IUT p-values across genes instead.

If log.p=TRUE, log-transformed p-values and FDRs will be reported. This may be useful in over-powered studies with many cells, where directly reporting the raw p-values would result in many zeroes due to the limits of machine precision.

Note that the reported FDRs are intended only as a rough measure of significance. Properly correcting for multiple testing is not generally possible when clusters is determined from the same x used for DE testing.

Author(s)

Aaron Lun

References

Simes RJ (1986). An improved Bonferroni procedure for multiple tests of significance. Biometrika 73:751-754.

Berger RL and Hsu JC (1996). Bioequivalence trials, intersection-union tests and equivalence confidence sets. Statist. Sci. 11, 283-319.

See Also

pairwiseTTests and pairwiseWilcox for functions that can generate de.lists and pairs.

findMarkers and overlapExprs provide wrappers that use combineMarkers on the t-test or Wilcoxon test results.

Examples

# Using the mocked-up data 'y2' from this example.
example(computeSpikeFactors) 
y2 <- normalize(y2)
kout <- kmeans(t(logcounts(y2)), centers=2) # Any clustering method is okay.

out <- pairwiseTTests(logcounts(y2), clusters=paste0("Cluster", kout$cluster))
comb <- combineMarkers(out$statistics, out$pairs)
comb[["Cluster1"]]

[Package scran version 1.12.1 Index]