combineMarkers {scran} | R Documentation |
Combine pairwise comparisons between groups or clusters into a single marker set for each cluster.
combineMarkers(de.lists, pairs, pval.field="p.value", effect.field="logFC", pval.type=c("any", "all"), log.p.in=FALSE, log.p.out=log.p.in, output.field=NULL, full.stats=FALSE)
de.lists |
A list-like object where each element is a data.frame or DataFrame. Each element should represent the results of a pairwise comparison between two groups/clusters, in which each row should contain the statistics for a single gene/feature. Rows should be named by the feature name in the same order for all elements. |
pairs |
A matrix, data.frame or DataFrame with two columns and number of rows equal to the length of |
pval.field |
A string specifying the column name of each element of |
effect.field |
A string specifying the column name of each element of |
pval.type |
A string specifying the type of combined p-value to be computed, i.e., Simes' ( |
log.p.in |
A logical scalar indicating if the p-values in |
log.p.out |
A logical scalar indicating if log-transformed p-values/FDRs should be returned. |
output.field |
A string specifying the prefix of the field names containing the effect sizes. |
full.stats |
A logical scalar indicating whether all statistics in |
An obvious strategy to characterizing differences between clusters is to look for genes that are differentially expressed (DE) between them.
However, this entails a number of comparisons between all pairs of clusters to comprehensively identify genes that define each cluster.
For all pairwise comparisons involving a single cluster, we would like to consolidate the DE results into a single list of candidate marker genes.
This is the intention of the combineMarkers
function.
DE statistics from any testing regime can be supplied to this function - see the Examples for how this is done with t-tests from pairwiseTTests
.
A named List of DataFrames where each DataFrame contains the consolidated results for the cluster of the same name.
Within each DataFrame (say, the DataFrame for cluster X), rows correspond to genes with the fields:
Top
:Integer, the minimum rank across all pairwise comparisons.
Only reported if pval.type="any"
.
p.value
:Numeric, the p-value across all comparisons if log.p.out=FALSE
.
This is a Simes' p-value if pval.type="any"
, otherwise it is an IUT p-value.
log.p.value
:Numeric, the log-transformed version of p.value
if log.p.out=TRUE
.
FDR
:Numeric, the BH-adjusted p-value for each gene if log.p.out=FALSE
.
log.FDR
:Numeric, the log-transformed adjusted p-value for each gene if log.p.out=TRUE
.
logFC.Y
:Numeric for every other cluster Y in clusters
, containing the effect size of the comparison of X to Y when full.stats=FALSE
.
Note that this is named according to the output.field
prefix and so may not necessarily contain the log-fold change.
stats.Y
:DataFrame for every other cluster Y in clusters
, returned when full.stats=TRUE
.
This contains the same fields in the corresponding entry of de.lists
for the X versus Y comparison.
Genes are ranked by the Top
column (if available) and then the p.value
column.
DataFrames are sorted according to the order of cluster IDs in pairs[,1]
.
The logFC.Y
columns are sorted according to the order of cluster IDs in pairs[,2]
within the corresponding level of the first cluster.
Note that DataFrames are only created for clusters present in pairs[,1]
.
Clusters unique to pairs[,2]
will only be present within each DataFrame as Y.
By default, each table is sorted by the Top
value when pval.type="any"
.
This is the minimum rank across all pairwise comparisons for each gene, and specifies the size of the candidate marker set.
Taking all rows with Top
values less than or equal to X will yield a marker set containing the top X genes (ranked by significance) from each pairwise comparison.
The marker set for each cluster allows it to be distinguished from every other cluster based on the differential expression of at least one gene.
To demonstrate, let us define a marker set with an X of 1 for a given cluster.
The set of genes with Top <= 1
will contain the top gene from each pairwise comparison to every other cluster.
If X is instead, say, 5, the set will consist of the union of the top 5 genes from each pairwise comparison.
Obviously, multiple genes can have the same Top
as different genes may have the same rank across different pairwise comparisons.
Conversely, the marker set may be smaller than the product of Top
and the number of other clusters, as the same gene may be shared across different comparisons.
This approach does not explicitly favour genes that are uniquely expressed in a cluster.
Such a strategy is often too stringent, especially in cases involving overclustering or cell types defined by combinatorial gene expression.
However, if pval.type="all"
, the null hypothesis is that the gene is not DE in all contrasts, and the IUT p-value is computed for each gene.
This yields a IUT.p
field instead of a Top
field in the output table.
Ranking based on the IUT p-value will focus on genes that are uniquely DE in that cluster.
When pval.type="any"
, a combined p-value is calculated by consolidating p-values across contrasts for each gene using Simes' method.
This represents the evidence against the null hypothesis is that the gene is not DE in any of the contrasts.
The BH method is then applied on the combined p-values across all genes to obtain the FDR
field.
The same procedure is done with pval.type="all"
, but using the IUT p-values across genes instead.
If log.p=TRUE
, log-transformed p-values and FDRs will be reported.
This may be useful in over-powered studies with many cells, where directly reporting the raw p-values would result in many zeroes due to the limits of machine precision.
Note that the reported FDRs are intended only as a rough measure of significance.
Properly correcting for multiple testing is not generally possible when clusters
is determined from the same x
used for DE testing.
Aaron Lun
Simes RJ (1986). An improved Bonferroni procedure for multiple tests of significance. Biometrika 73:751-754.
Berger RL and Hsu JC (1996). Bioequivalence trials, intersection-union tests and equivalence confidence sets. Statist. Sci. 11, 283-319.
pairwiseTTests
and pairwiseWilcox
for functions that can generate de.lists
and pairs
.
findMarkers
and overlapExprs
provide wrappers that use combineMarkers
on the t-test or Wilcoxon test results.
# Using the mocked-up data 'y2' from this example. example(computeSpikeFactors) y2 <- normalize(y2) kout <- kmeans(t(logcounts(y2)), centers=2) # Any clustering method is okay. out <- pairwiseTTests(logcounts(y2), clusters=paste0("Cluster", kout$cluster)) comb <- combineMarkers(out$statistics, out$pairs) comb[["Cluster1"]]