Quick clustering {scran}R Documentation

Quick clustering of cells

Description

Cluster similar cells based on their expression profiles, using either log-expression values or ranks.

Usage

## S4 method for signature 'ANY'
quickCluster(x, min.size=100, method=c("igraph", "hclust"), 
    use.ranks=NULL, pc.approx=NULL, d=NULL, subset.row=NULL, min.mean=1, 
    graph.fun=cluster_walktrap, BSPARAM=ExactParam(), BPPARAM=SerialParam(),
    block=NULL, block.BPPARAM=SerialParam(), ...)

## S4 method for signature 'SingleCellExperiment'
quickCluster(x, subset.row=NULL, ..., assay.type="counts", get.spikes=FALSE)

Arguments

x

A numeric count matrix where rows are genes and columns are cells. Alternatively, a SingleCellExperiment object containing such a matrix.

min.size

An integer scalar specifying the minimum size of each cluster.

method

A string specifying the clustering method to use.

use.ranks

A logical scalar indicating whether clustering should be performed on the rank matrix, i.e., based on Spearman's rank correlation. Defaults to TRUE for consistency with old defaults, but this will be changed to FALSE in subsequent releases.

pc.approx

Deprecated, use BSPARAM instead.

d

An integer scalar specifying the number of principal components to retain.

If NULL and use.ranks=TRUE, this defaults to 50. If use.rank=FALSE, the number of PCs is chosen by denoisePCA.

If NA, no dimensionality reduction is performed and the gene expression values (or their rank equivalents) are directly used in clustering.

subset.row

See ?"scran-gene-selection".

min.mean

A numeric scalar specifying the filter to be applied on the average count for each filter prior to computing ranks. Only used when use.ranks=TRUE, see ?scaledColRanks for details.

graph.fun

A function specifying the community detection algorithm to use on the nearest neighbor graph when method="igraph". Usually obtained from the igraph package.

BSPARAM

A BiocSingularParam object specifying the algorithm to use for PCA, if d is not NA.

BPPARAM

A BiocParallelParam object to use for parallel processing within each block.

block

A factor of length equal to ncol(x) specifying whether clustering should be performed within pre-specified blocks. By default, all columns in x are treated as a single block.

block.BPPARAM

A BiocParallelParam object specifying whether and how parallelization should be performed across blocks, if block is non-NULL and has more than one level.

...

For quickCluster,ANY-method, additional arguments to be passed to cutreeDynamic for method="hclust", or buildSNNGraph for method="igraph".

For quickCluster,SingleCellExperiment-method, additional arguments to pass to quickCluster,ANY-method.

assay.type

A string specifying which assay values to use, e.g., "counts" or "logcounts".

get.spikes

See ?"scran-gene-selection".

Details

This function provides a convenient wrapper to quickly define clusters of a minimum size min.size. Two clustering strategies are available:

By default, quickCluster will apply these clustering algorithms on the principal component (PC) scores generated from the log-expression values. These are obtained by denoisePCA based on the trend fitted to endogenous genes with trendVar.

If use.ranks=TRUE, clustering is instead performed on PC scores obtained from scaled and centred ranks generated by scaledColRanks. This effectively means that clustering uses distances based on the Spearman's rank correlation between two cells. In addition, if x is a dgCMatrix and BSPARAM has deferred=TRUE, ranks will be computed without loss of sparsity to improve speed and memory efficiency during PCA.

Setting use.ranks=TRUE is invariant to scaling normalization and avoids circularity between normalization and clustering, e.g., in computeSumFactors. However, the default is to use the log-expression values as this yields finer and more precise clusters.

Value

A character vector of cluster identities for each cell in counts is returned.

Enforcing cluster sizes

With method="hclust", cutreeDynamic is used to ensure that all clusters contain a minimum number of cells. However, some cells may not be assigned to any cluster and are assigned identities of "0" in the output vector. In most cases, this is because those cells belong in a separate cluster with fewer than min.size cells. The function will not be able to call this as a cluster as the minimum threshold on the number of cells has not been passed. Users are advised to check that the unassigned cells do indeed form their own cluster. Otherwise, it may be necessary to use a different clustering algorithm.

When using method="igraph", clusters are first identified using the specified graph.fun. If the smallest cluster contains fewer cells than min.size, it is merged with the closest neighbouring cluster. In particular, the function will attempt to merge the smallest cluster with each other cluster. The merge that maximizes the modularity score is selected, and a new merged cluster is formed. This process is repeated until all (merged) clusters are larger than min.size.

Gene selection

Spike-in transcripts are not used by default as they provide little information on the biological similarities between cells. This may not be the case if subpopulations differ by total RNA content, in which case setting get.spikes=TRUE may provide more discriminative power.

When use.ranks=TRUE, the function will also filter out genes with average counts (as defined by calcAverage) below min.mean. This removes low-abundance genes with many tied ranks, especially due to zeros, which may reduce the precision of the clustering. We suggest setting min.mean to 1 for read count data and 0.1 for UMI data.

Author(s)

Aaron Lun and Karsten Bach

References

van Dongen S and Enright AJ (2012). Metric distances derived from cosine similarity and Pearson and Spearman correlations. arXiv 1208.3145

Lun ATL, Bach K and Marioni JC (2016). Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. Genome Biol. 17:75

See Also

cutreeDynamic, computeSumFactors, buildSNNGraph

scaledColRanks to get the rank matrix directly.

Examples

set.seed(100)
popsize <- 200
ngenes <- 1000
all.facs <- 2^rnorm(popsize, sd=0.5)
counts <- matrix(rnbinom(ngenes*popsize, mu=all.facs, size=1), ncol=popsize, byrow=TRUE)

clusters <- quickCluster(counts, use.ranks=FALSE, min.size=20)
clusters <- quickCluster(counts, use.ranks=TRUE)

[Package scran version 1.12.1 Index]