multiBatchPCA {batchelor} | R Documentation |
Perform a principal components analysis across multiple gene expression matrices to project all cells to a common low-dimensional space.
multiBatchPCA(..., batch = NULL, d = 50, subset.row = NULL, rotate.all = FALSE, get.variance = FALSE, preserve.single = FALSE, assay.type = "logcounts", get.spikes = FALSE, BSPARAM = ExactParam(), BPPARAM = SerialParam())
... |
Two or more matrices containing expression values (usually log-normalized). Each matrix is assumed to represent one batch. Alternatively, two or more SingleCellExperiment objects containing these matrices. Alternatively, one matrix or SingleCellExperiment can be supplied containing cells from all batches.
This requires |
batch |
A factor specifying the batch identity of each cell in the input data.
Ignored if |
d |
An integer scalar specifying the number of dimensions to keep from the initial multi-sample PCA. |
subset.row |
A vector specifying which features to use for correction. |
rotate.all |
A logical scalar indicating whether the reported rotation vectors should include genes that are excluded by a non- |
get.variance |
A logical scalar indicating whether to return the (weighted) variance explained by each PC. |
preserve.single |
A logical scalar indicating whether to combine the results into a single matrix if only one object was supplied in |
assay.type |
A string or integer scalar specifying the assay containing the expression values, if SingleCellExperiment objects are present in |
get.spikes |
A logical scalar indicating whether to retain rows corresponding to spike-in transcripts. Only used for SingleCellExperiment inputs. |
BSPARAM |
A BiocSingularParam object specifying the algorithm to use for PCA, see |
BPPARAM |
A BiocParallelParam object specifying whether the SVD should be parallelized. |
This function is roughly equivalent to cbind
ing all matrices in ...
and performing PCA on the merged matrix.
The main difference is that each sample is forced to contribute equally to the identification of the rotation vectors.
Specifically, the mean vector used for centering is defined as the grand mean of the mean vectors within each batch.
Each batch's contribution to the gene-gene covariance matrix is also divided by the number of cells in that batch.
Our approach is to effectively weight the cells in each batch to mimic the situation where all batches have the same number of cells. This ensures that the low-dimensional space can distinguish subpopulations in smaller batches. Otherwise, batches with a large number of cells would dominate the PCA, i.e., the definition of the mean vector and covariance matrix. This may reduce resolution of unique subpopulations in smaller batches that differ in a different dimension to the subspace of the larger batches.
If ...
contains SingleCellExperiment objects, any spike-in transcripts should be the same across all batches.
These will be removed prior to PCA unless get.spikes=TRUE
.
If subset.row
is specified and get.spikes=FALSE
, only the non-spike-in specified features will be used.
Setting rotate.all=TRUE
will report rotation vectors that span all genes, even when only a subset of genes are used for the PCA.
This is done by projecting all non-used genes into the low-dimensional “cell space” defined by the first d
components.
If BSPARAM
is defined with deferred=TRUE
, the per-gene centering and per-cell scaling will be manually deferred during matrix multiplication.
This can greatly improve speeds when the input matrices are sparse, as deferred operations avoids loss of sparsity (at the cost of numerical precision).
A List of numeric matrices is returned where each matrix corresponds to a batch and contains the first d
PCs (columns) for all cells in the batch (rows).
If preserve.single=TRUE
and ...
contains a single object, the List will only contain a single matrix.
This contains the first d
PCs (columns) for all cells in the same order as supplied in the single input object.
The metadata contains rotation
, a matrix of rotation vectors, which can be used to construct a low-rank approximation of the input matrices.
This has number of rows equal to the number of genes after any subsetting, except if rotate.all=TRUE
, where the number of rows is equal to the genes before subsetting.
If get.variance=TRUE
, the metadata will also contain var.explained
, the weighted variance explained by each PC;
and var.total
, the total variance after weighting.
Aaron Lun
d1 <- matrix(rnorm(5000), ncol=100) d1[1:10,1:10] <- d1[1:10,1:10] + 2 # unique population in d1 d2 <- matrix(rnorm(2000), ncol=40) d2[11:20,1:10] <- d2[11:20,1:10] + 2 # unique population in d2 out <- multiBatchPCA(d1, d2) xlim <- range(c(out[[1]][,1], out[[2]][,1])) ylim <- range(c(out[[1]][,2], out[[2]][,2])) plot(out[[1]][,1], out[[1]][,2], col="red", xlim=xlim, ylim=ylim) points(out[[2]][,1], out[[2]][,2], col="blue")