rescaleBatches {batchelor} | R Documentation |
Scale counts so that the average count within each batch is the same for each gene.
rescaleBatches(..., batch = NULL, restrict = NULL, log.base = 2, pseudo.count = 1, subset.row = NULL, assay.type = "logcounts", get.spikes = FALSE)
... |
Two or more log-expression matrices where genes correspond to rows and cells correspond to columns. Each matrix should contain cells from the same batch; multiple matrices represent separate batches of cells. Each matrix should contain the same number of rows, corresponding to the same genes (in the same order). Alternatively, one or more SingleCellExperiment objects can be supplied containing a count matrix in the |
batch |
A factor specifying the batch of origin for all cells when only a single object is supplied in |
restrict |
A list of length equal to the number of objects in |
log.base |
A numeric scalar specifying the base of the log-transformation. |
pseudo.count |
A numeric scalar specifying the pseudo-count used for the log-transformation. |
subset.row |
A vector specifying which features to use for correction. |
assay.type |
A string or integer scalar specifying the assay containing the log-expression values, if SingleCellExperiment objects are present in |
get.spikes |
A logical scalar indicating whether to retain rows corresponding to spike-in transcripts. Only used for SingleCellExperiment inputs. |
This function assumes that the log-expression values were computed by a log-transformation of normalized count data, plus a pseudo-count. It reverses the log-transformation and scales the underlying counts in each batch so that the average (normalized) count is equal across batches. The assumption here is that each batch contains the same population composition. Thus, any scaling difference between batches is technical and must be removed.
This function is equivalent to centering in log-expression space, the simplest application of linear regression methods for batch correction. However, by scaling the raw counts, it avoids loss of sparsity that would otherwise result from centering. It also mitigates issues with artificial differences in variance due to log-transformation.
The output values are always re-log-transformed with the same log.base
and pseudo.count
.
These can be used directly in place of the input values for downstream operations.
A SingleCellExperiment object containing the corrected
assay.
This contains corrected log-expression values for each gene (row) in each cell (column) in each batch.
A batch
field is present in the column data, specifying the batch of origin for each cell.
Cells in the output object are always ordered in the same manner as supplied in ...
.
For a single input object, cells will be reported in the same order as they are arranged in that object.
In cases with multiple input objects, the cell identities are simply concatenated from successive objects,
i.e., all cells from the first object (in their provided order), then all cells from the second object, and so on.
All genes are used with the default setting of subset.row=NULL
.
Users can set subset.row
to subset the inputs, though this is purely for convenience as each gene is processed independently of other genes.
For SingleCellExperiment inputs, spike-in transcripts are automatically removed unless get.spikes=TRUE
.
If subset.row
is specified and get.spikes=FALSE
, only the non-spike-in specified features will be used.
All SingleCellExperiment objects should have the same set of spike-in transcripts.
It is possible to compute the correction using only a subset of cells in each batch, and then extrapolate that correction to all other cells. This may be desirable in experimental designs where a control set of cells from the same source population were run on different batches. Any difference in the controls must be artificial in origin and can be directly removed without making further biological assumptions.
To do this, users should set restrict
to specify the subset of cells in each batch to be used for correction.
This should be set to a list of length equal to the length of ...
, where each element is a subsetting vector to be applied to the columns of the corresponding batch.
A NULL
element indicates that all the cells from a batch should be used.
In situations where one input object contains multiple batches, restrict
is simply a list containing a single subsetting vector for that object.
The function will compute the scaling differences using only the specified subset of cells.
However, the re-scaling will then be applied to all cells in each batch - hence the extrapolation.
This means that the output is always of the same dimensionality, regardless of whether restrict
is specified.
Aaron Lun
means <- 2^rgamma(1000, 2, 1) A1 <- matrix(rpois(10000, lambda=means), ncol=50) # Batch 1 A2 <- matrix(rpois(10000, lambda=means*runif(1000, 0, 2)), ncol=50) # Batch 2 B1 <- log2(A1 + 1) B2 <- log2(A2 + 1) out <- rescaleBatches(B1, B2)