testVar {scran}R Documentation

Test for significantly large variances

Description

Test for whether the total variance exceeds that expected under some null hypothesis, for sample variances estimated from normally distributed observations.

Usage

testVar(total, null, df, design=NULL, test=c("chisq", "f"), second.df=NULL, log.p=FALSE)

Arguments

total

A numeric vector of total variances for all genes.

null

A numeric scalar or vector of expected variances under the null hypothesis for all genes.

df

An integer scalar specifying the degrees of freedom on which the variances were estimated.

design

A design matrix, used to determine the degrees of freedom if df is missing.

test

A string specifying the type of test to perform.

second.df

A numeric scalar specifying the second degrees of freedom for the F-distribution when test="f".

log.p

A logical scalar indicating whether log-transformed p-values should be returned.

Details

The null hypothesis is that the true variance for each gene is equal to null. (Technically, it is that the variance is equal to or less than this value, but the most conservative test is obtained at equality.) If test="chisq", variance estimates are assumed to follow a chi-squared distribution on df degrees of freedom and scaled by null/df. This is used to compute a p-value for total being greater than null. The underlying assumption is that the observations are normally distributed under the null, which is reasonable for log-counts with low-to-moderate dispersions.

The aim is to use this function to identify significantly highly variable genes (HVGs). For example, the null vector can be set to the values of the trend fitted to the spike-in variances. This will identify genes with variances significantly greater than technical noise. Alternatively, it can be set to the trend fitted to the cellular variances, which will identify those that are significantly more variable than the bulk of genes. Selecting HVGs on p-values is better than using total - null, as the latter is less precise when null is large.

If test="f", the true variance of each spike-in transcript is assumed to be sampled from a scaled inverse chi-squared distribution. This accounts for any inflated scatter around the trend due to differences in amplification efficiency between transcripts. As a result, the gene-wise variance estimates are should be F-distributed around the trend under the null. The second degrees of freedom is estimated from the scatter around the trend in trendVar using fitFDistRobustly, and needs to be supplied to second.df to calculate an appropriate p-value.

Value

A numeric vector of p-values for all genes.

Author(s)

Aaron Lun

References

Law CW, Chen Y, Shi W and Smyth GK (2014). voom: precision weights unlock linear model analysis tools for RNA-seq read counts Genome Biol. 15(2), R29.

See Also

trendVar, decomposeVar, fitFDistRobustly

Examples

set.seed(100)
null <- 100/runif(1000, 50, 2000)
df <- 30
total <- null * rchisq(length(null), df=df)/df

# Direct test:
out <- testVar(total, null, df=df)
hist(out)

# Rejecting the null:
alt <- null * 5 * rchisq(length(null), df=df)/df
out <- testVar(alt, null, df=df)
plot(alt[order(out)]-null)

# Focusing on genes that have high absolute increases in variability:
out <- testVar(alt, null+0.5, df=df)
plot(alt[order(out)]-null)

[Package scran version 1.12.0 Index]