samplesize {OCplus}R Documentation

FDR as a function of sample size

Description

This function tabulates the false discovery rate (FDR) for selecting differentially expressed genes as a function of sample size and cutoff level. Additionally, the same information can be displayed through an attractive plot.

Usage

samplesize(n = seq(5, 50, by = 5), p0 = 0.99, sigma = 1, D, F0, F1, 
           paired = FALSE, crit, crit.style = c("top percentage", "cutoff"),
                   plot =TRUE, local.show=FALSE, nplot = 100, ylim = c(0, 1), main,
                   legend.show = FALSE, grid.show = FALSE, ...)

Arguments

n sample size (as subjects per group)
p0 the proportion of non-differentially expressed genes
sigma the standard deviation for the log expression values
D assumed average log fold change (in units of sigma), by default 1; this is a shortcut for specifying a simple symmetrical alternative hypothesis through F1.
F0 the distribution of the log2 expression values under the null hypothesis; by default, this is normal with mean zero and standard deviation sigma, but mixtures of normals can be specified, see Details and Examples.
F1 the distribution of the log2 expression values under the alternative hypothesis; by default, this is an equal mixture of two normals with means D and -D and standard deviation sigma; mixture of normals are again possible, see Details and Examples.
paired logical value indicating whether this is the independent sample case (default) or the paired sample case.
crit a vector of cutoff values for selecting differentially expressed genes; the interpretation depends on crit.style.
crit.style indicates how differentially expressed genes are selected: either by a fixed cutoff level for the absolute value of the t-statistic or as a fixed percentage of the absolute largest t-statistics.
plot logical value indicating whether to do the plotting business
local.show logical value indicating whether to show local or global false discovery rate (default: global).
nplot number of points that are evaluated for the curves
ylim the usual limits on the vertical axis
main the main title of the plot
legend.show logical value indicating whether to show a legend for the types of gene selection in the plot
grid.show logical value indicating whether to draw grid lines showing the sample sizes n to be tabulated in the plot
... the usual graphical parameters, passed to plot

Details

This function plots the FDR as a function of the sample size when comparing the expression of multiple genes between two groups of subjects. This is based on a model assuming that a proportion p0 of genes is not differentially expressed (regulated) between groups, and that 1-p0 genes are. The logarithmized gene expression values of regulated and non regulated genes are assumed to be generated by mixtures of normal distributions; these mixtures can be specified through the parameters F0, F1 or D, and sigma; please see TOC for details on the model and the specification of the mixtures. By default, the null distribution of the log expression values is a normal centered on zero, and the alternative an equal mixture of normals centered at +D and -D.

The list of nominally differentially expressed genes can be selected in two ways:

  • all genes with absolute t-statistic larger than the specified critical cutoff values (cutoff),
  • all genes that represent the specified critical top percentage of the absolutely largest t-statistics (top percentage).

    Multiple critical values correspond to multiple curves, each labeled by the critical value, but only one value can be specified for the proportion of non-regulated genes p0 and the standard deviation sigma.

    Value

    A matrix with rows corresponding to elements of n and columns corresponding to the specified critical values is returned. The matrix has the attribute param that contains the specified arguments, see Examples.

    Note

    Both the curve labels and the legend may be squashed if the plotting device is too small. Increasing the size of the device and re-plotting should improve readability.

    Author(s)

    Y. Pawitan and A. Ploner

    References

    Pawitan Y, Michiels S, Koscielny S, Gusnanto A, Ploner A (2005) False Discovery Rate, Sensitivity and Sample Size for Microarray Studies. Bioinformatics, 21, 3017-3024.

    Jung SH (2005) Sample size for FDR-control in microarray data analysis. Bioinformatics, 21, 3097-104.

    See Also

    FDR, TOC, EOC

    Examples

    # Default assumes a proportion of 0.01 regulated genes equally split
    # between two-fold up- and down-regulated
    # We select the top 1, 2, 3 percent absolute largest t-statistics
    samplesize(crit=c(0.03,0.02, 0.01))
    
    # Same model, but using a hard cutoff for the t-statistics
    samplesize(crit=2:4, crit.style="cutoff")
    
    # Paired test of the same size has slightly better FDR (as expected)
    samplesize(paired=TRUE)
    
    # Compare the effect of p0 and effect size
    par(mfrow=c(2,2))
    samplesize(crit=c(0.03,0.02, 0.01), p0=0.95, D=1)
    samplesize(crit=c(0.03,0.02, 0.01), p0=0.99, D=1)
    samplesize(crit=c(0.03,0.02, 0.01), p0=0.95, D=2)
    samplesize(crit=c(0.03,0.02, 0.01), p0=0.99, D=2)
    
    # An asymmetric alternative distribution: 20 percent of the regulated genes 
    # are expected to be (at least) four-fold up regulated
    # NB, no graphical output
    ret = samplesize(F1=list(D=c(-1,1,2), p=c(2,2,1)), p0=0.95, crit=0.05, plot=FALSE)
    ret
    # Look at the parameters
    attr(ret, "param")
    
    # A wide null distribution that allows to disregard genes with small effect
    # Here: |log2 fold change| < 0.25, i.e. fold change of less than 19 percent
    samplesize(F0=list(D=c(-0.25,0,0.25)), grid=TRUE)
    
    # This is close to Example 3 in Jung's paper (see References):
    # p0=0.99 and sensitivity=0.6, so we want a rejection rate of 
    # around 0.006 from the top list.
    # Here we require around 40 arrays/group, compared to 
    # around 37 in Jung's paper, most likely because we use 
    # the t-distribution instead of normal. Jung's alternative 
    # is only one-sided, so the exact correspondence is
    # 
    samplesize(p0=0.99,crit.style="top", crit=0.006, F1=list(D=1, p=1), grid=TRUE) 
    abline(h=0.01)
    
    #The result is very close to the symmetric alternatives: 
    samplesize(p0=0.99,crit=0.006, D=1, grid=TRUE, ylim=c(0,0.9))
    
    

    [Package OCplus version 1.10.0 Index]