KEstimate {pcaMethods}R Documentation

Estimate best number of Components for missing value estimation

Description

Perform cross validation to estimate the optimal number of components for missing value estimation. Cross validation is done on the subset containing only complete observations because including incomplete observations may tamper the results. The assumption hereby is that genes that are highly correlated in a distinct region (here the non-missing observations) are also correlated in another (here the missing observations). This also implies that the complete subset must be large enough to be representative. For each incomplete gene, the available values are divided into a user defined number of cv-segments. The segments have equal size, but are chosen from a random equal distribution. The non-missing values of the gene are covered completely. PPCA, BPCA, SVDimpute and Nipals PCA may be used for imputation.
The whole cross validation is repeated several times. As error measure the NRMSEP (see Feten et. al, 2005) is used. This error basically normalises the RMSD between original data and estimate by the gene-wise variance. The reason for this is that a higher variance will lead to a higher estimation error.

Usage

kEstimate(data, method = "ppca", maxPcs = 3, segs = 3, nruncv = 10,
allGenes = FALSE, verbose = interactive(), random = FALSE)

Arguments

data matrix – numeric matrix containing observations in rows and genes in columns
method character – One of ppca | bpca | svdImpute | nipals
maxPcs numeric – number of principal components to use for cross validation. The NRMSEP is calculated for 1:maxPcs components.
segs numeric – number of segments for cross validation
nruncv numeric – Times the whole cross validation is repeated
allGenes boolean – If TRUE, the NRMSEP is calculated for all genes, If FALSE, only the incomplete ones are included. You maybe want to do this to compare several methods on a complete data set.
verbose boolean – If TRUE, the NRMSEP and the variance are printed to the console each iteration.
random boolean – Impute normal distributed random values with same mean and standard deviation than the original data. This is only thought for comparison.

Details

Run time may be very high on large data sets. Also, when used with methods like BPCA or Nipals PCA which are already quite slow. The estimation method is called (g_miss * segs * nruncv) times, where g_miss is the number of genes showing missing values.

Value

list Returns a list with the elements:
  • mink - number of PCs for which the minimal average NRMSEP was obtained
  • nrmsep - a matrix of dimension (nruncv, maxPcs). The columns contain the NRMSEP obtained for each repeat of the cross validation.

Author(s)

Wolfram Stacklies
Max Planck Institut fuer Molekulare Pflanzenphysiologie, Potsdam, Germany
wolfram.stacklies@gmail.com

See Also

bpca, svdImpute, prcomp, nipalsPca, pca.

Examples

## Load a sample metabolite dataset (metaboliteData)
data(metaboliteData)

# Now remove 10% of the data
rows <- nrow(metaboliteData)
cols <- ncol(metaboliteData)
cond<-matrix(runif(rows * cols),rows,cols) < 0.1
metaboliteData[cond] <- NA

# Do cross validation with ppca for component 1:3
nrmsep <- kEstimate(metaboliteData, method = "ppca", maxPcs = 3, nruncv=1)

# Plot the result
barplot(drop(nrmsep$nrmsep), xlab = "Components",ylab = "NRMSEP (1 iterations)")

# The best k value is:
print(nrmsep$mink)

[Package pcaMethods version 1.2.3 Index]