ppca {pcaMethods}R Documentation

Probabilistic PCA Missing Value Estimator

Description

Implementation of probabilistic PCA (PPCA). PPCA allows to perform PCA on incomplete data and may be used for missing value estimation. This script was implemented after the Matlab version provided by Jakob Verbeek ( see http://lear.inrialpes.fr/~verbeek/) and the draft ``EM Algorithms for PCA and Sensible PCA'' written by Sam Roweis. Thanks a lot!

Probabilistic PCA combines an EM approach for PCA with a probabilistic model. The EM approach is based on the assumption that the latent variables as well as the noise are normal distributed.

In standard PCA data which is far from the training set but close to the principal subspace may have the same reconstruction error. PPCA defines a likelihood function such that the likelihood for data far from the training set is much lower, even if they are close to the principal subspace. This allows to improve the estimation accuracy.

A method called kEstimate is provided to estimate the optimal number of components via cross validation. In general few components are sufficient for reasonable estimation accuracy. See also the package documentation for further discussion on what kind of data PCA-based missing value estimation is advisable.

Requires MASS

Usage

  ppca(Matrix, nPcs = 2, center = TRUE, completeObs = TRUE, seed = NA, ...)

Arguments

Matrix matrix – Data containing the variables in columns and observations in rows. The data may contain missing values, denoted as NA.
nPcs numeric – Number of components to estimate. The preciseness of the missing value estimation depends on the number of components, which should resemble the internal structure of the data.
center boolean Mean center the data if TRUE
completeObs boolean Return the complete observations if TRUE. This is the original data with NA values filled with the estimated values.
seed numeric Set the seed for the random number generator. PPCA creates fills the initial loading matrix with random numbers chosen from a normal distribution. Thus results may vary slightly. Set the seed for exact reproduction of your results.
... Reserved for future use. Currently no further parameters are used.

Details

Complexity: Runtime is linear in the number of data, number of data dimensions and number of principal components.

Convergence: The algorithms seems not to converge to proper results in rare cases when unfavourable initial random numbers were chosen. To avoid this you can set the seed (parameter seed) of the random number generator.
If used for missing value estimation, results may be checked by simply running the algorithm several times with changing seed, if the estimated values show little variance the algorithm converged well.

Value

pcaRes Standart PCA result object used by all PCA-based methods of this package. Contains scores, loadings, data mean and more. See pcaRes for details.

Author(s)

Wolfram Stacklies
Max Planck Institut fuer Molekulare Pflanzenphysiologie, Potsdam, Germany
wolfram.stacklies@gmail.com

See Also

bpca, svdImpute, prcomp, nipalsPca, pca, pcaRes.

Examples

## Load a sample metabolite dataset (metaboliteData)
data(metaboliteData)

# Now remove 10% of the data
rows <- nrow(metaboliteData)
cols <- ncol(metaboliteData)
cond <- matrix(runif(rows * cols),rows,cols) < 0.1
metaboliteData[cond] <- NA

## Perform probabilistic PCA using the 3 largest components
result <- pca(metaboliteData, method="ppca", nPcs=3, center=TRUE)

## Get the estimated principal axes (loadings)
loadings <- result@loadings

## Get the estimated scores
scores <- result@scores

## Get the estimated complete observations
cObs <- result@completeObs

## Now plot the scores
plotPcs(result, scoresLoadings=c(TRUE,FALSE))


[Package pcaMethods version 1.2.3 Index]