scoreSpectrum {transite}R Documentation

Calculates spectrum scores and creates spectrum plots

Description

Spectrum scores are a means to evaluate if a spectrum has a meaningful (i.e., biologically relevant) or a random pattern.

Usage

scoreSpectrum(x, p.value = array(1, length(x)),
  x.label = "log enrichment", midpoint = 0, max.model.degree = 3,
  max.cs.permutations = 1e+07, min.cs.permutations = 5000, e = 5)

Arguments

x

vector of values (e.g., enrichment values, normalized RBP scores) per bin

p.value

vector of p-values (e.g., significance of enrichment values) per bin

x.label

label of values (e.g., "enrichment value")

midpoint

for enrichment values the midpoint should be 1, for log enrichment values 0)

max.model.degree

maximum degree of polynomial

max.cs.permutations

maximum number of permutations performed in Monte Carlo test for consistency score

min.cs.permutations

minimum number of permutations performed in Monte Carlo test for consistency score

e

integer-valued stop criterion for consistency score Monte Carlo test: aborting permutation process after observing e random consistency values with more extreme values than the actual consistency value

Details

One way to quantify the meaningfulness of a spectrum is to calculate the deviance between the linear interpolation of the scores of two adjoining bins and the score of the middle bin, for each position in the spectrum. The lower the score, the more consistent the trend in the spectrum plot. Formally, the local consistency score x_c is defined as

x_c = \frac{1}{n} ∑_{i = 1}^{n - 2}{|\frac{s_i + s_{i + 2}}{2} - s_{i + 1}|}.

In order to obtain an estimate of the significance of a particular score x_c', Monte Carlo sampling is performed by randomly permuting the coordinates of the scores vector s and recomputing x_c. The probability estimate \hat{p} is given by the lower tail version of the cumulative distribution function

\hat{Pr}(T(x)) = \frac{∑_{i = 1}^n 1(T(y_i) ≤ T(x)) + 1}{n + 1},

where 1 is the indicator function, n is the sample size, i.e., the number of performed permutations, and T equals x_c in the above equation.

An alternative approach to assess the consistency of a spectrum plot is via polynomial regression. In a first step, polynomial regression models of various degrees are fitted to the data, i.e., the dependent variable s (vector of scores), and orthogonal polynomials of the independent variable b (vector of bin numbers). Secondly, the model that reflects best the true nature of the data is selected by means of the F-test. And lastly, the adjusted R^2 and the sum of squared residuals are calculated to indicate how well the model fits the data. These statistics are used as scores to rank the spectrum plots. In general, the polynomial regression equation is

y_i = β_0 + β_1 x_i + β_2 x_i^2 + \cdots + β_m x_i^m + ε_i,

where m is the degree of the polynomial (usually m ≤ 5), and ε_i is the error term. The dependent variable y is the vector of scores s and x to x^m are the orthogonal polynomials of the vector of bin numbers b. Orthogonal polynomials are used in order to reduce the correlation between the different powers of b and therefore avoid multicollinearity in the model. This is important, because correlated predictors lead to unstable coefficients, i.e., the coefficients of a polynomial regression model of degree m can be greatly different from a model of degree m + 1.

The orthogonal polynomials of vector b are obtained by centering (subtracting the mean), QR decomposition, and subsequent normalization. Given the dependent variable y and the orthogonal polynomials of b x to x^m, the model coefficients β are chosen in a way to minimize the deviance between the actual and the predicted values characterized by

M(x) = β_0 + β_1 x + β_2 x^2 + \cdots + β_m x^m

M = argmin_{M}{(∑_{i = 1}^n{L(y_i, M(x_i))})},

where L(actual value, predicted value) denotes the loss function.

Ordinary least squares is used as estimation method for the model coefficients β. The loss function of ordinary least squares is the sum of squared residuals (SSR) and is defined as follows SSR(y, \hat{y}) = ∑_{i = 1}^n{(y_i - \hat{y}_i)^2}, where y are the observed data and \hat{y} the model predictions.

Thus the ordinary least squares estimate of the coefficients \hat{β} (including the intercept \hat{β}_0) of the model M is defined by

\hat{β} = argmin_{β}{(∑_{i = 1}^n{(y_i - β_0 - ∑_{j = 1}^m{β_j x_i^j})^2})}.

After polynomial models of various degrees have been fitted to the data, the F-test is used to select the model that best fits the data. Since the SSR monotonically decreases with increasing model degree (model complexity), the relative decrease of the SSR between the simpler model and the more complex model must outweigh the increase in model complexity between the two models. The F-test gives the probability that a relative decrease of the SSR between the simpler and the more complex model given their respective degrees of freedom is due to chance. A low p-value indicates that the additional degrees of freedom of the more complex model lead to a better fit of the data than would be expected after a mere increase of degrees of freedom.

The F-statistic is calculated as follows

F = \frac{(SSR_1 - SSR_2) / (p_2 - p_1)}{SSR_2 / (n - p_2)},

where SSR_i is the sum of squared residuals and p_i is the number of parameters of model i. The number of data points, i.e., bins, is denoted as n. F is distributed according to the F-distribution with df_1 = p_2 - p_1 and df_2 = n - p_2.

Value

A list object of class SpectrumScore with the following components:

adj.r.squared adjusted R^2 of polynomial model
degree maximum degree of polynomial
residuals residuals of polynomial model
slope coefficient of the linear term of the polynomial model (spectrum "direction")
f.statistic statistic of the F-test
f.statistic.p.value p-value of F-test
consistency.score normalized sum of deviance between the linear interpolation of the scores of two adjoining bins and the score of the middle bin, for each position in the spectrum
consistency.score.p.value obtained by Monte Carlo sampling (randomly permuting the coordinates of the scores vector)
consistency.score.n number of permutations
plot

See Also

Other SPMA functions: runKmerSPMA, runMatrixSPMA, spectrumClassifier, subdivideData

Examples

# random spectrum
scoreSpectrum(runif(n = 40, min = -1, max = 1), max.model.degree = 1)

# non-random linear spectrum
signal <- seq(-1, 0.99, 2 / 40)
noise <- rnorm(n = 40, mean = 0, sd = 0.5)
scoreSpectrum(signal + noise, max.model.degree = 1,
  max.cs.permutations = 100000)

# non-random quadratic spectrum
signal <- seq(-1, 0.99, 2 / 40)^2 - 0.5
noise <- rnorm(n = 40, mean = 0, sd = 0.2)
scoreSpectrum(signal + noise, max.model.degree = 2,
  max.cs.permutations = 100000)

[Package transite version 1.2.0 Index]