ClassifyR provides a structured pipeline for cross-validated classification. Classification is viewed in terms of four stages, data transformation, feature selection, classifier training, and prediction. The stages can be run in any order that is sensible.
Each step can be provided with custom functions that follow some rules about parameters. The driver function runTests implements different varieties of cross-validation. They are:
runTests can use parallel processing capabilities in R to speed up cross-validations when many CPUs are available. The output of runTests is a ClassifyResult object which can be directly used by the performance evaluation functions. The process of classification is summarised by a flowchart.
Importantly, ClassifyR implements a number of methods for classification using different kinds of changes in measurements between classes. Most classifiers work with features where the means are different. In addition to changes in means, ClassifyR also allows for classification using differential deviation (changes in scale) and differential distribution (changes in location and/or scale).
In the following sections, some of the most useful functions provided in ClassifyR will be demonstrated. However, a user can provide any feature selection, training, or prediction function to the classification framework, as long as it meets some simple rules about the input and return parameters. See the last section of this guide “Rules for New Functions” for a description of these.
There are a few other frameworks for classification in R. The table below provides a comparison of which features they offer.
Package | Run User-defined Classifiers | Parallel Execution on any OS | Parameter Tuning | Intel DAAL Performance Metrics | Ranking and Selection Plots | Class Distribution Plot | Sample-wise Error Heatmap | Direct Support for MultiAssayExperiment Input |
---|---|---|---|---|---|---|---|---|
ClassifyR | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
caret | Yes | Yes | Yes | No | No | No | No | No |
MLInterfaces | Yes | No | No | No | No | No | No | No |
MCRestimate | Yes | No | Yes | No | No | No | No | No |
CMA | No | No | Yes | No | No | No | No | No |
To demonstrate some key features of ClassifyR, a small dataset consisting of 35 features and 20 blood samples from stage 4 melanoma samples will be used to quickly obtain results. The 20 patients which were treated by either nivolumab or pembrolizumab are divided into 11 responders and 9 non-responders. The journal article corresponding to the dataset was published in 2018 and is titled High-dimensional Single-cell Analysis Predicts Response to Anti-PD-1 Immunotherapy. Although thousands of cells were measured for each sample, those measurements have been summarised to 1 value per protein by taking the median of all cells’ measurements for a particular protein.
library(ClassifyR)
data(melanomaResponse) # Replace with real dataset soon. This is rnorm.
measurements[1:5, 1:5]
## Sample 1 Sample 2 Sample 3 Sample 4 Sample 5
## Protein 1 -1.5498020 -1.0009375 -0.2256400 0.07468511 -0.1543675
## Protein 2 0.1993516 -0.5029450 2.1664501 0.82270289 -0.9588970
## Protein 3 -0.1134202 -0.8706476 -1.1501180 1.15974980 1.1442343
## Protein 4 0.5906970 1.4497073 0.9202853 1.72510974 0.9495312
## Protein 5 -0.3403030 -1.4498872 0.1747108 -0.75318423 0.7068944
head(classes)
## Sample 1 Sample 2 Sample 3 Sample 4 Sample 5 Sample 6
## Yes Yes Yes Yes Yes Yes
## Levels: No Yes
The numeric matrix variable measurements stores the normalised values of the protein abundances for each sample and the factor vector classes identifies which class the samples belong to.
For more complex datasets with multiple kinds of experiments (e.g. DNA methylation, copy number, gene expression on the same set of samples) a MultiAssayExperiment is recommended for data storage and supported by ClassifyR’s methods.
runTests is the main function in ClassifyR which handles the sample splitting and parallelisation, if used, of cross-validation. To begin with, a simple classifier will be specified. It uses a limma moderated t-test ranking for feature selection and DLDA for classification. The limmaSelection function also uses DLDA for estimating a resubstitution error rate for a number of top-f ranked features, as a heuristic for picking f features from the feature ranking which are used in the training and prediction stages of classification. This classifier relies on differences in means between classes.
resubstituteParams <- ResubstituteParams(nFeatures = 2:10,
performanceType = "balanced error",
better = "lower")
DMresults <- runTests(measurements, classes, "Melanoma", "Different Means",
validation = "leaveOut", leave = 1,
params = list(SelectParams(limmaSelection, "Moderated t Statistic",
resubstituteParams = resubstituteParams),
TrainParams(DLDAtrainInterface),
PredictParams(DLDApredictInterface,
getClasses = function(result)
result[["class"]])),
verbose = 1)
DMresults
## An object of class 'ClassifyResult'.
## Dataset Name: Melanoma.
## Classification Name: Different Means.
## Feature Selection Name: Moderated t Statistic.
## Features: List of length 20 of feature indices.
## Validation: Leave 1 Out.
## Predictions: List of data frames of length 1.
## Performance Measures: None calculated yet.
Here, leave-one-out cross-validation (LOOCV) is specified by the values of validation and leave. The specification of feature selection and classifier algorithm is made by the list of parameter classes specified to params. For computers with more than 1 CPU, the number of cores to use can be given to runTests by using the argument parallelParams. Although not used in this example, the parameter seed is important to set for result reproducibility when doing a cross-validation such as 100 sample permutations and 5 folds, because it employs randomisation to partition the samples into folds. For more details about runTests and the parameter classes used by it, consult the help pages of such functions.
The most frequently selected protein can be identified using the distribution function and its relative abundance values for all samples can be displayed visually by plotFeatureClasses.
selectionPercentages <- distribution(DMresults, plot = FALSE)
mostChosen <- names(sort(selectionPercentages, decreasing = TRUE)[1])
bestProteinPlot <- plotFeatureClasses(measurements, classes, mostChosen, dotBinWidth = 0.1,
xAxisLabel = "Standardised Expression")
The means of the protein levels are substantially different between the responder and non-responder patients. plotFeatureClasses can also plot categorical data, such as may be found in a clinical data table, as a bar chart.
Classification error rates, as well as many other prediction performance measures, can be calculated with calcCVperformance. Next, the balanced error rate is calculated considering all samples, each of which was in the test set once. The balanced error rate is defined as the average of the classification errors of each class.
See the documentation of calcCVperformance for a list of performance metrics which may be calculated.
DMresults <- calcCVperformance(DMresults, "balanced error")
DMresults
## An object of class 'ClassifyResult'.
## Dataset Name: Melanoma.
## Classification Name: Different Means.
## Feature Selection Name: Moderated t Statistic.
## Features: List of length 20 of feature indices.
## Validation: Leave 1 Out.
## Predictions: List of data frames of length 1.
## Performance Measures: Balanced Error Rate.
performance(DMresults)
## $`Balanced Error Rate`
## [1] 0.5050505
The error rate is small. If a cross-validation scheme such as 100 permutations, 5 folds was used instead, there would be 100 data sets to cross-validate and numerous predictions made of each sample, resulting in 100 values for each performance metric. Then, the function performancePlot could be used to visualise the distribution of the metric and even compare different classifiers. See the documentation of performancePlot for more details. If only a vector of predictions and a vector of actual classes is available, such as from an old study which did not use ClassifyR, then calcExternalPerformance can be used on a pair of factor vectors which have the same length.
The samplesMetricMap function allows the visual comparison of sample-wise error rate or accuracy measures from different ClassifyResult objects. Firstly, a classifier will be run that uses Kullback-Leibler divergence ranking and resubstitution error as a feature selection heuristic and a naive Bayes classifier for classification. This classification will use features that have either a change in location or in scale between classes.
DDresults <- runTests(measurements, classes, "Melanoma", "Differential Distribution",
validation = "leaveOut", leave = 1,
params = list(SelectParams(KullbackLeiblerSelection,
resubstituteParams = resubstituteParams),
TrainParams(naiveBayesKernel),
PredictParams(predictor = NULL, getClasses = function(result) result,
weighted = "weighted", weight = "height difference",
returnType = "both")),
verbose = 1)
DDresults
## An object of class 'ClassifyResult'.
## Dataset Name: Melanoma.
## Classification Name: Differential Distribution.
## Feature Selection Name: Kullback-Leibler Divergence.
## Features: List of length 20 of feature indices.
## Validation: Leave 1 Out.
## Predictions: List of data frames of length 1.
## Performance Measures: None calculated yet.
The naive Bayes kernel classifier has many options specifying how the distances between class densities are used. For more information, consult the documentation of the naiveBayesKernel function.
Now, the classification error for each sample is also calculated for both the differential means and differential distribution classifiers and both ClassifyResult objects generated so far are plotted with samplesMetricMap.
library(grid)
DMresults <- calcCVperformance(DMresults, "sample error")
DDresults <- calcCVperformance(DDresults, "sample error")
resultsList <- list(Abundance = DMresults, Distribution = DDresults)
errorPlot <- samplesMetricMap(resultsList, metric = "error", plot = FALSE)
grid.draw(errorPlot)
The benefit of this plot is that it allows the easy identification of samples which are hard to classify and could be explained by considering additional information about them.
The features being ranked and selected in the feature selection stage can be compared within and between classifiers by the plotting functions rankingPlot and selectionPlot. Consider the task of visually representing how consistent the feature rankings of the top 10 different features were for the differential distribution classifier for all 20 cross-validations.
rankOverlaps <- rankingPlot(list(DDresults), topRanked = 1:10, xLabelPositions = 1:10,
lineColourVariable = "None", pointTypeVariable = "None",
columnVariable = "None", plot = FALSE)
rankOverlaps
The top-ranked features are highly consistent between all pairs of the 20 cross-validations, as is intuitively expected if only one sample changes between each cross-validation.
For a large cross-validation scheme, such as leave-2-out cross-validation, or when results contains many classifications, there are many feature set comparisons to make. Note that rankingPlot and selectionPlot have a parallelParams options which allows for the calculation of feature set overlaps to be done on multiple processors.
Sometimes, cross-validation is unnecessary. This happens when studies have large sample sizes and are well-designed such that a large number of samples is prespecified to form a test set. The classifier is only trained on the training sample set, and makes predictions only on the test sample set. This can be achieved by using the function runTest directly. See its documentation for required inputs.
Once a cross-validated classification is complete, the usefulness of the features selected may be explored in another dataset. previousSelection is a function which takes an existing ClassifyResult object and returns the features selected at the equivalent iteration which is currently being processed. This is necessary, because the models trained on one dataset are not directly transferrable to a new dataset; the classifier training (e.g. choosing thresholds, fitting model coefficients) is redone.
Some classifiers can be set to output scores or probabilities representing how likely a sample is to be from one of the classes, rather than class labels. This enables different score thresholds to be tried, to generate pairs of false positive and false negative rates. The naive Bayes classifier used previously had its returnType parameter set to “both”, so class predictions and scores were both stored in the classification result. In this case, a data.frame with two columns (named “class” and “score”) is returned by the classifier to the framework. Setting returnType to “score” is also sufficient to generate a ROC plot. Many existing classifiers in other R packages also have an option that allows a score or probability to be calculated.
ROCcurves <- ROCplot(list(DDresults), fontSizes = c(24, 12, 12, 12, 12))
This ROC plot shows the classifiability of the melanoma dataset is high. Other included functions which can output scores are fisherDiscriminant, DLDApredictInterface, and SVMpredictInterface.
Some classifiers allow the setting of a tuning parameter, which controls some aspect of their model learning. An example of doing parameter tuning with a linear SVM is presented. This particular SVM has a single tuning parameter, the cost. Higher values of this parameter penalise misclassifications more.
This is achieved in ClassifyR by providing a variable called tuneParams to the TrainParams container constructor. tuneParams is a named list, with the names being the names of the tuning variables, and the contents being vectors of values to try. If tuneParams has more than one element, all combination of values of the tuning variables are tried. The performance criterion specified to resubstituteParams is also used as the criterion for choosing the best tuning parameter(s). Any of the non-sample-specific performance metrics which calcCVperformance calculates can be optimised.
SVMresults <- runTests(measurements, classes, "Melanoma", "Tuned SVM",
validation = "leave", leave = 1,
params = list(SelectParams(limmaSelection, resubstituteParams = resubstituteParams),
TrainParams(SVMtrainInterface, kernel = "linear",
resubstituteParams = resubstituteParams,
tuneParams = list(cost = c(0.01, 0.1, 1, 10))),
PredictParams(SVMpredictInterface,
getClasses = function(result) result))
)
The chosen values of the parameters are stored for every validation, and can be accessed with the tunedParameters function.
length(tunedParameters(SVMresults))
## [1] 20
tunedParameters(SVMresults)[1:3]
## $V1
## $V1$cost
## [1] 10
##
##
## $V2
## $V2$cost
## [1] 10
##
##
## $V3
## $V3$cost
## [1] 10
These are the cost values chosen for the three folds of the first resampling.
ClassifyR is a framework for cross-validated classification that provides a variety of unique functions for performance evaluation. It provides wrappers for many popular classifiers but is designed to be extensible if other classifiers are desired.
A number of feature selection methods are provided for users. Functions with names ending in “interface” indicate wrappers for existing methods implemented in other packages. Different methods select different types of changes between classes.
Likewise, a variety of classifiers is also provided.
If a desired selection or classification method is not already implemented, rules for writing functions to work with ClassifyR are outlined in the next section.
The required inputs and type of output that each stage of classifiation has is summarised by the table below. The functions can have any number of other arguments after the set of arguments which are mandatory.
The argument verbose is sent from runTest to these functions so they must handle it, even if not explicitly using it. In the ClassifyR framework, verbose is a number which indicates the amount of progress messages to be printed. If verbose is 0, no progress messages are printed. If it is 1, only one message is printed for every 10 cross-validations completed. If it is 2, in addition to the messages printed when it is 1, a message is printed each time one of the stages of classification (transformation, feature selection, training, prediction) is done. If it is 3, in addition to the messages printed for values 1 and 2, progress messages are printed from within the classification functions themselves.
A version of each included transformation, selection, training and prediction function is typically implemented for (1) a numeric matrix for which the rows are for features and columns are for samples (a data storage convention in bioinformatics) and a factor vector of the same length as the number of columns of the matrix, (2) a DataFrame where the columns are naturally for the features, possibly of different data types (i.e. categorical and numeric), and rows are for samples, and a class specification and (3) a MultiAssayExperiment which stores sample class information in the colData slot’s DataFrame with column name “class”. For the inputs (1 and 3) which are not DataFrame, they are converted to one, because the other data types can be stored as a DataFrame without loss of information and the transformation, selection and classification functions which accept a DataFrame contain the code to do the actual computations. At a minimum, a new function must have a method taking a DataFrame as input with the sample classes either stored in a column named “class” or provided as a factor vector. Although not required, providing a version of a function that accepts a numeric matrix with an accompanying factor vector and another version that accepts a MultiAssayExperiment is desirable to provide flexibility regarding input data. See the code of existing functions in the package for examples of this, if intending to implement novel classification-related functions to be used with ClassifyR.
Strbenac D., Yang, J., Mann, G.J. and Ormerod, J. T. (2015) ClassifyR: an R package for performance assessment of classification with applications to transcriptomics, Bioinformatics, 31(11):1851-1853
Strbenac D., Mann, G.J., Yang, J. and Ormerod, J. T. (2016) Differential distribution improves gene selection stability and has competitive classification performance for patient survival, Nucleic Acids Research, 44(13):e119