CRAN status CRAN downloads CRAN monthly downloads License: MIT R ≥ 3.5.0 Lifecycle: stable

Case Based Reasoning using Statistical Models

The R package case-based-reasoning provides an R interface case-based reasoning using machine learning methods.

Introduction: What is Case Based Reasoning?

Case-Based Reasoning (CBR) is an artificial intelligence (AI) and problem-solving methodology that leverages the knowledge and experience gained from previously encountered situations, known as cases, to address new and complex problems. CBR relies on the principle that similar problems often have similar solutions, and it focuses on identifying, adapting, and reusing those solutions to solve new problems.

The CBR process consists of four main steps:

CBR has been successfully applied in various domains, including medical diagnosis, legal reasoning, customer support, and design optimization. Its ability to learn from experience and adapt to new situations makes it a valuable approach in fields where expertise and problem-solving skills are crucial.

How This Package Implements CBR

The central challenge in CBR is defining what “similar” means. This package solves the Retrieve step by learning a distance function from data rather than relying on ad-hoc similarity measures.

The workflow follows three stages:

  1. Fit a statistical model to the training data. The model captures which variables matter and how strongly they influence the outcome.
  2. Derive a distance function from the fitted model. The model’s structure determines how distance between cases is measured.
  3. Retrieve the k nearest cases for each new query observation using the learned distance.

The package offers two families of models for this:

Regression-based distance (Cox, linear, logistic)

A regression model (Cox proportional hazards, OLS, or logistic regression via the rms package) is fitted on the training data. The estimated regression coefficients are then used as weights in a weighted absolute distance function. Variables with larger coefficients contribute more to the distance, while variables the model deems irrelevant contribute little. This follows the approach of Dippon et al. (2002).

Random Forest distance (proximity and depth)

A random forest (via the ranger package) is fitted on the training data. Two distance measures can be extracted:

Both approaches produce an (n x m) distance matrix, where n is the number of training observations and m the number of query observations. This matrix can be used directly for clustering or visualization, or passed to the built-in k-nearest-neighbor search to retrieve the most similar cases.

Installation

CRAN

install.packages("CaseBasedReasoning")

GITHUB

install.packages("devtools")
devtools::install_github("sipemu/case-based-reasoning")

Features

This R package provides two methods case-based reasoning by using an endpoint:

Besides the functionality of searching for similar cases, we added some additional features:

Warning Message

“Warning: Cases with missing values in the dependent variable (Y) or predictor variables (X) have been dropped from the analysis. This may lead to a reduced dataset and potential loss of information. Please review your data and consider appropriate missing value imputation techniques to mitigate these issues.”

Example: Cox Beta Model

Initialization

In the first example, we use the CPH model and the ovarian data set from the survival package. In the first step, we initialize the R6 data object.

library(survival)
library(CaseBasedReasoning)
ovarian$resid.ds <- factor(ovarian$resid.ds)
ovarian$rx <- factor(ovarian$rx)
ovarian$ecog.ps <- factor(ovarian$ecog.ps)

# initialize R6 object
cph_model <- CoxModel$new(Surv(futime, fustat) ~ age + resid.ds + rx + ecog.ps, data = ovarian)

Similar Cases

After the initialization, we may want to get for each case in the query data the most similar case from the learning data.

n <- nrow(ovarian)
trainID <- sample(1:n, floor(0.8 * n), FALSE)
testID <- (1:n)[-trainID]
cph_model <- CoxModel$new(Surv(futime, fustat) ~ age + resid.ds + rx + ecog.ps, data = ovarian[trainID, ])

# fit model
cph_model$fit()

# get similar cases
matched_tbl <- cph_model$get_similar_cases(query = ovarian[testID, ], k = 3)

# or use the standard S3 predict interface
matched_tbl <- predict(cph_model, newdata = ovarian[testID, ], k = 3)

To analyze the results, you can extract the similar cases and training data and combine them:

These columns help organize and interpret the results, ensuring a clear understanding of the most similar cases and their corresponding query cases.

Distance Matrix

The distance matrix is a square matrix that represents the pairwise distances between a set of data points. In the context of Case-Based Reasoning (CBR), the distance matrix captures the dissimilarities between cases in the training and test (or query) datasets, based on the fitted model and the values of the predictor variables.

The distance matrix can be helpful in various situations:

In summary, a distance matrix can provide valuable insights into the relationships between cases, facilitate the identification of similar cases for CBR, and aid in the validation of the chosen statistical models.

distance_matrix <- cph_model$calc_distance_matrix()

cph_model$calc_distance_matrix() calculates the distance matrix between train and test data. When test data is omitted, the distances between observations in the training data are calculated. Rows are observations in train and columns observations of test. The distance matrix is saved internally in the CoxModel object: cph_model$dist_matrix.

Contribution

Responsible for Mathematical Model Development and Programming

Medical Advisor

Funding

The Robert Bosch Foundation funded this work. Special thanks go to Professor Dr. Friedel (Thoraxchirugie - Klinik Schillerhöhe).

References

Main

Other