Let’s start with an extremely simple example: a prediction problem on a continuous outcome, where we want to use cross-validation to minimize the expected risk/loss on held out data across a few different models.
We’ll use the iris dataset to do this.
nadir::super_learner() strives to keep the syntax
simple, so the simplest call to super_learner() might look
something like this:
super_learner(
data = iris,
formula = Petal.Width ~ Petal.Length + Sepal.Length + Sepal.Width,
learners = list(lnr_lm, lnr_rf, lnr_earth, lnr_mean))
#> $predict
#> function (newdata)
#> {
#> Reduce(`+`, x = future_lapply(1:length(fit_learners), function(i) {
#> fit_learners[[i]](newdata) * learner_weights[[i]]
#> }, future.seed = TRUE))
#> }
#> <bytecode: 0x1492f1408>
#> <environment: 0x1492f38d8>
#>
#> $y_variable
#> [1] "Petal.Width"
#>
#> $outcome_type
#> [1] "continuous"
#>
#> $learner_weights
#> lm rf earth mean
#> 0.5769071 0.4230929 0.0000000 0.0000000
#>
#> $holdout_predictions
#> # A tibble: 150 × 6
#> .sl_fold lm rf earth mean Petal.Width
#> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0.425 0.329 1.83 1.20 0.4
#> 2 1 0.143 0.250 1.43 1.20 0.1
#> 3 1 0.0913 0.227 0.959 1.20 0.1
#> 4 1 0.0996 0.599 1.68 1.20 0.2
#> 5 1 0.319 0.220 1.64 1.20 0.4
#> 6 1 0.417 0.293 1.71 1.20 0.1
#> 7 1 0.203 0.191 1.55 1.20 0.2
#> 8 1 -0.0406 0.467 1.20 1.20 0.3
#> 9 1 0.232 0.213 1.02 1.20 0.2
#> 10 1 0.346 0.221 1.63 1.20 0.6
#> # ℹ 140 more rows
#>
#> attr(,"class")
#> [1] "nadir_sl_model"Notice what it returns: A function of newdata that
predicts across the learners, sums up according to the learned weights,
and returns the ensemble predictions.
We can store that learned predictor function and use it:
# We recommend storing more complicated arguments used repeatedly to simplify
# the call to super_learner()
petal_formula <- Petal.Width ~ Petal.Length + Sepal.Length + Sepal.Width
learners <- list(lnr_lm, lnr_rf, lnr_earth, lnr_mean)
sl_model <- super_learner(
data = iris,
formula = petal_formula,
learners = learners)In particular, we can use it to predict on the same dataset,
predict(sl_model, iris) |> head()
#> 1 2 3 4 5 6
#> 0.2274732 0.1725615 0.1903219 0.2566321 0.2482329 0.3803236On a random sample of it,
predict(sl_model, iris[sample.int(size = 10, n = nrow(iris)), ]) |>
head()
#> 128 45 25 2 97 106
#> 1.7339317 0.4793721 0.4237248 0.1725615 1.3776838 2.2293237Or on completely new data.
fake_iris_data <- data.frame()
fake_iris_data <- cbind.data.frame(
Sepal.Length =
rnorm(
n = 6,
mean = mean(iris$Sepal.Length),
sd = sd(iris$Sepal.Length)
),
Sepal.Width =
rnorm(
n = 6,
mean = mean(iris$Sepal.Width),
sd = sd(iris$Sepal.Width)
),
Petal.Length =
rnorm(
n = 6,
mean = mean(iris$Petal.Length),
sd = sd(iris$Petal.Length)
)
)
predict(sl_model, fake_iris_data) |>
head()
#> 1 2 3 4 5 6
#> 1.1015928 1.7899799 1.1575923 0.8751806 0.8710849 0.5649973If we want to know a lot more about the super_learner()
process, how it weighted the candidate learners, what the candidate
learners predicted on the held-out data, etc., then we will want to look
at the other metadata contained in the nadir_sl_model
object produced: option.
sl_model_iris <- super_learner(
data = iris,
formula = petal_formula,
learners = learners)
str(sl_model_iris, max.level = 2)
#> List of 5
#> $ predict :function (newdata)
#> $ y_variable : chr "Petal.Width"
#> $ outcome_type : chr "continuous"
#> $ learner_weights : Named num [1:4] 0.521 0.479 0 0
#> ..- attr(*, "names")= chr [1:4] "lm" "rf" "earth" "mean"
#> $ holdout_predictions: tibble [150 × 6] (S3: tbl_df/tbl/data.frame)
#> - attr(*, "class")= chr "nadir_sl_model"To put some description to what’s contained in the output from
super_learner():
$predict() that takes
newdata$y_variable and
$outcome_type to provide some context to the learning task
that was performed.$learner_weights that indicate what weight the
different candidate learners were given$holdout_predictions: A data.frame of predictions from
each of the candidate learners, along with the actual outcome from the
held-out data.We can call compare_learners() on the verbose output
from super_learner() if we want to assess how the different
learners performed. We can also call cv_super_learner()
with the same arguments as super_learner() to wrap the
super_learner() call in another layer of cross-validation
to assess how super_learner() performs on held-out
data.
compare_learners(sl_model_iris)
#> Inferring the loss metric for learner comparison based on the outcome type:
#> outcome_type=continuous -> using mean squared error
#> # A tibble: 1 × 4
#> lm rf earth mean
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0.0373 0.0391 1.69 0.590
cv_super_learner(
data = iris,
formula = petal_formula,
learners = learners)$cv_loss
#> The loss_metric is being inferred based on the outcome_type=continuous -> using CV-MSE
#> [1] 0.03374206We can, of course, do anything with a super learned model that we would do with a conventional prediction model, like calculating performance statistics like \(R^2\).