Survival Analysis Setup for UKB Outcomes

Overview

After disease case definitions have been derived (see vignette("derive")), three additional functions prepare the data for time-to-event analysis:

Function Output columns Purpose
derive_timing() {name}_timing Classify prevalent vs. incident disease
derive_age() age_at_{name} Age at event (years)
derive_followup() {name}_followup_end, {name}_followup_years Follow-up end date and duration

Prerequisite: {name}_status and {name}_date must already be present — produced by vignette("derive"). The examples below assume the full disease derivation pipeline has been run on an ops_toy() dataset, so the baseline date column is p53_i0 and age at recruitment is p21022.


Step 1: Classify Timing — Prevalent vs. Incident

derive_timing() compares the disease date to the UKB baseline assessment date and assigns each participant to one of four categories:

Value Meaning
0 No disease (status is FALSE)
1 Prevalent — disease date on or before baseline
2 Incident — disease date strictly after baseline
NA Case with no recorded date; timing cannot be determined
library(ukbflow)

# Build on the derive pipeline from vignette("derive")
df <- ops_toy(n = 500)
df <- derive_missing(df)
df <- derive_covariate(df, as_factor = c("p31", "p20116_i0"))
df <- derive_selfreport(df, name = "dm", regex = "type 2 diabetes")
df <- derive_icd10(df, name = "dm", icd10 = "E11", source = c("hes", "death"))
df <- derive_case(df, name = "dm")
# Uses {name}_status and {name}_date by default
df <- derive_timing(df, name = "dm", baseline_col = "p53_i0")

Supply explicit column names when the defaults do not apply:

df <- derive_timing(df,
  name         = "dm",
  status_col   = "dm_status",
  date_col     = "dm_date",
  baseline_col = "p53_i0"
)

Call once per variable needed — for example, once for the combined case and once per individual source (HES, self-report, etc.).


Step 2: Age at Event

derive_age() computes age at the time of the event for cases, and returns NA for non-cases and cases without a date.

\[\text{age\_at\_event} = \text{age\_at\_recruitment} + \frac{\text{event\_date} - \text{baseline\_date}}{365.25}\]

The divisor 365.25 accounts for leap years, ensuring sub-monthly precision in age calculation across the full UKB follow-up window.

# Auto-detects {name}_date and {name}_status; produces age_at_{name} column.
df <- derive_age(df,
  name         = "dm",
  baseline_col = "p53_i0",
  age_col      = "p21022"
)

Supply explicit column mappings when names do not follow the default {name}_date / {name}_status pattern:

df <- derive_age(df,
  name         = "dm",
  baseline_col = "p53_i0",
  age_col      = "p21022",
  date_cols    = c(dm = "dm_date"),
  status_cols  = c(dm = "dm_status")
)

Step 3: Follow-Up Time

derive_followup() computes the follow-up end date as the earliest of:

  1. The outcome event date (if the participant is a case)
  2. Date of death (field 40000; competing event)
  3. Date lost to follow-up (field 191)
  4. The administrative censoring date

Follow-up time in years is then derived from the baseline date.

df <- derive_followup(df,
  name         = "dm",
  event_col    = "dm_date",
  baseline_col = "p53_i0",
  censor_date  = as.Date("2022-10-31"),   # set to your study's cut-off date
  death_col    = "p40000_i0",
  lost_col     = FALSE                    # not available in ops_toy
)

Output columns:

Column Type Description
dm_followup_end IDate Earliest competing date
dm_followup_years numeric Years from baseline to end

Prevalent cases receive NA follow-up time

Participants whose event date falls before or on the baseline date (prevalent cases, {name}_timing == 1) will have followup_years set to NA rather than a zero or negative value, which has no meaning in time-to-event analysis. Use derive_timing() to identify and exclude prevalent cases before fitting a Cox model (see the full pipeline example below).

Auto-detection of death and lost-to-follow-up columns

When death_col and lost_col are NULL (default), derive_followup() looks them up automatically from the field cache (UKB fields 40000 and 191). Pass FALSE to explicitly disable a competing event:

df <- derive_followup(df,
  name         = "dm",
  event_col    = "dm_date",
  baseline_col = "p53_i0",
  censor_date  = as.Date("2022-10-31"),
  death_col    = FALSE,
  lost_col     = FALSE
)

Full Survival-Ready Pipeline

After completing all three steps, the data contains everything needed to fit a Cox proportional hazards model:

library(survival)

# Incident analysis: exclude prevalent cases and those with undetermined timing
df_incident <- df[dm_timing != 1L]

fit <- coxph(
  Surv(dm_followup_years, dm_status) ~
    p20116_i0 + p21022 + p31 + p1558_i0,
  data = df_incident
)
summary(fit)

Column roles in the model:

Column Role
dm_status Event indicator (logical)
dm_followup_years Time variable
dm_timing Filter: exclude prevalent (== 1)
age_at_dm Age at diagnosis (descriptive / secondary analysis)
p20116_i0 Exposure of interest (smoking status)

Getting Help