Association Analysis for UKB Outcomes

Overview

The assoc_* functions fit regression models for each exposure variable and return tidy result tables suitable for downstream forest plots and publication tables.

Function	Alias	Model	Effect measure
`assoc_coxph()`	`assoc_cox()`	Cox proportional hazards	HR
`assoc_logistic()`	`assoc_logit()`	Logistic regression	OR
`assoc_linear()`	`assoc_lm()`	Linear regression	beta
`assoc_coxph_zph()`	`assoc_zph()`	Schoenfeld residual PH test	chisq / p
`assoc_subgroup()`	`assoc_sub()`	Stratified analysis + LRT interaction	HR / OR / beta
`assoc_trend()`	`assoc_tr()`	Dose-response trend	HR / OR / beta + p_trend
`assoc_competing()`	`assoc_fg()`	Fine-Gray competing risks	SHR
`assoc_lag()`	—	Cox lag sensitivity analysis	HR

Prerequisite: the analysis dataset should already contain derived case status, follow-up time, and covariates produced by the derive_* functions. See vignette("derive") and vignette("derive-survival").

library(ukbflow)

# ops_toy(scenario = "association") returns a pre-derived analysis-ready table:
# covariates already as factors, bmi_cat / tdi_cat binned, and two outcomes
# (dm_* and htn_*) with status / date / timing / followup columns in place.
dt <- ops_toy(scenario = "association")
dt <- dt[dm_timing != 1L]   # incident analysis: exclude prevalent DM cases

The Three-Model Framework

All main functions automatically produce up to three adjustment levels without requiring manual formula construction:

Model	Covariates	When included
Unadjusted	None (crude)	Always (when `base = TRUE`)
Age and sex adjusted	Age (field 21022) + sex (field 31), auto-detected	When both columns are found
Fully adjusted	User-supplied `covariates`	When `covariates` is non-NULL

Age and sex columns are located automatically by scanning the dataset’s column names for standard UKB naming patterns (p21022_* for age at recruitment, p31_* for sex). As a fallback, any column starting with "age" or named "sex" is also recognised. No prerequisite pipeline step is required — the detection works on any data.frame whose columns follow UKB or ukbflow naming conventions. Set base = FALSE to skip the first two models and run only the Fully adjusted model.

Step 1: Cox Proportional Hazards

assoc_coxph() is the primary function for time-to-event outcomes. It accepts logical (TRUE/FALSE) or integer (0/1) event indicators and returns one row per exposure x term x model combination.

# Crude + age-sex adjusted (automatic); p21022 and p31 auto-detected
res <- assoc_coxph(
  data         = dt,
  outcome_col  = "dm_status",
  time_col     = "dm_followup_years",
  exposure_col = c("p20116_i0", "bmi_cat")
)

# Add a Fully adjusted model
res <- assoc_coxph(
  data         = dt,
  outcome_col  = "dm_status",
  time_col     = "dm_followup_years",
  exposure_col = "p20116_i0",
  covariates   = c("bmi_cat", "tdi_cat", "p1558_i0",
                   paste0("p22009_a", 1:10))
)

Output columns:

Column	Description
`exposure`	Exposure variable name
`term`	Coefficient name from `coxph`
`model`	Ordered factor: Unadjusted < Age and sex adjusted < Fully adjusted
`n`	Participants in model (after NA removal)
`n_events`	Events in model
`person_years`	Total person-years (rounded)
`HR`	Hazard ratio

n, n_events, and person_years all reflect the model-specific complete-case analysis set — participants with any missing value across outcome, time, exposure, or covariates are dropped before fitting. These counts therefore differ between models (e.g. Fully adjusted typically has fewer participants than Unadjusted). | CI_lower, CI_upper | Confidence interval bounds | | p_value | Wald test p-value | | HR_label | Formatted string, e.g. "1.43 (1.28-1.60)" |

Step 2: Logistic Regression

assoc_logistic() is for binary outcomes without a time dimension (e.g. case-control or cross-sectional designs).

res <- assoc_logistic(
  data         = dt,
  outcome_col  = "dm_status",
  exposure_col = c("p20116_i0", "bmi_cat"),
  covariates   = c("tdi_cat", "p1558_i0", paste0("p22009_a", 1:10))
)

For sparse data or small samples, use profile likelihood confidence intervals:

res <- assoc_logistic(
  data         = dt,
  outcome_col  = "dm_status",
  exposure_col = "grs_bmi",
  ci_method    = "profile"   # slower but more accurate for sparse data
)

Output is identical in structure to assoc_coxph() but with OR and OR_label in place of HR / HR_label, and n_cases instead of n_events / person_years.

Step 3: Linear Regression

assoc_linear() is for continuous outcomes (e.g. biomarker levels, BMI). The standard error of beta is included to support downstream meta-analysis.

# Continuous outcome (BMI); smoking and GRS as exposures
res <- assoc_linear(
  data         = dt,
  outcome_col  = "p21001_i0",
  exposure_col = c("p20116_i0", "grs_bmi"),
  covariates   = c("tdi_cat", "p1558_i0", paste0("p22009_a", 1:10))
)

Common mistake: passing a binary (0/1) or logical column as outcome_col is permitted (linear probability model) but triggers a warning. Use assoc_logistic() for binary outcomes — linear regression on a 0/1 outcome produces unbounded predictions and is rarely appropriate in epidemiological analyses.

Step 4: Proportional Hazards Assumption Test

assoc_coxph_zph() re-fits the same Cox models and tests the PH assumption via Schoenfeld residuals (cox.zph()). Use it alongside assoc_coxph() to validate model assumptions.

zph <- assoc_coxph_zph(
  data         = dt,
  outcome_col  = "dm_status",
  time_col     = "dm_followup_years",
  exposure_col = c("p20116_i0", "bmi_cat"),
  covariates   = c("tdi_cat", "p1558_i0")
)

# Identify any violations
zph[ph_satisfied == FALSE]

Output columns include chisq, df, p_value, ph_satisfied (logical), and global test statistics (global_chisq, global_df, global_p).

When the PH assumption is violated, consider adding strata() to assoc_coxph() or modelling time-varying effects.

Step 5: Subgroup Analysis

assoc_subgroup() stratifies the dataset by a grouping variable and fits the specified model within each subgroup. A likelihood ratio test (LRT) for the exposure x subgroup interaction is computed on the full dataset and appended as p_interaction.

# Subgroup by sex; p31 is automatically excluded from within-stratum models
res <- assoc_subgroup(
  data         = dt,
  outcome_col  = "dm_status",
  time_col     = "dm_followup_years",
  exposure_col = "p20116_i0",
  by           = "p31",
  method       = "coxph",
  covariates   = c("p21022", "bmi_cat", "tdi_cat")
)

Output columns include subgroup, subgroup_level, and p_interaction in addition to the standard effect estimate columns.

Note: the by variable is automatically excluded from the subgroup models. Do not include it in covariates — a collinearity warning is issued if you do. Subgroup analysis is most interpretable when by has a small number of levels (e.g. sex, smoking status); continuous variables are technically permitted but per-unique-value subgrouping is rarely meaningful in practice — use a pre-categorised version (e.g. via derive_cut()) instead.

Step 6: Dose-Response Trend Analysis

assoc_trend() fits categorical and trend models simultaneously for each ordered-factor exposure, returning per-category estimates alongside a p-value for linear trend.

# assoc_trend() requires an ordered factor; make bmi_cat ordered in-place
dt[, bmi_cat := factor(bmi_cat, levels = levels(bmi_cat), ordered = TRUE)]

res <- assoc_trend(
  data         = dt,
  outcome_col  = "dm_status",
  time_col     = "dm_followup_years",
  exposure_col = "bmi_cat",
  method       = "coxph",
  covariates   = c("p21022", "p31", "tdi_cat", "p20116_i0")
)

Supply custom scores reflecting approximate median BMI per category:

res <- assoc_trend(
  data         = dt,
  outcome_col  = "dm_status",
  time_col     = "dm_followup_years",
  exposure_col = "bmi_cat",
  method       = "coxph",
  covariates   = c("p21022", "p31", "tdi_cat", "p20116_i0"),
  scores       = c(17, 22, 27, 35)   # approximate median BMI per category
)

The output contains a reference row (HR = 1.00 (ref)) followed by non-reference rows, with HR_per_score, HR_per_score_label, and p_trend appended as shared columns. These trend columns carry the same value across every row within the same exposure × model combination — including the reference row — so the full result table remains self-contained and easy to filter or export.

Step 7: Fine-Gray Competing Risks

assoc_competing() fits a Fine-Gray subdistribution hazard model via survival::finegray() + inverse-probability-of-censoring-weighted Cox regression. Use this when a competing event (e.g. death) can preclude the outcome of interest.

Two input modes are supported:

Mode A — a single column encodes all event types:

# Construct a combined event column: 0 = censored, 1 = DM, 2 = HTN (competing)
dt_cr <- dt[htn_timing != 1L]   # also exclude prevalent HTN
dt_cr[, event_type := data.table::fcase(
  dm_timing  == 2L, 1L,
  htn_timing == 2L, 2L,
  default          = 0L
)]

res <- assoc_competing(
  data         = dt_cr,
  outcome_col  = "event_type",       # 0 = censored, 1 = DM, 2 = HTN
  time_col     = "dm_followup_years",
  exposure_col = "p20116_i0",
  event_val    = 1L,
  compete_val  = 2L,
  covariates   = c("bmi_cat", "tdi_cat")
)

Mode B — separate 0/1 columns for primary and competing events:

res <- assoc_competing(
  data         = dt_cr,
  outcome_col  = "dm_status",        # primary event
  time_col     = "dm_followup_years",
  exposure_col = c("p20116_i0", "bmi_cat"),
  compete_col  = "htn_status",       # competing event
  covariates   = c("tdi_cat", "p1558_i0")
)

Output uses SHR (subdistribution hazard ratio) and SHR_label in place of HR, and adds n_compete (number of competing events in the analysis set).

Working with Results

All functions return a data.table that feeds directly into plot_forest():

res <- assoc_coxph(
  data         = dt,
  outcome_col  = "dm_status",
  time_col     = "dm_followup_years",
  exposure_col = "p20116_i0",
  covariates   = c("bmi_cat", "tdi_cat", "p1558_i0")
)

# Pass directly to plot_forest()
plot_forest(as.data.frame(res))

# Filter to a single model
res[model == "Fully adjusted"]

# Export
data.table::fwrite(res, "assoc_results.csv")

Step 8: Lag Sensitivity Analysis

assoc_lag() re-runs the same Cox models at one or more lag periods to assess whether associations are driven by early events (reverse causation or detection bias). For each lag, participants whose follow-up time is less than lag_years are excluded; follow-up is kept on its original scale.

res <- assoc_lag(
  data         = dt,
  outcome_col  = "dm_status",
  time_col     = "dm_followup_years",
  exposure_col = "p20116_i0",
  lag_years    = c(0, 1, 2),   # 0 = full cohort reference
  covariates   = c("bmi_cat", "tdi_cat", "p1558_i0")
)

Setting lag_years = 0 (or including 0 in the vector) runs the model on the full unfiltered cohort, providing a reference row against which lagged results can be compared.

Output adds two columns to the standard assoc_coxph() result:

Column	Description
`lag_years`	Lag period applied (numeric, same units as `time_col`)
`n_excluded`	Participants excluded because follow-up < `lag_years`

Getting Help

?assoc_coxph, ?assoc_logistic, ?assoc_linear
?assoc_coxph_zph, ?assoc_subgroup, ?assoc_trend, ?assoc_competing, ?assoc_lag
vignette("derive-survival") — follow-up time and event derivation
vignette("derive") — disease phenotype derivation
GitHub Issues