A core goal of behavior genetics is to decompose observed phenotypic variance into genetic and environmental components. Traditionally, this has been done using twin studies, which compare monozygotic (MZ) and dizygotic (DZ) twin pairs. However, extended pedigrees – multi-generational families with known parentage – provide richer information about relatedness and allow researchers to estimate a wider array of variance components.
The BGmisc package provides a complete pipeline for
pedigree-based variance component modeling:
simulatePedigree()ped2add(), ped2cn(), ped2mit(),
and ped2ce()identifyComponentModel() and comp2vech()buildOneFamilyGroup(), buildPedigreeMx(), and
fitPedigreeModel()This vignette walks through each step, from generating a pedigree to fitting a variance component model and interpreting the results.
This vignette requires the OpenMx package for structural
equation modeling. If you don’t have it installed, you can install it
from CRAN or from the OpenMx website.
library(BGmisc)
has_openmx <- requireNamespace("OpenMx", quietly = TRUE)
has_mvtnorm <- requireNamespace("mvtnorm", quietly = TRUE)
if (has_openmx) {
library(OpenMx)
} else {
message(
"OpenMx is not installed. Code will be shown but not executed.\n",
"Install OpenMx with: install.packages('OpenMx')"
)
}
if (!has_mvtnorm) {
message(
"mvtnorm is not installed. Data simulation examples will not run.\n",
"Install mvtnorm with: install.packages('mvtnorm')"
)
} else {
library(mvtnorm)
}
run_models <- has_openmx && has_mvtnormWe begin by simulating a multi-generational pedigree using
simulatePedigree(). This creates a balanced family
structure with a specified number of generations and children per
couple.
set.seed(421)
ped <- simulatePedigree(
kpc = 3, # 3 children per couple
Ngen = 4, # 4 generations
sexR = 0.5, # equal sex ratio
marR = 0.6 # 60% mating rate
)
head(ped)
#> fam ID gen dadID momID spID sex
#> 1 fam 1 10101 1 NA NA 10102 F
#> 2 fam 1 10102 1 NA NA 10101 M
#> 3 fam 1 10201 2 NA NA 10203 M
#> 4 fam 1 10202 2 10102 10101 NA M
#> 5 fam 1 10203 2 10102 10101 10201 F
#> 6 fam 1 10204 2 NA NA 10205 MThe resulting data frame contains one row per individual with columns
for family ID (fam), personal ID (ID),
generation (gen), parent IDs (dadID,
momID), spouse ID (spID), and biological sex
(sex).
254summarizeFamilies(ped, famID = "fam")$family_summary
#> fam count gen_mean gen_median gen_min gen_max gen_sd spID_mean
#> <char> <int> <num> <num> <num> <num> <num> <num>
#> 1: fam 1 25 3 3 1 4 0.9574271 10237.42
#> spID_median spID_min spID_max spID_sd
#> <num> <num> <num> <num>
#> 1: 10253 10101 10309 79.4818Before fitting a model, we need to verify that the variance components are identified – that is, the data provide enough information to uniquely estimate each parameter. If components are not identified, their estimates can trade off against each other, leading to unstable or meaningless results.
The identifyComponentModel() function checks
identification by vectorizing each relatedness component matrix (via
comp2vech()) and testing whether the resulting design
matrix has full column rank. Each component matrix becomes a column in
this design matrix. If the rank equals the number of components, the
model is identified.
For more background on identification in variance component models,
see
vignette("v1_modelingvariancecomponents", package = "BGmisc").
We plan to fit a 5-component model with additive genetic (A), common nuclear environment (Cn), common extended environment (Ce), mitochondrial (Mt), and unique environment (E) components. Let’s check whether these five components are simultaneously identified using the relatedness matrices we just computed:
id_full <- identifyComponentModel(
A = add_matrix,
Cn = cn_matrix,
Ce = ce_matrix,
Mt = mt_matrix,
E = diag(1, nrow(add_matrix))
)
#> Component model is identified.
id_full
#> $identified
#> [1] TRUE
#>
#> $nidp
#> character(0)The full model is identified. We can proceed to fit it. However, to
illustrate the process of checking identification and refining the
model, we will also show how to interpret the details of the
identification check. Below, I have provided an unidentified model to
demonstrate how to use the nidp element of the result to
understand which components are confounded.
# show if model is identified
identifyComponentModel(
A = add_matrix,
A2 = add_matrix,
Cn = cn_matrix,
Ce = ce_matrix,
Mt = mt_matrix,
E = diag(1, nrow(add_matrix))
)
#> Component model is not identified.
#> Non-identified parameters are A, A2
#> $identified
#> [1] FALSE
#>
#> $nidp
#> [1] "A" "A2"With a single pedigree, the 5-component model may not be
fully identified. The nidp element in the result tells us
which components are confounded. This is because some relatedness
matrices can be linearly dependent – for example, ce_matrix
is a matrix of all ones for a single family, and the identity matrix (E)
plus other components may span a similar space. In this case, our model
is identified, but if it were not, we would see a message like “Variance
components are not all identified. (And even if they were, we might not
have enough data to estimate them all.)
Based on the identification check above, we can drop or fix the
non-identified components. A natural choice is to remove the components
flagged by identifyComponentModel() and re-check:
# A simpler model: A + Cn + E
id_ace <- identifyComponentModel(
A = list(add_matrix),
Cn = list(cn_matrix),
E = diag(1, nrow(add_matrix))
)
#> Component model is identified.
id_ace
#> $identified
#> [1] TRUE
#>
#> $nidp
#> character(0)A single extended pedigree typically provides enough variation in relatedness coefficients (parent-offspring = 0.5, siblings = 0.5, grandparent-grandchild = 0.25, cousins = 0.125, etc.) to identify the A + Cn + E model.
When a component is not identified with one family, adding families with different structures can help. Each family contributes additional rows to the design matrix. Let’s check whether the full 5-component model becomes identified when we combine two pedigrees:
set.seed(99)
ped2 <- simulatePedigree(kpc = 4, Ngen = 3, marR = 0.5) |> makeTwins()
add2 <- ped2add(ped2, sparse = FALSE)
cn2 <- ped2cn(ped2, sparse = FALSE)
ce2 <- ped2ce(ped2)
mt2 <- ped2mit(ped2, sparse = FALSE)
# Check the full model across two families
n_combined <- nrow(add_matrix) + nrow(add2)
id_two_fam <- identifyComponentModel(
A = list(add_matrix, add2),
Cn = list(cn_matrix, cn2),
Ce = list(ce_matrix, ce2),
Mt = list(mt_matrix, mt2),
E = diag(1, n_combined)
)
#> Component model is identified.
id_two_fam
#> $identified
#> [1] TRUE
#>
#> $nidp
#> character(0)This result guides our modeling decisions in the steps that follow. When fitting the model below, we set the non-identified components’ true values to zero so that we have a known ground truth to recover.
Before fitting a model, we need observed data. In practice, this would be measured phenotypes (e.g., height, cognitive ability, disease status). Here, we simulate phenotypic data from a known variance component structure so we can verify that our model recovers the true parameters.
We define “true” variance components and use the relatedness matrices to construct the population covariance matrix, then sample from it.
# True variance components (proportions of total variance)
true_var <- list(
ad2 = 0.50, # additive genetic
cn2 = 0.10, # common nuclear environment
ce2 = 0.00, # common extended environment (set to 0 for identifiability)
mt2 = 0.00, # mitochondrial (set to 0 for simplicity)
ee2 = 0.40 # unique environment (residual)
)
# Build the implied covariance matrix
# V = ad2*A + cn2*Cn + ce2*U + mt2*Mt + ee2*I
n <- nrow(add_matrix)
I_mat <- diag(1, n)
U_mat <- matrix(1, n, n)
V_true <- true_var$ad2 * add_matrix +
true_var$cn2 * cn_matrix +
true_var$ce2 * U_mat +
true_var$mt2 * mt_matrix +
true_var$ee2 * I_mat
# Simulate one realization of the phenotype vector
set.seed(123)
y <- mvtnorm::rmvnorm(1, sigma = V_true)
# Create named variable labels (required by OpenMx)
ytemp <- paste("S", rownames(add_matrix))We simulated phenotypic data for25 individuals, with a mean of -0.064 and a standard deviation of 0.884. The variance in this simulated phenotype arises from the specified genetic and environmental components according to the covariance structure we defined.
In practice, you would have data from multiple independently ascertained families. Here we simulate data from a single pedigree for simplicity, but the model-fitting functions support multiple pedigrees (shown in a later section).
The BGmisc package provides helper functions that
construct the OpenMx model in three layers:
buildPedigreeModelCovariance() –
Creates the variance component parameters (free parameters to be
estimated)buildOneFamilyGroup() – Creates the
model for a single family, embedding the relatedness matrices and
observed databuildPedigreeMx() – Combines the
variance components with one or more family groups into a multi-group
modelLet’s walk through building the model step by step.
The buildPedigreeModelCovariance() function creates the
OpenMx sub-model that holds the free variance component parameters. You
can control which components to include:
# Starting values for the optimizer
start_vars <- list(
ad2 = 0.3, dd2 = 0, cn2 = 0.1,
ce2 = 0.1, mt2 = 0.1, am2 = 0,
ee2 = 0.4
)
# Build variance component sub-model
vc_model <- buildPedigreeModelCovariance(
vars = start_vars,
Vad = TRUE, # estimate additive genetic variance
Vdd = FALSE, # do not estimate dominance
Vcn = TRUE, # estimate common nuclear environment
Vce = TRUE, # estimate common extended environment
Vmt = TRUE, # estimate mitochondrial
Vam = FALSE, # do not estimate A x Mt interaction
Ver = TRUE # estimate unique environment
)
vc_model
#> MxModel 'ModelOne'
#> type : default
#> $matrices : 'Vad', 'Vcn', 'Vce', 'Vmt', and 'Ver'
#> $algebras : NULL
#> $penalties : NULL
#> $constraints : NULL
#> $intervals : NULL
#> $latentVars : none
#> $manifestVars : none
#> $data : NULL
#> $submodels : NULL
#> $expectation : NULL
#> $fitfunction : NULL
#> $compute : NULL
#> $independent : FALSE
#> $options :
#> $output : FALSE
summary(vc_model)
#> Summary of ModelOne
#>
#> Model Statistics:
#> | Parameters | Degrees of Freedom | Fit ( units)
#> Model: 0 0 NA
#> Saturated: NA NA NA
#> Independence: NA NA NA
#> Number of observations/statistics: 0/0
#>
#> timestamp: NULL
#> Wall clock time: NULL
#> OpenMx version number: NULL
#> Need help? See help(mxSummary)
#> WARNING: This model has not been run yet. Tip: Use
#> model = mxRun(model)
#> to estimate a model.The buildOneFamilyGroup() function constructs the model
for one family. It takes the relatedness matrices and the observed data
for that family:
# Format the observed data as a 1-row matrix with named columns
obs_data <- matrix(y, nrow = 1, dimnames = list(NULL, ytemp))
# Build the family group model
family_group <- buildOneFamilyGroup(
group_name = "family1",
Addmat = add_matrix,
Nucmat = cn_matrix,
Extmat = ce_matrix,
Mtdmat = mt_matrix,
full_df_row = obs_data,
ytemp = ytemp
)The family group model contains:
The buildPedigreeMx() function combines the variance
component parameters (shared across all families) with the family group
model(s):
full_model <- buildPedigreeMx(
model_name = "PedigreeVCModel",
vars = start_vars,
group_models = list(family_group)
)
full_model$submodels
#> $ModelOne
#> MxModel 'ModelOne'
#> type : default
#> $matrices : 'Vad', 'Vcn', 'Vce', 'Vmt', and 'Ver'
#> $algebras : NULL
#> $penalties : NULL
#> $constraints : NULL
#> $intervals : NULL
#> $latentVars : none
#> $manifestVars : none
#> $data : NULL
#> $submodels : NULL
#> $expectation : NULL
#> $fitfunction : NULL
#> $compute : NULL
#> $independent : FALSE
#> $options :
#> $output : FALSE
#>
#> $family1
#> MxModel 'family1'
#> type : default
#> $matrices : 'I', 'U', 'A', 'Cn', 'Mt', and 'M'
#> $algebras : 'V'
#> $penalties : NULL
#> $constraints : NULL
#> $intervals : NULL
#> $latentVars : none
#> $manifestVars : none
#> $data : 1 x 25
#> $data means : NA
#> $data type: 'raw'
#> $submodels : NULL
#> $expectation : MxExpectationNormal
#> $fitfunction : MxFitFunctionML
#> $compute : NULL
#> $independent : FALSE
#> $options :
#> $output : FALSEWith the model assembled, we fit it using OpenMx’s optimizer. The
mxRun() function performs maximum likelihood
estimation:
fitted_model <- mxRun(full_model)
#> Running PedigreeVCModel with 6 parameters
smr <- summary(fitted_model)# Extract variance component estimates
estimates <- c(
Vad = fitted_model$ModelOne$Vad$values[1, 1],
Vcn = fitted_model$ModelOne$Vcn$values[1, 1],
Vce = fitted_model$ModelOne$Vce$values[1, 1],
Vmt = fitted_model$ModelOne$Vmt$values[1, 1],
Ver = fitted_model$ModelOne$Ver$values[1, 1]
)
# Compare estimates to true values
truth <- c(
Vad = true_var$ad2,
Vcn = true_var$cn2,
Vce = true_var$ce2,
Vmt = true_var$mt2,
Ver = true_var$ee2
)
comparison <- data.frame(
Component = names(truth),
True = truth,
Estimated = round(estimates, 4)
)
comparison
#> Component True Estimated
#> Vad Vad 0.5 0.5510
#> Vcn Vcn 0.1 0.0085
#> Vce Vce 0.0 0.0000
#> Vmt Vmt 0.0 0.0909
#> Ver Ver 0.4 0.2890cat("-2 Log Likelihood:", smr$Minus2LogLikelihood, "\n")
#> -2 Log Likelihood: 63.47269
cat("Converged:", fitted_model$output$status$code == 0, "\n")
#> Converged: TRUEWith a single pedigree realization, estimates will vary from the true values due to sampling variability. Estimation improves substantially with multiple families, as shown next.
In practice, researchers have data from multiple families. The BGmisc helpers are designed for this multi-group scenario, where variance component parameters are shared across families but each family has its own relatedness structure and data.
set.seed(2024)
n_families <- 5
# Storage
ped_list <- vector("list", n_families)
add_list <- vector("list", n_families)
cn_list <- vector("list", n_families)
mt_list <- vector("list", n_families)
ce_list <- vector("list", n_families)
y_list <- vector("list", n_families)
ytemp_list <- vector("list", n_families)
for (i in seq_len(n_families)) {
# Simulate each family with slightly different structure
ped_i <- simulatePedigree(kpc = 3, Ngen = 4, marR = 0.6)
ped_list[[i]] <- ped_i
# Compute relatedness matrices
A_i <- as.matrix(ped2add(ped_i))
Cn_i <- as.matrix(ped2cn(ped_i))
Mt_i <- as.matrix(ped2mit(ped_i))
Ce_i <- ped2ce(ped_i)
n_i <- nrow(A_i)
add_list[[i]] <- A_i
cn_list[[i]] <- Cn_i
mt_list[[i]] <- Mt_i
ce_list[[i]] <- Ce_i
# Build implied covariance and simulate data
I_i <- diag(1, n_i)
U_i <- matrix(1, n_i, n_i)
V_i <- true_var$ad2 * A_i +
true_var$cn2 * Cn_i +
true_var$ce2 * U_i +
true_var$mt2 * Mt_i +
true_var$ee2 * I_i
y_list[[i]] <- mvtnorm::rmvnorm(1, sigma = V_i)
ytemp_list[[i]] <- paste("S", rownames(A_i))
}
cat("Simulated", n_families, "families\n")
#> Simulated 5 families
cat("Family sizes:", vapply(ped_list, nrow, integer(1)), "\n")
#> Family sizes: 25 25 25 25 25We build a group model for each family and then combine them:
# Build group models for each family
group_models <- lapply(seq_len(n_families), function(i) {
obs_data_i <- matrix(y_list[[i]], nrow = 1, dimnames = list(NULL, ytemp_list[[i]]))
buildOneFamilyGroup(
group_name = paste0("ped", i),
Addmat = add_list[[i]],
Nucmat = cn_list[[i]],
Extmat = ce_list[[i]],
Mtdmat = mt_list[[i]],
full_df_row = obs_data_i,
ytemp = ytemp_list[[i]]
)
})
# Build the multi-group model
multi_model <- buildPedigreeMx(
model_name = "MultiPedigreeModel",
vars = start_vars,
group_models = group_models
)
# Fit the model
fitted_multi <- mxRun(multi_model)
#> Running MultiPedigreeModel with 6 parameters
smr_multi <- summary(fitted_multi)# Extract and compare estimates
estimates_multi <- c(
Vad = fitted_multi$ModelOne$Vad$values[1, 1],
Vcn = fitted_multi$ModelOne$Vcn$values[1, 1],
Vce = fitted_multi$ModelOne$Vce$values[1, 1],
Vmt = fitted_multi$ModelOne$Vmt$values[1, 1],
Ver = fitted_multi$ModelOne$Ver$values[1, 1]
)
comparison_multi <- data.frame(
Component = c("Additive genetic (Vad)", "Common nuclear (Vcn)",
"Common extended (Vce)", "Mitochondrial (Vmt)",
"Unique environment (Ver)"),
True = truth,
Estimated = round(estimates_multi, 4)
)
comparison_multi
#> Component True Estimated
#> Vad Additive genetic (Vad) 0.5 0.4780
#> Vcn Common nuclear (Vcn) 0.1 0.0645
#> Vce Common extended (Vce) 0.0 0.0000
#> Vmt Mitochondrial (Vmt) 0.0 0.0000
#> Ver Unique environment (Ver) 0.4 0.5457
cat("\n-2 Log Likelihood:", smr_multi$Minus2LogLikelihood, "\n")
#>
#> -2 Log Likelihood: 352.1827
cat("Converged:", fitted_multi$output$status$code == 0, "\n")
#> Converged: TRUEWith multiple families providing more information, the estimates should more closely approximate the true generating values.
fitPedigreeModel() WrapperFor convenience, fitPedigreeModel() wraps the build and
fit steps together. It accepts pre-built group models and uses
mxTryHard() for robust optimization with multiple
starts:
fitted_easy <- fitPedigreeModel(
model_name = "EasyFit",
vars = start_vars,
data = NULL,
group_models = group_models,
tryhard = TRUE
)
#> Beginning initial fit attemptFit attempt 0, fit=352.182707904676, new current best! (was 361.708560299926)
#>
#> Solution found! Final fit=352.18271 (started at 361.70856) (1 attempt(s): 1 valid, 0 errors)
summary(fitted_easy)
#> Summary of EasyFit
#>
#> free parameters:
#> name matrix row col Estimate Std.Error A lbound ubound
#> 1 vad ModelOne.Vad 1 1 4.779743e-01 0.3104667 1e-10
#> 2 vcn ModelOne.Vcn 1 1 6.448255e-02 0.0948814 1e-10
#> 3 vce ModelOne.Vce 1 1 1.002645e-10 0.2203776 ! 0!
#> 4 vmt ModelOne.Vmt 1 1 1.000546e-10 0.1041140 ! 0!
#> 5 ver ModelOne.Ver 1 1 5.457066e-01 0.1982269 1e-10
#> 6 meanLI ped1.M 1 S 10101 -1.761523e-01 0.1531313
#>
#> Model Statistics:
#> | Parameters | Degrees of Freedom | Fit (-2lnL units)
#> Model: 6 119 352.1827
#> Saturated: NA NA NA
#> Independence: NA NA NA
#> Number of observations/statistics: 5/125
#>
#> Information Criteria:
#> | df Penalty | Parameters Penalty | Sample-Size Adjusted
#> AIC: 114.1827 364.1827 322.1827
#> BIC: 160.6596 361.8393 344.7898
#> To get additional fit indices, see help(mxRefModels)
#> timestamp: 2026-03-13 12:25:28
#> Wall clock time: 0.16693 secs
#> optimizer: SLSQP
#> OpenMx version number: 2.22.11
#> Need help? See help(mxSummary)The key equation underlying the model is:
\[V = \sigma^2_a \mathbf{A} + \sigma^2_{cn} \mathbf{C}_n + \sigma^2_{ce} \mathbf{U} + \sigma^2_{mt} \mathbf{M} + \sigma^2_e \mathbf{I}\]
where:
ped2add())ped2cn())ped2ce())ped2mit())This is an extension of the classical twin model to arbitrary pedigree structures. The additive genetic relatedness matrix generalizes the concept of MZ twins sharing 100% of genes and DZ twins sharing 50% – in a pedigree, every pair of relatives has a specific coefficient of relatedness determined by their genealogical connection.
This vignette demonstrated the full workflow for pedigree-based variance component modeling:
| Step | Function | Purpose |
|---|---|---|
| 1 | simulatePedigree() |
Generate a multi-generational pedigree |
| 2 | ped2add(), ped2cn(),
ped2mit(), ped2ce() |
Compute relatedness matrices |
| 3 | identifyComponentModel() |
Check model identification |
| 4 | Simulate or prepare phenotypic data | Observed data for model fitting |
| 5 | buildOneFamilyGroup(),
buildPedigreeModelCovariance() |
Build OpenMx model components |
| 6 | buildPedigreeMx(), mxRun() or
fitPedigreeModel() |
Assemble and fit the model |
| 7 | Multiple families | Scale to multi-group pedigree models |
The helper functions (buildPedigreeModelCovariance(),
buildOneFamilyGroup(), buildFamilyGroups(),
buildPedigreeMx(), fitPedigreeModel()) handle
the mechanics of translating pedigree relatedness matrices into proper
OpenMx model specifications, allowing researchers to focus on the
substantive questions rather than the modeling boilerplate.