Introduction

Welcome to MEAT (Muscle Epigenetic Age Test)! If you are reading these lines, you are probably an inquisitive scientist who has put a lot of effort into collecting skeletal muscle samples from – hopefully – consenting humans. Your coin purse (grant) is now lighter after profiling these muscle samples with the Illumina HumanMethylation technology (HM27, HM450 and HMEPIC) and you are yearning to know what the skeletal muscle epigenome has to say about your samples' age. I am here to guide you in your quest to find out how old your skeletal muscle samples are, by simply looking at their DNA methylation profiles. DNA methylation doesn't lie, but it can be tricky to understand what it says. Are you ready to undertake your quest to uncover the secrets of the muscle epigenome?

You can view MEAT as a spellbook (package) that contains all the necessary spells (functions) to estimate epigenetic age in human skeletal muscle samples. However, the spells will only work if you cast them in a particular order (data cleaning, data calibration, and epigenetic age estimation). Starting from preprocessed data (matrix of beta-values that has been normalized/batch corrected, etc.), MEAT will estimate epigenetic age in each sample, based on a penalized regression model developed by Voisin et al. on 682 skeletal muscle samples from 12 independent datasets . This muscle epigenetic clock is a machine learning algorithm (elastic net regression) essentially similar to Horvath's original pan-tissue clock.
You have the option to provide the actual age of each sample (if known), so MEAT can calculate age acceleration as the difference between epigenetic age and real age (AAdiff), or as the residuals from a linear regression of epigenetic age against real age (AAresid). For more information on the distinction between AAdiff and AAresid, see our original paper.

Installation

Install the MEAT package:

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("MEAT")

Then, load the package:

library(MEAT)

Step-by-step guide

Data requirements

To use this guide, you will need in your inventory:

data("GSE121961", envir = environment())
knitr::kable(head(GSE121961),
             caption = "Top rows of the GSE121961 matrix before cleaning and calibration.")

Table: Top rows of the GSE121961 matrix before cleaning and calibration.

GSM3450524 GSM3450528
cg00000029 0.35 0.35
cg00000103 NA NA
cg00000109 0.77 0.79
cg00000155 0.94 0.96
cg00000158 0.96 0.96
cg00000165 0.23 0.20
data("GSE121961_pheno", envir = environment())
knitr::kable(GSE121961_pheno,
             caption = "Phenotypes corresponding to GSE121961.")

Table: Phenotypes corresponding to GSE121961.

ID Sex Age Group Batch Position
GSM3450524 M NA Control 202128330007 R01C01
GSM3450528 F 14 SELENON 202128330007 R07C01

The phenotype table is useful if you want to discover the AA of your samples and to associate this AA with a phenotype of interest (e.g. do elves show systematically lower AA than humans, therefore explaining their exceptional lifespans?)

Step 1: Data formatting

A good adventurer never embarks a quest without a minimum of preparation. That is particularly true for your inventory! The beta matrix, the phenotype table and the annotation table should be all bundled up into a single object, to coordinate the meta-data and assays when subsetting. For example, if you have skeletal muscle DNA methylation profiles from humans, elves and dwarves, but you only want to select the samples from humans and elves, you can select these samples in a single operation in both the beta-matrix and phenotype table. This ensures the beta matrix, phenotype table and annotation table remain in sync. SummarizedExperiment objects) have the ideal format for your inventory. Let's create such an object with the beta matrix and optional phenotype and annotation tables. Please ensure that you call the beta-matrix “beta” as it is essential for the upcoming functions.

library(SummarizedExperiment)
GSE121961_SE <- SummarizedExperiment(assays=list(beta=GSE121961),
colData=GSE121961_pheno)
GSE121961_SE
#> class: SummarizedExperiment 
#> dim: 866091 2 
#> metadata(0):
#> assays(1): beta
#> rownames(866091): cg00000029 cg00000103 ... ch.X.97737721F
#>   ch.X.98007042R
#> rowData names(0):
#> colnames(2): GSM3450524 GSM3450528
#> colData names(6): ID Sex ... Batch Position

Step 2: Data cleaning

The first important step is data 'cleaning', which essentially means reducing the beta matrix to the 19,401 CpGs common to the 12 datasets used in the muscle clock. If some of the 19,401 CpGs are not present in your beta-matrix, these missing values will be imputed.

GSE121961_SE_clean <- clean_beta(SE = GSE121961_SE)
#> Cluster size 19362 broken into 5430 13932 
#> Cluster size 5430 broken into 3338 2092 
#> Cluster size 3338 broken into 1330 2008 
#> Done cluster 1330 
#> Cluster size 2008 broken into 1128 880 
#> Done cluster 1128 
#> Done cluster 880 
#> Done cluster 2008 
#> Done cluster 3338 
#> Cluster size 2092 broken into 1081 1011 
#> Done cluster 1081 
#> Done cluster 1011 
#> Done cluster 2092 
#> Done cluster 5430 
#> Cluster size 13932 broken into 1946 11986 
#> Cluster size 1946 broken into 1043 903 
#> Done cluster 1043 
#> Done cluster 903 
#> Done cluster 1946 
#> Cluster size 11986 broken into 7645 4341 
#> Cluster size 7645 broken into 4848 2797 
#> Cluster size 4848 broken into 2831 2017 
#> Cluster size 2831 broken into 1561 1270 
#> Cluster size 1561 broken into 658 903 
#> Done cluster 658 
#> Done cluster 903 
#> Done cluster 1561 
#> Done cluster 1270 
#> Done cluster 2831 
#> Cluster size 2017 broken into 1160 857 
#> Done cluster 1160 
#> Done cluster 857 
#> Done cluster 2017 
#> Done cluster 4848 
#> Cluster size 2797 broken into 856 1941 
#> Done cluster 856 
#> Cluster size 1941 broken into 849 1092 
#> Done cluster 849 
#> Done cluster 1092 
#> Done cluster 1941 
#> Done cluster 2797 
#> Done cluster 7645 
#> Cluster size 4341 broken into 2857 1484 
#> Cluster size 2857 broken into 1645 1212 
#> Cluster size 1645 broken into 368 1277 
#> Done cluster 368 
#> Done cluster 1277 
#> Done cluster 1645 
#> Done cluster 1212 
#> Done cluster 2857 
#> Done cluster 1484 
#> Done cluster 4341 
#> Done cluster 11986 
#> Done cluster 13932
knitr::kable(head(assays(GSE121961_SE_clean)$beta),
             caption = "Top rows of the GSE121961 beta matrix after cleaning.")

Table: Top rows of the GSE121961 beta matrix after cleaning.

GSM3450524 GSM3450528
cg21432842 0.30 0.39
cg15376097 0.20 0.26
cg05876918 0.11 0.10
cg25771195 0.64 0.67
cg21380842 0.12 0.13
cg00602891 0.09 0.08

Step 3: Data calibration

The second step is data 'calibration', which essentially means re-scaling the DNA methylation profiles to that of the gold standard dataset used to develop the muscle clock. This step harmonises differences in data processing, sample preparation, lab-to-lab variability, to obtain accurate measures of epigenetic age in your samples. It should be noted that this 'calibration' is entirely different from the previously mentioned data preprocessing (i.e. probe and sample filtering, normalisation of Type I and Type II probes, and correction for batch effects). The calibration implemented in BMIQcalibration() does use code from the original BMIQ algorithm, but it is not used to normalize TypeI and TypeII probe methylation distribution. The BMIQcalibration() of the MEAT package re-scales the methylation distribution of your samples to the gold standard dataset GSE50498 .

GSE121961_SE_calibrated <- BMIQcalibration(SE = GSE121961_SE_clean)
#> [1] Inf
#> [1] 0.001941611
#> [1] 0.0006828534
#> ii= 1
#> ii= 2
knitr::kable(head(assays(GSE121961_SE_calibrated)$beta),
             caption = "Top rows of the GSE121961 beta matrix after cleaning and calibration.")

Table: Top rows of the GSE121961 beta matrix after cleaning and calibration.

GSM3450524 GSM3450528
cg21432842 0.3015962 0.3986824
cg15376097 0.2045055 0.2734146
cg05876918 0.1077539 0.0997550
cg25771195 0.6317048 0.6684900
cg21380842 0.1184796 0.1345813
cg00602891 0.0864733 0.0770285

You can have a look at the distribution of DNA methylation before and after calibration with a density plot. On this plot, each line is an individual sample, and you can clearly see the bimodal distribution of DNA methylation data, with most CpGs harboring very low methylation levels (left side of the graph), very few CpGs with intermediate methylation levels, and some CpGs with high methylation levels. Before calibration, the profiles do not align well with that of the gold standard, and this is problematic to obtain accurate estimates of epigenetic age. However, after calibration, the samples' profiles overlap nicely with that of the gold standard.

data("gold.mean", envir = environment())
GSE121961_SE_clean_with_gold_mean <- cbind(assays(GSE121961_SE_clean)$beta,
                                           gold.mean$gold.mean) # add the gold mean
GSE121961_SE_calibrated_with_gold_mean <- cbind(assays(GSE121961_SE_calibrated)$beta,
                                                gold.mean$gold.mean) # add the gold mean
groups <- c(rep("GSE121961",
                ncol(GSE121961_SE_clean)), "Gold mean")

library(minfi)
par(mfrow = c(2, 1))
densityPlot(GSE121961_SE_clean_with_gold_mean,
  sampGroups = groups,
  main = "Before calibration",
  legend = FALSE
)
densityPlot(GSE121961_SE_calibrated_with_gold_mean,
  sampGroups = groups,
  main = "After calibration"
)

plot of chunk DNA methylation profile distribution before and after calibration

Step 4: Epigenetic age estimation

Your quest is almost over! The only spell left to cast is epiage_estimation() that uses methylation levels at 200 CpGs from the calibrated profiles to estimate epigenetic age. If you do not have information on age, epiage_estimation() will only return epigenetic age (“DNAmage”). However, if you have information on age (and other phenotypes), epiage_estimation() will return:

While AAdiff is a straightforward way of calculating the error in age prediction, it is sensitive to the mean age of the dataset and to the pre-processing of the DNA methylation dataset; AAdiff can be biased upwards or downwards depending on how the dataset was normalized, and depending on the mean age and age variance of the dataset. In contrast, AAresid is insensitive to the mean age of the dataset and is robust against different pre-processing methods.

GSE121961_SE_epiage <- epiage_estimation(SE = GSE121961_SE_calibrated,
age_col_name = "Age")
knitr::kable(colData(GSE121961_SE_epiage),
             caption = "Phenotypes corresponding to GSE121961 with AAdiff for each sample.")

Table: Phenotypes corresponding to GSE121961 with AAdiff for each sample.

ID Sex Age Group Batch Position AAdiff
GSM3450524 GSM3450524 M NA Control 202128330007 R01C01 NA
GSM3450528 GSM3450528 F 14 SELENON 202128330007 R07C01 5.747296

Session information

sessionInfo()
#> R version 4.0.0 (2020-04-24)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Mojave 10.14.6
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
#> [8] methods   base     
#> 
#> other attached packages:
#>  [1] minfi_1.34.0                bumphunter_1.30.0          
#>  [3] locfit_1.5-9.4              iterators_1.0.12           
#>  [5] foreach_1.5.0               Biostrings_2.56.0          
#>  [7] XVector_0.28.0              SummarizedExperiment_1.18.1
#>  [9] DelayedArray_0.14.0         matrixStats_0.56.0         
#> [11] Biobase_2.48.0              GenomicRanges_1.40.0       
#> [13] GenomeInfoDb_1.24.2         IRanges_2.22.2             
#> [15] S4Vectors_0.26.1            BiocGenerics_0.34.0        
#> [17] MEAT_1.0.1                  BiocStyle_2.16.0           
#> 
#> loaded via a namespace (and not attached):
#>   [1] colorspace_1.4-1          ellipsis_0.3.1           
#>   [3] siggenes_1.62.0           dynamicTreeCut_1.63-1    
#>   [5] mclust_5.4.6              base64_2.0               
#>   [7] affyio_1.58.0             bit64_0.9-7              
#>   [9] AnnotationDbi_1.50.1      xml2_1.3.2               
#>  [11] codetools_0.2-16          splines_4.0.0            
#>  [13] impute_1.62.0             methylumi_2.34.0         
#>  [15] scrime_1.3.5              knitr_1.29               
#>  [17] Rsamtools_2.4.0           annotate_1.66.0          
#>  [19] cluster_2.1.0             dbplyr_1.4.4             
#>  [21] HDF5Array_1.16.1          wateRmelon_1.32.0        
#>  [23] BiocManager_1.30.10       readr_1.3.1              
#>  [25] compiler_4.0.0            httr_1.4.1               
#>  [27] assertthat_0.2.1          Matrix_1.2-18            
#>  [29] limma_3.44.3              RPMM_1.25                
#>  [31] htmltools_0.5.0           prettyunits_1.1.1        
#>  [33] tools_4.0.0               affy_1.66.0              
#>  [35] glue_1.4.1                GenomeInfoDbData_1.2.3   
#>  [37] dplyr_1.0.0               rappdirs_0.3.1           
#>  [39] doRNG_1.8.2               Rcpp_1.0.4.6             
#>  [41] vctrs_0.3.1               multtest_2.44.0          
#>  [43] preprocessCore_1.50.0     nlme_3.1-148             
#>  [45] rtracklayer_1.47.0        DelayedMatrixStats_1.10.1
#>  [47] xfun_0.15                 stringr_1.4.0            
#>  [49] ROC_1.64.0                lifecycle_0.2.0          
#>  [51] rngtools_1.5              XML_3.99-0.3             
#>  [53] beanplot_1.2              nleqslv_3.3.2            
#>  [55] zlibbioc_1.34.0           MASS_7.3-51.6            
#>  [57] hms_0.5.3                 rhdf5_2.32.1             
#>  [59] GEOquery_2.56.0           RColorBrewer_1.1-2       
#>  [61] yaml_2.2.1                curl_4.3                 
#>  [63] memoise_1.1.0             biomaRt_2.44.1           
#>  [65] reshape_0.8.8             stringi_1.4.6            
#>  [67] RSQLite_2.2.0             highr_0.8                
#>  [69] genefilter_1.70.0         lumi_2.40.0              
#>  [71] GenomicFeatures_1.40.0    BiocParallel_1.22.0      
#>  [73] shape_1.4.4               rlang_0.4.6              
#>  [75] pkgconfig_2.0.3           bitops_1.0-6             
#>  [77] nor1mix_1.3-0             evaluate_0.14            
#>  [79] lattice_0.20-41           purrr_0.3.4              
#>  [81] Rhdf5lib_1.10.0           GenomicAlignments_1.24.0 
#>  [83] bit_1.1-15.2              tidyselect_1.1.0         
#>  [85] plyr_1.8.6                magrittr_1.5             
#>  [87] R6_2.4.1                  generics_0.0.2           
#>  [89] DBI_1.1.0                 mgcv_1.8-31              
#>  [91] pillar_1.4.4              survival_3.2-3           
#>  [93] RCurl_1.98-1.2            tibble_3.0.1             
#>  [95] crayon_1.3.4              KernSmooth_2.23-17       
#>  [97] BiocFileCache_1.12.0      rmarkdown_2.3            
#>  [99] progress_1.2.2            grid_4.0.0               
#> [101] data.table_1.12.8         blob_1.2.1               
#> [103] digest_0.6.25             xtable_1.8-4             
#> [105] tidyr_1.1.0               illuminaio_0.30.0        
#> [107] openssl_1.4.2             glmnet_4.0-2             
#> [109] askpass_1.1               quadprog_1.5-8