DNA methylation age of human tissues and cell types. Genome Biol. 2013 14(10):R115 PMID: 24138928 Statistical goal and challenge • Goal: Build an age prediction method based on tens of thousands of variables – dependent variable y= transformed version of chronological age (in years) – covariates= CpGs – Approach: Penalized regression (elastic net) • Challenge: how to combine multiple training data generated by different labs etc Data label (color) Training data sets DNA origin Platform 1 (turquoise) Blood WB 27K 2 (blue) Blood WB 450K 3 (brown) Blood WB 450K 4 (blue2) Blood PBMC 450K 5 (green) Blood PBMC 450K 6 (red) Blood Cord 27K 7 (black) Brain CRBLM 27k 8 (pink) Brain CRBLM 27K 9 (magenta) Brain FCTX 27K 10 (purple) Brain PONS 27K 11 (greenyellow) Brain Prefr.CTX27K 12 (tan) BrainVariousCells 450K 13 (salmon) Brain TCTX 27K 14 (cyan) Breast NL 27K 15 (midnightblue) Buccal 27K 16 (indianred) Buccal 27K 17 (grey60) Buccal 450K 18 (green2) Cartilage Knee 27k 19 (gold) Colon 27K 20 (royalblue) Colon 450K 21 (darkred) Dermal fibroblast 27K 22 (darkgreen) Epidermis 27K 23 (darkturquoise) Gastric 27K 24 (darkgrey) Head+Neck 450K 25 (orange) Heart 27K 26 (darkorange)Kidney 450K 27 (lightsteelblue2) Kidney 450K 28 (skyblue) Liver 27K 29 (saddlebrown) Lung NL Adj 27K 30 (steelblue) Lung NL Adj 27K 31 (paleturquoise) Lung NL Adj 450K 32 (violet) MSC (bonemarrow) 27K 33 (darkolivegreen) Placenta 27K 34 (darkmagenta) Prostate NL 27K 35 (sienna3) Prostate NL 450K 36 (yellowgreen) Saliva 27K 37 (skyblue3) Saliva 27K 38 (plum1) Stomach 27K 39 (orangered4)Thyroid 450K Data Use Training Training Training Training Training Training Training Training Training Training Training Training Training Training Training Training Training Training Training Training Training Training Training Training Training Training Training Training Training Training Training Training Training Training Training Training Training Training Training n Median (Prop.Female Age(range) )715 (0.38) 33 (16,88) 94 (0.28) 29 (18,65) 656 (0.52) 65 (19,100) 72 (0) 3.1 (1,16) 48 (0.52) 15 (3.5,76) 216 (0.51) 0 (0,0) 168 (NaN) 45 (20,70) 114 (0.3) 44 (16,96) 133 (0.32) 43 (16,100) 125 (0.3) 43 (15,100) 108 (0.48) 19 (-0.5,84) 145 (0.48) 35 (13,79) 127 (0.33) 44 (15,100) 23 (1) 46 (19,75) 109 (0.61) 15 (15,15) 8 (0.75) 43 (16,68) 53 (0.45) 0 (0,1.5) 41 (0.49) 66 (40,79) 35 (0.63) 74 (43,90) 24 (0.54) 14 (3.5,19) 14 (1) 20 (6,73) 10 (0) 50 (26,71) 52 (NaN) 68 (25,88) 50 (0.24) 62 (26,87) 17 (0.41) 55 (16,68) 43 (0.3) 66 (31,83) 160 (0.34) 63 (38,90) 57 (0.14) 51 (20,79) 27 (0.15) 69 (52,83) 24 (0.58) 66 (51,77) 40 (0.32) 73 (40,85) 16 (0.38) 52 (21,85) 28 (1) 0 (0,0) 69 (0) 61 (44,73) 44 (0) 63 (44,72) 131 (0.015) 29 (21,55) 69 (0) 35 (21,55) 41 (0.51) 69 (43,87) 25 (0.8) 40 (18,76) Citation Horvath 2012 Horvath 2012 Hannum 2012 Alisch 2012 Harris et al 2012 Adkins 2011 Liu 2013 Gibbs 2010 Gibbs 2010 Gibbs 2010 Numata 2012 Guintivano 2013 Gibbs 2010 Zhuang 2012 Essex 2011 Rakyan 2010 Martino 2013 Fernández-Tajes 2013 TCGA, COAD Kellermayer 2013 Koch 2011 Gronniger 2010 Zouridis 2012 TCGA, HNSC Haas 2013 TCGA, KIRP TCGA, KIRC Shen 2012 TCGA, LUSC TCGA, LUAD TCGA, LUSC Bork 2010 Gordon 2012 Kobayashi 2011 TCGA, PRAD Liu 2010 Bockland 2011 TCGA, STAD TCGA, THCA Test data sets 40 (mediumpurple3) Blood WB 27K Test 41 (lightsteelblue1) Blood WB 27K Test 42 (darkcyan) Blood WB 27K Test 43 (orange) Blood WB 27K Test 44 (green) Blood WB 450K Test 45 (darkorange2) Blood PBMC 27K Test 46 (brown4) Blood PBMC 450K Test 47 (bisque4) Blood PBMC 27K Test 48 (darkslateblue) Blood Cord 27K Test 49 (plum2) Blood Cord 27K Test 50 (thistle2) Blood Cord 27K Test 51 (darkblue) Blood CD4 Tcells 450K Test 52 (salmon4) Blood CD4+CD14 27K Test 53 (palevioletred3) Blood Cell Types450K Test 54 (brown3) Brain Cerebellar27K Test 55 (maroon) Brain Occipital Cortex 27K Test 56 (lightpink4) Breast NL Adj 450K Test 57 (lavenderblush3) Breast NL Adj 27K Test 58 (deepskyblue) Buccal 450K Test 59 (darkseagreen4) Colon 450K Test 60 (coral1) Fat Adip 27K Test 61 (brown2) Heart 27K Test 62 (coral2) Kidney 27K Test 63 (mediumorchid) Liver 450K Test 64 (skyblue2) Lung NL Adj 450K Test 65 (yellow4) Muscle 27K Test 66 (skyblue1) Muscle 27K Test 67 (plum) Placenta 450k Test 68 (orangered3)Saliva 27K Test 69 (mediumpurple2) Uterine Cervix 27K Test 70 (lightsteelblue) Uterine Endomet 450K Test 71 (lightcoral) Various Tissues27K Test 72 (indianred4) Chimp+Human Tissues 27K Other 73 (firebrick4) Ape WB 450k Other 74 (darkolivegreen4) Sperm 27K Other 75 (brown2) Sperm 450k Other 76 (blue2) Vasc.Endoth(Umbilical) 27K Other 77 Stem cells+Somatic 27K Cells Other 78 Stem cells+Somatic 450KCells Other 79 Reprogrammed450K mesenchymal stromal Other cells 80 hESC and normal 27k primary tissueOther 81 hESC 27k Other 82 Blood Cell Types450K Other 191 (0.51) 93 (1) 262 (1) 269 (1) 689 (0.71) 386 (0) 38 (0.74) 92 (NaN) 48 (0.021) 84 (0.52) 53 (0.45) 48 (NaN) 50 (0.68) 16 (0.62) 20 (0) 16 (0) 81 (1) 27 (1) 51 (0.45) 38 (0.45) 10 (0.4) 6 (0) 198 (0.35) 37 (0.35) 26 (0.46) 22 (0.55) 44 (0) 40 (NaN) 52 (0.92) 152 (1) 28 (1) 44 (0.41) 35 (0.4) 32 (0.62) 19 (1) 26 (0) 42 (0.43) 271 (NA) 153 (0.63) 24 (NA) 34 (NA) 6 (NA) 60 (0) 43 (24,74) 63 (49,74) 67 (49,91) 64 (52,78) 54 (17,70) 9.3 (3.6,18) 44 (0,100) 33 (24,45) 0 (0,0) 0 (0,0.75) 0 (0,0) 0.5 (0,1) 34 (16,69) 32 (17,60) 22 (1,60) 25 (1,60) 55 (28,90) 51 (35,88) 0 (0,1.5) 72 (40,90) 75 (73,78) 60 (55,71) 60 (33,86) 68 (20,81) 66 (42,86) 66 (53,78) 25 (25,25) 0 (0,0) 27 (21,55) 25 (19,55) 62 (35,90) 71 (0,83) 47 (9,81) 22 (9,43) 0 (0,0) 0 (0,0) 0 (0,0) NA NA NA NA NA NA Teschendorff 2010 Rakyan 2010 Song 2010 Teschendorff 2010 So Liu 2013 Alisch 2012 Heyn 2012 Lam 2012 Turan Khulan 2012 Gordon 2012 Martino 2012 Rakyan 2010 Heyn 2013 Ginsberg 2012 Ginsberg 2012 TCGA, BRCA TCGA, BRCA Martino 2013 TCGA,COAD Ribel-Madsen 2012 Pai 2011 TCGA, KIRC TCGA, LIHC TCGA, LUAD Ribel-Madsen 2012 Jacobsen 2012 Blair 2013 Liu 2010 Zhuang 2012 TCGA, UCEG Myers 2012 Pai 2011 Hernando-Herraez 201 Pacheco 2011 Krausz 2012 Gordon 2012 Nazor 2012 Nazor 2012 Shao 2012 Calvanese 2012 Ramos-Mejía 2012 Reinius 2012 Construction of the epigenetic clock • assembled a large DNA methylation data set by combining publicly available individual data sets measured on the Illumina 27K or Illumina 450K array platform. • training+test data involved n=7844 non-cancer samples from 82 individual data sets which assess DNA methylation levels in 51 different tissues and cell types. • Although many data sets were collected for studying certain diseases, they largely involved healthy tissues. – In particular, cancer tissues were excluded Illumina data sets • The first 39 data sets were used to construct ("train") the age predictor. • Data sets 40-71 were used to test (validate) the age predictor. • Data sets 72-82 served other purposes e.g. to estimate the DNAm age of embryonic stem and iPS cells. • Training data were chosen i) to represent a wide spectrum of tissues/cell types, ii) to involve samples whose mean age (43 years) is similar to that in the test data, and iii) to involve a high proportion of samples (37%) measured on the Illumina 450K platform since many on-going studies use this recent Illumina platform. • Only studied 21369 CpGs (measured with the Infinium type II assay) which were present on both Illumina platforms (Infinium 450K and 27K) and had fewer than 10 missing values across the data sets. Age predictor • To ensure an unbiased validation in the test data, only used the training data to define the age predictor. • A transformed version of chronological age was regressed on the CpGs using a penalized regression model (elastic net). • The elastic net regression model automatically selected 353 CpGs. • I refer to the 353 CpGs as (epigenetic) clock CpGs since their weighted average (formed by the regression coefficients) amounts to an epigenetic clock. Accuracy across tissues and cell types (training) Accuracy across test data Accuracy in brain tissue Results send to me via email Blood data from Marco Boks Jan 2014 Excerpts from emails Epigenetic clock applied to large cohort studies Median error is less than 3.5 years. Aging clock applied to urine • This figure, created by bioinformatician Wei Guo at Zymo Research Factors influencing accuracy: standard deviation of age, tissue Using the clock for measuring the age of different parts of the body The clock works in the genus pan: common chimpanzees+bonobos ES cells and iPS cells are perfectly young Heritability (based on twin studies) of age acceleration is 40% in older subjects and 100% in newborns Rows correspond to 2 different twin data sets Red dots=monozygotic twin pair Black dots=dizygotic twin pair Conclusions • Most studies that involved telomere length and other biomarkers can be revisited • User friendly software can be found on my webpage – I recommend the online age calculator since it outputs a host of array quality statistics that can be used to identify samples where the age prediction may not be accurate. • Data get deleted right after you upload them. – Don't pre-process data too much. Don't remove batch effects, etc. Raw beta values will be fine. • I am always happy to collaborate.