HorvathEpigeneticClock2014

advertisement
DNA methylation age of human
tissues and cell types.
Genome Biol. 2013 14(10):R115 PMID: 24138928
Statistical goal and challenge
• Goal: Build an age prediction method
based on tens of thousands of variables
– dependent variable y= transformed version of
chronological age (in years)
– covariates= CpGs
– Approach: Penalized regression (elastic net)
• Challenge: how to combine multiple
training data generated by different labs
etc
Data label
(color)
Training
data
sets
DNA origin
Platform
1 (turquoise) Blood WB
27K
2 (blue)
Blood WB
450K
3 (brown)
Blood WB
450K
4 (blue2)
Blood PBMC 450K
5 (green)
Blood PBMC 450K
6 (red)
Blood Cord
27K
7 (black)
Brain CRBLM 27k
8 (pink)
Brain CRBLM 27K
9 (magenta)
Brain FCTX
27K
10 (purple)
Brain PONS
27K
11 (greenyellow)
Brain Prefr.CTX27K
12 (tan)
BrainVariousCells
450K
13 (salmon)
Brain TCTX
27K
14 (cyan)
Breast NL
27K
15 (midnightblue)
Buccal
27K
16 (indianred) Buccal
27K
17 (grey60)
Buccal
450K
18 (green2)
Cartilage Knee 27k
19 (gold)
Colon
27K
20 (royalblue) Colon
450K
21 (darkred)
Dermal fibroblast
27K
22 (darkgreen) Epidermis
27K
23 (darkturquoise)
Gastric
27K
24 (darkgrey) Head+Neck
450K
25 (orange)
Heart
27K
26 (darkorange)Kidney
450K
27 (lightsteelblue2)
Kidney
450K
28 (skyblue)
Liver
27K
29 (saddlebrown)
Lung NL Adj
27K
30 (steelblue) Lung NL Adj
27K
31 (paleturquoise)
Lung NL Adj
450K
32 (violet)
MSC (bonemarrow)
27K
33 (darkolivegreen)
Placenta
27K
34 (darkmagenta)
Prostate NL
27K
35 (sienna3)
Prostate NL
450K
36 (yellowgreen)
Saliva
27K
37 (skyblue3) Saliva
27K
38 (plum1)
Stomach
27K
39 (orangered4)Thyroid
450K
Data Use
Training
Training
Training
Training
Training
Training
Training
Training
Training
Training
Training
Training
Training
Training
Training
Training
Training
Training
Training
Training
Training
Training
Training
Training
Training
Training
Training
Training
Training
Training
Training
Training
Training
Training
Training
Training
Training
Training
Training
n
Median
(Prop.Female Age(range)
)715 (0.38)
33 (16,88)
94 (0.28)
29 (18,65)
656 (0.52)
65 (19,100)
72 (0)
3.1 (1,16)
48 (0.52)
15 (3.5,76)
216 (0.51)
0 (0,0)
168 (NaN)
45 (20,70)
114 (0.3)
44 (16,96)
133 (0.32)
43 (16,100)
125 (0.3)
43 (15,100)
108 (0.48)
19 (-0.5,84)
145 (0.48)
35 (13,79)
127 (0.33)
44 (15,100)
23 (1)
46 (19,75)
109 (0.61)
15 (15,15)
8 (0.75)
43 (16,68)
53 (0.45)
0 (0,1.5)
41 (0.49)
66 (40,79)
35 (0.63)
74 (43,90)
24 (0.54)
14 (3.5,19)
14 (1)
20 (6,73)
10 (0)
50 (26,71)
52 (NaN)
68 (25,88)
50 (0.24)
62 (26,87)
17 (0.41)
55 (16,68)
43 (0.3)
66 (31,83)
160 (0.34)
63 (38,90)
57 (0.14)
51 (20,79)
27 (0.15)
69 (52,83)
24 (0.58)
66 (51,77)
40 (0.32)
73 (40,85)
16 (0.38)
52 (21,85)
28 (1)
0 (0,0)
69 (0)
61 (44,73)
44 (0)
63 (44,72)
131 (0.015)
29 (21,55)
69 (0)
35 (21,55)
41 (0.51)
69 (43,87)
25 (0.8)
40 (18,76)
Citation
Horvath 2012
Horvath 2012
Hannum 2012
Alisch 2012
Harris et al 2012
Adkins 2011
Liu 2013
Gibbs 2010
Gibbs 2010
Gibbs 2010
Numata 2012
Guintivano 2013
Gibbs 2010
Zhuang 2012
Essex 2011
Rakyan 2010
Martino 2013
Fernández-Tajes 2013
TCGA, COAD
Kellermayer 2013
Koch 2011
Gronniger 2010
Zouridis 2012
TCGA, HNSC
Haas 2013
TCGA, KIRP
TCGA, KIRC
Shen 2012
TCGA, LUSC
TCGA, LUAD
TCGA, LUSC
Bork 2010
Gordon 2012
Kobayashi 2011
TCGA, PRAD
Liu 2010
Bockland 2011
TCGA, STAD
TCGA, THCA
Test data
sets
40 (mediumpurple3)
Blood WB
27K
Test
41 (lightsteelblue1)
Blood WB
27K
Test
42 (darkcyan) Blood WB
27K
Test
43 (orange)
Blood WB
27K
Test
44 (green)
Blood WB
450K
Test
45 (darkorange2)
Blood PBMC 27K
Test
46 (brown4) Blood PBMC 450K
Test
47 (bisque4) Blood PBMC 27K
Test
48 (darkslateblue)
Blood Cord
27K
Test
49 (plum2)
Blood Cord
27K
Test
50 (thistle2) Blood Cord
27K
Test
51 (darkblue) Blood CD4 Tcells
450K
Test
52 (salmon4) Blood CD4+CD14
27K
Test
53 (palevioletred3)
Blood Cell Types450K
Test
54 (brown3) Brain Cerebellar27K
Test
55 (maroon) Brain Occipital Cortex
27K
Test
56 (lightpink4) Breast NL Adj 450K
Test
57 (lavenderblush3)
Breast NL Adj 27K
Test
58 (deepskyblue)
Buccal
450K
Test
59 (darkseagreen4)
Colon
450K
Test
60 (coral1)
Fat Adip
27K
Test
61 (brown2) Heart
27K
Test
62 (coral2)
Kidney
27K
Test
63 (mediumorchid)
Liver
450K
Test
64 (skyblue2) Lung NL Adj 450K
Test
65 (yellow4) Muscle
27K
Test
66 (skyblue1) Muscle
27K
Test
67 (plum)
Placenta
450k
Test
68 (orangered3)Saliva
27K
Test
69 (mediumpurple2)
Uterine Cervix 27K
Test
70 (lightsteelblue)
Uterine Endomet
450K
Test
71 (lightcoral) Various Tissues27K
Test
72 (indianred4) Chimp+Human Tissues
27K
Other
73 (firebrick4) Ape WB
450k
Other
74 (darkolivegreen4)
Sperm
27K
Other
75 (brown2) Sperm
450k
Other
76 (blue2)
Vasc.Endoth(Umbilical)
27K
Other
77 Stem cells+Somatic
27K Cells
Other
78 Stem cells+Somatic
450KCells
Other
79 Reprogrammed450K
mesenchymal stromal
Other cells
80 hESC and normal
27k
primary tissueOther
81 hESC
27k
Other
82 Blood Cell Types450K
Other
191 (0.51)
93 (1)
262 (1)
269 (1)
689 (0.71)
386 (0)
38 (0.74)
92 (NaN)
48 (0.021)
84 (0.52)
53 (0.45)
48 (NaN)
50 (0.68)
16 (0.62)
20 (0)
16 (0)
81 (1)
27 (1)
51 (0.45)
38 (0.45)
10 (0.4)
6 (0)
198 (0.35)
37 (0.35)
26 (0.46)
22 (0.55)
44 (0)
40 (NaN)
52 (0.92)
152 (1)
28 (1)
44 (0.41)
35 (0.4)
32 (0.62)
19 (1)
26 (0)
42 (0.43)
271 (NA)
153 (0.63)
24 (NA)
34 (NA)
6 (NA)
60 (0)
43 (24,74)
63 (49,74)
67 (49,91)
64 (52,78)
54 (17,70)
9.3 (3.6,18)
44 (0,100)
33 (24,45)
0 (0,0)
0 (0,0.75)
0 (0,0)
0.5 (0,1)
34 (16,69)
32 (17,60)
22 (1,60)
25 (1,60)
55 (28,90)
51 (35,88)
0 (0,1.5)
72 (40,90)
75 (73,78)
60 (55,71)
60 (33,86)
68 (20,81)
66 (42,86)
66 (53,78)
25 (25,25)
0 (0,0)
27 (21,55)
25 (19,55)
62 (35,90)
71 (0,83)
47 (9,81)
22 (9,43)
0 (0,0)
0 (0,0)
0 (0,0)
NA
NA
NA
NA
NA
NA
Teschendorff 2010
Rakyan 2010
Song 2010
Teschendorff 2010 So
Liu 2013
Alisch 2012
Heyn 2012
Lam 2012
Turan
Khulan 2012
Gordon 2012
Martino 2012
Rakyan 2010
Heyn 2013
Ginsberg 2012
Ginsberg 2012
TCGA, BRCA
TCGA, BRCA
Martino 2013
TCGA,COAD
Ribel-Madsen 2012
Pai 2011
TCGA, KIRC
TCGA, LIHC
TCGA, LUAD
Ribel-Madsen 2012
Jacobsen 2012
Blair 2013
Liu 2010
Zhuang 2012
TCGA, UCEG
Myers 2012
Pai 2011
Hernando-Herraez 201
Pacheco 2011
Krausz 2012
Gordon 2012
Nazor 2012
Nazor 2012
Shao 2012
Calvanese 2012
Ramos-Mejía 2012
Reinius 2012
Construction of the epigenetic clock
• assembled a large DNA methylation data set
by combining publicly available individual data
sets measured on the Illumina 27K or Illumina
450K array platform.
• training+test data involved n=7844 non-cancer
samples from 82 individual data sets which
assess DNA methylation levels in 51 different
tissues and cell types.
• Although many data sets were collected for
studying certain diseases, they largely involved
healthy tissues.
– In particular, cancer tissues were excluded
Illumina data sets
• The first 39 data sets were used to construct ("train") the
age predictor.
• Data sets 40-71 were used to test (validate) the age
predictor.
• Data sets 72-82 served other purposes e.g. to estimate
the DNAm age of embryonic stem and iPS cells.
• Training data were chosen i) to represent a wide
spectrum of tissues/cell types, ii) to involve samples
whose mean age (43 years) is similar to that in the test
data, and iii) to involve a high proportion of samples
(37%) measured on the Illumina 450K platform since
many on-going studies use this recent Illumina platform.
• Only studied 21369 CpGs (measured with the Infinium
type II assay) which were present on both Illumina
platforms (Infinium 450K and 27K) and had fewer than 10
missing values across the data sets.
Age predictor
• To ensure an unbiased validation in the test
data, only used the training data to define the
age predictor.
• A transformed version of chronological age was
regressed on the CpGs using a penalized
regression model (elastic net).
• The elastic net regression model automatically
selected 353 CpGs.
• I refer to the 353 CpGs as (epigenetic) clock
CpGs since their weighted average (formed by
the regression coefficients) amounts to an
epigenetic clock.
Accuracy across tissues and cell types (training)
Accuracy across test data
Accuracy in brain tissue
Results send to me via email
Blood data from Marco Boks Jan 2014
Excerpts from emails
Epigenetic clock applied to large cohort studies
Median error is less than 3.5 years.
Aging clock applied to urine
• This figure, created by bioinformatician Wei Guo at Zymo
Research
Factors influencing accuracy:
standard deviation of age, tissue
Using the clock for measuring the age of
different parts of the body
The clock works in the genus pan: common
chimpanzees+bonobos
ES cells and iPS cells are perfectly young
Heritability (based on twin studies) of age acceleration
is 40% in older subjects and 100% in newborns
Rows correspond to 2 different twin data sets
Red dots=monozygotic twin pair
Black dots=dizygotic twin pair
Conclusions
• Most studies that involved telomere length and
other biomarkers can be revisited
• User friendly software can be found on my
webpage
– I recommend the online age calculator since it
outputs a host of array quality statistics that can be
used to identify samples where the age prediction
may not be accurate.
• Data get deleted right after you upload them.
– Don't pre-process data too much. Don't remove
batch effects, etc. Raw beta values will be fine.
• I am always happy to collaborate.
Download