TalkBanffHorvath2014

advertisement
Empirical evaluation of predictionand correlation network methods
applied to genomic data
Steve Horvath
University of California, Los Angeles
Content
Review of weighted correlation network analysis
(WGCNA)
When Is Hub Gene Selection Better than Standard
Meta-Analysis?
Evaluating systems biologic gene selection methods
The epigenetic clock: a highly accurate genomic
predictor of age
What is weighted correlation
network analysis (WGCNA) ?
Construct a network
Rationale: make use of interaction patterns between genes
Identify modules
Rationale: module (pathway) based analysis
Relate modules to external information
Array Information: Clinical data, SNPs, proteomics
Gene Information: gene ontology, EASE, IPA
Rationale: find biologically interesting modules
Study Module Preservation across different data
Rationale:
• Same data: to check robustness of module definition
• Different data: to find interesting modules.
Find the key drivers in interesting modules
Tools: intramodular connectivity, causality testing
Rationale: experimental validation, therapeutics, biomarkers
Weighted correlation networks are valuable
for a biologically meaningful…
• reduction of high dimensional data
– expression: microarray, RNA-seq
– gene methylation data, fMRI data, etc.
• integration of multiscale data
– expression data from multiple tissues
– SNPs (module QTL analysis)
– Complex phenotypes
An anatomically comprehensive atlas of
the adult human brain transcriptome
MJ Hawrylycz, E Lein,..,AR Jones
(2012) Nature 489, 391-399
Allen Brain Institute
Data
•
•
•
•
•
Brains from two healthy males (ages 24 and 39)
170 brain structures
over 900 microarray samples per individual
64K Agilent microarray
This data set provides a neuroanatomically precise,
genome-wide map of transcript distributions
Modules in brain 1
Global gene networks.
How to construct a weighted
correlation network?
Systems biology as a field of study: interactions between the
components of biological systems
Network=Adjacency Matrix
• A network can be represented by an adjacency
matrix, A=[aij], that encodes whether/how a
pair of nodes is connected.
– A is a symmetric matrix with entries in [0,1]
– For unweighted network, entries are 1 or 0
depending on whether or not 2 nodes are adjacent
(connected)
– For weighted networks, the adjacency matrix reports
the connection strength between node pairs
– Our convention: diagonal elements of A are all 1.
Two types of weighted correlation networks
U nsigned netw ork, absolute value
a ij  | cor ( x i , x j ) |

S igned netw ork preserves sign info
a ij  | 0.5  0.5  cor ( x i , x j ) |

Default values: β=6 for unsigned and β =12 for signed
networks.
We prefer signed networks…
Zhang et al SAGMB Vol. 4: No. 1, Article 17.
Adjacency versus correlation in unsigned
and signed networks
Unsigned Network
Signed Network
Advantages of soft thresholding with the
power function
1. Robustness: Network results are highly robust with
respect to the choice of the power β (Zhang et al 2005)
2. Calibration of different networks becomes
straightforward, which facilitates consensus module
analysis
3. Module preservation statistics are particularly sensitive
for measuring connectivity preservation in weighted
networks
4. Math reason: Geometric Interpretation of Gene CoExpression Network Analysis. PloS Computational Biology. 4(8):
e1000117
How to detect network modules?
Systems biology as a paradigm, usually defined in antithesis to the socalled reductionist paradigm (biological organization)
Module Definition
• Based on the resulting cluster tree, we define
modules as branches
• Modules are either labeled by integers (1,2,3…)
or equivalently by colors (turquoise, blue,
brown, etc)
• We often use average linkage hierarchical
clustering coupled with the topological overlap
dissimilarity measure.
• Next we use the dynamic tree cutting method
to define clusters. Langfelder et al 2007
Defining modules based on a
hierarchical cluster tree
Langfelder P, Zhang B et al (2007) Defining clusters from a hierarchical cluster tree: the
Dynamic Tree Cut library for R. Bioinformatics 2008 24(5):719-720
Module=branch of a cluster
tree
Dynamic hybrid branch
cutting method combines
advantages of hierarchical
clustering and pam clustering
How does one find “consensus”
modules based on multiple gene
expression data (networks)?
Example: Multiple Human brain
expression data sets from Huntington's
Disease
Publicly available caudate nucleus gene expression
data from HD subjects and controls
1) Durrenberger et al (2012). Selection of novel reference genes for
use in the human central nervous system: a BrainNet Europe Study.
Acta Neuropathol. 2012 Dec;124(6):893-903
2) Hodges et al Luthi-Carter (2006) Regional and cellular gene
expression changes in human Huntington’s disease brain. Human
Molecular Genetics, 2006, Vol. 15, No. 6
Analysis steps of WGCNA
1. Construct a signed weighted correlation network
based on 2 human gene expression data sets
Purpose: keep track of co-expression relationships
2. Identify consensus modules
Purpose: find robustly defined and reproducible modules
Technique: Consensus adjacency is a quantile of the input
e.g. minimum, lower quartile, median
3. Relate modules to external information
HD disease status
Gene Information: gene ontology, cell marker genes
Purpose: find biologically meaningful modules
Consensus dendrogram with module colors and meta-analysis
significance for diagnosis. The colors correspond to the meta-analysis Z score (with
weights proportional to root of number of DOF); blue color denotes genes are down in HD
vs controls, and red color denotes genes that are up in HD vs controls.
Question: How does one summarize the
expression profiles in a module?
Answer: This has been solved.
Math answer: module eigengene
= first principal component
Network answer: the most highly connected
intramodular hub gene
Both turn out to be equivalent
Module Eigengene= measure of overexpression=average redness
Rows,=genes, Columns=microarray
br own
185
184
183
182
181
180
179
178
177
176
175
174
173
172
171
170
169
168
167
166
165
164
163
162
161
160
159
158
157
156
155
154
153
152
151
150
149
148
147
146
145
144
143
142
141
140
139
138
137
136
135
134
133
132
131
130
129
128
127
126
125
124
123
122
121
120
119
118
117
116
115
114
113
112
111
110
109
108
107
106
105
104
103
102
101
100
99
98
97
96
95
94
93
92
91
90
89
88
87
86
85
84
83
82
81
80
79
78
77
76
75
74
73
72
71
70
69
68
67
66
65
64
63
62
61
60
59
58
57
56
55
54
53
52
51
50
49
48
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
-0.1
0.0
0.1
0.2
0.3
0.4
brown
The brown module eigengenes across samples
Module eigengenes are very useful
• 1) They allow one to relate modules to each other
– Allows one to determine whether modules should be
merged
• 2) They allow one to relate modules to clinical traits
(HD status) and genetic variation (e.g. CAG trinucleotide repeat length)
-> avoids multiple comparison problem
• 3) They allow one to define a measure of module
membership: kME=cor(x,ME)
– Can be used for finding centrally located hub genes
– Can be used to define gene lists for GO enrichment
Content
When Is Hub Gene Selection Better than Standard Meta-Analysis?
Evaluating systems biologic gene selection methods
When does hub gene selection lead to more
meaningful gene lists than a standard statistical
analysis based on significance testing?
• Here we address this question for the special case when
multiple data sets are available.
• This is of great practical importance since for many research
questions multiple gene expression or other -omics data sets
are publicly available.
• In this case, the data analyst can decide between a standard
statistical approach (e.g., based on meta-analysis) and a coexpression network analysis approach that selects
intramodular hubs in consensus modules.
Intramodular hub genes
versus whole network hubs
• Intramodular hubs have high intramodular
connectivity kME with respect to a given
module of interest
• Whole network hubs have high values of whole
network connectivity k
– k= row sum of the adjacency matrix
– k= number of direct neighbors in case of an
unweighted network
Q&A
• 1. Are whole-network hub genes relevant or should one
exclusively focus on intramodular hubs?
• Answer: Focus exclusively on intramodular hubs in trait-related
modules.
• 2. Do network-based gene selection strategies lead to gene lists
that are biologically more informative than those based on a
standard marginal approaches?
• Answer: Yes, gene selection based on intramodular connectivity
leads to biologically more informative gene lists than marginal
approaches.
• 3. Do network-based gene selection strategies lead to gene lists
that have more reproducible trait associations than those based
on a standard marginal approaches? Answer: Overall no. But in
case of a weak signal networks can help.
Criteria for judging gene selection
methods
• Criterion 1 evaluates the biological insights
gained, i.e. it is relevant in basic research.
• Criterion 2 evaluates the validation success in
independent data sets, i.e. it is relevant when it
comes to developing diagnostic or prognostic
biomarkers.
Data sets used in the empirical
evaluation
• We compare standard meta-analysis with consensus
network analysis in three comprehensive and unbiased
empirical studies:
• (1) Find genes predictive of lung cancer survival
– Gold standard=cell proliferation related genes
• (2) Find age related DNA methylation markers
– Gold standard= Polycomb group target genes
• (3) Find genes related to total cholesterol in mouse liver
tissues
– Gold standard= immune system related genes
R code in the WGCNA package
• For standard screening, we used the
metaAnalysis function
• For finding hubs in consensus modules, we
used the consensusKME function
Results
• The results demonstrate that intramodular
hub gene status is more useful than a metaanalysis p-value when identifying biologically
meaningful gene lists (reflecting criterion 1).
• However, meta-analysis methods perform as
good as (if not better) than a co-expression
network approach in terms of validation
success (criterion 2).
Overview of biological aging clocks
Here a biological aging clock
• is defined as a method for predicting the age (in
years) of a subject/biological sample
• Examples
1.
2.
3.
4.
based on telomere length
based on gene expression levels
based on protein expression levels
DNA methylation levels
Telomere length versus age in white blood cells
• Relation between age and TRF in men
(r=−0.45) and in women (r=−0.48)
Benetos A, et al (2001) Telomere Length as an Indicator of Biological Aging: The Gender Effect
and Relation With Pulse Pressure and Pulse Wave Velocity Hypertension. 2001
p16INK4a clock
CDKN2A=p16Ink4A=tumor suppressor
• tumor suppressor protein encoded by the CDKN2A
gene
• Cyclin-dependent kinase inhibitor 2A, (CDKN2A,
p16Ink4A)
– also known as multiple tumor suppressor 1 (MTS-1)
• p16 plays an important role in regulating the cell cycle,
and mutations in p16 increase the risk of developing a
variety of cancers, notably melanoma.
• Increased expression of the p16 gene as organisms age
reduces the proliferation of stem cells.
– This reduction in the division and production of stem cells
protects against cancer while increasing the risks associated
with cellular senescence.
p16INK4a clock
• R^2=0.40 means that the
age correlation is 0.63
• Liu Y et al (May 2009).
"Expression of p16INK4a in
peripheral blood T-cells is
a biomarker of human
aging". Aging Cell 8 (4):
439–48.
Disruptive clock technology based on
DNA methylation levels
• State of the art of biological clock before
epigenetic markers
– Gene products (mRNA, protein levels) lead to an age
correlation = 0.63
• DNA methylation levels (epigenetics) can be used
to define drastically more accurate clocks
– Epigenetic clock leads to an age correlation = 0.96
DNA methylation age of human
tissues and cell types.
Genome Biol. 2013 14(10):R115 PMID: 24138928
Data label
(color)
Training
data sets
DNA origin
Platform
1 (turquoise) Blood WB
27K
2 (blue)
Blood WB
450K
3 (brown)
Blood WB
450K
4 (blue2)
Blood PBMC 450K
5 (green)
Blood PBMC 450K
6 (red)
Blood Cord
27K
7 (black)
Brain CRBLM 27k
8 (pink)
Brain CRBLM 27K
9 (magenta)
Brain FCTX
27K
10 (purple)
Brain PONS
27K
11 (greenyellow)
Brain Prefr.CTX27K
12 (tan)
BrainVariousCells
450K
13 (salmon)
Brain TCTX
27K
14 (cyan)
Breast NL
27K
15 (midnightblue)
Buccal
27K
16 (indianred) Buccal
27K
17 (grey60)
Buccal
450K
18 (green2)
Cartilage Knee 27k
19 (gold)
Colon
27K
20 (royalblue) Colon
450K
21 (darkred)
Dermal fibroblast
27K
22 (darkgreen) Epidermis
27K
23 (darkturquoise)
Gastric
27K
24 (darkgrey) Head+Neck
450K
25 (orange)
Heart
27K
26 (darkorange)Kidney
450K
27 (lightsteelblue2)
Kidney
450K
28 (skyblue)
Liver
27K
29 (saddlebrown)
Lung NL Adj
27K
30 (steelblue) Lung NL Adj
27K
31 (paleturquoise)
Lung NL Adj
450K
32 (violet)
MSC (bonemarrow)
27K
33 (darkolivegreen)
Placenta
27K
34 (darkmagenta)
Prostate NL
27K
35 (sienna3)
Prostate NL
450K
36 (yellowgreen)
Saliva
27K
37 (skyblue3) Saliva
27K
38 (plum1)
Stomach
27K
39 (orangered4)Thyroid
450K
Data Use
Training
Training
Training
Training
Training
Training
Training
Training
Training
Training
Training
Training
Training
Training
Training
Training
Training
Training
Training
Training
Training
Training
Training
Training
Training
Training
Training
Training
Training
Training
Training
Training
Training
Training
Training
Training
Training
Training
Training
n
Median
(Prop.Female Age(range)
)715 (0.38)
33 (16,88)
94 (0.28)
29 (18,65)
656 (0.52)
65 (19,100)
72 (0)
3.1 (1,16)
48 (0.52)
15 (3.5,76)
216 (0.51)
0 (0,0)
168 (NaN)
45 (20,70)
114 (0.3)
44 (16,96)
133 (0.32)
43 (16,100)
125 (0.3)
43 (15,100)
108 (0.48)
19 (-0.5,84)
145 (0.48)
35 (13,79)
127 (0.33)
44 (15,100)
23 (1)
46 (19,75)
109 (0.61)
15 (15,15)
8 (0.75)
43 (16,68)
53 (0.45)
0 (0,1.5)
41 (0.49)
66 (40,79)
35 (0.63)
74 (43,90)
24 (0.54)
14 (3.5,19)
14 (1)
20 (6,73)
10 (0)
50 (26,71)
52 (NaN)
68 (25,88)
50 (0.24)
62 (26,87)
17 (0.41)
55 (16,68)
43 (0.3)
66 (31,83)
160 (0.34)
63 (38,90)
57 (0.14)
51 (20,79)
27 (0.15)
69 (52,83)
24 (0.58)
66 (51,77)
40 (0.32)
73 (40,85)
16 (0.38)
52 (21,85)
28 (1)
0 (0,0)
69 (0)
61 (44,73)
44 (0)
63 (44,72)
131 (0.015)
29 (21,55)
69 (0)
35 (21,55)
41 (0.51)
69 (43,87)
25 (0.8)
40 (18,76)
Citation
Horvath 2012
Horvath 2012
Hannum 2012
Alisch 2012
Harris et al 2012
Adkins 2011
Liu 2013
Gibbs 2010
Gibbs 2010
Gibbs 2010
Numata 2012
Guintivano 2013
Gibbs 2010
Zhuang 2012
Essex 2011
Rakyan 2010
Martino 2013
Fernández-Tajes 2013
TCGA, COAD
Kellermayer 2013
Koch 2011
Gronniger 2010
Zouridis 2012
TCGA, HNSC
Haas 2013
TCGA, KIRP
TCGA, KIRC
Shen 2012
TCGA, LUSC
TCGA, LUAD
TCGA, LUSC
Bork 2010
Gordon 2012
Kobayashi 2011
TCGA, PRAD
Liu 2010
Bockland 2011
TCGA, STAD
TCGA, THCA
Test data sets
40 (mediumpurple3)
Blood WB
27K
Test
41 (lightsteelblue1)
Blood WB
27K
Test
42 (darkcyan) Blood WB
27K
Test
43 (orange)
Blood WB
27K
Test
44 (green)
Blood WB
450K
Test
45 (darkorange2)
Blood PBMC 27K
Test
46 (brown4) Blood PBMC 450K
Test
47 (bisque4) Blood PBMC 27K
Test
48 (darkslateblue)
Blood Cord
27K
Test
49 (plum2)
Blood Cord
27K
Test
50 (thistle2) Blood Cord
27K
Test
51 (darkblue) Blood CD4 Tcells
450K
Test
52 (salmon4) Blood CD4+CD14
27K
Test
53 (palevioletred3)
Blood Cell Types450K
Test
54 (brown3) Brain Cerebellar27K
Test
55 (maroon) Brain Occipital Cortex
27K
Test
56 (lightpink4) Breast NL Adj 450K
Test
57 (lavenderblush3)
Breast NL Adj 27K
Test
58 (deepskyblue)
Buccal
450K
Test
59 (darkseagreen4)
Colon
450K
Test
60 (coral1)
Fat Adip
27K
Test
61 (brown2) Heart
27K
Test
62 (coral2)
Kidney
27K
Test
63 (mediumorchid)
Liver
450K
Test
64 (skyblue2) Lung NL Adj 450K
Test
65 (yellow4) Muscle
27K
Test
66 (skyblue1) Muscle
27K
Test
67 (plum)
Placenta
450k
Test
68 (orangered3)Saliva
27K
Test
69 (mediumpurple2)
Uterine Cervix 27K
Test
70 (lightsteelblue)
Uterine Endomet
450K
Test
71 (lightcoral) Various Tissues27K
Test
72 (indianred4) Chimp+Human Tissues
27K
Other
73 (firebrick4) Ape WB
450k
Other
74 (darkolivegreen4)
Sperm
27K
Other
75 (brown2) Sperm
450k
Other
76 (blue2)
Vasc.Endoth(Umbilical)
27K
Other
77 Stem cells+Somatic
27K Cells
Other
78 Stem cells+Somatic
450KCells
Other
79 Reprogrammed450K
mesenchymal stromal
Other cells
80 hESC and normal
27k
primary tissueOther
81 hESC
27k
Other
82 Blood Cell Types450K
Other
191 (0.51)
93 (1)
262 (1)
269 (1)
689 (0.71)
386 (0)
38 (0.74)
92 (NaN)
48 (0.021)
84 (0.52)
53 (0.45)
48 (NaN)
50 (0.68)
16 (0.62)
20 (0)
16 (0)
81 (1)
27 (1)
51 (0.45)
38 (0.45)
10 (0.4)
6 (0)
198 (0.35)
37 (0.35)
26 (0.46)
22 (0.55)
44 (0)
40 (NaN)
52 (0.92)
152 (1)
28 (1)
44 (0.41)
35 (0.4)
32 (0.62)
19 (1)
26 (0)
42 (0.43)
271 (NA)
153 (0.63)
24 (NA)
34 (NA)
6 (NA)
60 (0)
43 (24,74)
63 (49,74)
67 (49,91)
64 (52,78)
54 (17,70)
9.3 (3.6,18)
44 (0,100)
33 (24,45)
0 (0,0)
0 (0,0.75)
0 (0,0)
0.5 (0,1)
34 (16,69)
32 (17,60)
22 (1,60)
25 (1,60)
55 (28,90)
51 (35,88)
0 (0,1.5)
72 (40,90)
75 (73,78)
60 (55,71)
60 (33,86)
68 (20,81)
66 (42,86)
66 (53,78)
25 (25,25)
0 (0,0)
27 (21,55)
25 (19,55)
62 (35,90)
71 (0,83)
47 (9,81)
22 (9,43)
0 (0,0)
0 (0,0)
0 (0,0)
NA
NA
NA
NA
NA
NA
Teschendorff 2010
Rakyan 2010
Song 2010
Teschendorff 2010 So
Liu 2013
Alisch 2012
Heyn 2012
Lam 2012
Turan
Khulan 2012
Gordon 2012
Martino 2012
Rakyan 2010
Heyn 2013
Ginsberg 2012
Ginsberg 2012
TCGA, BRCA
TCGA, BRCA
Martino 2013
TCGA,COAD
Ribel-Madsen 2012
Pai 2011
TCGA, KIRC
TCGA, LIHC
TCGA, LUAD
Ribel-Madsen 2012
Jacobsen 2012
Blair 2013
Liu 2010
Zhuang 2012
TCGA, UCEG
Myers 2012
Pai 2011
Hernando-Herraez 201
Pacheco 2011
Krausz 2012
Gordon 2012
Nazor 2012
Nazor 2012
Shao 2012
Calvanese 2012
Ramos-Mejía 2012
Reinius 2012
Illumina data sets
• The first 39 data sets were used to construct ("train") the age
predictor.
• Data sets 40-71 were used to test (validate) the age predictor.
• Data sets 72-82 served other purposes e.g. to estimate the
DNAm age of embryonic stem and iPS cells.
• Training data were chosen i) to represent a wide spectrum of
tissues/cell types, ii) to involve samples whose mean age (43
years) is similar to that in the test data, and iii) to involve a high
proportion of samples (37%) measured on the Illumina 450K
platform since many on-going studies use this recent Illumina
platform.
• Only studied 21369 CpGs (measured with the Infinium type II
assay) which were present on both Illumina platforms (Infinium
450K and 27K) and had fewer than 10 missing values across the
data sets.
Age predictor
• To ensure an unbiased validation in the test data,
only used the training data to define the age
predictor.
• A transformed version of chronological age was
regressed on the CpGs using a penalized regression
model (elastic net).
• The elastic net regression model automatically
selected 353 CpGs.
• I refer to the 353 CpGs as (epigenetic) clock CpGs
since their weighted average (formed by the
regression coefficients) amounts to an epigenetic
clock.
Accuracy across tissues and cell types (training)
Accuracy across test data
Accuracy in brain tissue
Results send to me via email
Blood data from Marco Boks Jan 2014
Blood data Jim Pankow, Jan 2014
Median error=3.5 years
Aging clock applied to urine
• This figure, created by Wei Guo from Zymo Research,
• Median error=2.7 years,
• Cor=0.98
Acknowledgements
• WGCNA analysis
– Lin Song
– Peter Langfelder
Download