Empirical evaluation of predictionand correlation network methods applied to genomic data Steve Horvath University of California, Los Angeles Content Review of weighted correlation network analysis (WGCNA) When Is Hub Gene Selection Better than Standard Meta-Analysis? Evaluating systems biologic gene selection methods The epigenetic clock: a highly accurate genomic predictor of age What is weighted correlation network analysis (WGCNA) ? Construct a network Rationale: make use of interaction patterns between genes Identify modules Rationale: module (pathway) based analysis Relate modules to external information Array Information: Clinical data, SNPs, proteomics Gene Information: gene ontology, EASE, IPA Rationale: find biologically interesting modules Study Module Preservation across different data Rationale: • Same data: to check robustness of module definition • Different data: to find interesting modules. Find the key drivers in interesting modules Tools: intramodular connectivity, causality testing Rationale: experimental validation, therapeutics, biomarkers Weighted correlation networks are valuable for a biologically meaningful… • reduction of high dimensional data – expression: microarray, RNA-seq – gene methylation data, fMRI data, etc. • integration of multiscale data – expression data from multiple tissues – SNPs (module QTL analysis) – Complex phenotypes An anatomically comprehensive atlas of the adult human brain transcriptome MJ Hawrylycz, E Lein,..,AR Jones (2012) Nature 489, 391-399 Allen Brain Institute Data • • • • • Brains from two healthy males (ages 24 and 39) 170 brain structures over 900 microarray samples per individual 64K Agilent microarray This data set provides a neuroanatomically precise, genome-wide map of transcript distributions Modules in brain 1 Global gene networks. How to construct a weighted correlation network? Systems biology as a field of study: interactions between the components of biological systems Network=Adjacency Matrix • A network can be represented by an adjacency matrix, A=[aij], that encodes whether/how a pair of nodes is connected. – A is a symmetric matrix with entries in [0,1] – For unweighted network, entries are 1 or 0 depending on whether or not 2 nodes are adjacent (connected) – For weighted networks, the adjacency matrix reports the connection strength between node pairs – Our convention: diagonal elements of A are all 1. Two types of weighted correlation networks U nsigned netw ork, absolute value a ij | cor ( x i , x j ) | S igned netw ork preserves sign info a ij | 0.5 0.5 cor ( x i , x j ) | Default values: β=6 for unsigned and β =12 for signed networks. We prefer signed networks… Zhang et al SAGMB Vol. 4: No. 1, Article 17. Adjacency versus correlation in unsigned and signed networks Unsigned Network Signed Network Advantages of soft thresholding with the power function 1. Robustness: Network results are highly robust with respect to the choice of the power β (Zhang et al 2005) 2. Calibration of different networks becomes straightforward, which facilitates consensus module analysis 3. Module preservation statistics are particularly sensitive for measuring connectivity preservation in weighted networks 4. Math reason: Geometric Interpretation of Gene CoExpression Network Analysis. PloS Computational Biology. 4(8): e1000117 How to detect network modules? Systems biology as a paradigm, usually defined in antithesis to the socalled reductionist paradigm (biological organization) Module Definition • Based on the resulting cluster tree, we define modules as branches • Modules are either labeled by integers (1,2,3…) or equivalently by colors (turquoise, blue, brown, etc) • We often use average linkage hierarchical clustering coupled with the topological overlap dissimilarity measure. • Next we use the dynamic tree cutting method to define clusters. Langfelder et al 2007 Defining modules based on a hierarchical cluster tree Langfelder P, Zhang B et al (2007) Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut library for R. Bioinformatics 2008 24(5):719-720 Module=branch of a cluster tree Dynamic hybrid branch cutting method combines advantages of hierarchical clustering and pam clustering How does one find “consensus” modules based on multiple gene expression data (networks)? Example: Multiple Human brain expression data sets from Huntington's Disease Publicly available caudate nucleus gene expression data from HD subjects and controls 1) Durrenberger et al (2012). Selection of novel reference genes for use in the human central nervous system: a BrainNet Europe Study. Acta Neuropathol. 2012 Dec;124(6):893-903 2) Hodges et al Luthi-Carter (2006) Regional and cellular gene expression changes in human Huntington’s disease brain. Human Molecular Genetics, 2006, Vol. 15, No. 6 Analysis steps of WGCNA 1. Construct a signed weighted correlation network based on 2 human gene expression data sets Purpose: keep track of co-expression relationships 2. Identify consensus modules Purpose: find robustly defined and reproducible modules Technique: Consensus adjacency is a quantile of the input e.g. minimum, lower quartile, median 3. Relate modules to external information HD disease status Gene Information: gene ontology, cell marker genes Purpose: find biologically meaningful modules Consensus dendrogram with module colors and meta-analysis significance for diagnosis. The colors correspond to the meta-analysis Z score (with weights proportional to root of number of DOF); blue color denotes genes are down in HD vs controls, and red color denotes genes that are up in HD vs controls. Question: How does one summarize the expression profiles in a module? Answer: This has been solved. Math answer: module eigengene = first principal component Network answer: the most highly connected intramodular hub gene Both turn out to be equivalent Module Eigengene= measure of overexpression=average redness Rows,=genes, Columns=microarray br own 185 184 183 182 181 180 179 178 177 176 175 174 173 172 171 170 169 168 167 166 165 164 163 162 161 160 159 158 157 156 155 154 153 152 151 150 149 148 147 146 145 144 143 142 141 140 139 138 137 136 135 134 133 132 131 130 129 128 127 126 125 124 123 122 121 120 119 118 117 116 115 114 113 112 111 110 109 108 107 106 105 104 103 102 101 100 99 98 97 96 95 94 93 92 91 90 89 88 87 86 85 84 83 82 81 80 79 78 77 76 75 74 73 72 71 70 69 68 67 66 65 64 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 -0.1 0.0 0.1 0.2 0.3 0.4 brown The brown module eigengenes across samples Module eigengenes are very useful • 1) They allow one to relate modules to each other – Allows one to determine whether modules should be merged • 2) They allow one to relate modules to clinical traits (HD status) and genetic variation (e.g. CAG trinucleotide repeat length) -> avoids multiple comparison problem • 3) They allow one to define a measure of module membership: kME=cor(x,ME) – Can be used for finding centrally located hub genes – Can be used to define gene lists for GO enrichment Content When Is Hub Gene Selection Better than Standard Meta-Analysis? Evaluating systems biologic gene selection methods When does hub gene selection lead to more meaningful gene lists than a standard statistical analysis based on significance testing? • Here we address this question for the special case when multiple data sets are available. • This is of great practical importance since for many research questions multiple gene expression or other -omics data sets are publicly available. • In this case, the data analyst can decide between a standard statistical approach (e.g., based on meta-analysis) and a coexpression network analysis approach that selects intramodular hubs in consensus modules. Intramodular hub genes versus whole network hubs • Intramodular hubs have high intramodular connectivity kME with respect to a given module of interest • Whole network hubs have high values of whole network connectivity k – k= row sum of the adjacency matrix – k= number of direct neighbors in case of an unweighted network Q&A • 1. Are whole-network hub genes relevant or should one exclusively focus on intramodular hubs? • Answer: Focus exclusively on intramodular hubs in trait-related modules. • 2. Do network-based gene selection strategies lead to gene lists that are biologically more informative than those based on a standard marginal approaches? • Answer: Yes, gene selection based on intramodular connectivity leads to biologically more informative gene lists than marginal approaches. • 3. Do network-based gene selection strategies lead to gene lists that have more reproducible trait associations than those based on a standard marginal approaches? Answer: Overall no. But in case of a weak signal networks can help. Criteria for judging gene selection methods • Criterion 1 evaluates the biological insights gained, i.e. it is relevant in basic research. • Criterion 2 evaluates the validation success in independent data sets, i.e. it is relevant when it comes to developing diagnostic or prognostic biomarkers. Data sets used in the empirical evaluation • We compare standard meta-analysis with consensus network analysis in three comprehensive and unbiased empirical studies: • (1) Find genes predictive of lung cancer survival – Gold standard=cell proliferation related genes • (2) Find age related DNA methylation markers – Gold standard= Polycomb group target genes • (3) Find genes related to total cholesterol in mouse liver tissues – Gold standard= immune system related genes R code in the WGCNA package • For standard screening, we used the metaAnalysis function • For finding hubs in consensus modules, we used the consensusKME function Results • The results demonstrate that intramodular hub gene status is more useful than a metaanalysis p-value when identifying biologically meaningful gene lists (reflecting criterion 1). • However, meta-analysis methods perform as good as (if not better) than a co-expression network approach in terms of validation success (criterion 2). Overview of biological aging clocks Here a biological aging clock • is defined as a method for predicting the age (in years) of a subject/biological sample • Examples 1. 2. 3. 4. based on telomere length based on gene expression levels based on protein expression levels DNA methylation levels Telomere length versus age in white blood cells • Relation between age and TRF in men (r=−0.45) and in women (r=−0.48) Benetos A, et al (2001) Telomere Length as an Indicator of Biological Aging: The Gender Effect and Relation With Pulse Pressure and Pulse Wave Velocity Hypertension. 2001 p16INK4a clock CDKN2A=p16Ink4A=tumor suppressor • tumor suppressor protein encoded by the CDKN2A gene • Cyclin-dependent kinase inhibitor 2A, (CDKN2A, p16Ink4A) – also known as multiple tumor suppressor 1 (MTS-1) • p16 plays an important role in regulating the cell cycle, and mutations in p16 increase the risk of developing a variety of cancers, notably melanoma. • Increased expression of the p16 gene as organisms age reduces the proliferation of stem cells. – This reduction in the division and production of stem cells protects against cancer while increasing the risks associated with cellular senescence. p16INK4a clock • R^2=0.40 means that the age correlation is 0.63 • Liu Y et al (May 2009). "Expression of p16INK4a in peripheral blood T-cells is a biomarker of human aging". Aging Cell 8 (4): 439–48. Disruptive clock technology based on DNA methylation levels • State of the art of biological clock before epigenetic markers – Gene products (mRNA, protein levels) lead to an age correlation = 0.63 • DNA methylation levels (epigenetics) can be used to define drastically more accurate clocks – Epigenetic clock leads to an age correlation = 0.96 DNA methylation age of human tissues and cell types. Genome Biol. 2013 14(10):R115 PMID: 24138928 Data label (color) Training data sets DNA origin Platform 1 (turquoise) Blood WB 27K 2 (blue) Blood WB 450K 3 (brown) Blood WB 450K 4 (blue2) Blood PBMC 450K 5 (green) Blood PBMC 450K 6 (red) Blood Cord 27K 7 (black) Brain CRBLM 27k 8 (pink) Brain CRBLM 27K 9 (magenta) Brain FCTX 27K 10 (purple) Brain PONS 27K 11 (greenyellow) Brain Prefr.CTX27K 12 (tan) BrainVariousCells 450K 13 (salmon) Brain TCTX 27K 14 (cyan) Breast NL 27K 15 (midnightblue) Buccal 27K 16 (indianred) Buccal 27K 17 (grey60) Buccal 450K 18 (green2) Cartilage Knee 27k 19 (gold) Colon 27K 20 (royalblue) Colon 450K 21 (darkred) Dermal fibroblast 27K 22 (darkgreen) Epidermis 27K 23 (darkturquoise) Gastric 27K 24 (darkgrey) Head+Neck 450K 25 (orange) Heart 27K 26 (darkorange)Kidney 450K 27 (lightsteelblue2) Kidney 450K 28 (skyblue) Liver 27K 29 (saddlebrown) Lung NL Adj 27K 30 (steelblue) Lung NL Adj 27K 31 (paleturquoise) Lung NL Adj 450K 32 (violet) MSC (bonemarrow) 27K 33 (darkolivegreen) Placenta 27K 34 (darkmagenta) Prostate NL 27K 35 (sienna3) Prostate NL 450K 36 (yellowgreen) Saliva 27K 37 (skyblue3) Saliva 27K 38 (plum1) Stomach 27K 39 (orangered4)Thyroid 450K Data Use Training Training Training Training Training Training Training Training Training Training Training Training Training Training Training Training Training Training Training Training Training Training Training Training Training Training Training Training Training Training Training Training Training Training Training Training Training Training Training n Median (Prop.Female Age(range) )715 (0.38) 33 (16,88) 94 (0.28) 29 (18,65) 656 (0.52) 65 (19,100) 72 (0) 3.1 (1,16) 48 (0.52) 15 (3.5,76) 216 (0.51) 0 (0,0) 168 (NaN) 45 (20,70) 114 (0.3) 44 (16,96) 133 (0.32) 43 (16,100) 125 (0.3) 43 (15,100) 108 (0.48) 19 (-0.5,84) 145 (0.48) 35 (13,79) 127 (0.33) 44 (15,100) 23 (1) 46 (19,75) 109 (0.61) 15 (15,15) 8 (0.75) 43 (16,68) 53 (0.45) 0 (0,1.5) 41 (0.49) 66 (40,79) 35 (0.63) 74 (43,90) 24 (0.54) 14 (3.5,19) 14 (1) 20 (6,73) 10 (0) 50 (26,71) 52 (NaN) 68 (25,88) 50 (0.24) 62 (26,87) 17 (0.41) 55 (16,68) 43 (0.3) 66 (31,83) 160 (0.34) 63 (38,90) 57 (0.14) 51 (20,79) 27 (0.15) 69 (52,83) 24 (0.58) 66 (51,77) 40 (0.32) 73 (40,85) 16 (0.38) 52 (21,85) 28 (1) 0 (0,0) 69 (0) 61 (44,73) 44 (0) 63 (44,72) 131 (0.015) 29 (21,55) 69 (0) 35 (21,55) 41 (0.51) 69 (43,87) 25 (0.8) 40 (18,76) Citation Horvath 2012 Horvath 2012 Hannum 2012 Alisch 2012 Harris et al 2012 Adkins 2011 Liu 2013 Gibbs 2010 Gibbs 2010 Gibbs 2010 Numata 2012 Guintivano 2013 Gibbs 2010 Zhuang 2012 Essex 2011 Rakyan 2010 Martino 2013 Fernández-Tajes 2013 TCGA, COAD Kellermayer 2013 Koch 2011 Gronniger 2010 Zouridis 2012 TCGA, HNSC Haas 2013 TCGA, KIRP TCGA, KIRC Shen 2012 TCGA, LUSC TCGA, LUAD TCGA, LUSC Bork 2010 Gordon 2012 Kobayashi 2011 TCGA, PRAD Liu 2010 Bockland 2011 TCGA, STAD TCGA, THCA Test data sets 40 (mediumpurple3) Blood WB 27K Test 41 (lightsteelblue1) Blood WB 27K Test 42 (darkcyan) Blood WB 27K Test 43 (orange) Blood WB 27K Test 44 (green) Blood WB 450K Test 45 (darkorange2) Blood PBMC 27K Test 46 (brown4) Blood PBMC 450K Test 47 (bisque4) Blood PBMC 27K Test 48 (darkslateblue) Blood Cord 27K Test 49 (plum2) Blood Cord 27K Test 50 (thistle2) Blood Cord 27K Test 51 (darkblue) Blood CD4 Tcells 450K Test 52 (salmon4) Blood CD4+CD14 27K Test 53 (palevioletred3) Blood Cell Types450K Test 54 (brown3) Brain Cerebellar27K Test 55 (maroon) Brain Occipital Cortex 27K Test 56 (lightpink4) Breast NL Adj 450K Test 57 (lavenderblush3) Breast NL Adj 27K Test 58 (deepskyblue) Buccal 450K Test 59 (darkseagreen4) Colon 450K Test 60 (coral1) Fat Adip 27K Test 61 (brown2) Heart 27K Test 62 (coral2) Kidney 27K Test 63 (mediumorchid) Liver 450K Test 64 (skyblue2) Lung NL Adj 450K Test 65 (yellow4) Muscle 27K Test 66 (skyblue1) Muscle 27K Test 67 (plum) Placenta 450k Test 68 (orangered3)Saliva 27K Test 69 (mediumpurple2) Uterine Cervix 27K Test 70 (lightsteelblue) Uterine Endomet 450K Test 71 (lightcoral) Various Tissues27K Test 72 (indianred4) Chimp+Human Tissues 27K Other 73 (firebrick4) Ape WB 450k Other 74 (darkolivegreen4) Sperm 27K Other 75 (brown2) Sperm 450k Other 76 (blue2) Vasc.Endoth(Umbilical) 27K Other 77 Stem cells+Somatic 27K Cells Other 78 Stem cells+Somatic 450KCells Other 79 Reprogrammed450K mesenchymal stromal Other cells 80 hESC and normal 27k primary tissueOther 81 hESC 27k Other 82 Blood Cell Types450K Other 191 (0.51) 93 (1) 262 (1) 269 (1) 689 (0.71) 386 (0) 38 (0.74) 92 (NaN) 48 (0.021) 84 (0.52) 53 (0.45) 48 (NaN) 50 (0.68) 16 (0.62) 20 (0) 16 (0) 81 (1) 27 (1) 51 (0.45) 38 (0.45) 10 (0.4) 6 (0) 198 (0.35) 37 (0.35) 26 (0.46) 22 (0.55) 44 (0) 40 (NaN) 52 (0.92) 152 (1) 28 (1) 44 (0.41) 35 (0.4) 32 (0.62) 19 (1) 26 (0) 42 (0.43) 271 (NA) 153 (0.63) 24 (NA) 34 (NA) 6 (NA) 60 (0) 43 (24,74) 63 (49,74) 67 (49,91) 64 (52,78) 54 (17,70) 9.3 (3.6,18) 44 (0,100) 33 (24,45) 0 (0,0) 0 (0,0.75) 0 (0,0) 0.5 (0,1) 34 (16,69) 32 (17,60) 22 (1,60) 25 (1,60) 55 (28,90) 51 (35,88) 0 (0,1.5) 72 (40,90) 75 (73,78) 60 (55,71) 60 (33,86) 68 (20,81) 66 (42,86) 66 (53,78) 25 (25,25) 0 (0,0) 27 (21,55) 25 (19,55) 62 (35,90) 71 (0,83) 47 (9,81) 22 (9,43) 0 (0,0) 0 (0,0) 0 (0,0) NA NA NA NA NA NA Teschendorff 2010 Rakyan 2010 Song 2010 Teschendorff 2010 So Liu 2013 Alisch 2012 Heyn 2012 Lam 2012 Turan Khulan 2012 Gordon 2012 Martino 2012 Rakyan 2010 Heyn 2013 Ginsberg 2012 Ginsberg 2012 TCGA, BRCA TCGA, BRCA Martino 2013 TCGA,COAD Ribel-Madsen 2012 Pai 2011 TCGA, KIRC TCGA, LIHC TCGA, LUAD Ribel-Madsen 2012 Jacobsen 2012 Blair 2013 Liu 2010 Zhuang 2012 TCGA, UCEG Myers 2012 Pai 2011 Hernando-Herraez 201 Pacheco 2011 Krausz 2012 Gordon 2012 Nazor 2012 Nazor 2012 Shao 2012 Calvanese 2012 Ramos-Mejía 2012 Reinius 2012 Illumina data sets • The first 39 data sets were used to construct ("train") the age predictor. • Data sets 40-71 were used to test (validate) the age predictor. • Data sets 72-82 served other purposes e.g. to estimate the DNAm age of embryonic stem and iPS cells. • Training data were chosen i) to represent a wide spectrum of tissues/cell types, ii) to involve samples whose mean age (43 years) is similar to that in the test data, and iii) to involve a high proportion of samples (37%) measured on the Illumina 450K platform since many on-going studies use this recent Illumina platform. • Only studied 21369 CpGs (measured with the Infinium type II assay) which were present on both Illumina platforms (Infinium 450K and 27K) and had fewer than 10 missing values across the data sets. Age predictor • To ensure an unbiased validation in the test data, only used the training data to define the age predictor. • A transformed version of chronological age was regressed on the CpGs using a penalized regression model (elastic net). • The elastic net regression model automatically selected 353 CpGs. • I refer to the 353 CpGs as (epigenetic) clock CpGs since their weighted average (formed by the regression coefficients) amounts to an epigenetic clock. Accuracy across tissues and cell types (training) Accuracy across test data Accuracy in brain tissue Results send to me via email Blood data from Marco Boks Jan 2014 Blood data Jim Pankow, Jan 2014 Median error=3.5 years Aging clock applied to urine • This figure, created by Wei Guo from Zymo Research, • Median error=2.7 years, • Cor=0.98 Acknowledgements • WGCNA analysis – Lin Song – Peter Langfelder