Large scale genomic data integration for functional metagenomics Curtis Huttenhower Harvard School of Public Health Department of Biostatistics 03-29-10 Greatest Biological Discoveries? 2 Are We There Yet? Species Diversity of Environmental Samples • How much biology is out there? • How much have we found? • How fast are we finding it? Fierer 2008 Human Proteins with Annotated Biological Roles Age-Adjusted Citation Rates for Major Sequencing Projects # Distinct Roles Matt Hibbs 3 Are We There Yet? Species Diversity of Environmental Samples Lots! • How much biology is out there? • How much have we Ourfound? job is toNot create nearly all • How fast arecomputational we finding it? microscopes: To ask Notand fastanswer enoughspecific biomedical questions using Human Proteins with Age-Adjusted Cost per Citation for millions results Annotated Biological Roles of experimentalMajor Sequencing Projects Fierer 2008 # Distinct Roles Matt Hibbs 4 Outline 1. Data mining: 2. Metagenomics: Algorithms for integrating very large data compendia Network models of microbial communities 5 A framework for functional genomics Low Correlation G1 G4 G2 G9 + + 0.9 0.7 High Correlation … … G3 G7 G6 G8 - - 0.1 0.2 … G2 G5 ? … 0.8 P(G2-G5|Data) = 0.85 Frequency ← 1Ks datasets 100Ms gene pairs → = + - … - - … + Not coloc. Low Similarity High Similarity 0.8 0.5 … 0.05 0.1 … 0.6 Coloc. Frequency + High Correlation Frequency Low Correlation Dissim. Similar 6 Functional network prediction and analysis Global interaction network HEFalMp Currently includes data from 30,000 human experimental results, 15,000 expression conditions + 15,000 diverse others, analyzed for 200 biological functions and 150 diseases Metabolism network Signaling network Gut community network 7 HEFalMp: Predicting human gene function HEFalMp 8 HEFalMp: Predicting human genetic interactions HEFalMp 9 HEFalMp: Analyzing human genomic data HEFalMp 10 HEFalMp: Understanding human disease HEFalMp 11 Validating Human Predictions With Erin Haley, Hilary Coller Autophagy 5½ of 7 predictions currently confirmed Luciferase ATG5 (Negative control) (Positive control) Predicted novel autophagy proteins LAMP2 RAB11A Not Starved Starved (Autophagic) 12 Meta-analysis for unsupervised functional data integration Huttenhower 2006 Hibbs 2007 Evangelou 2007 1 1 ' log 2 1 z ye ,i e y e ,i e e e ,i ̂ e we*,i ye,i i we*,i ' ' ' Simple regression: All datasets are equally accurate Random effects: Variation within and among datasets and interactions 1 se2,i ˆ 2e 13 Meta-analysis for unsupervised functional data integration Huttenhower 2006 Hibbs 2007 Evangelou 2007 1 1 ' log 2 1 z + ' ' ' = Following up with semisupervised approach 14 Functional mapping: mining integrated networks Predicted relationships between genes Low Confidence High Confidence The strength of these relationships indicates how cohesive a process is. Chemotaxis 15 Functional mapping: mining integrated networks Predicted relationships between genes Low Confidence High Confidence Chemotaxis 16 Functional mapping: mining integrated networks Predicted relationships between genes Low Confidence High Confidence The strength of these relationships indicates how associated two processes are. Chemotaxis Flagellar assembly 17 Functional Mapping: Scoring Functional Associations How can we formalize these relationships? Any sets of genes G1 and G2 in a network can be compared using four measures: • Edges between their genes • Edges within each set • The background edges incident to each set • The baseline of all edges in the network Stronger connections between the sets increase association. FAG1 ,G2 between(G1 , G2 ) baseline background (G1 , G2 ) within (G1 , G2 ) Stronger within self-connections or nonspecific background connections decrease association. 18 Functional Mapping: Bootstrap p-values For any graph, compute FA scores for many Null distribution is • Scoring functional associations is great… randomly chosen gene sets of different sizes. approximately normal …how do you interpret an association score? with mean 1. # Genes– 1 gene5 sets 10 50 sizes? For of arbitrary ˆ FA (Gi , G j ) 1 – In arbitrary graphs? A(| Gi |) | G j | B of edges? 1 – Each with its own bizarre distribution ˆ FA (Gi , G j ) | Gi | C (| G j |) 5 Standard deviation is 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 10 0 0.05 10 0 2 0 10 10 50 10 1 asymptotic in the sizes of both gene sets. P( FAG1 ,G2 x) 1 ˆ (G1 ,G2 ),ˆ (G1 ,G2 ) ( x) 2 10 3 10 4 10 |G1| |G2| Null distribution one graph Histograms of FAsσs forfor random sets Maps FA scores to p-values for any gene sets and underlying graph. 19 Functional Mapping: Functional Associations Between Processes Hydrogen Transport Electron Transport Edges Associations between processes Cellular Respiration Aldehyde Metabolism Very Strong Cell Redox Homeostasis Peptide Metabolism Energy Reserve Metabolism Moderately Strong Vacuolar Protein Catabolism Protein Processing Negative Regulation of Protein Metabolism Protein Depolymerization Organelle Fusion Organelle Inheritance 20 Functional Mapping: Functional Associations Between Processes Hydrogen Transport Electron Transport Edges Associations between processes Cellular Respiration Aldehyde Metabolism Very Strong Cell Redox Homeostasis Peptide Metabolism Energy Reserve Metabolism Moderately Strong Vacuolar Protein Catabolism Protein Processing Negative Regulation of Protein Metabolism Borders Protein Depolymerization Data coverage of processes Organelle Fusion Sparsely Covered Well Covered Organelle Inheritance 21 Functional Mapping: Functional Associations Between Processes Hydrogen Transport Electron Transport Edges Associations between processes Cellular Respiration Aldehyde Metabolism Very Strong Cell Redox Homeostasis Peptide Metabolism Energy Reserve Metabolism Moderately Strong Vacuolar Protein Catabolism Protein Processing Negative Regulation of Protein Metabolism Nodes Cohesiveness of processes Below Baseline Baseline (genomic background) Very Cohesive Borders Protein Depolymerization Data coverage of processes Organelle Fusion Sparsely Covered Well Covered Organelle Inheritance 22 Functional Mapping: Functional Associations Between Processes Edges Associations between processes Moderately Strong Very Strong Nodes Cohesiveness of processes Below Baseline Baseline (genomic background) Very Cohesive Borders Data coverage of processes Sparsely Covered Well Covered 23 Functional Maps: Focused Data Summarization ACGGTGAACGTACA GTACAGATTACTAG GACATTAGGCCGTA TCCGATACCCGATA Data integration summarizes an impossibly huge amount of experimental data into an impossibly huge number of predictions; what next? 24 Functional Maps: Focused Data Summarization ACGGTGAACGTACA GTACAGATTACTAG GACATTAGGCCGTA TCCGATACCCGATA How can a biologist take advantage of all this data to study his/her favorite gene/pathway/disease without losing information? Functional mapping • • • • Very large collections of genomic data Specific predicted molecular interactions Pathway, process, or disease associations Underlying experimental results and functional activities in data 25 Outline 1. Data mining: 2. Metagenomics: Algorithms for integrating very large data compendia Network models of microbial communities 26 Microbial Communities and Functional Metagenomics With Jacques Izard, Wendy Garrett • Metagenomics: data analysis from environmental samples – Microflora: environment includes us! • Pathogen collections of “single” organisms form similar communities • Another data integration problem – Must include datasets from multiple organisms • What questions can we answer? – What pathways/processes are present/over/underenriched in a newly sequences microbe/community? – What’s shared within community X? What’s different? What’s unique? – How do human microflora interact with diabetes, obesity, oral health, antibiotics, aging, … – Current functional methods annotate ~50% of synthetic data, <5% of environmental data 27 Data Integration for Microbial Communities ~300 available expression datasets ~30 species • • • • Data integration works just as well in microbes as it does in yeast and humans We know an awful lot about some microorganisms and almost nothing about others Sequence-based and network-based tools for function transfer both work in isolation We can use data integration to leverage both and mine out additional biology Weskamp et al 2004 Flannick et al 2006 Kanehisa et al 2008 Tatusov et al 1997 28 Functional network prediction from diverse microbial data 486 bacterial expression experiments 876 raw datasets 310 postprocessed datasets 304 normalized coexpression networks in 27 species 307 bacterial interaction experiments 154796 raw interactions 114786 postprocessed interactions Integrated functional interaction networks in 15 species E. Coli Integration ← Precision ↑, Recall ↓ 29 Functional maps for cross-species knowledge transfer ECG1, ECG2 BSG1 ECG3, BSG2 … O1: G1, G2, G3 O2: G4 O3: G6 … G2 G3 G4 G1 O2 G5 G6 G7 O3 G8 O5 O4 G9 G10 G12 O8 G11 O6 G13 G15 G16 O7 O9 G14 G17 30 Functional maps for cross-species knowledge transfer Following up with unsupervised and partially anchored network alignment ← Precision ↑, Recall ↓ 31 Functional maps for functional metagenomics GOS 4441599.3 Hypersaline Lagoon, Ecuador KEGG Pathways Organisms Mapping organisms into phyla Env. + Integrated functional interaction networks in 27 species Pathogens = Mapping genes into pathways Mapping pathways into organisms 32 Functional maps for functional metagenomics Edges Process association in obesity Less Coregulated Baseline (no change) More Coregulated Nodes Process cohesiveness in obesity Very Downregulated Baseline (no change) Very Upregulated 33 Efficient Computation For Biological Discovery Massive datasets and genomes require efficient algorithms and implementations. • Sleipnir C++ library for computational functional genomics • Data types for biological entities • • Microarray data, interaction data, genes and gene sets, functional catalogs, etc. etc. Network communication, parallelization • Efficient machine learning algorithms • Generative (Bayesian) and discriminative (SVM) It’s also speedy:•microbial And it’s data integration computation takes <3hrs. fully documented! 34 Outline • Bayesian and unsupervised methods for data integration • HEFalMp system for human data analysis and integration • Functional mapping to statistically summarize large data collections • Integration for microbial communities and metagenomics • Accurate cross-species interactome transfer • Sleipnir software for efficient large scale data mining 1. Data mining: 2. Metagenomics: Algorithms for integrating very large data compendia Network models of microbial communities 35 Thanks! Olga Troyanskaya Chris Park David Hess Matt Hibbs Chad Myers Ana Pop Aaron Wong Jacques Izard Hilary Coller Erin Haley Sarah Fortune Tracy Rosebrock Wendy Garrett http://huttenhower.sph.harvard.edu/sleipnir http://function.princeton.edu/hefalmp 36 Current Work: Molecular Mechanisms in a Colorectal Cancer Cohort With Shuji Ogino, Charlie Fuchs Nurse’s Health Study Health Professionals Follow-Up Study LINE-1 Methylation • Repetitive element making up ~20% of mammalian genomes • Very easy to assay methylation level (%) • Good proxy for whole-genome methylation level ~3,100 gastrointestinal subjects ~2,100 cancer mutation tests ~1,200 LINE-1 methylation ~3,800 tissue samples ~1,450 colon cancer samples ~1,150 CpG island methylation ~700 TMA immunohistochemistry ~775 gene expression DASL Gene Expression • Gene expression analysis from paraffin blocks • Thanks to Todd Golub, Yujin Hoshida 38 Molecular Subtypes of Colorectal Cancer: Stem Cell Programs and Proliferation ← Genes Tumors → C1 C2 C3 C4 Nonnegative matrix factorization Cell cycle regulation Chr. 19 rearrangement, membrane receptors/channels Angiogenesis, proliferation HSC signature Neural/ESC signature BRCA interactors, chrom. stability factors 39 Molecular Subtypes of Colorectal Cancer: Stem Cell Programs and Proliferation Subramanian et al, 2005 CD133 + Bcl-X(L) Hematopoeitic Stem Cell Signature Neural Stem Cell Signature CD44 + CD166 166 799 945 195 678 18 146 Chr. 19q BAX Hypotheses? • Two main pathways to proliferation: 7 8 325 Embryonic Stem Cell Signature Note that these regulatory programs do not appear to correspond with demographics or common pathologic markers… Testing now for correlation with outcome. • HSC program + BAX • ESC/NSC program • Two main pathways to deregulation: • Angiogenesis + chrom. instability • Cell cycle disruption (MSI?) 40 Epigenetics of Colorectal Cancer: LINE-1 methylation levels Lower LINE-1 methylation associates with poor colon cancer prognosis. LINE-1 methylation varies remarkably between individuals… …but it is highly correlated within individuals. LINE-1 Methylation in Multiple Tumors from the Same Subject Ogino et al, 2008 Methylation %, Tumor #2 80 70 60 50 40 30 What does it all mean?? What is the biological mechanism linking LINE-1 methylation to colon cancer? 30 40 50 60 70 Methylation %, Tumor #1 80 ρ = 0.718, p < 0.01 41 Epigenetics of Colorectal Cancer: LINE-1 methylation levels Lower LINE-1 methylation associates with poor colon cancer prognosis. LINE-1 methylation varies remarkably between individuals… …but it is highly correlated within individuals. LINE-1 Methylation in Multiple Tumors from the Same Subject Is anything different about these outliers? Ogino et al, 2008 This suggests linkage to a cancer-related pathway. What is the biological mechanism linking LINE-1 methylation to colon cancer? Methylation %, Tumor #2 80 70 60 50 40 30 This suggests a copy number variation. This suggests a genetic effect. 30 40 50 60 70 Methylation %, Tumor #1 80 ρ = 0.718, p < 0.01 42 Epigenetics of Colorectal Cancer: LINE-1 methylation levels Preliminary Data • • • • • 10 genes differentially expressed even using simple methods 1/3 are from the same family with known GI tumor prognostic value 1/3 are X-chromosome testis/cancer-specific antigens 1/2 fall in same cytogenic band, which is also a known CNV hotspot HEFalMp links to a cascade of antigens/membrane receptors/TFs Cell adhesion p-value ≈ 0, moderate correlation in many cancer arrays • GSEA pulls out a wide range of proliferation up (E2F), immune response down; need to regress out prognosis correlates Check back in a couple of months! What is the biological mechanism linking LINE-1 methylation to colon cancer? 43