Provenance issues in 3R systems Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine Larry.Hunter@uchsc.edu http://compbio.uchsc.edu/Hunter What is a 3R system? • Reading, Reasoning & Reporting. • Used to facilitate knowledge-based • • analysis of genome-scale experimental data in biology SM Leach, et al. “Biomedical Discovery Acceleration, with Applications to Craniofacial Development.” PLoS Comput Bio 2009, 5(3): e1000215. doi:10.1371/journal.pcbi.1000215 http://hanalyzer.sourceforge.net Understanding Gene Lists • There is no “gene” for any complex phenotype; gene products function together in dynamic groups • A key task is to understand why a set of gene products are grouped together in a condition, exploiting all existing knowledge about: – The genes (all of them) – Their relationships (|genes|2) – The condition(s) under study. Knowledge-based data analysis • Goal: Bring all existing knowledge • (and more!) to bear on explaining experimental results. How? – Integrate multiple databases (using the semantic web) – Extract knowledge from the literature – Infer implicit interactions – Represent with knowledge networks • Nodes are social fiducials, like gene IDs or ontology terms • Arcs (relations) are qualified (typed) and quantified (with reliability) – Deliver a tool for biomedical analysts to use knowledge networks to explain results and generate hypotheses External sources Reading methods Ontology enrichment Ontology annotations Medline abstracts Reasoning methods Experimental data Co-annotation inference Biomedical language processing Gene database1 Literature co-occurrence Co-database inference Gene database2 … Gene databasen Reporting methods Parsers & Provenance tracker Data Network Knowledge Network Network integration methods Semantic integration Reliability estimation Visualization & Drill-down tool Semantic database integration • More than 1,000 peer-reviewed gene databases: – Annotations to function, location, process, disease, etc. ontologies – Linkages to many sorts of experimental and derived data (GWAS, expression, structure, pathways, population frequencies) – Linkages to publications that report evidence relevant to them – A “Gene” is not one thing: • • • • DNA locus Specific allele (Coding sequence and/or regulatory sequences) Varieties of gene products: intermediate and functional RNAs, proteins, alternative splices Post-translational variations (cleavage, phosphorylation) to activate, transport • Many can be integrated into a single, unified network using gene or publication identifiers. – Identifier cross-reference lists increasingly reliable – Increasing coordination and standardization among providers • Many provenance challenges remain OpenDMAP extracts typed relations from the literature • Concept recognition tool – Connect ontological terms to literature instances – Built on Protégé knowledge representation system • Language patterns associated with concepts and slots – Patterns can contain text literals, other concepts, constraints (conceptual or syntactic), ordering information, or outputs of other processing. – Linked to many text analysis engines via UIMA • Best performance in BioCreative II IPS task • >500,000 instances of three predicates (with • arguments) extracted from Medline Abstracts [Hunter, et al., 2008] http://bionlp.sourceforge.net Network inference issues Ddc; MGI:94876 B P carboxylic acid metabolic process B P catecholamine biosynthesis process B P response to toxin (GO:00009636) (GO:0019752) (GO:0042423) catechols (CHEBI:33566) catecholamines (CHEBI:33567) adrenaline (CHEBI:33568) noradrenaline (CHEBI:335 Cadps; MGI:1350922 … B P catecholamine secretion (GO:0050432) B P protein transport (GO:0015031) B P vesicle organization (GO:0016050) … GO:0042423 ? MGI:94876 GO:0050432 ? ? CHEBI:33567 Reliability = 0.009740 ? MGI:1350922 Inferred interactions • Dramatically increase coverage… • But at the cost of Top 1,000 Craniofacial genes • • lower reliability We apply new method to assess reliability without an explicit gold standard [Leach, et al., 2007; Gabow, et al., 2008] (1,000,000 possible edges) Link calculations for MyoD1 ↔ MyoG Shared GO DMAP transport molecular functions: relation PMID:16407395… GO:3705… Co-occurrence in abstracts: R = 0.0105 Shared knockout phenotypes: MP:5374 … R = 0.018 R = 0.1034 Shared GO cell component: GO:5667… R = 0.0284 Shared interpro domains: IPR:11598… R = 0.0438 Shared GO biological processes: GO:6139… R = 0.0190 Premod_M interaction: Mod074699 R = 0.1005 R = 0.0172 Inferred link through shared GO/ChEBI: ChEBI:16991 R = 0.01 Correlation in expression data: Pdata = 0.4808 A real use WT Mouse craniofacial development: 3 tissues 5 time points 7 replicates each >1000 genes differentially expressed @ FDR<0.01 Graph of top 1000 edges each from average or logistic or both (1734 edges total). Strong data and background knowledge facilitate explanations Skeletal muscle structural components Skeletal muscle contractile components Proteins of no common family • AVE edges Both edges Goal is abductive inference (explanation): why are these genes doing this? – Specifically, why the increase in mandible before the increase in maxilla, and not at all in the frontonasal prominence? Exploring the knowledge network Scientist + aide + literature → explanation: tongue development Skeletal muscle structural components Skeletal muscle contractile components Proteins of no common family AVE edges Both edges The delayed onset, at E12.5, of the same group of proteins during mastication muscle development. Myoblast differentiation and proliferation continues until E15 at which point the tongue muscle is completely formed. Myogenic cells invade the tongue primodia ~E11 Hypothesis generation HANISCH edges AVE edges Both edges inferred synapse signaling proteins Inferred myogenic proteins • • Proteins of no common family Proteins in the previous AVE based sub-network Add the strong data, weak background knowledge (Hanisch) edges to the previous network, bringing in new genes. Four of these genes not previously implicated in facial muscle development (1 almost completely unannotated) Hypotheses confirmed Zim1,E12.5 E43rik,E12.5 ApoBEC2,E11.5 HoxA2,E12.5 Genes ~ Bad Guys? • Pirolli & Card, Int’l Conf. on Intelligence Analysis, 2005 3R Provenance issues • Reasoning about the evidence for a link (e.g. types of evidence, epistemic relations) • Identifying and avoiding redundant evidence, while recognizing corroborating evidence • Capturing provenance aspects of inference – Automatic inferences have many Reliability estimation • Biological knowledge is provisional • Reliability estimates are very important • ECO used by GO for typing evidence O. Biomedical Investigations • OBI models: investigation design; materials, protocols and instrumentation; data generated; and analysis performed. Estimating and Combining Reliabilities • Assessing and Combining Reliability of • Protein Interaction Sources [Leach, et al., 2007] Many alternative approaches to combination: Redundant vs. Corroborating • Multiple sources of evidence for many • • assertions: Which are genuinely independent? Many databases curate from the same articles, but shared article ID alone doesn’t prove redundancy (multiple assertions per article) Some data comes directly from an experiment, but then is later peer reviewed and published: redundant or b ti ? Capturing complex inferences • Goal of 3R systems is to support • explanation and hypothesis generation; There are many computational and cognitive tasks involved. Want to capture (some aspects of) use – Maintain an “analyst’s notebook” to facilitate gathering materials and sense-making; what needs to be tracked? – Report to a server on what was done with the network (to calculate “value” or “interestingness”) 3R Provenance issues • Reasoning about the evidence for a link (e.g. types of evidence, estimating reliability) • Identifying and avoiding redundant evidence, while recognizing corroborating evidence • Capturing provenance aspects of inference – Automatic inferences have many