TITech 13 Nov 2000 Bioinformatics: Converting Data to Knowledge Gio Wiederhold Stanford University Computer Science, E.E. & Medicine http://www-db.stanford.edu/people/gio.html Data Aggregation of instances Integration of sources Knowledge Analyses Observations Filters • The product: Information 7/26/2016 Gio Wiederhold - TITech 2000 2 Bio-Information • to learn about ourselves, – our origins, our place in the world Primates, Mice, Zebrafish, Fruit Flies, Roundworms, Yeast – modesty, seeing how much we share with all organisms – not just of philosophical interest, but also • to help humanity to lead healthy lives – to create new scientific methods – to create new diagnostics – to create new therapeutics 7/26/2016 Gio Wiederhold - TITech 2000 3 Loops of Data and Knowledge Information is created at the Storage confluence of Education data -- the state Selection Recording & knowledge -Integration the ability to select and Abstraction Experience State changes project the Decision-making state into the future Action Knowledge Loop 7/26/2016 Data Loop Gio Wiederhold - TITech 2000 4 Volume and Variety Two interacting issues in the generating information 1. The volume is large -we need automation 2. The data is varied & heterogeneous • many autonomous sources • many distinct objectives 7/26/2016 many incompatibilities, errors Gio Wiederhold - TITech 2000 5 Nature 1 human > 30 000 genes ~ 10 000 proteins diseases Quantities Progress The human genome: ~ 4 000 000 000 base pairs Genes, and gene abnormalities 6 000 000 000 humans Everybody’s genes <1000 systems Metabolic pathways ~2 000 000 molecules Small organic molecules - affect proteins - suitable for drugs 7/26/2016 Gio Wiederhold - TITech 2000 6 Diversity Heterogeneity A wide variety of knowledge is needed to interpret the data A large variety of experts is developing this knowledge The scope of interests differs among those experts The knowledge is expressed in diverse ways The terms differs in precise meaning: semantics A large variety of data types is needed A wide variety of representations is used The database and file schemas differ A wide variety of representations is used The openness and accessibility of the information differs 7/26/2016 Gio Wiederhold - TITech 2000 7 Scope differences A scope difference exists when terms differ in their mapping to real-world objects employee (payroll) disabled employee(personnel) all possible employees contractors The local objective determines scope Example: “binding site” in PDB database [Waugh&Altman] binding sites reported for publication doubtful all actual binding sites reporting doubtful results risks rejection of publication 7/26/2016 Gio Wiederhold - TITech 2000 8 Heterogeneity inhibits Integration • An essential feature of science – autonomy of fields – differing granularity and scope of focus – growth of fields requires new terms • A feature of technological process – standards require stability – yesterday’s innovations are today’s infrastructure Must be dealt with explicitly – sharing, integration, and aggregation are essential – large quantities of data require precision 7/26/2016 Gio Wiederhold - TITech 2000 9 Heterogeneity among domains is natural Interoperation creates mismatch • Autonomy conflicts with consistency, – Local Needs have Priority, – Outside uses are a Byproduct Heterogeneity must be addressed • Platform and Operating Systems 4 4 • Data Representation and Access Conventions 4 • Metadata: Annotations, Naming, and Ontology : – needed to share data from distinct sources 7/26/2016 Gio Wiederhold - TITech 2000 10 Required precision = F(volume) More precision is needed as data volume increases --- a small error rate still leads to too many errors False Positives have to be investigated False Negatives cause lost opportunities, suboptimal to some degree Information Wall data errors ( attractive-looking supplier - makes toys apparent drug-target with poor annotation ) information quantity adapted from Warren Powell, Princeton Un. 7/26/2016 Gio Wiederhold - TITech 2000 11 Inconsistency causes errors, while results need precision False positives = poor precision typically cost more than false negatives = poor recall Example: [ Todd Lowe tRNA search <rna.wustl.edu/tRDB > ] Search in Yeast for 55 methylation sites -- required manual elimination of pseudogenes Search space in human genome is 215 times larger, not yet done In drug-discovery we have now more targets than . pharmaceutical companies can afford to investigate 7/26/2016 Gio Wiederhold - TITech 2000 12 Broad array of relatable sources • • • • Genomic Bibliographic Demographic Epidemiological – Familial – Contacts • Clinical – Drug effectiveness – Drug-resistance – Co-occurrence 7/26/2016 [ Many used in data-mining: as PRM (Probabilistic Relational Model) research by Lise Getoor @ stanford ] Requires acyclicity. Use temporal dependencies? Gio Wiederhold - TITech 2000 13 Intersection of a large (irrelevant data) and a small (good data) distribution. Result 7/26/2016 The optimal separation creates more false positives (irrelevant results ) than false negatives (good results missed) Gio Wiederhold - TITech 2000 14 Quality of data verified through publication Data characteristics project [Stephen Koslow, Office on Neuroinformatics, NIMH www.nimh.nih.gov/neuroinformatics/index.cfg] The human brain uses 15 Watts; has dozens of cell types, 100 billion (10^14) neural cells, 10^15 connections. Neuroscience is a growing field, includes neuroinformatics. Intial, broad journals, reductionist journals, Numerical, symbolic, literature and image data. Volume of publication only for serotonin, discovered in 1948, now 70 000 papers, is becoming impossible to follow. Voluminous 3-D MRI data. UCLA brain mapping. Basis for localization of diagnostic EEG, MEG observations. 7/26/2016 Gio Wiederhold - TITech 2000 15 Projects requiring manual curation are domain specific Virtual Cell Project Dong-Guk Shin, Univ. Connecticut shin@engr.uconn.edu also available without DB support, www.nrcam,uchc.edu NIH supported: Physiology modeling, NSF support: computational modeling approach. Bottom-up approach to cell modeling: Cross checking of models and HXs: Geometry from segmented images, 2-D visualization of specified reactions: channels, pumps, for extra, intra (cytosol), of core cellular compartments. Generates equations for simulation. Result is a DB publication cycle, supporting model copying and adaptation. For access to remote DBs will need more than a browser, but also a query system, with join over association. DBs need APIs and mediation for scalability and mismatch. 7/26/2016 Gio Wiederhold - TITech 2000 16 Data integration in Literature [ Jim Garrels, Proteome, Inc. www.proteome.cm - free ] BioKnowlede Library, a portal site: with 50 billion bytes of text covering the 5 billion bytes in Genbank. Classification, curated by experts. Pages {title with brief functional description, family, properties (Mutant phenotype, ) , } sequence annotations, related proteins: Orthologs and Interlogs (in different species) [Marc Vidal, MGH], Integrated data from cDNA microarrays and chips, systematic 2-hybrids, Model-organisms: First Yeast, now Worms [Stuart Kim, Stanford], Several 1000 physical associations and interactions. Authors should not publish experimental data directly into a DB and curate their own papers, but submit their results and publish detailed expression studies and update their own results. 7/26/2016 Gio Wiederhold - TITech 2000 17 Relationships among search parameters perfect recall 100% r = v.relevant v.available 50% 0% 7/26/2016 p= v.relevant v.retrieved space of methods, ranked from best Gio Wiederhold - TITech 2000 18 Means to achieve precision in text Textual information - knowledge - complements pure data-oriented searches as BLAST [Liu & Altman] • Reduce redundancy – omit similar results from alternate sources reports, workshop papers, journals, books • Reduce false positives – recognize contextual domains * • the same word refers to different object types nail (carpentry, anatomy), miter (carpentry, religion) • Abstract findings to higher levels – Linguistic processing based on customer model medical case studies have similar formats 7/26/2016 Gio Wiederhold - TITech 2000 19 Integration makes Semantic Mismatches visible Information comes from many autonomous sources • Differing viewpoints (by source) – – – – – differing terms for similar items { lorry, truck } same terms for dissimilar items trunk ( luggage, car ) differing coverage vehicles ( DMV, police, AIA ) differing granularity trucks (shipper, manuf.) different scope student (museum fee, Stanford ) • Hinders use of information from disjoint sources – missed linkages – irrelevant linkages loss of information, opportunities overload on user or application program • Poor precision for interoperation ok for web browsing poor for business and science 7/26/2016 Gio Wiederhold - TITech 2000 20 Shared Knowledge Base PharmGKB – PharmacoGenetics Knowledge Base starting 2000 “An Ontology for Genetic Information” [Russ Altman] <pharmgkb.org> based at Stanford, funded by NIGMS to link existing projects – but open to others. Phenotype variation --> Genotype variation • • • • • • Phase 2 metabolizing enzymes – R.Weinshllboum at Mayo Clinic Asthma -- Weiss (was Jeff Raizin) at Havard Un. Anti-cancer agents -- Mark Ratain at Un. of Chicago Membrane Transporters -- Kathleen Giacomini, UCSF Tomoxifen metabolic activation -- Dave Flockhart at Georgetown Un. Minority Populations and Privacy – M.Rothstein at Univ of Houston • • Depression in Mexican-Americans -- J.Licinio at UCLA Database Tools -- Prakash Nadkarni at Yale Un. 7/26/2016 Gio Wiederhold - TITech 2000 21 Complex Relationships Genomic information Isolated functional measures Pharma. activity Drug response systems Clinical phenotype Physiology Coding Molecules Molecular & cellular phenotype Integrated functional measures Protein Products Obser vable pheno types Genetic Makeup Alleles Mole cular Varia tion Drugs Individuals Non-genetic factors Environment courtesy of R.Altman &Teri Klein, PhamGKB 7/26/2016 Gio Wiederhold - TITech 2000 22 PharmGKB • Ontology for pharmacogenetics – Represented in Protégé [Musen: smi.stanford.edu/project/protege] • Service for Universities and Industry • open access to information and tools, but not a warehouse – Industrial affiliates contributors and consumers at larger scales: • • • • geneticXchange Merck Co Pharmacia SmithKline-Beecham ( & Glaxo-Wellcome ) • Collaboration in larger topics: GeneLogic Guidant Doubletwist Incyte Informax SGI Sun – Biotechnology -- Clark Center – Education -- NIH sponsored training program, new UG degrees 7/26/2016 Gio Wiederhold - TITech 2000 23 Consistency: global or partial ? • Global consistency + – – – – wonderful for users and their programs too many interacting sources long time to achieve, 2 sources (UAL, LH), 3 (+ trucks), 4, … all ? costly maintenance, since all sources evolve no world-wide authority to dictate conformance • Domain-specific ontologies XML DTD assumption + + + + – – Small, focused, cooperating groups high quality, some examples - arthritis, Shakespeare plays allows sharable, formal tools ongoing, local maintenance affecting users - annual updates poor interoperation, users still face inter-domain mismatches periodic source updates need automation in interoperation 7/26/2016 Gio Wiederhold - TITech 2000 24 Stanford Infolab SKC project ( Scalable Knowledge Composition ) Objective: High precision in semantic interoperation of autonomous sources • Basic -- pessimistic -- assumption: – The ontological mapping of terms objects differs between autonomous domains. • But – The collections of real-world objects provides a grounding for the definitions, and an opportunity to validate the meaning of the terms being employed. – Relationships have semantic and a related structural significance. 7/26/2016 Gio Wiederhold - TITech 2000 25 Exploit Domain-specific Expertise . Knowledge needed is huge in science and in business • Partition into natural domains • Determine domain responsibility and authority • Empower domain owners • Provide tools Consider interaction 7/26/2016 Gio Wiederhold - TITech 2000 Society of specialists Society of specialists Society of specialists 26 SKC grounded definition . • Ontology: a set of terms and their relationships • Term: a reference to real-world and abstract objects • Relationship: a named and typed set of links between objects • Reference: a label that names objects • Abstract object: a concept which refers to other objects • Real-world object: an entity instance with a physical manifestation 7/26/2016 Gio Wiederhold - TITech 2000 27 Sample Operation: INTERSECTION Result contains shared terms, useful for purchasing Articulation Source Domain 1: Owned and maintained by Store 7/26/2016 Source Domain 2: Owned and maintained by Factory Gio Wiederhold - TITech 2000 28 An Ontology Algebra A knowledge-based algebra for ontologies Intersection Union Difference create a subset ontology keep sharable entries create a joint ontology merge entries create a distinct ontology remove shared entries The Articulation Ontology (AO) consists of rules that link domain ontologies 7/26/2016 Gio Wiederhold - TITech 2000 matching 29 INTERSECTION support Articulation ontology Terms useful for purchasing Matching rules that use terms from the 2 source domains Store Ontology 7/26/2016 Gio Wiederhold - TITech 2000 Factory Ontology 30 Other Basic Operations UNION: merging entire ontologies DIFFERENCE: material fully under local control Articulation ontology typically prior intersections 7/26/2016 Gio Wiederhold - TITech 2000 31 Sample Operation: INTERSECTION Result contains shared terms, useful for purchasing Articulation Source Domain 1: Owned and maintained by Store 7/26/2016 Source Domain 2: Owned and maintained by Factory Gio Wiederhold - TITech 2000 32 Tools to create articulations Graph matcher for Articulationcreating Expert Transport ontology Vehicle ontology Suggestions for articulations 7/26/2016 Gio Wiederhold - TITech 2000 33 continue from initial point Also suggest similar terms for further articulation: • by spelling similarity, • by graph position • by term match nexus Expert response: 1. Okay 2. False 3. Irrelevant to this articulation All results are recorded Okay ’s are converted into articulation rules 7/26/2016 Gio Wiederhold - TITech 2000 34 Candidate Match Nexus Term linkages automatically extracted from 1912 Webster’s dictionary * Notice presence of 2 domains: chemistry, transport Based on processing headwords definitions using algebra primitives 7/26/2016 * free; have processed the OED (Oxford English Dictionary) at Stanford for internal use Gio Wiederhold - TITech 2000 35 Using the Match Nexus Experiment: On government structures of NATO countries: SKEIN system resolved over 70% of unmatched terms 7/26/2016 Gio Wiederhold - TITech 2000 36 Using the Match Nexus 7/26/2016 Gio Wiederhold - TITech 2000 37 Features of an algebra Operations can be composed Operations can be rearranged Alternate arrangements can be evaluated Optimization is enabled The record of past operations can be kept and reused when sources change 7/26/2016 Gio Wiederhold - TITech 2000 38 Knowledge Composition Composed knowledge for applications using A,B,C,E Articulation knowledge (A B) U (B C) U (C E) Articulation knowledge (C E) U U U : union : intersection U Knowledge resource E Articulation knowledge for (A B) U Knowledge resource A 7/26/2016 U (B C) Knowledge resource C Knowledge resource B Gio Wiederhold - TITech 2000 (C U Legend: U U for D) Knowledge resource D 39 Support Domain Specialization • Knowledge Acquisition (20% effort) & • Knowledge Maintenance (80% effort *) to be performed • Domain specialists • Professional organizations • Field teams of modest size autonomously maintainable Empowerment * based on experience with software 7/26/2016 Gio Wiederhold - TITech 2000 40 Summary Scalable Knowledge Composition Provide for Maintainable Ontologies • devolve maintenance onto many domain-specific experts / authorities • provide an algebra to compute composed ontologies that are limited to their articulation terms SKC • enable interpretation within the source contexts 7/26/2016 Gio Wiederhold - TITech 2000 41 Many Other Tasks at/near Stanford Matching cell / protein 3D with chemical’s 3D • Regulatory Gene motifs : – Bioprospector [ Brutlag & Liu <www-cmgm.stanford.edu> ] • Protein structure generation – moving from small to larger proteins 1: Powerful parallel processing [IBM BlueGene] 2: Two-level : use features as an intermediate (alpha-helix, beta-sheets, …) 3: Protein Folding speedup by delegation [Shirts & Pande: foldingathome.stanford.edu ] • RNA folding (simpler, larger) [Nakatani & Pande] 7/26/2016 Gio Wiederhold - TITech 2000 42 Provenance of derived data Assure having a proper history of derived results [ Peter Buneman, UPenn, www.humgen.upenn.edu ] K2 integration tool Integrated databases often don’t indicate the original sources I.e., SwissProt does not distinguish inferred versus being observed. [ William Gelbart, Harvard University] Flybase Flybase also collects data as exons and their mutations, tranposon insertion sites. Moving from being Hunter Gatherers in science to Harvesters, moving to an agronomical society Clasical genomics is being superseded by expression and interaction of gene products and gene perturbation. [ Peter Karp, SRI Int., Bioinformatics Res.Group, www.ai.sri.com/pkarp/ ] EcoCyc EcoCyc links proteins to 150 metabolic pathways in Ecoli Databases are supplanting journals. They are re-analyzable. Results in journals are not. Estimate now about 500 public databases for Bioinformatics; although not all of them have APIs, use real DBMSs, have differing models, units of measurements, leading to semantic problems. 7/26/2016 Gio Wiederhold - TITech 2000 43 The People Problem The demand for people in bioinformatics is high, at all levels • Critical is a lack of – training opportunities - programs and teachers – available trainees • Being in multi-disciplinary field is scary – tenure for faculty – load for students – salary and growth differentials in biology and CS • Some institutions are moving aggressively – must compete with World-Wide Web visions 7/26/2016 Gio Wiederhold - TITech 2000 44 Bioinformatics: Converting Data to Knowledge • The means: People • The product: Information 7/26/2016 Gio Wiederhold - TITech 2000 45 Up-to-dateness 100% never 1/year %tage up-to-date 1/month 1/week 1/day =effort, methods F(user need) 50% 1/hour 1/minute 1/second 0% Frequency of source change 7/26/2016 0 1 ? frequency of visits as often as possible Feb.2000 F(capability given 2.2M public sites with 288M pages ) Gio Wiederhold - TITech 2000 46 Privacy requires Ethics Knowledge carries responsibilities. How will people feel about your knowledge about them? their genetic make-up, physical & psychological propensities. Privacy is hard to formalize, but that does not mean it is not real to people. Perceptions count. (There is also real stuff insurance scams - personal relations ) Diagnostics without therapies. 7/26/2016 Gio Wiederhold - TITech 2000 47 Securing Collaboration Collaborator source query certified result Security Filter certified query Logs unfiltered result Private Patient Data Gio Wiederhold TIHI Oct96 48 7/26/2016 Gio Wiederhold - TITech 2000 48