Reind van de Riet celebration Increasing the Precision of Semantic Interoperation Gio Wiederhold Stanford University August 2000 Thanks to Jan Jannink, Shrish Agarwal, Prasenjit Mitra, Stefan Decker. August 2000 Gio vdR 1 Reind van de Riet, music and computing August 2000 Gio vdR 2 Achievements • Computer Science-based Formalization • Effective Dissemination - students and writings • Powerful Database Technology • Language-sensitive Methods • Ubiquitous Databases • World-wide interconnectivity • All the Information one might want somewhere . . . . August 2000 Gio vdR 3 Information Data overload starvation • More databases – public & corporate • Faster communication – digital – packeting: TCP-IP, ATM • World-wide connectivity – internet – world-wide web • Disintermediation – ubiquitous publishing Data and Knowledge Information is created at the Storage confluence of Education data -- the state Selection Recording & knowledge -Integration the ability to select and Abstraction Experience State changes project the Decision-making state into the future Action Knowledge Loop Data Loop Language issues • Our languages are specialized for our needs – efficiency of expression: minimal symbols Examples: • carpenter versus householder domain • surgeon versus pathologist • World-wide communication changes ranges – geographical locality is less relevant – conceptual locality is important • Business requires precision, including that words map consistently to instances knowledge controls use of data August 2000 Gio vdR 6 Transform Data to Information Application Layer Mediation Layer decision-makers at workstations value-added services Foundation Layer data and simulation resources August 2000 Gio vdR 7 Heterogeneity among Domains If interoperation involves distinct domains mismatch ensues • Autonomy conflicts with consistency, – Local Needs have Priority, – Outside uses are a Byproduct Heterogeneity must be addressed • Platform and Operating Systems 4 4 • Representation and Access Conventions 4 • Naming and Ontology : August 2000 Gio vdR 8 Semantic Mismatches Information comes from many autonomous sources • Differing viewpoints (by source) – – – – – differing terms for similar items { lorry, truck } same terms for dissimilar items trunk(luggage, car) differing coverage vehicles (DMV, AIA) differing granularity trucks (shipper, manuf.) different scope student museum fee, Stanford • Hinders use of information from disjoint sources – missed linkages – irrelevant linkages loss of information, opportunities overload on user or application program • Poor precision when merged ok for web browsing , August 2000 poor for business Gio vdR 9 Need for precision More precision is needed as data volume increases --- a small error rate still leads to too many errors False Positives have to be investigated ( attractive-looking supplier - makes toys apparent drug-target with poor annotation ) Information Wall lost opportunities, suboptimal to some degree False positives = poor precision typically cost more than false negatives = poor recall data errors False Negatives cause information quantity adapted from Warren Powell, Princeton Un. August 2000 Gio vdR 10 Means to achieve precision • Reduce redundancy – omit similar results from alternate sources reports, workshop papers, journals, books • Reduce false positives – recognize contextual domains * • the same word refers to different object types nail (carpentry, anatomy), miter (carpentry, religion) • Abstract findings to higher levels – Linguistic processing based on customer model medical case studies have similar formats August 2000 Gio vdR 11 Proposed Language Solutions Specify and define terminology usage: ontology • Domain-specific ontologies XML DTD assumption – – – – – Small, focused, cooperating groups high quality, some examples - genomics, arthritis, Shakespeare plays allows sharable, formal tools ongoing, local maintenance affecting users - annual updates poor interoperation, users still face inter-domain mismatches • Cannot achieve globally consistency – – – – – wonderful for users and their programs too many interacting sources long time to achieve, 2 sources (UAL, BA), 3 (+ trucks), 4, … all ? costly maintenance, since all sources evolve no world-wide authority to dictate conformance August 2000 Gio vdR 12 Domains and Consistency . • a domain will contain many objects • the object configuration is consistent • within a domain all terms are consistent & • relationships among objects are consistent Domain Ontology • context is implicit No committee is needed to forge compromises * within a domain Compromises hide valuable details August 2000 Gio vdR 13 Many ontologies Interoperation Have devolved maintenance onto many domain-specific experts / authorities • Many applications span multiple domains • Need an algebra to compute composed ontologies that are limited to their articulation terms SKC • enable interpretation within the source contexts August 2000 Gio vdR 14 Sample Operation: INTERSECTION Articulation Source Domain 1: Owned and maintained by Store August 2000 Result contains shared terms, useful for purchasing Source Domain 2: Owned and maintained by Factory Gio vdR 15 Vehicle sales ontology Vehicle registration ontology Tools to create articulations Suggestions for articulations August 2000 Combine ontology graphs with expert selection based on spelling, graph matching, and a nexus derived from a dictionary (O.E.D.)Gio vdR 16 An Ontology Algebra A knowledge-based algebra for ontologies Intersection Union Difference create a subset ontology keep sharable entries create a joint ontology merge entries create a distinct ontology remove shared entries The Articulation Ontology (AO) consists of rules that link domain ontologies August 2000 matching Gio vdR 17 INTERSECTION support Articulation ontology Terms useful for purchasing Matching rules that use terms from the 2 source domains Store Ontology August 2000 Factory Ontology Gio vdR 18 Other Basic Operations UNION: merging entire ontologies DIFFERENCE: material fully under local control Articulation ontology typically prior intersections August 2000 Gio vdR 19 Features of an algebra Operations can be composed Operations can be rearranged Alternate arrangements can be evaluated Optimization is enabled The record of past operations can be kept and reused when sources change August 2000 Gio vdR 20 Knowledge Composition Composed knowledge for applications using A,B,C,E Articulation knowledge (A B) U (B C) U (C E) Articulation knowledge (C E) U U U : union : intersection U Knowledge resource E Articulation knowledge for (A B) U Knowledge resource A August 2000 U (B C) Knowledge resource B Knowledge resource C (C U Legend: U U for D) Knowledge resource D Gio vdR 21 Domain Specialization . • Knowledge Acquisition (20% effort) & • Knowledge Maintenance (80% effort *) to be performed • Domain specialists • Professional organizations • Field teams of modest size autonomously maintainable Empowerment * based on experience with software August 2000 Gio vdR 22 Summary To sustain the growth of web usage 1. The value of the results has to keep increasing precision, relevance not volume 2. Value is provided by mediating experts, encoded as models of diverse resources, diverse customers Problems being addressed redundancy Clear models mismatches quality maintenance } Need research to develop tools for these tasks August 2000 Gio vdR 23 Integration Science ? Databases access storage algebras scalability Systems Engineering analysis documentation costing Artificial Intelligence knowledge mgmt linguistic models uncertainty Integration Science August 2000 Gio vdR 24 Background Slides In Handout distributed at the VU, but not actually used August 2000 Gio vdR 25 Sample Processing in HPKB • What is the most recent year – Problems resolved by SKC * Factbook has out of date an OPEC member nation was OPEC & UN SC lists on the UN security council? – Related to DARPA HPKB Challenge Problem – SKC resolves 3 Sources • CIA Factbook ‘96 (nation) • OPEC (members, dates) • UN (SC members, years) – SKC obtains the Correct Answer • 1996 (Indonesia) – Other groups obtained more, but factually wrong answers August 2000 * * • • – Indonesia not listed – Gabon (left OPEC 1994) different country names – Gambia => The Gambia historical country names – Yugoslavia UN lists future security council members – Gabon 1999 intent of original question – Temporal variants Gio vdR 26 Vehicle sales ontology Vehicle registration ontology Tools to create articulations Suggestions for articulations August 2000 Combine ontology graphs with expert selection based on spelling, graph matching, and a nexus derived from a dictionary (O.E.D.)Gio vdR 27 Tools to create articulations Graph matcher for Articulationcreating Expert Transport ontology Vehicle ontology Suggestions for articulations August 2000 Gio vdR 28 continue from initial point Also suggest similar terms for further articulation: • by spelling similarity, • by graph position • by term match repository Expert response: 1. Okay 2. False 3. Irrelevant to this articulation All results are recorded Okay’s are converted into articulation rules August 2000 Gio vdR 29 Candidate Match Repository Term linkages automatically extracted from 1912 Webster’s dictionary * * free, other sources have been processed. . Based on processing headwords definitions using algebra primitives Notice presence of 2 domains: chemistry, transport August 2000 Gio vdR 30 Using the match repository August 2000 Gio vdR 31 Navigating the match repository August 2000 Gio vdR 32 Primitive Operations Model and Instance Unary • Summarize -- structure up • Glossarize - list terms • Filter - reduce instances • Extract - circumscription Binary • Match - data corrobaration • Difference - distance measure • Intersect - schem discovery • Blend - schema extension August 2000 Constructors • create object • create set Connectors • match object • match set Editors • insert value • edit value • move value • delete value Converters • object - value • object indirection • reference indirection Gio vdR 33 Future: exploiting the result Result has links to source Avoid n2 problem of interpreter mapping as stated by Swartout as an issue in HPKB year 1 Processing & query evaluation is best performed within Source Domains & by their engines August 2000 Gio vdR 34