S K C An Algebraic Approach to Articulate Ontologies for Information Integration February 2001 Gio Wiederhold Stanford University 7/26/2016 Gio Wiederhold SKC 1 What are Ontologies? Ontologies list the terms and their relationships that allow communication among partners in enterprises (in machine-readable form) Relationships determine meaning - parent, school, company Databases use ontologies during design in their E-R diagrams (Implicitly) and represent the leaf nodes in their schemas Knowledge-bases use ontologies (often implicitly) add class definition (to hold instances), constraints, and, sometimes, operations among the terms 7/26/2016 Gio Wiederhold SKC 2 Functions of Ontologies . • Enable Precision in Understanding People = designers, implementors, users, maintainers Systems = implementors = users = maintainers • Share the Cost of Knowledge Acquistion & Maintenance reuse encoded knowledge, remain up-to-date as domains change • Enable Information Interoperation Define the terms that link domains 7/26/2016 * Gio Wiederhold SKC 3 Ancestors of Ontologies Lexicons: Taxonomies: . collect terms used in informtion systems categorize, abstract, classify terms Schemas of databases: attributes, ranges, constraints Data dictionaries: systems with multiple files, owners Object libraries: grouped attributes, inherit., methods Symbol tables: terms bound to implemented programs Domain object models: (XML DTD): interchange terms . . . More Knowledge formalized 7/26/2016 Gio Wiederhold SKC 4 Establishing Ontologies . Top-down: –Commonly acceptable UPPER layers Domain-specific –Sharing tools –Object based Bottom-up –Pragmatic, TASK-specific collections –Database schemas and models 7/26/2016 Gio Wiederhold SKC 5 Heterogeneity among Domains If interoperation involves distinct domains mismatch ensues • Autonomy conflicts with consistency, – Local Needs have Priority, – Outside uses are a Byproduct Heterogeneity must be addressed • Platform and Operating Systems • Representation and Access Conventions • Naming and Ontology 7/26/2016 Gio Wiederhold SKC 6 Semantic Mismatches Information comes from many autonomous sources • Differing viewpoints (by source) – differing terms for similar items { lorry, truck } – same terms for dissimilar items trunk(luggage, car) – differing coverage vehicles (DMV, AIA) – differing granularity trucks (shipper, manuf.) – different scope student museum fee, Stanford • Hinders use of information from disjoint sources – missed linkages loss of information, opportunities – irrelevant linkages overload on user or application program • Poor precision when merged Still ok for web browsing , poor for business & science 7/26/2016 Gio Wiederhold SKC 7 Need for precision More precision is needed as data volume increases --- a small error rate still leads to too many errors False Positives have to be investigated ( attractive-looking supplier - makes toys apparent drug-target with poor annotation ) Information Wall lost opportunities, suboptimal to some degree False positives = poor precision typically cost more than false negatives = poor recall data errors False Negatives cause information quantity adapted from Warren Powell, Princeton Un. 7/26/2016 Gio Wiederhold SKC 8 Ontology Sharing Three Alternatives Create a committee to define everybody’s terms Takes many years, until people are worn out Ignored when changes make deviation necessary Get all terms and put them into large model [ Cyc, UMLS, Federated Schemas, . . . ] Can be rapid Ignores conflicts Hard to maintain (requires committee) Keep all Terms distinct, except where sharing Requires initial effort Empowers participants 7/26/2016 Gio Wiederhold SKC 9 Proposed Language Solutions Specify and define terminology usage: ontology • Domain-specific ontologies XML DTD assumption – Small, focused, cooperating groups – high quality, some examples - genomics, arthritis, Shakespeare plays – allows sharable, formal tools – ongoing, local maintenance affecting users - annual updates – poor interoperation, users still face inter-domain mismatches • Cannot achieve globally consistency – wonderful for users and their programs – too many interacting sources – long time to achieve, 2 sources (UAL, BA), 3 (+ trucks), 4, … all ? – costly maintenance, since all sources evolve – no world-wide authority to dictate conformance 7/26/2016 Gio Wiederhold SKC 10 An unsolved problem Common assumption in assembling and integrating distributed information resources • The language used by the resources is the same • Sub languages used by the resources are subsets of a globally consistent language This assumption is provably false Working towards the goal of globally consistency is 1. naïve -- the goal cannot be achieved 2. inefficient -- languages are efficient in local contexts 7/26/2016 Gio Wiederhold SKC 11 General Ontologies? Have all the Knowledge together + simple for customers of KBs – hard for owners of KBs Large KB will cover multiple domains created by a committee -- slow maintained by a committee -- costly Differences in level of abstraction -- efficiency homeowner: nail carpenter: sinker, brad, boxnail, . . . 7/26/2016 Gio Wiederhold SKC 12 Structural Heterogeneity 7/26/2016 Gio Wiederhold SKC 13 Domains and Consistency . • a domain will contain many objects • the object configuration is consistent • within a domain all terms are consistent & • relationships among objects are consistent Domain Ontology • context is implicit No committee is needed to forge compromises * within a domain Compromises hide valuable details 7/26/2016 Gio Wiederhold SKC 14 SKC grounded definition . • Ontology: a set of terms and their relationships • Term: a reference to real-world and abstract objects • Relationship: a named and typed set of links between objects • Reference: a label that names objects • Abstract object: a concept which refers to other objects • Real-world object: an entity instance with a physical manifestation 7/26/2016 Gio Wiederhold SKC 15 Domain-specific Expertise . Knowledge needed is huge • Partition into natural domains • Determine domain responsibility and authority • Empower domain owners • Provide tools Consider interaction 7/26/2016 Society of specialists Gio Wiederhold SKC 16 An Ontology Algebra A knowledge-based algebra for ontologies Intersection Union Difference create a subset ontology keep sharable entries create a joint ontology merge entries create a distinct ontology remove shared entries The Articulation Ontology (AO) consists of matching rules that link domain ontologies 7/26/2016 Gio Wiederhold SKC 17 Sample Operation: INTERSECTION Result contains shared terms Source Domain 1: Owned and maintained by Store 7/26/2016 Terms useful for purchasing Source Domain 2: Owned and maintained by Factory Gio Wiederhold SKC 18 INTERSECTION support Articulation ontology Terms useful for purchasing Matching rules that use terms from the 2 source domains Store Ontology 7/26/2016 Factory Ontology Gio Wiederhold SKC 19 Sample Intersections Articulation size = size ontology matching rules : color =table(colcode) style = style Anatomy {. . . } Shoe Factory Shoe Store • Shoes { . . . } • Customers { . . . } • Employees { . . . } foot = foot Employees Nail (toe, foot) ... 7/26/2016 • Material inventory {...} • Employees { . . . } • Machinery { . . . } • Processes { . . . } • Shoes { . . . } Department Store Hardware Employees Nail (fastener) ... Gio Wiederhold SKC 20 Other Basic Operations DIFFERENCE: material fully under local control UNION: merging entire ontologies Articulation ontology typically prior intersections 7/26/2016 Gio Wiederhold SKC 21 Features of an algebra Operations can be composed Operations can be rearranged Alternate arrangements can be evaluated Optimization is enabled The record of past operations can be kept and reused 7/26/2016 Gio Wiederhold SKC 22 Sample Processing in HPKB • What is the most recent year an – Problems resolved by SKC OPEC member nation was on the * Factbook – a secondary UN security council (SC)? source -- has out of date – Related to DARPA HPKB OPEC & UN SC lists Challenge Problem • Indonesia not listed – SKC resolves 3 Sources • Gabon (left OPEC 1994) » CIA Factbook ‘96 (nation) » OPEC (members, dates) » UN (SC members, years) – SKC obtains the Correct Answer » 1996 (Indonesia) – Other groups obtained more, but factually wrong answers 7/26/2016 * different country names • Gambia => The Gambia * historical country names • Yugoslavia » UN lists future security council members • Gabon 1999 needed ancillary data Gio Wiederhold SKC 23 Interoperation via Articulation At application definition time – Match ontologies – Establish articulation rules. – Record the process At execution time – Query rewriting – Optimization based on an Ontology Algebra. For maintenance –Regenerate rules using the stored formulation 7/26/2016 Gio Wiederhold SKC 24 Semi-automatic approach Provide library of automatic match heuristics • • • • • Lexical Methods -- spelling Structural Methods -- relative graph position Reasoning-based Methods Nexus Hybrid Methods – Iterative/Non-iterative Methods GUI tool to - display matches and - verify generated matches using human expert - expert can also supply matching rules 7/26/2016 Gio Wiederhold SKC 25 Articulation Generator Being built by Prasenjit Mitra Thesaurus OntA Phrase Relator Context-based Word Relator Driver Structural Matcher Ont1 Ont2 Semantic Network (Nexus) Human Expert 7/26/2016 Gio Wiederhold SKC 26 Lexical Methods • Preprocessing rules. -Expert-generated seed rules. e.g., (Match O1.President O2.PrimeMinister) -Context-based preprocessing directives. • Thesaurus - synonyms, relationships • Distance of words as measure of relatedness. 7/26/2016 Gio Wiederhold SKC 27 Vehicle sales ontology Vehicle registration ontology Tools to create articulations Suggestions for articulations 7/26/2016 Combine ontology graphs with expert selection based on spelling, graph matching, and a nexus derived from a dictionary (O.E.D.) Gio Wiederhold SKC 28 Tools to create articulations Graph matcher for Articulationcreating Expert Transport ontology Vehicle ontology Suggestions for articulations 7/26/2016 Gio Wiederhold SKC 29 continue from initial point Also suggest similar terms for further articulation: • by spelling similarity, • by graph position • by term match repository Expert response: 1. Okay 2. False 3. Irrelevant to this articulation All results are recorded Okay’s are converted into articulation rules 7/26/2016 Gio Wiederhold SKC 30 Candidate Match Repository Term linkages automatically extracted from 1912 Webster’s dictionary * * free, other sources have been processed. . Based on processing headwords definitions using algebra primitives Notice presence of 2 domains: chemistry, transport 7/26/2016 Gio Wiederhold SKC 31 Using the match repository 7/26/2016 Gio Wiederhold SKC 32 Navigating the match repository 7/26/2016 Gio Wiederhold SKC 33 Relative Arc Importance • PageRank (Google) limitations – node oriented – high rank to words with little semantic value » conjunctions, articles The » prepositions, pronouns And to it • Relative arc importance – contribution of source rank to target rank – vs ,t ps / | as | pt es ,t 7/26/2016 Gio Wiederhold SKC 34 ArcRank • For All source s and target t nodes in graph sort outgoing vs ,t j , rank by sorted order res (vs ,t j ) sort incoming vs ,t , rank by sorted order ret (vsi ,t ) i for each arc es ,t compute mean( res (vs ,t ), ret (vs ,t )) • In ranking – Equal values take same rank – Ranks numbered consecutively 7/26/2016 Gio Wiederhold SKC 35 All Pairs Similarity • Compute similarity value for all node pairs n,n' product Pi of inbound arc importance vectors product Po of outbound arc importance vectors similarity = Pi Po • Similarity Matrix – Initial state: nodes similar only to themselves – Node substitution: terms replace similar ones – Iterative convergence: bounded substitution 7/26/2016 Gio Wiederhold SKC 36 Contents » Examples • Verb (Educate) • Adverb (Ever) • Proper Noun (Scotland -- undefined term 7/26/2016 Gio Wiederhold SKC 37 Examples (Verb) 7/26/2016 Gio Wiederhold SKC 38 Examples (Adverb) 7/26/2016 Gio Wiederhold SKC 39 Examples (Proper Noun) 7/26/2016 Gio Wiederhold SKC 40 Country Graphs 7/26/2016 Gio Wiederhold SKC 41 To be matched to 7/26/2016 Gio Wiederhold SKC 42 Knowledge Composition Composed knowledge for applications using A,B,C,E Articulation knowledge Legend: U U for (A B) U (B C) U (C E) Articulation knowledge (C E) U U : union U : intersection Knowledge resource E U Knowledge resource A 7/26/2016 U (B C) Knowledge resource B Knowledge resource C (C U U Articulation knowledge for (A B) D) Knowledge resource D Gio Wiederhold SKC 43 Primitive Operations Model and Instance Unary • Summarize -- structure up • Glossarize - list terms • Filter - reduce instances • Extract - circumscription Binary • Match - data corrobaration • Difference - distance measure • Intersect - schem discovery • Blend - schema extension 7/26/2016 Constructors • create object • create set Connectors • match object • match set Editors • insert value • edit value • move value • delete value Converters • object - value • object indirection • reference indirection Gio Wiederhold SKC 44 Future: exploiting the result Result has links to source Avoid n2 problem of interpreter mapping as stated by Swartout as an issue in HPKB year 1 Processing & query evaluation is best performed within Source Domains & by their engines 7/26/2016 Gio Wiederhold SKC 45 SKC Synopsis • Research: – Reliable query answers from heterogeneous, imperfect data sources • Sources: – General: CIA World Factbook ‘96, UN-www, OPEC-www Webster’s Dictionary, Thesaurus, Oxford English Dictionary – Topical: OPEC, BattleSpace Sensors, Logistics Servers • Client: – DARPA High Performance Knowledge Base project • Theory: – Rule-based algebra – Translation & Composition primitives 7/26/2016 Gio Wiederhold SKC 46 Domain Specialization • Knowledge Acquisition (20% effort) & • Knowledge Maintenance (80% effort *) to be performed by • Domain specialists • Professional organizations • Field teams of modest size autonomously maintainable Empowerment * based on experience with software 7/26/2016 Gio Wiederhold SKC 47 Innovation in SKC • • • • No need to harmonize full ontologies Focus on what is critical for interoperation Rules specific for articulation Tools for creation and maintenance – Maintenance is distributed » to n sources » to m articulation agents • Potentially many sets of articulation rules is m < n2 , depends on semantic architecture density a research question 7/26/2016 Gio Wiederhold SKC 48 Backup Viewgraphs 7/26/2016 Gio Wiederhold SKC 49 Summary . • Algebra enables Interoperation by dealing explicitly with differences by knowledge identifying maintenance domains keeping sources autonomous • Assumes domain has a common ontology composing domain ontologies requires the algebra to manage the linkages where articulation occurs processes are best executed within the domains • Knowledge about articulation is disjoint allows integration specialists to work independently supports multiple intersections and views • Maintenance is structured and partitioned 7/26/2016 Gio Wiederhold SKC 50 Current Directions • Experience with real world (imperfect) data confirms validity of our approach – Expert sources are better maintained than general sources – Rules applied to multiple sources provide more reliable and accurate query results – Component architecture enables scalable, maintainable knowledge base development • Developing proof of concept environment with HPKB standard knowledge base connectivity interface 7/26/2016 Gio Wiederhold SKC 51 Components of Ontologies . • Vocabularies • words as handles for classes and concepts • from database schemas. textbooks, catalog indexes • identified with their domain or context • Relationships • meaning is primarily defined through relationships Employee of a Company; Nails in a Shoe • Annotations • material to clarify the meaning of the terms • contributed by users as well as original authors • best with some examples 7/26/2016 Gio Wiederhold SKC 52 Status September 1997 . • Base HPKB funding from AFOSR – New World Vistas – some industrial co-funding • Prior work supported through Commercenet – support for common representation, an interlingua • Acquiring ontologies that – – – – are interesting to HPKB projects not trivial, I.e., represent realistic activities intersectable Logistics: DoD CIM, CIA, Cyc, . . . • Starting smart students • Integrating into architecture managed by TFS 7/26/2016 Gio Wiederhold SKC 53 Information Flow for Training Initiative sample scenarios scenario refinement trainer / controller aggregation/ analysis/ evaluation ISI scenario language Scenarios Objectives tasks explosion doctrine TRADOC mediator knowledge base Requirements exercise design Legend sources scenario justification Data collection Probepoint settings draft 1 aggregation 7/26/2016 Gio Wiederhold SKC 54 Interlingua(s) Interlingua: Query : Object Exchange Model Mediator Specification Language OEM MSL { OID, LABEL, TYPE, VALUE } <document {<author AUTHOR> <title TITLE>}:- <biblioentry {<author AUTHOR>}>@biblio <inproceedings {<title TITLE>}> @sybase AND AND Equal(AUTHOR, “Jeff Ullman”) Interlingua: Query: Knowledge Interchange Format Knowledge Query and Manipulation Language KIF KQML (PACKAGE :FROM ap001 :TO ap002 :CONTENT (MSG :TYPE query :CONTENT-LANGUAGE KIF :CONTENT (and (document (author@biblio ?a) (title@sybase ?t)) (eq “Jeff Ullman” ?a))) 7/26/2016 Gio Wiederhold SKC 55 Support for KB-Algebra • Ontolingua [Gruber, Fikes @ Stanford KSL]: Repository for Domain Terminologies Used for mechanical design, bibliographies, catalogs • LOOM [MacGregor@ USC ISI]: Classification-based Expert System Helps in structuring and processing ontologies • PROTÉGÉ [Musen@ Stanford MIS] Reuse • Penguin [Barsalou, Keller@ Stanford MIS, CIFE]: Object manipulation based on Relational Algebra Used for genetics laboratory, building design 7/26/2016 Gio Wiederhold SKC 56