S K C Scalable Knowledge Composition September 2001 Gio Wiederhold, Shrish Agarwal, Stefan Decker, Jan Janninck, Prasenjit Mitra, et al. Stanford University, CSD 7/26/2016 SKC Synopsis Data + Knowledge Information • Apply relevant Knowledge to relevant Data Analyses SKC focus Composition of source information Aggregation of instances Selection Observations Quality Filters • to obtain: Information for decision-making 7/26/2016 SKC Synopsis Gio Wiederhold 2 Many sources, disciplines, people Extraction of actionable information, so that future benefits can accrue, requires broad-based knowledge Areas make deep progress in isolation Benefits are possible when solid results are available the results are Integrated or Composed Broad base leads to heterogeneity and inconsistency of terminologies 7/26/2016 SKC Synopsis Gio Wiederhold 3 Language differences inhibit integration • An essential feature of science – autonomy of fields – differing granularity and scope of focus – growth of fields requires new terms • A feature of technological process – standards require stability – yesterday’s innovations are today’s infrastructure – today’s innovations are tomorrow’s infrastructure • Must be dealt with explicitly – sharing, integration, and aggregation are essential – large quantities of data require precision 7/26/2016 SKC Synopsis Gio Wiederhold 4 Semantic Mismatches Autonomous sources in all domains have • Differing viewpoints ( by source ) – – – – – differing terms for similar items { lorry, truck } same terms for dissimilar items trunk( luggage, car) differing coverage vehicles ( DMV, police, AIA ) differing granularity trucks ( shipper, manuf. ) different scope student ( museum fee, Stanford ) – different hierarchical structures supplier vs. usage • Hinders use of information from disjoint sources – missed linkages – irrelevant linkages loss of information, opportunities overload on user or application program • Poor precision when merged Ok for web browsing , poor for business & science 7/26/2016 SKC Synopsis Gio Wiederhold 5 Heterogeneity among Domains is natural Interoperation creates mismatch • Autonomy conflicts with consistency, – Local Needs have Priority, – Outside uses are a Byproduct Heterogeneity must be addressed • Platform and Operating Systems • Data Representation and Access Conventions • Metadata: Naming and Ontology – needed to share data from distinct sources 7/26/2016 SKC Synopsis Gio Wiederhold 6 Two Mismatch Solutions 1. A Single, Globally consistent Ontology ( Your Hope ) – – – – – wonderful for users and their programs too many interacting sources long time to achieve, 2 sources ( UAL, LH ), 3 (+ trucks), 4, … all ? costly maintenance, since all sources evolve no world-wide authority to dictate conformance 2. Domain-specific ontologies ( XML DTD assumption ) – – – – – Small, focused, cooperating groups high quality, some examples - arthritis, Shakespeare plays allows sharable, formal tools ongoing, local maintenance affecting users - annual updates poor interoperation, users still face inter-domain mismatches 7/26/2016 SKC Synopsis Gio Wiederhold 7 Our approach (SKC project) 1. Define Terminology in a domain precisely Schemas, XML DTDs Ontologies 2. Develop methods to permit interoperation among differing domains (not integration) Articulation --- support the limited interoperation needed to solve problems in an application domain Ontology Algebra --- enable scalability to as many sources as are needed to support applications 3. Develop tools to support the methods 7/26/2016 Ontology matching SKC Synopsis Gio Wiederhold 8 An Ontology Algebra The glue that holds the bricks together A knowledge-based algebra for ontologies Intersection Union Difference create a subset ontology keep sharable entries create a joint ontology merge entries create a distinct ontology remove shared entries The Articulation Ontology (AO) consists of matching rules that link domain ontologies 7/26/2016 SKC Synopsis Gio Wiederhold 9 Sample Operation: INTERSECTION Result contains shared terms Articulates the two domains Terms useful for purchasing Source Domain 1: Owned and maintained by Store 7/26/2016 Source Domain 2: Owned and maintained by Factory SKC Synopsis Gio Wiederhold 10 Sample Intersections Articulation size = size ontology matching rules : color =table(colcode) style = style Anatomy {. . . } Shoe Factory • Material inventory {...} • Employees { . . . } • Machinery { . . . } • Processes { . . . } • Shoes { . . . } Shoe Store • Shoes { . . . } • Customers { . . . } • Employees { . . . } foot = foot Employees Nail (toe, foot) ... 7/26/2016 Department Store SKC Synopsis Hardware Employees Nail (fastener) ... Gio Wiederhold 11 Within a Domain Terms have clear Meanings • a domain will contain many objects • the object configuration is consistent • within a domain all terms are consistent & • relationships among objects are consistent No committee is needed to forge compromises * within a domain Domain Ontology • context is implicit * Compromises hide valuable details 7/26/2016 SKC Synopsis Gio Wiederhold 12 SKC grounded definition . • Ontology: a set of terms and their relationships • Term: a reference to real-world and abstract objects • Relationship: a named and typed set of links between objects • Reference: a label that names objects • Abstract object: a concept which refers to other objects • Real-world object: an entity instance with a physical manifestation (or its representation in a factual database) 7/26/2016 SKC Synopsis Gio Wiederhold 13 Grounding enables implementation • We use many abstract terms in our work – Needed because we are dealing with many objects – Human thinking is limited to short-term memory • Someone must be able to translate them into code reliably – Each abstract term must have a path to reality – One must provide that path for – students and – coders • Without a clear path that is not possible – Not automatically at all – machines need specs – Not reliably by human programmers – failures occur • Without implementation there is no benefit 7/26/2016 SKC Synopsis Gio Wiederhold 14 INTERSECTION support Articulation ontology Terms useful for purchasing Matching rules that use terms from the 2 source domains Store Ontology 7/26/2016 Factory Ontology SKC Synopsis Other Basic Operations DIFFERENCE: material fully under local control UNION: merging entire ontologies Articulation ontology typically prior intersections 7/26/2016 SKC Synopsis Features of an algebra The record of past operations can be kept and reused (experience: 3 months 1 week for Webster's annual update, 2 weeks for OED (6 x size) [Jannink:01]) Maintenance is enabled by using 1. remote, deep domain expertise 2. rapid recomposition for application domain Expect also that Operations can be composed Operations can be rearranged Alternate arrangements can be evaluated Optimization is enabled 7/26/2016 SKC Synopsis Gio Wiederhold 17 Sample Processing in the DARPA HPKB challenge What is the most recent year an OPEC member nation was on the UN security council (SC)? – Problems resolved by SKC – SKC resolves 3 Sources * Factbook – a secondary source -- has out of date OPEC & UN SC lists • CIA Factbook ‘96 (nation) – Indonesia not listed • OPEC (members, dates) – Gabon (left OPEC 1994) • UN (SC members, years) – Gambia => The Gambia – SKC obtains the Correct Answer * historical country names • 1996 (Indonesia) – Yugoslavia – Other groups obtained more, but factually wrong answers; they relied on one global source, the CIA factbook. 7/26/2016 * different country names • UN lists future security council members – Gabon 1999 needed ancillary data SKC Synopsis Gio Wiederhold 18 Interoperation via Articulation Process phases: At application definition time – Match relevant ontologies where needed – Establish articulation rules among them. – Record the process At execution time – Perform query rewriting to get to sources – Optimize based on the ontology algebra. For maintenance – Regenerate rules using the stored formulation 7/26/2016 SKC Synopsis Gio Wiederhold 19 Generation of the articulation rules Provide library of automatic match heuristics • Lexical Methods – spelling similarity --- commonly used by others Structural Methods -- relative graph position Reasoning-based Methods Nexus – a graph we derive from the OED / Websters links terms based on definitions, not lexical similarity • Hybrid Methods – Iteratively, with an expert in control GUI tool to - display matches and - verify generated matches using the human expert - expert can also supply matching rules 7/26/2016 SKC Synopsis Gio Wiederhold 20 Articulation Generator Being built by Prasenjit Mitra Thesaurus OntA Phrase Relator Context-based Word Relator Driver Structural Matcher Ont1 Ont2 Semantic Network (Nexus) Human Expert 7/26/2016 SKC Synopsis Gio Wiederhold 21 Principle of Knowledge Composition Composed knowledge for applications using A,B,C,E Articulation knowledge Legend: U U for (A B) U (B C) U (C E) Articulation knowledge (C E) U U : union U : intersection Knowledge resource E U Knowledge resource A 7/26/2016 U (B C) Knowledge resource C Knowledge resource B SKC Synopsis (C U U Articulation knowledge for (A B) D) Knowledge resource D Exploiting the result (future plans) Avoid n2 problem of interpreter mapping [Swartout HPKB year 1] Result has links to source Processing & query evaluation is best performed within Source Domains & by their engines 7/26/2016 SKC Synopsis Support Domain Specialization • Knowledge Acquisition (20% effort) & • Knowledge Maintenance (80% effort *) to be performed • Domain specialists (SMEs) • Professional organizations • Field teams of modest size automously maintainable Empowerment * based on experience with software 7/26/2016 SKC Synopsis Domain-specific Expertise . Knowledge needed is huge • Partition into natural domains Our Ontology • Determine domain responsibility and authority • Empower domain owners Society of specialists • Exploit domain-specific expertise • Provide computer-science tools Consider interaction 7/26/2016 SKC Synopsis Gio Wiederhold 25 SKC Project Synopsis • Research Objective: – Precise information for applications from heterogeneous, imperfect, scalably many data sources • Sources for Ontologies used currently: – General: CIA World Factbook ‘96, www.UN, www.OPEC Webster’s Dictionary, Thesaurus, Oxford English Dictionary – Topical: NATO, BattleSpace Sensors, Logistics Servers • Theory: – Domain autonomy and exploitation – Rule-based algebra over ontologies – Translation & Composition primitives • Sponsors and collaboration – AFOSR; DARPA DAML program; W3C; Stanford KSL and SMI; Univ. of Karlsruhe, Germany; others. 7/26/2016 SKC Synopsis Gio Wiederhold 26