Using Knowledge to Exploit Data March 2001 Gio Wiederhold Stanford University www-db.stanford.edu/people/gio.html 7/26/2016 Gio Wiederhold IBM R 1 Data + Knowledge Information Knowledge Loop Data Loop Storage Selection Integration Abstraction Experience Projection Information Education Recording State changes Decision-making Action Gio Wiederhold IBM R 2 4 Information Technology Tasks Selection: SQL . one verb language Integration: Middleware Mediation* . Mediation Actionable Information is created at the confluence of data -- the state & knowledge -- the ability to select, integrate, abstract and project the state into the future Abstraction: Visualization Summarization Projection*: Simulation . Spreadsheets * Focus of my talk Gio Wiederhold IBM R 3 Today's sources and needs Many resources • computing – databases – analytical software – simulations • people – organizations – domains Many objectives • operations – reports – routine actions – optimization • decision-making – crisis response – allocations – conflict resolution » semantics Gio Wiederhold IBM R 4 Integration Problems: Source Autonomy • • • • • • Heterogeneity Hardware platforms . . . . . . Hidden by operating system Operating systems . . . . . . Choices are reducing: NT, UNIX, ... Programming languages . . Irrelevant in remote access Database system models . . Relational and E-R common Database systems . . . . . . . Standards, convergence Coverage . . . . . . . . . . . . . . . Source dependent – Attributes . . . . . . . . . . . . . . . – Scope . . . . . . . . . . . . . . . . . . . documented, catenable undocumented, intersecting • Data representation . . . . . Conversion problems, nulls • Data semantics . . . . . . . . . Requires multi-domain knowledge Gio Wiederhold IBM R 5 Managing Large Scale Problems • Build big systems ? – cost – obsolescence – inflexibility – unmaintainable • Compose systems ? – limits to control – remote maintenance – heterogeneity Gio Wiederhold IBM R 6 Semantic Heterogeneity If interoperation involves distinct domains mismatch ensues • Autonomy conflicts with consistency, – Local Needs have Priority, – Outside uses are a Byproduct Semantic Heterogeneity must be addressed • Naming of object classes and objects • Ontology -- relationships • Scope of object classes and attributes 7/26/2016 Gio Wiederhold IBM R 7 Semantic Mismatch Examples Information comes from many autonomous sources Differing viewpoints - by source – differing terms for similar items { lorry, truck, shag } – same terms for dissimilar items trunk ( luggage, car ) – differing coverage vehicles ( DMV, police, AIA ) – differing granularity trucks ( shipper, manufacturer ) – different scope student ( Stanford museum, classes ) employee (Stanford personnel, payroll ) • Hinders use of information from disjoint sources – missed linkages – irrelevant linkages loss of information, opportunities overload on user & application program • Poor precision when merged Still ok for web browsing , poor for business & science 7/26/2016 Gio Wiederhold IBM R 8 Need for precision More precision is needed as data volume increases --- a small error rate still leads to too many errors False Positives have to be investigated, cause work 1 ( attractive-looking supplier - makes toy trucks apparent drug-target with poor annotation ) Information Wall .......................... False positives = poor precision typically cost more than false negatives = poor recall data errors False Negatives cause lost opportunities, 2 suboptimal to some degree information quantity adapted from Warren Powell, Princeton Un. 7/26/2016 Gio Wiederhold IBM R 9 Proposed Language Solution Specify and define terminology usage: ontology • Domain-specific ontologies XML DTD assumption – – – – – Small, focused, cooperating groups high quality, some examples – oil trading, SLE, Shakespeare plays allows sharable, formal tools ongoing, local maintenance affecting users - annual updates poor interoperation, users still face inter-domain mismatches Cannot achieve global consistency, although that would be wonderful for users and their programs – – – – too many interacting source domains long time to achieve, 2 sources ( UAL, LH ), 3 (+ trucks), 4, … all ? costly maintenance, since all sources evolve no world-wide authority to dictate conformance 7/26/2016 Gio Wiederhold IBM R 10 Ontology Creation Three Common Alternatives: Create a committee to define everybody’s terms Takes many years, until people are worn out Ignored when changes make deviation necessary Collect all terms and put them into large model [ Cyc, UMLS, Federated Schemas, . . . ] Can be rapid Hides conflicts, leads to low precision Hard to maintain (requires committee) Focus on local Terms Requires local effort Ignores conflicts among distinct domains Empowers participants 7/26/2016 Gio Wiederhold IBM R 11 Central Solutions do not Scale What works with 7 modules and one person in charge fails when we have 100 and need a committee Any changes in resources affects the central module Gio Wiederhold IBM R 12 The semantics problem Common assumption in assembling and integrating distributed information resources: • The language used by the resources is the same • Sub languages used by the resources are subsets of a globally consistent language This assumption is provably false Working towards the goal of globally consistency is 1. naïve -- the goal cannot be achieved or maintained 2. inefficient -- languages are efficient in local contexts 7/26/2016 Gio Wiederhold IBM R 13 Domains and Consistency . • a domain will contain many objects • the object configuration is consistent • within a domain all terms are consistent & • relationships among objects are consistent Domain Ontology • context is implicit No committee is needed to forge compromises * within a domain Compromises hide valuable details 7/26/2016 Gio Wiederhold IBM R 14 Assumption • The assumption of consistency defines the scope of a domain • When it is false, we get semantic trickery: When the going gets tough, the tough get going . [Alfred Spector, when convincing me Tuesday to brave the storm of the decade ] 7/26/2016 Gio Wiederhold IBM R 15 Mediated Architecture: Transform Data to Information Application Layer decision-makers at workstations CORBA / XML Mediation Layer domain specific value-added services SQL/SimQL Foundation Layer data and simulation resources Gio Wiederhold IBM R 16 Sample: Integration provides precision Problem posed in DARPA HPKB Program • What is the most recent year – Subproblems resolved by SKC an OPEC member nation was * Factbook – a secondary source on the UN security council (SC)? -- has out of date OPEC & UN • SKC resolves 3 Sources SC lists » CIA Factbook ‘96 (nation) » OPEC (members, dates) » UN (SC members, years) – SKC obtains the Correct Answer » 1996 (Indonesia) • Other groups used only the integrated source: CIA Factbook; – obtained more, but factually wrong answers 7/26/2016 • Indonesia not listed • Gabon (left OPEC 1994) * different country names • Gambia => The Gambia * historical country names • Yugoslavia » UN lists future security council members • Gabon 1999 needed temporal reasoning Gio Wiederhold IBM R 17 Formalize precise integration • Start with local, well defined and maintained ontologies. • Do not assume mutual consistency • Create means to link those ontologies precisely: articulations – Concepts – Tools for ontology support – Tools for ontology exploitation • Support to rapidly update the articulations when sources are updated 7/26/2016 Gio Wiederhold IBM R 18 SKC grounded(Dbish)definition . • Ontology: a set of terms and their relationships • Term: a reference to real-world and abstract objects • Relationship: a named and typed set of links between objects • Reference: a label that names objects • Abstract object: a concept which refers to other objects • Real-world object – captured in databases : an entity instance with a physical manifestation 7/26/2016 Gio Wiederhold IBM R 19 An Ontology Algebra A knowledge-based algebra for ontologies Intersection Union Difference create a subset ontology keep sharable entries create a joint ontology merge entries create a distinct ontology remove shared entries The Articulation Ontology (AO) consists of matching rules that link domain ontologies 7/26/2016 Gio Wiederhold IBM R 20 Sample Operation: INTERSECTION Result contains shared terms Source Domain 1: Owned and maintained by Store 7/26/2016 Terms useful for purchasing Source Domain 2: Owned and maintained by Factory Gio Wiederhold IBM R 21 Other Basic Operations DIFFERENCE: material fully under local control UNION: merging entire ontologies Articulation ontology typically prior intersections 7/26/2016 Gio Wiederhold IBM R 22 Sample Intersections Articulation size = size ontology matching rules : color =table(colcode) style = style Anatomy {. . . } Shoe Factory Shoe Store • Shoes { . . . } • Customers { . . . } • Employees { . . . } foot = foot Employees Nail (toe, foot) ... 7/26/2016 • Material inventory {...} • Employees { . . . } • Machinery { . . . } • Processes { . . . } • Shoes { . . . } Department Store Hardware Employees Nail (fastener) ... Gio Wiederhold IBM R 23 INTERSECTION support Articulation ontology Terms useful for purchasing Matching rules that use terms from the 2 source domains Store Ontology 7/26/2016 Factory Ontology Gio Wiederhold IBM R 24 Interoperation via Articulation At application definition time – Match ontologies – Establish articulation rules. – Record the process At execution time – Query rewriting – Optimization based on an Ontology Algebra. For maintenance –Regenerate rules using the stored formulation 7/26/2016 Gio Wiederhold IBM R 25 Features of an algebra Operations can be composed Operations can be rearranged Alternate arrangements can be evaluated Optimization is enabled The record of past operations can be kept and reused 7/26/2016 Gio Wiederhold IBM R 26 Example: Nexus creation Term linkages automatically extracted from 1912 Webster’s dictionary * using algebra primitives. * free, other sources Term definitions using the term 'Vehicle' One graph from network have been processed. . Notice presence of 2 domains: chemistry, transport Terms used in defining the term 'Vehicle' 7/26/2016 Gio Wiederhold IBM R 27 Re-Use Example • Nexus Creation (A graph derived from a dictionary linking headwords and definitions) 1. Created from public, 1913 Webster’s (OCRed , 5% errors) – 50Mb, 97K defs 17 weeks, 480 primitive rules 2. Recreated from revised 1913 Webster’s (> 10%) – 50Mb, 97K 2 weeks, 440 primitive rules 3. Applied to Roget’s Thesaurus – 1Mb, 1K defs <1 week, 40 primitive rules 4. Applied to Oxford English Dictionary – 570Mb, 5130K defs 5 weeks, 400 primitive rules [ Jan Jannink thesis Stanford CSD 2001 ] 7/26/2016 Gio Wiederhold IBM R 28 Articulation Generator Being built by Prasenjit Mitra Thesaurus OntA Phrase Relator Context-based Word Relator Driver Structural Matcher Ont1 Ont2 Semantic Network (Nexus) Human Expert 7/26/2016 Gio Wiederhold IBM R 29 Semi-automatic approach Provide library of automatic match heuristics • Lexical Methods -- spelling • Structural Methods – walk the graphs and try • Nexus to link terms (buyer[DMV]owner[Ford]) • Reasoning-based Methods • Hybrid Methods – generally iterative methods GUI tool to - display matches and - verify generated matches using human expert - expert can also supply matching rules 7/26/2016 Gio Wiederhold IBM R 30 Lexical Methods • Preprocessing rules. -Expert-generated seed rules. e.g., (Match O1.President O2.PrimeMinister) -Context-based preprocessing directives. Existing narrow domain ontologies • Thesaurus - synonyms, relationships • Nexus – definition-based graph linkage • Distance of words as measure of relatedness. 7/26/2016 Gio Wiederhold IBM R 31 Tools to create articulations Graph matcher for Articulationcreating Expert Transport ontology Vehicle ontology Suggestions for articulations 7/26/2016 Gio Wiederhold IBM R 32 continue from initial point Also suggest similar terms for further articulation: • by spelling similarity, • by graph position • by term match nexus Expert response: 1. Okay 2. False 3. Irrelevant to this articulation All results are recorded Okay’s are converted into articulation rules 7/26/2016 Gio Wiederhold IBM R 33 Using the nexus 7/26/2016 Gio Wiederhold IBM R 34 Navigating the nexus 7/26/2016 Gio Wiederhold IBM R 35 Exploiting the result Result has links to source R=f(KR1, IE1) 7/26/2016 . Not yet done in SKC Processing and evaluation is best performed within Source Domains R=f(KRn, IEn ) Gio Wiederhold IBM R 36 Next Steps 1. We now have integrated multiple sources – with great care 2. This result is likely to be large, needs – Visualization – Summarization – Abstraction – Modeling 3. Projection into the future – a major gap today’s information systems 7/26/2016 Gio Wiederhold IBM R 37 Current state of DM Support past now time organized support Data integration Databases distributed, heterogeneous future disjointed support x17 @qbfera ffga 67 .78 jjkl,a nsnd nn 23.5a Intuition + • Spreadsheets • Planning of allocations • Other simulations various point assessments Gio Wiederhold IBM R 38 Information Systems should also Project into the Future past now future time Databases, accessed via SQL or CORBA compliant wrappers Simulations, accessed via SimQL and compliant wrappers Msgs, sensors Human decision-making is essential when inputs do not have common metrics: Price vs quality of goods/ healthcare Gio Wiederhold IBM R 39 Types of simulation services 1. Continously executing: weather prediction – SimQL result reports best match samples 2. Execution specific to query: what-if assessment – primary tool today: Spreadsheets – may require HPC power for adequate response 3. Low-level simulations results stored in a base: use of smart materials in product design – perform inter-/extra-polations to match queries 4. Combinations, i.e., 2. + 3.: top layer simulation using stored partial lower level results: product performance in new setting All with uncertainty parameters Gio Wiederhold IBM R 40 Use of Simulation Results A 0.6 0.3 0.5 0.2 0.1 0.5 0.07 0.03 0.5 0.2 0.1 time 0.4 0.2 0.1 0.3 0.1 B Simulation results can be composed for planning alternative Courses-of-actions A: Composition should include computation and recomputation of likelihoods B: x values of outcomes leads to present values Likelihoods change as now moves forwards and eliminates earlier alternatives. Gio Wiederhold IBM R 41 1000 500 700 1200 -200 600 950 -100 Prototype Implementation Developer Customer Query Development Interaction Help Schema Manager Parser Schema Commands Metadata Manager Filing of Access Specs Metadata Use of Access Specs Help Production Interaction Schema Commands Query manager Initiation and Results of Simulations Error reports Wrapped .. Simulations Gio Wiederhold IBM R 42 Interoperation for Simulation Databases • serve clients via SQL by Sharing a Model (The Schema) A query language over the model the SQL interface enables • independence of application development DBMS technology development reuse of infrastructure Today • most new systems use a DBMS for data storage even with less performance, inability to handle all problems, but enough of them well enough. Simulations should • serve clients via SimQL by Sharing a Model (research q.) A query language over the model a SimQL interface will enable • independence of application development simulation technology develop’t reuse of infrastructure Objective • build information systems combining DBMS, Simulations even with less performance, inability to handle all problems, but enough of them . . . Gio Wiederhold IBM R 43 Even the present needs SimQL last recorded observations point-in-time for situational assessment simple simulations to extrapolate data past Not all data are current: now time future • Where is the customer now? • Is the delivery truck in X? • Where will they meet (mobile setting) ? • Is the right stuff on the truck? • Will the loading crew be at X? Gio Wiederhold IBM R 44 The language is an interface Applies to database and simulation services • A service service program can be written in any language » C, C++, Fortran, Java • A simulation service must be compliant to the interface specifications » SQL, SimQL schema » SQL, SimQL query language » XML for output voluminous semantic tags 7/26/2016 Gio Wiederhold IBM R 45 Moving to a Service Paradigm • • • • Server is an independent contractor, defines service Client selects service, and specifies parameters Server’s success depends on value provided Some form of payment to be received for services x,y Databases are a current example. Simulations have the same potential. Gio Wiederhold IBM R 46 Knowledge maintenance Changes of user needs Application Interface Software & People Models, programs, rules, caches, . . . Resource Interfaces Owner/ Creator Maintainer Lessor - Seller Advertisor Domain changes Resource changes Gio Wiederhold IBM R 47 Domain Specialization • Knowledge Acquisition (20% effort) & • Knowledge Maintenance (80% effort *) to be performed by • Domain specialists • Professional organizations • Field teams of modest size autonomously maintainable Empowerment * based on experience with software 7/26/2016 Gio Wiederhold IBM R 48 Today DM support is disjoint does not interoperate Planning Science Distribution extensions to move to networked support are also disjoint Gio Wiederhold IBM R 49 Research Goals Reduce volume especially type 2 errors • Among intermediate processing nodes – bandwidth requirements • To end-users – mobile displays – human capabilities 7±2 Services reduce bandwidth requirements – Effective mediation (not just middleware) – smarts can reduce volume while increasing information contents – Model of customer-class needs > personalization 7/26/2016 Gio Wiederhold IBM R 50 Research Questions Manage the Complexity "only simple stuff works" • Number of interfaces – Among n domains m < n 2 ? • Complexity of interfaces – Variety of standards, – used to establish dominance – Feature overload (recovery, security, ... ) • Intellectual domain interfaces – Acceptance of service paradigm, – rather than ownership 7/26/2016 Gio Wiederhold IBM R 51 Vision for Information Systems Predictions Predictions Models CoAs Decisions Effects & Reactions sensors sensors past now future time Gio Wiederhold IBM R 52 Backup Viewgraphs 7/26/2016 Gio Wiederhold IBM R 53 Functional Layer User interface Service interface Human-computer Interaction Applicationspecific code MEDIATION Resource access interface Domainspecific code Sourcespecific code Real-world interface Gio Wiederhold IBM R 54 Requirements . . . assessment • • • • • • • • • • • Access to remote data . . . . . Resolution of heterogeneity . . Integration of matched data . . Access to knowledge . . . . . Access to sensor data . . . . . Access to simulation tools . . Creation of alternate CoAs . . Assessment of alternate CoAs Showing alternatives . . . . . Provision of drill-down . . . . Security . . . . . . . . . . . . . . . . . . . . . . . . . good progress I3 . ok, - semantics I3, .. . good progress DDB . progress HPKB . ok, poor integration BADD . poor, manual . isolated tools DRPI . isolated tools . good progress . poor, many barriers . poor integration SURV Gio Wiederhold IBM R 55 SKC Summary . • Algebra enables Interoperation by dealing explicitly with differences by knowledge identifying maintenance domains keeping sources autonomous • Assumes domain has a common ontology composing domain ontologies requires the algebra to manage the linkages where articulation occurs processes are best executed within the domains • Knowledge about articulation is disjoint allows integration specialists to work independently supports multiple intersections and views • Maintenance is structured and partitioned 7/26/2016 Gio Wiederhold IBM R 56 Current Directions • Experience with real world (imperfect) data confirms validity of our approach – Expert sources are better maintained than general sources – Rules applied to multiple sources provide more reliable and accurate query results – Component architecture enables scalable, maintainable knowledge base development • Developing proof of concept environment with HPKB standard knowledge base connectivity interface 7/26/2016 Gio Wiederhold IBM R 57 Components of Ontologies . • Vocabularies • words as handles for classes and concepts • from database schemas. textbooks, catalog indexes • identified with their domain or context • Relationships • meaning is primarily defined through relationships Employee of a Company; Nails in a Shoe • Annotations • material to clarify the meaning of the terms • contributed by users as well as original authors • best with some examples 7/26/2016 Gio Wiederhold IBM R 58 Knowledge Composition Composed knowledge for applications using A,B,C,E Articulation knowledge Legend: U U for (A B) U (B C) U (C E) Articulation knowledge (C E) U U : union U : intersection Knowledge resource E U Knowledge resource A 7/26/2016 U (B C) Knowledge resource B Knowledge resource C (C U U Articulation knowledge for (A B) D) Knowledge resource D Gio Wiederhold IBM R 59 Primitive Operations Model and Instance Unary • Summarize -- structure up • Glossarize - list terms • Filter - reduce instances • Extract - circumscription Binary • Match - data corrobaration • Difference - distance measure • Intersect - schem discovery • Blend - schema extension 7/26/2016 Constructors • create object • create set Connectors • match object • match set Editors • insert value • edit value • move value • delete value Converters • object - value • object indirection • reference indirection Gio Wiederhold IBM R 60 Innovation in SKC • • • • No need to harmonize full ontologies Focus on what is critical for interoperation Rules specific for articulation Tools for creation and maintenance – Maintenance is distributed » to n sources » to m articulation agents • Potentially many sets of articulation rules is m < n2 , depends on semantic architecture density a research question 7/26/2016 Gio Wiederhold IBM R 61 SKC Synopsis • Research: – Reliable query answers from heterogeneous, imperfect data sources • Sources: – General: CIA World Factbook ‘96, UN-www, OPEC-www Webster’s Dictionary, Thesaurus, Oxford English Dictionary – Topical: OPEC, BattleSpace Sensors, Logistics Servers • Client: – DARPA High Performance Knowledge Base project • Theory: – Rule-based algebra – Translation & Composition primitives 7/26/2016 Gio Wiederhold IBM R 62 ArcRank • For All source s and target t nodes in graph sort outgoing vs ,t j , rank by sorted order res (vs ,t j ) sort incoming vs ,t , rank by sorted order ret (vsi ,t ) i for each arc es ,t compute mean( res (vs ,t ), ret (vs ,t )) • In ranking – Equal values take same rank – Ranks numbered consecutively 7/26/2016 Gio Wiederhold IBM R 63 Vehicle sales ontology Vehicle registration ontology Tools to create articulations Suggestions for articulations 7/26/2016 Combine ontology graphs with expert selection based on spelling, graph matching, and a nexus derived from a dictionary (O.E.D.) Gio Wiederhold IBM R 64 Relative Arc Importance • PageRank (Google) limitations – node oriented – high rank to words with little semantic value » conjunctions, articles The » prepositions, pronouns And to it • Relative arc importance – contribution of source rank to target rank – vs ,t ps / | as | pt es ,t 7/26/2016 Gio Wiederhold IBM R 65 Internet requirements for Simulation • Ubiquitous acess to simulations of a wide variety of types • Rapid response to parameter changes – often High-Performance computation is needed – distributed simulations with synchronization • Rapid Service Composition – High bandwidth among simulations – Acces to multiple services in parallel Gio Wiederhold IBM R 66 Country Graphs 7/26/2016 Gio Wiederhold IBM R 67 To be matched to 7/26/2016 Gio Wiederhold IBM R 68 Place of SimQL in Objective-based Planning Higher Level Objectives, Intel, OB, ROE, Commanders Guidance & Intent, Etc. Campaign 1 Status Execution Feedback Determine Status 2 Develop * Objectives Phased Sequenced *: w/Measures Objectives 3 Phase & * Sequence Objectives Prioritized Sequenced Tasks 4 Assign Task / Activity Assessment Simulation Plan results 5 Develop Assessment Plan SimQL Access to Simulations Required Resources Simulation parameters Determine Req’mts Resource Constraints from JFACC PIP 7 Assess and/or Rehearse Plan Plan Assessment Feedback Gio Wiederhold IBM R 69 Information Flow for Training Initiative sample scenarios scenario refinement trainer / controller aggregation/ analysis/ evaluation ISI scenario language Scenarios Objectives tasks explosion doctrine TRADOC mediator knowledge base Requirements exercise design Legend sources scenario justification Data collection Probepoint settings draft 1 aggregation 7/26/2016 Gio Wiederhold IBM R 70 Interlingua(s) Interlingua: Query : Object Exchange Model Mediator Specification Language OEM MSL { OID, LABEL, TYPE, VALUE } <document {<author AUTHOR> <title TITLE>}:- <biblioentry {<author AUTHOR>}>@biblio <inproceedings {<title TITLE>}> @sybase AND AND Equal(AUTHOR, “Jeff Ullman”) Interlingua: Query: Knowledge Interchange Format Knowledge Query and Manipulation Language KIF KQML (PACKAGE :FROM ap001 :TO ap002 :CONTENT (MSG :TYPE query :CONTENT-LANGUAGE KIF :CONTENT (and (document (author@biblio ?a) (title@sybase ?t)) (eq “Jeff Ullman” ?a))) 7/26/2016 Gio Wiederhold IBM R 71 Support for KB-Algebra • Ontolingua [Gruber, Fikes @ Stanford KSL]: Repository for Domain Terminologies Used for mechanical design, bibliographies, catalogs • LOOM [MacGregor@ USC ISI]: Classification-based Expert System Helps in structuring and processing ontologies • PROTÉGÉ [Musen@ Stanford MIS] Reuse • Penguin [Barsalou, Keller@ Stanford MIS, CIFE]: Object manipulation based on Relational Algebra Used for genetics laboratory, building design 7/26/2016 Gio Wiederhold IBM R 72