prepared for CERN seminar, June 2000 Heterogeneous Information Management June 2000 Gio Wiederhold Stanford University 7/26/2016 Gio - CERN 1 Abstract Information is created by applying knowledge (enoded as programs or rules) to collected data and message received. Data and computation resources are provided by a variety of suppliers, public and private. The autonomy of the suppliers causes heterogeneity and inconsistencies. The number of potential suppliers and their autonomy also creates information overload To cope with these issues novel intermediate services are needed, opening up new opportunities. Many traditional relationships among consumers and vendors will change. We will present the concepts and status of such services. Collaboration, security, and payment schemes are some of the considerations. 7/26/2016 Gio - CERN 2 Outline • • • • Background for Mediated Systems Motivation and Functions needed Architecture Current Status • Resolving Semantic Heterogeneity • Research Directions • Background – Maintenance – Research Projects – Integration of Simulation Information 7/26/2016 Gio - CERN 3 Evolution of mediation applications A2 A1 A4 A3 A5 A6 integrators a. I2 I1 mediators network b. M1 c. d. wrappers D1 W2 W1 D2 D4 W3 D5 M2 e. D6 D3 datasources 7/26/2016 Gio - CERN 4 Transforming Data to Information Application Layer Mediation Layer Foundation Layer 7/26/2016 users at workstations value-added services data and simulation resources Gio - CERN 5 Data and Knowledge Data Loop Knowledge Loop Storage Education Selection Abstraction Integration Recording Summarization Experience Decision-making State changes Action 7/26/2016 Gio - CERN Information is created at the confluence of data -- the state & knowledge -the ability to select and project the state into the future 6 Definition* A mediator is a software module that exploits encoded knowledge about certain sets or subsets of data to create information for a higher layer of applications. It should be small and simple, so that it can be maintained by one expert or, at most, a small and coherent group of experts. * Wiederhold: IEEE Computer March 1992 7/26/2016 Gio - CERN 7 Information Data overload starvation • More databases – public & corporate • Faster communication – digital – packeting: TCP-IP, ATM • World-wide connectivity – Internet & Intranets – world-wide web • Disintermediation – ubiquitous publishing 7/26/2016 Gio - CERN 8 Change in Supply vs Demand What information consumes is rather obvious, it consumes the attention of its recipients. Hence a wealth of information creates a poverty of attention, and a need to allocate that attention efficiently among the overabundance of information sources that might consume it. 7/26/2016 Gio - CERN [Herbert Simon]9 Function of Mediation Apply Domain-specific Specialist Knowledge to add value • • • • • • • to locate data sources to convert for consistency to integrate from diverse sources to describe data for processing to abstract for insight / models to extrapolate to new situations to summarize for presentation INFORMATION 7/26/2016 Gio - CERN 10 Interfaces User interface Human-computer Interaction Applicationspecific code Service interface MEDIATION Resource access interface Domainspecific code Sourcespecific code Real-world interface 7/26/2016 Gio - CERN 11 Making data relevant • Data reduction • Data abstraction – – – – Level changing Summarization Exception search Level change to integrate with other data sources • Follow Customer Model: hierarchical, divide-and-conquer, a common paradigm 7/26/2016 Gio - CERN 12 Functions inside Mediation articulation Summarize Transform Heterogenous Selection 7/26/2016 resources Gio - CERN 13 Status of Mediation Technology Today • Handcrafted • Expert consults with programmer • Programmer codes the knowledge needed • Resource changes require advise, program update 7/26/2016 Future • Generated from models • Domain Expert maintains models • Specification determines functions • Resource changes trigger regeneration Gio - CERN 14 Coverage of Current DARPA I3 Efforts ) | ( ] Good progress / active research / related work / poor coverage (web,schema searching) for relevance to customer Maintenance Caching / for multiple domains History :-( :-| Security Mediators :-( :-[ Facilitation Integration over sources :-) :-[ :-( for cooperation Wrapping (syntactical heterogeneity) :-[ :-) :-) Gio - CERN 15 :-( Databases / Web / Text / Simulation 7/26/2016 :-| (rule technology?) (auto linking) :-( Abstraction :-[ :-) Discovery Mediator Design Principle Transform Data into Information Match Costumer Model Hierarchical to Resource Model General network (and maintain models) 7/26/2016 Gio - CERN 16 Heterogeneity among Domains If interoperation involves distinct domains mismatch ensues • Autonomy conflicts with consistency, – Local Needs have Priority, – Outside uses are a Byproduct Heterogeneity must be addressed • Platform and Operating Systems 4 4 • Representation and Access Conventions 4 • Naming and Ontology : 7/26/2016 Unsolved problem in Interoperation Common assumption in assembling and integrating distributed information resources • The language used by the resources is the same • Sublanguages used by the resources are subsets of a globally consistent language This assumption is provably false. Working towards the goal of global consistency is 1. naïve -- the goal cannot be achieved 2. inefficient -- languages are efficient in local contexts 7/26/2016 Gio - CERN 18 Ontology: components . We represent the contents and structure of a languages by its ontology: • a set of well-defined terms, which delimit the domain of discourse • relationships among those terms, chosen from a limited set a formalizable subset of expert knowledge 7/26/2016 Gio - CERN 19 SKC’s grounded definition . • Ontology: a set of terms and their relationships • Term: a reference to real-world and abstract objects • Relationship: a named and typed set of links between objects • Reference: a label that names objects • Real-world object: an entity instance with a physical manifestation • Abstract object: a concept which refers to other objects 7/26/2016 Gio - CERN 20 Where are Ontologies found? Ontologies allow communication among partners in enterprises (rarely in machine-readable form) Relationships determine meaning - parent, school, company Variable and Class names in Software Databases use ontologies during design in their E-R diagrams (implicitly) and to represent the leaf nodes in their schemas. Knowledge-bases use term ontologies (often explicitely), add class definition (to hold instances), constraints, and operations among the terms. 7/26/2016 Gio - CERN 21 Establishing Ontologies Top-down: – Commonly acceptable UPPER layers Domain-specific – Analysis and Sharing tools – Model and Object-type based Bottom-up – Wordlist creation from task-specific collections – Database models, schemas, and contents 7/26/2016 Gio - CERN 22 Large Ontologies: good or bad? Have all the Knowledge together + simple for customers of KBs – hard for owners of KBs, must synchronize with many others – in the limit -- everybody must be globally consistent Large KB will cover multiple / all domains created by a committee -- slow maintained by a committee -- costly Differences in level of abstraction -- efficiency homeowner: nail carpenter: sinker, brad, boxnail, . . . 7/26/2016 Gio - CERN 23 Domain ontology assumption . • a domain will contain known objects • the object configuration is consistent • within a domain all terms are consistent & • relationships among objects are consistent Domain Ontology • context is implicit in use • explicit context is needed for external use 7/26/2016 Gio - CERN No committee is needed to forge compromises * within a domain Compromises hide valuable details 24 SKC Objective Provide for Maintainable Ontologies • devolve maintenance onto many domain-specific experts / authorities SKC • provide an algebra to compute composed ontologies that are limited to their articulation terms • enable interpretation within the source contexts 7/26/2016 Gio - CERN 25 Conservative assumption ! When dealing with multiple ontologies one can never be sure that identically or similarly spelled words mean the same thing, I.e, refer to exactly the same set of real-world objects under all current and future conditions • Common, optimistic assumption: Meaning is identical – Gets worse when terms are stemmed • SKC, conservative or pessimistic assumption: Meaning never matches, unless there is a match rule – number of matching rules is reduced by focusing on the articulation 7/26/2016 Gio - CERN 26 An Ontology Algebra A knowledge-based algebra for ontologies Intersection Union Difference create a subset ontology keep sharable entries create a joint ontology merge entries create a distinct ontology remove shared entries The Articulation Ontology (AO) consists of rules that link domain ontologies 7/26/2016 Gio - CERN matching 27 Sample Operation: INTERSECTION Terms useful for purchasing Result contains shared terms Source Domain 1: Owned and maintained by Store 7/26/2016 Source Domain 2: Owned and maintained by Factory Gio - CERN 28 INTERSECTION support Articulation ontology Terms useful for purchasing Matching rules that use terms from the 2 source domains Store Ontology 7/26/2016 Factory Ontology Gio - CERN 29 Sample Intersections size = size Articulation color =table(colcode) ontology style = style matching rules : Anatomy {. . . } Shoe Factory • Material inventory {...} • Employees { . . . } • Machinery { . . . } • Processes { . . . } • Shoes { . . . } Shoe Store • Shoes { . . . } • Customers { . . . } • Employees { . . . } foot = foot Employees Nail (toe, foot) ... 7/26/2016 Department Store Gio - CERN Hardware Employees Nail (fastener) ... 30 Other Basic Operations UNION: merging entire ontologies DIFFERENCE: material fully under local control Articulation ontology typically prior intersections 7/26/2016 Gio - CERN 31 Features of an algebra Operations can be composed Operations can be rearranged Alternate arrangements can be evaluated Optimization is enabled The record of past operations can be kept and reused 7/26/2016 Gio - CERN 32 Knowledge Composition Legend: Articulation knowledge U : union for U U (A B) U (B C) U (C E) Articulation knowledge for (C E) U U Knowledge resource E Articulation knowledge for (A B) U Knowledge resource A 7/26/2016 U (B C) Knowledge resource C Knowledge resource B Gio - CERN (C U : intersection Composed knowledge for applications using A,B,C,E D) Knowledge resource D 33 U Sample Processing in HPKB • What is the most recent year an OPEC member nation was on the UN security council? – Related to DARPA HPKB Challenge Problem – SKC resolves 3 Sources • CIA Factbook ‘96 (nation) • OPEC (members, dates) • UN (SC members, years) – SKC obtains the Correct Answer • 1996 (Indonesia) – Other groups obtained more, but factually wrong answers 7/26/2016 – Problems resolved by SKC * Factbook has out of date OPEC & UN SC lists – Indonesia not listed – Gabon (left OPEC 1994) * different country names – Gambia => The Gambia * historical country names – Yugoslavia • UN lists future security council members – Gabon 1999 • intent of original question – Temporal variants Gio - CERN 34 Tools to create articulations Graph matcher for Articulationcreating Expert Transport ontology Vehicle ontology Suggestions for articulations 7/26/2016 Gio - CERN 35 continue from initial point Also suggest similar terms for further articulation: • by spelling similarity, • by graph position • by term match repository Expert response: 1. Okay 2. False 3. Irrelevant to this articulation All results are recorded Okay’s are converted into articulation rules 7/26/2016 Gio - CERN 36 Candidate Match Repository Term linkages automatically extracted from 1912 Webster’s dictionary * * free, other sources have been processed. . Based on processing headwords definitions using algebra primitives Notice presence of 2 domains: chemistry, transport 7/26/2016 Gio - CERN 37 Using the match repository 7/26/2016 Gio - CERN 38 Navigating the match repository 7/26/2016 Gio - CERN 39 Primitive Operations Model Unary • Summarize -- structure up • Glossarize - list terms • Filter - reduce instances • Extract - circumscription Binary • Match - data corrobaration • Difference - distance measure • Intersect - schem discovery • Blend - schema extension 7/26/2016 and Instance Constructors • create object • create set Connectors • match object • match set Editors • insert value • edit value • move value • delete value Converters • object - value • object indirection • reference indirection Gio - CERN 40 Future: exploiting the result Avoid n2 problem of interpreter mapping as stated by Swartout as an issue in HPKB year 1 Result has links to source Processing & query evaluation is best performed within Source Domains & by their engines 7/26/2016 Gio - CERN 41 SKC Synopsis • Research: Reliable query answers from heterogeneous, imperfect data sources • Sources: – General: CIA World Factbook ‘96, UN www, OPEC www Webster’s Dictionary, Thesaurus, Oxford English Dictionary – Topical: OPEC, BattleSpace Sensors, Logistics Servers • Client: DARPA High Performance Knowledge Base (HPKB) project • Theory: Rule-based algebra – Translation & Composition primitives 7/26/2016 Gio - CERN 42 Innovation in SKC • • • • No need to harmonize full ontologies Focus on what is critical for interoperation Rules specific for articulation Potentially many sets of articulation rules • Maintenance is distributed – to n sources – to m articulation agents is m < n2 , depending on architecture density a research question 7/26/2016 Gio - CERN 43 Domain Specialization • Knowledge Acquisition (20% effort) & • Knowledge Maintenance (80% effort *) to be performed by • Domain specialists • Professional organizations • Field teams of modest size automously maintainable Empowerment * based on experience with software 7/26/2016 Gio - CERN 44 SKC Summary . • Algebra enables Interoperation by dealing explicitly with differences by knowledge identifying maintenance domains keeping sources autonomous • Assumes domain has a common ontology composing domain ontologies requires the algebra to manage the linkages where articulation occurs processes are best executed within the domains • Knowledge about articulation is disjoint allows integration specialists to work independently supports multiple intersections and views • Maintenance is structured and partitioned 7/26/2016 Gio - CERN 45 Current SKC Directions • Experience with real world (imperfect) data confirms validity of our approach – Expert sources are better maintained than general sources – Rules applied to multiple sources provide more reliable and accurate query results – Component architecture enables scalable, maintainable knowledge base development • Porting the concepts to the DARPA Markup Language (DAML) setting 7/26/2016 Gio - CERN 46 Mediation Research Topics • • • • Mediator management and maintenance Representation of knowledge and customer models Balancing dynamic and warehouse solutions Formalization of semantic heterogneities – – – – many levels and types roles for wrappers vs. mediators vs. applications scalability by partitioning -- make it simple! Domain Ontologies --- tools, validation, . . . • Effect of object paradigm and method-based access • Service and business models • New types of information systems 7/26/2016 Gio - CERN 47 Long Range Science Vision Databases access storage algebras Systems Engineering analysis documentation costing Artificial Intelligence knowledge mgmt domain expertise uncertainty Integration Methods GIS Spatial is special. 7/26/2016 Integration Science Gio - CERN 48 Background Material: • Technology Sources • Maintenance • Projects • Information about the Future 7/26/2016 Gio - CERN 49 Interfaces Human Computer {x-widgets, HTML} Application Mediator {OQL, KQML, ...} Mediator Data sources {SQL, TQL, XML, … } Data real world {sensors, clerks, … } 7/26/2016 Gio - CERN 50 Support for KB-Algebra • Ontolingua [Gruber, Fikes @ Stanford KSL]: Repository for Domain Terminologies Used for mechanical design, bibliographies, catalogs • LOOM [MacGregor@ USC ISI]: Classification-based Expert System Helps in structuring and processing ontologies • PROTÉGÉ [Musen@ Stanford MIS] Reuse • Penguin [Barsalou, Keller@ Stanford MIS, CIFE]: Object manipulation based on Relational Algebra Used for genetics laboratory, building design 7/26/2016 Gio - CERN 51 Getting there: Available Technology/Science Web Search Tools Multimedia Interfaces Agents Database Models Security Filters Domain Ontologies Object Bases Temporal Algebras Uncertainty algebras Customer Models Constraint Management Circumscription Communication Standards Active Databases DB Views 7/26/2016 Case-based Reasoning Internet Billing Knobots Simulation Access Wrappers Public Databases GIS GIS Caching Text & Speech Processing Distributed Storage Systems Gio - CERN High Perf.Comm. 52 Fat versus thin mediators • too thin: insufficient added value • Too fat: hard to compose service scope • Too narrow: few costumers • too broad: hard to maintain, needs a committee domain scope 7/26/2016 Just right Gio - CERN 53 Maintenance is good for you ? 13 12 11 100% 10 9 90 8 80 7 70 6 60 lifetime 5 50 4 40 3 30 2 1 20 10 relative annual maintenance cost depreciation = 1 / lifetime years 0 automobile 7/26/2016 hardware Gio - CERN software 54 Client-Server Architecture Client system s X Fast build of clients by resource reuse data and simulation resources Changes (x) are difficult, can affect many clients 7/26/2016 Gio - CERN 55 Systems with Mediators Gio Wiederhold. 1995 Applications . . . . Mediators . . . . . . Data Resources . . . 7/26/2016 Gio - CERN 56 Growth through Reuse Gio Wiederhold. 1995 New Application Prior & Revised Mediators Extended Data Resources 7/26/2016 Gio - CERN 57 Linear O(n) Cost of Growth-- now O(n2) • Data changes only affect some mediators; only in their domain • Mediators can 1. supply old information to n-1 prior applications 2. provide better information to the new application 3. be partially or completely reused • New applications, using the new data, can be developed and inserted dynamically 7/26/2016 Gio - CERN 7 2 58 A mediator is not just static software: Knowledge ages Application Interface Changes of user needs Software & People Models, programs, rules, caches, . . . Owner / Creator Maintainer Lessor - Seller Advertisor Resource changes Resource Interfaces 7/26/2016 Domain changes Gio - CERN 59 Roles Computer Scientists • Provide tools – – – – adapatation integration matching composing • Assess Standards • Assure scalability 7/26/2016 Domain Experts • Learn to use the tools • Select resources • Assess their value • Rank their quality • Resolve semantics • Get client feedback • Give provide feedback Gio - CERN 60 Assigning maintenance responsibility a. Source data quality – supplier database, files, or web pages b. Interface to the source – Sources wrapper, supplier or vendor for supplier c. Source selection – expert specialist in mediator d. Source quality assessment – customer input to mediator Services e. Semantic interoperation – specialist group providing input to the mediator f. Consistency and metadata information – mediator service operation or warehouse g. Informal, pragmatic integration – client services with customer input h. User presentation formats – Customers client services with customer input 7/26/2016 Gio - CERN 61 Sample projects • Tsimmis at Stanford • E-Commerce in Digital Libraries • INEEL: information integration for environmental restoration • MIFT: feedback for training • Civil Engineering and Architecture • F-22 • SimQL • Security 7/26/2016 Gio - CERN 62 Projects at Stanford DB group Data Mining. Mediator & Wrapper Generation. Warehousing. Security Mediators. Megaprogramming. Simulation Access. Changes, Consistency, and Configurations. 7/26/2016 MIDAS WHIPS TSIMMIS TIHI SimQL CHAIMS C3 Gio - CERN 63 The TSIMMIS Project Ramana Yerneni, Yannis Papakonstantinou, ... • Objective: Support mediation technology – integrated access to distributed, autonomous, heterogeneous data sources, using object fusion – wrapper toolkit to rapidly create wrappers, based on source specification, heterogeneous sources a uniform interface to – mediator toolkit to rapidly construct mediators, based on a mediator specification, a set of wrappers 7/26/2016 to integrate data from Gio - CERN 64 Investors Need to Fuse Information from Multiple Sources . Network Ticker Tape WWW 7/26/2016 Personal database • group together information about the same real-world entity • remove redundancies • resolve conflicts Gio - CERN 65 An Integration Architecture Client Application portfolios for each company Mediator stock market prices 7/26/2016 business reports Wrapper Wrapper Ticker Tape Dialog Gio - CERN 66 Additional Challenge: Sources Without a Well-Structured Schema Examples • semistructured – irregular – deeply nested • incomplete schema knowledge – autonomous – dynamic 7/26/2016 • World Wide Web • SGML documents • genome, chemical structures • bibliographic information • files Gio - CERN 67 Wrappers & Mediators from High-Level Specifications Client DeclarativeMediator Specification Mediator Mediator Specification Interpreter Wrapper Source 7/26/2016 Wrapper Source Gio - CERN Wrapper Specification Interpreter Declarative Source Specifications 68 E-money Services must be paid for • Incentive for creation and improvement • price proportional to value added, often small • profit f (cost, market, price, overhead ) • price low per item, so overhead must be low Simple payment (no credit accounts, checks) yes Enabled through secure signatures 7/26/2016 Gio - CERN 69 E-Commerce in the Digital Library Steven Ketchpel & DL Economics Group Payment Delivery CyberCash DigiCash First Virtual SET Cryptolope DigiBox HTTP E-mail Major Integration Problem Shopping Models: Pay-per-view, Subscription, Session, Shareware, Auctions, Site License, Gift Certificate, Layaway, Pre-paid vouchers, … . 7/26/2016 Gio - CERN 70 Shopping model: merchant-independent logic controlling flow of business model Example shopping models: Order, Pay, (Deliver 52 times) (1 month; Order, Deliver) Pay State Information Event Handlers Abstract API allows application to interact with many different services in a consistent way 7/26/2016 2 1 Order Complete 3 Start Transfer $ 4 Payment Complete Event Handlers Event Handlers Customer Bill Merchant Event Handlers Payment/Delivery/ Other Services Gio - CERN Proxy event handlers translate from native applications to shopping model defined protocols 71 TSIMMIS Status • Mediator Specification Interpreter running on Ultrix, AIX, OSF. • 9000 lines of C/C++ code • 4000 C++ lines of Server/Client Support Libraries • Integration of three disparate bibliographic sources – – – – 7/26/2016 legacy system flat BibTeX files relational DB wwWeb files Gio - CERN 72 Mediator Specification Interpreter Architecture Query Result Query Rewriter logical datamerge program Mediator Specification Cost-Based Optimizer plan Datamerge Engine Queries to Wrappers 7/26/2016 Results Gio - CERN 73 Environmental Restoration at INEL Undoing 50 years of messes …. MSL [Stanford] OQL [ODMG] MQL [ISX] OEM QEM OEM QEM other mediators wrapper OEM QEM QEM OEM mediator QEM OEM OEM QEM CORBA OEM QEM wrapper QEM wrapper wrapper Many projects many sources ERIS LOCKHEED MARTIN 7/26/2016 7/26/2016 IEDMS ISX - Stanford Univ. Idaho National Engineering Laboratory Gio - CERN 74 Mediation to Implement Feedback in Training David Maluf, Priya Panchapagesan, Ted Linden Another task of mediators, prior to integration MIFT Abstraction Abstraction to match levels of granularity 7/26/2016 Gio - CERN 75 Mediation Feedback: Playback or Graph User Interface Commanders Trainees Observers Training Developers Analysts UI in Java Application Layer Standards in KQML Objectives Mediation Layers Tasks Stanford Mediators with rules in CLIPS I.D.A Wrapped Simulation Resources 7/26/2016 Wrappers in C/C++ Janus Gio - CERN SimNet 76 MIFT . Result . Analyses: • Force ratio • Losses • Area gain Exercise Simulator Type 7/26/2016 Gio - CERN 77 Control Valve Sizing, Future From Andrew Arnold: Civ. Eng. Qualification Exam • Interpretation – Programmatic • Analysis – Integrated • Evaluation – Integrated • Transformation – Automated 7/26/2016 Gio - CERN 78 F-22 IWSDB Phase 6 User Interfaces Application PRIDE Provisioner Engineer IWSDB client GUI 7/26/2016 Integration Services Change Notification Query Reformulation Match maker Wrappers Databases Sybase Gio - CERN Index WAIS server Domain Model Domain Matching PD DS Suppliers S Q L 79 Simulation services 1. Continously executing: weather prediction – SimQL result reports best match samples 2. Execution specific to query: what-if assessment, spreadsheets – may require HPC power for adequate response 3. Complement base data: materials data, assembly – performs inter- or extra-polations to match query parameters 4. Combinations of 2. and 3.: top layer simulation using stored partial lower level results: weapon performance in setting 5. Human-in-the-loop (mediated by an agent program): SAFs Note • A simulation service program can be written in any language • A simulation service must be compliant to the interface 7/26/2016 Gio - CERN 80 SimQL: Simulation Access Service Information Systems should also deal with the Future past SQL now SimQL future time Decision-making requires dealing with the future, as well the past • Databases deal well with the past • Sensors can provide current status • Spreadsheets, simulations deal with the likely futures Information systems should be able to combine all three 7/26/2016 Gio - CERN 81