Getting Data to Applications: Why Do We Fail, and How We Can Do Better? Arnon Rosenthal, Frank Manola, Scott Renner Toward an Industrial Revolution for Data Interoperability Incremental, (full) Interfaces, Incentives Arnon Rosenthal, Frank Manola, Scott Renner Goal: A Common Operational Picture (COP) View tier logistics mapmaker intelligence operations User sees data values, assembled and expressed in user’s own terms The “Common Operation Picture” warehouse or federation: an integrated subset of information sources with presentations for different users Source tier sensor naval NIMA info products ground air 5 Current Status Read only is insufficiently ambitious for a guiding vision but is driving many industrial solutions Proposed architectures (e.g., messaging) often don’t fit - Metadata - Operations: update /annotate/subscribe - Fusion Numerous initiatives that are likely to fail - e.g., common operational pictures Today’s technology: Costly, little reuse, skill-intensive 7 Toward Attainable Goals (and more realistic slogans) “Give everyone transparent (read) access to all data”. (Any success stories?) The vision of perfection crowds out ability to live with imperfection!| Restate the challenge: Prepare data/software systems to work with partners -- including unknown future ones? Connection-creation as a core competence for IT - Describe each service that is offered or wanted - (e.g., some operation on some data) Reduce cost of establishing the software connection Reuse knowledge captured when a connection is built 8 What Do We Mean “Industrial Revolution”? Small tasks Each with one skill Many atomic steps become automatable Each produces reusable knowledge (as opposed to motivating a few lines within a program) “Market-driven” (as connections are made) rather than giant initiatives 9 Future of Large Info Management Architectures Consensus among researchers for scalable sharing - Each data resource describes what it offers - Each consumer describes what it wants - Discovery and brokering processes create a connection (prototypes automate some cases) Is it really so different from today? each functional task is performed by today’s developers - Key difference: “describe and generate” 10 A word from our sponsor: We’re Hiring Researcher / Consultants, Prototypers, Systems Engineers (or make us an offer) Main offices: suburbs of Boston and Washington DC - Also jobs in Norfolk, Montgomery, St. Louis, San Diego, … + Europe, Asia We’re a nonprofit working mostly for the US government (A good place to learn. So you’ll get more stock options later) US Citizens and Permanent residents only (so MITRE can get you a security clearance) 12 Talk Outline Why do current approaches so often fail? - We act as if we believe ridiculous things -in architectures and in design discussions Where should we try to go? Incremental Interoperability - Aim to revolutionize -- incrementally How to Start Moving in this Direction? - Scope of talk: - Create logical connectivity -- development and logical admin Omits: Systems planning, execution performance (cache selection, indexing, dissemination) 14 Tacit Assumptions -- and Antidotes -- 2 “End State” fallacies: - Architectures are for a perfect end state (?) - Systems conform and consumers benefit only when transition is complete (?) You’ll add flexibility later (?) Config. mgt. is a sufficient strategy for change (?) Advice Nuggets Architect for manageable, adaptable, imperfect systems (for 2001, 2002, … 2999) - Transitional states are within the architecture Architect for adaptability. How to contract for it? - Config. management is only a brake 15 Tacit Assumptions -- and Antidotes -- 3 Mandates will elicit good quality metadata (?) - Local administrators will rush to keep you up to date (?) Advice Nuggets Active (operational) metadata is kept accurate - Passive metadata is untested, and soon too obsolete to drive automated processing (except browsing) More carrots, fewer sticks - If your tools use the metadata to ease the providers’ tasks, you’ll get better metadata Calls for metadata should include an exploitation plan 16 Tacit Assumptions -- and Antidotes -- 4 “Midpoint” Fallacy: Design a compromise interface (msg?) Build around and above it. (?) “Message interface” Fallacy : “Send message Mxyz” is a fine interface between systems (?) - Support interfaces procedurally (e.g., Java + parser) (?) Describe the “natural” interface. - One interface supports all subsets. - Connectors are separate & declarative (e.g. SQL + fns?) On the consumer’s interface, generate - operations (e.g., query, update, subscribe) - metadata, e.g., units, error, access controls 18 Tacit Assumption 6: Interoperability Metaphor: Universal Plug Two Prongs Too Simple Important element of truth: Design to plug into the “infosphere”, not into one neighbor 19 A Better Interoperability Metaphor: A Multi-Pin Connector CORBA/DCOM transactions 1 3 2 14 XML 15 4 16 5 17 6 18 7 19 8 20 10 9 21 SQL 22 11 23 12 24 13 25 All the Pins Have To Fit -and Many are compound Data Each attribute has semantics format, quality Track Resolution of Each Pin’s Issues 20 Organization of the Section Why do current approaches so often fail? Where Should We Want to Go? - Approach - Taxonomy of needed capabilities How to Start Moving in this Direction? Research Agenda: Risk Mitigation 21 Transition is the steady state, with good ways to cope Descriptions of sources, consumers exist -- sometimes - When build next connection, capture more - You’re still funded to build connections No giant process cutover Discovery and brokering tools work with whatever descriptions they find Integration contractors already do discovery and brokering! - Manually, with too little reuse! For everything, there are multiple ways to do it - Choose one, but work with those who chose differently - Connections and transforms are partially known 22 Steps to Connect a Consumer to Provider(s): (with metadata reuse) Obtain descriptions of each player Use same form for consumers’ needs as for providers - May employ intermediary vocabularies - Discover potential (source, consumer) pairs Obtain transforms for - Element representations (e.g., miles km; jpeg gif) - Object and set representations (e.g., ODBC XML) - Protocols (e.g., DCOM CORBA) - Pull versus push, whole versus changes Generate the entire connection (tuned for efficiency) What vendor can supply the framework? 24 Metadata Drives Connection Creation (when there is enough metadata) New “Wants” from consumer Repository/ Discovery process Knowl. Base Brokering process Transform Library + execute 25 Connection Creation Drives Metadata New “Wants” from consumer Repository/ M’data capture tools + Discovery process Knowl. Base M’data capture tools Brokering process Transform Library + execute 26 Connection Creation Drives Vocabularies (?) Vocab and I/f creation tools Repository/ New “Wants” from consumer M’data capture tools + Discovery process Knowl. Base M’data capture tools Brokering process Transform Library + execute Optimizer 27 Toward an “industrial revolution” for IT: Re-imagine Existing Processes as Simpler Steps Each step should - Require just one or two skills - Benefit from existing resources -- metadata and - transforms Be fully automated (sometimes) Produce reusable resources for later steps Key challenges: - Incentives: - It’s must be made easier to generate from resource atoms than to code it all yourself! To support these incentives, we may need tools that assemble the atomic components into a solution 28 Data Descriptions: A Taxonomy (foil 1 of 2) Data admin for requirements parallels admin for offers! Use same constructs - Enables (partly) automated comparisons - Interpretation: element semantics, element representation, schema Scope and completeness of what you provide (population), e.g., images of + all US air-fuel depots, since 1970 + some NATO fuel depots since 1990 Delivery style (push/pull, whole / changes) (Is offer/need model adequate for update transactions?) 29 Data Descriptions’ Taxonomy (foil 2 of 2) Quality of service - Data quality, timeliness, attribution, completeness, obligation (to continue providing), cost, … Guidance for data merging (match-up, conflict resolution) Server information, e.g. (catch-all) - Access language, protocols, address, security domains, … 32 Talk Outline Why do current approaches so often fail? Discussion of a “low risk” approach - What the goal system looks like - How it evolves - Tool and technology details How to Start Moving in this Direction? How to: - Simplify the task of interfacing to a particular system - Establish more connections - Make created interfaces “first class” Research Agenda: Risk Mitigation 33 Getting Started along the New Road Provide help in creating needed interfaces - Focus on individual programs, small initiatives - Give incremental benefits, to keep all aboard What’s the minimum to give some benefits? Separate existing work into atomic tasks that require fewer skills, and are sometimes automatable - No giant cutovers, with massive retraining, coordination Issues - What does each program need to do? - What requires coalitions, or central funding? (e.g., repository, brokers) 34 Tasks (examples) Define vocabularies for - Metadata - (how to say “means the same”, or “distanceUnits = km” or “Corba3.0 interface) Aspects to be brokered (of scope, representation, …) Frequently-exchanged domain data (Part#, Facility#) Describe portions of systems in terms of these vocabs - Be opportunistic, e.g., when building new connections Provide transforms among major representations, protocols Provide brokers for various aspects (simple brokers first) “Partial brokering” must help metadata providers 35 Who Will Be Most Interested? (Suggested Initial Targets) Find a system which needs multiple interfaces. (to customers and/or feeders) Good candidates - Non-dominant players who must connect to multiple - others Dominant player with bad ease-of-connecting (MIDB?) Issue: How soon till it’s helpful - Generate, based on own entries in metadata repository - Transformers are quickly helpful (esp. harder ones, e.g., coordinates, image formats) Perhaps attach to DBMS, or to XML engine? 36 Example Initiatives (and their benefits) Publish interface in one formalism (with description) e.g., SQL Tools generate the additional interfaces, without disturbing the original publisher e.g., XML, CORBA, DCOM, html, … - Publish interface in one vocabulary, for all exported info e.g., Supply - Tools generate “closest feasible” interface in other vocabularies that have been related to it e.g., Repair, Procurement, Defense finance, … - Transform representations (image format, coord system) Provide interfaces as (root concept, well known modifier) Derive metadata, additional operations (e.g., update) 40 Summary: Try an approach that hasn’t failed consistently! Identified pitfalls that are too rarely avoided Described incremental steps toward large scale data admin for diverse, changing, incomplete systems Generate connections from reusable resources (system metadata, vocabulary metadata, transforms) active metadata - Separation of skills, use point and click - Incentives: Make provide resource + generate easier than writing connecting code Connection-creation creates more reusable resources - Projects cooperate to create vocabularies, acquire tools It’s a low risk approach -- begin prototyping 41 Challenges for Database Researchers Better brokering for matching requirements to sets of views - Assume multiple ontologies, spotty connection, - incremental improvement Explain the shortfalls, understandably Scalable fusion (to match objects, resolve data conflicts) without n x n pairwise administration Pragmatic - Acquisition guidance, e.g., metrics on flexibility (what should be in each acquisition contract?) Combine techniques for learning metadata? No more discovery heuristics! Automate physical DBA work (caching, optimization) -