12. Models of Business Information [2] DE + IA (INFO 243) - 3 March 2008 Bob Glushko 1 of 35 Plan for Today's Class "Operation Clean Data" case studies Authority Control Data Warehouses "Interoperability Costs in Auto Supply Chain" case study Hub languages The Universal Business Language 2 of 35 But First... Schedule for Assignments Assignment 3, Business patterns (assigned today 3/3, due 3/12) Assignment 4, Requirements and Source Inventory (assigned 3/12, due 3/24) Assignment 5, Process Analysis (assigned 3/31, due 4/9) Assignment 6, Document Analysis (assigned 4/16, due 4/23) 3 of 35 We're Both "Shipping Containers" "The expense of resolving ambiguous business terms over and over on a daily basis pales in comparison with the expense of NOT realizing there is an ambiguity in the term" (Farish) 4 of 35 Controlled Vocabularies The words people use to describe things or concepts are "embodied" in their context and experiences... so they are often different or even "bad" with respect to the words used by others These naturally-occurring words are an "uncontrolled vocabulary" Information retrieval or other processes with uncontrolled vocabularies are often ineffective and error-prone Creating a controlled vocabulary creates an artificial language by: 1. Choosing an authoritative form of a term, name or identifier 2. Ensuring that the term is distinctive 3. Mapping all the variant forms to the authoritative one 5 of 35 "Operation Clean Data" -- British Military Case What were the symptoms or implications of "dirty" data in the British army's supply chains? What were the primary causes of this "dirty" data? Which data items were the focus of the data cleanup effort? Why? What technologies or tools were used in the data cleanup effort? 6 of 35 "Operation Clean Data" -- Carlson Wagonlit Case What were the symptoms or implications of "dirty" data for the Carlson Wagonlit travel agency? What were the primary causes of this "dirty" data? How is Carlson Wagonlit improving its data quality? 7 of 35 "Operation Clean Data" -- Cendant Case What were the symptoms or implications of "dirty" data for Cendant? What were the primary causes of this "dirty" data? How is Cendant improving its data quality? 8 of 35 Normative Name Forms When names appear in multiple forms, one form needs to be chosen using criteria that include: Fullness (e.g., full names vs. initials only) Language of the name Spelling (choose predominant form) Entry element "Smith, John" not "John Smith" "Mao Zedong" or "Zedong, Mao" or "Mao Tse Tung" or ? 9 of 35 Authority Control for Places Variant forms: St. Petersburg, Санкт Пербургскйй, Saint-Pétersbourg Multiple names: Cluj, in Romania / Roumania / Rumania, is also called Klausenburg and Kolozsvar Name changes: Bombay -> Mumbai. Homographs:Vienna, VA, and Vienna, Austria; 50 Springfields Anachronisms: No Germany before 1870 Vague, e.g. Midwest, Silicon Valley Unstable boundaries: 19th century Poland; Balkans; USSR 10 of 35 "Operation Clean Data" -- US govt agencies How are the US Census Bureau and CDC improving data quality? How do these processes differ for printed and electronic surveys/forms? 11 of 35 Some General Questions about Data Quality Are the data quality problems primarily technology ones or process/management ones? Why are "homonyms" worse than "synonyms" in a set of item identifiers? Does data have to be perfectly clean? Can it ever be? How can your own actions contribute to data quality problems or to their resolution? 12 of 35 Principles and Processes for Quality Information Prioritize the data items Involve the data owners Keep future data clean (enough) Find the data owners and the "headwaters" Validate at the time of capture or creation Set realistic goals for data quality 13 of 35 Data Warehouses A data warehouse is a "subject-oriented, integrated, time-varying, non-volatile collection of data used in organizational decision making" Data warehouses extract data from ERP systems or other transactional applications into a separate repository It is common practice to "stage" data prior to merging it into a data warehouse with an "Extract, Transform, and Load" (ETL) application The data model for the warehouse, designed to enable efficient ad hoc data analysis and reporting, is sometimes called a "hypercube" 14 of 35 Generic Enterprise Information Integration Architecture with Warehouse (Gantz, XML 2004) 15 of 35 ETL vs ELT The traditional ETL (Extract-Transform-Load) approach relies on proprietary ETL engines being deployed between sources and targets. Relational databases are rapidly eliminating the ETL category by incorporating transformation functionality So ETL is becoming ELT (Extract-Load-Transform), with all the complex processing of data occurring inside the database 16 of 35 The Virtual Warehouse A virtual warehouse is created "on demand" by centralizing and normalizing metadata about the data sources rather than the data itself. The data is left in its original location and extracted only when needed, which makes more "real time" analysis 17 of 35 Virtual Warehouse Via Metadata Repository (Gantz, XML 2004) 18 of 35 "Interoperability Costs in the US Auto Supply Chain" Excellent case study about how a concurrent engineering business model escalates the information exchanges and interoperability problems in the "ecosystem" Analyzes various alternatives for data transfer, and finds that the choices made are not the optimal ones Concepts and lessons apply to other industries with "data exchange-intensive" supply chains 19 of 35 Alternatives for Data Transfer Between Two Systems Manual re-entry Everyone has to learn to "speak" all the languages Native formal transfer Point-to-point translation Everyone has to learn just one new language but it has to be the same one Dominant players impose their language on their ecosystem Multiple vocabularies exist, but there is at least one "interchange" or "hub" language designed to facilitate translations between "native" vocabularies 20 of 35 CAD / CAM Systems Proliferation 21 of 35 Juran's "Quality Costs" Framework Joseph Juran's "Quality Control Handbook" (1951) -- "cost of quality" framework determines how much to spend on quality at any point in the "quality system" The costs of preventing and finding quality problems (avoidance) ... Prevention costs (design reviews, training, guidelines, knowledge...) Appraisal costs (tests, process control measurements, reports, evaluations,...) ... must be balanced against the costs associated with those quality problems (mitigation): Internal failure costs (costs incurred before the product or service is delivered: scrap, rework, lost time, unused capacity, ...) External failure costs (cost incurred when quality problems reach customers: returns, recalls, complaints, field services, warranty repairs, liability lawsuits,...) 22 of 35 The Case for Investing in Avoidance 23 of 35 Interoperability Avoidance Costs 24 of 35 Interoperability Mitigation and Delay Costs 25 of 35 Estimated Interoperability Costs 26 of 35 An Interchange or Hub Language 27 of 35 Hub Languages for e-Business (early 1990s) - Ad hoc efforts in EDIFACT to "harmonize" core components across verticals 1997- XML Common Business Library is 1st XML horizontal vocabulary, incorporated EDIFACT semantics and code lists 1999 - ebxml initiative of EDIFACT and OASIS to develop syntax-neutral "core components" 2001 - Universal Business Language effort begins, building on xCBL and ebXML Core Components 28 of 35 Universal Business Language DOCUMENT ARCHITECTURE: A generic XML interchange format for business documents that can be extended to meet the requirements of particular industries CORE COMPONENTS: A library of XML schemas for reusable data components such as "Address," "Item," and "Payment" -- the common data elements of everyday business documents STANDARD DOCUMENTS: A small set of XML schemas for common business documents such as "Order," "Despatch Advice," and "Invoice" that are constructed from the UBL library components and can be used in a generic order-to-invoice trading context 29 of 35 UBL 1.0 Document / Process Scope 30 of 35 How A Hub Language Increases the XML Advantage over EDI 31 of 35 How a Hub Language Shortens the Time to the XML Payoff 32 of 35 Document Exchange Context with UBL 33 of 35 Mapping in and out of Hub Language If all parties/applications/services rely on a hub language for their external interfaces, an exponential interoperability challenge becomes a linear one Mapping tools for transforming instances from an internal information model to another one are ubiquitous as standalone tools and as parts of application servers EXAMPLE: Altova MapForce 34 of 35 For Wednesday March 5 Chapter 5 of Document Engineering "E-Government Architecture in Ireland" Sean McGrath and Fergal Murray, XML 2004 Conference "Mobile Telemedicine System for Home Care and Patient Monitoring" M. V. M. Figeuredo and J. S. Dias, Proceedings of the 26th Annual Conference of the IEEE EMBS (September 2004) "Redefining the Patient Record Paradigm" MedicAlert Foundation, Whitepaper (January 2005) 35 of 35