Oh! So That is What you Meant: The Interplay of Data Quality and Data Semantics 2003 International Conference on Conceptual Modeling (ER’03) Chicago – October 2003 Stuart Madnick MIT Sloan School of Management smadnick@mit.edu © S.Madnick, 2003 V4 1 Agenda Examples of Multiple Interpretations No one “right” answer “Wrong” answer = bad quality data Corporate Householding Examples Framework COntext INterchange (COIN) Approach Basic concept Applied to Corporate Householding 2 Example Questions • Simple questions: • “How much did MIT buy from IBM last year ?” • “How much did IBM sell to MIT last year ?” [ Do you expect the answers to be the same ?] • “How much did Merrill Lynch loan to IBM last year ?” • “How many employees does IBM have ?” 3 There can be multiple purposes and contexts with differing answers Picture of old lady or young lady ? 4 Data source . . . (how do you cite in a Journal article?) 5 Role Of Context 02-01-03 Context Context $ 01-02-03 £ ? Context ¥ 03-02-01 CONTEXT VARIATIONS: - GEOGRAPHIC ( US vs. UK ) - FUNCTIONAL (CASH MGMT vs. LOANS ) - ORGANIZATIONAL ( CITIBANK vs. CHASE ) Data: Databases Web data E-mail 6 Types of Context Representational Ontological Temporal Example Temporal Representational Currency: $ vs € Francs before 2000, Scale factor: 1 vs 1000 € thereafter Ontological Revenue: Includes vs excludes interest Revenue: Excludes interest before 1994 but incl. thereafter 7 Example : Context Differences ( from multiple web sources) Daimler Benz ( DCX ) Financial Data P/E Ratio 11.6 ABC Bloomberg 5.57 DBC 19.19 MarketGuide 7.46 8 Additional Example Q: How did CO2 emissions (total, per GDP, per capita) change over time (between 1990 and 2000) in Yugoslavia? Context 1: YUG as a geographic region bounded before the breakup Context 2: YUG as a legal autonomous state Related effort: - Laboratory for Information Globalization and Harmonization Technologies and Studies ( LIGHTS ) Project 9 The 1999 Overture Unit-of-measure mixup tied to loss of $125Million Mars Orbiter “NASA’s Mars Climate Orbiter was lost because engineers did not make a simple conversion from English units to metric, an embarrassing lapse that sent the $125 million craft off course. . . . . . . The navigators ( JPL ) assumed metric units of force per second, or newtons. In fact, the numbers were in pounds of force per second as supplied by Lockheed Martin ( the contractor ).” Source: Kathy Sawyer, Boston Globe, October 1, 1999, page 1. 10 Revisit Example Questions • Simple questions: • “How much did MIT buy from IBM last year ?” • “How much did IBM sell to MIT last year ?” • “How much did Merrill Lynch loan to IBM last year ?” • “How many employees does IBM have ?” • Even definition of “employee” might be ambiguous • How count part-time employees ? • How count full-time consultants ? • (in university] how count joint appointments or unpaid visiting faculty ? 11 An example of “Householding” issue 12 “Household Data” • Example: letter from Smith Barney “To reduce mailing expense … We have adopted a policy known as “householding.” … shareholders who are members of the same family, share the same address and have multiple accounts in the same Fund will receive only one copy of the annual prospectus.” • In general, what is definition of “household”? • What would be corresponding “corporate household”? 13 “Household” – Definition? Is it: • Father, mother and children living at same address? • What if father and mother have different last names? • Father, mother, children even if living at different addresses ? • Include cousin Alice visiting for 6 months? • An unmarried couple living together? (of any sex) • Roommates? 14 A Framework for Corporate Householding a. Identical entity instance identification Name: MIT Addr: 77 Mass Ave Name: Mass Inst of Tech Addr: 77 Massachusetts b. Entity aggregation Name: MIT Employees: 1200 Name: Lincoln Lab Employees: 840 C. Transparency of inter-entity relationships MIT MicroComputer CompUSA IBM 15 a. Identical entity instance identification Name: MIT Addr: 77 Mass Ave Name: Mass Inst of Tech Addr: 77 Massachusetts • Unambiguous universal identifiers rare (or rarely used) • Examples: Massaschusetts Institute of Technology Mass Inst of Tech MIT, M.I.T., M I T • In practice a frequent problem for mailing lists • Addressed, in part, by companies such as FirstLogic 16 b. Entity aggregation Name: MIT Employees: 1200 Name: Lincoln Lab Employees: 840 • What should be included as part of an entity ? • Example: “Lincoln Lab” is “Federally Funded R&D Center of MIT” • Is Lincoln Lab included in answer to questions, such as: How many employees does MIT have ? What was MIT’s total budget last year ? How much have we sold to MIT ? • The different circumstances are called “contexts” • Addressed, in part, by companies such as D&B 17 Example: What is “IBM” ? What is the relationship among these entities (and the changes over time – “temporal context”): • • • • • • • • • • International Business Machines Corporation IBM IBM Global Services IBM Global Network (1999-) IBM de Colombia, S.A (90%) Lotus Development Corporation (100%) Software Artistry, Inc. (1998+, 2000-) Dominion Semiconductor Company (50/50 jv) MiCRUS (majority jv) Computing-Tabulating-Recording Co. 18 c. Transparency of inter-entity relationships MIT MicroComputer IBM CompUSA • Relationships might be direct or indirect • Understand what circumstances (i.e., contexts) should they be collapsed ? • This can be multi-leveled, especially in - financial transactions - supply chain management 19 Types of Entities and Relationships • Many possible “atoms”: Locations (“branches”) Scope (“divisions”) Ownerships (“subsidiaries”) – including fractionals Joint ventures ... Others ? 20 Examples of multiple purposes . . . Corporate household /family structure purposes - Accounting: Account consolidation - Financial: Risk ( credit - bankruptcy, country ) - Marketing/Purchasing ( multiple divisions & subsidiaries ): Customers & Supplier consolidation ( volume discounts ) - Customer Relationship Management ( CRM) Managerial: Regional and/or Product separations Legal: Liability ( insurance ) Licensing ( software, patents ] Relationship: Consultant Conflict of interest & competition And these are dynamic … changing over time 21 Account Consolidation Example: Should Company A consolidate accounts with Company B? Does Company A have majority ow nership of Company B (either directly or indirectly?) YES NO Does Company A have a controlling financial interest in Company B? No Consolidation NO YES No Consolidation Is Company A a bank holding company? YES Is Company B subject to the Bank NO Holding company Act? Does Company A have the same fiscal periods as Company B? NO YES YES No Consolidation NO Is Company B a foreign subsidiary of company A? YES Consolidation at the Discretion of Company A Consolidation,how ever, changes need to be made in fiscal periods of Company B NO Consolidation should occur 22 General Challenge: There is not a single right answer • As seen with “old woman or young woman” picture • Need to be able to answer question “in context” • This issue appears in many situations … 23 Research Agenda • We do not want to build custom solution for each situation - Need to understand requirements - Need to understand sources • Process 1. Interviews – describe representative purposes / cases in detail 2. Study use of Corporate Householding in organizations 3. Identify key characteristics & typical kinds of rules 4. Incorporate into a context knowledge management system 5. Develop reasoning algorithm to map purposes onto data sources automatically (extend prior context mediation work) 24 The Context Interchange Approach Concept: Length Meters function( ) meters feet Feet Shared Ontologies Source Context Conversion Libraries Context Mediator Receiver Context part length Context Transformation 17 Source Context Management Application Select partlength From catalog Where partno=“12AY” Receiver 25 Corporate Household - Ontology, Relations and Elevations relationshipType (string) ownership (number) Inheritance Attributes attr Modifiers mod Basic scale (number) aggregationType (string) Elevation Relationship CorporateEntity Date CurrencyType Country Semantic Types fyEnding Groups of Semantic Objects Columns in Relations location childEntity Ontology Elevations EntityFinancials AggregationItem Revenue Relationship(“subsidiary”) Relationship(“branch”) Relationship(“division”) ... Basic (Number) Country USA USA USA Netherlands China Germany France Ireland USA USA USA ... CorporateEntity International Business Machines Lotus Development International Information Products IBM Far East Holdings B. V. IBM International Treasury Services General Motors Hughes Electronics Electronic Data Systems ... Corporate householding database: structure: CHH Relations Revenue(77,966,000) Revenue(970,000) Revenue(1,200,000) Revenue(177,828,100) ... Source Data: revenue1 (c_unconsolidated_revenue): Corporate householding database: country: CorporateEntity International Business Machines Lotus Development IBM Global Services IBM Far East Holdings B. V. International Information Products IBM Germany IBM France IBM International Treasury Services General Motors Hughes Electronics Electronics Data Systems ... currency company parentEntity CorporateEntity(“IBM”) CorporateEntity(“Lotus”) CorporateEntity(“IIP”) CorporateEntity(“General Motors”) ... Country(“USA”) Country(“Netherlands”) Country(“China”) ... officialCurrency ChildEntity Lotus Development IBM Global Services IBM Far East Holdings B. V. International Information Products IBM Germany IBM France IBM International Treasury Services IBM International Treasury Services ... ParentEntity International Business Machines International Business Machines International Business Machines IBM Far East Holdings B. V. International Business Machines International Business Machines IBM Germany IBM France ... RelationshipType Subsidiary Division Subsidiary Subsidiary Branch Branch Subsidiary Subsidiary ... Ownership 100 100 100 80 100 100 33 14 ... Revenue 77,966,000 970,000 1,200,000 550,000 500,000 177,828,100 8,934,900 21,502,000 … Application Relation(s) 26 Example Contexts and Queries Input query: Select CorporateEntity, Revenue from revenue1 where CorporateEntity = “IBM” asked in the c_unconsolidated_revenue context asked in the c_majorityowned_revenue context source context: c_unconsolidated_revenue asked in the c_whollyowned_revenue context c_unconsolidated_revenue: (source context) an entity’s revenue only includes revenues of its divisions and branches; scale= 1,000; currency= ‘USD’; aggregationType = ‘division+branch’ c_majorityowned_revenue: c_whollyowned_revenue: an entity’s revenue includes revenues of its divisions, branches, and all majority owned subsidiaries; scale = 1,000; currency = ‘USD’; aggregationType = ‘subsidiary+division+branch’ an entity’s revenue includes revenues of its divisions, branches, and wholly owned subsidiaries; scale = 1,000,000; currency = ‘USD’; aggregationType = ‘whollyownedsubsidiary+division+branch’ revenue1: CorporateEntity IBM Lotus IIP IBM Far East Holdings IBM International Treasury Services General Motors Hughes EDS ... revenue2: CorporateEntity IBM Lotus IIP IBM Far East Holdings IBM International Treasury Services General Motors Hughes EDS … revenue3: CorporateEntity IBM Lotus IIP IBM Far East Holdings IBM International Treasury Services General Motors Hughes EDS … Revenue 77,966,000 970,000 1,200,000 550,000 500,000 177,828,100 8,934,900 21,502,000 conversion Revenue 81,186,000 970,000 1,200,000 1,750,000 conversion 500,000 186,763,000 8,934,900 21,502,000 Revenue 79,986 970 1,200 550 500 186,763 8,934.9 21,502 result (no conversion needed) CorporateEntity IBM Revenue 77,966,000 Output query: Select “IBM” as CorporateEntity, SUM(Revenue) from revenue1 where CorporateEntity in (“IBM”, “Lotus” , “IBM Far East Holdings”, “IBM International Treasury Services”, “IIP”) result result CorporateEntity IBM Output query: Select “IBM” as CorporateEntity, SUM(Revenue) from revenue1 where CorporateEntity in (“IBM”, “Lotus”, “IBM Far East Holdings”, “IBM International Treasury Services”) Revenue 81,186,000 CorporateEntity IBM Revenue 79,986 27 COIN Demo – Result of Execution 28 The 1805 Overture In 1805, the Austrian and Russian Emperors agreed to join forces against Napoleon. The Russians said their forces would be in the field in Bavaria by Oct. 20. The Austrian staff planned based on that date in the Gregorian calendar. Russia, however, used the ancient Julian calendar, which lagged 10 days behind. The difference allowed Napoleon to surround Austrian General Mack's army at Ulm on Oct. 21, well before the Russian forces arrived. Source: David Chandler, The Campaigns of Napoleon, New York: MacMillan 1966, pg. 390. 29 Acknowledgements Work reported has been supported, in part, by • Banco Santander Central Hispano, DARPA, D&B, Fleet Bank, Firstlogic, Merrill Lynch, MIT Total Data Quality Management (TDQM) Program, PricewaterhouseCoopers, Singapore-MIT Alliance (SMA), Suruga Bank, and USAF/ROME Laboratory. For more information, go to http://web.mit.edu/TDQM and http://context2.mit.edu 30