The Interplay of Data Quality and Data Semantics

advertisement
Oh! So That is What you Meant:
The Interplay of
Data Quality and Data Semantics
2003 International Conference on Conceptual Modeling
(ER’03)
Chicago – October 2003
Stuart Madnick
MIT Sloan School of Management
smadnick@mit.edu
© S.Madnick, 2003
V4
1
Agenda

Examples of Multiple Interpretations
No one “right” answer
 “Wrong” answer = bad quality data


Corporate Householding
Examples
 Framework


COntext INterchange (COIN) Approach
Basic concept
 Applied to Corporate Householding

2
Example Questions
• Simple questions:
• “How much did MIT buy from IBM last year ?”
• “How much did IBM sell to MIT last year ?”
[ Do you expect the answers to be the same ?]
• “How much did Merrill Lynch loan to IBM last year ?”
• “How many employees does IBM have ?”
3
There can be multiple purposes and
contexts with differing answers
Picture of old lady or young lady ?
4
Data source . . . (how do you cite in a Journal article?)
5
Role Of Context
02-01-03
Context
Context
$
01-02-03
£
?
Context
¥

03-02-01
CONTEXT VARIATIONS:
- GEOGRAPHIC ( US vs. UK )
- FUNCTIONAL (CASH MGMT vs. LOANS )
- ORGANIZATIONAL ( CITIBANK vs. CHASE )
Data:
Databases
Web data
E-mail
6
Types of Context
Representational Ontological
Temporal
Example
Temporal
Representational Currency: $ vs €
Francs before 2000,
Scale factor: 1 vs 1000 € thereafter
Ontological
Revenue: Includes vs
excludes interest
Revenue: Excludes
interest before 1994
but incl. thereafter
7
Example : Context Differences
( from multiple web sources)
Daimler Benz ( DCX ) Financial Data
P/E Ratio
11.6
ABC
Bloomberg 5.57
DBC
19.19
MarketGuide 7.46
8
Additional Example

Q: How did CO2 emissions
(total, per GDP, per capita)
change over time
(between 1990 and 2000)
in Yugoslavia?


Context 1: YUG as a geographic
region bounded before the breakup
Context 2: YUG as a legal
autonomous state
Related effort:
- Laboratory for Information Globalization and Harmonization Technologies and Studies
( LIGHTS ) Project
9
The 1999 Overture
Unit-of-measure mixup tied to
loss of $125Million Mars Orbiter
“NASA’s Mars Climate Orbiter was lost because
engineers did not make a simple conversion from English
units to metric, an embarrassing lapse that sent the $125
million craft off course. . . .
. . . The navigators ( JPL ) assumed metric units of
force per second, or newtons. In fact, the numbers were in
pounds of force per second as supplied by Lockheed Martin
( the contractor ).”
Source: Kathy Sawyer, Boston Globe, October 1, 1999, page 1.
10
Revisit Example Questions
• Simple questions:
• “How much did MIT buy from IBM last year ?”
• “How much did IBM sell to MIT last year ?”
• “How much did Merrill Lynch loan to IBM last year ?”
• “How many employees does IBM have ?”
• Even definition of “employee” might be ambiguous
• How count part-time employees ?
• How count full-time consultants ?
• (in university] how count joint appointments or unpaid
visiting faculty ?
11
An example of “Householding” issue
12
“Household Data”
• Example: letter from Smith Barney
“To reduce mailing expense …
We have adopted a policy known as “householding.”
… shareholders who are members of the same family, share
the same address and have multiple accounts in the same
Fund will receive only one copy of the annual prospectus.”
• In general, what is definition of “household”?
• What would be corresponding “corporate household”?
13
“Household” – Definition?
Is it:
• Father, mother and children living at same address?
• What if father and mother have different last names?
• Father, mother, children even if living at different addresses ?
• Include cousin Alice visiting for 6 months?
• An unmarried couple living together? (of any sex)
• Roommates?
14
A Framework for Corporate Householding
a. Identical entity instance identification
Name: MIT
Addr: 77 Mass Ave
Name: Mass Inst of Tech
Addr: 77 Massachusetts
b. Entity aggregation
Name: MIT
Employees: 1200
Name: Lincoln Lab
Employees: 840
C. Transparency of inter-entity relationships
MIT
MicroComputer
CompUSA
IBM
15
a. Identical entity instance identification
Name: MIT
Addr: 77 Mass Ave
Name: Mass Inst of Tech
Addr: 77 Massachusetts
• Unambiguous universal identifiers rare (or rarely used)
• Examples:
Massaschusetts Institute of Technology
Mass Inst of Tech
MIT, M.I.T., M I T
• In practice a frequent problem for mailing lists
• Addressed, in part, by companies such as FirstLogic
16
b. Entity aggregation
Name: MIT
Employees: 1200
Name: Lincoln Lab
Employees: 840
• What should be included as part of an entity ?
• Example:
“Lincoln Lab” is “Federally Funded R&D Center of MIT”
• Is Lincoln Lab included in answer to questions, such as:
How many employees does MIT have ?
What was MIT’s total budget last year ?
How much have we sold to MIT ?
• The different circumstances are called “contexts”
• Addressed, in part, by companies such as D&B
17
Example: What is “IBM” ?
What is the relationship among these entities
(and the changes over time – “temporal context”):
•
•
•
•
•
•
•
•
•
•
International Business Machines Corporation
IBM
IBM Global Services
IBM Global Network (1999-)
IBM de Colombia, S.A (90%)
Lotus Development Corporation (100%)
Software Artistry, Inc. (1998+, 2000-)
Dominion Semiconductor Company (50/50 jv)
MiCRUS (majority jv)
Computing-Tabulating-Recording Co.
18
c. Transparency of inter-entity relationships
MIT
MicroComputer
IBM
CompUSA
• Relationships might be direct or indirect
• Understand what circumstances (i.e., contexts)
should they be collapsed ?
• This can be multi-leveled, especially in
- financial transactions
- supply chain management
19
Types of Entities and Relationships
• Many possible “atoms”:
Locations (“branches”)
Scope (“divisions”)
Ownerships (“subsidiaries”) – including fractionals
Joint ventures
...
Others ?
20
Examples of multiple purposes . . .
Corporate household /family structure purposes - Accounting: Account consolidation
- Financial: Risk ( credit - bankruptcy, country )
- Marketing/Purchasing ( multiple divisions & subsidiaries ):
Customers & Supplier consolidation ( volume discounts )
-
Customer Relationship Management ( CRM)
Managerial: Regional and/or Product separations
Legal: Liability ( insurance )
Licensing ( software, patents ]
Relationship: Consultant Conflict of interest & competition
And these are dynamic … changing over time
21
Account Consolidation Example:
Should Company A consolidate accounts with Company B?
Does Company A have majority ow nership of Company B
(either directly or indirectly?)
YES
NO
Does Company A have a controlling
financial interest in Company B?
No Consolidation
NO
YES
No Consolidation
Is Company A a bank holding
company?
YES
Is Company B subject to the Bank
NO
Holding company Act?
Does Company A have the
same fiscal periods as
Company B?
NO
YES
YES
No Consolidation
NO
Is Company B a foreign
subsidiary of company A?
YES
Consolidation at the
Discretion of Company A
Consolidation,how ever,
changes need to be made in
fiscal periods of Company B
NO
Consolidation
should occur
22
General Challenge:
There is not a single right answer
•
As seen with “old woman or young woman” picture
•
Need to be able to answer question “in context”
•
This issue appears in many situations …
23
Research Agenda
•
We do not want to build custom solution for each situation
- Need to understand requirements
- Need to understand sources
• Process
1. Interviews – describe representative purposes / cases in detail
2. Study use of Corporate Householding in organizations
3. Identify key characteristics & typical kinds of rules
4. Incorporate into a context knowledge management system
5. Develop reasoning algorithm to map purposes onto data sources
automatically (extend prior context mediation work)
24
The Context Interchange Approach
Concept:
Length
Meters
function( )
meters
feet
Feet
Shared
Ontologies
Source
Context
Conversion
Libraries
Context
Mediator
Receiver
Context
part
length
Context
Transformation
17
Source
Context Management
Application
Select partlength
From catalog
Where partno=“12AY”
Receiver
25
Corporate Household - Ontology, Relations and Elevations
relationshipType (string)
ownership (number)
Inheritance
Attributes
attr
Modifiers
mod
Basic
scale (number)
aggregationType (string)
Elevation
Relationship
CorporateEntity
Date
CurrencyType
Country
Semantic Types
fyEnding
Groups of
Semantic Objects
Columns in Relations
location
childEntity
Ontology
Elevations
EntityFinancials
AggregationItem
Revenue
Relationship(“subsidiary”)
Relationship(“branch”)
Relationship(“division”)
...
Basic (Number)
Country
USA
USA
USA
Netherlands
China
Germany
France
Ireland
USA
USA
USA
...
CorporateEntity
International Business Machines
Lotus Development
International Information Products
IBM Far East Holdings B. V.
IBM International Treasury Services
General Motors
Hughes Electronics
Electronic Data Systems
...
Corporate householding database: structure:
CHH
Relations
Revenue(77,966,000)
Revenue(970,000)
Revenue(1,200,000)
Revenue(177,828,100)
...
Source Data:
revenue1
(c_unconsolidated_revenue):
Corporate householding database: country:
CorporateEntity
International Business Machines
Lotus Development
IBM Global Services
IBM Far East Holdings B. V.
International Information Products
IBM Germany
IBM France
IBM International Treasury Services
General Motors
Hughes Electronics
Electronics Data Systems
...
currency
company
parentEntity
CorporateEntity(“IBM”)
CorporateEntity(“Lotus”)
CorporateEntity(“IIP”)
CorporateEntity(“General
Motors”)
...
Country(“USA”)
Country(“Netherlands”)
Country(“China”)
...
officialCurrency
ChildEntity
Lotus Development
IBM Global Services
IBM Far East Holdings B. V.
International Information Products
IBM Germany
IBM France
IBM International Treasury Services
IBM International Treasury Services
...
ParentEntity
International Business Machines
International Business Machines
International Business Machines
IBM Far East Holdings B. V.
International Business Machines
International Business Machines
IBM Germany
IBM France
...
RelationshipType
Subsidiary
Division
Subsidiary
Subsidiary
Branch
Branch
Subsidiary
Subsidiary
...
Ownership
100
100
100
80
100
100
33
14
...
Revenue
77,966,000
970,000
1,200,000
550,000
500,000
177,828,100
8,934,900
21,502,000
…
Application
Relation(s)
26
Example Contexts and Queries
Input query:
Select CorporateEntity, Revenue from revenue1
where CorporateEntity = “IBM”
asked in the c_unconsolidated_revenue context
asked in the c_majorityowned_revenue
context
source context: c_unconsolidated_revenue
asked in the c_whollyowned_revenue context
c_unconsolidated_revenue:
(source context)
an entity’s revenue only includes revenues of
its divisions and branches;
scale= 1,000;
currency= ‘USD’;
aggregationType = ‘division+branch’
c_majorityowned_revenue:
c_whollyowned_revenue:
an entity’s revenue includes revenues of its
divisions, branches, and all majority owned
subsidiaries;
scale = 1,000;
currency = ‘USD’;
aggregationType = ‘subsidiary+division+branch’
an entity’s revenue includes revenues of its
divisions, branches, and wholly owned subsidiaries;
scale = 1,000,000;
currency = ‘USD’;
aggregationType =
‘whollyownedsubsidiary+division+branch’
revenue1:
CorporateEntity
IBM
Lotus
IIP
IBM Far East Holdings
IBM International
Treasury Services
General Motors
Hughes
EDS
...
revenue2:
CorporateEntity
IBM
Lotus
IIP
IBM Far East Holdings
IBM International
Treasury Services
General Motors
Hughes
EDS
…
revenue3:
CorporateEntity
IBM
Lotus
IIP
IBM Far East Holdings
IBM International
Treasury Services
General Motors
Hughes
EDS
…
Revenue
77,966,000
970,000
1,200,000
550,000
500,000
177,828,100
8,934,900
21,502,000
conversion
Revenue
81,186,000
970,000
1,200,000
1,750,000
conversion
500,000
186,763,000
8,934,900
21,502,000
Revenue
79,986
970
1,200
550
500
186,763
8,934.9
21,502
result (no conversion needed)
CorporateEntity
IBM
Revenue
77,966,000
Output query:
Select “IBM” as CorporateEntity, SUM(Revenue) from revenue1
where CorporateEntity in (“IBM”, “Lotus” , “IBM Far East
Holdings”, “IBM International Treasury Services”, “IIP”)
result
result
CorporateEntity
IBM
Output query:
Select “IBM” as CorporateEntity, SUM(Revenue) from revenue1
where CorporateEntity in (“IBM”, “Lotus”, “IBM Far East
Holdings”, “IBM International Treasury Services”)
Revenue
81,186,000
CorporateEntity
IBM
Revenue
79,986
27
COIN Demo – Result of Execution
28
The 1805 Overture
In 1805, the Austrian and Russian Emperors agreed to
join forces against Napoleon. The Russians said their forces
would be in the field in Bavaria by Oct. 20.
The Austrian staff planned based on that date in the
Gregorian calendar. Russia, however, used the ancient Julian
calendar, which lagged 10 days behind.
The difference allowed Napoleon to surround Austrian
General Mack's army at Ulm on Oct. 21, well before the
Russian forces arrived.
Source: David Chandler, The Campaigns of Napoleon, New York: MacMillan 1966, pg. 390.
29
Acknowledgements
Work reported has been supported, in part, by
• Banco Santander Central Hispano, DARPA, D&B, Fleet
Bank, Firstlogic, Merrill Lynch, MIT Total Data Quality
Management (TDQM) Program, PricewaterhouseCoopers,
Singapore-MIT Alliance (SMA), Suruga Bank, and
USAF/ROME Laboratory.
For more information, go to http://web.mit.edu/TDQM and
http://context2.mit.edu
30
Download