Stuart Madnick: Information Integration & Re-Use

advertisement

Stuart Madnick

(smadnick@mit.edu) :

Information Integration & Re-Use

Information

Integration

Total Data Quality

(TDQM) Program (5)

Technologies Applications

COntext

INterchange

(COIN) (1)

Financial Services

(account aggregation)

RFID IT

Infrastructure

Others …

Security Analysis

Data

Quality

Military Logistics

MIT Information

Quality (MIT-IQ)

Program

Pros and cons

Of data standards

System Dynamics

Modeling of

State Stability (4)

Economic model of alternatives to

EU Database

Directive (3)

Strategy, Policy

& Legal Issues

Security

Stakeholder

Perceptions of

Security (2)

1

(1) Unit-of-measure mixup tied to loss of $125Million Mars Orbiter

“NASA’s Mars Climate Orbiter was lost because engineers did not make a simple conversion from English units to metric, an embarrassing lapse that sent the $125 million craft off course. . . .

. . . The navigators ( JPL ) assumed metric units of force per second, or newtons. In fact, the numbers were in pounds of force per second as supplied by Lockheed Martin ( the contractor ).”

Source: Kathy Sawyer, Boston Globe , October 1, 1999, page 1.

2

The Context Interchange Approach

Concept

: Length

Meters Feet

Shared

Ontologies f() meters feet

Conversion

Libraries

Context Management

Administrator part length

17

Context

Mediator

Source

Context

2

Select partlength x 3.35

From catalog

Where partno=“12AY”

Context

Transformation

Receiver

Context

Source

1 Select partlength

From catalog

Where partno=“12AY”

55.25

3

Receiver

3

(2) Security Gap Analysis Findings –

Different Organizations

Question 39: People are aware of good security practices.

MI

Gap between Assessment and Importance

– for your company

Overall = 1.28

(5.04 vs. 6.32) Overall

MA

Gap

Miscellaneous 1 = 2.40

(4.20 vs. 6.60)

Company X 2 = 1.83

(5.00 vs. 6.83)

Misc.

Comp X

Company W 2 = 1.89

(4.61 vs. 6.50)

Comp W

Company I 3 = 0.44

(5.33 vs. 5.78)

1

Comp I

4

Original pilot sample: diverse array of companies many middle-managers

2

5 6

High-tech organizations

3 Non-USA company

7

4

(3) Recent legislation/proposals regarding data reuse & repurposing

Adopted

EU introduced Database

Directive granting database makers sui generis right to prevent unauthorized extraction and reutilization of the whole, a substantial part of, or systematic extraction of insubstantial part of database contents

HR 2652 Collections of

Information Antipiracy Act: criminal and civil remedies if reuse of substantial part of another person’s database causes or has the potential to cause harm

Failed Failed

HR354

Collections of

Information

Antipiracy Act: similar to HR

2652 with more fair use exceptions

Failed

HR 3261 Database and

Collections of Information

Misappropriation act: disallow free riders from creating functional equivalent databases reduce revenue of the creators

91 96 98 99

Start

Feist case: US

Supreme Court decided that databases lacking minimal originality are not copyrightable

Failed

HR 3531 Database

Investment and

Intellectual Property

Piracy Act: similar to

EU Directive, with no fair use exceptions

Failed

HR 1858 Consumer and Investor Access to

Information Act: disallow verbatim copying of a database

03 04

Failed

HR 3872 Consumer Access to Information Act: prevent free-rider from engaging in direct competition that threatens the existence or the quality of creator database

Incentives of creating databases v . Value Creation through data reuse

5

(4) MIT Model – High Level System Dynamics View:

Loads on State Stability vs. Capacity to Manage

+

+

Regime

Legitimacy

+

-

+

State Institutional

Capacity

+ -

-

-

+

Economic

Performance

-

+

Investment in Other

Productive

Investments

-

+

Invesment in

Internal/External security

+

Regime Force and Violence

(Regime

Anti-terrorism

Violent Activity)

+

-

-

-

Anti-Regime

Activity

+

-

+

+

+

Dissident

Institutional

Capacity

Insurgents

+

-

Demographic and

Socio-political

Cleavages

-

Civic Capacity and

Social Liberties

-

-

Socio-political

-

Mobilization

+

6

(5) Interplay of Data Quality & Data Semantics: multiple purposes with differing answers

Picture of old lady or young lady ?

7

The 1805 Overture

In 1805, the Austrian and Russian Emperors agreed to join forces against Napoleon. The Russians said their forces would be in the field in Bavaria by Oct. 20 .

The Austrian staff planned based on that date in the

Gregorian calendar .

Russia, however, used the ancient Julian calendar , which lagged 10 days behind.

The difference allowed Napoleon to surround Austrian

General Mack's army at Ulm on Oct. 21, well before the

Russian forces arrived.

Source: David Chandler, The Campaigns of Napoleon, New York: MacMillan 1966, pg. 390.

8

Data Integration: MIT COntext INterchange (COIN) Project

Professor Stuart Madnick & Dr. Michael Siegel, Sloan School of Management, {smadnick,msiegel}@mit.edu

Dealing with multiple semantically heterogeneous data sources has been the subject of intense initiatives in the COntext INterchange (COIN) research group at MIT for several years. The work, based on a solid theoretical foundation (there are several journal publications) has been used to develop a series of prototypes and development systems in applications ranging from financial services to integration of diverse intelligence data.

The COIN technologies consist of two key components that can be used separately but were designed to be efficiently used together: (1) a multimedia data federation engine (aka Cameleon), which can extract and merge data from semistructured HTML and XML web sites (i.e., a “web wrapper”) as well as traditional relational data bases and (2) a powerful data semantics mediation engine (aka COIN mediator) which provides a means for semantic representation and reasoning.

Some key papers on Cameleon can be found at:

“Information Aggregation using the Caméléon# Web Wrapper” at http://web.mit.edu/smadnick/www/wp/2005-06.pdf

“The Cameleon Web Wrapper Engine” at http://web.mit.edu/smadnick/www/wp/2000-03.pdf

“Querying Web-Sources within a Data Federation” at http://web.mit.edu/smadnick/www/wp/2006-09.pdf

Functionality: It is not surprising that data gathered by different agencies (possibly from different countries) for different purposes would represent

“similar” information in different ways. This can range from simple things like representing a person’s weight in pounds, kilograms, or even

“stones” (in UK) to what definition of “terrorist” is used in “number of terrorists.” In order to use these diverse sources in combination, these differences must be resolved A further complication is that these “definitions” (i.e. “contexts”) also change over time (i.e. “temporal context”) – as a simple example, military expenditures in France used to be reported in Francs, now they are in Euros. We have demonstrated the technology in application areas ranging from financial services to global price shopping to counter-terrorism and intelligence data. Some papers illustrating the functionality can be found at:

“Context Mediation Demonstration of Counter-Terrorism Intelligence (CTI) Integration” at http://web.mit.edu/smadnick/www/wp/2005-03.pdf

“Semantic Information Integration in the Large: Adaptability, Extensibility, and Scalability of the Context Mediation Approach” at http://web.mit.edu/smadnick/www/wp/2005-04.pdf

Technology: The COIN mediation reasoning is done with an innovative integrated framework of abductive and constraint logic programming to determine semantic conflicts. This is then combined with the ability to dynamically produce complex conversion programs by using symbolic equation solving techniques to invert, compose complex conversion programs from simple conversion components.

The basic theory of COIN mediation was presented in this paper:

“Context Interchange: New Features and Formalisms for the Intelligent Integration of Information” published in the ACM Transaction of

Information Systems at http://web.mit.edu/smadnick/www/wp/1997-03.pdf

The research has been extensively expanded since then to include temporal reasoning and much more powerful automatic and dynamic conversion program generation, described in papers such as:

“Information Integration In the Presence of Equational Ontological Conflicts” at http://web.mit.edu/smadnick/www/wp/2002-16.pdf

“Effective Data Integration in the Presence of Temporal Semantic Conflicts” at http://web.mit.edu/smadnick/www/wp/2004-02.pdf

Although there still exist important context mediation research issues, such as automated generation and reconciliation of ontologies, the current

COIN technologies are advanced enough to be immediate and valuable contributors to efforts in enterprise integration, semantic integration and the Semantic Web.

9

Download