Data Mining - Lyle School of Engineering

advertisement

DATA WAREHOUSING

&

INFORMATION RETRIEVAL

Margaret H. Dunham

Department of Computer Science and Engineering

Southern Methodist University

POBox 750122

Dallas, Texas 75275-0122

214-768-3087 mhd@engr.smu.edu

The contents of this presentation draw extensively from slides for:

 Data Mining, Introductory and Advanced Topics , by Margaret H. Dunham, Prentice Hall, 2003.

4/17/07, Tecnológico de Monterrey, SMU

CSE 8337

1

DW&IR Outline

 Introduction

 Data Warehousing

 Research

 Summary

4/17/07, Tecnológico de Monterrey, SMU

CSE 8337

2

DW&IR Outline

 Introduction

– Data Warehousing Overview

– Information Retrieval

 Data Warehousing

 Research

 Summary

4/17/07, Tecnológico de Monterrey, SMU

CSE 8337

3

Data Warehousing

Subject-oriented, integrated, timevariant, nonvolatile”

William Inmon

 http://www.inmondatasystems.com/

Operational Data: Data used in day to day needs of company.

Informational Data: Supports other functions such as planning and forecasting.

Data mining tools often access data warehouses rather than operational data.

4/17/07, Tecnológico de Monterrey, SMU

CSE 8337

4

Data Warehouse Variations

 Data Mart – Subset of complete data warehouse

 Virtual Warehouse – Warehouse implemented as a view of operational data

4/17/07, Tecnológico de Monterrey, SMU

CSE 8337

5

Operational vs. Informational

Application

Use

Temporal

Modification

Orientation

Data

Size

Level

Access

Response

Data Schema

Operational Data

OLTP

Precise Queries

Snapshot

Dynamic

Application

Operational Values

Gigabits

Detailed

Often

Few Seconds

Relational

4/17/07, Tecnológico de Monterrey, SMU

CSE 8337

Data Warehouse

OLAP

Ad Hoc

Historical

Static

Business

Integrated

Terabits

Summarized

Less Often

Minutes

Star/Snowflake

6

Information Retrieval

Information Retrieval (IR): retrieving desired information from textual data.

Library Science

Digital Libraries

Web Search Engines

Traditionally keyword based

Sample query:

Find all documents about “data mining”

IR being applied to other unformatted data

4/17/07, Tecnológico de Monterrey, SMU

CSE 8337

7

DB vs IR

 Records (tuples) vs. documents

 Well defined results vs. fuzzy results

 DB grew out of files and traditional business systesm

 IR grew out of library science and need to categorize/group/access books/articles

4/17/07, Tecnológico de Monterrey, SMU

CSE 8337

8

DB vs IR (cont’d)

 Data retrieval

 which docs contain a set of keywords?

 Well defined semantics

 a single erroneous object implies failure!

 Information retrieval

 information about a subject or topic

 semantics is frequently loose

 small errors are tolerated

 IR system:

 interpret contents of information items

 generate a ranking which reflects relevance

 notion of relevance is most important

4/17/07, Tecnológico de Monterrey, SMU

CSE 8337

9

Information Retrieval (cont’d)

Similarity: measure of how close a query is to a document.

Documents which are “close enough” are retrieved.

 Metrics:

– Precision = |Relevant and Retrieved|

|Retrieved|

– Recall = |Relevant and Retrieved|

|Relevant|

4/17/07, Tecnológico de Monterrey, SMU

CSE 8337

10

IR Query Result Measures and Classification

IR Classification

4/17/07, Tecnológico de Monterrey, SMU

CSE 8337

11

DW&IR Outline

 Introduction

 Data Warehousing

– Dimensional Modeling

– OLAP

– Decision Support Systems

 Research

 Summary

4/17/07, Tecnológico de Monterrey, SMU

CSE 8337

12

Data Transformation for Data

Warehouse

ETL – Extract, Transform, Load

Unwanted data must be removed

Convert heterogeneous sources into one common schema

As the operational data is probably a snapshot of the data, multiple snapshots may need to be merged to create the historical view

Summarize data

New derived data

Handle missing and erroneous data

4/17/07, Tecnológico de Monterrey, SMU

CSE 8337

13

Data Warehouse Creation

Fig 1 [1]

4/17/07, Tecnológico de Monterrey, SMU

CSE 8337

14

Dimensional Modeling

 View data in a hierarchical manner more as business executives might

 Useful in decision support systems and mining

 Dimension: collection of logically related attributes; axis for modeling data.

Facts: data stored

Ex: Dimensions – products, locations, date

Facts – quantity, unit price

4/17/07, Tecnológico de Monterrey, SMU

CSE 8337

15

Multidimensional Model Example

Fig 2 [1]

4/17/07, Tecnológico de Monterrey, SMU

CSE 8337

16

Cube view of Data

4/17/07, Tecnológico de Monterrey, SMU

CSE 8337

Fig 4 [1]

17

Aggregation Hierarchies

4/17/07, Tecnológico de Monterrey, SMU

CSE 8337

18

Multidimensional Schemas

 Star Schema shows facts and dimensions

– Center of the star has facts shown in fact tables

– Outside of the facts, each diemnsion is shown separately in dimension tables

– Access to fact table from dimension table via join

SELECT Quantity, Price

FROM Facts, Location

Where (Facts.LocationID = Location.LocationID) and

(Location.City = ‘Dallas’)

– View as relations, problem volume of data and indexing

4/17/07, Tecnológico de Monterrey, SMU

CSE 8337

19

Star Schema

4/17/07, Tecnológico de Monterrey, SMU

CSE 8337

20

Flattened Star

4/17/07, Tecnológico de Monterrey, SMU

CSE 8337

21

Normalized Star

4/17/07, Tecnológico de Monterrey, SMU

CSE 8337

22

Snowflake Schema

4/17/07, Tecnológico de Monterrey, SMU

CSE 8337

23

OLAP

Online Analytic Processing (OLAP): provides more complex queries than OLTP.

OnLine Transaction Processing (OLTP): traditional database/transaction processing.

Dimensional data; cube view

Support ad hoc querying

Require analysis of data

Can be thought of as an extension of some of the basic aggregation functions available in SQL

OLAP tools may be used in DSS systems

Mutlidimentional view is fundamental

4/17/07, Tecnológico de Monterrey, SMU

CSE 8337

24

OLAP Implementations

MOLAP (Multidimensional OLAP)

– Multidimential Database (MDD)

– Specialized DBMS and software system capable of supporting the multidimensional data directly

– Data stored as an n-dimensional array (cube)

– Indexes used to speed up processing

ROLAP (Relational OLAP)

– Data stored in a relational database

– ROLAP server (middleware) creates the multidimensional view for the user

– Less Complex; Less efficient

HOLAP (Hybrid OLAP)

– Not updated frequently – MDD

– Updated frequently - RDB

4/17/07, Tecnológico de Monterrey, SMU

CSE 8337

25

OLAP Operations

Roll Up

Drill Down

Single Cell Multiple Cells Slice

4/17/07, Tecnológico de Monterrey, SMU

CSE 8337

Dice

26

OLAP Operations

Simple query – single cell in the cube

Slice – Look at a subcube to get more specific information

Dice – Rotate cube to look at another dimension

Roll Up – Dimension Reduction; Aggregation

 Drill Down

 Visualization: These operations allow the

OLAP users to actually “see” results of an operation.

4/17/07, Tecnológico de Monterrey, SMU

CSE 8337

27

Relationship Between Topcs

4/17/07, Tecnológico de Monterrey, SMU

CSE 8337

28

Decision Support Systems

 Tools and computer systems that assist management in decision making

 What if types of questions

 High level decisions

 Data warehouse – data which supports

DSS

4/17/07, Tecnológico de Monterrey, SMU

CSE 8337

29

Data Warehouse Links

OLAP

– http://www.olapreport.com/

General Data Warehousing

– http://www.inmoncif.com/home/

– http://www.datawarehouseconsulting.com/

– http://www.datawarehousing.com/

– http://www.dw-institute.com/

DW Products

– http://www-306.ibm.com/software/data/informix/redbrick/

– http://www.oracle.com/solutions/business_intelligence/dw_home.html

– http://www.sas.com/technologies/dw/index.html

– http://msdn2.microsoft.com/en-us/library/aa545535.aspx

– http://www.sybase.com/detail?id=1027323

Interesting Articles

– “ Teaching Effective Methodologies to Design a Data Warehouse,” by Behrooz Seyed-

Abbassi http://isedj.org/isecon/2001/35c/ISECON.2001.Seyed-Abbassi.pdf

– An Oracle DBA’s Guide to the OLAP Option,” by by Mark Rittman http://www.dbazine.com/datawarehouse/dw-articles/rittman1

4/17/07, Tecnológico de Monterrey, SMU

CSE 8337

30

DW&IR Outline

 Introduction

 Data Warehousing

 Research

– Bibliomining

– Hierarchical Multimedia IR

– Ontology-based OLAP & IR

 Summary

4/17/07, Tecnológico de Monterrey, SMU

CSE 8337

31

Bibliomining [2,3]

Data Warehousing + Data Mining + Libraries

Abstract, cleanse, summarize library data

– Documents

– Users (including demographics)

– Circulation Records (including Web server records)

Privacy of utmost importance http://www.bibliomining.com/nicholson/biblioprocess.htm

[2] http://bibliomining.com/nicholson/nicholsonbibliointro.html

[3]

4/17/07, Tecnológico de Monterrey, SMU

CSE 8337

32

Hierarchical Multimedia IR [4]

DW Approach to Multimedia IR

– Allows easier integration of multiple data types

– Facilitates indexing

– Facilitates searching

– Allows data to be stored at many different granularities and dimensions

– Data aggregation

“data warehouses are not just large databases; they are large, complex environments that integrate many technologies” [p729]

Multimedia starflake schema

– Denormalized star dimension table

– Normalized snowflake tables

4/17/07, Tecnológico de Monterrey, SMU

CSE 8337

33

Starflake

4/17/07, Tecnológico de Monterrey, SMU

CSE 8337

Fig 2 [4]

34

Hierarchy of Data Cubes

Fig 4 [4]

4/17/07, Tecnológico de Monterrey, SMU

CSE 8337

35

Ontology-Based OLAP & IR [5]

 Combine structured and document data obtained from Web

 Global Ontology

– Includes OLAP dimensions

– Contains resource metadata

– RDF based

 IR based on

– Both queries and resources represented as

RDF metadata

– http://www.w3.org/RDF/

4/17/07, Tecnológico de Monterrey, SMU

CSE 8337

36

Ontology OLAP&IR Architecture

4/17/07, Tecnológico de Monterrey, SMU

CSE 8337

Fig 1 [5]

37

OLAP Dimensions in RDF

Fig 2 [5]

4/17/07, Tecnológico de Monterrey, SMU

CSE 8337

38

RDF Query

Fig 6 [5]

4/17/07, Tecnológico de Monterrey, SMU

CSE 8337

39

DW&IR Outline

 Introduction

 Data Warehousing

 Research

 Summary

4/17/07, Tecnológico de Monterrey, SMU

CSE 8337

40

Summary

Information Retrieval is being extended to many different data types

– Multimedia

– Data warehouse

Data Warehousing is being extended beyond the basic business domain

Little research in combining DW and IR

Integrating Unstructured Text into the Structured

Environment: The Value Proposition“, by Bill Inmon

– http://www.inmondatasystems.com/whitepapers/int egratingunstructured.pdf

4/17/07, Tecnológico de Monterrey, SMU

CSE 8337

41

Bibliography

[1] AnneMuriel Arigon, Anne Tchounikine, and Maryvonne Miquel, “Handling

Multiple Points of View in a Multimedia Data Warehouse,”

ACM Transactions on

Multimedia Computing, Communications and Applications , Vol. 2, No. 3, August

2006, Pages 199 –218.

[2] S. Nicholson, “The Bibliomining Process: Data Warehousing and Data Mining for Library DecisionMaking,” Information Technology and Libraries, 22(4),

2003.

[3] S. Nicholson, “The Basis for Biliomining: Frameworks for Bringing Together

Usage-Based Data Mining and Bibliometrics through Data Warehousing in

Digital Library Services,”

Information Processing & Management, 42(3), May

2006, pp 785-804.

[4] Jane You, Tharam Dillon, James Liu, Edwige Pissaloux, “On Hierarchical

Multimedia Information Retrieval,” You, J.;

Proceedings of the 2001

International Conference on Image Processing , 7-10 Oct 2001, pp 729 – 732.

[5] Torsten Priebe and Gunther Pernul, “Ontology-based Integration of OLAP and

Information Retrieval,” Proceedings of the 14 th International Workshop on

Database and expert Systems Applications, 2003.

4/17/07, Tecnológico de Monterrey, SMU

CSE 8337

42

4/17/07, Tecnológico de Monterrey, SMU

CSE 8337

43

Download