Uploaded by Sai Krishna

1a-DWH

advertisement
Data Mining:
Concepts and Techniques
(3rd ed.)
Unit-I
Jiawei Han, Micheline Kamber, and Jian Pei
1
Unit-1 (a) : Data Warehousing
1. Basic Concepts
2. Data Warehouse Modeling
3. Data Warehouse Design and Usage
4. Data Generalization by Attribute-Oriented Induction
2
1. Basic Concepts
i.
What is a DWH?
ii.
Differences between OLTP and OLAP
iii.
Why a Separate Data Ware house?
iv.
A Multi-tiered Architecture
v.
DWH Models (Enterprise WH, Data Mart, Virtual WH)
vi.
Extraction, Transformation and Loading
vii.
Metadata Repository
3
(i) DWH Definitions

Defined in many different ways, but not rigorously.

A decision support database that is maintained separately from
the organization’s operational database.

Support information processing by providing a solid platform of
consolidated, historical data for analysis.

“A data warehouse is a subject-oriented, integrated, time-variant,
and nonvolatile collection of data in support of management’s
decision-making process.”—W. H. Inmon

Data warehousing:

March 2, 2022
The process of constructing and using data warehouses
Unit-I: DWH and OLAP Technology
4
DWH Characteristics—Subject-Oriented

Organized around major subjects, such as customer,
product, sales.

Focusing on the modeling and analysis of data for
decision makers, not on daily operations or transaction
processing.

Provide a simple and concise view around particular
subject issues by excluding data that are not useful in
the decision support process.
March 2, 2022
Unit-I: DWH and OLAP Technology
5
DWH Characteristics - Integrated


Constructed by integrating multiple, heterogeneous data
sources
 Relational Databases, Flat Files, On-line transaction
records.
Data cleaning and data integration techniques are
applied.
 Ensure consistency in naming conventions, encoding
structures, attribute measures, etc. among different
data sources


March 2, 2022
E.g., Hotel price: currency, tax, breakfast covered, etc.
When data is moved to the warehouse, it is
converted.
Unit-I: DWH and OLAP Technology
6
DWH Characteristics — Time Variant

The time horizon for the data warehouse is significantly
longer than that of operational systems



Operational database: current value data.
Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years)
Every key structure in the data warehouse


Contains an element of time, explicitly or implicitly.
But the key of operational data may or may not
contain “time element”.
March 2, 2022
Unit-I: DWH and OLAP Technology
7
DWH Characteristics—Nonvolatile

A physically separate store of data transformed from the
operational environment

Operational update of data does not occur in the data
warehouse environment
 Does
not require transaction processing, recovery,
and concurrency control mechanisms
 Requires

March 2, 2022
only two operations in data accessing:
initial loading of data and access of data
Unit-I: DWH and OLAP Technology
8
Data Warehouse Vs Heterogeneous DBMS

Traditional heterogeneous DB integration: A query driven approach

Build wrappers/mediators on top of heterogeneous databases

When a query is posed to a client site, a meta-dictionary is used
to translate the query into queries appropriate for individual
heterogeneous sites involved, and the results are integrated into
a global answer set.


Complex information filtering, compete for resources.
Data warehouse: update-driven, High Performance

Information from heterogeneous sources is integrated in advance
and stored in warehouses for direct query and analysis.
March 2, 2022
Unit-I: DWH and OLAP Technology
9
ii. Data Warehouse Vs Operational DBMS

OLTP (on-line transaction processing)




Major task of traditional Relational DBMS
Day-to-day
operations:
purchasing,
inventory,
manufacturing, payroll, registration, accounting, etc.
banking,
OLAP (on-line analytical processing)

Major task of data warehouse system

Data analysis and decision making
Distinct features (OLTP Vs OLAP):

User and system orientation: Customer Vs Market

Data contents: Current, Detailed Vs Historical, Consolidated

Database design: ER + Application Vs Star + Subject

View: Current, Local Vs Evolutionary, Integrated

Access patterns: Update Vs Read-only but Complex Queries
March 2, 2022
Unit-I: DWH and OLAP Technology
10
March 2, 2022
Unit-I: DWH and OLAP Technology
11
DBMS, OLAP and Data Mining
DBMS
OLAP
Data Mining
Task
 Extraction of
detailed and
summary data
 Knowledge
 Summaries, trends
discovery of
and forecasts
hidden patterns
and insights
Type of Result
 Information
 Analysis
 Insight and
Prediction
Method
 Deduction (Ask
the question,
verify with data)
 Multidimensional
data modeling,
Aggregation,
Statistics
 Induction (Build
the model, apply it
to new data, get
the result)
 Who purchased
mutual funds in
the last 3 years?
 What is the
average income of
mutual fund
buyers by region
by year?
 Who will buy a
mutual fund in the
next 6 months and
why?
Example question
iii. Why Separate Data Warehouse?

High performance for both systems



Warehouse—tuned for OLAP: Complex
Multidimensional view, Consolidation
OLAP
queries,
Different functions and different data:




DBMS— tuned for OLTP: Access methods, Indexing, Concurrency
control, Recovery
missing data: Decision support requires historical data which
operational DBs do not typically maintain
data consolidation:
DS requires consolidation (aggregation,
summarization) of data from heterogeneous sources
data quality: different sources typically use inconsistent data
representations, codes and formats which have to be reconciled
Note: There are more and more systems which perform OLAP
analysis directly on relational databases
March 2, 2022
Unit-I: DWH and OLAP Technology
13
iv. Architecture of a DWH
14
v. Three Data Warehouse Models


Enterprise warehouse
 collects all of the information about subjects spanning
the entire organization
Data Mart
 a subset of corporate-wide data that is of value to a
specific groups of users. Its scope is confined to
specific, selected groups, such as marketing data mart


Independent vs. dependent (directly from warehouse) data mart
Virtual warehouse
 A set of views over operational databases
 Only some of the possible summary views may be
materialized
15
vi. Extraction, Transformation, and Loading (ETL)





Data Extraction
 get data from multiple, heterogeneous, and external
sources
Data Cleaning
 detect errors in the data and rectify them when possible
Data Transformation
 convert data from legacy or host format to warehouse
format
Load
 sort, summarize, consolidate, compute views, check
integrity, and build indicies and partitions
Refresh
 propagate the updates from the data sources to the
warehouse
16
vii. Metadata Repository

Meta data is the data defining warehouse objects. It stores:

Description of the structure of the data warehouse


schema, view, dimensions, hierarchies, derived data defn, data
mart locations and contents
Operational meta-data

data lineage (history of migrated data and transformation path),
currency of data (active, archived, or purged), monitoring
information (warehouse usage statistics, error reports, audit trails)

The algorithms used for summarization

The mapping from operational environment to the data warehouse


Data related to system performance
 warehouse schema, view and derived data definitions
Business data

business terms and definitions, ownership of data, charging policies
17
Unit-1 (a) : Data Warehousing

Data Warehouse: Basic Concepts

Data Warehouse Modeling: Data Cube and OLAP

Data Warehouse Design and Usage

Data Generalization by Attribute-Oriented Induction

Summary
18
2. Data Warehouse Modeling
i.
Data Cube: A Multidimensional Data Model
ii.
Schemas for Multidimensional Data Models: Stars,
Snowflakes, and Fact Constellations
iii.
Dimensions: The Role of Concept Hierarchies
iv.
Measures: Their Categorization and Computation
v.
Typical OLAP Operations
vi.
A Starnet Query Model for Querying Multidimensional
Databases
19
i. Data Cube: A Multidimensional Data Model

A data warehouse is based on a multidimensional data model which
views data in the form of a data cube.

A data cube, allows data to be modeled and viewed in multiple
dimensions (Eg: Sales)

Dimension tables, such as item (item_name, brand, type), or
time(day, week, month, quarter, year)

Fact table contains measures (such as dollars_sold) and keys to
each of the related dimension tables

In data warehousing literature, an n-D base cube is called a base
cuboid. The top most 0-D cuboid, which holds the highest-level of
summarization, is called the apex cuboid. The lattice of cuboids forms
a data cube.
20
March 2, 2022
Data Mining: Concepts and Techniques
21
March 2, 2022
Data Mining: Concepts and Techniques
22
March 2, 2022
Data Mining: Concepts and Techniques
23
Cube: A Lattice of Cuboids
all
time
0-D (apex) cuboid
item
time,location
time,item
location
supplier
item,location
time,supplier
1-D cuboids
location,supplier
2-D cuboids
item,supplier
time,location,supplier
3-D cuboids
time,item,location
time,item,supplier
item,location,supplier
4-D (base) cuboid
time, item, location, supplier
24
Schemas for Multidimensional Data Models

Modeling data warehouses: dimensions & measures

Star schema: A fact table in the middle connected to a
set of dimension tables

Snowflake schema: A refinement of star schema where
some dimensional hierarchy is normalized into a set of
smaller dimension tables, forming a shape similar to
snowflake

Fact constellations: Multiple fact tables share dimension
tables, viewed as a collection of stars, therefore called
galaxy schema or fact constellation
25
March 2, 2022
Data Mining: Concepts and Techniques
26
27
28
iii. Dimensions: The Role of Concept Hierarchies
29
Contd..
30
iv. Measures: Their Categorization and Computation

Distributive: if the result derived by applying the function to
n aggregate values is the same as that derived by applying
the function on all the data without partitioning


Algebraic: if it can be computed by an algebraic function
with M arguments (where M is a bounded integer), each of
which is obtained by applying a distributive aggregate
function


E.g., count(), sum(), min(), max()
E.g., avg(), min_N(), standard_deviation()
Holistic: if there is no constant bound on the storage size
needed to describe a subaggregate.

E.g., median(), mode(), rank()
31
v. Typical OLAP Operations

Roll up (drill-up): summarize data

by climbing up hierarchy or by dimension reduction

Drill down (roll down): reverse of roll-up

from higher level summary to lower level summary or
detailed data, or introducing new dimensions
Slice and dice: project and select

Pivot (rotate):



reorient the cube, visualization, 3D to series of 2D planes
Other operations


drill across: involving (across) more than one fact table
drill through: through the bottom level of the cube to its
back-end relational tables (using SQL)
32
OLAP Systems Vs Statistical Databases




A statistical database is a database system that is designed
to support statistical applications.
SDBs tend to focus on socioeconomic applications and OLAP
has been targeted for business applications.
Privacy issues regarding concept hierarchies are a major
concern for SDBs.
OLAP systems are designed for efficiently handling huge
amounts of data.
33
Typical OLAP Operations
34
vi. A Starnet Query Model for Querying MDBs
Unit-1 (a) : Data Warehousing

Data Warehouse: Basic Concepts

Data Warehouse Modeling: Data Cube and OLAP

Data Warehouse Design and Usage

Data Generalization by Attribute-Oriented Induction

Summary
36
3. Data Warehouse Design and Usage
i.
A Business Analysis Framework for Data Warehouse
Design
ii.
Data Warehouse Design Process
iii.
Data Warehouse Usage for Information Processing
iv.
From Online Analytical Processing to Multidimensional
Data Mining
37
i. A Business Analysis Framework

Four views regarding the design of a data warehouse

Top-down view


Data source view


exposes the information being captured, stored, and managed
by operational systems
Data warehouse view


allows selection of the relevant information necessary for the
data warehouse
consists of fact tables and dimension tables
Business query view

sees the perspectives of data in the warehouse from the view of
end-user
38
ii. Data Warehouse Design Process


Top-down, bottom-up approaches or a combination of both

Top-down: Starts with overall design and planning (mature)

Bottom-up: Starts with experiments and prototypes (rapid)
From software engineering point of view



Waterfall: structured and systematic analysis at each step before
proceeding to the next
Spiral: rapid generation of increasingly functional systems, short
turn around time, quick turn around
Typical data warehouse design process

Choose a business process to model, e.g., orders, invoices, etc.

Choose the grain (atomic level of data) of the business process

Choose the dimensions that will apply to each fact table record

Choose the measure that will populate each fact table record
39
iii. Data Warehouse Usage

Three kinds of data warehouse applications

Information processing



supports querying, basic statistical analysis, and reporting
using crosstabs, tables, charts and graphs
Analytical processing

multidimensional analysis of data warehouse data

supports basic OLAP operations, slice-dice, drilling, pivoting
Data mining


knowledge discovery from hidden patterns
supports associations, constructing analytical models,
performing classification and prediction, and presenting the
mining results using visualization tools
40
iv. From On-Line Analytical Processing
to On Line Analytical Mining

Why online analytical mining?
 High quality of data in data warehouses
 DW contains integrated, consistent, cleaned data
 Available information processing structure surrounding
data warehouses
 ODBC,
OLEDB, Web accessing, service facilities,
reporting and OLAP tools
 OLAP-based exploratory data analysis
 Mining with drilling, dicing, pivoting, etc.
 On-line selection of data mining functions
 Integration
and swapping of multiple mining
functions, algorithms, and tasks
41
42
Unit-1 (a) : Data Warehousing

Data Warehouse: Basic Concepts

Data Warehouse Modeling: Data Cube and OLAP

Data Warehouse Design and Usage

Data Generalization by Attribute-Oriented Induction

Summary
43
4. Data Generalization by Attribute-Oriented Induction
i.
Attribute-Oriented Induction for Data Characterization
ii.
Efficient Implementation of Attribute-Oriented Induction
iii.
Attribute-Oriented Induction for Class Comparisons
44
Introduction to Data Generalization


The data cube can be viewed as a kind of multidimensional
data generalization.
Data generalization summarizes data by replacing relatively
low-level values with higher-level concepts.
Eg: Age -> Young, Middle-aged and Senior

Reducing the number of dimensions to summarize data in
concept space involving fewer dimensions
Eg: Removing birth_date and telephone_no
summarizing the behavior of a group of students).
when
45
Introduction to Concept Description

Allowing data sets to be generalized at multiple levels of
abstraction facilitates users in examining the general
behavior of the data.
Eg: Given the AllElectronics database, instead of examining
individual customer transactions, sales managers may refer
to view the data generalized to higher levels, such as
summarized by customer groups according to geographic
regions, frequency of purchases per group, and customer
income.
46
Characterization and Discrimination




Concept Description, which is a form of data generalization.
It generates descriptions for data characterization and
comparison.
Characterization provides a concise
summarization of the given data collection.
and
succinct
Concept or class comparison (discrimination) provides
descriptions comparing two or more data collections.
47
i. Attribute-Oriented Induction

Proposed in 1989, before data cube approach.

Process steps:
i.
ii.
iii.
iv.
Collect the task-relevant data (initial relation) using a
relational database query
Perform generalization by attribute removal or attribute
generalization
Apply aggregation by merging identical, generalized
tuples and accumulating their respective counts
Interaction with users for knowledge presentation
48
49
An Example

Query: Describe general characteristics of graduate
students in the Big_University Database.

Given the attributes:
 Name
 gender
 major
 birth_place
 birth_date
 residence
 phone# and
 gpa (grade_point_average).
50
Steps in AOI

Step 1: Fetch relevant set of data using an SQL
statement.

Step 2: Perform attribute-oriented induction

Step 3: Present results in generalized relation,
cross-tab, or rule forms
51
DMQL for characterization
use Big_University_DB
mine characteristics as “Science_Students”
in relevance to name, gender, major,
birth_place,birth_date, residence, phone#, gpa
from student
where status in “graduate”
RDBMS Query for characterization
use Big_University_DB
select name, gender, major, birth_place, birth_date,
residence, phone#, gpa
from student
where status in {“M.Sc.,” “M.A.,” “M.B.A.,” “Ph.D.”}
52
(Task-relevant) Initial Working Relation
53
Step 2: Perform attribute-oriented induction


The essential operation of attribute-oriented induction is data
generalization, can be performed in two ways on the initial
working relation: attribute removal and attribute generalization.
Rule for attribute removal:
 If there is a large set of distinct values for an attribute of the
initial working relation, but either
 case 1: There is no generalization operator on the attribute
(e.g., there is no concept hierarchy defined for the attribute)
(or) (Name)
 case 2: its higher-level concepts are expressed in terms of
other attributes, then the attribute should be removed from
the working relation. (Street -> city, state, country)
54
Step 2: Perform attribute-oriented induction

Rule for Attribute generalization



If there is a large set of distinct values for an attribute in the
initial working relation, and there exists a set of generalization
operators on the attribute, then a generalization operator
should be selected and applied to the attribute. (DOB ->Age)
Attribute generalization control: If the number of distinct values
in an attribute is greater than the attribute threshold, further
attribute removal or attribute generalization should be performed
Problems
 Over-generalization and under-generalization
55
Step 2: Perform attribute-oriented induction

generalized relation threshold control

If the number of (distinct) tuples in the generalized relation is
greater than the threshold, further generalization should be
performed. Otherwise, no further generalization should be
performed.
56
Class Characterization: An Example
Name
Gender
Jim
Initial
Woodman
Relation Scott
M
Major
M
F
…
Removed
Retained
Residence
Phone #
GPA
Vancouver,BC, 8-12-76
Canada
CS
Montreal, Que, 28-7-75
Canada
Physics Seattle, WA, USA 25-8-70
…
…
…
3511 Main St.,
Richmond
345 1st Ave.,
Richmond
687-4598
3.67
253-9106
3.70
125 Austin Ave.,
Burnaby
…
420-5232
…
3.83
…
Sci,Eng,
Bus
City
Removed
Excl,
VG,..
Gender Major
M
F
…
Birth_date
CS
Lachance
Laura Lee
…
Prime
Generalized
Relation
Birth-Place
Science
Science
…
Country
Age range
Birth_region
Age_range
Residence
GPA
Canada
Foreign
…
20-25
25-30
…
Richmond
Burnaby
…
Very-good
Excellent
…
Count
16
22
…
Birth_Region
Canada
Foreign
Total
Gender
M
16
14
30
F
10
22
32
Total
26
36
62
57
Basic Principles of Attribute-Oriented Induction





Data focusing: task-relevant data, including dimensions, and the
result is the initial relation
Attribute-removal: remove attribute A if there is a large set of
distinct values for A but (1) there is no generalization operator on
A, or (2) A’s higher level concepts are expressed in terms of other
attributes
Attribute-generalization: If there is a large set of distinct values
for A, and there exists a set of generalization operators on A,
then select an operator and generalize A
Attribute-threshold control: typical 2-8, specified/default
Generalized relation
relation/rule size
threshold
control:
control
the
final
58
Step3: Presentation of Generalized Results

Generalized relation:


Cross tabulation:


Relations where some or all attributes are generalized, with counts
or other aggregation values accumulated.
Mapping results into cross tabulation form (similar to contingency
tables).

Visualization techniques:

Pie charts, bar charts, curves, cubes, and other visual forms.
Quantitative characteristic rules:

Mapping generalized result into characteristic rules with quantitative
information associated with it, e.g.,
grad( x)  male( x) 
birth_ region( x) "Canada"[t :53%] birth_ region( x) " foreign"[t : 47%].
59
60
61
62
Attribute-Oriented Induction: Basic
Algorithm




InitialRel: Query processing of task-relevant data, deriving
the initial relation.
PreGen: Based on the analysis of the number of distinct
values in each attribute, determine generalization plan for
each attribute: removal? or how high to generalize?
PrimeGen: Based on the PreGen plan, perform
generalization to the right level to derive a “prime
generalized relation”, accumulating the counts.
Presentation: User interaction: (1) adjust levels by drilling,
(2) pivoting, (3) mapping into rules, cross tabs,
visualization presentations.
63
Presentation of Generalized Results

Generalized relation:


Cross tabulation:


Relations where some or all attributes are generalized, with counts
or other aggregation values accumulated.
Mapping results into cross tabulation form (similar to contingency
tables).

Visualization techniques:

Pie charts, bar charts, curves, cubes, and other visual forms.
Quantitative characteristic rules:

Mapping generalized result into characteristic rules with quantitative
information associated with it, e.g.,
grad( x)  male( x) 
birth_ region( x) "Canada"[t :53%] birth_ region( x) " foreign"[t : 47%].
64
Mining Class Comparisons

Comparison: Comparing two or more classes

Method:

Partition the set of relevant data into the target class and the
contrasting class(es)

Generalize both classes to the same high level concepts

Compare tuples with the same high level descriptions

Present for every tuple its description and two measures



support - distribution within single class

comparison - distribution between classes
Highlight the tuples with strong discriminant features
Relevance Analysis:

Find attributes (features) which best distinguish different classes
65
66
67
68
Unit-1 (a) : Data Warehousing

Data Warehouse: Basic Concepts

Data Warehouse Modeling: Data Cube and OLAP

Data Warehouse Design and Usage

Data Generalization by Attribute-Oriented Induction

Summary
69
Summary

Data warehousing: A multi-dimensional model of a data warehouse




A data cube consists of dimensions & measures
Star schema, snowflake schema, fact constellations
OLAP operations: drilling, rolling, slicing, dicing and pivoting
Data Warehouse Architecture, Design, and Usage

Multi-tiered architecture

Business analysis design framework
Information processing, analytical processing, data mining, OLAM (Online
Analytical Mining)
Implementation: Efficient computation of data cubes

Partial vs. full vs. no materialization

Indexing OALP data: Bitmap index and join index

OLAP query processing

OLAP servers: ROLAP, MOLAP, HOLAP



Data generalization: Attribute-oriented induction
70
Unit-1 (a) : Data Warehousing

Data Warehouse: Basic Concepts
(a) What Is a Data Warehouse?
(b) Data Warehouse: A Multi-Tiered Architecture
(c) Three Data Warehouse Models: Enterprise Warehouse, Data Mart, ad Virtual Warehouse
(d) Extraction, Transformation and Loading
(e) Metadata Repository

Data Warehouse Modeling: Data Cube and OLAP
(a) Cube: A Lattice of Cuboids
(b) Conceptual Modeling of Data Warehouses
(c) Stars, Snowflakes, and Fact Constellations: Schemas for Multidimensional Databases
(d) Dimensions: The Role of Concept Hierarchy
(e) Measures: Their Categorization and Computation
(f) Cube Definitions in Database systems
(g) Typical OLAP Operations
(h) A Starnet Query Model for Querying Multidimensional Databases

Data Warehouse Design and Usage
(a) Data Warehouses Design Processes
(b) Data Warehouse Usage
(c) From OLAP to OLAM
(d) Design of Data Warehouses: A Business Analysis Framework
(e) Analytical Processing to On-Line Analytical Mining

Data Generalization by Attribute-Oriented Induction
(a) Attribute-Oriented Induction for Data Characterization
(b) Efficient Implementation of Attribute-Oriented Induction
(c) Attribute-Oriented Induction for Class Comparisons
(d) Attribute-Oriented Induction vs. Cube-Based OLAP
71
Download