Notes 6

advertisement
Introduction to Data Warehousing
CPS 196.03
Notes 6
Warehousing
Growing industry: $30+ billion industry
 Range from desktop to huge:

 Walmart:
900-CPU, 2,700 disk, 23TB
Teradata system (numbers from earlier part
of this decade)

Lots of buzzwords, hype
 slice
& dice, rollup, MOLAP, pivot, ...
2
Outline
What is a data warehouse?
 Why a warehouse?
 Models & operations
 Implementing a warehouse

3
What is a Warehouse?

Collection of diverse data
 subject
oriented
 aimed at executive, decision maker
 often a copy of operational data
 with value-added data (e.g., summaries, history)
 integrated
 time-varying
 non-volatile
more
4
What is a Warehouse?

Collection of tools
 gathering
data
 cleansing, integrating, ...
 querying, reporting, analysis
 data mining
 monitoring, administering warehouse
5
Warehouse Architecture
Client
Client
Query & Analysis
Metadata
Warehouse
Integration
Source
Source
Source
6
Motivating Examples
Forecasting
 Comparing performance of units
 Monitoring, detecting fraud
 Visualization

7
Why a Warehouse?

Two Approaches:
 Query-Driven
(Lazy)
 Warehouse (Eager)
?
Source
Source
8
Query-Driven Approach
Client
Client
Mediator
Wrapper
Source
Wrapper
Wrapper
Source
Source
9
Advantages of Warehousing
High query performance
 Queries not visible outside warehouse
 Local processing at sources unaffected
 Can operate when sources unavailable
 Can query data not stored in a DBMS
 Extra information at warehouse

 Modify,
summarize (store aggregates)
 Add historical information
10
Advantages of Query-Driven

No need to copy data
 less
storage
 no need to purchase data
More up-to-date data
 Query needs can be unknown
 Only query interface needed at sources
 May be less draining on sources

11
OLTP vs. OLAP

OLTP: On Line Transaction Processing
 Describes

processing at operational sites
OLAP: On Line Analytical Processing
 Describes
processing at warehouse
12
OLTP vs. OLAP
OLTP







Mostly updates
Many small transactions
Mb-Gb of data
Raw data
Clerical users
Up-to-date data
Consistency,
recoverability critical
OLAP





Mostly reads
Queries long, complex
Tb-Pb of data
Summarized,
consolidated data
Decision-makers,
analysts as users
13
Data Marts
Smaller warehouses
 Spans part of organization

 e.g.,

marketing (customers, products, sales)
Do not require enterprise-wide consensus
 but
long term integration problems?
14
Warehouse Models & Operators

Data Models
 relations
 stars
& snowflakes
 cubes

Operators
 slice
& dice
 roll-up, drill down
 pivoting
 other
15
Warehouse Models

Modeling data warehouses: dimensions, measures
 Star
schema: A fact table in the middle connected to a set
of dimension tables
 Snowflake
schema: A refinement of star schema where
some dimensional hierarchy is normalized into a set of
smaller dimension tables, forming a shape similar to
snowflake
 Fact
constellations: Multiple fact tables share dimension
tables, viewed as a collection of stars, therefore called
galaxy schema or fact constellation
16
Star
product
prodId
p1
p2
name price
bolt
10
nut
5
sale oderId date
o100 1/7/97
o102 2/7/97
105 3/8/97
store storeId
c1
c2
c3
custId
53
53
111
prodId
p1
p2
p1
storeId
c1
c1
c3
qty
1
2
5
city
nyc
sfo
la
amt
12
11
50
Measures
customer
custId
53
81
111
name
joe
fred
sally
address
10 main
12 main
80 willow
city
sfo
sfo
la
17
Star Schema
product
prodId
name
price
sale
orderId
date
custId
prodId
storeId
qty
amt
customer
custId
name
address
city
store
storeId
city
18
Another Example of Star Schema
time
item
time_key
day
day_of_the_week
month
quarter
year
Sales Fact Table
time_key
item_key
branch_key
branch
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
item_key
item_name
brand
type
supplier_type
location
location_key
street
city
state_or_province
country
Measures
19
Terms
Fact table
 Dimension tables
 Measures

product
prodId
name
price
sale
orderId
date
custId
prodId
storeId
qty
amt
customer
custId
name
address
city
store
storeId
city
20
Dimension Hierarchies
sType
store
store storeId
s5
s7
s9
city
cityId
sfo
sfo
la
tId
t1
t2
t1
mgr
joe
fred
nancy
 snowflake schema
 constellations
region
sType tId
t1
t2
size
small
large
city cityId pop
sfo
1M
la
5M
location
downtown
suburbs
regId
north
south
region regId
name
north cold region
south warm region
21
Example of Snowflake Schema
time
time_key
day
day_of_the_week
month
quarter
year
item
Sales Fact Table
time_key
item_key
branch_key
branch
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_key
supplier
supplier_key
supplier_type
location
location_key
street
city_key
city
city_key
city
state_or_province
country
22
Example of Fact Constellation
time
time_key
day
day_of_the_week
month
quarter
year
item
Sales Fact Table
time_key
item_key
item_name
brand
type
supplier_type
item_key
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
Measures
time_key
item_key
shipper_key
from_location
branch_key
branch
Shipping Fact Table
location
to_location
location_key
street
city
province_or_state
country
dollars_cost
units_shipped
shipper
shipper_key
shipper_name
location_key
23
shipper_type
Cube
Fact table view:
sale
prodId storeId
p1
c1
p2
c1
p1
c3
p2
c2
Multi-dimensional cube:
amt
12
11
50
8
p1
p2
c1
12
11
c2
c3
50
8
dimensions = 2
Recall counters in Apriori
24
3-D Cube
Fact table view:
sale
prodId
p1
p2
p1
p2
p1
p1
storeId
c1
c1
c3
c2
c1
c2
Multi-dimensional cube:
date
1
1
1
1
2
2
amt
12
11
50
8
44
4
day 2
day 1
p1
p2 c1
p1
12
p2
11
c1
44
c2
4
c2
c3
c3
50
8
dimensions = 3
25
Aggregates
• Add up amounts for day 1
• In SQL: SELECT sum(amt) FROM SALE
WHERE date = 1
sale
prodId storeId
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2
date
1
1
1
1
2
2
amt
12
11
50
8
44
4
81
26
Aggregates
• Add up amounts by day
• In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date
sale
prodId storeId
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2
date
1
1
1
1
2
2
amt
12
11
50
8
44
4
ans
date
1
2
sum
81
48
27
Another Example
• Add up amounts by day, product
• In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date, prodId
sale
prodId storeId
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2
date
1
1
1
1
2
2
amt
12
11
50
8
44
4
sale
prodId
p1
p2
p1
date
1
1
2
amt
62
19
48
rollup
drill-down
28
Aggregates
Operators: sum, count, max, min,
median, ave
 “Having” clause
 Using dimension hierarchy

 average
by region (within store)
 maximum by month (within date)
29
Types of Measures in Data Cubes

Distributive: if the result derived by applying the function to
n aggregate values is the same as that derived by applying
the function on all the data without partitioning


Algebraic: if it can be computed by an algebraic function
with M arguments (where M is a bounded integer), each of
which is obtained by applying a distributive aggregate
function


E.g., count(), sum(), min(), max()
E.g., avg(), min_N(), standard_deviation()
Holistic: if there is no constant bound on the storage size
needed to describe a subaggregate.

E.g., median(), mode(), rank()
30
Cube Aggregation
day 2
day 1
p1
p2 c1
p1
12
p2
11
p1
p2
c1
56
11
c1
44
c2
4
c2
c3
Example: computing sums
...
c3
50
8
c2
4
8
rollup
drill-down
c3
50
sum
c1
67
c2
12
c3
50
129
p1
p2
sum
110
19
31
Cube Operators
day 2
day 1
p1
p2 c1
p1
12
p2
11
p1
p2
c1
56
11
c1
44
c2
4
c2
c3
...
c3
50
sale(c1,*,*)
8
c2
4
8
c3
50
sale(c2,p2,*)
sum
c1
67
c2
12
c3
50
129
p1
p2
sum
110
19
sale(*,*,*)
32
Extended Cube
c2
4
8
c312
p1
p2
c1
*
12
p1
p2
c1*
44
c1
56
11
c267
4
c2
44
c3
4
50
11
23
8
8
50
*
62
19
81
*
day 2
day 1
p1
p2
*
c3
50
* 50
48
48
*
110
19
129
sale(*,p2,*)
33
Cube Aggregates Lattice
129
all
c1
67
p1
c2
12
c3
50
city
city, product
p1
p2
c1
56
11
c2
4
8
product
city, date
date
product, date
c3
50
day 2
day 1
c1
c2
c3
p1
44
4
p2 c1
c2
c3
p1
12
50
p2
11
8
city, product, date
34
Dimension Hierarchies
all
cities
state
city
c1
c2
state
CA
NY
city
35
Dimension Hierarchies
all
city
city, product
product
city, date
city, product, date
date
product, date
state
state, date
state, product
state, product, date
not all arcs shown...
36
Interesting Hierarchy
time
all
years
weeks
quarters
months
day
1
2
3
4
5
6
7
8
week
1
1
1
1
1
1
1
2
month
1
1
1
1
1
1
1
1
quarter
1
1
1
1
1
1
1
1
year
2000
2000
2000
2000
2000
2000
2000
2000
conceptual
dimension table
days
37
Aggregation Using Hierarchies
day 2
day 1
p1
p2 c1
p1
12
p2
11
c1
44
c2
4
c2
c3
c3
50
customer
region
8
country
p1
p2
region A region B
56
54
11
8
(customer c1 in Region A;
customers c2, c3 in Region B)
38
Multidimensional Data

Sales volume as a function of product,
month, and region Dimensions: Product, Location, Time
Hierarchical summarization paths
Industry Region
Year
Product
Category Country Quarter
Product
City
Office
Month Week
Day
Month
39
Typical OLAP Operations
2Qtr
3Qtr
4Qtr
sum
U.S.A
Canada
Mexico
Country
TV
PC
VCR
sum
1Qtr
Date
Total annual sales
of TV in U.S.A.
sum
40
Typical OLAP Operations





Roll up (drill-up): summarize data
 by climbing up hierarchy or by dimension reduction
Drill down (roll down): reverse of roll-up
 from higher level summary to lower level summary or
detailed data, or introducing new dimensions
Slice and dice: project and select
Pivot (rotate):
 reorient the cube, visualization, 3D to series of 2D planes
Other operations
 drill across: involving (across) more than one fact table
 drill through: through the bottom level of the cube to its
back-end relational tables (using SQL)
41
Fig. 3.10 Typical OLAP
Operations
42
Pivoting
Fact table view:
sale
prodId storeId
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2
Multi-dimensional cube:
date
1
1
1
1
2
2
amt
12
11
50
8
44
4
Pivot turns unique values from
one column into unique columns
in the output
day 2
day 1
p1
p2 c1
p1
12
p2
11
p1
p2
c1
56
11
c1
44
c2
4
c2
c3
c3
50
8
c2
4
8
c3
50
43
Download