PPT

advertisement
Data Warehouses and
OLAP
*Slides by Nikos Mamoulis
Multi-Tiered Architecture
other
Metadata
sources
Operational
DBs
Extract
Transform
Load
Refresh
Monitor
&
Integrator
Data
Warehouse
OLAP Server
Serve
Analysis
Query
Reports
Data mining
Data Marts
Data Sources
Data Storage
OLAP Engine Front-End Tools
OLAP Server Architectures

Relational OLAP (ROLAP)




Multidimensional OLAP (MOLAP)



Array-based multidimensional storage engine (sparse matrix
techniques)
fast indexing to pre-computed summarized data
Hybrid OLAP (HOLAP)


Use relational or extended-relational DBMS to store and manage
warehouse data and OLAP middle ware to support missing
pieces
Include optimization of DBMS backend, implementation of
aggregation navigation logic, and additional tools and services
greater scalability
User flexibility, e.g., low level: relational, high-level: array
Specialized SQL servers

specialized support for SQL queries over star/snowflake
schemas
Data Warehousing and OLAP
Technology for Data Mining

What is a data warehouse?

A multi-dimensional data model

Data warehouse architecture

Data warehouse implementation

Further development of data cube technology

From data warehousing to data mining
Efficient Data Cube Computation

Data cube can be viewed as a lattice of cuboids

The bottom-most cuboid is the base cuboid

The top-most cuboid (apex) contains only one cell

How many cuboids in an n-dimensional cube with L
levels?
n
T   ( Li 1)
i 1

Materialization of data cube

Materialize every (cuboid) (full materialization), none (no
materialization), or some (partial materialization)

Selection of which cuboids to materialize

Based on size, sharing, access frequency, etc.
Cube Operations

Cube definition and computation in DMQL
define cube sales[item, city, year]: sum(sales_in_dollars)
compute cube sales

Transform it into a SQL-like language (with a new operator
cube by, introduced by Gray et al.’96)
SELECT item, city, year, SUM (amount)
()
FROM SALES
CUBE BY item, city, year

(city)
(item)
(year)
Need compute the following Group-Bys
(date, product, customer),
(date,product),(date, customer),
(product,
customer),
(city,
item)
(city, year)
(item, year)
(date), (product), (customer)
()
(city, item, year)
Cube Computation: ROLAP-Based
Method

Efficient cube computation methods




ROLAP-based cubing algorithms (Agarwal et al’96)
Array-based cubing algorithm (Zhao et al’97)
Bottom-up computation method (Bayer & Ramarkrishnan’99)
ROLAP-based cubing algorithms

Sorting, hashing, and grouping operations are applied to the
dimension attributes in order to reorder and cluster related
tuples

Grouping is performed on some sub-aggregates as a “partial
grouping step”

Aggregates may be computed from previously computed
aggregates, rather than from the base fact table
Multi-way Array Aggregation for
Cube Computation

Partition arrays into chunks (a small subcube which fits in
memory).

Compressed sparse array addressing: (chunk_id, offset)

Compute aggregates in “multiway” by visiting cube cells in
the order which minimizes the # of times to visit each cell,
and reduces memory access and storage cost.
C
c3 61
62
63
64
c2 45
46
47
48
c1 29
30
31
32
c0
B
b3
B13
b2
9
b1
5
b0
14
15
16
1
2
3
4
a0
a1
a2
a3
A
60
44
28 56
40
24 52
36
20
What is the best
traversing order
to do multi-way
aggregation?
Multi-way Array Aggregation
for Cube Computation
C
c3 61
62
63
64
c2 45
46
47
48
c1 29
30
31
32
c0
b3
B
b2
B13
14
15
60
16
44
28
9
24
b1
5
b0
1
2
3
4
a0
a1
a2
a3
56
40
36
A
20
52
Multi-way Array Aggregation
for Cube Computation
C
c3 61
62
63
64
c2 45
46
47
48
c1 29
30
31
32
c0
b3
B
b2
B13
14
15
60
16
44
28
9
24
b1
5
b0
1
2
3
4
a0
a1
a2
a3
56
40
36
A
20
52
Multi-Way Array Aggregation for
Cube Computation (Cont.)

Method: the planes should be sorted and
computed according to their size in ascending
order.



See the details of Example 4.4
Idea: keep the smallest plane in the main memory,
fetch and compute only one chunk at a time for the
largest plane
Limitation of the method: works well only for a
small number of dimensions

If there are a large number of dimensions, “bottom-up
computation” and iceberg cube computation methods
can be explored
Indexing OLAP Data: Bitmap Index





Index on a particular column
Each value in the column has a bit vector: bit-op is fast
The length of the bit vector: # of records in the base
table
The i-th bit is set if the i-th row of the base table has the
value for the indexed column
not suitable for high cardinality domains
Base table
Cust
C1
C2
C3
C4
C5
Region
Asia
Europe
Asia
America
Europe
Index on Region
Index on Type
Type RecIDAsia Europe America RecID Retail Dealer
Retail
1
1
0
1
1
0
0
Dealer 2
2
0
1
0
1
0
Dealer 3
1
0
0
3
0
1
4
0
0
1
4
1
0
Retail
0
1
0
5
0
1
Dealer 5
Indexing OLAP Data: Join Indices


Join index: JI(R-id, S-id) where R (R-id, …)
 S (S-id, …)
Traditional indices map the values to a list
of record ids


It materializes relational join in JI file and
speeds up relational join — a rather costly
operation
In data warehouses, join index relates the
values of the dimensions of a start schema
to rows in the fact table.


E.g. fact table: Sales and two dimensions
city and product
 A join index on city maintains for each
distinct city a list of R-IDs of the tuples
recording the Sales in the city
Join indices can span multiple dimensions
Efficient Processing OLAP Queries

Determine which operations should be performed
on the available cuboids:

transform drill, roll, etc. into corresponding SQL and/or
OLAP operations, e.g, dice = selection + projection

Determine to which materialized cuboid(s) the
relevant operations should be applied.

Exploring indexing structures and compressed vs.
dense array structures in MOLAP
Metadata Repository

Meta data is the data defining warehouse objects. It has the
following kinds

Description of the structure of the warehouse


Operational meta-data




data lineage (history of migrated data and transformation path),
currency of data (active, archived, or purged), monitoring information
(warehouse usage statistics, error reports, audit trails)
The algorithms used for summarization
The mapping from operational environment to the data
warehouse
Data related to system performance


schema, view, dimensions, hierarchies, derived data defn, data mart
locations and contents
warehouse schema, view and derived data definitions
Business data

business terms and definitions, ownership of data, charging policies
Data Warehouse Back-End Tools and
Utilities





Data extraction:
 get data from multiple, heterogeneous, and external
sources
Data cleaning:
 detect errors in the data and rectify them when possible
Data transformation:
 convert data from legacy or host format to warehouse
format
Load:
 sort, summarize, consolidate, compute views, check
integrity, and build indicies and partitions
Refresh
 propagate the updates from the data sources to the
warehouse
Data Warehousing and OLAP
Technology for Data Mining

What is a data warehouse?

A multi-dimensional data model

Data warehouse architecture

Data warehouse implementation

Further development of data cube technology

From data warehousing to data mining
Discovery-Driven Exploration of
Data Cubes

Hypothesis-driven: exploration by user, huge search space

Discovery-driven (Sarawagi et al.’98)

pre-compute measures indicating exceptions, guide user in the
data analysis, at all levels of aggregation

Exception: significantly different from the value anticipated,
based on a statistical model

Visual cues such as background color are used to reflect the
degree of exception of each cell

Computation of exception indicator (modeling fitting and
computing SelfExp, InExp, and PathExp values) can be
overlapped with cube construction
Examples: Discovery-Driven Data Cubes
Data Warehousing and OLAP
Technology for Data Mining

What is a data warehouse?

A multi-dimensional data model

Data warehouse architecture

Data warehouse implementation

Further development of data cube technology

From data warehousing to data mining
Data Warehouse Usage

Three kinds of data warehouse applications

Information processing


Analytical processing



multidimensional analysis of data warehouse data
supports basic OLAP operations, slice-dice, drilling,
pivoting
Data mining



supports querying, basic statistical analysis, and reporting
using crosstabs, tables, charts and graphs
knowledge discovery from hidden patterns
supports associations, constructing analytical models,
performing classification and prediction, and presenting
the mining results using visualization tools.
Differences among the three tasks
From On-Line Analytical Processing
to On Line Analytical Mining (OLAM)

Why online analytical mining?





High quality of data in data warehouses
 DW contains integrated, consistent, cleaned data
Available information processing structure surrounding data
warehouses
 ODBC, OLEDB, Web accessing, service facilities, reporting
and OLAP tools
OLAP-based exploratory data analysis
 mining with drilling, dicing, pivoting, etc.
On-line selection of data mining functions
 integration and swapping of multiple mining functions,
algorithms, and tasks.
Architecture of OLAM
An OLAM Architecture
Mining query
Mining result
Layer4
User Interface
User GUI API
OLAM
Engine
OLAP
Engine
Layer3
OLAP/OLAM
Data Cube API
Layer2
MDDB
MDDB
Meta Data
Filtering&Integration
Database API
Filtering
Layer1
Data cleaning
Databases
Data
Data integration Warehouse
Data
Repository
Summary

Data warehouse





A multi-dimensional model of a data warehouse

Star schema, snowflake schema, fact constellations

A data cube consists of dimensions & measures
OLAP operations: drilling, rolling, slicing, dicing and pivoting
OLAP servers: ROLAP, MOLAP, HOLAP
Efficient computation of data cubes




A subject-oriented, integrated, time-variant, and nonvolatile collection of
data in support of management’s decision-making process
Partial vs. full vs. no materialization
Multiway array aggregation
Bitmap index and join index implementations
Further development of data cube technology


Discovery-drive and multi-feature cubes
From OLAP to OLAM (on-line analytical mining)
References (I)

S. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, R. Ramakrishnan,
and S. Sarawagi. On the computation of multidimensional aggregates. In Proc. 1996
Int. Conf. Very Large Data Bases, 506-521, Bombay, India, Sept. 1996.

D. Agrawal, A. E. Abbadi, A. Singh, and T. Yurek. Efficient view maintenance in data
warehouses. In Proc. 1997 ACM-SIGMOD Int. Conf. Management of Data, 417-427,
Tucson, Arizona, May 1997.

R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering
of high dimensional data for data mining applications. In Proc. 1998 ACM-SIGMOD Int.
Conf. Management of Data, 94-105, Seattle, Washington, June 1998.

R. Agrawal, A. Gupta, and S. Sarawagi. Modeling multidimensional databases. In Proc.
1997 Int. Conf. Data Engineering, 232-243, Birmingham, England, April 1997.

K. Beyer and R. Ramakrishnan. Bottom-Up Computation of Sparse and Iceberg CUBEs.
In Proc. 1999 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD'99), 359-370,
Philadelphia, PA, June 1999.

S. Chaudhuri and U. Dayal. An overview of data warehousing and OLAP technology.
ACM SIGMOD Record, 26:65-74, 1997.

OLAP council. MDAPI specification version 2.0. In
http://www.olapcouncil.org/research/apily.htm, 1998.

J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow,
and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by,
cross-tab and sub-totals. Data Mining and Knowledge Discovery, 1:29-54, 1997.
References (II)

V. Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing data cubes efficiently.
In Proc. 1996 ACM-SIGMOD Int. Conf. Management of Data, pages 205-216, Montreal,
Canada, June 1996.

Microsoft. OLEDB for OLAP programmer's reference version 1.0. In
http://www.microsoft.com/data/oledb/olap, 1998.

K. Ross and D. Srivastava. Fast computation of sparse datacubes. In Proc. 1997 Int.
Conf. Very Large Data Bases, 116-125, Athens, Greece, Aug. 1997.

K. A. Ross, D. Srivastava, and D. Chatziantoniou. Complex aggregation at multiple
granularities. In Proc. Int. Conf. of Extending Database Technology (EDBT'98), 263277, Valencia, Spain, March 1998.

S. Sarawagi, R. Agrawal, and N. Megiddo. Discovery-driven exploration of OLAP data
cubes. In Proc. Int. Conf. of Extending Database Technology (EDBT'98), pages 168182, Valencia, Spain, March 1998.

E. Thomsen. OLAP Solutions: Building Multidimensional Information Systems. John
Wiley & Sons, 1997.

Y. Zhao, P. M. Deshpande, and J. F. Naughton. An array-based algorithm for
simultaneous multidimensional aggregates. In Proc. 1997 ACM-SIGMOD Int. Conf.
Management of Data, 159-170, Tucson, Arizona, May 1997.
Selection of tables, attributes,
domains in the DW design process

If you are asked to design a data
warehouse for a set of operational
databases how would you do it?

Use specifications of requirements to design
the data warehouse schema and select:



The central theme(s) of the analysis (e.g., sales)
The measures on the central themes (e.g.,
sum(dollars))

The dimensions used by analytical processing

The attributes and hierarchies of the dimensions
Clean, transform, and integrate information
Example

A large company which sells engine parts


Database 1 (Los Angeles)

employee(id, name, dept, lot, salary, age)

department(id, name, type, manager_id)

part(id, name, type, brand, manufacturer, color)

customer(id, name, type, age, city, state, zip, tel)

sales(id, part_id, customer_id, quantity, price)
Database 2 (New York)

employee(id, ename, dept_id, salary, age)

department(id, name, type, manager)

part(id, title, type, brand, manufacturer, color)

customer(id, name, type, zip, tel)

location(zip, city_id)

city(city_id, state, country)

sales(id, part_id, customer_id, quantity, price)
Example: requirements and selection

Requirements of the data warehouse


we want to analyze the total sales in dollars
and the average price of sold units with respect
to time, part, customer.
Selection of the basic features of the
warehouse

central theme(s): sales

measure(s): sum(sales_in_dollars),
avg(price_sold_units)

dimensions: time, part, customer
Example: selection of hierarchies

To determine the dimension hierarchies we
have to select which dimensional attributes
are required to include for analysis

We go back to the requirements and ask
the analyst:

time: day, week, month, quarter, year

part: name, type, color, brand, manufacturer

customer: name, type, city, state, country
Exercise

Find the hierarchies for time, part,
customer
•time: day, week, month, quarter, year
•part: name, type, color, brand,
manufacturer
•customer: name, type, city, state, country
Example: selection of hierarchies

Definition of the hierarchies:
year
manufacturer
state
quarter
month
country
week type color brand
type city
day
name
name
time
part
customer
Example: is the information that we
need to analyze available?

We have to check if the required
information for analysis exists in the
databases to be integrated

All requested attributes exist, except from the
time. This can be determined by accessing the
transaction logs of the databases.
Example: Design the DW schema

We use the star schema in this example
Time
time_id
day
week
month
quarter
year
Part
part_id
name
type
color
brand
manufacturer
Fact table
time_id
part_id
cust_id
quantity
price
Customer
cust_id
name
type
city
state
country
We need just quantity and (unit) price to derive
sum(sales_in_dollars) = quantity*price
avg(price_sold_units) =
Σ(quantity*price)/ Σ(quantity)
Integration tasks

A large company which sells engine parts


Database 1 (Los Angeles)

employee(id, name, dept, lot, salary, age)

department(id, name, type, manager_id)

part(id, name, type, brand, manufacturer, color)

customer(id, name, type, age, city, state, zip, tel)

sales(id, part_id, customer_id, quantity, price)
Database 2 (New York)

employee(id, ename, dept_id, salary, age)

department(id, name, type, manager)

part(id, title, type, brand, manufacturer, color)

customer(id, name, type, zip, tel)

location(zip, city_id)

city(city_id, state, country)

sales(id, part_id, customer_id, quantity, price)
convert:
•attribute names
•attribute types
join tables
derive data not
stored explicitly
in the databases
fill in missing
values
ignore irrelevant data:
•tables
•attributes
Example: how to populate the DW?
Load, Clean, Integrate

Convert attribute names and types

E.g., part.name = part.title

Convert values to be consistent

Join tables if necessary


Derive time if not present


Check transaction log for sales table to get the time and convert it to
the required format
Complete missing values


Join customer,location,city from “New York” DB
The “Los Angeles” database does not record customer country
information because all its customers are from US. In the integrated
data from LA country value is set to “USA”
Ignore irrelevant tables and attributes

Tables employee, department are ignored

Attributes zip, tel, id are ignored.
How many cuboids ?

How do we compute the total number of
cuboids of a data cube?

Compute the product of the number of
levels for each dimension

Number of cuboids =
(levels for time)*
(levels for part)*
(levels for customer) =
5*4*5 = 100
What is the data cube?

The data cube is NOT a cube

Multiple dimensions, variable range and interpretations of
the cells, does not look always like a “cube”

The data cube is the set of all non-redundand,
multidimensional views from which we can
analyze the measures on the central theme(s)

Remember: the terms multidimensional view and
cuboid have the same meaning

A multidimensional view is non-redundant, if
there are no hierarchical relationships between its
dimensional attributes
Redundancy in views
A non-redundant combination of
attributes:

(month, part_type)
A redundant (not useful)
combination of attributes:

(part_brand, part_manufacturer)
part_type

month
manufacturer

23 12 23 11 5067
10 20 43 33 13 22
58 18 72 25 30 25
23 12 0 40 9012
23 45 43 33 13 1
2 0 56 23 6 25
part_brand
23 12 0 0 0 0
0 0 43 33 13 0
0 0 0 0 0 25
Exercise

Find the non-redundant combinations of
the attributes of part.
manufacturer
type color brand
name
part
Visualization of non-redundant attribute
combinations for part
ALL
manufacturer
type
color
(type,color)
(color,brand)
brand
(type, manufacturer) (color, manufacturer)
(type,brand) (type,color,manufacturer)
(type,color,brand)
name
part
How many (non-redundant) cuboids ?

How do we compute the total number of cuboids of a data
cube?

Compute the product of the number of combinations for
each dimension

Number of cuboids =
(combinations for time)*
(combinations for part)*
(combinations for customer) =
6*13*9 = 702

Note: sometimes we use the term cuboid to also denote
multidimensional views with selections:


total sales for each (part.type,customer.city) for year = “2001”
We do not count such cuboids in the computation above
Cube: A Lattice of Cuboids
E.g., (location) is dependent on
(time,item,location)
all
time
time,item
item
time,location
location
item,location
time,supplier
time,item,location
supplier
location,supplier
item,supplier
time,location,supplier
time,item,supplier
item,location,supplier
time, item, location, supplier
How to answer queries from a set of
materialized cuboids?

In real-life examples it is not possible to materialize the
whole data cube

Typically, a small subset of cuboids is materialized

To answer a query we have to select the cuboid that results
in the minimum cost for the query

A query typically consists of


a set of group-by attributes

a set of selection clauses

E.g., compute the total sales per part.type, cust.city for
year=2002.
The factors for selecting the best cuboid for a particular
query are:

the size of the cuboid

any indexes on the attributes of the “select” clause of the query
Cube: A Lattice of Cuboids
= materialized
cuboid
time
time,item
all
item
time,location
location
item,location
time,supplier
time,item,location
supplier
How would you compute the
following queries?
Q1: <time,item>
Q2: <supplier>
Q3: <location>
Q4: <time,supplier>
Q5: <item>
location,supplier
item,supplier
time,location,supplier
time,item,supplier
item,location,supplier
time, item, location, supplier
Which cuboids should we materialize?

In real-life examples it is not possible to materialize the
whole data cube

We have to select the most beneficial cuboids to materialize

This depends mainly on the size of the cuboids and their
usage by queries

Thus to select we need information about (i) the size of
cuboids, (ii) the queries and their frequency

The base cuboid almost always corresponds to the fact
table, which is already materialized. For example, if the
products have unique name, and customers unique name,
we can use:

time_id to derive the day

part_id to derive part.name

cust_id to derive customer.name
Which cuboids should we materialize?

Example


candidate cuboids:

(day,pname,cname) – 100GB (already materialized)

(day,pname) – 60GB

(day,cname) – 20GB

(pname,cname) – 1GB

(day) – 10GB

(pname) – 200MB

(cname) – 30MB

(ALL) – 8 bytes
queries (with equal probability)

Q1: total sales per (pname,cname)

Q2: total sales per (pname)

Q3: total sales per (cname)
Exercise: Which views
should we materialize if
the available space is:
1. 10GB
2. 1GB
3. 100MB
Which cuboids should we materialize?

Case 1: Available space = 10GB

We can materialize all three views
(pname,cname) ,
(pname), and (cname)

The cost of Q1 is reading 1GB

The cost of Q2 is reading 200MB

The cost of Q3 is reading 30MB

Average query cost = (1230MB)/3=410 MB/query
Which cuboids should we materialize?

Case 2: Available space = 1GB

We have two choices:



materialize (pname,cname) using 1GB

Q1 costs 1GB, Q2 costs 1GB, Q3 costs 1GB

Average query cost = 1GB
materialize (pname) and (cname) using 230MB

Q1 costs 100GB, Q2 costs 200MB, Q3 costs 30MB

Average query cost = (100,230MB)/3 = 34GB
First choice is better than the second!
Which cuboids should we materialize?

Case 3: Available space = 100MB

We can only materialize (cname)

Q1 costs 100GB

Q2 costs 100GB

Q3 costs 30MB

Average query cost = (200,030MB)/3 = 67GB
Bitmap Indexes

The bitmap index is used to index attributes with
small domains

For each attribute value, a bitmap is defined to
indicate the rows of the table that contain this
value

Bitmaps are useful especially when we want to
join some attribute values

Example: find the total sales of red parts to
female customers
Bitmap Indexes - Example
100 bytes
date
10/10/00
12/10/00
15/10/00
20/10/00
22/11/00
26/11/00
...
pcolor
red
green
green
blue
red
red
...
cname
Smith
Jones
Kane
Nike
Ellis
Jones
...
index for pcolor
red
1
0
0
0
1
1
...
green
0
1
1
0
0
0
...
blue
0
0
0
1
0
0
...
cgender
M
F
M
M
F
F
...
sales
21
13
14
23
9
92
...
index for gender
M
1
0
1
1
0
0
...
F
0
1
0
0
1
1
...
1 billion rows
Exercise:
1. what are the sizes of
the table and indexes
2. what is the cost of the
query: “find the total sales of
red parts to female
customers” if:
a) the table is used
b) the indexes are used
Bitmap Indexes - Example

The size of the table is 100GB=100bytes*1billion

The size of the indexes are:

3bits*1billion = 3000Mbits = 375MB

2bits*1billion = 2000Mbits = 250MB

The cost of evaluating the query directly on the table is reading
100GB and for each of the 1 billion tuples perform a comparison
with “red” and “F” (expensive)

The cost of evaluating the query using the indexes is:

read bitmap for red, read bitmap for F and join them.


This costs reading 1Gbits+1Gbits = 250MB
for each join result accumulate the sales for the corresponding rid


this retrieves (1/3)*(1/2) = 1/6 tuples (estimated) and accumulates
them
we will probably read the whole table (since we want to avoid random
accesses), but we will avoid any comparisons.
Download