Data Warehousing and OLAP

advertisement
Data Warehousing and OLAP:
MOLAP and ROLAP
dr. Toon Calders
t.calders@tue.nl
Previous Lectures
• Online analytical processing
– Data cubes as a conceptual model
– Query languages for data cubes
• Database explosion problem
– Materializing the complete cube is impossible
– Partially materializing can help
1
This Lecture
• How is the data stored?
– relational database (ROLAP)
– Specialized structures (MOLAP)
• How can we speed up computation?
– Indexing structures
• bitmap index
• join index
Implementation
Nowadays systems can be divided in three
categories:
– ROLAP (Relational OLAP)
• OLAP supported on top of a relational database
– MOLAP (Multi-Dimensional OLAP)
• Use of special multi-dimensional data structures
– HOLAP: (Hybrid)
• combination of previous two
2
ROLAP
• Typical database scheme:
– star schema
• fact table is central
• links to dimensional tables
– Extensions:
• snowflake schema
– dimensions have hierarchy/extra information attached
• Star constellation
– multiple star schemas sharing dimensions
Example of a Star Schema
Order
Product
Order No
ProductNO
Order Date
Customer
Customer No
Customer Name
Customer
Address
City
Salesperson
SalespersonID
SalespersonName
City
Quota
ProdName
Fact Table
ProdDescr
OrderNO
Category
SalespersonID
CategoryDescription
CustomerNO
UnitPrice
ProdNo
Date
DateKey
DateKey
CityName
Date
Quantity
Total Price
City
CityName
State
Country
3
Example of a Snowflake
Example of a Snowflake Schema
Schema
Order
Product
ProductNO
Order No
Order Date
Fact Table
Customer
Customer No
Customer Name
Customer
Address
City
Salesperson
OrderNO
SalespersonID
Category
ProdName
CategoryName
ProdDescr
CategoryDescr
Category
Category
UnitPrice
CustomerNO
ProdNo
Date
DateKey
DateKey
CityName
Date
SalespersonID
Quantity
Month
City
SalespersonName
Total Price
CityName
City
State
Quota
Country
Month
Month
Year
Year
Year
State
StateName
Country
Example of Fact Constellation
Multiple fact tables share dimension tables
Time_key
Time
time_key
day
day_of_the_week
month
quarter
year
Sales Fact Table
Time_key
Item_key
Branch_key
Item
item_key
item_name
brand
type
supplier_key
Location_key
Branch
branch_key
branch_name
branch_type
Measures
Shipping Fact Table
Unit_sold
Euros_sold
Avg_sales
Location
location_key
street
city
Province/street
country
Item_key
shipper_key
from_location
to_location
Euros_sold
unit_shipped
shipper
shipper_key
shipper_name
location_key
shipper_type
4
This Lecture
• How is the data stored?
– Relational database (ROLAP)
– Specialized structures (MOLAP)
• How can we speed up computation?
– Indexing structures
• bitmap index
• join index
MOLAP
• Not on top of relational database
– most popular design
– specialized data structures
• Multicubes vs Hypercubes
– Not all subcubes are materialized
5
Storing the cube
• User identifies set of sparse attributes S,
and a set of dense attributes D.
• Index tree is constructed on sparse
dimensions.
• Each leaf points to a multidimensional
array indexed by D.
Example
• product, store are sparse dimensions
• date and customer-type are dense
prod. p
store s1
prod. p
1time
ret
reg
Total
1/1/07
51
25
158
234
2/1/07
58
20
120
198
…
65
22
51
138
Total
174
67
329
570
…
…
prod. p
store s2
1time
ret
reg
Total
1/1/07
51
25
158
234
2/1/07
58
20
120
198
…
65
22
51
138
Total
174
67
329
570
6
Example
• product, store are sparse dimensions
• date and customer-type are dense
prod. p
store s1
prod. p
1time
ret
reg
Total
1/1/07
51
25
158
234
2/1/07
58
20
120
198
…
65
22
51
138
Total
174
67
329
570
E.g., B-tree, R-tree, …
…
…
prod. p
store s2
1time
ret
reg
Total
1/1/07
51
25
158
234
2/1/07
58
20
120
198
…
65
22
51
138
Total
174
67
329
570
Example
• product, store are sparse dimensions
• date and customer-type are dense
prod. p
store s1
prod. p
1time
ret
reg
Total
1/1/07
51
25
158
234
2/1/07
582D array
20
120
198
…
Direct
access
65
22
51
Total
174
67
E.g., B-tree, R-tree, …
…
…
prod. p
store s2
138
329
570
1time
ret
reg
Total
1/1/07
51
25
158
234
2/1/07
58
20
120
198
…
65
22
51
138
Total
174
67
329
570
7
Example
• product, store are sparse dimensions
• date and customer-type are dense
prod. p
store s1
prod. p
1time
ret
reg
Total
1/1/07
51
25
158
234
2/1/07
582D array
20
120
198
…
Direct
access
65
22
51
Total
174
67
E.g., B-tree, R-tree, …
…
…
prod. p
store s2
138
329
570
1time
ret
reg
Total
1/1/07
51
25
158
234
2/1/07
58
20
120
198
…
65
22
51
138
Total
174
67
329
570
Linked
list
Queries
• Efficiency depends on:
– does index on sparse dimensions fit into
memory?
– Type of queries:
• Restrictions on all dimensions
• Restrictions only on dense
• Restrictions only on some sparse and dense
8
Queries
• Selection on all attributes: (p,s1,ret,all)
prod. p
store s1
prod. p
1time
ret
reg
Total
1/1/07
51
25
158
234
2/1/07
58
20
120
198
…
65
22
51
138
Total
174
67
329
570
…
prod. p
store s2
…
1time
ret
reg
Total
1/1/07
51
25
158
234
2/1/07
58
20
120
198
…
65
22
51
138
Total
174
67
329
570
Queries
• Only on dense attributes: (-,-,ret,”2/1/07”)
prod. p
store s1
prod. p
1time
ret
reg
Total
1/1/07
51
25
158
234
2/1/07
58
20
120
198
…
65
22
51
138
Total
174
67
329
570
…
…
prod. p
store s2
1time
ret
reg
Total
1/1/07
51
25
158
234
2/1/07
58
20
120
198
…
65
22
51
138
Total
174
67
329
570
9
Queries
• Only some sparse and dense attributes:
(-,s1,ret,”2/1/07”)
prod. p
store s1
prod. p
…
prod. p
store s2
1time
ret
reg
Total
1/1/07
51
25
158
234
2/1/07
58
20
120
198
…
65
22
51
138
Total
174
67
329
570
1time
ret
reg
Total
1/1/07
51
25
158
234
2/1/07
58
20
120
198
…
65
22
51
138
Total
174
67
329
570
prod. p2
store s1
Queries
• Only some sparse and dense attributes:
(p,-,-,”2/1/07”)
prod. p
store s1
prod. p
1time
ret
reg
Total
1/1/07
51
25
158
234
2/1/07
58
20
120
198
…
65
22
51
138
Total
174
67
329
570
…
prod. p
store s2
1time
ret
reg
Total
1/1/07
51
25
158
234
2/1/07
58
20
120
198
…
65
22
51
138
Total
174
67
329
570
prod. p
store s1
10
Storing the Cube
• Dense combinations of dimensions can be
stored in multi-dimensional arrays
• For every combination of sparse
dimensions
– one sub-cube
• Sub-cubes indexed by sparse dimensions
– E.g., B-tree
– Order of the dimensions plays a role
This Lecture
• How is the data stored?
– relational database (ROLAP)
– Multi-dimensional structure (MOLAP)
• How can we speed up computation?
– Indexing structures
• bitmap index
• join index
11
Specialized Indexing Structures
•
•
•
•
B-trees, (covered in other courses)
Bitmapped indices,
Join indices,
Spatial data structures (covered later)
Index Structures
• Indexing principle:
• mapping key values to records for
associative direct access
• Most popular indexing techniques in
relational database: B+-trees
• For multi-dimensional data, a large
number of indexing techniques have
been developed: R-trees
12
Bitmap Indexes
• Bitmap index: indexing technique that has attracted
attention in multi-dimensional DB implementation
table
Customer
c1
c2
c3
c4
c5
c6
City
Detroit
Chicago
Detroit
Poznan
Paris
Paris
Car
Ford
Honda
Honda
Ford
BMW
Nissan
Bitmap Indexes
• The index consists of bitmaps:
ec1
1
2
3
4
5
6
Chicago Detroit
0
1
1
0
0
1
0
0
0
0
0
0
Paris
0
0
0
0
1
1
Poznan
0
0
0
1
0
0
ec1
1
2
3
4
5
6
BMW
0
1
0
0
1
0
Ford
1
0
0
1
0
0
Honda
0
1
1
0
0
0
Nissan
0
0
0
0
0
1
bitmaps
bitmaps
•Index on a particular column
•Index consists of a number of bit vectors - bitmaps
•Each value in the indexed column has a bit vector (bitmaps)
•The length of the bit vector is the number of records in the base table
•The i-th bit is set if the i-th row of the base table has the value for the
indexed column
13
Bitmap Indexes
• Index on a particular column
• Index consists of a number of bit vectors - bitmaps
• Each value in the indexed column has a bit vector
(bitmaps)
• The length of the bit vector is the number of records
in the base table
• The i-th bit is set if the i-th row of the base table
has the value for the indexed column
Bitmap Indexes
20
23
20
21
22
23
25
26
age
index
1
1
0
1
1
0
0
0
0
0
0
1
0
0
0
1
0
1
1
id
1
2
3
4
5
6
7
8
name age
joe
20
fred
20
sally
21
nancy 20
tom
20
pat
25
dave 21
jeff
26
...
18
19
data
records
Query:
Get people with age =
20 and name = “fred”
List for age = 20:
1101100000
List for name = “fred”:
0100000001
Answer is intersection:
0100000000
Suited well for domains
with small cardinality
bit
maps
14
Bitmap Index
• Size of bitmaps can be further reduced
– use run-length encoding
1111000111100000001111000 is encoded as
4x1;3x0;4x1;7x0;4x1;3x0
– Can reduce the storage space significantly
– Logical operations can work directly on the
encoding
Bitmap Index – Summary
• With efficient hardware support for bitmap
operations (AND, OR, XOR, NOT), bitmap index
offers better access methods for certain queries
• e.g., selection on two attributes
• Some commercial products have implemented
bitmap index
• Works poorly for high cardinality domains since the
number of bitmaps increases
• Difficult to maintain - need reorganization when
relation sizes change (new bitmaps)
15
This Lecture
• How is the data stored?
– relational database (ROLAP)
– Specialized structures (MOLAP)
• How can we speed up computation?
– Indexing structures
• bitmap index
• join index
Join Indexes
• Traditional indexes: value rids.
Join indices: tuples in the join to rids in
the source tables.
• Data warehouse:
• values of dimensions of star schema rows in
fact table.
• Join indexes can span multiple dimensions
16
Join
• “Combine” SALE, PRODUCT relations
• In SQL: SELECT * FROM SALE, PRODUCT
sale
prodId storeId
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2
joinTb
date
1
1
1
1
2
2
prodId
p1
p2
p1
p2
p1
p1
amt
12
11
50
8
44
4
name
bolt
nut
bolt
nut
bolt
bolt
product
price
10
5
10
5
10
10
storeId
c1
c1
c3
c2
c1
c2
date
1
1
1
1
2
2
id
p1
p2
name price
bolt
10
nut
5
amt
12
11
50
8
44
4
Join Indexes
join index
product
sale
id
p1
p2
rId
r1
r2
r3
r4
r5
r6
name price
bolt
10
nut
5
prodId
p1
p2
p1
p2
p1
p1
jIndex
r1,r3,r5,r6
r2,r4
storeId
c1
c1
c3
c2
c1
c2
date
1
1
1
1
2
2
amt
12
11
50
8
44
4
17
OLAP - Summary
• Data warehouse is a specialized database
to support analytical queries = OLAP
queries
• Data cube as conceptual model
• Implementation of Data Cube
– View selection problem
– Explosion problem
– ROLAP vs. MOLAP
– Indexing structures
18
Download