A TPC-DS Multidimensional Implementation in Cloud Systems

advertisement
MOLAP on Cloud
Interactive, Cluster Data Warehouse
Hongwei Zhao, XiaojunYe
Tsinghua University
hwzhao73@gmail.com,
yexj@mail.tsinghua.edu.cn
Motivation
Extend the cube model to support OLAP
operations on Big Data:
»OLAP operations
»Interactive queries
Outline
Cube modelling
Building and querying
Experimenting
Data Transform for Cube
TPC-DS tables
Star views
User queries
Cube data
A Simplified Cube Model
Result
Cube Instance
*
Cube
Metadata
Dimension
Dimension
Dimension
Instance
Instance
Instance
Key
Key
Key
Member
Member
Dimension
Member
Cuboid
CuboidInstance
Instance
Key
Key
Key
Measure
Measure
Node
Measure
Node
Cell
A
AB
C
B
BC
AC
ABC
Base Cuboids
Example: TPC-DS Query7
select i_item_id, avg(ss_quantity) agg1, avg(ss_list_price) agg2,
avg(ss_coupon_amt) agg3, avg(ss_sales_price) agg4
from store_sales, customer_demographics, date_dim, item, promotion
where ss_sold_date_sk = d_date_sk and
ss_item_sk = i_item_sk and
ss_cdemo_sk = cd_demo_sk and
ss_promo_sk = p_promo_sk and
cd_gender = '[GEN]' and
cd_marital_status = '[MS]' and
cd_education_status = '[ES]' and
(p_channel_email = 'N' or p_channel_event = 'N') and
d_year = [YEAR]
group by i_item_id
order by i_item_id
Relation Schema
Customer
Demographics
Date Dim
Store Sales
Item
Promotion
Converting to BitKey
Dimension A Dimension B Dimension C Measure
Fact1
Fact
2
A1
B1
C1
M1
A2
B1
C2
M2
A3
B2
C2
M3
Dimension
Intermediate Member
Result1
Intermediate
Result1
BitKey
Dimension
Mask
A1
000001
000001
B1
000010
000010
C1
000100
000100
A2
001000
001001
B1
000010
000010
C2
010000
010100
BitKeys
Value
Result1
000111
M1
Result
011010
M2
2
Cube Storage
Table
One table for dimension instances storage:
Row Key
Dimension Name
Column Family
Default
Column
Member BitKey
ColumnFamily
Value
Member Value
Row
Multiple tables for cuboids instances
Region
Column
Version
Value
Cell
Table Name
Cuboid Name
Row Key
Cell BitKey
Column Family
Default
Column
Measure Name
Value
Measure Value
MDX for query 7
select { i_item_id } on rows,
{ avg(ss_quantity), avg(ss_list_price),
avg(ss_coupon_amt), avg(ss_sales_price) }
on columns
from store_sales_cube
where (cd_gender .[Male],
cd_marital_status .[Single],
cd_education_status .[College],
d_year.[2000])
Cube Implementation
Base cuboid building with 4 stages:
Dimension constructing
Hive query
Aggregation
Saving
Query execution with 4 stages:
Loading dimension
Other cuboid constructing
Mapping
Reducing
Cluster Framework
OLAP System
• Distribute dynamically
cubes data onto
worker nodes
• Parallelize OLAP
operations into a
concurrent model
Columnar
Database
Worker Nodes
Engine
Dispatcher Node
Dispatcher
Node
Worker Node
Worker Node
Worker Node
cache
data
Master
Node
Region Node
Region Node
Region Node
Cube
data
Actor of Akka
State
Behavior
Mailbox
Lifecycle
Fault tolerance
Actors for Query
• Load dimension
members
• Build other cuboids
• Mapping
• Reducing
Execute Query
require
Dimension
load
1
Query
Dispatcher
Dimension
Manager
Cuboid
Manager
data
ready
3
4
Extract
Query
Cuboid
ready
Hit Cell
Hit Cell
Mapper
2
Reducer
Compiling & Mapping
Query 7 Condition:
GEN=M and MS=S and ES=College and YEAR=2000
GEN Mask:
MS Mask:
ES Mask:
YEAR Mask:
000000011
000011100
001100000
110000000
Male
Single :
College:
2000:
000000010
000001100
001000000
010000000
Mapper1
Mask:
FilterKey:
Query
Dispatcher
111111111
011001110
Mapper2
Mapper3
For each cell in mapper
{
If (key & mask
== Filter Key)
Send to Reducer
}
Query Execution
Cache 1
• Master sends task messages to workers
results
• Each worker caches each region data
Master
Worker
Region 1
• Sequential tasks reuse the cache data
messages
Cache 2
Worker
First query on 1G consume 48 secs,
the following queries with various parameters
consume 2.4 secs
Cache 3
Worker
Region 3
Region 2
Experiments On TPC-DS
1G
4 nodes:
•
•
•
•
2*Intel Xeon CPU E5-2630
4*600G 15000r/s SAS
256G RAM
10Gb Network
Dimensions:
1. "i_item_id",
2. "cd_gender",
3. "cd_marital_status",
4. "cd_education_status",
5. "p_channel_email",
6. "p_channel_event",
7. "d_year“
Measures:
1. ss_quantity_avg,
2. ss_list_price_avg,
3. ss_coupon_amt_avg,
4. ss_sales_price_avg
records
number
cube
cell
number
10G
100G
2,653,108
26,532,571 265,325,821
1,836,162
10,190,922 41,892,286
300000000
250000000
200000000
150000000
fact records
100000000
cells
50000000
0
1g
10g 100g
Build Cube for Query 7
100G
TPC-DS data size
• Partition by the largest
Dimension(i_item_id)
• In-Memory aggregation
• Saving stage can be
ignore(cache)
Initializing
10G
querying
aggregating
1G
0
1000
2000
3000
4000
5000
6000
running time (seconds)
Execute Query 7
Sequential execution includes:
• Mapping
• Reducing
running time (seconds)
First execution on the cube includes
• Dimension loading
• other cuboids construction
• Caching
• Mapping
• Reducing
400
350
300
250
200
4 workers
150
8 workers
100
16 workers
50
0
1
2
3
4
iteration number
5
Hive Query for Fact Data
select p_channel_email, p_channel_event, cd_gender,
cd_marital_status, cd_education_status, i_item_id,d_year,
ss_quantity, ss_list_price, ss_coupon_amt, ss_sales_price
from store_sales
join date_dim on (store_sales.ss_sold_date_sk
= date_dim.d_date_sk)
join item on (store_sales.ss_item_sk = item.i_item_sk)
join customer_demographics on (store_sales.ss_cdemo_sk
= custom-er_demographics.cd_demo_sk)
join promotion on (store_sales.ss_promo_sk
= promotion.p_promo_sk)
Compare with Hive
1400
First query time compare:
2-3X
1200
1000
800
hive
600
prototype
400
200
0
1G
10G
100G
1400
Sequential execution time:
30-50X
1200
1000
800
hive
600
prototype
400
200
0
1G
10G
100G
Future work
• Cube Model:
• Demand-driven & Data-driven
• Cube Data:
• Model-driven & Requirement-driven
• More experiments on TPC-DS queries
• Report, ad hoc, iterative, data mining,
• MDX/XMLA compliance
Thanks.
Storage for Example
Table: Dimension
Row Key
Column Family: default
Dimension Mask
A
001001
000001
001000
001001
A1
A2
A3
Dimension Mask
B
100010
000010
100000
B1
B2
Table: Cuboid_ABC
Row Key
000111
011010
Column Family: default
Mea_count
Mea_sum
1
M1
Mea_count
Mea_sum
1
M2
Download