HP_TemIt_Guidelines

advertisement
The Making of TPC-DS
Meikel Poess
Oracle Corporation
Raghunath Othayoth Nambiar
Hewlett-Packard Company
Agenda
1.
Industry standard benchmark development
2.
Limitations of TPC-H
3.
Key elements of TPC-DS
4.
Current state of the specification
5.
Q&A
August 31, 2006
32nd International Conference on Very Large Data Bases
2
Benchmark Categories
•
Industry standard benchmarks
− Transaction Processing Performance Council
(TPC)
− Standard Performance Evaluation Corporation
(SPEC)
•
Application benchmarks
− SAP, Oracle Apps, JD Edwards, Exchange,
Domino
•
Special purpose benchmarks
− Dhrystone, Whetstone, Linpak, Iozone, Netperf,
Stream
August 31, 2006
32nd International Conference on Very Large Data Bases
3
Industry Standard Benchmarks
− Broad Industry representation (all decision taken
by the board)
− Verifiable (audit process)
− Domain specific standard tests
− Resolution of disputes and challenges
TPC Benchmarks
SPEC Benchmarks
August 31, 2006
•TPC-C (OLTP), TPC-E (New OLTP)
•TPCH (DSS), TPC-DS (New DSS)
•TPC-App - Dynamic WEB
•SPEC CPU – Integer and Floating Point
•SPEC SFS - System File Server
•SPECweb – Web Server
•SPECPower – Power Consumption (New)
32nd International Conference on Very Large Data Bases
4
Why Benchmarks Are Important
•
Vendor point of view
− Define the playing field (measurable, repeatable)
− Enable competitive analysis
− Monitor release to release progress
− Result understood by engineering, sales and
customers
− Accelerate focused technology development
•
Customer point of view
− Cross-vendor comparisons (performance, TCO)
− Evaluate new technologies
− Eliminate costly in-house characterization
August 31, 2006
32nd International Conference on Very Large Data Bases
5
Tracking Release to Release Progress,
Example
SPEC CPU2000 benchmark results on HP ProLiant DL380, 2002-todate
140
120
SPECfp®_rate2000
100
SPECint®_rate2000
120
82
80
59
60
43
40
18
20
22
25
32
33
40
11
0
DL380 G2
(01/ 02)
DL380 G3
(07/ 03)
DL380 G4
(08/ 04)
DL380 G4
(09/ 05)
DL380 G4
Dual-core
(10/ 05)
DL380 G5
Dual-core
(07/ 06)
All SPEC® CPU2000 benchmark results stated above reflect results published as of July 25, 2006. For the
latest SPEC® CPU2000 benchmark results, visit www.spec.org/cpu2000/.
August 31, 2006
32nd International Conference on Very Large Data Bases
6
Tracking Product-line Progress, Example
TPC-C benchmark on HP ProLiant servers over 10 years.
240,000
$300.00
Oct-95
May-96
May-97
Oct-97
Jul-99
Feb-02
Mar-04
Oct-04
Dec-04
Feb-05
Apr-05
Oct-05
220,000
202,551
$241.64
200,000
$250.00
187,296
180,000
160,000
$200.00
143,367
130,623
120,000
115,110
$135.68
$150.00
$/tpmC
tpmC
140,000
95,163
100,000
80,000
$100.00
69,170
$78.17
60,000
40,001
40,000
$39.25
$50.00
$18.86
20,000
5,677
2,455
11,056
9,029
$9.43
$2.93
$3.96
$2.62
$2.80
$2.04
$2.40
C
ia
n
ro
L
P
ro
L
ia
n
tD
L5
8
5/
D
L5
8
5/
D
C
6
32nd International Conference on Very Large Data Bases
P
P
ro
L
ia
n
tD
tD
tD
L5
8
L7
6
5/
2.
0G
2
5
L5
8
ia
n
ro
L
P
ro
L
ia
n
P
P
ro
L
ia
n
tD
ia
n
ro
L
tD
L5
8
L7
6
tD
t8
ia
n
P
August 31, 2006
0G
2
0
0
50
0
P
ro
L
ia
n
t7
00
0
00
ro
L
P
P
ro
L
ia
n
t6
00
t5
ia
n
ro
L
P
P
ro
L
ia
n
t4
50
0
$-
0
0
7
Competitive Analysis, Example
Top Ten 3000GB TPC-H by Performance, As of 12-Sept-2006.
August 31, 2006
32nd International Conference on Very Large Data Bases
8
Transaction Processing Performance Council
(TPC)
August 31, 2006
32nd International Conference on Very Large Data Bases
9
What makes the TPC unique
•
TPC is the only benchmark organization that requires priceperformance scores across all of its benchmarks
•
All tests require full documentation of the components and
applications under test, so that the test can be replicated
•
The TPC requires an independent audit of results prior to
publication
•
TPC tests the whole system performance, not just a piece
•
TPC is database agnostic: Oracle, IBM DB2, Sybase, Microsoft
SQL Server, NonStop SQL/MX and other databases
•
TPC provides cross-platform performance comparisons, a view
of processor versus real performance, technology comparisons
and actual cost of performance comparisons
August 31, 2006
32nd International Conference on Very Large Data Bases
10
TPC Business Model
•
TPC follows the philosophy of real world
benchmarks, so that its customers can:
− relate their business to the benchmark business
model
− relate their workload to the workload of the
benchmark
− Understand the benchmark
August 31, 2006
32nd International Conference on Very Large Data Bases
11
TPC Members
•
20 Member companies
•
Page
August 31, 2006
12
4 Associate members
32nd International Conference on Very Large Data Bases
12
TPC
•
TPC Organization
− Subcommittee
•
•
•
•
OLTP
DSS
Web
Pricing
− Technical Advisory Board
− Steering Committee
•
Benchmark development/maintenance
− 6 Face-to-face meetings per year
− Weekly conference calls
August 31, 2006
32nd International Conference on Very Large Data Bases
13
New Benchmark Development
August 31, 2006
32nd International Conference on Very Large Data Bases
14
Industry Standard Benchmark Development
Activities
• Development of a new benchmark in a new domain
•
Refinement of existing benchmarks
•
New Benchmark in an existing domain
− Good benchmarks drive industry and technology forward
− At some point, all reasonable advances have been made
− Benchmarks can become counter productive by encouraging
artificial optimizations
− So, even good benchmarks become obsolete over time
− As technology and user environment evolve, so should the
benchmark
August 31, 2006
32nd International Conference on Very Large Data Bases
15
Benchmark Lifecycle
new idea
Requirements
feasibility
forms
Subcommittee
debate
defines
Draft Spec
feasibility
implements
Prototypes
evaluation
publishes
Benchmark Spec
refinement
Vendors Publish
Benchmark Results
Benchmark becomes obsolete and results in new requirements
August 31, 2006
32nd International Conference on Very Large Data Bases
16
Industry Standard Benchmark Development,
Challenges
•
Development cycle
− Benchmark development can take years
• Technology and business could change significantly
•
Members have their own agenda
− Hardware vs. software
− Scale-out vs. scale-up
− Proprietary vs. industry standard
August 31, 2006
32nd International Conference on Very Large Data Bases
17
New Industry Standard Benchmarks,
Challenges
•
Unknowns
− How does my product perform under new load
and metric
•
Risk factors
− There are failed benchmarks
•
Investments
− TPC Benchmark publications are expensive
− Vendors want to keep their existing publications
− Need to train engineers, sales and marketing,
customers
August 31, 2006
32nd International Conference on Very Large Data Bases
18
TPC-H
August 31, 2006
32nd International Conference on Very Large Data Bases
19
TPC-H
•
Measures generally applicable aspects of a Decision
Support System
•
Its basic ideas have been a standard since 1994
•
Currently about 100 results from 14 vendors
(system/ database) on website
•
Has served the industry and academia very well
•
Shortcomings in:
− Data model
− Workload model
− Metric
August 31, 2006
32nd International Conference on Very Large Data Bases
20
Data Model Shortcomings
•
Database Schema
−
−
−
−
•
3rd Normal Form
8 tables
On average 10 columns per table
Commonly used database technologies are restricted
Dataset
−
−
−
−
−
Uniform data distributions
Synthetic data
No null values
Linear scaling with scale factor of almost all tables
Unrealistic table sizing
• Scale factor 100,000  20 Billion parts sold to 15 Billion
customers at a rate of 150 Billion orders a year
August 31, 2006
32nd International Conference on Very Large Data Bases
21
Workload Model Shortcomings
•
Query workload
− 22 SQL92 queries
− Simple structure
− Only ad-hoc queries
•
Update workload
−
−
−
−
August 31, 2006
Simple insert and delete operations
No data transformations
Only two tables are maintained
Random inserts and deletes based on non-contiguous keys
in the dataset
32nd International Conference on Very Large Data Bases
22
Metric Shortcomings
•
Primary performance metric
− Very complex
− Mix of geometric mean and arithmetic mean
3600 * SF
QphH=
24
August 31, 2006
22
2
i 1
i 1
( Qi *  RFi )
*
S * 22 * 3600 * SF
max( T1 ..TS )
32nd International Conference on Very Large Data Bases
23
Objectives for TPC-DS
•
Realistic data model
•
Complex workload
− Large query set
− ETL like update model
•
Simple and comprehensible metric
•
Understandable business model
August 31, 2006
32nd International Conference on Very Large Data Bases
24
Data Model
August 31, 2006
32nd International Conference on Very Large Data Bases
25
Fact Tables
Catalog Returns
Web Returns
Store Returns
Inventory
Catalog Sales






August 31, 2006
Web Sales
Store Sales
3 sales channels: Catalog - Web - Store
7 fact tables
2 fact tables for each sales channel
24 tables total
Basic auxiliary data structure are allowed on all tables
Complex auxiliary data structures are only allowed on Catalog
Sales and Catalog Returns
32nd International Conference on Very Large Data Bases
26
Snow Flake
Date_Dim
Store
Item
Store_Sales
Time_Dim
Promotion
Customer_
Demographics
Customer_ Household_
Demographics
Address
Customer
August 31, 2006
32nd International Conference on Very Large Data Bases
Income_
Band
27
Schema: Store Channel w/
Dimensions
Date_Dim
Store
Item
Store_Sales
Time_Dim
Promotion
Customer_
Demographics
Customer_ Household_
Demographics
Address
Customer
August 31, 2006
32nd International Conference on Very Large Data Bases
Income_
Band
28
Schema: Store Channel w/
Dimensions
Date_Dim
Store
Item
Store_Sales
Time_Dim
Promotion
Customer_
Demographics
Customer_ Household_
Demographics
Address
Customer
August 31, 2006
32nd International Conference on Very Large Data Bases
Income_
Band
29
Data Model Advantages
•
•
•
•
•
August 31, 2006
Complex relationships
Fact to fact table relations
Large number of tables (24)
Large number of columns (18)
Auxiliary data structures are allowed on a subset of
the schema
 complex queries
 star and “traditional” executions
 satisfies hardware and software vendors
 extents lifetime of the benchmark due to
complexity
32nd International Conference on Very Large Data Bases
30
Dataset
August 31, 2006
32nd International Conference on Very Large Data Bases
31
Database Scaling
•
•
•
August 31, 2006
Database size is
defined in scalefactors
Scale factor
indicates raw data
size in GB
Auxiliary data
structures and
temporary storage
are not included
Scale Factor
Database Size
1
1 GB
100
100 GB
300
300 GB
1000
1 TB
3000
3 TB
10000
10 TB
30000
30 TB
100000
100 TB
32nd International Conference on Very Large Data Bases
32
Fact Table Scaling
•
Fact tables scale linearly with the scale factor
Fact Table Scaling
# Rows in Fact Table
1.00E+12
1.00E+11
1.00E+10
1.00E+09
1.00E+08
1.00E+07
1.00E+06
1.00E+05
Scale Factor
1
10
Sto re Sales
August 31, 2006
100
Catalo g Sales
1000
Web Sales
32nd International Conference on Very Large Data Bases
10000
100000
Invento ry
33
Database Scaling (Dimensions)
•
Scale sub-linearly
•
Amount for a fraction of the fact tables
Dimension Table Scaling
# Rows in Dimension Table
10000
1000
100
10
August 31, 2006
Date
Item
10
00
0
10
00
10
0
10
00
00
Stores
10
1
1
Scale Factor
32nd International Conference on Very Large Data Bases
34
Table Sizes at SF 100GB
August 31, 2006
Table
#Rows
Percent of Total
Store Sales
288 Million
39
Store Returns
28.8 Million
3.4
Catalog Sales
144 Million
30
Catalog Returns 14.4 Million
2.4
Web Sales
72 Million
15
Web Returns
7.2 Million
1
Inventory
390 Million
9
Customer
2 Million
0.5
Item
100,000
0.1
Catalog Page
24,000
0.002
Remaining
3,3 Million
0.005
32nd International Conference on Very Large Data Bases
35
Data Content
•
Some data has “real world” content:
− Last name “Sanchez”, “Ward”, “Roberts”
− Addresses “630 Railroad, Woodbine, Sullivan
County,MO-64253”
•
Data is skewed
− Sales are modeled after US census data
− More green items than red
− Small and large cities
August 31, 2006
32nd International Conference on Very Large Data Bases
36
Sales Distribution
Distribution of Store Sales over Month
600000
Store Sales
500000
14 % of all sales
happen between
January and July
28 % of all sales
58% of all sales happen
happen between
in November and
August and October December
Group 3
400000
Group 2
300000
Group 1
200000
100000
0
1
2
3
4
5
6
7
8
9
10
11
12
Month
August 31, 2006
32nd International Conference on Very Large Data Bases
37
Color Distribution
Number of Occurences
8 % of all colors
24% of all colors are
are
in Group 1 in group 2
1400
68% of all colors are in
Group 3
1200
Group 3
1000
800
600
400
Group 2
Group 1
200
al
ch mo
ar nd
tre
us
e
flo
ra
l
a
b u zu
rly re
co wo
rn od
flo
w
e
kh r
ak
i
la
ce
gh
m os
ag t
en
ta
m
m i nt
ar
o
se on
as
h
po ell
wd
e
th r
is
pa tle
pa
ya
pl
u
sm m
ok
pu e
rp
le
0
Color
August 31, 2006
32nd International Conference on Very Large Data Bases
38
Dataset Advantages
•
Realistic table scaling
•
Real world data content
•
Non-uniform distributions
 challenging for:
− statistics collection
− query optimizer
August 31, 2006
32nd International Conference on Very Large Data Bases
39
Query Model
August 31, 2006
32nd International Conference on Very Large Data Bases
40
Query Model
•
Queries are designed to be realistic. They:
− Answer real world questions
− Cover system’s functionality relevant to decision
support applications
− Only allow tuning methods available to a DBA
− Queries cover all data so that unrealistic creation
of auxiliary data structures is unlikely
− Yet, they impose a controlled and repeatable
workload
August 31, 2006
32nd International Conference on Very Large Data Bases
41
Query Templates
•
TPC-DS requires a large query set
− E.g. 100TB benchmarks runs 1089 queries
•
Queries are written in a query template language
•
Each query template is unique
•
Queries are automatically generated using query
templates
•
More information about the query generator at:
Meikel Poess, John M. Stephens: Generating Thousand
Benchmark Queries in Seconds. VLDB 2004: 1045-1053
August 31, 2006
32nd International Conference on Very Large Data Bases
42
Query Model
Query Language: SQL99 + OLAP extensions
Query needs to be executed “as is”
•
•
− No hints or rewrites allowed,
except when approved by TPC
•
•
99 different query templates
4 different query types:
Type
simulate
Implemented via
Templates
Reporting
Finely tuned reoccurring queries
Access catalog sales channel tables
38
Ad-hoc
Sporadic queries, minimal tuning
Access Store and Web Sales Channel
tables
47
Iterative
Users issuing sequences of
queries
Sequence of queries where each
query adds SQL elements
4
Data
Mining
Queries feeding Data Mining Tools
for further processing
Return large number of rows
10
August 31, 2006
32nd International Conference on Very Large Data Bases
43
Ad Hoc Query
select i_item_id,
s_state, grouping(s_state) g_state,
avg(ss_quantity) agg1,
avg(ss_list_price) agg2,
avg(ss_coupon_amt) agg3,
avg(ss_sales_price) agg4
from store_sales, customer_demographics, date_dim, store,
item
where ss_sold_date_sk = d_date_sk and
ss_item_sk = i_item_sk and
ss_store_sk = s_store_sk and
ss_cdemo_sk = cd_demo_sk and
cd_gender = '[GEN]' and
cd_marital_status = '[MS]' and
cd_education_status = '[ES]' and
d_year = [YEAR] and
s_state in ('[STATE_A]','[STATE_B]', '[STATE_C]',
'[STATE_D]','[STATE_E]', '[STATE_F]')
group by rollup (i_item_id, s_state);
August 31, 2006
32nd International Conference on Very Large Data Bases
44
Reporting Query
select
count(distinct cs_order_number) as "order count"
,sum(cs_ext_ship_cost) as "total shipping cost"
,sum(cs_net_profit) as "total net profit"
from
catalog_sales cs1 ,date_dim ,customer_address ,call_center
where
d_date between '[YEAR]-[MONTH]-01' and
(cast('[YEAR]-[MONTH]-01' as date) + 60 )
and cs1.cs_ship_date_sk = d_date_sk
and cs1.cs_ship_addr_sk = ca_address_sk
and ca_state = '[STATE]'
and cs1.cs_call_center_sk = cc_call_center_sk
and cc_county in ('[COUNTY_A]','[COUNTY_B]','[COUNTY_C]‘
,'[COUNTY_D]','[COUNTY_E]')
and exists (select *
from catalog_sales cs2
where cs1.cs_order_number = cs2.cs_order_number
and cs1.cs_warehouse_sk <> cs2.cs_warehouse_sk)
and not exists(select *
from catalog_returns cr1
where cs1.cs_order_number = cr1.cr_order_number);
August 31, 2006
32nd International Conference on Very Large Data Bases
45
Iterative Query Part # 1
with frequent_ss_items as
(select substr(i_item_desc,1,30) itemdesc,i_item_sk item_sk,d_date solddate,count(*) cnt
from store_sales ,date_dim ,item
where ss_sold_date_sk = d_date_sk
and ss_item_sk = i_item_sk
and d_year in ([YEAR],[YEAR]+1,[YEAR]+2,[YEAR]+3)
group by substr(i_item_desc,1,30),i_item_sk,d_date
having count(*) >4), max_store_sales as (select max(csales) cmax
from (select c_customer_sk,sum(ss_quantity*ss_sales_price) csales
from store_sales ,customer ,date_dim
where ss_customer_sk = c_customer_sk
and ss_sold_date_sk = d_date_sk
and d_year in ([YEAR],[YEAR]+1,[YEAR]+2,[YEAR]+3)
group by c_customer_sk) x),
best_ss_customer as
(select c_customer_sk,sum(ss_quantity*ss_sales_price) ssales
from store_sales,customer
where ss_customer_sk = c_customer_sk
group by c_customer_sk
having sum(ss_quantity*ss_sales_price) > 0.95 * (select * from max_store_sales))
select sum(sales)
from ((select cs_quantity*cs_list_price sales
from catalog_sales ,date_dim
where d_year = [YEAR]
and d_moy = [MONTH]
and cs_sold_date_sk = d_date_sk
and cs_item_sk in (select item_sk from frequent_ss_items)
and cs_bill_customer_sk in (select c_customer_sk from best_ss_customer))
union all
(select ws_quantity*ws_list_price sales
from web_sales ,date_dim
where d_year = [YEAR]
and d_moy = [MONTH]
and ws_sold_date_sk = d_date_sk
and ws_item_sk in (select item_sk from frequent_ss_items)
and ws_bill_customer_sk in (select c_customer_sk from best_ss_customer))) y;
August 31, 2006
32nd International Conference on Very Large Data Bases
46
Iterative Query Part # 2
select c_last_name,c_first_name,sales
from ((select c_last_name,c_first_name,sum(cs_quantity*cs_list_price) sales
from catalog_sales ,customer ,date_dim
where d_year = [YEAR]
and d_moy = [MONTH]
and cs_sold_date_sk = d_date_sk
and cs_item_sk in (select item_sk from frequent_ss_items)
and cs_bill_customer_sk in (select c_customer_sk from best_ss_customer)
and cs_bill_customer_sk = c_customer_sk
group by c_last_name,c_first_name)
union all
(select c_last_name,c_first_name,sum(ws_quantity*ws_list_price) sales
from web_sales, customer ,date_dim
where d_year = [YEAR]
and d_moy = [MONTH]
and ws_sold_date_sk = d_date_sk
and ws_item_sk in (select item_sk from frequent_ss_items)
and ws_bill_customer_sk in (select c_customer_sk from best_ss_customer)
and ws_bill_customer_sk = c_customer_sk
group by c_last_name,c_first_name)) y;
August 31, 2006
32nd International Conference on Very Large Data Bases
47
Iterative Query Part # 3
select c_last_name,c_first_name,sales
from ((select c_last_name,c_first_name,sum(cs_quantity*cs_list_price) sales
from catalog_sales ,customer ,date_dim
where d_year = [YEAR]
and d_moy = [MONTH]
and cs_sold_date_sk = d_date_sk
and cs_item_sk in (select item_sk from frequent_ss_items)
and cs_bill_customer_sk in (select c_customer_sk from best_ss_customer)
and cs_bill_customer_sk = c_customer_sk
and cs_bill_customer_sk = cs_ship_customer_sk
group by c_last_name,c_first_name)
union all
(select c_last_name,c_first_name,sum(ws_quantity*ws_list_price) sales
from web_sales ,customer ,date_dim
where d_year = [YEAR]
and d_moy = [MONTH]
and ws_sold_date_sk = d_date_sk
and ws_item_sk in (select item_sk from frequent_ss_items)
and ws_bill_customer_sk in (select c_customer_sk from best_ss_customer)
and ws_bill_customer_sk = c_customer_sk
and ws_bill_customer_sk = ws_ship_customer_sk
group by c_last_name,c_first_name)) y;
August 31, 2006
32nd International Conference on Very Large Data Bases
48
Query Model Advantages
•
SQL99 + OLAP extensions
•
Query templates allow for the generation of
thousands of different queries
•
Combining of different query classes
−
−
−
−
•
August 31, 2006
Ad-hoc
Reporting
Iterating
Data mining
Star schema and “traditional” query execution
32nd International Conference on Very Large Data Bases
49
Execution Rules
August 31, 2006
32nd International Conference on Very Large Data Bases
50
Benchmark Execution
System
Setup
Database
Setup
Database
Load
Query
Run #1
Un-timed
Data
Maintenance
Query
Run #2
Timed
•Setup
of:
••Creation
Runs of
n streams
•Load
raw
of: dataconcurrently,
running
•each
Load
intoOperating
fact
tables
•Servers/
System
•System
tables
•Creation
of
auxiliary
data
•
99
queries
• Delete
fact
tablesRAID
•Storagefrom
Arrays
including
structures
•Table
spaces
different,
substitutions
•Repeat
•• Maintain
of random
Query
slowlyRun
changing
#1
•Networks
• dimensions
simulates
n concurrent users
•File Groups
•Statistics
gathering
•Database
Software
•Flat•Log
filesfiles
(optional)
August 31, 2006
32nd International Conference on Very Large Data Bases
51
Benchmark Execution
Database
Setup
Database
Load
Query
Run #1
Un-timed
Data
Maintenance
Timed
Scale
Factor
Number
Streams
1
n.a
100
3
300
5
1000
7
3000
9
Q47
7
99
Stream 2: Q1,Q55,Q4,Q1430000
,Q9, … ,13Q12,Q3
Q369
…
Stream 1: Q3,Q21,Q11,Q3,Q
Q47,Q99
8, … , 11
10000
100000
15
Stream n: Q94,Q3,Q1,Q8,Q84, … , Q34,Q23
August 31, 2006
Query
Run #2
32nd International Conference on Very Large Data Bases
…
System
Setup
Q7
52
Benchmark Execution
System
Setup
Database
Setup
Database
Load
Query
Run #1
Un-timed
Data
Maintenance
Query
Run #2
Timed
Flat File
read
Transformation
load
Data
Warehouse
Tables
DBMS
August 31, 2006
32nd International Conference on Very Large Data Bases
53
Benchmark Execution
System
Setup
Database
Setup
Un-timed
Database
Load
Query
Run #1
Data
Maintenance
Timed
Q7
Stream 2: Q1,Q55,Q4,Q14,Q9, … , Q12,Q3
Q69
…
…
Stream 1: Q3,Q21,Q11,Q3,Q8, … , Q47,Q99
Stream n: Q94,Q3,Q1,Q8,Q84, … , Q34,Q23
August 31, 2006
Query
Run #2
32nd International Conference on Very Large Data Bases
Q7
54
Database Load
•
Simulates data warehouse reload
•
Measures the system’s ability to:
− Load data
− Create auxiliary data structures
− Gather statistics
•
Is part of metric because
− Data warehouses get recreated
− It prevents using unrealistic auxiliary data
structures
August 31, 2006
32nd International Conference on Very Large Data Bases
55
Execution Query Run #1
•
Simulates execution of queries by multiple
concurrent users
•
Measures the system’s ability to:
− Process concurrent query executions in the least
amount of time
− Allocate resources efficiently among multiple
concurrent users
•
Minimum number of streams required
•
Maximum number of streams not limited
August 31, 2006
32nd International Conference on Very Large Data Bases
56
Data Maintenance
•
Simulates incremental raw data feeds from
an OLTP system
•
Costs auxiliary data structures
•
Amount of data loaded is linear to the
number of streams  guarantees
significance of DM
•
Approach is database centric  no ETL
tools
August 31, 2006
32nd International Conference on Very Large Data Bases
57
Execution Query Run #2
•
Rerun of Query Run #1
•
Measures the system’s ability to repeat the
results of Query Run #1 after Data
Maintenance
August 31, 2006
32nd International Conference on Very Large Data Bases
58
Metric
August 31, 2006
32nd International Conference on Very Large Data Bases
59
Primary Metrics
•
Three primary metrics
−
−
−
•
Queries per hour
Price per Query
System Availability
Queries per Hour
S *198 * 3600 * SF
QphDS=
S * 0.01* TLOAD  TQ1  TDM  TQ 2
−
−
−
−
−
August 31, 2006
S: Number of query streams
SF: Scale Factor
TQ1 and TQ2: elapsed times to complete query run #1 and #2
TDM is the elapsed time to complete the data maintenance
TLOAD is the total elapsed time to complete the database load
32nd International Conference on Very Large Data Bases
60
Metric Explanation
S *198 * 3600 * SF
QphDS=
S * 0.01* TLOAD  TQ1  TDM  TQ 2
•
Numerator:
− S*198  normalizes result to queries
− 3600  normalizes result to hours
− SF  normalizes result to scale factor
•
Denominator
− 0.01  costs load with 1%
− S  do avoid diminishing the costing of load
August 31, 2006
32nd International Conference on Very Large Data Bases
61
Current Status of TPC-DS Spec
August 31, 2006
32nd International Conference on Very Large Data Bases
62
Status TPC-DS
new idea
Requirements
feasibility
forms
Subcommittee
debate
defines
Draft Spec
feasibility
implements
Prototypes
evaluation
publishes
Benchmark Spec
refinement
Vendors Publish
Benchmark Results
Benchmark becomes obsolete and results in new requirements
August 31, 2006
32nd International Conference on Very Large Data Bases
63
More Information
•
Specification:
http://www.tpc.org/tpcds/default.asp
•
Benchmark tools:
− Dbgen
− Qgen
− Query templates
•
August 31, 2006
will be available on website soon
32nd International Conference on Very Large Data Bases
64
Q&A
August 31, 2006
32nd International Conference on Very Large Data Bases
65
Download