ZhangNagar

advertisement
Synthesizing Representative I/O
Workloads for TPC-H
J. Zhang*, A. Sivasubramaniam*,
H. Franke, N. Gautam*, Y. Zhang, S. Nagar
* Pennsylvania State University
IBM T.J. Watson
Rutgers University
Outline
• Motivation
• Related Work
• Methodology
– Arrival Time
– Access Pattern
– Request Sizes
• Accuracy of synthetic traces
• Concluding Remarks
Motivation
• I/O subsystems are critical for
commercial services and in production
environments.
• Real applications are essential for
system design and evaluation.
• TPC-H is a decision-support workload
for business enterprises.
Disadvantages of Traces
•
•
•
•
•
Not easily obtainable
Can be very large
Difficult to get statistical confidence
Very difficult to change workload behavior
Does not isolate the influence of one
parameter
• On the other hand, a deeper understanding of
the workload can:
• Help generate a synthetic workload
• Help in system design itself.
What do we need to
synthesize?
• Inter-arrival times (temporal
behavior) of disk block requests.
• Access pattern (spatial behavior) of
blocks being referenced
• Size (volume) of each I/O request.
Related work
• Scientific Application I/O behavior
– Time-series models for arrivals
– Sequentiality/Markov models for access
pattern
• Commercial/production workloads
– Self-similar arrival patterns
– Sequentiality in TPC-H/TPC-D
• No prior complete synthesis of all
three attributes for TPC-H
Our TPC-H Workload
• Trace Collection Platform
– IBM Netfinity 8-way SMP with 2.5GB
memory and 15 disks
– Linux 2.4.17
– DB2 UDB EE V7.2
• TPC-H Configuration
– Power Run of 22 queries
– Partitioning tables across the disks
– 30 GB dataset
Validation
Original I/O traces
CDF
Identify
characteristics
Generate
Disksim 2.0
synthetic traces
Response time
Metrics


RMS: root-mean-square error of differences
between two CDF curves
nRMS: RMS/m, m is average response time
for the original trace
Overall Methodology
• Arrival pattern characteristics
– Investigate correlations
• Time series
• Self-similar
• iid distributions
• Access pattern characteristics
– Sequentiality/pseudo sequentiality/randomness
– Size characteristics
• Investigating correlations between time,
space and volume to get final synthesis
Arrival pattern
• Statistical analysis
– Auto-correlation
function (ACF) plots
• Shows the correlation
between current
inter-arrival time and
one that is x-steps
away
– Correlations seem very weak (<0.15 for
12 queries, and <0.30 for the rest)
• Errors with Time series models
(AR/MA/ARIMA/ARFIMA) are high
• No suggestions for self-similar either
– Perhaps iid (independent and identically
distributed) is not a bad assumption.
• Fitting distributions
– Tried hyper-exponential/normal/pareto
– Used Maximum Likelihood Estimator
(normal/pareto) and Expectation
Maximization (hyper-exponential) to
estimate distribution parameters
– Use K-S test to measure goodness-of-fit
– Maximum distance between fitted
distribution and original CDF was ensured to
be less than 0.1
Comparing CDF of fitted
distribution and data
Access Pattern
(Location + Size)
Location
Location
Location
• Most studies use sequentiality to describe TPC-H
• However, this is not always the case.
Arrival Time
Arrival Time
Arrival Time
Cat1: Q10
Cat2: Q12,
Cat3: Q20
Q4, Q14
Q1,Q3,Q5,Q7,
Q9, Q17
Q8,Q15,Q18,
Q19,Q21
Category 1: Intermingling sequential streams
• Consider the following:
– Run: A strictly sequential set of I/O
requests
– Stream: A pseudo-sequential set of I/O
requests that could be interrupted by
another stream.
– i.e. a stream could have several runs
that are interrupted by runs of other
streams.
Run and Stream
An example run of 5 requests
1-4
5-8
9-10
11-14
15-18
A stream (pseudo-sequential) of 4 requests
1-4
7-8
9-12
11-14
An example trace:
Stream A
1-4
Stream B
100-104
Trace
1-4
7-8
105-108
100-104
9-12
11-14
109-112
7-8
9-12
105-108
109-112
11-14
Secondary Attributes
•
•
•
•
Run Length: # of requests in a run
Run Start location: start sector of run
Stream Length: # of requests in a stream
Inter-stream Jump Distance: spatial separation
between start of run and previous request
• Intra-stream Jump Distance: spatial separation
between successive requests within a stream
• Number of active streams (at any instant)
• Interference Distance: number of requests
between 2 successive requests in a stream
• Derive empirical distributions for these from the
trace
Location Synthesis - Q10
(Time and size from trace)
 LocIID: locations are
i.i.d.
 LocRUN: incorporate
run length
distribution and run
start location
distribution.
 LocSTREAM: combine
all stream and run
statistics.
Request Size
• Requests are one of
– 64, 128, 192, 256, 320, 384, 448, 512
blocks
• But attributes (location, size, time)
are not independent !!!
Correlations between size
and location
Size
64
128
192
256
320
384
448
512
All req.
.716 .009 .010 .009 .009 .011 .011 .225
Run start
.577 .012 .013 .012 .013 .015 .016 .342
Within run
.916 .004 .004 .004 .004 .005 .005 .057
Fraction of requests
Correlations between size and
time
100%
90%
Size frequency
80%
70%
60%
50%
40%
30%
512
20%
128-448
10%
64
0%
1
3
5
7
9 11 13 15 17 19 21 23 25 27 29
Inter-arrival time interval
Correlations between location
and time
Final Synthesis Methodology
(Category 1)
 Location: use LocSTREAM to generate start
locations. Two kinds of requests: a run start
request or a request within a run
 Time: use Pr(inter-arrival time | run start
requests) and Pr(inter-arrival time | within a
run requests) to generate times.
 Size:
1) For run start request, use Pr(size | inter-arrival times
of run start requests) to generate sizes.
2) For within a run requests, use Pr(size | within a run
requests) to generate sizes.
• Can be easily adapted for Category 2
(strictly sequential) and Category 3
(random) queries.
• Validation: Compare the response time
characteristics of synthesized and
real trace.
Validation of CDF of response times
(Category 1)
Validation of CDF of response times
(Category 2)
Validation of CDF of response times
(Category 3)
Storage Requirements
Q1
Q3
Q4
Q5
Q6
Q7
Q8
Q9
Q10
Storage
Fraction(x0.001)
3.46 3.64 2.76 3.43 3.46 3.47 3.66 .004 2.79
nRMS
0.10 0.09 0.20 0.07 0.01 0.04 0.05 0.15 0.16
Q12
Q14
Q15
Q17
Q18
Q19
Q20
Q21
Storage
Fraction(x0.001)
3.73
6.49
3.46
2.03
3.54
3.44
4.57
2.95
nRMS
0.06
0.19
0.01
0.05
0.06
0.03
0.10
0.07
Contributions
• A synthesis methodology to capture
– Inter-mingling streams of requests
– Exploiting correlations between request
attributes
• An application of this methodology to
TPC-H
• Along the way (for TPC-H),
– iid can capture arrival time
characteristics
– Strict sequentiality is not always the case
Backup slides
Validating arrival time synthesis
LocSTREAM
1. Use Pr(stream length) to generate stream
lengths.
2. Use Pr(run length | stream length) to
generate run lengths for each stream length.
3. Generate start location for each run:
a)
b)
Use Pr(inter-stream jump dist.) to generate the start
location of the first run in the stream.
Use Pr(intra-stream jump distance | this stream) to
generate other runs’ start location in this stream.
4. Use Pr(interference distance) to interleave
all streams.
Download