Can you trust your experimental results?

advertisement
Rigorous Benchmarking in
Reasonable Time
Tomas Kalibera, Richard Jones
University of Kent
ISMM, Seattle, June 2013
1
What do we want to establish?
 By comparing an old and a new system rigorously, find
 If there is a performance change?
 How large is the change?
new execution time
old execution time
 What variation we expect?
 How confident are we of the result?
 How many experiments must we carry out?
ISMM, Seattle, June 2013
2
Uncertainty
 Computer systems are complex.
 Many factors influence performance:
 Some known.
 Some out of experimenter’s control.
 Some non-deterministic.
 Execution times vary.
 We need to design experiments and summarise results in
a repeatable and reproducible fashion.
ISMM, Seattle, June 2013
3
Uncertainty should be reported!
140
120
100
Papers (122)
80
Execution time (67)
60
Execution time ratio (59)
40
Ignored uncertainty (47)
20
0
Papers published in 2011
ISMM, Seattle, June 2013
4
Uncertainty should be reported!
140
120
100
80
60
40
ignored
Execution time (67)
uncertainty
Execution time ratio (59)
Papers (122)
Ignored uncertainty (47)
20
0
Papers published in 2011
ISMM, Seattle, June 2013
5
How were the experiments
performed?
 Not always obvious if experiments were repeated.
 Very few report that experiments repeat at more than
one level, e.g.
 Repeat executions (e.g. invocations of a JVM).
 Repeat measurements (e.g. iterations of an application).
 Number of repetitions: arbitrary or heuristic-based?
ISMM, Seattle, June 2013
6
One benchmark…
 Good experimental methods take time
ISMM, Seattle, June 2013
7
A suite…
 Good experimental methods take time
ISMM, Seattle, June 2013
8
Add invocations…
 Good experimental methods take time
ISMM, Seattle, June 2013
9
and iterations…
 Good experimental methods take time
ISMM, Seattle, June 2013
10
…and heap sizes
 Good experimental methods take time
ISMM, Seattle, June 2013
11
A lost cause?
 Is statistically
rigorous
experimental
methodology
simply
infeasible?
ISMM, Seattle, June 2013
12
NO!
 With some initial one-off investment,
 We can cater for variation
 Without excessive repetition (in most cases).
 Our contributions:
 A sound experimental methodology that makes best use
of experiment time.
 How to establish how much repetition is needed.
 How to estimate error bounds .
ISMM, Seattle, June 2013
13
The Challenge of Reasonable
Repetition
 Variation at several stages of a benchmark experiment
— iteration, execution, compilation…
 Controlled variables platform, heap size or compiler options.
 Random variables statistical properties.
 Uncontrolled variables try to convert these to controlled or
randomised (e.g. by randomising link order).
 The challenge:
 How to design efficient experiments given the random
variables present, and
 Summarise the results, with a confidence interval.
ISMM, Seattle, June 2013
14
Our running example
 An experiment with 3 “levels” (though our technique is
general):
1. Repeat compilation to create a binary
— e.g. if code performance depends on layout.
2. Repeat executions of the same binary.
3. Repeat iterations of a benchmark.
ISMM, Seattle, June 2013
15
Independent state
 Researchers are typically interested in steady state
performance.
 Initialised state: no significant initialisation overhead.
 Independent state: iteration times are (statistically)
independent and identically distributed (IID).
 Don’t repeat measurements before independence.
If measurements are not IID, the variance and
confidence interval estimates will be biased.
ISMM, Seattle, June 2013
17
Independent state
 Does a benchmark reach an independent state?
After how many iterations?
 DaCapo/OpenJDK 7: ‘large’ and ‘small’ sizes
3 executions, 300 iterations/execution.
 Inspect run-sequence, lag and auto-correlation plots for
patterns indicating dependence.
3.0
2.5
●
●●
●●
●
●
●●
●
●
● ●
● ●● ●
●
● ●
1.0
1.5
● ●
●
●
●
●
●
●
●
●
●●●
●
● ●
●
●
●●
●
● ●●● ●●● ●
●● ●●
●● ●
● ●●●●●● ●●● ●●●● ●
●
●
●● ●●●●● ● ●
●
●●
● ●●●●●● ●●
●● ●●
● ● ●●
2.0
2.5
3.0
●
●
●
●
●
●
●
●
●
●
●
●
● ● ● ●● ●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●
●
●
● ●● ●
●● ●
●
●
●
1.5
●
●
●
●
●
●●
●
●
●
●
●
● ● ●
●●
●
●
●
2.0
●
2.0
1.5
●
●
●
●
●
● ●
●
●●●●
●
●●● ●
●● ●
● ●●●●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●●●
●●●
●
●
●
●● ● ●
●●●●
●●●
●●●●●●
●●●●
●●●●
●●●
●● ●
●●
● ●●
●
● ● ●●●
●
●
●
● ●
●
●●
●
●
●
●
●
●●
●
●
●
●
● ●●
●
●● ●
●●
●
●
●
●
● ●●
●
●● ●
●● ●
● ●
●
Time [s]
2.5
● ●●
●● ●
●
●●
●
●●
●
●
●
● ●
●
●
1.0
Time [s]
●
●
1.0
3.0
●
●
●
●●
●● ●
● ●
●●●
● ●●
●●
● ●● ●
●
● ●●●●
● ●
●●
●●
● ●● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
● ●
●
●●●●
●
●● ●● ●
●
●
●
●
●
●
● ●
●
● ● ●
●
●
●
● ●●
● ●●
●
● ● ●
●
●
●
●
●●
●
●
●
●
●
● ●
●●
●● ●
●
● ●
●● ●●● ●● ●
●
●●●
● ●
●●
●●
●
●
●●● ● ●●
●● ● ●● ●●●
●●
● ●
●●
● ●
●●
●● ●
●
●
●
●● ●
●
●
●
1.0
LAG 1 of Time [s]
●
●
●
1.5
●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●●
●
●
●
●● ● ●
2.0
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
2.5
● ●●
●
● ●
●●
●●
●
3.0
LAG 1 of Time [s]
ISMM, Seattle, June 2013
18
Independent state
 Does a benchmark reach an independent state?
After how many iterations?
 DaCapo/OpenJDK 7: ‘large’ and ‘small’ sizes
RECOMMENDATION:
Use this manual
3 executions,
300 iterations/execution.
procedure just once to find how many
 Inspect run-sequence, lag and auto-correlation plots for
iterations each benchmark, VM and
patterns indicating dependence.
platform combination requires to reach
an independent state.
3.0
2.5
●
●●
●●
●
●
●●
●
●
● ●
● ●● ●
●
● ●
1.0
1.5
● ●
●
●
●
●
●
●
●
●
●●●
●
● ●
●
●
●●
●
● ●●● ●●● ●
●● ●●
●● ●
● ●●●●●● ●●● ●●●● ●
●
●
●● ●●●●● ● ●
●
●●
● ●●●●●● ●●
●● ●●
● ● ●●
2.0
2.5
3.0
●
●
●
●
●
●
●
●
●
●
●
●
● ● ● ●● ●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●
●
●
● ●● ●
●● ●
●
●
●
1.5
●
●
●
●
●
●●
●
●
●
●
●
● ● ●
●●
●
●
●
2.0
●
2.0
1.5
●
●
●
●
●
● ●
●
●●●●
●
●●● ●
●● ●
● ●●●●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●●●
●●●
●
●
●
●● ● ●
●●●●
●●●
●●●●●●
●●●●
●●●●
●●●
●● ●
●●
● ●●
●
● ● ●●●
●
●
●
● ●
●
●●
●
●
●
●
●
●●
●
●
●
●
● ●●
●
●● ●
●●
●
●
●
●
● ●●
●
●● ●
●● ●
● ●
●
Time [s]
2.5
● ●●
●● ●
●
●●
●
●●
●
●
●
● ●
●
●
1.0
Time [s]
●
●
1.0
3.0
●
●
●
●●
●● ●
● ●
●●●
● ●●
●●
● ●● ●
●
● ●●●●
● ●
●●
●●
● ●● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
● ●
●
●●●●
●
●● ●● ●
●
●
●
●
●
●
● ●
●
● ● ●
●
●
●
● ●●
● ●●
●
● ● ●
●
●
●
●
●●
●
●
●
●
●
● ●
●●
●● ●
●
● ●
●● ●●● ●● ●
●
●●●
● ●
●●
●●
●
●
●●● ● ●●
●● ● ●● ●●●
●●
● ●
●●
● ●
●●
●● ●
●
●
●
●● ●
●
●
●
1.0
LAG 1 of Time [s]
●
●
●
1.5
●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●●
●
●
●
●● ● ●
2.0
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
2.5
● ●●
●
● ●
●●
●●
●
3.0
LAG 1 of Time [s]
ISMM, Seattle, June 2013
19
Reached independent state?
DaCapo ‘small’
avrora9
h29
pmd6
pmd9
bloat6
hsqldb6
sunflow9
eclipse6
fop6
chart6
eclipse9
fop9
jython6
luindex6
jython9
luindex9
lusearch9
tomcat9
tradebeans9
xalan6
tradesoap9
xalan9
Intel Xeon: 2 processors x 4 cores x 2-way HT
ISMM, Seattle, June 2013
20
Reached independent state?
DaCapo ‘small’
avrora9
h29
pmd6
pmd9
bloat6
hsqldb6
sunflow9
eclipse6
fop6
chart6
eclipse9
fop9
jython6
luindex6
jython9
luindex9
lusearch9
tomcat9
tradebeans9
xalan6
tradesoap9
xalan9
AMD Opteron: 4 processors x 16 cores
ISMM, Seattle, June 2013
21
Reached independent state?
DaCapo ‘large’
avrora9
h29
pmd6
pmd9
bloat6
hsqldb6
sunflow9
eclipse6
fop6
chart6
eclipse9
fop9
jython6
luindex6
jython9
luindex9
lusearch9
tomcat9
tradebeans9
xalan6
tradesoap9
xalan9
Intel Xeon: 2 processors x 4 cores x 2-way NT
ISMM, Seattle, June 2013
22
Reached independent state?
DaCapo ‘large’
avrora9
h29
pmd6
pmd9
bloat6
hsqldb6
sunflow9
eclipse6
fop6
chart6
eclipse9
fop9
jython6
luindex6
jython9
luindex9
lusearch9
tomcat9
tradebeans9
xalan6
tradesoap9
xalan9
AMD Opteron: 4 processors x 16 cores
ISMM, Seattle, June 2013
23
Reached independent state?
DaCapo ‘small’
avrora9
h29
pmd6
pmd9
bloat6
hsqldb6
sunflow9
eclipse6
fop6
chart6
eclipse9
fop9
jython6
luindex6
jython9
luindex9
lusearch9
tomcat9
tradebeans9
xalan6
tradesoap9
xalan9
AMD Opteron: 4 processors x 16 cores
ISMM, Seattle, June 2013
24
Some benchmarks don’t reach
independent state
 Many benchmarks do not reach an independent state in
a reasonable time.
 Most have strong auto-dependencies.
 Gradual drift in times and trends (increases and decreases);
abrupt state changes; systematic transitions.
 Choice of iteration significantly influences a result.
 Problematic for online algorithms which distinguish small
differences although the noise is many times larger.
 Fortunately, trends tend to be consistent across runs.
ISMM, Seattle, June 2013
25
Some benchmarks don’t reach
independent state
 Many benchmarks do not reach an independent state in
a reasonable time.
 Most have strong auto-dependencies.
 Gradual drift in times and trends (increases and decreases);
abrupt
state changes; systematic
transitions.
RECOMMENDATION:
If a
benchmark
does not reach an independent state
in a reasonable time,
Problematic for online algorithms which distinguish small
take the
samethe
iteration
fromtimes
each
run.
differences
although
noise is many
larger.
 Choice of iteration significantly influences a result.

 Fortunately, trends tend to be consistent across runs.
ISMM, Seattle, June 2013
26
Heuristics don’t do well
Initialised
Independent
Harness
Georges
bloat
2
4
8
∞
chart
3
4
1
eclipse
5
7
7
4
fop
10
180
7
8
hsqldb
6
6
8
15
jython
3
5
2
luindex
13
4
8
lusearch
10
7
8
pmd
7
4
1
xalan
6
15
139
85
13
ISMM, Seattle, June 2013
27
Heuristics don’t do well
Initialised
Independent
Harness
Georges
bloat
2
4
8
∞
chart
3
4
1
eclipse
5
7
7
4
fop
10
180
7
8
hsqldb
6
6
8
15
jython
3
5
2
luindex
13
4
8
lusearch
85
7
8
pmd
10 Wastes
time!
7
4
1
xalan
6
13
15
139
ISMM, Seattle, June 2013
28
Heuristics don’t do well
Initialised
Independent
Harness
Georges
bloat
2
4
8
∞
chart
3
4
1
eclipse
5
7
7
4
fop
10
180
7
8
hsqldb
6
6
8
15
jython
3
5
2
luindex
13
4
8
lusearch
10
7
8
pmd
7
4
1
xalan
6
15
139
85
13
Unusable!
ISMM, Seattle, June 2013
29
Heuristics don’t do well
Initialised
Independent
Harness
Georges
bloat
2
4
8
∞
chart
3
4
1
eclipse
5
7
7
4
fop
10
180
7
8
hsqldb
6
6
jython
3
luindex
13
lusearch
10
pmd
7
xalan
6
Initialised in8
reasonable5 time
85
13
ISMM, Seattle, June 2013
15
2
4
8
7
8
4
1
15
139
30
What to repeat?
 Run a benchmark to independence and then repeat a
number of iterations, collecting each result? or
 Repeatedly, run a benchmark until it is initialised and then
collect a single result?
 The first method saves experimentation time if
 variation between iterations > variation between executions,
 initialisation warmup + VM initialisation is large, and
 independence warmup is small.
Variation
%
bloat6
eclipse9
lusearch9
xalan6
xalan9
Iteration
14.1
0.8
3.3
7.0
3.5
Execution
3.7
0.4
30.3
9.1
1.0
AMD Opteron: 4 processors x 16 cores
ISMM, Seattle, June 2013
31
What to repeat?
 Run a benchmark to independence and then repeat a
number of iterations, collecting each result? or
 Repeatedly, run a benchmark until it is initialised and then
collect a single result?
 The first method saves experimentation time if
 variation between iterations > variation between executions,
 initialisation warmup + VM initialisation is large, and
 independence warmup is small.
Variation
%
bloat6
eclipse9
lusearch9
xalan6
xalan9
Iteration
14.1
0.8
3.3
7.0
3.5
Execution
3.7
0.4
30.3
9.1
1.0
AMD Opteron: 4 processors x 16 cores
ISMM, Seattle, June 2013
32
A clear but rigorous account
 Goal: We want to quantify a performance optimisation in
the form of an effect size confidence interval, e.g.
“we are 95% confident that system A is faster than
system B by 5.5% ± 2.5%”.
 We need to repeat executions and take multiple
measurements from each.
 For a given experimental budget, we want to obtain the
tightest possible confidence interval.
 Adding repetition at the highest level always increases
precision.
 but it is often cheaper to add repetitions at lower levels.
ISMM, Seattle, June 2013
33
Multi-level repetition
 How many repetitions to do at which levels?
1. Run an initial, dimensioning experiment
 Gather the cost of a repetition at each level.
 Iteration — time to complete an iteration.
 Execution — more expensive, need to get to an
independent state.
 Calculate optimal repetition counts for the real experiment.
2. Run the real experiment.
 Use the optimal repetition counts from the initial experiment.
 Calculate the effect size confidence interval.
ISMM, Seattle, June 2013
34
Initial Experiment
Initial experiment
 Choose arbitrary repetition counts r1,…,rn
 20 may be enough, 30 if possible, 10 if you must (e.g. if there
are many levels)
 Then, measure the cost of each level, e.g.
 c1 time to get an iteration (iteration duration).
 c2 time to get an execution (time to independent state).
 c3 time to get a binary (build time) .
 Also take the measurement times Yjn...j1
 Y2,1,3 = time of the 3rd non-warmup iteration from the
1st execution of the 2nd binary.
ISMM, Seattle, June 2013
35
Initial Experiment
Variance estimators
 First calculate n biased estimators S12,…,Sn2
 Then the unbiased estimators Ti2 iteratively
ISMM, Seattle, June 2013
36
Initial Experiment
Variance estimators
 First calculate n biased estimators S12,…,Sn2
 Then the unbiased estimators Ti2 iteratively
ISMM, Seattle, June 2013
37
Real Experiment
Optimal repetition counts
 The optimal repetition counts to be used in the real
experiments are r1,…,rn-1
 We don’t calculate rn, the repetition count for the
highest level
 rn can always be increased for more precision.
 Calculate the variance estimators Sn2 for the real
experiment as before but using the optimal repetition
counts r1,…,rn-1 and the measurements from the real
experiment.
ISMM, Seattle, June 2013
38
Real Experiment
Confidence intervals
 Asymptotic confidence interval with confidence (1 − a)
where
is the (1-a/2)-quantile of the I-distribution
with n = rn-1 degrees of freedom.
 See the ISMM’13 paper for details of constructing confidence intervals
of execution time ratios.
 See our technical report for proofs and gory details.
ISMM, Seattle, June 2013
39
Confidence interval for
execution time ratios
 Confidence interval due to Fieller (1954).

and
are average execution times from the old and new
systems.
 Variance estimators Sn2 and S’n2 and half-widths h,h’ as before.
ISMM, Seattle, June 2013
40
In practise
 For each
benchmark/VM/platform…
 Conduct a dimensioning
experiment to establish the
optimal repetition counts
for each but the top level
of the real experiment.
 Redimension if only if the
benchmark/VM/platform
changes.
ISMM, Seattle, June 2013
41
DaCapo (revisited)
bloat6
lusearch9
xalan6
xalan9
c1(s)
35.5
1.7
10.8
6.7
c2(s)
110.0
12.3
3.4
30.2
10
1
2
15
Optimal (%)
14.0
3.4
7.2
3.5
Original (%)
14.1
3.3
7.0
3.5
r1
Halfintervals
AMD Opteron: 4 processors x 16 cores
 The confidence half-intervals using optimal repetition
counts correspond closely to those obtained by running
large numbers of executions (30) and iterations (40).
 But repetition counts are much lower.
 E.g. lusearch: r1=1 so time better spent repeating executions
ISMM, Seattle, June 2013
42
Conclusions
 Researchers should provide measures of variation when
reporting results.
 DaCapo and SPEC CPU benchmarks need very different
repetition counts on different platforms before they
reach an initialised or independent state.
 Iteration execution times are often strongly autodependent: for these, automatic detection of steady
state is not applicable. They can waste time or mislead.
 An one-off (per benchmark/VM/platform) dimensioning
experiment can provide the optimal counts for repetition
at each level of the real experiments.
ISMM, Seattle, June 2013
43
RECOMMENDATION: Benchmark
developers should include our
dimensioning methodology as a
one-off, per-system configuration
requirement.
ISMM, Seattle, June 2013
44
ISMM, Seattle, June 2013
45
Code layout experiments
ISMM, Seattle, June 2013
46
What’s of interest?
 Mean execution times
 Minimum threshold for ratio of execution times
 Only interested in ‘significant’ performance changes
 Improvements in systems research are often small, e.g. 10%.
 Many factors influence performance
 E.g. memory placement, randomised compilation algorithms,
JIT compiler, symbol names…
 [Mytkowicz et al., ASPLOS 2009; Gu et al, Component and
middleware performance workshop 2004]
 Randomisation to avoid measurement bias
 E.g. Stabiliser tool [Curtsinger & Berger, UMass TR, 2012]
ISMM, Seattle, June 2013
47
Current best practice
 Based on 2-level hierarchical experiments
 Repeat measurements until standard deviation of last few
measurements is small enough.
 Quantify changes using a visual or statistical significance
test
 [Georges et al, OOPSLA 2007; PhD 2008]
 Problems
 Two levels are not always appropriate
 Null hypothesis significance tests are deprecated in other
sciences
 Visual tests are overly conservative
ISMM, Seattle, June 2013
48
Null hypothesis significance tests
 Null hypothesis: “the 2 systems have the same
performance”
 Tests if the null hypothesis can be rejected: “it is unlikely
that the systems have the same performance”
 Student’s t-test
 Visual test
ISMM, Seattle, June 2013
49
Visual test
 Construct confidence intervals
 Do they overlap?
 If not, it is unlikely that the
systems have the same
performance
 [If only slight overlap — centre
not covered by other CI — fall
back to statistical test]
ISMM, Seattle, June 2013
50
What’s wrong with this?
1. It does not tell us what we want to know
 Only if there is a performance change
 We could also report the ratio of sample means
 But we still don’t know how much of this change is due to
uncertainty
2.
The decision is affected by sample size
 The larger the sample, the more unlikely even a small and
meaningless change becomes
 Its limitations have been known for 70 years
 Deprecated in many fields: statistics, psychology, medicine,
biology, chemistry, sociology, education, ecology…
ISMM, Seattle, June 2013
51
What’s wrong with this (cont.)?
3. Both tests use parametric methods that violate their
assumptions
 Performance measurements are not usually normally
distributed
 Multi-modal, long tails to the right
 Good practice to check if data is close to normal
 Robust methods are used in some fields
 Should at least make assumptions clear
 That using Student’s t-test is OK…
 …Often it is OK
ISMM, Seattle, June 2013
52
Two methods
 Statistical model of random effects in n-way classification
 Use this model to construct effect size confidence interval
for the ratio of the means of execution time.
1. A parametric method based on asymptotic normality
2. A non-parametric method based on statistical
simulation (‘bootstrap’)
ISMM, Seattle, June 2013
53
Quantifying the performance (1)
 Parametric method
 Use the same number of repetitions for the old (OY) and
new (NY) system.
 Report (1-a) confidence interval (e.g. a=0.05 for 95% CI)
 ta/2,n denotes the a/2-quantile of the t-distribution with n = nn+1 - 1
degrees of freedom
ISMM, Seattle, June 2013
54
Quantifying performance (2)
 Bootstrap method
1. Perform many simulations (1000 or more if there is time)
 Use real data within each simulated step
2. Randomly choose the values to use at each level
 Replacement at all levels seems safe
3. Calculate many sample means from these
 Asymptotically normal due to the Central Limit Theorem
 Form a (1-a) CI by using the a/2 and 1-a/2 sample
quantiles
 E.g. order the values and use the 25th and 975th
ISMM, Seattle, June 2013
55
Parametric vs. bootstrap
 Bootstrap is more robust than parametric method
 Uses fewer assumptions
 Does not depend on underlying distribution
 No need to check if data is reasonably close to normal
 Can be used with other metrics, e.g. medians
 Parametric method is more confident
 Narrower confidence intervals
 More likely to find a significant difference
ISMM, Seattle, June 2013
56
Download