The Active Streams Approach to adaptive distributed systems

advertisement
Lies, Damn Lies and Benchmarks
Are your
benchmark
tests
reliable?
6.1 Advanced Operating Systems
Typical Computer Systems Paper
Abstract: What this paper contains.
– Most readers will be reading just this.
Introduction: Present a problem.
– The universe cannot go on, if the problem
persists.
Related Work: Show the work of
competitors.
– They are stink.
Solution: Present the suggested solution.
– We are the best.
6.2 Advanced Operating Systems
Typical Paper (Cont.)
Technique: Go into details.
– Many drawings and figures.
Experiments: Prove our point, Evaluation
Methodology.
– Which benchmarks adhere to my assumptions?
Results: Show how the enhancement is great.
– The objective benchmarks agree that we are the
best.
Conclusions: Highlights of the paper.
– Some readers will be reading besides the abstract,
also this.
6.3 Advanced Operating Systems
SPEC
SPEC is Standard Performance Evaluation
Corporation.
– Legally, SPEC is a non-profit corporation registered in
California.
SPEC's mission: To establish, maintain, and
endorse a standardized set of relevant
benchmarks and metrics for performance
evaluation of modern computer systems.
"SPEC CPU2000 is the next-generation industrystandardized CPU-intensive benchmark suite."
– Composed of 12 integer (CINT2000) and 14 floatingpoint benchmarks (CFP2000).
6.4 Advanced Operating Systems
Some Conferences Statistics
# of papers
30
36
36
40
29
24
31
26
27
20
10
0
17
17
17
ISCA 2001
Micro 2001
HPCA 2002
23
27
ISCA 2002
Micro 2002
15
HPCA 2003
23
ISCA 2003
Number of papers published:
209
Papers that used a version of SPEC: 138 (66%)
Earliest conference deadline: December 2000
SPEC CPU2000 announced: December 1999
6.5 Advanced Operating Systems
Partial use of CINT2000
number of benchmarks used per paper
0
1-6
7-11
12
percents of papers
100%
80%
60%
40%
20%
0%
ISCA
2001
Micro
2001
HPCA
2002
ISCA
2002
Micro
2002
HPCA
2003
6.6 Advanced Operating Systems
ISCA
2003
Total
Why not using it all?
It seemed that many papers are not using all
benchmarks of the suite.
Selected excuses were:
–
–
–
–
–
“The chosen benchmarks stress the problem …”
“Several benchmarks couldn’t be simulated …”
“A subset of CINT2000 was chosen …”
“… select benchmarks from CPU2000 …”
“More benchmarks wouldn't fit into our displays …”
6.7 Advanced Operating Systems
Omission Explanation
– Are the claims in the
previous slide
persuasive?
80
number of papers
Roughly a third of the
papers (34/108) present
any reason at all.
Many reasons are not so
convincing.
Full Use
No Reason
Reason Given
70
60
50
40
30
20
10
0
6.8 Advanced Operating Systems
What has been omitted
Possible reasons
for the omissions:
percents of usage
100 90
80
70
60
50
– eon is written in
C++.
– gap calls ioctl
system call, which
is a device specific
call.
– crafty uses a 64-bit
word.
– perlbmk has
problems with 64bit processors
6.9 Advanced Operating Systems
40
30
20
10
0
gzip
vpr
parser
gcc
mcf
vortex
twolf
bzip2
perlbmk
crafty
gap
eon
CINT95
0
percents of papers
Still widespread even
though it retired on
June 2000.
Smaller suite (8 vs. 12).
Over 50% of full use,
but around for at least
3 years already.
Only 5 papers out of 36
explain the partial use.
1-6 (1-4)
7-11 (5-7)
12 (8)
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
CINT95
CINT2000
(1999-2000) (2001-2002)
6.10 Advanced Operating Systems
The using of
CINT has
been
increasing
over the years.
The
benchmarking
of new
systems is
done by old
tests
Full Use of CINT2000
Using of CINT2000
100%
80%
60%
40%
20%
0%
2001
2002
2003
Acutal
6.11 Advanced Operating Systems
2004
Projected
2005
Amdahl's Law
Fenhanced is the percents of the benchmarks that were
enhanced. Speedup is:
CPU Timeold
CPU Timeold
=
=
CPU Timenew
CPU Timeold (1 - Fenhanced) + CPU Timeold Fenhanced (1/ speedupenhanced)
1
(1 - Fenhanced ) +
Fenhanced
speedupenhanced
Example: if we have a way to improve just the gzip
benchmark by a factor of 10, what fraction of usage
must be gzip to achieve a 300% speedup?
3=
1
F
(1 - Fenhanced ) + enhanced
10
 Fenhanced=20/27=74%
6.12 Advanced Operating Systems
Breaking Amdahl's Law
"The performance improvement to be gained
from using some faster mode of execution is
limited by the fraction of the time the faster
mode be used."
Just the full suite can accurately gauge the
enhancement.
It is possible that other benchmarks :
– produce similar results.
– degrade performance.
– invariant to the enhancement. Even in this case
the published results are too high according to
Amdahl's Law.
6.13 Advanced Operating Systems
Tradeoffs
What about papers that offer performance
tradeoffs?
– Performance tradeoff are more than 40% of the
papers.
– An average paper contains just 8 tests out of the
12.
What do we assume about missing results?
I shouldn't have left
eon out
6.14 Advanced Operating Systems
Besides SPEC
Categories of benchmarks:
– Official benchmarks like SPEC; there are
also official benchmarks by non-vendor
source.
• They will not always concentrate on the points
important for your usage.
– Traces – real users whose activities are
logged and kept.
• An improved (or worsened) system may
change the users behavior.
6.15 Advanced Operating Systems
Besides SPEC (Cont.)
– Microbenchmarks – test just an isolated
component of a system.
• Using multiple microbenchmarks will not test
the interaction between the components.
– Ad-hoc benchmarks – run a bunch of
programs that seem interesting.
• If you suggest a way to compile Linux faster,
Linux compilation can be a good benchmark.
– Synthetic Benchmarks – write a program to
test yourself.
• You can stress your point.
6.16 Advanced Operating Systems
Whetstone Benchmark
Historically it is the first synthetic microbenchmark.
The original Whetstone benchmark was designed
in the 60's. First practical implementation on 1972.
– Was named after the small town of Whetstone, where it
was designed.
Designed to measure the execution speed of a
variety of FP instructions (+, *, sin, cos, atan, sqrt,
log, exp). Contains small loop of FP instructions.
The majority of its variables are global; hence will
not show up the RISC advantages, where large
number of registers enhance the local variables
handling.
6.17 Advanced Operating Systems
The Andrew benchmark
Andrew benchmark was suggested at 1988.
In the early 90's the Andrew benchmark was
one of the popular non-vendor benchmark for
file system efficiency.
The Andrew benchmark:
– Copies a directory hierarchy containing a source
code of a large program.
– "stat"s every file in the hierarchy.
– Reads any byte of every copied file.
– Compiles the code in the copied hierarchy.
Does this reflect the reality? Who does work
like this?
6.18 Advanced Operating Systems
Kernel Compilation
Maybe a "real" job can be more
representative?
Measure the compilation of the Linux
kernel.
The compilation reads large memory
areas only once. This reduces the
influence of the cache efficiency.
– The influence of the L2 cache will be
drastically reduced.
6.19 Advanced Operating Systems
Benchmarks' Contribution
On 1999 Mogul presented statistics which have
shown that while HW is usually measured by
SPEC; when it comes to the code of the
Operating System, no standard is popular.
Distributed Systems are commonly
benchmarked by NAS.
On 1993, Chen & Patterson wrote:
"Benchmarks do not help in understanding
system performance".
6.20 Advanced Operating Systems
Download