Variability in Architectural Simulations of Multi- threaded Workloads Jason Bosko

advertisement
Variability in Architectural
Simulations of Multithreaded Workloads
Jason Bosko
February 27, 2008
Based on “Variability in Architectural Simulations of Multi-threaded Workloads” by
Alaa R. Alamelden and David A. Wood, appearing in the 9th Annual
International
Symposium on High-Performance Computer Architecture (HPCA-9)
Outline
What is variability?
 How does variability affect results?
 Examples of variability
 How can we account for variability?

What is variability?
Variability is “the differences between
performance estimates obtained from
multiple runs of the same workload”
 Variability is bad because it can lead to
incorrect conclusions about performance
 What are the two types of variability?

Time vs. Space Variability
Time variability occurs when different
phases of the same program result in
different performance
 Space variability occurs when the same
program starting from the same initial
conditions can be run multiple times and
exhibit different performance for each run

How can time variability occur?
Workloads have different execution
phases
 Even in the phases are “identical”, the
state of the caches will be different,
threads may be scheduled differently, etc.

How can space variability occur?
Small-scale variations due to interrupt
timing, bus contention, memory access
delays can have a domino effect on
performance
 Although these variations can be small,
they can cause threads to be scheduled
differently, locks to be acquired in different
orders, transactions to fail to complete,
etc.

Does it really matter for real systems?
Time variability!
Space variability!
Which variability matters most?

Time variability
 Effects

the “true” performance of a workload
Space variability
 Effects
the relative performance of multiple
runs of a workload
Simulations are usually used to compare a new idea to
some base idea – therefore it is the relative performance
of multiple runs that matters most!
Slower memory is better???
Based on these points, you could conclude that a
longer DRAM latency improves performance!
Most of the focus
is on space
variability –
slightly changing
the DRAM latency
has a curious
effect on overall
performance!
How do we model variability?

Use the Simics simulator
 Good:
Has checkpointing, can be extended
 Bad: It is deterministic – space variability will
not occur!

Introduce small, random perturbations in
L2-cache miss latencies
Wrong Conclusion Ratio (WCR)
The percentage of comparison experiment
pairs that reach an incorrect conclusion
 The WCR can be used to estimate the
probability that a wrong conclusion would
be made about an experiment if variability
is ignored

Proof of variability: cache design
Overlapping error bars = significant
chance of making a wrong conclusion!
The associativity of the
L2 cache was modified,
and each experiment
was run 20 times from
the same checkpoint.
Variability across benchmarks


Coefficient of variation = 100 times the ratio of the standard
deviation to the mean – this is used to estimate the magnitude of
space variability
Range of variability = the difference between the maximum and
minimum runtimes, as a percentage of the mean
Time variability, revisited

The effect of time variability on simulation results is still
important
Both time and space variability
Time variability only
What can we do about it?

Intuition?
 Allow each run to execute
 Perform multiple runs and

more transactions/cycles
use the average
But this is time consuming!
 Consider
that modeling a 16-processor system with
OOO processors has a 24,000x slowdown (according
to Alameldeen and Wood)
How can we find the right balance
between being confident about a
conclusion and running simulations
for an eternity?
Confidence Intervals

A confidence interval is based on a desired
confidence probability and the number of runs
used – a tighter confidence interval is obtained
by lowering the desired confidence probability or
increasing the number of runs
If you use a confidence probability of p,
then if the confidence intervals do not
overlap, then there is, at most, a 1-p chance
of reaching a wrong conclusion
Confidence intervals place a very
conservative bound on the probability of
making a wrong conclusion
Determining sample size
t is calculated from the desired confidence
probability, and r is the minimum relative
error required
 S/Y is the ratio of the true population
standard deviation and mean, which is
unknown and must be estimated based on
prior knowledge (any problem with this?)

Hypothesis Testing



Consider testing the effect of ROB size (32 vs 64)
Actually test the hypothesis that there is no significant
difference between ROB sizes – we want to prove this wrong
The significance level (α) corresponds to our wrong
conclusion probability – we want to show that the significance
of our hypothesis is very low, meaning that the probability of
making a wrong conclusion is very low
y: sample mean runtimes
s: sample variances
The goal is to find a value of
n that reduces the area of
the rejection region to our
desired significance level!
What about time variability again?
Use “analysis of variance (ANOVA)” to
determine if variances in runs from
different starting points are statistically the
same
 Based on ANOVA results, it may be
sufficient to use all runs from a single
starting point, or it may be necessary to
choose multiple starting points in addition
to using multiple runs!

Thoughts
How much confidence is enough?
Theoretically we’d like to be 100%
confident, but that would take too much
time to simulate!
 As multi-threaded benchmarks expand,
will variability become too great and make
simulations too unreliable?

Download