Variability in Architectural Simulations of Multithreaded Workloads Jason Bosko February 27, 2008 Based on “Variability in Architectural Simulations of Multi-threaded Workloads” by Alaa R. Alamelden and David A. Wood, appearing in the 9th Annual International Symposium on High-Performance Computer Architecture (HPCA-9) Outline What is variability? How does variability affect results? Examples of variability How can we account for variability? What is variability? Variability is “the differences between performance estimates obtained from multiple runs of the same workload” Variability is bad because it can lead to incorrect conclusions about performance What are the two types of variability? Time vs. Space Variability Time variability occurs when different phases of the same program result in different performance Space variability occurs when the same program starting from the same initial conditions can be run multiple times and exhibit different performance for each run How can time variability occur? Workloads have different execution phases Even in the phases are “identical”, the state of the caches will be different, threads may be scheduled differently, etc. How can space variability occur? Small-scale variations due to interrupt timing, bus contention, memory access delays can have a domino effect on performance Although these variations can be small, they can cause threads to be scheduled differently, locks to be acquired in different orders, transactions to fail to complete, etc. Does it really matter for real systems? Time variability! Space variability! Which variability matters most? Time variability Effects the “true” performance of a workload Space variability Effects the relative performance of multiple runs of a workload Simulations are usually used to compare a new idea to some base idea – therefore it is the relative performance of multiple runs that matters most! Slower memory is better??? Based on these points, you could conclude that a longer DRAM latency improves performance! Most of the focus is on space variability – slightly changing the DRAM latency has a curious effect on overall performance! How do we model variability? Use the Simics simulator Good: Has checkpointing, can be extended Bad: It is deterministic – space variability will not occur! Introduce small, random perturbations in L2-cache miss latencies Wrong Conclusion Ratio (WCR) The percentage of comparison experiment pairs that reach an incorrect conclusion The WCR can be used to estimate the probability that a wrong conclusion would be made about an experiment if variability is ignored Proof of variability: cache design Overlapping error bars = significant chance of making a wrong conclusion! The associativity of the L2 cache was modified, and each experiment was run 20 times from the same checkpoint. Variability across benchmarks Coefficient of variation = 100 times the ratio of the standard deviation to the mean – this is used to estimate the magnitude of space variability Range of variability = the difference between the maximum and minimum runtimes, as a percentage of the mean Time variability, revisited The effect of time variability on simulation results is still important Both time and space variability Time variability only What can we do about it? Intuition? Allow each run to execute Perform multiple runs and more transactions/cycles use the average But this is time consuming! Consider that modeling a 16-processor system with OOO processors has a 24,000x slowdown (according to Alameldeen and Wood) How can we find the right balance between being confident about a conclusion and running simulations for an eternity? Confidence Intervals A confidence interval is based on a desired confidence probability and the number of runs used – a tighter confidence interval is obtained by lowering the desired confidence probability or increasing the number of runs If you use a confidence probability of p, then if the confidence intervals do not overlap, then there is, at most, a 1-p chance of reaching a wrong conclusion Confidence intervals place a very conservative bound on the probability of making a wrong conclusion Determining sample size t is calculated from the desired confidence probability, and r is the minimum relative error required S/Y is the ratio of the true population standard deviation and mean, which is unknown and must be estimated based on prior knowledge (any problem with this?) Hypothesis Testing Consider testing the effect of ROB size (32 vs 64) Actually test the hypothesis that there is no significant difference between ROB sizes – we want to prove this wrong The significance level (α) corresponds to our wrong conclusion probability – we want to show that the significance of our hypothesis is very low, meaning that the probability of making a wrong conclusion is very low y: sample mean runtimes s: sample variances The goal is to find a value of n that reduces the area of the rejection region to our desired significance level! What about time variability again? Use “analysis of variance (ANOVA)” to determine if variances in runs from different starting points are statistically the same Based on ANOVA results, it may be sufficient to use all runs from a single starting point, or it may be necessary to choose multiple starting points in addition to using multiple runs! Thoughts How much confidence is enough? Theoretically we’d like to be 100% confident, but that would take too much time to simulate! As multi-threaded benchmarks expand, will variability become too great and make simulations too unreliable?