Summarizing Performance is No Mean Feat No matter how much people want performance to be a single number, it is usually a distribution, not a mean alone. John R. Mashey For EE282, Stanford, Oct 12, 2004, Alpha Version NoMeanFeat – Copyright 2004, John Mashey 0 Speaker – John Mashey • Ancient UNIX software guy, Bell Labs 1973-1983, MTS…Manager – Programmer’s Workbench, shell programming, text processing, workload measurement/tuning in first UNIX computer center, UNIX per-process accounting UNIX+mainframe data mining apps, capacity planning/tuning • Convergent Technologies 1983-1984, MTS…Director Software – Compiler & OS tuning, uniprocessor/multiprocessor servers • MIPS Computer Systems 1985-1992, Mgr. OS…VP Systems Technology – {system coprocessor, TLB, interrupt-handling, etc}, {byte addressing(!), halfword instructions}, ISA evolution, SMP features, multi-page-sizes, 64-bit – MIPS Performance Brief editor; ex-long-time frequent poster on comp.arch – One of the SPEC founders, 1988; long-time Hot Chips committee • Silicon Graphics 1992-2000, Dir. Systems Tech … VP & Chief Scientist – Fingers in many areas, R10000 & later architecture, including performance counters & software, Origin3000/Altix ccNUMA architecture, performance issues in HPC, DBMS • Current: consult for VCs & high-tech co’s, sit on technical advisory boards; Trustee at Computer History Museum * Not a statistician, so statisticians in audience, please be nice. NoMeanFeat – Copyright 2004, John Mashey 1 Overview • Background – Who benchmarks and why – “Standard model” advice about use of means ((W)AM, (W)HM, WGM) » Good advice, but incomplete; contradictions remain; industry mismatch – “Alternate model” » Various means useful when applied appropriately » Requirements, assumptions, results differ • Review of basic statistics – Populations, samples; parameters versus statistics – Distributions, especially normal, inverse normal, lognormal • Alternate model – WCA, WAW (mean = WAM or WHM), SERPOP (mean = GM) • Sample analyses using WAW and/or SERPOP – SPEC CINT2000, CFP2000 – Livermore Fortran Kernels (LFK) – (Digital Review CPU2; not in this version) • Conclusion NoMeanFeat – Copyright 2004, John Mashey 2 Who benchmarks and why • Computer designers – – – – • Sell magazines Industry consortia – – • Workload understanding, capacity planning; evaluate potential purchases Computer magazines – • Understand where to focus efforts on improvements Owners/buyers of computers, sometimes in groups – • System for dedicated application; must understand application, especially real-time Example: embedded systems, System-on-Chip, Tensilca/ARC/etc Software engineers – • New H/W + S/W to attack new problem domain; little existing data Example: Cray-1 vector systems … but many such have failed New H/W + S/W to compete in wide existing market; can gather related data Examples: RISC systems of 1980s … but most new ISAs have failed New implementation of established ISA; much data on workloads and programs Examples: IBM S/360, Digital VAX [reputed to have 500+ benchmarks] Try to get meaningful benchmarks to avoid coercion/waste of silly ones SPEC, TPC, EEMBC, etc Researchers NoMeanFeat – Copyright 2004, John Mashey 3 Standard Model for Summarizing Benchmarks • Summarize times – Arithmetic Mean (AM) – Weighted AM (WAM) • Summarize rates or ratios – Harmonic Mean (HM) – Weighted HM (WHM) • Do not ever use for anything! – – – – Geometric Mean (GM) [1] Weighted GM (WGM) Do not predict workload run-time I.e., SPEC and LFK wrong • Do not use Means (But do) References [2, 3, 4, 5, 6] AM, HM, GM are Power Means M1, M-1, M0 or the Pythagorean Means … these are old! NoMeanFeat – Copyright 2004, John Mashey 4 Alternate Model of Summarizing Benchmarks • Summarize times • – Arithmetic Mean (AM) – Weighted AM (WAM) • Summarize rates or ratios – Harmonic Mean (HM) – Weighted HM (WHM) • Do not ever use for anything! – AM or HM rare assumptions • Do not use means (but do) Workload-dependent (WAW) – WHM, WAM same, if right – Performance : workload – Population, algebra, definite • – Geometric Mean (GM) – Weighted GM (WGM) – Do not predict workload run-time • Measure workload (WCA) • Workload-neutral (SERPOP) – – – – Geometric Mean (GM) Weighted GM (WGM) only to fix sample Performance : program, not workload Sample, statistical inference, probabilistic Really do not use means (alone) – WAW: really sum/n, not distribution – SERPOP: means + other metrics NoMeanFeat – Copyright 2004, John Mashey 5 Some Really Basic Statistics • • • • • • • Populations and samples General distribution descriptions Normal distribution – x Handling non-normal distributions Inverse normal – 1/x Lognormal – ln x What do the Means mean NoMeanFeat – Copyright 2004, John Mashey 6 Populations and Samples • Population: set of observations measured across members of group – Forms a distribution – Summarized by descriptive statistics, or better, parameters – Uncertainty: individual measurement errors • Sample: subset of population – Compute statistics – Know population distribution apriori, or check sample versus assumption – Extra uncertainty: small samples or selection bias Population, N Parameters Sample size, representativeness Sample, n Statistics NoMeanFeat – Copyright 2004, John Mashey 7 General Distribution Descriptions • • • • • Mean: measure of central tendency, 1st moment Variance: measure of dispersion, 2nd moment Standard deviation: measure of dispersion, same scale as Mean Coefficient of Variation: CoV, dimensionless measure of dispersion Excel functions at left, when exist (OpenOffice.org Calc is mostly the same) 1 AVERAGE : AM N 1 VARP : N 2 N x i 1 i N 2 x i i 1 SDEVP : 2 CoV / NoMeanFeat – Copyright 2004, John Mashey 8 Samples • • • Sample Mean: used to estimate population mean; AM OK for 3 cases below Sample Variance, standard deviation, CoV, note slight difference Skewness or skew : degree of asymmetry, 3rd moment, Excel: SKEW • – Zero for symmetric, negative for long left tail, positive for long right Kurtosis: concentration comparison with normal, 4th moment, Excel: KURT – Positive: more concentrated than normal, negative: heavier tails • NOTE: further discussion: assume all xi positive 1 n AVERAGE : AM x xi n i 1 n 1 VAR : s xi x (n 1) i 1 2 Normal Kurtosis negative Kurtosis positive Right-skewed 2 STDEV : s s 2 CoV s / x 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 NoMeanFeat – Copyright 2004, John Mashey 9 Normal (Gaussian) Distribution • • • • • • Arises from large number of additive effects Familiar, useful properties … but cannot automatically assume normal 68% mean -/+ s 95% mean -/+ 2s 97.7% mean -/+ 3s 3 of 1000 outside 3 s, i.e., rare! xs x 2s x 3s 0.0 0.5 1.0 1.5 x 2.0 xs x 2s x 3s 2.5 3.0 3.5 4.0 m ---68%-------------95%----------------------------97.7%-----------------NoMeanFeat – Copyright 2004, John Mashey 10 Normal z-score Transformation • • • Normals one shape, linear-scale symmetric, with percentages as given Can be converted to standard normal, mean=0, s =1 Excel: STANDARDIZE xi x zi s xs x 2s x 3s -4.0 -3.0 -2.0 -1.0 x 0.0 xs x 2s x 3s 1.0 2.0 3.0 4.0 m ---68%-------------95%----------------------------97.7%-----------------NoMeanFeat – Copyright 2004, John Mashey 11 Confidence Intervals • • • Confidence Intervals can be computed if population is normal Commonly described as chance that population mean within interval Alpha = significance level, such as .05 100(1 – Alpha) = confidence level, such 95% Small samples (less than 30) need to use Student’s t-distribution, like normal, but with wider tails. Excel: TINV Interval improves (gets smaller) with larger sample Conf. interval x TINV (0.05, n) s n NoMeanFeat – Copyright 2004, John Mashey 12 Is x (or x* ) Normal? Quick Tests • • Normal’s Skew and Kurtosis 0 and CoV < 0.3 If sample much different, population normal assumption needs checking – Population may be distinctly non-normal, as in heavily-skewed example – Population may include several distinct sub-populations [see LFK] – Sample may be small or biased – Sample may include unusual outliers AM especially sensitive to large outliers, right skew. Error? Odd, but legal? Illusionary, due to small sample? – As CoV rises above 0.3, normal increasingly predicts negative xi (Bad). NoMeanFeat – Copyright 2004, John Mashey 13 Is it Normal? Quick Tests • • Coefficient of Determination, CoD =1.0, others decrease Excel: INDEX(LINEST(sorted.zdata,,,TRUE), 3) HM=93, GM=100, AM=108; STDEV=45; SKEW=1.48; CONF [80,137] HM=100, GM=101, AM=103; STDEV=22; SKEW=.77; CONF [88,119] trimmed Normal probability plot, perfect normal = straight line r – 1 big, 1 medium outlier; CoD = 0.77 r 41 79 86 87 88 89 103 106 124 127 144 225 z( r) -1.49 -0.65 -0.50 -0.47 -0.44 -0.43 -0.11 -0.04 0.35 0.42 0.78 2.57 z-scores: r 3.0 2.0 1.0 0.0 -1.0 -2.0 -3.0 Trimmed – no outliers; CoD = 0.91 z( r) -2.88 -1.13 -0.81 -0.75 -0.70 -0.67 0.00 0.14 0.96 1.09 1.86 5.58 NoMeanFeat – Copyright 2004, John Mashey z-scores: r 3.0 2.0 1.0 0.0 -1.0 -2.0 -3.0 14 Sample distribution metrics, summary • Mean: measure of central tendency • s, CoV: dispersion, relative dispersion Skew, Kurtosis: more on shape • CoD: similarity to normal • Confidence limits: goodness of sample (Should be more error analysis; later) NoMeanFeat – Copyright 2004, John Mashey 15 Handling Non-Normal Distributions • Example distributions: Jain[4], DeCoursey[7], NIST/SEMATECH [8] – Different processes different distributions – Bernoulli, Beta, Binomial, Cauchy, Chi-Square, Double exponential, Erlang, Exponential, Extreme value, F, Gamma, Geometric, Lognormal, Inverse Normal, Negative Binomial, Normal, Pareto, Pascal, Poisson, Power lognormal, Student’s t, Tukey-Lambda, Uniform, Weibull. – • • http://www.itl.nist.gov/div898/handbook/eda/section3/eda366.htm Normal is so useful … but nothing guarantees it, so must check If isn’t normal, try to transform to one that could be – Xi* = f(xi) transform; use whatever works, f(x) = 1/x, f(x) = ln(x), etc – Compute mean, standard deviation, etc of Xi* Check normality! – Back-transform mean (and other metrics that can be) via f -1 NoMeanFeat – Copyright 2004, John Mashey 16 Inverse Normal – 1/x • If 1/x is normal, x has an inverse normal distribution – In computing, minimal direct use; typically used for converting rates times • • Transform: xi* = 1/xi. Mean: AVERAGE : AM x* • 1 n * x * xi n i 1 Compute higher moments. Back-transformed Mean is HM: HM 1 / AM x* HARMEAN : HM back - transform n n 1 i 1 i x direct NoMeanFeat – Copyright 2004, John Mashey 17 Linear and Logarithmic scales • On usual linear scale 1(a) Normal, symmetric 1(b) Normal, symmetric, wider 1(c) Lognormal, slight right-skew 1(d) Lognormal, noticeable right-skew 1(a): Normal, s=0.2 1(b): Normal, s=0.3 1(c): Lognormal, s=0.2 1(d) Lognormal, s=0.5 1(a) and 1(c) very similar 0.0 • On logarithmic scale 2(a) Normal, slight left-skew 2(b) Normal, noticeable left-skew 2(c) Lognormal, symmetric 2(d) Lognormal, symmetric, wider 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 2(a): Normal, s=0.2 2(b): Normal, s=0.3 2(c): Lognormal, s=0.2 2(d) Lognormal, s=0.5 0.1 NoMeanFeat – Copyright 2004, John Mashey 1.0 10.0 18 Lognormal (or log-normal) – ln(x) • If ln(x) is normal, x has a lognormal distribution • Transform: xi* = ln(xi ), Mean is: – Nothing magic about ln; any base works AVERAGE : AM x* • 1 n * x * xi n i 1 Compute higher moments on xi* . Back-transformed Mean is GM: GM exp( AM x* ) back - transform 1 n GEOMEAN : GM exp ln xi #1 - Direct, back - transforme d AM n i 1 1 n n GEOMEAN : GM xi i 1 Sigma exp (s x* ) • #2 - Direct, easier compute, non - obvious Mult iplicative standard deviation Sigma can be used like s, 68% of x data in [m/Sigma, m*Sigma] NoMeanFeat – Copyright 2004, John Mashey 19 Lognormal in real world, more or less • “Examples of variates which have approximately log normal distributions include the size of silver particles in a photographic emulsion, the survival time of bacteria in disinfectants, the weight* and blood pressure of humans, and the number of words written in sentences by George Bernard Shaw.” http://mathworld.wolfram.com/LogNormalDistribution.html *Human heights are normal/lognormal, but weights need lognormal • Useful article by Stahel and Abbt: “Log-Normal Distributions Across the Sciences: Keys and Clues.” http://www.inf.ethz.ch/personal/gutc/lognormal/bioscience.pdf • Net-based graphical simulation: Gut, Limpert, Hinterberger, “Modeling the Genesis of Normal and Log-Normal Distributions – A Computer simulation on the web to visualize the genesis of normal and log-normal distributions” http://www.inf.ethz.ch/~gut/lognormal NoMeanFeat – Copyright 2004, John Mashey 20 Inverse normal • • • Coefficient of Determination, CoD =0.68 or trimmed 0.95 below CONF [72, 130] CONF [88,115] trimmed Normal probability plot, perfect normal = straight line [note inverted scale] C/r – 1 big outlier; CoD = 0.68 z(C/r) C/r 2.79 2.46 0.38 1.27 0.18 1.17 0.14 1.15 0.11 1.14 0.10 1.13 -0.23 0.97 -0.28 0.94 -0.55 0.81 -0.59 0.79 -0.77 0.70 -1.28 0.44 z-scores: C/r -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 Trimmed; no outliers; CoD = 0.95 z(C/r) 7.52 1.37 0.84 0.74 0.68 0.63 -0.19 -0.33 -1.03 -1.12 -1.59 -2.89 NoMeanFeat – Copyright 2004, John Mashey z-scores: C/r -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 21 Lognormal • • • Coefficient of Determination, CoD =0.83 or trimmed 0.93 below CONF [72, 130] CONF [88,117] trimmed Normal probability plot, perfect normal = straight line Ln(r) – 2 moderate outliers; CoD = 0.83 ln( r) z(ln( r)) 3.70 -2.21 4.37 -0.59 4.45 -0.39 4.47 -0.35 4.48 -0.32 4.48 -0.30 0.07 4.64 0.14 4.67 0.52 4.82 0.58 4.84 0.88 4.97 1.97 5.42 z-scores: ln(r) 3.0 2.0 1.0 0.0 -1.0 -2.0 -3.0 Trimmed – no outliers; CoD = 0.93 z(ln( r)) -4.51 -1.24 -0.83 -0.75 -0.69 -0.66 0.10 0.24 1.00 1.12 1.72 3.93 NoMeanFeat – Copyright 2004, John Mashey z-scores: ln(r) 3.0 2.0 1.0 0.0 -1.0 -2.0 -3.0 22 What Do the Means Mean? • Relationships: HM ( x ) GM ( x ) AM ( x ) HM ( x ) 1 / AM (1 / x ) AM ( x ) 1 / HM (1 / x ) GM ( x ) 1 / GM (1 / x ) • • Normal AM, inverse normal HM, lognormal GM AM normal, HM inverse normal, GM lognormal Must – – – – • Not implied Understand physical meaning to avoid irrelevant math Discover appropriate distribution type for population, check sample Know whether population or sample! Quantify uncertainty from measurement; sample size, bias So far, main example might be modeled by lognormal, normal, inverse normal, although untrimmed lognormal better for original example Trimming helps … but are the outliers good data or not? NoMeanFeat – Copyright 2004, John Mashey 23 Entering Benchmark Territory • • “Danger, Will Robinson. Warning! Warning! Enemy benchmarketeers!” • In benchmarking • • Lies, Damn Lies, and Benchmarks [This talk is not that one.] This talk is about trying to get the math right In science – Data usually is what it is, within measurement error – People make mistakes, but try to explain data, make good charts, liars get hurt – Example: human metrics Year 1: Measure people’s heights (normal / lognormal), or weights (lognormal) Year 2: Do it again, probably get mostly similar numbers – – – – Data changes, inherently, business changes quickly People are selective in presentation, can be tricky; Jain [4] “Ratio Games” People do make mistakes, sometimes on purpose; rule-bending….cheating If human metrics were benchmarks, assuming bigger is better: Year 2: Some will have had crash eating binges; a few have lead weights in pockets Some will have elevator shoes Somebody will say they’ll be taller next month Somebody will say why heights don’t really matter in their application One will have discovered stilts NoMeanFeat – Copyright 2004, John Mashey 24 Benchmarking: What Does “R% as fast” Mean? • • Computers X and Y Assume n programs Pi=1,n supposedly members of some related class – Many examples here: large user-level, CPU-intense, user-level integer codes • Run programs on X & Y with same inputs measure run-times xi and yi compute ratios ri = 100 * xi / yi • Compute R from these numbers, somehow, then claim: “Y is R% as fast as X” • What could this possibly mean that is true and useful? – “Always” – simple, useful, but essentially never true – “For a given workload” – true and useful, sometimes – “For systems and programs” – true and useful, sometimes, different times NoMeanFeat – Copyright 2004, John Mashey 25 “X R% as fast as Y on all such programs” • Two very similar systems, same cache, memory system, software – SPEC CINT2000, 12 benchmarks – X: Dell PW350, 2266MHz – Y: Dell PW350, 3066Mhz (135% of X clock-rate) – But ri vary noticeably; HM=124; GM=125; AM=125; STDEV=9 – Note: did not get 135% performance, for usual memory reasons – Bad enough; earlier example is worse (41%-225%); typical r 106 118 119 120 122 123 125 129 133 133 133 138 z( r) -2.12 -0.83 -0.68 -0.56 -0.31 -0.18 -0.01 0.44 0.91 0.93 0.94 1.46 z-scores: r 3.0 2.0 1.0 0.0 -1.0 -2.0 -3.0 Typical, others worse. NoMeanFeat – Copyright 2004, John Mashey 26 “Two Useful Answers – WAW or SERPOP” • Overall population: hopeless! • Workload Characterization Analysis Gather data, generate Weights Codify “local knowledge” External: published rates/ratios Can sometimes be used to fill in missing data, increase sample size • • Workload Analysis with Weights Needs goodness of Weights Can be “what if” analyses Algebra on workload parameters R% for a workload • Sample Estimate of Relative Performance Of Programs Representativeness, sample size Statistical analysis on sample R% for programs on systems, plus distribution and confidence Population of Programs Pi=1,NN, Run-times for X & Y Workload-dependent Workload-neutral WCA External System log, “Experience” Published metrics Pi=1, N : Txi =total run-time Pk: Mxk, Myk Compute weights Wi Compute rk(Y:X) Pi : Wi Pk : rk (Y:X) WAW SERPOP Pi, Wi Pi Pi: select input, Pi: select input, run run on X,Y xi, yi, and on X, Y ri = xi /yi. ri = xi / yi Assume: IA1, IA2 Assume: IA1, IA2 P : Wi,, xi,,yi. Pi : Wi, ri WAM WHM Ra : IA3 Rh : IA4 Rwa Rwh Rw WCA or IA3, IA4, IA5 NoMeanFeat – Copyright 2004, John Mashey Pi : Wi, ri GM Rg : IA7, IA8 s, CoV, Skew, Kurtosis CoD Confidence limits 27 WCA • • Assume X running real workload, days, weeks, etc. Goal: estimate Y’s R% without running the entire workload • • Identify Pi=1,n that account for most of the total run-time identify Wi=1,n Weights, or fractions of total run-time Txi = sum of all run-times for Pi n Wi Txi / Txi i 1 • Can be easy, for example: On most UNIX systems, turn on accounting, use acctcom(1) & accumulate • Wi must be well-measured, or at very least, well-estimated by experience Later WAW analyses no better than the Weights NoMeanFeat – Copyright 2004, John Mashey 28 WAW • • For each Pi , select representative input Run programs on X & Y with same input to get: run-times xi and yi ratios ri = 100 * xi / yi • • Implicit Assumptions need to be made explicit, always checked IA1: Repeatability Multiple runs of Pi, same input, same system low-variability run-times Run-times long enough to avoid clock measurement problems If true, tiny samples plausibly representative Recent SPEC rule: run 3 times, take Median time Often more-or-less true … but not always IA2: Consistent Relative Performance Each ri has small variability for different inputs. If true, only one input needed to estimate ri Often more-or-less true … but not always • NoMeanFeat – Copyright 2004, John Mashey 29 Violating IA1: Repeatability • IA1: awful case in Summer 1986, on first MIPSco systems (~M/400) Run-time distribution of ECAD benchmark supposed to run overnight: Run-times (hours) Before Hours After OK 0 • • 2 4 6 8 10 We wish! NOT OK! 12 14 16 18 20 Why: 4KB pages, 8KB I-cache, 8KB D-cache, direct-mapped. OS random-mapped pages (virtualreal) about 20% caused bad collisions, 2X time, unpredictable, unacceptable Fix: OS changed to map (virtualreal) in consistent way per process. the worst outliers disappeared [real-time: max more important than mean] M/500 got 16KB I-cache instead; big modern set-associative caches OK BUT: issue still exists, has moved to embedded designs, SoCs NoMeanFeat – Copyright 2004, John Mashey 30 Run-times: one program, one system, different inputs • • • Some programs run ~fixed time, not much dependent on input Some simulations run 1 hour, 2 hour, 3 hour… depending on iterations Some can run arbitrary time, dependent on input size, array sizes What is overall distribution shape of run-times in general? A: unknown, perhaps unknowable What is overall distribution of Pi’s run-times in a specific environment? A: WCA, measure it 0 • 2 4 Almost const ant Discret e Simulat ion St eps Proport ional t o input or sizing M in, t ypical, long t ail 6 8 10 12 14 16 18 20 xi , yi : ?? but at least hope that ri have small variability, maybe NoMeanFeat – Copyright 2004, John Mashey 31 Violating IA2: Consistent Relative Performance • • • IA2: Some programs execute radically different code for different inputs Example: Spice in SPEC89, different circuit types Oops! We knew Spice was floating-point, but happened to pick one input that spent most time doing integer storage allocation Some programs stress data cache differently according to size parameter Example: array size for matrix operations, easily changeable Following chart known to benchmarketeers…. One of several profiles, mostly dealing with cache size, memory design X: lower clock rate or equivalent, larger cache, same memory Y: higher clock rate or equivalent, smaller cache, same memory Rel. Perf, y Relat ive Perf ormance of Y over X , higher = bet t er f or Y Increasing problem size Y high hit Y decreasing Y high hit rate Y decreasing X high hit X high X high hit rate X high Y low X decreasing Y (first choice) X happy NoMeanFeat – Copyright 2004, John Mashey Y very low X very low Y (second choice) 32 WAW using AM • Simplest analysis adds run-times and divides to get Ra : n tx xi i 1 n and ty yi i 1 Ra 100 * AM ( x ) / AM ( y ) 100 * tx / ty • • • • • Ra = 100 * tx / ty = 1881 / 1789 = 105% Ra = 100 * AM(x )/ AM(ty ) = 157 / 149 = 105% Sums give same answer What do AMs really mean? = sum / n Are AMs good central measures here? Unclear. We have not characterized the distribution types. No reason to believe any particular distribution In any case, run-times might be arbitary, adjusted for convenient size, availability of inputs, etc NoMeanFeat – Copyright 2004, John Mashey x (secs) y (secs) 100 246 133 169 97 113 114 131 213 242 133 150 96 93 99 93 175 141 106 83 217 151 398 177 Sums: tx=1881 ty=1789 AMs: 157 149 33 WAW using AM, Run-time Distributions? • • • X distribution, far from normal, CoD = 0.66 (low) Right-skewed by 197: skew>2 AM(157) further from GM(141) than HM(131) Large STDEV, high CoV (>.3 {HM, GM, HM} >> Median Y distribution, closer to normal, CoD = 0.92 Somewhat right-skewed by 181 & 300 SPEC run-times somewhat arbitrary, picked for convenience, keep fastest systems >60 seconds z(x) z(y) 3.0 3.0 2.0 2.0 1.0 1.0 0.0 0.0 -1.0 -1.0 -2.0 -2.0 -3.0 -3.0 NoMeanFeat – Copyright 2004, John Mashey Benchmark 181.mcf 256.bzip2 252.eon 255.vortex 300.twolf 175.vpr 176.gcc 254.gap 164.gzip 186.crafty 253.perlbmk 197.parser MEDIAN HM GM AM STDEV SKEW KURTOSIS CoV x (secs) y (secs) 100 246 133 169 97 113 114 131 213 242 133 150 96 93 99 93 175 141 106 83 217 151 398 177 124 146 131 133 141 141 157 149 88 54 2.16 0.75 5.19 -0.12 0.56 0.36 34 WAW with AM, usually wrong • • • • Ra calculations never used actual Wi AM makes a very strong, and in practice, usually wrong assumption: IA3: Benchmark Equals Workload Each Pi uses same fraction of time in both workload and benchmark: Wi = xi /tx weights are assumed; this makes the choice explicit Useful only when benchmark = workload, perhaps with multiplicative factor This does happen, as when someone’s workload is: – – – – • Small number of programs, often run in dependent order With small run-time variance Run in equal numbers Examples: some CAD workloads, where nightly run = P1P2P3P4, and sequence must finish overnight Otherwise simplicity of the AM can fool people into thinking that: – Real distribution being measured, and AM is good central measure – The original workload is being modeled NoMeanFeat – Copyright 2004, John Mashey 35 WAW with Weighted Arithmetic Mean (WAM) • • First, calculate xi* proportional to original Wi, to reflect their fractions of original total run-time Then yi* computed to maintain same relative performance ri, xi* Wi tx and y*i xi* /( ri / 100) Rwa 100 * AM ( x* ) / AM ( y * ) 100 * tx* / ty* n n 100Wi tx ri i 1 100 Wi tx / i 1 n n Wi tx ri i 1 Wi tx / i 1 n n Wi i 1 ri Wi / i 1 • • Results depend on Wi,, need good WCA With good WCA, people make various assumptions, implicit or explicit: NoMeanFeat – Copyright 2004, John Mashey 36 WAW with WAM, Assumptions without WCA • • • • IA4: Equal times on X Set Wi, = 1/n, assume each Pi consumes equal time on X. All xi * = 157. tx* = tx = 1881, but ty* = 2031 Rwa = 93% (Y slower than X) IA5: Equal times on Y Is X somehow special? No, no WCA. If X and Y swap roles: All yi * = 149. ty* = ty = 1789, but tx* = 1936 Rwa = 108% (Y faster than X) Original benchmark x (secs) y (secs) 100 246 133 169 97 113 114 131 213 242 133 150 96 93 99 93 175 141 106 83 217 151 398 177 Sums: tx=1881 ty=1789 AMs: 157 149 r 41 79 86 87 88 89 103 106 124 127 144 225 IA4 y* 386 199 183 180 178 177 152 147 126 123 109 70 IA5 x* 60 117 128 130 131 132 154 159 185 190 214 335 ty*=2031 tx*=1936 169 161 IA6: Extreme cases By assuming workload consisting only of worst or best case, can get: Rmin = 41% to Rmax = 225% Is it better to be base (equal weight) system in IA4 or IA5? Do benchmarketeers know the answer? Jain[4], “Ratio Games” NoMeanFeat – Copyright 2004, John Mashey 37 WAW with WAM Summary • Y is R% as fast as X: Rmin = 41% worst case from this data* Rwa = 93% IA4: equal times on X Ra = 105% IA3: benchmark = workload Rwa = 108% IA5: equal times on Y Rmax = 225% best case on this data* • • In the absence of weights from WCA any of these are assumptions For any assumption, must ask: Why this assumption made? What evidence is there for it? • Assumptions are no substitute for knowledge * Actually, later SERPOP analysis would predict about 5% of programs would fall outside this range, so could be even worse or better for the wrong/right single-program workload NoMeanFeat – Copyright 2004, John Mashey 38 WAW with Weighted Harmonic Mean (WAM) • • HM or WHM used when xi and yi are really rates or performance ratios Usual general WHM formula is: n Rwh • • • • n Wi WHM ( r ) Wi / i 1 i 1 ri Same formula as in Rwa, so Rwh = Rwa If Wi = 1, reduces to usual HM calculation under IA4 HM assumes equal time per Pi on X Each Pi ’s run-time proportional to 1/ri, and Rh = 93% If X and Y swap, and assume equal time on Y under IA5, Rh = 108%. Defined this particular way: IA4: WAM and WHM give same answer (93%) IA5: WAM and WHM give same answer (108%) WCA: WAM and WHM give same answer (according to WCA) But of course, the different cases give different answers NoMeanFeat – Copyright 2004, John Mashey 39 WAW Summary • • • • • Appropriate when WCA Wi, are known Appropriate for “what-ifs” Appropriate for design of dedicated systems Strong influence of goodness and completeness of Wi, Makes no useful predictions about program n+1 • Sometimes difficult to use for general computer architectural design studies Difficult to get the data in any reliable way New design aimed to target workload 3 years away • Very difficult for industry-consortia benchmarking! Agreement on weights is “interesting” experience Population, not sample Algebra, not statistics NoMeanFeat – Copyright 2004, John Mashey 40 External Data • • • • • • Published metrics can be useful complement to analysis Times useful only for replication testing, credibility Usually rates (like MFLOPS, Dhrystones) or ratios (SPECratios) Common practice: convert former to latter 1980s: vendor performance documents often included some awful published metrics, for lack of anything better, i.e., Dhrystones, Whetstones Recognizability and understandability by user very important Mysterious numbers are useless … especially from a vendor Can feed into WAW analysis example 80% of workload is in 4 programs on X, 20% is “other” Able to get x1, x2, x3, x4, but only y1, y2, y3. (P4 3rd-party code, no port yet), then can compute r1, r2, r3, but not r4 If experience has shown that P4 is “like” some published benchmark, then can estimate r4 and continue Likewise, may be able to estimate an overall r for the missing 20% NoMeanFeat – Copyright 2004, John Mashey 41 SERPOP * Actually not. The rest is just a slight formalization of methods people have used for decades, but usually justified for (true) reasons that (also) led to seeming contradictions. Some of us (author included) failed to dig deeper into the math and explain it, or whole argument over Means would have been over years ago. There are no new statistical methods here, and this just is the tip of the applicable statistics iceberg. Resampling techniques, jack-knife, bootstrap, etc, or general Box-Cox transformations, and more complex distributions than lognormal are beyond the scope here, but are worth studying, especially for small sample sizes commonly found in benchmarking. NoMeanFeat – Copyright 2004, John Mashey 42 Distributions of Ratios • • • • • xi and yi fit no consistent, recognizable, useful distributions No real-world reason for them to do so They are also often arbitrarily adjusted for convenience. Wi are fractions that must sum to 1.0 No real-world reason for them to follow any consistent distribution, although given site might find that its workload fit something That leaves the ratios ri If xi and yi independent, from standard normal (negative…0…positive): Cauchy distribution, among the most awkward known. No mean or variance. Increasing sample size useless But fortunately, for benchmarks: xi and yi positive ri positive not Cauchy, thank goodness! xi and yi not completely independent, sometimes extremely well-correlated Real experience says Y is actually faster/slower than X, usually Example: Correl = 0.45; Dell example = 0.99. One more important characteristic of benchmark ratio distributions…. NoMeanFeat – Copyright 2004, John Mashey 43 Benchmark Ratio Distributions need Log-scale Symmetry • • • • X and Y are arbitrary labels, results cannot depend on numerator choice i.e., R(a/b) = 1002/R(b/a) (given percentage notation used here) log-scale symmetry X 50% of Y on P1, and 200% on P2; Y just the opposite Given only the r values, the two systems must be equal, as GM shows If this is WAW with AM, X, Y, and Z are equal (100) If this is a sample from a log-symmetric distribution, Z is slower…. (94) Note effect of consistently doubling P1 run-time Tiny examples mostly silly, but the math has to work for them also Fleming and Wallace [1]: reflexive, symmetric, multiplicative properties x P1 P2 HM GM AM 2 4 2.67 2.83 3.00 y 4 2 2.67 2.83 3.00 AM r 50 200 80 100 125 ln( r) 3.912 5.298 4.605 z 3 3 3.00 3.00 3.00 rz 67 133 89 94 100 ln( rz) 4.200 4.893 4.546 x P1 P2 HM GM AM NoMeanFeat – Copyright 2004, John Mashey 4 4 4.00 4.00 4.00 y 8 2 3.20 4.00 5.00 AM r 50 200 80 100 125 ln( r) 3.912 5.298 4.605 z 6 3 4.00 4.24 4.50 rz 67 133 89 94 100 ln( rz) 4.200 4.893 4.546 44 Log-scale Symmetry and Lognormal Distribution • • • • AM (ln(x)) GM(x) : GM appropriate for log-symmetric distributions Lognormal one of many log-symmetric distributions, obvious first choice to try Lognormals normally caused by (mostly) multiplicative effects Clock rate: Consider add 100MHz to 100Mhz system, 100Mhz to 1000MHz First: <200%, Second: <110%; multiplicative, not additive Micro-architecture Memory system Compiler code generation Early: lognormal s fit, and if small-s, fits normal as well, as should Possible: related symetric distribution with extra parameter to vary Kurtosis No rmal 0.1 Kurt o sis neg at ive 1.0 Kurt o sis p o sit ive 10.0 NoMeanFeat – Copyright 2004, John Mashey 100.0 45 SERPOP • • • • • For each Pi, select representative input Run programs on X & Y with same input, get: run-times xi and yi ratios ri = 100 * xi / yi and ignore xi and yi thereafter Assume IA1: Repeatability and IA2: Consistent Relative Performance, plus IA7: Sufficient sample size More better, especially if CoV large Compute Confidence Intervals to understand goodness of sample So far: even small samples (CINT2000: 12, CFP2000: 14) look OK IA8: Representative sample Experience and analysis needed to know that selected Pi “representative” People often know representative programs in local environment, even if weights not really known LFK Vendors often have detailed simulations, micro-architectural statistics “This just does not look like anything real we’ve ever seen.” experience with synthetic benchmarks like Dhrystone Wide use of benchmark does not imply goodness of benchmark NoMeanFeat – Copyright 2004, John Mashey 46 SERPOP • • • • Just use GM(ri ) and other usual statistics for lognormal WGM(ri ) is possible, but usually reserved for samples known to be badly biased, while awaiting better, larger sample. Detailed analysis on next few pages Shows normal, lognormal, inverse normal for illustration In real usage, would likely use lognormal Others are not log-symmetric When s small, similar anyway NoMeanFeat – Copyright 2004, John Mashey 47 SERPOP – CINT2000 Einux (Opteron) vs IBM (POWER4+) • • • • • • B1:C1 are unique identifiers from filenames 5.5X range of z-scores: r relative perf 3.0 Unrelated, 2.0 1.0 Correl=0.45 0.0 -1.0 Outliers -2.0 -3.0 pull down CoD z-scores: ln(r) Lognormal 3.0 copes better 2.0 1.0 with outliers 0.0 -1.0 X: negative -2.0 -3.0 times predicted by z-scores: C/r normal assumption, -3.0 -2.0 as usual with -1.0 0.0 strong skew 1.0 2.0 3.0 A 1 CINT2000 2 3 int_base2000: 4 Benchmark 5 181.mcf 6 256.bzip2 7 252.eon 8 255.vortex 9 300.twolf 10 175.vpr 11 176.gcc 12 254.gap 13 164.gzip 14 186.crafty 15 253.perlbmk 16 197.parser 17 MEDIAN 18 HM 19 GM 20 AM 21 STDEV 22 SKEW 23 KURTOSIS 24 CoV 25 Histogram 26 Bins 27 <m-2s 28 <m-s 29 <m 30 <m+s 31 <m+2s 32 >m+2s B C #02136 #02097 x:IBM y:Einux 1081 1077 x (secs) y (secs) 100 246 133 169 97 113 114 131 213 242 133 150 96 93 99 93 175 141 106 83 217 151 398 177 124 146 131 133 141 141 157 149 88 54 2.16 0.75 5.19 -0.12 0.56 0.36 Correl: 0.45 Back Means: -19 42 69 96 157 149 244 203 332 256 D E F G H I y: Einux A4800, 1800Mhz AMD Opteron, versus x:IBM eServer Turbo 690, 1700Mhz POWER4+ =C*x/y C: 100 Z-scores r ln( r) C/r z( r) z(ln( r)) z(C/r) 41 3.70 2.46 -1.49 -2.21 2.79 79 4.37 1.27 -0.65 -0.59 0.38 86 4.45 1.17 -0.50 -0.39 0.18 87 4.47 1.15 -0.47 -0.35 0.14 88 4.48 1.14 -0.44 -0.32 0.11 89 4.48 1.13 -0.43 -0.30 0.10 103 4.64 0.97 -0.11 0.07 -0.23 106 4.67 0.94 -0.04 0.14 -0.28 124 4.82 0.81 0.35 0.52 -0.55 127 4.84 0.79 0.42 0.58 -0.59 144 4.97 0.70 0.78 0.88 -0.77 225 5.42 0.44 2.57 1.97 -1.28 96 4.56 1.05 n: 12 93 4.57 0.92 r ln( r) C/r 100 4.59 1.00 Coef of Determination 108 4.61 1.08 0.77 0.83 0.68 45 0.41 0.50 95% Confidence Limits 1.48 -0.27 2.05 80 78 72 3.72 2.27 5.99 137 130 130 0.42 0.09 0.46 Range of Conf Limits Sigma: 1.51 57 52 59 108 100 93 Histogram counts 17 44 48 0 1 1 63 67 63 1 0 0 108 100 93 7 5 5 154 151 172 3 5 5 199 227 1164 0 1 1 1 0 0 NoMeanFeat – Copyright 2004, John Mashey 1.2 1 0.8 0.6 0.4 0.2 0 1 48 SERPOP – CFP2000 Einux (Opteron) vs IBM (POWER4+) • • • • Well-behaved lognormal 4.7X range of z-scores: C/r relative performance -3.0 -2.0 Unrelated, -1.0 Correl=0.49 0.0 1.0 Remember, 2.0 3.0 a high r z-scores: ln(r) does not necessarily 3.0 mean Y 2.0 1.0 is great, 0.0 -1.0 it means it is -2.0 unusually -3.0 better than X; z-scores: C/r X might be -3.0 bad on that -2.0 code. Need -1.0 0.0 to look at 1.0 other systems. 2.0 3.0 A B C D E F G H I 1 CFP2000 #02137 #02109 y: Einux A4800, 1800Mhz AMD Opteron, versus 2 x:IBM y:Einux x:IBM eServer Turbo 690, 1700Mhz POWER4+ 3 fp_base2000: 1598 1122 =C*tr/t C: 100 Z-scores 4 Benchmark tr t r ln( r) C/r z( r) z(ln( r)) z(C/r) 5 183.quake 44 142 31 3.44 3.22 -1.51 -2.07 2.58 6 178.galgel 76 187 41 3.71 2.45 -1.18 -1.37 1.40 7 168.wupwise 72 138 52 3.96 1.91 -0.79 -0.75 0.58 8 179.art 112 187 60 4.09 1.67 -0.53 -0.40 0.21 9 200.sixtrack 152 245 62 4.13 1.61 -0.45 -0.31 0.12 10 187.facerec 99 149 66 4.19 1.51 -0.31 -0.15 -0.03 11 173.applu 151 228 66 4.19 1.51 -0.31 -0.15 -0.04 12 189.lucas 109 141 77 4.35 1.29 0.07 0.24 -0.37 13 301.apsi 193 246 78 4.36 1.27 0.11 0.28 -0.40 14 171.swim 145 183 79 4.37 1.26 0.13 0.31 -0.42 15 191.fma3d 163 190 86 4.45 1.17 0.36 0.51 -0.56 16 172.mgrid 173 170 102 4.62 0.98 0.90 0.94 -0.84 17 188.ammp 210 197 107 4.67 0.94 1.07 1.06 -0.91 18 177.mesa 164 112 146 4.99 0.68 2.43 1.86 -1.30 19 MEDIAN 148 185 72 4.27 1.40 n: 14 20 HM 111 171 65 4.22 1.33 ln( r) C/r 21 GM 123 175 70 4.23 1.42 Coef of Determination 22 AM 133 180 75 4.25 1.53 0.88 0.92 0.81 23 STDEV 49 41 29 0.39 0.65 95% Confidence Limits 24 SKEW -0.29 0.22 0.94 -0.29 1.44 59 56 52 25 KURTOSIS -0.77 -0.69 1.61 0.61 2.54 92 88 86 26 CoV 0.37 0.23 0.39 0.09 0.43 Range of Conf Limits 27 Histogram Correl: 0.49 Sigma: 1.48 34 32 34 28 Bins Back Means: 75 70 65 Histogram counts 29 <m-2s 36 98 17 32 35 0 1 1 30 <m-s 84 139 46 47 46 2 1 1 31 <m 133 180 75 70 65 5 5 3 32 <m+s 182 221 105 104 114 5 5 8 33 <m+2s 230 262 134 155 442 1 2 1 34 >m+2s 1 0 0 1.2 1 0.8 0.6 0.4 0.2 0 1 NoMeanFeat – Copyright 2004, John Mashey 49 SERPOP – CINT2000, Dell PW350 3066 vs 2266 Mhz • • • • • • • 135% clock rate difference R = 125% z-scores: r 512KB cache Correl=0.99 3.0 2.0 181: likely 1.0 0.0 low cache -1.0 -2.0 hit rate, -3.0 drags faster z-scores: ln(r) system almost down 3.0 2.0 to slower 1.0 0.0 254, 253, -1.0 186, 164 -2.0 -3.0 likely high z-scores: C/r cache hit rate -3.0 clock-rate -2.0 -1.0 ratio 0.0 1.0 252: ?? 2.0 3.0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 A CINT2000 B C D E F G H I #01791 #01827 Y: Dell Precision Workstation 350, 3066MHz P4 3066MHz2266MHz X: Dell Precision Workstation 350, 2266MHz P4 int_base2000: 1081 1077 =C*x/y C: 100 Z-scores Benchmark x(secs) x (secs) r ln( r) C/r z( r) z(ln( r)) z(C/r) 181.mcf 248 233 106 4.67 0.94 -2.12 -2.21 2.31 175.vpr 272 231 118 4.77 0.85 -0.83 -0.81 0.78 176.gcc 106 89 119 4.78 0.84 -0.68 -0.65 0.61 300.twolf 400 333 120 4.79 0.83 -0.56 -0.53 0.49 256.bzip2 214 175 122 4.81 0.82 -0.31 -0.28 0.24 255.vortex 142 115 123 4.82 0.81 -0.18 -0.14 0.11 197.parser 220 176 125 4.83 0.80 -0.01 0.03 -0.06 254.gap 100 78 129 4.86 0.78 0.44 0.46 -0.47 253.perlbmk 173 130 133 4.89 0.75 0.91 0.90 -0.88 186.crafty 113 85 133 4.89 0.75 0.93 0.92 -0.90 164.gzip 164 123 133 4.89 0.75 0.94 0.93 -0.91 252.eon 134 97 138 4.93 0.73 1.46 1.39 -1.32 MEDIAN 169 127 124 4.82 0.80 n: 12 HM 162 128 124 4.83 0.80 r ln( r) C/r GM 175 140 125 4.83 0.80 Coef of Determination AM 190 155 125 4.83 0.80 0.93 0.91 0.89 STDEV 87 78 9 0.07 0.06 95% Confidence Limits SKEW 1.30 1.19 -0.55 -0.73 0.93 120 119 119 KURTOSIS 1.90 1.01 0.34 0.79 1.32 131 131 130 CoV 0.46 0.50 0.07 0.01 0.07 Range of Conf Limits Histogram Correl: 0.99 Sigma: 1.07 11 11 12 Bins Back Means: 125 125 124 Histogram counts <m-2s 17 0 107 108 109 1 1 1 <m-s 104 78 116 116 116 0 0 0 <m 190 155 125 125 124 6 5 5 <m+s 277 233 134 134 134 4 5 5 <m+2s 364 311 143 144 146 1 1 1 >m+2s 0 0 0 NoMeanFeat – Copyright 2004, John Mashey 1.2 1 0.8 0.6 0.4 0.2 0 1 50 SERPOP – CFP2000, Dell PW350 3066 vs 2266 Mhz • • • • • • 135% clock rate difference R=117%, noticeable z-scores: r cache-miss, CFP2000 has 3.0 larger data 2.0 1.0 than CINT 0.0 -1.0 Correl=0.97 -2.0 512KB cache -3.0 z-scores: ln(r) each Low STDEV, 3.0 HM~GM~AM, 2.0 1.0 ~Median 0.0 -1.0 177, 200 -2.0 -3.0 get good z-scores: C/r cache hit -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 A CFP2000 B C D E F G H I #02137 #02109 Y: Dell Precision Workstation 350, 3066MHz P4 3066MHz2266MHz X: Dell Precision Workstation 350, 2266MHz P4 fp_base2000: 1598 1122 =C*x/y C: 100 Z-scores Benchmark x (secs) y (secs) r ln( r) C/r z( r) z(ln( r)) z(C/r) 179.art 354 363 98 4.58 1.03 -1.94 -2.06 2.18 171.swim 175 169 104 4.64 0.97 -1.34 -1.37 1.40 183.quake 112 101 111 4.71 0.90 -0.62 -0.59 0.56 188.ammp 382 342 112 4.72 0.90 -0.54 -0.51 0.47 178.galgel 213 189 113 4.72 0.89 -0.44 -0.40 0.37 189.lucas 151 131 115 4.75 0.87 -0.19 -0.15 0.11 173.applu 208 180 116 4.75 0.87 -0.16 -0.12 0.08 301.apsi 368 312 118 4.77 0.85 0.08 0.12 -0.15 187.facerec 172 144 119 4.78 0.84 0.22 0.26 -0.29 172.mgrid 208 172 121 4.80 0.83 0.37 0.40 -0.43 168.wupwise 140 114 123 4.81 0.81 0.56 0.58 -0.60 191.fma3d 239 193 124 4.82 0.81 0.66 0.67 -0.68 177.mesa 160 120 133 4.89 0.75 1.59 1.52 -1.44 200.sixtrack 263 195 135 4.90 0.74 1.75 1.65 -1.55 MEDIAN 208 176 117 4.76 0.86 n: 14 HM 197 168 116 4.76 0.85 ln( r) C/r GM 210 180 117 4.76 0.86 Coef of Determination AM 225 195 117 4.76 0.86 0.93 0.92 0.91 STDEV 87 84 10 0.09 0.08 95% Confidence Limits SKEW 0.82 1.10 -0.03 -0.28 0.54 111 111 111 KURTOSIS -0.49 0.10 0.27 0.43 0.71 123 123 123 CoV 0.39 0.43 0.09 0.02 0.09 Range of Conf Limits Histogram Correl: 0.97 Sigma: 1.09 12 12 12 Bins Back Means: 117 117 116 Histogram counts <m-2s 50 26 97 98 99 0 1 1 <m-s 137 110 107 107 107 2 1 1 <m 225 195 117 117 116 5 5 5 <m+s 312 279 127 127 128 5 5 5 <m+2s 399 364 137 139 141 2 2 2 >m+2s 0 0 0 NoMeanFeat – Copyright 2004, John Mashey 1.2 1 0.8 0.6 0.4 0.2 0 1 51 SERPOP – CINT2000 – Two Sun Blades, Caches • • • • • • • 2500: 1MB on-chip, 1280MHz 2000: 8MB off-chip, 1200MHz z-scores: r 107% higher clock rate 3.0 94% GM 2.0 1.0 Correl=0.74 0.0 -1.0 186, 254, 113 -2.0 >107%; good -3.0 z-scores: ln(r) cache hit in on-chip cache 3.0 181 likely fits 2.0 1.0 8MB cache, 0.0 -1.0 misses much -2.0 -3.0 in 1MB z-scores: C/r 256 also shows -3.0 cache -2.0 -1.0 effect 0.0 1.0 High KURT 2.0 3.0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 A CINT2000 B C #01999 #02435 IBM Einux int_base2000: 1081 1077 Benchmark x (secs) y(secs) 181.mcf 263 546.0 256.bzip2 215 269.0 197.parser 315 337.0 175.vpr 254 269.0 255.vortex 191 196.0 176.gcc 162 165.0 300.twolf 456 459.0 252.eon 169 162.0 253.perlbmk 282 269.0 186.crafty 150 138.0 254.gap 236 210.0 164.gzip 293 259.0 MEDIAN 245 264 HM 227 234 GM 237 252 AM 249 273 STDEV 85 123 SKEW 1.25 1.25 KURTOSIS 2.31 1.13 CoV 0.34 0.45 Histogram Correl: 0.74 Bins Back Means: <m-2s 79 28 <m-s 164 151 <m 249 273 <m+s 334 396 <m+2s 418 519 >m+2s D E F G H I Y: Sun Blade2500, US IIIi, 1280Mhz, 1MB on-chip X: Sun Blade2000, 1200Mhz, 8MB off-chip =C*tr/t C: 100 Z-scores r ln( r) C/r z( r) z(ln( r)) z(C/r) 48 3.87 2.08 -2.71 -2.90 3.02 80 4.38 1.25 -0.92 -0.71 0.49 93 4.54 1.07 -0.15 -0.03 -0.07 94 4.55 1.06 -0.10 0.01 -0.10 97 4.58 1.03 0.07 0.15 -0.20 98 4.59 1.02 0.11 0.18 -0.23 99 4.60 1.01 0.18 0.23 -0.27 104 4.65 0.96 0.46 0.44 -0.41 105 4.65 0.95 0.49 0.46 -0.43 109 4.69 0.92 0.71 0.62 -0.53 112 4.72 0.89 0.91 0.76 -0.62 113 4.73 0.88 0.96 0.79 -0.64 99 4.59 1.01 n: 12 92 4.53 1.04 r ln( r) C/r 94 4.54 1.06 Coef of Determination 96 4.55 1.09 0.72 0.60 0.50 18 0.23 0.33 95% Confidence Limits -2.03 -2.54 2.94 85 81 77 4.93 7.26 9.23 107 109 113 0.18 0.05 0.30 Range of Conf Limits Sigma: 1.26 22 28 35 96 94 92 Histogram counts 61 59 57 1 1 1 79 75 71 0 0 0 96 94 92 3 2 1 114 119 130 8 9 10 132 150 226 0 0 0 0 0 0 NoMeanFeat – Copyright 2004, John Mashey 1.2 1 0.8 0.6 0.4 0.2 0 1 52 SERPOP – CFP2000 – Two Sun Blades, Caches • • • • • • 2500: 1MB on-chip, 1280MHz 2000: 8MB off-chip, 1200MHz z-scores: r 107% higher clock rate 3.0 92% GM 2.0 1.0 Correl=0.92 0.0 -1.0 178, 189, -2.0 -3.0 188, 179 z-scores: ln(r) form clump, cache size 3.0 makes 2.0 1.0 difference 0.0 -1.0 200,301,191 -2.0 >107%; good -3.0 cache hit in z-scores: C/r on-chip cache -3.0 Negative -2.0 -1.0 KURT 0.0 1.0 2.0 3.0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 A CFP2000 B C #02000 #02436 IBM Einux fp_base2000: 1598 1122 Benchmark x (secs) y(secs) 178.galgel 146 206 189.lucas 378 523 188.ammp 367 482 179.art 26 34 171.swim 299 337 183.quake 84 93 172.mgrid 255 271 187.facerec 161 169 177.mesa 201 199 173.applu 282 279 168.wupwise 179 169 200.sixtrack 270 250 301.apsi 402 365 191.fma3d 431 377 MEDIAN 263 261 HM 142 164 GM 206 224 AM 249 268 STDEV 122 139 SKEW -0.20 0.27 KURTOSIS -0.79 -0.31 CoV 0.49 0.52 Histogram Correl: 0.92 Bins Back Means: <m-2s 4 -10 <m-s 126 129 <m 249 268 <m+s 371 407 <m+2s 493 546 >m+2s D E F G H I Y: Sun Blade2500, US IIIi, 1280Mhz, 1MB on-chip X: Sun Blade2000, 1200Mhz, 8MB off-chip =C*x/y C: 100 Z-scores r ln( r) C/r z( r) z(ln( r)) z(C/r) 71 4.26 1.41 -1.53 -1.62 1.70 72 4.28 1.38 -1.43 -1.49 1.55 76 4.33 1.31 -1.17 -1.17 1.17 77 4.34 1.30 -1.11 -1.11 1.10 89 4.49 1.13 -0.30 -0.23 0.15 89 4.49 1.12 -0.26 -0.18 0.10 94 4.54 1.06 0.06 0.14 -0.20 95 4.56 1.05 0.14 0.21 -0.27 101 4.62 0.99 0.54 0.57 -0.60 101 4.62 0.99 0.54 0.58 -0.60 106 4.66 0.94 0.87 0.87 -0.85 108 4.68 0.93 1.02 0.99 -0.95 110 4.70 0.91 1.16 1.11 -1.05 114 4.74 0.87 1.45 1.34 -1.23 95 4.55 1.06 n: 14 91 4.52 1.07 ln( r) C/r 92 4.52 1.09 Coef of Determination 93 4.52 1.10 0.98 0.96 0.94 15 0.16 0.18 95% Confidence Limits -0.25 -0.43 0.59 85 84 83 -1.26 -1.19 -1.07 102 101 100 0.16 0.04 0.17 Range of Conf Limits Sigma: 1.18 17 17 17 93 92 91 Histogram counts 64 67 68 0 0 0 79 78 78 4 4 4 93 92 91 2 2 2 108 108 109 5 6 6 122 127 136 3 2 2 0 0 0 NoMeanFeat – Copyright 2004, John Mashey 1.2 1 0.8 0.6 0.4 0.2 0 1 53 Livermore Fortran Kernels • McMahon[9], Good WCA of codes in local scientific environment, to identify Pi, not Wi – Large environment, workloads varied strongly • http://www.llnl.gov/asci_benchmarks/asci/limited/lfk/README.html • • • • • • • • • • • • • • • • • • • • • • • • Kernel 1: an excerpt from a hydrodynamic code. Kernel 2: an excerpt from an Incomplete Cholesky-Conjugate Gradient code. Kernel 3: the standard Inner Product function of linear algebra. Kernel 4: an excerpt from a Banded Linear Equations routine. Kernel 5: an excerpt from a Tridiagonal Elimination routine. Kernel 6: an example of a general linear recurrence equation. Kernel 7: an Equation of State fragment. Kernel 8: an excerpt of an Alternating Direction, Implicit Integration code. Kernel 9: an Integrate Predictor code. Kernel 10: a Difference Predictor code. Kernel 11: a First Sum. Kernel 12: a First Difference. Kernel 13: an excerpt from a 2-D Particle-in-Cell code. Kernel 14: an excerpt of a 1-D Particle-in-Cell code. Kernel 15: a sample of how casually FORTRAN can be written. Kernel 16: a search loop from a Monte Carlo code. Kernel 17: an example of an implicit conditional computation. Kernel 18: an excerpt from a 2-D Explicit Hydrodynamic code. Kernel 19: a general Linear Recurrence Equation. Kernel 20: an excerpt from a Discrete Ordinate Transport program. Kernel 21: a matrix X matrix product calculation. Kernel 22: a Planckian Distribution procedure. Kernel 23: an excerpt from 2-D Implicit Hydrodynamics. Kernel 24: finds the location of the first minimum in X. NoMeanFeat – Copyright 2004, John Mashey 54 Livermore Fortran Kernels • • • • • • http://www.netlib.org/benchmark/livermore says: “The best central measure is the Geometric Mean(GM) of 72 rates because the GM is less biased by outliers than the Harmonic(HM) or Arithemetic(AM). CRAY hardware monitors have demonstrated that net Mflop rates for the LLNL and UCSD tuned workloads are closest to the 72 LFK test GM rate. [ However, CRAY memories are "all cache". LLNL codes ported to smaller cache microprocessors typically perform at only LFK Harmonic mean MFlop rates.]” It also associates: 2*AM Best applications AM Optimized applications GM Tuned workload HM Untuned workload HM(scalar) All-scalar applications Such advice seems to be set of heuristics from people with long experience in specific environment, as there is no obvious mathematical reason for all these to be true, other than usual HM < GM < AM Comment “less biased” is true, but more important, this collection of codes is a reasonable SERPOP sample. MFLOPS rates really ratios versus a mythical system that does 1MFLOP on each loop In following, MFLOPS rates are given as “r” NoMeanFeat – Copyright 2004, John Mashey 55 3.0 2.0 LFK – MIPS M/1000, 15Mhz R2000 uniprocessor, Oct 1987 1.0 0.0 0 • • • • • • • Scalar uniprocessor No cache-blocking 24 data points, fairly well behaved Good fit for normal, lognormal, as expected with small s No really weird outliers MIPS [14] #NUMS: no worry,do not care about HM and GM of the logs, and of course. If these are needed, would scale the values to avoid negative logs. z( r) 3. 0 2. 0 1. 0 0. 0 1 13 -1. 0 -2. 0 -3. 0 z ( l n( r ) ) ) 3. 0 2. 0 1. 0 0. 0 1 13 -1. 0 -2. 0 -3. 0 z ( 1/ r ) -3. 0 -2. 0 -1. 0 1 0. 0 1. 0 13 MIPS M/1000, 15MHz R2000, 64-bit Livermore Fortran Kernels 1/r ln( r) r Loop 0.64 -0.450 1.568 13 0.67 -0.401 1.494 14 0.96 -0.042 1.043 15 1.04 0.044 0.957 23 1.23 0.208 0.812 12 1.23 0.209 0.812 11 1.23 0.209 0.811 10 1.42 0.349 0.705 6 1.45 0.371 0.690 21 1.48 0.390 0.677 5 1.62 0.482 0.617 20 1.63 0.487 0.615 16 1.84 0.609 0.544 2 1.85 0.617 0.539 24 1.92 0.650 0.522 19 1.96 0.673 0.510 4 2.10 0.743 0.476 3 2.27 0.818 0.441 17 2.31 0.839 0.432 1 2.33 0.845 0.430 18 2.37 0.864 0.421 22 2.57 0.943 0.389 8 3.13 1.140 0.320 7 3.23 1.172 0.310 9 1.733 0.548 0.579 MEDIAN 1.487 #NUM! 0.565 HM 1.633 #NUM! 0.612 GM 1.770 0.490 0.672 AM 0.689 0.427 0.327 STDEV 0.374 -0.589 1.586 SKEW KURTOSIS -0.212 0.076 2.433 0.389 0.870 0.486 COV 1.532 MSTDEV 1.2 2. 0 -1.0 Size = 471 z( r) z(ln( r)) z(1/r) -1.558 -2.196 2.924 -1.513 -2.081 2.680 -1.105 -1.227 1.198 -0.984 -1.022 0.915 -0.721 -0.632 0.439 -0.719 -0.630 0.436 -0.718 -0.628 0.435 -0.457 -0.295 0.085 -0.413 -0.244 0.036 -0.373 -0.197 -0.008 -0.172 0.022 -0.204 -0.162 0.032 -0.213 0.136 0.322 -0.445 0.158 0.342 -0.460 0.245 0.421 -0.517 0.308 0.475 -0.556 0.508 0.642 -0.669 0.741 0.821 -0.783 0.807 0.870 -0.813 0.827 0.884 -0.821 0.890 0.930 -0.848 1.167 1.118 -0.954 1.956 1.587 -1.183 2.098 1.663 -1.216 24 Count 1/r ln( r) r Coef of Determination 0.957 0.937 0.788 95% Confidence Limits 1.2 1.3 1.4 1.8 1.9 2.0 -2.0 -3.0 3.0 2.0 1.0 0.0 0 -1.0 -2.0 -3.0 -3.0 -2.0 -1.0 0 0.0 1.0 2.0 3.0 1 3. 0 0.8 NoMeanFeat – Copyright 2004, John Mashey 0.6 56 12 10 8 0.4 6 2.0 LFK – SGI 4D240GTX, 4 R3000-25, July 1989 1.0 0.0 0 -1.0 • • • • • Vectorizing, parallelizing FORTRAN compiler Says: 11 of 24 parallelizable 15 of 24 vectorizable Much bigger range, example of rationale that led to use of GM, much better fit for lognormal when substantial variation exists Step-like groupings of programs with related performance Humphries [15] 3 off chart z( r) 3. 0 2. 0 1. 0 0. 0 1 13 -1. 0 -2. 0 -3. 0 z ( l n( r ) ) ) 3. 0 2. 0 1. 0 0. 0 1 13 -1. 0 -2. 0 -3. 0 z ( 1/ r ) -3. 0 -2. 0 -1. 0 1 0. 0 1. 0 13 64-bit Livermore Fortran Kernels 1/r ln( r) r Loop 0.11 0.893 1.12 13 0.79 0.452 2.21 24 0.83 0.435 2.30 11 0.88 0.413 2.42 16 1.11 0.329 3.04 6 1.31 0.270 3.71 5 1.33 0.264 3.79 2 1.33 0.264 3.79 14 1.33 0.264 3.79 17 1.48 0.227 4.40 4 1.49 0.225 4.44 20 1.55 0.213 4.70 15 1.62 0.198 5.05 23 1.62 0.197 5.07 19 1.91 0.148 6.74 3 2.12 0.120 8.32 12 2.14 0.117 8.53 22 2.18 0.113 8.83 10 2.47 0.085 11.82 18 2.78 0.062 16.13 1 2.86 0.057 17.52 9 2.98 0.051 19.68 21 3.01 0.050 20.20 8 3.16 0.042 23.58 7 5.07 0.205 4.875 MEDIAN 4.372 1.018 0.126 HM 5.855 1.515 0.171 GM 7.966 1.767 0.229 AM 6.582 0.805 0.187 STDEV 1.209 0.147 2.062 SKEW KURTOSIS 0.216 -0.589 6.040 0.826 0.456 0.818 COV 2.237 MSTDEV 1.2 2. 0 Size = 471 z( r) z(ln( r)) z(1/r) -0.950 -1.707 2.406 -0.705 -0.805 0.560 -0.685 -0.752 0.486 -0.658 -0.685 0.396 -0.518 -0.382 0.042 -0.368 -0.118 -0.207 -0.350 -0.090 -0.230 -0.350 -0.090 -0.230 -0.350 -0.090 -0.230 -0.213 0.108 -0.384 -0.204 0.120 -0.392 -0.145 0.196 -0.445 -0.066 0.291 -0.506 -0.062 0.296 -0.510 0.314 0.674 -0.714 0.669 0.954 -0.833 0.716 0.987 -0.845 0.783 1.033 -0.862 1.456 1.420 -0.982 2.424 1.832 -1.076 2.737 1.942 -1.097 3.223 2.096 -1.123 3.339 2.131 -1.129 4.099 2.336 -1.159 24 Count 1/r ln( r) r Coef of Determination 0.810 0.954 0.737 95% Confidence Limits 2.4 3.0 3.5 4.6 5.6 7.2 -2.0 -3.0 3.0 2.0 1.0 0.0 0 -1.0 -2.0 -3.0 -3.0 -2.0 -1.0 0 0.0 1.0 2.0 3.0 1 3. 0 NoMeanFeat – Copyright 2004, John Mashey 0.8 0.6 57 16 14 12 10 2.0 LFK-CRAY X-MP, uniprocessor, 1988 1.0 0.0 1 -1.0 • • • • • Vectorized Kernels asterisked, all but one below bar separating loop 14 and loop 2 r: large SKEW, KURT, CoV. AM=STDEV, awkward for normal Tang & Davidson [16] r, ln( r), 1/r s-sized intervals below. Grayed boxes were made-up, as the normal calculations yield impossible results. The -62 and 0 values are also impossible. z( r) 7 off chart 3. 0 2. 0 1. 0 0. 0 1 13 -1. 0 -2. 0 -3. 0 z ( l n( r ) ) ) 3. 0 2. 0 1. 0 0. 0 1 13 -1. 0 -2. 0 -3. 0 1.2 z ( 1/ r ) 1 -3. 0 0.8 -2. 0 <m-2s <m-s <m <m+s <m+2s >m+2s -62 0 61 123 184 2 9 31 114 416 0.6 5 7 16 114 416 -1. 0 0.4 0.2 0 1 0. 0 1. 0 1 2. 0 3. 0 13 64-bit LFK-CRAY X-MP, COS, CFT77.12 Size = 471 CRAY r ln( r) 1/r z( r) z(ln( r)) z(1/r) 24 3.65 1.29 0.274 -0.85 -1.18 1.05 15 5.18 1.64 0.193 -0.80 -0.88 0.48 13 5.83 1.76 0.172 -0.77 -0.78 0.33 16 6.15 1.82 0.163 -0.76 -0.74 0.27 17 10.15 2.32 0.099 -0.61 -0.32 -0.18 6* 11.28 2.42 0.089 -0.57 -0.23 -0.25 11 12.68 2.54 0.079 -0.52 -0.13 -0.32 20 13.22 2.58 0.076 -0.50 -0.09 -0.34 19 13.36 2.59 0.075 -0.50 -0.09 -0.35 23 13.88 2.63 0.072 -0.48 -0.05 -0.37 5 14.36 2.66 0.070 -0.46 -0.02 -0.38 14 22.22 3.10 0.045 -0.17 0.34 -0.56 2* 45.51 3.82 0.022 0.69 0.95 -0.72 10* 61.21 4.11 0.016 1.26 1.20 -0.76 22* 65.78 4.19 0.015 1.43 1.26 -0.77 4* 65.94 4.19 0.015 1.44 1.26 -0.77 12* 74.34 4.31 0.013 1.75 1.36 -0.78 21* 108.94 4.69 0.009 3.02 1.68 -0.81 18* 110.57 4.71 0.009 3.08 1.69 -0.81 8* 145.79 4.98 0.007 4.38 1.93 -0.83 3* 151.70 5.02 0.007 4.59 1.96 -0.83 9* 157.52 5.06 0.006 4.81 1.99 -0.83 1* 164.58 5.10 0.006 5.07 2.03 -0.83 7* 187.75 5.24 0.005 5.92 2.14 -0.84 MEDIAN 33.865 5.07 0.033 Count 24 HM 15.626 2.923 0.016 r ln( r) 1/r GM 31.479 3.192 0.032 Coef of Determination AM 61.316 3.449 0.064 0.87 0.97 0.77 STDEV 61.519 1.291 0.072 95% Confidence Limits SKEW 0.816 -0.068 1.511 15.4 9.0 5.4 KURTOSIS -0.813 -1.504 1.963 38.3 24.4 15.5 COV 1.003 0.374 1.123 MSTDEV 3.636 NoMeanFeat – Copyright 2004, John Mashey -2.0 -3.0 3.0 2.0 1.0 0.0 1 -1.0 -2.0 -3.0 -3.0 -2.0 -1.0 1 0.0 1.0 2.0 3.0 1.2 1 0.8 0.6 58 16 14 12 10 Optimization / Tuning • • “Social engineering” issue of creating good industry benchmarks and controlling their “gaming” is an entire different talk, except one bit of math: In competitive benchmarking, tuning focus depends on metric – HM: work to raise smallest ri , especially low outliers – AM: work to reduce largest run-times xi or yi , especially high outliers – GM: work to tune every program, since improving any ri by factor F is as good as improving any other ri by that factor. – WAM/WHM: work on programs with largest weights from WCA NoMeanFeat – Copyright 2004, John Mashey 59 “Two Useful Answers – WAW or SERPOP” • Overall population: hopeless! • Workload Characterization Analysis Gather data, generate Weights Codify “local knowledge” External: published rates/ratios Can sometimes be used to fill in missing data, increase sample size • • • Workload Analysis with Weights Needs goodness of Weights Can be “what if” analyses Algebra on workload parameters R% for a workload Sample Estimate of Relative Performance Of Programs Representativeness, sample size Statistical analysis on sample R% for programs on systems, plus distribution and confidence Population of Programs Pi=1,NN, Run-times for X & Y Workload-dependent Workload-neutral WCA External System log, “Experience” Published metrics Pi=1, N : Txi =total run-time Pk: Mxk, Myk Compute weights Wi Compute rk(Y:X) Pi : Wi Pk : rk (Y:X) WAW SERPOP Pi, Wi Pi Pi: select input, Pi: select input, run run on X,Y xi, yi, and on X, Y ri = xi /yi. ri = xi / yi Assume: IA1, IA2 Assume: IA1, IA2 P : Wi,, xi,,yi. Pi : Wi, ri WAM WHM Ra : IA3 Rh : IA4 Rwa Rwh Rw WCA or IA3, IA4, IA5 NoMeanFeat – Copyright 2004, John Mashey Pi : Wi, ri GM Rg : IA7, IA8 s, CoV, Skew, Kurtosis CoD Confidence limits 60 Assumptions Again • • IA1: Repeatability IA2: Consistent Relative Performance • • • • IA3: Benchmark Equals Workload IA4: Equal times on X IA5: Equal times on Y IA6: Extreme cases • • IA7: Sufficient sample size IA8: Representative sample NoMeanFeat – Copyright 2004, John Mashey 61 Conclusion • Do as much WCA as affordable – Competitively, better WCA = better products – Very difficult for general-purpose CPUs – Much more plausible for dedicated, embedded systems, SystemOnChip Really important for real-time or equivalent • When weights are known, use WAW to estimate workload run-time – Algebra • When they are not, use SERPOP to analyze distribution – Statistics – Large CoV or equivalent, wide confidence levels get more/better data • Performance is distribution, not just mean NoMeanFeat – Copyright 2004, John Mashey 62 References 1. P. J. Fleming, J. J. Wallace, “How Not to Lie With Statistics: The Correct Way to Summarize Benchmarks,” Comm ACM, Vol 29, No. 3, pp. 218-221, March 1986. 2. J. E. Smith, “Characterizing Computer Performance with a Single Number,” Comm ACM, Vol 31, No. 10, pp. 1202-1206, October 1988. 3. L. K. John, “More on finding a Single Number to indicate Overall Performance of a Benchmark Suite,” Computer Architecture News, Vol. 32, No 1, pp. 3-8, March 2004. 4. R. Jain, The Art of Computer Systems Performance Analysis, Wiley, New York, 1991. See especially Chapter on “Ratio Games.” 5. D. J. Lilja, Measuring Computer Performance – A Practioner’s Guide, Cambridge University Press, 2000. 6. J. L. Hennessy, D. A. Patterson, Computer Architecture – A Quantitative Approach, Third Edition, Morgan Kaufmann Publishers, 2003. 7. W. J. DeCoursey, Statistics and Probability for Engineering Applications, Newnes, Amsterdam, 2003. 8. NIST/SEMATECH e-Handbook of Statistical Methods, http://www.itl.nist.giv/div898/handbook, 2004. 9. McMahon, F., “The Livermore Fortran kernels: A Computer test of numerical performance range,” Tech. Rep. UCRL-55745, Lawrence Livermore national Laboratory, Univ. of California, Livermore, 1986. 10. SPEC, www.spec.org. 11. J. R. Mashey, “War of the Benchmark Means: Time for a Truce,” Computer Architecture News, Vol 32, No 3, Sept 2004. [TO BE PUBLISHED] 12. J. Tang, E. S. Davidson, “An Evaluation of Cray-1 and Cray X-MP Performance on Vectorizable Livermore Fortran Kernels,” International Conference on Supercomputing, pp. 510-518, ACM, July 1988. 13. http://mathworld.wolfram.com, a good website on mathematics. Good place to look up distributions. 14. MIPS Computer Systems, “Performance Brief Part 1: CPU Benchmarks, Issue 3.0,” October 1987. 15. Ralph Humphries, “Performance Report, Revision 1.4, July 1, 1989,” Silicon Graphics Computer Systems, Mountain View, CA. 16. J.H. Tang, E. S. Davidson, “An evaluation of Cray-1 and Cray X-MP Performance on Vectorizable Livermore Fortran Kernels,” International Conf. on Supercomputing, ACM, pp. 510-518, July 1988. NoMeanFeat – Copyright 2004, John Mashey 63 Feedback Please • A. Working on: 1. BS 2. MS 3. PhD 4. Already have PhD or other • B. Statistics background 1. None 2. Some embedded in other course 3. High school course 4. Undergraduate course 6. Grad school And if class taken, what department label? ______________________________ • C. Any terminology that was insufficiently defined early enough? • D. Anything else you would have expected to be in this talk? • E. Anything too elementary? • F. Any other comments or suggestions? NoMeanFeat – Copyright 2004, John Mashey 64