Fun Size Your Data: Using Statistical Techniques to Efficiently Compress and Exploit Benchmarking Results David J. Lilja Electrical and Computer Engineering University of Minnesota lilja@umn.edu Electrical and Computer Engineering The Problem Benchmark programs We can generate heaps of data Heaps o’ data 445 446 397 226 388 3445 188 1002 47762 432 54 12 98 345 2245 8839 77492 472 565 999 1 34 882 545 4022 827 572 597 364 … But it’s noisy Too much to understand or use efficiently Electrical and Computer Engineering A Solution Statistical design of experiments techniques Compress complex benchmark results Exploit the results in interesting ways Extract new insights Demonstrate using Microarchitecture-aware floorplanning Benchmark classification Electrical and Computer Engineering Why Do We Need Statistics? Draw meaningful conclusions in the presence of noisy measurements Noise filtering Aggregate data into meaningful information Data compression Heaps o’ data 445 446 397 226 388 3445 188 1002 47762 432 54 12 98 345 2245 8839 77492 472 565 999 1 34 882 545 4022 827 572 597 364 … x ... Electrical and Computer Engineering Why Do We Need Statistics? Draw meaningful conclusions in the presence of noisy measurements Noise filtering Aggregate data into meaningful information Data compression Heaps o’ data 445 446 397 226 388 3445 188 1002 47762 432 54 12 98 345 2245 8839 77492 472 565 999 1 34 882 545 4022 827 572 597 364 … x ... Electrical and Computer Engineering Design of Experiments for Data Compression A B V1 √ √ V2 √ √ V3 √ V4 C √ √ √ √ 445 446 397 226 388 3445 188 1002 47762 432 54 12 98 345 2245 8839 77492 472 565 999 1 34 882 545 4022 827 572 597 364 … Effects of each input A, B, C Effects of interactions AB, AC, BC, ABC Electrical and Computer Engineering Types of Designs of Experiments Full factorial design with replication O(vm) experiments = O(43) Fractional factorial designs O(2m) experiments = O(23) Multifactorial design (P&B) O(m) experiments = O(3) Main effects only – no interactions A B V1 √ √ V2 √ √ V3 √ V4 C √ √ √ √ m-factor resolution x designs k O(2m) experiments = k O(23) Selected interactions Electrical and Computer Engineering Example: Architecture-Aware Floor-Planner V. Nookala, S. Sapatnekar, D. Lilja, DAC’05. Electrical and Computer Engineering Motivation Imbalance between device and wire delays Global wire delays > system clock cycle in nanometer technology Layout wire Electrical and Computer Engineering Solution Wire-pipelining Layout If delay > a clock cycle → insert flipflops along a wire Several methods for optimal FF insertion on a wire wire FF • Li et al. [DATE 02] • Cocchini et al. [ICCAD 02] • Hassoun et al. [ICCAD 02] But what about the performance impact of the pipeline delays? Electrical and Computer Engineering Impact on Performance Execution time = num-instr * cycles/instr (CPI) * cycle-time Wire-pipelining Electrical and Computer Engineering Impact on Performance Execution time = num-instr * cycles/instr (CPI) * cycle-time Wire-pipelining Key idea Some buses are critical Some can be freely pipelined without (much) penalty Electrical and Computer Engineering Change Objective Function Execution time = num-instr * cycles/instr (CPI) * cycle-time Wire-pipelining Traditional physical design objectives Minimize area, total wire length, etc. New objective Optimize only throughput critical wires to maximize overall performance Electrical and Computer Engineering Conventional Microarchitecture Interaction with Floor Planner µ-arch Benchmarks Simulation Methodology CPI info Frequency Physical Design Electrical and Computer Engineering Microarchitecture-aware Physical Design µ-arch Benchmarks Simulation Methodology CPI info Frequency Physical Design Layout Incorporate wire-pipelining models into the simulator Extra pipeline stages in processor Simulator needs to adjust operation latencies Electrical and Computer Engineering But There are Problems µ-arch Benchmarks Simulation Methodology CPI info Frequency Physical Design Layout Simulation is too slow 2000-3000 instructions per simulated instruction Numerous benchmark programs to consider Exponential search space Thousands of combinations tried in physical design step Electrical and Computer Engineering Design of Experiments Methodology µ-arch Design of Experiments based Simulation Methodology benchmarks MinneSPEC Reduced input sets # Simulations is linear in the number of buses (if no interactions) Bus, interaction weights benchmarks Frequency Floorplanning Layout Validation Electrical and Computer Engineering Related Floorplanning Work Simulated Annealing (SA) CPI look up table [Liao et al, DAC 04] Bus access ratios from simulation profiles Minimize the weighted sum of bus latencies [Ekpanyapong et al, DAC 04] Throughput sensitivity models for a selected few critical paths Limited sampling for a large solution space [Jagannathan et al, ASPDAC 05] Our approach Design of experiments to identify criticality of each bus Electrical and Computer Engineering Microarchitecture and factors 22 buses → 19 factors in experimental design Some factors model multiple buses Fetch Decode IADD2 RUU REG IADD3 IMULT LSQ BPRED DL1 IL1 ITLB IADD1 L2 FADD DTLB FMULT Electrical and Computer Engineering 2-level Resolution III Design 2-levels for each factor Lowest and highest possible values (range) Latency range of buses Min = 0 Max = Chip corner-corner wire latency 19 factors 32 simulations (nearest power of 2) Captured by a design matrix (32x19) • 32 rows - 32 simulations • 19 columns - Factor values Electrical and Computer Engineering Experimental setup Nine SPEC 2000 benchmarks MinneSPEC reduced input sets SimpleScalar simulator Floorplanner -- PARQUET Simulated annealing based Objective function Minimize the weighted sum of bus latencies Secondarily minimize aspect ratio and area Electrical and Computer Engineering Comparisons Case Description SFP Our “statistical floorplanner” acc Access ratios from [Ekpanyapong et al, DAC 04] minWL Traditional floorplanning Electrical and Computer Engineering Typical Results for Single Benchmark Electrical and Computer Engineering Averaged Over All Benchmarks Compared to acc 3-7% point improvement Better improvements over acc at higher frequencies SFP-comb ≈ SFP (within about 1-3% points) Electrical and Computer Engineering Summary Use statistical design of experiments Compress benchmark data into critical bus weights Used by microarchitecture-aware floorplanner Optimizes insertion of pipeline delays on wires to maximize performance Extend methodology for other critical objectives Power consumption Heat distribution Electrical and Computer Engineering Collaborators and Funders Vidyasagar Nookala Joshua J. Yi Sachin Sapatnekar Semiconductor Research Corporation (SRC) Intel IBM Electrical and Computer Engineering