Systematic Energy Characterization of CMP/SMT Processor Systems via Automated Micro-Benchmarks R. Bertran*+, A. Buyuktosunoglu*, M. Gupta*, M. Gonzalez+, P. Bose* *IBM T.J. Watson Research Center +Barcelona Supercomputing Center MICRO 2012 Tuesday, December 4, 2012 © 2012 IBM Corporation Barcelona Supercomputing Center Why do we need micro-benchmarks? What is the maximum power consumption? Any performance bug? Any reliability issues? … Micro-benchmarks! Time consuming and tedious – Error prone task • Trial and error process – Several microbenchmarks are required Deep expertise limited to few designers – Detailed knowledge of the underlying architecture is required 2 MICRO 2012 Tuesday, December 4, 2012 © 2012 IBM Corporation Barcelona Supercomputing Center MicroProbe: a micro-benchmark generation framework MICRO 2012 Tuesday, December 4, 2012 © 2012 IBM Corporation Barcelona Supercomputing Center MicroProbe Workflow Inputs User Endless Endless loop loop for Max Power each 50% INT instruction 50% FP stressmark of the ISA Outputs Microbenchmark generation policy MicroProbe Framework MicroMicroBenchMicroBenchmark MicroBenchmark Benchmark mark Architecture Definition files External tools Real platforms MICRO 2012 Tuesday, December 4, 2012 Simulators Models © 2012 IBM Corporation Barcelona Supercomputing Center MicroProbe: Distinguishing Features Feature Previous works MicroProbe ISA queries - Instruction type - Operand length, binary codification etc. (manual) (manual) (no) (no) (no) (manual) (manual) Micro-architecture queries - Functional unit, latency, throughput, energy per instruction, average instruction power etc. Micro-architecture models - Set-associative cache model Code generation - Skeleton and instruction definition passes, memory modeling pass, branch modeling pass, ILP definition pass. - Configurable passes Design space exploration - Integrated - GA-based search - Exhaustive search - Customizable search 5 MICRO 2012 Tuesday, December 4, 2012 © 2012 IBM Corporation Barcelona Supercomputing Center MicroProbe Usage and Design Overview Research idea Micro-benchmark Micro-benchmark Micro-benchmark Micro-benchmark generation policies (user-defined scripts) Loop stressing the floating point unit Sequence of loads hitting 50% L1 and 50% L2 Generate a stressmark for each functional unit of the architecture Search for the sequence of 2 loads and 2 integer operations with maximum IPC MicroProbe Framework (Python API) Architecture module ISA ISA ISA definitions definitions definitions Micro-architecture Micro-architecture Micro-architecture analytical models analytical analyticalmodels models Micro-architecture Micro-architecture Micro-architecture definitions definitions definitions MICRO 2012 Tuesday, December 4, 2012 Automatic bootstrap process Code generation module Design space exploration module Micro-benchmark synthesizer Passes Passes Passes Search Search Search drivers drivers drivers Properties Properties Properties External tools © 2012 IBM Corporation Barcelona Supercomputing Center Max-power Stressmark Generation Use MicroProbe to generate maxpower stressmark Characterize energy per instruction (EPI) and IPC (Architecture Module) Select N instructions with max (IPC* EPI) Form a basic endless loop (e.g. 4K) using selected instructions (Code Generation Module) Generate micro-benchmarks with different orders of the selected N instructions Evaluate using Design Space Exploration Module mulldo xvnmsubmdp lxvw4x Loop: … Loop: mulldo … mulldo mulldo lxvw4x lxvw4x lxvw4x mulldo xvnmsubmdp xvnmsubmdp xvnmsubmdp lxvw4x … xvnmsubmdp … Pick the highest power microbenchmark 7 MICRO 2012 Tuesday, December 4, 2012 © 2012 IBM Corporation Barcelona Supercomputing Center MicroProbe: A Micro-benchmark Generation Framework CASE STUDIES 8 MICRO 2012 Tuesday, December 4, 2012 © 2012 IBM Corporation Barcelona Supercomputing Center Experimental Methodology Platform: – Processor: POWER7 @ 3GHz • 8-core 4-way SMT • 32KB L1, 256KB L2 and 4MB L3 per core – Memory: 32 GB DDR3 SDRAM @ 800MHz – OS: RHEL 5.7 + Linux 3.0.1 – EnergyScale architecture • Power measurements in miliwatts • Sampling rate up to 1ms In-house software collects power and performance counter traces [C. Lefurgy et al, IBM] 9 MICRO 2012 Tuesday, December 4, 2012 © 2012 IBM Corporation Barcelona Supercomputing Center Case Study 1: EPI Characterization Category Instruction Core IPC Normalized EPI Global Category Functional Units FXU LSU VSU mulldo subf addic lxvw4x lvewx lbz xvnmsubmdp xvmaddadp xstsqrtdp 1,40 2,00 2,00 1,68 1,68 1,68 2,00 2,00 2,00 2,60 1,69 1,00 2,88 2,81 2,14 2,35 2,31 1,32 2,60 1,69 1,00 1,35 1,31 1,00 1,78 1,75 1,00 1,73 1,58 1,16 1,49 1,36 1,00 5,12 5,01 4,24 5,51 5,29 4,80 1,21 1,18 1,00 1,15 1,10 1,00 8,36 7,16 5,97 10,00 9,49 8,40 1,40 1,20 1,00 1,19 1,13 1,00 Simple Integer Operations FXU or LSU add nor and 3,50 3,50 3,50 Integer Memory Operations ldux lwax lfsu lhaux lwax lhaux 1,00 1,00 1,00 1,00 1,00 1,00 LSU anddifferences FXU High in EPI across instructions stressing different microLSU and 2FXU architecture components Vector/Float/Decimal memory operations 10 stxvw4x 0,48 High differences in EPI across stxsdx 0,48 LSU and VSU stfd 0,48 instructions stressing the same microstfsux 0,48 stfdux LSU and VSU and FXU architecture components and at the0,48 stfdu 0,48 same rate (IPC) MICRO 2012 Tuesday, December 4, 2012 © 2012 IBM Corporation Barcelona Supercomputing Center Case Study 2: Max-power Stressmark Generation Generate Use complex all possible Use a of combinations instructions complex accessing Use MicroProbe instructions different computational functional stressing units different with intensive kernel high units IPC ? MicroProbe Expert manual Loop: … MicroProbe mullw Selected intructions: Loops Selected instructions: Loops mullw lxvd2x Loops mullw DAXPY Loops mulldo, Loops xvmaddadp mullw Loops Heuristic: xvmaddadp xvnmsubmdp, xvmaddadp Max(EPI * IPC) lxvw4x lxvd2x xvmaddadp lxvd2x lxvd2x xvmaddadp … 11 MICRO 2012 Tuesday, December 4, 2012 Expert DSE Loops Loops Loops Loops Loops Loops MicroProbe © 2012 IBM Corporation Barcelona Supercomputing Center Max-power Stressmark Generation Max-power results Normalized power 1.2 1.1 1 Min Mean Max 0.9 0.8 0.7 0.6 DAXPY Expert Manual Expert DSE MicroProbe Methods 12 MICRO 2012 Tuesday, December 4, 2012 © 2012 IBM Corporation Barcelona Supercomputing Center Case Study 3: Counter-based Processor Power Model Func.Unit microBenchmarks CMP1–SMT1 Dynamic Power f(PMCs) Bottomup Power modeling method Random microBenchmarks CMP1–SMT1 1 Intercept SMT1 Random microBenchmarks CMP1–SMT2/4 2 SMT effect Intercept SMT2-4 CMP effect Random microBenchmarks CMP1/8–SMT2/4 Linear Regression f(CMP) Model: # threads Dynamic Power f(PMCs) k 1 13 # cores MICRO 2012 Tuesday, December 4, 2012 k 1 SMT effect SMT enabled CMP effect # cores 3 Uncore power Uncore power © 2012 IBM Corporation Barcelona Supercomputing Center % Error Counter-based Processor Power Model Validation Model accuracy results on SPEC CPU2006 10 9 8 7 6 5 4 3 2 1 0 Micro trained Random trained SPEC trained Proposed 1-1 1-2 1-4 2-1 2-2 2-4 4-1 4-2 4-4 6-1 6-2 6-4 8-1 8-2 8-4 Mean CMP - SMT configuration Within acceptable error margins: < 4% on average MICRO 2012 Tuesday, December 4, 2012 © 2012 IBM Corporation Barcelona Supercomputing Center Counter-based Processor Power Model Validation on Corner Cases 62% Model accuracy results % Error 20 15 Micro trained Random trained SPEC trained Proposed 10 5 0 FXU High FXU Low L1 Loads Main Memory VSU High VSU Low Mean Validation set Models trained using non-micro-architecture aware training sets show high errors and variability Models trained using the micro-architecture aware training set show acceptable error margins: < 5% on average MICRO 2012 Tuesday, December 4, 2012 © 2012 IBM Corporation Barcelona Supercomputing Center Conclusions MicroProbe is a productive micro-benchmark generation framework – Adaptive and flexible – Includes micro-architecture semantics – Integrates design space exploration Presented three case studies: – Instruction-based EPI characterization – Automated max-power stressmark generation – CMP/SMT-aware bottom-up counter-based processor power model 16 MICRO 2012 Tuesday, December 4, 2012 © 2012 IBM Corporation Barcelona Supercomputing Center MicroProbe: A Micro-benchmark Generation Framework QUESTIONS? 17 MICRO 2012 Tuesday, December 4, 2012 © 2012 IBM Corporation Barcelona Supercomputing Center