The Elusive Metric for Low-Power Architecture Research Hsien-Hsin “Sean” Lee A. Utku Diril Joshua B. Fryman Yuvraj S. Dhillon Center for Experimental Research in Computer Systems Georgia Institute of Technology Atlanta, GA 30332 Workshop for Complexity-Effective Design, San Diego, CA, 2003 Background Picture Energy-Delay product (EDP) [Gonzalez & Horowitz 96] “Power” is meaningless ( frequency) 2 “Energy per instruction” is elusive ( CV ) “Energy Delay” (J/SPEC or J IPC) is better 3 Use Alpha-power model, ED CVdd (Vdd - Vth) Note that no “physical” meaning of EDP Widespread adoption De facto standard by community Metric for energy and complexity effectiveness New architectural techniques have arrived New hardware exploiting low-power opportunities Temperature-aware power detectors Voltage & Frequency Scaling Multi-threshold voltage WCED-03 2 Outline of the Talk Potential pitfalls Yeah, we all know, it is obvious…. but Which “E” goes in ED product? Impact of new hardware (more transistors) Methodology matters in deep submicron processes Observations Summary WCED-03 3 Calculating ED Product New architecture solutions save energy at the expense of (insensitive) performance loss A number of research results were reported in the following manner: Technique “X” for Data Cache Reduce 50% energy of Data Cache Lose 20% IPC EDP = (1-0.5)(1+0.2) = 0.60 Very Energy efficient Technique “Y” for Branch Predictor Reduce 10% energy of Branch Predictor Lose 20% IPC EDP = (1-0.1)(1+0.2) = 1.08 Energy inefficient WCED-03 4 So What is E and What is D in EDP? Hypothetical black box Battery (i.e. E) shared by CPU, DRAM, chipsets, graphics, TFT, Wi-Fi, HDD, flash disk D typically account for some system effect such as DRAM latency Improvement proposed: Remove 5% of E from flash disk No delay incurred Is this a good design decision? Flash disk is 10% of total E in system Improvement amounts to 0.5% system impact “In-the-noise” improvement Is the “complexity” worth the effort? Gfx card flash C.S. 802.11 DDRDRAM HDD TFT Display Battery So, is EDP used in the right way? And WCED-03 is EDP so important? 5 Energy Efficiency: E versus D 100 Esaved=99% Esaved=90% Esaved=58% Esaved=50% Esvaed=30% Esaved=10% Esaved=5% Maxmum Delay Tolerance 10 1 0.1 0.01 0.001 0.0001 0 WCED-03 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Power Distribution of a FU w.r.t. target system 1 6 Example: Energy Efficiency: E vs. D 100 Esaved=99% Esaved=90% Esaved=58% Esaved=50% Esvaed=30% Esaved=10% Esaved=5% Maxmum Delay Tolerance 10 1 0.1 Tolerate ~25% performance loss 0.01 0.001 0.0001 0 WCED-03 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Energy Distribution w.r.t. target system 1 7 Using EDP: Pentium Pro 0.3 IFU (22%) IEU (14%) ROB, DCU (11.1%) RS, FPU, Global Clock (7.9%) RAT, MOB (6.3%) BTB (4.7%) 0.28 0.26 Maximum Delay Tolerance 0.24 Data Source: [Brooks et al. 00] Assume 100% for CPU 40% IFU power reduction can tolerate < 10% performance loss 0.22 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Energy Saved for a functional unit u WCED-03 8 Maximum Delay Tolerance But CPU is not 100% of a System 150 140 130 120 110 100 90 80 70 60 50 40 30 20 10 0 CPU=100% CPU=75% CPU=50% CPU=25% 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 WCED-03 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 9 Case Study: Filter Cache [Kin et. al 97,00] The Filter Cache design as reported 58% Energy savings in “L1 Caches” 21% IPC degradation ED product as shown (1-0.58)(1+0.21) << 1 suggests this is a winning design Question is “which E ?” WCED-03 10 Filter Cache: E Values Esaved = 58% [Kin et al. 00] 1.4 FilterCache CPU=100% CPU=70% CPU=50% CPU=25% FilterCache SA-110 (I$+D$=43%) 1.3 1.2 Use StrongARM 110 43% () energy by Maximum Delay Tolerance 1.1 Caches 1 27% in I-CACHE 16% in D-CACHE 0.9 CPU=X% stands for 0.8 X% of overall power drawn by CPU Delay Tolerance 0.7 0.6 0.5 FC slowdown 21% 0.4 33% : CPU=100% 0.3 21% : CPU=70% 0.2 14% : CPU=50% 0.1 6% : CPU=25% Not energy-efficient if 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Energy distribution for a functional unit u wrt CPU only WCED-03 CPU < 70% 11 Rethinking EDP: Switching Activity vs. New Hardware Ignore leakage and short-circuit power Dynamic switching power is dominant The “E” would be below T: Transistor count f: frequency Pdyn a f 2 C Vdd a f Cg avg T 2 Vdd Pdynref Pdynnew aref f T anew (f f ) (T T ) WCED-03 12 ED Variables The elegant ratio governing E… aref f T f T 1 anew f T fT To include the application delay, D… 2 aref f T D 1 1 anew f T D Can be applied to Macromodeling to determine the trade-off between transistor count and performance degradation WCED-03 13 Impact of Additional Transistor Count 50 45 30% switching reduced 25% switching reduced 10% switching reduced 45 40 40 35 35 % Impact on f % Impact on D 50 30% switching reduced 25% switching reduced 10% switching reduced 30 25 20 15 10 5 30 25 20 15 10 5 0 0 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 % Impact on T (given freq. unchanged) 0 5 10 15 20 25 30 35 40 45 50 % Impact on T (given delay unchanged by frequency scaling Given a new avg switching probability of new architecture LHS: Trading transistors with delay given no freq. scaling RHS: Delay recovered by freq. scaling WCED-03 14 Role of Leakage Energy As Deep Sub-Micron (DSM) era is upon us... More than 50% power from leakage Source: Intel Corp. Custom Integrated Circuits Conference 2002 Leakage ignorance could revert conclusion Early architecture evaluation Leakage cannot be isolated from switching during evaluation Additional HW can be harmful WCED-03 15 Evaluate the Leakage when adding HW in Early Stage of Arch Definition Example: Dual-speed pipeline [Pyreddy and Tyson’01] Idea appears to be plausible x% inst 1-x% inst non-critical critical Identify critical instructions [Tune et al 01] [Seng et al. 01] Two datapaths: fast and slow Critical inst fast pipe; remainder to slow Slow pipe consumes less E than fast pipe E.g. Multi-voltage supply, lower frequency Let’s evaluate and assume: N instructions; x slow datapath (N-x) fast datapath slow fast How does leakage impact efficiency? What x value to achieve energy efficiency? WCED-03 16 Dual Datapath Leakage Impact 0.5 Minimum instructions to Slow Datapath ”r” is power 0.45 ratio of slow vs. fast A small r 0.4 impair 0.35 performance Slow path becomes critical path 0.3 0.25 0.2 0.15 0.1 r = 0.9 r = 0.75 r = 0.60 r = 0.5 r = 0.4 r = 0.2 0.05 0 0 Today WCED-03 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Static-to-Total Energy Ratio 0.8 0.9 1 Soon to be 17 Dual Datapath Leakage Impact 0.5 Minimum instructions to Slow Datapath ”r” is power 0.45 ratio of slow vs. fast A small r Soon to be 0.4 impair 0.35 performance Slow path becomes critical path 0.3 0.25 % of non-critical 0.2 0.15 inst needed for slow datapath Today Today: ~17% 0.1 0.05 0 0 WCED-03 Soon: ~40% r = 0.9 r = 0.75 r = 0.60 r = 0.5 r = 0.4 r = 0.2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Static-to-Total Energy Ratio 0.8 0.9 1 18 Energy Savings v. # Inst of Slow Path r = 75% r = 50% 20 20 15 15 10 10 5 5 0 0 -5 -5 -10 -10 -15 -15 -20 -20 -25 -25 -30 -30 -35 -35 -40 -40 -45 Static-to-Total=1% Static-to-Total=20% Static-to-Total=33% Static-to-Total=50% Static-to-Total=67% Static-to-Total=75% -50 -55 -60 0 0.1 0.2 0.3 -45 Static-to-Total=1% Static-to-Total=20% Static-to-Total=33% Static-to-Total=50% Static-to-Total=67% Static-to-Total=75% -50 -55 -60 0.4 0 0.1 0.2 0.3 0.4 X-axis : % of instructions to non-critical datapath Y-axis : % Energy saved If send 30% instructions to non-critical datapth Only save ~5% energy (savings only on datapath) in DSM for r=75% Consume more energy in DSM for r=50% WCED-03 Is the extra complexity paid off? 19 Observations It is insufficient to examine ED product on a microscale; the entire system must be examined. Adding HW complexity for low energy needs to be evaluated thoroughly If the target process is not DSM, ED product can be examined via simplified ratio analysis For DSM process Leakage must be accounted for in local and system E Additional HW could be an overkill WCED-03 20 Summary Low-power architecture research: Metric could be elusive Methodology More susceptible to reverse conclusions than performance research, if not meticulously applied 2nd order effect today 1st order effect tomorrow “Complexity” can be ineffective in energy reduction Purposes of our study Provide analytical models and methodology for early evaluation No intention to invalidate prior results WCED WDDD Raise more discussions To get it right in education WCED-03 21 That’s All Folks ! WCED-03 22