PowerPoint

A Multi-Core Approach to Addressing the Energy-Complexity Problem in Microprocessors Rakesh Kumar Keith Farkas (HP Labs) Norman Jouppi (HP Labs) Partha Ranganathan (HP Labs) Dean Tullsen (UCSD) Motivation  Power is an important issue for processors  Going up every successive generation (with complexity) -Up to 150W for Alpha 21464! Past Techniques for Power Reduction  Voltage/frequency scaling Limitation: Limited by technology. Also, not possible below a certain feature-size.  Architectural Adaptation -shut off portions of core when not needed -dynamic speculation control -reconfigurable caches Limitations: -Very few choices to make -Only dynamic power being saved -Has associated overhead Single-ISA Heterogeneous Multi-Core Architectures Our Proposal  Have multiple heterogeneous cores on the same die  Match workload (or workload phase) to core that achieves best efficiency according to some objective function (Ensure that the new core has acceptable performance)  Power down the unused cores Motivation  Hypotheses  Performance difference between cores varies based on workload or workload phases  Different cores have varying relative energy efficiencies for the same workload  Implication: possibility of dynamically changing “best” core Goals of the Paper  Validate the hypotheses  Get an idea of the design space  Get an idea of the potential benefits Outline of Talk  Motivation  Past Work  Our Work    Assumptions Decisions Methodology  Results and Conclusions  Summary and Future Work Choice of Cores on the Die  Five Cores on the die: In-order: QED R4700, EV4(Alpha 21064), EV5(Alpha 21164) Out-of-order: EV6 (Alpha 21264),"EV8-“  All cores assumed to be without L2-cache.  “EV8-”: Issue width is same as EV8(Alpha 21464) -Resources reduced to account for a single thread. -Core-power dissipation: 100W Properties of the Cores Processor R4700 EV4 EV5 EV6 EV8- Issue-width 1 2 4 6(OOO) 8(OOO) I-Cache 2-way 16KB DM, 8KB DM, 8KB 2-way 64KB 4-way 64KB D-Cache 2-way 16KB DM, 8KB DM, 8KB 2-way 64KB 4-way 64KB Branch Pred. MSHR No 2KB/1-bit 2K-gshare 1 2 4 Notice the gradation! Hybrid 2-levelHybrid 2-level 8 16 Properties of Cores (contd.)  Assume all cores implemented in 0.1um -Scaled area and power accordingly  Clock Speed? -All Alpha cores assumed to run at 2.1GHz (EV6 frequency at 0.10 micron) -R4700 assumed to run at 1GHz Core Power and Area  peak power of core estimated from data sheets - minus that used by L2 caches and pins - then scaled for .1um process  area of core estimated from die photos - minus that of i/o pad, wires, L2 cache & control - then scaled for .1um process  L2 cache area and power - estimated using CACTI Core Power and Area (contd.) Processor Core-power (in W) Core-area (in mm^2) R4700 0.45 3 EV4 4.97 3 EV5 9.83 5 EV6 17.80 24 EV8- 92.88 260 EV8- consumes 200 times more power than R4700! It is more than 85 times bigger too! Core Power and Area (contd.) Methodology  Simulator used: SMTSIM  ROB-size, Activelist-size and Load-store queue always kept big enough to ensure no conflicts.  Benchmarks used: 14 chosen randomly out of SPEC2000 suite  Fast-forwarded for 2 billion instructions, simulated for 1 billion instructions.  Data collected after every 1 million instructions. Validating Hypotheses  Performance difference between cores varies based on workload or workload phases (IPS)  Different cores have varying relative energy efficiencies for the same workload (IPS/W) Performance Variation with Time 2 1.6 1.2 IPS EV8EV6 EV5 EV4 R4700 0.8 0.4 0 1 201 401 601 801 Committed instructions (in millions) Ah! Those clear, distinct phases! Variation of Energy Efficiency with Time 80 70 60 IPS/W 50 R4700 EV4 EV5 EV6 EV8- 40 30 20 10 0 1 201 401 601 801 Committed instructions (in millions) Power dominates IPS/W numbers! How does a composite objective function fare? Energy-delay Product Profile 0.2 R4700 EV4 EV5 EV6 EV8- 0.16 IPS^2/W 0.12 0.08 0.04 0 1 201 401 601 Committed instructions(in millions) 801 So why not run on the “best” core at all points of time?? Choosing Dynamically the Core with Best Energy-Delay Product (perf. loss<50%) 0.2 0.16 R4700 EV4 EV5 EV6 EV8Best-path IPS^2/W 0.12 0.08 0.04 0 1 201 401 601 801 Committed instructions (in millions) Notice the regions where best-path is not along the best energy-delay product! Choosing Dynamically the Core with Best Energy-Delay product (perf. loss<50%) [Summary of Results] Energy-Delay Performance Savings(%) Degradation(%) Maximum Minimum Mean 97.9 0.1 65.4 Number of Switchings: Maximum=387(art) Minimum=0 Median=1 8.5 0.1 18.2 Dissecting the Results  More improvements possible – locally-best decisions not necessarily globally-best there was a performance constraint choice of cores not the best for this objective-function cache-configurations not necessarily the best  Even for present improvements, beats voltage scaling handsomely(44.2% ED2 improvement) Conclusion  Enormous potential for power-savings  No leakage-power solution  Does considerable IP reuse  Complexity-appropriate -every application match to the “appropriate” complexity core Tip of the iceberg? Current/Future Work  Cores can be non-ordered  Some cores can be multithreaded  Throughput impact of the architecture Questions?

PowerPoint

Related documents

Products

Support

PowerPoint

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib