Forwardflow A Scalable Core for Power-Constrained CMPs Dan Gibson and David A. Wood ISCA 2010 SAINT-MALO FRANCE UW-Madison Computer Sciences Multifacet Group © 2010 Executive Summary [1/2] • Future CMPs will need Scalable Cores – Scale UP for single-thread performance • Exploit ILP – Scale DOWN for multiple threads • Save power • Exploit TLP • Hard with traditional μArch ISCA 2010 - 2 Executive Summary [2/2] • Our Contribution: Forwardflow – New Scalable Core μArch • Uses pointers to eliminate associative search • Distributes values, no PRF • Scales to large instruction window sizes • Full window scheduler No IQ clog – Scales dynamically • Variable-sized instruction window – ~20% power/performance range ISCA 2010 - 3 Ancient History: The Memory Wall [Wulf94] (24 years ago) • 1994: Processors get faster faster than DRAM gets faster • Solutions: More Caches, Superscalar, OoO, etc. 386, 20MHz 486, 50MHz • 2010: Processors are a lot faster than DRAM – 1 DRAM Access = ~100s cycles P6, 166MHz IMAGE: Prise de la Bastille (Storming the Bastille) By Jen-Pierre-Louis-Laurent Houel ISCA 2010 - 4 PIV, 4000MHz Moore’s Law Endures (Obligatory Slide) • Device Counts Continue to Grow Rock, 65nm [JSSC2009] Rock16, 16nm [ITRS2007] In 1965, Gordon Moore sketched out his prediction of the pace of silicon technology. Decades later, Moore’s Law remains true, driven largely by Intel’s unparalleled silicon expertise. Copyright © 2005 Intel Corporation. More Transistors More Threads? (~512) More Cache? ISCA 2010 - 5 Amdahl’s Law Endures Everyone knows Amdahl’s law, but quickly forgets it. -Thomas Puzak [ (1 - f ) + f ] N • Parallel Speedup Limited by Parallel Fraction -1 100 10 – i.e., Only ~10x speedup at N=512, f=90% 1 Takeaway: No TLP = No Speedup [Hill08] 1 2 4 f = 99% ISCA 2010 - 6 8 16 f = 95% 32 64 128 256 512 f = 90% Utilization Wall (aka SAF) [Venkatesh2009] [Chakraborty2008] devices in a fixed-area design that can be active at the same time, while still remaining within a fixed power budget. HP Devices LP Devices 1 0.8 Dynamic SAF • Simultaneously Active Fraction (SAF): Fraction of 0.6 0.4 0.2 0 90nm 65nm 45nm Takeaway: More Transistors Lots of them have to be off ISCA 2010 - 7 32nm Walls, Laws, and Threads • Power prevents all of the chip from operating all of the time • Many applications are single-threaded – Need ILP • Some applications are multi-threaded – Need TLP • Emerging Solution: Scalable Cores ISCA 2010 - 8 Scalable Cores If you were plowing a field, which would you rather use: Two strong oxen or 1024 chickens? -Attributed to Seymour Cray • Scale UP for Performance – Use more resources for more performance – (e.g., 2 Strong Oxen) • Scale DOWN for Energy Conservation – Exploit TLP with many small cores – (e.g., 1024 Chickens) ISCA 2010 - 9 Core Scaling Assume SAF = 50% Core Core Core Core $ $ $ $ $ $ $ $ Core Core Core Core Scale Down $ $ $ $ $ $ $ $ Core Core Core Core Core Core Core Core Baseline 8-Core CMP Many Threads for TLP Core Core Core Core Core Scale Up Hard to do with a traditional core design (not impossible) $ $ $ $ $ $ $ $ Core Core Core Core One Thread for ILP ISCA 2010 - 10 Microarchitecture for Scalable Cores • Conventional OoO: – Interdependent structures scaled together • Some structures easy to scale, some hard – Scaling up means scaling to large sizes • Hard to tolerate search operations in large structures • This Work: – Single, integrated structure – Wire-delay tolerant design – Avoid associative search ISCA 2010 - 11 RAM-based Disaggregated Instruction Window Pointers instead Forwardflow – Forward Pointers • Use Pointers to explicitly define data movement • Every Operand has a Next Use Pointer • Register names not needed + No search operation • No associative search (ever) ― Serialized Wakeup S1 ld R4 add R1 sub R4 st R3 breq R4 • Usually OK: Most ops have few successors [Ramirez04,Sassone07] ISCA 2010 - 12 S2 4 R3 16 R8 R3 Dst R1 R3 R4 Forwardflow – Dataflow Queue (DQ) • Combination Scheduler, ROB, and Register File – Schedules instructions – Holds data values for all operands Register Consumer Table (RCT) R1 2-S1 R1 R2 R2 R3 4-S1 R3 R4 5-S1 R4 Dataflow Queue (DQ) Op1 Op2 0 ld 1 add 2 sub 3 st R4 R1 R4 R3 4 R3 16 R8 4 R4 R3 Next Instruction: breq R4 R3 ISCA 2010 - 13 breq Dest R1 R3R3=0 R4 Physical Organization Logical DQ Organization Physical DQ Organization DQ Bank Group – Fundamental Unit of Scaling ISCA 2010 - 14 Scaling a Forwardflow Core • 128-entry DQ ea. • 2 IALU ea. • 2 FPALU ea. BP • Scale the core by scaling the DQ – BGs power on/off independently RCT RCT RCT L1-D – 4 Bank Groups Backend ARF L1-I • Fully-provisioned Forwardflow Core Frontend F-4: BGs: 4 IALU: 8 DQ: 512-entry FPALU: 8 DMEM: 2 F-2: BGs: 2 IALU: 4 DQ: 256-entry FPALU: 4 DMEM: 2 F-1: BGs: 1 IALU: 2 DQ: 128-entry FPALU: 2 DMEM: 2 ISCA 2010 - 15 Evaluation: Questions 1. Is single-thread Forwardflow core performance comparable to a similarly-sized OoO? 2. Does FF DQ scaling effectively scale performance for single threads? 3. How does DQ scaling affect power consumption? ISCA 2010 - 16 • 8-Core CMP DRAM0 – 32KB L1s, 1MB L2s, 8MB shared L3 – NoSQ [Sha06] – OoO Baseline – SPARCv9 “+” L1-I L1-I L1-I L1-I Core0 Core1 Core2 Core3 L1-D L1-D L1-D L1-D L2 L2 L2 L2 L3B0 L3B1 L3B2 L3B3 L3B4 L3B5 L3B6 L3B7 L2 L2 L2 L2 L1-D L1-D L1-D L1-D Core4 Core5 Core6 Core7 L1-I L1-I L1-I L1-I • Running One Thread – 7 Cores Off, 1 On – specCPU + Com (1) ISCA 2010 - 17 DRAM1 Evaluation: Target Machine Results: OoO-like Performance Some bad cases (e.g., bzip2): Not enough misses to cover serialized wakeup Some good cases (e.g., libquantum): OoO suffers from IQ 1.4 Clog Overall, Forwardflow (F-1) performance is close to that of samesize OoO 1 0.8 0.6 0.4 0.2 0 F-1 SPEC INT 2006 SPEC FP 2006 ISCA 2010 - 18 Commercial Workloads GMean Normalized Runtime 1.2 Results: Performance Scaling Takeaway: Forwardflow’s Backend Scaling Scales Core Performance Runtime Reduction compared to F-1: Some great cases, some non-great cases •F-2: 12% •F-4: 21% 1.4 1 0.8 0.6 0.4 0.2 0 SPEC INT 2006 F-1 F-2 F-4 SPEC FP 2006 ISCA 2010 - 19 Commercial Workloads GMean Normalized Runtime 1.2 Results: Power Scaling • F-1 consumes 10% less power than OoO 1.2 +14% Normalized Power 1 +16% 0.8 +14% +13% 0.6 +12% 0.4 +11% 0.2 0 OoO F-1 F-2 Configuration Backend Frontend Static Other Caches F-4 – Most of the difference comes from the finegrained DQ accesses and smaller RF • Scaling up increases power consumption in unscaled components – Larger windows better utilize caches and frontend – Backend consumption scales reasonably (30%) ISCA 2010 - 20 Concluding Remarks • Future CMPs will need Scalable Cores – Scale UP for single-thread performance – Scale DOWN to run multiple threads • Forwardflow Core: – New μArch for scaling the instruction window – ~20% power/performance ISCA 2010 - 21 This looks familiar… Didn’t I just see a talk on this topic from the same institution? -75% of the audience (the waking portion) WiDGET Forwardflow Vision Scalable Cores Scalable Cores Mechanism Steering, InOrder Pointers, DQ, OoO [Watanabe10] Approximates In-Order Full-Window Scheduling ISCA 2010 - 22 [Gibson10] Acknowledgments / Q&A NSF CCR-0324878, CNS-0551401, CNS0720565 for financial support (e.g., keeping me alive in graduate school, buying cluster nodes, etc.) Multifacet and Multiscalar groups for years of guidance and advice. Yasuko Watanabe for simulator contributions. UW Computer Architecture Affiliates for many discussions, suggestions, and encouraging remarks. ACM/SIGARCH+IEEE/TCCA for part of a trip to France. Megan Gibson for the rest. Anonymous reviewers are also swell people and their advice made this work better. ISCA 2010 - 23 INDEX OF BACKUP SLIDES • • • • • • Multithreaded Workloads Using DVFS to Scale Mispredictions ARF More vs. WiDGET A Day In the Life of a Forwardflow Op – – – – – – Decode Dispatch Wakeup Issue Writeback Commit ISCA 2010 - 24 ISCA 2010 - 25 Related Work • Scalable Schedulers – Direct Instruction Wakeup [Ramirez04]: • Scheduler has a pointer to the first successor • Secondary table for matrix of successors – Hybrid Wakeup [Huang02]: • Scheduler has a pointer to the first successor • Each entry has a broadcast bit for multiple successors – Half Price [Kim02]: • Slice the scheduler in half • Second operand often unneeded ISCA 2010 - 26 Related Work • Dataflow & Distributed Machines – Tagged-Token [Arvind90] • Values (tokens) flow to successors – TRIPS [Sankaralingam03]: • Discrete Execution Tiles: X, RF, $, etc. • EDGE ISA – Clustered Designs [e.g. Palacharla97] • Independent execution queues ISCA 2010 - 27 Results: Multiple Threads specOMP Power/Performance Normalized Power 16 applu 14 apsi 12 equake 10 8 6 4 0.95 swim wupwise 1 1.05 1.1 1.15 1.2 Speedup Most benchmarks trade off power/performance with different Forwardflow configurations Some do not 1.25 1.3 1.35 1.4 Feasible operating points depends on available power, e.g.: ISCA 2010 - 28 •… at Y=8, 9 of 14 can run without DVFS •… at Y=15, nearly all can run F-4 Back to Index Shutting Off Cores (or DVFS) Assume SAF = 50% Core Core Core Core $ $ $ $ $ $ $ $ Core Core Core Core Shut off 50% Core Core Core Core $ $ $ $ $ $ $ $ Core Core Core Core Baseline 8-Core CMP 4 Equally-Powerful Cores •Shutting off cores is too much •Limits TLP: Not enough cores •Limits ILP: Cores aren’t aggressive enough ISCA 2010 - 29 Back to Index Forwardflow – Resolving Branches • On Branch Pred.: Dataflow Queue Op1 – Checkpoint RCT – Checkpoint Pointer Valid Bits • Checkpoint Restore – Restores RCT – Invalidates Bad Pointers Op2 Dest 1 ld R4 4 R1 2 add R1 R3 R3 3 sub R4 16 R4 4 st R3 R8 5 breq R4 R5 6 ld R4 4 R1 7 add R1 R3 R3 ISCA 2010 - 30 Back to Index Forwardflow – ARF [1/2] • Architectural Register File (ARF) – Read at Dispatch – Written at Commit Dataflow Queue (DQ) 0 ld 1 add : Read 2 sub from ARF 3 st Register Consumer Table (RCT) R1 2-S1 R1 R2 R2 R3 4-S1 R3 R4 5-S1 R4 4 Next Instruction: mov R2 R5 ISCA 2010 - 31 mov Op1 Op2 Dest R4 R1 R4 R3 4 R3 16 R8 R1 R3 R4 R2 Forwardflow – ARF [2/2] • Architectural Register File (ARF) – Read at Dispatch – Written at Commit Write R1 = 44 to ARF Dataflow Queue (DQ) Op1 Op2 0 ld 1 add 2 sub 3 st R4 R1 R4 R3 4 R3 16 R8 4 R2 Next Commit: ld [R4+4] R1 ISCA 2010 - 32 mov Dest R1=44 R1 R3 R4 R5 Back to Index WiDGET vs. FF [1/3] Forwardflow WiDGET – Full-window scheduling, clog-free – Steering-based scheduling • Good for Lookahead • Good for MLP – Pointer-based dependences • Simple steering, simple execution logic • Can clog – Some centralization • Serialized Wakeup • Bad for serial uses of same value • PRF – Scales down to inorder ISCA 2010 - 33 Back to Index WiDGET vs. FF [2/3] ld [R1+R2]R3 Ample MLP, many forward slices R2 indep. br -8 WiDGET ldR3 ? R2 Forwardflow ld [R1+R2]R3 Cannot Steer: Stall indep. R2 br -8 ld [R1+R2]R3 indep. ldR3 ? R2 ldR3 ? R2 IB 0 IB 1 EU 0 ldR3 ? R2 ldR3 ? R2 IB 0 IB 1 EU 1 ISCA 2010 - 34 R2 br -8 ld [R1+R2]R3 indep. R2 br -8 Entire Window WiDGET vs. FF [3/3] ld [R1+R2]R3 mul R5 R3 R2 Serial use of R3 add R5 R3 R4 WiDGET shr R3 4 R9 Forwardflow ld [R1+R2]R3 mul R5 R3 R2 mulR2 ld R3 add R5 R3 R4 addR4 IB 0 IB 1 EU 0 shrR9 shr R3 IB 0 IB 1 4 R9 Artificial Serialization EU 1 ISCA 2010 - 35 Back to Index A Day in the Life of a Forwardflow Instruction: Decode Register Consumer Table 8-S1 R1 R1@7D 7-D R2 R3=0 R3 8-D R4 4-S1 add R1 R3 R3 8-S1 add 8-D ISCA 2010 - 36 Back to Index A Day in the Life of a Forwardflow Instruction: Dispatch Dataflow Queue add R1@7D R3=0 Op1 Op2 Dest 7 ld R4 4 R1 8 add R1 0 R3 9 Implicit -- Not actually written ISCA 2010 - 37 Back to Index A Day in the Life of a Forwardflow Instruction: Wakeup DQ7 Result is 0! Dataflow Queue 7 ld Op1 Op2 R4 4 DestPtr.Read(7) Dest R1 8-S1 8 add R1 0 R3 9 sub R4 16 R4 10 st R3 R8 Update HW next -D1 87-S value 0 DestVal.Write(7,0) ISCA 2010 - 38 Back to Index A Day in the Life of a Forwardflow Instruction: Issue (…and Execute) S2Val.Read(8) Meta.Read(8) Dataflow Queue 7 ld Op1 Op2 R4 4 8 add add R1 9 sub 10 st Dest S1Ptr.Read(8) add 0 + 0 → DQ8 R1 0 0 R3 R4 16 R4 R3 R8 ISCA 2010 - 39 Update HW next 8 -S1 value 0 S1Val.Write(8,0) Back to Index A Day in the Life of a Forwardflow Instruction: Writeback Dataflow Queue Op1 Op2 DestPtr.Read(8) Dest 7 8 add R1 0 9 sub R4 16 10 st R3 R8 R3:0 R3 10-S1 R4 Update HW next 10 -S1 8-D value 0 DestVal.Write(8,0) ISCA 2010 - 40 Back to Index A Day in the Life of a Forwardflow Instruction: Commit Dataflow Queue Op1 Op2 Dest DestVal.Read(8) ARF.Write(R3,0) Meta.Read(8) 7 Commit Logic 8 add add R1 0 R3:0 R3:0 9 sub R4 16 R4 10 st R3 R8 ISCA 2010 - 41 Back to Index DQ Q&A Dataflow Queue Op1 Register Consumer Table R1 R1 R2 R2 R3 R3 R4 R4 -S1 72-S1 -S1 94-S1 5-D -S1 8 Op2 Dest 1 ld R4 4 R1 2 add R1 R3 R3 3 sub R4 16 R4 4 st R3 R8 5 breq R4 R5 6 ld R4 4 R1 7 add R1 R3 R3 8 sub R4 16 R4 9 st R3 R8 ISCA 2010 - 42