NoRD: Node-Router Decoupling for Effective Power-gating of On-Chip Routers Lizhong Chen, Timothy M. Pinkston University of Southern California Node-Router Decoupling Problems in Applying Power-gating to Routers: • Intensified BET limitation - Intermittent packet arrivals break long idle periods into fragments - For PARSEC, 61% of total number of idle periods is below BET • Cumulative wakeup latency in multi-hop NoCs - Worse for larger networks • Disconnection problem - Idle period is upper bounded by local node’s traffic - Disconnected network Advantages of NoRD: Solving All Three Problems: • Mitigate BET limitation: use bypass paths instead of waking up routers • Hide wakeup latency: use bypass paths while routers are waking up • Eliminate disconnection: all nodes are always connected by bypass ring Power-gating overhead energy blackscholes bodytrack Simulation Platform: • Platform: Simics + Gems (Garnet+Orion2.0) • Workloads: PARSEC 2.0 + Synthetic traffic Schemes Under Comparison: • No power-gating (No_PG) • Conventional power-gating (Conv_PG) - Apply power-gating technique conventionally to routers • Optimized conventional power-gating (Conv_PG_OPT) - Conv_PG + early wakeup (hide some wakeup latency) • Node-router decoupling (NoRD) dedup fluidanimate raytrace swaptions vips x264 NORD Conv_PG_OPT No_PG Conv_PG NORD Conv_PG_OPT Conv_PG NORD No_PG Conv_PG_OPT No_PG Conv_PG NORD Conv_PG Conv_PG_OPT NORD No_PG Conv_PG_OPT No_PG Conv_PG NORD Conv_PG Conv_PG_OPT NORD ferret AVG Performance: • Average packet latency penalty - Conv_PG: 63.8%, Conv_PG_OPT: 41.5%, NoRD: 15.2% • Execution time penalty - Conv_PG: 11.7%, Conv_PG_OPT: 8.1%, NoRD: 3.9% No_PG Evaluation Methodology canneal No_PG 0% Conv_PG Conv_PG_OPT No_PG NoRD 40 35 30 25 20 15 10 5 0 Conv_PG Conv_PG_OPT NoRD 130% 45 Execution time (norm. to No_PG) Two Concerns: • Breakeven-time (BET): the minimum number of gated-off idle cycles to offset power-gating energy overhead (~10 cycles for router) • Wakeup latency: around 10 to 15 cycles for router power-gating overhead 20% Conv_PG_OPT Power-gating Challenges • The red ones are performance-centric routers • The blue ones are power-centric routers router static power No_PG Canonical router at 45nm and 1.0V router dynamic power 40% Conv_PG 32nm link dynamic power NORD 45nm link static power 60% Conv_PG 65nm The left figures shows the classification of routers: 80% Conv_PG_OPT 1.2V 1.1V 1.0V 1.2V 1.1V 1.0V 1.2V 1.1V 1.0V 100% NORD 0% NoRD 120% No_PG Clock_static 4% Conv_PG_OPT Overall NoC Energy Savings: • Conv_PG: 9.1%, Conv_PG_OPT: 9.4%, NoRD: 20.6% • Static energy savings vs. dynamic energy losses Conv_PG 20% Increasing NoRD Efficiency: • Routers have different impact on performance based on their location • Classify routers in to performance-centric class and power-centric class • Wake up early a few performance-critical routers to improve performance by adding “shortcuts” in routing • Wake up late the rest (majority) of the routers to save more static power by allowing those routers to stay in gated-off state for a longer time • Use an off-line program based on Floyd-Warshall all-pair shortest path algorithm to classify routers in this work; further exploration can be done for future work Conv_PG_OPT Xbar_static 5% NoRD 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% No_PG SA_static 2% Routing: • Based on Duato’s Protocol - Escape resources are comprised of escape VCs of the bypass ring formed by (Bypass Inport, Bypass Outport) pairs - Other VCs are adaptive resources • Packets on adaptive VCs - First routed minimally - If not possible, detoured by one May still routed on adaptive VCs - If misrouted hops reach threshold Forced to enter escape VCs • Packets on escape VCs - Confined to bypass ring until destination NORD Dynamic 62% Conv_PG Conv_PG 40% Conv_PG_OPT 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Conv_PG_OPT VA_static 7% Conv_PG Power-gating Overhead Reduction: • NoRD reduces power-gating overhead by over 80% NORD 60% No_PG No_PG Buffer_static 21% Router/NI-level: • Two bypass paths and control logic are added • Router is power-gated off when its datapath is empty • Router is turned on when the wakeup metric exceeds a threshold - VC request rate at the local NI • Low implementation cost (3.1% of router area) Conv_PG_OPT 80% Chip-level: • A bypass ring connecting all nodes • Receiving : add a bypass path from Bypass Inport to the NI ejection • Sending: add a bypass path from the NI injection to Bypass Outport • Forwarding: packets bypass a gated-off router by using the above two bypass paths together No_PG Static power percentage 100% • Breaks the node-router dependence via decoupling bypass paths Conv_PG • Issue of high NoC power consumption • The increasing static power of on-chip routers Static Energy Savings: • Conv_PG: 51.2%, Conv_PG_OPT : 47.0%, NoRD: 62.9% • Relative improvement of NoRD: 23.9% and 29.9% Breakdown of power (normalized to No_PG) NoC Power Consumption Basic Idea: Average packet latency (cycles) While power-gating is a promising technique to mitigate the increasing static power of a chip, a fundamental requirement is for the idle periods to be sufficiently long to compensate for the power-gating and performance overhead. On-chip routers are potentially good targets for power optimizations, but few works have explored effective ways of power-gating them due to the intrinsic dependence between the node and router – any packet (sent, received or forwarded) must wakeup the router before being transferred, thus breaking the potentially long idle period into fragmented intervals. Simulation shows that directly applying conventional power-gating techniques would cause frequent statetransitions and significant energy and performance overhead. In this work, we propose NoRD (Node-Router Decoupling), a novel poweraware on-chip network approach that provides for power-gating bypass to decouple the node’s ability for transferring packets from the poweredon/off status of the associated router, thereby maximizing the length of router idle periods. Full system evaluation using PARSEC benchmarks shows that the proposed approach can substantially reduce the number of state-transitions, completely hide wakeup latency from the critical path of packet transport and eliminate node-network disconnection problems. Compared to an optimized conventional power-gating technique applied to on-chip routers, NoRD can further reduce the router static energy by 29.9% and improve the average packet latency by 26.3%, with only 3% additional area overhead. Results Static energy (norm. to No_PG) Abstract {lizhongc, tpink}@usc.edu 120% 110% 100% 90% 80% 70% 60% 50% Acknowledgements We thank the anonymous reviewers for their helpful comments and suggestions. We especially acknowledge the efforts of Yuho Jin in creating Simics checkpoints prior to this work. We also thank LiShiuan Peh’s research group for their assistance in Orion 2.0. This research was supported, in part, by the National Science Foundation (NSF), grant CCF-0946388.