GCA: Global Congestion Awareness for Load Balance in Networks-onChip Mukund Ramakrishna, Paul V. Gratz & Alex Sprintson Department of Electrical and Computer Engineering Texas A&M University Networks-on-Chip • Moore’s Law is putting more and more transistors on the chip. • NoCs scale better than traditional interconnects. • High interconnect latencies translate into idle processor core cycles and wasted power Intel Single-Chip Cloud Computer (ISSCC 2010) Tilera Tile64 Routing in NoCs fft benchmark from SPLASH-2 under DOR Darker arrows = Higher congestion • Real workloads are unbalanced in nature • Oblivious routing (DOR) – Tends to exacerbate congestion • Adaptive routing: – Try to avoid congested spots – Classified on the basis of awareness: • Local adaptive • Regionally aware • Globally aware Outline • • • • Introduction Motivation for Global Awareness Related Work Global Congestion Awareness (GCA): – Route Computation – Information Propagation • Implementation • Evaluation • Conclusion and Future Work Local Congestion • Local adaptive – Measure local congestion metric (free VC, free buffers) – Greedy local decisions due to poor visibility D Low congestion Moderate congestion High congestion Optimal S Local adaptive Regional Awareness • Regionally aware(Gratz et al. HPCA’ 08, Ma et al. ISCA ’11 ) – Aggregated congestion of all nodes in a dimension – Noisy information degrades performance. D Low congestion Moderate congestion High congestion Optimal RCA-1D S Ideally … • On a per-destination basis: – Evaluate end-to-end delay along all minimal paths to destination – Pick path with least delay D Low congestion Moderate congestion High congestion Optimal S Global Awareness • Earlier schemes utilize a separate congestion monitoring network (Manevich et al. DSD`11, Ramanujam et al. ANCS`10) • Increased network complexity • Slow route calculation mechanism in DAR • Challenges: – Low overhead dissemination technique – Limited resource for storage and computation Outline • Introduction • Motivation for Global Awareness • Global Congestion Awareness (GCA): – Route Computation – Information Propagation • Implementation • Evaluation • Conclusion and Future Work GCA: Bird’s eye view 1. Congestion information is conveyed via piggybacking onto the header flits 2. Every node builds a “map” representing the congestion of the network 3. Optimal path is calculated using a shortest path graph algorithm in each router Packet-level • “Piggybacking” of congestion information in header flits (Zhang et al. PrimeAsia`10, Chen et al. NOCS`12) • Back-annotation appends congestion information for link in opposite direction – Direction of flit traversal: Black – Congestion Information appended: Red Router Micro-Architecture Traffic Vector VC-1 N VC-n Optimal Output Port Table E Header Modification W Routing Unit Local Congestion Values Route Compute Hardware VC Allocator S I n Congestion Map Destination Node VC-1 VC-n XB Allocator X N E W S Ej Router Micro-Architecture Traffic Vector VC-1 N VC-n Optimal Output Port Table E Header Modification W Routing Unit Local Congestion Values Route Compute Hardware VC Allocator S I n Congestion Map Destination Node VC-1 VC-n XB Allocator X N E W S Ej Router Micro-Architecture P1 Pout P2 d1 l1 + < d2 l2 + Dout 4 Route Computation • Node marked 0 is source • Number on link denotes congestion • Number in node denotes shortest path cost to that node • Letter denotes optimal output port. Route Computation (contd.) • At most two feeder nodes for every node • Pick the feeder node with the least cost path • For nodes a hop away: – Cost = Congestion of connecting link Route Computation (contd.) • Simple “add and compare” step • Example: Top-left node – From East port: • 3+1=4 – From South port: • 8+0=8 • Cost assigned = 4 Route Computation (contd.) • Every iteration flows outward • Every quadrant computed in parallel • Re-evaluate only the downstream subgraph every update Caveats • Infrequent updates of distant links – Scale the weights so that distant information is less important – For a link i with congestion ci, which is n hops away from the local node, we calculate its weight as 𝑊𝑖 = 𝑐𝑖 ∗ (1 − 𝑤 ∗ 𝑛 ) • Staleness of congestion information – Entries untouched for n cycles are faded – Fade towards a nominal value in steps of x (empirically determined) – For large networks, fade one entry every n/L cycles where L is the number of links Limited GCA (LGCA) • Constrain the visibility to a smaller window • Store information only for nodes k hops or less away • Reduces storage overhead at the cost of slight performance penalty vis-à-vis GCA Implementation • Storage overhead: – Congestion Map: • 3 bit congestion metric • 6𝑁 ∗ 𝑁 − 1 𝑏𝑖𝑡𝑠 – Optimal output port table: • 1 bit per node because of minimal routing • No storage for nodes in same dimension • 𝑁 2 − 2𝑁 + 1 𝑏𝑖𝑡𝑠 – Flag array for fading mechanism: • One bit per entry in the congestion map • 2𝑁 ∗ 𝑁 − 1 𝑏𝑖𝑡𝑠 • Synthesis results: – 16 node computation and storage circuit – 2GHz at 45nm and less than 1% area overhead – No impact to router critical path Simulation parameters • Simulation carried out in a cycle accurate C++ simulator1 Characteristics of simulated design Topology Realistic Workloads Synthetic traffic 7x7 2D Mesh 8x8 2D Mesh Router uArch Per hop latency 3 cycles: 2 cycles in router, 1 cycle to cross channel Virtual Channels/Port 8 Flit buffers/VC 5 Traffic Workload SPLASH-2 traces Random, Transpose, Bitcomplement Duration of simulation 10 million cycles or end of trace 10000 warm-up cycles followed by 100000 packets Scaling Factor (w) Fading (x,n) 1. Two Stage Speculative 0.25 x=1 unit; n = 100 cycles S. Prabhu, B. Grot, P. Gratz, and J. Hu, “Ocin_tsim - DVFS Aware Simulator for NoCs,” in Proc. SAW-1, Jan 2010. SPLASH-2 Average Packet Latency (Relative to DOR) 1.2 1 0.8 local RCA 0.6 DAR GCA 0.4 LGCA 0.2 0 water spatial fft lu water nsquared radix raytrace ocean barnes geomean (overall) SPLASH-2 Average Packet Latency (Relative to DOR) 1.2 1 0.8 local RCA 0.6 DAR GCA 0.4 LGCA 0.2 0 water spatial fft lu water nsquared radix raytrace Improvement due to GCA (average): - DOR: 45% - Local: 26% - RCA-1D: 15% - DAR: 8% ocean barnes geomean (overall) SPLASH-2 Average Packet Latency (Relative to DOR) 1.2 1 0.8 local RCA 0.6 DAR GCA 0.4 LGCA 0.2 0 water spatial fft lu water nsquared radix raytrace ocean barnes geomean (overall) Outliers: • DAR better than GCA on inherently static workloads (fft,radix) • Statistical traffic distribution enables better performance • GCA better than DAR on all other workloads SPLASH-2 Average Packet Latency (Relative to DOR) 1.2 1 0.8 local RCA 0.6 DAR GCA 0.4 LGCA 0.2 0 water spatial fft lu water nsquared radix raytrace ocean barnes geomean (overall) LGCA performance: • Close to GCA on most workloads • lu is an exception • Overall average slightly worse than GCA but still better than other competing algorithms Conclusion • Proposed a novel adaptive routing mechanism which uses global congestion information to perform per-hop routing in on-chip networks. • Uses back-annotated piggybacking to propagate congestion information which alleviates the issue of overheads • Light-weight implementation of the shortest path computation • GCA improves average packet latency – By 26% against local adaptive – By 15% against RCA -1D – By 8% against DAR On average for the SPLASH-2 suite of benchmarks. Thank You BACKUP Minimal Adaptive Routing • Model – Adaptive routing along minimal paths. D S Fading • Entries untouched for n cycles are faded • Fade towards the gray value in steps of x, an empirically determined parameter – Black = extremely congested – White = uncongested – Gray = middle value 𝑊𝑖 = 𝑊𝑖 ± 𝑥 • For large networks, stagger the fading mechanism – Fade one entry every n/L cycles where L is the number of links Synthetic Traffic Transpose Uniform Random 60 40 xydor 30 local 20 RCA 10 LGCA 0 GCA 10 20 30 Injection Rate (flits/node/cycle) 40 Average Packet Latency 50 50 40 xydor 30 local 20 RCA LGCA 10 GCA 0 5 Bit-complement Average Packet Latency Average Packet Latency 60 15 25 35 Injection Rate (flits/node/cycle) 90 80 70 60 50 40 30 20 10 0 xydor local RCA LGCA GCA 3 8 13 18 23 Injection Rate (flits/node/cycle) 28 45 Network Sensitivity Experiments • Variation of two parameters: – Vary VC Count • No variation in relative performance – Vary the mesh size • Performs better for larger meshes • For both experiments, we simulate Transpose traffic Network Size 16 x 16 Mesh 8 x 8 Mesh 60 21 % 16 % 100 80 60 local LGCA 40 GCA Average Packet Latency Average Packet Latency 50 40 local 30 LGCA 20 GCA 10 20 0 0 4 9 14 Injection Rate (% of flits/node/cycle) 19 5 15 25 35 Injection Rate (% of flits/node/cycle) 45 Congestion Information Scaling • Scale the weights so that distant information is less important in making routing decision • Degree of scaling determined by an empirical constant w (0< w < 1). • For a link i with congestion ci, which is n hops away from the local node, we calculate its weight as 𝑊𝑖 = 𝑐𝑖 ∗ (1 − 𝑤 ∗ 𝑛 ) Number of steps • For a 𝑁𝑥𝑁 2-D mesh, the number of steps required lies in the range: – The upper bound is always 2𝑁 − 3 – If N is even, the lower bound is ⌊𝑁/2⌋ – If N is odd, the lower bound is 𝑁/2 − 1 Empirical parameters • Scaling Factor w: – w = 0.25 • GCA: links beyond 4 hops are assigned a constant scaling factor of 0.25 • LGCA: links beyond 4 hops are not stored as k=4 • Fading mechanism: – n = 100 cycles – x = 1 unit Challenges in global awareness • Dissemination of congestion information – Low overhead – Account for staleness • Limited storage in on-chip routers – Exponential number of paths to each destination • Limited hardware resources for computations Future Work • Congestion prediction: proactive adaptive routing instead of reactive adaptive routing • Stability analysis: Does the algorithm thrash between different paths for some traffic patterns? • Effect of imperfect congestion state representation Back-annotation For each outgoing flit, the node appends the congestion metric for the link in the same direction For each outgoing flit, the node appends the congestion metric for the link in the opposite direction. D S D S Packet Traversal direction Congestion information direction Multi-region • Network partitioned into four quadrants • Each quadrant runs a benchmark as shown • Isolated traffic regions emulate virtual machine-like scenario Average Packet Latency (relative to DOR) Multi-region local_uniregion local_multiregion RCA_uniregion RCA_multiregion GCA_uniregion GCA_multiregion 1.4 1.2 1 0.8 0.6 0.4 0.2 0 lu waternsq ocean raytrace Geomean • Local adaptive is unaffected due to its lack of visibility • RCA’s performance suffers due to noise through aggregation • GCA maintains fine-grained information • Helps avoid noise and perform better than RCA SPLASH-2 Average Packet Latency (Relative to DOR) 1.2 1 0.8 local RCA 0.6 DAR GCA 0.4 LGCA 0.2 0 water spatial fft lu water nsquared radix raytrace Improvement over local adaptive: • GCA: 26% average, 86% best case • LGCA: 23% average, 82% best case ocean barnes geomean (overall) SPLASH-2 Average Packet Latency (Relative to DOR) 1.2 1 0.8 local RCA 0.6 DAR GCA 0.4 LGCA 0.2 0 water spatial fft lu water nsquared radix raytrace Improvement over RCA-1D: • GCA: 15% average, 51% best case • LGCA: 11% average, 38% best case ocean barnes geomean (overall) SPLASH-2 Average Packet Latency (Relative to DOR) 1.2 1 0.8 local RCA 0.6 DAR GCA 0.4 LGCA 0.2 0 water spatial fft lu water nsquared radix raytrace Improvement over DAR: • GCA: 8% average, 53% best case • LGCA: 4% average, 41% best case ocean barnes geomean (overall)