Document 10420926

advertisement
Design and Evalua.on of Hierarchical Rings with Deflec.on Rou.ng
Rachata Ausavarungnirun, Chris Fallin, Xiangyao Yu, Kevin Chang, Greg Nazario, Reetuparna Das, Gabriel H. Loh, Onur Mutlu Execu.ve Summary
•  Rings do not scale well as core count increases
•  Tradi.onal hierarchical ring designs are complex and energy inefficient
–  Complicated buffering and flow control
•  Solu.on: Hierarchical Rings with Deflec.on (HiRD)
–  Guarantees livelock freedom and delivery
–  Eliminates all buffers at local routers and most buffers at bridge routers
•  HiRD provides higher performance and energy-­‐efficiency than hierarchical rings
•  HiRD is simpler than hierarchical rings
2
Outline
•  Background and Mo.va.on
•  Key Idea: Deflec.on Rou.ng
•  End-­‐to-­‐end Delivery Guarantees
•  Our Solu.on: HiRD
•  Results
•  Conclusion
3
Scaling Problems in a Ring NoC
•  As the number of cores grows:
– Lower performance
– More power
4
Alterna.ve: Hierarchical Designs
Local Ring (Level 0) Global Ring (Level 1) Packets can reach far des.na.on in fewer hops
5
Normalized System Performance Single Ring vs. Hierarchical Rings
1.2 1 0.8 0.6 0.4 0.2 0 4x4 8x8 Ring 64-­‐bit Ring 128-­‐bit Ring 256-­‐bit Hring Network Size A hierarchical design provides beUer performance as the network scales
6
Complexity in Hierarchical Designs
Complex buffering and flow control
7
Single Ring vs. Hierarchical Rings
Normalized Network Power 1.4 1.2 1 Ring 64-­‐bit Ring 128-­‐bit Ring 256-­‐bit Hring 0.8 0.6 0.4 0.2 0 4x4 8x8 Network Size Design complexity increases power consump.on
8
Our Goal
•  Design a hierarchical ring that has lower complexity without sacrificing performance
9
Outline
•  Background and Mo.va.on
•  Key Idea: Deflec.on Rou.ng
•  End-­‐to-­‐end Delivery Guarantees
•  Our Solu.on: HiRD
•  Results
•  Conclusion
10
Key Idea
•  Eliminate buffers
•  Use deflec.on rou.ng
– Simpler flow control
11
Local Router
Local West Local East Core •  Key func.onality: – Accept new flits
– Pass flits around the ring
12
Elimina.ng Buffers in Local Routers
Local West Local East Core 13
Elimina.ng Buffers in Local Routers
•  Flits can enter the ring if the output is available
Ejector Local West Local East No Buffer
Core Simpler Crossbar
14
Deflec.on Rou.ng
Ejector Local West Local East Deflected Core 15
Bridge Router
Global Ring West East Crossbar West East Local Ring 16
Elimina.ng Buffers in Bridge Routers
Global Ring West East Crossbar West East Local Ring 17
Elimina.ng Buffers in Bridge Routers
Global Ring West East Simpler Buffering
Fewer Buffers
Simpler Crossbar
West East Local Ring 18
Outline
•  Background and Mo.va.on
•  Key Idea: Deflec.on Rou.ng
•  End-­‐to-­‐end Delivery Guarantees
•  Our Solu.on: HiRD
•  Results
•  Conclusion
19
Livelock in Deflec.on Rou.ng
•  Injec.on starva.on
Ring Unable to inject Starved Flit Src 20
HiRD: Injec.on Guarantee
AWer 150 cycles: All nodes stop injecYng flits Ring Unable to inject Starved Flit Thro[led Router Src •  ThroUling provides injec.on guarantee
21
Livelock in Deflec.on Rou.ng
•  Transfer starva.on
Ring Unable to Transfer Starved Flit Transfer FIFO 22
HiRD: Transfer Guarantee
AWer 10 looparounds Ring Starved Flit Reserved Slot Transfer FIFO •  Reserva.on provides transfer guarantee
23
Ejec.on Guarantee
•  Provided by a prior work
•  Re-­‐transmit once [Fallin et al., HPCA’11]
– Drop a flit if there is no available slot
– Reserve a buffer slot at the des.na.on if a flit was dropped
24
End-­‐to-­‐end Delivery Guarantees
EjecYon Guarantee Local Ring Global Ring dest Local Ring InjecYon uarantee Transfer Transfer GGuarantee Guarantee InjecYon Guarantee src InjecYon Guarantee 25
Outline
•  Background and Mo.va.on
•  Key Idea: Deflec.on Rou.ng
•  End-­‐to-­‐end Delivery Guarantees
•  Our Solu.on: HiRD
•  Results
•  Conclusion
26
An Overview of HiRD
•  Deflec.on rou.ng
–  Simpler flow control
–  Simpler crossbars and control logics
•  No buffers in the local rings
–  Simpler and faster local routers
•  Simpler bridge routers
–  Lower power, less area and simpler to design
•  Provides end-­‐to-­‐end delivery guarantees
–  Injec.on guarantee by throUling
–  Transfer guarantee by reserva.on
27
Pu_ng It All Together
•  Deflec.on rou.ng
–  Simpler flow control
–  Simpler crossbars and control logic
•  No buffers in the local rings
–  Simpler and faster local routers
•  Simpler bridge routers
–  Lower power, less area and simpler to design
•  Provides end-­‐to-­‐end delivery guarantees
–  Injec.on guarantee by throUling
–  Transfer guarantee by reserva.on
28
Outline
•  Background and Mo.va.on
•  Key Idea: Deflec.on Rou.ng
•  End-­‐to-­‐end Delivery Guarantees
•  Our Solu.on: HiRD
•  Results
•  Conclusion
29
Methodology
•  Cores
–  16 and 64 OoO CPU cores
–  64 KB 4-­‐way private L1
–  Distributed L2
•  Network
–  1 flit local-­‐to-­‐global buffer
–  4 flits global-­‐to-­‐local buffers
–  2-­‐cycle per hop latency for local routers
–  3-­‐cycle per hop latency for global routers
•  60 workloads consis.ng of SPEC2006 apps
30
Comparison to Previous Designs
•  Single ring design
–  Kim and Kim, NoCArc’09
–  64-­‐bit links
–  128-­‐bit links
–  256-­‐bit links
•  Buffered hierarchical ring design
–  Ravindran and Stumm, HPCA’97
–  Iden.cal topology
–  Iden.cal bisec.on bandwidth
–  4-­‐flit buffers in both local and global routers
31
Normalized System Performance Results: System Performance
1.2 1.9% 1 2.9% 0.8 0.6 0.4 0.2 0 4x4 8x8 Network Size Ring 64-­‐bit Ring 128-­‐bit Ring 256-­‐bit Hring HiRD 1)  Hierarchical designs provide beUer performance than a single ring on a larger network
2)  HiRD performs beUer compared to buffered hierarchical rings due to lower latency in local routers 32
and throUling
Normalized Network Power Results: Network Power
1.4 1.2 1 0.8 0.6 0.4 0.2 0 46.6% 4x4 15% 8x8 Ring 64-­‐bit Ring 128-­‐bit Ring 256-­‐bit Hring HiRD Network Size 1)  Hierarchical designs consume much less power than the highest-­‐performance single ring
2) Routers and flow control in HiRD are simpler than
routers in buffered hierarchical rings
33
Router Area and Cri.cal Path
•  16-­‐node network with 8 bridge routers
•  Verilog RTL design using 45nm Technology •  HiRD reduces NoC area by 50.3% compared to a buffered hierarchical ring design
•  HiRD reduces local router cri.cal path by 29.9% compared to a buffered hierarchical ring design
34
Addi.onal Results
•  Detailed power breakdown
•  Synthe.c evalua.ons
•  Energy efficiency results
•  Worst case analysis
•  Techical Report:
–  Mul.threaded evalua.on
–  Average, 90th percen.le and max latency
–  Comparison against other topologies
–  Sensi.vity analysis on different link bandwidths and number of buffers
35
Outline
•  Background and Mo.va.on
•  Key Idea: Deflec.on Rou.ng
•  End-­‐to-­‐end Delivery Guarantees
•  Our Solu.on: HiRD
•  Results
•  Conclusion
36
Conclusion
•  Rings do not scale well as core count increases
•  Tradi.onal hierarchical ring designs are complex and energy inefficient
–  Complicated buffering and flow control
•  Solu.on: Hierarchical Rings with Deflec.on (HiRD)
–  Guarantees livelock freedom and delivery
–  Eliminates all buffers at local routers and most buffers at bridge routers
•  HiRD provides higher performance and energy-­‐efficiency than hierarchical rings
•  HiRD is simpler than hierarchical rings
37
Design and Evalua.on of Hierarchical Rings with Deflec.on Rou.ng
Rachata Ausavarungnirun, Chris Fallin, Xiangyao Yu, Kevin Chang, Greg Nazario, Reetuparna Das, Gabriel H. Loh, Onur Mutlu Backup Slides
39
Network Intensive Workloads
•  15 network intensive workloads
40
Normalized System Performance System Performance
1.2 2.5% 1 5.6% 0.8 0.6 0.4 0.2 0 4x4 8x8 Ring 64-­‐bit Ring 128-­‐bit Ring 256-­‐bit Hring HiRD Network Size DeflecYons balance out the network load Thor[ling reduces congesYon 41
Normalized Network Power Network Power
1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 37% 4x4 11.9% 8x8 Ring 64-­‐bit Ring 128-­‐bit Ring 256-­‐bit Hring HiRD Network Size More deflecYons happen when the network is congested 42
Detailed Results
43
Mul.threaded Applica.ons
44
Network Latency
45
Synthe.c Traffic Evalua.ons
46
Topology Comparison
47
Sweep over Different Bandwidth
48
Packet Reassembly
•  Borrowed from CHIPPER [Fallin et al. HPCA’10]
– Retransmit-­‐Once à Des.na.on node reserves a buffer slot for a dropped packet
– Provides ejec.on guarantee
49
Other Op.miza.ons
•  Map cores that communicate with each other a lot on the same local ring
– Takes advantage of the faster local ring routers
50
Related Concurrent Works
•  Clumsy Flow Control [Kim et al., IEEE CAL’13]
– Requires coordina.on between cores and memory controllers
•  Transporta.on inspired NoCs [Kim et al., HPCA’14]
– tNoCs require an addi.onal credit network
– tNoCs have more complex flow control
– HiRD is more lightweight
51
Some Related Previous Works
•  Hierarchical Bus [Udipi et al., HPCA’10]
– HiRD provides more scalability
•  Concentrated Meshe [Das et al., HPCA’09]
– Several nodes share one router
– Used on meshed network
– Less power efficient than HiRD
•  Low-­‐cost Mesh Router [J. Kim, MICRO’09]
– Specifically designed for meshes
– Does not solve issues in deflec.on-­‐based flow control (HiRD does)
52
Download