Garnet2.0: A Detailed On-Chip Network Model Inside a Full-System Simulator Tushar Krishna gem5 workshop ARM Research Summit September 11, 2017 Assistant Professor School of ECE and CS Georgia Institute of Technology tushar@ece.gatech.edu http://synergy.ece.gatech.edu/tools/garnet Overview u What are Networks-On-Chip (NoCs)? u Why model NoCs accurately u Garnet2.0 u u u u Configuration u Topology u Routing u Flow-Control u Router Microarchitecture System Integration u Network Interface u Network Parameters u Running Garnet with Ruby u Ruby Garnet Standalone Output Stats Extensions and FAQs Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology September 11, 2017 2 Overview u What are Networks-On-Chip (NoCs)? u Why model NoCs accurately u Garnet2.0 u u u u Configuration u Topology u Routing u Flow-Control u Router Microarchitecture System Integration u Network Interface u Network Parameters u Running Garnet with Ruby u Ruby Garnet Standalone Output Stats Extensions and FAQs Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology September 11, 2017 3 Networks-on-Chip Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology September 11, 2017 4 Introduction to NoCs Core + L1$ Core + L1$ Core + L1$ Core + L1$ A Core + L1$ Core + L1$ LD (A) R1 Rsp Req On-Chip Network Fwd L2$ L2$ Garnet2.0 Tutorial | ARM Research Summit 2017 L2$ L2$ Tushar Krishna | Georgia Institute of Technology L2$ L2$ September 11, 2017 5 Modern NoCs Core L1 D$ L1 I$ L2$ L3$/Directory Network Interface Router Network Interface converts cache messages (ctrl or data) into packets. Packets get broken down into one or more flits depending on NoC link width “Tile” Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology September 11, 2017 6 Network Architecture u Topology u Routing u Flow Control u Router Microarchitecture Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology September 11, 2017 7 Topology: How to connect nodes with links ~Road Network Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology September 11, 2017 8 Routing: Which path should a message take ~Series of road segments from source to destination Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology September 11, 2017 9 Flow Control: When does a message stop/proceed ~Traffic Signals / Stop signs at end of each road segment Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology September 11, 2017 10 Router Microarchitecture: How to build the routers ~Design of traffic intersection (number of lanes, algorithm for turning red/green) Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology September 11, 2017 11 Overview u What are Networks-On-Chip (NoCs)? u Why model NoCs accurately u Garnet2.0 u u u u Configuration u Topology u Routing u Flow-Control u Router Microarchitecture System Integration u Network Interface u Network Parameters u Running Garnet with Ruby u Ruby Garnet Standalone Output Stats Extensions and FAQs Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology September 11, 2017 12 Why model NoCs accurately? 1.6 Full-state Directory HyperTransport Case Study II: Private vs. Shared L2 Token Coherence 1.4 1.2 Normalized Runtime Normalized Runtime Case Study I: Directory vs. Broadcast Protocols 1.2 1 0.8 0.6 0.4 0.2 0 Baseline NoC (Intel SCC) FANOUT + FANIN NoC SMART NoC (Krishna et al, MICRO 2011) (Krishna et al, HPCA 2013) Shared L2 Private L2 1 0.8 0.6 0.4 0.2 0 64-core CMP with different NoC Microarchitectures Baseline NoC SMART NoC (Intel SCC) (Krishna et al, HPCA 2013) 64-core CMP with different NoC Microarchitectures Different NoC Microarchitectures may lead to different microarchitectural decisions and new design optimization opportunities 64-core CMP running PARSEC workloads in full-system gem5. Average runtime plotted. Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology September 11, 2017 13 Overview u What are Networks-On-Chip (NoCs)? u Why model NoCs accurately u Garnet2.0 u u u u Configuration u Topology u Routing u Flow-Control u Router Microarchitecture System Integration u Network Interface u Network Parameters u Running Garnet with Ruby u Ruby Garnet Standalone Output Stats Extensions and FAQs Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology September 11, 2017 14 Garnet2.0 u Detailed NoC Model u Currently part of Ruby Memory System in gem5 u Original version (5-stage pipeline) released in 2009 u Developed by Niket Agarwal (currently at Google) and myself u New version (1-stage pipeline, more configurability) released in 2016 u Resources u Source: src/mem/ruby/network/garnet2.0 u gem5 wiki page: www.gem5.org/garnet2.0 u Dev patches + practice labs: http://synergy.ece.gatech.edu/garnet Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology September 11, 2017 15 Topology Each topology is a python file in configs/topologies/ Dir Dir Core + L1$ ExtLink (bi-directional) L2$ DMA R L2$ Core + L1$ L2$ L2$ IntLink (uni-directional) R Core + L1$ Core + L1$ R R L2$ L2$ Dir Garnet2.0 Tutorial | ARM Research Summit 2017 Core + L1$ Step 2: Instantiate “Int Links” and connect to routers in desired topology This is an example of MeshDirCorners_XY.py R R Step 1: Instantiate routers and connect to controllers via “Ext Links” Dir Core + L1$ Tushar Krishna | Georgia Institute of Technology September 11, 2017 16 Topology Configurable Parameters u Router u u router_latency (in cycles) u Can be set per router u Defined in src/mem/ruby/network/BasicRouter.py Link u u link_latency (in cycles) u Can be set per link u Defined in src/mem/ruby/network/BasicLink.py weight (i.e., link weight) u u u To bias routing algorithm [later slides] src_outport (string) and dst_inport (string) u Port direction (e.g., “East”) u Helps with readability of Config file + Adaptive routing algorithms bw_multiplier (value) u Used by Ruby’s simple network model, NOT by Garnet u Link bandwidth is set inside Garnet via the ni_flit_size parameter [later slides] Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology September 11, 2017 17 Default Topologies Supported u Pt2Pt configs/topologies/YourTopology.py u Crossbar u Mesh u Mesh_XY u Mesh_westfirst u MeshDirCorners_XY u Cluster Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology September 11, 2017 18 Routing u Routing table u Automatically populated based on topology u All messages use shortest path u In case of multiple options, the path with the smaller weight is chosen u Deterministic Routing A custom routing algorithm can be selected from the command line by setting --routing-algorithm=2. See configs/network/Network.py and src/mem/ruby/network/garnet2.0/GarnetNetwork.py u Custom u Users can leverage outport/inport direction names associated with each port to implement custom algorithms (say adaptive) Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology September 11, 2017 19 Deadlock Avoidance u Deadlock: A condition in which a set of agents wait indefinitely trying to acquire a set of resources D C u u u u u 0 1 A x 3 w v 2 B Packet A holds buffer u (in 1) and wants buffer v (in 2) Packet B holds buffer v (in 2) and wants buffer w (in 3) Packet C holds buffer w (in 3) and wants buffer x (in 0) Packet D holds buffer x (in 0) and wants buffer u (in 1) Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology September 11, 2017 20 Deadlock-free Routing Algorithms u Deadlocks may occur if the turns taken form a cycle u Removing some turns can make the routing algorithm deadlock free XY Model Garnet2.0 Tutorial | ARM Research Summit 2017 West-First Turn Model Tushar Krishna | Georgia Institute of Technology September 11, 2017 21 Deadlock-free Routing Algorithms u Assign weights to bias which links used first (to ensure no cyclic dependence) XY Routing: Always go X first, then Y DC DA DB 2 2 SA SB 1 1 1 1 2 2 1 1 1 1 2 SC 2 2 2 2 2 1 1 1 1 2 2 Mesh_XY Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology September 11, 2017 22 Deadlock-free Routing Algorithms u Assign weights to bias which links used first (to ensure no cyclic dependence) West-first Routing: Go W first, then N/S DC DA DB 2 2 SA SB 1 1 2 2 2 2 1 1 2 2 2 SC 2 2 2 1 1 2 2 2 2 2 2 Mesh_westfirst Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology September 11, 2017 23 Flow Control u Virtual Channels u Coherence Protocol requires certain number of virtual networks / message classes to avoid protocol deadlocks u This u Within each vnet, there can be more than one VC for boosting network performance u In u is the minimum number of VCs required Garnet, only one packet can use a VC inside a router at a time u VCs in vnets carrying control messages are 1-flit deep u VCs in vnets carrying data (cacheline) messages are 4-5 flit deep Credits u Each VC conveys its buffer availability by sending credits to its upstream router Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology September 11, 2017 24 Conventional VC Router Microarch Route Compute BW: Buffer Write VC Allocator RC: Route Compute SW Allocator V N 0 FLIT V N 1 VC 0 VA: VC Allocation VC 1 Input VCs arbitrate for “output” VCs (Input VCs at next router) VC n Input Buffers SA: Switch Allocation Input ports arbitrate for output ports VC 1 VC 2 BR: Buffer Read VC n Input Buffers BW RC VA Garnet2.0 Tutorial | ARM Research Summit 2017 ST: Switch Traversal Crossbar Switch LT: Link Traversal SA BR ST LT Tushar Krishna | Georgia Institute of Technology September 11, 2017 25 Single-Cycle Router Implementation Router.ccàwakeup() 1 Network Link.cc SwitchAllocator.cc àwakeup() InputUnit.cc àwakeup() VN0 VC 0 VC 1 VN1 VC 2 VC 3 * Buffer Write * Route Compute For multi-cycle router, add delay in flit before it can do SA has_free_vc() * Arbitrate outports * VC Allocate * send credit select_free_vc() * Buffer Read 4 OutputUnit.cc àwakeup() * Arbitrate inports (schedule creditlink wakeup next cycle) VirtualChannel.cc Credit Link.cc 2 3 (send flit to switch) OutVCState.cc CreditLink.cc * Rcv Credit * Update State CrossbarSwitch.ccàwakeup() NetworkLink.cc * Switch Traversal: Push winner flits on link queue * Schedule output link wakeup for next cycle Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology September 11, 2017 26 Overview u What are Networks-On-Chip (NoCs)? u Why model NoCs accurately u Garnet2.0 u u u u Configuration u Topology u Routing u Flow-Control u Router Microarchitecture System Integration u Network Interface u Network Parameters u Running Garnet with Ruby u Ruby Garnet Standalone Output Stats Extensions and FAQs Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology September 11, 2017 27 System Organization Ruby Memory System GarnetNetwork.cc Coherence Protocol Interface is “MessageBuffers ” from Ruby Garnet2.0 Tutorial | ARM Research Summit 2017 Network Interface Router Router Network Link Credit Link Stats Tushar Krishna | Georgia Institute of Technology September 11, 2017 28 Msg_size decides number of flits Network Interface Dir L2$ <VN, msg> Core + L1$ NI NI DMA L2$ Core + L1$ NI NI R R R R NI NI L2$ NI NI NI Dir credit Flitisize Cache Controller Core + L1$ Can add additional info in flit VN0 VN1 VC Select VC 0 VC 1 VC 2 To/From Router VC 3 Ingress NI NI Egress L2$ Core + L1$ Garnet2.0 Tutorial | ARM Research Summit 2017 NetworkInterface.cc àwakeup() Tushar Krishna | Georgia Institute of Technology September 11, 2017 29 Garnet Configurable Parameters u ni_flit_size u u u vcs_per_vnet u u Default = 1 routing_algorithm u u Default = 4 buffers_per_ctrl_vc u u Total VCs in each inport = num_vnets * vcs_per_vnet buffers_per_data_vc u u Default = 16B (128b) à 1-flit ctrl, 5-flit data This sets the bandwidth of each physical link Weight-based table or custom Defined in: src/mem/ruby/network/garnet2.0/GarnetNetwork.py Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology September 11, 2017 30 Command Line Parameters src/mem/ruby/network/ BasicRouter.py BasicLink.py Network.py garnet2.0/GarnetNetwork.py Definitions and Default Values configs/ ruby/Ruby.py common/Options.py network/Network.py example/garnet_synth_test.py - Overrides default values in Ruby - All parameters in in these .py files can be specified from command line Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology September 11, 2017 31 Running Garnet with Ruby u Build Ruby Coherence Protocol scons build/X86_MOESI_hammer/gem5.opt PROTOCOL=MOESI_hammer u Protocol determines number of message classes/virtual networks required u Invoke garnet2.0 from command line with appropriate network parameters ./build/X86_MOESI_hammer/gem5.opt configs/example/fs.py --network=garnet2.0 --topology=Mesh_XY ... Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology September 11, 2017 32 Running Garnet Standalone u Build the Garnet_standalone protocol scons build/NULL/gem5.debug PROTOCOL=Garnet_Standalone Dummy protocol just for traffic injection via Garnet Synthetic Traffic tester (next slide) u 3 Virtual Networks: vnet 0 and vnet 1 inject ctrl (1-flit) packets, vnet 2 injects data (5-flit) packets u u Run user-specified synthetic traffic ./build/NULL/gem5.debug configs/example/garnet_synth_traffic.py --network=garnet2.0 --topology=... \ --synthetic=uniform_random \ --injectionrate=0.01 \ --num-packets-max=100 \ Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology September 11, 2017 33 Garnet Synthetic Traffic u Dummy CPU Model [only works with Garnet_standalone protocol] u src/cpu/testers/garnet_synthetic_traffic u Inject packets for user-specified traffic pattern at user-specified injection rate u uniform_random Add your own synthetic traffic: u tornado configs/example/garnet_synth_traffic.py u bit_complement src/cpu/testers/garnet_synthetic_traffic/GarnetSyntheticTraffic.hh u bit_reverse src/cpu/testers/garnet_synthetic_traffic/GarnetSyntheticTraffic.cc u u bit_rotation u neighbor u shuffle u transpose Ability to inject continuously or fixed number of packets, and/or for fixed time, and/or from fixed source and/or to fixed destination in one/more vnets u u Heavy customization very useful for debugging All parameters described here: http://www.gem5.org/Garnet_Synthetic_Traffic Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology September 11, 2017 34 Overview u What are Networks-On-Chip (NoCs)? u Why model NoCs accurately u Garnet2.0 u u u u Configuration u Topology u Routing u Flow-Control u Router Microarchitecture System Integration u Network Interface u Network Parameters u Running Garnet with Ruby u Ruby Garnet Standalone Output Stats Extensions and FAQs Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology September 11, 2017 35 Sample Simulation ./build/NULL/gem5.debug configs/example/garnet_synth_traffic.py --topology=Mesh_XY --num-cpus=16 --num-dirs=16 --mesh-rows=4 m5out/config.ini Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology September 11, 2017 36 Output Stats m5out/stats.txt Packet and Flit stats (#injected, #received, queueing latency at NIC, and network latency) per VN and overall Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology More stats can be added via GarnetNetwork.cc September 11, 2017 37 Overview u What are Networks-On-Chip (NoCs)? u Why model NoCs accurately u Garnet2.0 u u u u Configuration u Topology u Routing u Flow-Control u Router Microarchitecture System Integration u Network Interface u Network Parameters u Running Garnet with Ruby u Ruby Garnet Standalone Output Stats Extensions and FAQs Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology September 11, 2017 38 Extensions and FAQs u System Level Modeling u How can I integrate Garnet into my own simulator (or with the gem5 classic memory system)? u u How do I print a network trace? u u You can run Garnet in a standalone mode, and have it inject traffic from a trace instead of a fixed synthetic pattern. Alternately, try to use cpu/testers/traffic_gen Does Garnet report NoC area and power numbers? u u You can add some code to the NI. I have a patch on my website for reference. Can garnet read a network trace? u u This should not be too hard - would just require some changes to the NI code on how it receives its inputs. If anyone wants to try it, I would love to give pointers. The output of Garnet can be fed into DSENT (Sun et al, NOCS 2012) which is present in ext/dsent. We will add an automatic stats.txt parser for it soon. Can we model multiple clock domains? u The entire Ruby memory system (including Garnet inside it) is one clock domain. To mimic a multi-clock domain design, you can schedule wakeups of slow routers intelligently at some multiples of the clock rather than every cycle. Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology September 11, 2017 39 Extensions and FAQs u Topology u How do we model multiple BW links in Garnet? u u Can we model a heterogeneous CPU-GPU system? u u Yes. The current AMD GPU model models a cluster of CPUs connected to a GPU. Can we model indirect networks such as Clos? u u Inherently, that would require support in the router to manage multiple flit sizes. Instead, you can add multiple links between the same nodes if you want to model higher bandwidth. If they have the same weight, Garnet will randomly send over the two. Yes, there can be additional routers that are not connected to any controller and act purely as switches. Can we model large-scale HPC networks? u Garnet can model any sized network. You can run 256-node standalone simulations easily. However, beyond that gem5 cannot instantiate more directories (which it uses as destination nodes). I have a patch to run 1024node synthetic sims on my website. But these run quite slowly. Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology September 11, 2017 40 Extensions and FAQs u Routing u How do we model an adaptive routing algorithm? u u Flow Control u Can we implement alternate deadlock avoidance schemes (such as escape VCs or dateline)? u u If you want to use internal NoC metrics (such as number of credits at an output port) for making routing decisions, do not use table-based routing. Instead, set the routing-algorithm to custom, and implement your own routing function inside RoutingUnit.cc. See outportComputeXY() for reference. You can update the vc selection scheme inside SwitchAllocator to control which VCs get allocated. Microarchitecture u Can we model variable number of VCs in each router? u Currently the codebase is very tied to the global vcs_per_vnet parameter. If you want to model variable number of VCs, one hack could be to have everyone instantiate the same number of VCs, but modify the VC select to never allocate certain VCs Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology September 11, 2017 41 Conclusions u Garnet2.0 is an open-source research vehicle Use it and contribute to it! u Being actively maintained by the following people u Georgia Tech: Tushar Krishna u AMD Research: Brad Beckmann, Onur Kayiran, Jieming Yin, Matt Porembas u If you have any questions, email on gem5-users or gem5-dev mailing lists u u u Would love to have it integrated with the classic memory model Thank you! Useful Resources: www.gem5.org/Interconnection_Network u www.gem5.org/garnet2.0 u http://synergy.ece.gatech.edu/garnet u Garnet2.0 Tutorial | ARM Research Summit 2017 Tushar Krishna | Georgia Institute of Technology September 11, 2017 42