NoC Basics

Networks-on-Chips (NoCs) Basics ECE 284 On-Chip Interconnection Networks Spring 2013 Examples of Tiled Multiprocessors • 2D-mesh networks often used as on-chip fabric 12.64mm I/O Area single tile 1.5mm 21.72mm 2.0mm Tilera Tile64 2I/O Area Intel 80-core Typical architecture Compute Unit Router CPU L1 Cache Slice of L2 Cache • Each tile typically comprises the CPU, a local L1 cache, a “slice” of a distributed L2 cache, and a router Router function • The job of the router is forward packets from a source tile to a destination tile (e.g., when a “cache line” is read from a “remote” L2 slice). • Two example switching modes: – Store-and-forward: Bits of a packet are forwarded only after entire packet is first stored. – Cut-through: Bits of a packet are forwarded once the header portion is received. Store-and-forward switching Buffers for data packets Store Source end node Destination end node Packets are completely stored before any portion is forwarded [adapted from instructional slides of Pinkston & Duato, Computer Architecture: A Quantitative Approach] Store-and-forward switching Forward Store Requirement: buffers must be sized to hold entire packet Source end node Destination end node Packets are completely stored before any portion is forwarded [adapted from instructional slides of Pinkston & Duato, Computer Architecture: A Quantitative Approach] Cut-through switching Buffers for data packets Requirement: buffers must be sized to hold entire packet • Virtual cut-through Source end node Destination node Buffersend for flits: packets can be larger than buffers • Wormhole Source end node Destination end node [adapted from instructional slides of Pinkston & Duato, Computer Architecture: A Quantitative Approach] Cut-through switching • Virtual cut-through Buffers for data packets Requirement: buffers must be sized to hold entire packet (MTU) Busy Link Packet completely stored at the switch Source end node Destination node Buffersend for flits: packets can be larger than buffers • Wormhole Busy Link Source end node Packet stored along the path Destination end node [adapted from instructional slides of Pinkston & Duato, Computer Architecture: A Quantitative Approach] Packets to flits Transact. Type Message Type Packet Size Read Request 1 flit Read Reply 1+n flits Write Request 1+n flits Write Reply 1 flit [adapted from Becker STM’09 talk] Wormhole routing • Head flit establishes the connection from input port to output port. It contains the destination address. • Body flits goes through the established connection (does not need destination address information) • Tail flit releases the connection. • All other flits blocked until connection is released Deadlock Virtual channels • Share channel capacity between multiple data streams – Interleave flits from different packets • Provide dedicated buffer space for each virtual channel – Decouple channels from buffers • “The Swiss Army Knife for Interconnection Networks” – Prevent deadlocks – Reduce head-of-line blocking – Also useful for providing QoS [adapted from Becker STM’09 talk] Using VCs for deadlock prevention • Protocol deadlock – Circular dependencies between messages at network edge – Solution: • Partition range of VCs into different message classes • Routing deadlock – Circular dependencies between resources within network – Solution: • Partition range of VCs into different resource classes • Restrict transitions between resource classes to impose partial order on resource acquisition • {packet classes} = {message classes} × {resource classes} [adapted from Becker STM’09 talk] Using VCs for flow control • Coupling between channels and buffers causes head-of-line blocking – – – – Adds false dependencies between packets Limits channel utilization Increases latency Even with VCs for deadlock prevention, still applies to packets in same class • Solution: – Assign multiple VCs to each packet class [adapted from Becker STM’09 talk] VC router pipeline • Route Computation (RC) – Determine candidate output port(s) and VC(s) – Can be precomputed at upstream router (lookahead routing) • Per packet Virtual Channel Allocation (VA) – Assign available output VCs to waiting packets at input VCs • Switch Allocation (SA) – Assign switch time slots to buffered flits • Switch Traversal (ST) – Send flits through crossbar switch to appropriate output Per flit [adapted from Becker STM’09 talk] Allocation basics • Arbitration: – Multiple requestors – Single resource – Request + grant vectors • Allocation: – Multiple requestors – Multiple equivalent resources – Request + grant matrices • Matching: – Each grant must satisfy a request – Each requester gets at most one grant – Each resource is granted at most once [adapted from Becker STM’09 talk] Separable allocators • Matchings have at most one grant per row and per column • Implement via to two phases of arbitration Input-first: – Column-wise and row-wise – Perform in either order – Arbiters in each stage are fully independent • Fast and cheap • But bad choices in first phase can prevent second stage from generating a good matching! [adapted from Becker STM’09 talk] Output-first: Wavefront allocators • Avoid separate phases – … and bad decisions in first • Generate better matchings • But delay scales linearly • Also difficult to pipeline • Principle of operation: – Pick initial diagonal – Grant all requests on diagonal • Never conflict! – For each grant, delete requests in same row, column – Repeat for next diagonal [adapted from Becker STM’09 talk] Wavefront allocator timing • Originally conceived as fullcustom design • Tiled design • True delay scales linearly • Signal wraparound creates combinational loops – Effectively broken at priority diagonal – But static timing analysis cannot infer that – Synthesized designs must be modified to avoid loops! [adapted from Becker STM’09 talk] Diagonal Propagation Allocator • Unrolled matrix avoids combinational loops • Sliding priority window activates sub-matrix cells • But static timing analysis again sees false paths! – Actual delay is ~n – Reported delay is ~(2n-1) – Hurts synthesized designs [adapted from Becker STM’09 talk] 20 VC allocation • Before packets can proceed through router, need to acquire ownership of VC at downstream router • VC allocator matches unassigned input VCs with output VCs that are not currently in use – P×V requestors (input VCs), P×V resources (output VCs) • VC is acquired by head flit, inherited by body & tail flits [adapted from Becker STM’09 talk] VC allocator implementations • Not shown: – Masking logic for busy VCs [adapted from Becker STM’09 talk] Typical pipelined router RC VA SA ST LT route computation VC + switch allocation switch traversal link traversal

NoC Basics

Related documents

Products

Support

NoC Basics

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib