Networks-on-Chips (NoCs) Basics ECE 284 On-Chip Interconnection Networks Spring 2013 Examples of Tiled Multiprocessors • 2D-mesh networks often used as on-chip fabric 12.64mm I/O Area single tile 1.5mm 21.72mm 2.0mm Tilera Tile64 2I/O Area Intel 80-core Typical architecture Compute Unit Router CPU L1 Cache Slice of L2 Cache • Each tile typically comprises the CPU, a local L1 cache, a “slice” of a distributed L2 cache, and a router Router function • The job of the router is forward packets from a source tile to a destination tile (e.g., when a “cache line” is read from a “remote” L2 slice). • Two example switching modes: – Store-and-forward: Bits of a packet are forwarded only after entire packet is first stored. – Cut-through: Bits of a packet are forwarded once the header portion is received. Store-and-forward switching Buffers for data packets Store Source end node Destination end node Packets are completely stored before any portion is forwarded [adapted from instructional slides of Pinkston & Duato, Computer Architecture: A Quantitative Approach] Store-and-forward switching Forward Store Requirement: buffers must be sized to hold entire packet Source end node Destination end node Packets are completely stored before any portion is forwarded [adapted from instructional slides of Pinkston & Duato, Computer Architecture: A Quantitative Approach] Cut-through switching Buffers for data packets Requirement: buffers must be sized to hold entire packet • Virtual cut-through Source end node Destination node Buffersend for flits: packets can be larger than buffers • Wormhole Source end node Destination end node [adapted from instructional slides of Pinkston & Duato, Computer Architecture: A Quantitative Approach] Cut-through switching • Virtual cut-through Buffers for data packets Requirement: buffers must be sized to hold entire packet (MTU) Busy Link Packet completely stored at the switch Source end node Destination node Buffersend for flits: packets can be larger than buffers • Wormhole Busy Link Source end node Packet stored along the path Destination end node [adapted from instructional slides of Pinkston & Duato, Computer Architecture: A Quantitative Approach] Packets to flits Transact. Type Message Type Packet Size Read Request 1 flit Read Reply 1+n flits Write Request 1+n flits Write Reply 1 flit [adapted from Becker STM’09 talk] Wormhole routing • Head flit establishes the connection from input port to output port. It contains the destination address. • Body flits goes through the established connection (does not need destination address information) • Tail flit releases the connection. • All other flits blocked until connection is released Deadlock Virtual channels • Share channel capacity between multiple data streams – Interleave flits from different packets • Provide dedicated buffer space for each virtual channel – Decouple channels from buffers • “The Swiss Army Knife for Interconnection Networks” – Prevent deadlocks – Reduce head-of-line blocking – Also useful for providing QoS [adapted from Becker STM’09 talk] Using VCs for deadlock prevention • Protocol deadlock – Circular dependencies between messages at network edge – Solution: • Partition range of VCs into different message classes • Routing deadlock – Circular dependencies between resources within network – Solution: • Partition range of VCs into different resource classes • Restrict transitions between resource classes to impose partial order on resource acquisition • {packet classes} = {message classes} × {resource classes} [adapted from Becker STM’09 talk] Using VCs for flow control • Coupling between channels and buffers causes head-of-line blocking – – – – Adds false dependencies between packets Limits channel utilization Increases latency Even with VCs for deadlock prevention, still applies to packets in same class • Solution: – Assign multiple VCs to each packet class [adapted from Becker STM’09 talk] VC router pipeline • Route Computation (RC) – Determine candidate output port(s) and VC(s) – Can be precomputed at upstream router (lookahead routing) • Per packet Virtual Channel Allocation (VA) – Assign available output VCs to waiting packets at input VCs • Switch Allocation (SA) – Assign switch time slots to buffered flits • Switch Traversal (ST) – Send flits through crossbar switch to appropriate output Per flit [adapted from Becker STM’09 talk] Allocation basics • Arbitration: – Multiple requestors – Single resource – Request + grant vectors • Allocation: – Multiple requestors – Multiple equivalent resources – Request + grant matrices • Matching: – Each grant must satisfy a request – Each requester gets at most one grant – Each resource is granted at most once [adapted from Becker STM’09 talk] Separable allocators • Matchings have at most one grant per row and per column • Implement via to two phases of arbitration Input-first: – Column-wise and row-wise – Perform in either order – Arbiters in each stage are fully independent • Fast and cheap • But bad choices in first phase can prevent second stage from generating a good matching! [adapted from Becker STM’09 talk] Output-first: Wavefront allocators • Avoid separate phases – … and bad decisions in first • Generate better matchings • But delay scales linearly • Also difficult to pipeline • Principle of operation: – Pick initial diagonal – Grant all requests on diagonal • Never conflict! – For each grant, delete requests in same row, column – Repeat for next diagonal [adapted from Becker STM’09 talk] Wavefront allocator timing • Originally conceived as fullcustom design • Tiled design • True delay scales linearly • Signal wraparound creates combinational loops – Effectively broken at priority diagonal – But static timing analysis cannot infer that – Synthesized designs must be modified to avoid loops! [adapted from Becker STM’09 talk] Diagonal Propagation Allocator • Unrolled matrix avoids combinational loops • Sliding priority window activates sub-matrix cells • But static timing analysis again sees false paths! – Actual delay is ~n – Reported delay is ~(2n-1) – Hurts synthesized designs [adapted from Becker STM’09 talk] 20 VC allocation • Before packets can proceed through router, need to acquire ownership of VC at downstream router • VC allocator matches unassigned input VCs with output VCs that are not currently in use – P×V requestors (input VCs), P×V resources (output VCs) • VC is acquired by head flit, inherited by body & tail flits [adapted from Becker STM’09 talk] VC allocator implementations • Not shown: – Masking logic for busy VCs [adapted from Becker STM’09 talk] Typical pipelined router RC VA SA ST LT route computation VC + switch allocation switch traversal link traversal