18-742 Parallel Computer Architecture Lecture 13: Interconnection Networks II Michael Papamichael Carnegie Mellon University Readings: Interconnection Networks Required Das et al., “Application-Aware Prioritization Mechanisms for On-Chip Networks,” MICRO 2009. Wentzlaff et al., “On-Chip Interconnection Architecture of the Tile Processor,” IEEE Micro 2007. Recommended Mullins et al., “Low-Latency Virtual-Channel Routers for OnChip Networks,” ISCA 2004. Moscibroda and Mutlu, “A Case for Bufferless Routing in OnChip Networks,” ISCA 2009. Tobias Bjerregaard, Shankar Mahadevan, “A Survey of Research and Practices of Network-on-Chip”, ACM Computing Surveys (CSUR) 2006. 2 Last Lecture Interconnection Networks Introduction & Terminology Topology Buffering and Flow control 3 Today Review (Topology & Flow Control) More on interconnection networks Routing Router design Network performance metrics On-chip vs. off-chip differences Research on NoCs and packet scheduling The problem with packet scheduling Application-aware packet scheduling Aergia: Latency slack based packet scheduling 4 Today Review (Topology & Flow Control) More on interconnection networks Routing Router design Network performance metrics On-chip vs. off-chip differences Research on NoCs and packet scheduling The problem with packet scheduling Application-aware packet scheduling Aergia: Latency slack based packet scheduling 5 Review: Topologies 3 2 1 0 0 1 Topology 2 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 3 Crossbar Multistage Logarith. Mesh Direct/Indirect Indirect Indirect Direct Blocking/ Non-blocking Non-blocking Blocking Blocking Cost O(N2) O(NlogN) O(N) Latency O(1) O(logN) O(sqrt(N)) 6 Review: Flow Control S Store and Forward S Cut Through / Wormhole Shrink Buffers D D Reduce latency Any other issues? Head-of-Line Blocking Use Virtual Channels Red holds this channel: channel remains idle until read proceeds Blocked by other packets Channel idle but red packet blocked behind blue Buffer full: blue cannot proceed 7 Review: Flow Control S Store and Forward S Cut Through / Wormhole Shrink Buffers D D Reduce latency Any other issues? Head-of-Line Blocking Use Virtual Channels Buffer full: blue cannot proceed Blocked by other packets 8 Today Review (Topology & Flow Control) More on interconnection networks Routing Router design Network performance metrics On-chip vs. off-chip differences Research on NoCs and packet scheduling The problem with packet scheduling Application-aware packet scheduling Aergia: Latency slack based packet scheduling 12 Routing Mechanism Arithmetic Simple arithmetic to determine route in regular topologies Dimension order routing in meshes/tori Source Based Source specifies output port for each switch in route + Simple switches no control state: strip output port off header - Large header Table Lookup Based Index into table for output port + Small header - More complex switches 13 Routing Algorithm Types Deterministic: always choose the same path Oblivious: do not consider network state (e.g., random) Adaptive: adapt to state of the network How to adapt Local/global feedback Minimal or non-minimal paths 14 Deterministic Routing All packets between the same (source, dest) pair take the same path Dimension-order routing E.g., XY routing (used in Cray T3D, and many on-chip networks) First traverse dimension X, then traverse dimension Y + Simple + Deadlock freedom (no cycles in resource allocation) - Could lead to high contention - Does not exploit path diversity 15 Deadlock No forward progress Caused by circular dependencies on resources Each packet waits for a buffer occupied by another packet downstream 16 Handling Deadlock Avoid cycles in routing Dimension order routing Restrict the “turns” each packet can take Avoid deadlock by adding virtual channels Cannot build a circular dependency Separate VC pool per distance Detect and break deadlock Preemption of buffers 17 Turn Model to Avoid Deadlock Idea Analyze directions in which packets can turn in the network Determine the cycles that such turns can form Prohibit just enough turns to break possible cycles Glass and Ni, “The Turn Model for Adaptive Routing,” ISCA 1992. 18 Valiant’s Algorithm An example of oblivious algorithm Goal: Balance network load Idea: Randomly choose an intermediate destination, route to it first, then route from there to destination Between source-intermediate and intermediate-dest, can use dimension order routing + Randomizes/balances network load - Non minimal (packet latency can increase) Optimizations: Do this on high load Restrict the intermediate node to be close (in the same quadrant) 19 Adaptive Routing Minimal adaptive Router uses network state (e.g., downstream buffer occupancy) to pick which “productive” output port to send a packet to Productive output port: port that gets the packet closer to its destination + Aware of local congestion - Minimality restricts achievable link utilization (load balance) Non-minimal (fully) adaptive “Misroute” packets to non-productive output ports based on network state + Can achieve better network utilization and load balance - Need to guarantee livelock freedom 20 More on Adaptive Routing Can avoid faulty links/routers Idea: Route around faults + Deterministic routing cannot handle faulty components - Need to change the routing table to disable faulty routes - Assuming the faulty link/router is detected 21 Today Review (Topology & Flow Control) More on interconnection networks Routing Router design Network performance metrics On-chip vs. off-chip differences Research on NoCs and packet scheduling The problem with packet scheduling Application-aware packet scheduling Aergia: Latency slack based packet scheduling 22 On-chip Networks PE PE PE R R PE PE PE PE PE R VC Identifier R From East R From West PE R PE R Input Port with Buffers PE PE PE R R R R PE R R R PE VC 0 VC 1 VC 2 Control Logic Routing Unit (RC) VC Allocator (VA) Switch Allocator (SA) PE R R To East From North To West To North To South To PE From South R Router PE Processing Element Crossbar (5 x 5) From PE Crossbar (Cores, L2 Banks, Memory Controllers etc) 23 Router Design: Functions of a Router Buffering (of flits) Route computation Arbitration of flits (i.e. prioritization) when contention Switching Called packet scheduling From input port to output port Power management Scale link/router frequency 24 Router Pipeline BW RC VA SA ST LT Five logical stages BW: Buffer Write RC: Route computation VA: Virtual Channel Allocation SA: Switch Allocation ST: Switch Traversal LT: Link Traversal 25 Wormhole Router Timeline BW Head Body 1 Body 2 RC VA SA BW BW BW Tail ST LT SA ST LT SA ST LT SA ST LT Route computation performed once per packet Virtual channel allocated once per packet Body and tail flits inherit this information from head flit 26 Dependencies in a Router Decode + Routing Switch Arbitration Crossbar Traversal Wormhole Router Decode + Routing VC Switch Allocation Arbitration Virtual Channel Router Decode + Routing VC Allocation Speculative Switch Arbitration Crossbar Traversal Crossbar Traversal Speculative Virtual Channel Router Dependence between output of one module and input of another Determine critical path through router Cannot bid for switch port until routing performed 27 Pipeline Optimizations: Lookahead Routing At current router perform routing computation for next router Overlap with BW BW RC SA ST LT Precomputing route allows flits to compete for VCs immediately after BW RC decodes route header Routing computation needed at next hop VA Can be computed in parallel with VA Galles, “Spider: A High-Speed Network Interconnect,” IEEE Micro 1997. Pipeline Optimizations: Speculation Assume that Virtual Channel Allocation stage will be successful Valid under low to moderate loads Entire VA and SA in parallel BW RC ST LT If VA unsuccessful (no virtual channel returned) VA SA Must repeat VA/SA in next cycle Prioritize non-speculative requests Pipeline Optimizations: Bypassing When no flits in input buffer Speculatively enter ST On port conflict, speculation aborted VA RC Setup ST LT In the first stage, a free VC is allocated, next routing is performed and the crossbar is setup Today Review (Topology & Flow Control) More on interconnection networks Routing Router design Network performance metrics On-chip vs. off-chip differences Research on NoCs and packet scheduling The problem with packet scheduling Application-aware packet scheduling Aergia: Latency slack based packet scheduling 39 Interconnection Network Performance Throughput given by flow control Latency Zero load latency (topology+routing+f low control) Throughput given by routing Throughput given by topology Min latency given by routing algorithm Min latency given by topology Offered Traffic (bits/sec) 40 Ideal Latency Ideal latency Solely due to wire delay between source and destination D L Tideal v b D = Manhattan distance L = packet size b = channel bandwidth velocity v = propagation 41 Actual Latency Dedicated wiring impractical Long wires segmented with insertion of routers D L Tactual H Trouter Tc v b D = Manhattan distance L = packet size b = channel bandwidth v = propagation velocity H = hops Trouter = router latency Tc = latency due to contention 42 Network Performance Metrics Packet latency Round trip latency Saturation throughput Application-level performance: system performance Affected by interference among threads/applications 44 Today Review (Topology & Flow Control) More on interconnection networks Routing Router design Network performance metrics On-chip vs. off-chip differences Research on NoCs and packet scheduling The problem with packet scheduling Application-aware packet scheduling Aergia: Latency slack based packet scheduling 45 On-Chip vs. Off-Chip Differences Advantages of on-chip Wires are “free” Can build highly connected networks with wide buses Low latency Can cross entire network in few clock cycles High Reliability Packets are not dropped and links rarely fail Disadvantages of on-chip Sharing resources with rest of components on chip Area Power Limited buffering available Not all topologies map well to 2D plane 46 Today Review (Topology & Flow Control) More on interconnection networks Routing Router design Network performance metrics On-chip vs. off-chip differences Research on NoCs and packet scheduling The problem with packet scheduling Application-aware packet scheduling Aergia: Latency slack based packet scheduling 47 Packet Scheduling Which packet to choose for a given output port? Common strategies Router needs to prioritize between competing flits Which input port? Which virtual channel? Which application’s packet? Round robin across virtual channels Oldest packet first (or an approximation) Prioritize some virtual channels over others Better policies in a multi-core environment Use application characteristics 48 The Problem: Packet Scheduling App1 App2 P P App N-1 App N P P P P P P Network-on-Chip L2$ L2$ L2$ L2$ L2$ L2$ Bank Bank Bank mem Memory cont Controller Accelerator Network-on-Chip is a critical resource shared by multiple applications The Problem: Packet Scheduling PE PE PE R R PE PE PE PE PE R VC Identifier R From East PE R PE R Input Port with Buffers PE PE PE R R R R PE R R R PE R PE R R From West VC 0 VC 1 VC 2 Control Logic Routing Unit (RC) VC Allocator (VA) Switch Allocator (SA ) To East From North To West To North To South To PE From South R PE Routers Processing Element (Cores, L2 Banks, Memory Controllers etc) Crossbar (5 x 5) From PE Crossbar The Problem: Packet Scheduling From East From West From North From South From PE VC 0 VC 1 VC 2 Routing Unit (RC) VC Allocator (VA) Switch Allocator(SA ) The Problem: Packet Scheduling VC 0 From East From West VC 0 VC 1 VC 2 VC 1 From East Routing Unit (RC) VC 2 VC Allocator (VA) Switch Allocator(SA) From West Conceptual From North From South View From North From South From PE From PE App1 App5 App2 App6 App3 App7 App4 App8 The Problem: Packet Scheduling VC 0 From West Routing Unit (RC) VC 1 From East VC 2 VC Allocator (VA) Switch Allocator(SA) From West Scheduler From East VC 0 VC 1 VC 2 Conceptual From North View From South Which packet to choose? From North From South From PE From PE App1 App5 App2 App6 App3 App7 App4 App8 The Problem: Packet Scheduling Existing scheduling policies Round Robin Age Problem 1: Local to a router Lead to contradictory decision making between routers: packets from one application may be prioritized at one router, to be delayed at next. Problem 2: Application oblivious Treat all applications packets equally But applications are heterogeneous Solution : Application-aware global scheduling policies. Today Review (Topology & Flow Control) More on interconnection networks Routing Router design Network performance metrics On-chip vs. off-chip differences Research on NoCs and packet scheduling The problem with packet scheduling Application-aware packet scheduling Aergia: Latency slack based packet scheduling 55 Motivation: Stall Time Criticality Applications are not homogenous Applications have different criticality with respect to the network Some applications are network latency sensitive Some applications are network latency tolerant Application’s Stall Time Criticality (STC) can be measured by its average network stall time per packet (i.e. NST/packet) Network Stall Time (NST) is number of cycles the processor stalls waiting for network transactions to complete Motivation: Stall Time Criticality Why applications have different network stall time criticality (STC)? Memory Level Parallelism (MLP) Lower MLP leads to higher STC Shortest Job First Principle (SJF) Lower network load leads to higher STC Average Memory Access Time Higher memory access time leads to higher STC STC Principle 1 {MLP} Compute STALL of Red Packet = 0 STALL STALL Application with high MLP LATENCY LATENCY LATENCY Observation 1: Packet Latency != Network Stall Time STC Principle 1 {MLP} STALL of Red Packet = 0 STALL STALL Application with high MLP LATENCY LATENCY LATENCY Application with low MLP STALL LATENCY STALL LATENCY STALL LATENCY Observation 1: Packet Latency != Network Stall Time Observation 2: A low MLP application’s packets have higher criticality than a high MLP application’s STC Principle 2 {Shortest-Job-First} Heavy Application Light Application Running ALONE Compute Baseline (RR) Scheduling 4X network slow down 1.3X network slow down SJF Scheduling 1.2X network slow down 1.6X network slow down Overall system throughput{weighted speedup} increases by 34% Solution: Application-Aware Policies Idea Identify stall time critical applications (i.e. network sensitive applications) and prioritize their packets in each router. Key components of scheduling policy: Application Ranking Packet Batching Propose low-hardware complexity solution Component 1 : Ranking Ranking distinguishes applications based on Stall Time Criticality (STC) Periodically rank applications based on Stall Time Criticality (STC). Explored many heuristics for quantifying STC (Details & analysis in paper) Heuristic based on outermost private cache Misses Per Instruction (L1-MPI) is the most effective Low L1-MPI => high STC => higher rank Why Misses Per Instruction (L1-MPI)? Easy to Compute (low complexity) Stable Metric (unaffected by interference in network) Component 1 : How to Rank? Execution time is divided into fixed “ranking intervals” Ranking interval is 350,000 cycles At the end of an interval, each core calculates their L1-MPI and sends it to the Central Decision Logic (CDL) CDL is located in the central node of mesh CDL forms a ranking order and sends back its rank to each core Two control packets per core every ranking interval Ranking order is a “partial order” Rank formation is not on the critical path Ranking interval is significantly longer than rank computation time Cores use older rank values until new ranking is available Component 2: Batching Problem: Starvation Prioritizing a higher ranked application can lead to starvation of lower ranked application Solution: Packet Batching Network packets are grouped into finite sized batches Packets of older batches are prioritized over younger batches Alternative batching policies explored in paper Time-Based Batching New batches are formed in a periodic, synchronous manner across all nodes in the network, every T cycles Putting it all together Before injecting a packet into the network, it is tagged by Batch ID (3 bits) Rank ID (3 bits) Three tier priority structure at routers Oldest batch first Highest rank first Local Round-Robin (prevent starvation) (maximize performance) (final tie breaker) Simple hardware support: priority arbiters Global coordinated scheduling Ranking order and batching order are same across all routers STC Scheduling Example 8 Injection Cycles 7 Batch 2 6 5 Batching interval length = 3 cycles 4 Batch 1 Ranking order = 3 3 2 2 1 2 Batch 0 Core1 Core2 Core3 Packet Injection Order at Processor STC Scheduling Example Router 8 Injection Cycles 8 6 2 5 4 7 1 6 2 Batch 1 3 3 2 2 1 4 1 2 Batch 0 3 1 Applications Scheduler Batch 2 7 5 STC Scheduling Example Router Round Robin 3 5 2 8 7 6 4 3 7 1 6 2 Scheduler 8 Time STALL CYCLES 2 3 2 RR Age STC 8 6 Avg 11 8.3 STC Scheduling Example Router Round Robin 5 5 3 1 2 2 3 7 1 6 2 3 2 2 8 7 6 3 Time 5 4 STALL CYCLES 2 3 3 Age 4 Scheduler 8 4 Time 6 7 8 Avg RR 8 6 11 8.3 Age 4 6 11 7.0 STC Ranking order STC Scheduling Example Router Round Robin 5 5 3 7 6 3 1 2 2 1 2 1 2 2 8 7 6 2 2 2 3 3 Time 5 4 6 7 STC 3 8 Time 5 4 STALL CYCLES 2 3 3 Age 4 Scheduler 8 4 Time 6 7 Avg RR 8 6 11 8.3 Age 4 6 11 7.0 STC 1 3 11 5.0 8 Qualitative Comparison Round Robin & Age Local and application oblivious Age is biased towards heavy applications heavy applications flood the network higher likelihood of an older packet being from heavy application Globally Synchronized Frames (GSF) [Lee et al., ISCA 2008] Provides bandwidth fairness at the expense of system performance Penalizes heavy and bursty applications Each application gets equal and fixed quota of flits (credits) in each batch. Heavy application quickly run out of credits after injecting into all active batches & stall till oldest batch completes and frees up fresh credits. Underutilization of network resources System Performance STC provides 9.1% improvement in weighted speedup over the best existing policy{averaged across 96 workloads} Detailed case studies in the paper 1.0 0.8 0.6 0.4 LocalAge STC 10 Network Unfairness Normalized System Speedup 1.2 LocalRR GSF 8 6 4 0.2 2 0.0 0 LocalRR GSF LocalAge STC Today Review (Topology & Flow Control) More on interconnection networks Routing Router design Network performance metrics On-chip vs. off-chip differences Research on NoCs and packet scheduling The problem with packet scheduling Application-aware packet scheduling Aergia: Latency slack based packet scheduling 73 What is Aérgia? Aérgia is the spirit of laziness in Greek mythology Some packets can afford to slack! Slack of Packets What is slack of a packet? Slack of a packet is number of cycles it can be delayed in a router without reducing application’s performance Local network slack Source of slack: Memory-Level Parallelism (MLP) Latency of an application’s packet hidden from application due to overlap with latency of pending cache miss requests Prioritize packets with lower slack Concept of Slack Instruction Window Execution Time Network-on-Chip Latency ( ) Latency ( ) Load Miss Causes Load Miss Causes Stall Compute Slack Slack returns earlier than necessary Slack ( ) = Latency ( ) – Latency ( ) = 26 – 6 = 20 hops Packet( ) can be delayed for available slack cycles without reducing performance! Prioritizing using Slack Packet Latency Core A Slack Load Miss Causes 13 hops 0 hops Load Miss Causes 3 hops 10 hops 10 hops 0 hops 4 hops 6 hops Core B Load Miss Causes Load Miss Causes Interference at 3 hops Slack( ) > Slack ( ) Prioritize Slack in Applications 100 Non-critical Percentage of all Packets (%) 90 50% of packets have 350+ slack cycles 80 70 60 50 Gems 40 30 critical 20 10% of packets have <50 slack cycles 10 0 0 50 100 150 200 250 300 Slack in cycles 350 400 450 500 Slack in Applications 100 Percentage of all Packets (%) 90 68% of packets have zero slack cycles 80 Gems 70 60 50 40 30 20 art 10 0 0 50 100 150 200 250 300 Slack in cycles 350 400 450 500 Diversity in Slack 100 Gems omnet Percentage of all Packets (%) 90 tpcw 80 mcf 70 bzip2 60 sjbb sap 50 sphinx deal 40 barnes 30 astar 20 calculix 10 art libquantum 0 0 50 100 150 200 250 300 Slack in cycles 350 400 450 500 sjeng h264ref Diversity in Slack 100 Gems omnet Percentage of all Packets (%) 90 tpcw Slack varies between packets of different applications mcf 80 70 bzip2 60 sjbb sap 50 sphinx 40 deal Slack varies between packets of a single application barnes 30 astar 20 calculix 10 art libquantum 0 0 50 100 150 200 250 300 Slack in cycles 350 400 450 500 sjeng h264ref Estimating Slack Priority Slack (P) = Max (Latencies of P’s Predecessors) – Latency of P Predecessors(P) are the packets of outstanding cache miss requests when P is issued Packet latencies not known when issued Predicting latency of any packet Q Higher latency if Q corresponds to an L2 miss Higher latency if Q has to travel farther number of hops Estimating Slack Priority Slack of P = Maximum Predecessor Latency – Latency of P Slack(P) = PredL2 (2 bits) MyL2 (1 bit) HopEstimate (2 bits) PredL2: Set if any predecessor packet is servicing L2 miss MyL2: Set if P is NOT servicing an L2 miss HopEstimate: Max (# of hops of Predecessors) – hops of P Estimating Slack Priority How to predict L2 hit or miss at core? Global Branch Predictor based L2 Miss Predictor Use Pattern History Table and 2-bit saturating counters Threshold based L2 Miss Predictor If #L2 misses in “M” misses >= “T” threshold then next load is a L2 miss. Number of miss predecessors? List of outstanding L2 Misses Hops estimate? Hops => ∆X + ∆ Y distance Use predecessor list to calculate slack hop estimate Starvation Avoidance Problem: Starvation Prioritizing packets can lead to starvation of lower priority packets Solution: Time-Based Packet Batching New batches are formed at every T cycles Packets of older batches are prioritized over younger batches Qualitative Comparison Round Robin & Age Local and application oblivious Age is biased towards heavy applications Globally Synchronized Frames (GSF) [Lee et al., ISCA 2008] Provides bandwidth fairness at the expense of system performance Penalizes heavy and bursty applications Application-Aware Prioritization Policies (SJF) [Das et al., MICRO 2009] Shortest-Job-First Principle Packet scheduling policies which prioritize network sensitive applications which inject lower load System Performance Age GSF Aergia 1.2 SJF provides 8.9% improvement Normalized System Speedup in weighted speedup Aérgia improves system throughput by 10.3% Aérgia+SJF improves system throughput by 16.1% 1.1 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 RR SJF SJF+Aergia