SMART: A SingleCycle Reconfigurable NoC for SoC Applications Chia-Hsin Owen Chen, Sunghyun Park, Tushar Krishna, Suvinay Subramaniam, Anantha P. Chandrakasan, Li-Shiuan Peh Department of Electrical and Computer Science, MIT, Cambridge -Jyoti Wadhwani Evolution of on-chip systems ECE 284 Spring 2013 2 Challenges with this evolution Scaling “compute” possible: Moore’s Law What about communication network? ECE 284 Spring 2013 3 More “hops” are bad At each hop: router • Latency • Power At system level • delayed responses delayed injection of fresh requests overall shutdown increased power budget ECE 284 Spring 2013 4 Motivation • NoCs should deliver • Low latency • High bandwidth with low power and area overhead • Signaling at low-voltage swing can lower energy consumption and propagation delay • Wire delay is much shorter than a typical router cycle time • Can traverse multiple hops in a single cycle by bypassing buffering & arbitration at the routers Wires can be driven to multiple mm within a cycle using repeaters 1mm 1mm Router cycle time = 500ps for a 2GHz clock Full-swing repeated wire delay ~ 100ps/mm by bypassing the buffers, we can traverse 5mm in 1 clock cycle! 1mm 1mm 1mm Number of hops in a cycle depends on the repeater circuit and wire parasitics 1mm ECE 284 Spring 2013 5 Approaches to reduce on-chip latency • Application-specific topology reconfiguration needed • To bypass the buffering and arbitration at routers • Topology can be reconfigured to match application-specific communication patterns at • Design time • Requires knowledge of all applications and their communication graphs at design time • Overhead: wiring density to support dedicated links • Runtime • Computation of contention free routes allowing flits to bypass the queues This paper performs online reconfiguration of network routers at runtime, to enable different applications to run on tailored topologies ECE 284 Spring 2013 6 SMART LINK • Voltage lock repeater (VLR): Asynchronous low-swing repeater circuit • For single-cycle multi-hop link traversal • Low-swing link stretches the maximum distance spanned by a repeated link in a single clock cycle • For transmitting 5.5Gb/s data with BER less than 10−9 , power consumption for • Full swing repeater is 4.21mW • VLR is 3.78mW • Delay of the link with • Full-swing repeaters is 100ps/mm • VLRs is 60ps/mm Node X voltage locked to swing near the threshold voltage of INV1x without decrease in drive current Low-swing voltage level is determined by transistor sizes and link wire impedance simulations performed across process corners ECE 284 Spring 2013 7 SMART Router Microarchitecture SMART Crossbar If the MUX is preset to connect the incoming link to the crossbar, bypass path is enabled bypass path If the MUX is set to connect the input port buffer to the crossbar, bypass path is disabled Bypass path is disabled when the same output port is shared by multiple input ports ECE 284 Spring 2013 8 SMART Flow The green and purple flows do not overlap with each other traverse from the source to destination router in a single clock cycle Reverse credit mesh network: to keep track of the free VCs at the endpoint of an arbitrary SMART route The red and blue flows overlap need to be stopped at the routers 9 and 10 to arbitrate for the shared crossbar ports For the blue flow, 3, 7 and 11 forward credits from NIC3 to the router 10’s East output port The VC queue of a router keeps track of the VCs at the input port of a router multiple hops away, and not just the neighbor ECE 284 Spring 2013 9 Results • SMART is compared against two baselines: • Mesh: • No reconfiguration • Each hop takes 3 cycles in the router and 1 cycle in the link • Dedicated: • 1-cycle dedicated links tailored to each application • At 2GHz, SMART NoC can traverse 8mm within a single clock cycle, i.e. 8 hops with 1mm cores • SMART is 1.5 cycles off in performance from the Dedicated baseline. • when one core acts as a source and another acts as a sink for most of the flows. ECE 284 Spring 2013 10 Results • Benefits of SMART are seen more when certain tasks are tied to specific cores, resulting in longer paths • SMART NoC gives 60% latency savings and 2.2X power savings compared to the Mesh. • Power savings are due to bypassing of buffers, low voltage signaling and clock gating at the routers ECE 284 Spring 2013 11 Conclusion • The paper proposes • an NoC architecture that reconfigures and tailors a generic mesh topology for SoC applications at runtime • a low-swing clockless repeated link circuit embedded within router crossbars that allows packets to bypass all the way from source to destination core within a single clock cycle ECE 284 Spring 2013 12 Critiques/Comments • Wire delay does not scale with the shrinking of transistors unlike gate delay. • In multi-mode design (operating at different voltage levels) and wire resistance increasing with rise in temperature, careful transistor sizing in the repeater circuit is required by simulating across all PVT corners (not just process corners). ECE 284 Spring 2013 13 THANK YOU ECE 284 Spring 2013 14