Networks-on-Chip Ben Abdallah Abderazek The University of Aizu, Graduate School of Computer Science and Eng. Adaptive Systems Laboratory, E-mail: benab@u-aizu.ac.jp 03/01/2010 Hong Kong University of Science and Technology, March 2010 1 Part I Application Requirements Network on Chip: A paradigm Shift in VLSI Critical problems addressed by NoC Traffic abstractions Data Abstraction Network delay modeling Hong Kong University of Science and Technology, March 2010 2 Application Requirements Signal processing o Hard real time o Very regular load o High quality Media processing o Hard real time Typically on DSPs SoC/media processors o Irregular load o High quality Multimedia o Soft real time o Irregular load o Limited quality PC/desktop Very challenging! Hong Kong University of Science and Technology, March 2010 3 What the Internet Needs? Increasing Huge Amount of Packets & Routing, Packet Classification, Encryption, QoS, New Applications and Protocols, etc….. ASIC (large, expensive to develop, not flexible) General Purpose RISC (not capable enough) SoC, MCSoC? • High processing power • Support wire speed • Programmable • Scalable • Specially for network applications Hong Kong University of Science and Technology, March 2010 4 Example - Network Processor (NP) 16 pico-procesors and 1 powerPC Each pico-processor Dyadic Processing Unit Two pico-processors 2KB Shared memory Tree search engine Focus is layers 2-4 PowerPC 405 for control plane operations Support 2 hardware threads 3 stage pipeline : fetch/decode/execute 16K I and D caches Target is OC-48 IBM PowerNP Adaptive Systems Laboratory, Univ. of Aizu 5 Example - Network Processor (NP) NP can be applied in various network layers and applications Traditional apps – forwarding, classification Advanced apps – transcoding, URL-based switching, security etc. New apps Adaptive Systems Laboratory, Univ. of Aizu 6 Telecommunication Systems and NoC Paradigm The trend nowadays is to integrate telecommunication system on complex multicore SoC (MCSoC): Network processors, Multimedia hubs ,and base-band telecom circuits These applications have tight time-tomarket and performance constraints Adaptive Systems Laboratory, Univ. of Aizu 7 Telecommunication Systems and NoC Paradigm Telecommunication multicore SoC is composed of 4 kinds of components: 1. 2. 3. 4. Software tasks, Processors executing software, Specific hardware cores , and Global on-chip communication network Adaptive Systems Laboratory, Univ. of Aizu 8 Telecommunication Systems and NoC Paradigm Telecommunication multicore SoC is composed of 4 kinds of components: 1. 2. 3. 4. Software tasks, Processors executing software, Specific hardware cores , and Global on-chip communication network This is the most challenging part. Adaptive Systems Laboratory, Univ. of Aizu 9 Technology & Architecture Trends Technology trends: Vast transistor budgets Relatively poor interconnect scaling Need to manage complexity and power Build flexible designs (multi-/general-purpose) Architectural trends: Go parallel ! Keep core complexity constant or simplify Result is lots of modules (cores, memories, offchip interfaces, specialized IP cores, etc.) Hong Kong University of Science and Technology, March 2010 10 Wire Delay vs. Logic Delay Operation Delay Delay (.13mico) (.05micro ) 32-bit ALU Operation 650ps 250ps 32-bit Register read 325ps 125ps Read 32-bit from 8KB RAM 780ps 300ps Transfer 32-bit across chip (10mm) 1400ps 2300ps Transfer 32-bit across chip (200mm) 2800ps 4600ps 2:1 global on-chip communication to operation delay 9:1 in 2010 Ref: W.J. Dally HPCA Panel presentation 2002 Hong Kong University of Science and Technology, March 2010 11 Communication Reliability Information transfer is inherently unreliable at the electrical level, due to: Timing errors Cross-talk Electro-magnetic interference (EMI) Soft errors The problem will get increasingly worse as technology scales down Adaptive Systems Laboratory, UoA 12 Evolution of on-chip communication Hong Kong University of Science and Technology, March 2010 13 Traditional SoC nightmare Variety of dedicated interfaces Design and verification complexity Unpredictable performance Many underutilized wires DMA CPU DSP Control signals CPU Bus A Bridge B C Peripheral Bus IO IO IO Hong Kong University of Science and Technology, March 2010 14 Network on Chip: A paradigm Shift in VLSI From: Dedicated signal wires To: Shared network s s s Module s s Module Module s PointTo-point Link s s Computing Module s Network switch Adaptive Systems Laboratory, UoA 15 NoC essential s s s Module s s Module Module s s s s Communication by packets of bits Routing of packets through several hops, via switches Efficient sharing of wires Parallelism Hong Kong University of Science and Technology, March 2010 16 Characteristics of a paradigm shift Solves a critical problem Step-up in abstraction Design is affected: Design becomes more restricted New tools The changes enable higher complexity and capacity Jump in design productivity Hong Kong University of Science and Technology, March 2010 17 Characteristics of a paradigm shift Solves a critical problem Step-up in abstraction Design is affected: Design becomes more restricted New tools The changes enable higher complexity and capacity Jump in design productivity Hong Kong University of Science and Technology, March 2010 18 Origins of the NoC concept The idea was talked about in the 90’s, but actual research came in the new illenium. Some well-known early publications: Guerrier and Greiner (2000) “A generic architecture for on-chip packet-switched interconnections” Hemani et al. (2000) “Network on chip: An architecture for billion transistor era” Dally and Towles (2001) “Route packets, not wires: on-chip interconnection networks” Wingard (2001) “MicroNetwork-based integration of SoCs” Rijpkema, Goossens and Wielage (2001) “A router architecture for networks on silicon” Kumar et al. (2002) “A Network on chip architecture and design methodology” De Micheli and Benini (2002) “Networks on chip: A new paradigm for systems on chip design” Hong Kong University of Science and Technology, March 2010 19 Don't we already know how to design interconnection networks? Many existing network topologies, router designs and theory has already been developed for high end supercomputers and telecom switches Yes, and we'll cover some of this material, but the trade-offs on-chip lead to very different designs!! Hong Kong University of Science and Technology, March 2010 20 Critical problems addressed by NoC 1) Global interconnect design problem: delay, power, noise, scalability, reliability 2) System integration productivity problem 3) Chip Multi Processors (key to power-efficient computing Hong Kong University of Science and Technology, March 2010 21 1(a): NoC and Global wire delay Long wire delay is dominated by Resistance Add repeaters Repeaters become latches (with clock frequency scaling) Latches evolve to NoC routers NoC Router NoC Router NoC Router Hong Kong University of Science and Technology, March 2010 22 1(b): Wire design for NoC NoC links: Regular Point-to-point (no fanout tree) Can use transmission-line layout Well-defined current return path Can be optimized for noise / speed / power Low swing, current mode, …. Hong Kong University of Science and Technology, March 2010 23 1(c): NoC scalability For Same Performance, compare the wire area and power NoC: O(n) O(n) Simple Bus O(n^3 √n) O(n√n) Point –to-Point Segmented Bus: O(n^2 √n) O(n^2 √n) O(n√n) O(n √n) Hong Kong University of Science and Technology, March 2010 24 1(d): NoC and communication reliability Fault tolerance & error correction Router n … Input buffer UMODEM U M O D E M Router U M O D E M Error correction Synchronization UMODEM ISI reduction m Parallel to Serial Convertor UMODEM U M O D E M Router U M O D E M Modulation Link Interface UMODEM Interconnect A. Morgenshtein, E. Bolotin, I. Cidon, A. Kolodny, R. Ginosar, “Micro-modem – reliability solution for NOC communications”, ICECS 2004 Hong Kong University of Science and Technology, March 2010 25 1(e): NoC and GALS Modules in NoC System use different clocks May use different voltages NoC can take care of synchronization NoC design may be asynchronous No waste of power when the links and routers are idle Hong Kong University of Science and Technology, March 2010 26 2: NoC and engineering productivity NoC eliminates ad-hoc global wire engineering NoC separates computation from communication NoC supports modularity and reuse of cores NoC is a platform for system integration, debugging and testing Hong Kong University of Science and Technology, March 2010 27 3: NoC and CMP cannot provide Power-efficient performance growth Interconnect Uniprocessors Interconnect dominates dynamic power Gate Global wire delay doesn’t scale Instruction-level parallelism is limited Diff. Power-efficiency requires many parallel local Uniprocessor dynamic power computations (Magen et al., SLIP 200 Uniprocessir Chip Multi Processors (CMP) Performance Thread-Level Parallelism (TLP) Die Area (or Power) Hong Kong University of Science and Technology, March 2010 28 3: NoC and CMP Uniprocessors cannot provide Power-efficient performance growth Interconnect dominates dynamic power Global wire delay doesn’t scale Instruction-level parallelism is limited Power-efficiency requires many parallel local computations Chip Multi Processors (CMP) Thread-Level Parallelism (TLP) Network is a natural choice for CMP! Hong Kong University of Science and Technology, March 2010 29 3: NoC and CMP Network is a natural choice for CMP Uniprocessors cannot provide Power-efficient performance growth Interconnect dominates dynamic power Global wire delay doesn’t scale Instruction-level parallelism is limited Power-efficiency requires many parallel local computations Chip Multi Processors (CMP) Thread-Level Parallelism (TLP) Network is a natural choice for CMP! Hong Kong University of Science and Technology, March 2010 30 Why Now is the time for NoC? Difficulty of DSM wire design Productivity pressure CMPs Hong Kong University of Science and Technology, March 2010 31 Traffic abstractions Traffic model are generally captured from actual traces of functional simulation A statically distribution is often assumed for message Flow 1 ->10 2->10 1->4 4->10 4->5 3->10 5->10 6->10 8->10 9->8 9->10 7->10 11->10 12->10 Bandwidth 400kb/s 1.8Mb/s 230kb/s 50kb/s 300kb/s 34kb/s 400kb/s 699kb/s 300kb/s 1.8mb/s 200kb/s 200kb/s 300kb/s 500kb/s Packet size 1kb 3kb 2kb 1kb 3kb 0.5kb 1kb 2kb 3kb 5kb 5kb 3kb 4kb 5kb Latency 5ns 12ns 6ns 3ns 4ns 15ns 4ns 1ns 12ns 7ns 10ns 12ns 10ns 12ns PE1 PE2 PE3 PE4 PE12 PE10 PE11 PE5 PE9 PE7 PE8 PE6 Hong Kong University of Science and Technology, March 2010 32 Data abstractions Hong Kong University of Science and Technology, March 2010 33 Layers of abstraction in network modeling Software layers Application, OS Network & transport layers Network topology e.g. crossbar, ring, mesh, torus, fat tree,… Switching Circuit / packet switching(SAF,VCT), wormhole Addressing Logical/physical, source/destination, flow, transaction Routing Static/dynamic, distributed/source, deadlock avoidance Quality of Service e.g. guaranteed-throughput, best-effort Congestion control, end-to-end flow control Data link layer Flow control (handshake) Handling of contention Correction of transmission errors Physical layer Wires, drivers, receivers, repeaters, signaling, circuits,.. Hong Kong University of Science and Technology, March 2010 34 How to select architecture ? Architecture choices depends on system needs. Reconfiguration Rate During run time CMP/ Multicore ASSP At boot time FPGA At design time ASIC Flexibility Single application General purpose or Embedded systems Hong Kong University of Science and Technology, March 2010 35 How to select architecture ? Architecture choices depends on system needs. Reconfiguration Rate A large range of solutions! During run time CMP/ Multicore ASSP At boot time FPGA At design time ASIC Flexibility Single application General purpose or Embedded systems Hong Kong University of Science and Technology, March 2010 36 Example: OASIS ASIC assumed Traffic requirement are known a-priori Features Packet switching – wormhole Quality of service e Mesh topology K. Mori, A. Ben Abdallah, and K. Kuruda, “Design and Evaluation of a Complexity Effective Network-on-Chip Architecture on FPGA", The 19th Intelligent System Symposium (FAN 2009), pp.318321, Sep. 2009. S. Miura, A. Ben Abdallah, and K. Kuroda, "PNoC - Design and Preliminary Evaluation of a Parameterizable NoC for MCSoCGeneration and Design Space Exploration", The 19th Intelligent System Symposium (FAN 2009), pp.314-317, Sep. 2009. Hong Kong University of Science and Technology, March 2010 37 Perspective 1: NoC vs. Bus NoC Aggregate bandwidth grows Link speed unaffected by N Concurrent spatial reuse Pipelining is built-in Distributed arbitration Separate abstraction layers However: No performance guarantee Extra delay in routers Area and power overhead? Modules need NI Unfamiliar methodology Bus Bandwidth is limited, shared Speed goes down as N grows No concurrency Pipelining is tough Central arbitration No layers of abstraction (communication and computation are coupled) However: Fairly simple and familiar Hong Kong University of Science and Technology, March 2010 38 Perspective 2: NoC vs. Off-chip Networks NoC Sensitive to cost: area power Wires are relatively cheap Latency is critical Off-Chip Networks Cost is in the links Latency is tolerable Traffic/applications unknown Changes at runtime Adherence to networking standards Traffic may be known a-priori Design time specialization Custom NoCs are possible Hong Kong University of Science and Technology, March 2010 39 VLSI CAD problems Application mapping Floorplanning / placement Routing Buffer sizing Timing closure Simulation Testing Hong Kong University of Science and Technology, March 2010 40 VLSI CAD problems in NoC Application mapping (map tasks to cores) Floorplanning / placement (within the network) Routing (of messages) Buffer sizing (size of FIFO queues in the routers) Timing closure (Link bandwidth capacity allocation) Simulation (Network simulation, traffic/delay/power modeling) Other NoC design problems (topology synthesis, switching, virtual channels, arbitration, flow control,……) Hong Kong University of Science and Technology, March 2010 41 Typical NoC design flow Place Modules Determine routing and adjust link capacities Hong Kong University of Science and Technology, March 2010 42 Timing closure in NoC Define intermodule traffic Place modules Increase link capacities No QoS satisfied ? Yes Finish Too long capacity results in poor QoS Too high capacity wastes area Uniform link capacities are a waste in ASIP system Hong Kong University of Science and Technology, March 2010 43 Network delay modeling Analysis of mean packet delay us Multiple Virtual-Channels Different link capacities Different communication demands wormhole network Hong Kong University of Science and Technology, March 2010 44 NoC design requirements High-performance interconnect High-throughput, latency, power, area Complex functionality (performance again) Support for virtual-channels QoS Synchronization Reliability, high-throughput, low-laten 45 ISO/OSI network protocol stack model Hong Kong University of Science and Technology, March 2010 46 Part II NoC topologies Switching strategies Routing algorithms Flow control schemes Clocking schemes QoS Basic Building Blocks Status and Open Problems Hong Kong University of Science and Technology, March 2010 47 NoC Topology The connection map between PEs Adopted from large-scale networks and parallel computing Topology classifications: Direct topologies Indirect topologies Adaptive Systems Laboratory, Univ. of Aizu 48 Direct topologies Each switch (SW) connected to a single PE As the # of nodes in the system increases, the total bandwidth also increases PE 1 PE is connected to only a single SW PE PE SW SW SW SW PE Hong Kong University of Science and Technology, March 2010 49 Direct topologies Mesh 2D mesh is most popular All links have the same length Eases physical design Area grows linearly with the the # of nodes 4x4 Mesh Hong Kong University of Science and Technology, March 2010 50 Direct topologies Torus and Folded Torus Torus PE R PE R PE R PE PE R PE PE R PE PE R PE PE R PE R R PE PE R R R R PE PE PE R R R PE PE PE R R R R PE PE PE PE R R R R PE PE PE PE R R R R PE PE PE Folded Torus R R Similar to a regular Mesh Excessive delay problem due to long-end-around connection PE R R Overcomes the long link limitation of a 2-D torus Links have the same size Hong Kong University of Science and Technology, March 2010 51 Direct topologies Octagon topology Messages being sent between any 2 nodes require at most two hops More octagons can be tiled together to accommodate larger designs PE PE PE SW PE PE PE PE PE Hong Kong University of Science and Technology, March 2010 52 Indirect topologies A set of PEs are connected to a switch (router). Fat tree topology Nodes are connected only to the leaves of the tree More links near root, where bandwidth requirements are higher SW SW SW SW PE SW PE PE SW SW PE PE PE PE PE Hong Kong University of Science and Technology, March 2010 53 Indirect topologies k-ary n-fly butterfly network Blocking multi-stage network – packets may be temporarily blocked or dropped in the network if contention occurs Example: 2-ary 3-fly butterfly network Hong Kong University of Science and Technology, March 2010 54 Indirect topologies (m, n, r) symmetric Clos network 3-stage network in which each stage is made up of a number of crossbar switches m : number of middle-stage switches n : number of input/output nodes on each input/output switch r : number of I and O switches Example: (3, 3, 4) Clos network Non-blocking network Expensive (several full crossbars) Hong Kong University of Science and Technology, March 2010 55 Indirect topologies Benes network Rearrangeable network in which paths may have to be rearranged to provide a connection, requiring an appropriate controller Clos topology composed of 2 x 2 switches Example: (2, 2, 4) re-arrangeable Clos network constructed using two (2, 2, 2) Clos networks with 4 x 4 middle switches. Hong Kong University of Science and Technology, March 2010 56 Irregular Topologies Customized Customized for an application Usually a mix of shared bus, direct, and indirect network topologies sw sw PE PE sw sw PE PE sw sw sw PE PE sw PE sw sw sw PE PE PE sw sw sw PE Example1: Reduced mesh sw PE sw sw sw PE PE PE PE sw PE PE PE PE sw PE PE sw sw sw PE PE sw sw sw PE Example 2: Cluster-based hybrid topology Hong Kong University of Science and Technology, March 2010 57 Example 1: Partially irregular 2D-Mesh topology PE PE R PE ∆y 2∆y PE ∆x PE R PE R R ∆x PE PE R PE R PE R PE R PE R R PE PE 2∆y R 2∆x Contains oversized rectangularly shaped PEs. Adaptive Systems Laboratory, Univ. of Aizu 58 Example 2: Irregular Mesh R R R R R R R R R This kind of chip does not limit the shape of the PEs or the placement of the routers. It may be considered a "custom" NoC R Adaptive Systems Laboratory, Univ. of Aizu 59 How to Select a Topology ? Application decides the topology type If PEs = few tens Star, Mesh topologies are recommended If PEs = 100 or more Hierarchical Star, Mesh are recommended Some topologies are better for certain designs than others Most of the times, when one topology is better in performance, it is worse in power consumption!! Adaptive Systems Laboratory, Univ. of Aizu 60 Part II NoC topologies NoC Switching strategies Routing algorithms Flow control schemes Clocking schemes QoS Basic Building Blocks Status and Open Problems Hong Kong University of Science and Technology, March 2010 61 NoC Switching Strategies Switching determines how flits and packets flows through routers in the network There are two basic modes: Circuit switching Packet switching Adaptive Systems Laboratory, Univ. of Aizu 62 Circuit Switching Network resources (channels) are reserved before a packet is sent Entire path must be reserved first The packets do not contain routing information, but rather data and information about the data. Circuit-switched networks require no overhead for packetisation, packet header processing or packet buffering Hong Kong University of Science and Technology, March 2010 63 Circuit Switching Header ACK Data R1 R2 R3 Routing + switching delay Router Delay Setup time Transfer time Adaptive Systems Laboratory, Univ. of Aizu 64 Circuit Switching Once circuit is setup, router latency and control overheads are very low Very poor use of channel bandwidth if lots of short packets must be sent to many different destinations More commonly seen in embedded SoC applications where traffic patterns may be static and involve streaming large amounts of data between different IP blocks Hong Kong University of Science and Technology, March 2010 65 Packet Switching We can aim to make better use of channel resources by buffering packets. We then arbitrate for access to network resources dynamically. We distinguish between different approaches by the granularity at which we reserve resources (e.g. channels and buffers) and conditions that must be met for a packet to advance to the next node Hong Kong University of Science and Technology, March 2010 66 Packet Switching Advance when entire packet is buffered + L free flit buffers at next node Store-and-forward (SaF) Advance when L free flit buffers at the next node Packet-Buffer Flow Control Cut-through Can advance when at least one flit buffer is available Flit-Buffer Flow Control Wormhole L : Packet Length Hong Kong University of Science and Technology, March 2010 67 Packet Switching Store and Forward (SAF) Packet is sent from one router to the next only if the receiving router has buffer space for entire packet Buffer size in the router is at least equal to the size of a packet Forward packet by packet Buffer packet Switch Buffer Buffer Switch Switch Store and Forward switching data flit header flit Hong Kong University of Science and Technology, March 2010 68 Packet switching Wormhole (WH) Flit is forwarded to a router if space exists for that flit Parts of the packet can be distributed among two or more routers Buffer requirements are reduced to one flit, instead of an entire packet Forward flit by flit Buffer packet Switch Buffer Buffer Switch Switch WH switching technique data flit header flit Hong Kong University of Science and Technology, March 2010 69 Packet switching Virtual Channel (VC) Improve performance of WH routing, prevent a single packet blocking a free channel e.g. if the green packet is blocked, the red packet may still make progress through the network We can interleave flits from different packets over the same channel Hong Kong University of Science and Technology, March 2010 70 Part II NoC topologies NoC Switching strategies Routing algorithms Flow control schemes Clocking schemes QoS Basic Building Blocks Status and Open Problems Hong Kong University of Science and Technology, March 2010 71