NOC: Networks on Chip MPSoC:Multiprocessor System on Chip EE8205: Embedded Computer Systems http://www.ee.ryerson.ca/~courses/EE8205/ Dr. Gul N. Khan http://www.ee.ryerson.ca/~gnkhan Electrical and Computer Engineering Ryerson University Overview • Introduction to SoC and MPSoC • Networks on a Chip • Bus-based and Point-to-point NoC Systems • Regular and Application Specific NoC Topologies • Routing and Switching Techniques • NOC Topology Generation and Analysis Introductory Articles on MPSoC and NoC available at the course webpage System on a Chip Systems-on-Chip (SoC) • Advances in chip design and integration. • Incorporate multiple components on a single chip. • MPSoC has addressed ever-increasing performance requirements. Samsung S3C6400 Platform NOC and SOC Design 2 Samsung S3C6410 Platform NOC and SOC Design 3 S3C6410 System on Chip • A 16/32-bit RISC low power, high performance micro-processor • Applications include mobile phones, Portable Navigation Devices and other general applications. • Provide optimized H/W performance for the 2.5G and 3G communication services, • Includes many powerful hardware accelerators for motion video processing, display control and scaling. An • Integrated Multi Format Codec (MFC) supports encoding and decoding of MPEG4/H.263, H.264. • Many hardware peripherals such as camera interface, TFT 24-bit LCD controller, power management, etc. NOC and SOC Design 4 ARM 11 (v6) based SOC NOC and SOC Design 5 S3C6410 based Mobile Processor Navigation System iPhone based on ARM1176JZ S3C6410 NOC and SOC Design 6 System on Chip Design Flow • Specify: * What does the customer really want? • Architect: * Find the most cost and performance effective architecture to implement it? * What existing components can we adapt and re-use? • Evaluate: * What is the performance impact of a cheaper architecture? • Implement: * What can we generate automatically from libraries and customization? Use separate computation, communication and performance NOC and SOC Design 7 System-on-Chip and NoC System-on-Chip ---to--- Network-on-Chip CPU MPEG CORE DSP VGA CORE Analog ADC/DAC NOC and SOC Design Component 8 SoC Structure NoC-based System on a Chip A A tile tile of of the the chip chip Cache L2 Proc control control p2 Proc Proc Switch Fabric p3 data data p1 Control Logic parity parity spare spare p0 p4 Network Interface Data $ Instr $ Switch Fabric Control Logic core p0 Network Interface Data $ p1 bus p3 Instr $ core A communication link A computational block NOC and SOC Design 9 System on Chip Design Flow 1 System Architecture System Behavior Behavior Simulation 2 Mapping 3 Performance Simulation Communication Refinement Flow To Implementation NOC and SOC Design 4 10 System on Chip Design Flow o Performance Simulation behavior annotated with architectural effects p Analyze / Visualize Results NOC and SOC Design n Annotation of architectural timing and Energy onto behavior 11 SoC Appl.- Wireless LAN Physical Layer Protocol Stack HiperLan/2 PicoRadio Application Ad Hoc Networks: Low Rate: b/sec - kb/sec Low Power: 100μW Network MAC Multi-media Wireless Networks; High Rate: 10 Mb/sec Low Power: 10-100 mW OFDM Physical Layer/Digital BB OFDM RX OFDM TX Dynamic Reconfiguration NOC and SOC Design 12 Wireless LAN SoC Analog front end Analog front end ASIC FPGA AD Digital Modem AD AD AD PA sleep mode mngmt clock manager DA bus PA DA Microcontroller Bloc User Interface • • • • • • • • **-Main Points-** Which micro-controller to use? Do we need more FPGAs? DSP or ASIC? Which MAC? Where will the MAC run? Which other appls. can I add? Is the chip reusable? Is too much memory? NOC and SOC Design Protocol Analog front end f0 Turbo Codec f1 f2 FPGA AD AD crossbar bus PA DA Micro-controller 13 Wireless LAN Physical Layer Design Flow Higher Layers OFDM Physical Layer OFDM RX OFDM TX SystemC Functional IP Reuse Functional Partitioning TX RX Application Specification English (UML, SystemC…) Algorithm Exploration SystemC or C (Matlab/Simulink, …) Functional Simulation and Refinement Coware (….) Architecture Exploration: Performance Simulation Coware(, …) Mapping Mapping Architecture Refinement Implementation NOC and SOC Design 14 Physical Layer SoC Architecture FPGA FFT FIR UART BUFFER XBAR Interface XBAR FPGA config. mem. Int. bridge DPR2/SPS2 Bridge Processor bus Interface Processor bus Ck, reset - I/D Micro caches Jtag Interface Reset CK2 CK1 NOC and SOC Design Clock gen. Datapath SPS2 (instruction/ data RAM) MCK VDD VSS TEST(0..2) 15 Multiple Processor/Core System-onChip Inter-node communication between CPU/cores can be performed by message passing or shared memory. Number of processors in the same chip-die increases at each node (CMP and MPSoC). • Memory sharing will require: SHARED BUS * Large Multiplexers * Cache coherence techniques * Not Scalable • Message Passing: NOC * Scalable * Require data transfer transactions * Has overhead of extra communication NOC and SOC Design 16 NOC: Network-on-Chip System Bus Shared bus is not a long-term solution • It has poor scalability On-Chip micro-networks suit the demand of scalability and performance NOC and SOC Design 17 NOC and Off-Chip Networks NOC Off-Chip Networks Sensitive to cost: area and power Wires are relatively cheap Latency is critical Traffic is known a-priori Design time specialization Custom NoCs are possible Cost is in the links NOC and SOC Design Latency is tolerable Traffic/applications unknown Changes at runtime Adherence to networking standards 18 On-Chip Communication Structures NOC and SOC Design 19 On-Chip Bus Interconnection For highly connected multi-core system Communication bottleneck For multi-master buses Arbitration will become a complex problem Power grows for each communication event as more units attached will increase the capacitive load. A crossbar switch can overcome some of these problems and limitations of the buses Crossbar is not scalable NOC and SOC Design 20 SOC Communication Structures Dedicated Point-to-Point • Advantages Optimal in terms of bandwidth, availability, latency and power usage Simple to design and verify as well as easier to model • Disadvantages Number of links may increase exponentially with the increase in number of cores Hardware Area Routing Problems NOC and SOC Design 21 SOC Communication Structures Network on Chip • Advantages Structured architecture – Lower complexity and cost of SOC design Reuse of components, architectures, design methods and tools Efficient and high performance interconnect. Scalability of communication architecture • Disadvantages Internal network contention can cause a latency Bus oriented IPs need smart wrapping hardware Software needs clear synchronization in multiprocessor systems NOC and SOC Design 22 Networks-on-Chip • Interconnect for SoCs, CMPs, MPSoC and FPGAs Multi-hop, packet-based communication Efficient resource sharing • Scalable communication infrastructure provides scalable performance/efficiency in Power Hardware Area Design productivity NOC and SOC Design 23 Networks-on-Chip • Interconnect for SoCs, CMPs, MPSoC and FPGAs Multi-hop, packet-based communication Efficient resource sharing • Scalable communication infrastructure provides scalable performance/efficiency in Power Hardware Area Design productivity NOC and SOC Design 24 NoC ? A chip-wide network: Processing Elements (PEs) are interconnected via a packet-based network in NoC Architecture Packetized Message ROUTER ROUTER ROUTER ROUTER MSG PE 1 text text PE 2 text PE 3 text PE 4 ROUTER ROUTER ROUTER ROUTER text PE 5 text PE 6 text PE 7 text PE 8 ROUTER ROUTER ROUTER ROUTER text PE 9 text PE 10 text PE 11 text PE 12 ROUTER ROUTER ROUTER ROUTER text PE 13 text PE 14 text PE 15 PE 16 MSG text Decoded Message NOC and SOC Design 25 Network-on-Chip vs. Bus Interconnection • Total bandwidth grows • Link speed unaffected • Concurrent spatial reuse • Pipelining is built-in • Distributed arbitration • Separate abstraction layers However • No performance guarantee • Extra delay in routers • Area and power overhead? • Modules need NI • Unfamiliar methodology NOC and SOC Design BUS inter-connection is fairly simple and familiar However • Bandwidth is limited, shared • Speed goes down as N grows • No concurrency • Pipelining is tough • Central arbitration • No layers of abstraction (communication and computation are coupled) 26 On-Chip Buses • Ad hoc Buses Traditional Data/Address Buses • ARM AMBA Bus Advanced Micro controller Bus Architecture • IBM Core Connect Bus CoreConnect Bus Architecture NOC and SOC Design 27 AMBA On-Chip Bus AMBA evolved from ARM’s internal bus development: • ASB/AHB: Advance System Bus/High Performance bus with support for pipelining, burst transfer and multiple bus masters • APB: Advance Periphral Bus with all slave devices • Bridge: A slave on ASB that connect it to APB NOC and SOC Design 28 AMBA based Single Chip GPS Controller ■ Suitable for handheld and personal navigation systems ■ ARM7TDMI 16/32 bit RISC CPU based host ■ Complete embedded memory system: Flash 256 KB, RAM 64 KB. ■ 12 channel GPS correlation DSP ■ 4 channels A/D ■ 4 serial communication interfaces ■ One serial peripheral interfaces (SPI) ■ Real-time clock module ■ 16-bit watchdog timer NOC and SOC Design 29 IBM CoreConnect On-Chip Bus CoreConnect is an SOC Bus proposed by IBM having: • PLB: Processor Local Bus, PLB Arbiter, PLB to OPB Bridge • OPB: On-Chip Peripheral Bus, OPB Arbiter • DCR: Device Control Register Bus and a Bridge NOC and SOC Design 30 CoreConnect Advance Features IBM CoreConnect Bus with 32-, 64-, and 128-bit versions to support a variety of applications • PLB: Fully synchronous, supports up to 8 masters - Separate read/write data buses - Burst transfers, variable and fixed-length, Pipelining - DMA transfers and No on-chip tri-states required - Overlapped arbitration, programmable priority fairness • OPB: Fully synchronous, 32-bit address and data buses - Support 1-cycle data transfers between master and slaves - Arbitration for up to 4 OPB master peripherals - Bridge function can be master on PLB or OPB • DCR: Provides fully synchronous movement of GPR data between CPU and slave logic NOC and SOC Design 31 CoreConnect Bus based SoC NOC and SOC Design 32 Comparing AMBA and CoreConnect SoC Buses NOC and SOC Design 33 NoC: Buses to Networks Original Bus Features • • • • • One transaction at a time Central Arbiter Limited bandwidth Synchronous Low cost Shared Bus to Segmented Bus S S NOC and SOC Design 34 Advanced Bus Segmented Bus • • • • • • More General/Versatile bus architecture Pipelining capability Burst transfer Split transactions Overlapped arbitration Transaction preemption, resumption & reordering Shared Bus to Segmented Bus S S NOC and SOC Design 35 Buses to Networks • Architectural paradigm shift: Replace wire spaghetti by network • Usage paradigm shift: Pack everything in packets • Organizational paradigm shift Confiscate communications from logic designers Create a new discipline, a new infrastructure responsibility NOC and SOC Design 36 NoC Related Main Problems Global interconnect design problems: • Delay • Power • Noise • Scalability • Reliability System integration Productivity problem Chip Multi Processors For power-efficient computing NOC and SOC Design 37 NoC and Global Connections Delay Long wiring delay is dominated by Resistance • • • Add repeaters Repeaters will become latches (with clock frequency scaling) Latches can become NoC routers NoC router NOC and SOC Design NoC router NoC router 38 NoC: Long Wiring Delays Long wiring delay is dominated by Resistance • • • Add repeaters Repeaters will become latches (with clock frequency scaling) Latches can become NoC routers NoC router NOC and SOC Design NoC router NoC router 39 NoC Wiring Design • NoC links: – – – – Regular Point-to-point -- no fan-out tree (problem) Can use transmission-line layout Well-defined current return path • Can be optimized for noise / speed / power – Low swing, current mode, …. NOC and SOC Design 40 NoC Scalability Compare the wire-area for same performance d d n d d NoC: O ( n) n n Bus O n3 n ( ) n d n d Pt-to-Pt: O n2 n ( ) NOC and SOC Design Segmented Bus: 2 O n n ( ) n 41 NoC a Platform System modules may use different clocks/voltages. NoC can take care of synchronization. NoC design may be asynchronous. No waste of power when NoC platform for System Integration, Testing and Debugging links/routers are idle. It eliminates ad-hoc global wire engineering. It separates computation from communication. It supports modularity & reuse of cores. NOC and SOC Design 42 CMP and NoC • Uniprocessors cannot provide Power-efficient performance growth Interconnect dominates dynamic power Global wire delay doesn’t scale Instruction-level parallelism is limited • Power-efficiency requires many parallel local computations Chip Multi Processors (CMP) Thread-Level Parallelism (TLP) Uni-processor dynamic power (Magen et al., SLIP 2004) Gate Interconnect Diff. Uni-processor Performance Network is another choice for CMP “Pollack’s rule” (F. Pollack. Micro 32, 1999) NOC and SOC Design Die Area (or Power) 43 Network-on-Chip Topologies Application Specific Irregular Topologies NOC and SOC Design 44 Irregular NoC Topologies • Based on the concept of using only what is necessary. • Application-specific topologies. • Eliminate unneeded resources and bandwidth from the system. • Leads to reduced power and area use. • Requires additional design work. NOC and SOC Design 45 NOC Topology 1 2 3 4 1 2 3 4 5 6 7 8 5 6 7 8 9 10 11 12 9 10 11 12 13 14 15 16 13 14 15 16 Mesh NOC and SOC Design Physical implementation 46 NOC Torus Topology 1 2 4 3 13 14 16 15 1 2 3 4 5 6 7 8 9 10 11 12 5 6 8 7 13 14 15 16 9 10 12 11 Torus NOC and SOC Design Physical implementation 47 NOC Abstraction Layers Network Modeling Software Layers O/S, application Network and Transport Layers Network topology e.g. crossbar, ring, mesh, torus, fat tree,… Switching Circuit / packet switching: VCT, wormhole Addressing Logical/physical, source/destination, flow, transaction Routing Static/dynamic, distributed/source, deadlock avoidance Quality of Service e.g. guaranteed-throughput, best-effort Congestion control, end-to-end flow control Data Link Layer Flow control (handshake) Handling of contention Correction of transmission errors Physical Layer Wires, drivers, receivers, repeaters, signaling, circuits,.. NOC and SOC Design 48 Definitions and Terminology Switch: The component of the network that is in charge of flit routing. Flit Latency: The time needed for a FLIT to reach its target PE from its source PE. Packet Latency: The time needed for a PACKET to reach its target PE from its source PE. Packet Spread: The time from the reception of the first flit of a packet to the reception of the last one. NOC and SOC Design 49 Message Abstraction Packet: An element of information that a processing element (PE) sends to another PE. A packet may consist of a variable number of flits.” Flit: The elementary unit of information exchanged in the communication network in a clock cycle. Message Header Payload Packet NOC and SOC Design Tail Type VC Body Type VC Flit Type VC Dest. 50 Switching Techniques Circuit Switching Packet Switching – Routing Protocols Store and Forward: Router cost is packet based. Packet size also affects latency and buffering requirements. Stalling happens at two nodes and the link between them. Wormhole: Router cost is based on header. Header can effect latency and buffering at the router is based on the header size. Stalling can happen at all the nodes and links spanned by the packet.. Virtual Cut-through: Router cost depends on header and packet size. Stalling at local nodes level. NOC and SOC Design 51 Relevant Parameters: Routing Minimum latency is of paramount importance in NOC/SOC (inter-process communication). Ideally: 1 clock latency per switch/router (flit enters at time t and exits at t+1) Maximum switch clock frequency (technology + routing logic limits) Deadlock free No flits are ever lost; once a flit is injected in the NOC, it must reach its destination may be after a long time. NOC and SOC Design 52 Fixed Shortest Path Routing Suitable for Regular Topologies e.g. Mesh, Torus, Tree, etc. X-Y routing (fist x then y direction. Simple Router No deadlock scenario No retransmission No reordering of messages Power-efficient NOC and SOC Design 53 Wormhole Routing In wormhole routing a header flit “digs” the path and hold. Successive flits are routed to the same path or direction In case of blocks and loss-less NoC we need: Buffers A back-pressure mechanism if we don’t have infinite size FIFOs… NOC and SOC Design 54 Wormhole Src Dest NOC and SOC Design 55 Wormhole T F FF Src F 4 3 2 H F Dest NOC and SOC Design 56 Wormhole T F F F Src F 4 3 2 H F Dest NOC and SOC Design 57 Wormhole T F F Src F 4 3 F 2 HF Dest NOC and SOC Design 58 Wormhole T F SrcF 4 F 3 F2 HF Dest NOC and SOC Design 59 Wormhole T Src F F F 4 3 F2 HF Dest NOC and SOC Design 60 Wormhole T Src F F F 4 3 F2 HF Dest NOC and SOC Design 61 Wormhole T Src F F 4 F3 F2 Dest HF NOC and SOC Design 62 Wormhole Src T F F4 F3 Dest F2 HF NOC and SOC Design 63 Wormhole Src TF F4 F3 Dest F2 HF NOC and SOC Design 64 Wormhole Src TF F4 F3 Dest F2 HF NOC and SOC Design 65 Wormhole Src TF F4 F3 Dest F2 HF NOC and SOC Design 66 Deflection Routing Hot Potato – Deadlock Free Routing Every flit can be routed to different directions (no packet notion at the switch level) If the optimal direction is blocked, the flit is “deflected” to another direction Switch latency of 1 clock cycle whatever the level of congestion Minimum buffer requirements Packets reordering z Adaptive routing z No buffering z No back pressure z Works with Torus/Mesh z NOC and SOC Design Wormhole Routing z No packets reordering z Static routing z Buffering ( ≥ 2 flits/port) z Back pressure z XY routing needs mesh 67 Hot-Potato Src Dest Hot-Potato H T F3 F2 Src F F Dest Hot-Potato T F3 F2 Src F H F Dest Hot-Potato T Src F3 F H F F2 Dest Hot-Potato Src T F F3 F2 Dest HF Hot-Potato T F Src F3 F2 Dest HF Hot-Potato Src TF Dest F3 F2 H F Hot-Potato Src TF Dest HF F2 F3 Hot-Potato Src TF Dest HF F2 F3 Hot-Potato Src F3 TF Dest HF F2 Network-on-Chip NOC and SOC Design 78 Core to Network Connection NOC and SOC Design 79 NOC Switch/Router Generic Router/Switch NOC and SOC Design 80 Another Generic Router with Virtual Channels VCID Demux VC0 Input 0 (From West) VC(V-1) Input 1 (From North) VC0 Full Crossbar (5x5) VC(V-1) Input N (From PE) Flit_in VC0 VC(V-1) Mux Switch Allocater (SA) Credit_out Scheduling VC Allocater (VA) Credit_in, Output VC Resv_State Routing Logic NOC and SOC Design 81 A Typical Router Pipeline FLIT OUT FLIT IN ROUTING & BUFFERS NOC and SOC Design VC ALLOCATION ARBITRATION SWITCH TRAVERSAL 82 VC: Virtual-Channels NOC and SOC Design 83 CAD Problems for NOC Application Mapping (map tasks to cores) Floorplanning/Placement (within the network) Routing (of messages) Buffer Sizing (size of FIFO queues in the routers) Timing Closure (Link bandwidth capacity allocation) Simulation (Network simulation for traffic, delay, power modeling) Testing … Combined with problems of designing NOC itself (topology synthesis, switching, virtual channels, arbitration, flow control,……) NOC and SOC Design 84 Topology Generation and Analysis • Aim: Generate a viable network topology. Analyze the generated topology. • Targeted Network: Best-effort, wormhole switched. Lookup table based source routing. No virtual channel support. Round Robin switch output arbitration. One NI per component master or slave interface. All transactions converted to packets of the same length (flit count). Burst beats converted to separate packets. NOC and SOC Design 85 System Input and Output • Input: Core Graph Network Parameters • Output: Topology Graph Route tables Recommended Operating Clock Frequency NOC and SOC Design 86 Topology Generation • Aims: Provide physical links. Minimize latency on select paths. Use a minimum of resources. • Two algorithms are used. ALG1: Point-to-Point Oriented Topologies. ALG2: Partitioned Crossbar Topologies. • Heuristic approach. NOC and SOC Design 87 Point-to-Point Oriented Topologies NOC and SOC Design 88 Partitioned Crossbar Topologies • Initial topology: FullyConnected Crossbar (single switch). • Ideal latency situation. • May violate maximum port requirement. • Partitioning process. NOC and SOC Design 89 Topology Analysis • Aim: Estimate achievable performance. Account for interference in the system. • Use of Petri Nets. • Partitioned analysis. Analyze components in isolation. Sum contention effects across paths. • Two Stages: Frequency selection. Path verification. NOC and SOC Design 90 Verification Process • Verify all path latencies. Write packet latency. Read packet latency. • Adjust delays based on contention. • Contention Areas: Switch output. Destination NI. NOC and SOC Design 91 Contention Estimation NOC and SOC Design 92 Frequency Selection • Cyclical relation between contention and frequency. • Frequency is fixed before contention is analyzed. • To find minimum valid frequency: Interval halving process. Large number of frequency points. NOC and SOC Design 93 Simulation Environment • SystemC based. • Collection of models: Generators and Sinks. Master and Slave NIs. Various Switches. AMBA AXI protocol implemented. NOC and SOC Design 94 Results • Applications and generated topologies. • Comparative results. • Resource Usage. • Accuracy tests. NOC and SOC Design 95 MPEG4 - Decoder A) B) Clock Frequency: 3.43 GHz NOC and SOC Design 96 MWD Application A) B) Clock Frequency: 573.4 MHz NOC and SOC Design 97 AV Benchmark NOC and SOC Design 98 AV Topologies A) B) Clock Frequency: 2.31 GHz NOC and SOC Design 99 Comparative Results I NOC and SOC Design 100 Comparative Results II NOC and SOC Design 101 Resource Usage Topology NOC and SOC Design Mesh Fat Tree Custom 1 Custom 2 MPEG4 Decoder 46 44 22 14 MWD Applicatio n 59 47 13 17 Av Benchmar k 87 67 25 102 Accuracy Test Results I NOC and SOC Design 103 Accuracy Test Results II NOC and SOC Design 104