On-Chip Networks (NoCs) Seminar on Embedded System Architecture (Prof. A. Strey)

On-Chip Networks (NoCs) Seminar on Embedded System Architecture (Prof. A. Strey) Simone Pellegrini simone.pellegrini@uibk.ac.at University of Innsbruck – December 10, 2009 1 Introduction Advancements in chip manufacturing technology allow, nowadays, the integration of several hardware components into a single integrated circuit reducing both manufacturing costs and system dimensions. System-on-a-Chip (SoC) integrates – on a single chip – several building blocks, or IPs (Intellectual Property), such as general-purpose programmable processors, DSPs, accelerators, memory blocks, I/O blocks, etc. Recently, the exponential growth of the number of IPs employed in SoC designs is arising the need for a more efficient and scalable on-chip core-to-core interconnection. Traditionally, on-chip interconnects consist of 3 solutions [1]: Bus : A bus is shared between several IPs, arbitration logic is needed to guarantee serialization of access requests. Crossbar : A switch connecting multiple inputs to multiple outputs in a matrix manner. Point-to-Point : Each component has a dedicated interconnection towards other dependent IPs. An example of bus interconnection is depicted in Fig. 1.a, bus are cheap to realize, however in a highly interconnected multicore system it can easily become a communication bottleneck. Furthermore, the capacitive load increases with the number of cores attached to the bus resulting in higher energy consumption. The crossbar overcomes some of the limitations of the bus, it provides lower latency but it is not scalable. Dedicated point-to-point links are optimal in terms of latency, power usage and bandwidth availability as designed accordingly to the specific core-to-core bandwidth requirement. However, the number of dedicated point-to-point interconnections exponentially grows with the number of IPs resulting in a larger realization area. Crossbar and point-to-point interconnections usually scale efficiently up to 20 cores, for larger number of IPs integrated in a single chip a more scalable and flexible solution is needed. The solution consists of an on-chip data-routing network consisting of communication links and routing nodes generally known as Network-on-Chip (NoC) architectures (Fig. 1.c). CPU DSP a) Bus RAM CPU MPEG DSP MPEG CPU RAM DSP MPEG RAM b) Crossbar c) Point-to-Point Figure 1: Examples of communication structures in Systems-on-Chip. a) traditional bus-based communication, b) crossbar and c) point-to-point. 1 2 Basic NoCs’ Building Blocks A NoC consists of routing nodes spread out across the chip connected via communication links. In Fig. 2 the main components of a NoC are depicted: Network Adapter implements the interface by which cores (IP blocks) connect to the NoC. Its function is to decouple computation (the cores) from communication (the network). switch NA Routing node routes the data according to chosen protocols. It implements the routing strategy. Figure 2: Example of On-Chip Network. Link connects the nodes, providing the raw bandwidth. It may consist of one or more logical or physical channels. Several NoC designs have been proposed in literature, while some of them aim at low-latency others are more oriented to maximize bandwidth. Like computer networks, a NoC is usually designed to met application requirements and at each level (network adapter, routing and link) several design trade-offs must be considered. 2.1 Network Adapter (NA) NA handles the end-to-end flow control, encapsulating the messages or transactions generated by the cores for the routing strategy of the network. At this level messages are broken into packets and, depending from the underlying network, additional routing information (like destination core) are encoded into a packet header. Packets are further decomposed into flits (flow control units) which again are divided into phits (physical units), which are the minimum size datagram that can be transmitted in a single link transaction. A flit is usually composed by 1 to 2 phits. The NA implements a Core Interface (CI) at the core side and a Network interface (NI) at network side. The level of decoupling introduced by the NA may vary. A high degree of decoupling allows for easy reuse of cores. On the other hand, a lower level of decoupling (a more network-aware core) has the potential to make more optimal use of the network resources. Standard sockets can be used to implement the CI, the Open Core Protocol (OCP) and the Virtual Component Interface (VCI) are two examples widely used in SoCs. 2.2 Network Level (routing) The main responsibility of the network is to deliver messages from a sender IP to a receiver core. A NoC is defined by its topology and the protocol implemented by it. 2.2.1 Topology Topology concerns the layout and connectivity of the nodes and links on the chip. Two forms of interconnections are explored in modern NoCs: regular and irregular topologies. Regular topologies are divided into 3 classes k-ary n-cube, k-ary tree and k-ary n-dimensional fat tree, where k is the degree of each dimension and n is the number of dimensions. Most NoCs implement regular forms of network ,topology that can be laid out on a chip surface, for example, k-ary 2-cube, commonly known as grid-based topology. Generally, a grid topology makes better use of links (utilization), while tree-based topologies are useful for exploiting locality of traffic and better optimize the bandwidth. Irregular forms of topologies are based on the concept of clustering (frequently used in GALS chips) and derived by mixing different forms in a hierarchical, hybrid, or asymmetric way. 2 2.2.2 Protocol The protocol concerns the strategy of moving data through the network and it is implemented at the router level. A generic scheme of a router is showed in Fig. 2. The switch is the physical component which deals with data transport by connecting input and output buffers. Switch connections are dictated by the routing algorithm which defines the path a message has to follow to reach its destination. The wide majority of NoCs are based on packet-switching networks. Alike circuit-switching networks, where a circuit from source to destination is set up and reserved until the transfer of data is completed, in packet-switching networks packets are forwarded on a per-hop basis (each packet contains routing information and data). Most common forward strategies are the following: Store-and-forward the node stores the complete packet and forwards it based on the information within its header. The packet may stall if the router does not have sufficient buffer space. Wormhole the node looks at the header of the packet (stored in the first flit) to determine its next hop and immediately forwards it. The subsequent flits are forwarded as they arrive to the same destination node. As no buffering is done, wormhole routing attains a minimal packet latency. The main drawback is that a stalling packet can occupy all the links a worm spans. Virtual cut-through works like the wormhole routing but before forwarding a packet the node waits for a guarantee that the next node in the path will accept the entire packet. The main forwarding technique used in NoCs is wormhole because of the low latency and the small realization area as no buffering is required. Most often connection-less routing is employed for best effort (BE) while connection-oriented routing is preferable for guarantee throughput (GT) needed when applications have QoS requirements. Once the destination node is known, in order to determine to which of the switch’s output ports the message should be forwarded, static or dynamic techniques can be used. Deterministic routing the path is determined by packet source and destination. Popular deterministic routing schemes for NoC are source routing and X-Y routing (2D dimension order routing). In X-Y routing, for example, the packet follows the rows first, then moves along the columns toward the destination or vice versa. In source routing, the routing information are directly encoded into the packet header by the sending IP. Adaptive routing the routing path is decided on a per-hop basis. Adaptive schemes involve dynamic arbitration mechanisms, for example, based on local link congestion. This results in a more complex node implementations but offers benefits like dynamic load balancing. Deterministic routing is preferred as it is easier to implement, adaptive routing, on the other hand, tends to concentrate traffic at the centre of the network resulting in increased congestion there [1]. Beside the delivery of messages from the source to the destination core, the network level is also responsible of ensuring the correct operation, also known as flow control. Deadlock and livelock should be detected and resolved. Deadlock occurs when network resources (e.g., link bandwidth or buffer space) are suspended waiting for each other to be released, i.e. where one path is blocked leading to others being blocked in a cyclic way. Livelock occurs when resources constantly change state waiting for other to finish. Methods to avoid deadlock, and livelock can be applied either locally at the nodes or globally by ensuring logical separation of data streams by applying end-to-end control mechanisms. Deadlocks can be avoided by using Virtual Channels (VC), VCs allow multiple logical channels to share a single physical channel. When one logical channel stalls, the physical node can continue serving other logical channels (VCs require additional buffering as several independent queues must be provided). 2.3 Link Level Link-level research regards the node-to-node links, the goal is to design point-to-point interconnections which are fast, low power and reliable. As chip manufacturing technology scales, the effects of wires on link delay and power consumption increase. Due to physical limitations, is not possible 3 to produce long and fast wires, so segmentation is done by inserting repeater buffers at a regular distance to keep the delay linearly dependent on the length of the wire. Partition of long wires into pipeline stages as an alternative segmentation technique is an effective way of increasing throughput. This comes at expense of latency as pipeline stages are more complex than standard repeaters (buffering is required). In future chip interconnects, optical wires are gaining more and more interests. The idea consist in integrating the technology used in fiber optic wires on the chip surface. As shown in Fig. 3, for links longer than 3-4 mm optical interconnects can dramatically reduce signal delay and power consumption [1]. 3 NoC Analysis Metrics usually employed to describe a NoC can be divided into two categories: performance and cost. Latency, bandwidth, throughput and jitter are the most common performance factors while cost is often referred to power consumption and area usage. Design a NoC usually requires to find the optimal trade-off between application needs (usually expressed in terms of traffic requirements), area usage and power consumption. For example choosing adaptive routing scheme in favour of static ones can have benefits on the average latency as the network load increases. Routing algorithm can also optimize the aggregate bandwidth as well as link utilization. The jitter, at last, refers to the delay variation of received packets in a flow caused by network congestion or improper queuing. Knowledge of the application traffic behaviour can help in compensating the jitter via a proper buffer dimensioning which allows to absorb the effect of traffic bursts. Figure 3: Delay comparison of optical and electriFrom the point of view of costs, reduce the cal interconnect (with and without repeaters) in a power consumption is one of the main objec- projected 50 nm technology tives. There are two main terms which are used to measure the power consumption of a NoC system: (i) power per communicated bit and (ii) idle power. Idle power is often reduced by using asynchronous circuits in implementing some of the NoC components, e.g. the NA. 4 Available NoC Solutions Design of a NoC is possible by means of libraries which give a high level description of NoC components that can be assembled together by system designers to suite the application needs. Available NoC libraries can be classified corresponding to the two following metrics as depicted in Fig.4: Parametrizability at system-level : the ease with which a system-level NoC characteristic can be changed at instantiation time. The NoC description may encompass a wide range of parameters, such as: number of slots in the switch, pipeline stages in the links, number of ports of the network, and others. Granularity of NoC : describe the abstraction level at which the NoC component is described. At fine level the level the network can be assembled by putting together basic building blocks which at the coarse level, the NoC can be described as a system core. In the next sections two concrete systems which employ NoC will be analyzed: Xpipes and the RAW processor. Xpipes is a library with the purpose of providing to system designers the building blocks 4 to instantiate custom NoCs for SoC systems. The RAW processor consists of a general purpose multicore CPU architecture which use NoC-based interconnect between processing cores. 4.1 Xpipes Xpipes is a library of highly parameterised soft macros (network interface, switch and switch-to-switch link) that can be turned into instance-specific network components at instantiation time. Components can be assembled together allowing users to explore several NoC designs (e.g. different topologies) to better fit the specific application needs. The high degree of parameterisation of Xpipes network building blocks regards both global network-specific parameters (such as flit size, maximum number of hops between any two nodes, etc.) and block-specific parameters (content of routing tables for source-based routing, switch I/O ports, number of virtual channels, etc.). The Xpipes NoC backbone relies on a wormhole switching technique and makes use of a static routing algorithm called street sign routing. Important attention is given to reliability as distributed error detection techniques are implemented at Figure 4: Available NoC solutions classilink level. To achieve at high clock frequency, in Xpipes fication. links are pipelined in order to optimize throughput. The size of segments can be tailored to the desired clock frequency. Overall, the delay for a flit to traverse from across one link and node is 2N + M cycles, where N is number of pipeline stages and M the switch stages. One of the main advantage of Xpipes over other NoC libraries is the provided tool set. The XpipesCompiler is a tool designed to automatically instantiates an application-specific custom communication infrastructure using Xpipes components from the system specification. The output of the XpipesCompiler is a SystemC description that can be fed to a back-end RTL synthesis tool for silicon implementation. 4.2 RAW Processor The RAW microprocessor is a general purpose multicore CPU first designed at MIT [3]. The main focus is to exploit ILP across several CPU cores (called tiles) whose functional units are connected through a NoC. The first prototype, realized in 1997, had 16 individual tiles and a clock speed of 225 MHz. RAW processor design is optimized for low-latency communication and for efficient execution of parallel codes. Tiles are interconnected using four 32-bit full-duplex networks-on-chip, a 2 dimensional grid topology is used thus each tile connects to its four neighbours. The limited length of wires, which is no greater than the width of a tile, allows high clock frequencies. Two of the networks are static and managed by a single static router (which is optimized for low-latency), while the remaining two are dynamic. The interesting aspect of the RAW processor is that the networks are integrated directly into the processors’ pipeline as depicted in Fig. 5. This architecture enables an ALU-to-network latency of 4 clock cycles (for a 8 stage pipeline tile). The static router is a 5 stages pipeline that controls 2 routing crossbars and thus 2 physical networks. Routing is performed by programming the static routers in a per-clock-basis. These instructions are generated by the compiler and as the traffic pattern is extracted from the application at compile time, router preparation can be pipelined allowing data words to be forwarded towards the correct port upon arrival. The dynamic network is based on packet-switching and the wormhole routing protocol is used. The packet header contains the destination tile, an user field and the length of the message. Dynamic routing introduces higher latency as additional clock cycles are required to process the message header. Two dynamic networks are implemented to handle deadlocks. The memory network has a 5 Figure 5: Raw compute processor pipeline. restricted usage model that uses deadlock avoidance. The general network usage is instead unrestricted and when deadlocks happen the memory network is used to restore the correct functionality. The success of the RAW architecture is demonstrate by the Tilera company, founded by former MIT members, which distributes CPUs based on the RAW design. Current top microprocessor is the Tile-GX which packs into a single chip 100 tiles at a clock speed of 1,5 Ghz [4]. Also Intel recently presented the Single-chip Cloud Computer (SCC) which packs 48 tiles connected through a NoC [5]. 5 Summary On-Chip networks are gaining more and more interest for both embedded SoC systems and general purpose CPUs as the number of integrated cores is constantly increasing and traditional interconnects do not scale. As chip manufacturing technology scales down, computation is getting cheaper and communication is becoming the main source of power consumption. NoC tries to solve the problem by reducing the complexity and the number of chip interconnections and by limiting the wire length to allow higher clock speeds. Several NoC libraries and NoC-based general purpose CPUs have been introduced making effective programmability of these chips one of the next big research challenges. References [1] Bjerregaard, Tobias and Mahadevan, Shankar: A survey of research and practices of Network-onchip, ACM Computing Surveys, Volume 38, Number 1, 2006. [2] Davide Bertozzi and Luca Benini: Xpipes: A Network-on-Chip Architecture for Gigascale Systemson-Chip, IEEE Circuits and Systems Magazine, Volume 4, Issue 2, page(s): 18-31, 2004. [3] Michael B. Taylor et al.: The RAW microprocessor: A Computational Fabric for Software circuits and General Purpose Programs, IEEE Micro, Volume 22, Issue 2, page(s): 25- 35, 2002. [4] Tilera: Tile-GX Processors Family, http://www.tilera.com/products/TILE-Gx.php. [5] Intel Research: Single-chip Cloud Computer, Tera-Scale/1826.htm 6 http://techresearch.intel.com/articles/

On-Chip Networks (NoCs) Seminar on Embedded System Architecture (Prof. A. Strey)

Related documents

Products

Support

On-Chip Networks (NoCs) Seminar on Embedded System Architecture (Prof. A. Strey)

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib