Деветта национална конференција со меѓународно учество - ЕТАИ 2009 Ninth National Conference with International Participation - ETAI 2009 Охрид, Република Македонија - Ohrid, Republic of Macedonia, 26-29 IX 2009 IE2-4 NETWORK PROCESSORS: EVOLUTION AND TRENDS Danijela Jakimovska1, Goce Dokoski2, Aristotel Tentov3 and Marija Kalendar4 Faculty of Electrical Engineering and Information Technologies, Dept. of Computer Science, Karposh 2, b.b, Skopje, Macedonia {danijela1, goce.dokoski2, toto3, marijaka4}@feit.ukim.edu.mk Abstract--This paper provides an overview of the network processor research and design, current trends, and proposition for future development. Firstly, we explain the need of this type of chips – network processors as software controllable devices optimized for high performance communication traffic. After that, we describe the network processor's key aspects: data processing, chip design and software development possibilities. Afterwards, we explain the reasons for existence of different architectures and organizations. In that sense, we give a description of few different network processors, thus presenting current trends and giving possible ideas for future evolution. We analyze the approach of involving general purpose processors, in combination with specific software in order to further augment the performance of network processors. Index terms-- network processor, design, architecture, evolution, trends. 1. INTRODUCTION Networks grow rapidly and include numerous complex applications, services and real-time data that need to be provided at very high speeds, up to multi Gb/s. Therefore, there is a constant demand for ever increasing packet processing speed, while at the same time, an increasing number of services need to be provided by the networking hardware (QoS, firewalls, scheduling, flow controls etc.) Not to mention that data, voice and video networks are converging and that users are looking for on-demand services delivered in any network, on any platform. Network devices must follow this evolution, and process the data at these transmission rates. Routers are traditionally designed as programmable integrated circuits specifically tailored to the tasks of routing and forwarding information. However, this approach has showed inflexibility when it comes to adding new capabilities. In the same time, the increasing development of the System on Chip (SoC) technology, and the availability of Field programmable gate architecture (FPGA) as well as complex programmable logic device (CPLD), has enabled many new possibilities in processor design. This evolution resulted in the concept of a network processor, which is optimized for packet processing and routing. This turned out as the best solution as it provides the necessary flexibility, while keeping a descent operating speed. In order to design an efficient network processor, a careful examination of the current solutions is necessary. Therefore we’ll analyze few of the most important. Network processors design is an ongoing field of development and research. Many approaches have been applied and many new ideas are emerging, such as the NetFPGA architecture, or software routers. The aim of this paper is to give an outline of the achievements in network processor design, discuss current trends, as well to propose ideas for further improvements. This paper is organized as follows. Section 2 gives an overview of network processors, how network processing works and different levels of operation. Section 3 describes current architectural trends in network processor design. Section 4 gives a few examples of commercial network processors, including one software solution for general purpose processor. Section 5 presents possible architectural changes in general purpose processors for achieving better processing speeds. The paper concludes in Section 6. 2. NETWORK PROCESSORS OVERVIEW The development of network processors starts in the late 1990s, when network devices were insufficient to handle complex network processing requirements. They are chipprogrammable devices, optimized for packet processing at very high speeds (multi-Gb/s) and are included in many different types of network equipment such as routers, switches or firewalls [1] [2]. A network processor is actually an application specific instruction processor (ASIP) similar to the general purpose processors that usually support the simplest instruction set – RISC. Additionally, it implements parallel processing and pipelining at low level and it contains hardware blocks with specific purpose like traffic management, searching, high – speed memory and packet I/O [2] [3]. As a result it can operate at wire speeds and achieve great performances. Additional requirements should be satisfied as well. It should be flexible, easily programmable and reach time-to-market [4]. Network processing devices usually perform packet analysis, searching and classifying frames, modifying packet contents, retrieving relevant information from the frames and forwarding packets. In order to achieve this, they are usually designed as a composition of four functional blocks: physical interface, data plane, control plane and switching interface. The physical layer interface does the conversion of a signal when it is received, and then transmits it over the communication channel medium. The packet processing speed depends on the appropriate operations that need to be executed. Fast packet processing is usually referred to as the data plane, and is characterized by simple tasks, performed at wire speeds. On the other side, the slow packet processing, called control plane, is responsible for packets that need more complex processing. This plane also performs operations for control, configuration and management of the network device. Therefore, the control plane is usually implemented as a general purpose processor, while the data plane as a network processor, which has greater processing power. The switching fabric is another part of the network devices and its basic function is forwarding the traffic from ingress to egress ports [3] [4] [5] [6]. Fig. 1 Functional elements of network devices There are three levels on which network processors operate: entry-level or access network processors, mid-level or edge network processors, and high-end or core network processors [3]. Nowadays the access level processors enable up to 2 Gb/s throughput. Their applications include routers for customer branch offices, homes etc. Network processors used in this network level equipment are: EZchip’s NPA, Wintegra’s WinPath, Agere, PMC Sierra, and Intel’s IXP2300. Edge network processors aggregate traffic from more access routers and serve as ingress and egress to the core. They run at speeds from 2 to 5 Gb/s. In this group of network processors belong: multipurpose network processors like AMCC, Intel’s IPX, C-port, Agere, Vittese, and IBM’s network processor. Core network processors present the fastest part of the network, operating at wire speeds between 10 and 100 Gb/s. In order to achieve these speeds, network processors are usually constructed from high-speed components that are able to process huge amount of data (around million of packets per second). Examples of such network processors are: Ezchip’s network processors, Xelerated, Sandburst, Bay Microsystems, or the in-house Alcatel-Lucent SP2 [2]. Fig. 2 Network processors different operation levels 3. CURRENT ARCHITECTURE TRENDS Network processor design is an ongoing field of development and there are many different solutions proposed. In general, network processor architectures include: a processing engine (PE), dedicated hardware, network interface, memory resources, and software support. It must also exploit parallelism, by using various parallelization techniques and pipelining. PE is the basic programmable unit in network processors, responsible for data processing. Depending on the architecture, PEs can be grouped in blocks of multiple PEs [8]. Usually, PEs are positioned close to the special hardware accelerators called coprocessors. This dedicated hardware is easily programmable, performs additional computations, and consequently increases processing power and speed. Memory organization includes packet memory, instruction memory and routing table memory. Since the processor frequently interacts with the memory, this communication should be very fast. Improvements in this field are achieved by the use of content addressable memory (CAM). Network interfaces are the communication points for the ingress and egress packet flow in the network processor. Another important aspect in network processor design when it comes to flexibility is software support. These days, companies are paying much more attention to the programmability, allowing network software to be written in a high-level language such as C, and the core routines in microcode [3]. However, it is not easy to develop software for network processors as they have different architectures, complex design and performance constraints. Current trend is to achieve software uniformity and design portability. In order to meet the performance and speed requirements, current architectures include parallel processing, special onchip bus and memory organizations. Parallelism can be implemented on three different levels: instructions, threads and packets. Instruction level parallelism enables instantaneous execution of program instructions, similar to the pipeline approach. The idea of using multithreading processors results in executing multiple threads on one or more processors. On the other hand, packet level parallelism refers to parallel packet processing. This approach is achieved by employing multiple PE, so each one is simultaneously responsible for processing different packets. Currently, there are various network processors, accomplishing their performance by raising the number of processor engines. This trend is actually shown in the next picture [2]. Fig. 3 Trade-offs between number of PEs and issue width. High performance on-chip communication architectures include buses and crossbars in order to achieve high speed processing. Bus-based communications are not able to satisfy the increasing performance needs, up to 40 Gb/s, so they are replaced by crossbar switch. On the other side, the use of the crossbars is limited by their high cost, difficult design and low scalability. As a result, some network processors use high bandwidth buses [4]. Memory in the network processors is usually organized on several levels. It is very important, as it has many responsibilities such as storing the program and registers’ content, buffering packets, keeping intermediate results, storing data that is produced by the processors while working, holding, and maintaining potentially huge tables and trees for look-ups, maintaining statistical tables, and so forth. Therefore memory should has lower latency and fulfill the speed requirements. Consequently, memory size and speed are a trade off [2]. Improvements in memory organization are achieved by memory coprocessors and different caching mechanisms. Memory coprocessor function is to execute instructions on the data held in the memory, without involving the main processor. Memory search operations are optimized by the use of content addressable memory. Caching mechanisms can significantly improve root lookup performances, and hence packet forwarding. Although at the beginning of the network processor evolution, caching mechanisms were not widely used, today both data and instruction cache are very crucial [4]. 4. EXAMPLES OF NETWORK PROCESSORS So far many network processor architectures have emerged, all characterized by their own advantages and disadvantages. In this section we describe three famous architectures, two of which are very successful on the market. We will also consider a software solution based on general-purpose processors. 4.1 Intel IXP 2800/2850 Network Processor The Intel IXP 2800/2850 network processor architecture allows high performance network processing and routing, with speeds ranging from OC-3 (155 Mb/s) to OC-192 (10 Gb/s) [9]. It can operate on very heavy traffic such as at the internet core as well as edge routers/switches, Storage Area Networks (SANs), data-center and enterprise routers/switches. At the same time, it is capable of advanced traffic management routines (deep packet inspection, load balancing and high speed packet forwarding). According to [1], IXP2850 is the first processor to integrate security functionality into the core. It is achieved by two security specific functional units for handling cryptographic algorithms as well as encrypting and decrypting IPsec packets at OC-192 speeds. However, it can achieve this on either one 10Gb/s connection, or several 1Gb/s connections, simultaneously. Its hardware configuration consists of two IXP multiprocessors, one per ingress and egress directions as well as a switch fabric. One IXP processor itself is a complete multiprocessor system, consisting of one XScale core, operating on 700 MHz, based on RISC architecture, 16 programmable "micro engines", working at 1.4GHz frequency, and additional coprocessors for aiding packet reassembly, acceleration and control flow. The memory consists of general purpose registers, CAMs, SDRAM and RDRAM, distributed as global and local memory, among the micro engines and the XScale core. The XScale core is supposed to do the communication and coordination with the backplane, maintain the data, and process the packets that the micro engines can't handle. In [10] one interesting approach is proposed, where the XScale core is replaced by one of the micro engines, in order to achieve higher speed (about 40% in particular). This alternative allows high speed routing for the IPv6 protocol, as it does not require complex header processing in the XScale core. The micro engines are programmable in a high-level language - C, and their task is to do the high-speed packet processing. Their configuration encompasses packet processing stages, pipes and the memory model. To conclude, using the 1 Gb/s configuration, this processor can be used as an edge-class router, and with the 10Gb/s, it can be used for line-speed packet processing (e.g. Firewall, Intrusion Detection System etc.). 4.2 EZchip Architecture This architecture is based on a pipeline of parallel heterogeneous processors, optimized for packet processing. Each processor is equipped with specific functional units, memory and internal data buses. There are many network processor (NP) types and versions, based on the EZchip architecture, that are suitable for various applications. EZchip's NP-1 is a 10-Gb/s full duplex network processor, providing seven-layer packet processing. This is achieved with many specific hardware accelerators, capable of processing packets at wire speeds. Unlike NP-1, NP-2 includes traffic management units. This architecture uses DRAM for all lookup tables and frame buffers, as well as external SRAM for statistical data. The use of DRAM minimizes overall system cost and power dissipation. 4.4. General-Purpose Processor as a Network Processor Recently, an increasing attention has been devoted to software routers based on off-the-shelf hardware and opensource operating systems running on personal computer (PC) architectures. Today's high-end PCI shared buses can be used for multi-gigabit-per-second routing, for a price much lower than that of commercial routers. What remains as a bottleneck to this approach is the programmability of the NICs, although an interesting solution has already been proposed [14]. Although this category is not primarily intended for the core level routing, it can certainly very well serve at the edge-level, allowing multi-gigabyte connectivity [15]. There are two very popular open-source software solutions: The NP-1 is based on a five-stage packet processing pipeline, each stage containing multiple parallel processors optimized to perform specific tasks. There are four types of such Task-Optimized Processors (TOP engines), each of them employed to perform parsing, searching, resolving and modifying packets, correspondingly. Each type of TOP processor has a function-specific data path, functional units and instruction set, required for the complex seven layer packet processing. The instruction sets are similar, and only small adjustments are needed to operate the specific functional units. As a result the overall architecture resembles a super-scalar system of a high degree. One important characteristic is that the allocation of the TOP engines to the incoming frames, inter-TOP communication and the maintenance of the ordering of the frames is performed in the hardware and is completely transparent to the programmer [2]. 4.3 Architecture of NetFPGA Router When it comes to achieving dissent speeds, at the same time affording sufficient flexibility for implementing new protocols, the Field Programmable Gate Array (FPGA) technology comes as a very handy tool. With this in mind, the NetFPGA architecture was designed as a sandbox for network hardware design. It allows researches to experiment with new ways to process packets at line-rate, by allowing them to program its functionality in a hardware description language (e.g. Verilog). The NetFPGA is constructed as a PCI card that contains an FPGA, four 1GigE ports and buffer memory (SDRAM and DRAM). It consists of modules that are connected as a sequence of stages in a pipeline. They communicate using a simple packet-based synchronous FIFO push interface [11]. Although its primary intent is research, the NetFPGA points out the idea of combining FPGA with the network processors. This combination can significantly increase the flexibility of a network processor at higher speeds. 5. QUAGGA, free software that manages TCP/IP based routing protocols based on the GNU Zebra project, and released as part of the GNU Project. According to its developers, it offers true modularity, by having a different process for each protocol. QUAGGA works independently from the operating system over which it's installed, and does not include non-routing functionalities such as DHCP, NTP or SSH access. These services can be added separately to the operating system. Another advantage of this software is its availability for different architectures. Vyatta, on the other hand is built together with the operating system, and is easier to configure. It is also free software; however it only supports the x86 processor architecture. GENERAL PURPOSE MULTICORE PROCESSORS General purpose processor with multiple cores combined with operating systems and application programs with multithreading capabilities poses very promising performances for their usage in routing. However, they can’t be implemented directly in network applications like routing, but instead needs careful internal architectural hardware redesigning which will result at the end into crucial improvement of the network packet processing performances. This will result in an appropriate change in internal software architecture, as well as instruction set changes. We start to investigate necessary hardware and instruction set changes for a SUN T2 processors architectures. Initial results are very promising toward capabilities of such kind of processors, with appropriate internal hardware and software interventions, to be involved in a multi-gigabitper-second routing applications. 6. CONCLUSION At first, either general purpose processors or ASICs were used as network processors, but today a combination of these two approaches is in use, in order to cope with the need for a greater flexibility and speed. researches, initial results are very promising, but further investigations are necessary, and we will pursue it. REFERENCES Fig. 4 Comparison of General purpose processors, ASICs and network processors To meet the networks' evolution, characterized by an ever increasing bandwidth, as well as the requirement for heavy packet processing, these processors were constantly equipped with increasing number of hardware accelerators. That is how the concept of today's network processors has evolved [13]. Today, network processors are multi-engine or multi-core, multi-threaded system on chips, with varying degree of programmability. The more recent network processors are requiring relatively general purpose programming techniques. Network processors are replacing ASICs and fixedfunction chips in a variety of networking equipment, and thus are emerging as a dominating technology in new network system designs. According to [12], the marketshare leader in 2007 was Intel, with its IXP2350 access NPU and the IXP28xx 10Gbps NPU. Nowadays, Intel is shipping an embedded x86 processor using some elements from the IXP2350. Netronome Systems on the other side develop high-end derivates from this processor, using a license from Intel [6]. EZchip is shipping its highly integrated 20GBps NP-2, that combines a network processor unit, fine grained traffic manager, and several Ethernet MACs. Other manufacturers that are ranking high on the market are LSI, Wintegra, Broadcom as well as few smaller companies. It is expected that the NPU market will show strong growth in the near future. The software solutions, such as the QUAGGA and Viatta platform, that add routing functionalities to general purpose processors, can not reach the line speeds achieved by the network processors on the core level. This might be achieved by some minor architectural changes. One may conclude that a compromise between programmable processors and fixed-function circuits (possibly FPGA), is always necessary. Another interesting area is to investigate possibilities for inclusion of general purpose Multicore processors, with necessary internal hardware and software redesign, and appropriate multithreading routing software, to reach or even overcome the line speeds achieved by the special network processors. We are at the very beginning of these [1] Nazar Zaidi: Network Processors: Evolution and Current Trends, RMI Corporation, USA, May 1, 2008 [2] Ran Giladi: Network Processors - Architecture, Programming and Implementation, Ben-Gurion University of the Negev and EZchip Technologies Ltd., 2008 [3] Mahmood Ahmadi, Stephan Wong: Network Processors: Challenges and Trends, Computer Engineering Laboratory Electrical Engineering, Mathematics and Computer Science Department Delft University of Technology, Netherlands [4] Mohammad Shorfuzzaman, Rasit Eskicioglu, Peter Graham: Architectures for Network Processors: Key Features, Evaluation, and Trends, Department of Computer Science University of Manitoba, Winnipeg [5] Panos C. Lekkas: Network Processors: Architectures, Protocols and Platforms, McGraw-Hill Professional , 2003 [6] Netronome, Switching & Routing Solutions, http://www.netronome.com/pages/switching-routing, 2000 [7] Patrick Crowley, Mark A. Franklin, Haldun Hadimioglu, Peter Z. Onufryk: Network Processor Design volume 1/2, Morgan Kaufmann, 2003 [8] Md. Ehtesamul Haque, Md. Humayun Kabir: A Survey on Network Processors Department of Computer Science and Engineering Bangladesh University of Engineering and Technology, Dhaka, April 3, 2007 [9] David Meng, Ravi Gunturi, Manohar Castelino: IXP2800 Intel Network Processor IP Forwarding Benchmark Full Disclosure Report for OC192-POS, Oct 30, 2003 [10]Ankush Garg, Prantik Bhattacharyya: Network Processor Architecture for Next Generation Packet Format, Department of Computer Science, University of California, Davis [11]Jad Naous, Sara Bolouki, Glen Gibb, Nick McKeown: NetFPGA: Reusable Router Architecture for Experimental Research, Stanford University California, USA [12]Bob Wheele, Linley Gwennap: A Guide to Network Processors, Ninth Edition, California, January 2008 [13]Matthias Gries: Algorithm-Architecture Trade-offs in Network Processor Design, PhD thesis, Sweden, May 21, 2001 [14]Petracca, M. Birke, R. Bianco, A.: HERO: Highspeed enhanced routing operation in software routers NICs, Turin, 13 Feb. 2008, Politec. di Torino, [15]Rober Olsson, Hans Wassen, Emil Pedersen: Open Source Routing in High-Speed Production Use, 2008, Uppsala Universitet