Network Processor Architectures, Programming Models, and Applications Fang Li and Jie Wang Department of Computer Science University of Massachusetts Lowell Abstract: Network processors (NPs) are a new breed of high-performance network devices. They enable users, through software, to create and run network applications at line rates. This article describes major NP architectures, their programmability, and various applications. 1. Introduction Network processors are fully programmable network devices specially designed to store, process, and forward large volumes of data packets at wire speed. These tasks are traditionally performed using ASICbased switches and routers. While ASIC devices offer performance, they lack programmability. On the other hand, software-based solutions run on general-purpose processors offer programming flexibility, but they lack performance. Network processors are designed to fill this gap. They offer the advantages of both hardware-based and software-based solutions. This technological advancement opens a new direction for data communications. Major chip makers have seized this opportunity and developed their NP lines. What do NP architectures look like and what are their programming models? This article provides brief answers to these questions. In particular, we will describe major NP architectures and their programming models, including IBM’s PowerNP, Intel’s IXP, Motorola’s C-Port, and Agere’s Payload Plus. We will then describe a number of applications suitable for NPs. Finally, we will discuss a few issues and challenges. 2. NP Architectures and Programmability A typical NP chip consists of an array of programmable packet processors in a highly parallel architecture, a programmable control processor, hardware coprocessors or accelerators for common networking operations, high-speed memory interfaces, and high-speed network interfaces. 2.1 IBM PowerNP. A PowerNP chip consists of an embedded processors complex (EPC), a data flow (DF) unit, a scheduler, MACs, and coprocessors (see Figure 1). Figure 1: IBM PowerNP architecture [1] 1 The EPC component consists of an embedded PowerPC core processor and an array of Picoprocessors (PPs), where each PP can perform a full set of operations on each packet it receives. PowerPC and PPs are programmable. Instructions executed by PPs are referred to as Picocode. NP4GS3 is a popular high-end member of the PowerNP family. It supports OC-48 (4 Gbps) line rate. It integrates the switching engine, the search engine, and security functions on one chip to provide fast switching. Listed below are major components of NP4GS3: • An embedded PowerPC that runs at 133 MHz. It has 16 KB of instruction cache (ICache), 16 KB of data cache (DCache), and up to 128 MB of program space. • 16 PPs, each with 133 MHz clock speed, providing a total of 2128 MIPS aggregate packet processing capability. There is a total of 32 KB of instruction memory (IMEM). • Multiple hardware accelerators for tree-searching, frame-forwarding, filtering, and alteration. • 40 Fast Ethernet/4Gb MACs, supporting industry standard PHY components. • An integrated Packet over SONET (POS) interface supporting one OC-48c line, one OC-48 line, four OC-12 lines, and sixteen OC-3 lines for transmitting industry standard POS framers. • A data flow unit that serves as primary data path for receiving and transmitting network traffic. • A Scheduler that schedules traffic flows. NP4GS3 processes POS frames using the combination of PPs, hardware accelerators, and external coprocessors. Each PP offers two hardware threads, and so the 16 PP in EPC can simultaneously process 32 frames with zero context-switching overhead between threads. NP2G is a low-end member of the PowerNP family. It provides deep packet processing and substantial performance headroom for OC-12 lines. NP2G consists of one Embedded PowerPC, 12 PPs, and 60 hardware accelerators for tree searching, frame forwarding, frame filtering, and frame alteration. The PowerNP Developer’s Toolkit provides a programming model for PowerNP chips. It offers a set of tools for developing and optimizing PowerNP applications. It contains an assembler/linker (NPAsm), a graphical debugger (NPScore), a system simulator (NPSim), a test-case generator (NPTest), and a software performance profiler (NPProfile). These tools are written in C++ and are tightly coupled with the Tcl/Tk scripting language. It provides both high-level (C API, C compiler) and low-level APIs for software developers. 2.2 Intel IXP. IXP network processors consist of three major components: a StrongARM or an XScale core processor, an array of multi-threaded packet processors called Microengines (MEs), and an IXA framework. IXA stands for Internet Exchange Architecture. The core processor and the MEs are fully programmable. IXP1200 is the first generation of the IXP family (see Figure 2). It offers OC-3 and OC-12 line rates. Listed below are the major components of an IXP1200 chip [2]: • A StrongARM core. It runs at 166/200/233 MHz and can be programmed to run control-plane applications. It has 16 KB instruction cache and 8 KB main data cache. • Six 32-bit RISC MEs. Each ME has four hardware threads with zero overhead context switching, and 8 KB programmable instruction control storage that can hold up to 2048 instructions. • An FBI unit and an IX Bus. FBI is responsible for serving fast MAC-layer devices on the IX Bus, providing an interface for receiving and transmitting packets. FBI has 4K scratchpad memory. • An SRAM unit and an SDRAM unit. The SRAM unit provides 8 MB SRAM that can be used to store lookup tables. The SDRAM unit provides 256 MB lower-cost SDRAM for storing mass data, forwarding information, and transmitting queues. • A PCI unit. The PCI unit provides an industry standard 32-bit PCI Bus for PCI peripheral devices such as host processors and MAC devices. 2 Figure 2: Intel IXP1200 Block Diagram [2] IXP2400, IXP2800, and IXP2850 are the second generation of IXP network processors [3, 4]. IXP2400 is designed for OC-48 (2.5 Gbps) applications. It has one Xscale core and eight MEs, where each ME has eight hardware threads. The Xscale core has 32 KB instruction memory, 32 KB data memory (DMEM), and 2 KB mini data cache. Each ME has 4K 40-bit instruction control storage that can hold up to 4096 instructions, and 640 32-bit local addressable memories. IXP2400 supports 64 MB SRAM and up to 2 GB DDR SDRAM. It provides two unidirectional 32-bit media interfaces, which can be configured to be SPI-3, UTOPIA 1/2/3 level, or CSIX-L1. To improve processing performance, IXP2400 also supports multiplication (while IXP1200 does not). IXP2400 also has built-in functions for generating pseudo-random numbers and time stamping. IXP2800 is designed for OC-192 (10 Gbps) applications. The IXP designers introduce the Hyper Task Chaining processing technology into IXP2800 for deep packet inspection via software pipeline at wire speed. IXP2800 has one Xscale core and 16 MEs, supporting 23.1 Giga operations per second. IXP2850 is IXP2800 plus on-chip cryptography units. In particular, IXP2850 integrates two cryptography blocks into IXP2800 to provide hardware acceleration of standard encryption algorithms. IXP comes with a comprehensive software development environment, including Software Development Kit (SDK) and hardware development platform for rapid product development and prototyping. The Workbench component in IXA SDK provides a friendly GUI simulation environment for code development and debugging. IXA SDK also provides programming frameworks and the IXA Active Computing Engine (ACE) model, which provide complete code for typical network applications and packet processing functionalities. The programming languages used by IXP are Microengine C, a C-like language, and Microcode, an assembly language. 2.3 Motorola C-Port. A C-Port chip consists of an Executive Processor (EP), an array of packet processors called Channel Processors (CPs), a Fabric Processor, and a number of coprocessors. The EP 3 and CPs are VLIW processors and are fully programmable. Each processor can be individually configured to enhance the flexibility of a C-Port chip. Figure 3: Motorola C-5 Block Diagram [5] C-5 is the first member of the C-Port family (see Figure 3). It was designed to support complete programmability from Layer 2 through Layer 7 in the OSI model. It provides up to 5 Gbps bandwidth and more than 3000 MIPS of computing power. It supports a variety of industry-standard serial and parallel protocols and data rates from DS1 (1.544 Mbps) to GE (1 Gbps), including 10/100Mb Ethernet, 1Gb Ethernet, OC-3c, OC-12, OC-12c, and Fiber Channels. Listed below are major components of C-5: • An EP that runs at 166/200/233 MHz and performs conventional supervisory tasks including initialization, program loading and control of CPs, statistics gathering, and centralized exception handling. It also executes routing algorithms and updates routing tables. • 16 CPs responsible for receiving, processing, and transmitting packets and cells at wire speed. Each CP consists of a 32-bit C/C++ programmable RISC core and two microcode programmable serial data processors (SDPs). The RISC core is used to handle more complex tasks and is responsible for classification and making scheduling decisions. It is also responsible for overall management of the entire CP. One SDP is used to process the received data stream, and the other SDP is used to process transmitted data stream. Each CP has 12 KB of data memory. In addition, a cluster of four adjacent CPs has 24 KB of instruction memory, giving each CP 6 KB of instruction memory of its own. The CPs can be arranged to operate independently of each other. They can also co-operate in clusters. Each CP has programmable MAC interfaces and PHY interfaces. • A Fabric processor (FP), which is a high-speed network interface port with advanced functionality. It supports bidirectional transfer of packets, frames or cells, and can be configured for different fabric protocols. FP is compatible with Utopia 1/2/3 level, IBM PRIZMA, and PowerX (Csix-L0). • Three co-processors, which operate as shared resources for the CPs and perform specific tasks, such as table lookup, queue management, buffer management, and fabric interface management. Buffer Management Unit (BMU) co-processor can be programmed to manage centralized payload storage during packet processing. It has 32 MB memory. C-5e is the second generation of the C-Port family. It has 18 programmable RISC core processors, including 16 CPs, one Executive Processor, one Fabric Processor; and 32 programmable serial Data 4 Processors. Each CP shares access to a 32 KB instruction memory among a cluster of 4 adjacent CPs. In addition, each CP also has 12 KB local data memory. C-5e supports 5 Gbps bandwidth and more than 4500 MIPS of computing power. C-Port is programmed using the C/C++ languages. Its development environment provides a set of design, development and debugging tools to support services and enhance productivities, including the C-Ware Software Toolset that provides application libraries, APIs, simulator, GNU-based compiler and debugger, and C-Ware Development System for different service modules. 2.4 Agere PayloadPlus. The design of PayloadPlus differs from those discussed above. PayloadPlus (see Figure 4) provides multi-service solutions (IP, ATM, Ethernet, MPLS, Frame Relay, etc.) at the speed of GbE, OC-48c, and OC-192c. Its supports layer 2 through layer 7 protocol processing, buffer management, traffic shaping, data modification, and per-flow policing and statistics. PayloadPlus employs the pipelined processors architecture and uses the “Pattern Matching Optimization” technology. Figure 4: Agere Payload Plus block diagram [6] PayloadPlus consists of the following major components: • • • • A Fast Pattern Processor (FPP). FPP is a programmable, pipelined, multi-threaded processor that can analyze and classify up to 64 packets at wire speed. It classifies packets and re-assembles packets. The outcome of classification is forward to the Routing Switch Processor for processing. A Routing Switch Processor (RSP). RSP is a programmable processor that handles the queuing, traffic management, and packet modification at wire speed. It contains three VLIW processors called engines: one for traffic management, one for traffic shaping, and one for outgoing packet modification. These three engines can run different programs. An Agere System Interface (ASI). ASI is a configurable, non-programmable engine managing interface between FPP, RSP, and the host computer. It handles slow path packets, communicates with the host through a PCI bus. One µP. It handles initial setup and exceptions. PayloadPlus protocols are programmed using Functional Programming Language (FPL). Compared to C/C++, FPL provides a reduced number of instructions needed to carry out a given task. In addition to FPL, Agere also offers Agere System Language (ASL), a C-like scripting language, to program procedural tasks executed by the RSP and the ASI components. 2.5 Architecture Summary. Table 1 summarizes PowerNP, IXP, and C-Port network processors we discussed above in the following categories: line rate, physical interface, chip memory, and programmability. 5 Network Processor Line Rate Physical Interface Chip Memory Programmability PowerPC core, programmed in C 16 PPs (each with 2 hardware threads), programmed in Picocode IBM PowerNP development Toolkit StrongARM core, programmed in C/C++ 6 MEs (each with 4 hardware threads), programmed in Microengine C or Microcode Intel IXA SDK IBM PowerNP NP4GS3 1 Gbps OC-12, OC-12c OC-48, OC-48c 40 Fast Ethernet/OC-48 MACs 16 KB ICache, 16 KB DCache for the PowerNP core 2 KB IMEM for each PP 128 KB SRAM for input packet buffering Intel IXP IXP1200 OC-3 OC-12 10/100/1000 Ethernet MACs, ATM, T1/E1 SONET, XDSL up to 56 physical ports 16 KB ICache and Motorola C-Port IXP2400 OC-3 OC-12 OC-48 2 unidirectional 32-bit media interfaces configurable to be SPI-3, UTOPIA 1/2/3, or CSIX-L1 C-5 10/100 10/100 Mbps 1 Gbps OC-3c OC-12, OC-12c OC-48 Fiber Channel Ethernet, 1GE, OC-3c, OC-12, OC-12c, OC48, Fiber Channel up to 16 physical interfaces to each CP 8 KB DCache for the StrongARM core 2 KB IMEM for each ME 4 KB on-chip scratchpad for the FBI unit 8 MB off-chip SRAM 256 MB off-chip SDRAM 32 KB IMEM, 32 KB DMEM, and 2 KB mini DMEM for the Xscale core 4K 40-bit IMEM, and 640 32-bit local memory for each ME 64 MB off-chip SRAM 2 GB off-chip DDR SDRAM 16 MB for table lookup 128 MB for Buffer Management Unit 12 KB data memory for each CP, 24 KB instruction store for 4 adjacent CPs Table 1: Comparisons of network processors 6 A Xscale core, programmed in C/C++ 8 MEs (each with 8 hardware threads), programmed in Microengine C or Microcode Intel IXA SDK An XP core, programmed in C/C++ 16 CPs , RISC Core in each CP programmed in C/C++ 2 SDPs in each CP, programmed in microcode C-ware Software Toolset 3. Major NP Applications 3.1 Routing and switching. Network processors are designed to route and switch large volumes of packets at wire speed. Thus, routers and switches are direct applications of network processors. The programmability of NPs makes it possible and convenient to add new network service and new protocols to the router without jeopardizing the robustness of its service. For example, Spalink et al [7] recently implemented an IXP1200-based router, where they used MEs to implement packet classification. They used MEs or StrongARM to implement packet forwarding, and used StrongARM to implement scheduling. The programmability of NPs also makes it possible to design one’s own network protocols and implement them on NPs. 3.2 QoS and traffic management. Ensuring that consumers will get the promised services is a challenging issue in QoS. Good QoS relies on good traffic management, and good traffic management must decide, with good strategies, which packets to drop when the network is congested. To improve performance, certain traffic management functions such as packet classification have been implemented at the network layer using NPs. 3.3 Load balancing. NPs can be used to help balancing job loads in a distributed system. For example, Dittmann and Herkersdorf [8] devised an algorithm to distribute traffic from a high-speed link across multiple lower-speed NPs. The algorithm avoids the need for synchronization between NPs and helps obtain a better load distribution. In particular, when a packet arrives, the algorithm inspects the packet, compresses the headers by a hash function to an index of fixed length (the index serves as an address of the lookup memory), and decides to which NP the packet will be sent based on the information stored in the lookup memory. The algorithm then reunites the packet streams from the NPs. Implementation is straightforward for the NPs do not deliver more traffic than what the switch link can carry. 3.4 Web switch. Web switches are network devices that connect multiple Web servers. Web switches must support Layer 3 and Layer 4 load balancing, URL-based load balancing, server health monitoring, peak load or traffic surge management, and policy-based service level management. Balancing loads between thousands or tens of thousands of connections per second over dozens of servers requires vast amount of processing capacity and memory. Session processing is CPUintensive, particularly when traffic to be balanced arrives simultaneously from many high-speed ports. We observe that NP-based Web switches would offer a reasonable solution. For example, we can use the embedded core processor to monitor server and service. We can also use the embedded core processor to track loads and bring servers in and out of service. We can use the programmable packet processors to handle session state management, server assignment, session binding and unbinding, and packet data examination. Processing tasks for each session are distributed to different programmable packet processors for parallel operations, which can increase session performance and scale the Web switch’s load balancing capacity with its port density. 3.5 Network security. Security coprocessors or inline security processors are standard hardware solutions for adding security processing powers to networks. It is, however, difficult for coprocessors to scale up to higher data rates. Inline security processors, on the other hand, can scale up to higher data rates, but they must perform many of the same functions as NPs do to achieve these rates. To solve this problem, the IXP designers integrate security functions into IXP2850 to provide network security at the rate of 10 Gbps while using the same NP designs [9]. IXP2850 therefore becomes a natural choice to implement IPsec. For example, we can use the on-chip security units to execute standard cryptographic algorithms and use the MEs to process security 7 protocols such as ESP and AH for IPsec traffic. We can use the DRAM memory to handle security associations (SAs) with sufficient throughput. We can use the hashing unit for lookup to find the required SA information for a given packet and use the SRAM memory to store hash tables necessary to carry out the IPsec protocol. We can use the Xscale core to handle exception and carry out the Internet Key Exchange (IKE) protocol. 3.6 Grid computing. Grid computing distributes computing tasks in a distributed environment and coordinates resources sharing. Services provided in grid computing include resource allocation, resource management, information discovery, and secure infrastructure, where NPs could play an important role. For example, Liljeqvist and Bengtsson [10] designed a new grid computing architecture using programmable routers implemented by NPs to distribute computing task in a grid at wire speed. A new grid protocol is used to efficiently utilize resources and load balance all computers on the grid in a truly distribution fashion. 3.7 VoIP gateway. Digital communication technology can be used to carry out voice communication. VoIP (Voice over IP) gateways are used to convert media formats and translate IP protocols for setting up and releasing calls, where NPs can also play an important role. For example, one can build a VoIP gateway using IXP1200 [11] by optimally partitioning the control and signaling layers and the media protocol layers between the host CPU and the packet processors. 3.8 Wireless communication. The wireless infrastructure consists of central switching centers, base stations, and Radio Network Controllers (RNCs). Central switching centers connect base stations. RNCs manage the wireless radio interfaces between base stations and controls handoff, sending data from the core network to base stations in the forward direction; and select the best signal from several base stations, sending it to the core network in the reverse direction. The processing functions at the RNC, including IPV6 routing, IPV4 routing, header compression and decompression, tunneling, and QoS, can be implemented on NPs as reusable building blocks. Each stage of packet processing in an RNC can be implemented as a context pipeline stage on a set of processors. The receiving and transmitting functions are link-layer specific, and can be easily implemented using features provided by the media and switch fabric interface on NPs [12]. 3.9 Peer-to-peer network. Distributed Replica Location Indexing (RLI) protocol is a critical component in peer-to-peer communications. NPs can be used to implement certain parts of RLI at the network layer to improve locality. 3.10 Computer clusters. NPs can be used as a switch to connect computers to form a cluster and, at the same time, implement resource management and scheduling algorithms for sharing resources and scheduling jobs between clustered computers. 3.11 Storage network. We observe that NPs can be used to build an intelligent file resource switch in storage networks. Such a switch acts like a file proxy used to aggregate heterogeneous storage infrastructures, enable intelligent policies in data management, increase the flexibility of file storage network, and adapt to the dynamic storage demands of users and applications. 4. Issues and Challenges NP designers and NP programmers are facing a number of challenges. To improve performance and obtain higher throughput, the embedded processors (core processors and packet processors) and their memories are critical. One approach is to add more processors with greater processing power. But this approach may incur more traffic, for additional processors would result in additional traffic between processors and shared memories, which could make data path and memory management a performance 8 bottleneck. Another approach is to include different types of processors on one chip. But coordinating a hierarchy of processors could also result in a bottleneck. High performance memories are needed for NPs to process packets at wire speed. Increasing memory bandwidth would improve performance. But it could also lead to other problems. For example, it may require larger hardware space due to the increased parallelism of internal data path for larger memory bandwidth. Programmability of NPs lies in a set of programmable processors. But programming these embedded processors to take full advantage of the underlying architectures is not straightforward. The steep learning curve is a challenge for NP programmers. One would like to maximize hardware parallelism for performance without increasing software complexity. Thus, keeping programming models simple, keeping the code size small, and keeping programming complexity down are challenging issues. Maximizing reusability of code across multiple product generations is also a challenge. Moreover, since different NP families offer different programming paradigms, abstraction layers, and hardware assistances, it is difficult to write code that is portable on more than one NP family. Despite these issues and challenges, network processors, armed with highly parallel architectures with multiple embedded processors and programmability, have opened a new direction for data communications. New features can be injected to an existing NP chip without the need of constant hardware upgrades as in ASIC-based solutions. The future of network processors looks promising. References [1] IBM PowerNP Network Processor: Hardware, Software, and Applications. IBM. [2] Intel IXP1200 Network Processor Family Hardware Reference Manual. Intel Corp. [3] Intel IXP2800 Network Processor Product Brief. Intel Corp. [4] Intel IXP2850 Network Processor Product Brief. Intel Corp. [5] C-5 Network Processor Architecture Guide. Motorola. [6] Advanced PayloadPlus Network Processor Family. Agere. [7] T. Spalink, S. Karlin, L. Peterson, and Y. Gottlieb. Building a robust software-based router using network processors. In Proceedings of the 18th ACM Symposium on Operating Systems Principle, pages 216-229, Alberta, October 2001. [8] G. Dittmann and A. Herkersdorf. Network processor load balancing for high-speed links. In Proceedings of the 2002 International Symposium on Performance Evaluation of Computer and Telecommunication Systems, pages 727-735, San Diego, July 2002. [9] W. Feghali, B. Burres, G. Wolrich, and D. Carrigan. Security: adding protection to the network via the network processor. Intel Technology Journal, Volume 6, Issue 3, August 2002. [10] B. Liljeqvist and L. Bengtsson. Grid computing distribution using network processors. Tech Report, Dept. of Computer Engineering, Chalmers University of Technology, http://tech.nplogic.com/gridpaper.pdf. [11] Intel Architecture in the Voice over IP Gateway. Intel Corp. 9 [12] H. Vipat, P. Mathew, M. Castelino, and A. Tripathy. Network processor building blocks for all-IP wireless networks. Intel Technology Journal, Volume 6, Issue 3, August 2002. Fang Li is a doctoral candidate in the Department of Computer Science, University of Massachusetts Lowell. Her research interests include Web switches, NP applications, and network security. She received a BE in electromagnetic field and microwave technology from Beijing Broadcasting Institute in 1994 and an MS in computer science from University of Massachusetts Boston in 2001. Contact her at fli@cs.uml.edu. Jie Wang is a professor of computer science at University of Massachusetts Lowell. His research interests include NP applications, combinatorial algorithms, complexity theory, and network security. He received a BS in computational mathematics in 1982 and an ME in computer science in 1984 from Zhongshan University. He received a PhD in computer science from Boston University in 1991. Contact him at wang@cs.uml.edu. 10