An Exploratory study on Intel IXP1200 Network Processor design

advertisement
Network Processor Architectures, Programming Models, and Applications
Fang Li and Jie Wang
Department of Computer Science
University of Massachusetts Lowell
Abstract: Network processors (NPs) are a new breed of high-performance network devices. They enable
users, through software, to create and run network applications at line rates. This article describes major
NP architectures, their programmability, and various applications.
1. Introduction
Network processors are fully programmable network devices specially designed to store, process, and
forward large volumes of data packets at wire speed. These tasks are traditionally performed using ASICbased switches and routers. While ASIC devices offer performance, they lack programmability. On the
other hand, software-based solutions run on general-purpose processors offer programming flexibility, but
they lack performance. Network processors are designed to fill this gap. They offer the advantages of both
hardware-based and software-based solutions. This technological advancement opens a new direction for
data communications. Major chip makers have seized this opportunity and developed their NP lines. What
do NP architectures look like and what are their programming models? This article provides brief answers
to these questions. In particular, we will describe major NP architectures and their programming models,
including IBM’s PowerNP, Intel’s IXP, Motorola’s C-Port, and Agere’s Payload Plus. We will then
describe a number of applications suitable for NPs. Finally, we will discuss a few issues and challenges.
2. NP Architectures and Programmability
A typical NP chip consists of an array of programmable packet processors in a highly parallel
architecture, a programmable control processor, hardware coprocessors or accelerators for common
networking operations, high-speed memory interfaces, and high-speed network interfaces.
2.1 IBM PowerNP. A PowerNP chip consists of an embedded processors complex (EPC), a data flow
(DF) unit, a scheduler, MACs, and coprocessors (see Figure 1).
Figure 1: IBM PowerNP architecture [1]
1
The EPC component consists of an embedded PowerPC core processor and an array of Picoprocessors
(PPs), where each PP can perform a full set of operations on each packet it receives. PowerPC and PPs
are programmable. Instructions executed by PPs are referred to as Picocode.
NP4GS3 is a popular high-end member of the PowerNP family. It supports OC-48 (4 Gbps) line rate. It
integrates the switching engine, the search engine, and security functions on one chip to provide fast
switching. Listed below are major components of NP4GS3:
• An embedded PowerPC that runs at 133 MHz. It has 16 KB of instruction cache (ICache), 16 KB
of data cache (DCache), and up to 128 MB of program space.
• 16 PPs, each with 133 MHz clock speed, providing a total of 2128 MIPS aggregate packet
processing capability. There is a total of 32 KB of instruction memory (IMEM).
• Multiple hardware accelerators for tree-searching, frame-forwarding, filtering, and alteration.
• 40 Fast Ethernet/4Gb MACs, supporting industry standard PHY components.
• An integrated Packet over SONET (POS) interface supporting one OC-48c line, one OC-48 line,
four OC-12 lines, and sixteen OC-3 lines for transmitting industry standard POS framers.
• A data flow unit that serves as primary data path for receiving and transmitting network traffic.
• A Scheduler that schedules traffic flows.
NP4GS3 processes POS frames using the combination of PPs, hardware accelerators, and external coprocessors. Each PP offers two hardware threads, and so the 16 PP in EPC can simultaneously process 32
frames with zero context-switching overhead between threads.
NP2G is a low-end member of the PowerNP family. It provides deep packet processing and substantial
performance headroom for OC-12 lines. NP2G consists of one Embedded PowerPC, 12 PPs, and 60
hardware accelerators for tree searching, frame forwarding, frame filtering, and frame alteration.
The PowerNP Developer’s Toolkit provides a programming model for PowerNP chips. It offers a set of
tools for developing and optimizing PowerNP applications. It contains an assembler/linker (NPAsm), a
graphical debugger (NPScore), a system simulator (NPSim), a test-case generator (NPTest), and a
software performance profiler (NPProfile). These tools are written in C++ and are tightly coupled with
the Tcl/Tk scripting language. It provides both high-level (C API, C compiler) and low-level APIs for
software developers.
2.2 Intel IXP. IXP network processors consist of three major components: a StrongARM or an XScale
core processor, an array of multi-threaded packet processors called Microengines (MEs), and an IXA
framework. IXA stands for Internet Exchange Architecture. The core processor and the MEs are fully
programmable.
IXP1200 is the first generation of the IXP family (see Figure 2). It offers OC-3 and OC-12 line rates.
Listed below are the major components of an IXP1200 chip [2]:
• A StrongARM core. It runs at 166/200/233 MHz and can be programmed to run control-plane
applications. It has 16 KB instruction cache and 8 KB main data cache.
• Six 32-bit RISC MEs. Each ME has four hardware threads with zero overhead context switching,
and 8 KB programmable instruction control storage that can hold up to 2048 instructions.
• An FBI unit and an IX Bus. FBI is responsible for serving fast MAC-layer devices on the IX Bus,
providing an interface for receiving and transmitting packets. FBI has 4K scratchpad memory.
• An SRAM unit and an SDRAM unit. The SRAM unit provides 8 MB SRAM that can be used to
store lookup tables. The SDRAM unit provides 256 MB lower-cost SDRAM for storing mass
data, forwarding information, and transmitting queues.
• A PCI unit. The PCI unit provides an industry standard 32-bit PCI Bus for PCI peripheral devices
such as host processors and MAC devices.
2
Figure 2: Intel IXP1200 Block Diagram [2]
IXP2400, IXP2800, and IXP2850 are the second generation of IXP network processors [3, 4].
IXP2400 is designed for OC-48 (2.5 Gbps) applications. It has one Xscale core and eight MEs, where
each ME has eight hardware threads. The Xscale core has 32 KB instruction memory, 32 KB data
memory (DMEM), and 2 KB mini data cache. Each ME has 4K 40-bit instruction control storage that can
hold up to 4096 instructions, and 640 32-bit local addressable memories. IXP2400 supports 64 MB
SRAM and up to 2 GB DDR SDRAM. It provides two unidirectional 32-bit media interfaces, which can
be configured to be SPI-3, UTOPIA 1/2/3 level, or CSIX-L1. To improve processing performance,
IXP2400 also supports multiplication (while IXP1200 does not). IXP2400 also has built-in functions for
generating pseudo-random numbers and time stamping.
IXP2800 is designed for OC-192 (10 Gbps) applications. The IXP designers introduce the Hyper Task
Chaining processing technology into IXP2800 for deep packet inspection via software pipeline at wire
speed. IXP2800 has one Xscale core and 16 MEs, supporting 23.1 Giga operations per second.
IXP2850 is IXP2800 plus on-chip cryptography units. In particular, IXP2850 integrates two cryptography
blocks into IXP2800 to provide hardware acceleration of standard encryption algorithms.
IXP comes with a comprehensive software development environment, including Software Development
Kit (SDK) and hardware development platform for rapid product development and prototyping. The
Workbench component in IXA SDK provides a friendly GUI simulation environment for code
development and debugging. IXA SDK also provides programming frameworks and the IXA Active
Computing Engine (ACE) model, which provide complete code for typical network applications and
packet processing functionalities. The programming languages used by IXP are Microengine C, a C-like
language, and Microcode, an assembly language.
2.3 Motorola C-Port. A C-Port chip consists of an Executive Processor (EP), an array of packet
processors called Channel Processors (CPs), a Fabric Processor, and a number of coprocessors. The EP
3
and CPs are VLIW processors and are fully programmable. Each processor can be individually
configured to enhance the flexibility of a C-Port chip.
Figure 3: Motorola C-5 Block Diagram [5]
C-5 is the first member of the C-Port family (see Figure 3). It was designed to support complete
programmability from Layer 2 through Layer 7 in the OSI model. It provides up to 5 Gbps bandwidth and
more than 3000 MIPS of computing power. It supports a variety of industry-standard serial and parallel
protocols and data rates from DS1 (1.544 Mbps) to GE (1 Gbps), including 10/100Mb Ethernet, 1Gb
Ethernet, OC-3c, OC-12, OC-12c, and Fiber Channels. Listed below are major components of C-5:
• An EP that runs at 166/200/233 MHz and performs conventional supervisory tasks including
initialization, program loading and control of CPs, statistics gathering, and centralized exception
handling. It also executes routing algorithms and updates routing tables.
• 16 CPs responsible for receiving, processing, and transmitting packets and cells at wire speed.
Each CP consists of a 32-bit C/C++ programmable RISC core and two microcode programmable
serial data processors (SDPs). The RISC core is used to handle more complex tasks and is
responsible for classification and making scheduling decisions. It is also responsible for overall
management of the entire CP. One SDP is used to process the received data stream, and the other
SDP is used to process transmitted data stream. Each CP has 12 KB of data memory. In addition,
a cluster of four adjacent CPs has 24 KB of instruction memory, giving each CP 6 KB of
instruction memory of its own. The CPs can be arranged to operate independently of each other.
They can also co-operate in clusters. Each CP has programmable MAC interfaces and PHY
interfaces.
• A Fabric processor (FP), which is a high-speed network interface port with advanced
functionality. It supports bidirectional transfer of packets, frames or cells, and can be configured
for different fabric protocols. FP is compatible with Utopia 1/2/3 level, IBM PRIZMA, and
PowerX (Csix-L0).
• Three co-processors, which operate as shared resources for the CPs and perform specific tasks,
such as table lookup, queue management, buffer management, and fabric interface management.
Buffer Management Unit (BMU) co-processor can be programmed to manage centralized
payload storage during packet processing. It has 32 MB memory.
C-5e is the second generation of the C-Port family. It has 18 programmable RISC core processors,
including 16 CPs, one Executive Processor, one Fabric Processor; and 32 programmable serial Data
4
Processors. Each CP shares access to a 32 KB instruction memory among a cluster of 4 adjacent CPs.
In addition, each CP also has 12 KB local data memory. C-5e supports 5 Gbps bandwidth and more than
4500 MIPS of computing power.
C-Port is programmed using the C/C++ languages. Its development environment provides a set of design,
development and debugging tools to support services and enhance productivities, including the C-Ware
Software Toolset that provides application libraries, APIs, simulator, GNU-based compiler and debugger,
and C-Ware Development System for different service modules.
2.4 Agere PayloadPlus. The design of PayloadPlus differs from those discussed above. PayloadPlus (see
Figure 4) provides multi-service solutions (IP, ATM, Ethernet, MPLS, Frame Relay, etc.) at the speed of
GbE, OC-48c, and OC-192c. Its supports layer 2 through layer 7 protocol processing, buffer management,
traffic shaping, data modification, and per-flow policing and statistics. PayloadPlus employs the pipelined
processors architecture and uses the “Pattern Matching Optimization” technology.
Figure 4: Agere Payload Plus block diagram [6]
PayloadPlus consists of the following major components:
•
•
•
•
A Fast Pattern Processor (FPP). FPP is a programmable, pipelined, multi-threaded processor that
can analyze and classify up to 64 packets at wire speed. It classifies packets and re-assembles
packets. The outcome of classification is forward to the Routing Switch Processor for processing.
A Routing Switch Processor (RSP). RSP is a programmable processor that handles the queuing,
traffic management, and packet modification at wire speed. It contains three VLIW processors
called engines: one for traffic management, one for traffic shaping, and one for outgoing packet
modification. These three engines can run different programs.
An Agere System Interface (ASI). ASI is a configurable, non-programmable engine managing
interface between FPP, RSP, and the host computer. It handles slow path packets, communicates
with the host through a PCI bus.
One µP. It handles initial setup and exceptions.
PayloadPlus protocols are programmed using Functional Programming Language (FPL). Compared to
C/C++, FPL provides a reduced number of instructions needed to carry out a given task. In addition to
FPL, Agere also offers Agere System Language (ASL), a C-like scripting language, to program
procedural tasks executed by the RSP and the ASI components.
2.5 Architecture Summary. Table 1 summarizes PowerNP, IXP, and C-Port network processors we
discussed above in the following categories: line rate, physical interface, chip memory, and
programmability.
5
Network Processor
Line Rate
Physical Interface
Chip Memory
Programmability
ƒ PowerPC core,
programmed in C
ƒ 16 PPs (each with
2 hardware
threads),
programmed in
Picocode
ƒ IBM PowerNP
development
Toolkit
ƒ StrongARM core,
programmed in
C/C++
ƒ 6 MEs (each with 4
hardware threads),
programmed in
Microengine C or
Microcode
ƒ Intel IXA SDK
IBM
PowerNP
NP4GS3
ƒ 1 Gbps
ƒ OC-12,
OC-12c
ƒ OC-48,
OC-48c
ƒ 40 Fast
Ethernet/OC-48
MACs
ƒ 16 KB ICache, 16
KB DCache for the
PowerNP core
ƒ 2 KB IMEM for each
PP
ƒ 128 KB SRAM for
input packet
buffering
Intel
IXP
IXP1200
ƒ OC-3
ƒ OC-12
ƒ 10/100/1000
Ethernet MACs,
ATM, T1/E1
SONET, XDSL
ƒ up to 56 physical
ports
ƒ 16 KB ICache and
Motorola
C-Port
IXP2400
ƒ OC-3
ƒ OC-12
ƒ OC-48
ƒ 2 unidirectional
32-bit media
interfaces
ƒ configurable to
be SPI-3,
UTOPIA 1/2/3,
or CSIX-L1
C-5
ƒ 10/100
ƒ 10/100
ƒ
ƒ
ƒ
ƒ
ƒ
Mbps
1 Gbps
OC-3c
OC-12,
OC-12c
OC-48
Fiber
Channel
Ethernet, 1GE,
OC-3c, OC-12,
OC-12c, OC48, Fiber
Channel
ƒ up to 16
physical
interfaces to
each CP
8 KB DCache for
the StrongARM
core
ƒ 2 KB IMEM for each
ME
ƒ 4 KB on-chip
scratchpad for the
FBI unit
ƒ 8 MB off-chip
SRAM
ƒ 256 MB off-chip
SDRAM
ƒ 32 KB IMEM, 32
KB DMEM, and 2
KB mini DMEM for
the Xscale core
ƒ 4K 40-bit IMEM,
and 640 32-bit local
memory for each ME
ƒ 64 MB off-chip
SRAM
ƒ 2 GB off-chip DDR
SDRAM
ƒ 16 MB for table
lookup
ƒ 128 MB for Buffer
Management Unit
ƒ 12 KB data memory
for each CP, 24 KB
instruction store for 4
adjacent CPs
Table 1: Comparisons of network processors
6
ƒ A Xscale core,
programmed in
C/C++
ƒ 8 MEs (each with 8
hardware threads),
programmed in
Microengine C or
Microcode
ƒ Intel IXA SDK
ƒ An XP core,
programmed in
C/C++
ƒ 16 CPs , RISC
Core in each CP
programmed in
C/C++
ƒ 2 SDPs in each CP,
programmed in
microcode
ƒ C-ware Software
Toolset
3. Major NP Applications
3.1
Routing and switching. Network processors are designed to route and switch large volumes of
packets at wire speed. Thus, routers and switches are direct applications of network processors. The
programmability of NPs makes it possible and convenient to add new network service and new
protocols to the router without jeopardizing the robustness of its service. For example, Spalink et al
[7] recently implemented an IXP1200-based router, where they used MEs to implement packet
classification. They used MEs or StrongARM to implement packet forwarding, and used
StrongARM to implement scheduling. The programmability of NPs also makes it possible to design
one’s own network protocols and implement them on NPs.
3.2
QoS and traffic management. Ensuring that consumers will get the promised services is a
challenging issue in QoS. Good QoS relies on good traffic management, and good traffic
management must decide, with good strategies, which packets to drop when the network is
congested. To improve performance, certain traffic management functions such as packet
classification have been implemented at the network layer using NPs.
3.3
Load balancing. NPs can be used to help balancing job loads in a distributed system. For example,
Dittmann and Herkersdorf [8] devised an algorithm to distribute traffic from a high-speed link
across multiple lower-speed NPs. The algorithm avoids the need for synchronization between NPs
and helps obtain a better load distribution. In particular, when a packet arrives, the algorithm
inspects the packet, compresses the headers by a hash function to an index of fixed length (the index
serves as an address of the lookup memory), and decides to which NP the packet will be sent based
on the information stored in the lookup memory. The algorithm then reunites the packet streams
from the NPs. Implementation is straightforward for the NPs do not deliver more traffic than what
the switch link can carry.
3.4
Web switch. Web switches are network devices that connect multiple Web servers. Web switches
must support Layer 3 and Layer 4 load balancing, URL-based load balancing, server health
monitoring, peak load or traffic surge management, and policy-based service level management.
Balancing loads between thousands or tens of thousands of connections per second over dozens of
servers requires vast amount of processing capacity and memory. Session processing is CPUintensive, particularly when traffic to be balanced arrives simultaneously from many high-speed
ports. We observe that NP-based Web switches would offer a reasonable solution. For example, we
can use the embedded core processor to monitor server and service. We can also use the embedded
core processor to track loads and bring servers in and out of service. We can use the programmable
packet processors to handle session state management, server assignment, session binding and
unbinding, and packet data examination. Processing tasks for each session are distributed to
different programmable packet processors for parallel operations, which can increase session
performance and scale the Web switch’s load balancing capacity with its port density.
3.5
Network security. Security coprocessors or inline security processors are standard hardware
solutions for adding security processing powers to networks. It is, however, difficult for
coprocessors to scale up to higher data rates. Inline security processors, on the other hand, can scale
up to higher data rates, but they must perform many of the same functions as NPs do to achieve
these rates. To solve this problem, the IXP designers integrate security functions into IXP2850 to
provide network security at the rate of 10 Gbps while using the same NP designs [9]. IXP2850
therefore becomes a natural choice to implement IPsec. For example, we can use the on-chip
security units to execute standard cryptographic algorithms and use the MEs to process security
7
protocols such as ESP and AH for IPsec traffic. We can use the DRAM memory to handle security
associations (SAs) with sufficient throughput. We can use the hashing unit for lookup to find the
required SA information for a given packet and use the SRAM memory to store hash tables
necessary to carry out the IPsec protocol. We can use the Xscale core to handle exception and carry
out the Internet Key Exchange (IKE) protocol.
3.6
Grid computing. Grid computing distributes computing tasks in a distributed environment and
coordinates resources sharing. Services provided in grid computing include resource allocation,
resource management, information discovery, and secure infrastructure, where NPs could play an
important role. For example, Liljeqvist and Bengtsson [10] designed a new grid computing
architecture using programmable routers implemented by NPs to distribute computing task in a grid
at wire speed. A new grid protocol is used to efficiently utilize resources and load balance all
computers on the grid in a truly distribution fashion.
3.7
VoIP gateway. Digital communication technology can be used to carry out voice communication.
VoIP (Voice over IP) gateways are used to convert media formats and translate IP protocols for
setting up and releasing calls, where NPs can also play an important role. For example, one can
build a VoIP gateway using IXP1200 [11] by optimally partitioning the control and signaling layers
and the media protocol layers between the host CPU and the packet processors.
3.8
Wireless communication. The wireless infrastructure consists of central switching centers, base
stations, and Radio Network Controllers (RNCs). Central switching centers connect base stations.
RNCs manage the wireless radio interfaces between base stations and controls handoff, sending data
from the core network to base stations in the forward direction; and select the best signal from
several base stations, sending it to the core network in the reverse direction. The processing
functions at the RNC, including IPV6 routing, IPV4 routing, header compression and
decompression, tunneling, and QoS, can be implemented on NPs as reusable building blocks. Each
stage of packet processing in an RNC can be implemented as a context pipeline stage on a set of
processors. The receiving and transmitting functions are link-layer specific, and can be easily
implemented using features provided by the media and switch fabric interface on NPs [12].
3.9
Peer-to-peer network. Distributed Replica Location Indexing (RLI) protocol is a critical
component in peer-to-peer communications. NPs can be used to implement certain parts of RLI at
the network layer to improve locality.
3.10 Computer clusters. NPs can be used as a switch to connect computers to form a cluster and, at the
same time, implement resource management and scheduling algorithms for sharing resources and
scheduling jobs between clustered computers.
3.11 Storage network. We observe that NPs can be used to build an intelligent file resource switch in
storage networks. Such a switch acts like a file proxy used to aggregate heterogeneous storage
infrastructures, enable intelligent policies in data management, increase the flexibility of file storage
network, and adapt to the dynamic storage demands of users and applications.
4. Issues and Challenges
NP designers and NP programmers are facing a number of challenges. To improve performance and
obtain higher throughput, the embedded processors (core processors and packet processors) and their
memories are critical. One approach is to add more processors with greater processing power. But this
approach may incur more traffic, for additional processors would result in additional traffic between
processors and shared memories, which could make data path and memory management a performance
8
bottleneck. Another approach is to include different types of processors on one chip. But coordinating a
hierarchy of processors could also result in a bottleneck. High performance memories are needed for NPs
to process packets at wire speed. Increasing memory bandwidth would improve performance. But it could
also lead to other problems. For example, it may require larger hardware space due to the increased
parallelism of internal data path for larger memory bandwidth.
Programmability of NPs lies in a set of programmable processors. But programming these embedded
processors to take full advantage of the underlying architectures is not straightforward. The steep learning
curve is a challenge for NP programmers. One would like to maximize hardware parallelism for
performance without increasing software complexity. Thus, keeping programming models simple,
keeping the code size small, and keeping programming complexity down are challenging issues.
Maximizing reusability of code across multiple product generations is also a challenge. Moreover, since
different NP families offer different programming paradigms, abstraction layers, and hardware
assistances, it is difficult to write code that is portable on more than one NP family.
Despite these issues and challenges, network processors, armed with highly parallel architectures with
multiple embedded processors and programmability, have opened a new direction for data
communications. New features can be injected to an existing NP chip without the need of constant
hardware upgrades as in ASIC-based solutions. The future of network processors looks promising.
References
[1] IBM PowerNP Network Processor: Hardware, Software, and Applications. IBM.
[2] Intel IXP1200 Network Processor Family Hardware Reference Manual. Intel Corp.
[3] Intel IXP2800 Network Processor Product Brief. Intel Corp.
[4] Intel IXP2850 Network Processor Product Brief. Intel Corp.
[5] C-5 Network Processor Architecture Guide. Motorola.
[6] Advanced PayloadPlus Network Processor Family. Agere.
[7] T. Spalink, S. Karlin, L. Peterson, and Y. Gottlieb. Building a robust software-based router using
network processors. In Proceedings of the 18th ACM Symposium on Operating Systems Principle, pages
216-229, Alberta, October 2001.
[8] G. Dittmann and A. Herkersdorf. Network processor load balancing for high-speed links. In
Proceedings of the 2002 International Symposium on Performance Evaluation of Computer and
Telecommunication Systems, pages 727-735, San Diego, July 2002.
[9] W. Feghali, B. Burres, G. Wolrich, and D. Carrigan. Security: adding protection to the network via the
network processor. Intel Technology Journal, Volume 6, Issue 3, August 2002.
[10] B. Liljeqvist and L. Bengtsson. Grid computing distribution using network processors. Tech Report,
Dept. of Computer Engineering, Chalmers University of Technology,
http://tech.nplogic.com/gridpaper.pdf.
[11] Intel Architecture in the Voice over IP Gateway. Intel Corp.
9
[12] H. Vipat, P. Mathew, M. Castelino, and A. Tripathy. Network processor building blocks for all-IP
wireless networks. Intel Technology Journal, Volume 6, Issue 3, August 2002.
Fang Li is a doctoral candidate in the Department of Computer Science, University of Massachusetts
Lowell. Her research interests include Web switches, NP applications, and network security. She received
a BE in electromagnetic field and microwave technology from Beijing Broadcasting Institute in 1994 and
an MS in computer science from University of Massachusetts Boston in 2001. Contact her at
fli@cs.uml.edu.
Jie Wang is a professor of computer science at University of Massachusetts Lowell. His research interests
include NP applications, combinatorial algorithms, complexity theory, and network security. He received
a BS in computational mathematics in 1982 and an ME in computer science in 1984 from Zhongshan
University. He received a PhD in computer science from Boston University in 1991. Contact him at
wang@cs.uml.edu.
10
Download