Building Scalable Network Processing Platforms with Multicore Processors With the huge demand being placed on today’s networks, thanks in part to the surge in smart phones and tablets, increased processing is needed in different roles in the network. To address this while keeping power, heat and cost under control, multicore processors are finding enthusiastic acceptance among developers. by Paul Stevens, Advantech As excited new users charge their latest mobile device for the first time, little thought is given to the challenges these new devices bring to the infrastructure that must support them. Whether a smartphone, iPad or Android tablet, they are all adding to the rapid growth in network traffic as new devices and applications, especially those in the mobile space, place greater demands on the infrastructure. Besides managing the overall traffic volume, which Cisco’s Visual Networking Index (VNI) predicts will approach the zettabyte/yr threshold (1 zettabyte = 1 billion terabytes) by 2015 (Figure 1), increased burden is placed on all infrastructure support applications such as the security and traffic management platforms. At all levels the evolving infrastructure needs platforms that can handle this load while still keeping both physical and power footprints in check. For all this, carriers must still watch the bottom line, so cost-optimized and efficient solutions are a prerequisite. A range of multicore processors implemented across a variety of system platform architectures are being utilized both individually and in combination to meet these challenges. It seems not that long ago since we experienced the telecom crash, and the huge amount of excess capacity and dark fiber was crying out for the next “killer app.” Well, the tables have turned and one could say that a multitude of applications have contributed to the current growth challenges of the infrastructure that supports them. There is now estimated to be in excess of 500,000 apps available for the iPhone, iPad and Android platforms alone. Much of the demand is based around delivering new rich media and video. According to the Cisco VNI, an ongoing initiative to track and forecast the impact of visual networking applications, there are a number of both exciting and frightening trends that are fueling network growth and evolution. Here are just four numbers to consider 32, 26, 40 and 61: 32%—the compound annual growth rate (CAGR) that IP traffic will grow over the next five years; 26—the number of times mobile data traffic will increase between 2010 & 2015. The numbers 40 and 61 represent the percentage of consumer Internet traffic that is video content, today and in 2015 respectively. No longer can there be complaints about unused capacity, and the challenge is how to corral the data traffic in the most efficient ways possible. Network technologies and applications such as deep packet inspection (DPI), traffic-based filtering, encryption, packet and media processing are all needing to take on extra load. One of the fundamental attributes common to all new network application platforms is the need for “wire speed” processing as they must interact with the traffic flows without impeding them in any way. As we can now see, the volumes are huge and the throughput and speed requirements will in turn require a serious amount of compute and processor capability. Simply throwing more processors, systems and racks full of equipment at the task just won’t do as one begins to approach the logical limits of one’s resources whether those are power, real estate or cash. There are many similarities with the challenges that faced the processor developers as they fast approached the limits of physics with the traditional performance enhancing technique of increasing clock frequencies. These techniques dramatically increased power consumption and heat output, making it more challenging to build system platforms with the necessary densities. The resulting solution was the development of multicore processing technologies. Multicore architectures enable processors to be created that have two or more identical CPU cores (now as many as 32 or more) and typically share a common system memory. Each core can operate independently on different processing elements and dataflows and can also easily interact with other cores and processors. The processing, bandwidth, power, scalability and cost requirements for platforms in nextgeneration mobile (4G/LTE) telecom infrastructure and enterprise networking are well matched with the capabilities of multicore technology. Many of the applications such as those related to DPI (security, filtering, content management) can easily be split into logic chunks with the heavy lifting processes being highly repetitive making it suitable for scaling across many cores. Not all network applications have the same requirements; this has led to a “division of labor” approach to network equipment architecture: Control Plane and Device Management functions—such as call setup, connection control, routing, signaling, device operation, administration and maintenance—were performed on General Purpose Processors (GPP). Data Plane functions—such as packet processing, encryption/decryption, compression/decompression, traffic-based filtering, video transcoding and deep packet inspection—were performed on Network Processing Units (NPU). Digital Signal Processing functions—such as audio and speech processing, digital image and video processing, sensor array and radar/sonar signal processing—were performed on Digital Signal Processors (DSP). Early generation NPU and DSP products used ASICs to provide the required performance and functionality, sacrificing the flexibility provided by software programmability of GPP-based solutions. The current generations of NPU and DSP products use multicore technology to gain the benefits of programmability and scalability, typically using a less complex RISC processor for each core. Backed by a full SDK, the multicore NPU and DSP products are now as flexible (i.e. programmable) as general purpose processors. The boundaries between these different types of solutions are being blurred. GPP multicore processors have added hardware acceleration for certain packet processing and/or security functions, and some NPU and DSP processors have added general purpose CPU cores to handle control plane and device management functions. As always, products are adapted to meet market needs and these “hybrid” architectures are a good example of that. There are numerous examples of multicore GPP, NPU and DSP processors that can fit the bill for telecom and enterprise networking applications. Although there is some crossover, each is suited to a certain set of applications. Intel Xeon Processors: The top end of the embedded Intel Xeon Processor 5000 Sequence Family, the E5645, is a 32 nm core microarchitecture designed for high-performance, datademanding applications. Each of the six 2.4 GHz cores can support 12 threads, making it a great choice for use in networking platforms (Figure 2). For the 5000 family, there are specific low power options that provide greater performance per/watt, making them eminently suitable for matching with the power envelope constraints of embedded standard form factors. Intel targets this family of processors at a wide range of applications including storage area networks, network attached storage, routers, IP‐PBX, converged/unified communications platforms, sophisticated content firewalls, unified threat management systems, medical imaging equipment, military signal and image processing, and telecommunications (wireless and wireline) servers. Cavium Network’s Octeon II Internet Application Processor Family: A flexible multicore design using MIPS64 architecture, the Octeon family can support up to 32 cores and can be configured with up to 75 application acceleration engines. A state-of-the-art network processor, it is designed for the needs of next-generation networking applications. Including specialized functions for security and packet processing acceleration with very low power consumption built directly into the hardware (with supporting software), these processors are designed to maximize throughput for a multitude of protocols all the way to layer 7. Key application uses for the Octeon family are routers, switches, HD video over IP, deep packet inspection (DPI), unified threat management (UTM) appliances, content‐aware switches, application‐aware gateways, triple‐play gateways, WLAN and 3G/4G access and aggregation devices, storage arrays, storage networking equipment, servers and intelligent NICs. NetLogic Microsystems XLP Processor Family: The XLP832 processor supports 8 MIPS64cores and is designed for both control plane and data plane applications. Numerous autonomous acceleration engines (AAEs) provide packet processing, security, compression/decompression, load balancing and storage acceleration functions. NetLogic’s low-latency Fast Messaging Network (FMN) allows for non-intrusive communication and control messaging among VirtuCores, acceleration engines and I/O, enabling inter-unit communication without the need for spin-locks or semaphores. NetLogic targets the XLP Processor at high-end communication systems, including wired and wireless security, networking, storage and data center acceleration. Texas Instruments Multicore DSPs: Texas Instruments offers a high-performance multimedia solution based on its TMS320C6678 digital signal processor (DSP). Designed for applications such as multimedia gateways, IMS media servers, video conferencing servers and video broadcast equipment, the C6678 is a highly dense media solution that is both power and cost efficient at the system level. Based on its newest DSP generation of devices, the TMS320C66x, TI's C6678 features eight 1.25 GHz DSP cores with 320 GMACs and 160 GFLOPs of combined fixed- and floating-point performance on a single device, enabling users to consolidate multiple DSPs to save board space and cost, as well as reduce overall power requirements (Figure 3). Multicore-Based Network Application Platforms We have seen that multicore GPU, NPU and DSP platforms have healthy roles to play and equipment designers have a multitude of choices from which to select the best possible solution for their specific application needs. There may be a multitude of reasons why one development organization chooses one architecture over another. It may be specific technical features, existing software investments, power requirements or competitive economics. Examples of the two ends of that spectrum of choice are Advantech’s AdvancedTCA and Packetarium product lines. AdvancedTCA is a standards-based board and system platform architecture designed with telecommunication solutions in mind. Supported as part of the SCOPE Alliance’s profiles and carrier grade base platform definition, numerous network platforms have been built using AdvancedTCA. Advantech offers a number of multicore AdvancedTCA blades. For GPP requirements the MIC-5322 is a dual processor Intel Xeon 5500/5600-based blade. The MIC5322 supports one of the highest performing Intel Xeon processors in ATCA form factor with 12 cores and 24 threads of processing power, low DDR3 memory latency, fast PCI Express 2.0 and accelerated virtualization. Aimed at providing a large amount of video and media processing capability, the Advantech DSPA-8901 is designed with 20 TI TMS320TCI6608 DSPs. That totals 160 cores of processing power to reach the higher levels of performance density needed to build the highest capacity wireless media gateways. The DSPA-8901 significantly reduces overall system power dissipation and system cost, and frees up valuable slots in gateway elements for additional subscriber capacity and throughput. The DSPA-8901 includes a high-performance Freescale QorIQ P2020 processor and a Broadcom BCM56321 switch, which terminates the 10 Gigabit Ethernet fabric connections and distributes traffic to the twenty DSPs. Although they have an impressive array of carrier grade features, AdvancedTCA platforms can be size, power and price prohibitive for some applications, especially those that are heavily dedicated to network processing. This was one of the key reasons behind Advantech’s costoptimized Packetarium range. The goal was to pack as much network processing performance as possible into the smallest package while keeping power consumption and cost efficiency optimized for the targeted applications. The NCP-5260 represents a new generation of hybrid system designs with Intel architecture processing on the control plane, and Packetarium network processing boards featuring NetLogic NPUs for the data plane. It integrates up to two powerful, multicore Packetarium network processing boards for wire speed packet processing and accommodates up to 16 x 10 GbE external interfaces. The main carrier board provides the high-speed switched interconnects between Packetarium boards (Figure 4). At the high-performance end of Advantech’s Packetarium product line, the NCP-7560 integrates up to eight powerful, NCPB-2320 multicore Packetarium Network Processing Boards. Utilizing Cavium Network’s CN6880 Octeon II processor, a fully configured NCP-7560 packs 256 cores into the 4U server space to handle 80 Gbit/s of network traffic from multiple 10 Gigabit Ethernet ports. Applications that reap the performance benefits of the new Octeon II processor family include high-capacity radio network controllers, network acceleration platforms, as well as data center and LTE gateways. None of us have crystal balls but we can all be certain that the future of global networks will be one requiring a huge increase in capacity and capability. As the various models of cloud computing go from strength to strength, and network-capable mobile devices become even more pervasive, the requirement for ever more powerful network systems platforms will increase. Whichever high level architectures are chosen by solution developers, the advantages of multicore silicon linked with flexible and cost-optimized system platforms will provide a major implementation advantage. Advantech, Irvine, CA. (949) 789-7178. [www.advantech.com]. Cavium Networks, San Jose, CA. (650) 623-7000. [www.cavium.com]. Intel, Santa Clara, CA. (408) 765-8080. [www.intel.com]. NetLogic Microsystems, Santa Clara, CA. (408) 454-3000. [www.netlogicmicro.com]. Texas Instruments, Dallas, TX. [www.ti.com].