Case Study on Using Multicore Processing and Programming Base Stations for Fourth-Generation Cellular Services How new-generation multicore processors with hardware acceleration deliver high performance at low power while supporting flexible programming models in today’s advanced cellular networks. According to the European Information Technology Observatory (EITO), and other organizations, the number of mobile phone users worldwide exceeded the 4 billion mark for the first time in 2009. This means that approximately two-thirds of the world’s population uses a mobile phone. Escalating competition and rising mobile user expectations are driving demand for evermore advanced features and services. Mobile infrastructures are under pressure to efficiently and reliably process data streams at faster rates as third-generation (3G) services evolve to 4G with anticipated speeds of 100 Mbps or more. The basic components of a mobile network, shown in Figure 1, include the user handsets that connect wirelessly to base stations, which in turn communicate with the mobile service provider’s core network over a wired connection. Base stations can become a choke point restricting the flow of data upstream and downstream between the handsets and the wireless provider’s core network. As carriers migrate to faster, 4G networks, how can they manage the increasing project complexity resulting from new or additional wireless technologies, higher throughput, and more features? ________________________________________________________________________ University@ CaviumNetworks.com Page 1 Supplemental teaching aid The Professor may wish to pose one or more of the following questions to students to appraise their comprehension of the concepts presented within this document. 1. What software programming models are commonly used in multicore programming? 2. Describe the key characteristics, advantages, and potential disadvantages of the run-to-completion model. 3. Describe the key characteristics, advantages, and potential disadvantages of the software pipelining model. 4. What factors would you need to consider when choosing between using a run-tocompletion model and a pipelining model? 5. Select an application besides a base station and develop the software architecture for implementing this application by using the run-to-completion model. 6. Select an application besides a base station and develop the software architecture for implementing this application by using pipelining model. 7. What are the advantages of using multicore processors over DSPs or network processors? 8. Describe the LTE or WiMAX base station processing flow. 9. Which functions of a LTE or WiMAX base station benefit significantly from hardware acceleration and what kinds of hardware acceleration would be most relevant in these cases? [Insert Fig. 1 showing gateways, bases stations and handset] Fig. 1 Mobile network architecture Previous generations of base stations have mostly been implemented using digital signal processors (DSPs) and network processors. In DSP implementations, multiple DSPs are ________________________________________________________________________ University@ CaviumNetworks.com Page 2 typically required to provide adequate performance. DSPs are based on proprietary instruction set architecture and are optimized for digital signal processing as opposed to general purpose processing. Moreover, development using DSPs requires proprietary software tools. Because multiple DSP chips are needed to meet performance requirements, development complexity increases. Most network processors are multicore-based. However, these cores do not support industry-standard instruction set architecture. Some of these cores require assembly language programming. In addition, some network processors require the software program to fit entirely into the limited on-chip program storage. This restricts the size of the program, and increases development complexity, as well as the functionality that can be supported. Furthermore, network processors usually possess strong dependencies between the software programming model and the network processor hardware architecture. For example, some network processors require using a pipelining programming model to achieve good performance. In contrast, most multicore processors support flexible software programming models and industry-standard instruction set architecture and software tools. In terms of multicore programming models, run-to-completion and software pipelining are the models most commonly used. This document describes the pluses and minuses of these two models. Following the Data A mobile data stream (which may include both voice and data) is transmitted wirelessly from handsets within a geographical cell to a nearby base station where it is processed before sending it on to the core network. This is known as the uplink. Conversely, data from the wireless service provider’s core network is transmitted to the base station where it is processed before transmitting to the handset. This is the downlink. In parallel to this data plane is a control plane that provides Operations & Management (O&M) functions. ________________________________________________________________________ University@ CaviumNetworks.com Page 3 Multiple generations of wireless technologies and protocols currently exist, supporting various features and levels of throughput. Going forward, Long Term Evolution (LTE) and Worldwide Interoperability for Microwave Access (WiMAX) offer the next generation (4G) of features and capabilities, such as interactive video conferencing. This case study focuses on base stations utilizing LTE or WiMAX, as these next-generation wireless technologies present a significant increase in processing complexity and higher throughput, requiring processors with increased capabilities. LTE is based on a set of standards developed by the 3rd Generation Partnership Project (3GPP), a collaboration between groups of telecommunications associations to create a global 3G mobile phone specification. 3GPP LTE, the 4G successor, provides downlink speeds up to 100 Mbps and uplink speeds of 50 Mbps. WiMAX, also known as IEEE 802.16, was initially intended for broadband metropolitan area networks (MANs) offering high data rates (up to 75 Mbps) over long distances (up to 10 miles from a base station). The reader can visit http://www.comsysmobile.com/pdf/LTEvsWiMax.pdf for a comparison of LTE and WiMAX specifications. LTE and WiMAX networks operate at several protocol layers to allow the base station to process the data from the handsets and core network. The PHY layer (Layer 1) provides the over-the-air communications channel for exchanging bits of data between handsets and base station. At the Medium Access Control (MAC) layer (Layer 2), and at Layer 3, LTE and WiMAX perform a complex and compute-intensive series of operations on the incoming data. Figure 2 below illustrates the process under WiMAX. LTE follows a similar, but not identical, flow of steps. Downlink processing takes place in roughly the reverse order. [Insert Fig 2 showing WiMAX uplink processing steps for a single flow] Fig. 2 Uplink operations ________________________________________________________________________ University@ CaviumNetworks.com Page 4 Base station developers can choose either a run-to-completion programming model or a pipelining programming model to develop the software that implements these functions. These models are described in the following two sections. Run-to-Completion Programming Model Depending on the class of the base station, WiMAX and LTE networks can support hundreds of simultaneous users. Each user represents a separate packet flow (or flows when data and voice go through separate flows). Each flow represents all the packets related to a call, or a Wireless Application Protocol (WAP) connection for data access. Base stations can exploit the parallelism of multiple packet flows using multicore processors. The base station receives packets, identifies packet flows and schedules processing of these packets on the multiple processing cores. Some multicore processors, like Cavium Networks OCTEON, provide hardware acceleration and automation for these functions. In a run-to-completion model, each processing core completes the entire processing flow for a packet. Packets belonging to the same flow (e.g., WAP flow), tend to depend on each other. For instance, packets that belong to the same flow are typically processed in sequence and access shared data structures, which maintain the context of the relevant flow. When a processor core completes the processing flow for a packet, it picks up another packet and processes it. At any given point in time, the processor may be simultaneously processing as many flows as there are cores. However, the processor is managing all the flows (perhaps hundreds of users, each with multiple flows) that are passing through the base station at any time. Figure 3 below illustrates multiple cores executing multiple flows. [Insert Fig 3 showing WiMAX uplink processing steps for multiple flows] Fig. 3 Run-to-Completion Programming Model Pipelining Programming Model ________________________________________________________________________ University@ CaviumNetworks.com Page 5 The software pipelining model takes a different approach, dividing up the entire processing flow into a pipeline of pipestages. Each pipestage represents the processing of one or more functions which are a subset of the entire processing flow. For example, Figure 4 below illustrates two pipestages. [Please note that this example merely demonstrates the pipelining model. It does not reflect technical trade-offs or optimization of the WiMAX processing flow, nor does it suggest how to divide up the flow into pipestages. WiMAX base station developers should design the pipeline based on specific product requirements and processors used.] The first pipestage handles the Demux, Decoder, and Scheduler steps for a packet flow, while the second pipestage handles Payload Header Suppression (PHS), Classification, and Generic Routing Encapsulation (GRE). As a packet is received, a core is assigned to execute the steps in the first pipestage for this packet. Once complete, that processed data is assigned to another core to complete the remaining steps in the second pipestage of the overall processing flow. [Insert Fig 4 showing two pipestages] Fig. 4 2-pipestage Pipelining Programming Model After dividing up the overall processing into pipestages, developers assign cores to process each pipestage. Cores associated with each pipestage constitute a core group, which can include one or more cores depending on the performance requirement and complexity of the pipestage. Processing within each core group is performed in parallel. Scheduling of packets to be processed must account for and maintain packet ordering and dependencies. A critical task is scheduling the processing of each packet from one pipestage to the next. The data and context of each packet must be communicated to allow the next pipestage to pick up the processing where the previous one left off. Some multicore processors, such as Cavium OCTEON, provide hardware acceleration and automation to streamline the ________________________________________________________________________ University@ CaviumNetworks.com Page 6 scheduling process. OCTEON integrates a hardware scheduler that can schedule processing to groups of cores based on tag values. A developer partitions the cores into multiple groups, with each group covering a pipestage. Tag values are used to identify which pipestage a packet is currently at. As the processing for a pipestage finishes, it can simply update the tag value for the packet to reflect the next stage. The hardware scheduler uses the tag value to schedule the packet to one of the cores covering the next stage. Programming Model Considerations Which programming model should base station developers choose? As with most technologies, there are advantages and disadvantages to both. Developers should take the following points into consideration. When comparing run-to-completion versus a pipelining model, developers usually prefer run-to-completion. This is because the run-to-completion model is more intuitive and less complicated. It is also more flexible, because there are no dependencies between processing of different pipestages. However, there is still dependency of processing packets concurrently that may belong to the same flow and may have ordering requirements among the packets. Additionally, there is still a need for synchronization mechanisms to ensure that dependencies among the processing on multiple cores and accesses to shared data structures are properly coordinated. On the other hand, the pipelining model should not be dismissed, as it offers several advantages over run-to-completion. First, it forces the developer to partition the entire processing flow into pipestages, As a result, the software is, by design, modular. Second, each pipestage focuses on certain functions as opposed to the entire processing flow. As a result, the amount of software involved is more limited. For network processors with separate, limited program storage, the pipelining model helps to fit within the storage limit. For multicore processors that implement a cache hierarchy with ________________________________________________________________________ University@ CaviumNetworks.com Page 7 dedicated L1 cache for each core, the hit rate for the L1 cache tends to be better with a pipelining model. This is because the amount of software involved is more limited. Third, the pipelining model, by definition, imposes a sequence on processing among the pipestages. This sequencing may already replace the need for applying explicit synchronization in some cases, but not all of the synchronization cases that are required for the application. Furthermore, some portions of the application may require critical section access or exclusive (e.g., atomic) access to shared data. If these accesses are divided into a separate pipestage handled by a single core, there is no need for software locking and it can be removed as only one core is involved and there is no competition for requests. Even if several cores handle this pipestage, synchronization overhead is still likely to be lower than if all the cores requested synchronized access at the same time. The amount of overhead in synchronization is proportional to the number of cores involved, and increases exponentially as the number of competing cores goes up. One specific multicore processor that base station developers may wish to consider is Cavium OCTEON, which supports both run-to-completion and pipelining models. OCTEON includes hardware features that classify packets into their individual flows. As a result, it can schedule processing of packets while automatically maintaining processing order for packets that belong to the same flow. In addition, it can schedule processing of atomic sequences without requiring the programmer to use explicit software locking and synchronization mechanisms. These hardware features make the run-to-completion model very efficient on OCTEON. In case the developer prefers to use a pipelining model, OCTEON also provides effective hardware support. The developer may prefer a pipelining model, because he/she is porting legacy software from a prior implementation on network processors. OCTEON integrates a hardware scheduler that can tag packets belonging to the same flow as well as the same pipestage. When a core completes processing of a packet for the current pipestage, it can update the tag to reflect the next pipestage. The hardware scheduler will then schedule the packet to be processed by another core or set of cores. ________________________________________________________________________ University@ CaviumNetworks.com Page 8 Added Efficiency through Hardware Acceleration Another key component of the latest multicore processors is integrated application hardware acceleration engines that can offload functions for MAC layer operations and all higher protocol layers within LTE or WiMAX. While general purpose multicore processors performed these functions in software, newer multicore processors, such as OCTEON, use acceleration engines to offload many mechanical or compute-intensive tasks, providing significantly higher performance at much lower power consumption. For example, both WiMAX and LTE perform encryption/decryption based on standard security algorithms (such as SNOW 3G for LTE and IPSec). Hardware acceleration can speed up these tasks as well as cyclical redundancy check (CRC) calculations, packet flow classification, scheduling of processing of run-to-completion versus pipelining timers, re-transmission, quality of service, etc. About Cavium Networks Cavium Networks is a leading provider of highly integrated semiconductor products that enable intelligent processing in networking, communications, wireless, storage, video and security applications. Cavium Networks offers a broad portfolio of integrated, softwarecompatible processors ranging in performance from 10 Mbps to 40 Gbps that enable secure, intelligent functionality in enterprise, data-center, broadband/consumer and access and service provider equipment. Cavium Networks processors are supported by ecosystem partners that provide operating systems, tool support, reference designs and other services. Cavium Networks principal offices are in Mountain View, CA with design team locations in California, Massachusetts and India. For more information, please visit: http://www.caviumnetworks.com. ________________________________________________________________________ University@ CaviumNetworks.com Page 9