Base Stations for Fourth-Generation Cellular Services

advertisement
Case Study on Using Multicore Processing and Programming
Base Stations for Fourth-Generation
Cellular Services
How new-generation multicore processors with hardware
acceleration deliver high performance at low power
while supporting flexible programming models in
today’s advanced cellular networks.
According to the European Information Technology Observatory (EITO), and other
organizations, the number of mobile phone users worldwide exceeded the 4 billion mark
for the first time in 2009. This means that approximately two-thirds of the world’s
population uses a mobile phone.
Escalating competition and rising mobile user expectations are driving demand for
evermore advanced features and services. Mobile infrastructures are under pressure to
efficiently and reliably process data streams at faster rates as third-generation (3G)
services evolve to 4G with anticipated speeds of 100 Mbps or more. The basic
components of a mobile network, shown in Figure 1, include the user handsets that
connect wirelessly to base stations, which in turn communicate with the mobile service
provider’s core network over a wired connection. Base stations can become a choke point
restricting the flow of data upstream and downstream between the handsets and the
wireless provider’s core network. As carriers migrate to faster, 4G networks, how can
they manage the increasing project complexity resulting from new or additional wireless
technologies, higher throughput, and more features?
________________________________________________________________________
University@ CaviumNetworks.com
Page 1
Supplemental teaching aid
The Professor may wish to pose one or more of the following questions to students to
appraise their comprehension of the concepts presented within this document.
1. What software programming models are commonly used in multicore
programming?
2. Describe the key characteristics, advantages, and potential disadvantages of the
run-to-completion model.
3. Describe the key characteristics, advantages, and potential disadvantages of the
software pipelining model.
4. What factors would you need to consider when choosing between using a run-tocompletion model and a pipelining model?
5. Select an application besides a base station and develop the software architecture
for implementing this application by using the run-to-completion model.
6. Select an application besides a base station and develop the software architecture
for implementing this application by using pipelining model.
7. What are the advantages of using multicore processors over DSPs or network
processors?
8. Describe the LTE or WiMAX base station processing flow.
9. Which functions of a LTE or WiMAX base station benefit significantly from
hardware acceleration and what kinds of hardware acceleration would be most
relevant in these cases?
[Insert Fig. 1 showing gateways, bases stations and handset]
Fig. 1 Mobile network architecture
Previous generations of base stations have mostly been implemented using digital signal
processors (DSPs) and network processors. In DSP implementations, multiple DSPs are
________________________________________________________________________
University@ CaviumNetworks.com
Page 2
typically required to provide adequate performance. DSPs are based on proprietary
instruction set architecture and are optimized for digital signal processing as opposed to
general purpose processing. Moreover, development using DSPs requires proprietary
software tools. Because multiple DSP chips are needed to meet performance
requirements, development complexity increases.
Most network processors are multicore-based. However, these cores do not support
industry-standard instruction set architecture. Some of these cores require assembly
language programming. In addition, some network processors require the software
program to fit entirely into the limited on-chip program storage. This restricts the size of
the program, and increases development complexity, as well as the functionality that can
be supported. Furthermore, network processors usually possess strong dependencies
between the software programming model and the network processor hardware
architecture. For example, some network processors require using a pipelining
programming model to achieve good performance.
In contrast, most multicore processors support flexible software programming models
and industry-standard instruction set architecture and software tools. In terms of
multicore programming models, run-to-completion and software pipelining are the
models most commonly used. This document describes the pluses and minuses of these
two models.
Following the Data
A mobile data stream (which may include both voice and data) is transmitted wirelessly
from handsets within a geographical cell to a nearby base station where it is processed
before sending it on to the core network. This is known as the uplink. Conversely, data
from the wireless service provider’s core network is transmitted to the base station where
it is processed before transmitting to the handset. This is the downlink.
In parallel to this data plane is a control plane that provides Operations & Management
(O&M) functions.
________________________________________________________________________
University@ CaviumNetworks.com
Page 3
Multiple generations of wireless technologies and protocols currently exist, supporting
various features and levels of throughput. Going forward, Long Term Evolution (LTE)
and Worldwide Interoperability for Microwave Access (WiMAX) offer the next
generation (4G) of features and capabilities, such as interactive video conferencing. This
case study focuses on base stations utilizing LTE or WiMAX, as these next-generation
wireless technologies present a significant increase in processing complexity and higher
throughput, requiring processors with increased capabilities.
LTE is based on a set of standards developed by the 3rd Generation Partnership Project
(3GPP), a collaboration between groups of telecommunications associations to create a
global 3G mobile phone specification. 3GPP LTE, the 4G successor, provides downlink
speeds up to 100 Mbps and uplink speeds of 50 Mbps. WiMAX, also known as IEEE
802.16, was initially intended for broadband metropolitan area networks (MANs)
offering high data rates (up to 75 Mbps) over long distances (up to 10 miles from a base
station). The reader can visit http://www.comsysmobile.com/pdf/LTEvsWiMax.pdf for a
comparison of LTE and WiMAX specifications.
LTE and WiMAX networks operate at several protocol layers to allow the base station to
process the data from the handsets and core network. The PHY layer (Layer 1) provides
the over-the-air communications channel for exchanging bits of data between handsets
and base station. At the Medium Access Control (MAC) layer (Layer 2), and at Layer 3,
LTE and WiMAX perform a complex and compute-intensive series of operations on the
incoming data. Figure 2 below illustrates the process under WiMAX. LTE follows a
similar, but not identical, flow of steps. Downlink processing takes place in roughly the
reverse order.
[Insert Fig 2 showing WiMAX uplink processing steps for a single flow]
Fig. 2 Uplink operations
________________________________________________________________________
University@ CaviumNetworks.com
Page 4
Base station developers can choose either a run-to-completion programming model or a
pipelining programming model to develop the software that implements these functions.
These models are described in the following two sections.
Run-to-Completion Programming Model
Depending on the class of the base station, WiMAX and LTE networks can support
hundreds of simultaneous users. Each user represents a separate packet flow (or flows
when data and voice go through separate flows). Each flow represents all the packets
related to a call, or a Wireless Application Protocol (WAP) connection for data access.
Base stations can exploit the parallelism of multiple packet flows using multicore
processors.
The base station receives packets, identifies packet flows and schedules processing of
these packets on the multiple processing cores. Some multicore processors, like Cavium
Networks OCTEON, provide hardware acceleration and automation for these functions.
In a run-to-completion model, each processing core completes the entire processing flow
for a packet. Packets belonging to the same flow (e.g., WAP flow), tend to depend on
each other. For instance, packets that belong to the same flow are typically processed in
sequence and access shared data structures, which maintain the context of the relevant
flow. When a processor core completes the processing flow for a packet, it picks up
another packet and processes it. At any given point in time, the processor may be
simultaneously processing as many flows as there are cores. However, the processor is
managing all the flows (perhaps hundreds of users, each with multiple flows) that are
passing through the base station at any time. Figure 3 below illustrates multiple cores
executing multiple flows.
[Insert Fig 3 showing WiMAX uplink processing steps for multiple flows]
Fig. 3 Run-to-Completion Programming Model
Pipelining Programming Model
________________________________________________________________________
University@ CaviumNetworks.com
Page 5
The software pipelining model takes a different approach, dividing up the entire
processing flow into a pipeline of pipestages. Each pipestage represents the processing of
one or more functions which are a subset of the entire processing flow.
For example, Figure 4 below illustrates two pipestages. [Please note that this example
merely demonstrates the pipelining model. It does not reflect technical trade-offs or
optimization of the WiMAX processing flow, nor does it suggest how to divide up the
flow into pipestages. WiMAX base station developers should design the pipeline based
on specific product requirements and processors used.]
The first pipestage handles the Demux, Decoder, and Scheduler steps for a packet flow,
while the second pipestage handles Payload Header Suppression (PHS), Classification,
and Generic Routing Encapsulation (GRE). As a packet is received, a core is assigned to
execute the steps in the first pipestage for this packet. Once complete, that processed data
is assigned to another core to complete the remaining steps in the second pipestage of the
overall processing flow.
[Insert Fig 4 showing two pipestages]
Fig. 4 2-pipestage Pipelining Programming Model
After dividing up the overall processing into pipestages, developers assign cores to
process each pipestage. Cores associated with each pipestage constitute a core group,
which can include one or more cores depending on the performance requirement and
complexity of the pipestage. Processing within each core group is performed in parallel.
Scheduling of packets to be processed must account for and maintain packet ordering and
dependencies.
A critical task is scheduling the processing of each packet from one pipestage to the next.
The data and context of each packet must be communicated to allow the next pipestage to
pick up the processing where the previous one left off. Some multicore processors, such
as Cavium OCTEON, provide hardware acceleration and automation to streamline the
________________________________________________________________________
University@ CaviumNetworks.com
Page 6
scheduling process. OCTEON integrates a hardware scheduler that can schedule
processing to groups of cores based on tag values. A developer partitions the cores into
multiple groups, with each group covering a pipestage. Tag values are used to identify
which pipestage a packet is currently at. As the processing for a pipestage finishes, it can
simply update the tag value for the packet to reflect the next stage. The hardware
scheduler uses the tag value to schedule the packet to one of the cores covering the next
stage.
Programming Model Considerations
Which programming model should base station developers choose? As with most
technologies, there are advantages and disadvantages to both. Developers should take the
following points into consideration.
When comparing run-to-completion versus a pipelining model, developers usually prefer
run-to-completion. This is because the run-to-completion model is more intuitive and less
complicated. It is also more flexible, because there are no dependencies between
processing of different pipestages. However, there is still dependency of processing
packets concurrently that may belong to the same flow and may have ordering
requirements among the packets. Additionally, there is still a need for synchronization
mechanisms to ensure that dependencies among the processing on multiple cores and
accesses to shared data structures are properly coordinated.
On the other hand, the pipelining model should not be dismissed, as it offers several
advantages over run-to-completion. First, it forces the developer to partition the entire
processing flow into pipestages, As a result, the software is, by design, modular.
Second, each pipestage focuses on certain functions as opposed to the entire processing
flow. As a result, the amount of software involved is more limited. For network
processors with separate, limited program storage, the pipelining model helps to fit within
the storage limit. For multicore processors that implement a cache hierarchy with
________________________________________________________________________
University@ CaviumNetworks.com
Page 7
dedicated L1 cache for each core, the hit rate for the L1 cache tends to be better with a
pipelining model. This is because the amount of software involved is more limited.
Third, the pipelining model, by definition, imposes a sequence on processing among the
pipestages. This sequencing may already replace the need for applying explicit
synchronization in some cases, but not all of the synchronization cases that are required
for the application. Furthermore, some portions of the application may require critical
section access or exclusive (e.g., atomic) access to shared data. If these accesses are
divided into a separate pipestage handled by a single core, there is no need for software
locking and it can be removed as only one core is involved and there is no competition
for requests. Even if several cores handle this pipestage, synchronization overhead is still
likely to be lower than if all the cores requested synchronized access at the same time.
The amount of overhead in synchronization is proportional to the number of cores
involved, and increases exponentially as the number of competing cores goes up.
One specific multicore processor that base station developers may wish to consider is
Cavium OCTEON, which supports both run-to-completion and pipelining models.
OCTEON includes hardware features that classify packets into their individual flows. As
a result, it can schedule processing of packets while automatically maintaining processing
order for packets that belong to the same flow. In addition, it can schedule processing of
atomic sequences without requiring the programmer to use explicit software locking and
synchronization mechanisms. These hardware features make the run-to-completion
model very efficient on OCTEON.
In case the developer prefers to use a pipelining model, OCTEON also provides effective
hardware support. The developer may prefer a pipelining model, because he/she is
porting legacy software from a prior implementation on network processors. OCTEON
integrates a hardware scheduler that can tag packets belonging to the same flow as well
as the same pipestage. When a core completes processing of a packet for the current
pipestage, it can update the tag to reflect the next pipestage. The hardware scheduler will
then schedule the packet to be processed by another core or set of cores.
________________________________________________________________________
University@ CaviumNetworks.com
Page 8
Added Efficiency through Hardware Acceleration
Another key component of the latest multicore processors is integrated application
hardware acceleration engines that can offload functions for MAC layer operations and
all higher protocol layers within LTE or WiMAX. While general purpose multicore
processors performed these functions in software, newer multicore processors, such as
OCTEON, use acceleration engines to offload many mechanical or compute-intensive
tasks, providing significantly higher performance at much lower power consumption.
For example, both WiMAX and LTE perform encryption/decryption based on standard
security algorithms (such as SNOW 3G for LTE and IPSec). Hardware acceleration can
speed up these tasks as well as cyclical redundancy check (CRC) calculations, packet
flow classification, scheduling of processing of run-to-completion versus pipelining
timers, re-transmission, quality of service, etc.
About Cavium Networks
Cavium Networks is a leading provider of highly integrated semiconductor products that
enable intelligent processing in networking, communications, wireless, storage, video and
security applications. Cavium Networks offers a broad portfolio of integrated, softwarecompatible processors ranging in performance from 10 Mbps to 40 Gbps that enable
secure, intelligent functionality in enterprise, data-center, broadband/consumer and
access and service provider equipment. Cavium Networks processors are supported by
ecosystem partners that provide operating systems, tool support, reference designs and
other services. Cavium Networks principal offices are in Mountain View, CA with design
team locations in California, Massachusetts and India. For more information, please visit:
http://www.caviumnetworks.com.
________________________________________________________________________
University@ CaviumNetworks.com
Page 9
Download