Fundamental Architectural Considerations for Network Processors

advertisement
Network Processors
Harsh Chilwal
Evolution : Cellular phone generation
1G
2G
2.5G
3G
1000
170
12
kb/s
900MHz
Voice
900MHz
1800MHz
Voice
900-1800MHz 900-1800-1900MHz
Voice
Smart Phone
Tiny Internet Full web service
Evolution : 3G cellular phones
base station
controller (BSC)
12Kb/secon
d
10 BS
100Mb/second
mobile
station (MS)
Network
100 MS
5Mb/second
base station (BS)
Evolution : 3G cellular phones
1Mb/second
base station
controller (BSC)
100 BS
50Gbit/second
NP
NP
mobile station (MS)
NP
500 MS
base station
(BS)
Network
500Mb/second
Evolution : Networks
100,000
OC768
Bandwidth (Mb/s)
40Gb
x4
OC192
10,000
10Gb
x16
OC12
1000
x12
100
NP
622Mb
DS3
44Mb
x28
10
DS1
1.5M
1
x24
0.1
DS0
64K
1980
DS= Digital signal
1985
1990
OC = Optical carrier
1995
2000
2005
Year
Networking Trends






Increasing Networking Traffic.
New sophisticated protocols are being
introduced at rapid pace.
Need for supporting new applications to provide
new services.
Convergence of voice and data networks
introducing a lot of changes in the
communication industry.
Increasing TTM Pressures
Decreasing product life cycles.
General Purpose Processor based Software
Router

Benefits

Flexible for upgrading the system
Easy for supporting additional interfaces
Quick to develop new products with short TTM.
The core processor performs all the routing
functionalities




Drawbacks

Not able to scale up for higher bandwidths, maximum
up to OC-12 speeds only
Can support complex network operations viz., traffic
engineering, QoS, etc
with a major reduction in performance


ASIC based Routers
» Benefits


Provide wire-speed performances
provided high speed
» Drawbacks





Lacks flexibility; difficult to meet changing market
needs/demands
Long design cycles increases TTM reduces PLC.
Change in design or failure in design involves more
risks
Need to replace the ASIC to provide new functionality
Complex network operation are still executed in
software
Network Processor based boxes





Promises to provide performance and flexibility
Comprises of many packet processing elements
supporting multiple threads
Achieves higher performance by pipelining and
parallel processing both in terms of threads and
packet processing elements
Brings-in flexibility by due software
programming
Easy to add features
Network Processor
Basic Architecture of
Network Processors
Basic architecture (contd.)
Look-A-Side Co-processors
Risc
CP1
CP2
CP3
CP4
Merger
Multiple
Streams
Dispatcher
Com –
Engine
Intro: Systems and Protocols:
Relation with Standards
Systems
Protocols

IETF/Protocols
IPv4
MPLS
 PPP/L2TP
IPv6
MIBs
ITU-T/ANSI/ATM
Forum:
ATM
IEEE
Ethernet
IETF / Forces WG:
Data / Forwarding
Plane
Control Plane
NPF:
Service Layer
System Wide
No awareness where things
are
Functional Layer
Awareness where things are
Operational Layer
Interface Management
OSI Network Architecture
B
A
Network
DATA
7
Application
6
Pre.
5
Session
4
Transport
3
Network
2
Data Link
1
Physical
AH
DATA
PH
DATA
SH
TH
NH
DH
PH
DATA
DATA
DATA
DATA
DATA
Application
7
Pre.
6
Session
5
Transport
4
Network
3
Data Link
2
Physical
1
Typical Applications








WAN/LAN Switching and Routing, Multiservice Switches, Multi-layer switches,
Aggregators
Web caching, Load balancing, Web switching,
Content based load balancers
QoS solutions
VoIP Gateways
2.5G and 3G wireless infrastructure
equipments
Security - Firewall, VPN, Encryption, Access
control
Storage solutions
Residential Gateways
Software Framework
Scene setting - why
specs are not enough


2 NPU vendors want to promote their solution
with some ‘numbers’
Both chip architectures comprise
Commonalties in
–
–
–
–

RISC engines
Hardware support engines
Various types of interfaces
Support for internal and external memory
They report the following data
– Aggregate MIPS
– Max number of lookups per second
– ...
building blocks
Commonalties in
Interpretation?
Commonalties in
specifications
Specifications
NPU A
NPU B
Aggregate MIPS
1000
6000
Lookups/s
50M
100M
#Counters
32K
4M
Speedgrade
10Gbps 10Gbps
Performance
wirespeed wirespeed
Test scenario





What is measured? Performance in
packets per second versus a forwarding
information base (FIB) that is increased in
size.
Start application is IPv4.
Next, counters are added for per flow
billing purposes.
Next, load balancing is introduced as an
additional feature.
Finally, encryption becomes an additional
requirement for 2% of the data that is
being forwarded
Performance curves
Performance
(Mpps)
30
IPv4
20
NPU B
10
NPU A
50
100
150
FIB
(K entries)
Performance curves
Performance
(Mpps)
30
IPv4 + counters
NPU A
20
Requires more
memory references
NPU B
10
50
100
150
FIB
(K entries)
Performance curves
Performance
(Mpps)
30
IPv4 + counters + Load balancing
NPU A
20
10
Requires even more
memory references
NPU B
50
100
150
FIB
(K entries)
Performance curves
Performance
(Mpps)
30
IPv4 + counters + Load balancing + encryption
20
10
NPU B
No extra references and
resources available
NPU A
A does not have
sufficient resources
50
100
150
FIB
(K entries)
Architecture A
LU
Hash
3 MIPS
cores
Key
extract
Int.
mem
Count
Sched
External
Buffer Mem
OC-192 POS
OC-192 POS
Int.
mem
Int.
mem
IPv4
+ counters
+ LB
+ crypto
Architecture B
IPv4
+ counters
+ LB
+ crypto
IMEM
Memory interface
External
Buffer Mem
10GE
10GE
LB
10 MIPS
cores
Specifications - revisited
NPU A
NPU B
Aggregate MIPS
1000
6000
Lookups/s
50M
100M
#Counters
32K
4M
Speedgrade
10Gbps 10Gbps
Performance
wirespeed wirespeed
Lookup width
I/F technology
Power
Core frequency
Cost (USD)
128-bit
POS
12W
300MHz
800
32-bit
Ethernet
20W
600MHz
1500
So

No clear value statement could be made in favor
of either NPU solutions
– NPU A achieves higher throughput but with limited
flexibility
– NPU B achieves lower throughput but is more flexible

Were the provided specs accurate?
– Yes.
– The devices performed up to spec.
– Although NPU B looks better on paper at first sight,
more resources have to be consumed for less per
formant results.
– There is a cost associated with flexibility

Were the provided specs relevant?
– No. They represent granular maximum performances.
– For ‘real world’ applications,


some resources could not be maximally consumed
some resources were over consumed
Benchmarking considerations

Processor core metrics are not always
relevant for networking applications
– It might be relevant for NPU B, since
functionality relies almost totally on those
cores.
– It is definitely not the case for NPU A, since
there is extensive additional hardware support
for specific functions.
GRANULARITY
Highly granular specifications, data or benchmarking information
can offer a wrongful picture of the actual performance capabilities
of the DUT. Since Network Processing Devices are designed
with specific applications in mind, benchmarks must exist for
those specific applications
Benchmarking considerations

External factors affect NPD performance
(where you don’t always suspect it)
– A forwarding application relies on FIB lookups
to determine the destination of a packet
– The size of the FIB table can influence
performance in many ways


Usage of multiple memory banks
increasing number of hash collisions
EXTERNAL FACTORS
Benchmarks should include parameters that take into account external
factors that are relevant to the particular applications that are being
benchmarked.
Benchmarking considerations

Interfaces present performance
boundary conditions
– Ethernet applications require inter frame
gaps that result in more relaxed pps
numbers
INTERFACES
Benchmarks should also specify the types of interfaces that are being used
since those interfaces have an impact all by themselves on maximum
performance figures
Benchmarking considerations

Combinations of applications or minor
extensions have a completely different
impact on both network processing devices
– NPU A has a lot of well engineered hardware
support that can offer additional services BUT
fails almost completely when additional
computing resources are required
– NPU B is very ‘soft’; performance degrades
slowly when additional services are requested
and shows no abrupt peaks in the
performance curves.
HEADROOM
Benchmarks should combine applications as they occur in the real world
to give a ‘sense’ of headroom that is available to support real world
scenarios. It is however very hard to define a metric for headroom
CommBench – A
Telecommunication Benchmark
For NPs
CommBench
HPAs
PPAs
RTR
FRAG
DRR
TCP
CAST
ZIP
REED
JPEG
Benchmark Characteristics –
Code & Computational Kernel
Sizes
Benchmark Characteristics –
Computational Complexity
Na,l – Num Of Instructions/byte required for app a
operationg on a packet of length l
Benchmark Characteristics –
Instruction Set Characteristics
Benchmark Characteristics –
Memory Hierarchy
Example System: Cisco Toaster
10000
Almost all data plane operations execute on the programmable XMC
 Pipeline stages are assigned tasks – e.g. classification, routing, firewall,
MPLS

– Classic SW load balancing problem

External SDRAM shared by common pipe stages
Example System: IXP 2400
XScale core replaces
StrongARM
 Microengines

DDR DRAM
controller
ME0
ME1
ME3
ME2
ME4
ME5
ME7
ME6
Scratch
/Hash
/CSR
XScale
Core
PCI
QDR SRAM
controller
MSF Unit
– Faster
– More: 2 clusters of 4
microengines each
Local memory
 Next neighbor routes
added between
microengines
 Hardware to accelerate
CRC operations and
Random number
generation
 16 entry CAM

References
Network Processor Design – Patrick Crowley etal.
 CommBench - A Telecommunications Benchmark for
Network Processors, Tilman Wolf and Mark Franklin.
Proceedings of IEEE International Symposium on
Performance Analysis of Systems and Software (ISPASS),
http://www.ecs.umass.edu/ece/wolf/papers/commbench.pdf
 Network Processing Forum - Benchmarking
 www.wipro.com/pdf_files/networkprocessors_wipro_solPPT.
pdf
 http://intrage.insatlse.fr/~etienne/netpro.ppt

Download