Willmann

advertisement
An Efficient Programmable
10 Gigabit Ethernet
Network Interface Card
Paul Willmann, Hyong-youb Kim,
Scott Rixner, and Vijay S. Pai
Designing a 10 Gigabit NIC

Programmability for performance
 Computation

NICs have power, area concerns
 Architecture

1
offloading improves performance
solutions should be efficient
Above all, must support 10 Gb/s links
 What
are the computation, memory requirements?
 What
architecture efficiently meets them?
 What
firmware organization should be used?
Mechanisms for an Efficient
Programmable 10 Gb/s NIC

A partitioned memory system



A distributed task-queue firmware


Utilizes frame-level parallelism to scale across many simple,
low-frequency processors
New RMW instructions

2
Low-latency access to control structures
High-bandwidth, high-capacity access to frame data
Reduce firmware frame-ordering overheads by 50% and
reduce clock frequency requirement by 17%
Outline
3

Motivation

How Programmable NICs work

Architecture Requirements, Design

Frame-parallel Firmware

Evaluation
How Programmable NICs Work
Memory
Bus
4
PCI
Interface
Processor(s)
Ethernet
Interface
Ethernet
Per-frame Requirements
Instructions
Data
Accesses
TX Frame
281
101
RX Frame
253
85
Processing and control data requirements per frame, as
determined by dynamic traces of relevant NIC functions
5
Aggregate Requirements
10 Gb/s - Max Sized Frames
Instruction Control Data Frame Data
Throughput Bandwidth
Bandwidth
TX Frame
229 MIPS
2.6 Gb/s
19.75 Gb/s
RX Frame
206 MIPS
2.2 Gb/s
19.75 Gb/s
Total
435 MIPS
4.8 Gb/s
39.5 Gb/s
1514-byte Frames at 10 Gb/s
6
812,744 Frames/s
Meeting 10 Gb/s Requirements
with Hardware

Processor Architecture

At least 435 MIPS within embedded device
 Does NIC firmware have ILP?

Memory Architecture

Low latency control data
 High bandwidth, high capacity frame data
 … both, how?
7
ILP Processors for NIC Firmware?


ILP limited by data, control dependences
Analysis of dynamic trace reveal dependences
In-order 1
In-order 2
In-order 4
Out-order 1
Out-order 2
Out-order 4
8
Perfect BP
0.87
1.19
1.34
1.00
1.96
2.65
Perfect 1BP
0.87
1.19
1.33
1.00
1.74
2.00
No BP
0.87
1.13
1.17
0.88
1.21
1.29
Processors: 1-Wide, In-order
Perfect 1BP
0.87
In-order 1
1.74
Out-order 2

2x performance costly





9
Branch prediction, reorder buffer, renaming logic, wakeup logic
Overheads translate to greater than 2x core power, area costs
Great for a GP processor; not for an embedded device
Other opportunities for parallelism? YES!


No BP
0.87
1.21
Many steps to process a frame - run them simultaneously
Many frames need processing - process simultaneously
Use parallel single-issue cores
Memory Architecture

Competing demands



The traditional solution: Caches



10
Frame data: High bandwidth, high capacity for many offload
mechanisms
Control data: Low latency; coherence among processors,
PCI Interface, and Ethernet Interface
Advantages: low latency, transparent to the programmer
Disadvantages: Hardware costs (tag arrays, coherence)
In many applications, advantages outweigh costs
Are Caches Effective?
60
Hit Ratio (Percent)
50
40
30
20
6 Processor
Hit Ratio
10
32K
B
16K
B
8KB
4KB
2KB
1KB
512B
256B
128B
64B
32B
16B
0
Cache Size (Bytes)
SMPCache trace analysis of a 6-processor NIC architecture
11
Choosing a Better Organization
Cache Hierarchy
12
A Partitioned
Organization
Putting it All Together
Instruction Memory
I-Cache 0
I-Cache 1
I-Cache P-1
CPU 0
CPU 1
CPU P-1
(P+4)x(S) Crossbar (32-bit)
Scratchpad 0
PCI
Bus
13
PCI
Interface
Scratchpad 1
Ext. Mem. Interface
(Off-Chip)
DRAM
S-pad S-1
Ethernet
Interface
Parallel Firmware

NIC processing steps already well-defined

Previous Gigabit NIC firmware divides steps
between 2 processors

14
… but does this mechanism scale?
Task Assignment with an Event
Register
PCI Read Bit
SW Event Bit
… Other Bits
0
1
0
1
0
PCI Interface
Finishes Work
15
Processor(s)
Processor(s)
need inspect
to enqueue
transactions
TX Data
Processor(s)
pass data to
Ethernet
Interface
Task-level Parallel Firmware
PCI Read
HW Status
PCI
Read Bit
Function
Running
(Proc 0)
Transfer
DMAs 0-4
0
Idle
Transfer
DMAs 5-9
1
Process
DMAs
0-4
Idle
Process
DMAs
5-9
Idle
1
1
16
0
Function
Running
(Proc 1)
Idle
Time
Frame-level Parallel Firmware
PCI RD
HW Status
Function
Running
(Proc 0)
Function
Running
(Proc 1)
Transfer
DMAs 0-4
Idle
Idle
Transfer
DMAs 5-9
17
Build Event
Process
DMAs
0-4
Idle
Build Event
Process
DMAs
5-9
Time
Evaluation Methodology

Spinach: A library of cycle-accurate LSE
simulator modules for network interfaces





Idea: Model everything inside the NIC

18
Memory latency, bandwidth, contention modeled precisely
Processors modeled in detail
NIC I/O (PCI, Ethernet Interfaces) modeled in detail
Verified when modeling the Tigon 2 Gigabit NIC
(LCTES 2004)
Gather performance, trace data
Scaling in Two Dimensions
20
Throughput (Gb/s)
18
16
14
Ethernet Limit
12
8 Processors
10
6 Processors
8
4 Processors
6
2 Processors
4
1 Processor
2
0
100
19
150
200
250
Core Frequency (MHz)
300
Processor Performance
20
Processor
Behavior
IPC
Component
Execution
0.72
Miss Stalls
0.01

Load Stalls
0.12


Achieves 83% of
theoretical peak IPC
Small I-Caches work
Sensitive to mem stalls

Scratchpad
Conflict Stalls
0.05
Pipeline Stalls
0.10
Total
1.00

Half of loads are part of a loadto-use sequence
Conflict stalls could be
reduced with more ports, more
banks
Reducing Frame Ordering
Overheads



21
Firmware ordering costly - 30% of execution
Synchronization, bitwise check/updates
occupy processors, memory
Solution: Atomic bitwise operations that also
update a pointer according to last set
location
Maintaining Frame Ordering
Index 0 Index 1 Index 3 Index 4
Frame
Status
Array
0
1
0
1
0
1
CPU C
CPU A
CPU B
Detects
prepares
prepares Completed
frames
frames
Frames
0
1
LOCK
Iterate
Notify
Hardware
UNLOCK
22
… more bits
Ethernet
Interface
RMW Instructions Reduce Clock
Frequency

Performance: 6x166 MHz = 6x200 MHz



Dynamically tasked firmware balances the
benefit


23
Performance is equivalent at all frame sizes
17% reduction in frequency requirement
Send cycles reduced by 28.4%
Receive cycles reduced by 4.7%
Conclusions
A Programmable 10 Gb/s NIC

This NIC architecture relies on:





24
Data Memory System - Partitioned organization, not
coherent caches
Processor Architecture - Parallel scalar processors
Firmware - Frame-level parallel organization
RMW Instructions - reduce ordering overheads
A programmable NIC: A substrate for offload
services
Comparing Frame Ordering
Methods
Full-Duplex Throughput (Gb/s)
20
18
16
14
12
10
Ethernet Limit
(Duplex)
8
6
6x200 MHz,
Software Only
4
6x166 MHz, RMW
Enhanced
2
0
0
25
200
400
600
800
1000
UDP Datagram Size (Bytes)
1200
1400
Download