Exploiting Task-level Concurrency in a Programmable Network

advertisement
Exploiting Task-level Concurrency in
a Programmable Network Interface
Hyong-youb Kim, Vijay S. Pai, and Scott Rixner
Rice Computer Architecture Group
http://www.cs.rice.edu/CS/Architecture/
June 11, 2003
Why programmable network interface?
 More
complex functionality on network interface
– TCP offload, iSCSI, etc.
 Easy
maintenance
– Bug fix, upgrade, customization, etc.
 Performance?
– 51% less web server throughput than ASIC NIC
– Big problem
2
Improving Performance
 Increase
clock speed and/or complexity
– Typical solutions for general-purpose processors
– Do not work for embedded processors
 Design
–
–
–
–
constraints: limited power and area
Power proportional to C V² f
Higher f requires higher V
Thus, Power roughly proportional to f³
Complexity increases C only for marginal gains
 Implication:
simple and low frequency processor
3
Use Parallel Programming
 Use
multiple programmable cores
– Increase computational capacity
– Achieve performance within power limit
• Consume far less power than higher-frequency core
 Improvements
with two cores over single core
– 65-157% for bidirectional traffic
– 27-51% for web server workloads
– Web server throughput comparable to ASIC NIC
4
Outline
 Background
– Tigon Programmable Gigabit Ethernet Controller
– Network Interface Processing: Send/Receive
 Parallelization
of Firmware
 Experimental Results
 Conclusion
5
Tigon Gigabit Ethernet Controller
 Two
programmable cores
– Based on MIPS running at 88MHz
– Small on-chip memory (scratch pad) per core
 Shared
off-chip SRAM
 Supports event-driven firmware
 No interrupts
– Event handlers run to completion
– Handlers on same core require no synchronization
 Released
firmware fully utilizes only one core
 No previous Ethernet firmware to utilize 2 cores
6
Send Processing
Tigon event:
Send
Data
Ready
Mailbox
Read
Complete
Buffer
Descriptor
Update
Send
ConsumerReady
DMA
Write
Complete
CPU
Main
Memory
Index
Bridge
Interrupt
Memory-mapped
I/O
PCI Bus
Direct Memory Access
Network
Interface
Card
Descriptor
Index
Descriptor
Packet
Packet
Network
2.
3.
Fetch
buffer
descriptor
buffer
1.
6. Alert:
Createproduced
consumed
buffer
descriptor
bufferdescriptor
descriptor
4.
Transfer
Packet
5.
Transmit
Packet
7
Receive Processing: Pre-allocation
Tigon event:
Receive
Buffer
Descriptor Ready
Mailbox
DMA
Read
Complete
CPU
Main
Memory
Descriptor
Receive Buffer
Bridge
Memory-mapped I/O
PCI Bus
Direct Memory Access
Network
Interface
Card
Network
Descriptor
4. Allocate
Fetch produced
buffer
descriptor
3.
Alert:
buffer
2.
1.
Create
buffer
receive
descriptor
bufferdescriptor
8
Receive Processing: Actual Receive
Tigon event:
Update
Receive
Return Producer
Receive
Complete
DMA Write
Complete
CPU
Main
Memory
Index
Descriptor
Receive Buffer
Packet
Bridge
Interrupt
PCI Bus
Direct Memory Access
Network
Interface
Card
Network
Index
Descriptor
Packet
5.
Alert:
produced
buffer descriptor
2.
Create
buffer
3.
Transfer
packet
1.
4. Store
packet
bufferdescriptor
descriptor
9
Tigon Uniprocessor Performance
Decreasing maximum UDP throughput due to
network headers and per-frame overhead
Intel 100%
over Tigon
Intel PRO/1000 MT
Netgear 622T
Tigon with
uniprocessor firmware
10
Outline
 Background
 Parallelization
of Firmware
– Principles
– Resource Sharing Patterns
– Partitioning Process
 Experimental
Results
 Conclusion
11
Principles
 Identify
unit of concurrency
– Event handler
 Analyze
resource sharing patterns
 Profile uniprocessor firmware
 Partition event handlers so as to
– Balance load
– Minimize synchronization
– Maximize on-chip memory utilization
12
Resource Sharing Patterns
Mailbox
Receive Buffer Descriptor Ready
DMA Write Complete
Send Buffer Descriptor Ready
DMA Read Complete
Update Receive Return Producer
Receive Complete
Update Send Consumer
Send Data Ready
Shared data objects
Shared DMA read channel
Shared DMA write channel
13
Partitioning Process
Mailbox
Receive Buffer Descriptor Ready
6%
DMA Write Complete
6%
DMA Read Complete
14%
4%
Send Buffer Descriptor Ready
3%
Update Receive Return Producer
5%
Receive Complete
30%
Send Data Ready
31%
CPU A: 30%
47%
41%
553%
2%
CPU B: 31%
Update Send Consumer
1%
Shared data objects
Shared DMA read channel
Shared DMA write channel
14
Final Partition
Receive Complete
Send Data Ready
DMA Write Complete
DMA Read Complete
Mailbox
Send Buffer Descriptor Ready
Receive Buffer Descriptor Ready
Update Send Consumer
Update Receive Return Producer
CPU A: 47%
CPU B: 53%
Shared data objects
Shared DMA read channel
Shared DMA write channel
15
Outline
 Background
 Parallelization
of Firmware
 Experimental Results
– Improved Maximum Throughput
– Improved Web Server Throughput
 Conclusion
16
Experimental Setup

Network interface card
– 3Com 710024 Gigabit Ethernet interface card based on Tigon

Firmware versions
– Uniprocessor firmware: 12.4.13 from original manufacturer
– Parallel firmware: modified version of 12.4.13

Benchmarks
– UDP bidirectional, unidirectional, and ping traffic
– Web server (thttpd) and software router (Click)

Testbed
– PC machines with AMD Athlon 2600+ CPU and 2GB RAM
– FreeBSD 4.7
17
Overall Improvements
65%
157%
improvement
improvement
18
Sources of Improvements
70%improvement
improvement
37%
due
scratch
pads
due
toto
two
processors
19
Comparison to ASIC NICs
Intel only 21% over Tigon
Intel PRO/1000 MT
Intel (2002)
3Com 710024
Tigon (1997)
Netgear GA622T
Nat. Semi. (2001)
3Com 710024
Tigon (1997)
UNIPROCESSOR
20
Impact on Web Server Throughput
Overall 27-51%
Comparable
to
improvement
ASIC NICs
21
Parallelization Makes Programmability Viable
 Programmability
useful for complex functions
 Limited clock speed for embedded processor
– Limited uniprocessor performance
 Use
multiple cores to improve performance
 Two
core vs. single core
– 65% increase for maximum throughput
– 51% increase for web server throughput
– Web server throughput comparable to ASIC NICs
22
23
UDP Send: Overall Improvements
24
UDP Send: Sources of Improvements
25
UDP Receive: Overall Improvements
26
UDP Receive: Sources of Improvements
27
UDP Ping: Overall Improvements
28
UDP Ping: Sources of Improvements
29
Impact on Routing Throughput
30
Download