ATM and Fast Ethernet Network Interfaces for User

advertisement
Page 1
HPCA-3, San Antonio, 5 February 1997
Joint work with Anindya Basu and Thorsten von Eicken
Matt Welsh
Department of Computer Science
Cornell University
mdw@cs.cornell.edu, http://www.cs.cornell.edu/Info/Projects/U-Net/
ATM and Fast Ethernet
Network Interfaces for
User-Level Communication
ATM and Fast Ethernet NIs for User-Level Communication
Matt Welsh, Cornell University
Page 2
HPCA-3, San Antonio, 5 February 1997
• Need to ensure protection between processes
• Application implements protocols directly
• Provide minimal interface enabling communication
Proposed solution: User-level network access
• That means: Abandon in-kernel protocol stacks
• New communication semantics
• Variants of standard protocols
Motivation 2: Flexibility
fi That means: Low latency and high bandwidth
• Finer comm granularity
• Support parallel processing on NoW’s
• Utilize high-speed nets
Motivation 1: Performance
User-level Network Interfaces: Motivation
ATM and Fast Ethernet NIs for User-Level Communication
Matt Welsh, Cornell University
set-up/shut-down
Page 3
HPCA-3, San Antonio, 5 February 1997
A cannot inspect or corrupt B’s messages, A cannot impersonate B
- No compromise on protection:
- Off-the-shelf hardware and software
Invariants:
Matt Welsh, Cornell University
- Kernel only involved in connection
simple MUX in NI
- Applications send/recv directly via
U-Net:
- All communication via kernel
Traditional:
- Can be implemented in hardware, software, or both
Generic communication architecture
U-Net: Basic Idea
ATM and Fast Ethernet NIs for User-Level Communication
Page 4
HPCA-3, San Antonio, 5 February 1997
• Restricts multi-user capabilities
• Assumes homogeneous nodes
• Custom network and NI
Shortcomings
• Application-specific protocols
• No copy: DMA direct to/from user memory
• Bypasses the kernel for send/recv
Advantages
• Examples: TMC CM-5, IBM SP-2, Meiko CS-2
Key idea: User-level access to NI
Inspiration: MPP Systems
ATM and Fast Ethernet NIs for User-Level Communication
Matt Welsh, Cornell University
Page 5
HPCA-3, San Antonio, 5 February 1997
Matt Welsh, Cornell University
- Set of parallel benchmarks measured over FE and ATM workstation clusters
Application performance
- Micro-benchmarks for latency/bandwidth performance
- Careful instrumentation of U-Net/FE implementation
Detailed performance analysis
- Explore Fast Ethernet as an alternative to ATM for commodity interconnect
- NIs with and without a programmable co-processor
Comparing implementations of U-Net
Our method:
- Does user-level communication require expensive/complex NI’s?
- What is the hardware/software tradeoff?
Summary: Explore design space of user-level NI’s
Overview
ATM and Fast Ethernet NIs for User-Level Communication
Page 6
HPCA-3, San Antonio, 5 February 1997
Matt Welsh, Cornell University
- Msg arrives, data in buffer from free queue, Rx descriptor pushed onto recv queue
Receive operation:
- User constructs msg in buffer area, pushes Tx descriptor onto send queue
Transmit operation:
- Message buffers and send/recv/free queues
U-Net Endpoint: Virtual device interface
The U-Net Interface
ATM and Fast Ethernet NIs for User-Level Communication
25 MHz i960
Page 7
FORE PCA-200 ATM Interface
256K SRAM
U-Net Endpoint
User address space
HPCA-3, San Antonio, 5 February 1997
Matt Welsh, Cornell University
• No O/S, CPU intervention in Tx/Rx
... always DMA-able by the i960
• Buffers, Rx ring in pinned memory segments
• Tx/Free rings mapped from i960 RAM
• U-Net implemented on i960
- Pentium 133 WS, Linux 1.3.97
- 25 MHz i960, 256K SRAM
- PCI bus 155 Mbps OC-3 ATM NI
FORE Systems PCA-200
• Programmable co-processor, ATM as "obvious choice" for interconnect
• Original implementation of U-Net
U-Net ATM Implementation
ATM and Fast Ethernet NIs for User-Level Communication
Page 8
HPCA-3, San Antonio, 5 February 1997
• U-Net implemented in kernel trap and interrupt routines
• Assumes single O/S agent to mux the queues
• Single, shared Tx and Rx rings, buffer pool
Matt Welsh, Cornell University
- Pentium 133 WS, Linux 1.3.97
- Low cost: $150/board
- But, not programmable
- PCI busmastering interface
- 100 Mbps UTP5 or fiber
DECchip 21140 FE controller
U-Net Fast Ethernet Implementation
ATM and Fast Ethernet NIs for User-Level Communication
3. User calls trap
4. Trap pushes descr to device Tx Ring
5. On Tx done, trap sets Tx descr done flag
3. i960 polls Tx rings, fetches descriptor
4. i960 initiates DMA to fiber output
5. i960 sets Tx descr done flag
25 MHz i960
Page 9
FORE PCA-200 ATM Interface
256K SRAM
U-Net Endpoint
Matt Welsh, Cornell University
2. User pushes Tx descr
2. User pushes Tx descr into Tx Ring
HPCA-3, San Antonio, 5 February 1997
1. User constructs data
1. User constructs data in buffer region
User address space
U-Net/Fast Ethernet
U-Net/ATM
Transmit Operation
ATM and Fast Ethernet NIs for User-Level Communication
3. Intr copies from device buffer to user buffer
4. Intr writes Rx descr into Rx FIFO
5. User polls Rx FIFO, or upcall
3. i960 initiates DMA to free buffer
4. At End-of-PDU, i960 writes Rx descr
5. User polls Rx FIFO, or upcall
25 MHz i960
Page 10
FORE PCA-200 ATM Interface
256K SRAM
U-Net Endpoint
Matt Welsh, Cornell University
2. Intr fetches free buffer descr
2. i960 fetches free buffer descr
HPCA-3, San Antonio, 5 February 1997
1. FE packet arrives, interrupt raised
1. AAL5 PDU cells arrive at fiber input
User address space
U-Net/Fast Ethernet
U-Net/ATM
Receive Operation
ATM and Fast Ethernet NIs for User-Level Communication
Page 11
HPCA-3, San Antonio, 5 February 1997
- Take (small) slice of main CPU time to mux U-Net
- Trap seen as ’protected co-routine’
- Trap semantics are ’service U-Net Tx queue’
- PCI access time dominates
- Null trap: 1 usec
Fast trap to start transmit, 4.2 usec any size packet
U-Net/FE Transmit operation
ATM and Fast Ethernet NIs for User-Level Communication
Matt Welsh, Cornell University
Page 12
HPCA-3, San Antonio, 5 February 1997
- Need to integrate with IP/packet filtering
Matt Welsh, Cornell University
- U-Net ’protocol ID’ in Ethernet header, plus ’channel number’ and length field
Mux/Demux:
- Msg arrives in fixed buffer pool in kernel, copy to user
Interrupt handler on Rx, copy time dominates
U-Net/FE Receive operation
ATM and Fast Ethernet NIs for User-Level Communication
0
250
500
750
Message size, bytes
1000
1250
ATM
FE, Hub
FE, Bay Networks 28115
1500
Page 13
HPCA-3, San Antonio, 5 February 1997
Matt Welsh, Cornell University
90 Mbps+ with 500-byte messages ... but switch shaves off some b/w?
Fast Ethernet
120 Mbps TAXI used as receiver
ATM
0
10
20
30
40
50
60
70
80
90
100
110
120
Performance: Bandwidth
ATM and Fast Ethernet NIs for User-Level Communication
Bandwidth, Mbps
0
250
500
750
Message size, bytes
Page 14
1000
1250
ATM
FE, Hub
FE, Bay Networks 28115
FE, Cabletron FN100
HPCA-3, San Antonio, 5 February 1997
• FE switches add 17 usec one-way
... for small messages, anyway
FE has lower latency than ATM!
0
100
200
300
400
500
600
700
800
Performance: Latency
ATM and Fast Ethernet NIs for User-Level Communication
Round-trip time, usec
1500
Matt Welsh, Cornell University
mm16x16
atm2
fe2
atm4
fe4
atm8
fe8
ssortsm512K
ssortlg512K
rsortsm512K
atm2
fe2
atm4
fe4
atm8
fe8
rsortlg512K
Page 15
HPCA-3, San Antonio, 5 February 1997
• Use of ’global pointers’ to access other proc addr space
• Novel parallel language based on C
Split-C
mm128x128
atm2
fe2
atm4
fe4
atm8
fe8
net
atm2
fe2
atm4
fe4
atm8
fe8
cpu
atm2
fe2
atm4
fe4
atm8
fe8
Split-C Benchmarks: ATM vs. FE
ATM and Fast Ethernet NIs for User-Level Communication
atm2
fe2
atm4
fe4
atm8
fe8
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Matt Welsh, Cornell University
than SPARC
• Pentium int ops faster
Pentium
• SPARC fp faster than
• FE faster for small msgs
• ATM faster for large msgs
cluster
SparcStations used in ATM
rsort: Radix sort
ssort: Sample sort
mm: Matrix Multiply
Page 16
HPCA-3, San Antonio, 5 February 1997
Matt Welsh, Cornell University
Implementations for PCA-200 & Linux, DC21140, Zeitnet & Windows NT
- Lazy read-page retrieval: Tell the NI to try again
- What about swapped-out read page? (Can’t swap in interrupt...)
- Writeable pages are easy to get
Issues
- Pages discarded on TLB capacity miss
- TLB miss causes kernel interrupt to fetch page
- Uses software TLB to cache page mappings
- On-demand paging of U-Net buffers
Paging Endpoints
- Required to allow direct DMA to/from user space
- Locked into physical memory for lifetime of process
Pinned buffers and queues
Current work work: Memory Management
ATM and Fast Ethernet NIs for User-Level Communication
Page 17
HPCA-3, San Antonio, 5 February 1997
Matt Welsh, Cornell University
- Fast Ethernet is an excellent price-performance point for workstation clusters
- Split-C benchmarks demonstrate comparable app performance
- U-Net model extends to other networks and NI architectures
Conclusions
- Bandwidth reaches > 90 Mbps with 500-byte messages
- Lower latency than OC-3 ATM (120 usec, 40 byte ping-pong)
- Round-trip latency starts at 57 usec, 40 byte ping-pong
U-Net extended to Fast Ethernet
- Hardware requires kernel trap and copy on receive
- Implementation using DC21140 FE interface
U-Net extended to non-programmable NICs
Summary
ATM and Fast Ethernet NIs for User-Level Communication
Download