NFS/RDMA over IB under Linux

advertisement
NFS/RDMA over IB under Linux
Charles J. Antonelli
Center for Information Technology Integration
University of Michigan, Ann Arbor
February 7, 2005
(portions copyright Tom Talpey and Gary Grider)
Agenda
NFSv2,3,4
NFS/RDMA
Linux NFS/RDMA server
NFS Sessions
pNFS and RDMA
NFSv2,3
One of the major software innovations of the 80’s
Open systems
Open specification
Remote procedure call (RPC)
Invocation across machine boundaries
Support for heterogeneity
Virtual file system interface (VFS)
Abstract interface to file system functions
Read, write, open, close, etc.
Stateless server
Ease of implementation
Obviates lack of server reliability
Problems with NFSv2,3
Naming
Under client control (automounter helps)
Scalability
Caching is hard to get right
Consistency
Three-second rule
Performance
Chatty protocol
Problems with NFSv2,3
Access control
Trusted client
Identity agreement
Locking
Outside the NFS protocol specification
System administration
No tools for backend management
Proliferation of exported workstation disks
NFSv4
Major components
Export management
Compound RPC
Delegation
State and locks
Access control lists
Security: RPCSEC_GSS
NFSv4
Export Management
NFSv4 pseudo fs allows the client to mount
the server root, and browse to discover
offered exports
No more mountd
Access into an export is based on the user’s
credentials
Obviates /etc/exports client list
Compound RPC
Designed to reduce wire traffic
Multiple operations per request:
PUTROOTFH
LOOKUP
GETATTR
GETFH
“Start with the pseudo fs root, lookup
mount point path name, and return
attributes and file handle.”
Delegation
Server issues delegations to clients
A read delegation on a file is a guarantee that
no other clients are writing to the file
A write delegation on a file is a guarantee that
no other clients are accessing the file
Reduces revalidation requirements
Not necessary for correctness
Intended to reduce RPC requests to the
server
State and Locks
NFSv3 is an ostensibly stateless protocol
However, NFSv3 is typically used with a stateful auxiliary
locking protocol (NLM)
NFSv4 locking is part of the protocol
No more lockd
LOCK operation sets up lock state
Client polls server when LOCK request is denied
NFSv4 servers also keep track of
Open files, mainly to support Windows share reservation
semantics
Delegations
State Management
Open file and lock state are lease-based
A lease is the amount of time a server will wait,
while not receiving a state referencing operation
from a client, before reaping the client’s state.
Delegation state is callback-based
A callback is a communication channel from the
server back to the client
Access Control Lists
NFSv4 defines ACLs for file system
objects
Richer and more granular than POSIX
ACLs
Similar to NT ACLs
ACLs are showing up on local UNIX file
systems
Security Model
Security added to RPC layer
RFC 2203 defines RPCSEC_GSS
Adds the GSSAPI to the ONC RPC
An application that uses the GSSAPI can "plug
in" any security service implementing the API
NFSv4 mandates the implementation of Kerberos
v5 and LIPKEY GSSAPI security mechanisms.
The combination of LIPKEY (and SPKM3) provides a
security service similar to TLS
Existing NFSv4 Implementations
SUN Solaris client and server
Network Appliance multi-protocol server
NFSv4, NFSv3, CIFS
Hummingbird WinXXX client and server
CITI
Linux client and server
OpenBSD/FreeBSD client
EMC multi-protocol server
HPUX server
Guelph OpenBSD server
IBM AIX client and server
Future Implementations
Cluster-coherent NFS server
pNFS
NFS/RDMA
A way to run NFS v2/v3/v4 over RDMA
Greatly enhanced NFS performance
Low overhead
Full bandwidth
Direct I/O – true zero copy
Implemented on Linux
kDAPL API
Client today, server soon
RPC layer approach
Implemented within RPC layer
New RPC transport type
Adds RDMA-transport specific header
“Chunks” direct data transfer between client
memory and server buffers
Bindings for NFSv2/v3, also NFSv4
Implementation Layering
Client implemented as kernel RPC transport
Server approach similar
RDMA API: kDAPL
NFS client code remains unchanged
Completely transparent to application
Use of kDAPL
All RDMA interfacing is via kDAPL
Very simple subset of kDAPL 1.1 API
Connection, connection DTOs
Kernel-virtual or physical LMRs, RMRs
Small (1KB-4KB typical) send/receive
Large RDMA (4KB-64KB typical)
All RDMA read/write initiated by server
Potential NFS/RDMA Users
Anywhere high bandwidth, low overhead is
important:
HPC/Supercomputing clusters
Database
Financial applications
Scientific computing
General cluster computing
Linux NFS/RDMA server
Project goals
RPC/RDMA implementation
kDAPL API
Mellanox IB
Interoperate with NetApp RPC RDMA client
Performance gain over TCP transport
Linux NFS/RDMA server
Approach
Divide RPC layer into unified state management
and abstract transport layer
Socket-specific code replaced by general
interface implemented by socket or RDMA
transports
Similar to client RPC transport switch concept
Linux NFS/RDMA server
Implementation stages
Listen for and accept connections
Process inline NFSv3 requests
NFSv3 RDMA
NFSv4 RDMA
Listen for and accept connections
svc_makexprt
Similar to svc_makesock for socket transports
RDMA transport tasks:
Open HCA
Register memory
Create endpoint for RDMA connections
Listen for and accept connections
svc_xprt
Retains transport-independent components of
svc_sock
Add pointer to transport-specific structure
Support for registering dynamic transport
implementations (eventually)
Listen for and accept connections
Reorganize code into transport-agnostic and
transport-specific blocks
Update calling code to specify transport
Process inline NFSv3 requests
RDMA-specific send and receive routines
All data sent inline via RDMA Send
Tasks
Register memory buffers for RDMA send
Manage buffer transmission by the hardware
Process RDMA headers
NFSv3 RDMA
Use RDMA Read and Write for large transfers
RPC page management
xdr_buf contains initial kvec and list of pages
Initial kvec holds RPC header and short payloads
Page list used for large data transfer
Server memory registration
All server memory pre-registered
Allows simpler memory management
May need revisiting wrt security
NFSv3 RDMA
Client write
Server issues RDMA Read from client-provided
read chunks
Server reads into xdr_buf page list
Similar to socket-based receive for ULP
Client read
Server issues RDMA Write into client-provided
write chunks
NFSv3 RDMA
Reply chunks
Applies when client requests generate replies that
are too large for RDMA Send
Server issues RDMA write into client-supplied
buffers
NFSv4 RDMA
NFSv4 layered on RPC/RDMA
Task:
export modifications for RDMA transport
NFSv4.1 Sessions
Adds a session layer to NFSv4
Enhances protocol reliability
Accurate duplicate request caching
Bounded resources
Provides transport diversity
Trunking, multipathing
http://www.ietf.org/internet-drafts/draft-ietfnfsv4-sess-00.txt
pNFS basics
Separation of data and control, so NFS
metadata requests go through NFS and data
requests flow directly to devices (OBSD,
Block/ iSCSI, file)
This allows an NFSv4.X-pNFS client to be a
native client to Object/SAN/data-filer file
system and scale efficiently.
Limits the need for custom VFS clients for
every version of every OS/kernel known to
mankind
pNFS and RDMA
NFSv4.x client with RDMA gives us low latency low
overhead path for metadata (via RPC/RDMA layer)
pNFS gives us parallel paths for data direct to the
storage devices or filers (for OBSD, block, and file
methods)
For file method RPC/RDMA provides standards based data path
to data filer
For block method iSCSI/ISER or SRP could be used, this
provides a standards based data path (lacks transactional
security though)
For OBSD method, since ANSI OBSD is iSCSI extended, if
OBSD/iSCSI/ISER all get along, this provides a standards
based data path that is transactionally secure
pNFS and RDMA
With the previous two items, combined with other
NFSv4 features like leasing, compound RPC’s, etc., we
have a first class standards based file system client
that gets native device performance all provided by
NFSv4.XXX, capable of effectively using any global
parallel file system
AND ALL WITH STANDARDS!
pNFS and RDMA
We really need all this work to be enabled on both
Ethernet and Infiniband and to be completely routable
between the two medias.
Will higher level apps that become RDMA aware be able to
use both Ethernet and Infiniband and mixtures of both
transparently?
Will NFSv4 RPC/RDMA, iSCSI, and SRP be routable
between medias?
CITI
Developing NFSv4 reference implementation
since 1999
NFS/RDMA and NFSv4.1 Sessions since 2003
Funded by Sun, Network Appliance, ASCI,
PolyServe, NSF
http://www.citi.umich.edu/projects/nfsv4/
Key message
Give us kDAPL
Any questions?
http://www.citi.umich.edu/
Download