NFS/RDMA over IB under Linux Charles J. Antonelli Center for Information Technology Integration University of Michigan, Ann Arbor February 7, 2005 (portions copyright Tom Talpey and Gary Grider) Agenda NFSv2,3,4 NFS/RDMA Linux NFS/RDMA server NFS Sessions pNFS and RDMA NFSv2,3 One of the major software innovations of the 80’s Open systems Open specification Remote procedure call (RPC) Invocation across machine boundaries Support for heterogeneity Virtual file system interface (VFS) Abstract interface to file system functions Read, write, open, close, etc. Stateless server Ease of implementation Obviates lack of server reliability Problems with NFSv2,3 Naming Under client control (automounter helps) Scalability Caching is hard to get right Consistency Three-second rule Performance Chatty protocol Problems with NFSv2,3 Access control Trusted client Identity agreement Locking Outside the NFS protocol specification System administration No tools for backend management Proliferation of exported workstation disks NFSv4 Major components Export management Compound RPC Delegation State and locks Access control lists Security: RPCSEC_GSS NFSv4 Export Management NFSv4 pseudo fs allows the client to mount the server root, and browse to discover offered exports No more mountd Access into an export is based on the user’s credentials Obviates /etc/exports client list Compound RPC Designed to reduce wire traffic Multiple operations per request: PUTROOTFH LOOKUP GETATTR GETFH “Start with the pseudo fs root, lookup mount point path name, and return attributes and file handle.” Delegation Server issues delegations to clients A read delegation on a file is a guarantee that no other clients are writing to the file A write delegation on a file is a guarantee that no other clients are accessing the file Reduces revalidation requirements Not necessary for correctness Intended to reduce RPC requests to the server State and Locks NFSv3 is an ostensibly stateless protocol However, NFSv3 is typically used with a stateful auxiliary locking protocol (NLM) NFSv4 locking is part of the protocol No more lockd LOCK operation sets up lock state Client polls server when LOCK request is denied NFSv4 servers also keep track of Open files, mainly to support Windows share reservation semantics Delegations State Management Open file and lock state are lease-based A lease is the amount of time a server will wait, while not receiving a state referencing operation from a client, before reaping the client’s state. Delegation state is callback-based A callback is a communication channel from the server back to the client Access Control Lists NFSv4 defines ACLs for file system objects Richer and more granular than POSIX ACLs Similar to NT ACLs ACLs are showing up on local UNIX file systems Security Model Security added to RPC layer RFC 2203 defines RPCSEC_GSS Adds the GSSAPI to the ONC RPC An application that uses the GSSAPI can "plug in" any security service implementing the API NFSv4 mandates the implementation of Kerberos v5 and LIPKEY GSSAPI security mechanisms. The combination of LIPKEY (and SPKM3) provides a security service similar to TLS Existing NFSv4 Implementations SUN Solaris client and server Network Appliance multi-protocol server NFSv4, NFSv3, CIFS Hummingbird WinXXX client and server CITI Linux client and server OpenBSD/FreeBSD client EMC multi-protocol server HPUX server Guelph OpenBSD server IBM AIX client and server Future Implementations Cluster-coherent NFS server pNFS NFS/RDMA A way to run NFS v2/v3/v4 over RDMA Greatly enhanced NFS performance Low overhead Full bandwidth Direct I/O – true zero copy Implemented on Linux kDAPL API Client today, server soon RPC layer approach Implemented within RPC layer New RPC transport type Adds RDMA-transport specific header “Chunks” direct data transfer between client memory and server buffers Bindings for NFSv2/v3, also NFSv4 Implementation Layering Client implemented as kernel RPC transport Server approach similar RDMA API: kDAPL NFS client code remains unchanged Completely transparent to application Use of kDAPL All RDMA interfacing is via kDAPL Very simple subset of kDAPL 1.1 API Connection, connection DTOs Kernel-virtual or physical LMRs, RMRs Small (1KB-4KB typical) send/receive Large RDMA (4KB-64KB typical) All RDMA read/write initiated by server Potential NFS/RDMA Users Anywhere high bandwidth, low overhead is important: HPC/Supercomputing clusters Database Financial applications Scientific computing General cluster computing Linux NFS/RDMA server Project goals RPC/RDMA implementation kDAPL API Mellanox IB Interoperate with NetApp RPC RDMA client Performance gain over TCP transport Linux NFS/RDMA server Approach Divide RPC layer into unified state management and abstract transport layer Socket-specific code replaced by general interface implemented by socket or RDMA transports Similar to client RPC transport switch concept Linux NFS/RDMA server Implementation stages Listen for and accept connections Process inline NFSv3 requests NFSv3 RDMA NFSv4 RDMA Listen for and accept connections svc_makexprt Similar to svc_makesock for socket transports RDMA transport tasks: Open HCA Register memory Create endpoint for RDMA connections Listen for and accept connections svc_xprt Retains transport-independent components of svc_sock Add pointer to transport-specific structure Support for registering dynamic transport implementations (eventually) Listen for and accept connections Reorganize code into transport-agnostic and transport-specific blocks Update calling code to specify transport Process inline NFSv3 requests RDMA-specific send and receive routines All data sent inline via RDMA Send Tasks Register memory buffers for RDMA send Manage buffer transmission by the hardware Process RDMA headers NFSv3 RDMA Use RDMA Read and Write for large transfers RPC page management xdr_buf contains initial kvec and list of pages Initial kvec holds RPC header and short payloads Page list used for large data transfer Server memory registration All server memory pre-registered Allows simpler memory management May need revisiting wrt security NFSv3 RDMA Client write Server issues RDMA Read from client-provided read chunks Server reads into xdr_buf page list Similar to socket-based receive for ULP Client read Server issues RDMA Write into client-provided write chunks NFSv3 RDMA Reply chunks Applies when client requests generate replies that are too large for RDMA Send Server issues RDMA write into client-supplied buffers NFSv4 RDMA NFSv4 layered on RPC/RDMA Task: export modifications for RDMA transport NFSv4.1 Sessions Adds a session layer to NFSv4 Enhances protocol reliability Accurate duplicate request caching Bounded resources Provides transport diversity Trunking, multipathing http://www.ietf.org/internet-drafts/draft-ietfnfsv4-sess-00.txt pNFS basics Separation of data and control, so NFS metadata requests go through NFS and data requests flow directly to devices (OBSD, Block/ iSCSI, file) This allows an NFSv4.X-pNFS client to be a native client to Object/SAN/data-filer file system and scale efficiently. Limits the need for custom VFS clients for every version of every OS/kernel known to mankind pNFS and RDMA NFSv4.x client with RDMA gives us low latency low overhead path for metadata (via RPC/RDMA layer) pNFS gives us parallel paths for data direct to the storage devices or filers (for OBSD, block, and file methods) For file method RPC/RDMA provides standards based data path to data filer For block method iSCSI/ISER or SRP could be used, this provides a standards based data path (lacks transactional security though) For OBSD method, since ANSI OBSD is iSCSI extended, if OBSD/iSCSI/ISER all get along, this provides a standards based data path that is transactionally secure pNFS and RDMA With the previous two items, combined with other NFSv4 features like leasing, compound RPC’s, etc., we have a first class standards based file system client that gets native device performance all provided by NFSv4.XXX, capable of effectively using any global parallel file system AND ALL WITH STANDARDS! pNFS and RDMA We really need all this work to be enabled on both Ethernet and Infiniband and to be completely routable between the two medias. Will higher level apps that become RDMA aware be able to use both Ethernet and Infiniband and mixtures of both transparently? Will NFSv4 RPC/RDMA, iSCSI, and SRP be routable between medias? CITI Developing NFSv4 reference implementation since 1999 NFS/RDMA and NFSv4.1 Sessions since 2003 Funded by Sun, Network Appliance, ASCI, PolyServe, NSF http://www.citi.umich.edu/projects/nfsv4/ Key message Give us kDAPL Any questions? http://www.citi.umich.edu/