pNFS extension for NFSv4 IETF-62 March 2005 Brent Welch welch@panasas.com March 23, 2016 Abstract pNFS extends NFSv4 Scalable I/O problem (Brief review) of the proposal Status and next steps Object Storage and pNFS Object security scheme IETF 61, Nov 2004 Page 2 pNFS Extension to NFSv4 NFS is THE file system standard Fewest additions that enable parallel I/O from clients Layouts are the key additional abstraction Describe where data is located and how it is organized Clients access storage directly, in parallel Generalized support for asymmetric solutions Files: Clean way to do filer virtualization, eliminate botteneck Objects: Standard way to do object-based file systems Blocks: Standard way to do block-based SAN file systems IETF 61, Nov 2004 Page 3 Scalable I/O Problem Storage for 1000’s of active clients => lots of bandwidth Scaling capacity through 100’s of TB and into PB File Server model has good sharing and manageability but it is hard to scale Many other proprietary solutions GPFS, CXFS, StorNext, Panasas, Sustina, Sanergy, … Everyone has their own client Like to have a standards based solution => pNFS IETF 61, Nov 2004 Page 4 Scaling and the Client Gary Grider’s rule of thumb for HPC 1 Gbyte/sec for each Teraflop of computing power 2000 3.2 GHz processors => 6TF => 6 GB/sec One file server with 48 GE NICs? I don’t think so. 100 GB/sec I/O system in ’08 or ’09 for 100 TF cluster Making movies 1000 node rendering farm, plus 100’s of desktops at night Oil and Gas 100’s to 1000’s of clients Lots of large files (10’s of GB to TB each) EDA, Compile Farms, Life Sciences … Everyone has a Linux cluster these days IETF 61, Nov 2004 Page 5 Asymmetric File Systems Control Path vs. Data Path (“out-of-band” control) Metadata servers (Control) File name space Client Client Client Client Client Access control (ACL checking) Sharing and coordination Protocol NFSv4 + pNFS ops Storage Devices (Data) Clients access storage directly SAN Filesystems CXFS, EMC Hi Road, Sanergy, SAN FS Object Storage File Systems Panasas, Lustre IETF 61, Nov 2004 Page 6 Storage Storage Storage Storage Storag Storage Storage pNFS Server Storage Management Protocol NFS and Cluster File Systems NFS Client NFS Client NFS Client NFS Client NFS “Head” NFS “Head” NFSClient “Head” Native Native Client Native Client NFS Client NFS Client NFS Client NFS Client Native Client Native Client Native Client Native Client Cluster FS Cluster FS Cluster FS Cluster FS Currently, NFS head can be a client of a cluster file system (instead of a client of the local file system) NFSv4 adds more state, makes integration with cluster file system more interesting IETF 61, Nov 2004 Page 7 pNFS and Cluster File Systems pNFS Client pNFS Client pNFS Client pNFS Client pNFS Client pNFS Client pNFS Client pNFS Client Cluster FS + NFSv4 Cluster FS + NFSv4 Cluster FS + NFSv4 Cluster FS + NFSv4 Cluster FS + NFSv4 Native Client Native Client Native Client Native Client Replace proprietary cluster file system protocols with pNFS Lots of room for innovation inside the cluster file system IETF 61, Nov 2004 Page 8 pNFS Ops Summary GETDEVINFO Maps from opaque device ID used in layout data structures to the storage protocol type and necessary addressing information for that device LAYOUTGET Fetch location and access control information (i.e., capabilities) LAYOUTCOMMIT Commit write activity. New file size and attributes visible on storage. LAYOUTRELEASE Give up lease on the layout CB_LAYOUTRETURN Server callback to recall layout lease IETF 61, Nov 2004 Page 9 Multiple Data Server Protocols BE INCLUSIVE !! Client Apps Broaden the market reach Three (or more) flavors of outof-band metadata attributes: BLOCKS: SBC/FCP/FC or SBC/iSCSI… for files built on blocks OBJECTS: OSD/iSCSI/TCP/IP/GE for files built on objects FILES: NFS/ONCRPC/TCP/IP/GE for files built on subfiles Inode-level encapsulation in server and client code IETF 61, Nov 2004 Page 10 pNFS IFS Layout driver NFSv4 extended w/ orthogonal layout metadata attributes 1. SBC (blocks) 2. OSD (objects) 3. NFS (files) pNFS server Layout metadata grant & revoke Local Filesystem March 2005 Connectathon NetApp server prototype by Garth Goodson NFSv4 for the data protocol Layouts are an array of filehandles and stateIDs Storage Mangement protocol is also NFSv4 Client prototype by Sun Solaris 10 variant IETF 61, Nov 2004 Page 11 Object Storage Object interface is midway between files and blocks (think inode) Create, Delete, Read, Write, GetAttr, SetAttr, … Objects have numeric ID, not pathnames Objects have a capability based security scheme (details in a moment) Objects have extensible attributes to hold high-level FS information Based on NASD and OBSD research out of CMU (Gibson et. al) SNIA T10 standards based. V1 complete, V2 in progress. IETF 61, Nov 2004 Page 12 Objects as Building Blocks Object Comprised of: User Data Attributes Layout Interface: ID <dev,grp,obj> Read/Write Create/Delete Getattr/Setattr Capability-based File Component: Stripe files across storage nodes IETF 61, Nov 2004 Page 13 Objects and pNFS Clients speak pNFS to metadata server Or they will, eventually Clients speak iSCSI/OSD to the data servers (OSD) Files are striped across multiple objects Metadata manager speaks iSCSI/OSD to data server Metadata manager uses extended attributes on objects to store metadata High level file system attributes like owners and ACLs Storage maps Directories are just files stored in objects IETF 61, Nov 2004 Page 14 Object Security (1) Clean security model based on shared, secret device keys Part of T10 standard Metadata manager knows device keys Metadata manager signs caps, OSD verifies them (explained next slide) Metadata server returns capability to the client Capability encodes: object ID, operation, expire time, data range, cap version, and a signature Data range allows serialization over file data, as well as different rights to different attributes Cap version allows revocation by changing it on the object May want to protect capability with privacy (encryption) to avoid snooping the path between metadata manager and client IETF 61, Nov 2004 Page 15 Object Security (2) Capability Request (client to metadata manager) CapRequest = object ID, operation, data range Capability signed with device key known by metadata manager CapSignature = {CapRequest, Expires, CapVersion} KeyDevice CapSignature used as a signing key (!) OSD Command (client to OSD) Request contains {CapRequest, Expires, CapVersion} + other details + nonce to prevent replay attacks RequestSignature = {Request} CapSignature OSD can compute CapSignature by signing CapRequest with its own key OSD can then verify RequestSignature Caches and other tricks used to make this go fast IETF 61, Nov 2004 Page 16 Scalable Bandwidth Panasas Bandwidth vs. OSDs 12 10 GB/sec 8 6 4 2 0 0 50 100 150 200 250 Num OSDs (Panasas OSD has 2 disks, processor, and GE NIC) IETF 61, Nov 2004 Page 17 300 350 Per Shelf Bandwidth Bandwidth vs Clients, 10 OSD 450 400 350 MB/sec 300 250 200 150 100 50 0 0 5 10 15 Clients Read (64K) IETF 61, Nov 2004 Page 18 Write (64K) 20 25 Scalable NFS NFS Throughput (SPEC FS) 20 Response Time (msec) 18 16 14 12 10 8 6 4 2 0 0 50000 100000 150000 200000 250000 SPEC SFS Ops/sec 60-Director IETF 61, Nov 2004 Page 19 ORT 10-Director ORT 300000 350000 Rendering load from 1000 clients IETF 61, Nov 2004 Page 20 Status pNFS ad-hoc working group Dec ’03 Ann Arbor, April ’04 FAST, Aug ’04 IETF, Sept ’04 Pittsburgh Internet Drafts draft-gibson-pnfs-problem-statement-01.txt July 2004 draft-gibson-pnfs-reqs-00.txt October 2004 draft-welch-pnfs-ops-00.txt October 2004 draft-welch-pnfs-ops-01.txt March 2005 Next Steps Add NFSv4 layout and storage protocol to current draft Add text for additional error and recovery cases Add separate drafts for object storage and block storage IETF 61, Nov 2004 Page 21 Backup IETF 61, Nov 2004 Page 22 Object Storage File System Out of band control path Direct data path IETF 61, Nov 2004 Page 23 “Out-of-band” Value Proposition Out-of-band allows a client to use more than one storage address for a given file, directory or closely linked set of files Parallel I/O direct from client to multiple storage devices Scalable capacity: file/dir uses space on all storage: can get big Capacity balancing: file/dir uses space on all storage: evenly Load balancing: dynamic access to file/dir over all storage: evenly Scalable bandwidth: dynamic access to file/dir over all storage: big Lower latency under load: no bottleneck developing deep queues Cost-effectiveness at scale: use streamlined storage servers pNFS standard leads to standard client SW: share client support $$$ IETF 61, Nov 2004 Page 24 Scaling and the Server Tension between sharing and throughput File server provides semantics, including sharing Client Client Client Client Direct attach I/O provides throughput, no sharing File server is a bottleneck between clients and storage Server Pressure to make server ever faster and more expensive Clustered NAS solutions, e.g., Spinnaker SAN filesystems provide sharing and direct access Asymmetric, out-of-band system with distinct control and data paths Proprietary solutions, vendor-specific clients Physical security model, which we’d like to improve IETF 61, Nov 2004 Page 25 Symmetric File Systems Distribute storage among all the clients GPFS (AIX), GFS, PVFS (User Level) Reliability Issues Compute nodes less reliable because of the disk Storage less reliable, unless replication schemes employed Scalability Issues Stealing cycles from clients, which have other work to do Coupling of computing and storage Like early days of engineering workstations, private storage IETF 61, Nov 2004 Page 26