25/08/15 AN INTRODUCTION TO LUSTRE ARCHITECTURE Malcolm Cowe High Performance Data August 2015 Legal Information All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/performance. Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at http://www.intel.com/content/www/us/en/software/intel-solutions-for-lustre-software.html. Intel technologies may require enabled hardware, specific software, or services activation. Check with your system manufacturer or retailer. You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein. You agree to grant Intel a non-exclusive, royaltyfree license to any patent claim thereafter drafted which includes subject matter disclosed herein. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document. The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade. This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps. A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. * Other names and brands may be claimed as the property of others. © 2015 Intel Corporation 2 1 25/08/15 A brief [and inaccurate] recent history of big computers and their storage 3 Building blocks for Lustre scalable storage Management Target (MGT) Metadata Target (MDT) Object Storage Targets (OSTs) Storage servers grouped into failover pairs Management Network Management & Metadata Servers Object Storage Servers Object Storage Servers High Performance Data Network (InfiniBand, 10GbE) Lustre Clients (1 – 100,000+) 4 2 25/08/15 Traditional network file systems vs Lustre Adding a 2nd server can provide HA but does not increase performance Bottleneck Building block for Scalability. Add storage servers to increase capacity and performance SMB / NFS Clients Lustre Clients (1 – 100,000+) Distributed client, central server, e.g. NFS, SMB Distributed client, distributed servers (Lustre) Performance and scalability is limited by single server bandwidth Scale performance and capacity linearly as client population grows All clients use a single network path to one server Adding a new storage server creates a separate file system Client network scalability is not limited by the file system All clients can access all storage servers simultaneously and in parallel Single coherent name space across all servers. Can have hundreds of network storage servers in a single file system Storage becomes fragmented into silos 5 Intel EE Lustre building blocks Metadata is stored separately from file object data. Usually only one MDS pair per Lustre file system. Management Target (MGT) Metadata Target (MDT) Object Storage Targets (OSTs) Uniformity in building block configuration promotes uniformity in performance as the system grows. Storage servers grouped into failover pairs Management Network Metadata Servers Object Storage Servers Intel Manager for Lustre Building block for Scalability. Add more OSS building blocks to increase capacity and throughput bandwidth. Balance server IO with storage IO capability for best utilisation. High Performance Data Network (InfiniBand, 10GbE) Lustre Clients (1 – 100,000+) 6 3 25/08/15 Lustre HSM system architecture Management Target (MGT) Metadata Target (MDT) Object Storage Targets (OSTs) Storage servers grouped into failover pairs Management Network Metadata Servers Object Storage Servers Intel Manager for Lustre Data Network Policy Engine HSM Agents Lustre Clients (1 – 100,000+) Archive 7 Lustre SMB system architecture Management Target (MGT) Metadata Target (MDT) Object Storage Targets (OSTs) Storage servers grouped into failover pairs Management Network Data Network Metadata Servers Object Storage Servers Intel Manager for Lustre CTDB Private Network CTDB Cluster (Samba) Lustre Clients (1 – 100,000+) Public Network SMB Clients 8 4 25/08/15 LUSTRE OVERVIEW 9 Lustre overview Client-server, parallel, distributed, network file system Servers manage presentation of storage to network of clients 3 different classes of server • Management server provides configuration information, file system registries • Metadata servers record file system namespace, inodes. File system index • Object storage servers record file content in distributed binary objects • A single file is comprised of 1 or more objects, organised in concats (RAID 0 stripes) across multiple storage targets Lustre separates metadata (inode) storage from block data storage (file content) Clients aggregate name space and object data to present a coherent POSIX file system to applications Clients do not access storage directly; all I/O is sent over a network 10 5 25/08/15 Lustre overview: networking (1/2) Lustre I/O is sent over a network using a protocol called LNet • All application I/O is transacted over a network. Applications do not run directly on storage servers Native support for most networks. The most commonly used today are: • RDMA verbs support for OFA-derived drivers (InfiniBand, RoCE) • TCP Sockets (Ethernet) Heterogeneous network support • Multi-homed servers • Lustre network (LNet) routers • Provide gateway between different LNet networks • Scalable resource: multiple routers can be grouped into pools 11 Lustre overview: networking (2/2) Lustre network protocol is connection-based • End-points maintain shared, coordinated state • Servers maintain exports for each active connection from a client to a target; clients maintain imports • Connections are per-target and per-client • If a server exports i targets to j clients, there will be i x j exports on the server, and each client will have i imports Most Lustre protocol actions are initiated by clients • The most common activity in the Lustre protocol is for a client to initiate an RPC to a specific target • A server may also initiate an RPC to the target on another server • e.g. an MDS RPC to the MGS for configuration data; or an RPC to an OST to update the MDS’s state with available space data 12 6 25/08/15 Lustre and Linux The core of Lustre runs in the Linux kernel on both servers and clients Lustre servers using ldiskfs (derived from EXT4) OSD storage require patches to the Linux kernel • Performance patches in the most part • List of patches continues to reduce as kernel development advances (compare RHEL 6 patch list to RHEL 7 patch list • Lustre servers using ZFS OSD storage do not require patched kernels Developers merge upstream when possible • EXT4 is just one example • Lustre is considered a ‘niche’ use of the Linux kernel, so full merge may never occur Lustre client packages do not require a patched kernel • Lustre client is starting to be merged into mainstream Linux kernel 13 Lustre and High Availability Metadata and Object servers support failover • Completely application-transparent • System calls are guaranteed to complete across failovers • Individual storage targets are managed as failover resources (active-passive) • Multiple resources can run on the same failover cluster for optimal utilisation Data modifications are asynchronous • Clients maintain a transaction log • When failure, clients replay transactions • Transaction log entries removed once committed to disk System recovers quickly from failed clients 14 7 25/08/15 SERVER OVERVIEW 15 Lustre file system architecture – server overview Each Lustre file system comprises, at a minimum: • 1 x Management service (MGS, with MGT storage) • 1 x Metadata service (MDS, with MDT storage) • 1+ Object storage service (OSS, with OST storage) For High Availability, the minimum working configuration is: • • 2 Metadata servers, running MGS and MDS in failover configuration • MGS service on one node 1, MDS service on the other node • Shared storage for the MGT and MDT 2 Object storage servers, running multiple OSTs in failover configuration • Shared storage for the OSTs • All OSTs evenly balanced across the OSS servers 16 8 25/08/15 Lustre file system – MGS and MDS MGS is a global resource and can be associated with more than one Lustre file system • Carries configuration data only, stored on the MGT • Can only be one MGT on a metadata server or server pair MDS serves the metadata for a file system • Metadata is stored on an MDT • Each Lustre file system must have 1 MDT served by an MDS • Multiple MDTs can be hosted by a single metadata server pair MGS and MDS usually paired into a high availability server configuration 17 Management and metadata server storage MGS has a shared storage target, MGT • 2 disks, RAID1 MDS has a high performance shared storage target, MDT • RAID 10 • Storage is sized for metadata IO performance, number of i-nodes • Metadata performance is dependent upon fast storage LUNs • One pair of MDS servers can host multiple MDTs 18 9 25/08/15 Server example – MGS and MGS (metadata servers) Net HBA MGT (RAID 1) MDS1 Ctlr A Ctlr B HBA Spare Cluster Comms Net Net HBA MDS2 MDT0 (RAID 10) Net HBA Net Net Lustre / Data Management OS RAID 1 19 Lustre file system – OSS Object storage servers (OSS) store file content data in binary objects • Minimum of 2 servers required for high availability • Provides storage capacity and bandwidth • Add more OSS servers to increase capacity and performance • Aggregate throughput of the file system is the sum of the throughput of each OSS • OST LUNs should be evenly distributed across OSS servers to balance performance 20 10 25/08/15 OSS (object storage servers) The Object storage servers provide scalable bulk storage Configured as pairs for high availability Scalable unit. Add more OSS server pairs to increase capacity and performance Each OSS may have several shared object storage targets (OSTs) • RAID 6 (8+2 is most common configuration) • Preferred layout options are RAID 6 (2^N + 2), e.g.: 4+2, 8+2, 16+2 • Storage targets are configured for capacity, balanced with performance and resilience • An OSS pair may have more than one storage array • 2 arrays per OSS pair is common – effectively one per OSS 21 Server example – OSS (object storage servers) Low density storage chassis based on 24 disks per tray, plus hot spares Ctlr A OST0 HBA HBA OST1 Ctlr B HBA Net OSS1 Spare HBA OST2 OST3 Cluster Comms Net HBA RAID 6 Net Ctlr A HBA HBA Ctlr B HBA Net OSS2 Net Net Lustre / Data Management OS RAID 1 22 11 25/08/15 Server example – OSS High density storage chassis based on 60 disks per tray, no hot spares OST1 OST2 OST4 OST5 OST7 OST8 HBA HBA HBA HBA HBA HBA Net OSS1 Net Cluster Comms Net Net OSS2 Net Net OS RAID 1 Lustre / Data OST11 HBA HBA Management OST10 Controller B OST9 Controller A OST6 Controller B OST3 Controller A OST0 23 24 12 25/08/15 CLIENT OVERVIEW 25 Lustre file system – clients Lustre client combines the metadata and object storage into a single, coherent POSIX file system • Presented to the client OS as a file system mount point • Applications access data as for any POSIX file system • Applications do not therefore need to be re-written to run with Lustre All Lustre client I/O is sent via RPC over a network connection • Clients do not make use of any node-local storage, can be diskless 26 13 25/08/15 Lustre client services Management Client (MGC) • MGC handles RPCs with the MGS • All servers (even the MGS) run one MGC and every Lustre client runs one MGC per MGS Metadata Client (MDC) • MDC handles RPCs with the MDS • Only Lustre clients do RPCs with the MDS • Each client runs an MDC for each MDT Object Storage Client (OSC) • OSC handles RPCs with a single OST • Both the MDS and Lustre clients initiate RPCs to OSTs, so each run one OSC per OST 27 Other Lustre client types Lustre router nodes • Route Lustre network traffic between networks • Efficiently connect different network types • Only run limited Lustre Networking software stack • Use RDMA network transfers for efficiency NFS / SMB server clients • Lustre clients that re-export Lustre file system to non-Linux clients 28 14 25/08/15 PROTOCOLS OVERVIEW Clients, File IDs, Layouts 29 Lustre client I/O overview Lustre client mounts the file system root from MDT0 When a client looks up a file name, RPC is sent to MDS to get a lock, either: • Read lock with look-up intent • Write lock with create intent MDS returns a lock plus metadata attributes and file layout to the client • If the file is new, MDS will also allocate OST objects for the file layout on open • Avoids need for further MDS communication until file close Layout contains a layout access pattern (stripe info) and list of OST objects • Allows client to access data directly from the OSTs. • Each file has a unique layout 30 15 25/08/15 Basic Lustre I/O Operation Client data stripe 0 3 L O V Object A OSC data stripe 3 data stripe 6 OSC data stripe 1 Object B OSC data stripe 4 data stripe 7 LMV MDC data stripe 2 Object C MDC data stripe 5 data stripe 8 OSSs 1 2 MDS 1 File open request 2 Layout EA; FID (ObjA, OBJB, ObjC) 3 Objects written in parallel 31 LOV – Logical Object Volume A software layer in the client stack Aggregates multiple OSC together Presents a single logical volume to the client Client LOV Directory Operations, file open/close, metadata, and concurrency MDS File I/O and file locking Recovery, file status and file creation OSS 32 16 25/08/15 LMV – Logical Metadata Volume A new software layer in the client stack since lustre 2.4 Aggregates multiple MDCs together Presents a single logical metadata space to client Client LMV Directory Operations, file open/close, metadata, and concurrency MDS MDS MDS LOV File I/O and file locking Recovery, file status and file creation OSS 33 Locking Distributed lock manager in the manner of OpenVMS Cache-coherent across all clients Metadata server uses inode bit locks for file lookup, state (modification, open r/w/x), EAs and layout • Clients only ever granted read locks, can fetch multiple bit locks for an inode in a single RPC • MDS manages all inode modifications to avoid lock resource contention Object storage servers provide extent-based locks for OST objects • File data locks are managed for each OST • Clients can be granted read extent locks for part or all of the file, allowing multiple concurrent readers of the same file • Clients can be granted non-overlapping write extent locks for regions of the file • Multiple Lustre clients may access a single file concurrently for both read and write, avoiding bottlenecks during file I/O 34 17 25/08/15 Lustre File identifier (FID) Lustre file identifiers (FIDs) provide a device-independent replacement for UNIX inode numbers to uniquely identify files or objects Lustre files and objects are accessed by unique 128-bit file identifier (FID) • 64-bit sequence number – used to locate the storage target • Unique across all OSTs and MDTs in a file system • 32-bit object identifier (OID) – reference to the object within the sequence • 32-bit version number – currently unused; reserved for future work (e.g. snapshot) FID-in-dirent feature stores the FID as part of the name of the file in the parent directory • Significantly improves performance for “ls” command executions by reducing disk I/O Sequence number Object ID Version 64-bit 32-bit 32-bit 35 FID allocation The sequence number is unique across all Lustre targets (OSTs and MDTs) • Independent of the backend file system Sequence controller (MDT0) allocates super-sequence ranges to sequence managers • A super-sequence is a large contiguous range of sequence numbers Sequence managers control distribution of sequences to clients • MDS and OSS servers are sequence managers Ranges of sequence IDs are granted by managers to clients as reservations • Allows the client to create the FID for new files using a reserved sequence ID • When existing allocation is exhausted, a new set of sequence numbers is provided A given sequence ID always maps to the same storage target FIDs are never re-used 36 18 25/08/15 FID Location Database (FLDB) FID does not contain any location information FLDB maps an FID sequence to a specific target (MDT or OST) • FLDB is cached by all clients and servers in the file system • The complete FLDB is on on MDT0 • In DNE, every MDT also has its own local FLD, a subset of the full FLDB Files created within same sequence will be located on the same storage target 37 File layout: striping Each file in Lustre has its own unique file layout, comprised of 1 or more objects in a concat (RAID 0) File layout is allocated by the MDS File A File B File C Object Layout is selected by the client, either • • by policy (inherited from parent directory) by the user or application Layout of a file is fixed once created Intel Confidential — Do Not Forward OST00 OST01 OST02 1 2 3 4 5 6 7 1 1 38 19 25/08/15 20