The Lustre Storage Architecture Linux Clusters for Super Computing Linköping 2003 Peter J. Braam Tim Reddin braam@clusterfs.com http://www.clusterfs.com tim.reddin@hp.com Cluster File Systems, Inc 6/10/2001 1 Topics History of project High level picture Networking Devices and fundamental API’s File I/O Metadata & recovery Project status Cluster File Systems, Inc 2 - NSC 2003 Lustre’s History 3 - NSC 2003 Project history 1999 CMU & Seagate 4 - NSC 2003 Worked with Seagate for one year Storage management, clustering Built prototypes, much design Much survives today 2000-2002 File system challenge First put forward Sep 1999 Santa Fe New architecture for National Labs Characteristics: 100’s GB’s/sec of I/O throughput trillions of files 10,000’s of nodes Petabytes From start Garth & Peter in the running 5 - NSC 2003 2002 – 2003 fast lane 3 year ASCI Path Forward contract with HP and Intel MCR & ALC, 2x 1000 node Linux Clusters PNNL HP IA64, 1000 node Linux cluster Red Storm, Sandia (8000 nodes, Cray) Lustre Lite 1.0 Many partnerships (HP, Dell, DDN, …) 6 - NSC 2003 2003 – Production, perfomance Spring and summer Performance LLNL MCR from no, to partial, to full time use PNNL similar Stability much improved Summer 2003: I/O problems tackled Metadata much faster Dec/Jan 7 - NSC 2003 Lustre 1.0 High level picture 8 - NSC 2003 Lustre Systems – Major Components Clients OST Object storage targets Handle (stripes of, references to) file data MDS Have access to file system Typical role: compute server Metadata request transaction engine. Also: LDAP, Kerberos, routers etc. 9 - NSC 2003 OST 1 MDS 1 (active) MDS 2 (failover) Linux OST Servers with disk arrays QSW Net OST 2 SAN OST 3 Lustre Clients (1,000 Lustre Lite) Up to 10,000’s OST 4 GigE OST 5 3rd party OST Appliances OST 6 Lustre Object Storage Targets (OST) 10 - NSC 2003 OST 7 LDAP Server configuration information, network connection details, & security management Clients directory operations, meta-data, & concurrency Meta-data Server (MDS) 11 - NSC 2003 file I/O & file locking recovery, file status, & file creation Object Storage Targets (OST) Networking 12 - NSC 2003 Lustre Networking Currently runs over: Beta Myrinet, SCI Under development TCP Quadrics Elan 3 & 4 Lustre can route & can use heterogeneous nets SAN (FC/iSCSI), I/B Planned: 13 - NSC 2003 SCTP, some special NUMA and other nets Lustre Network Stack - Portals 0-copy marshalling libraries, Service framework, Client request dispatch, Connection & address naming, Generic recovery infrastructure Move small & large buffers, Remote DMA handling, Generate events Lustre Request Processing NIO API Sandia’s API, CFS improved impl. Portal Library Portal NAL’s Device Library (Elan,Myrinet,TCP,...) 14 - NSC 2003 Network Abstraction Layer for TCP, QSW, etc. Small & hard Includes routing api. Devices and API’s 15 - NSC 2003 Lustre Devices & API’s Lustre has numerous driver modules One API - very different implementations Driver binds to named device Stacking devices is key Generalized “object devices” Drivers currently export several API’s 16 - NSC 2003 Infrastructure - a mandatory API Object Storage Metadata Handling Locking Recovery Lustre Clients & API’s Lustre File System (Linux) or Lustre Library (Win, Unix, Micro Kernels) Data Object & Lock Metadata & Lock Logical Object Volume (LOV driver) OSC1 17 - NSC 2003 … OSCn Clustered MD driver MDC … MDC Object Storage Api Objects are (usually) unnamed files Improves on the block device api create, destroy, setattr, getattr, read, write OBD driver does block/extent allocation Implementation: 18 - NSC 2003 Linux drivers, using a file system backend Bringing it all together Lustre Client File System System & Parallel File I/O, File Locking Portal NAL’s Device (Elan,TCP,…) OSC’s MDS Server Lock Server Lock Client Ext3, Reiser, XFS, … FS Recovery Recovery, File Status, File Creation Load Balancing Fibre Channel 19 - NSC 2003 MDS Networking Recovery Fibre Channel Lock Client Directory Metadata & Concurrency Networking Ext3, Reiser, XFS,… FS MDC Networking OST Object-Based Disk Lock Server (OBD server Server Metadata WB cache Recovery Request Processing NIO API Portal Library File I/O 20 - NSC 2003 File I/O – Write Operation Open file on meta-data server Get information on all objects that are part of file: Objects id’s What storage controllers (OST) What part of the file (offset) Striping pattern Create LOV, OSC drivers Use connection to OST 21 - NSC 2003 Object writes to OST No MDS involvement at all Lustre Client File system LOV OSC 1 OSC 2 Write (obj 1) File open request MDS File meta-data Inode A {(O1,obj1),(O3, obj2)} Write (obj 2) OST 1 22 - NSC 2003 MDC Meta-data Server OST 2 OST 3 I/O bandwidth 100’s GB/sec => saturate many100’s OSTs OST’s: Do ext3 extent allocation, non-caching direct I/O Lock management spread over cluster Achieve 90-95% of network throughput 23 - NSC 2003 Single client, single thread Elan3: W 269MB/sec OST’s handle up to 260MB/sec W/O extent code, on 2 way 2.4GHz Xeon Metadata 24 - NSC 2003 Intent locks & Write Back caching Clients – MDS: protocol adaptation Low concurrency - write back caching Client in memory updates delayed replay to MDS High concurrency (mostly merged in 2.6) 25 - NSC 2003 Single network request per transaction No lock revocations to clients Intent based lock includes complete request Client Client lookup Network Network File FileServer Server lookup lookup lookup mkdir mkdir mkdir create dir mkdir a)a)Conventional Conventionalmkdir mkdir Lustre Client Network Meta-data Server lookup lookup intent mkdir lock module lookup mkdir exercise the intent Lustre_mkdir 26 - NSC 2003 Mds_mkdir b) Lustre mkdir Lustre 1.0 Only has high concurrency model Aggregate throughput (1,000 clients): Single client: Achieve ~5000 file creations (open/close) /sec Achieve ~7800 stat’s in 10 x1M file directories Around 1500 creations or stat’s /sec Handling 10M file directories is effortless Many changes to ext3 (all merged in 2.6) 27 - NSC 2003 Metadata Future Lustre 2.0 – 2004 Metadata clustering Common operations will parallelize 100% WB caching in memory or on disk 28 - NSC 2003 Like AFS Recovery 30 - NSC 2003 Recovery approach Keep it simple! Based on failover circles: At HP we use failover pairs Use existing failover software Left working neighbor is failover node for you Simplify storage connectivity I/O failure triggers 31 - NSC 2003 Peer node serves failed OST Retry from client routed to new OST node OST Server – redundant pair OST1 OST 2 FC Switch FC Switch C1 32 - NSC 2003 C2 C1 C2 Configuration 33 - NSC 2003 Lustre 1.0 Good tools to build configuration Configuration is recorded on MDS Or on dedicated management server Configuration can be changed, Clients auto configure 1.0 requires downtime mount –t lustre –o … mds://fileset/sub/dir /mnt/pt SNMP support 34 - NSC 2003 Futures 35 - NSC 2003 Advanced Management Snapshots Global namespace Combine best of AFS & autofs4 HSM, hot migration All features you might expect Driven by customer demand (we plan XDSM) Online 0-downtime re-configuration 36 - NSC 2003 Part of Lustre 2.0 Security Authentication POSIX style authorization NASD style OST authorization Refinement: use OST ACL’s and cookies File crypting with group key service 38 - NSC 2003 STK secure file system Project status 43 - NSC 2003 Lustre Feature Roadmap Lustre (Lite) 1.0 Lustre 2.0 (2.6) Lustre 3.0 2003 2004 2005 Failover MDS Metadata cluster Metadata cluster (Linux 2.4 & 2.6) Basic Unix security Basic Unix security Advanced Security File I/O very fast (~100’s OST’s) Collaborative read cache Storage management Intent based scalable metadata POSIX compliant Write back metadata Parallel I/O Load balanced MD 44 - NSC 2003 Global namespace Cluster File Systems, Inc. 45 - NSC 2003 Cluster File Systems Small service company: 20-30 people Extremely specialized and extreme expertise Software development & service (95% Lustre) contract work for Government labs OSS but defense contracts we only do file systems and storage Investments - not needed. Profitable. Partners: HP, Dell, DDN, Cray 46 - NSC 2003 Lustre – conclusions Great vehicle for advanced storage software Things are done differently Protocols & design from Coda & InterMezzo Stacking & DB recovery theory applied Leverage existing components Initial signs promising 47 - NSC 2003 HP & Lustre Two projects ASCI PathForward – Hendrix Lustre Storage product 48 - NSC 2003 Field trial in Q1 of 04 Questions? 49 - NSC 2003