Uploaded by Mohamed Berzig

01-Introduction-to-Lustre-Architecture

advertisement
25/08/15 AN INTRODUCTION TO LUSTRE
ARCHITECTURE
Malcolm Cowe
High Performance Data
August 2015
Legal Information
All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps
Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information
to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/performance.
Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No
computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at http://www.intel.com/content/www/us/en/software/intel-solutions-for-lustre-software.html.
Intel technologies may require enabled hardware, specific software, or services activation. Check with your system manufacturer or retailer.
You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein. You agree to grant Intel a non-exclusive, royaltyfree license to any patent claim thereafter drafted which includes subject matter disclosed herein.
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.
Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising
from course of performance, course of dealing, or usage in trade.
This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain
the latest forecast, schedule, specifications and roadmaps.
A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR
ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF
EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY,
PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN,
MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS.
Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or
"undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to
change without notice. Do not finalize a design with this information.
Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
* Other names and brands may be claimed as the property of others.
© 2015 Intel Corporation
2
1 25/08/15 A brief [and inaccurate] recent history of big
computers and their storage
3
Building blocks for Lustre scalable storage
Management
Target (MGT)
Metadata
Target (MDT)
Object Storage
Targets (OSTs)
Storage servers grouped
into failover pairs
Management
Network
Management
& Metadata
Servers
Object Storage
Servers
Object Storage
Servers
High Performance Data Network
(InfiniBand, 10GbE)
Lustre Clients (1 – 100,000+)
4
2 25/08/15 Traditional network file systems vs Lustre
Adding a 2nd server
can provide HA but
does not increase
performance
Bottleneck
Building block for
Scalability. Add
storage servers to
increase capacity
and performance
SMB / NFS Clients
Lustre Clients (1 – 100,000+)
Distributed client, central server, e.g. NFS, SMB
Distributed client, distributed servers (Lustre)
Performance and scalability is limited by single server bandwidth
Scale performance and capacity linearly as client population grows
All clients use a single network path to one server
Adding a new storage server creates a separate file system
Client network scalability is not limited by the file system
All clients can access all storage servers simultaneously and in parallel
Single coherent name space across all servers. Can have hundreds of
network storage servers in a single file system
Storage becomes fragmented into silos
5
Intel EE Lustre building blocks
Metadata is stored
separately from
file object data.
Usually only one
MDS pair per
Lustre file system.
Management
Target (MGT)
Metadata
Target (MDT)
Object Storage
Targets (OSTs)
Uniformity in building
block configuration
promotes uniformity
in performance as the
system grows.
Storage servers grouped
into failover pairs
Management
Network
Metadata
Servers
Object Storage
Servers
Intel Manager for Lustre
Building block for
Scalability.
Add more OSS
building blocks to
increase capacity
and throughput
bandwidth.
Balance server IO
with storage IO
capability for best
utilisation.
High Performance Data Network
(InfiniBand, 10GbE)
Lustre Clients (1 – 100,000+)
6
3 25/08/15 Lustre HSM system architecture
Management
Target (MGT)
Metadata
Target (MDT)
Object Storage
Targets (OSTs)
Storage servers grouped
into failover pairs
Management
Network
Metadata
Servers
Object Storage
Servers
Intel Manager
for Lustre
Data Network
Policy Engine
HSM Agents
Lustre Clients (1 – 100,000+)
Archive
7
Lustre SMB system architecture
Management
Target (MGT)
Metadata
Target (MDT)
Object Storage
Targets (OSTs)
Storage servers grouped
into failover pairs
Management
Network
Data Network
Metadata
Servers
Object Storage
Servers
Intel Manager
for Lustre
CTDB Private Network
CTDB Cluster (Samba)
Lustre Clients (1 – 100,000+)
Public Network
SMB Clients
8
4 25/08/15 LUSTRE OVERVIEW
9
Lustre overview
Client-server, parallel, distributed, network file system
Servers manage presentation of storage to network of clients
3 different classes of server
•
Management server provides configuration information, file system registries
•
Metadata servers record file system namespace, inodes. File system index
•
Object storage servers record file content in distributed binary objects
•
A single file is comprised of 1 or more objects, organised in concats (RAID 0 stripes) across
multiple storage targets
Lustre separates metadata (inode) storage from block data storage (file content)
Clients aggregate name space and object data to present a coherent POSIX file system to applications
Clients do not access storage directly; all I/O is sent over a network
10
5 25/08/15 Lustre overview: networking (1/2)
Lustre I/O is sent over a network using a protocol called LNet
•
All application I/O is transacted over a network. Applications do not run directly on
storage servers
Native support for most networks. The most commonly used today are:
•
RDMA verbs support for OFA-derived drivers (InfiniBand, RoCE)
•
TCP Sockets (Ethernet)
Heterogeneous network support
•
Multi-homed servers
•
Lustre network (LNet) routers
• Provide gateway between different LNet networks
• Scalable resource: multiple routers can be grouped into pools
11
Lustre overview: networking (2/2)
Lustre network protocol is connection-based
•
End-points maintain shared, coordinated state
•
Servers maintain exports for each active connection from a client to a target; clients maintain
imports
•
Connections are per-target and per-client
•
If a server exports i targets to j clients, there will be i x j exports on the server, and each
client will have i imports
Most Lustre protocol actions are initiated by clients
•
The most common activity in the Lustre protocol is for a client to initiate an RPC to a specific
target
•
A server may also initiate an RPC to the target on another server
•
e.g. an MDS RPC to the MGS for configuration data; or an RPC to an OST to update the
MDS’s state with available space data
12
6 25/08/15 Lustre and Linux
The core of Lustre runs in the Linux kernel on both servers and clients
Lustre servers using ldiskfs (derived from EXT4) OSD storage require patches to the Linux kernel
•
Performance patches in the most part
•
List of patches continues to reduce as kernel development advances (compare RHEL 6 patch list
to RHEL 7 patch list
•
Lustre servers using ZFS OSD storage do not require patched kernels
Developers merge upstream when possible
•
EXT4 is just one example
•
Lustre is considered a ‘niche’ use of the Linux kernel, so full merge may never occur
Lustre client packages do not require a patched kernel
•
Lustre client is starting to be merged into mainstream Linux kernel
13
Lustre and High Availability
Metadata and Object servers support failover
•
Completely application-transparent
•
System calls are guaranteed to complete across failovers
•
Individual storage targets are managed as failover resources (active-passive)
•
Multiple resources can run on the same failover cluster for optimal utilisation
Data modifications are asynchronous
•
Clients maintain a transaction log
•
When failure, clients replay transactions
•
Transaction log entries removed once committed to disk
System recovers quickly from failed clients
14
7 25/08/15 SERVER OVERVIEW
15
Lustre file system architecture – server overview
Each Lustre file system comprises, at a minimum:
•
1 x Management service (MGS, with MGT storage)
•
1 x Metadata service (MDS, with MDT storage)
•
1+ Object storage service (OSS, with OST storage)
For High Availability, the minimum working configuration is:
•
•
2 Metadata servers, running MGS and MDS in failover configuration
•
MGS service on one node 1, MDS service on the other node
•
Shared storage for the MGT and MDT
2 Object storage servers, running multiple OSTs in failover configuration
•
Shared storage for the OSTs
•
All OSTs evenly balanced across the OSS servers
16
8 25/08/15 Lustre file system – MGS and MDS
MGS is a global resource and can be associated with more than one Lustre file
system
•
Carries configuration data only, stored on the MGT
•
Can only be one MGT on a metadata server or server pair
MDS serves the metadata for a file system
•
Metadata is stored on an MDT
•
Each Lustre file system must have 1 MDT served by an MDS
•
Multiple MDTs can be hosted by a single metadata server pair
MGS and MDS usually paired into a high availability server configuration
17
Management and metadata server storage
MGS has a shared storage target, MGT
• 2 disks, RAID1
MDS has a high performance shared storage target, MDT
• RAID 10
• Storage is sized for metadata IO performance, number of i-nodes
• Metadata performance is dependent upon fast storage LUNs
• One pair of MDS servers can host multiple MDTs
18
9 25/08/15 Server example – MGS and MGS (metadata servers)
Net
HBA
MGT (RAID 1)
MDS1
Ctlr A
Ctlr B
HBA
Spare
Cluster
Comms
Net
Net
HBA
MDS2
MDT0 (RAID 10)
Net
HBA
Net
Net
Lustre / Data
Management
OS
RAID 1
19
Lustre file system – OSS
Object storage servers (OSS) store file content data in binary objects
•
Minimum of 2 servers required for high availability
•
Provides storage capacity and bandwidth
•
Add more OSS servers to increase capacity and performance
•
Aggregate throughput of the file system is the sum of the throughput of each OSS
•
OST LUNs should be evenly distributed across OSS servers to balance performance
20
10 25/08/15 OSS (object storage servers)
The Object storage servers provide scalable bulk storage
Configured as pairs for high availability
Scalable unit. Add more OSS server pairs to increase capacity and performance
Each OSS may have several shared object storage targets (OSTs)
• RAID 6 (8+2 is most common configuration)
• Preferred layout options are RAID 6 (2^N + 2), e.g.: 4+2, 8+2, 16+2
• Storage targets are configured for capacity, balanced with performance and
resilience
• An OSS pair may have more than one storage array
• 2 arrays per OSS pair is common – effectively one per OSS
21
Server example – OSS (object storage servers)
Low density storage chassis based on 24 disks per tray, plus hot spares
Ctlr A
OST0
HBA
HBA
OST1
Ctlr B
HBA
Net
OSS1
Spare
HBA
OST2
OST3
Cluster
Comms
Net
HBA
RAID 6
Net
Ctlr A
HBA
HBA
Ctlr B
HBA
Net
OSS2
Net
Net
Lustre / Data
Management
OS
RAID 1
22
11 25/08/15 Server example – OSS
High density storage chassis based on 60 disks per tray, no hot spares
OST1
OST2
OST4
OST5
OST7
OST8
HBA
HBA
HBA
HBA
HBA
HBA
Net
OSS1
Net
Cluster
Comms
Net
Net
OSS2
Net
Net
OS
RAID 1
Lustre / Data
OST11
HBA
HBA
Management
OST10
Controller B
OST9
Controller A
OST6
Controller B
OST3
Controller A
OST0
23
24
12 25/08/15 CLIENT OVERVIEW
25
Lustre file system – clients
Lustre client combines the metadata and object storage into a single, coherent
POSIX file system
•
Presented to the client OS as a file system mount point
•
Applications access data as for any POSIX file system
•
Applications do not therefore need to be re-written to run with Lustre
All Lustre client I/O is sent via RPC over a network connection
•
Clients do not make use of any node-local storage, can be diskless
26
13 25/08/15 Lustre client services
Management Client (MGC)
•
MGC handles RPCs with the MGS
•
All servers (even the MGS) run one MGC and every Lustre client runs one MGC per MGS
Metadata Client (MDC)
•
MDC handles RPCs with the MDS
•
Only Lustre clients do RPCs with the MDS
•
Each client runs an MDC for each MDT
Object Storage Client (OSC)
•
OSC handles RPCs with a single OST
•
Both the MDS and Lustre clients initiate RPCs to OSTs, so each run one OSC per OST
27
Other Lustre client types
Lustre router nodes
•
Route Lustre network traffic between networks
•
Efficiently connect different network types
•
Only run limited Lustre Networking software stack
•
Use RDMA network transfers for efficiency
NFS / SMB server clients
•
Lustre clients that re-export Lustre file system to non-Linux clients
28
14 25/08/15 PROTOCOLS OVERVIEW
Clients, File IDs, Layouts
29
Lustre client I/O overview
Lustre client mounts the file system root from MDT0
When a client looks up a file name, RPC is sent to MDS to get a lock, either:
•
Read lock with look-up intent
•
Write lock with create intent
MDS returns a lock plus metadata attributes and file layout to the client
•
If the file is new, MDS will also allocate OST objects for the file layout on open
•
Avoids need for further MDS communication until file close
Layout contains a layout access pattern (stripe info) and list of OST objects
•
Allows client to access data directly from the OSTs.
•
Each file has a unique layout
30
15 25/08/15 Basic Lustre I/O Operation
Client
data stripe 0
3
L
O
V
Object A
OSC
data stripe 3
data stripe 6
OSC
data stripe 1
Object B
OSC
data stripe 4
data stripe 7
LMV
MDC
data stripe 2
Object C
MDC
data stripe 5
data stripe 8
OSSs
1
2
MDS
1
File open request
2
Layout EA; FID (ObjA, OBJB, ObjC)
3
Objects written in parallel
31
LOV – Logical Object Volume
A software layer in the client stack
Aggregates multiple OSC together
Presents a single logical volume
to the client
Client
LOV
Directory Operations,
file open/close, metadata,
and concurrency
MDS
File I/O and file locking
Recovery, file status
and file creation
OSS
32
16 25/08/15 LMV – Logical Metadata Volume
A new software layer in the client stack
since lustre 2.4
Aggregates multiple MDCs together
Presents a single logical metadata
space to client
Client
LMV
Directory Operations,
file open/close, metadata,
and concurrency
MDS
MDS
MDS
LOV
File I/O and file locking
Recovery, file status
and file creation
OSS
33
Locking
Distributed lock manager in the manner of OpenVMS
Cache-coherent across all clients
Metadata server uses inode bit locks for file lookup, state (modification, open r/w/x), EAs and layout
•
Clients only ever granted read locks, can fetch multiple bit locks for an inode in a single RPC
•
MDS manages all inode modifications to avoid lock resource contention
Object storage servers provide extent-based locks for OST objects
•
File data locks are managed for each OST
•
Clients can be granted read extent locks for part or all of the file, allowing multiple concurrent
readers of the same file
•
Clients can be granted non-overlapping write extent locks for regions of the file
•
Multiple Lustre clients may access a single file concurrently for both read and write, avoiding
bottlenecks during file I/O
34
17 25/08/15 Lustre File identifier (FID)
Lustre file identifiers (FIDs) provide a device-independent replacement for UNIX inode numbers
to uniquely identify files or objects
Lustre files and objects are accessed by unique 128-bit file identifier (FID)
•
64-bit sequence number – used to locate the storage target
• Unique across all OSTs and MDTs in a file system
•
32-bit object identifier (OID) – reference to the object within the sequence
•
32-bit version number – currently unused; reserved for future work (e.g. snapshot)
FID-in-dirent feature stores the FID as part of the name of the file in the parent directory
•
Significantly improves performance for “ls” command executions by reducing disk I/O
Sequence number
Object ID
Version
64-bit
32-bit
32-bit
35
FID allocation
The sequence number is unique across all Lustre targets (OSTs and MDTs)
•
Independent of the backend file system
Sequence controller (MDT0) allocates super-sequence ranges to sequence managers
•
A super-sequence is a large contiguous range of sequence numbers
Sequence managers control distribution of sequences to clients
•
MDS and OSS servers are sequence managers
Ranges of sequence IDs are granted by managers to clients as reservations
•
Allows the client to create the FID for new files using a reserved sequence ID
•
When existing allocation is exhausted, a new set of sequence numbers is provided
A given sequence ID always maps to the same storage target
FIDs are never re-used
36
18 25/08/15 FID Location Database (FLDB)
FID does not contain any location information
FLDB maps an FID sequence to a specific target (MDT or OST)
• FLDB is cached by all clients and servers in the file system
• The complete FLDB is on on MDT0
• In DNE, every MDT also has its own local FLD, a subset of the full FLDB
Files created within same sequence will be located on the same storage target
37
File layout: striping
Each file in Lustre has its own unique
file layout, comprised of 1 or more
objects in a concat (RAID 0)
File layout is allocated by the MDS
File A
File B
File C
Object
Layout is selected by the client, either
•
•
by policy (inherited from parent
directory)
by the user or application
Layout of a file is fixed once created
Intel Confidential — Do Not Forward
OST00
OST01
OST02
1
2
3
4
5
6
7
1
1
38
19 25/08/15 20 
Download