Ceph: A Scalable, High-Performance Distributed File System Derek Weitzel In the Before… Lets go back through some of the mentionable distributed file systems used in HPC In the Before… There were distributed filesystems like: Lustre – RAID over storage boxes Recovery time after a node failure was MASSIVE! (Entire server’s contents had to be copied, one to one) When functional, reading/writing EXTREMELY fast Used in heavily in HPC In the Before… There were distributed filesystems like: NFS – Network File System Does this really count as distributed? Single large server Full POSIX support, in kernel since…forever Slow with even a moderate number of clients Dead simple In the Current… There are distributed filesystems like: Hadoop – Apache Project inspired by Google Massive throughput Throughput scales with attached HDs Have seen VERY LARGE production clusters Facebook, Yahoo… Nebraska Doesn’t even pretend to be POSIX In the Current… There are distributed filesystems like: GPFS(IBM) / Panasas – Propriety file systems Requires closed source kernel driver Not flexible with newest kernels / OS’s Good: Good support and large communities Can be treated as black box for administrators HUGE Installments (Panasas at LANL is HUGE!!!!) Motivation Ceph is a emerging technology in the production clustered environment Designed for: Performance – Striped data over data servers. Reliability – No single point of failure Scalability – Adaptable metadata cluster Timeline 2006 – Ceph Paper written 2007 – Sage Weil earned PhD from Ceph (largely) 2007 – 2010 Development continued, primarily for DreamHost March 2010 – Linus merged Ceph client into mainline 2.6.34 kernel No more patches needed for clients Adding Ceph to Mainline Kernel Huge development! Significantly lowered cost to deploy Ceph For production environments, it was a little too late – 2.6.32 was the stable kernel used in RHEL 6 (CentOS 6, SL 6, Oracle 6). Lets talk paper Then I’ll show a quick demo Ceph Overview Decoupled data and metadata IO directly with object servers Dynamic distributed metadata management Multiple metadata servers handling different directories (subtrees) Reliable autonomic distributed storage OSD’s manage themselves by replicating and monitoring Decoupled Data and Metadata Increases performance by limiting interaction between clients and servers Decoupling is common in distributed filesystems: Hadoop, Lustre, Panasas… In contrast to other filesystems, CEPH uses a function to calculate the block locations Dynamic Distributed Metadata Management Metadata is split among cluster of servers Distribution of metadata changes with the number of requests to even load among metadata servers Metadata servers also can quickly recover from failures by taking over neighbors data Improves performance by leveling metadata load Reliable Autonomic Distributed Storage Data storage servers act on events by themselves Initiates replication and Improves performance by offloading decision making to the many data servers Improves reliability by removing central control of the cluster (single point of failure) Ceph Components Some quick definitions before getting into the paper MDS – Meta Data Server ODS – Object Data Server MON – Monitor (Now fully implemented) Ceph Components Ordered: Clients, Metadata, Object Storage 2 1 3 Ceph Components Ordered: Clients, Metadata, Object Storage 2 1 3 Client Overview Can be a Fuse mount File system in user space Introduced so file systems can use a better interface than the Linux Kernel VFS (Virtual file system) Can link directly to the Ceph Library Built into newest OS’s. Client Overview – File IO 1. Asks the MDS for the inode information Client Overview – File IO 2. Responds with the inode information Client Overview – File IO 3. Client Calculates data location with CRUSH Client Overview – File IO 4. Client reads directly off storage nodes Client Overview – File IO Client asks MDS for a small amount of information Performance: Small bandwidth between client and MDS Performance Small cache (memory) due to small data Client calculates file location using function Reliability: Saves the MDS from keeping block locations Function described in data storage section Ceph Components Ordered: Clients, Metadata, Object Storage 2 1 3 Client Overview – Namespace Optimized for the common case, ‘ls –l’ Directory listing immediately followed by a stat of each file Reading directory gives all inodes in the directory $ ls -l total 0 drwxr-xr-x drwxr-xr-x drwxr-xr-x drwxr-xr-x drwxr-xr-x 4 dweitzel swanson 63 Aug 15 2011 apache 5 dweitzel swanson 42 Jan 18 11:15 argus-pep-api-java 5 dweitzel swanson 42 Jan 18 11:15 argus-pep-common 7 dweitzel swanson 103 Jan 18 16:37 bestman2 6 dweitzel swanson 75 Jan 18 12:25 buildsys-macros Namespace covered in detail next! Metadata Overview Metadata servers (MDS) server out the file system attributes and directory structure Metadata is stored in the distributed filesystem beside the data Compare this to Hadoop, where metadata is stored only on the head nodes Updates are staged in a journal, flushed occasionally to the distributed file system MDS Subtree Partitioning In HPC applications, it is common to have ‘hot’ metadata that is needed by many clients In order to be scalable, Ceph needs to distributed metadata requests among many servers MDS will monitor frequency of queries using special counters MDS will compare the counters with each other and split the directory tree to evenly split the load MDS Subtree Partitioning Multiple MDS split the metadata Clients will receive metadata partition data from the MDS during a request MDS Subtree Partitioning Busy directories (multiple creates or opens) will be hashed across multiple MDS’s MDS Subtree Partitioning Clients will read from random replica Update to the primary MDS for the subtree Ceph Components Ordered: Clients, Metadata, Object Storage 2 1 3 Data Placement Need a way to evenly distribute data among storage devices (OSD) Increased performance from even data distribution Increased resiliency: Losing any node is minimally effects the status of the cluster if even distribution Problem: Don’t want to keep data locations in the metadata servers Requires lots of memory if lots of data blocks CRUSH CRUSH is a pseudo-random function to find the location of data in a distributed filesystem Summary: Take a little information, plug into globally known function (hashing?) to find where the data is stored Input data is: inode number – From MDS OSD Cluster Map (CRUSH map) – From OSD/Monitors CRUSH CRUSH maps a file to a list of servers that have the data CRUSH File to Object: Takes the inode (from MDS) CRUSH File to Placement Group (PG): Object ID and number of PG’s Placement Group Sets of OSDs that manage a subset of the objects OSD’s will have many Placement Groups Placement Groups will have R OSD’s, where R is number of replicas An OSD will either be a Primary or Replica Primary is in charge of accepting modification requests for the Placement Group Clients will write to Primary, read from random member of Placement Group CRUSH PG to OSD: PG ID and Cluster Map (from OSD) CRUSH Now we know where to write the data / read the data Now how do we safely handle replication and node failures? Replication Replicates to nodes also in the Placement Group Replication Write the the placement group primary (from CRUSH function). Replication Primary OSD replicates to other OSD’s in the Placement Group Replication Commit update only after the longest update Failure Detection Each Autonomic OSD looks after nodes in it’s Placement Group (possible many!). Monitors keep a cluster map (used in CRUSH) Multiple monitors keep eye on cluster configuration, dole out cluster maps. Recovery & Updates Recovery is entirely between OSDs OSD have two off modes, Down and Out. Down is when the node could come back, Primary for a PG is handed off Out is when a node will not come back, data is re- replicated. Recovery & Updates Each object has a version number Upon bringing up, check version number of Placement Groups to see if current Check version number of objects to see if need update Ceph Components Ordered: Clients, Metadata, Object Storage (Physical) 2 1 4 Object Storage The underlying filesystem can make or break a distributed one Filesystems have different characteristics Example: RieserFS good at small files XFS good at REALLY big files Ceph keeps a lot of attributes on the inodes, needs a filesystem that can hanle attrs. Object Storage Ceph can run on normal file systems, but slow XFS, ext3/4, … Created own Filesystem in order to handle special object requirements of Ceph EBOFS – Extent and B-Tree based Object File System. Object Storage Important to note that development of EBOFS has ceased Though Ceph can run on any normal filesystem (I have it running on ext4) Hugely recommend to run on BTRFS Object Storage - BTRFS Fast Writes: Copy on write file system for Linux Great Performance: Supports small files with fast lookup using B-Tree algorithm Ceph Requirement: Supports unlimited chaining of attributes Integrated into mainline kernel 2.6.29 Considered next generation file system Peer of ZFS from Sun Child of ext3/4 Performance and Scalability Lets look at some graphs! Performance & Scalability Write latency with different replication factors Remember, has to write to all replicas before ACK write to client 20 4096 10 sync write sync lock, async write Write Latency (ms) lication ication ication 15 no replication 2x replication 3x replication 5 0 4 16 64 Write Size (KB) 256 1024 Performance & Scalability X-Axis is size of the write to Ceph Y-Axis is the Latency when writing X KB 20 4096 10 sync write sync lock, async write Write Latency (ms) lication ication ication 15 no replication 2x replication 3x replication 5 0 4 16 64 Write Size (KB) 256 1024 Performance & Scalability Notice, this is still small writes, < 1MB As you can see, the more replicas Ceph has to write, the slower the ACK to the client 20 4096 10 sync write sync lock, async write Write Latency (ms) lication ication ication 15 no replication 2x replication 3x replication 5 0 4 16 64 Write Size (KB) 256 1024 Performance & Scalability Obviously, async write is faster Latency for async is from flushing buffers to Ceph 20 4096 10 sync write sync lock, async write Write Latency (ms) lication ication ication 15 no replication 2x replication 3x replication 5 0 4 16 64 Write Size (KB) 256 1024 Performance and Scalability 10 0 4 64 256 Write Size (KB) 1024 4096 5 0 4 Figure 5: Per-OSD write performance. The horizontal line indicates the upper limit imposed by the physical disk. Replication has minimal impact on OSD throughput, although if the number of OSDs is fixed, n-way lines for each file system replication reduces total effective throughput by a factor of n because replicated data must be written to n OSDs. Per−OSD Throughput (MB/sec) Writes are bunched at top, reads at Figure 7: Write cation. More th cost for small concurrently. sion times dom bottom for writes over asynchronously 60 writes 50 60 40 ebofs ext3 reiserfs xfs 30 20 reads 10 Per−OSD Throughput (MB/sec) 2 16 Writ Per− 2x replication 3x replication 50 40 30 2 0 4 16 64 256 1024 I/O Size (KB) 4096 16384 Figure 8: OSD 0 4 16 64 256 Write Size (KB) 1024 4096 Figure 5: Per-OSD write performance. The horizontal line indicates the upper limit imposed by the physical disk. Replication has minimal impact on OSD throughput, although the number of OSDs is fixed, X-Axis is theifKBs written to or readn-way from replication reduces total effective throughput by a factor of n because replicated data must be written to n OSDs. Per−OSD Throughput (MB/sec) Y-Axis is the throughput per OSD (node) Writ Performance and Scalability 10 5 0 4 Figure 7: Write cation. More th cost for small concurrently. sion times dom for writes over asynchronously 60 writes 50 60 40 ebofs ext3 reiserfs xfs 30 20 reads 10 Per−OSD Throughput (MB/sec) Per− 2x replication 3x replication 50 40 30 2 0 4 16 64 256 1024 I/O Size (KB) 4096 16384 Figure 8: OSD Performance and Scalability 10 0 4 64 256 Write Size (KB) 1024 4096 5 0 4 Figure 5: Per-OSD write performance. The horizontal Figure 7: Write line indicates the upper limit imposed by the physical cation. More th disk. Replication has minimal impact on OSD throughfor small although ebofs if the number OSDs is fixed, n-way Theput, custom doesofmuch better on bothcost concurrently. replication reduces total effective throughput by a factor writes and reads sion times dom of n because replicated data must be written to n OSDs. for writes over asynchronously 60 writes 50 60 40 ebofs ext3 reiserfs xfs 30 20 reads 10 Per−OSD Throughput (MB/sec) Per−OSD Throughput (MB/sec) 16 Writ Per− 2x replication 3x replication 50 40 30 2 0 4 16 64 256 1024 I/O Size (KB) 4096 16384 Figure 8: OSD Performance and Scalability 10 0 4 64 256 Write Size (KB) 1024 4096 5 0 4 Figure 5: Per-OSD write performance. The horizontal Figure 7: Write line indicates the upper limit imposed by the physical cation. More th disk. Replication has minimal impact on OSD throughfor small put, although if themax numberthe of OSDs is fixed, n-way Writes for ebofs throughput of thecost concurrently. replication reduces total effective throughput by a factor underlying HD sion times dom of n because replicated data must be written to n OSDs. for writes over asynchronously 60 writes 50 60 40 ebofs ext3 reiserfs xfs 30 20 reads 10 Per−OSD Throughput (MB/sec) Per−OSD Throughput (MB/sec) 16 Writ Per− 2x replication 3x replication 50 40 30 2 0 4 16 64 256 1024 I/O Size (KB) 4096 16384 Figure 8: OSD fs 84 Performance and Scalability cation. More than two replicas incurs minimal additional cost for small writes because replicated updates occur concurrently. For large synchronous writes, transmission times dominate. Clients partially mask that latency for writes over 128 KB by acquiring exclusive locks and asynchronously X-Axis is size of thethecluster flushing data. Y-Axis is the per OSD throughput 60 Per−OSD Throughput (MB/sec) ical ghway ctor Ds. 50 crush (32k PGs) crush (4k PGs) hash (32k PGs) hash (4k PGs) linear 40 30 2 6 10 14 18 OSD Cluster Size 22 26 fs 84 Performance and Scalability cation. More than two replicas incurs minimal additional cost for small writes because replicated updates occur concurrently. For large synchronous writes, transmission times dominate. Clients partially mask that latency for writes over 128 KB by acquiring exclusive locks and asynchronously Most configurations hover flushing the data. around HD speed 60 Per−OSD Throughput (MB/sec) ical ghway ctor Ds. 50 crush (32k PGs) crush (4k PGs) hash (32k PGs) hash (4k PGs) linear 40 30 2 6 10 14 18 OSD Cluster Size 22 26 fs 84 Performance and Scalability cation. More than two replicas incurs minimal additional cost for small writes because replicated updates occur concurrently. For large synchronous writes, transmission times dominate. Clients partially mask that latency for writes over 128 KB by acquiring exclusive locks and asynchronously 32k PGs willflushing distribute data more evenly over the data. the cluster than the 4k PGs 60 Per−OSD Throughput (MB/sec) ical ghway ctor Ds. 50 crush (32k PGs) crush (4k PGs) hash (32k PGs) hash (4k PGs) linear 40 30 2 6 10 14 18 OSD Cluster Size 22 26 fs 84 Performance and Scalability cation. More than two replicas incurs minimal additional cost for small writes because replicated updates occur concurrently. For large synchronous writes, transmission times dominate. Clients partially mask that latency for writes over 128 KB by acquiring exclusive locks and asynchronously Evenly splitting the the data will lead to a flushing data. balanced load across the OSDs 60 Per−OSD Throughput (MB/sec) ical ghway ctor Ds. 50 crush (32k PGs) crush (4k PGs) hash (32k PGs) hash (4k PGs) linear 40 30 2 6 10 14 18 OSD Cluster Size 22 26 Conclusions Very fast POSIX compliant file system General enough for many applications No single point of failure – Important for large data centers Can handle HPC like applications (lots of metadata, small files) Demonstration Started 3 Fedora 16 instances on HCC’s private cloud Demonstration Some quick things if the demo doesn’t work MDS log of a MDS handing off a directory to another for load balancing 2012-02-16 18:15:17.686167 7f964654b700 mds.0.migrator nicely exporting to mds.1 [dir 10000000004 /hadoop-grid/ [2,head] auth{1=1} pv=2574 v=2572 cv=0/0 ap=1+2+3 state=1610612738|complete f(v2 m2012-02-16 18:14:21.322129 1=0+1) n(v86 rc2012-02-16 18:15:16.423535 b36440689 292=213+79) hs=1+8,ss=0+0 dirty=9 | child replicated dirty authpin 0x29a0fe0] Demonstration Election after a Monitor was overloaded 2012-02-16 16:23:22.920514 7fcf40904700 log [INF] : mon.gamma calling new monitor election 2012-02-16 16:23:26.167868 7fcf40904700 log [INF] : mon.gamma calling new monitor election 2012-02-16 16:23:31.558554 7fcf40103700 log [INF] : mon.gamma@1 won leader election with quorum 1,2 Lost another election (peon ): 2012-02-16 17:15:36.301172 7f50b360e700 mon.gamma@1(peon).osd e26 e26: 3 osds: 2 up, 3 in GUI Interface Where to Find More Info New company sponsoring development http://ceph.newdream.net/ Instruction on setting up CEPH can be found on the Ceph wiki: http://ceph.newdream.net/wiki/ Or my blog http://derekweitzel.blogspot.com/