Ceph: A Scalable, High-Performance Distributed File System

advertisement
Ceph: A Scalable,
High-Performance
Distributed File
System
Derek Weitzel
In the Before…
 Lets go back through some of the mentionable
distributed file systems used in HPC
In the Before…
 There were distributed filesystems like:
 Lustre – RAID over storage boxes
 Recovery time after a node failure was MASSIVE!
(Entire server’s contents had to be copied, one to
one)
 When functional, reading/writing EXTREMELY
fast
 Used in heavily in HPC
In the Before…
 There were distributed filesystems like:
 NFS – Network File System
 Does this really count as distributed?
 Single large server
 Full POSIX support, in kernel since…forever
 Slow with even a moderate number of clients
 Dead simple
In the Current…
 There are distributed filesystems like:
 Hadoop – Apache Project inspired by Google
 Massive throughput
 Throughput scales with attached HDs
 Have seen VERY LARGE production clusters
 Facebook, Yahoo… Nebraska
 Doesn’t even pretend to be POSIX
In the Current…
 There are distributed filesystems like:
 GPFS(IBM) / Panasas – Propriety file systems
 Requires closed source kernel driver
 Not flexible with newest kernels / OS’s
 Good: Good support and large communities
 Can be treated as black box for administrators
 HUGE Installments (Panasas at LANL is
HUGE!!!!)
Motivation
 Ceph is a emerging technology in the
production clustered environment
 Designed for:
 Performance – Striped data over data servers.
 Reliability – No single point of failure
 Scalability – Adaptable metadata cluster
Timeline
 2006 – Ceph Paper written
 2007 – Sage Weil earned PhD from Ceph
(largely)
 2007 – 2010 Development continued,
primarily for DreamHost
 March 2010 – Linus merged Ceph client into
mainline 2.6.34 kernel
 No more patches needed for clients
Adding Ceph to Mainline
Kernel
 Huge development!
 Significantly lowered cost to deploy Ceph
 For production environments, it was a little
too late – 2.6.32 was the stable kernel used in
RHEL 6 (CentOS 6, SL 6, Oracle 6).
Lets talk paper
Then I’ll show a quick demo
Ceph Overview
 Decoupled data and metadata
 IO directly with object servers
 Dynamic distributed metadata management
 Multiple metadata servers handling different directories
(subtrees)
 Reliable autonomic distributed storage
 OSD’s manage themselves by replicating and
monitoring
Decoupled Data and
Metadata
 Increases performance by limiting interaction
between clients and servers
 Decoupling is common in distributed
filesystems: Hadoop, Lustre, Panasas…
 In contrast to other filesystems, CEPH uses a
function to calculate the block locations
Dynamic Distributed
Metadata Management
 Metadata is split among cluster of servers
 Distribution of metadata changes with the
number of requests to even load among
metadata servers
 Metadata servers also can quickly recover from
failures by taking over neighbors data
 Improves performance by leveling metadata
load
Reliable Autonomic
Distributed Storage
 Data storage servers act on events by
themselves
 Initiates replication and
 Improves performance by offloading decision
making to the many data servers
 Improves reliability by removing central
control of the cluster (single point of failure)
Ceph Components
 Some quick definitions before getting into the
paper
 MDS – Meta Data Server
 ODS – Object Data Server
 MON – Monitor (Now fully implemented)
Ceph Components
 Ordered: Clients, Metadata, Object Storage
2
1
3
Ceph Components
 Ordered: Clients, Metadata, Object Storage
2
1
3
Client Overview
 Can be a Fuse mount
 File system in user space
 Introduced so file systems can use a better interface
than the Linux Kernel VFS (Virtual file system)
 Can link directly to the Ceph Library
 Built into newest OS’s.
Client Overview – File IO
 1. Asks the MDS for the inode information
Client Overview – File IO
 2. Responds with the inode information
Client Overview – File IO
 3. Client Calculates data location with
CRUSH
Client Overview – File IO
 4. Client reads directly off storage nodes
Client Overview – File IO
 Client asks MDS for a small amount of
information
 Performance: Small bandwidth between client and
MDS
 Performance Small cache (memory) due to small data
 Client calculates file location using function
 Reliability: Saves the MDS from keeping block
locations
 Function described in data storage section
Ceph Components
 Ordered: Clients, Metadata, Object Storage
2
1
3
Client Overview –
Namespace
 Optimized for the common case, ‘ls –l’
 Directory listing immediately followed by a stat of
each file
 Reading directory gives all inodes in the directory
$ ls -l
total 0
drwxr-xr-x
drwxr-xr-x
drwxr-xr-x
drwxr-xr-x
drwxr-xr-x
4 dweitzel swanson 63 Aug 15 2011 apache
5 dweitzel swanson 42 Jan 18 11:15 argus-pep-api-java
5 dweitzel swanson 42 Jan 18 11:15 argus-pep-common
7 dweitzel swanson 103 Jan 18 16:37 bestman2
6 dweitzel swanson 75 Jan 18 12:25 buildsys-macros
 Namespace covered in detail next!
Metadata Overview
 Metadata servers (MDS) server out the file
system attributes and directory structure
 Metadata is stored in the distributed filesystem
beside the data
 Compare this to Hadoop, where metadata is stored
only on the head nodes
 Updates are staged in a journal, flushed
occasionally to the distributed file system
MDS Subtree
Partitioning
 In HPC applications, it is common to have ‘hot’
metadata that is needed by many clients
 In order to be scalable, Ceph needs to distributed
metadata requests among many servers
 MDS will monitor frequency of queries using
special counters
 MDS will compare the counters with each other
and split the directory tree to evenly split the load
MDS Subtree
Partitioning
 Multiple MDS split the metadata
 Clients will receive metadata partition data
from the MDS during a request
MDS Subtree
Partitioning
 Busy directories (multiple creates or opens)
will be hashed across multiple MDS’s
MDS Subtree
Partitioning
 Clients will read from random replica
 Update to the primary MDS for the subtree
Ceph Components
 Ordered: Clients, Metadata, Object Storage
2
1
3
Data Placement
 Need a way to evenly distribute data among
storage devices (OSD)
 Increased performance from even data distribution
 Increased resiliency: Losing any node is minimally
effects the status of the cluster if even distribution
 Problem: Don’t want to keep data locations in
the metadata servers
 Requires lots of memory if lots of data blocks
CRUSH
 CRUSH is a pseudo-random function to find
the location of data in a distributed filesystem
 Summary: Take a little information, plug into
globally known function (hashing?) to find
where the data is stored
 Input data is:
 inode number – From MDS
 OSD Cluster Map (CRUSH map) – From
OSD/Monitors
CRUSH
 CRUSH maps a file to a list of servers that
have the data
CRUSH
 File to Object: Takes the inode (from MDS)
CRUSH
 File to Placement Group (PG): Object ID and
number of PG’s
Placement Group
 Sets of OSDs that manage a subset of the objects
 OSD’s will have many Placement Groups
 Placement Groups will have R OSD’s, where R is
number of replicas
 An OSD will either be a Primary or Replica
 Primary is in charge of accepting modification requests
for the Placement Group
 Clients will write to Primary, read from random
member of Placement Group
CRUSH
 PG to OSD: PG ID and Cluster Map (from
OSD)
CRUSH
 Now we know where to write the data / read
the data
 Now how do we safely handle replication and
node failures?
Replication
 Replicates to nodes also in the Placement
Group
Replication
 Write the the placement group primary (from
CRUSH function).
Replication
 Primary OSD replicates to other OSD’s in the
Placement Group
Replication
 Commit update only after the longest update
Failure Detection
 Each Autonomic OSD looks after nodes in it’s
Placement Group (possible many!).
 Monitors keep a cluster map (used in CRUSH)
 Multiple monitors keep eye on cluster
configuration, dole out cluster maps.
Recovery & Updates
 Recovery is entirely between OSDs
 OSD have two off modes, Down and Out.
 Down is when the node could come back, Primary
for a PG is handed off
 Out is when a node will not come back, data is re-
replicated.
Recovery & Updates
 Each object has a version number
 Upon bringing up, check version number of
Placement Groups to see if current
 Check version number of objects to see if need
update
Ceph Components
 Ordered: Clients, Metadata, Object Storage
(Physical)
2
1
4
Object Storage
 The underlying filesystem can make or break a
distributed one
 Filesystems have different characteristics
 Example: RieserFS good at small files
 XFS good at REALLY big files
 Ceph keeps a lot of attributes on the inodes,
needs a filesystem that can hanle attrs.
Object Storage
 Ceph can run on normal file systems, but slow
 XFS, ext3/4, …
 Created own Filesystem in order to handle
special object requirements of Ceph
 EBOFS – Extent and B-Tree based Object File
System.
Object Storage
 Important to note that development of
EBOFS has ceased
 Though Ceph can run on any normal
filesystem (I have it running on ext4)
 Hugely recommend to run on BTRFS
Object Storage - BTRFS
 Fast Writes: Copy on write file system for Linux
 Great Performance: Supports small files with fast
lookup using B-Tree algorithm
 Ceph Requirement: Supports unlimited chaining
of attributes
 Integrated into mainline kernel 2.6.29
 Considered next generation file system
 Peer of ZFS from Sun
 Child of ext3/4
Performance and
Scalability
Lets look at some graphs!
Performance &
Scalability
 Write latency with different replication factors
 Remember, has to write to all replicas before
ACK write to client
20
4096
10
sync write
sync lock, async write
Write Latency (ms)
lication
ication
ication
15
no replication
2x replication
3x replication
5
0
4
16
64
Write Size (KB)
256
1024
Performance &
Scalability
 X-Axis is size of the write to Ceph
 Y-Axis is the Latency when writing X KB
20
4096
10
sync write
sync lock, async write
Write Latency (ms)
lication
ication
ication
15
no replication
2x replication
3x replication
5
0
4
16
64
Write Size (KB)
256
1024
Performance &
Scalability
 Notice, this is still small writes, < 1MB
 As you can see, the more replicas Ceph has to
write, the slower the ACK to the client
20
4096
10
sync write
sync lock, async write
Write Latency (ms)
lication
ication
ication
15
no replication
2x replication
3x replication
5
0
4
16
64
Write Size (KB)
256
1024
Performance &
Scalability
 Obviously, async write is faster
 Latency for async is from flushing buffers to
Ceph
20
4096
10
sync write
sync lock, async write
Write Latency (ms)
lication
ication
ication
15
no replication
2x replication
3x replication
5
0
4
16
64
Write Size (KB)
256
1024
Performance and
Scalability
10
0
4
64
256
Write Size (KB)
1024
4096
5
0
4
Figure 5: Per-OSD write performance. The horizontal
line indicates the upper limit imposed by the physical
disk. Replication has minimal impact on OSD throughput, although
if the
number
of OSDs is fixed, n-way
lines
for each
file
system
replication reduces total effective throughput by a factor
of n because replicated data must be written to n OSDs.
Per−OSD Throughput (MB/sec)
 Writes are bunched at top, reads at
Figure 7: Write
cation. More th
cost for small
concurrently.
sion times dom
bottom for writes over
asynchronously
60
writes
50
60
40
ebofs
ext3
reiserfs
xfs
30
20
reads
10
Per−OSD Throughput
(MB/sec)
 2
16
Writ
Per−
2x replication
3x replication
50
40
30
2
0
4
16
64
256
1024
I/O Size (KB)
4096
16384
Figure 8: OSD
0
4

16
64
256
Write Size (KB)
1024
4096
Figure 5: Per-OSD write performance. The horizontal
line indicates the upper limit imposed by the physical
disk. Replication has minimal impact on OSD throughput, although
the number
of OSDs
is fixed,
X-Axis
is theifKBs
written
to or
readn-way
from
replication reduces total effective throughput by a factor
of n because replicated data must be written to n OSDs.
Per−OSD Throughput (MB/sec)
 Y-Axis is the throughput per OSD (node)
Writ
Performance and
Scalability
10
5
0
4
Figure 7: Write
cation. More th
cost for small
concurrently.
sion times dom
for writes over
asynchronously
60
writes
50
60
40
ebofs
ext3
reiserfs
xfs
30
20
reads
10
Per−OSD Throughput
(MB/sec)
Per−
2x replication
3x replication
50
40
30
2
0
4
16
64
256
1024
I/O Size (KB)
4096
16384
Figure 8: OSD
Performance and
Scalability
10
0
4
64
256
Write Size (KB)
1024
4096
5
0
4
Figure 5: Per-OSD write performance. The horizontal
Figure 7: Write
line indicates the upper limit imposed by the physical
cation. More th
disk. Replication has minimal impact on OSD throughfor small
although ebofs
if the number
OSDs is
fixed, n-way
Theput,
custom
doesofmuch
better
on bothcost
concurrently.
replication reduces total effective throughput by a factor
writes
and reads
sion times dom
of n because replicated data must be written to n OSDs.
for writes over
asynchronously
60
writes
50
60
40
ebofs
ext3
reiserfs
xfs
30
20
reads
10
Per−OSD Throughput
(MB/sec)
Per−OSD Throughput (MB/sec)

16
Writ
Per−
2x replication
3x replication
50
40
30
2
0
4
16
64
256
1024
I/O Size (KB)
4096
16384
Figure 8: OSD
Performance and
Scalability
10
0
4
64
256
Write Size (KB)
1024
4096
5
0
4
Figure 5: Per-OSD write performance. The horizontal
Figure 7: Write
line indicates the upper limit imposed by the physical
cation. More th
disk. Replication has minimal impact on OSD throughfor small
put, although
if themax
numberthe
of OSDs
is fixed, n-way
Writes
for ebofs
throughput
of thecost
concurrently.
replication reduces total effective throughput by a factor
underlying
HD
sion times dom
of n because replicated data must be written to n OSDs.
for writes over
asynchronously
60
writes
50
60
40
ebofs
ext3
reiserfs
xfs
30
20
reads
10
Per−OSD Throughput
(MB/sec)
Per−OSD Throughput (MB/sec)

16
Writ
Per−
2x replication
3x replication
50
40
30
2
0
4
16
64
256
1024
I/O Size (KB)
4096
16384
Figure 8: OSD
fs
84
Performance and
Scalability
cation. More than two replicas incurs minimal additional
cost for small writes because replicated updates occur
concurrently. For large synchronous writes, transmission times dominate. Clients partially mask that latency
for writes over 128 KB by acquiring exclusive locks and
asynchronously
X-Axis is size
of thethecluster
flushing
data.
 Y-Axis is the per OSD throughput
60
Per−OSD Throughput
(MB/sec)
ical
ghway
ctor
Ds.
50
crush (32k PGs)
crush (4k PGs)
hash (32k PGs)
hash (4k PGs)
linear
40
30
2
6
10
14
18
OSD Cluster Size
22
26
fs
84
Performance and
Scalability
cation. More than two replicas incurs minimal additional
cost for small writes because replicated updates occur
concurrently. For large synchronous writes, transmission times dominate. Clients partially mask that latency
for writes over 128 KB by acquiring exclusive locks and
asynchronously
Most configurations
hover
flushing the
data. around HD speed
60
Per−OSD Throughput
(MB/sec)
ical
ghway
ctor
Ds.
50
crush (32k PGs)
crush (4k PGs)
hash (32k PGs)
hash (4k PGs)
linear
40
30
2
6
10
14
18
OSD Cluster Size
22
26
fs
84
Performance and
Scalability
cation. More than two replicas incurs minimal additional
cost for small writes because replicated updates occur
concurrently. For large synchronous writes, transmission times dominate. Clients partially mask that latency
for writes over 128 KB by acquiring exclusive locks and
asynchronously
32k PGs willflushing
distribute
data more evenly over
the data.
the cluster than the 4k PGs
60
Per−OSD Throughput
(MB/sec)
ical
ghway
ctor
Ds.
50
crush (32k PGs)
crush (4k PGs)
hash (32k PGs)
hash (4k PGs)
linear
40
30
2
6
10
14
18
OSD Cluster Size
22
26
fs
84
Performance and
Scalability
cation. More than two replicas incurs minimal additional
cost for small writes because replicated updates occur
concurrently. For large synchronous writes, transmission times dominate. Clients partially mask that latency
for writes over 128 KB by acquiring exclusive locks and
asynchronously
Evenly splitting
the the
data
will lead to a
flushing
data.
balanced load across the OSDs
60
Per−OSD Throughput
(MB/sec)
ical
ghway
ctor
Ds.
50
crush (32k PGs)
crush (4k PGs)
hash (32k PGs)
hash (4k PGs)
linear
40
30
2
6
10
14
18
OSD Cluster Size
22
26
Conclusions
 Very fast POSIX compliant file system
 General enough for many applications
 No single point of failure – Important for large
data centers
 Can handle HPC like applications (lots of
metadata, small files)
Demonstration
 Started 3 Fedora 16 instances on HCC’s
private cloud
Demonstration
 Some quick things if the demo doesn’t work
 MDS log of a MDS handing off a directory to
another for load balancing
2012-02-16 18:15:17.686167 7f964654b700 mds.0.migrator nicely exporting to mds.1 [dir 10000000004 /hadoop-grid/ [2,head]
auth{1=1} pv=2574 v=2572 cv=0/0 ap=1+2+3 state=1610612738|complete f(v2 m2012-02-16 18:14:21.322129 1=0+1) n(v86
rc2012-02-16 18:15:16.423535 b36440689 292=213+79) hs=1+8,ss=0+0 dirty=9 | child replicated dirty authpin 0x29a0fe0]
Demonstration
 Election after a Monitor was overloaded
2012-02-16 16:23:22.920514 7fcf40904700 log [INF] : mon.gamma calling new monitor election
2012-02-16 16:23:26.167868 7fcf40904700 log [INF] : mon.gamma calling new monitor election
2012-02-16 16:23:31.558554 7fcf40103700 log [INF] : mon.gamma@1 won leader election with quorum 1,2
 Lost another election (peon  ):
2012-02-16 17:15:36.301172 7f50b360e700 mon.gamma@1(peon).osd e26 e26: 3 osds: 2 up, 3 in
GUI Interface
Where to Find More Info
 New company sponsoring development
 http://ceph.newdream.net/
 Instruction on setting up CEPH can be found
on the Ceph wiki:
 http://ceph.newdream.net/wiki/
 Or my blog
 http://derekweitzel.blogspot.com/
Download