TITLE ALL CAPS

advertisement
Isilon Clustered Storage
OneFS
Nick Kirsch
Introduction
•
•
•
•
•
Who is Isilon?
What Problems Are We Solving? (Market Opportunity)
Who Has These Problems? (Our Customers)
What Is Our Solution? (Our Product)
How Does It Work? (The Cool Stuff)
Who is Isilon Systems?
•
•
•
•
•
•
Founded in 2000
Located in Seattle (Queen Anne)
IPO’d in 2006 (ISLN)
~400 employees
Q3 2008 Revenue: $30 million, 40% Y/Y
Co-founded by Paul Mikesell, UW/CSE
• I’ve been at the company for 6+ years
What Problems Are We Solving?
Structured Data
•
•
•
•
•
Small files
Modest-size data stores
I/O intensive
Transactional
Steady capacity growth
Unstructured Data
•
•
•
•
•
Larger files
Very large data stores
Throughput intensive
Sequential
Explosive capacity growth
Traditional Architectures
•
•
•
•
•
•
•
•
•
Data Organized in Layers of Abstraction
• File System, Volume Manager, RAID
Server/Storage Architecture - “Head” and “Disk”
Scale Up (vs Scale Out)
Islands of Storage
Hard to Scale
Performance Bottlenecks
Not Highly Available
Overly Complex
Cost Prohibitive
Storage
Device
#1
Storage
Device
#2
Storage
Device
#3
Who Has These Problems?
Worldwide File And Block Disk Storage Systems, 2005-2011*
(PB)
By 2011, 75% of all storage capacity
sold will be for file-based data
•
File Based: 79.3% CAGR
Block Based: 31% CAGR
Isilon has over 850 customers today.
* Source: IDC, 2007
What is Our Solution?
OneFS™
intelligent
software
Enterprise
class
hardware
A 3-node
Isilon IQ Cluster
Isilon IQ
Clustered
Storage
Scales to 96 nodes
2.3 PB (single file system)
20 GB/s (aggregate)
Clustered Storage Consists Of “Nodes”
•
•
•
•
•
Largely Commodity Hardware
Quad-core 2.3Ghz CPU
4 GB memory read cache
GbE and 10GbE for front-end network
12 disks per node
•
•
•
InfiniBand for intra-cluster communication
High-speed NVRAM journal
Hot-swappable disks, power supplies, and fans
•
•
NFS, CIFS, HTTP, FTP
Integrates with Windows and UNIX
•
OneFS operating system
Isilon Network Architecture
CIFS
Ethernet
NFS
Either
•
•
•
•
Drop-in replacement for any NAS device
No client-side drivers required, like Andrew FS (Coda), or Lustre
No application changes, like Google FS or Amazon S3
No changes required to adopt.
How Does It Work?
• Built on FreeBSD 6.x (originally 5.x)
•
•
•
•
•
New kernel module for OneFS
Modifications to the kernel proper
User space applications
Leverage open-source where possible
Almost all of the heavy-lifting is in the kernel
• Commodity Hardware
• A few exceptions:
•
•
•
•
•
We have a high-speed NVRAM journal for data consistency
We have an Infiniband low-latency cluster inter-connect
We have a close-to-commodity SAS card (commodity chips)
A custom monitoring board (fans, temps, voltages, etc.)
SAS and SATA disks
OneFS architecture
• Fully Distributed
• Top Half
•
Initiator
• Bottom Half
•
•
Participant
Network Operations (TCP, NFS, CIFS)
FEC Calculations, Block Reconstruction
VFS layer, Locking, etc.
File-Indexed Cache
Journal and Disk Operations
Block-Indexed Cache
The OneFS architecture is basically an Infiniband SAN
•
•
•
All data access across the back-end network is block-level
The participants act as very smart disk drives
Much of the back-end data traffic can be RDMA
OneFS architecture
•
OneFS started from UFS (aka FFS)
•
•
•
•
OneFS Knows Everything – no volume manager, no RAID
•
•
Lack of abstraction allows us to do interesting things, but forces
the file system to know a lot – everything.
Cache/Memory Architecture Split
•
•
•
•
Generalized for a distributed system.
Little resemblance in code today, but concepts are there.
Almost all data structures are trees
“Level 1” – file cache (cached as part of the vnode)
“Level 2” – block cache (local or remote disk blocks)
Memory used for high-speed write coalescer
Much more resource intensive than a local FS
Atomicity/Consistency Guarantees
•
POSIX file system
•
•
•
Namespace operations are atomic
fsync/sync operations are guaranteed synchronous
FS data is either mirrored or FEC-protected
•
•
Meta-data is always mirrored; up to 8x
User-data can be mirrored (up to 8x) or FEC up to +4
•
•
Protection level can be chosen on a per-file or per-directory
basis.
•
•
•
We use Reed-Solomon codings for FEC
Some files can be at 1x (no protection) while others can be at +4
(survive four failures).
Meta-data must be protected at least as high as anything it refers to.
All writes go to the NVRAM first as part of a distributed
transaction – guaranteed to commit or abort.
Group Management
•
•
•
•
Transactional way to handle state changes
All nodes need to agree on their peers
Group changes: split, merge, add, remove
Group changes don’t “scale”, but are rare
1
4
+
2
3
Distributed Lock Manager
•
Textbook-ish DLM
•
•
Anyone requesting a lock is an initiator.
Coordinator knows the definitive owner for the lock.
•
•
•
Split/Merge behavior
•
•
•
•
Controls access to locks.
Coordinator is chosen by a hash of the resource.
Locks are lost at merge time, not split time.
Since POSIX has no lock-revoke mechanism, advisory locks are
silently dropped.
Coordinator renegotiates on split/merge.
Locking optimizations – “lazy locks”
•
•
•
Locks are cached.
Lock-lost callbacks.
Lock-contention callbacks.
RPC Mechanism
• Uses SDP on Infiniband
• Batch System
• Allows you to put dependencies on the remote side.
• i.e. Send 20 messages, checkpoint, send 20 messages.
• Messages run in parallel, then synchronize, etc.
• Coalesces errors.
•
•
•
•
Async messages (callback)
Sync messages
Update message (no response)
Used by DLM, RBM, etc. (everything)
Writing a file to OneFS
• Writes occur via NFS, CIFS, etc. to a single node
• That node coalesces data and initiates transactions
• Optimizing for write performance is hard
•Lots of variables
•Each node might have different load
•Unusual scenarios, e.g. degraded writes
• Asynchronous Write Engine
•Build a directed acyclical graph (DAG)
•Do work as soon as dependencies satisfied
•Prioritize and pipeline work for efficiency
Writing a file to OneFS
Servers
NFS, CIFS,
FTP, HTTP
Servers
(optional 2nd switch)
Servers
(optional 2nd
(optional
2nd
switch)
switch)
Writing a file to OneFS
(optional 2nd
switch)
Writing a file to OneFS
• Break the write into regions
• Region are protection group aligned
• For each region:
• Create a layout
• Use layout to generate a plan
• Execute the plan asynchronously
write
FEC
compute
FEC
compute
layout
write
block
allocate
blocks
write
block
Writing a file to OneFS
• Plan executes and transaction commits
• Data and parity blocks are now on disks
Data and
Parity blocks
Data and
Parity blocks
Data and
Parity blocks
Inode mirror 0
Inode mirror 1
Reading a file from OneFS
Servers
NFS, CIFS,
FTP, HTTP
Servers
(optional 2nd switch)
Servers
(optional 2nd
(optional
2nd
switch)
switch)
Reading a
a OneFS
File OneFS
Reading
file from
Servers
NFS, CIFS,
FTP, HTTP
Servers
(optional 2nd switch)
Servers
(optional 2nd
switch)
Handling Failures
• What could go wrong during a single
transaction?
•
•
•
•
A block-level I/O request fails
A drive goes down
A node runs out of space
A node disconnects or crashes
• In a distributed system, things are expected
to fail.
• Most of our system calls automatically restart.
• Have to be able to gracefully handle all of the
above, plus much more!
Handling Failures
• When a node goes “down”:
• New files will use effective protection levels (if necessary)
• Affected files will be reconstructed automatically per
request.
• That node’s IP addresses are migrated to another node.
• Some data is orphaned and later garbage collected.
• When a node “fails”:
• New files will use effective protection levels (if necessary)
• Affected files will be repaired automatically across the
cluster.
• AutoBalance will automatically rebalance data.
• We can safely, proactively SmartFail nodes/drives:
• Reconstruct data without removing the device.
• In the event of a multiple-component failure occurs, use
the original device – minimizes WOR.
SmartConnect
SmartConnect
CIFS
Ethernet
NFS
Either
• Client must connect to a single IP address.
• SmartConnect - DNS server which runs on the cluster
• Customer delegates zone to the cluster DNS server
• SmartConnect responds to DNS queries with only available nodes
• SmartConnect can also be configured to respond with nodes
based on load, connection, throughput, etc.
We've got Lego Pieces
• Accelerator Nodes
• Top-Half Only
• Adds CPU and Memory – no disks or journal
• Only has Level 1 cache… high single-stream throughput
• Storage Nodes
• Both Top or Bottom Half
• In Some Workloads, Bottom Half Only Makes Sense
• Storage Expansion Nodes
• Just a dumb extension of a Storage Node – add disks
• Grow Capacity Without Performance
SmartConnect Zones
hpc. tx.com
•10 GigE dedicated
•Accelerator X nodes
•NFS Failover required
Interpreters
10.20
Processing
10gige-1
gg.tx.com
•Storage nodes
•NFS clients, no
failover
BizDev
Eng
10.10
bizz.tx.com
•Renamed sub-domain
•CIFS clients (static IP)
10.30
eng.tx.com
•Shared subnet
•Separate sub-domain
•NFS Failover
ext-1
Finance
IT
fin.tx.com
•VLAN (confidential
traffic, isolated)
•Same physical LAN
it.tx.com
•Full access, maintenance interface
•Corporate DNS, no SC
•Static (well-known) IPs required
Initiator Software Block Diagram
Front-end Network
NFS
CIFS
HTTP
NDMP
FTP
?
Layout
BSW
Initiator Cache
DFM
IFM
LIN
STF
BAM
Btree
MDS
RBM
Back-end Network
2
Participant Software Block Diagram
Back-end Network
RBM
LBM
Participant Cache
Journal
NVRAM
DRV
Disk Subsystem
3
System Software Block Diagram
NFS
D
F
M
Front-end Network
CIF
HTT
ND
FTP
S
P
MP
IF
LI
M
N
Btree
MDS
Initiator Cache
S
Lay
T
BAM
out
F
iSC
SI
NFS
D
F
M
BS
W
Front-end Network
CIF
HTT
ND
FTP
S
P
MP
IF
LI
M
N
Btree
MDS
Initiator Cache
S
Lay
T
BAM
out
F
RBM
RBM
Back-end Network
Back-end Network
iSC
SI
BS
W
Infinband
Back-end Network
Accelerator
RBM
LBM
Participant Cache
Journal
NV
RA
M
DRV
Disk Subsystem
Storage Node
3
Too much to talk about…
•
•
•
•
•
•
•
•
•
•
Snapshots
• Failed Drive Reconstruction
Quotas
• Distributed Deadlock Detection
Replication
• On-the-fly Filesystem Upgrade
Bit Error Protection
• Dynamic Sector Repair
Rebalancing Data • Globally Coherent Cache
Handling Slow Drives
Statistics Gathering
I/O Scheduling
Network Failover
Native Windows Concepts (ACLs, SIDs, etc.)
Thank You!
Questions?
Download