CSE 598D Storage Systems Object Based Storage

advertisement
CSE 598D Storage Systems, Spring 2007
Object Based Storage
Presented By: Kanishk Jain
Introduction

Object Based Storage



ANSI T10 Object-based Storage Devices
Standard
storage object: a logical collection of bytes
on a storage device, with well-known
methods for access, attributes describing
characteristics of the data, and security
policies that prevent unauthorized access.
“intelligent data layout”
Object Storage Interface

OSD model is simply a rearrangement of existing data management
functions

OSD is a level higher than block access but one level below file access
Background – NAS sharing



NAS being used to
share files among a
number of clients
The files themselves
may be stored on a
fast SAN
The file server is
used to intermediate
all requests and thus
becomes the
bottleneck !
Background – SAN sharing


The files themselves
are stored on a fast
SAN (e.g., iSCSI) to
which the clients are
also attached
While the file server
is removed as a
bottleneck, security
is a concern !
Object-based storage security
architecture


Metadata managers
grant capabilities to
clients; clients
present these
capabilities to the
devices on every I/O
to ensure security
Secure separation of
control and data
path !
Development of OSD

Most initial work on object storage
devices (OSD) was done at Parallel Data
Lab at CMU



Focused on developing underlying concepts
in two closely related areas: NASD and
Active Disks
Proposed as part of same project as NASD
Standardized by Storage Networking
Industry Association (SNIA) in 2004.
OSD v/s Active Disks




OSD standard only talks about the interface.
It does not assume anything about the
processing power at the disk.
OSD intelligence is software/firmware running
at the disk (no specifications for this)
Processing power of an OSD can be scaled to
meet the requirements of the functions an
active disk
File System – Application side
(User Component only)


The OSD has the intelligence to perform basic
data management functions such as space
allocation, free space management etc., those
functions are no longer part of the
application-side file system.
Thus the application side file system is
reduced to a manager : an abstraction layer
between user application and the OSD.

Only provides security and backward compatibility
File System - On the Device
(Storage Component)



Workload offered to OSDs may be quite
different from that of general-purpose file
systems
At the OSD level, objects typically have no
logical relationship, presenting a flat name
space
General-purpose file systems, which are
usually optimized for workloads exhibiting
relatively small variable-sized files, relatively
small hierarchical directories, and some
degree of locality are not effective in this case
Object based File System



Separation of metadata and data paths:
Separate metadata servers (MDS)
manage the directory hierarchy,
permissions and file to object mapping.
Distribution and replication of a file
across a sequence of objects on many
OSDs.
Example files systems: Lustre, Panasas,
Ceph
Some Optimizations in Ceph



Partitioning the directory tree: To efficiently balance load, the MDS
partition the directory tree across the cluster. A client guesses which
metadata server is responsible for a file, and contacts that server to
open the file. That MDS will forward the request to the correct MDS if
necessary. Distribution and replication of a file across a sequence of
objects on many OSDs.
Limit on object size and use of regions: Ceph limits objects to a
maximum size (e. g., 1MB), so files are a sequence of bytes broken into
chunks on the maximum object size boundary. Since only the MDS hold
the directory tree, OSDs do not have directory information to suggest
layout hints for file data. Instead, the OSDs organize objects into small
and large object regions, using small block sizes (e. g., 4KB or 8KB) for
small objects and large block sizes (e. g. 50–100% of the maximum
object size) for large objects.
Use of a specialized mapping algorithm: A file handle returned by
the metadata server describes which objects on which OSD contain the
file data. A special algorithm, RUSH maps a sequence index to the OSD
holding the object at that position in the sequence, distributing the
objects in a uniform way.
Possible Performance Results

OBFS outperforms Linux Ext2 and Ext3 by a factor of
two or three, and while OBFS is 1/25 the size of XFS,
it provides only slightly lower read performance and
10%–40% higher write performance
Possible Performance Results
(contd..)
Database Storage Management




Object attributes are also the key to giving storage
devices an awareness of how objects are being
accessed, so that it can use this information to
optimize disk layout specific to the application.
Database software often has very little detailed
information about the storage subsystem
Previous research took the view that a storage device
can provide relevant characteristics to applications
Device-specific information is known to the storage
subsystem, and thus it is better-equipped to manage
low-level storage tasks
Database Storage Management
(contd..)



Object attributes can contain information about the
expected behavior of an object such as expected
read/write ratio, access pattern (sequential vs.
random), or expected size, dimension, and content of
the object.
Using OSD, a DBMS can inform the storage
subsystem of the geometry of a relation, thereby
passing responsibility for low-level data layout to the
storage device.
The dependency between the metadata and storage
system/application is removed. This assists with data
sharing between different storage applications
OSD Objects and Attributes
Scalability
Scalability – what does that word really mean :

Capacity: number of bytes, number of objects, number of files, …etc.
OSD aggregation techniques will allow for hierarchical representations




of more complex objects that consist of larger numbers of smaller
objects.
Performance: Bandwidth, Transaction rate, Latency. OSD performance
management can be used in conjunction with OSD aggregation
techniques to more effectively scale each of these three performance
metrics and maintain required QoS levels on a per-object basis.
Connectivity: number of disks, hosts, arrays, …etc. Since the OSD
model requires self-managed devices and is transport agnostic the
number of OSDs and hosts can grow to the size limits of the transport
network.
Geographic: LAN, SAN, WAN, …etc. Again, since the OSD model is
transport agnostic and since there is a security model built into the
OSD architecture, the geographic scalability is not bounded.
Processing Power: OSD processing power can be scaled.
Other Advantages





Manageability: OSD management model relies on self-managed,
policy driven storage devices, that can be centrally managed
and locally administered (i.e. central policies, local execution).
Density: OSD on individual storage devices can optimize
densities by abstracting the physical characteristics of the
underlying storage medium
Cost: address issues such as $/MB, $/sqft, $/IOP, $/MB/sec,
TCO, …etc.
Adaptability: to changing applications. Can the OSD be
repurposed to different uses such as from a film editing station
to mail serving?
Capability: can add functionality for different applications. Can
additional functionality be added to an OSD to increase its
usefulness?
Other Advantages (contd..)





Availability: Fail-over capabilities between cooperating OSD
devices. 2-way failover versus N-way failover?
Reliability: Connection-integrity capabilities
Serviceability: Remote monitoring, remote servicing, hot-plug
capability, genocidal sparing. When an OSD dies and a new one
is put in it’s place, how does it get “rebuilt”? How automated is
the service process?
Interoperability: Supported by many OS vendors, file system
vendors, storage vendors, middleware vendors.
Power: decrease the power per unit volume by relying on the
policy-driven self management schemes to “power down”
objects (i.e. move them to disks and spin those disks down).
Cluster Computing


Traditionally 'divide-and-conquer' approach,
decomposing the problem to be solved into
thousands of independently executed tasks using a
problem's inherent data parallelism--identifying the
data partitions that comprise the individual task, then
distributing each task and corresponding partition to
the compute nodes for processing.
Data from a shared storage system is staged (copied)
to the compute nodes, processing is performed, and
results are de-staged from the nodes back to shared
storage when done. In many applications, the staging
setup time can be appreciable-up to several hours for
large clusters.
OSD for Cluster Computing



Object-based storage clustering is useful in
unlocking the full potential of these Linux
compute clusters.
Intrinsic ability to linearly scale in capacity
and performance to meet the demands of the
supercomputing applications.
High bandwidth parallel data access between
thousands of Linux cluster nodes and a
unified storage cluster over standard TCP/IP
networks.
Commercial Products
OSD Commands
Download