CSE 598D Storage Systems, Spring 2007 Object Based Storage Presented By: Kanishk Jain Introduction Object Based Storage ANSI T10 Object-based Storage Devices Standard storage object: a logical collection of bytes on a storage device, with well-known methods for access, attributes describing characteristics of the data, and security policies that prevent unauthorized access. “intelligent data layout” Object Storage Interface OSD model is simply a rearrangement of existing data management functions OSD is a level higher than block access but one level below file access Background – NAS sharing NAS being used to share files among a number of clients The files themselves may be stored on a fast SAN The file server is used to intermediate all requests and thus becomes the bottleneck ! Background – SAN sharing The files themselves are stored on a fast SAN (e.g., iSCSI) to which the clients are also attached While the file server is removed as a bottleneck, security is a concern ! Object-based storage security architecture Metadata managers grant capabilities to clients; clients present these capabilities to the devices on every I/O to ensure security Secure separation of control and data path ! Development of OSD Most initial work on object storage devices (OSD) was done at Parallel Data Lab at CMU Focused on developing underlying concepts in two closely related areas: NASD and Active Disks Proposed as part of same project as NASD Standardized by Storage Networking Industry Association (SNIA) in 2004. OSD v/s Active Disks OSD standard only talks about the interface. It does not assume anything about the processing power at the disk. OSD intelligence is software/firmware running at the disk (no specifications for this) Processing power of an OSD can be scaled to meet the requirements of the functions an active disk File System – Application side (User Component only) The OSD has the intelligence to perform basic data management functions such as space allocation, free space management etc., those functions are no longer part of the application-side file system. Thus the application side file system is reduced to a manager : an abstraction layer between user application and the OSD. Only provides security and backward compatibility File System - On the Device (Storage Component) Workload offered to OSDs may be quite different from that of general-purpose file systems At the OSD level, objects typically have no logical relationship, presenting a flat name space General-purpose file systems, which are usually optimized for workloads exhibiting relatively small variable-sized files, relatively small hierarchical directories, and some degree of locality are not effective in this case Object based File System Separation of metadata and data paths: Separate metadata servers (MDS) manage the directory hierarchy, permissions and file to object mapping. Distribution and replication of a file across a sequence of objects on many OSDs. Example files systems: Lustre, Panasas, Ceph Some Optimizations in Ceph Partitioning the directory tree: To efficiently balance load, the MDS partition the directory tree across the cluster. A client guesses which metadata server is responsible for a file, and contacts that server to open the file. That MDS will forward the request to the correct MDS if necessary. Distribution and replication of a file across a sequence of objects on many OSDs. Limit on object size and use of regions: Ceph limits objects to a maximum size (e. g., 1MB), so files are a sequence of bytes broken into chunks on the maximum object size boundary. Since only the MDS hold the directory tree, OSDs do not have directory information to suggest layout hints for file data. Instead, the OSDs organize objects into small and large object regions, using small block sizes (e. g., 4KB or 8KB) for small objects and large block sizes (e. g. 50–100% of the maximum object size) for large objects. Use of a specialized mapping algorithm: A file handle returned by the metadata server describes which objects on which OSD contain the file data. A special algorithm, RUSH maps a sequence index to the OSD holding the object at that position in the sequence, distributing the objects in a uniform way. Possible Performance Results OBFS outperforms Linux Ext2 and Ext3 by a factor of two or three, and while OBFS is 1/25 the size of XFS, it provides only slightly lower read performance and 10%–40% higher write performance Possible Performance Results (contd..) Database Storage Management Object attributes are also the key to giving storage devices an awareness of how objects are being accessed, so that it can use this information to optimize disk layout specific to the application. Database software often has very little detailed information about the storage subsystem Previous research took the view that a storage device can provide relevant characteristics to applications Device-specific information is known to the storage subsystem, and thus it is better-equipped to manage low-level storage tasks Database Storage Management (contd..) Object attributes can contain information about the expected behavior of an object such as expected read/write ratio, access pattern (sequential vs. random), or expected size, dimension, and content of the object. Using OSD, a DBMS can inform the storage subsystem of the geometry of a relation, thereby passing responsibility for low-level data layout to the storage device. The dependency between the metadata and storage system/application is removed. This assists with data sharing between different storage applications OSD Objects and Attributes Scalability Scalability – what does that word really mean : Capacity: number of bytes, number of objects, number of files, …etc. OSD aggregation techniques will allow for hierarchical representations of more complex objects that consist of larger numbers of smaller objects. Performance: Bandwidth, Transaction rate, Latency. OSD performance management can be used in conjunction with OSD aggregation techniques to more effectively scale each of these three performance metrics and maintain required QoS levels on a per-object basis. Connectivity: number of disks, hosts, arrays, …etc. Since the OSD model requires self-managed devices and is transport agnostic the number of OSDs and hosts can grow to the size limits of the transport network. Geographic: LAN, SAN, WAN, …etc. Again, since the OSD model is transport agnostic and since there is a security model built into the OSD architecture, the geographic scalability is not bounded. Processing Power: OSD processing power can be scaled. Other Advantages Manageability: OSD management model relies on self-managed, policy driven storage devices, that can be centrally managed and locally administered (i.e. central policies, local execution). Density: OSD on individual storage devices can optimize densities by abstracting the physical characteristics of the underlying storage medium Cost: address issues such as $/MB, $/sqft, $/IOP, $/MB/sec, TCO, …etc. Adaptability: to changing applications. Can the OSD be repurposed to different uses such as from a film editing station to mail serving? Capability: can add functionality for different applications. Can additional functionality be added to an OSD to increase its usefulness? Other Advantages (contd..) Availability: Fail-over capabilities between cooperating OSD devices. 2-way failover versus N-way failover? Reliability: Connection-integrity capabilities Serviceability: Remote monitoring, remote servicing, hot-plug capability, genocidal sparing. When an OSD dies and a new one is put in it’s place, how does it get “rebuilt”? How automated is the service process? Interoperability: Supported by many OS vendors, file system vendors, storage vendors, middleware vendors. Power: decrease the power per unit volume by relying on the policy-driven self management schemes to “power down” objects (i.e. move them to disks and spin those disks down). Cluster Computing Traditionally 'divide-and-conquer' approach, decomposing the problem to be solved into thousands of independently executed tasks using a problem's inherent data parallelism--identifying the data partitions that comprise the individual task, then distributing each task and corresponding partition to the compute nodes for processing. Data from a shared storage system is staged (copied) to the compute nodes, processing is performed, and results are de-staged from the nodes back to shared storage when done. In many applications, the staging setup time can be appreciable-up to several hours for large clusters. OSD for Cluster Computing Object-based storage clustering is useful in unlocking the full potential of these Linux compute clusters. Intrinsic ability to linearly scale in capacity and performance to meet the demands of the supercomputing applications. High bandwidth parallel data access between thousands of Linux cluster nodes and a unified storage cluster over standard TCP/IP networks. Commercial Products OSD Commands