Survey of Leveraging Capabilities of Intelligent Storage Devices for

advertisement
Survey of techniques used to reduce the Semantic Gap between
Database Management Systems and Storage Subsystems
Problem Description:
One of the most important modules of any database management system is the storage
manager module. This module essentially controls the way the data is allocated, accessed,
and stored on storage devices. Storage subsystems are typically virtualized for the
purposes of consolidation, easy of management, reducing interdependence, etc. Due to
this virtualization now the database storage managers neither have a strict control over
the physical layout of data nor are they aware of the internal characteristics of storage
subsystems, and apply some coarse rules of thumb to optimize its access. On the other
hand, storage subsystems do not have semantic knowledge of the data that they are
storing, again relying on their own rules of thumb to manage such workload-dependent
tasks as pre-fetching, caching, and data layout. The end result is both sides are working
blindly to optimize their performance without not knowing what other side is doing.
Various studies [1][2] emphasize the importance reducing information gap between
applications and underlying storage devices. Over the years, various techniques have
been developed to reduce the semantic gap between database management systems and
storage systems. In this project we plan to survey various approaches used by researchers
or implementers in order to reduce this semantic gap.
The idea of placing intelligence in storage systems to help database operation was
explored extensively in the context of database machines in late 1970s and late 1980s.
Database machines can be classified into four categories depending on disk processing
[8]:
1. Processor per head: DBC, SURE
2. Processor per track: CAFS, RARES, RAP, CASSM
3. Processor per disk: SURE
4. Multi-processor cache: DIRECT, RDBM, DBMAC, INFOPLEX, RAP.2
In all of the architectures, there was a central processor which pushed simple database
operations (e.g., scan) closer to disk, and achieved a dramatic performance improvement
for these operations. The main counter-arguments are summarized by Boral and Dewitt
[11]. First, most database machine use special-purpose hardware, such as associative
disks, associative CCD devices, and magnetic bubble memory which increased the design
time and cost of these machines. Again, the performance gain was not enough to justify
the additional cost incurred by these hardwares. Second, although the performance was
impressive for scan operations, but for the complex database operations, such as sorts and
joins did not provide significant benefits. Third, the performance offered by database
machines can be easily achievable by smart indexing techniques. Fourth, CPU processing
speed was improving much faster than the disk transfer rates improve, so CPU was sitting
idle. Fifth, the communication overhead between processing elements were high. Finally,
database vendors did not agree to rewrite their legacy code base to take advantage of
features offered by these new hardwares.
Storage technology has evolved much in the intervening years and disk-resident
processing got attention from the research community again in 1990s. The biggest change
was widespread use of disk arrays that use a large number of disks working in parallel.
Special-purpose silicon cores in database machines are replaced by general-purpose
embedded processing and increased memory cores. Numerous parallel algorithms for
database operations, such as joins and sorts have been developed for different
architectures, such as shared-nothing, shared-memory, and shared disk since the
inception of specialized database machines. Serial communications were able to provide
enough bandwidth to disk to overcome the message passing overheads.
Description of Candidate Solutions:
To take advantage of this technology innovations, Archrya et al. [9] and Riedel et al. [6]
[10] explored the benefits of mapping applications form database, data mining, image
processing, sorting, and data cubes onto storage devices to enable application specific
processing close to data. Their work mostly focuses on how to partition applications
across host and storage devices to minimize communication over overhead due to data
transfer. Keeton et al., [8] explored using a analytical model of speed up to offload
portions of SMP databases to use the processing power in storage devices.
The Fates database storage project at CMU [3][5], explores how to extract disk
characteristics to map efficient database access patterns. Logical Volume Manager
(LVM) exposes information about disk layout to the database storage manager, which
uses this information to efficiently write data to disks so as to improve later readings
from storage devices.
The database aware semantically storage disk project [4], explores the possibility of a
database aware storage device. It uses WAL entries to find the access patterns of block
access for database and use thin information for efficient pre-fetching, caching, and datalayout inside disk. Host is transparent to what the disk does.
[7] makes use of object-based storage devices (OSD). It examines the approach of
passing semantic information from a database system to the storage subsystem to bridge
the information gap between these two levels. Recently-standardized OSD interface
moves low level storage functionalities close to data and provides an object interface to
access the data. This paper leverages OSD interface for communicating semantic
information database to the storage device. It discusses how we can map relation of a
database to an OSD object; and how we can read and write database relation efficiently
taking advantage of geometry aware data-layout through additional OSD interface. This
paper only scratches the surface but neither builds any prototype system nor shows any
performance improvement compared to the traditional approaches.
Methodology and Objective:
In this work we plan to survey these above-mentioned techniques to provide a clear
picture of the state of the art when it comes to current database management systems. We
plan to compare these approaches using very common illustrative examples. Our final
goal is to clearly identify the advantages and disadvantages of each approach and
hopefully conclude as to which approach is the most appropriate or promising one for
future database systems.
References:
[1] T. E. Denehy, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. Bridging the
information gap in storage protocol stacks. Summer USENIX Technical Conference
(Monterey, CA, 10–15 June 2002), pages 177–190, 2002.
[2] G. R. Ganger. Blurring the line between OSs and storage devices. Technical report
CMU–CS–01–166. Carnegie Mellon University, December 2001.
[3] Jiri Schindler, Steven W Schlosser, et al, Atropos: A Disk Array Volume Manager for
Orchestrated Use of Disks, In 3rd USENIX Conference on File and Storage Technologies
FAST 04, CA. March 2004
[4] Muthian Sivathanu, Lakshmi N Bairavasundaram, et al, Database Aware Sematicaly
Smart Storage, FAST 2005.
[5]Minglong Shao, et al, Clotho: Decoupling Memory Page Layout from Storage
Organization, VLDB 2004.
[6] E. Riedel, C. Faloutsos, and D. Nagle. Active Disk Archittecture for Databases.
Technical Report CMU-CS-00-145, Carniegie Mellon University, April 2000.
[7] Steve Schlosser, Sami Iren. Database Storage Management with Object-Based
Storage Devices, DAMON 2005.
[8] K. Keeton. Computer Architecture Support for Database Applications, PhD thesis,
University of California at Berkeley, 1999.
[9] A. Acharya, M. Uysal, and J. Saltz. Active Disk: Programming Model, Algorithms,
and Evaluation, ASPLOS VIII, 1998.
[10] E. Riedel, G. Gibson, and C. Faloutsos. Active Storage for Large Scale Data Mining
and Multimedia, VLDB 1998.
[11] H. Boral and D. J. Dewitt. Database Machines: An Idea whose time has passed?, In
3rd Workshop on Database Machines, 1983.
Download