Survey of techniques used to reduce the Semantic Gap between Database Management Systems and Storage Subsystems Problem Description: One of the most important modules of any database management system is the storage manager module. This module essentially controls the way the data is allocated, accessed, and stored on storage devices. Storage subsystems are typically virtualized for the purposes of consolidation, easy of management, reducing interdependence, etc. Due to this virtualization now the database storage managers neither have a strict control over the physical layout of data nor are they aware of the internal characteristics of storage subsystems, and apply some coarse rules of thumb to optimize its access. On the other hand, storage subsystems do not have semantic knowledge of the data that they are storing, again relying on their own rules of thumb to manage such workload-dependent tasks as pre-fetching, caching, and data layout. The end result is both sides are working blindly to optimize their performance without not knowing what other side is doing. Various studies [1][2] emphasize the importance reducing information gap between applications and underlying storage devices. Over the years, various techniques have been developed to reduce the semantic gap between database management systems and storage systems. In this project we plan to survey various approaches used by researchers or implementers in order to reduce this semantic gap. The idea of placing intelligence in storage systems to help database operation was explored extensively in the context of database machines in late 1970s and late 1980s. Database machines can be classified into four categories depending on disk processing [8]: 1. Processor per head: DBC, SURE 2. Processor per track: CAFS, RARES, RAP, CASSM 3. Processor per disk: SURE 4. Multi-processor cache: DIRECT, RDBM, DBMAC, INFOPLEX, RAP.2 In all of the architectures, there was a central processor which pushed simple database operations (e.g., scan) closer to disk, and achieved a dramatic performance improvement for these operations. The main counter-arguments are summarized by Boral and Dewitt [11]. First, most database machine use special-purpose hardware, such as associative disks, associative CCD devices, and magnetic bubble memory which increased the design time and cost of these machines. Again, the performance gain was not enough to justify the additional cost incurred by these hardwares. Second, although the performance was impressive for scan operations, but for the complex database operations, such as sorts and joins did not provide significant benefits. Third, the performance offered by database machines can be easily achievable by smart indexing techniques. Fourth, CPU processing speed was improving much faster than the disk transfer rates improve, so CPU was sitting idle. Fifth, the communication overhead between processing elements were high. Finally, database vendors did not agree to rewrite their legacy code base to take advantage of features offered by these new hardwares. Storage technology has evolved much in the intervening years and disk-resident processing got attention from the research community again in 1990s. The biggest change was widespread use of disk arrays that use a large number of disks working in parallel. Special-purpose silicon cores in database machines are replaced by general-purpose embedded processing and increased memory cores. Numerous parallel algorithms for database operations, such as joins and sorts have been developed for different architectures, such as shared-nothing, shared-memory, and shared disk since the inception of specialized database machines. Serial communications were able to provide enough bandwidth to disk to overcome the message passing overheads. Description of Candidate Solutions: To take advantage of this technology innovations, Archrya et al. [9] and Riedel et al. [6] [10] explored the benefits of mapping applications form database, data mining, image processing, sorting, and data cubes onto storage devices to enable application specific processing close to data. Their work mostly focuses on how to partition applications across host and storage devices to minimize communication over overhead due to data transfer. Keeton et al., [8] explored using a analytical model of speed up to offload portions of SMP databases to use the processing power in storage devices. The Fates database storage project at CMU [3][5], explores how to extract disk characteristics to map efficient database access patterns. Logical Volume Manager (LVM) exposes information about disk layout to the database storage manager, which uses this information to efficiently write data to disks so as to improve later readings from storage devices. The database aware semantically storage disk project [4], explores the possibility of a database aware storage device. It uses WAL entries to find the access patterns of block access for database and use thin information for efficient pre-fetching, caching, and datalayout inside disk. Host is transparent to what the disk does. [7] makes use of object-based storage devices (OSD). It examines the approach of passing semantic information from a database system to the storage subsystem to bridge the information gap between these two levels. Recently-standardized OSD interface moves low level storage functionalities close to data and provides an object interface to access the data. This paper leverages OSD interface for communicating semantic information database to the storage device. It discusses how we can map relation of a database to an OSD object; and how we can read and write database relation efficiently taking advantage of geometry aware data-layout through additional OSD interface. This paper only scratches the surface but neither builds any prototype system nor shows any performance improvement compared to the traditional approaches. Methodology and Objective: In this work we plan to survey these above-mentioned techniques to provide a clear picture of the state of the art when it comes to current database management systems. We plan to compare these approaches using very common illustrative examples. Our final goal is to clearly identify the advantages and disadvantages of each approach and hopefully conclude as to which approach is the most appropriate or promising one for future database systems. References: [1] T. E. Denehy, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. Bridging the information gap in storage protocol stacks. Summer USENIX Technical Conference (Monterey, CA, 10–15 June 2002), pages 177–190, 2002. [2] G. R. Ganger. Blurring the line between OSs and storage devices. Technical report CMU–CS–01–166. Carnegie Mellon University, December 2001. [3] Jiri Schindler, Steven W Schlosser, et al, Atropos: A Disk Array Volume Manager for Orchestrated Use of Disks, In 3rd USENIX Conference on File and Storage Technologies FAST 04, CA. March 2004 [4] Muthian Sivathanu, Lakshmi N Bairavasundaram, et al, Database Aware Sematicaly Smart Storage, FAST 2005. [5]Minglong Shao, et al, Clotho: Decoupling Memory Page Layout from Storage Organization, VLDB 2004. [6] E. Riedel, C. Faloutsos, and D. Nagle. Active Disk Archittecture for Databases. Technical Report CMU-CS-00-145, Carniegie Mellon University, April 2000. [7] Steve Schlosser, Sami Iren. Database Storage Management with Object-Based Storage Devices, DAMON 2005. [8] K. Keeton. Computer Architecture Support for Database Applications, PhD thesis, University of California at Berkeley, 1999. [9] A. Acharya, M. Uysal, and J. Saltz. Active Disk: Programming Model, Algorithms, and Evaluation, ASPLOS VIII, 1998. [10] E. Riedel, G. Gibson, and C. Faloutsos. Active Storage for Large Scale Data Mining and Multimedia, VLDB 1998. [11] H. Boral and D. J. Dewitt. Database Machines: An Idea whose time has passed?, In 3rd Workshop on Database Machines, 1983.