Chapter 10 Storage, Basic File Structure and Indexing Storage Categories Storage medium is required to store information/data Primary memory can be accessed by the CPU directly Fast, expensive and limited in capacity Volatile Secondary memory Data on SM cannot be processed by CPU directly Slow, larger capacity, less expensive Non-volatile Secondary storage is the media of database storage Disks and Files DBMS stores information on (“hard”) disks. This has major implications for DBMS design! READ: transfer data from disk to main memory (RAM). WRITE: transfer data from RAM to disk. Both are high-cost operations, relative to in-memory operations, so must be planned carefully! Disks Secondary storage device of choice. Data is stored and retrieved in units called disk blocks or pages. Unlike RAM, time to retrieve a disk page varies depending upon location on disk. Therefore, relative placement of pages on disk has major impact on DBMS performance! Components of a Disk 10.1 Disk head Track s Sector Arm movement Platter s Arm assembly Records and Files Record consists of a collection of related data values or items (or fields, column etc) A file is a sequence of records made up of Fixed-length records Variable-length records A database is stored as a collection of files Record Formats: Fixed Length 10.2 F1 L1 Base address (B) F2 F3 F4 L2 L3 L4 Address = B+L1+L2 Information about field types same for all records in a file; stored in system catalogs. Finding i’th field does not require scan of record. Fixed Length Records Store record i start – 1), where n is the size of each record. Record access is simple but records may cross blocks Deletion of record i alternatives move records i + 1, . . ., n to i, . . . , n – 1. move record n to i do not move records, but link all free records on a free list 10.3 Variable-Length Records Record Organization (on Disks) (a) Unspanned. (b) Spanned 10.4 File Organization & Access Method File organization refers to physical arrangement of data in a file into records and pages of the secondary storage Access method refers to the steps involved in storing and retrieving record from a file Some common file organizations and access methods are discussed now Unordered File Also called a heap or a pile file. Simplest file structure contains records in no particular order. As file grows and shrinks, disk pages are allocated and de-allocated. New records are inserted at the end of the file. To search for a record, a linear search through the file records is necessary. This requires reading and searching half the file blocks on the average, and is hence quite expensive. Record insertion is quite efficient. Reading the records in order of a particular field requires sorting the file records 10.5 Ordered Files Also called a sequential file. File records are kept sorted by the values of an ordering field. Insertion is expensive: records must be inserted in the correct order. It is common to keep a separate unordered overflow (or transaction ) file for new records to improve insertion efficiency; this is periodically merged with the main ordered file. A binary search can be used to search for a record on its ordering field value. This requires reading and searching log2 of the file blocks on the average, an improvement over linear search. Reading the records in order of the ordering field is quite efficient Hash Files Hashing for disk files is called External Hashing The file blocks are divided into M equal-sized buckets, numbered bucket0, bucket1, ..., bucket M-1. Typically, a bucket corresponds to one (or a fixed number of) disk block. One of the file fields is designated to be the hash key of the file. The record with hash key value K is stored in bucket i, where i=h(K), and h is the hashing function. Search is very efficient on the hash key. Collisions occur when a new record hashes to a bucket that is already full. An overflow file is kept for storing such records. Overflow records that hash to each bucket can be linked together. Indexing Structures for Files Index is a data structure that allows the DBMS to locate a particular records in a file more quickly and thereby speed response to user queries An index file consists of records (called index entries) of the form search-key pointer – – Any subset of the fields of a relation can be the search key for an index on the relation. Search key is not the same as key (minimal set of fields that uniquely identify a record in a relation). Index files are typically much smaller than the original file Types of Indexes There are different types of indexes – Single-level Indexes Primary Indexes Clustering Indexes Secondary Indexes Multilevel Indexes Single Level Index A single-level index is an auxiliary file that makes it more efficient to search for a record in the data file. The index is usually specified on one field of the file (although it could be specified on several fields) One form of an index is a file of entries <field value, pointer to record>, which is ordered by field value 10.6 Primary Index Defined on an ordered data file The data file is ordered on a key field Includes one index entry for each block in the data file; the index entry has the key field value for the first record in the block, which is called the block anchor A similar scheme can use the last record in a block. A primary index is a nondense (sparse) index, since it includes an entry for each disk block of the data file and the keys of its anchor record rather than for every search value. 10.7 Clustering Index Defined on an ordered data file The data file is ordered on a non-key field unlike primary index, which requires that the ordering field of the data file have a distinct value for each record. Includes one index entry for each distinct value of the field; the index entry points to the first data block that contains records with that field value. 10.8 Secondary Index A secondary index provides a secondary means of accessing a file for which some primary access already exists. The secondary index may be on a field which is a candidate key and has a unique value in every record, or a nonkey with duplicate values. The index is an ordered file with two fields. The first field is of the same data type as some nonordering field of the data file that is an indexing field. The second field is either a block pointer or a record pointer. There can be many secondary indexes (and hence, indexing fields) for the same file. Includes one entry for each record in the data file; hence, it is a dense index 10.9