Part 2 - File Organizations and Indexing 1 Files and Records • A file is a sequence of records, where each record is a collection of data values. • Records are stored on disk blocks. The blocking factor for a file is the (average) number of file records stored in a disk block. • A file can have fixed-length records or variable-length records. • File records can be un-spanned (no record can span two blocks) or spanned (a record can be stored in more than one block). • The physical disk blocks that are allocated to hold the records of a file can be contiguous, linked, or indexed. • In a file of fixed-length records, all records have the same structure. Usually, un-spanned blocking is used with such files. - Files of variable-length records require additional information to be stored in each record, such as separator characters and field types. Usually spanned blocking is used with such files. 2 Basic File Operations: • OPEN: Prepares the file for access, and associates a pointer that will refer to a current file record at each point in time. • READ: Reads the current file record into a program variable. • INSERT: Inserts a new record into the file, and makes it the current file record. • DELETE: Removes the current file record from the file, usually by marking the record to indicate that it is no longer valid. • UPDATE: Changes the values of some fields of the current file record. • CLOSE: Terminates access to the file. • FIND: Searches for the first file record that satisfies a certain condition, and makes it the current file record. • FINDNEXT: Searches for the next file record (from the current record) that satisfies a certain condition, and makes it the current file record. • REORGANIZE: Reorganizes the file records. For example, the records marked deleted are physically removed from the file or a new organization of the file records is created. 3 Sequential File Organization • Developed in the early 1950s to store data on magnetic tapes. (Mag. tapes were the first secondary storage devices available.) • Also called a heap file. • To search for a record, a linear search is necessary. On the average, searching an un-ordered file requires reading and searching half the file’s records. • Maintenance of a Sequential file: • Addition • Unordered file: • Requires appending to the end of the file. O(1). • Ordered file: • More complex, since the order (ascending or descending) must be preserved. O(n) • Requires 3 steps: 1) copy the records up to the location where the new record must be inserted, into a new file. 2) write the new record 3) copy the rest of the file to the new file. • Deletion: • Same as addition to an Ordered file, however, we must rewrite the file leaving the deleted record out. O(n) • Update: • Same as addition to an Ordered file, except one or more of the records will be changed before rewriting. O(n) • Search: • Unordered file (average n/2) • Ordered file (log n) 4 Sequential Search: • Processes the records of a file in their order of occurrence until it either locates the record or reaches the EOF. • O(n) complexity • Average search will require (n/2) comparisons. • Worst case require (n) comparisons. (i.e. if it is the last record or the records does not exist) • Best if (n) is small. • Simple algorithm. • On the average performance can be improved (n/2) by sorting the records in the file. (i.e. if you pass the point in which the record should appear, and it is not there, you know that it is not in the file.) Ž Advantages: • Simple Ž Disadvantages: • Slow 5 Binary Search: Basic Requirement: • Requires a sorted file! • Must be a random access file on a random access device! • Must have equal sized records! Basic Algorithm: • Same as Binary search in an array. • Search the file by continuously splitting it in half, until the record is found or until you longer can split the file. • O(log2 n) complexity. • Both Average search as well as Worst case require O(log2 n) comparisons. Ž Advantages: • Faster than sequential search. Ž Disadvantages: • Still Slow (Ok for in memory searches however, in a file it is best to use a binary search when (n) is small.) (Example: with n=256, we still need 8 comparisons to locate a record. Possibly 8 seek + 8 Rotations) • Overhead (The file must be initially sorted and the order must be maintained after insertion and deletions.) 6 Random Access File Organization : • Direct Access • Insertion, deletion and searching of records can be done at random. • Fixed record size. (Or buckets) • Location of the records are computed by using their unique key (identifier). • Logical ordering of records may or may not correspond to their physical sequence. • Random access files can be updated in place. • Often a HASH function is used to locate a record. • Fast 7 3) Index Sequential File Organization: • Provides both Sequential and Direct Access • Combines sequential access and ordering with random access capabilities. • Contains two parts: 1) A sequential file: - Maintains the actual data. - Sometimes ordered on a key. - Variable sized records. 2) An Index: - A linear or hierarchical index structure: (Key, Block #) or (Key, Absolute file location) • Variable length records within fixed size blocks. • One or more records can be placed in the same block. • Some space may be wasted. • Variable length records waste less space, but are slower. • Fixed length records waste space but provide faster access. 8 4) Multi-Key: • Provides both Sequential and Direct Access • Allows access to a data file by several different key fields. • Example: • A library file which requires access by author and title. • Index sequential file organization provide random access by one key field only. 9 Hashing Internal Hashing • In memory hashing, typically implemented through the use of an array of records. • The record with hash key value K is stored in record i, where i=h(K), and h is the hashing function. • Collisions are handled using methods such as open addressing, chaining, or multiple hashing. (Below a chaining method is shown) 10 Hashed Files Static External Hashing • The file blocks are divided into M equal-sized buckets, numbered bucket0, bucket1, ..., bucketM-1. Typically, for efficiency reasons, a bucket corresponds to one (or a fixed number of) disk block. • One of the fields is designated to be the hash key of the file. • The record with hash key value K is stored in bucket i, where i=h(K), and h is the hashing function. • Search is very efficient on the hash key. • Collisions occur when a new record hashes to a bucket that is already full. An overflow area is kept for storing such records. 11 • To reduce overflow records, a hash file is typically kept 70-80% full. • The hash function h should distribute the records uniformly among the buckets; otherwise, search time will be increased because many overflow records will exist. • Main disadvantages of static external hashing: - Fixed number of buckets M is a problem if the number of records in the file grows or shrinks. - Ordered access on the hash key is quite inefficient (requires sorting the records). 12 Dynamic and Extendible Hashing Techniques - Hashing techniques are adapted to allow the dynamic growth and shrinking of the number of file records. - Extendible hashing use the binary representation of the hash value h(K) in order to access a directory. In extendible hashing the directory is an array of size 2d where d is called the global depth. - The directories can be stored on disk, and they expand or shrink dynamically. Directory entries point to the disk blocks that contain the stored records. - Each bucket has a local depth which is less than or equal to the glo bal de pth . 13 - An insertion in a disk block that is full causes the block to split into two blocks and the records are redistributed among the two blocks. The directory is updated appropriately. - If a bucket that overflows and is split, used to have a local depth d’ equal to the global depth of d (of the directory, then the size of the directory must be doubled (extra bit) to accommodate the overflow. - Extendible hashing do not require an overflow area. 14 Index Structures - A single-level index is an auxiliary file that makes it more efficient to search for a record in the data file - The index is usually specified on one field of the file (although it could be specified on several fields) - One form of an index is a file of entries <field value, pointer to record>, which is ordered by field value - The index is called an access path on the field - The index file usually occupies considerably less disk blocks than the data file because its entries are much smaller - A binary search (or a hash) on the index yields a pointer to the file record Example: Given the following data file: EMPLOYEE(NAME, SSN, ADDRESS, JOB, SAL, ... ) Suppose that: record size R=150 bytes block size B=512 bytes r=30000 records Then, we get: blocking factor Bfr= B div R= 512 div 150= 3 records/block number of file blocks b= (r/Bfr)= (30000/3)= 10000 blocks For an index on the SSN field, assume the field size VSSN=9 bytes, assume the record pointer size PR=7 bytes. Then: index entry size RI=(VSSN+ PR)=(9+7)=16 bytes index blocking factor BfrI= B div RI= 512 div 16= 32 entries/block number of index blocks b= (r/BfrI)= (30000/32)= 938 blocks binary search needs log2 bI = log2 938= 10 block accesses 15 This is compared to an average linear search cost of: (b/2)= 30000/2= 15000 block accesses If the file records are ordered, the binary search cost would be: log2 b= log2 30000= 15 block accesses 16 Types of Single-Level Indexes 1. Primary Index - Defined on an ordered data file - The data file is ordered on a key field - Includes one index entry for each block in the data file; the index entry has the key field value for the first record in the block, which is called the block anchor. - A similar scheme can use the last record in a block. - Primary index is a non-dense index: it includes an entry for each disk block of the data file rather than for every record. 17 2. Clustering Index - As cluster index is used to maintain an index on file which needs to be ordered on a non-key field - Includes one index entry for each distinct value of the field; the index entry points to the first data block that contains records with that field value. - Cluster index is another example of a non-dense index: it includes an entry for each disk block of the data file rather than for every record. 18 19 20 3. Secondary Index - Defined on an unordered data file - Can be defined on a key field or a non-key field - Includes one entry for each record in the data file; hence, it is called a dense index 21 22 Multi-Level Indexes - Because a single-level index is an ordered file, we can create a primary index to the index itself ; in this case, the original index file is called the first-level index and the index to the index is called the second-level index - We can repeat the process, creating a third, fourth, ..., top level until all entries of the top level fit in one disk block - A multi-level index can be created for any type of first-level index (primary, secondary, clustering) as long as the first-level index consists of more than one disk block - Such a multi-level index is a form of search tree ; however, insertion and deletion of new index entries is a severe problem because every level of the index is an ordered file - Because of the insertion and deletion problem, most multi-level indexes use B-tree or B+-tree data structures, which leave space in each tree node (disk block) to allow for new index entries 23 Using B-Trees and B+-Trees as Dynamic Multi-level Indexes - These data structures are variations of search trees that allow efficient insertion and deletion of new search values. - In B-Tree and B+-Tree data structures, each node corresponds to a disk block. - Each node is kept between half-full and completely full. - An insertion into a node that is not full is quite efficient; if a node is full the insertion causes a split into two nodes. - Splitting may propagate to other tree levels. - A deletion is quite efficient if a node does not become less than half full. - If a deletion causes a node to become less than half full, it must be merged with neighboring nodes. Difference between B-tree and B+-tree: - In a B-tree, pointers to data records exist at all levels of the tree. - In a B+-tree, all pointers to data records exists at the leaf-level nodes. - A B+-tree can have less levels (or higher capacity of search values) than the corresponding B-tree 24