Indexed Files Hashing is a computational technique for organizing files. Records in hashed files can be stored and retrieved quickly. However the difficulty in hashed files is that they are difficult to process in key order, which is important if you want to access all records with keys in a certain range. Indexing is a data structure based technique for accessing records in a file. Indexes are auxiliary access structures which are used to speed up the retrieval of records in response to certain search conditions. A main file of records can be supplemented by one or more indexes. Index structures provide secondary access paths, which provide alternative ways of accessing the records without affecting the physical placement of records on the disk. Indexes may be part of the main file or be separate files and may be created and destroyed as required without affecting the main file. Indexes allow for efficient access to records based on the indexing fields that are used to construct the index. Any field of the file can be used to create an index, and multiple indexes on different fields can be constructed on the same file. Single Level Ordered Indexes Ordered access structures are similar to indexes in textbooks. The indexes or tables of contents list the important terms in alphabetical order, along with page numbers where the information can be found. You can search an index to find the address (page numbers in this case), and locate the term by searching the appropriate page. The alternative is to complete a linear search of the textbook, looking through each page one at a time. For a file consisting of several fields, and index access structure is usually defined on a single field, called an indexing field. An index usually is composed of two components, the index field value and a list of pointers to all disk blocks that contain records with that field value. The values in the index are ordered, so we can do a binary search on the index. The index file is much smaller than the data file, so the binary search is much faster. There are different types of indexes, these include: - Primary Indexes - Clustering index - Secondary index Primary Indexes A primary index is an ordered file whose records are of fixed length with two fields. The first field is the primary key of the main data file, and the second is a pointer to a disk block. There is one index entry in the index file for each block in the data file. Each index entry has the value of the primary key field for the first record in the block, and a pointer to the block. Employee # Name Sex Job Salary Birth Date 125-777 135-566 Index File Primary Key Value 125-777 Block Pointer 175-456 185-654 195-222 185-654 300-213 245-225 ; . . . 215-333 240-225 250-566 290-111 750-222 300-213 800-356 310-444 340-333 750-222 760-477 780-444 800-356 810-357 820-358 Each entry in the index has an employee id value and a pointer. The total number of entries in the index is the same as the number of disk blocks in the ordered data file. <K(1) = (125-777), P(1) = address of block 1> The first record in each block of the data file is called the anchor record or the block anchor. Indexes can be characterised as dense or sparse: Dense Index – has an index entry for every search key value (ie. every record) in the data file. Sparse(non dense) Index – has index entries for only some of the search key values. A primary index is a sparse (non dense) index because it includes an entry for each disk block of the data file rather than for every search value. The index file for a primary index needs fewer blocks than the data file, because a) There are fewer index entries than there are records of data, and b) each index entry is smaller in size than a data record because it has only two fields, therefore more index entries than data records can fit in one block. This means that a binary search on the index file requires fewer block accesses than a binary search on the data file. Block Accesses Required Remember a binary search for an ordered data file requires log 2b block accesses. If the primary index file contains bi blocks, then to locate a record with a search key value, it requires a binary search of the index (bI blocks), plus an access to the block containing the record: log2 bi +1 block accesses. A record whose primary key value is K, is in the block with the address P, where K(i)<=K(i+1). To retrieve a record, we do a binary search of the index file, to find the appropriate entry, then retrieve the data block whose address is P(i). Example from Text: We have an ordered file with r = 30,000 records with block size B = 1024 bytes. Records are fixed and unspanned with R = 100 bytes. What is the bfr? How many blocks are needed? How many block accesses needed to locate a record? Now assume the ordering key field of the file is V = 9bytes, and a block pointer is P=6 bytes and a primary index has been constructed for the file. What is the size of each index entry? What is the blocking factor? How many blocks are needed? How many block accesses are needed to locate a record? Insertion and deletion of records are a problem with primary indexes, because they are ordered files. The problem increases with primary indexes because when we insert a record into the correct position in the data file, not only does space have to be made in the blocks for the new record, but index entries may also need to be changed because the anchor records of some blocks will change. Problems can be overcome as seen previously using an unordered overflow file, or a linked list of overflow records for each block in the data file. Clustering Indexes Occurs when records are physically ordered on a nonkey field, which does not have a distinct value for each record. The field is called the clustering field. A clustering index can be created to speed the retrieval of records that have the same value for the clustering field. Clustering indexes are different from a primary index, because a primary index requires that the ordering field of the data file has a distinct value for each record. A clustering index is an ordered file with two fields, the first is of the same type as the clustering field of the data file, and the second is a block pointer. There is one entry in the clustering index for each distinct value of the clustering field containing the value and a pointer to the first block in the data file that has a record with that value. For its clustering field. Clustering Example with indexing fields with different values being stored in the same block: Dept No. Name Job BirthDate Salary 1 1 1 2 2 3 3 3 1 2 3 3 3 4 4 5 4 5 5 5 5 Insertion and deletion still cause problems because the records in the data file are still physically ordered. It is common to reserve a whole block or cluster of blocks for each value of the clustering field. All records with that value are placed in the block (or cluster). See the diagram below for an example. Clustering Example with each distinct key value being stored in a separate block. Dept No. Name Job BirthDate Salary 1 1 Block Pointer Null pointer Block Pointer Null pointer 2 2 1 2 3 3 4 3 3 3 Block Pointer 3 3 Block Pointer Null pointer 4 4 4 Block Pointer Null pointer A clustering index is a nondense index, because there is an entry for every distinct value of the indexing field (which is nonkey by definition) and has duplicate values, rather than for every record in the file. An index is similar to the directory structures used for extendible hashing. Both are searched to find a pointer to the data block containing desired record. The main difference is that an index search uses the values of the search field itself, where a hash directory search uses the hash value that is calculated by applying the hash function to the search field. Secondary Indexes A secondary index provides a secondary means of accessing a file for which some primary access already exists. The secondary index may be on a field which is a candidate key and has a unique value in every record, or a nonkey with duplicate values. The index is an ordered file with two fields. The first field is of the same data type as the nonordering field of the data file that is an indexing field. The second is either a block pointer or a record pointer. There can be many secondary indexes for the same file. A secondary index access structure on a key field that has a distinct value for every record is called a secondary key. There is one index entry for each record in the data file, therefore it is dense. The index entries are ordered by value K(i), so a binary search can be performed. Because the records in the data file are not physically ordered by the values of the secondary key field, block anchors cannot be used. An index entry is created for each record in the data file rather than for each block. A block pointer is used to point to the block containing the record with the desired key value. The appropriate block is transferred to main memory and a search for the desired record within the block can be performed. A secondary index needs more storage space and longer search time than a primary index, because there is a larger number of entries. However the improvement in the search time for a secondary index is greater than for a primary index since a linear search would have to be performed if the secondary index did not exist. SEE DIAGRAM FROM TEXT (14.4) Example from Text Consider the file from the previous example, with r=30,000 fixed length records of size R=100 bytes, stored on a disk with block size B = 1024 bytes. The file has b = 3000 blocks. How many block accesses would be required to do a linear search on the file? If we construct a secondary index on a nonordering key field of the file that is V = 9 bytes long. A block pointer is P = 6 bytes long. How big is each index entry? What is the blocking factor? In a dense secondary index, the total number of index entries is equal to the number of records in the data file. How many blocks are needed for the index? How many block accesses are needed to do a binary search? As you can see, there is a large improvement over the number of block accesses needed for the average linear search, but it is slightly worse than the seven block accesses required for the primary index. Non Key Secondary Indexes A secondary index can be created on a nonkey field of a file. In this case, many records in the data file can have the same value for the indexing field. There are several ways an index like this can be created: Include several index entries with the same key value, one for each record. It would be a dense index. Have variable length records for the index entries, with a repeating field for the pointer. A list of pointers can be kept in the index for the key, one to each block that contains a record whose indexing field is equal to the key value. Binary search can be used, but must be modified. Keep the index entries are a fixed length, and have a single entry for each index field value, but create a level of indirection to handle the multiple pointers. This is a non dense scheme. o The pointer P in the index entry, points to a block of record pointers, o Each record pointer in the block points to one of the data file records with the key value K for the indexing field. o If a value of K occurs in too many records, so the record pointers can’t fit in a single disk block, a cluster or linked list of blocks is used. o Retrieval using the index requires one or more additional block accesses because of the extra level, but searching and inserting more records is straightforward. o SEE DIAGRAM FROM TEXT (14.5) A secondary index provides a logical ordering on the records by the indexing field.