Topic 3 3.Indexing Structure for Files Learning Objectives • Discuss Single level Index• Primary Index • Clustering Index • Secondary Index • Explain the need of Multilevel Indexing • Discuss Dynamic Multilevel indexing using B-trees and B+trees • Explain the insertion and Deletion process using B-tree and B+trees • Discuss indexing using Hashing techniques and buckets • Develop a hashing table given a hashing function Reference Book • FUNDAMENTALS OF Database Systems SIXTH EDITION Ramez Elmasri &Shamkant B. Navathe (p660-697) Basic Concepts and Introduction Basic Concepts • Indexing mechanisms is used to speed up access to desired data. – E.g., author catalog in library • Search Key - attribute to set of attributes used to look up records in a file. • An index file consists of records (called index entries) of the form shown in the diagram--search-key • pointer Index files are typically much smaller than the original file Indexes • Extra access structures, or indexes, can be created on a file. • They're stored on disk just like the main data file, and provide other access paths that can be quicker for certain operations on certain fields. • Any field (or combination of fields) can be used to create an index, but there will be different index types depending on whether the field is a key (unique), and whether the main file is sorted by it or not. • There can be multiple indexes on one file. • Indexes speed up access on the indexed field, but slow down updates— almost every update on the main table must also update every index. Sparse and Dense Indexes – The index file usually occupies considerably less disk blocks than the data file because its entries are much smaller – A binary search on the index yields a pointer to the file record – Indexes can also be characterized as dense or sparse. • A dense index has an index entry for every search key value (and hence every record) in the data file. • A sparse (or nondense) index, on the other hand, has index entries for only some of the search values Monday, June 14, 2021 Dense Index • • • In dense index, there is an index record for every search key value in the database. This makes searching faster but requires more space to store index records itself. Index records contain search key value and a pointer to the actual record on the disk. Sparse Index • It is an index record that appears for only some of the values in the file. Monday, June 14, 2021 Pointers Types of Single-Level Ordered Indexes 1.Primary Indexes • • • • • An index on the ordering key (often primary key) of a sorted file. The index file is a table of <key, block_pointer> pairs. These are called the index entries and recap the ordering key of the first record of their pointed-to block. The first record of each block is called the anchor record. To retrieve a record given its ordering key value using the index, the system does a binary search in the index file to find the index entry whose key value is ≤ the goal key's value, then retrieves the pointed-to block from the original file. Example 1 • Consider an ordered file with these parameters: Block size (B) 1024 Bytes Record count (r) 30,000 Record length (R) 100 Bytes fixed size, unspanned Calculate the the bfr=⌊(B/R)⌋= 1024/100= 10 records per block Calculate the number of blocks needed by the file b = ⌈(r/bfr)⌉ = 3000 blocks A binary search would, on average, need to access ⌈log2 b⌉ = 12 blocks. Example 1 Continued.. • Now consider a primary index on that file with these parameters: Ordering key length (V) 9 Bytes Block pointer length (P) 6 Bytes Index entry length (Ri) 15 Bytes (9+6) Calculate the following: i. The blocking factor of the index file bfri ii. The total number of index entries iii. The number of blocks needed by the index file is bi iv. The number of block access when doing binary search Solution i. The blocking factor of the index file is bfri = ⌊(B/Ri)⌋ = 68 record/block. ii. The total number of index entries will be the same as the number of blocks in the main file, or ri = 3000. iii. Thus the number of blocks needed by the index file is bi = ⌈(ri/bfri)⌉ = 45 blocks. iv. A binary search on this need access on average only ⌈log2 bi⌉ = 6+1 block =7 blocks . Problems with Primary Indexing i. Inserting and deleting records in the main file must move other records, since it's ordered. ii. Some insertions/deletions must also change index entries, if the anchor records change. iii. There can be only one primary index on a file. Types of Single-Level Indexes • 2.Clustering Index – Defined on an ordered data file – The data file is ordered on a non-key field unlike primary index, which requires that the ordering field of the data file have a distinct value for each record. – Includes one index entry for each distinct value of the field; the index entry points to the first data block that contains records with that field value. – It is another example of nondense index(Sparse) where Insertion and Deletion is relatively straightforward with a clustering index. Monday, June 14, 2021 Example: • Suppose a company contains several employees in each department. Suppose we use a clustering index, where all employees which belong to the same Dept_ID are considered within a single cluster, and index pointers point to the cluster as a whole. Here Dept_Id is a non-unique key. FIGURE 2 Clustering index with a separate block cluster for each group of records that share the same value for the clustering field. Monday, June 14, 2021 Types of Single-Level Indexes • 3.Secondary Index – A secondary index provides a secondary means of accessing a file for which some primary access already exists. – The secondary index may be on a field which is a candidate key and has a unique value in every record, or a nonkey with duplicate values. – The index is an ordered file with two fields. i. The first field is of the same data type as some non-ordering field of the data file that is an indexing field. ii. The second field is either a block pointer or a record pointer. There can be many secondary indexes (and hence, indexing fields) for the same file. – Includes one entry for each record in the data file; hence, it is a dense index Monday, June 14, 2021 Secondary Index Monday, June 14, 2021 Summary of Single indexes Multilevel Indexes • The previous approaches accessed records with a binary search on the index file, then direct access to the block containing the record. • How about creating an index on the index? And an index on that? • Why not keep going until the top-level index fits in a single block? • Since the first-level index is ordered and key on the index value, the remaining levels can all be non-dense Two level Index Multilevel Indexing • Here the blocking factor of the index entries, bfri, is called the fan-out fo. • Then the average number of block accesses needed to retrieve a record is only ⌈logfo(bi)⌉. • If the first level has r1 index entries, it needs ⌈r1/fo⌉ blocks, which is also r2, the number of index entries in the next level up. • The second level needs ⌈r2/fo⌉ blocks, which is also r3. And so on. • Eventually at some level rt = 1, and t is the top level. • It can be calculated as t = ⌈logfo(r1)⌉. Example 2: Data file • • • Recall these parameters given in example1: If this is extended into a multilevel index, the index blocking factor is also the fanout (bfri = fo = 68) and the index size(entries) is the b1 size. So the block size of level 2 is: ⌈b1/fo⌉ = ⌈45/68⌉ = 1. So level 2 is enough. Accessing a record by Ssn will require 3 block reads,(Number of levels+1) and the index will need 46 blocks in total (about 15% the size of the data file). 1024 Bytes Block size (B) 30,000 Record count (r) 100 Bytes fixed size, unspanned Record length (R) 10 record /block Blocking factor (bfr) 3000 blocks Block size of file (b) Primary index 9 Bytes Ordering key length (V) 6 Bytes Block pointer length (P) 15 B ytes(9+6) Index entry length (Ri) 68 record /block Blocking factor (bfri) 45 blocks Block size of index (bi) Class Exercise • Consider a disk with block size B = 1024 bytes. A block pointer is P = 9 bytes long, and a record pointer is PR = 8 bytes long. A file has r = 30,000 EMPLOYEE records of fixed length. Each record has the following fields: Name(20 bytes), Ssn (15 bytes), Department_code (10 bytes), Address (30 bytes),Phone (11 bytes), Birth_date (8 bytes), Sex (1 byte), Job_code (4 bytes), and Salary (4 bytes, real number).An additional byte is used as a deletion marker. a) Calculate the record size R in bytes. b) Calculate the blocking factor bfr and the number of file blocks b, assuming an unspanned organization. c) Suppose that the file is ordered by the key field Ssn and we want to construct a primary index on Ssn. Calculate i. ii. iii. iv. v. The index blocking factor bfri(which is also the index fan-out fo); The number of first-level index entries and the number of first-level index blocks; The number of levels needed if we make it into a multilevel index; The total number of blocks required by the multilevel index; The number of block accesses needed to search for and retrieve rom the file—given its Ssn value—using the primary index. (d) Suppose the file is not ordered by the key field SSN and we want to construct a secondary index on SSN. Repeat the previous exercise (c) for the secondary index and compare with the primary index. Dynamic Multilevel Indexes Using B-Trees and B+-Trees • The previous multilevel index is great, but inserting or deleting records has to happen on the underlying sorted file, which is costly. Plus now it may require modifying the index file(s), also costly. • Recall (binary) tree structure. But if the tree is in storage, occupying more than one block, there is the risk of thrashing or heavy caching. Terms used in a tree • A tree is formed of nodes. Each node in the tree, except for a special node called the root, has one parent node and zero or more child nodes. • The root node has no parent. A node that does not have any child nodes is called a leaf node;a nonleaf node is called an internal node. • A subtree of a node consists of that node and all its descendant nodes—its child nodes, the child nodes of its child nodes, and so on. • In figure 18.7(shown in the previous slide) the root node is A, and its child nodes are B, C, and D. Nodes E, J, C, G, H, and K are leaf nodes. Since the leaf nodes are at different levels of the tree, this tree is called unbalanced. Monday, June 14, 2021 Pointers Search Trees and BTrees • Each search field has the search value (index value) and a pointer to the associated record in storage (a record pointer or block pointer, depending). • In practice each node is one block of storage. • Need for balancing? B-trees • A b-tree is known as balanced sorted tree. • It is used for external sorting. External sorting is done when data values cannot fit into the main memory. • To reduce disk access, several conditions of the tree must be true; i. The height of the tree must be kept minimal ii. There must be no empty sub-trees above the leaves of the tree iii. Leaves of the tree must be at the same level iv. All nodes except the leaves must have some minimum number of children Properties of a B-tree B-tree of order of M has the following properties i. Each node has a maximum of M children and a minimum of M/2 ii. Each node has (M-1) Keys iii. Keys are arranged in ascending order. All the keys in the subtree to the left of a key are smaller than the key(Predecessor).All the keys to the right are greater than the value of the key(Successor) iv. When a key is inserted into a full node, the node is split into two nodes, and the median value is inserted in the parent node. v. All leaves are on the same level i.e no empty subtree above the leave node Example • Construct a B-tree of order of 5 with the following set of data 10,22,9,6,4,12,40,60,2,1,8,32,36,41 Class Exercise 1) Construct a B-tree of order 3 with the following set of data:8, 5, 1, 7, 3, 12, 9, 6. 2) Construct a B-tree of order 4 with the following set of data:5,3,21,9,1,13,2,7,10,12,4,8 Solution?? 1) Construct a B-tree of order 3 with the following set of data:8, 5, 1, 7, 3, 12, 9, 6. B-Trees Summary • • • • • • • • • • Each node has at most m children and at most m-1 keys. Nodes may leave some empty space, allowing for inserting new entries. (Each node has at minimum (m-1)/2 keys. Insertion into a non-full node is very efficient. If the node is full it causes a split. Splitting the root node creates two children, with only the middle value left in the original root. Splitting a branch node divides it into two branch nodes, and moves the middle value to the parent node. Splitting can propagate to other tree levels. Deletion from a node more than ½-full is very efficient. If the node becomes ½-full it must be merged with neighboring nodes. Specific approaches to deletion merging vary. Equilibrium at 69%. B+Trees • A B+ tree is a balanced tree structure where the leaf nodes contain the data entries and the nodes above contain entries which direct the search » A distinction is made in a B+ tree between the entries in leaf nodes and entries in the non-leaf nodes » The entries in the leaf nodes contain pointers to the data records. The leaf level is a dense index • The non-leaf nodes form a sparse B+Trees B+trees Compared with B-Trees: – Only leaf nodes store record pointers. – All search values are stored in leaf nodes. (And may be stored at higher levels.) – Leaf nodes are usually linked into a sorted list.(Sequential list/Linked list) Insertion and Deletion in B+ trees Example • Construct a B+tree where order value is 3.The data set is 45,77,60,33,100,14,11,55 B+Trees • B+ trees support equality and range searches,multiple attribute key and partial key searches. • It responds to dynamic changes in the table. • Sibling pointers in the leaf level allows range search and equality search. It creates sequential set or linked list. • E.g. select *from employees where salary is BETWEEN 2000 AND 30000; Class Exercise-Assignment 3 a)A B+tree is of order 4. Construct a B+tree given the following data set: 1,4,10,17,21,31,25,19,20,28,42 b)Construct a B+tree of order 3 with the following set of data: 8, 5, 1, 7, 3, 12, 9, 6.Then delete 5,12 and 9 from the list B+Tree insertion B+tree Deletion Sequence Hashed Files • • • • • • Hashing for disk files is called External Hashing The file blocks are divided into M equal-sized buckets, numbered bucket0, bucket1, ..., bucket M-1. Typically, a bucket corresponds to one (or a fixed number of) disk block. One of the file fields is designated to be the hash key of the file. The record with hash key value K is stored in bucket i, where i=h(K), and h is the hashing function. Search is very efficient on the hash key. Collisions occur when a new record hashes to a bucket that is already full. An overflow file is kept for storing such records. Overflow records that hash to each bucket can be linked together. Monday, June 14, 2021 Hash Indexes The hash index is a secondary structure to access • Searching for an entry uses the hash search the file by using hashing on a search key other than the one used for the primary data file algorithm on K. Once an entry is found, the organization. pointer Pr (or P) is used to locate the The index entries are of the type <K, Pr> or <K, corresponding record in the data file. In a P>, where Pr is a pointer to the record containing practical application, there may be thousands the key, or P is a pointer to the block containing the record for that key. of buckets; the bucket number, which may be several bits long, would be subjected to the directory schemes Example: • Employee file with EmployeeID as the hash key includes records with the following EmployeeID values: 2369, 3760, 4692, 4871, 5659, 1821, 1074, 7115, 1620, 2428, 3943,4750, 6975, 4981, and 9208. • The file uses eight buckets, numbered 0 to 7. Each bucket is one disk block and holds two records. Load these records into the file in the given order, using the hash function h(K) = K mod 8. Clearly show the hashing table and the buckets. Monday, June 14, 2021 Pointers Solution function h(K) = K mod 8. EmployeeID Modulus 2369 1 3760 0 4692 4 4871 7 5659 3 1821 5 1074 2 7115 3 1620 4 2428 4 3943 7 4750 6 6975 7 4981 5 9208 0 Hashing Table • Buckets 0 to 7 Bucket 0verflow 0 3760 9208 1 2369 2 1074 3 5659 7115 4 4694 1620 5 1821 4981 6 4750 7 4871 3943 2428 6975 Hashed Files (cont.) There are numerous methods for collision resolution, including the following: • Open addressing: Proceeding from the occupied position specified by the hash address, the program checks the subsequent positions in order until an unused (empty) position is found. • Chaining: For this method, various overflow locations are kept, usually by extending the array with a number of overflow positions. In addition, a pointer field is added to each record location. A collision is resolved by placing the new record in an unused overflow location and setting the pointer of the occupied hash address location to the address of that overflow location. • Multiple hashing: The program applies a second hash function if the first results in a collision. If another collision results, the program uses open addressing or applies a third hash function and then uses open addressing if necessary. Monday, June 14, 2021 Hashed Files (cont.) Monday, June 14, 2021 Employee hashing Figure 18.15 illustrates a hash index on the Emp_id field for a file that has been stored as a sequential file ordered by Name. The Emp_id is hashed to a bucket number by using a hashing function: the sum of the digits of Emp_id modulo 10. For example, to find Emp_id 51024, the hash function results in bucket number 2; that bucket is accessed first. It contains the index entry < 51024, Pr >; the pointer Pr leads us to the actual record in the file. End Any Questions?