Indexed File Organization Indexing allows access to records based on a key, on which the file is stored and accessed. Address of a record is some function of the key. Student id, social security id, citizen id, etc. are good candidates for an indexed file organization. Simple indexing is used where a separate index file is maintained, in addition to the data file. In this case, the index file is generally as big as the available memory. The handling of the index file is treated as a fixed record length sorted file (or array). The data file is maintained as unsorted, with possibly variable length records. This method becomes cumbersome when the index file becomes too big to fit in the memory and when too many updates are needed, which means deletion/insertion of records from/into the index files becomes costly. Mbozyigit 1 The problem is how to have a search method that does better than the binary search, thus better than simple indexing. The techniques, such as binary trees or AVL trees can also be used, but they are logN efficient. This may be considered unaffordable for a large file… As alternative to the binary or AVL based processing, multilevel indexing or hashing is suggested. o In this category, as potential file organization methods for very large files, three methods can be mentioned: ISAM, B Trees, and Hashing. Historically, an indexing method known as ISAM (Indexed Sequential Access Method) is used by famous vendors such as IBM and others, dealing with database management systems. Mbozyigit 2 ISAM ISAM-Indexed Sequential Access method is generally based on a cylinder index and a block index. o The cylinder index contains the highest key record in each cylinder o The block index contains the highest key record on each block o To access a record, First, the cylinder index is accessed, generally once for each disk, to find the cylinder on which the record is. Second, the index block (containin pointers for the data blocks) on that cylinder is access to find the block address on which the target record is located. Then, that block is accessed. Time required to access a single record would require one seek to the target cylinder containing the data, one r+btt for the index block, and one r+btt for the data block: Tf=r+s+btt+r+btt An overflow area allocated at the end of each cylinder to be used when needed. When new records are added the old records are shifted to open up space. Mbozyigit 3 The record which has the largest key in that block is moved to the overflow area, with a pointer placed in the inserted block to point to the moved block. The new records will cause the number of overflow records to grow, if there is no space in the primary area. After some time, with frequent insertions, there will be a long list of linked records in the overflow area. Thus, ISAM degrades as the new records are added to the overflow area, which will cause the ISAM file to be reorganized, as a costly process… This inefficiency is the cause of the drop of once important ISAM, in the eyes of database users. Mbozyigit 4 B+Trees as an Indexed File Organization Method (Due to Bayer and McCreight) B+ Tree is a multilevel indexing type organization In a B+ Tree, o Each node may have any number of children o It has all its leaves on the same level B+ Trees are also referred to as B Trees, B* Tree, or B+ Tree, with some differences. The ones with data in the leaves, indices in the internal nodes, seems to be the most common, B+ Tree. Thus, B+ Trees probably form the most common file organization methods. Properties of B+ Trees o Order of B+ Tree (v) is the minimum number of the keys an internal node has. (Note that different authors may define order differently!) o Except, the root node can have at least 2 children (minimum), unless it is a leaf. o No internal nodes can have more than 2v keys. o All the leaves are on the same level. o Leaves contain data records (or the address of the data records in case of secondary index). Mbozyigit 5 Also note that for each secondary key a related new B+ Tree is maintained. o Leaves may also contain the address of the next leaf for fast sequential access. o An internal node with k keys has k+1 children. o The keys in a node are sorted, such that a given key is actually the largest or the smallest key in the corresponding child node; Searching a B+ Tree for a record, given its key value. Note that if there are k keys in a node (c1,…, ck), there are k+1 pointers (p0, p1,…, pk), in the same node, for that many descendent nodes. o Given a key x, start from the root and do the following until the corresponding record is reached at a leaf: If x<ci, take the pi-1, If x>=ck, take the pk, for i=1,…,k Timing computations o Tf=index access time + data access time o If index access time=s+r+btt, and data access time=s+r+dtt, Tf=2s+2r+btt+dtt o Note that dtt (data transfer time) implies a cluster(or bucket) to hold the data, which is generally several blocks. Mbozyigit 6 This computation is based on the assumption that, the rest of the B+ Tree is kept in the memory, except the level above the leave nodes and the leaf nodes themselves… For most files, a B+ Tree based fetch takes at most two disk accesses. For small files, only one access per fetch, for very small files no access is required, as the leaves as well as index nodes can all be resident in the memory during the application life time. Generally, it is arranged such that the B+ Tree nodes (internal and leave) are ln2 full. For example if the maximum number of keys is 200, the average occupancy would be 140 (=0.7x200). Given the size of available memory, it is possible to compute the number of buckets that can be supported with two disk accesses only. An example for forming a B+ Tree: o Assuming that a block can have an average of 140 keys and there are k data clusters (or blocks or buckets), total number of internal nodes (blocks) except the bottom most two levels is equal to i iΣk/140 , where i=2, …,logpk, where p=140 o If the memory size is limited to b blocks, and the target is only two accesses. Then, there will be three level above the bottom most two levels. Thus, Mbozyigit 7 b= k/1403 +k/1402 +1 there will be k/140 blocks in the level above the leaves. One can solve for k, if b is given, for b if k is given, for blocking factor m if both b and k are given. Note that m in the above example is 140/0.7. Memory size is approximated as k/p2. Mbozyigit 8 B+ Trees and secondary key B+ Trees are also appropriate for the secondary key implementation. In this case, for each secondary key a separate B+ Tree is formed. Except that in this case the bottom most level will contain the pointers to the data records, rather than the clusters themselves. Mbozyigit 9 Time considerations The time to read the whole file(exhaustive read), in the order of the primary key, ignoring in memory processing: Txp=b*(s+r+dtt) Where b=1/(ln2)*(n/m), where n is number of records, assume that leaf nodes have links to the next node, but the next node is not contiguously located. Time to read the whole file in the order of the secondary key. Txs=n*(s+r+dtt)+b*( s+r+btt) Where the first term is reading the file record by record; a record is in any cluster. Second term is reading the secondary key’s B+ Tree which has b blocks. This is too slow!!! Accessing the next record is fast, in primary key case TN=[(1/ln2)(1/m)]*(s+r+dtt) Where the first factor is the probability that the record is not on the current cluster. Mbozyigit 10 B+ Tree Insertion Algorithm Top-down search to find the place to insert the new record in the leaf nodes. If there is room in the leaf node, insert the record and terminate. If there is no room in the leaf, allocate a new leaf node, split the records in the middle. Place the first half(ceiling) in the first, the rest in the second leaf. Place the smallest key value in the second(new) leaf in the internal immediate parent node. o If the internal parent node is already full, split it into two internal nodes, each with half of the keys o Carry the middle key value to the next level up (parent). o If no parent exist while bottom up process continues, create a new node(root in this case)! Mbozyigit 11 Primary key case: Insertion If the new record fit into the data block, the insertion time required is the sum of the fetch and update times: TIp=TF+2r If the record does not fit in the data block, then a data block split is required. Considering the expected times for this to happen, the insertion time is formulated as follows: TIp=TF+2r+(2/m)[(s+r+dtt)+(s+r+btt)+(2/2v)(s+ r+btt)] Where 2/m (=1/(m/2)) is the probability that the data block is full, as a block has to be half full any way; 2/v(=1/(2v/2)) is the probability that the parent of the leaf is full. Meaning of each term: TF+2r: Fetch and write the original data cluster, (s+r+dtt): write a new cluster as a split cluster, write the parent internal node block, (s+r+bt):write the splitted parent internal node. Mbozyigit 12 Notes: (1) the minimum block occupancy is 50%, i.e., m/2. So, the insertion will be in positions from, m/2+1 to m in the data block, this is the reason for 2/m. (2) m is assumed to be maximum blocking factor for the leaves, 2v is assumed to be the maximum blocking factor for the internal nodes. You may choose both m and 2v to be the same… Whenever a data record is inserted in the primary B+ Tree, the secondary key B+ Tree needs to be updated as well. Note that this time, the maximum blocking factors for both leaves and internal nodes are the same, say m. Assuming that all the internal nodes, including the parents of the leaves, are in memory. the time to insert a secondary key: TIs=(s+r+btt) + 2r + (2/m) (s+r+btt) The first, term is the time to read, the second term is the time to write back after modifications, the third term is the time required if split is also considered. Mbozyigit 13 If data node in a primary B+ Tree is spilt, it may require all the secondary indexes of this file to be updated… To lessen this update problem, the secondary keys may be associated (pinned) to the primary keys rather than the record addresses. In this case, we do not have to change the secondary key B+ Tree when the record addresses do change. Deletion of Records When the minimum criteria is met regarding the occupancy, the deletion is no problem, just remove the related entry from the node, both for primary and secondary key cases. No problem is posed, if the record key in a parent node does not exist in any leaf. If the subject leaf is at its minimum occupancy, then a deletion will cause consolidation with a sibling node. Consolidation may mean coalescence of two sibling nodes, if the sum of entries is less then the maximum; or redistribution of the entries in adjacent siblings, if the sum is more than the maximum. Mbozyigit 14 If two siblings have equal number of entries chose the left one… Algorithm 1. Find the block containing the record, say X 2. Delete the record and terminate if block limits are Ok 3. Otherwise, if one of the sibling blocks exceed the minimum the most, redistribute the entries in both; change the parent accordingly to record the correct key 4. If neither siblings have more than the minimum, coalesce (combine) X with it and modify the parent to reflect the change. 5. If an internal node has less than the minimum after the modifications, redistribute its content with a sibling if the total is more than the maximum and modify the parent… coalesce the content with a sibling and modify the parent, if the total is less than the maximum. after the modification, if the parent is too sparse, repeat this step (5) Mbozyigit 15 Timing considerations for the record deletion Most usual case, deletion does not cause change in the tree. TDp= TF+ 2r If the deletion requires redistribution of the values in the sibling, because it falls below the limit, then there is a need for reading the two adjacent siblings. The two siblings and the original leaf will make two leaves to be written back to the disk. The parent of the siblings, which is already in the memory) need to be modified and rewritten. Thus, TDp= TF+ 4(s+r+dtt) +s+r+btt If one sibling is involved, then the TDp= TF+ 2(s+r+dtt) +s+r+btt approximated to TDp= TF +2TF+2r If we consider the probability of 1/m/2 we may have to read two siblings and write a parent and a sibling, TDp= TF +2(2/m)TF+ 2r For large m, TDp= TF + 2r For the secondary key deletion, if the probability term is ignored for large blocking factor, we have TDs= TF + 2r Mbozyigit 16 Construction of A B+ Tree for an existing file The most reasonable method is to use bottom-up B+ Tree construction: 1. Sort the file, on the disk with clusters ln2 full. 2. Read in the sorted file cluster by cluster, and enter the addresses in the parent node until the it is ln2 full. 3. If the index node is ln2 full, create a new entry in the parent node if that is not full, otherwise go one level up. Note that new root may be created if all lower level nodes are full. 4. Each time a new entry is created in a node, all lower level nodes needs to be created as well. 5. The process continues until all the sorted leaves are consumed. 6. There may be sparse nodes on the right most side of the tree, which need to be fixed. Note that a B+ Tree can also be constructed by successive insertions, but this would be very inefficient… Why? Mbozyigit 17