B Trees B Trees are a special case of the tree data structure. First we will review tree structures and search trees, then talk about B-Trees and later B+ Trees. Tree Structure Terminology Review A tree is formed by nodes Each node has one parent, and zero or more child nodes The root is the only node without a parent node. A node that does not have any child nodes is a leaf node. A non leaf node is an internal node The level of a node is one more than the level of its parent, with the root node being zero. A subtree of a node is the node and all its descendant nodes (child nodes, and children’s children). Level 0 A B E C F G H J Level 1 D I Level 2 K Level 3 Root node is A, and its child nodes are B, C, D. Nodes E, C, G, H, J, K are leaf nodes. A common way to implement a tree is to have as many pointers in a node as there are children of the node. As well, a parent pointer can also be stored in each node. Nodes usually contain some type of stored information. When a multilevel index is implemented as a tree structure, the information includes values of the files’ indexing field that are used to guide the search for a record. Multilevel Indexes as Special Search Trees Multilevel indexes can be thought of as a variation of a search tree(a special type of tree that is used to guide the search for a record with record field value V) Each node can have as many as fo pointers and fo key values, where fo is the index fo (blocking factor of the index). . . . The index values in each node guide us to the next node, until you reach the data file block that contains the required records. By following a pointer, the search is restricted at each level to a subtree of the search tree, and the nodes not in the subtree are ignored. Search Trees P1 K1 … K1-1 Pi Ki Pq-1 Kq-1 Kq-1 Pq X X X X < K1 Ki-1< X < Ki Kq-1< X A search tree of order p is such that each node contains at most p-1 search values and p pointers in the following order: <P1, K1, P2, K2…Kq-1, Pq>, where: q<=p; each Pi is a pointer to a child node, or a null pointer; and each Ki is a search value from some ordered set of values. Two constraints must hold on the search tree: 1. Within each node, the key values are ordered (K1 < K2 < …<Kq-1>) 2. For all values X in the subtree pointed to by Pi, For 1 < i < q, Ki-1 <X < Ki For i = 1, X < K, and For i = q, Ki-1 < X, When searching for a value X, you follow the pointers, P, using the above conditions. 5 3 1 6 9 7 8 12 The values in the tree can be one of the fields in the file called the search field. This is the same as the index field as a file. Each key value is associated with a pointer, either to a record in the data file having that search key value, or a pointer to the block containing the record with the search key value. The tree in the first diagram is not balanced, meaning that leaf nodes can be found at different levels. This is not an efficient organization, because some nodes may be at very high levels, requiring many block accesses. The B-Tree addresses this problem by specifying additional constraints. B Trees The B-Tree has additional constraints to ensure the tree is aways balanced, and the space wasted by deletion never becomes excessive. P1 K1 Pr1 P2… Ki-1 Pri-1 Pi Ki Pri Pq-1 Kq-1 Prq-1 Pq data pointer Tree Pointer X X < K1 X Ki-1< X < Ki X Kq-1< X The formal definition of a B-Tree of order p, when used as an access structure on a key field, to search for a record is as follows: 1. Each internal node in the tree is of the form: <P1, <K1, Pr1>, P2, <K2, Pr2>… <Kq-1, Prq-1>, Pq> , where q<=p. Each P is a tree pointer, a pointer to a node in the tree, and each Pr is a data pointer, a pointer to the record whose search key field value is equal to K. 2. The key values, Ki…Kq-1 are ordered within each node. 3. For all search key values X in the subtree pointed at by Pi, the ith subtree, we have: For 1 < i < q, Ki-1 < X < Ki , For i = 1, X < Ki, For i = q, Ki-1 < X 4. Each node has at most p tree pointers. 5. Each node except the root has at least p/2 tree pointers. The root node has at least two tree pointers unless it is the only node in the tree. 6. A node with q tree pointers, q<=p had q-1 search key field values and hence q-1 data pointers. 7. All leaf nodes are at the same level. Leaf nodes have the same structure as internal nodes except that all of their tree pointers Pi are null. 1 3 5 8 6 7 9 12 The above example shows a B-Tree of order 3. The example assumes the B-Tree access structure is on a key field, therefore the values are unique. If the B-Tree is used on a non-key field, the pointer would point to a cluster of blocks that contain blocks of file pointers, similar to option 3 for secondary indexes. B-Tree starts with a single root node, which is also a leaf node, at level 0. Once the root node is full with p-1 search key values, the root node splits evenly into two nodes at level 1. Only the middle value is kept in the root. When a non root node is full, and a new entry is inserted into it, the node is split into two nodes at the same level, and the middle entry is moved to the parent node along with two pointers to the split nodes. If the parent node is full, it is also split. Splitting can propogate all the way to the root, creating a new level if the root is split. If deletion of a value causes a node to be less that half full, it is combined with its neighboing nodes, this can propogate all the way to the root. After numerous random insertions and deletions, the nodes are approximately 69 percent full when the number of values in the tree stabilizes. If this happens, node splitting and combining will occur only rarely. Insert the following values into a B-tree: 5. 7, 4, 6, 8, 14, 2 Calculating the order p of a B-Tree stored on disk. Example 4 from Text: Suppose the search field is V=9 bytes long, the disk block size is B=512 bytes, the record pointer is Pr = 7 bytes, and a block pointer is P = 6 bytes. Each B Tree node can have at most p tree pointers, and p-1 data pointers and p-1 search key fields values. These must fit into a single disk block if earch B-Tree node is to correspond to a disk block. To calculate p: 6p+(p-1)*(9 + 7) <=512 6p + 9p + 7p -9 – 7 <= 512 22p – 16 <= 512 22p <= 528 p <= 24 Although p can be a maximum of 24, we choose p = 23 because the B Tree nodes may contain additional information used to manipulate the tree, such as the number of entries q in the node, and a pointer to the parent, therefore before we calculate p above, the block size should be first reduced by the amount of extra space needed. Example 5 from Text: Suppose that a serach field of example 4 is a nonordering key field, and we construct a B-tree on this field. Assume that each node of the B tree is 69 percent full. Each node on average will have p*0.69 = 23 * 0.69 or approximately 16 pointers, and hence, 15 search key values. The average fan out fo = 16. To see how many pointers and values can exist on each level: Root: Level 1: Level 2: Level 3: 1 node 16 nodes 256 nodes 4096 nodes 15 entries 16 pointers 240 entries 256 pointers 3840 entries 4096 pointers 61,440 entries Each level, the number of entries is calculated by multiplying the total number of pointers at the previous level by 15, the average number of entries at each node. Hence for a given block size, pointer size, and search key field size, a two level BTree holds 3840 + 240 + 15 = 4096 entries on average. A three level B-tree holds 65,535 entries on average.