Access Paths 9:00 11:00 13:30 15:30 18:00 Aug. 2 Intro & terminology Reliability Fault tolerance Transaction models Reception Aug. 3 Aug. 4 Aug. 5 Aug. 6 TP mons Logging & Files & Structured & ORBs res. Mgr. Buffer Mgr. files Locking Res. Mgr. & COM+ Access paths theory Trans. Mgr. Locking CICS & TP CORBA/ Groupware techniques & Internet EJB + TP Queueing Advanced Replication Performance Trans. Mgr. & TPC Workflow Cyberbricks Party FREE Chapter 20 Types of Associative Access Paths Primary index: Given the primary key value, find the tuple. Secondary index: Given the value of a nonunique attribute, find all qualified tuples. Range index: Given the value range of some attribute, find all tuples within that range. Structure index: Given some tuple, find all structurally related tuples (CODASYL sets, object hierarchies, etc.) Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 2 Two Important Techniques Two basic techniques dominate in modern DBMSs: Hashing: Use a fixed transformation algorithm to convert the attribute value into a database address. Tree search:A dynamical search structure is built that guides the search for a given attribute value to the proper database location. Hashing supports primary indices only. Tree search is more versatile. Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 3 The Access Gap Accessing a tuple in buffer costs ca. 2000 instr. Accessing it on disk takes 25 ms for I/O-related activities. On a 20 MIPS machine, this translates into 500,000 instructions. Therefore, one can spend many instructions on an algorithm that saves one I/O on the average. For access paths, the dominant cost measure is the number of different pages that need to be accessed during the search. Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 4 Hashing the magic of folding (the shaded areas denote used key values) Jim Gray, Andreas Reuter range of positive integers range of potential key values the magic of hashing tupleaddress space Tuples should use up the availabe address spaceas evenly as possible.The white portions indicate that space utilization will be < 100%. Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 5 Folding vs. Hashing Folding is used to turn an attribute value of arbitrary length and arbitrary data type in to an unsigned integer the maximum length of which is determined by the instruction set. Hashing is used to transform the result of folding into the address of a page that (probably) holds the tuple with the specified key value. Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 6 Requirement A good hash function H has to map primary key values, which are very unevenly distributed over a large value range, into a tuple address space that is much smaller (proportional to the number of existing tuples), such that the resulting addresses are evenly distributed over the address range. Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 7 Parameters of the Hash Function K: This is the (folded) key value. It varies between 0 and 2**32 - 1. B: Number of pages to be allocated for the file. Depends on the number of tuples expected. H: Hash function that performs a mapping of: (0, 2**32-1) -> (0, B-1). Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 8 Consequences of the Approach Contiguous allocation: All B pages must allocated in physical contiguity, because the relative addresses vary between 0 and B-1. Fixed size: The file size must be determined upon creation time, because changing the size (i.e. changing B) means changing the hash function. This in turn requires a complete reorganization. Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 9 Requirements for a Hash Function Hash-based allocation assumes that it is possible to estimate the number of tuples T that the relation will have, and that this estimate is not drastically exceeded. If a block has length B, and a tuple has an average length of L bytes, then we need at least S = T / (B/L) blocks to store the T tuples. The required number of blocks (S) is allocated before the first tuple is stored. It is a good idea to allocate some more blocks (S’ > S) to allow for unexpected growth. Then a hash function H is defined, which takes in the value of the primary key k of the relation and converts it into a number between 1 and S’; this is the block number where the tuple is to be stored. If K is the set of possible values for the primary key we have: H: K {1, 2, …, S’} The set of potential values for the primary key attribute will be much larger than the number of blocks allocated (think, e.g., of ISBNs for the books relation). So the hash function is a compacting function. For each primary key value, there is exactly one block it is mapped to. Many different primary key values are mapped to the same block (n:1 relationship). Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 10 Properties of Hash Functions A hash function must be easy to compute and must not require access to any blocks in the database. It must be able to map a generally very large set of potential primary key values (remember: primary keys can be constructed by concatenating several attributes of a relation) into a comparatively small set of block numbers in which the tuples will be stored. It must be able to take in primary key values of different data types (integer, binary, decimal, character, etc.) and map them to the set of integers between 1 and S’ with equal efficiency. The formula for estimating S based on the number of tuples, average tuple length and block length implicitly assumes that all blocks are equally filled, i.e. that the same number of tuples is mapped to each block. This is the most difficult requirement, because the primary key values in general are not equally distributed over their value range: Some parts of the value range are used, others are not used at all, keys are generated by some regular mechanism, etc. To achieve this “hashing” property, different methods exist: table look-up, base conversion, folding, encryption, division by prime numbers, etc. Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 11 A Popular Hash Function A hash function that most database systems use as a default (if the user does not specify one) is defined as: H(k) := k mod d + 1 This requires that k is a positive integer. If the primary key attribute of the relation does not have the data type integer - it could, for example, be a name - then it has to be turned into an integer first. The usual way of doing this is to “fold” the binary representation of the key value such that its length does not exceed 32 bits. Then these 32 bits are interpreted as an integer; if it is negative, it is multiplied by -1 one. Details of folding are omitted here. For H to be a good hash function, d must be a large enough prime number; this is explained by detailed number theoretic analyses. We also must make sure that the number of blocks allocated is about 25% larger than the minimum requirement. Summing it up: We first compute S. Then we compute S’ = 1.25*S. Then we compute d = next_higher_prime (S’). H(k) will then determine the block for each tuple based on the primary key value. Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 12 Average Number of Overflow Pages maximum number of tuples per page (bucket) 2 3 5 10 20 U 1 .50 0.500 0.177 0.087 0.031 0.005 0.000 0.000 .60 0.750 0.293 0.158 0.066 0.015 0.002 0.000 .70 1.167 0.494 0.286 0.136 0.042 0.010 0.001 .80 2.000 0.903 0.554 0.289 0.110 0.036 0.009 .85 2.832 1.316 0.827 0.449 0.185 0.069 0.022 .90 4.495 2.146 1.377 0.777 0.345 0.144 0.055 .92 5.740 2.768 1.792 1.024 0.467 0.203 0.083 .95 9.444 4.631 3.035 1.769 0.837 0.386 0.171 Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques 40 WICS August 2 - 6, 1999 13 Hashing for Non-Unique Attributes Let V denote the number of different attribute values. Then we can distinguish three cases: V ~ T: The attribute is “almost” unique; a good hash function should work in that case. V > B: There are more values than buckets. Can be made work, but some buckets may get much higher utilization than others. V < B: This is the case where hashing cannot be used. Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 14 Overflow Handling (Implicit) pointers established by the overflow function.In this example the overflow bucket for bucket b is b+3 mod B. buckets overflow pages buckets a) external overflow handling b) internal overflow handling (Completely filled buckets and pages are shaded.) Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 15 Hashing: Summary For unique attributes, hashing can yield oneaccess retrieval. It is critical to find a good hash function to reduce collisions. If the original estimate of the file size is wrong, reorganization is inevitable. Synchronization at the page level is done using standard crabbing techniques. Hashing does not support range queries. Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 16 B-Trees B-Trees consist of two types of nodes: Leaf nodes: The contain the data, i.e. the tuples or pointers to the tuples (TIDs). Index nodes: Index nodes contain reference keys to direct the search towards the leaves. The data structure looks like this: struct { char * KeyValue; PAGEID PointerToNextNode; } index_node_structure[ ]; Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 17 Rules for Index Nodes Key values are in sorted order: K0 K1 ... Ki ... Kf (f is max. capacity of a node). For any two adjacent key values Ki, Ki+1 the pointer Pi points to a node covering all values in the interval (Ki, Ki+1]. If a search for value v arrives at an index node, the next node to be visited is pointed to by Pi such that Ki v < Ki+1. K0 is an arbitrary “low value” (smaller than anything else in that node). Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 18 Properties of a B-Tree Parameter f is called the fan-out of the tree. The number of nodes visited from the root to a leaf is called the height of the tree. A B-tree is always perfectly balanced, i.e. the height is the same for all leaves. Storage utilization is at least 50% for all nodes except for the root. Average storage utilization is close to 70%. Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 19 A Simple B-Tree root node 220 index node I 140 168 220 255 256 120 136 271 312 312 318 leaf node L1 140 151 leaf node L3 271 296 299 303 leaf node L2 168 170 190 The shaded parts at the beginning of the index nodes represent the low value key (K0). Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 20 Some Observations B-trees also work for non-unique attributes; implementational optimizations will be discussed later on. The reference keys in the index nodes can be different from all “real” key values in the leaves; they only have to guide the search correctly. The key values at the leaf level are sorted in ascending order; this supports range queries. Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 21 Inserting Into a B-Tree before split of index node I after split of index node I index node I index node I 271 312 255 256 leaf node L1 271 299 312 312 318 leaf node L3 271 296 299 303 leaf node L2 L1 L3 271 296 leaf node L2 299 303 insert key value 280 Jim Gray, Andreas Reuter insert leaf node Ln key value 280 Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 22 Growing a B-Tree If the insert leaf is full, allocate a new node, distribute the values (the new one sorted in place) evenly across the old leaf and the new node, move the lowest key value of the new node up to the index node. If that index node is full, split it in the same way. If the root has to be split: Allocate two new nodes, distribute the key values evenly over them, put the reference key in the root. Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 23 Deleting Tuples From a B-Tree To maintain the space utilization guarantees, a leaf that becomes under-utilized (< 50%) would have to be merged with its neighbours. This is a very costly operation; in particular, synchronization at the page level is very complicated. Therefore, most systems let nodes become empty and discard them when that happens. Analyses show that this does not deteriorate the overall B-tree performance. Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 24 Non-Unique Attributes key value pointer 23 P1 key value 23 3 23 P2 23 P3 28 P4 29 P5 29 P6 pointer list P1 P2 P3 28 1 P4 29 R3 29 2 P5 P6 key value reference 23 R1 28 R2 P1 P2 P3 P4 P5 Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques P6 WICS August 2 - 6, 1999 25 The Basic Formula of B-Tree-Performance With the N: number of tuples, C: average number of entries in a leaf, F: average number of entries in an index node, the height H of a Btree is H = 1 + logF ( N/C) Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 26 Some Performance Figures H Key-sequenced file (C* = 43) N(max) increase Secondary index (C* = 300) N(max) increase 2 12,900 3 3,870,000 3,857,100 27,000,000 26,910,000 4 1,161,000,000 1,157,130,000 8,100,000,000 8,073,000,000 5 90,000 - - 348,300,000,000 347,139,000,000 2,430,000,000,000 2,421,900,000,000 Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 27 Tuples in the Leaves? Assuming a tuple is x times longer than a TID, we get the following estimate: 1 + logF (N/(x•C)) + 1.1 £ 1 + logF (N/C). This transforms into 1.1 £ logF x When this holds, moving the tuples out of the leaves improves performance. Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 28 Key Compression: Suffix Compression Bertolucci Copelletti Gambogi ... Bert Cop . .Gam . . Cooperativa .. Copelletti ... a) Full key values stored in all nodes of the B*-tree Jim Gray, Andreas Reuter Cooperativa Copelletti ... b) Suffix compression for key values in high-level nodes Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 29 Synchronization on B-Trees: What Is the Problem? B-Trees are fully redundant structures, which can be reconstructed from the tuples; therefore, no synchronization should be required at all. However, some queries operate on the index only. This requires all operations on B-trees to be serializable with the operations on the tuples. Standard two-phase locking with the nodes as the objects is not feasible for performance reasons. Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 30 Protecting Tree Traversal 1. semaphore on Q Node Q at level i search path 2. follow search path 3. semaphore on R Node R at level i+1 This is an example of the crabbing technique Jim Gray, Andreas Reuter 4. release sem. on Q Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 31 B-Trees and Value Locks B*-tree 14 P1 P2 16 P3 17 P4 P5 P6 20 P7 22 P8 P9 P10 Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 32 Making Lock Names To implement value locking, we need to build lock names according to the following rule: LockN := {TableName, IndexName, KeyValue}. KeyValue in turn is a composite: KeyValue := {AttributeValue, TupleIdentifier}. Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 33 Key Range Locking an B-Trees Simple retrieval: k = c Get a semaphore on leaf page; get S-lock on key range defined by largest existing value c1 with c1 c; hold lock until commit. Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 34 Key Range Locking an B-Trees Range retrieval: c1 < k < c2 Get s semaphore on first leaf page; get S-lock on key defined by largest existing value c3 with c3 < c1; proceed sequentially along leaf level; request key range S-lock for each new attribute value up to and including c2; do careful crabbing across leaf pages; hold S-lock until commit. Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 35 Key Range Locking an B-Trees Insert: [c, ki] Get X-semaphore on leaf page; find largest existing value c1 with c1 < c; request instant IX-lock on c1 ; request long X-lock on c. Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 36 Key Range Locking an B-Trees Delete: [c, kd] Get X-semaphore on leaf page; find largest existing value c1 with c1 < c; request long IX-lock on c; else request long X-lock on c and c1. Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 37 B-Tree Recovery Pi Pi Pa Pa Pa' T1 Transaction T1 inserts into page pa, causing a split Jim Gray, Andreas Reuter Transaction T2 inserts into page pa' Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 38 B-Tree Recovery Based on Physiological Logging Cover all B-tree operations with semaphores on all affected pages. For each logical update a log record with the logical UNDO operation must be moved to the log While the update operation is being performed, physical REDO log records are written. After all REDO records are safely in the log, the exclusive semaphores can be released. Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 39 The Two Phases of B-Tree-Recovery Phase1: Go forward through the log up to its current end, applying all REDO records to the tree. Phase2: Go backward to the Begin of transaction record of the oldest incomplete transaction, executing the UNDO operations on the tree for all losers along the way. Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 40 Other Access Path Methods: Extendible Hashing pointer table (directory) key value buckets (pages) ¥ the magic of hashing Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 41 New Techniques Grid files: Symmetric multi-dimensional point access. Can become very unbalanced depending on correlation in the data. R-Trees: Symmetric multi-dimensional access. Can deteriorate depending on insertion strategy. hb-Tree: Symmetric multi-dimensional access. Can turn into a DAG depending on deletion order. Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 42