Reading and Review Chapter 12: Indexing and Hashing – 12.1 Basic Concepts – 12.2 Ordered Indices – 12.3 B+-tree Index Files • lots of stuff – 12.4 B-tree Index Files – 12.5 Static Hashing • lots of stuff – 12.6 Dynamic Hashing (Extendable Hashing) • lots of stuff – 12.7 Comparison of Ordered Indexing and Hashing – 12.9 Multiple-Key Access • Multiple-key indices, Grid files, Bitmap Indices – 12.10 Summary Reading and Review Chapter 13: Query Processing – – – – – 13.1 Overview 13.2 Measures of Query Cost 13.3 Selection Operation 13.4 Sorting (Sort-Merge Algorithm) 13.5 Join Operation • • • • Nested-loop Join (regular, block, indexed) Merge Join Hash Join Complex Joins – 13.6 Other Operations – 13.7 Evaluation of Expressions – 13.8 Summary Reading and Review Chapter 14: Query Optimization – 14.1 Overview – 14.2 Estimating Statistics of Expression Results • • • • • Catalog information Selection size estimation Join size estimation Size estimation for other operations Estimation of number of distinct values – 14.3 Transformation of Relational Expressions • Equivalence rules and examples • Join ordering, enumeration of equivalent expressions – 14.4 Choice of Evaluation Plans – from 14.4 on, ignore any section with a “**” at the end of the section title – 14.6 Summary Reading and Review Chapter 15: Transaction Management – 15.1 Transaction Concept • ACID properties – 15.2 Transaction State • state diagram of a transaction – (15.3 Implementation of Atomicity and Durability) • not important -- we covered this material better in section 17.4 – 15.4 Concurrent Execution and Scheduling – 15.5 Serializability (Conflict Serializability) • (ignore 15.5.2 View Serializability) – – – – 15.6 Recoverability (15.7 Isolation -- not important, see 16.1 on locking instead) 15.9 Testing for Serializability (Precedence Graphs) 15.10 Summary Reading and Review Chapters 16 and 17 – 16.1: Lock-Based Protocols • • • • granting locks deadlocks two-phase locking protocol ignore 16.1.4 and 16.1.5 – 16.6.3 Deadlock Detection and Recovery • Wait-for Graph – 17.1, 17.2.1 -- skim these sections. Material should be familiar to you as background, but I won’t be testing on definitions or memorization of it – 17.4: Log-Based Recovery • logs, redo/undo, basic concepts • checkpoints Indexing and Hashing: Motivation • Query response speed is a major issue in database design • Some queries only need to access a very small proportion of the records in a database – “Find all accounts at the Perryridge bank branch” – “Find the balance of account number A-101” – “Find all accounts owned by Zephraim Cochrane” • Checking every single record for the queries above is very inefficient and slow. • To allow fast access for those sorts of queries, we create additional structures that we associate with files: indices (index files). Basic Concepts • Indexing methods are used to speed up access to desired data – e.g. Author card catalog in a library • Search key -- an attribute or set of attributes used to look up records in a file. This use of the word key differs from that used before in class. • An index file consists of records (index entries) of the form: (search-key, pointer) where the pointer is a link to a location in the original file • Index files are typically much smaller than the original file • Two basic types of index: – ordered indices: search keys are stored in sorted order – hash indices: search keys are distributed uniformly across “buckets” using a “hash function” Index Evaluation Metrics • We will look at a number of techniques for both ordered indexing and hashing. No one technique is best in all circumstances -- each has its advantages and disadvantages. The factors that can be used to evaluate different indices are: • Access types supported efficiently. – Finding records with a specified attribute value – Finding records with attribute values within a specified range of values • • • • • Access (retrieval) time. Finding a single desired tuple in the file. Insertion time Deletion time Update time Space overhead for the index Index Evaluation Metrics (2) • Speed issues are – – – – Access time Insertion time Deletion time Update time • Access is the operation that occurs the most frequently, because it is also used for insert, delete, update • For insert, delete, and update operations we must consider not only the time for the operation itself (inserting a new record into the file) but also any time required to update the index structure to reflect changes. • We will often want to have more than one index for a file – e.g., card catalogs for author, subject, and title in a library Ordered Indices – An ordered index stores the values of the search keys in sorted order – records in the original file may themselves be stored in some sorted order (or not) – the original file may have several indices, on different search keys – when there are multiple indices, a primary index is an index whose search key also determines the sort order of the original file. Primary indices are also called clustering indices – secondary indices are indices whose search key specifies an order different from the sequential order of the file. Also called nonclustering indices – an index-sequential file is an ordered sequential file with a primary index. Dense and Sparse Indices • A dense index is where an index record appears for every search-key value in the file • A sparse index contains index records for only some search-key values – applicable when records are sequentially ordered on the search key – to locate a record with search-key value K we must: • find index record with largest search-key value <=K • search file sequentially starting at the location pointed to • stop (fail) when we hit a record with search-key value >K – less space and less maintenance overhead for insert, delete – generally slower than dense index for locating records Example of a Dense Index Example of a Sparse Index Problems with Index-Sequential Files • Retrieve: search until the key value or a larger key value is found – individual key access BAD – scan the file in order of the key GOOD • Insert is hard -- all higher key records must be shifted to place the new record • Delete may leave holes • Update is equivalent to a combined delete and insert – updating a search key field may cause the combined disadvantages of an insert and a delete by shifting the record’s location in the sorted file • differential files are often used to hold the recent updates until the database can be reorganized off-line Multi-level Index • • • • If the primary index does not fit in memory, access becomes expensive (disk reads cost) To reduce the number of disk accesses to index records, we treat the primary index kept on disk as a sequential file and construct a sparse index on it If the outer index is still too large, we can create another level of index to index it, and so on Indices at all levels must be updated on insertion or deletion from the file Index Update: Deletion • When deletions occur in the primary file, the index will sometimes need to change • If the deleted record was the only record in the file with its particular search-key value, the search-key is deleted from the index also • Single-level index deletion: – dense indices -- deletion of search-key is similar to record deletion – sparse indices -- if an entry for the search key exists in the index, it is replaced by the next search key value in the original file (taken in search-key order) – Multi-level index deletion is a simple extension of the above Index Update: Insertion • As with deletions, when insertions occur in the primary file, the index will sometimes need to change • Single-level index insertion: – first perform a lookup using the search-key value appearing in the record to be inserted – dense indices -- if the search-key value does not appear in the index, insert it – sparse indices -- depends upon the design of the sparse index. If the index is designed to store an entry for each block of the file, then no change is necessary to the index unless the insertion creates a new block (if the old block overflows). If that happens, the first search-key value in the new block is inserted in the index – multi-level insertion is a simple extension of the above Secondary Index Motivation • Good database design is to have an index to handle all common (frequently requested) queries. • Queries based upon values of the primary key can use the primary index • Queries based upon values of other attributes will require other (secondary) indices. Secondary Indices vs. Primary Indices • Secondary indices must be dense, with an index entry for every search-key value, and a pointer to every record in the file. A primary index may be dense or sparse. (why?) • A secondary index on a candidate key looks just like a dense primary index, except that the records pointed to by successive values of the index are not sequential • when a primary index is not on a candidate key it suffices if the index points to the first record with a particular value for the search key, as subsequent records can be found with a sequential scan from that point Secondary Indices vs. Primary Indices (2) • When the search key of a secondary index is not a candidate key (I.e., there may be more than one tuple in the relation with a given search-key value) it isn’t enough to just point to the first record with that value -- other records with that value may be scattered throughout the file. A secondary index must contain pointers to all the records – so an index record points to a bucket that contains pointers to every record in the file with that particular search-key value Secondary Index on balance Primary and Secondary Indices • As mentioned earlier: – secondary indices must be dense – Indices offer substantial benefits when searching for records – When a file is modified, every index on the file must be updated • this overhead imposes limits on the number of indices – relatively static files will reasonably permit more indices – relatively dynamic files make index maintenance quite expensive • Sequential scan using primary index is efficient; sequential scan on secondary indices is very expensive – each record access may fetch a new block from disk (oy!) Disadvantages of Index-Sequential Files • The main disadvantage of the index-sequential file organization is that performance degrades as the file grows, both for index lookups and for sequential scans through the data. Although this degradation can be remedied through file reorganization, frequent reorgs are expensive and therefore undesirable. • Next we’ll start examining index structures that maintain their efficiency despite insertion and deletion of data: the B+-tree (section 12.3) B+- Tree Index Files • Main disadvantage of ISAM files is that performance degrades as the file grows, creating many overflow blocks and the need for periodic reorganization of the entire file • B+- trees are an alternative to indexed-sequential files – used for both primary and secondary indexing – B+- trees are a multi-level index • B+- tree index files automatically reorganize themselves with small local changes on insertion and deletion. – No reorg of entire file is required to maintain performance – disadvantages: extra insertion, deletion, and space overhead – advantages outweigh disadvantages. B+-trees are used extensively B+- Tree Index Files (2) Definition: A B+-tree of order n has: • • • • • All leaves at the same level balanced tree (“B” in the name stands for “balanced”) logarithmic performance root has between 1 and n-1 keys all other nodes have between n/2 and n-1 keys (>= 50% space utilization) • we construct the tree with order n such that one node corresponds to one disk block I/O (in other words, each disk page read brings up one full tree node). B+- Tree Index Files (3) A B+-tree is a rooted tree satisfying the following properties: • All paths from root to tree are the same length • Search for an index value takes time according to the height of the tree (whether successful or unsuccessful) B+- Tree Node Structure • The B+-tree is constructed so that each node (when full) fits on a single disk page – parameters: B: size of a block in bytes (e.g., 4096) K: size of the key in bytes (e.g., 8) P: size of a pointer in bytes (e.g., 4) – internal node must have n such that: (n-1)*K + n*P <= B n<= (B+K)/(K+P) – with the example values above, this becomes n<=(4096+8)/(8+4)=4114/12 n<=342.83 B+- Tree Node Structure (2) • Typical B+-tree Node Ki are the search-key values Pi are the pointers to children (for non-leaf nodes) or pointers to records or buckets of records (for leaf nodes) • the search keys in a node are ordered: K1<K2 <K3 …<Kn-1 Non-Leaf Nodes in B+-Trees • Non-leaf nodes form a multi-level sparse index on the leaf nodes. For a non-leaf node with n pointers: – all the search keys in the subtree to which P1 points are less than K1 – For 2<= i <= n-1, all the search keys in the subtree to which Pi points have values greater than or equal to Ki-1 and less than Kn-1 Leaf Nodes in B+-Trees • As mentioned last class, primary indices may be sparse indices. So B+-trees constructed on a primary key (that is, where the search key order corresponds to the sort order of the original file) can have the pointers of their leaf nodes point to an appropriate position in the original file that represents the first occurrence of that key value. • Secondary indices must be dense indices. B+-trees constructed as a secondary index must have the pointers of their leaf nodes point to a bucket storing all locations where a given search key value occur; this set of buckets is often called an occurrence file Example of a B+-tree • B+-tree for the account file (n=3) Another Example of a B+-tree • B+-tree for the account file (n=5) • • • Leaf nodes must have between 2 and 4 values ((n-1)/2 and (n-1), with n=5) Non-leaf nodes other than the root must have between 3 and 5 children (n/2 and n, with n=5) Root must have at least 2 children Observations about B+-trees • Since the inter-node connections are done by pointers, “logically” close blocks need not be “physically” close • The non-leaf levels of the B+-tree form a hierarchy of sparse indices • The B+-tree contains a relatively small number of levels (logarithmic in the size of the main file), thus searches can be conducted efficiently • Insertions and deletions to the main file can be handled efficiently, as the index can be restructured in logarithmic time (as we shall examine later in class) Queries on B+-trees • Find all records with a search-key value of k – start with the root node (assume it has m pointers) • examine the node for the smallest search-key value > k • if we find such a value, say at Kj , follow the pointer Pj to its child node • if no such k value exists, then k >= Km-1, so follow Pm – if the node reached is not a leaf node, repeat the procedure above and follow the corresponding pointer – eventually we reach a leaf node. If we find a matching key value (our search value k = Ki for some i) then we follow Pi to the desired record or bucket. If we find no matching value, the search is unsuccessful and we are done. Queries on B+-trees (2) • Processing a query traces a path from the root node to a leaf node • If there are K search-key values in the file, the path is no longer than logn/2 (K) • A node is generally the same size as a disk block, typically 4 kilobytes, and n is typically around 100 (40 bytes per index entry) • With 1 million search key values and n=100, at most log50(1,000,000) = 4 nodes are accessed in a lookup • In a balanced binary tree with 1 million search key values, around 20 nodes are accessed in a lookup – the difference is significant since every node access might need a disk I/O, costing around 20 milliseconds Insertion on B+-trees • Find the leaf node in which the search-key value would appear • If the search key value is already present, add the record to the main file and (if necessary) add a pointer to the record to the appropriate occurrence file bucket • If the search-key value is not there, add the record to the main file as above (including creating a new occurrence file bucket if necessary). Then: – if there is room in the leaf node, insert (key-value, pointer) in the leaf node – otherwise, overflow. Split the leaf node (along with the new entry) Insertion on B+-trees (2) • Splitting a node: – take the n (search-key-value, pointer) pairs, including the one being inserted, in sorted order. Place half in the original node, and the other half in a new node. – Let the new node be p, and let k be the least key value in p. Insert (k, p) in the parent of the node being split. – If the parent becomes full by this new insertion, split it as described above, and propogate the split as far up as necessary • The splitting of nodes proceeds upwards til a node that is not full is found. In the worst case the root node may be split, increasing the height of the tree by 1. Insertion on B+-trees Example Deletion on B+-trees • Find the record to be deleted, and remove it from the main file and the bucket (if necessary) • If there is no occurrence-file bucket, or if the deletion caused the bucket to become empty, then delete (key-value, pointer) from the B+tree leaf-node • If the leaf-node now has too few entries, underflow has occurred. If the active leaf-node has a sibling with few enough entries that the combined entries can fit in a single node, then – combine all the entries of both nodes in a single one – delete the (K,P) pair pointing to the deleted node from the parent. Follow this procedure recursively if the parent node underflows. Deletion on B+-trees (2) • Otherwise, if no sibling node is small enough to combine with the active node without causing overflow, then: – Redistribute the pointers between the active node and the sibling so that both of them have sufficient pointers to avoid underflow – Update the corresponding search key value in the parent node – No deletion occurs in the parent node, so no further recursion is necessary in this case. • Deletions may cascade upwards until a node with n/2 or more pointers is found. If the root node has only one pointer after deletion, it is removed and the sole child becomes the root (reducing the height of the tree by 1) Deletion on B+-trees Example 1 Deletion on B+-trees Example 2 Deletion on B+-trees Example 3 B+-tree File Organization • B+-Tree Indices solve the problem of index file degradation. The original data file will still degrade upon a stream of insert/delete operations. • Solve data-file degradation by using a B+-tree file organization • Leaf nodes in a B+-tree file organization store records, not pointers into a separate original datafile – since records are larger than pointers, the maximum number of recrods that can be stored in a leaf node is less than the number of pointers in a non-leaf node – leaf nodes must still be maintained at least half full – insert and delete are handled in the same was as insert and delete for entries in a B+-tree index B+-tree File Organization Example • Records are much bigger than pointers, so good space usage is important • To improve space usage, involve more sibling nodes in redistribution during splits and merges (to avoid split/merge when possible) – involving one sibling guarantees 50% space use – involving two guarantees at least 2/3 space use, etc. B-tree Index Files • B-trees are similar to B+-trees, but search-key values appear only once in the index (eliminates redundant storage of key values) – search keys in non-leaf nodes don’t appear in the leaf nodes, so an additional pointer field for each search key in a non-leaf node must be stored to point to the bucket or record for that key value – leaf nodes look like B+-tree leaf nodes: (P1, K1, P2, K2, …, Pn) – non-leaf nodes look like so: (P1, B1, K1, P2, B2, K2, …, Pn) where the Bi are pointers to buckets or file records. B-tree Index File Example B-tree and B+-tree B-tree Index Files (cont.) • Advantages of B-tree Indices (vs. B+-trees) – May use less tree nodes than a B+-tree on the same data – Sometimes possible to find a specific key value before reaching a leaf node • Disadvantages of B-tree Indices – Only a small fraction of key values are found early – Non-leaf nodes are larger, so fanout is reduced, and B-trees may be slightly taller than B+-trees on the same data – Insertion and deletion are more complicated than on B+-trees – Implementation is more difficult than B+-trees • In general, advantages don’t outweigh disadvantages Hashing • We’ve examined Ordered Indices (design based upon sorting or ordering search key values); the other type of major indexing technique is Hashing • Underlying concept is very simple: – observation: small files don’t require indices or complicated search methods – use some clever method, based upon the search key, to split a large file into a lot of little buckets – each bucket is sufficiently small – use the same method to find the bucket for a given search key Hashing Basics – A bucket is a unit of storage containing one or more records (typically a bucket is one disk block in size) – In a hash file organization we find the bucket for a record directly from its search-key value using a hash function – A hash function is a function that maps from the set of all searchkey values K to the set of all bucket addresses B – The hash function is used to locate records for access, insertion, and deletion – Records with different search-key values may be mapped to the same bucket • the entire bucket must be searched to find a record • buckets are designed to be small, so this task is usually not onerous Hashed File Example – So we: • divide the set of disk blocks that make up the file into buckets • devise a hash function that maps each key value into a bucket V: set of key values B: number of buckets H: hashing function H: V--> (0, 1, 2, 3, …, B-1) Example: V= 9 digit SS#; B=1000; H= key modulo 1000 Hash Functions • To search/insert/delete/modify a key do: – compute H(k) to get the bucket number – search sequentially in the bucket (heap organization within each bucket) • Choosing H: almost any function that generates “random” numbers in the range [0, B-1] – try to distribute the keys evenly into the B buckets – one rule of thumb when using MOD -- use a prime number Hash Functions (2) • Collision is when two or more key values go to the same bucket – too many collisions increases search time and degrades performance – no or few collisions means that each bucket has only one (or very few) key(s) • Worst-case hash functions map all search keys to the same bucket Hash Functions (3) • Ideal hash functions are uniform – each bucket is assigned the same number of search-key values from the set of all possible values • Ideal hash functions are random – each bucket has approximately the same number of records assigned to it irrespective of the actual distribution of search-key values in the file • Finding a good hash function is not always easy Examples of Hash Functions • Given 26 buckets and a string-valued search key, consider the following possible hash functions: – – – – – Hash based upon the first letter of the string Hash based upon the last letter of the string Hash based upon the middle letter of the string Hash based upon the most common letter in the string Hash based upon the “average” letter in the string: the sum of the letters (using A=0, B=1, etc) divided by the number of letters – Hash based upon the length of the string (modulo 26) • Typical hash functions perform computation on the internal binary representation of the search key – example: searching on a string value, hash based upon the binary sum of the characters in the string, modulo the number of buckets Overflow • Overflow is when an insertion into a bucket can’t occur because it is full. • Overflow can occur for the following reasons: – too many records (not enough buckets) – poor hash function – skewed data: • multiple records might have the same search key • multiple search keys might be assigned the same bucket Overflow (2) • Overflow is handled by one of two methods – chaining of multiple blocks in a bucket, by attaching a number of overflow buckets together in a linked list – double hashing: use a second hash function to find another (hopefully non-full) bucket – in theory we could use the next bucket that had space; this is often called open hashing or linear probing. This is often used to construct symbol tables for compilers • useful where deletion does not occur • deletion is very awkward with linear probing, so it isn’t useful in most database applications Hashed File Performance Metrics • An important performance measure is the loading factor (number of records)/(B*f) B is the number of buckets f is the number of records that will fit in a single bucket • when loading factor too high (file becomes too full), double the number of buckets and rehash Hashed File Performance • • • • • (Assume that the hash table is in main memory) Successful search: best case 1 block; worst case every chained bucket; average case half of worst case Unsuccessful search: always hits every chained bucket (best case, worst case, average case) With loading factor around 90% and a good hashing function, average is about 1.2 blocks Advantage of hashing: very fast for exact queries Disadvantage: records are not sorted in any order. As a result, it is effectively impossible to do range queries Hash Indices • Hashing can be used for index-structure creation as well as for file organization • A hash index organizes the search keys (and their record pointers) into a hash file structure • strictly speaking, a hash index is always a secondary index – if the primary file was stored using the same hash function, an additional, separate primary hash index would be unnecessary – We use the term hash index to refer both to secondary hash indices and to files organized using hashing file structures Example of a Hash Index Hash index into file account, on search key account-number; Hash function computes sum of digits in account number modulo 7. Bucket size is 2 Static Hashing • We’ve been discussing static hashing: the hash function maps searchkey values to a fixed set of buckets. This has some disadvantages: – databases grow with time. Once buckets start to overflow, performance will degrade – if we attempt to anticipate some future file size and allocate sufficient buckets for that expected size when we build the database initially, we will waste lots of space – if the database ever shrinks, space will be wasted – periodic reorganization avoids these problems, but is very expensive • By using techniques that allow us to modify the number of buckets dynamically (“dynamic hashing”) we can avoid these problems – Good for databases that grow and shrink in size – Allows the hash function to be modified dynamically Dynamic Hashing • One form of dynamic hashing is extendable hashing – hash function generates values over a large range -- typically b-bit integers, with b being something like 32 – At any given moment, only a prefix of the hash function is used to index into a table of bucket addresses – With the prefix at a given moment being j, with 0<=j<=32, the bucket address table size is 2j – Value of j grows and shrinks as the size of the database grows and shrinks – Multiple entries in the bucket address table may point to a bucket – Thus the actual number of buckets is < 2j – the number of buckets also changes dynamically due to coalescing and splitting of buckets General Extendable Hash Structure Use of Extendable Hash Structure • Each bucket j stores a value ij; all the entries that point to the same bucket have the same values on the first ij bits • To locate the bucket containing search key Kj; – compute H(Kj) = X – Use the first i high order bits of X as a displacement into the bucket address table and follow the pointer to the appropriate bucket • T insert a record with search-key value Kj – follow lookup procedure to locate the bucket, say j – if there is room in bucket j, insert the record – Otherwise the bucket must be split and insertion reattempted • in some cases we use overflow buckets instead (as explained shortly) Splitting in Extendable Hash Structure To split a bucket j when inserting a record with search-key value Kj • if i> ij (more than one pointer in to bucket j) – allocate a new bucket z – set ij and iz to the old value ij incremented by one – update the bucket address table (change the second half of the set of entries pointing to j so that they now point to z) – remove all the entries in j and rehash them so that they either fall in z or j – reattempt the insert (Kj). If the bucket is still full, repeat the above. Splitting in Extendable Hash Structure (2) To split a bucket j when inserting a record with search-key value Kj • if i= ij (only one pointer in to bucket j) – increment i and double the size of the bucket address table – replace each entry in the bucket address table with two entries that point to the same bucket – recompute new bucket address table entry for Kj – now i> ij so use the first case described earlier • When inserting a value, if the bucket is still full after several splits (that is, i reaches some preset value b), give up and create an overflow bucket rather than splitting the bucket entry table further – how might this occur? Deletion in Extendable Hash Structure To delete a key value Kj • locate it in its bucket and remove it • the bucket itself can be removed if it becomes empty (with appropriate updates to the bucket address table) • coalescing of buckets is possible – can only coalesce with a “buddy” bucket having the same value of ij and same ij -1prefix, if one such bucket exists • decreasing bucket address table size is also possible – very expensive – should only be done if the number of buckets becomes much smaller than the size of the table Extendable Hash Structure Example Hash function on branch name Initial hash table (empty) Extendable Hash Structure Example (2) Hash structure after insertion of one Brighton and two Downtown records Extendable Hash Structure Example (3) Hash structure after insertion of Mianus record Extendable Hash Structure Example (4) Hash structure after insertion of three Perryridge records Extendable Hash Structure Example (5) Hash structure after insertion of Redwood and Round Hill records Extendable Hashing vs. Other Hashing • Benefits of extendable hashing: – hash performance doesn’t degrade with growth of file – minimal space overhead • Disadvantages of extendable hashing – extra level of indirection (bucket address table) to find desired record – bucket address table may itself become very big (larger than memory) • need a tree structure to locate desired record in the structure! – Changing size of bucket address table is an expensive operation • Linear hashing is an alternative mechanism which avoids these disadvantages at the possible cost of more bucket overflows Comparison: Ordered Indexing vs. Hashing • Each scheme has advantages for some operations and situations. To choose wisely between different schemes we need to consider: – cost of periodic reorganization – relative frequency of insertions and deletions – is it desirable to optimize average access time at the expense of worst-case access time? – What types of queries do we expect? • Hashing is generally better at retrieving records for a specific key value • Ordered indices are better for range queries Index Definition in SQL • Create an index create index <index-name> on <relation-name> (<attribute-list>) e.g., create index br-index on branch(branch-name) • Use create unique index to indirectly specify and enforce the condition that the search key is a candidate key – not really required if SQL unique integrity constraint is supported • To drop an index drop index <index-name> Multiple-Key Access • With some queries we can use multiple indices • Example: select account-number from account where branch-name=“Perryridge” and balance=1000 • Possible strategies for processing this query using indices on single attributes: – use index on balance to find accounts with balances =1000, then test them individually to see if branch-name=“Perryridge” – use index on branch-name to find accounts with branchname=“Perryridge”, then test them individually to see if balances =1000 – use branch-name index to find pointers to all records of the Perryridge branch, and use balance index similarly, then take intersection of both sets of pointers Multiple-Key Access (2) • With some queries using a single-attribute index is unnecessarily expensive – with methods (1) and (2) from the earlier slide, we might have the index we use return a very large set, even though the final result is quite small – Even with method (3) (use both indices and then find the intersection) we might have both indices return a large set, which will make for a lot of unneeded work if the final result (the intersection) is small • An alternative strategy is to create and use an index on more than one attribute -- in this example, an index on (branch-name, balance) (both attributes) Indices on Multiple Attributes • Suppose we have an ordered index on the combined search-key (branch-name, balance) • Examining the earlier query with the clause where branch-name=“Perryridge” and balance=1000 Our new index will fetch only records that satisfy both conditions -much more efficient than trying to answer the query with separate single-valued indices • We can also handle range queries like: where branch-name=“Perryridge” and balance<1000 • But our index will not work for where branch-name<“Perryridge” and balance=1000 Multi-Attribute Indexing Example: EMP(eno, ename, age, sal) lots of ways to handle this problem • • separate indices: lots of false hits combined index based on composite key – key= sal*100 + age – search for 30<sal<40 translates into 3000<key<4000 (easy) – search for 60<age<80 difficult • • • Grid files (up next) R-tree (B-tree generalization) Quad-trees, K-d trees, etc... Grid Files • Structure used to speed processing of general multiple search-key queries involving one or more comparison operators • The grid file has: – a single grid array – array has number of dimensions equal to the number of search-key attributes – one linear scale for each search-key attribute • Multiple cells of the grid array can point to the same bucket • To find the bucket for a search-key value, locate the row and column of the cell using the linear scales to get the grid location, then follow the pointer in that grid location to the bucket Example Grid File for account Queries on a Grid File • A grid file on two attributes A and B can handle queries of all the following forms with reasonable efficiency: – (a1 < A < a2) – (b1 < B < b2) – (a1 < A < a2 and b1 < B < b2) • For example, to answer (a1 < A < a2 and b1 < B < b2), use the linear scales to find corresponding candidate grid array cells, and look up all the buckets pointed-to from those cells Grid Files (cont) • During insertion, if a bucket becomes full, a new bucket can be created if more than one cell points to it – idea similar to extendable hashing, but in multiple dimensions – if only one cell points to the bucket, either an overflow bucket must be created or the grid array size must be increased • Linear scales can be chosen to uniformly distribute records across buckets – not necessary to have scale uniform across the domain -- if records are distributed in some other pattern, the linear scale can mirror it (as shown in the example earlier) Grid Files (end) • Periodic re-organization to increase grid array size helps with good performance – reorganization can be very expensive, though • Space overhead of the grid array can be high • R-trees (chapter 23) are an alternative Bitmap Indices • Bitmap indices are a special type of index designed for efficient queries on multiple keys • Records in a relation are assumed to be numbered sequentially from 0 – given a number n the objective is for it to be easy to retrieve record n – very easy to achieve if we’re looking at fixed-length records • Bitmap indices are applicable on attributes that take on a relatively small number of distinct values – e.g. gender, country, state, hockey team – or an arbitrary mapping of a wider spectrum of values into a small number of categories (e.g. income level: divide income into a small number of levels such as 0-9999, 10K-19999, 20K-49999, 50K and greater) • A bitmap is simply an array of bits Bitmap Indices (cont) • In the simplest form a bitmap index on an attribute has a bitmap for each value of the attribute – bitmap has as many bits as there are records – in a bitmap for value v, the bit for a record is 1 if the record has the value v for the attribute, and 0 otherwise Bitmap Indices (cont) • Bitmap indices are useful for queries on multiple attributes – not particularly useful for single-attribute queries • Queries are answered using bitmap operations – intersection (and) – union (or) – complementation (not) • Each operation takes two bitmaps of the same size and applies the operation on corresponding bits to get the result bitmap – e.g. 100110 and 110011 = 100010 100110 or 110011 = 110111 not 100110 = 011001 Bitmap Indices (cont) • Every 1 bit in the result marks a desired tuple – can compute its location (since records are numbered sequentially and are all of the same size) and retrieve the records – counting number of tuples in the result (SQL count aggregate) is even faster • Bitmap indices are generally very small compared to relation size – e.g. if record is 100 bytes, space for a single bitmap on all tuples of the relation takes 1/800 of the size of the relation itself – if bitmap allows for 8 distinct values, bitmap is only 1% of the size of the whole relation Bitmap Indices (cont) • Deletion needs to be handled properly – can’t just store “zero” values at deleted locations (why?) – need existence bitmap to note if the record at location X is valid or not • existence bitmap necessary for complementation not(A=v): (not bitmap-A-v) and ExistenceBitmap • Should keep bitmaps for all values, even “null” – to correctly handle SQL null semantics for not (A=v) must compute: not(bitmap-A-Null) and (not bitmap-A-v) and ExistenceBitmap Efficient Implementation of Bitmap Operations • Bitmaps are packed into words; a single word and is a basic CPU instruction that computes the and of 32 or 64 bits at once – e.g. one-million-bit maps can be anded with just 31,250 instructions • Counting number of 1s can be done fast by a trick: – use each byte to index into a precomputed array of 256 elements each storing the count of 1s in the binary representation – add up the retrieved counts – can use pairs of bytes to speed up further in a similar way, but at a higher memory cost (64K values in precomputed array for 2 byte) Bitmaps and B+-trees • Bitmaps can be used instead of Tuple-ID lists at leaf levels of B+-trees, for values that have a large number of matching records – assuming a tuple-id is 64 bits, this becomes worthwhile if >1/64 of the records have a given value • This technique merges benefits of bitmap and B+-tree indices – useful if some values are uncommon, and some are quite common – why not useful if there are only a small number of different values total (16, 20, something less than 64)? • That covers all of chapter 12. Query Processing • SQL is good for humans, but not as an internal (machine) representation of how to calculate a result • Processing an SQL (or other) query requires these steps: – parsing and translation • turning the query into a useful internal representation in the extended relational algebra – optimization • manipulating the relational algebra query into the most efficient form (one that gets results the fastest) – evaluation • actually computing the results of the query Query Processing Diagram Query Processing Steps 1. parsing and translation – details of parsing are covered in other places (texts and courses on compilers). We’ve already covered SQL and relational algebra; translating between the two should be relatively familiar ground 2. optimization – This is the meat of chapter 13. How to figure out which plan, among many, is the best way to execute a query 3. evaluation – actually computing the results of the query is mostly mechanical (doesn’t require much cleverness) once a good plan is in place. Query Processing Example • Initial query: select from where balance account balance<2500 • Two different relational algebra expressions could represent this query: – sel balance<2500(Pro balance(account)) – Pro balance( sel balance<2500(account)) • which choice is better? It depends upon metadata (data about the data) and what indices are available for use on these operations. Query Processing Metadata • Cost parameters (some are easy to maintain; some are very hard - this is statistical info maintained in the system’s catalog) – n(r ): number of tuples in relation r – b(r ): number of disk blocks containing tuples of relation r – s(r ): average size of a tuple of relation r – f(r ): blocking factor of r: how many tuples fit in a disk block – V(A,r): number of distinct values of attribute A in r. (V(A,r)=n(r ) if A is a candidate key) – SC(A,r): average selectivity cardinality factor for attribute A of r. Equivalent to n(r )/V(A,r). (1 if A is a key) – min(A,r): minimum value of attribute A in r – max(A,r): maximum value of attribute A in r Query Processing Metadata (2) • Cost parameters are used in two important computations: – I/O cost of an operation – the size of the result • In the following examination we’ll find it useful to differentiate three important operations: – Selection (search) for equality (R.A1=c) – Selection (search) for inequality (R.A1>c) (range queries) – Projection on attribute A1 Selection for Equality (no indices) • Selection (search) for equality (R.A1=c) – cost (sequential search on a sorted relation) = b(r )/2 average unsuccessful b(r )/2 + SC(A1,r) -1 average successful – cost (binary search on a sorted relation) = log b(r ) average unsuccessful log b(r ) + SC(A1,r) -1 average successful – size of the result n(select(R.A1=c)) = SC(A1,r) = n(r )/V(A1,r) Selection for Inequality (no indices) • Selection (search) for inequality (R.A1>c) – cost (file unsorted) = b(r ) – cost (file sorted on A1) = b(r )/2 + b(r )/2 (if we assume that half the tuples qualify) b(r ) in general (regardless of the number of tuples that qualify. Why?) – size of the result = depends upon the query; unpredictable Projection on A1 • Projection on attribute A1 – cost = b(r ) – size of the result n(Pro(R,A1)) = V(A1,r) Selection (Indexed Scan) for Equality Primary Index on key: cost = (height+1) unsuccessful cost = (height+1) +1 successful Primary (clustering) Index on non-key: cost = (height+1) + SC(A1,r)/f(r ) all tuples with the same value are clustered Secondary Index cost = (height+1) + SC(A1,r) tuples with the same value are scattered Selection (Indexed Scan) for Inequality Primary Index on key: search for first value and then pick tuples >= value cost = (height+1) +1+ size of the result (in disk pages) = height+2 + n(r ) * (max(A,r)c)/(max(A,r)-min(A,r))/f(r ) Primary (clustering) Index on non-key: cost as above (all tuples with the same value are clustered) Secondary (non-clustering) Index cost = (height+1) +B-treeLeaves/2 + size of result (in tuples) = height+1 + B-treeLeaves/2 + n(r ) * (max(A,r)-c)/(max(A,r)-min(A,r)) Complex Selections • Conjunction (select where theta1 and theta2) (s1 = # of tuples satisfying selection condition theta1) combined SC = (s1/n(r )) * (s2/n(r )) = s1*s2/n(r )2 assuming independence of predicates • Disjunction (select where theta1 or theta2) combined SC = 1 - (1 - s1/n(r )) * (1 - s2/n(r )) = s1/n(r )) + s2/n(r ) - s1*s2/n(r )2 • Negation (select where not theta1) n(! Theta1) = n(r ) - n(Theta1) Complex Selections with Indices GOAL: apply the most restrictive condition first and combined use of multiple indices to reduce the intermediate results as early as possible • Why? No index will be available on intermediate results! • Conjunctive selection using one index B: – select using B and then apply remaining predicates on intermediate results • Conjunctive selection using a composite key index (R.A1, R.A2): – create a composite key or range from the query values and search directly (range search on the first attribute (MSB of the composite key) only) • Conjunctive selection using two indices B1 and B2: – search each separately and intersect the tuple identifiers (TIDs) • Disjunctive selection using two indices B1 and B2: – search each separately and union the tuple identifiers (TIDs) Sorting? • What has sorting to do with query processing? – SQL queries can specify the output be sorted – several relational operations (such as joins) can be implemented very efficiently if the input data is sorted first – as a result, query processing is often concerned with sorting temporary (intermediate) and final results – creating a secondary index on the active relation (“logical” sorting) isn’t sufficient -- sequential scans through the data on secondary indices are very inefficient. We often need to sort the data physically into order Sorting • We differentiate two types of sorting: – internal sorting: the entire relation fits in memory – external sorting: the relation is too large to fit in memory • Internal sorting can use any of a large range of wellestablished sorting algorithms (e.g., Quicksort) • In databases, the most commonly used method for external sorting is the sort-merge algorithm. (based upon Mergesort) Sort-merge Algorithm • create runs phase. – Load in M consecutive blocks of the relation (M is number of blocks that will fit easily in main memory) – Use some internal sorting algorithm to sort the tuples in those M blocks – Write the sorted run to disk – Continue with the next M blocks, etcetera, until finished • merge runs phase (assuming that the number of runs, N, is less than M) – load the first block of each run into memory – grab the first tuple (lowest value) from all the runs and write it to an output buffer page – when the last tuple of a block is read, grab a new block from that run – when the output buffer page is full, write it to disk and start a new one – continue until all buffer pages are empty Sort-merge Algorithm (2) • Merge-runs phase (N>M) – operate on M runs at a time, creating runs of length M2, and continue in multiple passes of the Merge operation • Cost of sorting: b(r ) is the number of blocks occupied by relation r – runs phase does one read, one write on each block of r: cost 2b(r ) – total number of runs (N): b(r )/M – number of passes in merge operation: 1 if N<M; otherwise logM-1(b(r )/M) – during each pass in the merge phase we read the whole relation and write it all out again: cost 2b(r ) per pass – total cost of merge phase is therefore 2b(r ) (logM-1(b(r )/M)+1) – if only one merge pass is required (N<M) the cost is 4b(r ); – if M>b(r ) then there is only one run (internal sorting) and the cost is b(r ) Join Operation • Join is perhaps the most important operation combining two relations • Algorithms computing the join efficiently are crucial to the optimization phase of query processing • We will examine a number of algorithms for computing joins • An important metric for estimating efficiency is the size of the result: as mentioned last class, the best algorithms on complex (multi-relation) queries is to cut down the size of the intermediate results as quickly as possible. Join Operation: Size Estimation • 0 <= size <= n(r ) * n(s) (between 0 and size of cartesian product) • If A = R S is a key of R, then size <= n(s) • If A = R S is a key of R and a foreign key of S, then size = n(s) • If A = R S is not a key, then each value of A in R appears no more than n(s)/V(A,s) times in S, so n(r ) tuples of R produce: size <= n(r ) *n(s)/V(A,s) symmetrically, size <= n(s) *n(r)/V(A,r) if the two values are different we use: size <=min{n(s)*n(r)/V(A,r), n(r)*n(s)/V(A,s)} Join Methods: Nested Loop • Simplest algorithm to compute a join: nested for loops – requires no indices • tuple-oriented: for each tuple t1 in r do begin for each tuple t2 in s do begin join t1, t2 and append the result to the output • block-oriented: for each block b1 in r do begin for each block b2 in s do begin join b1, b2 and append the result to the output • reverse inner loop – as above, but we alternate counting up and down in the inner loop. Why? Cost of Nested Loop Join • Cost depends upon the number of buffers and the replacement strategy – pin 1 block from the outer relation, k for the inner and LRU cost: b(r ) + b(r )*b(s) (assuming b(s)>k) – pin 1 block from the outer relation, k for the inner and MRU cost: b(r ) + b(s) + (b(s) - (k-1))*(b(r )-1) = b(r )(2-k) + k + 1 + b(r )*b(s) (assuming b(s)>k) – pin k blocks from the outer relation, 1 for the inner • read k from the outer • for each block of s join 1xk blocks • repeat with next k blocks of r untildone (cost k) (cost b(s)) (repeated b(r )/k times) cost: (k+b(s)) * b(r )/k =b(r ) + b(r )*b(s)/k – which relation should be the outer one? Join Methods: Sort-Merge Join • Two phases: – Sort both relations on the join attributes – Merge the sorted relations sort R on joining attribute sort S on joining attribute merge (sorted-R, sorted-S) • cost with k buffers: – – – – b(r) (2 log( b(r)/k) +1) to sort R b(s) (2 log( b(s)/k) +1) to sort S b(r ) + b(s) to merge total: b(r) (2 log(b(r)/k) +1) + b(s) (2 log( b(s)/k) +1) +b(r) + b(s) Join Methods: Hash Join • Two phases: – Hash both relations into hashed partitions – Bucket-wise join: join tuples of the same partitions only Hash R on joining attribute into H(R) buckets Hash S on joining attribute into H(S) buckets nested-loop join of corresponding buckets • cost (assuming pairwise buckets fit in the buffers) – – – – 2b(r ) to hash R (read and write) 2b(s) to hash S (as above) b(r ) + b(s) to merge total: 3(b(r ) + b(s)) Join Methods: Indexed Join • Inner relation has an index (clustering or not): for each block b(r ) in R do begin for each tuple t in b(r ) do begin search the index on S with the value t.A of the joining attribute and join with the resulting tuples of S • cost = b(r ) + n(r ) * cost(select(S.A=c)) where cost(select(S.A=c)) is as described before for indexed selection – What if R is sorted on A? (hint: use V(A,r) in the above) 3-way Join Suppose we want to compute R(A,B) |X| S(B,C) |X| T(C,D) • 1st method: pairwise. – First compute temp(A,B,C) = R(A,B) |X| S(B,C) cost b(r ) + b(r )*b(s) size of temp b(temp) = n(r )*n(s)/(V(B,S)/f(r+s)) – then compute result temp(A,B,C) |X| T(C,D) cost b(t)+b(t)*b(temp) • 2nd method: scan S and do simultaneous selections on R and T – cost = b(s) + b(s)* (b(r ) + b(t)) – if R and T are indexed we could do the selections through the indices cost = b(s) + n(s ) *[ cost(select(R.B=S.B)) + cost(select(T.C=S.C))] Transformation of Relational Expressions • (Section 14.3) • Two relational algebra expressions are equivalent if they generate the same set of tuples on every legal database instance. – Important for optimization – Allows one relational expression to be replaced by another without affecting the results – We can choose among equivalent expressions according to which are lower cost (size or speed depending upon our needs) – An equivalence rule says that expression A is equivalent to B; we may replace A with B and vice versa in the optimizer. Equivalence Rules (1) In the discussion that follows: Ex represents a relational algebra expression y represents a predicate Lz represents a particular list of attributes and are selection and projection as usual • cascade of selections: a conjunction of selections can be deconstructed into a series of them 12(E) = 1(2(E)) • selections are commutative 1(2(E)) = 2( 1(E)) Equivalence Rules (2) • cascade of projections: only the final ops in a series of projections is necessary L1(L2(…(L3(E))…)) = L1(E) • selections can be combined with Cartesian products and theta joins (E1 X E2) = E1 |X| E2 1(E1 | X|2 E2) = E1 |X|12 E2 • joins and theta-joins are commutative (if you ignore the order of attributes, which can be corrected by an appropriate projection) E1 |X| E2 = E2 |X| E1 Equivalence Rules (3) • joins and cross product are associative (E1 |X|1 3 E2 ) |X|2 E3 = E1 |X|1 (E2 |X|23 E3) (where 2 involves attributes from only E2 and E3.) (E1 |X| E2 ) |X| E3 = E1 |X| (E2 |X| E3) (E1 X E2 ) X E3 = E1 X (E2 X E3) (the above two equivalences are useful special cases of the first one) Equivalence Rules (4) • selection distributes over join in two cases: – if all the attributes in 1 appear only in one expression (say E1) 1 (E1 |X| E2) = (1 (E1 )) |X| E2 – if all the attributes in 1 appear only in E1 and all the attributes in 2 appear only in E2 1 2 (E1 |X| E2) = (1 (E1 )) |X| (2 ( E2)) (note that the first case is a special case of the second) Equivalence Rules (5) • projection distributes over join. Given L1 and L2 being attributes of E1 and E2 respectively: – if involves attributes entirely from L1 and L2 then L1L2 (E1 |X| E2) = (L1 (E1)) |X| (L2 ( E2)) – if involves attribute lists L3 and L4 (both not in L1 L2) from E1 and E2 respectively L1L2 (E1 |X| E2) = L1L2 (L1L3 (E1)) |X| (L2L4 ( E2)) Equivalence Rules (6) • union and intersection are commutative (difference is not) E1 E2 = E2 E1 E1 E2 = E2 E1 • union and intersection are associative (difference is not) (E1 E2) E3 = E1 (E2 E3) (E1 E2) E3 = E1 (E2 E3) • selection distributes over union, intersection, set difference (E1 E2) = (E1) (E2) (E1 E2) = (E1) (E2) (E1 E2) = (E1) E2 (E1 - E2) = (E1) - (E2) (E1 - E2) = (E1) - E2 Chapter 15: Transactions • A transaction is a single logical unit of database work -for example, a transfer of funds from checking to savings account. It is a set of operations delimited by statements of the form begin transaction and end transaction • To ensure database integrity, we require that transactions have the ACID properties – Atomic: all or nothing gets done. – Consistent: preserves the consistency of the database – Isolated: unaware of any other concurrent transactions (as if there were none) – Durable: after completion the results are permanent even through system crashes ACID Transactions • Transactions access data using two operations: – read (X) – write (X) • ACID property violations: – Consistency is the responsibility of the programmer – System crash half-way through: atomicity issues – Another transaction modifies the data half-way through: isolation violation – example: transfer $50 from account A to account B T: read(A); A:=A-50; write (A); read (B); B:=B+50; write (B) Transaction States • Five possible states for a transaction: – active (executing) – partially committed (after last statement’s execution) – failed (can no longer proceed) – committed (successful completion) – aborted (after transaction has been rolled back Transaction States • Committed or Aborted transactions are called terminated • Aborted transactions may be – restarted as a new transaction – killed if it is clear that it will fail again • Rollbacks – can be requested by the transaction itself (go back to a given execution state) – some actions can’t be rolled back (e.g., a printed message, or an ATM cash withdrawal) Concurrent Execution • Transaction processing systems allow multiple transactions to run concurrently – improves throughput and resource utilization (I/O on transaction A can be done in parallel with CPU processing on transaction B) – reduced waiting time (short transactions need not wait for long ones operating on other parts of the database) – however, this can cause problems with the “I” (Isolation) property of ACID Serializability • Scheduling transactions so that they are serializable (equivalent to some serial schedule of the same transctions) ensures database consistency (and the “I” property of ACID) • serial schedule is equivalent to having the transactions execute one at a time • non-serial or interleaved schedule permits concurrent execution Serial Schedule • To the right T1 and T2 are on a serial schedule: T2 begins after T1 finishes. No problems. Interleaved Schedule • This is an example of an interleaved concurrent schedule that raises a number of database consistency concerns. Serial and Interleaved Schedules Serial schedule Interleaved schedule Another Interleaved Schedule The previous serial and interleaved schedules ensured database consistency (A+B before execution = A+B after execution) The interleaved schedule on the right is only slightly different, but does not ensure a consistent result. – assume A=1000 and B=2000 before (sum = 3000) – after execution, A=950 and B=2150 (sum = 3100) Inconsistent Transaction Schedules • So what caused the problem? What makes one concurrent schedule consistent, and another one inconsistent? – Operations on data within a transaction are not relevant, as they are run on copies of the data residing in local buffers of the transaction – For scheduling purposes, the only significant operations are read and write • Given two transactions, Ti and Tj, both attempting to access data item Q: – – – – if Ti and Tj are both executing read(Q) statements, order does not matter if Ti is doing write(Q), and Tj read(Q), then order does matter same if Tj is writing and Ti reading if both Ti and Tj are executing write(Q), then order might matter if there are any subsequent operations in Ti or Tj accessing Q Transaction Conflicts • Two operations on the same data item by different transactions are said to be in conflict if at least one of the operations is a write • If two consecutive operations of different transactions in a schedule S are not in conflict, then we can swap the two to produce another schedule S’ that is conflict equivalent with S • A schedule S is serializable if it is conflict equivalent (after some series of swaps) to a serial schedule. Transaction Conflict Example Example: read/write(B) in T0 do not conflict with read/write(A) in T1 T0 read(A) write(A) T1 read(A) read(B) write(A) write(B) read(B) write(B) T0 read(A) write(A) read(B) write(B) T1 read(A) write(A) read(B) write(B) Serializability Testing (15.9) and Precedence Graphs • So we need a simple method to test a schedule S and discover whether it is serializable. • Simple method involves constructing a directed graph called a Precedence Graph from S • Construct a precedence graph as follows: – a vertex labelled Ti for every transaction in S – an edge from Ti to Tj if any of these three conditions holds: • Ti executes write(Q) before Tj executes read(Q) • Ti executes read(Q) before Tj executes write(Q) • Ti executes write(Q) before Tj executes write(Q) – if the graph has a cycle, S is not serializable Precedence Graph Example 1 • Compute a precedence graph for schedule B (right) • three vertices (T1, T2, T3) • edge from Ti to Tj if – Ti writes Q before Tj reads Q – Ti reads Q before Tj writes Q – Ti writes Q before Tj writes Q Precedence Graph Example 1 • Compute a precedence graph for schedule B (right) • three vertices (T1, T2, T3) • edge from Ti to Tj if – Ti writes Q before Tj reads Q – Ti reads Q before Tj writes Q – Ti writes Q before Tj writes Q Precedence Graph Example 2 • Slightly more complicated example • Compute a precedence graph for schedule A (right) • three vertices (T1, T2, T3) • edge from Ti to Tj if – Ti writes Q before Tj reads Q – Ti reads Q before Tj writes Q – Ti writes Q before Tj writes Q Precedence Graph Example 2 • Slightly more complicated example • Compute a precedence graph for schedule A (right) Concurrency Control • So now we can recognize when a schedule is serializable. In practice, it is often difficult and inefficient to determine a schedule in advance, much less examine it for serializability. • Lock-based protocols are a common system used to prevent transaction conflicts on the fly (i.e., without knowing what operations are coming later) • Basic concept is simple: to prevent transaction conflict (two transactions working on the same data item with at least one of them writing), we implement a lock system -- a transaction may only access an item if it holds the lock on that item. Lock-based Protocols • We recognize two modes of locks: – shared: if Ti has a shared-mode (“S”) lock on data item Q then Ti may read, but not write, Q. – exclusive: if Ti has an exclusive-mode (“X”) lock on Q then Ti can both read and write Q. • Transactions be granted a lock before accessing data – A concurrency-control manager handles granting of locks – Multiple S locks are permitted on a single data item, but only one X lock – this allows multiple reads (which don’t create serializability conflicts) but prevents any R/W, W/R, or W/W interactions (which create conflicts) Lock-based Protocols • Transactions must request a lock before accessing a data item; they may release the lock at any time when they no longer need access • If the concurrency-control manager does not grant a requested lock, the transaction must wait until the data item becomes available later on. • Unfortunately, this can lead to a situation called a deadlock – suppose T1 holds a lock on item R and requests a lock on Q, but transaction T2 holds an exclusive lock on Q. So T1 waits. Then T2 gets to where it requests a lock on R (still held by waiting T1). Now both transactions are waiting for each other. Deadlock. Deadlocks • To detect a deadlock situation we use a wait-for graph – one node for each transaction – directed edge Ti --> Tj if Ti is waiting for a resource locked by Tj • a cycle in the wait-for graph implies a deadlock. – The system checks periodically for deadlocks – If a deadlock exists, one of the nodes in the cycle must be aborted • 95% of deadlocks are between two transactions • deadlocks are a necessary evil – preferable to allowing the database to become inconsistent – deadlocks can be rolled back; inconsistent data is much worse Two-phase Locking Protocol • A locking protocol is a set of rules for placing and releasing locks – a protocol restricts the number of possible schedules – lock-based protocols are pessimistic • Two-phase locking is (by far) the most common protocol – growing phase: a transaction may only obtain locks (never release any of its locks) – shrinking phase: a transaction may only release locks (never obtain any new locks) Two-phase with Lock Conversion • Two-phase with lock conversion: – S can be upgraded to X during the growing phase – X can be downgraded to S during the shrinking phase (this only works if the transaction has already written any changed data value with an X lock, of course) – The idea here is that during the growing phase, instead of holding on X on an item that it doesn’t need to write yet, to hold an S lock on it instead (allowing other transactions to read the old value for longer) until the point where modifications to the old value begin. – Similarly in the shrink phase, once a transaction downgrades an X lock, other transactions can begin reading the new value earlier. Variants on Two-phase Locking • Strict two-phase locking – additionally requires that all X locks are held until commit time – prevents any other transactions from seeing uncommitted data – viewing uncommitted data can lead to cascading rollbacks • Rigorous two-phase locking – requires that ALL locks are held until commit time Database Recovery • Computers may crash, stall, lock • We require that transactions are Durable (the D in ACID) – A completed transaction makes a permanent change to the database that will not be lost – How do we ensure durability, given the possibility of computer crash, hard disk failure, power surges, software bugs locking the system, or all the myriad bad things that can happen? – How do we recover from failure (i.e., get back to a consistent state that includes all recent changes)? • Most widely used system is log-based recovery Backups • Regular backups take a snapshot of the database status at a particular moment in time – used to restore data in case of catastrophic failure – expensive operation (writing out the whole database) • usually done no more than once a week, over the weekend when the system usage is low – smaller daily backups store only records that have been modified since the last weekly backup; done overnight – backups allow us to recover the database to a fairly recent consistent state (yesterday’s), but are far too expensive to be used to save running database modifications – How do we ensure transaction (D) durability? Log-Based Recovery • We store a record of recent modifications; a log. – Log is a sequence of log records, recording all update activities in the database. A log record records a single database write. It has these fields: • transaction identifier: what transaction performed the write • data-item identifier: unique ID of the data item (typically the location on disk) • old value (what was overwritten) • new value (value after the write) – Log is a write-ahead log -- log records are written before the database updates its records Log-Based Recovery (2) • Other log records: • • • • <T-ID, start-time> <T-ID, D-ID, V-old, V-new> <T-ID, commit-time> <T-ID, abort-time> transaction becomes active transaction makes a write transaction commits transaction aborts • Log contains a complete record of all database activity since the last backup • Logs must reside on stable storage – Assume each log record is written to the end of the log on stable storage as soon as it is created. Log-Based Recovery (3) • Recovery operation uses two primitives: – redo: reapply the logged update. • Write V-new into D-ID – undo: reverse the logged update • Write V-old into D-ID – both primitives ignore the current state of the data item -- they don’t bother to read the value first. – Multiple applications on the same data item is equivalent to the last one -- no harm as long as we do them in the correct order, even if the correct result is already written into stable storage Checkpoints • When a system failure occurs, we examine the log to determine which transactions need to be redone, and which need to be undone. • In theory we need to search the entire log – time consuming – most of the transactions in the log have already written their output to stable storage. It won’t hurt the database to redo their results, but every unnecessary redo wastes time. • To reduce this overhead database systems introduce checkpoints. Log-Based Recover with Checkpoints • So we have a crash and need to recover. What do we do? • Three passes through the log between the checkpoint and the failure – go forward from the checkpoint to the failure to create the redo and undo lists • redo everything that commtted before the failure • undo everything that failed to commit before the failure – go backward from failure to checkpoint doing the undos in order – go forward from checkpoint to failure doing the redos in sequence – expensive -- three sequential scans of the active log Recovery Example • • • First pass: T2, T4 commit; T3, T5 are uncommitted Second pass: undo T5, then T3 Third pass: redo T4, then T2 Almost Final Stuff on Checkpoints • Checkpointing usually speeds up recovery – log prior to checkpoint can be archived – without checkpoint the log may be very long; three sequential passes through it could be very expensive • During checkpointing: – stop accepting new transactions and wait until all active transactions commit – flush the log to stable storage – flush all dirty disk pages in the buffer to disk – mark the stable-storage log record with a <checkpoint> marker Final Stuff on Checkpoints • Better checkpointing – don’t wait for active transactions to finish, but don’t let them make updates to the buffers or the update log during checkpointing – make the checkpoint log record so that it includes a list L of active transactions <checkpoint, L> – on recovery we need to go further back through previous checkpoints to find all the changes of all transactions listed in L so we can undo or redo them – an even more elaborate scheme (called fuzzy checkpointing) allows updates during recovery Deferred vs. Immediate Modification • Immediate Database Modification – basically what we’ve been discussing so far -- uncommitted transactions may write values to disk during their execution • Deferred Database Modification – no writes to the database before transaction is partially committed (i.e., after the execution of its last statement) – since no uncommitted transaction writes are in the log, there is no need for the undo pass on recovery.