Database Systems (資料庫系統) November 8, 2004 Lecture #9 By Hao-hua Chu (朱浩華) 1 Announcement • Midterm exam: November 20 (Sat): 2:30 PM in CSIE 101/103 • Assignment #6 is available on the course homepage. – It is due on 11/24 – It is very difficult – Suggest you do it before midterm exam • Assignment #7 will be available on the course homepage later this afternoon. – It is due 11/16 (next Tuesday). – It is easy. – It will help you prepare midterm exam. 2 Cool Ubicomp Project Counter Intelligence (MIT) • Smart kitchen & kitchen wares • Talking Spoon – Salty, sweet ,hot? • Talking Cultery – Bacteria? • Smart fridge & counters – RFID tags – Tracking food from fridge to your month 3 Hash-Based Indexing Chapter 11 4 Introduction • Recall that Hash-based indexes are best for equality selections. – Cannot support range searches. – Equality selections are useful for join operations. • Static and dynamic hashing techniques exist – Trade-offs similar to ISAM vs. B+ trees. – Static hashing technique – Two dynamic hashing techniques • Extendible Hashing • Linear Hashing 5 Static Hashing • # primary pages fixed, allocated sequentially, never deallocated; overflow pages if needed. • h(k) mod N = bucket to which data entry with key k belongs. (N = # of buckets) h(key) mod N key 0 2 h N-1 Primary bucket pages Overflow pages 6 Static Hashing (Contd.) • Buckets contain data entries. • Hash function works on search key field of record r. – Ideally uniformly distribute values over range 0 ... N-1 – h(key) = (a * key + b) usually works well. – a and b are constants; lots known about how to tune h. • Cost for insertion/delete/search – two/two/one disk page I/Os (no overflow chains). • Long overflow chains can develop and degrade performance. – – Why poor performance? Scan through overflow chains linearly. Extendible and Linear Hashing: Dynamic techniques to fix this problem. 7 Simple Solution • Avoid creating overflow pages: – When a bucket (primary page) becomes full, double # of buckets & re-organize the file. • What’s wrong with this simple solution? – High cost concern: reading and writing all pages is expensive! 8 Extendible Hashing • The basic Idea (another level of abstraction): – – – • • Directory much smaller than file, so doubling it is much cheaper. Only one page of data entries is split – • Use directory of pointers to buckets (hash to the directory entry) Double # of buckets by doubling the directory Splitting just the bucket that overflowed! The page that overflows, rehash that page to two pages. Trick lies in how hash function is adjusted! – – Before doubling directory, h(r) -> 0..N-1 buckets. After doubling directory, h(r) -> 0 .. 2N-1 9 Example • Directory is array of size 4. • To find bucket for r, take last global depth # bits of h(r); – Example: If h(r) = 5 = binary 101, it is in bucket pointed to by 01. • Global depth: # of bits used for hashing directory entries. • Local depth of a bucket: # bits for hashing a bucket. • When can global depth be different from local depth? LOCAL DEPTH GLOBAL DEPTH 2 00 2 4* 12* 32* 16* Bucket A 2 1* 5* 21* 13* Bucket B 01 10 2 11 10* DIRECTORY Bucket C 2 15* 7* 19* Bucket D DATA PAGES 10 Insert h(r)=20 (Causes Doubling) LOCAL DEPTH GLOBAL DEPTH 2 00 2 4 12 32*16* Bucket A 2 3 1* 5* 21*13* Bucket B 32* 16* Bucket A 000 2 1* 5* 21* 13* Bucket B 001 10 2 11 10* Bucket C 15* 7* 19* 010 2 011 10* Bucket C 100 2 4: 0000 0100 12: 0000 1100 20: 0001 0100 16: 0001 0000 32: 0010 0000 3 GLOBAL DEPTH 01 DIRECTORY LOCAL DEPTH Bucket D 101 2 110 15* 7* 19* Bucket D 111 3 4* 12* 20* DIRECTORY 4* 12* 20* Bucket A2 (`split image' of Bucket A) 11 Extensible Hashing Insert • Check if the bucket is full. – If no, done! • Otherwise, check if local depth = global depth – if no, rehash the entries and distribute them into two buckets + increment the local depth – if yes, double the directory -> rehash the entries and distribute into two buckets • Directory is doubled by copying it over and `fixing’ pointer to split image page. – You can do this only by using the least significant bits in the directory. 12 1: 0000 0001 5: 0000 0101 21: 0001 0101 13: 0000 1101 9: 0000 1001 Insert 9 LOCAL DEPTH LOCAL DEPTH 3 3 000 2 000 1* 5* 21* 13* Bucket B 001 001 010 2 011 10* Bucket C 100 101 2 110 15* 7* 19* Bucket D 4* 12* 20* Bucket A2 (`split image' of Bucket A) 1* 9* 2 011 10* Bucket B Bucket C 100 101 2 110 15* 7* 19* 111 3 DIRECTORY 3 3 010 111 DIRECTORY 32* 16* Bucket A GLOBAL DEPTH 32* 16* Bucket A GLOBAL DEPTH 3 3 4* 12* 20* 3 5* 13* 21* Bucket D Bucket A2 (`split image' of Bucket A) Bucket B2 (`split image' of Bucket13 B) Directory Doubling Why use least significant bits in directory? Allows for doubling via copying! 6 = 110 2 1 0 1 6* 6 = 110 3 00 10 11 000 000 001 001 2 010 1 011 01 6* 3 100 0 101 1 110 6* 6* 01 010 011 10 11 111 Least Significant 00 100 6* 101 110 6* 111 vs. Most Significant 14 Comments on Extendible Hashing • If directory fits in memory, equality search answered with one disk access; else two. – – – 100MB file, 100 bytes/rec, you have 1M data entries. A 4K page (a bucket) can contain 40 data entries. You need about 25,000 directory elements; chances are high that directory will fit in memory. If the distribution of hash values is skewed (concentrates on a few buckets), directory can grow large. • Delete: If removal of data entry makes bucket empty, can be merged with `split image’. If each directory element points to same bucket as its split image, can halve directory. 15 Linear Hashing (LH) • This is another dynamic hashing scheme, an alternative to Extendible Hashing. – LH fixes the problem of long overflow chains (in static hashing) without using a directory (in extendible hashing). • Basic Idea: Use a family of hash functions h0, h1, h2, ... – Each function’s range is twice that of its predecessor. – Pages are split when overflows occur – but not necessarily the overflowing page. (Splitting occurs in turn, in a round robin fashion.) – Buckets are added gradually (one bucket at a time). – When all the pages at one level (the current hash function) have been split, a new level is applied. – Primary pages are allocated consecutively. 16 Levels of Linear Hashing • Initial Stage. – The initial level distributes entries into N0 buckets. – Call the hash function to perform this h0. • Splitting buckets. – If a bucket overflows its primary page is chained to an overflow page (same as in static hashing). – Also when a bucket overflows, some bucket is split. • The first bucket to be split is the first bucket in the file (not necessarily the bucket that overflows). • The next bucket to be split is the second bucket in the file … and so on until the Nth. has been split. • When buckets are split their entries (including those in overflow pages) are distributed using h1. – To access split buckets the next level hash function (h1) is applied. – h1 maps entries to 2N0 (or N1)buckets. 17 Levels of Linear Hashing (Cnt) • Level progression: – Once all Ni buckets of the current level (i) are split, the hash function hi is replaced by hi+1. – The splitting process starts again at the first bucket, and hi+2 is applied to find entries in split buckets. 18 Linear Hashing Example h0 00 next 64 36 01 1 10 6 11 31 15 17 5 • Initially, the index level equal to 0 and N0 equals 4 (three entries fit on a page). • h0 maps index entries to one of four buckets. • h0 is used and no buckets have been split. • Now consider what happens when 9 (1001) is inserted (which will not fit in the second bucket). • Note that next indicates which bucket is to split next. (Round Robin) 19 Linear Hashing Example 2 000 h1 next 64 01 h0 next 1 10 h0 6 11 h0 31 100 h1 36 • The page indicated by next is split (the first one). • Next is incremented. 17 15 5 9 • An overflow page is chained to the primary page to contain the inserted value. • If h0 maps a value from zero to next – 1 (just the first page in this case) h1 must be used to insert the new entry. • Note how the new page falls naturally into the sequence as the fifth page. 20 Linear Hashing Example 3 2 1 h1 h1 h1 h1 h1 h0 h0 h0 next3 next1 next2 64 8 32 1 17 9 10 18 6 18 14 15 7 11 31 • • • 16 h1 h1 36 h h 5 13 11 Assume1inserts 1of 8, 7, 18, 14, 111, 32, 162, 10, 13, 233 After the 2nd. split the base level is 1 (N1 = 8), use h1. Subsequent splits h - will use h2 for6 inserts 14 between the first bucket and next-1. 21 1 Linear Hashing vs. Extendable Hashing • What is the similarity? – One round of RR of splitting in LH is the same as 1step doubling of directory in EH • What are the differences? – Directory overhead vs. none – Overflow pages vs. none – Gradual splitting (of pages) vs. one-step doubling (of directory) – Pages are allocated in order vs. not in order – Splitting non-overflowing pages vs. splitting overflowing pages 22 Summary • Hash-based indexes: best for equality searches, cannot support range searches. • Static Hashing can lead to long overflow chains. • Extendible Hashing avoids overflow pages by splitting a full bucket when a new data entry is to be added to it. (Duplicates may require overflow pages.) Directory to keep track of buckets, doubles periodically. – Can get large with skewed data; additional I/O if this does not fit in main memory. – a skewed data distribution is one in which the hash values of data entries are not uniformly distributed! – 23 Summary (Contd.) • Linear Hashing avoids directory by splitting buckets round-robin, and using overflow pages. – – Overflow pages not likely to be long. Space utilization could be lower than Extendible Hashing, since splits not concentrated on `dense’ data areas. • Can tune criterion for triggering splits to trade-off slightly longer chains for better space utilization. 24