CS 405G: Introduction to Database Systems 13 Hashing Chen Qian University of Kentucky Searching Consider the problem of searching an array for a given value If the array is not sorted, the search requires O(n) time If the array is sorted, we can do a binary search It doesn’t seem like we could do much better 2 A binary search requires O(log n) time How about an O(1), that is, constant time search? We can do it if the array is organized in a particular way Hashing algorithm Suppose we were to come up with a “magic function” that, given a value to search for, would tell us exactly where in the array to look If it’s in that location, it’s in the array If it’s not in that location, it’s not in the array This function is called a hash function because it “makes hash” of its inputs 3 Example (ideal) hash function Suppose our hash function gave us the following values: 0 hashCode("apple") = 5 hashCode("watermelon") = 3 hashCode("grapes") = 8 hashCode("cantaloupe") = 7 hashCode("kiwi") = 0 hashCode("strawberry") = 9 hashCode("mango") = 6 hashCode("banana") = 2 2 1 3 banana watermelon 4 5 6 7 8 9 4 kiwi apple mango cantaloupe grapes strawberry Sets and tables Sometimes we just want a set of things—objects are either in it, or they are not in it Sometimes we want a map— a way of looking up one thing based on the value of another 5 We use a key to find a place in the map The associated value is the information we are trying to look up ... key value robin robin info 141 142 143 sparrow sparrow info 144 hawk hawk info 145 seagull seagull info 147 bluejay bluejay info 148 owl owl info 146 Finding the hash function How can we come up with this magic function? In general, we cannot--there is no such magic function What is the next best thing? 6 In a few specific cases, where all the possible values are known in advance, it has been possible to compute a perfect hash function A perfect hash function would tell us exactly where to look In general, the best we can do is a function that tells us where to start looking! Example imperfect hash function Suppose our hash function gave us the following values: hash("apple") = 5 hash("watermelon") = 3 hash("grapes") = 8 hash("cantaloupe") = 7 hash("kiwi") = 0 hash("strawberry") = 9 hash("mango") = 6 hash("banana") = 2 hash("honeydew") = 6 0 1 2 3 7 banana watermelon 4 5 6 7 8 • Now what? kiwi 9 apple mango cantaloupe grapes strawberry Collisions 8 When two values hash to the same array location, this is called a collision Collisions are normally treated as “first come, first served”—the first value that hashes to the location gets it We have to find something to do with the second and subsequent values that hash to this same location Handling collisions What can we do when two different values attempt to occupy the same place in an array? Solution #1: Search from there for an empty location Solution #2: Use a second hash function ...and a third, and a fourth, and a fifth, ... Solution #3: Use the array location as the header of a linked list of values that hash to this location All these solutions work, provided: 9 Can stop searching when we find the value or an empty location Search must be end-around We use the same technique to add things to the array as we use to search for things in the array Insertion, I 10 Suppose you want to add seagull to this hash table Also suppose: ... 141 142 robin hashCode(seagull) = 143 143 sparrow table[143] 144 hawk table[143] != seagull 145 table[144] table[144] != seagull 146 table[145] is not empty is not empty is empty Therefore, put seagull at location 145 seagull 147 bluejay 148 owl ... Searching, I Suppose you want to look up seagull in this hash table Also suppose: 142 robin 143 sparrow 144 hawk table[144] is not empty table[144] != seagull table[145] 147 bluejay 148 owl 11 table[143] is not empty table[143] != seagull 141 145 hashCode(seagull) = 143 ... is not empty table[145] == seagull ! We found seagull at location 145 seagull 146 ... Searching, II Suppose you want to look up cow in this hash table Also suppose: 12 table[144] is not empty table[144] != cow 141 142 robin 143 sparrow 144 hawk table[145] is not empty table[145] != cow 145 table[146] 147 bluejay 148 owl hashCode(cow) = 144 ... is empty If cow were in the table, we should have found it by now Therefore, it isn’t here seagull 146 ... Insertion, II Suppose you want to add hawk to this hash table Also suppose ... 142 robin hashCode(hawk) = 143 143 sparrow table[143] 144 hawk table[143] != hawk 145 seagull table[144] table[144] == hawk 146 is not empty is not empty hawk is already in the table, so do nothing 13 141 147 bluejay 148 owl ... Insertion, III Suppose: 14 You want to add cardinal to this hash table 141 142 robin hashCode(cardinal) = 147 143 sparrow The last location is 148 147 and 148 are occupied 144 hawk 145 seagull Solution: ... Treat the table as circular; after 148 comes 0 Hence, cardinal goes in location 0 (or 1, or 2, or ...) 146 147 bluejay 148 owl Efficiency 15 Hash tables are actually surprisingly efficient Until the table is about 70% full, the number of probes (places looked at in the table) is typically only 2 or 3 Sophisticated mathematical analysis is required to prove that the expected cost of inserting into a hash table, or looking something up in the hash table, is O(1) Even if the table is nearly full (leading to occasional long searches), efficiency is usually still quite high Hash for database indexing Hash-based indexes A good hash function is: are best for equality selections cannot support range searches Random Uniform Static and dynamic hashing techniques exist; trade-offs similar to ISAM vs. B+ trees. 7/1/2016 Chen Qian @ University of Kentucky 16 Static Hashing Each record (entry) has a key k # primary pages fixed, allocated sequentially, never de-allocated; overflow pages if needed. h(k) mod N = bucket to which data entry with key k belongs. (N = # of buckets) h(key) mod N key 0 2 h N-1 Primary bucket pages 7/1/2016 Overflow pages Chen Qian @ University of Kentucky 17 Static Hashing (Contd.) Buckets contain data records. Hashing works on search key field of record r. Must distribute values over range 0 ... N-1. h(key) = a binary number with 32/64/128/256 bits. h(k) mod N is a way to distribute values Long overflow chains can develop and degrade performance. 7/1/2016 Extendible and Linear Hashing: Dynamic techniques to fix this problem. Chen Qian @ University of Kentucky 18 Extendible Hashing Situation: Bucket (primary page) becomes full. Why not reorganize file by doubling # of buckets? Idea: Use directory of pointers to buckets Reading and writing all pages is expensive! double # of buckets by doubling the directory, splitting just the bucket that overflowed! Directory much smaller than file, so doubling it is much cheaper. Only one page of data entries is split. No overflow page! 7/1/2016 Chen Qian @ University of Kentucky 19 Example LOCAL DEPTH GLOBAL DEPTH Directory is array of size 4. To find bucket for r, take last `global depth’ # bits of h(r). If h(5) = binary(5) = 101, it is in bucket pointed to by 01. 2 00 2 4* 12* 32* 16* Bucket A 2 1* 5* 21* 13* Bucket B 01 10 2 11 10* DIRECTORY Bucket C 2 15* 7* 19* DATA PAGES Bucket D LOCAL DEPTH GLOBAL DEPTH 2 00 2 4* 12* 32* 16* Bucket A 2 1* 5* 21* 13* Bucket B 01 10 2 11 10* DIRECTORY Bucket C 2 15* 7* 19* Bucket D DATA PAGES Insert: If bucket is full, split it (allocate new page, re-distribute). If necessary, double the directory. (As we will see, splitting a bucket does not always require doubling; we can tell by comparing global depth with local depth for the split bucket.) LOCAL DEPTH GLOBAL DEPTH 2 00 2 4* 12* 32* 16* Bucket A 2 1* 5* 21* 13* Bucket B 01 10 2 11 10* DIRECTORY Bucket C DATA PAGES 2 15* 7* 19* Suppose now insert 20, whose hash is 10100. Bucket A is full. Split A to A1 and A2 Bucket D A1: values ending with 000 A2: values ending with 100 Insert h(r)=20 (Causes Doubling) LOCAL DEPTH 2 32*16* GLOBAL DEPTH 2 00 Bucket A 2 3 1* 5* 21*13* Bucket B 32* 16* Bucket A 000 2 1* 5* 21* 13* Bucket B 001 10 2 11 10* Bucket C 15* 7* 19* Bucket D 2 011 10* Bucket C 101 2 110 15* 7* 19* Bucket D 111 2 4* 12* 20* 010 100 2 7/1/2016 3 GLOBAL DEPTH 01 DIRECTORY LOCAL DEPTH Bucket A2 (`split image' of Bucket A) 3 DIRECTORY Chen Qian @ University of Kentucky 4* 12* 20* Bucket A2 (`split image' of Bucket A) 23 Points to Note 20 = binary 10100. Last 2 bits (00) tell us r belongs in A or others. Last 3 bits needed to tell us r belongs in A1 or A2. Global depth of directory: Max # of bits needed to tell which bucket an entry belongs to. Local depth of a bucket: # of bits used to determine if an entry belongs to this bucket. When does bucket split cause directory doubling? Before insert, local depth of bucket = global depth. Insert causes local depth to become > global depth; directory is doubled by copying it over and `fixing’ pointer to split image page. 7/1/2016 Chen Qian @ University of Kentucky 24 Comments on Extendible Hashing Static hash: one disk access for equality search If directory fits in memory, equality search answered with one disk access; else two. 100MB file, 100 bytes/rec, 4K pages contains 1,000,000 records (as data entries) and 25,000 directory elements; chances are high that directory will fit in memory. Directory grows in spurts, and, if the distribution of hash values is skewed, directory can grow large. Multiple entries with same hash value cause problems! 7/1/2016 Chen Qian @ University of Kentucky 25 Linear Hashing This is another dynamic hashing scheme, an alternative to Extendible Hashing. Linear hashing handles the problem of long overflow chains without using a directory, and handles duplicates. Idea: Use a family of hash functions h0, h1, h2, ... hi(key) = h(key) mod(2iN); N = initial # buckets h is some hash function (range is not 0 to N-1) h0 ranges in 0 to N-1. (If N = 2d, looking at the last d bits) h1 ranges in 0 to 2N-1. (looking at the last d+1 bits) hi+1 doubles the range of hi (similar to directory doubling) 7/1/2016 Chen Qian @ University of Kentucky 26 Linear Hashing (Contd.) Directory avoided in LH by using overflow pages, and choosing bucket to split round-robin. Splitting proceeds in `rounds’. Round ends when all NR initial (for round R) buckets are split. Buckets 0 to Next-1 have been split; Next to NR yet to be split. Current round number is Level. Search: To find bucket for data entry r, find hLevel(r): 7/1/2016 If hLevel(r) in range `Next to NR’ , r belongs here. Else, r could belong to bucket hLevel(r) or bucket hLevel(r) + NR; must apply hLevel+1(r) to find out. Chen Qian @ University of Kentucky 27 Overview of LH File In the middle of a round. Bucket to be split Next Buckets that existed at the beginning of this round: this is the range of Buckets split in this round: If h Level ( search key value ) is in this range, must use h Level+1 ( search key value ) to decide if entry is in `split image' bucket. hLevel `split image' buckets: created (through splitting of other buckets) in this round 7/1/2016 Chen Qian @ University of Kentucky 28 Linear Hashing (Contd.) Insert: Find bucket by applying hLevel and hLevel+1: If bucket to insert into is full: Add overflow page and insert data entry. Split Next bucket and increment Next. Since buckets are split round-robin, long overflow chains don’t develop! 7/1/2016 Chen Qian @ University of Kentucky 29 Example of Linear Hashing Now adding 43 Level=0, N=4 h h 1 0 000 00 001 01 010 10 011 11 Level=0 PRIMARY Next=0 PAGES h 32*44* 36* 9* 25* 5* 14* 18*10*30* 31*35* 7* 11* Data entry r with h(r)=5 Primary bucket page h PRIMARY PAGES 1 0 000 00 001 Next=1 9* 25* 5* 01 010 10 011 11 100 00 OVERFLOW PAGES 32* 14* 18*10*30* 31*35* 7* 11* 44* 36* 43* Example: End of a Round Level=1 h1 PRIMARY PAGES h0 Next=0 Level=0 h1 h0 000 00 001 010 01 10 PRIMARY PAGES OVERFLOW PAGES 000 00 32* 001 01 9* 25* 010 10 66* 18* 10* 34* 011 11 43* 35* 11* 100 00 44* 36* 101 11 5* 37* 29* 32* 50* 9* 25* 66*18* 10* 34* Next=3 31*35* 7* 11* 43* 011 11 100 00 44*36* 101 01 5* 37*29* 110 10 14* 30* 22* 110 10 14*30*22* 111 11 31*7* 7/1/2016 OVERFLOW PAGES Chen Qian @ University of Kentucky 31 LH Described as a Variant of EH The two schemes are actually quite similar: Begin with an EH index where directory has N elements. Use overflow pages, split buckets round-robin. So, directory can double gradually. Also, primary bucket pages are created in order. If they are allocated in sequence too (so that finding i’th is easy), we actually don’t need a directory! Voila, LH. 7/1/2016 Chen Qian @ University of Kentucky 32 Summary Hash-based indexes: best for equality searches, cannot support range searches. Static Hashing can lead to long overflow chains. Extendible Hashing avoids overflow pages by splitting a full bucket when a new data entry is to be added to it. (Duplicates may require overflow pages.) Directory to keep track of buckets, doubles periodically. Can get large with skewed data; additional I/O if this does not fit in main memory. 7/1/2016 Chen Qian @ University of Kentucky 33 Summary (Contd.) Linear Hashing avoids directory by splitting buckets round-robin, and using overflow pages. Overflow pages not likely to be long. Duplicates handled easily. Space utilization could be lower than Extendible Hashing, since splits not concentrated on `dense’ data areas. For hash-based indexes, a skewed data distribution is one in which the hash values of data entries are not uniformly distributed! 7/1/2016 Chen Qian @ University of Kentucky 34 Quiz 5 this Wed. Homework due 4/25 7/1/2016 Chen Qian @ University of Kentucky 35