Hash-Based Indexes (From Chapter 11) Database Management Systems, R. Ramakrishnan and Johannes Gehrke Introduction As for any index, 3 alternatives for data entries k*: Hash-based indexes are best for equality selections. Static and dynamic hashing techniques exist Database Management Systems, R. Ramakrishnan and Johannes Gehrke Static Hashing # primary pages fixed, allocated sequentially, never de-allocated; overflow pages if needed. h(k) mod N = bucket to which data entry with key k belongs. (N = # of buckets) h(key) mod N key 0 2 h N-1 Primary bucket pages Overflow pages Database Management Systems, R. Ramakrishnan and Johannes Gehrke Static Hashing (Contd.) Buckets contain data entries. Hash fn works on search key field(s) of record r. Must distribute values over range 0 ... N-1. h(key) = Long overflow chains can develop and degrade performance Database Management Systems, R. Ramakrishnan and Johannes Gehrke Extendible Hashing Main idea: If bucket (primary page) becomes full, why not re-organize file by doubling # of buckets? But reading and writing all buckets is expensive! Idea: Database Management Systems, R. Ramakrishnan and Johannes Gehrke Insert h(r)=14 LOCAL DEPTH GLOBAL DEPTH 2 00 2 4* 12* 32*16* Bucket A 2 1* 5* 21*13* Bucket B 01 10 2 11 10* Bucket C 2 DIRECTORY 15* 7* 19* Bucket D Database Management Systems, R. Ramakrishnan and Johannes Gehrke Insert h(r)=20 2 LOCAL DEPTH 4* 12* 32* 16* GLOBAL DEPTH Bucket A 2 2 1* 5* 21*13* Bucket B 00 01 10 2 11 10* Bucket C 2 DIRECTORY 15* 7* 19* Bucket D Database Management Systems, R. Ramakrishnan and Johannes Gehrke Insert h(r)=20 2 LOCAL DEPTH 32* 16* GLOBAL DEPTH Bucket A 2 2 1* 5* 21*13* Bucket B 00 01 10 2 11 10* Bucket C 2 DIRECTORY 15* 7* 19* Bucket D 2 4* 12* 20* Bucket A2 (`split image' of Bucket A) Database Management Systems, R. Ramakrishnan and Johannes Gehrke Insert h(r)=32 LOCAL DEPTH GLOBAL DEPTH 0 0 4* 12* 1* 10* Bucket A DIRECTORY Database Management Systems, R. Ramakrishnan and Johannes Gehrke Insert h(r)=16 LOCAL DEPTH GLOBAL DEPTH 1 1 4* 12* 10* 32* Bucket A 0 1 1 DIRECTORY 1* Bucket B Database Management Systems, R. Ramakrishnan and Johannes Gehrke Insert h(r)=20 LOCAL DEPTH GLOBAL DEPTH 2 2 4* 12* 32* 16* 1 1* 00 Bucket A Bucket B 01 10 2 11 10* Bucket C DIRECTORY Database Management Systems, R. Ramakrishnan and Johannes Gehrke Insert h(r)=5, 15, 7, 19 LOCAL DEPTH GLOBAL DEPTH 3 000 3 32* 16* Bucket A 1 1* 5* 15* 7* Bucket B 001 010 2 011 10* Bucket C 100 101 110 111 3 DIRECTORY 4* 12* 20* Bucket A2 (`split image' of Bucket A) Database Management Systems, R. Ramakrishnan and Johannes Gehrke Deletions Inverse of insertion If removal of data entry makes bucket empty, merge with ‘split image’ If each directory element points to same bucket as its split image, can halve directory Database Management Systems, R. Ramakrishnan and Johannes Gehrke Comments on Extendible Hashing If directory fits in memory, equality search answered with _____ I/O; else _____ 100MB file, 100 bytes/rec, 4K pages contain 1,000,000 records (as data entries) and 25,000 directory elements; chances are high that directory will fit in memory. Directory grows in spurts, and, if the distribution of hash values is ________, directory can grow large Database Management Systems, R. Ramakrishnan and Johannes Gehrke Linear Hashing This is another dynamic hashing scheme, an alternative to Extendible Hashing LH handles the problem of long overflow chains without using a directory, and handles duplicates Main idea: Database Management Systems, R. Ramakrishnan and Johannes Gehrke Inserting h(r) = 43 Level=2, N=4 h PRIMARY Next=0 PAGES h 3 2 000 00 001 01 010 10 011 11 32*44* 36* 9* 25* 5* 14* 18*10*30* 31*35* 7* 11* (This info is for illustration only!) (The actual contents of the linear hashed file) Database Management Systems, R. Ramakrishnan and Johannes Gehrke Example (Inserting h(r) = 43) Level=2 h h 3 PRIMARY 2 Next=0 PAGES 000 00 001 01 010 10 011 11 OVERFLOW PAGES 32* 44* 36* 9* 25* 5* 14* 18*10*30* 31*35* 7* 11* 43* Database Management Systems, R. Ramakrishnan and Johannes Gehrke Inserting h(r) = 50 (End of a Round) Level=2 h3 h2 000 00 001 01 010 10 PRIMARY PAGES OVERFLOW PAGES 32* 9* 25* 66*18* 10* 34* Next=3 31*35* 7* 11* 011 11 100 00 101 01 5* 37*29* 110 10 14*30*22* 43* 44*36* Database Management Systems, R. Ramakrishnan and Johannes Gehrke Overview of LH File In the middle of a round. Bucket to be split Next Buckets that existed at the beginning of this round: this is the range of Buckets split in this round: If h Level ( search key value ) is in this range, must use h Level+1 ( search key value ) to decide if entry is in `split image' bucket. hLevel `split image' buckets: created (through splitting of other buckets) in this round Database Management Systems, R. Ramakrishnan and Johannes Gehrke Summary Hash-based indexes: best for ______ searches, cannot support _____ searches. Static Hashing can lead to ________________. Extendible Hashing uses directory doubling to avoid ___________ Duplicates may require ________________ Linear hashing avoids directory by splitting in rounds Naturally handles ______________ Uses overflow buckets (but not very long in practice) Database Management Systems, R. Ramakrishnan and Johannes Gehrke