Hashing CS 245: Database System Principles Notes 5: Hashing and More <key> key h(key) Hector Garcia-Molina CS 245 . . . Notes 5 1 Two alternatives Notes 5 2 Two alternatives . . . (2) key h(key) records (1) key h(key) CS 245 CS 245 Buckets (typically 1 disk block) key 1 . . . record Index Notes 5 3 Two alternatives CS 245 Notes 5 4 Example hash function (2) key h(key) key 1 • Key = ‘x1 x2 … xn’ n byte character string • Have b buckets • h: add x1 + x2 + ….. xn record – Index compute sum modulo b • Alt (2) for “secondary” search key CS 245 Notes 5 5 CS 245 Notes 5 6 1 This may not be best function … Read Knuth Vol. 3 if you really need to select a good function. This may not be best function … Read Knuth Vol. 3 if you really need to select a good function. Good hash function: CS 245 Notes 5 7 Expected number of keys/bucket is the same for all buckets CS 245 Notes 5 8 Next: example to illustrate inserts, overflows, deletes Within a bucket: • Do we keep keys sorted? h(K) • Yes, if CPU time critical & Inserts/Deletes not too frequent CS 245 Notes 5 9 EXAMPLE 2 records/bucket INSERT: h(a) = 1 h(b) = 2 h(c) = 1 h(d) = 0 CS 245 Notes 5 10 EXAMPLE 2 records/bucket 0 INSERT: h(a) = 1 h(b) = 2 h(c) = 1 h(d) = 0 1 2 3 0 d 1 a c 2 b 3 h(e) = 1 CS 245 Notes 5 11 CS 245 Notes 5 12 2 EXAMPLE 2 records/bucket INSERT: h(a) = 1 h(b) = 2 h(c) = 1 h(d) = 0 0 EXAMPLE: deletion Delete: e f d 1 a c 2 b e 3 0 a 1 b c 2 e 3 f g h(e) = 1 CS 245 Notes 5 13 EXAMPLE: deletion Delete: e f c 0 b c 2 e 3 f g CS 245 Notes 5 14 EXAMPLE: deletion Delete: e f c a 1 CS 245 d d a 1 b c d e 2 3 maybe move “g” up Notes 5 0 15 CS 245 f g d maybe move “g” up Notes 5 Rule of thumb: Rule of thumb: • Try to keep space utilization between 50% and 80% Utilization = # keys used total # keys that fit • Try to keep space utilization between 50% and 80% Utilization = # keys used total # keys that fit 16 • If < 50%, wasting space • If > 80%, overflows significant depends on how good hash function is & on # keys/bucket CS 245 Notes 5 17 CS 245 Notes 5 18 3 How do we cope with growth? How do we cope with growth? • Overflows and reorganizations • Dynamic hashing • Overflows and reorganizations • Dynamic hashing • Extensible • Linear CS 245 Notes 5 19 Extensible hashing: two ideas Notes 5 20 (b) Use directory (a) Use i of b bits output by hash function b h(K) CS 245 . . . h(K)[i ] to bucket . . . 00110101 use i grows over time…. CS 245 Notes 5 21 Example: h(k) is 4 bits; 2 keys/bucket i= 1 CS 245 22 Example: h(k) is 4 bits; 2 keys/bucket 1 0001 i= 1 1 1001 1100 1 0001 1 1001 1010 1100 Insert 1010 CS 245 Notes 5 Insert 1010 Notes 5 23 CS 245 1 1100 Notes 5 24 4 Example continued Example: h(k) is 4 bits; 2 keys/bucket i= 1 i= 2 i=2 1 0001 00 00 01 01 Insert 1010 10 10 1 2 1001 1010 1100 11 11 Insert: New directory 1 2 1100 0111 1 0001 2 1001 1010 2 1100 0000 CS 245 Notes 5 25 Example continued i= 2 00 01 10 11 Insert: 0111 CS 245 Example continued 0000 0001 i= 2 00 1 0001 0111 0111 01 10 2 1001 11 1010 Insert: 2 1100 0111 0000 CS 245 Notes 5 27 CS 245 1 2 0001 0111 0111 2 1001 1010 2 1100 Notes 5 0000 2 0001 i= 2 00 0111 2 0001 0111 2 01 01 10 10 1001 11 11 1001 1001 2 Insert: 1001 CS 245 28 Example continued 0000 2 00 2 0000 0001 0000 Notes 5 Example continued i= 2 26 1010 1001 2 1010 Insert: 1100 2 Notes 5 1001 29 CS 245 1010 1100 2 Notes 5 30 5 Example continued i=3 0000 2 000 0001 i= 2 00 0111 2 011 10 1001 3 11 1001 100 1010 1001 2 3 101 1010 110 1100 2 111 1001 CS 245 • No merging of blocks • Merge blocks and cut directory if possible (Reverse insert procedure) 010 01 Insert: Extensible hashing: deletion 001 Notes 5 31 CS 245 Notes 5 32 Note: Still need overflow chains Deletion example: • Example: many records with duplicate keys • Run thru insert example in reverse! if we split: insert 1100 2 1 1101 1100 2 1100 1100 CS 245 Notes 5 33 Solution: overflow chains CS 245 Summary + add overflow block: insert 1100 CS 245 1 1101 1 1101 1100 1101 Notes 5 Notes 5 34 Extensible hashing Can handle growing files - with less wasted space - with no full reorganizations 1100 35 CS 245 Notes 5 36 6 Summary + Linear hashing Extensible hashing • Another dynamic hashing scheme Can handle growing files - with less wasted space - with no full reorganizations - Indirection - Directory doubles in size Two ideas: (a) Use i low order bits of hash b 01110101 grows i (Not bad if directory in memory) (Now it fits, now it does not) CS 245 Notes 5 37 CS 245 Notes 5 Example b=4 bits, Linear hashing 38 i =2, 2 keys/bucket • Another dynamic hashing scheme Two ideas: (a) Use i low order bits of hash b 01110101 grows i 0000 0101 1010 1111 00 01 Notes 5 Example b=4 bits, 10 11 m = 01 (max used block) (b) File grows linearly CS 245 Future growth buckets 39 i =2, 2 keys/bucket CS 245 Notes 5 Example b=4 bits, 40 i =2, 2 keys/bucket • insert 0101 0000 0101 1010 1111 00 Future growth buckets 01 10 11 m = 01 (max used block) Rule CS 245 0101 1010 1111 00 01 Future growth buckets 10 11 m = 01 (max used block) If h(k)[i ] m, then look at bucket h(k)[i ] else, look at bucket h(k)[i ] - 2i -1 Notes 5 0000 Rule 41 CS 245 If h(k)[i ] m, then look at bucket h(k)[i ] else, look at bucket h(k)[i ] - 2i -1 Notes 5 42 7 Example b=4 bits, i =2, 2 keys/bucket • insert 0101 0101 0000 0101 1010 1111 00 01 Note • In textbook, n is used instead of m • n=m+1 • can have overflow chains! Future growth buckets 10 n=10 11 m = 01 (max used block) Rule Notes 5 Example b=4 bits, 0000 0101 1010 1111 00 01 10 11 m = 01 (max used block) 0000 0101 1010 1111 CS 245 1111 11 Notes 5 Example b=4 bits, 0000 0101 1010 1111 01 44 i =2, 2 keys/bucket Future growth buckets 1010 10 11 10 45 CS 245 Notes 5 Example b=4 bits, 0101 • insert 0101 Future growth buckets 01 10 11 m = 01 (max used block) 10 Notes 5 10 m = 01 (max used block) i =2, 2 keys/bucket 1010 01 CS 245 00 Notes 5 0101 00 43 Future growth buckets Example b=4 bits, 1010 Future growth buckets m = 01 (max used block) i =2, 2 keys/bucket CS 245 0101 00 If h(k)[i ] m, then look at bucket h(k)[i ] else, look at bucket h(k)[i ] - 2i -1 CS 245 0000 0000 0101 1010 1111 00 01 46 i =2, 2 keys/bucket • insert 0101 Future growth buckets 1010 10 11 m = 01 (max used block) 10 11 47 CS 245 Notes 5 48 8 Example b=4 bits, 0101 Example Continued: How to grow beyond this? i =2, 2 keys/bucket • insert 0101 i=2 0000 1010 00 0101 1010 1111 Future growth buckets 0000 0101 1111 01 0101 1010 1111 10 11 0101 10 11 00 01 ... m = 01 (max used block) 10 11 m = 11 (max used block) CS 245 Notes 5 49 CS 245 Example Continued: How to grow beyond this? 0101 i=2 3 1010 1111 0000 0101 0 00 100 0 01 101 50 Example Continued: How to grow beyond this? i=2 3 0000 Notes 5 0101 1010 1111 0101 010 110 0 11 111 ... 0 00 100 m = 11 (max used block) 0 01 101 010 110 0 11 111 100 ... m = 11 (max used block) 100 CS 245 Notes 5 51 Example Continued: How to grow beyond this? CS 245 Notes 5 52 When do we expand file? i=2 3 0000 0101 1010 0 01 101 010 110 0 11 111 # used slots total # of slots CS 245 Notes 5 0101 0101 1111 0101 0 00 100 • Keep track of: 100 =U 101 ... m = 11 (max used block) 100 101 CS 245 Notes 5 53 54 9 Summary When do we expand file? • Keep track of: # used slots total # of slots + Can handle growing files - with less wasted space - with no full reorganizations + No indirection like extensible hashing =U • If U > threshold then increase m (and maybe i ) Can still have overflow chains - CS 245 Notes 5 55 Example: BAD CASE Need to move Would waste space... Notes 5 Notes 5 57 CS 245 Notes 5 Next: Indexing vs Hashing • Indexing vs Hashing • Index definition in SQL • Multiple key access • Hashing good for probes given key e.g., SELECT … FROM R WHERE R.A = 5 CS 245 Notes 5 56 Hashing - How it works - Dynamic hashing - Extensible - Linear m here… CS 245 CS 245 Summary Very full Very empty Linear Hashing 59 CS 245 Notes 5 58 60 10 Indexing vs Hashing Index definition in SQL • INDEXING (Including B Trees) good for Range Searches: e.g., SELECT FROM R WHERE R.A > 5 • Create index name on rel (attr) • Create unique index name on rel (attr) CS 245 CS 245 Notes 5 61 Note CANNOT SPECIFY TYPE OF INDEX defines candidate key • Drop INDEX name Notes 5 62 Note ATTRIBUTE LIST MULTIKEY INDEX (next) e.g., CREATE INDEX foo ON R(A,B,C) (e.g. B-tree, Hashing, …) OR PARAMETERS (e.g. Load Factor, Size of Hash,...) ... at least in SQL... CS 245 Notes 5 63 CS 245 Notes 5 Multi-key Index Strategy I: Motivation: Find records where DEPT = “Toy” AND SAL > 50k • Use one index, say Dept. • Get all Dept = “Toy” records and check their salary 64 I1 CS 245 Notes 5 65 CS 245 Notes 5 66 11 Strategy II: Strategy III: • Use 2 Indexes; Manipulate Pointers • Multiple Key Index I2 Toy Sal > 50k CS 245 Notes 5 Example Art Sales Toy Dept Index 67 One idea: I1 CS 245 I3 Notes 5 68 For which queries is this index good? 10k 15k 17k 21k Example Record Name=Joe DEPT=Sales SAL=15k 12k 15k 15k 19k Find Find Find Find RECs RECs RECs RECs Dept = “Sales” Dept = “Sales” Dept = “Sales” SAL = 20k SAL=20k SAL > 20k Salary Index CS 245 Notes 5 69 CS 245 Notes 5 Interesting application: Queries: • Geographic Data • What city is at <Xi,Yi>? • What is within 5 miles from <Xi,Yi>? • Which is closest point to <Xi,Yi>? y DATA: 70 <X1,Y1, Attributes> <X2,Y2, Attributes> ... x CS 245 Notes 5 71 CS 245 Notes 5 72 12 Example i e h n l m Example i e n f h b f o j k a d 10 20 l c g m c g 10 CS 245 Notes 5 Example 73 40 10 25 30 20 15 35 i 20 e h n 20 l 10 j k o Example 25 15 35 20 i e n f h 30 20 c 10 74 40 10 g 20 Notes 5 b f m a d CS 245 b o j k a d 20 l o 10 j k m 20 a d b c g 10 20 5 15 15 CS 245 Notes 5 Example 75 40 10 25 30 20 15 35 i 20 e h n 20 l 10 j k o 10 5 h i CS 245 f d e c l m Example 25 20 e n f 20 l o 10 j k m 20 10 5 a b h i n o j k 77 CS 245 g f 15 15 Notes 5 15 35 i h 30 20 c g 76 40 10 15 15 j k g Notes 5 b f m a d CS 245 l m d e c a d b c g 20 a b • Search points near f • Search points near b n o Notes 5 78 13 Queries • • • • Find Find Find Find points points points points • Many types of geographic index structures have been suggested • • • • with Yi > 20 with Xi < 5 “close” to i = <12,38> “close” to b = <7,24> CS 245 Notes 5 79 Two more types of multi key indexes kd-Trees (very similar to what we described here) Quad Trees R Trees ... CS 245 Notes 5 80 Grid Index Key 2 • Grid • Partitioned hash V1 V2 X1 X2 …… Xn Key 1 Vn To records with key1=V3, key2=X2 CS 245 Notes 5 81 CS 245 Notes 5 CLAIM CLAIM • Can quickly find records with • Can quickly find records with – key 1 = Vi Key 2 = Xj – key 1 = Vi – key 2 = Xj 82 – key 1 = Vi Key 2 = Xj – key 1 = Vi – key 2 = Xj • And also ranges…. – E.g., key 1 Vi key 2 < Xj CS 245 Notes 5 83 CS 245 Notes 5 84 14 • How do we find entry i,j in linear structure? max number of i values N=4 i, j CS 245 max number of i values N=4 position S+0 position S+1 0, 0 0, 1 0, 2 0, 3 1, 0 1, 1 1, 2 1, 3 2, 0 2, 1 2, 2 2, 3 3, 0 pos(i, j) = position S+2 position S+3 pos(i, j) = S + iN + j position S+4 Issue: Cells must be same size, and N must be constant! position S+9 Issue: Some cells may overflow, some may be sparse... Notes 5 85 Solution: Use Indirection V1 V2 V3 V4 X1 X2 X3 ---- Buckets CS 245 ---------- CS 245 87 CS 245 8 3 - position S+9 86 88 Good for multiple-key search Space, management overhead (nothing is free) 1 Linear Scale CS 245 position S+4 Notes 5 + Grid 50K- position S+3 contains pointers to buckets Salary 2 position S+2 *Grid only Grid files 20K-50K position S+1 Notes 5 Can also index grid on value ranges 1 position S+0 • Grid can be regular without wasting space • We do have price of indirection Notes 5 0-20K i, j 0, 0 0, 1 0, 2 0, 3 1, 0 1, 1 1, 2 1, 3 2, 0 2, 1 2, 2 2, 3 3, 0 With indirection: Buckets ---- • How do we find entry i,j in linear structure? 2 Need partitioning ranges that evenly split keys 3 Toy Sales Personnel Notes 5 89 CS 245 Notes 5 90 15 Partitioned hash function Idea: Key1 EX: h1(toy) h1(sales) h1(art) . . h2(10k) h2(20k) h2(30k) h2(40k) . . 010110 1110010 h1 h2 Key2 Insert CS 245 Notes 5 91 CS 245 =0 =1 =1 =01 =11 =01 =00 000 001 010 011 100 101 110 111 <Fred,toy,10k>,<Joe,sales,10k> <Sally,art,30k> Notes 5 92 EX: h1(toy) h1(sales) h1(art) . . h2(10k) h2(20k) h2(30k) h2(40k) . . Insert CS 245 =0 =1 =1 =01 =11 =01 =00 000 001 010 011 100 101 110 111 h1(toy) =0 000 <Fred> h1(sales) =1 001 <Joe><Jan> h1(art) =1 010 <Mary> . 011 . <Sally> h2(10k) =01 100 h2(20k) =11 101 h2(30k) =01 110 <Tom><Bill> <Andy> h2(40k) =00 111 . . • Find Emp. with Dept. = Sales Sal=40k <Fred> <Joe><Sally> <Fred,toy,10k>,<Joe,sales,10k> <Sally,art,30k> Notes 5 93 h1(toy) =0 000 <Fred> h1(sales) =1 001 <Joe><Jan> h1(art) =1 010 <Mary> . 011 . <Sally> h2(10k) =01 100 h2(20k) =11 101 h2(30k) =01 110 <Tom><Bill> <Andy> h2(40k) =00 111 . . • Find Emp. with Dept. = Sales Sal=40k CS 245 Notes 5 CS 245 Notes 5 h1(toy) =0 000 h1(sales) =1 001 h1(art) =1 010 . 011 . h2(10k) =01 100 h2(20k) =11 101 h2(30k) =01 110 h2(40k) =00 111 . . • Find Emp. with Sal=30k 95 CS 245 Notes 5 94 <Fred> <Joe><Jan> <Mary> <Sally> <Tom><Bill> <Andy> 96 16 h1(toy) =0 000 h1(sales) =1 001 h1(art) =1 010 . 011 . h2(10k) =01 100 h2(20k) =11 101 h2(30k) =01 110 h2(40k) =00 111 . . • Find Emp. with Sal=30k CS 245 <Fred> <Joe><Jan> <Mary> <Sally> <Tom><Bill> <Andy> look here Notes 5 97 h1(toy) =0 000 <Fred> h1(sales) =1 001 <Joe><Jan> h1(art) =1 010 <Mary> . 011 . <Sally> h2(10k) =01 100 h2(20k) =11 101 h2(30k) =01 110 <Tom><Bill> <Andy> h2(40k) =00 111 . . • Find Emp. with Dept. = Sales CS 245 Notes 5 98 Summary h1(toy) =0 000 <Fred> h1(sales) =1 001 <Joe><Jan> h1(art) =1 010 <Mary> . 011 . <Sally> h2(10k) =01 100 h2(20k) =11 101 h2(30k) =01 110 <Tom><Bill> <Andy> h2(40k) =00 111 . . • Find Emp. with Dept. = Sales look here CS 245 Notes 5 Post hashing discussion: - Indexing vs. Hashing - SQL Index Definition - Multiple Key Access - Multi Key Index Variations: Grid, Geo Data - Partitioned Hash 99 CS 245 Notes 5 100 BIG picture…. Reading Chapter 5 The • Skim the following sections: • Chapters 11 & 12 [13]: Storage, records, blocks... • Chapters 13 & 14 [14]: Access Mechanisms - Indexes - B trees - Hashing - Multi key • Chapters 15 & 16 [15, 16]: Query Processing – Sections 14.3.6, 14.3.7, 14.3.8 [Second Ed: 14.6.6, 14.6.7, 14.6.8] – Sections 14.4.2, 14.4.3, 14.4.4 [Second Ed: 14.7.2, 14.7.3, 14.7.4] • Read the rest NEXT CS 245 Notes 5 101 CS 245 Notes 5 102 17