Hashing Part Two Better Collision Resolution Small parts of this material stolen from "File Organization and Access" by Austing and Cassel Hash function converts key to file address Collision is when two or more keys hash to the same address Collision Avoidance o Good Hash Function spreads out the keys evenly along the whole address space o Non-Dense File decreases chance of collisions and decreases probes after a collision Very simple collision resolution if H(key) = A, and A is already used, try A+1, then A+2, etc Advantages easy to implement guaranteed to use all addresses Disadvantages clustering / clumping Given the following hashes and linear probing: 1. 2. 3. 4. 5. adams = 20 bates = 22 cole = 20 dean = 21 evans = 23 Result of either ◦ poor hash function ◦ dense file Address Data 20 Adams 21 Cole 22 Bates 23 Dean 24 Evans 25 26 Instead of adding 1, spread out by random amount True random would not work. Instead use pseudo-random. While A is in use A = (A + R) mod T A = address R = prime T = Table Size 1. 2. 3. 4. 5. adams = 20 bates = 22 cole = 20 dean = 21 evans = 23 Linear Probing But what if 25 and 30 already had keys directly hashed to those locations? Cole would be at 35 -- 4 probes away. Random Probing, R = 5 Address Data 20 Adams Address Data 21 Dean 20 Adams 22 Bates 21 Cole 23 Evans 22 Bates 24 23 Dean 25 24 Evans 26 25 27 26 28 Cole Assuming a better hash function and less dense file are not options... And assuming linear and random probing lead to coalesced lists... Chaining : maintain a linked list of collisions, one head per address ◦ Example, after addition of Adams and Cole, and R=5: 19 : null 20 : 35 -> null 21 : null Advantage: Faster at resolving collisions Disadvantage : Space File Read Time = seek time + latency + data read time Smallest Readable Portion = 1 cluster = 4KB (usually) To access portion of a file, most of the time is in seek time and latency, not read time ◦ so, number of file reads is more important than size of reads, until size gets really big SO... reading a few records from a file takes no more time than reading just one record Given, collisions will occur... Why not just read 2, or 3, or 4 records instead of just 1 on each read operation? "Bucket" - a group of records at the same address "Hash File of Buckets" - hashed keys collide to small arrays of records in the data file use avg collisions and stddev? if 1000 records and 200 addresses ◦ then avg is 5.0 ◦ but stddev might be 1.0 start by determining how many records can fit in one or more disk clusters then design a good hash function to match that address space Advantages: Can achieve relatively fast access ◦ Remember, the hash function tells us where the record is located, so only 1 read operation. And even with collisions, the list of possible records is read into memory, which searches fast. ◦ Search Time = time to read bucket + time to search the array Disadvantages: What do we do when the bucket is full? ◦ solutions are similar to collision resolution ◦ we end up reading multiple sets of records Collisions will happen! Poisson Function: ◦ p(x) gives the probability that a given address will have had x records assigned to it. (r/N)x e-(r/N) p(x) = --------------x! N = number of available addresses r = number of records to be stored x = number of records assigned to a given address Given ◦ N = 1000 ◦ r = 1000 Probability that a given address will have exactly one, two, or three keys hashed to it: p(1) = 0.368 p(2) = 0.184 p(3) = 0.061 Given ◦ N = 10,000 ◦ R = 10,000 How many addresses should have one, two, or three keys hashed to them? 10,000 x p(1) = 10000x0.3679 = 3679 10,000 x p(2) = 10000x0.1839 = 1839 10,000 x p(3) = 10000x0.0613 = 613 So, 1839 keys will collide once and 613 will collide at least twice. Many of those collisions will disrupt probing. Given ◦ r = 500 ◦ N = 1000 ◦ one record per address Records that never collide = 303 Records that cannot go at their home = 107 Records at their home, but cause collisions = 90 Total = 500 Addresses with exact one record? N x p(1) = 1000 x 0.303 = 303 How many overflow records? 1 x N x p(2) + 2 x N x p(3) + 3 x N x p(4) + ... = N x [1 x p(2) + 2 x p(3) + 3 x p(4)] = 1000 x [ 1 x 0.076 + 2 x 0.013 + 3 x 0.002] = 107 Percentage of Records NOT stored at home address 107 / 500 = 21.4% Packing Density (%) Synonyms as percent of records 10 4.8 30 13.6 50 21.4 70 28.1 100 36.8 We must balance many factors: • file size • e.g., wasted space in hashed files • e.g., extra space for index files • disk access times • available memory • frequency of additions and deletions compared to searches Best Solution of All? probably a combination of indexed files, hashing, and buckets Thursday April 14 ◦ No Class Tuesday April 19 ◦ B-Trees Thursday April 21 ◦ Review