Comp 335 File Structures Hashing What is Hashing? A process used with record files that will try to achieve O(1) (i.e. – constant) access to a record’s location in the file. An algorithm, called a hash function (h), is given a primary key as input; the resulting output is the location of the record within the file; h(key) = address. Hashing Example Assume you want to store 5,000 data records on file. You want this to be a hashed file for quick access. Each record will be fixed in length and the primary key for each record is an employee number which is 8 digits long. A common hash function is called modulo arithmetic. h(key) = key mod n; n = 5000 h(82461792) = 82461792 mod 5000 = 1792 The address (RRN) of the record with this key is 1792 Other Hashing Methods Folding Folding requires extracting certain groupings from the key and then adding or multiplying the groupings in some fashion to form the hash address. Example : Key = “BISON” Address Space = 101 Step 1 – get ASCII values of each character in the string Step 2 – Add “even[even index val]” 73+79 = 152 Step 4 – Multiply results 66 +83+78 = 227 Step 3 – Add “odd[odd index val]” B(66), I(73), S(83), O(79), N(78) 227 * 152 = 34504 Step 5 – Modulo results 34504 mod 101 = 63 (hash address) Other Hashing Methods Mid-Square Involves squaring the “numeric” form of a key and extracting some of the digits from the “middle of the square”. Example: Assume address space is 1000 Key(4 digit int) = 2973 2973 * 2973 = 8838729 Extract “middle” digits = 387 (hash address) Other Hashing Methods Radix Transformation Convert the key to a different base and then use modulo arithmetic. Example: Address space is 100. Key is 43510 Conversion: 38211 382 mod 100 = 82 (hash address) Other Hashing Methods Multiplicative Function Involves multiplying the key by some constant less than one, the hash function will return some of the digits of the fractional part of the result. Example: Address space = 1000 Key (5 digit integer): 82165 Multiplier: 0.39731 82165 * 0.39731 = 32644.97615 First three digits of fractional part is hash address = 976 Major Problem with Hashing Given a random set of keys and a hash function (h), it is highly probable that some keys in the set will be hash synonyms. In other words, the same hash function output can be obtained from different keys in the set. A hashing algorithm can yield three different types of address distributions: Perfect – no synonyms given a set of keys; the probability of obtaining a perfect distribution from a large set of unknown keys is very, very low (textbook – 1 out 10120,000) Random – “few” synonyms generated; what we strive for! Scud – many synonyms generated If the set of keys is known beforehand, it is possible to generate a perfect hashing algorithm (Pearson, Cichelli) Collisions When two or more keys hash to same address, this is called a collision. This has to be accounted for with random hashing algorithms. The handling of collisions becomes a critical issue in the overall search efficiency of a given file. Remember each search could mean a “disk access”. Decreasing the Probability of Collisions Increase the address space – a common technique; allocate more addresses in the file than records to store; this can decrease the possibility of collisions greatly assuming the hashing algorithm is random. The disadvantage obviously is wasted space. Place more than one record at an address. This is commonly referred to as buckets. A single address space can store an array of records. This has been shown to increase search efficiency. Collision Resolution Even if you have tried to decrease the probability of collisions, they still can and will happen. Ways to resolve collisions: Linear Probing Double Hashing Prime area with overflow Chaining Linear Probing If a key is hashed to an address already occupied or full, search the address space linearly until the first free space is found. Easy to implement, however this technique can lead to poor search efficiency. This technique can take away home addresses from other keys resulting in more collision handling. It can also take many accesses to determine if a key does not exist. What about if a key is deleted using this technique? Could be bad if not handled properly. Double Hashing Upon a collision, the key re-hashed using a different algorithm; this determines the increment to take to search for an open address space. The same problems exist as with linear probing. Research has shown that this technique will give better performance than linear probing. Prime area with Overflow Usually used with buckets. A bucket will hold x number of records in the prime address space and will also contain a pointer to an overflow area of the file which is entry-sequenced. This pointer will contain the first overflow record and each overflow record will contain a pointer to the next overflow record. This is a common technique and gives excellent search efficiency. Chaining The file consists of a hash table which is simply an array of pointers. When a key is hashed, the result is an index into the hash table. At this location is a pointer to the first record which has this hash address. All the records are then “chained” together as a linked list. The data record portion of the file can be entry sequenced. Hash Address Distributions Assuming you have a random hash function, the Poisson Function can be used to compute various probabilities such as: How many empty hash slots will there be? What percentage of the time will access to a key result in more than one access to find it? What is the probability that a certain hash address will have x number of keys assigned to it? Poisson Function p(x) = (r/n)x e-r/n x! n – the address space r - number of keys to hash x – number of records assigned to a given address r/n = packing density; load factor Poisson Function Example Assume 1,000 records to be hashed into a 1,000 address hashed file. 1) What is the probability that a given address will have two keys hashed to it? p(2) = (1,000/1,000)2 e-1,000/1,000 2! = e-1 2 = .368/2 = .184 2) 1,000 (number of addresses) * .184 = 184 Therefore there are approximately 184 addresses which will have 2 keys hashed to it which means there will be 184 overflow records. Poisson Function Example Assume 1,000 records to be hashed into a 1,500 address hashed file. 1) What is the probability that a given address will have two keys hashed to it? p(2) = (1,000/1,500)2 e-1,000/1,500 2! = (.67)2 e-.67 = (.449)(.512)/2 2! = .230/2 = .115 2) 1,500 (number of addresses) * .115 = 172.5 (173) Therefore there are approximately 173 addresses which will have 2 keys hashed to it which means there will be 173 overflow records.