Hashing and Hash Tables Hashing is key Hash tables are one of the great inventions in computer science. Combination of lists, arrays, and a clever idea. Usual use is a symbol table: associate value (data) with string (key). Countless other applications exist; we'll talk about some. Hashing The insight is to create from arbitrary data a small integer key (or just hash) that can be used to identify the data efficiently. Usually (but not always) a subset of the data is used in making the key: the string in a string/value pair, for example. Store the complete data under the key, usually in a hash table that uses the key as an index. This makes the key a form of lookup address (index in a table). Some hashing schemes eliminate the table altogether and make the key the machine address itself. Hash function Why is it called a hash? Because the key is usually generated by mashing all the bits of the data together to make as random a mess as possible. OED: A mixture of mangled and incongruous fragments; a medley; a spoiled mixture; a mess, jumble. Often in phr. to make a hash of, to mangle and spoil in attempting to deal with. This is the job of the hash function: produce a hash from the data to be used as the key. The hash should be randomly distributed through a modest integer range. That range is the size of the hash table that stores the data. For example, add up the bytes in a character string and take the result modulo 128, to give an index into a 128-entry array of string/value pairs. Hashing styles What to do if two items hash to same key? 1. Perfect hashing Arrange that there are no collisions. Can be done if the data is known in advance: choose the right hash function, or compute it. Rare, but nice when it can work. 2. Open hashing Put the colliding entry in the next slot of the table, or chain them together with pointers. When the table fills, rehash. Adds complexity; avoids memory allocation. Old-fashioned. 3. Hash buckets Each hash table entry is the head of a list of items that hash to that value. This is almost always the way to go these days. Can still rehash if the lists get long, but we can often arrange that that won't happen. Open Hashing Hash Buckets Hash tables have their limitations. If the hash function is poor or the table size is too small, the lists can grow long. Since they're unsorted, this leads to n 2 behaviour. There's no easy (let alone efficient) way to extract the elements in order. Types of Hashing • There are two types of hashing: 1. Static hashing: the set of keys is fixed and given in advance. 2· Dynamic hashing: the set of keys can change dynamically. • The load factor of a hash table is the ratio of the number of keys in the table to the size of the hash table. • As the load factor gets closer to 1.0, the likelihood of collisions increases. • The load factor is a typical example of a space/time trade-off. A good hash function should: · Minimize collision. · Be easy and quick to compute. · Distribute key values evenly in the hash table. · Use all the information provided in the key. · Have a high load factor for a given set of keys Hashing Methods 1. Prime-Number Division Remainder • Computes hash value from key using the % operator. • Table size that is a power of 2 like 32 and 1024 should be avoided, for it leads to more collisions. • Also, powers of 10 are not good for table sizes when the keys rely on decimal integers. • Prime numbers not close to powers of 2 are better table size values. • This method is best when combined with truncation or folding – which are discussed in the following slides 2. Truncation or Digit/Character Extraction • Works based on the distribution of digits or characters in the key. • More evenly distributed digit positions are extracted and used for hashing purposes. • For instance, students IDs or ISBN codes may contain common subsequences which may increase the likelihood of collision. • Very fast but digits/characters distribution in keys may not be very even. 3. Folding • It involves splitting keys into two or more parts and then combining the parts to form the hash addresses. • To map the key 25936715 to a range between 0 and 9999, we can: · split the number into two as 2593 and 6715 and · add these two to obtain 9308 as the hash value. • Very useful if we have keys that are very large. • Fast and simple especially with bit patterns. • A great advantage is ability to transform non-integer keys into integer values. 4. Radix Conversion • • • Transforms a key into another number base to obtain the hash value. Typically use number base other than base 10 and base 2 to calculate the hash addresses. To map the key 38652 in the range 0 to 9999 using base 11 we have: 3x114 + 8x113 + 6x112 + 5x111 + 2x110 = 5535411 • We may truncate the high-order 5 to yield 5354 as our hash address within 0 to 9999. 5. Mid-Square • The key is squared and the middle part of the result taken as the hash value. • To map the key 3121 into a hash table of size 1000, we square it 31212 = 9740641 and extract 406 as the hash value. • Can be more efficient with powers of 2 as hash table size. • Works well if the keys do not contain a lot of leading or trailing zeros. • Non-integer keys have to be preprocessed to obtain corresponding integer values. 6. Use of a Random-Number Generator • Given a seed as parameter, the method generates a random number. • The algorithm must ensure that: • • It always generates the same random value for a given key. • It is unlikely for two keys to yield the same random value. The random number produced can be transformed to produce a valid hash value. Hash collision In computer science, a hash collision is a situation that occurs when two distinct inputs into a hash function produce identical outputs. Most hash functions have potential collisions, but with good hash functions they occur less often than with bad ones. In certain specialized applications where a relatively small number of possible inputs are all known ahead of time it is possible to construct a perfect hash function which maps all inputs to different outputs. But in a function which can take input of arbitrary length and content and returns a hash of a fixed length, there will always be collisions, because any given hash can correspond to an infinite number of possible inputs. In searching An efficient method of searching can be to process a lookup key using a hash function, then take the resulting hash value and then use it as an index into an array of data. The resulting data structure is called a hash table. As long as different keys map to different indices, lookup can be performed in constant time. When multiple lookup keys are mapped to identical indices, however, a hash collision occurs. The most popular ways of dealing with this are chaining (building a linked list of values for each array index), and open addressing (searching other array indices nearby for an empty space) Collision resolution Collisions are practically unavoidable when hashing a random subset of a large set of possible keys. For example, if 2500 keys are hashed into a million buckets, even with a perfectly uniform random distribution, according to the birthday paradox there is a 95% chance of at least two of the keys being hashed to the same slot. Therefore, most hash table implementations have some collision resolution strategy to handle such events. Some common strategies are described below. All these methods require that the keys (or pointers to them) be stored in the table, together with the associated values. Load factor The performance of most collision resolution methods does not depend directly on the number n of stored entries, but depends strongly on the table's load factor, the ratio n/s between n and the size s of its bucket array. With a good hash function, the average lookup cost is nearly constant as the load factor increases from 0 up to 0.7 or so. Beyond that point, the probability of collisions and the cost of handling them increases. On the other hand, as the load factor approaches zero, the size of the hash table increases with little improvement in the search cost, and memory is wasted. Separate chaining In the strategy known as separate chaining, direct chaining, or simply chaining, each slot of the bucket array is a pointer to a linked list that contains the key-value pairs that hashed to the same location. Lookup requires scanning the list for an entry with the given key. Insertion requires appending a new entry record to either end of the list in the hashed slot. Deletion requires searching the list and removing the element. (The technique is also called open hashing or closed addressing, which should not be confused with 'open addressing' or 'closed hashing'.) Chained hash tables with linked lists are popular because they require only basic data structures with simple algorithms, and can use simple hash functions that are unsuitable for other methods. The cost of a table operation is that of scanning the entries of the selected bucket for the desired key. If the distribution of keys insufficiently, the average cost of a lookup depends only on the average number of keys per bucket—that is, on the load factor. Chained hash tables remain effective even when the number of entries n is much higher than the number of slots. Their performancedegrades more gracefully (linearly) with the load factor. For example, a chained hash table with 1000 slots and 10,000 stored keys (load factor 10) is five to ten times slower than a 10,000-slot table (load factor 1); but still 1000 times faster than a plain sequential list, and possibly even faster than a balanced search tree. For separate-chaining, the worst-case scenario is when all entries were inserted into the same bucket, in which case the hash table is ineffective and the cost is that of searching the bucket data structure. If the latter is a linear list, the lookup procedure may have to scan all its entries; so the worst-case cost is proportional to the number n of entries in the table. The bucket chains are often implemented as ordered lists, sorted by the key field; this choice approximately halves the average cost of unsuccessful lookups, compared to an unordered list. However, if some keys are much more likely to come up than others, an unordered list with move-to-front heuristic may be more effective. More sophisticated data structures, such as balanced search trees, are worth considering only if the load factor is large (about 10 or more), or if the hash distribution is likely to be very nonuniform, or if one must guarantee good performance even in the worst-case. However, using a larger table and/or a better hash function may be even more effective in those cases. Chained hash tables also inherit the disadvantages of linked lists. When storing small keys and values, the space overhead of the next pointer in each entry record can be significant. An additional disadvantage is that traversing a linked list has poor cache performance, making the processor cache ineffective. Separate chaining with list heads Some chaining implementations store the first record of each chain in the slot array itself. The purpose is to increase cache efficiency of hash table access. To save memory space, such hash tables often have about as many slots as stored entries, meaning that many slots have two or more entries. Hash collision by separate chaining with head records in the bucket array. Separate chaining with other structure Instead of a list, one can use any other data structure that supports the required operations. By using a self-balancing tree, for example, the theoretical worst-case time of a hash table can be brought down to O(log n) rather than O(n). However, this approach is only worth the trouble and extra memory cost if long delays must be avoided at all costs (e.g. in a real-time application), or if one expects to have many entries hashed to the same slot (e.g. if one expects extremely non-uniform or even malicious key distributions). The variant called array hashing uses a dynamic array to store all the entries that hash to the same bucket.[6] Each inserted entry gets appended to the end of the dynamic array that is assigned to the hashed slot. This variation makes more effective use of CPU caching, since the bucket entries are stored in sequential memory positions. It also dispenses with the next pointers that are required by linked lists, which saves space when the entries are small, such as pointers or single-word integers. An elaboration on this approach is the so-called dynamic perfect hashing , where a bucket that contains k entries is organized as a perfect hash table with k2 slots. While it uses more memory (n2 slots for nentries, in the worst case), this variant has guaranteed constant worst-case lookup time, and low amortized time for insertion. Open addressing In another strategy, called open addressing, all entry records are stored in the bucket array itself. When a new entry has to be inserted, the buckets are examined, starting with the hashed-to slot and proceeding in some probe sequence, until an unoccupied slot is found. When searching for an entry, the buckets are scanned in the same sequence, until either the target record is found, or an unused array slot is found, which indicates that there is no such key in the table.[9] The name "open addressing" refers to the fact that the location ("address") of the item is not determined by its hash value. (This method is also called closed hashing; it should not be confused with "open hashing" or "closed addressing" which usually mean separate chaining.) Hash collision resolved by open addressing with linear probing (interval=1). Note that "Ted Baker" has a unique hash, but nevertheless collided with "Sandra Dee" which had previously collided with "John Smith". Coalesced hashing A hybrid of chaining and open addressing, coalesced hashing links together chains of nodes within the table itself.Like open addressing, it achieves space usage and (somewhat diminished) cache advantages over chaining. Like chaining, it does not exhibit clustering effects; in fact, the table can be efficiently filled to a high density. Unlike chaining, it cannot have more elements than table slots. Robin Hood hashing One interesting variation on double-hashing collision resolution is that of Robin Hood hashing.The idea is that a key already inserted may be displaced by a new key if its probe count is larger than the key at the current position. The net effect of this is that it reduces worst case search times in the table. This is similar to Knuth's ordered hash tables except the criteria for bumping a key does not depend on a direct relationship between the keys. Since both the worst case and the variance on the number of probes is reduced dramatically an interesting variation is to probe the table starting at the expected successful probe value and then expand from that position in both directions. External Robin Hashing is an extension of this algorithm where the table is stored in an external file and each table position corresponds to a fixed sized page or bucket with B records. Cuckoo hashing Another alternative open-addressing solution is cuckoo hashing, which ensures constant lookup time in the worst case, and constant amortized time for insertions and deletions. Hopscotch hashing Another alternative open-addressing solution is hopscotch hashing,which combines the approaches of cuckoo hashing and linear probing, yet seems in general to avoid their limitations. In particular it works well even when the load factor grows beyond 0.9. The algorithm is well suited for implementing a resizable concurrent hash table. The hopscotch hashing algorithm works by defining a neighborhood of buckets near the original hashed bucket, where a given entry is always found. Thus, search is limited to the number of entries in this neighborhood, which is logarithmic in the worst case, constant on average, and with proper alignment of the neighborhood typically requires one cache miss. When inserting an entry, one first attempts to add it to a bucket in the neighborhood. However, if all buckets in this neighborhood are occupied, the algorithm traverses buckets in sequence until an open slot (an unoccupied bucket) is found (as in linear probing). At that point, since the empty bucket is outside the neighborhood, items are repeatedly displaced in a sequence of hops (in a manner reminiscent of cuckoo hashing, but with the difference that in this case the empty slot is being moved into the neighborhood, instead of items being moved out with the hope of eventually finding an empty slot). Each hop brings the open slot closer to the original neighborhood, without invalidating the neighborhood property of any of the buckets along the way. In the end the open slot has been moved into the neighborhood, and the entry being inserted can be added to it.