Hashing • Hashing – Definition of hashing – Example of hashing – Hashing functions – Deleting elements – Dynamic resizing – Hashing in the Java Collections API – Hashtable Applications • Reading: L&C 3rd: 14.1 – 14.5 2nd : 17.117.5 1 Definition of Hashing • Data elements are stored in a hash table (usually implemented as an indexed array) • The location for storing each element in the table is determined by some function of the data itself or a key within the data • This function is called a hashing function • Each location in the hash table is referred to as a cell or bucket • Hashing attempts to make access to each element independent of the number n of elements, i.e. O(1) instead of O(n) 2 Example (Poor/Oversimplified) • A hash table has 26 cells or buckets • The hash function creates an index value based on the first letter of the data – A data element beginning with ‘A’ would be stored in the 0th cell or bucket – A data element beginning with ‘Z’ would be stored in the 25th cell or bucket • The hash function’s value can be calculated in O(1) time, i.e. independent of n 3 L&C Example: Hash Table Ann Doug Elizabeth Hash Function produces A value 0 - 25 based only on the first letter of name Hal Mary Tim Walter Young 4 More Definitions for Hashing • The efficiency is only fully realized if each element maps to a unique table position • A perfect hashing function maps each element to a unique location, but it is not always feasible • When two elements map to the same table location, we get a collision! • The efficiency is dependent on using a reasonable collision resolution algorithm 5 More Definitions for Hashing • The size of the hash table is another major factor in the efficiency • The mathematics work best if the size of the table is a prime number, e.g. 101 • If a perfect hashing function is not available but the size of the data set is known, we can make the table 150% of the data size • If the size of the data set is not known, we can use our old “standby” expandCapacity based on a load factor such as 50% 6 Hashing Functions • In many cases, the key itself is too large to reasonably use it as the hash value • There are many ways to calculate a suitable integer value from a key to the data element • Note that these algorithms are not likely to produce perfect hash functions, but that is not really required • A reasonably good hash function will do! 7 Hashing Functions • One approach such as using the first letter of the key in the previous example is called extraction – For phone or Social Security numbers, we can convert the last four digits to an integer and produce a reasonable hash value – For cars, we could use a portion of the license plate number to extract a hash value 8 Hashing Functions • Another approach is called division – For phone or Social Security numbers, we can divide the value of the number by p (a positive integer) and use the remainder as a hash value • The Mathematics of designing a good hash function is very complex and we’ll avoid it • The Java Object class provides a basic hash function that is inherited by all of our classes • However, you can override that method 9 Hashing Functions • We may need to manipulate a hash value further to be able to use it as an index into the hash table • Usually this is done by modulo division of the hash value by the size of the table. 10 Resolving Collisions • Since we usually can not create a perfect hashing function, we need to handle the collisions that result from the function used • Obviously, we need to use the chosen collision resolution algorithm when adding elements to the table • Less obviously, our choice of a collision resolution algorithm affects our removal of data elements from the table as well 11 Resolving Collisions • The chaining method for handling collisions treats the hash table as an array of object references to some other type of collection such as an ordered or unordered list 12 Resolving Collisions • The chaining method can use an overflow area in an array of size (n + overflow) Sub-array of Size n (n is the modulo divisor) Sub-array Overflow Area 13 Deleting – Chained Implementation • If we store a reference to another collection in each hash table entry, it is easy use that collection’s remove method to delete an element in a chained hash implementation • Otherwise, we must manipulate references appropriately to unlink the object being removed and maintain the chain based at the original hash table entry 14 Resolving Collisions • Open Addressing looks for another open position in the table other than the one to which the element is originally hashed • Three variations of open addressing: – Linear probing – Quadratic probing – Double hashing 15 Open Addressing – Linear Probing • Open Addressing with linear probing means that if an element hashes to position p and position p is already occupied, we simply try: position p = (p + 1) % tableSize • If that element is also occupied, we keep adding 1 modulo table size until we find: – An empty element and we use it OR – We are back at the original position p and the table is full (expanding capacity is an option) • Problem: Tends to create dense occupied clusters which affects the efficiency of adds and searches 16 Open Addressing – Quadratic Probing • Open Addressing with quadratic probing means that if an element hashes to position p and position p is already occupied, we use a formula such as: new hash = original hash + (-1)i-1((i+1)/2)2 (for i = 1, 2, …) • We must divide the new hash modulo tableSize • This tries a sequence of positions such as: p, p+1, p-1, p+4, p-4, p+9, p-9, … • If we get back to the original position, table is full (expanding capacity is still an option) • This does not have as strong a tendency to create dense clusters as linear probing 17 Open Addressing – Double Hashing • Open Addressing with double hashing means that if an element hashes to position p and position p is already occupied, we use a formula such as: new hash(x) = original hash(x) + i*secondary hash(x) • We must divide the new hash modulo tableSize • This tries a sequence of positions that depends on the definition of the secondary hash function • If we get back to the original position, table is full (expanding capacity is still an option) • It is somewhat more costly to compute a second hash function, but this tends to further reduce the density of clustering over quadratic probing 18 Deleting – Open Addressing • If we actually delete any element from an open addressing implementation, we may make it impossible to find another entry • If we start a search at the hashed position and find any empty space (including one created by a deletion), we will stop searching • Therefore, we remove an element by marking it as deleted without actually removing it • Then, if we start a search at the hashed position and find a marked deletion, we know to continue searching • We can reuse deleted elements when possible or rehash the table to recover wasted space 19 Dynamic Table Resizing • To dynamically expand the capacity of the hash table, we create a new larger array • However, we note that the size of the table is used to select locations in the array • We can’t just copy the smaller array into a sub-array area within the new larger array • We must take all elements from the smaller array and rehash them into the larger array 20 Java Collections API: Hash Tables • There are seven implementations of hashing – Hashtable – HashMap – HashSet – LinkedHashSet – LinkedHashMap 21 Hash Table Applications • Compilers use a hashtable called a symbol table for the variable names in source code • Game programs use a hashtable called a transposition table for positions previously encountered to “remember” the move made in that position • An On-line spell checker can pre-hash its dictionary of valid words • In each case, an O(1) key lookup is achieved 22