CHAPTER 14: Hashing Java Software Structures: Designing and Using Data Structures Third Edition John Lewis & Joseph Chase Addison Wesley is an imprint of © 2010 Pearson Addison-Wesley. All rights reserved. Chapter Objectives • Define hashing • Examine various hashing functions • Examine the problem of collisions in hash tables • Explore the Java Collections API implementations of hashing 1-2 © 2010 Pearson Addison-Wesley. All rights reserved. 1-2 Hashing • In the collections we have discussed thus far, we have made one of three assumptions regarding order: – Order is unimportant – Order is determined by the way elements are added – Order is determined by comparing the values of the elements 1-3 © 2010 Pearson Addison-Wesley. All rights reserved. 1-3 Hashing • Hashing is the concept that order is determined by some function of the value of the element to be stored – More precisely, the location within the collection is determined by some function of the value of the element to be stored 1-4 © 2010 Pearson Addison-Wesley. All rights reserved. 1-4 Hashing • In hashing, elements are stored in a hash table with their location in the table determined by a hashing function • Each location in the hash table is called a cell or a bucket 1-5 © 2010 Pearson Addison-Wesley. All rights reserved. 1-5 Hashing • Consider a simple example where we create an array that will hold 26 elements • Wishing to store names in our array, we create a simple hashing function that uses the first letter of each name 1-6 © 2010 Pearson Addison-Wesley. All rights reserved. 1-6 A simple hashing example 1-7 © 2010 Pearson Addison-Wesley. All rights reserved. 1-7 Hashing • Notice that using a hashing approach results in the access time to a particular element being independent of the number of elements in the table • This means that all of the operations on an element of a hash table should be O(1) • However, this efficiency is only realized if each element maps to a unique position in the table 1-8 © 2010 Pearson Addison-Wesley. All rights reserved. 1-8 Hashing • A collision occurs when two or more elements map to the same location • A hashing function that maps each element to a unique location is called perfect hashing function • While it is possible to develop a perfect hashing function, a function that does a good job of distributing the elements in the table will still result in O(1) operations 1-9 © 2010 Pearson Addison-Wesley. All rights reserved. 1-9 Hashing • Another issue surrounding hashing is how large the table should be • If we are assured of a dataset of size n and a perfect hashing function, then the table should be of size n • If a perfect hashing function is not available but we know the size of the dataset (n), then a good rule of thumb is to make the table 150% the size of the dataset 1-10 © 2010 Pearson Addison-Wesley. All rights reserved. 1-10 Hashing • The third case is the case where we do not know the size of the dataset • In this case, we depend upon dynamic resizing • Dynamic resizing of a hash table involves creating a new, larger, hash table and then inserting all of the elements of the old table into the new one 1-11 © 2010 Pearson Addison-Wesley. All rights reserved. 1-11 Hashing • Deciding when to resize is the key to dynamic resizing • One possibility is simply to resize when the table is full – However, it is the nature of hash tables that their performance seriously degrades as they become full • A better approach is to use a load factor • The load factor of a hash table is the percentage occupancy of the table at which the table will be resized 1-12 © 2010 Pearson Addison-Wesley. All rights reserved. 1-12 Hashing Functions - Extraction • There are a wide variety of hashing functions that provide good distribution of various types of data • The method we used in our earlier example is called extraction • Extraction involves using only a part of an element’s value or key to compute the location at which to store the element • In our previous example, we simply extracted the first letter of a string and computed its value relative to the letter A 1-13 © 2010 Pearson Addison-Wesley. All rights reserved. 1-13 Hashing Functions - Division • Creating a hashing function by division simply means we will use the remainder of the key divided by some positive integer p Hashcode(key) = Math.abs(key)%p • This function will yield a result in the range 0 to p-1 • If we use our table size as p, we then have an index that maps directly to the table 1-14 © 2010 Pearson Addison-Wesley. All rights reserved. 1-14 Hashing Functions - Folding • In the folding method, the key is divided into two parts that are then combined or folded together to create an index into the table • This is done by first dividing the key into parts where each of the parts of the key will be the same length as the desired key • In the shift folding method, these parts are then added together to create the index – Using the SSN 987-65-4321 we divide into three parts 987, 654, 321, and then add these together to get 1962 – We would then use either division or extraction to get a three digit index 1-15 © 2010 Pearson Addison-Wesley. All rights reserved. 1-15 Hashing Functions - Folding • In the boundary folding method, some parts of the key are reversed before adding • Using the same example of a SSN 987-65-4321 – – – – divide the SSN into parts 987, 654, 321 reverse the middle element 987, 456, 321 add yielding 1764 then use either division or extraction to get a three digit index • Folding may also be used on strings 1-16 © 2010 Pearson Addison-Wesley. All rights reserved. 1-16 Hashing Functions - Mid-square • In the mid-square method, the key is multiplied by itself and then the extraction method is used to extract the needed number of digits from the middle of the result • For example, if our key is 4321 – Multiply the key by itself yielding 18671041 – Extract the needed three digits • It is critical that the same three digits by extract each time • We may also extract bits and then reconstruct an index from the bits 1-17 © 2010 Pearson Addison-Wesley. All rights reserved. 1-17 Hashing Functions - Radix Transformation • In the radix transformation method, the key is transformed into another numeric base • For example, if our key is 23 in base 10, we might convert it to 32 in base 7 • We then use the division method and divide the converted key by the table size and use the remainder as our index 1-18 © 2010 Pearson Addison-Wesley. All rights reserved. 1-18 Hashing Functions - Digit Analysis • In the digit analysis method, the index is formed by extracting, and then manipulating specific digits from the key • For example, if our key is 1234567, we might select the digits in positions 2 through 4 yielding 234 • The manipulation can then take many forms – – – – Reversing the digits (432) Performing a circular shift to the right (423) Performing a circular shift to the left (342) Swapping each pair of digits (324) 1-19 © 2010 Pearson Addison-Wesley. All rights reserved. 1-19 Hashing Functions - LengthDependent • In the length-dependent method, the key and the length of the key are combined in some way to form either the index itself or an intermediate version • For example, if our key is 8765, we might multiply the first two digits by the length and then divide by the last digit yielding 69 • If our table size is 43, we would then use the division method resulting in an index of 26 1-20 © 2010 Pearson Addison-Wesley. All rights reserved. 1-20 Hashing Functions in the Java Language • The java.lang.Object class defines a method called hashcode that returns an integer based on the memory location of the object • This is generally not very useful • Class that are derived from Object often override the inherited definition of hashcode to provide their own version • For example, the String and Integer classes define their own hashcode methods • These more specific hashcode functions are more effective 1-21 © 2010 Pearson Addison-Wesley. All rights reserved. 1-21 Hashing - Resolving Collisions • If we are able to develop a perfect hashing function, then we do not need to be concerned about collisions or table size • However, often we do not know the size of the dataset and are not able to develop a perfect hashing function • In these cases, we must choose a method for handling collisions 1-22 © 2010 Pearson Addison-Wesley. All rights reserved. 1-22 Resolving Collisions - Chaining • The chaining method simply treats the hash table conceptually as a table of collections rather than a table of individual elements • Thus each cell is a pointer to the collection associated with that location in the table • Usually, this internal collection is either an unordered list or an ordered list • Chaining can be accomplished using a linked structure or an array based structure with an overflow area 1-23 © 2010 Pearson Addison-Wesley. All rights reserved. 1-23 The chaining method of collision handling 1-24 © 2010 Pearson Addison-Wesley. All rights reserved. 1-24 Chaining using an overflow area 1-25 © 2010 Pearson Addison-Wesley. All rights reserved. 1-25 Resolving Collisions - Open Addressing • The open addressing method looks for another open position in the table other than the one to which the element is originally hashed • The simplest of these methods is linear probing • In linear probing, if an element hashes to position p and that position is occupied we simply try position (p+1)%s where s is the size of the table • One problem with linear probing is the development of clusters of occupied cells called primary clusters 1-26 © 2010 Pearson Addison-Wesley. All rights reserved. 1-26 Open addressing using linear probing 1-27 © 2010 Pearson Addison-Wesley. All rights reserved. 1-27 Resolving Collisions - Open Addressing • A second form of open addressing is quadratic probing • In this method, once we have a collision, the next index to try is computed by newhashcode(x) = hashcode(x) + (-1)i-1((i+1)/2)2 • Quadratic probing helps to eliminate the problem of primary clusters 1-28 © 2010 Pearson Addison-Wesley. All rights reserved. 1-28 Open addressing using quadratic probing 1-29 © 2010 Pearson Addison-Wesley. All rights reserved. 1-29 Resolving Collisons - Open Addressing • A third form of the open addressing method is double hashing • In this method, collisions are resolved by supplying a secondary hashing function • For example, if a key x hashes to a position p that is already occupied, then the next position p’ will be p’ = p + secondaryhashcode(x) • Given the same example, if we use a secondary hashing function that uses the length of the string, we get the following result 1-30 © 2010 Pearson Addison-Wesley. All rights reserved. 1-30 Open addressing using double hashing 1-31 © 2010 Pearson Addison-Wesley. All rights reserved. 1-31 Deleting Elements from a Hash Table • Removing an element from a chained implementation falls into one of five cases • If the element we are attempting to remove is the only one mapped to the location, we simply remove the element by setting the table position to null 1-32 © 2010 Pearson Addison-Wesley. All rights reserved. 1-32 Deleting Elements from a Hash Table • If the element we are attempting to remove is stored in the table but has an index into the overflow area – Replace the element and the next index value in the table with the element and next index value of the array position pointed to by the element to be removed – We then must also set the position in the overflow area to null and add it back to our list of free overflow cells 1-33 © 2010 Pearson Addison-Wesley. All rights reserved. 1-33 Deleting Elements from a Hash Table • If the element we are attempting to remove is at the end of the list of elements stored at that location the table – Set its position in the table to null – Set the next pointer of the previous element to null – Add that position to the list of free overflow cells 1-34 © 2010 Pearson Addison-Wesley. All rights reserved. 1-34 Deleting Elements from a Hash Table • If the element we are attempting to remove is in the middle of the list of elements stored at that location – Set its position in the overflow area to null – Set the next pointer of the previous element to point to the next element of the element being removed – Add the position back to the list of free overflow cells • If the element we are attempting to remove is not in the table, then we must throw an exception 1-35 © 2010 Pearson Addison-Wesley. All rights reserved. 1-35 Deleting Elements from a Hash Table • If we have chosen to implement our table using open addressing, the deletion creates more of a challenge • This is because we cannot remove a deleted item with possibly affecting our ability to find items that collided with it on entry • The solution to this problem is to mark items as deleted but not actually remove them from the table until some future point when either the deleted item is overwritten or the entire table is rehashed 1-36 © 2010 Pearson Addison-Wesley. All rights reserved. 1-36 Open addressing and deletion 1-37 © 2010 Pearson Addison-Wesley. All rights reserved. 1-37 Hash Tables in the Java Collections API • The Java Collections API provides seven implementations of hashing (click on each for more information): – – – – – – – Hashtable HashMap HashSet IdentityHashMap LinkedHashSet LinkedHashMap WeakHashMap 1-38 © 2010 Pearson Addison-Wesley. All rights reserved. 1-38