CHAPTER 14:
Hashing
Java Software Structures:
Designing and Using Data Structures
Third Edition
John Lewis & Joseph Chase
Addison Wesley
is an imprint of
© 2010 Pearson Addison-Wesley. All rights reserved.
Chapter Objectives
• Define hashing
• Examine various hashing functions
• Examine the problem of collisions in hash
tables
• Explore the Java Collections API
implementations of hashing
1-2
© 2010 Pearson Addison-Wesley. All rights reserved.
1-2
Hashing
• In the collections we have discussed thus
far, we have made one of three
assumptions regarding order:
– Order is unimportant
– Order is determined by the way elements are
added
– Order is determined by comparing the values
of the elements
1-3
© 2010 Pearson Addison-Wesley. All rights reserved.
1-3
Hashing
• Hashing is the concept that order is
determined by some function of the value
of the element to be stored
– More precisely, the location within the
collection is determined by some function of
the value of the element to be stored
1-4
© 2010 Pearson Addison-Wesley. All rights reserved.
1-4
Hashing
• In hashing, elements are stored in a hash
table with their location in the table
determined by a hashing function
• Each location in the hash table is called a
cell or a bucket
1-5
© 2010 Pearson Addison-Wesley. All rights reserved.
1-5
Hashing
• Consider a simple example where we
create an array that will hold 26 elements
• Wishing to store names in our array, we
create a simple hashing function that uses
the first letter of each name
1-6
© 2010 Pearson Addison-Wesley. All rights reserved.
1-6
A simple
hashing example
1-7
© 2010 Pearson Addison-Wesley. All rights reserved.
1-7
Hashing
• Notice that using a hashing approach results in
the access time to a particular element being
independent of the number of elements in the
table
• This means that all of the operations on an
element of a hash table should be O(1)
• However, this efficiency is only realized if each
element maps to a unique position in the table
1-8
© 2010 Pearson Addison-Wesley. All rights reserved.
1-8
Hashing
• A collision occurs when two or more elements
map to the same location
• A hashing function that maps each element to a
unique location is called perfect hashing function
• While it is possible to develop a perfect hashing
function, a function that does a good job of
distributing the elements in the table will still
result in O(1) operations
1-9
© 2010 Pearson Addison-Wesley. All rights reserved.
1-9
Hashing
• Another issue surrounding hashing is how large
the table should be
• If we are assured of a dataset of size n and a
perfect hashing function, then the table should
be of size n
• If a perfect hashing function is not available but
we know the size of the dataset (n), then a good
rule of thumb is to make the table 150% the size
of the dataset
1-10
© 2010 Pearson Addison-Wesley. All rights reserved.
1-10
Hashing
• The third case is the case where we do not know
the size of the dataset
• In this case, we depend upon dynamic resizing
• Dynamic resizing of a hash table involves
creating a new, larger, hash table and then
inserting all of the elements of the old table into
the new one
1-11
© 2010 Pearson Addison-Wesley. All rights reserved.
1-11
Hashing
• Deciding when to resize is the key to dynamic resizing
• One possibility is simply to resize when the table is full
– However, it is the nature of hash tables that their performance
seriously degrades as they become full
• A better approach is to use a load factor
• The load factor of a hash table is the percentage
occupancy of the table at which the table will be resized
1-12
© 2010 Pearson Addison-Wesley. All rights reserved.
1-12
Hashing Functions - Extraction
• There are a wide variety of hashing functions that provide
good distribution of various types of data
• The method we used in our earlier example is called
extraction
• Extraction involves using only a part of an element’s
value or key to compute the location at which to store the
element
• In our previous example, we simply extracted the first
letter of a string and computed its value relative to the
letter A
1-13
© 2010 Pearson Addison-Wesley. All rights reserved.
1-13
Hashing Functions - Division
• Creating a hashing function by division
simply means we will use the remainder of
the key divided by some positive integer p
Hashcode(key) = Math.abs(key)%p
• This function will yield a result in the range 0
to p-1
• If we use our table size as p, we then have
an index that maps directly to the table
1-14
© 2010 Pearson Addison-Wesley. All rights reserved.
1-14
Hashing Functions - Folding
• In the folding method, the key is divided into two
parts that are then combined or folded together to
create an index into the table
• This is done by first dividing the key into parts where
each of the parts of the key will be the same length
as the desired key
• In the shift folding method, these parts are then
added together to create the index
– Using the SSN 987-65-4321 we divide into three parts 987,
654, 321, and then add these together to get 1962
– We would then use either division or extraction to get a
three digit index
1-15
© 2010 Pearson Addison-Wesley. All rights reserved.
1-15
Hashing Functions - Folding
• In the boundary folding method, some parts of the
key are reversed before adding
• Using the same example of a SSN 987-65-4321
–
–
–
–
divide the SSN into parts 987, 654, 321
reverse the middle element 987, 456, 321
add yielding 1764
then use either division or extraction to get a three digit
index
• Folding may also be used on strings
1-16
© 2010 Pearson Addison-Wesley. All rights reserved.
1-16
Hashing Functions - Mid-square
• In the mid-square method, the key is multiplied by itself and
then the extraction method is used to extract the needed
number of digits from the middle of the result
• For example, if our key is 4321
– Multiply the key by itself yielding 18671041
– Extract the needed three digits
• It is critical that the same three digits by extract each time
• We may also extract bits and then reconstruct an index from
the bits
1-17
© 2010 Pearson Addison-Wesley. All rights reserved.
1-17
Hashing Functions - Radix
Transformation
• In the radix transformation method, the key
is transformed into another numeric base
• For example, if our key is 23 in base 10, we
might convert it to 32 in base 7
• We then use the division method and divide
the converted key by the table size and use
the remainder as our index
1-18
© 2010 Pearson Addison-Wesley. All rights reserved.
1-18
Hashing Functions - Digit Analysis
• In the digit analysis method, the index is formed by
extracting, and then manipulating specific digits from
the key
• For example, if our key is 1234567, we might select
the digits in positions 2 through 4 yielding 234
• The manipulation can then take many forms
–
–
–
–
Reversing the digits (432)
Performing a circular shift to the right (423)
Performing a circular shift to the left (342)
Swapping each pair of digits (324)
1-19
© 2010 Pearson Addison-Wesley. All rights reserved.
1-19
Hashing Functions - LengthDependent
• In the length-dependent method, the key and the
length of the key are combined in some way to form
either the index itself or an intermediate version
• For example, if our key is 8765, we might multiply
the first two digits by the length and then divide by
the last digit yielding 69
• If our table size is 43, we would then use the division
method resulting in an index of 26
1-20
© 2010 Pearson Addison-Wesley. All rights reserved.
1-20
Hashing Functions in the Java
Language
• The java.lang.Object class defines a method called
hashcode that returns an integer based on the
memory location of the object
• This is generally not very useful
• Class that are derived from Object often override the
inherited definition of hashcode to provide their own
version
• For example, the String and Integer classes define
their own hashcode methods
• These more specific hashcode functions are more
effective
1-21
© 2010 Pearson Addison-Wesley. All rights reserved.
1-21
Hashing - Resolving Collisions
• If we are able to develop a perfect hashing function,
then we do not need to be concerned about
collisions or table size
• However, often we do not know the size of the
dataset and are not able to develop a perfect
hashing function
• In these cases, we must choose a method for
handling collisions
1-22
© 2010 Pearson Addison-Wesley. All rights reserved.
1-22
Resolving Collisions - Chaining
• The chaining method simply treats the hash table
conceptually as a table of collections rather than a
table of individual elements
• Thus each cell is a pointer to the collection
associated with that location in the table
• Usually, this internal collection is either an
unordered list or an ordered list
• Chaining can be accomplished using a linked
structure or an array based structure with an
overflow area
1-23
© 2010 Pearson Addison-Wesley. All rights reserved.
1-23
The chaining
method of
collision handling
1-24
© 2010 Pearson Addison-Wesley. All rights reserved.
1-24
Chaining using
an overflow area
1-25
© 2010 Pearson Addison-Wesley. All rights reserved.
1-25
Resolving Collisions - Open
Addressing
• The open addressing method looks for another open
position in the table other than the one to which the
element is originally hashed
• The simplest of these methods is linear probing
• In linear probing, if an element hashes to position p and
that position is occupied we simply try position (p+1)%s
where s is the size of the table
• One problem with linear probing is the development of
clusters of occupied cells called primary clusters
1-26
© 2010 Pearson Addison-Wesley. All rights reserved.
1-26
Open addressing
using linear probing
1-27
© 2010 Pearson Addison-Wesley. All rights reserved.
1-27
Resolving Collisions - Open
Addressing
• A second form of open addressing is
quadratic probing
• In this method, once we have a collision,
the next index to try is computed by
newhashcode(x) = hashcode(x) + (-1)i-1((i+1)/2)2
• Quadratic probing helps to eliminate the
problem of primary clusters
1-28
© 2010 Pearson Addison-Wesley. All rights reserved.
1-28
Open addressing using
quadratic probing
1-29
© 2010 Pearson Addison-Wesley. All rights reserved.
1-29
Resolving Collisons - Open
Addressing
• A third form of the open addressing method is double hashing
• In this method, collisions are resolved by supplying a secondary
hashing function
• For example, if a key x hashes to a position p that is already
occupied, then the next position p’ will be
p’ = p + secondaryhashcode(x)
• Given the same example, if we use a secondary hashing function
that uses the length of the string, we get the following result
1-30
© 2010 Pearson Addison-Wesley. All rights reserved.
1-30
Open
addressing
using double
hashing
1-31
© 2010 Pearson Addison-Wesley. All rights reserved.
1-31
Deleting Elements from a Hash Table
• Removing an element from a chained
implementation falls into one of five cases
• If the element we are attempting to remove
is the only one mapped to the location, we
simply remove the element by setting the
table position to null
1-32
© 2010 Pearson Addison-Wesley. All rights reserved.
1-32
Deleting Elements from a Hash Table
• If the element we are attempting to remove is
stored in the table but has an index into the
overflow area
– Replace the element and the next index value in the
table with the element and next index value of the
array position pointed to by the element to be removed
– We then must also set the position in the overflow
area to null and add it back to our list of free overflow
cells
1-33
© 2010 Pearson Addison-Wesley. All rights reserved.
1-33
Deleting Elements from a Hash Table
• If the element we are attempting to remove is at
the end of the list of elements stored at that
location the table
– Set its position in the table to null
– Set the next pointer of the previous element to null
– Add that position to the list of free overflow cells
1-34
© 2010 Pearson Addison-Wesley. All rights reserved.
1-34
Deleting Elements from a Hash Table
• If the element we are attempting to remove is in the
middle of the list of elements stored at that location
– Set its position in the overflow area to null
– Set the next pointer of the previous element to point to the next
element of the element being removed
– Add the position back to the list of free overflow cells
• If the element we are attempting to remove is not in the
table, then we must throw an exception
1-35
© 2010 Pearson Addison-Wesley. All rights reserved.
1-35
Deleting Elements from a Hash Table
• If we have chosen to implement our table using open
addressing, the deletion creates more of a challenge
• This is because we cannot remove a deleted item with
possibly affecting our ability to find items that collided
with it on entry
• The solution to this problem is to mark items as deleted
but not actually remove them from the table until some
future point when either the deleted item is overwritten or
the entire table is rehashed
1-36
© 2010 Pearson Addison-Wesley. All rights reserved.
1-36
Open addressing
and deletion
1-37
© 2010 Pearson Addison-Wesley. All rights reserved.
1-37
Hash Tables in the Java Collections
API
• The Java Collections API provides seven
implementations of hashing (click on each for
more information):
–
–
–
–
–
–
–
Hashtable
HashMap
HashSet
IdentityHashMap
LinkedHashSet
LinkedHashMap
WeakHashMap
1-38
© 2010 Pearson Addison-Wesley. All rights reserved.
1-38