Hashing • Hashing

advertisement
Hashing
• Hashing
– Definition of hashing
– Example of hashing
– Hashing functions
– Deleting elements
– Dynamic resizing
– Hashing in the Java Collections API
– Hashtable Applications
• Reading: L&C 3rd: 14.1 – 14.5 2nd : 17.117.5
1
Definition of Hashing
• Data elements are stored in a hash table (usually
implemented as an indexed array)
• The location for storing each element in the table is
determined by some function of the data itself or a
key within the data
• This function is called a hashing function
• Each location in the hash table is referred to as a
cell or bucket
• Hashing attempts to make access to each element
independent of the number n of elements, i.e. O(1)
instead of O(n)
2
Example (Poor/Oversimplified)
• A hash table has 26 cells or buckets
• The hash function creates an index value
based on the first letter of the data
– A data element beginning with ‘A’ would be
stored in the 0th cell or bucket
– A data element beginning with ‘Z’ would be
stored in the 25th cell or bucket
• The hash function’s value can be calculated
in O(1) time, i.e. independent of n
3
L&C Example:
Hash Table
Ann
Doug
Elizabeth
Hash Function produces
A value 0 - 25 based only
on the first letter of name
Hal
Mary
Tim
Walter
Young
4
More Definitions for Hashing
• The efficiency is only fully realized if each
element maps to a unique table position
• A perfect hashing function maps each
element to a unique location, but it is not
always feasible
• When two elements map to the same table
location, we get a collision!
• The efficiency is dependent on using a
reasonable collision resolution algorithm
5
More Definitions for Hashing
• The size of the hash table is another major
factor in the efficiency
• The mathematics work best if the size of
the table is a prime number, e.g. 101
• If a perfect hashing function is not available
but the size of the data set is known, we
can make the table 150% of the data size
• If the size of the data set is not known, we
can use our old “standby” expandCapacity
based on a load factor such as 50%
6
Hashing Functions
• In many cases, the key itself is too large to
reasonably use it as the hash value
• There are many ways to calculate a suitable
integer value from a key to the data element
• Note that these algorithms are not likely to
produce perfect hash functions, but that is
not really required
• A reasonably good hash function will do!
7
Hashing Functions
• One approach such as using the first letter
of the key in the previous example is called
extraction
– For phone or Social Security numbers, we can
convert the last four digits to an integer and
produce a reasonable hash value
– For cars, we could use a portion of the license
plate number to extract a hash value
8
Hashing Functions
• Another approach is called division
– For phone or Social Security numbers, we can
divide the value of the number by p (a positive
integer) and use the remainder as a hash value
• The Mathematics of designing a good hash
function is very complex and we’ll avoid it
• The Java Object class provides a basic hash
function that is inherited by all of our classes
• However, you can override that method
9
Hashing Functions
• We may need to manipulate a hash value
further to be able to use it as an index into
the hash table
• Usually this is done by modulo division of
the hash value by the size of the table.
10
Resolving Collisions
• Since we usually can not create a perfect
hashing function, we need to handle the
collisions that result from the function used
• Obviously, we need to use the chosen
collision resolution algorithm when adding
elements to the table
• Less obviously, our choice of a collision
resolution algorithm affects our removal of
data elements from the table as well
11
Resolving Collisions
• The chaining method for handling collisions
treats the hash table as an array of object
references to some other type of collection
such as an ordered or unordered list
12
Resolving Collisions
• The chaining method can use an overflow
area in an array of size (n + overflow)
Sub-array of Size n
(n is the modulo divisor)
Sub-array Overflow Area
13
Deleting – Chained Implementation
• If we store a reference to another collection
in each hash table entry, it is easy use that
collection’s remove method to delete an
element in a chained hash implementation
• Otherwise, we must manipulate references
appropriately to unlink the object being
removed and maintain the chain based at
the original hash table entry
14
Resolving Collisions
• Open Addressing looks for another open
position in the table other than the one to
which the element is originally hashed
• Three variations of open addressing:
– Linear probing
– Quadratic probing
– Double hashing
15
Open Addressing – Linear Probing
• Open Addressing with linear probing means that if
an element hashes to position p and position p is
already occupied, we simply try:
position p = (p + 1) % tableSize
• If that element is also occupied, we keep adding 1
modulo table size until we find:
– An empty element and we use it OR
– We are back at the original position p and the table is full
(expanding capacity is an option)
• Problem: Tends to create dense occupied clusters
which affects the efficiency of adds and searches
16
Open Addressing – Quadratic Probing
• Open Addressing with quadratic probing means
that if an element hashes to position p and position
p is already occupied, we use a formula such as:
new hash = original hash + (-1)i-1((i+1)/2)2 (for i = 1, 2, …)
• We must divide the new hash modulo tableSize
• This tries a sequence of positions such as:
p, p+1, p-1, p+4, p-4, p+9, p-9, …
• If we get back to the original position, table is full
(expanding capacity is still an option)
• This does not have as strong a tendency to create
dense clusters as linear probing
17
Open Addressing – Double Hashing
• Open Addressing with double hashing means that
if an element hashes to position p and position p is
already occupied, we use a formula such as:
new hash(x) = original hash(x) + i*secondary hash(x)
• We must divide the new hash modulo tableSize
• This tries a sequence of positions that depends on
the definition of the secondary hash function
• If we get back to the original position, table is full
(expanding capacity is still an option)
• It is somewhat more costly to compute a second
hash function, but this tends to further reduce the
density of clustering over quadratic probing
18
Deleting – Open Addressing
• If we actually delete any element from an open
addressing implementation, we may make it
impossible to find another entry
• If we start a search at the hashed position and
find any empty space (including one created by
a deletion), we will stop searching
• Therefore, we remove an element by marking it
as deleted without actually removing it
• Then, if we start a search at the hashed position
and find a marked deletion, we know to continue
searching
• We can reuse deleted elements when possible
or rehash the table to recover wasted space
19
Dynamic Table Resizing
• To dynamically expand the capacity of the
hash table, we create a new larger array
• However, we note that the size of the table
is used to select locations in the array
• We can’t just copy the smaller array into a
sub-array area within the new larger array
• We must take all elements from the smaller
array and rehash them into the larger array
20
Java Collections API: Hash Tables
• There are seven implementations of hashing
– Hashtable
– HashMap
– HashSet
– LinkedHashSet
– LinkedHashMap
21
Hash Table Applications
• Compilers use a hashtable called a symbol
table for the variable names in source code
• Game programs use a hashtable called a
transposition table for positions previously
encountered to “remember” the move made
in that position
• An On-line spell checker can pre-hash its
dictionary of valid words
• In each case, an O(1) key lookup is achieved
22
Download