10. Hashing

advertisement
241-423 Advanced Data Structures
and Algorithms Semester 2, 2013-2014
10. Hashing
Objectives
– introduce hashing, hash functions, hash
tables, collisions, linear probing, double
hashing, bucket hashing, JDK hash classes
ADSA: Hashing/10
1
Contents
1. Searching an Array
2. Hashing
3. Creating a Hash Function
4. Solution #1: Linear Probing
5. Solution #2: Double Hashing
6. Solution #3: Bucket Hashing
7. Table Resizing
8. Java's hashCode()
9. Hash Tables in Java
10. Efficiency
11. Ordered/Unordered Sets and Maps
ADSA: Hashing/10
2
1. Searching an Array
• If the array is not sorted, a search requires
O(n) time.
• If the array is sorted, a binary search
requires O(log n) time
• If the array is organized using hashing then
it is possible to have constant time search:
O(1).
ADSA: Hashing/10
3
2. Hashing
• A hash function takes a search item and
returns its array location (its index position).
• The array + hash function is called a hash
table.
• Hash tables support efficient insertion and
search in O(1) time
– but the hash function must be carefully chosen
ADSA: Hashing/10
4
A Simple Hash Function
• The simple hash function:
0
kiwi
1
hashCode("apple") = 5
hashCode("watermelon") = 3
hashCode("grapes") = 8
hashCode("cantaloupe") = 7
hashCode("kiwi") = 0
hashCode("strawberry") = 9
hashCode("mango") = 6
hashCode("banana") = 2
2
3
4
5
6
7
8
9
ADSA: Hashing/10
banana
watermelon
apple
mango
cantaloupe
grapes
strawberry
5
A Table with Key-Value Pairs
• A hash table for a map storing
(ID, Name) items,
– ID is a nine-digit integer
ADSA: Hashing/10

(025610001, jim)
(981100002, ad)

(451220004, tim)
…
• The hash table is an array of
size N = 10,000
• The hash function is
hashCode(ID) = last four digits
of the ID key
0
1
2
3
4
9997
9998
9999

(200759998, tim)

6
Applications of Hash Tables
• Small databases
• Compilers
• Web Browser caches
ADSA: Hashing/10
7
3. Creating a Hash Function
• The hash function should return the table index
where an item is to be placed
– but it's not possible to write a perfect hash function
– the best we can usually do is to write a hash
function that tells us where to start looking for a
location to place the item
ADSA: Hashing/10
8
A Real Hash Function
• A more realistic hash
function produces:
0
– hash("apple") = 5
hash("watermelon") = 3
hash("grapes") = 8
hash("cantaloupe") = 7
hash("kiwi") = 0
hash("strawberry") = 9
hash("mango") = 6
hash("banana") = 2
hash("honeydew") = 6
2
1
3
ADSA: Hashing/10
banana
watermelon
4
5
6
7
8
• Now what?
kiwi
9
apple
mango
cantaloupe
grapes
strawberry
9
Collisions
• A collision occurs when two item hash to the
same array location
– e.g "mango" and "honeydew" both hash to 6
• Where should we place the second and other
items that hash to this same location?
• Three popular solutions:
– linear probing, double hashing, bucket hashing
(chaining)
ADSA: Hashing/10
10
4. Solution #1: Linear Probing
o
Linear probing handles collisions by placing the
colliding item in the next empty table cell (perhaps
after cycling around the table)
Key = 32 to be added
[hash(32) = 6]
41
18 44 59 32 22
0 1 2 3 4 5 6 7 8 9 10 11 12
o
Each inspected table cell is called a probe
– in the above example, three probes were needed to insert 32
ADSA: Hashing/10
11
Example 1
• A hash table with N =
13 and h(k) = k mod 13
0 1 2 3 4 5 6 7 8 9 10 11 12
k
• Insert keys: 18, 41, 22,
44, 59, 32, 31, 73
• Total number of
probes: 19
18
41
22
44
59
32
31
73
h (k ) Probes
5
2
9
5
7
6
5
8
5
2
9
5
7
6
5
8
6
7
6
9
8
7
10
8
11
9
10
41
18 44 59 32 22 31 73
0 1 2 3 4 5 6 7 8 9 10 11 12
ADSA: Hashing/10
12
Example 2: Insertion
• Suppose we want to add
"seagull":
– hash(seagull) = 143
– 3 probes:
table[143] is not empty;
table[144] is not empty;
table[145] is empty
– put seagull at location 145
...
141
142
144
robin
sparrow
hawk
145
seagull
143
146
147
bluejay
148
owl
...
ADSA: Hashing/10
13
Searching
• Look up "seagull":
– hash(seagull) = 143
– 3 probes:
table[143] != seagull;
table[144] != seagull;
table[145] == seagull
– found seagull at location 145
...
141
142
144
robin
sparrow
hawk
145
seagull
143
146
147
bluejay
148
owl
...
ADSA: Hashing/10
14
Searching Again
• Look up "cow":
– hash(cow) = 144
...
141
142
– 3 probes:
table[144] != cow;
table[145] != cow;
table[146] is empty
– "cow" is not in the table
since the probes reached
an empty cell
ADSA: Hashing/10
144
robin
sparrow
hawk
145
seagull
143
146
147
bluejay
148
owl
...
15
Insertion Again
• Add "hawk":
– hash(hawk) = 143
...
141
142
– 2 probes:
table[143] != hawk;
table[144] == hawk
– hawk is already in the table, so
do nothing
144
robin
sparrow
hawk
145
seagull
143
146
147
bluejay
148
owl
...
ADSA: Hashing/10
16
Insertion Again
• Add "cardinal":
– hash(cardinal) = 147
...
141
142
– 3 or more probes:
147 and 148 are occupied;
"cardinal" goes in location 0
(or 1, or 2, or ...)
ADSA: Hashing/10
144
robin
sparrow
hawk
145
seagull
143
146
147
bluejay
148
owl
17
Search with Linear Probing
o
o
Search hash table A
get(k)
– start at cell h(k)
– probe consecutive
locations until:
•
•
•
An item with key k is
found, or
An empty cell is found, or
N cells have been
unsuccessfully probed
(N is the table size)
ADSA: Hashing/10
Algorithm get(k)
i  h(k)
p  0 // count num of probes
repeat
c  A[i]
if c =  // empty cell
return null
else if c.key () = k
return c.element()
else // linear probing
i  (i + 1) mod N
pp+1
until p = N
return null
18
Lazy Deletion
• Deletions are done by marking a table cell
as deleted, rather than emptying the cell.
• Deleted locations are treated as empty when
inserting and as occupied during a search.
ADSA: Hashing/10
19
Updates with Lazy Deletion
o
delete(k)
– Start at cell h(k)
– Probe consecutive cells
until:
•
A cell with key k is
found:
o
insert(k, v)
– Start at cell h(k)
– Probe consecutive cells
until:
•
• put DELETED in cell;
• return true
•
•
• put v in cell;
• return true
or an empty cell is found
• return false
or N cells have been
probed
A cell i is found that is
either empty or contains
DELETED
•
or N cells have been
probed
• return false
• return false
ADSA: Hashing/10
20
Clustering
• Linear Probing tends to form “clusters”.
– a cluster is a sequence of non-empty array locations with
no empty cells among them
•
e.g. the cluster in Example 1 on slide 12
• The bigger a cluster gets, the more likely it is that
new items will hash into that cluster, and make it
ever bigger.
• Clusters reduce hash table efficiency
– searching becomes sequential (O(n))
ADSA: Hashing/10
continued
21
• If the size of the table is large relative to the number
of items, linear probing is fast ( O(1))
– a good hash function generates indices that are evenly
distributed over the table range, and collisions will be
minimal
• As the ratio of the number of items (n) to the table
size (N) approaches 1, hashing slows down to the
speed of a sequential search ( O(n)).
ADSA: Hashing/10
22
5. Solution #2: Double Hashing
• In the event of a collision, compute a
second 'offset' hash function
– this is used as an offset from the collision
location
• Linear probing always uses an offset of 1,
which contributes to clustering. Hashing
makes the offset more random.
ADSA: Hashing/10
23
Example of Double Hashing
• A hash table with N =
13, h(k) = k mod 13,
and d(k) = 7 - k mod 7
• Insert keys 18, 41, 22,
44, 59, 32, 31, 73
• Total number of
probes: 11
0 1 2 3 4 5 6 7 8 9 10 11 12
k
18
41
22
44
59
32
31
73
h (k ) d (k ) Probes
5
2
9
5
7
6
5
8
3
1
6
5
4
3
4
4
5
2
9
5
7
6
5
8
10
9
0
31
41
18 32 59 73 22 44
0 1 2 3 4 5 6 7 8 9 10 11 12
ADSA: Hashing/10
24
6. Solution #3: Bucket Hashing
• The previous solutions
use open hashing:
– all items go into the
array
...
141
142
143
144
• In bucket hashing, an
array cell points to a
linked list of items that
all hash to that cell.
robin
sparrow
hawk
seagull
145
146
147
bluejay
148
owl
...
ADSA: Hashing/10
also called chaining
continued
25
• Chaining is generally faster than linear probing:
– searching only examines items that hash to the same table
location
• With linear probing and double hashing, the number
of table items (n) is limited to the table size (N),
whereas the linked lists in chaining can keep growing.
• To delete an element, just erase it from its list.
ADSA: Hashing/10
26
7. Table Resizing
• As the number of items in the hash
table increases, search speed goes down.
• Increase the hash table size when the number of items in
the table is a specified percentage of its size.
Works with
open chaining
also.
ADSA: Hashing/10
continued
27
• Create a new table with the specified size and
cycle through the items in the original table.
• For each item, use the hash() value modulo
the new table size to hash to a new index.
• Insert the item at the front of the linked list.
ADSA: Hashing/10
28
8. Java's hashCode()
• public int hashCode() is defined in Object
– it returns the memory address of the object
• hashCode() does not know the size of the
hash table
– the returned value must be adjusted
•
e.g hashCode() % N
• hashCode() can be overridden in your
classes
ADSA: Hashing/10
29
Coding your own hashCode()
• Your hashCode() must:
– always return the same value for the same item
•
it can’t use random numbers, or the time of day
– always return the same value for equal items
•
if o1.equals(o2) is true then hashCode() for o1 and
o2 must be the same number
ADSA: Hashing/10
continued
30
• A good hashCode() should:
– be fast to evaluate
– produce uniformly distributed hash values
•
this spreads the hash table indices around the table,
which helps minimize collisions
– not assign similar hash values to similar items
ADSA: Hashing/10
31
String Hash Function
• In the majority of hash table applications, the
key is a string.
– combine the string's characters to form an integer
public int hashCode()
{
int hash = 0;
for (int i = 0; i < s.length; i++)
hash = 31*hash + s[i];
return hash;
}
ADSA: Hashing/10
continued
32
String strA = "and";
String strB = "uncharacteristically";
String strC = "algorithm";
hashVal = strA.hashCode();
// hashVal = 96727
hashVal = strB.hashCode();
// hashVal = -2112884372
hashVal = strC.hashCode();
// hashVal = 225490031
A hash function might overflow and return a negative
number. The following code insures that the table index is
nonnegative.
tableIndex = (hashVal & Integer.MAX_VALUE) % tableSize
ADSA: Hashing/10
33
Time24 Hash Function
• For the Time24 class, the hash value for an
object is its time converted to minutes.
• Since each hour is 60 mins more than the last,
and a minute is between 0--59, then each hash
is unique.
public int hashCode()
{ return hour*60 + minute;
ADSA: Hashing/10
}
34
9. Hash Tables in Java
• Java provides HashSet, Hashtable and
HashMap in java.util
– HashSet is a set
– Hashtable and HashMap are maps
ADSA: Hashing/10
continued
35
• Hashtable is synchronized; it can be
accessed safely from multiple threads
– Hashtable uses an open hash, and has rehash()
for resizing the table
• HashMap is newer, faster, and usually
better, but it is not synchronized
– HashMap uses a bucket hash, and has a
remove() method
ADSA: Hashing/10
36
Hash Table Operations
• HashSet, Hashtable and HashMap have noargument constructors, and constructors that
take an integer table size.
• HashSet has add(), contains(), remove(),
iterator(), etc.
ADSA: Hashing/10
continued
37
• Hashtable and HashMap include:
– public Object put(Object key, Object value)
•
returns the previous value for this key, or null
– public Object get(Object key)
– public void clear()
– public Set keySet()
• dynamically reflects changes in the hash table
– many others
ADSA: Hashing/10
38
Using HashMap
• A HashMap with Strings as keys and values
HashMap
"Charles Nguyen"
"(531) 9392 4587"
"Lisa Jones"
"(402) 4536 4674"
"William H. Smith"
"(998) 5488 0123"
A telephone book
ADSA: Hashing/10
39
Coding a Map
HashMap <String, String> phoneBook =
new HashMap<String, String>();
phoneBook.put("Charles Nguyen", "(531) 9392 4587");
phoneBook.put("Lisa Jones", "(402) 4536 4674");
phoneBook.put("William H. Smith", "(998) 5488 0123");
String phoneNumber = phoneBook.get("Lisa Jones");
System.out.println( phoneNumber );
prints: (402) 4536 4674
ADSA: Hashing/10
40
HashMap<String, String> h =
new HashMap<String, String>(100, /*capacity*/
0.75f /*load factor*/ );
h.put(
h.put(
h.put(
h.put(
"WA",
"NY",
"RI",
"BC",
ADSA: Hashing/10
"Washington" );
"New York" );
"Rhode Island" );
"British Columbia" );
41
Capacities and Load Factors
• HashMaps round capacities up to powers of two
– e.g. 100 --> 128
– default capacity is 16; load factor is 0.75
• The load factor is used to decide when it is time to
double the size of the table
– just after you have added 96 elements in this example
– 128 * 0.75 == 96
• Hashtables work best with capacities that are
prime numbers.
ADSA: Hashing/10
42
10. Efficiency
• Hash tables are efficient
– until the table is about 70% full, the number of
probes (places looked at in the table) is
typically only 2 or 3
• Cost of insertion / accessing, is O(1)
ADSA: Hashing/10
continued
43
• Even if the table is nearly full (leading to
long searches), efficiency remains quite
high.
• Hash tables work best when the table size
(N) is a prime number. HashMaps use
powers of 2 for N.
ADSA: Hashing/10
continued
44
o
In the worst case, searches, insertions and
removals on a hash table take O(n) time
(n = no. of items)
– the worst case occurs when all the keys
inserted into the map collide
o
The load factor  = n/N affects the
performance of a hash table (N = table
size).
– for linear probe, 0 ≤  ≤ 1
– for chaining with lists, it is possible that  > 1
ADSA: Hashing/10
continued
45
• Assume that the hash function uniformly
distributes indices around the hash table.
– we can expect  = n/N elements in each cell.
•
•
on average, an unsuccessful search makes 
comparisons before arriving at the end of a list and
returning failure
mathematical analysis shows that the average
number of probes for a successful search is
approximately 1 + /2
– so keep  small!
ADSA: Hashing/10
46
11. Ordered/Unordered Sets and Maps
• Use an ordered set or map if an iteration
should return items in order
– average search time: O(log2n)
• Use an unordered set or map with hashing
when fast access and updates are needed
without caring about the ordering of items
– average search time: O(1)
ADSA: Hashing/10
47
Timing Tests
• SearchComp.java:
– read a file of 25025 randomly ordered words and
insert each word into a TreeSet and a HashSet.
– report the amount of time required to build both
data structures
– shuffle the file input and time a search of the
TreeSet and HashSet for each shuffled word
– report the total time required for both search
techniques
ADSA: Hashing/10
continued
48
Ford & Topp's HashSet build and search times
are much better than TreeSet.
ADSA: Hashing/10
continued
49
• SearchJComp.java
– replace Ford & Topp's TreeSet and HashSet by
the ones in the JDK.
JDK HashSet and TreeSet are much
the same speed for searching
ADSA: Hashing/10
50
Download