Ch. 5: Hashing

advertisement
Chapter 5: Hashing
• Hash Table ADT
• Hash Functions
• Collision Resolution
• Rehashing
CS 340
Page 89
Hashing
Hashing is a technique for performing
searches, insertions, and deletions from a list
in constant time.
A particular component of each data element
being stored is used as a key which is
mapped to a particular cell in a hash table.
Problems arise when collisions occur, i.e.,
when multiple data elements are mapped to
the same cell.
The term “hash” was coined to
illustrate the analogy between
hashing and the culinary practice of
chopping and mixing ingredients to
make a hash.
Essentially, the input domain is
“chopped” into several subdomains,
which are then “mixed” into the
output range to improve the
uniformity of their distribution.
CS 340
Page 90
The Hash Table Abstract Data Type
Yu
Wang
Klein-Mayer-White
Stefik
Bouvier-CumminsEhlmann
Fujinoki
A hash table is a list of keys, mapped to particular cells via a hash
function.
• The table is implemented as a fixed-size array.
• The table size and hash function are strategically chosen to avoid
collisions.
Example: A hash table to hold the CS Department faculty and staff
Hash function: length of last name
Table size: 11
Collision-free keys: 9
Ave. # comparisons per name: 1.50
Wang
Stefik
Klein
Tornaritis
Tornaritis
Bartholomew
Klein
Stefik
Fujinoki-Yu
Ehlmann-Tornaritis
White
Mayer
Bouvier
Bartholomew-CumminsWang
CS 340
Hash function: ((Sum of office room # digits) *
(# of vowels in last name) +
(Last 2 digits of office phone #)) % 25
Table size: 25
Collision-free keys: 24
Ave. # comparisons per name: 1.08
Hash function: (Office room #) % 15
Table size: 15
Collision-free keys: 12
Ave. # comparisons per name: 1.42
Ehlmann
Bouvier-Mayer
Yu
Cummins
Fujinoki
White
Bartholomew
Page 91
New Year’s Day
Independence Day
Labor Day
Mother’s Day
Lincoln’s Birthday
Valentine’s Day
Martin Luther King, Jr. Day
St. Patrick’s Day - Flag Day - Columbus
Day
Choosing The Hash Table
Size
Define the load factor,  , of a hash table to be the ratio of the
number of elements in the hash table to the table size.
• If  > 1, then collisions are inevitable, so it is wise to
choose a table size greater than the number of anticipated
elements.
• If  << 1, then there will be a very large number of empty
slots, lessening the probability of a collision, but wasting a
lot of memory.
Example: A hash table to hold the 2011 holidays
Mew Year’s Day – Martin Luther King, Jr. Day
Lincoln’s Birthday – Valentine’s Day – Washington’s
Birthday
St. Patrick’s Day
Easter Sunday
Mother’s Day – Memorial Day
Flag Day – Father’s Day
Independence Day
Veterans Day
Washington’s Birthday
Father’s Day
Easter Sunday
Table size: 12
Hash function: Month #
Load factor: 1.5
Labor Day
Columbus Day – Halloween
Veterans Day – Thanksgiving
Christmas
Memorial Day - Thanksgiving
Christmas
Halloween
CS 340
Table size: 43
Hash function: (Month #) + (Day #)
Load factor: 0.419
Page 92
Choosing The Hash Function
Given a particular hash table size and a particular type of
data, the hash function should be chosen to minimize the
number of collisions.
This usually requires an in-depth analysis of the keys
expected to go in the table.
Example: A hash table to hold CS undergraduate course enrollment
statistics
Summing the
digits yields: 9,
5, 10, 6, 6, 14,
6, 8, 6, 10, 6,
7, 12, 9, 11, 9,
15, 15, 13, 15,
14, 13, 18, and
22.
(Table size:
18;
7 noncollisions.)
CS 340
There are 24 active courses: 108, 140, 145, 150, 240, 275, 312, 314,
321, 325, 330, 340, 390, 423, 425, 434, 438, 447, 454, 456, 482,
490, 495, and 499.
Summing (one’s
Summing (100’s
Summing (one’s digit)
digit) + 2*(ten’s
digit) + 3*(ten’s digit)
+ 3*(ten’s digit) +
digit) + 4*(100’s
+ 9*(one’s digit)
9*(100’s digit) yields:
digit) yields: 12, 12,
yields: 73, 13, 58, 16,
17, 21, 26, 24, 30, 44,
17, 14, 16, 27, 16,
14, 68, 24, 26, 18,
32, 34, 34, 38, 36, 39,
20, 17, 21, 18, 28,
54, 12, 15, 30, 37,
54, 45, 47, 49, 53, 55,
30, 23, 26, 30, 31,
55, 49, 85, 79, 55,
55, 57, 62, 63, 68,
30, 32, 34, 34, 39,
73, 46, 31, 76, and
and 72.
and 43.
112.
(Table size: 56;
(Table size: 32;
(Table size: 100;
20 non-collisions.)
10 non-collisions.)
20 non-collisions.)
Page 93
Collision Resolution
What should be done when a collision does occur?
There are two main strategies: separate chaining and probing.
Separate Chaining
With separate chaining, the hash table is an array of linked lists,
with each linked list containing all of the elements that map to the
same value.
New Year’s Day
Martin Luther King, Jr.
Lincoln’s Birthday
St. Patrick’s Day
Easter Sunday
Mother’s Day
Flag Day
Independence Day
Labor Day
Columbus Day
Veterans Day
Christmas Day
CS 340
Day Day
Valentine’s
Washington’s Birthday
Memorial Day
Father’s. Day
Halloween
Thanksgiving Day
Disadvantages:
Average successful search: 1+(/2)
comparisons
Average unsuccessful search: 
comparisons
Worst case search: n comparisons (forPage
a 94
Collision Resolution (Continued)
Probing
With probing, the hash table is an array of values,
with a whole series of cells probed until no
collision occurs (i.e., cells h0(x), h1(x), h2(x),… are
tried, where hi(x) = (Hash(x) + f(i)) mod tablesize,
with f(0) = 0).
Linear Probing: f(i) is a linear function
Example: f(i) = 3i and Hash(x) = x
insert
1492
(slot 2)
1492
insert
1776
1492
insert
1812
(slot 2 
slot 5)
(slot 6)
1776
1492
1812
1776
insert
1945
(slot 5 
slot 8)
1492
1812
1776
1945
insert
1968
(slot 8 
slot 1)
1968
1492
insert
1992
1968
1492
(slot 2  1992
slot 5 
1812 slot 8  1812
1776 slot 1  1776
slot 4)
1945
1945
Problems With Linear Probing:
• Coefficient and table size must be relatively
prime or free cells may not be found.
• Bad tendency to experience primary
clustering, resulting in many collisions.
CS 340
Page 95
Collision Resolution
(Continued)
Quadratic Probing: f(i) is a quadratic function
Example: f(i) = 2i 2 and Hash(x) = x
1992
insert
1492
(slot 2)
1492
insert
1776
1492
insert
1812
1492
(slot 2  1812
slot 4)
(slot 6)
1776
1776
insert
1945
1492
insert
1968
(slot 5)
1812
1945
1776
(slot 8)
1492
insert
1992
1492
1812 (slot 2  1812
slot 4 
1945 slot 0) 1945
1776
1776
1968
1968
Problems With Quadratic Probing:
CS 340
•
Coefficient and table size must be carefully
chosen or free cells may be ignored.
•
Bad tendency to experience secondary
clustering, since keys with the same original
hashed value will follow the same sequence
of cells through the table.
Page 96
Collision Resolution
(Continued)
Double Hashing: f(i) is a second hash function,
multiplied by an iterative value
Example: f(i) = iHash2(x), where Hash2(x) = 7 - x
mod 7, and Hash(x) = x
insert
1492
(slot 2)
1492
insert
1776
1492
insert
1812
(slot 2 
slot 3)
(slot 6)
1776
1492
1812
insert
1945
1492
1812
(slot 5)
1492
1812
(slot 8)
1945
1776
1776
insert
1968
1945
1776
insert
1992
(slot 2
slot 5 
slot 8 
slot 1)
1968
1992
1492
1812
1945
1776
1968
Problems With Double Hashing:
CS 340
•
A strategic choice must be made for both
hashing functions.
•
Calculation will be much more expensive
in the event of a collision.
Page 97
Rehashing
When a hash table starts getting too full, with
many delays caused by repeated collisions,
rehashing the values into a new, larger table with a
new hash function may alleviate the problem.
Bush2
Reagan
Obama
Kennedy
Insert:
Bush2 (2001)
Clinton (1993)
Bush (1989)
Reagan (1981)
Carter (1977)
Ford (1974)
Nixon (1969)
Johnson (1963)
Kennedy (1961)
Eisenhower (1953)
Hash (president) =
first_year_in_office mod 11
CS 340
Inserting Obama (2009)
would cause a collision in
slot 7, so…
Nixon
Reagan
Clinton
Kennedy
Ford
Johnson
Eisenhower
Carter
Bush
Bush2
Johnson
Bush
REHASH
Hash (president) =
first_year_in_office mod 23
Nixon
Clinton
Ford
Eisenhower
Carter
Page 98
C++ “map”:
#include <map>
#include <string>
using namespace std;
void main()
{
map<string, string> phone_book;
phone_book["Sally Smart"] = "555-9999";
phone_book["John Doe"] = "555-1212";
phone_book["J. Random Hacker"] = "553-1337";
}
Java “map”:
Map<String, String> phoneBook = new
HashMap<String, String>();
phoneBook.put("Sally Smart", "555-9999");
phoneBook.put("John Doe", "555-1212");
phoneBook.put("J. Random Hacker", "555-1337");
Lua “table”:
Associative Arrays
Hashing is used extensively in modern
programming, particularly in database
management, network security, and operating
systems.
Consequently, most modern programming
languages have built-in mechanisms for
implementing associative arrays, i.e., dictionaries
based on the key-value concept of hash tables.
phone_book = {
["Sally Smart"] = "555-9999",
["John Doe"] = "555-1212",
["J. Random Hacker"] = "553-1337", -- Trailing comma is OK
}
aTable = {
-- Table as value
subTable = { 5, 7.5, k = true }, -- key is "subTable "
-- Function as value
['John Doe'] = function (age) if age < 18 then return "Young" else return "Old!" end end,
-- Table and function (and other types) can also be used as keys
}
Perl “hash”:
Python “dictionary”:
Ruby “hash”:
%phone_book = (
'Sally Smart' => '555-9999',
'John Doe' => '555-1212',
'J. Random Hacker' => '553-1337',
);
phonebook = {
'Sally Smart' : '555-9999',
'John Doe' : '555-1212',
'J. Random Hacker' : '553-1337'
}
phonebook = {
'Sally Smart' => '555-9999',
'John Doe' => '555-1212',
'J. Random Hacker' => '553-1337'
}
CS 340
Page 99
Download