Hash Tables

advertisement
HASH TABLES

The crucial disadvantage for avoiding arrays is that we
need to allocate in advance the size of this structure

We tend to overestimate its size and end up
with a very sparse structure
1
STORING BIG DATA

We tend to think that the actual number of keys
to be stored is equal to the universe of possible
existing keys
2
HASH TABLES

Often the number of keys to be stored is smaller
than the number in the universe of keys.

In this case, a hash table may save us a lot of space.
3
HASH TABLES

How can you store all possible SSN in an array?

Use an array with range 0 - 999,999,999– a billion
possible locations!
This will give you O(1) access time but …
considering there are approximately

308,000,000 people in the USA ,you waste
 1,000,000,000 -350,000,000 array entries!

4
PROBLEM - WASTED SPACE

Problem:
The range of key values we are mapping is too large
(0-999,999,999 when compared to
the # of actual keys (US citizens)
5
HASH TABLES

All search structures so far

Relied on a comparison operation

Performance O(n) or O( log n) for input of

Size N

WE CAN DO BETTER WITH HASHING
6

Simplest case:
 Assume we have keys with values in the range 1 .. M

Use a hash method to compute the value of the key
(an int) to select a slot in a direct access table in
which to store the item
7
HASH(KEY)


To search for an item with key,
k,

look in slot hash (key) which
produces an int that maps to
an index in the array.

If there’s an item there,
you’ve found it

If the tag is 0, it’s missing.
8
CONSTANT TIME SEARCH

This produces a Constant time search
O(1)
9
EXAMPLE (IDEAL) HASH FUNCTION


Suppose we now have Strings and
must hash them to an integer.
Our hash function maps the
following values:
0
1
2
hashCode("apple") = 5
3
hashCode("watermelon") = 3
4
hashCode("grapes") = 8
5
hashCode("cantaloupe") = 7
6
hashCode("kiwi") = 0
7
hashCode("strawberry") = 9
hashCode("mango") = 6
hashCode("banana") = 2
kiwi
8
9
banana
watermelon
apple
mango
cantaloupe
grapes
strawberry
10
WHY HASH TABLES?

We use key/value pairs to store
an Entry into the table
key
value
142
robin
robin info
143
sparrow
sparrow info
144
hawk
hawk info
145
seagull
seagull info
147
bluejay
bluejay info
148
owl
owl info
...
141
We use use a hash function
to map a key “Hawk”
Key(hawk) to an integer


The value column holds the
data we are actually
interested in
146
11
HASH FUNCTIONS

Hash tables normally provide O(1) time (constant
time) to access an element

A value(called a key) is normally stored in slot k –
which is an integer value)

In hash tables, this element is stored in
slot = hash(key).
12
HASH FUNCTIONS

hash(k) is a hash function.

It maps the universe U of keys into the slots of a
hash table (smaller than the universe) ----

Thus reducing the size of the space we need to use.
13
PICTORIAL VIEW OF HASH TABLES
UNIVERSE OF VALUES ARE MAPPED TO
A SMALLER NUMBER OF SLOTS
k1
k2
k3
k4
14
HASHING

Assume I have a hash function where the key is a String

e.g. A label which represents a city in our HPAir project

hash( key ) 
integer
i.e. the function maps the key to an integer
That is a string – city name – to an int –
which is an index into the HashMap

What performance (Big(0) do I get ?
15
HASH TABLES - CONSTRAINTS

Initial Constraints – hash a key to an integer

The hashcode of a Key must be unique

Keys must lie in a small range for storage
efficiency,

keys must be dense in the range -

If they’re sparse (lots of gaps between values),
a lot of space is used to obtain speed
16
HASH TABLES 
Hashing Keys produces integers, therefore

We need a hash function
hash( key ) ® integer
ie one that maps(hashes) a key to
an integer

Applying this function to the key produces a
unique address
17
PROBLEMS WITH A UNIQUE ADDRESS

FOR EACH KEY
If hash(key) maps each key to a unique
integer in the range 0 .. m-1
then search is O(1) -
BUT THIS IS HARD TO DO!!!!!
18

Example - using an n-character key e.g. a String –

n = number of characters in the String.

Use a String class method to change the String to a
character array -
Call a method with an array name and the number of
chars in String:
hash(char array, # of characters)
19
HASHING A STRING OF CHARACTERS

// n = number of chars in the String
int hash( char [] sarray, int n )
{
int sum = 0, i= 0;
// sum ascii values of the characters
while( n-- > 0 )
sum = sum + sarray[ i + +].getNumericValue();
return sum % 256
} // number of ASCII characters –is 256
returns a value in 0 .. 255
20
EVALUATION
int hash( char [] sarray, int n )
{
int sum = 0, i= 0;
while( n-- > 0 ) // get ascii values of each
character
// and sum them
sum = sum + sarray[i++].getNumericValue();
return sum % 256;
}
returns a value in 0 .. 255

The hash function itself is O(1) since the number
of characters is a constant for each String - that
number will not change for each String
21
HASH TABLES – PROBLEM -COLLISIONS

With this hash function
int hash( char []s, int n ) {
int sum = 0, i = 0;
while( n-- > 0 ) sum = sum + s[i++].getNumericValue;
return sum % 256;
}
FOR:



hash( “AB”, 2 ) and
hash( “BA”, 2 ) their Ascii (Unicode) values
return the same value!
Unicode value A is 65, for B is 66
Add them together in any order and they equal 131
This is called a collision
22
COLLISIONS

Because we're mapping a larger universe into
a smaller set of slots, collisions occur.

A variety of techniques are used for resolving
collisions

Therefore having a unique key is HARD TO
DO.
23
PICTORIAL VIEW OF COLLISION
Sometimes keys map to the
same memory location
COLLISION
k1
k5
k2
k3
k4
24
HASH TABLES – COLLISION
SOLUTIONS I
We need to store the actual key with the item in
the hash table
We compute the address
index = hash( key )
 Next,
look for the index in the table
if ( the location is occupied) then we
try next entry till we find an open one
25
COLLISION RESOLUTION & OPEN HASHING

The most common resolution mechanism for collisions
is called chaining .


This is also called Open Hashing.

Being "open", the Hashtable will store a linked list of
entries whose keys hash to the same value

Chaining incorporates the concepts of linked lists and
direct access structures like arrays

Each slot of a hash table will be a pointer to a linked
list
26
CHAINING OR OPEN HASHING

When hashing a key, if a collision happens

the new key is stored in the linked list in that
location

E.g., suppose that we're mapping the universe of
integers to a hash table of size 10
27
OPEN HASH TABLE
KEYS
BUCKETS
ENTRIES
John Smith and Sandra map to the same location – a
linked list is started from John to Sandra
28
HASH TABLES - LINKED LISTS

Collisions - Resolution
 Linked list is attached
to each primary table slot
 // Three entries map to same location

h(k) == h(k1) == h(k2)

Searching for k1
 Calculate hash(k1)
 Item doesn’t match
 Follow linked list to k1

If NULL found, key isn’t in table
29
HASH TABLES - LINKED LISTS
 If a search can be satisfied
by any item with key, k,
performance is still O(1)
 but
 If the key values are different
we get O( 1 * max )
 Where
max is the largest number
of duplicates - or length of the
longest chain (Linked List)
30
 TECHNIQUE TWO
- USE AN OVERFLOW AREA
 Linked list constructed in special area of table
called OVERFLOW AREA
 If
two keys map to same location
hash(k) == hash(j)
 k stored first
 Adding j
When hash(j) maps to hash(k)
 Find k THEN


Go to first slot in overflow area
Put j in it
 Searching - same as linked list

31
HASHING(103)
hash(103) = 103 mod 10
hash(103) = 3
Our hash function is
based on the division
method for creating
hash functions:
hash(k) = k mod size
32
HASHING(103)
hash(n) = 103 mod 10
hash(n) = 3
103 /
33
HASHING(69)
hash(n) = 69 mod 10
hash(n) = 9
103
/
69
/
34
HASHING(20)
h(n) = 20 mod 10
h(n) = 0
20
/
103
/
69
/
35
HASHING(13)
hash(n) = 13 mod 10
hash(n) = 3
20
/
103
69
13
/
/
36
HASHING(110)
hash(n) = 110 mod 10
hash(n) = 0
20
110
/
103
13
/
69
/
37
HASHING(53)
hash(n) = 53 mod 10
hash(n) = 3
20
110
103
13
69
/
53
/
/
38
FINAL HASH TABLE
20
110 /
103
13
53 /
69 /
39
SEARCHING FOR 53 USING CHAINING
20
110 /
103
13
69
/
53
/
/
40
SEARCHING FOR 53
20
110 /
103
13
69
/
53
/
/
41
SEARCHING FOR 53
20
110 /
103
13
/
53
/
temp
69
/
42
SEARCHING FOR 53
20
110 /
103
13
/
53
/
temp
69
/
43
SEARCHING FOR 53
20
110 /
103
13
/
53
/
temp
69
/
44
CLOSED HASHING - RE-HASH FUNCTIONS
Closed hashing, is a method of collision resolution
in hash tables.
With this method, a hash collision is resolved by

probing, or
searching through other locations in the array –
45
+1 SOLUTION - LINEAR PROBING
In one variation, the probing sequence
is called
(+1) – Linear Probing
Continue probing adjacent locations
until an unused array slot is found.
Then put the Entry in that location.
46
CLOSED HASHING - E.G. LINEAR PROBING
Closed Hashing keeps keys in the main table and uses
a re-hash function which has many variations .
Linear probing - previous example - is the most
commonly Closed Hashing
 uses the Main Table or flat area to find
another location
47
REHASH FUNCTION - LINEAR PROBING
The rehash function for Linear Probing is
=
hash’(x) is +1
Keep going to the next slot until you find
an empty one
48
INSERTION, I

Suppose you want to add seagull
to this hash table
...
141


Also suppose:
 hashCode(seagull) = 143
 table[143] is not empty
 table[143] != seagull
 table[144] is not empty
 table[144] != seagull
 table[145] is empty
Therefore, put seagull at location
145
142
robin
143
sparrow
144
hawk
145
seagull
146
147
bluejay
148
owl
...
49
SEARCHING, I



Suppose you want to look up
seagull in this hash table
Also suppose:
 hashCode(seagull) = 143
 table[143] is not empty
 table[143] != seagull
 table[144] is not empty
 table[144] != seagull
 table[145] is not empty
 table[145] == seagull !
We found seagull at location 145
...
141
142
robin
143
sparrow
144
hawk
145
seagull
146
147
bluejay
148
owl
...
50
SEARCHING, II




Suppose you want to look up cow in
this hash table
Also suppose:
 hashCode(cow) = 144
 table[144] is not empty
 table[144] != cow
 table[145] is not empty
 table[145] != cow
 table[146] is empty
If cow were in the table, we should
have found it by now
Therefore, it isn’t here
...
141
142
robin
143
sparrow
144
hawk
145
seagull
146
147
bluejay
148
owl
...
51
INSERTION, II

Suppose you want to add hawk to
this hash table
...
141


Also suppose
 hashCode(hawk) = 143
 table[143] is not empty
 table[143] != hawk
 table[144] is not empty
 table[144] == hawk
hawk is already in the table, so do
nothing
142
robin
143
sparrow
144
hawk
145
seagull
146
147
bluejay
148
owl
...
52
INSERTION, III

Suppose:
 You want to add cardinal to
this hash table
...
141

hashCode(cardinal) = 147
The last location is 148
 147 and 148 are occupied


Solution:
 Treat the table as circular; after
148 comes 0
 Hence, cardinal goes in
location 0 (or 1, or 2, or ...)
142
robin
143
sparrow
144
hawk
145
seagull
146
147
bluejay
148
owl
53
LINEAR PROBING – REVIEW:

Closed Hashing uses Linear Probing (among
others)

Linear Probing: If position h(key) is occupied, do
a linear search in the table until you find a empty
slot.

The slot is searched in this order:

h(key), k(key)+1, h(key)+2, ..., h(key)+c
54
EXPANDING THE TABLE

If the table becomes full, an exception can be
thrown or


we can expand the capacity.

This process is involved because if we double the
size,


we risk a “sparse” structure that can impact the
efficiency we seek.

One solution is to rehash the table using the
new table size.
55
CLOSED HASHING - BUCKETS

One implementation for closed hashing groups hash
table slots into buckets.

The M slots of the hash table are divided into B
buckets, with each bucket consisting of M/B slots.

The hash function assigns each record to the first
slot within one of the buckets.
56
BUCKET HASHING - USES MAIN TABLE

If this slot is already occupied,

then the bucket slots are searched sequentially until an
open slot is found.
57
BUCKETS ON THE TABLE

If a bucket is entirely full,

then the record is stored in an overflow bucket of
infinite capacity at the end of the table.

All buckets share the same overflow bucket. See
link below: See this link for a fuller explanation

http://research.cs.vt.edu/AVresearch/hashing/buckethash.php
58
SLOTS OR BUCKETS – 4 BUCKETS
59
BUCKET HASHING

To search, hash the key to determine which bucket
should contain the record.

The records in this bucket are then searched.

How is this better than linear probing? -- +1
60
BUCKET HASHING
 If
the desired key value is not found and the
bucket still has free slots, then the search is
complete.

If the bucket is full, then the search goes to
the overflow bucket.

If many records are in the overflow bucket,
this will be an expensive process.
61
BUCKET HASHING ADVANTAGE
 Bucket
methods are good for implementing
hash tables stored on disk, because the
bucket size can be set to the size of a disk
block.
 Whenever
search or insertion occurs, the
entire bucket is read into memory.
62
USING BUCKETS
 Because
the entire bucket is then in memory,
 processing
an insert or search operation
requires only one disk access, unless the
bucket is full.
 If
the bucket is full, then the overflow bucket
must be retrieved from disk as well.
63
CLUSTERING

Even with a good hash function, linear probing has its
problems:

The position of the initial mapping of key k is
called the home position of k.

When several insertions map to the same home
position, they end up placed contiguously in the
table.

This collection of keys with the same home
position is called a cluster.
64
CLUSTERS
A
cluster is a group of items not containing
any open slots
 Clusters
cause efficiency to degrade
65
CLUSTERING
 As clusters grow, the probability
increases that a
key will map to the middle of a cluster,
 increasing the rate of the cluster’s growth.
66
CLUSTERS
 This tendency of linear probing to place
items together is known as primary
clustering.
 As these clusters grow, they merge with
other clusters forming even bigger clusters
which grow even faster.
67
OTHER COLLISION TECHNIQUES

We have looked at
chaining(Linked Lists) (Open Hashing) and
 Linear Probing( Closed Hashing):
Bucket Hashing
Let us look at some other collision techniques
68
Other Closed hash function techniques are:

Quadratic probing: a variant of the above where
the term being added to the hash result is squared.
h(key) + c2

Random probing: the term being added to the hash
function is a random number.
h(key) + random()
69
REHASH FUNCTIONS

Rehashing: is a technique where a
sequence of hashing functions are defined
(h1, h2, ... hk).

If a collision occurs the functions are used
in the this order
70
Use a second hash function
- Re-Hashing
hash(k) == hash(j)
 k stored first

 Adding
j
 Calculate
hash(j)
 Find k first

Calculate hash’2(j) where
hash’2 is some
other hash function
Hash 2’(j) second hash function
 Repeat until we find an empty slot
 Put j in it
71
HASH TABLES - RE-HASH FUNCTIONS
 The re-hash function has many variations
 Quadratic
probing
 h’(x)
is squared
 Avoids primary clustering
 Secondary clustering occurs

All keys which collide on h(x) follow the same sequence

First
a = h(j)
 Then
a + c, a + 4c, a + 16c, ....

72
QUADRATIC PROBING
Some versions use:

p(K, i) = c1 i2 + c2 i2 + c3 i2 for some choice of
constants c1, c2, and c3.

Secondary clustering generally less of a problem
73
SEARCHING IN A HASH TABLE


We have already seen how searching works with chaining.
With Closed Hashing, we use the following steps

Given a target, hash the target

Take the value of the hash of target and go to the
slot.

If the target exist it must be in this slot

Search in the list in the current slot using a linear
search.
74
LOOK UP A KEY
public lookup(key)
{ int I ;
i = find_slot(key) // method to find key in table
if slot[i] is occupied // key is in table
return slot[i].value ; // return value in slot
else
// key is not in table
return not found
}
75
LINEAR PROBING AND SINGLE-SLOT STEP
public find_slot(key)
{
int i;
i = hash(key) ; // use a hash method to hash the key
// search until we either find the key, or find an empty slot.
while ( (slot[i] is occupied) and ( slot[i].key ≠ key ) )
{
i = (i + 1)
}
return i
}
76
Deleting in a table – Closed Hashing


Suppose you want to look up cow
in this hash table
Also suppose:
hashCode(cow) = 144
142
robin
–
table[144] is not empty
table[144] != cow
143
sparrow
144
hawk
145
seagull
–
table[145] is not empty
table[145] != cow
–
table[146] is empty
–

141
–
–

...
If cow were in the table, we should
have found it by now
Therefore it is not there.
146
147
bluejay
148
owl
...
77
DELETING FROM A TABLE
Problem:
 When an empty slot is reached, we assume the
item we are searching for is not there.


Deletion leaves an empty slot,

When we next search for an item using linear
probing,
 We
assume the item is not there when we
reached the empty slot.
78
TOMBSTONES
 We
assume the item is not there when
we reached the empty slot.
in fact, the item could be
AFTER the empty slot.
 When,
79
TOMBSTONES
Therefore, straight deletion of an item would
not work.
Instead, the cell is marked (usually by use of a
boolean variable) when a item is deleted
The slot is often termed a “tombstone”.
80
HASH TABLES - SUMMARY SO FAR ...

Potential O(1) search time

If a suitable function hash(key)  integer can
be found

Space for speed trade-off
 “Full” hash tables don’t work (more later!)

Collisions
 Inevitable
81
Various resolution strategies looked at so far:
Linked lists
Overflow areas
Re-hash functions
Linear probing
h’ is +1
Quadratic probing h’ is + i2
-
Any other hash function!
or even sequence of functions!
82
Linear Probing
Expected Number of Probes
COMPARISON OF COLLISION TECHNIQUES
Random Probing
factor (n/size)
Chaining
83
HASHING WITH CHAINING
 What
is the running time to insert/search/delete?

Insert: It takes O(1) time to compute the hash
function and insert at head of linked list

Search: It is proportional to max linked list
length

Delete: Same as search
84
EFFICIENCY OF CHAINING

Therefore, if we have a “bad” hash function,
all n keys may hash to the same table index
giving an O(n) run-time!
So how can we create a “good” hash
function?
85
HASH TABLES - CHOOSING THE HASH FUNCTION

Some functions are definitely better than
others!
 Key
criterion
 Minimum number of collisions
Keeps chains short
Maintains O(1) on average

86
WRITING YOUR OWN HASHCODE METHOD

A hashCode method must:

Return a value that is a legal array index

Always return the same value for the same input

It can’t use random numbers, or the time of day

Return the same value for equal inputs

Must be consistent with your equals method
87
HASHCODE FUNCTION


It does not need to return different values for
different inputs – some collisions are inevitable.
A good hashCode method should:

Be efficient to compute

Give a uniform distribution of array indices

so NO SPARSE ARRAYS!
88
OTHER CONSIDERATIONS
 The
hash table might fill up; we need to be
prepared for that
 Generally
speaking, hash tables work best
when the table size is a prime number
89
HASH TABLES IN JAVA

Java provides two classes, Hashtable and
HashMap classes which implement the MAP
Interface

Both are maps: they associate keys with values

Hashtable is synchronized; it can be accessed
safely from multiple threads

Hashtable uses an open hash, and has a
rehash method, to increase the size of the
table –
90
HASHMAP

HashMap is newer, faster, and usually
better,
 but
it is not synchronized
HashMap (default) uses a bucket hash  (linked list)
 and has a remove method

91
HASH TABLE OPERATIONS
 Both
Hashtable and HashMap are in java.util
 Both
have no-argument constructors, as well as
constructors that take an integer table size
 Both
have methods as listed in next slide
92
METHODS

// put the entry in the table
 public T put(T key, T value)

//Returns the value for this key, or null
public T get(T key)

public void clear() // clears the table

public Set keySet() // returns the values in the
table in a Set
93
HASH TABLES - REDUCING THE RANGE TO [ 0, M )

We’ve mapped the keys to a range of integers
0  key < r -
decided on total number of possible keys –
 For social security numbers - 999,999,999


Now we must reduce this range to [ 0, m )
// from 0 to M
where m is a reasonable size for the
hash table
94
HASH TABLES – HASH FUNCTIONS
 Some

typical functions
Division : Use a mod function
hash(k) = abs( k mod m)
where m is table size
which yields a range between 0 and m-1
95
Some typical functions

Choice of

m?
Powers of 2 are generally not good!
h(k) = k mod 2n

Prime numbers close to 2n - good
choices
96
CHOOSING A VIABLE VALUE FOR M

Prime numbers close to 2n - good choices
Eg. want ~4000 entry table,
choose m = 4093
Other methods in your text.
97
PERFORMANCE ANALYSIS
 If
n slots in a table of size m are occupied, the
load factor is defined as: ( α is the load factor)
n

m
n = number of items
m = number of slots
=1 means the table is full, and =0
means the table is empty.
 when
 It
is generally good to get a value < 1, near
.8.
98
Successful search
20
18 Linear probing
16
Double Hashing
Average # of probes
14 Separate Chaining
12
10
8
6
4
2
0
0.2
0.4
0.6
Load factor
0.8
1
99
Unsuccessful search
20
Average # of probes
18 Linear probing
Double hashing
16
Separate chaining
14
12
10
8
6
4
2
0
0.2
0.4
0.6
Load factor
0.8
1
100
HASH TABLES - COLLISION RESOLUTION SUMMARY

Chaining
+ Unlimited number of elements
+ Unlimited number of collisions
- Overhead of multiple linked lists

Re-hashing
+ Fast re-hashing
+ Fast access through use of main table space
- Maximum number of elements must be known
- Multiple collisions become probable - CLUSTERING!

Overflow area
+ Fast access
+ Collisions don't use primary table space
101
TERMS TO KNOW
• Open Addressing looks for another open position in the
table other than the one to which the element is originally
hashed. Requires that the load factor be < 1.

Open Addressing using Linear Probing - seeking next
available position –creates clusters - alternative methods quadratic probing etc.

Separate Chaining If two keys map to the same address,
separate chaining creates a linked list of keys that map to
that address.
102
HASHCODE FUNCTION IN JAVA

Hash function - has two parts:
 Map key k to an integer

There is a default hashcode() in Java - the method
maps each object to an integer .

It returns a 32 bit integer – which may be where the
object is in memory.

It works poorly with Strings as two strings could be
in different locations in memory and contain the
same data.
103
HASH TABLES - REVIEW
•
If you can meet the constraints of a hash function
that gives a Big(O) of 1:

Hash Tables will generally give good performance

O(1) search
104
 BUT:
 not advisable for unknown data
 If collection size is relatively static – few
insertions and deletions - memory
management is actually simpler –
105
UNIVERSAL OR PERFECT HASHING

“Dynamic perfect hashing" involves using a second
hash table as the data structure to store multiple
values within a particular bucket.

How do we find the next location with this
approach?
106
UNIVERSAL HASHING
 What
advantages does it have over linear
probing?
 What
are possible problems with the
approach?
 Perfect
hashing means that read access takes
constant time even in the worst case.
107
UNIVERSAL OR PERFECT HASHING
 For
inserting , the time bounds are only true
on average.
 To
make insertion fast enough ,
the second level hash table is very large for
the number of keys (k2),
large enough so that collisions become
unlikely.
108

SECOND LEVEL HASH TABLES

This is not a problem with table size because the first
level hash distributes keys evenly

so that on average second level hash tables are still
relatively small.

The hash function for the second level tables are
chosen at random from a set of parameterized hash
functions.
109
UNIVERSAL HASHING

It is possible when you know exactly what set of
keys you are going to be hashing when you design
your hash function.

It's popular for hashing keywords for compilers

Minimal perfect hashing guarantees that n keys
will map to 0..n-1 with no collisions at all.
110
CHAINED BUCKET

Note: when using chaining,
each linked list attached to a slot is called a bucket
 - this is called “chained bucket hashing”


However, there is also “bucket hashing” done on
the main table - just to make things real clear.
111
Download