hashing

advertisement
Hashing
Chapter 10

The search time of each algorithm depend on the number n of
elements of the collection S of the data.
 A searching technique called Hashing or Hash addressing which
is essentially independent of the number n.
 Hashing uses a data structure called a hash table. Although hash
tables provide fast insertion, deletion, and retrieval, operations
that involve searching, such as finding the minimum or
maximum value, are not performed very quickly.
 It is also used in many encryption algorithms.
Hash Table is a data structure in which keys are mapped to
array positions by a hash function. This table can be searched
for an item in O(1) time using a hash function to form an
address from the key.
 Hash Function is a function which, when applied to the key,
produces an integer which can be used as an address in a hash
table.

 Perfect hash function
 Good hash function
 When more than one element tries to occupy the same array
position, we have a collision.
 Collision is a condition resulting when two or more keys
produce the same hash location.
 Comparison of keys was the main operation used by the previous
discussed searching methods .
 There is a different way of searching by calculates the position of the
key based on the value of the key.
 So, the search time is reduced to O(1) from O(n) or from O(log n).
 We need to find a function h that can transfer a key K (string,
number, record, etc..) into an index the a table used for storing
items of the same type as K.
 This function is called hash function.
 Example:
Suppose we want to store a sequence of randomly
generated numbers, keys: 5, 17, 37, 20, 42, 3. The array A,
the hash table, where we want to store the numbers:
0 1234 567 8
||||||||||
We need a way of mapping the numbers to the array
indexes, a hash function, that will let us store the
numbers and later recompute the index when we want
to retrieve them. There is a natural choice for this.
 Our hashtable has 9 fields and the mod function, which
sends every integer to its remainder modulo 9, will map an
integer to a number between 0 and 8.
5 mod 9 = 5
17 mod 9 = 8
37 mod 9 = 1
20 mod 9 = 2
42 mod 9 = 6
3 mod 9 = 3
We store the values:
| | 37 | 20 | 3 | | 5 | 42 | | 17 |
In this case, computing the hash value of the number n to
be stored: n mod 9, costs a constant amount of time. And
so does the actual storage, because n is stored directly in
an array field.
Hash Functions
1.





Division
A hash function must guarantee that the number it returns is a valid index to one of the table
entries.
The simplest way is to use division modulo.
TSize=sizeof(table), as in h(K)= K mod TSize.
It is best if TSize is a prime number.
Advantages:
 simple
 useful if we don't know much about the keys
2.
Extraction
 Idea: use only part of the key to compute the hash value/ address/ index.

Exe: Key is (SSN) 123456789
This method might use for example: the first four digits ( 1234) or the last four (6789), or
combined the first two with the last two (1289) to be the index.
Hash Functions
3.




Folding
Idea: divide the key into parts, then combine (“fold”) the parts to create
the index
The key is divided into several parts. These parts are combined or folded
together and are usually transformed in a certain way to create (address)
index into the table.
This is done by first dividing the key into parts where each of the parts
of the key will be the same length as the desired index
Note: after combining the key parts if the resulted index is grater that
the desired length then you can apply either division (which is usually
used) or use extraction.
There are two types of folding
1)
Shift folding

The key is divided into several parts then these parts are added together to create the index

Exe: Key is (SSN) 123456789
(SSN) 123-45-6789 can be divided into three parts, 123, 456, 789, and then these parts can be added.
The resulting 1,368 can be divided modulo TSize.
Boundary folding
 Same as shift folding, except that every other part is written backwards
 Exe: Key is (SSN) 123456789
(SSN) with three parts, 123, 456, 789.
the first part is taken in the same order
the second part is in reverse order
the third pat is in the same order
The result is 123+654+789=1,566 , then division
 Exe: Key is 23459087632
Boundary folding: 234 + 095 + 876 + 23 = 1228
2)

This process is simple and fast especially when bit patterns are used instead of numerical
values, replace addition in previous examples with XOR
Hash Functions(cont’)
4.
Mid-Square function
 Idea: square the key (key is multiplied by itself), then use the “middle (mid) part of the result”
as the address.
 Note: extraction could be used to extract the mid part.
 Exe: Key is 3121
Square the key: (3,121)2 =9,740,641
Then use the mid part as the address (406)
Here, for 1,000-cell table, h(3,121)=406
5.
Radix transformation

Idea: convert key into another number base, then divide (modulo) could be used.

So, the key is expressed in a numerical system using a different radix.


Example: convert 23456 to base 7 --> use this as the hash value
This method may cause collisions.
Detecting and resolving collisions
 Even with the methods introduced previously, collisions may still
occur.
 We cannot hash two keys to the same location, so we must find a
way to resolve collisions.
 Choice of hash function and choice of table size may reduce
collisions, but will not eliminate them.
 Methods for resolving collisions:
 open addressing: find another empty position
 chaining: use linked lists
 bucket addressing: store elements at same location
Download