hashfunctions_Module3_

advertisement
Introduction

Remember although in most of our classroom examples we are using simple lists of
integers as our data sets. In practice it is not like that. The “elements’ or “items” in
realistic situations are records where each record might consist of several fields each
field perhaps being of several digits or characters.

However, one of the fields must be considered to be the way in which a record is
identified (hence called the IDENTIFIER). Typically this might be an employee’s
payroll number or stock item’s product code. This identifier is also known as the
record KEY.

Let us revert to our classroom practice where the elements are simple integers
representing the identifiers (rather than records with keys). Consider the following
sequence of ‘keys’:
Array index
Key
0
1
2
3
4
254 378 631 794 827

Our search techniques to date are: sequential search or binary search.

Both assumed that the elements were held in contiguous locations in an array (as
above) and the binary search required them to be sorted.

Suppose we could generate the index of the key from the key itself!!!
Hashing

With hashing techniques you attempt to find a HASH FUNCTION that is able to
convert the keys (usually integers or short character strings) into integers in the
range 0 … N-1 where N is the number of records that can fit into the amount of
memory available.

In our classroom examples we will just consider a record to consist solely of a key
that is an integer and that we want to convert the key into the index of an array.
This may or may not be simple.

The terms HASH and HASHING come from the very fact that the conversion from
key to index really does ‘hash’ the key as in many cases the index resulting from
‘hashing’ a key bears absolutely no resemblance to the original key at all.

Suppose memory was limitless then we could have the simplest hash function of all:
index = key.

In the example above this would mean declaring an array of 1000 elements and
putting the keys in those elements with the same index.

In the example this would mean that 5 elements of the array contained data and
that 995 did not! For this to be viable proposition memory would have to be limitless
(and wastable). In practice you cannot afford such extravagance.
19

The hash function you choose depends very much on the distributions of actual keys
within the range of the possible keys.
Hash function – Truncation

With truncation part of the keys is ignored and the remainder used directly.

Suppose in a certain situation there were just 6 keys and they were:
12302

23303
12305
12307
12308
12309
The hash function here could be:
index = last digit of key
(or index = key mod 123000)

So provided we had declared an array of 10 elements (index 0 to 9) then the hash
function would be suitable and with (only?) 40% wastage. Of course the above hash
function would not work if the range of keys was:
2134 4134 5134 7134 9134
or

2560 4670 6124 8435 9200
Here you would require a hash function that removed the last 3 digits. This is why it
is necessary for you to know something about the distribution of actual keys before
deciding on a hash function.
Hash function – Folding


Here the key is partitioned and the parts combined in some way (may be by adding
or multiplying them) to obtain the index. Suppose we have 1000 records but 8 digits
keys then perhaps:
The 8 digit key such as
62538194
62538195
may be partitioned into:
625 381 94
625 381 95
the groups now added:
1100
1101
and the result truncated:
100
101
Since all the information in the key is used in generating the index folding often
achieves a better spread than truncation just by itself.
Hash function – Modular arithmetic

The key is converted into an integer, the integer is then divided by the size of the
index range and the remainder is taken as the index position of the record.
index =
key mod size
29

As an example, suppose we have a 5 digit integer as a key and that there are 1000
records (and room for 1000 records).
The hash function would then be: index =
key mod 1000

If we are very lucky! Our keys might be such that there is only 1 key that maps to
each index. Of course we might still have the situation where two keys map to the
same index (e.g. 23456, 43456) – this is called a COLLISION.

In practice it always turns out that it is better to have an index range that is a prime
number. This way you do not get so may COLLISIONS.

In the above, it would be better to have an index range of 997 or 1009. However,
collisions do and will occur.
Hashing Functions
Some of the methods of defining a hash function are discussed in the following paragraphs.
Modular Arithmetic
In modular arithmetic, first the key is converted to an integer, then it is divided by the size of the
index range, and the remainder is taken to be the hash value. The spread achieved depends very
much on the modulus. If the modulus is the power of small integers such as 2 or 10, then many
keys tend to map into the same index, while other indices remain unused. The best choice for the
modulus is often, but not always, a prime number, which usually has the effect of spreading the
keys quite uniformly.
Truncation
Truncation ignores part of the key, and uses the remainder directly as the hash value (using
numeric code to represent non-numeric field data). If the keys, for example, are eight-digit
numbers and the hash table has 1000 entries, then the first, second, and fifth digits from the right
side might make the hash value. So, 62538194 maps to 394. It is a fast method, but it often fails
to distribute keys evenly.
Folding
In folding, the identifier is partitioned into several parts, all but the last part being of the same
length. These parts are then added together to obtain the hash value. For example, an eight-digit
integer can be divided into groups of three, three, and two digits. The groups are then added
39
together, and truncated, if necessary, to be in the proper range of indices. So 62538149 maps to
625 + 381 + 94 = 1100, truncated to 100. Since all information in the key can affect the value of
the function, folding often achieves a better spread of indices than truncation.
Mid-square method
In this method, the identifier is squared (using numeric code to represent non- numeric field data),
and then the appropriate number of bits from the middle of the square are used to get the hash
value. Since the middle bits of the square usually depend on all the characters in the identifier, it is
expected that different identifiers will result in different values. The number of middle bits that we
select depends on table size. Therefore, if r is the number of middle bits used to form the hash
value, then the table size will be 2r. So when we use the mid- square method, the table size should
be a power of 2.
Program
A complete C program for implementation of a hash table is given here:
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#define SIZE 50
#define MAX 10
typedef struct node
{
char symbol[MAX];
int value;
struct node *next;
} entry;
typedef entry *entry_ptr;
int hash_value(char * name)
{
int sum=0;
while( *name != '\0')
{
sum += *name;
name++;
}
return(sum % SIZE);
}
void initialize( entry_ptr table[])
{
int i=0;
for(i=0; i<SIZE; i++)
table[i] = NULL;
}
void insert( entry_ptr table[], char *name, int val)
{
49
int h, flag = 1;
entry_ptr temp;
h = hash_value(name);
temp = table[h];
while( temp != NULL && flag )
{
if( strcmp(temp->symbol,name) == 0)
{
printf("The symbol %s is already present in the
table\n",name);
flag =0;;
}
temp=temp->next;
}
if(flag)
{
temp = (entry_ptr) malloc(sizeof( entry));
if(temp == NULL)
{
printf("ERRRR .......\n");
exit(0);
}
strcpy(temp->symbol,name);
temp->value = val;
temp->next = table[h];
table[h]=temp;
}
}
void retrieve( entry_ptr table[],char *name)
{
int h,flag =1;
entry_ptr temp;
h = hash_value(name);
temp = table[h];
while( temp != NULL && flag)
{
if( strcmp(temp->symbol,name) == 0)
{
printf("The symbol %s is present in the table and having value =
%d\n",
name,temp->value);
flag =0;
}
temp=temp->next;
}
if(flag == 1)
printf("The symbol %s is not present in the table
\n",name);
59
}
void main()
{
entry_ptr table[SIZE];
char name[MAX];
int value,n;
initialize(table);
do
{
do
{
printf("Enter the symbol and value pair to be inserted\n");
scanf("%s %d",name,&value);
insert(table,name,value);
printf("Enter 1 to continue\n");
scanf("%d",&n);
}
while(n == 1);
do
{
printf("Enter the symbol whose value is to be retrieved\n");
scanf("%s",name);
retrieve(table,name);
printf("Enter 1 to continue\n");
scanf("%d",&n);
}while( n == 1);
printf("Eneter 1 to continue\n");
scanf("%d",&n);
}while(n == 1);
}
Example
Input and Output
Enter the symbol and value pair to be inserted
ogk10
Enter 1 to continue
1
Enter the symbol and value pair to be inserted
psd20
Enter 1 to continue
0
Enter the symbol whose value is to be retrieved
69
ogk
The symbol ogk is present in the table with the value = 10
Enter 1 to continue
1
Enter the symbol whose value is to be retrieved
psd
The symbol psd is present in the table with the value = 20
Enter 1 to continue
1
Enter the symbol whose value is to be retrieved
asg
The symbol asg is not present in the table
Enter 1 to continue
0
Eneter 1 to continue
1
Enter the symbol and value pair to be inserted
asg30
Enter 1 to continue
0
Enter the symbol whose value is to be retrieved
asg
The symbol asg is present in the table with the value = 30
Enter 1 to continue
0
Eneter 1 to continue
0
What is Hashing?
Question: I want to know about direct addressing, dictionary lookup, and indirect
addressing. Can you explain about division remainder, digit analysis/extraction,
folding, mid-square, and radix conversion? What is collision and what are the
differences between double hashing, chaining, bucket addressing, perfect hashing,
79
and dynamic hashing? What is linear probing and what are synonyms?
anonymous
Answer: Hash tables are one of the most efficient means of searching for data. A
hash table establishes a mapping between all of the possible keys and the position
where the data associated with those keys is stored. I don't know exactly how all of
the hashing methods that you are asking about work but the following should go at
least partway towards answering your questions.
Direct Addressing is where the hashing algorithm guarantees that no two keys will
generate the exact same hash value. This is the ideal situation but is impractical in
most situations. Instead most hash functions map multiple keys to the same has
value and provide indirect addressing mechanisms to handle the situation where
two keys map to the same value. The situation where this occurs is called a
collision. Two keys that hash to the same value like this are called synonyms.
A hash function attempts to distribute keys in a random manner so as to minimize
the need to resolve collisions. One of the simplest hashing algorithms is division
remainder where you divide the key value by some fixed number and take the
remainder as the key location (also known as modulo arithmetic). For example let's
say we have space for one thousand entries, so we divide the key of each value by
1000 and take the remainder so the entry with key 5078950770 will occupy
position 770 in our array (5078950770 modulo 1000 is 770). Of course with this
particular example every 1000th key will map to the same location.
A Chained hash table is one where the data is stored in linked lists which contain all
of the data entries whose keys map to the same hash value. The linked list (or
bucket) can grow to contain as many entries as required but as the list grows it
becomes gradually less efficient in the speed at which the entries can be accessed.
An open addressed hash table stores the data in a table instead of in a linked list
and can use a variety of probing methods to locate the required entry within the
table. Linear probing is the least efficient method as it involves searching
sequentially through the table until the desired entry is found. Where only a few
entries map to the same hash value this may be good enough but where lots of
entries map to the same hash value, a more efficient method is required.
Double hashing is one of the more effective methods of probing an open addressed
hash table. It works by adding the hash codings of two auxiliary hash functions.
The function for performing double hashing is h(k,i) = (h1(k) + ih2(k)) mod m
where 0 <= i <m, m is the number of entries in the table, and h1 and h2 are the
hash functions. To ensure that all entries in the table are filled before collisions
occur we can either make m a power of 2 and have h2 always return an odd
number or we make m prime and have h2 always return a value less than m. For
example if h1(k) = k mod m and h2(k) = (k mod (m - 1)) and we set m to 1699 (a
prime number) then when we hash 15385 we probe position 94 first and then every
113th position through the table as i increases until we find the entry we're looking
for.
89
Dynamic or Universal hashing involves using a random number generator to
dynamically generate the hashing algorithm at run time. This method reduces the
chance that your keys will produce a bad distribution across the table.
As you have noticed from the above examples, hashing algorithms work on numeric
keys. digit analysis/extraction is the method that you use to convert the actual keys
into numerical values that cal be fed through your hashing algorithm. To get the
best results you should aim to base your hashing values on those parts of the
actual keys most likely to vary from one record to another.
99
Download