Introduction Remember although in most of our classroom examples we are using simple lists of integers as our data sets. In practice it is not like that. The “elements’ or “items” in realistic situations are records where each record might consist of several fields each field perhaps being of several digits or characters. However, one of the fields must be considered to be the way in which a record is identified (hence called the IDENTIFIER). Typically this might be an employee’s payroll number or stock item’s product code. This identifier is also known as the record KEY. Let us revert to our classroom practice where the elements are simple integers representing the identifiers (rather than records with keys). Consider the following sequence of ‘keys’: Array index Key 0 1 2 3 4 254 378 631 794 827 Our search techniques to date are: sequential search or binary search. Both assumed that the elements were held in contiguous locations in an array (as above) and the binary search required them to be sorted. Suppose we could generate the index of the key from the key itself!!! Hashing With hashing techniques you attempt to find a HASH FUNCTION that is able to convert the keys (usually integers or short character strings) into integers in the range 0 … N-1 where N is the number of records that can fit into the amount of memory available. In our classroom examples we will just consider a record to consist solely of a key that is an integer and that we want to convert the key into the index of an array. This may or may not be simple. The terms HASH and HASHING come from the very fact that the conversion from key to index really does ‘hash’ the key as in many cases the index resulting from ‘hashing’ a key bears absolutely no resemblance to the original key at all. Suppose memory was limitless then we could have the simplest hash function of all: index = key. In the example above this would mean declaring an array of 1000 elements and putting the keys in those elements with the same index. In the example this would mean that 5 elements of the array contained data and that 995 did not! For this to be viable proposition memory would have to be limitless (and wastable). In practice you cannot afford such extravagance. 19 The hash function you choose depends very much on the distributions of actual keys within the range of the possible keys. Hash function – Truncation With truncation part of the keys is ignored and the remainder used directly. Suppose in a certain situation there were just 6 keys and they were: 12302 23303 12305 12307 12308 12309 The hash function here could be: index = last digit of key (or index = key mod 123000) So provided we had declared an array of 10 elements (index 0 to 9) then the hash function would be suitable and with (only?) 40% wastage. Of course the above hash function would not work if the range of keys was: 2134 4134 5134 7134 9134 or 2560 4670 6124 8435 9200 Here you would require a hash function that removed the last 3 digits. This is why it is necessary for you to know something about the distribution of actual keys before deciding on a hash function. Hash function – Folding Here the key is partitioned and the parts combined in some way (may be by adding or multiplying them) to obtain the index. Suppose we have 1000 records but 8 digits keys then perhaps: The 8 digit key such as 62538194 62538195 may be partitioned into: 625 381 94 625 381 95 the groups now added: 1100 1101 and the result truncated: 100 101 Since all the information in the key is used in generating the index folding often achieves a better spread than truncation just by itself. Hash function – Modular arithmetic The key is converted into an integer, the integer is then divided by the size of the index range and the remainder is taken as the index position of the record. index = key mod size 29 As an example, suppose we have a 5 digit integer as a key and that there are 1000 records (and room for 1000 records). The hash function would then be: index = key mod 1000 If we are very lucky! Our keys might be such that there is only 1 key that maps to each index. Of course we might still have the situation where two keys map to the same index (e.g. 23456, 43456) – this is called a COLLISION. In practice it always turns out that it is better to have an index range that is a prime number. This way you do not get so may COLLISIONS. In the above, it would be better to have an index range of 997 or 1009. However, collisions do and will occur. Hashing Functions Some of the methods of defining a hash function are discussed in the following paragraphs. Modular Arithmetic In modular arithmetic, first the key is converted to an integer, then it is divided by the size of the index range, and the remainder is taken to be the hash value. The spread achieved depends very much on the modulus. If the modulus is the power of small integers such as 2 or 10, then many keys tend to map into the same index, while other indices remain unused. The best choice for the modulus is often, but not always, a prime number, which usually has the effect of spreading the keys quite uniformly. Truncation Truncation ignores part of the key, and uses the remainder directly as the hash value (using numeric code to represent non-numeric field data). If the keys, for example, are eight-digit numbers and the hash table has 1000 entries, then the first, second, and fifth digits from the right side might make the hash value. So, 62538194 maps to 394. It is a fast method, but it often fails to distribute keys evenly. Folding In folding, the identifier is partitioned into several parts, all but the last part being of the same length. These parts are then added together to obtain the hash value. For example, an eight-digit integer can be divided into groups of three, three, and two digits. The groups are then added 39 together, and truncated, if necessary, to be in the proper range of indices. So 62538149 maps to 625 + 381 + 94 = 1100, truncated to 100. Since all information in the key can affect the value of the function, folding often achieves a better spread of indices than truncation. Mid-square method In this method, the identifier is squared (using numeric code to represent non- numeric field data), and then the appropriate number of bits from the middle of the square are used to get the hash value. Since the middle bits of the square usually depend on all the characters in the identifier, it is expected that different identifiers will result in different values. The number of middle bits that we select depends on table size. Therefore, if r is the number of middle bits used to form the hash value, then the table size will be 2r. So when we use the mid- square method, the table size should be a power of 2. Program A complete C program for implementation of a hash table is given here: #include <stdio.h> #include <string.h> #include <stdlib.h> #define SIZE 50 #define MAX 10 typedef struct node { char symbol[MAX]; int value; struct node *next; } entry; typedef entry *entry_ptr; int hash_value(char * name) { int sum=0; while( *name != '\0') { sum += *name; name++; } return(sum % SIZE); } void initialize( entry_ptr table[]) { int i=0; for(i=0; i<SIZE; i++) table[i] = NULL; } void insert( entry_ptr table[], char *name, int val) { 49 int h, flag = 1; entry_ptr temp; h = hash_value(name); temp = table[h]; while( temp != NULL && flag ) { if( strcmp(temp->symbol,name) == 0) { printf("The symbol %s is already present in the table\n",name); flag =0;; } temp=temp->next; } if(flag) { temp = (entry_ptr) malloc(sizeof( entry)); if(temp == NULL) { printf("ERRRR .......\n"); exit(0); } strcpy(temp->symbol,name); temp->value = val; temp->next = table[h]; table[h]=temp; } } void retrieve( entry_ptr table[],char *name) { int h,flag =1; entry_ptr temp; h = hash_value(name); temp = table[h]; while( temp != NULL && flag) { if( strcmp(temp->symbol,name) == 0) { printf("The symbol %s is present in the table and having value = %d\n", name,temp->value); flag =0; } temp=temp->next; } if(flag == 1) printf("The symbol %s is not present in the table \n",name); 59 } void main() { entry_ptr table[SIZE]; char name[MAX]; int value,n; initialize(table); do { do { printf("Enter the symbol and value pair to be inserted\n"); scanf("%s %d",name,&value); insert(table,name,value); printf("Enter 1 to continue\n"); scanf("%d",&n); } while(n == 1); do { printf("Enter the symbol whose value is to be retrieved\n"); scanf("%s",name); retrieve(table,name); printf("Enter 1 to continue\n"); scanf("%d",&n); }while( n == 1); printf("Eneter 1 to continue\n"); scanf("%d",&n); }while(n == 1); } Example Input and Output Enter the symbol and value pair to be inserted ogk10 Enter 1 to continue 1 Enter the symbol and value pair to be inserted psd20 Enter 1 to continue 0 Enter the symbol whose value is to be retrieved 69 ogk The symbol ogk is present in the table with the value = 10 Enter 1 to continue 1 Enter the symbol whose value is to be retrieved psd The symbol psd is present in the table with the value = 20 Enter 1 to continue 1 Enter the symbol whose value is to be retrieved asg The symbol asg is not present in the table Enter 1 to continue 0 Eneter 1 to continue 1 Enter the symbol and value pair to be inserted asg30 Enter 1 to continue 0 Enter the symbol whose value is to be retrieved asg The symbol asg is present in the table with the value = 30 Enter 1 to continue 0 Eneter 1 to continue 0 What is Hashing? Question: I want to know about direct addressing, dictionary lookup, and indirect addressing. Can you explain about division remainder, digit analysis/extraction, folding, mid-square, and radix conversion? What is collision and what are the differences between double hashing, chaining, bucket addressing, perfect hashing, 79 and dynamic hashing? What is linear probing and what are synonyms? anonymous Answer: Hash tables are one of the most efficient means of searching for data. A hash table establishes a mapping between all of the possible keys and the position where the data associated with those keys is stored. I don't know exactly how all of the hashing methods that you are asking about work but the following should go at least partway towards answering your questions. Direct Addressing is where the hashing algorithm guarantees that no two keys will generate the exact same hash value. This is the ideal situation but is impractical in most situations. Instead most hash functions map multiple keys to the same has value and provide indirect addressing mechanisms to handle the situation where two keys map to the same value. The situation where this occurs is called a collision. Two keys that hash to the same value like this are called synonyms. A hash function attempts to distribute keys in a random manner so as to minimize the need to resolve collisions. One of the simplest hashing algorithms is division remainder where you divide the key value by some fixed number and take the remainder as the key location (also known as modulo arithmetic). For example let's say we have space for one thousand entries, so we divide the key of each value by 1000 and take the remainder so the entry with key 5078950770 will occupy position 770 in our array (5078950770 modulo 1000 is 770). Of course with this particular example every 1000th key will map to the same location. A Chained hash table is one where the data is stored in linked lists which contain all of the data entries whose keys map to the same hash value. The linked list (or bucket) can grow to contain as many entries as required but as the list grows it becomes gradually less efficient in the speed at which the entries can be accessed. An open addressed hash table stores the data in a table instead of in a linked list and can use a variety of probing methods to locate the required entry within the table. Linear probing is the least efficient method as it involves searching sequentially through the table until the desired entry is found. Where only a few entries map to the same hash value this may be good enough but where lots of entries map to the same hash value, a more efficient method is required. Double hashing is one of the more effective methods of probing an open addressed hash table. It works by adding the hash codings of two auxiliary hash functions. The function for performing double hashing is h(k,i) = (h1(k) + ih2(k)) mod m where 0 <= i <m, m is the number of entries in the table, and h1 and h2 are the hash functions. To ensure that all entries in the table are filled before collisions occur we can either make m a power of 2 and have h2 always return an odd number or we make m prime and have h2 always return a value less than m. For example if h1(k) = k mod m and h2(k) = (k mod (m - 1)) and we set m to 1699 (a prime number) then when we hash 15385 we probe position 94 first and then every 113th position through the table as i increases until we find the entry we're looking for. 89 Dynamic or Universal hashing involves using a random number generator to dynamically generate the hashing algorithm at run time. This method reduces the chance that your keys will produce a bad distribution across the table. As you have noticed from the above examples, hashing algorithms work on numeric keys. digit analysis/extraction is the method that you use to convert the actual keys into numerical values that cal be fed through your hashing algorithm. To get the best results you should aim to base your hashing values on those parts of the actual keys most likely to vary from one record to another. 99