CS 321 HM LANGUAGES & COMPILER DESIGN PSU HW 4 HomeWork 4, DHash, 100 Points (1/22/2005) Due Date: Subject: Monday February 7th 2005 at start of class A hashing mechanism named DHash General Rules: Implement Homework in C or C++. Any flavor of C will do: K&R C, ANSI C, or C++. Hand in the complete listing of all C/C++ source files plus include files, if any, plus all inputs, and generated outputs. Write your name, the HW number, completion date, and the current PSU term into the header of each source file. Designs and discussions are to be provided in electronic form, not in long-hand, and not using hand-drawn pictures. Abstract: Design (20), implement (40), debug and test (20) the double hashing function DHash. DHash is part of a simple, general purpose scanner. Only identifiers are of interest for this HW. Once an identifier is scanned, it is to be entered into a table named hash[], using a hash function. This function uses a prime modifier. In case of a collision, a secondary hash function is used. Discuss in a brief essay variations of how DHash could work. To this end, implement alternatives by varying the prime modifier such that for different measurements with identical input, you observe different collision numbers and different collision chains. In your discussion, show also the collisions and chain lengths as a function of the varying prime modifier (20 points). Detailed Specification: DHash reads tokens from standard input, a character file. The input file consists of source lines, each no longer than 255 characters. Scanned identifiers are entered for retrieval into a string table, more precisely a table of pointers to character strings. This hash table is named char * hash[ MAX_HASH ]. Use a small table, for example: #define MAX_HASH 109. If some identifier id has already been scanned before, a renewed attempt to enter that id into the table will actually retrieve it. The bucket size per entry will be 1. Thus, if the source contains more distinct identifiers than slots in hash[], the table is full and DHash aborts with a suitable error message. Include such a test in your assignment and show the correct result. For any source file, track the total number of collisions. Also, in case of a collision, track the length of the collision chain. In the end, print the total number of collisions, the length of the longest collision chain, the total number tokens scanned, the total number of identifiers, the number of distinct identifiers, and compute the fill factor of the hash table. Do all of this only once, for one fixed size table and one prime number. For each used entry in hash[], print the index and the entered identifier in increasing order. Also, for each used entry, print the number of times that entry was visited in index order at the end of the complete scan. Note that the number of entries used plus the number of collisions must equal the sum of these visits. Also print the number of instances the same id was found in the input source. Include a test case, in which the table fills up and consequently DHash aborts. Prove by test that if the number of distinct identifiers equals the hash table size, all entries will indeed be filled, thus your collision handler does not skip entries. However, the total number of collisions may be very large. Implementation Hints and Requirements: Use a small prime number (suggest 109) as the table size of hash[]. Note that 109 is prime, and has a companion prime 107, i.e. it is part of a double prime. 1 HW 4 LANGUAGES & COMPILER DESIGN CS 321 HM PSU HW 4 Possible hash functions (primary + secondary) for all letters in an identifier, with hash_x and hash2_x initially 0: hash_x = ( ( hash_x + identifier[ i ] ) * PRIME ) % MAX_HASH; hash2_x = ( ( hash2_x + identifier[ i ] ) * PRIME ) % ( MAX_HASH - 2 ) + 1; Sample Input to DHash The sample input below just used the C implementation of the dhash.c homework itself. Do the same with your implementation. Skip erroneous input. hash[ 75] = include hash[ 103] = stdio hash[ 33] = h hash[ 67] = ctype hash[ 18] = stdlib hash[ 54] = string hash[ 104] = unsigned hash[ 41] = line_no hash[ 1] = col_no . . . hash[ 50] = scan_ident hash[ 74] = scan_special hash[ 38] = save hash[ 40] = WHITE hash[ 0] = skip_blanks hash[ 90] = scan hash[ 37] = int hash[ 100] = main Corresponding Output of DHash for above Sample Input: The second output section below is a dump of the hash table, after the complete input has been read. hash[ 0] hash[ 1] hash[ 2] hash[ 3] hash[ 4] hash[ 5] hash[ 8] hash[ 9] hash[ 11] . . hash[ 94] hash[ 95] hash[ 96] hash[ 97] hash[100] hash[103] hash[104] hash[107] hash[108] had had had had had had had had had . had had had had had had had had had 2 3 33 1 8 10 26 9 5 visits, visits, visits, visits, visits, visits, visits, visits, visits, 2 3 8 1 5 5 26 7 3 occurrences occurrences occurrences occurrences occurrences occurrences occurrences occurrences occurrences of of of of of of of of of 'skip_blanks' 'col_no' 'else' 'malloc' 'for' 'token_no' 'void' 'printf' 'NULL' 10 11 1 7 1 7 30 2 7 visits, visits, visits, visits, visits, visits, visits, visits, visits, 10 3 1 7 1 1 22 2 7 occurrences occurrences occurrences occurrences occurrences occurrences occurrences occurrences occurrences of of of of of of of of of 'buff' 'slots_used' 'strcmp' 'c' 'main' 'stdio' 'unsigned' 'search_ident' 'while' Total tokens scanned: 1365 Total collisions = 173, Max chain = 6 2 HW 4 CS 321 HM LANGUAGES & COMPILER DESIGN PSU HW 4 Number of entries in hash[109] used: 77 3 HW 4