Homework 4

advertisement
CS 321
HM
LANGUAGES & COMPILER DESIGN
PSU
HW 4
HomeWork 4, DHash, 100 Points (1/22/2005)
Due Date:
Subject:
Monday February 7th 2005 at start of class
A hashing mechanism named DHash
General Rules: Implement Homework in C or C++. Any flavor of C will do: K&R C, ANSI C, or
C++. Hand in the complete listing of all C/C++ source files plus include files, if any, plus all inputs,
and generated outputs. Write your name, the HW number, completion date, and the current PSU term
into the header of each source file. Designs and discussions are to be provided in electronic form, not
in long-hand, and not using hand-drawn pictures.
Abstract: Design (20), implement (40), debug and test (20) the double hashing function DHash.
DHash is part of a simple, general purpose scanner. Only identifiers are of interest for this HW. Once
an identifier is scanned, it is to be entered into a table named hash[], using a hash function. This
function uses a prime modifier. In case of a collision, a secondary hash function is used. Discuss in a
brief essay variations of how DHash could work. To this end, implement alternatives by varying the
prime modifier such that for different measurements with identical input, you observe different
collision numbers and different collision chains. In your discussion, show also the collisions and
chain lengths as a function of the varying prime modifier (20 points).
Detailed Specification: DHash reads tokens from standard input, a character file. The input file
consists of source lines, each no longer than 255 characters. Scanned identifiers are entered for retrieval
into a string table, more precisely a table of pointers to character strings. This hash table is named
char * hash[ MAX_HASH ]. Use a small table, for example: #define MAX_HASH 109. If
some identifier id has already been scanned before, a renewed attempt to enter that id into the table
will actually retrieve it. The bucket size per entry will be 1. Thus, if the source contains more distinct
identifiers than slots in hash[], the table is full and DHash aborts with a suitable error message.
Include such a test in your assignment and show the correct result.
For any source file, track the total number of collisions. Also, in case of a collision, track the
length of the collision chain. In the end, print the total number of collisions, the length of the longest
collision chain, the total number tokens scanned, the total number of identifiers, the number of distinct
identifiers, and compute the fill factor of the hash table. Do all of this only once, for one fixed size
table and one prime number.
For each used entry in hash[], print the index and the entered identifier in increasing order.
Also, for each used entry, print the number of times that entry was visited in index order at the end of
the complete scan. Note that the number of entries used plus the number of collisions must equal the
sum of these visits. Also print the number of instances the same id was found in the input source.
Include a test case, in which the table fills up and consequently DHash aborts. Prove by test that if the
number of distinct identifiers equals the hash table size, all entries will indeed be filled, thus your
collision handler does not skip entries. However, the total number of collisions may be very large.
Implementation Hints and Requirements: Use a small prime number (suggest 109) as the table size
of hash[]. Note that 109 is prime, and has a companion prime 107, i.e. it is part of a double prime.
1
HW 4
LANGUAGES & COMPILER DESIGN
CS 321
HM
PSU
HW 4
Possible hash functions (primary + secondary) for all letters in an identifier, with hash_x and
hash2_x initially 0:
hash_x = ( ( hash_x + identifier[ i ] ) * PRIME ) %
MAX_HASH;
hash2_x = ( ( hash2_x + identifier[ i ] ) * PRIME ) % ( MAX_HASH - 2 ) + 1;
Sample Input to DHash
The sample input below just used the C implementation of the dhash.c homework itself. Do the same
with your implementation. Skip erroneous input.
hash[ 75] = include
hash[ 103] = stdio
hash[ 33] = h
hash[ 67] = ctype
hash[ 18] = stdlib
hash[ 54] = string
hash[ 104] = unsigned
hash[ 41] = line_no
hash[
1] = col_no
. . .
hash[ 50] = scan_ident
hash[ 74] = scan_special
hash[ 38] = save
hash[ 40] = WHITE
hash[
0] = skip_blanks
hash[ 90] = scan
hash[ 37] = int
hash[ 100] = main
Corresponding Output of DHash for above Sample Input:
The second output section below is a dump of the hash table, after the complete input has been read.
hash[ 0]
hash[ 1]
hash[ 2]
hash[ 3]
hash[ 4]
hash[ 5]
hash[ 8]
hash[ 9]
hash[ 11]
. .
hash[ 94]
hash[ 95]
hash[ 96]
hash[ 97]
hash[100]
hash[103]
hash[104]
hash[107]
hash[108]
had
had
had
had
had
had
had
had
had
.
had
had
had
had
had
had
had
had
had
2
3
33
1
8
10
26
9
5
visits,
visits,
visits,
visits,
visits,
visits,
visits,
visits,
visits,
2
3
8
1
5
5
26
7
3
occurrences
occurrences
occurrences
occurrences
occurrences
occurrences
occurrences
occurrences
occurrences
of
of
of
of
of
of
of
of
of
'skip_blanks'
'col_no'
'else'
'malloc'
'for'
'token_no'
'void'
'printf'
'NULL'
10
11
1
7
1
7
30
2
7
visits,
visits,
visits,
visits,
visits,
visits,
visits,
visits,
visits,
10
3
1
7
1
1
22
2
7
occurrences
occurrences
occurrences
occurrences
occurrences
occurrences
occurrences
occurrences
occurrences
of
of
of
of
of
of
of
of
of
'buff'
'slots_used'
'strcmp'
'c'
'main'
'stdio'
'unsigned'
'search_ident'
'while'
Total tokens scanned: 1365
Total collisions = 173, Max chain = 6
2
HW 4
CS 321
HM
LANGUAGES & COMPILER DESIGN
PSU
HW 4
Number of entries in hash[109] used: 77
3
HW 4
Download