cosc2007ass4

COSC 2007 –Data Structures II Assignment #4 Due: April, 6, 2011 Hashing Program to write The data structure we will implement for this assignment is called a hash table. Hash tables are among the most common and most useful general-purpose data structures. Your program will use a hash table to count the number of occurrences of words in a text file. The basic algorithm goes as follows. For each word in the file (there might be several files): 1. Look up the word in the hash table. 2. If it isn't there, add it to the hash table with a value of 1. In this case you'll have to create a new node and link it to the hash table as described above. 3. If it is there, add 1 to its value. In this case you don't create a new node. At the end of the program, you should output all the words in the file, one per line, with the count next to the word e.g. cat 2 hat 4 green 12 eggs 3 ham 5 algorithmic 14 deoxyribonucleic 3 dodecahedron 400 The hash function The hash function you will use is extremely simple: it will go through the string a character at a time, convert each character to an integer, add up the integer values, and return the sum modulo 128 (but don't use a magic number for this). This will give an integer in the range of [0, 127]. This is a poor hash function; for one thing, anagrams will always hash to the same value. However, it's very simple to implement. Please notice that you'll need a loop to go through the string character-by-character. Other details Since a value associated with a key will always be positive, if you search for a key and don't find it in a hash table, just return 0. That's the special "not found" value for this hash table. The program assumes that the input file has several words per line. The order of the words you output isn't important, since the test script will sort them. The corpus is written text-including paragraphs, blank lines, and punctuation-so we need to scan through it to pull out the individual words. Specifically:    A word is a sequence of alphabetic characters delimited by non-alphabetic characters, the beginning of line, the end of line, or the end of file. The exception to this is the apostrophe (‘), which should always be considered part of the word it’s in (treat the apostrophe as an alphabetic character.) All words should be treated as case insensitive. Store all words in lower case. o Hello “hello” hello, **hello *** Hello HELLO are all the same word o Isn’t ‘twas students’ are all single, complete words Testing your program Your program should perform the following steps: 1. Prompt the user for the name of the file (a string). Use the string input by the user as an argument to open file: 2. Open the file on disk, and process its contents, adding unique words to the Hash table and increasing the counts of existing words. 3. Repeat steps 1&2 until the user enters some sentinel value. 4. Write a print method to print out all the words in the hash table specified as above 5. Prompt user for the test file name, perform the search method and display the information about the searching result. If the word is in the hash table, print the word and count, otherwise, print a message that indicates the word (word name) is not in the hash table. The input files and the test file will be posted. Hand in your source code and the output after running your program.

cosc2007ass4

Related documents

Products

Support

cosc2007ass4

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib