cosc2007ass2

advertisement
COSC 2007 –Data Structures II
Assignment #2 (hard one)
Due: Feb, 15, 2010
Building a BST
In order to write a spell checker-the kind used by any word-processing program – a
spelling dictionary of correctly spelled words needs to be created. This spelling
dictionary has to be stored in some nice, easy-to search structure, so we are going to use a
binary search tree (which is not necessary the best structure, but it is the best one we
know!)
Once we have decided on a data structure, there is two ways to creating our spelling
dictionary. One is to go out and buy a pre-made dictionary from some dictionary
manufacturing company. The other is to take a wide variety of text in electronic form, in
which we already know the spelling is correct, and store all of the unique words in it. The
text that we take the words from is called a corpus. Since we don’t have any money to
spend, we’re going to use the second method.
Our program is going to read in text from a number of different input files and store the
individual, unique words in a BST. Because it might be handy later to know how
common each word is (for example, if our spell checker is going to suggest a replacement
for a misspelled word, it might want to suggest more common words first), we’re also
going to keep a count of how many times that word has appeared in the corpus.
Each node in the BST will store a string containing a word from our corpus, along with
an integer count indicating how many times that word appeared in the corpus. The corpus
is written text-including paragraphs, blank lines, and punctuation-so we need to scan
through it to pull out the individual words. Specifically:



A word is a sequence of alphabetic characters delimited by non-alphabetic
characters, the beginning of line, the end of line, or the end of file.
The exception to this is the apostrophe (‘), which should always be considered
part of the word it’s in (treat the apostrophe as an alphabetic character.)
All words should be treated as case insensitive. Store all words in lower case.
o Hello “hello” hello, **hello *** Hello HELLO are all the same word
o Isn’t ‘twas students’ are all single, complete words
Your program should perform the following steps:
1. Prompt the user for the name of the file (a string). Use the string input by the
user as an argument to open file:
2. Open the file on disk, and process its contents, adding unique words to the BST and
increasing the counts of existing words.
3. Repeat steps 1&2 until the user enters some sentinel value.
4. Write functions to print out three important pieces of information about your BST:
o Its maximum depth (the length of the longest path between the root and leaf
nodes)
o Its minimum depth (the length of the shortest path between the root and leaf
nodes)
o The total number of nodes in the BST (the number of unique words in the corpus)
Both of these functions will need to traverse the entire tree, so both will probably be
recursive.
5. Since this sort of dictionary takes up a lot of memory and disk space, one of the
designers of the spell-checker project has decided to remove any word with three or
fewer letters since they’re already easy to spell. Delete every node from your BST
that contains a word that is 3 or fewer letters long (note that you must explicitly make
these deletions, not fail to insert these words in the first place).
6. Print out the total number of nodes in your tree after making these deletions.
7. Write another function to print out the contents of a tree using indentation to show the
tree’s hierarchy. For example:
sent
enough
before
ground
this
unknown
won’t
would be the contents of BST that had “sent” as its root node, with two children,
“enough” and “this”. “enough” has two children, “before” and “ground”, both of which
are leaf nodes. “this” has a single child, “unknown” (which has to be the right child since
this is a BST), and “unknown” also has a single right child, “won’t”. DO NOT PRINT
OUT YOUR ENTIRE TREE USING THIS FUNCTION! Instead, follow the path to the
node in your tree with greatest depth, and find the node 4 higher than that one in the tree
(its great-great-grandparent node). Print the subtree that starts at that node.
Hand in your source code and the output after running your program on a corpus consisting of
two text files: a2text.1.text, a2text2.txt
Download