Object-Oriented Programming in Java MISM/MSIT 95-712 Summer 2002 Homework 7 1. In "The Adventure of the Dancing Men," Mr. Sherlock Holmes was asked by a Mr. Hilton Cubitt of Riding Thorpe Manor, Norfolk, to deduce the meaning of his wife's strange behavior, which seemed to stem from several notes left on the sundial in their garden. The notes contained drawings of stick figures, little dancing men in a variety of positions. After accumulating a number of these notes, Holmes concluded that they were a cipher of some sort. Since Holmes was "fairly familiar with all forms of secret writings, and [was himself] the author of a trifling monograph upon the subject," he had no difficulty in breaking the code. In fact wrote a message of his own, in the very same code, which ultimately led to the capture of the perp. (See http://www.webfic.com/dancmen/danc.htm) Holmes' solution used the frequencies with which the 26 letters of the English alphabet appear in normal prose. Write a Java program which will tally letter frequencies in any plain text file a user wants. Your program should prompt the user to enter a filename, open the file, scan the text character-by-character, and count the occurrences of each of the 26 letters. You should not distinguish between upper and lower case letters, and punctuation should be ignored. You will probably want to use the isLetter() method of the Character class, and the toLowerCase() method of the String class. Your program should print out each letter and the number of occurrances in the file. Of course, we will test your program on the original Conan Doyle story! You should submit a clasp-type manila envelope (with your name on it) containing A printed listing of your code, with your name as the first line at the top. A 3½” floppy disk labeled “Java Homework 7 <your name>” completely blank except for two folders. The first folder, named Problem7-1, should contain your .java and .class files for this problem. Your program should run when a TA sets the current directory to a:\Problem7-1 and types java Sherlock at the command prompt. 2. The next step in the genetic programming project is to read and store a data file (such as the one a criminologist might collect on recidivism). Each of the trees in a generation of trees will be tested against this data file, and its “fitness” thus measured. Here is a simple example. Suppose there is just a single independent variable x0. A (very short) data file might look like this: y 1.2 4.1 8.8 x0 1.0 2.0 3.0 Now suppose the tree is ((x0 + 0.25) * x0). We want to find out how close the tree value, for each x0 value, is to the given y value in the data. One standard way of doing this is to add up (over the rows of data) the square of the “deviation”. For this example, the result is (((1.0 + 0.25) * 1.0) – 1.2)2 + (((2.0 + 0.25) * 2.0) – 4.1) 2 + (((2.0 + 0.25) * 3.0) – 8.8) 2 = 4.365 Fitness = The work you’ve done so far allows you to evaluate any tree on a double[], an array of values for x0, x1,…, corresponding to a single data row. The next step is to be able to evaluate over multiple rows (subtracting and squaring for each row, as above) and sum the results for each row. To do this, create two new classes, DataRow and DataSet. A DataRow object will hold a y value and an array (or ArrayList, if you like) of x values. (For the example above, this array would only be of length 1, but other data sets may have more than one x variable, so we really do need an array.) A DataSet will hold an array (or ArrayList) of DataRow objects. Both of these classes should provide member functions to add and access their elements. The DataSet class also has the responsibility of reading the data from a data file. Write a constructor that takes a String as its argument. The String holds the name of the data file. To keep things simple, assume that the data file contains, as its first entry, the number of independent variables, and as its second entry, the number of rows of data. Following this, the file contains the first y value, the x value(s), then the next y value and the corresponding x values, and so forth. The file for the example data above might look like this: 1 3 1.2 1.0 4.1 2.0 8.8 3.0 The only real requirement is that the data file have spaces between the entries. You can use the SimpleInput class for this job, because it automatically skips whitespace when you call nextInt() or nextDouble(). Assume for now that the data file will be correct, that is, don’t go wild trying to anticipate and recover from errors in the file. Once these classes are working correctly, the next step is to modify the GPTree class so that its eval() method accepts a DataSet object as its argument. Since you already have an eval() method that takes a double[], it shouldn’t be too hard to extract each DataRow’s array of x values and feed it to your existing method (code reuse at work). The GPTree eval() method should run through each of the DataRows, evaluate the tree, subtract the y value, and square the result, all the while keeping a running sum of the squared differences. The final sum is the GPTree’s fitness value. The final step is to fix up the Generation class, so its evalAll() method takes a DataSet as its argument. As in the last homework, write a test class that demonstrates your stuff. Call it TestGenration. Have this class’s main() method create a generation of 500 GPTrees. Then prompt the user for a data file. Get the data into a DataSet object, and evaluate each GPTree. Sort the GPTrees according to their fitnesses, then print out the GPTree with the smallest fitness. After all, this is the tree that best fits the data. Submit A printed listing of your code, with your name as the first line at the top. A folder, named Problem7-2, containing your .java and .class files for this problem. Your program should run when a TA sets the current directory to a:\Problem7-2 and types java TestGeneration at the command prompt. I provide a test data file below for you to try your work on. Testing your program can be difficult because it depends on random numbers, and because of the number crunching involved. My plan was to use a very small data file, a small number of trees, and small (but still randomly generated) trees, which are easy to produce by setting maxDepth to 2 or 3. Although I’m not going to ask you to do this (because it is pretty hard!), you might consider how a “high-class” version of the project could simplify the tree expressions. One tree I got when testing my code was ((0.692 - X0) - (((((X0 / X0) / 0.679) / (0.976 / 0.261)) + (((0.278 * X0) + (X0 * X0)) ((X0 + X0) / (0.187 * X0)))) + 0.221)) Successive algebraic reductions give ((0.692 - X0) - ((((1 / 0.679) / (0.976 / 0.261)) + (((0.278 * X0) + X02) - (2*X0 / (0.187 * X0)))) + 0.221)) ((0.692 - X0) - (((1.435 / 3.739) + (((0.278 * X0) + X02) – 10.695*X0)) + 0.221)) ((0.692 - X0) - ((0.383 + (X02 – 10.417*X0)) + 0.221)) ((0.692 - X0) - (X02 – 10.417*X0 + 0.604)) X02 + 9.417*X0 + 0.088 (I think!) So, these horrid expression trees aren’t always so bad after all! Here is a simple dataset to try. 1 31 0 0 0.19 0.1 0.36 0.2 0.51 0.3 0.64 0.4 0.75 0.5 0.84 0.6 0.91 0.7 0.96 0.8 0.99 0.9 1 1 0.99 1.1 0.96 1.2 0.91 1.3 0.84 1.4 0.75 1.5 0.64 1.6 0.51 1.7 0.36 1.8 0.19 1.9 0 2 -0.21 2.1 -0.44 2.2 -0.69 2.3 -0.96 2.4 -1.25 2.5 -1.56 2.6 -1.89 2.7 -2.24 2.8 -2.61 2.9 -3 3