Homework7

advertisement
Object-Oriented Programming in Java
MISM/MSIT 95-712
Summer 2002
Homework 7
1. In "The Adventure of the Dancing Men," Mr. Sherlock Holmes was asked by a Mr.
Hilton Cubitt of Riding Thorpe Manor, Norfolk, to deduce the meaning of his wife's
strange behavior, which seemed to stem from several notes left on the sundial in their
garden. The notes contained drawings of stick figures, little dancing men in a variety of
positions.
After accumulating a number of these notes, Holmes concluded that they were a cipher of
some sort. Since Holmes was "fairly familiar with all forms of secret writings, and [was
himself] the author of a trifling monograph upon the subject," he had no difficulty in
breaking the code. In fact wrote a message of his own, in the very same code, which
ultimately led to the capture of the perp. (See
http://www.webfic.com/dancmen/danc.htm)
Holmes' solution used the frequencies with which the 26 letters of the English alphabet
appear in normal prose. Write a Java program which will tally letter frequencies in any
plain text file a user wants. Your program should prompt the user to enter a filename,
open the file, scan the text character-by-character, and count the occurrences of each of
the 26 letters. You should not distinguish between upper and lower case letters, and
punctuation should be ignored. You will probably want to use the isLetter() method
of the Character class, and the toLowerCase() method of the String class.
Your program should print out each letter and the number of occurrances in the file. Of
course, we will test your program on the original Conan Doyle story!
You should submit a clasp-type manila envelope (with your name on it)
containing
 A printed listing of your code, with your name as the first line at the top.
 A 3½” floppy disk labeled “Java Homework 7 <your name>” completely
blank except for two folders. The first folder, named Problem7-1, should
contain your .java and .class files for this problem. Your program
should run when a TA sets the current directory to a:\Problem7-1 and
types
java Sherlock
at the command prompt.
2. The next step in the genetic programming project is to read and store a data file (such
as the one a criminologist might collect on recidivism). Each of the trees in a generation
of trees will be tested against this data file, and its “fitness” thus measured. Here is a
simple example.
Suppose there is just a single independent variable x0. A (very short) data file might look
like this:
y
1.2
4.1
8.8
x0
1.0
2.0
3.0
Now suppose the tree is ((x0 + 0.25) * x0). We want to find out how close the tree value,
for each x0 value, is to the given y value in the data. One standard way of doing this is to
add up (over the rows of data) the square of the “deviation”. For this example, the result
is
(((1.0 + 0.25) * 1.0) – 1.2)2
+ (((2.0 + 0.25) * 2.0) – 4.1) 2
+ (((2.0 + 0.25) * 3.0) – 8.8) 2
= 4.365
Fitness =
The work you’ve done so far allows you to evaluate any tree on a double[], an array
of values for x0, x1,…, corresponding to a single data row. The next step is to be able to
evaluate over multiple rows (subtracting and squaring for each row, as above) and sum
the results for each row.
To do this, create two new classes, DataRow and DataSet. A DataRow object will
hold a y value and an array (or ArrayList, if you like) of x values. (For the example
above, this array would only be of length 1, but other data sets may have more than one x
variable, so we really do need an array.) A DataSet will hold an array (or
ArrayList) of DataRow objects. Both of these classes should provide member
functions to add and access their elements.
The DataSet class also has the responsibility of reading the data from a data file. Write
a constructor that takes a String as its argument. The String holds the name of the
data file. To keep things simple, assume that the data file contains, as its first entry, the
number of independent variables, and as its second entry, the number of rows of data.
Following this, the file contains the first y value, the x value(s), then the next y value and
the corresponding x values, and so forth. The file for the example data above might look
like this:
1 3
1.2 1.0
4.1 2.0
8.8 3.0
The only real requirement is that the data file have spaces between the entries. You can
use the SimpleInput class for this job, because it automatically skips whitespace
when you call nextInt() or nextDouble(). Assume for now that the data file will
be correct, that is, don’t go wild trying to anticipate and recover from errors in the file.
Once these classes are working correctly, the next step is to modify the GPTree class so
that its eval() method accepts a DataSet object as its argument. Since you already
have an eval() method that takes a double[], it shouldn’t be too hard to extract each
DataRow’s array of x values and feed it to your existing method (code reuse at work).
The GPTree eval() method should run through each of the DataRows, evaluate the
tree, subtract the y value, and square the result, all the while keeping a running sum of the
squared differences. The final sum is the GPTree’s fitness value.
The final step is to fix up the Generation class, so its evalAll() method takes a
DataSet as its argument.
As in the last homework, write a test class that demonstrates your stuff. Call it
TestGenration. Have this class’s main() method create a generation of 500
GPTrees. Then prompt the user for a data file. Get the data into a DataSet object,
and evaluate each GPTree. Sort the GPTrees according to their fitnesses, then print out
the GPTree with the smallest fitness. After all, this is the tree that best fits the data.
Submit
 A printed listing of your code, with your name as the first line at the top.
 A folder, named Problem7-2, containing your .java and .class files for
this problem. Your program should run when a TA sets the current directory
to a:\Problem7-2 and types
java TestGeneration
at the command prompt. I provide a test data file below for you to try your
work on.
Testing your program can be difficult because it depends on random numbers, and
because of the number crunching involved. My plan was to use a very small data file, a
small number of trees, and small (but still randomly generated) trees, which are easy to
produce by setting maxDepth to 2 or 3.
Although I’m not going to ask you to do this (because it is pretty hard!), you might
consider how a “high-class” version of the project could simplify the tree expressions.
One tree I got when testing my code was
((0.692 - X0) - (((((X0 / X0) / 0.679) / (0.976 / 0.261)) + (((0.278 * X0) + (X0 * X0)) ((X0 + X0) / (0.187 * X0)))) + 0.221))
Successive algebraic reductions give
((0.692 - X0) - ((((1 / 0.679) / (0.976 / 0.261)) + (((0.278 * X0) + X02) - (2*X0 / (0.187 *
X0)))) + 0.221))
((0.692 - X0) - (((1.435 / 3.739) + (((0.278 * X0) + X02) – 10.695*X0)) + 0.221))
((0.692 - X0) - ((0.383 + (X02 – 10.417*X0)) + 0.221))
((0.692 - X0) - (X02 – 10.417*X0 + 0.604))
X02 + 9.417*X0 + 0.088 (I think!)
So, these horrid expression trees aren’t always so bad after all!
Here is a simple dataset to try.
1 31
0
0
0.19 0.1
0.36 0.2
0.51 0.3
0.64 0.4
0.75 0.5
0.84 0.6
0.91 0.7
0.96 0.8
0.99 0.9
1
1
0.99 1.1
0.96 1.2
0.91 1.3
0.84 1.4
0.75 1.5
0.64 1.6
0.51 1.7
0.36 1.8
0.19 1.9
0
2
-0.21 2.1
-0.44 2.2
-0.69 2.3
-0.96 2.4
-1.25 2.5
-1.56 2.6
-1.89 2.7
-2.24 2.8
-2.61 2.9
-3
3
Download