Dr. Eick COSC 6397 “Data Mining” Homework1 Fall 2005 Due: Th., Sept. 22 in class; this is a graded homework! 1) Tan book; problem 3 on page 89 2) Tan book; problem 4 on page 90 3) Compare the box plot of sepal length and pedal length that was generated for all 150 IRIS flowers; figure 3.11; what differences can be observed in the two plots? What do these differences tell you about the distribution of the attributes sepal length and pedal length? 4) Assume an integer-valued attribute A whose values are distributed as follows is given: 0,0.0,1,1,1,2,2,5,6,7,17,18,19,20,25,28,29,33,39,43,44,44,46,51,58,59,60, 61,65,77,78,81,99,120. has to be summarized in a histogram with 5 buckets using the following 3 methods: 1) Equidepth 2) V-Optimal 3) MaxDiff Give the 3 histograms that would be obtained using the 3 methods. Also explain how your histograms were derived. Compare the 3 histograms; which histogram(s) do you prefer (if you prefer any of the three)? Give reasons for your answer! 5) Apply min-max normalization and z-score normalization for attribute A (from the previous exercise) and plot the distribution of attribute A in the two normalized spaces. If the value of the attribute A that has been normalized by min-max normalization is 0.25 what does this value tell you about the location of the attribute value 0.25 relative to the other values for attribute A? If the value of the attribute A that has been normalized using zscore is -2 what does this value tell you about the location of the attribute value -2 relative to the other values for attribute A? 1 6) Assume the following dataset that contains 16 examples with attributes A1,,,,,A5 and a class label C is given: (A1,A2,A3,A4,A5,C) 11201X 11302X 11201X 11302X 00201X 00312X 00211X 00312X 01302Y 01302Y 01201Y 01100Y 10100Y 10100Y 10110Y 10110Y Your task is to learn a classifier for the class attribute (we assume there are only 2 classes X and Y) and you are forced to reduce the dataset from 5 to 2 attributes! Use information gain to rank the 5 attributes. Are the two attributes selected by using information the best choice for the problem at hand or not? If your answer is no, what two attributes would you use and why? In general, is information gain sufficient to identify relevant attributes in preprocessing or should this technique be augmented by other techniques when creating a dataset? If your answer to the last question in no; what other techniques should be used for constructing a dataset in preprocessing? 2