Homework1

advertisement
Dr. Eick
COSC 6397 “Data Mining” Homework1 Fall 2005
Due: Th., Sept. 22 in class; this is a graded homework!
1) Tan book; problem 3 on page 89
2) Tan book; problem 4 on page 90
3) Compare the box plot of sepal length and pedal length that was generated
for all 150 IRIS flowers; figure 3.11; what differences can be observed in
the two plots? What do these differences tell you about the distribution of
the attributes sepal length and pedal length?
4) Assume an integer-valued attribute A whose values are distributed as
follows is given:
0,0.0,1,1,1,2,2,5,6,7,17,18,19,20,25,28,29,33,39,43,44,44,46,51,58,59,60,
61,65,77,78,81,99,120.
has to be summarized in a histogram with 5 buckets using the following 3
methods:
1) Equidepth
2) V-Optimal
3) MaxDiff
Give the 3 histograms that would be obtained using the 3 methods. Also
explain how your histograms were derived. Compare the 3 histograms;
which histogram(s) do you prefer (if you prefer any of the three)? Give
reasons for your answer!
5) Apply min-max normalization and z-score normalization for attribute A
(from the previous exercise) and plot the distribution of attribute A in the
two normalized spaces. If the value of the attribute A that has been
normalized by min-max normalization is 0.25 what does this value tell you
about the location of the attribute value 0.25 relative to the other values for
attribute A? If the value of the attribute A that has been normalized using zscore is -2 what does this value tell you about the location of the attribute
value -2 relative to the other values for attribute A?
1
6) Assume the following dataset that contains 16 examples with attributes
A1,,,,,A5 and a class label C is given:
(A1,A2,A3,A4,A5,C)
11201X
11302X
11201X
11302X
00201X
00312X
00211X
00312X
01302Y
01302Y
01201Y
01100Y
10100Y
10100Y
10110Y
10110Y
Your task is to learn a classifier for the class attribute (we assume there are
only 2 classes X and Y) and you are forced to reduce the dataset from 5 to 2
attributes! Use information gain to rank the 5 attributes. Are the two
attributes selected by using information the best choice for the problem at
hand or not? If your answer is no, what two attributes would you use and
why? In general, is information gain sufficient to identify relevant attributes
in preprocessing or should this technique be augmented by other techniques
when creating a dataset? If your answer to the last question in no; what other
techniques should be used for constructing a dataset in preprocessing?
2
Download