YILDIZ TECHNICAL UNIVERSITY COMPUTER ENGINEERING DEPARTMENT 0114850 DATA MINING: Assignment 3 Due: Friday, March 30, 2007 About Feature Selections 1-Given the data set X with three input features and one output feature representing the classification of samples X: I1 2.5 7.2 3.4 5.6 4.8 8.1 6.3 I2 1.6 4.3 5.8 3.6 7.2 4.9 4.8 I3 5.9 2.1 1.6 6.8 3.1 8.3 2.4 O 0 1 1 0 1 0 1 Rank the features using a comparison of means and variances. 2-Given four-dimensional samples where the first two dimensions are numeric and last two are categorical X1 3 3 5 5 7 5 X2 3 6 3 6 3 4 X3 1 2 1 2 1 2 X4 A A B B A B Apply a method for unsupervised feature selection based on entropy measure to reduce one dimension from the given data set. 3-Apply the ChiMerge technique to reduce the number of values for numeric attributes in Problem1. a)Reduce the number of numeric values for feature I1 and find the final reduced number of intervals. b) Reduce the number of numeric values for feature I2 and find the final reduced number of intervals. About Naïve Bayes 4- About Decision Tree 5-The goal of this exercise is to get familiarized with decision trees. Therefore a simple dataset about weather to go skiing or not is provided. The decision of going skiing depends on the attributes snow, weather, season, and physical condition as shown in the table below. snow weather season physical condition go skiing sticky foggy low rested no fresh sunny low injured no fresh sunny low rested yes fresh sunny high rested yes fresh sunny mid rested yes frosted windy high tired no sticky sunny low rested yes frosted foggy mid rested no fresh windy low rested yes fresh windy low rested yes fresh foggy low rested yes fresh foggy low rested yes sticky sunny mid rested yes frosted foggy low injured No Build the decision tree based on the data above by calculating the information gain for each possible node, as shown in the lecture. Please hand in the resulting tree including all calculation steps. About Decision Tree 6-Given a training data set Y: A 15 20 25 30 35 25 15 20 B 1 3 2 4 2 4 2 3 C A B A A B A B B Class C1 C2 C1 C1 C2 C1 C2 C2 a)Find the best threshold (for the maximal gain) for attribute A. b) Find the best threshold (for the maximal gain) for attribute B. c)Find a decision tree for data set Y. d)If the testing set is A 10 20 30 B 2 1 3 C A B A D C2 C1 C2 40 15 2 1 B B C2 C1