Assignment-2

advertisement
YILDIZ TECHNICAL UNIVERSITY
COMPUTER ENGINEERING DEPARTMENT
0114850 DATA MINING: Assignment 3
Due: Friday, March 30, 2007
About Feature Selections
1-Given the data set X with three input features and one output feature representing the classification
of samples
X:
I1
2.5
7.2
3.4
5.6
4.8
8.1
6.3
I2
1.6
4.3
5.8
3.6
7.2
4.9
4.8
I3
5.9
2.1
1.6
6.8
3.1
8.3
2.4
O
0
1
1
0
1
0
1
Rank the features using a comparison of means and variances.
2-Given four-dimensional samples where the first two dimensions are numeric and last two are
categorical
X1
3
3
5
5
7
5
X2
3
6
3
6
3
4
X3
1
2
1
2
1
2
X4
A
A
B
B
A
B
Apply a method for unsupervised feature selection based on entropy measure to reduce one dimension
from the given data set.
3-Apply the ChiMerge technique to reduce the number of values for numeric attributes in Problem1.
a)Reduce the number of numeric values for feature I1 and find the final reduced number of intervals.
b) Reduce the number of numeric values for feature I2 and find the final reduced number of intervals.
About Naïve Bayes
4-
About Decision Tree
5-The goal of this exercise is to get familiarized with decision trees. Therefore a simple dataset about
weather to go skiing or not is provided. The decision of going skiing depends on the attributes snow,
weather, season, and physical condition as shown in the table below.
snow
weather season physical condition go skiing
sticky
foggy
low
rested
no
fresh
sunny
low
injured
no
fresh
sunny
low
rested
yes
fresh
sunny
high
rested
yes
fresh
sunny
mid
rested
yes
frosted windy
high
tired
no
sticky
sunny
low
rested
yes
frosted foggy
mid
rested
no
fresh
windy
low
rested
yes
fresh
windy
low
rested
yes
fresh
foggy
low
rested
yes
fresh
foggy
low
rested
yes
sticky
sunny
mid
rested
yes
frosted foggy
low
injured
No
Build the decision tree based on the data above by calculating the information gain for each possible
node, as shown in the lecture. Please hand in the resulting tree including all calculation steps.
About Decision Tree
6-Given a training data set Y:
A
15
20
25
30
35
25
15
20
B
1
3
2
4
2
4
2
3
C
A
B
A
A
B
A
B
B
Class
C1
C2
C1
C1
C2
C1
C2
C2
a)Find the best threshold (for the maximal gain) for attribute A.
b) Find the best threshold (for the maximal gain) for attribute B.
c)Find a decision tree for data set Y.
d)If the testing set is
A
10
20
30
B
2
1
3
C
A
B
A
D
C2
C1
C2
40
15
2
1
B
B
C2
C1
Download