CIS 4930 Data Mining Spring 2016, Assignment 4

advertisement
CIS 4930 Data Mining
Spring 2016, Assignment 4
Instructor: Peixiang Zhao
TA: Yongjiang Liang
Due date: Wednesday, 04/20/2016, during class
Problem 1
[15 points]
Consider the following data set for a binary class problem.
Table 1: Problem 1
A
T
T
T
T
T
F
F
F
T
T
B
F
T
T
F
T
F
F
F
T
F
Class Label
+
+
+
+
-
1. [5 points] Calculate the information gain when splitting on A and B, respectively. Which attribute would the decision tree induction algorithm choose?
2. [5 points] Calculate the gain in the Gini index when splitting on A and B, respectively. Which attribute would the decision tree induction algorithm choose?
3. [5 points] Note that entropy and the Gini index are both monotonously increasing on the range [0, 0.5] and they are both monotonously decreasing on the
range [0.5, 1] (course slide 39). Is it possible that information gain and the gain
of the Gini index favor different attributes? Explain.
2
CIS 4930: Data Mining
Table 2: Problem 2
Problem 2
A
B
C
T
F
T
F
T
F
T
F
T
T
F
F
T
T
F
F
T
T
T
T
F
F
F
F
# Instances
+
5
0
0
20
20
0
0
5
0
0
25
0
0
0
0
25
[30 points]
The following table summarizes a data set with three attributes A, B, C and two class
labels +, -. Build a two-level decision tree.
1. [10 points] According to the classification error rate, which attribute would be
chosen as the first splitting attribute? For each attribute, show the contingency
table and the gains in classification error rate.
2. [15 points] Repeat for the two children of the root node.
3. [5 points] How many instances are misclassified by the resulting decision tree?
Problem 3
[15 points]
Consider the data set shown in below.
Table 3: Problem 3
Instance
1
2
3
4
5
6
7
8
9
10
Spring 2016
A
0
1
0
1
1
0
1
0
0
1
B
0
0
1
0
0
0
1
0
1
1
C
1
1
0
0
1
1
0
0
0
1
Class
+
+
+
+
+
Assignment 4
3
CIS 4930: Data Mining
Table 4: Problem 5
P1
P2
P3
P4
P5
P1
1
0.1
0.41
0.55
0.35
P2
0.1
1
0.64
0.47
0.98
P3
0.41
0.64
1
0.44
0.85
P4
0.55
0.47
0.44
1
0.76
P5
0.35
0.98
0.85
0.76
1
1. [10 points] Estimate the conditional probabilities for Pr(A = 1|+), Pr(B =
1|+), Pr(C = 1|+), Pr(A = 1|−), Pr(B = 1|−), and Pr(C = 1|−);
2. [5 points] Predict the class label for a test sample (A = 1, B = 1, C = 1) using
the naive Bayes approach.
Problem 4
[20 points]
The following is a set of one-dimensional points: {6, 12, 18, 24, 30, 42, 48}. Create
two clusters by the k-Means method where k = 2, given the following sets of initial
centroids. What are the final centroids of the two clusters? What is the number
of iterations needed (excluding the final iteration for verification) in order to make
k-Means stabilize? Note that for any given data point v, if the distances between v
and any of the two centroids are the same, v will be kept into the previous cluster,
instead of being assigned to another, new cluster.
1. [10 points] {6, 12};
2. [10 points] {6, 48};
Problem 5
[20 points]
Use the similarity matrix shown in the table to perform single and complete link
hierarchical clustering. Show your results by drawing dendrograms. The dendrograms
should clearly show the order in which the points are merged.
Spring 2016
Assignment 4
Download