CIS 4930 Data Mining Spring 2016, Assignment 4 Instructor: Peixiang Zhao TA: Yongjiang Liang Due date: Wednesday, 04/20/2016, during class Problem 1 [15 points] Consider the following data set for a binary class problem. Table 1: Problem 1 A T T T T T F F F T T B F T T F T F F F T F Class Label + + + + - 1. [5 points] Calculate the information gain when splitting on A and B, respectively. Which attribute would the decision tree induction algorithm choose? 2. [5 points] Calculate the gain in the Gini index when splitting on A and B, respectively. Which attribute would the decision tree induction algorithm choose? 3. [5 points] Note that entropy and the Gini index are both monotonously increasing on the range [0, 0.5] and they are both monotonously decreasing on the range [0.5, 1] (course slide 39). Is it possible that information gain and the gain of the Gini index favor different attributes? Explain. 2 CIS 4930: Data Mining Table 2: Problem 2 Problem 2 A B C T F T F T F T F T T F F T T F F T T T T F F F F # Instances + 5 0 0 20 20 0 0 5 0 0 25 0 0 0 0 25 [30 points] The following table summarizes a data set with three attributes A, B, C and two class labels +, -. Build a two-level decision tree. 1. [10 points] According to the classification error rate, which attribute would be chosen as the first splitting attribute? For each attribute, show the contingency table and the gains in classification error rate. 2. [15 points] Repeat for the two children of the root node. 3. [5 points] How many instances are misclassified by the resulting decision tree? Problem 3 [15 points] Consider the data set shown in below. Table 3: Problem 3 Instance 1 2 3 4 5 6 7 8 9 10 Spring 2016 A 0 1 0 1 1 0 1 0 0 1 B 0 0 1 0 0 0 1 0 1 1 C 1 1 0 0 1 1 0 0 0 1 Class + + + + + Assignment 4 3 CIS 4930: Data Mining Table 4: Problem 5 P1 P2 P3 P4 P5 P1 1 0.1 0.41 0.55 0.35 P2 0.1 1 0.64 0.47 0.98 P3 0.41 0.64 1 0.44 0.85 P4 0.55 0.47 0.44 1 0.76 P5 0.35 0.98 0.85 0.76 1 1. [10 points] Estimate the conditional probabilities for Pr(A = 1|+), Pr(B = 1|+), Pr(C = 1|+), Pr(A = 1|−), Pr(B = 1|−), and Pr(C = 1|−); 2. [5 points] Predict the class label for a test sample (A = 1, B = 1, C = 1) using the naive Bayes approach. Problem 4 [20 points] The following is a set of one-dimensional points: {6, 12, 18, 24, 30, 42, 48}. Create two clusters by the k-Means method where k = 2, given the following sets of initial centroids. What are the final centroids of the two clusters? What is the number of iterations needed (excluding the final iteration for verification) in order to make k-Means stabilize? Note that for any given data point v, if the distances between v and any of the two centroids are the same, v will be kept into the previous cluster, instead of being assigned to another, new cluster. 1. [10 points] {6, 12}; 2. [10 points] {6, 48}; Problem 5 [20 points] Use the similarity matrix shown in the table to perform single and complete link hierarchical clustering. Show your results by drawing dendrograms. The dendrograms should clearly show the order in which the points are merged. Spring 2016 Assignment 4