Exam Selected Topics in Data and Knowledge Engineering: Data Mining 2010 This exam is open book and notes. You have 100 minutes to complete it. During the exam it is forbidden to communicate with other persons participating in the exam. It is also forbidden to use mobile phones during the exam. First and last name: __________________________________ Student signature: __________________________________ Problem 1 2 3 4 5 6 Total Grade Maximum 10 10 10 10 10 10 60 Points 1. Web mining [10 points] 1 4 2 5 3 1. Set up the equations to compute pagerank (matrix M) for the web graph shown on the figure above. Calculate the Pagerank of pages 1, 2, …, 5 assuming no tax (perform 1 iteration) [5 points] 2. Define the matrix A (HITS algorithm) for the web graph shown on the figure above. Calculate the hubbiness and authority of pages 1, 2, …, 5 using HITS algorithm (perform 1 iteration). [5 points] 2. Clustering [10 points] Given the Euclidean distance matrix for objects p1, p2, …, p5. P1 P2 P3 P4 P5 P1 0 0,9 0,59 0,45 0,65 P2 0,9 0 0,36 0,53 0,02 P3 0,59 0,36 0 0,56 0,15 P4 0,45 0,53 0,56 0 0,24 P5 0,65 0,02 0,15 0,24 0 Apply the hierarchical agglomerative clustering algorithm to cluster objects p1, p2, …, p5. Show the result of the clustering by drawing a dendrogram. The dendrogram should clearly show the order in which the objects are merged. Assume the “maximum distance” measure as the distance measure between clusters. 3. Association and multilevel association analysis [10 points] Part 1 [6 points] 1. Let A, A1, A2 and B denote items in an item hierarchy (taxonomy), and let A1 and A2 denote 2 children of A in the item hierarchy (see the figure below). A A1 A2 Circle the correct answer. Justify your answer (1 sentence) (a) sup(A B) = sup(A1 B) + sup(A2 B) T F (b) if sup(A1 B) > minsup, then sup(A B) > minsup T F (c) if sup(A B) > minsup, then sup(A1 B) > minsup T F 2. Let p and q denote a pair of items, and P and Q are their corresponding ancestors (parents) in the item hierarchy. Assume the confidence of the association rule conf(p q) >minconf. Which of the following rules are guaranteed to have confidence higher than minconf: conf(p Q) > minconf T F conf(P q) > minconf T F conf(P Q) > minconf T F Part 2 [4 points] Given a frequent itemset ABCD. Assume the confidence of the association rule conf(AB CD) < minconf. Which of the following rules are guaranteed to have confidence lower than minconf: conf(BCD A) < minconf T F conf(A BCD) < minconf T F 4. Classification [10 points] Given the following training set of examples. A B Decision attribute 1 1 0 0 0 1 0 0 0 1 0 0 0 1 1 1 1 1 0 0 No No Yes No Yes No Yes Yes Yes No Generate a decision tree from the given training set of examples using SPRINT algorithm (based in the gini split index). 5. Multidimensional association rules [10 points] Given a dataset shown below. RecId Temperature Pressure Alarm1 1 2 3 4 5 6 7 8 9 95 85 103 97 80 101 83 86 102 1105 1041 1090 1085 1030 1087 1026 1028 1101 No Yes Yes Yes No Yes Yes Yes Yes First two attributes are continuous (Temperature, Pressure), while the attribute Alarm1 is the symmetric binary attribute. Suppose we apply the following equiwidth discretization schema to the attributes Temperature and Pressure: Temperature: (80-89), (90 – 99), (100 – 109) Pressure: (1000 – 1049), (1050 – 1099), (1100 – 1149) 1. Construct a binarized version of the data set [2 points]. 2. Derive all frequent itemsets with support 30% [6 points]. 3. Generate one multidimensional association rule from the derived 3-itemset and compute its confidence [2 points]. 6. Clustering [10 points] (A) Given a list of restaurants in Poznan with positive and negative ratings of some of them given by students. Apply the hierarchical agglomerative clustering algorithm to cluster students. Show the result of the clustering by drawing a dendrogram, but stop to build the dendrogram when the number of clusters is k=3. The dendrogram should clearly show the order in which the objects are merged. Assume the “minimum distance” measure as the distance measure between clusters. [7 points] (B) For the cluster of students including Dave define the ranking of restaurants. Which restaurant should you recommend to Dave? [3 points] Casa Mia Alice Bob Cindy Dave Estie Fred No No Bazar Yes Yes Dorota No Piano Yes Azja Yes No No No Yes Yes Rodeo No No Farma Yes Yes Sfinks No No No Haveli Yes Yes No