Final Exam COSC 6335 Data Mining December 6, 2011 Your Name: Your student id: Problem 1 --- Clustering [17] Problem 2 --- Association and Sequence Mining [14] Problem 3 --- kNN and Support Vector Machines [16] Problem 4 --- Preprocessing [11] Problem 5 --- Spatial Data Mining [5] Problem 6 --- Data Mining in General [7] : Grade: The exam is “open books” and you have 95 minutes to complete the exam. The exam will count approx. 33% towards the course grade. 1 1) Clustering [17] a) How do hierarchical clustering algorithms, such as AGNES, form dendrograms?[2] Starting with a clustering in which each single object forms a cluster, clusters which are closest to each with respect to distance metric (such as average distance or minimum distance between clusters) are merged until a cluster is obtained which contains all objects of the dataset. Cluster which are merged are connected in the dendrogram. b) Compute the Silhouette for the following clustering that consists of 2 clusters: {(0,0), (0.1), (3,3)}, {(4,3), (4,4)}; use Manhattan distance for distance computations. Compute each point’s silhouette; interpret the results (what do they say about the clustering of the 5 points; the overall clustering?)! [6] Zechun adds solution for first part! Compute Silhouette 4 points; one error at most 2 points; two errors 0 points. Interpretation: should say that the silhouette for (3,3) is bad[0.5], and that the silhouette for cluster 1 is not good[0.5], and all other silhouettes are kind of okay.[1] c) Assume you apply k-means to a dataset which contains outliers; assume you apply kmeans to a 2D-dataset Y={(-100,-100), (0,0), (1,1), (0, 1), (1, 2), (5, 4), (5,5) , (5,6)} with k=2; how do outliers (e.g. the point (-100,-100)) impact the k-means clustering result? Propose an approach that alleviates the strong influence of outliers on clustering results. If your approach would be used to cluster dataset Y; how would the result change? [6] Leads to a clustering where the outlier forms a single cluster [2] Method [2.5] After method is applied points 2-5 form a cluster, and the last 3 [1.5] d) The Top Ten Data Mining Algorithm article suggests to use single-link hierarchical clustering to post-process K-means clustering results, by running the hierarchical clustering algorithms for a few iterations on the K-means clustering result. What advantage(s) do you see in this approach? [3] Clusters with non-convex shapes can be obtained! 2 2) Association Rule and Sequence Mining [14] a) Assume the Apriori-style sequence mining algorithm described on pages 429-435 of our textbook is used and the algorithm generated 3-sequences listed below: Frequent 3-sequences Candidate Generation Candidates that survived pruning <(1) (2 3)> <(1) (2) (4)> <(1) (3) (4)> <(1 2) (3)> <(2 3 4)> <(2 3) (4)> <(2) (3) (4)> What candidate 4-sequences are generated from this 3-sequence set? Which of the generated 4-sequences survive the pruning step? Use format of Figure 7.6 in the textbook on page 435 to describe your answer! [5] Zechun adds solution! b) Give a sketch of an algorithm which creates association rules from the frequent itemsets (which have been computed using APRIORI)! [5] No solution given! c) Assume you run APRIORI with a given support threshold on a supermarket transaction database and there is a 40-itemset which is frequent. What can be said about how many frequent itemsets APRIORI needs to compute for this example (hint: answer this question by giving a lower bound for the number of itemsets which are frequent in this case)? Based on your answer to the previous question, do you believe it is computationally feasible to compute all frequent itemsets? [4] As the 40-itemset is frequent, all its subsets are frequent; consequently there are at least 2401012 solutions [3]; it will not be possible to compute such a large number of frequent itemsets[1]. 3 3) kNN, SVM, & Ensembles [16] a) The soft margin support vector machine solves the following optimization problem: What does the first term minimize? Depict all non-zero i in the figure below! What is the advantage of the soft margin approach over the linear SVM approach? [5] The width of the margin with respect to the class1 and class2 hyperplan[1]. Depict [2; 2 errors=0 points]. Can deal with classification problems in which the examples are not linearly separable[2]. b) Referring to the figure above, explain how examples are classified by SVMs! What is the relationship between i and example i being classified correctly? [4] Example which are above the straight line hyperplane belong to the round class, and example below the line belong to the square class [1.5]. An example will be classified correctly if i is less equal to half of the width of the hyperplane; the width w is the distance between the class1 and class2 hyperplane. [2.5]. c) How does kNN (k-nearest-neighbor) predict the class label of a new example? [2] Find the nearest k neighbor of the example which needs to be classified; take a major vote based on the class labels of the k-nearest neighbors found. d) Assume you want to use a nearest neighbor classifier for a particular classification task. How would you approach the problem to choosing the parameter k of a nearest neighbor classifier? [3] Use N-fold cross validation to assess the testing accuracy of kNN classifier for different k values; choose the k for your final classifier for which the testing accuracy is the highest! e) What can be said about the number of decision boundaries a kNN classifier uses? [2] Many[2]; as many as N-1 where N is the number of training examples[1 extra point]. 4 4) Prepocessing [11] a) What is the goal of dimensionality reduction techniques, such as PCA—for what are they used in practice? [4] – Avoid curse of dimensionality---it is much more difficult to find interesting patterns in high dimensional spaces [1] – Reduce amount of time and memory required by data mining algorithms [1] – Facilitate data visualization [1] – eliminate irrelevant features [0.5] – reduce noise [0.5] – reduce redundancies [0.5] At most 4 points! b) Assume you have to mine association rules for a very large transaction database which contains 9,000,000 transactions. How could sampling be used to speed up association rule mining? Give a sketch of an approach which speeds up association rule mining which uses sampling! [5] One Solution 1. Run the Association Rule Mining algorithm for a much smaller sample (e.g. 500,000 transactions) with a slightly lower support and confidence threshold[-1 if the same thresholds are use] obtaining a set of association rules R 2. For each rule go through the complete transaction database and compute its support and confidence, and prune rules which violate confidence or support thresholds. 3. Return the surviving rules c) What does it mean if an attribute is irrelevant for a classification problem? [2] It does to provide any useful information for distinguishing between the different classes. 5) Spatial Data Mining [5] What are the main challenges in mining spatial datasetshow does mining spatial datasets differ from mining business datasets? Autocorrelation/Tobler’s first law[1.5], Attribute Space is continuous[0.5], no clearly defined transactions [0.5], complex spatial datatypes such as polygons [1], separation between spatial and non-spatial attributes [1], importance of putting results on maps [0.5], regional knowledge[0.5], a lot of objects/a lot of patterns[0.5]. At most 5 points! 6) Data Mining in General [7] Assume you own an online book store which sells books over the internet. How can your business benefit from data mining? Limit your answer to 7-10 sentences! No solution given! 5