Review1 COSC 6335 Spring 2013 1) Similarity Assessment Design a distance function to assess the similarity of customers of a supermarket; each customer in a supermarket is characterized by the following attributes 1: a) Ssn b) Items_Bought (The set of items the bought last month) c) Age (integer, assume that the mean age is 40, the standard deviation is 10, the maximum age is 96 and the minimum age is 6) d) Amount_spend (Average amount spent per purchase in dollars and cents; it has a mean of 40.00 a standard deviation of 30, the minimum is 0.02 and the maximum is 398) Assume that Items_Bought and Amount_Spend are of major importance and Age is of a minor importance when assessing the similarity of the customers. Use your designed distance function to compute the distance between the following 3 examples: c1=(111111111, {a,b,c,d,e} 43, 123.00), c2=(222222222, {d,e,f,g}, 26, 35.50) and c3=(333333333, {a,d,g,h,i}, 30, 26.50)! [8] Example Solution—uses z-scores (there many other correct solutions): Let u and v be two customers; their distance d is measured as follows2: d(u,v) = (1*(1-(|(u.Items_Bought v.Items_Bought)|/ |u.Items_Bought v.Items_Bought)|) + 0.2*|(u.Age-40)/10 - (v.Age-40)/10| + 1*|(u.Amount_spend-40)/30 - (v.Amount_spend40)/30|)/2.2 d(c1,c2)= (1*(1-2/7)+0.2*|3/30+14/30|+|83/30+4.5/30|)/2.2 =(5/7+3.4/30+87.5/30)/2.2= a little under 1.9—quite large, due to the “large” amount purchased by one customer! See: http://en.wikipedia.org/wiki/68-95-99.7_rule 2) Clustering and Classification Compare clustering with classification; what are the main differences between the two data mining tasks? Classification • Goal is to predict classes from the object properties/attribute values; the objective is to learn a model from training examples • Pre-defined classes • Datasets consist of attributes and a class labels • Supervised (class label is known) • Classifiers are learnt from sets of classified examples • Important: classifiers need to have a high accuracy Clustering • Goal is to identify similar groups of objects • A set of objects is partioned into subsets • Groups (clusters, new classes) are discovered • Dataset consists of attributes • Unsupervised (class label has to be learned) • Important: Similarity assessment which derives a “distance function” is critical, because clusters are discovered based on distances/density. 1 2 E.g. (111234232, {Coke, 2%-milk, apple}, 42, 3.39) is an example of a customer description. ‘|S|’ computes the number of elements in a set S! 3) K-Means and K-Medoids/PAM[16] a) What are the characteristics of clusters K-Medoids/K-means are trying to find? What can be said about the optimality of the clusters they find? Both algorithms a sensitive to initialization; explain why this is the case! [5] Looking for: compact clusters[1] which minimize the MSE/SE fitness function[1] Suboptimal, local minima of the fitness function [1] Employ hill climbing procedures which climb up the hill the initial solution belongs to; therefore, using different seeds which are on the foot of different hills (valleys) will lead to obtaining different solutions which usually differ in quality [2]. b) K-means is probably the most popular clustering algorithm; why do you believe is this the case? Fast; runtime complexity is basically O(n) Easy to use; no complex parameter values have to be selected… Its properties are well understood! Can deal with high-dimensional datasets The properties of clusters can be “tweaked” by using different kind of distance functions/Kernel approaches … c) Assume the following dataset is given: (2,2), (4,4), (5,5), (6,6), (8,8),(9,9), (0,4), (4,0) . KMeans is used with k=4 to cluster the dataset. Moreover, Manhattan distance is used as the distance function (formula below) to compute distances between centroids and objects in the dataset. Moreover, K-Means’s initial clusters C1, C2, C3, and C4 are as follows: C1: {(2,2), (4,4), (6,6)} C2: {(0,4), (4,0)} C3: {(5,5), (9,9)} C4: {(8,8}} Now K-means is run for a single iteration; what are the new clusters and what are their centroids? [5] d((x1,x2),(x1’,x2’))= |x1-x1’| + |x2-x2| Centroids: c1: (4, 4) c2: (2, 2) c3: (7, 7) c4: (8, 8) Clusters: C1 = {(4, 4), (5, 5)} C2 = {(2, 2), (0, 4), (4, 0)} assigning (0,4) and (4,0) to cluster C1 is also correct! C3 = {(6, 6)} C4 = {(8, 8), (9, 9)} d) Compute the Silhouette for the following clustering that consists of 2 clusters: {(0,0), (0.1), (2,3)}, {(3,3), (3,4)}; use Manhattan distance for distance computations. Compute each point’s silhouette; interpret the results (what do they say about the clustering of the 5 points; the quality of the two clusters, the overall clustering?)![5] Calculate a = average distance of i to the points in its cluster Calculate b = min (average distance of i to s = (b-a)/max(a,b) 0, 0: 0, 1: 2, 3: 3, 3: 4, 4: b (6+7)/2=6.5 (5+6)/2=5.5 (1+2)/2=1.5 (6+5+1)/3=4 (7+6+2)/3=5 a (1+5)/2 =3 (1+4)/2=2.5 (5+4)/2=4.5 1/1=1 1/1=1 points in another cluster) s 3.5/6.5=7/13=0.538 3/5.5=6/11=0.545 -3/4.5= -2/3=-0.667 3/4=0.75 4/5=0.8 How to interpret the results? [2.5 point] Cluster 2 is good; Cluster1 is not that good; its average silhouette is only 0.2; this is mainly caused by assigning the point (2,3) to the first cluster, although it is closer to the second cluster: this pathological assignment is indicated by the negative value of -2/3 for this point. 4) DBSCAN a) What is a border point in DBSCAN? A non-core point which lies in the radius of a core point! b) How does DBSCAN form clusters? DBSCAN takes an unprocessed core point p and forms a cluster for this core-point that contains all core- and border points that are density-reachable from p; this process continues until all core-points have been assigned to clusters. In summary, forms clusters by recursively computing points in the radius of a corepoint. c) Assume I run DBSCAN with MinPoints=6 and epsilon=0.1 for a dataset and I obtain 4 clusters and 5% of the objects in the dataset are classified as outliers. Now I run DBSCAN with MinPoints=8 and epsilon=0.1. How do expect the clustering results to change? The graph whose nodes contain the objects of the dataset and whose edges connect core points to objects in the dataset within the radius of a core point will have less edges; as a result of that: 1. There will be more outliers 2. Some clusters that exist when using the first parameter setting will be deleted or split into several smaller sub-clusters. Therefore the number of clusters could increase or decrease!