Review1

advertisement
Review1 COSC 6335 Spring 2013
1) Similarity Assessment
Design a distance function to assess the similarity of customers of a supermarket; each customer in a
supermarket is characterized by the following attributes 1:
a) Ssn
b) Items_Bought (The set of items the bought last month)
c) Age (integer, assume that the mean age is 40, the standard deviation is 10, the maximum age is 96 and the
minimum age is 6)
d) Amount_spend (Average amount spent per purchase in dollars and cents; it has a mean of 40.00 a standard
deviation of 30, the minimum is 0.02 and the maximum is 398)
Assume that Items_Bought and Amount_Spend are of major importance and Age is of a minor
importance when assessing the similarity of the customers. Use your designed distance function to
compute the distance between the following 3 examples: c1=(111111111, {a,b,c,d,e} 43, 123.00),
c2=(222222222, {d,e,f,g}, 26, 35.50) and c3=(333333333, {a,d,g,h,i}, 30, 26.50)! [8]
Example Solution—uses z-scores (there many other correct solutions):
Let u and v be two customers; their distance d is measured as follows2:
d(u,v) = (1*(1-(|(u.Items_Bought  v.Items_Bought)|/ |u.Items_Bought  v.Items_Bought)|)
+ 0.2*|(u.Age-40)/10 - (v.Age-40)/10| + 1*|(u.Amount_spend-40)/30 - (v.Amount_spend40)/30|)/2.2
d(c1,c2)= (1*(1-2/7)+0.2*|3/30+14/30|+|83/30+4.5/30|)/2.2
=(5/7+3.4/30+87.5/30)/2.2= a little under 1.9—quite large, due to the “large” amount
purchased by one customer!
See: http://en.wikipedia.org/wiki/68-95-99.7_rule
2) Clustering and Classification
Compare clustering with classification; what are the main differences between the two data mining
tasks?
Classification
• Goal is to predict classes from the object properties/attribute values; the objective is to learn a
model from training examples
• Pre-defined classes
• Datasets consist of attributes and a class labels
• Supervised (class label is known)
• Classifiers are learnt from sets of classified examples
• Important: classifiers need to have a high accuracy
Clustering
• Goal is to identify similar groups of objects
• A set of objects is partioned into subsets
• Groups (clusters, new classes) are discovered
• Dataset consists of attributes
• Unsupervised (class label has to be learned)
• Important: Similarity assessment which derives a “distance function” is critical, because clusters
are discovered based on distances/density.
1
2
E.g. (111234232, {Coke, 2%-milk, apple}, 42, 3.39) is an example of a customer description.
‘|S|’ computes the number of elements in a set S!
3) K-Means and K-Medoids/PAM[16]
a) What are the characteristics of clusters K-Medoids/K-means are trying to find? What can be said
about the optimality of the clusters they find? Both algorithms a sensitive to initialization; explain
why this is the case! [5]
Looking for: compact clusters[1] which minimize the MSE/SE fitness function[1]
Suboptimal, local minima of the fitness function [1]
Employ hill climbing procedures which climb up the hill the initial solution belongs to; therefore,
using different seeds which are on the foot of different hills (valleys) will lead to obtaining
different solutions which usually differ in quality [2].
b) K-means is probably the most popular clustering algorithm; why do you believe is this the case?






Fast; runtime complexity is basically O(n)
Easy to use; no complex parameter values have to be selected…
Its properties are well understood!
Can deal with high-dimensional datasets
The properties of clusters can be “tweaked” by using different kind of distance functions/Kernel approaches
…
c) Assume the following dataset is given: (2,2), (4,4), (5,5), (6,6), (8,8),(9,9), (0,4), (4,0) . KMeans is used with k=4 to cluster the dataset. Moreover, Manhattan distance is used as the
distance function (formula below) to compute distances between centroids and objects in the
dataset. Moreover, K-Means’s initial clusters C1, C2, C3, and C4 are as follows:
C1: {(2,2), (4,4), (6,6)}
C2: {(0,4), (4,0)}
C3: {(5,5), (9,9)}
C4: {(8,8}}
Now K-means is run for a single iteration; what are the new clusters and what are their
centroids? [5]
d((x1,x2),(x1’,x2’))= |x1-x1’| + |x2-x2|
Centroids:
c1: (4, 4)
c2: (2, 2)
c3: (7, 7)
c4: (8, 8)
Clusters:
C1 = {(4, 4), (5, 5)}
C2 = {(2, 2), (0, 4), (4, 0)} assigning (0,4) and (4,0) to cluster C1 is also correct!
C3 = {(6, 6)}
C4 = {(8, 8), (9, 9)}
d) Compute the Silhouette for the following clustering that consists of 2 clusters:
{(0,0), (0.1), (2,3)}, {(3,3), (3,4)}; use Manhattan distance for distance computations. Compute each
point’s silhouette; interpret the results (what do they say about the clustering of the 5 points; the
quality of the two clusters, the overall clustering?)![5]
Calculate a = average distance of i to the points in its cluster
Calculate b = min (average distance of i to
s = (b-a)/max(a,b)
0, 0:
0, 1:
2, 3:
3, 3:
4, 4:
b
(6+7)/2=6.5
(5+6)/2=5.5
(1+2)/2=1.5
(6+5+1)/3=4
(7+6+2)/3=5
a
(1+5)/2 =3
(1+4)/2=2.5
(5+4)/2=4.5
1/1=1
1/1=1
points
in
another
cluster)
s
3.5/6.5=7/13=0.538
3/5.5=6/11=0.545
-3/4.5= -2/3=-0.667
3/4=0.75
4/5=0.8
How to interpret the results? [2.5 point]
Cluster 2 is good; Cluster1 is not that good; its average silhouette is only 0.2; this is mainly caused by
assigning the point (2,3) to the first cluster, although it is closer to the second cluster: this
pathological assignment is indicated by the negative value of -2/3 for this point.
4) DBSCAN
a) What is a border point in DBSCAN?
A non-core point which lies in the radius of a core point!
b) How does DBSCAN form clusters?
DBSCAN takes an unprocessed core point p and forms a cluster for this core-point that contains all core- and
border points that are density-reachable from p; this process continues until all core-points have been assigned to
clusters. In summary, forms clusters by recursively computing points in the radius of a corepoint.
c) Assume I run DBSCAN with MinPoints=6 and epsilon=0.1 for a dataset and I obtain 4 clusters and 5%
of the objects in the dataset are classified as outliers. Now I run DBSCAN with MinPoints=8 and
epsilon=0.1. How do expect the clustering results to change?
The graph whose nodes contain the objects of the dataset and whose edges connect core points to
objects in the dataset within the radius of a core point will have less edges; as a result of that:
1. There will be more outliers
2. Some clusters that exist when using the first parameter setting will be deleted or split into
several smaller sub-clusters. Therefore the number of clusters could increase or decrease!
Download