2014 Midterm Exam with Solution Sketches

advertisement
COSC 6335 Data Mining (Dr. Eick)
Solution Sketches
Midterm Exam
November 6, 2014
Your Name:
Your Student id:
Problem 1 --- K-Means/PAM [18]
Problem 2 --- DBSCAN [9]:
Problem 3 --- Similarity Assessment [11]
Problem 4 --- R-Programming [11]
Problem 4 --- Decision/Regression Trees [9]
Problem 5 --- APRIORI [7]:
Problem 6 --- Exploratory Data Analysis [5]:
:
Grade:
The exam is “open books” and you have 85 minutes to complete the exam.
The exam is slightly too long and you are expected to solve 90% of the
problems in the exam. The exam will count approx. 26% towards the course
grade.
1
1) K-Means and K-Medoids/PAM[18]
a) K-means is a very efficient clustering algorithm; why is this the case—what of KMeans’ properties contribute to its efficiency? [3]
The loop—which assigns objects to clusters and then recomputes centroids until
there is no more change—only takes a few iterations and the complexity of each
iteration is only O(k*n), where k is the number of clusters and n is the number
of objects in the dataset. [2]
K-means minimizes the mean-squared objective function without ever
computing the mean-squared error (implicit objective function)[1]
b) Assuming an Euclidian distance function is used, the shapes of clusters K-means can
discover are limited to convex polygons; what does this mean? What are the implications
of this statement? [3]
K-means cannot discover clusters whose shape cannot be represented as a convex
polygon, such as clusters that have shapes of the letters ‘T’, ‘V’ or O. Consequently,
such clusters cannot be directly be found with K-means and the best we can do is to
approximate those clusters using unions of K-means clusters.
c) However, the “Top 10 Data Mining Algorithm article, the authors say: “So, K-means
will falter whenever there are non-convex shaped clusters in the data. This problem may
be alleviated by rescaling the data to “whiten” it before clustering, or by using a
different distance measure that is more appropriate for the dataset”. Explain what the
authors have in mind to enable K-means to find non-convex clusters1? [3]
The authors suggest to alleviate the problem by modifying the distance function. This
can be done in two ways: Either by transforming the dataset using a scaling function
(or a kernel function) that then is used in conjunction with the original distance
function (e.g. Euclidean distance)—indirectly changing the distance measure— or by
directly changing the distance measure. The key idea is that convex shaped clusters in
the modified distance space, frequently correspond to non-convex clusters in the
original space, enabling K-means to find non-convex shaped clusters.
d) K-means has difficulties to cluster datasets that contain categorical attributes. What is
this difficulty? [2]
Categorical attributes have no ordering; consequently, it is difficult to define the mean
(or even the median) of a set of values of a categorical attributes; e.g. what is the
mean of {yellow, blue, blue, red, red}?
1
Hint: Recall the approach we used in Task7 of Project2 to enable K-means to find different clusters!
2
e) Assume we cluster a dataset with 5 objects with PAM/K-medoids for k=2 and this is
the Distance Matrix for the 5 objects (e.g. the entry 4 in the distance matrix indicates
that the distance of object 1 and 3 is 4) :
02456
0233
055
02
0
PAM’ s initial set of representatives is R={1,5}. What is the initial cluster PAM computes
from R and what is its fitness(squared error) [2]
Initial clustering: {1,2,3} and {4,5}
Fitness/Sum of Squared Distances from Cluster Objects to their Representative:
2^2 + 4^2 + 2^2 = 24
What computations does PAM perform in the next iteration? [2]
Every representative is replaced with a non-representative point while keeping the
rest of the representatives in place and the SSE is calculated. The clustering with the
lowest SSE is chosen in the next iteration.
What is the next cluster PAM obtains and what is its fitnees?[3]
Representative Set SSE
{2,5}
2^2 + 2^2 + 2^2
{3,5}
4^2 + 2^2 + 2^2
{4,5}
5^2 + 3^2 + 5^2
{1,2}
2^2 + 3^2 + 3^2
{1,3}
2^2 + 5^2 + 5^2
{1,4}
2^2 + 4^2 + 2^2
Next clustering: {1,2,3} and {4,5}. Representatives are {2,5}
Fitness: 2^2 + 2^2 + 2^2 = 12
2) DBSCAN [9]
a) Assume you run DBSCAN with MinPoints=6 and epsilon=0.1 for a dataset and we
obtain 4 clusters and 20% of the objects in the dataset are classified as outliers. Now we
run DBSCAN with MinPoints=5 and epsilon=0.1. How do expect the clustering results
to change? [4]
Less outliers [1], Clusters may increase in size [1], Some clusters may merge to form
larger clusters [1], More number of core points and border points [1]
b) What is a noise point/outlier for DBSCAN? [2]
A noise point is any point which is not a core point [1] and not in the epsilon radius
of a core point [1]
3
c) Assume you run DBSCAN in R and you obtain the following clustering results for a
dataset containing 180 objects.
0 1
border 90 2
seed
0 88
total 90 90
What does the displayed result mean with respect to number of clusters, outliers, border
points and core points?[3]
There is 1 cluster with 88 core points and 2 border points. There are 90 outliers.
3) Similarity Assessment [11]
Design a distance function to assess the similarity of customers of a supermarket; each
customer in a supermarket is characterized by the following attributes2:
a) Ssn
b) Items_Bought (The set of items the bought last month)
c) Amount_spend (Average amount spent per purchase in dollars and cents; it has a
mean of 50.00 a standard deviation of 40, the minimum is 0.05 and the maximum
is 600)
d) Age (is an ordinal attribute taking 5 values: child, young, medium, old, very_old)
Assume that Items_Bought and Amount_Spend are of major importance and Age is of a
minor importance when assessing the similarity of the customers. [8]
MANY POSSIBLE ANSWERS!!!
One possible answer: Ignore SSN as it is not important.
Find distance between items bought using Jaccard: 1 – Jaccard
Index(u.Items_Bought,v.Items_Bought)
Normalize Amount_spend using Z-score and find distance by L-1 norm
Assign 5 values to Age using a function : child = 0, young = 1, medium = 2, old = 3,
very old = 4
Find distance by taking L-1 norm and dividing by range i.e. 4
Assign weights 0.4 to Items_Bought, 0.4 to Amount_spend and 0.2 to age
So distance between 2 points:
d(u,v) = 0.4*(1 – Jaccard(u.Items_Bought, v.Items_Bought)) + 0.4*|(u.Amount)/40 –
(v.Amount)/40| + 0.2*|(u.Age) – (v.Age)|
with Jaccard(A,B)= |AB|/|AB|
Compute the distance of the following 2 customers
c1=(111111111, {A,B,C,D}, 40.00, ‘old’) and
c2=(222222222, {D,E,F}, 100.00, ‘young’)
with your distance function[3]:
d(c1,c2) = 0.4*(1 – 1/6) + 0.4*|(40-50)/40 – (100-50)/40| + 0.2*|3-1|/4
=0.33 + 0.6 + 0.1
=1.03
2
E.g. (111234232, {Coke, 2%-milk, apple}, 42.42, ‘medium’) is an example of a customer description.
4
4) R-Programming [11]
Write a generic function highest_percentage(x,y,n,class) which will return the number of
the cluster with highest percentage of samples from class in clustering x, where n is the
highest cluster number in the clustering x. The actual classes of the samples are in y.
For reference, you can use the iris dataset.
sepal length
sepal width
petal length
1
5.1
3.5
1.4
2
4.4
2.9
1.4
3
7.0
3.2
4.7
7
…
…
…
petal width
0.2
0.2
1.4
…
class
Setosa
Setosa
Versicolor
…
For example, if the function is called for a clustering result for the Iris Flower dataset
such that percentage of Setosa in Cluster 1 is 20%, percentage of Setosa in cluster 2 is
25% and percentage of Setosa in cluster 3 is 22%, a value of 2 will be assigned in the
code below to the variable Z:
Y<-kmeans(iris[1:4],3)
Z<-highest_percentage(Y$cluster,iris[,5],3,’Setosa’)
Your function should work for any dataset and any clustering result and should not
consider cluster 0 in case the clustering result supports outliers.
highest_percentage<-function(x,y,n,class) {
outliers = ifelse(0 %in% x,TRUE,FALSE)
if(outliers) { % decide start and end points depending on OUTLIERS
start = 2
end = n + 1
} else {
start = 1
end = n
}
t<-table(x,y)
tot<-apply(t,1,sum)
for(i in c(start:end)) {
t[i,]<-t[i,]*100/tot[i]
}
return(which.max(t[start:end,class]))
}
You can test this using the following:
km <- kmeans(iris[,1:4],3)
require(fpc)
dbs <- dbscan(iris[,1:4],eps=0.4,MinPts=5)
highest_percentage(km$cluster,iris[,5],3,'versicolor')
highest_percentage(dbs$cluster,iris[,5],4,'setosa')
5
If you are not aware of which.max() you can write a few lines of code to find the
index of maximum percentage.
5) Decision Trees/Regression Trees [9]
a) What are the characteristics of overfitting when learning decision trees? Assume you
observe overfitting, what could be done to learn a “better” decision tree? [5]
Training error is low [1] but testing error is high/not optimal [1]. The model is too
complex [1] and has probably learnt noise [1].
We can use pruning [1] – post-pruning [0.5] or pre-pruning [0.5]
b) What interestingness function does the regression tree induction algorithm minimize
and why do you believe this interestingness function was chosen? [4]
Variance or SSE [2]
If variance/SSE is low, prediction error is low and we get a regression tree which
better fits the data. [2]
6) APRIORI [7]
a) What is the APRIORI property? Where is it used in the APRIORI algorithm? [3]
Let i be an interestingness measure and X and Y be two sets, then
𝑿 ⊆ 𝒀 => 𝒊(𝑿) ≥ 𝒊(𝒀) [1]. This property is used when creating (k+1)-frequent
itemsets from k-frequent itemsets [1] and for subset pruning in (k+1)-frequent
itemsets[1].
b) Assume the APRIORI algorithm identified the following 7 4-item sets that satisfy a
user given support threshold: abcd, abce, abcf, bcde, and bcdf; what initial
candidate 5-itemsets are created by the APRIORI algorithm; which of those survive
subset pruning? [4]
abcde, abcdf, abcef, bcdef
abcde pruned as acde is not frequent
abcdf pruned as acdf is not frequent
abcef pruned as acef is not frequent
bcdef pruned as bdef is not frequent
7. Exploratory Data Analysis [5]
a. What is the purpose of exploratory data analysis? [3]
create background knowledge about the task at hand [1], assess difficulty [1],
provide knowledge to help select appropriate tools for the task[1], assess quality of
data [1], validate data [1], help form hypothesis [1], find issues, patterns and errors
in data [1]
At most 3 points + 1 bonus
b. What does the box in a box plot describe? [2]
The box depicts the 25 percentile to 75 percentile range of the distribution of the
attribute; horizontal line at median i.e. 50 percentile. The size of the box measures
the spread’ of data. Using the word ‘variance’ is not appropriate here!
6
Download