Review1 COSC 6335 Spring 2014

advertisement
Review1 COSC 6335 Spring 2014
1) Similarity Assessment
Design a distance function to assess the similarity of objects in the following variation of the Forrest Fire Dataset
limiting your attention to the following attributes:
1. month - month of the year: 'jan' to 'dec'
2. FFMC - FFMC index from the FWI system: 18.7 to 96.20 with mean value 44 and standard deviation 20
3. temp - temperature in Celsius degrees: 2.2 to 33.30 with mean value 10 and standard deviation 10
4. RH – relative humidity; ordinal attribute with values: very low, low, medium, high and very high
Assume that temp and FFMC are of major importance, and the other two attributes are of minor importance.
Compute the distances between the following 3 examples: (March,18,30,low), (June, 20,28, very high), and
(December, 80, 10, medium) using your designed distance function.
2) Clustering and Classification
Compare clustering with classification; what are the main differences between the two data mining
tasks?
Classification
• Goal is to predict classes from the object properties/attribute values; the objective is to learn a
model from training examples
• Pre-defined classes
• Datasets consist of attributes and a class labels
• Supervised (class label is known)
• Classifiers are learnt from sets of classified examples
• Important: classifiers need to have a high accuracy
Clustering
• Goal is to identify similar groups of objects
• A set of objects is partioned into subsets
• Groups (clusters, new classes) are discovered
• Dataset consists of attributes
• Unsupervised (class label has to be learned)
• Important: Similarity assessment which derives a “distance function” is critical, because clusters
are discovered based on distances/density.
3) K-Means and K-Medoids/PAM[16]
a) What are the characteristics of clusters K-Medoids/K-means are trying to find? What can be said
about the optimality of the clusters they find? Both algorithms a sensitive to initialization; explain
why this is the case! [5]
Looking for: compact clusters[1] which minimize the MSE/SE fitness function[1]
Suboptimal, local minima of the fitness function [1]
Employ hill climbing procedures which climb up the hill the initial solution belongs to; therefore,
using different seeds which are on the foot of different hills (valleys) will lead to obtaining
different solutions which usually differ in quality [2].
b) K-means is probably the most popular clustering algorithm; why do you believe is this the case?





Fast; runtime complexity is basically O(n)
Easy to use; no complex parameter values have to be selected…
Its properties are well understood!
Can deal with high-dimensional datasets
The properties of clusters can be “tweaked” by using different kind of distance functions/Kernel approaches

…
c) Assume the following dataset is given: (2,2), (4,4), (5,5), (6,6), (8,8),(9,9), (0,4), (4,0) . KMeans is used with k=4 to cluster the dataset. Moreover, Manhattan distance is used as the
distance function (formula below) to compute distances between centroids and objects in the
dataset. Moreover, K-Means’s initial clusters C1, C2, C3, and C4 are as follows:
C1: {(2,2), (4,4), (6,6)}
C2: {(0,4), (4,0)}
C3: {(5,5), (9,9)}
C4: {(8,8}}
Now K-means is run for a single iteration; what are the new clusters and what are their
centroids? [5]
d((x1,x2),(x1’,x2’))= |x1-x1’| + |x2-x2|
Centroids:
c1: (4, 4)
c2: (2, 2)
c3: (7, 7)
c4: (8, 8)
Clusters:
C1 = {(4, 4), (5, 5)}
C2 = {(2, 2), (0, 4), (4, 0)} assigning (0,4) and (4,0) to cluster C1 is also correct!
C3 = {(6, 6)}
C4 = {(8, 8), (9, 9)}
d) Compute the Silhouette for the following clustering that consists of 2 clusters:
{(0,0), (0.1), (2,3)}, {(3,3), (3,4)}; use Manhattan distance for distance computations. Compute each
point’s silhouette; interpret the results (what do they say about the clustering of the 5 points; the
quality of the two clusters, the overall clustering?)![5]
Calculate a = average distance of i to the points in its cluster
Calculate b = min (average distance of i to
s = (b-a)/max(a,b)
0, 0:
0, 1:
2, 3:
3, 3:
4, 4:
b
(6+7)/2=6.5
(5+6)/2=5.5
(1+2)/2=1.5
(6+5+1)/3=4
(7+6+2)/3=5
a
(1+5)/2 =3
(1+4)/2=2.5
(5+4)/2=4.5
1/1=1
1/1=1
points
in
another
cluster)
s
3.5/6.5=7/13=0.538
3/5.5=6/11=0.545
-3/4.5= -2/3=-0.667
3/4=0.75
4/5=0.8
How to interpret the results? [2.5 point]
Cluster 2 is good; Cluster1 is not that good; its average silhouette is only 0.2; this is mainly caused by
assigning the point (2,3) to the first cluster, although it is closer to the second cluster: this
pathological assignment is indicated by the negative value of -2/3 for this point.
4) DBSCAN
a) What is a border point in DBSCAN?
A non-core point which lies in the radius of a core point!
b) How does DBSCAN form clusters?
DBSCAN takes an unprocessed core point p and forms a cluster for this core-point that contains all core- and
border points that are density-reachable from p; this process continues until all core-points have been assigned to
clusters. In summary, forms clusters by recursively computing points in the radius of a corepoint.
c) Assume I run DBSCAN with MinPoints=6 and epsilon=0.1 for a dataset and I obtain 4 clusters and 5%
of the objects in the dataset are classified as outliers. Now I run DBSCAN with MinPoints=8 and
epsilon=0.1. How do expect the clustering results to change?
The graph whose nodes contain the objects of the dataset and whose edges connect core points to
objects in the dataset within the radius of a core point will have less edges; as a result of that:
1. There will be more outliers
2. Some clusters that exist when using the first parameter setting will be deleted or split into
several smaller sub-clusters. Therefore the number of clusters could increase or decrease!
5) Decision and Regression Trees
a) Compute the value of H(1/4,1/4,1/2,0) and for H(1/4, 1/4. 1/4, 1/4) where H is the entropy
function!
b) Compute the Entropy-gain1 for the following decision tree split[4] (just giving the formula is
fine!):
(5,3,3)
(4,0,3)
(1,3,0)
c) Assume you learn a decision tree for a dataset that solely contains numerical attributes. What can be
said about the shape of the decision boundaries that the learnt decision tree model employs? [2]
d) Are decision trees capable to model the ‘either-or’ operator; for example.
IF EITHER A OR B THEN class1 ELSE class2?
Give a reason for your answer!
e) Compute the variance gain for a regression tree whose numerical observations for the
output variable contain the following values:
root= {2,2,4,4,4,6,6}
left-node={2,2,4,4}
right node={4,6,6}
Approach: Just report the variance of the root and subtract the sum of the variance of the left
node and the right node weighted by the percentage of the examples they contain.
1
(GINI before the split) minus (GINI after the split)
Download