KSE525 - Data Mining Lab

advertisement
Data Mining and Knowledge Discovery (KSE525)
Assignment #4 (May 21, 2013, due: June 4)
1. [10 points] The effectiveness of the SVM depends on the selection of kernels.
trick?
(b) Consider the quadratic kernel K(u, v) = (u • v + 1)2.
= Φ(u) • Φ(v) for some Φ.
(a) What is the kernel
Show that this is a kernel, i.e., K(u, v)
[Hint: I did the proof for K(u, v) = (u • v)2 in class.]
2. [5 points] What is boosting?
Why does boosting improve classification accuracy?
3. [10 points] Discuss the advantages and disadvantages of the four clustering methods: k-means, EM,
BIRCH, and DBSCAN.
You had better fill out the table below.
Advantages
Disadvantages
k-means
EM
BIRCH
DBSCAN
4. [10 points] Suppose that the data mining task is to cluster points (with (x, y) representing a location)
into three clusters, where the points are A1(2,10), A2(2,5), A3(8,4), B1(5,8), B2(7,5), B3(6,4), C1(1,2), C2(4,9).
The distance function is the Euclidean distance.
center of each cluster, respectively.
after the first round of execution.
Suppose initially we assign A1, B1, and C1 as the
Use the k-means algorithms.
(a) Show the three cluster centers
(b) Show the final three clusters.
5. [15 points] LIBSVM is one of the most popular tools for the SVM.
Download the Wine data set available at the URL below.
Let’s practice to use LIBSVM.
Then, arbitrarily divide wine.scale into the
training set and the test set of approximately the same size.
Run svm-train to build a classification
model using the training set and run svm-predict to test the accuracy of the model using the test set.
Identify the misclassified objects in the test set and report the accuracy of the model.
You need to
mention which kernel is used (the default is the Gaussian kernel).

LIBSVM: http://www.csie.ntu.edu.tw/~cjlin/libsvm/

Wine data set: http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html#wine
Download