CSCI 4390/6390 Database Mining Assignment 4 Instructor: Wei Liu Due on Dec. 10th

advertisement
CSCI 4390/6390 Database Mining
Assignment 4
Instructor: Wei Liu
Due on Dec. 10th
Problem 1. Hierarchical K-Means Clustering (30 points)
We want to run an algorithm of Hierarchical K-Means Clustering to cluster a data set {xi }ni=1 into 1,048,576
clusters. Please answer:
1) If K = 2 for all K-Means clustering actions, what is the height (or depth) of the clustering tree constructed
by the algorithm?
2) Suppose K = 2. To cluster a new data point x, how many distances between x and the obtained clustering
centers should be computed?
3) Please design the data structure for the clustering tree. For the nodes (including root, not leaf), what should
be saved in them? For the leaf nodes, what should be saved in them?
(Hint: The total number of clustering centers obtained by the Hierarchical K-Means Clustering algorithm is
larger than 1,048,576. Many parent clusters or intermediate clusters exist. )
Problem 2. Stochastic Gradient Descent (20 points)
We want to train a linear learning model for classification. The model takes the form of f (x) = w> x, where
x ∈ Rd is an input data sample, and w ∈ Rd is the weight vector. The classification framework is to minimize the
following objective function
Q(w) =
n
X
Loss f (xi ), yi ,
(1)
i=1
in which xi is the i-th training sample, and its label is yi ∈ {1, −1}.
1) If the loss function is Loss(a, b) = (a − b)2 , derive the update formula of the Stochastic Gradient Descent
method that is used to optimization Eq. (1).
2) If the loss function is Loss(a, b) = ln 1 + exp(−ab) , derive the update formula of the Stochastic Gradient
Descent method that is used to optimization Eq. (1).
(Hint: The update formula should be expressed as wk+1 ←− g(wk ), in which g(wk ) is some function with
respect to wk . )
Problem 3. Support Vector Machines (30 points)
Solve Q1. (a)(b)(c) of 21.7 EXERCISES in the textbook (page 546).
Download