CSCI 4390/6390 Database Mining Assignment 4 Instructor: Wei Liu Due on Dec. 10th Problem 1. Hierarchical K-Means Clustering (30 points) We want to run an algorithm of Hierarchical K-Means Clustering to cluster a data set {xi }ni=1 into 1,048,576 clusters. Please answer: 1) If K = 2 for all K-Means clustering actions, what is the height (or depth) of the clustering tree constructed by the algorithm? 2) Suppose K = 2. To cluster a new data point x, how many distances between x and the obtained clustering centers should be computed? 3) Please design the data structure for the clustering tree. For the nodes (including root, not leaf), what should be saved in them? For the leaf nodes, what should be saved in them? (Hint: The total number of clustering centers obtained by the Hierarchical K-Means Clustering algorithm is larger than 1,048,576. Many parent clusters or intermediate clusters exist. ) Problem 2. Stochastic Gradient Descent (20 points) We want to train a linear learning model for classification. The model takes the form of f (x) = w> x, where x ∈ Rd is an input data sample, and w ∈ Rd is the weight vector. The classification framework is to minimize the following objective function Q(w) = n X Loss f (xi ), yi , (1) i=1 in which xi is the i-th training sample, and its label is yi ∈ {1, −1}. 1) If the loss function is Loss(a, b) = (a − b)2 , derive the update formula of the Stochastic Gradient Descent method that is used to optimization Eq. (1). 2) If the loss function is Loss(a, b) = ln 1 + exp(−ab) , derive the update formula of the Stochastic Gradient Descent method that is used to optimization Eq. (1). (Hint: The update formula should be expressed as wk+1 ←− g(wk ), in which g(wk ) is some function with respect to wk . ) Problem 3. Support Vector Machines (30 points) Solve Q1. (a)(b)(c) of 21.7 EXERCISES in the textbook (page 546).