Dr. Eick COSC 6342“Machine Learning” Homework 3/4 Spring 2011 Due: Submit problems 15-18 and problem 10b from Homework2 by Mo., April 25, 11p, submit all other problems by Fr., April 29, 11p. Remark: more problems might be added but those will be not graded. 10 ii) Compute the first principal component for the covariance matrix of Problem 9 (use any procedure you like). What percentage of the variance does the first principal component capture? Interpret your result! 1 0 0 0 4 -3 0 -3 9 1 det(A-λI)= 0 | 1-λ 0 0| det | 0 4-λ -3 | = 0 |0 -3 9-λ | (1-λ) [(4-λ)(9-λ) – 9] = 0 (1-λ) [(36-13λ + λ2) -9] = 0 (1-λ)[ λ2 - 13λ + 27] = 0 λ1 = 1 λ2 = -(-13) + ((-13)-4(1)(27))1/2 = 13 + (61)1/2 ***** Max eigenvalue***** 2(1) 2 λ3 = -(-13) - ((-13)-4(1)(27))1/2 = 13 – (61)1/2 2(1) 2 Principal component (0, -0.42415539624972365, 0.9055894212236802) Proportion of the variance by principal component = 13 + (61)1/2 2 = 0.74322 ___________________________________________________________________________ 13 + (61)1/2 + 13 - (61)1/2 + 1 2 2 15) Regression Trees Assume you create a regression tree for a function with one input variable x and the output variable is y, and the current node contains (in (x y) form) the following 4 examples: (1.1 7), (1.2 9), (1.4 19), (1.5 21). What test condition would you generate assuming the node is split? Mean(y)current=(7+9+19+21)/4=14 Ecurrent=((7-14)2+(9-14)2+(19-14)2+(21-14)2)/4=(49+25+25+49)/4 = 148/4 = 37 x<=1.1(or x<1.2) E1= 0/1+((9-49/3)2+(19-49/3)2+(21-49/3)2)/3= … 1 x<=1.2 E2= ((7-8)2+((9-8)2)/2+((19-20)2+(21-20)2)/2= … x<=1.4 E3= ((7-35/3)2+(9-35/3)2+(19-35/3)2)/3+0/1= … 16) DBSCAN How does DBSCAN form clusters? What are the characteristics of an outlier in DBSCAN? Definitions of “core point”, “border points” and “density-connected” A cluster, which is a subset of the points of the database, satisfies two properties: 1. All points within the cluster are mutually density-connected. 2. If a point is density-connected to any point of the cluster, it is part of the cluster as well. An outlier point is neither a core point nor a border point. It is not density connected by any point in a cluster. 17) Ensemble Methods a) What is the key idea of ensemble methods? Why have they been successfully used in many applications to obtain classifiers with high accuracies? Key idea of ensemble methods: Generate a group of base-learners such that different base models misclassify different training examples, high accuracies can be accomplished if, even if the base classifier accuracy is low. b) In which case, does AdaBoost increase the weight of an example? Why does the AdaBoost algorithm modify the weights of examples? When an example is misclassified, the AdaBoosst increase the weight of an example. The AdaBoost algorithm modify the weight of examples so that the resample process will be more likely to pick misclassified examples in the next round and less likely to pick the correctly classified examples. So a model trained in the next round will be focus more on these misclassified examples. 18) Support-Vector Machines a) The Top Ten Data Mining Algorithms article says about SVM: “The reason why SVM insists on finding the maximum margin hyperplanes is that it offers the best generalization ability.” Explain why having wide margins is good! The margin of a hyperplane in SVM is based on the distances from the hyperplane to the nearest examples(support vectors) of the two classes in the training data. If there is a new 2 example closer to the hyperplane than the support vectors, the new example will be still correctly classified as long as its position is located in the margin area. Therefore, having wide margins increases the chance of correctly predicting a new example and thus offers the best generalization ability. b) The soft margin support vector machine solves the following optimization problem: What does the first term minimize? What does i measure (refer to the figure given below if helpful)? What role does the parameter C play? 1. As shown in the figure, w is a normal vector that is vertical to the hyperplane wx-b=0. The distance between the two hyperplanes, wx-b=1 and wx-b=-1, is (2/||w||), thus the minimization of the first term is equivalent to the maximization of the margin of the hyperplane. 2. i is the slack variable which measure the degree of misclassification of data points. The term "𝐶 ∑ 𝜉" is the penalty for examples being on the wrong side of the support plane. 3.Parameter C determines the importance(weight) of the 2nd terms in the formula; a small C emphasizes more on finding a wide margin and a large C places more importance on reducing the training error. c) Why do most support vector machine machines map examples into a new feature space which is usually higher dimensional. What is the motivation for doing that? Many non-linear separable problem becomes linear separable after converting its feature space to a proper higher dimensional. So the main motivation of doing so is to reduce the classification error for a given problem. 3 d) The following questions refer to SV regression. Why does the approach use two error variables t , t which measure error and two constraints, instead of one as the support vector machine classification approach does? What are the characteristics of examples that have an error of zeroboth error variables are 0? What are the characteristics of regression functions the SV regression approach generates? (The following answers are based on the Smola/Schoelkopf Tutorial on Support Vector Regression) 1. Using dual formulation of the convex optimization problem(dual convex quadratic programmes) is the key to extend the SVM regression to nonlinear functions. By using two error variables, the objective function can be viewed as a primal problem, and its dual form can be obtained by constructing Lagrange function and introducing a set of (dual) variables. It is also easier to solve the convex optimization problem in its dual formulation in most cases. 2. The examples that have an error of zero are located within the error tube (ε-tube). 3. our goal is to find a function f(x) that has at most ε deviation from the actually obtained targets yi for all the training data, and at the same time is as flat as possible. In other words, we do not care about errors as long as they are less than ε, but will not accept any deviation larger than this. 19) Kernels a) Assume you want to use kernels to generalize a machine learning techniques T; what property must T have so that this is possible? T must be able to be written in "dot product form" b) How is kernel PCA different from ordinary PCA? Kernel PCA does PCA on the kernel matrix (equal to doing PCA in the mapped space selecting some orthogonal eigenvectors in the mapped space as the new coordinate system) Kind of PCA using non-linear transformations in the original space, moreover, the vectors of the chosen new coordinate system are usually not orthogonal in the original space c) The Vasconcelos lecture shows how K-means can be generalized to use kernels. What is the advantage of a version of K-means which supports kernels? What is lostare there things you can no longer do with the new version of K-means or things which have to be done differently from traditional k-means? The advantage of K-means which supports kernel is that the algorithm now can find clusters of non-convex shape in its original feature space. (convex shape in the transformed feature space). The new version of K-means no longer have access to the prototype for each cluster. For the applications where the prototypes are important, we can try to replacing teh prototype by the closest vector, but this is not necessarily optimal. 4 d) Assume k1 and k2 are kernel functions. Show that k(u,v)=k1(u,v)+k2(u,v) is a kernel function! (idea: construct ). This problem is potentially challenging; try your best to solve it! See 2009 Exam3 Solution Sketches posted on Dr. Eick's course webpage 20) ROC Curves a) Create the ROC-curve for the following classifier based on the results of applying a support vector machine classifier1 to the following 4 examples [4]: Example# ClassifierSelected Correct Class Output Class 1 -0.25 2 -0.02 3 +0.03 + + 4 +0.23 + + b) Interpret the ROC curve! Does the curve suggest that the classifier is a good classifier? Answer is not given, will be updated later. (ROC curve is the relation between TPR and FPR when choosing different cutoff points for a classifier, don't we need more than one threshold and their corresponding TPR/FPR in order to draw a ROC curve?) We assume that the classifier assigns class ‘+’ to an example if the returned value is greater 0, and class ‘‘, otherwise. 1 5