Solution Sketches 2011 Homework3

advertisement
Dr. Eick
COSC 6342“Machine Learning” Homework 3/4 Spring 2011
Due: Submit problems 15-18 and problem 10b from Homework2 by Mo., April 25, 11p, submit
all other problems by Fr., April 29, 11p.
Remark: more problems might be added but those will be not graded.
10 ii) Compute the first principal component for the covariance matrix of Problem 9 (use
any procedure you like). What percentage of the variance does the first principal
component capture? Interpret your result!
1 0 0
0 4 -3
0 -3 9
1
det(A-λI)= 0
| 1-λ 0
0|
det | 0
4-λ -3 | = 0
|0
-3 9-λ |
(1-λ) [(4-λ)(9-λ) – 9] = 0
(1-λ) [(36-13λ + λ2) -9] = 0
(1-λ)[ λ2 - 13λ + 27] = 0
λ1 = 1
λ2 = -(-13) + ((-13)-4(1)(27))1/2 = 13 + (61)1/2 ***** Max eigenvalue*****
2(1)
2
λ3 = -(-13) - ((-13)-4(1)(27))1/2 = 13 – (61)1/2
2(1)
2
Principal component (0, -0.42415539624972365, 0.9055894212236802)
Proportion of the variance by principal component =
13 + (61)1/2
2
= 0.74322
___________________________________________________________________________
13 + (61)1/2 + 13 - (61)1/2 + 1
2
2
15) Regression Trees
Assume you create a regression tree for a function with one input variable x and the
output variable is y, and the current node contains (in (x y) form) the following 4
examples: (1.1 7), (1.2 9), (1.4 19), (1.5 21). What test condition would you generate
assuming the node is split?
Mean(y)current=(7+9+19+21)/4=14
Ecurrent=((7-14)2+(9-14)2+(19-14)2+(21-14)2)/4=(49+25+25+49)/4 = 148/4 = 37
x<=1.1(or x<1.2)
E1= 0/1+((9-49/3)2+(19-49/3)2+(21-49/3)2)/3= …
1
x<=1.2
E2= ((7-8)2+((9-8)2)/2+((19-20)2+(21-20)2)/2= …
x<=1.4
E3= ((7-35/3)2+(9-35/3)2+(19-35/3)2)/3+0/1= …
16) DBSCAN
How does DBSCAN form clusters? What are the characteristics of an outlier in
DBSCAN?
Definitions of “core point”, “border points” and “density-connected”
A cluster, which is a subset of the points of the database, satisfies two properties:
1. All points within the cluster are mutually density-connected.
2. If a point is density-connected to any point of the cluster, it is part of the cluster as well.
An outlier point is neither a core point nor a border point. It is not density connected by any point
in a cluster.
17) Ensemble Methods
a) What is the key idea of ensemble methods? Why have they been successfully used in
many applications to obtain classifiers with high accuracies?
Key idea of ensemble methods: Generate a group of base-learners such that different
base models misclassify different training examples, high accuracies can be
accomplished if, even if the base classifier accuracy is low.
b) In which case, does AdaBoost increase the weight of an example? Why does the
AdaBoost algorithm modify the weights of examples?
When an example is misclassified, the AdaBoosst increase the weight of
an example. The AdaBoost algorithm modify the weight of examples so
that the resample process will be more likely to pick misclassified
examples in the next round and less likely to pick the correctly classified
examples. So a model trained in the next round will be focus more on
these misclassified examples.
18) Support-Vector Machines
a) The Top Ten Data Mining Algorithms article says about SVM: “The reason why SVM
insists on finding the maximum margin hyperplanes is that it offers the best generalization
ability.” Explain why having wide margins is good!
The margin of a hyperplane in SVM is based on the distances from the hyperplane to the
nearest examples(support vectors) of the two classes in the training data. If there is a new
2
example closer to the hyperplane than the support vectors, the new example will be still
correctly classified as long as its position is located in the margin area. Therefore, having
wide margins increases the chance of correctly predicting a new example and thus offers
the best generalization ability.
b) The soft margin support vector machine solves the following optimization problem:
What does the first term minimize? What does i measure (refer to the figure given below
if helpful)? What role does the parameter C play?
1. As shown in the figure, w is a normal vector that is vertical to the hyperplane wx-b=0.
The distance between the two hyperplanes, wx-b=1 and wx-b=-1, is (2/||w||), thus the
minimization of the first term is equivalent to the maximization of the margin of the
hyperplane.
2. i is the slack variable which measure the degree of misclassification of data points.
The term "𝐶 ∑ 𝜉" is the penalty for examples being on the wrong side of the support
plane.
3.Parameter C determines the importance(weight) of the 2nd terms in the formula; a
small C emphasizes more on finding a wide margin and a large C places more importance
on reducing the training error.
c) Why do most support vector machine machines map examples into a new feature space
which is usually higher dimensional. What is the motivation for doing that?
Many non-linear separable problem becomes linear separable after converting its feature
space to a proper higher dimensional. So the main motivation of doing so is to reduce the
classification error for a given problem.
3
d) The following questions refer to SV regression. Why does the approach use two error
variables  t ,  t which measure error and two constraints, instead of one as the support
vector machine classification approach does? What are the characteristics of examples
that have an error of zeroboth error variables are 0? What are the characteristics of
regression functions the SV regression approach generates?


(The following answers are based on the Smola/Schoelkopf Tutorial on Support Vector
Regression)
1. Using dual formulation of the convex optimization problem(dual convex quadratic
programmes) is the key to extend the SVM regression to nonlinear functions. By using
two error variables, the objective function can be viewed as a primal problem, and its
dual form can be obtained by constructing Lagrange function and introducing a set of
(dual) variables. It is also easier to solve the convex optimization problem in its dual
formulation in most cases.
2. The examples that have an error of zero are located within the error tube (ε-tube).
3. our goal is to find a function f(x) that has at most ε deviation from the actually
obtained targets yi for all the training data, and at the same time is as flat as possible. In
other words, we do not care about errors as long as they are less than ε, but will not
accept any deviation larger than this.
19) Kernels
a) Assume you want to use kernels to generalize a machine learning techniques T; what
property must T have so that this is possible?
T must be able to be written in "dot product form"
b) How is kernel PCA different from ordinary PCA?
 Kernel PCA does PCA on the kernel matrix (equal to doing PCA in the mapped
space selecting some orthogonal eigenvectors in the mapped space as the new
coordinate system)
 Kind of PCA using non-linear transformations in the original space, moreover, the
vectors of the chosen new coordinate system are usually not orthogonal in the
original space
c) The Vasconcelos lecture shows how K-means can be generalized to use kernels. What
is the advantage of a version of K-means which supports kernels? What is lostare there
things you can no longer do with the new version of K-means or things which have to be
done differently from traditional k-means?
The advantage of K-means which supports kernel is that the algorithm now can find
clusters of non-convex shape in its original feature space. (convex shape in the
transformed feature space).
The new version of K-means no longer have access to the prototype for each cluster. For
the applications where the prototypes are important, we can try to replacing teh prototype
by the closest vector, but this is not necessarily optimal.
4
d) Assume k1 and k2 are kernel functions. Show that
k(u,v)=k1(u,v)+k2(u,v)
is a kernel function! (idea: construct ). This problem is potentially challenging; try
your best to solve it!
See 2009 Exam3 Solution Sketches posted on Dr. Eick's course webpage
20) ROC Curves
a) Create the ROC-curve for the following classifier based on the results of applying a
support vector machine classifier1 to the following 4 examples [4]:
Example#
ClassifierSelected
Correct Class
Output
Class
1
-0.25


2
-0.02


3
+0.03
+
+
4
+0.23
+
+
b) Interpret the ROC curve! Does the curve suggest that the classifier is a good classifier?
Answer is not given, will be updated later.
(ROC curve is the relation between TPR and FPR when choosing different cutoff points
for a classifier, don't we need more than one threshold and their corresponding TPR/FPR
in order to draw a ROC curve?)
We assume that the classifier assigns class ‘+’ to an example if the returned value is greater 0, and class ‘‘, otherwise.
1
5
Download