Uploaded by asdfasdf afsadf

7160913

advertisement
Sheet 6 · Page 1
Machine Learning 1 — WS2014 — Module IN2064
Machine Learning Worksheet 7
Linear Classification
1
Linear separability
Problem 1: Given a set of data points {xn }, we can define the convex hull to be the set of all points x
given by
X
αn xn
x=
n
P
where αn ≥ 0 and n αn = 1. Consider a second set of points {y n } together with their corresponding
convex hull. By definition, the two sets of points will be linearly separable if there exists a vector w and
a scalar w0 such that wT xn + w0 > 0 for all xn , and wT y n + w0 < 0 for all y n . Show that if their convex
hulls intersect, the two sets of points cannot be linearly separable, and conversely that if they are linearly
separable, their convex hulls do not intersect.
We prove Convex hull intersect ⇒ not separable. We have two sets of points, {xn } and {y n } . Assume
there is a z that lies in both convex hulls, i.e. there are {αn } and {βn } such that
X
X
z=
αn xn =
βn y n
n
n
Now, assume the two set of points are linearly separable, i.e. there is a w for all {xn } and {y n } with
wT xn + w0 > 0 and wT y n + w0 < 0. Now for wT z we get
X
X
wT z =
αn wT xn >
αn (−w0 ) = −w0
n
n
but also
wT z =
X
n
βn wT y n <
X
βn (−w0 ) = −w0
n
Problem 2: Show that for a linearly separable data set, the maximum likelihood solution for the logistic
regression model is obtained by finding a vector w whose decision boundary wT φ(x) = 0 separates the
classes and then taking the magnitude of w to infinity.
Finding a separating hyperplane puts all training points on the correct side (and thus ensures posterior
class probabilities > 0.5 for each training point). Taking the magnitude of w to infinity then assignes
each training point a posterior class probability of 1 (this happens because of the shape of the logistic
sigmoid function – it becomes a heaveside function in this particular case).
Submit to homework@class.brml.org with subject line homework sheet 6 by 2014/11/24, 23:59 CET
Sheet 6 · Page 2
Machine Learning 1 — WS2014 — Module IN2064
2
Multiclass classification
Problem 3: Consider a generative classification model for K classes defined by prior class probablities
p(Ck ) = πk and general class-conditional densities p(φ|Ck ) where φ is the input feature vector. Suppose
we are given a training data set {φn , tn } where n = 1, . . . , N , and tn is a binary target vector of length
K that uses the 1-of-K coding scheme, so that is has components tnj = Ijk if pattern n is from class
Ck . Assuming that the data points are drawn independently from this model, show that the maximumlikelihood solution for the prior probabilities is given by πk = NNk where Nk is the number of data points
assigned to class Ck .
Q
We want to find parameters such that we maximize the complete likelihood, i.e.
n p(φn , tn ). We
use a simple trick (from the lecture) to facilitate the representation of probabilities
Y
p(φn , tn ) =
[p(φn |Ck )p(Ck )]tnk
k
P P
Taking negative logs we get to − n k tnk (log(p(φn |Ck )) + log(p(Ck ))). We want to optimize with
respect to πk and thus we have to look at the following objective function (with added constraints
via Lagrange multiplier):
XX
X
−
tnk log(πk ) + λ(
πk − 1)
n
k
k
Looking at the derivatives of this function with respect to πk and λ (and setting these equal zero) we
get:
−
X tnk
n
+λ=
πk
X
πk =
0
(1)
1
(2)
k
Thus (from the first condition) πk =
πk = NNk .
3
P
n
λ
=
Nk
λ .
and from the second we get λ = N . Thus one has
Bounds
Problem 4: Suppose we test a classification method on a set ofPn new test cases. Let Xi = 1 if the
classification is wrong and Xi = 0 if it is correct. Then X̂ = n−1 Xi is the observed error rate. If we
regard each Xi as a Bernoulli with unknown mean p, then p should be the true, but unknown, error rate
of our method. How likely is X̂ to not be within ε of p. How many test cases are necessary to ensure
that the observed error rate is with probablity at most 5% farther than 0.01 away from the true one?
X̂]
This is an application of the Chebyshev inequality: P (|X̂ − p| ≥ ε) ≤ V ar[
. Assuming the n new
ε2
test cases are i.i.d, we know that V ar[X̂] = V ar[X1 ]/n = p(1 − p)/n, which has an upper bound of
Submit to homework@class.brml.org with subject line homework sheet 6 by 2014/11/24, 23:59 CET
Sheet 6 · Page 3
Machine Learning 1 — WS2014 — Module IN2064
1/(4n). So
P (|X̂ − p| ≥ ε) ≤
p(1 − p)
1
=
2
nε
4n
And the right hand side should be equal to 5%, so n =
4
20
4ε2
which is 50000 for ε = 0.01
The perceptron ?
An important example of a so called linear discriminant model is the perceptron of Rosenblatt. The following
questions will look more closely at this algorithm. We will assume the following:
• The parameters of the perceptron learning algorithm are called weights and are denoted by w.
• The training set consists of training inputs xi with labels ti ∈ {+1, −1}.
• The learning rate is 1.
• Let k denote the number of weight updates the algorithm has performed at some point in time and wk the
weight vector after k updates (initially, k = 0 and w0 = 0).
• All training inputs have bounded euclidean norms, i.e. ||xi || < R, for all i and some R ∈ R+ .
• There is some γ > 0 such that ti w̃T xi > γ for all i and some suitable w̃ (γ is called a finite margin).
Problem 5: Write down the perceptron learning algorithm.
while there is at least one missclassified xi do
pick one missclassified xi
update weight w: w = w + ti*xi
done
Problem 6: Given the following training set D of labeled 2D training inputs, find a separating hyperplane using
the perceptron learning rule. Illustrate the consecutive updates of the weight w with a series of plots (do not plot
the bias weight)!
D = {((−0.7, 0.8), +1), ((−0.9, 0.6), +1), ((−0.3, −0.2), +1), ((−0.6, 0.7), +1)}
∪ {((0.6, −0.8), −1), ((0.2, −0.5), −1), ((0.3, 0.2), −1)}
You will now show that the perceptron algorithm converges in a finite number of updates (if the training data is
linearly separable).
None.
Problem 7: Let wk be the kth update of the weight during the perceptron algorithm. Show that (w̃T wk ) ≥ kγ.
(Hint: How are (w̃T wk ) and (w̃T wk−1 ) related?)
w̃T wk = w̃T wk−1 + ti w̃T xi > w̃T wk−1 + γ
Use induction to formally prove the statement.
Submit to homework@class.brml.org with subject line homework sheet 6 by 2014/11/24, 23:59 CET
Sheet 6 · Page 4
Machine Learning 1 — WS2014 — Module IN2064
Problem 8: Show that ||wk ||2 < kR2 . Note that the algorithm updates the weights only in response to a mistake
k−1
(i.e., ti xT
≤ 0 for some i). (Hint: Triangle inequality for the euclidean norm.)
i w
||wk ||2 = ||wk−1 + ti xi ||2
= ||w
k−1 2
|| +
≤ ||wk−1 ||2 +
≤ ||w
k−1 2
k−1
2ti xT
i w
||xi ||2
|| + R
(3)
2
+ ||xi ||
(4)
(5)
2
(6)
Again, use induction to formally prove the statement.
Problem 9: Consider the cosine of the angle between w̃ and wk and derive
k≤
R2 ||w̃||2
.
γ2
Now consider a new data set, D0 (again 2D inputs and two different classes):
D0 = {((0, 0), +1), ((−0.1, 0.1), +1), ((−0.3, −0.2), +1), ((0.2, 0.1), +1)}
∪ {((0.2, −0.1), +1), ((−1.1, −1.0), −1), ((−1.3, −1.2), −1), ((−1, −1), −1)}
∪ {((1, 1), −1), ((0.9, 1.2), −1), ((1.1, 1.0), −1)}
1 ≥ cos(w̃, wk ) =
w̃T wk
kγ
kγ
≥
≥√
k
k
||w̃||||w ||
||w̃||||w ||
kR2 ||w̃||2
Problem 10: Can you separate this data with the perceptron algorithm? Why/why not?
See exercise number 1.
2
2
Problem 11: Transform every input xi ∈ D0 to x0i with x0i1 = exp( −||x2i || ) and x0i2 = exp( −||xi −(1,1)||
)). If the
2
labels stay the same, are the x0i s now linearly separable? Why/ why not?
Separable, as you can see in your plot.
Submit to homework@class.brml.org with subject line homework sheet 6 by 2014/11/24, 23:59 CET
Download