Machine Learning Worksheet: Linear Classification Problems

Sheet 6 · Page 1 Machine Learning 1 — WS2014 — Module IN2064 Machine Learning Worksheet 7 Linear Classification 1 Linear separability Problem 1: Given a set of data points {xn }, we can define the convex hull to be the set of all points x given by X αn xn x= n P where αn ≥ 0 and n αn = 1. Consider a second set of points {y n } together with their corresponding convex hull. By definition, the two sets of points will be linearly separable if there exists a vector w and a scalar w0 such that wT xn + w0 > 0 for all xn , and wT y n + w0 < 0 for all y n . Show that if their convex hulls intersect, the two sets of points cannot be linearly separable, and conversely that if they are linearly separable, their convex hulls do not intersect. We prove Convex hull intersect ⇒ not separable. We have two sets of points, {xn } and {y n } . Assume there is a z that lies in both convex hulls, i.e. there are {αn } and {βn } such that X X z= αn xn = βn y n n n Now, assume the two set of points are linearly separable, i.e. there is a w for all {xn } and {y n } with wT xn + w0 > 0 and wT y n + w0 < 0. Now for wT z we get X X wT z = αn wT xn > αn (−w0 ) = −w0 n n but also wT z = X n βn wT y n < X βn (−w0 ) = −w0 n Problem 2: Show that for a linearly separable data set, the maximum likelihood solution for the logistic regression model is obtained by finding a vector w whose decision boundary wT φ(x) = 0 separates the classes and then taking the magnitude of w to infinity. Finding a separating hyperplane puts all training points on the correct side (and thus ensures posterior class probabilities > 0.5 for each training point). Taking the magnitude of w to infinity then assignes each training point a posterior class probability of 1 (this happens because of the shape of the logistic sigmoid function – it becomes a heaveside function in this particular case). Submit to homework@class.brml.org with subject line homework sheet 6 by 2014/11/24, 23:59 CET Sheet 6 · Page 2 Machine Learning 1 — WS2014 — Module IN2064 2 Multiclass classification Problem 3: Consider a generative classification model for K classes defined by prior class probablities p(Ck ) = πk and general class-conditional densities p(φ|Ck ) where φ is the input feature vector. Suppose we are given a training data set {φn , tn } where n = 1, . . . , N , and tn is a binary target vector of length K that uses the 1-of-K coding scheme, so that is has components tnj = Ijk if pattern n is from class Ck . Assuming that the data points are drawn independently from this model, show that the maximumlikelihood solution for the prior probabilities is given by πk = NNk where Nk is the number of data points assigned to class Ck . Q We want to find parameters such that we maximize the complete likelihood, i.e. n p(φn , tn ). We use a simple trick (from the lecture) to facilitate the representation of probabilities Y p(φn , tn ) = [p(φn |Ck )p(Ck )]tnk k P P Taking negative logs we get to − n k tnk (log(p(φn |Ck )) + log(p(Ck ))). We want to optimize with respect to πk and thus we have to look at the following objective function (with added constraints via Lagrange multiplier): XX X − tnk log(πk ) + λ( πk − 1) n k k Looking at the derivatives of this function with respect to πk and λ (and setting these equal zero) we get: − X tnk n +λ= πk X πk = 0 (1) 1 (2) k Thus (from the first condition) πk = πk = NNk . 3 P n λ = Nk λ . and from the second we get λ = N . Thus one has Bounds Problem 4: Suppose we test a classification method on a set ofPn new test cases. Let Xi = 1 if the classification is wrong and Xi = 0 if it is correct. Then X̂ = n−1 Xi is the observed error rate. If we regard each Xi as a Bernoulli with unknown mean p, then p should be the true, but unknown, error rate of our method. How likely is X̂ to not be within ε of p. How many test cases are necessary to ensure that the observed error rate is with probablity at most 5% farther than 0.01 away from the true one? X̂] This is an application of the Chebyshev inequality: P (|X̂ − p| ≥ ε) ≤ V ar[ . Assuming the n new ε2 test cases are i.i.d, we know that V ar[X̂] = V ar[X1 ]/n = p(1 − p)/n, which has an upper bound of Submit to homework@class.brml.org with subject line homework sheet 6 by 2014/11/24, 23:59 CET Sheet 6 · Page 3 Machine Learning 1 — WS2014 — Module IN2064 1/(4n). So P (|X̂ − p| ≥ ε) ≤ p(1 − p) 1 = 2 nε 4n And the right hand side should be equal to 5%, so n = 4 20 4ε2 which is 50000 for ε = 0.01 The perceptron ? An important example of a so called linear discriminant model is the perceptron of Rosenblatt. The following questions will look more closely at this algorithm. We will assume the following: • The parameters of the perceptron learning algorithm are called weights and are denoted by w. • The training set consists of training inputs xi with labels ti ∈ {+1, −1}. • The learning rate is 1. • Let k denote the number of weight updates the algorithm has performed at some point in time and wk the weight vector after k updates (initially, k = 0 and w0 = 0). • All training inputs have bounded euclidean norms, i.e. ||xi || < R, for all i and some R ∈ R+ . • There is some γ > 0 such that ti w̃T xi > γ for all i and some suitable w̃ (γ is called a finite margin). Problem 5: Write down the perceptron learning algorithm. while there is at least one missclassified xi do pick one missclassified xi update weight w: w = w + ti*xi done Problem 6: Given the following training set D of labeled 2D training inputs, find a separating hyperplane using the perceptron learning rule. Illustrate the consecutive updates of the weight w with a series of plots (do not plot the bias weight)! D = {((−0.7, 0.8), +1), ((−0.9, 0.6), +1), ((−0.3, −0.2), +1), ((−0.6, 0.7), +1)} ∪ {((0.6, −0.8), −1), ((0.2, −0.5), −1), ((0.3, 0.2), −1)} You will now show that the perceptron algorithm converges in a finite number of updates (if the training data is linearly separable). None. Problem 7: Let wk be the kth update of the weight during the perceptron algorithm. Show that (w̃T wk ) ≥ kγ. (Hint: How are (w̃T wk ) and (w̃T wk−1 ) related?) w̃T wk = w̃T wk−1 + ti w̃T xi > w̃T wk−1 + γ Use induction to formally prove the statement. Submit to homework@class.brml.org with subject line homework sheet 6 by 2014/11/24, 23:59 CET Sheet 6 · Page 4 Machine Learning 1 — WS2014 — Module IN2064 Problem 8: Show that ||wk ||2 < kR2 . Note that the algorithm updates the weights only in response to a mistake k−1 (i.e., ti xT ≤ 0 for some i). (Hint: Triangle inequality for the euclidean norm.) i w ||wk ||2 = ||wk−1 + ti xi ||2 = ||w k−1 2 || + ≤ ||wk−1 ||2 + ≤ ||w k−1 2 k−1 2ti xT i w ||xi ||2 || + R (3) 2 + ||xi || (4) (5) 2 (6) Again, use induction to formally prove the statement. Problem 9: Consider the cosine of the angle between w̃ and wk and derive k≤ R2 ||w̃||2 . γ2 Now consider a new data set, D0 (again 2D inputs and two different classes): D0 = {((0, 0), +1), ((−0.1, 0.1), +1), ((−0.3, −0.2), +1), ((0.2, 0.1), +1)} ∪ {((0.2, −0.1), +1), ((−1.1, −1.0), −1), ((−1.3, −1.2), −1), ((−1, −1), −1)} ∪ {((1, 1), −1), ((0.9, 1.2), −1), ((1.1, 1.0), −1)} 1 ≥ cos(w̃, wk ) = w̃T wk kγ kγ ≥ ≥√ k k ||w̃||||w || ||w̃||||w || kR2 ||w̃||2 Problem 10: Can you separate this data with the perceptron algorithm? Why/why not? See exercise number 1. 2 2 Problem 11: Transform every input xi ∈ D0 to x0i with x0i1 = exp( −||x2i || ) and x0i2 = exp( −||xi −(1,1)|| )). If the 2 labels stay the same, are the x0i s now linearly separable? Why/ why not? Separable, as you can see in your plot. Submit to homework@class.brml.org with subject line homework sheet 6 by 2014/11/24, 23:59 CET

Machine Learning Worksheet: Linear Classification Problems

Related documents

Products

Support

Machine Learning Worksheet: Linear Classification Problems

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib