advertisement

CM146: Introduction to Machine Learning Fall 2018 Final Exam • This is a closed book exam. Everything you need in order to solve the problems is supplied in the body of this exam. • This exam booklet contains six problems. • You have 150 minutes to earn a total of 120 points. • Besides giving the correct answer, being concise and clear is very important. To get the full credit, you must show your work and explain your answers. Good Luck! Name and ID: Name and ID True/False Naive Bayes Expectation Maximization Kernels and SVM Learning Theory Short Answer Questions Total 1 /2 /15 /14 /18 /26 /15 /30 /120 2 1 True or False [15 pts] Choose either True or False for each of the following statements. (If your answer is incorrect, partial point may be given if your explanation is reasonable.) (a) (3 pts) After mapping feature into a high dimension space with a proper kernel, a Perceptron may be able to achieve better classification performance on instances it wasn’t able to classify before. Solution: True (b) (3 pts) Given two classifiers A and B, if A has a lower VC-dimension than B, then conceptually A is more likely to overfit the data. Solution: False. Lower VC-dimension means A is less expressive and is less likely to overfit. (c) (3 pts) If two variables, X, Y are conditional independent given Z, then the variables X, Y are independent as well. Solution: False. Conditional independent does not imply independent. (d) (3 pts) The SGD algorithm for soft-SVM optimization problem does not converge if the training samples are not linearly separable. Solution: False. Soft-SVM is a convex optimization problem, and SGD can converge to global optimum. (e) (3 pts) In the AdaBoost algorithm, at each iteration, we increase the weight for misclassified examples. Solution: True 3 4 2 Naive Bayes [14 pts] Imagine you are given the following set of training examples. Each feature can take one of three nominal values: a, b, or c. F1 a c a b c F2 c a a c c F3 a c c a b C + + - (a) (3 pts) What modeling assumptions does the Naive Bayes model make? Q Solution: The joint probability can be written as P (C, F 1, F 2, F 3) = P (C) 3i=1 P (Fi |C). Alternatively, you can say, Given the class the features are independent, hence P (F 1, F 2, F 3|C) = Q3 P (F |C). i i=1 (b) (3 pts) How many parameters do we need to learn from the data in the model you defined in (a). Solution: For a naive Bayes classifier, we need to estimate P (C = +), P (F 1 = a|C = +), P (F 1 = b|C = +), P (F 2 = a|C = +), P (F 2 = b|C = +), P (F 3 = a|C = +), P (F 3 = b|C = +), P (F 1 = a|C = −), P (F 1 = b|C = −), P (F 2 = a|C = −), P (F 2 = b|C = −), P (F 3 = a|C = −), P (F 3 = b|C = −). Other probabilities can be obtained with the constraint that the probabilities sum up to 1. So we need to estimate (3 − 1) ∗ 3 ∗ 2 + 1 = 13 parameters. (c) (3 pts) How many parameters do we have to estimate if we do not make the assumption as in Naive Bayes? Solution: Without the conditional independence assumption, we still need to estimate P (C = +). For C = +, we need to know all the enumerations of (F 1, F 2, F 3), i.e., 33 of possible (F 1, F 2, F 3). Consider the constraint that the probabilities sum up to 1, we need to estimate 33 − 1 = 26 parameters for C = +. Therefore the total number of parameters is 1 + 2(33 − 1) = 53. 5 6 (d) (3 pts) We use the maximum likelihood principle to estimate the model parameters in (a). Show the numerical solutions of any three of the model parameters. Solution: The parameters can be estimated by counting. You can write any three of the following parameters: P (C = +) = 2/5 P (F 1 = a | C = +) = 21 , P (F 1 = a | C = −) = 31 P (F 1 = b | C = +) = 0, P (F 1 = b | C = −) = 13 P (F 1 = c | C = +) = 21 , P (F 1 = c | C = −) = 31 P (F 2 = a | C = +) = 12 , P (F 2 = a | C = −) = 31 P (F 2 = c | C = +) = 12 , P (F 2 = c | C = −) = 32 P (F 3 = b | C = +) = 12 , P (F 3 = b | C = −) = 31 P (F 3 = c | C = +) = 12 , P (F 3 = c | C = −) = 31 P (F 3 = c | C = +) = 12 , P (F 1 = c | C = −) = 31 (e) (2 pts) In two sentences, describe the difference between logistic regression and the Naive Bayes? Solution: They are both probabilistic models. However, Naive Bayes is a generative model (modeling P (x, y)), while logistic regression is a discriminative model (modeling P (y|x)). 7 8 3 Expectation Maximization [18 pts] Suppose that somebody gave you a bag with two biased coins, having the probability of getting heads of p1 and p2 respectively. You are supposed to figure out p1 and p2 by tossing the coins repeatedly. You will repeat the following experiment n times: pick a coin uniformly at random from the bag (i.e. each coin has probability 21 of being picked) and toss it, recording the outcome. The coin is then returned to the bag. Assume that each experiment is independent. (a) (4 pts) Suppose the two coins have different colors: the white coin has probability p1 to show head, while the yellow coin has probability p2 to show head. Based on color, you know which coin you tossed during the experiments. After n tosses, the numbers of heads and tails of the white coin are H1 and T1 , respectively. And, the number of heads and tails of the yellow coin are H2 and T2 . Write down the likelihood function. H2 +T2 H2 H1 +T1 H1 p2 (1 − p2 )T2 . The constant term can be ignored: p1 (1 − p1 )T1 CH Solution: CH 2 1 T1 H2 T2 1 pH 1 (1 − p1 ) p2 (1 − p2 ) . (b) (5 pts) Based on the above likelihood function. Derive the maximum likelihood estimators for the two parameters p1 and p2 (please provide the detailed derivations). Solution: intuitively, p1 is the frequency of observing heads when tossing the white coin, and p2 is the frequency of observing head when tossing the yellow coin. Mathematically, take the partial gradient of log-likelihood w.r.t p1 and p2 , we will get H1 T1 − = 0, p1 1 − p1 and T2 H2 − = 0. p2 1 − p2 Therefore, p1 = H1 /(H1 + T1 ), p2 = H2 /(H2 + T2 ). (c) (4 pts) Now suppose both coins look identical, hence the identity of the coin is missing in your data. After n tosses, if the numbers of heads and tails we got are H and T , respectively. Write down the likelihood function. Describing the challenge of maximizing this likelihood function. Solution: 1 H 2 p1 (1 T − p1 )T + 12 pH 2 (1 − p2 ) . (d) (5 pts) Describe how to optimize the likelihood function in the previous question by the EM algorithm (please provide sufficient details). Solution: Step1: Initialization: Guess the probability that a given data point came from Coin 1 or 2; Generate fictional labels, weighted according to this probability. Step2: Maximum Conditional Likelihood: Now, compute the most likely value of the parameters p1 , p2 . Step3: Likelihood Estimation: Compute the likelihood of the data given this model. 9 10 11 4 Kernels and SVM [26 pts] (a) In this question we will define kernels, study some of their properties and develop one specific kernel. i. (2 pts) Circle the correct option below: A function K(x, z) is a valid kernel if it corresponds to the {“inner(dot) product” , “sum”} in some feature space, of the feature representations that correspond to x and z. Solution: inner(dot) product ii. (10 pts) In the next few questions we guide you to prove the following properties of kernels: Linear Combination if ∀ i, ki (x, x0 ) are valid kernels, and ci > 0 are constants, P Property: 0 0 then ki (x, x ) = i ci ki (x, x ) is also a valid kernel. • (5 pts) Given a valid kernel k1 (x, x0 ) and a constant c > 0, use the definition above to show that k(x, x0 ) = ck1 (x, x0 ) is also a valid kernel. Solution: k1 (x, x0 ) = φ(x) · φ(x0 ). We can define a new feature mapping function √ φ0 (x) = cφ(x), then we get k(x, x0 ) = ck1 (x, x0 ) = φ0 (x) · φ0 (x0 ). • (5 pts) Given valid kernels k1 (x, x0 ) and k2 (x, x0 ), use the definition above to show that k(x, x0 ) = k1 (x, x0 ) + k2 (x, x0 ) is also a valid kernel. Solution: k1 (x, x0 ) = φ1 (x) · φ1 (x0 ), where φ1 is a feature mapping function: φ1 (x) =< f1 (x), f2 (x), ..., fn (x) >. Similarly, k2 (x, x0 ) = φ2 (x) · φ2 (x0 ), where φ2 (x) =< g1 (x), g2 (x), ..., gn (x) >. Now we define a new feature mapping function with a range of dimensionality n+m: φ3 (x) =< f1 (x), f2 (x), ..., fn (x), g1 (x), g2 (x), ..., gn (x) >. We get φ3 (x) · φ3 (x0 ) = φ1 (x) · φ1 (x0 ) + φ2 (x) · φ2 (x0 ) = k1 (x, x0 ) + k2 (x, x0 ) = k(x, x0 ). 12 13 (b) Let {(xi , yi )}li=1 be a set of l training pairs of feature vectors and labels. We consider binary classification, and assume yi ∈ {−1, +1}∀i. The following is the primal formulation of L2 SVM, a variant of the standard SVM obtained by squaring the hinge loss: min w,ξi ,b s.t 1 T CX 2 w w+ ξi 2 2 i yi (wT xi + b) ≥ 1 − ξi , ξi ≥ 0, ∀i ∈ {1, ..., l} (1) ∀i ∈ {1, ..., l} i. (4 pts) Derive an unconstrained optimization problem that is equivalent to Eq. (1). Solution: CX 1 T w w+ max(0, (1 − yi (wT xi + b)))2 min w,b 2 2 i ii. (6 pts) Show that removing the last set of constraints {ξ ≥ 0, ∀i} does not change the optimal solution to the primal problem. Solution: Let w∗ , ξi∗ , b∗ be the optimal solution to the problem without the last set of constraints. It suffices to show that ξi∗ ≥ 0, ∀i ∈ {1, ..., l}. Suppose it is not the case, then there exists some ξi∗ < 0. Then we have yj (w∗T xj + b∗ ) ≥ 1 − ξj∗ > 1, implying that ξj0 = 0 is a feasible solution and yet gives a smaller objective since (ξj0 )2 = 0 < (ξj∗ )2 , a contradiction to the assumption that ξj∗ is optimal. iii. (4 pts) Given the following dataset in 1-d space, which consists of 4 positive data points {0, 1, 2, 3} and 3 negative data points {−3, −2, −1}. • if C = 0, please list all the support vectors. Solution: {−3, −2, −1, 0, 1, 2, 3} • if C → ∞, please list all the support vectors. Solution: {−1, 0} 14 15 5 PAC learning and VC dimension [15 pts] (a) (6 pts) Consider a learning problem in which x ∈ R is a real number, and the hypothesis space is a set of intervals H = {(a < x < b)|a, b ∈ R}. Note that the hypothesis labels points inside the interval as positive, and negative otherwise. What is the VC dimension of H? Solution: The VC dimension is 2. Proof: Suppose we have two points x1 and x2 , and x1 < x2 . They can always be shattered by H, no matter how they are labeled. (a) if x1 positive and x2 negative, choose a < x1 < b < x2 ; (b) if x1 negative and x2 positive, choose x1 < a < x2 < b; (c) if both x1 and x2 positive, choose a < x1 < x2 < b; (d) if both x1 and x2 negative, choose a < b < x1 < x2 ; However, if we have three points x1 < x2 < x3 and if they are labeled as x1 (positive) x2 (negative) and x3 (positive), then they cannot be shattered by H. (b) (5 pts) The sample complexity of a PAC-learnable Hypothesis class H is given by m≥ log(|H|/δ) . (2) In three sentences, explain the meaning of and δ and the meaning of the inequality. Solution: For all distribution D over X, and fixed , given m examples sampled i.i.d., the algorithm L produces, with probability at least (1 − δ), a hypothesis h ∈ H that has error at most . The inequality denotes an upper bound on the sample complexity of learning any finite function class. (c) (4 pts) Now suppose we have a training set with 25 examples and our model has an error of 0.32 on the test-set. Based on Eq. (2), how many training examples we may need to reduce the error rate to 0.15 ? (Only need to list the formulation.) Solution: 25 ∗ 0.32/0.15 16 17 6 Short Answer Questions [30 pts] Most of the following questions can be answered in one or two sentences. Please make your answer concise and to the point. (a) (2 pts) Multiple choice: for a neural network, which one of the following design choices that affects the trade-off between underfitting and overfitting the most: i. The learning rate ii. The number of hidden nodes iii. The initialization of model weights Solution: The number of hidden nodes (b) (4 pts) Describe the difference between maximum likelihood (MLE) and maximum a posteriori (MAP) principles, and under what condition, MAP is reduced to MLE? Solution: MLE maximizes the likelihood function, while MAP maximizes the posterior distribution (i.e,. P(model | data)) or P(model) P(data | model). When the prior distribution of the model is a uniform distribution. (c) (6 pts) If a data point y follows the Poisson distribution with rate parameter θ, then the probability of a single observation y is given by p(y; θ) = θy exp−θ y! You are given data points y1 , y2 , ..., yn independently drawn from a Poisson distribution with parameter θ. What is the MLE of θ. (Hint: write down the log-likelihood as a function of θ.) Solution: The log-likelihood is: n X (yi log θ − θ − log yi !) = log θ i=1 The first derivative is n X n Y (yi ) − nθ − log( yi !) i=1 i=1 Pn i=1 (yi ) θ Set it to be 0, we get − n, Pn θ= 18 i=1 (yi ) n . 19 (d) (6 pts) Given vectors x and z in R3 , define the kernel Kβ (x; z) = (β + x · z)2 for any value β > 0. Find the corresponding feature map φβ (·). Solution: x = [x1 , x2 , x3 ]T , z = [z1 , z2 , z3 ]T . Now, Kβ (x; z) = (β + x · z)2 = (β + (x1 z1 + x2 z2 + x3 z3 ))2 = β 2 + 2β(x1 z1 + x2 z2 + x3 z3 ) + (x1 z1 + x2 z2 + x3 z3 )2 = β 2 + 2βx1 z1 + 2βx2 z2 + 2βx3 z3 + x21 z12 + 2x1 z1 x2 z2 + x22 z22 + 2x1 z1 x3 z3 + 2x2 z2 x3 z3 + x23 z32 Comparing √ this with z) = φβ√ (x) · φβ (z), √ √ √Kβ (x;√ φβ (x) = [β, 2βx1 , 2βx2 , 2βx3 , 2x1 x2 , 2x1 x3 , 2x2 x3 , x21 , x22 , x23 ]T (e) (2 pts) Your billionaire friend needs your help. She needs to classify job applications into good/bad categories, and also to detect job applicants who lie in their applications using density estimation to detect outliers. To meet these needs, do you recommend using a discriminative or generative classifier? Why? Solution: For density estimation, we need to model P (X|Y ; θ), so we need a Generative Classifier 20 21 (f) (5 pts) We consider Boolean functions in the class L10,30,100 . This is the class of 10 out of 30 out of 100, defined over {x1 , x2 , . . . x100 }. Recall that a function in the class L10,30,100 is defined by a set of 30 relevant variables. An example x ∈ {0, 1}100 is positive if and only if at least 10 out these 30 variables are on. In the following discussion, for the sake of simplicity, whenever we consider a member in L10,30,100 , we will consider the function f in which the first 30 coordinates are the relevant coordinates. Show a linear threshold function h that behaves just like f ∈ L10,30,100 on {0, 1}100 . P Solution: sign( 30 i=1 xi − 10) (g) (5 pts) Describe what are the model assumptions in the Gaussian Mixture Model (GMM). Is GMM a discriminative model or a generative model? Solution: GMM is a generative model. It assumes a data point comes from one of conditionally independent Gaussian distributions. Let z to be a random variable indicating which cluster x belongs to: K X wk N (x|µk , Σk ), P (x) = k=1 P where wk ≤ 0, k wk = 1 indicates the prior probability p(z = k) of cluster k, and N (x|µk , Σk ) is a Gaussian distribution model modeling P (x|z = k). 22 23