CS 146 Fall 18 Final Exam (1)

advertisement
CM146: Introduction to Machine Learning
Fall 2018
Final Exam
• This is a closed book exam. Everything you need in order to solve the problems is supplied
in the body of this exam.
• This exam booklet contains six problems.
• You have 150 minutes to earn a total of 120 points.
• Besides giving the correct answer, being concise and clear is very important. To
get the full credit, you must show your work and explain your answers.
Good Luck!
Name and ID:
Name and ID
True/False
Naive Bayes
Expectation Maximization
Kernels and SVM
Learning Theory
Short Answer Questions
Total
1
/2
/15
/14
/18
/26
/15
/30
/120
2
1
True or False [15 pts]
Choose either True or False for each of the following statements. (If your answer is incorrect, partial
point may be given if your explanation is reasonable.)
(a) (3 pts) After mapping feature into a high dimension space with a proper kernel, a Perceptron
may be able to achieve better classification performance on instances it wasn’t able to classify
before.
Solution: True
(b) (3 pts) Given two classifiers A and B, if A has a lower VC-dimension than B, then conceptually A is more likely to overfit the data.
Solution: False. Lower VC-dimension means A is less expressive and is less likely to overfit.
(c) (3 pts) If two variables, X, Y are conditional independent given Z, then the variables X, Y
are independent as well.
Solution: False. Conditional independent does not imply independent.
(d) (3 pts) The SGD algorithm for soft-SVM optimization problem does not converge if the
training samples are not linearly separable.
Solution: False. Soft-SVM is a convex optimization problem, and SGD can converge to
global optimum.
(e) (3 pts) In the AdaBoost algorithm, at each iteration, we increase the weight for misclassified
examples. Solution: True
3
4
2
Naive Bayes [14 pts]
Imagine you are given the following set of training examples. Each feature can take one of three
nominal values: a, b, or c.
F1
a
c
a
b
c
F2
c
a
a
c
c
F3
a
c
c
a
b
C
+
+
-
(a) (3 pts) What modeling assumptions does the Naive Bayes model make?
Q
Solution: The joint probability can be written as P (C, F 1, F 2, F 3) = P (C) 3i=1 P (Fi |C).
Alternatively,
you can say, Given the class the features are independent, hence P (F 1, F 2, F 3|C) =
Q3
P
(F
|C).
i
i=1
(b) (3 pts) How many parameters do we need to learn from the data in the model you defined
in (a).
Solution: For a naive Bayes classifier, we need to estimate P (C = +), P (F 1 = a|C = +),
P (F 1 = b|C = +), P (F 2 = a|C = +), P (F 2 = b|C = +), P (F 3 = a|C = +), P (F 3 =
b|C = +), P (F 1 = a|C = −), P (F 1 = b|C = −), P (F 2 = a|C = −), P (F 2 = b|C = −),
P (F 3 = a|C = −), P (F 3 = b|C = −). Other probabilities can be obtained with the
constraint that the probabilities sum up to 1. So we need to estimate (3 − 1) ∗ 3 ∗ 2 + 1 = 13
parameters.
(c) (3 pts) How many parameters do we have to estimate if we do not make the assumption as
in Naive Bayes?
Solution: Without the conditional independence assumption, we still need to estimate
P (C = +). For C = +, we need to know all the enumerations of (F 1, F 2, F 3), i.e., 33 of
possible (F 1, F 2, F 3). Consider the constraint that the probabilities sum up to 1, we need
to estimate 33 − 1 = 26 parameters for C = +. Therefore the total number of parameters is
1 + 2(33 − 1) = 53.
5
6
(d) (3 pts) We use the maximum likelihood principle to estimate the model parameters in (a).
Show the numerical solutions of any three of the model parameters.
Solution: The parameters can be estimated by counting. You can write any three of the
following parameters:
P (C = +) = 2/5
P (F 1 = a | C = +) = 21 , P (F 1 = a | C = −) = 31
P (F 1 = b | C = +) = 0, P (F 1 = b | C = −) = 13
P (F 1 = c | C = +) = 21 , P (F 1 = c | C = −) = 31
P (F 2 = a | C = +) = 12 , P (F 2 = a | C = −) = 31
P (F 2 = c | C = +) = 12 , P (F 2 = c | C = −) = 32
P (F 3 = b | C = +) = 12 , P (F 3 = b | C = −) = 31
P (F 3 = c | C = +) = 12 , P (F 3 = c | C = −) = 31
P (F 3 = c | C = +) = 12 , P (F 1 = c | C = −) = 31
(e) (2 pts) In two sentences, describe the difference between logistic regression and the Naive
Bayes?
Solution: They are both probabilistic models. However, Naive Bayes is a generative model
(modeling P (x, y)), while logistic regression is a discriminative model (modeling P (y|x)).
7
8
3
Expectation Maximization [18 pts]
Suppose that somebody gave you a bag with two biased coins, having the probability of getting
heads of p1 and p2 respectively. You are supposed to figure out p1 and p2 by tossing the coins
repeatedly. You will repeat the following experiment n times: pick a coin uniformly at random
from the bag (i.e. each coin has probability 21 of being picked) and toss it, recording the outcome.
The coin is then returned to the bag. Assume that each experiment is independent.
(a) (4 pts) Suppose the two coins have different colors: the white coin has probability p1 to show
head, while the yellow coin has probability p2 to show head. Based on color, you know which
coin you tossed during the experiments. After n tosses, the numbers of heads and tails of the
white coin are H1 and T1 , respectively. And, the number of heads and tails of the yellow coin
are H2 and T2 . Write down the likelihood function.
H2 +T2 H2
H1 +T1 H1
p2 (1 − p2 )T2 . The constant term can be ignored:
p1 (1 − p1 )T1 CH
Solution: CH
2
1
T1 H2
T2
1
pH
1 (1 − p1 ) p2 (1 − p2 ) .
(b) (5 pts) Based on the above likelihood function. Derive the maximum likelihood estimators
for the two parameters p1 and p2 (please provide the detailed derivations).
Solution: intuitively, p1 is the frequency of observing heads when tossing the white coin,
and p2 is the frequency of observing head when tossing the yellow coin. Mathematically, take
the partial gradient of log-likelihood w.r.t p1 and p2 , we will get
H1
T1
−
= 0,
p1
1 − p1
and
T2
H2
−
= 0.
p2
1 − p2
Therefore, p1 = H1 /(H1 + T1 ), p2 = H2 /(H2 + T2 ).
(c) (4 pts) Now suppose both coins look identical, hence the identity of the coin is missing in
your data. After n tosses, if the numbers of heads and tails we got are H and T , respectively.
Write down the likelihood function. Describing the challenge of maximizing this likelihood
function.
Solution:
1 H
2 p1 (1
T
− p1 )T + 12 pH
2 (1 − p2 ) .
(d) (5 pts) Describe how to optimize the likelihood function in the previous question by the EM
algorithm (please provide sufficient details).
Solution:
Step1: Initialization: Guess the probability that a given data point came from Coin 1 or 2;
Generate fictional labels, weighted according to this probability.
Step2: Maximum Conditional Likelihood: Now, compute the most likely value of the parameters p1 , p2 .
Step3: Likelihood Estimation: Compute the likelihood of the data given this model.
9
10
11
4
Kernels and SVM [26 pts]
(a) In this question we will define kernels, study some of their properties and develop one specific
kernel.
i. (2 pts) Circle the correct option below:
A function K(x, z) is a valid kernel if it corresponds to the
{“inner(dot)
product” , “sum”} in some feature space, of the feature representations that correspond
to x and z.
Solution: inner(dot) product
ii. (10 pts) In the next few questions we guide you to prove the following properties of
kernels:
Linear Combination
if ∀ i, ki (x, x0 ) are valid kernels, and ci > 0 are constants,
P Property:
0
0
then ki (x, x ) = i ci ki (x, x ) is also a valid kernel.
• (5 pts) Given a valid kernel k1 (x, x0 ) and a constant c > 0, use the definition above
to show that k(x, x0 ) = ck1 (x, x0 ) is also a valid kernel.
Solution: k1 (x, x0 ) = φ(x) · φ(x0 ). We can define a new feature mapping function
√
φ0 (x) = cφ(x), then we get k(x, x0 ) = ck1 (x, x0 ) = φ0 (x) · φ0 (x0 ).
• (5 pts) Given valid kernels k1 (x, x0 ) and k2 (x, x0 ), use the definition above to show
that k(x, x0 ) = k1 (x, x0 ) + k2 (x, x0 ) is also a valid kernel.
Solution: k1 (x, x0 ) = φ1 (x) · φ1 (x0 ), where φ1 is a feature mapping function:
φ1 (x) =< f1 (x), f2 (x), ..., fn (x) >.
Similarly, k2 (x, x0 ) = φ2 (x) · φ2 (x0 ), where φ2 (x) =< g1 (x), g2 (x), ..., gn (x) >.
Now we define a new feature mapping function with a range of dimensionality n+m:
φ3 (x) =< f1 (x), f2 (x), ..., fn (x), g1 (x), g2 (x), ..., gn (x) >.
We get φ3 (x) · φ3 (x0 ) = φ1 (x) · φ1 (x0 ) + φ2 (x) · φ2 (x0 ) = k1 (x, x0 ) + k2 (x, x0 ) = k(x, x0 ).
12
13
(b) Let {(xi , yi )}li=1 be a set of l training pairs of feature vectors and labels. We consider binary
classification, and assume yi ∈ {−1, +1}∀i. The following is the primal formulation of L2
SVM, a variant of the standard SVM obtained by squaring the hinge loss:
min
w,ξi ,b
s.t
1 T
CX 2
w w+
ξi
2
2
i
yi (wT xi + b) ≥ 1 − ξi ,
ξi ≥ 0,
∀i ∈ {1, ..., l}
(1)
∀i ∈ {1, ..., l}
i. (4 pts) Derive an unconstrained optimization problem that is equivalent to Eq. (1).
Solution:
CX
1 T
w w+
max(0, (1 − yi (wT xi + b)))2
min
w,b
2
2
i
ii. (6 pts) Show that removing the last set of constraints {ξ ≥ 0, ∀i} does not change the
optimal solution to the primal problem.
Solution: Let w∗ , ξi∗ , b∗ be the optimal solution to the problem without the last set of
constraints. It suffices to show that ξi∗ ≥ 0, ∀i ∈ {1, ..., l}. Suppose it is not the case,
then there exists some ξi∗ < 0. Then we have
yj (w∗T xj + b∗ ) ≥ 1 − ξj∗ > 1,
implying that ξj0 = 0 is a feasible solution and yet gives a smaller objective since (ξj0 )2 =
0 < (ξj∗ )2 , a contradiction to the assumption that ξj∗ is optimal.
iii. (4 pts) Given the following dataset in 1-d space, which consists of 4 positive data points
{0, 1, 2, 3} and 3 negative data points {−3, −2, −1}.
• if C = 0, please list all the support vectors. Solution: {−3, −2, −1, 0, 1, 2, 3}
• if C → ∞, please list all the support vectors.
Solution: {−1, 0}
14
15
5
PAC learning and VC dimension [15 pts]
(a) (6 pts) Consider a learning problem in which x ∈ R is a real number, and the hypothesis
space is a set of intervals H = {(a < x < b)|a, b ∈ R}. Note that the hypothesis labels points
inside the interval as positive, and negative otherwise. What is the VC dimension of H?
Solution: The VC dimension is 2. Proof: Suppose we have two points x1 and x2 , and
x1 < x2 . They can always be shattered by H, no matter how they are labeled.
(a) if x1 positive and x2 negative, choose a < x1 < b < x2 ;
(b) if x1 negative and x2 positive, choose x1 < a < x2 < b;
(c) if both x1 and x2 positive, choose a < x1 < x2 < b;
(d) if both x1 and x2 negative, choose a < b < x1 < x2 ;
However, if we have three points x1 < x2 < x3 and if they are labeled as x1 (positive) x2
(negative) and x3 (positive), then they cannot be shattered by H.
(b) (5 pts) The sample complexity of a PAC-learnable Hypothesis class H is given by
m≥
log(|H|/δ)
.
(2)
In three sentences, explain the meaning of and δ and the meaning of the inequality.
Solution: For all distribution D over X, and fixed , given m examples sampled i.i.d., the
algorithm L produces, with probability at least (1 − δ), a hypothesis h ∈ H that has error
at most . The inequality denotes an upper bound on the sample complexity of learning any
finite function class.
(c) (4 pts) Now suppose we have a training set with 25 examples and our model has an error of
0.32 on the test-set. Based on Eq. (2), how many training examples we may need to reduce
the error rate to 0.15 ? (Only need to list the formulation.)
Solution: 25 ∗ 0.32/0.15
16
17
6
Short Answer Questions [30 pts]
Most of the following questions can be answered in one or two sentences. Please make your answer
concise and to the point.
(a) (2 pts) Multiple choice: for a neural network, which one of the following design choices that
affects the trade-off between underfitting and overfitting the most:
i. The learning rate
ii. The number of hidden nodes
iii. The initialization of model weights
Solution: The number of hidden nodes
(b) (4 pts) Describe the difference between maximum likelihood (MLE) and maximum a posteriori
(MAP) principles, and under what condition, MAP is reduced to MLE?
Solution:
MLE maximizes the likelihood function, while MAP maximizes the posterior distribution
(i.e,. P(model | data)) or P(model) P(data | model). When the prior distribution of the
model is a uniform distribution.
(c) (6 pts) If a data point y follows the Poisson distribution with rate parameter θ, then the
probability of a single observation y is given by
p(y; θ) =
θy exp−θ
y!
You are given data points y1 , y2 , ..., yn independently drawn from a Poisson distribution with
parameter θ. What is the MLE of θ. (Hint: write down the log-likelihood as a function of θ.)
Solution: The log-likelihood is:
n
X
(yi log θ − θ − log yi !) = log θ
i=1
The first derivative is
n
X
n
Y
(yi ) − nθ − log( yi !)
i=1
i=1
Pn
i=1 (yi )
θ
Set it to be 0, we get
− n,
Pn
θ=
18
i=1 (yi )
n
.
19
(d) (6 pts) Given vectors x and z in R3 , define the kernel Kβ (x; z) = (β + x · z)2 for any value
β > 0. Find the corresponding feature map φβ (·).
Solution: x = [x1 , x2 , x3 ]T , z = [z1 , z2 , z3 ]T .
Now, Kβ (x; z) = (β + x · z)2
= (β + (x1 z1 + x2 z2 + x3 z3 ))2
= β 2 + 2β(x1 z1 + x2 z2 + x3 z3 ) + (x1 z1 + x2 z2 + x3 z3 )2
= β 2 + 2βx1 z1 + 2βx2 z2 + 2βx3 z3 + x21 z12 + 2x1 z1 x2 z2
+ x22 z22 + 2x1 z1 x3 z3 + 2x2 z2 x3 z3 + x23 z32
Comparing √
this with
z) = φβ√
(x) · φβ (z),
√
√
√Kβ (x;√
φβ (x) = [β, 2βx1 , 2βx2 , 2βx3 , 2x1 x2 , 2x1 x3 , 2x2 x3 , x21 , x22 , x23 ]T
(e) (2 pts) Your billionaire friend needs your help. She needs to classify job applications into
good/bad categories, and also to detect job applicants who lie in their applications using
density estimation to detect outliers. To meet these needs, do you recommend using a discriminative or generative classifier? Why?
Solution: For density estimation, we need to model P (X|Y ; θ), so we need a Generative
Classifier
20
21
(f) (5 pts) We consider Boolean functions in the class L10,30,100 . This is the class of 10 out of
30 out of 100, defined over {x1 , x2 , . . . x100 }. Recall that a function in the class L10,30,100 is
defined by a set of 30 relevant variables. An example x ∈ {0, 1}100 is positive if and only if at
least 10 out these 30 variables are on. In the following discussion, for the sake of simplicity,
whenever we consider a member in L10,30,100 , we will consider the function f in which the first
30 coordinates are the relevant coordinates. Show a linear threshold function h that behaves
just like f ∈ L10,30,100 on {0, 1}100 .
P
Solution: sign( 30
i=1 xi − 10)
(g) (5 pts) Describe what are the model assumptions in the Gaussian Mixture Model (GMM).
Is GMM a discriminative model or a generative model?
Solution:
GMM is a generative model. It assumes a data point comes from one of conditionally independent Gaussian distributions. Let z to be a random variable indicating which cluster x
belongs to:
K
X
wk N (x|µk , Σk ),
P (x) =
k=1
P
where wk ≤ 0, k wk = 1 indicates the prior probability p(z = k) of cluster k, and N (x|µk , Σk )
is a Gaussian distribution model modeling P (x|z = k).
22
23
Download