Uploaded by Sambhrant Shekhar Srivastava

PRML end sem

advertisement
(1) (Bayes Classifier.)
i. Consider the following cost matrices for a 3 class classification problem.






0 1 1
0 1 2
0 1 1 12
Lzo = 1 0 1 Lordinal = 1 0 1 Labstain = 1 0 1 12 
1 1 0 12
1 1 0
2 1 0
where (i.j)th entry in the cost matrix corresponds to the loss of predicting j
when the truth is i. The abstain loss has 4 possible predictions for this three
class problem corresponding to the three classes, and an ‘abstain’ option. Derive
the Bayes classifier for all three cost matrices. In other words, give a mapping
which takes as input a 3-dimensional probability vector corresponding to the
class conditional probabilities and outputs one of the 3 classes for the zero-one
and ordinal losses, and outputs one of the 4 possibilities for the abstain loss.
Denote the option of ‘abstaining’ by the symbol ⊥.
ii. Consider the following distribution of (X, Y ) over R × {1, 2, 3}. with P (Y = 1) =
P (Y = 2) = P (Y = 3) = 13 .

3

 14
if x ∈ [0, 2]
if x ∈ [2, 10]
otherwise
1
fX (x|Y = 1) = 14


0

3

 14
if x ∈ [4, 6]
if x ∈ [0, 4] ∪ [6, 10]
otherwise
1
fX (x|Y = 2) = 14


0

3

 14
if x ∈ [8, 10]
if x ∈ [0, 8]
otherwise
1
fX (x|Y = 3) = 14


0
where fX (x|Y = a) is the conditional density of X given that Y = a. Give the
Bayes classifier for all three cost matrices for this distribution.
iii. Repeat the previous sub-problem for the below distribution with P (Y = 1) =
P (Y = 2) = P (Y = 3) = 13 .

x−1

 9
fX (x|Y = 1) = 7−x
 9

0

x−2

 9
fX (x|Y = 2) = 8−x
 9

0

x−3

 9
fX (x|Y = 3) = 9−x
 9

0
if x ∈ [1, 4]
if x ∈ [4, 7]
otherwise
if x ∈ [2, 5]
if x ∈ [5, 8]
otherwise
if x ∈ [3, 6]
if x ∈ [6, 9]
otherwise
(2+2+2 points)
(2) (EM algorithm.) Consider the following crowd-sourcing problem. Each data-point
1 ≤ i ≤ n has a true-label Zi ∈ {0, 1}, and P (Zi = 1) = α. Each data point also has
crowd-sourced labels Xi1 , . . . , Xid given by d annotators, where Xij ∈ {0, 1}. Each
annotator j, is assumed to have an independent error µj , i.e.
P (Xij = 1|Zi = 0) = µj
P (Xij = 0|Zi = 1) = µj
The joint distribution of all the random variables involved is given by:


n
d
�
�
P (Zi )
P (Xij |Zi )
P (X11 , . . . , X1d , . . . , Xn1 , . . . , Xnd , Z1 , . . . , Zn ) =
i=1
j=1
You have access to the dataset of n points annotated by d annotators, i.e. x1 , x2 , . . . , xn ,
where xi ∈ {0, 1}d . But the true labels z1 , z2 , . . . , zn ∈ {0, 1} are hidden. Use the EM
algorithm to estimate the error rates of the annotators µj and the prior probability
parameter α.
i. For a given current value of the parameters α and µ1 , . . . µd , give the E-step where
you will compute γi = P (Zi = 1|Xi = xi ).
ii. Consider the estimation problem for annotator j. Given the complete data
(z1 , x1j ), . . . , (zn , xnj ), with zi ∈ {0, 1} and xij ∈ {0, 1}, give the ML estimate
of the probability of error µj .
iii. Use the γi from part (a), along with the solution to the maximum likelihood
problem above to give an expression for the update of the parameters α, µ1 , . . . , µd
in the M-step.
(Hint: It might be helpful to use γi to hallucinate complete data.)
iv. Consider the annnotation data below with 20 points and 4 labellers. Apply the
E-step and M-step derived above for 2 iterations. In particular, give the result of
the E-step γ1 , . . . , γ20 at the end of each of the two iterations. Also give the result
of the M-step α, µ1 , µ2 , µ3 , µ4 at the end of each of the two iterations. Initialize
with α = 0.5, µ1 = µ2 = µ3 = µ4 = 0.1. Interpret the likely true labels Z for the
20 points, and the quality of the 4 labellers based on the final γ and µ values.
How does the final γi differ from a simple majority vote of the annotator labels?
Instructions: Feel free to write and use a computer code to get these answers.
If you are using a computer code, you can run 10 steps of EM algorithm, instead
of 2, for a much easier interpretation of the final answer. Give answers only upto
3 significant digits. Give the numerical results for the E-step and M-step in two
tables of size 20*2 and 5*2 respectively.
x:,1
x:,2
x:,3
x:,4
1
0
1
1
0
0
0
1
0
1
0
1
0
0
1
1
0
0
1
1
0
0
1
0
0
0
0
1
0
0
1
1
0
0
0
1
0
0
0
1
1
1
1
0
1
1
0
0
1
1
1
0
1
1
0
1
1
0
1
0
1
1
1
0
0
1
0
0
1
1
0
0
1
1
1
0
1
1
0
0
(1+1+2+2 points)
(3) (Neural Networks.)
i. Consider this 2-dimensional binary classification dataset, with the class label Y
taking values ±1.
x1 x2
y
0
1 −1
1
0 −1
1
1 +1
0
0 +1
Give a neural network with one hidden layer containing two nodes and an output
layer with just one output node, such that the sign of the output on all the 4
inputs x above matches the sign of the label y. Use a ReLU activation function.
Use a bias on all the intermediate nodes and the output node. Explain and
illustrate your model using a figure. No optimization required, just look at the
data and come up with a neural network that fits the simple data.
ii. Repeat the above problem for the below dataset, and use 3 hidden nodes in one
hidden layer.
x1
1
1
−1
5
5
−5
−5
x2
1
−1
0
5
−5
5
−5
y
−1
−1
−1
+1
+1
+1
+1
iii. How many parameters are used in a neural network for a regression task with 10dimensional input, with 5 hidden layers of sizes 6, 5, 4, 3, 2 respectively? Assume
biases are used in all the intermediate and output nodes.
(2+3+1 points)
(4) (Support vector classifier.) Consider the following hard-margin SVM problem
with the dataset below
x1
x2
y
1
1
+1
1
3
+1
3
1
+1
5
1
+1
6
6
−1
5
6
−1
5
7
−1
Guess the support vectors, and use this to get a dual solution candidate α∗ . Check
whether this candidate is indeed a valid solution for the dual using KKT conditions.
Use the solution α∗ to get the primal solution w∗ and b∗ . Illustrate the resulting
classifier.
(5 points)
(5) (Kernel Regression) Consider the following kernel regression problem. The data
matrix containing 3 points with one dimension is given by X � = [−1, 0, 2]. The
regression targets are given by y� = [1, 2, 0]. Consider the feature vector regression
problem given by the objective:
R(w) =
3
�
i=1
(w� φ(xi ) − yi )2
where φ : R→Rd is a feature vector corresponding to the kernel k(u, v) = sin(u) sin(v)+
cos(u) cos(v) + 1.
i. Solve the kernel regression problem and give the solution α1∗ , α2∗ , α3∗ . Does the
problem have unique or multiple solutions?
ii. Use the above α∗ to make predictions at the 11 points ranging from x = −5 to
x = 5 in steps of 1. Plot this as a curve.
iii. Give any feature mapping φ : R→Rd such that φ(u)� φ(v) = k(u, v)
iv. Give the solution w∗ to the feature vector regression problem assuming the feature
function φ got above.
(5 points)
(6) (Maximum Likelihhood.) Consider the following parameter estimation problem.
Let σ12 , σ22 , . . . , σn2 be known constants. The d-dimensional instance vectors X1 , X2 , . . . , Xn
are drawn from some distributions in an i.i.d. fashion. The real valued targets
∗
Y1 , . . . , Yn are such that Yi is drawn from a Normal distribution with mean x�
i w
and variance σi2 , for some fixed but unknown parameter w∗ ∈ Rd . Derive the maximum likelihood estimate of w∗ .
Assume you have access to an equation solver sub-routine that takes in A ∈ Rd×d and
b ∈ Rd and returns a solution to Ax = b (if a solution exists). How will you use
this solver for this parameter estimation problem, for a given dataset with instances
x1 , . . . , xn , targets y1 , . . . , yn and noise variances σ12 , σ22 , . . . , σn2 .
(5 points)
(7) (Constrained and penalised versions of regularization.) Let f, g be functions
from Rd to R. Let λ > 0. Define two objects below:
w∗ (λ) = argminw f (w) + λg(w)
�
w(B)
= argminw:g(w)≤B f (w)
For simplicity let us assume both the problems defined above have unique minimisers,
�
are well defined.
and thus w∗ (λ) and w(B)
�
i. Show that for every λ > 0, there exists B ∈ R such that w∗ (λ) = w(B).
ii. What is the qualitative relation between λ and B? More concretely, let λ1 < λ2 ,
� 1 ) and w∗ (λ2 ) = w(B
� 2 ). What is the
and B1 , B2 ∈ R be such that w∗ (λ1 ) = w(B
relationship between B1 and B2 ?
(2+3 points)
(8) (VC Dimension.) Let the instance space X = R2 . The set of binary classifiers
H we consider is the set of all functions from R2 →{−1, +1} that can be represented
using a depth 2 decision tree (i.e.) one root node, two internal nodes and 4 leaf nodes.
Give a lower bound on the VC-dimension of H, by giving a set of points in R2 and
arguing/constructing classifiers for all the bit patterns as labels on these points.
(5 points)
(9) (AdaBoost.) Consider the following binary classification dataset. Run AdaBoost
for 3 iterations on the dataset, with the weak learner returning a best decision stump
(equivalently a decision tree with one node) (equivalently a horizontal or vertical
separator). Ties can be broken arbitrarily. Give the objects asked for below. Highlight
your answer by boxing it.
x1 x2
y
i. Give the weak learners ht for t = 1, 2, 3.
1
1 +1
1
2 −1
ii. Give the “edge over random” γt , and the multiplicative fac1
3 +1
tor βt for t = 1, 2, 3.
2
1 −1
2
2 −1
iii. Give the predictions of the final weighted classifier h on the
2
3 −1
training points.
3
1 +1
3
2 +1
(2+2+1 points)
3
3 +1
(10) (PCA.)
��
�� � �
3
9 3
.
,
3 4
1
Let Sa,b,c = {x ∈ R2 : ax1 + bx2 + c = 0} be a line in R2 . Let the approximation
error of this line be:
i. Let n be a large number. Let x1 , . . . , xn be i.i.d samples from N
n
R(a, b, c) =
min
y1 ,...,yn ∈Sa,b,c
1�
||xi − yi ||2
n
i=1
Give a, b, c which minimises the above error.
ii. Let n be a large number. and x1 , . . . , xn be i.i.d. samples from D. Let distribution
D be a mixture of two Gaussians over Rd , i.e.
1
1
D = N (0, AΣA� ) + N (0, BΛB � )
2
2
where A, B are orthonormal matrices. Σ, Λ are diagonal matrices, given as follows:
1 1 1 1 1 1
Σ = diag(8, 4, 2, 1, , , , , , , 0, 0, 0, 0, 0, 0)
2 2 4 4 4 4
1 1 1 1
Λ = diag(9, 5, , , , , 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
4 4 4 4
Let Lk be the approximation error of the datapoints xi with the best k-dimensional
hyperplane, i.e.
Lk = min
u1 ,...,ud
min
z ,...,z
1
n
bk+1 ...,bd
n
k
d
i=1
j=1
j=k+1
�
�
1�
||xi −
zi,j uj −
bj uj ||2
n
where uj ∈ Rd and zi ∈ Rk and b values are scalars.
Show the following:
(a) L2 ≤ 7.5
(b) L5 ≤ 2.
(c) Also show that the above two bounds are tight. i.e. there exists orthonormal
matrices A and B such that L2 = 7.5 and L5 = 2.
iii. PCA finds a k-dimensional affine space that mimimizes the approximation error
for the data. Show that the approximation error corresponding to the best k + 1dimensional vector space, is less than or equal to the approximation error of the
best k-dimensional affine space.
(2+3+2 points)
Download