(1) (Bayes Classifier.) i. Consider the following cost matrices for a 3 class classification problem. 0 1 1 0 1 2 0 1 1 12 Lzo = 1 0 1 Lordinal = 1 0 1 Labstain = 1 0 1 12 1 1 0 12 1 1 0 2 1 0 where (i.j)th entry in the cost matrix corresponds to the loss of predicting j when the truth is i. The abstain loss has 4 possible predictions for this three class problem corresponding to the three classes, and an ‘abstain’ option. Derive the Bayes classifier for all three cost matrices. In other words, give a mapping which takes as input a 3-dimensional probability vector corresponding to the class conditional probabilities and outputs one of the 3 classes for the zero-one and ordinal losses, and outputs one of the 4 possibilities for the abstain loss. Denote the option of ‘abstaining’ by the symbol ⊥. ii. Consider the following distribution of (X, Y ) over R × {1, 2, 3}. with P (Y = 1) = P (Y = 2) = P (Y = 3) = 13 . 3 14 if x ∈ [0, 2] if x ∈ [2, 10] otherwise 1 fX (x|Y = 1) = 14 0 3 14 if x ∈ [4, 6] if x ∈ [0, 4] ∪ [6, 10] otherwise 1 fX (x|Y = 2) = 14 0 3 14 if x ∈ [8, 10] if x ∈ [0, 8] otherwise 1 fX (x|Y = 3) = 14 0 where fX (x|Y = a) is the conditional density of X given that Y = a. Give the Bayes classifier for all three cost matrices for this distribution. iii. Repeat the previous sub-problem for the below distribution with P (Y = 1) = P (Y = 2) = P (Y = 3) = 13 . x−1 9 fX (x|Y = 1) = 7−x 9 0 x−2 9 fX (x|Y = 2) = 8−x 9 0 x−3 9 fX (x|Y = 3) = 9−x 9 0 if x ∈ [1, 4] if x ∈ [4, 7] otherwise if x ∈ [2, 5] if x ∈ [5, 8] otherwise if x ∈ [3, 6] if x ∈ [6, 9] otherwise (2+2+2 points) (2) (EM algorithm.) Consider the following crowd-sourcing problem. Each data-point 1 ≤ i ≤ n has a true-label Zi ∈ {0, 1}, and P (Zi = 1) = α. Each data point also has crowd-sourced labels Xi1 , . . . , Xid given by d annotators, where Xij ∈ {0, 1}. Each annotator j, is assumed to have an independent error µj , i.e. P (Xij = 1|Zi = 0) = µj P (Xij = 0|Zi = 1) = µj The joint distribution of all the random variables involved is given by: n d � � P (Zi ) P (Xij |Zi ) P (X11 , . . . , X1d , . . . , Xn1 , . . . , Xnd , Z1 , . . . , Zn ) = i=1 j=1 You have access to the dataset of n points annotated by d annotators, i.e. x1 , x2 , . . . , xn , where xi ∈ {0, 1}d . But the true labels z1 , z2 , . . . , zn ∈ {0, 1} are hidden. Use the EM algorithm to estimate the error rates of the annotators µj and the prior probability parameter α. i. For a given current value of the parameters α and µ1 , . . . µd , give the E-step where you will compute γi = P (Zi = 1|Xi = xi ). ii. Consider the estimation problem for annotator j. Given the complete data (z1 , x1j ), . . . , (zn , xnj ), with zi ∈ {0, 1} and xij ∈ {0, 1}, give the ML estimate of the probability of error µj . iii. Use the γi from part (a), along with the solution to the maximum likelihood problem above to give an expression for the update of the parameters α, µ1 , . . . , µd in the M-step. (Hint: It might be helpful to use γi to hallucinate complete data.) iv. Consider the annnotation data below with 20 points and 4 labellers. Apply the E-step and M-step derived above for 2 iterations. In particular, give the result of the E-step γ1 , . . . , γ20 at the end of each of the two iterations. Also give the result of the M-step α, µ1 , µ2 , µ3 , µ4 at the end of each of the two iterations. Initialize with α = 0.5, µ1 = µ2 = µ3 = µ4 = 0.1. Interpret the likely true labels Z for the 20 points, and the quality of the 4 labellers based on the final γ and µ values. How does the final γi differ from a simple majority vote of the annotator labels? Instructions: Feel free to write and use a computer code to get these answers. If you are using a computer code, you can run 10 steps of EM algorithm, instead of 2, for a much easier interpretation of the final answer. Give answers only upto 3 significant digits. Give the numerical results for the E-step and M-step in two tables of size 20*2 and 5*2 respectively. x:,1 x:,2 x:,3 x:,4 1 0 1 1 0 0 0 1 0 1 0 1 0 0 1 1 0 0 1 1 0 0 1 0 0 0 0 1 0 0 1 1 0 0 0 1 0 0 0 1 1 1 1 0 1 1 0 0 1 1 1 0 1 1 0 1 1 0 1 0 1 1 1 0 0 1 0 0 1 1 0 0 1 1 1 0 1 1 0 0 (1+1+2+2 points) (3) (Neural Networks.) i. Consider this 2-dimensional binary classification dataset, with the class label Y taking values ±1. x1 x2 y 0 1 −1 1 0 −1 1 1 +1 0 0 +1 Give a neural network with one hidden layer containing two nodes and an output layer with just one output node, such that the sign of the output on all the 4 inputs x above matches the sign of the label y. Use a ReLU activation function. Use a bias on all the intermediate nodes and the output node. Explain and illustrate your model using a figure. No optimization required, just look at the data and come up with a neural network that fits the simple data. ii. Repeat the above problem for the below dataset, and use 3 hidden nodes in one hidden layer. x1 1 1 −1 5 5 −5 −5 x2 1 −1 0 5 −5 5 −5 y −1 −1 −1 +1 +1 +1 +1 iii. How many parameters are used in a neural network for a regression task with 10dimensional input, with 5 hidden layers of sizes 6, 5, 4, 3, 2 respectively? Assume biases are used in all the intermediate and output nodes. (2+3+1 points) (4) (Support vector classifier.) Consider the following hard-margin SVM problem with the dataset below x1 x2 y 1 1 +1 1 3 +1 3 1 +1 5 1 +1 6 6 −1 5 6 −1 5 7 −1 Guess the support vectors, and use this to get a dual solution candidate α∗ . Check whether this candidate is indeed a valid solution for the dual using KKT conditions. Use the solution α∗ to get the primal solution w∗ and b∗ . Illustrate the resulting classifier. (5 points) (5) (Kernel Regression) Consider the following kernel regression problem. The data matrix containing 3 points with one dimension is given by X � = [−1, 0, 2]. The regression targets are given by y� = [1, 2, 0]. Consider the feature vector regression problem given by the objective: R(w) = 3 � i=1 (w� φ(xi ) − yi )2 where φ : R→Rd is a feature vector corresponding to the kernel k(u, v) = sin(u) sin(v)+ cos(u) cos(v) + 1. i. Solve the kernel regression problem and give the solution α1∗ , α2∗ , α3∗ . Does the problem have unique or multiple solutions? ii. Use the above α∗ to make predictions at the 11 points ranging from x = −5 to x = 5 in steps of 1. Plot this as a curve. iii. Give any feature mapping φ : R→Rd such that φ(u)� φ(v) = k(u, v) iv. Give the solution w∗ to the feature vector regression problem assuming the feature function φ got above. (5 points) (6) (Maximum Likelihhood.) Consider the following parameter estimation problem. Let σ12 , σ22 , . . . , σn2 be known constants. The d-dimensional instance vectors X1 , X2 , . . . , Xn are drawn from some distributions in an i.i.d. fashion. The real valued targets ∗ Y1 , . . . , Yn are such that Yi is drawn from a Normal distribution with mean x� i w and variance σi2 , for some fixed but unknown parameter w∗ ∈ Rd . Derive the maximum likelihood estimate of w∗ . Assume you have access to an equation solver sub-routine that takes in A ∈ Rd×d and b ∈ Rd and returns a solution to Ax = b (if a solution exists). How will you use this solver for this parameter estimation problem, for a given dataset with instances x1 , . . . , xn , targets y1 , . . . , yn and noise variances σ12 , σ22 , . . . , σn2 . (5 points) (7) (Constrained and penalised versions of regularization.) Let f, g be functions from Rd to R. Let λ > 0. Define two objects below: w∗ (λ) = argminw f (w) + λg(w) � w(B) = argminw:g(w)≤B f (w) For simplicity let us assume both the problems defined above have unique minimisers, � are well defined. and thus w∗ (λ) and w(B) � i. Show that for every λ > 0, there exists B ∈ R such that w∗ (λ) = w(B). ii. What is the qualitative relation between λ and B? More concretely, let λ1 < λ2 , � 1 ) and w∗ (λ2 ) = w(B � 2 ). What is the and B1 , B2 ∈ R be such that w∗ (λ1 ) = w(B relationship between B1 and B2 ? (2+3 points) (8) (VC Dimension.) Let the instance space X = R2 . The set of binary classifiers H we consider is the set of all functions from R2 →{−1, +1} that can be represented using a depth 2 decision tree (i.e.) one root node, two internal nodes and 4 leaf nodes. Give a lower bound on the VC-dimension of H, by giving a set of points in R2 and arguing/constructing classifiers for all the bit patterns as labels on these points. (5 points) (9) (AdaBoost.) Consider the following binary classification dataset. Run AdaBoost for 3 iterations on the dataset, with the weak learner returning a best decision stump (equivalently a decision tree with one node) (equivalently a horizontal or vertical separator). Ties can be broken arbitrarily. Give the objects asked for below. Highlight your answer by boxing it. x1 x2 y i. Give the weak learners ht for t = 1, 2, 3. 1 1 +1 1 2 −1 ii. Give the “edge over random” γt , and the multiplicative fac1 3 +1 tor βt for t = 1, 2, 3. 2 1 −1 2 2 −1 iii. Give the predictions of the final weighted classifier h on the 2 3 −1 training points. 3 1 +1 3 2 +1 (2+2+1 points) 3 3 +1 (10) (PCA.) �� �� � � 3 9 3 . , 3 4 1 Let Sa,b,c = {x ∈ R2 : ax1 + bx2 + c = 0} be a line in R2 . Let the approximation error of this line be: i. Let n be a large number. Let x1 , . . . , xn be i.i.d samples from N n R(a, b, c) = min y1 ,...,yn ∈Sa,b,c 1� ||xi − yi ||2 n i=1 Give a, b, c which minimises the above error. ii. Let n be a large number. and x1 , . . . , xn be i.i.d. samples from D. Let distribution D be a mixture of two Gaussians over Rd , i.e. 1 1 D = N (0, AΣA� ) + N (0, BΛB � ) 2 2 where A, B are orthonormal matrices. Σ, Λ are diagonal matrices, given as follows: 1 1 1 1 1 1 Σ = diag(8, 4, 2, 1, , , , , , , 0, 0, 0, 0, 0, 0) 2 2 4 4 4 4 1 1 1 1 Λ = diag(9, 5, , , , , 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) 4 4 4 4 Let Lk be the approximation error of the datapoints xi with the best k-dimensional hyperplane, i.e. Lk = min u1 ,...,ud min z ,...,z 1 n bk+1 ...,bd n k d i=1 j=1 j=k+1 � � 1� ||xi − zi,j uj − bj uj ||2 n where uj ∈ Rd and zi ∈ Rk and b values are scalars. Show the following: (a) L2 ≤ 7.5 (b) L5 ≤ 2. (c) Also show that the above two bounds are tight. i.e. there exists orthonormal matrices A and B such that L2 = 7.5 and L5 = 2. iii. PCA finds a k-dimensional affine space that mimimizes the approximation error for the data. Show that the approximation error corresponding to the best k + 1dimensional vector space, is less than or equal to the approximation error of the best k-dimensional affine space. (2+3+2 points)