1. Suppose you train a classifier and test it on a held-out validation set. It gets 80% classification accuracy on the training set and 20% classification accuracy on the validation set. From what problem is your model most likely suffering? [ ] Underfitting [ ] Overfitting [ ] Underfitting and Overfitting [ ] None of the above 2. Suppose you train a classifier and test it on a held-out validation set. It gets 30% classification accuracy on the training set and 30% classification accuracy on the validation set. From what problem is your model most likely suffering? [ ] Underfitting [ ] Overfitting [ ] Underfitting and Overfitting [ ] None of the above 3. (10pt) Consider the following classifier: ( g(x) = yi 1 if x = xi for (xi , yi ) ∈ Dtrain otherwise Assume Y = {1, 2, 3, 4} and your dataset contains 1000 samples, equally distributed over the four classes. You split your data into Dtrain (70%) and Dtest (30%), preserving class distributions. (a) What is the 0/1 loss for the entire training set? (b) What is the 0/1 loss for the entire test set? (c) What is the squared loss for the entire test set? (d) What is the absolute loss for the entire test set? 4. Suppose you have a perceptron with weight vector w = r · [w0 , w1 , w2 ]T . You are given the weights but otherwise not told what the constant r > 0 is. Assuming that ties are broken as positive, can you determine the perceptron’s classification of a new example x? Justify your answer. [ ] Always [ ] Sometimes [ ] Never 5. We would like to use a perceptron to train a classifier on the following training data: ([−1, 2]T , +1), ([3, −1]T , −1), ([1, 2]T , −1), ([3, 1]T , +1) (a) Starting with a weight vector wT = [0, 0, 0]. What will be the updated values for these weights after processing the first point with the perceptron algorithm? (b) After how many steps will the perceptron algorithm converge? Write “never” if it will never converge. Note: for this question one step means processing one point. Points are processed in order and then repeated, until convergence. 6. (15pt) Assuming x1 , x2 , . . . , xn are independent samples from P (x|θ) = θx−θ−1 , where θ > 1 and x ≥ 1. Find the MLE estimator of θ. 7. Assume that you are given instances (x, y) ∈ R2 × {±1} in the following order: ([0, 0]T , +1), ([−3, 3]T , +1), ([2, −2]T , +1), ([0, 2]T , −1), ([3, 2]T , −1), ([3, −1]T , −1) (a) (10pt) Starting from wT = [0, 0, 0], indicate the status of the weight vector after processing each data instance (in the order given). Call your vectors ws1 , ws2 , ws3 , ws4 , ws5 , ws6 . 1 (b) (15pt) Mark the data instances in the plot below, then calculate and draw in the same plot, the hyperplane corresponding to your weight vector ws6 . 8. You are working on a classification task and applying 10-fold Cross Validation to perform model selection. You are training linear models and trying different hyperparameters. (a) How would you recognize when a given model is overfitting the data? (b) Assuming that overfitting is happening, indicate one clear action that will help decrease overfitting. 9. Assume you have a neural network with 1 input layer (two units), 1 hidden layer (two units), 1 output layer (one unit). Use tanh as the nonlinear function and squared error as the loss function. (a) Draw the network and include the bias units. Give proper labels to all weights using the notation used in (l) class (wi,j ) (b) Initialize all weights to zero an calculate the output of the network after propagating the labeled data point ([1, 0]T , +1) 10. Derive a gradient descent update rule that minimizes the loss function below: n L(w) = 2 1 X hw (x(i) ) − y (i) n i=1 where: hw (x) = w0 + w1 x1 + w1 x31 + w2 x2 + w2 x32 + · · · + wd xd + wd x3d 2