Machine Learning (Extended) Resit paper for August 2014. Answer all questions. Non-alpha calculator may be used. Question 1 – Probabilistic generative classifiers a) What is the Naive Bayes assumption and why is it called Naive? [5%] b) State an advantage and a disadvantage of using the the Naive Bayes assumption in classification. [5%] c) How many parameters do you need to specify P(A,B|C) if A, B and C are discrete random variables that can take on a, b and c different values respectively? [5%] d) Assume we have a data set described the following three variables: Hair = {B,D}, where B=blonde, D=dark. Height = {T,S}, where T=tall, S=short. Country = {G,P}, where G=Greenland, P=Poland. You are given the following training data set (Hair, Height, Country): (B,T,G), (D,T,G), (D,T,G), (D,T,G), (B,T,G), (B,S,G), (B,S,G), (D,S,G), (B,T,G), (D,T,G), (D,T,G), (D,T,G), (B,T,G), (B,S,G), (B,S,G), (D,S,G), (B,T,P), (B,T,P), (B,T,P), (D,T,P), (D,T,P), (D,S,P), (B,S,P), (D,S,P). Now, suppose you observe a new individual tall with blond hair, and you want to use these training data to determine the most likely country of origin. i) Give the maximum a posteriori (MAP) answer to the above question, using the Naïve Bayes assumption. Show all of your working. [5%] ii) Give the Maximum Likelihood (ML) answer to the above question, using the Naïve Bayes assumption, and explain what is the difference from the method used in a). [5%] iii) Explain how would you solve i) or ii) if instead of blonde/dark we would be given some continuous valued measurements of the hair colour, and instead of tall/short we would measure the height in centimeters. [5%] e) Consider a 1-dimensional Gaussian classifier, that is a classifier that models each class by a Gaussian having its own mean and variance. Assume that the class prior probabilities are equal for both classes. Draw an example of 1-dimensional Gaussian classifiers with two classes, and indicate where is the decision boundary on your plot. [5%] Question 2 – Non-probabilistic classifiers a) Consider the following training data from 2 categories: Class1: (1,1)’ Class 2: (-1,-1)’, (1,0)’, (0,1)’ i) Plot these four points, draw the linear separation boundary that SVM would give for these data, and list the support vectors. [5%] ii) Consider training data of 1-dimensional points from two classes: Class 1: -5,5 Class 2: -2,1 A) Consider the transformation f: RR2, f(x)=(x,x2). Transform the data and plot these transformed points. Are these linearly separable? [5%] B) Draw the optimal separating hyper-plane in the transformed space, and explain in one or two sentences how does this linear boundary help us to separate the original data points. [5%] b) Consider the following data set with two real-valued inputs x (i.e. the coordinates of the points) and one binary output y (taking values + or -). We want to use k-nearest neighbours (K-NN) with Euclidean distance to predict y from x. i) Calculate the leave-one-out cross-validation error of 1-NN on this data set. That is, for each point in turn, try to predict its label y using the rest of the points, and count up the number of misclassification errors. [5%] ii) Calculate the leave-one-out cross-validation error of 3-NN on the same data set. [5%] iii) Describe how would you choose the number of neighbours K in K-NN in general? [5%] c) Suppose you have a data set to classify and you have several classification methods that you can try. Explain how do you decide which of these methods to choose [5%] Question 3 – Learning theory a) State three questions that are studied by learning theory. [5%] b) A learning theory framework is the PAC model of learning, where PAC stands for Probably Approximately Correct. Explain in plain English when do we say that a concept class is PAClearnable. [5%] Question 4 – Unsupervised learning: Clustering a) What methods to you know for data clustering? [5%] b) Describe how you would use a clustering method to do image segmentation. [5%] c) Describe two limitations of the K-means clustering algorithm. [5%] d) Suppose you have run K-means clustering on a data set and later you get more data point into the same data set. How would you cluster the new points without re-running the algorithm ? [5%] Link to learning the outcomes assessed by examination 1. Demonstrate a knowledge and understanding of the main approaches to machine learning. Q1 a,c,e; Q2 a, b; Q3; Q4 a. 2. Demonstrate the ability to apply the main approaches to unseen examples. Q1 d; Q2 b; Q3; Q4 d. 3. Demonstrate an understanding of the differences, advantages and problems of the main approaches in machine learning. Q1 d iii; Q2 c. 4. Demonstrate an understanding of the main limitations of current approaches to machine learning, and be able to discuss possible extensions to overcome these limitations. Q1 b; Q2 c; Q4 c.