Machine Learning (Extended) Exam paper for May 2014. Answer all questions. Non-alpha calculator may be used. Question 1. Probabilistic generative classifiers a) Assume we have a data set described the following three variables: Hair = {B,D}, where B=blonde, D=dark. Height = {T,S}, where T=tall, S=short. Country = {G,P}, where G=Greenland, P=Poland. You are given the following training data set (Hair, Height, Country): (B,T,G), (D,T,G), (D,T,G), (D,T,G), (B,T,G), (B,S,G), (B,S,G), (D,S,G), (B,T,G), (D,T,G), (D,T,G), (D,T,G), (B,T,G), (B,S,G), (B,S,G), (D,S,G), (B,T,P), (B,T,P), (B,T,P), (D,T,P), (D,T,P), (D,S,P), (B,S,P), (D,S,P). Now, suppose you observe a new individual tall with blond hair, and you want to use these training data to determine the most likely country of origin. i) Compute the maximum a posteriori (MAP) answer to the above question, using the Naïve Bayes assumption. Show all of your working. [10%] ii) In general, would you use the Naive Bayes assumption in any of the following two situations? Justify your answer. Case1: a large number of training examples described by a small number of attributes Case2: a small number of training examples described by a large number of attributes [10%] b) Suppose we have 1-dimensional data in 2 classes, and a Gaussian classifier has been estimated from an equal number of points from each class. Each class has its own mean and variance, as shown in the diagram below (the continuous line and the dashed line depict the distribution of the two classes). Further, we are given 5 test points (shown with crosses on the diagram). These points are: -4,1,3, 8,12. i) Using the classifier given in the diagram below, decide which of these test points should belong to Class 1 and which are in Class 2? Briefly justify your answers. [5%] ii) By looking at the classifier in the above figure, determine whether its decision boundary is linear or non-linerar? Justify your answer in one sentence. [5%] iii) What if the variance of the two classes were equal to each other – would such a classifier be capable of producing a linear or a nonlinear decision boundary? [5%] Question 2. Non-probabilistic classifiers a) Consider training data of 1-dimensional points from two classes: Class 1: -2,3 Class 2: -1,0 i) Plot these points. Are they linearly separable? [2%] ii) Consider the transformation f: R-->R^2, f(x)=(x,x^2). Transform the data and plot these transformed points. Are these linearly separable? [2%] iii) Draw the optimal separation boundary in the transformed space. [5%] iv) For the transformation given in point b), write down the kernel that implements the inner product in the transformed space. [1%] b) Consider the 2-dimensional data set in the figure below, where the two markers ('*' and 'o') indicate the class labels of the points. i) Give the leave-one-out error of the 3-NN classifier. [5%] ii) Explain what goes wrong on this data set if you choose k too small (e.g. k=1) or too large (e.g. k=13). [5%] c) i) Explain (in plain English) how to use the AdaBoost algorithm for training a classifier, and how to use the resulting classifier at test time. [5%] ii) Describe three advantages and three disadvantages of AdaBoost. [5%] iii) It is known that the training error converges to zero with the AdaBoost iterations. Is this always a good thing? Justify your answer. [5%] Question 3 – Learning theory a) State three questions that are studied by learning theory. [5%] b) Consider a finite set of functions, H, that map an input set X into the set of labels {0,1}. Let L be an algorithm that for any function c from H, and any training set S of N training points, drawn independently from some unknown distribution D over X, returns a hypothesis, hS, that is consistent with the given training set. In this case it is known (as proved in the class) that for any choice of ε > 0 the generalisation error of hS is upper bounded by ε with probability of at least 1-|H|exp(-Nε). Here, |H| denotes the cardinality of a set (that is the number of elements in the set H). Using this result, determine how large N needs to be if we want to ensure that the mentioned upper bound holds with probability 0.95. [5%] Question 4 – Unsupervised learning: Clustering a) What is the goal of clustering, and how it differs from classification? [5%] b) Write down the pseudo-code of k-means clustering. [5%] c) State three application areas of K-means clustering. [5%] d) Describe a situation in which K-means would fail, and suggest a modification to get round of the problem. [5%] Link to learning the outcomes assessed by examination 1. Demonstrate a knowledge and understanding of the main approaches to machine learning. Q1 a i; Q2 a; Q2 a,b; Q2 c i; Q3; Q4 b,c. 2. Demonstrate the ability to apply the main approaches to unseen examples. Q1 a i; Q1 b i; Q2 b i; Q2 c ii; Q3. 3. Demonstrate an understanding of the differences, advantages and problems of the main approaches in machine learning. Q1 a ii; Q1 b ii,iii; Q2 b ii; Q2 c ii; Q4 a. 4. Demonstrate an understanding of the main limitations of current approaches to machine learning, and be able to discuss possible extensions to overcome these limitations. Q1 a ii; Q1 b iii; Q2c iii; Q4 d.