University of Wisconsin – Madison Computer Sciences Department CS 760 - Machine Learning Spring 2010 Exam 11am-12:30pm, Monday, April 26, 2010 Room 1240 CS CLOSED BOOK (one sheet of notes and a calculator allowed) Write your answers on these pages and show your work. If you feel that a question is not fully specified, state any assumptions you need to make in order to solve the problem. You may use the backs of these sheets for scratch work. If you use the back for any of your final answers, be sure to clearly mark that on the front side of the sheets. Neatly write your name on this and all other pages of this exam. Name ________________________________________________________________ Problem Score Max Score 1 ______ 20 2 ______ 20 3 ______ 20 4 ______ 20 5 ______ 20 TOTAL ______ 100 Name: ______________________ Problem 1 – Learning from Labeled Examples (20 points) You have a dataset that involves three features. Feature C’s values are in [0, 1000]. The other two features are Boolean-valued. Ex1 Ex2 Ex3 Ex4 Ex5 A B C Category F T T F T T F T F T 115 890 257 509 753 false false true true true a) How much information about the category is gained by knowing whether or not the value of feature C is less than 333? b) How much information is there in knowing whether or not features A and B have the same value? c) A knowledgeable reviewer says that the above data set was not very well pre-processed for nearest-neighbor algorithms. Briefly explain why a reviewer might say that. Page 2 of 10 Name: ______________________ d) Assume a one-norm SVM puts weight = -3 on feature A, weight = 2 on feature B, and weight = 0 on feature C. What would the cost of this solution be, based on this question’s five training examples? If you need to make any additional assumptions, be sure to state and briefly justify them. The training examples repeated for your convenience: Ex1 Ex2 Ex3 Ex4 Ex5 A B C F T T F T T F T F T 115 890 257 509 753 Category false false true true true Page 3 of 10 Name: ______________________ Problem 2 –Aspects of Supervised Learning (20 points) a) Explain what active learning means. Also briefly describe how you might use Bagging to address the task of active learning. b) Assume we have a supervised-learning task where the examples are represented by 26 Boolean features, A-Z. We guess that the true concept is of the form: Literal1 Literal2 Literal3 Where Literali is a one of the features A-Z or its negation and where a given feature can appear at most once in the concept (so “C ¬ M A” is a valid concept, but “C ¬ M M” is not). If 90% of the time we want to learn a concept whose accuracy is at least 95%, how many training examples should we collect? Page 4 of 10 Name: ______________________ c) Assume that our learning algorithm is to simply (and stupidly) learn the model f(x) = maximum output value seen in the training set. We want to estimate the error due to bias (in the bias-variance sense) of this algorithm, so we collect a number of possible training sets, where the notation NM means for input N the output is M (i.e., there is one input feature and the output is a single number). { 1 3, 2 2} { 4 5, 3 0 } { 2 2, 4 5 } { 3 0, 3 0 } { 2 2, 1 3 } Based on this sample of possible training sets, what is the estimated error, due to this algorithm’s bias, for the input value of 2? Be sure to show your work and explain your answer. Page 5 of 10 Name: ______________________ Problem 3 – Reinforcement Learning (20 points) Consider the deterministic reinforcement environment drawn below (let γ=0.5). The numbers on the arcs indicate the immediate rewards. Once the agent reaches the ‘end’ state the current episode ends and the agent is magically transported to the ‘start’ state. The probability of an exploration step is 0.02. start -3 9 a c -5 2 7 4 b -1000 end a) A one-step, Q-table learner follows the path start b end. On the graph below, show the Q values that have changed, and show your work to the right of the graph. Assume that for all legal actions, the initial values in the Q table are 6. start a c end b b) Starting with the Q table you produced in Part a, again follow the path start b end and show the Q values below that have changed. Show your work to the right. start a c end b Page 6 of 10 Name: ______________________ c) State and informally explain the optimal path from start to end that a Q-table learner will learn after a large number of trials in this environment. (You do not need to show the score of every possible path. The original RL graph appears below for convenience.) start end d) Repeat Part c but this time assume the SARSA algorithm is being used. start end e) In class and in the text, a convergence proof for Q learning was presented. If we use a function approximator, this proof no longer applies. Briefly explain why. Here again is the version of the RL graph with the immediate rewards shown. start -3 9 a c -5 2 4 b -1000 Page 7 of 10 7 end Name: ______________________ Problem 4 – Experimental Methodology (20 points) a) Assume on some Boolean-prediction task, you train a perceptron on 1000 examples and get 850 correct, then test your learned model on a fresh set of 100 examples and find it predicts 80 correctly. Give an estimate, including the 95% confidence interval, for the expected accuracy on the next 100 randomly drawn examples. b) Sketch a pair of learning curves that might result from an experiment where one evaluated whether or not a given feature-selection algorithm helped. Be sure to label the axes and informally explain what your curves show. Why would a learning curve even be used for an experiment like this? c) Assume you have trained a Bayesian network for a Boolean-valued task. For each of the test-set examples below, the second column reports the probability the trained Bayesian network computed for this example, while the third column lists the correct category. Example 1 3 2 4 5 Probability(Output is True) Correct Category 0.99 positive 0.81 negative 0.53 positive 0.26 negative 0.04 negative Draw to the right of this table the ROC curve for this ensemble (it is fine to simply ‘connect the dots,’ that is make your curve piece-wise linear). Be sure to label your axes. Page 8 of 10 Name: ______________________ Problem 5– Miscellaneous Short Answers (20 points) Briefly define and discuss the importance in machine learning of each of the following: weight decay definition: importance: kernels that compute the distance between graph-based examples [‘graph’ here is in the sense of arcs and nodes, as opposed to plots of x vs. f(x)] definition: importance: structure search definition: importance: State and briefly explain two ways that the Random Forest algorithm reduces the chances of overfitting a training set. i) ii) Page 9 of 10 Name: ______________________ Feel free to tear off this page and use it for ‘scratch’ paper. Page 10 of 10