CS 760 - Machine Learning

advertisement
University of Wisconsin – Madison
Computer Sciences Department
CS 760 - Machine Learning
Spring 2010
Exam
11am-12:30pm, Monday, April 26, 2010
Room 1240 CS
CLOSED BOOK
(one sheet of notes and a calculator allowed)
Write your answers on these pages and show your work. If you feel that a question is not fully
specified, state any assumptions you need to make in order to solve the problem. You may use
the backs of these sheets for scratch work. If you use the back for any of your final answers, be
sure to clearly mark that on the front side of the sheets.
Neatly write your name on this and all other pages of this exam.
Name
________________________________________________________________
Problem
Score
Max Score
1
______
20
2
______
20
3
______
20
4
______
20
5
______
20
TOTAL
______
100
Name: ______________________
Problem 1 – Learning from Labeled Examples (20 points)
You have a dataset that involves three features. Feature C’s values are in [0, 1000]. The other
two features are Boolean-valued.
Ex1
Ex2
Ex3
Ex4
Ex5
A
B
C
Category
F
T
T
F
T
T
F
T
F
T
115
890
257
509
753
false
false
true
true
true
a) How much information about the category is gained by knowing whether or not the value of
feature C is less than 333?
b) How much information is there in knowing whether or not features A and B have the same
value?
c) A knowledgeable reviewer says that the above data set was not very well pre-processed for
nearest-neighbor algorithms. Briefly explain why a reviewer might say that.
Page 2 of 10
Name: ______________________
d) Assume a one-norm SVM puts weight = -3 on feature A, weight = 2 on feature B, and
weight = 0 on feature C. What would the cost of this solution be, based on this question’s
five training examples? If you need to make any additional assumptions, be sure to state and
briefly justify them.
The training examples repeated for your convenience:
Ex1
Ex2
Ex3
Ex4
Ex5
A
B
C
F
T
T
F
T
T
F
T
F
T
115
890
257
509
753
Category
false
false
true
true
true
Page 3 of 10
Name: ______________________
Problem 2 –Aspects of Supervised Learning (20 points)
a) Explain what active learning means. Also briefly describe how you might use Bagging to
address the task of active learning.
b) Assume we have a supervised-learning task where the examples are represented by 26
Boolean features, A-Z. We guess that the true concept is of the form:
Literal1  Literal2  Literal3
Where Literali is a one of the features A-Z or its negation and where a given feature can
appear at most once in the concept (so “C  ¬ M  A” is a valid concept, but “C  ¬ M  M”
is not).
If 90% of the time we want to learn a concept whose accuracy is at least 95%, how many
training examples should we collect?
Page 4 of 10
Name: ______________________
c) Assume that our learning algorithm is to simply (and stupidly) learn the model
f(x) = maximum output value seen in the training set.
We want to estimate the error due to bias (in the bias-variance sense) of this algorithm, so we
collect a number of possible training sets, where the notation NM means for input N the
output is M (i.e., there is one input feature and the output is a single number).
{ 1 3, 2 2} { 4 5, 3 0 } { 2 2, 4 5 } { 3 0, 3 0 } { 2 2, 1 3 }
Based on this sample of possible training sets, what is the estimated error, due to this
algorithm’s bias, for the input value of 2? Be sure to show your work and explain your
answer.
Page 5 of 10
Name: ______________________
Problem 3 – Reinforcement Learning (20 points)
Consider the deterministic reinforcement environment drawn below (let γ=0.5). The numbers on
the arcs indicate the immediate rewards. Once the agent reaches the ‘end’ state the current
episode ends and the agent is magically transported to the ‘start’ state. The probability of an
exploration step is 0.02.
start
-3
9
a
c
-5
2
7
4
b
-1000
end
a) A one-step, Q-table learner follows the path start  b  end. On the graph below, show
the Q values that have changed, and show your work to the right of the graph. Assume that
for all legal actions, the initial values in the Q table are 6.
start
a
c
end
b
b) Starting with the Q table you produced in Part a, again follow the path start  b  end and
show the Q values below that have changed. Show your work to the right.
start
a
c
end
b
Page 6 of 10
Name: ______________________
c) State and informally explain the optimal path from start to end that a Q-table learner will
learn after a large number of trials in this environment. (You do not need to show the score
of every possible path. The original RL graph appears below for convenience.)
start 
 end
d) Repeat Part c but this time assume the SARSA algorithm is being used.
start 
 end
e) In class and in the text, a convergence proof for Q learning was presented. If we use a
function approximator, this proof no longer applies. Briefly explain why.
Here again is the version of the RL graph with the immediate rewards shown.
start
-3
9
a
c
-5
2
4
b
-1000
Page 7 of 10
7
end
Name: ______________________
Problem 4 – Experimental Methodology (20 points)
a) Assume on some Boolean-prediction task, you train a perceptron on 1000 examples and get
850 correct, then test your learned model on a fresh set of 100 examples and find it predicts
80 correctly. Give an estimate, including the 95% confidence interval, for the expected
accuracy on the next 100 randomly drawn examples.
b) Sketch a pair of learning curves that might result from an experiment where one evaluated
whether or not a given feature-selection algorithm helped. Be sure to label the axes and
informally explain what your curves show.
Why would a learning curve even be used for an experiment like this?
c) Assume you have trained a Bayesian network for a Boolean-valued task. For each of the
test-set examples below, the second column reports the probability the trained Bayesian
network computed for this example, while the third column lists the correct category.
Example
1
3
2
4
5
Probability(Output is True) Correct Category
0.99
positive
0.81
negative
0.53
positive
0.26
negative
0.04
negative
Draw to the right of this table the ROC curve
for this ensemble (it is fine to simply ‘connect the dots,’
that is make your curve piece-wise linear).
Be sure to label your axes.
Page 8 of 10
Name: ______________________
Problem 5– Miscellaneous Short Answers (20 points)
Briefly define and discuss the importance in machine learning of each of the following:
weight decay
definition:
importance:
kernels that compute the distance between graph-based examples
[‘graph’ here is in the sense of arcs and nodes, as opposed to plots of x vs. f(x)]
definition:
importance:
structure search
definition:
importance:
State and briefly explain two ways that the Random Forest algorithm reduces the chances of
overfitting a training set.
i)
ii)
Page 9 of 10
Name: ______________________
Feel free to tear off this page and use it for ‘scratch’ paper.
Page 10 of 10
Related documents
Download