Machine Learning

advertisement
Machine Learning
Lecture 4: Greedy Local Search
(Hill Climbing)
Adapted by Doug Downey from Bryan Pardo Fall 2007 Machine Learning EECS 349
Local search algorithms
• We’ve discussed ways to select a hypothesis h
that performs well on training examples, e.g.
– Candidate-Elimination
– Decision Trees
• Another technique that is quite general:
– Start with some (perhaps random) hypothesis h
– Incrementally improve h
• Known as local search
Adapted by Doug Downey from Bryan Pardo Fall 2007 Machine Learning EECS 349
Example: n-queens
• Put n queens on an n × n board with no
two queens on the same row, column, or
diagonal
Adapted by Doug Downey from Bryan Pardo Fall 2007 Machine Learning EECS 349
Hill-climbing search
• "Like climbing Everest in thick fog with
amnesia“
h = initialState
loop:
h’ = highest valued Successor(h)
if Value(h) >= Value(h’)
return h
else
h = h’
Adapted by Doug Downey from Bryan Pardo Fall 2007 Machine Learning EECS 349
Hill-climbing search
• Problem: depending on initial state, can
get stuck in local maxima
Adapted by Doug Downey from Bryan Pardo Fall 2007 Machine Learning EECS 349
Underfitting
• Overfitting: Performance on test examples
is much lower than on training examples
• Underfitting: Performance on training
examples is low
Two leading causes:
– Hypothesis space is too small/simple
– Training algorithm (i.e., hypothesis search
algorithm) stuck in local maxima
Adapted by Doug Downey from Bryan Pardo Fall 2007 Machine Learning EECS 349
Hill-climbing search: 8-queens problem
v =17
• v = number of pairs of queens that are attacking
each other, either directly or indirectly
Adapted by Doug Downey from Bryan Pardo Fall 2007 Machine Learning EECS 349
Hill-climbing search: 8-queens problem
• A local minimum with v = 1
Adapted by Doug Downey from Bryan Pardo Fall 2007 Machine Learning EECS 349
Simulated annealing search
• Idea: escape local maxima by allowing some
"bad" moves but gradually decrease their
frequency
h = initialState
T = initialTemperature
loop:
h’ = random Successor(h)
if (V = Value(h’)-Value(h)) > 0
h = h’
else
h = h’ with probability eV/T
decrease T; if T==0, return h
Adapted by Doug Downey from Bryan Pardo Fall 2007 Machine Learning EECS 349
Properties of simulated annealing
• One can prove: If T decreases slowly enough,
then simulated annealing search will find a
global optimum with probability approaching 1
• Widely used in VLSI layout, airline scheduling,
etc
Adapted by Doug Downey from Bryan Pardo Fall 2007 Machine Learning EECS 349
Local beam search
• Keep track of k states rather than just one
• Start with k randomly generated states
• At each iteration, all the successors of all k
states are generated
• If any one is a goal state, stop; else select the k
best successors from the complete list and
repeat.
Adapted by Doug Downey from Bryan Pardo Fall 2007 Machine Learning EECS 349
Gradient Descent
• Hill Climbing and Simulated Annealing are
“generate and test” algorithms
– Successor function generates candidates,
Value function helps select
• In some cases, we can do much better:
– Define: Error(training data D, hypothesis h)
– If h is represented by parameters w1,…wn and
dError/dwi is known, we can compute the
error gradient, and descend in the direction
that is (locally) steepest
Adapted by Doug Downey from Bryan Pardo Fall 2007 Machine Learning EECS 349
Adapted by Doug Downey from Bryan Pardo Fall 2007 Machine Learning EECS 349
About distance….
• Clustering requires distance measures.
• Local methods require a measure of “locality”
• Search engines require a measure of similarity
• So….when are two things close to each other?
Adapted by Doug Downey from Bryan Pardo Fall 2007 Machine Learning EECS 349
Euclidean Distance
Dimension 2: y
• What people intuitively think of as “distance”
d ( A, B)  (ax  bx )  (a y  by )
2
Dimension 1: x
Adapted by Doug Downey from Bryan Pardo Fall 2007 Machine Learning EECS 349
2
Generalized Euclidean Distance
1/ 2

2
d ( A, B )   | ai  bi | 
 i 1

where A  {a1 , a2 ,..., an },
n
B  {b1 , b2 ,..., bn }
and
i(a i ,bi  )
Adapted by Doug Downey from Bryan Pardo Fall 2007 Machine Learning EECS 349
Weighting Dimensions
• Apparent clusters at one scaling of X are
not so apparent at another scaling
Adapted by Doug Downey from Bryan Pardo Fall 2007 Machine Learning EECS 349
Weighted Euclidean Distance
• You can, of course compensate by weighting
your dimensions….

2
d ( A, B)   wi | ai  bi | 
 i 1

n
1/ 2
Adapted by Doug Downey from Bryan Pardo Fall 2007 Machine Learning EECS 349
More Generalization: Minkowsky metric
• My three favorites are special cases of this:

p
d ( A, B)   wi | ai  bi | 
 i 1

n
1/ p
Manhattan Distance : p  1
Euclidean Distance : p  2
Hamming Distance : p  1 and ai ,bi  0,1
Adapted by Doug Downey from Bryan Pardo Fall 2007 Machine Learning EECS 349
What is a “metric”?
• A metric has these four qualities.
d ( A, B)  0 iff x  y
(reflexivity)
d ( A, B)  0
d ( A, B)  d ( B, A)
(non-negative)
(symmetry)
d ( A, B)  d ( B, C )  d ( A, C )
(triangle inequality)
• …otherwise, call it a “measure”
Adapted by Doug Downey from Bryan Pardo Fall 2007 Machine Learning EECS 349
Metric, or not?
• Driving distance with 1-way streets
• Categorical Stuff :
– Is distance Jazz -> Blues -> Rock no less
than distance Jazz -> Rock?
Adapted by Doug Downey from Bryan Pardo Fall 2007 Machine Learning EECS 349
What about categorical variables?
• Consider feature vectors for genre &
vocals
– Genre: {Blues, Jazz, Rock, Zydeco}
– Vocals: {vocals,no vocals}
s1 = {rock, vocals}
s2 = {jazz, no vocals}
s3 = { rock, no vocals}
• Which two songs are more similar?
Adapted by Doug Downey from Bryan Pardo Fall 2007 Machine Learning EECS 349
Binary Features + Hamming distance
Blues Jazz
0
s1 = {rock, yes}
s2 = {jazz, no}
0
s3 = { rock, no vocals}
0
Rock
Zydeco Vocals
0
1
0
1
1
0
0
1
0
1
0
0
Hamming Distance = number of bits different
between binary vectors
Adapted by Doug Downey from Bryan Pardo Fall 2007 Machine Learning EECS 349
Hamming Distance
n
d ( A, B )   ai  bi 
i 1
where A  {a1 , a2 ,..., an },
B  {b1 , b2 ,..., bn }
and
i( ai , bi  {0,1})
Adapted by Doug Downey from Bryan Pardo Fall 2007 Machine Learning EECS 349
Other approaches…
• Define your own distance:
f(a,b)
Quote Frequency
Beethoven Beatles
Liz Phair
Beethoven
7
0
0
Beatles
4
5
0
Liz Phair
?
1
2
Adapted by Doug Downey from Bryan Pardo Fall 2007 Machine Learning EECS 349
Missing data
• What if, for some category, on some
examples, there is no value given?
• Approaches:
– Discard all examples missing the category
– Fill in the blanks with the mean value
– Only use a category in the distance measure if
both examples give a value
Adapted by Doug Downey from Bryan Pardo Fall 2007 Machine Learning EECS 349
Dealing with missing data
0, if both ai and bi are defined
wi  
 1, else


d ( A, B) 
f
(
a
,
b
)

i
i
n


i 1


n   wi
n
n
i 1
Adapted by Doug Downey from Bryan Pardo Fall 2007 Machine Learning EECS 349
Download