Machine Learning - Northwestern University

advertisement
Machine Learning
Topic 10: Computational Learning
Theory
Bryan Pardo, Machine Learning: EECS 349 Fall 2009
Overview
• Are there general laws that govern learning?
– Sample Complexity: How many training examples
are needed to learn a successful hypothesis?
– Computational Complexity: How much
computational effort is needed to learn a successful
hypothesis?
– Mistake Bound: How many training examples will the
learner misclassify before converging to a successful
hypothesis?
Bryan Pardo, Machine Learning: EECS 349 Fall 2009
Definition
• The true error of hypothesis h, with respect to
the target concept c and observation distribution
D is the probability that h will misclassify an
instance drawn according to D
errorD  P [c( x)  h( x)]
xD
• In a perfect world, we’d like the true error to be 0
Bryan Pardo, Machine Learning: EECS 349 Fall 2009
The world isn’t perfect
• If we can’t provide every instance for training, a consistent
hypothesis may have error on unobserved instances.
Instance Space X
Hypothesis h
Training
set
Concept c
• How many training examples do we need to bound the
likelihood of error to a reasonable level?
When is our hypothesis Probably Approximately Correct
(PAC)?
Bryan Pardo, Machine Learning: EECS 349 Fall 2009
Some terms
X
is the set of all possible instances
C
is the set of all possible concepts c
where c : X  {0,1}
H
is the set of hypotheses considered
by a learner, H  C
L
is the learner
D
is a probability distribution over X
that generates observed instances
Bryan Pardo, Machine Learning: EECS 349 Fall 2009
Definition: e-exhausted
IN ENGLISH:
The set of hypotheses consistent with the
training data T is e-exhausted if, when you test
them on the actual distribution of instances, all
consistent hypotheses have error below e
IN MATHESE:
VS H,T is e - exhausted for concept c
and sample distribution D, if....
 h VS H,T , errorD (h)  e
Bryan Pardo, Machine Learning: EECS 349 Fall 2009
A Theorem
If hypothesisspace H is finite, & training
set T contains m independent randomly
drawn examples of concept c
THEN, for any 0  e  1...
P(VS H,T is NOT ε - exhausted)  |H|e
Bryan Pardo, Machine Learning: EECS 349 Fall 2009
em
Proof of Theorem
If hypothesish has true error e , the probability of it
getting a single random exampe right is :
P(h got 1 example right)  1-ε
Ergo the probability of h getting m examples right is :
P(h got m examples right)  (1-ε ) m
Bryan Pardo, Machine Learning: EECS 349 Fall 2009
Proof of Theorem
If there are k hypothesesin H with error at least
e , call the probability at least of those k hypotheses
got m instances right P(at least one bad h looks good ).
This prob. is BOUNDED by k (1-ε ) m
Pat least one bad h looks good   k (1-ε ) m
Think about why this is a
bound and not an equality
Bryan Pardo, Machine Learning: EECS 349 Fall 2009
Proof of Theorem (continued)
Since k  H , it follows that k (1-ε)  H (1-ε)
m
If 0  e  1, then (1  e )  e
m
e
Therefore...
P(at least one bad h looks good )  k (1-ε ) m  H (1-ε ) m  H e em
Proof complete!
We now have a bound on the likelihood that a
hypothsesis consistent with the training data
will have error  e
Bryan Pardo, Machine Learning: EECS 349 Fall 2009
Using the theorem
Let' s rearrange to see

to set a bound  on

ln  H   ln e em
ln  H   em  ln  
ln  H   ln    em
the likelihood our
true error is e .

  ln  
ln H e em  ln  
how many training
examples we need
H e  em  
1
e
ln  H   ln    m
1
 1 
 ln  H   ln     m
e
  
Bryan Pardo, Machine Learning: EECS 349 Fall 2009
Probably Approximately Correct (PAC)
1
e
The worst error
we’ll tolerate
ln  H   ln    m
hypothesis
space size
The likelihood a
hypothesis consistent
with the training data
will have error e
number of training examples
Bryan Pardo, Machine Learning: EECS 349 Fall 2009
Using the bound
1
e
ln  H   ln    m
Plug in e,  , and H to get a number of training examples
m that will “guarantee” your learner will generate a
hypothesis that is Probably Approximately Correct.
NOTE: This assumes that the concept is actually IN H, that H is
finite, and that your training set is drawn using distribution D
Bryan Pardo, Machine Learning: EECS 349 Fall 2009
Problems with PAC
• The PAC Learning framework has 2
disadvantages:
1) It can lead to weak bounds
2)Sample Complexity bound cannot be
established for infinite hypothesis spaces
• We introduce the VC dimension for dealing with
these problems
Bryan Pardo, Machine Learning: EECS 349 Fall 2009
Shattering
Def: A set of instances S is shattered by hypothesis set H
iff for every possible concept c on S there exists a
hypothesis h in H that is consistent with that concept.
Bryan Pardo, Machine Learning: EECS 349 Fall 2009
Can a linear separator shatter this?
NO!
The ability of H to shatter a set of instances is a
measure of its capacity to represent target concepts
defined over those instances
Bryan Pardo, Machine Learning: EECS 349 Fall 2009
Can a quadratic separator shatter this?
This sounds like a homework problem….
Bryan Pardo, Machine Learning: EECS 349 Fall 2009
Vapnik-Chervonenkis Dimension
Def: The Vapnik-Chervonenkis
dimension, VC(H) of hypothesis space H
defined over instance space X is the size
of the largest finite subset of X shattered
by H. If arbitrarily large finite sets can be
shattered by H, then VC(H) is infinite.
Bryan Pardo, Machine Learning: EECS 349 Fall 2009
How many training examples needed?
• Upper bound on m using VC(H)
m
1
e
This uses the
HYPOTHESIS
space
(4 log 2 (2 /  )  8VC ( H ) log 2 (13 / e ))
• Lower bound on m using VC(C)
This uses the
CONCEPT space
VC (C )  1
1
max  log(1 /  ),

e
32
e


There’s more on this in the textbook.
Bryan Pardo, Machine Learning: EECS 349 Fall 2009
Mistake bound model of learning
• How many mistakes (misclassified
examples) will a learner make before it
converges to the correct hypothesis?
No time for this in lecture.
Read the book!
Northwestern University Winter 2007 Machine Learning EECS 395-22
Download