Machine Learning Topic 10: Computational Learning Theory Bryan Pardo, Machine Learning: EECS 349 Fall 2009 Overview • Are there general laws that govern learning? – Sample Complexity: How many training examples are needed to learn a successful hypothesis? – Computational Complexity: How much computational effort is needed to learn a successful hypothesis? – Mistake Bound: How many training examples will the learner misclassify before converging to a successful hypothesis? Bryan Pardo, Machine Learning: EECS 349 Fall 2009 Definition • The true error of hypothesis h, with respect to the target concept c and observation distribution D is the probability that h will misclassify an instance drawn according to D errorD P [c( x) h( x)] xD • In a perfect world, we’d like the true error to be 0 Bryan Pardo, Machine Learning: EECS 349 Fall 2009 The world isn’t perfect • If we can’t provide every instance for training, a consistent hypothesis may have error on unobserved instances. Instance Space X Hypothesis h Training set Concept c • How many training examples do we need to bound the likelihood of error to a reasonable level? When is our hypothesis Probably Approximately Correct (PAC)? Bryan Pardo, Machine Learning: EECS 349 Fall 2009 Some terms X is the set of all possible instances C is the set of all possible concepts c where c : X {0,1} H is the set of hypotheses considered by a learner, H C L is the learner D is a probability distribution over X that generates observed instances Bryan Pardo, Machine Learning: EECS 349 Fall 2009 Definition: e-exhausted IN ENGLISH: The set of hypotheses consistent with the training data T is e-exhausted if, when you test them on the actual distribution of instances, all consistent hypotheses have error below e IN MATHESE: VS H,T is e - exhausted for concept c and sample distribution D, if.... h VS H,T , errorD (h) e Bryan Pardo, Machine Learning: EECS 349 Fall 2009 A Theorem If hypothesisspace H is finite, & training set T contains m independent randomly drawn examples of concept c THEN, for any 0 e 1... P(VS H,T is NOT ε - exhausted) |H|e Bryan Pardo, Machine Learning: EECS 349 Fall 2009 em Proof of Theorem If hypothesish has true error e , the probability of it getting a single random exampe right is : P(h got 1 example right) 1-ε Ergo the probability of h getting m examples right is : P(h got m examples right) (1-ε ) m Bryan Pardo, Machine Learning: EECS 349 Fall 2009 Proof of Theorem If there are k hypothesesin H with error at least e , call the probability at least of those k hypotheses got m instances right P(at least one bad h looks good ). This prob. is BOUNDED by k (1-ε ) m Pat least one bad h looks good k (1-ε ) m Think about why this is a bound and not an equality Bryan Pardo, Machine Learning: EECS 349 Fall 2009 Proof of Theorem (continued) Since k H , it follows that k (1-ε) H (1-ε) m If 0 e 1, then (1 e ) e m e Therefore... P(at least one bad h looks good ) k (1-ε ) m H (1-ε ) m H e em Proof complete! We now have a bound on the likelihood that a hypothsesis consistent with the training data will have error e Bryan Pardo, Machine Learning: EECS 349 Fall 2009 Using the theorem Let' s rearrange to see to set a bound on ln H ln e em ln H em ln ln H ln em the likelihood our true error is e . ln ln H e em ln how many training examples we need H e em 1 e ln H ln m 1 1 ln H ln m e Bryan Pardo, Machine Learning: EECS 349 Fall 2009 Probably Approximately Correct (PAC) 1 e The worst error we’ll tolerate ln H ln m hypothesis space size The likelihood a hypothesis consistent with the training data will have error e number of training examples Bryan Pardo, Machine Learning: EECS 349 Fall 2009 Using the bound 1 e ln H ln m Plug in e, , and H to get a number of training examples m that will “guarantee” your learner will generate a hypothesis that is Probably Approximately Correct. NOTE: This assumes that the concept is actually IN H, that H is finite, and that your training set is drawn using distribution D Bryan Pardo, Machine Learning: EECS 349 Fall 2009 Problems with PAC • The PAC Learning framework has 2 disadvantages: 1) It can lead to weak bounds 2)Sample Complexity bound cannot be established for infinite hypothesis spaces • We introduce the VC dimension for dealing with these problems Bryan Pardo, Machine Learning: EECS 349 Fall 2009 Shattering Def: A set of instances S is shattered by hypothesis set H iff for every possible concept c on S there exists a hypothesis h in H that is consistent with that concept. Bryan Pardo, Machine Learning: EECS 349 Fall 2009 Can a linear separator shatter this? NO! The ability of H to shatter a set of instances is a measure of its capacity to represent target concepts defined over those instances Bryan Pardo, Machine Learning: EECS 349 Fall 2009 Can a quadratic separator shatter this? This sounds like a homework problem…. Bryan Pardo, Machine Learning: EECS 349 Fall 2009 Vapnik-Chervonenkis Dimension Def: The Vapnik-Chervonenkis dimension, VC(H) of hypothesis space H defined over instance space X is the size of the largest finite subset of X shattered by H. If arbitrarily large finite sets can be shattered by H, then VC(H) is infinite. Bryan Pardo, Machine Learning: EECS 349 Fall 2009 How many training examples needed? • Upper bound on m using VC(H) m 1 e This uses the HYPOTHESIS space (4 log 2 (2 / ) 8VC ( H ) log 2 (13 / e )) • Lower bound on m using VC(C) This uses the CONCEPT space VC (C ) 1 1 max log(1 / ), e 32 e There’s more on this in the textbook. Bryan Pardo, Machine Learning: EECS 349 Fall 2009 Mistake bound model of learning • How many mistakes (misclassified examples) will a learner make before it converges to the correct hypothesis? No time for this in lecture. Read the book! Northwestern University Winter 2007 Machine Learning EECS 395-22