Machine Learning Week 2 Lecture 2 Hand In • It is online. • Web board forum for Matlab questions • Comments and corrections very welcome. I will upload new versions as we go along. Currently we are at version 3 • Your data is coming. We might change it over time. Quiz • Go through all Questions Recap Impossibility of Learning! What is f? There are 256 potential functions 8 of them has in sample error 0 Assumptions are needed x1 0 1 0 1 0 1 0 1 x2 0 0 1 1 0 0 1 1 x3 0 0 0 0 1 1 1 1 f(x) 1 0 1 1 0 ? ? ? No Free Lunch "All models are wrong, but some models are useful.” George Box Machine Learning has many different models and algorithms There is no single best model that works best for all problems (No Free Lunch Theorem) Assumptions that works well in one domain may fail in another Probabilistic Approach Repeat N times independently μ is unknown Sample mean: ν #heads/N Sample: h,h,h,t,t,h,t,t,h Hoeffdings Inequality Sample mean is probably approximately correct PAC Classification Connection Testing a Hypothesis Unknown Target Fixed Hypothesis Probability Distribution over x is probability of picking x such that f(x) ≠ h(x) is probability of picking x such that f(x) = h(x) μ is the sum of the probability of all the points X where hypothesis is wrong Sample Mean - true error rate μ Learning? • Only Verification not Learning • For finite hypothesis sets we used union bound • Make sure is close to and minimize Error Functions Walmart. Discount for a given person Error Function h(x)/f(x) Lying True Est. Lying 0 1000 Est. True 1 0 CIA Access (Friday bar stock) Error Function h(x)/f(x) Lying Est. Lying 0 Est. True True 1 1000 Point being. Depends on application 0 Final Diagram Unknown Probability Distribution P(x) Unknown Target P(y | x) Learn Importance Data Set Error Measure e Learning Algorithm Hypothesis Set Final Hypothesis Today • We are still only talking classification • Test Sets • Work towards learning with infinite size hypothesis spaces for classification – Reinvestigate Union Bound – Dichotomies – Break Points The Test Set Fixed hypothesis h, N independent data points, and any ε>0 • • • • Split your data into two parts D-train,D-test Train on D-train and select hypothesis h Test h on D-test, error Apply Hoeffding bound to Test Set • Strong Bound: 1000 points then with 98% probability, in sample error will be within 5% of out of sample error • Unbiased – Just as likely to better than worse • Problem lose data for training • If Error is high it is not a help that it will also be high in practice • Can NOT be used to select h (contamination) Learning Pick a tolerance (risk) δ of failing you can accept Set RHS equal to δ and solve for ε = With Probability 1-δ Generalization Bound Why we minimize in sample error. Union Bound Union Bound Learning Learning algorithm pick hypothesis hl P(hl is bad) is less than the probability that some hypothesis is bad We did not subtract overlapping events!!! Hypotheses seem correlated h2 Change h1 if h1 is bad (poor generalization) then probably so is h2 Hope to improve union bound result Goal • Replace M with something like effective number of hypotheses • General bound. E.g. independent, target function and input distribution • Simple would be nice. Look at finite point sets Dichotomy bit string of length N Fixed set of N points X = (x1,..,xN) Hypothesis set Each gives a dichotomy How Many Different Dichotomies do we get? At Most Capturing the “expressiveness” of the hypothesis set on X Growth Function Fixed set of N points X = (x1,..,xN) Hypothesis set Example 1: Positive Rays 1-Dimensional input space (points on the real line) a Only Change When a moves to different interval Example 2: Intervals 1-Dimensional input space (points on the real line) a1 a1,a2 in separate parts a2 + Put in same Example 3: Convex Sets 2-Dimensional input space (points in the plane) Goal Continued Generalization Bound Imagine we can replace M with growth function RHS is dropping exponentially fast in N If Growth function is a polynomial in N then RHS still drops exponentially in N Bright Idea. Prove Growth function is polynomial in N Prove we can replace M with growth function Bounding Growth Function • Might be hard to compute • Instead of computing the exact value • Prove that it is bounded by a polynomial Shattering and Break Point If then we say that shatters (x1,…,xN) If no data set of size K can be shattered by then K is a break point for If K is a break point for then so is all numbers larger than K? Why? Revisit Examples • Positive Rays • Intervals a1 • Convex sets a a2 2D Linear Classification (Hyperplanes) 2D Linear Classification 3 Points on a line For 2D Linear Classification Hypothesis set 4 is a break point Break Points and Growth Function If has a break point then the growth function is polynomial (needs proof) If not then it is not! By definition of break point: Break Point Game Has Break Point 2 x x x 1 0 20 30 x x2 0 0 1 1 0 1 0 0 1 1 0 0 1 1 0 1 0 1 Impossible for 1 1 1 1 0 0 1 1 0 1 0 1 Row 1,2,3,4 Row 6,5,2,1 Row 7,5,3,2 Row 8,5,3,2 Proof Coming If has a break point then the growth function is polynomial Definition: B(n,k) is the maximal number of dichotomies possible on N points such that no subset of k points can be shattered by the dichotomies. More general than hypothesis sets If no data set of size K can be shattered by then K is a break point for for any with break point k Computing B(n,k) – Boundary Cases Cannot shatter set of size 1. There is no way of picking dichotomies that gives different classes for a point. There is only one dichotomy since a different dichotomy would give different class for at least one point There is only one point, this only 2 dichotomies are possible Compute B(N,k)- Recursion N,k >1 List L with all dichotomies in B(n,k) Recursion Consider the first n_1 points, there are α+β different (S2 sets are identical here) They can still at most shatter k points, e.g. B(N-1,k) is an upper bound Consider the first n_1 points in S2. If they can shatter k-1 points we can extend with last point where we have both combinations for all dichotomies. This gives k points we can shatter a contradiction. Proof Coming Base Cases: Induction Step Show for N0+1 for k>1 (k=1 was base case) should be 0 change parameter Continue Make it into one sum Recurrence for binomials Add in zero index again QED