Learning From Data Lecture 4 Real Learning is Feasible Real Learning vs. Verification The Two Step Solution to Learning Closer to Reality: Error and Noise M. Magdon-Ismail CSCI 4100/6100 recap: Verification Income h Age Hoeffding: Eout(g) ≈ Ein(g) Eout(h) ↓D (with high probability) 2 Income P[|Ein(h) − Eout(h)| > ǫ] ≤ 2e−2N ǫ . Age Ein(h) = c AM L Creator: Malik Magdon-Ismail 2 9 Real Learning is Feasible: 2 /16 Recap: selection bias and coins −→ recap: Selection Bias Illustrated with Coins Coin tossing example: • If we toss one coin and get no heads, its very surprising. P= 1 2N We expect it is biased: P[heads] ≈ 0. • Tossing 70 coins, and find one with no heads. Is it surprising? 70 1 P=1− 1− N 2 Do we expect P[heads] ≈ 0 for the selected coin? Similar to the “birthday problem”: among 30 people, two will likely share the same birthday. • This is called selection bias. Selection bias is a very serious trap. For example medical screening. Search Causes Selection Bias c AM L Creator: Malik Magdon-Ismail Real Learning is Feasible: 3 /16 Learning: finite model −→ Real Learning – Finite Learning Models hM ··· Income Income Income h3 Income h2 h1 Age Age Age Age Eout(h1) Eout(h2) Eout(h3) Eout(hM ) Income Income Income Income ↓D ··· Age Ein(h1) = Age 2 9 Age Ein(h2) = 0 Ein(h3) = Age 5 9 Ein(hM ) = 6 9 Pick the hypothesis with minimum Ein; will Eout be small? c AM L Creator: Malik Magdon-Ismail Real Learning is Feasible: 4 /16 Hoeffding: finite model −→ Hoeffding says that Ein(g) ≈ Eout(g) for Finite H P [|Ein(g) − Eout(g)| > ǫ] ≤ 2|H|e−2ǫ 2 N for any ǫ > 0. , −2ǫ2 N P [|Ein(g) − Eout(g)| ≤ ǫ] ≥ 1 − 2|H|e for any ǫ > 0. , We don’t care how g was obtained, as long as it is from H Some Basic Probability Events A, B Implication If A =⇒ B (A ⊆ B) then P[A] ≤ P[B]. Union Bound P[A or B] = P[A ∪ B] ≤ P[A] + P[B]. Proof: Let M = |H|. The event “|Ein(g) − Eout (g)| > ǫ” implies “|Ein(h1 ) − Eout (h1 )| > ǫ” OR . . . OR “|Ein (hM ) − Eout (hM )| > ǫ” So, by the implication and union bounds: P[|Ein (g) − Eout (g)| > ǫ] ≤ P OR |Ein (hM ) − Eout (hM )| > ǫ M m=1 Bayes’ Rule P[B|A] · P[A] P[A|B] = P[B] ≤ M X P[|Ein (hm ) − Eout (hm )| > ǫ], m=1 ≤ 2Me−2ǫ 2N . (The last inequality is because we can apply the Hoeffding bound to each summand) c AM L Creator: Malik Magdon-Ismail Real Learning is Feasible: 5 /16 Hoeffding as error bar −→ Interpreting the Hoeffding Bound for Finite |H| P [|Ein(g) − Eout(g)| > ǫ] ≤ 2|H|e−2ǫ 2 N for any ǫ > 0. , P [|Ein(g) − Eout(g)| ≤ ǫ] ≥ 1 − 2|H|e−2ǫ Theorem. With probability at least 1 − δ, r 1 2|H| log . Eout(g) ≤ Ein(g) + 2N δ 2 N , for any ǫ > 0. 2N Proof: Let δ = 2|H|e−2ǫ . Then P [|Ein (g) − Eout (g)| ≤ ǫ] ≥ 1 − δ. In words, with probability at least 1 − δ, |Ein (g) − Eout (g)| ≤ ǫ. This implies We don’t care how g was obtained, as long as g ∈ H Eout (g) ≤ Ein (g) + ǫ. From the definition of δ, solve for ǫ: r 1 2|H| log . ǫ= 2N δ c AM L Creator: Malik Magdon-Ismail Real Learning is Feasible: 6 /16 Ein is close to Eout for small H −→ Ein Reaches Outside to Eout when |H| is Small Eout(g) ≤ Ein(g) + r 2|H| 1 . log 2N δ If N ≫ ln |H|, then Eout(g) ≈ Ein(g). • Does not depend on X , P (x), f or how g is found. • Only requires P (x) to generate the data points independently and also the test point. What about Eout ≈ 0? c AM L Creator: Malik Magdon-Ismail Real Learning is Feasible: 7 /16 2 step approach −→ The 2 Step Approach to Getting Eout ≈ 0: (1) Eout(g) ≈ Ein(g). (2) Ein(g) ≈ 0. Together, these ensure Eout ≈ 0. How to verify (1) since we do not know Eout – must ensure it theoretically - Hoeffding. We can ensure (2) (for example PLA) – modulo that we can guarantee (1) out-of-sample error There is a tradeoff: Error model complexity q • Small |H| =⇒ Ein ≈ Eout 1 2N log 2|H| δ • Large |H| =⇒ Ein ≈ 0 is more likely. in-sample error |H|∗ c AM L Creator: Malik Magdon-Ismail Real Learning is Feasible: 8 /16 |H| Summary: feasibility of learning −→ Feasibility of Learning (Finite Models) • No Free Lunch: can’t know anything outside D, for sure. • Can “learn” with high probability if D is i.i.d. from P (x). Eout ≈ Ein (Ein can reach outside the data set to Eout). • We want Eout ≈ 0. • The two step solution. We trade Eout ≈ 0 for 2 goals: (i) Eout ≈ Ein; (ii) Ein ≈ 0. We know Ein, not Eout, but we can ensure (i) if |H| is small. This is a big step! • What about infinite H - the perceptron? c AM L Creator: Malik Magdon-Ismail Real Learning is Feasible: 9 /16 Complex f are harder to learn −→ “Complex” Target Functions are Harder to Learn What happened to the “difficulty” (complexity) of f ? • Simple f =⇒ can use small H to get Ein ≈ 0 (need smaller N ). • Complex f =⇒ need large H to get Ein ≈ 0 (need larger N ). c AM L Creator: Malik Magdon-Ismail Real Learning is Feasible: 10 /16 Learning setup with probability −→ Revising the Learning Problem – Adding in Probability UNKNOWN TARGET FUNCTION f : X 7→ Y UNKNOWN INPUT DISTRIBUTION yn = f (xn ) P (x) TRAINING EXAMPLES (x1 , y1 ), (x2 , y2 ), . . . , (xN , yN ) x1 , x2 , . . . , xN x g(x) ≈ f (x) LEARNING ALGORITHM FINAL HYPOTHESIS A g HYPOTHESIS SET H c AM L Creator: Malik Magdon-Ismail Real Learning is Feasible: 11 /16 Error and Noise −→ Error and Noise Error Measure: How to quantify that h ≈ f . Noise: yn 6= f (xn). c AM L Creator: Malik Magdon-Ismail Real Learning is Feasible: 12 /16 Finger print example −→ Finger Print Recognition Two types of error. f +1 you f +1 −1 no error false accept +1 h −1 false reject no error −1 intruder In any application you need to think about how to penalize each type of error. f f +1 −1 +1 0 1 h −1 10 0 +1 −1 +1 0 1000 h −1 1 0 Supermarket CIA c AM L Creator: Malik Magdon-Ismail Real Learning is Feasible: 13 /16 Take Away Error measure is specified by the user. If not, choose one that is – plausible (conceptually appealing) – friendly (practically appealing) Pointwise errors −→ Almost All Error Mearures are Pointwise Compare h and f on individual points x using a pointwise error e(h(x), f (x)): Binary error: e(h(x), f (x)) = Jh(x) 6= f (x)K Squared error: e(h(x), f (x)) = (h(x) − f (x))2 (classification) (regression) In-sample error: N 1 X e(h(xn), f (xn )). Ein(h) = N n=1 Out-of-sample error: Eout(h) = Ex[e(h(x), f (x))]. c AM L Creator: Malik Magdon-Ismail Real Learning is Feasible: 14 /16 Noisy targets −→ Noisy Targets age gender salary debt years in job years at home ... 32 years male 40,000 26,000 1 year 3 years ... Consider two customers with the same credit data. They can have different behaviors. Approve for credit? The target ‘function’ is not a deterministic function but a stochastic function. ‘f (x)’ = P (y|x) c AM L Creator: Malik Magdon-Ismail Real Learning is Feasible: 15 /16 Learning setup with error and noise −→ Learning Setup with Error Measure and Noisy Targets UNKNOWN TARGET DISTRIBUTION (target function f plus noise) P (y | x) UNKNOWN INPUT DISTRIBUTION yn ∼ P (y | xn ) P (x) TRAINING EXAMPLES (x1 , y1 ), (x2 , y2 ), . . . , (xN , yN ) x1 , x2 , . . . , xN x ERROR MEASURE g(x) ≈ f (x) LEARNING ALGORITHM FINAL HYPOTHESIS A g HYPOTHESIS SET H c AM L Creator: Malik Magdon-Ismail Real Learning is Feasible: 16 /16