Learning From Data Lecture 7 Approximation Versus Generalization The VC Dimension Approximation Versus Generalization Bias and Variance The Learning Curve M. Magdon-Ismail CSCI 4100/6100 recap: The Vapnik-Chervonenkis Bound (VC Bound) P [|Ein(g) − Eout(g)| > ǫ] ≤ 4mH(2N )e−ǫ P[|Ein (g)−Eout (g)|>ǫ]≤2|H|e−2ǫ 2N 2 N/8 , ← finite H P [|Ein(g) − Eout(g)| ≤ ǫ] ≥ 1 − 4mH(2N )e−ǫ P[|Ein (g)−Eout (g)|≤ǫ]≥1−2|H|e−2ǫ Eout(g) ≤ Ein(g) + √ 1 2|H| Eout(g)≤Ein (g)+ mH(N ) ≤ c AM L Creator: Malik Magdon-Ismail 2N log k−1 P i=1 N i δ for any ǫ > 0. 2N 2 N/8 , for any ǫ > 0. ← finite H q 4mH (2N ) 8 log , N δ ← finite H ≤ N k−1 + 1 Approximation Versus Generalization: 2 /22 w.p. at least 1 − δ. k is a break point. VC dimension −→ The VC Dimension dvc mH (N ) ∼ N k−1 The tightest bound is obtained with the smallest break point k ∗. Definition [VC Dimension] dvc = k ∗ − 1. The VC dimension is the largest N which can be shattered (mH (N ) = 2N ). N ≤ dvc : H could shatter your data (H can shatter some N points). N > dvc : N is a break point for H; H cannot possibly shatter your data. mH (N ) ≤ N dvc + 1 ∼ N dvc Eout(g) ≤ Ein(g) + O c AM L Creator: Malik Magdon-Ismail q dvc log N N Approximation Versus Generalization: 3 /22 dvc versus number of parameters −→ The VC-dimension is an Effective Number of Parameters N 4 5 ··· 1 2 3 #Param dvc 2-D perceptron 2 4 8 14 ··· 3 3 1-D pos. ray 2 3 4 5 ··· 1 1 2-D pos. rectangles 2 4 8 16 < 25 · · · 4 4 pos. convex sets 2 4 8 16 32 ··· ∞ ∞ There are models with few parameters but infinite dvc . There are models with redundant parameters but small dvc. c AM L Creator: Malik Magdon-Ismail Approximation Versus Generalization: 4 /22 dvc for perceptron −→ VC-dimension of the Perceptron in Rd is d + 1 This can be shown in two steps: 1. dvc ≥ d + 1. What needs to be shown? (a) (b) (c) (d) There is a set of d + 1 points that can be shattered. There is a set of d + 1 points that cannot be shattered. Every set of d + 1 points can be shattered. Every set of d + 1 points cannot be shattered. 2. dvc ≤ d + 1. What needs to be shown? (a) (b) (c) (d) (e) c AM L Creator: Malik Magdon-Ismail There is a set of d + 1 points that can be shattered. There is a set of d + 2 points that cannot be shattered. Every set of d + 2 points can be shattered. Every set of d + 1 points cannot be shattered. Every set of d + 2 points cannot be shattered. Approximation Versus Generalization: 5 /22 Step 1 answer −→ VC-dimension of the Perceptron in Rd is d + 1 This can be shown in two steps: 1. dvc ≥ d + 1. What needs to be shown? X (a) (b) (c) (d) There is a set of d + 1 points that can be shattered. There is a set of d + 1 points that cannot be shattered. Every set of d + 1 points can be shattered. Every set of d + 1 points cannot be shattered. 2. dvc ≤ d + 1. What needs to be shown? (a) (b) (c) (d) (e) c AM L Creator: Malik Magdon-Ismail There is a set of d + 1 points that can be shattered. There is a set of d + 2 points that cannot be shattered. Every set of d + 2 points can be shattered. Every set of d + 1 points cannot be shattered. Every set of d + 2 points cannot be shattered. Approximation Versus Generalization: 6 /22 Step 2 answer −→ VC-dimension of the Perceptron in Rd is d + 1 This can be shown in two steps: 1. dvc ≥ d + 1. What needs to be shown? X (a) (b) (c) (d) There is a set of d + 1 points that can be shattered. There is a set of d + 1 points that cannot be shattered. Every set of d + 1 points can be shattered. Every set of d + 1 points cannot be shattered. 2. dvc ≤ d + 1. What needs to be shown? (a) (b) (c) (d) X (e) c AM L Creator: Malik Magdon-Ismail There is a set of d + 1 points that can be shattered. There is a set of d + 2 points that cannot be shattered. Every set of d + 2 points can be shattered. Every set of d + 1 points cannot be shattered. Every set of d + 2 points cannot be shattered. Approximation Versus Generalization: 7 /22 dvc characterizes complexity in error bar −→ A Single Parameter Characterizes Complexity out-of-sample error 1 2|H| log 2N δ model complexity Error Eout(g) ≤ Ein(g) + r in-sample error |H|∗ |H| ↓ out-of-sample error 8 4((2N )dvc + 1) Eout(g) ≤ Ein(g) + log N δ | {z } model complexity Error s penalty for model complexity Ω(dvc) c AM L Creator: Malik Magdon-Ismail Approximation Versus Generalization: 8 /22 in-sample error d∗vc VC dimension, dvc Sample complexity −→ Sample Complexity: How Many Data Points Do You Need? Set the error bar at ǫ. ǫ= Solve for N : s 8 4((2N )dvc + 1) ln N δ 8 4((2N )dvc + 1) N = 2 ln = O (dvc ln N ) ǫ δ Example. dvc = 3; error bar ǫ = 0.1; confidence 90% (δ = 0.1). A simple iterative method works well. Trying N = 1000 we get 4(2000)3 + 4 1 log N≈ ≈ 21192. 0.12 0.1 We continue iteratively, and converge to N ≈ 30000. If dvc = 4, N ≈ 40000; for dvc = 5, N ≈ 50000. (N ∝ dvc, but gross overestimates) Practical Rule of Thumb: N = 10 × dvc c AM L Creator: Malik Magdon-Ismail Approximation Versus Generalization: 9 /22 Theory versus practice −→ Theory Versus Practice The VC analysis allows us to reach outside the data for general H. – a single parameter characterizes complexity of H – dvc depends only on H. – Ein can reach outside D to Eout when dvc is finite. In Practice . . . • The VC bound is loose. – Hoeffding; – mH (N ) is a worst case # of dichotomies, not average case or likely case. – The polynomial bound on mH (N ) is loose. • It is a good guide – models with small dvc are good. • Roughly 10 × dvc examples needed to get good generalization. c AM L Creator: Malik Magdon-Ismail Approximation Versus Generalization: 10 /22 Test set −→ The Test Set • Another way to estimate Eout(g) is using a test set to obtain Etest(g). • Etest is better than Ein: you don’t pay the price for fitting. You can use |H| = 1 in the Hoeffding bound with Etest . • Both a test and training set have variance. The training set has optimistic bias due to selection – fitting the data. A test set has no bias. • The price for a test set is fewer training examples. (why is this bad?) Etest ≈ Eout but now Etest may be bad. c AM L Creator: Malik Magdon-Ismail Approximation Versus Generalization: 11 /22 Approximation versus Generalization −→ VC Bound Quantifies Approximation Versus Generalization The best H is H = {f }. You are better off buying a lottery ticket. dvc ↑ =⇒ better chance of approximating f (Ein ≈ 0). dvc ↓ =⇒ better chance of generalizing to out of sample (Ein ≈ Eout). Eout ≤ Ein + Ω(dvc). VC analysis only depends on H. Independent of f, P (x), learning algorithm. c AM L Creator: Malik Magdon-Ismail Approximation Versus Generalization: 12 /22 Bias-variance analysis −→ Bias-Variance Analysis Another way to quantify the tradeoff: 1. How well can the learning approximate f . . . . as opposed to how well did the learning approximate f in-sample (Ein). 2. How close can you get to that approximation with a finite data set. . . . as opposed to how close is Ein to Eout . Bias-variance analysis applies to squared errors (classification and regression) Bias-variance analysis can take into account the learning algorithm Different learning algorithms can have different Eout when applied to the same H! c AM L Creator: Malik Magdon-Ismail Approximation Versus Generalization: 13 /22 Sin example −→ A Simple Learning Problem 2 Data Points. 2 hypothesis sets: h(x) = b h(x) = ax + b y y H0 : H1 : x c AM L Creator: Malik Magdon-Ismail x Approximation Versus Generalization: 14 /22 Many data sets −→ y y Let’s Repeat the Experiment Many Times x x For each data set D, you get a different g D . So, for a fixed x, g D (x) is random value, depending on D. c AM L Creator: Malik Magdon-Ismail Approximation Versus Generalization: 15 /22 Average behavior −→ y y What’s Happening on Average ḡ(x) ḡ(x) sin(x) sin(x) x x We can define: g D (x) ← random value, depending on D D ḡ(x) = ED g (x) ← your average prediction on x ≈ K1 (g D1 (x) + · · · + g DK (x)) D 2 var(x) = ED (g (x) − ḡ(x)) D 2 = ED g (x) − ḡ(x)2 c AM L Creator: Malik Magdon-Ismail ← how variable is your prediction? Approximation Versus Generalization: 16 /22 Error on out-of-sample test point −→ Eout on Test Point x for Data D f (x) f (x) D (x) Eout D (x) Eout g D (x) g D (x) x x D Eout (x) = (g D (x) − f (x))2 D Eout(x) = ED Eout(x) c AM L Creator: Malik Magdon-Ismail ← squared error, a random value depending on D ← expected Eout(x) before seeing D Approximation Versus Generalization: 17 /22 bias-vardecomposition −→ The Bias-Variance Decomposition Eout(x) = ED (g D (x) − f (x))2 = ED g D (x)2 − 2g D (x)f (x) + f (x)2 D 2 ← understand this; the rest is just algebra = ED g (x) − 2ḡ(x)f (x) + f (x)2 D 2 = ED g (x) −ḡ(x)2 + ḡ(x)2 − 2ḡ(x)f (x) + f (x)2 = ED g D (x)2 − ḡ(x)2 + (ḡ(x) − f (x))2 {z } | {z } | var(x) bias(x) f H H bias Eout(x) = bias(x) + var(x) f var Very small model If you take average over x: Eout = bias + var c AM L Creator: Malik Magdon-Ismail Approximation Versus Generalization: 18 /22 Very large model Back to sin example −→ y y Back to H0 and H1; and, our winner is . . . ḡ(x) ḡ(x) sin(x) sin(x) x H0 bias = 0.50 var = 0.25 Eout = 0.75 X c AM L Creator: Malik Magdon-Ismail x H1 bias = 0.21 var = 1.69 Eout = 1.90 Approximation Versus Generalization: 19 /22 2 versus 5 data points −→ Match Learning Power to Data, . . . Not to f ḡ(x) sin(x) H0 bias = 0.50; var = 0.25. Eout = 0.75 X c AM L Creator: Malik Magdon-Ismail ḡ(x) ḡ(x) sin(x) x x y ḡ(x) x y y x y x y y 5 Data Points y y 2 Data Points sin(x) x H1 bias = 0.21; var = 1.69. Eout = 1.90 sin(x) x H0 bias = 0.50; var = 0.1. Eout = 0.6 Approximation Versus Generalization: 20 /22 x H1 bias = 0.21; var = 0.21. Eout = 0.42 X Learning curves −→ Learning Curves: When Does the Balance Tip? Complex Model Expected Error Expected Error Simple Model Eout Ein Eout Ein Number of Data Points, N Number of Data Points, N Eout = Ex [Eout(x)] c AM L Creator: Malik Magdon-Ismail Approximation Versus Generalization: 21 /22 Decomposing the learning curve −→ Decomposing The Learning Curve Bias-Variance Analysis Eout generalization error Ein Expected Error Expected Error VC Analysis Eout variance Ein bias in-sample error Number of Data Points, N Pick H that can generalize and has a good chance to fit the data c AM L Creator: Malik Magdon-Ismail Number of Data Points, N Pick (H, A) to approximate f and not behave wildly after seeing the data Approximation Versus Generalization: 22 /22