The VC Dimension, Bias and Variance

Learning From Data Lecture 7 Approximation Versus Generalization The VC Dimension Approximation Versus Generalization Bias and Variance The Learning Curve M. Magdon-Ismail CSCI 4100/6100 recap: The Vapnik-Chervonenkis Bound (VC Bound) P [|Ein(g) − Eout(g)| > ǫ] ≤ 4mH(2N )e−ǫ P[|Ein (g)−Eout (g)|>ǫ]≤2|H|e−2ǫ 2N 2 N/8 , ← finite H P [|Ein(g) − Eout(g)| ≤ ǫ] ≥ 1 − 4mH(2N )e−ǫ P[|Ein (g)−Eout (g)|≤ǫ]≥1−2|H|e−2ǫ Eout(g) ≤ Ein(g) + √ 1 2|H| Eout(g)≤Ein (g)+ mH(N ) ≤ c AM L Creator: Malik Magdon-Ismail 2N log k−1 P i=1 N i δ for any ǫ > 0. 2N 2 N/8 , for any ǫ > 0. ← finite H q 4mH (2N ) 8 log , N δ ← finite H ≤ N k−1 + 1 Approximation Versus Generalization: 2 /22 w.p. at least 1 − δ. k is a break point. VC dimension −→ The VC Dimension dvc mH (N ) ∼ N k−1 The tightest bound is obtained with the smallest break point k ∗. Definition [VC Dimension] dvc = k ∗ − 1. The VC dimension is the largest N which can be shattered (mH (N ) = 2N ). N ≤ dvc : H could shatter your data (H can shatter some N points). N > dvc : N is a break point for H; H cannot possibly shatter your data. mH (N ) ≤ N dvc + 1 ∼ N dvc Eout(g) ≤ Ein(g) + O c AM L Creator: Malik Magdon-Ismail q dvc log N N Approximation Versus Generalization: 3 /22 dvc versus number of parameters −→ The VC-dimension is an Effective Number of Parameters N 4 5 ··· 1 2 3 #Param dvc 2-D perceptron 2 4 8 14 ··· 3 3 1-D pos. ray 2 3 4 5 ··· 1 1 2-D pos. rectangles 2 4 8 16 < 25 · · · 4 4 pos. convex sets 2 4 8 16 32 ··· ∞ ∞ There are models with few parameters but infinite dvc . There are models with redundant parameters but small dvc. c AM L Creator: Malik Magdon-Ismail Approximation Versus Generalization: 4 /22 dvc for perceptron −→ VC-dimension of the Perceptron in Rd is d + 1 This can be shown in two steps: 1. dvc ≥ d + 1. What needs to be shown? (a) (b) (c) (d) There is a set of d + 1 points that can be shattered. There is a set of d + 1 points that cannot be shattered. Every set of d + 1 points can be shattered. Every set of d + 1 points cannot be shattered. 2. dvc ≤ d + 1. What needs to be shown? (a) (b) (c) (d) (e) c AM L Creator: Malik Magdon-Ismail There is a set of d + 1 points that can be shattered. There is a set of d + 2 points that cannot be shattered. Every set of d + 2 points can be shattered. Every set of d + 1 points cannot be shattered. Every set of d + 2 points cannot be shattered. Approximation Versus Generalization: 5 /22 Step 1 answer −→ VC-dimension of the Perceptron in Rd is d + 1 This can be shown in two steps: 1. dvc ≥ d + 1. What needs to be shown? X (a) (b) (c) (d) There is a set of d + 1 points that can be shattered. There is a set of d + 1 points that cannot be shattered. Every set of d + 1 points can be shattered. Every set of d + 1 points cannot be shattered. 2. dvc ≤ d + 1. What needs to be shown? (a) (b) (c) (d) (e) c AM L Creator: Malik Magdon-Ismail There is a set of d + 1 points that can be shattered. There is a set of d + 2 points that cannot be shattered. Every set of d + 2 points can be shattered. Every set of d + 1 points cannot be shattered. Every set of d + 2 points cannot be shattered. Approximation Versus Generalization: 6 /22 Step 2 answer −→ VC-dimension of the Perceptron in Rd is d + 1 This can be shown in two steps: 1. dvc ≥ d + 1. What needs to be shown? X (a) (b) (c) (d) There is a set of d + 1 points that can be shattered. There is a set of d + 1 points that cannot be shattered. Every set of d + 1 points can be shattered. Every set of d + 1 points cannot be shattered. 2. dvc ≤ d + 1. What needs to be shown? (a) (b) (c) (d) X (e) c AM L Creator: Malik Magdon-Ismail There is a set of d + 1 points that can be shattered. There is a set of d + 2 points that cannot be shattered. Every set of d + 2 points can be shattered. Every set of d + 1 points cannot be shattered. Every set of d + 2 points cannot be shattered. Approximation Versus Generalization: 7 /22 dvc characterizes complexity in error bar −→ A Single Parameter Characterizes Complexity out-of-sample error 1 2|H| log 2N δ model complexity Error Eout(g) ≤ Ein(g) + r in-sample error |H|∗ |H| ↓ out-of-sample error 8 4((2N )dvc + 1) Eout(g) ≤ Ein(g) + log N δ | {z } model complexity Error s penalty for model complexity Ω(dvc) c AM L Creator: Malik Magdon-Ismail Approximation Versus Generalization: 8 /22 in-sample error d∗vc VC dimension, dvc Sample complexity −→ Sample Complexity: How Many Data Points Do You Need? Set the error bar at ǫ. ǫ= Solve for N : s 8 4((2N )dvc + 1) ln N δ 8 4((2N )dvc + 1) N = 2 ln = O (dvc ln N ) ǫ δ Example. dvc = 3; error bar ǫ = 0.1; confidence 90% (δ = 0.1). A simple iterative method works well. Trying N = 1000 we get 4(2000)3 + 4 1 log N≈ ≈ 21192. 0.12 0.1 We continue iteratively, and converge to N ≈ 30000. If dvc = 4, N ≈ 40000; for dvc = 5, N ≈ 50000. (N ∝ dvc, but gross overestimates) Practical Rule of Thumb: N = 10 × dvc c AM L Creator: Malik Magdon-Ismail Approximation Versus Generalization: 9 /22 Theory versus practice −→ Theory Versus Practice The VC analysis allows us to reach outside the data for general H. – a single parameter characterizes complexity of H – dvc depends only on H. – Ein can reach outside D to Eout when dvc is finite. In Practice . . . • The VC bound is loose. – Hoeffding; – mH (N ) is a worst case # of dichotomies, not average case or likely case. – The polynomial bound on mH (N ) is loose. • It is a good guide – models with small dvc are good. • Roughly 10 × dvc examples needed to get good generalization. c AM L Creator: Malik Magdon-Ismail Approximation Versus Generalization: 10 /22 Test set −→ The Test Set • Another way to estimate Eout(g) is using a test set to obtain Etest(g). • Etest is better than Ein: you don’t pay the price for fitting. You can use |H| = 1 in the Hoeffding bound with Etest . • Both a test and training set have variance. The training set has optimistic bias due to selection – fitting the data. A test set has no bias. • The price for a test set is fewer training examples. (why is this bad?) Etest ≈ Eout but now Etest may be bad. c AM L Creator: Malik Magdon-Ismail Approximation Versus Generalization: 11 /22 Approximation versus Generalization −→ VC Bound Quantifies Approximation Versus Generalization The best H is H = {f }. You are better off buying a lottery ticket. dvc ↑ =⇒ better chance of approximating f (Ein ≈ 0). dvc ↓ =⇒ better chance of generalizing to out of sample (Ein ≈ Eout). Eout ≤ Ein + Ω(dvc). VC analysis only depends on H. Independent of f, P (x), learning algorithm. c AM L Creator: Malik Magdon-Ismail Approximation Versus Generalization: 12 /22 Bias-variance analysis −→ Bias-Variance Analysis Another way to quantify the tradeoff: 1. How well can the learning approximate f . . . . as opposed to how well did the learning approximate f in-sample (Ein). 2. How close can you get to that approximation with a finite data set. . . . as opposed to how close is Ein to Eout . Bias-variance analysis applies to squared errors (classification and regression) Bias-variance analysis can take into account the learning algorithm Different learning algorithms can have different Eout when applied to the same H! c AM L Creator: Malik Magdon-Ismail Approximation Versus Generalization: 13 /22 Sin example −→ A Simple Learning Problem 2 Data Points. 2 hypothesis sets: h(x) = b h(x) = ax + b y y H0 : H1 : x c AM L Creator: Malik Magdon-Ismail x Approximation Versus Generalization: 14 /22 Many data sets −→ y y Let’s Repeat the Experiment Many Times x x For each data set D, you get a different g D . So, for a fixed x, g D (x) is random value, depending on D. c AM L Creator: Malik Magdon-Ismail Approximation Versus Generalization: 15 /22 Average behavior −→ y y What’s Happening on Average ḡ(x) ḡ(x) sin(x) sin(x) x x We can define: g D (x) ← random value, depending on D D ḡ(x) = ED g (x) ← your average prediction on x ≈ K1 (g D1 (x) + · · · + g DK (x)) D 2 var(x) = ED (g (x) − ḡ(x)) D 2 = ED g (x) − ḡ(x)2 c AM L Creator: Malik Magdon-Ismail ← how variable is your prediction? Approximation Versus Generalization: 16 /22 Error on out-of-sample test point −→ Eout on Test Point x for Data D f (x) f (x) D (x) Eout D (x) Eout g D (x) g D (x) x x D Eout (x) = (g D (x) − f (x))2 D Eout(x) = ED Eout(x) c AM L Creator: Malik Magdon-Ismail ← squared error, a random value depending on D ← expected Eout(x) before seeing D Approximation Versus Generalization: 17 /22 bias-vardecomposition −→ The Bias-Variance Decomposition Eout(x) = ED (g D (x) − f (x))2 = ED g D (x)2 − 2g D (x)f (x) + f (x)2 D 2 ← understand this; the rest is just algebra = ED g (x) − 2ḡ(x)f (x) + f (x)2 D 2 = ED g (x) −ḡ(x)2 + ḡ(x)2 − 2ḡ(x)f (x) + f (x)2 = ED g D (x)2 − ḡ(x)2 + (ḡ(x) − f (x))2 {z } | {z } | var(x) bias(x) f H H bias Eout(x) = bias(x) + var(x) f var Very small model If you take average over x: Eout = bias + var c AM L Creator: Malik Magdon-Ismail Approximation Versus Generalization: 18 /22 Very large model Back to sin example −→ y y Back to H0 and H1; and, our winner is . . . ḡ(x) ḡ(x) sin(x) sin(x) x H0 bias = 0.50 var = 0.25 Eout = 0.75 X c AM L Creator: Malik Magdon-Ismail x H1 bias = 0.21 var = 1.69 Eout = 1.90 Approximation Versus Generalization: 19 /22 2 versus 5 data points −→ Match Learning Power to Data, . . . Not to f ḡ(x) sin(x) H0 bias = 0.50; var = 0.25. Eout = 0.75 X c AM L Creator: Malik Magdon-Ismail ḡ(x) ḡ(x) sin(x) x x y ḡ(x) x y y x y x y y 5 Data Points y y 2 Data Points sin(x) x H1 bias = 0.21; var = 1.69. Eout = 1.90 sin(x) x H0 bias = 0.50; var = 0.1. Eout = 0.6 Approximation Versus Generalization: 20 /22 x H1 bias = 0.21; var = 0.21. Eout = 0.42 X Learning curves −→ Learning Curves: When Does the Balance Tip? Complex Model Expected Error Expected Error Simple Model Eout Ein Eout Ein Number of Data Points, N Number of Data Points, N Eout = Ex [Eout(x)] c AM L Creator: Malik Magdon-Ismail Approximation Versus Generalization: 21 /22 Decomposing the learning curve −→ Decomposing The Learning Curve Bias-Variance Analysis Eout generalization error Ein Expected Error Expected Error VC Analysis Eout variance Ein bias in-sample error Number of Data Points, N Pick H that can generalize and has a good chance to fit the data c AM L Creator: Malik Magdon-Ismail Number of Data Points, N Pick (H, A) to approximate f and not behave wildly after seeing the data Approximation Versus Generalization: 22 /22

The VC Dimension, Bias and Variance

Related documents

Products

Support

The VC Dimension, Bias and Variance

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib