The VC Dimension, Bias and Variance

advertisement
Learning From Data
Lecture 7
Approximation Versus Generalization
The VC Dimension
Approximation Versus Generalization
Bias and Variance
The Learning Curve
M. Magdon-Ismail
CSCI 4100/6100
recap:
The Vapnik-Chervonenkis Bound (VC Bound)
P [|Ein(g) − Eout(g)| > ǫ] ≤ 4mH(2N )e−ǫ
P[|Ein (g)−Eout (g)|>ǫ]≤2|H|e−2ǫ
2N
2
N/8
,
← finite H
P [|Ein(g) − Eout(g)| ≤ ǫ] ≥ 1 − 4mH(2N )e−ǫ
P[|Ein (g)−Eout (g)|≤ǫ]≥1−2|H|e−2ǫ
Eout(g) ≤ Ein(g) +
√ 1 2|H|
Eout(g)≤Ein (g)+
mH(N ) ≤
c AM
L Creator: Malik Magdon-Ismail
2N
log
k−1
P i=1
N
i
δ
for any ǫ > 0.
2N
2
N/8
,
for any ǫ > 0.
← finite H
q
4mH (2N )
8
log
,
N
δ
← finite H
≤ N k−1 + 1
Approximation Versus Generalization: 2 /22
w.p. at least 1 − δ.
k is a break point.
VC dimension −→
The VC Dimension dvc
mH (N ) ∼ N k−1
The tightest bound is obtained with the smallest break point k ∗.
Definition [VC Dimension] dvc = k ∗ − 1.
The VC dimension is the largest N which can be shattered (mH (N ) = 2N ).
N ≤ dvc : H could shatter your data (H can shatter some N points).
N > dvc : N is a break point for H; H cannot possibly shatter your data.
mH (N ) ≤ N dvc + 1 ∼ N dvc
Eout(g) ≤ Ein(g) + O
c AM
L Creator: Malik Magdon-Ismail
q
dvc log N
N
Approximation Versus Generalization: 3 /22
dvc versus number of parameters −→
The VC-dimension is an Effective Number of Parameters
N
4
5
···
1
2
3
#Param dvc
2-D perceptron
2
4
8
14
···
3
3
1-D pos. ray
2
3
4
5
···
1
1
2-D pos. rectangles
2
4
8
16
< 25 · · ·
4
4
pos. convex sets
2
4
8
16
32
···
∞
∞
There are models with few parameters but infinite dvc .
There are models with redundant parameters but small dvc.
c AM
L Creator: Malik Magdon-Ismail
Approximation Versus Generalization: 4 /22
dvc for perceptron −→
VC-dimension of the Perceptron in Rd is d + 1
This can be shown in two steps:
1. dvc ≥ d + 1.
What needs to be shown?
(a)
(b)
(c)
(d)
There is a set of d + 1 points that can be shattered.
There is a set of d + 1 points that cannot be shattered.
Every set of d + 1 points can be shattered.
Every set of d + 1 points cannot be shattered.
2. dvc ≤ d + 1.
What needs to be shown?
(a)
(b)
(c)
(d)
(e)
c AM
L Creator: Malik Magdon-Ismail
There is a set of d + 1 points that can be shattered.
There is a set of d + 2 points that cannot be shattered.
Every set of d + 2 points can be shattered.
Every set of d + 1 points cannot be shattered.
Every set of d + 2 points cannot be shattered.
Approximation Versus Generalization: 5 /22
Step 1 answer −→
VC-dimension of the Perceptron in Rd is d + 1
This can be shown in two steps:
1. dvc ≥ d + 1.
What needs to be shown?
X (a)
(b)
(c)
(d)
There is a set of d + 1 points that can be shattered.
There is a set of d + 1 points that cannot be shattered.
Every set of d + 1 points can be shattered.
Every set of d + 1 points cannot be shattered.
2. dvc ≤ d + 1.
What needs to be shown?
(a)
(b)
(c)
(d)
(e)
c AM
L Creator: Malik Magdon-Ismail
There is a set of d + 1 points that can be shattered.
There is a set of d + 2 points that cannot be shattered.
Every set of d + 2 points can be shattered.
Every set of d + 1 points cannot be shattered.
Every set of d + 2 points cannot be shattered.
Approximation Versus Generalization: 6 /22
Step 2 answer −→
VC-dimension of the Perceptron in Rd is d + 1
This can be shown in two steps:
1. dvc ≥ d + 1.
What needs to be shown?
X (a)
(b)
(c)
(d)
There is a set of d + 1 points that can be shattered.
There is a set of d + 1 points that cannot be shattered.
Every set of d + 1 points can be shattered.
Every set of d + 1 points cannot be shattered.
2. dvc ≤ d + 1.
What needs to be shown?
(a)
(b)
(c)
(d)
X (e)
c AM
L Creator: Malik Magdon-Ismail
There is a set of d + 1 points that can be shattered.
There is a set of d + 2 points that cannot be shattered.
Every set of d + 2 points can be shattered.
Every set of d + 1 points cannot be shattered.
Every set of d + 2 points cannot be shattered.
Approximation Versus Generalization: 7 /22
dvc characterizes complexity in error bar −→
A Single Parameter Characterizes Complexity
out-of-sample error
1
2|H|
log
2N
δ
model complexity
Error
Eout(g) ≤ Ein(g) +
r
in-sample error
|H|∗
|H|
↓
out-of-sample error
8
4((2N )dvc + 1)
Eout(g) ≤ Ein(g) +
log
N
δ
|
{z
}
model complexity
Error
s
penalty for model complexity
Ω(dvc)
c AM
L Creator: Malik Magdon-Ismail
Approximation Versus Generalization: 8 /22
in-sample error
d∗vc
VC dimension, dvc
Sample complexity −→
Sample Complexity: How Many Data Points Do You Need?
Set the error bar at ǫ.
ǫ=
Solve for N :
s
8 4((2N )dvc + 1)
ln
N
δ
8 4((2N )dvc + 1)
N = 2 ln
= O (dvc ln N )
ǫ
δ
Example. dvc = 3; error bar ǫ = 0.1; confidence 90% (δ = 0.1).
A simple iterative method works well. Trying N = 1000 we get
4(2000)3 + 4
1
log
N≈
≈ 21192.
0.12
0.1
We continue iteratively, and converge to N ≈ 30000.
If dvc = 4, N ≈ 40000; for dvc = 5, N ≈ 50000.
(N ∝ dvc, but gross overestimates)
Practical Rule of Thumb: N = 10 × dvc
c AM
L Creator: Malik Magdon-Ismail
Approximation Versus Generalization: 9 /22
Theory versus practice −→
Theory Versus Practice
The VC analysis allows us to reach outside the data for general H.
– a single parameter characterizes complexity of H
– dvc depends only on H.
– Ein can reach outside D to Eout when dvc is finite.
In Practice . . .
• The VC bound is loose.
– Hoeffding;
– mH (N ) is a worst case # of dichotomies, not average case or likely case.
– The polynomial bound on mH (N ) is loose.
• It is a good guide – models with small dvc are good.
• Roughly 10 × dvc examples needed to get good generalization.
c AM
L Creator: Malik Magdon-Ismail
Approximation Versus Generalization: 10 /22
Test set −→
The Test Set
• Another way to estimate Eout(g) is using a test set to obtain Etest(g).
• Etest is better than Ein: you don’t pay the price for fitting.
You can use |H| = 1 in the Hoeffding bound with Etest .
• Both a test and training set have variance.
The training set has optimistic bias due to selection – fitting the data.
A test set has no bias.
• The price for a test set is fewer training examples.
(why is this bad?)
Etest ≈ Eout but now Etest may be bad.
c AM
L Creator: Malik Magdon-Ismail
Approximation Versus Generalization: 11 /22
Approximation versus Generalization −→
VC Bound Quantifies Approximation Versus Generalization
The best H is H = {f }.
You are better off buying a lottery ticket.
dvc ↑ =⇒ better chance of approximating f (Ein ≈ 0).
dvc ↓ =⇒ better chance of generalizing to out of sample (Ein ≈ Eout).
Eout ≤ Ein + Ω(dvc).
VC analysis only depends on H.
Independent of f, P (x), learning algorithm.
c AM
L Creator: Malik Magdon-Ismail
Approximation Versus Generalization: 12 /22
Bias-variance analysis −→
Bias-Variance Analysis
Another way to quantify the tradeoff:
1. How well can the learning approximate f .
. . . as opposed to how well did the learning approximate f in-sample (Ein).
2. How close can you get to that approximation with a finite data set.
. . . as opposed to how close is Ein to Eout .
Bias-variance analysis applies to squared errors (classification and regression)
Bias-variance analysis can take into account the learning algorithm
Different learning algorithms can have different Eout when applied to the same H!
c AM
L Creator: Malik Magdon-Ismail
Approximation Versus Generalization: 13 /22
Sin example −→
A Simple Learning Problem
2 Data Points. 2 hypothesis sets:
h(x) = b
h(x) = ax + b
y
y
H0 :
H1 :
x
c AM
L Creator: Malik Magdon-Ismail
x
Approximation Versus Generalization: 14 /22
Many data sets −→
y
y
Let’s Repeat the Experiment Many Times
x
x
For each data set D, you get a different g D .
So, for a fixed x, g D (x) is random value, depending on D.
c AM
L Creator: Malik Magdon-Ismail
Approximation Versus Generalization: 15 /22
Average behavior −→
y
y
What’s Happening on Average
ḡ(x)
ḡ(x)
sin(x)
sin(x)
x
x
We can define:
g D (x)
← random value, depending on D
D ḡ(x) = ED g (x)
← your average prediction on x
≈ K1 (g D1 (x) + · · · + g DK (x))
D
2
var(x) = ED (g (x) − ḡ(x))
D 2
= ED g (x) − ḡ(x)2
c AM
L Creator: Malik Magdon-Ismail
← how variable is your prediction?
Approximation Versus Generalization: 16 /22
Error on out-of-sample test point −→
Eout on Test Point x for Data D
f (x)
f (x)
D
(x)
Eout
D
(x)
Eout
g D (x)
g D (x)
x
x
D
Eout
(x) = (g D (x) − f (x))2
D
Eout(x) = ED Eout(x)
c AM
L Creator: Malik Magdon-Ismail
← squared error, a random value depending on D
← expected Eout(x) before seeing D
Approximation Versus Generalization: 17 /22
bias-vardecomposition −→
The Bias-Variance Decomposition
Eout(x) = ED (g D (x) − f (x))2
= ED g D (x)2 − 2g D (x)f (x) + f (x)2
D 2
← understand this; the rest is just algebra
= ED g (x) − 2ḡ(x)f (x) + f (x)2
D 2
= ED g (x) −ḡ(x)2 + ḡ(x)2 − 2ḡ(x)f (x) + f (x)2
= ED g D (x)2 − ḡ(x)2 + (ḡ(x) − f (x))2
{z
} |
{z
}
|
var(x)
bias(x)
f
H
H
bias
Eout(x) = bias(x) + var(x)
f
var
Very small model
If you take average over x:
Eout = bias + var
c AM
L Creator: Malik Magdon-Ismail
Approximation Versus Generalization: 18 /22
Very large model
Back to sin example −→
y
y
Back to H0 and H1; and, our winner is . . .
ḡ(x)
ḡ(x)
sin(x)
sin(x)
x
H0
bias = 0.50
var = 0.25
Eout = 0.75 X
c AM
L Creator: Malik Magdon-Ismail
x
H1
bias = 0.21
var = 1.69
Eout = 1.90
Approximation Versus Generalization: 19 /22
2 versus 5 data points −→
Match Learning Power to Data, . . . Not to f
ḡ(x)
sin(x)
H0
bias = 0.50;
var = 0.25.
Eout = 0.75 X
c AM
L Creator: Malik Magdon-Ismail
ḡ(x)
ḡ(x)
sin(x)
x
x
y
ḡ(x)
x
y
y
x
y
x
y
y
5 Data Points
y
y
2 Data Points
sin(x)
x
H1
bias = 0.21;
var = 1.69.
Eout = 1.90
sin(x)
x
H0
bias = 0.50;
var = 0.1.
Eout = 0.6
Approximation Versus Generalization: 20 /22
x
H1
bias = 0.21;
var = 0.21.
Eout = 0.42 X
Learning curves −→
Learning Curves: When Does the Balance Tip?
Complex Model
Expected Error
Expected Error
Simple Model
Eout
Ein
Eout
Ein
Number of Data Points, N
Number of Data Points, N
Eout = Ex [Eout(x)]
c AM
L Creator: Malik Magdon-Ismail
Approximation Versus Generalization: 21 /22
Decomposing the learning curve −→
Decomposing The Learning Curve
Bias-Variance Analysis
Eout
generalization error
Ein
Expected Error
Expected Error
VC Analysis
Eout
variance
Ein
bias
in-sample error
Number of Data Points, N
Pick H that can generalize and has a good
chance to fit the data
c AM
L Creator: Malik Magdon-Ismail
Number of Data Points, N
Pick (H, A) to approximate f and not behave
wildly after seeing the data
Approximation Versus Generalization: 22 /22
Download