Vapnik-Chervonenkis Dimension Definition and Lower bound Adapted from Yishai Mansour

advertisement
Vapnik-Chervonenkis
Dimension
Definition and Lower bound
Adapted from Yishai Mansour
PAC Learning model
• There exists a distribution D over
domain X
• Examples: <x, c(x)>
– use c for target function (rather than ct)
• Goal:
– With high probability (1-d)
– find h in H such that
– error(h,c ) < e
– e arbitrarily small.
VC: Motivation
• Handle infinite classes.
• VC-dim “replaces” finite class size.
• Previous lecture (on PAC):
– specific examples
– rectangle.
– interval.
• Goal: develop a general methodology.
The VC Dimension
• C collection of subsets of universe U
• VC(C) = VC dimension of C:
size of largest subset T  U shattered by C
• T shattered if every subset T’T expressible as
T  (an element of C)
• Example:
C = {{a}, {a, c}, {a, b, c}, {b, c}, {b}}
VC(C) = 2
{b, c} shattered by C
• Plays important role in learning theory, finite automata,
comparability theory, computational geometry
Definitions: Projection
• Given a concept c over X
– associate it with a set (all positive
examples)
• Projection (sets)
– For a concept class C and subset S
– PC(S) = { c  S | c  C}
• Projection (vectors)
– For a concept class C and S = {x1, … , xm}
– PC(S) = {<c(x1), … , cxm)> | c  C}
Definition: VC-dim
• Clearly |PC(S) |  2m
• C shatters S if |PC(S) | =2m
(S is shattered by C)
• VC dimension of a class C:
– The size d of the largest set S that shatters C.
– Can be infinite.
• For a finite class C
– VC-dim(C)  log |C|
Example S is Shattered by C
VC: A combinatorial measure of a function class complexity
Calculating VC dimensionality
• The VC dimension is at least d if there exists some sample
|S| = d which is shattered by C.
• This does not mean that all samples of size d are shattered by
C. (Three point on a single line in 2d)
• Conversely, in order to show that the VC dimension is at most
d, one must show that no sample of size d + 1 is shattered.
• Naturally, proving an upper bound is more difficult than proving
the lower bound on the VC dimension.
Example 1: Interval
1
C1={cz | z  [0,1] }
cz(x) = 1  x  z
0
Example 2: line
C2={cw | w=(a,b,c) }
cw(x,y) = 1  ax+by  c
Line: Hyperplane VC dim > 3
VC dim < 4
4 points can not be shattered
Example 3: Parallel Rectangle
VC Dim of Rectangles
Example 4: Finite union of intervals
Any set of points can be covered
Thus VC dim = 
Example 5 : Parity
•
•
•
•
•
n Boolean input variables
T  {1, …, n}
fT(x) = iT xi
Lower bound: n unit vectors
Upper bound
– Number of concepts
– Linear dependency
Example 6: OR
•
•
•
•
•
n Boolean input variables
P and N subsets {1, …, n}
fP,N(x) = ( iP xi)  ( iN  xi)
Lower bound: n unit vectors
Upper bound
– Trivial 2n
– Use ELIM (get n+1)
– Show second vector removes 2 (get n)
Example 7: Convex polygons
Example 7: Convex polygons
Example 8: Hyper-plane
C8={cw,c | wd}
cw,c(x) = 1  <w,x>  c
• VC-dim(C8) = d+1
• Lower bound
– unit vectors and zero vector
• Upper bound!
Complexity Questions
Given C, compute VC(C)
• since VC(C)  log |C|, can compute in O(nlog n) time
(Linial-Mansour-Rivest 88)
• probably can’t do better: problem is LOG NP-complete
(Papadimitriou-Yannakakis 96)
Often C has a small implicit representation:
C(i, x) is a polynomial-size circuit such that
C(i, x) = 1 iff x belongs to set i
• implicit version is 3-complete (Schaefer 99)
(as hard as abc (a, b, c) for CNF formula )
Sampling Lemma
Lemma: Let W X be chosen randomly such that |W| ε|X|.
A set of O(1/ε ln(1/δ)) points sampled independently and uniformly at
random from X intersects W with probability at least (1- δ)
Proof: Any sample x is in W with probability at least ε. Thus, the
probability that all samples do not intersect with W is at most δ:
ε-Net Theorem
Theorem: Let VC-dimension of (X,C) be d  2 and
0  ε  ½.  ε-net for (X,C) of size at most
O(d/ε ln(1/ε)).

If we choose O(d/ε ln(d/ε) + 1/ε ln(1/δ)) points at
random from X, then the resulting set N is an ε-net
with probability δ.
Exercise 3, Submission next week
A polynomial bound on the sample size for PAC learning
Radon Theorem
• Definitions:
– Convex set.
– Convex hull: conv(S)
• Theorem:
– Let T be a set of d+2 points in Rd
– There exists a subset S of T such that
– conv(S)  conv(T \ S) 
• Proof!
Hyper-plane: Finishing the
proof
• Assume d+2 points T can be shattered.
• Use Radon Theorem to find S such that
– conv(S)  conv(T \ S) 
• Assign point in S label 1
– points not in S label 0
• There is a separating hyper-plane
• How will it label conv(S)  conv(T \ S)
Lower bounds: Setting
• Static learning algorithm:
– asks for a sample S of size m(e,d)
– Based on S selects a hypothesis
Lower bounds: Setting
• Theorem:
– if VC-dim(C) = then C is not learnable.
• Proof:
–
–
–
–
Let m = m(0.1,0.1)
Find 2m points which are shattered (set T)
Let D be the uniform distribution on T
Set ct(xi)=1 with probability ½.
• Expected error ¼.
• Finish proof!
Lower Bound: Feasible
• Theorem
– VC-dim(C)=d+1, then m(e,d)=W(d/e)
• Proof:
– Let T be a set of d+1 points which is
shattered.
– D samples:
• z0 with prob. 1-8e
• zi with prob. 8e/d
Continue
– Set ct(z0)=1 and ct(zi)=1 with probability
½
• Expected error 2e
• Bound confidence
– for accuracy e
Lower Bound: Non-Feasible
• Theorem
– For two hypoth. m(e,d)=W((log 1/d)/e2 )
• Proof:
– Let H={h0, h1}, where hb(x)=b
– Two distributions:
– D0: Prob. <x,1> is ½ - g and <y,0> is ½ +
g
– D1: Prob. <x,1> is ½ + g and <y,0> is ½ g
Download