Uploaded by Stefano Gamage

Vapnik-Chevonenkis Dimension and its use in machine learning

advertisement
Vapnik-Chevonenkis Dimension and its use in machine learning
We have some data with real world observations. Given an input of the data, the resulting observations depend on a function 𝑐 (𝑥) ∈ 𝐶. By using the data and training a model, we try to
define a function ℎ(𝑥) ∈ 𝐻, an approximation of the concept function. The objective is to
minimize the generalization error for a chosen ℎ(𝑥), so that it is as near as possible to the
function 𝑐 (𝑥) that generated the data. We want to define a threshold error value 𝜖 that a hypothesis in the class 𝐻 can have and define the probability of having a ℎ(𝑥) greater than the
threshold value in the hypothesis class 𝐻. For a finite hypothesis class, this can be defined by
the sample complexity bound:
1
1
𝜖
𝛿
𝑚 ≥ (𝑙𝑛|𝐻| + 𝑙𝑛( ),
where m indicates the number of training data points needed to have a generalization error to
be at most 𝜖 with probability at most 𝛿 . This bound is valid for finite hypothesis classes, but
it is not valid for infinite hypothesis classes since 𝑙𝑛|𝐻| cannot be calculated. This is where
the Vapnik-Chervonenkis dimension is used.
VC dimension is defined as the size of the largest finite subset 𝑋 that can be shattered by a
hypothesis space H. It indicates the number of datapoints for a given dichotomy of 𝑋 that a
given hypothesis ℎ(𝑥) ∈ 𝐻 can distinguish. A dichotomy of a set is defined as a partition of
X into two disjoint subsets (for example, labeling the dataset into two distinct classes). 𝑋
then, is shattered by the hypothesis space 𝐻 if and only if for every dichotomy of 𝑋 there exists some hypothesis ℎ(𝑥) ∈ 𝐻 that is consistent with the dichotomy (for example a hypothesis that can separate all the positive labelled data points from the negative ones).
If for a given number n of datapoints there is a dichotomy that cannot be shattered by a hypothesis, then the VC dimension is defined as n -1. The VC dimension of any d-dimensional
hyperplane is defined as d+1, thus in a 2D hyperplane, the VC dimension is at most 3 for any
hypothesis class in the same dimension.
To define then a bound for the generalization error, the sample complexity in the case of infinite hypothesis classes can be defined as:
𝑚 ≥
1
2
13
(4 log 2 ( ) + 8𝑉𝐶(𝐻) log 2 ( )
𝜖
𝛿
𝜖
Which answers the question: How many training datapoints suffice to guarantee a generalization error of at most ϵ with probability at least (1 − δ)?
Download