Vapnik-Chevonenkis Dimension and its use in machine learning We have some data with real world observations. Given an input of the data, the resulting observations depend on a function 𝑐 (𝑥) ∈ 𝐶. By using the data and training a model, we try to define a function ℎ(𝑥) ∈ 𝐻, an approximation of the concept function. The objective is to minimize the generalization error for a chosen ℎ(𝑥), so that it is as near as possible to the function 𝑐 (𝑥) that generated the data. We want to define a threshold error value 𝜖 that a hypothesis in the class 𝐻 can have and define the probability of having a ℎ(𝑥) greater than the threshold value in the hypothesis class 𝐻. For a finite hypothesis class, this can be defined by the sample complexity bound: 1 1 𝜖 𝛿 𝑚 ≥ (𝑙𝑛|𝐻| + 𝑙𝑛( ), where m indicates the number of training data points needed to have a generalization error to be at most 𝜖 with probability at most 𝛿 . This bound is valid for finite hypothesis classes, but it is not valid for infinite hypothesis classes since 𝑙𝑛|𝐻| cannot be calculated. This is where the Vapnik-Chervonenkis dimension is used. VC dimension is defined as the size of the largest finite subset 𝑋 that can be shattered by a hypothesis space H. It indicates the number of datapoints for a given dichotomy of 𝑋 that a given hypothesis ℎ(𝑥) ∈ 𝐻 can distinguish. A dichotomy of a set is defined as a partition of X into two disjoint subsets (for example, labeling the dataset into two distinct classes). 𝑋 then, is shattered by the hypothesis space 𝐻 if and only if for every dichotomy of 𝑋 there exists some hypothesis ℎ(𝑥) ∈ 𝐻 that is consistent with the dichotomy (for example a hypothesis that can separate all the positive labelled data points from the negative ones). If for a given number n of datapoints there is a dichotomy that cannot be shattered by a hypothesis, then the VC dimension is defined as n -1. The VC dimension of any d-dimensional hyperplane is defined as d+1, thus in a 2D hyperplane, the VC dimension is at most 3 for any hypothesis class in the same dimension. To define then a bound for the generalization error, the sample complexity in the case of infinite hypothesis classes can be defined as: 𝑚 ≥ 1 2 13 (4 log 2 ( ) + 8𝑉𝐶(𝐻) log 2 ( ) 𝜖 𝛿 𝜖 Which answers the question: How many training datapoints suffice to guarantee a generalization error of at most ϵ with probability at least (1 − δ)?