An Introduction to Statistical Learning Theory Jian Zhang Department of Statistics Purdue University April, 2010 1 Theory of Learning Goal: To study theoretical properties of learning algorithm from a statistical viewpoint. The theory should be able model real/artificial data so that we can better understand and predict them. Statistical Learning Theory (SLT) is just one approach to understanding/studying learning systems. SLT results are often given in the form the generalization error bounds, although other types of results are possible. 2 Basic Concepts Setup: • • • • IID observations (x1, y1), . . . , (xn, yn) ∼ PX,Y yi’s are binary: yi ∈ {0, 1} or {±1} Dn = {(x1, y1), . . . , (xn, yn)} is also known as the training data PX,Y is some unknown distribution A learning algorithm A is a procedure which takes the training data and produces a predictor ĥn = A(Dn) as the output. Typically the learning algorithm will search over a space of functions H which is known as the hypothesis space. So it can also be written as ĥn = A(Dn , H). 3 Generalization of a Learner For any predictor/classifier h : X #→ {0, 1} • Its performance can be measured by a loss function !(y, p) • Most popular loss for classification is the 0/1 loss: !(y, p) = 1{y %= p} • Generalization error (aka. risk): R(h) := EX,Y [!(Y, h(X))] = P (Y %= h(X)). Generalization error measures how well the classifier can predict on average for future observations. Note that it cannot be calculated in practice as PX,Y is unknown. 4 Bayes Risk and Bayes Classifier Bayes risk R∗ is defined as the minimum risk over all measurable functions: R∗ = inf R(h) h measurable and Bayes classifier h∗ is the one which attains the Bayes risk: R∗ = R(h∗ ). It is easy to show that h∗(x) takes the following form: ! 1, P (Y = 1|X = x) > 0.5 h∗(x) = 0, otherwise. 5 Risk Consistency Given IID observations Dn = {(x1, y1), . . . , (xn, yn)}, the learning algorithm outputs a classifier ĥn = A(Dn ): • • ĥn is a function of Dn; R(ĥn) = EX,Y [1{Y %= ĥn(X)}] is also a random variable of Dn A learning procedure is called risk consistent if EDn [R(ĥn)] → R∗ as n → ∞. In other words, as more and more training data is provided, the risk of the output classifier will converge to the minimum possible risk. 6 Empirical Risk Minimization In general R(h) is not computable since PX,Y is unknown. In practice, we use empirical risk minimization within a function class H: n 1" !(yi, h(xi)) ĥn = arg min R̂n(h) = arg min h∈H h∈H n i=1 where • • • • H is some specified function class; R̂n(h) is the same as R(h) except we replace PX,Y by the empirical distribution By law of large numbers, R̂n (h) → R(h) for any h Uniform convergence needs further condition 7 Overfitting Overfitting refers to the situation where we have a small empirical risk but still a relatively large true risk. It could happen especially when • • sample size n is small hypothesis space H is large Recall that empirical risk R̂n(h) is a random variable whose mean is R(h). Pure empirical risk minimization may lead to the fitting of noise eventually – overfitting. 8 Overfitting: An Example Consider the following example. Let !(y, p) = (y − p)2 and we obtain the predictor ĥ by ERM: n 1" (yi − h(xi))2. ĥ = arg min h∈H n i=1 Polynomial with degree1 Polynomial with degree2 1.5 1.5 1 1 0.5 0.5 0 0 −0.5 −0.5 −1 −1 −1.5 −1.5 0 2 4 6 0 Polynomial with degree3 1.5 1 1 0.5 0.5 0 0 −0.5 −0.5 −1 −1 −1.5 −1.5 2 4 4 6 Polynomial with degree5 1.5 0 2 6 9 0 2 4 6 Overfitting vs. Model Complexity The following figure shows the tradeoff between model complexity and empirical risk: True/Empirical Risk VS. Model Complexity Risk Empirical Risk True Risk Best Model in Theory Model Complexity 10 Approximation Error vs. Estimation Error Suppose that the learning algorithm chooses the predictor from the hypothesis space H, and re-define h∗ = arg inf R(h), h∈H i.e. h∗ is the best predictor among H. Then the excess risk of the output ĥn of the learning algorithm can be decomposed as follows: # $ # $ R(ĥn ) − R∗ = R(h∗ ) − R∗ + R(ĥn ) − R(h∗ ) % &' ( % &' ( approximation error 11 estimation error Approximation Error vs. Estimation Error # $ # $ R(ĥn ) − R∗ = R(h∗ ) − R∗ + R(ĥn ) − R(h∗ ) % &' ( % &' ( approximation error estimation error Such a decomposition reflects a trade-off similar to the bias-variance tradeoff: • • approximation error: deterministic and is caused by the restriction of using H; estimation error: caused by the usage of a finite sample that cannot completely represent the underlying distribution. The approximation error behaves like a bias square term, and the estimation error behaves like the variance. Basically if H is large then we have a small approximation error but a relatively large estimation error and vice versa. 12 Generalization Error Bound Consistency only states that as n → ∞, EDn [R̂n(ĥn)] converges to the minimum risk. However, sometimes only knowing this is not enough. We are also interested in the following: • • • How fast is the convergence? R̂n(ĥn) is a function of Dn, and thus a random variable. What is the distribution of this random variable? In particular, what is the behavior of the tail of the distribution? A tail bound on the risk can often give us more information including consistency and rate of convergence. 13 Generalization Error Bound Recall that the excess risk can be decomposed into approximation error and estimation error, i.e. # $ # $ ∗ ∗ R(ĥn ) − R = inf R(h) − R + R(ĥn ) − inf R(h) . % h∈H &' ( % &'h∈H ( approximation error estimation error We would like to bound the estimation error. Specifically, if we use ERM to obtain our predictor ĥn = arg minh∈H R̂n(h), and assume that inf h∈H R(h) = R(h∗ ) for some h∗ ∈ H, then we have R(ĥn ) − inf R(h) = R(ĥn ) − R(h∗ ) h∈H ≤ R(ĥn ) − R(h∗ ) + R̂n(h∗) − R̂n (ĥn) = (R(ĥn)) − R̂n (ĥn)) −)(R(h∗ ) − R̂n (h∗)) ) ) ≤ 2 sup )R(h) − R̂n (h)) . h∈H 14 PAC Learning Thus if we can obtain uniform bound of suph∈H |R(h) − R̂n(h)| then the approximation error can be bounded. Thus again justifies the usage of the ERM method. The probably approximately correct (PAC) learning model typically states as follows: we say that ĥn is "-accurate with probability 1 − δ, if * + P R(ĥn ) − inf R(h∗ ) > " < δ. h∈H In other words, we have R(ĥn ) − inf h∈H R(h) ≤ " with probability at least (1 − δ). 15 PAC Learning: Example I Consider the special case H = {h}, i.e. we only have a single function. Furthermore, we assume that it can achieve 0 trainining error over Dn, i.e. R̂n(h) = 0. Then what is the probability that its generalization error R(h) ≥ "? We have # $ P R̂n(h) = 0, R(h) ≥ " = (1 − R(h))n ≤ (1 − ")n ≤ exp(−n"). Setting the RHS to δ and solve for " we have " = probability (1 − δ), * + 1 1 P R̂n(h) = 0, R(h) < log . n δ 16 1 n log 1δ . Thus with PAC Learning: Example I Note that we can also utilzie the Hoeffding’s inequality to obtain P (|R̂n(h) − R(h)| ≥ ") ≤ 2 exp(−2n"2), which leads to . , 1 2 log ≤ δ. P |R̂n(h) − R(h)| ≥ 2n δ This is more general but not as tight as the previous one since it does not utilize the fact R̂n(h) = 0. Example 1 essentially says that for each fixed function h, there is a set S of samples (whose measure P (S) ≥ 1−δ) for which |R̂n(h)−R(h)| is bounded. However, such S sets could be different for different functions. To handle this issue we need to obtain the uniform deviations since: R̂n (ĥn) − Rn (ĥn) ≤ sup(R̂n (h) − R(h)). h∈H 17 PAC Learning: Example II Consider the case H = {h1, . . . , hm}, a finite class of functions. Let / 0 Bk := (x1, y1) . . . , (xn, yn) : R(hk ) − R̂n(hk ) ≥ " , k = 1, . . . , m. Each Bk is the set of all bad samples for hk , i.e. the samples for which the bound fails for hk . In other words, it contains all misleading samples. If we want to measure the proability of the samples which are bad for any hk (k = 1, . . . , m), we could apply the Bonferroni inequality to simply obtain: m " P (B1 ∪ . . . ∪ Bm) ≤ P (Bk ). k=1 18 PAC Learning: Example II Thus we have # $ P ∃h ∈ H : R(h) − R̂n(h) ≥ " = P , m / 0 1 R(hk ) − R̂n(hk ) ≥ " k=1 union bound ≤ ≤ m " k=1 # $ P R(hk ) − R̂n(hk ) ≥ " m exp(−2n"2). Hence for H = {h1, . . . , hm}, with probability at least (1 − δ), 2 log m + log 1δ . ∀h ∈ H, R(h) − R̂n(h) ≤ 2n Since this is a uniform upper bound, it can be applied to ĥn ∈ H. 19 . PAC Learning: Example II From the above PAC learning examples we can see that • It requires assumptions on data generation, i.e. samples are iid. The error bounds are valid wrt repeated samples of training data. √ • For a fixed function we roughly have R(h) − R̂n (h) ≈ 1/ n. 3 • If |H| = m then suph∈H (R(h) − R̂n (h)) ≈ log m/n. The term log m can be thought as the complexity of the hypothesis space H. • There are several things which can be improved: • • • Hoeffding’s inequality does not utlize the variance information. The union bound could be as bad as if all the functions in H were independent. The supremum over H might be too conservative. 20 From Finite to (Uncountably) Infinite: Basic Idea If H has infinite number of functions, we want to group functions into clusters in such a way that functions in each cluster behave similarly. • • • Since we aim to bound the risk, the grouping criteria should be related to function’s output. Need to related function’s output on a finite sample to its output on the whole population. If the number of clusters is finite, we can then apply the union bound to obtain the generalization error bound. 21 Growth Function Given a sample Dn = {(x1, y1), . . . , (xn, yn )}, and define S = {x1, . . . , xn}. Consider the set HS = Hx1,...,xn = {(h(x1), . . . , h(xn) : h ∈ H} . The size of this set is the total number of possible ways that S = {x1, . . . , xn} can be classified. For binary classification the cardinality of this set is always finite, no matter how large H is. Definition (Growth Function). The growth function is the maximum number of ways into which n points can be classified by the function class: GH(n) = sup |HS | . x1 ,...,xn 22 Growth Function Growth function can be thought as a measure of the “size” for the class of functions H. Several facts about the growth function: • • When H is finite, we always have GH(n) ≤ |H| = m. Since h(x) ∈ {0, 1}, we have GH(n) ≤ 2n. If GH (n) = 2n, then there is a set of n points such that the class of functions H can generate any possible classification result on these points. 23 Symmetrization We define Zi = (Xi, Yi). For notational simplicty, we will use n 1" Pf = E[f (X, Y )], Pn f = f (xi, yi). n i=1 Here f (X, Y ) can be thought as !(Y, h(X)). The key idea is to upper bound the true risk by an estimate from an independent sample, which is often known as the “ghost” sample. We use Z10 , . . . , Zn0 to denote the ghost sample and n 1" 0 f (x0i, yi0 ). Pn f = n i=1 Then we could project the functions in H onto this double sample and apply the union bound with the help of the growth function GH (.) of H. 24 Symmetrization 3 Lemma (Symmetrization). For any t > 0 such that t ≥ 2/n, we have , . + * ) ) 0 ) ) P sup |Pf − Pn f | ≥ t ≤ 2P sup )Pnf − Pn f ) ≥ t/2 . f ∈F h∈H 25 Generalization Bound for Infinite Function Class Theorem (Vapnik-Chervonenkis). For any δ > 0, with probability at least 1 − δ, 2 2 log GH(2n) + 2 log 2δ ∀h ∈ H, R(h) ≤ R̂n(h) + 2 . n Proof: Let F = {f : f (x, y) = !(y, h(x)), h ∈ H}. First note that GH(n) = GF (n). 26 Generalization Bound for Infinite Function Class P * , . + sup R(h) − R̂n(h)) ≥ " = P sup(Pf − Pnf ) ≥ " h∈H f ∈F , sup(P0nf − Pnf ) ≥ "/2 ≤ 2P = 2P . , f ∈F . sup (P0nf − Pnf ) ≥ "/2 f ∈FD ,D0 n n ≤ 2GF (2n)P ((P0nf − Pnf ) ≥ "/2) ≤ 2GF (2n) exp(−n"2/8). The last inequality is by the Hoeffding’s inequality since P (Pn0 f − Pnf ≥ t) ≤ exp(−nt2/2). Setting δ = 2GF (2n) exp(−n"2/8) we have the claimed result. 27 Consistency and Rate of Convergence It is often easy to obtain consistency and rate of convergence result from the generalization error bound. 4´ 3 √ 2 • Consistency: use the fact E[Z] ≤ E[Z ] = P (Z > t)dt and plug in the bound. • Rate of convergence: often can be read off the error bound. 28 Data Dependant Error Bound Definition. Let µ be a proability measure on X and assume that X1, . . . , Xn are independent random variables according to µ. Let F be a class of functions mapping from X to R. Define the random variable 5 6 n ) " 1 ) R̂n(F) := E sup σif (Xi) )X1, . . . , Xn , f ∈F n i=1 where σ1, . . . , σn are independent uniform {±1}-valued random variables. R̂n(F) is called the empirical Rademacher averages of F. Note that it depends on the sample and can be actually computed. Essentially it measures the correlation between a random noise (labeling) and functions in the function class F, in the supremum sense. The Rademacher averages of F is Rn(F) = E[R̂n(F)]. 29 Data-Dependent Error Bound Theorem. Let F be a set of binary-valued {0, 1} functions. For all δ > 0, with proability at least 1 − δ, log(1/δ) ∀f ∈ F, Pf ≤ Pnf + 2Rn(F) + , 2n and also with probability at least 1 − δ, log(2/δ) , ∀f ∈ F, Pf ≤ Pn f + 2R̂n(F) + C n √ √ where C = 2 + 1/ 2. 30