Regularization Instructor : Dr. Saeed Shiry Hypothesis Space The hypothesis space H is the space of functions allow our algorithm to provide. in the space the algorithm is allowed to search. it is often important to choose the hypothesis space as a function of the amount of data available. Learning As Function Approximation From Samples: Regression and Classification The basic goal of supervised learning: to use the training set S to “learn” a function For a new x value predict the associated value of y: Regression : If y is a real-valued random variable Pattern classification : If y takes values from an unordered finite set, In two-class pattern classification problems, we assign one class a y value of 1, and the other class a y value of −1. Loss Functions In order to measure goodness of our function, we need a loss function V. In general, we let V(f , z) = V(f (x), y) price we pay when we see x and guess that the associated y value is f (x) when it is actually y. Common Loss Functions For Regression The most common loss function is square loss or L2 loss: L1 loss: V(f (x), y) = (f (x) − y)^2 V(f (x), y) = |f (x) − y| Vapnik’s more general -insensitive loss: Problem of risk minimization In order to choose the best available approximation to the supervisor's response, one measures the loss or discrepancy L(y, f(x, a)) between the response y of the supervisor to a given input x and the response f(x, a) provided by the learning machine. Consider the expected value of the loss, given by the risk functional The goal is to find the function f(x, , a) which minimizes the risk functional R(a) over the class of functions f(x,), A in the situation where the joint probability distribution P(x,y) is unknown and the only available information is contained in the training set. Three Main Learning Problems 1. Pattern Recognition Let the supervisor's output y take only two values y = {0,1} and let f(x,), A, be a set of indicator functions (functions which take only two values: zero and one). Consider the following loss function: For this loss function, the functional (1.2) determines the probability of different answers given by the supervisor and by the indicator function f(x, ). We call the case of different answers a classification error. The problem, therefore, is to find a function that minimizes the probability of classification error when the probability measure F(x, y) is unknown, but the data are given. Three Main Learning Problems 2. Regression Estimation Let the supervisor's answer y be a real value, and let f(x, ), A, be a set of real functions that contains the regression function It is known that the regression function is the one that minimizes the functional (1.2) with the following loss function: Thus the problem of regression estimation is the problem of minimizing the risk functional (1.2) with the above loss function in the situation where the probability measure P(x,y) is unknown but the data are given. Three Main Learning Problems 3. Density Estimation (Fisher-Wald Setting) Finally, consider the problem of density estimation from the set of densities p(x, ) A. For this problem we consider the following loss function: It is known that the desired density minimizes the risk functional (1.2) with the above loss function . Thus, again, to estimate the density from the data one has to minimize the risk functional under the condition that the corresponding probability measure P(x) is unknown, but i.i.d. data are given. Expected error, empirical error The expected or true error of f is: Given a function f , a loss function V, and a probability distribution μ over Z, the expected loss on a new example drawn at random from μ. We would like to make I[f ] small, but in general we do not know μ. The empirical error of f is: Given a function f , a loss function V, and a training set S consisting of n data points A reminder: convergence in probability Let {Xn} be a sequence of bounded random variables. We say that Generalization A learning algorithm should be well-posed, eg stable In addition to the key property of generalization, a “good” learning algorithm should also be stable: fs should depend continuously on the training set S. In particular, changing one of the training points should affect less and less the solution as n goes to infinity. General definition of Well-Posed and Ill-Posed problems A problem is well-posed if its solution: exists is unique depends continuously on the data (e.g. it is stable) A problem is ill-posed if it is not well-posed. well-posedness is mainly used to mean stability of the solution. Theory of Solving Ill-Posed Problems In the early 1900s Hadamard observed that under some (very general) circumstances the problem of solving (linear) operator equations (finding f F that satisfies the equality), is ill-posed; even if there exists a unique solution to this equation, a small deviation on the right-hand side of this equation (Fδ instead of F, where ||F- Fδ ||< δ is arbitrarily small) can cause large deviations in the solutions (it can happen that ||fδ -f||< is large). In this case if the right-hand side F of the equation is not exact (e.g., it equals Fδ , where Fδ differs from F by some level δ of noise), the functions fδ that minimize the function do not guarantee a good approximation to the desired solution even if δ tends to zero. Real-life problems were found to be ill-posed Hadamard thought that ill-posed problems are a pure mathematical phenomenon and that all real-life problems are "well-posed.“ However, in the second half of the century a number of very important real-life problems were found to be ill-posed. it is important that one of main problems of statistics, estimating the density function from the data, is ill-posed. Regularization theory Regularization theory was one of the first signs of the existence of intelligent inference: In the middle of the 1960s it was discovered that if instead of the functional R(f) one minimizes another so-called regularized functional where Ω(f) is some function (that belongs to a special type of functions) and (δ) is an appropriately chosen constant (depending on the level of noise), then one obtains a sequence of solutions that converges to the desired one as δ tends to zero ERM Given a training set S and a function space H, empirical risk minimization (Vapnik introduced the term) is the class of algorithms that look at S and select fs as For example linear regression is ERM when V(z) = (f (x) − y)^2 and H is space of linear functions f = ax. THE EMPIRICAL RISK MINIMIZATION (ERM) INDUCTIVE PRINCIPLE In order to minimize the risk functional for an unknown probability measure P(z) the following induction principle is usually employed. The expected risk functional R() is replaced by the empirical risk functional Constructed on the basis of the training set. The principle is to approximate the function Q(z, ) which minimizes the risk by the function Q(z, l) which miniminimizes the empirical risk (1.8). This principle is called the Empirical Risk Minimization induction principle (ERM principle). Generalization and Well-posedness of Empirical Risk Minimization For ERM to represent a “good” class of learning algorithms, the solution should generalize exist, be unique and – especially – be stable (well-posedness). ERM and generalization: given a certain number of samples... ...suppose this is the “true” solution... ... but suppose ERM gives this solution. Under which conditions the ERM solution converges with increasing number of examples to the true solution? In other words...what are the conditions for generalization of ERM? ERM and stability: given 10 samples... ...we can find the smoothest interpolating polynomial (which degree?). But if we perturb the points slightly... ...the solution changes a lot! If we restrict ourselves to degree two polynomials... ...the solution varies only a small amount under a small perturbation. ERM: conditions for wellposedness (stability) and predictivity (generalization) Since Tikhonov, it is well-known that a generally ill-posed problem such as ERM, can be guaranteed to be well-posed and therefore stable by an appropriate choice of H. For example, compactness of H guarantees stability. It seems intriguing that the classical conditions for consistency of ERM – thus quite a different property – consist of appropriately restricting H. ERM: conditions for wellposedness (stability) and predictivity (generalization) We would like to have a hypothesis space that yields generalization. Loosely speaking this would be a H for which the solution of ERM, say fs is such that |Is[fs] −I[fs]| converges to zero in probability for n increasing. Note that the above requirement is NOT the law of large numbers; the requirement for a fixed f that |Is[f ] − I[f ]| converges to zero in probability for n increasing Is the law of large numbers. ERM: conditions for wellposedness (stability) and predictivity (generalization) ERM: conditions for wellposedness (stability) and predictivity (generalization) The theorem says that a proper choice of the hypothesis space H ensures generalization of ERM (and consistency since for ERM generalization is necessary and sufficient for consistency and viceversa). A separate theorem guarantees also stability (defined in a specific way) of ERM. Thus with the appropriate definition of stability, stability and generalization are equivalent for ERM. Other results characterize uGC classes in terms of measures of complexity or capacity of H (such as VC dimension). Thus the two desirable conditions for a learning algorithm – generalization and stability – are equivalent (and they correspond to the same constraints on H). Regularization A method of improving stability of solutions of illconditioned inverse problems, called regularization. The basic idea in the treatment of ill-conditioned problems such knowledge can be: use some a priori knowledge about solutions to disqualify meaningless ones. some regularity condition on the solution expressed existence of derivatives up to a certain order with bounds on the magnitudes of these derivatives some localization condition such as a bound on the support of the solution or its behavior at infinity. Tikhonov’s regularization: penalizes undesired solutions by adding a term called a stabilizer. Regularization Generally speaking, any regularization method tries to analyze a related well-posed problem whose solution approximates the original ill-posed problem. The well-posedness is achieved by implementing one or more of the following basic ideas restriction of the data; change of the space and/or topologies; modification of the operator itself; the concept of regularization operators; and well-posed stochastic extensions of ill-posed problems. Regularization Regularized cost function = empirical cost function +regularization parameter *regularizer function Image restoration – An illposed problem Degradation model G (u, v) H (u, v) F (u, v) N (u, v) G (u, v) N (u, v) ˆ F (u, v) F (u, v) H (u, v) H (u, v) H is ill-conditioned which makes image restoration problem an ill-posed problem Solution is not stable Tikhonov’s Regularization Theory Proposed by Tikhonov in 1963 Proposes the use of prior knowledge to regularize mappings Most common application: utilize the smoothness property: “Similar inputs produce similar outputs for an input-output mapping to be smooth” Ivanov and Tikhonov Regularization Tikhonov Regularization As we will see in future classes Tikhonov regularization ensures wellposedness eg existence, uniqueness and especially stability (in a very strong form) of the solution Tikhonov regularization ensures generalization Tikhonov regularization is closely related to – but different from – Ivanov regularization, eg ERM on a hypothesis space H which is a ball in a RKHS.