Regularization

advertisement
Regularization
Instructor : Dr. Saeed Shiry
Hypothesis Space

The hypothesis space H is the space of
functions



allow our algorithm to provide.
in the space the algorithm is allowed to search.
it is often important to choose the hypothesis
space as a function of the amount of data
available.
Learning As Function Approximation From
Samples: Regression and Classification

The basic goal of supervised learning:




to use the training set S to “learn” a function
For a new x value predict the associated value of y:
Regression : If y is a real-valued random
variable
Pattern classification : If y takes values from an
unordered finite set,

In two-class pattern classification problems, we assign
one class a y value of 1, and the other class a y value
of −1.
Loss Functions


In order to measure goodness of our function,
we need a loss function V.
In general, we let V(f , z) = V(f (x), y)

price we pay when we see x and guess that the
associated y value is f (x) when it is actually y.
Common Loss Functions For
Regression

The most common loss function is square
loss or L2 loss:


L1 loss:


V(f (x), y) = (f (x) − y)^2
V(f (x), y) = |f (x) − y|
Vapnik’s more general -insensitive loss:
Problem of risk minimization

In order to choose the best available approximation to the
supervisor's response, one measures the loss or
discrepancy L(y, f(x, a)) between the response y of the
supervisor to a given input x and the response f(x, a)
provided by the learning machine. Consider the expected
value of the loss, given by the risk functional

The goal is to find the function f(x, , a) which minimizes the
risk functional R(a) over the class of functions f(x,),   A
in the situation where the joint probability distribution P(x,y)
is unknown and the only available information is contained
in the training set.
Three Main Learning Problems
1.




Pattern Recognition
Let the supervisor's output y take only two values y = {0,1} and
let f(x,),   A, be a set of indicator functions (functions which
take only two values: zero and one).
Consider the following loss function:
For this loss function, the functional (1.2) determines the
probability of different answers given by the supervisor and by
the indicator function f(x, ). We call the case of different
answers a classification error.
The problem, therefore, is to find a function that minimizes the
probability of classification error when the probability measure
F(x, y) is unknown, but the data are given.
Three Main Learning Problems
2.

Regression Estimation
Let the supervisor's answer y be a real value, and let f(x, ),  
A, be a set of real functions that contains the regression function

It is known that the regression function is the one that minimizes
the functional (1.2) with the following loss function:

Thus the problem of regression estimation is the problem of
minimizing the risk functional (1.2) with the above loss function
in the situation where the probability measure P(x,y) is unknown
but the data are given.
Three Main Learning Problems
3.



Density Estimation (Fisher-Wald Setting)
Finally, consider the problem of density estimation from the set of
densities p(x, )   A. For this problem we consider the
following loss function:
It is known that the desired density minimizes the risk functional
(1.2) with the above loss function .
Thus, again, to estimate the density from the data one has to
minimize the risk functional under the condition that the
corresponding probability measure P(x) is unknown, but i.i.d.
data
are given.
Expected error, empirical error

The expected or true error of f is:




Given a function f , a loss function V, and a probability distribution
μ over Z,
the expected loss on a new example drawn at random from μ.
We would like to make I[f ] small, but in general we do not know
μ.
The empirical error of f is:

Given a function f , a loss function V, and a training set S
consisting of n data points
A reminder: convergence in
probability

Let {Xn} be a sequence of bounded random
variables. We say that
Generalization
A learning algorithm should be
well-posed, eg stable
In addition to the key property of
generalization, a “good” learning algorithm
should also be stable:


fs should depend continuously on the training
set S.
In particular, changing one of the training points
should affect less and less the solution as n
goes to infinity.
General definition of Well-Posed
and Ill-Posed problems
A problem is well-posed if its solution:



exists
is unique
depends continuously on the data (e.g. it is
stable)
A problem is ill-posed if it is not well-posed.

well-posedness is mainly used to mean stability of
the solution.
Theory of Solving Ill-Posed
Problems

In the early 1900s Hadamard observed that under some (very general)
circumstances the problem of solving (linear) operator equations

(finding f F that satisfies the equality), is ill-posed; even if there exists
a unique solution to this equation,
a small deviation on the right-hand side of this equation (Fδ instead of F,
where ||F- Fδ ||< δ is arbitrarily small) can cause large deviations in the
solutions (it can happen that ||fδ -f||< is large).
In this case if the right-hand side F of the equation is not exact (e.g., it
equals Fδ , where Fδ differs from F by some level δ of noise), the
functions fδ that minimize the function



do not guarantee a good approximation to the desired solution even if δ
tends to zero.
Real-life problems
were found to be ill-posed


Hadamard thought that ill-posed problems
are a pure mathematical phenomenon and
that all real-life problems are "well-posed.“
However, in the second half of the century a
number of very important real-life problems
were found to be ill-posed.

it is important that one of main problems of
statistics, estimating the density function from the
data, is ill-posed.
Regularization theory



Regularization theory was one of the first signs of the existence
of intelligent inference:
In the middle of the 1960s it was discovered that if instead of the
functional R(f) one minimizes another so-called regularized
functional
where Ω(f) is some function (that belongs to a special type of
functions) and (δ) is an appropriately chosen constant
(depending on the level of noise), then one obtains a sequence
of solutions that converges to the desired one as δ tends to zero
ERM

Given a training set S and a function space
H, empirical risk minimization (Vapnik
introduced the term) is the class of algorithms
that look at S and select fs as

For example linear regression is ERM when
V(z) = (f (x) − y)^2 and H is space of linear
functions f = ax.
THE EMPIRICAL RISK MINIMIZATION
(ERM) INDUCTIVE PRINCIPLE





In order to minimize the risk functional for an unknown
probability measure P(z) the following induction principle is
usually employed.
The expected risk functional R() is replaced by the
empirical risk functional
Constructed on the basis of the training set.
The principle is to approximate the function Q(z, ) which
minimizes the risk by the function Q(z, l) which
miniminimizes the empirical risk (1.8).
This principle is called the Empirical Risk Minimization
induction principle (ERM principle).
Generalization and Well-posedness of
Empirical Risk Minimization
For ERM to represent a “good” class of learning
algorithms, the solution should


generalize
exist, be unique and – especially – be stable
(well-posedness).
ERM and generalization: given a
certain number of samples...
...suppose this is the “true”
solution...
... but suppose ERM gives this
solution.
Under which conditions the ERM
solution converges with increasing
number of examples to the true
solution? In other words...what are the
conditions for generalization of ERM?
ERM and stability: given 10
samples...
...we can find the smoothest
interpolating polynomial (which
degree?).
But if we perturb the points
slightly...
...the solution changes a lot!
If we restrict ourselves to degree
two polynomials...
...the solution varies only a small
amount under a small
perturbation.
ERM: conditions for wellposedness (stability) and
predictivity (generalization)

Since Tikhonov, it is well-known that a generally
ill-posed problem such as ERM, can be
guaranteed to be well-posed and therefore
stable by an appropriate choice of H.


For example, compactness of H guarantees stability.
It seems intriguing that the classical conditions
for consistency of ERM – thus quite a different
property – consist of appropriately restricting H.
ERM: conditions for wellposedness (stability) and
predictivity (generalization)


We would like to have a hypothesis space that
yields generalization. Loosely speaking this
would be a H for which the solution of ERM, say
fs is such that |Is[fs] −I[fs]| converges to zero in
probability for n increasing.
Note that the above requirement is NOT the law
of large numbers; the requirement for a fixed f
that |Is[f ] − I[f ]| converges to zero in probability
for n increasing Is the law of large numbers.
ERM: conditions for wellposedness (stability) and
predictivity (generalization)
ERM: conditions for wellposedness (stability) and
predictivity (generalization)





The theorem says that a proper choice of the hypothesis
space H ensures generalization of ERM (and consistency
since for ERM generalization is necessary and sufficient for
consistency and viceversa).
A separate theorem guarantees also stability (defined in a
specific way) of ERM.
Thus with the appropriate definition of stability, stability and
generalization are equivalent for ERM.
Other results characterize uGC classes in terms of measures
of complexity or capacity of H (such as VC dimension).
Thus the two desirable conditions for a learning algorithm –
generalization and stability – are equivalent (and they
correspond to the same constraints on H).
Regularization


A method of improving stability of solutions of illconditioned inverse problems, called regularization.
The basic idea in the treatment of ill-conditioned
problems


such knowledge can be:



use some a priori knowledge about solutions to disqualify
meaningless ones.
some regularity condition on the solution expressed
existence of derivatives up to a certain order with bounds
on the magnitudes of these derivatives
some localization condition such as a bound on the
support of the solution or its behavior at infinity.
Tikhonov’s regularization: penalizes undesired
solutions by adding a term called a stabilizer.
Regularization


Generally speaking, any regularization method tries to
analyze a related well-posed problem whose solution
approximates the original ill-posed problem.
The well-posedness is achieved by implementing one or
more of the following basic ideas





restriction of the data;
change of the space and/or topologies;
modification of the operator itself;
the concept of regularization operators; and
well-posed stochastic extensions of ill-posed problems.
Regularization

Regularized cost function = empirical cost
function +regularization parameter
*regularizer function
Image restoration – An illposed problem

Degradation model
G (u, v)  H (u, v) F (u, v)  N (u, v)
G (u, v)
N (u, v)
ˆ
F (u, v) 
 F (u, v) 
H (u, v)
H (u, v)

H is ill-conditioned which makes image
restoration problem an ill-posed problem

Solution is not stable
Tikhonov’s Regularization

Theory



Proposed by Tikhonov in 1963
Proposes the use of prior knowledge to regularize
mappings
Most common application: utilize the
smoothness property:

“Similar inputs produce similar outputs for an
input-output mapping to be smooth”
Ivanov and Tikhonov
Regularization
Tikhonov Regularization
As we will see in future classes
 Tikhonov regularization ensures wellposedness eg existence, uniqueness and
especially stability (in a very strong form) of
the solution
 Tikhonov regularization ensures
generalization Tikhonov regularization is
closely related to – but different from – Ivanov
regularization, eg ERM on a hypothesis
space H which is a ball in a RKHS.
Download