PPT

advertisement
T. Poggio, R. Rifkin, S. Mukherjee, P. Niyogi:
General Conditions for
Predictivity in Learning Theory
Michael Pfeiffer
pfeiffer@igi.tugraz.at
25.11.2004
Motivation

Supervised Learning


learn functional
relationships from a finite
set of labelled training
examples
Generalization


How well does the learned
function perform on unseen
test examples?
Central question in
supervised learning
What you will hear

New Idea: Stability implies
predictivity


learning algorithm is stable if
small pertubations of training
set do not change hypothesis
much
Conditions for generalization
on learning map rather than
hypothesis space

in contrast to VC-analysis
Agenda





Introduction
Problem Definition
Classical Results
Stability Criteria
Conclusion
Some Definitions 1/2

Training Data: S = {z1=(x1,y1), ..., zn=(xn, yn)}



Hypothesis Space: H


Z=XY
Unknown Distribution (x, y)
Hypothesis fS  H: X  Y
Learning Algorithm: L : n1 Z n  H



L(S )  Lz1  ( x1 , y1 ),..., zn  ( xn , yn )  f S
Regression: fS is real-valued / Classification: fS is binary
symmetric learning algorithm (ordering irrelevant)
Some Definitions 2/2

Loss Function: V(f, z)



V :H  Z  R
e.g. V(f, z) = (f(x) – y)2
Assume that V is bounded

Empirical Error (Training Error)
1 n
I S [ f ]   V ( f , zi )
n i 1

Expected Error (True Error)
I [ f ]   V ( f , z ) dμ( z)
Z
Generalization and Consistency

Convergence in Probability


lim P X n  X  ε   0 ε  0
n
Generalization

Performance on training examples must be a good
indicator of performance on future examples

lim I [ f S ]  I S [ f S ]  0 in probabilit y
n

Consistency


Expected error converges to most accurate one in H
μ ε  0 lim P I [ f S ]  inf I [ f ]  ε   0
n  
f H

Agenda





Introduction
Problem Definition
Classical Results
Stability Criteria
Conclusion
Empirical Risk Minimization (ERM)

Focus of classical learning theory research


Minimize training error over H:



exact and almost ERM
take best hypothesis on training data
I S [ f S ]  min I S [ f ]
f H
For ERM: Generalization  Consistency
What algorithms are ERM?

All these belong to class of ERM algorithms






Least Squares Regression
Decision Trees
ANN Backpropagation (?)
...
Are all learning algorithms ERM?
NO!





Support Vector Machines
k-Nearest Neighbour
Bagging, Boosting
Regularization
...
Vapnik asked
What property must the
hypothesis space H have to
ensure good generalization
of ERM?
Classical Results for ERM1

Theorem: A necessary and sufficient condition for
generalization and consistency of ERM is that H is a
uniform Glivenko-Cantelli (uGC) class:


1 n
ε  0 lim sup PS  sup  f ( xi )   f ( x) dμ( x)  ε   0
X
n μ
n
f

H
i

1




1
convergence of empirical mean to true expected value
uniform convergence in probability of loss functions
induced by H and V
e.g. Alon, Ben-David, Cesa-Bianchi, Hausller: Scale-sensitive dimensions, uniform convergence and learnability, Journal
of ACM 44, 1997
VC-Dimension


Binary functions f: X{0, 1}
VC-dim(H) = size of largest
finite set in X that can be
shattered by H


e.g. linear separation in
2D yields VC-dim = 3
Theorem: Let H be a class of
binary valued hypotheses,
then H is a uGC-class if and
only if VC-dim(H) is finite1.
1
Alon, Ben-David, Cesa-Bianchi, Hausller: Scale-sensitive dimensions, uniform convergence and learnability, Journal of
ACM 44, 1997
Achievements of Classical
Learning Theory

Complete characterization of necessary and
sufficient conditions for generalization and
consistency of ERM

Remaining questions:


What about non-ERM algorithms?
Can we establish criteria not only for the
hypothesis space?
Agenda





Introduction
Problem Definition
Classical Results
Stability Criteria
Conclusion
Poggio et.al. asked
What property must the
learning map L have for good
generalization of general
algorithms?
Can a new theory subsume
the classical results for
ERM?
Stability

Small pertubations of
the training set should
not change the
hypothesis much



especially deleting one
training example
Si = S \ {zi}
Original
Training Set S
Perturbed
Training Set Si
Learning Map
How can this be
mathematically
defined?
Hypothesis Space
Uniform Stability1

A learning algorithm L is uniformly stable if
K
n
S  Z , i  {1,..., n} sup V ( f S , z )  V ( f S i , z ) 
n
zZ
After deleting one training sample the
change must be small at all points z  Z
 Uniform stability implies generalization
 Requirement is too strong
 Most algorithms (e.g. ERM) are not
uniformly stable

1
Bousquet, Elisseeff: Stability and Generalization, JMLR 2, 2001
CVloo stability1

Cross-Validation leaveone-out stability
lim sup V ( f S i , zi )  V ( f S , zi )  0 i. p.
n  i{1,..., n}


1
considers only errors at
removed training points
strictly weaker than
uniform stability
remove zi
error at xi
Mukherjee, Niyogi, Poggio, Rifkin: Statistical Learning: stability is sufficient for generalization and necessary and sufficient for
consistency of empirical risk minimization, MIT 2003
Equivalence for ERM1

Theorem: For good loss functions the
following statements are equivalent for ERM:




1
L is distribution-independent CVloo stable
ERM generalizes and is universally consistent
H is a uGC class
Question: Does CVloo stability ensure
generalization for all learning algorithms?
Mukherjee, Niyogi, Poggio, Rifkin: Statistical Learning: stability is sufficient for generalization and necessary and sufficient for
consistency of empirical risk minimization, MIT 2003
CVloo Counterexample1






1
X be uniform on [0, 1]
Y  {-1, +1}
Target f *(x) = 1
  1n if x is a training point
Learning algorithm L: f S ( x)   n1
otherwise
 1
if x  xi
 f S ( x)
f S i ( x)  
 f S ( x) otherwise
No change at removed training point  CVloo stable
Algorithm does not generalize at all!
Mukherjee, Niyogi, Poggio, Rifkin: Statistical Learning: stability is sufficient for generalization and necessary and sufficient for
consistency of empirical risk minimization, MIT 2003
Additional Stability Criteria

Error (Eloo) stability
lim sup I [ f S ]  I [ f S i ]  0 in probabilit y
n  i{1,..., n}

Empirical Error (EEloo) stability
lim sup I S [ f S ]  I S i [ f S i ]  0 in probabilit y
n  i{1,..., n}


Weak conditions, satisfied by most reasonable
learning algorithms (e.g. ERM)
Not sufficient for generalization
CVEEEloo Stability

Learning Map L is CVEEEloo stable if it is




CVloo stable
and Eloo stable
and EEloo stable
Question:

Does this imply generalization for all L?
CVEEEloo implies Generalization1

Theorem: If L is CVEEEloo stable and the
loss function is bounded, then fS generalizes

Remarks:



1
Neither condition (CV, E, EE) itself is sufficient
Eloo and EEloo stability are not sufficient
For ERM CVloo stability alone is necessary and
sufficient for generalization and consistency
Mukherjee, Niyogi, Poggio, Rifkin: Statistical Learning: stability is sufficient for generalization and necessary and sufficient for
consistency of empirical risk minimization, MIT 2003
Consistency

CVEEEloo stability in
general does NOT
guarantee consistency

Good generalization
does NOT necessarily
mean good prediction

but poor expected
performance is indicated
by poor training
performance
CVEEEloo stable algorithms






Support Vector Machines and Regularization
k-Nearest Neighbour (k increasing with n)
Bagging (number of regressors increasing with n)
More results to come (e.g. AdaBoost)
For some of these algorithms a ´VC-style´ analysis
is impossible (e.g. k-NN)
For all these algorithms generalization is guaranteed
by the shown theorems!
Agenda





Introduction
Problem Definition
Classical Results
Stability Criteria
Conclusion
Implications

Classical „VC-style“ conditions


CVloo stability



 Incremental Change
online-algorithms
Inverse Problems: stability  well-posedness


 Occams Razor: prefer simple hypotheses
condition numbers characterize stability
Stability-based learning may have more direct
connections with brain‘s learning mechanisms

condition on learning machinery
Language Learning



Goal: learn grammars from sentences
Hypothesis Space: class of all learnable grammars
What is easier to characterize and gives more
insight into real language learning?



Language learning algorithm
or Class of all learnable grammars?
Focus on algorithms shift focus to stability
Conclusion

Stability implies generalization



intuitive (CVloo) and technical (Eloo, EEloo) criteria
Theory subsumes classical ERM results
Generalization criteria also for non-ERM algorithms

Restrictions on learning map rather than
hypothesis space

New approach for designing learning algorithms
Open Questions





Easier / other necessary and sufficient
conditions for generalization
Conditions for general consistency
Tight bounds for sample complexity
Applications of the theory for new algorithms
Stability proofs for existing algorithms
Thank you!
Sources





T. Poggio, R. Rifkin, S. Mukherjee, P. Niyogi: General conditions for
predictivity in learning theory, Nature Vol. 428, S. 419-422, 2004
S. Mukherjee, P. Niyogi, T. Poggio, R. Rifkin: Statistical Learning:
Stability is sufficient for generalization and necessary and sufficient
for consistency of empirical risk minimization, AI Memo 2002-024,
MIT, 2003
T. Mitchell: Machine Learning, McGraw-Hill, 1997
C. Tomasi: Past Performance and future results, Nature Vol. 428, S.
378, 2004
N. Alon, S. Ben-David, N. Cesa-Bianchi, D. Haussler: Scalesensitive Dimensions, Uniform Convergence, and Learnability,
Journal of ACM 44(4), 1997
Download