T. Poggio, R. Rifkin, S. Mukherjee, P. Niyogi: General Conditions for Predictivity in Learning Theory Michael Pfeiffer pfeiffer@igi.tugraz.at 25.11.2004 Motivation Supervised Learning learn functional relationships from a finite set of labelled training examples Generalization How well does the learned function perform on unseen test examples? Central question in supervised learning What you will hear New Idea: Stability implies predictivity learning algorithm is stable if small pertubations of training set do not change hypothesis much Conditions for generalization on learning map rather than hypothesis space in contrast to VC-analysis Agenda Introduction Problem Definition Classical Results Stability Criteria Conclusion Some Definitions 1/2 Training Data: S = {z1=(x1,y1), ..., zn=(xn, yn)} Hypothesis Space: H Z=XY Unknown Distribution (x, y) Hypothesis fS H: X Y Learning Algorithm: L : n1 Z n H L(S ) Lz1 ( x1 , y1 ),..., zn ( xn , yn ) f S Regression: fS is real-valued / Classification: fS is binary symmetric learning algorithm (ordering irrelevant) Some Definitions 2/2 Loss Function: V(f, z) V :H Z R e.g. V(f, z) = (f(x) – y)2 Assume that V is bounded Empirical Error (Training Error) 1 n I S [ f ] V ( f , zi ) n i 1 Expected Error (True Error) I [ f ] V ( f , z ) dμ( z) Z Generalization and Consistency Convergence in Probability lim P X n X ε 0 ε 0 n Generalization Performance on training examples must be a good indicator of performance on future examples lim I [ f S ] I S [ f S ] 0 in probabilit y n Consistency Expected error converges to most accurate one in H μ ε 0 lim P I [ f S ] inf I [ f ] ε 0 n f H Agenda Introduction Problem Definition Classical Results Stability Criteria Conclusion Empirical Risk Minimization (ERM) Focus of classical learning theory research Minimize training error over H: exact and almost ERM take best hypothesis on training data I S [ f S ] min I S [ f ] f H For ERM: Generalization Consistency What algorithms are ERM? All these belong to class of ERM algorithms Least Squares Regression Decision Trees ANN Backpropagation (?) ... Are all learning algorithms ERM? NO! Support Vector Machines k-Nearest Neighbour Bagging, Boosting Regularization ... Vapnik asked What property must the hypothesis space H have to ensure good generalization of ERM? Classical Results for ERM1 Theorem: A necessary and sufficient condition for generalization and consistency of ERM is that H is a uniform Glivenko-Cantelli (uGC) class: 1 n ε 0 lim sup PS sup f ( xi ) f ( x) dμ( x) ε 0 X n μ n f H i 1 1 convergence of empirical mean to true expected value uniform convergence in probability of loss functions induced by H and V e.g. Alon, Ben-David, Cesa-Bianchi, Hausller: Scale-sensitive dimensions, uniform convergence and learnability, Journal of ACM 44, 1997 VC-Dimension Binary functions f: X{0, 1} VC-dim(H) = size of largest finite set in X that can be shattered by H e.g. linear separation in 2D yields VC-dim = 3 Theorem: Let H be a class of binary valued hypotheses, then H is a uGC-class if and only if VC-dim(H) is finite1. 1 Alon, Ben-David, Cesa-Bianchi, Hausller: Scale-sensitive dimensions, uniform convergence and learnability, Journal of ACM 44, 1997 Achievements of Classical Learning Theory Complete characterization of necessary and sufficient conditions for generalization and consistency of ERM Remaining questions: What about non-ERM algorithms? Can we establish criteria not only for the hypothesis space? Agenda Introduction Problem Definition Classical Results Stability Criteria Conclusion Poggio et.al. asked What property must the learning map L have for good generalization of general algorithms? Can a new theory subsume the classical results for ERM? Stability Small pertubations of the training set should not change the hypothesis much especially deleting one training example Si = S \ {zi} Original Training Set S Perturbed Training Set Si Learning Map How can this be mathematically defined? Hypothesis Space Uniform Stability1 A learning algorithm L is uniformly stable if K n S Z , i {1,..., n} sup V ( f S , z ) V ( f S i , z ) n zZ After deleting one training sample the change must be small at all points z Z Uniform stability implies generalization Requirement is too strong Most algorithms (e.g. ERM) are not uniformly stable 1 Bousquet, Elisseeff: Stability and Generalization, JMLR 2, 2001 CVloo stability1 Cross-Validation leaveone-out stability lim sup V ( f S i , zi ) V ( f S , zi ) 0 i. p. n i{1,..., n} 1 considers only errors at removed training points strictly weaker than uniform stability remove zi error at xi Mukherjee, Niyogi, Poggio, Rifkin: Statistical Learning: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization, MIT 2003 Equivalence for ERM1 Theorem: For good loss functions the following statements are equivalent for ERM: 1 L is distribution-independent CVloo stable ERM generalizes and is universally consistent H is a uGC class Question: Does CVloo stability ensure generalization for all learning algorithms? Mukherjee, Niyogi, Poggio, Rifkin: Statistical Learning: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization, MIT 2003 CVloo Counterexample1 1 X be uniform on [0, 1] Y {-1, +1} Target f *(x) = 1 1n if x is a training point Learning algorithm L: f S ( x) n1 otherwise 1 if x xi f S ( x) f S i ( x) f S ( x) otherwise No change at removed training point CVloo stable Algorithm does not generalize at all! Mukherjee, Niyogi, Poggio, Rifkin: Statistical Learning: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization, MIT 2003 Additional Stability Criteria Error (Eloo) stability lim sup I [ f S ] I [ f S i ] 0 in probabilit y n i{1,..., n} Empirical Error (EEloo) stability lim sup I S [ f S ] I S i [ f S i ] 0 in probabilit y n i{1,..., n} Weak conditions, satisfied by most reasonable learning algorithms (e.g. ERM) Not sufficient for generalization CVEEEloo Stability Learning Map L is CVEEEloo stable if it is CVloo stable and Eloo stable and EEloo stable Question: Does this imply generalization for all L? CVEEEloo implies Generalization1 Theorem: If L is CVEEEloo stable and the loss function is bounded, then fS generalizes Remarks: 1 Neither condition (CV, E, EE) itself is sufficient Eloo and EEloo stability are not sufficient For ERM CVloo stability alone is necessary and sufficient for generalization and consistency Mukherjee, Niyogi, Poggio, Rifkin: Statistical Learning: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization, MIT 2003 Consistency CVEEEloo stability in general does NOT guarantee consistency Good generalization does NOT necessarily mean good prediction but poor expected performance is indicated by poor training performance CVEEEloo stable algorithms Support Vector Machines and Regularization k-Nearest Neighbour (k increasing with n) Bagging (number of regressors increasing with n) More results to come (e.g. AdaBoost) For some of these algorithms a ´VC-style´ analysis is impossible (e.g. k-NN) For all these algorithms generalization is guaranteed by the shown theorems! Agenda Introduction Problem Definition Classical Results Stability Criteria Conclusion Implications Classical „VC-style“ conditions CVloo stability Incremental Change online-algorithms Inverse Problems: stability well-posedness Occams Razor: prefer simple hypotheses condition numbers characterize stability Stability-based learning may have more direct connections with brain‘s learning mechanisms condition on learning machinery Language Learning Goal: learn grammars from sentences Hypothesis Space: class of all learnable grammars What is easier to characterize and gives more insight into real language learning? Language learning algorithm or Class of all learnable grammars? Focus on algorithms shift focus to stability Conclusion Stability implies generalization intuitive (CVloo) and technical (Eloo, EEloo) criteria Theory subsumes classical ERM results Generalization criteria also for non-ERM algorithms Restrictions on learning map rather than hypothesis space New approach for designing learning algorithms Open Questions Easier / other necessary and sufficient conditions for generalization Conditions for general consistency Tight bounds for sample complexity Applications of the theory for new algorithms Stability proofs for existing algorithms Thank you! Sources T. Poggio, R. Rifkin, S. Mukherjee, P. Niyogi: General conditions for predictivity in learning theory, Nature Vol. 428, S. 419-422, 2004 S. Mukherjee, P. Niyogi, T. Poggio, R. Rifkin: Statistical Learning: Stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization, AI Memo 2002-024, MIT, 2003 T. Mitchell: Machine Learning, McGraw-Hill, 1997 C. Tomasi: Past Performance and future results, Nature Vol. 428, S. 378, 2004 N. Alon, S. Ben-David, N. Cesa-Bianchi, D. Haussler: Scalesensitive Dimensions, Uniform Convergence, and Learnability, Journal of ACM 44(4), 1997