National Research University Higher School of Economics Faculty of Economics Departement of Statistics and Data Analysis Course title : Data Mining With a focus on Statistical Learning Theory Instructor : Quentin Paris, Assistant Professor Email : qparis@hse.ru 2015-2016 MOTIVATION AND GOAL When we talk informally about learning, we usually refer to a process based on information (or experience of some sort) and from which one may design an adequate solution, in a broad sense, when facing a new situation. Statistical Learning Theory (SLT) is a scientific field of study that aims to understand and formalize the concept of learning, based on data, for practical applications. This subject of research is very closely related to artificial intelligence and Computer Science. The main purpose of SLT is to produce and study learning algorithms, that can be implemented into computers to imitate intelligent behavior. The field of research that focuses on the practical implementation of SLT algorithms into computers is called Machine Learning (ML). In this course, we focus on SLT and occasionally dive into ML considerations. At the basis of any learning process comes information. Another formal and widely used term to designate information is data. Being an attempt to formalize the learning process, SLT is founded on the central idea that good learning algorithms should be designed from available data. In this context, one is usually interested in producing performance guarantees for given algorithms which requires a preliminary modeling of the data. Here, by modeling the data we mean to build a model of how the data has been produced. Hence, SLT has also a very strong connexion to Statistics and Probability Theory. In the context of SLT, the data points considered, in order to build learning algorithms, usually take the form of n pairs (x_{1},y_{1}),…,(x_{n},y_{n}). The points x_{1},…,x_{n} are called feature points (or features) and the points y_{1},…,y_{n} are called labels. This is a simple way to formalize the general idea that our learning process is based on some number of observations (represented by the n feature points x_{1},…,x_{n}) for each of which we have a feeling, an evaluation or a feedback (represented by the labels y_{1},…,y_{n}). This data, often referred to as the learning sample (or the training sample), should be considered as a set of examples from which one can learn how to attribute a label y to any new unlabeled feature point x. In SLT, one is precisely interested in the process of finding the « right label y » for any new unlabeled feature point x in the case where the data points (x_{1},y_{1}),…,(x_{n},y_{n}) and the pair (x,y) are supposed to share some kind of « similarity ». The precise definition and study of the words « right label y » and « similarity » will be of great interest in this course. This is precisely where Statistics and Probability Theory come into play. However, the very intuitive and basic idea that we will try to formalize is that feature points that look alike should have comparable labels. The focus of this course will be to understand the basic ideas of data modelling and to contruct SLT classification algorithms whose performance will be studied in the context of Probability theory. CONTENT PART I - BASICS Chapter 1 – Basic tools from Probability theory and Statistics In this first chapter, we will shortly review the basic notions of Probability theory needed for the course. The notions of random variables, classical probability distributions, expectation and variance computation will be briefly recalled. Some attention will be focused on the notion of independence and on the different modes of convergence for sequences of random variables. The important and general notion of conditional expectation will be introduced from a geometric point of view and will be related to previous and well known formulas in the case of discrete and continuous random variables. Chapter 2 – Data modelling and related issues Different data modelling perspectives, relevant for the rest of the course, will be discussed in this chapter. Topics such as classical tests of normality, tests of independence and testing a specific distribution will be studied. Additional topics such as mixture modelling as well as classical regression or classification models will be introduced. The main goal of this chapter is to provide insights on how available information in different contexts (economics, biology, medical or computer sciences) lead to the choice of specific models. PART II – TOPICS FROM STATISTICAL LEARNING THEORY Chapter 3 – Introduction to supervised classification In this chapter, we introduce the general problem of binary classification and motivate its study by proposing classical exemples. The notions of classification risk, optimal classifier (Bayes classifier) and data-based classifiers will be presented. We will define several criteria (consistency, rates of convergence) for evaluating the performance of data-based classifiers. Chapter 4 – Linear classification A great deal of attention will be focused on linear classification methods. After providing several practical and theoretical motivations for considering such methods, we will present classical topics such as logistic and probit regression as well as linear discrination. Finally we will show how these specific models emerge as special cases of more general modelling procedures of the posterior probability function or the intra-class densities. Additional related topics such as generalised linear models or neural networks will eventually be presented. Chapter 5 – Nearest neighbours, Kernel and Tree classifiers After explaining the limitations of linear methods, this chapter introduces several non-linear methods such as nearest neighbours (NN), kernel and tree classifiers. We will explain how, for instance, NN and kernel classifiers are based on the common and natural principle of local averaging. We will present Stone’s Theorem which guarantees the universal performance of certain NN and kernel classifiers. Chapter 6 – Empirical risk minimization for classification This chapter stands as an introduction to the interesting idea of empirical risk minimization (ERM). This general principle will be put in perspective as a natural extention of the well know least-squares method. To avoid complications, attention will be restricted to finite classes of candidate classifiers. The Hoeffding Inequality will be presented and shown to be a powerfull tool in this context to eveluate the performance of ERM classifiers. Computational issues, related to the practical computation of ERM classifiers, will be considered and shown to lead naturally to convex methods such as boosting. PART III – PRACTICAL CASE STUDIES This third part will be devoted to the study of real data sets. We will try to implement in practice, using the R software, different methods studied earlier. Focus will be put mainly on NN, Kernel and Tree classifiers. SCHEDULE The course consists of 36 hours (36 sessions of 40 minutes). Sessions will alternate (depending on the speed at which the lectures will go) between lectures (1 lecture=2 hours) and seminars (1 seminar=1 hour). In the lectures, new material will be presented and dicussed. Seminar sessions will be devoted to exercices concerning topics studied in the lectures or to complementary comments. Formally, the course will be organised as follows. Topics Basic tools from Probability theory and Statistics Data modelling and related issues Introduction to supervised classification Linear classification Nearest neighbours, Kernel and Tree classifiers Empirical risk minimization for classification Programming Total hours 3 Lectures Seminars 1 1 Self work (1=60 min) 4 6 2 2 8 6 2 2 8 6 6 2 2 2 2 8 8 6 2 2 8 3 0 3 Nb: 1h is understood here as 40 min ; 1 lect.=2h ; 1 sem.=1h. 4 GRADING CRITERIA Students will be asked to attend the course regularly and to work on two to three homework projects by groups of two people maximum. The grades obtained for the homework projects will be averaged to form the HM grade. At the end of the semester, students will pass an exam. Denoting E the grade obtained at the exam, the final grade F of the students will be computed using the formula : F=0.5*HM+0.5*E. For the exam, no lecture notes, phone, computer or other electonic devices will be allowed. A single sheet of paper with handwritten or typed notes will be allowed as well as a calculator. NOTES AND BIBLIOGRAPHY As a basis for their work, students will be provided a comprehensive set of lecture notes as well as exercice sheets. For further readings and complements, students are invited to consider the following bibliography. Albert, J. and Rizzo, M. (2012). R by Example. Springer. Bishop, C.M. (2006). Pattern Recognition and Machine Learning. Springer. Cesa-Bianchi, N. and Lugosi, G. (2006). Prediction, Learning and Games. Cambridge University Press. Devroye, L., Györfi, L. and Lugosi, G. (1996). A Probabilistic Theory of Pattern Recognition. Springer. Györfi, L., Kohler, M., Krzyzak, A. and Walk, H. (2002). A DistributionFree Theory of Nonparametric Regression. Springer. Hastie, T., Tibshirani, R. and Friedman, J. (2008). The Elements of Statistical Learning Theory : Data Mining, Inference and Prediction. Springer. Ritz, C. and Streibig, J. C. (2008). Nonlinear Regression with R. Springer. Wasserman, L. (2004). All of Statistics : A Concise Course in Statistical Inference. Springer.