Chapter 2

advertisement
National Research University
Higher School of Economics
Faculty of Economics
Departement of Statistics and Data Analysis
Course title :
Data Mining
With a focus on Statistical Learning Theory
Instructor : Quentin Paris, Assistant Professor
Email : qparis@hse.ru
2015-2016
MOTIVATION AND GOAL
When we talk informally about learning, we usually refer to a process based
on information (or experience of some sort) and from which one may design
an adequate solution, in a broad sense, when facing a new situation. Statistical
Learning Theory (SLT) is a scientific field of study that aims to understand
and formalize the concept of learning, based on data, for practical
applications. This subject of research is very closely related to artificial
intelligence and Computer Science. The main purpose of SLT is to produce
and study learning algorithms, that can be implemented into computers to
imitate intelligent behavior. The field of research that focuses on the practical
implementation of SLT algorithms into computers is called Machine Learning
(ML). In this course, we focus on SLT and occasionally dive into ML
considerations.
At the basis of any learning process comes information. Another formal and
widely used term to designate information is data. Being an attempt to
formalize the learning process, SLT is founded on the central idea that good
learning algorithms should be designed from available data. In this context,
one is usually interested in producing performance guarantees for given
algorithms which requires a preliminary modeling of the data. Here, by
modeling the data we mean to build a model of how the data has been
produced. Hence, SLT has also a very strong connexion to Statistics and
Probability Theory.
In the context of SLT, the data points considered, in order to build learning
algorithms, usually take the form of n pairs (x_{1},y_{1}),…,(x_{n},y_{n}).
The points x_{1},…,x_{n} are called feature points (or features) and the
points y_{1},…,y_{n} are called labels. This is a simple way to formalize the
general idea that our learning process is based on some number of
observations (represented by the n feature points x_{1},…,x_{n}) for each of
which we have a feeling, an evaluation or a feedback (represented by the labels
y_{1},…,y_{n}). This data, often referred to as the learning sample (or the
training sample), should be considered as a set of examples from which one
can learn how to attribute a label y to any new unlabeled feature point x. In
SLT, one is precisely interested in the process of finding the « right label y »
for any new unlabeled feature point x in the case where the data points
(x_{1},y_{1}),…,(x_{n},y_{n}) and the pair (x,y) are supposed to share some
kind of « similarity ». The precise definition and study of the words « right
label y » and « similarity » will be of great interest in this course. This is
precisely where Statistics and Probability Theory come into play. However,
the very intuitive and basic idea that we will try to formalize is that feature
points that look alike should have comparable labels.
The focus of this course will be to understand the basic ideas of data
modelling and to contruct SLT classification algorithms whose performance
will be studied in the context of Probability theory.
CONTENT
PART I - BASICS
Chapter 1 – Basic tools from Probability theory and Statistics
In this first chapter, we will shortly review the basic notions of Probability
theory needed for the course. The notions of random variables, classical
probability distributions, expectation and variance computation will be briefly
recalled. Some attention will be focused on the notion of independence and
on the different modes of convergence for sequences of random variables.
The important and general notion of conditional expectation will be
introduced from a geometric point of view and will be related to previous and
well known formulas in the case of discrete and continuous random variables.
Chapter 2 – Data modelling and related issues
Different data modelling perspectives, relevant for the rest of the course, will
be discussed in this chapter. Topics such as classical tests of normality, tests
of independence and testing a specific distribution will be studied. Additional
topics such as mixture modelling as well as classical regression or classification
models will be introduced. The main goal of this chapter is to provide insights
on how available information in different contexts (economics, biology,
medical or computer sciences) lead to the choice of specific models.
PART II – TOPICS FROM STATISTICAL LEARNING THEORY
Chapter 3 – Introduction to supervised classification
In this chapter, we introduce the general problem of binary classification and
motivate its study by proposing classical exemples. The notions of
classification risk, optimal classifier (Bayes classifier) and data-based classifiers
will be presented. We will define several criteria (consistency, rates of
convergence) for evaluating the performance of data-based classifiers.
Chapter 4 – Linear classification
A great deal of attention will be focused on linear classification methods.
After providing several practical and theoretical motivations for considering
such methods, we will present classical topics such as logistic and probit
regression as well as linear discrination. Finally we will show how these
specific models emerge as special cases of more general modelling procedures
of the posterior probability function or the intra-class densities. Additional
related topics such as generalised linear models or neural networks will
eventually be presented.
Chapter 5 – Nearest neighbours, Kernel and Tree classifiers
After explaining the limitations of linear methods, this chapter introduces
several non-linear methods such as nearest neighbours (NN), kernel and tree
classifiers. We will explain how, for instance, NN and kernel classifiers are
based on the common and natural principle of local averaging. We will
present Stone’s Theorem which guarantees the universal performance of
certain NN and kernel classifiers.
Chapter 6 – Empirical risk minimization for classification
This chapter stands as an introduction to the interesting idea of empirical risk
minimization (ERM). This general principle will be put in perspective as a
natural extention of the well know least-squares method. To avoid
complications, attention will be restricted to finite classes of candidate
classifiers. The Hoeffding Inequality will be presented and shown to be a
powerfull tool in this context to eveluate the performance of ERM classifiers.
Computational issues, related to the practical computation of ERM classifiers,
will be considered and shown to lead naturally to convex methods such as
boosting.
PART III – PRACTICAL CASE STUDIES
This third part will be devoted to the study of real data sets. We will try to
implement in practice, using the R software, different methods studied earlier.
Focus will be put mainly on NN, Kernel and Tree classifiers.
SCHEDULE
The course consists of 36 hours (36 sessions of 40 minutes). Sessions will
alternate (depending on the speed at which the lectures will go) between
lectures (1 lecture=2 hours) and seminars (1 seminar=1 hour). In the lectures,
new material will be presented and dicussed. Seminar sessions will be devoted
to exercices concerning topics studied in the lectures or to complementary
comments. Formally, the course will be organised as follows.
Topics
Basic tools from
Probability theory and
Statistics
Data modelling and
related issues
Introduction
to
supervised
classification
Linear classification
Nearest neighbours,
Kernel and Tree
classifiers
Empirical
risk
minimization
for
classification
Programming
Total
hours
3
Lectures
Seminars
1
1
Self work
(1=60 min)
4
6
2
2
8
6
2
2
8
6
6
2
2
2
2
8
8
6
2
2
8
3
0
3
Nb: 1h is understood here as 40 min ; 1 lect.=2h ; 1 sem.=1h.
4
GRADING CRITERIA
Students will be asked to attend the course regularly and to work on two to
three homework projects by groups of two people maximum. The grades
obtained for the homework projects will be averaged to form the HM grade.
At the end of the semester, students will pass an exam. Denoting E the grade
obtained at the exam, the final grade F of the students will be computed using
the formula :
F=0.5*HM+0.5*E.
For the exam, no lecture notes, phone, computer or other electonic devices
will be allowed. A single sheet of paper with handwritten or typed notes will
be allowed as well as a calculator.
NOTES AND BIBLIOGRAPHY
As a basis for their work, students will be provided a comprehensive set of
lecture notes as well as exercice sheets. For further readings and
complements, students are invited to consider the following bibliography.
Albert, J. and Rizzo, M. (2012). R by Example. Springer.
Bishop, C.M. (2006). Pattern Recognition and Machine Learning.
Springer.
Cesa-Bianchi, N. and Lugosi, G. (2006). Prediction, Learning and Games.
Cambridge University Press.
Devroye, L., Györfi, L. and Lugosi, G. (1996). A Probabilistic Theory of
Pattern Recognition. Springer.
Györfi, L., Kohler, M., Krzyzak, A. and Walk, H. (2002). A DistributionFree Theory of Nonparametric Regression. Springer.
Hastie, T., Tibshirani, R. and Friedman, J. (2008). The Elements of
Statistical Learning Theory : Data Mining, Inference and Prediction.
Springer.
Ritz, C. and Streibig, J. C. (2008). Nonlinear Regression with R. Springer.
Wasserman, L. (2004). All of Statistics : A Concise Course in Statistical
Inference. Springer.
Download