Introduction to
Machine Learning & Deep Learning
Lecture 1: Introduction
Nicolas Schreuder
CNRS
L2 IA & Sciences des Organisations
Université Paris Dauphine
1
Practical organisation of the course
• 13 classes of 3 hours (~one class per week) organised as
• 1h30 of theory;
• 1h30 of practice/lab (Python).
• Contact: Teams or by email schreuder.nicolas@gmail.com.
• Evaluation: TBD (project? written exam? both?)
2
Plan for this semester
• Goal: present a general overview of the main concepts and
methods in machine learning.
• Two parts:
• “Traditional” machine learning: linear regression,
regularisation, Generalised Linear Models, decision trees,
random forests, SVMs…
• Deep learning: fully-connected NN, convolutional NN,
attention models…
3
General objectives
At the end of the course you should:
• Understand the main concepts in statistical learning theory;
• Be able to implement classical machine learning algorithms;
• Know when machine learning can be useful for a given problem.
Q: What are you expectations for this course?
4
Plan for today
• Supervised learning paradigm;
• Regression and classi cation settings;
• Concept of generalisation, bias-variance trade-o ;
• A rst learning algorithm: k-nearest neighbours.
• Model selection: train-test split, cross-validation.
ff
fi
fi
5
Two learning paradigms
Are the data labeled?
Credits: @Ciaraioch
6
Supervised Learning
7
Supervised Learning
• Training dataset: collection of (input, outcome) pairs;
• Inductive learning: learning with examples (as opposed to deductive learning);
• Goal: predict the outcomes associated to new/future inputs;
• Two main settings: classi cation and regression.
fi
8
Classi cation
- Input: images;
- Output: a digit.
fi
9
Classi cation
- Input: images;
- Output: muffin or chihuahua.
fi
10
Classi cation: more examples
• A person arrives at the emergency room with a set of symptoms that could possibly be
attributed to one of three medical conditions. Which of the three conditions does the
individual have?
• An online banking service must be able to determine whether or not a transaction being
performed on the site is fraudulent, on the basis of the user’s IP address, past transaction
history, and so forth.
• On the basis of DNA sequence data for a number of patients with and without a given
disease, a biologist would like to figure out which DNA mutations are deleterious (diseasecausing) and which are not.
fi
11
Regression
- Input: house characteristics;
- Output: house price.
12
Data/observations: (X1, Y1), …, (Xn, Yn) ∈
×
.
Two main settings:
• Regression: ⊂ ℝ;
• Classification: = {0,1}.
̂
General goal: using the data, find a function f :
̂f (X
n
n
)
≈
Y
new
new
𝒴
𝒳
for a new observation Xnew.
𝒴
𝒳
𝒴
𝒴
Formal setting
13
→
such that
The two stages of supervised learning
New
New
(unlabelled)
input(s)
inputs
Training dataset
(examples)
Learning algorithm
Predicted
output(s)
14
The two stages of supervised learning
New
New
(unlabelled)
input(s)
inputs
Training dataset
(examples)
Learning algorithm
1. Training stage
Predicted
output(s)
15
The two stages of supervised learning
New
(unlabelled)
inputs
Training dataset
(examples)
Learning algorithm
1.
Predicted
outputs
16
2. Prediction stage
The features that we observe belong to the feature space
(e.g., vector
space of characteristics, space of images, text, DNA sequences).
The outcome belongs to the outcome set
prices).
(e.g., categories, house
A learning algorithm
is mapping from the space of datasets to a space
of prediction functions ℱ ⊂
,
×
𝒴
17
fi
𝒳
𝒳
𝒴
: ∪n≥1 (
𝒜
𝒳𝒴𝒜
Formal de nition
n
) → ℱ.
Generalisation
In machine learning, we are primarily interested in making
predictions, i.e., for a new observation xnew, we want to correctly
predict its label.
Let’s build our intuition with an example.
18
19
20
21
22
23
24
Bias-variance trade-off
• Bias: how well does the model fit the training data?
• Variance: how sensitive is the model to the training data?
• We want a model that fits well the training data, but not too
much!
25
High bias, low variance
26
Low bias, high variance
27
Medium bias, medium variance
28
Bias-variance trade-off
• High bias: the model is too simple (under-fitting);
• High variance: the model is too flexible (over-fitting).
How can we measure the bias and the variance of a model?
29
Train and test error
• Train error: how well does the model fit already seen data?
• Test error: how well does the algorithm generalises to new data?
30
Train and test error
Credits: The Elements of Statistical Learning (Hastie et al.)
31
Train and test error
A prediction algorithm could have:
• a low train error and a high test error (over-fitting);
• a high train error and a high test error (under-fitting);
The key challenge is to find the sweet spot between under-fitting
and over-fitting!
32
Model selection: train-test split
Given a collection of models, which one should we pick?
• Choose the model which has the smallest train error?
33
Model selection: train-test split
Given a collection of models, which one should we pick?
• Choose the model which has the smallest train error? No!
• Choose the model which has the smallest test error?
34
Model selection: train-test split
Given a collection of models, which one should we pick?
• Choose the model which has the smallest train error? No!
• Choose the model which has the smallest test error? Yes, but how?
35
Model selection: train-test split
Given a collection of models, which one should we pick?
• Choose the model which has the smallest train error? No!
• Choose the model which has the smallest test error? Yes, but how?
Train the models on the training data, evaluate performance on testing data.
36
K-nearest neighbours
algorithm
37
k-nearest neighbours algorithm
We are given n data points (X1, Y1), …, (Xn, Yn) and we are asked to
predict a label for a new observation Xnew.
Starting point: neighbouring points should have similar labels.
→ for the label of Xnew, predict the average of the labels of its
nearest neighbours, where k ∈ ℕ is a parameter to select.
38
k-nearest neighbours algorithm
Credits: scikit-learn
39
k-nearest neighbours algorithm
•When k = 1, kNN is predicting the label of the nearest neighbour;
•When k = n, kNN is constantly predicting the average output label
over the whole dataset.
40
Decision frontier (training data)
41
Decision frontier (testing data)
42
Cross-validation
How many neighbours should we pick?
We could compare the performance of di erent models characterised by
di erent values for the hyperparameters (number of neighbours).
Problem: we might over t the testing data.
We need to use a more subtle approach: cross-validation.
ff
fi
ff
43
Cross-validation
Credits: scikit-learn
44
Take-home message
• Never use the same data to train/optimize a model and evaluate its
performance (to avoid data leakage).
• Instead use k-fold cross-validation to train/optimize the models on train
and validation sets, then evaluate the performance of the chosen model
on a separate test set.
45
46
Some useful references
47
Some useful references (DL)
48
Inspirations/extra references
• Machine Learning Preparatory Week @ PSL;
• MOOC scikit-learn.
49
𝕏
Data matrix
=
T
x1
T
x2
…
T
xn
n×d
∈ℝ
d
, xi ∈ ℝ the feature vector of the i-th observation.
y1
y
n
2
Target vector y = … ∈ ℝ , yi is the target in
yn
= ℝ for regression).
𝒴
classification,
𝒴𝒴
𝕏
and target vector y
Data matrix
50
(e.g.,
= {0,1}for binary