Uploaded by lehatam849

ml-section10-model-selection

advertisement
Machine Learning
Section 10: Model selection
Stefan Harmeling
10. November 2021
Overview
▸ Evaluating on test error / cross-validation
▸ Bayesian model selection
Machine Learning / Stefan Harmeling / 10. November 2021
2
Evaluating on test error / cross validation
Machine Learning / Stefan Harmeling / 10. November 2021
3
Example: linear regression
Data:
▸ dataset D of scalar-valued input/output pairs (x1 , y1 ) . . . , (xn , yn )
▸ the outputs are noisy
Model:
▸ basis function φd (x) = [1, x, x 2 , . . . , x d ]T with hyperparameter d
▸ model the function as f (x, w) = φd (x)T w, with parameter w ∈ Rd+1
Loss and fit:
▸ mean-squared error
MSE(D, w, d) =
1
1
2
T
2
∑ (y − f (x, w)) =
∑ (y − φd (x) w)
∣D∣ (x,y )∈D
∣D∣ (x,y )∈D
▸ minimize the MSE (to get the maximum likelihood estimator):
wML (D, d) = arg min MSE(D, w, d)
w
Model selection / hyperparameter tuning
▸ What is the best choice for d?
Machine Learning / Stefan Harmeling / 10. November 2021
4
Hyperparameter tuning: 1st attempt
Idea:
▸ choose d such that the MSE is minimal:
d1 = arg min MSE(D, wML (D, d), d)
d
Problem:
▸ this overfits the data
▸ the best fit will be a large d, such that every single noisy data
point is hit
▸ however, actually, we would like to minimize the test error which is
the MSE on unseen data
▸ idea: split data into seen (training) and unseen (evaluation) data
Machine Learning / Stefan Harmeling / 10. November 2021
5
Hyperparameter tuning: 2nd attempt
Idea:
▸ split dataset D into two disjoint sets Dtrain and Deval
▸ 1. train on Dtrain (choose parameter w)
2. evaluate on Deval (choose hyperparameter d)
▸ choose d such that the MSE on the evaluation set is minimal for
the parameter w chosen on the training set:
d2 = arg min MSE(Deval , wML (Dtrain , d), d)
d
Note:
▸ much better! works quite well!
▸ no overfitting while learning the parameter
Problems/challenges:
▸ how good is the parameter/hyperparameter choice on unseen
data? i.e. what is the test error (the MSE on unseen data)?
▸ the estimator MSE(Deval , wML (Dtrain , d2 ), d2 ) of the MSE is too
optimistic (this is similar to underestimating the variance with a
sample mean)
Machine Learning / Stefan Harmeling / 10. November 2021
6
Hyperparameter tuning: 3rd attempt
Idea:
▸ split dataset D into three disjoint sets Dtrain , Deval and Dtest
▸ 1. train on Dtrain (choose parameter w)
2. evaluate on Deval (choose hyperparameter d)
3. test of Dtest (calculate MSE for unseen data)
▸ choose d such that the MSE on the evaluation set is minimal for
the parameter w chosen on the training set:
d3 = arg min MSE(Deval , wML (Dtrain , d), d)
d
▸ calculate the test error (the MSE on unseen data) on the test set:
MSE(Dtest , wML (Dtrain , d3 ), d3 )
Note:
▸ this is the gold standard (if we have enough data)!
▸ no overfitting of the parameter or of the hyperparameter
Machine Learning / Stefan Harmeling / 10. November 2021
7
Hyperparameter tuning: 3rd attempt - more thoughts
Idea:
▸ split dataset D into three disjoint sets Dtrain , Deval and Dtest
▸ 1. train on Dtrain (choose parameter w)
2. evaluate on Deval (choose hyperparameter d)
3. test of Dtest (calculate MSE for unseen data)
Problems
▸ we are not using all data for estimating the parameter
▸ we are not using all data for estimating the hyperparameter
Hack/Solution
▸ Cross-validation (see next slide)
▸ does improve the estimate
Machine Learning / Stefan Harmeling / 10. November 2021
8
Cross-validation (2-fold)
Simpler setup w/o hyperparameter
▸ given data D
▸ model with parameter w
▸ but no hyperparameter d
Goal
▸ estimate the test error
Solution
▸ split D into two sets D1 and D2
1. train w1 on D1 , estimate MSE ε1 on D2 for w1
2. train w2 on D2 , estimate MSE ε2 on D1 for w2
▸ the cross-validation estimate of the test error is the average:
(ε1 + ε2 )/2
Machine Learning / Stefan Harmeling / 10. November 2021
9
More general:
Algorithm 10.1 (k -fold cross validation (CV))
Split your data D into k disjoint subsets of approximately equal size:
D = D1 ∪ . . . ∪ Dk
For i = 1 to k
1. Di plays the role of the test set
2. train parameter w on D ∖ Di (i.e. “D without Di ”, set minus)
3. calculate MSE εi on Di
The cross-validation estimate of the test error ε is the average
validation error:
εCV =
1 k
∑ εi
k i=1
For k = n this is called leave-one-out cross validation (LOOCV).
Machine Learning / Stefan Harmeling / 10. November 2021
10
Nested cross validation for the gold-standard
Setup
▸ given some data set D
▸ model with parameter w and hyper-parameter d.
Goal:
▸ goal: estimate test error
Outer k -fold cross-validation
1. split D into D1 , . . . , Dk
2. for i in 1 . . . k :
2.1 split D into D ∖ Di and Di
2.2 train w and d on D ∖ Di
use Inner k ′ -fold cross-validation to estimate the MSE for the
hyperparameter choice
2.3 estimate εi on Di using w and d
3. estimate test error ε as the average of ε1 , . . . , εk
Machine Learning / Stefan Harmeling / 10. November 2021
11
Bayesian model selection
Machine Learning / Stefan Harmeling / 10. November 2021
12
Bayesian inference
Model specification
p(w) = . . .
p(D ∣ w) = . . .
prior
likelihood
Inference given data
p(w ∣ D) =
p(D ∣ w)p(w)
p(D)
p(D) = ∫ p(D ∣ w)p(w)dw
posterior
evidence
▸ The posterior
▸ combines prior and likelihood via product rule (Bayes rule)
▸ describes what the parameter might be after seeing data
▸ The evidence
▸ integrates out the parameter via sum rule
▸ is the expected likelihood under the prior
Machine Learning / Stefan Harmeling / 10. November 2021
13
Bayesian inference with hyper-parameters
Model specification
p(θ) = . . .
hyper-prior (level 2)
p(w ∣ θ) = . . .
prior (level 1)
p(D ∣ w) = . . .
likelihood
Level 1: Inference of parameters
p(w ∣ D, θ) =
p(D ∣ w)p(w ∣ θ)
p(D ∣ θ)
p(D ∣ θ) = ∫ p(D ∣ w)p(w ∣ θ)dw
posterior
evidence
Level 2: Inference of hyper-parameters
p(θ ∣ D) =
p(D ∣ θ)p(θ)
p(D)
p(D) = ∫ p(D ∣ θ)p(θ)dθ
Machine Learning / Stefan Harmeling / 10. November 2021
hyper-posterior
hyper-evidence
14
Bayesian inference with hyper-parameters and models
Model specification
p(H) = . . .
model-prior (level 3)
p(θ ∣ H) = . . .
hyper-prior (level 2)
p(w ∣ θ, H) = . . .
prior (level 1)
p(D ∣ w, H) = . . .
likelihood
Level 1, 2, 3: Inference of parameters, hyper-parameters and model
given data
p(D ∣ w, H)p(w ∣ θ, H)
p(D ∣ θ, H)
p(D ∣ θ, H)p(θ ∣ H)
p(θ ∣ D, H) =
p(D ∣ H)
p(D ∣ H)p(H)
p(H ∣ D) =
p(D)
p(w ∣ D, θ, H) =
Machine Learning / Stefan Harmeling / 10. November 2021
posterior
hyper-posterior
model-posterior
15
Bayesian inference with hyper-parameters and models
Level 1: Inference of parameters
p(w ∣ D, θ, H) =
p(D ∣ w, H)p(w ∣ θ, H)
p(D ∣ θ, H)
posterior
p(D ∣ θ, H) = ∫ p(D ∣ w, H)p(w ∣ θ, H)dw
marginal likelihood
Level 2: Inference of hyper-parameters
p(θ ∣ D, H) =
p(D ∣ θ, H)p(θ)
p(D ∣ H)
hyper-posterior
p(D ∣ H) = ∫ p(D ∣ θ, H)p(θ ∣ H)dθ
Level 3: Inference of model
p(D ∣ H)p(H)
p(D)
p(D) = ∑ p(D ∣ H)p(H)
p(H ∣ D) =
model-posterior
H
Machine Learning / Stefan Harmeling / 10. November 2021
16
Model selection for regression
Instead of a model for p(D) we specify a model for p(y ∣ X ) with
location matrix X and targets y .
Machine Learning / Stefan Harmeling / 10. November 2021
17
Bayesian model selection for regression
Model specification
p(H) = . . .
model-prior
p(θ ∣ H) = . . .
hyper-prior
p(w ∣ θ, H) = . . .
p(y ∣ X , w, H) = . . .
prior
likelihood (this is different)
Level 1, 2, 3: Inference of parameters, hyper-parameters and model
given data
p(y ∣ X , w, H)p(w ∣ θ, H)
p(y ∣ X , θ, H)
p(y ∣ X , θ, H)p(θ ∣ H)
p(θ ∣ y , X , H) =
p(y ∣ X , H)
p(y ∣ X , H)p(H)
p(H ∣ y , X ) =
p(y ∣ X )
p(w ∣ y , X , θ, H) =
Machine Learning / Stefan Harmeling / 10. November 2021
posterior
hyper-posterior
model-posterior
18
Bayesian model selection for Linear Regression
Level 1:
w ∼ N (0d , τ 2 Id )
y ∣ X , w ∼ N (Xw, σ 2 In )
We infer posterior:
w ∣ X , y ∼ N (wn , Vn )
posterior
with posterior mean wn and posterior covariance Vn .
−1
1
1 T
X
+
X
I
)
d
σ2
τ2
σ2
wn = (X T X + 2 Id )−1 X T y
τ
Vn = (
Note that on this slide σ 2 and τ 2 is assumed to be known (constant).
Level 2: Let’s add random variables for the hyper-parameters!
Machine Learning / Stefan Harmeling / 10. November 2021
19
Bayesian model selection for Linear Regression
Level 1:
θ ∼ ...
hyper-prior
w ∣ θ ∼ N (0d , τ Id )
2
y ∣ X , w, θ ∼ N (Xw, σ In )
2
prior
likelihood
where θ = (σ 2 , τ 2 , d) collects all hyperparameter.
We infer posterior:
w ∣ X , y , θ ∼ N (wn , Vn ∣ θ)
posterior
with posterior mean wn and posterior covariance Vn which do
dependent on θ.
Level 2: We need the evidence from level 1:
p(y ∣ X , θ) = ∫ p(y ∣ X , w, θ)p(w ∣ θ)dw = N (...)
This is also called marginal likelihood.
Machine Learning / Stefan Harmeling / 10. November 2021
20
Bayesian inference for level 2
Level 2: Infer hyper-posterior:
p(θ ∣ y , X ) =
p(y ∣ X , θ)p(θ)
p(y ∣ X )
This gets quickly too complicated! (the integral become to difficult)
Instead we only maximize the Marginal likelihood
p(y ∣ X , θ) = ∫ p(y ∣ X , w, θ)p(w ∣ θ)dw = N (...)
with respect to the hyper-parameter θ.
▸ this is called “Type II maximum likelihood” and “empirical Bayes”
and “generalized maximum likelihood” and “evidence
approximation”
Machine Learning / Stefan Harmeling / 10. November 2021
21
Marginal likelihood for linear regression
We can derive the form of the logarithm of the marginal likelihood:
log p(y ∣ X , θ) = log ∫ p(y ∣ X , w, θ) p(w ∣ θ) dw
= log ∫ N (y ∣ Xw, σ 2 In ) N (w ∣ 0d , τ 2 Id ) dw
= log ∫ N (y ∣ 0n , σ 2 In + τ 2 XX T ) N (w ∣ . . . , . . .) dw
= log N (y ∣ 0n , σ 2 In + τ 2 XX T ) ∫ N (w ∣ . . . , . . .) dw
= log N (y ∣ 0n , σ 2 In + τ 2 XX T )
n
1
1
= − log(2π) − log∣σ 2 In + τ 2 XX T ∣ − y T (σ 2 In + τ 2 XX T )−1 y
2
2
2
Notes:
▸ N (w ∣ . . . , . . .) does not depend on y .
▸ The third equality follows from some fancy rule for the Gaussians
(see next slide).
Machine Learning / Stefan Harmeling / 10. November 2021
22
Another product formula for Gaussians
copied from Bishop’s book Sec. 2.3.3, page 93
Given a marginal Gaussian for x and a conditional Gaussian for y
given x,
x ∼ N (µ, Λ−1 )
y ∣ x ∼ N (Ax + b, L−1 )
we have the following marignal Guassian for y and conditional
Gaussian for x given y :
y ∼ N (Aµ + b, L−1 + AΛ−1 AT )
x ∣ y ∼ N (Σ(AT L(y − b) + Λµ), Σ)
where Σ = (Λ + AT LA)−1
Machine Learning / Stefan Harmeling / 10. November 2021
23
Bayesian model selection for linear regression
Setting the hyperparameters for linear regression
The logarithm of the marginal likelihood
n
1
1
log p(y ∣ X , θ) = − log(2π) − log∣σ 2 In + τ 2 XX T ∣ − y T (σ 2 In + τ 2 XX T )−1 y
2
2
2
is a function of all hyperparameters θ and can be maximized using
Gradient descent in terms of the hyperparameters (or for discrete
parameters like d try different values).
Can we use Matrix Differential Calculus?
Machine Learning / Stefan Harmeling / 10. November 2021
24
Derivative of log marginal likelihood
Let’s write A = σ 2 In + τ 2 XX T . Then we get:
n
1
1
d log p(y ∣X , θ) = d (− log(2π) − log det A − y T A−1 y )
2
2
2
1 T
1
−1
= − d log det A − y (dA )y
2
2
1
1
−1
= − tr(A dA) + y T A−1 (dA)A−1 y
2
2
1
= tr(A−1 yy T A−1 − A−1 ) dA
2
Replacing dA with d(σ 2 In + τ 2 XX T ) and continuining a bit, we get:
∂
log p(y ∣X , θ) =
∂σ 2
∂
log p(y ∣X , θ) =
∂τ 2
1
tr(A−1 yy T A−1 − A−1 )
2
1
tr(A−1 yy T A−1 − A−1 )XX T
2
Use these derivatives for gradient descent (see Jupyter notebook) (or
visualize the log marginal likelihood).
Machine Learning / Stefan Harmeling / 10. November 2021
25
End of Section 10
Machine Learning / Stefan Harmeling / 10. November 2021
26
Literature on Bayesian model selection / linear regression:
▸ Chapter 3.5 in Bishop’s book “Pattern Recognition and Machine
learning”
Machine Learning / Stefan Harmeling / 10. November 2021
27
Download