Machine Learning Section 10: Model selection Stefan Harmeling 10. November 2021 Overview ▸ Evaluating on test error / cross-validation ▸ Bayesian model selection Machine Learning / Stefan Harmeling / 10. November 2021 2 Evaluating on test error / cross validation Machine Learning / Stefan Harmeling / 10. November 2021 3 Example: linear regression Data: ▸ dataset D of scalar-valued input/output pairs (x1 , y1 ) . . . , (xn , yn ) ▸ the outputs are noisy Model: ▸ basis function φd (x) = [1, x, x 2 , . . . , x d ]T with hyperparameter d ▸ model the function as f (x, w) = φd (x)T w, with parameter w ∈ Rd+1 Loss and fit: ▸ mean-squared error MSE(D, w, d) = 1 1 2 T 2 ∑ (y − f (x, w)) = ∑ (y − φd (x) w) ∣D∣ (x,y )∈D ∣D∣ (x,y )∈D ▸ minimize the MSE (to get the maximum likelihood estimator): wML (D, d) = arg min MSE(D, w, d) w Model selection / hyperparameter tuning ▸ What is the best choice for d? Machine Learning / Stefan Harmeling / 10. November 2021 4 Hyperparameter tuning: 1st attempt Idea: ▸ choose d such that the MSE is minimal: d1 = arg min MSE(D, wML (D, d), d) d Problem: ▸ this overfits the data ▸ the best fit will be a large d, such that every single noisy data point is hit ▸ however, actually, we would like to minimize the test error which is the MSE on unseen data ▸ idea: split data into seen (training) and unseen (evaluation) data Machine Learning / Stefan Harmeling / 10. November 2021 5 Hyperparameter tuning: 2nd attempt Idea: ▸ split dataset D into two disjoint sets Dtrain and Deval ▸ 1. train on Dtrain (choose parameter w) 2. evaluate on Deval (choose hyperparameter d) ▸ choose d such that the MSE on the evaluation set is minimal for the parameter w chosen on the training set: d2 = arg min MSE(Deval , wML (Dtrain , d), d) d Note: ▸ much better! works quite well! ▸ no overfitting while learning the parameter Problems/challenges: ▸ how good is the parameter/hyperparameter choice on unseen data? i.e. what is the test error (the MSE on unseen data)? ▸ the estimator MSE(Deval , wML (Dtrain , d2 ), d2 ) of the MSE is too optimistic (this is similar to underestimating the variance with a sample mean) Machine Learning / Stefan Harmeling / 10. November 2021 6 Hyperparameter tuning: 3rd attempt Idea: ▸ split dataset D into three disjoint sets Dtrain , Deval and Dtest ▸ 1. train on Dtrain (choose parameter w) 2. evaluate on Deval (choose hyperparameter d) 3. test of Dtest (calculate MSE for unseen data) ▸ choose d such that the MSE on the evaluation set is minimal for the parameter w chosen on the training set: d3 = arg min MSE(Deval , wML (Dtrain , d), d) d ▸ calculate the test error (the MSE on unseen data) on the test set: MSE(Dtest , wML (Dtrain , d3 ), d3 ) Note: ▸ this is the gold standard (if we have enough data)! ▸ no overfitting of the parameter or of the hyperparameter Machine Learning / Stefan Harmeling / 10. November 2021 7 Hyperparameter tuning: 3rd attempt - more thoughts Idea: ▸ split dataset D into three disjoint sets Dtrain , Deval and Dtest ▸ 1. train on Dtrain (choose parameter w) 2. evaluate on Deval (choose hyperparameter d) 3. test of Dtest (calculate MSE for unseen data) Problems ▸ we are not using all data for estimating the parameter ▸ we are not using all data for estimating the hyperparameter Hack/Solution ▸ Cross-validation (see next slide) ▸ does improve the estimate Machine Learning / Stefan Harmeling / 10. November 2021 8 Cross-validation (2-fold) Simpler setup w/o hyperparameter ▸ given data D ▸ model with parameter w ▸ but no hyperparameter d Goal ▸ estimate the test error Solution ▸ split D into two sets D1 and D2 1. train w1 on D1 , estimate MSE ε1 on D2 for w1 2. train w2 on D2 , estimate MSE ε2 on D1 for w2 ▸ the cross-validation estimate of the test error is the average: (ε1 + ε2 )/2 Machine Learning / Stefan Harmeling / 10. November 2021 9 More general: Algorithm 10.1 (k -fold cross validation (CV)) Split your data D into k disjoint subsets of approximately equal size: D = D1 ∪ . . . ∪ Dk For i = 1 to k 1. Di plays the role of the test set 2. train parameter w on D ∖ Di (i.e. “D without Di ”, set minus) 3. calculate MSE εi on Di The cross-validation estimate of the test error ε is the average validation error: εCV = 1 k ∑ εi k i=1 For k = n this is called leave-one-out cross validation (LOOCV). Machine Learning / Stefan Harmeling / 10. November 2021 10 Nested cross validation for the gold-standard Setup ▸ given some data set D ▸ model with parameter w and hyper-parameter d. Goal: ▸ goal: estimate test error Outer k -fold cross-validation 1. split D into D1 , . . . , Dk 2. for i in 1 . . . k : 2.1 split D into D ∖ Di and Di 2.2 train w and d on D ∖ Di use Inner k ′ -fold cross-validation to estimate the MSE for the hyperparameter choice 2.3 estimate εi on Di using w and d 3. estimate test error ε as the average of ε1 , . . . , εk Machine Learning / Stefan Harmeling / 10. November 2021 11 Bayesian model selection Machine Learning / Stefan Harmeling / 10. November 2021 12 Bayesian inference Model specification p(w) = . . . p(D ∣ w) = . . . prior likelihood Inference given data p(w ∣ D) = p(D ∣ w)p(w) p(D) p(D) = ∫ p(D ∣ w)p(w)dw posterior evidence ▸ The posterior ▸ combines prior and likelihood via product rule (Bayes rule) ▸ describes what the parameter might be after seeing data ▸ The evidence ▸ integrates out the parameter via sum rule ▸ is the expected likelihood under the prior Machine Learning / Stefan Harmeling / 10. November 2021 13 Bayesian inference with hyper-parameters Model specification p(θ) = . . . hyper-prior (level 2) p(w ∣ θ) = . . . prior (level 1) p(D ∣ w) = . . . likelihood Level 1: Inference of parameters p(w ∣ D, θ) = p(D ∣ w)p(w ∣ θ) p(D ∣ θ) p(D ∣ θ) = ∫ p(D ∣ w)p(w ∣ θ)dw posterior evidence Level 2: Inference of hyper-parameters p(θ ∣ D) = p(D ∣ θ)p(θ) p(D) p(D) = ∫ p(D ∣ θ)p(θ)dθ Machine Learning / Stefan Harmeling / 10. November 2021 hyper-posterior hyper-evidence 14 Bayesian inference with hyper-parameters and models Model specification p(H) = . . . model-prior (level 3) p(θ ∣ H) = . . . hyper-prior (level 2) p(w ∣ θ, H) = . . . prior (level 1) p(D ∣ w, H) = . . . likelihood Level 1, 2, 3: Inference of parameters, hyper-parameters and model given data p(D ∣ w, H)p(w ∣ θ, H) p(D ∣ θ, H) p(D ∣ θ, H)p(θ ∣ H) p(θ ∣ D, H) = p(D ∣ H) p(D ∣ H)p(H) p(H ∣ D) = p(D) p(w ∣ D, θ, H) = Machine Learning / Stefan Harmeling / 10. November 2021 posterior hyper-posterior model-posterior 15 Bayesian inference with hyper-parameters and models Level 1: Inference of parameters p(w ∣ D, θ, H) = p(D ∣ w, H)p(w ∣ θ, H) p(D ∣ θ, H) posterior p(D ∣ θ, H) = ∫ p(D ∣ w, H)p(w ∣ θ, H)dw marginal likelihood Level 2: Inference of hyper-parameters p(θ ∣ D, H) = p(D ∣ θ, H)p(θ) p(D ∣ H) hyper-posterior p(D ∣ H) = ∫ p(D ∣ θ, H)p(θ ∣ H)dθ Level 3: Inference of model p(D ∣ H)p(H) p(D) p(D) = ∑ p(D ∣ H)p(H) p(H ∣ D) = model-posterior H Machine Learning / Stefan Harmeling / 10. November 2021 16 Model selection for regression Instead of a model for p(D) we specify a model for p(y ∣ X ) with location matrix X and targets y . Machine Learning / Stefan Harmeling / 10. November 2021 17 Bayesian model selection for regression Model specification p(H) = . . . model-prior p(θ ∣ H) = . . . hyper-prior p(w ∣ θ, H) = . . . p(y ∣ X , w, H) = . . . prior likelihood (this is different) Level 1, 2, 3: Inference of parameters, hyper-parameters and model given data p(y ∣ X , w, H)p(w ∣ θ, H) p(y ∣ X , θ, H) p(y ∣ X , θ, H)p(θ ∣ H) p(θ ∣ y , X , H) = p(y ∣ X , H) p(y ∣ X , H)p(H) p(H ∣ y , X ) = p(y ∣ X ) p(w ∣ y , X , θ, H) = Machine Learning / Stefan Harmeling / 10. November 2021 posterior hyper-posterior model-posterior 18 Bayesian model selection for Linear Regression Level 1: w ∼ N (0d , τ 2 Id ) y ∣ X , w ∼ N (Xw, σ 2 In ) We infer posterior: w ∣ X , y ∼ N (wn , Vn ) posterior with posterior mean wn and posterior covariance Vn . −1 1 1 T X + X I ) d σ2 τ2 σ2 wn = (X T X + 2 Id )−1 X T y τ Vn = ( Note that on this slide σ 2 and τ 2 is assumed to be known (constant). Level 2: Let’s add random variables for the hyper-parameters! Machine Learning / Stefan Harmeling / 10. November 2021 19 Bayesian model selection for Linear Regression Level 1: θ ∼ ... hyper-prior w ∣ θ ∼ N (0d , τ Id ) 2 y ∣ X , w, θ ∼ N (Xw, σ In ) 2 prior likelihood where θ = (σ 2 , τ 2 , d) collects all hyperparameter. We infer posterior: w ∣ X , y , θ ∼ N (wn , Vn ∣ θ) posterior with posterior mean wn and posterior covariance Vn which do dependent on θ. Level 2: We need the evidence from level 1: p(y ∣ X , θ) = ∫ p(y ∣ X , w, θ)p(w ∣ θ)dw = N (...) This is also called marginal likelihood. Machine Learning / Stefan Harmeling / 10. November 2021 20 Bayesian inference for level 2 Level 2: Infer hyper-posterior: p(θ ∣ y , X ) = p(y ∣ X , θ)p(θ) p(y ∣ X ) This gets quickly too complicated! (the integral become to difficult) Instead we only maximize the Marginal likelihood p(y ∣ X , θ) = ∫ p(y ∣ X , w, θ)p(w ∣ θ)dw = N (...) with respect to the hyper-parameter θ. ▸ this is called “Type II maximum likelihood” and “empirical Bayes” and “generalized maximum likelihood” and “evidence approximation” Machine Learning / Stefan Harmeling / 10. November 2021 21 Marginal likelihood for linear regression We can derive the form of the logarithm of the marginal likelihood: log p(y ∣ X , θ) = log ∫ p(y ∣ X , w, θ) p(w ∣ θ) dw = log ∫ N (y ∣ Xw, σ 2 In ) N (w ∣ 0d , τ 2 Id ) dw = log ∫ N (y ∣ 0n , σ 2 In + τ 2 XX T ) N (w ∣ . . . , . . .) dw = log N (y ∣ 0n , σ 2 In + τ 2 XX T ) ∫ N (w ∣ . . . , . . .) dw = log N (y ∣ 0n , σ 2 In + τ 2 XX T ) n 1 1 = − log(2π) − log∣σ 2 In + τ 2 XX T ∣ − y T (σ 2 In + τ 2 XX T )−1 y 2 2 2 Notes: ▸ N (w ∣ . . . , . . .) does not depend on y . ▸ The third equality follows from some fancy rule for the Gaussians (see next slide). Machine Learning / Stefan Harmeling / 10. November 2021 22 Another product formula for Gaussians copied from Bishop’s book Sec. 2.3.3, page 93 Given a marginal Gaussian for x and a conditional Gaussian for y given x, x ∼ N (µ, Λ−1 ) y ∣ x ∼ N (Ax + b, L−1 ) we have the following marignal Guassian for y and conditional Gaussian for x given y : y ∼ N (Aµ + b, L−1 + AΛ−1 AT ) x ∣ y ∼ N (Σ(AT L(y − b) + Λµ), Σ) where Σ = (Λ + AT LA)−1 Machine Learning / Stefan Harmeling / 10. November 2021 23 Bayesian model selection for linear regression Setting the hyperparameters for linear regression The logarithm of the marginal likelihood n 1 1 log p(y ∣ X , θ) = − log(2π) − log∣σ 2 In + τ 2 XX T ∣ − y T (σ 2 In + τ 2 XX T )−1 y 2 2 2 is a function of all hyperparameters θ and can be maximized using Gradient descent in terms of the hyperparameters (or for discrete parameters like d try different values). Can we use Matrix Differential Calculus? Machine Learning / Stefan Harmeling / 10. November 2021 24 Derivative of log marginal likelihood Let’s write A = σ 2 In + τ 2 XX T . Then we get: n 1 1 d log p(y ∣X , θ) = d (− log(2π) − log det A − y T A−1 y ) 2 2 2 1 T 1 −1 = − d log det A − y (dA )y 2 2 1 1 −1 = − tr(A dA) + y T A−1 (dA)A−1 y 2 2 1 = tr(A−1 yy T A−1 − A−1 ) dA 2 Replacing dA with d(σ 2 In + τ 2 XX T ) and continuining a bit, we get: ∂ log p(y ∣X , θ) = ∂σ 2 ∂ log p(y ∣X , θ) = ∂τ 2 1 tr(A−1 yy T A−1 − A−1 ) 2 1 tr(A−1 yy T A−1 − A−1 )XX T 2 Use these derivatives for gradient descent (see Jupyter notebook) (or visualize the log marginal likelihood). Machine Learning / Stefan Harmeling / 10. November 2021 25 End of Section 10 Machine Learning / Stefan Harmeling / 10. November 2021 26 Literature on Bayesian model selection / linear regression: ▸ Chapter 3.5 in Bishop’s book “Pattern Recognition and Machine learning” Machine Learning / Stefan Harmeling / 10. November 2021 27