Linear Regression Libin Jiao Dalian University of Technology November 7, 2024 Libin Jiao (DLUT) Linear Regression November 7, 2024 1 / 43 Outline 1 Problem formulation 2 Parameter Estimation 3 Bayesian Linear Regression 4 Maximum Likelihood as Orthogonal Projection Libin Jiao (DLUT) Linear Regression November 7, 2024 2 / 43 Problem formulation 9.1 Problem formulation Given a set of training inputs xn and corresponding noisy observations yn = f (xn ) + . is an i.i.d. random variable that describes measurement/observation noise. (a) Dataset. (b) Possible solution. The task is to infer the function f that generated the data and generalizes well to function values at new input locations. Libin Jiao (DLUT) Linear Regression November 7, 2024 3 / 43 Problem formulation 9.1 Problem formulation Regression is a basic problem in ML: Prediction and control Classification. Time-series analysis Deep-learning Finding a regression function: Choice of the model (type) and the parametrization. Finding (estimating) good parameters. Over-fitting and model selection. Relationship between loss functions and parameter priors. Uncertainty modeling. Find optimal model parameters Maximum likelihood estimation£4Œq, O¤ Maximum a posteriori (MAP) estimation£4Œ Generalization errors£•zØ Bayesian linear regression Libin Jiao (DLUT) O¤ ¤and over-fitting£L[ܤ. Linear Regression November 7, 2024 4 / 43 Problem formulation 9.1 Problem formulation Regression is a basic problem in ML: Prediction and control Classification. Time-series analysis Deep-learning Finding a regression function: Choice of the model (type) and the parametrization. Finding (estimating) good parameters. Over-fitting and model selection. Relationship between loss functions and parameter priors. Uncertainty modeling. Find optimal model parameters Maximum likelihood estimation£4Œq, O¤ Maximum a posteriori (MAP) estimation£4Œ Generalization errors£•zØ Bayesian linear regression Libin Jiao (DLUT) O¤ ¤and over-fitting£L[ܤ. Linear Regression November 7, 2024 4 / 43 Problem formulation 9.1 Problem formulation Regression is a basic problem in ML: Prediction and control Classification. Time-series analysis Deep-learning Finding a regression function: Choice of the model (type) and the parametrization. Finding (estimating) good parameters. Over-fitting and model selection. Relationship between loss functions and parameter priors. Uncertainty modeling. Find optimal model parameters Maximum likelihood estimation£4Œq, O¤ Maximum a posteriori (MAP) estimation£4Œ Generalization errors£•zØ Bayesian linear regression Libin Jiao (DLUT) O¤ ¤and over-fitting£L[ܤ. Linear Regression November 7, 2024 4 / 43 Problem formulation 9.1 Problem formulation Regression is a basic problem in ML: Prediction and control Classification. Time-series analysis Deep-learning Finding a regression function: Choice of the model (type) and the parametrization. Finding (estimating) good parameters. Over-fitting and model selection. Relationship between loss functions and parameter priors. Uncertainty modeling. Find optimal model parameters Maximum likelihood estimation£4Œq, O¤ Maximum a posteriori (MAP) estimation£4Œ Generalization errors£•zØ Bayesian linear regression Libin Jiao (DLUT) O¤ ¤and over-fitting£L[ܤ. Linear Regression November 7, 2024 4 / 43 Problem formulation 9.1 Problem formulation Regression is a basic problem in ML: Prediction and control Classification. Time-series analysis Deep-learning Finding a regression function: Choice of the model (type) and the parametrization. Finding (estimating) good parameters. Over-fitting and model selection. Relationship between loss functions and parameter priors. Uncertainty modeling. Find optimal model parameters Maximum likelihood estimation£4Œq, O¤ Maximum a posteriori (MAP) estimation£4Œ Generalization errors£•zØ Bayesian linear regression Libin Jiao (DLUT) O¤ ¤and over-fitting£L[ܤ. Linear Regression November 7, 2024 4 / 43 Problem formulation 9.1 Problem formulation Functional relationship between x and y is given as y = f (x) + Suppose ∼ N (0, σ 2 ) is independent, identically distributed (i.i.d.) Gaussian measurement noise with mean 0 and variance σ 2 . The likelihood function£q,¼ê¤ p(y∣x) = N (y∣ f (x), σ 2 ). We focus on parametric models, i.e., we choose a parameterized function and find parameters θ that /work well0for modeling the data. Libin Jiao (DLUT) Linear Regression November 7, 2024 5 / 43 Problem formulation 9.1 Problem formulation Functional relationship between x and y is given as y = f (x) + Suppose ∼ N (0, σ 2 ) is independent, identically distributed (i.i.d.) Gaussian measurement noise with mean 0 and variance σ 2 . The likelihood function£q,¼ê¤ p(y∣x) = N (y∣ f (x), σ 2 ). We focus on parametric models, i.e., we choose a parameterized function and find parameters θ that /work well0for modeling the data. Libin Jiao (DLUT) Linear Regression November 7, 2024 5 / 43 Problem formulation 9.1 Problem formulation Functional relationship between x and y is given as y = f (x) + Suppose ∼ N (0, σ 2 ) is independent, identically distributed (i.i.d.) Gaussian measurement noise with mean 0 and variance σ 2 . The likelihood function£q,¼ê¤ p(y∣x) = N (y∣ f (x), σ 2 ). We focus on parametric models, i.e., we choose a parameterized function and find parameters θ that /work well0for modeling the data. Libin Jiao (DLUT) Linear Regression November 7, 2024 5 / 43 Problem formulation 9.1 Problem formulation Functional relationship between x and y is given as y = f (x) + Suppose ∼ N (0, σ 2 ) is independent, identically distributed (i.i.d.) Gaussian measurement noise with mean 0 and variance σ 2 . The likelihood function£q,¼ê¤ p(y∣x) = N (y∣ f (x), σ 2 ). We focus on parametric models, i.e., we choose a parameterized function and find parameters θ that /work well0for modeling the data. Libin Jiao (DLUT) Linear Regression November 7, 2024 5 / 43 Problem formulation 9.1 Problem formulation Parametric models:f (x) is a parameterized function with parameters θ, and can be denoted by f (x; θ). Linear regression: the parameters θ appear linearly in the model: f (x; θ) = x⊺ θ; f (x; θ) = φ⊺ (x)θ, where φ is a nonlinear transformation (features). Libin Jiao (DLUT) Linear Regression November 7, 2024 6 / 43 Problem formulation 9.1 Problem formulation Parametric models:f (x) is a parameterized function with parameters θ, and can be denoted by f (x; θ). Linear regression: the parameters θ appear linearly in the model: f (x; θ) = x⊺ θ; f (x; θ) = φ⊺ (x)θ, where φ is a nonlinear transformation (features). Libin Jiao (DLUT) Linear Regression November 7, 2024 6 / 43 Problem formulation 9.1 Problem formulation Parametric models:f (x) is a parameterized function with parameters θ, and can be denoted by f (x; θ). Linear regression: the parameters θ appear linearly in the model: f (x; θ) = x⊺ θ; f (x; θ) = φ⊺ (x)θ, where φ is a nonlinear transformation (features). Libin Jiao (DLUT) Linear Regression November 7, 2024 6 / 43 Problem formulation 9.1 Problem formulation Parametric models:f (x) is a parameterized function with parameters θ, and can be denoted by f (x; θ). Linear regression: the parameters θ appear linearly in the model: f (x; θ) = x⊺ θ; f (x; θ) = φ⊺ (x)θ, where φ is a nonlinear transformation (features). Libin Jiao (DLUT) Linear Regression November 7, 2024 6 / 43 Problem formulation 9.1 Problem formulation Parametric models:f (x) is a parameterized function with parameters θ, and can be denoted by f (x; θ). Linear regression: the parameters θ appear linearly in the model: f (x; θ) = x⊺ θ; f (x; θ) = φ⊺ (x)θ, where φ is a nonlinear transformation (features). Libin Jiao (DLUT) Linear Regression November 7, 2024 6 / 43 Problem formulation 9.1 Problem formulation Parametric models: f (x) is a parameterized function with parameters θ, and can be denoted by f (x; θ). Linear regression: the parameters θ appear linearly in the model: f (x; θ) = x⊺ θ; f (x; θ) = φ⊺ (x)θ, where φ is a nonlinear transformation (features). Example 9.1: (a) Example functions. Libin Jiao (DLUT) (b) Training set. Linear Regression (c) MLE. November 7, 2024 7 / 43 Parameter Estimation 9.2 Parameter Estimation Linear in θ and in x: p(y∣x, θ) = N (y∣x⊺ θ, σ 2 ) ⇔ y = x⊺ θ + , ∼ N (0, σ 2 ), suppose firstly σ 2 is known. Given a training set D ∶= {(x1 , y1 ), . . . , (xN , yN )}, xn ∈ RD and yn ∈ R (n = 1, 2, . . . , N); Libin Jiao (DLUT) Linear Regression November 7, 2024 8 / 43 Parameter Estimation 9.2 Parameter Estimation Linear in θ and in x: p(y∣x, θ) = N (y∣x⊺ θ, σ 2 ) ⇔ y = x⊺ θ + , ∼ N (0, σ 2 ), suppose firstly σ 2 is known. Given a training set D ∶= {(x1 , y1 ), . . . , (xN , yN )}, xn ∈ RD and yn ∈ R (n = 1, 2, . . . , N); Libin Jiao (DLUT) Linear Regression November 7, 2024 8 / 43 Parameter Estimation 9.2 Parameter Estimation Linear in θ and in x: p(y∣x, θ) = N (y∣x⊺ θ, σ 2 ) ⇔ y = x⊺ θ + , ∼ N (0, σ 2 ), suppose firstly σ 2 is known. Given a training set D ∶= {(x1 , y1 ), . . . , (xN , yN )}, xn ∈ RD and yn ∈ R (n = 1, 2, . . . , N); Libin Jiao (DLUT) Linear Regression November 7, 2024 8 / 43 Parameter Estimation 9.2 Parameter Estimation Linear in θ and in x: p(y∣x, θ) = N (y∣x⊺ θ, σ 2 ) ⇔ y = x⊺ θ + , ∼ N (0, σ 2 ), suppose firstly σ 2 is known. Given a training set D ∶= {(x1 , y1 ), . . . , (xN , yN )}, xn ∈ RD and yn ∈ R (n = 1, 2, . . . , N); Libin Jiao (DLUT) Linear Regression November 7, 2024 8 / 43 Parameter Estimation 9.2 Parameter Estimation Linear in θ and in x: p(y∣x, θ) = N (y∣x⊺ θ, σ 2 ) ⇔ y = x⊺ θ + , ∼ N (0, σ 2 ), suppose firstly σ 2 is known. Given a training set D ∶= {(x1 , y1 ), . . . , (xN , yN )}, xn ∈ RD and yn ∈ R (n = 1, 2, . . . , N); Likelyhood: p(Y∣X , θ) = p(y1 , . . . , yN ∣x1 , . . . , xN , θ) N N n=1 n=1 = ∏ p(yn ∣xn , θ) = ∏ N (yn ∣x⊺n θ, σ 2 ) Graphical model. Libin Jiao (DLUT) Linear Regression November 7, 2024 9 / 43 Parameter Estimation 9.2 Parameter Estimation Maximum Likelyhood Estimation: θML = arg max p(Y∣X , θ) (maximize the likelihood) θ = arg min{− log p(Y∣X , θ)} (minimize the negative log-likelihood) θ N = arg min{− log ∏ p(yn ∣xn , θ)} θ n=1 N = arg min{− ∑ log N (yn ∣x⊺n θ, σ 2 )} ← √ 1 2 exp (− θ n=1 N 1 = arg min{ 2σ2 ∑ (yn − x⊺n θ)2 + const.} θ n=1 2πσ (yn −x⊺n θ)2 ) 2σ 2 = arg min L(θ) ( negative log-likelihood) θ Libin Jiao (DLUT) Linear Regression November 7, 2024 10 / 43 Parameter Estimation 9.2 Parameter Estimation Maximum Likelyhood Estimation: θML = arg max p(Y∣X , θ) (maximize the likelihood) θ = arg min{− log p(Y∣X , θ)} (minimize the negative log-likelihood) θ N = arg min{− log ∏ p(yn ∣xn , θ)} θ n=1 N = arg min{− ∑ log N (yn ∣x⊺n θ, σ 2 )} ← √ 1 2 exp (− θ n=1 N 1 = arg min{ 2σ2 ∑ (yn − x⊺n θ)2 + const.} θ n=1 2πσ (yn −x⊺n θ)2 ) 2σ 2 = arg min L(θ) ( negative log-likelihood) θ Libin Jiao (DLUT) Linear Regression November 7, 2024 10 / 43 Parameter Estimation 9.2 Parameter Estimation Negative log-likelihood£Kéêq,¤µ L(θ) ∶= 1 N 1 1 ⊺ 2 ⊺ 2 ∑ (yn − xn θ) = 2 (y − Xθ) (y − Xθ) = 2 ∥y − Xθ∥2 . 2 2σ n=1 2σ 2σ Maximum likelihood estimator£4Œq, Of§†σÃ'œ¤µ θ ML = (X⊺ X)−1 X⊺ y, (design matrix) X ∶= [x1 , . . . , xN ]⊺ ∈ RN×D , y ∶= [y1 , . . . , yN ]⊺ ∈ RN . Libin Jiao (DLUT) Linear Regression November 7, 2024 11 / 43 Parameter Estimation 9.2 Parameter Estimation Negative log-likelihood£Kéêq,¤µ L(θ) ∶= 1 N 1 1 ⊺ 2 ⊺ 2 ∑ (yn − xn θ) = 2 (y − Xθ) (y − Xθ) = 2 ∥y − Xθ∥2 . 2 2σ n=1 2σ 2σ Maximum likelihood estimator£4Œq, Of§†σÃ'œ¤µ θ ML = (X⊺ X)−1 X⊺ y, (design matrix) X ∶= [x1 , . . . , xN ]⊺ ∈ RN×D , y ∶= [y1 , . . . , yN ]⊺ ∈ RN . Libin Jiao (DLUT) Linear Regression November 7, 2024 11 / 43 Parameter Estimation 9.2 Parameter Estimation Negative log-likelihood£Kéêq,¤µ L(θ) ∶= 1 N 1 1 ⊺ 2 ⊺ 2 ∑ (yn − xn θ) = 2 (y − Xθ) (y − Xθ) = 2 ∥y − Xθ∥2 . 2σ 2 n=1 2σ 2σ Maximum likelihood estimator£4Œq, Of§†σÃ'œ¤µ θ ML = (X⊺ X)−1 X⊺ y, (design matrix) X ∶= [x1 , . . . , xN ]⊺ ∈ RN×D , y ∶= [y1 , . . . , yN ]⊺ ∈ RN . Example 9.2 (Fitting Lines): (a) Example functions. Libin Jiao (DLUT) (b) Training set. Linear Regression (c) MLE. November 7, 2024 12 / 43 Parameter Estimation 9.2 Parameter Estimation Exercise Do MLE under Laplacian distribution ∼ L(0, b). Do MLE under the uniform distribution ∼ U[−a, a]. Libin Jiao (DLUT) Linear Regression November 7, 2024 13 / 43 Parameter Estimation 9.2 Parameter Estimation Maximum Likelihood Estimation with Features£‘A MLE¤µ K−1 p(y∣x, θ) = N (y∣φ⊺ (x)θ, σ 2 ) ⇔ y = φ⊺ (x)θ + = ∑ θk φk (x) + , k=0 where φ ∶ RD → RK , φ(xk ) is the feature vector of xk . Negative log-likelihood: L(θ) = 2σ1 2 (y − Φθ)⊺ (y − Φθ) = 2σ1 2 ∥y − Φθ∥22 Maximum likelihood estimator: θ ML = (Φ⊺ Φ)−1 Φ⊺ y The feature matrix (design matrix) ⎡ ⎡ φ⊺ (x1 ) ⎤ ⎢ φ0 (x1 ) ⎢ ⎥ ⎢ φ (x ) ⎢ ⎥ ⎢ 0 2 ⎥=⎢ ⋮ Φ ∶= ⎢ ⎢ ⊺ ⎥ ⎢ ⋮ ⎢ φ (xN ) ⎥ ⎢ ⎣ ⎦ ⎢ φ0 (xN ) ⎣ Libin Jiao (DLUT) Linear Regression ⋯ φK−1 (x1 ) ⎤⎥ ⋯ φK−1 (x2 ) ⎥⎥ ⎥ ∈ RN×K ⎥ ⋯ ⋮ ⎥ ⋯ φK−1 (xN ) ⎥⎦ November 7, 2024 14 / 43 Parameter Estimation 9.2 Parameter Estimation Maximum Likelihood Estimation with Features£‘A MLE¤µ K−1 p(y∣x, θ) = N (y∣φ⊺ (x)θ, σ 2 ) ⇔ y = φ⊺ (x)θ + = ∑ θk φk (x) + , k=0 where φ ∶ RD → RK , φ(xk ) is the feature vector of xk . Negative log-likelihood: L(θ) = 2σ1 2 (y − Φθ)⊺ (y − Φθ) = 2σ1 2 ∥y − Φθ∥22 Maximum likelihood estimator: θ ML = (Φ⊺ Φ)−1 Φ⊺ y The feature matrix (design matrix) ⎡ ⎡ φ⊺ (x1 ) ⎤ ⎢ φ0 (x1 ) ⎢ ⎥ ⎢ φ (x ) ⎢ ⎥ ⎢ 0 2 ⎥=⎢ ⋮ Φ ∶= ⎢ ⎢ ⊺ ⎥ ⎢ ⋮ ⎢ φ (xN ) ⎥ ⎢ ⎣ ⎦ ⎢ φ0 (xN ) ⎣ Libin Jiao (DLUT) Linear Regression ⋯ φK−1 (x1 ) ⎤⎥ ⋯ φK−1 (x2 ) ⎥⎥ ⎥ ∈ RN×K ⎥ ⋯ ⋮ ⎥ ⋯ φK−1 (xN ) ⎥⎦ November 7, 2024 14 / 43 Parameter Estimation 9.2 Parameter Estimation Maximum Likelihood Estimation with Features£‘A MLE¤µ K−1 p(y∣x, θ) = N (y∣φ⊺ (x)θ, σ 2 ) ⇔ y = φ⊺ (x)θ + = ∑ θk φk (x) + , k=0 where φ ∶ RD → RK , φ(xk ) is the feature vector of xk . Negative log-likelihood: L(θ) = 2σ1 2 (y − Φθ)⊺ (y − Φθ) = 2σ1 2 ∥y − Φθ∥22 Maximum likelihood estimator: θ ML = (Φ⊺ Φ)−1 Φ⊺ y The feature matrix (design matrix) ⎡ ⎡ φ⊺ (x1 ) ⎤ ⎢ φ0 (x1 ) ⎢ ⎥ ⎢ φ (x ) ⎢ ⎥ ⎢ 0 2 ⎥=⎢ ⋮ Φ ∶= ⎢ ⎢ ⊺ ⎥ ⎢ ⋮ ⎢ φ (xN ) ⎥ ⎢ ⎣ ⎦ ⎢ φ0 (xN ) ⎣ Libin Jiao (DLUT) Linear Regression ⋯ φK−1 (x1 ) ⎤⎥ ⋯ φK−1 (x2 ) ⎥⎥ ⎥ ∈ RN×K ⎥ ⋯ ⋮ ⎥ ⋯ φK−1 (xN ) ⎥⎦ November 7, 2024 14 / 43 Parameter Estimation 9.2 Parameter Estimation Maximum Likelihood Estimation with Features£‘A MLE¤µ K−1 p(y∣x, θ) = N (y∣φ⊺ (x)θ, σ 2 ) ⇔ y = φ⊺ (x)θ + = ∑ θk φk (x) + , k=0 where φ ∶ RD → RK , φ(xk ) is the feature vector of xk . Negative log-likelihood: L(θ) = 2σ1 2 (y − Φθ)⊺ (y − Φθ) = 2σ1 2 ∥y − Φθ∥22 Maximum likelihood estimator: θ ML = (Φ⊺ Φ)−1 Φ⊺ y The feature matrix (design matrix) ⎡ ⎡ φ⊺ (x1 ) ⎤ ⎢ φ0 (x1 ) ⎢ ⎥ ⎢ φ (x ) ⎢ ⎥ ⎢ 0 2 ⎥=⎢ ⋮ Φ ∶= ⎢ ⎢ ⊺ ⎥ ⎢ ⋮ ⎢ φ (xN ) ⎥ ⎢ ⎣ ⎦ ⎢ φ0 (xN ) ⎣ Libin Jiao (DLUT) Linear Regression ⋯ φK−1 (x1 ) ⎤⎥ ⋯ φK−1 (x2 ) ⎥⎥ ⎥ ∈ RN×K ⎥ ⋯ ⋮ ⎥ ⋯ φK−1 (xN ) ⎥⎦ November 7, 2024 14 / 43 Parameter Estimation 9.2 Parameter Estimation Maximum Likelihood Estimation with Features£‘A MLE¤µ K−1 p(y∣x, θ) = N (y∣φ⊺ (x)θ, σ 2 ) ⇔ y = φ⊺ (x)θ + = ∑ θk φk (x) + , k=0 where φ ∶ RD → RK , φ(xk ) is the feature vector of xk . Negative log-likelihood: L(θ) = 2σ1 2 (y − Φθ)⊺ (y − Φθ) = 2σ1 2 ∥y − Φθ∥22 Maximum likelihood estimator: θ ML = (Φ⊺ Φ)−1 Φ⊺ y The feature matrix (design matrix) ⎡ ⎡ φ⊺ (x1 ) ⎤ ⎢ φ0 (x1 ) ⎢ ⎥ ⎢ φ (x ) ⎢ ⎥ ⎢ 0 2 ⎥=⎢ ⋮ Φ ∶= ⎢ ⎢ ⊺ ⎥ ⎢ ⋮ ⎢ φ (xN ) ⎥ ⎢ ⎣ ⎦ ⎢ φ0 (xN ) ⎣ Libin Jiao (DLUT) Linear Regression ⋯ φK−1 (x1 ) ⎤⎥ ⋯ φK−1 (x2 ) ⎥⎥ ⎥ ∈ RN×K ⎥ ⋯ ⋮ ⎥ ⋯ φK−1 (xN ) ⎥⎦ November 7, 2024 14 / 43 Parameter Estimation 9.2 Parameter Estimation Example 9.3 (polynomial regression) ⎤ ⎡ ⎡ φ0 (x) ⎤ ⎢⎢ 1 ⎥⎥ ⎥ ⎢ ⎢ φ (x) ⎥ ⎢⎢ x ⎥⎥ ⎥ ⎢ 2 ⎥ ⎢ 1 ⎥ = ⎢ x ⎥ ∈ RK . φ(xk ) = ⎢ ⎥ ⎢ ⎢ ⋮ ⎥ ⎥ ⎢ ⎢ φK−1 (x) ⎥ ⎢⎢ ⋮ ⎥⎥ ⎦ ⎢ xK−1 ⎥ ⎣ ⎦ ⎣ Example 9.4 (feature matrix for second-order polynomials) ⎡ 1 x1 ⎢ ⎢ 1 x ⎢ 2 Φ ∶= ⎢ ⎢ ⋮ ⋮ ⎢ ⎢ 1 xN ⎣ Libin Jiao (DLUT) Linear Regression x12 ⎤⎥ x22 ⎥⎥ ⎥ ⋮ ⎥⎥ xN2 ⎥⎦ November 7, 2024 15 / 43 Parameter Estimation 9.2 Parameter Estimation Example 9.3 (polynomial regression) ⎤ ⎡ ⎡ φ0 (x) ⎤ ⎢⎢ 1 ⎥⎥ ⎢ ⎥ ⎢ φ (x) ⎥ ⎢⎢ x ⎥⎥ ⎢ ⎥ ⎢ 2 ⎥ 1 ⎥ = ⎢ x ⎥ ∈ RK . φ(xk ) = ⎢ ⎢ ⎥ ⎢ ⋮ ⎥ ⎢ ⎥ ⎢ φK−1 (x) ⎥ ⎢⎢ ⋮ ⎥⎥ ⎣ ⎦ ⎢ xK−1 ⎥ ⎦ ⎣ Example 9.4 (feature matrix for second-order polynomials) ⎡ 1 x1 ⎢ ⎢ 1 x ⎢ 2 Φ ∶= ⎢ ⎢ ⋮ ⋮ ⎢ ⎢ 1 xN ⎣ Libin Jiao (DLUT) Linear Regression x12 ⎤⎥ x22 ⎥⎥ ⎥ ⋮ ⎥⎥ xN2 ⎥⎦ November 7, 2024 15 / 43 Parameter Estimation 9.2 Parameter Estimation Example 9.3 (polynomial regression) ⎤ ⎡ ⎡ φ0 (x) ⎤ ⎢⎢ 1 ⎥⎥ ⎢ ⎥ ⎢ φ (x) ⎥ ⎢⎢ x ⎥⎥ ⎢ ⎥ ⎢ 2 ⎥ 1 ⎥ = ⎢ x ⎥ ∈ RK . φ(xk ) = ⎢ ⎢ ⎥ ⎢ ⋮ ⎥ ⎢ ⎥ ⎢ φK−1 (x) ⎥ ⎢⎢ ⋮ ⎥⎥ ⎣ ⎦ ⎢ xK−1 ⎥ ⎦ ⎣ Example 9.4 (feature matrix for second-order polynomials) ⎡ 1 x1 ⎢ ⎢ 1 x ⎢ 2 Φ ∶= ⎢ ⎢ ⋮ ⋮ ⎢ ⎢ 1 xN ⎣ Libin Jiao (DLUT) Linear Regression x12 ⎤⎥ x22 ⎥⎥ ⎥ ⋮ ⎥⎥ xN2 ⎥⎦ November 7, 2024 15 / 43 Parameter Estimation 9.2 Parameter Estimation Example 9.5 (maximum likelihood polynomial fit): N = 10, xn ∼ U[−5, 5], yn = − sin(xn /5) + cos(xn ) + , ∼ N (0, 0.22 ). Fit a polynomial of degree 4 using maximum likelihood estimation. Libin Jiao (DLUT) Linear Regression November 7, 2024 16 / 43 Parameter Estimation 9.2 Parameter Estimation Estimating the Noise Variance N log p(Y∣X , θ, σ 2 ) = ∑ log N (yn ∣φ⊺ (xn )θ, σ 2 ) n=1 N = ∑ (− 21 log(2π) − 12 log σ 2 − 2σ1 2 (yn − φ⊺ (xn )θ)2 ) n=1 N = − N2 log σ 2 − 2σ1 2 ∑ (yn − φ⊺ (xn )θ)2 +const. n=1 ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶ =∶s ∂ log p(Y∣X ,θ,σ 2 ) = − 2σN 2 + 2σs 4 = 0 ⇔ σ 2 = Ns . ∂σ 2 ∂ 2 log p(Y∣X ,θ,σ 2 ) = 2σN 4 − σs6 = 0 ⇔ σ 2 = 2s N. ∂(σ 2 )2 N 2 σML = Ns = N1 ∑ (yn − φ⊺ (xn )θ)2 . n=1 Libin Jiao (DLUT) Linear Regression November 7, 2024 17 / 43 Parameter Estimation 9.2 Parameter Estimation Estimating the Noise Variance N log p(Y∣X , θ, σ 2 ) = ∑ log N (yn ∣φ⊺ (xn )θ, σ 2 ) n=1 N = ∑ (− 21 log(2π) − 12 log σ 2 − 2σ1 2 (yn − φ⊺ (xn )θ)2 ) n=1 N = − N2 log σ 2 − 2σ1 2 ∑ (yn − φ⊺ (xn )θ)2 +const. n=1 ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶ =∶s ∂ log p(Y∣X ,θ,σ 2 ) = − 2σN 2 + 2σs 4 = 0 ⇔ σ 2 = Ns . ∂σ 2 ∂ 2 log p(Y∣X ,θ,σ 2 ) = 2σN 4 − σs6 = 0 ⇔ σ 2 = 2s N. ∂(σ 2 )2 N 2 σML = Ns = N1 ∑ (yn − φ⊺ (xn )θ)2 . n=1 Libin Jiao (DLUT) Linear Regression November 7, 2024 17 / 43 Parameter Estimation 9.2 Parameter Estimation Estimating the Noise Variance N log p(Y∣X , θ, σ 2 ) = ∑ log N (yn ∣φ⊺ (xn )θ, σ 2 ) n=1 N = ∑ (− 21 log(2π) − 12 log σ 2 − 2σ1 2 (yn − φ⊺ (xn )θ)2 ) n=1 N = − N2 log σ 2 − 2σ1 2 ∑ (yn − φ⊺ (xn )θ)2 +const. n=1 ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶ =∶s ∂ log p(Y∣X ,θ,σ 2 ) = − 2σN 2 + 2σs 4 = 0 ⇔ σ 2 = Ns . ∂σ 2 ∂ 2 log p(Y∣X ,θ,σ 2 ) = 2σN 4 − σs6 = 0 ⇔ σ 2 = 2s N. ∂(σ 2 )2 N 2 σML = Ns = N1 ∑ (yn − φ⊺ (xn )θ)2 . n=1 Libin Jiao (DLUT) Linear Regression November 7, 2024 17 / 43 Parameter Estimation 9.2 Parameter Estimation Evaluate the quality of the model The negative log-likelihood: 2σ1 2 ∥y − Φθ∥22 The squared-error-loss function: ∥y − Φθ∥2 The root mean square error (RMSE§þ•ŠØ ): ¿ √ Á1 N 1 À ∑ (y − φ⊺ (x )θ)2 ∥y − Φθ∥2 = Á n n N N n=1 (a) Allows us to compare errors of datasets with different sizes; (b) Has the same scale and units as the observed function values yn . For model selection, we can use the RMSE (or the negative log-likelihood) to determine the best degree of the polynomial. Libin Jiao (DLUT) Linear Regression November 7, 2024 18 / 43 Parameter Estimation 9.2 Parameter Estimation Evaluate the quality of the model The negative log-likelihood: 2σ1 2 ∥y − Φθ∥22 The squared-error-loss function: ∥y − Φθ∥2 The root mean square error (RMSE§þ•ŠØ ): ¿ √ Á1 N 1 À ∑ (y − φ⊺ (x )θ)2 ∥y − Φθ∥2 = Á n n N N n=1 (a) Allows us to compare errors of datasets with different sizes; (b) Has the same scale and units as the observed function values yn . For model selection, we can use the RMSE (or the negative log-likelihood) to determine the best degree of the polynomial. Libin Jiao (DLUT) Linear Regression November 7, 2024 18 / 43 Parameter Estimation 9.2 Parameter Estimation Evaluate the quality of the model The negative log-likelihood: 2σ1 2 ∥y − Φθ∥22 The squared-error-loss function: ∥y − Φθ∥2 The root mean square error (RMSE§þ•ŠØ ): ¿ √ Á1 N 1 À ∑ (y − φ⊺ (x )θ)2 ∥y − Φθ∥2 = Á n n N N n=1 (a) Allows us to compare errors of datasets with different sizes; (b) Has the same scale and units as the observed function values yn . For model selection, we can use the RMSE (or the negative log-likelihood) to determine the best degree of the polynomial. Libin Jiao (DLUT) Linear Regression November 7, 2024 18 / 43 Parameter Estimation 9.2 Parameter Estimation Evaluate the quality of the model The negative log-likelihood: 2σ1 2 ∥y − Φθ∥22 The squared-error-loss function: ∥y − Φθ∥2 The root mean square error (RMSE§þ•ŠØ ): ¿ √ Á1 N 1 À ∑ (y − φ⊺ (x )θ)2 ∥y − Φθ∥2 = Á n n N N n=1 (a) Allows us to compare errors of datasets with different sizes; (b) Has the same scale and units as the observed function values yn . For model selection, we can use the RMSE (or the negative log-likelihood) to determine the best degree of the polynomial. Libin Jiao (DLUT) Linear Regression November 7, 2024 18 / 43 Parameter Estimation 9.2 Parameter Estimation Evaluate the quality of the model The negative log-likelihood: 2σ1 2 ∥y − Φθ∥22 The squared-error-loss function: ∥y − Φθ∥2 The root mean square error (RMSE§þ•ŠØ ): ¿ √ Á1 N 1 À ∑ (y − φ⊺ (x )θ)2 ∥y − Φθ∥2 = Á n n N N n=1 (a) Allows us to compare errors of datasets with different sizes; (b) Has the same scale and units as the observed function values yn . For model selection, we can use the RMSE (or the negative log-likelihood) to determine the best degree of the polynomial. Libin Jiao (DLUT) Linear Regression November 7, 2024 18 / 43 Parameter Estimation 9.2 Parameter Estimation Evaluate the quality of the model The negative log-likelihood: 2σ1 2 ∥y − Φθ∥22 The squared-error-loss function: ∥y − Φθ∥2 The root mean square error (RMSE§þ•ŠØ ): ¿ √ Á1 N 1 À ∑ (y − φ⊺ (x )θ)2 ∥y − Φθ∥2 = Á n n N N n=1 (a) Allows us to compare errors of datasets with different sizes; (b) Has the same scale and units as the observed function values yn . For model selection, we can use the RMSE (or the negative log-likelihood) to determine the best degree of the polynomial. Libin Jiao (DLUT) Linear Regression November 7, 2024 18 / 43 Parameter Estimation 9.2 Parameter Estimation Evaluate the quality of the model The negative log-likelihood: 2σ1 2 ∥y − Φθ∥22 The squared-error-loss function: ∥y − Φθ∥2 The root mean square error (RMSE§þ•ŠØ ): ¿ √ Á1 N 1 À ∑ (y − φ⊺ (x )θ)2 ∥y − Φθ∥2 = Á n n N N n=1 (a) Allows us to compare errors of datasets with different sizes; (b) Has the same scale and units as the observed function values yn . For model selection, we can use the RMSE (or the negative log-likelihood) to determine the best degree of the polynomial. Libin Jiao (DLUT) Linear Regression November 7, 2024 18 / 43 Parameter Estimation 9.2 Parameter Estimation Maximum likelihood polynomial fit for N = 10 and various M. Libin Jiao (DLUT) Linear Regression November 7, 2024 19 / 43 Parameter Estimation 9.2 Parameter Estimation Over-fitting for big M! RMSE for training data is not enough⇐ Test data! Generalization performance£•z5U£Ly¤¤ Evaluate the RMSE for both the training data and the test data (200). The best generalization is obtained for a polynomial of degree M = 4. Libin Jiao (DLUT) Linear Regression November 7, 2024 20 / 43 Parameter Estimation 9.2 Parameter Estimation Maximum A Posteriori Estimation £4Œ O¤ Maximum likelihood estimation is prone to over-fitting. It is observed that the magnitude of the parameter values becomes relatively large if we run into over-fitting. To mitigate the effect of huge parameter values, we can place a prior distribution p(θ) on the parameters. Maximize the posterior distribution (probability) p(θ∣X , Y) = p(Y∣X , θ)p(θ) . p(Y∣X ) Minimize the negative log-posterior distribution (probability): θ MAP = arg min{− log p(Y∣X , θ) − log p(θ) + const.}. θ Libin Jiao (DLUT) Linear Regression November 7, 2024 21 / 43 Parameter Estimation 9.2 Parameter Estimation Maximum A Posteriori Estimation £4Œ O¤ Maximum likelihood estimation is prone to over-fitting. It is observed that the magnitude of the parameter values becomes relatively large if we run into over-fitting. To mitigate the effect of huge parameter values, we can place a prior distribution p(θ) on the parameters. Maximize the posterior distribution (probability) p(θ∣X , Y) = p(Y∣X , θ)p(θ) . p(Y∣X ) Minimize the negative log-posterior distribution (probability): θ MAP = arg min{− log p(Y∣X , θ) − log p(θ) + const.}. θ Libin Jiao (DLUT) Linear Regression November 7, 2024 21 / 43 Parameter Estimation 9.2 Parameter Estimation Maximum A Posteriori Estimation £4Œ O¤ Maximum likelihood estimation is prone to over-fitting. It is observed that the magnitude of the parameter values becomes relatively large if we run into over-fitting. To mitigate the effect of huge parameter values, we can place a prior distribution p(θ) on the parameters. Maximize the posterior distribution (probability) p(θ∣X , Y) = p(Y∣X , θ)p(θ) . p(Y∣X ) Minimize the negative log-posterior distribution (probability): θ MAP = arg min{− log p(Y∣X , θ) − log p(θ) + const.}. θ Libin Jiao (DLUT) Linear Regression November 7, 2024 21 / 43 Parameter Estimation 9.2 Parameter Estimation Maximum A Posteriori Estimation £4Œ O¤ Maximum likelihood estimation is prone to over-fitting. It is observed that the magnitude of the parameter values becomes relatively large if we run into over-fitting. To mitigate the effect of huge parameter values, we can place a prior distribution p(θ) on the parameters. Maximize the posterior distribution (probability) p(θ∣X , Y) = p(Y∣X , θ)p(θ) . p(Y∣X ) Minimize the negative log-posterior distribution (probability): θ MAP = arg min{− log p(Y∣X , θ) − log p(θ) + const.}. θ Libin Jiao (DLUT) Linear Regression November 7, 2024 21 / 43 Parameter Estimation 9.2 Parameter Estimation Maximum A Posteriori Estimation £4Œ O¤ Maximum likelihood estimation is prone to over-fitting. It is observed that the magnitude of the parameter values becomes relatively large if we run into over-fitting. To mitigate the effect of huge parameter values, we can place a prior distribution p(θ) on the parameters. Maximize the posterior distribution (probability) p(θ∣X , Y) = p(Y∣X , θ)p(θ) . p(Y∣X ) Minimize the negative log-posterior distribution (probability): θ MAP = arg min{− log p(Y∣X , θ) − log p(θ) + const.}. θ Libin Jiao (DLUT) Linear Regression November 7, 2024 21 / 43 Parameter Estimation 9.2 Parameter Estimation Maximum A Posteriori Estimation £4Œ O¤ Maximum likelihood estimation is prone to over-fitting. It is observed that the magnitude of the parameter values becomes relatively large if we run into over-fitting. To mitigate the effect of huge parameter values, we can place a prior distribution p(θ) on the parameters. Maximize the posterior distribution (probability) p(θ∣X , Y) = p(Y∣X , θ)p(θ) . p(Y∣X ) Minimize the negative log-posterior distribution (probability): θ MAP = arg min{− log p(Y∣X , θ) − log p(θ) + const.}. θ Libin Jiao (DLUT) Linear Regression November 7, 2024 21 / 43 Parameter Estimation 9.2 Parameter Estimation If p(y∣x, θ) = N (y∣φ⊺ (x)θ, σ 2 ) and p(θ) = N (0, b2 I), then − log p(θ∣X , Y) = 2σ1 2 (y − Φθ)⊺ (y − Φθ) + 2b12 θ ⊺ θ + const. 2 = 2σ1 2 (∥y − Φθ∥22 + σb2 ∥θ∥22 ) + const.. θ MAP = (Φ⊺ Φ + σb2 I) 2 Libin Jiao (DLUT) −1 Φ⊺ y. Linear Regression November 7, 2024 22 / 43 Parameter Estimation 9.2 Parameter Estimation If p(y∣x, θ) = N (y∣φ⊺ (x)θ, σ 2 ) and p(θ) = N (0, b2 I), then − log p(θ∣X , Y) = 2σ1 2 (y − Φθ)⊺ (y − Φθ) + 2b12 θ ⊺ θ + const. 2 = 2σ1 2 (∥y − Φθ∥22 + σb2 ∥θ∥22 ) + const.. θ MAP = (Φ⊺ Φ + σb2 I) 2 Libin Jiao (DLUT) −1 Φ⊺ y. Linear Regression November 7, 2024 22 / 43 Parameter Estimation 9.2 Parameter Estimation If p(y∣x, θ) = N (y∣φ⊺ (x)θ, σ 2 ) and p(θ) = N (0, b2 I), then − log p(θ∣X , Y) = 2σ1 2 (y − Φθ)⊺ (y − Φθ) + 2b12 θ ⊺ θ + const. 2 = 2σ1 2 (∥y − Φθ∥22 + σb2 ∥θ∥22 ) + const.. θ MAP = (Φ⊺ Φ + σb2 I) 2 Libin Jiao (DLUT) −1 Φ⊺ y. Linear Regression November 7, 2024 23 / 43 Parameter Estimation 9.2 Parameter Estimation MAP Estimation as Regularization L2 -Regularized (Tikhonov regularization) least squares £ridge regression £*£8¤¤ min{∥y − Φθ∥22 + λ∥θ∥22 } θ λ= σ2 ⇒ MAP with prior p(θ) = N (0, b2 I). b2 −1 θ RLS = (Φ⊺ Φ + λI) Libin Jiao (DLUT) Φ⊺ y. Linear Regression November 7, 2024 24 / 43 Parameter Estimation 9.2 Parameter Estimation MAP Estimation as Regularization L2 -Regularized (Tikhonov regularization) least squares £ridge regression £*£8¤¤ min{∥y − Φθ∥22 + λ∥θ∥22 } θ λ= σ2 ⇒ MAP with prior p(θ) = N (0, b2 I). b2 −1 θ RLS = (Φ⊺ Φ + λI) Libin Jiao (DLUT) Φ⊺ y. Linear Regression November 7, 2024 24 / 43 Parameter Estimation 9.2 Parameter Estimation MAP Estimation as Regularization L2 -Regularized (Tikhonov regularization) least squares £ridge regression £*£8¤¤ min{∥y − Φθ∥22 + λ∥θ∥22 } θ λ= σ2 ⇒ MAP with prior p(θ) = N (0, b2 I). b2 −1 θ RLS = (Φ⊺ Φ + λI) Libin Jiao (DLUT) Φ⊺ y. Linear Regression November 7, 2024 24 / 43 Parameter Estimation 9.2 Parameter Estimation MAP Estimation as Regularization L2 -Regularized (Tikhonov regularization) least squares £ridge regression £*£8¤¤ min{∥y − Φθ∥22 + λ∥θ∥22 } θ λ= σ2 ⇒ MAP with prior p(θ) = N (0, b2 I). b2 −1 θ RLS = (Φ⊺ Φ + λI) Libin Jiao (DLUT) Φ⊺ y. Linear Regression November 7, 2024 24 / 43 Parameter Estimation 9.2 Parameter Estimation MAP Estimation as Regularization L2 -Regularized (Tikhonov regularization) least squares £ridge regression £*£8¤¤ min{∥y − Φθ∥22 + λ∥θ∥22 } θ λ= σ2 ⇒ MAP with prior p(θ) = N (0, b2 I). b2 −1 θ RLS = (Φ⊺ Φ + λI) Libin Jiao (DLUT) Φ⊺ y. Linear Regression November 7, 2024 24 / 43 Parameter Estimation 9.2 Parameter Estimation Exercise Do MAPE with Laplacian prior p(θi ) = L(0, b) under Gaussian noise distribution ∼ N [0, σ 2 ]. Libin Jiao (DLUT) Linear Regression November 7, 2024 25 / 43 Parameter Estimation 9.2 Parameter Estimation L1 -Regularized least squares (LASSO, least absolute shrinkage and selection operator, sparsity inducing regularization) min {∥y − Φθ∥22 + λ∥θ∥1 } ⇒ No closed form solution! Diagonal Φ? θ Convex relaxation of the L0 regularization. Libin Jiao (DLUT) Linear Regression November 7, 2024 26 / 43 Parameter Estimation 9.2 Parameter Estimation L1 -Regularized least squares (LASSO, least absolute shrinkage and selection operator, sparsity inducing regularization) min {∥y − Φθ∥22 + λ∥θ∥1 } ⇒ No closed form solution! Diagonal Φ? θ Convex relaxation of the L0 regularization. Libin Jiao (DLUT) Linear Regression November 7, 2024 26 / 43 Parameter Estimation 9.2 Parameter Estimation L1 -Regularized least squares (LASSO, least absolute shrinkage and selection operator, sparsity inducing regularization) min {∥y − Φθ∥22 + λ∥θ∥1 } ⇒ No closed form solution! Diagonal Φ? θ Convex relaxation of the L0 regularization. Libin Jiao (DLUT) Linear Regression November 7, 2024 26 / 43 Parameter Estimation 9.2 Parameter Estimation L1 -Regularized least squares (LASSO, least absolute shrinkage and selection operator, sparsity inducing regularization) min {∥y − Φθ∥22 + λ∥θ∥1 } ⇒ No closed form solution! Diagonal Φ? θ Convex relaxation of the L0 regularization. Libin Jiao (DLUT) Linear Regression November 7, 2024 26 / 43 Parameter Estimation 9.2 Parameter Estimation L0 -Regularized least squares (sparsity pursuiting) min {∥y − Φθ∥22 + λ∥θ∥0 } θ Lp -Regularized least squares (0 < p < 1, sparsity inducing) min {∥y − Φθ∥22 + λ∥θ∥pp } θ Libin Jiao (DLUT) p = 21 ? Linear Regression November 7, 2024 27 / 43 Parameter Estimation 9.2 Parameter Estimation L0 -Regularized least squares (sparsity pursuiting) min {∥y − Φθ∥22 + λ∥θ∥0 } θ Lp -Regularized least squares (0 < p < 1, sparsity inducing) min {∥y − Φθ∥22 + λ∥θ∥pp } θ Libin Jiao (DLUT) p = 12 ? Linear Regression November 7, 2024 27 / 43 Parameter Estimation 9.2 Parameter Estimation L0 -Regularized least squares (sparsity pursuiting) min {∥y − Φθ∥22 + λ∥θ∥0 } θ Lp -Regularized least squares (0 < p < 1, sparsity inducing) min {∥y − Φθ∥22 + λ∥θ∥pp } θ Libin Jiao (DLUT) p = 12 ? Linear Regression November 7, 2024 27 / 43 Parameter Estimation 9.2 Parameter Estimation L2 vs L1 Lp -Balls Libin Jiao (DLUT) Linear Regression November 7, 2024 28 / 43 Bayesian Linear Regression 9.3 Bayesian Linear Regression MLE (e.g., Least Square): min {∥y − Φθ∥22 } θ can lead to severe overfitting, in particular, in the small-data regime. MAPE (e.g., L2 -Regularized Least Square): min {∥y − Φθ∥22 + λ∥θ∥22 } θ addresses this issue by placing a prior on the parameters. Libin Jiao (DLUT) Linear Regression November 7, 2024 29 / 43 Bayesian Linear Regression 9.3 Bayesian Linear Regression Bayesian linear regression pushes the idea of the parameter prior a step further: Does not compute a point estimate of the parameters; The full posterior distribution over the parameters is taken into account when making predictions; Compute a mean over all plausible parameters settings (according to the posterior). Libin Jiao (DLUT) Linear Regression November 7, 2024 30 / 43 Bayesian Linear Regression 9.3 Bayesian Linear Regression Bayesian linear regression pushes the idea of the parameter prior a step further: Does not compute a point estimate of the parameters; The full posterior distribution over the parameters is taken into account when making predictions; Compute a mean over all plausible parameters settings (according to the posterior). Libin Jiao (DLUT) Linear Regression November 7, 2024 30 / 43 Bayesian Linear Regression 9.3 Bayesian Linear Regression Bayesian linear regression pushes the idea of the parameter prior a step further: Does not compute a point estimate of the parameters; The full posterior distribution over the parameters is taken into account when making predictions; Compute a mean over all plausible parameters settings (according to the posterior). Libin Jiao (DLUT) Linear Regression November 7, 2024 30 / 43 Bayesian Linear Regression 9.3 Bayesian Linear Regression Bayesian linear regression pushes the idea of the parameter prior a step further: Does not compute a point estimate of the parameters; The full posterior distribution over the parameters is taken into account when making predictions; Compute a mean over all plausible parameters settings (according to the posterior). Libin Jiao (DLUT) Linear Regression November 7, 2024 30 / 43 Bayesian Linear Regression 9.3 Bayesian Linear Regression Consider the model: (∗) Libin Jiao (DLUT) prior p(θ) = N (m0 , S0 ), likeligood p(y∣x, θ) = N (y∣φ⊺ (x)θ, σ 2 ). Linear Regression November 7, 2024 31 / 43 Bayesian Linear Regression 9.3 Bayesian Linear Regression Consider the model: y = φ⊺ (x)θ + prior (∗) p(θ) = N (m0 , S0 ), likeligood p(y∣x, θ) = N (y∣φ⊺ (x)θ, σ 2 ). The full probabilistic model: p(y, θ∣x) = p(y∣x, θ)p(θ). The average predictive distribution according to the prior p(θ): p(y∣x) = ∫ p(y∣x, θ)p(θ)dθ = Eθ [p(y∣x, θ)] With prior distribution p(θ) = N (m0 , S0 ), from Gaussian is conjugate; The marginal of a Gaussian is also a Gaussian; Gaussian noise is independent; y = φ⊺ (x)θ + we obtain: p(y∣x) = N (φ⊺ (x)m0 , φ⊺ (x)S0 φ(x) + σ 2 ) Libin Jiao (DLUT) Linear Regression November 7, 2024 32 / 43 Bayesian Linear Regression 9.3 Bayesian Linear Regression Consider the model: y = φ⊺ (x)θ + prior (∗) p(θ) = N (m0 , S0 ), likeligood p(y∣x, θ) = N (y∣φ⊺ (x)θ, σ 2 ). The full probabilistic model: p(y, θ∣x) = p(y∣x, θ)p(θ). The average predictive distribution according to the prior p(θ): p(y∣x) = ∫ p(y∣x, θ)p(θ)dθ = Eθ [p(y∣x, θ)] With prior distribution p(θ) = N (m0 , S0 ), from Gaussian is conjugate; The marginal of a Gaussian is also a Gaussian; Gaussian noise is independent; y = φ⊺ (x)θ + we obtain: p(y∣x) = N (φ⊺ (x)m0 , φ⊺ (x)S0 φ(x) + σ 2 ) Libin Jiao (DLUT) Linear Regression November 7, 2024 32 / 43 Bayesian Linear Regression 9.3 Bayesian Linear Regression Consider the model: y = φ⊺ (x)θ + prior (∗) p(θ) = N (m0 , S0 ), likeligood p(y∣x, θ) = N (y∣φ⊺ (x)θ, σ 2 ). The full probabilistic model: p(y, θ∣x) = p(y∣x, θ)p(θ). The average predictive distribution according to the prior p(θ): p(y∣x) = ∫ p(y∣x, θ)p(θ)dθ = Eθ [p(y∣x, θ)] With prior distribution p(θ) = N (m0 , S0 ), from Gaussian is conjugate; The marginal of a Gaussian is also a Gaussian; Gaussian noise is independent; y = φ⊺ (x)θ + we obtain: p(y∣x) = N (φ⊺ (x)m0 , φ⊺ (x)S0 φ(x) + σ 2 ) Libin Jiao (DLUT) Linear Regression November 7, 2024 32 / 43 Bayesian Linear Regression 9.3 Bayesian Linear Regression Prior over (noise-free) function: p(f (x)) = N (φ⊺ (x)m0 , φ⊺ (x)S0 φ(x)) Polynomials of degree 5, fi (⋅) = θ ⊺i φ(⋅), θ i ∼ p(θ) = N (0, 41 I). Libin Jiao (DLUT) Linear Regression November 7, 2024 33 / 43 Bayesian Linear Regression 9.3 Bayesian Linear Regression Posterior Distribution: X = {xn ∣n = 1, . . . , N}, Y = {yn ∣n = 1, . . . , N}, p(θ∣X , Y) = p(Y∣X , θ)p(θ) . p(Y∣X ) Marginal likelihood/Evidence: p(Y∣X ) = ∫ p(Y∣X , θ)p(θ)dθ = Eθ [p(Y∣X , θ)] Theorem 9.1 (Parameter Posterior). p(θ∣X , Y) = N (θ∣mN , SN ) = (2π)− 2 ∣SN ∣− 2 exp(− 21 (θ − mN )⊺ S−1 N (θ − mN )), D 1 −2 ⊺ −1 SN = (S−1 0 + σ Φ Φ) , −2 ⊺ mN = SN (S−1 0 m0 + σ Φ y). Libin Jiao (DLUT) Linear Regression November 7, 2024 34 / 43 Bayesian Linear Regression 9.3 Bayesian Linear Regression Posterior Distribution: X = {xn ∣n = 1, . . . , N}, Y = {yn ∣n = 1, . . . , N}, p(θ∣X , Y) = p(Y∣X , θ)p(θ) . p(Y∣X ) Marginal likelihood/Evidence: p(Y∣X ) = ∫ p(Y∣X , θ)p(θ)dθ = Eθ [p(Y∣X , θ)] Theorem 9.1 (Parameter Posterior). p(θ∣X , Y) = N (θ∣mN , SN ) = (2π)− 2 ∣SN ∣− 2 exp(− 21 (θ − mN )⊺ S−1 N (θ − mN )), D 1 −2 ⊺ −1 SN = (S−1 0 + σ Φ Φ) , −2 ⊺ mN = SN (S−1 0 m0 + σ Φ y). Libin Jiao (DLUT) Linear Regression November 7, 2024 34 / 43 Bayesian Linear Regression 9.3 Bayesian Linear Regression Posterior Distribution: X = {xn ∣n = 1, . . . , N}, Y = {yn ∣n = 1, . . . , N}, p(θ∣X , Y) = p(Y∣X , θ)p(θ) . p(Y∣X ) Marginal likelihood/Evidence: p(Y∣X ) = ∫ p(Y∣X , θ)p(θ)dθ = Eθ [p(Y∣X , θ)] Theorem 9.1 (Parameter Posterior). p(θ∣X , Y) = N (θ∣mN , SN ) = (2π)− 2 ∣SN ∣− 2 exp(− 21 (θ − mN )⊺ S−1 N (θ − mN )), D 1 −2 ⊺ −1 SN = (S−1 0 + σ Φ Φ) , −2 ⊺ mN = SN (S−1 0 m0 + σ Φ y). Libin Jiao (DLUT) Linear Regression November 7, 2024 34 / 43 Bayesian Linear Regression 9.3 Bayesian Linear Regression Proof of Theorem 9.1. p(Y∣X ,θ)p(θ) Posterior: p(θ∣X , Y) = p(Y∣X ) ; likeligood: p(Y∣X , θ) = N (y∣Φθ, σ 2 I); prior: p(θ) = N (m0 , S0 ). log-Posterior: log p(θ∣X , Y) = log N (Y∣Φθ, σ 2 I) + log N (θ∣m0 , S0 ) + const. = − 21 (σ −2 (y − Φθ)⊺ (y − Φθ) + (θ − m0 )⊺ S−1 0 (θ − m0 )) + const. −1 −2 ⊺ ⊺ = − 12 (θ ⊺ (σ −2 Φ⊺ Φ + S−1 0 )θ − 2(σ Φ y + S0 m0 ) θ) + const. ⊺ −1 ⊺ −1 = − 21 (θ ⊺ S−1 N θ − 2mN SN θ + mN SN mN ) + const. = − 21 (θ − mN )⊺ S−1 N (θ − mN ) + const. Libin Jiao (DLUT) Linear Regression November 7, 2024 35 / 43 Bayesian Linear Regression 9.3 Bayesian Linear Regression Proof of Theorem 9.1. p(Y∣X ,θ)p(θ) Posterior: p(θ∣X , Y) = p(Y∣X ) ; likeligood: p(Y∣X , θ) = N (y∣Φθ, σ 2 I); prior: p(θ) = N (m0 , S0 ). log-Posterior: log p(θ∣X , Y) = log N (Y∣Φθ, σ 2 I) + log N (θ∣m0 , S0 ) + const. = − 21 (σ −2 (y − Φθ)⊺ (y − Φθ) + (θ − m0 )⊺ S−1 0 (θ − m0 )) + const. −1 −2 ⊺ ⊺ = − 12 (θ ⊺ (σ −2 Φ⊺ Φ + S−1 0 )θ − 2(σ Φ y + S0 m0 ) θ) + const. ⊺ −1 ⊺ −1 = − 21 (θ ⊺ S−1 N θ − 2mN SN θ + mN SN mN ) + const. = − 21 (θ − mN )⊺ S−1 N (θ − mN ) + const. Libin Jiao (DLUT) Linear Regression November 7, 2024 35 / 43 Bayesian Linear Regression 9.3 Bayesian Linear Regression Proof of Theorem 9.1. p(Y∣X ,θ)p(θ) Posterior: p(θ∣X , Y) = p(Y∣X ) ; likeligood: p(Y∣X , θ) = N (y∣Φθ, σ 2 I); prior: p(θ) = N (m0 , S0 ). log-Posterior: log p(θ∣X , Y) = log N (Y∣Φθ, σ 2 I) + log N (θ∣m0 , S0 ) + const. = − 21 (σ −2 (y − Φθ)⊺ (y − Φθ) + (θ − m0 )⊺ S−1 0 (θ − m0 )) + const. −1 −2 ⊺ ⊺ = − 12 (θ ⊺ (σ −2 Φ⊺ Φ + S−1 0 )θ − 2(σ Φ y + S0 m0 ) θ) + const. ⊺ −1 ⊺ −1 = − 21 (θ ⊺ S−1 N θ − 2mN SN θ + mN SN mN ) + const. = − 21 (θ − mN )⊺ S−1 N (θ − mN ) + const. Libin Jiao (DLUT) Linear Regression November 7, 2024 35 / 43 Bayesian Linear Regression 9.3 Bayesian Linear Regression Posterior Predictive Distribution: p(y∣X , Y, x) = ∫ p(y∣x, θ)p(θ∣X , Y)dθ = Eθ∣X ,Y [p(y∣x, θ)] = ∫ N (y∣φ⊺ (x)θ, σ 2 )N (θ∣mN , SN )dθ = N (y∣φ⊺ (x)mN , φ⊺ (x)SN φ(x) + σ 2 ) Posterior over (Noise-free) Function: p(f (x)∣X , Y) = N (φ⊺ (x)mN , φ⊺ (x)SN φ(x)) Libin Jiao (DLUT) Linear Regression November 7, 2024 36 / 43 Bayesian Linear Regression 9.3 Bayesian Linear Regression Posterior over (Noise-free) Function: Polynomials of degree 5, fi (⋅) = θ ⊺i φ(⋅), θ i ∼ p(θ) = N (0, 41 I). Libin Jiao (DLUT) Linear Regression November 7, 2024 37 / 43 Bayesian Linear Regression 9.3 Bayesian Linear Regression Polynomials of degree 3,5 and 7, fi (⋅) = θ ⊺i φ(⋅), θ i ∼ p(θ) = N (0, 14 I). Libin Jiao (DLUT) Linear Regression November 7, 2024 38 / 43 Bayesian Linear Regression 9.3 Bayesian Linear Regression Computing the Marginal Likelihood: prior θ ∼ N (m0 , S0 ), likeligood yn ∣xn , θ ∼ N (y∣x⊺n (x)θ, σ 2 ). The marginal likelihood: p(Y∣X ) = ∫ p(Y∣X , θ)p(θ)dθ = ∫ N (y∣Xθ, σ 2 I)N (θm0 , S0 )dθ Eθ [Y∣X ] = Eθ [Xθ + ] = XEθ [θ] = Xm0 Covθ [Y∣X ] = Covθ [Xθ] + σ 2 I = XCovθ [θ]X⊺ + σ 2 I = XS0 X⊺ + σ 2 I p(Y∣X ) = N (y∣Xm0 , XS0 X⊺ + σ 2 I). Libin Jiao (DLUT) Linear Regression November 7, 2024 39 / 43 Maximum Likelihood as Orthogonal Projection 9.4 Maximum Likelihood as Orthogonal Projection y = xθ + , ∼ N (0, σ 2 ), suppose firstly σ 2 is known. Given a training set D ∶= {(x1 , y1 ), . . . , (xN , yN )}. Maximum likelihood estimator: X⊺ y θML = (X⊺ X)−1 X⊺ y = ⊺ ∈ R X X ⊺ N where X = (x1 , . . . , xN ) ∈ R , y = (y1 , . . . , , yN )⊺ ∈ RN . Optimal (maximum likelihood) reconstruction of the training target for the training input X: X⊺ y XX⊺ = y X⊺ X X⊺ X ⇒ the orthogonal projection of y onto the one-dimensional subspace of RN spanned by X. θML is the coordinates of the projection under the basis X. XθML = X Libin Jiao (DLUT) Linear Regression November 7, 2024 40 / 43 Maximum Likelihood as Orthogonal Projection 9.4 Maximum Likelihood as Orthogonal Projection y = xθ + , ∼ N (0, σ 2 ), suppose firstly σ 2 is known. Given a training set D ∶= {(x1 , y1 ), . . . , (xN , yN )}. Maximum likelihood estimator: X⊺ y θML = (X⊺ X)−1 X⊺ y = ⊺ ∈ R X X ⊺ N where X = (x1 , . . . , xN ) ∈ R , y = (y1 , . . . , , yN )⊺ ∈ RN . Optimal (maximum likelihood) reconstruction of the training target for the training input X: X⊺ y XX⊺ = y X⊺ X X⊺ X ⇒ the orthogonal projection of y onto the one-dimensional subspace of RN spanned by X. θML is the coordinates of the projection under the basis X. XθML = X Libin Jiao (DLUT) Linear Regression November 7, 2024 40 / 43 Maximum Likelihood as Orthogonal Projection 9.4 Maximum Likelihood as Orthogonal Projection General case: y = φ⊺ (x)⊺ θ + , ∼ N (0, σ 2 ). Maximum likelihood estimator θ ML = (Φ⊺ Φ)−1 Φ⊺ y. Maximum likelihood estimator Φθ ML = Φ(Φ⊺ Φ)−1 Φ⊺ y When Φ⊺ Φ = I, K Φθ ML = ΦΦ⊺ y = ( ∑ φk φ⊺k ) y k=1 Libin Jiao (DLUT) Linear Regression November 7, 2024 41 / 43 Maximum Likelihood as Orthogonal Projection 9.4 Logistic regression and deep neural network Logistic regression: y = σ(f (x)) + , ∼ N (0, σ 2 ), where f (x) = φ⊺ (x)⊺ θ; 1 σ(f ) = 1+exp(−f is the logistic sigmoid function. ) Maximum likelihood estimation: min L(θ) ≜ ∥y − σ(Φθ)∥22 θ Deep neural network: y = f (x; W, b) + , ∼ N (0, σ 2 ), where f (x) is defined as follows x(0) = x, x(l) = σ (l) (W (l) x(l−1) + b(l) ), l = 1, 2, . . . , L, f (x; W, b) = x(L) , in which W (l) ∈ Rkl ×kl−1 , k0 = d and kL = 1. Maximum likelihood estimation: N min L(W, b) ≜ ∑ (yn − f (xn ; W, b)2 . W,b Libin Jiao (DLUT) n=1 Linear Regression November 7, 2024 42 / 43 Maximum Likelihood as Orthogonal Projection Exercises Do MLE under Laplacian distribution ∼ L(0, b). Do MLE under the uniform distribution ∼ U[−a, a]. Do MAPE with Laplacian prior p(θi ) = L(0, b) under Gaussian noise distribution ∼ N [0, σ 2 ]. Prove that θ MAP = mN . Give the closed form solution of L1 -Regularized least squares problem min {∥y − θ∥22 + λ∥θ∥1 } . θ Libin Jiao (DLUT) Linear Regression November 7, 2024 43 / 43