Linear Regression
Libin Jiao
Dalian University of Technology
November 7, 2024
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
1 / 43
Outline
1
Problem formulation
2
Parameter Estimation
3
Bayesian Linear Regression
4
Maximum Likelihood as Orthogonal Projection
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
2 / 43
Problem formulation
9.1 Problem formulation
Given a set of training inputs xn and corresponding noisy observations
yn = f (xn ) + .
is an i.i.d. random variable that describes measurement/observation
noise.
(a) Dataset.
(b) Possible solution.
The task is to infer the function f that generated the data and generalizes
well to function values at new input locations.
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
3 / 43
Problem formulation
9.1 Problem formulation
Regression is a basic problem in ML:
Prediction and control
Classification.
Time-series analysis
Deep-learning
Finding a regression function:
Choice of the model (type) and the parametrization.
Finding (estimating) good parameters.
Over-fitting and model selection.
Relationship between loss functions and parameter priors.
Uncertainty modeling.
Find optimal model parameters
Maximum likelihood estimation£4Œq, O¤
Maximum a posteriori (MAP) estimation£4Œ
Generalization errors£•zØ
Bayesian linear regression
Libin Jiao (DLUT)
O¤
¤and over-fitting£L[ܤ.
Linear Regression
November 7, 2024
4 / 43
Problem formulation
9.1 Problem formulation
Regression is a basic problem in ML:
Prediction and control
Classification.
Time-series analysis
Deep-learning
Finding a regression function:
Choice of the model (type) and the parametrization.
Finding (estimating) good parameters.
Over-fitting and model selection.
Relationship between loss functions and parameter priors.
Uncertainty modeling.
Find optimal model parameters
Maximum likelihood estimation£4Œq, O¤
Maximum a posteriori (MAP) estimation£4Œ
Generalization errors£•zØ
Bayesian linear regression
Libin Jiao (DLUT)
O¤
¤and over-fitting£L[ܤ.
Linear Regression
November 7, 2024
4 / 43
Problem formulation
9.1 Problem formulation
Regression is a basic problem in ML:
Prediction and control
Classification.
Time-series analysis
Deep-learning
Finding a regression function:
Choice of the model (type) and the parametrization.
Finding (estimating) good parameters.
Over-fitting and model selection.
Relationship between loss functions and parameter priors.
Uncertainty modeling.
Find optimal model parameters
Maximum likelihood estimation£4Œq, O¤
Maximum a posteriori (MAP) estimation£4Œ
Generalization errors£•zØ
Bayesian linear regression
Libin Jiao (DLUT)
O¤
¤and over-fitting£L[ܤ.
Linear Regression
November 7, 2024
4 / 43
Problem formulation
9.1 Problem formulation
Regression is a basic problem in ML:
Prediction and control
Classification.
Time-series analysis
Deep-learning
Finding a regression function:
Choice of the model (type) and the parametrization.
Finding (estimating) good parameters.
Over-fitting and model selection.
Relationship between loss functions and parameter priors.
Uncertainty modeling.
Find optimal model parameters
Maximum likelihood estimation£4Œq, O¤
Maximum a posteriori (MAP) estimation£4Œ
Generalization errors£•zØ
Bayesian linear regression
Libin Jiao (DLUT)
O¤
¤and over-fitting£L[ܤ.
Linear Regression
November 7, 2024
4 / 43
Problem formulation
9.1 Problem formulation
Regression is a basic problem in ML:
Prediction and control
Classification.
Time-series analysis
Deep-learning
Finding a regression function:
Choice of the model (type) and the parametrization.
Finding (estimating) good parameters.
Over-fitting and model selection.
Relationship between loss functions and parameter priors.
Uncertainty modeling.
Find optimal model parameters
Maximum likelihood estimation£4Œq, O¤
Maximum a posteriori (MAP) estimation£4Œ
Generalization errors£•zØ
Bayesian linear regression
Libin Jiao (DLUT)
O¤
¤and over-fitting£L[ܤ.
Linear Regression
November 7, 2024
4 / 43
Problem formulation
9.1 Problem formulation
Functional relationship between x and y is given as
y = f (x) + Suppose ∼ N (0, σ 2 ) is independent, identically distributed (i.i.d.)
Gaussian measurement noise with mean 0 and variance σ 2 .
The likelihood function£q,¼ê¤
p(y∣x) = N (y∣ f (x), σ 2 ).
We focus on parametric models, i.e., we choose a parameterized function
and find parameters θ that /work well0for modeling the data.
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
5 / 43
Problem formulation
9.1 Problem formulation
Functional relationship between x and y is given as
y = f (x) + Suppose ∼ N (0, σ 2 ) is independent, identically distributed (i.i.d.)
Gaussian measurement noise with mean 0 and variance σ 2 .
The likelihood function£q,¼ê¤
p(y∣x) = N (y∣ f (x), σ 2 ).
We focus on parametric models, i.e., we choose a parameterized function
and find parameters θ that /work well0for modeling the data.
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
5 / 43
Problem formulation
9.1 Problem formulation
Functional relationship between x and y is given as
y = f (x) + Suppose ∼ N (0, σ 2 ) is independent, identically distributed (i.i.d.)
Gaussian measurement noise with mean 0 and variance σ 2 .
The likelihood function£q,¼ê¤
p(y∣x) = N (y∣ f (x), σ 2 ).
We focus on parametric models, i.e., we choose a parameterized function
and find parameters θ that /work well0for modeling the data.
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
5 / 43
Problem formulation
9.1 Problem formulation
Functional relationship between x and y is given as
y = f (x) + Suppose ∼ N (0, σ 2 ) is independent, identically distributed (i.i.d.)
Gaussian measurement noise with mean 0 and variance σ 2 .
The likelihood function£q,¼ê¤
p(y∣x) = N (y∣ f (x), σ 2 ).
We focus on parametric models, i.e., we choose a parameterized function
and find parameters θ that /work well0for modeling the data.
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
5 / 43
Problem formulation
9.1 Problem formulation
Parametric models:f (x) is a parameterized function with parameters θ,
and can be denoted by f (x; θ).
Linear regression: the parameters θ appear linearly in the model:
f (x; θ) = x⊺ θ;
f (x; θ) = φ⊺ (x)θ, where φ is a nonlinear transformation (features).
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
6 / 43
Problem formulation
9.1 Problem formulation
Parametric models:f (x) is a parameterized function with parameters θ,
and can be denoted by f (x; θ).
Linear regression: the parameters θ appear linearly in the model:
f (x; θ) = x⊺ θ;
f (x; θ) = φ⊺ (x)θ, where φ is a nonlinear transformation (features).
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
6 / 43
Problem formulation
9.1 Problem formulation
Parametric models:f (x) is a parameterized function with parameters θ,
and can be denoted by f (x; θ).
Linear regression: the parameters θ appear linearly in the model:
f (x; θ) = x⊺ θ;
f (x; θ) = φ⊺ (x)θ, where φ is a nonlinear transformation (features).
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
6 / 43
Problem formulation
9.1 Problem formulation
Parametric models:f (x) is a parameterized function with parameters θ,
and can be denoted by f (x; θ).
Linear regression: the parameters θ appear linearly in the model:
f (x; θ) = x⊺ θ;
f (x; θ) = φ⊺ (x)θ, where φ is a nonlinear transformation (features).
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
6 / 43
Problem formulation
9.1 Problem formulation
Parametric models:f (x) is a parameterized function with parameters θ,
and can be denoted by f (x; θ).
Linear regression: the parameters θ appear linearly in the model:
f (x; θ) = x⊺ θ;
f (x; θ) = φ⊺ (x)θ, where φ is a nonlinear transformation (features).
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
6 / 43
Problem formulation
9.1 Problem formulation
Parametric models: f (x) is a parameterized function with parameters θ,
and can be denoted by f (x; θ).
Linear regression: the parameters θ appear linearly in the model:
f (x; θ) = x⊺ θ;
f (x; θ) = φ⊺ (x)θ, where φ is a nonlinear transformation (features).
Example 9.1:
(a) Example functions.
Libin Jiao (DLUT)
(b) Training set.
Linear Regression
(c) MLE.
November 7, 2024
7 / 43
Parameter Estimation
9.2 Parameter Estimation
Linear in θ and in x: p(y∣x, θ) = N (y∣x⊺ θ, σ 2 ) ⇔ y = x⊺ θ + ,
∼ N (0, σ 2 ), suppose firstly σ 2 is known.
Given a training set D ∶= {(x1 , y1 ), . . . , (xN , yN )}, xn ∈ RD and yn ∈ R
(n = 1, 2, . . . , N);
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
8 / 43
Parameter Estimation
9.2 Parameter Estimation
Linear in θ and in x: p(y∣x, θ) = N (y∣x⊺ θ, σ 2 ) ⇔ y = x⊺ θ + ,
∼ N (0, σ 2 ), suppose firstly σ 2 is known.
Given a training set D ∶= {(x1 , y1 ), . . . , (xN , yN )}, xn ∈ RD and yn ∈ R
(n = 1, 2, . . . , N);
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
8 / 43
Parameter Estimation
9.2 Parameter Estimation
Linear in θ and in x: p(y∣x, θ) = N (y∣x⊺ θ, σ 2 ) ⇔ y = x⊺ θ + ,
∼ N (0, σ 2 ), suppose firstly σ 2 is known.
Given a training set D ∶= {(x1 , y1 ), . . . , (xN , yN )}, xn ∈ RD and yn ∈ R
(n = 1, 2, . . . , N);
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
8 / 43
Parameter Estimation
9.2 Parameter Estimation
Linear in θ and in x: p(y∣x, θ) = N (y∣x⊺ θ, σ 2 ) ⇔ y = x⊺ θ + ,
∼ N (0, σ 2 ), suppose firstly σ 2 is known.
Given a training set D ∶= {(x1 , y1 ), . . . , (xN , yN )}, xn ∈ RD and yn ∈ R
(n = 1, 2, . . . , N);
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
8 / 43
Parameter Estimation
9.2 Parameter Estimation
Linear in θ and in x: p(y∣x, θ) = N (y∣x⊺ θ, σ 2 ) ⇔ y = x⊺ θ + ,
∼ N (0, σ 2 ), suppose firstly σ 2 is known.
Given a training set D ∶= {(x1 , y1 ), . . . , (xN , yN )}, xn ∈ RD and yn ∈ R
(n = 1, 2, . . . , N);
Likelyhood:
p(Y∣X , θ) = p(y1 , . . . , yN ∣x1 , . . . , xN , θ)
N
N
n=1
n=1
= ∏ p(yn ∣xn , θ) = ∏ N (yn ∣x⊺n θ, σ 2 )
Graphical model.
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
9 / 43
Parameter Estimation
9.2 Parameter Estimation
Maximum Likelyhood Estimation:
θML = arg max p(Y∣X , θ) (maximize the likelihood)
θ
= arg min{− log p(Y∣X , θ)} (minimize the negative log-likelihood)
θ
N
= arg min{− log ∏ p(yn ∣xn , θ)}
θ
n=1
N
= arg min{− ∑ log N (yn ∣x⊺n θ, σ 2 )} ← √ 1 2 exp (−
θ
n=1
N
1
= arg min{ 2σ2 ∑ (yn − x⊺n θ)2 + const.}
θ
n=1
2πσ
(yn −x⊺n θ)2
)
2σ 2
= arg min L(θ) ( negative log-likelihood)
θ
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
10 / 43
Parameter Estimation
9.2 Parameter Estimation
Maximum Likelyhood Estimation:
θML = arg max p(Y∣X , θ) (maximize the likelihood)
θ
= arg min{− log p(Y∣X , θ)} (minimize the negative log-likelihood)
θ
N
= arg min{− log ∏ p(yn ∣xn , θ)}
θ
n=1
N
= arg min{− ∑ log N (yn ∣x⊺n θ, σ 2 )} ← √ 1 2 exp (−
θ
n=1
N
1
= arg min{ 2σ2 ∑ (yn − x⊺n θ)2 + const.}
θ
n=1
2πσ
(yn −x⊺n θ)2
)
2σ 2
= arg min L(θ) ( negative log-likelihood)
θ
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
10 / 43
Parameter Estimation
9.2 Parameter Estimation
Negative log-likelihood£Kéêq,¤µ
L(θ) ∶=
1 N
1
1
⊺ 2
⊺
2
∑ (yn − xn θ) = 2 (y − Xθ) (y − Xθ) = 2 ∥y − Xθ∥2 .
2
2σ n=1
2σ
2σ
Maximum likelihood estimator£4Œq, Of§†σÃ'œ¤µ
θ ML = (X⊺ X)−1 X⊺ y, (design matrix) X ∶= [x1 , . . . , xN ]⊺ ∈ RN×D , y ∶= [y1 , . . . , yN ]⊺ ∈ RN .
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
11 / 43
Parameter Estimation
9.2 Parameter Estimation
Negative log-likelihood£Kéêq,¤µ
L(θ) ∶=
1 N
1
1
⊺ 2
⊺
2
∑ (yn − xn θ) = 2 (y − Xθ) (y − Xθ) = 2 ∥y − Xθ∥2 .
2
2σ n=1
2σ
2σ
Maximum likelihood estimator£4Œq, Of§†σÃ'œ¤µ
θ ML = (X⊺ X)−1 X⊺ y, (design matrix) X ∶= [x1 , . . . , xN ]⊺ ∈ RN×D , y ∶= [y1 , . . . , yN ]⊺ ∈ RN .
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
11 / 43
Parameter Estimation
9.2 Parameter Estimation
Negative log-likelihood£Kéêq,¤µ
L(θ) ∶=
1 N
1
1
⊺ 2
⊺
2
∑ (yn − xn θ) = 2 (y − Xθ) (y − Xθ) = 2 ∥y − Xθ∥2 .
2σ 2 n=1
2σ
2σ
Maximum likelihood estimator£4Œq, Of§†σÃ'œ¤µ
θ ML = (X⊺ X)−1 X⊺ y, (design matrix) X ∶= [x1 , . . . , xN ]⊺ ∈ RN×D , y ∶= [y1 , . . . , yN ]⊺ ∈ RN .
Example 9.2 (Fitting Lines):
(a) Example functions.
Libin Jiao (DLUT)
(b) Training set.
Linear Regression
(c) MLE.
November 7, 2024
12 / 43
Parameter Estimation
9.2 Parameter Estimation
Exercise
Do MLE under Laplacian distribution ∼ L(0, b).
Do MLE under the uniform distribution ∼ U[−a, a].
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
13 / 43
Parameter Estimation
9.2 Parameter Estimation
Maximum Likelihood Estimation with Features£‘A
MLE¤µ
K−1
p(y∣x, θ) = N (y∣φ⊺ (x)θ, σ 2 ) ⇔ y = φ⊺ (x)θ + = ∑ θk φk (x) + ,
k=0
where φ ∶ RD → RK , φ(xk ) is the feature vector of xk .
Negative log-likelihood: L(θ) = 2σ1 2 (y − Φθ)⊺ (y − Φθ) = 2σ1 2 ∥y − Φθ∥22
Maximum likelihood estimator: θ ML = (Φ⊺ Φ)−1 Φ⊺ y
The feature matrix (design matrix)
⎡
⎡ φ⊺ (x1 ) ⎤ ⎢ φ0 (x1 )
⎢
⎥ ⎢ φ (x )
⎢
⎥ ⎢ 0 2
⎥=⎢
⋮
Φ ∶= ⎢
⎢ ⊺
⎥ ⎢
⋮
⎢ φ (xN ) ⎥ ⎢
⎣
⎦ ⎢ φ0 (xN )
⎣
Libin Jiao (DLUT)
Linear Regression
⋯ φK−1 (x1 ) ⎤⎥
⋯ φK−1 (x2 ) ⎥⎥
⎥ ∈ RN×K
⎥
⋯
⋮
⎥
⋯ φK−1 (xN ) ⎥⎦
November 7, 2024
14 / 43
Parameter Estimation
9.2 Parameter Estimation
Maximum Likelihood Estimation with Features£‘A
MLE¤µ
K−1
p(y∣x, θ) = N (y∣φ⊺ (x)θ, σ 2 ) ⇔ y = φ⊺ (x)θ + = ∑ θk φk (x) + ,
k=0
where φ ∶ RD → RK , φ(xk ) is the feature vector of xk .
Negative log-likelihood: L(θ) = 2σ1 2 (y − Φθ)⊺ (y − Φθ) = 2σ1 2 ∥y − Φθ∥22
Maximum likelihood estimator: θ ML = (Φ⊺ Φ)−1 Φ⊺ y
The feature matrix (design matrix)
⎡
⎡ φ⊺ (x1 ) ⎤ ⎢ φ0 (x1 )
⎢
⎥ ⎢ φ (x )
⎢
⎥ ⎢ 0 2
⎥=⎢
⋮
Φ ∶= ⎢
⎢ ⊺
⎥ ⎢
⋮
⎢ φ (xN ) ⎥ ⎢
⎣
⎦ ⎢ φ0 (xN )
⎣
Libin Jiao (DLUT)
Linear Regression
⋯ φK−1 (x1 ) ⎤⎥
⋯ φK−1 (x2 ) ⎥⎥
⎥ ∈ RN×K
⎥
⋯
⋮
⎥
⋯ φK−1 (xN ) ⎥⎦
November 7, 2024
14 / 43
Parameter Estimation
9.2 Parameter Estimation
Maximum Likelihood Estimation with Features£‘A
MLE¤µ
K−1
p(y∣x, θ) = N (y∣φ⊺ (x)θ, σ 2 ) ⇔ y = φ⊺ (x)θ + = ∑ θk φk (x) + ,
k=0
where φ ∶ RD → RK , φ(xk ) is the feature vector of xk .
Negative log-likelihood: L(θ) = 2σ1 2 (y − Φθ)⊺ (y − Φθ) = 2σ1 2 ∥y − Φθ∥22
Maximum likelihood estimator: θ ML = (Φ⊺ Φ)−1 Φ⊺ y
The feature matrix (design matrix)
⎡
⎡ φ⊺ (x1 ) ⎤ ⎢ φ0 (x1 )
⎢
⎥ ⎢ φ (x )
⎢
⎥ ⎢ 0 2
⎥=⎢
⋮
Φ ∶= ⎢
⎢ ⊺
⎥ ⎢
⋮
⎢ φ (xN ) ⎥ ⎢
⎣
⎦ ⎢ φ0 (xN )
⎣
Libin Jiao (DLUT)
Linear Regression
⋯ φK−1 (x1 ) ⎤⎥
⋯ φK−1 (x2 ) ⎥⎥
⎥ ∈ RN×K
⎥
⋯
⋮
⎥
⋯ φK−1 (xN ) ⎥⎦
November 7, 2024
14 / 43
Parameter Estimation
9.2 Parameter Estimation
Maximum Likelihood Estimation with Features£‘A
MLE¤µ
K−1
p(y∣x, θ) = N (y∣φ⊺ (x)θ, σ 2 ) ⇔ y = φ⊺ (x)θ + = ∑ θk φk (x) + ,
k=0
where φ ∶ RD → RK , φ(xk ) is the feature vector of xk .
Negative log-likelihood: L(θ) = 2σ1 2 (y − Φθ)⊺ (y − Φθ) = 2σ1 2 ∥y − Φθ∥22
Maximum likelihood estimator: θ ML = (Φ⊺ Φ)−1 Φ⊺ y
The feature matrix (design matrix)
⎡
⎡ φ⊺ (x1 ) ⎤ ⎢ φ0 (x1 )
⎢
⎥ ⎢ φ (x )
⎢
⎥ ⎢ 0 2
⎥=⎢
⋮
Φ ∶= ⎢
⎢ ⊺
⎥ ⎢
⋮
⎢ φ (xN ) ⎥ ⎢
⎣
⎦ ⎢ φ0 (xN )
⎣
Libin Jiao (DLUT)
Linear Regression
⋯ φK−1 (x1 ) ⎤⎥
⋯ φK−1 (x2 ) ⎥⎥
⎥ ∈ RN×K
⎥
⋯
⋮
⎥
⋯ φK−1 (xN ) ⎥⎦
November 7, 2024
14 / 43
Parameter Estimation
9.2 Parameter Estimation
Maximum Likelihood Estimation with Features£‘A
MLE¤µ
K−1
p(y∣x, θ) = N (y∣φ⊺ (x)θ, σ 2 ) ⇔ y = φ⊺ (x)θ + = ∑ θk φk (x) + ,
k=0
where φ ∶ RD → RK , φ(xk ) is the feature vector of xk .
Negative log-likelihood: L(θ) = 2σ1 2 (y − Φθ)⊺ (y − Φθ) = 2σ1 2 ∥y − Φθ∥22
Maximum likelihood estimator: θ ML = (Φ⊺ Φ)−1 Φ⊺ y
The feature matrix (design matrix)
⎡
⎡ φ⊺ (x1 ) ⎤ ⎢ φ0 (x1 )
⎢
⎥ ⎢ φ (x )
⎢
⎥ ⎢ 0 2
⎥=⎢
⋮
Φ ∶= ⎢
⎢ ⊺
⎥ ⎢
⋮
⎢ φ (xN ) ⎥ ⎢
⎣
⎦ ⎢ φ0 (xN )
⎣
Libin Jiao (DLUT)
Linear Regression
⋯ φK−1 (x1 ) ⎤⎥
⋯ φK−1 (x2 ) ⎥⎥
⎥ ∈ RN×K
⎥
⋯
⋮
⎥
⋯ φK−1 (xN ) ⎥⎦
November 7, 2024
14 / 43
Parameter Estimation
9.2 Parameter Estimation
Example 9.3 (polynomial regression)
⎤
⎡
⎡ φ0 (x) ⎤ ⎢⎢ 1 ⎥⎥
⎥
⎢
⎢ φ (x) ⎥ ⎢⎢ x ⎥⎥
⎥ ⎢ 2 ⎥
⎢
1
⎥ = ⎢ x ⎥ ∈ RK .
φ(xk ) = ⎢
⎥ ⎢
⎢
⋮
⎥
⎥
⎢
⎢ φK−1 (x) ⎥ ⎢⎢ ⋮ ⎥⎥
⎦ ⎢ xK−1 ⎥
⎣
⎦
⎣
Example 9.4 (feature matrix for second-order polynomials)
⎡ 1 x1
⎢
⎢ 1 x
⎢
2
Φ ∶= ⎢
⎢ ⋮ ⋮
⎢
⎢ 1 xN
⎣
Libin Jiao (DLUT)
Linear Regression
x12 ⎤⎥
x22 ⎥⎥
⎥
⋮ ⎥⎥
xN2 ⎥⎦
November 7, 2024
15 / 43
Parameter Estimation
9.2 Parameter Estimation
Example 9.3 (polynomial regression)
⎤
⎡
⎡ φ0 (x) ⎤ ⎢⎢ 1 ⎥⎥
⎢
⎥
⎢ φ (x) ⎥ ⎢⎢ x ⎥⎥
⎢
⎥ ⎢ 2 ⎥
1
⎥ = ⎢ x ⎥ ∈ RK .
φ(xk ) = ⎢
⎢
⎥ ⎢
⋮
⎥
⎢
⎥
⎢ φK−1 (x) ⎥ ⎢⎢ ⋮ ⎥⎥
⎣
⎦ ⎢ xK−1 ⎥
⎦
⎣
Example 9.4 (feature matrix for second-order polynomials)
⎡ 1 x1
⎢
⎢ 1 x
⎢
2
Φ ∶= ⎢
⎢ ⋮ ⋮
⎢
⎢ 1 xN
⎣
Libin Jiao (DLUT)
Linear Regression
x12 ⎤⎥
x22 ⎥⎥
⎥
⋮ ⎥⎥
xN2 ⎥⎦
November 7, 2024
15 / 43
Parameter Estimation
9.2 Parameter Estimation
Example 9.3 (polynomial regression)
⎤
⎡
⎡ φ0 (x) ⎤ ⎢⎢ 1 ⎥⎥
⎢
⎥
⎢ φ (x) ⎥ ⎢⎢ x ⎥⎥
⎢
⎥ ⎢ 2 ⎥
1
⎥ = ⎢ x ⎥ ∈ RK .
φ(xk ) = ⎢
⎢
⎥ ⎢
⋮
⎥
⎢
⎥
⎢ φK−1 (x) ⎥ ⎢⎢ ⋮ ⎥⎥
⎣
⎦ ⎢ xK−1 ⎥
⎦
⎣
Example 9.4 (feature matrix for second-order polynomials)
⎡ 1 x1
⎢
⎢ 1 x
⎢
2
Φ ∶= ⎢
⎢ ⋮ ⋮
⎢
⎢ 1 xN
⎣
Libin Jiao (DLUT)
Linear Regression
x12 ⎤⎥
x22 ⎥⎥
⎥
⋮ ⎥⎥
xN2 ⎥⎦
November 7, 2024
15 / 43
Parameter Estimation
9.2 Parameter Estimation
Example 9.5 (maximum likelihood polynomial fit):
N = 10, xn ∼ U[−5, 5], yn = − sin(xn /5) + cos(xn ) + , ∼ N (0, 0.22 ).
Fit a polynomial of degree 4 using maximum likelihood estimation.
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
16 / 43
Parameter Estimation
9.2 Parameter Estimation
Estimating the Noise Variance
N
log p(Y∣X , θ, σ 2 ) = ∑ log N (yn ∣φ⊺ (xn )θ, σ 2 )
n=1
N
= ∑ (− 21 log(2π) − 12 log σ 2 − 2σ1 2 (yn − φ⊺ (xn )θ)2 )
n=1
N
= − N2 log σ 2 − 2σ1 2 ∑ (yn − φ⊺ (xn )θ)2 +const.
n=1
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
=∶s
∂ log p(Y∣X ,θ,σ 2 )
= − 2σN 2 + 2σs 4 = 0 ⇔ σ 2 = Ns .
∂σ 2
∂ 2 log p(Y∣X ,θ,σ 2 )
= 2σN 4 − σs6 = 0 ⇔ σ 2 = 2s
N.
∂(σ 2 )2
N
2
σML
= Ns = N1 ∑ (yn − φ⊺ (xn )θ)2 .
n=1
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
17 / 43
Parameter Estimation
9.2 Parameter Estimation
Estimating the Noise Variance
N
log p(Y∣X , θ, σ 2 ) = ∑ log N (yn ∣φ⊺ (xn )θ, σ 2 )
n=1
N
= ∑ (− 21 log(2π) − 12 log σ 2 − 2σ1 2 (yn − φ⊺ (xn )θ)2 )
n=1
N
= − N2 log σ 2 − 2σ1 2 ∑ (yn − φ⊺ (xn )θ)2 +const.
n=1
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
=∶s
∂ log p(Y∣X ,θ,σ 2 )
= − 2σN 2 + 2σs 4 = 0 ⇔ σ 2 = Ns .
∂σ 2
∂ 2 log p(Y∣X ,θ,σ 2 )
= 2σN 4 − σs6 = 0 ⇔ σ 2 = 2s
N.
∂(σ 2 )2
N
2
σML
= Ns = N1 ∑ (yn − φ⊺ (xn )θ)2 .
n=1
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
17 / 43
Parameter Estimation
9.2 Parameter Estimation
Estimating the Noise Variance
N
log p(Y∣X , θ, σ 2 ) = ∑ log N (yn ∣φ⊺ (xn )θ, σ 2 )
n=1
N
= ∑ (− 21 log(2π) − 12 log σ 2 − 2σ1 2 (yn − φ⊺ (xn )θ)2 )
n=1
N
= − N2 log σ 2 − 2σ1 2 ∑ (yn − φ⊺ (xn )θ)2 +const.
n=1
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
=∶s
∂ log p(Y∣X ,θ,σ 2 )
= − 2σN 2 + 2σs 4 = 0 ⇔ σ 2 = Ns .
∂σ 2
∂ 2 log p(Y∣X ,θ,σ 2 )
= 2σN 4 − σs6 = 0 ⇔ σ 2 = 2s
N.
∂(σ 2 )2
N
2
σML
= Ns = N1 ∑ (yn − φ⊺ (xn )θ)2 .
n=1
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
17 / 43
Parameter Estimation
9.2 Parameter Estimation
Evaluate the quality of the model
The negative log-likelihood: 2σ1 2 ∥y − Φθ∥22
The squared-error-loss function: ∥y − Φθ∥2
The root mean square error (RMSE§þ•ŠØ ):
¿
√
Á1 N
1
À ∑ (y − φ⊺ (x )θ)2
∥y − Φθ∥2 = Á
n
n
N
N n=1
(a) Allows us to compare errors of datasets with different sizes;
(b) Has the same scale and units as the observed function values yn .
For model selection, we can use the RMSE (or the negative
log-likelihood) to determine the best degree of the polynomial.
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
18 / 43
Parameter Estimation
9.2 Parameter Estimation
Evaluate the quality of the model
The negative log-likelihood: 2σ1 2 ∥y − Φθ∥22
The squared-error-loss function: ∥y − Φθ∥2
The root mean square error (RMSE§þ•ŠØ ):
¿
√
Á1 N
1
À ∑ (y − φ⊺ (x )θ)2
∥y − Φθ∥2 = Á
n
n
N
N n=1
(a) Allows us to compare errors of datasets with different sizes;
(b) Has the same scale and units as the observed function values yn .
For model selection, we can use the RMSE (or the negative
log-likelihood) to determine the best degree of the polynomial.
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
18 / 43
Parameter Estimation
9.2 Parameter Estimation
Evaluate the quality of the model
The negative log-likelihood: 2σ1 2 ∥y − Φθ∥22
The squared-error-loss function: ∥y − Φθ∥2
The root mean square error (RMSE§þ•ŠØ ):
¿
√
Á1 N
1
À ∑ (y − φ⊺ (x )θ)2
∥y − Φθ∥2 = Á
n
n
N
N n=1
(a) Allows us to compare errors of datasets with different sizes;
(b) Has the same scale and units as the observed function values yn .
For model selection, we can use the RMSE (or the negative
log-likelihood) to determine the best degree of the polynomial.
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
18 / 43
Parameter Estimation
9.2 Parameter Estimation
Evaluate the quality of the model
The negative log-likelihood: 2σ1 2 ∥y − Φθ∥22
The squared-error-loss function: ∥y − Φθ∥2
The root mean square error (RMSE§þ•ŠØ ):
¿
√
Á1 N
1
À ∑ (y − φ⊺ (x )θ)2
∥y − Φθ∥2 = Á
n
n
N
N n=1
(a) Allows us to compare errors of datasets with different sizes;
(b) Has the same scale and units as the observed function values yn .
For model selection, we can use the RMSE (or the negative
log-likelihood) to determine the best degree of the polynomial.
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
18 / 43
Parameter Estimation
9.2 Parameter Estimation
Evaluate the quality of the model
The negative log-likelihood: 2σ1 2 ∥y − Φθ∥22
The squared-error-loss function: ∥y − Φθ∥2
The root mean square error (RMSE§þ•ŠØ ):
¿
√
Á1 N
1
À ∑ (y − φ⊺ (x )θ)2
∥y − Φθ∥2 = Á
n
n
N
N n=1
(a) Allows us to compare errors of datasets with different sizes;
(b) Has the same scale and units as the observed function values yn .
For model selection, we can use the RMSE (or the negative
log-likelihood) to determine the best degree of the polynomial.
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
18 / 43
Parameter Estimation
9.2 Parameter Estimation
Evaluate the quality of the model
The negative log-likelihood: 2σ1 2 ∥y − Φθ∥22
The squared-error-loss function: ∥y − Φθ∥2
The root mean square error (RMSE§þ•ŠØ ):
¿
√
Á1 N
1
À ∑ (y − φ⊺ (x )θ)2
∥y − Φθ∥2 = Á
n
n
N
N n=1
(a) Allows us to compare errors of datasets with different sizes;
(b) Has the same scale and units as the observed function values yn .
For model selection, we can use the RMSE (or the negative
log-likelihood) to determine the best degree of the polynomial.
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
18 / 43
Parameter Estimation
9.2 Parameter Estimation
Evaluate the quality of the model
The negative log-likelihood: 2σ1 2 ∥y − Φθ∥22
The squared-error-loss function: ∥y − Φθ∥2
The root mean square error (RMSE§þ•ŠØ ):
¿
√
Á1 N
1
À ∑ (y − φ⊺ (x )θ)2
∥y − Φθ∥2 = Á
n
n
N
N n=1
(a) Allows us to compare errors of datasets with different sizes;
(b) Has the same scale and units as the observed function values yn .
For model selection, we can use the RMSE (or the negative
log-likelihood) to determine the best degree of the polynomial.
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
18 / 43
Parameter Estimation
9.2 Parameter Estimation
Maximum likelihood polynomial fit for N = 10 and various M.
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
19 / 43
Parameter Estimation
9.2 Parameter Estimation
Over-fitting for big M!
RMSE for training data is not enough⇐ Test data!
Generalization performance£•z5U£Ly¤¤
Evaluate the RMSE for both the training data and the test data (200).
The best generalization is obtained for a polynomial of degree M = 4.
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
20 / 43
Parameter Estimation
9.2 Parameter Estimation
Maximum A Posteriori Estimation £4Œ
O¤
Maximum likelihood estimation is prone to over-fitting.
It is observed that the magnitude of the parameter values becomes
relatively large if we run into over-fitting.
To mitigate the effect of huge parameter values, we can place a prior
distribution p(θ) on the parameters.
Maximize the posterior distribution (probability)
p(θ∣X , Y) =
p(Y∣X , θ)p(θ)
.
p(Y∣X )
Minimize the negative log-posterior distribution (probability):
θ MAP = arg min{− log p(Y∣X , θ) − log p(θ) + const.}.
θ
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
21 / 43
Parameter Estimation
9.2 Parameter Estimation
Maximum A Posteriori Estimation £4Œ
O¤
Maximum likelihood estimation is prone to over-fitting.
It is observed that the magnitude of the parameter values becomes
relatively large if we run into over-fitting.
To mitigate the effect of huge parameter values, we can place a prior
distribution p(θ) on the parameters.
Maximize the posterior distribution (probability)
p(θ∣X , Y) =
p(Y∣X , θ)p(θ)
.
p(Y∣X )
Minimize the negative log-posterior distribution (probability):
θ MAP = arg min{− log p(Y∣X , θ) − log p(θ) + const.}.
θ
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
21 / 43
Parameter Estimation
9.2 Parameter Estimation
Maximum A Posteriori Estimation £4Œ
O¤
Maximum likelihood estimation is prone to over-fitting.
It is observed that the magnitude of the parameter values becomes
relatively large if we run into over-fitting.
To mitigate the effect of huge parameter values, we can place a prior
distribution p(θ) on the parameters.
Maximize the posterior distribution (probability)
p(θ∣X , Y) =
p(Y∣X , θ)p(θ)
.
p(Y∣X )
Minimize the negative log-posterior distribution (probability):
θ MAP = arg min{− log p(Y∣X , θ) − log p(θ) + const.}.
θ
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
21 / 43
Parameter Estimation
9.2 Parameter Estimation
Maximum A Posteriori Estimation £4Œ
O¤
Maximum likelihood estimation is prone to over-fitting.
It is observed that the magnitude of the parameter values becomes
relatively large if we run into over-fitting.
To mitigate the effect of huge parameter values, we can place a prior
distribution p(θ) on the parameters.
Maximize the posterior distribution (probability)
p(θ∣X , Y) =
p(Y∣X , θ)p(θ)
.
p(Y∣X )
Minimize the negative log-posterior distribution (probability):
θ MAP = arg min{− log p(Y∣X , θ) − log p(θ) + const.}.
θ
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
21 / 43
Parameter Estimation
9.2 Parameter Estimation
Maximum A Posteriori Estimation £4Œ
O¤
Maximum likelihood estimation is prone to over-fitting.
It is observed that the magnitude of the parameter values becomes
relatively large if we run into over-fitting.
To mitigate the effect of huge parameter values, we can place a prior
distribution p(θ) on the parameters.
Maximize the posterior distribution (probability)
p(θ∣X , Y) =
p(Y∣X , θ)p(θ)
.
p(Y∣X )
Minimize the negative log-posterior distribution (probability):
θ MAP = arg min{− log p(Y∣X , θ) − log p(θ) + const.}.
θ
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
21 / 43
Parameter Estimation
9.2 Parameter Estimation
Maximum A Posteriori Estimation £4Œ
O¤
Maximum likelihood estimation is prone to over-fitting.
It is observed that the magnitude of the parameter values becomes
relatively large if we run into over-fitting.
To mitigate the effect of huge parameter values, we can place a prior
distribution p(θ) on the parameters.
Maximize the posterior distribution (probability)
p(θ∣X , Y) =
p(Y∣X , θ)p(θ)
.
p(Y∣X )
Minimize the negative log-posterior distribution (probability):
θ MAP = arg min{− log p(Y∣X , θ) − log p(θ) + const.}.
θ
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
21 / 43
Parameter Estimation
9.2 Parameter Estimation
If p(y∣x, θ) = N (y∣φ⊺ (x)θ, σ 2 ) and p(θ) = N (0, b2 I), then
− log p(θ∣X , Y) = 2σ1 2 (y − Φθ)⊺ (y − Φθ) + 2b12 θ ⊺ θ + const.
2
= 2σ1 2 (∥y − Φθ∥22 + σb2 ∥θ∥22 ) + const..
θ MAP = (Φ⊺ Φ + σb2 I)
2
Libin Jiao (DLUT)
−1
Φ⊺ y.
Linear Regression
November 7, 2024
22 / 43
Parameter Estimation
9.2 Parameter Estimation
If p(y∣x, θ) = N (y∣φ⊺ (x)θ, σ 2 ) and p(θ) = N (0, b2 I), then
− log p(θ∣X , Y) = 2σ1 2 (y − Φθ)⊺ (y − Φθ) + 2b12 θ ⊺ θ + const.
2
= 2σ1 2 (∥y − Φθ∥22 + σb2 ∥θ∥22 ) + const..
θ MAP = (Φ⊺ Φ + σb2 I)
2
Libin Jiao (DLUT)
−1
Φ⊺ y.
Linear Regression
November 7, 2024
22 / 43
Parameter Estimation
9.2 Parameter Estimation
If p(y∣x, θ) = N (y∣φ⊺ (x)θ, σ 2 ) and p(θ) = N (0, b2 I), then
− log p(θ∣X , Y) = 2σ1 2 (y − Φθ)⊺ (y − Φθ) + 2b12 θ ⊺ θ + const.
2
= 2σ1 2 (∥y − Φθ∥22 + σb2 ∥θ∥22 ) + const..
θ MAP = (Φ⊺ Φ + σb2 I)
2
Libin Jiao (DLUT)
−1
Φ⊺ y.
Linear Regression
November 7, 2024
23 / 43
Parameter Estimation
9.2 Parameter Estimation
MAP Estimation as Regularization
L2 -Regularized (Tikhonov regularization) least squares £ridge
regression £*£8¤¤
min{∥y − Φθ∥22 + λ∥θ∥22 }
θ
λ=
σ2
⇒ MAP with prior p(θ) = N (0, b2 I).
b2
−1
θ RLS = (Φ⊺ Φ + λI)
Libin Jiao (DLUT)
Φ⊺ y.
Linear Regression
November 7, 2024
24 / 43
Parameter Estimation
9.2 Parameter Estimation
MAP Estimation as Regularization
L2 -Regularized (Tikhonov regularization) least squares £ridge
regression £*£8¤¤
min{∥y − Φθ∥22 + λ∥θ∥22 }
θ
λ=
σ2
⇒ MAP with prior p(θ) = N (0, b2 I).
b2
−1
θ RLS = (Φ⊺ Φ + λI)
Libin Jiao (DLUT)
Φ⊺ y.
Linear Regression
November 7, 2024
24 / 43
Parameter Estimation
9.2 Parameter Estimation
MAP Estimation as Regularization
L2 -Regularized (Tikhonov regularization) least squares £ridge
regression £*£8¤¤
min{∥y − Φθ∥22 + λ∥θ∥22 }
θ
λ=
σ2
⇒ MAP with prior p(θ) = N (0, b2 I).
b2
−1
θ RLS = (Φ⊺ Φ + λI)
Libin Jiao (DLUT)
Φ⊺ y.
Linear Regression
November 7, 2024
24 / 43
Parameter Estimation
9.2 Parameter Estimation
MAP Estimation as Regularization
L2 -Regularized (Tikhonov regularization) least squares £ridge
regression £*£8¤¤
min{∥y − Φθ∥22 + λ∥θ∥22 }
θ
λ=
σ2
⇒ MAP with prior p(θ) = N (0, b2 I).
b2
−1
θ RLS = (Φ⊺ Φ + λI)
Libin Jiao (DLUT)
Φ⊺ y.
Linear Regression
November 7, 2024
24 / 43
Parameter Estimation
9.2 Parameter Estimation
MAP Estimation as Regularization
L2 -Regularized (Tikhonov regularization) least squares £ridge
regression £*£8¤¤
min{∥y − Φθ∥22 + λ∥θ∥22 }
θ
λ=
σ2
⇒ MAP with prior p(θ) = N (0, b2 I).
b2
−1
θ RLS = (Φ⊺ Φ + λI)
Libin Jiao (DLUT)
Φ⊺ y.
Linear Regression
November 7, 2024
24 / 43
Parameter Estimation
9.2 Parameter Estimation
Exercise
Do MAPE with Laplacian prior p(θi ) = L(0, b) under Gaussian noise
distribution ∼ N [0, σ 2 ].
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
25 / 43
Parameter Estimation
9.2 Parameter Estimation
L1 -Regularized least squares (LASSO, least absolute shrinkage and
selection operator, sparsity inducing regularization)
min {∥y − Φθ∥22 + λ∥θ∥1 } ⇒ No closed form solution! Diagonal Φ?
θ
Convex relaxation of the L0 regularization.
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
26 / 43
Parameter Estimation
9.2 Parameter Estimation
L1 -Regularized least squares (LASSO, least absolute shrinkage and
selection operator, sparsity inducing regularization)
min {∥y − Φθ∥22 + λ∥θ∥1 } ⇒ No closed form solution! Diagonal Φ?
θ
Convex relaxation of the L0 regularization.
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
26 / 43
Parameter Estimation
9.2 Parameter Estimation
L1 -Regularized least squares (LASSO, least absolute shrinkage and
selection operator, sparsity inducing regularization)
min {∥y − Φθ∥22 + λ∥θ∥1 } ⇒ No closed form solution! Diagonal Φ?
θ
Convex relaxation of the L0 regularization.
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
26 / 43
Parameter Estimation
9.2 Parameter Estimation
L1 -Regularized least squares (LASSO, least absolute shrinkage and
selection operator, sparsity inducing regularization)
min {∥y − Φθ∥22 + λ∥θ∥1 } ⇒ No closed form solution! Diagonal Φ?
θ
Convex relaxation of the L0 regularization.
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
26 / 43
Parameter Estimation
9.2 Parameter Estimation
L0 -Regularized least squares (sparsity pursuiting)
min {∥y − Φθ∥22 + λ∥θ∥0 }
θ
Lp -Regularized least squares (0 < p < 1, sparsity inducing)
min {∥y − Φθ∥22 + λ∥θ∥pp }
θ
Libin Jiao (DLUT)
p = 21 ?
Linear Regression
November 7, 2024
27 / 43
Parameter Estimation
9.2 Parameter Estimation
L0 -Regularized least squares (sparsity pursuiting)
min {∥y − Φθ∥22 + λ∥θ∥0 }
θ
Lp -Regularized least squares (0 < p < 1, sparsity inducing)
min {∥y − Φθ∥22 + λ∥θ∥pp }
θ
Libin Jiao (DLUT)
p = 12 ?
Linear Regression
November 7, 2024
27 / 43
Parameter Estimation
9.2 Parameter Estimation
L0 -Regularized least squares (sparsity pursuiting)
min {∥y − Φθ∥22 + λ∥θ∥0 }
θ
Lp -Regularized least squares (0 < p < 1, sparsity inducing)
min {∥y − Φθ∥22 + λ∥θ∥pp }
θ
Libin Jiao (DLUT)
p = 12 ?
Linear Regression
November 7, 2024
27 / 43
Parameter Estimation
9.2 Parameter Estimation
L2 vs L1
Lp -Balls
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
28 / 43
Bayesian Linear Regression
9.3 Bayesian Linear Regression
MLE (e.g., Least Square):
min {∥y − Φθ∥22 }
θ
can lead to severe overfitting, in particular, in the small-data regime.
MAPE (e.g., L2 -Regularized Least Square):
min {∥y − Φθ∥22 + λ∥θ∥22 }
θ
addresses this issue by placing a prior on the parameters.
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
29 / 43
Bayesian Linear Regression
9.3 Bayesian Linear Regression
Bayesian linear regression pushes the idea of the parameter prior a step
further:
Does not compute a point estimate of the parameters;
The full posterior distribution over the parameters is taken into account
when making predictions;
Compute a mean over all plausible parameters settings (according to the
posterior).
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
30 / 43
Bayesian Linear Regression
9.3 Bayesian Linear Regression
Bayesian linear regression pushes the idea of the parameter prior a step
further:
Does not compute a point estimate of the parameters;
The full posterior distribution over the parameters is taken into account
when making predictions;
Compute a mean over all plausible parameters settings (according to the
posterior).
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
30 / 43
Bayesian Linear Regression
9.3 Bayesian Linear Regression
Bayesian linear regression pushes the idea of the parameter prior a step
further:
Does not compute a point estimate of the parameters;
The full posterior distribution over the parameters is taken into account
when making predictions;
Compute a mean over all plausible parameters settings (according to the
posterior).
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
30 / 43
Bayesian Linear Regression
9.3 Bayesian Linear Regression
Bayesian linear regression pushes the idea of the parameter prior a step
further:
Does not compute a point estimate of the parameters;
The full posterior distribution over the parameters is taken into account
when making predictions;
Compute a mean over all plausible parameters settings (according to the
posterior).
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
30 / 43
Bayesian Linear Regression
9.3 Bayesian Linear Regression
Consider the model:
(∗)
Libin Jiao (DLUT)
prior
p(θ) = N (m0 , S0 ),
likeligood p(y∣x, θ) = N (y∣φ⊺ (x)θ, σ 2 ).
Linear Regression
November 7, 2024
31 / 43
Bayesian Linear Regression
9.3 Bayesian Linear Regression
Consider the model: y = φ⊺ (x)θ + prior
(∗)
p(θ) = N (m0 , S0 ),
likeligood p(y∣x, θ) = N (y∣φ⊺ (x)θ, σ 2 ).
The full probabilistic model: p(y, θ∣x) = p(y∣x, θ)p(θ).
The average predictive distribution according to the prior p(θ):
p(y∣x) = ∫ p(y∣x, θ)p(θ)dθ = Eθ [p(y∣x, θ)]
With prior distribution p(θ) = N (m0 , S0 ), from
Gaussian is conjugate;
The marginal of a Gaussian is also a Gaussian;
Gaussian noise is independent;
y = φ⊺ (x)θ + we obtain: p(y∣x) = N (φ⊺ (x)m0 , φ⊺ (x)S0 φ(x) + σ 2 )
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
32 / 43
Bayesian Linear Regression
9.3 Bayesian Linear Regression
Consider the model: y = φ⊺ (x)θ + prior
(∗)
p(θ) = N (m0 , S0 ),
likeligood p(y∣x, θ) = N (y∣φ⊺ (x)θ, σ 2 ).
The full probabilistic model: p(y, θ∣x) = p(y∣x, θ)p(θ).
The average predictive distribution according to the prior p(θ):
p(y∣x) = ∫ p(y∣x, θ)p(θ)dθ = Eθ [p(y∣x, θ)]
With prior distribution p(θ) = N (m0 , S0 ), from
Gaussian is conjugate;
The marginal of a Gaussian is also a Gaussian;
Gaussian noise is independent;
y = φ⊺ (x)θ + we obtain: p(y∣x) = N (φ⊺ (x)m0 , φ⊺ (x)S0 φ(x) + σ 2 )
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
32 / 43
Bayesian Linear Regression
9.3 Bayesian Linear Regression
Consider the model: y = φ⊺ (x)θ + prior
(∗)
p(θ) = N (m0 , S0 ),
likeligood p(y∣x, θ) = N (y∣φ⊺ (x)θ, σ 2 ).
The full probabilistic model: p(y, θ∣x) = p(y∣x, θ)p(θ).
The average predictive distribution according to the prior p(θ):
p(y∣x) = ∫ p(y∣x, θ)p(θ)dθ = Eθ [p(y∣x, θ)]
With prior distribution p(θ) = N (m0 , S0 ), from
Gaussian is conjugate;
The marginal of a Gaussian is also a Gaussian;
Gaussian noise is independent;
y = φ⊺ (x)θ + we obtain: p(y∣x) = N (φ⊺ (x)m0 , φ⊺ (x)S0 φ(x) + σ 2 )
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
32 / 43
Bayesian Linear Regression
9.3 Bayesian Linear Regression
Prior over (noise-free) function:
p(f (x)) = N (φ⊺ (x)m0 , φ⊺ (x)S0 φ(x))
Polynomials of degree 5, fi (⋅) = θ ⊺i φ(⋅), θ i ∼ p(θ) = N (0, 41 I).
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
33 / 43
Bayesian Linear Regression
9.3 Bayesian Linear Regression
Posterior Distribution: X = {xn ∣n = 1, . . . , N}, Y = {yn ∣n = 1, . . . , N},
p(θ∣X , Y) =
p(Y∣X , θ)p(θ)
.
p(Y∣X )
Marginal likelihood/Evidence:
p(Y∣X ) = ∫ p(Y∣X , θ)p(θ)dθ = Eθ [p(Y∣X , θ)]
Theorem 9.1 (Parameter Posterior).
p(θ∣X , Y) = N (θ∣mN , SN ) = (2π)− 2 ∣SN ∣− 2 exp(− 21 (θ − mN )⊺ S−1
N (θ − mN )),
D
1
−2 ⊺
−1
SN = (S−1
0 + σ Φ Φ) ,
−2 ⊺
mN = SN (S−1
0 m0 + σ Φ y).
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
34 / 43
Bayesian Linear Regression
9.3 Bayesian Linear Regression
Posterior Distribution: X = {xn ∣n = 1, . . . , N}, Y = {yn ∣n = 1, . . . , N},
p(θ∣X , Y) =
p(Y∣X , θ)p(θ)
.
p(Y∣X )
Marginal likelihood/Evidence:
p(Y∣X ) = ∫ p(Y∣X , θ)p(θ)dθ = Eθ [p(Y∣X , θ)]
Theorem 9.1 (Parameter Posterior).
p(θ∣X , Y) = N (θ∣mN , SN ) = (2π)− 2 ∣SN ∣− 2 exp(− 21 (θ − mN )⊺ S−1
N (θ − mN )),
D
1
−2 ⊺
−1
SN = (S−1
0 + σ Φ Φ) ,
−2 ⊺
mN = SN (S−1
0 m0 + σ Φ y).
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
34 / 43
Bayesian Linear Regression
9.3 Bayesian Linear Regression
Posterior Distribution: X = {xn ∣n = 1, . . . , N}, Y = {yn ∣n = 1, . . . , N},
p(θ∣X , Y) =
p(Y∣X , θ)p(θ)
.
p(Y∣X )
Marginal likelihood/Evidence:
p(Y∣X ) = ∫ p(Y∣X , θ)p(θ)dθ = Eθ [p(Y∣X , θ)]
Theorem 9.1 (Parameter Posterior).
p(θ∣X , Y) = N (θ∣mN , SN ) = (2π)− 2 ∣SN ∣− 2 exp(− 21 (θ − mN )⊺ S−1
N (θ − mN )),
D
1
−2 ⊺
−1
SN = (S−1
0 + σ Φ Φ) ,
−2 ⊺
mN = SN (S−1
0 m0 + σ Φ y).
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
34 / 43
Bayesian Linear Regression
9.3 Bayesian Linear Regression
Proof of Theorem 9.1.
p(Y∣X ,θ)p(θ)
Posterior: p(θ∣X , Y) = p(Y∣X ) ;
likeligood: p(Y∣X , θ) = N (y∣Φθ, σ 2 I);
prior:
p(θ) = N (m0 , S0 ).
log-Posterior:
log p(θ∣X , Y) = log N (Y∣Φθ, σ 2 I) + log N (θ∣m0 , S0 ) + const.
= − 21 (σ −2 (y − Φθ)⊺ (y − Φθ) + (θ − m0 )⊺ S−1
0 (θ − m0 )) + const.
−1
−2 ⊺
⊺
= − 12 (θ ⊺ (σ −2 Φ⊺ Φ + S−1
0 )θ − 2(σ Φ y + S0 m0 ) θ) + const.
⊺ −1
⊺ −1
= − 21 (θ ⊺ S−1
N θ − 2mN SN θ + mN SN mN ) + const.
= − 21 (θ − mN )⊺ S−1
N (θ − mN ) + const.
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
35 / 43
Bayesian Linear Regression
9.3 Bayesian Linear Regression
Proof of Theorem 9.1.
p(Y∣X ,θ)p(θ)
Posterior: p(θ∣X , Y) = p(Y∣X ) ;
likeligood: p(Y∣X , θ) = N (y∣Φθ, σ 2 I);
prior:
p(θ) = N (m0 , S0 ).
log-Posterior:
log p(θ∣X , Y) = log N (Y∣Φθ, σ 2 I) + log N (θ∣m0 , S0 ) + const.
= − 21 (σ −2 (y − Φθ)⊺ (y − Φθ) + (θ − m0 )⊺ S−1
0 (θ − m0 )) + const.
−1
−2 ⊺
⊺
= − 12 (θ ⊺ (σ −2 Φ⊺ Φ + S−1
0 )θ − 2(σ Φ y + S0 m0 ) θ) + const.
⊺ −1
⊺ −1
= − 21 (θ ⊺ S−1
N θ − 2mN SN θ + mN SN mN ) + const.
= − 21 (θ − mN )⊺ S−1
N (θ − mN ) + const.
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
35 / 43
Bayesian Linear Regression
9.3 Bayesian Linear Regression
Proof of Theorem 9.1.
p(Y∣X ,θ)p(θ)
Posterior: p(θ∣X , Y) = p(Y∣X ) ;
likeligood: p(Y∣X , θ) = N (y∣Φθ, σ 2 I);
prior:
p(θ) = N (m0 , S0 ).
log-Posterior:
log p(θ∣X , Y) = log N (Y∣Φθ, σ 2 I) + log N (θ∣m0 , S0 ) + const.
= − 21 (σ −2 (y − Φθ)⊺ (y − Φθ) + (θ − m0 )⊺ S−1
0 (θ − m0 )) + const.
−1
−2 ⊺
⊺
= − 12 (θ ⊺ (σ −2 Φ⊺ Φ + S−1
0 )θ − 2(σ Φ y + S0 m0 ) θ) + const.
⊺ −1
⊺ −1
= − 21 (θ ⊺ S−1
N θ − 2mN SN θ + mN SN mN ) + const.
= − 21 (θ − mN )⊺ S−1
N (θ − mN ) + const.
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
35 / 43
Bayesian Linear Regression
9.3 Bayesian Linear Regression
Posterior Predictive Distribution:
p(y∣X , Y, x) = ∫ p(y∣x, θ)p(θ∣X , Y)dθ = Eθ∣X ,Y [p(y∣x, θ)]
= ∫ N (y∣φ⊺ (x)θ, σ 2 )N (θ∣mN , SN )dθ
= N (y∣φ⊺ (x)mN , φ⊺ (x)SN φ(x) + σ 2 )
Posterior over (Noise-free) Function:
p(f (x)∣X , Y) = N (φ⊺ (x)mN , φ⊺ (x)SN φ(x))
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
36 / 43
Bayesian Linear Regression
9.3 Bayesian Linear Regression
Posterior over (Noise-free) Function: Polynomials of degree 5,
fi (⋅) = θ ⊺i φ(⋅), θ i ∼ p(θ) = N (0, 41 I).
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
37 / 43
Bayesian Linear Regression
9.3 Bayesian Linear Regression
Polynomials of degree 3,5 and 7, fi (⋅) = θ ⊺i φ(⋅), θ i ∼ p(θ) = N (0, 14 I).
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
38 / 43
Bayesian Linear Regression
9.3 Bayesian Linear Regression
Computing the Marginal Likelihood:
prior
θ ∼ N (m0 , S0 ),
likeligood yn ∣xn , θ ∼ N (y∣x⊺n (x)θ, σ 2 ).
The marginal likelihood:
p(Y∣X ) = ∫ p(Y∣X , θ)p(θ)dθ
= ∫ N (y∣Xθ, σ 2 I)N (θm0 , S0 )dθ
Eθ [Y∣X ] = Eθ [Xθ + ] = XEθ [θ] = Xm0
Covθ [Y∣X ] = Covθ [Xθ] + σ 2 I = XCovθ [θ]X⊺ + σ 2 I = XS0 X⊺ + σ 2 I
p(Y∣X ) = N (y∣Xm0 , XS0 X⊺ + σ 2 I).
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
39 / 43
Maximum Likelihood as Orthogonal Projection
9.4 Maximum Likelihood as Orthogonal Projection
y = xθ + , ∼ N (0, σ 2 ), suppose firstly σ 2 is known.
Given a training set D ∶= {(x1 , y1 ), . . . , (xN , yN )}.
Maximum likelihood estimator:
X⊺ y
θML = (X⊺ X)−1 X⊺ y = ⊺ ∈ R
X X
⊺
N
where X = (x1 , . . . , xN ) ∈ R , y = (y1 , . . . , , yN )⊺ ∈ RN .
Optimal (maximum likelihood) reconstruction of the training target for
the training input X:
X⊺ y XX⊺
=
y
X⊺ X X⊺ X
⇒ the orthogonal projection of y onto the one-dimensional subspace of
RN spanned by X.
θML is the coordinates of the projection under the basis X.
XθML = X
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
40 / 43
Maximum Likelihood as Orthogonal Projection
9.4 Maximum Likelihood as Orthogonal Projection
y = xθ + , ∼ N (0, σ 2 ), suppose firstly σ 2 is known.
Given a training set D ∶= {(x1 , y1 ), . . . , (xN , yN )}.
Maximum likelihood estimator:
X⊺ y
θML = (X⊺ X)−1 X⊺ y = ⊺ ∈ R
X X
⊺
N
where X = (x1 , . . . , xN ) ∈ R , y = (y1 , . . . , , yN )⊺ ∈ RN .
Optimal (maximum likelihood) reconstruction of the training target for
the training input X:
X⊺ y XX⊺
=
y
X⊺ X X⊺ X
⇒ the orthogonal projection of y onto the one-dimensional subspace of
RN spanned by X.
θML is the coordinates of the projection under the basis X.
XθML = X
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
40 / 43
Maximum Likelihood as Orthogonal Projection
9.4 Maximum Likelihood as Orthogonal Projection
General case: y = φ⊺ (x)⊺ θ + , ∼ N (0, σ 2 ).
Maximum likelihood estimator
θ ML = (Φ⊺ Φ)−1 Φ⊺ y.
Maximum likelihood estimator
Φθ ML = Φ(Φ⊺ Φ)−1 Φ⊺ y
When Φ⊺ Φ = I,
K
Φθ ML = ΦΦ⊺ y = ( ∑ φk φ⊺k ) y
k=1
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
41 / 43
Maximum Likelihood as Orthogonal Projection
9.4 Logistic regression and deep neural network
Logistic regression: y = σ(f (x)) + , ∼ N (0, σ 2 ), where
f (x) = φ⊺ (x)⊺ θ;
1
σ(f ) = 1+exp(−f
is the logistic sigmoid function.
)
Maximum likelihood estimation: min L(θ) ≜ ∥y − σ(Φθ)∥22
θ
Deep neural network: y = f (x; W, b) + , ∼ N (0, σ 2 ), where f (x) is
defined as follows
x(0) = x,
x(l) = σ (l) (W (l) x(l−1) + b(l) ), l = 1, 2, . . . , L,
f (x; W, b) = x(L) ,
in which W (l) ∈ Rkl ×kl−1 , k0 = d and kL = 1.
Maximum likelihood estimation:
N
min L(W, b) ≜ ∑ (yn − f (xn ; W, b)2 .
W,b
Libin Jiao (DLUT)
n=1
Linear Regression
November 7, 2024
42 / 43
Maximum Likelihood as Orthogonal Projection
Exercises
Do MLE under Laplacian distribution ∼ L(0, b).
Do MLE under the uniform distribution ∼ U[−a, a].
Do MAPE with Laplacian prior p(θi ) = L(0, b) under Gaussian noise
distribution ∼ N [0, σ 2 ].
Prove that θ MAP = mN .
Give the closed form solution of L1 -Regularized least squares problem
min {∥y − θ∥22 + λ∥θ∥1 } .
θ
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
43 / 43
0
You can add this document to your study collection(s)
Sign in Available only to authorized usersYou can add this document to your saved list
Sign in Available only to authorized users(For complaints, use another form )