Uploaded by zhuan zhuan

Prezentacija o linearnoj regresiji: Problem, procena, Bayesov pristup

advertisement
Linear Regression
Libin Jiao
Dalian University of Technology
November 7, 2024
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
1 / 43
Outline
1
Problem formulation
2
Parameter Estimation
3
Bayesian Linear Regression
4
Maximum Likelihood as Orthogonal Projection
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
2 / 43
Problem formulation
9.1 Problem formulation
Given a set of training inputs xn and corresponding noisy observations
yn = f (xn ) + .
is an i.i.d. random variable that describes measurement/observation
noise.
(a) Dataset.
(b) Possible solution.
The task is to infer the function f that generated the data and generalizes
well to function values at new input locations.
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
3 / 43
Problem formulation
9.1 Problem formulation
Regression is a basic problem in ML:
Prediction and control
Classification.
Time-series analysis
Deep-learning
Finding a regression function:
Choice of the model (type) and the parametrization.
Finding (estimating) good parameters.
Over-fitting and model selection.
Relationship between loss functions and parameter priors.
Uncertainty modeling.
Find optimal model parameters
Maximum likelihood estimation£4Œq, O¤
Maximum a posteriori (MAP) estimation£4Œ
Generalization errors£•zØ
Bayesian linear regression
Libin Jiao (DLUT)
O¤
¤and over-fitting£L[ܤ.
Linear Regression
November 7, 2024
4 / 43
Problem formulation
9.1 Problem formulation
Regression is a basic problem in ML:
Prediction and control
Classification.
Time-series analysis
Deep-learning
Finding a regression function:
Choice of the model (type) and the parametrization.
Finding (estimating) good parameters.
Over-fitting and model selection.
Relationship between loss functions and parameter priors.
Uncertainty modeling.
Find optimal model parameters
Maximum likelihood estimation£4Œq, O¤
Maximum a posteriori (MAP) estimation£4Œ
Generalization errors£•zØ
Bayesian linear regression
Libin Jiao (DLUT)
O¤
¤and over-fitting£L[ܤ.
Linear Regression
November 7, 2024
4 / 43
Problem formulation
9.1 Problem formulation
Regression is a basic problem in ML:
Prediction and control
Classification.
Time-series analysis
Deep-learning
Finding a regression function:
Choice of the model (type) and the parametrization.
Finding (estimating) good parameters.
Over-fitting and model selection.
Relationship between loss functions and parameter priors.
Uncertainty modeling.
Find optimal model parameters
Maximum likelihood estimation£4Œq, O¤
Maximum a posteriori (MAP) estimation£4Œ
Generalization errors£•zØ
Bayesian linear regression
Libin Jiao (DLUT)
O¤
¤and over-fitting£L[ܤ.
Linear Regression
November 7, 2024
4 / 43
Problem formulation
9.1 Problem formulation
Regression is a basic problem in ML:
Prediction and control
Classification.
Time-series analysis
Deep-learning
Finding a regression function:
Choice of the model (type) and the parametrization.
Finding (estimating) good parameters.
Over-fitting and model selection.
Relationship between loss functions and parameter priors.
Uncertainty modeling.
Find optimal model parameters
Maximum likelihood estimation£4Œq, O¤
Maximum a posteriori (MAP) estimation£4Œ
Generalization errors£•zØ
Bayesian linear regression
Libin Jiao (DLUT)
O¤
¤and over-fitting£L[ܤ.
Linear Regression
November 7, 2024
4 / 43
Problem formulation
9.1 Problem formulation
Regression is a basic problem in ML:
Prediction and control
Classification.
Time-series analysis
Deep-learning
Finding a regression function:
Choice of the model (type) and the parametrization.
Finding (estimating) good parameters.
Over-fitting and model selection.
Relationship between loss functions and parameter priors.
Uncertainty modeling.
Find optimal model parameters
Maximum likelihood estimation£4Œq, O¤
Maximum a posteriori (MAP) estimation£4Œ
Generalization errors£•zØ
Bayesian linear regression
Libin Jiao (DLUT)
O¤
¤and over-fitting£L[ܤ.
Linear Regression
November 7, 2024
4 / 43
Problem formulation
9.1 Problem formulation
Functional relationship between x and y is given as
y = f (x) + Suppose ∼ N (0, σ 2 ) is independent, identically distributed (i.i.d.)
Gaussian measurement noise with mean 0 and variance σ 2 .
The likelihood function£q,¼ê¤
p(y∣x) = N (y∣ f (x), σ 2 ).
We focus on parametric models, i.e., we choose a parameterized function
and find parameters θ that /work well0for modeling the data.
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
5 / 43
Problem formulation
9.1 Problem formulation
Functional relationship between x and y is given as
y = f (x) + Suppose ∼ N (0, σ 2 ) is independent, identically distributed (i.i.d.)
Gaussian measurement noise with mean 0 and variance σ 2 .
The likelihood function£q,¼ê¤
p(y∣x) = N (y∣ f (x), σ 2 ).
We focus on parametric models, i.e., we choose a parameterized function
and find parameters θ that /work well0for modeling the data.
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
5 / 43
Problem formulation
9.1 Problem formulation
Functional relationship between x and y is given as
y = f (x) + Suppose ∼ N (0, σ 2 ) is independent, identically distributed (i.i.d.)
Gaussian measurement noise with mean 0 and variance σ 2 .
The likelihood function£q,¼ê¤
p(y∣x) = N (y∣ f (x), σ 2 ).
We focus on parametric models, i.e., we choose a parameterized function
and find parameters θ that /work well0for modeling the data.
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
5 / 43
Problem formulation
9.1 Problem formulation
Functional relationship between x and y is given as
y = f (x) + Suppose ∼ N (0, σ 2 ) is independent, identically distributed (i.i.d.)
Gaussian measurement noise with mean 0 and variance σ 2 .
The likelihood function£q,¼ê¤
p(y∣x) = N (y∣ f (x), σ 2 ).
We focus on parametric models, i.e., we choose a parameterized function
and find parameters θ that /work well0for modeling the data.
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
5 / 43
Problem formulation
9.1 Problem formulation
Parametric models:f (x) is a parameterized function with parameters θ,
and can be denoted by f (x; θ).
Linear regression: the parameters θ appear linearly in the model:
f (x; θ) = x⊺ θ;
f (x; θ) = φ⊺ (x)θ, where φ is a nonlinear transformation (features).
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
6 / 43
Problem formulation
9.1 Problem formulation
Parametric models:f (x) is a parameterized function with parameters θ,
and can be denoted by f (x; θ).
Linear regression: the parameters θ appear linearly in the model:
f (x; θ) = x⊺ θ;
f (x; θ) = φ⊺ (x)θ, where φ is a nonlinear transformation (features).
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
6 / 43
Problem formulation
9.1 Problem formulation
Parametric models:f (x) is a parameterized function with parameters θ,
and can be denoted by f (x; θ).
Linear regression: the parameters θ appear linearly in the model:
f (x; θ) = x⊺ θ;
f (x; θ) = φ⊺ (x)θ, where φ is a nonlinear transformation (features).
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
6 / 43
Problem formulation
9.1 Problem formulation
Parametric models:f (x) is a parameterized function with parameters θ,
and can be denoted by f (x; θ).
Linear regression: the parameters θ appear linearly in the model:
f (x; θ) = x⊺ θ;
f (x; θ) = φ⊺ (x)θ, where φ is a nonlinear transformation (features).
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
6 / 43
Problem formulation
9.1 Problem formulation
Parametric models:f (x) is a parameterized function with parameters θ,
and can be denoted by f (x; θ).
Linear regression: the parameters θ appear linearly in the model:
f (x; θ) = x⊺ θ;
f (x; θ) = φ⊺ (x)θ, where φ is a nonlinear transformation (features).
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
6 / 43
Problem formulation
9.1 Problem formulation
Parametric models: f (x) is a parameterized function with parameters θ,
and can be denoted by f (x; θ).
Linear regression: the parameters θ appear linearly in the model:
f (x; θ) = x⊺ θ;
f (x; θ) = φ⊺ (x)θ, where φ is a nonlinear transformation (features).
Example 9.1:
(a) Example functions.
Libin Jiao (DLUT)
(b) Training set.
Linear Regression
(c) MLE.
November 7, 2024
7 / 43
Parameter Estimation
9.2 Parameter Estimation
Linear in θ and in x: p(y∣x, θ) = N (y∣x⊺ θ, σ 2 ) ⇔ y = x⊺ θ + ,
∼ N (0, σ 2 ), suppose firstly σ 2 is known.
Given a training set D ∶= {(x1 , y1 ), . . . , (xN , yN )}, xn ∈ RD and yn ∈ R
(n = 1, 2, . . . , N);
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
8 / 43
Parameter Estimation
9.2 Parameter Estimation
Linear in θ and in x: p(y∣x, θ) = N (y∣x⊺ θ, σ 2 ) ⇔ y = x⊺ θ + ,
∼ N (0, σ 2 ), suppose firstly σ 2 is known.
Given a training set D ∶= {(x1 , y1 ), . . . , (xN , yN )}, xn ∈ RD and yn ∈ R
(n = 1, 2, . . . , N);
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
8 / 43
Parameter Estimation
9.2 Parameter Estimation
Linear in θ and in x: p(y∣x, θ) = N (y∣x⊺ θ, σ 2 ) ⇔ y = x⊺ θ + ,
∼ N (0, σ 2 ), suppose firstly σ 2 is known.
Given a training set D ∶= {(x1 , y1 ), . . . , (xN , yN )}, xn ∈ RD and yn ∈ R
(n = 1, 2, . . . , N);
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
8 / 43
Parameter Estimation
9.2 Parameter Estimation
Linear in θ and in x: p(y∣x, θ) = N (y∣x⊺ θ, σ 2 ) ⇔ y = x⊺ θ + ,
∼ N (0, σ 2 ), suppose firstly σ 2 is known.
Given a training set D ∶= {(x1 , y1 ), . . . , (xN , yN )}, xn ∈ RD and yn ∈ R
(n = 1, 2, . . . , N);
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
8 / 43
Parameter Estimation
9.2 Parameter Estimation
Linear in θ and in x: p(y∣x, θ) = N (y∣x⊺ θ, σ 2 ) ⇔ y = x⊺ θ + ,
∼ N (0, σ 2 ), suppose firstly σ 2 is known.
Given a training set D ∶= {(x1 , y1 ), . . . , (xN , yN )}, xn ∈ RD and yn ∈ R
(n = 1, 2, . . . , N);
Likelyhood:
p(Y∣X , θ) = p(y1 , . . . , yN ∣x1 , . . . , xN , θ)
N
N
n=1
n=1
= ∏ p(yn ∣xn , θ) = ∏ N (yn ∣x⊺n θ, σ 2 )
Graphical model.
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
9 / 43
Parameter Estimation
9.2 Parameter Estimation
Maximum Likelyhood Estimation:
θML = arg max p(Y∣X , θ) (maximize the likelihood)
θ
= arg min{− log p(Y∣X , θ)} (minimize the negative log-likelihood)
θ
N
= arg min{− log ∏ p(yn ∣xn , θ)}
θ
n=1
N
= arg min{− ∑ log N (yn ∣x⊺n θ, σ 2 )} ← √ 1 2 exp (−
θ
n=1
N
1
= arg min{ 2σ2 ∑ (yn − x⊺n θ)2 + const.}
θ
n=1
2πσ
(yn −x⊺n θ)2
)
2σ 2
= arg min L(θ) ( negative log-likelihood)
θ
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
10 / 43
Parameter Estimation
9.2 Parameter Estimation
Maximum Likelyhood Estimation:
θML = arg max p(Y∣X , θ) (maximize the likelihood)
θ
= arg min{− log p(Y∣X , θ)} (minimize the negative log-likelihood)
θ
N
= arg min{− log ∏ p(yn ∣xn , θ)}
θ
n=1
N
= arg min{− ∑ log N (yn ∣x⊺n θ, σ 2 )} ← √ 1 2 exp (−
θ
n=1
N
1
= arg min{ 2σ2 ∑ (yn − x⊺n θ)2 + const.}
θ
n=1
2πσ
(yn −x⊺n θ)2
)
2σ 2
= arg min L(θ) ( negative log-likelihood)
θ
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
10 / 43
Parameter Estimation
9.2 Parameter Estimation
Negative log-likelihood£Kéêq,¤µ
L(θ) ∶=
1 N
1
1
⊺ 2
⊺
2
∑ (yn − xn θ) = 2 (y − Xθ) (y − Xθ) = 2 ∥y − Xθ∥2 .
2
2σ n=1
2σ
2σ
Maximum likelihood estimator£4Œq, Of§†σÃ'œ¤µ
θ ML = (X⊺ X)−1 X⊺ y, (design matrix) X ∶= [x1 , . . . , xN ]⊺ ∈ RN×D , y ∶= [y1 , . . . , yN ]⊺ ∈ RN .
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
11 / 43
Parameter Estimation
9.2 Parameter Estimation
Negative log-likelihood£Kéêq,¤µ
L(θ) ∶=
1 N
1
1
⊺ 2
⊺
2
∑ (yn − xn θ) = 2 (y − Xθ) (y − Xθ) = 2 ∥y − Xθ∥2 .
2
2σ n=1
2σ
2σ
Maximum likelihood estimator£4Œq, Of§†σÃ'œ¤µ
θ ML = (X⊺ X)−1 X⊺ y, (design matrix) X ∶= [x1 , . . . , xN ]⊺ ∈ RN×D , y ∶= [y1 , . . . , yN ]⊺ ∈ RN .
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
11 / 43
Parameter Estimation
9.2 Parameter Estimation
Negative log-likelihood£Kéêq,¤µ
L(θ) ∶=
1 N
1
1
⊺ 2
⊺
2
∑ (yn − xn θ) = 2 (y − Xθ) (y − Xθ) = 2 ∥y − Xθ∥2 .
2σ 2 n=1
2σ
2σ
Maximum likelihood estimator£4Œq, Of§†σÃ'œ¤µ
θ ML = (X⊺ X)−1 X⊺ y, (design matrix) X ∶= [x1 , . . . , xN ]⊺ ∈ RN×D , y ∶= [y1 , . . . , yN ]⊺ ∈ RN .
Example 9.2 (Fitting Lines):
(a) Example functions.
Libin Jiao (DLUT)
(b) Training set.
Linear Regression
(c) MLE.
November 7, 2024
12 / 43
Parameter Estimation
9.2 Parameter Estimation
Exercise
Do MLE under Laplacian distribution ∼ L(0, b).
Do MLE under the uniform distribution ∼ U[−a, a].
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
13 / 43
Parameter Estimation
9.2 Parameter Estimation
Maximum Likelihood Estimation with Features£‘A
MLE¤µ
K−1
p(y∣x, θ) = N (y∣φ⊺ (x)θ, σ 2 ) ⇔ y = φ⊺ (x)θ + = ∑ θk φk (x) + ,
k=0
where φ ∶ RD → RK , φ(xk ) is the feature vector of xk .
Negative log-likelihood: L(θ) = 2σ1 2 (y − Φθ)⊺ (y − Φθ) = 2σ1 2 ∥y − Φθ∥22
Maximum likelihood estimator: θ ML = (Φ⊺ Φ)−1 Φ⊺ y
The feature matrix (design matrix)
⎡
⎡ φ⊺ (x1 ) ⎤ ⎢ φ0 (x1 )
⎢
⎥ ⎢ φ (x )
⎢
⎥ ⎢ 0 2
⎥=⎢
⋮
Φ ∶= ⎢
⎢ ⊺
⎥ ⎢
⋮
⎢ φ (xN ) ⎥ ⎢
⎣
⎦ ⎢ φ0 (xN )
⎣
Libin Jiao (DLUT)
Linear Regression
⋯ φK−1 (x1 ) ⎤⎥
⋯ φK−1 (x2 ) ⎥⎥
⎥ ∈ RN×K
⎥
⋯
⋮
⎥
⋯ φK−1 (xN ) ⎥⎦
November 7, 2024
14 / 43
Parameter Estimation
9.2 Parameter Estimation
Maximum Likelihood Estimation with Features£‘A
MLE¤µ
K−1
p(y∣x, θ) = N (y∣φ⊺ (x)θ, σ 2 ) ⇔ y = φ⊺ (x)θ + = ∑ θk φk (x) + ,
k=0
where φ ∶ RD → RK , φ(xk ) is the feature vector of xk .
Negative log-likelihood: L(θ) = 2σ1 2 (y − Φθ)⊺ (y − Φθ) = 2σ1 2 ∥y − Φθ∥22
Maximum likelihood estimator: θ ML = (Φ⊺ Φ)−1 Φ⊺ y
The feature matrix (design matrix)
⎡
⎡ φ⊺ (x1 ) ⎤ ⎢ φ0 (x1 )
⎢
⎥ ⎢ φ (x )
⎢
⎥ ⎢ 0 2
⎥=⎢
⋮
Φ ∶= ⎢
⎢ ⊺
⎥ ⎢
⋮
⎢ φ (xN ) ⎥ ⎢
⎣
⎦ ⎢ φ0 (xN )
⎣
Libin Jiao (DLUT)
Linear Regression
⋯ φK−1 (x1 ) ⎤⎥
⋯ φK−1 (x2 ) ⎥⎥
⎥ ∈ RN×K
⎥
⋯
⋮
⎥
⋯ φK−1 (xN ) ⎥⎦
November 7, 2024
14 / 43
Parameter Estimation
9.2 Parameter Estimation
Maximum Likelihood Estimation with Features£‘A
MLE¤µ
K−1
p(y∣x, θ) = N (y∣φ⊺ (x)θ, σ 2 ) ⇔ y = φ⊺ (x)θ + = ∑ θk φk (x) + ,
k=0
where φ ∶ RD → RK , φ(xk ) is the feature vector of xk .
Negative log-likelihood: L(θ) = 2σ1 2 (y − Φθ)⊺ (y − Φθ) = 2σ1 2 ∥y − Φθ∥22
Maximum likelihood estimator: θ ML = (Φ⊺ Φ)−1 Φ⊺ y
The feature matrix (design matrix)
⎡
⎡ φ⊺ (x1 ) ⎤ ⎢ φ0 (x1 )
⎢
⎥ ⎢ φ (x )
⎢
⎥ ⎢ 0 2
⎥=⎢
⋮
Φ ∶= ⎢
⎢ ⊺
⎥ ⎢
⋮
⎢ φ (xN ) ⎥ ⎢
⎣
⎦ ⎢ φ0 (xN )
⎣
Libin Jiao (DLUT)
Linear Regression
⋯ φK−1 (x1 ) ⎤⎥
⋯ φK−1 (x2 ) ⎥⎥
⎥ ∈ RN×K
⎥
⋯
⋮
⎥
⋯ φK−1 (xN ) ⎥⎦
November 7, 2024
14 / 43
Parameter Estimation
9.2 Parameter Estimation
Maximum Likelihood Estimation with Features£‘A
MLE¤µ
K−1
p(y∣x, θ) = N (y∣φ⊺ (x)θ, σ 2 ) ⇔ y = φ⊺ (x)θ + = ∑ θk φk (x) + ,
k=0
where φ ∶ RD → RK , φ(xk ) is the feature vector of xk .
Negative log-likelihood: L(θ) = 2σ1 2 (y − Φθ)⊺ (y − Φθ) = 2σ1 2 ∥y − Φθ∥22
Maximum likelihood estimator: θ ML = (Φ⊺ Φ)−1 Φ⊺ y
The feature matrix (design matrix)
⎡
⎡ φ⊺ (x1 ) ⎤ ⎢ φ0 (x1 )
⎢
⎥ ⎢ φ (x )
⎢
⎥ ⎢ 0 2
⎥=⎢
⋮
Φ ∶= ⎢
⎢ ⊺
⎥ ⎢
⋮
⎢ φ (xN ) ⎥ ⎢
⎣
⎦ ⎢ φ0 (xN )
⎣
Libin Jiao (DLUT)
Linear Regression
⋯ φK−1 (x1 ) ⎤⎥
⋯ φK−1 (x2 ) ⎥⎥
⎥ ∈ RN×K
⎥
⋯
⋮
⎥
⋯ φK−1 (xN ) ⎥⎦
November 7, 2024
14 / 43
Parameter Estimation
9.2 Parameter Estimation
Maximum Likelihood Estimation with Features£‘A
MLE¤µ
K−1
p(y∣x, θ) = N (y∣φ⊺ (x)θ, σ 2 ) ⇔ y = φ⊺ (x)θ + = ∑ θk φk (x) + ,
k=0
where φ ∶ RD → RK , φ(xk ) is the feature vector of xk .
Negative log-likelihood: L(θ) = 2σ1 2 (y − Φθ)⊺ (y − Φθ) = 2σ1 2 ∥y − Φθ∥22
Maximum likelihood estimator: θ ML = (Φ⊺ Φ)−1 Φ⊺ y
The feature matrix (design matrix)
⎡
⎡ φ⊺ (x1 ) ⎤ ⎢ φ0 (x1 )
⎢
⎥ ⎢ φ (x )
⎢
⎥ ⎢ 0 2
⎥=⎢
⋮
Φ ∶= ⎢
⎢ ⊺
⎥ ⎢
⋮
⎢ φ (xN ) ⎥ ⎢
⎣
⎦ ⎢ φ0 (xN )
⎣
Libin Jiao (DLUT)
Linear Regression
⋯ φK−1 (x1 ) ⎤⎥
⋯ φK−1 (x2 ) ⎥⎥
⎥ ∈ RN×K
⎥
⋯
⋮
⎥
⋯ φK−1 (xN ) ⎥⎦
November 7, 2024
14 / 43
Parameter Estimation
9.2 Parameter Estimation
Example 9.3 (polynomial regression)
⎤
⎡
⎡ φ0 (x) ⎤ ⎢⎢ 1 ⎥⎥
⎥
⎢
⎢ φ (x) ⎥ ⎢⎢ x ⎥⎥
⎥ ⎢ 2 ⎥
⎢
1
⎥ = ⎢ x ⎥ ∈ RK .
φ(xk ) = ⎢
⎥ ⎢
⎢
⋮
⎥
⎥
⎢
⎢ φK−1 (x) ⎥ ⎢⎢ ⋮ ⎥⎥
⎦ ⎢ xK−1 ⎥
⎣
⎦
⎣
Example 9.4 (feature matrix for second-order polynomials)
⎡ 1 x1
⎢
⎢ 1 x
⎢
2
Φ ∶= ⎢
⎢ ⋮ ⋮
⎢
⎢ 1 xN
⎣
Libin Jiao (DLUT)
Linear Regression
x12 ⎤⎥
x22 ⎥⎥
⎥
⋮ ⎥⎥
xN2 ⎥⎦
November 7, 2024
15 / 43
Parameter Estimation
9.2 Parameter Estimation
Example 9.3 (polynomial regression)
⎤
⎡
⎡ φ0 (x) ⎤ ⎢⎢ 1 ⎥⎥
⎢
⎥
⎢ φ (x) ⎥ ⎢⎢ x ⎥⎥
⎢
⎥ ⎢ 2 ⎥
1
⎥ = ⎢ x ⎥ ∈ RK .
φ(xk ) = ⎢
⎢
⎥ ⎢
⋮
⎥
⎢
⎥
⎢ φK−1 (x) ⎥ ⎢⎢ ⋮ ⎥⎥
⎣
⎦ ⎢ xK−1 ⎥
⎦
⎣
Example 9.4 (feature matrix for second-order polynomials)
⎡ 1 x1
⎢
⎢ 1 x
⎢
2
Φ ∶= ⎢
⎢ ⋮ ⋮
⎢
⎢ 1 xN
⎣
Libin Jiao (DLUT)
Linear Regression
x12 ⎤⎥
x22 ⎥⎥
⎥
⋮ ⎥⎥
xN2 ⎥⎦
November 7, 2024
15 / 43
Parameter Estimation
9.2 Parameter Estimation
Example 9.3 (polynomial regression)
⎤
⎡
⎡ φ0 (x) ⎤ ⎢⎢ 1 ⎥⎥
⎢
⎥
⎢ φ (x) ⎥ ⎢⎢ x ⎥⎥
⎢
⎥ ⎢ 2 ⎥
1
⎥ = ⎢ x ⎥ ∈ RK .
φ(xk ) = ⎢
⎢
⎥ ⎢
⋮
⎥
⎢
⎥
⎢ φK−1 (x) ⎥ ⎢⎢ ⋮ ⎥⎥
⎣
⎦ ⎢ xK−1 ⎥
⎦
⎣
Example 9.4 (feature matrix for second-order polynomials)
⎡ 1 x1
⎢
⎢ 1 x
⎢
2
Φ ∶= ⎢
⎢ ⋮ ⋮
⎢
⎢ 1 xN
⎣
Libin Jiao (DLUT)
Linear Regression
x12 ⎤⎥
x22 ⎥⎥
⎥
⋮ ⎥⎥
xN2 ⎥⎦
November 7, 2024
15 / 43
Parameter Estimation
9.2 Parameter Estimation
Example 9.5 (maximum likelihood polynomial fit):
N = 10, xn ∼ U[−5, 5], yn = − sin(xn /5) + cos(xn ) + , ∼ N (0, 0.22 ).
Fit a polynomial of degree 4 using maximum likelihood estimation.
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
16 / 43
Parameter Estimation
9.2 Parameter Estimation
Estimating the Noise Variance
N
log p(Y∣X , θ, σ 2 ) = ∑ log N (yn ∣φ⊺ (xn )θ, σ 2 )
n=1
N
= ∑ (− 21 log(2π) − 12 log σ 2 − 2σ1 2 (yn − φ⊺ (xn )θ)2 )
n=1
N
= − N2 log σ 2 − 2σ1 2 ∑ (yn − φ⊺ (xn )θ)2 +const.
n=1
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
=∶s
∂ log p(Y∣X ,θ,σ 2 )
= − 2σN 2 + 2σs 4 = 0 ⇔ σ 2 = Ns .
∂σ 2
∂ 2 log p(Y∣X ,θ,σ 2 )
= 2σN 4 − σs6 = 0 ⇔ σ 2 = 2s
N.
∂(σ 2 )2
N
2
σML
= Ns = N1 ∑ (yn − φ⊺ (xn )θ)2 .
n=1
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
17 / 43
Parameter Estimation
9.2 Parameter Estimation
Estimating the Noise Variance
N
log p(Y∣X , θ, σ 2 ) = ∑ log N (yn ∣φ⊺ (xn )θ, σ 2 )
n=1
N
= ∑ (− 21 log(2π) − 12 log σ 2 − 2σ1 2 (yn − φ⊺ (xn )θ)2 )
n=1
N
= − N2 log σ 2 − 2σ1 2 ∑ (yn − φ⊺ (xn )θ)2 +const.
n=1
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
=∶s
∂ log p(Y∣X ,θ,σ 2 )
= − 2σN 2 + 2σs 4 = 0 ⇔ σ 2 = Ns .
∂σ 2
∂ 2 log p(Y∣X ,θ,σ 2 )
= 2σN 4 − σs6 = 0 ⇔ σ 2 = 2s
N.
∂(σ 2 )2
N
2
σML
= Ns = N1 ∑ (yn − φ⊺ (xn )θ)2 .
n=1
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
17 / 43
Parameter Estimation
9.2 Parameter Estimation
Estimating the Noise Variance
N
log p(Y∣X , θ, σ 2 ) = ∑ log N (yn ∣φ⊺ (xn )θ, σ 2 )
n=1
N
= ∑ (− 21 log(2π) − 12 log σ 2 − 2σ1 2 (yn − φ⊺ (xn )θ)2 )
n=1
N
= − N2 log σ 2 − 2σ1 2 ∑ (yn − φ⊺ (xn )θ)2 +const.
n=1
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
=∶s
∂ log p(Y∣X ,θ,σ 2 )
= − 2σN 2 + 2σs 4 = 0 ⇔ σ 2 = Ns .
∂σ 2
∂ 2 log p(Y∣X ,θ,σ 2 )
= 2σN 4 − σs6 = 0 ⇔ σ 2 = 2s
N.
∂(σ 2 )2
N
2
σML
= Ns = N1 ∑ (yn − φ⊺ (xn )θ)2 .
n=1
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
17 / 43
Parameter Estimation
9.2 Parameter Estimation
Evaluate the quality of the model
The negative log-likelihood: 2σ1 2 ∥y − Φθ∥22
The squared-error-loss function: ∥y − Φθ∥2
The root mean square error (RMSE§þ•ŠØ ):
¿
√
Á1 N
1
À ∑ (y − φ⊺ (x )θ)2
∥y − Φθ∥2 = Á
n
n
N
N n=1
(a) Allows us to compare errors of datasets with different sizes;
(b) Has the same scale and units as the observed function values yn .
For model selection, we can use the RMSE (or the negative
log-likelihood) to determine the best degree of the polynomial.
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
18 / 43
Parameter Estimation
9.2 Parameter Estimation
Evaluate the quality of the model
The negative log-likelihood: 2σ1 2 ∥y − Φθ∥22
The squared-error-loss function: ∥y − Φθ∥2
The root mean square error (RMSE§þ•ŠØ ):
¿
√
Á1 N
1
À ∑ (y − φ⊺ (x )θ)2
∥y − Φθ∥2 = Á
n
n
N
N n=1
(a) Allows us to compare errors of datasets with different sizes;
(b) Has the same scale and units as the observed function values yn .
For model selection, we can use the RMSE (or the negative
log-likelihood) to determine the best degree of the polynomial.
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
18 / 43
Parameter Estimation
9.2 Parameter Estimation
Evaluate the quality of the model
The negative log-likelihood: 2σ1 2 ∥y − Φθ∥22
The squared-error-loss function: ∥y − Φθ∥2
The root mean square error (RMSE§þ•ŠØ ):
¿
√
Á1 N
1
À ∑ (y − φ⊺ (x )θ)2
∥y − Φθ∥2 = Á
n
n
N
N n=1
(a) Allows us to compare errors of datasets with different sizes;
(b) Has the same scale and units as the observed function values yn .
For model selection, we can use the RMSE (or the negative
log-likelihood) to determine the best degree of the polynomial.
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
18 / 43
Parameter Estimation
9.2 Parameter Estimation
Evaluate the quality of the model
The negative log-likelihood: 2σ1 2 ∥y − Φθ∥22
The squared-error-loss function: ∥y − Φθ∥2
The root mean square error (RMSE§þ•ŠØ ):
¿
√
Á1 N
1
À ∑ (y − φ⊺ (x )θ)2
∥y − Φθ∥2 = Á
n
n
N
N n=1
(a) Allows us to compare errors of datasets with different sizes;
(b) Has the same scale and units as the observed function values yn .
For model selection, we can use the RMSE (or the negative
log-likelihood) to determine the best degree of the polynomial.
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
18 / 43
Parameter Estimation
9.2 Parameter Estimation
Evaluate the quality of the model
The negative log-likelihood: 2σ1 2 ∥y − Φθ∥22
The squared-error-loss function: ∥y − Φθ∥2
The root mean square error (RMSE§þ•ŠØ ):
¿
√
Á1 N
1
À ∑ (y − φ⊺ (x )θ)2
∥y − Φθ∥2 = Á
n
n
N
N n=1
(a) Allows us to compare errors of datasets with different sizes;
(b) Has the same scale and units as the observed function values yn .
For model selection, we can use the RMSE (or the negative
log-likelihood) to determine the best degree of the polynomial.
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
18 / 43
Parameter Estimation
9.2 Parameter Estimation
Evaluate the quality of the model
The negative log-likelihood: 2σ1 2 ∥y − Φθ∥22
The squared-error-loss function: ∥y − Φθ∥2
The root mean square error (RMSE§þ•ŠØ ):
¿
√
Á1 N
1
À ∑ (y − φ⊺ (x )θ)2
∥y − Φθ∥2 = Á
n
n
N
N n=1
(a) Allows us to compare errors of datasets with different sizes;
(b) Has the same scale and units as the observed function values yn .
For model selection, we can use the RMSE (or the negative
log-likelihood) to determine the best degree of the polynomial.
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
18 / 43
Parameter Estimation
9.2 Parameter Estimation
Evaluate the quality of the model
The negative log-likelihood: 2σ1 2 ∥y − Φθ∥22
The squared-error-loss function: ∥y − Φθ∥2
The root mean square error (RMSE§þ•ŠØ ):
¿
√
Á1 N
1
À ∑ (y − φ⊺ (x )θ)2
∥y − Φθ∥2 = Á
n
n
N
N n=1
(a) Allows us to compare errors of datasets with different sizes;
(b) Has the same scale and units as the observed function values yn .
For model selection, we can use the RMSE (or the negative
log-likelihood) to determine the best degree of the polynomial.
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
18 / 43
Parameter Estimation
9.2 Parameter Estimation
Maximum likelihood polynomial fit for N = 10 and various M.
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
19 / 43
Parameter Estimation
9.2 Parameter Estimation
Over-fitting for big M!
RMSE for training data is not enough⇐ Test data!
Generalization performance£•z5U£Ly¤¤
Evaluate the RMSE for both the training data and the test data (200).
The best generalization is obtained for a polynomial of degree M = 4.
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
20 / 43
Parameter Estimation
9.2 Parameter Estimation
Maximum A Posteriori Estimation £4Œ
O¤
Maximum likelihood estimation is prone to over-fitting.
It is observed that the magnitude of the parameter values becomes
relatively large if we run into over-fitting.
To mitigate the effect of huge parameter values, we can place a prior
distribution p(θ) on the parameters.
Maximize the posterior distribution (probability)
p(θ∣X , Y) =
p(Y∣X , θ)p(θ)
.
p(Y∣X )
Minimize the negative log-posterior distribution (probability):
θ MAP = arg min{− log p(Y∣X , θ) − log p(θ) + const.}.
θ
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
21 / 43
Parameter Estimation
9.2 Parameter Estimation
Maximum A Posteriori Estimation £4Œ
O¤
Maximum likelihood estimation is prone to over-fitting.
It is observed that the magnitude of the parameter values becomes
relatively large if we run into over-fitting.
To mitigate the effect of huge parameter values, we can place a prior
distribution p(θ) on the parameters.
Maximize the posterior distribution (probability)
p(θ∣X , Y) =
p(Y∣X , θ)p(θ)
.
p(Y∣X )
Minimize the negative log-posterior distribution (probability):
θ MAP = arg min{− log p(Y∣X , θ) − log p(θ) + const.}.
θ
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
21 / 43
Parameter Estimation
9.2 Parameter Estimation
Maximum A Posteriori Estimation £4Œ
O¤
Maximum likelihood estimation is prone to over-fitting.
It is observed that the magnitude of the parameter values becomes
relatively large if we run into over-fitting.
To mitigate the effect of huge parameter values, we can place a prior
distribution p(θ) on the parameters.
Maximize the posterior distribution (probability)
p(θ∣X , Y) =
p(Y∣X , θ)p(θ)
.
p(Y∣X )
Minimize the negative log-posterior distribution (probability):
θ MAP = arg min{− log p(Y∣X , θ) − log p(θ) + const.}.
θ
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
21 / 43
Parameter Estimation
9.2 Parameter Estimation
Maximum A Posteriori Estimation £4Œ
O¤
Maximum likelihood estimation is prone to over-fitting.
It is observed that the magnitude of the parameter values becomes
relatively large if we run into over-fitting.
To mitigate the effect of huge parameter values, we can place a prior
distribution p(θ) on the parameters.
Maximize the posterior distribution (probability)
p(θ∣X , Y) =
p(Y∣X , θ)p(θ)
.
p(Y∣X )
Minimize the negative log-posterior distribution (probability):
θ MAP = arg min{− log p(Y∣X , θ) − log p(θ) + const.}.
θ
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
21 / 43
Parameter Estimation
9.2 Parameter Estimation
Maximum A Posteriori Estimation £4Œ
O¤
Maximum likelihood estimation is prone to over-fitting.
It is observed that the magnitude of the parameter values becomes
relatively large if we run into over-fitting.
To mitigate the effect of huge parameter values, we can place a prior
distribution p(θ) on the parameters.
Maximize the posterior distribution (probability)
p(θ∣X , Y) =
p(Y∣X , θ)p(θ)
.
p(Y∣X )
Minimize the negative log-posterior distribution (probability):
θ MAP = arg min{− log p(Y∣X , θ) − log p(θ) + const.}.
θ
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
21 / 43
Parameter Estimation
9.2 Parameter Estimation
Maximum A Posteriori Estimation £4Œ
O¤
Maximum likelihood estimation is prone to over-fitting.
It is observed that the magnitude of the parameter values becomes
relatively large if we run into over-fitting.
To mitigate the effect of huge parameter values, we can place a prior
distribution p(θ) on the parameters.
Maximize the posterior distribution (probability)
p(θ∣X , Y) =
p(Y∣X , θ)p(θ)
.
p(Y∣X )
Minimize the negative log-posterior distribution (probability):
θ MAP = arg min{− log p(Y∣X , θ) − log p(θ) + const.}.
θ
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
21 / 43
Parameter Estimation
9.2 Parameter Estimation
If p(y∣x, θ) = N (y∣φ⊺ (x)θ, σ 2 ) and p(θ) = N (0, b2 I), then
− log p(θ∣X , Y) = 2σ1 2 (y − Φθ)⊺ (y − Φθ) + 2b12 θ ⊺ θ + const.
2
= 2σ1 2 (∥y − Φθ∥22 + σb2 ∥θ∥22 ) + const..
θ MAP = (Φ⊺ Φ + σb2 I)
2
Libin Jiao (DLUT)
−1
Φ⊺ y.
Linear Regression
November 7, 2024
22 / 43
Parameter Estimation
9.2 Parameter Estimation
If p(y∣x, θ) = N (y∣φ⊺ (x)θ, σ 2 ) and p(θ) = N (0, b2 I), then
− log p(θ∣X , Y) = 2σ1 2 (y − Φθ)⊺ (y − Φθ) + 2b12 θ ⊺ θ + const.
2
= 2σ1 2 (∥y − Φθ∥22 + σb2 ∥θ∥22 ) + const..
θ MAP = (Φ⊺ Φ + σb2 I)
2
Libin Jiao (DLUT)
−1
Φ⊺ y.
Linear Regression
November 7, 2024
22 / 43
Parameter Estimation
9.2 Parameter Estimation
If p(y∣x, θ) = N (y∣φ⊺ (x)θ, σ 2 ) and p(θ) = N (0, b2 I), then
− log p(θ∣X , Y) = 2σ1 2 (y − Φθ)⊺ (y − Φθ) + 2b12 θ ⊺ θ + const.
2
= 2σ1 2 (∥y − Φθ∥22 + σb2 ∥θ∥22 ) + const..
θ MAP = (Φ⊺ Φ + σb2 I)
2
Libin Jiao (DLUT)
−1
Φ⊺ y.
Linear Regression
November 7, 2024
23 / 43
Parameter Estimation
9.2 Parameter Estimation
MAP Estimation as Regularization
L2 -Regularized (Tikhonov regularization) least squares £ridge
regression £*£8¤¤
min{∥y − Φθ∥22 + λ∥θ∥22 }
θ
λ=
σ2
⇒ MAP with prior p(θ) = N (0, b2 I).
b2
−1
θ RLS = (Φ⊺ Φ + λI)
Libin Jiao (DLUT)
Φ⊺ y.
Linear Regression
November 7, 2024
24 / 43
Parameter Estimation
9.2 Parameter Estimation
MAP Estimation as Regularization
L2 -Regularized (Tikhonov regularization) least squares £ridge
regression £*£8¤¤
min{∥y − Φθ∥22 + λ∥θ∥22 }
θ
λ=
σ2
⇒ MAP with prior p(θ) = N (0, b2 I).
b2
−1
θ RLS = (Φ⊺ Φ + λI)
Libin Jiao (DLUT)
Φ⊺ y.
Linear Regression
November 7, 2024
24 / 43
Parameter Estimation
9.2 Parameter Estimation
MAP Estimation as Regularization
L2 -Regularized (Tikhonov regularization) least squares £ridge
regression £*£8¤¤
min{∥y − Φθ∥22 + λ∥θ∥22 }
θ
λ=
σ2
⇒ MAP with prior p(θ) = N (0, b2 I).
b2
−1
θ RLS = (Φ⊺ Φ + λI)
Libin Jiao (DLUT)
Φ⊺ y.
Linear Regression
November 7, 2024
24 / 43
Parameter Estimation
9.2 Parameter Estimation
MAP Estimation as Regularization
L2 -Regularized (Tikhonov regularization) least squares £ridge
regression £*£8¤¤
min{∥y − Φθ∥22 + λ∥θ∥22 }
θ
λ=
σ2
⇒ MAP with prior p(θ) = N (0, b2 I).
b2
−1
θ RLS = (Φ⊺ Φ + λI)
Libin Jiao (DLUT)
Φ⊺ y.
Linear Regression
November 7, 2024
24 / 43
Parameter Estimation
9.2 Parameter Estimation
MAP Estimation as Regularization
L2 -Regularized (Tikhonov regularization) least squares £ridge
regression £*£8¤¤
min{∥y − Φθ∥22 + λ∥θ∥22 }
θ
λ=
σ2
⇒ MAP with prior p(θ) = N (0, b2 I).
b2
−1
θ RLS = (Φ⊺ Φ + λI)
Libin Jiao (DLUT)
Φ⊺ y.
Linear Regression
November 7, 2024
24 / 43
Parameter Estimation
9.2 Parameter Estimation
Exercise
Do MAPE with Laplacian prior p(θi ) = L(0, b) under Gaussian noise
distribution ∼ N [0, σ 2 ].
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
25 / 43
Parameter Estimation
9.2 Parameter Estimation
L1 -Regularized least squares (LASSO, least absolute shrinkage and
selection operator, sparsity inducing regularization)
min {∥y − Φθ∥22 + λ∥θ∥1 } ⇒ No closed form solution! Diagonal Φ?
θ
Convex relaxation of the L0 regularization.
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
26 / 43
Parameter Estimation
9.2 Parameter Estimation
L1 -Regularized least squares (LASSO, least absolute shrinkage and
selection operator, sparsity inducing regularization)
min {∥y − Φθ∥22 + λ∥θ∥1 } ⇒ No closed form solution! Diagonal Φ?
θ
Convex relaxation of the L0 regularization.
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
26 / 43
Parameter Estimation
9.2 Parameter Estimation
L1 -Regularized least squares (LASSO, least absolute shrinkage and
selection operator, sparsity inducing regularization)
min {∥y − Φθ∥22 + λ∥θ∥1 } ⇒ No closed form solution! Diagonal Φ?
θ
Convex relaxation of the L0 regularization.
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
26 / 43
Parameter Estimation
9.2 Parameter Estimation
L1 -Regularized least squares (LASSO, least absolute shrinkage and
selection operator, sparsity inducing regularization)
min {∥y − Φθ∥22 + λ∥θ∥1 } ⇒ No closed form solution! Diagonal Φ?
θ
Convex relaxation of the L0 regularization.
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
26 / 43
Parameter Estimation
9.2 Parameter Estimation
L0 -Regularized least squares (sparsity pursuiting)
min {∥y − Φθ∥22 + λ∥θ∥0 }
θ
Lp -Regularized least squares (0 < p < 1, sparsity inducing)
min {∥y − Φθ∥22 + λ∥θ∥pp }
θ
Libin Jiao (DLUT)
p = 21 ?
Linear Regression
November 7, 2024
27 / 43
Parameter Estimation
9.2 Parameter Estimation
L0 -Regularized least squares (sparsity pursuiting)
min {∥y − Φθ∥22 + λ∥θ∥0 }
θ
Lp -Regularized least squares (0 < p < 1, sparsity inducing)
min {∥y − Φθ∥22 + λ∥θ∥pp }
θ
Libin Jiao (DLUT)
p = 12 ?
Linear Regression
November 7, 2024
27 / 43
Parameter Estimation
9.2 Parameter Estimation
L0 -Regularized least squares (sparsity pursuiting)
min {∥y − Φθ∥22 + λ∥θ∥0 }
θ
Lp -Regularized least squares (0 < p < 1, sparsity inducing)
min {∥y − Φθ∥22 + λ∥θ∥pp }
θ
Libin Jiao (DLUT)
p = 12 ?
Linear Regression
November 7, 2024
27 / 43
Parameter Estimation
9.2 Parameter Estimation
L2 vs L1
Lp -Balls
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
28 / 43
Bayesian Linear Regression
9.3 Bayesian Linear Regression
MLE (e.g., Least Square):
min {∥y − Φθ∥22 }
θ
can lead to severe overfitting, in particular, in the small-data regime.
MAPE (e.g., L2 -Regularized Least Square):
min {∥y − Φθ∥22 + λ∥θ∥22 }
θ
addresses this issue by placing a prior on the parameters.
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
29 / 43
Bayesian Linear Regression
9.3 Bayesian Linear Regression
Bayesian linear regression pushes the idea of the parameter prior a step
further:
Does not compute a point estimate of the parameters;
The full posterior distribution over the parameters is taken into account
when making predictions;
Compute a mean over all plausible parameters settings (according to the
posterior).
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
30 / 43
Bayesian Linear Regression
9.3 Bayesian Linear Regression
Bayesian linear regression pushes the idea of the parameter prior a step
further:
Does not compute a point estimate of the parameters;
The full posterior distribution over the parameters is taken into account
when making predictions;
Compute a mean over all plausible parameters settings (according to the
posterior).
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
30 / 43
Bayesian Linear Regression
9.3 Bayesian Linear Regression
Bayesian linear regression pushes the idea of the parameter prior a step
further:
Does not compute a point estimate of the parameters;
The full posterior distribution over the parameters is taken into account
when making predictions;
Compute a mean over all plausible parameters settings (according to the
posterior).
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
30 / 43
Bayesian Linear Regression
9.3 Bayesian Linear Regression
Bayesian linear regression pushes the idea of the parameter prior a step
further:
Does not compute a point estimate of the parameters;
The full posterior distribution over the parameters is taken into account
when making predictions;
Compute a mean over all plausible parameters settings (according to the
posterior).
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
30 / 43
Bayesian Linear Regression
9.3 Bayesian Linear Regression
Consider the model:
(∗)
Libin Jiao (DLUT)
prior
p(θ) = N (m0 , S0 ),
likeligood p(y∣x, θ) = N (y∣φ⊺ (x)θ, σ 2 ).
Linear Regression
November 7, 2024
31 / 43
Bayesian Linear Regression
9.3 Bayesian Linear Regression
Consider the model: y = φ⊺ (x)θ + prior
(∗)
p(θ) = N (m0 , S0 ),
likeligood p(y∣x, θ) = N (y∣φ⊺ (x)θ, σ 2 ).
The full probabilistic model: p(y, θ∣x) = p(y∣x, θ)p(θ).
The average predictive distribution according to the prior p(θ):
p(y∣x) = ∫ p(y∣x, θ)p(θ)dθ = Eθ [p(y∣x, θ)]
With prior distribution p(θ) = N (m0 , S0 ), from
Gaussian is conjugate;
The marginal of a Gaussian is also a Gaussian;
Gaussian noise is independent;
y = φ⊺ (x)θ + we obtain: p(y∣x) = N (φ⊺ (x)m0 , φ⊺ (x)S0 φ(x) + σ 2 )
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
32 / 43
Bayesian Linear Regression
9.3 Bayesian Linear Regression
Consider the model: y = φ⊺ (x)θ + prior
(∗)
p(θ) = N (m0 , S0 ),
likeligood p(y∣x, θ) = N (y∣φ⊺ (x)θ, σ 2 ).
The full probabilistic model: p(y, θ∣x) = p(y∣x, θ)p(θ).
The average predictive distribution according to the prior p(θ):
p(y∣x) = ∫ p(y∣x, θ)p(θ)dθ = Eθ [p(y∣x, θ)]
With prior distribution p(θ) = N (m0 , S0 ), from
Gaussian is conjugate;
The marginal of a Gaussian is also a Gaussian;
Gaussian noise is independent;
y = φ⊺ (x)θ + we obtain: p(y∣x) = N (φ⊺ (x)m0 , φ⊺ (x)S0 φ(x) + σ 2 )
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
32 / 43
Bayesian Linear Regression
9.3 Bayesian Linear Regression
Consider the model: y = φ⊺ (x)θ + prior
(∗)
p(θ) = N (m0 , S0 ),
likeligood p(y∣x, θ) = N (y∣φ⊺ (x)θ, σ 2 ).
The full probabilistic model: p(y, θ∣x) = p(y∣x, θ)p(θ).
The average predictive distribution according to the prior p(θ):
p(y∣x) = ∫ p(y∣x, θ)p(θ)dθ = Eθ [p(y∣x, θ)]
With prior distribution p(θ) = N (m0 , S0 ), from
Gaussian is conjugate;
The marginal of a Gaussian is also a Gaussian;
Gaussian noise is independent;
y = φ⊺ (x)θ + we obtain: p(y∣x) = N (φ⊺ (x)m0 , φ⊺ (x)S0 φ(x) + σ 2 )
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
32 / 43
Bayesian Linear Regression
9.3 Bayesian Linear Regression
Prior over (noise-free) function:
p(f (x)) = N (φ⊺ (x)m0 , φ⊺ (x)S0 φ(x))
Polynomials of degree 5, fi (⋅) = θ ⊺i φ(⋅), θ i ∼ p(θ) = N (0, 41 I).
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
33 / 43
Bayesian Linear Regression
9.3 Bayesian Linear Regression
Posterior Distribution: X = {xn ∣n = 1, . . . , N}, Y = {yn ∣n = 1, . . . , N},
p(θ∣X , Y) =
p(Y∣X , θ)p(θ)
.
p(Y∣X )
Marginal likelihood/Evidence:
p(Y∣X ) = ∫ p(Y∣X , θ)p(θ)dθ = Eθ [p(Y∣X , θ)]
Theorem 9.1 (Parameter Posterior).
p(θ∣X , Y) = N (θ∣mN , SN ) = (2π)− 2 ∣SN ∣− 2 exp(− 21 (θ − mN )⊺ S−1
N (θ − mN )),
D
1
−2 ⊺
−1
SN = (S−1
0 + σ Φ Φ) ,
−2 ⊺
mN = SN (S−1
0 m0 + σ Φ y).
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
34 / 43
Bayesian Linear Regression
9.3 Bayesian Linear Regression
Posterior Distribution: X = {xn ∣n = 1, . . . , N}, Y = {yn ∣n = 1, . . . , N},
p(θ∣X , Y) =
p(Y∣X , θ)p(θ)
.
p(Y∣X )
Marginal likelihood/Evidence:
p(Y∣X ) = ∫ p(Y∣X , θ)p(θ)dθ = Eθ [p(Y∣X , θ)]
Theorem 9.1 (Parameter Posterior).
p(θ∣X , Y) = N (θ∣mN , SN ) = (2π)− 2 ∣SN ∣− 2 exp(− 21 (θ − mN )⊺ S−1
N (θ − mN )),
D
1
−2 ⊺
−1
SN = (S−1
0 + σ Φ Φ) ,
−2 ⊺
mN = SN (S−1
0 m0 + σ Φ y).
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
34 / 43
Bayesian Linear Regression
9.3 Bayesian Linear Regression
Posterior Distribution: X = {xn ∣n = 1, . . . , N}, Y = {yn ∣n = 1, . . . , N},
p(θ∣X , Y) =
p(Y∣X , θ)p(θ)
.
p(Y∣X )
Marginal likelihood/Evidence:
p(Y∣X ) = ∫ p(Y∣X , θ)p(θ)dθ = Eθ [p(Y∣X , θ)]
Theorem 9.1 (Parameter Posterior).
p(θ∣X , Y) = N (θ∣mN , SN ) = (2π)− 2 ∣SN ∣− 2 exp(− 21 (θ − mN )⊺ S−1
N (θ − mN )),
D
1
−2 ⊺
−1
SN = (S−1
0 + σ Φ Φ) ,
−2 ⊺
mN = SN (S−1
0 m0 + σ Φ y).
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
34 / 43
Bayesian Linear Regression
9.3 Bayesian Linear Regression
Proof of Theorem 9.1.
p(Y∣X ,θ)p(θ)
Posterior: p(θ∣X , Y) = p(Y∣X ) ;
likeligood: p(Y∣X , θ) = N (y∣Φθ, σ 2 I);
prior:
p(θ) = N (m0 , S0 ).
log-Posterior:
log p(θ∣X , Y) = log N (Y∣Φθ, σ 2 I) + log N (θ∣m0 , S0 ) + const.
= − 21 (σ −2 (y − Φθ)⊺ (y − Φθ) + (θ − m0 )⊺ S−1
0 (θ − m0 )) + const.
−1
−2 ⊺
⊺
= − 12 (θ ⊺ (σ −2 Φ⊺ Φ + S−1
0 )θ − 2(σ Φ y + S0 m0 ) θ) + const.
⊺ −1
⊺ −1
= − 21 (θ ⊺ S−1
N θ − 2mN SN θ + mN SN mN ) + const.
= − 21 (θ − mN )⊺ S−1
N (θ − mN ) + const.
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
35 / 43
Bayesian Linear Regression
9.3 Bayesian Linear Regression
Proof of Theorem 9.1.
p(Y∣X ,θ)p(θ)
Posterior: p(θ∣X , Y) = p(Y∣X ) ;
likeligood: p(Y∣X , θ) = N (y∣Φθ, σ 2 I);
prior:
p(θ) = N (m0 , S0 ).
log-Posterior:
log p(θ∣X , Y) = log N (Y∣Φθ, σ 2 I) + log N (θ∣m0 , S0 ) + const.
= − 21 (σ −2 (y − Φθ)⊺ (y − Φθ) + (θ − m0 )⊺ S−1
0 (θ − m0 )) + const.
−1
−2 ⊺
⊺
= − 12 (θ ⊺ (σ −2 Φ⊺ Φ + S−1
0 )θ − 2(σ Φ y + S0 m0 ) θ) + const.
⊺ −1
⊺ −1
= − 21 (θ ⊺ S−1
N θ − 2mN SN θ + mN SN mN ) + const.
= − 21 (θ − mN )⊺ S−1
N (θ − mN ) + const.
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
35 / 43
Bayesian Linear Regression
9.3 Bayesian Linear Regression
Proof of Theorem 9.1.
p(Y∣X ,θ)p(θ)
Posterior: p(θ∣X , Y) = p(Y∣X ) ;
likeligood: p(Y∣X , θ) = N (y∣Φθ, σ 2 I);
prior:
p(θ) = N (m0 , S0 ).
log-Posterior:
log p(θ∣X , Y) = log N (Y∣Φθ, σ 2 I) + log N (θ∣m0 , S0 ) + const.
= − 21 (σ −2 (y − Φθ)⊺ (y − Φθ) + (θ − m0 )⊺ S−1
0 (θ − m0 )) + const.
−1
−2 ⊺
⊺
= − 12 (θ ⊺ (σ −2 Φ⊺ Φ + S−1
0 )θ − 2(σ Φ y + S0 m0 ) θ) + const.
⊺ −1
⊺ −1
= − 21 (θ ⊺ S−1
N θ − 2mN SN θ + mN SN mN ) + const.
= − 21 (θ − mN )⊺ S−1
N (θ − mN ) + const.
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
35 / 43
Bayesian Linear Regression
9.3 Bayesian Linear Regression
Posterior Predictive Distribution:
p(y∣X , Y, x) = ∫ p(y∣x, θ)p(θ∣X , Y)dθ = Eθ∣X ,Y [p(y∣x, θ)]
= ∫ N (y∣φ⊺ (x)θ, σ 2 )N (θ∣mN , SN )dθ
= N (y∣φ⊺ (x)mN , φ⊺ (x)SN φ(x) + σ 2 )
Posterior over (Noise-free) Function:
p(f (x)∣X , Y) = N (φ⊺ (x)mN , φ⊺ (x)SN φ(x))
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
36 / 43
Bayesian Linear Regression
9.3 Bayesian Linear Regression
Posterior over (Noise-free) Function: Polynomials of degree 5,
fi (⋅) = θ ⊺i φ(⋅), θ i ∼ p(θ) = N (0, 41 I).
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
37 / 43
Bayesian Linear Regression
9.3 Bayesian Linear Regression
Polynomials of degree 3,5 and 7, fi (⋅) = θ ⊺i φ(⋅), θ i ∼ p(θ) = N (0, 14 I).
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
38 / 43
Bayesian Linear Regression
9.3 Bayesian Linear Regression
Computing the Marginal Likelihood:
prior
θ ∼ N (m0 , S0 ),
likeligood yn ∣xn , θ ∼ N (y∣x⊺n (x)θ, σ 2 ).
The marginal likelihood:
p(Y∣X ) = ∫ p(Y∣X , θ)p(θ)dθ
= ∫ N (y∣Xθ, σ 2 I)N (θm0 , S0 )dθ
Eθ [Y∣X ] = Eθ [Xθ + ] = XEθ [θ] = Xm0
Covθ [Y∣X ] = Covθ [Xθ] + σ 2 I = XCovθ [θ]X⊺ + σ 2 I = XS0 X⊺ + σ 2 I
p(Y∣X ) = N (y∣Xm0 , XS0 X⊺ + σ 2 I).
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
39 / 43
Maximum Likelihood as Orthogonal Projection
9.4 Maximum Likelihood as Orthogonal Projection
y = xθ + , ∼ N (0, σ 2 ), suppose firstly σ 2 is known.
Given a training set D ∶= {(x1 , y1 ), . . . , (xN , yN )}.
Maximum likelihood estimator:
X⊺ y
θML = (X⊺ X)−1 X⊺ y = ⊺ ∈ R
X X
⊺
N
where X = (x1 , . . . , xN ) ∈ R , y = (y1 , . . . , , yN )⊺ ∈ RN .
Optimal (maximum likelihood) reconstruction of the training target for
the training input X:
X⊺ y XX⊺
=
y
X⊺ X X⊺ X
⇒ the orthogonal projection of y onto the one-dimensional subspace of
RN spanned by X.
θML is the coordinates of the projection under the basis X.
XθML = X
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
40 / 43
Maximum Likelihood as Orthogonal Projection
9.4 Maximum Likelihood as Orthogonal Projection
y = xθ + , ∼ N (0, σ 2 ), suppose firstly σ 2 is known.
Given a training set D ∶= {(x1 , y1 ), . . . , (xN , yN )}.
Maximum likelihood estimator:
X⊺ y
θML = (X⊺ X)−1 X⊺ y = ⊺ ∈ R
X X
⊺
N
where X = (x1 , . . . , xN ) ∈ R , y = (y1 , . . . , , yN )⊺ ∈ RN .
Optimal (maximum likelihood) reconstruction of the training target for
the training input X:
X⊺ y XX⊺
=
y
X⊺ X X⊺ X
⇒ the orthogonal projection of y onto the one-dimensional subspace of
RN spanned by X.
θML is the coordinates of the projection under the basis X.
XθML = X
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
40 / 43
Maximum Likelihood as Orthogonal Projection
9.4 Maximum Likelihood as Orthogonal Projection
General case: y = φ⊺ (x)⊺ θ + , ∼ N (0, σ 2 ).
Maximum likelihood estimator
θ ML = (Φ⊺ Φ)−1 Φ⊺ y.
Maximum likelihood estimator
Φθ ML = Φ(Φ⊺ Φ)−1 Φ⊺ y
When Φ⊺ Φ = I,
K
Φθ ML = ΦΦ⊺ y = ( ∑ φk φ⊺k ) y
k=1
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
41 / 43
Maximum Likelihood as Orthogonal Projection
9.4 Logistic regression and deep neural network
Logistic regression: y = σ(f (x)) + , ∼ N (0, σ 2 ), where
f (x) = φ⊺ (x)⊺ θ;
1
σ(f ) = 1+exp(−f
is the logistic sigmoid function.
)
Maximum likelihood estimation: min L(θ) ≜ ∥y − σ(Φθ)∥22
θ
Deep neural network: y = f (x; W, b) + , ∼ N (0, σ 2 ), where f (x) is
defined as follows
x(0) = x,
x(l) = σ (l) (W (l) x(l−1) + b(l) ), l = 1, 2, . . . , L,
f (x; W, b) = x(L) ,
in which W (l) ∈ Rkl ×kl−1 , k0 = d and kL = 1.
Maximum likelihood estimation:
N
min L(W, b) ≜ ∑ (yn − f (xn ; W, b)2 .
W,b
Libin Jiao (DLUT)
n=1
Linear Regression
November 7, 2024
42 / 43
Maximum Likelihood as Orthogonal Projection
Exercises
Do MLE under Laplacian distribution ∼ L(0, b).
Do MLE under the uniform distribution ∼ U[−a, a].
Do MAPE with Laplacian prior p(θi ) = L(0, b) under Gaussian noise
distribution ∼ N [0, σ 2 ].
Prove that θ MAP = mN .
Give the closed form solution of L1 -Regularized least squares problem
min {∥y − θ∥22 + λ∥θ∥1 } .
θ
Libin Jiao (DLUT)
Linear Regression
November 7, 2024
43 / 43
Download