Inference in Machine Learning Joint Distributions Linear Gaussian Systems Linear Regression

advertisement
Inference in Machine Learning
Joint Distributions
Linear Gaussian Systems
Linear Regression
Recap
Handling Non-linear Relationships
Bayesian Regression
References
Introduction to Machine Learning
CSE474/574: Linear Regression
Varun Chandola <chandola@buffalo.edu>
Varun Chandola
Introduction to Machine Learning
Inference in Machine Learning
Joint Distributions
Linear Gaussian Systems
Linear Regression
Recap
Handling Non-linear Relationships
Bayesian Regression
References
Outline
1
2
3
4
5
6
7
Inference in Machine Learning
Joint Distributions
Linear Gaussian Systems
Example
Linear Regression
Problem Formulation
Geometric Interpretation
Learning Parameters
Recap
Issues with Linear Regression
Handling Non-linear Relationships
Handling Overfitting via Regularization
Bayesian Regression
Estimating Bayesian Regression Parameters
Prediction with Bayesian Regression
Varun Chandola
Introduction to Machine Learning
Inference in Machine Learning
Joint Distributions
Linear Gaussian Systems
Linear Regression
Recap
Handling Non-linear Relationships
Bayesian Regression
References
Outline
1
Inference in Machine Learning
2
Joint Distributions
3
Linear Gaussian Systems
Example
4
Linear Regression
Problem Formulation
Geometric Interpretation
Learning Parameters
5
Recap
Issues with Linear Regression
6
Handling Non-linear Relationships
Handling Overfitting via Regularization
7
Bayesian Regression
Estimating Bayesian Regression Parameters
Prediction with Bayesian Regression
Varun Chandola
Introduction to Machine Learning
Inference in Machine Learning
Joint Distributions
Linear Gaussian Systems
Linear Regression
Recap
Handling Non-linear Relationships
Bayesian Regression
References
What is Inference?
Definition
Given a joint distribution p(x1 , x2 ), how to compute the marginals, p(x1 ),
and conditionals p(x1 |x2 ).
Discrete Random Variables
X - length of a word, Y - # vowels in word
x
x
y
0
1
2
2
0
0.34
0
3
0.03
0.16
0
4
0
0.30
0.03
5
0
0
0.14
y
0
1
P 2
y p(x, y )
Varun Chandola
2
0
0.34
0
0.34
3
0.03
0.16
0
0.19
4
0
0.30
0.03
0.33
Introduction to Machine Learning
5
0
0
0.14
0.14
P
p(x, y )
0.03
0.80
0.17
x
Inference in Machine Learning
Joint Distributions
Linear Gaussian Systems
Linear Regression
Recap
Handling Non-linear Relationships
Bayesian Regression
References
What is Inference?
Definition
Given a joint distribution p(x1 , x2 ), how to compute the marginals, p(x1 ),
and conditionals p(x1 |x2 ).
Discrete Random Variables
X - length of a word, Y - # vowels in word
x
x
y
0
1
2
2
0
0.34
0
3
0.03
0.16
0
4
0
0.30
0.03
5
0
0
0.14
y
0
1
P 2
y p(x, y )
Varun Chandola
2
0
0.34
0
0.34
3
0.03
0.16
0
0.19
4
0
0.30
0.03
0.33
Introduction to Machine Learning
5
0
0
0.14
0.14
P
p(x, y )
0.03
0.80
0.17
x
Inference in Machine Learning
Joint Distributions
Linear Gaussian Systems
Linear Regression
Recap
Handling Non-linear Relationships
Bayesian Regression
References
Outline
1
Inference in Machine Learning
2
Joint Distributions
3
Linear Gaussian Systems
Example
4
Linear Regression
Problem Formulation
Geometric Interpretation
Learning Parameters
5
Recap
Issues with Linear Regression
6
Handling Non-linear Relationships
Handling Overfitting via Regularization
7
Bayesian Regression
Estimating Bayesian Regression Parameters
Prediction with Bayesian Regression
Varun Chandola
Introduction to Machine Learning
Inference in Machine Learning
Joint Distributions
Linear Gaussian Systems
Linear Regression
Recap
Handling Non-linear Relationships
Bayesian Regression
References
Joint Distributions
Multiple random variables may be jointly modeled
Joint probability distribution
p(x1 , x2 , . . . , xD )
Example, x = {x1 , x2 , . . . , xD } is a joint Gaussian distribution, if:
p(x) = N (µ, Σ)
Marginal Distribution: Probability distribution of one (or more)
variable(s) “marginalized” over all others
Conditional Distribution: Probability distribution of one (or more)
variable(s), given a particular value taken by the others
Varun Chandola
Introduction to Machine Learning
Inference in Machine Learning
Joint Distributions
Linear Gaussian Systems
Linear Regression
Recap
Handling Non-linear Relationships
Bayesian Regression
References
Inference for Multivariate Normal Random Variables
Let x = (x1 , x2 ) be jointly Gaussian with parameters:
µ1
Σ11 Σ12
Λ11
−1
µ=
,Σ =
,Λ = Σ =
µ2
Σ21 Σ22
Λ21
Both marginals and conditions are also Gaussian!
p(x1 )
=
N (µ1 , Σ11 )
p(x1 |x2 )
=
N (µ1|2 , Σ1|2 )
where,
µ1|2 = Σ1|2 (Λ11 µ1 − Λ12 (x2 − µ2 ))
Σ1|2 = Λ−1
11
Varun Chandola
Introduction to Machine Learning
Λ12
Λ22
Inference in Machine Learning
Joint Distributions
Linear Gaussian Systems
Linear Regression
Recap
Handling Non-linear Relationships
Bayesian Regression
References
Example
Outline
1
Inference in Machine Learning
2
Joint Distributions
3
Linear Gaussian Systems
Example
4
Linear Regression
Problem Formulation
Geometric Interpretation
Learning Parameters
5
Recap
Issues with Linear Regression
6
Handling Non-linear Relationships
Handling Overfitting via Regularization
7
Bayesian Regression
Estimating Bayesian Regression Parameters
Prediction with Bayesian Regression
Varun Chandola
Introduction to Machine Learning
Inference in Machine Learning
Joint Distributions
Linear Gaussian Systems
Linear Regression
Recap
Handling Non-linear Relationships
Bayesian Regression
References
Example
Locating an Airplane
5
longitude
4
3
2
1
0
0
1
2
3
latitude
Varun Chandola
4
5
Introduction to Machine Learning
Inference in Machine Learning
Joint Distributions
Linear Gaussian Systems
Linear Regression
Recap
Handling Non-linear Relationships
Bayesian Regression
References
Example
Locating an Airplane
5
longitude
4
3
2
1
0
0
1
2
3
latitude
Varun Chandola
4
5
Introduction to Machine Learning
Inference in Machine Learning
Joint Distributions
Linear Gaussian Systems
Linear Regression
Recap
Handling Non-linear Relationships
Bayesian Regression
References
Example
Locating an Airplane
5
longitude
4
3
2
1
0
0
1
2
3
latitude
Varun Chandola
4
5
Introduction to Machine Learning
Inference in Machine Learning
Joint Distributions
Linear Gaussian Systems
Linear Regression
Recap
Handling Non-linear Relationships
Bayesian Regression
References
Example
Locating an Airplane
5
longitude
4
3
2
1
0
0
1
2
3
latitude
Varun Chandola
4
5
Introduction to Machine Learning
Inference in Machine Learning
Joint Distributions
Linear Gaussian Systems
Linear Regression
Recap
Handling Non-linear Relationships
Bayesian Regression
References
Example
Locating an Airplane
5
Radar Blips
longitude
4
3
2
1
0
0
1
2
3
latitude
Varun Chandola
4
5
Introduction to Machine Learning
Inference in Machine Learning
Joint Distributions
Linear Gaussian Systems
Linear Regression
Recap
Handling Non-linear Relationships
Bayesian Regression
References
Example
Bayesian Approach
Let the exact, but unknown, location be a random variable (2d) that
takes the value y
Each radar blip is a random variable as well that takes the value x
We assume that x is generated from y (by adding noise)
However, we can only observe x
Can we estimate the true location y, given the noisy observations
Assumptions
1
2
y has a Gaussian prior, y ∼ N (µ0 , Σ0 )
Each xi ∼ N (y, Σ), i.e., each radar blip is a noisy version of the true
location
Problem Statement
Given samples of xi infer posterior for y
Varun Chandola
Introduction to Machine Learning
Inference in Machine Learning
Joint Distributions
Linear Gaussian Systems
Linear Regression
Recap
Handling Non-linear Relationships
Bayesian Regression
References
Example
Locating an Airplane
5
longitude
4
µ0
3
2
1
0
0
1
2
3
latitude
Varun Chandola
4
5
Introduction to Machine Learning
Inference in Machine Learning
Joint Distributions
Linear Gaussian Systems
Linear Regression
Recap
Handling Non-linear Relationships
Bayesian Regression
References
Example
Locating an Airplane
5
longitude
4
µ0
3
2
1
0
0
1
2
3
latitude
Varun Chandola
4
5
Introduction to Machine Learning
Inference in Machine Learning
Joint Distributions
Linear Gaussian Systems
Linear Regression
Recap
Handling Non-linear Relationships
Bayesian Regression
References
Example
Linear Gaussian Systems
What if each x is a noisy linear combination of a hidden variable y?
p(y)
=
N (µy , Σy )
p(x|y)
=
N (Ay + b, Σx )
A is a matrix and b is a vector.
x is a linear combination of y
Inference Problem: Can we infer y given x?
Example
In a univariate case, x can be a noisy measurement of an actual
quantity y .
x can be measurements from many sensors and y can be the actual
phenomenon to be measured
x and y can be different length vectors (sensor fusion)
Varun Chandola
Introduction to Machine Learning
Inference in Machine Learning
Joint Distributions
Linear Gaussian Systems
Linear Regression
Recap
Handling Non-linear Relationships
Bayesian Regression
References
Example
Applying Bayes Rule
Bayes Rule for Linear Gaussian Systems
p(y|x)
=
N (µy |x , Σy |x )
Σ−1
y |x
=
> −1
Σ−1
y + A Σx A
µy |x
=
−1
Σy |x [A> Σ−1
x (x − b) + Σy µy ]
Varun Chandola
Introduction to Machine Learning
Inference in Machine Learning
Joint Distributions
Linear Gaussian Systems
Linear Regression
Recap
Handling Non-linear Relationships
Bayesian Regression
References
Example
Bayesian Approach for Finding Airplane
Assume y and x1 , x2 , . . . , xN are jointly Gaussian
A = I, b = 0
Assuming same precision (Σy ) of each radar blip (same sensor)
Posterior for x
p(y|x1 , x2 , . . . , xN )
Σ−1
N
µN
= N (µN , ΣN )
−1
= Σ−1
0 + NΣx
−1
= ΣN (Σ−1
x (Nx̄) + Σ0 µ0 )
Varun Chandola
Introduction to Machine Learning
Inference in Machine Learning
Joint Distributions
Linear Gaussian Systems
Linear Regression
Recap
Handling Non-linear Relationships
Bayesian Regression
References
Problem Formulation
Geometric Interpretation
Learning Parameters
Outline
1
Inference in Machine Learning
2
Joint Distributions
3
Linear Gaussian Systems
Example
4
Linear Regression
Problem Formulation
Geometric Interpretation
Learning Parameters
5
Recap
Issues with Linear Regression
6
Handling Non-linear Relationships
Handling Overfitting via Regularization
7
Bayesian Regression
Estimating Bayesian Regression Parameters
Prediction with Bayesian Regression
Varun Chandola
Introduction to Machine Learning
Inference in Machine Learning
Joint Distributions
Linear Gaussian Systems
Linear Regression
Recap
Handling Non-linear Relationships
Bayesian Regression
References
Problem Formulation
Geometric Interpretation
Learning Parameters
Linear Regression
There is one scalar target variable y (instead of hidden)
There is one vector input variable x
Linear Regression Learning Task
Learn w given training examples, hX, yi.
Varun Chandola
Introduction to Machine Learning
Inference in Machine Learning
Joint Distributions
Linear Gaussian Systems
Linear Regression
Recap
Handling Non-linear Relationships
Bayesian Regression
References
Problem Formulation
Geometric Interpretation
Learning Parameters
Two Interpretations
1. Probabilistic Interpretation
y is assumed to be normally distributed
y ∼ N (w> x, σ 2 )
or, equivalently:
y = w> x + where ∼ N (0, σ 2 )
y is a linear combination of the input variables
Given w and σ 2 , one can find the probability distribution of y for a
given x
Varun Chandola
Introduction to Machine Learning
Inference in Machine Learning
Joint Distributions
Linear Gaussian Systems
Linear Regression
Recap
Handling Non-linear Relationships
Bayesian Regression
References
Problem Formulation
Geometric Interpretation
Learning Parameters
Two Interpretations
2. Geometric Interpretation
Fitting a straight line to d dimensional data
y = w> x
y = w> x = w1 x1 + w2 x2 + . . . + wd xd
Will pass through origin
Add intercept
y = w 0 + w 1 x1 + w 2 x2 + . . . + w d xd
Equivalent to adding another column in X of 1s.
Varun Chandola
Introduction to Machine Learning
Inference in Machine Learning
Joint Distributions
Linear Gaussian Systems
Linear Regression
Recap
Handling Non-linear Relationships
Bayesian Regression
References
Problem Formulation
Geometric Interpretation
Learning Parameters
Learning Parameters - MLE Approach
Find w and σ 2 that maximize the likelihood of training data
b MLE
w
2
σ
bMLE
(X> X)−1 X> y
1
(y − Xw)> (y − Xw)
=
N
=
Varun Chandola
Introduction to Machine Learning
Inference in Machine Learning
Joint Distributions
Linear Gaussian Systems
Linear Regression
Recap
Handling Non-linear Relationships
Bayesian Regression
References
Problem Formulation
Geometric Interpretation
Learning Parameters
Learning Parameters - Least Squares Approach
Minimize squared loss
N
J(w) =
1X
(yi − w> xi )2
2
i=1
Make prediction (w> xi ) as close to the target (yi ) as possible
Least squares estimate
b = (X> X)−1 X> y
w
Varun Chandola
Introduction to Machine Learning
Inference in Machine Learning
Joint Distributions
Linear Gaussian Systems
Linear Regression
Recap
Handling Non-linear Relationships
Bayesian Regression
References
Problem Formulation
Geometric Interpretation
Learning Parameters
Gradient Descent Based Method
Minimize the squared loss using Gradient Descent
N
J(w) =
1X
(yi − w> xi )2
2
i=1
Why?
Varun Chandola
Introduction to Machine Learning
Inference in Machine Learning
Joint Distributions
Linear Gaussian Systems
Linear Regression
Recap
Handling Non-linear Relationships
Bayesian Regression
References
Issues with Linear Regression
Outline
1
Inference in Machine Learning
2
Joint Distributions
3
Linear Gaussian Systems
Example
4
Linear Regression
Problem Formulation
Geometric Interpretation
Learning Parameters
5
Recap
Issues with Linear Regression
6
Handling Non-linear Relationships
Handling Overfitting via Regularization
7
Bayesian Regression
Estimating Bayesian Regression Parameters
Prediction with Bayesian Regression
Varun Chandola
Introduction to Machine Learning
Inference in Machine Learning
Joint Distributions
Linear Gaussian Systems
Linear Regression
Recap
Handling Non-linear Relationships
Bayesian Regression
References
Issues with Linear Regression
Recap - Linear Regression
Geometric
Bayesian
y = w> x
1
p(y ) = N (w> x, σ 2 )
Least Squares
1
>
b = (X X)
w
2
−1
>
X y
Maximum Likelihood
Estimation
b
w
Gradient Descent
N
1X
J(w) =
(yi − w> xi )2
2
2
σ
bMLE
= (X> X)−1 X> y
=
N
1 X
(y − Xw)> (y − Xw
N
i=1
i=1
Varun Chandola
Introduction to Machine Learning
Inference in Machine Learning
Joint Distributions
Linear Gaussian Systems
Linear Regression
Recap
Handling Non-linear Relationships
Bayesian Regression
References
Issues with Linear Regression
Issues with Linear Regression
1
Susceptible to outliers
Robust Regression: Use a “fat-tailed” distribution for y
Student t-distribution or Laplace distribution
2
Too simplistic - Underfitting
3
Unstable in presence of correlated input attributes
4
Gets “confused” by unnecessary attributes
Non-linear?
Varun Chandola
Introduction to Machine Learning
Inference in Machine Learning
Joint Distributions
Linear Gaussian Systems
Linear Regression
Recap
Handling Non-linear Relationships
Bayesian Regression
References
Issues with Linear Regression
Issues with Linear Regression
1
Susceptible to outliers
Robust Regression: Use a “fat-tailed” distribution for y
Student t-distribution or Laplace distribution
2
Too simplistic - Underfitting
3
Unstable in presence of correlated input attributes
4
Gets “confused” by unnecessary attributes
Non-linear?
Varun Chandola
Introduction to Machine Learning
Inference in Machine Learning
Joint Distributions
Linear Gaussian Systems
Linear Regression
Recap
Handling Non-linear Relationships
Bayesian Regression
References
Issues with Linear Regression
Issues with Linear Regression
1
Susceptible to outliers
Robust Regression: Use a “fat-tailed” distribution for y
Student t-distribution or Laplace distribution
2
Too simplistic - Underfitting
3
Unstable in presence of correlated input attributes
4
Gets “confused” by unnecessary attributes
Non-linear?
Varun Chandola
Introduction to Machine Learning
Inference in Machine Learning
Joint Distributions
Linear Gaussian Systems
Linear Regression
Recap
Handling Non-linear Relationships
Bayesian Regression
References
Handling Overfitting via Regularization
Outline
1
Inference in Machine Learning
2
Joint Distributions
3
Linear Gaussian Systems
Example
4
Linear Regression
Problem Formulation
Geometric Interpretation
Learning Parameters
5
Recap
Issues with Linear Regression
6
Handling Non-linear Relationships
Handling Overfitting via Regularization
7
Bayesian Regression
Estimating Bayesian Regression Parameters
Prediction with Bayesian Regression
Varun Chandola
Introduction to Machine Learning
Inference in Machine Learning
Joint Distributions
Linear Gaussian Systems
Linear Regression
Recap
Handling Non-linear Relationships
Bayesian Regression
References
Handling Overfitting via Regularization
Handling Non-linear Relationships
Replace x with non-linear functions φ(x)
p(y |x, θ) ∼ N (w> φ(x))
Model is still linear in w
Also known as basis function expansion
Example
φ(x) = [1, x, x 2 , . . . , x d ]
Increasing d results in more complex fits
Varun Chandola
Introduction to Machine Learning
Inference in Machine Learning
Joint Distributions
Linear Gaussian Systems
Linear Regression
Recap
Handling Non-linear Relationships
Bayesian Regression
References
Handling Overfitting via Regularization
How to Control Overfitting?
Use simpler models (linear instead of polynomial)
Might have poor results (underfitting)
Use regularized complex models
b = arg min J(Θ) + αR(Θ)
Θ
Θ
R() corresponds to the penalty paid for complexity of the model
Varun Chandola
Introduction to Machine Learning
Inference in Machine Learning
Joint Distributions
Linear Gaussian Systems
Linear Regression
Recap
Handling Non-linear Relationships
Bayesian Regression
References
Handling Overfitting via Regularization
Examples of Regularization
Ridge Regression
b = arg min J(w) + αkwk2
w
w
Also known as l2 or Tikhonov regularization
Helps in reducing impact of correlated inputs
Least Absolute Shrinkage and Selection Operator - LASSO
b = arg min J(w) + α|w|
w
w
Also known as l1 regularization
Helps in feature selection – favors sparse solutions
Varun Chandola
Introduction to Machine Learning
Inference in Machine Learning
Joint Distributions
Linear Gaussian Systems
Linear Regression
Recap
Handling Non-linear Relationships
Bayesian Regression
References
Handling Overfitting via Regularization
Parameter Estimation for Ridge Regression
Exact Loss Function
N
J(w) =
1X
1
(yi − w> xi )2 + λ||w||22
2
2
i=1
MAP Estimate of w
b MAP = (λID + X> X)−1 X> y
w
Varun Chandola
Introduction to Machine Learning
Inference in Machine Learning
Joint Distributions
Linear Gaussian Systems
Linear Regression
Recap
Handling Non-linear Relationships
Bayesian Regression
References
Estimating Bayesian Regression Parameters
Prediction with Bayesian Regression
Outline
1
Inference in Machine Learning
2
Joint Distributions
3
Linear Gaussian Systems
Example
4
Linear Regression
Problem Formulation
Geometric Interpretation
Learning Parameters
5
Recap
Issues with Linear Regression
6
Handling Non-linear Relationships
Handling Overfitting via Regularization
7
Bayesian Regression
Estimating Bayesian Regression Parameters
Prediction with Bayesian Regression
Varun Chandola
Introduction to Machine Learning
Inference in Machine Learning
Joint Distributions
Linear Gaussian Systems
Linear Regression
Recap
Handling Non-linear Relationships
Bayesian Regression
References
Estimating Bayesian Regression Parameters
Prediction with Bayesian Regression
Putting a Prior on w
“Penalize” large values of w
A zero-mean Gaussian prior
p(w) =
Y
N (wj |0, τ 2 )
j
What is posterior of w
p(w|D) ∝
Y
N (yi |wo + w> xi , σ 2 )p(w)
i
Posterior is also Gaussian
Varun Chandola
Introduction to Machine Learning
Inference in Machine Learning
Joint Distributions
Linear Gaussian Systems
Linear Regression
Recap
Handling Non-linear Relationships
Bayesian Regression
References
Estimating Bayesian Regression Parameters
Prediction with Bayesian Regression
Posterior Estimates of the Weight Vector
MAP estimate of w
arg max
w
N
X
log N (yi |w> xi , σ 2 ) +
i=1
D
X
log N (wj |0, τ 2 )
j=1
Equivalent to Ridge Regression
Varun Chandola
Introduction to Machine Learning
Inference in Machine Learning
Joint Distributions
Linear Gaussian Systems
Linear Regression
Recap
Handling Non-linear Relationships
Bayesian Regression
References
Estimating Bayesian Regression Parameters
Prediction with Bayesian Regression
Parameter Estimation for Bayesian Regression
Prior for w
w ∼ N (0, τ 2 ID )
Posterior for w
p(w|y, X)
=
p(y|X, w)p(w)
p(y|X)
=
N (w̄ = (X> X +
σ2
σ2
−1
2
>
I
)
Xy,
σ
(X
X
+
IN )−1 )
N
τ2
τ2
Posterior distribution for w is also Gaussian
What will be MAP estimate for w?
Varun Chandola
Introduction to Machine Learning
Inference in Machine Learning
Joint Distributions
Linear Gaussian Systems
Linear Regression
Recap
Handling Non-linear Relationships
Bayesian Regression
References
Estimating Bayesian Regression Parameters
Prediction with Bayesian Regression
Prediction with Bayesian Regression
For a new x∗ , predict y ∗
Point estimate of y
>
b MLE
y∗ = w
x∗
Treating y as a Gaussian random variable
>
2
b MLE
p(y ∗ |x∗ ) = N (w
x∗ , σ
bMLE
)
>
2
b MAP
p(y ∗ |x∗ ) = N (w
x∗ , σ
bMAP
)
Varun Chandola
Introduction to Machine Learning
Inference in Machine Learning
Joint Distributions
Linear Gaussian Systems
Linear Regression
Recap
Handling Non-linear Relationships
Bayesian Regression
References
Estimating Bayesian Regression Parameters
Prediction with Bayesian Regression
Full Bayesian Treatment
Treating y and w as random variables
Z
∗ ∗
p(y |x ) = p(y ∗ |x∗ , w)p(w|X, y)dw
This is also Gaussian!
Varun Chandola
Introduction to Machine Learning
Inference in Machine Learning
Joint Distributions
Linear Gaussian Systems
Linear Regression
Recap
Handling Non-linear Relationships
Bayesian Regression
References
References
Varun Chandola
Introduction to Machine Learning
Download