Introduction to Machine Learning Contents Varun Chandola <>

advertisement
2. Joint Distributions
2
2
Joint Distributions
• Multiple random variables may be jointly modeled
Introduction to Machine Learning
• Joint probability distribution
p(x1 , x2 , . . . , xD )
CSE474/574: Linear Regression
• Example, x = {x1 , x2 , . . . , xD } is a joint Gaussian distribution, if:
Varun Chandola <chandola@buffalo.edu>
p(x) = N (µ, Σ)
Outline
• Marginal Distribution: Probability distribution of one (or more) variable(s) “marginalized” over
all others
Contents
• Conditional Distribution: Probability distribution of one (or more) variable(s), given a particular
value taken by the others
1 Inference in Machine Learning
1
2 Joint Distributions
1
3 Linear Gaussian Systems
3.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
2
4 Linear Regression
4.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Geometric Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3 Learning Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
4
4
5
5 Recap
5.1 Issues with Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
7
6 Handling Non-linear Relationships
6.1 Handling Overfitting via Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
7
7 Bayesian Regression
7.1 Estimating Bayesian Regression Parameters . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2 Prediction with Bayesian Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
8
9
• Let x = (x1 , x2 ) be jointly Gaussian with parameters:
µ1
Σ11 Σ12
Λ11 Λ12
µ=
,Σ =
, Λ = Σ−1 =
µ2
Σ21 Σ22
Λ21 Λ22
• Both marginals and conditions are also Gaussian!
p(x1 )
=
p(x1 |x2 )
=
N (µ1 , Σ11 )
N (µ1|2 , Σ1|2 )
where,
Σ1|2 = Λ−1
11
3
Radar Blips
4
Definition 1. Given a joint distribution p(x1 , x2 ), how to compute the marginals, p(x1 ), and conditionals
p(x1 |x2 ).
Discrete Random Variables
x
x
X - length of a word, Y - # vowels in word
2
0
0.34
0
3
0.03
0.16
0
4
0
0.30
0.03
Example
5
Inference in Machine Learning
y
0
1
2
Linear Gaussian Systems
3.1
5
0
0
0.14
y
0
1
P 2
y p(x, y)
2
0
0.34
0
0.34
3
0.03
0.16
0
0.19
4
0
0.30
0.03
0.33
5
0
0
0.14
0.14
P
The main question that arises here is why? Why are we interested in computing marginals and conditionals? Sometimes we wish to remove the “influence” of one or more variables from a probability. This
maybe because some variables are less reliable, or uninteresting, or irrelevant to some analysis. In those
cases we need to marginalize the effect of certain variables, and compute the marginal distribution for
other “more interesting” variables.
p(x, y)
0.03
0.80
0.17
x
longitude
1
µ1|2 = Σ1|2 (Λ11 µ1 − Λ12 (x2 − µ2 ))
3
2
1
0
0
1
2
3
latitude
4
5
The most straightforward approach here would be to take
the mean location as the estimated location. A simple probabilistic approach would be to consider the
radar blips as samples from a probabilistic distribution. One can estimate the distribution parameters
using MLE and MAP methods we have discussed above.
3.1 Example
3
4. Linear Regression
4
• Let the exact, but unknown, location be a random variable (2d) that takes the value y
• Assume y and x1 , x2 , . . . , xN are jointly Gaussian
• Each radar blip is a random variable as well that takes the value x
• A = I, b = 0
• We assume that x is generated from y (by adding noise)
• Assuming same precision (Σy ) of each radar blip (same sensor)
• However, we can only observe x
Posterior for x
• Can we estimate the true location y, given the noisy observations
p(y|x1 , x2 , . . . , xN ) = N (µN , ΣN )
−1
Σ−1
= Σ−1
0 + N Σx
N
−1
µN = ΣN (Σx (N x̄) + Σ−1
0 µ0 )
• Assumptions
1. y has a Gaussian prior, y ∼ N (µ0 , Σ0 )
2. Each xi ∼ N (y, Σ), i.e., each radar blip is a noisy version of the true location
Problem Statement
Given samples of xi infer posterior for y
Linear Regression
4.1
Problem Formulation
• There is one scalar target variable y (instead of hidden)
5
• There is one vector input variable x
4
longitude
4
Linear Regression Learning Task
Learn w given training examples, hX, yi.
µ0
3
2
The training data is denoted as hX, yi, where X is a N × D data matrix consisting of N data examples
such that each data example is a D dimensional vector. y is a N × 1 vector consisting of corresponding
target values for the examples in X.
1
1. Probabilistic Interpretation
0
• y is assumed to be normally distributed
0
1
2
3
latitude
4
5
• What if each x is a noisy linear combination of a hidden variable y?
p(y) = N (µy , Σy )
p(x|y) = N (Ay + b, Σx )
• A is a matrix and b is a vector.
• x is a linear combination of y
• Inference Problem: Can we infer y given x?
Example 2.
• In a univariate case, x can be a noisy measurement of an actual quantity y.
• x can be measurements from many sensors and y can be the actual phenomenon to be measured
y ∼ N (w> x, σ 2 )
• or, equivalently:
where ∼ N (0, σ 2 )
• y is a linear combination of the input variables
• Given w and σ 2 , one can find the probability distribution of y for a given x
4.2
Geometric Interpretation
2. Geometric Interpretation
• Fitting a straight line to d dimensional data
y = w> x
– x and y can be different length vectors (sensor fusion)
Bayes Rule for Linear Gaussian Systems
p(y|x) = N (µy|x , Σy|x )
−1
> −1
Σ−1
y|x = Σy + A Σx A
−1
µy|x = Σy|x [A> Σ−1
x (x − b) + Σy µy ]
y = w> x + y = w> x = w1 x1 + w2 x2 + . . . + wd xd
• Will pass through origin
• Add intercept
y = w0 + w1 x1 + w2 x2 + . . . + wd xd
• Equivalent to adding another column in X of 1s.
4.3 Learning Parameters
4.3
5
Learning Parameters
• Find w and σ that maximize the likelihood of training data
2
5. Recap
6
The derivation of the least squares estimate follows the exact computation as the MLE derivation above,
once the expression of the squared loss is converted into matrix notation, i.e.,
N
>
−1
J(w) =
>
b M LE = (X X) X y
w
1
2
σ
bM
(y − Xw)> (y − Xw)
LE =
N
=
L(w) =
i=1
√
N
J(w) =
1
(yi − w> xi )2
exp(−
)
2σ 2
2πσ
The log-likelihood is given by:
N
1 X
1
(yi − w> xi )2
LL(w) = − log 2π − log σ − 2
2
2σ i=1
This can be rewritten in matrix notation as:
1
1
LL(w) = − log 2π − log σ − 2 (Y − Xw)> (Y − Xw)
2
2σ
>
The analytical approach discussed earlier involves a matrix inversion ((X> X)−1 ) which is a (D+1)×(D+1)
matrix. Alternatively, one could solve a system of equations. When D is large, this inversion can be
computationally expensive (O ∗ D3 ) for standard matrix inversion. Moreover, often, the linear system
might have singularities and inversion or solving the system of equations might yield numerically unstable
results.
To compute the gradient update rule one can differentiate the error with respect to each entry of w.
N
1 ∂ X
∂J(w)
=
(yi − w> xi )2
∂wj
2 ∂wj i=1
=
Using the above result, one can perform repeated updates of the weights:
Note that, we use the fact that (Xw) Y = Y Xw, since both quantities are scalars and the transpose of
a scalar is equal to itself. Continuing with the derivative:
Setting the derivative to 0, we get:
2w> X> X − 2Y > X
w > X> X
(X> X)> w
(X> X)w
w
wj := wj − α
5
∂J(w)
∂wj
Recap
Geometric
=
=
=
=
=
0
Y >X
X> Y (Taking transpose both sides)
X> Y
(X> X)−1 X> Y
y = w> x
1. Least Squares
b = (X> X)−1 X> y
w
2. Gradient Descent
N
J(w) =
In a similar fashion, one can set the derivative to 0 with respect to σ and plug in the the optimal value of
w
1X
(yi − w> xi )2
2 i=1
Bayesian
• Minimize squared loss
J(w) =
1
2
N
X
i=1
(yi − w> xi )2
• Make prediction (w xi ) as close to the target (yi ) as possible
>
• Least squares estimate
N
X
(w> xi − yi )xij
i=1
>
∂LL(w)
1
= − 2 (2w> X> X − 2Y > X)
∂w
2σ
1X
(yi − w> xi )2
2 i=1
• Why?
To maximize the log-likelihood, we first compute its derivative with respect to w and σ.
∂LL(w)
1 ∂
= − 2
(Y − Xw)> (Y − Xw)
∂w
2σ ∂w
1 ∂
(Y > Y + w> X> Xw − 2Y > Xw)
= − 2
2σ ∂w
1
(Y − Xw)> (Y − Xw)
2
• Minimize the squared loss using Gradient Descent
The derivation of the MLE estimates can be done by maximizing the log-likelihood of the data set. The
likelihood of the training data set is given by:
N
Y
1X
(yi − w> xi )2
2 i=1
b = (X> X)−1 X> y
w
p(y) = N (w> x, σ 2 )
1. Maximum Likelihood Estimation
b = (X> X)−1 X> y
w
N
1 X
2
σ
bM
(y − Xw)> (y − Xw)
LE =
N i=1
5.1 Issues with Linear Regression
5.1
7
7. Bayesian Regression
8
• Also known as l1 regularization
Issues with Linear Regression
• Helps in feature selection – favors sparse solutions
1. Susceptible to outliers
• Robust Regression: Use a “fat-tailed” distribution for y
• Student t-distribution or Laplace distribution
Exact Loss Function
N
J(w) =
2. Too simplistic - Underfitting
• Non-linear?
MAP Estimate of w
b M AP = (λID + X> X)−1 X> y
w
3. Unstable in presence of correlated input attributes
4. Gets “confused” by unnecessary attributes
6
Handling Non-linear Relationships
• Replace x with non-linear functions φ(x)
7
Bayesian Regression
• “Penalize” large values of w
• A zero-mean Gaussian prior
p(y|x, θ) ∼ N (w> φ(x))
• Model is still linear in w
p(w) =
• What is posterior of w
Example 3.
φ(x) = [1, x, x2 , . . . , xd ]
Y
j
p(w|D) ∝
• Also known as basis function expansion
Y
i
N (wj |0, τ 2 )
N (yi |wo + w> xi , σ 2 )p(w)
• Posterior is also Gaussian
• MAP estimate of w
• Increasing d results in more complex fits
6.1
1X
1
(yi − w> xi )2 + λ||w||22
2 i=1
2
arg max
Handling Overfitting via Regularization
w
How to Control Overfitting?
• Use simpler models (linear instead of polynomial)
– Might have poor results (underfitting)
N
X
i=1
b = arg min J(Θ) + αR(Θ)
Θ
Θ
• R() corresponds to the penalty paid for complexity of the model
Examples of Regularization
7.1
b = arg min J(w) + αkwk2
w
w
• Also known as l2 or Tikhonov regularization
• Helps in reducing impact of correlated inputs
Least Absolute Shrinkage and Selection Operator - LASSO
b = arg min J(w) + α|w|
w
w
j=1
log N (wj |0, τ 2 )
Estimating Bayesian Regression Parameters
w ∼ N (0, τ 2 ID )
• Posterior for w
p(w|y, X) =
p(y|X, w)p(w)
p(y|X)
= N (w̄ = (X> X +
Ridge Regression
D
X
• Equivalent to Ridge Regression
• Prior for w
• Use regularized complex models
logN (yi |w> xi , σ 2 ) +
• Posterior distribution for w is also Gaussian
• What will be MAP estimate for w?
σ2
σ2
IN )−1 Xy, σ 2 (X> X + 2 IN )−1 )
2
τ
τ
7.2 Prediction with Bayesian Regression
9
The denominator term in the posterior above can be computed as the marginal likelihood of data by
marginalizing w:
Z
p(y|X) = p(y|X, w)p(w)dw
One can compute the posterior for w as follows. We first show that the likelihood of y, i.e., all target
values in the training data, can be jointly modeled as a Gaussian as follows:
1
1
√
exp − 2 (yi − w>xi )
2σ
2πσ 2
i=1
1
1
2
exp
−
|y
−
Xw|
=
(2πσ 2 )N/2
2σ 2
= N (Xw, σ 2 IN )
p(y|X, w) =
N
Y
Ignoring the denominator which does not depend on w:
1
1
(y − Xw)> (y − Xw))exp(− 2 w> w)
2σ 2
2τ
1
1
1
∝ exp(− (w − w̄)> ( 2 X> X + 2 IN )(w − w̄))
2
σ
τ
p(w|y, X) ∝ exp(−
where w̄ = (X> X +
7.2
σ2
I )−1 Xy.
τ2 N
Prediction with Bayesian Regression
• For a new x∗ , predict y ∗
• Point estimate of y
• Treating y as a Gaussian random variable
>
∗
bM
y∗ = w
LE x
>
∗
2
bM
p(y ∗ |x∗ ) = N (w
bM
LE x , σ
LE )
>
∗
2
bM
p(y ∗ |x∗ ) = N (w
bM
AP x , σ
AP )
• Treating y and w as random variables
p(y ∗ |x∗ ) =
• This is also Gaussian!
References
Z
p(y ∗ |x∗ , w)p(w|X, y)dw
Download