Statistical Techniques in Robotics (STR, S16) Lecture#14 (Wednesday, March 2)

advertisement
Statistical Techniques in Robotics (STR, S16)
Lecturer: Byron Boots
1
Lecture#14 (Wednesday, March 2)
Gaussian Properties, Bayesian Linear Regression
Bayesian Linear Regression (BLR)
In linear regression, the goal is to predict a continuous outcome variable. In particular, let:
• θ = parameter vector of the learned model
• xi ∈ Rn is the ith set of features in the dataset,
• yi ∈ R is the true outcome given xi ,
The problem we are solving is to find a θ that can make the best prediction on the output y given
an input x. Then our model is as follows:
yi = θxi + i ,
where i is a noise independent of everything else. This has the following form yi ∼ N (θT xi , σ 2 ).
Thus the likelihood if θ is known is:
P (y|x, θ) =
1
θx
exp 2
Z
2σ
In BLR, we maintain a distribution over the weight vector θ to represent our beliefs about what θ
is likely to be. The math is easiest if we restrict this distribution to be a Gaussian. The problem
can be represented by the following graphical model:
Figure 1: Bayesian linear regression model. xi ’s are known.
1
1.1
Prior Distribution of θ
We assume that the prior distribution of θ is a normal distribution N (µ, Σ) with mean µ and
covariance matrix Σ, and the probability of θ is given by
P (θ) =
1
1
exp{− (θ − µ)T Σ−1 (θ − µ)},
Z
2
(1)
where µ = EP (θ) [θ] and Σ = EP (θ) [(θ − µ)T (θ − µ)].
2
Parameterizations for Gaussians
There are two common parameterizations for Gaussians, the moment parameterization and the natural parameterization. It is often most practical to switch back and forth between representations,
depending on which calculations are needed. Equation (1) is called the moment parametrization
of θ since it consists of the first moment (µ) and the second moment (Σ,
p also called the central
moment) of the variable θ. Z is a normalization factor with the value (2π)n det(Σ), where n
is the dimension of θ. To prove this, one can translate the distribution to center it at the origin,
and do change of variables so that the distribution has the form P (θ0 ) = Z1 exp{− 12 θ0T θ0 }. Then,
express θ0 in polar coordinates and integrate over the space to compute Z.
2.1
Moment Parameterization
1
1
−1
exp − (θ − µ) Σ (θ − µ)
z
2
N (µ, Σ) = p(θ) =
(2)
Given:
N
µ1
µ2
Σ11 Σ12
,
Σ21 Σ22
(3)
Marginal:
µmarg
= µ1
1
(4)
Σmarg
1
(5)
= Σ11
Conditional:
2.2
µ1|2 = µ1 + Σ12 Σ−1
22 (θ2 − µ2 )
(6)
Σ1|2 = Σ11 − Σ12 Σ−1
22 Σ21
(7)
Natural Parameterization
The normal distribution can also be expressed as
Ñ (J, P ) = p̃(θ) =
1
1
exp J T θ − θT P θ
z
2
2
(8)
The natural parametrization simplifies the multiplication of normal distributions as it becomes addition of the J and P matrices of different distributions. Transforming the moment parametrization
to the natural parametrization is acheived by first expanding the exponent:
1
1
1
− (θ − µ)T Σ−1 (θ − µ) = − θT Σ−1 + µT Σ−1 θ − µT Σ−1 µ
2
2
2
(9)
The last term in equation (9) has nothing to do with θ and can therefore be absorbed into the
normalizer. By (8) and (9),
J = Σ−1 µ
(10)
P = Σ−1
Given:
N
J1
J2
P11 P12
,
P21 P22
(11)
Marginal:
−1
J2
J1marg = J1 − P12 P22
(12)
P1marg
(13)
= P11 −
−1
P21
P12 P22
Conditional:
2.3
J1|2 = J1 − P12 θ2
(14)
P1|2 = P11
(15)
Gaussian MRFs
The matrix P of the natural parameterization has a graphical model interpretation. If and only
if there is a non-zero entry for (zi , zj ), then there is a lik between zi and zj in the MRF that
corresponds to Ñ (J, P ).
P=
z1
X X 0
X X X
0 X X
z3
z2
Following the graphical model interpretation, P is in many cases highly structured. Consider for
example the graphical model of a markov chain:
x1
x2
x3
x4
3
x5
The corresponding matrix P will be non-zero only

X X
 X X

P = 
 0 X
 0 0
0 0
on the diagonal and off-diagonal:

0 0 0
X 0 0 

X X 0 

X X X 
0 X X
(16)
Note: P describes which variables directly affect each other.
Note: P −1 is, in general, not sparse! (this makes intuitve sense since P −1 = Σ the covariance
matrix, and the covariance of two states along the markov chain are not independent.)
3
3.1
Return to Bayesian Linear Regression
Prediction
With the Bayesian linear regression model, we would like to know the probability of an output
yt+1 given an new input xt+1 and the set of data D = {(xi , yi )}i=1,...,t . To compute the probability
P (yt+1 |xt+1 , D), we introduce θ into this expression and marginalize over it
Z
P (yt+1 |xt+1 , D) =
P (yt+1 |xt+1 , θ, D)P (θ|xt+1 , D)
(17)
θ∈Θ
Because D tells no more than what θ does, P (yt+1 |xt+1 , θ, D) is P (yt+1 |xt+1 , θ). From the Markov
assumption and the graphical model, equation (17) becomes
Z
P (yt+1 |xt+1 , θ)P (θ|D)
(18)
P (yt+1 |xt+1 , D) =
θ∈Θ
Computing (18) is hard with the moment parametrization of normal distributions but not with the
natural parametrization.
3.2
Posterior Distribution P (θ|D)
Using Bayes rule, the posterior probability P (θ|D) can be expressed as
!
t
Y
P (θ|D) ∝ P (y1:t |x1:t , θ)P (θ) ∝
P (yi |xi , θ) P (θ)
(19)
i=1
Q
If θ is known, the data is independent; that is, P (y1:t |x1:t , θ) = ti=1 P (yi |xi , θ). We will see
that this product can be computed by a simple update rule. First, let’s look at the product of
4
P (yi |xi , θ)P (θ):
1
1
(yi − θT x)2 } exp{J T θ − θT P θ}
2σ 2
2
1
1
∝ exp{− 2 (−2yi θT xi + θT xi xTi θ)} exp{J T θ − θT P θ}
2σ
2
1 T
1 T
1
T
T
T
= exp{ 2 yi x θ − 2 θ xi xi θ} exp{J θ − θ P θ}
σ
2σ
2
1
1
1
= exp{(J + 2 yi xi )T θ − θT (P + 2 xi xi T )θ}
σ
2
σ
1 T 0
0T
= exp{J θ − θ P θ}
2
P (yi |xi , θ)P (θ) ∝ exp{−
Line 1 to line 2 is true because any term that does not have θ can be absorbed into the normalizer.
Now, we can apply the generalized result to (19) and derive
)
(
P
P
T
T
x
x
y
x
1
i
i
i
i
θ − θT P + i 2
θ
(20)
P (θ|D) ∝ exp
J+ i2
σ
2
σ
P
P
yx
So P (θ|D) is also a normal distribution with Jf inal = J + iσ2i i and Pf inal = P + σ12 i xi xi T .
The mean and the covariance of this distribution can be derived with the relation provided earlier:
µf inal =
Σf inal
Σ−1 +
P
i xi xi
σ2
T
−1 P
i yi x i
σ2
P
T −1
−1
i xi xi
= Σ +
σ2
Pf inal is the precision matrix of the normal distribution, and as the number of xi increases, the
terms in this matrix become larger. Also, since Pf inal is the inverse of the covariance, the variance
gets lower as the number of samples grow. This is a characteristic of a Gaussian model that a
new data point always lowers the variance, but this downgrading of variance does not always make
sense. If you believe that there are outliers in your dataset, this model will not work for you.
A few more facts about the BLR update:
1. If µ0 6= 0, the update for µθ will be more complicated
PN yt xt
2.
t=1 σ 2 is the gradient of Bayes online linear regression
3. this looks just like Newton’s method
4. Computation time : o(n2 ) for update, o(n3 ) for mean (that can be reduced to o(n2 ) with
tricks)
5
3.3
Probability Distribution of the Prediction
The next step to compute (18) is to compute P (yt+1 |xt+1 , θ). Since the linear combination of normal
distributions is also a normal distribution, P (yt+1 |xt+1 , θ) should be in the form Z1 exp{− 2σ1 2 (yt+1 −
µyt+1 )T Σyt+1 (yt+1 − µyt+1 )}, where
µyt+1 = E[yt+1 ] = E[θT xt+1 + ] = E[θT xt+1 ] + E[] = E[θ]T xt+1 + 0 = µθ T xt+1 ,
and
Σyt+1 = xt+1 T Σθ xt+1 + σ 2 .
The term xT Σθ x measures how large the variance is on the direction that x is on. If x is never
observed before, then the variance of the direction of x is large. Also, the variance is not a function
of yt+1 . The precision is only affected by the input not the output. This is the consequence of
having the same σ (observation error) everywhere in the space.
6
Download