Statistical Techniques in Robotics (STR, S16) Lecturer: Byron Boots 1 Lecture#14 (Wednesday, March 2) Gaussian Properties, Bayesian Linear Regression Bayesian Linear Regression (BLR) In linear regression, the goal is to predict a continuous outcome variable. In particular, let: • θ = parameter vector of the learned model • xi ∈ Rn is the ith set of features in the dataset, • yi ∈ R is the true outcome given xi , The problem we are solving is to find a θ that can make the best prediction on the output y given an input x. Then our model is as follows: yi = θxi + i , where i is a noise independent of everything else. This has the following form yi ∼ N (θT xi , σ 2 ). Thus the likelihood if θ is known is: P (y|x, θ) = 1 θx exp 2 Z 2σ In BLR, we maintain a distribution over the weight vector θ to represent our beliefs about what θ is likely to be. The math is easiest if we restrict this distribution to be a Gaussian. The problem can be represented by the following graphical model: Figure 1: Bayesian linear regression model. xi ’s are known. 1 1.1 Prior Distribution of θ We assume that the prior distribution of θ is a normal distribution N (µ, Σ) with mean µ and covariance matrix Σ, and the probability of θ is given by P (θ) = 1 1 exp{− (θ − µ)T Σ−1 (θ − µ)}, Z 2 (1) where µ = EP (θ) [θ] and Σ = EP (θ) [(θ − µ)T (θ − µ)]. 2 Parameterizations for Gaussians There are two common parameterizations for Gaussians, the moment parameterization and the natural parameterization. It is often most practical to switch back and forth between representations, depending on which calculations are needed. Equation (1) is called the moment parametrization of θ since it consists of the first moment (µ) and the second moment (Σ, p also called the central moment) of the variable θ. Z is a normalization factor with the value (2π)n det(Σ), where n is the dimension of θ. To prove this, one can translate the distribution to center it at the origin, and do change of variables so that the distribution has the form P (θ0 ) = Z1 exp{− 12 θ0T θ0 }. Then, express θ0 in polar coordinates and integrate over the space to compute Z. 2.1 Moment Parameterization 1 1 −1 exp − (θ − µ) Σ (θ − µ) z 2 N (µ, Σ) = p(θ) = (2) Given: N µ1 µ2 Σ11 Σ12 , Σ21 Σ22 (3) Marginal: µmarg = µ1 1 (4) Σmarg 1 (5) = Σ11 Conditional: 2.2 µ1|2 = µ1 + Σ12 Σ−1 22 (θ2 − µ2 ) (6) Σ1|2 = Σ11 − Σ12 Σ−1 22 Σ21 (7) Natural Parameterization The normal distribution can also be expressed as Ñ (J, P ) = p̃(θ) = 1 1 exp J T θ − θT P θ z 2 2 (8) The natural parametrization simplifies the multiplication of normal distributions as it becomes addition of the J and P matrices of different distributions. Transforming the moment parametrization to the natural parametrization is acheived by first expanding the exponent: 1 1 1 − (θ − µ)T Σ−1 (θ − µ) = − θT Σ−1 + µT Σ−1 θ − µT Σ−1 µ 2 2 2 (9) The last term in equation (9) has nothing to do with θ and can therefore be absorbed into the normalizer. By (8) and (9), J = Σ−1 µ (10) P = Σ−1 Given: N J1 J2 P11 P12 , P21 P22 (11) Marginal: −1 J2 J1marg = J1 − P12 P22 (12) P1marg (13) = P11 − −1 P21 P12 P22 Conditional: 2.3 J1|2 = J1 − P12 θ2 (14) P1|2 = P11 (15) Gaussian MRFs The matrix P of the natural parameterization has a graphical model interpretation. If and only if there is a non-zero entry for (zi , zj ), then there is a lik between zi and zj in the MRF that corresponds to Ñ (J, P ). P= z1 X X 0 X X X 0 X X z3 z2 Following the graphical model interpretation, P is in many cases highly structured. Consider for example the graphical model of a markov chain: x1 x2 x3 x4 3 x5 The corresponding matrix P will be non-zero only X X X X P = 0 X 0 0 0 0 on the diagonal and off-diagonal: 0 0 0 X 0 0 X X 0 X X X 0 X X (16) Note: P describes which variables directly affect each other. Note: P −1 is, in general, not sparse! (this makes intuitve sense since P −1 = Σ the covariance matrix, and the covariance of two states along the markov chain are not independent.) 3 3.1 Return to Bayesian Linear Regression Prediction With the Bayesian linear regression model, we would like to know the probability of an output yt+1 given an new input xt+1 and the set of data D = {(xi , yi )}i=1,...,t . To compute the probability P (yt+1 |xt+1 , D), we introduce θ into this expression and marginalize over it Z P (yt+1 |xt+1 , D) = P (yt+1 |xt+1 , θ, D)P (θ|xt+1 , D) (17) θ∈Θ Because D tells no more than what θ does, P (yt+1 |xt+1 , θ, D) is P (yt+1 |xt+1 , θ). From the Markov assumption and the graphical model, equation (17) becomes Z P (yt+1 |xt+1 , θ)P (θ|D) (18) P (yt+1 |xt+1 , D) = θ∈Θ Computing (18) is hard with the moment parametrization of normal distributions but not with the natural parametrization. 3.2 Posterior Distribution P (θ|D) Using Bayes rule, the posterior probability P (θ|D) can be expressed as ! t Y P (θ|D) ∝ P (y1:t |x1:t , θ)P (θ) ∝ P (yi |xi , θ) P (θ) (19) i=1 Q If θ is known, the data is independent; that is, P (y1:t |x1:t , θ) = ti=1 P (yi |xi , θ). We will see that this product can be computed by a simple update rule. First, let’s look at the product of 4 P (yi |xi , θ)P (θ): 1 1 (yi − θT x)2 } exp{J T θ − θT P θ} 2σ 2 2 1 1 ∝ exp{− 2 (−2yi θT xi + θT xi xTi θ)} exp{J T θ − θT P θ} 2σ 2 1 T 1 T 1 T T T = exp{ 2 yi x θ − 2 θ xi xi θ} exp{J θ − θ P θ} σ 2σ 2 1 1 1 = exp{(J + 2 yi xi )T θ − θT (P + 2 xi xi T )θ} σ 2 σ 1 T 0 0T = exp{J θ − θ P θ} 2 P (yi |xi , θ)P (θ) ∝ exp{− Line 1 to line 2 is true because any term that does not have θ can be absorbed into the normalizer. Now, we can apply the generalized result to (19) and derive ) ( P P T T x x y x 1 i i i i θ − θT P + i 2 θ (20) P (θ|D) ∝ exp J+ i2 σ 2 σ P P yx So P (θ|D) is also a normal distribution with Jf inal = J + iσ2i i and Pf inal = P + σ12 i xi xi T . The mean and the covariance of this distribution can be derived with the relation provided earlier: µf inal = Σf inal Σ−1 + P i xi xi σ2 T −1 P i yi x i σ2 P T −1 −1 i xi xi = Σ + σ2 Pf inal is the precision matrix of the normal distribution, and as the number of xi increases, the terms in this matrix become larger. Also, since Pf inal is the inverse of the covariance, the variance gets lower as the number of samples grow. This is a characteristic of a Gaussian model that a new data point always lowers the variance, but this downgrading of variance does not always make sense. If you believe that there are outliers in your dataset, this model will not work for you. A few more facts about the BLR update: 1. If µ0 6= 0, the update for µθ will be more complicated PN yt xt 2. t=1 σ 2 is the gradient of Bayes online linear regression 3. this looks just like Newton’s method 4. Computation time : o(n2 ) for update, o(n3 ) for mean (that can be reduced to o(n2 ) with tricks) 5 3.3 Probability Distribution of the Prediction The next step to compute (18) is to compute P (yt+1 |xt+1 , θ). Since the linear combination of normal distributions is also a normal distribution, P (yt+1 |xt+1 , θ) should be in the form Z1 exp{− 2σ1 2 (yt+1 − µyt+1 )T Σyt+1 (yt+1 − µyt+1 )}, where µyt+1 = E[yt+1 ] = E[θT xt+1 + ] = E[θT xt+1 ] + E[] = E[θ]T xt+1 + 0 = µθ T xt+1 , and Σyt+1 = xt+1 T Σθ xt+1 + σ 2 . The term xT Σθ x measures how large the variance is on the direction that x is on. If x is never observed before, then the variance of the direction of x is large. Also, the variance is not a function of yt+1 . The precision is only affected by the input not the output. This is the consequence of having the same σ (observation error) everywhere in the space. 6