Bayesian Learning & Estimation Theory Maximum likelihood estimation • Example: For Gaussian likelihood P(x|q) = N (x|,2), L= Objective of regression: Minimize error E(w) = ½ Sn ( tn - y(xn,w) )2 A probabilistic view of linear regression Precision b =1/ 2 • Compare to error function: E(w) = ½ Sn ( tn - y(xn,w) )2 • Since argminw E(w) = argmaxw equivalent to ML estimation of w , regression is Bayesian learning • View the data D and parameter q as random variables (for regression, D = (x, t) and q = w) • The data induces a distribution over the parameter: P(q |D) = P(D,q) / P(D) P(D,q) • Substituting P(D,q) = P(D |q) P(q), we obtain Bayes’ theorem: P(q |D) P(D |q) P(q) Posterior Likelihood x Prior Bayesian prediction • Predictions (eg, predict t from x using data D) are mediated through the parameter: P(prediction|D) = q P(prediction|q ) P(q |D) dq • Maximum a posteriori (MAP) estimation: q MAP = argmaxq P(q |D) P(prediction|D) P(prediction| q MAP ) – Accurate when P(q |D) is concentrated on q MAP A probabilistic view of regularized regression • E(w) = ½ Sn ( tn - y(xn,w) )2 + l/2 Sm wm2 ln p(t|x,w) ln p(w) • Prior: w’s are IID Gaussian p(w) = Pm (1/ 2pl-1 ) exp{- l wm2 / 2 } • Since argminw E(w) = argmaxw p(t|x,w) p(w), regularized regression is equivalent to MAP estimation of w Bayesian linear regression • Likelihood: – b specifies precision of data noise M wm| 0,a -1 • Prior: m=0 – a specifies precision of weights Computed using linear algebra (see textbook) • Posterior: – This is an M+1 dimensional Gaussian density • Prediction: Example: y(x) = w0 + w1x Data Prior No data 1st point 2nd point ... 20th point Likelihood y(x) sampled Posterior from posterior Example: y(x) = w0 + w1x + … + wMxM • M = 9, a = 5x10-3: Gives a reasonable range of functions • b = 11.1: Known precision of noise Mean and one std dev of the predictive distribution Example: y(x) = w0 + w1f1(x) + … + wMfM(x) Gaussian basis functions: 0 1 How are we doing on the pass sequence? • Least squares regression… Cross validation reduced the training data, so the red line isn’t as accurate as it should be Hand-labeled horizontal coordinate, t Choosing a particular M and w seems wrong – we should hedge our bets The red line doesn’t reveal different levels of uncertainty in predictions How are we doing on the pass sequence? The red line doesn’t reveal different levels of uncertainty in predictions Bayesian regression Hand-labeled horizontal coordinate, t Cross validation reduced the training data, so the red line isn’t as accurate as it should be Hand-labeled horizontal coordinate, t Choosing a particular M and w seems wrong – we should hedge our bets Estimation theory • Provided with a predictive distribution p(t|x), how do we estimate a single value for t? – Example: In the pass sequence, Cupid must aim at and hit the man in the white shirt, without hitting the man in the striped shirt • Define L(t,t*) as the loss incurred by estimating t* when the true value is t • Assuming p(t|x) is correct, the expected loss is E[L] = t L(t,t*) p(t|x) dt • The minimum loss estimate is found by minimizing E[L] w.r.t. t* Squared loss • A common choice: L(t,t*) = ( t - t* )2 E[L] = t ( t - t* )2 p(t|x) dt – Not appropriate for Cupid’s problem • To minimize E[L] , set its derivative to zero: dE[L]/dt* = -2t ( t - t* ) p(t|x) dt = 0 -2t t p(t|x)dt + t* = 0 • Minimum mean squared error (MMSE) estimate: t* = E[t|x] = t t p(t|x)dt For regression: t* = y(x,w) Other loss functions Absolute loss Squared loss Absolute loss t1 t2 t3 t4 t5 t6 t* t7 e L = |t*-t1| + |t*-t2| + |t*-t3| + |t*-t4| + |t*-t5| + |t*-t6| + |t*-t7| • Consider moving t* to the left by e – L decreases by 6e and increases by e – Changes in L are balanced when t* = t4 • The median of t under p(t|x) minimizes absolute loss • Important: The median is invariant to monotonic transformations of t Median Mean Mean and median t D-dimensional estimation • Suppose t is D-dimensional, t = (t1,…,tD) – Example: 2-dimensional tracking • Approach 1: Minimum marginal loss estimation – Find td* that minimizes t L(td,td*) p(td|x) dtd • Approach 2: Minimum joint loss estimation – Define joint loss L(t,t*) – Find t* that minimizes t L(t,t*) p(t|x) dt Questions? How are we doing on the pass sequence? Fraction of pixels in column with intensity > 0.9 • Bayesian regression and estimation enables us to track the man in the striped shirt based on labeled data Can we track the man in the white shirt? t = 290 0 320 Hand-labeled horizontal coordinate, t • Man in white shirt is occluded Feature, x Horizontal location Compute 1st moment: x = 224 How are we doing on the pass sequence? • Bayesian regression and estimation enables us to track the man in the striped shirt based on labeled data Can we track the man in the white shirt? Not very well. Hand-labeled horizontal coordinate, t • Regression fails to identify that there really are two classes of solution Feature, x