t - Microsoft Research

advertisement
Bayesian Learning & Estimation Theory
Maximum likelihood estimation
• Example: For Gaussian likelihood P(x|q) = N (x|,2),
L=
Objective of regression: Minimize error
E(w) = ½ Sn ( tn - y(xn,w) )2
A probabilistic view of linear regression
Precision
b =1/ 2
• Compare to error function: E(w) = ½ Sn ( tn - y(xn,w) )2
• Since argminw E(w) = argmaxw
equivalent to ML estimation of w
, regression is
Bayesian learning
• View the data D and parameter q as random
variables (for regression, D = (x, t) and q = w)
• The data induces a distribution over the
parameter:
P(q |D) = P(D,q) / P(D)  P(D,q)
• Substituting P(D,q) = P(D |q) P(q), we obtain
Bayes’ theorem:
P(q |D)  P(D |q) P(q)
Posterior  Likelihood x Prior
Bayesian prediction
• Predictions (eg, predict t from x using data D)
are mediated through the parameter:
P(prediction|D) = q P(prediction|q ) P(q |D) dq
• Maximum a posteriori (MAP) estimation:
q MAP = argmaxq P(q |D)
P(prediction|D)  P(prediction| q MAP )
– Accurate when P(q |D) is concentrated on q MAP
A probabilistic view of regularized regression
• E(w) = ½ Sn ( tn - y(xn,w) )2 + l/2 Sm wm2
ln p(t|x,w)
ln p(w)
• Prior: w’s are IID Gaussian
p(w) = Pm (1/ 2pl-1 ) exp{- l wm2 / 2 }
• Since argminw E(w) = argmaxw p(t|x,w) p(w),
regularized regression is equivalent to MAP
estimation of w
Bayesian linear regression
• Likelihood:
– b specifies precision of data noise
M
wm| 0,a -1
• Prior:
m=0
– a specifies precision of weights
Computed using
linear algebra
(see textbook)
• Posterior:
– This is an M+1 dimensional Gaussian density
• Prediction:
Example: y(x) = w0 + w1x
Data
Prior
No data
1st point
2nd point
...
20th point
Likelihood
y(x) sampled
Posterior from posterior
Example: y(x) = w0 + w1x + … + wMxM
• M = 9, a = 5x10-3: Gives a reasonable range of
functions
• b = 11.1: Known precision of noise
Mean and one
std dev of the
predictive
distribution
Example: y(x) = w0 + w1f1(x) + … + wMfM(x)
Gaussian basis functions:
0
1
How are we doing on the pass sequence?
• Least squares regression…
Cross validation
reduced the
training data, so
the red line isn’t
as accurate as it
should be
Hand-labeled horizontal
coordinate, t
Choosing a particular M and
w seems wrong – we should
hedge our bets
The red line
doesn’t reveal
different levels
of uncertainty in
predictions
How are we doing on the pass sequence?
The red line
doesn’t reveal
different levels of
uncertainty in
predictions
Bayesian regression
Hand-labeled horizontal coordinate, t
Cross validation
reduced the
training data, so
the red line isn’t
as accurate as it
should be
Hand-labeled horizontal
coordinate, t
Choosing a particular M and w
seems wrong – we should hedge
our bets
Estimation theory
• Provided with a predictive
distribution p(t|x), how do we
estimate a single value for t?
– Example: In the pass sequence,
Cupid must aim at and hit the
man in the white shirt, without
hitting the man in the striped shirt
• Define L(t,t*) as the loss incurred by estimating t*
when the true value is t
• Assuming p(t|x) is correct, the expected loss is
E[L] = t L(t,t*) p(t|x) dt
• The minimum loss estimate is found by minimizing
E[L] w.r.t. t*
Squared loss
• A common choice: L(t,t*) = ( t - t* )2
E[L] = t ( t - t* )2 p(t|x) dt
– Not appropriate for Cupid’s problem
• To minimize E[L] , set its derivative to zero:
dE[L]/dt* = -2t ( t - t* ) p(t|x) dt = 0
-2t t p(t|x)dt + t* = 0
• Minimum mean squared error (MMSE) estimate:
t* = E[t|x] = t t p(t|x)dt
For regression: t* = y(x,w)
Other loss functions
Absolute loss
Squared loss
Absolute loss
t1
t2 t3 t4 t5 t6
t*
t7
e
L = |t*-t1| + |t*-t2| + |t*-t3| + |t*-t4| + |t*-t5| + |t*-t6| + |t*-t7|
• Consider moving t* to the left by e
– L decreases by 6e and increases by e
– Changes in L are balanced when t* = t4
• The median of t under p(t|x) minimizes absolute loss
• Important: The median is invariant to monotonic
transformations of t
Median Mean
Mean and median
t
D-dimensional estimation
• Suppose t is D-dimensional, t = (t1,…,tD)
– Example: 2-dimensional tracking
• Approach 1: Minimum marginal loss estimation
– Find td* that minimizes t L(td,td*) p(td|x) dtd
• Approach 2: Minimum joint loss estimation
– Define joint loss L(t,t*)
– Find t* that minimizes t L(t,t*) p(t|x) dt
Questions?
How are we doing on the pass sequence?
Fraction of pixels in column
with intensity > 0.9
•
Bayesian regression and estimation enables us to track
the man in the striped shirt based on labeled data
Can we track the man in the white shirt?
t = 290
0
320
Hand-labeled horizontal coordinate, t
•
Man in white
shirt is
occluded
Feature, x
Horizontal location
Compute 1st moment: x = 224
How are we doing on the pass sequence?
•
Bayesian regression and estimation enables us to
track the man in the striped shirt based on labeled data
Can we track the man in the white shirt?
Not very well.
Hand-labeled horizontal coordinate, t
•
Regression fails
to identify that
there really are
two classes of
solution
Feature, x
Download