Uploaded by SAMIR ORUJOV

Bayesian Estimation - Statistics

advertisement
Bayesian Estimation
• Bayesian estimators differ from all classical estimators studied
so far in that they consider the parameters as random
variables instead of unknown constants.
• As such, the parameters also have a PDF, which needs to
be taken into account when seeking for an estimator.
• The PDF of the parameters can be used for incorporating
any prior knowledge we may have about its value.
Bayesian Estimation
• For example, we might know that the normalized
frequency f0 of an observed sinusoid cannot be greater
than 0.1. This is ensured by choosing
10, if 0 6 f0 6 0.1
p(f0 ) =
0,
otherwise
as the prior PDF in the Bayesian framework.
• Usually differentiable PDF’s are easier, and we could
approximate the uniform PDF with, e.g., the Rayleigh PDF.
Rayleigh density with σ = 0.035
Uniform density
20
15
Prior
Prior
10
5
10
5
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Normalized frequency f0
0.8
0.9
1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Normalized frequency f0
0.8
0.9
1
Prior and Posterior estimates
• One of the key properties of Bayesian approach is that it
can be used also for small data records, and the estimate
can be improved sequentially as new data arrives.
• For example, consider tossing a coin and estimating the
probability of a head, µ.
• As we saw earlier, the ML estimate is the number of
observed heads divided by total number of tosses:
#heads
µ̂ = #tosses
.
• However, if we can not afford to make more than, say,
three experiments, we may end up seeing three heads and
no tails. Thus, we are forced to infer that µ̂ML = 1, the coin
lands always as a head.
Prior and Posterior estimates
• The Bayesian approach can circumvent this problem,
because the prior regularizes the likelihood and avoids
overfitting to the small amount of data.
• The pictures below illustrate this. The one on the top is the
likelihood function
p(x | µ) = µ#heads (1 − µ)#tails
with #heads = 3 and #tails = 0. The maximum of the
function is at unity.
• The second curve is the prior density p(µ) of our choice. It
was selected to reflect the fact that we assume that the coin
is probably quite fair.
Prior and Posterior estimates
• The third curve is the posterior density p(µ | x) after
observing the samples, which can be evaluated using the
Bayes formula
p(µ | x) =
p(x | µ) · p(µ)
likelihood · prior
=
p(x)
p(x)
• Thus, the third curve is the product of the first two (with
normalization), and one Bayesian alternative is to use the
maximum as the estimate.
Prior and Posterior estimates
Likelihood function after three tosses resulting in a head
p( x|µ)
1
0.5
0
0
0.1
0.2
0.3
0.4
0
0.1
0
0.1
0.5
0.6
0.7
0.8
µ
Prior density before observing any data
0.9
1
0.2
0.3
0.4
0.9
1
0.2
0.3
0.4
0.9
1
p(µ)
0.05
0.025
0
0.5
0.6
0.7
0.8
µ
Posterior density after observing 3 heads
p(µ| x)
0.1
0.05
0
0.5
µ
0.6
0.7
0.8
Cost Functions
• Bayesian estimators are defined by a minimization
problem
ZZ
θ̂ = arg min
C(θ − θ̂)p(x, θ) dx dθ
θ̂
which seeks for the value of θ̂ that minimizes the average
cost.
Cost Functions
• The cost function C(x) is typically one of the following
1. Quadratic: C(x) = x2
2. Absolute: C(x) = |x|
3. Hit-or-miss: C(x) =
0,
1,
|x| < δ
|x| > δ
• Additional cost functions include Huber’s robust loss and
-insensitive loss.
Cost Functions
• These three cost functions are favoured, because we can
find the minimum cost solution in closed form. We will
introduce the solutions next.
• Functions 1 and 3 are slightly easier to use than 2. Thus,
we’ll concentrate on those.
• Regardless of the cost function, the above double integral
can be evaluated and minimized using the rule for joint
probabilities:
p(x, θ) = p(θ | x)p(x).
Cost Functions
• This results in
ZZ
C(θ−θ̂)p(θ | x)p(x) dx dθ =
Z Z
|
C(θ − θ̂)p(θ | x) dθ p(x) dx
{z
}
(∗)
• Because p(x) is always nonnegative, it suffices to minimize
the multiplier inside the brackets, (∗)1 :
Z
θ̂ = arg min C(θ − θ̂)p(θ | x) dθ
θ̂
1
Note, that there’s a slight shift in the paradigm. The double integral results in the theoretical estimate that
requires the knowledge of p(x). When minimizing only the inner integral, we get the optimum for a particular
realization, not all possible realizations.
1. Quadratic Cost Solution (or the MMSE
estimator)
• If we select the quadratic cost, then the Bayesian estimator
is defined by
Z
arg min (θ − θ̂)2 p(θ | x) dθ
θ̂
• Simple differentiation gives:
Z
Z
∂ ∂
2
(θ − θ̂) p(θ | x) dθ =
(θ − θ̂)2 p(θ | x) dθ
∂θ̂
Z ∂θ̂
=
−2(θ − θ̂) p(θ | x) dθ
1. Quadratic Cost Solution (or the MMSE
estimator)
• Setting this equal to zero gives
Z
−2(θ − θ̂) p(θ | x) dθ = 0
Z
⇔
2θ̂ p(θ | x) dθ = 2θ p(θ | x) dθ
Z
Z
⇔ θ̂ p(θ | x) dθ = θ p(θ | x) dθ
|
{z
}
=1
Z
⇔ θ̂ = θ p(θ | x) dθ
Z
1. Quadratic Cost Solution (or the MMSE
estimator)
• Thus, we have the minimum:
Z
θ̂MMSE = θp(θ | x) dθ = E(θ | x),
i.e., the mean of posterior PDF, p(θ | x).2
• This is called the minimum mean square error estimator
(MMSE estimator), because it minimizes the average
squared error.
2
Prior PDF, p(θ), refers to the parameter distribution before any observations are made. Posterior PDF, p(θ | x),
refers to the parameter distribution after observing the data.
2. Absolute Cost Solution
• If we choose the absolute value as the cost function, we
have to minimize
Z
arg min
θ − θ̂ p(θ | x) dθ
θ̂
• This can be shown to be equivalent to the following
condition
Z θ̂
−∞
p(θ | x) dθ =
Z∞
θ̂
p(θ | x) dθ
2. Absolute Cost Solution
• In other words, the estimate is the value which divides the
probability mass into equal proportions:
Z θ̂
−∞
p(θ | x) dθ =
1
2
• Thus, we have arrived at the definition of the median of the
posteriori PDF.
3. Hit-or-miss Cost Solution (or the MAP
estimator)
• For the hit-or-miss case, we also need to minimize the
inner integral:
Z
θ̂ = arg min C(θ − θ̂)p(θ | x) dθ
θ̂
with
C(x) =
0, |x| < δ
1, |x| > δ
3. Hit-or-miss Cost Solution (or the MAP
estimator)
• The integral becomes
Z
C(θ− θ̂)p(θ | x) dθ =
Z θ̂−δ
1·p(θ | x) dθ+
−∞
Z∞
1·p(θ | x) dθ
θ̂+δ
or in a simplified form
Z
C(θ − θ̂)p(θ | x) dθ = 1 −
Z θ̂+δ
θ̂−δ
1 · p(θ | x) dθ
3. Hit-or-miss Cost Solution (or the MAP
estimator)
• This is minimized by maximizing
Z θ̂+δ
p(θ | x) dθ
θ̂−δ
• For small δ and smooth p(θ | x) the maximum of the
integral occurs at the maximum of p(θ | x).
• Therefore, the estimator is the mode (the highest value) of
the posteriori PDF. Thus the name Maximum a Posteriori
(MAP) estimator.
3. Hit-or-miss Cost Solution (or the MAP
estimator)
• Note, that the MAP estimator
θ̂MAP = arg max p(θ | x)
θ
is calculated as (using the Bayes’ rule):
θ̂MAP = arg max
θ
p(x | θ)p(θ)
p(x)
3. Hit-or-miss Cost Solution (or the MAP
estimator)
• Since p(x) does not depend on theta, it is equivalent to
maximize only the numerator:
θ̂MAP = arg max p(x | θ)p(θ)
θ
• Incidentally, this is close to the ML estimator:
θ̂ML = arg max p(x | θ)
θ
The only difference is the inclusion of the prior PDF.
Summary
• To summarize, the three most widely used Bayesian
estimators are
The MMSE, θ̂MMSE = E(θ | x)
Rθ̂
2 The Median, or θ̂ with −∞ p(θ | x) dθ = 12 .
3 The MAP, θ̂MAP = arg maxθ p(x | θ)p(θ)
1
Example
• Consider the case of tossing a coin for three times resulting
in three heads.
• In the example, we used the Gaussian prior
1
2
p(µ) = √
exp − 2 (µ − 0.5) .
2σ
2πσ2
1
• Now the µ̂MAP becomes
µ̂MAP
= arg max p(x | µ)p(µ)
µ
= arg max µ
µ
#heads
(1 − µ)
#tails
1
1
2
√
exp − 2 (µ − 0.5)
2σ
2πσ2
Example
• Let’s simplify the arithmetic by setting # heads = 3 and #
tails = 0:
µ̂MAP
1
2
= arg max µ √
exp − 2 (µ − 0.5)
2σ
2πσ2
µ
3
1
• Equivalently, we can maximize it’s logarithm:
√
1
arg max 3 ln µ − ln 2πσ2 − 2 (µ − 0.5)2
2σ
µ
Example
• Now,
∂
ln [p(x|µ)p(µ)] =
∂µ
3 (µ − 0.5)
−
= 0,
µ
σ2
when
µ2 − 0.5µ − 3σ2 = 0.
This happens when
µ=
0.5 ±
p
√
0.25 − 4 · 1 · (−3σ2 )
0.25 + 12σ2
= 0.25 ±
.
2
2
Example
• If we substitute the value used in the example, σ = 0.1,
µ̂MAP
√
0.37
= 0.25 +
≈ 0.554.
2
• Thus, we have found the analytical solution of the
maximum of the curve in slide 5.
Vector Parameter Case for MMSE
• In vector parameter case, the MMSE estimator is
θ̂MMSE = E(θ | x)
or more explicitly
θ̂MMSE
R

R θ1 p(θ | x) dθ
 θ2 p(θ | x) dθ 


=

..


.
R
θp p(θ | x) dθ
Vector Parameter Case for MMSE
• In the linear model case, there exists a straightforward
solution:
If the observed data can be modeled as
x = Hθ + w,
where θ ∼ N(µθ , Cθ ) and w ∼ N(0, Cw ), then
E(θ | x) = µθ + Cθ HT (HCθ HT + Cw )−1 (x − Hµθ )
Vector Parameter Case for MMSE
• It is possible to derive an alternative form resembling the
LS estimator (exercise):
T −1
−1 T −1
E(θ | x) = µθ + (C−1
θ + H Cw H) H Cw (x − Hµθ ).
• Note that this becomes the LS estimator if µθ = 0 and
Cθ = I and Cw = σ2w I.
Vector Parameter Case for the MAP
• The MAP estimator can also be extended to vector
parameters:
θ̂MAP = arg max p(θ | x)
θ
or, using the Bayes’ rule,
θ̂MAP = arg max p(x | θ)p(θ)
θ
• Note, that in general this is different from p scalar MAP’s.
Scalar MAP would maximize for each parameter θi
individually, but the vector MAP seeks for the global
maximum of the vector space.
Example: MMSE Estimation of Sinusoidal
Parameters
• Consider the data model
x[n] = a cos 2πf0 n+b sin 2πf0 n+w[n],
n = 0, 1, . . . , N−1
or in vector form
x = Hθ + w,
where

1
cos 2πf0
cos 4πf0
..
.



H=


cos(2(N − 1)πf0 )
0
sin 2πf0
sin 4πf0
sin(2(N − 1)πf0 )







and
a
θ=
b
Example: MMSE Estimation of Sinusoidal
Parameters
• We depart from the classical model by assuming that a and
b are random variables with prior PDF θ ∼ N(0, σ2θ I). Also
w is assumed Gaussian (N(0, σ2 )) and independent of θ.
• Using the second version of the formula for the linear
model (on slide 28), we get the MMSE estimator:
T −1
−1 T −1
E(θ | x) = µθ + (C−1
θ + H Cw H) H Cw (x − Hµθ )
Example: MMSE Estimation of Sinusoidal
Parameters
or, in our case,3
−1
1
1
T 1
I + H 2 IH
E(θ | x) =
HT 2 Ix
2
σw
σw
σθ
−1
1
1
1
I + 2 HT H
HT 2 x
=
2
σ
σ
σθ
w
w
3
Note the correspondence with Ridge regression. It holds that Ridge regression is equivalent to the Bayesian
estimator with Gaussian prior for the coefficients. It also holds that the LASSO is equivalent to the Bayesian
estimator with Laplacian prior.
Example: MMSE Estimation of Sinusoidal
Parameters
• In earlier examples we have seen that the columns of H are
nearly orthogonal (exactly orthogonal if f0 = k/N):
HT H ≈
N
I
2
• Thus,
E(θ | x) ≈
=
1
N
I+ 2 I
2
2σ
σθ
w
1
σ2w
1
σ2θ
+
N
2σ2w
−1
HT x.
HT
1
x
σ2w
Example: MMSE Estimation of Sinusoidal
Parameters
• In all, the MMSE estimates become
âMMSE =
b̂MMSE =
1
1+
2σ2 /N
σ2θ
1+
2σ2 /N
σ2θ
1
#
N−1
2 X
x[n] cos 2πf0 n
N
n=0
" N−1
#
2 X
x[n] sin 2πf0 n
N
"
n=0
Example: MMSE Estimation of Sinusoidal
Parameters
• For comparison, recall that the classical MVU estimator is
2 X
x[n] cos 2πf0 n
N
N−1
âMVU =
b̂MVU =
2
N
n=0
N−1
X
n=0
x[n] sin 2πf0 n
Example: MMSE Estimation of Sinusoidal
Parameters
• The difference can be interpreted as a weighting between
the prior knowledge and the data.
• If the prior knowledge is unreliable (σ2θ large), then
1
2
1+ 2σ 2/N
σ
≈ 1 and the two estimators are almost equal.
θ
• If the data is unreliable (σ2 large), then the coefficient
1
2
1+ 2σ 2/N
σ
is small, making the estimate close to the mean of
θ
the prior PDF.
Example: MMSE Estimation of Sinusoidal
Parameters
• An example run is illustrated below. In this case, N = 100,
f0 = 15/N, and σ2θ = 0.48566, σ2 = 4.1173. Altogether
M = 500 tests were performed.
• Since the prior PDF has a small variance, the estimator
gains a lot from using it. This is seen as a significant
difference between the MSE’s of the two estimators.
Example: MMSE Estimation of Sinusoidal
Parameters
Classical estimator of a. MSE=0.072474
Classical estimator of b. MSE=0.092735
60
60
50
50
40
40
30
30
20
20
10
0
−1
10
−0.5
0
0.5
1
0
−1
Bayesian estimator of a. MSE=0.061919
60
60
50
50
40
40
30
30
20
20
10
0
−1
−0.5
0
0.5
1
Bayesian estimator of b. MSE=0.076355
10
−0.5
0
0.5
1
0
−1
−0.5
0
0.5
1
Example: MMSE Estimation of Sinusoidal
Parameters
• If the prior has a higher variance, the Bayesian approach
does not perform that much better. In the pictures below,
σ2θ = 2.1937, σ2 = 1.9078. The difference in performance is
negligible between the two approaches.
Example: MMSE Estimation of Sinusoidal
Parameters
Classical estimator of a. MSE=0.040066
Classical estimator of b. MSE=0.034727
60
60
50
50
40
40
30
30
20
20
10
0
−1
10
−0.5
0
0.5
1
0
−1
Bayesian estimator of a. MSE=0.03951
60
60
50
50
40
40
30
30
20
20
10
0
−1
−0.5
0
0.5
1
Bayesian estimator of b. MSE=0.034477
10
−0.5
0
0.5
1
0
−1
−0.5
0
0.5
1
Example: MMSE Estimation of Sinusoidal
Parameters
• The program code is available at
http://www.cs.tut.fi/courses/SGN-2606/BayesSinusoid.m
Example: MAP Estimator
• Assume that
p(x[n] | θ) =
θ exp(−θx[n])
0,
if x[n] > 0
if x[n] < 0
with x[n] conditionally IID and the prior of θ:
λ exp(−λθ) if θ > 0
p(θ) =
0
if θ < 0
• Now, θ is the unkown RV and λ is known.
Example: MAP Estimator
• Then the MAP estimator is found by maximizing p(θ | x)
or equivalently p(x | θ)p(θ).
• Because both PDF’s have an exponential form, it’s easier to
maximize the logarithm instead:
θ̂ = arg max (ln p(x | θ) + ln p(θ)) .
θ
Example: MAP Estimator
• Now,
ln p(x | θ) + ln p(θ)
=
ln
"N−1
Y
#
θ exp(−θx[n]) + ln[λ exp(−λθ)]
n=0
"
N
N−1
X
= ln θ exp −θ
!#
x[n]
n=0
= N ln θ − Nθx̄ + ln λ − λθ
• Differentiation produces
d
N
ln p(x | θ) + ln p(θ) =
− Nx̄ − λ
dθ
θ
+ ln[λ exp(−λθ)]
Example: MAP Estimator
• Setting it equal to zero produces the MAP estimator:
θ̂ =
1
λ
x̄ + N
Example: Deconvolution
• Consider the situation where a signal s[n] passes through a
channel with impulse response h[n] and is further
corrupted by noise w[n]:
x[n] = h(n) ∗ s(n) + w[n]
=
K
X
k=0
h[k]s[n − k] + w[n],
n = 0, 1, . . . , N − 1
Example: Deconvolution
• Since convolution commutes, we can write this as
x[n] =
nX
s −1
h[n − k]s[k] + w[n]
k=0
• In matrix form this is expressed by

 
x[0]
h[0]
 x[1]   h[1]

 

=
..
..

 
.
.
x[N − 1]
h[N − 1]
0
h[0]
..
.
h[N − 2]
···
···
..
.
···

 

0
s[0]
w[0]
  s[1]   w[1] 
0

 


+

..
..
..

 

.
.
.
s[ns − 1]
w[N − 1]
h[N − ns ]
Example: Deconvolution
• Thus, we have again the linear model
x = Hs + w
where the unknown parameter θ is the original signal s.
• The noise is assumed Gaussian: w[n] ∼ N(0, σ2 ).
• A reasonable assumption for the signal is that s ∼ N(0, Cs )
with [Cs ]ij = rss [i − j], where rss is the autocorrelation
function of s.
• According to slide 28, the MMSE estimator is
E(s | x) = µs + Cs HT (HCs HT + Cw )−1 (x − Hµs )
= Cs HT (HCs HT + σ2 I)−1 x
Example: Deconvolution
• In general, the form of the estimator varies a lot between
different cases. However, as a special case:
• When H = I, the channel is identity and only noise is
present. In this case
ŝ = Cs (Cs + σ2 I)−1 x
This case is called the Wiener filter. For example, in a single
data point case,
ŝ[0] =
rss [0]
x[0]
rss [0] + σ2
Thus, the variance of the noise is used as a parameter
telling the reliability of the data with respect to the prior.
Download