Bayesian Estimation • Bayesian estimators differ from all classical estimators studied so far in that they consider the parameters as random variables instead of unknown constants. • As such, the parameters also have a PDF, which needs to be taken into account when seeking for an estimator. • The PDF of the parameters can be used for incorporating any prior knowledge we may have about its value. Bayesian Estimation • For example, we might know that the normalized frequency f0 of an observed sinusoid cannot be greater than 0.1. This is ensured by choosing 10, if 0 6 f0 6 0.1 p(f0 ) = 0, otherwise as the prior PDF in the Bayesian framework. • Usually differentiable PDF’s are easier, and we could approximate the uniform PDF with, e.g., the Rayleigh PDF. Rayleigh density with σ = 0.035 Uniform density 20 15 Prior Prior 10 5 10 5 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Normalized frequency f0 0.8 0.9 1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Normalized frequency f0 0.8 0.9 1 Prior and Posterior estimates • One of the key properties of Bayesian approach is that it can be used also for small data records, and the estimate can be improved sequentially as new data arrives. • For example, consider tossing a coin and estimating the probability of a head, µ. • As we saw earlier, the ML estimate is the number of observed heads divided by total number of tosses: #heads µ̂ = #tosses . • However, if we can not afford to make more than, say, three experiments, we may end up seeing three heads and no tails. Thus, we are forced to infer that µ̂ML = 1, the coin lands always as a head. Prior and Posterior estimates • The Bayesian approach can circumvent this problem, because the prior regularizes the likelihood and avoids overfitting to the small amount of data. • The pictures below illustrate this. The one on the top is the likelihood function p(x | µ) = µ#heads (1 − µ)#tails with #heads = 3 and #tails = 0. The maximum of the function is at unity. • The second curve is the prior density p(µ) of our choice. It was selected to reflect the fact that we assume that the coin is probably quite fair. Prior and Posterior estimates • The third curve is the posterior density p(µ | x) after observing the samples, which can be evaluated using the Bayes formula p(µ | x) = p(x | µ) · p(µ) likelihood · prior = p(x) p(x) • Thus, the third curve is the product of the first two (with normalization), and one Bayesian alternative is to use the maximum as the estimate. Prior and Posterior estimates Likelihood function after three tosses resulting in a head p( x|µ) 1 0.5 0 0 0.1 0.2 0.3 0.4 0 0.1 0 0.1 0.5 0.6 0.7 0.8 µ Prior density before observing any data 0.9 1 0.2 0.3 0.4 0.9 1 0.2 0.3 0.4 0.9 1 p(µ) 0.05 0.025 0 0.5 0.6 0.7 0.8 µ Posterior density after observing 3 heads p(µ| x) 0.1 0.05 0 0.5 µ 0.6 0.7 0.8 Cost Functions • Bayesian estimators are defined by a minimization problem ZZ θ̂ = arg min C(θ − θ̂)p(x, θ) dx dθ θ̂ which seeks for the value of θ̂ that minimizes the average cost. Cost Functions • The cost function C(x) is typically one of the following 1. Quadratic: C(x) = x2 2. Absolute: C(x) = |x| 3. Hit-or-miss: C(x) = 0, 1, |x| < δ |x| > δ • Additional cost functions include Huber’s robust loss and -insensitive loss. Cost Functions • These three cost functions are favoured, because we can find the minimum cost solution in closed form. We will introduce the solutions next. • Functions 1 and 3 are slightly easier to use than 2. Thus, we’ll concentrate on those. • Regardless of the cost function, the above double integral can be evaluated and minimized using the rule for joint probabilities: p(x, θ) = p(θ | x)p(x). Cost Functions • This results in ZZ C(θ−θ̂)p(θ | x)p(x) dx dθ = Z Z | C(θ − θ̂)p(θ | x) dθ p(x) dx {z } (∗) • Because p(x) is always nonnegative, it suffices to minimize the multiplier inside the brackets, (∗)1 : Z θ̂ = arg min C(θ − θ̂)p(θ | x) dθ θ̂ 1 Note, that there’s a slight shift in the paradigm. The double integral results in the theoretical estimate that requires the knowledge of p(x). When minimizing only the inner integral, we get the optimum for a particular realization, not all possible realizations. 1. Quadratic Cost Solution (or the MMSE estimator) • If we select the quadratic cost, then the Bayesian estimator is defined by Z arg min (θ − θ̂)2 p(θ | x) dθ θ̂ • Simple differentiation gives: Z Z ∂ ∂ 2 (θ − θ̂) p(θ | x) dθ = (θ − θ̂)2 p(θ | x) dθ ∂θ̂ Z ∂θ̂ = −2(θ − θ̂) p(θ | x) dθ 1. Quadratic Cost Solution (or the MMSE estimator) • Setting this equal to zero gives Z −2(θ − θ̂) p(θ | x) dθ = 0 Z ⇔ 2θ̂ p(θ | x) dθ = 2θ p(θ | x) dθ Z Z ⇔ θ̂ p(θ | x) dθ = θ p(θ | x) dθ | {z } =1 Z ⇔ θ̂ = θ p(θ | x) dθ Z 1. Quadratic Cost Solution (or the MMSE estimator) • Thus, we have the minimum: Z θ̂MMSE = θp(θ | x) dθ = E(θ | x), i.e., the mean of posterior PDF, p(θ | x).2 • This is called the minimum mean square error estimator (MMSE estimator), because it minimizes the average squared error. 2 Prior PDF, p(θ), refers to the parameter distribution before any observations are made. Posterior PDF, p(θ | x), refers to the parameter distribution after observing the data. 2. Absolute Cost Solution • If we choose the absolute value as the cost function, we have to minimize Z arg min θ − θ̂ p(θ | x) dθ θ̂ • This can be shown to be equivalent to the following condition Z θ̂ −∞ p(θ | x) dθ = Z∞ θ̂ p(θ | x) dθ 2. Absolute Cost Solution • In other words, the estimate is the value which divides the probability mass into equal proportions: Z θ̂ −∞ p(θ | x) dθ = 1 2 • Thus, we have arrived at the definition of the median of the posteriori PDF. 3. Hit-or-miss Cost Solution (or the MAP estimator) • For the hit-or-miss case, we also need to minimize the inner integral: Z θ̂ = arg min C(θ − θ̂)p(θ | x) dθ θ̂ with C(x) = 0, |x| < δ 1, |x| > δ 3. Hit-or-miss Cost Solution (or the MAP estimator) • The integral becomes Z C(θ− θ̂)p(θ | x) dθ = Z θ̂−δ 1·p(θ | x) dθ+ −∞ Z∞ 1·p(θ | x) dθ θ̂+δ or in a simplified form Z C(θ − θ̂)p(θ | x) dθ = 1 − Z θ̂+δ θ̂−δ 1 · p(θ | x) dθ 3. Hit-or-miss Cost Solution (or the MAP estimator) • This is minimized by maximizing Z θ̂+δ p(θ | x) dθ θ̂−δ • For small δ and smooth p(θ | x) the maximum of the integral occurs at the maximum of p(θ | x). • Therefore, the estimator is the mode (the highest value) of the posteriori PDF. Thus the name Maximum a Posteriori (MAP) estimator. 3. Hit-or-miss Cost Solution (or the MAP estimator) • Note, that the MAP estimator θ̂MAP = arg max p(θ | x) θ is calculated as (using the Bayes’ rule): θ̂MAP = arg max θ p(x | θ)p(θ) p(x) 3. Hit-or-miss Cost Solution (or the MAP estimator) • Since p(x) does not depend on theta, it is equivalent to maximize only the numerator: θ̂MAP = arg max p(x | θ)p(θ) θ • Incidentally, this is close to the ML estimator: θ̂ML = arg max p(x | θ) θ The only difference is the inclusion of the prior PDF. Summary • To summarize, the three most widely used Bayesian estimators are The MMSE, θ̂MMSE = E(θ | x) Rθ̂ 2 The Median, or θ̂ with −∞ p(θ | x) dθ = 12 . 3 The MAP, θ̂MAP = arg maxθ p(x | θ)p(θ) 1 Example • Consider the case of tossing a coin for three times resulting in three heads. • In the example, we used the Gaussian prior 1 2 p(µ) = √ exp − 2 (µ − 0.5) . 2σ 2πσ2 1 • Now the µ̂MAP becomes µ̂MAP = arg max p(x | µ)p(µ) µ = arg max µ µ #heads (1 − µ) #tails 1 1 2 √ exp − 2 (µ − 0.5) 2σ 2πσ2 Example • Let’s simplify the arithmetic by setting # heads = 3 and # tails = 0: µ̂MAP 1 2 = arg max µ √ exp − 2 (µ − 0.5) 2σ 2πσ2 µ 3 1 • Equivalently, we can maximize it’s logarithm: √ 1 arg max 3 ln µ − ln 2πσ2 − 2 (µ − 0.5)2 2σ µ Example • Now, ∂ ln [p(x|µ)p(µ)] = ∂µ 3 (µ − 0.5) − = 0, µ σ2 when µ2 − 0.5µ − 3σ2 = 0. This happens when µ= 0.5 ± p √ 0.25 − 4 · 1 · (−3σ2 ) 0.25 + 12σ2 = 0.25 ± . 2 2 Example • If we substitute the value used in the example, σ = 0.1, µ̂MAP √ 0.37 = 0.25 + ≈ 0.554. 2 • Thus, we have found the analytical solution of the maximum of the curve in slide 5. Vector Parameter Case for MMSE • In vector parameter case, the MMSE estimator is θ̂MMSE = E(θ | x) or more explicitly θ̂MMSE R R θ1 p(θ | x) dθ θ2 p(θ | x) dθ = .. . R θp p(θ | x) dθ Vector Parameter Case for MMSE • In the linear model case, there exists a straightforward solution: If the observed data can be modeled as x = Hθ + w, where θ ∼ N(µθ , Cθ ) and w ∼ N(0, Cw ), then E(θ | x) = µθ + Cθ HT (HCθ HT + Cw )−1 (x − Hµθ ) Vector Parameter Case for MMSE • It is possible to derive an alternative form resembling the LS estimator (exercise): T −1 −1 T −1 E(θ | x) = µθ + (C−1 θ + H Cw H) H Cw (x − Hµθ ). • Note that this becomes the LS estimator if µθ = 0 and Cθ = I and Cw = σ2w I. Vector Parameter Case for the MAP • The MAP estimator can also be extended to vector parameters: θ̂MAP = arg max p(θ | x) θ or, using the Bayes’ rule, θ̂MAP = arg max p(x | θ)p(θ) θ • Note, that in general this is different from p scalar MAP’s. Scalar MAP would maximize for each parameter θi individually, but the vector MAP seeks for the global maximum of the vector space. Example: MMSE Estimation of Sinusoidal Parameters • Consider the data model x[n] = a cos 2πf0 n+b sin 2πf0 n+w[n], n = 0, 1, . . . , N−1 or in vector form x = Hθ + w, where 1 cos 2πf0 cos 4πf0 .. . H= cos(2(N − 1)πf0 ) 0 sin 2πf0 sin 4πf0 sin(2(N − 1)πf0 ) and a θ= b Example: MMSE Estimation of Sinusoidal Parameters • We depart from the classical model by assuming that a and b are random variables with prior PDF θ ∼ N(0, σ2θ I). Also w is assumed Gaussian (N(0, σ2 )) and independent of θ. • Using the second version of the formula for the linear model (on slide 28), we get the MMSE estimator: T −1 −1 T −1 E(θ | x) = µθ + (C−1 θ + H Cw H) H Cw (x − Hµθ ) Example: MMSE Estimation of Sinusoidal Parameters or, in our case,3 −1 1 1 T 1 I + H 2 IH E(θ | x) = HT 2 Ix 2 σw σw σθ −1 1 1 1 I + 2 HT H HT 2 x = 2 σ σ σθ w w 3 Note the correspondence with Ridge regression. It holds that Ridge regression is equivalent to the Bayesian estimator with Gaussian prior for the coefficients. It also holds that the LASSO is equivalent to the Bayesian estimator with Laplacian prior. Example: MMSE Estimation of Sinusoidal Parameters • In earlier examples we have seen that the columns of H are nearly orthogonal (exactly orthogonal if f0 = k/N): HT H ≈ N I 2 • Thus, E(θ | x) ≈ = 1 N I+ 2 I 2 2σ σθ w 1 σ2w 1 σ2θ + N 2σ2w −1 HT x. HT 1 x σ2w Example: MMSE Estimation of Sinusoidal Parameters • In all, the MMSE estimates become âMMSE = b̂MMSE = 1 1+ 2σ2 /N σ2θ 1+ 2σ2 /N σ2θ 1 # N−1 2 X x[n] cos 2πf0 n N n=0 " N−1 # 2 X x[n] sin 2πf0 n N " n=0 Example: MMSE Estimation of Sinusoidal Parameters • For comparison, recall that the classical MVU estimator is 2 X x[n] cos 2πf0 n N N−1 âMVU = b̂MVU = 2 N n=0 N−1 X n=0 x[n] sin 2πf0 n Example: MMSE Estimation of Sinusoidal Parameters • The difference can be interpreted as a weighting between the prior knowledge and the data. • If the prior knowledge is unreliable (σ2θ large), then 1 2 1+ 2σ 2/N σ ≈ 1 and the two estimators are almost equal. θ • If the data is unreliable (σ2 large), then the coefficient 1 2 1+ 2σ 2/N σ is small, making the estimate close to the mean of θ the prior PDF. Example: MMSE Estimation of Sinusoidal Parameters • An example run is illustrated below. In this case, N = 100, f0 = 15/N, and σ2θ = 0.48566, σ2 = 4.1173. Altogether M = 500 tests were performed. • Since the prior PDF has a small variance, the estimator gains a lot from using it. This is seen as a significant difference between the MSE’s of the two estimators. Example: MMSE Estimation of Sinusoidal Parameters Classical estimator of a. MSE=0.072474 Classical estimator of b. MSE=0.092735 60 60 50 50 40 40 30 30 20 20 10 0 −1 10 −0.5 0 0.5 1 0 −1 Bayesian estimator of a. MSE=0.061919 60 60 50 50 40 40 30 30 20 20 10 0 −1 −0.5 0 0.5 1 Bayesian estimator of b. MSE=0.076355 10 −0.5 0 0.5 1 0 −1 −0.5 0 0.5 1 Example: MMSE Estimation of Sinusoidal Parameters • If the prior has a higher variance, the Bayesian approach does not perform that much better. In the pictures below, σ2θ = 2.1937, σ2 = 1.9078. The difference in performance is negligible between the two approaches. Example: MMSE Estimation of Sinusoidal Parameters Classical estimator of a. MSE=0.040066 Classical estimator of b. MSE=0.034727 60 60 50 50 40 40 30 30 20 20 10 0 −1 10 −0.5 0 0.5 1 0 −1 Bayesian estimator of a. MSE=0.03951 60 60 50 50 40 40 30 30 20 20 10 0 −1 −0.5 0 0.5 1 Bayesian estimator of b. MSE=0.034477 10 −0.5 0 0.5 1 0 −1 −0.5 0 0.5 1 Example: MMSE Estimation of Sinusoidal Parameters • The program code is available at http://www.cs.tut.fi/courses/SGN-2606/BayesSinusoid.m Example: MAP Estimator • Assume that p(x[n] | θ) = θ exp(−θx[n]) 0, if x[n] > 0 if x[n] < 0 with x[n] conditionally IID and the prior of θ: λ exp(−λθ) if θ > 0 p(θ) = 0 if θ < 0 • Now, θ is the unkown RV and λ is known. Example: MAP Estimator • Then the MAP estimator is found by maximizing p(θ | x) or equivalently p(x | θ)p(θ). • Because both PDF’s have an exponential form, it’s easier to maximize the logarithm instead: θ̂ = arg max (ln p(x | θ) + ln p(θ)) . θ Example: MAP Estimator • Now, ln p(x | θ) + ln p(θ) = ln "N−1 Y # θ exp(−θx[n]) + ln[λ exp(−λθ)] n=0 " N N−1 X = ln θ exp −θ !# x[n] n=0 = N ln θ − Nθx̄ + ln λ − λθ • Differentiation produces d N ln p(x | θ) + ln p(θ) = − Nx̄ − λ dθ θ + ln[λ exp(−λθ)] Example: MAP Estimator • Setting it equal to zero produces the MAP estimator: θ̂ = 1 λ x̄ + N Example: Deconvolution • Consider the situation where a signal s[n] passes through a channel with impulse response h[n] and is further corrupted by noise w[n]: x[n] = h(n) ∗ s(n) + w[n] = K X k=0 h[k]s[n − k] + w[n], n = 0, 1, . . . , N − 1 Example: Deconvolution • Since convolution commutes, we can write this as x[n] = nX s −1 h[n − k]s[k] + w[n] k=0 • In matrix form this is expressed by x[0] h[0] x[1] h[1] = .. .. . . x[N − 1] h[N − 1] 0 h[0] .. . h[N − 2] ··· ··· .. . ··· 0 s[0] w[0] s[1] w[1] 0 + .. .. .. . . . s[ns − 1] w[N − 1] h[N − ns ] Example: Deconvolution • Thus, we have again the linear model x = Hs + w where the unknown parameter θ is the original signal s. • The noise is assumed Gaussian: w[n] ∼ N(0, σ2 ). • A reasonable assumption for the signal is that s ∼ N(0, Cs ) with [Cs ]ij = rss [i − j], where rss is the autocorrelation function of s. • According to slide 28, the MMSE estimator is E(s | x) = µs + Cs HT (HCs HT + Cw )−1 (x − Hµs ) = Cs HT (HCs HT + σ2 I)−1 x Example: Deconvolution • In general, the form of the estimator varies a lot between different cases. However, as a special case: • When H = I, the channel is identity and only noise is present. In this case ŝ = Cs (Cs + σ2 I)−1 x This case is called the Wiener filter. For example, in a single data point case, ŝ[0] = rss [0] x[0] rss [0] + σ2 Thus, the variance of the noise is used as a parameter telling the reliability of the data with respect to the prior.