Expectation Maximization Introduction to EM algorithm TLT-5906 Advanced Course in Digital Transmission Jukka Talvitie, M.Sc. (eng) jukka.talvitie@tut.fi Department of Communication Engineering Tampere University of Technology M.Sc. Jukka Talvitie 5.12.2013 Outline q Expectation Maximization (EM) algorithm – Motivation, background – Where the EM can be used? q EM principle – Formal definition – How the algorithm really works? – Coin toss example – About some practical issues q More advanced examples – Line fitting with EM algorithm – Parameter estimation of multivariate Gaussian mixture q Conclusions M.Sc. Jukka Talvitie 5.12.2013 Motivation q Consider classical line fitting problem: – Assume below measurements of a linear model y=ax+b+n (here the line parameters are a and b and n is zero mean noise) 2.2 Measurements 2.1 2 1.9 1.8 1.7 1.6 1.5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 M.Sc. Jukka Talvitie 0.9 1 5.12.2013 Motivation q We use LS (Least Squares) to find the best fit: q Is this the best solution? 2.3 Measurements LS 2.2 2.1 2 1.9 1.8 1.7 1.6 1.5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 M.Sc. Jukka Talvitie 0.9 1 5.12.2013 Motivation q LS would be the Best Linear Unbiased Estimator, if the noise would be uncorrelated with fixed variance q Here, actually the noise term is correlated and the actual linear model of this realization can be seen below as the black line – Here the LS gives too much weight for a group of samples in the middle 2.3 Measurements LS Correct line 2.2 2.1 2 1.9 1.8 1.7 1.6 1.5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 M.Sc. Jukka Talvitie 1 5.12.2013 Motivation q Taking the correlation of the noise term into account, we can use Generalized LS method and the result can be improved considerably q However, in many cases we do not know the correlation model – It is hided in the observations and we cannot access it directly – Therefore, e.g. here we would need to estimate simultaneously the covariance and the line parameters q This sort of problems might quite quickly become 2.3 very complicated Measurements LS – How to estimate the covariance without 2.2 Correct line knowing the line parameters and vice versa? 2.1 Generalized LS q Intuitive (heuristic) solution: 2 – Iteratively estimate the other parameter, and 1.9 then the other, and continue… 1.8 – No guarantee for the performance in this case (e.g. compared to maximum likelihood 1.7 (ML) solution) q The EM algorithm provides the ML solution for these sort of problems M.Sc. Jukka Talvitie 1.6 1.5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 5.12.2013 1 Expectation Maximization Algorithm q Presented by Dempster, Laird and Rubin in [1] in 1977 – Basically the same principle was already proposed earlier by some other authors in specific circumstances q EM algorithm is an iterative estimation algorithm that can derive the maximum likelihood (ML) estimates in the presence of missing/hidden data (“incomplete data”) – e.g. the classical case is the Gaussian mixture, where we have a set of unknown Gaussian distributions (see example later on) Many-to-one mapping [2] X: underlying space x: complete data (required for ML) Y: observation space y: observation x is observed only by means of y(x). X(y) is a subset of X determined by y. M.Sc. Jukka Talvitie 5.12.2013 Expectation Maximization Algorithm q The basic functioning of the EM algorithm can be divided into two steps (the parameter to be estimated is θ): – Expectation step (E-step) • Take the expected value of the complete data given the observation and the current parameter estimate qˆ { Q(q , qˆk ) = E log f (x | q ) | y, qˆk } k – Maximization step (M-step) • Maximize the Q-function in the E-step (basically, the data of the E-step is used as it were measured observations) qˆk +1 = arg max Q (q | qˆk ) q q The likelihood of the parameter is increased at every iteration – EM converges towards some local maximum of the likelihood function M.Sc. Jukka Talvitie 5.12.2013 An example: ML estimation vs. EM algorithm [3] q We wish to estimate the variance of S: – observation Y=S+N • S and N are normally distributed with zero means and variances θ and 1, respectively – Now, Y is also normally distributed (zero mean with variance θ+1) q ML estimate can be easily derived: qˆML = arg max( p( y | q )) M q = max{0, y 2 - 1} q The zero in above result becomes from the fact that we know that the variance is always non-negative M.Sc. Jukka Talvitie 5.12.2013 An example: ML estimation vs. EM algorithm q The same with the EM algorithm – complete data is now included in S and N – E-step is then: Q(q , qˆk ) = E éëln p( s, n | q ) | y, qˆk ùû – the logarithmic probability distribution for the complete data is then ln p ( s, n | q ) = ln p( n) + ln( p( s | q )) 1 S2 = C - ln q 2 2q ® (C contains all the terms independent of θ) 2 E S [ | Y , qˆk ] 1 ˆ Q(q , q k ) = C - ln q 2 2q M.Sc. Jukka Talvitie 5.12.2013 An example: ML estimation vs. EM algorithm q M-step: – maximize the E-step – We set the derivative to zero and get (use results from math tables: conditional means and variances, “Law of total variance”) qˆk +1 = E éë S 2 | Y , qˆk ùû = E 2 éë S | Y , qˆk ùû + var éë S | Y , qˆk ùû 2 æ qˆk ö qˆk Y ÷÷ + = çç ˆ ˆ è qk + 1 ø qk + 1 q At the steady state (qˆk +1 = qˆk ) we get the same value for the estimate as in ML estimation (max{0,y2-1}) q What about the convergence? What if we choose the initial value qˆ = 0 0 M.Sc. Jukka Talvitie 5.12.2013 An example: ML estimation vs. EM algorithm q In the previous example, the ML estimate could be solved in a closed form expression – In this case there was no need for EM algorithm, since the ML estimate is given in a straightforward manner (we just showed that the EM algorithm converges to the peak of the likelihood function) q Next we consider a coin toss example: – The target is to figure out the probability of heads for two coins – ML estimate can be directly calculated from the results q We will raise the bets a little bit higher and assume that we don’t even know which one of the coins is used for the sample set? – i.e. we are estimating the coin probabilities without knowing which one of the coins is being tossed M.Sc. Jukka Talvitie 5.12.2013 An example: Coin toss) [4] Maximum likelihood q We have two coins: A and B q The probabilities for heads are q A and q B q We have 5 measurement sets including 10 coin tosses in each set q If we know which of the coins are tossed in each set, we can calculate the ML probabilities for q A and q B q If we don’t know which of the coins are tossed in each set, ML estimates cannot be calculated directly → EM algorithm Binomial distribution used to calculate probabilities: ænö k n -k ç k ÷ p (1 - p ) H è ø Coin A Coin B HTTTHHTHTH 5H, 5T HHHHTHHHHH 9H, 1T HTHHHHHTHH 8H,2T HTHTTTTHHTT THHHTHHHTH 5 sets, 10 tosses per set 4H, 6T ? ? ? ? ? HTTTHHTHTH HHHHTHHHHH HTHHHHHTHH HTHTTTTHHTT THHHTHHHTH πA(0) < 0.6 πB(0) < 0.5 1. Initialization πA < 24 < 0.80 24 ∗ 6 πB < 9 < 0.45 9 ∗ 11 Example calculations for the first set (qˆA(0) = 0.6, qˆB(0) = 0.5) 7H,3T æ10 ö 5 5 ç ÷ × 0.6 × 0.4 » 0.201 è5ø th i w 37 æ10 ö e” 5 5 liz 2.2 a ç ÷ × 0.5 × 0.5 » 0.246 = m è5ø or 6 N 1 ” 4 0.2 1+ 0 0.2 24H, 6T 9H, 11T Expectation Maximization 2. E-step ML method (if we know the coins): Coin A Coin B 0.45 x 0.55 x ≈2.2H, 2.2T ≈2.8H, 2.8T 0.80 x 0.20 x ≈7.2H, 0.8T ≈1.8H, 0.2T 0.73 x 0.27 x ≈5.9H, 1.5T ≈2.1H, 0.5T 0.35 x 0.65 x ≈1.4H, 2.1T ≈2.6H, 3.9T 0.65 x 0.35 x ≈4.5H, 1.9T ≈2.5H, 1.1T πA(1) < πB(1) < M.Sc. Jukka Talvitie 21.3 < 0.71 21.3 ∗ 8.6 11.7 < 0.58 11.7 ∗ 8.4 ≈21.3H, 8.6T ≈11.7H, 8.4T 3. M-step 4. πA(10) < 0.80 πB(10) < 0.52 5.12.2013 About some practical issues q Although many examples in the literature are showing excellent results using the EM algorithm, the reality is often less glamorous – As the number of uncertain parameters increase in the modeled system, even the best available guess (in ML sense) might not be adequate – NB! This is not the algorithm’s fault. It still provides the best possible solution in ML sense q Depending on the form of the likelihood function (provided in the E-step) the convergence rate of the EM might vary considerably q Notice, that the algorithm converges towards a local maximum – To locate the global peak one must use different initial guesses for the estimated parameters or use some other more advanced methods to find out the global peak – With multiple unknown (hidden/latent) parameters the number of local peaks usually increases M.Sc. Jukka Talvitie 5.12.2013 Further examples q Line Fitting (showed only in the lecture) q Parameter estimation of multivariate Gaussian mixture – See additional pdf-file for the • Problem definition • Equations – Definition of the log-likelihood function – E-step – M-step – See additional Matlab m-file for the illustration of • The example in numerical form – Dimensions and value spaces for each parameter • The iterative nature of the EM algorithm – Study how parameters change at each iteration • How initial guesses for the estimated parameters affect the final result M.Sc. Jukka Talvitie 5.12.2013 Conclusions q EM finds iteratively ML estimates in estimation problems with hidden (incomplete) data – likelihood increases at every step of the iteration process q Algorithm consists of two iteratively taken steps: – Expectation step (E-step) • Take the expected value of the complete data given the observation and the current parameter estimate – Maximization step (M-step) • Maximize the Q-function in the E-step (basically, the data of the E-step is used as it were measured observations) q Algorithm converges to the local maximum – Global maximum can be elsewhere q See reference list for literature regarding use cases of EM algorithm in the Communications – These are the references [5]-[16] (not mentioned in the previous slides) M.Sc. Jukka Talvitie 5.12.2013 References 1. 2. 3. 4. Dempster, A.P.; Laird, N.M.; Rubin, D.B., “Maximum Likelihood from Incomplete Data via the EM Algorithm,” Journal of the Royal Statistical Society, Series B (Methodological), Vol. 39, No. 1., pp. 1-38, 1977. Moon, T.K., “The Expectation Maximization Algorithm”, IEEE Signal Processing Magazine, vol. 13, pp. 47-60, Nov. 1996. Chuong, B.D.; Serafim B., What is Expectation Maximization algorithm? [Online]. Not available anymore. Was originally available on: courses.ece.illinois.edu/ece561/spring08/EM.pdf The Expectation-Maximization Algorithm. [Online]. Not available anymore. Was originally available on: ai.stanford.edu/ ~chuongdo/papers/em_tutorial.pdf Some communications related papers using the EM algorithm (continues in the next slide): 5. 6. 7. 8. 9. 10. 11. Borran, M.J.; Nasiri-Kenari, M., "An efficient detection technique for synchronous CDMA communication systems based on the expectation maximization algorithm," Vehicular Technology, IEEE Transactions on , vol.49, no.5, pp.1663,1668, Sep 2000 Cozzo, C.; Hughes, B.L., "The expectation-maximization algorithm for space-time communications," Information Theory, 2000. Proceedings. IEEE International Symposium on , vol., no., pp.338,, 2000 Rad, K. R.; Nasiri-Kenari, M., "Iterative detection for V-BLAST MIMO communication systems based on expectation maximisation algorithm," Electronics Letters , vol.40, no.11, pp.684,685, 27 May 2004 Barembruch, S.; Scaglione, A.; Moulines, E., "The expectation and sparse maximization algorithm," Communications and Networks, Journal of , vol.12, no.4, pp.317,329, Aug. 2010 Panayirci, E., "Advanced signal processing techniques for wireless communications," Signal Design and its Applications in Communications (IWSDA), 2011 Fifth International Workshop on , vol., no., pp.1,1, 10-14 Oct. 2011 O'Sullivan, J.A., "Message passing expectation-maximization algorithms," Statistical Signal Processing, 2005 IEEE/SP 13th Workshop on , vol., no., pp.841,846, 17-20 July 2005 Etzlinger, Bernhard; Haselmayr, Werner; Springer, Andreas, "Joint Detection and Estimation on MIMO-ISI Channels Based on Gaussian Message Passing," Systems, Communication and Coding (SCC), Proceedings of 2013 9th International ITG Conference on , vol., no., pp.1,6, 21-24 Jan. 2013 M.Sc. Jukka Talvitie 5.12.2013 References 12. 13. 14. 15. 16. Groh, I.; Staudinger, E.; Sand, S., "Low Complexity High Resolution Maximum Likelihood Channel Estimation in Spread Spectrum Navigation Systems," Vehicular Technology Conference (VTC Fall), 2011 IEEE , vol., no., pp.1,5, 5-8 Sept. 2011 Wei Wang; Jost, T.; Dammann, A., "Estimation and Modelling of NLoS Time-Variant Multipath for Localization Channel Model in Mobile Radios," Global Telecommunications Conference (GLOBECOM 2010), 2010 IEEE , vol., no., pp.1,6, 6-10 Dec. 2010 Nasir, A.A.; Mehrpouyan, H.; Blostein, S.D.; Durrani, S.; Kennedy, R.A., "Timing and Carrier Synchronization With Channel Estimation in Multi-Relay Cooperative Networks," Signal Processing, IEEE Transactions on , vol.60, no.2, pp.793,811, Feb. 2012 Tsang-Yi Wang; Jyun-Wei Pu; Chih-Peng Li, "Joint Detection and Estimation for Cooperative Communications in Cluster-Based Networks," Communications, 2009. ICC '09. IEEE International Conference on , vol., no., pp.1,5, 1418 June 2009 Xie, Yongzhe; Georghiades, C.N., "Two EM-type channel estimation algorithms for OFDM with transmitter diversity," Acoustics, Speech, and Signal Processing (ICASSP), 2002 IEEE International Conference on , vol.3, no., pp.III2541,III-2544, 13-17 May 2002 M.Sc. Jukka Talvitie 5.12.2013