Learning Hidden Markov Models using Probabilistic Matrix Factorization April 30 2011 Ashutosh Tewari Decision Support and Machine Intelligence M. Giering, M. Shashanka (collaborators) 2 HIDDEN MARKOV PROCESS CONTENT 1. Hidden Markov Models V1 V2 V3 b2(1) b2(2) b3(3) 2. Parameter Estimation (Baum-Welch) V1 V2 V3 b3(1) b3(2) b3(3) 3. Probabilistic Matrix Factorization V1 4. HMM parameter estimation using PMF V2 π2 V3 a12 S2 b1(1) b1(2) b1(3) 5. Experiments a32 a23 a21 π1 S1 Observed Sequence 3 a13 π3 S3 a31 V3 V2 V2 V1 V2 V3 V2 V1 V1 HIDDEN MARKOV MODELS HIDDEN MARKOV MODEL Generative Model Applications Hidden Observed Model Parameters: 1. Speech Recognition Transition Probabilities 2. Character Recognition 3. Intrusion Detection Emission Probabilities Observed Sequence Observed symbols are being emitted by some Hidden St t States 4. Bioinformatics Initial State Distribution 5. Remaining Useful Life Prediction Hidden Sequence 6 Source 6. S Separation S ti 5 6 BAUM-WELCH ALGORITHM HIDDEN MARKOV MODELS Expectation Maximization Problem Classes EM algorithm gives the ML estimate of the parameters, when the generative model has hidden variables**: Initialization: Iterations (till convergence) Class 1: Given the model parameters, compute the probability of a symbol sequence, Class 2: Given the model parameters and an observed symbol sequence, determine the most likely hidden state sequence. Forward/Backward recursive algorithm E Step: E Step: p Estimate the distribution of hidden variables,, given the model parameters and the observed data. M Step: Re-estimate the model parameters given the distribution computed in the E step and the observed data. Viterbi algorithm M St Step: Class 3: Given the observed symbol sequence, compute the model parameters. Baum-Welch EM algorithm Guarantees local maximization of the observed data likelihood given the model parameter parameter. We address this problem !! Complexity: O IterEMN2T !!!!! **R.M. 7 Termination Maximizes Neal, G.E. Hinton, A View of EM Algorithm that Justifies Incremental, Sparse and other Variants. 8 APPLICATIONS OF PMF PROBABILISTIC MATRIX FACTORIZATION Topic Modeling Entries of all matrices are non-negative ! Goal: Identifying hidden topics in a document corpus !!! D = Matrix with columns representing the data vector. W =Matrix with columns representing basis vector. H = Matrix with columns representing p g mixing g weights. g Popular Models: An observed data vector can be represented as a linear combination of basis vectors. 1. Probabilistic Latent Semantic Analysis (Hofmann) 2. Latent Dirichlet Allocation (Blei et al) Whi h matrix Which t i is i factorized f t i d? If D is a double stochastic matrix it can be factorized symmetrically (PLCA) If D is a left stochastic matrix it can be factorized asymmetrically (PLSA) Generative Model of PLSA (k k = hidden topic) P w|k P k|d MVS Shashanka et al, Probabilistic Latent Variable Models as Non-Negative Factorizations, Computational Intelligence and Neuroscience, May 2008. 9 HMM ESTIMATION USING PMF Pd d k w Topic modeling of a corpus with 16,333 news articles with 23,075 unique words. Topic 1 Topic 2 Topic 3 Topic 4 NEW FILM SHOW MUSIC MOVIE PLAY MUSICAL BEST ACTOR FIRST YORK OPERA THEATER ACTRESS LOVE MILLION TAX PROGRAM BUDGET BILLION FEDERAL YEAR SPENDING NEW STATE PLAN MONEY PROGRAMS GOVERNMENT CONGRESS CHILDREN WOMEN PEOPLE CHILD YEARS FAMILIES WORK PARENTS SAYS FAMILY WELFARE MEN PERCENT CARE LIFE SCHOOL STUDENTS SCHOOLS EDUCATION TEACHERS HIGH PUBLIC TEACHER BENNETT MANIGAT NAMPHY STATE PRESIDENT ELEMENTARY HAITI P( w | d ) P(k | d ) P( w | k ) 10 k PARAMETER ESTIMATION Which Matrix to Factorize? Expectation Maximization Algorithm is double stochastic. Convert the sequence into a count matrix. ti Initialize C Computational t ti lC Complexity: l it Iterate E Step 1) Count matrix generation O T 2)) EM algorithm g O IterEMM2N)) Overall How a pair is generated ? M Step: O T IterEMM2N ~ O T for long sequences BW Complexity: O IterEMN2T !! Hidden state HMM Parameters are estimated as How do you estimate Terminate Maximizes 11 12 HMM ESTIMATION USING PMF EXPERIMENT Improved Generative Model Emission Probabilities State Transition Matrix The above generative model does not include the transition probabilities e explicitly. plicitl S1 S2 S3 S1 0 0.9 0.1 S2 0 0 1 S3 1 0 0 Observed Sequence after rounding + Emission Probabilities (Estimated) Th generative The i model d l is i improved i d by b including i l di the h transition i i probabilities b bili i **. Count Matrix (color map) Factorization of Count Matrix using the PMF. **M. Shashanka, A Fast Algorithm for Discrete HMM Parameter using Observed Transitions, ICASSP, 2011. 13 COMPARISON WITH BAUM-WELCH ALGORITHM 14 EFFECT OF SEQUENCE LENGTH Accuracy and Time Complexity Sequence Length = 104 Estimated Emission Probs. B Lakshminarayanan and R Raich, “Nonnegative Matrix Factorization for Parameter Estimation in Hidden Markov Models ” in Workshop on Models, Machine Learning for Signal Processing, 2010. PMF Time Taken (s) log(P(O|λ) ) 0.03 -29910 BW 28.41 -24397 The PMF was faster by almost 3 orders of magnitude !!! Empirically estimated joint distribution is poor at low sequence length, resulting in a bad model !!!! Hellinger Distance between the estimated and true emission probabilities 15 16 APPROXIMATING t t 1 Different Option: 1. Soft discretization. 2. Non-parametric estimation. 3. Parametric mixture model. Used a Gaussian Mixture Copula Models (GMCM)** Extension E t i off GMMs GMM to t handle non-Gaussian component densities. A. Tewari, A. Raghunathan, Gaussian Mixture Copula Models, In Preparation, 2011. 17