Two New Approaches for Learning Hidden Markov Models by Hyun Soo Kim Submitted to the Department of Electrical Engineering and Computer Science in Partial Fulfillment of the Requirements for the Degree of Master of Engineering in Electrical Engineering and Computer Science at the ~SACHUSETTS MASSACHUSETTS INSTITUTE OF TECHNOLOGY February 2010 © Hyun Soo Kim, MMX. All rights reserved. ARCHINVES INSTffUE OF TECHNOLOGY AUG 2 42010 LIBRARIES The author hereby grants to MIT permission to reproduce and distribute publicly paper and electronic copies of this thesis document in whole or in part. . ./ ........ ..................... A uthor .......................................... Department of Electrical Engineering and Computer Science September 5, 2009 C ertified by ............................... Leslie P. Kaelblin Professor of Computer Science and Engineering, MIT Thesis Supervisor A ccepted by .............................. Drdh'istopher J. Terman Chairman, Department Committee on Graduate Theses THIS PAGE INTENTIONALLY LEFT BLANK Two New Approaches for Learning Hidden Markov Models by Hyun Soo Kim Submitted to the Department of Electrical Engineering and Computer Science on September 5, 2009, in Partial Fulfillment of the Requirements for the Degree of Master of Engineering in Electrical Engineering and Computer Science Abstract Hidden Markov Models (HMMs) are ubiquitously used in applications such as speech recognition and gene prediction that involve inferring latent variables given observations. For the past few decades, the predominant technique used to infer these hidden variables has been the Baum-Welch algorithm. This thesis utilizes insights from two related fields. The first insight is from Angluin's seminal paper on learning regular sets from queries and counterexamples, which produces a simple and intuitive algorithm that efficiently learns deterministic finite automata. The second insight follows from a careful analysis of the representation of HMMs as matrices and realizing that matrices hold deeper meaning than simply entities used to represent the HMMs. This thesis takes Angluin's approach and nonnegative matrix factorization and applies them to learning HMMs. Angluin's approach fails and the reasons are discussed. The matrix factorization approach is successful, allowing us to produce a novel method of learning HMMs. The new method is combined with Baum-Welch into a hybrid algorithm. We evaluate the algorithm by comparing its performance in learning selected HMMs to the Baum-Welch algorithm. We empirically show that our algorithm is able to perform better than the Baum-Welch algorithm for HMMs with at most six states that have dense output and transition matrices. For these HMMs, our algorithm is shown to perform 22.65% better on average by the Kullback-Liebler measure. Thesis Supervisor: Leslie P. Kaelbling Title: Professor of Computer Science and Engineering, MIT THIS PAGE INTENTIONALLY LEFT BLANK Acknowledgments This thesis was prepared at the CSAIL 4th Floor Laboratory in the Stata building. Publication of this thesis does not constitute approval by the CSAIL Laboratory or any sponsor of the building or conclusions contained herein. I want to thank Professor Leslie Kaelbling for giving me the guidance with which to complete this thesis. She has gracefully accepted to take me under her wing in response to what must have been a pretty abrupt request to join her lab. Ever since becoming her student in the fall of 2008, she has always been on top of my research and available for consultation. Throughout the year, I met with her dozens of times to tell her of my progress, ask about papers I should read, and, perhaps most importantly, to hear her tell me not to get discouraged about not making much progress. I have learned invaluable lessons on how to really dig down into an idea and become an effective researcher. There were many times when I was stuck doing a certain approach, only to be nudged towards a better direction by her help. Sometimes, it just really took some time and contemplating to understand a difficult idea fully. THIS PAGE INTENTIONALLY LEFT BLANK Contents 1 Introduction 15 1.1 Hidden Markov Models 1.2 Motivation................... 1.3 Thesis A pproach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 1.4 T hesis Structure 18 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. ... . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Background 2.1 16 21 Hidden Markov models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.1.1 Definitions and Notation. . . . . . . . . . . . . . . . . . . . . . . . . 22 2.1.2 How to Use HMMs: A Brief Guide . . . . . . . . . . . . . . . . . . . 23 3 Related Work 25 3.1 The Baum-Welch Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2 Angluin's Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.2.1 Learning Nondeterministic Finite Automata . . . . . . . . . . . . . . 28 3.2.2 Equivalence of pNFAs and HMMs 29 3.3 4 15 The Spectral Algorithm..... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 31 The Two New Approaches 33 4.1 The Extended Angluin's Algorithm . . . . . . . . . . . . . . . . . . . . . . . 33 4.1.1 M otivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.1.2 Issues with Learning HMMs . . . . . . . . . . . . . . . . . . . . . . . 34 4.2 4.1.2.2 Deterministic Actions in DFAs.. 4.1.2.3 No Accepting States . . . . . . . . . . . . . . . . . . . . . . 35 4.1.2.4 Faux-Regular Sets .. 35 . . 36 .. . 37 .. 39 .. . . . . . 39 . . . 39 . . . 40 .. . 42 .. . . . . . .. ............... Probabilistic Angluin: A Possible Extension . . . . . . 4.1.4 The HMM Learning Algorithm . . . . ......... . The Nonnegative Matrix Factorization Algorithm . . . . .. Motivation......... . . . . . . . . . . . . . . . 4.2.1.1 The Observation Matrix..... . . . .. 4.2.1.2 Factorization of the Observation Matrix . . . Recovering 0, T, and 7r from an Observation Matrix 4.2.3 Issues with Learning HMMs . . . . . . . . . . . . . . . . . . . . . . . 44 .. 44 . 45 .. . . . . . 45 .. . . . . . 45 4.2.3.1 Stochastic Factorization . . . . . ....... 4.2.3.2 Constructing the Observation Matrix . . . . 4.2.3.3 Factorization into C and OT . . . . . .... 4.2.3.4 Factorization Measure.... . . . . 4.2.3.5 Trivial Factorizations 4.2.3.6 Non-Uniqueness of Factorization ....... 4.2.3.7 Difficulty of Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.4 Sparse Observation Matrices... . . . . . . 4.2.5 Algorithms for Factoring . . . . . . . .. . . . . . . . .. . . . . . 48 55 56 57 4.2.5.1 Lee and Seung's Algorithm . . . . . . . . . . . . . . . . . . 57 4.2.5.2 The ALS Algorithm.. . . . . . . . . 4.2.5.3 Our NNMF Algorithm..... . . The HMM Learning Algorithm. . . . Implementation Issues 5.1.1 46 . . . . . . . . . . . . . . .. ... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Methodology 5.1 34 4.2.2 4.2.6 5 No Agent in HMMs . . . . . . . . . . . . . . . . . . . . . . 4.1.3 4.2.1 34 4.1.2.1 . . . Baum-Welch Termination Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 58 60 5.1.2 NNMF Observation Matrix 5.2 Training and Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.3 Measures of Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.3.1 Euclidean Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.3.2 Kullback-Leibler Divergence. . 65 5.4 . . . . . . . . . . . . . . . . . . . The HMMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.4.1 Simple HMM 3 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.4.2 Simple HMM 3 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.4.3 Simple HMM 4 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.4.4 Separated HMM 3 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.4.5 Separated HMM 3 4 #2 67 5.4.6 Dense HMM 3 3......... 5.4.7 Dense HMM 4 4... 5.4.8 Dense HMM 5 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.4.9 Separated HMM 5 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.4.10 Dense HMM 6 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.4.11 Sparse HMM 6 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.4.12 Diverse Sparse HMM 6 6 69 . . . . . . . . . . . . . . . . . . . . . . . . . .. ..... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Results and Analysis 6.1 6.2 67 67 71 On the Failure of the Extended Angluin Algorithm . . . . . . . . . . . . . . 71 6.1.1 Stopping State Approach: Version 1 . . . . . . . . . . . . . . . . . . 72 6.1.2 Stopping State Approach: Version 2 . . . . . . . . . . . . . . . . . . 74 6.1.3 Accept All Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 6.1.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Comparison of Algorithms.. . . . . . . . . . . . . . . . . . . . . . . . . . 77 6.2.1 Evaluation Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 6.2.2 Evaluation Table Analysis . . . . . . . . . . . . . . . . . . . . . . . . 84 6.2.2.1 84 Average KL Ratios . . . . . . . . . . . . . . . . . . . . . . . . 6.2.2.2 NNMF vs. NNMF + B-W 6.2.2.3 Larger Training Sets ...... 6.2.2.4 Dense HMMs . . . . . . . . . . 6.2.2.5 Separated HMMs 6.2.2.6 Sparse HMMs 6.2.2.7 Standard Deviations...... 6.2.2.8 Factorization vs. Effectiveness 6.2.2.9 NNMF + B-W Runtimes . . . . . . . . . . . . . . . . . . . 7 Conclusion and Future Work . . . . . . . ... .......................... . . . . . . ... .......................... 7.1 Conclusion 7.2 Future Work 7.3 Summary . . . . . . . . . . . .......................... A Angluin's Algorithm Example 95 B HMMs for Testing Simple HMM 3 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 B.2 Simple HMM 3 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 B.3 Simple HMM 4 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 B.1 . . B.4 Separated HMM 3 4 . . . . . ....... B.5 Separated HMM 3 4 #2 . . ... . . ... . . . . . . . . ..... B.7 Dense HMM 4 4 . . . . . . ..... . . . . . . . . . . . . . . ... .. 96 ... . . . ... 96 . 96 .... . . . . . . . ... 97 . ... ... .. . . . . . . . . . . . . . . ...... ... B.9 Separated HMM 5 5 . . . . ....... 96 96 ............................. B.6 Dense HMM 3 3 B.8 Dense HMM 5 5 ... . ... B.10 Dense HMM 6 6 ... . . ...... .. . .. . . . . . . . . ... 97 B.11 Sparse HMM 6 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 B.12 Diverse Sparse HMM 6 6. . List of Figures 2-1 A M arkov m odel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3-1 An example of a pNFA. . . . . . . . . . 30 3-2 An example of an equivalent HMMT. . . . . .. . . . . . . . . . 30 3-3 An example of an equivalent HMM. . . . . . . . . . 31 3-4 Converting an HMM into an equivalent pNFA. . . . . . . . . . 32 4-1 The Extended Angluin HMM Learning Algorithm.. . . . . . . . 4-2 An example of how output distributions are determined by 0 and the distribution . . . . . . . . . . . . . . of possible states..... . . . ... . . . . . . . . . . . . . . . . . . . . . . . . . 38 40 4-3 An example of an observation matrix. . . . . . . 4-4 Factorization of an observation matrix. . . . . . . . . . . . . . . . . . 41 4-5 Trivial factorization of an observation matrix. . . . . . . . . . . . . . 47 4-6 The Lee-Seung algorithm. . . . . . . . . . . . . . . . . . . . . . . . . 57 4-7 The ALS algorithm. . . . . . . . . . . . 58 4-8 The modified Lee-Seung algorithm. . . . . . . . . . . . . . . . . . . . . . . . 59 4-9 The modified ALS algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . 59 . . . . . . . . . . . . . . . 4-10 The NNMF HMM Learning Algorithm..... . . . . . . . . . . . . . 41 . . . . . . . . . . . . 61 6-1 State transitions of Simple HMM 3 3. . . . . 72 6-2 NFA learned from 10000 output sequences from Simple HMM 3 3. . . . . 72 6-3 NFA learned from 50 output sequences from Simple HMM 3 3. . . . . 73 . . 6-4 NFA learned from 10000 output sequences from Separated HMM 3 4. . . . 73 6-5 Simple HMM 3 3 with only one possible initial state. . . . . . . . . . . . . . 75 6-6 NFA learned from 10000 output sequences from Simple HMM 3 3 with only one possible initial state. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 List of Tables 6.1 Simple HM M 3 3 results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 6.2 Simple HMM 3 4 results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 6.3 Simple HMM 4 3 results. . . . . . . . . . . . ... .. .. .. . .. .. .. . 79 6.4 Separated HMM 3 4 results. . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 6.5 Separated HMM 3 4 #2 results.. . . 80 6.6 Dense HMM 3 3 results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 6.7 Dense HMM 4 4 results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 6.8 Dense HMM 5 5 results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 6.9 Separated HMM 5 5 results. . . . . . . . . . . .. .. .. .. . ... . .. . 82 6.10 Dense HMM 6 6 results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 6.11 Sparse HMM 6 6 results.. . . . . . . . . . . .. ... .. .. . ... . .. .. 83 6.12 Diverse Sparse HMM 6 6 results. . . . . . . . . . . . . . . . . . . . . . . . . 83 6.13 Average KL ratios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 6.14 Average KL ratios for separated HMMs. . . .. . . . . . . . . . . . . . . . . 85 A.1 Initial table in Angluin's algorithm. . . . . . .. . . . . . . . . . . . . . . . . 91 A.2 Closed table in Angluin's algorithm. 92 .. .. . . . . . . . . . . . . . . . . . . . .. ... . .. .. .. .. .. A.3 Third iteration table in Angluin's algorithm. .. .. .. .. .. . .. .. .. 92 A.4 Fourth iteration table in Angluin's algorithm. . . . . . . . . . . . . . . . . . 92 A.5 Fifth iteration table in Angluin's algorithm. . . . . . . . . . . . . . . . . . . 93 A.6 Finished table in Angluin's algorithm. . . . . THIS PAGE INTENTIONALLY LEFT BLANK Chapter 1 Introduction 1.1 Hidden Markov Models An HMM is a statistical model that models a Markov process with unobserved states. Typically, the parameters in question are state transition and output probabilities. In a regular Markov model, the state of the system is plainly visible to the observer, and so the only parameters are transition probabilities between the states. In an HMM, the state of the system is hidden, but the outputs influenced by the state are visible. Output tokens are emitted according to a distribution that depends on the state. The sequence of tokens generated by an HMM are indicators of the state transitions of the system. HMMs can be used for pattern recognition, either by using repeated examples and the tools of inference to deduce parameters of a system, or by using given model parameters to infer future sequences of outputs. As such, HMMs are widely used in speech recognition, handwriting, gesture recognition, musical score following, and bioinformatics. There are three canonical problems in HMM theory. The first is to compute the probability of a particular output sequence given the parameters of the model. The second is to compute the most likely sequence of hidden states that could have given rise to a given output sequence given the parameters of the model. The third is to compute the most likely state transition probabilities, output probabilities, and initial state distribution, given output sequences. The first two problems have been completely solved and given thorough treatment in Rabiner's paper [18]. The third problem is given consideration as well, but Rabiner [18] admits the difficulty in learning HMMs; it is not a surprise that the most useful problem would turn out to be the hardest. This thesis is concerned with the third problem, learning an HMM solely from the observable data. We clarify our goal a bit, to emphasize that we are not so much interested in recovering the exact HMM parameters as producing an HMM that produces the same likelihoods for output sequences as the actual HMM. There are numerous results that imply that HMM learning is provably hard. Under reasonable assumptions, Terwijn showed in [20] that the HMM learning problem is not solvable in polynomial time in the error, confidence parameter, and size of the HMM. This result should serve as an unfortunate reminder of the limitations of our abilities in this field; we might get some answers, but we will probably never get an optimal answer. Practitioners typically resort to local search heuristics. In this vein, the Baum-Welch / EM algorithm has become the predominant learning algorithm as noted in [3] and [7]. Nonetheless, in practical applications, we typically find that we can relax constraints in the model and add strong assumptions that allow us to achieve much better results. For example, under the assumption that observation distributions arising from distinct hidden states are distinct, Hsu et al. [13] showed that there exists a polynomial time algorithm for approximating the conditional distribution of a future observation conditioned on some history of observations. 1.2 Motivation Currently, the Baum-Welch algorithm is the prevailing method used to learn HMMs. It is an expectation-maximization (EM) algorithm, which works by iterating an estimation step followed by a modification step. The algorithm does not come with many guarantees. The most one could say is that it is guaranteed, asymptotically, to converge to a local, not necessarily global, optimum. Moreover, results are highly sensitive to initialization. Nonetheless, the algorithm has been used extensively for decades by HMM researchers. Within the past few years, there have been some interesting results on learning HMMs that branch away from the Baum-Welch approach and introduce completely new concepts. One such result is the spectral algorithm discovered by Hsu et al. [13]. This thesis attempts to follow suit and investigate new methods of learning HMMs in the following two ways. Angluin's paper demonstrates the effectiveness of having a teacher that answers simple but illuminating questions regarding the system [1]. To our knowledge, there has not been an attempt to adapt Angluin's method to learn HMMs, although there has been a recent paper that extended Angluin's method to learn nondeterministic finite automata. The general HMM learning problem is difficult, but reducing the problem by approaching it from a practical perspective and introducing strong assumptions and relaxing constraints should make it more tractable, just as it did for learning regular sets in Angluin's paper. For example, in the realm of HMMs, it seems reasonable to assume the existence of a teacher that informs the algorithm how well it is doing and how well it could be doing if it worked better. Although learning finite automata does not carry over to HMMs in an obvious way, the methodology is certainly relevant, as we are trying to uncover hidden parameters given observables and a teacher helping the algorithm. Therefore, one of the goals of this thesis will be to investigate the feasibility of applying Angluin's insight into the realm of learning HMMs. The other area that this thesis investigates is nonnegative matrix factorization. The idea is to investigate whether the use of matrices to represent HMMs is merely notational. If an HMM gives rise to a number of meaningful matrices, could it be the case that starting with meaningful matrices, we may reconstruct the HMM? HMMs necessarily give rise to matrices with certain properties. Perhaps if we extracted matrices with those properties from the observable data, we could reconstruct the HMM. 1.3 Thesis Approach This thesis tackles the HMM learning problem using novel approaches inspired by related areas of study. To do so, we first do an in-depth analysis of the approaches in question, by describing them and giving examples of them being used. Our goal is to fully understand the underlying assumptions and context of these insights so that we have a good idea of how they could be of use in HMM learning. This thesis will then face the problem of applying these insights to HMM learning. This thesis will also investigate the effectiveness of these new approaches, comparing them to the tried-and-true effectiveness of the Baum-Welch algorithm. Our goal is to show whether any of these new approaches shows significant promise. Specifically, this thesis uses the insight gathered from considering various novel approaches to conclude that the nonnegative matrix factorization (NNMF) approach is the most promising. Moreover, this thesis will combine NNMF with Baum-Welch to produce a hybrid algorithm in hopes of outperforming the original Baum-Welch algorithm. The theoretical analysis of this approach leads us to consider various algorithms for nonnegative factorization and issues of linear programming that are outside the scope of this paper. However, the thesis does illuminate the interconnectedness of these two fields by showing that a good factorization is likely to result in a better learned HMM. 1.4 Thesis Structure This thesis is structured as follows. Chapter 2 describes some background that is necessary in understanding the HMM learning problem. It includes definitions that are necessary to understand the rest of the paper and notation conventions. We describe what an HMM is and how it can be used to calculate meaningful probabilities. Chapter 3 lists and explains previous work related to HMM learning. Baum-Welch, Angluin's method for learning finite automata, and the spectral algorithm are described here. This chapter aims to provide an overview of the different approaches to HMM learning. Chapter 4 is original work. The first section in this chapter describes an extension to Angluin's algorithm to learn HMMs. The second section describes how nonnegative matrix factorization can be used to learn HMMs. These two ideas are built up into two HMM learning algorithms. Chapter 5 describes our methodology in implementing and evaluating our HMM learning algorithms. We discuss measures of accuracy and set up a variety of HMMs. Chapter 6 describes the results of implementing and evaluating our two HMM learning algorithms. We find that only the nonnegative matrix factorization approach is feasible. We compare this new HMM learning algorithm with Baum-Welch and make comments on interesting trends in the resulting data. Chapter 7 wraps everything together, culminating in a conclusion, avenues for future work, and a summary. THIS PAGE INTENTIONALLY LEFT BLANK Chapter 2 Background 2.1 Hidden Markov models A Markov chain is a stochastic process that obeys the Markov property. The Markov property states that given the present state, future states are independent of past states. If X 1 , X 2 , .. . is a sequence of random variables with the Markov property, with the indices increasing with the passage of time, then Pr(Xn+1 = z|Xn = zX , Xn=1 = Xni, . .. , X 1 = xi) = Pr(Xn+1 = X = Xz). A Markov model comprises states and transitions between pairs of states. In the Markov models we are studying, each state also emits an output that is dependent only on that state. Figure 2-1 is an example of a Markov model. At each time step, the Markov model progresses forward in time, producing an output from its current state and moving to the next state according to the transition probabilities. Under some conventions, the output is produced only after transitioning to state. In this thesis, we choose to employ the convention that an output is produced from its current state before the transition occurs. Many Markov models specify a probability distribution Rain Rain No Rain No Rain 5% 95% Sunny Day 80% 20% 20% Cloudy Day 50% 80% 50% Figure 2-1: A Markov model. Each state represents whether the day starts off sunny or cloudy. Each state's emission corresponds to whether it rains on that day. The Markov property is satisfied because the transition and output from one state depend only on that state. over the initial. In a hidden Markov model, the state sequence is unknown to the observer, and only the sequence of emissions is observable. In our example above, we might observe a sequence { Rain, No Rain, No Rain, Rain, Rain, Rain} across a span of six days. We would have no information about the number of states, transition matrix, output matrix, or the initial state distribution. 2.1.1 Definitions and Notation We can organize the transition probabilities into a transition matrix. A transition matrix holds information about how likely it is to move from one state to another. In the example above, if the sunny state is state 1 and the cloudy state is state 2, the transition matrix is 0.8 0.5 0.2 0.5 The more popular convention is to have rows sum to 1. For our convenience, we will be taking the convention that the columns sum to 1. We can also organize the output probabilities into an output matrix. An output matrix describes the distribution of outputs emitted from a state. In the example above, the output matrix is 0.05 0.80 0.95 0.20 We can also organize the initial starting probabilities into a column vector 0.7 0.3 Under our notation, a transition matrix T's rows represent the destination states in some fixed order and columns represent the source states in some fixed order. An output matrix O's rows represent the outputs in some fixed order and columns represent the states in some fixed order. Lastly, 7r is a column vector representing the initial state distribution. 2.1.2 How to Use HMMs: A Brief Guide Given the parameters of an HMM, we can compute the following quantities. We only give a brief overview in this section, just enough to lay the foundations for our use of HMMs for the rest of this thesis. Please refer to [18] for a more in-depth treatment of how to use HMMs. (i) The probability of encountering a certain output sequence given a model. Let A, be the matrix such that entry (i, j) is the probability of starting from state emitting x, and then moving to state i. Then Ax = TO, where O = diag(O(x,:)). The probability of encountering output sequence xtxtli ... x 1 is then 10soAt_ appoAp ca 7r, where 1 is a row vector of all ones of appropriate size. j, (ii) The probability of encountering a certain output sequence given a certain output sequence. If the future output sequence is ysys_1 ... y1 and the past output sequence is xtzt 1 .. .z, the answer is 10,AYS- 1 AY --. Ax17r 10,, Ax,_, - -*-Ax 1 ,r (iii) The probability of being in a certain state given a certain output sequence. If the output sequence is xtxt_1 ... xi, then the probability of being in state i is entry i in the column vector Ax, Ax_- 1 Ax7r. Chapter 3 Related Work In this section, we describe previous work related to learning HMMs. The Baum-Welch algorithm is a well-known HMM learning algorithm that's been in use for decades. Angluin's algorithm is used to learn deterministic finite automata, but we will attempt to apply it to learning HMMs. 3.1 The Baum-Welch Algorithm The Baum-Welch algorithm outlined in [3] is a classic algorithm used to find unknown parameters to an HMM. In general, if we have access to labeled data (that is, we know the sequence of states), we can find parameters to the HMM by using maximum likelihood estimators. The real problem arises when we do not know the state sequence. The algorithm is an expectation-maximization (EM) algorithm. Given only emissions from an HMM, it computes maximum likelihood estimates and posterior probabilities for the parameters of the HMM. The two components - computing maximum likelihood estimates and posterior probabilities - are closely linked in the algorithm. In fact, there are two stages, known as the E-step and M-step in the algorithm, that alternate between these two concepts. In the E-step, the algorithm estimates the likelihood of the data under current parameters. In the M-step, these likelihoods are used to re-estimate the parameters. These steps are iterated until the algorithm converges to a local maximum, although it is not easy to tell when to stop the algorithm. The Baum-Welch algorithm can get close to a local maximum, but not necessarily to a global maximum. Intuitively, the algorithm takes a guess, looks at the data it produces, compares it to data actually produced, and updates its guess. It iterates until it cannot make any more improvements. Without the existence of a teacher, the algorithm makes incremental improvements by evaluating its own performance. Depending on the initial setting of parameters, the algorithm may converge to a local maximum and not a global maximum. Typically, to learn an HMM using Baum-Welch, we run the algorithm a few times, using random seeds each time. For each run, we stop when improvements to the likelihood measure are deemed too small. However, it is difficult to know when to stop because improvements to likelihood measure can be flat for a long time before jumping up. Because the Baum-Welch algorithm is so wide-spread, there are plenty of implementations that one can find easily on the web. In this thesis, we use a MATLAB implementation developed by [15]. 3.2 Angluin's Algorithm Angluin's algorithm that learns regular sets via queries and counterexamples is an important result [1]. The algorithm reversed a previous dismissal of the topic. By introducing the concept of a teacher that is able to answer queries made by the algorithm, it also shed light on the nature of learning itself. A regular language is a language recognized by a deterministic finite automaton (DFA). A regular set is the set of words in a regular language. The problem of being able to learn any deterministic finite automaton is equivalent to being able to learn any regular set. The existence of a teacher is an assumption made by the paper. There is no known polynomial-time algorithm for learning a regular language without the kind of teacher specified by the paper. Specifically, the teacher is capable of two things: answering membership queries and providing counterexamples. A membership query is a yes or no question regarding the membership of a certain word in the regular set. Making an analogy to human learning, if the problem were learning how to play tennis, membership queries would be akin to asking questions such as "am I swinging my arm correctly?" and "is my racquet grip correct?" A counterexample is a word that is not yet in the guess set that the algorithm builds into the desired regular set. Because the algorithm never adds words that are not in the desired regular set, there is no counterexample involving a word that should be removed from the guess set. In the tennis analogy, providing counterexamples would be akin to asking, "what am I missing?" If the teacher tells the algorithm that there are no counterexamples, then the algorithm has succeeded in determining the regular set. Using these intuitive concepts of a teacher, the algorithm builds a regular set. The nature and extent of the similarity seem to be an interesting research topic on human intelligence. Before we continue, let us remark on how realistic it is to assume that such a teacher is available. Assuming that membership queries are still available, a machine that is trying to predict a sequence that occurs in nature could find its own counterexamples by checking randomly produced sequences. After trying sufficiently many random sequences, it could conclude that its model is probably correct. In this way, we could obviate the need for a teacher that provides counterexamples. As for membership queries, if data is constantly provided by nature and the machine has been observing the data for a sufficiently long time, the machine could conclude that it has seen most of the members of the desired regular set. These two observations suggest that the abstract teacher concept is not very far-fetched. In any case, the algorithm utilizes an observation table during its run. It is a table that keeps track of membership queries it has already made. Running along the vertical axis of the table are prefixes and running along the horizontal axis are suffixes. Prefixes satisfy the constraint that any shortened version of the words (shortened by removing characters from the end) are still prefixes. A similar constraint holds for suffixes. The row-space is partitioned further into S, a "basis" set, and S -A, the set of prefixes formed by appending a letter of the alphabet A to the end of elements in S. Where the row and column for a particular prefix p and suffix s meet is a boolean value indicating whether the word p - s is in the regular set or not. The algorithm builds this table until it satisfies two conditions. The first condition is closedness. An observation table is closed if and only if for each t in S - A, there exists an s in S such that row(t) = row(s). Intuitively, this condition says that a letter addition to any word in S should be just like something we have seen before. Appending the letter should put us in no new territory. The second condition is consistency. An observation table is consistent if and only if for every two elements si and 82 in S such that row(si) = roW(s2), we have row(si -a) = row(s 2 - a) for every letter a. Intuitively, this condition says that states that we think are same should react in the same way regardless of how we prod it. The algorithm performs membership queries in order to satisfy these conditions and asks for a counterexample. If a counterexample exists, it adds that to the observation table and repeats the process. The paper shows that the algorithm always terminates and does so in polynomial time. The remarkable aspect of this algorithm is the simplicity brought on by assuming the existence of a teacher. The assumption is very reasonable; as humans, we expect a lot more from our teachers. This usage of a teacher is the main motivator for one of the two new approaches to learning HMMs in this thesis. We provide a sample run of Angluin's algorithm in Appendix A. 3.2.1 Learning Nondeterministic Finite Automata Recently, researchers have shown in [5] that Angluin's teacher idea can be extended to learn not only regular sets. This result is encouraging for us because it means that the idea holds more merit than we previously perceived. Like Anguin's algorithm, the NFA algorithm uses an observation table and the conditions of closedness and consistency. Instead of deterministic finite automata, however, it targets residual finite-state automata, a class of nondeterministic finite automata (NFA) that is more general than the deterministic finite automata. Nondeterministic automata can express regular sets in an exponentially more compact way than deterministic automata. A classic result in complexity theory says that any nondeterministic automata can be replaced by a deterministic one, with at most exponentially many states. Being able to learn NFAs means being able to learn languages exponentially more complex than regular sets. This result is encouraging because HMMs are more related to NFAs than they are to DFAs, since the sequence of states and emissions in an HMM are probabilistic. The connection between HMMs and NFAs is still not obvious, but this recent result suggests that the teacher assumption can be a powerful tool. 3.2.2 Equivalence of pNFAs and HMMs NFAs allow multiple transitions from one state per action, but there is no notion of probability. Probabilistic NFAs (pNFAs) bring us a step closer to utilizing Angluin's algorithm to learn HMMs. In pNFAs, each transition caused by an action has an associated probability of occurring. Figure 3-1 shows a pNFA, taken from [8]. In fact, Lemma 5 in [5] states that pNFA without accepting states and HMMs can be transformed into each other. The equivalence comes from the fact that both structures utilize states, outputs, transition probabilities, and output probabilities to uncover a model that best explains the distribution of outputs that are observed. The process involves utilizing HMMs with transition emissions (HMMTs). Figure 3-2, taken from [8], shows an HMMT equivalent to the pNFA in Figure 3-1. For completeness, Figure 3-3, taken from [8], shows an HMM equivalent to the HMMT and pNFA above. 0.4 0.6 0.27 1 a 0.02( Figure 3-1: An example of a pNFA, taken from [8]. Each action has an associated probability of occurrence given the state. 0.4 0.1( [a 0.3] [b 0.7] 0.9 [a 0-2] [b 0.81 0.6 0.3 [a 0.91 [b 0.1] [a 0.8] [b 0-2] Figure 3-2: An example of an equivalent HMMT, taken from [8]. Unlike an HMM, HMMTs emit outputs during state transitions. 0.36 0.1 12 11 [a 0-2] [b 0.8] [a. 0.3] [b 0-7] 0.7 0.1 21 [a 0.8] [b 0.2] 0.42 0-9 07 _ 0-3 22 0.3 [ 0..9] [ 0.1] 0.18 Figure 3-3: An example of an equivalent HMM, taken from [8]. Note that converting a pNFA into an HMM required increasing the number of states. Such is not always the case. Nonetheless, the process of converting an HMM into a pNFA is much simpler, preserving the number of states. Figure 3-4 shows an example. 3.3 The Spectral Algorithm Hsu et. al [13] note that learning HMMs is provably hard. However, they go on to write that the difficulty is divorced from what we are likely to encounter in practical applications and that making reasonable assumptions helps tremendously. They make the following assumptions. (i) 7r > 0 element-wise. (ii) 0 and T are rank m, where m is the number of states. (iii) UTO is invertible, where U "preserves the state dynamics". Details are in [13]. 0.04 0.1 0.04 0.36 1 1 b00 12 00.3 lb 0.81 03 [b 0.71] 0.36 M a0 -0 0.42 21 a 0- - 1 b 0.2 0.7 0.14 08 0.9 22 03 [a~~~ 0021 lb21~~ [a1 [b 0A42 0.7 00.21 b.0 7 0.0 0.7227. b 011b 0.8 01 0.18 0 0.0 b aa0.97 0.42 0.18 Figure 3-4: Converting an HMM into an equivalent pNFA, taken from [8]. The first statement, that 7r > 0, says that every state can serve as the initial state. This assumption may seem arbitrary, but it is necessary because of the way that the algorithm works. It seems to impose an unreasonable requirement. To our knowledge, it is not a trivial task to convert an HMM into one where all states can be initial states. The second assumption is rather interesting, as it is phrased in linear algebraic terms that seem unrelated to HMMs. It is actually a very powerful assumption that says that not only can no two states have the same output distributions, but also no state can have a distribution that is a mixture of other states' distributions. Also, a more subtle requirement that follows from the second assumption is that the number of distinct outputs must be at least the number of states, since the rank of 0 is at most the number of its rows, which is the number of distinct outputs. Nonetheless, it is still possible to produce an accurate model - it would simply have more states than the most concise model. The punch line in [13] is that its algorithm can calculate probabilities of output sequences without knowing the latent parameters. The paper mentions that the parameters can be recovered with a little more work, but their determination is unnecessary. We include this summary of their algorithm to illuminate the role linear algebra plays in HMM learning. Chapter 4 The Two New Approaches 4.1 4.1.1 The Extended Angluin's Algorithm Motivation The strategy that others have employed in the related work section suggests that perhaps learning HMMs is hard because we are asking the wrong question. Our strategy is to glean from the related work mentioned earlier and put the HMM learning problem on a more practical footing by introducing assumptions. Namely, we wish to continue in the style of Angluin and assume the existence of a teacher that is able to provide explicit performance guarantees that our algorithm should be able to achieve. This capability should allow our algorithm to measure how well it is doing and make improvements accordingly. We are inspired by the success that other researchers have had in this area, but we find Angluin's idea intriguing and especially appropriate for this learning problem, because human learning seems to closely resemble the learning style in Angluin's paper. As humans, we learn from queries and counterexamples, formulating the simplest theory that explains everything we observe. Only by seeing counterexamples do we make amendments to our theory. Also, we make our observations under uncertainty, and realize that what has happened is not necessarily indicative of the underlying circumstances. While DFAs are agnostic to transition, output, and initial state distributions, Angluin's treatment of states in a DFA as equivalence classes of pasts that yield the same future is a great insight into how we can distinguish states in an HMM. Perhaps we can utilize Angluin's DFA learning algorithm to at least learn the bare-bones structure of an HMM. 4.1.2 Issues with Learning HMMs There are a number of issues with applying Angluin's idea to learning HMMs that we address below. 4.1.2.1 No Agent in HMMs In DFAs and NFAs, outputs are actually actions made at states. An action is determined by the agent traversing the automaton. In an HMM, the outputs at each state are determined randomly according to an output distribution. Unlike a finite automaton, we have no say in deciding where we are going in an HMM. This issue is a problem since we want to be able to prod the HMM to see whether certain paths are possible, as in Angluin's algorithm with DFAs. We can get around this problem by assuming that we can sample enough sequences from the HMM to be able to find the sequence we wish to execute. 4.1.2.2 Deterministic Actions in DFAs Each action made by an agent in a DFA moves him into exactly one state. That is, the action taken by the agent at a given state uniquely determines the next state. In an HMM, a state may emit a certain output but may travel to one of many states. We can get around this problem by utilizing an NFA, which allows an agent to transition to multiple states at the same time given an action. As mentioned earlier, the recent paper [5] shows us that it is possible to extend Angluin's algorithm to learn NFAs. 4.1.2.3 No Accepting States In a DFA and NFA, there are accepting states, which signal ends of action sequences. There is no equivalent concept of an accepting state in HMMs. The significance of an accepting state is not so much that it signals the end of an action as that it forces action sequences to be finite. The accepting states basically allow us to say that certain finite sequences are valid sets of actions in a DFA or NFA. The appropriate analogue for HMMs would be to have a state that ends the emissions. There are a number of ways we can implement this idea. 1. (Stopping State Approach) Assume that we can ask the HMM to stop automatically once it reaches a certain state. 2. (Accept All Approach) Assume that every state in the HMM is an accept state. In this approach, every output sequence generated by the HMM is an accepted sequence. 3. Assume that we can transform the HMM into an equivalent pNFA and assign states to be the starting states and accepting states. We remark that option 1 is identical to option 3, because the process of transforming an HMM into a pNFA preserves all of the states, as noted by example in Figure 3-4. So really, we have two options. Note that in the first case, we are assuming that we can get more information from an HMM than we normally could. 4.1.2.4 Faux-Regular Sets HMMs with accepting states produce finite output sequences. We gather these sequences into a set and train the NFA on this set. We are implicitly assuming that this set is a regular set. The NFA learner can only learn from regular sets and it is not guaranteed that sequences sampled from HMMs will form a regular set, especially if we limit the number of samples. If everything that the learner queries for is in the set, then there is no problem. The problem arises when the learner queries for something that should be in the set but is not, leading to contradictions. It is likely that the NFA learner will fail to terminate in certain cases. Probabilistic Angluin: A Possible Extension 4.1.3 Another way to apply Angluin's algorithm to learn HMMs is to use a softer version of membership querying, in which we populate the table with probabilities, not ones and zeroes. Since we can sample many sequences from the HMM, why not take advantage of the distribution of outputs? For example, instead of simply saying that the sequence created by appending suffix '011' to '0' is accepted, we can write in the probability that '0' follows '011', write the probability of '0110' occurring, or write the probability of seeing '0110' among all four-letter output sequences. In terms of finding a map of the HMM, we can treat any entry greater than 0 as a valid entry, so that we can still recover the transition structure. This idea seems to have potential, but we have not been able to address the following issue that arises. Once we populate a table with probability values, we cannot claim that rows that look similar are in the same state. A row in Angluin's table represents the possible futures from a state given a sequence of actions (the suffix). Because the actions uniquely determine the resulting state, the outputs from that state are always the same. However, in the case of probabilistic outputs, given a sequence of outputs (the suffix), there is a probability distribution of states that the HMM could be in. The HMM is not in any one single state, but a mixture of different states. For example, suppose we are trying to learn the relatively simple HMM from Figure 3-3. Suppose we have the ability to convert it into the pNFA from Figure 3-1. The initial state distribution is (0.4 0.6), where the states are enumerated from left to right. The output distributions of the states are 0.29 0.71 0.83 0.17 / , where each row represents each state respectively. Hence, the row corresponding to the initial state would look like 0.4(0.29 0.71) + 0.6(0.83 0.17) = (0.614 0.386). In general, given a sequence of outputs that make a suffix, the corresponding row would look like a(0.29 0.71) + (1 - a)(0.83 0.17), where a is the probability of being in the first state given the prefix of outputs. Because the system is probabilistic, practically any value of a could be a valid probability of being in the first state. It is possible for the resulting table to have no two rows that similar in Euclidean distance. These considerations lead us to concepts that we describe in Section 4.2. Basically, while the fact that every row in the table looks like a(0.29 0.71) + (1 - a)(0.83 0.17) is a problem, it is also an insight, since it says that every row in the table can be expressed as a linear combination of two basis vectors. This realization suggests a perhaps more linear algebraic approach, which we describe in Section 4.2, the nonnegative matrix factorization approach. Angluin's paper shows that a state can be recognized by looking at what decisions made after entering the state. This realization suggests that the same does not hold in the probabilistic case, where more sophisticated tools are required. 4.1.4 The HMM Learning Algorithm We choose to use the NFA learning algorithm to discern the transition map and not the probabilities associated with the transitions and outputs. The insight is that even though NFAs do not recognize probabilities, a sequence of actions in an NFA are possible if and only if that sequence of actions is a possible sequence of outputs in the corresponding pNFA and HMM. Thus, our strategy is to take sequences of outputs from the HMM, assume them as the regular set, run the NFA learning algorithm to learn that regular set, find out the transitions between states, and then assign probabilities to them through some other method, say Baum-Welch. Because we would not be given any probabilistic information, it can be argued that we are not learning too much information. Nonetheless, by eliminating certain transitions altogether, we may be able to decrease the number of latent parameters drastically. Figure 4-1 is a diagram of the entire Angluin approach to learning HMMs. We will refer to this algorithm as the extended Angluin algorithm. 1 a. Request stopping states in HMM. I. R t s1b. Work with unaltered HMM. a 0] 4 * b 1] [a 0.5] (b0.5] 1 [ a21 ] 1 [b0] [b 0.5] 0.5 2 b 1] C. 1 [a 1] [b 0] 0 [a 0] [a 0.5] 0.5 ,- 2. Gather samples. This set represents the regular set that the NFA will learn. ab a a b aa abababa I 3. Use Angluin's algorithm to learn NFA. [a, b] a, b] 4. Learn probabilities via Baum-Welch. All Baum-Welch HMM Figure 4-1: The Extended Angluin HMM Learning Algorithm. la and lb represent the two options we have in remedying the accept state problem. 4.2 4.2.1 The Nonnegative Matrix Factorization Algorithm Motivation Consider the representation of HMMs in terms of matrices. The initial output is distributed according to the vector 07r. The output distribution, given an initial output T, is OTOW 10,7r 0TOTW O TWr 10T7 Now, note that TOT7r is a vector whose elements describe the probability of being in a certain state given the initial output r. The denominator 10,7r just normalizes the probabilities to sum to 1. From this representation, we can interpret output distributions as follows. Given a certain output sequence, we first compute the distribution of states that the HMM is in, say v. Then, the output distribution is vOT. An example is given in Figure 4-2. 4.2.1.1 The Observation Matrix We define the observation matrix of an HMM as follows. The rows of the matrix represent prefixes and the columns represent unit-length suffixes. The entry at which row r intersects column c is the probability of seeing the output sequence represented by c given the output sequence represented by r. We stipulate that the prefixes and suffixes are ordered in lexicographical order, from shorter to longer. ( 0.2 0.2 0.4 oTy oT 07=0.4 0 0.2 0.4 0.6 0.2 0.4 0.2 0.5 0.5 (0.5 0.25 0.25). , TT ) IrT = (0. 5 0.2 5 0.2 5). In states 1, 2, and 3, the output distributions are (0.2 0.2 0.6), (0.4 0.4 0.2), and (0 0.5 0.5) respectively. Initially, the probability of being in state 1, 2, and 3 are 0.5, 0.25, and 0.25 respectively. Hence, the distribution of the initial output is 0.5(0.2 0.2 0.6) + 0.25(0.4 0.4 0.2) + 0.25(0 0.5 0.5) = FT oT = (0.2 0.325 0.475). Given that the first output is 1, the HMM's state distribution is TO 17 = (0 0.5 0.5)T. It follows that the distribution of the second output, given that the first output is 1, is 0(0.2 0.2 0.6) + 0.5(0.4 0.4 0.2) + 0.5(0 0.5 0.5) = A (7rTOT TTT) OT (0.2 0.45 0.35), where A is a normalization constant. Figure 4-2: An example of how output distributions are determined by 0 distribution of possible states. The outputs are 1, 2, and 3. and the Figure 4-3 is an example, where the HMM is from Figure 4-2. Note that these observation matrices are exact. Measured observation matrices are likely to have a significant amount of noise. However, we emphasize that observation matrices can be measured to any degree of accuracy, simply by extracting more output sequences from the HMM. Figure 4-3 is an example of an observation matrix (values rounded to a suitable number of digits). 4.2.1.2 Factorization of the Observation Matrix Consider the observation matrix as a matrix A. We must have the following matrix equality. Observation matrix for HMM from Figure 4-2 ' 1 2 3 11 1 0.2 0.2 0.2 0.3053 0 2 3 0.325 0.475 0.45 0.3538 0.3579 0.5 0.35 0.4462 0.3368 0.5 Figure 4-3: An example of an observation matrix. COT = A, where 0 is the familiar output matrix and C is the coefficient matrix, every row of which holds the distributions of possible states that the HMM could be in. For the example given in Figure 4-3, the equality would look like Figure 4-4. We have truncated the table to include only five rows. ,rT \ A1,rTOT T 2 A3,r TOTTT A4,rTOITTOITT = 0.2 0.325 0.475 0.2 0.45 0.35 0.2 0.3053 \ 0 0.3538 0.4462 0.3579 0.5 0.3368 0.5 I, where Ai are normalizing constants. Figure 4-4: Factorization of an observation matrix. This representation of the observation matrix is the motivation for using nonnegative matrix factorization to learn HMMs. We have just shown that every HMM admits a factorization COT = A of the observation matrix. Thinking conversely, perhaps we can recover C and OT given just the measured observation matrix A. 4.2.2 Recovering 0, T, and 7r from an Observation Matrix In this section, we describe how to recover 0, T, and 7r to produce an HMM from a measured observation matrix. (i) Recovering 0 Once we factor A into C and OT, the matrix 0 is readily apparent. Note that because the factorization is not unique, we are likely to get different values for 0 depending on the factoring algorithm. (ii) Recovering 7r Note that the first row of the measured observation matrix is a distribution of outputs given no prefix. In other words, it is the distribution of outputs given that the HMM is in its initial state. It follows that the first row of C, which is the distribution of possible states, is the initial distribution of states. We have 7r = C(1,:)T. (iii) Recovering T The task of recovering T is the most cumbersome. T is not evident at first glance, but it is clear that implicitly, C involves T. We present two methods. The first one relies on conditions that may not always hold, but is more precise. The second always works, but it is usually not as precise. Denote 0' and 7r' to be the parameters determined by the factorization. Let T' be the transition matrix we are looking to find. Method 1 First, we require 0' to be invertible. One way to understand this requirement intuitively is that no state's output distribution can be a linear combination of other states' output distributions. Second, we require the following matrix to be invertible, where m is the number of distinct outputs. O17r' O27r' - ,7r') Note that O'r is an m x 1 column, so the matrix is an m x m matrix. Let M be the submatrix of A obtained by taking rows 2 to m + 1. That is, M = A(2 : m + 1,:). Now note that M is measuring ( O'T'O'r' O'T'O'27r' -.-.O'T'Or' - It is a matrix of output distributions given an output of length 1. In other words, O'T' ( O'r' O' - r' O21r' so that T' = O'-M O21r' -- /' 1 (4.1) Method 2 Let Ctrunc be the submatrix of C obtained by taking rows 2 to m + 1. We call this matrix the C submatrix. Since Ctrunc represents wITOI ITiT 7T 2 T'IT, 7F Toi) we basically have another factorization problem, except with one of the factors already known. That is, we wish to find T' that minimizes T If * - Ctrunc F is invertible, we simply have ( wTO T = \ - Ctrunc. Otherwise, we use a variant of the nonnegative matrix factorization algorithm (discussed in Section 4.2.5) that fixes one of the two factor matrices. 4.2.3 Issues with Learning HMMs There are a number of issues that we have to address before we can feasibility use matrix factorization to recover the HMM. 4.2.3.1 Stochastic Factorization If we take A and split it into C and OT using today's popular factorization algorithms, the factors will not necessarily be row-stochastic. Our algorithm must enforce the row-stochasticity constraint. 4.2.3.2 Constructing the Observation Matrix In order to construct A, we need to know the total number of outputs, which is the number of columns of A. Assuming that we have access to an HMM and can generate as many samples from it as we would like, this assumption is not unreasonable. From here on, we will assume that we know the number of outputs. Note that we are free to determine the number of rows of A. The number is how many sample output sequences we have. 4.2.3.3 Factorization into C and OT We must specify the row count of OT. The interpretation of the row count of OT is simple - it is the number of states in the HMM. Unfortunately, knowing the number of states is not a reasonable assumption in learning HMMs. However, the number of rows in OT can be specified as part of the factorization approach. We can factor the observation matrix into OT's with different row counts and take the one that yields the best factorization, in some sense. For instance, we can measure the likelihood on held-out data. 4.2.3.4 Factorization Measure We describe two interesting measures that we can use to gauge the accuracy of our factorization. One is the Kullback-Liebler (KL) divergence, a non-symmetric measure of the difference between two probability distributions P and Q. The KL divergence is non-symmetric and therefore not a true metric. It does not even satisfy the triangle inequality. For two matrices A and B, the divergence is D(A||B) = (Aij log - Aij + Bij The other is the Frobenius norm, which is the more intuitive measure of distance between two matrices. For A and B, the square of the Frobenius difference is |A - B||' =Z(Aij - Big)2 ii It is simply the Euclidean distance between A and B and is a true metric, satisfying the triangle inequality unlike the KL divergence. We will be factoring the observation matrix A using the Frobenius norm because most factorization techniques today utilize it. We can now express our problem mathematically - it is to find row-stochastic matrices C and OT that minimize ||COT - A|. Note that we will not necessarily find C and OT that perfectly factor A. In fact, because we cannot sample infinitely many sequences, it would be virtually impossible to perfectly factor the measured observation matrix. Perhaps it would be more accurate to call this method nonnegative matrix approximation instead of factorization, but we will use the term factorization for convenience. It is important to note that under this measure, having many identical rows in A is not redundant, because deviation of the factorization from any of those rows is multiplied. 4.2.3.5 Trivial Factorizations If the number of rows is greater than or equal to the number of columns in OT, then there exists a trivial factorization that yields 1COT - AI2 = 0. The reason is that in such a case, OT can simply be the identity matrix with arbitrary numbers to pad the bottom, as shown in Figure 4-5. This result is not a problem with the factorization, since after all, we produced row-stochastic C and OT that minimize the required distance measure. The problem is that in terms of learning the HMM, it does not make sense to have a state that is never reached (as would be the case for the state represented by the last column in C in our example above). /1 0 0 0\ 0 1 0 0 0 0 0 0 1 0 0 1 \j0.1 0.2 0.3 (A 0) =A. 0.4j/ Figure 4-5: Trivial factorization of an observation matrix. In this example, A has 4 columns. o is a column vector full of zeroes. Nonetheless, we are interested not so much in recovering the exact parameters of the HMM as being able to correctly estimate output sequence probabilities. If having a useless but benign extra state does not adversely affect the HMM's ability to accurately calculate probabilities, we should not be worried. To remove the extra state, we can try re-running the algorithm with a different seed. A more interesting situation occurs when the number of rows equals the number of columns in OT. Then the factorization A 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 is exact and the interpretation, that each state emits only one kind of output, is entirely plausible. As with before, this result is not a problem with the factorization. Also, as with before, we should not be too worried, since our goal in learning HMMs is to find the parameters that produce output sequences with the same likelihoods as the actual HMM. It might well be the case that there exists a set of parameters with 0 as the identity matrix that satisfies that criterion. Nonetheless, in an effort to prevent the algorithm from always selecting the identity matrix for OT and possibly confining itself to a local maximum in learning the HMM, there is a need to randomize the seeds in the factorization algorithm so that 0 does not necessarily come out to be the identity matrix. Non-Uniqueness of Factorization 4.2.3.6 This issue is related to trivial factorization. We ask ourselves, how special is a factorization? Lemma 1. Suppose we have a factorization COT = A and let V be a matrix such that the following hold. " V is row-stochastic * V is invertible * V-1 is nonnegative Then C' = CV, OIT = V --1 0 T is another factorization with C'O'T = COT. Proof. First, note that C'O'T = COT is easy to verify. It suffices to show that C' and OiT are both row-stochastic, which is clear since row-stochastic matrices are closed under multiplication. E Nonetheless, we claim that it is not trivial to produce different factorizations from a given one. Lemma 2. There does not exist an n x n matrix V with the following properties. " V is not a permutation matrix " V is row-stochastic " V is invertible * V- 1 is nonnegative Proof. Suppose such a matrix V exists. We have VV- 1 In. Since V is row-stochastic, V1 = 1, where 1 is a column vector of ones of appropriate size. Thus, 1 = V-11. Since V-1 is nonnegative, we have just shown that it is row-stochastic. Denote the entries of V as rij and the entries of VWe have >i rjcij = 1 for all j. 1 as cij. By the AM-GM inequality, we have = 2. Z2rjcij Zr7-Fc Moreover, )2 2Zr i (4.2) 2 (4.3) and 2 c c y, so that we have 2 1+ cij 2 2 + = (Erji) )2 so that >cij > 1. Summing over all c j, 2rji + c?. 2 2rjicij = 2, (4.4) we get E ) cij > ni. J But EJ Ei c i = EE cu =E 1 = n, so it follows that all of the inequalities used in the equations above must be equalities. In particular, from Equation 4.2, rji)2 = ( Ei r3 , so that rjarjb = 0, afb or rji jk / 0, or rji(1 - rji) = 0. Since V is row-stochastic, every entry in V is in [0, 11. Thus, rji(1 - rji) 0 for all i, j. But now every entry rji in V is either 0 or 1. Since V is row-stochastic, it follows that V must be a permutation matrix, a contradiction. D Note that when V is a permutation matrix, we do get a new factorization. However, the factors produce the same HMM, since permuting the rows of 0 T amounts to relabeling the states. On the other hand, it is not impossible to generate a new factorization. We can sometimes find a matrix V such that the following hold. " V is row-stochastic " V is invertible * V-1OT is nonnegative In such a case, we can say the following. Lemma 3. Suppose we have a factorization COT = A and let V be a matrix such that the following hold. * V is row-stochastic * V is invertible SV-1OT is nonnegative Then C' = CV, O'T = V-1OT is anotherfactorization with C'OiT = COT. We omit the proof. Proposition 1. There exists a row-stochastic matrix OT and a matrix V such that the following properties hold. e V is row-stochastic * V is invertible e V--oT is nonnegative Proof. We simply need to provide an example. Let OT 0.45 0.55 0.4924 0.5076 0.55 0.45 0.6060 0.3940 Let C' = CV and OIT = V--lOT. Now C' is row-stochastic since C and V are both row-stochastic. OIT = V-OT is row-stochastic since V-lOT Below is another example. Let ( 0.8967 0.1033 0.0167 0.9833 0.5307 0.2907 0.1786 0.2241 0.0455 0.2284 0.5020 0.6462 0.4418 0.3316 0.0222 0.0095 0.1182 0.8622 0.0101 0.4493 0.1089 0.1089 0.4577 0.1653 0.2681 0.4502 0.3376 0.2122 0.3589 0.0828 0.3559 0.2024 Then V--OT 0.1210 0.3798 0.4992 0.2985 0.6995 0.6590 0.0425 0.2875 0.0130 0.6579 0.2191 0.1230 We conjecture that if there exists at least one solution, there are an infinite number of them. Conjecture 1. Let M be a row-stochastic matrix. If there exists at least one matrix V with the following properties, then there are infinitely many of them. " V is row-stochastic " V is invertible " V-1M is nonnegative Empirically, this conjecture seems to be true. Perhaps most importantly, we make the following proposition about generated factorizations. When we say that two HMMs are equivalent, we mean that they assign the same probabilities to all output sequences. Proposition 2. Let an HMM be parameterizedby 0, ,r, and T. Suppose V is a row-stochastic invertible matrix such that V-1 0 T is nonnegative. Then the HMM parameterized by 0' O(VT)-1 and 7' = VT 7r is not necessarily equivalent to the original HMM. In other words, there may not exist a transition matrix T' such that the HMM parameterized by 0', ,r', and T' is equivalent to the original HMM. Proof. We only need to provide an example of such an HMM with parameters 0, 7r, and T. Choose an HMM such that 0' is invertible. For example, we can start with an invertible 0. Let 1, 2, ... , m be the outputs. Suppose that T' is a transition matrix such that the HMM parameterized by 0', 7r', and T' is equivalent to the original HMM. Then we must have O'T'O'7r' = OTO 1 7, O'T'O'7r' = OTO 2 7, O'T'O 7r' =- OTOm7r. Since 0' is invertible, we have T'O'7r' = O'-0OTO1 7r, T'O7r' = O'-OT027r, T'O, 7r' = O'-1OTOm7r. We can rearrange this system of equations to the following matrix form. T'(Oir O27r r) . -- O =O' 'OT (017r 027r - - - Omr). Note that O/r' are m x 1 column vectors, so the matrix being multiplied to T' is an m x m matrix. By choosing our HMM carefully, we can make the m x m invertible and be able to solve for T' as follows. V= O'10T (017 O27r Om7r) (O'r' o'r -- r') (4.5) We emphasize that if indeed the m x m matrix in question is invertible, Equation 4.5 is a necessary requirement for the two HMMs to be equivalent. Let o=< 0.4 0.4 0.5 0.1 0.3 0.1 0.6 Let V 0 0.3 0.3 00 1 ,§ 1 1 0 , 00 1 0 0J 0.5749 0.1159 0.3092 0.4011 0.5630 0.0359 0.0433 0.0419 0.9148 Then 0'= 0.2628 0.5046 0.2924 0.5663 0.4808 0.0605 0.1709 0.0146 0.6471 We can solve for T' using Equation 4.5. We get ( 0.5749 7' 0.1159 0.3092 0.0433 0.0433 0.0433 T' 0.0419 0.0419 0.0419 0.9148 0.9148 0.9148 We are left to verify whether 0', T', and ir' produce an HMM equivalent to the one produced by 0, T, and ir. It is shown to be false, since OTO1TO 1 7r is significantly different from O'T'O'T'O'7'. Thus, this set of parameters shows that newly generated factorizations do not necessarily lead to equivalent HMMs. E Proposition 2 suggests that by generating new factorizations, we may be able to improve the accuracy of the recovered HMM. Note that the newly generated factorizations do not improve the factorization metric at all - recall that CO = C'O'. This proposition merely says that these new factorizations may represent distinct HMMs. In addition, Proposition 2, via Equation 4.5, gives us an algorithm, which, under certain conditions, can test whether a generated factorization leads to an non-equivalent HMM. 4.2.3.7 Difficulty of Factorization While finding a good factorization C, OT does not necessarily imply that we have found the best value of OT to represent the HMM, we would expect the problem of finding C, OT to be pretty hard. Indeed, [21] indicates that finding the globally optimal solution minimizing ||AB - FIF for nonnegative matrices A, B, F is NP-hard. As we mentioned earlier, for this reason, many researchers call the NNMF problem as the nonnegative matrix approximation problem, as the exact solution is often hard or impossible to find because of algorithmic inadequacies or noisy data. We suspect our problem to be no less easy, since we require A and B to be row-stochastic on top of being nonnegative. Nonetheless, there are methods that can provide reasonably good local solutions to the nonnegative matrix factorization problem, which we describe in Section 4.2.5. 4.2.4 Sparse Observation Matrices As we mentioned earlier, an observation matrix A is likely to have many, possibly infinitely many, different factorizations into row-stochastic matrices C and OT. However, in the case that A is sparse, intuition tells us that we should expect far fewer factorization possibilities. Lemma 4 (The Zero Lemma). Let A be a row-stochastic matrix with A = XY, such that X and Y are row-stochastic matrices. We define a subset of A's entries S to be null if it satisfies the following properties. " a= 0 for all a E S " No two entries in S share the same row or column in A Then the union of the entries of X and Y must have at least |Sik zeroes, where k is the number of rows in Y. Proof. Let a E S. Then a = rc, where r is a row of X and c is a column of Y. Since X and Y are nonnegative matrices, we must have rici = r 2c 2 = - - = rkck = 0. It follows that for each i, at least one of ri and ci must equal zero. Thus, between r and c, there must be at least k zeroes. Since entries in S do not share rows or columns, each entry adds at least k zeroes to the total number of joint zeroes among X and Y. D The Zero Lemma describes the extent to which sparsity aids factorization. It states that an observation matrix with a large null set forces the factors to have many zeroes, vastly reducing the dimensionality of the factors. Thus, we expect sparse observation matrices to yield very good factorization results. 4.2.5 4.2.5.1 Algorithms for Factoring Lee and Seung's Algorithm The article [14] written by Lee and Seung started a flurry of research into nonnegative matrix factorization. The multiplicative update algorithm for NNMF, which minimizes the Frobenius norm, is outlined in Figure 4-6. W = rand(n, k); H = rand(k, m); for i 1 : maxiter H H.* (WT A)./(WTWH +10-9) W =W.* (AHT)./(WHHT + 10-9) end Figure 4-6: The Lee-Seung algorithm, written in the syntax of MATLAB. The 10-9 in each update is added to avoid division by zero. [4] notes that contrary to [14], the Lee-Seung (LS) algorithm does not necessarily converge to a local optimum, and may converge to a saddle point. [4] also notes that the LS algorithm is in the spirit of a more general class of algorithms called gradient descent algorithms. Benefits There are a few benefits of using LS over ALS. In fact, we need LS because it can be used to fix one of W and H and solve for the other. Also, LS works even in the case k > m, whereas ALS does not. Drawbacks The main drawback is that LS is slow to converge, if it converges at all. Also, once an element in W or H becomes 0, it remains 0. ALS is much faster and does not suffer from the zero element problem. 4.2.5.2 The ALS Algorithm [4] notes that the other large class of NNMF algorithm is the alternating least squares (ALS) class. In these algorithms, a least squares step is followed by another least squares step in an alternating fashion. The algorithm is given in Figure 4-7. W = rand(n,k); for i =1 : maxiter Solve for H in WTWH = WTA. Set all negative elements in H to 0. Solve for W in HHTWT = HAT. Set all negative elements in W to 0. end Figure 4-7: The ALS algorithm. [4] notes that the ALS algorithm can be very fast, depending on the implementation. MATLAB has a NNMF function that primarily utilizes the ALS algorithm. Benefits The main benefit to using ALS over LS is that ALS has fast convergence. Also, as mentioned earlier, zeroes that appear in W or H are not final in ALS. Drawbacks The biggest drawback to using ALS is that it cannot be used when k > m. Also, ALS cannot be used to fix one of W and H and solve for the other. 4.2.5.3 Our NNMF Algorithm It is clear that in the case k > m, or when the number of states is greater than the number of outputs, we must use LS. Also, in the case that we wish to fix one of W and H and solve for the other, we must use LS. In the case that k < m, we can run both LS and ALS. It is likely that ALS will give better results. Nonetheless, there is an important issue that neither algorithm addresses. The issue in our case is that the matrix factors must be row-stochastic. The LS and ALS algorithms do not guarantee that the factors are row-stochastic. We propose the following modified LS and ALS algorithms. Figure 4-8 describe the modified LS algorithm and Figure 4-9 describes the modified ALS algorithm. normalize(rand(n, k), 2); % initialize as a random row-stochastic matrix normalize(rand(k,m), 2); % initialize as a random row-stochastic matrix W H for i = 1 : maxiter H = H.* (WTA)./(WTWH + 10-'); W H W end W.* (AHT)./(WHHT + 10-9) normalize(H,2); normalize(W, 2); Figure 4-8: The modified Lee-Seung algorithm, written in the syntax of MATLAB. The 10-9 in each update is added to avoid division by zero. W = normalize(rand(n,k), 2); % initialize as a random row-stochastic matrix for i = 1 : maxiter Solve for H in WTWH = WT A. Set all negative elements in H to 0. Solve for W in HHTWT = HAT. Set all negative elements in W to 0. H = normalize(H,2); W = normalize(W,2); end Figure 4-9: The modified ALS algorithm, written in the syntax of MATLAB. The changes we added are simple. At the end of each iteration, we normalize the rows of W and H so that W and H become row-stochastic. Unfortunately, it is not clear whether these altered algorithms are likely to yield good solutions. These altered algorithms may or may not converge to local optima. In fact, empirically, the modified ALS algorithm either did very poorly or very well while the modified LS algorithm did moderately well across the board. As a result, although we used both modified algorithms, we mainly sided with the modified LS algorithm to factor matrices. 4.2.6 The HMM Learning Algorithm In the previous section, we have described how we can take an observation matrix and extract 0, T, and 7r from it. We now input these values as seeds into Baum-Welch to get an even better model. Figure 4-10 is a diagram of the entire NNMF HMM learning algorithm. It is a graphical summary of the steps we discussed in previous sections. We will refer to this algorithm, including the final Baum-Welch step, as the NNMF algorithm. 1. Gather samples. 2. Compile measured observation matrix. 1231121231... 1213212213... 2132131211... 1112232113 ... 3122332132... 5. Factor Csubmatrix. A 4. Extract 0 and n. 3. Factor measured observation matrix. C A LS/ ALS LS I 6. Extract T. T ' 7. Seed into Baum-Welch. 0, T,n 8. Recover optimized parameters. B'aum-WTch 0', T', n, Figure 4-10: The NNMF HMM Learning Algorithm. THIS PAGE INTENTIONALLY LEFT BLANK Chapter 5 Methodology In this chapter, we discuss issues regarding how we will implement, train, and evaluate HMM learning algorithms. 5.1 Implementation Issues 5.1.1 Baum-Welch Termination Protocol Because of the nature of Baum-Welch, the algorithm continues to run until the user terminates it. The algorithm continues to make progress in maximizing the likelihood of the training data, but as it gets closer to a local optimum, it may slow down. We terminate the algorithm when the likelihood increments are smaller than some threshold, which we arbitrarily decide as 0.1%. 5.1.2 NNMF Observation Matrix As we mentioned earlier, the measured observation table is likely to have a lot of noise. In particular, if there are not enough data points for a given prefix, the conditional output distribution will be very coarse. To prevent this phenomenon, we omit rows in the measured observation matrix that do not have enough data points. We arbitrarily decide the threshold to be 250 data points for a given prefix. 5.2 Training and Testing In training the algorithms, we considered the following parameters. (i) Number of output sequences We used 5000 and 10000. (ii) Length of each output sequence We choose to use the values 4, 6, and 8. (iii) Number of states The HMM's states are fixed, but the algorithms being trained can be initialized with different numbers of states. For our evaluation, we give the learning algorithms the correct number of states. (iv) Number of repetitions Algorithms such as Baum-Welch are heavily affected by random seeds. We choose to train each such algorithm five times per training instance in order to take variance into account. For testing, we generated a set of sequences from the HMM and considered the following parameters. For different runs of the same HMM, we fixed the test set. (i) Number of output sequences We always chose 10000 output sequences. (ii) Length of each output sequence The length of an output sequence depended on the number of states in the HMM. For HMMs with six states, we chose a length of 12. For HMMs with fewer states, we chose a length of 10. 5.3 Measures of Accuracy We measure an algorithm's effectiveness by measuring how accurately it calculates the occurrence probability of output sequences. If S is the set of output sequences, we generate A(S), the list of the algorithm's calculations of probabilities, and find its distance, in some sense, to H(S), the list of the original HMM's calculations of probabilities. We emphasize that it is not so important that an HMM learning algorithm recovers the parameters exactly as in the original HMM. 5.3.1 Euclidean Distance In this measure, the accuracy of the algorithm is S |A(S)i - H(S);|2 where A(S)i and H(S)i are corresponding probabilities. 5.3.2 Kullback-Leibler Divergence In this measure, the accuracy of the algorithm is lgH (S ji lgA(S)i ' where A(S)i and H(S)i are corresponding probabilities and S is assumed to be generated not arbitrarily, but from the original HMM. 5.4 The HMMs In this section, we list the HMMs that we will use to train and test the algorithms. Many of the HMMs have too many states to represent graphically. Hence, we will simply be writing down matrices to describe the HMMs. After each HMM name are two integers that describe the number of states and outputs in the HMM, in that order. These HMMs are reproduced in Appendix B. We used our own implementation of HMMs in MATLAB. 5.4.1 Simple HMM 3 3 There is a state transition involving no This HMM is not meant to be hard to learn. randomness and only one of the states emits all possible outputs. O= 0.5 0.25 0 0.5 0.25 0 ,T= 0.5 0 0.5 0.5 ,r= 0.25 . 0.25 0.25 1 0 0 0.25 0.75 5.4.2 0.75 0 Simple HMM 3 4 In this HMM, there are more outputs than states. The HMM was designed so that there is no randomness in state transitions. Note the similarity to Simple HMM 3 3. 0 0.25 1 0.2 0.55 0 0.05 0.2 0 0.75 0 5.4.3 10502 0.25 0.75 0 0 0.5 0.5 )0.25 ) T= 0.75 0 0.5 0.25 1 02 1 0.25 0 Simple HMM 4 3 In this HMM, there are more states than outputs. There is a deterministic transition from one of the states and only one of the states emits all possible outputs. Note the similarity to Simple HMM 3 3. O 0.5 0.5 0.25 0.3 0.5 0.25 0 0.4 0 0.25 0.75 0.3 , 0 0.2 0.5 0.5 0 0.5 0.25 1 0 0.25 0 0 0 07 0 0.3 . 0.15 0T=,00= . 5.4.4 Separated HMM 3 4 In this HMM, there are more outputs than states. We call this HMM "separated" because no state can transition to itself. 0.25 0 0 0.7 0.05 5.4.5 I, 0.5 0.75 0.25 0.25 0.25 Separated HMM 3 4 #2 0 5.4.6 r= 0 0.1 0.65 0 0.35 0.9 0 r Dense HMM 3 3 In this HMM, the defining factor is that there are not many zeroes in 0 or T. 0= 5.4.7 0.5 0.25 0.25 0.6 0.25 0.15 ) 0.25 0.3 0.5 0.4 0.25 0.3 =4V 0.5 0.25 0.25 Dense HMM 4 4 0.5 0.2 0.25 0.05 0.15 0.6 0.1 0.15 , T= 0.25 0.1 0.5 0.2 0.2 0.3 0.05 0.4 I, I. 0.5 0.2 0.2 0.1 5.4.8 o = 5.4.9 Dense HMM 5 5 0.5 0.2 0.2 0.4 0 0.25 0.3 0.05 0.1 0 0.1 0.1 0.4 0.2 0.4 0.05 0.3 0.2 0.2 0.2 0.1 0.1 0.15 0.1 0.4 ( ,T = 0.25 0.1 0.2 0.5 0.1 0.1 0.1 0.2 0.1 0.4 0 0.5 0.1 0.2 0.5 0.1 0.2 0.05 0.3 0.3 0.2 0.1 , 7rF J 0.1 . 0.2 0.2 0.1 Separated HMM 5 5 In this HMM, there are no states transitioning to themselves. There are five states and five outputs. 0 0.2 0.2 0.25 0.1 0.2 0 0.5 0.2 0.1 0.2 0.2 0.3 0.3 0.1 0.3 0.3 0.55 0 0.3 0.5 0.3 0.15 0.1 0.2 0 0.5 0.4 0.1 0.3 0 0.3 0.4 0.1 0.2 0.6 0 0 0.1 0.1 0.4 0 0.1 0.4 0.3 0.1 0.05 0.1 0.1 0.25 0.1 0.1 0.1 0 0.15 ,T= ,7r= 0.1 5.4.10 Dense HMM 6 6 In this HMM, there are no zeroes in 0 or T. 0= 0.2415 0.2713 0.1071 0.1737 0.2427 0.1662 0.2506 0.1447 0.0155 0.2044 0.0285 0.1662 0.2242 0.1634 0.4442 0.1843 0.1074 0.1506 0.0147 0.0120 0.2776 0.2000 0.3164 0.1083 0.1097 0.1999 0.0325 0.1555 0.2822 0.2331 0.1593 0.2087 0.1231 0.0821 0.0228 0.1756 5.4.11 0.2533 0.1997 0.2377 0.0668 0.0271 0.0965 0.2 0.2086 0.0782 0.0522 0.0356 0.1973 0.0657 0.25 0.1511 0.2776 0.0468 0.1560 0.2680 0.2587 0.15 0.1984 0.0767 0.2484 0.1271 0.1145 0.0867 0.1 0.0226 0.2425 0.1518 0.4391 0.2511 0.2333 0.2 0.1660 0.1253 0.2631 0.1754 0.1420 0.2591 0.1 Sparse HMM 6 6 This HMM has many zeroes in the transition and output matrices. There are only two possible initial states. 5.4.12 0.5 0 0 0 0.5 0.75 0 0.3 0 0.9 0 0 0.5 0 0 0 0 0.25 0 0 0 0 0 0 0.5 0 0.7 0 0 0 0 0.7 0 0 0.3 0.1 Diverse Sparse HMM 6 6 In this HMM, everything is the same as Sparse HMM 6 6, with the exception that every state can be the initial state. 0- 0 0.5 0 0 0.5 0 0 0.3 0.1 0 0.3 0 0 0 0.3 0.5 0 0 0 0.9 0 0.2 0 1 0 0.5 0 0 0.8 0 0 0 0 0.1 0 0.25 0 0 0 0 0 0.2 0.5 0 0 0.2 0.1 0 0 0 0 0.5 0 0 0 0 0.7 0 0.8 0.2 0 0 0 0.5 0 0.7 0 0 0.5 0 0 0 0.1 0.5 0 0.5 0.75 0 Chapter 6 Results and Analysis In this chapter, we discuss the results of implementing and evaluating the extended Angluin and NNMF HMM learning algorithms. We begin by writing that the extended Angluin method failed to produce interesting results. We then compare the NNMF algorithm to Baum-Welch algorithm by training and testing both of them across the variety of HMMs. Finally, we analyze some interesting trends we uncovered from our data. 6.1 On the Failure of the Extended Angluin Algorithm In this section, we find that the Extended Angluin algorithm is not feasible and attempt to explain why. We implemented Angluin's DFA learning algorithm in MATLAB according to [1]. We implemented the NFA learning algorithm in MATLAB according to [5]. In Section 4.1.2.3, we mentioned that pNFAs and HMMs do not have accepting states. We proposed the following two options to remedy this problem. 1. (Stopping State Approach) Assume that we can ask the HMM to stop automatically once it reaches a certain state. 2. (Accept All Approach) Assume that every state in the HMM is an accept state. We proceed to experiment with both of them. 6.1.1 Stopping State Approach: Version 1 We start with an effort to learn Simple HMM 3 3. The state transitions are diagrammed in Figure 6-1. 0.25 1 0.75 -40 05 0.5 Figure 6-1: State transitions of Simple HMM 3 3. We sampled 10000 sequences from the HMM, with the requirement that the HMM stop a sequence if and only if the third state is reached. We set this set of sequences as the regular set for the NFA learner to learn. The NFA in Figure 6-2 was returned. 1,2 C( 02 1,2 Figure 6-2: NFA learned from 10000 output sequences from Simple HMM 3 3. The NFA has two states. One state is the initial state and the other is an accepting state. It can be seen that every sequence of outputs is accepted by this NFA. Moreover, because any action from any state results in every transition being taken, this NFA reveals nothing about the transitions between states in the Simple HMM 3 3. We repeated the NFA learning with 5000, 1000, 500, 100, and 50 sequences from the HMM. All of them yielded the same NFA, except for the NFA learned from 50 sequences. With very few sequences, we obtained the NFA in Figure 6-3. 1,1,2, 1,22 3 1 2 1,2,3 12, 3 1,2 1, 2, 3 Figure 6-3: NFA learned from 50 output sequences from Simple HMM 3 3. This NFA is less open-ended than before. From these observations, we can conclude that supplying too many varieties of output sequences confuses the NFA into being too accepting. However, there are still some anomalies. First, note that state 3 is a trapping state. Once the NFA enters, it cannot escape. We believe the reason is that the HMM always stops at the third state, no transitions out of the third state are observed. The results are similar for Separated HMM 3 4. Setting the third state to be the accepting state, we get the NFA in Figure 6-4 after training with 10000 output sequences. 1,2,31,2,3 1,12,23 Figure 6-4: NFA learned from 10000 output sequences from Separated HMM 3 4. The resulting NFA is the same for 5000, 1000, and 500 output sequences. For 100 output sequences, it was difficult to get the NFA learner to terminate. We had to resample a number of times. When the NFA learner finally terminated, we got an NFA with 46 states - 13 initial states and 20 accept states. We speculated that there would be a way to reduce the NFA to a simpler form, as many states were probably redundant, but we did not investigate the issue any further. We believe the reason for the disappointing results may be the following. When we sample too many output sequences, we get too much variety, making the resulting NFA too accepting. Fundamentally, the NFA learner has no motivation to distinguish between states that have the same possible outputs but different distributions. On the other hand, when we sample too few sequences, the topology of the sequences is rough. Because samples are missing, the sample set may not have the necessary elements to be regular. We were not able to find a happy medium between these two extremes, having tried varying the number of output sequences between 100 and 500. It seems that the NFA learner is very sensitive to the regularity of the set it is learning, if regularity can be measured in some sense. Another reason may be that the different starting states confuse the NFA learner. That is, it is ambiguous whether empty sequence should be accepted or not. Because it is possible to start in the accepting state, the empty sequence should be accepted. On the other hand, because it is possible to not start in an accepting state, it should not be accepted. To remedy this problem and the trapping state problem, we have adopted a second version of the stopping state approach, described in Section 6.1.2. 6.1.2 Stopping State Approach: Version 2 We utilize the fact that all HMMs are equivalent to another state with exactly one starting state. Basically, a new state is created, and the HMM always starts in that state. It emits a dummy output and transitions to states depending on the original initial state distribution. A converted version of Simple HMM 3 3 is shown in Figure 6-5. In addition, to remedy the trapping state problem, we allowed the HMM to sometimes keep going after reaching an accepting state. That is, we specified a parameter p so that whenever the HMM reaches an accept state, it continues to transition and produce outputs with probability p. If 1123 was an output sequence from the original HMM, then 41123 would be the output sequence from the converted HMM, with 4 being the dummy output. In general, 0.25 Figure 6-5: Simple HMM 3 3 with only one possible initial state. N represents the initial state. once converted into single initial state form, the only difference in output sequences is the dummy output at the very beginning of the sequence. In this way, the empty sequence is unambiguously not accepted by the NFA. After training the output sequences from the converted version of Simple HMM 3 3 with p 0.3, we get the NFA in Figure 6-6. 1,12, 3 0,1, 2,3 Figure 6-6: NFA learned from 10000 output sequences from Simple HMM 3 3 with only one possible initial state. Parameter p was set to 0.3. '0' is the dummy output. The transitions are not as trivial as in Figure 6-2. In fact, this NFA resembles the NFA in Figure 6-3. However, there are still a number of peculiarities. As in Figure 6-3, which depicts an NFA trained on 50 sample sequences, the NFA in Figure 6-6 has a trapping state. Also, upon careful consideration, we find that the NFA in Figure 6-6 basically accepts all output sequences. The reason is that the dummy output can never appear after the first output, so the NFA will start from state 1, transition to state 2, and stay there forever. We have seen this liberal acceptance phenomenon before, and it is due to having too many varieties of output sequences. So we trained on fewer output sequences, repeating the NFA learning with 1000, 500, 100, 50, and 10 output sequences. The results were the same until we tried 10 output sequences. There, we had to resample a number of times for the NFA learner to terminate. Once it terminated, we got an NFA with 43 states - 2 initial states and 3 accept states. We considered this outcome uninteresting. 6.1.3 Accept All Approach After trying to learn some of the HMMs listed in Section 5.4, we quickly realized that this approach was not fruitful. The reason is that it is not uncommon for an HMM to emit every possible sequence of outputs, as the more interesting aspect about the sequences are how frequently they occur. The NFA learner always returned an NFA that accepted every possible sequence of outputs. As a result, the resulting NFA structure was uninteresting. 6.1.4 Summary In the end, we were unable to solve the issues that came up in trying to apply Angluin's insight to learn HMMs. We choose to prematurely conclude our endeavor into the Angluin approach to learning HMMs. This algorithm will not be present in the comparison section, Section 6.2. The major issues are as follows. (i) Faux-Regular Sets The language that an HMM generates is regular. However, the set of output sequences generated by an HMM is not guaranteed to be regular if there are not enough outputs. For example, false negatives in the membership query may result in the NFA learner not terminating. If every sequence that the learner queries for is in the sample set, then the algorithm returns the correct NFA. However, if the learner queries for a sequence that should be in the sample set but is not, the NFA may either not terminate or become overly complicated. (ii) Liberal Acceptance HMMs generate output sequences that are too diverse. Added to the fact that the set of output sequences is unlikely to be regular, the NFA ends up making sense of everything by accepting every single possible sequence of outputs. This problem is exacerbated by the fact that the structure of the minimal DFA is generally smaller than the structure of the relevant HMM. Not only does the resulting NFA accept too many sequences, it also has too few states. 6.2 Comparison of Algorithms In this section, we evaluate the NNMF HMM learning algorithm. 6.2.1 Evaluation Tables For each set of training parameters, we trained the algorithms five times to account for variance. Thus, for each training instance, we list the min, mean, and standard deviation of the performance results. There are three columns representing the performance, in KL divergence from the original HMM, of Baum-Welch, NNMF, and NNMF + B-W, in that order. NNMF + B-W is the full NNMF HMM learning algorithm that is depicted in Figure 4-10. NNMF is the NNMF algorithm with steps 7 and 8 removed from Figure 4-10. In each row, the best value is bolded. The "Ratio" column represents NNMF + B-W's score divided by B-W's score in that row. We also present data about how the observation matrix factored. The values D and tD represent the Euclidean distances incurred factoring the observation matrix and coefficient submatrix respectively. N 4 L 5000 4 10000 6 5000 6 10000 8 5000 8 10000 Min Mean B-W 997.9004 1711.5612 NNMF 3073.2533 3794.798 NNMF + B-W 1978.2321 2175.145 STD 586.737 502.6523 204.0401 Min Mean STD Min Mean STD Min Mean STD Min Mean 1267.9861 2100.4884 510.4714 826.4295 1585.9159 504.7816 1594.6401 2578.7567 1452.4929 1708.4988 4631.9174 3136.5893 3938.9514 879.3699 1420.958 2208.8319 511.2215 2016.3892 2252.6751 301.7404 2744.2626 3733.8713 1955.4402 2145.6534 238.2645 623.9327 1165.6193 553.8159 1069.3176 1276.236 221.3511 1710.2501 2074.3237 STD 1637.6752 1172.2853 443.7356 Min Mean STD 3468.8953 4852.2966 811.0101 2296.8081 3019.2137 695.2137 1455.7559 1684.2082 229.0714 Ratio 1.9824 1.2709 1.5422 1.0215 0.75497 0.73498 0.67057 0.4949 1.001 0.44783 0.41966 0.3471 D 1.7844e-009 1.7628e-005 tD 0.74466 0.76402 2.5448e-005 0.024138 1.0659e-006 1.3482e-005 1.8301e-005 2.1212e-009 2.8605e-009 1.1112e-009 7.100le-009 2.8051e-006 5.9191e-006 2.1543e-009 2.8588e-009 0.77694 0.88881 0.082829 0.70404 0.72441 0.018199 0.70054 0.71762 0.013932 0.70267 0.7223 8.3373e-010 0.017803 6.0036e-009 6.7917e-005 9.2086e-005 0.70998 0.72934 0.020071 Table 6.1: Simple HMM 3 3 results. N L 4 5000 4 10000 6 5000 6 10000 8 5000 8 10000 Min Mean STD Min Mean STD Min Mean STD Min Mean STD Min Mean STD Min Mean STD B-W NNMF NNMF + B-W Ratio D tD 442.938 7887.1676 6788.0769 1012.371 2224.1061 1045.096 655.6046 11421.9997 6373.5161 541.8756 7028.8504 5674.2161 144.0686 3307.4869 6712.3864 417.0339 9521.8506 8316.5538 4753.5918 10246.8763 4827.6747 2851.4064 3746.6054 694.6626 6106.9601 16116.4206 7509.9886 3694.1697 6215.7834 2559.1328 3996.8902 15454.9652 12911.146 1514.4711 5267.715 2483.7431 708.4322 3501.47 3295.6424 437.3121 1793.1714 986.1501 3573.9303 6693.3662 2675.7591 482.6687 3537.7063 2530.7667 1183.3128 5313.8906 3137.5246 160.2441 2780.6991 1863.0491 1.5994 0.44395 0.027698 0.037517 0.0090011 0.027648 0.03735 0.0078252 0.05575 0.06715 0.015662 0.058799 0.081687 0.01486 0.038503 0.049135 0.014305 0.061117 0.079614 0.013846 1.0225 1.0964 0.061877 1.0606 1.1083 0.037473 0.97981 1.0256 0.047698 0.99653 1.0625 0.057243 0.99639 1.0565 0.06685 1.0654 1.112 0.040776 0.43197 0.80624 5.4514 0.58601 0.89074 0.50331 8.2135 1.6066 0.38425 0.29203 Table 6.2: Simple HMM 3 4 results. N 4 L 5000 4 10000 6 5000 6 10000 8 5000 8 10000 Min Mean B-W 246.5829 586.1898 NNMF 842.7965 1698.7183 NNMF + B-W 443.8406 826.1375 STD 225.8949 919.3675 419.7274 Min Mean 211.106 646.1992 339.9976 1361.0068 186.1485 484.2997 STD 272.5914 574.4871 238.0156 Min Mean STD Min Mean STD Min Mean 352.618 457.0649 84.6659 343.1575 494.9053 110.9837 466.1999 540.8028 655.8777 1141.2995 561.7039 665.5298 1152.7888 637.1897 448.8728 1548.3818 284.9661 373.3327 88.8709 309.2473 442.8703 150.966 233.7224 278.5806 STD 63.7271 877.4006 55.7901 Min Mean STD 358.4193 506.8186 97.3148 584.7148 1008.9154 424.3164 264.213 325.1176 64.9346 Ratio 1.8 1.4093 0.88178 0.74946 0.80814 0.8168 0.90118 0.89486 0.50134 0.51512 0.73716 0.64149 D 5.7293e-010 6.2605e-010 4.7446e-011 5.3266e-010 5.6806e-010 tD 0.69686 0.7363 0.023273 0.69125 0.82138 3.3612e-011 0.085894 5.1956e-010 6.7974e-010 9.2742e-011 6.9685e-010 7.9776e-010 7.6083e-011 5.7822e-010 6.6059e-010 0.72194 0.78777 0.067581 0.67417 0.72921 0.051518 0.68173 0.7519 5 .2148e-011 0.050575 7.355e-010 7.9107e-010 3.3875e-011 0.73071 0.76864 0.048385 Table 6.3: Simple HMM 4 3 results. N 4 L 5000 4 10000 6 5000 6 10000 8 5000 8 10000 Min Mean STD Min Mean STD Min Mean STD Min Mean STD Min Mean STD Min Mean B-W 774.1461 831.63 55.1058 595.3184 807.6597 153.2688 603.711 702.9934 74.9966 699.3138 832.8425 162.4642 777.5568 846.3584 76.2612 777.8909 803.1075 NNMF 282.6304 944.5411 910.9561 447.9371 817.8496 322.9873 290.8432 568.0166 385.066 327.6702 1092.0584 692.6045 233.5061 999.8632 720.4316 195.4436 522.9961 NNMF + B-W 185.3625 540.9019 332.8036 264.6648 558.8738 185.5608 222.583 413.0648 205.271 250.3639 625.7018 328.0724 186.5281 511.1768 236.4639 147.1445 349.5934 STD 26.126 296.3821 216.4107 Ratio 0.23944 0.65041 0.44458 0.69197 0.36869 0.58758 0.35801 0.75128 0.23989 0.60397 0.18916 0.4353 Table 6.4: Separated HMM 3 4 results. D 0.04569 0.071385 0.016214 0.058159 0.068454 0.01265 0.054501 0.082582 0.023082 0.082378 0.11404 0.020373 0.059738 0.074771 0.013265 0.089566 0.10795 tD 0.94962 1.1127 0.12489 0.95463 1.0959 0.087527 0.94433 1.0727 0.084295 0.95939 1.025 0.073348 0.9511 1.0162 0.076826 0.97484 1.0367 0.016187 0.063611 N 4 L 5000 4 10000 6 5000 6 10000 8 5000 8 10000 Min Mean STD Min Mean STD Min Mean STD Min Mean STD Min Mean STD Min Mean STD B-W 692.1613 2473.7839 1006.716 2735.9597 2890.7895 113.5361 732.463 1872.572 1016.8428 2241.1899 2700.3719 322.4136 535.6466 2428.3011 1081.8397 529.2151 1075.3632 1085.128 NNMF 1256.8264 1806.0243 368.8792 517.3888 1138.4731 582.0805 517.5172 1617.3526 1078.2141 839.7055 1497.4537 864.2749 948.9102 1334.5147 548.5989 498.5396 1196.3787 471.1914 NNMF + B-W 755.3028 1146.3861 335.291 321.3234 903.6067 602.5364 342.6098 915.422 598.9476 500.7266 677.3107 159.0724 625.3597 831.5564 159.8957 269.039 622.7002 255.5409 Ratio 1.0912 0.46341 0.11744 0.31258 0.46775 0.48886 0.22342 0.25082 1.1675 0.34244 0.50837 0.57906 D 0.032212 0.040905 0.010054 0.026285 0.032148 0.0067635 0.037237 0.05287 0.01208 0.06965 0.07775 0.0063779 0.032745 0.046569 0.011163 0.059658 0.074978 0.010959 tD 1.0695 1.1485 0.065821 1.0245 1.1576 0.10029 0.96306 1.107 0.13131 0.98875 1.0943 0.076641 1.1095 1.1529 0.0479 1.0153 1.1223 0.090085 Table 6.5: Separated HMM 3 4 #2 results. N 4 L 5000 4 10000 6 5000 6 10000 8 5000 8 10000 Min Mean B-W 80.4189 150.9328 NNMF 94.5875 552.8904 NNMF + B-W 41.4701 242.1683 STD 61.8605 631.8636 221.5841 Min Mean STD Min Mean STD Min Mean STD Min Mean STD Min Mean STD 58.9067 159.6925 77.5463 48.0383 170.5184 122.6708 132.9425 188.458 38.1928 92.6143 233.5022 127.2543 105.4114 213.1205 71.694 236.9763 442.1492 150.1581 152.2607 1490.2956 1104.5111 78.0167 706.9429 641.2314 136.2905 909.6215 816.2267 97.7251 915.404 757.5031 89.1558 142.7398 55.465 89.2594 257.8672 113.7077 59.1967 171.1029 113.7074 41.8444 128.769 79.7246 42.6303 94.565 41.1597 Ratio 0.51568 1.6045 1.5135 0.89384 1.8581 1.5123 0.44528 0.90791 0.45181 0.55147 0.40442 0.44372 Table 6.6: Dense HMM 3 3 results. D 6.4591e-010 6.8283e-010 tD 0.71901 0.74776 3.6702e-011 0.029972 6.3935e-010 6.6786e-010 2.9599e-011 6.748e-010 8.0858e-010 9.4004e-011 1.1926e-009 1.7731e-009 3.5914e-010 7.1963e-010 8.2595e-010 8.4619e-011 9.0435e-010 1.3236e-009 5.2534e-010 0.69212 0.73319 0.024494 0.69282 0.71207 0.020377 0.68546 0.69798 0.011143 0.67914 0.71082 0.020134 0.68462 0.69924 0.013975 N 4 L 5000 4 10000 6 5000 6 10000 8 5000 8 10000 Min Mean STD Min Mean STD Min Mean B-W 83.3714 181.9737 61.1875 63.9092 140.7707 54.9844 101.6815 135.0095 NNMF 277.3044 711.6227 522.0941 189.6736 371.2808 301.9523 102.5564 894.5842 NNMF + B-W 114.158 258.3594 205.2391 96.3222 179.7176 88.4172 76.2079 147.343 STD 38.4957 942.4778 58.8077 Min Mean 61.9132 156.9878 57.2777 451.8523 17.3461 105.6887 STD 70.7718 433.9315 57.5969 Min Mean 75.3198 141.9259 161.9649 289.2854 104.6016 128.1162 STD 65.3842 119.1106 26.204 Min Mean STD 68.9887 137.7862 57.3042 78.6174 908.0051 776.9401 46.7607 156.1644 125.1327 Ratio 1.3693 1.4198 0.74948 1.0914 D 1.2384e-009 1.3402e-009 7.0844e-011 1.1977e-009 1.2774e-009 7.0287e-011 1.156e-009 1.3079e-009 tD 0.82839 0.86026 0.03275 0.78397 0.80446 0.026186 0.78691 0.80793 1.0002e-010 0.018923 0.28017 0.67323 1.2862e-009 1.9384e-009 0.77687 0.81945 5.7841e-010 0.028493 1.2501e-009 1.3593e-009 0.78606 0.82481 1.1055e-010 0.026001 1.3571e-009 2.1287e-009 7.9323e-010 0.78007 0.79759 0.012031 D 1.2602e-009 1.3536e-009 6.9306e-011 1.7015e-009 2.0942e-009 tD 0.90708 0.95607 0.047694 0.85738 0.88954 3.1202e-010 0.024311 1.3328e-009 1.4374e-009 1.1779e-010 2.033e-009 2.451e-009 3.6034e-010 1.2627e-009 1.3305e-009 5.7614e-011 1.9591e-009 2.1503e-009 0.87323 0.90387 0.018549 0.85049 0.85728 0.0043956 0.87328 0.93007 0.033536 0.852 0.87485 1.5444e-010 0.019105 1.5072 1.2767 1.3888 0.9027 0.6778 1.1334 Table 6.7: Dense HMM 4 4 results. N 4 L 5000 4 10000 6 5000 6 10000 8 5000 8 10000 Min Mean STD Min Mean B-W 768.8807 892.2086 175.5267 674.2995 893.7868 NNMF 929.9388 2811.2549 1888.2821 762.9955 2127.3043 NNMF + B-W 513.3167 1422.7241 729.7105 465.0274 914.2332 STD 164.0421 768.2717 285.0277 Min Mean STD Min Mean STD Min Mean STD Min Mean 597.7398 758.7097 148.7744 470.6347 694.9101 152.8592 657.4062 773.2348 179.5491 544.9122 674.6258 485.9444 1380.4805 1454.1388 1060.2761 2692.5426 2331.265 907.2523 2426.4336 1185.7498 244.7488 2855.0982 230.0795 480.5185 247.766 378.8173 556.4525 249.5042 381.6071 458.7711 68.7337 90.8653 407.605 STD 105.5129 2900.5149 199.1646 Ratio 0.66762 1.5946 0.68965 1.0229 0.38492 0.63334 0.80491 0.80075 0.58047 0.59331 0.16675 0.60419 Table 6.8: Dense HMM 5 5 results. N 4 L 5000 4 10000 6 5000 6 10000 8 5000 8 10000 Min Mean B-W 486.542 1034.5862 NNMF 662.8812 2124.0776 NNMF + B-W 338.9175 789.752 STD 403.1681 973.9133 393.8942 Min Mean STD Min Mean STD Min Mean STD Min Mean STD Min Mean STD 420.8795 1029.4356 417.7536 451.4029 1022.2118 358.1582 918.071 1184.2618 198.9643 957.1142 1128.347 116.854 1025.08 1131.0833 100.2022 615.4777 1516.8439 1307.855 951.6361 1720.0668 624.1885 703.2973 1316.8456 494.8713 735.5595 1162.6925 289.9418 479.7215 1273.8234 859.4713 272.6408 486.9774 263.5107 386.6114 501.2916 156.4701 205.122 437.5169 165.7033 256.5674 426.3246 142.0576 178.4787 397.3082 198.2706 Ratio 0.69658 0.76335 D 1.3432e-009 1.4281e-009 tD 0.92152 0.94495 6.214e-011 0.017812 0.64779 0.47305 1.9405e-009 2.0783e-009 9.6254e-011 1.182e-009 1.3141e-009 9.832e-011 1.5739e-009 1.9942e-009 2.4562e-010 1.3106e-009 1.3977e-009 6.653e-011 1.6641e-009 1.948e-009 1.9176e-010 0.89313 0.91715 0.017116 0.90857 0.94016 0.042036 0.88931 0.90858 0.015319 0.91643 0.94822 0.021727 0.89418 0.91254 0.01955 D 1.3495e-009 1.488e-009 1.3063e-010 2.419e-009 2.7577e-009 3.2359e-010 1.3883e-009 1.4538e-009 4.7518e-011 1.943e-009 2.4883e-009 4.1628e-010 1.3363e-009 1.4341e-009 tD 0.90942 0.96484 0.042955 0.89407 0.91546 0.020014 0.9145 0.93948 0.02215 0.86953 0.90903 0.03621 0.92092 0.95033 6.9626e-011 0.020038 2.0438e-009 2.4754e-009 0.87551 0.91048 2.8924e-010 0.025662 0.85647 0.4904 0.22343 0.36944 0.26806 0.37783 0.17411 0.35126 Table 6.9: Separated HMM 5 5 results. N 4 L 5000 4 10000 6 5000 6 10000 8 5000 8 10000 NNMF 445.5445 1333.194 981.255 343.7505 1540.9268 1464.199 1109.1188 1832.6603 726.0979 321.3539 932.8144 749.699 290.0426 1359.1341 NNMF + B-W 300.8148 444.026 153.6569 221.0042 808.1103 812.1967 215.7601 410.0196 140.4542 220.0137 295.6257 93.5043 214.5423 315.3194 Min Mean STD Min Mean STD Min Mean STD Min Mean STD Min Mean B-W 312.0537 431.1328 69.9133 436.7778 500.9244 55.7257 222.4566 325.2532 103.0878 313.3454 381.0247 53.6513 338.4901 377.7764 STD 48.6695 964.752 104.1627 Min Mean 157.0316 320.7461 294.1124 1366.0707 142.0052 250.1859 STD 181.4441 1397.739 91.5897 Ratio 0.96398 1.0299 0.50599 1.6132 0.9699 1.2606 0.70214 0.77587 0.63382 0.83467 0.90431 0.78001 Table 6.10: Dense HMM 6 6 results. N 4 L 5000 4 10000 6 5000 6 10000 8 5000 8 10000 Min Mean STD Min Mean STD Min Mean STD Min Mean STD Min Mean STD Min Mean STD B-W 3332.37 20399.01 16466.48 4127.54 15506.81 15250.92 2124.79 2381.96 275.89 1507.81 2082.44 327.63 2136.06 9691.49 10223.34 605.78 1796.89 691.75 NNMF 453566.01 912240.58 323910.41 495135.17 1038233.75 478704.46 438551.80 559502.59 138113.32 410360.34 462766.32 30252.51 202754.66 522528.68 241131.87 439488.17 621973.93 206364.20 NNMF + B-W 35296.40 68192.60 40021.43 29257.04 85209.60 81990.15 32129.16 46716.27 11542.60 43283.21 53974.37 6145.84 26450.17 42213.78 9541.81 34255.16 44376.92 9572.20 Ratio 10.592 3.3429 7.0883 5.495 15.1211 19.6126 28.7059 25.9188 12.3827 4.3558 56.5471 24.6964 D 0.00051148 0.00059111 7.4622e-005 0.0037584 0.013027 0.01041 0.00020088 0.00034919 0.00011676 0.00038973 0.00078397 0.0003145 0.00015219 0.00068915 0.00060151 0.00030934 0.0018096 0.0023095 tD 0.93474 1.1154 0.28157 0.90685 0.98516 0.046258 0.91459 0.98949 0.054984 0.8651 0.8907 0.038996 0.9522 0.98703 0.042136 0.88864 0.91587 0.028501 Table 6.11: Sparse HMM 6 6 results. N 4 L 5000 4 10000 6 5000 6 10000 8 5000 8 10000 Min Mean STD Min Mean STD Min Mean STD Min Mean B-W 597.90 7411.79 12023.72 615.28 7171.64 11298.65 483.64 3363.79 2835.32 357.46 2938.83 NNMF 13545.2732 16888.8874 2903.0062 11806.9636 19473.2593 6944.1741 18360.2393 20905.4379 3869.7666 13190.4853 19585.8435 NNMF + B-W 4904.1275 8702.5271 4055.0472 5256.2945 8626.3141 4851.2726 4487.7336 11374.5472 4263.1374 2661.972 7905.5441 STD Min Mean 2568.83 6893.4983 4977.6338 235.06 2100.87 14353.7449 19603.7602 2684.0319 5888.9426 STD Min Mean 1105.23 3493.2808 3845.1267 409.26 1708.51 10129.4971 17785.2752 2558.5785 6528.3364 STD 1130.33 5433.6811 5346.3046 Ratio 8.2022 1.1741 7.447 2.69 D 0.00018761 0.0004331 0.00037467 0.00023915 0.0010111 0.0012683 0.00030951 0.00089552 0.00056024 0.00055679 0.0022575 tD 1.3278 1.4541 0.10004 1.4105 1.4525 0.046306 1.3575 1.4359 0.092502 1.3986 1.47 0.0016912 0.06264 11.4184 2.8031 0.00035097 0.0016097 1.3799 1.4806 0.0011892 0.063915 6.2518 3.8211 0.00024118 0.001287 1.3902 1.4173 0.0012471 0.033055 8.5429 1.2028 9.2791 3.3815 Table 6.12: Diverse Sparse HMM 6 6 results. 6.2.2 6.2.2.1 Evaluation Table Analysis Average KL Ratios Simple HMM 3 3 Simple HMM 3 4 Simple HMM 4 3 Separated HMM 3 4 Separated HMM 3 4 #2 Dense HMM 3 3 Dense HMM 4 4 Dense HMM 5 5 Separated HMM 5 5 Dense HMM 6 6 Sparse HMM 6 6 Diverse Sparse HMM 6 6 Min 1.0618 2.8285 0.9383 0.3066 0.5959 0.8648 0.9955 0.5491 0.4777 0.7800 21.7395 8.5236 Mean 0.7195 0.7064 0.8378 0.6021 0.4062 0.9856 1.0829 0.8748 0.4709 1.0491 13.9036 2.5121 Table 6.13: Average KL ratios. Table 6.13 lists the average KL divergence ratios achieved for each HMM across all training instances. We will be referring to this table in sections below. 6.2.2.2 NNMF vs. NNMF + B-W In every single case, NNMF + B-W performs better than NNMF. This result is expected; our NNMF algorithm initially produces HMM parameters 0, T, and 7r. Using these values as seeds in Baum-Welch can only improve the parameters. 6.2.2.3 Larger Training Sets In Table 6.2, Baum-Welch beats NNMF + B-W only in the cases L = 5000. It seems that NNMF + B-W benefits greatly from having more samples in the training set. One of the reasons could be due to the threshold for omitting rows in the measured observation matrix. Having more data means having a bigger measured observation matrix. Having a bigger observation matrix does not necessarily make the factorization easier, but it could lead to more accurate HMMs. 6.2.2.4 Dense HMMs NNMF + B-W performs close to or better than Baum-Welch on all of the HMMs with dense o and T matrices, which are all but the two sparse matrices. The notable exception, row 2 of Table 6.13, is explained in Section 6.2.2.3. The average of the min KL ratios from Table 6.13 is 0.9398. The average of the mean KL ratios from Table 6.13 is 0.7735. Thus, we can say that on the dense HMMs that we have tested, NNMF + B-W performs an average of 22.65% better in KL measure than Baum-Welch per training instance. 6.2.2.5 Separated HMMs NNMF + B-W performs especially well on the separated HMMs, which do not have states that transition to themselves. On separated HMMs, NNMF + B-W achieves an average min KL ratio as low as 0.3066 and an average mean KL ratio as low as 0.4062. Table 6.14 is provided for reference. Separated HMM 3 4 Separated HMM 3 4 #2 Separated HMM 5 5 Min 0.3066 0.5959 0.4777 Mean 0.6021 0.4062 0.4709 Table 6.14: Average KL ratios for separated HMMs. 6.2.2.6 Sparse HMMs Baum-Welch absolutely dominated our algorithm for the sparse HMMs with six states. NNMF + B-W does especially badly on Sparse HMM 6 6. The reason may be intrinsic to the NNMF approach. the way in which we determine T is very local; we only look at the value of T that yields an accurate second output given the first output. In other words, T is determined through a sort of greedy algorithm approach that may suffer in the face of complexity. Because of this reason, NNMF + B-W has a much better time with HMM Diverse Sparse HMM 6 6. although it is orders of magnitude worse than Baum-Welch. For Diverse Sparse HMM 6 6, because the first state can be any state, when T is determined by our NNMF algorithm, it gets a better grasp of the interplay between all states instead of just a few. 6.2.2.7 Standard Deviations Although not particularly conspicuous, the performance of NNMF + B-W is often more consistent than that of B-W for HMMs with fewer than six states. For the two six state HMMs, the performance of NNMF + B-W is almost always less consistent than that of B-W. 6.2.2.8 Factorization vs. Effectiveness There does not seem to be a correlation between the accuracy of factorization and the accuracy of the resulting HMM. Our algorithm achieves factorization distances on the order of 0.01 for HMM Separated HMM 3 4, which is high compared to other HMMs, but the algorithm performs very well on Separated HMM 3 4, achieving low KL measures in all training cases. On the other hand, our algorithm achieves factorization distances on the order of 10-10 for Simple HMM 4 3, but the algorithm performs only slightly better in terms of KL measures than it did for Separated HMM 3 4. We surmise that easy-to-factor observation matrices do not necessarily lead to accurate HMMs. 6.2.2.9 NNMF + B-W Runtimes Although we did not provide the statistics, once seeded with NNMF, Baum-Welch often converged much more quickly than starting with a random seed. This phenomenon makes sense intuitively; NNMF provides parameters 0, T, and 7r that already fit the training data to some extent, so that Baum-Welch often starts close to a local optimum. Chapter 7 Conclusion and Future Work 7.1 Conclusion The central goal of this thesis was to draw from Angluin's DFA learning algorithm and nonnegative matrix factorization to provide novel algorithms for HMM learning. Unfortunately, we failed to produce a learning algorithm using Angluin's insights. The main problem was the incompatibility between HMMs and NFAs. If we sample too few sequences for training, the set of output sequences that an HMM generates is not regular. If we sample too many, the NFA ends up accepting too liberally. Nonetheless, we were able to produce an HMM learning algorithm using the nonnegative matrix factorization approach. The algorithm was derived from the insight that every output distribution is a linear combination of the columns of the HMM's output matrix. Using this insight, we rephrased the problem of HMM learning to a problem in linear algebra. By using nonnegative matrix factorization, we were able to extract the columns of the HMM's output matrix from a table of output distributions. To evaluate the algorithm, we measured its performance on a number of HMMs. The results show that for the dense HMMs we tested, the algorithm performs 22.65% better on average in KL measure than Baum-Welch. On the other hand, on HMMs with sparse 0 and T matrices, our algorithm performed significantly worse than Baum-Welch. Future Work 7.2 While we were unable to make Angluin's work apply to learning HMMs, perhaps others can. We still believe that the approach is promising and deserving of more study. A theoretical question that deserves attention is determining when the factorization approach helps. We have empirically shown some cases in which it outperforms Baum-Welch but we also saw many cases in which it performed far worse. There is a need to have a theoretical basis for when this approach is most effective. A related interesting theoretical question is how exactly the accuracy of the factorization affects the accuracy of the resulting HMM. The potentials of this nonnegative matrix factorization idea rest partly on advances in algorithms for factorization. In order to factor a matrix into row-stochastic matrices, we made alterations to existing factoring algorithms without justifying the changes. Deriving and proving an efficient row-stochastic factoring algorithm would be a crucial step before applying this algorithm to larger, more complicated HMMs. We have seen that for sparse HMMs with six states, our NNMF algorithm is unimpressive. An important investigation would be to improve the algorithm to work better for such HMMs. We surmise that the determination of T is the culprit. 0 and 7Tare derived from the observation matrix, which encompasses all of the HMM's information. T, on the other hand, is determined solely by looking at the distribution of the second output given the first output. Also, as discussed in Proposition 2, factorizations generated via invertible matrices may yield new HMMs. It would be nice to know how to pick among these generated factorizations to find the best HMM. 7.3 Summary This thesis shows that nonnegative matrix factorization can be used to provide an algorithm to learn HMMs. We empirically demonstrate that the algorithm is effective for dense HMMs with at most six states, beating Baum-Welch in many cases. This work also attempts to apply Angluin's DFA learning algorithm to learn HMMs but is unable to produce an algorithm and provides a list of reasons. We see that invoking different but related fields, such as linear algebra and DFA learning, can offer new perspectives into HMM learning and lead to novel and effective algorithms. THIS PAGE INTENTIONALLY LEFT BLANK Appendix A Angluin's Algorithm Example Suppose we are trying to learn the regular set of all finite binary sequences that contain an even number of zeroes and an even number of ones. The algorithm begins by querying the teacher if ", '0', and '1' are in the set. Table A.1 is provided for reference. The top row represents the set of suffixes. The left column represents the set of prefixes. The prefixes are separated by a line that separates S on top from S -A. Where row meets column indicates whether the word formed by prefix plus suffix is in the regular set. '' 1 '0' '1' 0 0 Table A.1: Initial table in Angluin's algorithm. Now, the algorithm notices that the table is not closed, since the row labeled by '0' E S- A is different from any rows in the top part of the table. After resolving the issue, the table becomes Table A.2. In accordance with the algorithm, we make queries to see if S-'0' and S-'1' are in the regular set. The table becomes Table A.3. Now this table is closed and consistent. We submit the conjecture to the teacher, which "~ 1 '0' '1' 0 0 Table A.2: Closed table in Angluin's algorithm. " '0' '1' '00' '01' 1 0 0 1 0 Table A.3: Third iteration table in Angluin's algorithm. tells us that the sequence '10' is a counterexample - that it should be in the regular set but our algorithm does not think so. Indeed, our algorithm thinks that after the sequence '1', the DFA is in the same state as after the sequence '0'. Hence, it treats '10' like '11'. After resolving the issue, the table becomes Table A.4. " 1 '0' '1' '10' '00' '01' '11' '100' '101' 0 0 0 1 0 1 0 0 Table A.4: Fourth iteration table in Angluin's algorithm. The table is not consistent since '0' and '1' have identical rows but '00' and '10' do not. After resolving the issue, the table becomes Table A.5. The table is still not consistent since '1' and '10' have identical rows but '11' and '101' do not. After resolving the issue, the table becomes Table A.6. Now the table is closed and consistent. We submit our conjecture to our teacher, and ' 1 0 '0' '1' '10' '00' '01' '11' '100' '101' 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 1 Table A.5: Fifth iteration table in Angluin's algorithm. '0' '1' '10' '00' '01' '11' '100' '101' '' 1 0 0 0 1 0 1 0 0 '0' 0 1 0 0 0 0 0 0 1 '1' 0 0 1 0 0 0 0 1 0 Table A.6: Finished table in Angluin's algorithm. our teacher validates our conjecture. The number of states is the number of rows in the S part of our table, which is 4. Each row represents a distinct state. The initial state is the " row. To see whether a sequence is in the regular set, use the following procedure. (i) Locate the state that the DFA is in. It is one of the rows represented by an element S E S. (ii) Let a be the next letter in the given sequence. If a is the last letter in the sequence, look up s - a in the table and retrieve the answer. Otherwise, go to step (iii). (iii) Find s - a in S U S -A. If s - a E S, go back to step (i). (iv) Observe the row represented by s - a E S - A, and look for the row in the top part of the table identical to it. That is the state that the DFA is in after processing a. Go back to step (i). Appendix B HMMs for Testing B.1 B.2 Simple HMM 3 3 0.5 0.25 0 ( 0.5 0.25 05 0.75 0 0.5 0 0.25 0.75 0.25 1 0.5 0.5 = . 0.25 0.25 0 Simple HMM 3 4 = 0.75 0 0.0 0.2 0.05 0 B.3 0 0.2 0.25 0 0.250.75) 0 0 0.5 , T= 0.5 0.75 0 0.5 7r 0.25 0.25 0.25 1 0 0.5 0 0 0.2 Simple HMM 4 3 0.5 0.5 0.25 0.3 0= ( 0.5 0.25 0 0.4 0 0.25 0.75 0.3 0.5 0.5 0.25 1 0 0.25 0 0 0 0.3 0.15 0 0.1 0 0.75 0.5 B.4 Separated HMM 3 4 0.5 0.5 0.25 0.1 0.2 0.1 0.2 0.7 .1 0.3 .050.25 0.1 00 03 B.5 O= O= 0.2 0.6 0.4 0.5 0.2 0.1 0.1 , T= 0.5 0.5 0.75 0 0.5 0.25 0.5 0 0.25 0 0.1 0.65 0.4 0.4 0 0.35 0..1= 0.9 0 0.5 Dense HMM 3 3 0.5 0.6 0.25 0 0.25 0.3 0.6 0.25 0.3 0.2 T= 0.5 0.4 0.2 0.5 ,0.25 0.25 0.25 0.3 0.6 0.25 0.1 0.15 Dense HMM 4 4 0= 0.5 0.5 0.2 0.4 0.25 0.1 0.05 0.3 0.15 0.3 0.6 0.2 0.1 0.1 0.15 0.1 0.25 0.1 0.2 0.1 0.5 0.2 0.2 0.2 ,T= 0.2 0.4 0.3 0.3 0.05 0.3 0.3 0.4 ,i= 0.5 0.2 0.2 0.1 Dense HMM 5 5 B.8 0 0.5 )0.6 0.5 0.2 0.1 0 B.7 0 Separated HMM 3 4 #2 0.1 B.6 0 = 0.5 0.2 0.2 0 0.5 0.25 0.1 0.1 0.2 0.5 0.25 0.3 0.05 0 0.5 0.1 0.1 0.1 0.4 0.05 0.3 0.1 0.2 0.1 0.2 0.15 0.1 0.1 0.2 0.4 0 0.05 0.4 0.4 0.1 0.1 0.1 0.2 0.3 0.3 0.2 0.2 0.2 B.9 Separated HMM 5 5 0 B.10 0.2 0.2 0.25 0.1 0.2 0 0.5 0.2 0.1 0.2 0.2 0.3 0.3 0.1 0.3 0.3 0.55 0 0.3 0.5 0.3 0.15 0.1 0.2 0 0.5 0.4 0.1 0.3 0 0.3 0.4 0.1 0.2 0.6 0 0 0.1 0.1 0.4 0 0.1 0.4 0.3 0.1 0.05 0.1 0.1 0.25 0.1 0.1 0.1 0 0.15 T = 0.1 Dense HMM 6 6 B.11 0.2415 0.2713 0.1071 0.1737 0.2427 0.1662 0.2506 0.1447 0.0155 0.2044 0.0285 0.1662 0.2242 0.1634 0.4442 0.1843 0.1074 0.1506 0.0147 0.0120 0.2776 0.2000 0.3164 0.1083 0.1097 0.1999 0.0325 0.1555 0.2822 0.2331 0.1593 0.2087 0.1231 0.0821 0.0228 0.1756 0.2533 0.1997 0.2377 0.0668 0.0271 0.0965 0.2 0.2086 0.0782 0.0522 0.0356 0.1973 0.0657 0.25 0.1511 0.2776 0.0468 0.1560 0.2680 0.2587 0.15 0.1984 0.0767 0.2484 0.1271 0.1145 0.0867 0.1 0.0226 0.2425 0.1518 0.4391 0.2511 0.2333 0.2 0.1660 0.1253 0.2631 0.1754 0.1420 0.2591 0.1 Sparse HMM 6 6 0 0.5 0 0 0.5 0 0 0.3 0.1 0 0.7 0.75 0 0 0 0.3 0.5 0 0 0 0.9 0 0.3 0 0 1 0 0.5 0 0 0.8 0 0 0 0 0 0 0.25 0 0 0 0 0 0.2 0.5 0 0 0.2 0 0 0 0 0 0.5 0 0 0 0 0.7 0 0.8 0 0 0 0 0.5 0 0.7 0 0 0.5 0 0 0 0 0.5 0 0.5 Diverse Sparse HMM 6 6 B.12 0 0.5 0 0 0.5 0 0 0.3 0.1 0 0.3 0 0 0 0.3 0.5 0 0 0 0.9 0 0.2 0 1 0 0.5 0 0 0.8 0 0 0 0 0.1 0 0.25 0 0 0 0 0 0.2 0.5 0 0 0.2 0.1 0 0 0 0 0.5 0 0 0 0 0.7 0 0.8 0.2 0 0 0 0.5 0 0.7 0 0 0.5 0 0 0 0.1 0.5 0 0.5 0.75 0 Bibliography [1] Dana Angluin. Learning regular sets from queries and counterexamples. Information and Computation, 75:87-106, 1987. [2] Kenneth Basye, Thomas Dean, and Leslie Pack Kaelbling. Learning dynamics: system identification for perceptually challenged agents. Artificial Intelligence, 72(1-2):139-171, January 1995. [3] Leonard E. Baum, Ted Petrie, George Soules, and Norman Weiss. A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Annals of Mathematical Statistics, 41(1):164-171, 1970. [4] Michael W. Berry, Murray Browne, Amy N. Langville, V. Paul Pauca, and Robert J. Plemmons. Algorithms and applications for approximate nonnegative matrix factorization. Computational Statistics and Data Analysis, 52(1):155-173, 2007. [5] Benedikt Bollig, Peter Habermehl, Carsten Kern, and Martin Leucker. Angluin-style learning of NFA. In Proceedings of IJCAI 2009, pages 164-171, 2009. To appear. Full version as Research Report LSV-08-28, Laboratoire Specification et Verification, ENS Cachan, France. [6] Andrzej Cichocki, Rafal Zdunek, and Shun ichi Amari. Hierarchical ALS algorithms for nonnegative matrix and 3d tensor factorization. In Lecture Notes in Comp. Sc., Vol. 4666, Springer,pages 169-176, 2007. [7] A. P. Dempster, N. M. Laird, and D. ZB. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1):1-38, 1977. [8] P. Dupont, F. Denis, and Y. Esposito. Links between probabilistic automata and hidden Markov models: probability distributions, learning models and induction algorithms. Pattern Recognition, 38:1349-1371, 2005. [9] Lorenzo Finesso and Peter Spreij. Approximation of stationary processes by hidden Markov models. arXiv:math/0606591v2, February 2008. [10] Yoav Freund, Michael Kearns, Dana Ron, Ronitt Rubinfeld, Robert E. Schapire, and Linda Sellie. Efficient learning of typical finite automata from random walks. In Proceedings of the 24th Annual ACM Symposium on Theory of Computing, pages 315-324, 1993. [11] Omri Guttman, S. V. N. Vishwanathan, and Robert C. Williamson. Learnability of probabilistic automata via oracles. ALT, pages 171-182, 2005. [12] Ngoc-Diep Ho and Paul Van Dooren. Non-negative matrix factorization with fixed row and column sums. Accepted for publication in Linear Algebra and Its Applications, 2006. [13] Daniel Hsu, Sham M. Kakade, and Tong Zhang. A spectral algorithm for learning hidden Markov models. Preprint, February 1988. [14] Daniel D. Lee and H. Sebastian Seung. Algorithms for non-negative matrix factorization. Advances in Neural Information Processing Systems, 13:556-562, 2001. [15] Kevin Murphy. Hidden Markov model toolbox for MATLAB. http://people.cs.ubc.ca/ murphyk/Software/HMM/hmm.html. 2005. Available at [16] Luis E. Ortiz and Leslie Pack Kaelbling. Accelerating EM: An empirical study. In Proceedings of the 15th Annual Conference on Uncertainty in Artificial Intelligence, pages 512-521, 1999. [17] V. Paul Pauca, J. Piper, and Robert J. Plemmons. Nonnegative matrix factorization for spectral analysis. In Linear Algebra and Its Applications, 2005. In press, available at http://www.wfu.edu/ plemmons. [18] Lawrence R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257-286, February 1989. [19] Farial Shahnaz, Michael W. Berry, V. Paul Pauca, and Robert J. Plemmons. Document clustering using nonnegative matrix factorization. Information Processing and Management, 42(2):373-386, 2006. [20] Sebastiaan A. Terwijn. On the learnability of hidden Markov models. In International Colloquium on Grammatical Inference, 2002. [21] Stephen Vavasis. On the arXiv:0708.4149v2, 2007. complexity of nonnegative 100 matrix factorization.