Two New Approaches for Learning ... Hyun Soo Kim

advertisement
Two New Approaches for Learning Hidden Markov Models
by
Hyun Soo Kim
Submitted to the
Department of Electrical Engineering and Computer Science
in Partial Fulfillment of the Requirements for the Degree of
Master of Engineering in Electrical Engineering and Computer Science
at the
~SACHUSETTS
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
February 2010
© Hyun Soo Kim, MMX. All rights reserved.
ARCHINVES
INSTffUE
OF TECHNOLOGY
AUG 2 42010
LIBRARIES
The author hereby grants to MIT permission to reproduce and distribute publicly
paper and electronic copies of this thesis document in whole or in part.
. ./ ........ .....................
A uthor ..........................................
Department of Electrical Engineering and Computer Science
September 5, 2009
C ertified by ...............................
Leslie P.
Kaelblin
Professor of Computer Science and Engineering, MIT
Thesis Supervisor
A ccepted by ..............................
Drdh'istopher J. Terman
Chairman, Department Committee on Graduate Theses
THIS PAGE INTENTIONALLY LEFT BLANK
Two New Approaches for Learning Hidden Markov Models
by
Hyun Soo Kim
Submitted to the Department of Electrical Engineering and Computer Science
on September 5, 2009, in Partial Fulfillment of the
Requirements for the Degree of
Master of Engineering in Electrical Engineering and Computer Science
Abstract
Hidden Markov Models (HMMs) are ubiquitously used in applications such as speech
recognition and gene prediction that involve inferring latent variables given observations.
For the past few decades, the predominant technique used to infer these hidden variables
has been the Baum-Welch algorithm.
This thesis utilizes insights from two related fields. The first insight is from Angluin's
seminal paper on learning regular sets from queries and counterexamples, which produces
a simple and intuitive algorithm that efficiently learns deterministic finite automata. The
second insight follows from a careful analysis of the representation of HMMs as matrices
and realizing that matrices hold deeper meaning than simply entities used to represent the
HMMs.
This thesis takes Angluin's approach and nonnegative matrix factorization and applies
them to learning HMMs. Angluin's approach fails and the reasons are discussed. The matrix
factorization approach is successful, allowing us to produce a novel method of learning
HMMs. The new method is combined with Baum-Welch into a hybrid algorithm. We
evaluate the algorithm by comparing its performance in learning selected HMMs to the
Baum-Welch algorithm. We empirically show that our algorithm is able to perform better
than the Baum-Welch algorithm for HMMs with at most six states that have dense output
and transition matrices. For these HMMs, our algorithm is shown to perform 22.65% better
on average by the Kullback-Liebler measure.
Thesis Supervisor: Leslie P. Kaelbling
Title: Professor of Computer Science and Engineering, MIT
THIS PAGE INTENTIONALLY LEFT BLANK
Acknowledgments
This thesis was prepared at the CSAIL 4th Floor Laboratory in the Stata building.
Publication of this thesis does not constitute approval by the CSAIL Laboratory or any
sponsor of the building or conclusions contained herein.
I want to thank Professor Leslie Kaelbling for giving me the guidance with which to
complete this thesis. She has gracefully accepted to take me under her wing in response
to what must have been a pretty abrupt request to join her lab. Ever since becoming
her student in the fall of 2008, she has always been on top of my research and available
for consultation. Throughout the year, I met with her dozens of times to tell her of my
progress, ask about papers I should read, and, perhaps most importantly, to hear her tell
me not to get discouraged about not making much progress.
I have learned invaluable lessons on how to really dig down into an idea and become an
effective researcher. There were many times when I was stuck doing a certain approach,
only to be nudged towards a better direction by her help. Sometimes, it just really took
some time and contemplating to understand a difficult idea fully.
THIS PAGE INTENTIONALLY LEFT BLANK
Contents
1 Introduction
15
1.1
Hidden Markov Models
1.2
Motivation...................
1.3
Thesis A pproach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
1.4
T hesis Structure
18
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.. ...
. . . . . . . . . . ..
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 Background
2.1
16
21
Hidden Markov models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
2.1.1
Definitions and Notation.
. . . . . . . . . . . . . . . . . . . . . . . .
22
2.1.2
How to Use HMMs: A Brief Guide . . . . . . . . . . . . . . . . . . .
23
3 Related Work
25
3.1
The Baum-Welch Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
3.2
Angluin's Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
3.2.1
Learning Nondeterministic Finite Automata . . . . . . . . . . . . . .
28
3.2.2
Equivalence of pNFAs and HMMs
29
3.3
4
15
The Spectral Algorithm.....
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . ..
31
The Two New Approaches
33
4.1
The Extended Angluin's Algorithm . . . . . . . . . . . . . . . . . . . . . . .
33
4.1.1
M otivation
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
4.1.2
Issues with Learning HMMs . . . . . . . . . . . . . . . . . . . . . . .
34
4.2
4.1.2.2
Deterministic Actions in DFAs..
4.1.2.3
No Accepting States . . . . . . . . . . . . . . . . . . . . . .
35
4.1.2.4
Faux-Regular Sets
..
35
. .
36
.. .
37
..
39
.. . . . . .
39
.
. .
39
.
. .
40
.. .
42
.. . .
. . . ..
...............
Probabilistic Angluin: A Possible Extension . . . . . .
4.1.4
The HMM Learning Algorithm . . . . .........
.
The Nonnegative Matrix Factorization Algorithm . . . . ..
Motivation......... . . . . . . . . . . . . . .
.
4.2.1.1
The Observation Matrix..... . .
. ..
4.2.1.2
Factorization of the Observation Matrix . . .
Recovering 0, T, and 7r from an Observation Matrix
4.2.3
Issues with Learning HMMs . . . . . . . . . . . . . . . . . . . . . . .
44
..
44
.
45
.. . . . . .
45
.. . . . . .
45
4.2.3.1
Stochastic Factorization . . . . . .......
4.2.3.2
Constructing the Observation Matrix . . . .
4.2.3.3
Factorization into C and OT . . . . . ....
4.2.3.4
Factorization Measure.... . . . .
4.2.3.5
Trivial Factorizations
4.2.3.6
Non-Uniqueness of Factorization .......
4.2.3.7
Difficulty of Factorization . . . . . . . . . . . . . . . . . . .
.
.
. . .
. . . . . . . . . . . . . . . . . . . . .
4.2.4
Sparse Observation Matrices... . . . . . .
4.2.5
Algorithms for Factoring . . . . . . .
.. . . . . .
. . ..
.
. . . .
48
55
56
57
4.2.5.1
Lee and Seung's Algorithm . . . . . . . . . . . . . . . . . .
57
4.2.5.2
The ALS Algorithm.. . . . . . . . .
4.2.5.3
Our NNMF Algorithm..... . .
The HMM Learning Algorithm. . . .
Implementation Issues
5.1.1
46
. . . . . . . .
. . . . . . ..
...
. . . .
.
. . . .
. . . . . . . .
. . . . . . .
.
. . . .
Methodology
5.1
34
4.2.2
4.2.6
5
No Agent in HMMs . . . . . . . . . . . . . . . . . . . . . .
4.1.3
4.2.1
34
4.1.2.1
. . .
Baum-Welch Termination Protocol
. . . . . .
. .
. . .
. . .
. . . . .
. . . . . . . .
57
58
60
5.1.2
NNMF Observation Matrix
5.2
Training and Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
64
5.3
Measures of Accuracy
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
5.3.1
Euclidean Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
5.3.2
Kullback-Leibler Divergence. .
65
5.4
. . . . . . . . . . . . . . . . . . .
The HMMs . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
65
5.4.1
Simple HMM 3 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
66
5.4.2
Simple HMM 3 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
66
5.4.3
Simple HMM 4 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
66
5.4.4
Separated HMM 3 4 . . . . . . . . . . . . . . . . . . . . . . . . . . .
67
5.4.5
Separated HMM 3 4 #2
67
5.4.6
Dense HMM 3 3.........
5.4.7
Dense HMM 4 4...
5.4.8
Dense HMM 5 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
68
5.4.9
Separated HMM 5 5 . . . . . . . . . . . . . . . . . . . . . . . . . . .
68
5.4.10 Dense HMM 6 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
69
5.4.11 Sparse HMM 6 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
69
5.4.12 Diverse Sparse HMM 6 6
69
. . . . . . . . . . . . . . . . . . . . . . . . .
.. ..... .
. . . .
. . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
6 Results and Analysis
6.1
6.2
67
67
71
On the Failure of the Extended Angluin Algorithm . . . . . . . . . . . . . .
71
6.1.1
Stopping State Approach: Version 1 . . . . . . . . . . . . . . . . . .
72
6.1.2
Stopping State Approach: Version 2 . . . . . . . . . . . . . . . . . .
74
6.1.3
Accept All Approach . . . . . . . . . . . . . . . . . . . . . . . . . . .
76
6.1.4
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
76
Comparison of Algorithms.. . . . . . . . . . . . . . . . . . .
. . . . . . .
77
6.2.1
Evaluation Tables
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
77
6.2.2
Evaluation Table Analysis . . . . . . . . . . . . . . . . . . . . . . . .
84
6.2.2.1
84
Average KL Ratios
. . . . . . . . . . . . . . . . . . . . . .
. .
6.2.2.2
NNMF vs. NNMF + B-W
6.2.2.3
Larger Training Sets ......
6.2.2.4
Dense HMMs . . . . . . . . . .
6.2.2.5
Separated HMMs
6.2.2.6
Sparse HMMs
6.2.2.7
Standard Deviations......
6.2.2.8
Factorization vs. Effectiveness
6.2.2.9
NNMF + B-W Runtimes . . .
. . . . . . .
. . . . . . . . .
7 Conclusion and Future Work
. . . . . . . ...
..........................
. . . . . . ...
..........................
7.1
Conclusion
7.2
Future Work
7.3
Summary . . . . . . . . . . . ..........................
A Angluin's Algorithm Example
95
B HMMs for Testing
Simple HMM 3 3
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
95
B.2 Simple HMM 3 4
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
95
B.3 Simple HMM 4 3
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
95
B.1
.
.
B.4 Separated HMM 3 4 . . . . . .......
B.5 Separated HMM 3 4 #2
. .
... . . ...
. . . . . . . . .....
B.7 Dense HMM 4 4
. . . . . . .....
. . . . . . . . . . . . . . ...
..
96
... . . . ...
96
.
96
....
. . . . . . . ...
97
. ...
... ..
. . . . . . . . . . . . . . ......
...
B.9 Separated HMM 5 5 . . . . .......
96
96
.............................
B.6 Dense HMM 3 3
B.8 Dense HMM 5 5
...
.
...
B.10 Dense HMM 6 6
... . . ......
..
. .. .
. . . . . . . ...
97
B.11 Sparse HMM 6 6
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
97
B.12 Diverse Sparse HMM 6 6. .
List of Figures
2-1
A M arkov m odel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
3-1
An example of a pNFA.
. . . . . . . . .
30
3-2
An example of an equivalent HMMT. . . . . ..
. . . . . . . . .
30
3-3
An example of an equivalent HMM.
. . . . . . . . .
31
3-4
Converting an HMM into an equivalent pNFA.
. . . . . . . . .
32
4-1
The Extended Angluin HMM Learning Algorithm.. . . . . . . .
4-2
An example of how output distributions are determined by 0 and the distribution
. . . . . . . . . . . . . .
of possible states.....
. . . ...
.
. . . .
. . . . . . . . . . . . . . . . . . . .
38
40
4-3
An example of an observation matrix. . . . . .
.
4-4
Factorization of an observation matrix. . . . . .
. . . . . . . . . . . .
41
4-5
Trivial factorization of an observation matrix. .
. . . . . . . . . . . .
47
4-6
The Lee-Seung algorithm. . . . . . . . . . . . .
. . . . . . . . . . . .
57
4-7
The ALS algorithm.
. . . . . . . . . . .
58
4-8
The modified Lee-Seung algorithm. . . . . . . . . . . . . . . . . . . . . . . .
59
4-9
The modified ALS algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . .
59
. . . . . . . . . . . . . . .
4-10 The NNMF HMM Learning Algorithm.....
.
.
.
.
.
.
.
.
.
.
.
.
.
41
. . . . . . . . . . . .
61
6-1
State transitions of Simple HMM 3 3.
. . . .
72
6-2
NFA learned from 10000 output sequences from Simple HMM 3 3.
. . . .
72
6-3
NFA learned from 50 output sequences from Simple HMM 3 3.
. . . .
73
. .
6-4
NFA learned from 10000 output sequences from Separated HMM 3 4.
. . .
73
6-5
Simple HMM 3 3 with only one possible initial state. . . . . . . . . . . . . .
75
6-6
NFA learned from 10000 output sequences from Simple HMM 3 3 with only
one possible initial state. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
75
List of Tables
6.1
Simple HM M 3 3 results.
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
78
6.2
Simple HMM 3 4 results.
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
78
6.3
Simple HMM 4 3 results.
. . . . . . . . . . . ... .. .. .. . .. .. .. .
79
6.4
Separated HMM 3 4 results. . . . . . . . . . . . . . . . . . . . . . . . . . . .
79
6.5
Separated HMM 3 4 #2 results.. . .
80
6.6
Dense HMM 3 3 results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
80
6.7
Dense HMM 4 4 results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
81
6.8
Dense HMM 5 5 results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
81
6.9
Separated HMM 5 5 results. . . . . . . . . . . ..
.. .. .. . ... . .. .
82
6.10 Dense HMM 6 6 results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
82
6.11 Sparse HMM 6 6 results..
. . . . . . . . . . .. ... .. .. . ... . .. ..
83
6.12 Diverse Sparse HMM 6 6 results. . . . . . . . . . . . . . . . . . . . . . . . .
83
6.13 Average KL ratios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
84
6.14 Average KL ratios for separated HMMs. . . .. . . . . . . . . . . . . . . . .
85
A.1 Initial table in Angluin's algorithm. . . . . . .. . . . . . . . . . . . . . . . .
91
A.2 Closed table in Angluin's algorithm.
92
..
.. . .
. . . . . . . . . . .
. . . . . . .. ... . .. .. .. .. ..
A.3 Third iteration table in Angluin's algorithm.
.. .. .. .. .. . .. .. ..
92
A.4 Fourth iteration table in Angluin's algorithm. . . . . . . . . . . . . . . . . .
92
A.5 Fifth iteration table in Angluin's algorithm. . . . . . . . . . . . . . . . . . .
93
A.6 Finished table in Angluin's algorithm.
. . . .
THIS PAGE INTENTIONALLY LEFT BLANK
Chapter 1
Introduction
1.1
Hidden Markov Models
An HMM is a statistical model that models a Markov process with unobserved states.
Typically, the parameters in question are state transition and output probabilities. In a
regular Markov model, the state of the system is plainly visible to the observer, and so the
only parameters are transition probabilities between the states. In an HMM, the state of
the system is hidden, but the outputs influenced by the state are visible. Output tokens
are emitted according to a distribution that depends on the state. The sequence of tokens
generated by an HMM are indicators of the state transitions of the system.
HMMs can be used for pattern recognition, either by using repeated examples and the
tools of inference to deduce parameters of a system, or by using given model parameters to
infer future sequences of outputs. As such, HMMs are widely used in speech recognition,
handwriting, gesture recognition, musical score following, and bioinformatics.
There are three canonical problems in HMM theory.
The first is to compute the
probability of a particular output sequence given the parameters of the model. The second
is to compute the most likely sequence of hidden states that could have given rise to a
given output sequence given the parameters of the model. The third is to compute the
most likely state transition probabilities, output probabilities, and initial state distribution,
given output sequences. The first two problems have been completely solved and given
thorough treatment in Rabiner's paper [18]. The third problem is given consideration as
well, but Rabiner [18] admits the difficulty in learning HMMs; it is not a surprise that the
most useful problem would turn out to be the hardest. This thesis is concerned with the
third problem, learning an HMM solely from the observable data. We clarify our goal a bit,
to emphasize that we are not so much interested in recovering the exact HMM parameters
as producing an HMM that produces the same likelihoods for output sequences as the actual
HMM.
There are numerous results that imply that HMM learning is provably hard. Under
reasonable assumptions, Terwijn showed in [20] that the HMM learning problem is not
solvable in polynomial time in the error, confidence parameter, and size of the HMM. This
result should serve as an unfortunate reminder of the limitations of our abilities in this field;
we might get some answers, but we will probably never get an optimal answer. Practitioners
typically resort to local search heuristics. In this vein, the Baum-Welch
/
EM algorithm
has become the predominant learning algorithm as noted in [3] and [7].
Nonetheless, in practical applications, we typically find that we can relax constraints in
the model and add strong assumptions that allow us to achieve much better results. For
example, under the assumption that observation distributions arising from distinct hidden
states are distinct, Hsu et al. [13] showed that there exists a polynomial time algorithm
for approximating the conditional distribution of a future observation conditioned on some
history of observations.
1.2
Motivation
Currently, the Baum-Welch algorithm is the prevailing method used to learn HMMs. It
is an expectation-maximization (EM) algorithm, which works by iterating an estimation
step followed by a modification step. The algorithm does not come with many guarantees.
The most one could say is that it is guaranteed, asymptotically, to converge to a local,
not necessarily global, optimum. Moreover, results are highly sensitive to initialization.
Nonetheless, the algorithm has been used extensively for decades by HMM researchers.
Within the past few years, there have been some interesting results on learning HMMs
that branch away from the Baum-Welch approach and introduce completely new concepts.
One such result is the spectral algorithm discovered by Hsu et al. [13].
This thesis attempts to follow suit and investigate new methods of learning HMMs in
the following two ways.
Angluin's paper demonstrates the effectiveness of having a teacher that answers simple
but illuminating questions regarding the system [1]. To our knowledge, there has not been
an attempt to adapt Angluin's method to learn HMMs, although there has been a recent
paper that extended Angluin's method to learn nondeterministic finite automata.
The
general HMM learning problem is difficult, but reducing the problem by approaching it
from a practical perspective and introducing strong assumptions and relaxing constraints
should make it more tractable, just as it did for learning regular sets in Angluin's paper. For
example, in the realm of HMMs, it seems reasonable to assume the existence of a teacher
that informs the algorithm how well it is doing and how well it could be doing if it worked
better. Although learning finite automata does not carry over to HMMs in an obvious way,
the methodology is certainly relevant, as we are trying to uncover hidden parameters given
observables and a teacher helping the algorithm. Therefore, one of the goals of this thesis
will be to investigate the feasibility of applying Angluin's insight into the realm of learning
HMMs.
The other area that this thesis investigates is nonnegative matrix factorization. The
idea is to investigate whether the use of matrices to represent HMMs is merely notational.
If an HMM gives rise to a number of meaningful matrices, could it be the case that starting
with meaningful matrices, we may reconstruct the HMM? HMMs necessarily give rise to
matrices with certain properties. Perhaps if we extracted matrices with those properties
from the observable data, we could reconstruct the HMM.
1.3
Thesis Approach
This thesis tackles the HMM learning problem using novel approaches inspired by related
areas of study. To do so, we first do an in-depth analysis of the approaches in question, by
describing them and giving examples of them being used. Our goal is to fully understand
the underlying assumptions and context of these insights so that we have a good idea of how
they could be of use in HMM learning. This thesis will then face the problem of applying
these insights to HMM learning.
This thesis will also investigate the effectiveness of these new approaches, comparing
them to the tried-and-true effectiveness of the Baum-Welch algorithm. Our goal is to show
whether any of these new approaches shows significant promise. Specifically, this thesis
uses the insight gathered from considering various novel approaches to conclude that the
nonnegative matrix factorization (NNMF) approach is the most promising. Moreover, this
thesis will combine NNMF with Baum-Welch to produce a hybrid algorithm in hopes of
outperforming the original Baum-Welch algorithm.
The theoretical analysis of this approach leads us to consider various algorithms for
nonnegative factorization and issues of linear programming that are outside the scope of
this paper. However, the thesis does illuminate the interconnectedness of these two fields
by showing that a good factorization is likely to result in a better learned HMM.
1.4
Thesis Structure
This thesis is structured as follows.
Chapter 2 describes some background that is necessary in understanding the HMM
learning problem. It includes definitions that are necessary to understand the rest of the
paper and notation conventions. We describe what an HMM is and how it can be used to
calculate meaningful probabilities.
Chapter 3 lists and explains previous work related to HMM learning. Baum-Welch,
Angluin's method for learning finite automata, and the spectral algorithm are described
here. This chapter aims to provide an overview of the different approaches to HMM learning.
Chapter 4 is original work. The first section in this chapter describes an extension to
Angluin's algorithm to learn HMMs. The second section describes how nonnegative matrix
factorization can be used to learn HMMs.
These two ideas are built up into two HMM
learning algorithms.
Chapter 5 describes our methodology in implementing and evaluating our HMM learning
algorithms. We discuss measures of accuracy and set up a variety of HMMs.
Chapter 6 describes the results of implementing and evaluating our two HMM learning
algorithms. We find that only the nonnegative matrix factorization approach is feasible.
We compare this new HMM learning algorithm with Baum-Welch and make comments on
interesting trends in the resulting data.
Chapter 7 wraps everything together, culminating in a conclusion, avenues for future
work, and a summary.
THIS PAGE INTENTIONALLY LEFT BLANK
Chapter 2
Background
2.1
Hidden Markov models
A Markov chain is a stochastic process that obeys the Markov property. The Markov
property states that given the present state, future states are independent of past states.
If X 1 , X 2 , .. . is a sequence of random variables with the Markov property, with the indices
increasing with the passage of time, then
Pr(Xn+1 = z|Xn = zX , Xn=1 = Xni, . .. , X
1
= xi) = Pr(Xn+1 =
X
= Xz).
A Markov model comprises states and transitions between pairs of states. In the Markov
models we are studying, each state also emits an output that is dependent only on that
state. Figure 2-1 is an example of a Markov model.
At each time step, the Markov model progresses forward in time, producing an output
from its current state and moving to the next state according to the transition probabilities.
Under some conventions, the output is produced only after transitioning to state. In this
thesis, we choose to employ the convention that an output is produced from its current
state before the transition occurs. Many Markov models specify a probability distribution
Rain
Rain No Rain
No Rain
5% 95%
Sunny
Day
80%
20%
20%
Cloudy
Day
50%
80%
50%
Figure 2-1: A Markov model. Each state represents whether the day starts off sunny or
cloudy. Each state's emission corresponds to whether it rains on that day. The Markov
property is satisfied because the transition and output from one state depend only on that
state.
over the initial.
In a hidden Markov model, the state sequence is unknown to the observer, and only the
sequence of emissions is observable. In our example above, we might observe a sequence
{ Rain,
No Rain, No Rain, Rain, Rain, Rain} across a span of six days. We would have
no information about the number of states, transition matrix, output matrix, or the initial
state distribution.
2.1.1
Definitions and Notation
We can organize the transition probabilities into a transition matrix. A transition matrix
holds information about how likely it is to move from one state to another. In the example
above, if the sunny state is state 1 and the cloudy state is state 2, the transition matrix is
0.8 0.5
0.2 0.5
The more popular convention is to have rows sum to 1. For our convenience, we will
be taking the convention that the columns sum to 1. We can also organize the output
probabilities into an output matrix. An output matrix describes the distribution of outputs
emitted from a state. In the example above, the output matrix is
0.05 0.80
0.95
0.20
We can also organize the initial starting probabilities into a column vector
0.7
0.3
Under our notation, a transition matrix T's rows represent the destination states in
some fixed order and columns represent the source states in some fixed order. An output
matrix O's rows represent the outputs in some fixed order and columns represent the states
in some fixed order. Lastly, 7r is a column vector representing the initial state distribution.
2.1.2
How to Use HMMs: A Brief Guide
Given the parameters of an HMM, we can compute the following quantities. We only give
a brief overview in this section, just enough to lay the foundations for our use of HMMs
for the rest of this thesis. Please refer to [18] for a more in-depth treatment of how to use
HMMs.
(i) The probability of encountering a certain output sequence given a model.
Let A, be the matrix such that entry (i,
j)
is the probability of starting from state
emitting x, and then moving to state i.
Then Ax
=
TO, where O
= diag(O(x,:)).
The probability of encountering output sequence xtxtli ... x 1 is then
10soAt_
appoAp
ca
7r,
where 1 is a row vector of all ones of appropriate size.
j,
(ii) The probability of encountering a certain output sequence given a certain output
sequence.
If the future output sequence is ysys_1
...
y1 and the past output sequence is xtzt 1 .. .z,
the answer is
10,AYS-
1
AY
--.
Ax17r
10,, Ax,_, - -*-Ax 1 ,r
(iii) The probability of being in a certain state given a certain output sequence.
If the output sequence is xtxt_1 ... xi, then the probability of being in state i is entry
i in the column vector
Ax, Ax_- 1
Ax7r.
Chapter 3
Related Work
In this section, we describe previous work related to learning HMMs.
The Baum-Welch
algorithm is a well-known HMM learning algorithm that's been in use for decades. Angluin's
algorithm is used to learn deterministic finite automata, but we will attempt to apply it to
learning HMMs.
3.1
The Baum-Welch Algorithm
The Baum-Welch algorithm outlined in [3] is a classic algorithm used to find unknown
parameters to an HMM.
In general, if we have access to labeled data (that is, we know the sequence of states),
we can find parameters to the HMM by using maximum likelihood estimators. The real
problem arises when we do not know the state sequence.
The algorithm is an expectation-maximization (EM) algorithm. Given only emissions
from an HMM, it computes maximum likelihood estimates and posterior probabilities for the
parameters of the HMM. The two components - computing maximum likelihood estimates
and posterior probabilities - are closely linked in the algorithm. In fact, there are two
stages, known as the E-step and M-step in the algorithm, that alternate between these two
concepts.
In the E-step, the algorithm estimates the likelihood of the data under current parameters.
In the M-step, these likelihoods are used to re-estimate the parameters. These steps are
iterated until the algorithm converges to a local maximum, although it is not easy to tell
when to stop the algorithm.
The Baum-Welch algorithm can get close to a local maximum, but not necessarily to a
global maximum.
Intuitively, the algorithm takes a guess, looks at the data it produces, compares it
to data actually produced, and updates its guess. It iterates until it cannot make any
more improvements. Without the existence of a teacher, the algorithm makes incremental
improvements by evaluating its own performance.
Depending on the initial setting of
parameters, the algorithm may converge to a local maximum and not a global maximum.
Typically, to learn an HMM using Baum-Welch, we run the algorithm a few times,
using random seeds each time. For each run, we stop when improvements to the likelihood
measure are deemed too small.
However, it is difficult to know when to stop because
improvements to likelihood measure can be flat for a long time before jumping up.
Because the Baum-Welch algorithm is so wide-spread, there are plenty of implementations
that one can find easily on the web. In this thesis, we use a MATLAB implementation
developed by [15].
3.2
Angluin's Algorithm
Angluin's algorithm that learns regular sets via queries and counterexamples is an important
result [1]. The algorithm reversed a previous dismissal of the topic. By introducing the
concept of a teacher that is able to answer queries made by the algorithm, it also shed light
on the nature of learning itself.
A regular language is a language recognized by a deterministic finite automaton (DFA).
A regular set is the set of words in a regular language. The problem of being able to learn
any deterministic finite automaton is equivalent to being able to learn any regular set.
The existence of a teacher is an assumption made by the paper. There is no known
polynomial-time algorithm for learning a regular language without the kind of teacher
specified by the paper. Specifically, the teacher is capable of two things: answering membership
queries and providing counterexamples.
A membership query is a yes or no question regarding the membership of a certain word
in the regular set. Making an analogy to human learning, if the problem were learning
how to play tennis, membership queries would be akin to asking questions such as "am I
swinging my arm correctly?" and "is my racquet grip correct?"
A counterexample is a word that is not yet in the guess set that the algorithm builds into
the desired regular set. Because the algorithm never adds words that are not in the desired
regular set, there is no counterexample involving a word that should be removed from the
guess set. In the tennis analogy, providing counterexamples would be akin to asking, "what
am I missing?"
If the teacher tells the algorithm that there are no counterexamples, then the algorithm
has succeeded in determining the regular set.
Using these intuitive concepts of a teacher, the algorithm builds a regular set. The
nature and extent of the similarity seem to be an interesting research topic on human
intelligence.
Before we continue, let us remark on how realistic it is to assume that such a teacher
is available. Assuming that membership queries are still available, a machine that is trying
to predict a sequence that occurs in nature could find its own counterexamples by checking
randomly produced sequences. After trying sufficiently many random sequences, it could
conclude that its model is probably correct. In this way, we could obviate the need for a
teacher that provides counterexamples. As for membership queries, if data is constantly
provided by nature and the machine has been observing the data for a sufficiently long time,
the machine could conclude that it has seen most of the members of the desired regular set.
These two observations suggest that the abstract teacher concept is not very far-fetched.
In any case, the algorithm utilizes an observation table during its run. It is a table that
keeps track of membership queries it has already made. Running along the vertical axis of
the table are prefixes and running along the horizontal axis are suffixes. Prefixes satisfy
the constraint that any shortened version of the words (shortened by removing characters
from the end) are still prefixes. A similar constraint holds for suffixes. The row-space is
partitioned further into S, a "basis" set, and S -A, the set of prefixes formed by appending
a letter of the alphabet A to the end of elements in S. Where the row and column for a
particular prefix p and suffix s meet is a boolean value indicating whether the word p - s is
in the regular set or not. The algorithm builds this table until it satisfies two conditions.
The first condition is closedness. An observation table is closed if and only if for each t
in S - A, there exists an s in S such that row(t) = row(s). Intuitively, this condition says
that a letter addition to any word in S should be just like something we have seen before.
Appending the letter should put us in no new territory.
The second condition is consistency. An observation table is consistent if and only if
for every two elements si and 82 in S such that row(si) = roW(s2), we have row(si -a) =
row(s 2 - a)
for every letter a. Intuitively, this condition says that states that we think are
same should react in the same way regardless of how we prod it.
The algorithm performs membership queries in order to satisfy these conditions and
asks for a counterexample. If a counterexample exists, it adds that to the observation table
and repeats the process. The paper shows that the algorithm always terminates and does
so in polynomial time.
The remarkable aspect of this algorithm is the simplicity brought on by assuming the
existence of a teacher. The assumption is very reasonable; as humans, we expect a lot more
from our teachers. This usage of a teacher is the main motivator for one of the two new
approaches to learning HMMs in this thesis.
We provide a sample run of Angluin's algorithm in Appendix A.
3.2.1
Learning Nondeterministic Finite Automata
Recently, researchers have shown in [5] that Angluin's teacher idea can be extended to learn
not only regular sets. This result is encouraging for us because it means that the idea holds
more merit than we previously perceived.
Like Anguin's algorithm, the NFA algorithm uses an observation table and the conditions
of closedness and consistency. Instead of deterministic finite automata, however, it targets
residual finite-state automata, a class of nondeterministic finite automata (NFA) that is
more general than the deterministic finite automata.
Nondeterministic automata can express regular sets in an exponentially more compact
way than deterministic automata.
A classic result in complexity theory says that any
nondeterministic automata can be replaced by a deterministic one, with at most exponentially
many states. Being able to learn NFAs means being able to learn languages exponentially
more complex than regular sets. This result is encouraging because HMMs are more related
to NFAs than they are to DFAs, since the sequence of states and emissions in an HMM are
probabilistic. The connection between HMMs and NFAs is still not obvious, but this recent
result suggests that the teacher assumption can be a powerful tool.
3.2.2
Equivalence of pNFAs and HMMs
NFAs allow multiple transitions from one state per action, but there is no notion of
probability.
Probabilistic NFAs (pNFAs) bring us a step closer to utilizing Angluin's
algorithm to learn HMMs. In pNFAs, each transition caused by an action has an associated
probability of occurring.
Figure 3-1 shows a pNFA, taken from [8].
In fact, Lemma 5 in [5] states that pNFA without accepting states and HMMs can be
transformed into each other. The equivalence comes from the fact that both structures
utilize states, outputs, transition probabilities, and output probabilities to uncover a model
that best explains the distribution of outputs that are observed.
The process involves utilizing HMMs with transition emissions (HMMTs). Figure 3-2,
taken from [8], shows an HMMT equivalent to the pNFA in Figure 3-1.
For completeness, Figure 3-3, taken from [8], shows an HMM equivalent to the HMMT
and pNFA above.
0.4
0.6
0.27
1
a 0.02(
Figure 3-1: An example of a pNFA, taken from [8]. Each action has an associated probability
of occurrence given the state.
0.4
0.1(
[a 0.3]
[b 0.7]
0.9
[a 0-2]
[b 0.81
0.6
0.3
[a 0.91
[b 0.1]
[a 0.8]
[b 0-2]
Figure 3-2: An example of an equivalent HMMT, taken from [8]. Unlike an HMM, HMMTs
emit outputs during state transitions.
0.36
0.1
12
11
[a 0-2]
[b 0.8]
[a. 0.3]
[b 0-7]
0.7
0.1
21
[a 0.8]
[b 0.2]
0.42
0-9
07
_
0-3
22
0.3
[ 0..9]
[ 0.1]
0.18
Figure 3-3: An example of an equivalent HMM, taken from [8].
Note that converting a pNFA into an HMM required increasing the number of states.
Such is not always the case. Nonetheless, the process of converting an HMM into a pNFA
is much simpler, preserving the number of states. Figure 3-4 shows an example.
3.3
The Spectral Algorithm
Hsu et. al [13] note that learning HMMs is provably hard. However, they go on to write
that the difficulty is divorced from what we are likely to encounter in practical applications
and that making reasonable assumptions helps tremendously. They make the following
assumptions.
(i) 7r > 0 element-wise.
(ii) 0 and T are rank m, where m is the number of states.
(iii) UTO is invertible, where U "preserves the state dynamics". Details are in [13].
0.04
0.1
0.04
0.36
1
1
b00
12
00.3
lb 0.81
03
[b 0.71]
0.36
M
a0
-0
0.42
21
a 0-
-
1
b 0.2
0.7
0.14
08
0.9
22
03
[a~~~
0021
lb21~~
[a1
[b
0A42
0.7
00.21
b.0 7
0.0 0.7227.
b 011b
0.8
01
0.18
0
0.0
b
aa0.97
0.42
0.18
Figure 3-4: Converting an HMM into an equivalent pNFA, taken from [8].
The first statement, that 7r > 0, says that every state can serve as the initial state. This
assumption may seem arbitrary, but it is necessary because of the way that the algorithm
works. It seems to impose an unreasonable requirement.
To our knowledge, it is not
a trivial task to convert an HMM into one where all states can be initial states. The
second assumption is rather interesting, as it is phrased in linear algebraic terms that seem
unrelated to HMMs. It is actually a very powerful assumption that says that not only can
no two states have the same output distributions, but also no state can have a distribution
that is a mixture of other states' distributions. Also, a more subtle requirement that follows
from the second assumption is that the number of distinct outputs must be at least the
number of states, since the rank of 0 is at most the number of its rows, which is the number
of distinct outputs. Nonetheless, it is still possible to produce an accurate model - it would
simply have more states than the most concise model.
The punch line in [13] is that its algorithm can calculate probabilities of output sequences
without knowing the latent parameters. The paper mentions that the parameters can be
recovered with a little more work, but their determination is unnecessary. We include this
summary of their algorithm to illuminate the role linear algebra plays in HMM learning.
Chapter 4
The Two New Approaches
4.1
4.1.1
The Extended Angluin's Algorithm
Motivation
The strategy that others have employed in the related work section suggests that perhaps
learning HMMs is hard because we are asking the wrong question. Our strategy is to glean
from the related work mentioned earlier and put the HMM learning problem on a more
practical footing by introducing assumptions. Namely, we wish to continue in the style of
Angluin and assume the existence of a teacher that is able to provide explicit performance
guarantees that our algorithm should be able to achieve. This capability should allow our
algorithm to measure how well it is doing and make improvements accordingly.
We are inspired by the success that other researchers have had in this area, but we
find Angluin's idea intriguing and especially appropriate for this learning problem, because
human learning seems to closely resemble the learning style in Angluin's paper. As humans,
we learn from queries and counterexamples, formulating the simplest theory that explains
everything we observe. Only by seeing counterexamples do we make amendments to our
theory. Also, we make our observations under uncertainty, and realize that what has
happened is not necessarily indicative of the underlying circumstances.
While DFAs are agnostic to transition, output, and initial state distributions, Angluin's
treatment of states in a DFA as equivalence classes of pasts that yield the same future
is a great insight into how we can distinguish states in an HMM. Perhaps we can utilize
Angluin's DFA learning algorithm to at least learn the bare-bones structure of an HMM.
4.1.2
Issues with Learning HMMs
There are a number of issues with applying Angluin's idea to learning HMMs that we
address below.
4.1.2.1
No Agent in HMMs
In DFAs and NFAs, outputs are actually actions made at states. An action is determined by
the agent traversing the automaton. In an HMM, the outputs at each state are determined
randomly according to an output distribution. Unlike a finite automaton, we have no say
in deciding where we are going in an HMM. This issue is a problem since we want to be
able to prod the HMM to see whether certain paths are possible, as in Angluin's algorithm
with DFAs.
We can get around this problem by assuming that we can sample enough sequences from
the HMM to be able to find the sequence we wish to execute.
4.1.2.2
Deterministic Actions in DFAs
Each action made by an agent in a DFA moves him into exactly one state. That is, the
action taken by the agent at a given state uniquely determines the next state. In an HMM,
a state may emit a certain output but may travel to one of many states.
We can get around this problem by utilizing an NFA, which allows an agent to transition
to multiple states at the same time given an action. As mentioned earlier, the recent paper
[5] shows us that it is possible to extend Angluin's algorithm to learn NFAs.
4.1.2.3
No Accepting States
In a DFA and NFA, there are accepting states, which signal ends of action sequences. There
is no equivalent concept of an accepting state in HMMs. The significance of an accepting
state is not so much that it signals the end of an action as that it forces action sequences
to be finite. The accepting states basically allow us to say that certain finite sequences are
valid sets of actions in a DFA or NFA. The appropriate analogue for HMMs would be to
have a state that ends the emissions. There are a number of ways we can implement this
idea.
1. (Stopping State Approach) Assume that we can ask the HMM to stop automatically
once it reaches a certain state.
2. (Accept All Approach) Assume that every state in the HMM is an accept state. In
this approach, every output sequence generated by the HMM is an accepted sequence.
3. Assume that we can transform the HMM into an equivalent pNFA and assign states
to be the starting states and accepting states.
We remark that option 1 is identical to option 3, because the process of transforming
an HMM into a pNFA preserves all of the states, as noted by example in Figure 3-4. So
really, we have two options.
Note that in the first case, we are assuming that we can get more information from an
HMM than we normally could.
4.1.2.4
Faux-Regular Sets
HMMs with accepting states produce finite output sequences. We gather these sequences
into a set and train the NFA on this set. We are implicitly assuming that this set is a
regular set. The NFA learner can only learn from regular sets and it is not guaranteed that
sequences sampled from HMMs will form a regular set, especially if we limit the number of
samples. If everything that the learner queries for is in the set, then there is no problem.
The problem arises when the learner queries for something that should be in the set but
is not, leading to contradictions. It is likely that the NFA learner will fail to terminate in
certain cases.
Probabilistic Angluin: A Possible Extension
4.1.3
Another way to apply Angluin's algorithm to learn HMMs is to use a softer version of
membership querying, in which we populate the table with probabilities, not ones and
zeroes.
Since we can sample many sequences from the HMM, why not take advantage
of the distribution of outputs? For example, instead of simply saying that the sequence
created by appending suffix '011' to '0' is accepted, we can write in the probability that
'0' follows '011', write the probability of '0110' occurring, or write the probability of seeing
'0110' among all four-letter output sequences.
In terms of finding a map of the HMM, we can treat any entry greater than 0 as a valid
entry, so that we can still recover the transition structure.
This idea seems to have potential, but we have not been able to address the following
issue that arises. Once we populate a table with probability values, we cannot claim that
rows that look similar are in the same state. A row in Angluin's table represents the possible
futures from a state given a sequence of actions (the suffix). Because the actions uniquely
determine the resulting state, the outputs from that state are always the same. However,
in the case of probabilistic outputs, given a sequence of outputs (the suffix), there is a
probability distribution of states that the HMM could be in. The HMM is not in any one
single state, but a mixture of different states.
For example, suppose we are trying to learn the relatively simple HMM from Figure
3-3.
Suppose we have the ability to convert it into the pNFA from Figure 3-1.
The
initial state distribution is (0.4 0.6), where the states are enumerated from left to right.
The output distributions of the states are
0.29 0.71
0.83 0.17
/
, where each row represents
each state respectively. Hence, the row corresponding to the initial state would look like
0.4(0.29 0.71) + 0.6(0.83 0.17) = (0.614 0.386). In general, given a sequence of outputs that
make a suffix, the corresponding row would look like a(0.29 0.71) + (1 - a)(0.83 0.17),
where a is the probability of being in the first state given the prefix of outputs. Because
the system is probabilistic, practically any value of a could be a valid probability of being
in the first state. It is possible for the resulting table to have no two rows that similar in
Euclidean distance.
These considerations lead us to concepts that we describe in Section 4.2. Basically,
while the fact that every row in the table looks like a(0.29 0.71) + (1 - a)(0.83 0.17) is a
problem, it is also an insight, since it says that every row in the table can be expressed as
a linear combination of two basis vectors. This realization suggests a perhaps more linear
algebraic approach, which we describe in Section 4.2, the nonnegative matrix factorization
approach. Angluin's paper shows that a state can be recognized by looking at what decisions
made after entering the state. This realization suggests that the same does not hold in the
probabilistic case, where more sophisticated tools are required.
4.1.4
The HMM Learning Algorithm
We choose to use the NFA learning algorithm to discern the transition map and not the
probabilities associated with the transitions and outputs. The insight is that even though
NFAs do not recognize probabilities, a sequence of actions in an NFA are possible if and
only if that sequence of actions is a possible sequence of outputs in the corresponding pNFA
and HMM.
Thus, our strategy is to take sequences of outputs from the HMM, assume them as the
regular set, run the NFA learning algorithm to learn that regular set, find out the transitions
between states, and then assign probabilities to them through some other method, say
Baum-Welch.
Because we would not be given any probabilistic information, it can be
argued that we are not learning too much information. Nonetheless, by eliminating certain
transitions altogether, we may be able to decrease the number of latent parameters drastically.
Figure 4-1 is a diagram of the entire Angluin approach to learning HMMs.
We will refer to this algorithm as the extended Angluin algorithm.
1 a. Request stopping states in HMM.
I. R t s1b.
Work with unaltered HMM.
a 0]
4
* b 1]
[a 0.5]
(b0.5]
1
[ a21
]
1
[b0]
[b 0.5]
0.5
2
b 1]
C.
1
[a 1]
[b 0]
0
[a 0]
[a 0.5]
0.5
,-
2. Gather samples. This set represents the regular set that the NFA will learn.
ab a a b
aa
abababa
I
3. Use Angluin's algorithm to learn NFA.
[a, b]
a, b]
4. Learn probabilities via Baum-Welch.
All
Baum-Welch
HMM
Figure 4-1: The Extended Angluin HMM Learning Algorithm. la and lb represent the two
options we have in remedying the accept state problem.
4.2
4.2.1
The Nonnegative Matrix Factorization Algorithm
Motivation
Consider the representation of HMMs in terms of matrices. The initial output is distributed
according to the vector
07r.
The output distribution, given an initial output T, is
OTOW
10,7r
0TOTW
O TWr
10T7
Now, note that TOT7r is a vector whose elements describe the probability of being
in a certain state given the initial output r. The denominator 10,7r just normalizes the
probabilities to sum to 1.
From this representation, we can interpret output distributions as follows.
Given a
certain output sequence, we first compute the distribution of states that the HMM is in,
say v. Then, the output distribution is vOT.
An example is given in Figure 4-2.
4.2.1.1
The Observation Matrix
We define the observation matrix of an HMM as follows.
The rows of the matrix represent prefixes and the columns represent unit-length suffixes.
The entry at which row r intersects column c is the probability of seeing the output sequence
represented by c given the output sequence represented by r.
We stipulate that the prefixes and suffixes are ordered in lexicographical order, from
shorter to longer.
(
0.2
0.2
0.4
oTy oT
07=0.4
0
0.2
0.4
0.6
0.2
0.4
0.2
0.5
0.5
(0.5 0.25 0.25).
, TT
) IrT
= (0. 5 0.2 5 0.2 5).
In states 1, 2, and 3, the output distributions are (0.2 0.2 0.6), (0.4 0.4 0.2), and (0 0.5 0.5)
respectively. Initially, the probability of being in state 1, 2, and 3 are 0.5, 0.25, and 0.25
respectively. Hence, the distribution of the initial output is
0.5(0.2 0.2 0.6) + 0.25(0.4 0.4 0.2) + 0.25(0 0.5 0.5)
=
FT oT
= (0.2 0.325 0.475).
Given that the first output is 1, the HMM's state distribution is TO 17 = (0 0.5 0.5)T. It
follows that the distribution of the second output, given that the first output is 1, is
0(0.2 0.2 0.6) + 0.5(0.4 0.4 0.2) + 0.5(0 0.5 0.5)
= A (7rTOT TTT) OT
(0.2 0.45 0.35),
where A is a normalization constant.
Figure 4-2: An example of how output distributions are determined by 0
distribution of possible states. The outputs are 1, 2, and 3.
and the
Figure 4-3 is an example, where the HMM is from Figure 4-2.
Note that these observation matrices are exact. Measured observation matrices are likely
to have a significant amount of noise. However, we emphasize that observation matrices
can be measured to any degree of accuracy, simply by extracting more output sequences
from the HMM.
Figure 4-3 is an example of an observation matrix (values rounded to a suitable number
of digits).
4.2.1.2
Factorization of the Observation Matrix
Consider the observation matrix as a matrix A. We must have the following matrix equality.
Observation matrix
for HMM from Figure 4-2
'
1
2
3
11
1
0.2
0.2
0.2
0.3053
0
2
3
0.325
0.475
0.45
0.3538
0.3579
0.5
0.35
0.4462
0.3368
0.5
Figure 4-3: An example of an observation matrix.
COT = A,
where 0 is the familiar output matrix and C is the coefficient matrix, every row of which
holds the distributions of possible states that the HMM could be in.
For the example given in Figure 4-3, the equality would look like Figure 4-4. We have
truncated the table to include only five rows.
,rT
\
A1,rTOT T
2
A3,r TOTTT
A4,rTOITTOITT
=
0.2
0.325
0.475
0.2
0.45
0.35
0.2
0.3053
\ 0
0.3538 0.4462
0.3579
0.5
0.3368
0.5
I,
where Ai are normalizing constants.
Figure 4-4: Factorization of an observation matrix.
This representation of the observation matrix is the motivation for using nonnegative
matrix factorization to learn HMMs.
We have just shown that every HMM admits a
factorization COT = A of the observation matrix. Thinking conversely, perhaps we can
recover C and OT given just the measured observation matrix A.
4.2.2
Recovering 0, T, and
7r
from an Observation Matrix
In this section, we describe how to recover 0, T, and 7r to produce an HMM from a measured
observation matrix.
(i) Recovering 0
Once we factor A into C and OT, the matrix 0 is readily apparent. Note that because
the factorization is not unique, we are likely to get different values for 0 depending
on the factoring algorithm.
(ii) Recovering 7r
Note that the first row of the measured observation matrix is a distribution of outputs
given no prefix. In other words, it is the distribution of outputs given that the HMM
is in its initial state. It follows that the first row of C, which is the distribution of
possible states, is the initial distribution of states. We have 7r = C(1,:)T.
(iii) Recovering T
The task of recovering T is the most cumbersome. T is not evident at first glance,
but it is clear that implicitly, C involves T.
We present two methods. The first one relies on conditions that may not always hold,
but is more precise. The second always works, but it is usually not as precise.
Denote 0' and 7r' to be the parameters determined by the factorization. Let T' be
the transition matrix we are looking to find.
Method 1
First, we require 0' to be invertible. One way to understand this requirement intuitively
is that no state's output distribution can be a linear combination of other states'
output distributions. Second, we require the following matrix to be invertible, where
m is the number of distinct outputs.
O17r'
O27r'
-
,7r')
Note that O'r is an m x 1 column, so the matrix is an m x m matrix.
Let M be the submatrix of A obtained by taking rows 2 to m + 1.
That is, M = A(2 : m + 1,:). Now note that M is measuring
( O'T'O'r'
O'T'O'27r' -.-.O'T'Or'
-
It is a matrix of output distributions given an output of length 1.
In other words,
O'T'
( O'r'
O' - r'
O21r'
so that
T' = O'-M
O21r'
--
/'
1
(4.1)
Method 2
Let Ctrunc be the submatrix of C obtained by taking rows 2 to m + 1. We call this
matrix the C submatrix.
Since Ctrunc represents
wITOI
ITiT
7T 2
T'IT,
7F Toi)
we basically have another factorization problem, except with one of the factors already
known. That is, we wish to find T' that minimizes
T
If
*
-
Ctrunc
F
is invertible, we simply have
(
wTO
T =
\
-
Ctrunc.
Otherwise, we use a variant of the nonnegative matrix factorization algorithm (discussed
in Section 4.2.5) that fixes one of the two factor matrices.
4.2.3
Issues with Learning HMMs
There are a number of issues that we have to address before we can feasibility use matrix
factorization to recover the HMM.
4.2.3.1
Stochastic Factorization
If we take A and split it into C and OT using today's popular factorization algorithms, the
factors will not necessarily be row-stochastic. Our algorithm must enforce the row-stochasticity
constraint.
4.2.3.2
Constructing the Observation Matrix
In order to construct A, we need to know the total number of outputs, which is the number
of columns of A. Assuming that we have access to an HMM and can generate as many
samples from it as we would like, this assumption is not unreasonable. From here on, we
will assume that we know the number of outputs.
Note that we are free to determine the number of rows of A. The number is how many
sample output sequences we have.
4.2.3.3
Factorization into C and OT
We must specify the row count of OT. The interpretation of the row count of OT is simple
- it is the number of states in the HMM. Unfortunately, knowing the number of states is
not a reasonable assumption in learning HMMs. However, the number of rows in OT can
be specified as part of the factorization approach. We can factor the observation matrix
into OT's with different row counts and take the one that yields the best factorization, in
some sense. For instance, we can measure the likelihood on held-out data.
4.2.3.4
Factorization Measure
We describe two interesting measures that we can use to gauge the accuracy of our factorization.
One is the Kullback-Liebler (KL) divergence, a non-symmetric measure of the difference
between two probability distributions P and Q. The KL divergence is non-symmetric and
therefore not a true metric. It does not even satisfy the triangle inequality. For two matrices
A and B, the divergence is
D(A||B) =
(Aij
log
- Aij + Bij
The other is the Frobenius norm, which is the more intuitive measure of distance between
two matrices. For A and B, the square of the Frobenius difference is
|A - B||' =Z(Aij - Big)2
ii
It is simply the Euclidean distance between A and B and is a true metric, satisfying the
triangle inequality unlike the KL divergence.
We will be factoring the observation matrix A using the Frobenius norm because most
factorization techniques today utilize it.
We can now express our problem mathematically - it is to find row-stochastic matrices
C and OT that minimize
||COT
-
A|.
Note that we will not necessarily find C and OT that perfectly factor A. In fact, because
we cannot sample infinitely many sequences, it would be virtually impossible to perfectly
factor the measured observation matrix. Perhaps it would be more accurate to call this
method nonnegative matrix approximation instead of factorization, but we will use the
term factorization for convenience.
It is important to note that under this measure, having many identical rows in A is not
redundant, because deviation of the factorization from any of those rows is multiplied.
4.2.3.5
Trivial Factorizations
If the number of rows is greater than or equal to the number of columns in OT, then there
exists a trivial factorization that yields 1COT - AI2 = 0. The reason is that in such a case,
OT can simply be the identity matrix with arbitrary numbers to pad the bottom, as shown
in Figure 4-5.
This result is not a problem with the factorization, since after all, we produced row-stochastic
C and OT that minimize the required distance measure. The problem is that in terms of
learning the HMM, it does not make sense to have a state that is never reached (as would
be the case for the state represented by the last column in C in our example above).
/1
0
0
0\
0
1
0
0
0
0
0
0
1
0
0
1
\j0.1
0.2
0.3
(A 0)
=A.
0.4j/
Figure 4-5: Trivial factorization of an observation matrix. In this example, A has 4 columns.
o is a column vector full of zeroes.
Nonetheless, we are interested not so much in recovering the exact parameters of the HMM
as being able to correctly estimate output sequence probabilities. If having a useless but
benign extra state does not adversely affect the HMM's ability to accurately calculate
probabilities, we should not be worried. To remove the extra state, we can try re-running
the algorithm with a different seed.
A more interesting situation occurs when the number of rows equals the number of
columns in OT. Then the factorization
A
1 0 0
0
0
0
1 0
0 0 1 0
0
0 0
1
is exact and the interpretation, that each state emits only one kind of output, is entirely
plausible. As with before, this result is not a problem with the factorization. Also, as
with before, we should not be too worried, since our goal in learning HMMs is to find the
parameters that produce output sequences with the same likelihoods as the actual HMM.
It might well be the case that there exists a set of parameters with 0 as the identity matrix
that satisfies that criterion.
Nonetheless, in an effort to prevent the algorithm from always selecting the identity
matrix for OT and possibly confining itself to a local maximum in learning the HMM, there
is a need to randomize the seeds in the factorization algorithm so that 0 does not necessarily
come out to be the identity matrix.
Non-Uniqueness of Factorization
4.2.3.6
This issue is related to trivial factorization. We ask ourselves, how special is a factorization?
Lemma 1. Suppose we have a factorization COT = A and let V be a matrix such that the
following hold.
" V is row-stochastic
* V is invertible
* V-1 is nonnegative
Then C' = CV, OIT = V --1 0 T is another factorization with C'O'T = COT.
Proof. First, note that C'O'T = COT is easy to verify. It suffices to show that C' and
OiT are both row-stochastic, which is clear since row-stochastic matrices are closed under
multiplication.
E
Nonetheless, we claim that it is not trivial to produce different factorizations from a
given one.
Lemma 2. There does not exist an n x n matrix V with the following properties.
" V is not a permutation matrix
" V is row-stochastic
" V is invertible
* V-
1
is nonnegative
Proof. Suppose such a matrix V exists.
We have VV- 1
In. Since V is row-stochastic, V1 = 1, where 1 is a column vector of
ones of appropriate size. Thus, 1 = V-11. Since V-1 is nonnegative, we have just shown
that it is row-stochastic.
Denote the entries of V as rij and the entries of VWe have
>i rjcij
= 1 for all
j.
1
as cij.
By the AM-GM inequality, we have
= 2.
Z2rjcij
Zr7-Fc
Moreover,
)2
2Zr i
(4.2)
2
(4.3)
and
2
c
c y,
so that we have
2
1+
cij
2
2
+
= (Erji)
)2
so that
>cij >
1. Summing over all
c
j,
2rji
+ c?. 2
2rjicij = 2,
(4.4)
we get
E ) cij > ni.
J
But
EJ Ei c
i
= EE cu =E 1 = n, so it follows that all of the inequalities used in the
equations above must be equalities.
In particular, from Equation 4.2,
rji)2 =
(
Ei r3
, so that
rjarjb = 0,
afb
or
rji
jk / 0,
or
rji(1 - rji) = 0.
Since V is row-stochastic, every entry in V is in [0, 11. Thus, rji(1 - rji)
0 for all i, j.
But now every entry rji in V is either 0 or 1. Since V is row-stochastic, it follows that V
must be a permutation matrix, a contradiction.
D
Note that when V is a permutation matrix, we do get a new factorization. However,
the factors produce the same HMM, since permuting the rows of
0
T amounts to relabeling
the states.
On the other hand, it is not impossible to generate a new factorization.
We can
sometimes find a matrix V such that the following hold.
" V is row-stochastic
" V is invertible
* V-1OT is nonnegative
In such a case, we can say the following.
Lemma 3. Suppose we have a factorization COT = A and let V be a matrix such that the
following hold.
* V is row-stochastic
*
V is invertible
SV-1OT is nonnegative
Then C' = CV, O'T = V-1OT is anotherfactorization with C'OiT = COT.
We omit the proof.
Proposition 1.
There exists a row-stochastic matrix OT and a matrix V such that the
following properties hold.
e V is row-stochastic
* V is invertible
e V--oT is nonnegative
Proof. We simply need to provide an example.
Let
OT
0.45
0.55
0.4924
0.5076
0.55
0.45
0.6060
0.3940
Let C' = CV and OIT = V--lOT. Now C' is row-stochastic since C and V are both
row-stochastic.
OIT = V-OT is row-stochastic since
V-lOT
Below is another example.
Let
(
0.8967
0.1033
0.0167
0.9833
0.5307 0.2907 0.1786
0.2241
0.0455 0.2284 0.5020
0.6462
0.4418
0.3316
0.0222
0.0095
0.1182
0.8622
0.0101
0.4493
0.1089
0.1089
0.4577 0.1653
0.2681
0.4502
0.3376 0.2122
0.3589
0.0828
0.3559
0.2024
Then
V--OT
0.1210
0.3798
0.4992
0.2985
0.6995
0.6590
0.0425
0.2875
0.0130
0.6579
0.2191 0.1230
We conjecture that if there exists at least one solution, there are an infinite number of
them.
Conjecture 1. Let M be a row-stochastic matrix. If there exists at least one matrix V with
the following properties, then there are infinitely many of them.
" V is row-stochastic
" V is invertible
" V-1M is nonnegative
Empirically, this conjecture seems to be true.
Perhaps most importantly, we make the following proposition about generated factorizations.
When we say that two HMMs are equivalent, we mean that they assign the same probabilities
to all output sequences.
Proposition 2. Let an HMM be parameterizedby 0, ,r, and T. Suppose V is a row-stochastic
invertible matrix such that V-1
0
T
is nonnegative. Then the HMM parameterized by 0'
O(VT)-1 and 7' = VT 7r is not necessarily equivalent to the original HMM. In other words,
there may not exist a transition matrix T' such that the HMM parameterized by 0', ,r', and
T' is equivalent to the original HMM.
Proof. We only need to provide an example of such an HMM with parameters 0, 7r, and T.
Choose an HMM such that 0' is invertible. For example, we can start with an invertible
0. Let 1, 2,
...
, m be the outputs.
Suppose that T' is a transition matrix such that the HMM parameterized by 0', 7r', and
T' is equivalent to the original HMM. Then we must have
O'T'O'7r' = OTO 1 7,
O'T'O'7r' = OTO 2 7,
O'T'O 7r' =- OTOm7r.
Since 0' is invertible, we have
T'O'7r' = O'-0OTO1 7r,
T'O7r' = O'-OT027r,
T'O, 7r' = O'-1OTOm7r.
We can rearrange this system of equations to the following matrix form.
T'(Oir O27r
r)
. -- O
=O'
'OT (017r 027r - - - Omr).
Note that O/r' are m x 1 column vectors, so the matrix being multiplied to T' is an
m x m matrix.
By choosing our HMM carefully, we can make the m x m invertible and be able to solve
for T' as follows.
V=
O'10T
(017
O27r
Om7r)
(O'r' o'r
--
r')
(4.5)
We emphasize that if indeed the m x m matrix in question is invertible, Equation 4.5 is
a necessary requirement for the two HMMs to be equivalent.
Let
o=<
0.4
0.4
0.5 0.1
0.3
0.1 0.6
Let
V
0
0.3
0.3
00 1
,§
1
1 0
,
00
1 0 0J
0.5749
0.1159
0.3092
0.4011
0.5630
0.0359
0.0433
0.0419 0.9148
Then
0'=
0.2628
0.5046
0.2924
0.5663
0.4808
0.0605
0.1709
0.0146 0.6471
We can solve for T' using Equation 4.5. We get
(
0.5749
7'
0.1159
0.3092
0.0433 0.0433 0.0433
T'
0.0419
0.0419
0.0419
0.9148 0.9148 0.9148
We are left to verify whether 0', T', and ir' produce an HMM equivalent to the one
produced by 0, T, and ir. It is shown to be false, since OTO1TO 1 7r is significantly different
from O'T'O'T'O'7'. Thus, this set of parameters shows that newly generated factorizations
do not necessarily lead to equivalent HMMs.
E
Proposition 2 suggests that by generating new factorizations, we may be able to improve
the accuracy of the recovered HMM. Note that the newly generated factorizations do not
improve the factorization metric at all - recall that CO = C'O'. This proposition merely
says that these new factorizations may represent distinct HMMs.
In addition, Proposition 2, via Equation 4.5, gives us an algorithm, which, under certain
conditions, can test whether a generated factorization leads to an non-equivalent HMM.
4.2.3.7
Difficulty of Factorization
While finding a good factorization C, OT does not necessarily imply that we have found the
best value of OT to represent the HMM, we would expect the problem of finding C, OT to
be pretty hard. Indeed, [21] indicates that finding the globally optimal solution minimizing
||AB - FIF for nonnegative matrices A, B, F is NP-hard. As we mentioned earlier, for this
reason, many researchers call the NNMF problem as the nonnegative matrix approximation
problem, as the exact solution is often hard or impossible to find because of algorithmic
inadequacies or noisy data. We suspect our problem to be no less easy, since we require A
and B to be row-stochastic on top of being nonnegative. Nonetheless, there are methods
that can provide reasonably good local solutions to the nonnegative matrix factorization
problem, which we describe in Section 4.2.5.
4.2.4
Sparse Observation Matrices
As we mentioned earlier, an observation matrix A is likely to have many, possibly infinitely
many, different factorizations into row-stochastic matrices C and OT. However, in the case
that A is sparse, intuition tells us that we should expect far fewer factorization possibilities.
Lemma 4 (The Zero Lemma). Let A be a row-stochastic matrix with A = XY, such that
X and Y are row-stochastic matrices. We define a subset of A's entries S to be null if it
satisfies the following properties.
" a= 0 for all a E S
" No two entries in S share the same row or column in A
Then the union of the entries of X and Y must have at least
|Sik
zeroes, where k is the
number of rows in Y.
Proof. Let a E S. Then a = rc, where r is a row of X and c is a column of Y. Since X and
Y are nonnegative matrices, we must have rici = r 2c 2 = - - = rkck = 0. It follows that for
each i, at least one of ri and ci must equal zero. Thus, between r and c, there must be at
least k zeroes.
Since entries in S do not share rows or columns, each entry adds at least k zeroes to the
total number of joint zeroes among X and Y.
D
The Zero Lemma describes the extent to which sparsity aids factorization. It states that
an observation matrix with a large null set forces the factors to have many zeroes, vastly
reducing the dimensionality of the factors. Thus, we expect sparse observation matrices to
yield very good factorization results.
4.2.5
4.2.5.1
Algorithms for Factoring
Lee and Seung's Algorithm
The article [14] written by Lee and Seung started a flurry of research into nonnegative
matrix factorization. The multiplicative update algorithm for NNMF, which minimizes the
Frobenius norm, is outlined in Figure 4-6.
W = rand(n, k);
H = rand(k, m);
for i 1 : maxiter
H H.* (WT A)./(WTWH +10-9)
W =W.* (AHT)./(WHHT + 10-9)
end
Figure 4-6: The Lee-Seung algorithm, written in the syntax of MATLAB. The 10-9 in each
update is added to avoid division by zero.
[4] notes that contrary to [14], the Lee-Seung (LS) algorithm does not necessarily
converge to a local optimum, and may converge to a saddle point. [4] also notes that the
LS algorithm is in the spirit of a more general class of algorithms called gradient descent
algorithms.
Benefits
There are a few benefits of using LS over ALS. In fact, we need LS because it can be
used to fix one of W and H and solve for the other. Also, LS works even in the case k > m,
whereas ALS does not.
Drawbacks
The main drawback is that LS is slow to converge, if it converges at all. Also, once an
element in W or H becomes 0, it remains 0. ALS is much faster and does not suffer from
the zero element problem.
4.2.5.2
The ALS Algorithm
[4] notes that the other large class of NNMF algorithm is the alternating least squares (ALS)
class. In these algorithms, a least squares step is followed by another least squares step in
an alternating fashion. The algorithm is given in Figure 4-7.
W = rand(n,k);
for i =1 : maxiter
Solve for H in WTWH = WTA.
Set all negative elements in H to 0.
Solve for W in HHTWT = HAT.
Set all negative elements in W to 0.
end
Figure 4-7: The ALS algorithm.
[4] notes that the ALS algorithm can be very fast, depending on the implementation.
MATLAB has a NNMF function that primarily utilizes the ALS algorithm.
Benefits
The main benefit to using ALS over LS is that ALS has fast convergence. Also, as
mentioned earlier, zeroes that appear in W or H are not final in ALS.
Drawbacks
The biggest drawback to using ALS is that it cannot be used when k > m. Also, ALS
cannot be used to fix one of W and H and solve for the other.
4.2.5.3
Our NNMF Algorithm
It is clear that in the case k > m, or when the number of states is greater than the number
of outputs, we must use LS. Also, in the case that we wish to fix one of W and H and solve
for the other, we must use LS.
In the case that k < m, we can run both LS and ALS. It is likely that ALS will give
better results.
Nonetheless, there is an important issue that neither algorithm addresses. The issue in
our case is that the matrix factors must be row-stochastic. The LS and ALS algorithms do
not guarantee that the factors are row-stochastic.
We propose the following modified LS and ALS algorithms. Figure 4-8 describe the
modified LS algorithm and Figure 4-9 describes the modified ALS algorithm.
normalize(rand(n, k), 2); % initialize as a random row-stochastic matrix
normalize(rand(k,m), 2); % initialize as a random row-stochastic matrix
W
H
for i = 1 : maxiter
H = H.* (WTA)./(WTWH + 10-');
W
H
W
end
W.* (AHT)./(WHHT + 10-9)
normalize(H,2);
normalize(W, 2);
Figure 4-8: The modified Lee-Seung algorithm, written in the syntax of MATLAB. The
10-9 in each update is added to avoid division by zero.
W = normalize(rand(n,k), 2); % initialize as a random row-stochastic matrix
for i = 1 : maxiter
Solve for H in WTWH
= WT A.
Set all negative elements in H to 0.
Solve for W in HHTWT = HAT.
Set all negative elements in W to 0.
H = normalize(H,2);
W = normalize(W,2);
end
Figure 4-9: The modified ALS algorithm, written in the syntax of MATLAB.
The changes we added are simple. At the end of each iteration, we normalize the rows of
W and H so that W and H become row-stochastic. Unfortunately, it is not clear whether
these altered algorithms are likely to yield good solutions. These altered algorithms may or
may not converge to local optima. In fact, empirically, the modified ALS algorithm either
did very poorly or very well while the modified LS algorithm did moderately well across the
board.
As a result, although we used both modified algorithms, we mainly sided with the
modified LS algorithm to factor matrices.
4.2.6
The HMM Learning Algorithm
In the previous section, we have described how we can take an observation matrix and
extract 0, T, and 7r from it. We now input these values as seeds into Baum-Welch to get
an even better model.
Figure 4-10 is a diagram of the entire NNMF HMM learning algorithm. It is a graphical
summary of the steps we discussed in previous sections.
We will refer to this algorithm, including the final Baum-Welch step, as the NNMF
algorithm.
1. Gather samples.
2. Compile measured observation matrix.
1231121231...
1213212213...
2132131211...
1112232113 ...
3122332132...
5. Factor Csubmatrix.
A
4. Extract 0 and n.
3. Factor measured observation matrix.
C
A
LS/ ALS
LS
I
6. Extract T.
T
'
7. Seed into Baum-Welch.
0, T,n
8. Recover optimized parameters.
B'aum-WTch
0', T', n,
Figure 4-10: The NNMF HMM Learning Algorithm.
THIS PAGE INTENTIONALLY LEFT BLANK
Chapter 5
Methodology
In this chapter, we discuss issues regarding how we will implement, train, and evaluate
HMM learning algorithms.
5.1
Implementation Issues
5.1.1
Baum-Welch Termination Protocol
Because of the nature of Baum-Welch, the algorithm continues to run until the user terminates
it. The algorithm continues to make progress in maximizing the likelihood of the training
data, but as it gets closer to a local optimum, it may slow down. We terminate the algorithm
when the likelihood increments are smaller than some threshold, which we arbitrarily decide
as 0.1%.
5.1.2
NNMF Observation Matrix
As we mentioned earlier, the measured observation table is likely to have a lot of noise. In
particular, if there are not enough data points for a given prefix, the conditional output
distribution will be very coarse. To prevent this phenomenon, we omit rows in the measured
observation matrix that do not have enough data points. We arbitrarily decide the threshold
to be 250 data points for a given prefix.
5.2
Training and Testing
In training the algorithms, we considered the following parameters.
(i) Number of output sequences
We used 5000 and 10000.
(ii) Length of each output sequence
We choose to use the values 4, 6, and 8.
(iii) Number of states
The HMM's states are fixed, but the algorithms being trained can be initialized with
different numbers of states. For our evaluation, we give the learning algorithms the
correct number of states.
(iv) Number of repetitions
Algorithms such as Baum-Welch are heavily affected by random seeds. We choose to
train each such algorithm five times per training instance in order to take variance
into account.
For testing, we generated a set of sequences from the HMM and considered the following
parameters. For different runs of the same HMM, we fixed the test set.
(i) Number of output sequences
We always chose 10000 output sequences.
(ii) Length of each output sequence
The length of an output sequence depended on the number of states in the HMM.
For HMMs with six states, we chose a length of 12. For HMMs with fewer states, we
chose a length of 10.
5.3
Measures of Accuracy
We measure an algorithm's effectiveness by measuring how accurately it calculates the
occurrence probability of output sequences. If S is the set of output sequences, we generate
A(S), the list of the algorithm's calculations of probabilities, and find its distance, in some
sense, to H(S), the list of the original HMM's calculations of probabilities. We emphasize
that it is not so important that an HMM learning algorithm recovers the parameters exactly
as in the original HMM.
5.3.1
Euclidean Distance
In this measure, the accuracy of the algorithm is
S |A(S)i - H(S);|2
where A(S)i and H(S)i are corresponding probabilities.
5.3.2
Kullback-Leibler Divergence
In this measure, the accuracy of the algorithm is
lgH (S ji
lgA(S)i '
where A(S)i and H(S)i are corresponding probabilities and S is assumed to be generated
not arbitrarily, but from the original HMM.
5.4
The HMMs
In this section, we list the HMMs that we will use to train and test the algorithms. Many of
the HMMs have too many states to represent graphically. Hence, we will simply be writing
down matrices to describe the HMMs. After each HMM name are two integers that describe
the number of states and outputs in the HMM, in that order. These HMMs are reproduced
in Appendix B.
We used our own implementation of HMMs in MATLAB.
5.4.1
Simple HMM 3 3
There is a state transition involving no
This HMM is not meant to be hard to learn.
randomness and only one of the states emits all possible outputs.
O=
0.5
0.25
0
0.5
0.25
0
,T=
0.5
0
0.5
0.5
,r=
0.25
.
0.25
0.25 1 0
0 0.25 0.75
5.4.2
0.75
0
Simple HMM 3 4
In this HMM, there are more outputs than states. The HMM was designed so that there is
no randomness in state transitions. Note the similarity to Simple HMM 3 3.
0
0.25
1 0.2
0.55
0
0.05
0.2
0
0.75
0
5.4.3
10502
0.25
0.75
0
0
0.5
0.5
)0.25 )
T=
0.75 0 0.5
0.25
1
02
1
0.25
0
Simple HMM 4 3
In this HMM, there are more states than outputs. There is a deterministic transition from
one of the states and only one of the states emits all possible outputs. Note the similarity
to Simple HMM 3 3.
O
0.5
0.5
0.25
0.3
0.5
0.25
0
0.4
0
0.25
0.75
0.3
,
0 0.2
0.5
0.5
0
0.5
0.25
1
0
0.25
0
0
0
07
0
0.3
.
0.15
0T=,00=
.
5.4.4
Separated HMM 3 4
In this HMM, there are more outputs than states. We call this HMM "separated" because
no state can transition to itself.
0.25
0
0
0.7
0.05
5.4.5
I,
0.5
0.75
0.25
0.25
0.25
Separated HMM 3 4 #2
0
5.4.6
r=
0
0.1
0.65
0
0.35
0.9
0
r
Dense HMM 3 3
In this HMM, the defining factor is that there are not many zeroes in 0 or T.
0=
5.4.7
0.5
0.25
0.25
0.6
0.25
0.15
)
0.25
0.3
0.5
0.4
0.25
0.3
=4V
0.5
0.25
0.25
Dense HMM 4 4
0.5
0.2
0.25
0.05
0.15
0.6
0.1
0.15
, T=
0.25
0.1
0.5
0.2
0.2
0.3
0.05
0.4
I, I.
0.5
0.2
0.2
0.1
5.4.8
o
=
5.4.9
Dense HMM 5 5
0.5
0.2
0.2
0.4
0
0.25
0.3
0.05
0.1
0
0.1
0.1
0.4
0.2
0.4
0.05
0.3
0.2
0.2
0.2
0.1
0.1
0.15
0.1
0.4
(
,T
=
0.25
0.1
0.2
0.5
0.1 0.1 0.1
0.2
0.1
0.4
0
0.5
0.1 0.2
0.5
0.1
0.2
0.05 0.3 0.3 0.2
0.1
, 7rF
J
0.1
.
0.2
0.2
0.1
Separated HMM 5 5
In this HMM, there are no states transitioning to themselves. There are five states and five
outputs.
0
0.2
0.2
0.25
0.1
0.2
0
0.5
0.2
0.1
0.2
0.2
0.3
0.3
0.1
0.3
0.3
0.55
0
0.3
0.5
0.3
0.15
0.1
0.2
0
0.5
0.4
0.1
0.3
0
0.3
0.4
0.1
0.2
0.6
0
0
0.1
0.1
0.4
0
0.1
0.4
0.3
0.1
0.05
0.1
0.1
0.25
0.1
0.1
0.1
0
0.15
,T=
,7r=
0.1
5.4.10
Dense HMM 6 6
In this HMM, there are no zeroes in 0 or T.
0=
0.2415
0.2713
0.1071
0.1737
0.2427
0.1662
0.2506
0.1447
0.0155
0.2044
0.0285
0.1662
0.2242
0.1634
0.4442
0.1843
0.1074
0.1506
0.0147
0.0120
0.2776
0.2000
0.3164
0.1083
0.1097
0.1999
0.0325
0.1555
0.2822
0.2331
0.1593 0.2087 0.1231 0.0821 0.0228 0.1756
5.4.11
0.2533 0.1997 0.2377 0.0668 0.0271 0.0965
0.2
0.2086
0.0782
0.0522
0.0356
0.1973
0.0657
0.25
0.1511
0.2776
0.0468
0.1560
0.2680
0.2587
0.15
0.1984
0.0767
0.2484
0.1271
0.1145
0.0867
0.1
0.0226
0.2425
0.1518
0.4391
0.2511
0.2333
0.2
0.1660
0.1253
0.2631
0.1754
0.1420
0.2591
0.1
Sparse HMM 6 6
This HMM has many zeroes in the transition and output matrices. There are only two
possible initial states.
5.4.12
0.5
0
0
0
0.5
0.75
0
0.3
0
0.9
0
0
0.5
0
0
0
0
0.25
0
0
0
0
0
0
0.5
0
0.7
0
0
0
0
0.7
0
0
0.3 0.1
Diverse Sparse HMM 6 6
In this HMM, everything is the same as Sparse HMM 6 6, with the exception that every
state can be the initial state.
0-
0 0.5
0
0
0.5
0
0
0.3
0.1
0
0.3
0
0
0
0.3
0.5
0
0
0
0.9
0
0.2
0
1
0
0.5
0
0
0.8
0
0
0
0
0.1
0
0.25
0
0
0
0
0
0.2
0.5
0
0
0.2
0.1
0
0
0
0
0.5
0
0
0
0
0.7
0
0.8
0.2
0
0
0 0.5
0
0.7
0
0
0.5
0
0
0
0.1
0.5
0
0.5
0.75
0
Chapter 6
Results and Analysis
In this chapter, we discuss the results of implementing and evaluating the extended Angluin
and NNMF HMM learning algorithms. We begin by writing that the extended Angluin
method failed to produce interesting results. We then compare the NNMF algorithm to
Baum-Welch algorithm by training and testing both of them across the variety of HMMs.
Finally, we analyze some interesting trends we uncovered from our data.
6.1
On the Failure of the Extended Angluin Algorithm
In this section, we find that the Extended Angluin algorithm is not feasible and attempt to
explain why.
We implemented Angluin's DFA learning algorithm in MATLAB according to [1]. We
implemented the NFA learning algorithm in MATLAB according to [5].
In Section 4.1.2.3, we mentioned that pNFAs and HMMs do not have accepting states.
We proposed the following two options to remedy this problem.
1. (Stopping State Approach) Assume that we can ask the HMM to stop automatically
once it reaches a certain state.
2. (Accept All Approach) Assume that every state in the HMM is an accept state.
We proceed to experiment with both of them.
6.1.1
Stopping State Approach: Version 1
We start with an effort to learn Simple HMM 3 3. The state transitions are diagrammed
in Figure 6-1.
0.25
1
0.75 -40
05 0.5
Figure 6-1: State transitions of Simple HMM 3 3.
We sampled 10000 sequences from the HMM, with the requirement that the HMM stop
a sequence if and only if the third state is reached. We set this set of sequences as the
regular set for the NFA learner to learn. The NFA in Figure 6-2 was returned.
1,2
C(
02
1,2
Figure 6-2: NFA learned from 10000 output sequences from Simple HMM 3 3.
The NFA has two states. One state is the initial state and the other is an accepting
state. It can be seen that every sequence of outputs is accepted by this NFA. Moreover,
because any action from any state results in every transition being taken, this NFA reveals
nothing about the transitions between states in the Simple HMM 3 3.
We repeated the NFA learning with 5000, 1000, 500, 100, and 50 sequences from the
HMM. All of them yielded the same NFA, except for the NFA learned from 50 sequences.
With very few sequences, we obtained the NFA in Figure 6-3.
1,1,2,
1,22
3 1
2
1,2,3
12, 3
1,2
1, 2, 3
Figure 6-3: NFA learned from 50 output sequences from Simple HMM 3 3.
This NFA is less open-ended than before. From these observations, we can conclude
that supplying too many varieties of output sequences confuses the NFA into being too
accepting. However, there are still some anomalies. First, note that state 3 is a trapping
state. Once the NFA enters, it cannot escape. We believe the reason is that the HMM
always stops at the third state, no transitions out of the third state are observed.
The results are similar for Separated HMM 3 4. Setting the third state to be the
accepting state, we get the NFA in Figure 6-4 after training with 10000 output sequences.
1,2,31,2,3
1,12,23
Figure 6-4: NFA learned from 10000 output sequences from Separated HMM 3 4.
The resulting NFA is the same for 5000, 1000, and 500 output sequences.
For 100
output sequences, it was difficult to get the NFA learner to terminate. We had to resample
a number of times. When the NFA learner finally terminated, we got an NFA with 46
states - 13 initial states and 20 accept states. We speculated that there would be a way
to reduce the NFA to a simpler form, as many states were probably redundant, but we did
not investigate the issue any further.
We believe the reason for the disappointing results may be the following. When we
sample too many output sequences, we get too much variety, making the resulting NFA too
accepting. Fundamentally, the NFA learner has no motivation to distinguish between states
that have the same possible outputs but different distributions. On the other hand, when
we sample too few sequences, the topology of the sequences is rough. Because samples are
missing, the sample set may not have the necessary elements to be regular. We were not
able to find a happy medium between these two extremes, having tried varying the number
of output sequences between 100 and 500. It seems that the NFA learner is very sensitive
to the regularity of the set it is learning, if regularity can be measured in some sense.
Another reason may be that the different starting states confuse the NFA learner. That
is, it is ambiguous whether empty sequence should be accepted or not. Because it is possible
to start in the accepting state, the empty sequence should be accepted. On the other hand,
because it is possible to not start in an accepting state, it should not be accepted. To
remedy this problem and the trapping state problem, we have adopted a second version of
the stopping state approach, described in Section 6.1.2.
6.1.2
Stopping State Approach: Version 2
We utilize the fact that all HMMs are equivalent to another state with exactly one starting
state. Basically, a new state is created, and the HMM always starts in that state. It emits a
dummy output and transitions to states depending on the original initial state distribution.
A converted version of Simple HMM 3 3 is shown in Figure 6-5. In addition, to remedy the
trapping state problem, we allowed the HMM to sometimes keep going after reaching an
accepting state. That is, we specified a parameter p so that whenever the HMM reaches an
accept state, it continues to transition and produce outputs with probability p.
If 1123 was an output sequence from the original HMM, then 41123 would be the
output sequence from the converted HMM, with 4 being the dummy output. In general,
0.25
Figure 6-5: Simple HMM 3 3 with only one possible initial state. N represents the initial
state.
once converted into single initial state form, the only difference in output sequences is the
dummy output at the very beginning of the sequence.
In this way, the empty sequence is unambiguously not accepted by the NFA. After
training the output sequences from the converted version of Simple HMM 3 3 with p
0.3,
we get the NFA in Figure 6-6.
1,12, 3
0,1, 2,3
Figure 6-6: NFA learned from 10000 output sequences from Simple HMM 3 3 with only one
possible initial state. Parameter p was set to 0.3. '0' is the dummy output.
The transitions are not as trivial as in Figure 6-2. In fact, this NFA resembles the NFA
in Figure 6-3. However, there are still a number of peculiarities. As in Figure 6-3, which
depicts an NFA trained on 50 sample sequences, the NFA in Figure 6-6 has a trapping state.
Also, upon careful consideration, we find that the NFA in Figure 6-6 basically accepts
all output sequences. The reason is that the dummy output can never appear after the first
output, so the NFA will start from state 1, transition to state 2, and stay there forever.
We have seen this liberal acceptance phenomenon before, and it is due to having too
many varieties of output sequences. So we trained on fewer output sequences, repeating
the NFA learning with 1000, 500, 100, 50, and 10 output sequences. The results were the
same until we tried 10 output sequences. There, we had to resample a number of times for
the NFA learner to terminate. Once it terminated, we got an NFA with 43 states - 2 initial
states and 3 accept states. We considered this outcome uninteresting.
6.1.3
Accept All Approach
After trying to learn some of the HMMs listed in Section 5.4, we quickly realized that this
approach was not fruitful. The reason is that it is not uncommon for an HMM to emit
every possible sequence of outputs, as the more interesting aspect about the sequences are
how frequently they occur. The NFA learner always returned an NFA that accepted every
possible sequence of outputs. As a result, the resulting NFA structure was uninteresting.
6.1.4
Summary
In the end, we were unable to solve the issues that came up in trying to apply Angluin's
insight to learn HMMs. We choose to prematurely conclude our endeavor into the Angluin
approach to learning HMMs. This algorithm will not be present in the comparison section,
Section 6.2. The major issues are as follows.
(i) Faux-Regular Sets
The language that an HMM generates is regular. However, the set of output sequences
generated by an HMM is not guaranteed to be regular if there are not enough outputs.
For example, false negatives in the membership query may result in the NFA learner
not terminating. If every sequence that the learner queries for is in the sample set, then
the algorithm returns the correct NFA. However, if the learner queries for a sequence
that should be in the sample set but is not, the NFA may either not terminate or
become overly complicated.
(ii) Liberal Acceptance
HMMs generate output sequences that are too diverse. Added to the fact that the
set of output sequences is unlikely to be regular, the NFA ends up making sense of
everything by accepting every single possible sequence of outputs. This problem is
exacerbated by the fact that the structure of the minimal DFA is generally smaller
than the structure of the relevant HMM. Not only does the resulting NFA accept too
many sequences, it also has too few states.
6.2
Comparison of Algorithms
In this section, we evaluate the NNMF HMM learning algorithm.
6.2.1
Evaluation Tables
For each set of training parameters, we trained the algorithms five times to account for
variance. Thus, for each training instance, we list the min, mean, and standard deviation
of the performance results.
There are three columns representing the performance, in KL divergence from the
original HMM, of Baum-Welch, NNMF, and NNMF + B-W, in that order. NNMF +
B-W is the full NNMF HMM learning algorithm that is depicted in Figure 4-10. NNMF is
the NNMF algorithm with steps 7 and 8 removed from Figure 4-10. In each row, the best
value is bolded. The "Ratio" column represents NNMF + B-W's score divided by B-W's
score in that row.
We also present data about how the observation matrix factored. The values D and tD
represent the Euclidean distances incurred factoring the observation matrix and coefficient
submatrix respectively.
N
4
L
5000
4
10000
6
5000
6
10000
8
5000
8
10000
Min
Mean
B-W
997.9004
1711.5612
NNMF
3073.2533
3794.798
NNMF + B-W
1978.2321
2175.145
STD
586.737
502.6523
204.0401
Min
Mean
STD
Min
Mean
STD
Min
Mean
STD
Min
Mean
1267.9861
2100.4884
510.4714
826.4295
1585.9159
504.7816
1594.6401
2578.7567
1452.4929
1708.4988
4631.9174
3136.5893
3938.9514
879.3699
1420.958
2208.8319
511.2215
2016.3892
2252.6751
301.7404
2744.2626
3733.8713
1955.4402
2145.6534
238.2645
623.9327
1165.6193
553.8159
1069.3176
1276.236
221.3511
1710.2501
2074.3237
STD
1637.6752
1172.2853
443.7356
Min
Mean
STD
3468.8953
4852.2966
811.0101
2296.8081
3019.2137
695.2137
1455.7559
1684.2082
229.0714
Ratio
1.9824
1.2709
1.5422
1.0215
0.75497
0.73498
0.67057
0.4949
1.001
0.44783
0.41966
0.3471
D
1.7844e-009
1.7628e-005
tD
0.74466
0.76402
2.5448e-005
0.024138
1.0659e-006
1.3482e-005
1.8301e-005
2.1212e-009
2.8605e-009
1.1112e-009
7.100le-009
2.8051e-006
5.9191e-006
2.1543e-009
2.8588e-009
0.77694
0.88881
0.082829
0.70404
0.72441
0.018199
0.70054
0.71762
0.013932
0.70267
0.7223
8.3373e-010
0.017803
6.0036e-009
6.7917e-005
9.2086e-005
0.70998
0.72934
0.020071
Table 6.1: Simple HMM 3 3 results.
N
L
4
5000
4
10000
6
5000
6
10000
8
5000
8
10000
Min
Mean
STD
Min
Mean
STD
Min
Mean
STD
Min
Mean
STD
Min
Mean
STD
Min
Mean
STD
B-W
NNMF
NNMF + B-W
Ratio
D
tD
442.938
7887.1676
6788.0769
1012.371
2224.1061
1045.096
655.6046
11421.9997
6373.5161
541.8756
7028.8504
5674.2161
144.0686
3307.4869
6712.3864
417.0339
9521.8506
8316.5538
4753.5918
10246.8763
4827.6747
2851.4064
3746.6054
694.6626
6106.9601
16116.4206
7509.9886
3694.1697
6215.7834
2559.1328
3996.8902
15454.9652
12911.146
1514.4711
5267.715
2483.7431
708.4322
3501.47
3295.6424
437.3121
1793.1714
986.1501
3573.9303
6693.3662
2675.7591
482.6687
3537.7063
2530.7667
1183.3128
5313.8906
3137.5246
160.2441
2780.6991
1863.0491
1.5994
0.44395
0.027698
0.037517
0.0090011
0.027648
0.03735
0.0078252
0.05575
0.06715
0.015662
0.058799
0.081687
0.01486
0.038503
0.049135
0.014305
0.061117
0.079614
0.013846
1.0225
1.0964
0.061877
1.0606
1.1083
0.037473
0.97981
1.0256
0.047698
0.99653
1.0625
0.057243
0.99639
1.0565
0.06685
1.0654
1.112
0.040776
0.43197
0.80624
5.4514
0.58601
0.89074
0.50331
8.2135
1.6066
0.38425
0.29203
Table 6.2: Simple HMM 3 4 results.
N
4
L
5000
4
10000
6
5000
6
10000
8
5000
8
10000
Min
Mean
B-W
246.5829
586.1898
NNMF
842.7965
1698.7183
NNMF + B-W
443.8406
826.1375
STD
225.8949
919.3675
419.7274
Min
Mean
211.106
646.1992
339.9976
1361.0068
186.1485
484.2997
STD
272.5914
574.4871
238.0156
Min
Mean
STD
Min
Mean
STD
Min
Mean
352.618
457.0649
84.6659
343.1575
494.9053
110.9837
466.1999
540.8028
655.8777
1141.2995
561.7039
665.5298
1152.7888
637.1897
448.8728
1548.3818
284.9661
373.3327
88.8709
309.2473
442.8703
150.966
233.7224
278.5806
STD
63.7271
877.4006
55.7901
Min
Mean
STD
358.4193
506.8186
97.3148
584.7148
1008.9154
424.3164
264.213
325.1176
64.9346
Ratio
1.8
1.4093
0.88178
0.74946
0.80814
0.8168
0.90118
0.89486
0.50134
0.51512
0.73716
0.64149
D
5.7293e-010
6.2605e-010
4.7446e-011
5.3266e-010
5.6806e-010
tD
0.69686
0.7363
0.023273
0.69125
0.82138
3.3612e-011
0.085894
5.1956e-010
6.7974e-010
9.2742e-011
6.9685e-010
7.9776e-010
7.6083e-011
5.7822e-010
6.6059e-010
0.72194
0.78777
0.067581
0.67417
0.72921
0.051518
0.68173
0.7519
5
.2148e-011
0.050575
7.355e-010
7.9107e-010
3.3875e-011
0.73071
0.76864
0.048385
Table 6.3: Simple HMM 4 3 results.
N
4
L
5000
4
10000
6
5000
6
10000
8
5000
8
10000
Min
Mean
STD
Min
Mean
STD
Min
Mean
STD
Min
Mean
STD
Min
Mean
STD
Min
Mean
B-W
774.1461
831.63
55.1058
595.3184
807.6597
153.2688
603.711
702.9934
74.9966
699.3138
832.8425
162.4642
777.5568
846.3584
76.2612
777.8909
803.1075
NNMF
282.6304
944.5411
910.9561
447.9371
817.8496
322.9873
290.8432
568.0166
385.066
327.6702
1092.0584
692.6045
233.5061
999.8632
720.4316
195.4436
522.9961
NNMF + B-W
185.3625
540.9019
332.8036
264.6648
558.8738
185.5608
222.583
413.0648
205.271
250.3639
625.7018
328.0724
186.5281
511.1768
236.4639
147.1445
349.5934
STD
26.126
296.3821
216.4107
Ratio
0.23944
0.65041
0.44458
0.69197
0.36869
0.58758
0.35801
0.75128
0.23989
0.60397
0.18916
0.4353
Table 6.4: Separated HMM 3 4 results.
D
0.04569
0.071385
0.016214
0.058159
0.068454
0.01265
0.054501
0.082582
0.023082
0.082378
0.11404
0.020373
0.059738
0.074771
0.013265
0.089566
0.10795
tD
0.94962
1.1127
0.12489
0.95463
1.0959
0.087527
0.94433
1.0727
0.084295
0.95939
1.025
0.073348
0.9511
1.0162
0.076826
0.97484
1.0367
0.016187
0.063611
N
4
L
5000
4
10000
6
5000
6
10000
8
5000
8
10000
Min
Mean
STD
Min
Mean
STD
Min
Mean
STD
Min
Mean
STD
Min
Mean
STD
Min
Mean
STD
B-W
692.1613
2473.7839
1006.716
2735.9597
2890.7895
113.5361
732.463
1872.572
1016.8428
2241.1899
2700.3719
322.4136
535.6466
2428.3011
1081.8397
529.2151
1075.3632
1085.128
NNMF
1256.8264
1806.0243
368.8792
517.3888
1138.4731
582.0805
517.5172
1617.3526
1078.2141
839.7055
1497.4537
864.2749
948.9102
1334.5147
548.5989
498.5396
1196.3787
471.1914
NNMF + B-W
755.3028
1146.3861
335.291
321.3234
903.6067
602.5364
342.6098
915.422
598.9476
500.7266
677.3107
159.0724
625.3597
831.5564
159.8957
269.039
622.7002
255.5409
Ratio
1.0912
0.46341
0.11744
0.31258
0.46775
0.48886
0.22342
0.25082
1.1675
0.34244
0.50837
0.57906
D
0.032212
0.040905
0.010054
0.026285
0.032148
0.0067635
0.037237
0.05287
0.01208
0.06965
0.07775
0.0063779
0.032745
0.046569
0.011163
0.059658
0.074978
0.010959
tD
1.0695
1.1485
0.065821
1.0245
1.1576
0.10029
0.96306
1.107
0.13131
0.98875
1.0943
0.076641
1.1095
1.1529
0.0479
1.0153
1.1223
0.090085
Table 6.5: Separated HMM 3 4 #2 results.
N
4
L
5000
4
10000
6
5000
6
10000
8
5000
8
10000
Min
Mean
B-W
80.4189
150.9328
NNMF
94.5875
552.8904
NNMF + B-W
41.4701
242.1683
STD
61.8605
631.8636
221.5841
Min
Mean
STD
Min
Mean
STD
Min
Mean
STD
Min
Mean
STD
Min
Mean
STD
58.9067
159.6925
77.5463
48.0383
170.5184
122.6708
132.9425
188.458
38.1928
92.6143
233.5022
127.2543
105.4114
213.1205
71.694
236.9763
442.1492
150.1581
152.2607
1490.2956
1104.5111
78.0167
706.9429
641.2314
136.2905
909.6215
816.2267
97.7251
915.404
757.5031
89.1558
142.7398
55.465
89.2594
257.8672
113.7077
59.1967
171.1029
113.7074
41.8444
128.769
79.7246
42.6303
94.565
41.1597
Ratio
0.51568
1.6045
1.5135
0.89384
1.8581
1.5123
0.44528
0.90791
0.45181
0.55147
0.40442
0.44372
Table 6.6: Dense HMM 3 3 results.
D
6.4591e-010
6.8283e-010
tD
0.71901
0.74776
3.6702e-011
0.029972
6.3935e-010
6.6786e-010
2.9599e-011
6.748e-010
8.0858e-010
9.4004e-011
1.1926e-009
1.7731e-009
3.5914e-010
7.1963e-010
8.2595e-010
8.4619e-011
9.0435e-010
1.3236e-009
5.2534e-010
0.69212
0.73319
0.024494
0.69282
0.71207
0.020377
0.68546
0.69798
0.011143
0.67914
0.71082
0.020134
0.68462
0.69924
0.013975
N
4
L
5000
4
10000
6
5000
6
10000
8
5000
8
10000
Min
Mean
STD
Min
Mean
STD
Min
Mean
B-W
83.3714
181.9737
61.1875
63.9092
140.7707
54.9844
101.6815
135.0095
NNMF
277.3044
711.6227
522.0941
189.6736
371.2808
301.9523
102.5564
894.5842
NNMF + B-W
114.158
258.3594
205.2391
96.3222
179.7176
88.4172
76.2079
147.343
STD
38.4957
942.4778
58.8077
Min
Mean
61.9132
156.9878
57.2777
451.8523
17.3461
105.6887
STD
70.7718
433.9315
57.5969
Min
Mean
75.3198
141.9259
161.9649
289.2854
104.6016
128.1162
STD
65.3842
119.1106
26.204
Min
Mean
STD
68.9887
137.7862
57.3042
78.6174
908.0051
776.9401
46.7607
156.1644
125.1327
Ratio
1.3693
1.4198
0.74948
1.0914
D
1.2384e-009
1.3402e-009
7.0844e-011
1.1977e-009
1.2774e-009
7.0287e-011
1.156e-009
1.3079e-009
tD
0.82839
0.86026
0.03275
0.78397
0.80446
0.026186
0.78691
0.80793
1.0002e-010
0.018923
0.28017
0.67323
1.2862e-009
1.9384e-009
0.77687
0.81945
5.7841e-010
0.028493
1.2501e-009
1.3593e-009
0.78606
0.82481
1.1055e-010
0.026001
1.3571e-009
2.1287e-009
7.9323e-010
0.78007
0.79759
0.012031
D
1.2602e-009
1.3536e-009
6.9306e-011
1.7015e-009
2.0942e-009
tD
0.90708
0.95607
0.047694
0.85738
0.88954
3.1202e-010
0.024311
1.3328e-009
1.4374e-009
1.1779e-010
2.033e-009
2.451e-009
3.6034e-010
1.2627e-009
1.3305e-009
5.7614e-011
1.9591e-009
2.1503e-009
0.87323
0.90387
0.018549
0.85049
0.85728
0.0043956
0.87328
0.93007
0.033536
0.852
0.87485
1.5444e-010
0.019105
1.5072
1.2767
1.3888
0.9027
0.6778
1.1334
Table 6.7: Dense HMM 4 4 results.
N
4
L
5000
4
10000
6
5000
6
10000
8
5000
8
10000
Min
Mean
STD
Min
Mean
B-W
768.8807
892.2086
175.5267
674.2995
893.7868
NNMF
929.9388
2811.2549
1888.2821
762.9955
2127.3043
NNMF + B-W
513.3167
1422.7241
729.7105
465.0274
914.2332
STD
164.0421
768.2717
285.0277
Min
Mean
STD
Min
Mean
STD
Min
Mean
STD
Min
Mean
597.7398
758.7097
148.7744
470.6347
694.9101
152.8592
657.4062
773.2348
179.5491
544.9122
674.6258
485.9444
1380.4805
1454.1388
1060.2761
2692.5426
2331.265
907.2523
2426.4336
1185.7498
244.7488
2855.0982
230.0795
480.5185
247.766
378.8173
556.4525
249.5042
381.6071
458.7711
68.7337
90.8653
407.605
STD
105.5129
2900.5149
199.1646
Ratio
0.66762
1.5946
0.68965
1.0229
0.38492
0.63334
0.80491
0.80075
0.58047
0.59331
0.16675
0.60419
Table 6.8: Dense HMM 5 5 results.
N
4
L
5000
4
10000
6
5000
6
10000
8
5000
8
10000
Min
Mean
B-W
486.542
1034.5862
NNMF
662.8812
2124.0776
NNMF + B-W
338.9175
789.752
STD
403.1681
973.9133
393.8942
Min
Mean
STD
Min
Mean
STD
Min
Mean
STD
Min
Mean
STD
Min
Mean
STD
420.8795
1029.4356
417.7536
451.4029
1022.2118
358.1582
918.071
1184.2618
198.9643
957.1142
1128.347
116.854
1025.08
1131.0833
100.2022
615.4777
1516.8439
1307.855
951.6361
1720.0668
624.1885
703.2973
1316.8456
494.8713
735.5595
1162.6925
289.9418
479.7215
1273.8234
859.4713
272.6408
486.9774
263.5107
386.6114
501.2916
156.4701
205.122
437.5169
165.7033
256.5674
426.3246
142.0576
178.4787
397.3082
198.2706
Ratio
0.69658
0.76335
D
1.3432e-009
1.4281e-009
tD
0.92152
0.94495
6.214e-011
0.017812
0.64779
0.47305
1.9405e-009
2.0783e-009
9.6254e-011
1.182e-009
1.3141e-009
9.832e-011
1.5739e-009
1.9942e-009
2.4562e-010
1.3106e-009
1.3977e-009
6.653e-011
1.6641e-009
1.948e-009
1.9176e-010
0.89313
0.91715
0.017116
0.90857
0.94016
0.042036
0.88931
0.90858
0.015319
0.91643
0.94822
0.021727
0.89418
0.91254
0.01955
D
1.3495e-009
1.488e-009
1.3063e-010
2.419e-009
2.7577e-009
3.2359e-010
1.3883e-009
1.4538e-009
4.7518e-011
1.943e-009
2.4883e-009
4.1628e-010
1.3363e-009
1.4341e-009
tD
0.90942
0.96484
0.042955
0.89407
0.91546
0.020014
0.9145
0.93948
0.02215
0.86953
0.90903
0.03621
0.92092
0.95033
6.9626e-011
0.020038
2.0438e-009
2.4754e-009
0.87551
0.91048
2.8924e-010
0.025662
0.85647
0.4904
0.22343
0.36944
0.26806
0.37783
0.17411
0.35126
Table 6.9: Separated HMM 5 5 results.
N
4
L
5000
4
10000
6
5000
6
10000
8
5000
8
10000
NNMF
445.5445
1333.194
981.255
343.7505
1540.9268
1464.199
1109.1188
1832.6603
726.0979
321.3539
932.8144
749.699
290.0426
1359.1341
NNMF + B-W
300.8148
444.026
153.6569
221.0042
808.1103
812.1967
215.7601
410.0196
140.4542
220.0137
295.6257
93.5043
214.5423
315.3194
Min
Mean
STD
Min
Mean
STD
Min
Mean
STD
Min
Mean
STD
Min
Mean
B-W
312.0537
431.1328
69.9133
436.7778
500.9244
55.7257
222.4566
325.2532
103.0878
313.3454
381.0247
53.6513
338.4901
377.7764
STD
48.6695
964.752
104.1627
Min
Mean
157.0316
320.7461
294.1124
1366.0707
142.0052
250.1859
STD
181.4441
1397.739
91.5897
Ratio
0.96398
1.0299
0.50599
1.6132
0.9699
1.2606
0.70214
0.77587
0.63382
0.83467
0.90431
0.78001
Table 6.10: Dense HMM 6 6 results.
N
4
L
5000
4
10000
6
5000
6
10000
8
5000
8
10000
Min
Mean
STD
Min
Mean
STD
Min
Mean
STD
Min
Mean
STD
Min
Mean
STD
Min
Mean
STD
B-W
3332.37
20399.01
16466.48
4127.54
15506.81
15250.92
2124.79
2381.96
275.89
1507.81
2082.44
327.63
2136.06
9691.49
10223.34
605.78
1796.89
691.75
NNMF
453566.01
912240.58
323910.41
495135.17
1038233.75
478704.46
438551.80
559502.59
138113.32
410360.34
462766.32
30252.51
202754.66
522528.68
241131.87
439488.17
621973.93
206364.20
NNMF + B-W
35296.40
68192.60
40021.43
29257.04
85209.60
81990.15
32129.16
46716.27
11542.60
43283.21
53974.37
6145.84
26450.17
42213.78
9541.81
34255.16
44376.92
9572.20
Ratio
10.592
3.3429
7.0883
5.495
15.1211
19.6126
28.7059
25.9188
12.3827
4.3558
56.5471
24.6964
D
0.00051148
0.00059111
7.4622e-005
0.0037584
0.013027
0.01041
0.00020088
0.00034919
0.00011676
0.00038973
0.00078397
0.0003145
0.00015219
0.00068915
0.00060151
0.00030934
0.0018096
0.0023095
tD
0.93474
1.1154
0.28157
0.90685
0.98516
0.046258
0.91459
0.98949
0.054984
0.8651
0.8907
0.038996
0.9522
0.98703
0.042136
0.88864
0.91587
0.028501
Table 6.11: Sparse HMM 6 6 results.
N
4
L
5000
4
10000
6
5000
6
10000
8
5000
8
10000
Min
Mean
STD
Min
Mean
STD
Min
Mean
STD
Min
Mean
B-W
597.90
7411.79
12023.72
615.28
7171.64
11298.65
483.64
3363.79
2835.32
357.46
2938.83
NNMF
13545.2732
16888.8874
2903.0062
11806.9636
19473.2593
6944.1741
18360.2393
20905.4379
3869.7666
13190.4853
19585.8435
NNMF + B-W
4904.1275
8702.5271
4055.0472
5256.2945
8626.3141
4851.2726
4487.7336
11374.5472
4263.1374
2661.972
7905.5441
STD
Min
Mean
2568.83
6893.4983
4977.6338
235.06
2100.87
14353.7449
19603.7602
2684.0319
5888.9426
STD
Min
Mean
1105.23
3493.2808
3845.1267
409.26
1708.51
10129.4971
17785.2752
2558.5785
6528.3364
STD
1130.33
5433.6811
5346.3046
Ratio
8.2022
1.1741
7.447
2.69
D
0.00018761
0.0004331
0.00037467
0.00023915
0.0010111
0.0012683
0.00030951
0.00089552
0.00056024
0.00055679
0.0022575
tD
1.3278
1.4541
0.10004
1.4105
1.4525
0.046306
1.3575
1.4359
0.092502
1.3986
1.47
0.0016912
0.06264
11.4184
2.8031
0.00035097
0.0016097
1.3799
1.4806
0.0011892
0.063915
6.2518
3.8211
0.00024118
0.001287
1.3902
1.4173
0.0012471
0.033055
8.5429
1.2028
9.2791
3.3815
Table 6.12: Diverse Sparse HMM 6 6 results.
6.2.2
6.2.2.1
Evaluation Table Analysis
Average KL Ratios
Simple HMM 3 3
Simple HMM 3 4
Simple HMM 4 3
Separated HMM 3 4
Separated HMM 3 4 #2
Dense HMM 3 3
Dense HMM 4 4
Dense HMM 5 5
Separated HMM 5 5
Dense HMM 6 6
Sparse HMM 6 6
Diverse Sparse HMM 6 6
Min
1.0618
2.8285
0.9383
0.3066
0.5959
0.8648
0.9955
0.5491
0.4777
0.7800
21.7395
8.5236
Mean
0.7195
0.7064
0.8378
0.6021
0.4062
0.9856
1.0829
0.8748
0.4709
1.0491
13.9036
2.5121
Table 6.13: Average KL ratios.
Table 6.13 lists the average KL divergence ratios achieved for each HMM across all
training instances. We will be referring to this table in sections below.
6.2.2.2
NNMF vs. NNMF + B-W
In every single case, NNMF + B-W performs better than NNMF. This result is expected;
our NNMF algorithm initially produces HMM parameters 0, T, and
7r.
Using these values
as seeds in Baum-Welch can only improve the parameters.
6.2.2.3
Larger Training Sets
In Table 6.2, Baum-Welch beats NNMF + B-W only in the cases L = 5000. It seems that
NNMF + B-W benefits greatly from having more samples in the training set. One of the
reasons could be due to the threshold for omitting rows in the measured observation matrix.
Having more data means having a bigger measured observation matrix. Having a bigger
observation matrix does not necessarily make the factorization easier, but it could lead to
more accurate HMMs.
6.2.2.4
Dense HMMs
NNMF + B-W performs close to or better than Baum-Welch on all of the HMMs with dense
o and
T matrices, which are all but the two sparse matrices. The notable exception, row 2
of Table 6.13, is explained in Section 6.2.2.3. The average of the min KL ratios from Table
6.13 is 0.9398. The average of the mean KL ratios from Table 6.13 is 0.7735. Thus, we can
say that on the dense HMMs that we have tested, NNMF + B-W performs an average of
22.65% better in KL measure than Baum-Welch per training instance.
6.2.2.5
Separated HMMs
NNMF + B-W performs especially well on the separated HMMs, which do not have states
that transition to themselves. On separated HMMs, NNMF + B-W achieves an average
min KL ratio as low as 0.3066 and an average mean KL ratio as low as 0.4062. Table 6.14
is provided for reference.
Separated HMM 3 4
Separated HMM 3 4 #2
Separated HMM 5 5
Min
0.3066
0.5959
0.4777
Mean
0.6021
0.4062
0.4709
Table 6.14: Average KL ratios for separated HMMs.
6.2.2.6
Sparse HMMs
Baum-Welch absolutely dominated our algorithm for the sparse HMMs with six states.
NNMF + B-W does especially badly on Sparse HMM 6 6. The reason may be intrinsic to
the NNMF approach. the way in which we determine T is very local; we only look at the
value of T that yields an accurate second output given the first output. In other words,
T is determined through a sort of greedy algorithm approach that may suffer in the face
of complexity. Because of this reason, NNMF + B-W has a much better time with HMM
Diverse Sparse HMM 6 6. although it is orders of magnitude worse than Baum-Welch. For
Diverse Sparse HMM 6 6, because the first state can be any state, when T is determined
by our NNMF algorithm, it gets a better grasp of the interplay between all states instead
of just a few.
6.2.2.7
Standard Deviations
Although not particularly conspicuous, the performance of NNMF + B-W is often more
consistent than that of B-W for HMMs with fewer than six states. For the two six state
HMMs, the performance of NNMF + B-W is almost always less consistent than that of
B-W.
6.2.2.8
Factorization vs. Effectiveness
There does not seem to be a correlation between the accuracy of factorization and the
accuracy of the resulting HMM. Our algorithm achieves factorization distances on the order
of 0.01 for HMM Separated HMM 3 4, which is high compared to other HMMs, but the
algorithm performs very well on Separated HMM 3 4, achieving low KL measures in all
training cases. On the other hand, our algorithm achieves factorization distances on the
order of 10-10 for Simple HMM 4 3, but the algorithm performs only slightly better in
terms of KL measures than it did for Separated HMM 3 4. We surmise that easy-to-factor
observation matrices do not necessarily lead to accurate HMMs.
6.2.2.9
NNMF + B-W Runtimes
Although we did not provide the statistics, once seeded with NNMF, Baum-Welch often
converged much more quickly than starting with a random seed. This phenomenon makes
sense intuitively; NNMF provides parameters 0, T, and 7r that already fit the training data
to some extent, so that Baum-Welch often starts close to a local optimum.
Chapter 7
Conclusion and Future Work
7.1
Conclusion
The central goal of this thesis was to draw from Angluin's DFA learning algorithm and
nonnegative matrix factorization to provide novel algorithms for HMM learning.
Unfortunately, we failed to produce a learning algorithm using Angluin's insights. The
main problem was the incompatibility between HMMs and NFAs. If we sample too few
sequences for training, the set of output sequences that an HMM generates is not regular.
If we sample too many, the NFA ends up accepting too liberally.
Nonetheless, we were able to produce an HMM learning algorithm using the nonnegative
matrix factorization approach.
The algorithm was derived from the insight that every
output distribution is a linear combination of the columns of the HMM's output matrix.
Using this insight, we rephrased the problem of HMM learning to a problem in linear
algebra. By using nonnegative matrix factorization, we were able to extract the columns of
the HMM's output matrix from a table of output distributions.
To evaluate the algorithm, we measured its performance on a number of HMMs. The
results show that for the dense HMMs we tested, the algorithm performs 22.65% better on
average in KL measure than Baum-Welch. On the other hand, on HMMs with sparse 0
and T matrices, our algorithm performed significantly worse than Baum-Welch.
Future Work
7.2
While we were unable to make Angluin's work apply to learning HMMs, perhaps others
can. We still believe that the approach is promising and deserving of more study.
A theoretical question that deserves attention is determining when the factorization
approach helps. We have empirically shown some cases in which it outperforms Baum-Welch
but we also saw many cases in which it performed far worse. There is a need to have a
theoretical basis for when this approach is most effective. A related interesting theoretical
question is how exactly the accuracy of the factorization affects the accuracy of the resulting
HMM.
The potentials of this nonnegative matrix factorization idea rest partly on advances in
algorithms for factorization. In order to factor a matrix into row-stochastic matrices, we
made alterations to existing factoring algorithms without justifying the changes. Deriving
and proving an efficient row-stochastic factoring algorithm would be a crucial step before
applying this algorithm to larger, more complicated HMMs.
We have seen that for sparse HMMs with six states, our NNMF algorithm is unimpressive.
An important investigation would be to improve the algorithm to work better for such
HMMs. We surmise that the determination of T is the culprit. 0 and 7Tare derived from
the observation matrix, which encompasses all of the HMM's information. T, on the other
hand, is determined solely by looking at the distribution of the second output given the
first output.
Also, as discussed in Proposition 2, factorizations generated via invertible matrices may
yield new HMMs. It would be nice to know how to pick among these generated factorizations
to find the best HMM.
7.3
Summary
This thesis shows that nonnegative matrix factorization can be used to provide an algorithm
to learn HMMs. We empirically demonstrate that the algorithm is effective for dense HMMs
with at most six states, beating Baum-Welch in many cases. This work also attempts
to apply Angluin's DFA learning algorithm to learn HMMs but is unable to produce an
algorithm and provides a list of reasons. We see that invoking different but related fields,
such as linear algebra and DFA learning, can offer new perspectives into HMM learning and
lead to novel and effective algorithms.
THIS PAGE INTENTIONALLY LEFT BLANK
Appendix A
Angluin's Algorithm Example
Suppose we are trying to learn the regular set of all finite binary sequences that contain an
even number of zeroes and an even number of ones.
The algorithm begins by querying the teacher if
",
'0', and '1' are in the set. Table
A.1 is provided for reference. The top row represents the set of suffixes. The left column
represents the set of prefixes. The prefixes are separated by a line that separates S on top
from S -A. Where row meets column indicates whether the word formed by prefix plus suffix
is in the regular set.
''
1
'0'
'1'
0
0
Table A.1: Initial table in Angluin's algorithm.
Now, the algorithm notices that the table is not closed, since the row labeled by '0'
E S- A is different from any rows in the top part of the table. After resolving the issue, the
table becomes Table A.2.
In accordance with the algorithm, we make queries to see if S-'0' and S-'1' are in the
regular set. The table becomes Table A.3.
Now this table is closed and consistent. We submit the conjecture to the teacher, which
"~
1
'0'
'1'
0
0
Table A.2: Closed table in Angluin's algorithm.
"
'0'
'1'
'00'
'01'
1
0
0
1
0
Table A.3: Third iteration table in Angluin's algorithm.
tells us that the sequence '10' is a counterexample - that it should be in the regular set but
our algorithm does not think so. Indeed, our algorithm thinks that after the sequence '1',
the DFA is in the same state as after the sequence '0'. Hence, it treats '10' like '11'.
After resolving the issue, the table becomes Table A.4.
"
1
'0'
'1'
'10'
'00'
'01'
'11'
'100'
'101'
0
0
0
1
0
1
0
0
Table A.4: Fourth iteration table in Angluin's algorithm.
The table is not consistent since '0' and '1' have identical rows but '00' and '10' do not.
After resolving the issue, the table becomes Table A.5.
The table is still not consistent since '1' and '10' have identical rows but '11' and '101'
do not. After resolving the issue, the table becomes Table A.6.
Now the table is closed and consistent. We submit our conjecture to our teacher, and
'
1
0
'0'
'1'
'10'
'00'
'01'
'11'
'100'
'101'
0
0
0
1
0
1
0
0
1
0
0
0
0
0
0
1
Table A.5: Fifth iteration table in Angluin's algorithm.
'0'
'1'
'10'
'00'
'01'
'11'
'100'
'101'
''
1
0
0
0
1
0
1
0
0
'0'
0
1
0
0
0
0
0
0
1
'1'
0
0
1
0
0
0
0
1
0
Table A.6: Finished table in Angluin's algorithm.
our teacher validates our conjecture.
The number of states is the number of rows in the S part of our table, which is 4. Each
row represents a distinct state. The initial state is the
"
row.
To see whether a sequence is in the regular set, use the following procedure.
(i) Locate the state that the DFA is in. It is one of the rows represented by an element
S E S.
(ii) Let a be the next letter in the given sequence. If a is the last letter in the sequence,
look up s - a in the table and retrieve the answer. Otherwise, go to step (iii).
(iii) Find s - a in S U S -A. If s - a E S, go back to step (i).
(iv) Observe the row represented by s - a E S - A, and look for the row in the top part of
the table identical to it. That is the state that the DFA is in after processing a. Go
back to step (i).
Appendix B
HMMs for Testing
B.1
B.2
Simple HMM 3 3
0.5
0.25
0
( 0.5
0.25
05
0.75
0 0.5
0
0.25
0.75
0.25
1
0.5
0.5
=
.
0.25
0.25
0
Simple HMM 3 4
=
0.75
0
0.0
0.2
0.05
0
B.3
0
0.2
0.25
0
0.250.75)
0 0 0.5
, T=
0.5
0.75 0 0.5
7r
0.25
0.25
0.25
1
0
0.5
0
0 0.2
Simple HMM 4 3
0.5 0.5
0.25 0.3
0= ( 0.5
0.25
0
0.4
0
0.25
0.75
0.3
0.5
0.5 0.25 1 0
0.25
0
0 0 0.3
0.15
0
0.1
0
0.75
0.5
B.4
Separated HMM 3 4
0.5
0.5 0.25
0.1
0.2
0.1
0.2 0.7
.1
0.3 .050.25
0.1 00
03
B.5
O=
O=
0.2 0.6
0.4 0.5
0.2
0.1
0.1
, T=
0.5
0.5
0.75
0
0.5
0.25
0.5
0
0.25
0
0.1
0.65
0.4
0.4
0
0.35
0..1=
0.9
0
0.5
Dense HMM 3 3
0.5 0.6 0.25
0
0.25
0.3
0.6
0.25 0.3 0.2
T=
0.5
0.4
0.2
0.5
,0.25
0.25
0.25 0.3 0.6
0.25 0.1 0.15
Dense HMM 4 4
0=
0.5
0.5
0.2
0.4
0.25
0.1
0.05
0.3
0.15
0.3
0.6
0.2
0.1
0.1
0.15
0.1
0.25 0.1 0.2 0.1
0.5 0.2 0.2 0.2
,T=
0.2
0.4 0.3
0.3
0.05
0.3 0.3
0.4
,i=
0.5
0.2
0.2
0.1
Dense HMM 5 5
B.8
0
0.5
)0.6
0.5 0.2 0.1
0
B.7
0
Separated HMM 3 4 #2
0.1
B.6
0
=
0.5
0.2
0.2
0
0.5 0.25 0.1 0.1 0.2
0.5
0.25
0.3
0.05
0
0.5
0.1
0.1
0.1
0.4
0.05
0.3
0.1
0.2
0.1
0.2
0.15
0.1
0.1
0.2
0.4
0
0.05
0.4
0.4
0.1
0.1
0.1
0.2
0.3
0.3
0.2
0.2
0.2
B.9
Separated HMM 5 5
0
B.10
0.2
0.2 0.25
0.1
0.2
0
0.5
0.2
0.1
0.2
0.2
0.3
0.3
0.1
0.3
0.3
0.55
0
0.3
0.5
0.3
0.15
0.1
0.2
0
0.5
0.4
0.1
0.3
0
0.3
0.4
0.1
0.2
0.6
0
0
0.1
0.1
0.4
0
0.1
0.4
0.3
0.1 0.05
0.1
0.1
0.25
0.1
0.1
0.1
0
0.15
T
=
0.1
Dense HMM 6 6
B.11
0.2415
0.2713
0.1071
0.1737
0.2427
0.1662
0.2506
0.1447
0.0155
0.2044
0.0285
0.1662
0.2242
0.1634
0.4442
0.1843
0.1074
0.1506
0.0147
0.0120
0.2776
0.2000
0.3164
0.1083
0.1097
0.1999
0.0325
0.1555
0.2822
0.2331
0.1593
0.2087
0.1231
0.0821
0.0228
0.1756
0.2533
0.1997
0.2377
0.0668
0.0271
0.0965
0.2
0.2086
0.0782
0.0522
0.0356
0.1973
0.0657
0.25
0.1511
0.2776
0.0468
0.1560
0.2680
0.2587
0.15
0.1984
0.0767
0.2484
0.1271
0.1145
0.0867
0.1
0.0226
0.2425
0.1518
0.4391
0.2511
0.2333
0.2
0.1660
0.1253
0.2631
0.1754
0.1420
0.2591
0.1
Sparse HMM 6 6
0 0.5
0
0
0.5
0
0
0.3
0.1
0
0.7
0.75
0
0
0
0.3
0.5
0
0
0
0.9
0
0.3
0
0
1
0
0.5
0
0
0.8
0
0
0
0
0
0
0.25
0
0
0
0
0
0.2
0.5
0
0
0.2
0
0
0
0
0
0.5
0
0
0
0
0.7
0
0.8
0
0
0
0 0.5
0
0.7
0
0
0.5
0
0
0
0
0.5
0
0.5
Diverse Sparse HMM 6 6
B.12
0 0.5
0
0
0.5
0
0
0.3
0.1
0
0.3
0
0
0
0.3
0.5
0
0
0
0.9
0
0.2
0
1
0
0.5
0
0
0.8
0
0
0
0
0.1
0
0.25
0
0
0
0
0
0.2
0.5
0
0
0.2
0.1
0
0
0
0
0.5
0
0
0
0
0.7
0
0.8
0.2
0
0
0 0.5
0
0.7
0
0
0.5
0
0
0
0.1
0.5
0
0.5
0.75
0
Bibliography
[1] Dana Angluin. Learning regular sets from queries and counterexamples. Information
and Computation, 75:87-106, 1987.
[2] Kenneth Basye, Thomas Dean, and Leslie Pack Kaelbling. Learning dynamics:
system identification for perceptually challenged agents.
Artificial Intelligence,
72(1-2):139-171, January 1995.
[3] Leonard E. Baum, Ted Petrie, George Soules, and Norman Weiss. A maximization
technique occurring in the statistical analysis of probabilistic functions of Markov
chains. Annals of Mathematical Statistics, 41(1):164-171, 1970.
[4] Michael W. Berry, Murray Browne, Amy N. Langville, V. Paul Pauca, and
Robert J. Plemmons. Algorithms and applications for approximate nonnegative matrix
factorization. Computational Statistics and Data Analysis, 52(1):155-173, 2007.
[5] Benedikt Bollig, Peter Habermehl, Carsten Kern, and Martin Leucker. Angluin-style
learning of NFA. In Proceedings of IJCAI 2009, pages 164-171, 2009. To appear. Full
version as Research Report LSV-08-28, Laboratoire Specification et Verification, ENS
Cachan, France.
[6] Andrzej Cichocki, Rafal Zdunek, and Shun ichi Amari. Hierarchical ALS algorithms
for nonnegative matrix and 3d tensor factorization. In Lecture Notes in Comp. Sc.,
Vol. 4666, Springer,pages 169-176, 2007.
[7] A. P. Dempster, N. M. Laird, and D. ZB. Rubin. Maximum likelihood from incomplete
data via the EM algorithm. Journal of the Royal Statistical Society, Series B,
39(1):1-38, 1977.
[8] P. Dupont, F. Denis, and Y. Esposito. Links between probabilistic automata and hidden
Markov models: probability distributions, learning models and induction algorithms.
Pattern Recognition, 38:1349-1371, 2005.
[9] Lorenzo Finesso and Peter Spreij. Approximation of stationary processes by hidden
Markov models. arXiv:math/0606591v2, February 2008.
[10] Yoav Freund, Michael Kearns, Dana Ron, Ronitt Rubinfeld, Robert E. Schapire, and
Linda Sellie. Efficient learning of typical finite automata from random walks. In
Proceedings of the 24th Annual ACM Symposium on Theory of Computing, pages
315-324, 1993.
[11] Omri Guttman, S. V. N. Vishwanathan, and Robert C. Williamson. Learnability of
probabilistic automata via oracles. ALT, pages 171-182, 2005.
[12] Ngoc-Diep Ho and Paul Van Dooren. Non-negative matrix factorization with fixed row
and column sums. Accepted for publication in Linear Algebra and Its Applications,
2006.
[13] Daniel Hsu, Sham M. Kakade, and Tong Zhang. A spectral algorithm for learning
hidden Markov models. Preprint, February 1988.
[14] Daniel D. Lee and H. Sebastian Seung.
Algorithms for non-negative matrix
factorization. Advances in Neural Information Processing Systems, 13:556-562, 2001.
[15] Kevin Murphy. Hidden Markov model toolbox for MATLAB.
http://people.cs.ubc.ca/ murphyk/Software/HMM/hmm.html.
2005.
Available at
[16] Luis E. Ortiz and Leslie Pack Kaelbling. Accelerating EM: An empirical study. In
Proceedings of the 15th Annual Conference on Uncertainty in Artificial Intelligence,
pages 512-521, 1999.
[17] V. Paul Pauca, J. Piper, and Robert J. Plemmons. Nonnegative matrix factorization
for spectral analysis. In Linear Algebra and Its Applications, 2005. In press, available
at http://www.wfu.edu/ plemmons.
[18] Lawrence R. Rabiner. A tutorial on hidden Markov models and selected applications
in speech recognition. Proceedings of the IEEE, 77(2):257-286, February 1989.
[19] Farial Shahnaz, Michael W. Berry, V. Paul Pauca, and Robert J. Plemmons.
Document clustering using nonnegative matrix factorization. Information Processing
and Management, 42(2):373-386, 2006.
[20] Sebastiaan A. Terwijn. On the learnability of hidden Markov models. In International
Colloquium on Grammatical Inference, 2002.
[21] Stephen Vavasis.
On the
arXiv:0708.4149v2, 2007.
complexity of nonnegative
100
matrix factorization.
Download