AB HELSINKI UNIVERSITY OF TECHNOLOGY Department of Engineering Physics and Mathematics Antti Honkela Nonlinear Switching State-Space Models Master’s thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in Technology Espoo, May 29, 2001 Supervisor: Professor Juha Karhunen Instructor: Harri Valpola, D.Sc. (Tech.) Teknillinen korkeakoulu Teknillisen fysiikan ja matematiikan osasto Tekijä: Osasto: Pääaine: Sivuaine: Työn nimi: Title in English: Diplomityön tiivistelmä Antti Honkela Teknillisen fysiikan ja matematiikan osasto Matematiikka Informaatiotekniikka Epälineaariset vaihtuvat tila-avaruusmallit Nonlinear Switching State-Space Models Professuurin koodi ja nimi: Tik-61 Informaatiotekniikka Työn valvoja: Prof. Juha Karhunen Työn ohjaaja: TkT Harri Valpola Tiivistelmä: Epälineaarinen vaihtuva tila-avaruusmalli (vaihtuva NSSM) on kahden dynaamisen mallin yhdistelmä. Epälineaarinen tila-avaruusmalli (NSSM) on jatkuva ja kätketty Markov-malli (HMM) diskreetti. Vaihtuvassa mallissa NSSM mallittaa datan lyhyen aikavälin dynamiikkaa. HMM kuvaa pidempiaikaisia muutoksia ja ohjaa NSSM:a. Tässä työssä kehitetään vaihtuva NSSM ja oppimisalgoritmi sen parametreille. Oppimisalgoritmi perustuu bayesiläiseen ensemble-oppimiseen, jossa todellista posteriorijakaumaa approksimoidaan helpommin käsiteltävällä jakaumalla. Sovitus tehdään todennäköisyysmassan perusteella ylioppimisen välttämiseksi. Algoritmin toteutus perustuu TkT Harri Valpolan aiempaan NSSM-algoritmiin. Se käyttää monikerrosperceptron -verkkoja NSSM:n epälineaaristen kuvausten mallittamiseen. NSSM-algoritmin laskennallinen vaativuus rajoittaa vaihtuvan mallin rakennetta. Vain yhden dynaamisen mallin käyttö on mahdollista. Tällöin HMM:a käytetään vain NSSM:n ennustusvirheiden mallittamiseen. Tämä lähestymistapa on laskennallisesti kevyt mutta hyödyntää HMM:a vain vähän. Algoritmin toimivuutta kokeillaan todellisella puhedatalla. Vaihtuva NSSM osoittautuu paremmaksi datan mallittamisessa kuin muut yleiset mallit. Työssä näytetään myös, kuinka algoritmi pystyy järkevästi segmentoimaan puhetta erillisiksi foneemeiksi, kun ainoastaan foneemien oikea järjestys tunnetaan etukäteen. Sivumäärä: 93 Avainsanat: Täytetään osastolla Hyväksytty: Kirjasto: vaihtuva malli, hybridi, tila-avaruusmalli, kätketty ensemble-oppiminen epälineaarinen Markov-malli, Helsinki University of Technology Abstract of Master’s thesis Department of Engineering Physics and Mathematics Author: Department: Major subject: Minor subject: Title: Title in Finnish: Chair: Supervisor: Instructor: Abstract: Antti Honkela Department of Engineering Physics and Mathematics Mathematics Computer and Information Science Nonlinear Switching State-Space Models Epälineaariset vaihtuvat tila-avaruusmallit Tik-61 Computer and Information Science Prof. Juha Karhunen Harri Valpola, D.Sc. (Tech.) The switching nonlinear state-space model (switching NSSM) is a combination of two dynamical models. The nonlinear state-space model (NSSM) is continuous by nature, whereas the hidden Markov model (HMM) is discrete. In the switching model, the NSSM is used to model the short-term dynamics of the data. The HMM describes the longer-term changes and controls the NSSM. This thesis describes the development of a switching NSSM and a learning algorithm for its parameters. The learning algorithm is based on Bayesian ensemble learning, in which the true posterior distribution is approximated with a tractable ensemble. The approximation is fitted to the probability mass to avoid overlearning. The implementation is based on an earlier NSSM by Dr. Harri Valpola. It uses multilayer perceptron networks to model the nonlinear functions of the NSSM. The computational complexity of the NSSM algorithm sets serious limitations for the switching model. Only one dynamical model can be used. Hence, the HMM is only used to model the prediction errors of the NSSM. This approach is computationally efficient but makes little use of the HMM. The algorithm is tested with real-world speech data. The switching NSSM is found to be better in modelling the data than other standard models. It is also demonstrated how the algorithm can find a reasonable segmentation to different phonemes when only the correct sequence of phonemes is known in advance. Number of pages: 93 Keywords: Department fills Approved: Library code: switching model, hybrid model, nonlinear state-space model, hidden Markov model, ensemble learning Preface This work has been funded by the European Commission research project BLISS. It has been done at the Laboratory of Computer and Information Science at the Helsinki University of Technology. It would not have been possible without the vast computational resources available at the laboratory. I am grateful to all my colleagues at the laboratory and the Neural Networks Research Centre for the wonderful environment they have created. I wish to thank my supervisor, professor Juha Karhunen, for the opportunity to work at the lab and in the Bayesian research group. I am grateful to my instructor, Dr. Harri Valpola, for suggesting the subject of my thesis and giving a lot of technical advice on the way. They both helped in making the presentation of the work much more clear. I also wish to thank professor Timo Eirola for sharing his expertise on dynamical systems. I am grateful to Mr. Vesa Siivola for his help in using the speech data in the experiments. Mr. Tapani Raiko and Mr. Jaakko Peltonen gave many useful suggestions in the numerous discussions we had. Finally I wish to thank Maija for all her support during my project. Otaniemi, May 29, 2001 Antti Honkela Contents List of abbreviations iv List of symbols v 1 Introduction 1.1 Problem setting . . . . . . 1.2 Aim of the thesis . . . . . 1.3 Structure of the thesis . . 1.4 Contributions of the thesis . . . . 1 1 2 3 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Mathematical preliminaries of time series modelling 2.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Dynamical systems . . . . . . . . . . . . . . . . 2.1.2 Linear systems . . . . . . . . . . . . . . . . . . 2.1.3 Nonlinear systems and chaos . . . . . . . . . . . 2.1.4 Tools for time series analysis . . . . . . . . . . . 2.2 Practical considerations . . . . . . . . . . . . . . . . . 2.2.1 Delay coordinates in practice . . . . . . . . . . 2.2.2 Prediction algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 5 5 6 6 7 9 9 10 3 Bayesian methods for data analysis 3.1 Bayesian statistics . . . . . . . . . . . . 3.1.1 Constructing probabilistic models 3.1.2 Hierarchical models . . . . . . . . 3.1.3 Conjugate priors . . . . . . . . . 3.2 Posterior approximations . . . . . . . . . 3.2.1 Model selection . . . . . . . . . . 3.2.2 Stochastic approximations . . . . 3.2.3 Laplace approximation . . . . . . 3.2.4 EM algorithm . . . . . . . . . . . 3.3 Ensemble learning . . . . . . . . . . . . . 3.3.1 Information theoretic approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 12 13 14 14 14 16 17 18 19 20 21 i . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Building blocks of the model 4.1 Hidden Markov models . . . . . . . . . . . . 4.1.1 Markov chains . . . . . . . . . . . . . 4.1.2 Hidden states . . . . . . . . . . . . . 4.1.3 Continuous observations . . . . . . . 4.1.4 Learning algorithms . . . . . . . . . 4.2 Nonlinear state-space models . . . . . . . . . 4.2.1 Linear models . . . . . . . . . . . . . 4.2.2 Extension from linear to nonlinear . . 4.2.3 Multilayer perceptrons . . . . . . . . 4.2.4 Nonlinear factor analysis . . . . . . . 4.2.5 Learning algorithms . . . . . . . . . 4.3 Previous hybrid models . . . . . . . . . . . . 4.3.1 Switching state-space models . . . . 4.3.2 Other hidden Markov model hybrids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 23 24 24 25 25 28 28 29 30 33 34 35 35 36 5 The model 5.1 Bayesian continuous density hidden Markov model . 5.1.1 The model . . . . . . . . . . . . . . . . . . . 5.1.2 The approximating posterior distribution . . 5.2 Bayesian nonlinear state-space model . . . . . . . . 5.2.1 The generative model . . . . . . . . . . . . . 5.2.2 The probabilistic model . . . . . . . . . . . 5.2.3 The approximating posterior distribution . . 5.3 Combining the two models . . . . . . . . . . . . . . 5.3.1 The structure of the model . . . . . . . . . . 5.3.2 The approximating posterior distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 38 38 40 41 41 42 45 46 47 48 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 The algorithm 6.1 Learning algorithm for the continuous density hidden Markov model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.1 Evaluating the cost function . . . . . . . . . . . . . . . 6.1.2 Optimising the cost function . . . . . . . . . . . . . . . 6.2 Learning algorithm for the nonlinear state-space model . . . . 6.2.1 Evaluating the cost function . . . . . . . . . . . . . . . 6.2.2 Optimising the cost function . . . . . . . . . . . . . . . 6.2.3 Learning procedure . . . . . . . . . . . . . . . . . . . . 6.2.4 Continuing learning with new data . . . . . . . . . . . 6.3 Learning algorithm for the switching model . . . . . . . . . . . 6.3.1 Evaluating and optimising the cost function . . . . . . 6.3.2 Learning procedure . . . . . . . . . . . . . . . . . . . . ii 49 . . . . . . . . . . . 49 49 52 56 56 58 61 61 62 62 63 6.3.3 Learning with known state sequence 7 Experimental results 7.1 Speech data . . . . . . . . . . . . 7.1.1 Preprocessing . . . . . . . 7.1.2 Properties of the data set 7.2 Comparison with other models . . 7.2.1 The experimental setting . 7.2.2 The results . . . . . . . . 7.3 Segmentation of annotated data . 7.3.1 The training procedure . . 7.3.2 The results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 . . . . . . . . . . . . . . . . . . 8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 65 66 66 68 69 71 72 72 73 77 A Standard probability distributions 80 A.1 Normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . 80 A.2 Dirichlet distribution . . . . . . . . . . . . . . . . . . . . . . . . 81 B Probabilistic computations for MLP networks 85 References 88 iii List of abbreviations CDHMM Continuous density hidden Markov model EM Expectation maximisation HMM Hidden Markov model ICA Independent component analysis MAP Maximum a posteriori MDL Minimum description length ML Maximum likelihood MLP Multilayer perceptron (network) NFA Nonlinear factor analysis NSSM Nonlinear state-space model PCA Principal component analysis pdf Probability density function RBF Radial basis function (network) SSM State-space model iv List of symbols A = (aij ) Transition probability matrix of a Markov chain or a hidden Markov model A, B, a, b The weight matrices and bias vectors of the MLP network f C, D, c, d The weight matrices and bias vectors of the MLP network g D(q(θ)||p(θ)) The Kullback–Leibler divergence between q(θ) and p(θ) diag[x] A diagonal matrix with the elements of the vector x on the diagonal Dirichlet(p; u) Dirichlet distribution for variable p with parameters u E[x] The expectation or mean of x f Mapping from latent space to the observations or an MLP network modelling such a mapping g Mapping modelling the continuous dynamics of an NSSM or an MLP network modelling such a mapping h(s) A measurement function h : M → R Hi A model, hypothesis Mt Discrete hidden state or HMM state at time instant t M The set of discrete hidden states or HMM states N (µ, σ 2 ) Gaussian or normal distribution with parameters µ and σ 2 N (x; µ, σ 2 ) As N (µ, σ 2 ) but for variable x p(x) The probability of event x, or the probability density function evaluated at point x q(x) Approximating probability density function used in ensemble learning s(t) Continuous hidden states or sources at time instant t sk (t) The kth component of vector s(t) sk (t) The estimated posterior mean of sk (t) v ◦ sk (t) s̆k (t − 1, t) S Var[x] x(t) X θi , θ θi θei θ π = (πi ) φt (x) The estimated conditional posterior variance of sk (t) given sk (t − 1) The estimated posterior linear dependence between sk (t − 1) and sk (t) The set of continuous hidden states or source values The variance of x A sample of observed data The set of observed data A scalar parameter of the model The estimated posterior mean of parameter θi The estimated posterior variance of parameter θi The set of all the parameters of the model Initial distribution vector of a Markov chain or a hidden Markov model The flow of a differential equation vi Chapter 1 Introduction 1.1 Problem setting Most data analysis methods are based on developing a model that could be used to recreate the studied data set. Speech recognition systems, for example, are often built around a model that could in principle be used as a speech generator. The success of the recogniser depends heavily on how well the generator can generate realistic speech data. The speech generators used by most modern speech recognition systems are based on the hidden Markov model (HMM). The HMM is a discrete model. It has a finite number of different internal states that produce different kind of output. Typically there are a couple of states for each phoneme or a pair of phonemes. The whole dynamical process of producing speech is thus modelled by discrete transitions between the states corresponding to the different phonemes. The model of human speech implied by the HMM is not a very realistic one. The dynamics of the mouth and the vocal cord used to produce the speech are continuous. The discrete model is only a very crude approximation of the “true” model. A more realistic approach would be to model the data with a continuous model. The process of producing speech is clearly nonlinear and this should be reflected by its model. A good candidate for the task is the nonlinear state-space model (NSSM). The NSSM can be described as the continuous counterpart of the HMM. The problem with models like the NSSM is that they concentrate on modelling the short-term structure of the data. Therefore they are not as such very well suited for speech recognition. 1 1.2. Aim of the thesis 2 There are speech recognition systems that try to get the best of the both worlds by combining the two different kinds of models into one hybrid structure. Such systems have performed well in several difficult real world problems but they are often rather specialised. The training algorithms for such models are usually based on some heuristic measures rather than on generally accepted mathematical principles. In this work, a hybrid model structure that combines the HMM with another dynamical model, the continuous NSSM, is studied. The resulting model is called the switching nonlinear state-space model (switching NSSM). The resulting hybrid model has the power of a continuous NSSM to model the shortterm dynamics of the data. However, above the NSSM there is still the familiar HMM to divide the data to different discrete states corresponding, for example, to the different phonemes. 1.2 Aim of the thesis The aim of this thesis has been to develop a Bayesian formulation for the switching NSSM and a learning algorithm for the parameters of the model. The term learning algorithm in this context means a procedure for optimising the parameters of the model to best describe the given data. The learning algorithm is based on the approximation method called ensemble learning. It provides a principled approach for global optimisation of the performance of the model. Similar switching models exist but they use only linear state-space models (SSMs). In practice the switching model has been implemented by extending the existing NSSM model and learning algorithm developed by Dr. Harri Valpola [58, 60]. The performance of the developed model has been verified by applying it to a data set of Finnish speech in two different experiments. In the first experiment, the switching NSSM has been compared with plain HMM, standard NSSM without switching and a static nonlinear factor analysis model which completely ignores the temporal structure of the data. In the second experiment, patches of speech with known annotation, i.e. the sequence of phonemes in the word, have been used. The correct segmentation of the word to the individual phonemes was, however, not known, and must be learnt by the model. Even though the development of the model has been motivated here by speech recognition examples, the purpose of this thesis has not been to develop a 1.3. Structure of the thesis 3 working speech recognition system. Such a system could probably be developed by extending the work presented here. 1.3 Structure of the thesis The thesis begins with a review of different methods of time series modelling in Chapters 2–4. Chapter 2 presents a mathematical point of view to the subject. These ideas form the foundation for what follows but most of the results are not used directly. Statistical methods for learning the parameters of the different models are presented in Chapter 3. Chapter 4 introduces the building blocks of the switching NSSM and discusses previous work on HMMs and SSMs, as well as some of their combinations. The second half of the thesis (Chapters 5–7) consists of the development of the switching NSSM and experimental verification of its operation. In Chapter 5, the exact structures of the parts of the switching model, the HMM and the NSSM, are defined. It is also described how the two models are combined. The ensemble learning based learning algorithms for all the models of the previous chapter are derived in Chapter 6. A series of experiments was conducted to verify the performance of the model. Chapter 7 discusses these experiments and their results. Finally the results of the thesis are summarised in Chapter 8. The thesis also has two appendices. Appendix A presents two important probability distributions, the Gaussian and the Dirichlet distribution, and some of their most important properties. Appendix B contains a derivation of a result needed in the learning algorithm in Chapter 6. 1.4 Contributions of the thesis The first part of the thesis consists of a literature review of important background knowledge for developing the switching NSSM. The model is presented in Chapter 5. The continuous density HMM in Section 5.1 has been developed by the author although it is heavily based on earlier work by Dr. David MacKay [39]. A Bayesian NSSM by Dr. Harri Valpola is presented in Section 5.2. The modifications needed to add switching to the NSSM model, as presented in Section 5.3, have been developed by the author. The same division applies to the learning algorithms for the models, found in Chapter 6. The developed learning algorithm has been tested extensively. The author has 1.4. Contributions of the thesis 4 implemented the switching NSSM learning algorithm under Matlab. The code is based on an implementation of the NSSM, which is in turn based on an implementation of the nonlinear factor analysis (NFA) algorithm. Dr. Harri Valpola wrote the extensions from NFA to NSSM. The author has written most of the rest of the code over a period of two years. All the experiments reported in Chapter 7 and the HMM implementation used in the comparisons have been done by the author. Chapter 2 Mathematical preliminaries of time series modelling As this thesis deals with modelling time series data, the first chapter will discuss mathematical background on the subject. In Section 2.1, a brief introduction to the theory of dynamical systems and some useful tools for solving the related inverse problem, i.e. finding a suitable dynamical system to model a given time series, is presented. Section 2.2 discusses some of the practical consequences of the results. The concepts presented in this chapter form the basis for the rest of the thesis. Most of them will, however, not be used directly. 2.1 2.1.1 Theory Dynamical systems The theory of dynamical systems is the basic mathematical tool for analysing time series. This section presents a brief introduction to the basic concepts. For a more extensive treatment, see for example [1]. The general form for an autonomous discrete-time dynamical system is the map xn+1 = f (xn ) (2.1) where xn , xn+1 ∈ Rn and f : Rn → Rn is a diffeomorphism, i.e. a smooth mapping with a smooth inverse. It is important that the mapping f is independent of time, meaning that it only depends on the argument point xn . Such 5 2.1. Theory 6 mappings are often generated by flows of autonomous differential equations. For a general autonomous differential equation x0 (t) = f (x(t)), (2.2) φt (x0 ) := x(t) (2.3) we define the flow by [1] where x(t) is the unique solution of Equation (2.2) with the initial condition x(0) = x0 , evaluated at time t. The function f in Equation (2.2) is called the vector field corresponding to the flow φ. Setting g(x) := φτ (x), where τ > 0, gives an autonomous discrete-time dynamical system like in Equation (2.1). The discrete system defined in this way samples the values of the continuous system at constant intervals τ . Thus it is a discretisation of the continuous system. 2.1.2 Linear systems Let us assume that the mapping f in Equation (2.1) is linear, i.e. xn+1 = Axn . (2.4) Iterating the system for a given initial vector x0 leads to a sequence xn = An x0 . (2.5) The possible courses of evolution of such sequences can be characterized by the eigenvalues of the matrix A. If there are eigenvalues that are greater than one in absolute value, almost all of the orbits will diverge to infinity. If all the eigenvalues are less than one in absolute value, the orbits will rapidly converge to the origin. Complex eigenvalues with absolute value of unity will lead to closed circular or elliptic orbits [1]. An affine map with xn+1 = Axn + b behaves in essentially the same way. This shows that the autonomous linear system is too simple to describe any interesting dynamical phenomena, because in practice the only stable linear systems converge exponentially to a constant value. 2.1.3 Nonlinear systems and chaos While the possible dynamics of linear systems are rather restricted, even very simple nonlinear systems can have very complex dynamical behaviour. Nonlin- 2.1. Theory 7 earity is closely associated with chaos, even though not all nonlinear mappings produce chaotic dynamics [49]. A dynamical system is chaotic if its future behaviour is very sensitive to the initial conditions. In such systems two orbits starting close to each other will rather soon behave very differently. This makes any long term prediction of the evolution of the system impossible. Even for a deterministic system that is perfectly known, the initial conditions would have to be known arbitrarily precisely, which is of course impossible in practice. There is, however, great variation in how far ahead different systems can be predicted. A classical example of a chaotic system is the weather conditions in the atmosphere of the Earth. It is said that a butterfly flapping its wings in the Caribbean can cause a great storm in Europe a few weeks later. Whether this is actually true or not, it gives a good example on the futility of trying to model a chaotic system and predict its evolution for long periods of time. Even though long term prediction of a chaotic system is impossible, it is still often possible to predict statistical features of its behaviour. While modelling weather has proven troublesome, modelling the general long term behaviour, the climate, is possible. It is also possible to find certain invariant features of chaotic systems that provide qualitative description of the system. 2.1.4 Tools for time series analysis In a typical mathematical model [9], a time series (x(t1 ), . . . , x(tn )) is generated by the flow of a smooth dynamical system on a d-dimensional smooth manifold M: s(t) = φt (s(0)). (2.6) The original d-dimensional states of the system cannot be observed directly. Instead, the observations consist of the possibly noisy values x(t) of a onedimensional measurement function h which are related to the original states by x(t) = h(s(t)) + n(t) (2.7) where n(t) denotes some noise process corrupting the observations. We shall for now assume that the observations are noiseless, i.e. n(t) = 0, and deal with the noisy case later on. 2.1. Theory 8 Reconstructing the state-space The first problem in modelling a time series like the one described by Equations (2.6) and (2.7) is to try to reconstruct the original state-space or its equivalent, i.e. to find the structure of the manifold M . Two spaces are topologically equivalent if there exists a continuous mapping with a continuous inverse between them. In this case it is sufficient that the equivalent structure is a part of a larger entity, the rest can easily be ignored. Thus the interesting concept is embedding, which is defined as follows. Definition 1. A function f : X → Y is an embedding if it is a continuous mapping with a continuous inverse f −1 : f (X) → X from its range to its domain. Whitney showed in 1936 [62] that any d-dimensional manifold can be embedded into R2d+1 . The theorem can be extended to show that with a proper definition of almost all for an infinite-dimensional function space, almost all smooth mappings from given d-dimensional manifold to R2d+1 are embeddings [53]. Having only a single time series produced by Equation (2.7), how does one get those 2d + 1 different coordinates? This problem can usually be solved by introducing so called delay coordinates [53]. Definition 2. Let φ be a flow on a manifold M , τ > 0, and h : M → R a smooth measurement function. The delay coordinate map with embedding delay τ , F (h, φ, τ ) : M → Rn is defined by ¡ ¢ s 7→ h(s), h(φ−τ (s)), h(φ−2τ (s)), . . . , h(φ−(n−1)τ (s)) . Takens proved in 1980 [55] that such mappings can indeed be used to reconstruct the state-space of the original dynamical system. This result is known as Takens’ embedding theorem. Theorem 3 (Takens). Let M be a compact manifold of dimension d. For triples (f, h, τ ), where f is a smooth vector field on M with flow φ, h : M → R a smooth measurement function and the embedding delay τ > 0, it is a generic property that the delay coordinate map F (h, φ, τ ) : M → R2d+1 is an embedding. Takens’ theorem states that in the general case, the dynamics of the system recovered by delay coordinate embedding are the same as the dynamics of the original system. The exact mathematical definition of this “in general” is, however, somewhat more complicated [46]. 2.2. Practical considerations 9 Definition 4. A subset U ⊂ X of a topological space is residual if it contains a countable intersection of open dense subsets. A property is called generic if it holds in a residual set. According to Baire’s theorem, a residual set cannot be empty. Unfortunately that is about all that can be said about it. Even in Euclidean spaces, a residual set can be of arbitrarily small measure. With a proper definition for almost all in infinite-dimensional spaces and slightly different assumptions, Sauer et al. showed in 1991 that the claim of Takens’ theorem actually applies almost always [53]. 2.2 Practical considerations According to Takens’ embedding theorem, almost all dynamical systems can be reconstructed from just one noiseless observation sequence. Unfortunately such observations of real life systems are very hard to get. There is always some measurement noise and even if that could be eliminated, the results must be stored into a computer with finite precision to do any practical calculations with them. In practice the observation noise can be a serious hinder to the reconstruction. Low dimensional observations from a high dimensional chaotic system with high noise level can easily become indistinguishable from a random time series. 2.2.1 Delay coordinates in practice Even though Takens’ theorem does not give any guarantees of the success of the embedding procedure in the noisy case, the delay coordinate embedding has been found useful in practice and is used widely [22]. In the noisy case, having more than one-dimensional measurements of the same process can help very much in the reconstruction even though Takens’ theorem achieves the goal with just a single time series. The embedding dimension must also be chosen carefully. Too low dimensionality may cause problems with noise amplification but using too high dimensionality will inflict other problems and is computationally expensive [9]. Assuming there is access to the original continuous measurement stream there is also another free parameter in the procedure, namely choosing the embedding delay τ . Takens’ theorem applies for almost all delays, at least as long as 2.2. Practical considerations 10 an infinitely long series of noiseless observations is used. According to Haykin and Principe [22] the delay should be chosen to be long enough for the consecutive observations to be essentially, but not too, independent. In practice a good value can often be found at the first minimum of the mutual information between the consecutive samples. 2.2.2 Prediction algorithms Assume we are analysing the scalar time series x(1), . . . , x(T ). Using the timedelay embedding with delay d transforms the series to vectors of the form y(t) = (x(t − d + 1), . . . , x(t)). Prediction in these coordinates corresponds to predicting the next sample x(t + 1) from the d previous ones, as all the other components of y(t+1) are directly available in y(t). Thus the problem reduces to finding a predictor of the form x(t + 1) = f (y(t)). (2.8) Using a simple linear function for f leads to a linear auto-regressive (AR) model [20]. There are also several generalisations of the same model leading to other algorithms for the same purpose. Farmer and Sidorowich [14] propose a locally linear predictor in which there are several linear predictors for different areas of the embedded state-space. The achieved performance is comparable with the global linear predictor for small embedding dimensionalities but as the embedding dimension grows, the local method clearly outperforms the global one. The problem with the locally linear predictor is that it is not continuous. Casdagli [8] presents a review of several different predictors including a globally linear predictor, a locally linear predictor and a global nonlinear predictor using a radial basis function (RBF) network [21] as the nonlinearity. According to his results, the nonlinear methods are best for small data sets whereas the locally linear method of Farmer and Sidorowich gives the best results for large data sets. For noisy data, the choice of coordinates can make a big difference on the success of the prediction algorithm. The embedding approach is by no means the only possible alternative, and as Casdagli et al. note in [9], there are often clearly better alternatives. There are many approaches in the field of neural computation where the choice is left to the algorithm. This corresponds to trying to blindly invert Equation (2.7). Such models are called (nonlinear) state-space models and they are studied in detail in Section 4.2. Chapter 3 Bayesian methods for data analysis Inclusion of the effects of noise into the model of a time series leads to the world of statistics — it is no longer possible to talk about exact events, only their probabilities. The Bayesian framework offers the mathematically soundest basis for doing statistical work. In this chapter, a brief review of the most important results and tools of the field is presented. Section 3.1 concentrates on the basic ideas of Bayesian statistics. Unfortunately, exact application of those methods is usually not possible. Therefore Section 3.2 discusses some practical approximation methods that allow getting reasonably good results with limited computational resources. The learning algorithms presented in this work are based on the approximation method called ensemble learning, which is presented in Section 3.3. This chapter contains many formulas involving probabilities. The notation p(x) is used for both probability of a discrete event x and the value of the probability density function (pdf) of a continuous variable at x, depending on what x is. All the theoretical results presented apply equally to both cases, at least when integration over a discrete variable is interpreted in the Lebesgue sense as summation. Some authors use subscripts to separate different pdfs but here they are omitted to simplify the notation. All pdfs are identified only by the argument of p. 11 3.1. Bayesian statistics 12 Two important probability distributions, the Gaussian or normal distribution and the Dirichlet distribution are presented in Appendix A. The notation p(x) = N (x; µ, σ 2 ) is used to denote that x is normally distributed with mean µ and variance σ 2 . 3.1 Bayesian statistics In the Bayesian probability theory, the probability of an event describes the observer’s degree of belief on the occurrence of the event [36]. This allows evaluating, for instance, the probability that a certain parameter in a complex model lies on a certain fixed interval. The Bayesian way of estimating the parameters of a given model focuses around the Bayes theorem. Given some data X and a model (or hypothesis) H for it that depends on a set of parameters θ, the Bayes theorem gives the posterior probability of the parameters p(θ|X, H) = p(X|θ, H)p(θ|H) . p(X|H) (3.1) In Equation (3.1), the term p(θ|X, H) is called the posterior probability of the parameters. It gives the probability of the parameters, when the data and the model are given. Therefore it contains all the information about the values of the parameters that can be extracted from the data. The term p(X|θ, H) is called the likelihood of the data. It is the probability of the data, when the model and its parameters are given and therefore it can usually be evaluated rather easily from the definition of the model. The term p(θ|H) is the prior probability of the parameters. It must be chosen beforehand to reflect one’s prior belief of the possible values of the parameters. The last term p(X|H) is called the evidence of the model H. It can be written as Z p(X|H) = p(X|θ, H)p(θ|H)dθ (3.2) and it ensures that the right hand side of the equation is properly scaled. In any case it is just a constant that is independent of the values of the parameters and can thus be usually ignored when inferring the values of the parameters of the model. This way the Bayes theorem can be written in a more compact form p(θ|X, H) ∝ p(X|θ, H)p(θ|H). (3.3) The evidence is, however, very important when comparing different models. 3.1. Bayesian statistics 13 The key idea in Bayesian statistics is to work with full distributions of parameters instead of single values. In calculations that require a value for a certain parameter, instead of choosing a single “best” value, one must use all the values and weight the results according to the posterior probabilities of the parameter values. This is called marginalising over the parameter. 3.1.1 Constructing probabilistic models To use the techniques of probabilistic modelling, one must usually somehow specify the likelihood of the data given some parameters. Not all models are, however, probabilistic in nature and have naturally emerging likelihoods. Many models are defined as generative models by stating how the observed data could be generated from unknown parameters or latent variables. For given data vectors x(t), a simple linear generative model could be written as x(t) = As(t) (3.4) where A is an unobserved transformation matrix and vectors s(t) are unobserved hidden or latent variables. In some application areas the latent variables are also called sources or factors [6, 35]. One way to turn such a generative model into a probabilistic model is to add a noise term to the Equation (3.4). This yields x(t) = As(t) + n(t) (3.5) where the noise can, for example, be assumed to be zero-mean and Gaussian as in n(t) ∼ N (0, Σn ). (3.6) Here Σn denotes the covariance matrix of the noise and it is usually assumed to be diagonal to make the different components of the noise independent. This implies a likelihood for the data given by p(x(t)|A, s(t), Σn , H) = N (x(t); As(t), Σn ). (3.7) For a complete probabilistic model one also needs priors for all the parameters of the model. For instance hierarchical models and conjugate priors can be useful mathematical tools here but in the end the priors should be chosen to represent true prior knowledge on the solution of the problem at hand. 3.2. Posterior approximations 3.1.2 14 Hierarchical models Many models include groups of parameters that are somehow related or connected. This connection should be reflected in the prior chosen for them. Hierarchical models provide a useful tool for building priors for such groups. This is done by giving the parameters a common prior distribution which is parameterised with new higher level hyperparameters [16]. Such a group would typically include parameters that have a somehow similar status in the model. Hierarchical models are well suited for neural network related problems because such connected groups emerge naturally, like for example the different elements of a weight matrix. The definitions of the components of the Bayesian nonlinear switching statespace model in Chapter 5 contain several examples of hierarchical priors. 3.1.3 Conjugate priors For a given class of likelihood functions p(X|θ), the class P of priors p(θ) is called a conjugate if the posterior p(θ|X) is of the same class P. This is a very useful property if the class P consists of a set of probability densities with the same functional form. In such a case the posterior distribution will also have the same functional form. For instance the conjugate prior for the mean of a Gaussian distribution is Gaussian. In other words, if the prior of the mean and the functional form of the likelihood are Gaussian, the posterior of the mean will also be Gaussian [16]. 3.2 Posterior approximations Even though the Bayesian statistics gives the optimal method for performing statistical inference, the exact use of those tools is impossible for all but the simplest models. Even if the likelihood and prior can be evaluated to give an unnormalised posterior of Equation (3.3), the integral needed for the scaling term of Equation (3.2) is usually intractable. This makes analytical evaluation of the posterior impossible. As the exact Bayesian inference is usually impossible, there are many algorithms and methods that approximate it. The simplest method is to approximate the posterior with a discrete distribution concentrated at the maximum 3.2. Posterior approximations 15 of the posterior density given by Equation (3.3). This gives a single value for all the parameters. The method is called maximum a posteriori (MAP) estimation. It is closely related to the classical technique of maximum likelihood (ML) estimation, in which the contribution of the prior is ignored and only the likelihood term p(X|θ, H) is maximised [57]. The MAP estimate is troublesome because especially in high dimensional spaces, high probability density does not necessarily have anything to do with high probability mass, which is the quantity of interest. A narrow spike can have very high density, but because of its very small width, the actual probability of the studied parameter belonging to it is small. In high dimensional spaces the width of the mode is much more important than its height. As an example, let us consider a simple linear model for data x(t) x(t) = As(t) (3.8) where both A and s(t) are unknown. This is the generative model for standard PCA and ICA [27]. Assuming that both A and s(t) have unimodal prior distributions centered at the origin, the MAP solution will typically give very small values for s(t) and very large values for A. This is because there are so many more parameters in s(t) than in A that it pays off to make the sources very close to their prior most probable value, even at the cost of A having huge values. Of course such a solution cannot make any sense, because the source values must be specified very precisely in order to describe the data. In simple linear models such behaviour can be suppressed by restricting the values of A suitably. In more complex models there usually is no way to restrict the values of the parameters and using better approximation methods is essential. The same problem in a two dimensional case is illustrated in Figure 3.1.1 The mean of the two dimensional distribution in the figure lies near the centre of the square where most of the probability mass is concentrated. The narrow spike has high density but it is not very massive. Using a gradient based algorithm to find the maximum of the distribution (MAP estimate) would inevitably lead to the top of the spike. The situation of the figure may not look that bad but the problem gets much worse when the dimensionality is higher. 1 The basic idea of the figure is due to Barber and Bishop [2]. 3.2. Posterior approximations 16 Figure 3.1: High probability density does not always imply high probability mass. The spike on the right has the highest density even though most of the mass is near the centre. 3.2.1 Model selection According to the marginalisation principle, the correct way to compare different models in the Bayesian framework is to always use all the models, weighting their results by the respective posterior probabilities. This is a computationally demanding approach and it may be desirable to use only one model, even if it does not lead to optimal results. The rest of the section concentrates on finding suitable criteria for choosing just one “best” model. Occam’s razor is an old scientific principle which states that when trying to explain some phenomenon, the simplest model that can adequately explain it should be chosen. There is no point in choosing an overly complex model when even a much simpler one would do. A very complex model will be able to fit the given data almost perfectly but it will not be able to generalise very well. On the other hand, very simple models will not be able to describe the essential features of the data. One must make a compromise and choose a model with enough but not too much complexity [57, 10]. MacKay showed in [37] how this can be done in the Bayesian framework by evaluating and comparing model evidences. The evidence of a model Hi is defined as the probability p(X|Hi ) of the data given the model. This is just the scaling factor in the denominator of the Bayes theorem in Equation (3.1) 3.2. Posterior approximations 17 and it can be evaluated as shown in Equation (3.2). True Bayesian comparison of the models would require using Bayes theorem to get the posterior probability of the model as p(Hi |X) ∝ p(X|Hi )p(Hi ). (3.9) If, however, there is no reason to prefer one model over another, one can choose a uniform prior for all the models. This way the posterior probability of a model is directly proportional to the evidence. There is no need to use a prior on the models to penalise complex ones, the evidence will do that automatically. Figure 3.2 shows how the model evidence can be used to choose the right model.2 In the figure the horizontal axis ranges through all the possible data sets. The curves show the values of evidence for different models and different data sets. As the distributions p(X|Hi ) are all normalised, the area under each curve is equal. A simple model like H1 can only describe a small range of possible data sets. It gives a high evidence for those but nothing for the rest. A very complex model like H2 can describe a much larger variety of data sets. Therefore it has to spread its predictions more thinly than model H1 and gives lower evidence for simple data sets. And for a data set like X 1 lying there in the middle, both the extremes will lose to model H3 which is just good enough for that data and therefore just the model called for by Occam’s razor. After this point the explicit references to the model H in expressions for different probabilities are omitted. This is done purely to simplify the notation. It should be noted that in the Bayesian framework, all the probabilities are conditional to some assumptions because they are always subjective. 3.2.2 Stochastic approximations The basic idea in stochastic approximations is to somehow get samples from the true posterior distribution. This is usually done by constructing a Markov chain for the model parameters θ whose stationary distribution corresponds to the posterior distribution p(θ|X). Simulation algorithms doing this are called Markov chain Monte Carlo (MCMC) methods. The most important of such algorithms is the Metropolis–Hastings algorithm. To use it, one must be able to compute the unnormalised posterior of Equa2 The idea of the figure is due to Dr. David MacKay [37]. 3.2. Posterior approximations 18 p(X|H1 ) X1 PSfrag replacements p(X|H3 ) p(X|H2 ) Figure 3.2: How model evidence embodies Occam’s razor [37]. For given data set X 1 , the model with just right complexity will have the highest evidence. tion (3.3). In addition one must specify a jumping distribution for the parameters. This can be almost any reasonable distribution that models possible transitions between different parameter values. The algorithm works by getting random samples from the jumping distribution and then either taking the suggested transition or rejecting it, depending on the ratio of the posterior probabilities of the two values in question. The normalising factor of the posterior is not needed as only the ratio of the posterior probabilities is needed [16]. Neal [44] discusses the use of MCMC methods for neural networks in detail. He also presents some modifications to the basic algorithms to improve their performance for large neural network models. 3.2.3 Laplace approximation R Laplace’s method approximates the integral of a function f (w)dw by fitting b of f (w), and computing the volume under the a Gaussian at the maximum w Gaussian. The covariance of the fitted Gaussian is determined by the Hessian b [40]. matrix of log f (w) at the maximum point w The same name is also used for the method of approximating the posterior distribution with a Gaussian centered at the maximum a posteriori estimate. This can be justified by the fact that under certain regularity conditions, the 3.2. Posterior approximations 19 posterior distribution approaches Gaussian distribution as the number of samples grows [16]. Despite using a full distribution to approximate the posterior, Laplace’s method still suffers from most of the problems of MAP estimation. Estimating the variances at the end does not help if the procedure has already lead to an area of low probability mass. 3.2.4 EM algorithm The EM algorithm was originally presented by Dempster et al. in 1977 [13]. They presented it as a method of doing maximum likelihood estimation for incomplete data. Their work formalised and generalised several old ideas. For example Baum et al. [3] used very similar methods with hidden Markov models several years earlier. The term “incomplete data” refers to a case where the complete data consists of two parts, only one of which is observed. Missing part can only be inferred based on how it affects the observed part. A typical example would be the generative model in Section 3.1.1, where the x(t) are the observed data X and the sources s(t) are the missing data S. The same algorithm can be easily interpreted in the Bayesian framework. Let us assume that we are trying to estimate the posterior probability p(θ|X) for some model parameters θ. In such a case, the Bayesian version gives a point estimate for posterior maximum of θ. b0 . The estimate is improved itThe algorithm starts at some initial estimate θ bi , X) eratively by first computing the conditional probability distribution p(S| θ of the missing data given the current estimate of the parameters. After this, a bi+1 for the parameters is computed by maximising the expectanew estimate θ bi , X). The first step tion of log p(θ|S, X) over the previously calculated p(S|θ is called the expectation step (E-step) and the latter is called maximisation step (M-step) [16, 57]. An important feature in the EM algorithm is that it gives the complete probability distribution for the missing data and uses point estimates only for the model parameters. It should therefore give better results than simple MAP estimation. Neal and Hinton showed in [45] that the EM algorithm can be interpreted as 3.3. Ensemble learning minimisation of the cost function given by · · ¸ ¸ q(S) q(S) C = Eq(S) log = Eq(S) log − log p(X|θ) p(X, S|θ) p(S|X, θ) 20 (3.10) bt−1 , X). The notation Eq(S) [· · · ] means that the where at step t, q(S) = p(S|θ expectation is taken over the distribution q(S). This interpretation can be used to justify many interesting variants of the algorithm. 3.3 Ensemble learning Ensemble learning is a relatively new concept suggested by Hinton and van Camp in 1993 [23]. It allows approximating the true posterior distribution with a tractable approximation and fitting it to the actual probability mass with no intermediate point estimates. The posterior distribution of the model parameters θ, p(θ|X, H), is approximated with another distribution or approximating ensemble q(θ). The objective function chosen to measure the quality of the approximation is essentially the same cost function as the one for EM algorithm in Equation (3.10) [38] ½ ¾ Z q(θ) q(θ) C(θ) = Eq(θ) log = q(θ) log dθ. (3.11) p(θ, X|H) p(θ, X|H) Ensemble learning is based on finding an optimal function to approximate another function. Such optimisation methods are called variational methods and therefore ensemble learning is sometimes also called variational learning [30]. A closer look at the cost function C(θ) shows that it can be represented as a sum of two simple terms Z Z q(θ) q(θ) dθ = q(θ) log dθ C(θ) = q(θ) log p(θ, X|H) p(θ|X, H)p(X|H) Z (3.12) q(θ) dθ − log p(X|H). = q(θ) log p(θ|X, H) The first term in Equation (3.12) is the Kullback–Leibler divergence between the approximate posterior q(θ) and the true posterior p(θ|X, H). A simple application of Jensen’s inequality [52] shows that the Kullback–Leibler diver- 3.3. Ensemble learning 21 gence D(q||p) between two distributions q(θ) and p(θ) is always nonnegative: Z p(θ) −D(q(θ)||p(θ)) = q(θ) log dθ q(θ) Z Z (3.13) p(θ) ≤ log q(θ) dθ = log p(θ)dθ = log 1 = 0. q(θ) Since the logarithm is a strictly concave function, the equality holds if and only if p(θ)/q(θ) = const., i.e. p(θ) = q(θ). The Kullback–Leibler divergence is not symmetric and it does not obey the triangle inequality, so it is not a metric. Nevertheless it can be considered a kind of a distance measure between probability distributions [12]. Using the inequality in Equation (3.13) we find that the cost function C(θ) is bounded from below by the negative logarithm of the evidence C(θ) = D(q(θ)||p(θ|X, H)) − log p(X|H) ≥ − log p(X|H) (3.14) and there is equality if and only if q(θ) = p(θ|X, H). Looking at this the other way round, the cost function gives a lower bound on the model evidence with p(X|H) ≥ e−C(θ). (3.15) The error of this estimate is governed by D(q(θ)||p(θ|X, H)). Assuming the distribution q(θ) has been optimised to fit well to the true posterior, the error should be rather small. Therefore it is possible to approximate the evidence by p(X|H) ≈ e−C(θ). This allows using the values of the cost function for model selection as presented in Section 3.2.1 [35]. An important feature for practical use of ensemble learning is that the cost function and its derivatives with respect to the parameters of the approximating distribution can be easily evaluated for many models. Hinton and van Camp [23] used a separable Gaussian approximating distribution for a single hidden layer MLP network. After that many authors have used the method for different applications. 3.3.1 Information theoretic approach In their original paper [23], Hinton and van Camp approached ensemble learning from an information theoretic point of view by using the Minimum Description Length (MDL) Principle [61]. They developed a new coding method for 3.3. Ensemble learning 22 noisy parameter values which led to the cost function of Equation (3.11). This allows interpreting the cost C(θ) in Equation (3.11) as a description length for the data using the chosen model. The MDL principle asserts that the best model for given data is the one that attains the shortest description of the data. The description length can be evaluated in bits and it represents the length of the message needed to transmit the data. The idea is that one builds a model for the data and then sends the description of that model and the residual of the data that could not be modelled. Thus the total description length is L(data) = L(model) + L(error). The code length is related to probability because according to the coding theorem, an event x1 having probability p(x1 ) can be coded using − log2 p(x1 ) bits, assuming both the sender and the receiver know the distribution p. In their article Hinton and van Camp developed a method for encoding the parameters of the model in such a way, that the expected code length is the one given by Equation (3.11). Derivation of this result can be found in the original paper by Hinton and van Camp [23] or in the doctoral thesis [57] by Harri Valpola. Chapter 4 Building blocks of the model The nonlinear switching state-space model is a combination of two well known models, the hidden Markov model (HMM) and the nonlinear state-space model (NSSM). In this chapter, a brief review of previous work on these models and their combinations is presented. Section 4.1 deals with HMMs and Section 4.2 with NSSMs. Section 4.3 discusses some of the combinations of the two models. The HMM and linear state-space model (SSM) are actually rather closely related. They can both be interpreted as linear Gaussian models [50]. The greatest difference between the models is that the SSM has continuous hidden states whereas the HMM has only discrete states. Roweis and Ghahramani have written a review [50] that nicely shows the common properties of the models. This thesis will nevertheless follow the more standard formulations, which tend to hide these connections. 4.1 Hidden Markov models Hidden Markov models (HMMs) are latent variable models based on the Markov chains. HMMs are widely used especially in speech recognition and there is an extensive literature on them. The tutorial by Rabiner [48] is often considered a definitive guide to HMMs in speech recognition. Ghahramani [17] gives another introduction with some more recent extensions to the basic model. Before going into details of HMMs, some of the basic properties of Markov chains are first reviewd briefly. 23 4.1. Hidden Markov models 4.1.1 24 Markov chains A Markov chain is a discrete stochastic process with discrete states and discrete transformations between them. At each time instant the system is in one of the N possible states, numbered from one to N . At regularly spaced discrete times, the system switches its state, possibly back to the same state. The initial state of the chain is denoted M1 and the states after each time of change are M2 , M3 , . . .. Standard first order Markov chain has the additional property that the probabilities of the future states depend only on the current state and not the ones before it [48]. Formally this means that p(Mt+1 = j|Mt = i, Mt−1 = k, . . .) = p(Mt+1 = j|Mt = i) ∀t, i, j, k. (4.1) This is called the Markov property of the chain. Because of the Markov property, the complete probability distribution of the states of a Markov chain is defined by the initial distribution πi = p(M1 = i) and the state transition probability matrix aij = p(Mt+1 = j|Mt = i), 1 ≤ i, j ≤ N. (4.2) Let us denote π = (πi ) and A = (aij ). In the general case the transition probabilities could be time dependent, i.e. aij = aij (t), but in this thesis only the time independent case is considered. This allows the evaluation of the probability of a sequence of states M = (M1 , M2 , . . . , MT ), given the model parameters θ = (A, π), as "T −1 # Y p(M |θ) = πM1 aMt Mt+1 . (4.3) t=1 4.1.2 Hidden states Hidden Markov model is basically a Markov chain whose internal state cannot be observed directly but only through some probabilistic function. That is, the internal state of the model only determines the probability distribution of the observed variables. Let us denote the observations by X = (x(1), . . . , x(T )). For each state, the distribution p(x(t)|Mt ) is defined and independent of the time index t. The exact form of this conditional distribution depends on the application. In the 4.1. Hidden Markov models 25 simplest case there is only a finite number of different observation symbols. In this case the distribution can be characterised by the point probabilities bi (m) = p(x(t) = m|Mt = i), ∀i, m. (4.4) Letting B = (bi (m)) and the parameters θ = (A, B, π), the joint probability of an observation sequence and a state sequence can be evaluated by simple extension to Equation (4.3) # #" T "T −1 Y Y bMt (x(t)) . (4.5) aMt Mt+1 p(X, M |θ) = πM1 t=1 t=1 The posterior probability of a state sequence can be derived from this as #" T "T −1 # Y Y 1 bM (x(t)) (4.6) aM M πM p(M |X, θ) = P (X|θ) 1 t=1 t t+1 t=1 t P where P (X|θ) = M P (X, M |θ). 4.1.3 Continuous observations An obvious extension to the basic HMM model is to allow continuous observation space instead of a finite number of discrete symbols. In this model the parameters B cannot be described as a simple matrix of point probabilities but rather as a complete pdf over the continuous observation space for each state. Therefore the values of bi (m) in Equation (4.4) must be replaced with a continuous probability distribution bi (x(t)) = p(x(t)|Mt = i), ∀x(t), i. (4.7) This model is called continuous density hidden Markov model (CDHMM). The probability of an observation sequence evaluated in Equation (4.5) stays the same. The conditional distributions bi (x(t)) can in principle be arbitrary but usually they are restricted to be finite mixtures of simple parametric distributions, like Gaussians [48]. 4.1.4 Learning algorithms Rabiner [48] lists three basic problems for HMMs: 4.1. Hidden Markov models 26 1. Given the observation sequence X and the model θ, how can the probability p(X|θ) be computed efficiently? 2. Given the observation sequence X and the model θ, how can the corresponding optimal state sequence M be estimated? 3. Given the observation sequence X, how can the model parameters θ be optimised to better describe the observations? Problem 1 is not a learning problem but rather one needed in evaluating the model. Its difficulty comes from the fact that one needs to calculate the sum over all the possible state sequences. This can be diverted by calculating the probabilities recursively with respect to the length of the observation sequence. This operation forms one half of the forward–backward algorithm and it is very typical for HMM calculations. The algorithm uses an auxiliary variable α i (t) which is defined as αi (t) = p(x1:t , Mt = i|θ). (4.8) Here we have used the notation x1:t = (x(1), . . . , x(t)). The algorithm to evaluate p(X|θ) proceeds as follows: 1. Initialisation: αj (1) = bj (x(1))πj 2. Iteration: αj (t + 1) = " N X # αi (t)aij bj (x(t + 1)) i=1 3. Termination: p(X|θ) = N X αi (T ). i=1 The solution of Problem 2 is much more difficult. Rabiner uses classical statistics and interprets it as a maximum likelihood estimation problem for the state sequence, the solution of which is given by the Viterbi algorithm. A Bayesian solution to the problem can be found with a simple modification of the solution of the next problem which essentially uses the posterior of the hidden states. Rabiner’s statement of Problem 3 is more precise than ours and asks for the maximum likelihood solution optimising p(X|θ) with respect to θ. This problem is solved by the Baum–Welch algorithm which is an application of the EM algorithm to the problem. 4.1. Hidden Markov models 27 The Baum–Welch algorithm uses the complete forward–backward procedure with a backward pass to compute βi (t) = p(xt:T |Mt = i, θ). This can be done very much like the forward pass: 1. Initialisation: βi (T ) = bi (x(T )) 2. Iteration: βi (t) = bi (x(t)) " N X # aij βj (t + 1) . j=1 The M-step of Baum–Welch algorithm can be expressed in terms of ηij (t), the posterior probability that there was a transition between state i and state j at time step t given X and θ. This probability can be easily calculated with forward and backward variables α and β: 1 αi (t)aij βj (t + 1) (4.9) Zn P where Zn is a normalising constant such that N i,j=1 ηij (t) = 1. Then the Mstep for aij is T −1 1 X aij = 0 ηij (t) (4.10) Zn t=1 ηij (t) = p(Mt = i, Mt+1 = j|X, θ) = where Zn0 is another constant needed to make the transition probabilities normalised. There are similar update formulas for other parameters. The algorithm can also easily be extended to take into account the possible priors of different variables [39]. Ensemble learning for hidden Markov models MacKay showed in [39] how ensemble learning can be applied to learning of HMMs with discrete observations. With suitable priors for all the variables, the problem can be solved analytically and the resulting algorithm turns out to be a rather simple modification of the Baum–Welch algorithm. MacKay uses Dirichlet distributions as priors for the model parameters θ. (See Appendix A for the definition of the Dirichlet distribution and some of its properties.) The likelihood of the model is discrete with respect to all the parameters and the Dirichlet distribution is the conjugate prior of such a distribution. Because of this, the posterior distributions of all the parameters 4.2. Nonlinear state-space models 28 are also Dirichlet. The update rule for the parameters Wij of the Dirichlet distribution of the transition probabilities is Wij = T −1 X X q(M )δ(Mt = i, Mt+1 = j) + uaij (4.11) t=1 M where uaij are the parameters of the prior and q(M ) is the approximate posterior used in ensemble learning. The update rules for other parameters are similar. The posterior distribution of the hidden state probabilities turns out to be of exactly the same form in Equation (4.6) but with aij ¡ as the likelihood ¢ replaced with a∗ij = exp Eq(A) {ln aij } and similarly for bi (x(t)) and πi . The required expectations over the Dirichlet distribution can be evaluated as in Equation (A.13) in Appendix A. 4.2 Nonlinear state-space models The fundamental goal of this work has been to extend the nonlinear statespace model (NSSM) and learning algorithm for it developed by Dr. Harri Valpola [58, 60], by adding the possibility of several distinct states for the dynamical system, as governed by the HMM. The NSSM algorithm is an extension to earlier nonlinear factor analysis (NFA) algorithm [34]. In this section, some of the previous work on these models is reviewed briefly. This complements the discussion in Section 2.2.2 as the problem to be solved is essentially the same. But first let us begin with an introduction to linear state-space models (SSMs), a generalisation of which the nonlinear version is. 4.2.1 Linear models The linear state-space model (SSM) is built from a standard linear dynamical system in Equation (2.4). However, the state is not observed directly but only through another linear mapping. The basic model can be described by a pair of equations: s(t + 1) = Bs(t) + m(t) x(t) = As(t) + n(t) (4.12) where x(t) ∈ Rn , t = 1, . . . , T are the observations and s(t) ∈ Rm , t = 1, . . . , T are the hidden internal states of the dynamical system. These internal states 4.2. Nonlinear state-space models 29 stay in the so called state-space — hence the name state-space model. Vectors m(t) and n(t) are the process and the observation noise, respectively. The mappings A and B are the linear observation and prediction mappings. As we saw in Section 2.1.2, the possible dynamics of such a linear model are very restricted. The process noise may help in keeping the whole system from converging to a single point, but still the system cannot explain any interesting complex phenomena. Linear SSMs are nevertheless used widely because they are very efficient to use and provide a good enough short-term approximation for many cases even if the true dynamics are nonlinear. The most famous variant of linear SSMs is the Kalman filter [31, 41, 32]. It gives a learning algorithm for the model defined by Equation (4.12) assuming all the parameters are Gaussian and certain independence assumptions are met. Kalman filtering is used in several application because it is easy to implement and efficient. 4.2.2 Extension from linear to nonlinear The nonlinear state-space model is in principle a very simple extension of the linear model. All that needs to be done is to replace the linear mappings A and B of Equation (4.12) with general nonlinear mappings to get s(t + 1) = g(s(t)) + m(t) x(t) = f (s(t)) + n(t), (4.13) where f and g are sufficiently smooth nonlinear mappings and all the other variables are as in Equation (4.12). One of the greatest difficulties with the nonlinear model is that while a linear mapping A : Rm → Rn can be uniquely determined by m · n real numbers (the elements of the corresponding matrix), there is no way to do the same for nonlinear mappings. Representing an arbitrary nonlinear mapping with even moderate accuracy requires much more parameters. While the matrix form provides a natural parameterisation for linear mappings, no such evident approach exists for nonlinear mappings. There are several ways to approximate a given nonlinear mapping through different kinds of series decompositions. Unfortunately such parameterisations are often best suited for some special classes of functions or are very sensitive to the data 4.2. Nonlinear state-space models 30 and thus useless in the presence of noise. For example in polynomial approximations, the coefficients become very sensitive to the data when the degree of the approximating polynomial is raised. Trigonometric expansions (Fourier series) are well suited only for modelling periodic functions. In neural network literature there are two competing nonlinear function approximation methods that are both widely used: radial–basis function (RBF) and multilayer perceptron (MLP) networks. They are both universal function approximators, meaning that given enough “neurons” they can model any function to the desired degree of accuracy [21]. In this work we use only MLP networks, so they are now discussed in greater detail. 4.2.3 Multilayer perceptrons An MLP is a network of simple neurons called perceptrons. The basic concept of a single perceptron was introduced by Rosenblatt in 1958. The perceptron computes a single output from multiple real-valued inputs by forming a linear combination according to its input weights and then possibly putting the output through some nonlinear activation function. Mathematically this can be written as n X wi xi + b) = ϕ(wT x + b) (4.14) y = ϕ( i=1 where w denotes the vector of weights, x is the vector of inputs, b is the bias and ϕ is the activation function. A signal-flow graph of this operation is shown in Figure 4.1 [21, 5]. The original Rosenblatt’s perceptron used a Heaviside step function as the activation function ϕ. Nowadays, and especially in multilayer networks, the activation function is often chosen to be the logistic sigmoid 1/(1 + e−x ) or the hyperbolic tangent tanh(x). They are related by (tanh(x)+1)/2 = 1/(1+e −2x ). These functions are used because they are mathematically convenient and are close to linear near origin while saturating rather quickly when getting away from the origin. This allows MLP networks to model well both strongly and mildly nonlinear mappings. A single perceptron is not very useful because of its limited mapping ability. No matter what activation function is used, the perceptron is only able to represent an oriented ridge-like function. The perceptrons can, however, be used as building blocks of a larger, much more practical structure. A typical 4.2. Nonlinear state-space models 31 PSfrag replacements x1 w2 x2 w1 ϕ(·) y x3 .. . w3 wn b xn +1 Figure 4.1: Signal-flow graph of the perceptron multilayer perceptron (MLP) network consists of a set of source nodes forming the input layer, one or more hidden layers of computation nodes, and an output layer of nodes. The input signal propagates through the network layer-bylayer. The signal-flow of such a network with one hidden layer is shown in Figure 4.2 [21]. The computations performed by such a feedforward network with a single hidden layer with nonlinear activation functions and a linear output layer can be written mathematically as x = f (s) = Bϕ(As + a) + b (4.15) where s is a vector of inputs and x a vector of outputs. A is the matrix of weights of the first layer, a is the bias vector of the first layer. B and b are, respectively, the weight matrix and the bias vector of the second layer. The function ϕ denotes an elementwise nonlinearity. The generalisation of the model to more hidden layers is obvious. While single-layer networks composed of parallel perceptrons are rather limited in what kind of mappings they can represent, the power of an MLP network with only one hidden layer is surprisingly large. As Hornik et al. and Funahashi showed in 1989 [26, 15], such networks, like the one in Equation (4.15), are capable of approximating any continuous function f : Rn → Rm to any given accuracy, provided that sufficiently many hidden units are available. 4.2. Nonlinear state-space models 32 input layer PSfrag replacements hidden layer output layer ϕ ϕ P ϕ P ϕ nonlinear neuron P linear neuron ϕ P Figure 4.2: Signal-flow graph of an MLP MLP networks are typically used in supervised learning problems. This means that there is a training set of input–output pairs and the network must learn to model the dependency between them. The training here means adapting all the weights and biases (A, B, a and b in Equation (4.15)) to their optimal values for the given pairs (s(t), x(t)). P The criterion to2 be optimised is typically the squared reconstruction error t ||f (s(t)) − x(t)|| . The supervised learning problem of the MLP can be solved with the backpropagation algorithm. The algorithm consists of two steps. In the forward pass, the predicted outputs corresponding to the given inputs are evaluated as in Equation (4.15). In the backward pass, partial derivatives of the cost function with respect to the different parameters are propagated back through the network. The chain rule of differentiation gives very similar computational rules for the backward pass as the ones in the forward pass. The network weights can then be adapted using any gradient-based optimisation algorithm. The whole process is iterated until the weights have converged [21]. The MLP network can also be used for unsupervised learning by using the so called auto-associative structure. This is done by setting the same values for both the inputs and the outputs of the network. The extracted sources emerge from the values of the hidden neurons [24]. This approach is computationally rather intensive. The MLP network has to have at least three hidden layers for 4.2. Nonlinear state-space models 33 any reasonable representation and training such a network is a time consuming process. 4.2.4 Nonlinear factor analysis The NSSM implementation in [58] uses MLP networks to model the two nonlinear mappings in Equation (4.13). The learning procedure for the mappings is essentially the same as in the simpler NFA model [34], so it is presented first. The NFA model is also used in some of the experiments to explore the properties of the data set used. Like the NSSM is a generalisation of the linear SSM, the NFA is a generalisation of the well-known linear factor analysis. The NFA can be defined with a generative data model given by x(t) = f (s(t)) + n(t), s(t) = m(t) (4.16) where the noise terms n(t) and m(t) are assumed to be Gaussian. The explicit mention of the factors being modeled as white noise is included to emphasise the difference between the NFA and the NSSM. The Gaussianity assumptions are made because of mathematical convenience, even though the Gaussianity of the factors is a serious limitation. In the corresponding linear model the posterior will be an uncorrelated Gaussian, and in the nonlinear model it is approximated with a similar Gaussian. However, an uncorrelated Gaussian with equal variances in each direction is spherically symmetric and thus invariant with respect to all possible rotations of the factors. Even if the variances are not equal, there is a similar rotation invariance. In traditional approaches to linear factor analysis this invariance is resolved by fixing the rotation using different heuristic criteria in choosing the optimal one. In the neural computation field the research has lead to algorithms like independent component analysis (ICA) [27] which uses basically the same linear generative model but with non-Gaussian factors. Similar techniques can also be applied to nonlinear model as discussed in [34]. The NFA algorithm uses an MLP network as the model of the nonlinearity f . The model is learnt using ensemble learning. Most of the expectations needed for ensemble learning for such a model can be evaluated analytically. Only the terms involving the nonlinearity f must be approximated by using Taylor series approximation for the function about the posterior mean of the input. 4.2. Nonlinear state-space models 34 The weights of the MLP network are updated with a back-propagation-like algorithm using the ensemble learning cost function. The unknown inputs of the network are also updated similarly, contrary to the standard supervised back-propagation. All the parameters of the model are assumed to be Gaussian with hierarchical priors for most of them. The technical details of the model and the learning algorithm are covered in Chapters 5 and 6. Those chapters do, however, deal with the more general NSSM but the NFA emerges as a special case when all the temporal structure is ignored. 4.2.5 Learning algorithms The most important NSSM learning algorithm for this work is the one by Dr. Harri Valpola [58, 60]. It uses MLP networks to model the nonlinearities and ensemble learning to optimise the model. It is discussed in detail in Sections 5.2 and 6.2 and it will therefore be skipped for now. Even though it is not exactly a learning algorithm for complete NSSMs, the extended Kalman filter (EKF) is an important building block for many such algorithms. The EKF extends standard Kalman filtering for nonlinear models. The nonlinear functions f and g of Equation (4.13) must be known in advance. The algorithm works by linearising the functions about the estimated posterior mean. The posterior probability of the state variables is evaluated with a forward-backward type iteration by assuming the posterior of each sample to be Gaussian [42, 28, 32]. Briegel and Tresp [7] present an NSSM that uses MLP networks as the model of the nonlinearities. The learning algorithm is based on Monte-Carlo generalised EM algorithm, i.e. an EM algorithm with stochastic estimates for the conditional posteriors at different steps. The Monte-Carlo E-step is further optimised by generating the samples of the hidden states from an approximate Gaussian distribution instead of the true posterior. The approximate posterior is found using either the EKF or an alternative similar method. Roweis and Ghahramani [51, 19] use RBF networks to model the nonlinearities. They use standard EM algorithm with EKF for the approximate E-step. The parameterisation of the RBF network allows an exact M-step for some of the parameters. All the parameters of the network can not, however, be adapted this way. 4.3. Previous hybrid models 4.3 35 Previous hybrid models There are numerous different hybrid models combining the HMM and a continuous model. As HMMs have been used most extensively in speech recognition, most of the hybrids have also been created for that purpose. Trentin and Gori [56] present a survey of speech recognition systems combining the HMM with some neural network model. These models are motivated by the end result of recognising speech by finding the correct HMM state sequence for the sequence of uttered phonemes. This is certainly one worthy goal but a good switching SSM can also be very useful in other time series modelling tasks. The basic idea of all the hybrids is to use the SSM or another continuous dynamical model to describe the short-term dynamics of the data while the HMM describes longer-term changes. In speech modelling, for instance, the HMM would govern the desired sequence of phonemes while the SSM models the actual production of the sound, the dynamics of the mouth, the vocal cord etc. The models presented here can be divided into two classes. First there are the true switching SSMs, i.e. combinations of the HMM and a linear SSM. The second class consists of models that combine the HMM with another dynamical model which is a continuous one but not a true SSM. 4.3.1 Switching state-space models There are several possible architectures for switching SSMs. Figure 4.3 shows some of the most basic ones [43]. The first subfigure corresponds to the case where the function g and possibly the model for noise m(t) in Equation (4.13) are different for different states. In the second subfigure, the function f and the noise n(t) depend on the switching variable. Some combination of these two approaches is of course also possible. The third subfigure shows an interesting architecture proposed by Ghahramani and Hinton [18] in which there are several completely separate SSMs and the switching variable chooses between them. Their model is especially interesting as it uses ensemble learning to infer the model parameters. One of the problems with switching SSMs is that the exact E-step of the EM algorithm is intractable, even if the individual continuous hidden states are Gaussian. Assuming the HMM has N states, the posterior of a single state variable s(1) will be a mixture of N Gaussians, one for each HMM state M1 . When this is propagated forward according to the dynamical model, the 4.3. Previous hybrid models 36 PSfrag replacements M1 M2 sn (1) sn (2) .. . .. . s(1) s(2) s1 (1) s1 (2) PSfrag replacements PSfrag replacements s(1) s(2) x(1) x(2) x(1) x(2) x(1) x(2) M1 M2 M1 M2 (a) Switching dynamics (b) Switching observations (c) Independent dynamical models [18] Figure 4.3: Bayesian network representations of some switching state-space model architectures. The round nodes represent Gaussian variables and the square nodes are discrete. The shaded nodes are observed while the white ones are hidden. mixture grows exponentially as the number of possible HMM state sequences increases. Finally, when the full observation sequence of length T is taken into account, the posterior of each s(t) will be a mixture of N T Gaussians. Ensemble learning is a very useful method in developing a tractable algorithm for the problem, although there are other heuristic methods for the same purpose. The other methods typically use some greedy procedure in collapsing the distribution and this may cause inaccuracies. This is not a problem with ensemble learning — it considers the whole sequence and minimises the Kullback–Leibler divergence, which in this case has no local minima. 4.3.2 Other hidden Markov model hybrids To complete the review, two models that combine the HMM with another dynamical model that is not an SSM are now presented. The hidden Markov ICA algorithm by Penny et al. [47] is an interesting vari- 4.3. Previous hybrid models 37 ant of the switching SSMs. It has both switching dynamics and observation mapping, but the dynamics are not modelled with a linear dynamical system. Instead, the observations are transformed to independent components which are then each predicted separately with a generalised autoregressive (GAR) model. This allows using samples from further back for the prediction but everything is done strictly separately for all the components. The MLP/HMM hybrid by Chung and Un [11] uses an MLP network for nonlinear prediction and an HMM to model its prediction errors. The predictor operates in the observation space so the model is not an NSSM. The prediction errors are modeled with a mixture-of-Gaussians for each HMM state. The model is trained using maximum likelihood and a discriminative criterion for minimal classification error in a speech recognition application. Chapter 5 The model In this chapter, it is shown how the two separate models of the previous chapter, the hidden Markov model (HMM) and the nonlinear state-space model (NSSM), can be combined into one switching NSSM. The chapter begins with Bayesian versions of the two individual models, the HMM in Section 5.1 and the NSSM in Section 5.2. The switching model is introduced in Section 5.3. 5.1 Bayesian continuous density hidden Markov model In Section 4.1.4, a Bayesian formulation of the HMM based on the work of MacKay [39] is presented. This model is now extended for continuous observations instead of discrete ones. Normally continuous density hidden Markov models (CDHMMs) use a mixture model like mixture-of-Gaussians as the distribution of the observations for a given state. This is, however, unnecessarily complicated for the HMM/NSSM hybrid so single Gaussians will be used as the observation model. 5.1.1 The model The basic model is the same as the one presented in Section 4.1. The hidden state sequence is denoted by M = (M1 , . . . , MT ) and other parameters by θ. The exact form of θ will be specified later. The observations 38 5.1. Bayesian continuous density hidden Markov model 39 X = (x(1), . . . , x(T )), given the corresponding hidden state, are assumed to be Gaussian with diagonal covariance matrix. Given the HMM state sequence M , the individual observations are assumed to be independent. Therefore the likelihood of the data can be written as P (X|M , θ) = T Y N Y p(xk (t)|Mt ). (5.1) t=1 k=1 Because of the Markov property, the prior distribution of the probabilities of the hidden states can also be written in factorial form: P (M |θ) = p(M1 |θ) T −1 Y t=1 p(Mt+1 |Mt , θ). (5.2) The factors of Equations (5.1) and (5.2) are defined to be p(xk (t)|Mt = i) = N (xk (t); mk (i), exp(2vk (i))) p(Mt+1 = j|Mt = i, θ) = aij p(M1 = i|θ) = πi . (5.3) (5.4) (5.5) The priors of all the parameters defined above are (A) p(ai,· ) = Dirichlet(ai,· ; ui ) p(π) = Dirichlet(π; u(π) ) p(mk (i)) = N (mk (i); mmk , exp(2vmk )) p(vk (i)) = N (vk (i); mvk , exp(2vvk )). (5.6) (5.7) (5.8) (5.9) These should be written as conditional distributions conditional to the parameters of the hyperprior but the conditioning variables have been dropped out to simplify the notation. (A) The parameters u(π) and ui of the Dirichlet priors are fixed. Their values should be chosen to reflect true prior knowledge on the possible initial states and transition probabilities of the chain. In our example of speech recognition where the states of the HMM represent different phonemes, these values could, for instance, be estimated from textual data. All the other parameters mmk , vmk , mvk and vvk have higher hierarchical priors. As the number of parameters in such priors grows, only the full structure of 5.1. Bayesian continuous density hidden Markov model 40 the hierarchical prior of mmk is given. It is: p(mmk ) = N (mmk ; mmm , exp(2vmm )) p(mmm ) = N (mmm ; 0, 1002 ) p(vmm ) = N (vmm ; 0, 1002 ). (5.10) (5.11) (5.12) The hierarchical prior of for example m(i) can be summarised as follows: • The different components of m(i) have different priors whereas the vectors corresponding to different states of the HMM share a common prior, which is parameterised with mmn . • The parameters mmn corresponding to different components of the original vector m(i) share a common prior parameterised with mmm . • The parameter mmm has a fixed noninformative prior. The set of model parameters θ consists of all these parameters and all the parameters of the hierarchical priors. In the hierarchical structure formulated above, the Gaussian prior for the mean m of a Gaussian is a conjugate prior. Thus the posterior will also be Gaussian. The parameterisation of the variance with σ 2 = exp(2v), v ∼ N (α, β) is somewhat less conventional. The conjugate prior for variance of a Gaussian is the inverse gamma distribution. Adding a new level of hierarchy for the parameters of such a distribution would, however, be significantly more difficult. The present parameterisation allows adding similar layers of hierarchy for the parameters of the priors of m and v. In this parameterisation the posterior of v is not exactly Gaussian but it may be approximated with one. The exponential function will ensure that the variance will always be positive and the posterior will thus be closer to a Gaussian. 5.1.2 The approximating posterior distribution The approximating posterior distribution needed in ensemble learning is over all the possible hidden state sequences M and the parameter values θ. The approximation is chosen to be of a factorial form q(M , θ) = q(M )q(θ). (5.13) 5.2. Bayesian nonlinear state-space model 41 The approximation q(M ) is a discrete distribution and it factorises as q(M ) = q(M1 ) T −1 Y t=1 q(Mt+1 |Mt ). (5.14) The parameters of this distribution are the discrete probabilities q(M1 = i) and q(Mt+1 = i|Mt = j). The distribution q(θ) is also formed as a product of independent distribution for different parameters. The parameters with Dirichlet priors have posterior approximations of a single Dirichlet distribution like for π q(π) = Dirichlet(π; π̂), (5.15) or a product of Dirichlet distributions as for A q(A) = M Y Dirichlet(ai ; âi ). (5.16) i=1 These will actually be the optimal choices among all possible distributions, assuming the factorisation q(M , π, A) = q(M )q(π)q(A). The parameters with Gaussian priors have Gaussian posterior approximations of the form (5.17) q(θi ) = N (θi ; θi , θei ). All these parameters are assumed to be independent. 5.2 Bayesian nonlinear state-space model In this section, the NSSM model by Dr. Harri Valpola [58, 60] is presented. The corresponding learning algorithm for the model will be discussed in Section 6.2. 5.2.1 The generative model The Bayesian model follows the standard NSSM defined in Equation (4.13): s(t + 1) = g(s(t)) + m(t) x(t) = f (s(t)) + n(t). (5.18) 5.2. Bayesian nonlinear state-space model PSfrag replacements g(·) s(t−1) f (·) x(t−1) 42 g(·) s(t) f (·) x(t) g(·) s(t+1) ··· f (·) x(t+1) Figure 5.1: The nonlinear state-space model. The basic structure of the model can be seen in Figure 5.1. The nonlinear functions f and g of the NSSM are modelled by MLP networks. The networks have one hidden layer and can thus be written as in Equation (4.15): f (s(t)) = B tanh[As(t) + a] + b g(s(t)) = s(t) + D tanh[Cs(t) + c] + d (5.19) (5.20) where A, B, C and D are the weight matrices of the networks and a, b, c and d are the bias vectors. Because the data are assumed to be generated by a continuous dynamical system, it is reasonable to assume that the values of the hidden states s(t) do not change very much from one time index to the next one. Therefore the MLP network representing the function g is used to model only the change. 5.2.2 The probabilistic model Let us denote the observed data with X = (x(1), . . . , x(T )), the hidden state values with S = (s(1), . . . , s(T )) and all the other model parameters with θ. These other parameters consist of the weights and biases of the MLP networks and hyperparameters defining the prior distributions of other parameters. The likelihood Let us assume that the noise terms m(t) and n(t) in Equation (5.18) are all independent at different time instants and Gaussian with zero mean. The 5.2. Bayesian nonlinear state-space model 43 different components of the vector are, however, allowed to have different variances as governed by their hyperparameters. Following the developments of Section 3.1.1, the model implies a likelihood for the data p(x(t)|s(t), θ) = N (x(t); f (s(t)), diag[exp(2vn )]) (5.21) where diag[exp(2vn )] denotes a diagonal matrix with the elements of the vector exp(2vn ) on the diagonal. The vector vn is a hyperparameter that defines the variances of different components of the noise. As the values of the noise at different time instants are independent, the full data likelihood can be written as p(X|S, θ) = T Y p(x(t)|s(t), θ) = t=1 T Y n Y p(xk (t)|s(t), θ) (5.22) t=1 k=1 where the individual factors are of the form p(xk (t)|s(t), θ) = N (xk (t); fk (s(t)), exp(2vnk )). (5.23) The prior of the states S Similarly, the prior of the states can be written as p(S|θ) = p(s(1)|θ) T −1 Y p(s(t + 1)|s(t), θ) t=1 = m Y p(sk (1)|θ) k=1 T −1 Y (5.24) p(sk (t + 1)|s(t), θ) t=1 where p(sk (t + 1)|s(t), θ) = N (sk (t + 1); gk (s(t)), exp(2vmk )) (5.25) p(sk (1)|θ) = N (sk (1); ms0k , exp(2vs0k )). (5.26) and The prior of the parameters θ Let us denote the elements of the weight matrices of the MLP networks by A = (Aij ), B = (Bij ), C = (Cij ) and D = (Dij ). The bias vectors consist similarly of elements a = (ai ), b = (bi ), c = (ci ) and d = (di ). 5.2. Bayesian nonlinear state-space model 44 All the elements of the weight matrices and the bias vectors are assumed to be independent and Gaussian. Their priors are as follows: p(Aij ) = N (Aij ; 0, 1) p(Bij ) = N (Bij ; 0, exp(2vBj )) p(ai ) = N (ai ; ma , exp(2va )) p(bi ) = N (bi ; mb , exp(2vb )) p(Cij ) = N (Cij ; 0, exp(2vCi )) p(Dij ) = N (Dij ; 0, exp(2vDj )) p(ci ) = N (ci ; mc , exp(2vc )) p(di ) = N (di ; md , exp(2vd )). (5.27) (5.28) (5.29) (5.30) (5.31) (5.32) (5.33) (5.34) These distributions should again be written conditional to the corresponding hyperparameters, but the conditioning variables have been here omitted to keep the notation simpler. Each of the bias vectors has a hierarchical prior that is shared among the different elements of that particular vector. The hyperparameters ma , mb , mc , md , va , vb , vc and vd all have zero mean Gaussian priors with standard deviation 100, which is a flat, essentially noninformative prior. The structure of the priors of the weight matrices is much more interesting. The prior of A is chosen to be fixed to resolve a scaling indeterminacy between the hidden states s(t) and the weights of the MLP networks. This is evident from Equation (5.19) where any scaling in one of these parameters could be compensated by the other without affecting the results in any way. The other weight matrices B, C and D have zero mean priors with common variance for all the weights related to a single hidden neuron. The remaining variance parameters from the priors of the weight matrices and from Equations (5.23), (5.25) and (5.26) again have hierarchical priors defined as p(vBj ) = N (vBj ; mvB , exp(2vvB )) p(vCi ) = N (vCi ; mvC , exp(2vvC )) p(vDj ) = N (vDj ; mvD , exp(2vvD )) p(vnk ) = N (vnk ; mvn , exp(2vvn )) p(vmk ) = N (vmk ; mvm , exp(2vvm )) p(vs0k ) = N (vs0k ; mvs0 , exp(2vvs0 )). (5.35) (5.36) (5.37) (5.38) (5.39) (5.40) The prior distributions of the parameters of these distributions are again zero mean Gaussians with standard deviation 100. 5.2. Bayesian nonlinear state-space model 5.2.3 45 The approximating posterior distribution As with the HMM, the approximating posterior distribution is chosen to have a factorial form Y q(S, θ) = q(S)q(θ) = q(S) q(θi ). (5.41) i The independent distributions for the parameters θi are all Gaussian with q(θi ) = N (θi ; θi , θei ) (5.42) where θi and θei are the variational parameters whose values must be optimised to minimise the cost function. Because of the strong temporal correlations between the source values at consecutive time instants, the same approach cannot be applied to form q(S). Therefore the approximation is chosen to be of the form " # n T −1 Y Y q(S) = q(sk (1)) q(sk (t + 1)|sk (t)) (5.43) k=1 t=1 where the factors are again Gaussian. The distributions for sk (1) can be handled as before with q(sk (1)) = N (sk (1); sk (1), sek (1)). (5.44) µsi (t + 1) = sk (t + 1) + s̆k (t, t + 1)[sk (t) − sk (t)], (5.45) The conditional distribution q(sk (t + 1)|sk (t)) must be modified slightly to include the contribution of the previous state value. Saving the notation sek (t) for the marginal variance of q(sk (t)), the variance of the conditional distribution ◦ is denoted with sk (t + 1). The mean of the distribution, depends linearly on the previous state value sk (t). This yields ◦ q(sk (t + 1)|sk (t)) = N (sk (t + 1); µsi (t + 1), sk (t + 1)). (5.46) The variational parameters of the distribution are thus the mean sk (t + 1), the ◦ linear dependence s̆k (t, t + 1) and the variance sk (t + 1). It should be noted that this dependence is only to the same component of the previous state value. The posterior dependence between the different components is neglected. 5.3. Combining the two models 46 The marginal distribution of the states at time instant t may now be evaluated inductively starting from the beginning. Assuming q(sk (t − 1)) = N (sk (t − 1); sk (t − 1), sek (t − 1)), this yields Z q(sk (t)) = q(sk (t)|sk (t − 1))q(sk (t − 1))dsk (t − 1) (5.47) ◦ 2 = N (sk (t); sk (t), sk (t) + s̆k (t − 1, t)e sk (t − 1)). Thus the marginal mean is the same as the conditional mean and the marginal variances can be computed using the recursion sek (t) = sk (t) + s̆2k (t − 1, t)e sk (t − 1). ◦ 5.3 (5.48) Combining the two models As we saw in Section 4.3, there are several possible ways to combine the HMM and the NSSM. The approach with switching dynamics seems more reasonable, so replacements we shall concentrate on it. Hence the illustration of the NSSM in Figure 5.1 PSfrag is augmented with the discrete HMM states to get Figure 5.2. Mt Mt−1 p(s(t−1)|Mt−1 ) p(s(t)|Mt ) g(·) s(t−1) f (·) x(t−1) ··· Mt+1 p(s(t+1)|Mt+1 ) g(·) s(t) f (·) x(t) g(·) s(t+1) ··· f (·) x(t+1) Figure 5.2: The switching nonlinear state-space model. In the linear models, switching dynamics are often implemented by having a completely separate dynamical model for every HMM state. This is of course the most general approach. In the linear case such an approach is possible 5.3. Combining the two models 47 because of the simplicity of the individual linear models needed for each state. Unfortunately this is not so in the nonlinear case. Using more than just a few nonlinear models makes the system computationally far too heavy for any practical use. To make things at least a little more computationally tractable, we use only one nonlinear mapping to describe the dynamics of all the states. Instead of own dynamics, every HMM state has its own characteristic model for the innovation process i.e. the description error of the dynamical mapping. 5.3.1 The structure of the model The probabilistic interpretation of the dynamics of the sources in the standard NSSM is p(s(t)|s(t − 1), θ) = N (s(t); g(s(t − 1)), diag[exp(v m )]) (5.49) where g is the nonlinear dynamical mapping and diag[exp(vm )] is the covariance matrix of the zero mean innovation process i(t) = s(t) − g(s(t − 1)). The typical linear approach, applied to the nonlinear case, would use a different g and diag[exp(vm )] for all the different states of the HMM. The simplified model uses only one dynamical mapping g but has an own covariance matrix diag[exp(vMj )] for each HMM state j. In addition to this, the innovation process is not assumed to be zero-mean but it has a mean depending on the HMM state. Mathematically this means that p(s(t)|s(t−1), Mt = j, θ) = N (s(t); g(s(t−1))+mMj , diag[exp(vMj )]) (5.50) where Mt is the HMM state, mMj and diag[exp(vMj )] are, respectively, the mean and the covariance matrix of the innovation process for that state. The prior model of s(1) remains unchanged. Equation (5.50) summarises the differences between the switching NSSM and its components, as they were presented in Sections 5.1 and 5.2. The HMM “output” distribution is the one defined in Equation (5.50), not the data likelihood as in the “stand-alone” model. Similarly the model of the continuous hidden states in NSSM is slightly different from the one specified in Equation (5.25). 5.3. Combining the two models 5.3.2 48 The approximating posterior distribution The approximating posterior distribution is again chosen to be of a factorial form q(M , S, θ) = q(M )q(S)q(θ). (5.51) The distributions of the state variables are as they were in the individual models, q(M ) as in Equation (5.14) and q(S) as in Equations (5.43)–(5.46). The distribution of the parameters θ is the product of corresponding distributions of the individual models, i.e. q(θ) = q(θ HMM )q(θ NSSM ). There is one additional approximation in the choice of the form of q. This can be clearly seen from Equation (5.50). After marginalising over Mt , the conditional probability p(s(t)|s(t−1), θ) will be a mixture of as many Gaussians as there are states in the HMM. Marginalising out the past will result in an exponentially growing mixture thus making the problem intractable. Our ensemble learning approach solves this problem by using only a single Gaussian as the posterior approximation q(s(t)|s(t − 1)). This is of course just an approximation but it allows a tractable way to solve the problem. Chapter 6 The algorithm In this chapter, the learning algorithm used for optimising the parameters of the models defined in the previous chapter is presented. The structure of this chapter is essentially the same as in the previous chapter, i.e. the HMM is discussed in Section 6.1, the NSSM in Section 6.2 and finally the switching model in Section 6.3. All the learning algorithms are based on ensemble learning which was introduced in Section 3.3. The sections of this chapter are mostly divided into two parts, one for evaluating the ensemble learning cost function and the other for optimising it. 6.1 Learning algorithm for the continuous density hidden Markov model In this section, the ensemble learning cost function for the CDHMM model defined in Section 5.1 is derived and it is shown how its value can be optimised. 6.1.1 Evaluating the cost function The general cost function of ensemble learning, as given in Equation (3.11), is C(M , θ) = Cq + Cp = E [log q(M , θ)] + E [− log p(M , θ, X)] = E [log q(M ) + log q(θ)] + E [− log p(X|M , θ)] + E [− log p(M |θ) − log p(θ)] 49 (6.1) 6.1. Learning algorithm for the continuous density hidden Markov model 50 where all the expectations are taken over q(M , θ). This will be true for all the similar formulas in this section unless explicitly stated otherwise. The terms originating from the parameters θ Assuming the parameters are θ = {θ1 , . . . , θN } and the approximation is of the form N Y q(θi ), (6.2) q(θ) = i=1 the terms of Equation (6.1) originating from the parameters θ can be written as N X (E [log q(θi )] − E [log p(θi )]) . (6.3) E [log q(θ) − log p(θ)] = i=1 In the case of Dirichlet distributions one θi in the previous equation must of course consist of a vector of parameters for the single distribution. There are two different kinds of parameters in θ, those with Gaussian distribution and those with a Dirichlet distribution. In the Gaussian case the expectation E[log q(θi )] over q(θi ) = N (θi ; θi , θei ) gives the negative entropy of a Gaussian, −1/2(1 + log(2π θei )), as derived in Equation (A.5) of Appendix A. The expectation of − log p(θi ) can also be evaluated using the formulas of Appendix A. Assuming p(θi ) = N (θi ; m, exp(2v)) (6.4) where q(θi ) = N (θi ; θi , θei ), q(m) = N (m; m, m) e and q(v) = N (v; v, ve), the expectation becomes · ¸ 1 1 2 Cp (θi ) = E [− log p(θi )] = E log(2π exp(2v)) + (θi − m) exp(−2v) 2 2 £ ¤ 1 1 = log(2π) + E[v] + E (θi − m)2 E [exp(−2v)] 2 2 i 1 1h = log(2π) + v + (θi − m)2 + θei + m e exp(2e v − 2v) 2 2 (6.5) where we have used the results of Equations (A.4) and (A.6). For Dirichlet distributed parameters, the procedure is similar. Let us assume that the parameter c ∈ θ, p(c) = Dirichlet(c; u(c )) and q(c) = Dirichlet(c; ĉ). 6.1. Learning algorithm for the continuous density hidden Markov model 51 Using the notation of Appendix A, the negative entropy of the Dirichlet distribution q(c), E [log q(c)], can be evaluated as in Equation (A.14) to yield Cq (c) = E [log q(c)] = log Z(ĉ) − n X i=1 (ĉi − 1)[Ψ(ĉi ) − Ψ(ĉ0 )]. (6.6) d The special function required in these terms is Ψ(x) = dx ln(Γ(x)), where Γ(x) is the gamma function. The psi function Ψ(x) is also known as the digamma function and it can be efficiently evaluated numerically for example using techniques described in [4]. The term Z(ĉ) is a normalising constant of the Dirichlet distribution as defined in Appendix A. The expectation of − log p(c) can be evaluated similarly !# " à n Y (c) 1 u −1 c i Cp (c) = − E [log p(c)] = − E log Z(u(c) ) i=1 i (c) = log Z(u ) − = log Z(u(c) ) − n X i=1 n X i=1 (c) (ui − 1) E [log ci ] (6.7) (c) (ui − 1)[Ψ(ci ) − Ψ(c0 )]. The likelihood term The likelihood term in Equation (6.1) is rather easy to evaluate Cp (X) = − E [log p(X|M , θ)] = = N M X T X X t=1 i=1 k=1 N M X T X X t=1 i=1 q(Mt = i) E [− log p(xk (t)|Mt = i)] µ 1 q(Mt = i) log(2π) + v k (i) 2 k=1 ¤ 1£ + (xk (t) − mk (i))2 + m e k (i) exp(2e vk (i) − 2v k (i)) 2 (6.8) ¶ where we have used the result of Equation (6.5) for the expectation. 6.1. Learning algorithm for the continuous density hidden Markov model 52 The terms originating from the hidden state sequence M The term Cq (M ) is just a sum over the discrete distribution. It can be further simplified into ! à T −1 Y X X q(Mt+1 |Mt ) Cq (M ) = q(M ) log q(M ) = q(M ) log q(M1 ) = M N X q(M1 = i) log q(M1 = i) i=1 T −1 X + t=1 M N X q(Mt+1 = j, Mt = i) log q(Mt+1 = j|Mt = i). t=1 i,j=1 (6.9) The other term, Cp (M ) can be split down to Cp (M ) = E [− log p(M |θ)] = E [− log p(M1 |θ)] + =− N X i=1 q(M1 = i)E [log πi ] − T −1 X N X T −1 X t=1 £ ¤ E − log aMt Mt+1 q(Mt+1 = j, Mt = i)E [log aij ] t=1 i,j=1 where according to Equation (A.13), E [log πi ] = Ψ(π̂i ) − Ψ( P similarly E [log aij ] = Ψ(âij ) − Ψ( N k=1 âik ). (6.10) PN j=1 π̂j ) and The above equations give the value of the cost function for given approximating distribution q(θ, M ). This value is important because it can be used to compare different models as shown in Section 3.3. Additionally it can be used to monitor whether the iterative optimisation procedure has converged. 6.1.2 Optimising the cost function After defining the model and finding a way to evaluate the cost function, the next problem is to optimise the cost function. This will be done in a way that is very similar to the EM algorithm, i.e. by updating one part of the model at a time while keeping all the other parameters fixed. All the optimisation steps aim at finding a minimum of the cost function with the current values for fixed parameters. Since all the steps decrease the value of the cost function, the learning algorithm is guaranteed to converge. 6.1. Learning algorithm for the continuous density hidden Markov model 53 Finding optimal q(M ) Assuming q(θ) is fixed, the cost function can be written, up to an additive constant, in the form · Z X C(M ) = q(M ) log q(M ) − q(θ) log p(M |θ)dθ M − = X M · Z q(θ) log p(X|M , θ)dθ q(M ) log q(M ) − − Z X T −1 Z ¸ (6.11) q(π) log πM1 dπ q(A) log aMt Mt+1 dA + t=1 T X C(x(t)|Mt ) t=1 ¸ where C(x(t)|Mt ) is the value of E[p(x(t)|Mt , θ)], i.e. the “cost” of current data sample given the HMM state. By defining πi∗ = exp µZ q(π) log πi dπ ¶ and a∗ij = exp µZ ¶ q(A) log aij dA , (6.12) Equation (6.11) can be written in the form Cq(M ) = X M q(M ) log ∗ πM 1 hQ T −1 t=1 a∗Mt Mt+1 q(M ) i hQ T t=1 exp(−C(x(t)|Mt )) i. (6.13) R The expression q(x) log pq(x) ∗ (x) is minimised with respect to q(x) by setting 1 ∗ q(x) = Z p (x) where Z is the appropriate normalising constant [39]. This can be proved with similar reasoning as in Equation (3.13). The cost in Equation (6.13) can thus be minimised by setting # #" T "T −1 Y 1 ∗ Y ∗ exp(−C(x(t)|Mt )) a q(M ) = π ZM M1 t=1 Mt Mt+1 t=1 (6.14) where ZM is the appropriate normalising constant. The derived optimal approximation is very similar in form to the exact posterior in Equation (4.6). Therefore the point probabilities of q(M 1 = i) and 6.1. Learning algorithm for the continuous density hidden Markov model 54 q(Mt = j|Mt−1 = i) can be evaluated with a modified forward–backward iteration. The result is the same as in Equation (4.9) except that in the iteration, πi is replaced with πi∗ , aij is replaced with a∗ij and bi (x) is replaced with exp(−C(x(t)|Mt = i)). Finding optimal q(θ) for Dirichlet parameters Let us now assume that q(M ) is fixed and optimise q(A) and q(π). Assuming that everything else is fixed, the cost function can be written as a functional of q(A), up to an additive constant C(A) = = (A) where Wij = uij + Z Z · q(A) log q(A) − − i,j=1 q(M ) (A) (uij − 1) log aij T −1 X ¸ log aMt Mt+1 dA t=1 M q(A) log QN PT −1 t=1 X N X q(A) (Wij −1) i,j=1 aij (6.15) dA q(Mt = i, Mt+1 = j). As before, the optimal q(A) is of the form q(A) = rule for the parameters âij of q(A) is thus âij ← Wij = (A) uij + T −1 X 1 (Wij −1) a . ZA ij q(Mt+1 = j|Mt = i). The update (6.16) t=1 Similar reasoning for π gives the update rule (π) π̂i ← ui + q(M1 = i). (6.17) Finding optimal q(θ) for Gaussian parameters As an example of Gaussian parameters we shall consider mk (i) and vk (i). All the others are handled in essentially the same way except that there are no weights needed for different states. To simplify the notation, all the indices from mk (i) and vk (i) are dropped out for the remainder of this section. The relevant terms of the cost function are 6.1. Learning algorithm for the continuous density hidden Markov model 55 now, up to an additive constant µ ¶ ¤ 1£ 2 q(Mt = i) v + (x(t) − m) + m C(m, v) = e exp(2e v − 2v) 2 t=1 ¤ 1£ 1 + (m − mm )2 + m e exp(2e vm − 2v m ) − log m e 2 2 ¤ 1 1£ vv − 2v v ) − log ve. + (v − mv )2 + ve exp(2e 2 2 T X (6.18) 2 2 2 Let us denote σeff = exp(2v−2e v ), σm,eff = exp(2v m −2e vm ) and σv,eff = exp(2v v − 2e vv ). The derivative of this expression with respect to m e is easy to evaluate T X ∂C 1 1 1 = − . q(Mt = i) 2 + 2 ∂m e 2σeff 2σm,eff 2m e t=1 (6.19) Setting this to zero gives m e = à T X t=1 1 1 q(Mt = i) 2 + 2 σeff σm,eff !−1 . (6.20) The derivative with respect to m is T X 1 ∂C 1 = q(Mt = i) 2 [m − x(t)] + 2 [m − mm ] ∂m 2σeff 2σm,eff t=1 (6.21) which has a zero at m= " T X t=1 # 1 1 q(Mt = i) 2 x(t) + 2 mm m e 2σeff 2σm,eff where m e is given by Equation (6.20). (6.22) The solutions for parameters of q(m) are exact. The true posterior for these parameters is also Gaussian so the approximation is equal to it. This is not the case for the parameters of q(v). The true posterior for v is not Gaussian. The best Gaussian approximation with respect to the chosen criterion can still be found by solving the zero of the derivative of the cost function with respect to the parameters of q(v). This is done using Newton’s iteration. 6.2. Learning algorithm for the nonlinear state-space model 56 The derivatives with respect to v and ve are T ¤ ¢ v − mv ¡ £ ∂C X e exp(2e v − 2v) + 2 q(Mt = i) 1 − (x(t) − m)2 + m = (6.23) ∂v σv,eff t=1 T ¤ £ 1 ∂C X 1 e exp(2e v − 2v) + 2 + . = q(Mt = i) (x(t) − m)2 + m ∂e v 2σv,eff 2e v t=1 (6.24) These are set to zero and solved with Newton’s iteration. 6.2 Learning algorithm for the nonlinear statespace model In this section, the learning procedure for the NSSM defined in Section 5.2 is presented. The structure of the section is essentially the same as in the previous section on the HMM, i.e. it is first shown how to evaluate the cost function and then how to optimise it. 6.2.1 Evaluating the cost function As before, the general cost function of ensemble learning, as given in Equation (3.11), is C(S, θ) = Cq + Cp = E [log q(S, θ)] + E [− log p(S, θ, X)] = E [log q(S) + log q(θ)] + E [− log p(X|S, θ)] + E [− log p(S|θ) − log p(θ)] (6.25) where the expectations are taken over q(S, θ). This will be the case for the rest of the section unless stated otherwise. In the NSSM, all the probability distributions involved are Gaussian so most of the terms will resemble the corresponding ones of the CDHMM. For the parameters θ, 1 Cq (θi ) = E [log q(θi )] = − (1 + log(2π θei )). 2 (6.26) 6.2. Learning algorithm for the nonlinear state-space model 57 The term Cq (S) is a little more complicated: Cq (S) = E [log q(S)] = n ³ X E [log q(sk (1))] + T −1 X t=1 k=1 ´ E [log q(sk (t + 1)|sk (t))] . (6.27) The first term reduces to Equation (6.26) but the second term is a little different: Eq(sk (t),sk (t+1)) [log q(sk (t + 1)|sk (t))] © ª = Eq(sk (t)) Eq(sk (t+1)|sk (t)) [log q(sk (t + 1)|sk (t))] ½ ¾ 1 1 ◦ ◦ = Eq(sk (t)) − (1 + log(2π sk (t + 1))) = − (1 + log(2π sk (t + 1))). (6.28) 2 2 The expectation of − log p(θi |m, v) has been evaluated in Equation (6.5), so the only remaining terms are E [− log p(X|S, θ)] and E [− log p(S|θ)]. They both involve the nonlinear mappings f and g, so they cannot be evaluated exactly. The formulas allowing to approximate the distribution of the outputs of an MLP network f are presented in Appendix B. As a result we get the posterior mean of the outputs f k (s) and the posterior variance, decomposed as fek (s) ≈ fek∗ (s) + X j sej · ∂fk (s) ∂sj ¸2 . (6.29) With these results the remaining terms of the cost function are relatively easy to evaluate. The likelihood term is a standard Gaussian and yields Cp (xk (t)) = E [− log p(xk (t)|s(t), θ)] 1 = log(2π) + v nk 2 i 1h + (xk (t) − f k (s(t)))2 + fek (s(t)) exp(2e vnk − 2v nk ). 2 (6.30) 6.2. Learning algorithm for the nonlinear state-space model 58 The source term is more difficult. The problematic expectation is ¤ £ αk (t) = E (sk (t) − gk (s(t − 1)))2 £ ¤ ◦ = E (sk (t) + s̆k (t − 1, t)(sk (t − 1) − sk (t − 1)) − gk (s(t − 1)))2 + sk (t) £ ◦ = s2k (t) + sk (t) + E s̆2k (t − 1, t)(sk (t − 1) − sk (t − 1))2 + gk2 (s(t − 1)) + 2sk (t)s̆k (t − 1, t)(sk (t − 1) − sk (t − 1)) ¤ − 2sk (t)gk (s(t − 1)) − 2s̆k (t − 1, t)(sk (t − 1) − sk (t − 1))gk (s(t − 1)) sk (t − 1) + g 2k (s(t − 1)) + gek (s(t − 1)) = s2k (t) + sk (t) + s̆2k (t − 1, t)e ∂gk (s(t − 1)) − 2sk (t)g k (s(t − 1)) − 2s̆k (t − 1, t) sek (t − 1) ∂sk (t − 1) = (sk (t) − g(s(t − 1)))2 + sek (t) + gek (s(t − 1)) ∂gk (s(t − 1)) sek (t − 1) − 2s̆k (t − 1, t) ∂sk (t − 1) ◦ (6.31) where we have used the additional approximation E [s̆k (t − 1, t)(sk (t − 1) − sk (t − 1))gk (s(t − 1))] = ∂gk (s(t − 1)) s̆k (t − 1, t) sek (t − 1). (6.32) ∂sk (t − 1) Using Equation (6.31), the remaining term of the cost function can be written as Cp (sk (t)) = E [− log p(sk (t)|s(t − 1), θ)] 1 1 = log(2π) + v mk + αk (t) exp(2e vmk − 2v mk ). 2 2 6.2.2 (6.33) Optimising the cost function The most difficult part in optimising the cost function for the NSSM is updating the hidden states and the weights of the MLP networks. All the hyperparameters can be handled in exactly the same way as in the CDHMM case presented in Section 6.1.2 but ignoring the additional weights caused by the HMM state probabilities. Updating the states and the weights is carried out in two steps. First the value of the cost function is evaluated using the current estimates for all the variables. This is called forward computation because it consists of a forward pass through the MLP networks. 6.2. Learning algorithm for the nonlinear state-space model 59 The second, backward computation step consists of evaluating the partial derivatives of the Cp part of the cost function with respect to the different parameters. This can be done by moving backward in the network, starting from the outputs and proceeding toward the inputs. The standard backpropagation calculations are done in the same way. In our case, however, all the parameters are described by their own posterior distributions, which are characterised by their means and variances. The cost function is very different from the standard back-propagation and the learning is unsupervised. This means that all the calculation formulas are different. Updating the network weights Assume θi is a weight in one of the MLP networks and we have evaluated the partial derivatives ∂Cp /∂θi and ∂Cp /∂ θei . The variance θei is easy to update with a fixed point update rule derived by setting the derivative to zero µ ¶−1 ∂C ∂Cp ∂Cq ∂Cp 1 ∂Cp e 0= . (6.34) = + = − ⇒ θi = 2 ∂ θei ∂ θei ∂ θei ∂ θei 2θei ∂ θei By looking at the form of the cost function for Gaussian terms, we can find an approximation for the second derivatives with respect to the means as [34] ∂2C 2 ∂θi ≈2 ∂Cp 1 = . ∂ θei θei (6.35) This allows using an approximate Newton’s iteration to update the mean à !−1 ∂C ∂ 2 C ∂C e θi . (6.36) θi ← θi − ≈ θi − 2 ∂θi ∂θi ∂θi There are some minor corrections to these update rules as explained in [34]. Updating the hidden states The basic setting for updating the hidden states is the same as above for the network weights. The correlations between consecutive states cause some changes to the formulas and require new ones for adaptation of the correlation coefficients. All the feedforward computations use the marginal variances sek (t) which are not actual variational parameters. This affects the derivatives 6.2. Learning algorithm for the nonlinear state-space model 60 with respect to the other parameters of the state distribution. Let us use the notation Cp (e sk (t)) to mean that the Cp part of the cost function is considered to be a function of the intermediate variables sek (1), . . . , sek (t) in addition to the variational parameters. This and Equation (5.48) yield following rules for evaluating the derivatives of the true cost function: 1 ∂Cp (e sk (t)) ∂e ∂Cq ∂Cp (e sk (t)) ∂C sk (t) − ◦ = + ◦ = ◦ ◦ ∂e sk (t) ∂ sk (t) ∂ sk (t) ∂e sk (t) ∂ sk (t) 2sk (t) (6.37) ∂Cp (e sk (t)) ∂Cp (e sk (t)) ∂e sk (t) ∂C = + ∂s̆k (t − 1, t) ∂s̆k (t − 1, t) ∂e sk (t) ∂s̆k (t − 1, t) ∂Cp (e sk (t)) ∂Cp (e sk (t)) +2 s̆k (t − 1, t)e sk (t − 1). = ∂s̆k (t − 1, t) ∂e sk (t) (6.38) The term ∂Cp (e sk (t))/∂e sk (t) in the above equations cannot be evaluated directly, but requires again the use of new intermediate variables. This leads to the recursive formula ∂Cp (e sk (t)) sk (t + 1) ∂Cp (e sk (t + 1)) ∂Cp (e sk (t + 1)) ∂e = + ∂e sk (t) ∂e sk (t) ∂e sk (t + 1) ∂e sk (t) sk (t + 1)) 2 ∂Cp (e sk (t + 1)) ∂Cp (e + s̆ (t, t + 1). = ∂e sk (t) ∂e sk (t + 1) k (6.39) The terms ∂Cp (e sk (t + 1))/∂e sk (t) are now the ones that can be evaluated with the backward computations through the MLPs as usual. The term ∂Cp (e sk (t))/∂s̆k (t−1, t) is easy to evaluate from Equation (6.33), and it gives ∂Cp (e sk (t)) ∂gk (s(t − 1)) =− sek (t − 1) exp(2e vmk − 2v mk ). ∂s̆k (t − 1, t) ∂sk (t − 1) (6.40) Equations (6.38) and (6.40) yield a fixed point update rule for s̆k (t − 1, t): µ ¶−1 ∂gk (s(t − 1)) ∂Cp (e sk (t)) s̆k (t − 1, t) = exp(2e vmk − 2v mk ) 2 . (6.41) ∂sk (t − 1) ∂e sk (t) The result depends, for instance, on s̆k (t, t + 1) through Equation (6.39), so the updates must be done in the order starting from the last and proceeding backward in time. ◦ The fixed point update rule of the variances sk (t) can be solved from Equation (6.37): ¶−1 µ ∂Cp (e sk (t)) ◦ . (6.42) sk (t) = 2 ∂e sk (t) 6.2. Learning algorithm for the nonlinear state-space model 61 The update rule for the means is similar to that of the weights in Equation (6.36) but it includes a correction which tries to compensate the simultaneous updates of the sources. The correction is explained in detail in [60]. 6.2.3 Learning procedure At the beginning of the learning for a new data set, the posterior means of the network weights are initialised to random values and the variances to small constant values. The original data is augmented with delay coordinate embedding, which was presented in Section 2.1.4, so that it consists of multiple time-shifted copies. The hidden states are initialised with a principal component (PCA) [27] projection of the augmented data. It is also used in training at the beginning of the learning. The learning procedure of the NSSM consists of sweeps. During one sweep, all the parameters of the model are updated as outlined above. There are, however, different phases in learning so that not all the parameters are updated at the very beginning. These phases are summarised in Table 6.1. Table 6.1: The different phases of NSSM learning. Sweeps 0–50 50–100 100–1000 1000 1000–10000 6.2.4 Updates Only the weights of the MLPs are updated, hidden states and hyperparameters are kept fixed at their initial values. Only the weights and the hidden states are updated, hyperparameters are still kept fixed. Everything is updated using the augmented data. The original data is restored and the parts of the observation network f corresponding to the extra data are pruned away. Everything is updated using the original data. Continuing learning with new data Continuing the learning process with the old model but new data requires initial estimates for the new hidden states. If the new data is a direct continuation of the old, the predictions of the old states provide a reasonable initial estimate for the new ones and the algorithm can continue the adaptation from there. 6.3. Learning algorithm for the switching model 62 If the new data forms an entirely separate sequence, the problem is more difficult. Knowing the model, we can still do much better than starting at random or using the same initialisation as in the very beginning. One way to find the estimates is to use an auxiliary MLP network to model the inverse of the observation mapping f [59]. This MLP can be trained using standard supervised back-propagation with the estimated means of s(t) and x(t) as training set. Their roles are of course inverted so that x(t) are the inputs and s(t) the outputs. The auxiliary MLP cannot give perfect estimates for the states s(t), but they can usually be adapted very quickly by using the standard learning algorithm to update only the hidden states. 6.3 Learning algorithm for the switching model In this section, the learning procedure for the switching model is presented. 6.3.1 Evaluating and optimising the cost function The switching model is a combination of the two models as shown in Section 5.3. Therefore most of the terms of the cost function are exactly as they were in the individual models. The only difference is in the term Cp (sk (t)) = E[− log p(sk (t)| · · · )]. The term αk (t) in Equation (6.31) changes to αk,i (t) for the case Mt = i and has the value £ ¤ αk,i (t) = E (sk (t) − gk (s(t − 1)) − mMi ,k )2 = (sk (t) − g(s(t − 1)) − mMi ,k )2 + sek (t) + m e Mi ,k (6.43) ∂gk (s(t − 1)) + gek (s(t − 1)) − 2s̆k (t − 1, t) sek (t − 1). ∂sk (t − 1) With this result the expectation of Equation (6.33) involving the source prior becomes Cp (S) = E [− log p(sk (t)|s(t − 1), Mt , θ)] µ ¶ N X 1 1 q(Mt = i) v Mi ,k + αk,i (t) exp(2e = log(2π) + vMi ,k − 2v Mi ,k ) . 2 2 i=1 (6.44) The update rules will mostly stay the same except that the values of the parameters of vmk must be replaced with properly weighted averages of the 6.3. Learning algorithm for the switching model 63 corresponding parameters of vMi ,k . The HMM prototype vectors will be taught using s(t) − g(s(t − 1)) as the data. 6.3.2 Learning procedure The progress of learning in the switching NSSM is almost the same as in the plain NSSM. The parameters are updated in similar sweeps and the data are used in exactly the same way. The HMM prototype means are initialised to have relatively small random means and small constant variances. The prototype variances are initialised to suitable constant values. The phases in learning the switching model are presented in Table 6.2. Table 6.2: The different phases of switching NSSM learning. Sweeps 0–50 50–100 100–500 500–1000 1000 1000–10000 Updates Only the weights of the MLPs and the HMM states are updated, continuous hidden states, HMM prototypes and hyperparameters are kept fixed at their initial values. Only the weights and both the hidden states are updated, HMM prototypes and hyperparameters are still kept fixed. Everything but the HMM prototypes is updated. Everything is updated using the augmented data. The original data is restored and the parts of the observation network f corresponding to the extra data are pruned away. Everything is updated using the original data. In each sweep of the learning algorithm, the following computations are performed: • The distributions of the outputs of the MLP networks f and g are evaluated as presented in Appendix B. • The HMM state probabilities are updated as in Equation (6.14). • The partial derivatives of the cost function with respect to the weights and inputs of the MLP networks are evaluated by inverting the computations of Appendix B and using Equations (6.37)–(6.39). 6.3. Learning algorithm for the switching model 64 • The parameters for the continuous hidden states s(t) are updated using Equations (6.36), (6.41) and (6.42). • The parameters of the MLP network weights are updated using Equations (6.34) and (6.36). • The HMM output parameters are updated using Equations (6.20) and (6.22), and the results from solving Equations (6.23)–(6.24). • The hyperparameters of the HMM are updated using Equations (6.16) and (6.17). • All the other hyperparameters are updated using similar procedure as with the HMM output parameters. 6.3.3 Learning with known state sequence Sometimes we want to use the switching NSSM to model data we already know something about. With speech data, for instance, we may know the “correct” sequence of phonemes in the utterance. This does not mean that learning the HMM part would be unnecessary. The correct segmentation requires determining the times of transitions between the states. Now only the states the model has to pass and their order are given. Such problems can be solved by estimating the HMM states for a modified model, namely the one that only allows transitions in the correct sequence. These probabilities can then be transformed back to the true state probabilities for the adaptation of the other model parameters. The forward–backward procedure must also be modified slightly as the first and the last state of a sequence are now known for sure. When the correct state sequences are known, the different orderings of HMM states are no longer equivalent. Therefore the HMM output distribution parameters can, and actually should, all be initialised to zeros. Random initialisation could make the output model for certain state very different from the true output thus making the learning much more difficult. Chapter 7 Experimental results A series of experiments was conducted to verify that the switching NSSM model and the learning algorithm derived in Chapters 5 and 6 really work. The results of these experiments are presented in this chapter. All the experiments were done with different parts of one large data set of speech. In Section 7.1, the data set used and some of its properties are introduced. In Section 7.2, the developed switching NSSM is compared with standard HMM, NSSM and NFA models. The results of a segmentation experiment are presented in Section 7.3. 7.1 Speech data The data collection that was used in the experiments consisted of individual Finnish words, spoken by 59 different speakers. The data has been collected at the Laboratory of Computer and Information Science. A detailed description of the data and its collection process can be found in Vesa Siivola’s Master’s thesis [54]. Due to the complexity of the algorithms used, only a very small fraction of the whole collection was ever used. The time needed for learning increases linearly as the amount of data increases and with more data even a single experiment would have taken months to complete. The data sets used in the experiments were selected by randomly taking individual words from the whole collection. In a typical experiment, where the used data set consisted of only a few dozen words, this meant that practically every word in the set was spoken by a different person. 65 7.1. Speech data 7.1.1 66 Preprocessing The preprocessing performed to turn the digitised speech samples to the observation vectors for the algorithms was as follows: 1. The signal was high-pass filtered to emphasise the important higher frequencies. This was done with a first order FIR filter having the transfer function H(z) = 1 − 0.95z −1 . 2. A 256-point Fourier transform with Hamming windowing was calculated for short overlapping segments. The overlapping part of two consecutive segments consisted of half of the segments. 3. The frequencies were transformed to Mel -scale to emphasise the important features for understanding the speech. This gave a 30 component vector for each segment. 4. The logarithm of the energies on the Mel -scale was used as observations. These steps form a rather standard preprocessing procedure for speech recognition [54, 33]. The Mel-scale of frequencies has been designed to model the frequency response of the human ear. The scale is constructed by asking a naı̈ve listener when she found the heard sound to have double or half of the frequency of a reference tone. The resulting scale is close to linear at frequencies below 1000 Hz and nearly logarithmic above that [54]. Figure 7.1 shows an example of what the preprocessed data looks like. 7.1.2 Properties of the data set One distinctive property of the data is that it is not very continuous. This is due to the bad frequency resolution of the relatively short time Fourier transform used in the preprocessing. The same data set was used with the static NFA model in [25]. The part used in these experiments consisted of spectrograms of 24 individual words, spoken by 20 different speakers. The preprocessed data consisted of 2547 spectrogram vectors with 30 components. For studying the dimensionality of the data, linear and nonlinear factor analysis were applied to the data. The results are shown in Figure 7.2. All the NFA 7.1. Speech data 67 Figure 7.1: An example of the preprocessed spectrogram of a speech segment. Time increases from left to right and frequency from down to up. White areas correspond to low energy of the signal and dark areas to high energy. The word in the segment is “JOHTOPÄÄTÖKSIÄ”, meaning “conclusions”. Every letter in the written word corresponds to one phoneme in speech. The silent areas in the middle correspond to the consonants t, p, t and k, thus revealing the segmentation of the utterance into phonemes. experiments used an MLP network with 30 hidden neurons. The data manifold is clearly nonlinear, because nonlinear factor analysis is able to explain it equally well with fewer components than linear factor analysis. The difference is especially clear when the number of components is relatively small. Even though the analysis only uses static models, it can be used to estimate a lower bound for the number of continuous hidden states used in the experiments with dynamical models. A small segment of the original data and its reconstructions with eight nonlinear and linear components are shown in Figure 7.3. The reconstructed spectrograms are somewhat smoother than the original ones. Still, all the discriminative features of the original spectrum are well preserved in the nonlinear reconstruction. This means that the dropped components mostly correspond to noise. The linear reconstruction is not as good, especially at the beginning. The extracted nonlinear factors, rotated with linear ICA, are shown in Figure 7.4. They seem rather smooth so it seems plausible that the dynamic models would be able to model the data better. The representation of the data given by the nonlinear factors seems, however, somewhat more difficult to interpret. It is rather difficult to see how the different factors affect the predicted outputs. 7.2. Comparison with other models 68 0 10 Remaining energy linear FA nonlinear FA −1 10 0 5 10 15 Number of sources 20 25 Figure 7.2: The remaining energy of the speech data as a function of the number of extracted components using linear and nonlinear factor analysis. 7.2 Comparison with other models In this section, a comparative test between the switching NSSM and some other standard models is presented. The goodness of the models is measured with the model evidence, i.e. the likelihood p(X|Hi ) for given data set X and model Hi . The theory behind this comparison is presented in Section 3.2.1. The values of the evidence are evaluated using the variational approximation given by ensemble learning. 7.2. Comparison with other models 69 Figure 7.3: A short fragment of the data used in the experiment. The first subfigure shows the original data, the second shows the reconstruction from 8 nonlinear components and the last shows the reconstruction from 8 linear components. Both the models used were static and did not use any temporal information on the signals. The results would have been exactly the same for any permutation of the data vectors. 7.2.1 The experimental setting All the models used the same preprocessed data set as in Section 7.1.2. The individual words were processed separately in the preprocessing and all the dynamical models were instructed to treat each word individually, i.e. not to make predictions across word boundaries. The models used in the comparison were: • A continuous density HMM with Gaussian mixture observation model. 7.2. Comparison with other models 70 Figure 7.4: Extracted NFA factors corresponding to the data fragment in Figure 7.3, rotated with linear ICA. This was a simple extension to the model presented in Section 5.1 that replaced the Gaussian observation model with mixtures-of-Gaussians. The number of Gaussians was the same for all the states but it was optimised by running the algorithm with several values and using the best result. The model was initialised with sufficiently many states and unused extra states were pruned. • The nonlinear factor analysis model, as presented in Section 4.2.4. The model had 15 dimensional factors and 30 hidden neurons in the observation MLP network. • The nonlinear SSM, as presented in Section 5.2. The model had 15 dimensional state-space and 30 hidden neurons in both MLP networks. • The switching NSSM, as presented in Section 5.3. The model was essen- 7.2. Comparison with other models 71 tially a combination of the HMM and NSSM models except that it used Gaussian observation model for the HMM. The parameters of the HMM priors for the initial distribution u(π) and the transition matrix u(A) were all set to ones. This corresponds to a flat, noninformative prior. The choice does not affect the performance of the switching NSSM very much. The HMM, on the other hand, is very sensitive to the prior. The data used with the plain HMM was additionally decorrelated with principal component analysis (PCA) [27]. This improved the performance a lot compared to the situation without the decorrelation, as the prototype Gaussians were restricted to be uncorrelated. The other algorithms can include the same transformation to the output mapping so it was not necessary to do it by hand. Using the nondecorrelated data has the advantage that it is “human readable” whereas the decorrelated data is much more difficult to interpret. 7.2.2 The results The results reached by different models are summarised in Table 7.1. The static NFA model gets the worst score in describing the data. It is a little faster than the NSSM and switching NSSM but significantly slower than the HMM. The HMM is a little better and it is clearly the fastest of the algorithms. The NSSM is significantly better than the HMM but takes quite a lot more time. The switching NSSM is a clear winner in describing the data but it is also the slowest of all the algorithms. The difference in speeds of NSSMs with and without switching is relatively small. Table 7.1: The results of the model comparison experiment. The second column contains the values of the ensemble learning cost function attained. Lower values are better. The values translate to probabilities as p(X|H) ≈ e −C . The third column contains a rough estimate on the time needed to run one simulation with Matlab on a single relatively fast RISC processor. Model NFA HMM NSSM Switching NSSM Cost function value Time needed 111 041 A few days 104 654 About an hour 90 955 About a week 82 410 More than a week 7.3. Segmentation of annotated data 72 The simulations with NFA, NSSM and switching NSSM were run using a 15 dimensional latent space. The results would probably have been slightly better with larger dimensionality, but unfortunately the current optimisation algorithm used for the models is somewhat unstable above that limit. Optimisation of the structures of the MLP networks would also have helped, but it would have taken too much time to be practical. The present results are thus the ones attained by taking the best of a few simulations with the same fixed structure but different random initialisations. 7.3 Segmentation of annotated data In this section, an experiment dealing with data segmentation is presented. The correct phoneme sequence for each segment of the speech data is known but the correct segmentation, i.e. the times of the transitions between the phonemes, is not. Therefore it is reasonable to see whether the model can learn to find the correct segmentation. The HMM used in this experiment had four states for each phoneme of the data. These states were linked together in a chain. Transitions from “outside” were allowed only to the first state and then forward in the chain until the last internal state of the phoneme was reached. This is a standard procedure in speech recognition to model the duration of the phonemes. 7.3.1 The training procedure In this experiment, the switching NSSM is used in a partially supervised mode, i.e. it is trained as explained in Section 6.3.3. Using only this modification does not, however, lead to good results. The computational complexity of the NSSM learning allows using only a couple of thousands of samples of data. With the preprocessing used in this work, this means that the NSSM must be trained with only a few dozen words. This training material is not at all enough to learn models for all the phonemes, as many of them appear only a few times in the training data set. To circumvent this problem, the training of the model was split into several phases. In the first phase, the complete model was trained with a data set of 23 words forming some 5069 sample vectors. This should be enough for the NSSM part to find a reasonable representation for the data. The number of sweeps for this training was 10000. After this step no NSSM parameters were 7.3. Segmentation of annotated data 73 adapted any more. The number of words in the training set of the first phase is small when compared to the data set of the previous experiment. This is due to inclusion of significant segments of silence to the samples. It is important for the model to also learn a good model for silence, since in real life speech is always embedded into silence or plain background noise. In the second phase, the training data set was changed to a new one for continuing training the HMM. The new data set consisted of 786 words forming 100064 sample vectors. The continuous hidden states for the new data were initialised with an auxiliary MLP as presented in Section 6.2.4. Then the standard learning algorithm was used for 10 sweeps to adapt only the continuous states. Updating just the HMM and leaving the NSSM out saved a significant amount of time. As Table 7.1 shows, HMM training is about two orders of magnitude faster than NSSM training. Using just the HMM part takes about 20 % of the time needed by the complete switching NSSM. The rest of the improvement is due to the fact that the HMM training converges with far fewer iterations. These savings allowed using a much larger data set to efficiently train the HMM part. 7.3.2 The results After the first phase of training with the small data set, the segmentations done by the algorithm seem pretty random. This can be seen from Figure 7.5. The model has not even learnt how to separate the speech signal from the silence at the beginning and end of the data segments. After the full training the segmentations seem rather good, as Figures 7.6 and 7.7 show. This is a very encouraging result, considering that the segmentations are performed using only the innovation process (the last subfigure) of the NSSM which consists mostly of the leftovers of the other parts of the model. The results should be significantly better with a model that gives the HMM a larger part in predicting the data. 7.3. Segmentation of annotated data 74 > V A S E N < 10 20 30 40 50 60 70 80 10 20 30 40 50 60 70 80 10 20 30 40 50 60 70 80 10 20 30 40 50 60 70 80 10 20 30 2 4 6 8 10 12 2 4 6 8 10 12 Figure 7.5: An example of the segmentation given by the algorithm after the first phase of learning. The states ‘>’ and ‘<’ correspond to silence at the beginning and the end of the utterance. The first subfigure shows the marginal probabilities of the HMM states for each sample. The second subfigure shows the data, the third shows the continuous hidden states s(t) and the last shows the innovation processes s(t) − g(s(t − 1)). The HMM does its segmentation solely based on the values of the innovation process, i.e. the last subfigure. The word in the figure is “VASEN” meaning “left”. 7.3. Segmentation of annotated data 75 > V A S E N < 10 20 30 40 50 60 70 80 10 20 30 40 50 60 70 80 10 20 30 40 50 60 70 80 10 20 30 40 50 60 70 80 10 20 30 2 4 6 8 10 12 2 4 6 8 10 12 Figure 7.6: An example of the segmentation given by the algorithm after complete learning. The data and the meanings of the different parts of the figure are the same as in Figure 7.5. The results are significantly better though not yet quite perfect. 7.3. Segmentation of annotated data 76 > P O H J A N M A L A < 20 40 60 80 100 120 140 20 40 60 80 100 120 140 20 40 60 80 100 120 140 20 40 60 80 100 120 140 10 20 30 2 4 6 8 10 12 2 4 6 8 10 12 Figure 7.7: Another example of segmentation given by the algorithm after complete learning. The meanings of the different parts of the figure are the same as in Figures 7.5 and 7.6. The figure illustrates the segmentation of a longer word. The result shows several relatively probable paths, not just one as in the previous figures. The word in the figure is “POHJANMAALLA”. The double phonemes are treated as one in the segmentation. Chapter 8 Discussion In this work, a Bayesian switching nonlinear state-space model (switching NSSM) and a learning algorithm for its parameters were developed. The switching model combines two dynamical models, a hidden Markov model (HMM) and a nonlinear state-space model (NSSM). The HMM models longterm behaviour of the data and controls the NSSM which describes the shortterm dynamics of the data. In order to be used in practice, the switching NSSM is, like any other model, needs an efficient method to learn its parameters. In this work, the Bayesian approximation method called ensemble learning was used to derive a learning algorithm for the model. The requirements of the learning algorithm set some limitations for the structure of the model. The learning algorithm for the NSSM which was used as the starting point for this work was computationally intensive. Therefore the additions needed for the switching model had to be designed very carefully to avoid making the model computationally intractable. Most existing switching state-space model structures use entirely different dynamical models for each state of the HMM. Such an approach would have resulted in a more powerful model, but the computational burden would have been too great to be practical. Therefore the developed switching model uses the same NSSM for all the HMM states. The HMM is only used to model the prediction errors of the NSSM. The ensemble learning based learning algorithm seems well suited for the switching NSSM. This kind of models usually suffer from the problem that the exact posterior of the hidden states is an exponentially growing mixture of Gaussian distributions. The ensemble learning approach solves this problem elegantly by finding the best approximate posterior consisting only of a 77 8. Discussion 78 single Gaussian distribution. It would also be possible to approximate the posterior with a mixture of a few Gaussian distributions instead of the single Gaussian. This would, however, increase the computational burden quite a lot while achieving little gain. Despite the suboptimal model structure, the switching model performs remarkably well in the experiments. It outperforms all the other tested models in the speech modelling problem by a large margin. Additionally, the switching model yields a segmentation of the data to different discrete dynamical states. This allows using the model for many different segmentation tasks. For a segmentation problem, the closest competitor of the switching NSSM is the plain HMM. The speech modelling experiment shows that the switching NSSM is significantly better in modelling the speech data than the HMM. This means that the switching model can get more out of the same data. The greatest problem of the switching model is the computational cost. It is approximately two orders of magnitude slower to train than the HMM. The difference in actually using the fully trained models should, however, be smaller. Much of the difference in training times is caused by the fact that the weights of the MLP networks take very many iterations of the training algorithm to converge. When the system is in operation and the MLPs are not trained, the other parts needed for recognition purposes converge much more quickly. The usage of the switching model in such a recognition system has not been thoroughly studied. Therefore it might be possible to find ways of optimising the usage of the model to make it comparable with the HMM. In the segmentation experiment, the switching model learnt the desired segmentation of an annotated data set of individual words of speech to different phonemes. However, the initial experiments on using the model for recognition, i.e. finding the segmentation without the annotation, are not as promising. The poor recognition performance is probably due to the fact that the HMM part of the model only uses the prediction error of the continuous NSSM. When the state of the dynamics changes, the prediction error grows. Thus it is easy for the HMM to see that something is changing but not how it is changing. The HMM would have to have more influence on the description of the dynamics of the data to know which state the system is going to. The design of the model was motivated by computational efficiency and the resulting algorithm seems successful in that sense. The learning algorithm for the switching model is only about 25 % slower than the corresponding algorithm for the NSSM. This could probably still be improved as Matlab is far from an optimal programming environment for HMM calculations. The forward and backward calculations of the MLP networks, which require evaluating the Ja- 8. Discussion 79 cobian matrix of the network at each input point, constitute the slowest part of the NSSM training. Those parts remain unchanged in the switching model and still take most of the time. There are at least two important lines of future work: giving the model more expressive power by making better use of the HMM part and optimising the speed of the NSSM part. Improving the model structure to give the HMM a larger role would be important to make the model work in recognition problems. The HMM should be given control over at least some of the parameters of the dynamical mapping. It is, however, difficult to do this without increasing the computational burden of using the model. Time usually helps in this respect as the computers become faster. Joining the two component models together more tightly would also make the learning algorithm even more complicated. Nevertheless, it might be possible to find a better compromise between the computational burden and the power of the model. Another way to help with the essentially same problem would be optimising the speed of the NSSM. The present algorithm is very good in modelling the data but it is also rather slow. Using a different kind of a structure for the model might help with the speed problem. This would, however, probably require a completely new architecture for the model. All in all, the nonlinear switching state-space models form an interesting field of study. At present, the algorithms seem computationally too intensive for most practical uses, but this is likely to change in the future. Appendix A Standard probability distributions A.1 Normal distribution The normal distribution, which is also known as the Gaussian distribution, is ubiquitous in statistics. The averages of identically distributed random variables are approximately normally distributed by the central limit theorem, regardless of their original distribution[16]. This section concentrates on the univariate normal distribution, as the general multivariate distribution is not needed in this thesis. The probability density of the normal distribution is given by ¶ µ 1 (x − µ)2 2 . p(x) = N (x; µ, σ ) = √ exp − 2σ 2 2πσ 2 (A.1) The parameters of the distribution directly yield the mean and the variance of the distribution: E[x] = µ, Var[x] = σ 2 . The multivariate case is very similar: µ ¶ 1 1 − 21 T −1 p(x) = N (x; µ, Σ) = √ |Σ| exp − (x − µ) Σ (x − µ) 2 2π (A.2) where µ is the mean vector and Σ the covariance matrix of the distribution. For our purposes it is sufficient to note that when the covariance matrix Σ is diagonal, the multivariate normal distribution reduces to a product of independent univariate normal distributions. 80 A.2. Dirichlet distribution 81 By the definition of the variance σ 2 = Var[x] = E[(x − µ)2 ] = E[x2 − 2xµ + µ2 ] = E[x2 ] − 2µ E[x] + µ2 = E[x2 ] − µ2 . (A.3) E[x2 ] = µ2 + σ 2 . (A.4) This gives The negative differential entropy of the normal distribution can be evaluated simply as ¸ · 1 1 1 (x − µ)2 2 E[log p(x)] = − log(2πσ ) − E = − (log(2πσ 2 ) + 1). (A.5) 2 2 2 σ 2 Another important expectation for our purposes is [35] ¶ µ Z 1 (x − µ)2 √ exp(−2x)dx exp − E[exp(−2x)] = 2σ 2 2πσ 2 ¶ µ Z 1 (x − µ)2 + 4xσ 2 dx =√ exp − 2σ 2 2πσ 2 ¶ µ 2 Z 1 x − 2µx + µ2 + 4xσ 2 =√ dx exp − 2σ 2 2πσ 2 µ ¶ Z 1 [x + (2σ 2 − µ)]2 + 4µσ 2 − 4(σ 2 )2 =√ dx exp − 2σ 2 2πσ 2 ¶ µ Z ¡ 2 ¢ 1 [x + (2σ 2 − µ)]2 exp 2σ − 2µ dx =√ exp − 2 2σ 2 2πσ ¡ ¢ = exp 2σ 2 − 2µ . (A.6) A plot of the probability density function of the normal distribution is shown in Figure A.1. A.2 Dirichlet distribution The multinomial distribution is a discrete distribution which gives the probability of choosing a given collection of m items from a set of n items with repetitions and the probabilities of each choice given by p1 , . . . , pn . These probabilities are the parameters of the multinomial distribution [16]. A.2. Dirichlet distribution 82 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 −4 −3 −2 −1 0 1 2 3 4 Figure A.1: Plot of the probability density function of the unit variance zero mean normal distribution N (0, 1). The Dirichlet distribution is the conjugate prior of the parameters of the multinomial distribution. The probability density of the Dirichlet distribution for variables p = (p1 , . . . , pn ) with parameters u = (u1 , . . . , un ) is defined by n 1 Y ui −1 p(p) = Dirichlet(p; u) = p Z(u) i=1 i (A.7) P when p1 , . . . , pn ≥ 0; ni=1 pi = 1 and u1 , . . . , un > 0. The parameters ui can be interpreted as “prior observation counts” for events governed by pi . The normalisation constant Z(u) becomes Qn i=1 Γ(ui ) . (A.8) Z(u) = P Γ( ni=1 ui ) Let u0 = Pn i=1 ui . The mean and variance of the distribution are [16] E[pi ] = and Var[pi ] = ui u0 ui (u0 − ui ) . u20 (u0 + 1) (A.9) (A.10) When ui → 0, the distribution becomes noninformative. The means of all the pi stay the same if all ui are scaled with the same multiplicative constant. The variances will, however, get smaller as the parameters ui grow. The pdfs of the Dirichlet distribution with certain parameter values are shown in Figure A.2. A.2. Dirichlet distribution 83 u = 0.5 u=1 3.5 2 3 1.5 2.5 2 1 1.5 0.5 1 0.5 0 0.2 0.4 0.6 0.8 1 0 0 0.2 0.4 u=2 0.6 0.8 1 0.8 1 u = 16 1.5 5 4 1 3 2 0.5 1 0 0 0.2 0.4 0.6 0.8 1 0 0 0.2 0.4 0.6 Figure A.2: Plots of one component of a two dimensional Dirichlet distribution. The parameters are chosen such that u1 = u2 = u with the values for u shown above each individual image. Because both the parameters of the distribution are equal, the distribution of the other component will be exactly the same. In addition to the standard statistics given above, using ensemble learning for parameters with Dirichlet distribution requires the evaluation of the expectation E[log pi ] and the negative differential entropy E[log p(p)]. The first expectation can be reduced to evaluating the expectation over a two dimensional Dirichlet distribution for (p, 1 − p) ∼ Dirichlet(ui , u0 − ui ) (A.11) which is given by the integral E[log pi ] = Z1 0 Γ(u0 ) pui −1 (1 − p)u0 −ui −1 log p dp. Γ(ui )Γ(u0 − ui ) (A.12) A.2. Dirichlet distribution 84 This can be evaluated analytically to yield E[log pi ] = Ψ(ui ) − Ψ(u0 ) where Ψ(x) = d dx (A.13) ln(Γ(x)) is also known as the digamma function. By using this result, the negative differential entropy can be evaluated " à !# n 1 Y ui −1 E[log p(p)] = E log p Z(u) i=1 i = − log Z(u) + = − log Z(u) + n X i=1 n X i=1 (ui − 1) E[log pi ] (ui − 1)[Ψ(ui ) − Ψ(u0 )] (A.14) Appendix B Probabilistic computations for MLP networks In this appendix, it is shown how to evaluate the distribution of the outputs of an MLP network f , assuming the inputs s and all the network weights are independent and Gaussian. These equations are needed in evaluating the ensemble learning cost function of the Bayesian NSSM in Section 6.2. The exact model for our single-hidden-layer MLP network f is f (s) = Bϕ(As + a) + b. (B.1) The structure is shown in Figure B.1. The assumed distributions for all the parameters, which are assumed to be independent, are si ∼ N (si , sei ) eij ) Aij ∼ N (Aij , A eij ) Bij ∼ N (B ij , B ai ∼ N (ai , e ai ) bi ∼ N (bi , ebi ). (B.2) (B.3) (B.4) (B.5) (B.6) The P computations in the first layer of the network can be written as yi = ai + j Aij sj . Since all the parameters involved are independent, the mean 85 B. Probabilistic computations for MLP networks 86 Inputs (sources) Outputs (data) Figure B.1: The structure of an MLP network with one hidden layer. and the variance of yi are y i = ai + X j yei = e ai + Aij sj Xh j ¡ ¢i 2 eij s2 + sej . Aij sej + A j (B.7) (B.8) Equation (B.8) follows from the identity Var[α] = E[α2 ] − E[α]2 . (B.9) The nonlinear activation function is handled with a truncated Taylor series approximation about the mean y i of the inputs. Using a second order approximation for the mean and first order for the variance yields 1 ϕ(yi ) ≈ ϕ(y i ) + ϕ00 (y i )e yi 2 2 ϕ(y e i ) ≈ [ϕ0 (y i )] yei . (B.10) (B.11) The reason why these approximations are used is that they are the best ones that can be expressed in terms of the input mean and variance. P The computations of the output layer are given by fi (s) = bi + j Bij ϕ(yj ). This may look the same as the one for the first layer, but there is the big B. Probabilistic computations for MLP networks 87 difference that the yj are no longer independent. Their dependence does not, however, affect the evaluation of the mean of the outputs, which is X f i (s) = bi + B ij ϕ(yj ). (B.12) j The dependence between yj arises from the fact that each si may potentially affect all of them. Hence, the variances of si would be taken into the account incorrectly if yj were assumed independent. Two possibly interfering paths can be seen in Figure B.1. Let us assume that the net weight of the left path is 1 and the weight of the right path −1, so that the two paths cancel each other out. If, however, the outputs of the hidden layer are incorrectly assumed to be independent, the estimated variance of the output will be greater than zero. The same effect can also happen the other way round when constructive inference of the two paths leads to underestimated output variance. The effects of different components of the inputs on the outputs can be measured using the Jacobian matrix ∂f (s)/∂s of the mapping f with elements (∂fi (s)/∂sj ). This leads to the approximation for the output variance fei (s) ≈ X µ ∂fi (s) ¶2 j Xh j ∂sj 2 ∗ e (yj ) B ij ϕ sej + ebi + ¡ ¢i eij ϕ2 (yj ) + ϕ(y +B e j) (B.13) where ϕ e∗ (yj ) denotes the posterior variance of ϕ(yj ) without the contribution of the input variance. It can be computed as X ¡ ¢ eij s2j + sej A yei∗ = e ai + (B.14) j 2 ϕ e (yi ) ≈ [ϕ (y i )] yei∗ . ∗ 0 (B.15) The needed partial derivatives can be evaluated efficiently at the mean of the inputs with the chain rule ∂fi (s) X ∂fi (s) ∂ϕk ∂yk X = = B ik ϕ0 (y k )Akj . ∂sj ∂ϕk ∂yk ∂sj k k (B.16) Bibliography [1] D. K. Arrowsmith and C. M. Place. An Introduction to Dynamical Systems. Cambridge University Press, Cambridge, 1990. [2] David Barber and Christopher M. Bishop. Ensemble learning for multilayer networks. In Michael I. Jordan, Michael J. Kearns, and Sara A. Solla, editors, Advances in Neural Information Processing Systems 10, NIPS*97, pages 395–401, Denver, Colorado, USA, Dec. 1–6, 1997, 1998. The MIT Press. [3] Leonard E. Baum, Ted Petrie, George Soules, and Norman Weiss. A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. The Annals of Mathematical Statistics, 41(1):164–171, 1970. [4] J. M. Bernardo. Psi (digamma) function. Applied Statistics, 25(3):315– 317, 1976. [5] Christopher Bishop. Neural Networks for Pattern Recognition. Oxford University Press, Oxford, 1995. [6] Christopher Bishop. Latent variable models. In Jordan [29], pages 371– 403. [7] Thomas Briegel and Volker Tresp. Fisher scoring and a mixture of modes approach for approximate inference and learning in nonlinear state space models. In Michael S. Kearns, Sara A. Solla, and David A. Cohn, editors, Advances in Neural Information Processing Systems 11, NIPS*98, pages 403–409, Denver, Colorado, USA, Nov. 30–Dec. 5, 1998, 1999. The MIT Press. [8] Martin Casdagli. Nonlinear prediction of chaotic time series. Physica D, 35(3):335–356, 1989. 88 BIBLIOGRAPHY 89 [9] Martin Casdagli, Stephen Eubank, J. Doyne Farmer, and John Gibson. State space reconstruction in the presence of noise. Physica D, 51(1– 3):52–98, 1991. [10] Vladimir Cherkassky and Filip Mulier. Learning from Data: Concepts, Theory, and Methods. John Wiley & Sons, New York, 1998. [11] Y. J. Chung and C. K. Un. An MLP/HMM hybrid model using nonlinear predictors. Speech Communication, 19(4):307–316, 1996. [12] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. Wiley series in telecommunications. John Wiley & Sons, New York, 1991. [13] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1):1–38, 1977. [14] J. Doyne Farmer and John J. Sidorowich. Predicting chaotic time series. Physical Review Letters, 59(8):845–848, 1987. [15] Ken-ichi Funahashi. On the approximate realization of continuous mappings by neural networks. Neural Networks, 2(3):183–192, 1989. [16] Andrew Gelman, John B. Carlin, Hal S. Stern, and Donald B. Rubin. Bayesian Data Analysis. Chapman & Hall/CRC, Boca Raton, 1995. [17] Zoubin Ghahramani. An introduction to hidden Markov models and Bayesian networks. International Journal of Pattern Recognition and Artificial Intelligence, 15(1):9–42, 2001. [18] Zoubin Ghahramani and Geoffrey E. Hinton. Variational learning for switching state–space models. Neural Computation, 12(4):831–864, 2000. [19] Zoubin Ghahramani and Sam T. Roweis. Learning nonlinear dynamical systems using an EM algorithm. In Michael S. Kearns, Sara A. Solla, and David A. Cohn, editors, Advances in Neural Information Processing Systems 11, NIPS*98, pages 599–605, Denver, Colorado, USA, Nov. 30– Dec. 5, 1998, 1999. The MIT Press. [20] Monson H. Hayes. Statistical Digital Signal Processing and Modeling. John Wiley & Sons, New York, 1996. [21] Simon Haykin. Neural Networks – A Comprehensive Foundation, 2nd ed. Prentice-Hall, Englewood Cliffs, 1998. BIBLIOGRAPHY 90 [22] Simon Haykin and Jose Principe. Making sense of a complex world. IEEE Signal Processing Magazine, 15(3):66–81, 1998. [23] Geoffrey E. Hinton and Drew van Camp. Keeping neural networks simple by minimizing the description length of the weights. In Proceedings of the COLT’93, pages 5–13, Santa Cruz, California, USA, July 26-28, 1993. [24] Sepp Hochreiter and Jürgen Schmidhuber. Feature extraction through LOCOCODE. Neural Computation, 11(3):679–714, 1999. [25] Antti Honkela and Juha Karhunen. An ensemble learning approach to nonlinear independent component analysis. In Proceedings of the 15th European Conference on Circuit Theory and Design (ECCTD’01), Espoo, Finland, August 28–31, 2001. To appear. [26] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural Networks, 2(5):359– 366, 1989. [27] Aapo Hyvärinen, Juha Karhunen, and Erkki Oja. Independent Component Analysis. John Wiley & Sons, 2001. In press. [28] O. L. R. Jacobs. Introduction to Control Theory. Oxford University Press, Oxford, second edition, 1993. [29] Michael I. Jordan, editor. Learning in Graphical Models. The MIT Press, Cambridge, Massachusetts, 1999. [30] Michael I. Jordan, Zoubin Ghahramani, Tommi S. Jaakkola, and Lawrence K. Saul. An introdution to variational methods for graphical models. In Jordan [29], pages 105–161. [31] Rudolf E. Kalman. A new approach to linear filtering and prediction problems. Transactions of the ASME, Journal of Basic Engineering, 82:35–45, 1960. [32] Edward W. Kamen and Jonathan K. Su. Introduction to Optimal Estimation. Springer, London, 1999. [33] Mikko Kurimo. Using Self-Organizing Maps and Learning Vector Quantization for Mixture Density Hidden Markov Models. PhD thesis, Helsinki University of Technology, Espoo, 1997. Published in Acta Polytechnica Scandinavica, Mathematics, Computing and Management in Engineering Series No. 87. BIBLIOGRAPHY 91 [34] Harri Lappalainen and Antti Honkela. Bayesian nonlinear independent component analysis by multi-layer perceptrons. In Mark Girolami, editor, Advances in Independent Component Analysis, pages 93–121. Springer, Berlin, 2000. [35] Harri Lappalainen and James W. Miskin. Ensemble learning. In Mark Girolami, editor, Advances in Independent Component Analysis, pages 76–92. Springer, Berlin, 2000. [36] Peter M. Lee. Bayesian Statistics: An Introduction. Arnold, London, second edition, 1997. [37] David J. C. MacKay. 4(3):415–447, 1992. Bayesian interpolation. Neural Computation, [38] David J. C. MacKay. Developments in probabilistic modelling with neural networks—ensemble learning. In Neural Networks: Artificial Intelligence and Industrial Applications. Proceedings of the 3rd Annual Symposium on Neural Networks, Nijmegen, Netherlands, 14-15 September 1995, pages 191–198, Berlin, 1995. Springer. [39] David J. C. MacKay. Ensemble learning for hidden Markov models. Available from http://wol.ra.phy.cam.ac.uk/mackay/, 1997. [40] David J. C. MacKay. Choice of basis for Laplace approximation. Machine Learning, 33(1):77–86, 1998. [41] Peter S. Maybeck. Stochastic Models, Estimation, and Control, volume 1. Academic Press, New York, 1979. [42] Peter S. Maybeck. Stochastic Models, Estimation, and Control, volume 2. Academic Press, New York, 1982. [43] Kevin Murphy. Switching Kalman filters. Technical report, Department of Computer Science, University of California Berkeley, 1998. [44] Radford M. Neal. Bayesian Learning for Neural Networks, volume 118 of Lecture Notes in Statistics. Springer, New York, 1996. [45] Radford M. Neal and Geoffrey E. Hinton. A view of the EM algorithm that justifies incremental, sparse, and other variants. In Jordan [29], pages 355–368. [46] Jacob Palis, Jr and Welington de Melo. Geometric Theory of Dynamical Systems. Springer, New York, 1982. BIBLIOGRAPHY 92 [47] William D. Penny, Richard M. Everson, and Stephen J. Roberts. Hidden Markov independent component analysis. In Mark Girolami, editor, Advances in Independent Component Analysis, pages 3–22. Springer, Berlin, 2000. [48] Lawrence R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257– 286, 1989. [49] S. Neil Rasband. Chaotic Dynamics of Nonlinear Systems. John Wiley & Sons, New York, 1990. [50] Sam T. Roweis and Zoubin Ghahramani. A unifying review of linear Gaussian models. Neural Computation, 11(2):305–345, 1999. [51] Sam T. Roweis and Zoubin Ghahramani. An EM algorithm for indentification of nonlinear dynamical systems. Submitted for publication. Preprint available from http://www.gatsby.ucl.ac.uk/~roweis/ publications.html, 2000. [52] Walter Rudin. Real and Complex Analysis. McGraw-Hill, Singapore, third edition, 1987. [53] Tim Sauer, James A. Yorke, and Martin Casdagli. Embedology. Journal of Statistical Physics, 65(3/4):579–616, 1991. [54] Vesa Siivola. An adaptive method to achieve speaker independence in a speech recognition system. Master’s thesis, Helsinki University of Technology, Espoo, 1999. [55] Floris Takens. Detecting strange attractors in turbulence. In David Rand and Lai-Sang Young, editors, Dynamical systems and turbulence, Warwick 1980, volume 898 of Lecture Notes in Mathematics, pages 366–381. Springer, Berlin, 1981. [56] Edmondo Trentin and Marco Gori. A survey of hybrid ANN/HMM models for automatic speech recognition. Neurocomputing, 37:91–126, 2001. [57] Harri Valpola. Bayesian Ensemble Learning for Nonlinear Factor Analysis. PhD thesis, Helsinki University of Technology, Espoo, 2000. Published in Acta Polytechnica Scandinavica, Mathematics and Computing Series No. 108. BIBLIOGRAPHY 93 [58] Harri Valpola. Unsupervised learning of nonlinear dynamic state–space models. Technical Report A59, Helsinki University of Technology, Espoo, 2000. [59] Harri Valpola, Xavier Giannakopoulos, Antti Honkela, and Juha Karhunen. Nonlinear independent component analysis using ensemble learning: Experiments and discussion. In Petteri Pajunen and Juha Karhunen, editors, Proceedings of the Second International Workshop on Independent Component Analysis and Blind Signal Separation, ICA 2000, pages 351–356, Espoo, Finland, June 19–22, 2000. [60] Harri Valpola and Juha Karhunen. An unsupervised ensemble learning method for nonlinear dynamic state–space models. A manuscript to be submitted to a journal, 2001. [61] Christopher S. Wallace and D. M. Boulton. An information measure for classification. The Computer Journal, 11(2):185–194, 1968. [62] Hassler Whitney. Differentiable manifolds. 37(3):645–680, 1936. Annals of Mathematics,