The Acoustic/Lexical model: Exploring the phonetic units; Triphones/Senones in action. Ofer M. Shir Speech Recognition Seminar, 15/10/2003 Leiden Institute of Advanced Computer Science Theoretical Background – Unit Selection When selecting the basic unit of acoustic information, we want it to be accurate, trainable and generalizable. Words are good units for small-vocabulary SR – but not a good choice for large-vocabulary continuous SR: • Each word is treated individually – no data sharing, which implies large amount of training data and storage. • The recognition vocabulary may consist of words which have never been given in the training data. • Expensive to model interword coarticulation effects. Theoretical Background - Phonemes The alternative unit is a Phoneme. Phonemes are more trainable (there are only about 50 phonemes in English, for example) and generalizable (vocabulary independent). However, each word is not a sequence of independent phonemes! Our articulators move continuously from one position to another. The realization of a particular phoneme is affected by its phonetic neighbourhood, as well as by local stress effects etc. Different realizations of a phoneme are called allophones. Theoretical Background - Triphones The Triphone model is a phonetic model which takes into consideration both the left and the right neighbouring phonemes. Triphones are an example of allophones. This model captures the most important coarticulatory effects, a fact which makes him a very powerful model. The cost – as context-dependent models generally increase the number of parameters, the trainability becomes much harder. Notice that in English there are more than 100,000 triphones ! Nevertheless, so far we have assumed that every triphone context is different. We are motivated to finds instances of similar contexts and merge them. Theoretical Background - Senones Recall that each allophone model is an HMM, made of states, transitions and probability distributions; the bottom line is that some distributions can be tied. The basic idea is clustering, but rather than clustering the HMM models themselves – we shall cluster only the the HMM states. Each cluster will represent a set of similar Markov states, and is called a Senone. The senones provide not only an improved recognition accuracy, but also a pronunciation-optimization capability. Theoretical Background – Senonic Trees Reminder: a decision tree is a binary tree which classifies target objects by asking Yes/No questions in a hierarchical manner. The senonic decision tree classifies Markov states of triphones, represented in the training data, by asking linguistic questions. => The leaves of the senonic trees are the possible senones. Sphinx III, A Short Review – Front End Feature Extraction 7 frame speech window Senones Data (Scoring Table) 12 elements Cepstrum Current frame Feature vectors and their analysis are inputs into Gaussian Mixtures Fitting Process. 12 elements 39 elements Time-der Cepstrum Gaussian Mixtures 12 elements Time-2-der Cepstrum Mean, Variance, Determinant 3 elements Power Fetch phonetic data (Senones !) from these Gaussian Mixtures – using the well-trained machine. Sphinx III – the implementation Handling a single word; evaluating each HMM according to the input, using the Viterbi Search. Every senone gets a HMM: ONE W AH TWO T UW TH R THREE 5-state HMM N IY The Viterbi Search - basics • Instantaneous score: how well a given HMM state matches the feature vector. • Path: A sequence of HMM states traversed during a given segment of feature vectors. • Path-score: Product of instantaneous scores and state transition probabilities corresponding to a given path. • The Viterbi search: An efficient lattice structure and algorithm for computing the best path score for a given segment of feature vectors. The Viterbi Search - demo Initial state initialized with path-score = 1.0 time The Viterbi Search (demo-contd.) State with best path-score State with path-score < best State without a valid path-score Pj (t) = max [Pi (t-1) aij bj (t)] i State transition probability, i to j Score for state j, given the input at time t Total path-score ending up at state j at time t time The Viterbi Search (demo-contd.) time Continuous Speech Recognition Add transitions from word ends to beginnings, and run the Viterbi Search. ONE W AH TWO T UW TH R THREE N IY Cross-Word Triphone Modeling Sphinx III uses “triphone” or “phoneme-in-context” HMMs; Remember to inject left-context into entry state. ONE W AH Contextdependent AH HMM Inherited left context propagated along with path-scores, and dynamically modifies the state model. N Separate N HMM instances for each possible right context Sphinx-III - Lexical Tree Structure Nodes shared if triphone Senone-Sequence-ID (SSID) identical: START STARTING STARTED STARTUP START-UP S-T-AA-R-TD S-T-AA-R-DX-IX-NG S-T-AA-R-DX-IX-DD S-T-AA-R-T-AX-PD S-T-AA-R-T-AX-PD R S T TD start IX NG starting IX DD started DX AA PD startup R T AX PD start-up Cross-Word Triphones (left context) left-contexts S T R IX NG starting IX DD started DX AA S-models for different left contexts to rest of lextree TD start PD startup R T Root nodes replicated for left context. Nodes are shared if SSIDs are identical. AX PD start-up Cross-Word Triphones (right context) Leaf node Triphones for all right contexts Picking states HMM states for triphones Composite SSID model composite states; average of component states Sphinx III, the Acoustic Model – File List Summary mdef.c – definition of the basic phones and triphones HMMs, the mapping of each HMM state to a senone and to its transition matrix. dict.c – pronunciation dictionary structure. hmm.c – implementing HMM evaluation using Viterbi Search, which means fetching the best senone score. Note that the HMM data structures, defined at hmm.h, are hardwired to 2 possible HMM topologies – 3 / 5 state left-to-right HMMs. lextree.c – lexical tree search. Presentation Resources: • Spoken Language Processing: A Guide to Theory, Algorithm and System Development by Xuedong Huang , Alex Acero , Hsiao-Wuen Hon , Raj Reddy (Hardcover, 980 pages; Publisher: Prentice Hall PTR; ISBN: 0130226165; 1st edition, April 25, 2001). Chapters 9,13. • Hwang, M., Huang, X., Alleva, F. : “Predicting Unseen Triphones with Senone”, 1993. • Hwang et al : Shared Distribution Hidden Markov Models for Speech Recognition, 1993. • Hwang et al : Subphonetic Modeling with Markov States – Senones, 1992. • Sphinx-III documentation - a presentation made by Mosur Ravishankar; found in the /doc/ folder of the sphinx-III package. • “Sphinx-III bible” - a presentation made by Edward Lin; http://www.ece.cmu.edu/~ffang/sis/documents/S3Bible.ppt “I shall never believe that God plays dice with the world, but maybe machines should play dice with human capabilities…” John Doe