Speech Recognition Seminar The Sphinx III Acoustic/Lexical model

advertisement
The Acoustic/Lexical model:
Exploring the phonetic units;
Triphones/Senones in action.
Ofer M. Shir
Speech Recognition Seminar, 15/10/2003
Leiden Institute of Advanced Computer Science
Theoretical Background – Unit Selection
When selecting the basic unit of acoustic information, we
want it to be accurate, trainable and generalizable.
Words are good units for small-vocabulary SR – but not
a good choice for large-vocabulary continuous SR:
• Each word is treated individually – no data sharing,
which implies large amount of training data and storage.
• The recognition vocabulary may consist of words
which have never been given in the training data.
• Expensive to model interword coarticulation effects.
Theoretical Background - Phonemes
The alternative unit is a Phoneme.
Phonemes are more trainable (there are only about 50 phonemes in
English, for example) and generalizable (vocabulary independent).
However, each word is not a sequence of independent phonemes!
Our articulators move continuously from one position to another.
The realization of a particular phoneme is affected by its phonetic
neighbourhood, as well as by local stress effects etc.
Different realizations of a phoneme are called allophones.
Theoretical Background - Triphones
The Triphone model is a phonetic model which takes into
consideration both the left and the right neighbouring phonemes.
Triphones are an example of allophones.
This model captures the most important coarticulatory effects, a
fact which makes him a very powerful model.
The cost – as context-dependent models generally increase the
number of parameters, the trainability becomes much harder.
Notice that in English there are more than 100,000 triphones !
Nevertheless, so far we have assumed that every triphone context is
different.
We are motivated to finds instances of similar contexts and merge
them.
Theoretical Background - Senones
Recall that each allophone model is an HMM, made of states,
transitions and probability distributions; the bottom line is that
some distributions can be tied.
The basic idea is clustering, but rather than clustering the HMM
models themselves – we shall cluster only the the HMM states.
Each cluster will represent a set of similar Markov states, and is
called a Senone.
The senones provide not only an improved recognition accuracy,
but also a pronunciation-optimization capability.
Theoretical Background – Senonic Trees
Reminder: a decision tree is a binary tree which classifies target
objects by asking Yes/No questions in a hierarchical manner.
The senonic decision tree classifies Markov states of triphones,
represented in the training data, by asking linguistic questions.
=> The leaves of the senonic trees are the possible senones.
Sphinx III, A Short Review –
Front End Feature Extraction
7 frame speech window
Senones Data
(Scoring Table)
12 elements
Cepstrum
Current frame
Feature vectors and their
analysis are inputs into
Gaussian Mixtures Fitting
Process.
12 elements 39 elements
Time-der
Cepstrum
Gaussian
Mixtures
12 elements
Time-2-der
Cepstrum
Mean, Variance, Determinant
3 elements
Power
Fetch phonetic data
(Senones !) from these
Gaussian Mixtures –
using the well-trained
machine.
Sphinx III – the implementation
Handling a single word; evaluating each HMM according to the
input, using the Viterbi Search.
Every senone gets a HMM:
ONE
W
AH
TWO
T
UW
TH
R
THREE
5-state HMM
N
IY
The Viterbi Search - basics
• Instantaneous score: how well a given HMM
state matches the feature vector.
• Path: A sequence of HMM states traversed during
a given segment of feature vectors.
• Path-score: Product of instantaneous scores and
state transition probabilities corresponding to a
given path.
• The Viterbi search: An efficient lattice structure
and algorithm for computing the best path score
for a given segment of feature vectors.
The Viterbi Search - demo
Initial state initialized with path-score = 1.0
time
The Viterbi Search (demo-contd.)
State with best path-score
State with path-score < best
State without a valid path-score
Pj (t) = max [Pi (t-1) aij bj (t)]
i
State transition probability, i to j
Score for state j, given the input at time t
Total path-score ending up at state j at time t
time
The Viterbi Search (demo-contd.)
time
Continuous Speech Recognition
Add transitions from word ends to beginnings, and
run the Viterbi Search.
ONE
W
AH
TWO
T
UW
TH
R
THREE
N
IY
Cross-Word Triphone Modeling
Sphinx III uses “triphone” or “phoneme-in-context” HMMs;
Remember to inject left-context into entry state.
ONE
W
AH
Contextdependent
AH HMM
Inherited left context propagated along
with path-scores, and dynamically
modifies the state model.
N
Separate
N HMM
instances
for each
possible
right
context
Sphinx-III - Lexical Tree Structure
Nodes shared if triphone Senone-Sequence-ID (SSID) identical:
START
STARTING
STARTED
STARTUP
START-UP
S-T-AA-R-TD
S-T-AA-R-DX-IX-NG
S-T-AA-R-DX-IX-DD
S-T-AA-R-T-AX-PD
S-T-AA-R-T-AX-PD
R
S
T
TD start
IX
NG starting
IX
DD started
DX
AA
PD startup
R
T
AX
PD start-up
Cross-Word Triphones (left context)
left-contexts
S
T
R
IX
NG starting
IX
DD started
DX
AA
S-models for
different left
contexts
to rest of lextree
TD start
PD startup
R
T
Root nodes replicated for left context.
Nodes are shared if SSIDs are identical.
AX
PD start-up
Cross-Word Triphones (right context)
Leaf node
Triphones for all right contexts
Picking
states
HMM
states for
triphones
Composite SSID model
composite
states;
average of
component
states
Sphinx III, the Acoustic Model –
File List Summary
mdef.c – definition of the basic phones and triphones HMMs, the
mapping of each HMM state to a senone and to its transition
matrix.
dict.c – pronunciation dictionary structure.
hmm.c – implementing HMM evaluation using Viterbi Search,
which means fetching the best senone score. Note that the HMM
data structures, defined at hmm.h, are hardwired to 2 possible
HMM topologies – 3 / 5 state left-to-right HMMs.
lextree.c – lexical tree search.
Presentation Resources:
• Spoken Language Processing: A Guide to Theory, Algorithm and System
Development by Xuedong Huang , Alex Acero , Hsiao-Wuen Hon , Raj Reddy (Hardcover,
980 pages; Publisher: Prentice Hall PTR; ISBN: 0130226165; 1st edition, April 25, 2001).
Chapters 9,13.
• Hwang, M., Huang, X., Alleva, F. : “Predicting Unseen Triphones with Senone”,
1993.
• Hwang et al : Shared Distribution Hidden Markov Models for Speech Recognition,
1993.
• Hwang et al : Subphonetic Modeling with Markov States – Senones, 1992.
• Sphinx-III documentation - a presentation made by Mosur Ravishankar; found in the
/doc/ folder of the sphinx-III package.
• “Sphinx-III bible” - a presentation made by Edward Lin;
http://www.ece.cmu.edu/~ffang/sis/documents/S3Bible.ppt
“I shall never believe that God
plays dice with the world,
but maybe machines should
play dice with human
capabilities…”
John Doe
Download