Automatic Speech Recognition

advertisement
Automatic Speech Recognition
Introduction
The Human Dialogue System
The Human Dialogue System
Computer Dialogue Systems
Dialogue
Management
Audition
signal
Automatic
Natural
Language
Speech
Recognition Understanding
signal
words
Planning
logical form
Natural
Language
Generation
Text-tospeech
words
signal
Computer Dialogue Systems
Dialogue
Mgmt.
Audition
signal
signal
ASR
NLU
words
NLG
Planning
logical form
Text-tospeech
words
signal
Parameters of ASR Capabilities
• Different types of tasks with different difficulties
–
–
–
–
–
–
–
Speaking mode (isolated words/continuous speech)
Speaking style (read/spontaneous)
Enrollment (speaker-independent/dependent)
Vocabulary (small < 20 wd/large >20kword)
Language model (finite state/context sensitive)
Signal-to-noise ratio (high > 30 dB/low < 10dB)
Transducer (high quality microphone/telephone)
The Noisy Channel Model
(Shannon)
message
message
noisy channel
Message
+
Channel =Signal
Decoding model: find Message*= argmax P(Message|Signal)
But how do we represent each of these things?
What are the basic units for acoustic
information?
When selecting the basic unit of acoustic information, we
want it to be accurate, trainable and generalizable.
Words are good units for small-vocabulary SR – but not
a good choice for large-vocabulary & continuous SR:
• Each word is treated individually –which implies large
amount of training data and storage.
• The recognition vocabulary may consist of words
which have never been given in the training data.
• Expensive to model interword coarticulation effects.
Why phones are better units than
words: an example
"SAY BITE AGAIN" spoken so that the phonemes
are separated in time
Recorded sound
spectrogram
"SAY BITE AGAIN" spoken normally
And why phones are still not the perfect
choice
Phonemes are more trainable (there are only about 50 phonemes in
English, for example) and generalizable (vocabulary independent).
However, each word is not a sequence of independent phonemes!
Our articulators move continuously from one position to another.
The realization of a particular phoneme is affected by its phonetic
neighbourhood, as well as by local stress effects etc.
Different realizations of a phoneme are called allophones.
Example: different spectrograms
for “eh”
Triphone model
Each triphone captures facts about preceding
and following phone
•Monophone: p, t, k
•Triphone: iy-p+aa
•a-b+c means “phone b, preceding by phone
a, followed by phone c”
In practice, systems use order of 100,000 3phones, and
the 3phone model is the one currently used (e.g. Sphynx)
Parts of an ASR System
Feature
Calculation
Acoustic
Modeling
k
@
Pronunciation
Modeling
cat: k@t
dog: dog
mail: mAl
the: D&, DE
…
Produces
Maps acoustics Maps 3phones
acoustic vectors to 3phones
to words
(xt)
Language
Modeling
cat dog: 0.00002
cat the: 0.0000005
the cat: 0.029
the dog: 0.031
the mail: 0.054
…
Strings words
together
Feature calculation
interpretations
Frequency
Feature calculation
Time
Find energy at each time step in
each frequency channel
Frequency
Feature calculation
Time
Take Inverse Discrete Fourier
Transform to decorrelate frequencies
Feature calculation
Input:
Output: acoustic
observations
vectors
-0.1
0.3
1.4
-1.2
2.3
2.6
…
0.2
0.1
1.2
-1.2
4.4
2.2
…
0.2
0.0
1.2
-1.2
4.4
2.2
…
-6.1
-2.1
3.1
2.4
1.0
2.2
…
…
Robust Speech Recognition
• Different schemes have been developed for
dealing with noise, reverberation
– Additive noise: reduce effects of particular
frequencies
– Convolutional noise: remove effects of linear
filters (cepstral mean subtraction)
cepstrum: fourier transfor of the LOGARITHM of the spectrum
How do we map from vectors to
word sequences?
-0.1
0.3
1.4
-1.2
2.3
2.6
…
0.2
0.1
1.2
-1.2
4.4
2.2
…
0.2
0.0
1.2
-1.2
4.4
2.2
…
-6.1
-2.1
3.1
2.4
1.0
2.2
…
???
“That you” …
HMM (again)!
-0.1
0.3
1.4
-1.2
2.3
2.6
…
0.2
0.1
1.2
-1.2
4.4
2.2
…
0.2
0.0
1.2
-1.2
4.4
2.2
…
-6.1
-2.1
3.1
2.4
1.0
2.2
…
Pattern recognition
with HMMs
“That you” …
ASR using HMMs
• Try to solve P(Message|Signal) by breaking the
problem up into separate components
• Most common method: Hidden Markov Models
– Assume that a message is composed of words
– Assume that words are composed of sub-word parts
(3phones)
– Assume that 3phones have some sort of acoustic
realization
– Use probabilistic models for matching acoustics to
phones to words
Creating HMMs for word sequences: Context
independent units
3phones
“Need” 3phone model
Hierarchical system of HMMs
HMM of a triphone
HMM of a triphone
Higher level HMM of a word
Language model
HMM of a triphone
To simplify, let’s now ignore
lower level HMM
Each phone node has
a “hidden” HMM (H2MM)
HMMs for ASR
go
g
g
home
o
o o
h
o
h
o
o o
m
o m m
x0 x1 x2 x3 x4 x5 x6 x7 x8 x9
Markov model
backbone composed
of sequences of 3phones
(hidden because we
don’t know
correspondences)
Acoustic observations
Each line represents a probability estimate (more later)
HMMs for ASR
go
g
o
home
h
o
m
x0 x1 x2 x3 x4 x5 x6 x7 x8 x9
Markov model
backbone composed
of phones
(hidden because we
don’t know
correspondences)
Acoustic observations
Even with same word hypothesis, can have different alignments
(red arrows). Also, have to search over all word hypotheses
For every HMM (in hierarchy): compute
Max probability sequence
X= acoustic observations,
(3)phones, phone sequences
W= (3)phones, phone
sequences, word sequences
p(he|that)
th
a
iy
sh
uh
COMPUTE:
argmaxW P(W|X)
=argmaxW P(X|W)P(W)/P(X)
=argmaxW P(X|W)P(W)
iy
y
uw
t
p(you|that)
h
h
d
Search
• When trying to find W*=argmaxW P(W|X), need
to look at (in theory)
– All possible (3phone, word.. etc) sequences
– All possible segmentations/alignments of W&X
• Generally, this is done by searching the space of
W
– Viterbi search: dynamic programming approach that
looks for the most likely path
– A* search: alternative method that keeps a stack of
hypotheses around
• If |W| is large, pruning becomes important
• Need also to estimate transition probabilities
Training: speech corpora
• Have a speech corpus at hand
– Should have word (and preferrably phone)
transcriptions
– Divide into training, development, and test sets
• Develop models of prior knowledge
– Pronunciation dictionary
– Grammar, lexical trees
• Train acoustic models
– Possibly realigning corpus phonetically
Acoustic Model
dh a
-0.1
0.3
1.4
-1.2
2.3
2.6
…
0.2
0.1
1.2
-1.2
4.4
2.2
…
a
t
0.2
0.0
1.2
-1.2
4.4
2.2
…
-6.1
-2.1
3.1
2.4
1.0
2.2
…
Na(m,S)
P(X|state=a)
• Assume that you can
label each vector with
a phonetic label
• Collect all of the
examples of a phone
together and build a
Gaussian model (or
some other statistical
model, e.g. neural
networks)
Pronunciation model
• Pronunciation model gives connections
between phones and words
1-pdh
1-pa
dh
a
pdh
1-pt
pa
t
pt
• Multiple pronunciations (tomato):
ow
t
m
ah
ey
ah
t
ow
Training models for a sound unit
Language Model
• Language model gives connections between
words (e.g., bigrams: probability of two
word sequences)
p(he|that)
dh
a
h
iy
y
uw
t
p(you|that)
Lexical trees
START
STARTING
STARTED
STARTUP
START-UP
S-T-AA-R-TD
S-T-AA-R-DX-IX-NG
S-T-AA-R-DX-IX-DD
S-T-AA-R-T-AX-PD
S-T-AA-R-T-AX-PD
R
S
T
TD start
IX
NG starting
IX
DD started
DX
AA
PD startup
R
T
AX
PD start-up
Judging the quality of a system
• Usually, ASR performance is judged by the
word error rate
ErrorRate = 100*(Subs + Ins + Dels) / Nwords
REF: I WANT TO GO HOME ***
REC: * WANT TWO GO HOME NOW
SC: D C
S C
C
I
100*(1S+1I+1D)/5 = 60%
Judging the quality of a system
• Usually, ASR performance is judged by the
word error rate
• This assumes that all errors are equal
– Also, a bit of a mismatch between optimization
criterion and error measurement
• Other (task specific) measures sometimes
used
– Task completion
– Concept error rate
Sphinx4
http://cmusphinx.sourceforge.net
Sphinx4 Implementation
Sphinx4 Implementation
Frontend
• Feature extractor
Frontend
• Feature extractor
• Mel-Frequency Cepstral Coefficients
Feature vectors
(MFCCs)
Hidden Markov Models (HMMs)
• Acoustic Observations
Hidden Markov Models (HMMs)
• Acoustic Observations
• Hidden States
Hidden Markov Models (HMMs)
• Acoustic Observations
• Hidden States
• Acoustic Observation likelihoods
Hidden Markov Models (HMMs)
“Six”
Sphinx4 Implementation
Linguist
• Constructs the search graph of HMMs
from:
–
–
–
–
Acoustic model
Statistical Language model ~or~
Grammar
Dictionary
Acoustic Model
• Constructs the HMMs of phones
• Produces observation likelihoods
Acoustic Model
•
•
•
•
Constructs the HMMs for units of speech
Produces observation likelihoods
Sampling rate is critical!
WSJ vs. WSJ_8k
Acoustic Model
•
•
•
•
•
Constructs the HMMs for units of speech
Produces observation likelihoods
Sampling rate is critical!
WSJ vs. WSJ_8k
TIDIGITS, RM1, AN4, HUB4
Language Model
• Word likelihoods
Language Model
• ARPA format Example:
1-grams:
-3.7839 board -0.1552
-2.5998 bottom
-0.3207
-3.7839 bunch -0.2174
2-grams:
-0.7782 as the
-0.2717
-0.4771 at all
0.0000
-0.7782 at the
-0.2915
3-grams:
-2.4450 in the lowest
-0.5211 in the middle
-2.4450 in the on
Grammar (example: command
language)
public <basicCmd> = <startPolite> <command>
<endPolite>;
public <startPolite> = (please | kindly | could you ) *;
public <endPolite> = [ please | thanks | thank you ];
<command> = <action> <object>;
<action> = (open | close | delete | move);
<object> = [the | a] (window | file | menu);
Dictionary
• Maps words to phoneme sequences
Dictionary
• Example from cmudict.06d
POULTICE
POULTICES
POULTON
POULTRY
POUNCE
POUNCED
POUNCEY
POUNCING
POUNCY
P OW L T AH S
P OW L T AH S IH Z
P AW L T AH N
P OW L T R IY
P AW N S
P AW N S T
P AW N S IY
P AW N S IH NG
P UW NG K IY
Sphinx4 Implementation
Search Graph
Search Graph
Search Graph
• Can be statically or dynamically
constructed
Sphinx4 Implementation
Decoder
• Maps feature vectors to search graph
Search Manager
• Searches the graph for the “best fit”
Search Manager
• Searches the graph for the “best fit”
• P(sequence of feature vectors|
word/phone)
• aka. P(O|W)
-> “how likely is the input to have been
generated by the word”
F ay ay ay ay v v v v v
F f ay ay ay ay v v v v
F f f ay ay ay ay v v v
F f f f ay ay ay ay v v
F f f f ay ay ay ay ay v
F f f f f ay ay ay ay v
F f f f f f ay ay ay v
…
Viterbi Search
Time
O1
O2
O3
Pruner
• Uses algorithms to weed out low scoring
paths during decoding
Result
• Words!
Word Error Rate
• Most common metric
• Measure the # of modifications to
transform recognized sentence into
reference sentence
Word Error Rate
• Reference: “This is a reference sentence.”
• Result: “This is neuroscience.”
Word Error Rate
• Reference: “This is a reference sentence.”
• Result: “This is neuroscience.”
• Requires 2 deletions, 1 substitution
Word Error Rate
• Reference: “This is a reference sentence.”
• Result:
“This is neuroscience.”
deletions + substitutions + insertions
WER =100 ´
Length
Word Error Rate
• Reference: “This is a reference sentence.”
• Result:
“This is neuroscience.”
•
D S D
2 + 1+ 0
3
WER =100 ´
= 100 ´ = 60%
5
5
Installation details
• http://cmusphinx.sourceforge.net/wiki/sphin
x4:howtobuildand_run_sphinx4
• Student report on NLP course web site
Download