Automatic Speech Recognition Julia Hirschberg CS 6998

advertisement
Automatic Speech Recognition
Julia Hirschberg
CS 6998
7/15/2016
1
What is speech recognition?
Transcribing words?
Understanding meaning?
7/15/2016
2
It’s hard to recognize
speech...
People speak in very different ways
Across speaker variation
Within speaker variation
Speech sounds vary according to the speech
context
Environment varies wrt noise
Transcription task must handle all of this and
produce a transcript of spoken words
7/15/2016
3
Success: low WER (S+I+D)/N * 100
Thesis test vs. This is a test. 75% WER
Progress:
Very large training corpora
Fast machines and cheap storage
Bake-offs
Market for real-time systems
New representations and algorithms: Finite
State Transducers
7/15/2016
4
Varieties of Speech
Recognition
Mode
Isolated words  continuous
Style
Read, prepared, spontaneous
Enrollment
Spkr-dependent or independent
Vocabulary size
<20  5K --> 60K -->~1M
Language Model
Finite state, ngrams, CFGs, CSGs
Perplexity
<10  > 100
SNR
> 30dB (high)  < 10dB (low)
Input device
Telephone, microphones
7/15/2016
5
ASR and the Noisy Channel
Model
Source --> noisy channel --> Hypothesis
Find the most likely input to have generated the
(observed) “noisy” sentence by finding most
likely sentence W in language given acoustic
input O
W’=
arg max
W L
P(W|O)
P( x | y)  P( y | x)P( x)
Bayes rule
P( y)
W’=
7/15/2016
arg max
W L
P(O|W )P(W )
P(O)
6
P(O) same for all hypothetical W, so
W’=P(O|W)P(W)
P(W) the prior; P(O|W) the (acoustic)
likelihood
7/15/2016
7
Simple Isolated Digit
Recognition
Train 10 acoustic templates Mi: one per digit
Compare input x with each
Select most similar template j according to some
comparison function, minimizing differences
j = min{f(x,Mi)}
7/15/2016
8
Scaling Up: Continuous
Speech Recognition
Collect training and test corpora of
Speech + word transcription
Speech + phonetic transcription
Built by hand or using TTS
Text corpus
Determine a representation for the signal
Build probabilitistic
Acoustic model: signal to phones
7/15/2016
9
Pronunciation model: phones to words
Language model: words to sentences
Select search procedures to decode new input
given these training models
7/15/2016
10
Representing the Signal
What parameters (features) of the waveform
Can be extracted automatically
Will preserve phonetic identity and
distinguish it from other phones
Will be independent of speaker variability
and channel conditions
Will not take up too much space
…Power Spectrum
7/15/2016
11
Speech captured by microphone and digitized
Signal divided into frames
Power spectrum computed to represent energy
in different bands of the signal
LPC spectrum, Cepstra, PLP
Each frame’s spectral features represented
by small set of numbers
7/15/2016
12
Why it works?
Different phonemes have different spectral
characteristics
Why it doesn’t work?
Phonemes can have different properties in
different acoustic contexts, spoken by different
people, ...
7/15/2016
13
Acoustic Models
Model likelihood of phone given spectral
features and prior context
Usually represented as HMM
Set of states representing phones or other
subword units
Transition probabilities on states: how likely
is it to see one phone after another?
Observation/output likelihoods: how likely is
spectral feature vector to be observed from
state i, given state i-1?
7/15/2016
14
Train initial model on small hand-labeled corpus
to get estimate of transition and observation
probabilities
Tune parameters on large corpus with only
transcription
Iterate until no further improvement
7/15/2016
15
Pronunciation Model
Models likelihood of word given network of
candidate phone hypotheses (weighted phone
lattice)
Allophones: butter vs. but
Lexicon may be HMM or simple dictionary
7/15/2016
16
Language Models
Models likelihood of word sequence given
candidate word hypotheses
Grammars
Finite state or CFG
Ngrams
Corpus trained
Smoothing issues
Out of Vocabulary (OOV) problem
7/15/2016
17
Search
Find the best hypothesis given
Lattice of subword units (AM)
Segmentation of all paths into possible words
(PM)
Probabilities of word sequences (LM)
Huge search space
Viterbi decoding
Beam search
7/15/2016
18
Challenges for Transcription
Robustness to channel characteristics and noise
Portability to new applications
Adapatation: to speakers, to environments
LMs: simple ngrams need help
Confidence measures
OOV words
New speaking styles/genres
New applications
7/15/2016
19
Challenges for Understanding
Recognizing communicative ‘problems’
ASR errors
User corrections
Disfluencies and self-repairs
7/15/2016
20
An Unsuccessful
Dialogue
S: Hi, this is AT&T Amtrak schedule system. This is
TOOT. How may I help you?
U: I'd like to find a train from Ch- to Chicago from
Baltimore.
S: Sorry, I can't understand you. Please repeat your
utterance.
U: I'd like to find a train from Baltimore to Chicago.
S: I heard you say “leave today around 5 thirty in the
morning .“ Which city do you want to go to?
U: I would like to leave Saturday at eight in the morning.
7/15/2016
21
S: You can say the arrival city name, such as “New
York City."
U: Baltimore to Chicago.
S: You can say the arrival city name, such as “New
York City.”
U: Baltimore to Chicago. (I hate it.)
U: Train from Baltimore ...S: ...I heard you s...U:
to Chicago.... Hello?
S: You can say the name of your departure city, such
as "New York City."
U: Baltimore.
7/15/2016
22
U: Bal-ti-more....
S: You can say...
U: Baltimore.
S: Do you want to continue and find a new set of
trains?
U: Yes.
S: You can say "yes" or "I am done here" to find
a new set of trains.
U: Yes.
S: Thank you for using AT&T Amtrak train time
table system.
See you next time.
U: I nev7/15/2016
23
Disfluencies and Self-Repairs
Disfluencies abound in spontaneous speech
every 4.6s in radio call-in (Blackmer &
Mitton ‘91)
hesitation: Ch- change strategy.
filled pause: Um Baltimore.
self-repair: Ba- uh Chicago.
Hard to recognize
Ch- change strategy. --> to D C D C today ten
fifteen.
Um Baltimore. --> From Baltimore ten.
Ba- uh Chicago. --> For Boston Chicago.
7/15/2016
24
Possibilities for
Understanding
Recognizing speaker emotion
Identifying speech acts: okay
Locating topic boundaries for topic tracking,
audio browsing, speech data mining
7/15/2016
25
Next Week
7/15/2016
26
7/15/2016
27
Download