Automatic Speech Recognition CS 4705

advertisement
Automatic Speech Recognition
CS 4705
• Opportunity to participate in a new user study for
Newsblaster and get $25-$30 for 2.5-3 hours of
time respectively.
• http://www1.cs.columbia.edu/~delson/study.html
• More opportunities will be coming….
What is speech recognition?
• Transcribing words?
• Understanding meaning?
• Today:
–
–
–
–
Overview ASR issues
Building an ASR system
Using an ASR system
Future research
“It’s hard to ... recognize speech/wreck a nice
beach”
• Speaker variability: within and across
• Recording environment varies wrt noise
• Transcription task must handle all of this and
produce a transcript of what was said, from
limited, noisy information in the speech signal
– Success: low word error rate (WER)
• WER = (S+I+D)/N * 100
– Thesis test vs. This is a test. 75% WER
• Understanding task must do more: from words to
meaning
– Measure concept accuracy (CA) of string in terms of
accuracy of recognition of domain concepts mentioned
in string and their values
I want to go from Boston to Baltimore on September 29
– Domain concepts
Values
– source city
Boston
– target city
Baltimore
– travel date
September 29
– Score recognized string “Go from Boston to
Washington on December 29” (1/3 = 33% CA)
– “Go to Boston from Baltimore on September 29”
Again, the Noisy Channel Model
Source
Noisy Channel
Decoder
Input to channel: spoken sentence s
– Output from channel: an observation O
– Decoding task: find s’ = arg max P(s|O)
sV
– Using Bayes Rule
P(O | s)P(s)
arg max
sV
P(O)
– And since P(O) doesn’t change for any hypothetical s’
max
– s’ = arg
P(O|s) P(s)
sV
– P(O|s) is the observation likelihood, or Acoustic Model,
and P(s) is the prior, or Language Model
What do we need to build use an ASR
system?
•
•
•
•
•
•
Corpora for training and testing of components
Feature extraction component
Pronunciation Model
Acoustic Model
Language Model
Algorithms to search hypothesis space efficiently
Training and Test Corpora
• Collect corpora appropriate for recognition task at
hand
– Small speech + phonetic transcription to associate
sounds with symbols (Acoustic Model)
– Large (>= 60 hrs) speech + orthographic transcription
to associate words with sounds (Acoustic Model)
– Very large text corpus to identify unigram and bigram
probabilities (Language Model)
Representing the Signal
• What parameters (features) of the speech input
– Can be extracted automatically
– Will preserve phonetic identity and distinguish it from
other phones
– Will be independent of speaker variability and channel
conditions
– Will not take up too much space
• Speech representations (for [ae] in had):
– Waveform: change in sound pressure over time
– LPC Spectrum: component frequencies of a waveform
– Spectrogram: overall view of how frequencies change
from phone to phone
• Speech captured by microphone and sampled
(digitized) -- may not capture all vital information
• Signal divided into frames
• Power spectrum computed to represent energy in
different bands of the signal
– LPC spectrum, Cepstra, PLP
– Each frame’s spectral features represented by small set
of numbers
• Frames clustered into ‘phone-like’ groups (phones
in context) -- Gaussian or other models
• Why this works?
– Different phonemes have different spectral
characteristics
• Why it doesn’t work?
– Phonemes can have different properties in different
acoustic contexts, spoken by different people …
– Nice white rice
Pronunciation Model
• Models likelihood of word given network of
candidate phone hypotheses (weighted phone
lattice)
• Allophones: butter vs. but
• Multiple pronunciations for each word
• Lexicon may be weighted automaton or simple
dictionary
• Words come from all corpora; pronunciations
from pronouncing dictionary or TTS system
Acoustic Models
• Model likelihood of phones or subphones given
spectral features and prior context
• Use pronunciation models
• Usually represented as HMM
– Set of states representing phones or other subword units
– Transition probabilities on states: how likely is it to see
one phone after seeing another?
– Observation/output likelihoods: how likely is spectral
feature vector to be observed from phone state i, given
phone state i-1?
• Initial estimates for
• Transition probabilities between phone states
• Observation probabilities associating phone states
with acoustic examples
• Re-estimate both probabilities by feeding the
HMM the transcribed speech training corpus
(forced alignment)
• I.e., we tell the HMM the ‘right’ answers -- which
words to associate with which sequences of sounds
• Iteratively retrain the transition and observation
probabilities by running the training data through
the model and scoring output until no
improvement
Language Model
• Models likelihood of word given prior word and
of entire sentence
• Ngram models:
– Build the LM by calculating bigram or trigram
probabilities from text training corpus
– Smoothing issues very important for real systems
• Grammars
– Finite state grammar or Context Free Grammar (CFG)
or semantic grammar
• Out of Vocabulary (OOV) problem
• Entropy H(X): the amount of information in a LM,
grammar
– How many bits will it take on average to encode a
choice or a piece of information?
– More likely things will take fewer bits to encode
• Perplexity 2H: a measure of the weighted mean
number of choice points in e.g. a language model
Search/Decoding
• Find the best hypothesis P(O|s) P(s) given
– Lattice of subword units (Acoustic Model)
– Segmentation of all paths into possible words
(Pronunciation Model)
– Probabilities of word sequences (Language Model)
• Produces a huge search space: How to reduce?
– Lattice minimization and determinization
– Forward algorithm: sum of all paths leading to a state
– Viterbi algorithm: max of all paths leading to a state
– Forward-backward (Baum-Welch, ExpectationMaximization) algorithm: computes probability of
sequence at any state in search space
– Beam search: prune the lattice
Varieties of Speech Recognition
Mode
Isolated words  continuous
Style
Read, prepared, spontaneous
Enrollment
Speaker-dependent or independent
Vocabulary size
<20  5K --> 60K -->~1M
Language Model
Finite state, ngrams, CFGs, CSGs
Perplexity
<10  > 100
SNR
> 30dB (high)  < 10dB (low)
Input device
Telephone, microphones
Challenges for Transcription
•
•
•
•
•
•
•
•
Robustness to channel characteristics and noise
Portability to new applications
Adapatation: to speakers, to environments
LMs: simple ngrams need help
Confidence measures
OOV words
New speaking styles/genres
New applications
Challenges for Understanding
• Recognizing communicative ‘problems’
– ASR errors
– User corrections
– Disfluencies and self-repairs
• Possibilities:
– Recognizing speaker emotion
– Identifying speech acts: okay
– Locating topic boundaries for topic tracking, audio
browsing, speech data mining
An Unsuccessful Dialogue
S: Hi, this is AT&T Amtrak schedule system. This is
TOOT. How may I help you?
U: I'd like to find a train from Ch- to Chicago from
Baltimore.
S: Sorry, I can't understand you. Please repeat your
utterance.
U: I'd like to find a train from Baltimore to Chicago.
S: I heard you say “leave today around 5 thirty in the
morning .“ Which city do you want to go to?
U: I would like to leave Saturday at eight in the morning.
S: You can say the arrival city name, such as “New
York City."
U: Baltimore to Chicago.
S: You can say the arrival city name, such as “New
York City.”
U: Baltimore to Chicago. (I hate it.)
U: Train from Baltimore ...S: ...I heard you s...U:
to Chicago.... Hello?
S: You can say the name of your departure city, such
as "New York City."
U: Baltimore.
U: Bal-ti-more....
S: You can say...
U: Baltimore.
S: Do you want to continue and find a new set of
trains?
U: Yes.
S: You can say "yes" or "I am done here" to find
a new set of trains.
U: Yes.
S: Thank you for using AT&T Amtrak train time
table system.
See you next time.
U: I nev-
Summary
• ASR technology relies upon a large number of
phenomena and techniques we’ve already seen to
convert sound into words
– Phonetic/phonological, morphological, and lexical
events
– FSA’s, Ngrams, Dynamic programming algorithms
• Better modeling of linguistic phenomena will be
needed to improve performance on transcription
and especially on understanding
• For next class: we’ll start talking about larger
structures in language above the word (Ch 8)
Disfluencies and Self-Repairs
• Disfluencies abound in spontaneous speech
– every 4.6s in radio call-in (Blackmer & Mitton ‘91)
hesitation: Ch- change strategy.
filled pause: Um Baltimore.
self-repair: Ba- uh Chicago.
• Hard to recognize
Ch- change strategy. --> to D C D C today ten fifteen.
Um Baltimore. --> From Baltimore ten.
Ba- uh Chicago. --> For Boston Chicago.
Download