Lecture 7 Automatic Speech Recognition CS 4705 What is speech recognition? • Transcribing words? • Understanding meaning? • Today: – – – – Overview ASR issues Building an ASR system Using an ASR system Future research “It’s hard to ... recognize speech/wreck a nice beach” • Speaker variability: within and across • Recording environment varies wrt noise • Transcription task must handle all of this and produce a transcript of what was said, from limited, noisy information in the speech signal – Success: low word error rate (WER) • WER = (S+I+D)/N * 100 – Thesis test vs. This is a test. 75% WER • Understanding task must do more: from words to meaning – Measure concept accuracy (CA) of string in terms of accuracy of recognition of domain concepts mentioned in string and their values I want to go from Boston to Baltimore on September 29 – Domain concepts Values – source city Boston – target city Baltimore – travel date September 29 – Score recognized string “Go from Boston to Washington on December 29” (1/3 = 33% CA) – “Go to Boston from Baltimore on September 29” Again, the Noisy Channel Model Source Noisy Channel Decoder Input to channel: spoken sentence s – Output from channel: an observation O – Decoding task: find s = arg max P(s|O) sV – Using Bayes Rule P(O | w) P(w) arg max wV P(O) – And since P(O) doesn’t change for any hypothetical s’ max – s’ = arg P(O|s) P(s) sV – P(O|s) is the observation likelihood, or Acoustic Model, and P(s) is the prior, or Language Model What do we need to build use an ASR system? • • • • • • Corpora for training and testing of components Feature extraction component Pronunciation Model Acoustic Model Language Model Algorithms to search hypothesis space efficiently Training and Test Corpora • Collect corpora appropriate for recognition task at hand – Small speech + phonetic transcription to associate sounds with symbols (Acoustic Model) – Large (>= 60 hrs) speech + orthographic transcription to associate words with sounds (Acoustic Model) – Very large text corpus to identify unigram and bigram probabilities (Language Model) Representing the Signal • What parameters (features) of the speech input – Can be extracted automatically – Will preserve phonetic identity and distinguish it from other phones – Will be independent of speaker variability and channel conditions – Will not take up too much space • Speech representations (for [ae] in had): – Waveform: change in sound pressure over time – LPC Spectrum: component frequencies of a waveform – Spectrogram: overall view of how frequencies change from phone to phone • Speech captured by microphone and sampled (digitized) -- may not capture all vital information • Signal divided into frames • Power spectrum computed to represent energy in different bands of the signal – LPC spectrum, Cepstra, PLP – Each frame’s spectral features represented by small set of numbers • Frames clustered into ‘phone-like’ groups (phones in context) -- Gaussian or other models • Why this works? – Different phonemes have different spectral characteristics • Why it doesn’t work? – Phonemes can have different properties in different acoustic contexts, spoken by different people … – Nice white rice Pronunciation Model • Models likelihood of word given network of candidate phone hypotheses (weighted phone lattice) • Allophones: butter vs. but • Multiple pronunciations for each word • Lexicon may be weighted automaton or simple dictionary • Words come from all corpora; pronunciations from pronouncing dictionary or TTS system Acoustic Models • Model likelihood of phones or subphones given spectral features and prior context • Use pronunciation models • Usually represented as HMM – Set of states representing phones or other subword units – Transition probabilities on states: how likely is it to see one phone after seeing another? – Observation/output likelihoods: how likely is spectral feature vector to be observed from phone state i, given phone state i-1? • Initial estimates for • Transition probabilities between phone states • Observation probabilities associating phone states with acoustic examples • Re-estimate both probabilities by feeding the HMM the transcribed speech training corpus (forced alignment) • I.e., we tell the HMM the ‘right’ answers -- which words to associate with which sequences of sounds • Iteratively retrain the transition and observation probabilities by running the training data through the model and scoring output until no improvement Language Model • Models likelihood of word given prior word and of entire sentence • Ngram models: – Build the LM by calculating bigram or trigram probabilities from text training corpus – Smoothing issues very important for real systems • Grammars – Finite state grammar or Context Free Grammar (CFG) or semantic grammar • Out of Vocabulary (OOV) problem • Entropy H(X): the amount of information in a LM, grammar – How many bits will it take on average to encode a choice or a piece of information? – More likely things will take fewer bits to encode • Perplexity 2H: a measure of the weighted mean number of choice points in e.g. a language model Search/Decoding • Find the best hypothesis P(O|s) P(s) given – Lattice of subword units (Acoustic Model) – Segmentation of all paths into possible words (Pronunciation Model) – Probabilities of word sequences (Language Model) • Produces a huge search space: How to reduce? – Lattice minimization and determinization – Forward algorithm: sum of all paths leading to a state – Viterbi algorithm: max of all paths leading to a state – Forward-backward (Baum-Welch, ExpectationMaximization) algorithm: computes probability of sequence at any state in search space – Beam search: prune the lattice Varieties of Speech Recognition Mode Isolated words continuous Style Read, prepared, spontaneous Enrollment Speaker-dependent or independent Vocabulary size <20 5K --> 60K -->~1M Language Model Finite state, ngrams, CFGs, CSGs Perplexity <10 > 100 SNR > 30dB (high) < 10dB (low) Input device Telephone, microphones Challenges for Transcription • • • • • • • • Robustness to channel characteristics and noise Portability to new applications Adapatation: to speakers, to environments LMs: simple ngrams need help Confidence measures OOV words New speaking styles/genres New applications Challenges for Understanding • Recognizing communicative ‘problems’ – ASR errors – User corrections – Disfluencies and self-repairs • Possibilities: – Recognizing speaker emotion – Identifying speech acts: okay – Locating topic boundaries for topic tracking, audio browsing, speech data mining An Unsuccessful Dialogue S: Hi, this is AT&T Amtrak schedule system. This is TOOT. How may I help you? U: I'd like to find a train from Ch- to Chicago from Baltimore. S: Sorry, I can't understand you. Please repeat your utterance. U: I'd like to find a train from Baltimore to Chicago. S: I heard you say “leave today around 5 thirty in the morning .“ Which city do you want to go to? U: I would like to leave Saturday at eight in the morning. S: You can say the arrival city name, such as “New York City." U: Baltimore to Chicago. S: You can say the arrival city name, such as “New York City.” U: Baltimore to Chicago. (I hate it.) U: Train from Baltimore ...S: ...I heard you s...U: to Chicago.... Hello? S: You can say the name of your departure city, such as "New York City." U: Baltimore. U: Bal-ti-more.... S: You can say... U: Baltimore. S: Do you want to continue and find a new set of trains? U: Yes. S: You can say "yes" or "I am done here" to find a new set of trains. U: Yes. S: Thank you for using AT&T Amtrak train time table system. See you next time. U: I nev- Summary • ASR technology relies upon a large number of phenomena and techniques we’ve already seen to convert sound into words – Phonetic/phonological, morphological, and lexical events – FSA’s, Ngrams, Dynamic programming algorithms • Better modeling of linguistic phenomena will be needed to improve performance on transcription and especially on understanding • For next class: we’ll start talking about larger structures in language above the word (Ch 8) Disfluencies and Self-Repairs • Disfluencies abound in spontaneous speech – every 4.6s in radio call-in (Blackmer & Mitton ‘91) hesitation: Ch- change strategy. filled pause: Um Baltimore. self-repair: Ba- uh Chicago. • Hard to recognize Ch- change strategy. --> to D C D C today ten fifteen. Um Baltimore. --> From Baltimore ten. Ba- uh Chicago. --> For Boston Chicago.