Automatic Speech Recognition Julia Hirschberg CS 6998 7/15/2016 1 What is speech recognition? Transcribing words? Understanding meaning? 7/15/2016 2 It’s hard to recognize speech... People speak in very different ways Across speaker variation Within speaker variation Speech sounds vary according to the speech context Environment varies wrt noise Transcription task must handle all of this and produce a transcript of spoken words 7/15/2016 3 Success: low WER (S+I+D)/N * 100 Thesis test vs. This is a test. 75% WER Progress: Very large training corpora Fast machines and cheap storage Bake-offs Market for real-time systems New representations and algorithms: Finite State Transducers 7/15/2016 4 Varieties of Speech Recognition Mode Isolated words continuous Style Read, prepared, spontaneous Enrollment Spkr-dependent or independent Vocabulary size <20 5K --> 60K -->~1M Language Model Finite state, ngrams, CFGs, CSGs Perplexity <10 > 100 SNR > 30dB (high) < 10dB (low) Input device Telephone, microphones 7/15/2016 5 ASR and the Noisy Channel Model Source --> noisy channel --> Hypothesis Find the most likely input to have generated the (observed) “noisy” sentence by finding most likely sentence W in language given acoustic input O W’= arg max W L P(W|O) P( x | y) P( y | x)P( x) Bayes rule P( y) W’= 7/15/2016 arg max W L P(O|W )P(W ) P(O) 6 P(O) same for all hypothetical W, so W’=P(O|W)P(W) P(W) the prior; P(O|W) the (acoustic) likelihood 7/15/2016 7 Simple Isolated Digit Recognition Train 10 acoustic templates Mi: one per digit Compare input x with each Select most similar template j according to some comparison function, minimizing differences j = min{f(x,Mi)} 7/15/2016 8 Scaling Up: Continuous Speech Recognition Collect training and test corpora of Speech + word transcription Speech + phonetic transcription Built by hand or using TTS Text corpus Determine a representation for the signal Build probabilitistic Acoustic model: signal to phones 7/15/2016 9 Pronunciation model: phones to words Language model: words to sentences Select search procedures to decode new input given these training models 7/15/2016 10 Representing the Signal What parameters (features) of the waveform Can be extracted automatically Will preserve phonetic identity and distinguish it from other phones Will be independent of speaker variability and channel conditions Will not take up too much space …Power Spectrum 7/15/2016 11 Speech captured by microphone and digitized Signal divided into frames Power spectrum computed to represent energy in different bands of the signal LPC spectrum, Cepstra, PLP Each frame’s spectral features represented by small set of numbers 7/15/2016 12 Why it works? Different phonemes have different spectral characteristics Why it doesn’t work? Phonemes can have different properties in different acoustic contexts, spoken by different people, ... 7/15/2016 13 Acoustic Models Model likelihood of phone given spectral features and prior context Usually represented as HMM Set of states representing phones or other subword units Transition probabilities on states: how likely is it to see one phone after another? Observation/output likelihoods: how likely is spectral feature vector to be observed from state i, given state i-1? 7/15/2016 14 Train initial model on small hand-labeled corpus to get estimate of transition and observation probabilities Tune parameters on large corpus with only transcription Iterate until no further improvement 7/15/2016 15 Pronunciation Model Models likelihood of word given network of candidate phone hypotheses (weighted phone lattice) Allophones: butter vs. but Lexicon may be HMM or simple dictionary 7/15/2016 16 Language Models Models likelihood of word sequence given candidate word hypotheses Grammars Finite state or CFG Ngrams Corpus trained Smoothing issues Out of Vocabulary (OOV) problem 7/15/2016 17 Search Find the best hypothesis given Lattice of subword units (AM) Segmentation of all paths into possible words (PM) Probabilities of word sequences (LM) Huge search space Viterbi decoding Beam search 7/15/2016 18 Challenges for Transcription Robustness to channel characteristics and noise Portability to new applications Adapatation: to speakers, to environments LMs: simple ngrams need help Confidence measures OOV words New speaking styles/genres New applications 7/15/2016 19 Challenges for Understanding Recognizing communicative ‘problems’ ASR errors User corrections Disfluencies and self-repairs 7/15/2016 20 An Unsuccessful Dialogue S: Hi, this is AT&T Amtrak schedule system. This is TOOT. How may I help you? U: I'd like to find a train from Ch- to Chicago from Baltimore. S: Sorry, I can't understand you. Please repeat your utterance. U: I'd like to find a train from Baltimore to Chicago. S: I heard you say “leave today around 5 thirty in the morning .“ Which city do you want to go to? U: I would like to leave Saturday at eight in the morning. 7/15/2016 21 S: You can say the arrival city name, such as “New York City." U: Baltimore to Chicago. S: You can say the arrival city name, such as “New York City.” U: Baltimore to Chicago. (I hate it.) U: Train from Baltimore ...S: ...I heard you s...U: to Chicago.... Hello? S: You can say the name of your departure city, such as "New York City." U: Baltimore. 7/15/2016 22 U: Bal-ti-more.... S: You can say... U: Baltimore. S: Do you want to continue and find a new set of trains? U: Yes. S: You can say "yes" or "I am done here" to find a new set of trains. U: Yes. S: Thank you for using AT&T Amtrak train time table system. See you next time. U: I nev7/15/2016 23 Disfluencies and Self-Repairs Disfluencies abound in spontaneous speech every 4.6s in radio call-in (Blackmer & Mitton ‘91) hesitation: Ch- change strategy. filled pause: Um Baltimore. self-repair: Ba- uh Chicago. Hard to recognize Ch- change strategy. --> to D C D C today ten fifteen. Um Baltimore. --> From Baltimore ten. Ba- uh Chicago. --> For Boston Chicago. 7/15/2016 24 Possibilities for Understanding Recognizing speaker emotion Identifying speech acts: okay Locating topic boundaries for topic tracking, audio browsing, speech data mining 7/15/2016 25 Next Week 7/15/2016 26 7/15/2016 27