Components of Spoken Dialogue Systems Julia Hirschberg LSA 353 1 Recreating the Speech Chain DIALOG SEMANTICS SPOKEN LANGUAGE UNDERSTANDING SPEECH RECOGNITION SPEECH SYNTHESIS DIALOG MANAGEMENT SYNTAX LEXICON MORPHOLOG Y PHONETICS INNER EAR ACOUSTIC NERVE VOCAL-TRACT ARTICULATORS 2 The Illusion of Segmentation... or... Why Speech Recognition is so Difficult (user:Roberto (attribute:telephone-num value:7360474)) VP NP NP MY IS NUMBER m I n & m &r i b THREE SEVEN SEVEN ZERO NINE s e v & nth rE n I n zE o r TWO t ü FOUR s ev & n f O 3 r The Illusion of Segmentation... or... Ellipses and Anaphors Why Speech Recognition is so Difficult Limited vocabulary Multiple Interpretations Speaker Dependency (user:Roberto (attribute:telephone-num value:7360474)) Word variations VP NP Word confusability NP MY IS NUMBER THREE SEVEN ZERO NINE Context-dependency SEVEN TWO Coarticulation FOUR Noise/reverberation m I n & m &r i b s e v & nth rE n I n z E o Intra-speaker t ü s e v &variability f O r r n 4 1980s -- The Statistical Approach • Based on work on Hidden Markov Models done by Leonard Baum at IDA, Princeton in the late 1960s • Purely statistical approach pursued by Fred Jelinek and Jim Baker, IBM T.J.Watson Research Wˆ arg max P( A | W ) P(W ) W speech recognition • Foundations of modern engines Acoustic HMMs a11 S1 a22 a12 S2 Word Tri-grams a33 a23 Fred Jelinek S3 Jim Baker P( wt | wt 1 , wt 2 ) No Data Like More Data Whenever I fire a linguist, our system performance improves (1988) Some of my best friends are linguists (2004) 5 1980-1990 – Statistical approach becomes ubiquitous • Lawrence Rabiner, A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, Proceeding of the IEEE, Vol. 77, No. 2, February 1989. 6 1980s-Today – The Power of Evaluation 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 … HOSTING MIT SPEECHWORKS SPOKEN STANDARDS DIALOG Nuance INDUSTRY APPLICATION DEVELOPERS TOOLS NUANCE SRI Pros and Cons of DARPA programs STANDARDS PLATFORM INTEGRATORS STANDARDS VENDORS + Continuous incremental improvement - Loss of “bio-diversity” TECHNOLOGY 7 Today’s State of the Art • Low noise conditions • Large vocabulary – ~20,000-64,000 words (or more…) • Speaker independent (vs. speaker-dependent) • Continuous speech (vs isolated-word) • World’s best research systems: • Human-human speech: ~13-20% Word Error Rate (WER) • Human-machine or monologue speech: ~3-5% WER 8 Components of an ASR System • Corpora for training and testing of components • Representation for input and method of extracting • Pronunciation Model • Acoustic Model • Language Model • Feature extraction component • Algorithms to search hypothesis space efficiently 9 Training and Test Corpora • Collect corpora appropriate for recognition task at hand – Small speech + phonetic transcription to associate sounds with symbols (Initial Acoustic Model) – Large (>= 60 hrs) speech + orthographic transcription to associate words with sounds (Acoustic Model) – Very large text corpus to identify unigram and bigram probabilities (Language Model) 10 Building the Acoustic Model • Model likelihood of phones or subphones given spectral features, pronunciation models, and prior context • Usually represented as HMM – Set of states representing phones or other subword units – Transition probabilities on states: how likely is it to see one phone after seeing another? – Observation/output likelihoods: how likely is spectral feature vector to be observed from phone state i, given phone state i-1? 11 • Initial estimates for • Transition probabilities between phone states • Observation probabilities associating phone states with acoustic examples • Re-estimate both probabilities by feeding the HMM the transcribed speech training corpus (forced alignment) • I.e., we tell the HMM the ‘right’ answers -which phones to associate with which sequences of sounds • Iteratively retrain transition and observation probabilities by running training data through model and scoring output (we know the right answers) until no improvement 12 HMMs for speech 13 Building the Pronunciation Model • Models likelihood of word given network of candidate phone hypotheses – Multiple pronunciations for each word – May be weighted automaton or simple dictionary • Words come from all corpora; pronunciations from pronouncing dictionary or TTS system 14 ASR Lexicon: Markov Models for Pronunciation 15 Language Model • Models likelihood of word given previous word(s) • Ngram models: – Build the LM by calculating bigram or trigram probabilities from (very large) text training corpus – Smoothing issues • Grammars – Finite state grammar or Context Free Grammar (CFG) or semantic grammar • Out of Vocabulary (OOV) problem 16 Search/Decoding • Find the best hypothesis given – Lattice of phone units (AM) – Lattice of words (segmentation of phone lattice into all possible words via Pronunciation Model) – Probabilities of word sequences (LM) • How to reduce this huge search space? – Lattice minimization and determinization – Pruning: beam search – Calculating most likely paths 17 Evaluating Success • Transcription – Low WER (S+I+D)/N * 100 Thesis test vs. This is a test. 75% WER Or That was the dentist calling. 125% WER • Understanding – High concept accuracy • How many domain concepts were correctly recognized? I want to go from Boston to Baltimore on September 29 18 Domain concepts Values – source city Boston – target city Baltimore – travel date September 29 – Score recognized string “Go from Boston to Washington on December 29” vs. “Go to Boston from Baltimore on September 29” – (1/3 = 33% CA) 19 Summary • ASR today – Combines many probabilistic phenomena: varying acoustic features of phones, likely pronunciations of words, likely sequences of words – Relies upon many approximate techniques to ‘translate’ a signal • ASR future – Can we include more language phenomena in the model? 20 Synthesizers Then and Now • The task: produce human-sounding speech from some orthographic or semantic representation of an input – An online text – A semantic representation of a response to a query • What are the obstacles? • What is the state of the art? 21 Possibly the First ‘Speaking Machine’ • Wolfgang von Kempelen, Mechanismus der menschlichen Sprache nebst Beschreibung einer sprechenden Maschine, 1791 (in Deutsches Museum still and playable) • First to produce whole words, phrases – in many languages22 Joseph Faber’s Euphonia, 1846 23 • Constructed 1835 w/pedal and keyboard control – Whispered and ordinary speech – Model of tongue, pharyngeal cavity with changeable shape – Singing too “God Save the Queen” • Modern Articulatory Synthesis: Dennis Klatt (1987) 24 25 • World’s Fair in NY, 1939 • Requires much training to ‘play’ • Purpose: reduce bandwidth needed to transmit speech, so many phone calls can be sent over single line 26 27 • Answers: – These days a chicken leg is a rare dish. – It’s easy to tell the depth of a well. – Four hours of steady work faced us. • ‘Automatic’ synthesis from spectrogram – but can also use hand-painted spectrograms as input • Purpose: understand perceptual effect of spectral details 28 Formant/Resonance/Acoustic Synthesis • Parametric or resonance synthesis – Specify minimal parameters, e.g. f0 and first 3 formants – Pass electronic source signal thru filter • Harmonic tone for voiced sounds • Aperiodic noise for unvoiced • Filter simulates the different resonances of the vocal tract • E.g. – Walter Lawrence’s Parametric Artificial Talker (1953) for vowels and consonants – Gunnar Fant’s Orator Verbis Electris (1953) for vowels – Formant synthesis download (demo) 29 Concatenative Synthesis • Most common type today • First practical application in 1936: British Phone company’s Talking Clock – Optical storage for words, part-words, phrases – Concatenated to tell time • E.g. • And a ‘similar’ example • Bell Labs TTS (1977) (1985) 30 Variants of Concatenative Synthesis • Inventory units – Diphone synthesis (e.g. Festival) – Microsegment synthesis – “Unit Selection” – large, variable units • Issues – How well do units fit together? – What is the perceived acoustic quality of the concatenated units? – Is post-processing on the output possible, to improve quality? 31 TTS Production Levels: Back End and Front End •Orthographic input: The children read to Dr. Smith •World Knowledge text normalization •Semantics •Syntax word pronunciation •Lexical Intonation assignment •Phonology intonation realization – F0, amplitude, duration •Acoustics synthesis 32 Text Normalization • • • • Reading is what W. hates most. Reading is what Wilde hated most. Have the students read the questions. In 1996 she sold 1995 shares and deposited $42 in her 401(k). • The duck dove supply. 33 Pronunciation in Context • Homograph disambiguation – E.g. bass/bass, desert/desert • Dictionaries vs letter-to-sound rules – Frequent or exceptional words: dictionary • Bellcore business name pronunciation – ‘New’ words: rules • Inferring language origin, e.g. Infiniti, Gomez, Paris • Pronunciation by analogy, e.g. universary • Learning rules automatically from dictionaries or small seed-sets + active learning 34 Intonation Assignment: Phrasing • Traditional: hand-built rules – Punctuation 234-5682 – Context/function word: no breaks after function word He went to dinner – Parsing? She favors the nuts and bolts approach • Current: statistical analysis of large labeled corpus – Punctuation, pos window, utt length,… – ~94% `accuracy’ 35 Intonation Assignment: Accent • Hand-built rules – Function/content distinction • Accent content words; deaccent function words • But many exceptions – ‘Given’ items, focus, contrast – There are presidents and there are good presidents – Complex nominals: • Main Street/Park Avenue • city hall parking lot • Today: Statistical procedures trained on large labeled corpora 36 Intonation Assignment: Contours • Simple rules – ‘.’ = declarative contour – ‘?’ = yes-no-question contour unless wh-word present at/near front of sentence • Well, how did he do it? And what do you know? • Open problem – But realization even of simple variation is quite difficult in current TTS systems 37 Phonological and Acoustic Realization • Task: – Produce a phonological representation from phonemes and intonational assignment • Pitch contour aligned with text • Durations, intensity – Select best concatenative units from inventory – Post-process if needed/possible to smooth joins, modify pitch, duration, intensity, rate from original units – Produce acoustic waveform as output 38 TTS: Where are we now? • Natural sounding speech for some utterances – Where good match between input and database • Still…hard to vary prosodic features and retain naturalness – Yes-no questions: Do you want to fly first class? • Context-dependent variation still hard to infer from text and hard to realize naturally: 39 – Appropriate contours from text – Emphasis, de-emphasis to convey focus, given/new distinction: I own a cat. Or, rather, my cat owns me. – Variation in pitch range, rate, pausal duration to convey topic structure • Characteristics of ‘emotional speech’ little understood, so hard to convey: …a voice that sounds friendly, sympathetic, authoritative…. • How to mimic real voices? • ScanSoft/Nuance demo 40 Next Class • J&M 22.4,22.8 • Walker et al ‘97 41