This work is licensed under a Creative Commons Attribution-Share Alike 3.0 Unported License. CS 479, section 1: Natural Language Processing Lecture #16: Speech Recognition Overview (cont.) Thanks to Alex Acero (Microsoft Research), Jeff Adams (Nuance), Simon Arnfield (Sheffield), Dan Klein (UC Berkeley), Mazin Rahim (AT&T Research) for many of the materials used in this lecture. Announcements Reading Report #6 on Young’s Overview Due: now Reading Report #7 on M&S 7 Due: Friday Review Questions Typed list of 5 questions for Mid-term exam review Due next Wednesday Objectives Continue our overview of an approach to speech recognition, picking up at acoustic modeling See other examples of the source / channel (noisy channel) paradigm for modeling interesting processes Apply language models Recall: Front End Source P( w) Text w Noisy Channel Speech x x1 x2 ...xn FE Features y y1 y2 ... yq ASR Text w* w1 w2 ...wm P( x | w) We want to predict a sentence 𝑤 ∗ given a feature vector 𝑦 = 𝐹𝐸(𝑥) w* arg max P( w) P( y | w) w Acoustic Modeling Goal: Language Model Feature Extraction Map acoustic feature vectors into distinct linguistic units Such as phones, syllables, words, etc. Decoder, Search Acoustic Model w* arg max P( w) P( y | w) w Word Lexicon Acoustic Trajectories sh f s k t th ch p AA h j v z zh d y ehg un dh IH UH w EH r OOb ng AEn um m ur UR OY OOH ee EE EY ih OW AW AH AY ul uh AWH l oh OH Acoustic Models: Neighborhoods are not Points How do we describe what points in our “feature space” are likely to come from a given phoneme? It’s clearly more complicated than just identifying a single point. Also, the boundaries are not “clean”. Use the normal distribution: Points are likely to lie near the center. We describe the distribution with the mean & variance. Easy to compute with Acoustic Models: Neighborhoods are not Points (2) Normal distributions in M dimensions are analogous A.k.a. “Gaussians” Specify the mean point in M dimensions Like an M-dimensional “hill” centered around the mean point Specify the variances (as Co-variance matrix) Diagonal gives the “widths” of the distribution in each direction Off-diagonal values describe the “orientation” “Full covariance” possibly “tilted” “Diagonal covariance” not “tilted” AMs: Gaussians don’t really cut it Consider the “AY” frames in our example. How can we describe these with an (elliptical) Gaussian? A single (diagonal) Gaussian is too big to be helpful. Full-covariance Gaussians are hard to train. We often use multiple Gaussians (a.k.a. Gaussian mixture models) (1 dimensional) Gaussian Mixture Models AMs: Phonemes are a path, not a destination Phonemes, like stories, have beginnings, middles, and ends. This might be clear if you think of how the “AY” sound moves from a sort of “EH” to an “EE”. Even non-diphthongs show these properties. We often represent a phoneme with multiple “states”. E.g. in our AY model, we might have 4 states. And each of these states is modeled by a mixture of Gaussians. STATE 2 STATE 3 STATE 4 STATE 1 AMs: Whence & Whither It matters where you come from (whence) and where you are going (whither). Phonetic contextual effects A way to model this is to use triphones I.e. Depend on the previous & following phonemes E.g. Our “AY” model should really be a silence-AY-S model (… or pentaphones: use 2 phonemes before & after) So what we really need for our “AY” model is a: Mixture of Gaussians For each of multiple states For each possible set of predecessor & successor phonemes Hidden Markov Model (HMM) Captures: Transitions between hidden states Feature emissions as mixtures of gaussians Spectral properties modeled by a parametric random process i.e., a directed graphical model! Advantages: Powerful statistical method for a wide range of data and conditions Highly reliable for recognizing speech A collection of HMMs for each: sub-word unit type extraneous event: cough, um, sneeze, … More on HMMs coming up in the course after classification! Anatomy of an HMM HMM for /AY/ in context of preceding silence, followed by /S/ 0.2 0.5 sil-AY+S [1] 0.8 0.3 sil-AY+S [2] 0.7 0.2 sil-AY+S [3] 0.8 HMMs as Phone Models 0.2 0.5 sil-AY+S [1] 0.3 0.2 0.8 sil-AY+S 0.7 sil-AY+S 0.8 [2] [3] Words and Phones w* arg max P( w) P( y | w1 , w2 ,..., wn ) w arg max P( w) P( y | p1,1 p1,2 p1,3 , p2,1 p2,2 ... p2,5 ,..., pn,1 pn,2 ... pn,4 ) w How do we know how to segment words into phones? Language Model Word Lexicon Goal: • • Map sub-word units into words Usual sub-word units are phone(me)s Decoder, Search Feature Extraction Acoustic Model Lexicon: (CMUDict, ARPABET) Phoneme Example Translation AA odd AA D AE at AE T AH hut HH AH T AO ought AO T AW cow K AW AY hide HH AY D B be B IY CH cheese CH IY Z … Properties: • • Simple Typically knowledge-engineered (not learned – shock!) Word Lexicon Decoder Source P( w) Text w Noisy Channel Speech x x1 x2 ...xn FE Features y y1 y2 ... yq ASR Text w* w1 w2 ...wm P( x | w) Predict a sentence 𝑤 ∗ given a feature vector 𝑦 = 𝐹𝐸(𝑥) w* arg max P( w) P( y | w) w Decoding: as StateSpace Search Language Model Feature Extraction Pattern Classification Acoustic Model Word Lexicon Decoding as Search Viterbi – Dynamic Programming Multi-pass A* (“stack decoding”) N-best … Viterbi: DP Noisy Channel Applications Speech recognition (dictation, commands, etc.) text neurons, acoustic signal, transmission acoustic waveforms text OCR text print, smudge, scan image text Handwriting recognition text neurons, muscles, ink, smudge, scan image text Spelling correction text your spelling mis-spelled text text Machine Translation (?) text in target language translation in head text in source language text in target language Noisy-Channel Models OCR P(text | strokes) P(text) P(strokes | text) Handwriting recognition P(text | pixels) P(text) P(pixels | text) Spelling Correction P(text | mis-spelled text) P(text) P(mis-spelled text | text) Translation? P(english | french) P(english) P(french | english) What’s Next Upcoming lectures: Classification / categorization Naïve-Bayes models Class-conditional language models Extra Milestones in Speech Recognition Small Vocabulary, Acoustic Phoneticsbased Isolated Words Isolated Words Connected Digits Continuous Speech Filter-bank analysis Time-normalization Dynamic programming 1967 Spoken dialog; Multiple modalities Stochastic language understanding Finite-state machines Statistical learning Concatenative synthesis Machine learning Mixed-initiative dialog Hidden Markov models Stochastic Language modeling 1972 Very Large Vocabulary; Semantics, Multimodal Dialog Continuous Speech Speech Understanding Connected Words Continuous Speech Pattern recognition LPC analysis Clustering algorithms Level building 1962 Large Vocabulary; Syntax, Semantics, Medium Large Vocabulary, Vocabulary, Template-based Statistical-based 1977 1982 Year 1987 1992 1997 2003 Dragon Dictate Progress WERR* from Dragon NaturallySpeaking version 7 to version 8 to version 9: DOMAIN US English: UK English: German: French: Dutch: Italian: Spanish: 78 27% 21% 16% 24% 27% 22% 26% 89 23% 10% 10% 14% 18% 14% 17% * WERR means relative word error rate reduction on an in-house evaluation set. Results from Jeff Adams, ca. 2006 Crazy Speech Marketplace Philips Inso IBM Articulate MedRemote Kurzweil ScanSoft L&H Nuance etc. Dragon Nuance etc. Dictaphone Speechworks Dictaphone Voice Signal Tegic ca. 1980 ca. 2004 Year Speech vs. text: tokens vs. characters Speech recognition recognizes a sequence of “tokens” taken from a discrete & finite set, called the lexicon. Informally, tokens correspond to words, but the correspondence is inexact. In dictation applications, where we have to worry about converting between speech & text, we need to sort out a “token philosophy”: Do we recognize “forty-two” or “forty two” or “42” or “40 2”? Do we recognize “millimeters” or “mms” or “mm”? What about common words which can also be names, e.g. “Brown” and “brown”? What about capitalized phrases like “Nuance Communications” or “The White House” or “Main Street”? What multi-word tokens should be in the lexicon, like “of_the”? What do we do with complex morphologies or compounding? Converting between tokens & text TOKEN PHILOSOPHY TOKENS TEXT Profits rose to $28 million. See fig. 1a on p. 124. TOKENIZATION ITN profits rose to twenty eight million dollars .\period see figure one a\a on page one twenty four .\period LEXICON Three examples (Tokenization) TEXT P.J. O’Rourke said, "Giving money and power to government is like giving whiskey and car keys to teenage boys." The 18-speed I bought sold on www.eBay.com for $611.00, including 8.5% sales tax. From 1832 until August 15, 1838 they lived at No. 235 Main Street, "opposite the Academy," and from there they could see it all. TOKENS PJ O'Rourke said ,\comma "\open-quotes giving money and power to government is like giving whiskey and car keys to teenage boys .\period "\close-quotes the eighteen speed I bought sold on www.\WWW_dot eBay .com\dot_com for six hundred and eleven dollars zero cents ,\comma including eight .\point five percent sales tax .\period from one eight three two until the fifteenth of August eighteen thirty eight they lived at number two thirty five Main_Street ,\comma "\open-quotes opposite the Academy ,\comma "\close-quotes and from there they could see it all .\period Missing from speech: punctuation When people speak they don’t explicitly indicate phrase and section boundaries instead listeners rely on prosody and syntax to know where these boundaries belong in dictation applications we normally rely on speakers to speak punctuation explicitly how can we remove that requirement When people speak, they don’t explicitly indicate phrase and section boundaries. Instead, listeners rely on prosody and syntax to know where these boundaries belong. In dictation applications, we normally rely on speakers to speak punctuation explicitly. How can we remove that requirement? Punctuation Guessing Example Punctuation Guessing As currently shipping in Dragon Targeted towards free, unpunctuated speech My personal experience with camping has been rather limited. Having lived overseas in a very urban situation in which camping in the wilderness is not really possible. My only chances at camping came when I returned to the United States. My most memory, I had two most memorable camping trips both with my father. My first one was when I was a preteen, and we went hiking on Bigalow mountain in Maine, central western Maine. We went hiking for a day took a trail that leads off of the Appalachian Trail and goes down to the village of Stratton in the township of Eustis, just north and west of Sugarloaf U.S.A., the ski area.