Automatic Speech Recognition Introduction Jan Odijk Utrecht, Dec 9, 2010 Overview • • • • • What is ASR? Why is it difficult? How does it work? How to make a speech recognizer? Example Applications Overview • • • • • What is ASR? Why is it difficult? How does it work? How to make a speech recognizer? Example Applications ASR • Automatic Speech Recognition is the process by which a computer maps an acoustic signal containing speech to text. • Automatic speech understanding is the process by which a computer maps an acoustic speech signal to some form of abstract meaning of the speech. ASR-related • Automatic speaker recognition is the process by which a computer recognizes the identity of the speaker based on speech samples. • Automatic speaker verification is the process by which a computer checks the claimed identity of the speaker based on speech samples Overview • • • • • What is ASR? Why is it difficult? How does it work? How to make a speech recognizer? Example Applications Why is ASR difficult? • All occurrences of a speech sound differ from each other – even when part of the same word type – And when pronounced by the same person – (‘b’ in ‘boom’ is never pronounced twice in exactly the same way) • Each speaker has his own voice characteristics Why is ASR difficult? • Other problems caused by: – – – – – Language: Dutch vs. English vs. … Accent/Dialect: Flemish vs. NL Dutch, etc. Gender: Male vs. female Age: child vs. adult vs. senior Health: cold, flu, sore throat, etc. Why is ASR difficult? • Other problems caused by: – Environment: home, office, in-car, in station, etc. – Channel : fixed telephone, mobile phone, multimedia channel, etc. – Microphone(s): telephone mike, close-talk mike, far mike, array microphone, etc.; different mike qualities Why is ASR difficult? • Confusables: – Zeven vs. negen • Ambiguity – [sã] = cent, (je) sens, sans (French) • Variation – Yes, yeah, yep, ok, okido, fine, etc. Why is ASR difficult? • Assimilation, deletions, etc – Een => [n], [m], [ŋ] (auto, boek, kast) – Natuurlijk => tuurlijk • Coarticulation – Pronunciation of a sound depends on its environment (sounds preceding/following) – Koel vs. kiel [k] vs. [k’] • Filled pauses, stuttering, repetitions Why is ASR difficult? • Other sounds – Background noise, music, other people talking, channel noise • Reverberation, echo • Speaker of language X pronouncing words from language Y – Esp. with names (persons, places, …) How are these problems reduced? • Separate ASR system – – – – for each language For each accent/dialect (Dutch / Flemish) For each environment For each channel and microphone(s) • Use close-talk mike to reduce other sounds and influence of environment – For each speaker (speaker-adaptive/dependent ASR) How are these problems reduced? • Restricted Vocabulary – Only a limited number of words can be ‘recognized’ by any specific system – Ranging from a dozen to 64k different word forms – Dozen: application in which digits, yes/no and simple commands are sufficient (banking applications, number dialing) How are these problems reduced? • Restricted Vocabulary – In between: reverse directory application • employee name => phone number – 64k: ‘large vocabulary systems’ • dictation, • (topographic) name recognition How are these problems reduced? • Small Vocabularies – Is that enough? No, generally not! – Use dialogue to change restricted vocabulary in each dialogue state (dynamic active vocabularies) • Yes/no answer is expected => activate yes/no vocabulary • Digit expected => activate digit vocabulary • Name expected => activate name vocabulary How are these problems reduced? • 64k Vocabulary (“Large Vocabulary) – – – – – – Is that enough? No, generally not Languages with compounds Languages with a lot of inflection Agglutinative languages => require special measures Overview • • • • • What is ASR? Why is it difficult? How does it work? How to make a speech recognizer? Example Applications How does ASR work? • Not possible (yet?) to characterize the different sounds by (hand-crafted) rules • Instead: – A large set of recordings of each sound is made – Using statistical methods a model for each sound is derived (acoustic model) – Incoming sound is compared, using statistics, with acoustic model of a sound Elements of a Recognizer Acoustic Model Speech Data Feature Extraction Pattern Matching Language Model Action Post Processing Natural Language Understanding Display text Meaning Elements of a Recognizer Acoustic Model Speech Data Feature Extraction Pattern Matching Language Model Action Post Processing Natural Language Understanding Display text Meaning Feature Extraction • Turning speech signal into something more manageable • Sampling of a signal: transforming into a digital form • For each short piece of speech (10ms) • Compression Feature Extraction • Extract relevant parameters from the signal – Spectral information, energy, frequency,... • Eliminate undesirable elements (normalization) – Noise – Channel properties – Speaker properties (gender) Feature Extraction: Vectors • Signal is chopped in small pieces (frames), • Spectral analysis of a speech frame produces a vector representing the signal properties. 10.3 1.2 • => result = stream of vectors -0.9 4 3 2 1 0 -1 -2 -3 -4 . 0.2 Elements of a Recognizer Acoustic Model Speech Data Feature Extraction Pattern Matching Language Model Action Post Processing Natural Language Understanding Display text Meaning Acoustic Model (AM) • • • • Split utterance into basic units, e.g. phonemes The acoustic model describes the typical spectral shape (or typical vectors) for each unit For each incoming speech segment, the acoustic model will tell us how well (or how badly) it matches each phoneme Must cope with pronunciation variability (see earlier) – – – Utterances of the same word by the same speaker are never identical Differences between speakers Identical phonemes sound differently in different words => statistical techniques: creation via a lot of examples Acoustic Model (AM) • Representation of speech signal • Waveform – Horizontal: time – Vertical: amplitude • Spectogram – Horizontal: time – Vertical: frequency – Color: amplitude of frequency f-r--ie--n--d--l--y- S1 S2 S3 S4 c--o--m--p---u----t--e--r---s S5 S6 S7 S8 S9 S10 S11 S12 S13 Acoustic Model: Units Phoneme: share units that model the same sound S T O P S T A R Stop T Start • Word: series of units specific to the word S1 S2 S3 S4 Stop S6 S7 S8 S9 S10 Start Acoustic Model: Units Context dependent phoneme S|,|T T|S|O O|T|P P|O|, ST TO OP Stop Diphone ,S P, Stop Other sub-word units: consonant clusters ST O P Stop Acoustic Model: Units • Other possible units – Words – Multi words: example: “it is”, “going to” • Combinations of all of the above Elements of a Recognizer Acoustic Model Speech Data Feature Extraction Pattern Matching Language Model Action Post Processing Natural Language Understanding Display text Meaning Pattern matching • Acoustic Model: returns a score for each incoming feature vector indicating how well the feature corresponds to the model. = Local score • Calculate score of a word, indicating how well the word matches the string of incoming features • Search algorithm: looks for the best scoring word or word sequence increase [, I n k R+ I s ,] Include [, I n k l u: d ,] Elements of a Recognizer Acoustic Model Speech Data Feature Extraction Pattern Matching Language Model Action Post Processing Natural Language Understanding Display text Meaning Language Model (LM) • Describes how words are connected to form a sentence • Limit possible word sequences • Reduce number of recognition errors by eliminating unlikely sequences • Increase speed of recognizer => real time implementations Language Model (LM) • Two major types – Grammar based !start <sentence>; <sentence>: <yes> | <no>; <yes>: yes | yep | yes please ; <no>: no | no thanks | no thank you ; – Statistical • Probability of single words, 2/3-word sequences • Derived from frequencies in a large text corpus Active Vocabulary • Lists words that can be recognized by the acoustic model • That are allowed to occur given the language model • Each word associated with a phonetic transcription – Enumerated, and/or – Generated by a Grapheme-to-Phoneme (G2P) module Result • N-Best List: • Lists of word sequences with a score – Based on AM and LM – Sorted descending by this score – Maximally N words Post Processing • Re-ordering of N-best list using other criteria: e.g. credit card numbers, telephone numbers • If one result is needed, select top element • Applying NLP techniques that fall outside the scope of the statistical language model – E.g. “three dollars fifty cents” “$ 3.50” – “doctor Jones” “Dr. Jones” – Etc. Overview • • • • • What is ASR? Why is it difficult? How does it work? How to make a speech recognizer? Example Applications How to get AM and LM • AM – Annotated speech database, and – Pronunciation dictionary • LM – Handwritten grammar, or – Large text corpus Training of Acoustic Models Annotated Speech Database Pronunciation Dictionary Training Program Acoustic Model Annotated Speech Database • Must contain speech covering – all units: phonemes, context dependent phonemes – population • Region, dialect, age, gender, …) – relevant environment(s) • car, office,.. – Relevant channel(s) • Fixed phone, mobile phone, desktop computer, … Annotated Speech Database • Must contain transcription of speech – At least orthographic • Must include markers for – Speech by others – Other non-speech sounds – Unfinished words, mispronunciations, stuttering, etc. Pronunciation Dictionary • List of all words occurring in speech database – With one or more phonetic transcriptions • Or: Grapheme-To-Phoneme (G2P) module – Graphemes => phonemes – E.g. boek => [,b u k ,] Training of Acoustic Models For all utterances in database: Make phonetic transcription of a utterance Use models to segment the utterance file: assign a phoneme to each speech frame Collect statistical information: Count prototype-phoneme occurrences Create New Models Language Model • Large text corpus – Relevant for the intended application(s) • Count frequencies – of individual words (unigrams) – of sequences of two words (bigrams) – of sequences of three words (trigrams) • Derive probabilities from the frequencies Spoken and Written Data • Produce them yourself • ELRA: – http://catalog.elra.info/index.php?cPath=37 – http://catalog.elra.info/index.php?cPath=42 • LDC http://www.ldc.upenn.edu/ • TST-Centrale – http://www.inl.nl/nl/producten?task=view – Bijv. Corpus Gesproken Nederlands • Usually pretty expensive! Key Element in ASR • ASR is based on learning from observations – Huge amount of spoken data needed for making acoustic models – Huge amount of text data needed for making language models • => Lots of statistics, few rules Overview • • • • • What is ASR? Why is it difficult? How does it work? How to make a speech recognizer? Example Applications Applications • • • • • • Dictation Audio Mining Subtitling Telephone Services Destination entry on GPS systems PDA/smartphone services Dictation • Convert speech into correctly written text • Typically desktop application in (silent) office, using close-talk microphone • Used very often in medical environment (pathologists, radiologists) • Also in legal domains, police reports, etc. – Using dedicated language models Dictation • Contains a speaker-adaptive Acoustic model • Adapts to the user’s speech upon use • Major Vendor: Nuance http://www.nuance.com • Product: Dragon NaturallySpeaking http://www.nuance.com/for-individuals/byproduct/dragon-for-pc/index.htm Dictation • MS Dictation system – in each MS Windows OS – http://www.microsoft.com/enable/products/win dowsvista/speech.aspx – But hardly used • Other earlier vendors (Philips, IBM, L&H) stopped or were acquired by Nuance Audiomining • Recognize speech – Create text transcription (possibly imperfectly), or – Align to existing text transcription • Make index of recognized words – With links to the relevant speech fragments • => speech is now searchable on the basis of keywords entered Audiomining • Examples – Journaaldemo • Search in TV news • http://hmi.ewi.utwente.nl/showcases/Broadcastnews-demo – Buchenwald • Search in interviews with Buchenwald survivors • http://www.buchenwald.nl/ Audiomining • Examples (cont.) – Radio Oranje • Search in queen Wilhelmina’s speeches (1940-45) • Uses alignment to existing transcription text • http://niod.al-m.nl/nl/thema/10/ Subtitling • Example – NEON (NEderlandstalige ONdertiteling) • Cooperation NPO and VRT • Use ASR to align speech with transcripts to efficiently create subtitles • http://www.kennislink.nl/publicaties/computer-gaattv-programmas-ondertitelen • http://www.youtube.com/watch?v=l0wf8gptic&feature=player_embedded#! • Local backup Destination Entry • Example – TomTom GO 520 / 720 / 930 • http://www.youtube.com/watch?v=z01zyfB0CrA • Say city, Say street, say / enter number – Returns 10 best candidates, select correct one – Saves a lot of clumsy typing – works in a (driving) car Telephone services • Police 0900 8844 – http://www.telecats.nl/nieuws/spraakherkennin g-politie-0900-8844/ • AEGON – http://www.telecats.nl/nieuws/spraakherkennin g-bij-aegon/ PDA/Smartphone • Examples – Dragon Dictate • SMS, e-mail dictation • http://www.dragonmobileapps.com/applications.htm l – Jibbigo http://www.jibbigo.com • Speech-to-speech translation (iPhone, Android) • http://www.phonedog.com/2009/10/30/iphone-appjibbigo-speech-translator/ PDA/Smartphone • Examples – Dragon Search • Search using speech input • http://www.dragonmobileapps.com/apple/search.ht ml – Google Voice Search • Search using speech input (also in Dutch) • Also uses location information • http://www.google.com/mobile/ • Read more? Kennislink! – http://www.kennislink.nl/publicaties/taal-enspraaktechnologie – http://www.kennislink.nl/publicaties/hetluisterend-oor-van-de-computer • Work with speech recognition yourself? – http://www.spraak.org/ Thanks for your attention