Full text

advertisement
Automatic Speech Recognition
Introduction
Jan Odijk
Utrecht, Dec 9, 2010
Overview
•
•
•
•
•
What is ASR?
Why is it difficult?
How does it work?
How to make a speech recognizer?
Example Applications
Overview
•
•
•
•
•
What is ASR?
Why is it difficult?
How does it work?
How to make a speech recognizer?
Example Applications
ASR
• Automatic Speech Recognition is the
process by which a computer maps an
acoustic signal containing speech to text.
• Automatic speech understanding is the
process by which a computer maps an
acoustic speech signal to some form of
abstract meaning of the speech.
ASR-related
• Automatic speaker recognition is the
process by which a computer recognizes the
identity of the speaker based on speech
samples.
• Automatic speaker verification is the
process by which a computer checks the
claimed identity of the speaker based on
speech samples
Overview
•
•
•
•
•
What is ASR?
Why is it difficult?
How does it work?
How to make a speech recognizer?
Example Applications
Why is ASR difficult?
• All occurrences of a speech sound differ
from each other
– even when part of the same word type
– And when pronounced by the same person
– (‘b’ in ‘boom’ is never pronounced twice in
exactly the same way)
• Each speaker has his own voice
characteristics
Why is ASR difficult?
• Other problems caused by:
–
–
–
–
–
Language: Dutch vs. English vs. …
Accent/Dialect: Flemish vs. NL Dutch, etc.
Gender: Male vs. female
Age: child vs. adult vs. senior
Health: cold, flu, sore throat, etc.
Why is ASR difficult?
• Other problems caused by:
– Environment: home, office, in-car, in station,
etc.
– Channel : fixed telephone, mobile phone,
multimedia channel, etc.
– Microphone(s): telephone mike, close-talk
mike, far mike, array microphone, etc.;
different mike qualities
Why is ASR difficult?
• Confusables:
– Zeven vs. negen
• Ambiguity
– [sã] = cent, (je) sens, sans (French)
• Variation
– Yes, yeah, yep, ok, okido, fine, etc.
Why is ASR difficult?
• Assimilation, deletions, etc
– Een => [n], [m], [ŋ] (auto, boek, kast)
– Natuurlijk => tuurlijk
• Coarticulation
– Pronunciation of a sound depends on its
environment (sounds preceding/following)
– Koel vs. kiel [k] vs. [k’]
• Filled pauses, stuttering, repetitions
Why is ASR difficult?
• Other sounds
– Background noise, music, other people talking,
channel noise
• Reverberation, echo
• Speaker of language X pronouncing words
from language Y
– Esp. with names (persons, places, …)
How are these problems reduced?
• Separate ASR system
–
–
–
–
for each language
For each accent/dialect (Dutch / Flemish)
For each environment
For each channel and microphone(s)
• Use close-talk mike to reduce other sounds and
influence of environment
– For each speaker (speaker-adaptive/dependent
ASR)
How are these problems reduced?
• Restricted Vocabulary
– Only a limited number of words can be
‘recognized’ by any specific system
– Ranging from a dozen to 64k different word
forms
– Dozen: application in which digits, yes/no and
simple commands are sufficient (banking
applications, number dialing)
How are these problems reduced?
• Restricted Vocabulary
– In between: reverse directory application
• employee name => phone number
– 64k: ‘large vocabulary systems’
• dictation,
• (topographic) name recognition
How are these problems reduced?
• Small Vocabularies
– Is that enough? No, generally not!
– Use dialogue to change restricted vocabulary in
each dialogue state (dynamic active
vocabularies)
• Yes/no answer is expected => activate yes/no
vocabulary
• Digit expected => activate digit vocabulary
• Name expected => activate name vocabulary
How are these problems reduced?
• 64k Vocabulary (“Large Vocabulary)
–
–
–
–
–
–
Is that enough?
No, generally not
Languages with compounds
Languages with a lot of inflection
Agglutinative languages
=> require special measures
Overview
•
•
•
•
•
What is ASR?
Why is it difficult?
How does it work?
How to make a speech recognizer?
Example Applications
How does ASR work?
• Not possible (yet?) to characterize the
different sounds by (hand-crafted) rules
• Instead:
– A large set of recordings of each sound is made
– Using statistical methods a model for each
sound is derived (acoustic model)
– Incoming sound is compared, using statistics,
with acoustic model of a sound
Elements of a Recognizer
Acoustic
Model
Speech
Data
Feature
Extraction
Pattern
Matching
Language
Model
Action
Post Processing
Natural
Language
Understanding
Display
text
Meaning
Elements of a Recognizer
Acoustic
Model
Speech
Data
Feature
Extraction
Pattern
Matching
Language
Model
Action
Post Processing
Natural
Language
Understanding
Display
text
Meaning
Feature Extraction
• Turning speech signal into something more
manageable
• Sampling of a signal: transforming into a
digital form
• For each short piece of speech (10ms)
• Compression
Feature Extraction
• Extract relevant parameters from the signal
– Spectral information, energy, frequency,...
• Eliminate undesirable elements
(normalization)
– Noise
– Channel properties
– Speaker properties (gender)
Feature Extraction: Vectors
• Signal is chopped in small pieces (frames),
• Spectral analysis of a speech frame
produces a vector representing the signal
properties.
10.3
1.2
• => result = stream of vectors
-0.9
4
3
2
1
0
-1
-2
-3
-4
.
0.2
Elements of a Recognizer
Acoustic
Model
Speech
Data
Feature
Extraction
Pattern
Matching
Language
Model
Action
Post Processing
Natural
Language
Understanding
Display
text
Meaning
Acoustic Model (AM)
•
•
•
•
Split utterance into basic units, e.g. phonemes
The acoustic model describes the typical spectral shape (or typical vectors) for
each unit
For each incoming speech segment, the acoustic model will tell us how well
(or how badly) it matches each phoneme
Must cope with pronunciation variability (see earlier)
–
–
–
Utterances of the same word by the same speaker are never identical
Differences between speakers
Identical phonemes sound differently in different words
=> statistical techniques: creation via a lot of examples
Acoustic Model (AM)
• Representation of speech signal
• Waveform
– Horizontal: time
– Vertical: amplitude
• Spectogram
– Horizontal: time
– Vertical: frequency
– Color: amplitude of frequency
f-r--ie--n--d--l--y-
S1
S2
S3
S4
c--o--m--p---u----t--e--r---s
S5
S6
S7
S8
S9
S10
S11
S12
S13
Acoustic Model: Units
 Phoneme: share units that model the same sound
S
T
O
P
S
T
A
R
Stop
T
Start
• Word: series of units specific to the
word S1 S2 S3 S4
Stop
S6 S7 S8 S9 S10
Start
Acoustic Model: Units
 Context dependent phoneme
S|,|T
T|S|O
O|T|P
P|O|,
ST
TO
OP
Stop
 Diphone
,S
P,
Stop
 Other sub-word units: consonant clusters
ST
O
P
Stop
Acoustic Model: Units
• Other possible units
– Words
– Multi words: example: “it is”, “going to”
• Combinations of all of the above
Elements of a Recognizer
Acoustic
Model
Speech
Data
Feature
Extraction
Pattern
Matching
Language
Model
Action
Post Processing
Natural
Language
Understanding
Display
text
Meaning
Pattern matching
• Acoustic Model: returns a score for each
incoming feature vector indicating how well
the feature corresponds to the model.
= Local score
• Calculate score of a word, indicating how
well the word matches the string of
incoming features
• Search algorithm: looks for the best scoring
word or word sequence
increase
[, I n k R+ I s ,]
Include
[, I n k l u: d ,]
Elements of a Recognizer
Acoustic
Model
Speech
Data
Feature
Extraction
Pattern
Matching
Language
Model
Action
Post Processing
Natural
Language
Understanding
Display
text
Meaning
Language Model (LM)
• Describes how words are connected to form
a sentence
• Limit possible word sequences
• Reduce number of recognition errors by
eliminating unlikely sequences
• Increase speed of recognizer => real time
implementations
Language Model (LM)
• Two major types
– Grammar based
!start <sentence>;
<sentence>: <yes> | <no>;
<yes>: yes | yep | yes please ;
<no>: no | no thanks | no thank you ;
– Statistical
• Probability of single words, 2/3-word sequences
• Derived from frequencies in a large text corpus
Active Vocabulary
• Lists words that can be recognized by the
acoustic model
• That are allowed to occur given the
language model
• Each word associated with a phonetic
transcription
– Enumerated, and/or
– Generated by a Grapheme-to-Phoneme (G2P)
module
Result
• N-Best List:
• Lists of word sequences with a score
– Based on AM and LM
– Sorted descending by this score
– Maximally N words
Post Processing
• Re-ordering of N-best list using other
criteria: e.g. credit card numbers, telephone
numbers
• If one result is needed, select top element
• Applying NLP techniques that fall outside
the scope of the statistical language model
– E.g. “three dollars fifty cents”  “$ 3.50”
– “doctor Jones”  “Dr. Jones”
– Etc.
Overview
•
•
•
•
•
What is ASR?
Why is it difficult?
How does it work?
How to make a speech recognizer?
Example Applications
How to get AM and LM
• AM
– Annotated speech database, and
– Pronunciation dictionary
• LM
– Handwritten grammar, or
– Large text corpus
Training of Acoustic Models
Annotated
Speech
Database
Pronunciation
Dictionary
Training Program
Acoustic
Model
Annotated Speech Database
• Must contain speech covering
– all units: phonemes, context dependent
phonemes
– population
• Region, dialect, age, gender, …)
– relevant environment(s)
• car, office,..
– Relevant channel(s)
• Fixed phone, mobile phone, desktop computer, …
Annotated Speech Database
• Must contain transcription of speech
– At least orthographic
• Must include markers for
– Speech by others
– Other non-speech sounds
– Unfinished words, mispronunciations,
stuttering, etc.
Pronunciation Dictionary
• List of all words occurring in speech
database
– With one or more phonetic transcriptions
• Or: Grapheme-To-Phoneme (G2P) module
– Graphemes => phonemes
– E.g. boek => [,b u k ,]
Training of Acoustic Models
For all utterances in database:
Make phonetic transcription of a utterance
Use models to segment the utterance file:
assign a phoneme to each speech frame
Collect statistical information:
Count prototype-phoneme occurrences
Create New Models
Language Model
• Large text corpus
– Relevant for the intended application(s)
• Count frequencies
– of individual words (unigrams)
– of sequences of two words (bigrams)
– of sequences of three words (trigrams)
• Derive probabilities from the frequencies
Spoken and Written Data
• Produce them yourself
• ELRA:
– http://catalog.elra.info/index.php?cPath=37
– http://catalog.elra.info/index.php?cPath=42
• LDC http://www.ldc.upenn.edu/
• TST-Centrale
– http://www.inl.nl/nl/producten?task=view
– Bijv. Corpus Gesproken Nederlands
• Usually pretty expensive!
Key Element in ASR
• ASR is based on learning from observations
– Huge amount of spoken data needed for making
acoustic models
– Huge amount of text data needed for making
language models
• => Lots of statistics, few rules
Overview
•
•
•
•
•
What is ASR?
Why is it difficult?
How does it work?
How to make a speech recognizer?
Example Applications
Applications
•
•
•
•
•
•
Dictation
Audio Mining
Subtitling
Telephone Services
Destination entry on GPS systems
PDA/smartphone services
Dictation
• Convert speech into correctly written text
• Typically desktop application in (silent)
office, using close-talk microphone
• Used very often in medical environment
(pathologists, radiologists)
• Also in legal domains, police reports, etc.
– Using dedicated language models
Dictation
• Contains a speaker-adaptive Acoustic
model
• Adapts to the user’s speech upon use
• Major Vendor: Nuance
http://www.nuance.com
• Product: Dragon NaturallySpeaking
http://www.nuance.com/for-individuals/byproduct/dragon-for-pc/index.htm
Dictation
• MS Dictation system
– in each MS Windows OS
– http://www.microsoft.com/enable/products/win
dowsvista/speech.aspx
– But hardly used
• Other earlier vendors (Philips, IBM, L&H)
stopped or were acquired by Nuance
Audiomining
• Recognize speech
– Create text transcription (possibly imperfectly),
or
– Align to existing text transcription
• Make index of recognized words
– With links to the relevant speech fragments
• => speech is now searchable on the basis of
keywords entered
Audiomining
• Examples
– Journaaldemo
• Search in TV news
• http://hmi.ewi.utwente.nl/showcases/Broadcastnews-demo
– Buchenwald
• Search in interviews with Buchenwald survivors
• http://www.buchenwald.nl/
Audiomining
• Examples (cont.)
– Radio Oranje
• Search in queen Wilhelmina’s speeches (1940-45)
• Uses alignment to existing transcription text
• http://niod.al-m.nl/nl/thema/10/
Subtitling
• Example
– NEON (NEderlandstalige ONdertiteling)
• Cooperation NPO and VRT
• Use ASR to align speech with transcripts to
efficiently create subtitles
• http://www.kennislink.nl/publicaties/computer-gaattv-programmas-ondertitelen
• http://www.youtube.com/watch?v=l0wf8gptic&feature=player_embedded#!
• Local backup
Destination Entry
• Example
– TomTom GO 520 / 720 / 930
• http://www.youtube.com/watch?v=z01zyfB0CrA
• Say city, Say street, say / enter number
– Returns 10 best candidates, select correct one
– Saves a lot of clumsy typing
– works in a (driving) car
Telephone services
• Police 0900 8844
– http://www.telecats.nl/nieuws/spraakherkennin
g-politie-0900-8844/
• AEGON
– http://www.telecats.nl/nieuws/spraakherkennin
g-bij-aegon/
PDA/Smartphone
• Examples
– Dragon Dictate
• SMS, e-mail dictation
• http://www.dragonmobileapps.com/applications.htm
l
– Jibbigo http://www.jibbigo.com
• Speech-to-speech translation (iPhone, Android)
• http://www.phonedog.com/2009/10/30/iphone-appjibbigo-speech-translator/
PDA/Smartphone
• Examples
– Dragon Search
• Search using speech input
• http://www.dragonmobileapps.com/apple/search.ht
ml
– Google Voice Search
• Search using speech input (also in Dutch)
• Also uses location information
• http://www.google.com/mobile/
• Read more? Kennislink!
– http://www.kennislink.nl/publicaties/taal-enspraaktechnologie
– http://www.kennislink.nl/publicaties/hetluisterend-oor-van-de-computer
• Work with speech recognition yourself?
– http://www.spraak.org/
Thanks for your attention
Download