Introduction to Automatic Speech Recognition Lectures: Jim Glass & guest lecturers

advertisement
Lecture # 1
Session 2003
Introduction to Automatic Speech Recognition
• Lectures: Jim Glass & guest lecturers
• Introduction to ASR
– Problem definition
– State of the art examples
• Course overview
–
–
–
–
Lecture outline
Assignments
Term Project
Grading
6.345 Automatic Speech Recognition
Introduction 1
Communication via Spoken Language
Human
Input
Output
Speech
Speech
Recognition
Synthesis
Computer
Text
Text
Generation
Understanding
Meaning
Meaning
6.345 Automatic Speech Recognition
Introduction 2
Virtues of Spoken Language
Natural:
Flexible:
Efficient:
Economical:
Requires no special training
Leaves hands and eyes free
Has high data rate
Can be transmitted/received inexpensively
Speech
Speech interfaces
interfaces are
are ideal
ideal for
for information
information access
access and
and
management
management when:
when:
•• The
The information
information space
space isis broad
broad and
and complex,
complex,
•• The
The users
users are
are technically
technically naive,
naive, or
or
•• Only
Only telephones
telephones are
are available.
available.
6.345 Automatic Speech Recognition
Introduction 3
Diverse Sources of Constraint for
Spoken Language Communication
Acoustic:
human vocal tract
Phonetic:
let us pray
lettuce spray
Phonological:
gas shortage
fish sandwich
Phonotactic:
blit vnuk
Syntactic:
I am flying to Chicago tomorrow
tomorrow I flying Chicago am to
Semantic:
Is the baby crying
Is the bay bee crying
Contextual:
It is easy to recognize speech
It is easy to wreck a nice beach
6.345 Automatic Speech Recognition
Introduction 4
Automatic Speech Recognition
ASR
ASR
System
System
Speech
Signal
Recognized
Words
• An ASR system converts the speech signal into words
• The recognized words can be
– The final output, or
– The input to natural language processing
6.345 Automatic Speech Recognition
Introduction 5
Application Areas for Speech Based Interfaces
• Mostly input (recognition only)
– Simple command and control
– Simple data entry (over the phone)
– Dictation
• Interactive conversation (understanding needed)
– Information kiosks
– Transactional processing
– Intelligent agents
6.345 Automatic Speech Recognition
Introduction 6
Basic Speech Recognition Challenges
• Co-articulation
• Speaker independence
– Dialect variations
– Non-native speakers
• Spontaneous speech
– Disfluencies
– Out-of-vocabulary words
• Language modelling
• Noise robustness
6.345 Automatic Speech Recognition
Introduction 7
Phonological Variation Example
• The acoustic realization of a phoneme depends strongly on
the context in which it occurs
Frequency
Time
TEA
6.345 Automatic Speech Recognition
TREE
STEEP
CITY
BEATEN
Introduction 8
Examples Contrasting
Read and Spontaneous Speech (Navigation Domain)
Filled and unfilled pauses:
Lengthened words:
False starts:
6.345 Automatic Speech Recognition
read, spontaneous
read, spontaneous
read, spontaneous
Introduction 9
Sometimes Real Data will Dictate
Technology Requirements (City Name Domain)
Technology Required
Simple word spotting
Complex word spotting
Speech understanding
6.345 Automatic Speech Recognition
Example
Um, Braintree
Eh yes, Avis rent-a-car in
Boston
Hello, please Brighton,
uh, can I have the number
of Earthscape, in, uh, on
Nonantum Street
Woburn, uh, Somerville.
I'm sorry
Introduction 10
Parameters that Characterize the
Capabilities of ASR Systems
Parameters
Parameters
Speaking
Speaking Mode:
Mode:
Speaking
Speaking Style:
Style:
Enrollment:
Enrollment:
Vocabulary:
Vocabulary:
Language
Language Model:
Model:
Perplexity:
Perplexity:
SNR:
SNR:
Transducer:
Transducer:
6.345 Automatic Speech Recognition
Range
Range
Isolated
Isolated word
word to
to continuous
continuous speech
speech
Read
Read speech
speech to
to spontaneous
spontaneous speech
speech
Speaker-dependent
Speaker-dependent to
to speaker-independent
speaker-independent
Small
Small (<20
(<20 words)
words) to
to large
large (>50,000
(>50,000 words)
words)
Finite-state
Finite-state to
to context-sensitive
context-sensitive
Small
Small (<10)
(<10) to
to large
large (>200)
(>200)
High
High (>30dB)
(>30dB) to
to low
low (<10dB)
(<10dB)
Noise-cancelling
Noise-cancelling microphone
microphone to
to cell
cell phone
phone
Introduction 11
ASR Trends*: Then and Now
before
before mid
mid 70's
70's
mid
mid 70’s
70’s -- mid
mid 80’s
80’s
after
after mid
mid 80’s
80’s
Recognition
Recognition
Units:
Units:
whole-word
whole-word and
and
sub-word
sub-word units
units
sub-word
sub-word units
units
sub-word
sub-word units
units
Modeling
Modeling
Approaches:
Approaches:
heuristic
heuristic and
and
ad
ad hoc
hoc
template
template matching
matching
mathematical
mathematical
and
and formal
formal
rule-based
rule-based and
and
declarative
declarative
deterministic
deterministic and
and
data-driven
data-driven
probabilistic
probabilistic
and
and data-driven
data-driven
Knowledge
Knowledge
Representation:
Representation:
heterogeneous
heterogeneous
and
and complex
complex
homogeneous
homogeneous
and
and simple
simple
homogeneous
homogeneous
and
and simple
simple
Knowledge
Knowledge
Acquisition:
Acquisition:
intense
intense knowledge
knowledge
engineering
engineering
embedded
embedded in
in
simple
simple structure
structure
automatic
automatic
learning
learning
* There are, of course, many exceptions.
6.345 Automatic Speech Recognition
Introduction 12
Speech Recognition: Where Are We Now?
• High performance, speaker-independent speech recognition
is now possible
– Large vocabulary (for cooperative speakers in benign
environments)
– Moderate vocabulary (for spontaneous speech over the phone)
• Commercial recognition systems are now available
– Dictation (e.g., Dragon, IBM, L&H, Philips) Scansoft
– Telephone transactions (e.g., AT&T, Nuance, Philips,
SpeechWorks, TellMe, etc.)
• When well-matched to applications, technology is able to
help perform real work
6.345 Automatic Speech Recognition
Introduction 13
Examples of ASR Performance
1K, Read
2K, Sponaneous
20K, Read
64K, Broadcast
10K, Conversational
100
Word Error Rate (%)
10
1
2001
1999
1997
1995
1993
1991
1989
0.1
1987
• Speaker-independent, continuousspeech ASR now possible
• Digit recognition over the telephone
with word error rate of 0.3%
• Error rate cut in half every two years
for moderate vocabulary tasks
• Error for spontaneous speech more
than twice that of read speech
• Conversational speech, involving
multiple speakers and poor acoustic
environment, remains a challenge
• Tens of hours of training data to
port to a different domain
• Statistical modeling using automatic
training achieves significant
advances
Digits
Year
6.345 Automatic Speech Recognition
Introduction 14
Important Lessons Learned
• Statistical modeling and data-driven approaches have
proved to be powerful
• Research infrastructure is crucial:
– Large amounts of linguistic data
– Evaluation methodologies
• Availability and affordability of computing power lead to
shorter technology development cycles and real-time
systems
• Performance-driven paradigm accelerates technology
development
• Interdisciplinary collaboration produces enhanced
capabilities (e.g., spoken language understanding)
6.345 Automatic Speech Recognition
Introduction 15
Major Components in a Speech Recognition System
Training Data
Applying
Acoustic
Models
Constraints
Lexical
Models
Speech
Signal
Language
Models
Recognized
Words
Representation
Search
• Speech recognition is the problem of deciding on
– How to represent the signal
– How to model the constraints
– How to search for the most optimal answer
6.345 Automatic Speech Recognition
Introduction 16
Demo: Continuous Dictation
• IBM ViaVoice running on a ThinkPad
• Trained for a quiet office (classroom performance not optimal)
6.345 Automatic Speech Recognition
Introduction 17
Demo: Simple Telephone Transactions
• Developed by SpeechWorks International (there are others)
• Shipping cost information for Fedex (1-800-GO-FEDEX)
– Provides information on:
* Package types
* Source and destination zip codes
* Weight, size, value
* Service type
– Handles all US rate information calls
• Automated Brokerage System for E*Trade
– Supports quotes and trades
* Using symbols or names
* For stocks, options, and mutual funds
– Users can “barge in” at any time
– Nationwide deployment for over 450,000 customers
6.345 Automatic Speech Recognition
Introduction 18
Conversational Interfaces: The Next Generation
• Enables us to converse with machines (in much the same
way we communicate with one another) in order to create,
access, and manage information and to solve problems
• Augments speech recognition technology with natural
language technology in order to understand the verbal input
• Can engage in a dialogue with a user during the interaction
• Uses natural language to speak the desired response
• Is what Hollywood and every “futurist” says we should
have!
6.345 Automatic Speech Recognition
Introduction 19
A Conversational System Architecture
SPEECH
SYNTHESIS
Sentence
LANGUAGE
GENERATION
Speech
Graphs
& Tables
Speech
DIALOGUE
MANAGEMENT
DISCOURSE
CONTEXT
DATABASE
Meaning
Representation
Meaning
SPEECH
RECOGNITION
6.345 Automatic Speech Recognition
Words
LANGUAGE
UNDERSTANDING
Introduction 20
Demo: Conversational Interface
• Jupiter weather information system
– Access through telephone
– 500 cities worldwide
– Harvest weather information from the Web several times daily
Jupiter
A conversational interface for on-line
weather information over the phone.
1-888-573-8255
(outside the USA: 1-617-258-0300)
http://www.sls.lcs.mit.edu/jupiter
Spoken Language Systems Group,
MIT Laboratory for Computer Science
6.345 Automatic Speech Recognition
Introduction 21
45
40
35
30
25
20
15
10
5
0
100
Word
Data
10
‘97
Apr
‘98
May
Jun
Jul
Aug
Nov
Apr
‘99
Nov
1
Training Data (x1000)
Error Rate (%)
(Real) Data Improves Performance (Weather Domain)
May
• Longitudinal evaluations show improvements
• Collecting real data improves performance:
– Enables increased complexity and improved robustness for
acoustic and language models
– Better match than laboratory recording conditions
• Users come in all kinds
6.345 Automatic Speech Recognition
Introduction 22
But We Are Far from Done!
Corpus
Corpus
Digit
Digit Strings
Strings (phone)
(phone)
Speech
Speech
Type
Type
10
10
0.3
0.3
0.009
0.009
Resource
Resource Management
Management read
read
1000
1000
3.6
3.6
0.1
0.1
ATIS
ATIS
spontaneous
spontaneous
2000
2000
22
----
Wall
Wall Street
Street Journal
Journal
read
read
64000
64000
6.6
6.6
11
Radio
Radio News
News
mixed
mixed
64000
64000
13.5
13.5
----
Switchboard
Switchboard (phone)
(phone)
conversation
conversation
10000
10000
19.3
19.3
44
Call
Call Home
Home (phone)
(phone)
conversation
conversation
10000
10000
30
30
----
6.345 Automatic Speech Recognition
spontaneous
spontaneous
Lexicon
Lexicon Word
Word Error
Error Human
Human Error
Error
Size
Rate
Rate
Size
Rate (%)
(%)
Rate (%)
(%)
Introduction 23
Course Outline
Paralinguistic
Paralinguistic Information
Information
Speech
Understanding
Speech Understanding
Multi-Modal
Multi-Modal Interfaces
Interfaces
Acoustic
Acoustic Theory
Theory of
of
Speech
Speech Production
Production
AcousticAcousticPhonetic
Phonetic
Modeling
Modeling
Pattern
Pattern
Recognition
Recognition
Finite-State
Finite-State
Transducers
Transducers
Language
Language
Modeling
Modeling
Robust
Robust
ASR
ASR
Acoustic
Models
Lexical
Models
Language
Models
Adaptation
Adaptation
Speech
Signal
Properties
Properties of
of
Speech
Speech Sounds
Sounds
Representation
Search
Signal
Signal
Representation
Representation
Search
Search
Algorithms
Algorithms
Vector
Vector Quantization
Quantization
&
Clustering
& Clustering
6.345 Automatic Speech Recognition
Recognized
Words
Hidden
Hidden Markov
Markov
Modeling
Modeling
Graphical
Graphical
Models
Models
Segmental
Segmental
Models
Models
Introduction 24
Course Logistics
• Lectures:Two sessions/week, 1.5 hours/session
• Labs: All week during school hours
Grading
• 9 Assignments
• 2 Quizzes
• Term Project (about 4 weeks)
6.345 Automatic Speech Recognition
45%
30%
25%
Introduction 25
Assignments
• There will be 9 weekly assignments
– Problems that expand on the lecture material
– Lab assignments to reinforce the lecture material
– Assignments are due the following week on Wednesday
• Lab work will be done in the computer lab
• Lab sign-up (on the course web page) is necessary
• Solutions will be provided
6.345 Automatic Speech Recognition
Introduction 26
Term Project
• Investigate a contrasting condition in an ASR experiment
• We will provide different recognizers and domains for you to
select from, and will work with you to select a topic
• You choose:
–
–
–
–
Evaluation condition: e.g., phonetic classification, word recognition)
Database (e.g., TIMIT, RM, Jupiter, Aurora, …)
Recognizer (e.g., Sphinx, Summit, GMTK, …)
Contrasting condition (e.g., signal representation, acoustic model,
language model)
• Requirements:
–
–
–
–
Proposal
Experiments (the bulk of the work)
Write-up
Presentation on extended last day of class
6.345 Automatic Speech Recognition
Introduction 27
References (on reserve at Barker)
• Huang, Acero, & Hon, Spoken Language Processing,
Prentice-Hall, 2001.
• Jelinek, Statistical Methods for Speech Recognition, MIT
Press, 1997.
• Rabiner & Juang, Fundamentals of Speech Recognition,
Prentice-Hall, 1983.
• Duda, Hart, & Stork, Pattern Classification, Wiley & Sons,
2001.
• Stevens, Acoustic Phonetics, MIT Press, 1998.
6.345 Automatic Speech Recognition
Introduction 28
Download