Automatic Speaker Recognition

advertisement
Seminar
Speech Recognition
2003
E.M. Bakker
LIACS Media Lab
Leiden University
LIACS Media Lab Leiden University
Speech Recognition 2003
Outline
•
•
Introduction and State of the Art
A Speech Recognition Architecture
– Acoustic modeling
– Language modeling
– Practical issues
•
Applications
NB Some of the slides are adapted from the presentation: “Can Advances in Speech
Recognition make Spoken Language as Convenient and as Accessible as Online Text?”, an
excellent presentation by: Dr. Patti Price, Speech Technology Consulting Menlo Park,
California 94025, and Dr. Joseph Picone Institute for Signal and Information Processing
Dept. of Elect. and Comp. Eng. Mississippi State University
LIACS Media Lab Leiden University
Speech Recognition 2003
Research Areas
•Speech Analysis (Production, Perception, Parameter Estimation)
•Speech Coding/Compression
•Speech Synthesis (TTS)
•Speaker Identification/Recognition/Verification (Sprint, TI)
•Language Identification (Transparent Dialogue)
•Speech Recognition (Dragon, IBM, ATT)
Speech recognition sub-categories:
•Discrete/Connected/Continuous Speech/Word Spotting
•Speaker Dependent/Independent
•Small/Medium/Large/Unlimited Vocabulary
•Speaker-Independent Large Vocabulary Continuous Speech Recognition (or LVCSR
for short :)
LIACS Media Lab Leiden University
Speech Recognition 2003
Introduction
What is Speech Recognition?
Goal: Automatically extract the string of
words spoken from the speech signal
Speech Signal
Speech
Recognition
Words
“How are you?”
• Other interesting area’s:
– Who is talker (speaker recognition, identification)
– Speech output (speech synthesis)
– What the words mean (speech understanding, semantics)
LIACS Media Lab Leiden University
Speech Recognition 2003
Introduction
Applications
•Command and control
–Manufacturing
–Consumer products
http://www.speech.philips.com
• Database query
– Resource management
– Air travel information
– Stock quote
Nuance, American Airlines:
1-800-433-7300, touch 1
•Dictation
–http://www.lhsl.com/contacts/
–http://www-4.ibm.com/software/speech
–http://www.microsoft.com/speech/
LIACS Media Lab Leiden University
Speech Recognition 2003
Introduction: State of the Art
Speech-recognition software
• IBM (Via Voice, Voice Server Applications,...)
–
–
–
–
•
•
Speaker independent, continuous command recognition
Large vocabulary recognition
Text-to-speech confirmation
Barge in (The ability to interrupt an audio prompt as it is
playing)
Dragon Systems, Lernout & Hauspie (L&H Voice
Xpress™ (:( )
Philips
– Dictation
– Telephone
– Voice Control (SpeechWave, VoCon SDK, chip-sets)
• Microsoft (Whisper, Dr Who)
LIACS Media Lab Leiden University
Speech Recognition 2003
Introduction: State of the Art
Speech over the telephone.:
•
•
•
AT&T Bell Labs pioneered the use of speechrecognition systems for telephone transactions
companies such as Nuance, Philips and
SpeechWorks are active in this field for some
years now.
IBM Applications over telephone:
– request news, internet pages, e-mail
– stock quotes, traveling info
– weather information
LIACS Media Lab Leiden University
Speech Recognition 2003
Introduction: State of the Art
Speech over the telephone (Philips):
• SpeechPearl® large vocabulary natural language
•
•
•
recognition (up to 200,000 words)
SpeechMania® mixed initiative dialog gives the
caller the impression of a truly natural dialogue:
full replacement of the human operator.
SpeechWave™ relatively small vocabularies (up to
hundreds of words) available in nearly 40
languages
Voice ReQuest The system recognizes the request
and routes the call to the appropriate extension,
all without the intervention of an operator.
LIACS Media Lab Leiden University
Speech Recognition 2003
Introduction: State of the Art
Speech over the telephone. :
•
•
•
•
•
•
•
Brokerage: E*Trade,...
Banking
Travel: United Airlines,
Continental Airlines
BellSouth
HP
Federal Express
Foodline
LIACS Media Lab Leiden University
Speech Recognition 2003
Introduction: State of the Art
Speech over the telephone.
LIACS Media Lab Leiden University
Speech Recognition 2003
Introduction: State of the Art
The following is a transcript of a telephone conversation
between Victor Zue and the Jupiter system for weather
information. Jupiter's toll-free number is 1-888-573-8255:
•
•
•
•
•
•
[Phone ringing]
Jupiter: Hi. Welcome to Jupiter, the M.I.T. Lab for Computer Science
Weather Information System. This call is being recorded for system
development.
You may hang up or ask for help at any time. How can I help you?
V.Z.: What is the forecast for Boston today?
Jupiter: In Boston this afternoon, partly sunny and windy. High around 50.
Tonight, clear, lows around 35. What other information can I give you?
V.Z.: Will it rain tomorrow?
Jupiter: To my knowledge, the forecast calls for no rain in Boston
tomorrow. Is there anything else?
LIACS Media Lab Leiden University
Speech Recognition 2003
Introduction: State of the Art
• V.Z.: How about Seattle?
• Jupiter: To my knowledge, the forecast calls for
•
•
•
•
•
•
•
no rain in Seattle tomorrow. Is there something
else you would like to know?
V.Z.: Is it sunny anywhere in the Caribbean?
Jupiter: Sorry, I am not sure what you said.
V.Z.: Is it sunny anywhere in the Caribbean?
Jupiter: I am sorry, I may not know the city you
are asking about. Try asking for the state or
country. For example, what cities do you know
about in Massachusetts?
V.Z.: What is the temperature in Montego Bay?
Jupiter: In Montego Bay today, high 86 and low
73. Is there something else?
V.Z.: Good-bye.
LIACS Media Lab Leiden University
Speech Recognition 2003
Factors that Affect Performance
of Speech Recognition Systems
2005
wherever
speech
occurs
2000
vehicle noise
radio
cell phones
NOISE
ENVIRONMENT
all speakers of
the language
including foreign
regional accents
native speakers
competent
foreign speakers
1995
normal office
various
microphones
telephone
quiet room
fixed high –
quality mic
speaker
independent and
adaptive
USER
speakerdep.
POPULATION
1985
careful
reading
SPEECH STYLE
planned
speech
natural humanmachine dialog
(user can adapt)
all styles
including
human-human
(unaware)
LIACS Media Lab Leiden University
application
– specific
speech and expert
years to
language
create
app–
specific
language
model
COMPLEXITY
some
application–
specific data and
one engineer
year
application
independent or
adaptive
Speech Recognition 2003
How Do You Measure the Performance?
USC, October 15, 1999: “the world's first machine system that
can recognize spoken words better than humans can.”
“ In benchmark testing using just a few spoken words, USC's
Berger-Liaw … System not only bested all existing
computer speech recognition systems but outperformed
the keenest human ears.”
• What benchmarks?
• What was training?
• What was the test?
• Were they independent?
• How large was the vocabulary and the sample size?
• Did they really test all existing systems?Is that different
from chance?
• Was the noise added or coincident with speech?
• What kind of noise? Was it independent of the speech?
LIACS Media Lab Leiden University
Speech Recognition 2003
Evaluation Metrics
Word Error Rate (WER)
Conversational
Speech
40%
30%
Broadcast
News
20%
Read Speech
10%
Continuous
Digits
Digits
• Spontaneous telephone
speech is still a “grand
challenge”.
• Telephone-quality speech
is still central to the
problem.
• Broadcast news is a very
dynamic domain.
Letters and Numbers
Command and Control
0%
Level Of Difficulty
LIACS Media Lab Leiden University
Speech Recognition 2003
Evaluation Metrics
Human Performance
Word Error Rate
20%
Wall Street Journal (Additive Noise)
• Human performance exceeds machine
performance by a factor ranging from
4x to 10x depending on the task.
• On some tasks, such as credit card
number recognition, machine
performance exceeds humans due to
human memory retrieval capacity.
15%
Machines
10%
• The nature of the noise is as important
as the SNR (e.g., cellular phones).
5%
Human Listeners (Committee)
0%
10 dB
16 dB
22 dB
Quiet
• A primary failure mode for humans is
inattention.
• A second major failure mode is the lack
of familiarity with the domain (i.e.,
business terms and corporation names).
Speech-To-Noise Ratio
LIACS Media Lab Leiden University
Speech Recognition 2003
Evaluation Metrics
Machine Performance
100%
(Foreign)
Read
Speech
Conversational
Speech
Broadcast
20k
Spontaneousvocabularies Varied
Speech
(Foreign)
Speech
Microphones
10%
1k
5k
Noisy
10 X
• A Word Error Rate (WER)
below 10% is considered
acceptable.
• Performance in the field is
typically 2x to 4x worse than
performance on an evaluation.
1%
1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003
LIACS Media Lab Leiden University
Speech Recognition 2003
What does a speech signal look like?
LIACS Media Lab Leiden University
Speech Recognition 2003
Spectrogram
LIACS Media Lab Leiden University
Speech Recognition 2003
Speech Recognition
LIACS Media Lab Leiden University
Speech Recognition 2003
Recognition Architectures
Why Is Speech Recognition So Difficult?
• Comparison of “aa” in “IOck” vs. “iy” in bEAt
for conversational speech (SWB)
Feature No. 2
Ph_1
Ph_2
Ph_3
Feature No. 1
• Measurements of the
signal are ambiguous.
• Region of overlap represents
classification errors.
• Reduce overlap by introducing
acoustic and linguistic context
(e.g., context-dependent phones).
LIACS Media Lab Leiden University
Speech Recognition 2003
Overlap in the ceptral space
(alphadigits)
Female “aa”
Female “iy”
Male “aa”
Male “iy”
LIACS Media Lab Leiden University
Speech Recognition 2003
Overlap in the cepstral space
(alphadigits)
Male “aa” (green) vs.
Female “aa” (black)
Male “iy” (blue) vs.
Female “iy” (red)
•Combined Comparisons:
•Male "aa" (green)
•Female "aa" (black)
•Male "iy" (blue)
•Female "iy" (red)
LIACS Media Lab Leiden University
Speech Recognition 2003
OVERLAP IN THE CEPSTRAL SPACE
(SWB-All)
The following plots demonstrate overlap of recognition features in the cepstral
space. These plots consist of all vowels excised from tokens in the
SWITCHBOARD conversational speech corpus.
All Male Vowels
All Female Vowels
LIACS Media Lab Leiden University
All Vowels
Speech Recognition 2003
Recognition Architectures
A Communication Theoretic Approach
Message
Source
Observable: Message
Linguistic
Channel
Articulatory
Channel
Acoustic
Channel
Words
Sounds
Features
Bayesian formulation for speech recognition:
• P(W|A) = P(A|W) P(W) / P(A)
Objective: minimize the word error rate
Approach: maximize P(W|A) during training
Components:
• P(A|W) : acoustic model (hidden Markov models, mixtures)
• P(W) : language model (statistical, finite state networks, etc.)
The language model typically predicts a small set of next words based on
knowledge of a finite number of previous words (N-grams).
LIACS Media Lab Leiden University
Speech Recognition 2003
Recognition Architectures
Incorporating Multiple Knowledge Sources
• The signal is converted to a sequence of
feature vectors based on spectral and
temporal measurements.
Input
Speech
Acoustic
Front-end
Acoustic Models
P(A/W)
Language Model
P(W)
Search
Recognized
Utterance
LIACS Media Lab Leiden University
• Acoustic models represent sub-word
units, such as phonemes, as a finitestate machine in which states model
spectral structure and transitions
model temporal structure.
• The language model predicts the next
set of words, and controls which models
are hypothesized.
• Search is crucial to the system, since
many combinations of words must be
investigated to find the most probable
word sequence.
Speech Recognition 2003
Acoustic Modeling
Feature Extraction
Fourier
Transform
Input Speech
• Typical: 512 samples (16kHz
sampling rate) =>
Cepstral
Analysis
•
Incorporate knowledge of the
nature of speech sounds in
measurement of the features.
• Utilize rudimentary models of
human perception.
• Use a ~30 msec window for
frequency domain analysis.
• Include absolute energy and
12 spectral measurements.
Perceptual
Weighting
Time
Derivative
Time
Derivative
Energy
+
Mel-Spaced Cepstrum
Delta Energy
+
Delta Cepstrum
Delta-Delta Energy
+
Delta-Delta Cepstrum
• Time derivatives to model
spectral change.
LIACS Media Lab Leiden University
Speech Recognition 2003
Acoustic Modeling
Hidden Markov Models
• Acoustic models encode the
temporal evolution of the
features (spectrum).
• Gaussian mixture distributions
are used to account for
variations in speaker, accent,
and pronunciation.
• Phonetic model topologies are
simple left-to-right structures.
• Skip states (time-warping) and
multiple paths (alternate
pronunciations) are also
common features of models.
• Sharing model parameters is a
common strategy to reduce
complexity.
LIACS Media Lab Leiden University
Speech Recognition 2003
Acoustic Modeling
Parameter Estimation
• Initialization
• Single
Gaussian
Estimation
•
•
•
•
The expectation/maximization (EM)
algorithm is used to improve our
parameter estimates.
•
Computationally efficient training
algorithms (Forward-Backward) are
crucial.
•
Batch mode parameter updates are
typically preferred.
•
Decision trees and the use of
additional linguistic knowledge are
used to optimize parameter-sharing,
and system complexity,.
• 2-Way Split
• Mixture
Distribution
Reestimation
• 4-Way Split
• Reestimation
•••
LIACS Media Lab Leiden University
Word level transcription
Supervises a closed-loop data-driven
modeling
Initial parameter estimation
Speech Recognition 2003
Language Modeling
Is A Lot Like Wheel of Fortune
LIACS Media Lab Leiden University
Speech Recognition 2003
Language Modeling
N-Grams: The Good, The Bad, and The Ugly
Unigrams (SWB):
• Most Common: “I”, “and”, “the”, “you”, “a”
• Rank-100: “she”, “an”, “going”
• Least Common: “Abraham”, “Alastair”, “Acura”
Bigrams (SWB):
• Most Common: “you know”, “yeah SENT!”,
“!SENT um-hum”, “I think”
• Rank-100: “do it”, “that we”, “don’t think”
• Least Common: “raw fish”, “moisture content”,
“Reagan Bush”
Trigrams (SWB):
• Most Common: “!SENT um-hum SENT!”,
“a lot of”, “I don’t know”
• Rank-100: “it was a”, “you know that”
• Least Common: “you have parents”,
“you seen Brooklyn”
LIACS Media Lab Leiden University
Speech Recognition 2003
Language Modeling
Integration of Natural Language
• Natural language constraints
can be easily incorporated.
• Lack of punctuation and search
space size pose problems.
• Speech recognition typically
produces a word-level
time-aligned annotation.
• Time alignments for other levels
of information also available.
LIACS Media Lab Leiden University
Speech Recognition 2003
Implementation Issues
Dynamic Programming-Based Search
• Dynamic programming is used
to find the most probable path
through the network.
• Beam search is used to
control resources.
• Search is time synchronous
and left-to-right.
• Arbitrary amounts of silence
must be permitted between
each word.
• Words are hypothesized
many times with different
start/stop times, which
significantly increases
search complexity.
LIACS Media Lab Leiden University
Speech Recognition 2003
Implementation Issues
Cross-Word Decoding Is Expensive
• Cross-word Decoding: since word boundaries don’t occur in spontaneous
speech, we must allow for sequences of sounds that span word boundaries.
• Cross-word decoding significantly increases memory requirements.
LIACS Media Lab Leiden University
Speech Recognition 2003
Implementation Issues
Search Is Resource Intensive
Megabytes of Memory
Feature
Extraction
(1M)
Acoustic
Modeling
(10M)
Language
Modeling
(30M)
Percentage of CPU
Language
Modeling
15%
Search
(150M)
Feature
Extraction
1%
Search
25%
Acoustic
Modeling
59%
•
Typical LVCSR systems have about 10M free parameters, which makes
training a challenge.
•
Large speech databases are required (several hundred hours of speech).
•
Tying, smoothing, and interpolation are required.
LIACS Media Lab Leiden University
Speech Recognition 2003
Applications
Conversational Speech
• Conversational speech collected over the telephone contains background
noise, music, fluctuations in the speech rate, laughter, partial words,
hesitations, mouth noises, etc.
• WER (Word Error Rate) has decreased from 100% to 30% in six years.
• Laughter
• Singing
• Unintelligible
• Spoonerism
• Background Speech
• No pauses
• Restarts
• Vocalized Noise
• Coinage
LIACS Media Lab Leiden University
Speech Recognition 2003
Applications
Audio Indexing of Broadcast News
Broadcast news offers some unique
challenges:
• Lexicon: important information in
infrequently occurring words
• Acoustic Modeling: variations in
channel, particularly within the same
segment (“ in the studio” vs. “on
location”)
• Language Model: must adapt (“ Bush,”
“Clinton,” “Bush,” “McCain,” “???”)
• Language: multilingual systems?
language-independent acoustic
modeling?
LIACS Media Lab Leiden University
Speech Recognition 2003
Applications
Automatic Phone Centers
• Portals: Bevocal, TellMe, HeyAniat
• VoiceXML 2.0
• Automatic Information Desk
• Reservation Desk
• Automatic Help-Desk
• With Speaker identification
• bank account services
• e-mail services
• corporate services
LIACS Media Lab Leiden University
Speech Recognition 2003
Applications
Real-Time Translation
• From President Clinton’s State of the Union address (January 27, 2000):
“These kinds of innovations are also propelling our remarkable prosperity...
Soon researchers will bring us devices that can translate foreign languages
as fast as you can talk... molecular computers the size of a tear drop with the
power of today’s fastest supercomputers.”
• Imagine a world where:
• You book a travel reservation from your cellular phone while driving in
your car without ever talking to a human (database query)
• You converse with someone in a foreign country and neither speaker
speaks a common language (universal translator)
• You place a call to your bank to inquire about your bank account and
never have to remember a password (transparent telephony)
• You can ask questions by voice and your Internet browser returns
answers to your questions (intelligent query)
• Human Language Engineering: a sophisticated integration of many speech and
language related technologies... a science for the next millennium.
LIACS Media Lab Leiden University
Speech Recognition 2003
Technology
Future Directions
Analog Filter Banks
1960
Dynamic Time-Warping
Hidden Markov Models
1980
1970
Conclusions:
Challenges:
• supervised training is a good
machine learning technique
•
•
•
•
• large databases are essential for
the development of robust statistics
2000
1990
discrimination vs. representation
generalization vs. memorization
pronunciation modeling
human-centered language modeling
The algorithmic issues for the next decade:
• Better features by extracting articulatory information?
• Bayesian statistics? Bayesian networks?
• Decision Trees? Information-theoretic measures?
• Nonlinear dynamics? Chaos?
LIACS Media Lab Leiden University
Speech Recognition 2003
Download