Components of Spoken Dialogue Systems Julia Hirschberg LSA 353

advertisement
Components of Spoken Dialogue
Systems
Julia Hirschberg
LSA 353
1
Recreating the Speech Chain
DIALOG
SEMANTICS
SPOKEN
LANGUAGE
UNDERSTANDING
SPEECH
RECOGNITION
SPEECH
SYNTHESIS
DIALOG
MANAGEMENT
SYNTAX
LEXICON
MORPHOLOG
Y
PHONETICS
INNER EAR
ACOUSTIC
NERVE
VOCAL-TRACT
ARTICULATORS
2
The Illusion of Segmentation... or...
Why Speech Recognition is so Difficult
(user:Roberto (attribute:telephone-num value:7360474))
VP
NP
NP
MY
IS
NUMBER
m I n & m &r i
b
THREE
SEVEN
SEVEN
ZERO
NINE
s e v & nth
rE n
I n zE o
r
TWO
t ü
FOUR
s ev &
n
f
O
3
r
The Illusion of Segmentation... or...
Ellipses and Anaphors
Why Speech Recognition is so Difficult
Limited vocabulary
Multiple Interpretations
Speaker Dependency
(user:Roberto (attribute:telephone-num value:7360474))
Word variations
VP
NP
Word confusability
NP
MY
IS
NUMBER
THREE
SEVEN
ZERO
NINE
Context-dependency
SEVEN
TWO Coarticulation FOUR
Noise/reverberation
m I n & m &r i
b
s e v & nth
rE n
I n z E o Intra-speaker
t ü s e v &variability
f O r
r
n
4
1980s -- The Statistical Approach
• Based on work on Hidden Markov Models
done by Leonard Baum at IDA, Princeton in
the late 1960s
• Purely statistical approach pursued by Fred
Jelinek and Jim Baker, IBM T.J.Watson
Research Wˆ  arg max P( A | W ) P(W )
W speech recognition
• Foundations of modern
engines
Acoustic HMMs
a11
S1
a22
a12
S2
Word Tri-grams
a33
a23
Fred Jelinek
S3
Jim Baker
P( wt | wt 1 , wt 2 )
 No Data Like More Data
 Whenever I fire a linguist, our system
performance improves (1988)
 Some of my best friends are linguists (2004)
5
1980-1990 – Statistical approach becomes
ubiquitous
• Lawrence Rabiner, A Tutorial on
Hidden Markov Models and Selected
Applications in Speech Recognition,
Proceeding of the IEEE, Vol. 77, No.
2, February 1989.
6
1980s-Today – The Power
of Evaluation
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
…
HOSTING
MIT
SPEECHWORKS
SPOKEN
STANDARDS
DIALOG
Nuance
INDUSTRY
APPLICATION
DEVELOPERS
TOOLS
NUANCE
SRI
Pros and Cons of DARPA programs
STANDARDS
PLATFORM
INTEGRATORS
STANDARDS
VENDORS
+ Continuous incremental improvement
- Loss of “bio-diversity” TECHNOLOGY
7
Today’s State of the Art
• Low noise conditions
• Large vocabulary
– ~20,000-64,000 words (or more…)
• Speaker independent (vs. speaker-dependent)
• Continuous speech (vs isolated-word)
• World’s best research systems:
• Human-human speech: ~13-20% Word Error Rate
(WER)
• Human-machine or monologue speech: ~3-5%
WER
8
Components of an ASR System
• Corpora for training and testing of components
• Representation for input and method of
extracting
• Pronunciation Model
• Acoustic Model
• Language Model
• Feature extraction component
• Algorithms to search hypothesis space efficiently
9
Training and Test Corpora
• Collect corpora appropriate for recognition task
at hand
– Small speech + phonetic transcription to
associate sounds with symbols (Initial
Acoustic Model)
– Large (>= 60 hrs) speech + orthographic
transcription to associate words with sounds
(Acoustic Model)
– Very large text corpus to identify unigram and
bigram probabilities (Language Model)
10
Building the Acoustic Model
• Model likelihood of phones or subphones given
spectral features, pronunciation models, and
prior context
• Usually represented as HMM
– Set of states representing phones or other
subword units
– Transition probabilities on states: how likely is
it to see one phone after seeing another?
– Observation/output likelihoods: how likely is
spectral feature vector to be observed from
phone state i, given phone state i-1?
11
• Initial estimates for
• Transition probabilities between phone states
• Observation probabilities associating phone
states with acoustic examples
• Re-estimate both probabilities by feeding the
HMM the transcribed speech training corpus
(forced alignment)
• I.e., we tell the HMM the ‘right’ answers -which phones to associate with which
sequences of sounds
• Iteratively retrain transition and observation
probabilities by running training data through
model and scoring output (we know the right
answers) until no improvement
12
HMMs for speech
13
Building the Pronunciation Model
• Models likelihood of word given network of
candidate phone hypotheses
– Multiple pronunciations for each word
– May be weighted automaton or simple
dictionary
• Words come from all corpora; pronunciations
from pronouncing dictionary or TTS system
14
ASR Lexicon: Markov Models for
Pronunciation
15
Language Model
• Models likelihood of word given previous word(s)
• Ngram models:
– Build the LM by calculating bigram or trigram
probabilities from (very large) text training
corpus
– Smoothing issues
• Grammars
– Finite state grammar or Context Free
Grammar (CFG) or semantic grammar
• Out of Vocabulary (OOV) problem
16
Search/Decoding
• Find the best hypothesis given
– Lattice of phone units (AM)
– Lattice of words (segmentation of phone
lattice into all possible words via
Pronunciation Model)
– Probabilities of word sequences (LM)
• How to reduce this huge search space?
– Lattice minimization and determinization
– Pruning: beam search
– Calculating most likely paths
17
Evaluating Success
• Transcription
– Low WER (S+I+D)/N * 100
Thesis test vs. This is a test. 75% WER
Or That was the dentist calling. 125% WER
• Understanding
– High concept accuracy
• How many domain concepts were correctly
recognized?
I want to go from Boston to Baltimore on September 29
18
Domain concepts Values
– source city
Boston
– target city
Baltimore
– travel date
September 29
– Score recognized string “Go from Boston to
Washington on December 29” vs. “Go to
Boston from Baltimore on September 29”
– (1/3 = 33% CA)
19
Summary
• ASR today
– Combines many probabilistic phenomena:
varying acoustic features of phones, likely
pronunciations of words, likely sequences
of words
– Relies upon many approximate techniques
to ‘translate’ a signal
• ASR future
– Can we include more language
phenomena in the model?
20
Synthesizers Then and Now
• The task: produce human-sounding speech
from some orthographic or semantic
representation of an input
– An online text
– A semantic representation of a response to a
query
• What are the obstacles?
• What is the state of the art?
21
Possibly the First ‘Speaking Machine’
• Wolfgang von Kempelen, Mechanismus der menschlichen
Sprache nebst Beschreibung einer sprechenden Maschine,
1791 (in Deutsches Museum still and playable)
• First to produce whole words, phrases – in many languages22
Joseph Faber’s Euphonia, 1846
23
• Constructed 1835 w/pedal and keyboard control
– Whispered and ordinary speech
– Model of tongue, pharyngeal cavity with
changeable shape
– Singing too “God Save the Queen”
• Modern Articulatory Synthesis: Dennis Klatt
(1987)
24
25
• World’s Fair in NY, 1939
• Requires much training to ‘play’
• Purpose: reduce bandwidth needed to transmit speech,
so many phone calls can be sent over single line
26
27
• Answers:
– These days a chicken leg is a rare dish.
– It’s easy to tell the depth of a well.
– Four hours of steady work faced us.
• ‘Automatic’ synthesis from spectrogram – but
can also use hand-painted spectrograms as
input
• Purpose: understand perceptual effect of
spectral details
28
Formant/Resonance/Acoustic Synthesis
• Parametric or resonance synthesis
– Specify minimal parameters, e.g. f0 and first 3
formants
– Pass electronic source signal thru filter
• Harmonic tone for voiced sounds
• Aperiodic noise for unvoiced
• Filter simulates the different resonances of the vocal tract
• E.g.
– Walter Lawrence’s Parametric Artificial Talker (1953)
for vowels and consonants
– Gunnar Fant’s Orator Verbis Electris (1953) for
vowels
– Formant synthesis download (demo)
29
Concatenative Synthesis
• Most common type today
• First practical application in 1936: British Phone
company’s Talking Clock
– Optical storage for words, part-words,
phrases
– Concatenated to tell time
• E.g.
• And a ‘similar’ example
• Bell Labs TTS (1977) (1985)
30
Variants of Concatenative Synthesis
• Inventory units
– Diphone synthesis (e.g. Festival)
– Microsegment synthesis
– “Unit Selection” – large, variable units
• Issues
– How well do units fit together?
– What is the perceived acoustic quality of the
concatenated units?
– Is post-processing on the output possible, to
improve quality?
31
TTS Production Levels: Back End and Front
End
•Orthographic input: The children read to Dr. Smith
•World Knowledge
text normalization
•Semantics
•Syntax
word pronunciation
•Lexical
Intonation assignment
•Phonology
intonation realization
– F0, amplitude, duration
•Acoustics
synthesis
32
Text Normalization
•
•
•
•
Reading is what W. hates most.
Reading is what Wilde hated most.
Have the students read the questions.
In 1996 she sold 1995 shares and deposited
$42 in her 401(k).
• The duck dove supply.
33
Pronunciation in Context
• Homograph disambiguation
– E.g. bass/bass, desert/desert
• Dictionaries vs letter-to-sound rules
– Frequent or exceptional words: dictionary
• Bellcore business name pronunciation
– ‘New’ words: rules
• Inferring language origin, e.g. Infiniti, Gomez, Paris
• Pronunciation by analogy, e.g. universary
• Learning rules automatically from dictionaries or
small seed-sets + active learning
34
Intonation Assignment: Phrasing
• Traditional: hand-built rules
– Punctuation 234-5682
– Context/function word: no breaks after
function word He went to dinner
– Parsing? She favors the nuts and bolts
approach
• Current: statistical analysis of large labeled
corpus
– Punctuation, pos window, utt length,…
– ~94% `accuracy’
35
Intonation Assignment: Accent
• Hand-built rules
– Function/content distinction
• Accent content words; deaccent function words
• But many exceptions
– ‘Given’ items, focus, contrast
– There are presidents and there are good presidents
– Complex nominals:
• Main Street/Park Avenue
• city hall parking lot
• Today: Statistical procedures trained on large
labeled corpora
36
Intonation Assignment: Contours
• Simple rules
– ‘.’ = declarative contour
– ‘?’ = yes-no-question contour unless wh-word
present at/near front of sentence
• Well, how did he do it? And what do you know?
• Open problem
– But realization even of simple variation is
quite difficult in current TTS systems
37
Phonological and Acoustic Realization
• Task:
– Produce a phonological representation from
phonemes and intonational assignment
• Pitch contour aligned with text
• Durations, intensity
– Select best concatenative units from inventory
– Post-process if needed/possible to smooth
joins, modify pitch, duration, intensity, rate
from original units
– Produce acoustic waveform as output
38
TTS:
Where are we now?
• Natural sounding speech for some utterances
– Where good match between input
and database
• Still…hard to vary prosodic features and
retain naturalness
– Yes-no questions: Do you want to fly
first class?
• Context-dependent variation still hard to infer
from text and hard to realize naturally:
39
– Appropriate contours from text
– Emphasis, de-emphasis to convey
focus, given/new distinction: I own a
cat. Or, rather, my cat owns me.
– Variation in pitch range, rate, pausal
duration to convey topic structure
• Characteristics of ‘emotional speech’ little
understood, so hard to convey: …a voice that
sounds friendly, sympathetic, authoritative….
• How to mimic real voices?
• ScanSoft/Nuance demo
40
Next Class
• J&M 22.4,22.8
• Walker et al ‘97
41
Download