Spoken Language Processing:Summing Up Julia Hirschberg CS 4706

advertisement
Spoken Language
Processing:Summing Up
Julia Hirschberg
CS 4706
7/15/2016
1
What We’ve Studied
• Speech phenomena
– What can people convey by varying the way
they say something?
– How we identify this kind of variation?
– What tools do we have for analysis?
• Speech generation (TTS)
• Speech recognition (ASR) and understanding
(ASRU)
• Applications for speech technologies
7/15/2016
2
What phenomena vary in speech?
• Intonational contours (ToBI)
– Phrasing: scope
– Accent: focus, given/new
– Overall contour: speech acts
• Pitch range, timing
– Topic structure
• Voice quality, intensity, …
– Emotion
– Deception?
– Charisma?
7/15/2016
3
Analyzing Speech: At the Acoustic Level
• How do we capture speech data for analysis?
– Digitizing: sampling, quantization, filtering
• How can we distinguish one speech sound from
another?
– Periodic vs. aperiodic waveforms
• Characterizing periodic waveforms: cycle, period,
phase
– Displaying and analyzing spectra, pitch tracks
– Comparing intensity (db)
• Tools to do all this and more: Praat
7/15/2016
4
7/15/2016
5
Analyzing Speech: At the Phonetic Level
• Can we distinguish different languages in terms
of their phoneme sets? Are their universal
constraints on possible speech sounds?
– Articulatory constraints
• How do we characterize the sounds of a given
language:
– Acoustic differences associated with place
and manner of articulation distinguish
consonants
– Vowels differ in their formant frequencies
• Do we use such information in speech
technologies?
7/15/2016
6
Articulators in action
(Sample from the Queen’s University / ATR Labs
X-ray Film Database)
“Why did Ken set the soggy net on top of his deck?”
7/15/2016
7
Articulatory parameters for
English consonants (in ARPAbet)
MANNER OF ARTICULATION
PLACE OF ARTICULATION
bilabial
stop
p
labiodental
interdental
b
fric.
f
v
th
dh
alveolar
t
d
s
z
affric.
nasal
m
n
approx
w
l/r
flap
7/15/2016
palatal
velar
k
sh
zh
ch
jh
glottal
g
q
h
ng
y
dx
VOICING:
voiceless
8
voiced
American English vowel space
HIGH
iy
uw
ix
ih
FRONT
ux
ax
eh
ah
ae
uh
ao
BACK
aa
LOW
7/15/2016
9
Analyzing Speech: At the Phononological Level
• How do people develop models of intonation?
• ToBI
– Tones: Pitch accents, phrase accents,
boundary tones
– Break indices
• Hand labeling vs. automatic analysis
– Which provides more useful information?
7/15/2016
10
L-L%
L-H%
H-L%
H-H%
H*
L*
L*+H
7/15/2016
11
L-L%
L-H%
H-L%
H-H%
L+H*
H+!H*
H* !H*
7/15/2016
12
Speech Generation
• Synthesis then
and now
• Open problems in TTS:
– Pronunciation modeling: OOV words,
homographs, abbreviations
– Predicting pitch accents and phrase
boundaries: corpus-based approaches
– Information status: focus, given/new
– Modeling discourse structure
– Producing emotional speech
– Evaluation
7/15/2016
13
Speech Recognition/Understanding
• ASR then and now: From speaker-dependent digit
recognition using analog circuits to HMM-based speakerindependent recognition of spontaneous speech by
computer
• Open problems
– Segmentation: sentence, speaker, topic
– OOV recognition
– Handling disfluencies
– Evaluation: transcription, semantic, task-based?
– Recognizing emotion and other types of speaker state
7/15/2016
14
Spoken Dialogue Systems
• Integrating TTS and ASR with dialogue
management and task-based components
• Open questions:
– Improving ASR accuracy
– Recognizing dialogue acts
– Turn-taking behavior
– Confirmation strategies and initiative
– Entrainment and ‘personality’
– Evaluation
7/15/2016
15
Recognizing Speaker State and Diagnosis
• Emotional speech
– Voice quality
• Deceptive speech
• Charismatic speech
• Customer care rep evaluation
• Medical diagnosis
– Paranoia and other psychiatric disorders
– Cancer patient prognosis
7/15/2016
16
Take-Home Final
•
•
Due: May 14 by 4:10 pm
Submission instructions:
– This examination is designed to test your ability to synthesize
information and to perform critical analysis of published research.
Choose 3 of the following 4 questions to answer Each question should
be answered with specific reference to the readings specified, all of
which are linked to the syllabus for the class on the date given. (I.e.,
cite articles with page numbers to support claims about authors’ findings
or claims, as “McLeod et al. (1998) claims that existing Spoken
Dialogue Systems’ major drawback is their lack of delightful
personalities (p. 4).”) Do not attempt to answer the questions until you
have read and understood the specified articles. Essays that do not
show evidence of this understanding will not receive high marks.
– Each essay will be worth 33 1/3 points. Each essay should be no more
than 1200 words in length; only the first 1200 words of each essay will
be graded, so please do not exceed this limit. If you can answer the
question in a shorter essay, feel free to do so. Please use plain ascii or
Word and report word-counts for each essay.
7/15/2016
17
Sample Question
Agree or disagree: “It is more difficult to recognize
deception automatically from acoustic/prosodic and
lexical cues than from visual cues obtained from face or
body gesture.” Use the readings assigned for April 28
to support your answer.
1. Show that you understand the question and are
answering it
•
E.g. “I believe that it is more difficult to recognize deception
automatically from from visual cues than from
acoustic/prosodic and lexical cues.”
2. For agree/disagree questions, decide whether you
basically agree or disagree
•
7/15/2016
e.g. “While there are difficulties recognizing deception from
both types of cues, I believe it is more difficult to recognize
deception from visual cues than from language-based
cues.”
18
3. Provide evidence on both sides of the question
•
•
“While both audio and visual cues require high quality
recordings, audio recordings must be obtained in a quiet
environment whereas video recordings can be obtained
in a wider variety of situations, providing that equipment
is available.”
“While Mehrabian (1971) found significant effects for
both visual and language-based cues, the particular
language cues he identified in this study would seem to
be easier to recognize automatically than the visual
cues: For example, it should be easier to identify
amount of speech and speaking rate than features such
as ‘rocking gestures’ and ‘leg and foot movements’.”
4. Support your statements with specific reference
to your sources
•
•
7/15/2016
e.g. “DePaulo et al (1983) find that…”
Or, “Motivation greatly influences subjects’ ability to
control their verbal cues (DePaulo et al, 1983).”
19
• When in doubt, cite
7/15/2016
20
Download