MultiModal Dialogue Systems and Other Research Projects at KTH www.speech.kth.se

advertisement
MultiModal Dialogue Systems and
Other Research Projects at KTH
Rolf Carlson, CTT, KTH
www.speech.kth.se
KTH - Kungliga tekniska högskolan
Department of Speech, Music and Hearing
Kungliga tekniska högskolan, KTH
•
•
•
•
10000
1500
3000
800
undergraduate students
graduate students
staff
professors and teachers
The KTH speech group
- Early days
Gunnar Fant and OVE I 1953
Ove II, 1958
1961
1962
Show OVE I
CTT - Centrum för talteknologi
Research Areas
• Speech Production
• Speech Perception
• Communication Aids
• Multimodal Synthesis
• Speech Understanding
• Speaker Characterization
• Spoken Language
• Interactive Dialog Systems
KTH/TTS history
• 1967, Digitally controlled OVE III
• 1974, Rule-based system RULSYS
– transformation rules
• 1979, Mobile text-to-speech system
– used by a non-vocal child
• 1982, Portable TTS
• 2004
– Data-driven multimodal synthesis
– Synthesis of emotions
– Synthesis of breaks
Synthesis
Original
Formant Synthesis
Emotions
natural
natural
synthesis
Neutral
synthesis
Happy
Sad
natural
synthesis
Angry
Angry
natural
synthesis
Multimodal Synthesis
Combining interior and exterior
registration
Example of resynthsis
Show multimodal synthesis
% correct
Results for VCV-words
(hearing impaired subjects)
100
90
80
70
60
50
40
30
20
10
0
Audio alone Synthetic face Natural face
+ rule-synthesis
Natural voice
The Synface telephone
Real user tests
Multi-modal dialog systems
The Waxholm Project
• tourist information
• Stockholm archipelago
• time-tables, hotels, hostels, camping
and dining possibilities.
• mixed initiative dialogue
• speech recognition
• multimodal synthesis
• graphic information
• pictures, maps, charts and time-tables.
User answers to questions?
The answers to the question:
“What weekday do you want to go?”
(Vilken veckodag vill du åka?)
• 22%
• 11%
• 11%
• 7%
• 6%
Friday (fredag)
I want to go on Friday (jag vill åka på fredag)
I want to go today (jag vill åka idag)
on Friday (på fredag)
I want to go a Friday (jag vill åka en fredag)
• -
are there any hotels in Vaxholm?
(finns det några hotell i Vaxholm)
User answers to questions?
The answers to the question:
“What weekday do you want to go?”
(Vilken veckodag vill du åka?)
• 22%
• 11%
• 11%
• 7%
• 6%
Friday (fredag)
I want to go on Friday (jag vill åka på fredag)
I want to go today (jag vill åka idag)
on Friday (på fredag)
I want to go a Friday (jag vill åka en fredag)
• -
are there any hotels in Vaxholm?
(finns det några hotell i Vaxholm)
Examples of questions
and answers
Hur ofta åker du utomlands på semestern?
jag åker en gång om året kanske
jag åker ganska sällan utomlands på semester
jag åker nästan alltid utomlands under min
semester
jag åker ungefär 2 gånger per år utomlands på
semester
jag åker utomlands nästan varje år
jag åker utomlands på semestern varje år
jag åker utomlands ungefär en gång om året
jag är nästan aldrig utomlands
en eller två gånger om året
en gång per semester
kanske en gång per år
ungefär en gång per år
åtminståne en gång om året
nästan aldrig
Hur ofta reser du utomlands på semestern?
jag reser en gång om året utomlands
jag reser inte ofta utomlands på semester det blir mera i
arbetet
jag reser reser utomlands på semestern vartannat år
jag reser utomlands en gång per semester
jag reser utomlands på semester ungefär en gång per år
jag brukar resa utomlands på semestern åtminståne en
gång i året
en gång per år kanske
en gång vart annat år
varje år
vart tredje år ungefär
nu för tiden inte så ofta
varje år brukar jag åka utomlands
Results
no
no
reuse
4% 2%answer
other
24%
reuse
52%
18%
ellipse
The Waxholm system
There
Information
Information
Information
Which
areWhen
IIs
IWaxholm
lots
am
Which
think
day
This
itWhere
looking
possible
of
Ido
about
about
of
Where
want
about
is
Iboats
hotels
the
want
the
Thank
ais
can
Thank
table
hotels
the
evening
to
The
for
week
shown
the
to
from
isto
go
are
Ieat
restaurants
boats
Waxholm?
you
find
city
hotels
of
go
tomorrow
you
is
in
do
Stockholm
in
the
to
on
boats
shown
too
Waxholm?
hotels?
Waxholm?
you
to
Waxholm
boats...
this
inWaxholm
Waxholm
want
depart?
in
map
inWaxholm
to
this
toWaxholm
go?
table
is
on a Friday,
Fromis
At
shown
where
shown
whatin
do
time
inthis
you
this
do
table
want
table
you to
want
go to go?
Dialog systems at KTH
The August system
•
•
•
•
•
•
Stockholm (events and general information)
Yellow pages
KTH and speech technology
August Strindberg
Greetings and social utterances
Comments about the system capabilities
and the discourse
Knowledge sources - Evaluation
Acoustic analysis
Syntactic analysis
Semantic analysis
Dialog state
Dialog Context
Confidence
Expectation
Filter
Shallow semantic analysis
• Input
– word sequences
– semantic features from lexicon
• Output
– Acceptable utterance? yes/no
– Predicted domain
• strindberg, stockholm, yellow pages…..
– Feature:value representation
• object:restaurant, place:mariatorget
• Trained on tagged N-best lists and lexicon
The set-up in
Kulturhuset
A sample video of the
system environment
There
MyThere
My
The
life
IHow
life
Can
We
are
there
can
information
cannot
are
cannot
are
you
over
tell
24,000
any
now
recommend
be
you
700
restaurants
be
measured
in
about
restaurants
is
islands
the
shown
ainin in
IAre
You
did
Are
What
IsExcellent
But
must
there
Can
not
Where
we
Hello
can
are
we
understand
you
in
Where?
specify
Hello!
one
Kulturhuset?
are
Iyou
August
as
are
see
ask
close
always!
here!
today?
we?
about?
me?
ameasured
street
by?
that!
restaurant
the
restaurants
terms
terms
centre
Stockholm
on in
of
Blekingegatan?
onStockholm
of
in
days
of
the
the
days
in
Stockholm
map
and
archipelago
Stockholm
neighbourhood
and
years!
years!
The August database
September 1998 - February 1999:
10,058 utterances (approximately 15 hours of speech)
were manually checked, transcribed and analyzed
children
24%
children
22%
men
50%
women
26%
men
55%
women
23%
2685 users
10,058 utterances
What do you say to August?
• Child
• Woman 1
• Woman 2
What …….. ?
• 334 utterances include “what”
– only 75 have “what” in initial position
• 99 “what is your name”
– all in final utterance position
– only 13 initiate an utterance
intro
what ……...
An example of
a repetitive sequence
The utterance ”Vad heter kungen?” (What is the name of the king?)
as original input (top) and repeated twice by the same user
Percentage of all repetitions
Features in repetition
50
40
adults
children
30
20
10
0
more
increased shifting of
clearly loudness focus
articulated
The August system
People
IWhat
IStrindberg
IYes,
Over
call
The
can
How
come
Strindberg
The
Perhaps
myself
answer
that
information
who
amany
from
Royal
million
was
live
we
Strindberg,
questions
the
was
people
Institute
ain
will
people
smart
glass
married
ishere?
shown
live
thing
about
of
houses
live
but
in I
Yes,
When
it
What
Do
You
do
might
you
Thank
you
Good
were
are
is
was
like
your
be
do
welcome!
bye!
you
that
you!
born
for
itdepartment
name?
born?
we
ameet
inliving?
will!
1849
Strindberg,
ofdon’t
Speech,
should
in the
really
three
Technology!
Stockholm?
on
soon
Stockholm
not
KTH
Music
tothe
have
throw
say!
times!
again!
map
and
and
a surname
Stockholm
stones
area
Hearing
Dialog systems at KTH
Dialog systems at KTH
Dialog systems at KTH
The HIGGINS domain
Error handling in dialog systems
Initial experiments
• Studies on human-human conversation
• The Higgins domain (similar to Map Task)
• Using ASR in one direction to elicit error
handling behaviour
User
Speaks
ASR
Listens
Vocoder
Reads
Speaks
Operator
Non-understanding error recovery
• Results show that humans tend not to signal nonunderstanding:
O:
U:
O:
Do you see a wooden house in front of you?
YES CROSSING ADDRESS NOW
(I pass the wooden house now)
Can you see a restaurant sign?
• This leads to
– Increased experience of task success
– Faster recovery from non-understanding
• Skantze, G. (2003). Exploring human error
handling strategies: implications for spoken
dialogue systems.
Prediction of Upcoming Swedish Prosodic
Boundaries by Swedish and American Listeners
Rolf Carlson, Julia Hirschberg, Marc Swerts
Questions
• Are listeners able to predict the occurrence
of upcoming boundarieson based on
acoustic and prosodic features
• Are listeners able to distinguish different
degrees of upcoming boundary strength
Experiment
• Spontaneous utterance fragments
presented to listeners
• Questions:
– followed by a prosodic break?
– strength of the break?
• Answers compared to labeled data
Database
• The speech corpus: an interview of a
Swedish female politician (Swedish Radio)
• The interview was prosodically labeled by
three independent researchers for boundary
presence and strength
• Majority voting strategy used to resolve
disagreements.
Stimuli
• 60 utterance fragments all preceded the
word “och” (and) in their original context
(each about 2 seconds long)
• additional 60 shortened versions consisting
of only the final word of each fragment
Subjects
• 13 Swedish subjects (SW), students of
logopedics from Umeå University, Sweden
• 29 American subjects (AM), staff and
students at Columbia University, USA
Perceived upcoming boundary
strength
perceived boundary strength .
5
one word (AM)
4
one word (SW)
3
2 seconds (AM)
2
2 seconds (SW)
1
no break
weak break
strong break
Subject scores on a 5-point scale. Data grouped according to expert
labeled boundary strength ( no,weak or strong break), fragment size
and native language
Word in Isolation and
2 Seconds Fragment
American subjects (AM)
boundary strength (2 seconds) .
boundary strength (2 seconds) .
Swedish subjects (SW)
5
4
3
2
1
1
2
3
4
boundary strength (one word )
5
5
4
3
2
1
1
2
3
4
5
boundary strength (one word )
Correlation between perceived upcoming boundary strength for each
word in isolation and in a 2 seconds fragment for the Swedish and
American subjects Regression coefficient r = 0,89 (SW) and r= 0,80
(AM)
Boundary strength .
Language Difference
5
4
3
2
1
AM
Word
SW
2 Seconds
Perceived upcoming boundary strength by stimulus
length. Data grouped according to subject’s native
language American (AM) and Swedish (SW).
Results
• Both native speakers of Swedish and of
standard American English were indeed able
to predict whether or not a boundary (of a
particular strength) followed the fragment
• Suggesting that prosodic rather than lexicogrammatical information was being used as
a primary cue.
Creaky Voice
100,00
Creak (%)
80,00
60,00
40,00
20,00
0,00
1 - 1,5
1,5 - 2,5
2,5 - 3,5
3,5 - 5,0
Judged boundary strength (words)
American
Swedish
Number of stimuli with creaky voice (in %) for different
judged boundary strength intervals (one word)
Perceptual results correlate with
•
•
•
•
F0 median in last syllable
F0 slope in last syllable
Creaky voice
Durational cues
Turn taking / Interruption
CHIL
"Computers in the Human interaction Loop"
• Integrated Project under the European
Commission's Sixth Framework Programme.
• Coordinated by Universität Karlsruhe (TH) and the
Fraunhofer Institute IITB.
• CHIL was launched on January, 1st 2004.
http://chil.server.de/
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
DaimlerChrysler AG, Group Dialogue Systems, Germany
ELDA, Evaluations and Language resources Distribution Agency,
France
IBM Ceska Republika, Jzech Republic
RESIT, Research and Education Society in Information Technologies,
Greece
INRIA (Institut National de Recherche en Informatique et en
Automatique), Project GRAVIR, France
IRST (Instituto Trentino di Cultura), Italy
KTH (Kungl Tekniska Högskolan), Sweden.
CNRS, LIMSI (Centre National de la Recherche Scientifique through
its Laboratoire d'Informatique pour la mécanique et les sciences de
l'ingénieur), France
TUE (Technische Universiteit Eindhoven), The Netherlands
IPD, Universität Karlsruhe (TH) through its Institute IPD, Germany
UPC, Universitat Politècnica de Catalunya, Spain
Universität Karlsruhe (TH), Interactive Systems Labs, Germany
Fraunhofer Institut für Informations- und Datenverarbeitung (IITB),
Karlsruhe, Germany
Stanford University, USA
CMU, Carnegie Mellon University, USA
Challenge
• The objective
– to create environments in which computers serve humans
who focus on interacting with other humans as opposed to
having to attend to and being preoccupied with the
machines themselves.
– Instead of computers operating in an isolated manner, and
Humans in the loop of computers we will put Computers in
the Human Interaction Loop (CHIL).
• Computer Services
– models of humans and the state of their activities and
intentions. Based on the understanding of the human
perceptual context, CHIL computers are enabled to
provide helpful assistance implicitly, requiring a minimum
of human attention or interruptions
CHIL - Services
• Memory Jog (MJ).
– It helps the attendees by providing information related to
the development of the event (meeting/lecture) and to
the participants. MJ provides context- and content-aware
information pull and push, both personalized and public.
• Attention Cockpit (AC).
– AC monitors the attention and interest level of
participants, supporting individuals who want more or less
involvement in the discussion. It can also inform the
Socially-Supportive Workspaces about the attentional
state of the participants.
• Connector (Connector).
– Context-aware connecting services ensure that two parties
are connected with each other at the right place, time and
by the best media, when it is most appropriate and
desirable for both parties to be connected.
Thank you
• Natural speech combined with synthetic head
controlled by the audio
The End
Perceptual Judgments of Pitch Range
Rolf Carlson, Kjell Elenius, Marc Swerts
• Question
– To what extent are listeners able to judge where
a particular utterance fragment is located in a
speaker’s pitch range.
• Corpus
– 498 speakers from the Swedish SpeeCon
database
– A cumulative distribution of the F0 easurements
for each speaker was calculated based on 314
prompted utterances.
50% stimulus
25% stimulus
50
45
75
F0 (Semitone)
Cumulative distribution
100
75% stimulus
50
40
35
25
30
0
25
0
50
100
F0 (Hz)
150
0
25
50
Speakers (N=498)
75
100
Experiment
• 100 stimuli, selected from a group of 50
different speakers which had an F0 median
distribution representative of the
distribution in the database as a whole.
• The fragments were presented to subjects
whom are asked to estimate whether the
fragment is located in the lower or higher
part of that speaker’s range.
Hypothesis
5
25 % H1
75 % H1
25 % H2
75 % H2
25 % H3
75 % H3
4
Percept
Hypothesis H1: Listeners can
make an estimate of a speaker’s
range and where an utterance is
positioned in this range.
Hypothesis H2: Listeners can not
make an estimate of a speaker’s
range and make an absolute
judgment of an utterance F0
irrespective of speaker
characteristics.
Hypothesis H3: Listeners can
estimate the speaker’s gender
and make an estimate where an
utterance is positioned in the
gender range.
3
2
1
25
30
35
40
45
50
Speaker Median f0 (semitone)
Judgments of pitch range
25%
75%
5,0
5,0
4,0
Percept
Percept
4,0
3,0
3,0
2,0
2,0
1,0
1,0
25 %
75 %
Average range judgments for
25% and 75% stimuli.
25
30
35
40
45
Speaker Median f0 (semitone)
50
Judgments of pitch range for
25% and 75% stimuli arranged
per speaker .
Conclusion
5,0
Percept
4,0
3,0
2,0
25%
75%
1,0
25
30
35
40
45
50
Stimulus Median f0 (semitone)
Listeners’ judgments are dependent on the gender of the speaker,
but within a gender they tend to hear differences in gender range.
Download