Flexible Turn-Taking for Spoken Dialog Systems

advertisement
Flexible Turn-Taking
for Spoken Dialog Systems
PhD Thesis Defense
Antoine Raux
Language Technologies Institute, CMU
December 12, 2008
Thesis Committee
Maxine Eskenazi (chair)
Alan W Black
Reid Simmons
Diane J. Litman
Spoken Dialog Systems
Spoken dialog systems have long promised to
improve human-machine interaction
Speech is a natural means of communication
Recent improvements in underlying technologies
have made such systems a reality
2
Sometimes they work…
S
U
S:
U:
S:
U:
S:
U:
S:
U:
S:
What can I do for you?
I’d like to go to the Waterfront.
Going to Waterfront. Is this correct?
Yes.
Alright. Where do you want to leave from?
Oakland.
Leaving from Oakland. When are you going to take that bus?
Now.
The next bus. Hold on. Let me check that for you.
The next 61C leaves Forbes Avenue at Atwood Children’s Hospital at 5:16 PM.
3
…but not always…
S
U
S:
U:
S:
S:
S:
U:
S:
S:
U:
S:
What can I do for you?
‘kay. 51C Carrick from Century Square to Downtown Pittsburgh, to Oakland.
The 61…
If you want…
Leaving from Oakland. Is this correct?
51C leaving Century Square going to Oakland, I mean go to South Side.
Leaving…
Leaving from McKeesport. Is…
No. Leaving from Century Square.
Leaving from McKeesport. Did I get that right?
4
Key Definitions
(Conversational) Floor
“The right to address an assembly” (Merriam-Webster)
The interactional state that describes which participant in
a dialog has the right to provide or request information at
any point.
Turn-Taking
The process by which participants in a conversation
alternately own the conversational floor.
5
Thesis Statement
Incorporating different levels of knowledge using
a data-driven decision model will improve the
turn-taking behavior of spoken dialog systems.
Specifically, turn-taking can be modeled as a
finite-state decision process operating under
uncertainty.
6
Floor, Intentions and Beliefs
The floor is not an observable state.
Rather, participants have:
• intentions to claim the floor or not
• beliefs over whether others are claiming it
Participants negotiate the floor to limit gaps
and overlaps. [Sacks et al 1974, Clark 1996]
7
Uncertainty over the Floor
S
U
Uncertainty over the floor leads to breakdowns
in turn-taking:
• Cut-ins
• Latency
• Barge-in latency
• Self interruptions
8
Turn-Taking Errors by System
Cut-ins
System grabs floor before user
releases it.
S
U
U:
S:
‘kay. 51C Carrick from Century Square (…)
The 61…
S
Latency
System waits after user has
released floor.
U
S:
U:
S:
(…) Is this correct?
Yeah.
Alright (…)
9
Turn-Taking Errors by System
S
Barge-in latency
System keeps floor while user is
claiming it.
U
S: For example, you can say “When is the next
28X from downtown to the airport?” or “I’d
like to go from McKee…
U: When is the next 54…
S: Leaving from Atwood. Is this correct?
S
Self interruptions
System releases floor while user
not claiming it.
U
S:
U:
S:
S:
S:
S:
What can I do for you?
61A.
For example, you can say when is…
Where would you li…
Let’s proceed step by step. Which neighb…
Leaving from North Side. Is this correct?
10
Outline
Introduction
An architecture for dialog and interaction management
Using dialog features to inform turn-taking
A domain-independent data-driven turn-taking model
Conclusion
11
Pipeline Architectures
Speech
Recognition
Natural Language
Understanding
Dialog
Management
Speech
Synthesis
Backend
Natural Language
Generation
Turn-taking imposed by full-utterance-based processing
Sequential processing  Lack of reactivity
No sharing of information across modules
Hard to extend to multimodal/asynchronous events
12
Multi-layer Architectures
• Separate reactive from deliberative behavior
– turn-taking vs dialog act planning
• Different layers work asynchronously
[Thorisson 1996, Allen et al 2001, Lemon et al 2003]
• But no previous work:
– addressed how conversational floor interacts with
dialog management
– successfully deployed a multi-layer architecture in a
broadly used system
13
Proposed Architecture: Olympus 2
Sensors
Speech
Recognition
Natural Language
Understanding
Interaction
Management
Speech
Synthesis
Dialog
Management
Backend
Natural Language
Generation
Actuators
14
Olympus
2
Architecture
Sensors
Speech
Recognition
Natural Language
Understanding
Interaction
Management
Speech
Synthesis
Dialog
Management
Backend
Natural Language
Generation
Actuators
Explicitly models turn-taking explicitly
Integrates dialog features from both low and high levels
Operates on generalized events and actions
Uses floor state to control planning of conversational acts
15
Olympus 2 Deployment
• Ported Let’s Go to Olympus 2
– publicly deployed telephone bus information
– originally built using Olympus 1
• New version processed about 30,000 dialogs
since deployed
– no performance degradation
• Allows research on turn-taking models to be
guided by real users behavior
16
Outline
Introduction
An event-driven architecture for spoken dialog systems
Using dialog features to inform turn-taking
End-of-turn detection
Decision tree-based thresholds
Batch evaluation
Live evaluation
17
End-of-Turn Detection
What can I do for you?
S
U
I’d like to go to
the airport.
Detecting when the user releases the floor.
Potential problems:
• Cut-ins
• Latency
18
End-of-Turn Detection
What can I do for you?
S
U
I’d like to go to
the airport.
End of turn
19
Latency / Cut-in Tradeoff
What can I do for you?
S
U
I’d like to go to
Long threshold
the airport.
Few cut-ins
Long latency
20
Latency / Cut-in Tradeoff
What can I do for you?
S
U
I’d like to go to
Long threshold
Short threshold
the airport.
Few cut-ins Long latency
Many cut-ins Short latency
Can we exploit dialog information
to get the best of both worlds?
21
End-of-Turn Detection as Classification
• Classify pauses as internal/final based on
words, syntax, prosody [Sato et al, 2002]
• Repeat classification every n milliseconds until
pause ends or end-of-turn is detected
[Ferrer et al, 2003, Takeuchi et al, 2004]
• But no previous work:
– successfully combined a wide range of features
– tested model in a real dialog system
22
Outline
Introduction
An architecture for dialog and interaction management
Using dialog features to inform turn-taking
End-of-turn detection
Decision tree-based thresholds
Batch evaluation
Live evaluation
23
Using Variable Thresholds
Open question
Does
partial
hyp
match
current
expectations?
Specific
question
S
Confirmation
U
What can I do for you?
I’d like to go to
the airport.
Discourse (dialog state)
Semantics (partial ASR)
Prosody (F0, duration)
Timing (pause start)
Speaker (avg # pauses)
24
Example Decision Tree
Utterance duration < 2000 ms
Partial ASR matches expectations
Average non-understanding ratio < 15%
Dialog state is
Average pause
duration < 200 ms
Partial ASR has
Partial ASR
has “YES”
205 ms
200 ms
779 ms
open question
Average pause
duration < 300 ms
less than 3 words
693 ms
789 ms
Trained on 1326 dialogs with the
Let’s Go public dialog system
1005 ms
Partial ASR is
available
637 ms
847 ms
907 ms
Consecutive user turns
w/o system prompt
1440 ms
1078 ms
Dialog state is
open question
Average pause
duration < 300 ms
839 ms
922 ms
25
Example Decision Tree
I’d like to go to
Utterance duration < 2000 ms
Partial ASR matches expectations
Average non-understanding ratio < 15%
Dialog state is
Average pause
duration < 200 ms
Partial ASR has
Partial ASR
has “YES”
205 ms
200 ms
779 ms
open question
Average pause
duration < 300 ms
less than 3 words
693 ms
789 ms
Trained on 1326 dialogs with the
Let’s Go public dialog system
1005 ms
Partial ASR is
available
637 ms
847 ms
907 ms
Consecutive user turns
w/o system prompt
1440 ms
1078 ms
Dialog state is
open question
Average pause
duration < 300 ms
839 ms
922 ms
26
Example Decision Tree
I’d like to go to
the airport.
Utterance duration < 2000 ms
Partial ASR matches expectations
Average non-understanding ratio < 15%
Dialog state is
Average pause
duration < 200 ms
Partial ASR has
Partial ASR
has “YES”
205 ms
200 ms
779 ms
open question
Average pause
duration < 300 ms
less than 3 words
693 ms
789 ms
Trained on 1326 dialogs with the
Let’s Go public dialog system
1005 ms
Partial ASR is
available
637 ms
847 ms
907 ms
Consecutive user turns
w/o system prompt
1440 ms
1078 ms
Dialog state is
open question
Average pause
duration < 300 ms
839 ms
922 ms
27
Outline
Introduction
An architecture for dialog and interaction management
Using dialog features to inform turn-taking
End-of-turn detection
Decision tree-based thresholds
Batch evaluation
Live evaluation
28
Performance per Feature Set
Average latency (ms)
900
38% cut-in rate reduction
700
Fixed threshold
State-specific threshold
Decision Tree
22% latency reduction
500
300
2
3
4
Turn cut-in rate (%)
5
6
29
Performance per Feature Set
Average latency (ms)
900
Fixed threshold
Discourse
Speaker
Prosody
Timing
Semantics
All features
Decision
Tree
700
500
Semantics is the most useful feature type
300
2
3
4
Turn cut-in rate (%)
5
6
30
Outline
Introduction
An architecture for dialog and interaction management
Using dialog features to inform turn-taking
End-of-turn detection
Decision tree-based thresholds
Batch evaluation
Live evaluation
31
Live Evaluation
• Implemented decision tree in the Let’s Go IM
• Operating point: 3% cut-in, 635 ms average
• 1061 dialogs collected in May ‘08
– 548 control dialogs (threshold = 700 ms)
– 513 treatment dialogs (decision tree)
32
Cut-in Rate per Dialog State
14%
Control
Decision Tree
12%
10%
Largest improvement:
after open requests
8%
6%
4%
Fewer cut-ins overall (p<0.05)
2%
0%
Overall
Open
Question
Specific
Question
Confirmation
33
Average Latency per State
1200 Slower on answers to
1100 open questions
Control
Decision Tree
1000
900
800
700
Faster on confirmations
600
500
Overall
Open
Question
Specific
Question
Confirmation
34
Non-Understanding Rate per State
Control
Decision Tree
35%
30%
25%
20%
15%
Significant reduction of
after confirmations (p < 0.01)
10%
5%
0%
Overall
Open
Question
Specific
Question
Confirmation
35
Outline
Introduction
An architecture for dialog and interaction management
Using dialog features to inform turn-taking
A domain-independent data-driven turn-taking model
The Finite-State Turn-Taking Machine
Application to end-of-turn detection
In pauses
Anytime
Application to barge-in detection
36
The Finite-State Turn-Taking Machine
System
User
37
The Finite-State Turn-Taking Machine
FreeS
System
User
FreeU
38
The Finite-State Turn-Taking Machine
FreeS
System
BothS
User
BothU
FreeU
39
The Finite-State Turn-Taking Machine
FreeS
Similar models
were
proposed
BothS
by Brady (1969) and Jaffe and
System
User
Feldstein (1970) for analysis of
BothU
human conversations.
FreeU
40
Uncertainty in the FSTTM
• System:
– knows whether it is claiming the floor or not
– holds probabilistic beliefs over whether the user is
• Probability distribution over the state
• In some (useful) cases, approximations allow
to reduce uncertainty to two states:
– User vs FreeU during user utterances
– System vs BothS during system prompts
41
Making Decisions with the FSTTM
• Actions
– YIELD, KEEP if system is currently holding the floor
– GRAB, WAIT if it is not
– Different costs in different states
• Decision-theoretic action selection
– Pick action with lowest expected cost given the
belief distribution over the states
42
Outline
Introduction
An architecture for dialog and interaction management
Using dialog features to inform turn-taking
A domain-independent data-driven turn-taking model
The Finite-State Turn-Taking Machine
Application to end-of-turn detection
In pauses
Anytime
Application to barge-in detection
43
End-of-Turn Detection in the FSTTM
FreeS
System
BothS
User
BothU
FreeU
44
End-of-Turn Detection in the FSTTM
FreeS
System
BothS
BothU
GRAB
WAIT
User
GRAB
WAIT
FreeU
45
Action/State Cost Matrix in Pauses
Floor state
User
FreeU
WAIT
0
CG∙t
GRAB
CU
System action
(constant)
(time in pause)
0
• Latency cost linearly increases with time
• Constant cut-in cost
46
Action Selection
• At time t in a pause, take the action with
minimal expected cost:
C t (GRAB)  Pt (User )  CU
C t (WAIT )  Pt (FreeU )  C G  t
47
Estimating State Probabilities
Probability that user releases floor,
estimated at the beginning of pause
Pt (FreeU )  P(FreeU )
Pt (User )  (1  P(FreeU ))  e
Probability that user keeps floor,
estimated at the beginning of pause

t

Exponential decay
48
Estimating P(FreeU)
• Step-wise logistic regression
• Selected features:
– boundary LM score, “YES” in ASR hyp
– energy, F0 before pause
– Barge-in
Baseline
Logistic Regression
Classification Error
21.9%
21.7%
Log Likelihood
-0.52
-0.44
49
In-Pause Detection Results
Average Latency (ms)
900
Fixed threshold
Decision Tree
FSTTM
700
28% latency reduction
500
300
2
3
4
Cut-in Rate (%)
5
6
50
Outline
Introduction
An event-driven architecture for spoken dialog systems
Using dialog features to inform turn-taking
A domain-independent data-driven turn-taking model
The Finite-State Turn-Taking Machine
Application to end-of-turn detection
At pauses
Anytime
Application to barge-in detection
51
Delays in Pause Detection
I’d like to go to
the airport.
• About 200 ms between pause start and VAD
change of state
• In some cases, we can make the decision before
VAD detection:
– partial hypotheses during speech
– previous model once a pause is detected
Anytime End-of-Turn Detection
52
End-of-turn Detection in Speech
• Cost matrix:
Floor state
User
FreeU
WAIT
0
CW
GRAB
CU
System action
(constant)
(constant)
0
• Leads to a fixed threshold on P(FreeU)
53
Estimating P(FreeU) in Speech
• Step-wise logistic regression
• Features:
– boundary LM score, “YES”/”NO” in hyp
– number of words
– Barge in
Baseline
Logistic Regression
Classification Error
38.9%
19.2%
Log Likelihood
-0.67
-0.45
54
Anytime Detection Results
Average latency (ms)
900
Fixed threshold
In-pause-FSTTM
Anytime-FSTTM
700
35% latency reduction
500
300
2
3
4
Cut-in rate (%)
5
6
55
Histogram of Turn Latencies
20%
In-pause-FSTTM
18%
16%
Less predictable
ends of turns
% Turns
14%
Highly predictable
ends of turns
12%
10%
8%
6%
4%
2%
0%
0
500
1000
1500
Latency
56
Histogram of Turn Latencies
20%
In-pause-FSTTM
40% of highly predictable cases get
Anytime-FSTTM
predicted during speech
No change to less predictable cases
18%
16%
% Turns
14%
12%
10%
8%
10% of turn ends
detected during speech
4%
6%
2%
0%
0
500
1000
1500
Latency
57
Outline
Introduction
An architecture for dialog and interaction management
Using dialog features to inform turn-taking
A generic, trainable turn-taking model
The Finite-State Turn-Taking Machine
Application to end-of-turn detection
At pauses
Anytime
Application to barge-in detection
58
Barge-in Detection in the FSTTM
FreeS
System
BothS
User
BothU
FreeU
59
Barge-in Detection in the FSTTM
YIELD
KEEP
System
FreeS
BothS
KEEP
YIELD
User
BothU
FreeU
60
Cost Matrix during System Prompts
Floor state
System
BothS
KEEP
0
CO
YIELD
CS
System action
(constant)
(constant)
0
• Constant costs
• Equivalent to setting a threshold on P (BothS)
61
Estimating P(BothS)
• Estimated at each new partial ASR hypothesis
• Logistic regression
• Features:
– partial hyp matches expectations
– cue words in the hypothesis
• selected using mutual information on a previous corpus
• E.g. : “When” in a state where “When is the
next/previous bus” is expected
62
Barge-in Detection Results
Average Latency (ms)
700
Hyp matches
Hyp matches + cues
600
500
400
0.5
1.0
1.5
2.0
Self Interruption Rate (%)
2.5
63
Outline
Introduction
An architecture for dialog and interaction management
Using dialog features to inform turn-taking
A generic, trainable turn-taking model
Conclusion
64
Thesis Statement
Incorporating different levels of knowledge using
a data-driven decision model will improve the
turn-taking behavior of spoken dialog systems.
• latency and/or cut-in rate reduced by both decision
tree and FSTTM
• semantic features most useful
Specifically, turn-taking can be modeled as a
finite-state decision process operating under
uncertainty.
• FSTTM
65
Contributions
• An architecture for spoken dialog systems that
incorporates dialog and interaction management
• Analysis of dialog features underlying turn-taking
• The Finite State Turn-Taking Machine
– domain-independent turn-taking model
– data-driven
– improves end-of-turn and barge-in detection
66
Extending the FSTTM
• A framework to organize turn-taking
• Extensions
– generalized FSTTM topology
• multi-party conversation
– richer cost functions
• non-linear latency cost, non-uniform cut-in cost, etc
– better tracking of uncertainty
• priors
• Partially Observable Markov Decision Processes
67
FSTTM Dialog
S
U
S:
U:
S:
U:
S:
U:
S:
U:
S:
What can I do for you?
Next bus from Fifth and Negley to Fifth and Craig.
Leaving from Fifth and Negley. Is this correct?
Yes.
Alright. Going to Fifth and Craig. Is this correct?
Yes.
Alright. I think you want the next bus. Am I…
Yes.
Right. Just a minute. I’ll look that up.
The next 71D leaves Fifth Avenue at Negley at 10:54 AM.
68
Thank you!
Questions?
Extra Slides
70
Building Threshold Decision Trees
1. Cluster pauses using automatically extracted
features from discourse, semantics, prosody,
timing and speaker.
2. Set one threshold for each cluster so as to
minimize overall latency.
71
700
25
650
20
600
15
550
10
500
5
450
0
0
200
400
600
800
1000
Tree size
(# of internal nodes)
Latency (ms)
Learning Curve
1200
Number of dialogues in training set
Average turn final latency
Tree size
72
Estimating Parameters
• μ
– overall mean pause
duration
– state-specific mean
pause duration
– predicted using dialog
features through stepwise generalized linear
regression
• Correl: 0.42
• Feats: barge-in, dialog
state, LM score, “YES”
Predicted pause duration (ms)
1500
1000
500
0
0
500
1000
Actual pause duration (ms)
1500
73
Endpointing Threshold
• Threshold is solution of:
P0 (Free S )  t  (1  P0 (Free S ))  e

t

K
Threshold (ms)
2000
1500
1000
500
0
0
0.2
0.4
0.6
0.8
1
P0(FreeS)
74
User Barge-in Time Distribution
20%
18%
16%
14%
12%
10%
8%
6%
4%
2%
0%
Going to <destination>.
Is this
correct?
75
The First Step
S: What can I do for you?
U: When is the next 54C coming to
18th street?
S: The 54C. Did I get that right?
U: Yes.
S: Okay. Where do you wanna go?
U: Carson.
S: Going to Carson. Is this correct?
U: Yes.
S: Okay. Let me check that for you.
S: What can I do for you?
U: When is the next 54C coming to
18th street?
S: The 54C, right?
U: Yes.
S: Okay. Where do you wanna go?
U: Carson.
S: Carson, correct?
U: Yes.
S: Okay. Let me check that for you.
76
The First Step
Prompt design
Prosody
Turn-Taking
Incremental Processing
S: What can I do for you?
U: When is the next 54C coming to
18th street?
S: The 54C, right?
U: Yes.
S: Okay. Where do you wanna go?
U: Carson.
S: Carson, correct?
U: Yes.
S: Okay. Let me check that for you.
77
The First Step
Prompt design
Prosody
Turn-Taking
Incremental Processing
S: What can I do for you?
U: When is the next 54C coming to
18th street?
S: The 54C, right?
U: Yes.
S: Okay. Where do you wanna go?
U: Carson.
S: Carson, correct?
U: Yes.
S: Okay. Let me check that for you.
78
The First Step
Prompt design
Prosody
Turn-Taking
Incremental Processing
S: What can I do for you?
U: When is the next 54C coming to
18th street?
S: The 54C, right?
U: Yes.
S: Okay. Where do you wanna go?
U: Carson.
S: Carson, correct?
U: Yes.
S: Okay. Let me check that for you.
79
Spoken Dialog Systems
S
U
S:
U:
S:
U:
S:
U:
U:
S:
What can I do for you?
I’d like to go to the Waterfront.
Going to Waterfront. Is this correct?
Yes.
Alright. Where do you want to leave from?
Oakland.
Leaving from Oakland. When are you going to take that bus?
Now.
The next bus. Hold on. Let me check that for you.
The next 61C leaves Forbes Avenue at Atwood Children’s Hospital at 5:16pm.
80
Turn Endpointing
S: What can I do for you?
U: I’d like to go to
the airport.
Silence Detected
Speech Detected Silence DetectedEndpoint
VAD
Speech
Silence
Speech
Silence
Threshold
81
Endpointing Issues
S: What can I do for you?
U: I’d like to go to
the airport.
Cut-in
VAD
Speech
Silence
Speech
Silence
Threshold
82
End-of-Turn Detection Issues
S: What can I do for you?
U: I’d like to go to
the airport.
Latency
VAD
Speech
Silence
Speech
Silence
Threshold
83
The Endpointing Trade Off
S: What can I do for you?
U: I’d like to go to
VAD
Speech
the airport.
Silence
Speech
Silence
Threshold
Long threshold
Few cut-ins
Long latency
84
The Endpointing Trade Off
S: What can I do for you?
U: I’d like to go to
VAD
Speech
the airport.
Silence
Speech
Silence
Threshold
Long threshold
Short threshold
Few cut-ins
Many cut-ins
Long latency
Short latency
85
Using Variable Thresholds
• Discourse (dialog state)
S: What can I do for you?
• Semantics (partial ASR)
• Prosody (F0, duration)
U: I’d like to go to
the airport.
• Timing (pause start)
• Speaker (avg # pauses)
VAD
Speech
Silence
Speech
Silence
Threshold
86
Standard Approach to Turn-Taking in
Spoken Dialog Systems
• Typically not explicitly modeled
• Rules based on low-level features
– threshold-based end-of-utterance detection
– (optionally) barge-in detection
• Fixed behavior
• Not integrated in the overall dialog model
87
The Finite-State Turn-Taking Machine
FreeS
System
Both
SmoothS
Transition
User
BothU
FreeU
USER YIELDS
88
The Finite-State Turn-Taking Machine
FreeS
System
Both
SmoothS
Transition
User
BothU
SYSTEM GRABS
FreeU
89
The Finite-State Turn-Taking Machine
FreeS
System
BothS
Latency
User
BothU
SYSTEM WAITS
USER WAITS
FreeU
90
The Finite-State Turn-Taking Machine
FreeS
Cut-in
System
BothS
BothU
User
SYSTEM GRABS
FreeU
91
The Finite-State Turn-Taking Machine
SYSTEM GRABS
System
FreeS
BothS
Time out
User
BothU
FreeU
92
The Finite-State Turn-Taking Machine
FreeS
USER GRABS
System
BothS
User
Both
U
Barge-in
FreeU
93
The Finite-State Turn-Taking Machine
FreeS
System
BothS
SYSTEM YIELDS
User
Both
U
Barge-in
FreeU
94
Average Latency (ms)
Optimal CW
490
470
450
430
410
390
370
350
CU is set to maintain an
overall cut-in rate of 5%
0
500
1000
CW
1500
2000
95
Estimating State Probabilities
Pt (Free , sil)  Pt (Free S )  Pt (sil |Free S )
 P0 (Free S )  1
 P0 (Free S )
96
Estimating State Probabilities
Pt (Free , sil)  Pt (Free S )  Pt (sil |Free S )
 P0 (Free S )  1
User remain silent
indefinitely at the end of
 P0 (Free S )
turn (no transition
FreeS  User)
97
Estimating State Probabilities
Pt (Free , sil)  Pt (Free S )  Pt (sil |Free S )
 P0 (Free S )  1

P
(
Free
)
0
S
Without knowledge of
silence duration,
Pt(FreeS)=P0(FreeS)
98
Estimating State Probabilities
Pt (Free , sil)  Pt (Free S )  Pt (sil |Free S )
 P0 (Free S )  1
 P0 (Free S )
99
Estimating State Probabilities
Pt (User , sil)  Pt (User )  Pt (sil |User )
 P0 (User )  P(dur (sil)  t |User )
 (1  P0 (Free S ))  e

t

100
Estimating State Probabilities
Pt (User , sil)  Pt (User )  Pt (sil |User )
 P0 (User )  P(dur (sil)  t |User )
t

Probability that the user

 (1  P0 (Free S ))  ise still silent at time t,
given that they haven’t
finished their turn.
101
Estimating State Probabilities
Pt (User , sil)  Pt (User )  Pt (sil |User )
 P0 (User )  P(dur (sil)  t |User )
 (1  P0 (Free S ))  e

t

Assuming an exponential distribution
on internal silence duration.
μ is mean pause duration.
102
Estimating State Probabilities
Pt (User , sil)  Pt (User )  Pt (sil |User )
 P0 (User )  P(dur (sil)  t |User )
 (1  P0 (Free S ))  e

t

103
Reducing Uncertainty
Different levels of information can help reduce
uncertainty over the floor:
Immediate information
syntax, semantics, prosody of current turn…
Discourse information
dialog state, task structure, expectations…
Environment information
acoustic conditions, user characteristics…
104
Endpointing Threshold
• Threshold is solution of:
P(FreeU )  t  (1  P(FreeU ))  e

t

K
Expected Cost
• Parameter K set empirically (typically 10,000 ms)
0
GRAB
WAIT
Endpointing
threshold
200
400
600
Time
800
1000
105
Pause Endpointing Results
Average Latency (ms)
900
Baseline
Threshold Optimization
FSTTM, only Pf estimated
FSTTM, P(U|O) & μ estimated
FSTTM, P(U|O) estimated, μ oracle
700
500
300
2
3
4
Cut-in Rate (%)
5
6
106
Download