Dialogue as a Decision Making Process

advertisement
Dialogue as a
Decision Making Process
Challenges of Autonomy in the Real
World
Wide range of sensors
Noisy sensors
World dynamics
Adaptability
Incomplete information
Nicholas Roy
Robustness under
uncertainty
Minerva
Pearl
Predicted Health Care Needs
Spoken Dialogue Management
By 2008, need 450,000 additional nurses:
Monitoring and walking assistance
30 % of adults 65 years and older have fallen this year
Cost of preventable falls:
$32 Billion US/year
Alexander 2001
Intelligent reminding
Cost of medication non-compliance:
Dunbar-Jacobs 2000
$1 Billion US/year
We want...
Natural dialogue...
With untrained (and untrainable) users...
In an uncontrolled environment...
Across many unrelated domains
Cost of errors...
Medication is not taken, or taken incorrectly
Robot behaves inappropriately
User becomes frustrated, robot is ignored, and becomes
useless
How to generate such a policy?
1
Perception and Control
Probabilistic Methods for Dialogue
Management
Markov Decision Processes model action
uncertainty
Probabilistic
Perception
Control
(Levin et. al, 1998, Goddeau & Pineau, 2000)
Many techniques for learning optimal policies,
especially reinforcement learning
World state
Markov Decision Processes
A Markov Decision Process is given formally by the following:
a set of states S={s1, s2, ..., sn}
a set of actions A={a1, a2, ..., am }
a set of transition probabilities T(si , a, sj ) = p(sj |a, si )
a set of rewards R: S ×A? ´
a discount factor g� [0, 1]
an initial state s0˛ S
Bellman's equation (Bellman, 1957) computes the expected
reward for each state recursively,
(Singh et al. 1999, Litman et al. 2000, Walker 2000)
The POMDP in Dialogue Management
State: Represents desire of user
e.g. want_tv, want_meds
This state is unobservable to the dialogue system
Observations: Utterances from speech recogniser
e.g. .I want to take my pills now.
The system must infer the user's state from the possibly noisy or
ambiguous observations
Where do the emission probabilities come from?
At planning time, from a prior model
At run time, from the speech recognition engine
and determines the policy that maximises the expected,
discounted reward
The MDP in Dialogue Management
State: Represents desire of user
e.g. want_tv, want_meds
Assume utterances from speech recogniser give
state
e.g. I want to take my pills now.
Markov Decision Processes
Model the world as different states the system can be in
e.g. current state of completion of a form
Each action moves to some new state with probability p(i; j)
Observation from user determines posterior state
Actions are: robot motion, speech acts
Reward: maximised for satisfying user task
2
Markov Decision Processes
Markov Decision Processes
Optimal policy maximizes expected future (discounted) reward
Policy found using value iteration
Since we can compute a policy that maximises
the expected reward...
then if we have ...
a reasonable reward function
a reasonable transition model
Do we get behaviour that satisfies the user?
Fully Observable State Representation
Fully Observable State Representation
Advantage: No state identification/tracking problems
Disadvantage: What if the observation is noisy or false?
Perception and Control
Talk Outline
Robots in the real world
Probabilistic
Perception
Model
P(x)
argmax P(x)
World state
Robust
Control
Partially Observable Markov Decision
Processes
Solving large POMDPs
Deployed POMDPs
3
Control Models
POMDPs
Markov Decision Processes
Probabilistic
Perception
Model
le
t
t
i
Br
P(x )
argmax P(x )
Control
Actions
a1
Beliefs
b1
Sondik, 1971
b2
T(sj|a i, si)
World state
Partially Observable Markov Decision Processes
Probabilistic
Perception
Model
P(x )
Observable
Observations
Z1
O(zj|si)
Hidden
States
Control
Rewards
World state
Navigation as a POMDP
Z2
S2
S1
R1
The POMDP in Dialogue Management
Controller chooses actions based
on probability distributions
State: Represents desire of user
e.g. want_tv, want_meds
This state is unobservable to the dialogue system
Observations: Utterances from speech recogniser
e.g. .I want to take my pills now.
The system must infer the user's state from the possibly
noisy or ambiguous observations
Where do the emission probabilities come from?
Action, observation
At planning time, from a prior model
At run time, from the speech recognition engine
State Space
State is hidden from the controller
Actions are still robot motion, speech acts
Reward: maximised for satisfying user task
The POMDP in Dialogue Management
The POMDP in Dialogue Management
want_ABC
want_NBC
want_NBC
want_CBS
want_CBS
Probability still distributed
among multiple states
4
The POMDP in Dialogue Management
The POMDP in Dialogue Management
want_ABC
He wants the
schedule for
NBC!
want_ABC
Dateline is
on NBC
right now.
want_NBC
want_CBS
want_NBC
want_CBS
request_done
Probability mass shifts to a new state as a result of
the action.
Probability mass still distributed among multiple
states, but mostly centered on the true state now
POMDP Advantages
A Simple POMDP
Models information gathering
Computes trade-off between:
Getting reward
Being uncertain
p(s)
Am I here, or over there?
State is hidden
s3
s1
s2
MDP makes decisions based on uncertain foreknowledge
POMDP makes decisions based on uncertain knowledge
POMDP Policies
POMDP Policies
Belief Space
Current
belief
Value Function over Belief Space
Current
belief
p(s)
Optimal
Action
Belief Space
Optimal
Action
5
Scaling POMDPs
This simple 20 state
maze problem takes
24 hours for 7 steps
of value iteration.
The Real World
Maps with 20,000 states
600 state dialogues
Goal
1 hour, Zhang &
Zhang
2001
Littman
et al.
1997
Hauskrecht 2000
Structure in POMDPs
Belief Space Structure
Factored models
Boutilier & Poole, 1996
Guestrin , Koller & Parr, 2001
Information Bottleneck models
The controller may be globally
uncertain...
but not usually.
Poupart & Boutilier, 2002
Hierarchical POMDPs
Pineau & Thrun, 2000
Mahadevan & Theocharous 2002
Many others
Belief Compression
Control Models
le
b
a
act
r
t
In
Previous models
If uncertainty has few degrees of freedom,
belief space should have few dimensions
le
t
t
i
Br
Probabilistic
Perception
Model
P(x)
argmax P(x )
Probabilistic
Perception
Model
Control
Control
World state
World state
Each mode has few
degrees of freedom
P(x)
Compressed POMDPs
Probabilistic
Perception
Model
P(x )
Low-dimensional P(x )
Control
World state
6
The Augmented MDP
Model Parameters
Represent beliefs using
Reward function
p(s)
~
b = arg max b(s ); H (b)
s
Back-project to high
dimensional belief
N
H (b)= -� p(si ) log 2 p( si )
i =1
~
R(b)
Discretise into 2-dimensional belief space MDP
s1
s2
s3
Compute expected reward from belief:
~
R (b ) = E b ( R (s )) = � p ( s) R (s)
S
Model Parameters
Augmented MDP
Use forward model
~
~
bi
~
bj
~
bk
1.
Discretize state-entropy space
2.
Compute reward function and transition
function
3.
Solve belief state MDP
bj
Low dimension
z1
Full dimension
a
~
bi
p i(s)
a, z1
p j(s)
Deterministic
process
Stochastic process
Nursebot Domain
MDP Graph
Medication scheduling
Time and place tracking
Appointment scheduling
Simple outside knowledge
e.g. weather
Simple entertainment
e.g. TV schedules
Sphinx speech recognition, Festival speech
synthesis
7
Accumulation of Reward – Simulated 7
State Domain
An Example Dialogue
Observation
hello
what is like
what time is it for will the
was on abc
was on abc
what is on nbc
yes
go to the that pretty good what
that that hello be
the bedroom any i
go it eight a hello
the kitchen hello
True State
request begun
start meds
want time
want tv
want abc
want nbc
want nbc
send robot
send robot bedroom
send robot bedroom
send robot
send robot kitchen
Belief Entropy
0.406
2.735
0.490
1.176
0.886
1.375
0.062
0.864
1.839
0.194
1.110
1.184
Action
say hello
ask repeat
say time
ask which station
say abc
confirm channel nbc
say nbc
ask robot where
confirm robot place
go to bedroom
ask robot where
go to kitchen
Reward
100
-100
100
-1
100
-1
100
-1
-1
100
-1
100
Larger Slope ==
Better Performance
Accumulation of Reward – Simulated 17
State Domain
POMDP Dialogue Manager Performance
POMDP Dialogue Manager Performance
POMDP Dialogue Manager Performance
8
Finding Dimensionality
Example Reduction
E-PCA will indicate
appropriate number of
bases, depending on
beliefs encountered
Planning
Planning
S1
S1
E-PCA
Discretize
E-PCA
S2
S2
S3
Original POMDP
S3
Original POMDP
Low-dimensional
belief space
Model Parameters
Low-dimensional
belief space
Discrete belief
space MDP
Model Parameters
Reward function
Transition function
p(s)
~
bj
Back-project to high
dimensional belief
~
R(b)
s1
s2
~
~
T(bi, a, b j)=?
s3
Compute expected reward from belief:
~
R (b ) = E b ( R (s )) = � p ( s) R (s)
~
bi
S
12
Model Parameters
Model Parameters
Use forward model
Use forward model
~
~
bi
~
b j
~
bk
bj
Low dimension
z1
~
bj
Full dimension
a
~
b i
p i(s)
a, z11
p j(s)
~
bk
z1
~
a
~b
i
z2
~
bq
~
T(bi, a, b j) � p(z|s)b i(s|a)
if b j(s) = b i(s|a,z)
= 0
otherwise
Deterministic
process
Stochastic process
E-PCA POMDPs
1.
Collect sample beliefs
2.
Find low-dimensional belief representation
3.
Discretize
4.
Compute reward function and transition
function
5.
Solve belief state MDP
Robot Navigation Example
Initial Distribution
Goal state
True robot position
Goal position
Robot Navigation Example
People Finding as a POMDP
Factored state space
True robot position
Goal position
2 dimensions: fully-observable robot position
6 dimensions: distribution over person positions
Regular grid gives ˜ 10 16 states
13
Variable Resolution Discretization
Variable Resolution Dynamic Programming (1991)
Variable Resolution
Parti-Game
Utile Distinction Trees
Parti-game (Moore, 1993)
Variable Resolution Discretization (Munos & Moore,
2000)
POMDP Grid-based Approximations (Hauskrecht,
2001)
Instance-based
Nearest-neighbour state
representation
Deterministic
Suffix tree representation
Improved POMDP Grid-based Approximations (Zhou &
Hansen, 2001)
Variable Resolution
Instance-based
Stochastic
Reward statistics splitting
criterion
Combine the two approaches:
“Stochastic Parti-Game”
Refining the Grid
~
Non-regular grid using samples
V(b
b11)
~
T(b1 , a1,~b2 )
~
b1
~b
2
~b
3
~
~
T(b1 , a2, b5 )
~
V(b'
b' 1)
~b
4
~
b5
Sample beliefs according to policy
Construct new model
~
The Optimal Policy
~
Keep new belief if V(b'1) > V(b1)
Compute model parameters using nearest-neighbour
Policy Comparison
Average time to find person
250
Original distribution
200
Reconstruction using
EPCA and 6 bases
Time
150
100
50
0
True MDP
Robot position
True person position
Closest
Densest Maximum
Likelihood
E-PCA
Var. Res.
E-PCA
E-PCA: 72 states
Var. Res. E-PCA: 260 states
14
Summary
POMDPs for robotic control improve system
performance
POMDPs can scale to real problems
Belief spaces are structured
Compress to low-dimensional statistics
Find controller for low-dimensional space
Open Problems
Better integration and modelling of people
Better spatial and temporal models
Integrating learning into control models
Integrating control into learning models
15
Download