Model-based Bayesian Reinforcement Learning for Dialogue Management Pierre Lison

advertisement
Model-based Bayesian
Reinforcement Learning for
Dialogue Management
Pierre Lison
Language Technology Group,
University of Oslo
August 26, 2013
Interspeech
Motivation
• Hand-crafting dialogue policies is hard!
•
•
Noise & uncertainty (e.g. speech recognition errors)
Large number of possible trajectories
• Alternative: automatically optimise dialogue
policies from (real or simulated) experience
• Two types of approaches:
•
•
Model-free reinforcement learning
Model-based reinforcement learning
2
Motivation
• Hand-crafting dialogue policies is hard!
•
•
Noise & uncertainty (e.g. speech recognition errors)
Large number of possible trajectories
• Alternative: automatically optimise dialogue
policies from (real or simulated) experience
• Two types of approaches:
•
•
Model-free reinforcement learning
Model-based reinforcement learning
Focus of
this talk
2
Motivation
• Model-based reinforcement learning:
•
Collect interactions and use them to estimate explicit
models of the domain
•
Use the resulting models to plan the best action
• Key advantage: can exploit prior knowledge
to structure the domain models
• We present an experiment showing the
benefits of this approach
3
POMDPs
St
Ot
4
POMDPs
St
Rt
Ot
At
4
POMDPs
St
St+1
Rt
Ot
At
Ot+1
4
POMDPs
St
St+1
Rt+1
Rt
Ot
At
Ot+1
At+1
4
POMDPs with model uncertainty
θT
St
θR
St+1
Rt+1
Rt
Ot
At
Ot+1
At+1
θO
5
Approach
• After each observation, the parameters
are updated via Bayesian inference
•
Parameter distributions gradually narrowed down to
the values that best fit the observed data
• Forward planning is used to select the
next action to execute at runtime
•
Three source of uncertainty: state uncertainty,
stochastic action effects, and model uncertainty
6
Abstraction
• Dialogue domains often have large, complex
state and action spaces
• Need generalisation/abstraction techniques to
avoid the «curse of dimensionality»
• The framework of probabilistic rules
offers such abstraction language
•
Capture domain structure through (parametrised) rules
mapping conditions to probabilistic effects
•
Drastic reduction in the number of parameters
7
Probabilistic rules
• Structured if...then...else cases associating
conditions to distributions over effects:
if (condition1 holds) then
P(effect1)= θ1, P(effect2)= θ2, ...
else if (condition2 holds) then
P(effect3) = θ3,
...
...
• Probabilistic rules serve as high-level
templates for a Bayesian network
[P. Lison, «Probabilistic Dialogue Models with Prior Domain Knowledge», SIGDIAL 2012]
8
Probabilistic rules: example
r1: ∀ X:
if (am = AskConfirm(X) ∧ iu ≠ X) then
[P(au’ = Disconfirm) = θ1]
9
Probabilistic rules: example
r1: ∀ X:
if (am = AskConfirm(X) ∧ iu ≠ X) then
[P(au’ = Disconfirm) = θ1]
θ1
iu
r1
a u’
am
9
Evaluation
• Evaluation of the learning
approach in a simulated
environment:
•
Human-robot interaction
domain (with Nao robot)
•
Simulator constructed from
Wizard-of-Oz data
•
Goal: estimate the transition
model of the domain (reward
model is given)
10
Simulator
• Simulation models:
•
User modelling: how the user is expected to react to
the system actions
•
Context modelling: how the system actions change the
state of the environment
•
Error modelling: how understanding errors can occur
• Collected and annotated Wizard-of-Oz
data to empirically estimate these models
11
Experimental setup
• Two alternative formalisations of the
transition model:
Baseline:
Our approach:
θr1
St-1
St
St-1
r1
St
...
θrn
At-1
θ
Classical (factored)
categorical distributions
At-1
rn
Model structured with
probabilistic rules
12
Results: average return
Average return
6,0
4,5
3,0
1,5
Categorical distributions
Probabilistic rules
0
2
14
26
38
50
62
74
86
98
110
122
134
146
Number of simulated interactions
13
Conclusion
• Hybrid approach to dialogue policy optimisation:
•
•
Domain models structured with probabilistic rules
Rule parameters estimated via model-based Bayesian RL
• Experiment shows that the rule-structured
model outperforms a classical factored model
• Future work:
•
•
Evaluate the approach with real interactions
Combine offline and online planning
14
Download