Managing Uncertainty within the KTD Framework

advertisement
JMLR:WorkshopandConferenceProceedings16(2011)157–168WorkshoponActiveLearningandExperimentalDesign
Managing Uncertainty within the KTD Framework
MatthieuGeist matthieu.geist@supe lec.f r
OlivierPietquin olivier.pietquin@supelec.fr IMS Research Group, Supélec, Metz, France
Editor: I.Guyon,G.Cawley,G.Dror,V.Lemaire,andA.Statnikov
Abstract
Thedilemmabetweenexplorationandexploitationisanimportanttopicinreinforcement learning (RL).
Most successful approaches in addressing this problem tend to use some uncertainty information
about values estimated during learning. On another hand, scalability is known as being a lack of
RL algorithms and value function approximation has becomeamajortopicofresearch.
Bothproblemsariseinreal-worldapplications,however
fewapproachesallowapproximatingthevaluefunctionwhilemaintaininguncertaintyinformationaboute
stimates. Evenfewerusethisinformationinthepurposeofaddressing the exploration/exploitation
dilemma. In this paper, we show how such an uncertainty
informationcanbederivedfromaKalman-basedTemporalDifferences(KTD)framework
andhowitcanbeused.
Keywords: Valuefunctionapproximation,activelearning,exploration/exploitationdilemma
1. Intro duction
Reinforcement learning (RL) (Sutton and Barto,1996) is the machine learning answer to
thewell-knownproblemofoptimalcontrolofdynamicsystems. Inthisparadigm,anagent learns to
control its environment (i.e. , the dynamic system) through examples of actual interactions. To
each of these interactions is associated an immediate reward which is a
localhintaboutthequalityofthecurrentcontrolpolicy. Moreformally,ateach(discrete) time step ithe
dynamic system to be controlled is in a state s. The agent chooses an action aii, and the dynamic
system is then driven in a new state, say s, following its owndynamics.
Theagentreceivesarewardrii+1associatedtothetransition(si;ai;si+1). The
agent’sobjectiveistomaximizetheexpectedcumulativerewards,whichitinternallymodels
asaso-calledvalueorQ-function(seelater). Inthemostchallengingcases,learninghasto be done
online and the agent has to control the system while trying to learn the optimal policy.
Amajorissueisthenthechoiceofthebehaviorpolicyandtheassociateddilemma between exploration
and exploitation (which can be linked to active learning). Indeed at
eachtimestep,theagentcanchooseanoptimalactionaccordingtoits(maybe)imperfect knowledge of
the environment (exploitation) or an action considered to be suboptimal so
astoimproveitsknowledge(exploration)andsubsequentlyitspolicy. The -greedyaction
selectionisapopularchoicewhichconsistsinselectingthegreedyactionwithprobability
1 ,andanequallydistributedrandomactionwithprobability . Anotherpopularscheme isthesoftmax
actionselection(SuttonandBarto,1996)drawingthebehavioractionfrom a Gibbs distribution. Most
successful approaches tend to use an uncertainty information
c 2011M.Geist&O.Pietquin.
Geist Pietquin
to choose between exploration and exploitation but also to drive exploration.Dearden et al.(1998)
maintain a distribution for each Q-value. They propose two schemes. The
firstoneconsistsinsamplingtheactionaccordingtotheQ-valuedistribution. Thesecond one uses a
myopic value of imperfect information which approximates the utility of an
information-gatheringactionintermsoftheexpectedimprovementofthedecisionquality. Strehl and
Littman(2006) maintain a confidence interval for each Q-value and the policy is greedy
respectively to the upper bound of this interval. This approach allows deriving
probably-approximately-correct(PAC)bounds.SakaguchiandTakano(2004)useaGibbs policy.
However a reliability index (actually a form of uncertainty) is used instead of the more classic
temperature parameter. Most of these approaches are designed for problems where an exact
(tabular) representation of the value function is possible. Nevertheless, approximating the value
in the case of large state spaces is another topic of importance
inRL.Therearesomemodel-basedalgorithmswhichaddressthisproblem(Kakadeetal., 2003;Jong
and Stone,2007;Li et al.,2009b). They imply approximating the model in
additiontothevaluefunction. Howeverwefocushereanpuremodel-freeapproaches(just
thevaluefunctionisestimated). Unfortunatelyquitefewvaluefunctionapproximatorallow
derivinganuncertaintyinformationaboutestimatedvalues.Engel(2005)proposessucha
model-freealgorithm,buttheactualuseofvalueuncertaintyisleftasaperspective. Inthis paper, we
show how some uncertainty information about estimated values can be derived from the Kalman
Temporal Differences (KTD) framework ofGeist et al.(2009a,b). We
alsointroduceaformofactivelearningwhichusesthisuncertaintyinformationinorderto
speeduplearning, aswellassomeadaptationsofexistingschemesdesignedtohandlethe
exploration/exploitation dilemma. Each contribution is illustrated and experimented, the
lastoneonareal-worlddialoguemanagementproblem.
2. Background
2.1. ReinforcementLearning This paper is placed in the framework of Markov decision process
(MDP). An MDP is a
tuple fS;A;P;R;g, where Sis the state space, Athe action space, P : s;a2S A! p(:js;a) 2P(S)
afamilyoftransitionprobabilities,R: S A S!R theboundedreward function, andthediscountfactor.
Apolicy associatestoeachstateaprobabilityover actions, : s2S ! (:js) 2P(A). The value function of
a given policy is defined as V(s) = E[ P1 i=0irijs0= s; ] where r iis the immediate reward observed at
time step i, and the expectation is done over all possible trajectories starting in sgiven the system
dynamicsandthefollowedpolicy. TheQ-functionallowsasupplementarydegreeoffreedom
forthefirstactionandisdefinedasQ(s;a) = E[ P1 i=0= a; ]. RLaimsat finding (through interactions) the
policy irijs0= s;a0which maximises the value function for every state: = argmax (V ).
Twoschemesamongotherscanleadtotheoptimalpolicy. First, policy iteration
involveslearningthevaluefunctionofagivenpolicyandthenimprovingthe
policy,thenewonebeinggreedyrespectivelytothelearntvaluefunction. Itrequiressolving theBel lman
evaluation equation ,whichisgivenhereforthevalueandQ-functions: V(s) = Es0;aj ;s[R(s;a;s0)+V (s0)]
andQ (s;a) = E)]. Thesecond scheme, value iteration , aims directly at finding the optimal policy.
It requires solving the Bel lman optimality equation : Q s0;a(s;a) =
E0sj ;s;a0js;a[R(s;a;s0[R(s;a;s0 )+Q(s0) + max;a0b2AQ (s0;b)]. For
158
Managing Uncertainty within KTD
largestateandactionspaces,exactsolutionsaretrickytoobtainandvalueorQfunction
approximationisrequired.
2.2. KalmanTemporalDifferences-KTD
Originally,theKalman(1960)filterparadigmisastatisticalmethodaimingatonlinetracking the hidden state of a non-stationary dynamic system through indirect observations of this
state. The idea behind KTD is to cast value function approximation into such a filteringparadigm:
consideringafunctionapproximatorbasedonafamillyofparameterized
functions,theparametersarethenthehiddenstatetobetracked,theobservationbeingthe reward linked
to the parameters through one of the classical Bellman equations. Thereby
valuefunctionapproximationcanbenefitfromtheadvantagesofKalmanfilteringandparticularlyuncertai
ntymanagementbecauseofstatisticalmodelling.
Thefollowingnotationsareadopted,giventhattheaimisthevaluefunctionevaluation,
theQ-functionevaluationortheQ-functiondirectoptimization:
r( ii8 >(s< >:i;s) (sii+1;ai;si+1;a)
i 8 >^ V< > i(si) ^ V i(s) ^Q i(si;aii+1) ^
ti =
(s ii+1i+1) gti( i) =
( i) + n Q i(si+1;a) ^Qiii i+1i+1;b) (1)
= gti
i;a
;s
:
( ; )
^ i(s
w
(resp. ^ Q
s a
(2)
i
h
m
Q
1
e
a
r
x
e
b
^
V
)isaparametricrepresentationofthevalue(resp. Q-)function, being the parameter vector. A
statistical point of view is adopted and the parameter vector is considered as a random
variable. The problem at sight is stated in a so-called state-space formulation :
+v
=
i
Using the vocabulary of Kalman filtering, the first equation is the evolution equation. It
specifiesthatthesearchedparametervectorfollowsarandomwalkwhichexpectationcorrespondstothe
optimalestimationofthevaluefunctionattimestepi. Theevolutionnoise viiscentered, white,
independentandofvariancematrix Pvi. Thesecondequationisthe
observationequation,itlinkstheobservedtransitionsandrewardstothevalue(orQ-)functionthroughon
eoftheBellmanequations. Theobservationnoisenissupposedcentered,
white,independentandofvariancePnii. KTDisasecondorderalgorithm:
itupdatesthemeanparametervector, butalsothe
associatedcovariancematrixaftereachinteraction. Itbreaksdownintothreesteps. First, predictions
of the parameters first and second order moments are obtained according to the evolution
equation and using previous estimates. Then some statistics of interest are computed.
Thethirdstepappliesacorrection topredictedmomentsoftheparametersvector according to the
so-called Kalman gain K(computed thanks to the statistics obtained in
secondstep),thepredictedreward ^riji1iandtheobservedrewardri(theirdifferencebeing
aformoftemporaldifferenceerror).
Statisticsofinterestaregeneralynotanalyticallycomputable,exceptinthelinearcase. This does not
hold for nonlinear parameterizations such as neural networks and for the
Bellmanoptimalityequation(becauseofthemax operator). Nevertheless,aderivative-free
approximationscheme,theunscentedtransform(UT)ofJulierandUhlmann(2004),allows
159
Geist Pietquin
estimatingfirstandsecondordermomentsofanonlinearlymappedrandomvector. LetX
bearandomvector(typicallytheparametervector)andY = f(X) itsnonlinearmapping
(typicallythegtifunction). LetnbethedimensionoftherandomvectorX. Asetof2n+1
so-calledsigma-pointsandassociatedweightsarecomputedasfollows:
2 n j=0j
2 n j=0
Y y=
= X j= 0= X+ (p (n+ )P= X(p
(n+ )PXX)j)Xnjn+ 1 j 2n andp PX)jis the jth
( wj =
n+
12
n+ )
=
( j)
j
( j)
where
j
8
1 jnx
>x<
>:(0)(
j)x
Xis the mean of X, P
2n w
TP y)
0j=
0w
X ( j)
and P T y)(y( j)
j=0
(y( j) w P y( j), PY w
is its variance matrix, is a scaling factor which controls the
accuracy, and (column of the Cholesky decomposition of P. Then
theimageofeachsigma-pointthroughthemappingfiscomputed: y),0 j 2n. The set of sigma-points
and their images can then be used to compute the following approximations:( j)(x X)(y( j).
ThankstotheUT,practicalalgorithmscanbederived. Attime-stepi,asetofsigma- Ppoints is
computed from predicted random parameters characterized by mean ^ and variancePiji1iji1.
Predictedrewardsarethencomputedasimagesofthesesigma-pointsusing
oneoftheobservationfunctions(1). Thensigma-pointsandtheirimagesareusedtocompute
statisticsofinterest. ThisgivesrisetoagenericalgorithmvalidforanyofthethreeBellman equations and
any parametric representation of V or Qsummarized in Alg.1, pbeing the number of parameters.
More details as well as theoretical results (such as proofs of
convergence)aboutKTDareprovidedbyGeistandPietquin(2010).
Algorithm1:KTD
Initialization: priors ^ 0 j0 andP0 j0
andrewardri
for i 1;2;::: do
Observetransitionti
PredictionStep ^ iji1 = ^ i1 ji1
Piji1 = Pi1ji1
+ Pvi
= f ^ (j) iji1;0 j 2pg/* from ^ iji1(j) iji1= gti( and Piji1
Sigma-pointscomputation iji1
^ (j) iji1= P2 p j=0wj^r(j) iji1P ri= P2 p j=0wj
j;0 j 2pg Riji1
Correctionstep K
= P riP1 ri
^ =
+ Ki(r ^r
) Piji = Piji1
KiPr
^
2p
j=0
ri
w
(j) iji1
^r
2
iji1
KT i
+ Pn
^r )(^r(j) iji1 ^
iji1
(^
iji1
=P
j(^r
i
iji1)
i
i
iji1
iji1
iji
*/ W= fw= f^r);0 j 2pg/* see Eq. ( 1 ) */ Compute statistics of interest ^riji1) P
3. Computing Uncertainty over Valu es
i
(j)
^ VTheparametersbeingmodeledasrandomvariables,theparameterizedvalueforanygiven state is
a random variable. This model allows computing the mean and associated uncertainty.
Let betheapproximatedvaluefunctionparameterizedbytherandomvector of
mean andvariancematrixP . Let V (s) and ^ ( j)2 V (s) betheassociatedmeanandvariance
foragivenstates. Topropagatetheuncertaintyfromtheparameterstotheapproximated
valuefunctionafirststepistocomputethesigma-pointsassociatedtotheparametervector, that is =
f ;0 j 2pg, as well as corresponding weights, from and P as
160
KTD-SARSA (bonus-greedy)
KTD-SARSA (e-greedy)
LSPI
average discounted sum of rewards
Managing Uncertainty within KTD
70
60
50
40
30
20
10
0
1000 2000 3000 5000 10000 20000 30000 50000
Figure1: Uncertaintycomputation.
number of transitions
Figure2: Dialogmanagementresults.
( j) describedbefore.
Thentheimagesofthesesigma-pointsarecomputedusingtheparameterizedvaluefunction:
V(s) = f ^ V(s) = ^ V (j) V(s); 0 j 2pg. Knowingtheseimagesand
correspondingweights,thestatisticsofinterestarecomputed: (s) = P2 p j=0wj^ V( j) (s) and
^ 2 V (s) = P2 p j=0wj( ^ V( j) (s)
V 2(s)). ThisisillustratedonFig.1. Extensionto Q-function
isstraightforward. So,asateachtime-stepuncertaintyinformationcanbecomputedinthe
KTDframework.
4. A Form of Active Learning
4.1. Principle
Itisshownherehowthisavailableuncertaintyinformationcanbeusedinaformofactive
learning. TheKTDalgorithmderivedfromtheBellmanoptimalityequation,thatisAlg.1 with
third equation of Eq. (1), is named KTD-Q. It is an off-policy algorithm: it learns
theoptimalpolicy whilefollowingadifferentbehaviorialpolicyb. Anaturalquestionis:
whatbehaviorialpolicytochoosesoastospeeduplearning? Letibethecurrenttemporal
index. Thesystemisinastatesi,andtheagenthastochooseanactiona. Thepredictions
^ iji1and Piji1iare available and can be used to approximate the uncertainty of the
Qfunction parameterized by iji1in the state siand for any action a. Let ^ ;a) be
thecorrespondingvariance. Theactionai2 Qiji1(siischosenaccordingtothefollowingheuristic:
b(:jsi) = ^ Qiji1(sa2A^ Qiiji1(s;:) Pi;a) (4)rdThistotallyexplorativepolicyfavoursuncertainactions.
Thecorrespondingalgorithmwhich iscalledactiveKTD-Q(Alg.1with 3Eq.
of(1)andpolicy(4)).
4.2. Experiment Thesecondexperimentistheinvertedpendulumbenchmark.
Thistaskrequiresmaintaining
apendulumofunknownlengthandmassattheuprightpositionbyapplyingforcestothe cart it
is attached to. It is fully described byLagoudakis and Parr(2003) and we use the same
parameterization (a mixture of Gaussian kernels). The goal is here to compare
twovalue-iteration-likealgorithms, namelyKTD-QandQ-learning, whichaimatlearning
directlytheoptimalpolicyfromsuboptimaltrajectories(off-policylearning). Asfaraswe
161
Geist Pietquin
3000
KTD-Q
Q-learning active
KTD-Q
2500
2000
1500
10000
KTD-Q
Q-learning
1000
0 50 100 150 200 250 300
500number of episodes
1000
100
10
0 100 200 300 400 500 600 700 800 900 1000
0
number of episodes
Figure4: Randomandactivelear
know, KTD-QFis the first second-order algorithm for Q-function approximation
i
iterationscheme,thedifficultybeingtohandlethemax
operator(YuandBertsekas(2
propose also g
such an algorithm, however for a restrictive class of MDP). That
compare it to ua first-order algorithm. The active learning scheme is also experi
r
usestheuncertaintycomputedbyKTDtospeedupconvergence.
e
3
:
O
p
t
i
m
a
l
p
o
l
i
c
y
l
e
a
r
n
i
n
g
.
ni
For Q-learning, the learning rate is set to
i
162
v
=
i
0 n0+1 n0+ i
wit= 0:5 and n0
h 0 j0
0
=
steps
steps
200, accordingtoLagoudakisandParr(2003).
ForKTD-Q,theparametersaresetto P= 10I, P= 1 and P= 0I.
For all algorithms the initial parameter vector is set to
zero.
Trainingsamplesarefirstcollectedonlinewitharandombehavi
orpolicy. Theagentstarts in a randomly perturbed state
close to the equilibrium. Performance is measured as the
averagenumberofstepsinantestepisode(amaximumof3000
stepsisallowed). Results are averaged over 100 trials.
Fig.3compares KTD-Q and Q-learning (the same random
samples are used to train both algorithms). Fig.4adds
active KTD-Q for which actions
aresampledaccordingto(4).
Averagelengthofepisodeswithtotallyrandompolicyis10,
whereasitis11forpolicy(4).
Consequentlytheincreaseinlengthcanonlyslightlyhelpto
improvespeedofconvergence(atmost10%,muchlessthanth
erealimprovementwhichis
about100%,atleastatthebeginning).
AccordingtoFig.3,KTD-Qlearnsanoptimalpolicy(thatisb
alancingthepoleforthe
maximumnumberofsteps)asymptoticallyandnear-optimalp
oliciesarelearnedafteronlya
fewtensofepisodes(noticethattheseresultsarecomparablet
otheLSPIalgorithm). With the same number of learning
episodes, Q-learning with the same linear
parameterization fails to learn a policy which balances the
pole for more than a few tens of time steps. Similar results
for Q-learning are obtained byLagoudakis and Parr(2003).
According to
Fig.4,itisclearthatsamplingactionsaccordingtouncertaintys
peedsupconvergence. It
isalmostdoubledinthefirst100episodes.
Noticethatthisactivelearningschemecouldnot
havebeenusedforQ-learningwithvaluefunctionapproximatio
n,asthisalgorithmcannot provideuncertaintyinformation.
5. Exploration/Exp loitation Dilemma
In this section, we present several approaches designed
to handle the dilemma between
explorationandexploitation(whichcanbelinkedtoactivelea
rning). Thefirstoneisthe
wellknown -greedypolicy,anditservesasabaseline.
Otherapproachesareinspiredfrom
theliteratureandusetheavailableuncertaintyinformation(s
eeSec.3foritscomputation).
Managing Uncertainty within KTD
ndThecorrespondingalgorithmsareacombinationofKTD-SARSA(Alg.1with
2Eq. of(1))
withpolicies(5-8).
= argmax Qi ji 1(si +1;b)) +
(ai +1jsi +1) = (1 ) (ai +1
b 2A
(ai +1
b 2A
6= argmax
5.1. -greedyPolicy
Withan -greedypolicy(SuttonandBarto,1996),theagentchoosesagreedyactionrespectivelytothecurrentlyestimatedQ-functionwithaprobability1 ,andarandomaction
withaprobability ( istheKroneckersymbol):
Qi ji 1(si +1;b)) (5)
Thispolicyisperhapsthemostbasicone,anditdoesnotuseanyuncertaintyinformation.
AnarbitraryQ-functionforagivenstateand4 differentactionsisillustratedonFig.6. For
eachaction,itgivestheestimatedQ-valueaswellastheassociateduncertainty(thatis estimatedstanda
rddeviation). Forexample,action3 hasthehighestvalueandthelowest uncertainty,andaction1
thelowestvaluebutthehighestuncertainty.
Theprobabilitydistributionassociatedtothe -greedypolicyisillustratedonFig.5.a.
Thehighestprobability is associated to action 3, and other actions have the same (low)
probability, despite their differentestimatedvaluesandstandarddeviations.
5.2. Confident-greedyPolicy
Thesecondapproachweproposeconsistsinactinggreedilyaccordingtotheupperboundof
anestimatedconfidenceinterval. Theapproachisnotnovel(Kaelbling,1993),howeversome PAC
(probably approximately correct) guarantees have been given recently byStrehl and
Littman(2006)foratabularrepresentation(forwhichtheconfidenceintervalisproportional
totheinverseofthesquarerootofthenumberofvisitstotheconsideredstate-actionpair).
Inourcase,wepostulatethattheconfidenceintervalwidthisproportionaltotheestimated
standarddeviation(whichistrueiftheparametersdistributionisassumedtobeGaussian).
Let beafreepositiveparameter,wedefinetheconfident-greedypolicyas:
Qiji1(si+1;b) + ^ Qiji1(si+1;b) (6)
= argmax 163
b2A
(ai+1jsi+1) = ai+1
ThesamearbitraryQ-valuesareconsidered(seeFig.6),andtheconfident-greedypolicyis
illustratedonFig.5.bwhichrepresentstheupperboundoftheconfidenceinterval. Action 1
ischosenbecauseithasthehighestscore(despitethefactthatithasthelowestestimated value).
Noticethataction3,whichisgreedyrespectivelytotheestimatedQ-function,has onlythethirdscore.
5.3. Bonus-greedyPolicy
ThethirdapproachweproposeisinspiredfromthemethodofKolterandNg(2009). The
policytheyuseisgreedyrespectivelytotheestimatedQ-functionplusabonus,thisbonus
beingproportionaltotheinverseofthenumberofvisitstothestate-actionpairofinterest (which can be
interpreted as a variance, instead of the square-root of this quantity for interval
estimation-based approaches which can be interpreted as a standard deviation).
Geist Pietquin
a. -greedy. b. Confident-greedy. c. Bonus-greedy.d. Thompson. Figure5: Policies.
Thebonus-greedypolicyweproposeusesthevarianceratherthanthestandarddeviation,
andisdefinedas( 0and beingtwofreeparameters):
= argmax Qiji1(si+1;b) + ^ 2 Qiji1(s;b) 0+ ^ 2
b2A
Qi+1iji1(si+1;b)
(7)
(ai+1jsi+1) = ai+1
The bonus-greedy policy is illustrated on Fig.5.c, still using the arbitrary Q-values and
associatedstandarddeviationsofFig.6. Action 2 hasthehighestscore, itisthuschosen.
Noticethatthethreeotheractionshaveapproximatelythesamescore,despitethefactthat
theyhavequitedifferentQ-values.
5.4. ThompsonPolicy
RecallthattheKTDalgorithmmaintainstheparametersmeanvectorandvariancematrix.
Assuming that the parameters distribution is Gaussian, we propose to sample a set of
parametersfromthisdistribution,andthentoactgreedilyaccordingtotheresultingsampled Q-function.
This type of scheme was first proposed byThompson(1933) for a bandit
problem,andithasbeenrecentlyintroducedintothereinforcementlearningcommunityin
thetabularcase(Deardenetal.,1998;Strens,2000). LettheThompsonpolicybe:
(ai+1jsi+1) = argmaxb2A^ Q (si+1;b) with N( ^ iji1;Piji1) (8)
We illustrate the Thompson policy on Fig.5.d by showing the distribution of the greedy action
(recall that parameters are random, and thus the greedy action too). The highest
probabilityisassociatedtoaction3. However,noticethatahighestprobabilityisassociated toaction1
thantoaction4: thefirstonehasalowerestimatedQ-value,butitislesscertain.
5.5. Experiment The bandit problem is an MDP with one state and N actions. Each action
aimplies a
rewardof1 withprobabilitypa,andarewardof0 withprobability1pa.
Foranactiona(randomlychosenatthebeginningofeachexperiment),theprobabilityissettopa = 0:6.
Forallotheractions,theassociatedprobabilityisuniformlyandrandomlysampledbetween 0 and 0:5:
pa U[0 ;0 :5] ;8a6= a. Presented results are averaged over 1000 experiments. The performance of
a method is measured as the percentage of time the optimal action
164
Managing Uncertainty within KTD
Figure6: Q-values and associated
uncertainty.
Figure7: Banditresults.
1has
been chosen, given the number of interactions between the agent
and the bandit. A
tabularrepresentationisadoptedforKTD-SARSA,andthefollowingparame
tersareused: N = 10, P0 j0= 0:1I, 0 j0= I, Pni= 1, = 0:1, = 0:3, 0= 1
and = 10. As the consideredbandithasN= 10
arms,arandompolicyhasaperformanceof0:1. Noticealso
thatapurelygreedypolicywouldchoosesystematicallythefirstactionforwhi
chtheagent hasobservedareward.
ResultspresentedinFig.7comparethefourschemes.
The -greedypolicyservesasa
baseline,andallproposedschemesusingtheavailableuncertaintyperform
sbetter. Thompson policy and confident-greedy policy perform
approximately equally well, and the best
resultsareobtainedbythebonus-greedypolicy.
Ofcourse,thesequitepreliminaryresults do not allow to conclude about
guarantees of convergence of the proposed schemes.
However,theytendtoshowthatthecomputeduncertaintyinformationismea
ningfulandthat
itcanprovideusefulforthedilemmabetweenexplorationandexploitation.
6. Dialogue management application
In this section is proposed an application to a real world problem:
spoken dialogue management.
Aspokendialogsystem(SDS)generallyaimsatprovidinginformationtoaus
er throughnaturallanguage-basedinteractions.
AnSDShasroughlythreemodules: aspeech
understandingcomponent(speechrecognizerandsemanticparser),adialo
guemanagerand
aspeechgenerationcomponent(naturallanguagegeneratorandspeechsy
nthesis). Dialogue
managementisasequentialdecisionmakingproblemwhereadialogueman
agerhastoselect which information should be asked or provided to the
user when in a given situation. It
canthusbecastintotheMDPframework(Levinetal.,2000;Singhetal.,1999;
Pietquin and Dutoit,2006). The set of actions a dialog manager can
select is defined by so called dialog acts .
Therecanbedifferentdialogactssuchas:
greetingtheuser,askingforapiece of information, providing a piece of
information, asking for confirmation about a piece of
information,closingthedialogetc. Thestate
ofadialogisusuallyrepresentedefficientlyby
theInformationStateparadigm(LarssonandTraum,2000).
Inthisparadigm,thedialogue state contains a compact representation of
the history of the dialogue in terms of dialog
1.
Foranempiricalstudyofthesensitivityofperformanceoftheproposedpoliciesasafunctionof
parameter setting,seeGeist(2009).
165
Geist Pietquin
actsanduserresponses. Itsummarizestheinformationexchangedbetweentheuserandthe
systemuntiltheconsideredstateisreached. Adialoguemanagementstrategy istherefore
amappingbetweendialoguestatesanddialogueacts. AccordingtotheMDPframework,a
rewardfunctionhastobedefined.
Theimmediaterewardisoftenmodeledasthecontributionofeachactiontotheuser’ssatisfaction(Singhe
tal.,1999). Thisisasubjectivereward
whichisusuallyapproximatedbyalinearcombinationofobjectivemeasures.
The considered system is a form-filling spoken dialog system. It is oriented toward tourism
information, similarly to the one described byLemon et al.(2006). Its goal is to provide information
about restaurants based on specific user preferences. There are three slots in this dialog
problem, namely the location of the restaurant, the cuisine type of the restaurant and its
price-range. Given past interactions with the user, the agent asks a question so as to propose
the best choice according to the user preferences. The goal is to provide the correct information
to the user with as few interactions as possible. The corresponding MDP’s state has 3
continuous components ranging from 0 to 1, each representing the averaging of filling and
confirmation confidence scores (provided by the
automaticspeechrecognitionsystem)oftherespectiveslots. Thereare13possibleactions:
askforaslot(3actions),explicitconfirmationofaslot(3actions),implicitconfirmationof
aslotandaskforanotherslot(6actions)andclosethedialogbyproposingarestaurant(1 action).
Thecorrespondingrewardisalways0,exceptwhenthedialogisclosed. Inthiscase, the agent is
rewarded 25 per correct slot filling, -75 per incorrect slot filling and -300 per emptyslot.
Thediscountfactorissetto= 0:95. Eveniftheultimategoalistoimplement RL on a real dialog
management problem, in this experiment a user simulation technique
wasusedtogeneratedata(PietquinandDutoit,2006). Theusersimulatorwaspluggedto thedipper
dialoguemanagementsystem(Lemonetal.,2006)togeneratedialoguesamples.
TheQ-functionisrepresentedusingoneRBFnetworkperaction. EachRBFnetworkhas three
equi-spaced Gaussian functions per dimension, each one with a standard deviation of =1
3(state variables ranging from 0 to 1). Therefore, there are 351 (i.e. , 3 3 13) parameters.
KTD-SARSA with -greedy and bonus-greedy policies are compared on Fig.2(results
areaveragedover8independenttrials, andeachpointisaveragedover100pastepisodes:
astablecurvemeansalowstandarddeviation). LSPI,abatchandoff-policyapproximate
policyiterationalgorithm(LagoudakisandParr,2003),servesasabaseline. Itwastrained in an
off-policy and batch manner using random trajectories, and this algorithm provide
competitiveresultsamongthestateoftheart(Lietal.,2009a;Chandramohanetal.,2010).
Bothalgorithmsprovidegoodresults(apositivecumulativereward,whichmeansthatthe
userisgenerallysatisfiedafterfewinteractions). However,onecanobservethatthebonusgreedy
scheme provides faster convergence as well as better and more stable policies than
theuninformed -greedypolicy. Moreover, resultsfortheinformedKTD-SARSAarevery close to
LSPI after few learning episodes. Therefore, KTD-SARSA is sample efficient (it
providesgoodpolicieswhiletheinsufficientnumberoftransitionspreventsfromusingLSPI
becauseofnumericalstabilityproblems),andtheprovideduncertaintyinformationisuseful
onthisdialogue-managementtask.
166
Managing Uncertainty within KTD
7. Conclusion
Inthispaper,wehaveshownhowanuncertaintyinformationaboutestimatedvaluescanbe
derivedfromKTD.Wehavealsointroducedanactivelearningschemeaimingatimproving speed of
convergence by sampling actions according to their relative uncertainty, as well as some
adaptations of existing schemes for exploration/exploitation. Three experiments have been
proposed. The first one shown that KTD-Q, a second-order value-iterationlike algorithm, is
sample efficient. The improvement gained by using the proposed active
learningschemewasalsodemonstrated. Theproposedschemesforexploration/exploitation were
also successfully experimented on a bandit problem and the bonus-greedy policy on
real-worldproblem. Thisisafirststeptowardcombiningthedilemmabetweenexploration
andexploitationwithvaluefunctionapproximation.
Thenextstepistoadaptmoreexistingapproachesdealingwiththeexploration/exploitation
dilemmadesignedfortabularrepresentationofthevaluefunctiontotheKTDframework,
andtoprovidesometheoreticalguaranteesfortheproposedapproaches. Thispaperfocused
onmodel-freereinforcementlearning,andweplantocompareourapproachtomodel-based
RLapproaches.
Acknowledgments
The authors thank the European Community (FP7/2007-2013, grant agreement 216594,
CLASSiCproject: www.classic-project.org)andtheRégionLorraineforfinancialsupport.
References
S. Chandramohan, M. Geist, and O. Pietquin. Sparse Approximate Dynamic Programming for Dialog Management.
InProceedingsofthe11thSIGDialConferenceonDiscourseandDialogue,pages107–115,
Tokyo(Japan),September2010.ACL.
R.Dearden,N.Friedman,andS.J.Russell. BayesianQ-Learning. InAAAI/IAAI,pages761–768,1998.
Y.Engel.AlgorithmsandRepresentationsforReinforcementLearning.PhDthesis,HebrewUniversity,April
2005. M. Geist. Optimisation des chaînes de production dans l’industrie sidérurgique : une approche statistique
de l’apprentissage par renforcement. Phd thesis in mathematics, Université Paul Verlaine de Metz (en
collaborationavecSupélec,ArcelorMittaletl’INRIA),Novembre2009.
M.GeistandO.Pietquin.KalmanTemporalDifferences.JournalofArtificialIntelligenceResearch(JAIR), 2010.
M.Geist,O.Pietquin,andG.Fricout. KalmanTemporalDifferences: thedeterministiccase.
InIEEEInternationalSymposiumonAdaptiveDynamicProgrammingandReinforcementLearning(ADPRL2009),
Nashville,TN,USA,April2009a.
M.Geist,O.Pietquin,andG.Fricout. Trackinginreinforcementlearning. InInternationalConferenceon
NeuralInformationProcessing(ICONIP2009),Bangkok(Thailand),December2009b.Springer.
N.JongandP.Stone. Model-BasedExplorationinContinuousStateSpaces. InSymposiumonAbstraction,
Reformulation,andApproximation,July2007.
S.J.JulierandJ.K.Uhlmann. Unscentedfilteringandnonlinearestimation. ProceedingsoftheIEEE,92 (3):401–422,2004.
167
Geist Pietquin
L.P.Kaelbling. Learninginembeddedsystems. MITPress,1993.
S.Kakade,M.J.Kearns,andJ.Langford. ExplorationinMetricStateSpaces. InInternationalConference
onMachineLearning(ICML03),pages306–312,2003.
R.E.Kalman. ANewApproachtoLinearFilteringandPredictionProblems. TransactionsoftheASME–
JournalofBasicEngineering,82(SeriesD):35–45,1960.
J.Z.KolterandA.Y.Ng. Near-BayesianExplorationinPolynomialTime. Ininternationalconferenceon
Machinelearning(ICML09),NewYork,NY,USA,2009.ACM.
M.G.LagoudakisandR.Parr. Least-SquaresPolicyIteration. Journal of Machine Learning Research,4: 1107–1149,2003.
S. Larsson and D. R. Traum. Information state and dialogue management in the TRINDI dialogue move enginetoolkit.
NaturalLanguageEngineering,2000.
O. Lemon, K. Georgila, J. Henderson, and M. Stuttle. An ISU dialogue system exhibiting reinforcement
learningofdialoguepolicies: genericslot-fillingintheTALKin-carsystem. InMeetingoftheEuropean
chapteroftheAssociatonforComputationalLinguistics(EACL’06),Morristown,NJ,USA,2006.
E.Levin,R.Pieraccini,andW.Eckert.Astochasticmodelofhuman-machineinteractionforlearningdialog strategies.
IEEETransactionsonSpeechandAudioProcessing,8(1):11–23,2000.
L. Li, S. Balakrishnan, and J. Williams. Reinforcement Learning for Dialog Management using LeastSquares Policy
Iteration and Fast Feature Selection. In Proceedings of the International Conference on
SpeechCommunicationandTechnologies(InterSpeech’09),Brighton(UK),2009a.
L.Li,M.Littman,andC.Mansley. Onlineexplorationinleast-squarespolicyiteration. InConferencefor
researchinautonomousagentsandmulti-agentsystems(AAMAS-09),Budapest,Hungary,2009b.
O.PietquinandT.Dutoit. Aprobabilisticframeworkfordialogsimulationandoptimalstrategylearning.
IEEETransactionsonAudio,SpeechandLanguageProcessing,14(2):589–599,March2006.
Y.SakaguchiandM.Takano. Reliabilityofinternalprediction/estimationanditsapplication: I.adaptive
actionselectionreflectingreliabilityofvaluefunction. NeuralNetworks,17(7):935–952,2004.
S.Singh, M.Kearns, D.Litman, andM.Walker. Reinforcementlearningforspokendialoguesystems. In
ConferenceonNeuralInformationProcessingSociety(NIPS’99),Denver,USA.Springer,1999.
A. L. Strehl and M. L. Littman. An Analysis of Model-Based Interval Estimation for Markov Decision Processes.
JournalofComputerandSystemSciences,2006.
M. Strens. A Bayesian Framework for Reinforcement Learning. In International Conference on Machine
Learning,pages943–950.MorganKaufmann,SanFrancisco,CA,2000.
R.S.SuttonandA.G.Barto. ReinforcementLearning: AnIntroduction. TheMITPress,1996.
W.R.Thompson. Onthelikelihoodthatoneunknownprobabilityexceedsanotherinviewoftwosamples.
Biometrika,(25):285–294,1933.
H. Yu and D. P. Bertsekas. Q-Learning Algorithms for Optimal Stopping Based on Least Squares. In
EuropeanControlConference,Kos,Greece,2007.
168
Download