JMLR:WorkshopandConferenceProceedings16(2011)157–168WorkshoponActiveLearningandExperimentalDesign Managing Uncertainty within the KTD Framework MatthieuGeist matthieu.geist@supe lec.f r OlivierPietquin olivier.pietquin@supelec.fr IMS Research Group, Supélec, Metz, France Editor: I.Guyon,G.Cawley,G.Dror,V.Lemaire,andA.Statnikov Abstract Thedilemmabetweenexplorationandexploitationisanimportanttopicinreinforcement learning (RL). Most successful approaches in addressing this problem tend to use some uncertainty information about values estimated during learning. On another hand, scalability is known as being a lack of RL algorithms and value function approximation has becomeamajortopicofresearch. Bothproblemsariseinreal-worldapplications,however fewapproachesallowapproximatingthevaluefunctionwhilemaintaininguncertaintyinformationaboute stimates. Evenfewerusethisinformationinthepurposeofaddressing the exploration/exploitation dilemma. In this paper, we show how such an uncertainty informationcanbederivedfromaKalman-basedTemporalDifferences(KTD)framework andhowitcanbeused. Keywords: Valuefunctionapproximation,activelearning,exploration/exploitationdilemma 1. Intro duction Reinforcement learning (RL) (Sutton and Barto,1996) is the machine learning answer to thewell-knownproblemofoptimalcontrolofdynamicsystems. Inthisparadigm,anagent learns to control its environment (i.e. , the dynamic system) through examples of actual interactions. To each of these interactions is associated an immediate reward which is a localhintaboutthequalityofthecurrentcontrolpolicy. Moreformally,ateach(discrete) time step ithe dynamic system to be controlled is in a state s. The agent chooses an action aii, and the dynamic system is then driven in a new state, say s, following its owndynamics. Theagentreceivesarewardrii+1associatedtothetransition(si;ai;si+1). The agent’sobjectiveistomaximizetheexpectedcumulativerewards,whichitinternallymodels asaso-calledvalueorQ-function(seelater). Inthemostchallengingcases,learninghasto be done online and the agent has to control the system while trying to learn the optimal policy. Amajorissueisthenthechoiceofthebehaviorpolicyandtheassociateddilemma between exploration and exploitation (which can be linked to active learning). Indeed at eachtimestep,theagentcanchooseanoptimalactionaccordingtoits(maybe)imperfect knowledge of the environment (exploitation) or an action considered to be suboptimal so astoimproveitsknowledge(exploration)andsubsequentlyitspolicy. The -greedyaction selectionisapopularchoicewhichconsistsinselectingthegreedyactionwithprobability 1 ,andanequallydistributedrandomactionwithprobability . Anotherpopularscheme isthesoftmax actionselection(SuttonandBarto,1996)drawingthebehavioractionfrom a Gibbs distribution. Most successful approaches tend to use an uncertainty information c 2011M.Geist&O.Pietquin. Geist Pietquin to choose between exploration and exploitation but also to drive exploration.Dearden et al.(1998) maintain a distribution for each Q-value. They propose two schemes. The firstoneconsistsinsamplingtheactionaccordingtotheQ-valuedistribution. Thesecond one uses a myopic value of imperfect information which approximates the utility of an information-gatheringactionintermsoftheexpectedimprovementofthedecisionquality. Strehl and Littman(2006) maintain a confidence interval for each Q-value and the policy is greedy respectively to the upper bound of this interval. This approach allows deriving probably-approximately-correct(PAC)bounds.SakaguchiandTakano(2004)useaGibbs policy. However a reliability index (actually a form of uncertainty) is used instead of the more classic temperature parameter. Most of these approaches are designed for problems where an exact (tabular) representation of the value function is possible. Nevertheless, approximating the value in the case of large state spaces is another topic of importance inRL.Therearesomemodel-basedalgorithmswhichaddressthisproblem(Kakadeetal., 2003;Jong and Stone,2007;Li et al.,2009b). They imply approximating the model in additiontothevaluefunction. Howeverwefocushereanpuremodel-freeapproaches(just thevaluefunctionisestimated). Unfortunatelyquitefewvaluefunctionapproximatorallow derivinganuncertaintyinformationaboutestimatedvalues.Engel(2005)proposessucha model-freealgorithm,buttheactualuseofvalueuncertaintyisleftasaperspective. Inthis paper, we show how some uncertainty information about estimated values can be derived from the Kalman Temporal Differences (KTD) framework ofGeist et al.(2009a,b). We alsointroduceaformofactivelearningwhichusesthisuncertaintyinformationinorderto speeduplearning, aswellassomeadaptationsofexistingschemesdesignedtohandlethe exploration/exploitation dilemma. Each contribution is illustrated and experimented, the lastoneonareal-worlddialoguemanagementproblem. 2. Background 2.1. ReinforcementLearning This paper is placed in the framework of Markov decision process (MDP). An MDP is a tuple fS;A;P;R;g, where Sis the state space, Athe action space, P : s;a2S A! p(:js;a) 2P(S) afamilyoftransitionprobabilities,R: S A S!R theboundedreward function, andthediscountfactor. Apolicy associatestoeachstateaprobabilityover actions, : s2S ! (:js) 2P(A). The value function of a given policy is defined as V(s) = E[ P1 i=0irijs0= s; ] where r iis the immediate reward observed at time step i, and the expectation is done over all possible trajectories starting in sgiven the system dynamicsandthefollowedpolicy. TheQ-functionallowsasupplementarydegreeoffreedom forthefirstactionandisdefinedasQ(s;a) = E[ P1 i=0= a; ]. RLaimsat finding (through interactions) the policy irijs0= s;a0which maximises the value function for every state: = argmax (V ). Twoschemesamongotherscanleadtotheoptimalpolicy. First, policy iteration involveslearningthevaluefunctionofagivenpolicyandthenimprovingthe policy,thenewonebeinggreedyrespectivelytothelearntvaluefunction. Itrequiressolving theBel lman evaluation equation ,whichisgivenhereforthevalueandQ-functions: V(s) = Es0;aj ;s[R(s;a;s0)+V (s0)] andQ (s;a) = E)]. Thesecond scheme, value iteration , aims directly at finding the optimal policy. It requires solving the Bel lman optimality equation : Q s0;a(s;a) = E0sj ;s;a0js;a[R(s;a;s0[R(s;a;s0 )+Q(s0) + max;a0b2AQ (s0;b)]. For 158 Managing Uncertainty within KTD largestateandactionspaces,exactsolutionsaretrickytoobtainandvalueorQfunction approximationisrequired. 2.2. KalmanTemporalDifferences-KTD Originally,theKalman(1960)filterparadigmisastatisticalmethodaimingatonlinetracking the hidden state of a non-stationary dynamic system through indirect observations of this state. The idea behind KTD is to cast value function approximation into such a filteringparadigm: consideringafunctionapproximatorbasedonafamillyofparameterized functions,theparametersarethenthehiddenstatetobetracked,theobservationbeingthe reward linked to the parameters through one of the classical Bellman equations. Thereby valuefunctionapproximationcanbenefitfromtheadvantagesofKalmanfilteringandparticularlyuncertai ntymanagementbecauseofstatisticalmodelling. Thefollowingnotationsareadopted,giventhattheaimisthevaluefunctionevaluation, theQ-functionevaluationortheQ-functiondirectoptimization: r( ii8 >(s< >:i;s) (sii+1;ai;si+1;a) i 8 >^ V< > i(si) ^ V i(s) ^Q i(si;aii+1) ^ ti = (s ii+1i+1) gti( i) = ( i) + n Q i(si+1;a) ^Qiii i+1i+1;b) (1) = gti i;a ;s : ( ; ) ^ i(s w (resp. ^ Q s a (2) i h m Q 1 e a r x e b ^ V )isaparametricrepresentationofthevalue(resp. Q-)function, being the parameter vector. A statistical point of view is adopted and the parameter vector is considered as a random variable. The problem at sight is stated in a so-called state-space formulation : +v = i Using the vocabulary of Kalman filtering, the first equation is the evolution equation. It specifiesthatthesearchedparametervectorfollowsarandomwalkwhichexpectationcorrespondstothe optimalestimationofthevaluefunctionattimestepi. Theevolutionnoise viiscentered, white, independentandofvariancematrix Pvi. Thesecondequationisthe observationequation,itlinkstheobservedtransitionsandrewardstothevalue(orQ-)functionthroughon eoftheBellmanequations. Theobservationnoisenissupposedcentered, white,independentandofvariancePnii. KTDisasecondorderalgorithm: itupdatesthemeanparametervector, butalsothe associatedcovariancematrixaftereachinteraction. Itbreaksdownintothreesteps. First, predictions of the parameters first and second order moments are obtained according to the evolution equation and using previous estimates. Then some statistics of interest are computed. Thethirdstepappliesacorrection topredictedmomentsoftheparametersvector according to the so-called Kalman gain K(computed thanks to the statistics obtained in secondstep),thepredictedreward ^riji1iandtheobservedrewardri(theirdifferencebeing aformoftemporaldifferenceerror). Statisticsofinterestaregeneralynotanalyticallycomputable,exceptinthelinearcase. This does not hold for nonlinear parameterizations such as neural networks and for the Bellmanoptimalityequation(becauseofthemax operator). Nevertheless,aderivative-free approximationscheme,theunscentedtransform(UT)ofJulierandUhlmann(2004),allows 159 Geist Pietquin estimatingfirstandsecondordermomentsofanonlinearlymappedrandomvector. LetX bearandomvector(typicallytheparametervector)andY = f(X) itsnonlinearmapping (typicallythegtifunction). LetnbethedimensionoftherandomvectorX. Asetof2n+1 so-calledsigma-pointsandassociatedweightsarecomputedasfollows: 2 n j=0j 2 n j=0 Y y= = X j= 0= X+ (p (n+ )P= X(p (n+ )PXX)j)Xnjn+ 1 j 2n andp PX)jis the jth ( wj = n+ 12 n+ ) = ( j) j ( j) where j 8 1 jnx >x< >:(0)( j)x Xis the mean of X, P 2n w TP y) 0j= 0w X ( j) and P T y)(y( j) j=0 (y( j) w P y( j), PY w is its variance matrix, is a scaling factor which controls the accuracy, and (column of the Cholesky decomposition of P. Then theimageofeachsigma-pointthroughthemappingfiscomputed: y),0 j 2n. The set of sigma-points and their images can then be used to compute the following approximations:( j)(x X)(y( j). ThankstotheUT,practicalalgorithmscanbederived. Attime-stepi,asetofsigma- Ppoints is computed from predicted random parameters characterized by mean ^ and variancePiji1iji1. Predictedrewardsarethencomputedasimagesofthesesigma-pointsusing oneoftheobservationfunctions(1). Thensigma-pointsandtheirimagesareusedtocompute statisticsofinterest. ThisgivesrisetoagenericalgorithmvalidforanyofthethreeBellman equations and any parametric representation of V or Qsummarized in Alg.1, pbeing the number of parameters. More details as well as theoretical results (such as proofs of convergence)aboutKTDareprovidedbyGeistandPietquin(2010). Algorithm1:KTD Initialization: priors ^ 0 j0 andP0 j0 andrewardri for i 1;2;::: do Observetransitionti PredictionStep ^ iji1 = ^ i1 ji1 Piji1 = Pi1ji1 + Pvi = f ^ (j) iji1;0 j 2pg/* from ^ iji1(j) iji1= gti( and Piji1 Sigma-pointscomputation iji1 ^ (j) iji1= P2 p j=0wj^r(j) iji1P ri= P2 p j=0wj j;0 j 2pg Riji1 Correctionstep K = P riP1 ri ^ = + Ki(r ^r ) Piji = Piji1 KiPr ^ 2p j=0 ri w (j) iji1 ^r 2 iji1 KT i + Pn ^r )(^r(j) iji1 ^ iji1 (^ iji1 =P j(^r i iji1) i i iji1 iji1 iji */ W= fw= f^r);0 j 2pg/* see Eq. ( 1 ) */ Compute statistics of interest ^riji1) P 3. Computing Uncertainty over Valu es i (j) ^ VTheparametersbeingmodeledasrandomvariables,theparameterizedvalueforanygiven state is a random variable. This model allows computing the mean and associated uncertainty. Let betheapproximatedvaluefunctionparameterizedbytherandomvector of mean andvariancematrixP . Let V (s) and ^ ( j)2 V (s) betheassociatedmeanandvariance foragivenstates. Topropagatetheuncertaintyfromtheparameterstotheapproximated valuefunctionafirststepistocomputethesigma-pointsassociatedtotheparametervector, that is = f ;0 j 2pg, as well as corresponding weights, from and P as 160 KTD-SARSA (bonus-greedy) KTD-SARSA (e-greedy) LSPI average discounted sum of rewards Managing Uncertainty within KTD 70 60 50 40 30 20 10 0 1000 2000 3000 5000 10000 20000 30000 50000 Figure1: Uncertaintycomputation. number of transitions Figure2: Dialogmanagementresults. ( j) describedbefore. Thentheimagesofthesesigma-pointsarecomputedusingtheparameterizedvaluefunction: V(s) = f ^ V(s) = ^ V (j) V(s); 0 j 2pg. Knowingtheseimagesand correspondingweights,thestatisticsofinterestarecomputed: (s) = P2 p j=0wj^ V( j) (s) and ^ 2 V (s) = P2 p j=0wj( ^ V( j) (s) V 2(s)). ThisisillustratedonFig.1. Extensionto Q-function isstraightforward. So,asateachtime-stepuncertaintyinformationcanbecomputedinthe KTDframework. 4. A Form of Active Learning 4.1. Principle Itisshownherehowthisavailableuncertaintyinformationcanbeusedinaformofactive learning. TheKTDalgorithmderivedfromtheBellmanoptimalityequation,thatisAlg.1 with third equation of Eq. (1), is named KTD-Q. It is an off-policy algorithm: it learns theoptimalpolicy whilefollowingadifferentbehaviorialpolicyb. Anaturalquestionis: whatbehaviorialpolicytochoosesoastospeeduplearning? Letibethecurrenttemporal index. Thesystemisinastatesi,andtheagenthastochooseanactiona. Thepredictions ^ iji1and Piji1iare available and can be used to approximate the uncertainty of the Qfunction parameterized by iji1in the state siand for any action a. Let ^ ;a) be thecorrespondingvariance. Theactionai2 Qiji1(siischosenaccordingtothefollowingheuristic: b(:jsi) = ^ Qiji1(sa2A^ Qiiji1(s;:) Pi;a) (4)rdThistotallyexplorativepolicyfavoursuncertainactions. Thecorrespondingalgorithmwhich iscalledactiveKTD-Q(Alg.1with 3Eq. of(1)andpolicy(4)). 4.2. Experiment Thesecondexperimentistheinvertedpendulumbenchmark. Thistaskrequiresmaintaining apendulumofunknownlengthandmassattheuprightpositionbyapplyingforcestothe cart it is attached to. It is fully described byLagoudakis and Parr(2003) and we use the same parameterization (a mixture of Gaussian kernels). The goal is here to compare twovalue-iteration-likealgorithms, namelyKTD-QandQ-learning, whichaimatlearning directlytheoptimalpolicyfromsuboptimaltrajectories(off-policylearning). Asfaraswe 161 Geist Pietquin 3000 KTD-Q Q-learning active KTD-Q 2500 2000 1500 10000 KTD-Q Q-learning 1000 0 50 100 150 200 250 300 500number of episodes 1000 100 10 0 100 200 300 400 500 600 700 800 900 1000 0 number of episodes Figure4: Randomandactivelear know, KTD-QFis the first second-order algorithm for Q-function approximation i iterationscheme,thedifficultybeingtohandlethemax operator(YuandBertsekas(2 propose also g such an algorithm, however for a restrictive class of MDP). That compare it to ua first-order algorithm. The active learning scheme is also experi r usestheuncertaintycomputedbyKTDtospeedupconvergence. e 3 : O p t i m a l p o l i c y l e a r n i n g . ni For Q-learning, the learning rate is set to i 162 v = i 0 n0+1 n0+ i wit= 0:5 and n0 h 0 j0 0 = steps steps 200, accordingtoLagoudakisandParr(2003). ForKTD-Q,theparametersaresetto P= 10I, P= 1 and P= 0I. For all algorithms the initial parameter vector is set to zero. Trainingsamplesarefirstcollectedonlinewitharandombehavi orpolicy. Theagentstarts in a randomly perturbed state close to the equilibrium. Performance is measured as the averagenumberofstepsinantestepisode(amaximumof3000 stepsisallowed). Results are averaged over 100 trials. Fig.3compares KTD-Q and Q-learning (the same random samples are used to train both algorithms). Fig.4adds active KTD-Q for which actions aresampledaccordingto(4). Averagelengthofepisodeswithtotallyrandompolicyis10, whereasitis11forpolicy(4). Consequentlytheincreaseinlengthcanonlyslightlyhelpto improvespeedofconvergence(atmost10%,muchlessthanth erealimprovementwhichis about100%,atleastatthebeginning). AccordingtoFig.3,KTD-Qlearnsanoptimalpolicy(thatisb alancingthepoleforthe maximumnumberofsteps)asymptoticallyandnear-optimalp oliciesarelearnedafteronlya fewtensofepisodes(noticethattheseresultsarecomparablet otheLSPIalgorithm). With the same number of learning episodes, Q-learning with the same linear parameterization fails to learn a policy which balances the pole for more than a few tens of time steps. Similar results for Q-learning are obtained byLagoudakis and Parr(2003). According to Fig.4,itisclearthatsamplingactionsaccordingtouncertaintys peedsupconvergence. It isalmostdoubledinthefirst100episodes. Noticethatthisactivelearningschemecouldnot havebeenusedforQ-learningwithvaluefunctionapproximatio n,asthisalgorithmcannot provideuncertaintyinformation. 5. Exploration/Exp loitation Dilemma In this section, we present several approaches designed to handle the dilemma between explorationandexploitation(whichcanbelinkedtoactivelea rning). Thefirstoneisthe wellknown -greedypolicy,anditservesasabaseline. Otherapproachesareinspiredfrom theliteratureandusetheavailableuncertaintyinformation(s eeSec.3foritscomputation). Managing Uncertainty within KTD ndThecorrespondingalgorithmsareacombinationofKTD-SARSA(Alg.1with 2Eq. of(1)) withpolicies(5-8). = argmax Qi ji 1(si +1;b)) + (ai +1jsi +1) = (1 ) (ai +1 b 2A (ai +1 b 2A 6= argmax 5.1. -greedyPolicy Withan -greedypolicy(SuttonandBarto,1996),theagentchoosesagreedyactionrespectivelytothecurrentlyestimatedQ-functionwithaprobability1 ,andarandomaction withaprobability ( istheKroneckersymbol): Qi ji 1(si +1;b)) (5) Thispolicyisperhapsthemostbasicone,anditdoesnotuseanyuncertaintyinformation. AnarbitraryQ-functionforagivenstateand4 differentactionsisillustratedonFig.6. For eachaction,itgivestheestimatedQ-valueaswellastheassociateduncertainty(thatis estimatedstanda rddeviation). Forexample,action3 hasthehighestvalueandthelowest uncertainty,andaction1 thelowestvaluebutthehighestuncertainty. Theprobabilitydistributionassociatedtothe -greedypolicyisillustratedonFig.5.a. Thehighestprobability is associated to action 3, and other actions have the same (low) probability, despite their differentestimatedvaluesandstandarddeviations. 5.2. Confident-greedyPolicy Thesecondapproachweproposeconsistsinactinggreedilyaccordingtotheupperboundof anestimatedconfidenceinterval. Theapproachisnotnovel(Kaelbling,1993),howeversome PAC (probably approximately correct) guarantees have been given recently byStrehl and Littman(2006)foratabularrepresentation(forwhichtheconfidenceintervalisproportional totheinverseofthesquarerootofthenumberofvisitstotheconsideredstate-actionpair). Inourcase,wepostulatethattheconfidenceintervalwidthisproportionaltotheestimated standarddeviation(whichistrueiftheparametersdistributionisassumedtobeGaussian). Let beafreepositiveparameter,wedefinetheconfident-greedypolicyas: Qiji1(si+1;b) + ^ Qiji1(si+1;b) (6) = argmax 163 b2A (ai+1jsi+1) = ai+1 ThesamearbitraryQ-valuesareconsidered(seeFig.6),andtheconfident-greedypolicyis illustratedonFig.5.bwhichrepresentstheupperboundoftheconfidenceinterval. Action 1 ischosenbecauseithasthehighestscore(despitethefactthatithasthelowestestimated value). Noticethataction3,whichisgreedyrespectivelytotheestimatedQ-function,has onlythethirdscore. 5.3. Bonus-greedyPolicy ThethirdapproachweproposeisinspiredfromthemethodofKolterandNg(2009). The policytheyuseisgreedyrespectivelytotheestimatedQ-functionplusabonus,thisbonus beingproportionaltotheinverseofthenumberofvisitstothestate-actionpairofinterest (which can be interpreted as a variance, instead of the square-root of this quantity for interval estimation-based approaches which can be interpreted as a standard deviation). Geist Pietquin a. -greedy. b. Confident-greedy. c. Bonus-greedy.d. Thompson. Figure5: Policies. Thebonus-greedypolicyweproposeusesthevarianceratherthanthestandarddeviation, andisdefinedas( 0and beingtwofreeparameters): = argmax Qiji1(si+1;b) + ^ 2 Qiji1(s;b) 0+ ^ 2 b2A Qi+1iji1(si+1;b) (7) (ai+1jsi+1) = ai+1 The bonus-greedy policy is illustrated on Fig.5.c, still using the arbitrary Q-values and associatedstandarddeviationsofFig.6. Action 2 hasthehighestscore, itisthuschosen. Noticethatthethreeotheractionshaveapproximatelythesamescore,despitethefactthat theyhavequitedifferentQ-values. 5.4. ThompsonPolicy RecallthattheKTDalgorithmmaintainstheparametersmeanvectorandvariancematrix. Assuming that the parameters distribution is Gaussian, we propose to sample a set of parametersfromthisdistribution,andthentoactgreedilyaccordingtotheresultingsampled Q-function. This type of scheme was first proposed byThompson(1933) for a bandit problem,andithasbeenrecentlyintroducedintothereinforcementlearningcommunityin thetabularcase(Deardenetal.,1998;Strens,2000). LettheThompsonpolicybe: (ai+1jsi+1) = argmaxb2A^ Q (si+1;b) with N( ^ iji1;Piji1) (8) We illustrate the Thompson policy on Fig.5.d by showing the distribution of the greedy action (recall that parameters are random, and thus the greedy action too). The highest probabilityisassociatedtoaction3. However,noticethatahighestprobabilityisassociated toaction1 thantoaction4: thefirstonehasalowerestimatedQ-value,butitislesscertain. 5.5. Experiment The bandit problem is an MDP with one state and N actions. Each action aimplies a rewardof1 withprobabilitypa,andarewardof0 withprobability1pa. Foranactiona(randomlychosenatthebeginningofeachexperiment),theprobabilityissettopa = 0:6. Forallotheractions,theassociatedprobabilityisuniformlyandrandomlysampledbetween 0 and 0:5: pa U[0 ;0 :5] ;8a6= a. Presented results are averaged over 1000 experiments. The performance of a method is measured as the percentage of time the optimal action 164 Managing Uncertainty within KTD Figure6: Q-values and associated uncertainty. Figure7: Banditresults. 1has been chosen, given the number of interactions between the agent and the bandit. A tabularrepresentationisadoptedforKTD-SARSA,andthefollowingparame tersareused: N = 10, P0 j0= 0:1I, 0 j0= I, Pni= 1, = 0:1, = 0:3, 0= 1 and = 10. As the consideredbandithasN= 10 arms,arandompolicyhasaperformanceof0:1. Noticealso thatapurelygreedypolicywouldchoosesystematicallythefirstactionforwhi chtheagent hasobservedareward. ResultspresentedinFig.7comparethefourschemes. The -greedypolicyservesasa baseline,andallproposedschemesusingtheavailableuncertaintyperform sbetter. Thompson policy and confident-greedy policy perform approximately equally well, and the best resultsareobtainedbythebonus-greedypolicy. Ofcourse,thesequitepreliminaryresults do not allow to conclude about guarantees of convergence of the proposed schemes. However,theytendtoshowthatthecomputeduncertaintyinformationismea ningfulandthat itcanprovideusefulforthedilemmabetweenexplorationandexploitation. 6. Dialogue management application In this section is proposed an application to a real world problem: spoken dialogue management. Aspokendialogsystem(SDS)generallyaimsatprovidinginformationtoaus er throughnaturallanguage-basedinteractions. AnSDShasroughlythreemodules: aspeech understandingcomponent(speechrecognizerandsemanticparser),adialo guemanagerand aspeechgenerationcomponent(naturallanguagegeneratorandspeechsy nthesis). Dialogue managementisasequentialdecisionmakingproblemwhereadialogueman agerhastoselect which information should be asked or provided to the user when in a given situation. It canthusbecastintotheMDPframework(Levinetal.,2000;Singhetal.,1999; Pietquin and Dutoit,2006). The set of actions a dialog manager can select is defined by so called dialog acts . Therecanbedifferentdialogactssuchas: greetingtheuser,askingforapiece of information, providing a piece of information, asking for confirmation about a piece of information,closingthedialogetc. Thestate ofadialogisusuallyrepresentedefficientlyby theInformationStateparadigm(LarssonandTraum,2000). Inthisparadigm,thedialogue state contains a compact representation of the history of the dialogue in terms of dialog 1. Foranempiricalstudyofthesensitivityofperformanceoftheproposedpoliciesasafunctionof parameter setting,seeGeist(2009). 165 Geist Pietquin actsanduserresponses. Itsummarizestheinformationexchangedbetweentheuserandthe systemuntiltheconsideredstateisreached. Adialoguemanagementstrategy istherefore amappingbetweendialoguestatesanddialogueacts. AccordingtotheMDPframework,a rewardfunctionhastobedefined. Theimmediaterewardisoftenmodeledasthecontributionofeachactiontotheuser’ssatisfaction(Singhe tal.,1999). Thisisasubjectivereward whichisusuallyapproximatedbyalinearcombinationofobjectivemeasures. The considered system is a form-filling spoken dialog system. It is oriented toward tourism information, similarly to the one described byLemon et al.(2006). Its goal is to provide information about restaurants based on specific user preferences. There are three slots in this dialog problem, namely the location of the restaurant, the cuisine type of the restaurant and its price-range. Given past interactions with the user, the agent asks a question so as to propose the best choice according to the user preferences. The goal is to provide the correct information to the user with as few interactions as possible. The corresponding MDP’s state has 3 continuous components ranging from 0 to 1, each representing the averaging of filling and confirmation confidence scores (provided by the automaticspeechrecognitionsystem)oftherespectiveslots. Thereare13possibleactions: askforaslot(3actions),explicitconfirmationofaslot(3actions),implicitconfirmationof aslotandaskforanotherslot(6actions)andclosethedialogbyproposingarestaurant(1 action). Thecorrespondingrewardisalways0,exceptwhenthedialogisclosed. Inthiscase, the agent is rewarded 25 per correct slot filling, -75 per incorrect slot filling and -300 per emptyslot. Thediscountfactorissetto= 0:95. Eveniftheultimategoalistoimplement RL on a real dialog management problem, in this experiment a user simulation technique wasusedtogeneratedata(PietquinandDutoit,2006). Theusersimulatorwaspluggedto thedipper dialoguemanagementsystem(Lemonetal.,2006)togeneratedialoguesamples. TheQ-functionisrepresentedusingoneRBFnetworkperaction. EachRBFnetworkhas three equi-spaced Gaussian functions per dimension, each one with a standard deviation of =1 3(state variables ranging from 0 to 1). Therefore, there are 351 (i.e. , 3 3 13) parameters. KTD-SARSA with -greedy and bonus-greedy policies are compared on Fig.2(results areaveragedover8independenttrials, andeachpointisaveragedover100pastepisodes: astablecurvemeansalowstandarddeviation). LSPI,abatchandoff-policyapproximate policyiterationalgorithm(LagoudakisandParr,2003),servesasabaseline. Itwastrained in an off-policy and batch manner using random trajectories, and this algorithm provide competitiveresultsamongthestateoftheart(Lietal.,2009a;Chandramohanetal.,2010). Bothalgorithmsprovidegoodresults(apositivecumulativereward,whichmeansthatthe userisgenerallysatisfiedafterfewinteractions). However,onecanobservethatthebonusgreedy scheme provides faster convergence as well as better and more stable policies than theuninformed -greedypolicy. Moreover, resultsfortheinformedKTD-SARSAarevery close to LSPI after few learning episodes. Therefore, KTD-SARSA is sample efficient (it providesgoodpolicieswhiletheinsufficientnumberoftransitionspreventsfromusingLSPI becauseofnumericalstabilityproblems),andtheprovideduncertaintyinformationisuseful onthisdialogue-managementtask. 166 Managing Uncertainty within KTD 7. Conclusion Inthispaper,wehaveshownhowanuncertaintyinformationaboutestimatedvaluescanbe derivedfromKTD.Wehavealsointroducedanactivelearningschemeaimingatimproving speed of convergence by sampling actions according to their relative uncertainty, as well as some adaptations of existing schemes for exploration/exploitation. Three experiments have been proposed. The first one shown that KTD-Q, a second-order value-iterationlike algorithm, is sample efficient. The improvement gained by using the proposed active learningschemewasalsodemonstrated. Theproposedschemesforexploration/exploitation were also successfully experimented on a bandit problem and the bonus-greedy policy on real-worldproblem. Thisisafirststeptowardcombiningthedilemmabetweenexploration andexploitationwithvaluefunctionapproximation. Thenextstepistoadaptmoreexistingapproachesdealingwiththeexploration/exploitation dilemmadesignedfortabularrepresentationofthevaluefunctiontotheKTDframework, andtoprovidesometheoreticalguaranteesfortheproposedapproaches. Thispaperfocused onmodel-freereinforcementlearning,andweplantocompareourapproachtomodel-based RLapproaches. Acknowledgments The authors thank the European Community (FP7/2007-2013, grant agreement 216594, CLASSiCproject: www.classic-project.org)andtheRégionLorraineforfinancialsupport. References S. Chandramohan, M. Geist, and O. Pietquin. Sparse Approximate Dynamic Programming for Dialog Management. InProceedingsofthe11thSIGDialConferenceonDiscourseandDialogue,pages107–115, Tokyo(Japan),September2010.ACL. R.Dearden,N.Friedman,andS.J.Russell. BayesianQ-Learning. InAAAI/IAAI,pages761–768,1998. Y.Engel.AlgorithmsandRepresentationsforReinforcementLearning.PhDthesis,HebrewUniversity,April 2005. M. Geist. Optimisation des chaînes de production dans l’industrie sidérurgique : une approche statistique de l’apprentissage par renforcement. Phd thesis in mathematics, Université Paul Verlaine de Metz (en collaborationavecSupélec,ArcelorMittaletl’INRIA),Novembre2009. M.GeistandO.Pietquin.KalmanTemporalDifferences.JournalofArtificialIntelligenceResearch(JAIR), 2010. M.Geist,O.Pietquin,andG.Fricout. KalmanTemporalDifferences: thedeterministiccase. InIEEEInternationalSymposiumonAdaptiveDynamicProgrammingandReinforcementLearning(ADPRL2009), Nashville,TN,USA,April2009a. M.Geist,O.Pietquin,andG.Fricout. Trackinginreinforcementlearning. InInternationalConferenceon NeuralInformationProcessing(ICONIP2009),Bangkok(Thailand),December2009b.Springer. N.JongandP.Stone. Model-BasedExplorationinContinuousStateSpaces. InSymposiumonAbstraction, Reformulation,andApproximation,July2007. S.J.JulierandJ.K.Uhlmann. Unscentedfilteringandnonlinearestimation. ProceedingsoftheIEEE,92 (3):401–422,2004. 167 Geist Pietquin L.P.Kaelbling. Learninginembeddedsystems. MITPress,1993. S.Kakade,M.J.Kearns,andJ.Langford. ExplorationinMetricStateSpaces. InInternationalConference onMachineLearning(ICML03),pages306–312,2003. R.E.Kalman. ANewApproachtoLinearFilteringandPredictionProblems. TransactionsoftheASME– JournalofBasicEngineering,82(SeriesD):35–45,1960. J.Z.KolterandA.Y.Ng. Near-BayesianExplorationinPolynomialTime. Ininternationalconferenceon Machinelearning(ICML09),NewYork,NY,USA,2009.ACM. M.G.LagoudakisandR.Parr. Least-SquaresPolicyIteration. Journal of Machine Learning Research,4: 1107–1149,2003. S. Larsson and D. R. Traum. Information state and dialogue management in the TRINDI dialogue move enginetoolkit. NaturalLanguageEngineering,2000. O. Lemon, K. Georgila, J. Henderson, and M. Stuttle. An ISU dialogue system exhibiting reinforcement learningofdialoguepolicies: genericslot-fillingintheTALKin-carsystem. InMeetingoftheEuropean chapteroftheAssociatonforComputationalLinguistics(EACL’06),Morristown,NJ,USA,2006. E.Levin,R.Pieraccini,andW.Eckert.Astochasticmodelofhuman-machineinteractionforlearningdialog strategies. IEEETransactionsonSpeechandAudioProcessing,8(1):11–23,2000. L. Li, S. Balakrishnan, and J. Williams. Reinforcement Learning for Dialog Management using LeastSquares Policy Iteration and Fast Feature Selection. In Proceedings of the International Conference on SpeechCommunicationandTechnologies(InterSpeech’09),Brighton(UK),2009a. L.Li,M.Littman,andC.Mansley. Onlineexplorationinleast-squarespolicyiteration. InConferencefor researchinautonomousagentsandmulti-agentsystems(AAMAS-09),Budapest,Hungary,2009b. O.PietquinandT.Dutoit. Aprobabilisticframeworkfordialogsimulationandoptimalstrategylearning. IEEETransactionsonAudio,SpeechandLanguageProcessing,14(2):589–599,March2006. Y.SakaguchiandM.Takano. Reliabilityofinternalprediction/estimationanditsapplication: I.adaptive actionselectionreflectingreliabilityofvaluefunction. NeuralNetworks,17(7):935–952,2004. S.Singh, M.Kearns, D.Litman, andM.Walker. Reinforcementlearningforspokendialoguesystems. In ConferenceonNeuralInformationProcessingSociety(NIPS’99),Denver,USA.Springer,1999. A. L. Strehl and M. L. Littman. An Analysis of Model-Based Interval Estimation for Markov Decision Processes. JournalofComputerandSystemSciences,2006. M. Strens. A Bayesian Framework for Reinforcement Learning. In International Conference on Machine Learning,pages943–950.MorganKaufmann,SanFrancisco,CA,2000. R.S.SuttonandA.G.Barto. ReinforcementLearning: AnIntroduction. TheMITPress,1996. W.R.Thompson. Onthelikelihoodthatoneunknownprobabilityexceedsanotherinviewoftwosamples. Biometrika,(25):285–294,1933. H. Yu and D. P. Bertsekas. Q-Learning Algorithms for Optimal Stopping Based on Least Squares. In EuropeanControlConference,Kos,Greece,2007. 168