Solving Maze Problem with Abstraction Selection Xiaodong Wang, and Chi Zhang University of Tennessee, Knoxville, USA {xwang33, czhang24}@utk.edu Abstract—Abstraction has been extensively studied in the fields of artificial intelligence. It is especially useful for highdimensional continuous domains, since abstraction can reduce the large state space and action space in such problems. In this project, we solve the multi-task Maze problem by temporal difference learning under the abstraction selection framework. Simulation results show that the multi-task maze problem can be solved efficiently with the abstraction selection framework. We further compare the algorithm proposed in an existing literature with our modified algorithm. Results showing that the multitask maze problem can be solved more efficiently by using tabular form TD learning, compared with the existing function approximation framework proposed in [1]. I. I NTRODUCTION High-dimensional, continuous domains are a class of reinforcement learning problems which remains difficult to solve. A key approach for such problems is using an abstraction that reduces the number of state variables in the solutions. However, one single abstraction cannot be effectively applied for the whole problem. Much recent research in such problems has focused on hierarchical reinforcement learning. It divides an intrinsically high-dimensional problem into several sub problems, each of which is much easier and can be solved using only a small set of state variables. For example, in the problem of learning driving home, the entire task can be broken into several small tasks, including getting to the parking lot, opening the car, starting the car and driving home, etc. We take the advantage of breaking the large problem down into a series of sub problems, and solve each sub problem using its own abstraction. If the agent has a library of abstraction available to it, it can select among the library and apply the selected abstraction to aid in new skill learning. In this project, we used abstraction selection algorithm proposed in [1] to solve a multi-task maze problem. In the multi-task problem, the agent needs to go through a series of sub-goals before finishing the learning task at the final goal. This problem fits well in the hierarchical reinforcement learning domain. We can break the entire multi-task maze problem by the subgoal. After breaking the multi-task maze problem into sub-tasks, we select an appropriate abstraction for each sub-goal task and using Temporal Difference learning to solve this problem. The result shows that the agent selects an appropriate abstraction using very little sample data and therefore significantly improves skill learning performance in the large real-valued reinforcement learning domain. The remainder of this paper is organized as follows. Section II introduces the background of abstraction and the option (subtask) framework. Section III elaborates on the detail of using abstraction to solve a large learning problem. Section IV presents two application scenarios in which abstraction selection can be applied. Section V shows the evaluation results of using abstraction selection for the multi-task maze problem. Section VI concludes the paper. II. BACKGROUND In this section, we introduces the background of abstraction selection, including the option framework and the definition of abstraction. A. The Options Framework Options framework [2] is a hierarchical reinforcement learning framework that provides methods for learning and planning by adding temporally extended actions (called options) in the standard reinforcement learning framework. Options are closed-loop policies for taking action over a period of time. Options consist of three components: a policy π; an initiation state set Io and the termination condition ςo . πo : (s, a) 7−→ [0, 1] Io : s 7−→ {0, 1} ςo : (s, a) 7−→ [0, 1] (1) The initiation set Io is an indicator function, which is 1 for states where the option can be executed and 0 elsewhere. An option is available in state St if and only if St ∈ Io . In our multi-task maze problem, the initiation set Io can be any state in the state space, since any state can be included for walking to a specific sub-goal. ςo is the termination condition for the option. Creation and termination are usually performed by the identification of subgoal states. In our multitask maze problem, the termination condition is reaching the given subgoal at the current option condition. After reaching the current subgoal, algorithm can create a new option for the next subgoal by defining the termination condition. B. Abstraction In the example of learning motor skills, although humans have many sensory inputs and degrees of freedom, which consist a very large state space, specific sensorimotor skills almost always involve a small number of sensor features and ignore most of the sensor and motor features in the environment. This inspires us by applying abstraction in high dimensional problems. In reinforcement learning, the use of a smaller set of variables to solve a large problem is modeled using the notion of abstraction [3]. Instead of working in the ground state space and action space, the decision maker usually finds solutions in the abstract state space as well as action space much faster 2 by treating groups of states and actions as a unit by ignoring irrelevant information. We define an abstraction Mi to be a pair of functions (σi , τi ), where σi : S → S ′ is a mapping from the overall state space S to a smaller state space S ′ , and τi : A → A′ is a mapping from the full problem action space A to a smaller action space A′ . Besides, each abstraction has an associated vector of basis functions Φi defined over S ′ , which can be used to approximate value functions. In the hierarchical reinforcement learning setting the agent tries to build as many abstractions as is has skills. Thus the agent solving many problems in its lifetime may accumulate a library of abstraction, which can be used later to solve new problems. Combining abstraction with option, we define that when an agent creates a new option it should create it with an accompanying abstraction. The agent can select abstraction from a library of abstractions, and refine the selected abstraction through experience. An agent creates an option to reach a particular sub goal (state) only after the sub goal is first reached. Therefore, a set of sample interactions end at the new subgoal, which we consider as a sample trajectory for the option. For a trajectory with m steps, it consists of a sequence of m state-actionreward. Given a library of abstractions, if we apply each abstraction to the sample trajectory we can obtain: )} ) ( ) ( {( i i (2) s1 , a1 , r1 , si2 , ai2 , r2 , ..., sim , aim , rm ( i i ) Where sk , ak , rk = (σi (sk ) , τi (ak ) , rk ) is a state-actionreward tuple obtained from abstraction i describing the kth state-action pair in the trajectory. III. S ELECTING A PPROPRIATE A BSTRACTION Our goal is to break a large task into small tasks and then choose appropriate abstraction to learn the skill for the each sub-task. In this section, we first introduce linear function approximation, the basic tool used in the abstraction selection. Then, we elaborate on how to choose the appropriate sub-task. A. Linear Function Approximation Function value estimation represented as a table with one entry for each state or for each state-action pair is a particularly clear and instructive case, but it is limited to tasks with small numbers of states and actions. For high-dimensional state representations, reality is very different. The problem is not just the memory needed for large tables, but also the time and data needed to fill them accurately. Function approximation provides us an easy way for finding the value of each state while avoiding the large resource overhead. Function approximation takes examples from a desired function (e.g., a value function) and attempts to generalize from them to construct an approximation of the entire function. One of the most important function approximations is linear function approximation, which approximates V by a weighted sum of basis functions Φ: V̄ (s) = w · Φ (s) = n ∑ wi ϕi (s) (3) i=1 where ϕi is the ith basis function. One of the basis functions that is widely used for function approximation is the Fourier Basis [4]. The Fourier expansion of the multivariate function F (x) with period T in m dimensions is: F̄ (x) = ] ∑[ 2π 2π ac cos( c · x) + bc sin( c · x) T T c (4) where c = [c1 , ..., ci ], ci ∈ [0, ..., n], 0 ≤ i ≤ m. This results in 2(n + 1)m basis functions for an nth order full Fourier approximation to a value function in m dimensions, which can be reduced to (n + 1)m if we drop either the sin or cos terms for each variable as described above. We thus define the kth order Fourier Basis for m variables: ( ) ϕi (x) = cos πci · x (5) where ci = [c1 , ..., cm ], cj ∈ [0, ..., k], 0 ≤ j ≤ m. Each basis function thus has a coefficient vector c that attaches an integer coefficient (less than or equal to k) to each variable in x; the basis set is obtained by systematically varying the variables in c. This basis has the benefit of being easy to compute accurately even for high degrees, since cos is bounded in [-1,1], and its arguments are formed by multiplication and summation rather than exponentiation. B. Abstraction Selection The objective of abstraction selection is to achieve efficient skill learning. The key idea of our design is using abstraction with hierarchical reinforcement learning in high-dimensional continues problems like. One of scenarios where the abstraction selection can be applied is the continuous playing room problem, which appears to humans as easy, but is difficult for agent because of the large number of variables and interactions between variables (e.g., between ∆x and ∆y values for an object-effector pair) that cannot all be included in the overall task function approximation. Considering an O (1) Fourier Basis over 120 variables that does not treat each variable as independent, results in 2120 features. Thus, options and abstractions are utilized to greatly improve performance in such domains. In this work, we implemented a multi-task maze. If we already have the entire trajectory at once, we may approximate functions and then select the best abstraction for a regression problem. A common model selection criterion is Bayesian Information Criterion (BIC) [5]: 1 ln p (D|Mi ) ≈ ln p (D|θM AP , Mi ) − |Mi | ln m (6) 2 where D is the date, Mi is abstraction i, p (D|θM AP , Mi ) is the likelihood of D given the maximum a priori value 3 function Sensorimotor Abstraction Fit (i, ρ, η) : 1. Initialization: Set A0 , b0 , z0 , Rc and Rz to 0, g to 1 2. Iteratively handle incoming samples: for each incoming sample (st , at , rt ): At = ρAt−1 + Φi (st )ΦTi (st ) bt = ρbt−1 + ρrt zt−1 + rt Φi (st ) zt = ρzt−1 + Φi (st ) Rc = ρRc + grt2 + ρrt Rz Rz = ρRz + 2grt g = ρg + 1 3. Compute weights, error and variance: (after m samples) w = (Am + ηI)−1 bm e = wT Am w − 2w · bm + Rc β= m e 4. Compute log likelihood and BIC: (quantities constant across abstractions ignored) ln β ll = − β2 e + m 2 return ll - 21 |Φi | ln m Fig. 1. An incremental algorithm for computing the BIC value of an abstraction i, using weight factor ρ and regularization parameter η, given a successful sample trajectory. function parameters θmap for abstraction i, |Mi | is the number of parameters in abstraction i and m is the sample size. In this work, we use linear regression model as the appropriate statistical model for the data. The log likelihood of this model is ( ) β m2 − m ln + ln ρ, 2π 4 (7) where β − 1 is the variance, w is the function approximation ∑m (m−j) 2 weight vector and ei = [w · Φi (sj ) − Rj ] is j=1 ρ the summed weighted squared error. We use incremental algorithm given in Figure 3, following [1]. The algorithm is run simultaneously for each abstraction while the agent is interacting with the environment. Whenever an option is created by the agent, the algorithm computes the associated log likelihood for each abstraction in one step. The agent then selects the abstraction with the highest log likelihood. Since more than one sample trajectory may be available, or may be required to produce robust selection. Given p samples, we can modify the algorithm to run step 1 and 2 separately for each sample trajectory and sum the A, b and Rc . Steps 3 and 4 then use the summed variables to perform a fit over all p trajectories simultaneously. β m ln p (D|Mi , w, β) = − ei + 2 2 IV. A PPLICATIONS WITH A BSTRACTION S ELECTION In this section, we introduce two applications that can be solved by using the abstraction framework, the continuous playroom and the multi-task maze problem. We also elaborate Fig. 2. Playing room on the extension to the algorithm with function approximation introduced in the previous section. A. Continuous Playroom The Continuous Playroom is a real-valued version of the Playroom domain [6]. It consists of an agent with a number of objects: a light switch, a ball, a bell, two movable blocks that are also buttons for turning music on and off, as well as a toy monkey that can make sounds. The agent also has three effectors: an eye, a hand, and a visual marker. The agent’s sensors tell it what objects (if any) are under the eye, hand and marker. The agent is in 1x1room, and may move any of its effectors 0.05 units in one of the usual four directions. When both its eye and hand are over an object it may additionally interact with it, but only if the light is on (unless the object is the light switch). Interacting with the green button switches the music on, while the red button switches the music off. The switch toggles the light. If both the hand and the eye are on the light switch, then the action of flicking the light switch becomes available, and if both the hand and eye are on the ball, then the action of kicking the ball becomes available (which when pushed, moves in a straight line to the marker). Finally, if the agent interacts with the ball and its marker is over the bell, then the ball hits the bell. Hitting the bell frightens the monkey if the light is on and the music is on, and causes it to squeak, whereupon the agent receives a reward of 100,000 and the episode ends. All other actions cause the agent to receive a reward of -1. Every time the objects are interacted with any effectors, it will relocate randomly in the room so that they do not overlap in each episode. Notice that if the agent has already learned how to turn the light on and off, how to turn music on, and how to make the bell ring, then those learned skills would be of obvious use in simplifying this process of engaging the toy monkey. [1] implemented the continuous playing room in an O (3) independent Fourier Basis and learning is performed using Sarsa(λ). Results show that agents that learning using an abstraction starts better and are able to obtain better overall solutions. Moreover, the initial value function obtained by abstraction selection benefits agents with less episode steps, compared with scratch starting. 4 TABLE I L EARNING CURVES IMPROVEMENT FOR AN OPTION WITH AN ABSTRACTION , AND WITH AN ABSTRACTION USING THE INITIAL VALUE 1500 FUNCTION FIT OBTAINED DURING RANDOM AND OPTIMAL TRAJECTORY SAMPLES , NORMALIZED TO NO ABSTRACTION [1]. Abstraction 78.57% -6.67% 75% 84.21% Start Fit(Random) 67.14% 40.00% 77.5% 89.12% 1 2 Fit(Given) 87.86% 44.44% 77.6% 73.68% Goal Steps 1000 Episode 1 10 20 30 Function approximation Tabular form 500 0 0 20 40 60 80 100 Episode Fig. 4. Comparison between applying abstraction selection with function approximation and with tabular form to the multi-task maze problem 2 1 Start Goal Fig. 3. Multi-task Maze Table 1 is the result concluded from [1], showing that abstraction selection can improve performance greatly. Normalized to no abstraction learning, learning using abstraction obtain better overall solutions almost in all the episodes. Besides, the quality of the trajectory data used for the fit significantly impacts the resulting policy, with policies obtained from fitting optimal sample trajectories (Fit(Given)) performing much better than those obtained from random sample trajectories(Fit(Random)). B. Multi-task Maze The second application, which is our major target in this project, is the multi-task maze problem. Figure 3 shows an example of the multi-maze task. The agent need to start off at the start point, and the final goal is to reach the Goal point. For the task to be accomplished, there is a specific requirement for the task, which is that the agent must go through block 1 and 2 in sequence first. Maze problem is a very special case, for which tabular form of value prediction can be naturally applied. Although the function approximation with the Fourier Basis approach introduced in previous sections can reduce the state space effectively, it might not work well in the maze problem context (which we can see in the later evaluation section). The major reason is that first, the maze problem is a discrete state space problem, which is hard to use the Fourier Basis to linearly approximate its value. Second, using linear approximation with Fourier Basis leads to a highly sensitive value to a minor change in the function value, which can result in biased value prediction, thus worse learning result. Therefore, we propose to use tabular form for solving this specific maze problem with abstraction. Specifically, we use the abstraction selection framework by keeping the table in the abstraction and select an appropriate abstraction for the next option (subgoal). We then use the tabular Sarsa(λ) at the option learning stage for each subgoal. Please refer to the implementation code in the appendix. V. M ULTI - TASK M AZE P ROBLEM R ESULT In this section, we show the results of using abstraction selection in the Multi-task Maze problem. A. Is Fourier Basis Function Approximation Good for Maze? As we have introduced in previous section that linear function approximation is good for continuous state space. Since the maze problem is a discrete state space problem, function approximation might not work well in this context. In this experiment we implement both the function approximation Sarsa(λ) and the tabular form Sarsa(λ). We see from figure 4 that the abstraction selection working better under the tabular form learning for this multi-task maze problem. The results are based on a 100 run of each episode. It demonstrates that the abstraction selection can also work with tabular form learning, specifically more effective to the discrete space problem. B. Learning Performance We perform more experiment to evaluate the impact from the ϵ-greedy parameter choice in the option learning stage. We change the ϵ value from 0.01 to 0.09 and explore its impact. We see from Figure 5 that with a higher ϵ, the average number of steps over 100 episodes is decreasing. This is because that the learning actually explores more with a higher ϵ value, prone to find a better solution to the problem. 5 700 Average Steps 600 500 400 300 200 100 0 0 0.02 0.04 0.06 0.08 0.1 Epsilon Fig. 5. Impact of different ϵ in the option learning stage VI. C ONCLUSION In the context of small discrete domains, acquired skill hierarchies have been proved to be beneficial. But for highdimensional continuous domains there may be difficulties due to large state action spaces. Abstraction selection opens up a further advantage to skill acquisition in high-dimensional continuous domains, allowing an agent to exploit abstractions. In an environment where an agent may acquire many skills over its lifetime this may represent a great potential efficiency improvement, that in conjunction with a good skill acquisition algorithm could enable reinforcement learning agents to scale up to higher dimensional domains. Additionally, abstraction selection opens up the possibility of abstraction transfer, where an agent that has learned a set of skills may benefit from the abstractions refined for each, even if it never uses those skills again. In this work, we implemented abstraction selection with function approximation TD learning and tabular form TD learning respectively. Results show that under abstraction framework, tabular form outperforms function approximation for multi-task problem solution. R EFERENCES [1] K. George and B. Andrew, “Efficient skill learning using abstraction selection,” in In Proceedings of the 21st International Joint Conference on Artificial Intelligence, 2009. [2] R. Sutton, D. Precup, and S. Singh, “Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning,” in Artificial Intelligence, 1999. [3] L. Li, T. Walsh, and M. Littman, “Towards a unified theory of state abstraction for mdps,” in In Proceedings of the Ninth International Symposium on Artificial Intelligence and Mathematics, 2006. [4] G. Konidaris and S. Osentoski, “Value function approximation in reinforcement learning using the fourier basis,” University of Massachusetts, Amherst, Tech. Rep., 2008. [5] G. Schwarz, “Estimating the dimension of a model,” Annals of Statistics, vol. 6, no. 2, pp. 461–464, 1978. [6] S. Singh, A. Barto, and N. Chentanez, “Intrinsically motivated reinforcement learning,” in In Proceedings of the 18th Annual Conference on Neural Information Processing Systems, 2004. Appendix A function [totalsteps] = simplified_abstraction(income) %UNTITLED6 Summary of this function goes here % Detailed explanation goes here % clear all; States = zeros(20,20)+65; States(18,18) = 65; destx =10; desty =10; subx = 1; suby = 1; up = 1; left = 2; down = 3; right = 4; gamma = 1; %1 lamda = 0.9; alpha = 0.001; epsilon = 0.01; episode = 0; record = 0; a = randi(4,1); Q = zeros(20,20,4,2); policy = zeros(20,20,4,2)+0.25; e = zeros(20,20,4,2); while(episode<100) episode = episode+1; % display(episode); % input('crap'); subfinish = 1; % % % % % % % % m = 0; while(1) sx = randi(20,1); sy = randi(20,1); if States(sx,sy)==65 break; end end if mod(episode,10) == 0 sx = 5; sy = 5; record = record+1; %disp(episode); % %disp(record); end stepcount = 0; deltax = zeros(2,1); deltay = zeros(2,1); nextdeltax = zeros(2,1); nextdeltay = zeros(2,1); % while(sx~=destx||sy~=desty||subfinish~=2) while(sx~=destx||sy~=desty) stepcount = stepcount+1; m = m+1; deltax(2) deltay(2) deltax(1) deltay(1) = = = = (sx-destx)+10; (sy-desty)+10; (sx-subx)+10; (sy-suby)+10; phi(:,1,a) = [1;cos(pi*deltaxsub*0.05);cos(pi*deltaysub*0.05);cos(0.05*pi*(deltaysub +deltaxsub))]; phi(:,2,a) = [1;cos(pi*deltax*0.05);cos(pi*deltay*0.05);cos(0.05*pi*(deltay+deltax)) ]; %define the next state if a==up nextsx=sx-1; nextsy = sy; if nextsx<1; nextsx=1; end elseif a == left nextsx = sx; nextsy=sy-1; if nextsy<1; nextsy=1; end elseif a == down nextsx=sx+1; nextsy = sy; if nextsx>10; nextsx=10; end elseif a == right nextsx = sx; nextsy=sy+1; if nextsy>10; nextsy=10; end end nextdeltax(2) nextdeltay(2) nextdeltax(1) nextdeltay(1) % % = = = = nextsx-destx+10; nextsy-desty+10; nextsx-subx+10; nextsy-suby+10; %find out the next reward if(subfinish ==2) if (nextsx == destx)&&(nextsy==desty) if (nextsx == destx)&&(nextsy==desty) r = 10000; else r = -1; end else if (nextsx == subx)&&(nextsy==suby) if (nextsx == destx)&&(nextsy==desty) r = 10000; else r = -1; end end count = 0; qmax = max(Q(nextdeltax(subfinish),nextdeltay(subfinish),:,subfinish)); for i = 1:4 if qmax == Q(nextdeltax(subfinish),nextdeltay(subfinish),i,subfinish) count = count+1; end end probability = rand(); sumprob = 0; for i = 1:4 sumprob = sumprob+policy(nextdeltax(subfinish),nextdeltay(subfinish),i,subfinish); if (sumprob>probability) nexta = i; break; end end for i = 1:4 if qmax == Q(nextdeltax(subfinish),nextdeltay(subfinish),i,subfinish) policy(nextdeltax(subfinish),nextdeltay(subfinish),i,subfinish) = (1epsilon)/count; else policy(nextdeltax(subfinish),nextdeltay(subfinish),i,subfinish) = epsilon/(4-count); end end delta = r + gamma*Q(nextdeltax(subfinish),nextdeltay(subfinish),nexta,subfinish)Q(deltax(subfinish),deltay(subfinish),a,subfinish); e(deltax(subfinish),deltay(subfinish),a,subfinish) = e(deltax(subfinish),deltay(subfinish),a,subfinish)+1; for i = 1:20 for j = 1:20 for k = 1:4 Q(i,j,k,subfinish)=Q(i,j,k,subfinish)+alpha*delta*e(i,j,k,subfinish); e(i,j,k,subfinish) = gamma*lamda*e(i,j,k,subfinish); end end end %for abstraction updating for i = 1:2 A(:,:,i) = rho*A(:,:,i)+phi(:,i,a)*phi(:,i,a)'; b(:,i)=rho*b(:,i)+rho*r*z(:,i)+r*phi(:,i,a); z(:,i) = rho*z(:,i)+phi(:,i,a); Rc(:,i)=rho*Rc(:,i)+g(:,i)*r*r+rho*r*Rz(:,i); Rz(:,i) = rho*Rz(:,i)+2*g(:,i)*r; g(:,i) = rho*g(:,i)+1; end %epsilon greedy policy improvement if(nextsx == subx)&&(nextsy == suby)&&subfinish==1 for i = 1:2 w(:,i) = A(:,:,i)\b(:,i); error(:,i)=w(:,i)'*A(:,:,i)*w(:,i)2*w(:,i)'*b(:,i)+Rc(:,i); beta(:,i) = m/error(:,i); likelihood(:,i) = beta(:,i)*error(:,i)/2+m*0.5*log(beta(:,i))-0.5*2*log(m); end disp(m); m = 0; [likelihoodmax,position]=max(likelihood); weight(:,subfinish)=w(:,position); subfinish = 2; for i=1:20 for j = 1:20 deltaxtemp = (i-destx); deltaytemp = (j-desty); phitemp = [1;cos(pi*deltaxtemp*0.05);cos(pi*deltaytemp*0.05);cos(0.05*pi*(deltayt emp+deltaxtemp))]; V(i,j) = weight(:,subfinish)'*phitemp; end end for i = 1:20 for j = 1:20 Qtemp = zeros(4,1); if(i~=destx||j~=desty) for k = 1:4 if k==up nexti = i-1; nextj = j; if nexti<1; nexti=1; end if(nexti==destx&&nextj==desty) Qtemp(k)=V(nexti,nextj)+10000; else Qtemp(k)=V(nexti,nextj)-1; end elseif k == left nexti = i; nextj = j-1; if nextj<1; nextj=1; end if(nexti==destx&&nextj==desty) Qtemp(k)=V(nexti,nextj)+10000; else Qtemp(k)=V(nexti,nextj)-1; end elseif k == down nexti=i+1; nextj = j; if nexti>20; nexti=20; end if(nexti==destx&&nextj==desty) Qtemp(k)=V(nexti,nextj)+10000; else Qtemp(k)=V(nexti,nextj)-1; end elseif k == right nexti = i; nextj = i+1; if nextj>20; nextj=20; end if(nexti==destx&&nextj==desty) Qtemp(k)=V(nexti,nextj)+10000; else Qtemp(k)=V(nexti,nextj)-1; end end end [maxq,maxindex]=max(Qtemp); for k = 1:4 if(k==maxindex) Q(i,j,k)=(1-epsilon)*Qtemp(k); else Q(i,j,k)=epsilon*Qtemp(k)/3; end end else for k = 1:4 Q(i,j,k)=0; end end end end A = zeros(4,4,2); b = zeros(4,2); z = zeros(4,2); Rc = zeros(1,2); Rz = zeros(1,2); g = zeros(1,2); weight = zeros(4,2); w = zeros(4,2); error = zeros(1,2); beta = zeros(1,2); likelihood = zeros(1,2); %having problem here, what to take for the initial q value? end if(nextsx == destx)&&(nextsy==desty)&&subfinish==2 for i = 1:2 w(:,i) = A(:,:,i)\b(:,i); error(:,i)=w(:,i)'*A(:,:,i)*w(:,i)2*w(:,i)'*b(:,i)+Rc(:,i); beta(:,i) = m/error(:,i); likelihood(:,i) = beta(:,i)*error(:,i)/2+m*0.5*log(beta(:,i))-0.5*2*log(m); end disp(m); m = 0; [likelihoodmax,position]=max(likelihood); weight(:,subfinish)=w(:,position); subfinish = 1; for i=1:20 for j = 1:20 deltaxtemp = (i-destx); deltaytemp = (j-desty); phitemp = [1;cos(pi*deltaxtemp*0.05);cos(pi*deltaytemp*0.05);cos(0.05*pi*(deltayt emp+deltaxtemp))]; V(i,j) = weight(:,subfinish)'*phitemp; end end for i = 1:20 for j = 1:20 Qtemp = zeros(4,1); for k = 1:4 if k==up nexti = i-1; nextj = j; if nexti<1; nexti=1; end Qtemp(k)=V(nexti,nextj)-1; elseif k == left nexti = i; nextj = j-1; if nextj<1; nextj=1; end Qtemp(k)=V(nexti,nextj)-1; elseif k == down nexti=i+1; nextj = j; if nexti>20; nexti=20; end Qtemp(k)=V(nexti,nextj)-1; elseif k == right nexti = i; nextj = i+1; if nextj>20; nextj=20; end Qtemp(k)=V(nexti,nextj)-1; end end [maxq,maxindex]=max(Qtemp); for k = 1:4 if(k==maxindex) Q(i,j,k)=(1-epsilon)*Qtemp(k); else Q(i,j,k)=epsilon*Qtemp(k)/3; end end end end subfinish = 2; end sx = nextsx; sy = nextsy; a = nexta; if subfinish == 1 if nextsx == subx && nextsy == suby subfinish = 2; disp(m); m = 0; a = randi(4,1); end else if (nextsx == destx)&&(nextsy==desty) disp(m); m = 0; end end % % end totalsteps(episode) = stepcount; end