Using Prior Knowledge to Improve Reinforcement

Using Prior Knowledge to Improve Reinforcement Learning in Mobile Robotics David L. Moreno, Carlos V. Regueiro† , Roberto Iglesias and Senén Barro Departamento de Electrónica y Computación, Universidad de Santiago de Compostela, 15782 Spain † Departamento de Electrónica y Sistemas, Universidad da Coruña, 15071 Spain dave@dec.usc.es Abstract Reinforcement learning (RL) is thought to be an appropriate paradigm for acquiring control policies in mobile robotics. However, in its standard formulation (tabula rasa) RL must explore and learn everything from scratch, which is neither realistic nor effective in real-world tasks. In this article we propose a new strategy, called Supervised Reinforcement Learning (SRL), for taking advantage of external knowledge within this type of learning and validate it in a wall-following behaviour. 1. Introduction Reinforcement learning (RL) is an interesting strategy for the automatic resolution of tasks in different domains, such as game-playing (Tesauro, 1994), robotics (Schaal and Atkeson, 1994) and even computer networks (Boyan and Littman, 1994). One of the main advantages of RL is that it does not need a set of inputs with their correct answers for training, which are often hard to produce in dynamic and unknown environments. Instead, it only requires a measurement of the system’s level of behaviour, so-called reinforcement. This feature and its incremental nature and adaptative capabilities make is suitable for use in mobile robotics. Nevertheless, tabula rasa RL has strong limitations. The main one is the fact that RL assumes that the environment as perceived by the system is a Markov Decision Process (MDP), which implies that the system only needs to know the current state of the process in order to be able to predict its future behaviour. Moreover, there must not be any perceptual aliasing; i.e, the agent cannot consider two situations to which it has to respond with different actions as being equal. Another limitation is the exploitation/exploration dilemma; i.e., the need to establish the strategy that the agent must follow in order to decide between attempting new actions or using previously acquired knowledge. These limitations cause lack of stability during learning, low levels of robustness in the learnt behaviours, and slowness in convergence. These problems become even more evident in real applications, especially in mobile robotics (Wyatt, 1995), where environments are complex, dynamic, and not usually totally modelable. However, in these systems there usually exists prior knowledge on the task in the form of human expertise or previously developed controllers. This information can be used to improve the learning process, as the RL agent does not start from scratch. In this paper, we present an application in the mobile robotics field of tabula rasa RL, as well as a new strategy to take advantage of prior knowledge in this type of learning, which we have called Supervised Reinforcement Learning (SRL) (Iglesias et al., 1998a, Iglesias et al., 1998b). We carried out our experiments within the Nomad 200 simulator and we selected the wall-following behaviour as the task to be learnt. The remainder of this paper is organized as follows: Section 2. reviews the basics of reinforcement learning. Section 3. presents the main ideas behind the SRL model. Sections 4. and 5. report our experimental setup and the experiments we have carried out. Section 6. briefly reviews related work on the use of prior knowledge to improve RL learning. Finally, section 7. discusses the contributions of this paper. 2. Reinforcement Learning In the RL paradigm (Kaelbling et al., 1996), an agent interacts with the environment through a set of actions. The environment is then modified and the agent perceives the new state through its sensors. Furthermore, at each step the agent receives an external reward signal (see Figure 1). The objective of the RL agent is to maximize the amount of reward received in the long term. In this learning strategy an objective is defined and the learning process takes place through trial and error interactions in a dynamic environment. The agent 3. Figure 1: Basic diagram of a RL agent. is rewarded or punished on the basis of the actions it carries out. There are many algorithms that implement RL principles, among the most used are Sarsa, Dyna, Prioritized Sweeping and Q-Learning. In this work we have employed the last one (Watkins, 1989) as, due to its simplicity and easy implementation, is the one that is currently the most commonly-used. We have used a tabular representation, in which for every state-action pair a Q-value Q(s, a) is stored and updated over the learning process. This Q-value represents the usefulness of performing action a when the robot is in state s. Q-learning directly approaches the optimal action-value function, independently of the policy currently being followed. Its updating rule is: h i Q(st , at ) ← (1−α) Q(st , at )+α rt + γ max Q(st+1 , a) , a (1) where st is the current state, at is the action taken at the current instant, rt is the reward received after executing at and Q(st , at ) is the evaluation of the action at in the state st . The only parameters that need to be adjusted are the learning rate α and the discount factor γ. The Q-learning algorithm requires a good balance between the exploitation of the information learnt to date and the exploration of different actions (Sutton and Barto, 1998). To achieve this we have used the Softmax algorithm (Bridle, 1990), in which the probability of taking the action ai in the state s is given by the following equation: eQ(s,ai )/T (s) Pr(s, ai ) = Pn , Q(s,aj )/T (s) j=1 e i = 1, . . . , n, (2) where {a1 , . . . , an } is the set of possible actions in the state s. The parameter T (temperature) makes it possible to regulate the distribution of probabilities between the actions. By means of varying T we force an intense exploration at the onset of the learning phase, focusing gradually on the selection of actions with the best evaluation. Each state regulates its exploration level independently. Embedding prior knowledge: SRL The objective of the SRL model is to establish a working framework for the development of systems that will integrate prior knowledge into RL processes through focalizing the exploration of an RL algorithm. SRL comprises several basic blocks (Figure 2): the reinforcement learning module, the prior knowledge sources (PKSs), and the control module, which regulates the knowledge transfer between them. The RL module houses the RL algorithm. This cannot be an on-policy algorithm, as in such methods the action to be carried out must be determined by the algorithm itself, while in the SRL model the control module makes this decision (Figure 2). For this application we use the Q-Learning algorithm. The PKSs supply their advice (recommended actions) for the current state of the RL module, s. More specifically, they produce a vector of utilities u which contains a value u(s, ai ) ∈ [0, 1] for each action ai that it is possible to carry out in the current state s. This value is an indication to the control module as to how advisable this action is according to the PKS that supplies the vector. 3.1 Control module The control module has the task of amalgamating the utilities that the PKSs supply for each action (the pieces of advice) with the information learnt to date (Figure 2). This module gives priority to the knowledge transfer over exploration of new actions. The control module is divided into two blocks: the credit assignment block and the decision block. 3.1.1 Credit assignment block The credit assignment block initially decides whether the advice obtained from the PKSs is of a sufficient quality to be used. If this is not the case, alternative actions to the ones recommended will need to be explored. The inputs to the credit assignment block (Figure 2) are the Q-values Q(s, ai ) from the RL module, and the advice from the PKSs, {u1 . . . um }. The first step consists of amalgamating the different pieces of advice into one single vector and then normalizing it in order to obtain the exploitation policy, w(a), which is thus called as following it implies exploiting the knowledge that is stored in the sources of prior knowledge: Pm j=1 uj (ai ) Pm , (3) w(ai ) = maxk { j=1 uj (ak )} where m is the number of PKSs. Then the exploration policy, e(a), is constructed, thus called because it recommends taking those actions that are considered unadvisable by the PKSs: e(ai ) = 1 − w(ai ), i = 1, . . . , n. (4) Figure 2: SRL block diagram. The next step is to decide which of the two policies needs to be followed. In order to do so, we define the function Ω(s, x), where s is the current state and x is a generic vector of utility, i.e., a vector that is normalized with a value for each action of the state s. This function is defined as: ½ h i¾ Ω(s, x) = max x(ai ) · Q(s, ai ) − min Q(s, a) . (5) i a The value of Ω gives us a measurement of the compatibility of the utility vector x with those Q-values that have been learned for state s, which tells us whether the actions suggested by x have been evaluated well by the learning process. In order to decide the most suitable policy, g(a), we establish the following criterion: ( e(a) if Ω(s, e) − δ > Ω(s, uj ) ∀ j = 1, . . . , m, g(a) = w(a) otherwise, (6) i.e., e(a) is chosen whenever its compatibility Ω is higher, with a margin δ, to that of all the suggestions uj in the current state s. In this situation, the suggestions either have no information to supply, or are recommending actions that are poorly evaluated by the experience that has been accumulated in the RL module, thus the advice is not trustworthy, and the system must search for alternatives: it must explore. The parameter δ, the exploration threshold, is a positive value and makes it possible to regulate the tolerance that the system will have with bad advisors. Lastly, a vector of utilities, h(s, a), is drawn up with a value between 0 and 1 for each action: h(s, ai ) = g(ai ) · [Q(s, ai ) − mina Q(s, a)] . Ω(s, g) (7) This vector is called the decision vector and indicates which actions are the most suitable for the current state. In this vector, if the advice coincides with a high value for the action, it is reinforced. On the other hand, if a piece of advice recommends an action with a low evaluation, it is moderated. This vector goes to the decision block. 3.1.2 Decision block This block has the task of selecting the action to be implemented. The action is chosen taking into account the vector h(s, a). In order to maintain a certain degree of exploration over the policy that is finally selected we use the Softmax algorithm. Thus the probability of taking the action ai in the state s is: eh(s,ai )/T (s) Pr(s, ai ) = Pn , h(s,aj )/T (s) j=1 e i = 1, . . . , n, (8) where n is the number of actions available in the state s. Again, each state has its own temperature. 4. Application The control of mobile robots is highly complex, and requires the use of control architectures (Regueiro et al., 2002). The majority of these are based on low-level behaviours that efficiently resolve a specific task. We have chosen the wall-following behaviour as the task to be learnt, as it is one of the most used in mobile robotics. 4.1 RL Agent The three elements of a RL system are: the state representation, the reward function and the actions that the robot can take in each state. Our state representation uses only information from the robot’s ultrasonic sensors, which are divided into four overlapping groups (Figure 3(a)) that are associated to four state variables: Parameter α γ δ (only in SRL) Description Learning rate Discount factor Exploration threshold Value 0.2 0.99 0.1 Table 1: Learning parameters for RL and SRL. where ∨ represents the logical operator ”OR”. Figure 3(b) shows the values of the state variables (shaded) that give rise to a negative reward. The actions that the robot can carry out constitute the final component. In order to simplify the learning process we set a constant linear velocity for the robot (20 cm/s) so that it is only necessary to learn to control the angular velocity. As the maximum angular velocity of the robot is 45◦ /s we have discretized the space of actions as follows: (◦ /s). (10) The values used for the parameters of the RL and SRL agents can be seen on Table 1. A = {-40, -20, -10, -0.3, 0, 0.3, 10, 20, 40} Figure 3: State representation (except orientation): a) ultrasound sensors associated to each state variable. Sensor number 0 marks the advancing direction; b) discretization of the distance variables. Values that give rise to a negative reward are shaded. right-hand distance (R), left-hand distance (L), frontal distance (F) and right-frontal distance (RF). The value of each variable is the minimum measurement of its associated sensors, which is discretized according to the values shown in Figure 3(b). After several failed experiments, we had to define a fifth state variable: the relative orientation (O) between the robot and the wall. To obtain this the reference wall is approximated to a straight line by means of linear regression of the latest measurements from sensors 11, 12 and 13 (Figure 3(a)). The angle between this straight line and the forward direction of the robot is discretized into 4 values: approaching, moving away, parallel and no direction. The last value is assigned when the regression is not reliable, the frontal wall is very close, or the robot is to close to or too far from the wall being followed. Thus, the state is defined as the quintuple s =< R, RF, F, L, O >. Theoretically, there are 188 possible states, although not all are met during task development in a normal environment. The second component of the RL system is the reward function. As in the wall-following behaviour the objective of the robot is to keep a constant distance from the right-hand wall, we define the reward function as:   −1 if R=too close ∨ R=too far r= (9) ∨ F =too close ∨ L=close,   0 on any other situation, 4.2 Prior knowledge In our system we have used one PKS, to build it we have used a human expert to decide on the action to be carried out in a number of representative positions within the environment. This PKS (called Ad-Hoc) supplies reasonably good advice in 12 states corresponding to the straight wall and open corner situations. For each state with advice only an action with maximum utility is recommended. It is important to stress that on its own, the advisor is not capable of completely resolving the task. 5. Experimental results 5.1 Experimental procedure In our experiments we differentiate between control cycles and learning steps. The former are carried out every 1/3 second, and here the system updates the sensorial measurements, determines the current state, and selects the action to be implemented. Nevertheless, the learning steps only take place with the agent changes state, due to which they are asynchronous. Our experimental procedure comprises two phases: learning and testing. In the former, the action to be carried out in each control cycle is selected on the basis of the probabilities given by the equation 8. Furthermore, the Q-values from the RL module are stored periodically for the test phase. The test phase is used to measure the performance of the system throughout the learning process and to determine when the learning process has converged. During 90000 Reward obtained (learning phase) Control cycles without failure (test phase) 80000 60000 -50 50000 40000 -100 30000 -150 Reward obtained 0 70000 Control cycles 50 20000 -200 10000 0 0 -250 5000 10000 15000 20000 25000 30000 35000 Learning steps Figure 4: Reward obtained by Q-Learning during learning phase and number of control cycles without error during test phase (max. 75,000 cycles) basis of 16,000 learning cycles, even thought it still receives some negative rewards. There is a small disruption between 27,000 and 29,000 learning steps (4 negative rewards) which is due to the correct action of a critical state being devaluated. The system is destabilized until it relearns the correct action, which is reflected in the test phase. These fluctuations are common in RL algorithms. Figure 5(a) shows the trajectory of the robot during the test phase after 35,000 learning steps. A comparsion of the RL and SRL agents’ convergence times can be seen in Figure 5(b). The incorporation of knowledge into the reinforced learning process by means of SRL significantly accelerates learning convergence. The Ad-Hoc advisor reduces learning convergence time to within the region of 2,000 learning steps, which is an eighth of the time required by tabula rasa RL. 5.3 Robustness of the learnt behavior this phase, the stored Q-values are loaded and the best valued action is always selected. The PKS is never used. A test finishes when the robot makes an error (collides with the environment or strays too far away from the wall being followed) or when it is considered that the task has been satisfactorily learnt. As the measurement of the quality of that which has been learnt by the agent we have selected the time during which the agent carries out the task before committing an error. Given that the duration of each control cycle and the linear velocity of the robot are constant, this measurement is equivalent to the number of control cycles issued before an error occurs. We have not used the amount of reinforcement received throughout the learning process (though this is precisely the value that the agent attempts to maximize) as in these systems it is not always possible to make out whether the obtention of little reinforcement is due to a total failure or to low performance in carrying out the task. The difference is crucial for us, as in the former case, the task is not attained, while in the latter it is. In our experiments, we consider the task to have been accurately learnt when the system is able to implement it over 75,000 control cycles without making an error. Our convergence criterion is that the task should be accurately learnt over 5 consecutive tests. 5.2 Convergence time Figure 4 shows the results for the RL agent (Q-learning). This figure shows the reward obtained during the learning stage: each point represents the reinforcement accumulated in the previous 1,000 learning cycles. It can be seen how the learning process stabilizes at around 19,000 learning steps. This figure also shows the results of the test phase (control cycles without failure). As can be seen, the agent is capable of resolving the task on the The final aim of the learning process is for the robot to succeed in learning to follow the reference wall. But it is also important to verify just how robust the final learnt behaviour is, and one way of verifying this is to study what occurs when the environment changes. If it is robust, it will not be affected to any great degree, and the robot will be capable of carrying out the task. In this experiment we took the behaviours learnt by tabula rasa RL and SRL with Ad-Hoc in the environment of Figure 5(a) and we tested them in the new environments (Figures 6(a) and 6(c)). We carried out the tests on different stages of the learning process, and the results obtained can be seen in Figures 6(b) and 6(d), respectively. The knowledge acquired by tabula rasa RL in the original environment is not sufficient to be able to resolve the task in other much more complicated environments (Figure 6(c)). RL learning is not sufficiently robust to obtain a satisfactory behaviour. On the other hand, the behaviour learnt with SRL shows a great degree of robustness when carrying out the task correctly in the new environments from the onset (4,000 learning steps). Figures 6(a) and 6(c) show the path of the robot carrying out the task in the new environments with the policy acquired by means of SRL (Ad-Hoc SC ) after 20,000 learning cycles in the original environment. This lack of robustness of the RL-learnt behaviour is due to the fact that the selected action (the one with the highest Q-value) for the most frequent states keeps changing over the learning process for a longer period, with the same learning parameters, in RL than in SRL. Hence, RL convergence is slow and weak. In the original environment, these changes in the selected actions have no effect on the performance after 16,000 learning steps, but they prove critical in complex environments such as the one shown in Figure 6(c). On the other hand, thanks to the prior knowledge, the selected actions for 80000 70000 Control cycles 60000 50000 40000 30000 20000 10000 SRL (SC) tabula rasa 0 0 (a) 5000 10000 15000 Learning steps 20000 25000 (b) Figure 5: RL and SRL results for wall-following task: a) environment and robot trajectory performing the task after 35,000 RL learning steps (test phase); b) comparsion of the SRL (PKS Ad-Hoc) and RL tabula rasa results. these states are settled sooner in SRL. This gives rise to a solid convergence and a much more robust behaviour. 6. Related work Efforts aimed at including agents’ prior knowledge into RL in the field of mobile robotics have been based on three techniques. The first is the design of complex reinforcements (Mataric’, 1994), which has the problem of its lack of generality, and the difficulty caused by devising the functions that supply the variable reinforcement, which will probably not be made use of in any other task. Dynamic knowledge-orientated creation of the state space (Hailu, 2001) is the second technique. Its main drawbacks are that the stability of the system during learning is not taken into account, and that there is a great dependency on the quality of knowledge: if this is incorrect, the agent cannot learn. The third and most important approach is the focalization of exploration. There are various approaches, one being to focalize exploration only at the onset of learning (del R. Millán et al., 2002). The statistical nature of RL results in there being fluctuations in the learning process, which may lead to the devaluation of an initially-recommended good action. In these cases, the initial focalization of the exploration does not help to stabilize the convergence of the RL algorithm. Another approach is that of Lin (Lin, 1992), who proposes systems that use external knowledge expressed in the form of sequences of states and actions that are sup- plied by a teacher. The limitation of these systems is that they can only make use of knowledge that can be expressed as sequences of states and actions that are characteristic of the system that is learning. In a similar strategy, Clouse presents a system in which advice is administered asynchronously (Clouse and Utgoff, 1992). Here a teacher may take part in the learning process whenever he considers it necessary. In general, example- or teacher-based systems do not allow the use of knowledge that is not drawn up specifically for the system, incorrect knowledge or various simultaneous teachers. The latest exploration focalization technique, the one used in SRL, is to employ a control module to regulate the transfer of information between the knowledge sources and the RL algorithm, which is located in the same level. Dixon (Dixon et al., 2000) presents a system of this type, which is similar to SRL. However, his work lacks the generality of SRL, which is designed to allow a broader range of prior knowledge sources and to deal with partial or even erroneus advice. In SRL, the balancing of prior knowledge and learnt information, which is carried out in the Credit Assignment Block, allows the designer to use several PKSs without the need for a complex elaboration or any testing of their correctness. Dixon’s control modules are far simpler. SRL also contemplates the possibility of the Control Module imposing its decisions or assigning overall control to a knowledge source, transforming it into a teacher. Thus, SRL includes the aforementioned approaches at RL tabula rasa SRL with Ad-Hoc Control cycles during test phase 80000 70000 60000 50000 40000 30000 20000 10000 0 0 5000 10000 (a) 15000 20000 Learning steps 25000 30000 35000 (b) RL tabula rasa SRL with Ad-Hoc Control cycles during test phase 80000 70000 60000 50000 40000 30000 20000 10000 0 0 (c) 5000 10000 15000 20000 Learning steps 25000 30000 35000 (d) Figure 6: SRL-learnt wall-following behavior robustness: a) and c) robot trajectory performing the task with Q-values learnt after 20,000 SRL(Ad-Hoc) learning steps in environment of Fig.5(a); b) and d) control cycles during test phase without failure using Q-values learnt in environment of Fig.5(a), SRL(Ad-Hoc) against tabula rasa RL. the same time as it permits the use of a much broader set of knowledge sources. Furthermore, we believe that SRL is suitable for use on real systems, in mobile robotics. 7. Conclusion and future work The use of reinforcement learning (RL) in real systems has highlighted the limitations of these algorithms, the main one being slowness in convergence. On the other hand, in real systems there often exists prior knowledge on the task being learnt that can be used to improve the learning process. In this paper we propose a new strategy for making use of external knowledge within RL, which we call Supervised Reinforcement Learning (SRL). The SRL is based on using prior knowledge on the task to focalize the RL algorithm’s exploration towards the most promising areas of the state space. Thanks to SRL, knowledge can be used to speed up convergence of RL algorithms, yielding at the same time more robust controllers and improving the agent’s stability during the learning process. In order to demonstrate the viability of the proposed methodology it has been applied to the resolution of a basic task in mobile robotics, the wall-following behaviour, and it has been compared with a classical tabula rasa RL algorithm (Q-learning). Thanks to SRL a significant reduction in learning convergence times has been achieved, even using a simple, intuitive prior knowledge source. Thus, just by advising on the best action to be implemented in 12 of the 188 possible states, there is a 84% reduction in the SLR convergence time with respect to RL, that is reduced from 16,000 to 2,000 learning steps. Furthermore, we have verified that the reduction in convergence time does not imply a loss in robustness in the final behaviour that is obtained. In fact, the behaviour learnt by SRL proves to be correct, even in more complex environments than those used during the learning phase, and in which the policy learnt by RL does not always succeed in carrying out the task safely. The SRL model permits the existence of several prior knowledge sources, even ones with incorrect advice. Our next objective is to study the performance of the model in such situations. We also aim to explore new methods for improving knowledge transfer, concentrating on the theoretical aspects of the model. Lastly, we intend to apply SRL directly on a real robot, and not on a simulator. The characteristics of SRL, with regard to stability during the learning phase and convergence time, lead us to believe that ’on-line’ learning will be feasible and safe. Acknowledgment This work was supported by CICYT’s project TIC200309400-C04-03. David L. Moreno’s research was also supported by MECD grant FPU-AP2001-3350. References Boyan, J. A. and Littman, M. L. (1994). Packet routing in dynamically changing networks: A reinforcement learning approach. In Cowan, J. D., Tesauro, G., and Alspector, J., (Eds.), Advances in Neural Information Processing Systems, volume 6, pages 671–678. Morgan Kaufmann. Bridle, J. S. (1990). Training stochastic model recognition algorithms as networks can lead to maximum mutual information estimation of parameters. In Touretzky, D., (Ed.), Advances in Neural Information Processing Systems: Proc. 1989 Conf., pages 211–217. Morgan-Kaufmann. Clouse, J. A. and Utgoff, P. E. (1992). A teaching method for reinforcement learning. In Machine Learning. Proc. 9th Int. Workshop (ML92), pages 92–101. Morgan Kaufmann. del R. Millán, J., Posenato, D., and Dedieu, E. (2002). Continuous-action q-learning. Machine Learning, 49:247, 265. Dixon, K. R., Malak, R. J., and Khosla, P. K. (2000). Incorporating prior knowledge and previously learned information into reinforcement learning agents. Technical report, Carnegie Mellon University, Institute for Complex Engineered Systems. Hailu, G. (2001). Symbolic structures in numeric reinforcement for learning optimum robot trajectory. Robotics and Autonomous Systems, 37:53–68. Iglesias, R., Regueiro, C., Correa, J., and Barro, S. (1998a). Supervised reinforcement learning: Application to a wall following behaviour in a mobile robot. In Tasks and Methods in Applied Artificial Intelligence, volume 1416, pages 300–309. Springer Verlag. Iglesias, R., Regueiro, C., Correa, J., Sánchez, E., and Barro, S. (1998b). Improving wall following behaviour in a mobile robot using reinforcement learning. In Proc. International ICSC Symposium on Engineering of Intelligent Systems, volume 3, pages 531–537, Tenerife, Spain. Kaelbling, L. P., Littman, M. L., and Moore, A. P. (1996). Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4:237–285. Lin, L.-J. (1992). Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning, 8(3/4):293–321. Mataric’, M. J. (1994). Reward functions for accelerated learning. In Int. Conf. on Machine Learning, pages 181–189. Regueiro, C., Rodrı́guez, M., Correa, J., Moreno, D., Iglesias, R., and Barro, S. (2002). A control architecture for mobile robotics based on specialists. volume 6 of Intelligent Systems: Technology and Applications, pages 337–360. CRC Press. Schaal, S. and Atkeson, C. (1994). Robot juggling: Implementation of memory-based learning. IEEE Control Systems, 14(1):57–71. Sutton, R. S. and Barto, A. G. (1998). Reinforcement Learning: An Introduction. MIT Press. Tesauro, G. (1994). Td-gammon, a self-teaching backgammon program, achieves master-level play. Neural Computation, 6(2):215–219. Watkins, C. (1989). Learning from Delayed Rewards. PhD thesis, Cambridge University. Wyatt, J. (1995). Issues in putting reinforcement learning onto robots. In 10th Biennal Conference of the AISB, Sheffield, UK.

Using Prior Knowledge to Improve Reinforcement

Related documents

Products

Support

Using Prior Knowledge to Improve Reinforcement

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib