Intelligent Online Case Based Planning Agent Model for Real-Time Strategy Games Omar Enayet1 and Abdelrahman Ogail2 and Ibrahim Moawad3 and Mostafa Aref 4 Faculty of Computer and Information Sciences Ain-Shams University ; Cairo ; Egypt 1,2 first.last@hotmail.com 3first_last@hotmail.com 4aref_99@yahoo.com Abstract – Research in learning and planning in real-time strategy (RTS) games is very interesting in several industries such as military industry, robotics, and most importantly game industry. Recent work on online case-based planning in RTS Games does not include the capability of online learning from experience, so the knowledge certainty remains constant, which leads to inefficient decisions. In this paper, an intelligent agent model based on both online case-based planning and reinforcement learning techniques is proposed. In addition, the proposed model has been evaluated using empirical simulation on Wargus (an open-source clone of the well known Real-Time Strategy Game Warcraft 2). This evaluation shows that the proposed model increases the certainty factor of the case-base by learning from experience, and hence the process of decision making for selecting more efficient, effective and successful plans. Keywords: Case-based reasoning, Reinforcement Learning, Online Case-Based Planning, Real-Time Strategy Games, SARSA (λ) learning, Intelligent Agent. I. INTRODUCTION A. Real-Time Strategy Games 2RTS games constitute well-defined environments to conduct experiments and offer straight forward objective ways of measuring performance. Also, strong game AI will likely make difference in future commercial games because graphics improvements are beginning to saturate. RTS game AI is also interesting for the military which uses battle simulations in training programs [1]. RTS Games offer challenging opportunities for research in adversarial planning under un-certainty, learning and opponent modeling and spatial and temporal reasoning. RTS games feature hundreds or even thousands of interacting objects, imperfect information, and fast-paced micro-actions [1]. B. Online Case based Planning [3] The CBR cycle, has two assumptions that are not suited for strategic real-time domains involving on-line planning. Firstly, the problem solving is modelled as a single-shot process, i.e. a “single loop” in the CBR cycle solves the problem. In Case-Based Planning, solving a problem might involve solving several sub-problems, and also monitoring their execution (potentially having to solve new problems along the way). Secondly, the problem solving and plan execution are decoupled, i.e. the CBR cycle produces a solution, but the solution execution is delegated to some external module. In strategic real-time domains, executing a problem is part of its solving, especially when the internal model of the world is not 100% accurate, and ensuring that the execution of the solution succeeds is an important part of solving problems. For instance, while executing a solution the system might discover low level details about the world that render the proposed solution wrong, and thus another solution has to be proposed. OLCBP (On-Line Case-Based Planning) cycle is an extension of the CBR cycle with two added processes needed to deal with planning and execution of solutions in real-time domains: ???????. II. BACKGROUND Case-Based planning was applied in computer games just in Darmok System [3]. An online planning system containing of expansion of plan and execution of actions ready action in this plan while. The plan is selected though retrieval algorithm after the assisting the situation using a novel situation assessment models [4]. The system learns plans in the offline stage by demonstrating human [5]. A final note, before executing any plan a delayed adaptation is applied to the plan to make it applicable in the current state of the environment [6]. Our system is inspired from Darmok with addition of online learning. Moreover, Darmok doesn’t include any scientific specifications of online learning for the plans it only evaluates plan through an output parameter computed heuristically. If one had to identify one idea as central and novel to reinforcement learning, it would undoubtedly be temporaldifference (TD) learning. TD learning is a combination of Monte Carlo ideas and dynamic programming (DP) ideas. Like Monte Carlo methods, TD methods can learn directly from raw experience without a model of the environment's dynamics. Like DP, TD methods update estimates based in part on other learned estimates, without waiting for a final outcome (they bootstrap). The relationship between TD, DP, and Monte Carlo methods is a recurring theme in the theory of reinforcement learning [7]. Eligibility traces are one of the basic mechanisms of reinforcement learning. Almost any temporal-difference (TD) method, such as Q-learning or Sarsa, can be combined with eligibility traces to obtain a more general method that may learn more efficiently. From the theoretical view, they are considered a bridge from TD to Monte Carlo methods. From the mechanistic view, an eligibility trace is a temporary record of the occurrence of an event, such as the visiting of a state or the taking of an action. The trace marks the memory parameters associated with the event as eligible for undergoing learning changes. When a TD error occurs, only the eligible states or actions are assigned credit or blame for the error. Like TD methods themselves, eligibility traces are a basic mechanism for temporal credit assignment. [7] Sarsa (λ) is a combines the temporal difference learning technique “One-step Sarsa” with eligibility traces to learn state-action pair values Qt(s,a) effectively [7]. It is an onpolicy algorithm in that it approximates the state-action values for the current policy then improve the policy gradually based on the approximate values for the current policy [7]. In this paper we introduce an approach to hybridize online case based planning with reinforcement learning. Section 2 talks about related work, section 3 introduces the architecture of the agent, section 4 describes the algorithm used in the hybridization, section 5 views the testing and results, section 6 views the conclusion and future work. III. INTELLIGENT OLCBP AGENT MODEL A novice AI agent capable of online planning and online learning is introduced. Basically, the agent goes through two main phases: an offline learning phase and an online planning and learning phase. In the offline phase, the agent starts to learn plans by observing game play and strategies between two different opponents (i.e. human or computer) and then, deduces and forms several cases. The agent receives raw traces that form the game played by the human. Using Goal Matrix Generation, these traces are converted into raw plan. Further, a Dependency Graph is constructed for each raw plan. This helps the system to know 1) These dependencies between each raw plan step. 2) Which steps are suitable to parallelize. Finally, a Hierarchal Composition for each raw plan is done in order to shrink raw plan size and substitute group of related steps with one step (of type goal). All the learnt cases are retained in the case-base and by this, the offline stage has completed. Figure 1: Case Representation A Case is consisted of (Shown in Figure 1): 1. Goal: that this case satisfied. 2. Situation: where the case is applicable. 3. Shallow Feature: set of sensed features from the environment that are computationally low. 4. Deep Feature: set of sensed features from the environment that are computationally high. 5. Success Rate: decimal number that ranges between zero and one indicating the success rate of the case. 6. Eligibility Trace: integer number represents frequency of using the case. 7. Prior Confidence: decimal number between zero and one set by expert indicating the confidence of successes when using that case. 8. Behavior: coins the actual plan to be used. After the agent learns cases that serve as basic ingredients for playing the game, it then set to be ready to play against different opponents. We call the phase where the agent plays against opponents as online planning and learning phase. The planning comes from expansion and execution of current plan whereas learning comes from the revision of the applied plans. The online phase consists of expansion, execution, retrieval, adaptation, reviser, and current plan modules. Expansion module expands ready open goals in the current plan. Ready means that all snippets before that goal were executed successfully and Open means that this goal has no assigned behavior. As shown in Figure 2 a behavior is consisted of: preconditions that must be satisfied before the plan is to execute, alive conditions that must be satisfied while the plan is being executed, success condition: conditions that must be satisfied after the plan has been executed and snippet: a set of plan steps executed serially where each plan step consists of a set of parallel actions/goals. Snippet forms the plan. Retrieval module selects the most efficient plan using: situation assessment to get most suitable plan, E-Greedy selection policy to determine whether to explore or exploit as shown in equation 1 in Figure 2. In exploitation, the predicted performance of all cases is computed and the best case is selected using equation 2 in Figure 2. In our experiments, the exploration parameter is set to 0.3. The situation assessment module is built through, capturing most representative shallow features of the environment, building Situation-Model that maps set of shallow features to situation, building Situation-Case Model, to classify each case for specific situation, building Situation-Deep Feature Model, to provide set of deep features important for predicted situation. After a complete snippet execution, the reviser module starts its mission. The importance of revision originated from the fact that interactive and intelligent agents must get feedback from the environment to improve their performance. The learnt plans were based on some specific situation that might not always be suitable. Moreover the human himself could be playing with insufficient good plans and strategies. The reviser adjusts the case performance according to Temporal Difference lea ng with 𝑆𝐴𝑅𝑆𝐴(𝜆) Online Policy Learning Algorithm. The adjusted plan is retained in the case base using the retainer module and the cycle starts over The selected behavior is passed to the Adaptation module to adapt the plan for the current situation. In particular, adaptation means removal of unnecessary actions in the plan and then adding satisfaction actions. Figure 3: Intelligent Online Case-based Planning Agent Model IV. HYBRID OLCBP/RL ALGORITHM USING SARSA(λ) We introduce our approach which hybridizes Online Case Based Planning and Reinforcement Learning -using SARSA(λ) Algorithm- in a novel algorithm (see figure 5 ) . In order to view how was SARSA(λ) customized, a table that maps the old symbols in the original SARSA(λ) Algorithm (see figure 3) to the new symbols used in the novel algorithm was constructed.(see Table 1) Every time the agent retrieves a case to satisfy a certain goal, the agent –in RL terms- is considered transformed into a new state, and starts the applying the algorithm, which goes through the following steps: Figure 2: Retrieval Equations Execution module starts to execute plan actions whenever their preconditions are satisfied. To execute plans the execution module starts to: search for ready snippets and then, send these ready snippets for execution (to the game), updates the current status of executing snippets whether succeeded or failed and finally, updates status of executing actions from each snippet. 1) It increments the eligibility of the retrieved case according to the following: e (Cr) = e (Cr) + 1 Where: Cr is the retrieved case. 2) It then updates the success rates of all cases in its case base according to the following: For each case C in the case base Q(C) = Q(C) + α δ e(C) eligibility and thus are affected more with any rewards or punishments. Where α is the learning rate, e(C) is the eligibility of the case C and δ is the temporal difference error that depends on the following: Notice that only cases with similar goal and situation have their eligibility updated, as cases with similar goal and situation constitute a pool of states (in RL terms) that need to take the responsibility of choosing the current case and thus have their eligibilities updated. δ = R + r + γ Q (Cr) – Q (Cu) Where: R: The Global Reward: Its value depends on the ratio between the player’s power and the enemy’s power. It is very important for the agent to be aware of the global consequences of its actions. However, the performance of the entire case base is updated, since different types of cases affect each other’s performance, for example, a case achieving the “BuildArmy” goal will certainly affect the performance of the next used case which achieves the “Attack” goal. r: The case-specific reward due to the success or failure of the last used case. It ranges between -1 and 1. It is computed based on a heuristic which determines how effective the plan was according to the following formula : 𝑟(𝑐) −1, 𝑖𝑓 𝑐 𝑓𝑎𝑖𝑙𝑒𝑑 ={ 𝑟 , 𝑤ℎ𝑒𝑟𝑒 − 1 < 𝑟 < 1, 𝑖𝑓 𝑐 𝑠𝑢𝑐𝑐𝑒𝑒𝑑𝑒𝑑 γ Q (Cr) – Q (Cu) : The difference in success rate between the retrieved case Cr (multiplied with the discount rate γ) and the last used case Cu. Table 1: Table for mapping original symbols/meanings to the new symbols\meanings of the proposed algorithm Symbol General Meaning New Symbol Customized Meaning s State S Situation and Goal a Action P Plan (Case Snippet) (s,a) State-action pair (S,P) or C Case Q(s,a) Value of state-action pair Q(S,P) or Q(C ) Success rate of case δ = R + ∑in ri + γ Q (Cr) – Q (Ci) r reward R General Reward Where n: number of last used cases α Learning Rate Parameter α Learning Rate Parameter δ Temporal-Difference Error δ Temporal-Difference Error e(s,a) Eligibility trace for stateaction pair e(S,P) or e(C) Eligibility trace for case γ Discount rate γ Discount Rate λ Trace Decay Parameter λ Trace Decay Parameter - - r Goal-Specific Reward Notice that, in online case based planning there could be multiple last used cases executed in parallel; in this condition the total temporal difference error relative to all last used cases should be equal to: Observe failed or succeeded Case Cu Compute R, r Retrieve Case Cr via retrieval policy (E-greedy) δ = R + r + γ Q (Cr) – Q (Cu) e (Cr) = e (Cr) + 1 For each case C in the case base Q(C) = Q(C) + α δ e(C) Retrieve set of cases E For each case C in E e (C) = γ λ e (C) 3) It retrieves all cases with a similar S (Goal and Situation) to the S of the retrieved case Cr and stores the result in E. 4) It updates the eligibility of all cases in E according to the following : e (C) = γ λ e (C) Where: λ is the trace decay parameter, which controls the rate of decay of the eligibility trace of all cases. As it increases, the cases preserve most of their FIGURE 5: ONLINE LEARNING ALGORITHM USED BY THE INTELLIGENT AGENT TO EVALUATE (REVISE) CASES V. EXPERIMENT AND RESULTS In order to make the significance of extending online case based planning with online learning using reinforcement learning clear, consider the simple case, where the case acquisition module (Offline Learning from human traces) has just learned the following 4 cases (in table 1), and initialized their success rates with a value of 0. It’s known in the game of Wargus, that using heavy units such as ballista and knights- to attack a towers defense is more effective than using light units such as footmen and archers. This means that it is highly preferable to use case BuildArmy2 instead of case BuildArmy1, and use Attack2 which will definitely cause the agent to destroy more of the enemies units and thus approach wining the game. The experiment constitutes tracing the agent’s evaluation for the cases (after achieving goals “Build Army” then “Attack” in order) for 40 successive times. Learning rate was set to 0.1. 0.8 For the discount rate, 0.5 for the decay rate, and 0.1 for the exploration rate. The values of all the success rates and eligibilities of the cases were initialized with zero. Table 3 shows the ranges of the rewards gained after executing each of the 4 cases. Notice that BuildArmy1 and BuildArmy2 are rewarded similarly however; the rewards of Attack1 and Attack2 vary greatly due to the different result of both. 1.6 Success Rate/Eligibilty Value In order to win, the Agent has to fulfill the 2 goals: “Build Army” and “Attack” in order, by choosing one case for each goal respectively. Notice that, cases BuildArmy1 and BuildArmy2 share identical game states, though they contain different plans for achieving the same goal “Build Army”. On the other hand, cases Attack1 and Attack2 achieve the same goal but with different plans, and different game states which are the same as the game states achieved after executing BuildArmy1 and BuildArmy2 respectively. Using BuildArmy1 will definitely force the agent to use Attack1 as BuildArmy1 trains the necessary army that will be used in Attack1. Similarly, using BuildArmy2 will definitely force the agent to use Attack2. Figure 6 shows a graph that compares the success rates of the 2 cases BuildArmy1 and BuildArmy2 along with their eligibility traces. E1 means Eligibility of BuildArmy1 and E2 means eligibility of BuildArmy2. Similarly, Figure 7 compares Attack1 and Attack2. BuildArmy2 1.4 1.2 1 0.8 E2 0.6 0.4 E1 0.2 BuildArmy1 0 -0.2 1 6 11 16 21 26 31 36 41 Number of Evaluations Figure 6 – Tracing success rates and eligibility values of BuildArmy1 and BuildArmy2 during 40 evaluations 2.5 Success Rate/Eligibility Value Table 2: cases being evaluated using SARSA (λ) algorithm in the experiment BuildArmy1 BuildArmy2 Goal: Build Army Goal: Build Army State: Enemy has a towers State: Enemy has a towers defense (identical to Case2) defense (identical to Case1) Plan: Plan: Train 15 grunts Train 2 catapults Train 5 Archers Train 6 Knights Success Rate : 0 Success Rate : 0 Attack1 Attack2 Goal: Attack Goal: Attack State: Enemy has a towers State: Enemy has a towers defense, agent has 15 grunts defense, agent has 2 catapults and 5 Archers exist and 6 knights exist Plan: Plan: Attack with 15 grunts and 5 Attack with 2 catapults and 6 archers on towers defense knights on towers defense Success rate: 0 Success rate: 0 Table 3: used rewards variations gained on successfully executing the cases Case-specific reward Global Reward Case/Reward From To From To BuildArmy1 0 0.2 0.2 0.3 BuildArmy2 0 0.2 0.2 0.3 Attack1 0 0.2 -0.8 -0.6 Attack2 0.1 0.2 0.4 0.6 Attack2 2 Attack1 1.5 1 E1 0.5 E2 0 1 6 11 16 21 26 31 36 41 Number of Evaluations Figure 7 – Tracing success rates and eligibility values of Attack1 and Attack2 during 40 evaluations In order to win, the Agent has to fulfill the 2 goals: “Build Army” and “Attack” in order. Since the 2 cases of the “Build Army” goal share the same state, and their success rates is initially equal, any case of the 2 cases is randomly chosen. Assume that the 2 cases will be executed successfully. In case BuildArmy1 is chosen to be retrieved, the agent retrieves Attack1 as the most suitable case for execution (as it does have 15 grunts and 5 archers). The low success rate of case Attack1 will affect the revision (or evaluation) of last used case BuildArmy1 causing its success rate to be equal to 0.4 instead of 0.5. In case Case2 is chosen, the agent retrieves Case4 as the most suitable case for execution (as it doesn’t have 15 grunts catapults and 5 archers). However, the choice of this case lead to the choice of a better case with a success rate of 0.8. This will affect the revision (or evaluation) of the last used Case Case2 causing its success rate to increase to 0.6 instead of 0.5. As the agent plays in the same game or in multiple successive games, the agent will surely learn that using Case2 is definitely better than using Case1, although seemed to the agent identical when they were just learned during the offline learning process. Below in table() is the result of an experiment conducted to Figure 5 show the result of applying the algorithm 10 times(could be in one or multiple game episodes) , where : Learning rate = 0.1. Decay rate = 0.8.It’s set average, to maintain average responsibility of last used cases for the choice current case retrieved. Exploration rate = 0.1.It’s set low because –due to the small number of cases available (4 cases) - any exploration will probably lead to the choice of the worst case. Choosing the worst case will have an undesirable negative effect on cases with high success rates. Discount rate = 0.5. It’s set average to maintain average bootstrapping. After applying the algorithm 10 successive times, C1 gains a low success rate compared with C2. This proves that the agent has learned building a smaller heavy army in that situation (the existence of a towers defense) is more preferable than building a larger light army. V. CONCLUSION AND FUTURE WORK In this paper, online case-based planning was hybridized with reinforcement learning. This was the first attempt to do so in order to introduce an intelligent agent capable of planning and learning online using Temporal Difference with Eligibility Traces: SARSA(λ) algorithm. Learning online biases the agent decision for selecting more efficient, effective and successful plans. Also, this serves in saving consumption of agent’s time in retrieving inefficient failed plans. As a result, the agent takes into account history when acting in the environment (i.e. playing a real-time strategy game). Further, we are planning to develop a strategy/casebase visualization tool capable of visualizing agent’s preferred playing strategy according to its learning history. This will help in tracking the learning curve of the agent. After tracking the agent’s learning curve, we will be capable of applying other learning algorithms and finding out which one is the most suitable and effective. REFERENCES [1] Buro, M. 2003. Real-time strategy games: A new AI research challenge. In IJCAI’2003, 1534–1535. Morgan Kaufmann [2] Aamodt, A., and Plaza, E. 1994. Case-based reasoning: Foundational issues, methodological variations, and system approaches. Artificial Intelligence Communications 7(1):39–59 [3] Santiago Ontañón and Kinshuk Mishra and Neha Sugandh and Ashwin Ram (2010) On-line CaseBased Planning. in Computational Intelligence Journal, Volume 26, Issue 1, pp. 84-119. [4] Kinshuk Mishra and Santiago Ontañón and Ashwin Ram (2008), Situation Assessment for Plan Retrieval in Real-Time Strategy Games. ECCBR2008. [5] Santiago Ontañón and Kinshuk Mishra and Neha Sugandh and Ashwin Ram (2008) Learning from Demonstration and Case-Based Planning for RealTime Strategy Games. in Soft Computing Applications in Industry (ISBN 1434-9922 (Print) 1860-0808 (Online)), p. 293-310. [6] Neha Sugandh and Santiago Ontañón and Ashwin Ram (2008), On-Line Case-Based Plan Adaptation for Real-Time Strategy Games. AAAI-2008. [7] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning, An Introduction. MIT press, 2005