Intelligent Online Case Based Planning Agent Model for Real-Time Strategy Games Omar Enayet1 and Abdelrahman Ogail2 and Ibrahim Moawad3 Faculty Computer and Information Science Ain-Shams University ; Cairo ; Egypt 1,2 first.last@hotmail.com 3first_last@hotmail.com Abstract – Research in learning and planning in real-time strategy (RTS) games is for interest for several industries such as military industry, robotics and most importantly game industry. Recent work on online case based planning in RTS Games misses the capability of online learning from experience. In this document we present a hybrid architecture which uses both online case-based planning and reinforcement learning techniques. Our architecture re-uses related existing work of online-case based planning, and extends it with online reinforcement learning .A model of the I-Strategizer(Intelligent Strategizer) agent, which implements the latter architecture in order to play on Wargus (an open-source clone of the well known Real-Time Strategy Game Warcraft 2) is introduced. We present empirical evaluation of the performance of I-Strategizer and show that it successfully approaches the human player behavior. I. INTRODUCTION A. Real-Time Strategy Games RTS games constitute well-defined environments to conduct experiments and offer straight forward objective ways of measuring performance. Also, strong game AI will likely make difference in future commercial games because graphics improvements are beginning to saturate. RTS Game AI is also for interest for the military which uses battle simulations in training programs. [1] RTS Games offer challenging opportunities for research in adversarial planning under un-certainty, learning and opponent modeling and spatial and temporal reasoning. RTS Games feature hundreds or even thousands of interacting objects, imperfect information and fast-paced micro-actions. [1] B. Case Based Reasoning Case-Based Reasoning (CBR) is a machine learning technique in which problems and their solutions are stored in a knowledge base as cases. These cases may be retrieved later should a similar problem arise, as the solution to this problem will already be in the knowledge base called the case base. Aamodt and Plaza [2] defines CBR as: "To solve a new problem by remembering a previous similar situation and by reusing information and knowledge of that situation." CBR may be described as four processes: [2] 1) RETRIEVE the case or cases from the knowledge base that are most similar to the current problem. 2) REUSE the information from the retrieved cases to solve the current problem. If no exact match is found, the solution to the new problem must be adapted from one or more cases. 3) REVISE the proposed solution if it failed. 4) RETAIN the experience for solving this case in the knowledge base. Figure 1: The Case Based Reasoning Cycle C. Case based Planning [3] The CBR cycle, shown in Figure 1, makes two assumptions that are not suited for strategic real-time domains involving on-line planning. The first assumption is that problem solving is modeled as a single-shot process, i.e. a “single loop” in the CBR cycle solves a problem. In CaseBased Planning, solving a problem might involve solving several subproblems, and also monitoring their execution (potentially having to solve new problems along the way). The second assumption is that execution and problem solving are decoupled, i.e. the CBR cycle produces a solution, but the solution execution is delegated to some external module. In strategic real-time domains, executing a problem is part of solving it, especially when the internal model of the world is not 100% accurate, and ensuring that the execution of the solution succeeds is an important part of solving problems. For instance, while executing a solution the system might discover low level details about the world that render the proposed solution wrong, and thus another solution has to be proposed. Figure 2 presents an extension of the CBR cycle, called the OLCBP (On-Line Case-Based Planning) cycle, with two added processes needed to deal with planning and execution of solutions in real-time domains, and some other small variations. Figure 2: The on-line case-based planning cycle Figure 3: The Sarsa(λ) Algorithm D. Temporal Difference Learning If one had to identify one idea as central and novel to reinforcement learning, it would undoubtedly be temporaldifference (TD) learning. TD learning is a combination of Monte Carlo ideas and dynamic programming (DP) ideas. Like Monte Carlo methods, TD methods can learn directly from raw experience without a model of the environment's dynamics. Like DP, TD methods update estimates based in part on other learned estimates, without waiting for a final outcome (they bootstrap). The relationship between TD, DP, and Monte Carlo methods is a recurring theme in the theory of reinforcement learning [8]. E. Eligibility Traces Eligibility traces are one of the basic mechanisms of reinforcement learning. Almost any temporal-difference (TD) method, such as Q-learning or Sarsa, can be combined with eligibility traces to obtain a more general method that may learn more efficiently. From the theoretical view, they are considered a bridge from TD to Monte Carlo methods. From the mechanistic view, an eligibility trace is a temporary record of the occurrence of an event, such as the visiting of a state or the taking of an action. The trace marks the memory parameters associated with the event as eligible for undergoing learning changes. When a TD error occurs, only the eligible states or actions are assigned credit or blame for the error. Like TD methods themselves, eligibility traces are a basic mechanism for temporal credit assignment. [8] In this paper we introduce an approach to hybridize online case based planning with reinforcement learning. Section 2 talks about related work, section 3 introduces the architecture of the agent, section 4 describes the algorithm used in the hybridization , section 5 views the testing and results, section 6 views the conclusion and future work. II. Case-Based planning was applied in computer games just in Darmok System [3] Santiago Ontañón. An online planning system containing of expansion of plan and execution of actions ready action in this plan while. The plan is selected though retrieval algorithm after the assisting the situation using a novel situation assessment models [5]. The system learns plans in the offline stage by demonstrating human [6]. A final note, before executing any plan a delayed adaptation is applied to the plan to make it applicable in the current state of the environment [7]. Our system is inspired from Darmok with addition of online learning. Moreover, Darmok doesn’t include any scientific specifications of online learning for the plans it only evaluates plan through an output parameter computed heuristically. We point out that we examine only Online Case-Based Planning with Reinforcement Learning. This is different from related work in Case-Based Reasoning with Reinforcement Learning [4]. III. F. Sarsa(λ) Learning Sarsa (λ) is a combines the temporal difference learning technique “One-step Sarsa” with eligibility traces to learn state-action pair values Qt(s,a) effectively [8]. It is an onpolicy algorithm in that it approximates the state-action values for the current policy then improve the policy gradually based on the approximate values for the current policy [8]. [See figure 3.15 for algorithm listing] [8] RELATED WORK ARCHITECTURE A novice AI agent capable of online planning and online learning is introduced. Basically, the agent goes through two main phases: an offline learning phase and an online phase which constitutes both planning while learning. In the offline phase, the agent starts to learn plans by observing game play and strategies between two different opponents (i.e. human or computer) and then, deduces and forms several cases. The agent receives raw traces that form the game played by the human. Using Goal Matrix Generation, these traces are converted into raw plan. Further, a Dependency Graph is constructed for each raw plan. This helps the system to know 1) These dependencies between each raw plan step. 2) Which steps are suitable to parallelize. Finally, a Hierarchal Composition for each raw plan is done in order to shrink raw plan size and substitute group of related steps with one step (of type goal). All the learnt cases are retained in the casebase and then, the offline stage has completed. A Case is consisted of: 1. Goal: that this case satisfied. 2. Situation: where the case is applicable. 3. Shallow Feature: set of sensed features from the environment that are computationally low. 4. Deep Feature: set of sensed features from the environment that are computationally high. 5. Success Rate: decimal number that ranges between zero and one indicating the success rate of the case. 6. Eligibility Trace: integer number represents frequency of using the case. 7. Prior Confidence: decimal number between zero and one set by expert indicating the confidence of successes when using that case. 8. Behavior: coins the actual plan to be used. After the agent learns the cases that serve as basic ingredients for playing the game, it then set to be ready to play against different opponents. We call the phase where the agent plays against opponents as online planning and learning phase. The planning comes from expansion and execution of current plan whereas learning comes from the revision of the applied plans. Let’s have a deeper look into this: WinWargus is the initial goal for the agent. The expansion module searches for ready open goals. o Ready means that all snippets before that goal were executed successfully. o Open means that this goal has no assigned behavior. A Behavior consists of: Preconditions: conditions that must be satisfied before the plan is to execute. Alive conditions: conditions that must be satisfied while the plan is being executed. Success conditions: conditions that must be satisfied after the plan has been executed. Snippet: a set of plan steps executed serially, each plan step consists of a set of parallel actions/goals. This snippet forms the plan. Retrieval module selects the most efficient plan using: o Situation assessment: to get most suitable set of plans. The Situation Assessment module is built by: Capturing most representative shallow features of the environment. Building Situation-Model that maps set of shallow features to situation. Building Situation-Case Model, to classify each case for specific situation. Building Situation-Deep Feature Model, to provide set of deep features important for predicted situation. o E-Greedy selection: to determine whether to explore or exploit. The exploration parameter is set to 30% in our experiments. 𝑺𝑷(𝑹𝑪, 𝑬) 𝑬𝒙𝒑𝒍𝒐𝒓𝒆(𝑹𝑪), 𝑷(𝑬) ={ , 𝑤ℎ𝑒𝑟𝑒 𝑅𝐶 𝑖𝑠 𝑠𝑒𝑡 𝑜𝑓 𝑅𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝐶𝑎𝑠𝑒𝑠 𝑬𝒙𝒑𝒍𝒐𝒊𝒕(𝑹𝑪), 𝟏 − 𝑷(𝑬) o In exploitation, the predicted performance of the cases is computed and the best case is selected. This is done through equation: 𝑪𝒂𝒔𝒆 𝑷𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 𝑷𝒆𝒓𝒇𝒐𝒓𝒎𝒆𝒄𝒆(𝑪) 𝟏 𝟏+( × 𝑪. 𝑶𝒃𝒔𝒆𝒓𝒗𝒆𝒅𝑷𝒆𝒓𝒇𝒐𝒓𝒎𝒆𝒏𝒄𝒆) 𝑪𝒂𝒔𝒆𝑺𝒊𝒎𝒊𝒍𝒂𝒓𝒊𝒕𝒚(𝑪) = 𝟏 𝟐+ 𝑪𝒂𝒔𝒆𝑺𝒊𝒎𝒊𝒍𝒂𝒓𝒊𝒕𝒚(𝑪) +𝝀 𝑪𝒂𝒔𝒆 𝑺𝒊𝒎𝒊𝒍𝒂𝒓𝒊𝒕𝒚(𝑪) = 𝜶 × 𝑮𝑺(𝑪. 𝑮) + (𝟏 − 𝜶) × 𝑺𝑺(𝑪. 𝑺) 𝝀(𝑪) = 𝑪. 𝑪𝒐𝒇𝒏𝒊𝒅𝒆𝒏𝒄𝒆 × 𝑪. 𝑷𝒆𝒓𝒇𝒐𝒓𝒎𝒆𝒏𝒄𝒆 Where Goal Similarity (GS) and State Similarity (SS) is done using Euclidean Distance. The selected behavior is passed to the plan adaptation module to adapt the plan for the current situation. o The adaptation removes unnecessary actions in the plan and then adding satisfaction actions. Expansion module expands the current plan with the selected behavior. Execution module starts to execute plan actions whenever their preconditions are satisfied. o To execute plans the Execution Module starts to: Search for ready snippets and then, Send these ready snippets for execution (to the game). Updates the current status of executing snippets whether succeeded or failed. Updates status of executing actions from each snippet. After a complete snippet execution, the reviser starts its mission. The importance of revision originated from the fact that interactive and intelligent agents must get feedback from the environment to improve their performance. The learnt plans were based on some specific situation that might not always be suitable. Moreover the human himself could be playing with insufficient good plans and strategies. The reviser adjusts the case performance according to Temporal Difference lea ng with 𝑆𝐴𝑅𝑆𝐴(𝜆) Online Policy Learning Algorithm. Finally, the adjusted plan is retained in the case base and the cycle starts over. (see figure ) to the new symbols used in the novel algorithm was constructed.(see table 1) Every time the agent retrieves a case to satisfy a certain goal, it goes through the following steps: 1) It increments the eligibility of the retrieved case according to the following: e (Cr) = e (Cr) + 1 Where: Cr is the retrieved case. 2) It then updates the success rates of all cases in its case base according to the following: For each case C in the case base Q(C) = Q(C) + α δ e(C) Where α is the learning rate, e(C) is the eligibility of the case C and δ is the temporal difference error that depends on the following: δ = R + r + γ Q (Cr) – Q (Cu) Where: R: The Global Reward: Its value is equal to the ratio between the player’s power and the enemy’s power. It resembles batch learning. r: The Case-specific reward: reward or punishment due to the success or failure of the last used case. It ranges between -1 and 1. It’s computed based on a heuristic which determines how effective the plan was according to the following formula : γ Q (Cr) – Q (Cu) : The difference in success rate between the retrieved case Cr (multiplied with the discount rate γ) and the last used case Cu. Notice that, in online case based planning there could be multiple last used cases executed in parallel; in this condition the total temporal difference error relative to all last used cases should be equal to: δ = R + ∑in ri + γ Q (Cr) – Q (Ci) Figure 4: I-Strategizer Architecture IV. I-STRATEGIZER CUSTOMIZED ALGORITHM USING SARSA(λ) We introduce our approach which hybridizes Online Case Based Planning and Reinforcement Learning -using SARSA(λ) Algorithm- in a novel algorithm (see figure ) . In order to view how was SARSA(λ) customized, a table that maps the old symbols in the original SARSA(λ) Algorithm Where n: number of last used cases 3) It retrieves all cases with a similar S (Goal and State) to the S of the retrieved case Cr and stores the result in E. Cases in E only have their eligibility trace updated, in order to 4) It updates the eligibility of all cases in E according to the following : e (C) = γ λ e (C) Where: λ is the trace decay parameter, which controls the rate of decay of the eligibility trace of all cases. As it increases, the cases preserve most of their eligibility and thus are affected more with any rewards or punishments. Symbol General Meaning New Symbol Customized Meaning s State S State and Goal a Action P Plan (Case Snippet) (s,a) State-action pair (S,P) or C Case Q(s,a) Value of state-action pair Q(S,P) or Q(C ) Success rate of case r reward R General Reward α Learning Rate Parameter α Learning Rate Parameter δ Temporal-Difference Error δ Temporal-Difference Error e(s,a) Eligibility trace for stateaction pair e(S,P) or e(C) Eligibility trace for case γ Discount rate γ Discount Rate λ Trace Decay Parameter λ Trace Decay Parameter - - r Goal-Specific Reward Case1 Goal: Build Army State: Enemy has a towers defense (identical to Case2) Plan: Train 15 grunts Train 5 Archers Success rate: 0.5 Case2 Goal: Build Army State: Enemy has a towers defense (identical to Case1) Plan: Train 2 catapults Train 6 Knights Success rate: 0.5 Table 2: 2 cases for “Build Army” Goal Now consider another 2 cases containing 2 different plans for attack Case3 Goal: Attack State: 15 grunts and 5 Archers exist Plan: Attack with 15 grunts and 5 archers on towers defense Success rate: 0.2 Case4 Goal: Attack State: 2 catapults and 6 knights exist Plan: Attack with 2 catapults and 6 knights on towers defense Success rate: 0.8 Table 3: 2 cases for “Attack” Goal In order to win, the Agent has to fulfill the 2 goals: “Build Army” and “Attack” in order. Since the 2 cases of the “Build Army” goal share the same state, and their success rates is currently equal, any case of the 2 cases is randomly chosen. Assume that the 2 cases will be executed successfully. Table 1: Table for mapping symbols with their meanings Observe failed or succeeded Case Cu Compute R, r Retrieve Case Cr via retrieval policy (E-greedy) δ = R + r + γ Q (Cr) – Q (Cu) e (Cr) = e (Cr) + 1 For each case C in the case base Q(C) = Q(C) + α δ e(C) Retrieve set of cases E For each case C in E e (C) = γ λ e (C) In case Case1 is chosen, the agent retrieves Case3 as the most suitable case for execution (as it doesn’t have 2 catapults and 6 knights). The low success rate of Case3 will affect the revision (or evaluation) of last used Case Case1 causing its success rate to be equal to 0.4 instead of 0.5. In case Case2 is chosen, the agent retrieves Case4 as the most suitable case for execution (as it doesn’t have 15 grunts catapults and 5 archers). However, the choice of this case lead to the choice of a better case with a success rate of 0.8. This will affect the revision (or evaluation) of the last used Case Case2 causing its success rate to increase to 0.6 instead of 0.5. FIGURE 5: ONLINE LEARNING ALGORITHM FOR I-STRATEGIZER V. TESTING AND RESULTS In order to make the significance of embedding reinforcement learning into online case based planning clear, consider the simple case when there exists 2 cases containing 2 similar plans (snippets) for a certain goal with similar game states. Consider the 2 cases as following: As the agent plays in the same game or in multiple successive games, the agent will surely learn that using Case2 is definitely better than using Case1, although seemed to the agent identical when they were just learned during the offline learning process. Below in table() is the result of an experiment conducted to Figure 5 show the result of applying the algorithm 10 times(could be in one or multiple game episodes) , where : Learning rate = 0.1. Decay rate = 0.4.It’s set average, to maintain average responsibility of last used cases for the choice current case retrieved. Exploration rate = 0.1.It’s set low because –due to the small number of cases available (4 cases) - any exploration will probably lead to the choice of the worst case. Choosing the worst case will have an undesirable negative effect on cases with high success rates. Discount rate = 0.5. It’s set average to maintain average bootstrapping. plans. Also, this serves in saving consumption of agent’s time in retrieving inefficient failed plans. As a result, the agent takes into account history when acting in the environment (i.e. playing a real-time strategy game). Further, we are planning to develop a strategy/casebase visualization tool capable of visualizing agent’s preferred playing strategy according to its learning history. This will help in tracking the learning curve of the agent. After tracking the agent’s learning curve, we will be capable of applying other learning algorithms and finding out which one is the most suitable and effective. The Column “Ch” stands for the chosen cases. Q1, Q2, Q3 and Q4 stand for the success rate values of the 4 cases. E1, E2, E3 and E4 stand for the eligibility traces of the 4 cases. RB, rB and δB stands for the Global Reward, Case-specific reward and temporal difference error of the case chosen for the goal “Build Army”. Similarly; RA, RB and δA are the same for goal “Attack”. CH Q1 E1 Q2 E2 RB RB δB Q3 E3 Q4 E4 RA RA δA - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 C1 – C3 -0.02 0.2 0 0 0.11 0.56 0.67 0.04 0.2 0 0 -0.8 -0.8 -1.6 C2– C4 -0.02 0.04 0.02 0.2 0.02 0.51 0.53 0.05 0.2 0.06 0.20 0.1 0.2 0.3 C2– C4 -0.01 0.01 0.04 0.24 0.06 0.57 0.64 0.07 0.2 0.14 0.24 0.1 0.2 0.3 C2– C4 -0.01 0 0.07 0.25 0.08 0.52 0.63 0.09 0.2 0.23 0.25 0.1 0.2 0.3 C2– C4 -0.01 0 0.09 0.25 0.06 0.57 0.68 0.11 0.2 0.32 0.25 0.1 0.2 0.3 C2– C4 -0.01 0 0.12 0.25 0.11 0.50 0.67 0.13 0.2 0.41 0.25 0.1 0.2 0.3 C2– C4 -0.01 0 0.16 0.25 0.05 0.59 0.72 0.15 0.2 0.51 0.25 0.1 0.2 0.3 C2– C4 -0.01 0 0.19 0.25 0.11 0.51 0.71 0.17 0.2 0.61 0.25 0.1 0.2 0.3 C2– C4 -0.01 0 0.23 0.25 0.04 0.57 0.71 0.2 0.2 0.71 0.25 0.1 0.2 0.3 C2– C4 -0.01 0 0.27 0.25 0.07 0.51 0.69 0.22 0.2 0.81 0.25 0.1 0.2 0.3 After applying the algorithm 10 successive times, C1 gains a low success rate compared with C2. This proves that the agent has learned building a smaller heavy army in that situation (the existence of a towers defense) is more preferable than building a larger light army. V. CONCLUSION AND FUTURE WORK In this paper, online case-based planning was hybridized with reinforcement learning. This was the first attempt to do so in order to introduce an intelligent agent capable of planning and learning online using Temporal Difference with Eligibility Traces: SARSA(λ) algorithm. Learning online biases the agent decision for selecting more efficient, effective and successful REFERENCES [1] Buro, M. 2003. Real-time strategy games: A new AI research challenge. In IJCAI’2003, 1534–1535. Morgan Kaufmann [2] Aamodt, A., and Plaza, E. 1994. Case-based reasoning: Foundational issues, methodological variations, and system approaches. Artificial Intelligence Communications 7(1):39–59 [3] Santiago Ontañón and Kinshuk Mishra and Neha Sugandh and Ashwin Ram (2010) On-line CaseBased Planning. in Computational Intelligence Journal, Volume 26, Issue 1, pp. 84-119. [4] Manu Sharma, Michael Homes, Juan Santamaria, Arya Irani, Charles Isbell, and Ashwin Ram. Transfer learning in real time strategy games using hybrid CBR/RL. In IJCAI'2007, page to appear. Morgan Kaufmann, 2007 [5] Kinshuk Mishra and Santiago Ontañón and Ashwin Ram (2008), Situation Assessment for Plan Retrieval in Real-Time Strategy Games. ECCBR2008. [6] Santiago Ontañón and Kinshuk Mishra and Neha Sugandh and Ashwin Ram (2008) Learning from Demonstration and Case-Based Planning for RealTime Strategy Games. in Soft Computing Applications in Industry (ISBN 1434-9922 (Print) 1860-0808 (Online)), p. 293-310. [7] Neha Sugandh and Santiago Ontañón and Ashwin Ram (2008), On-Line Case-Based Plan Adaptation for Real-Time Strategy Games. AAAI-2008. [8] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning, An Introduction. MIT press, 2005