Intelligent Online Case Based Plannig Agent Model

advertisement
Intelligent Online Case Based Planning Agent Model for
Real-Time Strategy Games
Omar Enayet1 and Abdelrahman Ogail2 and Ibrahim Moawad3
Faculty Computer and Information Science
Ain-Shams University ; Cairo ; Egypt
1,2
first.last@hotmail.com 3first_last@hotmail.com
Abstract – Research in learning and planning in real-time
strategy (RTS) games is for interest for several industries such as
military industry, robotics and most importantly game industry.
Recent work on online case based planning in RTS Games misses
the capability of online learning from experience. In this
document we present a hybrid architecture which uses both
online case-based planning and reinforcement learning
techniques. Our architecture re-uses related existing work of
online-case based planning, and extends it with online
reinforcement learning .A model of the I-Strategizer(Intelligent
Strategizer) agent, which implements the latter architecture in
order to play on Wargus (an open-source clone of the well known
Real-Time Strategy Game Warcraft 2) is introduced. We present
empirical evaluation of the performance of I-Strategizer and
show that it successfully approaches the human player behavior.
I.
INTRODUCTION
A. Real-Time Strategy Games
RTS games constitute well-defined environments to
conduct experiments and offer straight forward objective ways
of measuring performance. Also, strong game AI will likely
make difference in future commercial games because graphics
improvements are beginning to saturate. RTS Game AI is also
for interest for the military which uses battle simulations in
training programs. [1]
RTS Games offer challenging opportunities for research
in adversarial planning under un-certainty, learning and
opponent modeling and spatial and temporal reasoning. RTS
Games feature hundreds or even thousands of interacting
objects, imperfect information and fast-paced micro-actions.
[1]
B. Case Based Reasoning
Case-Based Reasoning (CBR) is a machine learning technique
in which problems and their solutions are stored in a
knowledge base as cases. These cases may be retrieved later
should a similar problem arise, as the solution to this problem
will already be in the knowledge base called the case base.
Aamodt and Plaza [2] defines CBR as: "To solve a new
problem by remembering a previous similar situation and by
reusing information and knowledge of that situation." CBR
may be described as four processes: [2]
1) RETRIEVE the case or cases from the knowledge
base that are most similar to the current problem.
2) REUSE the information from the retrieved cases to
solve the current problem. If no exact match is
found, the solution to the new problem must be adapted from
one or more cases.
3) REVISE the proposed solution if it failed.
4) RETAIN the experience for solving this case in the
knowledge base.
Figure 1: The Case Based Reasoning Cycle
C. Case based Planning [3]
The CBR cycle, shown in Figure 1, makes two
assumptions that are not suited for strategic real-time domains
involving on-line planning. The first assumption is that
problem solving is modeled as a single-shot process, i.e. a
“single loop” in the CBR cycle solves a problem. In CaseBased Planning, solving a problem might involve solving
several subproblems, and also monitoring their execution
(potentially having to solve new problems along the way). The
second assumption is that execution and problem solving are
decoupled, i.e. the CBR cycle produces a solution, but the
solution execution is delegated to some external module. In
strategic real-time domains, executing a problem is part of
solving it, especially when the internal model of the world is
not 100% accurate, and ensuring that the execution of the
solution succeeds is an important part of solving problems.
For instance, while executing a solution the system might
discover low level details about the world that render the
proposed solution wrong, and thus another solution has to be
proposed.
Figure 2 presents an extension of the CBR cycle, called the
OLCBP (On-Line Case-Based Planning) cycle, with two
added processes needed to deal with planning and execution of
solutions in real-time domains, and some other small
variations.
Figure 2: The on-line case-based planning cycle
Figure 3: The Sarsa(λ) Algorithm
D. Temporal Difference Learning
If one had to identify one idea as central and novel to
reinforcement learning, it would undoubtedly be temporaldifference (TD) learning. TD learning is a combination of
Monte Carlo ideas and dynamic programming (DP) ideas.
Like Monte Carlo methods, TD methods can learn directly
from raw experience without a model of the environment's
dynamics. Like DP, TD methods update estimates based in
part on other learned estimates, without waiting for a final
outcome (they bootstrap). The relationship between TD, DP,
and Monte Carlo methods is a recurring theme in the theory of
reinforcement learning [8].
E. Eligibility Traces
Eligibility traces are one of the basic mechanisms of
reinforcement learning. Almost any temporal-difference (TD)
method, such as Q-learning or Sarsa, can be combined with
eligibility traces to obtain a more general method that may
learn more efficiently. From the theoretical view, they are
considered a bridge from TD to Monte Carlo methods. From
the mechanistic view, an eligibility trace is a temporary record
of the occurrence of an event, such as the visiting of a state or
the taking of an action. The trace marks the memory
parameters associated with the event as eligible for
undergoing learning changes. When a TD error occurs, only
the eligible states or actions are assigned credit or blame for
the error. Like TD methods themselves, eligibility traces are a
basic mechanism for temporal credit assignment. [8]
In this paper we introduce an approach to hybridize online
case based planning with reinforcement learning. Section 2
talks about related work, section 3 introduces the architecture
of the agent, section 4 describes the algorithm used in the
hybridization , section 5 views the testing and results, section
6 views the conclusion and future work.
II.
Case-Based planning was applied in computer games just in
Darmok System [3] Santiago Ontañón. An online planning
system containing of expansion of plan and execution of
actions ready action in this plan while. The plan is selected
though retrieval algorithm after the assisting the situation
using a novel situation assessment models [5]. The system
learns plans in the offline stage by demonstrating human [6].
A final note, before executing any plan a delayed adaptation is
applied to the plan to make it applicable in the current state of
the environment [7]. Our system is inspired from Darmok with
addition of online learning. Moreover, Darmok doesn’t
include any scientific specifications of online learning for the
plans it only evaluates plan through an output parameter
computed heuristically.
We point out that we examine only Online Case-Based
Planning with Reinforcement Learning. This is different from
related work in Case-Based Reasoning with Reinforcement
Learning [4].
III.
F. Sarsa(λ) Learning
Sarsa (λ) is a combines the temporal difference learning
technique “One-step Sarsa” with eligibility traces to learn
state-action pair values Qt(s,a) effectively [8]. It is an onpolicy algorithm in that it approximates the state-action
values for the current policy then improve the policy
gradually based on the approximate values for the current
policy [8]. [See figure 3.15 for algorithm listing] [8]
RELATED WORK
ARCHITECTURE
A novice AI agent capable of online planning and online
learning is introduced. Basically, the agent goes through two
main phases: an offline learning phase and an online phase
which constitutes both planning while learning.
In the offline phase, the agent starts to learn plans by
observing game play and strategies between two different
opponents (i.e. human or computer) and then, deduces and
forms several cases. The agent receives raw traces that form
the game played by the human. Using Goal Matrix
Generation, these traces are converted into raw plan. Further,
a Dependency Graph is constructed for each raw plan. This
helps the system to know
1) These dependencies between each raw plan step.
2) Which steps are suitable to parallelize.
Finally, a Hierarchal Composition for each raw plan is done in
order to shrink raw plan size and substitute group of related
steps with one step (of type goal).
All the learnt cases are retained in the casebase and then, the
offline stage has completed.
A Case is consisted of:
1. Goal: that this case satisfied.
2. Situation: where the case is applicable.
3. Shallow Feature: set of sensed features from the
environment that are computationally low.
4. Deep Feature: set of sensed features from the
environment that are computationally high.
5. Success Rate: decimal number that ranges between
zero and one indicating the success rate of the case.
6. Eligibility Trace: integer number represents
frequency of using the case.
7. Prior Confidence: decimal number between zero and
one set by expert indicating the confidence of
successes when using that case.
8. Behavior: coins the actual plan to be used.
After the agent learns the cases that serve as basic ingredients
for playing the game, it then set to be ready to play against
different opponents. We call the phase where the agent plays
against opponents as online planning and learning phase. The
planning comes from expansion and execution of current plan
whereas learning comes from the revision of the applied plans.
Let’s have a deeper look into this:
 WinWargus is the initial goal for the agent.
 The expansion module searches for ready open goals.
o Ready means that all snippets before that
goal were executed successfully.
o Open means that this goal has no assigned
behavior.
 A Behavior consists of:
 Preconditions: conditions
that must be satisfied
before the plan is to
execute.
 Alive
conditions:
conditions that must be
satisfied while the plan is
being executed.
 Success
conditions:
conditions that must be
satisfied after the plan has
been executed.
 Snippet: a set of plan steps
executed serially, each
plan step consists of a set
of parallel actions/goals.

This snippet forms the
plan.
Retrieval module selects the most efficient plan
using:
o Situation assessment: to get most suitable set
of plans.
 The Situation Assessment module is
built by:
 Capturing
most
representative
shallow
features
of
the
environment.
 Building Situation-Model
that maps set of shallow
features to situation.
 Building Situation-Case
Model, to classify each
case for specific situation.
 Building Situation-Deep
Feature Model, to provide
set of deep features
important for predicted
situation.
o E-Greedy selection: to determine whether to
explore or exploit.
 The exploration parameter is set to
30% in our experiments.
𝑺𝑷(𝑹𝑪, 𝑬)
𝑬𝒙𝒑𝒍𝒐𝒓𝒆(𝑹𝑪),
𝑷(𝑬)
={
, 𝑤ℎ𝑒𝑟𝑒 𝑅𝐶 𝑖𝑠 𝑠𝑒𝑡 𝑜𝑓 𝑅𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝐶𝑎𝑠𝑒𝑠
𝑬𝒙𝒑𝒍𝒐𝒊𝒕(𝑹𝑪),
𝟏 − 𝑷(𝑬)
o In exploitation, the predicted performance of
the cases is computed and the best case is
selected.
 This is done through equation:
𝑪𝒂𝒔𝒆 𝑷𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 𝑷𝒆𝒓𝒇𝒐𝒓𝒎𝒆𝒄𝒆(𝑪)
𝟏
𝟏+(
× 𝑪. 𝑶𝒃𝒔𝒆𝒓𝒗𝒆𝒅𝑷𝒆𝒓𝒇𝒐𝒓𝒎𝒆𝒏𝒄𝒆)
𝑪𝒂𝒔𝒆𝑺𝒊𝒎𝒊𝒍𝒂𝒓𝒊𝒕𝒚(𝑪)
=
𝟏
𝟐+
𝑪𝒂𝒔𝒆𝑺𝒊𝒎𝒊𝒍𝒂𝒓𝒊𝒕𝒚(𝑪)
+𝝀
𝑪𝒂𝒔𝒆 𝑺𝒊𝒎𝒊𝒍𝒂𝒓𝒊𝒕𝒚(𝑪) = 𝜶 × 𝑮𝑺(𝑪. 𝑮) + (𝟏 − 𝜶) × 𝑺𝑺(𝑪. 𝑺)
𝝀(𝑪) = 𝑪. 𝑪𝒐𝒇𝒏𝒊𝒅𝒆𝒏𝒄𝒆 × 𝑪. 𝑷𝒆𝒓𝒇𝒐𝒓𝒎𝒆𝒏𝒄𝒆
Where Goal Similarity (GS) and State Similarity (SS)
is done using Euclidean Distance.
 The selected behavior is passed to the plan adaptation
module to adapt the plan for the current situation.
o The adaptation removes unnecessary actions
in the plan and then adding satisfaction
actions.




Expansion module expands the current plan with the
selected behavior.
Execution module starts to execute plan actions whenever
their preconditions are satisfied.
o To execute plans the Execution Module
starts to:
 Search for ready snippets and then,
 Send these ready snippets for
execution (to the game).
 Updates the current status of
executing
snippets
whether
succeeded or failed.
 Updates status of executing actions
from each snippet.
After a complete snippet execution, the reviser starts its
mission. The importance of revision originated from the
fact that interactive and intelligent agents must get
feedback from the environment to improve their
performance. The learnt plans were based on some
specific situation that might not always be suitable.
Moreover the human himself could be playing with
insufficient good plans and strategies. The reviser adjusts
the case performance according to Temporal Difference
lea ng with 𝑆𝐴𝑅𝑆𝐴(𝜆) Online Policy Learning
Algorithm.
Finally, the adjusted plan is retained in the case base and
the cycle starts over.
(see figure ) to the new symbols used in the novel algorithm
was constructed.(see table 1)
Every time the agent retrieves a case to satisfy a certain goal,
it goes through the following steps:
1) It increments the eligibility of the retrieved case
according to the following:
e (Cr) = e (Cr) + 1
Where: Cr is the retrieved case.
2) It then updates the success rates of all cases in its
case base according to the following:
For each case C in the case base
Q(C) = Q(C) + α δ e(C)
Where α is the learning rate, e(C) is the eligibility of the
case C and δ is the temporal difference error that depends
on the following:
δ = R + r + γ Q (Cr) – Q (Cu)
Where:

R: The Global Reward: Its value is equal to the
ratio between the player’s power and the
enemy’s power. It resembles batch learning.

r: The Case-specific reward: reward or
punishment due to the success or failure of the
last used case. It ranges between -1 and 1. It’s
computed based on a heuristic which determines
how effective the plan was according to the
following formula :

γ Q (Cr) – Q (Cu) : The difference in success rate
between the retrieved case Cr (multiplied with
the discount rate γ) and the last used case Cu.
Notice that, in online case based planning there
could be multiple last used cases executed in
parallel; in this condition the total temporal
difference error relative to all last used cases
should be equal to:
δ = R + ∑in ri + γ Q (Cr) – Q (Ci)
Figure 4: I-Strategizer Architecture
IV.
I-STRATEGIZER CUSTOMIZED ALGORITHM
USING SARSA(λ)
We introduce our approach which hybridizes Online Case
Based Planning and Reinforcement Learning -using
SARSA(λ) Algorithm- in a novel algorithm (see figure ) . In
order to view how was SARSA(λ) customized, a table that
maps the old symbols in the original SARSA(λ) Algorithm
Where n: number of last used cases
3) It retrieves all cases with a similar S (Goal and State)
to the S of the retrieved case Cr and stores the result
in E.
Cases in E only have their eligibility trace updated, in
order to
4) It updates the eligibility of all cases in E according to
the following :
e (C) = γ λ e (C)
Where: λ is the trace decay parameter, which controls
the rate of decay of the eligibility trace of all cases.
As it increases, the cases preserve most of their
eligibility and thus are affected more with any
rewards or punishments.
Symbol
General Meaning
New Symbol
Customized Meaning
s
State
S
State and Goal
a
Action
P
Plan (Case Snippet)
(s,a)
State-action pair
(S,P) or C
Case
Q(s,a)
Value of state-action pair
Q(S,P) or
Q(C )
Success rate of case
r
reward
R
General Reward
α
Learning Rate Parameter
α
Learning Rate
Parameter
δ
Temporal-Difference Error
δ
Temporal-Difference
Error
e(s,a)
Eligibility trace for stateaction pair
e(S,P) or e(C)
Eligibility trace for case
γ
Discount rate
γ
Discount Rate
λ
Trace Decay Parameter
λ
Trace Decay Parameter
-
-
r
Goal-Specific Reward
Case1
Goal: Build Army
State: Enemy has a towers
defense (identical to Case2)
Plan:
Train 15 grunts
Train 5 Archers
Success rate: 0.5
Case2
Goal: Build Army
State: Enemy has a towers
defense (identical to Case1)
Plan:
Train 2 catapults
Train 6 Knights
Success rate: 0.5
Table 2: 2 cases for “Build Army” Goal
Now consider another 2 cases containing 2 different plans for
attack
Case3
Goal: Attack
State: 15 grunts and 5
Archers exist
Plan:
Attack with 15 grunts and 5
archers on towers defense
Success rate: 0.2
Case4
Goal: Attack
State: 2 catapults and 6
knights exist
Plan:
Attack with 2 catapults and 6
knights on towers defense
Success rate: 0.8
Table 3: 2 cases for “Attack” Goal
In order to win, the Agent has to fulfill the 2 goals: “Build
Army” and “Attack” in order. Since the 2 cases of the “Build
Army” goal share the same state, and their success rates is
currently equal, any case of the 2 cases is randomly chosen.
Assume that the 2 cases will be executed successfully.
Table 1: Table for mapping symbols with their meanings
Observe failed or succeeded Case Cu
Compute R, r
Retrieve Case Cr via retrieval policy (E-greedy)
δ = R + r + γ Q (Cr) – Q (Cu)
e (Cr) = e (Cr) + 1
For each case C in the case base
Q(C) = Q(C) + α δ e(C)
Retrieve set of cases E
For each case C in E
e (C) = γ λ e (C)
In case Case1 is chosen, the agent retrieves Case3 as the most
suitable case for execution (as it doesn’t have 2 catapults and 6
knights). The low success rate of Case3 will affect the revision
(or evaluation) of last used Case Case1 causing its success rate
to be equal to 0.4 instead of 0.5.
In case Case2 is chosen, the agent retrieves Case4 as the most
suitable case for execution (as it doesn’t have 15 grunts
catapults and 5 archers). However, the choice of this case lead
to the choice of a better case with a success rate of 0.8. This
will affect the revision (or evaluation) of the last used Case
Case2 causing its success rate to increase to 0.6 instead of 0.5.
FIGURE 5: ONLINE LEARNING ALGORITHM FOR I-STRATEGIZER
V. TESTING AND RESULTS
In order to make the significance of embedding reinforcement
learning into online case based planning clear, consider the
simple case when there exists 2 cases containing 2 similar
plans (snippets) for a certain goal with similar game states.
Consider the 2 cases as following:
As the agent plays in the same game or in multiple successive
games, the agent will surely learn that using Case2 is
definitely better than using Case1, although seemed to the
agent identical when they were just learned during the offline
learning process.
Below in table() is the result of an experiment conducted to
Figure 5
show the result of applying the algorithm 10 times(could be in
one or multiple game episodes) , where :
 Learning rate = 0.1.

Decay rate = 0.4.It’s set average, to maintain average
responsibility of last used cases for the choice current
case retrieved.
Exploration rate = 0.1.It’s set low because –due to the
small number of cases available (4 cases) - any
exploration will probably lead to the choice of the
worst case. Choosing the worst case will have an
undesirable negative effect on cases with high
success rates.
Discount rate = 0.5. It’s set average to maintain
average bootstrapping.


plans. Also, this serves in saving consumption of agent’s time
in retrieving inefficient failed plans. As a result, the agent
takes into account history when acting in the environment (i.e.
playing a real-time strategy game).
Further, we are planning to develop a strategy/casebase
visualization tool capable of visualizing agent’s preferred
playing strategy according to its learning history. This will
help in tracking the learning curve of the agent. After tracking
the agent’s learning curve, we will be capable of applying
other learning algorithms and finding out which one is the
most suitable and effective.
The Column “Ch” stands for the chosen cases. Q1, Q2, Q3
and Q4 stand for the success rate values of the 4 cases. E1, E2,
E3 and E4 stand for the eligibility traces of the 4 cases. RB, rB
and δB stands for the Global Reward, Case-specific reward and
temporal difference error of the case chosen for the goal
“Build Army”. Similarly; RA, RB and δA are the same for goal
“Attack”.
CH
Q1
E1
Q2
E2
RB
RB
δB
Q3
E3
Q4
E4
RA
RA
δA
-
0
0
0
0
0
0
0
0
0
0
0
0
0
0
C1 –
C3
-0.02
0.2
0
0
0.11
0.56
0.67
0.04
0.2
0
0
-0.8
-0.8
-1.6
C2–
C4
-0.02
0.04
0.02
0.2
0.02
0.51
0.53
0.05
0.2
0.06
0.20
0.1
0.2
0.3
C2–
C4
-0.01
0.01
0.04
0.24
0.06
0.57
0.64
0.07
0.2
0.14
0.24
0.1
0.2
0.3
C2–
C4
-0.01
0
0.07
0.25
0.08
0.52
0.63
0.09
0.2
0.23
0.25
0.1
0.2
0.3
C2–
C4
-0.01
0
0.09
0.25
0.06
0.57
0.68
0.11
0.2
0.32
0.25
0.1
0.2
0.3
C2–
C4
-0.01
0
0.12
0.25
0.11
0.50
0.67
0.13
0.2
0.41
0.25
0.1
0.2
0.3
C2–
C4
-0.01
0
0.16
0.25
0.05
0.59
0.72
0.15
0.2
0.51
0.25
0.1
0.2
0.3
C2–
C4
-0.01
0
0.19
0.25
0.11
0.51
0.71
0.17
0.2
0.61
0.25
0.1
0.2
0.3
C2–
C4
-0.01
0
0.23
0.25
0.04
0.57
0.71
0.2
0.2
0.71
0.25
0.1
0.2
0.3
C2–
C4
-0.01
0
0.27
0.25
0.07
0.51
0.69
0.22
0.2
0.81
0.25
0.1
0.2
0.3
After applying the algorithm 10 successive times, C1 gains a
low success rate compared with C2. This proves that the agent
has learned building a smaller heavy army in that situation
(the existence of a towers defense) is more preferable than
building a larger light army.
V. CONCLUSION AND FUTURE WORK
In this paper, online case-based planning was hybridized with
reinforcement learning. This was the first attempt to do so in
order to introduce an intelligent agent capable of planning and
learning online using Temporal Difference with Eligibility
Traces: SARSA(λ) algorithm. Learning online biases the agent
decision for selecting more efficient, effective and successful
REFERENCES
[1] Buro, M. 2003. Real-time strategy games: A new
AI research challenge. In IJCAI’2003, 1534–1535.
Morgan Kaufmann
[2] Aamodt, A., and Plaza, E. 1994. Case-based
reasoning: Foundational issues, methodological
variations, and system approaches. Artificial
Intelligence Communications 7(1):39–59
[3] Santiago Ontañón and Kinshuk Mishra and Neha
Sugandh and Ashwin Ram (2010) On-line CaseBased Planning. in Computational Intelligence
Journal, Volume 26, Issue 1, pp. 84-119.
[4] Manu Sharma, Michael Homes, Juan Santamaria,
Arya Irani, Charles Isbell, and Ashwin Ram. Transfer
learning in real time strategy games using hybrid
CBR/RL. In IJCAI'2007, page to appear. Morgan
Kaufmann, 2007
[5] Kinshuk Mishra and Santiago Ontañón and
Ashwin Ram (2008), Situation Assessment for Plan
Retrieval in Real-Time Strategy Games. ECCBR2008.
[6] Santiago Ontañón and Kinshuk Mishra and Neha
Sugandh and Ashwin Ram (2008) Learning from
Demonstration and Case-Based Planning for RealTime Strategy Games. in Soft Computing
Applications in Industry (ISBN 1434-9922 (Print)
1860-0808 (Online)), p. 293-310.
[7] Neha Sugandh and Santiago Ontañón and Ashwin
Ram (2008), On-Line Case-Based Plan Adaptation
for Real-Time Strategy Games. AAAI-2008.
[8] Richard S. Sutton and Andrew G. Barto.
Reinforcement Learning, An Introduction. MIT
press, 2005
Download