Intelligent Online Case Based Plannig Agent Model

advertisement
Intelligent Online Case Based Planning Agent Model for
Real-Time Strategy Games
Omar Enayet1 and Abdelrahman Ogail2 and Ibrahim Moawad3 and Mostafa Aref 4
Faculty of Computer and Information Sciences
Ain-Shams University ; Cairo ; Egypt
1,2
first.last@hotmail.com 3first_last@hotmail.com 4aref_99@yahoo.com
Abstract – Research in learning and planning in real-time
strategy (RTS) games is very interesting in several industries
such as military industry, robotics, and most importantly game
industry. Recent work on online case-based planning in RTS
Games does not include the capability of online learning from
experience, so the knowledge certainty remains constant, which
leads to inefficient decisions. In this paper, an intelligent agent
model based on both online case-based planning and
reinforcement learning techniques is proposed. In addition, the
proposed model has been evaluated using empirical simulation
on Wargus (an open-source clone of the well known Real-Time
Strategy Game Warcraft 2). This evaluation shows that the
proposed model increases the certainty factor of the case-base by
learning from experience, and hence the process of decision
making for selecting more efficient, effective and successful plans.
Keywords: Case-based reasoning, Reinforcement Learning,
Online Case-Based Planning, Real-Time Strategy Games,
SARSA (λ) learning, Intelligent Agent.
I.
INTRODUCTION
A. Real-Time Strategy Games
2RTS games constitute well-defined environments to
conduct experiments and offer straight forward objective ways
of measuring performance. Also, strong game AI will likely
make difference in future commercial games because graphics
improvements are beginning to saturate. RTS game AI is also
interesting for the military which uses battle simulations in
training programs [1].
RTS Games offer challenging opportunities for research
in adversarial planning under un-certainty, learning and
opponent modeling and spatial and temporal reasoning. RTS
games feature hundreds or even thousands of interacting
objects, imperfect information, and fast-paced micro-actions
[1].
B. Online Case based Planning [3]
The CBR cycle, has two assumptions that are not suited
for strategic real-time domains involving on-line planning.
Firstly, the problem solving is modelled as a single-shot
process, i.e. a “single loop” in the CBR cycle solves the
problem. In Case-Based Planning, solving a problem might
involve solving several sub-problems, and also monitoring
their execution (potentially having to solve new problems
along the way).
Secondly, the problem solving and plan execution are
decoupled, i.e. the CBR cycle produces a solution, but the
solution execution is delegated to some external module. In
strategic real-time domains, executing a problem is part of its
solving, especially when the internal model of the world is not
100% accurate, and ensuring that the execution of the solution
succeeds is an important part of solving problems. For
instance, while executing a solution the system might discover
low level details about the world that render the proposed
solution wrong, and thus another solution has to be proposed.
OLCBP (On-Line Case-Based Planning) cycle is an extension
of the CBR cycle with two added processes needed to deal
with planning and execution of solutions in real-time domains:
???????.
II.
BACKGROUND
Case-Based planning was applied in computer games just
in Darmok System [3]. An online planning system containing
of expansion of plan and execution of actions ready action in
this plan while. The plan is selected though retrieval algorithm
after the assisting the situation using a novel situation
assessment models [4]. The system learns plans in the offline
stage by demonstrating human [5]. A final note, before
executing any plan a delayed adaptation is applied to the plan
to make it applicable in the current state of the environment
[6]. Our system is inspired from Darmok with addition of
online learning. Moreover, Darmok doesn’t include any
scientific specifications of online learning for the plans it only
evaluates plan through an output parameter computed
heuristically.
If one had to identify one idea as central and novel to
reinforcement learning, it would undoubtedly be temporaldifference (TD) learning. TD learning is a combination of
Monte Carlo ideas and dynamic programming (DP) ideas.
Like Monte Carlo methods, TD methods can learn directly
from raw experience without a model of the environment's
dynamics. Like DP, TD methods update estimates based in
part on other learned estimates, without waiting for a final
outcome (they bootstrap). The relationship between TD, DP,
and Monte Carlo methods is a recurring theme in the theory of
reinforcement learning [7].
Eligibility traces are one of the basic mechanisms of
reinforcement learning. Almost any temporal-difference (TD)
method, such as Q-learning or Sarsa, can be combined with
eligibility traces to obtain a more general method that may
learn more efficiently. From the theoretical view, they are
considered a bridge from TD to Monte Carlo methods. From
the mechanistic view, an eligibility trace is a temporary record
of the occurrence of an event, such as the visiting of a state or
the taking of an action. The trace marks the memory
parameters associated with the event as eligible for
undergoing learning changes. When a TD error occurs, only
the eligible states or actions are assigned credit or blame for
the error. Like TD methods themselves, eligibility traces are a
basic mechanism for temporal credit assignment. [7]
Sarsa (λ) is a combines the temporal difference learning
technique “One-step Sarsa” with eligibility traces to learn
state-action pair values Qt(s,a) effectively [7]. It is an onpolicy algorithm in that it approximates the state-action
values for the current policy then improve the policy
gradually based on the approximate values for the current
policy [7].
In this paper we introduce an approach to hybridize online
case based planning with reinforcement learning. Section 2
talks about related work, section 3 introduces the architecture
of the agent, section 4 describes the algorithm used in the
hybridization, section 5 views the testing and results, section 6
views the conclusion and future work.
III.
INTELLIGENT OLCBP AGENT MODEL
A novice AI agent capable of online planning and online
learning is introduced. Basically, the agent goes through two
main phases: an offline learning phase and an online planning
and learning phase.
In the offline phase, the agent starts to learn plans by
observing game play and strategies between two different
opponents (i.e. human or computer) and then, deduces and
forms several cases. The agent receives raw traces that form
the game played by the human. Using Goal Matrix
Generation, these traces are converted into raw plan. Further,
a Dependency Graph is constructed for each raw plan. This
helps the system to know
1) These dependencies between each raw plan step.
2) Which steps are suitable to parallelize.
Finally, a Hierarchal Composition for each raw plan is done in
order to shrink raw plan size and substitute group of related
steps with one step (of type goal).
All the learnt cases are retained in the case-base and by this,
the offline stage has completed.
Figure 1: Case Representation
A Case is consisted of (Shown in Figure 1):
1. Goal: that this case satisfied.
2. Situation: where the case is applicable.
3. Shallow Feature: set of sensed features from the
environment that are computationally low.
4. Deep Feature: set of sensed features from the
environment that are computationally high.
5. Success Rate: decimal number that ranges between zero
and one indicating the success rate of the case.
6. Eligibility Trace: integer number represents frequency of
using the case.
7. Prior Confidence: decimal number between zero and one
set by expert indicating the confidence of successes when
using that case.
8. Behavior: coins the actual plan to be used.
After the agent learns cases that serve as basic ingredients
for playing the game, it then set to be ready to play against
different opponents. We call the phase where the agent plays
against opponents as online planning and learning phase. The
planning comes from expansion and execution of current plan
whereas learning comes from the revision of the applied plans.
The online phase consists of expansion, execution,
retrieval, adaptation, reviser, and current plan modules.
Expansion module expands ready open goals in the
current plan. Ready means that all snippets before that goal
were executed successfully and Open means that this goal has
no assigned behavior. As shown in Figure 2 a behavior is
consisted of: preconditions that must be satisfied before the
plan is to execute, alive conditions that must be satisfied while
the plan is being executed, success condition: conditions that
must be satisfied after the plan has been executed and snippet:
a set of plan steps executed serially where each plan step
consists of a set of parallel actions/goals. Snippet forms the
plan.
Retrieval module selects the most efficient plan using:
situation assessment to get most suitable plan, E-Greedy
selection policy to determine whether to explore or exploit as
shown in equation 1 in Figure 2. In exploitation, the predicted
performance of all cases is computed and the best case is
selected using equation 2 in Figure 2. In our experiments, the
exploration parameter is set to 0.3.
The situation assessment module is built through, capturing
most representative shallow features of the environment,
building Situation-Model that maps set of shallow features to
situation, building Situation-Case Model, to classify each case
for specific situation, building Situation-Deep Feature Model,
to provide set of deep features important for predicted
situation.
After a complete snippet execution, the reviser module
starts its mission. The importance of revision originated from
the fact that interactive and intelligent agents must get
feedback from the environment to improve their performance.
The learnt plans were based on some specific situation that
might not always be suitable. Moreover the human himself
could be playing with insufficient good plans and strategies.
The reviser adjusts the case performance according to
Temporal Difference lea ng with 𝑆𝐴𝑅𝑆𝐴(𝜆) Online Policy
Learning Algorithm.
The adjusted plan is retained in the case base using the
retainer module and the cycle starts over
The selected behavior is passed to the Adaptation module
to adapt the plan for the current situation. In particular,
adaptation means removal of unnecessary actions in the plan
and then adding satisfaction actions.
Figure 3: Intelligent Online Case-based Planning Agent Model
IV.
HYBRID OLCBP/RL ALGORITHM USING
SARSA(λ)
We introduce our approach which hybridizes Online Case
Based Planning and Reinforcement Learning -using
SARSA(λ) Algorithm- in a novel algorithm (see figure 5 ) . In
order to view how was SARSA(λ) customized, a table that
maps the old symbols in the original SARSA(λ) Algorithm
(see figure 3) to the new symbols used in the novel algorithm
was constructed.(see Table 1)
Every time the agent retrieves a case to satisfy a certain goal,
the agent –in RL terms- is considered transformed into a new
state, and starts the applying the algorithm, which goes
through the following steps:
Figure 2: Retrieval Equations
Execution module starts to execute plan actions whenever
their preconditions are satisfied. To execute plans the
execution module starts to: search for ready snippets and then,
send these ready snippets for execution (to the game), updates
the current status of executing snippets whether succeeded or
failed and finally, updates status of executing actions from
each snippet.
1) It increments the eligibility of the retrieved case
according to the following:
e (Cr) = e (Cr) + 1
Where: Cr is the retrieved case.
2) It then updates the success rates of all cases in its
case base according to the following:
For each case C in the case base
Q(C) = Q(C) + α δ e(C)
eligibility and thus are affected more with any
rewards or punishments.
Where α is the learning rate, e(C) is the eligibility of the
case C and δ is the temporal difference error that depends
on the following:
Notice that only cases with similar goal and situation
have their eligibility updated, as cases with similar
goal and situation constitute a pool of states (in RL
terms) that need to take the responsibility of choosing
the current case and thus have their eligibilities
updated.
δ = R + r + γ Q (Cr) – Q (Cu)
Where:



R: The Global Reward: Its value depends on the
ratio between the player’s power and the
enemy’s power. It is very important for the agent
to be aware of the global consequences of its
actions.
However, the performance of the entire case base is
updated, since different types of cases affect each
other’s performance, for example, a case achieving
the “BuildArmy” goal will certainly affect the
performance of the next used case which achieves the
“Attack” goal.
r: The case-specific reward due to the success or
failure of the last used case. It ranges between -1
and 1. It is computed based on a heuristic which
determines how effective the plan was according
to the following formula :
𝑟(𝑐)
−1, 𝑖𝑓 𝑐 𝑓𝑎𝑖𝑙𝑒𝑑
={
𝑟 , 𝑤ℎ𝑒𝑟𝑒 − 1 < 𝑟 < 1, 𝑖𝑓 𝑐 𝑠𝑢𝑐𝑐𝑒𝑒𝑑𝑒𝑑
γ Q (Cr) – Q (Cu) : The difference in success rate
between the retrieved case Cr (multiplied with
the discount rate γ) and the last used case Cu.
Table 1: Table for mapping original symbols/meanings to the
new symbols\meanings of the proposed algorithm
Symbol
General Meaning
New Symbol
Customized Meaning
s
State
S
Situation and Goal
a
Action
P
Plan (Case Snippet)
(s,a)
State-action pair
(S,P) or C
Case
Q(s,a)
Value of state-action pair
Q(S,P) or
Q(C )
Success rate of case
δ = R + ∑in ri + γ Q (Cr) – Q (Ci)
r
reward
R
General Reward
Where n: number of last used cases
α
Learning Rate Parameter
α
Learning Rate
Parameter
δ
Temporal-Difference Error
δ
Temporal-Difference
Error
e(s,a)
Eligibility trace for stateaction pair
e(S,P) or e(C)
Eligibility trace for case
γ
Discount rate
γ
Discount Rate
λ
Trace Decay Parameter
λ
Trace Decay Parameter
-
-
r
Goal-Specific Reward
Notice that, in online case based planning there
could be multiple last used cases executed in
parallel; in this condition the total temporal
difference error relative to all last used cases
should be equal to:
Observe failed or succeeded Case Cu
Compute R, r
Retrieve Case Cr via retrieval policy (E-greedy)
δ = R + r + γ Q (Cr) – Q (Cu)
e (Cr) = e (Cr) + 1
For each case C in the case base
Q(C) = Q(C) + α δ e(C)
Retrieve set of cases E
For each case C in E
e (C) = γ λ e (C)
3) It retrieves all cases with a similar S (Goal and
Situation) to the S of the retrieved case Cr and stores
the result in E.
4) It updates the eligibility of all cases in E according to
the following :
e (C) = γ λ e (C)
Where: λ is the trace decay parameter, which controls
the rate of decay of the eligibility trace of all cases.
As it increases, the cases preserve most of their
FIGURE 5: ONLINE LEARNING ALGORITHM USED BY THE
INTELLIGENT AGENT TO EVALUATE (REVISE) CASES
V. EXPERIMENT AND RESULTS
In order to make the significance of extending online case
based planning with online learning using reinforcement
learning clear, consider the simple case, where the case
acquisition module (Offline Learning from human traces) has
just learned the following 4 cases (in table 1), and initialized
their success rates with a value of 0.
It’s known in the game of Wargus, that using heavy units such as ballista and knights- to attack a towers defense is more
effective than using light units such as footmen and archers.
This means that it is highly preferable to use case
BuildArmy2 instead of case BuildArmy1, and use Attack2
which will definitely cause the agent to destroy more of the
enemies units and thus approach wining the game.
The experiment constitutes tracing the agent’s evaluation for
the cases (after achieving goals “Build Army” then “Attack”
in order) for 40 successive times. Learning rate was set to 0.1.
0.8 For the discount rate, 0.5 for the decay rate, and 0.1 for the
exploration rate. The values of all the success rates and
eligibilities of the cases were initialized with zero. Table 3
shows the ranges of the rewards gained after executing each of
the 4 cases. Notice that BuildArmy1 and BuildArmy2 are
rewarded similarly however; the rewards of Attack1 and
Attack2 vary greatly due to the different result of both.
1.6
Success Rate/Eligibilty Value
In order to win, the Agent has to fulfill the 2 goals: “Build
Army” and “Attack” in order, by choosing one case for each
goal respectively.
Notice that, cases BuildArmy1 and BuildArmy2 share
identical game states, though they contain different plans for
achieving the same goal “Build Army”.
On the other hand, cases Attack1 and Attack2 achieve the
same goal but with different plans, and different game states
which are the same as the game states achieved after executing
BuildArmy1 and BuildArmy2 respectively. Using
BuildArmy1 will definitely force the agent to use Attack1 as
BuildArmy1 trains the necessary army that will be used in
Attack1. Similarly, using BuildArmy2 will definitely force
the agent to use Attack2.
Figure 6 shows a graph that compares the success rates of the
2 cases BuildArmy1 and BuildArmy2 along with their
eligibility traces. E1 means Eligibility of BuildArmy1 and E2
means eligibility of BuildArmy2. Similarly, Figure 7
compares Attack1 and Attack2.
BuildArmy2
1.4
1.2
1
0.8
E2
0.6
0.4
E1
0.2
BuildArmy1
0
-0.2
1
6
11
16
21
26
31
36
41
Number of Evaluations
Figure 6 – Tracing success rates and eligibility values of
BuildArmy1 and BuildArmy2 during 40 evaluations
2.5
Success Rate/Eligibility Value
Table 2: cases being evaluated using SARSA (λ) algorithm in
the experiment
BuildArmy1
BuildArmy2
Goal: Build Army
Goal: Build Army
State: Enemy has a towers State: Enemy has a towers
defense (identical to Case2)
defense (identical to Case1)
Plan:
Plan:
Train 15 grunts
Train 2 catapults
Train 5 Archers
Train 6 Knights
Success Rate : 0
Success Rate : 0
Attack1
Attack2
Goal: Attack
Goal: Attack
State: Enemy has a towers State: Enemy has a towers
defense, agent has 15 grunts defense, agent has 2 catapults
and 5 Archers exist
and 6 knights exist
Plan:
Plan:
Attack with 15 grunts and 5 Attack with 2 catapults and 6
archers on towers defense
knights on towers defense
Success rate: 0
Success rate: 0
Table 3: used rewards variations gained on successfully
executing the cases
Case-specific reward
Global Reward
Case/Reward
From
To
From
To
BuildArmy1
0
0.2
0.2
0.3
BuildArmy2
0
0.2
0.2
0.3
Attack1
0
0.2
-0.8
-0.6
Attack2
0.1
0.2
0.4
0.6
Attack2
2
Attack1
1.5
1
E1
0.5
E2
0
1
6
11
16
21
26
31
36
41
Number of Evaluations
Figure 7 – Tracing success rates and eligibility values of
Attack1 and Attack2 during 40 evaluations
In order to win, the Agent has to fulfill the 2 goals: “Build
Army” and “Attack” in order. Since the 2 cases of the “Build
Army” goal share the same state, and their success rates is
initially equal, any case of the 2 cases is randomly chosen.
Assume that the 2 cases will be executed successfully.
In case BuildArmy1 is chosen to be retrieved, the agent
retrieves Attack1 as the most suitable case for execution (as it
does have 15 grunts and 5 archers). The low success rate of
case Attack1 will affect the revision (or evaluation) of last
used case BuildArmy1 causing its success rate to be equal to
0.4 instead of 0.5.
In case Case2 is chosen, the agent retrieves Case4 as the most
suitable case for execution (as it doesn’t have 15 grunts
catapults and 5 archers). However, the choice of this case lead
to the choice of a better case with a success rate of 0.8. This
will affect the revision (or evaluation) of the last used Case
Case2 causing its success rate to increase to 0.6 instead of 0.5.
As the agent plays in the same game or in multiple successive
games, the agent will surely learn that using Case2 is
definitely better than using Case1, although seemed to the
agent identical when they were just learned during the offline
learning process.
Below in table() is the result of an experiment conducted to
Figure 5
show the result of applying the algorithm 10 times(could be in
one or multiple game episodes) , where :
 Learning rate = 0.1.
 Decay rate = 0.8.It’s set average, to maintain average
responsibility of last used cases for the choice current
case retrieved.
 Exploration rate = 0.1.It’s set low because –due to the
small number of cases available (4 cases) - any
exploration will probably lead to the choice of the
worst case. Choosing the worst case will have an
undesirable negative effect on cases with high
success rates.
 Discount rate = 0.5. It’s set average to maintain
average bootstrapping.
After applying the algorithm 10 successive times, C1 gains a
low success rate compared with C2. This proves that the agent
has learned building a smaller heavy army in that situation
(the existence of a towers defense) is more preferable than
building a larger light army.
V. CONCLUSION AND FUTURE WORK
In this paper, online case-based planning was hybridized with
reinforcement learning. This was the first attempt to do so in
order to introduce an intelligent agent capable of planning and
learning online using Temporal Difference with Eligibility
Traces: SARSA(λ) algorithm. Learning online biases the agent
decision for selecting more efficient, effective and successful
plans. Also, this serves in saving consumption of agent’s time
in retrieving inefficient failed plans. As a result, the agent
takes into account history when acting in the environment (i.e.
playing a real-time strategy game).
Further, we are planning to develop a strategy/casebase
visualization tool capable of visualizing agent’s preferred
playing strategy according to its learning history. This will
help in tracking the learning curve of the agent. After tracking
the agent’s learning curve, we will be capable of applying
other learning algorithms and finding out which one is the
most suitable and effective.
REFERENCES
[1] Buro, M. 2003. Real-time strategy games: A new
AI research challenge. In IJCAI’2003, 1534–1535.
Morgan Kaufmann
[2] Aamodt, A., and Plaza, E. 1994. Case-based
reasoning: Foundational issues, methodological
variations, and system approaches. Artificial
Intelligence Communications 7(1):39–59
[3] Santiago Ontañón and Kinshuk Mishra and Neha
Sugandh and Ashwin Ram (2010) On-line CaseBased Planning. in Computational Intelligence
Journal, Volume 26, Issue 1, pp. 84-119.
[4] Kinshuk Mishra and Santiago Ontañón and
Ashwin Ram (2008), Situation Assessment for Plan
Retrieval in Real-Time Strategy Games. ECCBR2008.
[5] Santiago Ontañón and Kinshuk Mishra and Neha
Sugandh and Ashwin Ram (2008) Learning from
Demonstration and Case-Based Planning for RealTime Strategy Games. in Soft Computing
Applications in Industry (ISBN 1434-9922 (Print)
1860-0808 (Online)), p. 293-310.
[6] Neha Sugandh and Santiago Ontañón and Ashwin
Ram (2008), On-Line Case-Based Plan Adaptation
for Real-Time Strategy Games. AAAI-2008.
[7] Richard S. Sutton and Andrew G. Barto.
Reinforcement Learning, An Introduction. MIT
press, 2005
Download