Ganesan

advertisement


By
Ashwinkumar Ganesan
CMSC 601

Reinforcement Learning

Problem Statement

Proposed Method

Conclusions

Method for learning by experience

There agent or bot learns by interacting with the
environment.

There is reward attached for action taken in a particular
state.

GOAL : MAXIMIZE THE REWARD

Principle of Optimality: An optimal policy has the property
that whatever the initial state and initial decision are, the
remaining decisions must constitute an optimal policy with
regard to the state resulting from the first decision [1]

𝑈 𝑠 =𝑅 𝑠 +

U(s) is Utility Of a state
R(s) is the reward of the state.
T(s,a,s’) is the transition from s to s’


𝑆
′
𝑇 𝑠, 𝑎, 𝑠 ′ 𝑈(𝑠)





It is a discrete time control process
𝑃 𝑠 ′ 𝑎, 𝑠, 𝑠 − 1 … . ) = 𝑃(𝑠 ′ | 𝑎, 𝑠)
S’ is the next state
S is the current state
A is the action




Q – learning is method of maintaining reward information
without learning the policy.
𝑄 𝑎, 𝑠 = 𝑄 𝑎, 𝑠 + α(𝑅 𝑠 ′ + 𝑄𝑚𝑎𝑥 𝑎′ , 𝑠 ′ − 𝑄(𝑎, 𝑠)
Q(a,s) is the current action state pair.
R(s’) is the reward for the next state.

Co-ordinated Group Action is the situation a set of agents
perform a single task.

GOAL: To maximize the output or the reward globally.

Reinforcement Learning

Problem Statement

Proposed Method

Conclusions

Communication i.e. what should the agent communicate
and how much should it communicate with other agents.

Optimal Policy i.e. defining an optimal policy of the entire
group. Is an optimal policy a set of optimal individual
policies for each agent?

How much of the individual policy information of a certain
agent is available to the entire group.

Create a method for implementing reinforcement learning
on co-ordinated group activity efficiently

Modify Reinforcement Learning algorithm to implement
group action

Implement the proposed method and measure its efficiency
in World of Warcraft.

1.
2.
3.
4.
5.
The proposal is to research co-operative group learning
under the following conditions:
The environment is assumed to partially observable.
Each agent in the system is knows the final action taken by
the other agent.
Agents do not have access the to state information
generated by other agents while selecting an action.
Agents do not have access to the policies of other agents,
to make the decision.
The rewards are not linearly separable.

World of Warcraft is a large multi-player online game by
Blizzard Entertainment

It is game where every player has his own and roams the
virtual world, fighting demons, observing the landscape,
buy and selling items and interacting with other players

In short, it is a large game with lots of options for tools,
skills and levels, making it challenging for Bots.
http://world-of-warcraft.en.softonic.com/

Today games especially like World Of Warcraft, have a large
number of human players who play in groups.

Single as well as multiplayer games have AI engines, but
attacks and actions in these games by opponents is still
one at a time.

Reinforcement Learning can be implemented real time in
these games, to improve the AI over a period of time and
customize the games for users

Other applications – Robotics,defense

QUICR Method: This method calculates the counterfactual
action, which is the action an agent did not take at a
time[2].

Least Square Policy Iteration (LSPI) can implemented. The
method performs policy iterations using samples instead of
an actual policy[3].

FMQ Algorithm. The algorithm is helpful for environments
where agents have partial or no observability[4].

Inverse reinforcement learning is the exact opposite of
reinforcement learning.

Input is the optimal policy or behavior that is expected from
the agent

The agent learns to find the reward function based on
observed values in the environment

Reinforcement Learning

Problem Statement

Proposed Method

Conclusions

1.
2.
Implement Reinforcement Learning in 2 parts:
Implement Inverse Reinforcement Learning
Implement a modified Reinforcement Learning on the rewards learned
in step1.
Observe
Expert
Calculate
Reward
Observe other
agents
Calculate
Policy based
on reward
Calculate the
new Q(a,s)
value

Calculating the iteration when the agent is known to have
optimal policy

Finding the point when to switch from Inverse Learning to
Reinforcement Learning

Working with the reward function got from observing the
expert and rewards obtained from the environment

Finding a method to observe other agents and “experts”.

The metrics will be evaluated against the known methods

Metrics are:
Number of states generated
Number of iterations required to reach optimal policy
Rewards in terms of points earned (in the game)
Rate of Convergence to optimal Policy
1.
2.
3.
4.

Reinforcement Learning

Problem Statement

Proposed Method

Conclusions

We can use Inverse reinforcement learning with
reinforcement learning methods to speed up the learning
time required for bots.

Improve bot reward functions over time.
1.
R Bellman, On the Theory of Dynamic Programming,
Proceedings of the National Academy of Sciences, 1952
2.
Adrian K. Agogino and Kagan Turner, QUICKER Q-Learning in
Multi-agent systems.
3.
Lihong Li, Micheal L. Littman, Christopher R. Mansley, Online
Exploration in Least – Squares policy Iteration.
4.
La¨etitia Matignon, Guillaume J. Laurent and Nadine Le FortPiat, A study of FMQ heuristic in cooperative multi-agent games
5.
Acknowledgement to Prof. Tim Oates for helping with the
literature survey.
Download