By Ashwinkumar Ganesan CMSC 601 Reinforcement Learning Problem Statement Proposed Method Conclusions Method for learning by experience There agent or bot learns by interacting with the environment. There is reward attached for action taken in a particular state. GOAL : MAXIMIZE THE REWARD Principle of Optimality: An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision [1] 𝑈 𝑠 =𝑅 𝑠 + U(s) is Utility Of a state R(s) is the reward of the state. T(s,a,s’) is the transition from s to s’ 𝑆 ′ 𝑇 𝑠, 𝑎, 𝑠 ′ 𝑈(𝑠) It is a discrete time control process 𝑃 𝑠 ′ 𝑎, 𝑠, 𝑠 − 1 … . ) = 𝑃(𝑠 ′ | 𝑎, 𝑠) S’ is the next state S is the current state A is the action Q – learning is method of maintaining reward information without learning the policy. 𝑄 𝑎, 𝑠 = 𝑄 𝑎, 𝑠 + α(𝑅 𝑠 ′ + 𝑄𝑚𝑎𝑥 𝑎′ , 𝑠 ′ − 𝑄(𝑎, 𝑠) Q(a,s) is the current action state pair. R(s’) is the reward for the next state. Co-ordinated Group Action is the situation a set of agents perform a single task. GOAL: To maximize the output or the reward globally. Reinforcement Learning Problem Statement Proposed Method Conclusions Communication i.e. what should the agent communicate and how much should it communicate with other agents. Optimal Policy i.e. defining an optimal policy of the entire group. Is an optimal policy a set of optimal individual policies for each agent? How much of the individual policy information of a certain agent is available to the entire group. Create a method for implementing reinforcement learning on co-ordinated group activity efficiently Modify Reinforcement Learning algorithm to implement group action Implement the proposed method and measure its efficiency in World of Warcraft. 1. 2. 3. 4. 5. The proposal is to research co-operative group learning under the following conditions: The environment is assumed to partially observable. Each agent in the system is knows the final action taken by the other agent. Agents do not have access the to state information generated by other agents while selecting an action. Agents do not have access to the policies of other agents, to make the decision. The rewards are not linearly separable. World of Warcraft is a large multi-player online game by Blizzard Entertainment It is game where every player has his own and roams the virtual world, fighting demons, observing the landscape, buy and selling items and interacting with other players In short, it is a large game with lots of options for tools, skills and levels, making it challenging for Bots. http://world-of-warcraft.en.softonic.com/ Today games especially like World Of Warcraft, have a large number of human players who play in groups. Single as well as multiplayer games have AI engines, but attacks and actions in these games by opponents is still one at a time. Reinforcement Learning can be implemented real time in these games, to improve the AI over a period of time and customize the games for users Other applications – Robotics,defense QUICR Method: This method calculates the counterfactual action, which is the action an agent did not take at a time[2]. Least Square Policy Iteration (LSPI) can implemented. The method performs policy iterations using samples instead of an actual policy[3]. FMQ Algorithm. The algorithm is helpful for environments where agents have partial or no observability[4]. Inverse reinforcement learning is the exact opposite of reinforcement learning. Input is the optimal policy or behavior that is expected from the agent The agent learns to find the reward function based on observed values in the environment Reinforcement Learning Problem Statement Proposed Method Conclusions 1. 2. Implement Reinforcement Learning in 2 parts: Implement Inverse Reinforcement Learning Implement a modified Reinforcement Learning on the rewards learned in step1. Observe Expert Calculate Reward Observe other agents Calculate Policy based on reward Calculate the new Q(a,s) value Calculating the iteration when the agent is known to have optimal policy Finding the point when to switch from Inverse Learning to Reinforcement Learning Working with the reward function got from observing the expert and rewards obtained from the environment Finding a method to observe other agents and “experts”. The metrics will be evaluated against the known methods Metrics are: Number of states generated Number of iterations required to reach optimal policy Rewards in terms of points earned (in the game) Rate of Convergence to optimal Policy 1. 2. 3. 4. Reinforcement Learning Problem Statement Proposed Method Conclusions We can use Inverse reinforcement learning with reinforcement learning methods to speed up the learning time required for bots. Improve bot reward functions over time. 1. R Bellman, On the Theory of Dynamic Programming, Proceedings of the National Academy of Sciences, 1952 2. Adrian K. Agogino and Kagan Turner, QUICKER Q-Learning in Multi-agent systems. 3. Lihong Li, Micheal L. Littman, Christopher R. Mansley, Online Exploration in Least – Squares policy Iteration. 4. La¨etitia Matignon, Guillaume J. Laurent and Nadine Le FortPiat, A study of FMQ heuristic in cooperative multi-agent games 5. Acknowledgement to Prof. Tim Oates for helping with the literature survey.