Reinforcement learning of 4x4 grid matrix using OpenAI libraries 1 2 3 4 5 6 7 8 Sanjay Department of Computer Science University at Buffalo Buffalo, NY 14260 sanjay@buffalo.edu 9 Abstract 10 11 12 13 14 15 16 17 This report studies and implements OpenAI driven reinforcement learning algorithms. The purpose of this project is to demonstrate the reinforcement learning task in which the machine learns to reach its target point in the shortest path possible on a (4X4) states environment. Q-learning methods are used to identify and steps of the learning process. Reward/Penalty driven learning is core objective of reinforcement learning which is showcased in this project. 18 19 20 1 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 R ei nforcem ent l e arn i ng i s a m achi n e l ea rni ng t e chni que, wh ere t he a gent l ea rns i n a n i nt eract i ve envi ro nm ent usi ng t ri al an d error approa ch. INTRODUCTION For t he a ct i on t aken b y t he a gent a posi t i ve or a ne gat i ve re ward wi l l be gi ven. Unl i k e supervi sed l e arni n g, t he r ei nfor cem ent l earni n g wi l l cre at e t he t rai ni ng d at a b y ex pl ori ng t he envi ronm ent and wi l l ex pl oi t i t i n t he fut ure t o at t ai n rew ard s. The i m port ant com ponent s of rei nfo rce m ent l earni n g • Envi ronm ent : The p h ysi cal ent i t y wh ere t he a gent t akes t he act i ons • S t at e: C urrent si t uat i on of t he a gent • R eward: The posi t i v e or n e gat i ve fe edba ck gi v en b y t he envi ronm ent . • P ol i c y: The cont rol act i on sequenc e t ak en b y t h e a gent t o reach t he t arget . 38 39 40 41 Fig1: Reinforcement learning illustrated 42 43 44 45 46 47 Reinforcement learning is the study of how an agent can interact with its environment to learn a policy which maximizes expected cumulative rewards for a task. Q-Learning or Quality-Learning is a reinforcement learning model which stored the usefulness of a given action for gaining some future rewards. The decision process used in this method is known as Markov Decision Process (MDP). 48 49 50 51 52 Markov Decision Process is based on Markov property i.e. the future is independent of past given the present. This means the actions taken in future will completely depend on the present state and doesn’t depend on the past history. This is because at every state the Markov property is followed. The mathematical representation of Markov Decision Process is a below. 53 54 55 56 57 58 59 60 Pss′ =P [St+1 =s′ |St =s] OpenAI Gym is a toolkit for reinforcement learning research. It includes a growing collection of benchmark problems that expose a common interface, and a website where people can share their results and compare the performance of algorithms. This whitepaper discusses the components of OpenAI Gym and the design decisions that went into the software. 61 62 1.1 ME T H O DO L O GY AND E X PE RIME N T S 63 64 65 66 The first and foremost step of our research is to setup the game grid environment. We then define the dimensions, steps, reset and distance function that help in traversal and rendering. Below are few of the significant python libraries which have been used to aide this study: 67 68 69 70 71 1) 2) 3) 4) Numpy OpenAI Gym Threading MatLib.plt 72 73 74 Fig2: Initial state of grid world environment 75 76 77 78 79 80 The agent will work within an action space consisting of four actions: up, down, left, right. At each time step, the agent will take one action and move in the direction described by the action. The agent will receive a reward of +1 for moving closer to the goal and −1 for moving away or remaining the same distance from the goal. 81 82 We implement Q-Learning and make sure to update the Q-Table which helps to remember of the states of path learnt. 83 Finally, we plot the graphs of the trends of our findings using pyplot. 84 85 86 1 . 2 Imp l e men tati on of p ol i cy f un cti on : 87 88 89 90 91 92 93 94 95 Our agent will randomly select its action at first by a certain percentage, called ‘exploration rate’ or ‘epsilon’. This is because at first, it is better for the agent to try all kinds of things before it starts to see the patterns. When it is not deciding the action randomly, the agent will predict the reward value based on the current state and pick the action that will give the highest reward. We want our agent to decrease the number of random actions, as it goes, so we introduce an exponential-decay epsilon, that eventually will allow our agent to explore the environment. 96 97 Exploration vs Exploitation: 98 99 100 We carryout exploration until there is the reward from random step generated epsilon lower than the threshold epsilon. 101 102 103 104 105 If the random step generated epsilon is greater than the threshold epsilon, we should exploit the model provided values to implement Q-Learning. This switch between Exploration and exploitation is key to not follow unproductive and avoid repetitive steps in our process. 106 1.3 Up d ati n g Q -Tab l e: 107 108 Updating Q-table is an important step toward Q-Learning since it helps us analyze the rewards and punishments and helps us lookup for the future steps. 109 110 111 Updating of Q-Table has two components. First is the old value after the lookup is done and next is the learnt value depending of various rewards and other hyper parameters. Below is an equation showing the same: 112 113 114 115 116 117 118 Fig3 : Q-Value updating equation. We have our agent follow what is called an ε-greedy strategy. Here, we introduce a parameter ε and set it initially to 1. 119 120 Fig4: Q-Learning flowchart 121 122 123 We repeat this process until the next state is the final state and there are no further learning or exploration to be done for our agent. 124 125 126 2 127 128 129 130 131 First, we initialize the primary state to the grid environment and set it to a variable object called obs. We explore the steps on the grid using a function to explore until a flag done is set to true. In each step we store the current state and reward using copy function and then seek the next state to repeat the process. Trai n in g th e mod el u si n g O PE NAI GY M 132 133 134 135 Fig5: Training process flowchart 136 137 138 139 140 141 142 143 144 145 146 147 This training is done completely when a series of gameplays are made one after an another to reach the destination state and compare its system from previous gameplay. Each of this attempt in gameplay is called as an episode. For our implementation we have chosen 1000 episodes for our Training process. We gather intermediate results for every 100 episodes. 4 RE S ULT S AND AN ALYS IS E p si l on vs E pi sod e: 148 149 Fig6: epsilon vs episode plot 150 151 152 153 154 We had started with ε-greedy strategy with initial ε value set to 1. We observe that it has linearly decreased over the episode count. This reflects the theory as q-function expects ε to decrease in linear fashion. Our ε value has linearly reduced from 1 to 0.4 over the episodes. The trend matches the expected values from theory and thus justifies the training. 155 156 157 Fig 7: Total Reward vs episode plot 158 159 160 161 162 Our training reward system has a reward of +1 for each step taken to make a move closer to final destination and a penalty of -1 for each step taken to make move farther from the final destination. The rewards are accumulated for each step in an episode. Our model makes many bad moved in the initial 163 164 165 166 episodes hence the negative values of Total reward are justified. Once it starts to avoids the mistakes made, it tries to maximize the rewards and minimize the penalties. 167 168 169 170 171 Finally, once the optimal solution is reached, it remembers to not choose paths where penalties exist and move forward with only positive reward paths. 172 173 174 175 176 Our observed final total reward: 8 steps 177 178 179 180 181 182 Fig8 : Path chosen by agent in optimal run 183 184 5 CO NCL US IO NS 185 186 187 188 189 190 191 192 193 194 195 196 Prediction of real-world environment characteristics can be done with the aid of machine learning techniques. In our study our machine has successfully predicted the 4x4 grid environment with minimum number of steps, given suitable number of episodes and learning rate. We have observed that the epsilon value has constantly decreased linearly over the episodes and that our model, once reached maximum reward paths, it avoided penalty steps in the grid. This means that the correctness of our machines learning performance has been good. Factors like insufficiently small episodes or extreme learning rates can lead to poor statistical metrics. Hence, we can be certain that with tuning the hyper-parameters we can achieve better statistical metrics which means better predictions of outcomes. It is also observed that there could be multiple bottlenecks which can interfere with performance of this machine, like processing power of the machine (GPU for Images in particular), size of the environment, Ambiguous reward/penalty steps that cause getting stuck in a loop. 197 198 199 200 201 202 203 204 205 206 207 208 209 210 Using Machine learning to help analyze and predict pathways of environments most effectively has been achieved. Expanding further research to find more parameters and increasing the sizes of datasets and complexity of the dataset would help get more accurate results in future. Our aim would be to further research in the domains of image processing and computer vision using the machine learning algorithms aided by neural networks. In quite near future users would be able to use the studies and methodologies as used in this report on hand-held devices on the go instantaneously while simulation of Virtual reality and Augmented Reality is done. Compression, Encoding and clustering can help us speed up the processes in above technologies in Computer Vision with higher accuracy. There is ample amount of usage application in computer games industry with use of machine learning to train the model on. Key example of today is Dota2 where OPEN AI has successfully defeated the worlds best players. Apart from it is being used in Automation industry and military corps. There is still scope for research in this advanced sciences, and active research also is accelerating both in terms of hardware and software in this direction. Because as they say, Modern Problems Require Modern Solutions. 211 212 213 214 215 216 217 218 219 220 221 222 Ref eren c es [1] G Brockman, V Cheung, L Pettersson, J Schneider… - arXiv preprint arXiv:1606.01540, 2016Openai gym [2] Richard S Sutton, Andrew G Barto Reinforcement learning: An introduction [3] Sutton, R. S., Precup, D., & Singh, S. (1999). Between MDPs and semi -MDPs: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1-2), 181-211. [4] Stone, P., Sutton, R. S., & Kuhlmann, G. (2005). Reinforcement learning for robocup soccer keepaway. Adaptive Behavior, 13(3), 165-188. [5] https://deeplizard.com/learn/video/HGeI30uATws [6] https://blog.floydhub.com/an-introduction-to-q-learning-reinforcement-learning/ [7] http://gym.openai.com/docs/