Uploaded by Sanjay Khan

nips2015

advertisement
Reinforcement learning of 4x4 grid matrix
using OpenAI libraries
1
2
3
4
5
6
7
8
Sanjay
Department of Computer Science
University at Buffalo
Buffalo, NY 14260
sanjay@buffalo.edu
9
Abstract
10
11
12
13
14
15
16
17
This report studies and implements OpenAI driven
reinforcement learning algorithms. The purpose of this project
is to demonstrate the reinforcement learning task in which the
machine learns to reach its target point in the shortest path
possible on a (4X4) states environment. Q-learning methods are
used to identify and steps of the learning process.
Reward/Penalty driven learning is core objective of
reinforcement learning which is showcased in this project.
18
19
20
1
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
R ei nforcem ent l e arn i ng i s a m achi n e l ea rni ng t e chni que, wh ere
t he a gent l ea rns i n a n i nt eract i ve envi ro nm ent usi ng t ri al an d
error approa ch.
INTRODUCTION
For t he a ct i on t aken b y t he a gent a posi t i ve or a ne gat i ve re ward
wi l l be gi ven. Unl i k e supervi sed l e arni n g, t he r ei nfor cem ent
l earni n g wi l l cre at e t he t rai ni ng d at a b y ex pl ori ng t he
envi ronm ent and wi l l ex pl oi t i t i n t he fut ure t o at t ai n rew ard s.
The i m port ant com ponent s of rei nfo rce m ent l earni n g
• Envi ronm ent : The p h ysi cal ent i t y wh ere t he a gent t akes
t he act i ons
• S t at e: C urrent si t uat i on of t he a gent
• R eward: The posi t i v e or n e gat i ve fe edba ck gi v en b y t he
envi ronm ent .
• P ol i c y: The cont rol act i on sequenc e t ak en b y t h e a gent
t o reach t he t arget .
38
39
40
41
Fig1: Reinforcement learning illustrated
42
43
44
45
46
47
Reinforcement learning is the study of how an agent can interact with its
environment to learn a policy which maximizes expected cumulative rewards for
a task. Q-Learning or Quality-Learning is a reinforcement learning model which
stored the usefulness of a given action for gaining some future rewards. The
decision process used in this method is known as Markov Decision Process
(MDP).
48
49
50
51
52
Markov Decision Process is based on Markov property i.e. the future is
independent of past given the present. This means the actions taken in future will
completely depend on the present state and doesn’t depend on the past history.
This is because at every state the Markov property is followed. The mathematical
representation of Markov Decision Process is a below.
53
54
55
56
57
58
59
60
Pss′ =P [St+1 =s′ |St =s]
OpenAI Gym is a toolkit for reinforcement learning research. It includes a
growing collection of benchmark problems that expose a common interface, and a
website where people can share their results and compare the performance of
algorithms. This whitepaper discusses the components of OpenAI Gym and the
design decisions that went into the software.
61
62
1.1
ME T H O DO L O GY AND E X PE RIME N T S
63
64
65
66
The first and foremost step of our research is to setup the game grid environment.
We then define the dimensions, steps, reset and distance function that help in
traversal and rendering. Below are few of the significant python libraries which
have been used to aide this study:
67
68
69
70
71
1)
2)
3)
4)
Numpy
OpenAI Gym
Threading
MatLib.plt
72
73
74
Fig2: Initial state of grid world environment
75
76
77
78
79
80
The agent will work within an action space consisting of four actions: up, down,
left, right. At each time step, the agent will take one action and move in the
direction described by the action. The agent will receive a reward of +1 for
moving closer to the goal and −1 for moving away or remaining the same distance
from the goal.
81
82
We implement Q-Learning and make sure to update the Q-Table which helps to
remember of the states of path learnt.
83
Finally, we plot the graphs of the trends of our findings using pyplot.
84
85
86
1 . 2 Imp l e men tati on of p ol i cy f un cti on :
87
88
89
90
91
92
93
94
95
Our agent will randomly select its action at first by a certain percentage,
called ‘exploration rate’ or ‘epsilon’. This is because at first, it is better for
the agent to try all kinds of things before it starts to see the patterns. When it
is not deciding the action randomly, the agent will predict the reward value
based on the current state and pick the action that will give the highest
reward. We want our agent to decrease the number of random actions, as it
goes, so we introduce an exponential-decay epsilon, that eventually will allow
our agent to explore the environment.
96
97
Exploration vs Exploitation:
98
99
100
We carryout exploration until there is the reward from random step generated
epsilon lower than the threshold epsilon.
101
102
103
104
105
If the random step generated epsilon is greater than the threshold epsilon, we
should exploit the model provided values to implement Q-Learning. This
switch between Exploration and exploitation is key to not follow
unproductive and avoid repetitive steps in our process.
106
1.3 Up d ati n g Q -Tab l e:
107
108
Updating Q-table is an important step toward Q-Learning since it helps us
analyze the rewards and punishments and helps us lookup for the future steps.
109
110
111
Updating of Q-Table has two components. First is the old value after the
lookup is done and next is the learnt value depending of various rewards and
other hyper parameters. Below is an equation showing the same:
112
113
114
115
116
117
118
Fig3 : Q-Value updating equation.
We have our agent follow what is called an ε-greedy strategy. Here, we
introduce a parameter ε and set it initially to 1.
119
120
Fig4: Q-Learning flowchart
121
122
123
We repeat this process until the next state is the final state and there are no
further learning or exploration to be done for our agent.
124
125
126
2
127
128
129
130
131
First, we initialize the primary state to the grid environment and set it to a
variable object called obs. We explore the steps on the grid using a function to
explore until a flag done is set to true. In each step we store the current state
and reward using copy function and then seek the next state to repeat the
process.
Trai n in g th e mod el u si n g O PE NAI GY M
132
133
134
135
Fig5: Training process flowchart
136
137
138
139
140
141
142
143
144
145
146
147
This training is done completely when a series of gameplays are made one
after an another to reach the destination state and compare its system from
previous gameplay. Each of this attempt in gameplay is called as an episode.
For our implementation we have chosen 1000 episodes for our Training
process. We gather intermediate results for every 100 episodes.
4
RE S ULT S AND AN ALYS IS
E p si l on vs E pi sod e:
148
149
Fig6: epsilon vs episode plot
150
151
152
153
154
We had started with ε-greedy strategy with initial ε value set to 1. We observe
that it has linearly decreased over the episode count. This reflects the theory
as q-function expects ε to decrease in linear fashion. Our ε value has linearly
reduced from 1 to 0.4 over the episodes. The trend matches the expected
values from theory and thus justifies the training.
155
156
157
Fig 7: Total Reward vs episode plot
158
159
160
161
162
Our training reward system has a reward of +1 for each step taken to make a
move closer to final destination and a penalty of -1 for each step taken to
make move farther from the final destination. The rewards are accumulated
for each step in an episode. Our model makes many bad moved in the initial
163
164
165
166
episodes hence the negative values of Total reward are justified. Once it starts
to avoids the mistakes made, it tries to maximize the rewards and minimize
the penalties.
167
168
169
170
171
Finally, once the optimal solution is reached, it remembers to not choose
paths where penalties exist and move forward with only positive reward
paths.
172
173
174
175
176
Our observed final total reward: 8 steps
177
178
179
180
181
182
Fig8 : Path chosen by agent in optimal run
183
184
5
CO NCL US IO NS
185
186
187
188
189
190
191
192
193
194
195
196
Prediction of real-world environment characteristics can be done with the aid of machine learning
techniques. In our study our machine has successfully predicted the 4x4 grid environment with
minimum number of steps, given suitable number of episodes and learning rate. We have observed
that the epsilon value has constantly decreased linearly over the episodes and that our model, once
reached maximum reward paths, it avoided penalty steps in the grid. This means that the
correctness of our machines learning performance has been good. Factors like insufficiently small
episodes or extreme learning rates can lead to poor statistical metrics. Hence, we can be certain
that with tuning the hyper-parameters we can achieve better statistical metrics which means better
predictions of outcomes. It is also observed that there could be multiple bottlenecks which can
interfere with performance of this machine, like processing power of the machine (GPU for
Images in particular), size of the environment, Ambiguous reward/penalty steps that cause getting
stuck in a loop.
197
198
199
200
201
202
203
204
205
206
207
208
209
210
Using Machine learning to help analyze and predict pathways of environments most effectively
has been achieved. Expanding further research to find more parameters and increasing the sizes of
datasets and complexity of the dataset would help get more accurate results in future. Our aim
would be to further research in the domains of image processing and computer vision using the
machine learning algorithms aided by neural networks. In quite near future users would be able to
use the studies and methodologies as used in this report on hand-held devices on the go
instantaneously while simulation of Virtual reality and Augmented Reality is done. Compression,
Encoding and clustering can help us speed up the processes in above technologies in Computer
Vision with higher accuracy. There is ample amount of usage application in computer games
industry with use of machine learning to train the model on. Key example of today is Dota2
where OPEN AI has successfully defeated the worlds best players. Apart from it is being used in
Automation industry and military corps. There is still scope for research in this advanced sciences,
and active research also is accelerating both in terms of hardware and software in this direction.
Because as they say, Modern Problems Require Modern Solutions.
211
212
213
214
215
216
217
218
219
220
221
222
Ref eren c es
[1] G Brockman, V Cheung, L Pettersson, J Schneider… - arXiv preprint arXiv:1606.01540, 2016Openai
gym
[2] Richard S Sutton, Andrew G Barto Reinforcement learning: An introduction
[3] Sutton, R. S., Precup, D., & Singh, S. (1999). Between MDPs and semi -MDPs: A framework for
temporal abstraction in reinforcement learning. Artificial intelligence, 112(1-2), 181-211.
[4] Stone, P., Sutton, R. S., & Kuhlmann, G. (2005). Reinforcement learning for robocup soccer
keepaway. Adaptive Behavior, 13(3), 165-188.
[5] https://deeplizard.com/learn/video/HGeI30uATws
[6] https://blog.floydhub.com/an-introduction-to-q-learning-reinforcement-learning/
[7] http://gym.openai.com/docs/
Download