Uploaded by sridharece

Reinforcement Learning — Machines learning by interacting with the world

advertisement
2/27/23, 10:54 AM
Reinforcement Learning — Machines learning by interacting with the world | by The Perceptive Agent | Analytics Vidhya | M…
Open in app
Sign up
Sign In
Published in Analytics Vidhya
The Perceptive Agent
Apr 13, 2020 · 7 min read ·
Follow
Listen
Save
Reinforcement Learning — Machines learning
by interacting with the world
In the past few years, the field of Artificial Intelligence has been booming. It has
made impressive advances and has enabled computers to continuously challenge
human performance in various domains. Many of us are familiar with AlphaGo
which was the first computer program to beat a professional player in the game of
Go without any handicaps. Its successor-AlphaZero is currently perceived as the
world’s top player in Go as well as possibly in chess.
But Reinforcement Learning (RL) is not good at games only. It has applications in
the Financial, Cyber Security and even teach machines to paint. This post is the first
one in a series of posts explaining important concepts of Reinforcement Learning.
I am writing this series of articles to cement my understanding of the concepts of
RL as I go through the book: Reinforcement Learning: An Introduction by Andrew
Barto and Richard S. Sutton. I will write my understanding of various concepts
complete with the associated programming tasks too.
What is Reinforcement Learning?
Suppose, we have an agent in an environment the dynamics of which are
completely unknown to the agent. The agent can interact with the environment by
taking a certain number of actions and the environment, in turn, returns rewards
for that action. The agent ought to maximize the total reward cumulated during its
https://medium.com/analytics-vidhya/reinforcement-learning-machines-learning-by-interacting-with-the-world-64e5862dbf19
1/7
2/27/23, 10:54 AM
Reinforcement Learning — Machines learning by interacting with the world | by The Perceptive Agent | Analytics Vidhya | M…
episode of interaction with the environment. For example, a bot playing a game, or
a robot at a restaurant that is rewarded for cleaning tables after the customer leaves.
An agent taking actions to interact with the environment and is rewarded in turn. Source
The goal is to make sure the agent learns a strategy from the trials and feedback
received to maximize the rewards.
Key Concepts
Before proceeding, let us define some key concepts. The agent is acting in an
environment. The environment’s reaction to the agent’s interactions is defined by a
model of the environment. At any given point of time, the agent is in a state (s∈S)
and can take any action from a set of actions (a∈A). Upon taking an action, the
agent transitions from the state s to s’. The probability of transitioning from s to s’ is
given by the transitional function. The environment rewards the agent from a set of
rewards (r∈R). The strategy by which an agent takes an action in a state is called the
policy π(s).
While designing an RL agent, the agent might be or might not be familiar with the
model of the environment. Hence, there arise two different circumstances:
1. Model-based RL: The agent is familiar with the complete model of the
environment or learns about it during its interactions with the environment. Here,
if the complete model is known, the optimal solution can be found using Dynamic
Programming.
2. Model Free RL: The agent learns a strategy to interact with the environment
without any knowledge of the model and does not try to learn about the model’s
environment.
The goal of the agent is to take actions so as to maximize the total reward. Each state
is associated with a value function V(s) predicting the expected amount of future
rewards we are able to receive in this state by acting the corresponding policy. In
other words, the value function quantifies how good a state is.
https://medium.com/analytics-vidhya/reinforcement-learning-machines-learning-by-interacting-with-the-world-64e5862dbf19
2/7
2/27/23, 10:54 AM
Reinforcement Learning — Machines learning by interacting with the world | by The Perceptive Agent | Analytics Vidhya | M…
RL Approaches with the name of algorithms for each type of approach. Source
A sequence of interaction between the agent and the environment is known as an
episode (also called “trajectory” or “trial”). An episode is composed of states,
actions and rewards at any time, t = 1, 2,…, T. At time t, the state, the action taken
and the reward observed is denoted by Sₜ, Aₜ and Rₜ, respectively. Thus an episode is
composed of the following: Sₜ = S₁, A₁, R₁, S₂, A₂, …, Sₜ.
Some other key terms used commonly:
1. On-policy: Use the deterministic outcomes or samples from the target policy to
train the algorithm. A target policy is a policy that is going to be used when the
agent is going to be put into action and not being trained.
2. Off-policy: Training on distribution of transitions or episodes produced by a
different behaviour policy rather than that produced by the target policy.
What is a model of the environment?
Suppose we are training a robot to walk to a far off point. Formulating it as a very
basic task, suppose we want the agent to control the various parts of the robot to
facilitate walking in a straight position. The robot should not deviate more than 20°
from the vertical axis. The robot is rewarded for every time step it stays within the
angular criteria and the closer it gets to the destination, the higher the rewards get.
Here, the model of the environment reacts to every action taken by the robot while
incorporating the result of gravity, momentum etc. and then returns the next state
to the agent. Hence, all the factors that determine the next state that the agent
https://medium.com/analytics-vidhya/reinforcement-learning-machines-learning-by-interacting-with-the-world-64e5862dbf19
3/7
2/27/23, 10:54 AM
Reinforcement Learning — Machines learning by interacting with the world | by The Perceptive Agent | Analytics Vidhya | M…
transitions to and the reward it gets are part of the model of the environment.
A model has two major parts, transition function P and reward function R.
Let’s say when we are in state s, we decide to take action a to arrive in the next state
s’ and obtain reward r. This is known as one transition step, represented by a tuple
(s, a, s’, r). If we are in state s, take action a to arrive in the next state s’ and are
rewarded the reward r. This is a single transition: (s, a, s’, r).
The transition function P records the probability of transitioning from state s to s’
after taking action a while obtaining reward r.
From this, we can determine the state transition function:
The reward function R is the expectation of receiving reward r on taking action a in
state s.
Policy — The agent’s strategy
The policy (π) is what determines the agent’s behaviour i.e. the action a the agent
takes in a state s. The policy can be:
1. Deterministic: For every state, there is a single action defined that the agent will
take in that state. π(s) = a
2. Stochastic: The policy returns the probability of taking each action for all the
possible actions in state s. (Seems like Neural Networks might be useful here?).
π(a|s) = ℙ[A=a|S=s].
Value Function — How good is the state I am in
https://medium.com/analytics-vidhya/reinforcement-learning-machines-learning-by-interacting-with-the-world-64e5862dbf19
4/7
2/27/23, 10:54 AM
Reinforcement Learning — Machines learning by interacting with the world | by The Perceptive Agent | Analytics Vidhya | M…
For every state, there is a value function that determines the total future reward that
can be obtained. The future reward, also known as return, is a total sum of
discounted rewards going forward. The return is denoted by Gₜ.
The discounting factor γ∈[0,1] penalize the rewards in the future, because:
The future rewards may have higher uncertainty; i.e. stock market.
The future rewards do not provide immediate benefits.
Discounting provides mathematical convenience; i.e., we don’t need to track
future steps forever to compute return.
State-value is the expected return that can be obtained when we are in state s at
time t.
Similarly, we also have an action-value. It is the expected return by taking action a
in state s at time t. It is also called the Q-value.
There is a way by which we can determine V(s) from Q(s, a). What if we take the
action values of all the possible actions in a state and multiply them with the
probability of taking that action in the state? That is exactly what we do:
https://medium.com/analytics-vidhya/reinforcement-learning-machines-learning-by-interacting-with-the-world-64e5862dbf19
5/7
2/27/23, 10:54 AM
Reinforcement Learning — Machines learning by interacting with the world | by The Perceptive Agent | Analytics Vidhya | M…
Another cool thing is the advantage function. It is the difference between the Q
value for an action a in state s and the value of the state s. You can think of it like
this: I know that in my current state, I can expect a certain reward. Now, if I take an
action, how much better position does it place me in?
Optimal Value and Policy
Since we are talking about learning a policy to maximize our rewards, there must be
some “optimal” form. Right? Well there is.
The optimal value function is the value function associated by the policy π that
returns the maximum rewards.
And similarly,
And obviously, the optimal policy is what the agent tries to learn. The policy that
takes the best possible action in every state so as to really maximize the return.
Ending notes:
That is it for this tutorial. See you in the next one!
References:
Best Deep Reinforcement Learning Research of 2019 So Far
In this article, I’ve conducted an informal survey of all the deep
reinforcement learning research thus far in 2019 and…
https://medium.com/analytics-vidhya/reinforcement-learning-machines-learning-by-interacting-with-the-world-64e5862dbf19
6/7
2/27/23, 10:54 AM
Reinforcement Learning — Machines learning by interacting with the world | by The Perceptive Agent | Analytics Vidhya | M…
medium.com
https://lilianweng.github.io/lil-log/2018/02/19/a-long-peek-into-reinforcementlearning.html#key-concepts
Artificial Intelligence
Reinforcement Learning
Learning
Computer Science
Machine Learning
Sign up for Analytics Vidhya News Bytes
By Analytics Vidhya
Latest news from Analytics Vidhya on our Hackathons and some of our best articles! Take a look.
Your email
Get this newsletter
By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our
privacy practices.
About
Help
Terms
Privacy
Get the Medium app
https://medium.com/analytics-vidhya/reinforcement-learning-machines-learning-by-interacting-with-the-world-64e5862dbf19
7/7
Download