Uploaded by PAVAN KUMAR I

UNIT V

advertisement
UNIT V
Advanced Learning
Content
•
•
•
•
•
•
•
•
•
•
Reinforcement Learning
K-armed bandit-Elements
Model-based Learning
Value iteration
Policy iteration
Temporal difference Learning
Exploration Strategies
Deterministic and Non-Deterministic Rewards & Actions
Semi-supervised Learning
Computational Learning Thoery
– VC Dimension
– PAC Learning
Reinforcement Learning
• What is Reinforcement Learning? How does it
compare with other ML techniques?
– Reinforcement Learning(RL) is a type of machine
learning technique that enables an agent to learn
in an interactive environment by trial and error
using feedback from its own actions and
experiences.
• Though both supervised and reinforcement
learning use mapping between input and
output, unlike supervised learning where the
feedback provided to the agent is correct set
of actions for performing a task,
reinforcement learning uses rewards and
punishments as signals for positive and
negative behaviour.
• As compared to unsupervised learning,
reinforcement learning is different in terms of
goals. While the goal in unsupervised learning
is to find similarities and differences between
data points, in the case of reinforcement
learning the goal is to find a suitable action
model that would maximize the total
cumulative reward of the agent.
How to formulate a basic
Reinforcement Learning problem?
• Some key terms that describe the basic
elements of an RL problem are:
– Environment — Physical world in which the agent
operates
– State — Current situation of the agent
– Reward — Feedback from the environment
– Policy — Method to map agent’s state to actions
– Value — Future reward that an agent would
receive by taking an action in a particular state
Multi-Armed Bandits and
Reinforcement Learning
• Multi-Arm Bandit is a classic reinforcement
learning problem, in which a player is facing
with k slot machines or bandits, each with a
different reward distribution, and the player is
trying to maximise his cumulative reward
based on trials.
Formulation
• Let’s strike into the problem directly. There are 3
key components in a reinforcement learning
problem — state, action and reward. Let’s recall
the problem — k machines are placed in front of
you, and in each episode, you choose one
machine and pull the handle, by taking this
action, you get a reward accordingly. So the state
is the current estimation of all the machines,
which is zeros for all in the beginning, the action
is the machine you decide to choose at each
episode, and the reward is the result or payout
after you pull the handle.
• Actions
– Choosing Action
– To identify the machine with the most reward, the straight
forward way is to try, try as many times as possible until
one has a certain confidence with each machine, and stick
to the best estimated machine from onward. The thing is
we probably can test in a wiser way. As our goal is to get
the maximum reward along the way, we surely should not
waste to much time on a machine that always gives a low
reward, and on the other hand, even we come across a
palatable reward of a certain machine, we should still be
able to explore other machines in the hope of some notexplored-enough machines could give us a higher reward.
What is model-based machine
learning?
• The field of machine learning has seen the
development of thousands of learning
algorithms. Typically, scientists choose from these
algorithms to solve specific problems. Their
choices often being limited by their familiarity
with these algorithms. In this classical/traditional
framework of machine learning, scientists are
constrained to making some assumptions so as to
use an existing algorithm. This is in contrast to
the model-based machine learning approach
which seeks to create a bespoke solution tailored
to each new problem.
• The goal of MBML is “to provide a single
development framework which supports the
creation of a wide range of bespoke models“.
This framework emerged from an important
convergence of three key ideas:
• the adoption of a Bayesian viewpoint,
• the use of factor graphs (a type of a probabilistic
graphical model), and
• the application of fast, deterministic, efficient and
approximate inference algorithms.
• The core idea is that all assumptions about the
problem domain are made explicit in the form
of a model. In this framework, a model is
simply a set of assumptions about the world
expressed in a probabilistic graphical format
with all the parameters and variables
expressed as random components.
• Stages of MBML
– There are 3 steps to model based machine learning,
namely:
– Describe the Model: Describe the process that
generated the data using factor graphs.
– Condition on Observed Data: Condition the observed
variables to their known quantities.
– Perform Inference: Perform backward reasoning to
update the prior distribution over the latent variables
or parameters. In other words, calculate the posterior
probability distributions of latent variables
conditioned on observed variables.
Value Iteration
• Value iteration computes the optimal state
value function by iteratively improving the
estimate
of
V(s).
The
algorithm
initialize V(s) to arbitrary random values. It
repeatedly updates the Q(s, a) and V(s) values
until they converges. Value iteration is
guaranteed to converge to the optimal values
Pseudo code for value-iteration algorithm. Credit: Alpaydin
Introduction to Machine Learning, 3rd edition
Policy Iteration
• While value-iteration algorithm keeps improving the
value function at each iteration until the value-function
converges. Since the agent only cares about the finding
the optimal policy, sometimes the optimal policy will
converge before the value function. Therefore, another
algorithm called policy-iteration instead of repeated
improving the value-function estimate, it will re-define
the policy at each step and compute the value
according to this new policy until the policy converges.
Policy iteration is also guaranteed to converge to the
optimal policy and it often takes less iterations to
converge than the value-iteration algorithm.
Pseudo code for policy-iteration algorithm. Credit: Alpaydin Introduction
to Machine Learning, 3rd edition.
Value-Iteration vs Policy-Iteration
• Both
value-iteration
and
policy-iteration
algorithms can be used for offline planning where
the agent is assumed to have prior knowledge
about the effects of its actions on the
environment (they assume the MDP model is
known). Comparing to each other, policy-iteration
is computationally efficient as it often takes
considerably fewer number of iterations to
converge although each iteration is more
computationally expensive.
Temporal difference Learning
• Temporal difference(TD) is an agent learning
from an environment through episodes with
no prior knowledge of the environment.
– Algorithms
• TD(0)
• TD(1)
• TD(y)
• notation for at least some of the hyper parameters (greek letters
that are sometimes intimidating).
• Gamma (γ): the discount rate. A value between 0 and 1. The higher
the value the less you are discounting.
• Lambda (λ): the credit assignment variable. A value between 0 and
1. The higher the value the more credit you can assign to further
back states and actions.
• Alpha (α): the learning rate. How much of the error should we
accept and therefore adjust our estimates towards. A value
between 0 and 1. A higher value adjusts aggressively, accepting
more of the error while a smaller one adjusts conservatively but
may make more conservative moves towards the actual values.
• Delta (δ): a change or difference in value.
• TD(1) Algorithm
• So the first algorithm we’ll start with will be TD(1).
TD(1) makes an update to our values in the same
manner as Monte Carlo, at the end of an episode. So
back to our random walk, going left or right randomly,
until landing in ‘A’ or ‘G’. Once the episode ends then
the update is made to the prior states. As we
mentioned above if the higher the lambda value the
further the credit can be assigned and in this case its
the extreme with lambda equaling 1. This is an
important distinction because TD(1) and MC only work
in episodic environments meaning they need a ‘finish
line’ to make an update.
• So thats temporal difference learning in a
simplified manner, I hope. The gist of it is we
make an initial estimate, explore a space, and
update our prior estimate based on our
exploration efforts. The difficult part of
Reinforcement Learning seems to be where to
apply it, what is the environment, how do I set up
my rewards properly, etc, but at least for now you
understand the exploration of a state space and
making estimates with an unsupervised modelfree approach.
Exploration Strategies
• classical approach to any reinforcement learning
(RL) problem is to explore and to exploit. Explore
the most rewarding way that reaches the target
and keep on exploiting a certain action;
exploration is hard. Without proper reward
functions, the algorithms can end up chasing
their own tails to eternity. When we say rewards,
think of them as mathematical functions crafted
carefully to nudge the algorithm. To be more
precise, consider teaching a robotic arm or an AI
playing a strategic game like Go or Chess to reach
a target on its own.
Curiosity Based Exploration
• First introduced by Dr Juergen Schmidhuber in 1991,
curiosity in RL models was implemented through a
framework of curious neural controllers. This described
how a particular algorithm can be driven by curiosity
and boredom. This was done by introducing (delayed)
reinforcement for actions that increase the model
network’s knowledge about the world. This, in turn,
requires the model network to model its own
ignorance, thus showing a rudimentary form of selfintrospective behaviour.
Epsilon-greedy
• As the name suggests, the objective of this
approach is to identify a potential way and keep
on exploiting it ‘greedily’. This approach is
popularly associated with the multi-arm bandit
problem, a simplified RL problem where the
agent has to find the best slot machine to make
more money. The agent randomly explores with
probability ϵ and takes the optimal action most of
the time with probability 1−ϵ.
Deterministic and Non-Deterministic
Rewards & Actions
Semi-supervised Learning
• Every machine learning algorithm needs data to learn
from. But even with tons of data in the world, including
texts, images, time-series, and more, only a small
fraction is actually labeled, whether algorithmically or
by hand.
• Most of the time, we need labeled data to do
supervised machine learning. I particular, we use it to
predict the label of each data point with the model.
Since the data tells us what the label should be, we can
calculate the difference between the prediction and
the label, and then minimize that difference.
• As you might know, another category of
algorithms called unsupervised algorithms
don’t need labels but can learn from
unlabeled data. Unsupervised learning often
works well to discover new patterns in a
dataset and to cluster the data into several
categories based on several features. Popular
examples are K-Means and Latent Dirichlet
Allocation (LDA) algorithms.
• Now imagine you want to train a model to
classify text documents but you want to give
your algorithm a hint about how to construct
the categories. You want to use only a very
small portion of labeled text documents
because every document is not labeled and at
the same time you want your model to classify
the unlabeled documents as accurately as
possible based on the documents that are
already labeled.
Computational learning theory
• Computational learning theory, or statistical
learning theory, refers to mathematical
frameworks for quantifying learning tasks and
algorithms.
– Computational learning theory uses formal methods
to study learning tasks and learning algorithms.
– PAC learning provides a way to quantify the
computational difficulty of a machine learning task.
– VC Dimension provides a way to quantify the
computational capacity of a machine learning
algorithm.
PAC Learning (Theory of Learning
Problems)
• Probably approximately correct learning, or PAC learning, refers to
a theoretical machine learning framework developed by Leslie
Valiant.
• PAC learning seeks to quantify the difficulty of a learning task and
might be considered the premier sub-field of computational
learning theory.
• Consider that in supervised learning, we are trying to approximate
an unknown underlying mapping function from inputs to outputs.
We don’t know what this mapping function looks like, but we
suspect it exists, and we have examples of data produced by the
function.
• PAC learning is concerned with how much computational effort is
required to find a hypothesis (fit model) that is a close match for
the unknown target function.
VC Dimension (Theory of Learning
Algorithms)
• Vapnik–Chervonenkis theory, or VC theory for short, refers to a theoretical
machine learning framework developed by Vladimir Vapnik and Alexey
Chervonenkis.
• VC theory learning seeks to quantify the capability of a learning algorithm
and might be considered the premier sub-field of statistical learning
theory.
• VC theory is comprised of many elements, most notably the VC dimension.
• The VC dimension quantifies the complexity of a hypothesis space, e.g. the
models that could be fit given a representation and learning algorithm.
• One way to consider the complexity of a hypothesis space (space of
models that could be fit) is based on the number of distinct hypotheses it
contains and perhaps how the space might be navigated. The VC
dimension is a clever approach that instead measures the number of
examples from the target problem that can be discriminated by
hypotheses in the space.
Download