Presentation 3: Reinforcement Learning

advertisement
REINFORCEMENT LEARNING
Overview & Applications to Music
Gautam Bhattacharya
MUMT 621
1
rise of the machine
‘ let us assume that we are playing against an
imperfect player, one whose play is sometimes
incorrect and allows us to win. For the moment, in
fact, let us consider draws and losses to be equally
bad for us. How might we construct a player that will
find the imperfections in its opponent's play and
learn to maximize its chances of winning?’
- Sutton, R. S., and A. G. Barto. 1998. Reinforcement Learning: An
Introduction
2
Goals & topics
What is Reinforcement Learning?
History, Introduction, Individuality & Examples
Elements of a Reinforcement Learning System
The Reinforcement Problem - An Example
Applications to Music
Questions & Comments
3
History
"heterostatic theory of adaptive systems" developed by A. Harry Klopf
‘ but in 1979 we came to realize that perhaps the simplest of the ideas, which
had long been taken for granted, had received surprisingly little attention from
a computational perspective. This was simply the idea of a learning system
that wants something, that adapts its behaviour in order to maximize a special
signal from its environment. This was the idea of a "hedonistic" learning
system, or, as we would say now, the idea of reinforcement learning’
- Sutton, R. S., and A. G. Barto. 1998. Reinforcement Learning: An
Introduction
4
Introduction
What is Reinforcement Learning?
‘ Reinforcement learning is learning what to do--how to map situations to
actions--so as to maximize a numerical reward signal’
- Sutton, R. S., and A. G. Barto. 1998. Reinforcement Learning: An
Introduction
These two characteristics:
trial-and-error
delayed reward
These are the two most important distinguishing features of
reinforcement learning
5
Introduction
The formulation is intended to include just these three aspects
- sensation
- action
- goal
‘ Clearly, such an agent must be able to sense the state of the environment to
some extent and must be able to take actions that affect the state. The agent
also must have a goal or goals relating to the state of the environment.’
- Sutton, R. S., and A. G. Barto. 1998. Reinforcement Learning: An
Introduction
6
DIFFERENCES WITH RESPECT TO SUPERVISED
LEARNING
‘Learning with a critic’ as opposed to learning with a teacher.
Reinforcement learning = INTERACTIVE learning
In interactive problems it is often impractical to obtain examples of
desired behaviour that are both correct and representative of all the
situations in which the agent has to act
Reinforcement Learning looks at the bigger picture
‘For example, we have mentioned that much of machine learning research is
concerned with supervised learning without explicitly specifying how such an ability would finally
be useful.‘
- Sutton, R. S., and A. G. Barto. 1998. Reinforcement Learning: An
Introduction
7
challenges
Challenges
One of the challenges that arise in reinforcement learning and not in other
kinds of learning is the trade-off between exploration and exploitation
The whole problem of a goal-directed agent interacting with an uncertain
environment
‘All reinforcement learning agents have explicit goals, can sense aspects of their
environments, and can choose actions to influence their environments. Moreover, it is
usually assumed from the beginning that the agent has to operate despite significant
uncertainty about the environment it faces.‘
- Sutton, R. S., and A. G. Barto. 1998. Reinforcement Learning: An
Introduction
8
Examples
A master chess player makes a move. The choice is informed both by
planning--anticipating possible replies and counterreplies--and by immediate,
intuitive judgments of the desirability of particular positions and moves.
A mobile robot decides whether it should enter a new room in search of
more trash to collect or start trying to find its way back to its battery recharging
station. It makes its decision based on how quickly and easily it has been able
to find the recharger in the past
A gazelle calf struggles to its feet minutes after being born. Half an hour later it
is running at 20 miles per hour.
- Sutton, R. S., and A. G. Barto. 1998. Reinforcement Learning: An
Introduction
9
Reinforcement learning: Elements
- Sutton, R. S., and A. G. Barto. 1998. Reinforcement Learning:
POLICY:
An Introduction
- A policy defines the learning agent's way of behaving at a given time.
Roughly
- It is a mapping from perceived states of the environment to actions to be
taken when in those states.
10
Reinforcement learning: Elements
REWARD FUNCTION:
- A reward function defines the goal in a reinforcement learning problem.
- It maps each perceived state (or state-action pair) of the environment to a
single number, a reward, indicating the intrinsic desirability of that state.
VALUE FUNCTION:
- Specifies what is good in the long run.
- the value of a state is the total amount of reward an agent can expect to
accumulate over the future, starting from that state.
CREDIT ASSIGNMET PROBLEM:
- reinforcement learning algorithms learn to generate an internal value for the
intermediate states as to how good they are in leading to the goal.
11
Reinforcement learning: Elements
MARKOV DECISION PROCESS:
- for discrete time t = 0,1,2....
- st ¢ S; where st is the state of the agent at time t and S is the set of all
possible
- at ¢ A(st); where at is the action at time t and A(st) is the set of all possible
actions in
state st
- Reward and Next state are sampled from their probability distributions
P(rt+1 | st , at) and P(st+1 | st , at)
- In a Markov System, the state and reward in the next time
step
depend only on the current state & action.
- Can be deterministic i.e. for certain state & action taken there
is
only one possible reward and next state
12
Reinforcement learning: Elements
The policy, π defines the action to be taken at any state st
The value of the policy, Vπ(st) is the cumulative reward that will be received if
the agent follows the policy starting from st
Optimize values of π and Vπ(st)
MODELS:
-This is something that mimics the behavior of the environment. For example,
given
a state and action, the model might predict the resultant next state and next
reward.
- Models are used for planning
- E.g. Finite Horizon & Infinite Horizon models
13
The Reinforcement Problem
http://nrm.wikipedia.org/wiki/File:Jogo_da_velha_-_tic_tac_toe.png
An Example
14
Tic tac TOE
‘ let us assume that we are playing
against an imperfect player, one
whose play is sometimes incorrect
and allows us to win. For the
moment, in fact, let us consider
draws and losses to be equally bad
for us. How might we construct a
player that will find the
imperfections in its opponent's play
and learn to maximize its chances
of winning?’
- Sutton, R. S., and A. G.
Barto. 1998. Reinforcement
Learning: An Introduction
http://nrm.wikipedia.org/wiki/File:Jogo
_da_velha_-_tic_tac_toe.png
15
A sequence of tic-tac-toe moves. The solid
lines represent the moves taken during a
game; the dashed lines represent moves
that we (our reinforcement learning player)
considered but did not make. Our second
move was an exploratory move, meaning
that it was taken even though another
sibling move, the one leading to , was
ranked higher. Exploratory moves do not
result in any learning, but each of our other
moves does, causing backups
- Sutton, R. S., and A. G. Barto. 1998. Reinforcement Learning: An Introductio
16
knowledge update
let s denote the state before a greedy move, and s’ the state after the move,
then the update to the estimated value of s, given by V(s) :
V(s) = V(s) + µ[ V(s’) -V(s)]
Where µ is the step - size parameter, which influences the rate of learning
This update method is called Temporal Difference Learning Method.
17
applications to music
INTERACTIVE MUSIC SYSTEMS
Contemporary work in this field includes investigations into both machine listening
(realtime audio analysis) and robotics; Ajay Kapur’s MahaDevi-Bot, a thirteen armed
Indian percussionist which can synchronise to sensor input from a human sitarist
Hanamaka and colabpratprs developed a system that trains virtual players from the
performance of a trio of human guitarists, learning independent models for reaction
(within ensemble playing behaviour), phrase (a database of individual musical materials) and groove (timing preference).
Francois Pachet et all developed a system that can learn the live improvisation style of
the musicians who plays on a polyphonic midi instrument, the machine can continue
the improvisation in the same style.
Work by Tristen Jehan, including the Perceptual Sound Engine, and the Hyper - Violin
18
Rewarding the music machine
Reinforcement learning requires a musically salient notion of rewardand
it is essential to derive some measure of the quality of musical actions for particular
situations. A number of candidates have.
such functions have been variously defined as fitness functions for genetic algorithms, utility
measures, or reinforcement signals.
Murray- Rust et al. list three feedback measures on agent per- formance:
Matching internal goals
The appreciation of fellow participants (and further, of a general audience and critical
authorities)
Memetic success;the take-upoftheagent’s musical ideas by others (within a concert
and a culture)
19
Thank you
20
Download