REINFORCEMENT LEARNING Overview & Applications to Music Gautam Bhattacharya MUMT 621 1 rise of the machine ‘ let us assume that we are playing against an imperfect player, one whose play is sometimes incorrect and allows us to win. For the moment, in fact, let us consider draws and losses to be equally bad for us. How might we construct a player that will find the imperfections in its opponent's play and learn to maximize its chances of winning?’ - Sutton, R. S., and A. G. Barto. 1998. Reinforcement Learning: An Introduction 2 Goals & topics What is Reinforcement Learning? History, Introduction, Individuality & Examples Elements of a Reinforcement Learning System The Reinforcement Problem - An Example Applications to Music Questions & Comments 3 History "heterostatic theory of adaptive systems" developed by A. Harry Klopf ‘ but in 1979 we came to realize that perhaps the simplest of the ideas, which had long been taken for granted, had received surprisingly little attention from a computational perspective. This was simply the idea of a learning system that wants something, that adapts its behaviour in order to maximize a special signal from its environment. This was the idea of a "hedonistic" learning system, or, as we would say now, the idea of reinforcement learning’ - Sutton, R. S., and A. G. Barto. 1998. Reinforcement Learning: An Introduction 4 Introduction What is Reinforcement Learning? ‘ Reinforcement learning is learning what to do--how to map situations to actions--so as to maximize a numerical reward signal’ - Sutton, R. S., and A. G. Barto. 1998. Reinforcement Learning: An Introduction These two characteristics: trial-and-error delayed reward These are the two most important distinguishing features of reinforcement learning 5 Introduction The formulation is intended to include just these three aspects - sensation - action - goal ‘ Clearly, such an agent must be able to sense the state of the environment to some extent and must be able to take actions that affect the state. The agent also must have a goal or goals relating to the state of the environment.’ - Sutton, R. S., and A. G. Barto. 1998. Reinforcement Learning: An Introduction 6 DIFFERENCES WITH RESPECT TO SUPERVISED LEARNING ‘Learning with a critic’ as opposed to learning with a teacher. Reinforcement learning = INTERACTIVE learning In interactive problems it is often impractical to obtain examples of desired behaviour that are both correct and representative of all the situations in which the agent has to act Reinforcement Learning looks at the bigger picture ‘For example, we have mentioned that much of machine learning research is concerned with supervised learning without explicitly specifying how such an ability would finally be useful.‘ - Sutton, R. S., and A. G. Barto. 1998. Reinforcement Learning: An Introduction 7 challenges Challenges One of the challenges that arise in reinforcement learning and not in other kinds of learning is the trade-off between exploration and exploitation The whole problem of a goal-directed agent interacting with an uncertain environment ‘All reinforcement learning agents have explicit goals, can sense aspects of their environments, and can choose actions to influence their environments. Moreover, it is usually assumed from the beginning that the agent has to operate despite significant uncertainty about the environment it faces.‘ - Sutton, R. S., and A. G. Barto. 1998. Reinforcement Learning: An Introduction 8 Examples A master chess player makes a move. The choice is informed both by planning--anticipating possible replies and counterreplies--and by immediate, intuitive judgments of the desirability of particular positions and moves. A mobile robot decides whether it should enter a new room in search of more trash to collect or start trying to find its way back to its battery recharging station. It makes its decision based on how quickly and easily it has been able to find the recharger in the past A gazelle calf struggles to its feet minutes after being born. Half an hour later it is running at 20 miles per hour. - Sutton, R. S., and A. G. Barto. 1998. Reinforcement Learning: An Introduction 9 Reinforcement learning: Elements - Sutton, R. S., and A. G. Barto. 1998. Reinforcement Learning: POLICY: An Introduction - A policy defines the learning agent's way of behaving at a given time. Roughly - It is a mapping from perceived states of the environment to actions to be taken when in those states. 10 Reinforcement learning: Elements REWARD FUNCTION: - A reward function defines the goal in a reinforcement learning problem. - It maps each perceived state (or state-action pair) of the environment to a single number, a reward, indicating the intrinsic desirability of that state. VALUE FUNCTION: - Specifies what is good in the long run. - the value of a state is the total amount of reward an agent can expect to accumulate over the future, starting from that state. CREDIT ASSIGNMET PROBLEM: - reinforcement learning algorithms learn to generate an internal value for the intermediate states as to how good they are in leading to the goal. 11 Reinforcement learning: Elements MARKOV DECISION PROCESS: - for discrete time t = 0,1,2.... - st ¢ S; where st is the state of the agent at time t and S is the set of all possible - at ¢ A(st); where at is the action at time t and A(st) is the set of all possible actions in state st - Reward and Next state are sampled from their probability distributions P(rt+1 | st , at) and P(st+1 | st , at) - In a Markov System, the state and reward in the next time step depend only on the current state & action. - Can be deterministic i.e. for certain state & action taken there is only one possible reward and next state 12 Reinforcement learning: Elements The policy, π defines the action to be taken at any state st The value of the policy, Vπ(st) is the cumulative reward that will be received if the agent follows the policy starting from st Optimize values of π and Vπ(st) MODELS: -This is something that mimics the behavior of the environment. For example, given a state and action, the model might predict the resultant next state and next reward. - Models are used for planning - E.g. Finite Horizon & Infinite Horizon models 13 The Reinforcement Problem http://nrm.wikipedia.org/wiki/File:Jogo_da_velha_-_tic_tac_toe.png An Example 14 Tic tac TOE ‘ let us assume that we are playing against an imperfect player, one whose play is sometimes incorrect and allows us to win. For the moment, in fact, let us consider draws and losses to be equally bad for us. How might we construct a player that will find the imperfections in its opponent's play and learn to maximize its chances of winning?’ - Sutton, R. S., and A. G. Barto. 1998. Reinforcement Learning: An Introduction http://nrm.wikipedia.org/wiki/File:Jogo _da_velha_-_tic_tac_toe.png 15 A sequence of tic-tac-toe moves. The solid lines represent the moves taken during a game; the dashed lines represent moves that we (our reinforcement learning player) considered but did not make. Our second move was an exploratory move, meaning that it was taken even though another sibling move, the one leading to , was ranked higher. Exploratory moves do not result in any learning, but each of our other moves does, causing backups - Sutton, R. S., and A. G. Barto. 1998. Reinforcement Learning: An Introductio 16 knowledge update let s denote the state before a greedy move, and s’ the state after the move, then the update to the estimated value of s, given by V(s) : V(s) = V(s) + µ[ V(s’) -V(s)] Where µ is the step - size parameter, which influences the rate of learning This update method is called Temporal Difference Learning Method. 17 applications to music INTERACTIVE MUSIC SYSTEMS Contemporary work in this field includes investigations into both machine listening (realtime audio analysis) and robotics; Ajay Kapur’s MahaDevi-Bot, a thirteen armed Indian percussionist which can synchronise to sensor input from a human sitarist Hanamaka and colabpratprs developed a system that trains virtual players from the performance of a trio of human guitarists, learning independent models for reaction (within ensemble playing behaviour), phrase (a database of individual musical materials) and groove (timing preference). Francois Pachet et all developed a system that can learn the live improvisation style of the musicians who plays on a polyphonic midi instrument, the machine can continue the improvisation in the same style. Work by Tristen Jehan, including the Perceptual Sound Engine, and the Hyper - Violin 18 Rewarding the music machine Reinforcement learning requires a musically salient notion of rewardand it is essential to derive some measure of the quality of musical actions for particular situations. A number of candidates have. such functions have been variously defined as fitness functions for genetic algorithms, utility measures, or reinforcement signals. Murray- Rust et al. list three feedback measures on agent per- formance: Matching internal goals The appreciation of fellow participants (and further, of a general audience and critical authorities) Memetic success;the take-upoftheagent’s musical ideas by others (within a concert and a culture) 19 Thank you 20