Reinforcement Learning RS Sutton and AG Barto Summarized by Joon Shik Kim 12.03.29.(Thu) Computational Models of Intelligence Introduction • The idea that we learn by interacting with our environment is probably the first to occur to us when we think about the nature of learning. • When an infant plays, waves its arms, or looks about, it has no explicit teacher, but it does have a direct sensorimotor connection to its environment. • Exercising this connection produces a wealth of information about cause and effect, about the consequences of actions, and about what to do in order to achieve goals. Elements of Reinforcement Learning (1/2) • A policy: the learning agent’s way of behaving at a given time. A mapping from perceived states of the environment to actions to be taken when in those states. • A reward function: the goal in a reinforcement learning problem. Each perceived state of the environment is mapped into a single number, a reward, indicating the intrinsic desirability of that state. Elements of Reinforcement Learning (2/2) • A value function: specifies what is good in the long run. Roughly speaking, the value of a state is the total amount of reward an agent can expect to accumulate over the future, starting from that state. Update Rule • If we let s denote the state before the greedy move, and s’ the state after the move, then the update to the estimated value of s, denoted V(s), can be written as V ( s ) V ( s ) [V ( s ') V ( s )] where α is a small positive fraction called the step-size parameter, which influences the rate of learning. Action-Value Methods (1/2) • We denote the true (actual) value of action a as Q*(a) and the estimated value at the tth play as Qt(a). • Recall that the true value of an action is the mean reward received when that action is selected. • If at the tth play action a has been chosen ka times prior to t, yielding rewards r1, r2,…,rka, then its value is estimated to be Qt (a ) r1 r2 rk a ka Action-Value Methods (2/2) • As ka →∞, by the law of large number Qt(a) converges to Q*(a). • The simplest action selection rule is to select the action (or one of the actions) with highest estimated action value, that is, to select on play t one of the greedy actions, a*, for which Qt(a*)=maxaQt(a*). Incremental Implementation Q k 1 1 k 1 r k 1 i i 1 k r r k 1 i k 1 i 1 1 ( rk 1 kQ k Q k Q k ) k 1 1 ( rk 1 ( k 1) Q k Q k ) k 1 1 Qk [ rk 1 Q k ] k 1 1 NewEstimate←OldEstimate+StepSize[Target−OldEstimate] Reinforcement Comparison • A central intuition underlying reinforcement learning is that actions followed by large rewards should be made more likely to recur, whereas actions followed by small rewards should be made less likely to recur. • If an action is taken and the environment returns a reward of 5, is that large or small? To make such a judgment one must compare the reward with some standard or reference level, called the reference reward. Reinforcement Comparison • In order to pick among the actions, reinforcement comparison methods maintain a separate measure of their preference for each action. • Let us denote the preference for action a on play t by Pt(a). • The preferences might be used to determine action-selection probabilities according to a softmax relationship, such as t (a ) e Pt ( a ) n b 1 e Pt ( b ) Reinforcement Comparison where t ( a ) denotes the probability of selecting action a on the tth play. • After each play, the preference for the action selected on that play, at, is incremented by the difference between the reward, rt, and the reference reward, rt : • p t 1 ( a t ) p t ( a t ) [ rt r t ] where β is a positive step-size parameter. Reinforcement Comparison Q-learning