Today’s Topics Some Additional ML Formulations • Active Learning • Transfer Learning • Structured-Output Learning • Reinforcement Learning (RL) • Q learning • Exploration vs Exploitation • Generalizing Across State • Used in Clinical Trials • Advice Taking & Refinement (later lec) 12/1&8/15 CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14 1 Active Learning Reconstruction of the original Mechanical Turk (human hid inside table) ML algo gets to ask human to label some ex’s (eg, via Amazon Turk) - don’t want to burden human by asking for too many labels - algo should pick most useful ex’s (borderline cases?) for labeling Sample Algo 1. train an ensemble on labeled data 2. find unlabeled ex’s where members of the ensemble disagree the most 3. ask human to label some of these 4. goto 1 12/1&8/15 CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14 2 Transfer Learning Active topic for DEEP ANNs: reuse lower layers in new image-processing tasks (survey by Torrey & Shavlik on my pubs page) Agent learns Task A Agent encounters related Task B Task A is the source Task B is the target Agent told/discovers how tasks are related Agent uses knowledge from Task A to learn Task B faster 12/1&8/15 CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14 3 Performance on Target Task Potential Benefits of Transfer Learning 12/1&8/15 steeper slope higher asymptote with transfer without transfer higher start Amount of Training CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14 Lecture #19, Slide 4 Close-Transfer Scenarios 2-on-1 BreakAway 4-on-3 BreakAway 3-on-2 BreakAway Score within a few seconds [Maclin et al., AAAI 2005] 12/1&8/15 CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14 Lecture #19, Slide 5 Distant-Transfer Scenarios 3-on-2 KeepAway 3-on-2 BreakAway 3-on-2 MoveDownfield Keep ball from opponents [Stone & Sutton, ICML 2001] 12/1&8/15 Advance ball [Torrey et al, ECML 2006] CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14 6 Some Results: Transfer to 3-on-2 BreakAway 0.6 Probability of Goal 0.5 0.4 0.3 Standard RL 0.2 Skill Transfer from 2-on-1 BreakAway Skill Transfer from 3-on-2 MoveDownfield 0.1 Skill Transfer from 3-on-2 KeepAway 0 0 500 1000 1500 2000 2500 3000 Training Games 12/1&8/15 CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14 Torrey, Shavlik, Walker & Maclin: ECML 2006, ICML Workshop 2006 7 Structured-Output Learning • So far we have only learned ‘scalar’ outputs • Some tasks address learning outputs with more structure learning parsers of English (or other ‘natural’ languages) learning to segment an image into objects learning to predict the 3D shape of protein given its sequence • We won’t cover this task in cs540, though 12/1&8/15 CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14 W8 Expecting Less of the Teacher • Imagine teaching robots to play soccer • We could tell them what to do every step this would be supervised ML • What if we occasionally gave them numeric feedback? • • • • • 12/1&8/15 If scored goal, get +1 If intercepted pass, get +0.5 If ran out of bounds, get -0.5 If scored goal for other team, get -10 ... CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14 9 Reinforcement Learning vs Supervised Learning RL requires much less of teacher – Teacher must set up ‘reward structure’ – Learner ‘works out the details’ ie, writes a program to maximize rewards received 12/1&8/15 CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14 10 Sequential Decision Problems Courtesy of Andy Barto (pictured) • Decisions are made in stages • The outcome of each decision is not fully predictable, but can be observed before the next decision is made • The objective is to maximize a numerical measure of total reward (or equivalently, to minimize a measure of total cost) • Decisions cannot be viewed in isolation need to balance desire for immediate reward with possibility of high future reward 12/1&8/15 CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14 11 Reinforcement Learning Task of an agent embedded in an environment Repeat forever 1) 2) 3) 4) 5) 12/1&8/15 sense world reason choose an action to perform get feedback (usually reward = 0) learn CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14 Notice: RL is active, online learning (queries are asked of the environment, not a human teacher) 12 RL Systems: Formalization SE = the set of states of the world eg, an N -dimensional vector ‘sensors’ (and memory of past sensations) AE = the set of possible actions an agent can perform (‘effectors’) W = the world R = the immediate reward structure W and R are the environment, can be stochastic functions (usually most states have R=0; ie rewards are sparse) 12/1&8/15 CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14 13 Embedded Learning Systems: Formalization (cont.) W: SE x AE SE [here the arrow means ‘maps to’] The world maps a state and an action and produces a new state R: SE “reals” Provides rewards (a number; often 0) as a function of the current state Note: we can instead use R: SE x AE “reals” (ie, rewards depend on how we ENTER a state) 12/1&8/15 CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14 14 A Graphical View of RL • Note that both the world and the agent can be probabilistic, so W and R could produce probability distributions • For cs540, we’ll assume deterministic problems The real world, W sensory info R, reward (a scalar) - indirect teacher an action The Agent 12/1&8/15 CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14 15 Common Confusion State need not be solely the current sensor readings – Markov Assumption commonly used in RL Value of state is independent of path taken to reach that state – But can store memory of the past in current state Can always create Markovian task by remembering entire past history 12/1&8/15 CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14 16 Need for Memory: Simple Example ‘Out of sight, but not out of mind’ Time=1 learning agent opponent WALL opponent Time=2 WALL Seems reasonable to remember opponent was recently seen learning agent 12/1&8/15 CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14 17 State vs. Current Sensor Readings Remember state is what is in one’s head (past memories, etc) not ONLY what one currently sees/hears/smells/etc 12/1&8/15 CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14 18 Policies The agent needs to learn a policy E : ŜE AE The policy, E, function 12/1&8/15 Given a world state, ŜE, which action, AE, should be chosen? ŜE is our learner’s APPROXIMATION to the true SE Remember: The agent’s task is to maximize the total reward received during its lifetime CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14 19 True World States vs. the Learner’s Representation of the World State • From here forward, S, will be our learner’s approximation of the true world state • Exceptions W: S x A S R: S reals These are our notations for how the true world behaves when we act upon it You can think that W and R take as an argument the learner’s representation of the world state and internally convert that to the ‘true’ world state(s) Recall: we can instead use R: SE x AE “reals” (ie, rewards might instead depend on how we ENTER a state) 12/1&8/15 CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14 20 Policies (cont.) To construct E, we will assign a utility (U) (a number) to each state U E ( s) t 1 R( s, E , t ) t 1 • is a positive constant ≤ 1 • R(s, E, t) is the reward received at time t, assuming the agent follows policy E and starts in state s at t=0 • Note: future rewards are discounted by 12/1&8/15 CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14 t-1 21 Why have a Decay on Rewards? • Getting ‘money’ in the future worth less than money right now – Inflation – More time to enjoy what it buys – Risk of death before collecting • Allows convergence proofs of the functions we’re learning 12/1&8/15 CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14 Lecture #13, Slide 22 The Action-Value Function We want to choose the ‘best’ action in the current state So, pick the one that leads to the best next state (and include any immediate reward) Let Q E ( s, a ) R(W ( s, a )) U E (W ( s, a)) Immediate reward received for going to state W(s,a) [Alternatively, R(s, a) ] 12/1&8/15 Future reward from further actions (discounted due to 1-step delay) CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14 23 The Action-Value Function (cont.) If we can accurately learn Q (the actionvalue function), choosing actions is easy Choose action a, where a arg max Q( s, a' ) a 'actions Note: x = argmax f(x) sets x to the value that leads to a max value for f(x) 12/1&8/15 CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14 24 Q vs. U Visually state action state Key U(2) states U(5) actions U(1) Q(1,ii) U(3) U(6) U(4) U’s ‘stored’ on states Q’s ‘stored’ on arcs 12/1&8/15 CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14 25 Q Q’s vs. U’s U S U Q U • Assume we’re in state S Which action do we choose? • U’s (Model-based) – Need to have a ‘next state’ function to generate all possible next states (eg, chess) – Choose next state with highest U value • Q’s (Model-free, though can also do model-based Q learning) – Need only know which actions are legal (eg, web) – Choose arc with highest Q value 12/1&8/15 CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14 26 Q-Learning (Watkins PhD, 1989) Let Qt be our current estimate of the optimal Q Our current policy is t (s) a such that Qt ( s, a ) max [Qt ( s, b)] bknown actions Our current utility-function estimate is U t ( s ) Qt ( s, t ( s )) - hence, the U table is embedded in the Q table and we don’t need to store both 12/1&8/15 CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14 27 Q-Learning (cont.) Assume we are in state St ‘Run the program’ * for awhile (n steps) Determine actual reward and compare to predicted reward Adjust prediction to reduce error * Ie, follow the current policy 12/1&8/15 CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14 28 Updating Qt Let rt( N ) N -step estimate of future rewards 12/1&8/15 N k 1 Rt k NU t ( St N ) k 1 Actual (discounted) reward received during the N time steps Estimate of future reward if continued to t = CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14 29 Changing the Q Function (ie, learn a better approx.) Old estimate Qt N ( St , at ) Qt ( St , at ) rt New estimate (at time t + N) Learning rate (for deterministic worlds, set α=1) 12/1&8/15 CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14 (N ) Qt ( St , at ) Error 30 Pictorially (here rewards are on arcs, rather than states) Actual moves made (in red) r1 S1 r2 r3 SN Potential next states Qest ( s1, a ) r1 r2 2 r3 +<estimate of remainder of infinite sum> r1 r2 2 r3 3U (SN ) r1 r2 2 r3 3 max Q(S N , b) bactions 12/1&8/15 CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14 31 How Many Actions Should We Take Before Updating Q ? Why not do so after each action? – One–step Q learning – Most common approach 12/1&8/15 CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14 32 Exploration vs. Exploitation In order to learn about better alternatives, we can’t always follow the current policy (‘exploitation’) Sometimes, need to try random moves (‘exploration’) 12/1&8/15 CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14 33 Exploration vs. Exploitation (cont) Approaches 1) p percent of the time, make a random move; could let 1 p 2) Prob(picking action A in state S ) # moves _ made QS , A const Q S ,i const Exponentia -ting gets rid of negative values iactions 12/1&8/15 CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14 34 One-Step Q-Learning Algo 0. S initial state 1. If random # P then a = random choice // Occassionally ‘explore’ Else a = t(S) // Else ‘exploit’ 2. Snew W(S, a) Rimmed R(Snew) 3. Error Rimmed + U(Snew) – Q(S, a) // Use Q to compute U 4. Q(S, a) Q(S, a) + Error 5. S Snew 6. Go to 1 12/1&8/15 Act on world and get reward // Should also decay α CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14 35 Visualizing Q -Learning (1-step ‘lookahead’) The estimate State I Q(I,a) Action a (get reward R) Should equal R + max Q(J,x) State J a z b 12/1&8/15 … - train ML system to learn a consistent set of Q values CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14 36 Bellman Optimality Equation (from 1957, though for U function back then) IF s ,a Q( s, a) RN max QSN , a' a 'actions Where SN = W(s,a) , ie, the next state THEN The resulting policy, (s) = argmax Q(s,a), is optimal – ie, leads to highest discounted total rewards (also, any optimal policy satisfies the Bellman Eq) 12/1&8/15 CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14 37 A Simple Example (of Q-learning - with updates after each step, ie N =1) Q=0 S0 R=0 Let = 2/3 S1 R=1 Q=0 Q=0 Q=0 S3 R=0 S2 R = -1 Q=0 Q=0 S4 R=3 Qnew R max Qnext state (deterministic world, so α=1) 12/1&8/15 CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14 38 A Simple Example (Step 1) S0 S2 Q=0 S0 R=0 Let = 2/3 S1 R=1 Q=0 Q=0 Q = -1 S3 R=0 S2 R = -1 Q=0 Q=0 12/1&8/15 S4 R=3 Qnew R max Qnext state CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14 39 A Simple Example (Step 2) S2 S4 Q=0 S0 R=0 Let = 2/3 S1 R=1 Q=0 Q=0 Q = -1 S3 R=0 S2 R = -1 Q=0 Q=3 12/1&8/15 S4 R=3 Qnew R max Qnext state CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14 40 A Simple Example (Step i) S0 S2 Q=0 S0 R=0 Let = 2/3 S1 R=1 Q=0 Q=0 Q = -1 S3 R=0 S2 R = -1 Q=0 Q=3 12/1&8/15 Assume we get to the end of the game and ‘magically’ restarted in S0 S4 R=3 Qnew R max Qnext state CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14 41 A Simple Example (Step i+1) S0 S2 Q=0 S0 R=0 Let = 2/3 S1 R=1 Q=0 Q=0 Q=1 S3 R=0 S2 R = -1 Q=0 Q=3 12/1&8/15 S4 R=3 Qnew R max Qnext state CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14 42 A Simple Example (Step ∞) - ie, the Bellman optimal Q=? S0 R=0 Let = 2/3 S1 R=1 Q=? Q=? Q=? S3 R=0 S2 R = -1 Q=? Q=? 12/1&8/15 S4 R=3 What would the final Q values be if we explored + exploited for a long time, always returning to S0 after 5 actions? Qnew R max Qnext state CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14 43 A Simple Example (Step ∞) Let = 2/3 Q=1 S0 R=0 What would happen if > 2/3? Lower path better S1 R=1 Q=0 Q=0 Q=1 S3 R=0 S2 R = -1 Q=0 Q=3 12/1&8/15 What would happen if < 2/3 ? Upper path better S4 R=3 Shows need for EXPLORATION since first ever action out of S0 may or may not be the optimal one Qnew R max Qnext state CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14 44 An “On Your Own” RL HW (Solution in Class Next Tues) Consider the deterministic reinforcement environment drawn below. Let γ=0.5. Immediate rewards are indicated inside nodes. Once the agent reaches the ‘end’ state the current episode ends and the agent is magically transported to the ‘start’ state. B (r=5) 4 Start (r=0) 4 4 4 End (r=5) 4 A (r=2) 4 C (r=3) 4 (a) A one-step, Q-table learner follows the path Start B C End. On the graph below, show the Q values that have changed, and show your work. Assume that for all legal actions (ie, for all the arcs on the graph), the initial values in the Q table are 4, as show above (feel free to copy the above 4’s below, but somehow highlight the changed values). Start (r=0) 12/1&8/15 B (r=5) A (r=2) CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14 End (r=5) C (r=3) 45 An “On Your Own” RL HW (Solution in Class Next Tues) (b) Starting with the Q table you produced in Part (a), again follow the path Start B C End and show the Q values below that have changed from Part (a). Show your work. Start (r=0) B (r=5) End (r=5) A (r=2) C (r=3) (c) What would the final Q values be in the limit of trying all possible arcs ‘infinitely’ often? Ie, what is the Bellman-optimal Q table? Explain your answer. Start (r=0) B (r=5) A (r=2) End (r=5) C (r=3) (d) What is the optimal path between Start and End? Explain. 12/1&8/15 CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14 46 Estimating Value’s ‘In Place’ (see Sec 2.6 +2.7 of Sutton+Barto RL textbook) Let ri be our i th estimate of some Q Note: ri is not the immediate reward, Ri ri = Ri + U(next statei) Assume we have k +1 such measurements 12/1&8/15 CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14 47 Estimating Value’s (cont) Estimate based on k + 1 trails 1 k 1 Qk 1 ri k 1 i 1 k 1 rk 1 ri k 1 i 1 1 rk 1 k Qk k 1 (cont.) 12/1&8/15 CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14 Ave of the k + 1 measurements Pull out last term Stick in definition of Qk 48 ‘In Place’ Estimates (cont.) 1 rk 1 k 1Qk Qk k 1 1 Qk rk 1 Qk k 1 latest estimate current ‘running’ average Add and subtract Qk Notice that needs to decay over time Repeating Qk 1 Qk rk 1 Qk 12/1&8/15 CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14 49 Q-Learning: The Need to ‘Generalize Across State’ Remember, conceptually we are filling in a huge table States S0 S1 S2 A c t i o n s 12/1&8/15 a b c . . . z ... Sn . . . ... Q(S2, c) CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14 Tables are a very verbose representation of a function 50 Representing Q Functions More Compactly We can use some other function representation (eg, neural net) to compactly encode this big table Second argument is a constant Q (S, a) An encoding of the state (S) Q (S, b) . . .. . Q (S, z) Each input unit encodes a property of the state (eg, a sensor value) 12/1&8/15 Or could have one net for each possible action CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14 51 Q (S, 0) Q (S, 1) … … Q Tables vs Q Nets . Q (S, 9) Given: 100 Boolean-valued features 10 possible actions Size of Q table 10 2100 Similar idea as Full Joint Prob Tables and Bayes Nets (called ‘factored’ representations) Size of Q net (100 HU’s) 100 100 + 100 10 = 11,000 Weights between inputs and HU’s 12/1&8/15 Weights between HU’s and outputs CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14 52 Why Use a Compact Q-Function? 1. 2. Full Q table may not fit in memory for realistic problems Can generalize across states, thereby speeding up convergence ie, one example ‘fills’ many cells in the Q table Notes 1. When generalizing across states, cannot use α=1 2. Convergence proofs only apply to Q tables 12/1&8/15 CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14 Lecture #25, Slide 53 Three Forward Props and a BackProp Q(S0, A) A 1 S0 N . . . N . S1 N . . . N . 3 A S0 Q(S0, Z) Q(S1, A) A 2 N . . Q(S1, Z) Choose action in state S0 - execute chosen action in world, ‘read’ new sensors and reward Estimate u(S1) = Max Q(S1,X) where X actions Q(S0, A) vs new estimate Calc “teacher’s” output . . N Aside: could save some forward props by caching information Q(S0, Z) - assume Q is ‘correct’ for other actions Backprop to reduce error at Q(S0, A) 12/1&8/15 CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14 54 The Agent World (Rough sketch, implemented in Java [by me], linked to class home page) Pushable Ice Cubes * * * * * Opponents 12/1&8/15 * * * The RL Agent * Food CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14 55 0 50/25 15 HU -10 5 HU -20 Q-table -30 Perceptrons (600 ex’s) (Supervised learning) Hand-coded -40 Q-net: 5 HU’s Q-net: 15 HU’s -50 -60 0 Q-net: 25 HU’s Q-net: 50 HU’s 500 1000 Training-set steps (in K) 12/1&8/15 CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14 1500 <-- 1000x slower CPU Mean(discounted) score on the testset suite Some (Ancient) Agent World Results 2000 ~2 weeks 56 Q-Learning Convergences • Only applies to Q tables and deterministic, Markovian worlds • Theorem: if every state-action pair visited infinitely often, 0 ≤ < 1, and |rewards| ≤ C (some constant), then s, a lim Q t ( s, a) Qactual ( s, a) t ^ the approx. Q table (Q) 12/1&8/15 the true Q table (Q) CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14 57 Developing a World Model • The world involves two functions – ‘Next state’ function W : S A S – Reward function R: S How could we learn these two functions? Eg, think about chess • Even if we knew these functions, we would still need to compute the Q table/function to have a policy 12/1&8/15 CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14 58 Learning World Models Can use any supervised learning technique to learn these functions R: S Supervised Learner Rep. of state S Reward W : S A S Rep. of state S Rep. of action A 12/1&8/15 Supervised Learner CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14 Rep. of next state 59 Using a World Model (‘The DYNA architecture’ by Sutton in From Animals to Animats, MIT Press, 1991) If we have a good world model, we can mentally simulate exploration to produce Q-learning data – Faster than running in the real world – May need to periodically update world model with the results of real-world experiments (trade-off between exploration and calculation) 12/1&8/15 CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14 60 Advantages of Using World Models • Interleave real and simulated actions • Can get extra training examples quickly and cheaply • Provides one way to incorporate prior knowledge (eg, simulators) • Allows planning of what to explore in the real world (ie, mental simulation) 12/1&8/15 CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14 61 Some Other Ways to Reduce Number of ‘Real World’ Examples Needed in RL • Replay old examples periodically (Lin) • Have a teacher occasionally say which action to do (Clouse & Utgoff) • Give ‘verbal’ advice to the learner (Maclin & Shavlik) • Transfer learning (discussed earlier) 12/1&8/15 CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14 62 Recap: Supervised Learners Helping the RL Learner • Note that Q learning automatically creates I/O pairs for a supervised ML algo when ‘generalizing across state’ • Can also learn a model of the world (W) and the reward function (R) – Simulations via learned models reduce need for ‘acting in the physical world’ 12/1&8/15 CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14 63 Shaping • Allow the teacher to change the reward function over time – Eg, consider some team sport – Most basic reward is win = +1, tie = 0, lose = -1 – But during training might initially give rewards for catching passes, scoring points, blocking shot, etc – Over time the reward function might become less detailed (maybe because shaping leads to non-optimality) • Some similarities to transfer learning 12/1&8/15 CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14 64 Challenges in RL • Q tables too big, so use function approximation – can ‘generalize across state’ (eg, via ANNs) – convergence proofs no longer apply, though • Hidden state (‘perceptual aliasing’) – two different states might look the same (eg, due to ‘local sensors’) – can use theory of ‘Partially Observable Markov Decision Problems’ (POMDP’s) • Multi-agent learning (world no longer stationary) 12/1&8/15 CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14 65 Could use GAs for RL Task • Another approach is to use GAs to evolve good policies – Create N ‘agents’ – Measure each’s rewards over some time period – Discard worst, cross over best, do some mutation – Repeat ‘forever’ (a model of biology) • Both ‘predator’ and ‘prey’ evolve/learn, ie co-evolution 12/1&8/15 CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14 66 Summary of Non-GA Reinforcement Learning Positives – Requires much less ‘teacher feedback’ – Appealing approach to learning to predict and control (eg, robotics, sofbots) Demo of Google’s Q Learning – Solid mathematical foundations • Dynamic programming • Markov decision processes • Convergence proofs (in the limit) – Core of solution to general AI problem ? 12/1&8/15 CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14 67 Summary of Non-GA Reinforcement Learning (cont.) Negatives – Need to deal with huge state-action spaces (so convergence very slow) – Hard to design R function ? – Learns specific environment rather than general concepts – depends on state representation ? – Dealing with multiple learning agents? – Hard to learn at multiple ‘grain sizes’ (hierarchical RL) 12/1&8/15 CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14 68