Bayesian Reinforcement Learning

Bayesian Reinforcement Learning Machine Learning RCC 16th June 2011 Outline • Introduction to Reinforcement Learning • Overview of the field • Model-based BRL • Model-free RL References • ICML-07 Tutorial – P. Poupart, M. Ghavamzadeh, Y. Engel • Reinforcement Learning: An Introduction – Richard S. Sutton and Andrew G. Barto Machine Learning Unsupervised Learning Reinforcement Learning Supervised Learning Definitions State Action Reward £££££ Policy Reward function Markov Decision Process Transition Probability x0 x1 a0 a1 r0 r1 Policy Reward function Value Function Optimal Policy • Assume one optimal action per state Value Iteration Unknown Reinforcement Learning • RL Problem: Solve MDP when reward/transition models are unknown • Basic Idea: Use samples obtained from agent’s interaction with environment Model-Based vs Model-Free RL • Model-Based: Learn a model of the reward/transition dynamics and derive value function/policy • Model-Free: Directly learn value function/policy RL Solutions RL Solutions • Value Function Algorithms – Define a form for the value function – Sample state-action-reward sequence – Update value function – Extract optimal policy • SARSA, Q-learning RL Solutions • Actor-Critic – Define a policy structure (actor) – Define a value function (critic) – Sample state-action-reward – Update both actor & critic RL Solutions • Policy Search Algorithm – Define a form for the policy – Sample state-action-reward sequence – Update policy • PEGASUS – (Policy Evaluation-of-Goodness And Search Using Scenarios) Online - Offline • Offline – Use a simulator – Policy fixed for each ‘episode’ – Updates made at the end of episode • Online – Directly interact with environment – Learning happens step-by-step Model-Free Solutions 1. Prediction: Estimate V(x) or Q(x,a) 2. Control: Extract policy • • On-Policy Off-Policy Monte-Carlo Predictions Leave car park Get out of city Motorway Enter Cambridge State Value -90 -83 -55 -11 Reward -13 -15 -61 -11 Updated -100 -87 -72 -11 Temporal Difference Predictions Leave car park Get out of city Motorway Enter Cambridge State Value -90 -83 -55 -11 Reward -13 -15 -61 -11 Updated -96 -70 -72 -11 Advantages of TD • Don’t need a model of reward/transitions • Online, fully incremental • Proved to converge given conditions on step-size • “Usually” faster than MC methods From TD to TD(λ) State Reward Terminal state From TD to TD(λ) State Reward Terminal state SARSA & Q-learning TD-Learning On-Policy Off-Policy Estimate value function for current policy Estimate value function for optimal policy SARSA Q-Learning GP Temporal Difference xx xx xx x x xx x 1 2 GP Temporal Difference xx xx xx x x xx x 1 2 GP Temporal Difference GP Temporal Difference GP Temporal Difference GP Temporal Difference

Bayesian Reinforcement Learning

Related documents

Products

Support

Bayesian Reinforcement Learning

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib