Machine Learning Quiz 3 Solutions Question 1: Difference between Value-based and Policy-based Reinforcement Learning Value-based Reinforcement Learning (RL): In value-based RL, the goal is to maximize a value function V(s) or Q(s, a), which represents the expected return of being in a state s or performing an action a in state s. The policy is derived from the value function, often by choosing actions that maximize the expected value. Policy-based Reinforcement Learning: In policy-based RL, the model learns a policy function π(a|s) directly that gives the probability distribution over actions given the state. These methods can handle highdimensional action spaces and are effective in scenarios where the action space is continuous or stochastic. Types of Policy-based RL: 1. Deterministic Policy Gradient (DPG): Directly learns the policy function that deterministically maps states to actions. 2. Stochastic Policy Gradient (SPG): Learns a stochastic policy, where the policy is typically represented as a probability distribution over actions. At the beginning level, it's common to start with simpler policy-based methods, such as Stochastic Policy Gradients (SPG), due to their ease of implementation and analysis. Question 2: Q-learning Algorithm for Soft Robot Movement Conditions: 1. The hyperparameters of the Q-learning algorithm remain constant in every movement of the robot. 2. The action is constant throughout the robot's movement. 3. State and action remain constant, then follow condition 1. Q-function Update for each Condition: Condition 1: With constant hyperparameters, the Q-function update rule remains the same throughout the robot's movement. The typical update is Q(s_t, a_t) <- Q(s_t, a_t) + α * (r_t+1 + γ * max_a Q(s_t+1, a) - Q(s_t, a_t)). Condition 2: If the action is constant, it implies the robot consistently chooses the same action from each state. The Q-function reflects this by not changing the action part in its updates. Condition 3: If both state and action are constant, the Q-function will update the value for that specific state-action pair continuously based on the received reward and discounted maximum future reward but doesn't consider changes in state or action. Sequential Execution under Off-Policy: Off-policy learning, like Q-learning, learns the value of the optimal policy independently of the agent's actions. It can execute the conditions sequentially, but the effectiveness and learning performance might vary. The robot will learn from the policy derived from its behavior policy (exploration) but will update the Q-values based on the optimal greedy policy.