Uploaded by Fiza Aamir

ML Quiz3 Solutions

advertisement
Machine Learning Quiz 3 Solutions
Question 1: Difference between Value-based and Policy-based
Reinforcement Learning
Value-based Reinforcement Learning (RL):
In value-based RL, the goal is to maximize a value function V(s) or Q(s, a), which represents
the expected return of being in a state s or performing an action a in state s. The policy is
derived from the value function, often by choosing actions that maximize the expected
value.
Policy-based Reinforcement Learning:
In policy-based RL, the model learns a policy function π(a|s) directly that gives the
probability distribution over actions given the state. These methods can handle highdimensional action spaces and are effective in scenarios where the action space is
continuous or stochastic.
Types of Policy-based RL:
1. Deterministic Policy Gradient (DPG): Directly learns the policy function that
deterministically maps states to actions.
2. Stochastic Policy Gradient (SPG): Learns a stochastic policy, where the policy is typically
represented as a probability distribution over actions.
At the beginning level, it's common to start with simpler policy-based methods, such as
Stochastic Policy Gradients (SPG), due to their ease of implementation and analysis.
Question 2: Q-learning Algorithm for Soft Robot Movement
Conditions:
1. The hyperparameters of the Q-learning algorithm remain constant in every movement of
the robot.
2. The action is constant throughout the robot's movement.
3. State and action remain constant, then follow condition 1.
Q-function Update for each Condition:
Condition 1: With constant hyperparameters, the Q-function update rule remains the same
throughout the robot's movement. The typical update is Q(s_t, a_t) <- Q(s_t, a_t) + α * (r_t+1
+ γ * max_a Q(s_t+1, a) - Q(s_t, a_t)).
Condition 2: If the action is constant, it implies the robot consistently chooses the same
action from each state. The Q-function reflects this by not changing the action part in its
updates.
Condition 3: If both state and action are constant, the Q-function will update the value for
that specific state-action pair continuously based on the received reward and discounted
maximum future reward but doesn't consider changes in state or action.
Sequential Execution under Off-Policy:
Off-policy learning, like Q-learning, learns the value of the optimal policy independently of
the agent's actions. It can execute the conditions sequentially, but the effectiveness and
learning performance might vary. The robot will learn from the policy derived from its
behavior policy (exploration) but will update the Q-values based on the optimal greedy
policy.
Download