GQ 1. Multi-Armed Bandit Problem consists of (A) Learning problem only Feedback : we try to find out the best arm. (B) Decision problem only Feedback : We try to find out which arm to try next. (C) Learning and Decision problem Feedback : We try to find out the best arm and then which arm to try next. Please refer video w2_m2_l1_v2_part1-Multi-Armed Bandit Problem. (D) Exploration only. Feedback : We try to explore the best arm that provides the maximum amount of rewards. 2. To solve Dynamic pricing problems using multi-armed bandit algorithms, action and rewards are respectively (A) Both are product price. Feedback : An efficient algorithm that indicates suitable price to charge. (B) Both are revenue Feedback : Finally the objective is to maximize the revenue. (C) Revenue and product price respectively. Feedback : In multi-armed bandit algorithm, we try to find out the best arm that accumulates the maximum reward. (D) Product prices and revenues Feedback : In dynamic pricing problem, by considering supplydemand and market competitions, the algorithm determines suitable product price to charge. Please refer video w2_m2_l1_v2_part1-Multi-Armed Bandit Problem. 3. If we apply multi-armed bandit algorithms to design a website, then actions and rewards are respectively (A) Clicks and engagement time Feedback : MABP determines the best features by which the revenue can be maximize. (B) Font color and page layout Feedback : These are the features of a website. (C) Clicks and font color Feedback : well designed features in a website attracts customers to clicks. (D) Font color and clicks Feedback : Access to a website leads to revenue maximization. Please refer video w2_m2_l1_v2_part1-Multi-Armed Bandit Problem. 4. In Multi-Armed Bandit Problem (MABP), the roles of an agent and the environment are respectively (A) Returns a reward and choosing an action Feedback : We try to explore a good algorithm which maximizes the rewards. (B) Choosing action and returns a reward Feedback : Agent chooses an action and according to that the environment returns a reward. Please refer video w2_m2_l1_v2_part2-Multi-Armed Bandit Problem. (C) Choosing an action only Feedback : The agent chooses a suitable action. (D) Returns a reward only Feedback : Environment returns a reward based on the action taken by the agent. 5. In multi-armed bandit problem, learner’s objective is (A) Receive the reward. Feedback : Environment interacts with the agent and based on the taken action it returns a reward. (B) Choose a random action Feedback : It denotes choosing any arm randomly. (C) Exploration only. Feedback : Agent explores the best arm to take as an action. (D) Use outcomes of the past actions to choose the future action that maximizes total reward (in expectation) Feedback : Agent tries to choose the best arm or best action to maximizes the reward. It considers the outcomes from the past actions. Please refer video w2_m2_l1_v3_part2-Multi-Armed Bandit Problem. PQ 1. Epsilon-Greedy strategy (A) Only do exploration Feedback : There must be a good balance between exploration and exploitation. (B) Only do exploitation Feedback : The agent needs to optimize the rewards using the existing results. (C) Do exploration and exploitation together Feedback : It performs both of these essentials together. (D) Do exploration and exploitation separately. Feedback : It performs exploitation after exploration. Please refer video w2_m2_l1_v3_part2-Multi-Armed Bandit Problem. 2. Which MABP algorithm is known as ``Explore then commit” ? (A) Epsilon-Greedy 1 Feedback : First sample an arm uniformly at random. Then find out the best arm based on the empirical mean. Finally sample the best arm. Please refer video w2_m2_l1_v3_part2-Multi-Armed Bandit Problem. (B) Thompson’s sampling Feedback : Another MABP algorithm which uses exploration and exploitation together. (C) UCB Feedback : Upper confidence bounds is an another MABP algorithm. Similar to Thompson’s sampling it uses exploration and exploitation together. (D) Exploitation Feedback : Based on the existing results from the past actions, the agent chooses the best arm that maximizes the reward.