Uploaded by Gopal Saha

Multi-Armed Bandit Problem Exam Questions

1. Multi-Armed Bandit Problem consists of
(A) Learning problem only
Feedback : we try to find out the best arm.
(B) Decision problem only
Feedback : We try to find out which arm to try next.
(C) Learning and Decision problem
Feedback : We try to find out the best arm and then which arm to
try next. Please refer video w2_m2_l1_v2_part1-Multi-Armed
Bandit Problem.
(D) Exploration only.
Feedback : We try to explore the best arm that provides the maximum
amount of rewards.
2. To solve Dynamic pricing problems using multi-armed bandit algorithms,
action and rewards are respectively
(A) Both are product price.
Feedback : An efficient algorithm that indicates suitable price to
(B) Both are revenue
Feedback : Finally the objective is to maximize the revenue.
(C) Revenue and product price respectively.
Feedback : In multi-armed bandit algorithm, we try to find out the
best arm that accumulates the maximum reward.
(D) Product prices and revenues
Feedback : In dynamic pricing problem, by considering supplydemand and market competitions, the algorithm determines
suitable product price to charge. Please refer video
w2_m2_l1_v2_part1-Multi-Armed Bandit Problem.
3. If we apply multi-armed bandit algorithms to design a website, then
actions and rewards are respectively
(A) Clicks and engagement time
Feedback : MABP determines the best features by which the revenue
can be maximize.
(B) Font color and page layout
Feedback : These are the features of a website.
(C) Clicks and font color
Feedback : well designed features in a website attracts customers to
(D) Font color and clicks
Feedback : Access to a website leads to revenue maximization.
Please refer video w2_m2_l1_v2_part1-Multi-Armed Bandit
4. In Multi-Armed Bandit Problem (MABP), the roles of an agent and the
environment are respectively
(A) Returns a reward and choosing an action
Feedback : We try to explore a good algorithm which maximizes the
(B) Choosing action and returns a reward
Feedback : Agent chooses an action and according to that the
environment returns a reward. Please refer video
w2_m2_l1_v2_part2-Multi-Armed Bandit Problem.
(C) Choosing an action only
Feedback : The agent chooses a suitable action.
(D) Returns a reward only
Feedback : Environment returns a reward based on the action taken by
the agent.
5. In multi-armed bandit problem, learner’s objective is
(A) Receive the reward.
Feedback : Environment interacts with the agent and based on the
taken action it returns a reward.
(B) Choose a random action
Feedback : It denotes choosing any arm randomly.
(C) Exploration only.
Feedback : Agent explores the best arm to take as an action.
(D) Use outcomes of the past actions to choose the future action that
maximizes total reward (in expectation)
Feedback : Agent tries to choose the best arm or best action to
maximizes the reward. It considers the outcomes from the past
actions. Please refer video w2_m2_l1_v3_part2-Multi-Armed
Bandit Problem.
1. Epsilon-Greedy strategy
(A) Only do exploration
Feedback : There must be a good balance between exploration
and exploitation.
(B) Only do exploitation
Feedback : The agent needs to optimize the rewards using the
existing results.
(C) Do exploration and exploitation together
Feedback : It performs both of these essentials together.
(D) Do exploration and exploitation separately.
Feedback : It performs exploitation after exploration. Please refer
video w2_m2_l1_v3_part2-Multi-Armed Bandit Problem.
2. Which MABP algorithm is known as ``Explore then commit” ?
(A) Epsilon-Greedy 1
Feedback : First sample an arm uniformly at random. Then
find out the best arm based on the empirical mean. Finally
sample the best arm. Please refer video
w2_m2_l1_v3_part2-Multi-Armed Bandit Problem.
(B) Thompson’s sampling
Feedback : Another MABP algorithm which uses exploration
and exploitation together.
Feedback : Upper confidence bounds is an another MABP
algorithm. Similar to Thompson’s sampling it uses exploration
and exploitation together.
(D) Exploitation
Feedback : Based on the existing results from the past actions,
the agent chooses the best arm that maximizes the reward.