Ant Swarm Reinforcement Learning for Formulating Online Promotion Strategies Tuck Siong Chung

advertisement
Ant Swarm Reinforcement Learning for Formulating Online
Promotion Strategies
Tuck Siong Chung
P. K. Kannan
Department of Marketing
The Robert H. Smith School of Business
University of Maryland
College Park, MD 20742
Abstract
The emergence of the online channel has rendered the retail environment more
dynamic than it has ever been before. The rapid developments in technology are allowing
more variations of products and services online, thereby expanding the product/service
line. Customers’ preferences have also become more dynamic, partly as a result of the
proliferation in products and services. In addition, the low entry barrier for competition in
many categories has also contributed significantly to the dynamics. In such a dynamic
environment, the task of inferring customer preferences and/or responding to customer
preferences in terms of appropriate actions such as product design, pricing and promotion
is becoming difficult for online/multi-channel retailers. Typically, many such strategies
are developed using a static framework wherein data is usually collected first, then
analyzed, and then appropriate strategies implemented to optimally impact the market.
For example, customer equity studies based on which “best” customers are identified,
promotional strategies to impact the customer base optimally, and mass customization
strategies – all are based on analyzing data in a static framework and implementing the
resultant strategies with the assumption that the underlying preference patterns and
market conditions have not changed significantly. However, if the underlying conditions
are dynamic, such strategies do not achieve their intended results.
In dynamic environments changes in the preferences and response patterns have
to be learned over time (see Sutton and Barto, 2000) and tracked in order to formulate
appropriate strategies. In this paper, we propose a reinforcement learning approach based
on the model of ant swarms (Dorigo and Di Caro, 1999). We focus on an online retailer
who has various promotional options to induce customers to purchase items from its Web
site. The promotional options include straight price discounts, incentives for cumulative
purchasing (loyalty programs), e-mail based coupons, banner ad discounts, and free
shipping offers. The problem the retailer faces is: which promotional tools will be the
optimum for the different segments of customers it attracts in terms of increasing the
purchase probability? One possible way to learn about the effectiveness of these tools is
to present these options one-at-a-time and determine each tool’s impact. However, each
customer’s reaction may depend on what state he/she is in, in addition to their underlying
difference. For example, a customer who likes coupon promotion may not purchase
anything from the online retailer if his/her time since last purchase is short, whereas the
same customer may buy if the time since last purchase is long. While mathematical
models can still be build to estimate the impact of such variables on purchase, if the
underlying preference (coupon-proneness) changes over time, the results of the model
may not be valid. In such cases, the learning has to continuous and reinforcing to track
the changes in the underlying dimensions of the market.
The general idea behind our reinforcement learning model is as follows. The
focus is to map a situation to action. This is accomplished by trial-and-error search to
obtain a delayed reward. The four main elements our model considers include a policy, a
reward function, a value function and a model of the environment. The model of the
environment mimics the behavior of the online environment. Given a state and action, the
model predicts the resultant next state and next reward. While the different promotional
tools are possible actions, each consumer could be in a different state depending on
whether the customer has been offered promotion previously, his/her response to it, time
since last purchase, basket size of last purchase, and so on. We model the trial-and-error
search phase using an ant swarm approach (Gutjahr, 2000; Chang 2004). Just as a swarm
of ants starts in multiple directions (parallel search) in search of a reward, our approach
will start with a swarm of actions over a selected number of customers to learn about
their response. The longer this phase lasts, the better the learning will be; however, there
is danger given that the environment itself is dynamic, spending a long time learning may
be useless from a rewards viewpoint if the environment changes. Thus, it is better to learn
and start implementing the strategy based on the learning quickly so that one can exploit
the learning.
We examine the application of the ant swarm model to the online promotion
problem using simulation. In Case 1, we let the underlying preferences to be static and
examine the efficiency of learning the efficacy of the promotional tools as a function of
learning time (and data collected). In Case 2, we let the underlying preferences to be
dynamic and examine the efficiency of the learning. In addition, we also examine the
tension between “exploring” and “exploiting”. Finally, we describe how the learning
model can be used in practice in setting online promotional strategies. We will be
presenting the conceptual model, reinforcement model set-up and the results of the
preliminary investigation at the conference.
References:
1. H. S. Chang, “An ant system based exploration-exploitation for reinforcement
learning,” in Proc. of the IEEE Conf. on Systems, Man, and Cybernetics, 2004.
2. M. Dorigo and G. Di Caro, “The ant colony optimization metaHeuristic,” New Ideas
in Optimization, D. Corne, M. Dorigo (eds), pp 11-32, McGraw-Hill, NY, USA, 1999.
3. W.J. Gutjahr, “A Graph-Based Ant System and Its Convergence,” Future Generation
Computer Systems, vol 16, pp 873-888, 2000.
4. R. Sutton and A. Barto, Reinforcement Learning. MIT Press 2000.
Download