Ant Swarm Reinforcement Learning for Formulating Online Promotion Strategies Tuck Siong Chung P. K. Kannan Department of Marketing The Robert H. Smith School of Business University of Maryland College Park, MD 20742 Abstract The emergence of the online channel has rendered the retail environment more dynamic than it has ever been before. The rapid developments in technology are allowing more variations of products and services online, thereby expanding the product/service line. Customers’ preferences have also become more dynamic, partly as a result of the proliferation in products and services. In addition, the low entry barrier for competition in many categories has also contributed significantly to the dynamics. In such a dynamic environment, the task of inferring customer preferences and/or responding to customer preferences in terms of appropriate actions such as product design, pricing and promotion is becoming difficult for online/multi-channel retailers. Typically, many such strategies are developed using a static framework wherein data is usually collected first, then analyzed, and then appropriate strategies implemented to optimally impact the market. For example, customer equity studies based on which “best” customers are identified, promotional strategies to impact the customer base optimally, and mass customization strategies – all are based on analyzing data in a static framework and implementing the resultant strategies with the assumption that the underlying preference patterns and market conditions have not changed significantly. However, if the underlying conditions are dynamic, such strategies do not achieve their intended results. In dynamic environments changes in the preferences and response patterns have to be learned over time (see Sutton and Barto, 2000) and tracked in order to formulate appropriate strategies. In this paper, we propose a reinforcement learning approach based on the model of ant swarms (Dorigo and Di Caro, 1999). We focus on an online retailer who has various promotional options to induce customers to purchase items from its Web site. The promotional options include straight price discounts, incentives for cumulative purchasing (loyalty programs), e-mail based coupons, banner ad discounts, and free shipping offers. The problem the retailer faces is: which promotional tools will be the optimum for the different segments of customers it attracts in terms of increasing the purchase probability? One possible way to learn about the effectiveness of these tools is to present these options one-at-a-time and determine each tool’s impact. However, each customer’s reaction may depend on what state he/she is in, in addition to their underlying difference. For example, a customer who likes coupon promotion may not purchase anything from the online retailer if his/her time since last purchase is short, whereas the same customer may buy if the time since last purchase is long. While mathematical models can still be build to estimate the impact of such variables on purchase, if the underlying preference (coupon-proneness) changes over time, the results of the model may not be valid. In such cases, the learning has to continuous and reinforcing to track the changes in the underlying dimensions of the market. The general idea behind our reinforcement learning model is as follows. The focus is to map a situation to action. This is accomplished by trial-and-error search to obtain a delayed reward. The four main elements our model considers include a policy, a reward function, a value function and a model of the environment. The model of the environment mimics the behavior of the online environment. Given a state and action, the model predicts the resultant next state and next reward. While the different promotional tools are possible actions, each consumer could be in a different state depending on whether the customer has been offered promotion previously, his/her response to it, time since last purchase, basket size of last purchase, and so on. We model the trial-and-error search phase using an ant swarm approach (Gutjahr, 2000; Chang 2004). Just as a swarm of ants starts in multiple directions (parallel search) in search of a reward, our approach will start with a swarm of actions over a selected number of customers to learn about their response. The longer this phase lasts, the better the learning will be; however, there is danger given that the environment itself is dynamic, spending a long time learning may be useless from a rewards viewpoint if the environment changes. Thus, it is better to learn and start implementing the strategy based on the learning quickly so that one can exploit the learning. We examine the application of the ant swarm model to the online promotion problem using simulation. In Case 1, we let the underlying preferences to be static and examine the efficiency of learning the efficacy of the promotional tools as a function of learning time (and data collected). In Case 2, we let the underlying preferences to be dynamic and examine the efficiency of the learning. In addition, we also examine the tension between “exploring” and “exploiting”. Finally, we describe how the learning model can be used in practice in setting online promotional strategies. We will be presenting the conceptual model, reinforcement model set-up and the results of the preliminary investigation at the conference. References: 1. H. S. Chang, “An ant system based exploration-exploitation for reinforcement learning,” in Proc. of the IEEE Conf. on Systems, Man, and Cybernetics, 2004. 2. M. Dorigo and G. Di Caro, “The ant colony optimization metaHeuristic,” New Ideas in Optimization, D. Corne, M. Dorigo (eds), pp 11-32, McGraw-Hill, NY, USA, 1999. 3. W.J. Gutjahr, “A Graph-Based Ant System and Its Convergence,” Future Generation Computer Systems, vol 16, pp 873-888, 2000. 4. R. Sutton and A. Barto, Reinforcement Learning. MIT Press 2000.