Safe Policy Search for Digital Marketing Philip S. Thomas | Summer Intern, Adobe Research | Graduate Student, UMass Amherst © 2014 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Motivation Researcher: “I have and algorithm that might be able to improve the policy that you are using.” Reasonable Person: “If it fails it could be very costly. What guarantees can you give me?” Researcher: “If the algorithm is tuned properly, it usually helps quite a bit in simulation.” Reasonable Person: “Can you ensure that the algorithm is tuned properly for our system? Can you ensure that an improvement in simulation will correspond to an improvement in the real world? Can you guarantee that your algorithm will not cost us more than $X with 95% confidence?” Researcher: “No, no, and no.” © 2014 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 2 Motivation Researcher: “I have and algorithm that might be able to improve the policy that you are using.” Reasonable Person: “If it fails it could be very costly. What guarantees can you give me?” Researcher: “You can specify an amount $X and a confidence, 1 - α, and we will only change your existing policy if we can guarantee that our change will improve profits by more than $X with confidence 1 - α.” Reasonable Person: “I can pick any $X and 1 - α that I want? I could select $1 and confidence 0.999?.” Researcher: “Absolutely.” © 2014 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 3 Goal Provide an algorithm that: Can use on and off-policy data the compute a new policy that we expect to be better. Will only propose a change to the policy if we can guarantee that its performance is better than X with confidence 1 - α. The user gets to specify X and α. Uncompromising approach to safety. © 2014 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 4 Theoretical Challenges We have data using the current policy parameters,θ0, of the company. We would like to guarantee that some other policy,θ, has expected return greater than X with confidence 1 - α. Estimating the return of θ with off-policy data is called the off-policy evaluation problem. Active research area We have an even harder problem: high-confidence off-policy evaluation. © 2014 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 5 Our Approach Three parts: Determine whether θ are safe. (High-confidence off-policy evaluation problem). Search θ–space for the θ that we expect to work best. (Off-policy policy search problem). Determine how to apply our algorithm repeatedly. © 2014 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 6 Importance Sampling © 2014 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 7 Importance Sampling Large Return bi Maximum possible return Minimum possible return ai Small Return © 2014 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 8 Ensuring Safety Hoeffding’s Inequality 1 Pr n n i 1 2n 2 k 2 fˆ , i , i f k exp n bi ai i 1 © 2014 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 9 Hoeffding’s Inequality © 2014 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 10 Hoeffding’s Inequality © 2014 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 11 Our Approach: Part 1 – Are θ safe? Importance sampling results in samples with very different ranges and variances. We derive a novel concentration inequality that is well suited to this particular setting: © 2014 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 12 Example © 2014 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 13 Example: Gridworld © 2014 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 14 Digital Marketing Example Actions: Promote: Give information about product or company. Good shortly after a product was purchased. Sell: Provide special offer or other direct attempt at generating a sale. Useful after several promotions of the product. NULL: Do not mention the product. Useful to avoid oversaturation. Observations: Recency: How long ago did the customer last make a purchase? Frequency: How many purchases has the customer made? Model of the world: The probability of the customer buying depends on the recency, frequency, and also other underlying properties of the customer that cannot be directly observed. © 2014 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 15 Digital Marketing Example The company begins with a decent, but not optimal policy. Goal: Maximize the expected (discounted) number of sales. The company selects fmin to be 90% of the number of sales their policy generated, and required a confidence of 95%. A degradation of more than 10% should be unlikely. © 2014 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 16 Digital Marketing Example © 2014 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 17 © 2014 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.