Safe Policy Search for Digital Marketing
Philip S. Thomas | Summer Intern, Adobe Research
| Graduate Student, UMass Amherst
© 2014 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Motivation

Researcher: “I have and algorithm that might be able to improve the
policy that you are using.”

Reasonable Person: “If it fails it could be very costly. What guarantees
can you give me?”

Researcher: “If the algorithm is tuned properly, it usually helps quite a bit
in simulation.”

Reasonable Person: “Can you ensure that the algorithm is tuned properly
for our system? Can you ensure that an improvement in simulation will
correspond to an improvement in the real world? Can you guarantee that
your algorithm will not cost us more than $X with 95% confidence?”

Researcher: “No, no, and no.”
© 2014 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
2
Motivation

Researcher: “I have and algorithm that might be able to improve the
policy that you are using.”

Reasonable Person: “If it fails it could be very costly. What guarantees
can you give me?”

Researcher: “You can specify an amount $X and a confidence, 1 - α, and
we will only change your existing policy if we can guarantee that our
change will improve profits by more than $X with confidence 1 - α.”

Reasonable Person: “I can pick any $X and 1 - α that I want? I could
select $1 and confidence 0.999?.”

Researcher: “Absolutely.”
© 2014 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
3
Goal

Provide an algorithm that:

Can use on and off-policy data the compute a new policy that we expect to be
better.

Will only propose a change to the policy if we can guarantee that its
performance is better than X with confidence 1 - α.

The user gets to specify X and α.

Uncompromising approach to safety.
© 2014 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
4
Theoretical Challenges

We have data using the current policy parameters,θ0, of the company.

We would like to guarantee that some other policy,θ, has expected return
greater than X with confidence 1 - α.

Estimating the return of θ with off-policy data is called the off-policy
evaluation problem.


Active research area
We have an even harder problem: high-confidence off-policy evaluation.
© 2014 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
5
Our Approach

Three parts:

Determine whether θ are safe. (High-confidence off-policy evaluation problem).

Search θ–space for the θ that we expect to work best. (Off-policy policy search
problem).

Determine how to apply our algorithm repeatedly.
© 2014 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
6
Importance Sampling
© 2014 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
7
Importance Sampling
Large Return
bi
Maximum possible return
Minimum possible return
ai
Small Return
© 2014 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
8
Ensuring Safety

Hoeffding’s Inequality
1
Pr 
 n
n

i 1

 2n 2 k 2

fˆ  , i , i   f    k   exp   n


  bi  ai
 i 1
© 2014 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
9






Hoeffding’s Inequality
© 2014 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
10
Hoeffding’s Inequality
© 2014 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
11
Our Approach: Part 1 – Are θ safe?

Importance sampling results in samples with very different ranges and
variances.

We derive a novel concentration inequality that is well suited to this
particular setting:
© 2014 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
12
Example
© 2014 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
13
Example: Gridworld
© 2014 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
14
Digital Marketing Example



Actions:

Promote: Give information about product or company. Good shortly after a
product was purchased.

Sell: Provide special offer or other direct attempt at generating a sale. Useful
after several promotions of the product.

NULL: Do not mention the product. Useful to avoid oversaturation.
Observations:

Recency: How long ago did the customer last make a purchase?

Frequency: How many purchases has the customer made?
Model of the world:

The probability of the customer buying depends on the recency, frequency, and
also other underlying properties of the customer that cannot be directly
observed.
© 2014 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
15
Digital Marketing Example

The company begins with a decent, but not optimal policy.

Goal: Maximize the expected (discounted) number of sales.

The company selects fmin to be 90% of the number of sales their policy
generated, and required a confidence of 95%.

A degradation of more than 10% should be unlikely.
© 2014 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
16
Digital Marketing Example
© 2014 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
17
© 2014 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.