A Comparison of Approaches to Advertising Measurement

advertisement
A Comparison of Approaches to Advertising Measurement:
Evidence from Big Field Experiments at Facebook∗
Brett Gordon
Kellogg School of Management
Northwestern University
Florian Zettelmeyer
Kellogg School of Management
Northwestern University and NBER
Neha Bhargava
Facebook
Dan Chapsky
Facebook
July 1, 2016
Version 1.1
WHITE PAPER—PRELIMINARY AND INCOMPLETE
Abstract
We examine how common techniques used to measure the causal impact of ad exposures on
users’ conversion outcomes compare to the “gold standard” of a true experiment (randomized
controlled trial). Using data from 12 US advertising lift studies at Facebook comprising 435
million user-study observations and 1.4 billion total impressions we contrast the experimental
results to those obtained from observational methods, such as comparing exposed to unexposed users, matching methods, model-based adjustments, synthetic matched-markets tests,
and before-after tests. We show that observational methods often fail to produce the same
results as true experiments even after conditioning on information from thousands of behavioral
variables and using non-linear models. We explain why this is the case. Our findings suggest
that common approaches used to measure advertising effectiveness in industry fail to measure
accurately the true effect of ads.
∗
No data contained PII that could identify consumers or advertisers to maintain privacy. We thank Daniel
Slotwiner, Gabrielle Gibbs, Joseph Davin, Brian d’Alessandro, and seminar participants at Northwestern, Columbia,
CKGSB, ESMT, HBS, and Temple for helpful comments and suggestions. We particularly thank Meghan Busse for
extensive comments and editing suggestions. Gordon and Zettelmeyer have no financial interest in Facebook and were
not compensated in any way by Facebook or its affiliated companies for engaging in this research. E-mail addresses for
correspondence: b-gordon@kellogg.northwestern.edu, f-zettelmeyer@kellogg.northwestern.edu, nehab@fb.com, chapsky@fb.com
1
Introduction
1.1
The industry problem
Consider the situation of Jim Brown, a hypothetical senior marketing executive. Jim was awaiting
a presentation from his digital media team on the performance of their current online marketing
campaigns for the company’s newest line of jewelry. The team had examined offline purchase rates
for the new line and tied each purchase to a consumer’s exposure to online ads. Figure 1 showed
the key findings:
Figure 1: Conversion rate by ad behavior
Sales Conversion Rate
3%
2.8 %
2%
0.9 %
1%
0%
0.02 %
Not exposed
Exposed, not clicked
Exposed and clicked
(a)
(b)
(c)
The digital media lead explained the graph to Jim: “We compared the sales conversion rate
during the last 60 days for consumers who (a) were not exposed to our ads, (b) were exposed to
our ads, (c) were exposed to our ads and clicked on the ads. The conversion rate of those who
were not exposed was only 0.02% and forms the baseline against which we measure the incremental
effect of the ads. Exposure to the ads led to a 0.9% conversion rate. When consumers clicked on
the ads, the sales conversion increased to 2.8%”. The digital media lead continued: “We can learn
two things from these data. First, our ads seem to be really working. Second, engagement with
the ads—meaning clicking—drives conversions. These findings show that clicking makes consumers
more likely purchase by engaging them. We think that future ads should be designed to entice
consumers to click.”
Jim sat back and thought about the digital media team’s presentation. He was re-evaluating his
marketing strategy for this line of jewelry and wondered how these results fit in. Something seemed
off – Jim felt like he needed to know more about the consumers who had recently purchased these
1
items. Jim asked his the team to delve into their CRM database and characterize the consumers
in each of the three groups in Figure 1.
The next day, the team reported their findings. There were startlingly large differences between
the groups of consumers who had seen no ads, had been exposed to ads but had not clicked, and
consumers who had both seen and clicked on ads. Almost all of the unexposed consumers were
men whereas the large majority of consumers who were exposed to the ads were women. Jim knew
that men were unlikely to buy this particular jewelry line. He was certain that even if they had
been shown the ads, very few men would have purchased. Furthermore, Jim noticed that 14.1% of
consumers who clicked on ads were loyalty club members compared to 2.3% for those who had not.
Jim was no longer convinced of the digital media team’s conclusions that the ads were working
and that clicking drove purchases. He wondered whether the primary reason the sales conversion
rates differed so much between the left two columns of Figure 1 could be that most of the unexposed
consumers were men and most of the exposed non-clicker consumers were women. Also, did the
clickers have the highest purchase rate because the ad had induced them to click or because, as
members of the loyalty program, they were most likely to favor the company’s products in the first
place?
Jim Brown’s situation is typical: Marketing executives regularly have to interpret and weigh
evidence about advertising effectiveness in order to refine their marketing strategy and media
spend. The evidence used in the above example is merely one of numerous types of measurement
approaches used to link ad spending to business-relevant outcomes. But are there better and worse
measurement approaches? Can some approaches be trusted and others not?
In this paper we investigate how well commonly-used approaches for measuring ad effectiveness
perform. Specifically, do they reliably reveal whether or not ads have a causal effect on businessrelevant outcomes such as purchases and site visits? Using a collection of advertising studies
conducted at Facebook, we investigate whether and why methods such as those presented to Jim
reliably measure the true, causal effect of advertising. We can do this because our advertising studies
were conducted as true experiments, the “gold standard” in measurement. We can use the outcomes
of these studies to reconstruct a set of commonly-used measurements of ad effectiveness and then
compare each of them to the advertising effects obtained from the randomized experiments.1
Two key findings emerge from this investigation:
• There is a significant discrepancy between the commonly-used approaches and the true experiments in our studies.
1
Our approach follows in the spirit of Lalonde (1986) and subsequent work by others, who compared observational
methods with randomized experiments in the context of active labor market programs.
2
• While observations approaches sometimes come close to recovering the measurement from
true experiments, it is difficult to predict a priori when this might occur.
• Commonly-used approaches are unreliable for lower funnel conversion outcomes (e.g., purchases) but somewhat more reliable for upper funnel outcomes (e.g., key landing pages).
Of course, advertisers don’t always have the luxury of conducting true experiments. We hope,
however, that a conceptual and quantitative comparison of measurement approaches will arm the
reader with enough knowledge to evaluate measurement with a critical eye and to help identify the
best measurement solution.
1.2
Understanding Causality
Before we proceed with the investigation, we would like to quickly reacquaint the reader with
the concept of causal measurement as a foundation against which to judge different measurement
approaches.
In everyday life we don’t tend to think of establishing cause-and-effect as a particularly hard
problem. It is usually easy to see that an action caused an outcome because we often observe the
mechanism by which the two are linked. For example, if we drop a plate, we can see the plate
falling, hitting the floor and breaking. Answering the question “Why did the plate break?” is
straightforward. Establishing cause-and-effect becomes a hard problem when we don’t observe the
mechanism by which an action is linked to an outcome. Regrettably, this is true for most marketing
activities. For example, it is exceedingly rare that we can describe, let alone observe, the exact
process by which an ad persuades a consumer to buy. This makes the question ”Why did the
consumer buy my product—was it because of my ad or something else?” very tricky to answer.
Returning to Jim’s problem, he wanted to know whether his advertising campaign led to higher
sales conversions. Said another way, how many consumers purchased because consumers saw the
ad? The “because” is the crucial point here. It is easy to measure how many customers purchased.
But to know the effectiveness of an ad, one must know how many of them purchased because of
the ad (and would not have otherwise).
This question is hard to answer because many factors influence whether consumers purchase.
Customers are exposed to a multitude of ads on many different platforms and devices. Was it
today’s mobile ad that caused the consumer to purchase, yesterday’s sponsored search ad, or last
week’s TV ad? Isolating the impact of one particular cause (today’s mobile ad) on a specific
outcome (purchase) is the challenge of causal measurement.
Ideally, to measure the causal effect of an ad, we would like to answer: “How would a consumer
behave in two alternative worlds that are identical except for one difference: in one world they
3
see an ad, and in the other world they do not see an ad?” Ideally, these two “worlds” would be
identical in every possible way except for the ad exposure. If this were possible and we observed
a difference in outcomes (e.g. purchase, visits, clicks, retention, etc.), we could conclude the ad
caused the difference because otherwise the worlds were the same.
While the above serves as a nice thought experiment, the core problem in establishing causality
is that consumers can never be in two worlds at once - you cannot both see an ad and not see an ad
at the exact same time. The solution is a true experiment, or “randomized controlled trial.” The
idea is to assign consumers randomly to one of several “worlds,” or “conditions” as they are typically
referred to. But even if 100,000 or more consumers are randomly split in two conditions, the groups
may not be exactly identical because, of course, each group consists of different consumers.
The solution is to realize that randomization makes the groups “probabilistically equivalent,”
meaning that there are no systematic differences between the groups in their characteristics or
in how they would respond to the ads. Suppose we knew that the product appeals more to
women than to men. Now suppose that we find that consumers in the “see the ad” condition
are more likely to purchase than consumers in the “don’t see the ad” condition. Since the product
appeals more to women, we might not trust the results of our experiment if there were a higher
proportion of women in the “ad” condition than in the “no-ad” condition. The importance of
randomizing which customers are in which conditions is that if the sample of people in each group
is large enough, then the proportion of females present should be approximately equal in the ad
and no-ad conditions. What makes randomization so powerful is that it works on all consumer
characteristics at the same time – gender, search habits, online shopping preferences, etc.. It
even works on characteristics that are unobserved or that the experimenter doesn’t realize are
the related to the outcome of interest. When the samples are large enough and have been truly
randomized, any difference in purchases between the conditions cannot be explained by differences
in the characteristics of consumers between the conditions—they have to have been caused by the
ad. Probabilistic equivalence allows us to compare conditions as if consumers were in two worlds
at once.
For example, suppose the graph in Figure 1 had been the result of a randomized controlled
trial. Say that 50% of consumers had been randomly chosen to not see campaign ads (the left
most column) and the other 50% to see campaign ads (the right two columns). Then the digital
media lead’s statement “our ads are really working” would have been unequivocally correct because
exposed and unexposed consumers would have been probabilistically equivalent. However, if the
digital marketing campaign run by our hypothetical Jim Brown had followed typical practices,
consumers would not have been randomly allocated into conditions in which they saw or did not
see ads. Instead, the platform’s ad targeting engine would have detected soon after the campaign
4
started that women were more likely to purchase than men. As a result, the engine would have
started exposing more women and fewer men to campaign ads. In fact, the job of an ad targeting
engine is to make consumers’ ad exposure as little random as possible: Targeting engines are
designed to show ads to precisely those consumers who are most likely to respond to them. In some
sense, the targeting engine “stacks the deck” by sending the ad to the people who are most likely to
buy, making it very difficult to tell whether the ad itself is actually having any incremental effect.
Hence, instead of proving the digital media lead’s statement that “our ads are really working,”
Figure 1 could be more accurately interpreted as showing that “consumers who are not interested
in buying the product don’t get shown ads and don’t buy (left column), while consumers who are
interested in buying the product do get shown ads and also buy (right columns).” Perhaps the ads
had some effect, but in this analysis it is impossible to tell whether high sales conversions were due
to ad exposure or preexisting differences between consumers.
The non-randomization of ad exposure may undermine Jim’s ability to draw conclusions from
the differences between consumers who are and are not exposed to ads, but what about the differences between columns (b) and (c), the non-clickers and the clickers? Does the difference in sales
conversion between the two groups show that clicks cause purchases? In order for that statement to
be true, it would have to be the case that, among consumers who are exposed, consumers who click
and don’t click are probabilistically equivalent. But why would some consumers click and others
not? Presumably because the ads appealed more to one group than the other. In fact, Jim’s team
found that consumers who clicked were more likely to be loyalty program members, suggesting that
they were already positively disposed to the firm’s products relative to those who did not click.
Perhaps the act of clicking had some effect, but in this analysis it is impossible to tell whether
higher sales conversions from clickers were due to clicking or because consumers who are already
loyal consumers—and already predisposed to buy—are more likely to click.
In the remainder of this paper we will look at a variety of different ways to measure advertising
effectiveness through the lens of causal measurement and probabilistic equivalence. This will make
clear when it is and is not possible to make credible causal claims about the effect of ad campaigns.
2
Study design and measurement approach
The 12 advertising studies analyzed in this paper were chosen by two of the authors (Gordon and
Zettelmeyer) for their suitability for comparing several common ad effectiveness methodologies and
for exploring the problems and complications of each. All 12 studies were randomized controlled
trials held in the US. The studies are not representative of all Facebook advertising, nor are they
intended to be representative. Nonetheless, they cover a varied set of verticals (retail, financial
5
services, e-commerce, telecom, and tech). Each study was conducted recently (January 2015 or
later) on a large audience (at least 1 million users) and with “conversion tracking” in place. This
means that in each study the advertiser measured outcomes using a piece of Facebook-provided html
code, referred to as a “conversion pixel,” that the advertiser embeds on its web pages.2 This enables
an advertiser to measure whether a user visited that page. Conversion pixels can be embedded on
different pages, for example a landing page, or the checkout confirmation page. Depending on
the placement, the conversion pixel reports whether a user visited a desired section of the website
during the time of the study, or purchased.
To compare different measurement techniques to the “truth,” we first report the results of
each randomized controlled trial (henceforth an “RCT”). RCTs are the “gold standard” in causal
measurement because they ensure probabilistic equivalence between users in control and test groups
(within Facebook the ad effectiveness RCTs we analyze in this paper are referred to as a “lift test.”3 ).
2.1
RCT design
An RCT begins with the advertiser defining a new marketing campaign which includes deciding
which consumers to target. For example, the advertiser might want to reach all users that match
a certain set of demographic variables, e.g., all women between the ages of 18 and 54. This choice
determines the set of users included in the study sample. Each user in the study sample was
randomly assigned to either the control group or the test group according to some proportion
selected by the advertiser (in consultation with Facebook). Users in the test group were eligible to
see the campaign’s ads during the study. Which ad gets served for a particular impression is the
result of an auction between advertisers competing for that impression. The opportunity set is the
collection of display ads that compete in an auction for an impression.4 Whether eligible users in
the test group ended up being exposed to a campaign’s ads depended on whether the user accessed
Facebook during the study period and whether the advertiser was in the opportunity set and was
the highest bidder for at least one impression on the user’s News Feed.
2
We use “conversion pixel” to refer to two different types of conversion pixels used by Facebook. One was
traditionally referred to as a “conversion pixel” and the other is referred to as a “Facebook pixel”. Both types
of pixels were used in the studies analyzed in this paper. For our purposes both pixels work the same way (see
https://www.facebook.com/business/help/460491677335370).
3
See https://www.facebook.com/business/news/conversion-lift-measurement
4
The advertising platform determines which ads are part of the opportunity set based on a combination of factors:
how recently the user was served any ad in the campaign, how recently the user saw ads from the same advertiser,
the overall number of ads the user was served in the past twenty-four hours, the “relevance score” of the advertiser,
and others. The relevance score attempts to adjust for whether a user is likely to be a good match for an ad
(https://www.facebook.com/business/news/relevance-score).
6
Users in the control group were never exposed to campaign ads during the study. This raises
the question: What should users in the control group be shown in place of the focal advertiser’s
campaign ads? One possibility is not to show control group users any ads at all, i.e., to replace the
advertiser’s campaign ads with non-advertising content. However, this creates significant opportunity costs for an advertising platform, and is therefore not implemented at Facebook. Instead,
Facebook serves each control group user the ad that this user would have seen if the advertiser’s
campaign had never been run.
We illustrate how this process works using a hypothetical and stylized example in Figure 2.
Consider two users in the test and control groups, respectively. Suppose that at one particular
instant, Jasper’s Market wins the auction to display an impression for the test group user, as seen
in Figure 2a. Imagine that the control group user, who occupies a parallel world to the test user,
would have been served the same ad had this user been in the test group. However, the platform,
recognizing the user’s assignment to the control group, prevents the focal ad from being displayed.
As Figure 2b shows, instead the second-place ad in the auction is served to the control user because
it is the ad that would have won the auction in the absence of the focal campaign.
Of course, Figure 2 is a stylized view of the experimental design (clearly there are no two
identical users in parallel worlds). For users in the test group, the platform serves ads in a regular
manner, including those from the focal advertiser.The process above is only relevant for users in
the control group and if the opportunity set contains the focal ad on a given impression. If the
focal ad does not win the auction, there is no intervention—whatever ad wins the auction is served
because the same ad would have been served in the absence of the focal advertiser’s campaign.
However, if the focal ad wins the auction, the system removes it and instead displays the secondplace ad. In the example, Waterford Lux Resorts is the “control ad” shown to the control user.
At another instant when Jasper’s Market would have won the auction, a different advertiser might
occupy the second-place rank in the auction. Thus, instead of their being a single control ad, users
in the control condition are shown the distribution of ads they would have seen if the advertiser’s
campaign had not run.
A key assumption for a valid RCT is that there is no contamination between control and test
groups, meaning users in the control group cannot have been inadvertently shown campaign ads,
even if the user accessed the site multiple times from different devices or browsers. Fortunately for
us, users must log into Facebook each time they access the service on any device, meaning that
campaign ads were never shown inadvertently to users in the control group. This allowed us to
bypass a potential problem with cookie-based measurements: that different browsers and devices
cannot always be reliably identified as belonging to the same consumer.5
5
In addition, a valid RCT requires the Stable Unit Treatment Value Assumption (SUTVA). In the context of our
7
Figure 2: Determination of control ads in Facebook experiments
(a) Step 1: Determine that a user in the control would have been served the focal ad
(b) Step 2: Serve the next ad in the auction.
8
2.2
Ad effectiveness metrics in an RCT
To illustrate how we report the effectiveness of ad campaigns, suppose that 0.8% of users in the
control group and 1.2% of users in the test group purchased during a hypothetical study period. One
might be tempted to interpret this as “exposure to ads increased the share of consumers buying by
0.4 percentage points, or an increase in purchase likelihood of 50%.” This interpretation, however,
would be incorrect. The reason is that not all consumers who were assigned to the test group were
exposed to ads during the study. (This could be because, after they were assigned to the test group,
some users did not log into Facebook; because the advertiser did not win any ad auctions for some
users; because some users did not scroll down far enough in their newsfeed to where a particular
ad was placed, etc.). Hence, the test group contains some users who cannot have been affected by
ads because they did not see them.
Continuing the example, suppose that only half the users in the test group saw the ads. This
means that the 0.4% difference between the 0.8% conversion rate in the control group and the 1.2%
conversion rate in the test group must have been generated by the half of users in the test group
who actually were exposed to the campaign ads—the effect of an ad on a user who doesn’t see it
must be zero. To help see this, we can express the 0.4% difference as a weighted average of (i) the
effect of ads on the 50% of users who actually saw them (let’s call this the “incremental conversion
rate” due to the ads or “ICR”) and (ii) the 0 percent effect on the 50% of users who did not see
them:
0.5 ∗ ICR + 0.5 ∗ 0% = 0.4%
(1)
Solving this simple equation for ICR shows that the incremental conversion rate due to ad
exposure is 0.8%.
ICR =
0.4%
= 0.8%
0.5
(2)
The interpretation of this incremental conversion rate is that consumers who were exposed to
campaign ads were more likely by 0.8 percentage points to purchase the product than they would
have been had they not been exposed to these ads during the study period.6
Continuing with our example, suppose that we examine the sales conversion rate of the consumers in our test sample who were exposed to the ads and find that it is 1.8%. At first, this might
seem puzzling: If the sales conversion rate for the half of the test group that saw the ads is 1.8%,
study this means that the potential outcome of one user should be unaffected by the particular assignment other
users to treatment and control groups (no interference). This assumption might be violated if users use the share
button to share ads with other users, which was not widely observed in our studies.
6
The ICR is also commonly referred to as the “average treatment effect on the treated.”
9
and the sales conversion rate for the whole test group is 1.2%, then the sales conversion rate for
the half of the test group who did not see the ads must be 0.6%. (This is because 0.6% is the
number that when averaged with 1.8% equals 1.2%: 0.5 ∗ 1.8% + 0.5 ∗ 0.6% = 1.2%.) This means
that the unexposed members of the test group had a lower sales conversion rate that the control
group. What happened? Did our randomization go wrong?
The answer is no—there is nothing wrong with the randomization. Remember that we can
randomize whether an individual is in the test or control group, but we can’t randomize whether
or not a person sees the ad. This depends in part on the targeting mechanism (which, as we have
already discussed, will tend to show ads to people who are likely to respond and to avoid showing
ads to people who are unlikely to respond) and in part on the person’s own behavior (someone who
is trekking in Nepal is unlikely to be spending time on Facebook and also unlikely to be making
online purchases). Said another way, the set of people within the test group who actually see the ad
is not a random sample but a “selected” sample. There are several different “selection mechanisms”
that might be at work in determining who in the test group actually sees the ad. The targeting
engine and vacations are two we have already mentioned (we will discuss a third later).
The reason for using the incremental conversion rate as the measure of the effect of the ad is to
account for the possibility that the test group members who saw the ad is a selected (rather than
random) subset of the test group. The results of our RCT were that the sales conversion rate in
the test sample, at 1.2%, was higher by 0.4 percentage points than the sales conversion rate in the
control sample, at 0.8% As we described above, this difference must be driven by half of the test
group who saw the ads, which means that the incremental effect of the ads must be to have increase
the sales conversion rate within this group by 0.8 percentage points. The actual sales conversion
rate in this exposed group was 1.8%, which implies that if the exposed group had not seen the ads,
their sales conversion rate would have been 1.0% (equal to the 1.8% sales conversion rate minus
the 0.8% calculated incremental conversion rate).
The power of the RCT is that it gives us a way to measure what we can’t observe, namely,
what would have happened in the alternative world in which people who actually saw the ad didn’t
see the ad (see figure 3). This measure, 1.0% in our example, is called the counterfactual” or “but
for” conversion rate—what the conversion rate would have been in the world counter to the factual
world, or what the conversion rate would have been but for the advertisement exposure.
The incremental conversion rate (ICR) is the actual conversion rate minus the counterfactual
conversion rate, 1.8%-1.0% in our example. The measure we that we will use to summarize outcomes
of the Facebook advertising studies, which is what Facebook uses internally, is “lift.” Lift simply
10
Figure 3: Measuring the incremental conversion rate
Test
Control
(Unexposed)
(Eligible to be exposed)
Exposed
50%
1.8%
50%
0.6%
ICR
1.0%
50% who would have been
exposed if they had been in
the test group
0.6%
50% who would not have
been exposed if they had
been in the test group
Unexposed
1.2%
0.8%
expresses the incremental conversion rate as a percentage effect
Lift =
Actual conversion rate – Counterfactual conversion rate
Counterfactual conversion rate
In our example, the lift is
1.8%−1.0%
1.0%
(3)
= 0.8 or 80%. The interpretation is that exposure to the
ad lifted the sales conversion rate of the kind consumers who were exposed to the ad by 80%.
3
Alternative measures of advertising effectiveness
In this section we will describe the various advertising effectiveness measures whose performance we
wish to assess in comparison to randomized controlled trials. In order to be able to talk about these
measurement techniques concretely rather than only generally, we will apply them to one typical
advertising study (we refer to it as “study 4”). This study was performed for the advertising
campaign of an omni-channel retailer. The campaign took place over two weeks in the first half of
2015 and comprised a total of 25.5 million users. Ads were shown on mobile and desktop Facebook
newsfeeds in the US. For this study the conversion pixel was embedded on the checkout confirmation
page. The outcome measured in this study is whether a user purchased online during the study
and up to several weeks after the study ended.7 Users were randomly split into test and control
groups in proportions of 70%, and 30%, respectively.
7
Even if some users convert as a result of seeing the ads further in the future, this still implies the experiment will
produce conservative estimates of advertising’s effects.
11
3.1
Results from a Randomized Controlled Trial
We begin by presenting the outcome from the randomized controlled trial. In later sections, we will
apply alternative advertising effectiveness measures to the data generated by study 4. To preserve
confidentiality, all of the conversion rates in this section have been scaled by a random constant.
Our first step is to check whether the randomization appears to have resulted in probabilistically
equivalent test and control groups. One way to check this is to see whether the two groups are
similar on variables we observe. As Table 1 shows for a few key variables, test and control groups
match very closely and are statistically indistinguishable (the p-values are above 0.05).
Table 1: Randomization check
Variable
Control group
Test group
p-value
31.7
17.2%
2,288
19.6
13.8
14.0
485.7
6.377
25.5
31.7
17.2%
2,287
19.6
13.8
14.0
485.7
6.376
25.5
0.33
0.705
0.24
0.508
0.0892
0.888
0.985
0.14
0.172
Average user age
% of users who are male
Length of time using FB (days)
% of users with status “married”
% of users status “engaged”
% of users status “single”
# of FB friends
# of FB uses in last 7 days
# of FB uses in last 28 days
As Figure 4 shows, the incremental conversion rate in this study was 0.045% (statistically
different from 0 at a 5% significance level). This is the difference between the conversion rate of
exposed users (0.104%), and their counterfactual conversion rate (0.059%), i.e. the conversion rate
of these users had they not been exposed. Hence, in this study the lift of the campaign was 77%
(=0.045%/0.059%). The 95% confidence interval for this lift is [37%, 117%].8
We will interpret the 77% lift measured by the RCT as our gold standard measure of the truth.
In the following sections we will calculate alternative measures of advertising effectiveness to see
how close they come to this 77% benchmark. These comparisons reveal how close to (or far from)
knowing the truth an advertiser who was unable to (or chose not to) evaluate their campaign with
an RCT, would be.
3.2
Individual-based Comparisons
Suppose that instead of conducting a randomized controlled trial, an advertiser had followed customary practice by choosing a target sample (such as single women aged 18-49, for example) and
8
See the technical appendix for details on how to compute the confidence interval for the lift.
12
Figure 4: Results from RCT
Test
(Eligible to be exposed)
Control
(Unexposed)
Exposed
25%
0.104%
75%
0.020%
ICR=0.045%
Lift=77%
0.059%
Users who would have been
exposed if they had been in
the test group
0.020%
Users who would not have
been exposed if they had
been in the test group
Unexposed
0.042%
0.030%
made all of them eligible to see the ad. (Note that this is equivalent to creating a test sample
without a control group held out). How might one evaluate the effectiveness of the ad? In the
rest of this section we will consider several alternatives: comparing exposed vs. unexposed users,
matching methods, and model-based adjustment techniques.
3.2.1
Exposed/Unexposed
One simple approach would be to compare the sales conversion rates of exposed vs. unexposed
users. This is possible because even if an entire set of users is eligible to see an ad not all of them
will (because the user doesn’t log into Facebook during the study period, because the advertiser
doesn’t win any auctions for this user, etc.). We can simulate being in this situation using the data
from study 4 by using data only from the test condition. Essentially, we are pretending that the
30% of users in the target segment of study 4 who were randomly selected for the control group do
not exist.
Within the test group of study 4, the conversion rate among exposed users was 0.104% and
the conversion rate among unexposed users was 0.020%, implying an ICR of 0.084% and a lift of
416%. This estimate is more that five times the true lift of 77% and therefore greatly overstates
the effectiveness of the ad.
It is well known among ad measurement experts that simply comparing exposed to unexposed
consumers yields problematic results. Many of the methods we describe in the following subsection
were designed to improve on the well-known failings of this approach. In the remainder of this
13
subsection we will delve into why this comparison yields problematic results in the first place. The
answers are helpful for understanding then other analysis on this paper.
Recall that one can attribute the difference in conversion rates between the groups solely to
advertising only if consumers who are exposed and unexposed to ads are probabilistically equivalent.
In practice, this is often not true. As we have discussed above, exposed and unexposed users likely
differ in many ways, some observed and others unobserved.
Figure 5 depicts some of the differences between the exposed and unexposed groups (within
the test group) in study 4. The figure displays percentage differences relative to the average of the
entire test group. For example, the second item in Figure 5 shows that exposed users are about
10% less likely to be male than the average for the test group as a whole, while unexposed users
are several percentage points more likely to be male than the average for the whole group. The
figure shows exposed and unexposed users differ other ways as well. In addition to having more
female users, the exposed group is slightly younger, more likely to be married, has more Facebook
friends, and tends to access Facebook more frequently from a mobile device than a desktop. The
two groups also own different types of phones.
These differences warn us that a comparison of exposed and unexposed users is unlikely to
satisfy probabilistic equivalence. The more plausible it is that these characteristics are correlated
with the underlying probability that a user will purchase, the more comparing groups that differ in
these characteristics will confound the effect of the ads with the differences in group composition.
The inflated ICR of 416% generated by the simple exposed vs. unexposed comparison suggests
that there is a strong confounding effect in study 4. In short, exposed users converted more
frequently than unexposed users because they were different in other ways that made them more
likely to convert in the first place, irrespective of advertising exposures. This confounding effect—
the differences in outcomes that arise from differences in underlying characteristics between groups
are attributed instead to differences in ad exposure—is called selection bias.
One remaining question is why the characteristics of exposed and unexposed users are so different. While selection effects can arise in various advertising settings, there are three features of
online advertising environments that make selection bias particularly significant.
First, an ad is delivered when the advertiser wins the underlying auction for an impression.
Winning the auction implies the advertiser out-bid the other advertisers competing for the same
impression. Advertisers usually bid more for impressions that are valuable to them, meaning more
likely to convert. Additionally, Facebook and some other publishers prefer to show ads to consumers
they are more likely to enjoy. This means that an advertisers’ ads are more likely to be shown to
users who are more likely to respond to its ads, and users who are less likely to respond to the other
advertisers who are currently active on Facebook. Even if an advertiser triggers little selection bias
14
Figure 5: Comparison of exposed and unexposed users in the test group of study 4
(expressed as percentage differences relative to average of the entire test group)
30%
Unexposed
Exposed
20%
10%
0%
-10%
-20%
-30%
-40%
-50%
Age
Male
FBAge
Married
Single
#ofFriends FBWeb
FBMobile
PhoneA
PhoneB
PhoneC
based on their own advertising, it can nevertheless end up with a selected exposure because of
what another advertiser does. For example, Figure 5 shows that in study 4, exposed users were
more likely to be women than men. This could be because the advertiser in study 4 was placing
relatively high bids for impressions to women. But it could also be because another advertiser who
was active during study 4 was bidding aggressively for men, leaving more women to be won by the
advertiser in study 4.
A second mechanism that drives selection is the optimization algorithms that exist on modern
advertising delivery platforms. Advertisers and platforms try to optimize the types of consumers
that should be shown an ad. A campaign that seeks to optimize on purchases will gradually refine
the targeting and delivery rules to identify users who are most likely to convert. For example,
suppose an advertiser initially targets female users between the ages of 18 and 55. After the
campaign’s first week, the platform observes that females between 18 and 34 are especially likely to
convert. As a result, the ad platform will increase the frequency that the ad campaign enters into
the ad auction for this set of consumers, resulting in more impressions targeted at this narrower
group. These optimization routines perpetuate an imbalance between exposed and unexposed test
group users: the exposed group will contain more 18-34 females and the unexposed group will
contain more 35-55 females. Assessing lift by comparing exposed vs. unexposed consumers will
15
therefore overstate the effectiveness of advertising because exposed users were specifically chosen
on the basis of their higher conversion rates.9
The final mechanism is subtle, but arises from the simple observation that a user must actually
visit Facebook during the campaign to be exposed. If conversion is purely a digital outcome (e.g.,
online purchase, registration, key landing page), exposed users will be more likely to convert simply
because they happened to be online during the campaign. Lewis, Rao, and Reiley (2011) show that
consumer choices such as this can lead to activity bias that complicates measuring causal effects
online.
We have described why selection bias is likely to plague simple exposed vs. unexposed comparisons, especially in online advertising environment. However, numerous statistical methods exist
that attempt to remedy these challenges. Some of the most popular ones are discussed next.
3.2.2
Exact Matching and Propensity Score Matching
In the previous section, we saw how comparing the exposed and unexposed groups is inappropriate
because they contain different compositions of consumers. But if we can observe how the groups
differ based on characteristics observed in the data, can we take them into account in some way
that improves our estimates of advertising effectiveness?
This is the logic behind matching methods. For starters, suppose that we believe that age and
gender alone determine whether a user is exposed or not. In other words, women might be more
likely to see an ad than men and younger people more likely than older, but among people of the
same age and gender, whether a particular user sees an ad is as good as random.
In this case a matching method is very simple to implement. Start with the set of all exposed
users. For each exposed user, choose randomly from the set of unexposed users a user of the same
age and gender. Repeat this for all the exposed users, potentially allowing the same unexposed
user to be matched to multiple exposed users. This method of matching essentially creates a
comparison set of unexposed users whose age and gender mix now matches that of the exposed
group. If it is indeed the case that within age and gender exposure is random, then we have
constructed probabilistically equivalent groups of consumers. The advertising effect is calculated
as the average difference in outcomes between the exposed users and the paired unexposed users.10
9
Facebook’s ad testing platform is specifically designed to account for the fact that targeting rules for a campaign
change over time. This is a accomplished by applying the new targeting rules both to test and control groups, even
though users in the control group are never actually exposed to campaign ads.
10
In practice, the matching is not usually a simple one-to-one pairing. Instead, each exposed user is matched
to a weighted average of all the unexposed users with the same age and gender combination. This makes use of
all the information that can be gleaned from unexposed users, while adjusting for imbalances between exposed and
16
We applied exact matching on age and gender to study 4. Within the test group of study 4, there
were 113 unique combinations of age and gender for which there was at least one exposed and at
least one unexposed user.11 Using this method, which we’ll label “Exact Matching,” exposed users
converted at a rate of 0.104% and unexposed users at 0.032%, for a lift of 221%.12 This estimate
is roughly half the number we obtained from directly comparing the exposed and unexposed users.
Matching on age and gender has reduced a lot of the selection bias, but this estimate is still
almost three times the 77% measure from the RCT, so we haven’t eliminated all of it. An obvious
reason for this is that age and gender are not the only factors that determine advertising exposure.
(We can see this for study 4 by looking at Figure 5). In addition to age and gender, one might want
to match users on their geographic location, phone type, relationship status, number of friends,
Facebook activity on mobile vs. desktop devices, and more. The problem is that as we add new
characteristics on which to match consumers, it gets harder and harder to find exact matches. This
becomes especially difficult when we think about the large number of attributes most advertisers
observe about a given user.
Fortunately there are matching methods that do not require exact matching, many of which
are already commonly used by marketing modelers. At a basic level, matching methods try to pair
exposed users with similar unexposed users. Previously we defined “similar” as exactly matching
on age and gender. Now we just need a clever way of defining “similarity” that allows for inexact
matches.
One popular method is to match users on their so-called propensity score. The propensity score
approach uses data we already have on users’ characteristics and whether or not they were exposed
to an ad to create an estimate, based on the user’s characteristic, of how likely that user is to
have been exposed to an ad.13 The idea is that the propensity score summarizes all of the relevant
unexposed users.
11
It would have been possible to create as many as 120 age-gender combinations but seven combinations were
dropped because that combination was absent from either the exposed or unexposed group. A total of 15 users
were dropped for this reason. A lack of comparison groups is a common problem in matching methods. There is no
guarantee a good match for each exposed user will always exist. In our case, for example, there was a 78-year-old
male who was exposed to advertising but there were no 78-year-old males in the unexposed group. The general
recommendation in these situations is to drop users that do not have a match in the other group, which is known as
enforcing a common support between the groups. We cannot make a reliable inference about the effect of exposure
on a user who does not have a match in the unexposed group. Of course, we could pair a 78-year-old male with a
77-year-old male but the match would not be exact. Other matching methods, such as propensity score matching
and nearest-neighbor matching, permit such inexact matches, and we discuss these methods shortly.
12
For these result we used the weighted version of exact matching described in footnote 10, rather than the oneto-one matching described in the main body of the text.
13
Calculating the propensity score for each user is easy and typically done using standard statistical tools, such
as logistic regression. More sophisticated approaches based on machine learning algorithms, such as random forests,
17
information about a consumer in a single number. Said another way, propensity scores enable us
to collapse many dimensions of consumer attributes into a single scale that measures specifically
how similar consumer are in their propensity to be exposed to an ad. With this measure in hand,
we just match people using their propensity scores in much the same way as we did earlier. For
each exposed user, we find the unexposed user with the closest propensity score, discarding any
individuals that don’t have a close enough match in the other group. Advertising effectiveness
is estimated as the difference in conversion rates between the matched exposed and unexposed
groups. The key assumption underlying this approach is that, for two people with the same (or
very close) propensity scores, exposure is as good as random. Hence, for two consumers who both
had propensity scores of 65%, one of whom was actually exposed while the other was not, we are
assuming it’s as if a coin flip had determined which user ended up being exposed. By pairing users
with close propensity scores, we can once again construct probabilistically equivalent groups to
measure the causal effect of ad exposure.14
We calculated propensity scores for the exposed and unexposed users in the test group from
study 4 using a logistic regression model and then created propensity score matched samples of
exposed and unexposed users. The upper part of panel (a) of Figure 6 shows the distributions of
the propensity scores for all exposed and unexposed users (i.e. before matching). The lower part of
panel (a) shows the distributions of the matched sample. Prior to matching, the propensity score
distribution for the exposed and unexposed users differ substantially. After matching, however,
there is no visible difference in the distributions, implying that matching did a good job of balancing
the two groups based on their likelihood of exposure.15
Propensity score matching matches users based on a composition of their characteristics. One
might wonder how well propensity-score matched samples are matched on individual characteristics.
In panel (b) of Figure 6, we show the distribution of age for exposed and unexposed users in the
unmatched samples (upper) and in the matched samples (lower). Even though we did not match
directly on age, matching on the propensity score nevertheless balanced the age distribution between
exposed and unexposed users.
An important input to propensity score matching (PSM) is the set of variables used to predict
can also be used. Rather than matching on the propensity score directly, most researchers recommend matching on
the “logit” transformation of the propensities because it linearizes values on the 0-1 interval.
14
As with the exact matching, there are propensity score methods that work by attributing greater weight to
unexposed users that are more similar in propensity score to exposed users rather than implementing one-to-one
matching. In the results we present, we use one of these weighted propensity score methods. See the appendix for
details.
15
This comparison also helps us check that we have sufficient overlap in the propensities between the exposed and
unexposed groups.
18
Propensity Score
Age
Unmatched
0
0
.02
1
.04
2
.06
3
.08
Unmatched
0
.2
.4
.6
.8
1
20
30
40
60
70
80
60
70
80
Matched
0
0
.5
.02
1
.04
1.5
2
.06
2.5
.08
Matched
50
0
.2
.4
Exposed
.6
.8
1
20
Unexposed
30
40
50
Exposed
(a) Propensity Score
Unexposed
(b) Age
Figure 6: Comparison of Unmatched and Matched Characteristic Distributions
the propensity score itself. We tested three different PSM specifications for study 4, each of which
used a larger set of inputs.
PSM 1: In addition to age and gender, the basis of our exact matching (EM) approach, this specification uses common Facebook variables, such as how long users have been on Facebook, how
many Facebook friends the have, their reported relationship status, and their phone OS, in
addition to other user characteristics.
PSM 2: In addition to the variables in PSM 1, this specification uses Facebook’s estimate of the user’s
zip code of residence to associate with each user nearly 40 variables drawn from the most
recent Census and American Communities Surveys (ACS).
PSM 3: In addition to the variables in PSM 2, this specification adds a composite metric of Facebook
data that summarizes thousands of behavioral variables. This is a machine-learning based
metric used by Facebook to construct target audiences that are similar to consumers that
an advertiser has identified as desirable.16 Using this metric bases the estimation of our
propensity score on a non-linear machine-learning model with thousands of features.17
16
17
See https://www.facebook.com/business/help/164749007013531 for an explanation.
Please note that, while this specification contains a great number of user-level variables, we have no data at
the user level that varies over time within the duration of each study. For example, while we know whether a user
used Facebook during the week prior to the beginning of the study, we don’t observe on any given day of the study
whether the user used Facebook on the previous day or whether the user engaged in any shopping activity. It is
possible that using such time-varying user-level information could improve our ability to match. We hope to explore
this in a future version of the paper.
19
Table 2 presents a summary of the estimates of advertising effectiveness produced by the exact
matching and propensity score matching approaches.18 As before, the main result of interest will
be the lift. In the context of matching models, lift is calculated as the difference between the
conversion rate for matches exposed users and matched unexposed users, expressed as a percentage
of the conversion rate for matched unexposed users. Table 2 reports each of the components of
this calculation, along with the 95% confidence interval for each estimate. The bottom row reports
the AUCROC, a common measure of the accuracy of classification models (it applies only to the
propensity score models).19
Note that the conversion rate for matched exposed users barely changes across the model specifications. This is for the most part we are holding on to the entire set of exposed users and changing
across specifications which unexposed users are chosen as the matches.20
What does change across specifications is the conversion rate of the matched unexposed users.
This is because different specifications choose different sets of matches from the unexposed group,
When we go from exact matching (EM) to our most parsimonious propensity score matching model
(PSM 1), the conversion rate for unexposed users increases from 0.032% to 0.042%, decreasing the
implied advertising lift from 221% to 147%. PSM 2 performs similarly to PSM 1, with an implied
lift of 154%.21 Finally, adding the composite measure of Facebook variables in PSM 3 improves
the fit of the propensity model (as measured by a higher AUCROC) and further increases the
conversion rate for matched unexposed users to 0.051%. The result is that our best performing
PSM model estimates an advertising lift of 102%.
When we naively compared exposed to unexposed users, we estimated an ad lift of 416%.
Adjusting these groups to achieve balance on age and gender, which differed in the raw sample,
suggested a lift of 221%. Matching the groups based on their propensity score, estimated with a
rich set of explanatory variables, gave us a lift of 102%. Compared to the starting point, we have
gotten much closer to the true RCT lift of 77%.
Next we apply another class of methods to this same problem, and later we will see how the
collection of methods fair across all the advertising studies.
18
19
See the appendix for more detail on implementation.
See http://gim.unmc.edu/dxtests/roc3.htm for a short and Fawcett (2006) for a detailed an explanation of
AUCROC.
20
Exposed users are dropped if there is no unexposed user that has a close enough propensity score match. In
study 4, the different propensity score specifications we use do not produce very different sets of exposed users who
can be matched. This need not be the case in all settings.
21
As we add variables to the propensity score model, we must drop some observations in the sample with missing
data. However, the decrease in sample size is fairly small and these dropped consumers do not significantly differ
from the remaining sample.
20
Table 2: Exact Matching (EM) and Propensity Score Matching (PSM 1-3)
EM
Est.
95% CI
PSM 1
Est.
PSM 2
PSM 3
95% CI
Est.
95% CI
Est.
95% CI
[0.041, 0.043]
0.041
[0.040, 0.042]
0.051
[0.050, 0.052]
Conversion rates for matched unexposed users (%)
0.032
[0.029, 0.034]
0.042
Conversion rates for matched exposed users (%)
Lift (%)
AUCROC
Observ
∗
0.104
[0.097, 0.109]
0.104
[0.097, 0.109]
0.104
[0.097, 0.109]
0.104
[0.097, 0.110]
221
[192, 250]
147
[126, 168]
154
[132, 176]
102
[83, 121]
N/A
0.72
0.73
0.81
7,674,114
7,673,968
7,608,447
7,432,271
Slight differences in the number of observations are due to variation in missing characteristics across users in
the sample. Note that the confidence intervals for PSM 1-3 on the conversion rate for matched unexposed users
and the lift are approximate (consult the appendix for more details).
3.2.3
Regression Adjustment
Regression adjustment (RA) methods take an approach that is fundamentally distinct from matching methods. In colloquial terms, the idea behind matching is “If I can find users who didn’t see the
ad but who are really similar in their observable characteristics to users who did see the add, then
I can use their conversion rate as a measure of what people who did see the ad would have done if
they had not seen the ad (the counterfactual).” The idea behind regression adjustment methods is
instead the following: “If I can figure out the relationship between the observable characteristics of
users who did not see the ad and whether they converted, then I can use that to predict for users
who did see the ad, on the basis of their characteristics, what they would have done if they had
not seen the ad (the counterfactual).” In other words, the two types of methods differ primarily in
how they construct the counterfactual.
A very simple implementation would be the following. Using data from the unexposed members
of the test group, regress the outcome measures (e.g. did the consumer purchase) on the observable
characteristics of the users. Take the estimated coefficients from this specification and use them to
extrapolate for each exposed user a predicted outcome measure (e.g., the probability of purchase).
The lift estimate is then based on the difference between the actual conversion rates of the exposed
users and their predicted conversion rates.22
22
More generally, one would start by constructing one model for each “treatment level” (e.g., exposed/unexposed).
Next, one would use these models to predict the counterfactual outcomes necessary to estimate the causal effect.
Suppose we want to predict what would have happened to an exposed user if they had instead not been exposed.
For this exposed user, we would use the model estimated only on all unexposed users to predict the counterfactual
outcome for the exposed user had they been unexposed. We repeat this process for all the exposed users. The causal
21
It turns out we can improve on the basic RA model by using propensity scores to place more
weight on some observations than others. Suppose a user with a propensity score of 0.9 is, in fact,
exposed to the ad. This is hardly a surprising outcome because the propensity model predicted
exposure was likely. However, observing an unexposed user with a propensity score of 0.9 would be
fairly surprising, with a 1:10 odds of occurrence. A class of estimation strategies leverages this feature of the data by placing relatively more weight on observations that are “more surprising.” This
is accomplished by weighing exposed users by the inverse of their propensity score and unexposed
users by the inverse of one minus their propensity score (i.e., the probability of being unexposed).
This approach forms the basis of inverse probability-weighted estimators (Hirano, Imbens, and
Ridder 2003), which can be combined with RA methods to yield an inverse-probability-weighed
regression adjustment (IPWRA) model (Wooldridge 2007).23
We estimated three different regression adjustment models using the propensity scores calculated
for PSM 1-3 in the pervious section. Table 3 presents the results. The results are similar to those
Table 3: IPWRA
IPWRA 1
Est.
CI
IPWRA 2
Est.
CI
IPWRA 3
Est.
CI
Conversion rate for exposed users if unexposed as predicted by RA mdoel (%)
0.045
[0.037, 0.046]
0.045
[0.039, 0.046]
0.049
[0.044, 0.056]
Actual conversion rate of exposed users
Lift%
0.104
[0.097, 0.109]
0.102
[0.096, 0.107]
0.104
[0.097, 0.110]
145
[120, 171]
144
[120, 169]
107
[79, 135]
obtained using PSM: including additional variables reduces the estimated lift from 145% to 107%.
effect of ads on exposed users is the average difference between their observed outcomes and those predicted by the
model built on the unexposed users.
Note that the key assumption is the following: Exposure to ads is independent of the outcome after conditioning
on all user characteristics. This is the standard assumption made in all regression analysis if one wishes to interpret a
right-hand side variable in a causal manner—otherwise a regression can only measure correlations between variables
and one cannot draw a causal inference. In our context, this assumption is also equivalent to requiring that the
exposed and unexposed groups are probabilistically equivalent after we have controlled—via the regression—for
observed differences between the groups and how these differences may contribute to a user’s conversion.
23
IPWRA jointly models a user’s exposure probability and conversion outcomes.The IPWRA is a doubly robust
estimation technique, meaning the estimation results are consistently estimated even if one of the underlying models—
either the propensity model or the outcome model—turns out to be misspecified. See Wooldridge (2007) for more
details.
22
3.2.4
Summary of individual-based comparisons
We summarize the result of all our propensity score matching and regression methods for study 4
in Figure 7.
Figure 7: Summary of lift estimates and confidence intervals
250
S4 Checkout
Lift
150
200
221**
154**
147**
100
145**
144**
107
102
T
C
R
A3
IP
W
R
A2
IP
W
R
IP
W
R
A1
3
M
PS
2
M
PS
1
M
PS
EM
50
77
** (*) Statistically significantly different from RCT lift at
1% (5%) level.
As the figure shows, propensity score matching and regression methods perform comparably well. Both methods tend to overstate lift, although including our complete set of predictor
variables—especially the composite Facebook variable—produce lift estimates that are statistically
indistinguishable from the RCT lift. However, if one ignores the uncertainty represented in confidence intervals and focuses on the point estimates alone, even a model with a rich set of predictors
overestimates the lift by about 50%.
3.3
Market-based comparisons
In many cases in which it would be difficult to randomize at the individual consumer level (necessary
for an RCT), it is possible to randomize at the geographic market level. This is commonly referred
to as a “matched market test.” Typically an advertiser would choose a set of, for example, 40
markets and assign 20 of these markets to the test group and the remaining 20 markets to the
control group. Users in control markets are never exposed to campaign ads during the study. Users
in the test markets are targeted with the campaign ads during the study period. The quality of this
comparison depends on how well the control markets allow the advertiser to measure what would
23
have happened in the test markets, had the ad campaign not taken place in those markets. Not
surprisingly, the key to the validity of such a comparison is to assign the available markets to test
and control groups such that consumers in test and control markets are as close to probabilistically
equivalent as possible.
There are two basic ways to achieve this goal. First, one can find pairs of similar or “matched”
markets based on market characteristics and then, within each pair, assign one market to the test
group and the other market to the control group. Alternatively, as long the number of markets
is sufficiently large, one can randomly assign markets to test and control groups, without first
choosing matches pairs.
All the studies used in this paper assigned individual users to test and control groups. However,
we can still assess what would have happened if, instead, the assignment had been at the market
level. Since each consumer in an advertiser’s target group for a campaign was randomly assigned
to either the test or control group prior to the study, each geographic market contains both users
in the test group and users in the control group. Moreover, the randomization ensures that both
the test and and the control group users in each market are representative of targeted Facebook
users in that market. This means that the behavior of users in the control group in a market is
representative of the behavior of all users in the advertiser’s target group in that market, had the
entire market not been targeted with campaign ads. Similarly the behavior of users in the test
group in a market is representative of the behavior of all users in the advertiser’s target group in
that market, had the entire market been targeted with campaigns ads. We can therefore simulate a
matched market test by assigning each market to be either a test market or a control market, and
using the data only from the corresponding set of study users in that market.
We define geographic markets using the US Census Bureaus definition of a “Core Based Statistical Area” (CBSA). There are 929 CBSAs, 328 of which are “Metropolitan Statistical Areas”
(MSAs) and the remaining 541 of which are “Micropolitan Statistical Areas.” For example, “San
Francisco-Oakland-Hayward,” “Santa Rosa,” and “Los Angeles-Long Beach-Anaheim” are CBSAs.24
To construct our matched market test we selected the 40 largest markets in the U.S. by population (see Table 4). We picked 40 because Facebook ad researchers reported that this was typical
for the number of markets requested by clients for conducting matched market testing. Next, we
found pairs of similar markets based on market characteristics. We considered three sets of market
characteristics. First, we used a rich set of census demographic variables to describe consumers
24
See Figure A-1 in the appendix for a map of CBSAs in California. Maps for other states can be found at
https://www.census.gov/geo/maps-data/maps/statecbsa.html.
24
in each market (we refer to this as “demographics-based matching”).25 Second, we proxied for an
advertiser’s relative sales in each geographic market by using the conversion percentages derived
from the advertiser’s conversion pixel in the month prior to the beginning of the study (we refer to
this as “sales-based matching”).26 Third, we combine census data and conversion percentage data
in calculating the best match between markets (we refer to this a “sales- and demographics-based
matching”).
We need to choose some rule for how markets should be matched before dividing them into
control and test groups. One objective we might choose is to minimize the total difference between
markets across all pairs by some metric. This is referred to as “optimal non-bipartite matching”
(Greevy, Lu, Silber, and Rosebaum 2004). When we perform this procedure for the 40 largest
markets in the US, demographics-based matching yields the matched pairs shown in Table 4.27 To
perform a matched market test we need to assign, for each optimally matched pair, one CBSA
to the test group and one CBSA to the control group. In practice, an ad measurement researcher
randomizes one CBSA in each pair to the test group and the other CBSA in each pair to the control
group. Of course, to the degree that matching does not yield probabilistically equivalent groups
of consumers across market, the choice of which CBSA ends up in the test and control groups
can matter for measurement. In our case we can estimate how sensitive measured lifts are to this
choice. This is because the RCT design produced consumers in both test and control groups for
each market. Hence, across all pairs, we can draw many different random allocations of CBSAs to
test and control and report the resulting lift for each allocation.
Recall from the discussion on page 9 that we derive the incremental conversion rate or ICR
by dividing the difference between the conversion rate in the test group and the control group by
25
We used the % of CBSA population that is under 18, % of households in CBSA who are married-couple families,
median year in which housing structures were built in CBSA, % of CBSA population with different levels of education,
% of CBSA population in different occupations, % of CBSA population that classify themselves as being of different
races and ethnicities, median household income in 1999 (dollars) in CBSA, average household size of occupied housing
units in CBSA, median value (dollars) for all owner-occupied housing units in CBSA, average vehicles per occupied
housing unit in CBSA, % of owner occupied housing units in CBSA, % of vacant housing units in CBSA, average
minutes of travel time to work outside home in CBSA, % of civilian workforce that is unemployed in CBSA, % of
population of 18+ who speaks English less than well, % of population below poverty line in CBSA.
26
Advertisers can keep the conversion pixel on relevant outcome pages of their website, even if they are not currently
running Facebook campaigns. This is the case for 7 out of the 12 studies we analyze. For these studies we observe
“attributed conversions.” This is a conversion which Facebook can associate with a specific action, such as an ad
view or a click.
27
We use the mahalanobis distance metric and the R package “nbpMatching” by Beck, Lu, and Greevy (2015). See
Tables A-1 and A-2 in the appendix for the equivalent tables using sales-based matching and sales- and demographicsbased matching, respectively.
25
Table 4: Optimal matched markets using demographics-based matching (40 largest markets)
Pair
Pair
Pair
Pair
Pair
Pair
Pair
Pair
Pair
Pair
Pair
Pair
Pair
Pair
Pair
Pair
Pair
Pair
Pair
Pair
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
First CBSA in pair
Atlanta-Sandy Springs-Roswell, GA
Cincinnati, OH-KY-IN
Austin-Round Rock, TX
Cleveland-Elyria, OH
Denver-Aurora-Lakewood, CO
Houston-The Woodlands-Sugar Land, TX
Chicago-Naperville-Elgin, IL-IN-WI
Charlotte-Concord-Gastonia, NC-SC
Boston-Cambridge-Newton, MA-NH
Minneapolis-St. Paul-Bloomington, MN-WI
Orlando-Kissimmee-Sanford, FL
Louisville/Jefferson County, KY-IN
Pittsburgh, PA
Las Vegas-Henderson-Paradise, NV
Phoenix-Mesa-Scottsdale, AZ
Baltimore-Columbia-Towson, MD
Columbus, OH
Riverside-San Bernardino-Ontario, CA
Sacramento–Roseville–Arden-Arcade, CA
San Jose-Sunnyvale-Santa Clara, CA
Second CBSA in pair
Dallas-Fort Worth-Arlington, TX
Detroit-Warren-Dearborn, MI
Indianapolis-Carmel-Anderson, IN
Kansas City, MO-KS
Los Angeles-Long Beach-Anaheim, CA
Miami-Fort Lauderdale-West Palm Beach, FL
Milwaukee-Waukesha-West Allis, WI
Nashville-Davidson–Murfreesboro–Franklin, TN
New York-Newark-Jersey City, NY-NJ-PA
Philadelphia-Camden-Wilmington, PA-NJ-DE-MD
Portland-Vancouver-Hillsboro, OR-WA
Providence-Warwick, RI-MA
St. Louis, MO-IL
San Antonio-New Braunfels, TX
San Diego-Carlsbad, CA
San Francisco-Oakland-Hayward, CA
Seattle-Tacoma-Bellevue, WA
Tampa-St. Petersburg-Clearwater, FL
Virginia Beach-Norfolk-Newport News, VA-NC
Washington-Arlington-Alexandria, DC-VA-MD-WV
the fraction of users who were exposed in the test group. One problem in a traditional matched
market test is that the advertiser may not know how many users in each market were exposed to
the ad. (For a Facebook ad, exposure could be tracked even in a matched market test, but this
would not be true generally for online advertising.) To reflect this common reality, we calculate
lift based on the difference between the conversion rate in the test group and the control group
without adjusting for rates of exposure.28 For example, continuing the example on page 9, suppose
that 0.8% of users in control markets and 1.2% of users in test markets purchased during the study
period. Then the lift associated with the matched market test would be
Lift =
1.2% − 0.8%
Conversion rate of test group – Conversion rate of control group
=
0.8%
Conversion rate of control group
(4)
If 100% of users are exposed to ads, this measure is identical to the ICR-based lift measure we
have used so far and is described on page 9. If less than 100% of users are exposed to ads, this
lift measure will “diluted” (i.e., smaller than the ICR-based lift measure). Hence, the lift we will
report below will always be smaller than the ICR-based lift we could calculate if we were to use
28
The difference between the conversion rate in the test group and the control group is referred to as an “intent to
treat” estimate or ITT.
26
Figure 8: Histogram of lifts∗
40 markets
True Lift=33
80]
0
200
Frequency
400
600
[-2
-50
0
50
Lift
100
150
* For top 40 largest markets, demographics-based matching, 10,000 random allocations of CBSAs in each matched
market pair to test and control markets.
information on how many users in each market were exposed to the campaign ads. Recall that in
study 4 the ICR-based lift measure was 77%. If we calculate lift according to equation 4 we obtain
an estimate of 38%, with a 95% confidence interval of [29%, 49%]. This is the measure of lift that
is analogous to what we will report for the matched market results.
In Figure 8 we show a histogram of the lifts generated from matched market tests for each of
10,000 random allocations of CBSAs to test and control markets. These allocations hold fixed the
pairings between markets and randomize over which market is assigned to test and control. If we
use the data from just these 40 markets we can calculate the true (RCT-based) lift in those 40
markets, which is 33%. The dashed lines on the left and the right of the histogram bracket 95%
of the lift measurements across the 10,000 random allocations of CBSAs. Notice that the dashed
lines do not correspond to a traditional confidence interval due to sampling of observations (we will
add those in the next subsection). Instead, they represent the middle 95% of lift estimates when
CBSAs, within matched market pairs, are randomly allocated to test and control markets.
Figure 8 shows both good news and bad news for the validity of matched market tests in the
context of online advertising. The good news is that the matched market test appears to neither
systematically overstate nor systematically understate the true lift (the true lift of 33% is close
to the middle of the distribution in Figure 8). This is small consolation to the ad measurement
researcher, however, because Figure 8 shows that any individual matched market test is likely to
27
Figure 9: Sales-based matching (left histogram), sales- and demographics-based matching (right
histogram)∗
40 markets
40 markets
77]
[3
True Lift=33
72]
Frequency
200
400
0
0
Frequency
200
400
600
True Lift=33
600
[1
-50
0
50
Lift
100
150
-50
0
50
Lift
100
150
* Histogram of lifts for top 40 markets, for 10,000 random allocations of CBSAs in each matched
market pair to test and control markets.
produce a result that is quite far off from 33%—95% of the time the researcher will estimate a lift
between -2% and 80%. If the researcher had the luxury of running 10,000 version of the matched
market test, he or she could tell from the distribution what the true lift is likely to be. But a single
matched market test seems unlikely to produce a reliable result.
Next, we explore whether we can reduce the variance of the lift estimates by using salesbased or sales- and demographic-based matching. Figure 9 shows the results for the 40 largest
markets using sales-based matching (left histogram) and sales- and demographics-based matching
(right histogram). Notice that sales-based matching and sales- and demographics-based matching
somewhat improve the lift estimates. However, the variance of the lift estimates remains large.
We also explore whether we can reduce the variance of the lift estimates by increasing the
number of markets used for the matched market test. We increase the number of markets to 80
and use sales- and demographics-based matching since it produced the smallest variance in lifts for
40 markets. The results in Figure 10 show that matched market tests for this study yield estimates
of lift that can be surprisingly far off from the result of the RCT, even with 80 markets. In summary,
this study suggests that the ad campaign lifts from matched market tests can be significantly lower
or higher than the true lift, even using sales- and demographics-based matching and a large number
of markets.29
29
Instead of matching, we can also randomly assign markets to test and control groups. The results are in the
online appendix in Figure A-2. As in the matched market case, the ad campaign lifts from matched market tests can
be significantly lower or higher than the true lift.
28
Figure 10: Sales- and demographics-based matching for 40 and 80 markets∗
40 markets
True Lift=33
80 markets
72]
[ 9 True Lift=32
59]
0
0
Frequency
200
400
Frequency
200
400
600
600
[3
-50
0
50
Lift
100
150
-50
0
50
Lift
100
150
* Histogram of lifts for 10,000 random allocations of CBSAs in each matched market pair to test
and control markets.
So far we have not estimated traditional confidence intervals around each individual lift estimate.
Doing so accounts for sampling-based uncertainty (the ever-present uncertainty that arises when
researchers don’t observe the entire population) in addition to the uncertainty generated by which
markets are assigned to treatment and control. We illustrate the compounded uncertainty in
Figure 11 for the top 40 markets. To read the figure, notice that the right panel of the graph is the
histogram in the left panel of Figure 10 laid on its side. The y-axis of the histogram displays the lift
for different random allocations of CBSAs in each matched market pair to test and control markets.
A horizontal line marks the true lift of 33%. The left side of the graph shows the confidence interval
for each of the 10,000 random allocations of CBSAs in each matched market pair to test and control
markets.30 To read the confidence interval, find the lift of interest on the y-axis on the right panel
of the figure and read off the corresponding confidence interval measured along the x-axis of the
left panel.
To illustrate, suppose that a researcher’s random allocation of CBSAs to test and control markets yielded a lift estimate equal to the true lift of 33%. (This is a purely a thought exercise given
that one cannot do this in a traditional matched market test). The graph then shows that the
sampling-based confidence interval would be between about -10% and 75%. However, if a different
random allocation of CBSAs to test and control markets yielded a different lift, the confidence intervals would be shifted considerably, meaning that the true uncertainty associated with a matched
markets test is greater than what is indicated by a traditional sampling-based confidence interval.
30
To account that the assignment of users to treatment and control groups happened by market we cluster the
standard errors at the CBSA level.
29
Figure 11: Uncertainty in lift estimates due to market allocation and sampling for 40 markets∗
S4 Checkout
Lift Histogram
50
Lift
-25
0
[3
25
Lift
True Lift=33
72]
75
100
125
Confidence Intervals by Lift
-25
0
25
50
Lift
75
100
125
0
.02
.04
.06
Fraction
* Histogram of lifts for top 40 markets, for 10,000 random allocations of CBSAs in each matched
market pair to test and control markets (right side of graph). Confidence intervals of lifts for
different lift estimates (left side of graph).
4
Evidence from additional studies
In section 3 we presented the results of a variety of different observational approaches to estimate
the lift of study 4. In this section we summarize the findings of using the same approaches for all
12 studies. The studies were conducted for advertisers in different verticals and ranged in size from
about 2 million observations to 140 million (see Table 5).
Table 5: Summary statistics for all studies
Study
1
2
3
4
5
6
7
8
9
10
11
12
Vertical
Retail
Finan. serv.
E-commerce
Retail
E-commerce
Telecom
Retail
E-commerce
E-commerce
Tech
E-commerce
Finan. serv.
Observations
2,427,494
86,183,523
4,672,112
25,553,093
18,486,000
141,254,650
67,398,350
8,333,319
71,068,955
1,955,375
13,339,044
16,578,673
Test
50.0%
85.0%
50.0%
70.0%
50.0%
75.0%
17.0%
50.0%
75.0%
60.0%
50.0%
85.0%
Control
50.0%
15.0%
50.1%
30.0%
50.0%
25.0%
83.0%
50.1%
25.0%
40.0%
50.0%
15.0%
* C = checkout, R = registration, P = page view
30
Impressions
39,167,679
577,005,340
7,655,089
14,261,207
7,334,636
590,377,329
61,248,021
2,250,984
35,197,874
2,943,890
11,633,187
23,105,265
Clicks
45,401
247,122
48,005
474,341
89,649
5,914,424
139,471
204,688
222,050
22,390
106,534
173,988
Conversions
8,767
95,305
61,273
4,935
226,817
867,033
127,976
4,102
113,531
7,625
225,241
6,309
Outcomes*
C, R
C, P
C
C
C, R, P
P
C
C, R
C
C, R
C
C
Table 6: Lift for all studies and measured outcomes
Study
1
2
3
4
5
7
8
9
10
11
12
1
5
8
10
2
5
6
Outcome
Checkout
Checkout
Checkout
Checkout
Checkout
Checkout
Checkout
Checkout
Checkout
Checkout
Checkout
Registration
Registration
Registration
Registration
Page View
Page View
Page View
Pct Exposed
76%
46%
63%
25%
29%
49%
26%
6%
65%
40%
21%
65%
29%
29%
58%
76%
46%
26%
RCT Lift
33%
0.91%
6.9%
77%
418%
3.5%
-3.6%
2.5%
0.6%
9.8%
76%
789%
900%
61%
8.8%
1617%
601%
14%
Confidence Interval
[19.5%
48.9%]
[-4.3%
7.2%]
[0.02%
14.3%]
[55.4%
108.2%]
[292.8%
633.5%]
[0.6%
6.6%]
[-20.7%
19.3%]
[0.2%
4.8%]
[-13.8%
16.3%]
[5.8%
13.8%]
[56.1%
101.2%]
[696.0%
898.4%]
[810.0%
1001.9%]
[12.3%
166.1%]
[0.4%
18.2%]
[1443.8% 1805.2%]
[538.6%
672.3%]
[12.9%
14.9%]
RCT Lift in red: statistically different from zero at 5% level. Confidence
intervals obtained via bootstrap.
The studies also differed by the conversion outcome that the advertiser measured; some advertisers tracked multiple outcomes of interest. In all studies but one, the advertiser placed a conversion
pixel on the checkout confirmation page, therefore gaining the ability to measure whether a Facebook user purchased from the advertiser. In four studies the advertiser placed a conversion pixel to
measure whether a consumer registered with the advertiser. In three studies the advertiser placed
a conversion pixel on a (landing) page of interest to the advertiser. Table 6 presents the results of
the RCTs for all studies.
Note that lifts for registration and page view outcomes are typically higher than for checkout
outcomes. The reason is as follows: Since specific registration and landing pages are typically
tied to ad campaigns, users who are not exposed to an ad are much less likely to reach that page
than users who see the ad, simply because unexposed users may not know how to get to the page.
For checkout outcomes, however, users in the control group lead to a checkout outcome simply by
purchasing from the advertiser—it does not take special knowledge of a page to trigger a conversion
pixel.31
31
One might ask why lifts for registration and page view outcomes are not infinite since—as we have just claimed—
users only reach those pages in response to an ad exposure. The reason is that registration and landing pages are
often shared among several ad campaigns. Therefore, users who are in our control group might have been exposed
to a different ad campaign which shared the same landing or registration page.
31
4.1
Individual-based Comparisons
We summarize the results of the exact matching specification (EM), the three propensity score
matching specifications (PSM 1-3), and the three regression adjustment specifications (IPWRA
1-3) using the same graphical format with which we summarized study 4 (see Figure 7). Figures 12
and 13 summarize results for the eleven studies for which there was a conversion pixel on the
checkout confirmation page.
• In study 1, the exact matching specification (EM), the first two propensity score matching
specifications (PSM 1 and 2), and the first two inverse-probability-weighed regression adjustment specifications (IPWRA 1 and 2) yield lift estimates between 85% and 117%, which are
statistically higher than the RCT lift of 33%. Including the composite metric of Facebook
data that summarizes thousands of behavioral variables (PSM 3 and IPWRA 3) lowers the
lift estimate to 59%, which is not statistically different from the RCT lift. Hence, study 1
shows a similar pattern to the one we observed in study 4.
• The results for study 2 look very different. The RCT shows no significant lift. Nonetheless,
the EM and all PSM specifications yield lift estimates of 116 to 535%, all of which are
statistically higher than the RCT estimate of 0.91%. The lift estimates of the IPWRA
specifications are between 65 and 72%, however, they are also very imprecisely measured and
therefore statistically not different from the RCT estimate.
• Study 3 follows yet another pattern. The RCT lift is 6.9%. EM, PSM 1, PSM 2, IPWRA 1,
and IPWRA 2 all overestimate the lift (34-73%). PSM 3 and IPWRA 3, however, significantly
underestimate the RCT lift (-12 to -14%).
• Study 4 was already discussed in section 3.
• In study 5 all estimates are statistically indistinguishable from the RCT lift of 418%. The
point estimates range from 515% for EM to 282% for PSM 3.
• Study 6 did not feature a checkout conversion pixel.
• In study 7 all estimates are different from the RCT lift of 3.5%. EM overestimates the lift
with an estimate of 38%. All other methods underestimate the lift with estimates between
-11 and -18%.
• Moving to Figure 13, study 8 finds an RCT lift of -3.6% (not statistically different from 0).
All methods overestimate the lift with estimates of 23 to 49%, except for IPWRA3 with a lift
of 16%, which is not statistically different from the RCT lift.
32
• The RCT lift in study 9 is 2.5%. All observational methods massively overestimate the lift;
estimates range from 1413 to 3288%.
• Study 10 estimates an RCT lift of 0.6% (not statistically different from 0). The point
estimates of different methods range from -18 to 37%, however, only the EM lift estimate
(37%) is statistically different from the RCT lift.
• Study 11 estimates an RCT lift of 9.8%. EM massively overestimates the lift at 276%.
PSM 1, PSM 2, IPWRA 1, and IPWRA 2 also overestimate the lift (22-25%), but to a much
smaller degree. PSM 3 and IPWRA 3, however, estimate a lift of 9.4 and 3.5%, respectively.
The latter estimates are not statistically different from the RCT lift. PSM 3 in study 11 is
the only case in these 12 checkout conversion studies of an observational method yielding a
lift estimate very close to that produced by the RCT.
• In study 12 we did not have access to the data that allowed us to run the “2” and “3”
specifications. The RCT lift is 76%. The observational methods we could estimate massively
overstated the lift; estimates range from 1231 to 2760%.
Figure 14 summarizes results for the four studies for which there was a conversion pixel on
a registration page. Figure 15 summarizes results for the three studies for which there was a
conversion pixel on a key landing page. The results for these studies vary across studies in how
they compare to the RCT results, just as they do for the checkout conversion studies reported in
Figures 12 and 13.
We summarize the performance of different observational approaches using two different metrics.
We want to know first how often an observational study fails to capture the truth. Said in a
statistically precise way, “For how many of the studies do we reject the hypothesis that the lift of
the observational method is equal to the RCT lift?” Table 7 reports the answer to this question.
We divide the table by outcome reported in the study (checkout is in the top section of Table 7,
followed by registration and page view). The first row of Table 7 tells us that of the 11 studies
that tracked checkout conversions, we statistically reject the hypothesis that the exact matching
estimate of lift equals the RCT estimate. As we go down the column, the propensity score matching
and regression adjustment approaches fare a little better, but for all but one specification, we reject
equality with the RCT estimate for half the studies or more.
We would also like to know how different the estimate produced by an observational method is
from the RCT estimate. Said more precisely, we ask “Across evaluated studies of a given outcome,
what is the average absolute deviation in percentage points between the observational method
estimate of lift and the RCT lift?” For example, the RCT lift for study 1 (checkout outcome) is
33
Figure 12: Results for checkout conversion event, studies 1-7
S2 Checkout
600
150
S1 Checkout
98**
88**
Lift
86**
85*
59
50
59
72
65 .91
T
C
R
2
M
3
IP
W
R
A1
IP
W
R
A2
IP
W
R
A3
PS
M
1
PS
M
PS
EM
250
200
50
39**
34**
38**
Lift
150
Lift
S4 Checkout
221**
35**
-12**
147**
154**
100
0
6.9
145** 144**
107
102
77
T
C
R
M
3
IP
W
R
A1
IP
W
R
A2
IP
W
R
A3
PS
2
PS
M
1
M
PS
EM
T
C
R
M
3
IP
W
R
A1
IP
W
R
A2
IP
W
R
A3
PS
2
M
PS
PS
M
1
50
-50
-14*
S7 Checkout
600
40
S5 Checkout
38**
20
505
427
426
432
418
3.5
308
3
IP
W
R
A1
IP
W
R
A2
IP
W
R
A3
PS
M
2
PS
M
1
[*] Lift of method significantly different from RCT lift at 5% level; [**] at 1% level.
[RCT Lift in red]: statistically different from zero at 5% level.
34
-12**
-18**
T
-13**
-18**
PS
M
EM
T
C
R
3
IP
W
R
A1
IP
W
R
A2
IP
W
R
A3
PS
M
2
PS
M
PS
M
1
200
EM
-11**
C
-12**
-20
282
R
300
0
425
Lift
Lift
400
500
69
-200
T
C
R
2
M
3
IP
W
R
A1
IP
W
R
A2
IP
W
R
A3
PS
M
PS
PS
M
1
0
S3 Checkout
100
EM
123** 116** 134**
0
33
73**
EM
535**
Lift
200 400
100
117**
Figure 13: Results for checkout conversion event, studies 8-12
2
M
3
IP
W
R
A1
IP
W
R
A2
IP
W
R
A3
PS
M
1
PS
M
276**
20
21
Lift
22
Lift
23
22
200
37**
100
0
.6
-18
Lift
2000
2760**
1000
1277**
1231**
T
C
R
3
IP
W
R
A1
IP
W
R
A2
IP
W
R
A3
PS
M
2
PS
M
PS
M
1
0
[*] Lift of method significantly different from RCT lift at 5% level; [**] at 1% level.
[RCT Lift in red]: statistically different from zero at 5% level.
35
25**
9.8
3.5
T
24**
M
3
IP
W
R
A1
IP
W
R
A2
IP
W
R
A3
9.4
PS
2
PS
M
1
M
PS
EM
T
C
R
M
3
IP
W
R
A1
IP
W
R
A2
IP
W
R
A3
PS
2
M
PS
PS
M
1
0
23**
C
22**
R
-20
-16
-40
EM
PS
EM
300
60
S11 Checkout
76
EM
1413**
2.5
S12 Checkout
3000
1685** 1678**
0
T
C
R
2
M
3
IP
W
R
A1
IP
W
R
A2
IP
W
R
A3
PS
M
1
PS
M
PS
EM
-20
1438**
T
-3.6
0
1680** 1659**
1000
16
C
23*
S10 Checkout
40
3288**
R
45**
Lift
2000
40**
35**
Lift
20
34**
3000
60
40
49**
S9 Checkout
4000
S8 Checkout
Figure 14: Results for registration conversion event
1200
1200
1014*
1259**
993
1055**
Lift
1000
985
Lift
857
1052**
1045**
1049**
900
800
800
827789
690**
T
C
A3
A2
IP
W
R
A1
IP
W
R
3
IP
W
R
2
PS
M
PS
M
PS
M
EM
1
600
T
R
C
A3
A2
IP
W
R
IP
W
R
A1
3
IP
W
R
PS
M
2
1
PS
M
PS
M
S8 Registration
S10 Registration
200
40
250
EM
600
722**
R
1000
1028**
1010
S5 Registration
1400
S1 Registration
34**
157
20
150
178*
156
118
100
123
19
20
18
20
Lift
126
8.8
0
Lift
139
50
61
-7.5*
[RCT Lift in red]: statistically different from zero at 5% level.
36
T
C
R
A3
A2
IP
W
R
A1
[*] Lift of method significantly different from RCT lift at 5% level; [**] at 1% level.
IP
W
R
3
IP
W
R
PS
M
2
PS
M
1
PS
M
EM
T
C
R
A3
A2
IP
W
R
A1
IP
W
R
IP
W
R
3
PS
M
2
PS
M
1
PS
M
EM
0
-20
-11**
Figure 15: Results for key page view conversion event
S5 Page View
900
6000
S2 Page View
800
4200**
749**
748**
740**
744**
1994
1974
1954
1916
1790 1617
601
500
2214*
600
2000
Lift
Lift
700
4000
839**
506*
33
T
C
R
A2
IP
W
R
A1
IP
W
R
IP
W
R
3
PS
M
2
PS
M
PS
M
1
0
EM
14
6.6
A3
20
17
[*] Lift of method significantly different from RCT lift at 5% level; [**] at 1% level.
[RCT Lift in red]: statistically different from zero at 5% level.
37
T
C
R
A3
A2
IP
W
R
A1
50
Lift
100 150
200
228**
IP
W
R
3
IP
W
R
2
PS
M
1
PS
M
.12**
PS
M
EM
T
C
R
A3
A2
IP
W
R
IP
W
R
A1
3
IP
W
R
PS
M
2
1
PS
M
PS
M
.59**
S6 Page View
250
EM
0
489**
33%. The EM lift estimate is 117%. Hence the absolute lift deviation is 84 percentage points. For
study 2 (checkout outcome) the RCT lift is 0.9%, the EM lift estimate is 535%, and the absolute
lift deviation is 534 percentage points. When we average over all studies, exact matching leads to
an average absolute lift deviation of 661 percentage points relative to an average RCT lift of 57%
across studies (see the last two columns of the first row of the table.)
As the table shows, inverse probability weighted regression adjustment with the most detailed
set of variables (IPWRA3) yields the smallest average absolute lift deviation across all evaluated
outcomes. For checkout outcomes, the deviation is large, namely 173 vs. an average RCT lift
of 57%. For registration and page view outcomes, however, the average absolute lift deviation is
relatively small, namely 80 vs. an average RCT lift of 440%, and 94 vs. an average RCT lift of
744%.
In general, observational methods to a better job of approximating RCT outcomes for registration and page view outcomes than for checkouts. We believe that the reason for this lies in the
nature of these outcomes. Since unexposed users (in both treatment and control) are comparatively unlikely to find a registration or landing page on their own, comparing the exposed group in
treatment to a subset of the unexposed group in the treatment group (the comparison all observational methods are based on) yields relatively similar outcomes to comparing the exposed group in
treatment to the (always unexposed) control group (the comparison the RCT is based on).
4.2
Market-based comparisons
In this subsection we summarize across all studies the uncertainty in lift estimates introduced
by performing a matched market test. This uncertainty is generated by the random allocation
of markets in matched market pairs to treatment and control. We present results in two tables.
Table 8 presents the studies for which we were able to perform sales- and demographics-based
matching. Table 9 presents studies for which we did not have sales-relevant information prior to
the beginning of the study. Therefore, we could only perform demographics-based matching. Each
table describes the middle 95% of lift estimates when CBSAs, within matched market pairs, are
randomly allocated to test and control markets. We also report the true lift of the ad campaign in
the selected markets. The left three columns present this information for the largest 40 markets;
the right three columns present this information for the largest 80 markets.
As Tables 8 and 9 show, for most studies matched market testing introduces substantial uncertainty in lift estimates in addition to traditional sampling-based uncertainty (the latter is not
reported).
38
Table 7: Summary of performance by method for different conversion types
Method
Outcome
evaluated
EM
PSM1
PSM2
PSM3
IPWRA1
IPWRA2
IPWRA3
EM
PSM1
PSM2
PSM3
IPWRA1
IPWRA2
IPWRA3
EM
PSM1
PSM2
PSM3
IPWRA1
IPWRA2
IPWRA3
Checkout
Checkout
Checkout
Checkout
Checkout
Checkout
Checkout
Registration
Registration
Registration
Registration
Registration
Registration
Registration
Page View
Page View
Page View
Page View
Page View
Page View
Page View
*
5
# of
studies
# of studies
with Lift 6=
RCT Lift∗
11
11
10
10
11
10
10
4
4
4
4
4
4
4
3
3
3
3
3
3
3
10
9
8
5
8
7
3
3
2
2
2
1
1
2
3
2
1
1
1
2
2
% of studies
with Lift 6=
RCT Lift∗
91
82
80
50
73
70
30
75
50
50
50
25
25
50
100
67
33
33
33
67
67
Average absolute
Lift deviation
from RCT Lift
in percentage points
661
296
202
184
288
201
173
180
120
110
82
114
115
80
1012
250
174
159
155
165
94
Average RCT Lift
in percent
57
57
57
57
57
57
57
440
440
440
440
440
440
440
744
744
744
744
744
744
744
Difference is statistically significant at a 5% level.
Additional Methods of Evaluation:
PSAs and Time-Based Comparisons
This section briefly discusses two other common approaches to advertising measurement: using
public service announcements (PSAs) as control ads and comparing conversion outcomes before
and after a campaign. Both approaches have their own distinct shortcomings, which help to further
illustrate the inherent challenges of ad effectiveness measurement.
5.1
PSAs
In an RCT performed by an advertiser, test group users are shown ads from that advertiser and
control group users are not. But what should the control group users be shown in place of the ads of
the advertiser? This seemingly innocuous question actually has critical implications for interpreting
advertising measurements using randomized experiments.
One possibility noted in Section 2.1 is not to show control users any ads at all, i.e., to replace
the ad with non-advertising content. However, this compares showing an ad to an unrealistic
39
Table 8: Uncertainty in lift estimates due to random allocation of matched markets to treatment and control (sales- and demographics-based matching)
Study
S1
S3
S4
S8
S9
S11
S1
S8
Outcome
Checkout
Checkout
Checkout
Checkout
Checkout
Checkout
Registration
Registration
5th percentile
of lift
estimates
13
-37
3.3
-42
-16
-21
534
-40
40 markets
True 95th percentile
Lift
of lift
estimates
31
50
6.2
72
33
72
-6.8
50
.071
19
8.8
51
643
769
-.61
61
5th percentile
of lift
estimates
.98
-27
9.1
-37
-11
-19
554
-26
80 markets
True 95th percentile
Lift
of lift
estimates
29
62
6.1
51
32
59
-2.8
49
.26
12
8.9
46
653
773
5.5
49
Table 9: Uncertainty in lift estimates due to random allocation of matched markets to treatment and control (demographics-based matching)
Study
S2
S5
S7
S10
S2
S5
S6
S5
S10
Outcome
Checkout
Checkout
Checkout
Checkout
Key Page
Key Page
Key Page
Registration
Registration
5th percentile
of lift
estimates
-80
110
-11
-12
396
192
2.8
305
-1.3
40 markets
True 95th percentile
Lift
of lift
estimates
-.17
391
146
190
-.2
12
.23
15
848
2283
217
245
11
21
336
369
6.9
16
5th percentile
of lift
estimates
-82
109
-8.4
-10
407
199
3.4
308
-.0037
80 markets
True 95th percentile
Lift
of lift
estimates
.0015
482
140
175
.63
10
.73
15
887
2533
219
239
11
20
336
365
7
14
counterfactual. Instead, the publisher is likely to show the ad of another advertiser instead. Hence,
we would like to answer the question: “How well does an ad (the “focal ad”) perform relative to
the user being shown the ad that would have appeared had the focal ad not been shown?”
One common choice is to use PSAs as the ads shown to the control group. The key feature of
a PSA is that its message is viewed as being neutral relative to the focal ad. For example, Nike
may not object to showing control group users an ad for the American Red Cross, but it certainly
wouldn’t want to pay for ad impressions from Reebok. Sometimes an agency or advertising network
partner will partly fund the cost of PSA impressions in order to deliver an estimate of the campaign’s
effectiveness.
In fact, the ad inserted as the PSA does not have to be an actual PSA—any ad from an
unrelated company could fulfill this purpose. We will refer to such ads as “PSAs” even though,
40
strictly speaking, they don’t have to be for a charity or non-profit.
Despite the industry’s reliance on PSAs as control ads, this approach presents at least two
problems. First, the PSA control ad does not correspond to the question of interest to the advertiser.
Comparing the test and control group measures the effect of showing the focal ad relative to showing
the PSA ad. But suppose the advertiser, say Nike, never implemented the campaign in the first
place, would the users still have seen the PSA ad? Probably not. Instead, they might have seen an
ad from Reebok, which implies that the PSA ad experiment doesn’t compare the Nike ad to the
appropriate counterfactual and therefore fails to yield a causal estimate of Nike’s ad effectiveness.
The second problem from the perspective of advertising research is that most modern advertising platforms (including Facebook) implement some type of performance optimization in their
ad delivery algorithm (also discussed at the end of section 3.2.1). These algorithms automatically
adjust the delivery of impressions for both the focal ad and the PSA depending on the types of
users who are most likely to click, visit, or purchase after exposure. The issue is that these algorithms can lead to violations of probabilistic equivalence between the test and control groups, if
not properly accounted for. Recently, Google has implemented a methodology that circumvents
these challenges on its online display network (Johnson, Lewis, and Nubbemeyer 2015).
Figure 16: Constructing a PSA Experiment
We can illustrate this measurement issue using two of our studies, study 6 and study 7. Both
studies have distinct messages and calls-to-action but they target a large overlapping set of about 50
million users. As Figure 16 shows, users can fall into one of four bins depending on their assignment
to test and control groups in each study. Our interest is the roughly 30 million users that are in
groups A and B, i.e. who are in the test group of one study and the control group for the other
study.
Suppose we are interested in running a PSA test for study 6. We need to compare outcomes
of users who are shown ads for study 6 to a PSA control group that is shown a neutral ad. Then
41
we can compare outcomes of Group A, our “test group,” to the outcomes of Group B, our “control
group” who is not shown any ads from study 6 but is shown ads from study 7. In this way the
study 7 ads effectively serve the same role as a PSA ad. A similar comparison can be made by
examining the outcomes from study 7 and reversing the roles of each group. If study 7 is the focal
advertiser, the “test group” switches to Group B and the “control” group becomes Group A.
In both cases, the advertising platform will optimize ad impression delivery to maximize the
conversion rate of each campaign. To the extent that this results in changes in the mix of users
receiving impressions, the PSA-based lifts might diverge from the RCT-based lifts.
The results appear in Table 10. As in the section on matched market tests, we calculate lift
based on the difference between the conversion rate in the test group and the control group without
adjusting for rates of exposure (see page 26). As a baseline, for both studies we report the RCT
lift based on the restricted set of users who were in both studies. The first row reports a test
using study 6 as the focal advertiser and study 7 as the PSA; we find that the PSA lift is similar
to the RCT lift. However, when the studies are flipped, there is a striking difference. The RCT
finds a positive lift of 1.8% (p-value<0.02) but the PSA test estimates a negative lift of -4.2%
(p-value<0.05). Using PSA’s as control ads for this study would lead the advertiser to draw the
incorrect conclusion that the campaign actually hurt conversion rates.
Table 10: PSA Tests∗
RCT
PSA
Focal Study
PSA Study
Lift
p-val
CI
Lift
p-val
CI
6
7
12.1%
<1e-4
[10.7%, 13.5%]
11.6%
<1e-4
[8.3%, 15.0%]
7
6
1.8%
0.0183
[0.0%, 3.5%]
-4.2%
0.0492
[-0.7%, -7.7%]
*
All statistics are specific to the set of users in both studies, hence the RCT conversion results may
differ from those reported in Table 6. p-val is the p-value associated with the null hypothesis that a
lift is not significantly different from zero. CI is the 95% confidence interval on the lift.
5.2
Time-based Comparisons
Another way to measure the effect of advertising is to perform a time-based comparison, or a
before/after. An advertiser might, for example, compare conversion outcomes during a period
immediately before the beginning of a campaign (the control period) with the outcomes during
the duration of campaign (the test period). Time-based comparisons rely on the assumption that
outcomes during the control period allow the advertiser to infer what would have happened in the
test period, had the ad campaign not been run during test period. This assumption (called “timeinvariance”) is met if, in the absence of ads, conversion outcomes are stable over some (reasonably)
42
short period of time.
We can directly test this assumption in our studies by examining conversion outcomes of individuals in the control group over the duration of the study. Since these individuals were never
exposed to campaign ads, it should be the case—if the time-invariance assumption is right—that
there should be no difference in conversion outcomes for control users between the first and the
second half of the campaign period.32
Table 11 reports, for each conversion outcome and each study, the conversion rates in the first
and second halves for control users.33 As the table shows, we reject the hypothesis that there is
no difference in conversion outcomes for control users between the two halves of the study for all
but one study in each of the conversion outcomes. This suggests that for these studies a purely
time-based comparison would yield incorrect lift measures of ad effectiveness.
Table 11: Time-based comparison
Study
Outcome
S1
S2
S3
S4
S5
S7
S8
S9
S10
S1
S5
S8
S10
S2
S5
S6
Checkout
Checkout
Checkout
Checkout
Checkout
Checkout
Checkout
Checkout
Checkout
Registration
Registration
Registration
Registration
Page View
Page View
Page View
*
Conversion rate
during first half
0.113%
0.104%
0.155%
0.012%
0.004%
0.135%
0.009%
0.217%
0.041%
0.058%
0.033%
0.002%
0.092%
0.014%
0.032%
0.523%
Conversion rate
during second half
0.096%
0.000%
0.303%
0.020%
0.007%
0.125%
0.027%
0.125%
0.114%
0.053%
0.064%
0.006%
0.264%
0.012%
0.072%
0.348%
Difference in
percentage points
-0.017%
-0.104%**
0.148%**
0.0076%**
0.0034%**
-0.0091%**
0.018%**
-0.091%**
0.073%**
-0.0053%
0.031%**
0.0037%**
0.172%**
-0.0017%
0.040%**
-0.175%**
Difference in
percent
-15%
-100%
95%
63%
94%
-7%
200%
-42%
177%
-9%
93%
160%
188%
-13%
127%
-34%
Difference is statistically significant at a 5% level.
Difference is statistically significant at a 1% level.
**
32
One concern for implementing this test is that Facebook’s targeting algorithm selects different users as potential
targets during the first and second half of the study. To avoid this potential concern we restrict our sample to users
in the control group who were selected by the targeting algorithm as potential targets during the first day of the
study. We can do this because our data contains information on when a user was first selected as a target, even if
that user was in the control group and therefore never received an ad.
33
All conversion rates have been scaled by the same random constant as in previous sections.
43
6
Summary of Results
Estimating the causal effect of advertisements, especially in digital settings, can be challenging. We
hope this article provides the reader with a better understanding of causal measurement and an
appreciation for the critical role that randomized controlled trials serve in advertising measurement.
Randomization of a treatment—whether it be an advertisement, an email message, or a new pharmaceutical drug—across probabilistically equivalent test and control groups is the gold standard
in establishing a causal relationship. In the absence of randomization, observational methods try
to estimate advertising lift by comparing exposed and unexposed users, adjusting for pre-existing
differences between these groups. Naturally, this leads to the question: How accurate and reliable
are such methods at recovering the true advertising lift?
To this end, we have explore how several common observational techniques for advertising
measurement compare to results obtained using an RCT. This comparison relies on 12 large-scale
randomized experiments from Facebook that spanned a mix of industry verticals, test/control
splits, and business-relevant outcomes. Using individual-level data, we compared conversion rates
(i) between exposed and unexposed users, (ii) after reweighing the groups by matching exactly on
age and gender, (iii) using propensity matching to flexibly model exposure likelihoods, and (iv)
with regression-based methods that incorporated both outcome and exposure models. We chose to
investigate these methods because they are well-accepted and commonly employed in the industry
and academia.
Our results present a mixed bag. In some cases, one or more of these methods obtained ad lift
estimates that were statistically indistinguishable from those of the RCT. However, even the best
method produced an average absolute deviation of 173% relative to an average RCT lift of 57%, with
most of the methods yielding upwardly biased estimates. The variance in the performance of the
observational methods across studies makes it difficult to predict when, and if, these methods would
obtain reliably accurate estimates. Given this, a manager should hesitate in trying to establish a
rule-of-thumb relationship between the observational and RCT estimates.
We have also found unsettling results when trying approaches that use aggregate data. While
matched market tests appears to neither systematically overstate nor understate the RCT lift, any
individual matched market test is likely to produce a results that is quite far off from the RCT lift.
As a result, we find that a single matched market test seems unlikely to produce reliable results.
We also find evidence that before/after comparisons would have been problematic in our studies.
The time-invariance assumption on which time-based comparisons rely is violated in most studies,
even over the course of just a few weeks.
Finally, we have found some evidence that PSA’s as controls would lead an advertiser to draw
44
the incorrect conclusion about the true lift of an ad campaign.
In summary, three key findings emerge from this investigation:
• There is a significant discrepancy between the commonly-used approaches and the true experiments in our studies.
• While observations approaches sometimes come close to recovering the measurement from
true experiments, it is difficult to predict a priori when this might occur.
• Commonly-used approaches are unreliable for lower funnel conversion outcomes (e.g., purchases) but somewhat more reliable for upper funnel outcomes (e.g., key landing pages).
These findings should be interpreted with the caveat that our observational approaches are only
as good as the data at our disposal allowed us to make them. It is possible that better data, for
example time-varying user-level data on online activity and generalized shopping behavior, would
significantly improve the performance of observational methods.
We are aware that advertisers don’t always have the luxury of conducting true experiments.
We hope, however, that a conceptual and quantitative comparison of measurement approaches will
arm the reader with enough knowledge to evaluate measurement with a critical eye and to help
identify the best measurement solution.
In a future version of this article we will present some recommendations to advertisers on
advertising measurement. When firms are able to employ RCTs, we plan to offer some suggestions
on best practices to help managers get the most out of their measurement. In the case that RCTs
aren’t practical or feasible, we will highlight the circumstances under which different observational
methods might be best.
7
What RCTs can and cannot do
A common critique of RCTs is that one can only test relatively few advertising strategies relative to
the enormous possible variation in advertising; which creative? which publisher? which touchpoint?
which sequence? which frequency? This critique of RCTs is not unique to advertising effectiveness
research. For example, Angus Deaton, a recent winner of the Nobel Prize in Economics made a
similar point about evaluating initiatives in development economics.34
To understand the critique, consider an example made by the economist Ricardo Hausman.35
Suppose we wanted to learn whether tablets can improve classroom learning. Any RCT would need
34
35
See Deaton (2010) or https://youtu.be/yiqbmiEalRU?list=PLYZdiPCblNEULWpJALSk_nNIJcn51bazU
https://www.project-syndicate.org/commentary/evidence-based-policy-problems-by-ricardo-hausmann-2016-02
45
to hypothesize a certain test condition that specifies exactly how the tablets are used: how should
teachers incorporate them into the classroom, how should students interact with them, in what
subjects, on what frequency, with what software, etc. An RCT would tell us whether a particular
treatment is better than the control of no tablets. However, the real goal is to find the best ways
to use tablets in the classroom.
This points to two problems that would equally apply to advertising effectiveness research.
First, with such a large design space, an RCT testing a small number of strategies would be too
slow to be useful. Second, whatever we learn from the RCT might not transfer well if the strategy
is extended to a different setting (this is referred to as external validity).
However, this is not really a critique of RCTs as opposed to using other observational method for
measuring advertising effectiveness—the exact same critique applies to any observational method,
including the ones we have evaluated in this paper. Instead, this critique points to the deficiency
of using data alone (through RCTs or observational methods) to discover what drives advertising
effectiveness. Instead, generalizing from such data fundamentally requires a theory of behavior,
i.e. the mechanism by which advertising works on consumers, that implies that the proposed
advertising will change outcomes a certain way. For example, theories of consumer behavior tell
us what types of advertising creative tend to produce larger responses (e.g., fear appeals do well
in political advertising). While RCTs are the gold-standard for accurately assessing the proposed
advertising, RCTs cannot tell you what advertising should be tested in the first place (of course,
neither can any observational methods).
In some contexts, theories are not very important because technology allows for “brute empiricism.” This applies, for example, to web site optimization where thousands of tests allow for
the publisher to optimize the color, page features, button placement, etc. However, in advertising,
the set of possible choices is much too large relative to the amount of tests that are feasible to.
This highlights one potential benefit of observational methods, which is that, relative to RCT’s,
much more data for high-dimensional problems is typically available because the data are generated
more easily and by more actors. The usual issues of selection bias, suitable controls, etc., must
be addressed, and of course this is exactly what observational methods try to do. However, our
paper shows that in the setting of online advertising—with the data we had at our disposal—these
methods do not seem to reliably work.
Overall, we think this discussion has three important implications for advertising research:
1. We cannot rely on data alone to improve advertising. If we want generalizable results we also
require good theories of behavior (a good understanding of mechanisms by which advertising
46
works) to guide what advertising should be evaluated in the first place. In other words, theory
should help guide the experimental design.
2. Testing should aim to discover the “why,” not just the “what.” Using the words of Angus
Deaton, “Learning about theory, or mechanisms, requires that the investigation be targeted
toward that theory, toward why something works, not whether it works.” (Deaton (2010), p.
442) This means that the choice of what to test should be in the service of deciding which of
multiple theories is correct. For example, if display advertising for a campaign works better
than search advertising, why is that the case? Only if we ask “why?” can we begin to use
data from one campaign to inform another campaign.
3. While data cannot replace theory, given a decision of what advertising to evaluate, RCTs rely
on much weaker assumptions than observational methods and are therefore preferable—as
long as they are feasible to implement.
47
References
Abadie, A., and G. Imbens (2015): “Matching on the Estimated Propensity Score,” Working
paper, Stanford GSB.
Abadie, A., and G. W. Imbens (2008): “On the Failure of the Bootstrap for Matching Estimators,” Econometrica, 76(6), 1537–1557.
Andrews, D. W. K., and M. Buchinsky (2000): “A three-step method for choosing the number
of bootstrap repetitions,” Econometrica, 68(1), 213–251.
Beck, C., B. Lu, and R. Greevy (2015): “nbpMatching: Functions for Optimal Non-Bipartite
Matching,” R package version 1.4.5.
Deaton, A. (2010): “Instruments, Randomization, and Learning about Development,” Journal of
Economic Literature, 48, 424–455.
Fawcett, T. (2006): “An introduction to ROC analysis,” Pattern Recognition Letters, 27, 861–
874.
Greevy, R., B. Lu, J. Silber, and P. Rosebaum (2004): “Optimal multivariate matching
before randomization,” Biostatistics, 5(2), 263–275.
Hirano, K., G. W. Imbens, and G. Ridder (2003): “Efficient Estimation of Average Treatment
Effects Using the Estimated Propensity Score,” Econometrica, 71(4), 1161–1189.
Johnson, G. A., R. A. Lewis, and E. I. Nubbemeyer (2015): “Ghost Ads: Improving the
Economics of Measuring Ad Effectiveness,” Working paper, Simon Business School.
Lalonde, R. J. (1986): “Evaluating the Econometric Evaluations of Training Programs with
Experimental Data,” American Economic Review, 76(4), 604–620.
Politis, D. N., and J. P. Romano (1994): “Large Sample Confidence Regions Based on Subsamples under Minimal Assumptions,” The Annals of Statistics, 22(4), 2031–2050.
Stuart, A., and K. Ord (2010): Kendall’s Advanced Theory of Statistics, Distribution Theory,
vol. 1. Wiley.
Wooldridge, J. M. (2007): “Inverse probability weighted estimation for general missing data
problems,” Journal of Econometrics, 141, 1281–1301.
48
ONLINE APPENDIX
Matched Markets
Table A-1: Optimal matched markets using sales-based matching (40 largest markets)
Pair
Pair
Pair
Pair
Pair
Pair
Pair
Pair
Pair
Pair
Pair
Pair
Pair
Pair
Pair
Pair
Pair
Pair
Pair
Pair
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
First CBSA in pair
Houston-The Woodlands-Sugar Land, TX
Indianapolis-Carmel-Anderson, IN
Atlanta-Sandy Springs-Roswell, GA
Detroit-Warren-Dearborn, MI
Las Vegas-Henderson-Paradise, NV
Dallas-Fort Worth-Arlington, TX
Columbus, OH
Kansas City, MO-KS
Cleveland-Elyria, OH
Cincinnati, OH-KY-IN
Chicago-Naperville-Elgin, IL-IN-WI
Charlotte-Concord-Gastonia, NC-SC
Baltimore-Columbia-Towson, MD
Portland-Vancouver-Hillsboro, OR-WA
Minneapolis-St. Paul-Bloomington, MN-WI
Los Angeles-Long Beach-Anaheim, CA
Denver-Aurora-Lakewood, CO
Austin-Round Rock, TX
Orlando-Kissimmee-Sanford, FL
Boston-Cambridge-Newton, MA-NH
Second CBSA in pair
Louisville/Jefferson County, KY-IN
Miami-Fort Lauderdale-West Palm Beach, FL
Milwaukee-Waukesha-West Allis, WI
Nashville-Davidson–Murfreesboro–Franklin, TN
New York-Newark-Jersey City, NY-NJ-PA
Philadelphia-Camden-Wilmington, PA-NJ-DE-MD
Phoenix-Mesa-Scottsdale, AZ
Pittsburgh, PA
Providence-Warwick, RI-MA
Riverside-San Bernardino-Ontario, CA
Sacramento–Roseville–Arden-Arcade, CA
St. Louis, MO-IL
San Antonio-New Braunfels, TX
San Diego-Carlsbad, CA
San Francisco-Oakland-Hayward, CA
San Jose-Sunnyvale-Santa Clara, CA
Seattle-Tacoma-Bellevue, WA
Tampa-St. Petersburg-Clearwater, FL
Virginia Beach-Norfolk-Newport News, VA-NC
Washington-Arlington-Alexandria, DC-VA-MD-WV
Lift Confidence Intervals
Below we have copied equation (3) from section 2 that defines lift:
Lift =
Actual conversion rate – Counterfactual conversion rate
Counterfactual conversion rate
To facilitate exposition, we rewire the above with some notation:
Lift =
ye (e) − ye (u)
ye (u)
where ye (e) is the conversion rate of the exposed users assuming they had actually been exposed
and ye (u) is the conversion rate of exposed users had they instead been unexposed. The former
is directly observed in the data whereas the latter requires a model to generate the counterfactual
prediction. Next we can rewrite this equation another way, using the fact that the counterfactual
conversion rate is the difference between the actual conversion rate and the estimated average
Appendix-1
Table A-2: Optimal matched markets using sales- and demographics-based matching (40 largest
markets)
Pair
Pair
Pair
Pair
Pair
Pair
Pair
Pair
Pair
Pair
Pair
Pair
Pair
Pair
Pair
Pair
Pair
Pair
Pair
Pair
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
First CBSA in pair
Austin-Round Rock, TX
Boston-Cambridge-Newton, MA-NH
Dallas-Fort Worth-Arlington, TX
Charlotte-Concord-Gastonia, NC-SC
Cleveland-Elyria, OH
Las Vegas-Henderson-Paradise, NV
Atlanta-Sandy Springs-Roswell, GA
Chicago-Naperville-Elgin, IL-IN-WI
Minneapolis-St. Paul-Bloomington, MN-WI
New York-Newark-Jersey City, NY-NJ-PA
Cincinnati, OH-KY-IN
Providence-Warwick, RI-MA
Nashville-Davidson–Murfreesboro–Franklin, TN
Baltimore-Columbia-Towson, MD
Portland-Vancouver-Hillsboro, OR-WA
San Antonio-New Braunfels, TX
Columbus, OH
Miami-Fort Lauderdale-West Palm Beach, FL
Milwaukee-Waukesha-West Allis, WI
Riverside-San Bernardino-Ontario, CA
Second CBSA in pair
Denver-Aurora-Lakewood, CO
Detroit-Warren-Dearborn, MI
Houston-The Woodlands-Sugar Land, TX
Indianapolis-Carmel-Anderson, IN
Kansas City, MO-KS
Los Angeles-Long Beach-Anaheim, CA
Louisville/Jefferson County, KY-IN
Orlando-Kissimmee-Sanford, FL
Philadelphia-Camden-Wilmington, PA-NJ-DE-MD
Phoenix-Mesa-Scottsdale, AZ
Pittsburgh, PA
Sacramento–Roseville–Arden-Arcade, CA
St. Louis, MO-IL
San Diego-Carlsbad, CA
San Francisco-Oakland-Hayward, CA
San Jose-Sunnyvale-Santa Clara, CA
Seattle-Tacoma-Bellevue, WA
Tampa-St. Petersburg-Clearwater, FL
Virginia Beach-Norfolk-Newport News, VA-NC
Washington-Arlington-Alexandria, DC-VA-MD-WV
treatment effect on the treated (ATT), which is ye (u) = ye (e) − AT T , and gives us:
ye (e) − ye (u)
ye (u)
ye (e) − (ye (e) − AT T )
=
ye (e) − AT T
AT T
=
ye (e) − AT T
Lift =
To determine the confidence interval on the lift, we require the standard error of the numerator
and the denominator. The standard error of the ATT is available in each of the methods we
consider. In the denominator, the standard error on ye (e) is straightforward to calculate because,
unlike the ATT, the term does not rely on a model to estimate it. That is, given the set of
relevant exposed users, we calculate the standard error on their conversion rates using the usual
formula for a standard error. However, the tricky issue is that the numerator and denominator are
clearly not independent. This implies we must calculate the covariance between the numerator and
denominator to estimate the standard error on the lift. The exception is when we can performing
a bootstrap is feasible and the standard error can be calculated from the bootstrapped samples.
We discuss our procedures for estimating the standard errors for each method below.
• RCT Lift. Rather than estimating the covariance explicitly, we implement a nonparametric
bootstrap to calculate the confidence intervals for the RCT lift estimates. We use the method
Appendix-2
in Andrews and Buchinsky (2000) to choose a suitable number of bootstrap draws to ensure
an accurate estimate of the confidence interval. This approach has the advantage that it
automatically integrates uncertainty about ye (e), the ATT, the share of exposed users, and
the ratio statistic.
• IPWRA. We recover the covariance for the estimates through the covariance matrix estimated
from the GMM procedure in Stata. This output contains separate estimates of the ATT and
(ye (e) − AT T ), estimates for the standard errors of each term, and the covariance estimate.
We can substitute these point estimates for the means, standard errors and covariance into
the following approximation (based on Taylor expansions) for the variance of the ratio of two
(potentially dependent) random variables:
V ar
x
E(x) 2 V ar(x) V ar(y)
Cov(x, y)
≈
+
−
2
y
E(y)
E(x)2
E(y)2
E(x)E(y)
The interested reader should refer to Stuart and Ord (2010).
• PSM. The standard errors for the ATT are computed using the methods explained in Abadie
and Imbens (2015) to account for the uncertainty in the propensity score estimates. The
standard error for the conversion rate of exposed matched users (ye (e)) is calculated directly
from the data using the standard formula. However, no formal results exist to estimate the
covariance between the ATT and conversion rate of exposed users. Instead, we implement a
subsampling procedure (Politis and Romano 1994) to generate multiple estimates of the ATT
and the conversion rate of the exposed users, since bootstrapping is invalid in the context of
matching prodedures (Abadie and Imbens 2008). We calculate the covariance based on these
results and use it to construct the standard error on the lift using the approximation above.
In general, the covariance is small enough relative to the standard error of each term that both
the quantitative and qualitative conclusions of the various hypothesis tests are unaffected.
Appendix-3
Figure A-1: CBSAs in for California
CALIFORNIA - Core Based Statistical Areas (CBSAs) and Counties
IDAHO
OREGON
Del
Norte
Crescent
City
LEGEND
Siskiyou
Modoc
EurekaArcataFortuna
Fresno-Madera
Combined Statistical Area
NAPA
REDDING
Shasta
Humboldt
Susanville
ReddingRed Bluff
Trinity
Metropolitan Statistical Area
Ukiah
San Rafael 1 1 1 1 1 1
MEXICO
NEVADA
Lassen
Micropolitan Statistical Area
Metropolitan Division
International
State or Statistical Equivalent
Alameda
County or Statistical Equivalent
Coastline
Pacific Ocean
Red Bluff
Tehama
CBSA boundaries and names are as of February 2013. All
other boundaries and names are as of January 1, 2012.
Plumas
CHICO
Glenn
Clearlake
r
Sutte
Colusa
Lake
SANTA
ROSA
Sonoma
Truckee-Grass Valley
Sacramento
1
Solano
1 1
1 1
1 11
Marin
11
Contra Costa
11
San Francisco
San Francisco-Redwood CitySouth San Francisco
2
Alpine
Amador
0
NEVADA
Sonora
60
20
40
80 Kilometers
60
80 Miles
Mariposa
Merced
ModestoMerced
MERCED
Santa
Cruz
San
Benito
Madera
FresnoMadera
MADERA
FRESNO
Inyo
Fresno
VISALIAPORTERVILLE
SALINAS
1 VALLEJO-FAIRFIELD
2 Oakland-Hayward-Berkeley
3 SAN JOSE-SUNNYVALE-SANTA CLARA
0
40
Mono
MODESTO
3
KEY
20
Tuolumne
Stanislaus
Santa
Clara
San Mateo
SANTA CRUZ-WATSONVILLE
±
SACRAMENTO--ROSEVILLE--ARDEN-ARCADE
Calaveras
San
Joaquin
STOCKTONLODI
Alameda
San JoseSan FranciscoOakland
Sacramento-Roseville
Placer
El Dorado
Napa
1
YUBA
CITY
ada
Nev
Yolo
NAPA
San Rafael
SAN FRANCISCOOAKLAND-HAYWARD
Sierra
Butte
Yu
ba
Ukiah
Mendocino
VisaliaPortervilleHanford
Tulare
Monterey
Kings
HANFORDCORCORAN
ARIZONA
SAN LUIS OBISPO-PASO ROBLESARROYO GRANDE
BAKERSFIELD
San Luis Obispo
Kern
Santa Barbara
SANTA MARIASANTA BARBARA
Ventura
Los AngelesLong BeachGlendale
San Bernardino
LOS ANGELES-LONG BEACH-ANAHEIM
Los Angeles
11
Orange
Anaheim-Santa AnaIrvine
1111111
11111
OXNARDTHOUSAND
OAKSVENTURA
11
11
1
Pacific Ocean
Los Angeles-Long Beach
RIVERSIDE-SAN BERNARDINO-ONTARIO
SAN DIEGOCARLSBAD
Riverside
EL CENTRO
Imperial
San Diego
MEXICO
U.S. DEPARTMENT OF COMMERCE Economics and Statistics Administration U.S. Census Bureau
Appendix-4
Figure A-2: Random matching for 40 and 80 markets∗
40 markets
80 markets
78]
[4
True Lift=32
66]
Frequency
400
600
200
0
0
200
Frequency
400
600
800
True Lift=34
800
[-1
-25
0
25
50
Lift
75
100
125
-25
0
25
50
Lift
75
100
* Histogram of lifts for 10,000 random allocations of CBSAs in each matched market pair
to test and control markets.
Appendix-5
125
Download