>> Kamal Jain: It's my pleasure to introduce Mohammad... Yahoo, where they don't even recall their talks. And...

advertisement
>> Kamal Jain: It's my pleasure to introduce Mohammad Mahdian. He's from
Yahoo, where they don't even recall their talks. And he spent two years already.
Before that he was a post-doc researcher in Microsoft for two years and before
that he was a Microsoft fellow. So here is Mohammad.
>> Mohammad Mahdian: Thank you. Thanks, Kamal. It's good to be back. So
this is -- I'm going to talk about externalities in online advertising, and this is
based on -- this is based on two papers, one joint work with Arpita Ghosh that
already appears in the (inaudible), and the second one is joint work with David
Kempe which was going to appear in the (inaudible) workshop, so.
So let me start with some introduction although I'm sure that most of the people
in the room already know about these. Online advertising is a huge business.
It's already 21 billion dollars in 2007 and it's one of the fastest growing segments
of the advertising business. And the standard in this business is that advertisers
specify how much they are willing to pay for each impression or click or whatever
else that you are selling here.
And the publishers decide which ads to show on a page based on the values that
the advertisers are submitting and also based on the estimates that they have of
values quality measures of the advertisers. Like the click-to rate. The probability
that the user collision on their ad. Or other measures of quality.
An important implicit assumption in all of the models that are standard in the
business is that the value of an ad only depends on the ad that's shown and
where it's shown on the page, but not on other factors, like other ads that are
shown on this page. And obviously in (inaudible) wrong the different advertisers
are actually competing for the same thing. They are competing for a user's
attention and user's attention is a limited resource, therefore you should expect
naturally that the value that an ad receives is a function of other ads that are
shown on the same page near to it.
So for example, if you increase the number of ads on a page that is going to the
decease the value to each of those advertisers. That's intuitive. And not only
that, the identity of the ads could also matter. For example, if I show an ad for
Toyota next to an ad for Honda presumably that detracts attention from the ad for
Toyota more than if I show an ad for Toyota and an ad for Ford, just because
these two advertisers are essentially targeting Toyota and Hondas are targeting
the same market segment, their ads for example Toyota and Fords are probably
targeting other market segments.
Or a better example for example if you search for Harry Potter, the ads that are
shown probably an ad for Harry Potter movie is detracts from attention to an ad
for Harry Potter book less so than two ads for Harry Potter book, for example.
Okay. So basically this is -- this becomes a problem that's called in economic
literature, this is called externalities, an effect that -- an effect that one agent can
have on other agents by just receiving an (inaudible). And the problem becomes
essentially the problem of mechanism design and externalities that has been
constituted in economic literature and different contexts.
So let me just mention a few related in economic literature and the last one in the
CS literature. These are generally for designing auctions when there are
externalities. It's not entirely relevant to what I'm going to talk about, which is
mostly modeling the externalities in the context of online advertising.
The last one is much more relevant. There are a couple of papers that I'm going
to refer to later in the talk. That's about effect of different links that you have on
the page, on the page click-to rate of the page link.
The first one is the eye-tracking experiment by a bunch of people at Cornell, and
the second one is is click-log evaluation by people at MSR.
Okay. So in this talk I'm going to go over two models that we propose for
externalities in online advertising. The first model is based on a rational choice
model for the viewers, for the users of the search engine or whatever, publisher
of the advertisements. And in -- for this problem, for this model, we are going to
focus on the lead generation advertising which I'm going to define in a couple of
slides. And the second model I'm going to talk about is based on a problematic
model for viewer behavior. And for each of these essentially what you are going
to do is that you are going to assume a model for how users behave and then
based on that we derive how playing one ad on the page or sending one ad to a
user is going to affect other ads that are the same user sees.
So for each of these models that I'm going to define, we will discuss the
competition, complexity of the winner determination problem, basically assuming
that the values are going to follow a model like this, how should we decide which
ads to show on the same page? And following that, there's a brief discussion of
incentive compatible mechanism design.
Okay. So let me start with talking about the lead generation advertising which is
a segment of the online advertising business. I'm sure you have seen it if you
have ever tried to buy car insurance or mortgage or even buying a car you go
to -- there are a number of Websites that you can go to, you enter your
information, and they contact the number of mortgage companies for example or
car insurance companies and each of those companies will contact you directly
with codes for car insurance or for whatever else that they're providing. This is
mostly used by segments of the market like mortgage firms, insurance
companies, auto dealers, distance education industry like Phoenix University and
so on.
So basically a lead is information is credible information provided by user and the
lead generation companies collect all these leads and sell them to advertisers.
And advertisers directly contact potential customers. But now there is the -- by
the way, this is a huge segment of the market. In 2006, that was the latest year I
could find data for, this was 1.3 billion dollars and it was like about eight percent
of online advertising revenue total.
There is an obvious trade-off here. For example if you -- if your information is
sent to ten mortgage companies, each of those mortgage companies have lower
chance of getting your business. There is an obvious trade-off here. And this is
something that in reality people have to deal with. Like I've talked to people who
are trying to design a lead generation business and this is the real question. To
how many people, to how many advertisers should they send these leads.
So here's an abstract model for this problem. Let's say we have end bidders,
each bidder is an advertiser, and but the value that the bidder has, the value
function that the bidder has depends on the other -- the set of all advertisers that
are winning in this auction, so it's a function of the foot of all substance of one to
N to non-negative real numbers. And VIFS is the value of I, assuming that the
set of winners is S.
Now, we want to design incentive compatible mechanism that maximize
advertiser welfare, which is basically the sum of the values that advertisers
receive. And the classical results in the economic literature is (inaudible)
mechanism so that if you can actually find a set that maximizes this function,
then there are simple payment schemes that can induce incentive compatible
recording of -- reporting of values.
So this is the abstract model at the very high level. Let me get into the specifics
of the first model I want to define for externalities. Assume we have an
advertisers number one through N. Now, a user type, here a user is an audience
of the advertising, specifies the preference that this user has over advertisers 1
through N, as well as some outside auctions. Okay. So for example if you're
buying a car, you've probably already walked into a dealership, you have some
price quotes from them, but also you are researching online, you are going to
receive some codes and you're going to compare all those codes as well as the
outside option that you have.
I'm going to denote the outside option by zero. Now, we have a prior on users
types, a distribution of how the preferences go. And advertiser I receives a value
of the I, which is a fixed number if this advertiser is chosen by the user, okay,
which means that if this advertiser has the highest preference for the user among
all the advertisements that this user sees.
Now, given this, the value of VIFS can be defined as VIFS is this number VI
times the probability that I is preferred to everything in S union zero, the outside
option. And the probability is over the choice of the user, random type of user.
Okay. And notice that in this model, the value of the set, the sum of the values to
the advertisers is not necessarily a monotone function so it's not necessarily best
for you to send the advertisement to as many advertisers as possible. And this is
intuitive and the reason here is that if you have an advertiser that a lot of the
users actually prefer but has a very low value, adding this advertiser to your set
is going to decrease the overall value of the set.
And notice that here by the value I mean the value to the advertisers. If you want
to take the value to the users into account, that would be a whole different story.
Okay. So now in order to look at the complexity of this problem, I have to define
an input of presentation because here I'm assuming that I'm given a distribution
over users types which is in general it's a -- it can't be given concisely. Yes?
>>: (Inaudible) because if you are the only search engine in town, you can wipe
out let's say Southwest Airlines because they don't pay anything for the ad but if
the guy across town does show it and people like it, you will lose business.
>> Mohammad Mahdian: You're talking about an user side. So on the advertiser
side, I'm trying to basically modeling I'm trying to model existence of other
options by having this outside option here, right. So basically if you're a user,
you've searched other search engines, you might have physically walked into like
basically retail stores and have gotten codes and you have some outside option
based on all of those things. This is noted here. This is already included in their
preferences of the user.
So in some sense, you don't want to -- you want to keep the users happy to
some extent as well, as much as the outside option forces you to basically.
Okay. So yeah, that's good, feel free to interrupt me any time. Which Jennifer
and Kirsten were here, I could be sure that I would be interrupted, but okay, so
now I need to talk about the input representation. We need to input the
representation of users preferences on advertisers. And the simplest model I
can -- we could think of is explicit presentation. Assume that M, a fixed number
of types of users, a small number of types of users and a user of type I with
probability PI, so PI is all going to add up to one, and the references of users of
type I is given by permutation over a set of all advertisers, actually I should say
all advertisers union zero. So you have a permutation over a set of zero, 1, all
the way to N. Probability for this type and that's it.
So now, for this explicit representation, there is the winner determination problem
becomes essentially like this, that you are given N a negative values, you want to
VN, okay, these are the values to the advertisers if they receive the business
from the user. We have a number K, which is basically a bound on how many
advertisers can possibly receive the lead, and we have M permutation, pi 1
through pi M of 0 through N. These are -- each permutation depends -corresponds to one type of user. And a probability PJ is also associated with this
permutation.
And the question is to find a set S of cardinality at most K that maximizes this
function. And this function basically what it says is the same expression that you
have in the previous slide. We are summing over all types of users the
probability of the user times the sum over all advertisers that are in the winning
set, S the value of the advertiser times an indicator variable which is one if and
only if this advertiser is preferred to all the other advertisers and outside option
by this the particular user type, okay.
So this is an optimization problem. The question is whether we can solve this
optimization problem efficiently. So first of all, this is actually not hard to see is
that the winner determination problem is MP hard, and the proof is actually pretty
simple. The idea is that when all the values, even if all the values are equal,
okay, still a problem is essentially a weighted version of the maximum K
conversation and if you don't know what the maximum K coverage problem is it's
basically you're given a number of sets and you want to select K of these sets to
maximize the size of the union, okay. So you basically want to minimize their
overlap and maximize the size of their union.
So why is this problem a special case of the maximum K coverage or sorry why
is maximum K coverage a special case of this problem? If all the values are the
same, the only thing that you care about is how much of the users you are
actually capturing, right, the values are all the same, right? So now advertiser J
is going to correspond to a set of users, user types I that track J above the
outside option. Okay? So these are the only user types that if advertiser J is
shown are going to go with this advertiser or at least some advertiser.
Now, the problem becomes to find a set of at most K advertisers to maximize the
weight of the set of covered user types which is exact objective function in the
maximum K coverage. And the maximum K coverage problem is MP hard, so
that means that this problem is also computationally hard. But on a positive side,
maximum K covers problem there is a simple intuitive greedy algorithm that
achieves a constant factor approximation for that problem and it also it works
very well in practice.
So for a while our hope was that maybe we can do the same for this problem,
maybe there is a simple algorithm that solves this problem with a constant
approximation factor. And we actually spent a lot of time on this but eventually
came up with the hardness of approximation proof which is pretty strong. We
can show that the winner determination problem is hard to approximate within
any factor better than N to the 1 minus epsilon, which is pretty strong negative
result.
And the proof is very simple deduction for independence problem. I'm going to
talk about the proof, too, just because it gives some intuition of what type of hard
instances are there. So in this respect this is actually a pretty simple reduction.
You are given a graph, G, that has X vertices, okay, and the problem is to find
the maximum independent set in this graph. And if you don't know what that
means, it's a set of vertices that there is no edge between them, okay. And this
problem is MP hard, it's very hard to approximate.
Now, given this graph, I want to construct an instance of this winner
determination problem that is essentially as hard as solving the maximum
independent set problem in the graph. So if I have X vertices in the graph, I'm
going to set the number of user types and also the number of advertisers to X,
and also I'm going to set K to X. So K was the upper bound and the number of
advertisers that could win. So basically that means that there is no upper bound
in the number of advertisers that can win.
So each vertex corresponds to one advertiser and also corresponds to one user
type. Now, corresponding to node I, the advertiser that's defined corresponding
to note I I'm going to give it a value of L to the power I where L is a really large
number. Okay. So the values of different advertisers are going to be very, very
different. And also notice that here I'm essentially I'm numbering the nodes from
1 through N, okay, so node 1, the value of the corresponding advertiser is L to
power 1, node to L squared and so on and so forth.
Also, I'm going to define this set NI. This is the set of neighbors of I in the graph
that have index less than I. Okay? And the permutation pi I is defined this way.
I'm going to rank all the elements of NI before I, it doesn't matter in which order
I'm ranking these elements of NI, but everything in NI comes before I and then I
put I and then I put the outside option. And then after the outside option it
doesn't really matter, I can put anything.
And the probability of this permutation also I set the probability of this
permutation to some constant C divided by L to the I. And the constant C is set
so that the sum of all the probabilities are important ones.
So now let's see what's happening here. So I have N user types and I have N
advertisers. The value of the I advertiser is L to the I and the probability of the I
user type is C divided by L to the I. And notice that the preference of the Ith user
essentially the only things that you -- this user has before the outside option are I
and also neighbors of I that are -- that have index less than I, okay.
So the value that this user is going to get, the expected value that this user type
is going to get if she's assigned to advertiser I is going to be C divided by L to the
I times L to the I which is C, a constant.
Well, if she's assigned to any advertiser NI in this set, then the value is going to
be C divided by L to the I times something that's much, much smaller than L to
the I, it's like L to. J for some J less than I. So the value, if it happens that this
advertise -- this user is assigned to an advertiser and NI, the value is going to be
much, much lower than if she's assigned to I.
So basically the total value that we get, the only dominant terms are going to be
the ones that correspond to user types that are assigned to the same advertiser
in essentially the advertiser with the same index.
And as a result, we can prove this result basically. The point is that if you have
two advertisers, two winning advertisers in your set that are connected by an
edge in the graph, then at least one of those advertisers are not going to be able
to derive this high value, okay? And as a result we get that the value of the
optimal set is going to be something between C times the size of the maximum
independent set in the graph and the same value plus some small number.
So what this shows is that the problem is essentially the as hard as problem of
solving the maximum independent set in the graph which we all know is a pretty
difficult problem. So but on the positive side notice that this reduction uses
instances where the advertisers value have a large span, okay. So you have
values that -- you have advertisers that have a very, very low value for this lead
and other advertisers that have a high value, which is not exactly a realistic
situation. So actually that's the nice thing about looking at the competition,
complexity of the problem and looking at the reduction because it gives you a
feeling of what are the hard instances and then you can try to modify your
approach, try to target instances that are not ruled out by those reductions.
So basically here the questions would be if we have a bound on the maximum
value divided by the minimum value can we get something better here? Can we
of an algorithm that has a better factor? There is a simple R times E over E
minus one algorithm by just completely ignoring the values, assuming that all the
values are the same, and then running the maximum K coverage greedy
algorithm that's going to give us this. Basically we're losing a factor of R because
we're ignoring the values and another factor of E over E minus one because of
the 3D maximum K coverage algorithm.
But can we do better than this? And in fact, we can. There is a further technique
that we can apply here. We can divide advertisers into log R buckets. Each
bucket will correspond to advertisers that have value in some interval, in some
exponentially increasing interval. So here I've set the Jth interval here, yeah, I've
set the Ith interval as the set of all advertisers that have value between E to the I
minus one where E is the base of the natural algorithm times E to the I times 3.
Now, given this bucketing of the advertisers, notice that in each bucket the value
of the advertiser are only off by a factor of E at most and now if you look at the
optimal solution of the problem, this optimal solution is driving some of the value
from each of these buckets, okay. So since there are log R different buckets,
there must be at least one of these buckets that gives -- that in the optimal
solution gives at least the one over log R fraction of the revenue. So that means
that if I actually -- if I even pick a random bucket and so the maximum K
coverage for this bucket, I'm going to get the factor that's at most log R plus 1
times E squared or E minus one here, log or plus one comes from the fact that
I'm only looking at one bucket, instead of all of the buckets.
There is a factor E that comes if the fact that I'm ignoring the values within each
bucket and the values could be off by a factor E and there is another E over E
minus 1 that's coming if the greedy maximum K coverage algorithm.
So here is one positive result. There is an approximation algorithm with a factor
called to this number and if I -- it's very easy to de-randomize this algorithm
basically picking random bucket you can just pick the bucket that has the
maximum value and in fact this algorithm so now if you want to turn this
algorithm, this is an approximation algorithm, so if you want to use the VCG
technology, the VC grove payment scheme to the turn this into an incentive
compatible mechanism, one practical result says that this is only possible if the
allocation rule is monotone.
What that means is that if I'm an advertiser. If I increase my value, the algorithm
should not drop me from the set of winners, okay, it should not be the case that
increasing ones value decreases the likelihood that this person will be one of the
winners.
So this algorithm as I stated it is not monotone because when you increase your
value you might fall into a different bucket and the competition might be tougher
in that bucket, but in fact it's relatively straightforward, it's not difficult to actually
turn this algorithm into a monotone algorithm by making these buckets
essentially overlapping.
And therefore we get a monotone allocation rule and using this and the VCG
payment we can get an incentive compatible deterministic mechanism that
approximates social welfare within a factor of the log B max divided (inaudible).
And one nice thing is that the algorithm is actually -- I mean at the end of the day
when you look at the algorithm, it's actually pretty simple and intuitive. Basically
what it's doing is it's taking one threshold for the value and only looking at the
advertisers that fall above this threshold and for those advertisers essentially
solves the problem using a maximum K coverage greedy algorithm.
And the threshold is chosen, you know, in a way that maximizes the revenue.
Okay. So that's one positive result. There are a couple of other positive results
that I'm just going to mention. For other special cases of the preferences we can
also get exact algorithm, exact (inaudible) algorithm. Again, the point is if you
look at the reduction, you see that the references that we are giving the users are
very different from one user to another user the preferences are very, very
different.
If the preferences are correlated in some sense then the -- then the -- we have a
better picture. For example, if the preferences are single picked, I'm not going to
define this, but essentially means that the -- so like for example in the -- if there's
a spectrum of everything from right to left and each user has an ideal point and
essentially ranks things based on the distance between himself, a lift between his
IV point and the object, then in this case we can actually get an exact algorithm
for the optimization problem.
And also if all the preferences are some notion of perseveration of a single
ranking, then we can get an exact algorithm. Of if the algorithm use dynamic
programming there are a lot of details there, but there is nothing fundamentally
difficult.
Okay. So that's pretty much all I wanted to say about the first model. I'm going
to get back to this at the end and have some discussion. Yes?
>>: (Inaudible) but do you assume that the value (inaudible) independent of the
user type?
>> Mohammad Mahdian: Well, I am assuming that. The reason we are
assuming that is basically that the user types you can't necessarily observe the
user type here, okay, so we are assuming that there are these user types but
when a user comes to your search engine you don't necessarily know what user
type is this.
>>: Right.
>> Mohammad Mahdian: But obviously you might be able to use external
information like targeting information in order to deduce things about the user
type. And that's a very good question actually but basically this assumption is
not losing much in generality because if you actually have external information
that you could target things better, you can basically essentially separate these
markets like if based on additional information you can guess whether this user is
looking for Harry Potter the book or Harry Potter the movie, you can basically
assume that you have two different markets here and solve the optimization
problem independently for each of these markets.
And presumably your set of winners should be different if you actually have some
external information. That's what targeting means really.
>>: (Inaudible) but those users have (inaudible). So if we can assume that for a
single user type (inaudible) is the same (inaudible).
>> Mohammad Mahdian: Say it again?
>>: So the reason (inaudible) is because (inaudible) I assume that they are
intended for different user classes which have permutations of preferences.
>> Mohammad Mahdian: Okay. Let's see if I'm understanding the question
here. Basically if you have some external information that helps to classify the
users if you can basically -- if you can tell user 1 from user two, then a separate
market for user 1 and a separate market for user 2 and for each of those you can
do the optimization for a set independently.
>>: (Inaudible). All the users have the same (inaudible) in that case?
>> Mohammad Mahdian: All the users have the same preference permutation
but the values are different. That's your question, right? The values of advertiser
for different user types. If they all have the same user type then, sure, I mean
basically you can assume that the advertisers have the average value for these
guys and you can just merge them into one. If you can't observe them. Yes?
>>: The model of very large changes in value and very large change in
probability you don't (inaudible) optimize it (inaudible) model.
>> Mohammad Mahdian: What was that?
>>: The (inaudible) very low probability, very high gain for the crook for some
users.
>> Mohammad Mahdian: Thanks for the comment. So that was a good
question. Maybe we can discuss it afterwards. I think there might be some
interesting problem there. But there we can actually use -- even if user types are
unobservable, is there a way to take advantage of differences in permutations in
the order to target more profitable user types. That's a good question. Yes?
>>: (Inaudible) monotone, do you (inaudible) the benefit of the search engine?
>> Mohammad Mahdian: You mean the revenue?
>>: Yes.
>> Mohammad Mahdian: No, these, the algorithms are based on usage, the
mechanisms are based on VCG, obviously they're all trying to maximize the
social welfare which is the sum of the values to the advertisers and to the search
engine. If you want to optimize for the revenue that general -- I mean this is not a
theoretical result but like basically the general approach is to use a mechanism
like this but set the right reserve prices for different items. And that usually gets
you close to the maximum revenue. The difficulty in theoretically analyzing and
designing an algorithm like that is that you usually need to make assumptions
about distributions of the user types or at least do some sort of sampling to
deduce things about the distribution user types yourself which makes the result
not as clean and probably not as directly applicable to practice.
Yes?
>>: (Inaudible) is that really the case, because if a user is looking for something,
I would think intuitively that whichever advertiser is deriving the least value
(inaudible) best choice.
>> Mohammad Mahdian: Okay. So that's also another good question. Basically
we are taking part of the -- taking the user values and advertiser values as
exogenous given but basically your question is maybe these things are also
determined in the gain. I have not looked at that. That is a good question.
Okay. So let's move on to the second part of the talk, which is the other type of
model for externalities. So this part is basically we're looking at a probabilistic
model for user behavior and we're defining model for externalities based on this
probabilistic model, and our focus is going to be as (inaudible) and the main
characteristics of the search ads in comparison with the previous part, the lead
generation is that the sponsored search ads are listed in a column.
Like for example this is Google search, you see that all the advertisements are
on the side and they are listed in some order and presumably the higher
advertisements have higher value for the advertisers than the lower ones. So
we're going to look at the probabilistic model for how users view and click on
these ads based on that's motivated by click log analysis and eye tracking
experiments and a couple of papers that I mentioned earlier.
So this model was proposed by Craswell, Goter, Taylor and Ramsey from MSR
for organic search results earlier in the last wisdom conference. And they
actually did the click log analysis that confirms that this model is a better model
for estimating a click rate and model that I'm going to define in the next slide.
And also independently in the context of adductions that this is the time for a
bunch of people like Google and it's going to -- they also have written the paper
which has some overlap with ours that's going to come up here in the same
conference.
Okay. So let me define precisely what's going on in sponsor search advertising.
For each search the system shows K, typically K is something around 10, 8, 10,
12. They show K ads in assorted order and they click through a different ad as
the probability that this ad is they clicked on, and the way it's made it's usually
using this assumption that's known as the probability. The probability
assumption says that they click through I that's placed in position J, so positions
are 1 through K, is the product of two terms. One term only depends on I, okay,
so let's call it alpha I, and the only term only depends on J. Let's call it lambda J
And so an termination of this assumption is that when the user is looking at the
advertisements, the user views the position J with probability lambda J, so this is
the probability that this user even sees this position, and then assuming that this
user sees this position, she's going to click on the ad in this position with a
probability that depends on the quality of this ad, okay. That's called alpha I.
So this is a standard assumption and as far as I know, all of the auction engines
for sponsor search are basically built based on this assumption and that click to
rate learning everything is based on this assumption.
Now, what I'm going to talk about is a model that's essentially that's an
alternative proposal. We are assuming that each ad is specified by three
parameters, one parameter is the value which is what we had before. This is the
value that the advertiser receives if there is a click on their ad and notice that I've
switched from value per lead that we had in the previous part of the talk to value
per click. Now I'm assuming that we have some fixed value per click, which is
not an entirely accurate assumption but I'm going to go with this assumption for
now.
So there is also another parameter which is a probability QI that if a user views
the ad, she will click on it. And finally there is a probability CI that if a use views
an ad, she will continue viewing the next ad as well.
So now this additional parameter CI allows us for situations where when you see
an ad and it's really a good ad that already satisfies your propose, you're going to
click on it and you're not going to look at the other ads afterwards. And it can
also model situations that are completely opposite. Like you see an ad that is so
crappy that you just gave up looking at the ads and you don't look at the ads
afterwards.
So now we assume that the user starts from the top position, the first position.
She looks at this first ad with probability one. That's just a matter of scaling. It's
not really an assumption. And then with probability C1 where 1 is the index of
the advertisements in that spot is going to -- she's going to look at the second ad
and so on and so forth.
So given this model for advertise their behavior, feel free to interrupt me if there
is anything that ->>: (Inaudible) independent of whether or not she collision? Is it ->> Mohammad Mahdian: Very good question. I'm assuming that this probability
are independent of the previous ones but this is an assumption without loss of
generate because essentially I'm looking at the aggregate probability. So this
probability is -- for the purpose of the optimization for either context it might be
different actually, but for the purpose of the optimization I only care about the
aggregate probability. Right. Since when I'm -- when I'm deciding which ads to
put, I don't have the information with their -- this guy is going to click on an ad or
not. So I can't make decisions based on that. And as a result, it's enough to
have the aggregate probability of activity. But that's a very good question. Most
likely the probability is not independent of whether you are clicking or not. Yes?
>>: (Inaudible).
>> Mohammad Mahdian: So you should see I be always left minus QR because
if you are clicking on an ad, you presumably are not going to click on future ads.
So I'm not going to make that assumption that sounds reasonable but I'm not
going to make that assumption. Yeah. I mean, I don't need that assumption
basically.
>>: (Inaudible) user really want to buy something they can click on the first page
and see if they don't like it they go back.
>> Mohammad Mahdian: Right. Right. Exactly. Statistically probably it
happens in the small portion of the on o small fraction of the users do this, but
presumably maybe there's a large benefit in looking at those small fractions. I'm
not going to make an assumption like this basically. Yes?
>>: The CI could be strongly dependent on the design I guess. If you could get
one click less you open it a new tab you get more multiple ad.
>> Mohammad Mahdian: Sure. I mean, I'm sure these things depend a lot on
the user interface design, but basically here I'm focusing on a part of it that's
really a function of this particular ad that I'm placing there. I want to see what's
the effect of this particular ad that I'm placing. And actually just to clarify the
connection with the previous work with the Craswell results in the context of
organic search results, their model, first of all they don't have the values just
because it's for organic search results there are no bids, you just want to
maximize the click-to rate, and they also assume that QI is actually precisely
equal to CI. Okay. So the rest -- what is it? Oh, one minus, yes, that's correct.
So basically what they're assuming is that you're clicking on an ad with
probability equal to QI and if you don't click on an ad, on this ad, you're going to
go to the next ad. So that's the assumption that they are making which seems a
little restrictive, but the thing is that even with this assumption, they are showing
that based on the click logs this is a better fit to the click log data than the
(inaudible) model. Which is pretty surprising, actually.
Okay. So now, given this model, formerly ads 1 through K are displayed in this
order, the probability that ad I is clicked on is going to be the product of C1
through CI minus 1 times QI. And therefore the winner of the determination
problem becomes to find an ordering of the -- of K ads, of at most K ads that
maximizes this function, V1 times Q1 plus V2 times Q1, Q2 plus and so on and
so forth, basically the product of this term and the I sum over all I.
Now, here as it turns out actually the problem is much easier from a competition
or complexity perspective. There is a limit that shows that if there is no limit on
the number of ad spots that are shown the optimal ordering is to sort all the ads
in decreasing order of VI times QI divided by RO minus CI. Okay. So this is a
parameter that the ads needs to be shown based on.
So you can think of this as essentially the value of the advertiser times some
squashed version of the quick rate of the advertiser, okay. And that's the optimal
ordering if you don't have any limit on the number of ads that you can show on
the side of the page.
Obviously you do have a limit on the number of ads that you can show on the
side of the page. But still the probably can be solved. The proof of this element,
I'm not going to prove it but it's based on a simple exchange argument. If you've
seen proofs of a greedy scheduling algorithms doing optimal, you know exactly
what I'm talking about. You assume that you have the optimal ordering and you
show that if this ordering is violated among two consecutive elements by
switching them, you are going to increase the value. So it's pretty simple.
So if you do have a limit on the number of ad starts that you can show, obviously
this, the optimal ordering might be somewhat different from this, because for
example for the last ad that you are showing, the last slot, there is nothing after
this, so you don't really care about the continuation probability. But as it turns
out, you can still solve the problem optimally because we can show that still if
you select the sort of ads that are shown in the page, these ads should be sorted
based on this order, okay, so now the problem becomes only to select the sort of
ads that you show on this page. And this can be done using basically dynamic
programming approach, given the (inaudible) you can do it with dynamic
approach.
Okay. So that's pretty much a complete picture for this model that I defined.
There are a number of channelizations of the model. First of all you could have
position dependent the multipliers so in this model I was assuming that the
probability that you go from ad I to ad I plus 1, if you -- that you view at I plus
one, assuming that you ordered to view it at I is a term that's only depending on
the advertiser that's displayed in slot I.
But in general, you could assume that this probability is not only dependent on
the ad but also on things that are dependent on the slot itself, okay. So to some
extent this address is guidon's question that you have there could be things that
are dependent on the slot based on the interface design, for example, if
something falls off the page then the probability of going to that ad is significantly
lower, no matter what ad is shown at the end.
So for this model, assuming that you have this position dependent multiplier as
well, there's a simple for approximation algorithm that you can give and the
algorithm is simple and intuitive and it seems applicable in practice. Theoretically
we can also get a quasi-polynomial time approximation so basically for any factor
that you want we can approximate the problem within that factor within time that's
not quite polynomial but sort of like. Can to the power log K. So yes?
>>: (Inaudible) similar results when (inaudible).
>> Mohammad Mahdian: I haven't thought about that. Some of the techniques
might apply there, but I have not thought about that. All right. So but from
practical perspective the polynomial time approximation is a bit too slow to be
applicable practically. And also theoretically another interesting question is that
we don't have an MP hardness proof in this case. They've actually tried to prove
MP hardness but they haven't been able to. So it's not entirely clear what's the
picture theoretically for this problem.
Another generalization of this is when you have multiple ad slates. For example
usually for a sponsor search you have a top slate for ads and you have a east
slate for ads and in this case you might assume that the advertiser -- that the
users behave slightly differently so they don't necessarily jump from the bottom
ad on the north to the top ad on the east. They might be looking at different ad
slates with different probabilities. There is a generalization for each case and in
fact we can approximate the problem pretty well if the number of different ad
slates is constant, which is pretty realistic. Usually the number of different ad
slates at most two or three.
Okay. So that's it. I'm just going to conclude by discussing a number of
interesting open directions. First of all our contribution was to define models for
externalities and non-linear advertising based on assumptions about how users
behave and we discussed the computational hardness of the winner
determination problem which is the fundamental problem assuming that you have
a model for externalities.
Now, there are -- this is a very interesting field. This is pretty new. People have
not really looked at the externalities by much and this is something that there is a
lot of potential in it because the business is huge, it's as I said in 2007 there was
21 billion dollars spent on online advertising which is a huge number. And the
whole business is basically based on this assumption that the click-to rates are
separable, and this is obviously wrong. There is a lot that could be gained here.
There are heuristic approaches that seem to have helped. Like for example if
you look at the data, for example, Google shows fewer ads than both Microsoft
and Yahoo, whereas the revenue for search is considerably higher than both
Microsoft and Yahoo and that's puzzling. And one justification for that is that
there is better targeting there. There is this fact even though it's not explicitly
taken into account in estimating the click rate still they take into account that
showing more ads doesn't necessarily increase your total revenue.
So now here on the theoretical front we are still at the beginning, we still need to
define the models that are both interesting from a practical point of view and also
makes sense theoretically. First of all, there is a question of experimental
evaluation of these models. The only result that I know, the published result that
I know on this is the work by accuracy we will et al for organic search results and
obviously there is much more to be done here.
Now, there is also the question of whether we can come up with more general
markovian model for user behavior. So basically the model that we are talking
about is assuming that some sort of -- the users are following some sort of
Markov chain where the parameters of this Markov chain are coming from the
advertisers that are placed in different spots. And an interesting question is
whether we can generalize this. And presumably users are not just in two states
of clicking or not clicking, viewing or not viewing. If we can come up with
something more complicated, that might be a more realistic model for how users
behave. And there is also the question of these models that I've talked about
and were basically two models, one, assuming that the users are acting perfectly
rationally, so they see a set of advertisements, they select the one that's
maximizing their utility among these, the other one was assuming that the users
are basically just Markov chains, okay, they're just transition with given
probabilities.
And there is a third class of algorithms, there is a paper by Susan Asty (phonetic)
and Glen Ellison (phonetic) that is looking essentially at something like this. So
there is also an element of signal link here. If you're showing an ad on the top,
you're signalling to the user that value equality of this ad is higher than the other
ones, okay. So that by itself could increase the probability of this ad being
clicked on, independent of the position and also independent of the ad identity.
So it would be interesting to look at combinations of these models because the
reality is probably somewhere in between. Users don't look at all the ads and
select the best ones. Users do care about which ads they perceive as better
than others. And also users do look at the basically the confidence vote that you
are giving to the different advertisements by placing them higher or lower.
So that would be a very interesting theoretical direction as for an experimental
direction. There is a question of learning externalities. So one problem with all
of these models is that the more complicated your model gets, the harder it
becomes to learn the parameters of the model and basically do anything practical
based on that, so we have to be careful not to get the models too complicated,
and there is the question of learning these parameters of this model.
For example, one very specific theoretical question is for the cascade model is
there any algorithms similar to the multi unbanded problem that would converge
over time to the optimal ranking of the advertisements. I don't know the answer
to that. And that's an interesting theoretical question there.
There's the connection to the literature on diversity, so in the web search
research literature basically there are a lot of papers that start with this
assumption that well, we want to -- we wanted to define an ordering of the search
results for example and we all know that diversity is good, so it's good to
incorporate some element of diversity in the search results. Another question is
how should we incorporate this and so on and so forth.
But the only thing we start with this assumption that it's good to have diversity.
One nice feature of these models for externalities is that they don't start with this
assumption but they could result in diversity. Like for example in the rational
choice model if you assume that you have different user types that care about
like one user type cares about Harry Potter the book, the other user type cares
about Harry Potter the movie and these user types actually have different
preferences, now here in order to maximize your click to rate without explicitly
incorporating any element of diversity, you do have to have a diverse search
result. Okay. So that's an interesting way to look at the diversity problem, both
for the web search result and also for advertising. Basically looking at the
diversity as a way to increase the click-to rate.
And I haven't seen anything done based on this approach. Another interesting
direction is to look at the long term externalities. Here I was looking basically the
short-term externality. If I'm showing this ad next to another ad, how is it going to
affect click-to rate of the second half, okay.
But there is another effect here, that's the long term externality. If you keep
showing good ads to the users, the users will become much more likely to click
on your ads, okay. And that's another factor that's distinguishing Google from
Yahoo and Microsoft. I mean that's one of the hypotheses why they have a
higher revenue per search than Microsoft or Yahoo because if being better at
showing ads that are more relevant and as a result the average click to rate of
the users are higher on Google versus Microsoft and Yahoo for example.
And traditionally for traditional forms of advertising these things have been
looked at. The difference between traditional advertising an online advertising is
that in traditional advertising there are -- there is not much that could be
measured basically, okay, so but in online advertising everything is logged so
you can measure basically everything and can try to learn things based on this
and optimize your allocation based on this.
For traditional advertising and economic literature what has been looked at is
basically this truth in advertising regulations. Sometimes advertisers can benefit
by forcing the government to essentially establish regulations that does not let
the advertisers lie in their advertisements, just because even though it's limiting
themselves but in the long run it can benefit them because that would increase
the trust.
So that's -- there is a similar question in online advertising and it would be
interesting to actually theoretically look at this.
Finally there's the example of online dating which is another very interesting
special case of the lead generation business. It's actually one of the fast test
growing segments of the lead generation industry because if you think of it, this is
really a lead generation business and it hasn't really been looked at this way. I
mean most of the online dating businesses so far are based on fixed fee
subscriptions, but this is really -- this is really a matter that people care about.
And I've actually talked to people that are of startups that are doing online dating
and this is a real question for them, like if somebody is searching for a possible
partner, what's search results should they show them, should they show them the
most quote unquote desirable people so should they try to diversify this in order
to like basically take into account these externalities that these people are
imposing on each other.
>>: (Inaudible).
>> Mohammad Mahdian: Obviously, yeah, there are all those problems as well.
Yeah. But anyway, so this issue of externalities in the context of lead generation
for online dating is an interesting research problem that has not been looked at.
Okay. That's it.
(Applause)
Download