>> Nikhil Devanur Rangarajan: Hello, everyone. It's my great... back. Zhiyi has been an intern here before, twice...

advertisement
>> Nikhil Devanur Rangarajan: Hello, everyone. It's my great pleasure to welcome Zhiyi Huang
back. Zhiyi has been an intern here before, twice actually, and he got his PhD from UPenn,
spent a year at Stanford as a postdoc and is now a professor at the University of Hong Kong, and
he's going to tell us how to make the most of our samples, which I think is really important.
>> Zhiyi Huang: Okay. Thanks for the introduction and thanks for coming. So this is a joint
work with Yishay and Tim, and it's a paper that is coming. So we're going to talk about auctions
and talk about revenue maximization in auctions, so basically, auction is about allocating a
bunch of resources to agents in some specific way, following some protocols, to optimize certain
objectives. And the two most studied and natural objectives in auction are social welfare -namely, kind of the overall happiness of all the agents, and the second one is just revenue
maximization, like making more money out of the sale of items. So you can find tons of real-life
applications where revenue is a natural objective, say, at auctions, at Microsoft Bing or Google
or those auctions on eBay and so on. So roughly speaking, there are two ways to get good
revenue out of selling items, right? One is from the competition of these agents, potential agents,
potential buyers. If their kind of value for the items are somewhat close to each other, the
competition is high, then just letting them compete with them with each other and use the second
prize also will give you good revenue. But sometimes, maybe only one agent has high value for
the item, in which case there's not enough competition, in which case the usual practice is to use
reserve prices to get good revenue. And the good things about using reserve prices is they are
simple and natural and also it's simple for the auctioneer to implement the auction and also easy
for the buyer to figure out what's going on and what they should behave. And also, it's kind of
optimal in many settings, so the classical result by Myerson back in the 1980s shows that
Vickrey auction or the second-prize auction combined with a reserve price is actually optimal for
selling a single item, assuming these agents values are joined IID from some public known
distribution. So since using reserve prices to maximize revenue is natural and sometimes
optimal, so a very important question is how to set reserve prices. So here is sort of the very
basics of how to set reserve prices. So consider the following the most basic scenario, we have
one item for sale and only one potential buyer, and his value is drawn from some publicly known
distribution denoted as F, so arguably the most simple setting you can consider in terms of
revenue maximization. And given any price p, we are going to let q of p to denote the quantile,
and in this case, just 1 minus the F of p, this is the sale probability, and then the revenue of a
certain quantile is just the revenue you can get by setting the price to be the value at that quantile.
So it's v of q times q. So we're going to adopt some standard assumptions from economics that
the distribution satisfies some regularity condition. For now, just imagine that as having this
revenue curve in terms of a function of quantile to be a concave function. And then now since
this function is concave, well, we can easily find the maximizer of this function, which will
correspond to quantile and the corresponding price that we should use if we want to maximize
revenue. Okay, so here it's all very nice and simple, but that leaves a very important question.
Where do these Bayesian priors come from? So in the literature, usually, we just assume that
magically, we have some closed form description of this prior and then suppose we have these
closed-form descriptions, then we just do the magic and come up with these optimal priors. But
in practice, these priors do not come for free, so where do they come from? One obvious
answer, which is a valid answer, is from the data of past user behavior. Say, at auctions, we have
tons of past behavior, and people submit bits in the past, and if you kind of assume they are kind
of behaving the same and having the same distribution across different timespans, across
different days, then we have certain knowledge about these prior distributions. However, it's not
clear -- like given this past user behavior, it's not clear using this optimal reserve, like from some
kind of an empirical distribution from these data, it's actually approximately optimal for the true
prior. And also, these past data do not give you any kind of a closed-form description. No one is
going to come tell you that, okay, this is a uniform between 0 and 1 or something like that. They
only give you some samples from the prior, even in the most optimistic case. So it seems that a
more practical model would be like we're given some IID samples from the prior, and now I
want to choose a reserve price based on these samples, and again, we want to approximately
maximize the revenue, so this is the model that we are going to focus on in this talk. Okay, so
more formally, an m-pricing algorithm will take m IID samples from the prior as input and
output some reserve price and then we will consider its approximation ratio as the expected
revenue we can get from this algorithm over the random realization of these samples, and
compare this with the optimal revenue you can get in hindsight, choosing the absolute optimal
reserve price, and this is the approximation ratio of the algorithm, and we want to make this as
large as possible, obviously. And there are multiple ways we can ask this question. For
example, given a fixed number of samples, what pricing algorithm gives the best approximation
ratio? So in some sense, how this is being done in practice is that we take these samples and we
pretend that the underlying distribution is just a uniform over these samples, and then based on
this empirical distribution, we try to choose the best possible price. And maybe some hack on
the empirical distribution, as well, but that's essentially what's being done. But is that the best
thing to do, and more importantly, if the number of samples are small and you don't get to form
such an empirical distribution and do an approximation of the prior, how are we going to get
good approximation ratio? And secondly, we can also ask this question in terms of kind of a
sample complexity style, so given the target approximation ratio, how many samples does this
optimal algorithm need in order to achieve that target approximation ratio. Okay? And
surprisingly, even these very basic questions, and for the very basic setting of a single agent, a
single item, largely, we don't know the right answer. We have some kind of an upper and lower
bound, but there's a pretty large gap between them and we don't know the right answer.
>>: So [indiscernible], the single buyers, single sellers, do you have a prior or no? You are
saying there is no prior knowledge.
>> Zhiyi Huang: So we have a prior, but it's not given to the algorithm in closed form. We can
only access the prior through this IID sample, so we assume the existence of a prior, but we do
not know where it is, out of those samples we get. Okay? Okay.
>>: Do you know the distribution?
>> Zhiyi Huang: We don't know the distribution.
>>: How do you know the distribution?
>>: Yes, the next [indiscernible].
>> Zhiyi Huang: It's just Myerson.
>>: But like, okay, I guess if we can for example think about just one sample, isn't it clear, what
the right algorithm is?
>> Zhiyi Huang: So there is some result for that.
>>: Set it on reserve price, what it tends to do.
>>: So, actually, it's the best predeterministic one, but this [indiscernible].
>>: And for a special class, the distributions can do better than setting that.
>>: No, but that's what I'm asking, then how are you saying you assume something about the
distribution. What I'm saying is that I don't assume ->>: If you do a [indiscernible] then I think you say that whatever sample you have is the reserve
price.
>>: No, usually, you assume that ->>: A deal for that regularity is necessary.
>>: Yes, you know that ->>: That's why I'm saying, you should assume something that you know about the distribution
in some certain class.
>> Zhiyi Huang: Oh, okay, okay. There are some restrictions, some regularity assumptions on
the distribution that's needed, so that's for sure, but other than that, we don't know anything about
distribution. So we know that, for example, we shouldn't be like targeting an absolute sale
probability and all the revenue just comes from that small probability. For that, basically, you -so here's a kind of potential bad distribution, right? So if you have property 1/h, the value is h
for some very large h, and then otherwise the value is just zero. Then you cannot expect to have
anything good, right? So some kind of regularity, kind of small-tail assumption is needed, but
other than that, we don't have any knowledge of the distribution. Okay, so again, we're going to
be asked the question in the following ways and in different regimes, so for the sample
complexity, obviously, we will ask in the asymptotic regime, assuming we have access to many,
many samples and allow we want to know is how many samples is sufficient and necessary to
general 1 minus epsilon approximation. And in terms of the algorithm question, we are going to
ask this question in a few sample regimes, because if we have many samples, in some sense, the
algorithm is clear. You kind of want to use those samples and form some empirical distribution,
and at least when the number of samples goes to infinity, that's the right answer, right? But with
only a few samples, now, what's the right algorithm? That's not clear, so that's also the question
that we're trying to ask. And now we get to the regularity assumption that we're making about
the algorithm, so there are two common assumptions in the literature. One is called kind of
regular distribution. There are some technical definitions saying that the virtual value defined in
this way is non-decreasing as a function of the value v, but really, it's just saying that the revenue
curve, kind of this R of q on the quantile space, is a concave function. And there's more
restrictive assumption called MHR, so the prior has an array, again. There's a technical
assumption, like this guy is non-decreasing, but for now, let's assume that it's just saying that the
revenue curve is strictly concave in some sense. We will get to the point like in what sense it's
concave, later in the talk.
>>: So there's some virtual value. Is there a solution to that, B minus? What is the correct one?
>> Zhiyi Huang: So it kind of comes from this Myerson's analysis on revenue maximization.
It's kind of how much revenue you can get if you sell the item to an agent with this value,
basically. But roughly speaking, using second prize. Okay.
>>: So any [indiscernible] mechanism, the expected payment for the distribution, draw from the
distribution, the expected payment is like this.
>>: The expected payment is the expected version of [indiscernible], basically. The payment is
some second [indiscernible]. It's not something directly given to you, so it's sort of counting
payments. Another way of counting revenue is just counting virtual value for each value point in
the distribution, so you add all these, you'll get revenue, basically.
>>: Yes, so instead of maximizing value, you maximize this, so it is equal to maximizing
revenue.
>> Zhiyi Huang: Okay, so basically, we're going to do some combination of these two, like the
field sample and the asymptotic regime and regular and amateur regime. Okay, so we are going
to go over what we know about this problem across these different regimes and kind of go over
what our results are, and depending on how much time I have, I will go over some of the proof
sketches. So first of all, let's look at the asymptotic regime. So, again, we have many samples,
and we want to know how many samples are sufficient and necessary in order to get 1 minus
epsilon approximation, so this has been studied before, so we know that -- okay, so
Dhangwotnotai et al. proposed this empirical reserve algorithm, which is what you would expect,
so given m samples v1 to vm, so I times vi is basically the revenue you could get, assuming like
the underlying distribution is just a uniform over v1 to vm, so just take the arg max of this.
Except there's one caveat, so it's called the alpha-guarded empirical reserve, because it will rule
out the alpha fraction, largest fraction of the value. So in terms of kind of a machine-learning
kind of a perspective, it's trying to prevent overfitting, because if the arg max is actually one of
those vi that are in the alpha largest fraction, then basically your revenue is coming from a very
small part of the sample, right? So maybe you are overfitting to that very small fraction of the
sample, which may not be a good thing. So, basically, you would choose alpha to be something
like epsilon, and then you use this algorithm. And what they show is that the epsilon guarded
empirical reserve, with 1/epsilon-cubed samples is sufficient to get 1 minus epsilon
approximation for regular distributions. Okay? Yes.
>>: So if I take so many samples and I look at the empirical distribution, how far will it be away
from, let's say, total variation in the [indiscernible] distribution? In something I'll want to do is
I'll just learn the real distribution, right, so I just have to make a comparison here?
>> Zhiyi Huang: But we need to make sure, be careful to define the distance. If you take total
variance, then the total variance could be just 1, right, because it's a discrete distribution and so
on.
>>: Yes, so let me -- maybe some other distance. I don't know what this is?
>> Zhiyi Huang: Yes, so in some sense, at least for the purpose of getting revenue, it's close
enough, obviously, and other than total variance distribution, I'm not sure what will be a good
measure distance between the discrete distribution and the potentially continuous underlying
distribution, right? So I guess what I wanted to say is that I don't have a good measure of
distance off the top of my head for measuring the empirical distribution and true underlying
distribution for this case, but yeah, but at least they show that it's sufficient for getting good
revenue. And for the more restricted MHR distributions, they show that 1/epsilon-squared
samples is enough, and also you don't need to put the alpha guarded version. You just use the
empirical optimal price, and then that's 1 minus epsilon approximation. Okay? Yes, sure.
>>: This goes against what you just said, right? You said it makes logical sense to guard.
>> Zhiyi Huang: Yes, it makes logical sense to guard if it could be possible that one of these
guys, that one of these very high value are the arg max, right? But MHR basically is a stronger
small-tail distribution. It's unlikely that those will fool the algorithm. Okay, and more recently,
there's a paper by Cole and Roughgarden that studied essentially the same setting, but in a more
complicated scenario, where we also have a single item for sale, but where we have k bidders
instead of only one bidder. And these k bidders' prior distribution may not be identical. And the
main result is that the sample complexity has to be polynomial in the number of bidders, K. And
so in terms of the dependency on epsilon, all they show is that it's something like 1/epsilon,
square root of epsilon, so pretty far from those upper bounds of the previous paper. Okay? So
what do we get in this work?
>>: That is for each [indiscernible]?
>> Zhiyi Huang: Yes, for each distribution. So in the model, basically, you get one sample for
each buyer, and they're saying that the number of rounds you need is -- needs to be polynomial in
k. So you need a more fine-grained knowledge about the distribution if you have non-identical
bidders. So on the algorithmic side, we give an improved amateur upper bound. We show that
using the same algorithm, actually, 1/epsilon-to-the-1.5 is enough to get 1 minus epsilon
approximation ratio. And we also show matching lower bound. Actually, we give a lowerbound framework, and we show that 1/epsilon-to-the-1.5 is the right answer for MHR
distributions, and 1/epsilon-cubed, which is achieved by this previous analysis, is also tied, and
this is borrowing some techniques from information theory and differential privacy. And I
would like to kind of say a few more words about these results for MHR regime, because it's
kind of surprising, at least to me. So, first of all, when we first get these results, we think
something is wrong with our analysis, and each was saying 3/2 should not be the right answer for
anything, especially in terms of like sample complexity. It's very rare that you see something
like that. And also, noticeably, this is fewer samples than what you would need in order to
estimate the optimal revenue up to a 1 minus epsilon factor. So even if I tell you what is the
optimal reserve price, you basically need 1/epsilon-squared samples to estimate the optimal
revenue up to a 1 plus minus epsilon factor, right? Basically, you want to estimate the sale
probability, and in order to estimate that to a 1 minus epsilon factor, you need 1/epsilon-squared
samples. So in this sense, using this many samples, we don't quite know what's the optimal
revenue we get, but we know what is a good price, okay? So kind of the optimization version is
easier than just estimating the objective. And we'll see some intuition why this is possible.
Now, for now, let me also go over some of the results for the single-sample regime. So this is a
classic paper by Bulow and Klemperer, and it basically shows that if you have one sample and
you use that sample as your reserve price, then this is one-half approximation, assuming the
distribution is regular. And for the more restrictive MHR version, basically, no result is known
prior to our work. People just plug in this one-half. So what we show is that actually, you can
do better for MHR distribution. If you take the sample and scale it down by some factor, 0.85, in
this case, then you can actually do a 0.589 approximation for MHR distribution. Okay. And we
also show some upper bound, like 0.68, for MHR distribution, and we also show that this onehalf approximation for the regular distribution is tied for deterministic algorithms.
>>: [Indiscernible].
>> Zhiyi Huang: Yes, kind of equal revenue type of distributions. Yes. Okay. And moreover,
like our positive results also have implications for more complicated settings, like the multiagent problem, because some of these multi-agent results use the single agent as a subroutine and
do some direct reduction. So we know that if we have some good pricing algorithm for the
single-agent problem, then there's a way to convert them to multi-agent problem, as well. So, for
example, we can combine our results with the results by Dhangwatnotai at all, so what they show
is that for what they call prior independent auction in the major environment with IID MHR
bidders, then they can do m minus 1 over n, times 0.5 approximation, and we can directly plug in
our new results and improve the 0.5 to 0.589. And similarly, for the other result, as well, and
asymptotically, we also can also plug in our new sample complexity upper bound and improve
the number of bidders they need to get 1 minus epsilon approximation from 1/epsilon-cubed to
1/epsilon-five-halves. And just there's a more recent result by Tim and his student, also used the
single-agent problem subroutine, and we can also plug in our new results to improve some of
their ratios. Okay, so now, let me get to the technical part and kind of sketch some of the highlevel ideas behind these results. So let's first consider the asymptotic and MHR regime and
explain why we managed to get these weird 1/epsilon-to-the-three-halves upper bound. So first
of all, let me briefly go over what's the usual sample complexity upper bound technique. So
consider the empirical reserve price algorithm. We take these samples and we just use that as an
empirical distribution and pick the best price. The high-level plan would be I want to estimate
the revenue or the sale probability of each potential price, namely these vis, up to a 1 minus
epsilon factor, like use some concentration bound and so on. And then we show that if we can
do that, then choosing the best from this empirical distribution would be a 1 minus epsilon
approximation, and small quantiles could be problematic, as I just explained, and that's the
intuition behind dropping the epsilon highest value and use this epsilon guarded version. And
with 1/epsilon-cubed samples and also concavity of the revenue curve, we can achieve exactly
that, so first of all, by concavity of the revenue curve, we can see that it's okay to drop the
highest value, because in this revenue curve, it drops at least as fast as linearly, right? So just
dropping an epsilon fraction in the quantile space does not hurt your optimal revenue by more
than a 1 minus epsilon factor. And second of all, since for any candidate price that we consider
in the epsilon guarded version, the sale probability is at least epsilon, so with 1/epsilon-cubed
samples, we expect to see at least 1/epsilon-squared samples that are larger, right? So that will
allow us to estimate the sale probability of every single price up to a 1 minus epsilon factor just
by a simple application of the Chernoff bound. And that's enough to finish the proof of this
1/epsilon-cubed upper bound, sample complexity upper bound or for regular distribution. Now,
for MHR, it's pretty much the same thing, except that we would use, say, a simple property of
MHR distribution that is optimal sale probability is at least some constant, 1/e, and that's also a
standard result in the literature. And then it's the same argument, except that now the relevant
prices have at least a constant, so with 1/epsilon-squared samples, you expect to see 1/epsilon
sales, right? And that's enough to estimate a sale probability up to 1 minus epsilon, okay? So
that's how you get the 1/epsilon-squared upper bound, so all very straightforward. And just from
this analysis, you may think that 1/epsilon-squared is the right answer, so why is that? Okay, so
what could be a potential best scenario here? So maybe there is some other prize, p, other than
p*, who is only a 1 minus 2-epsilon approximation, but somehow, due to sampling error, you
overestimate the revenue of this price by 1 plus epsilon factor and maybe also underestimate the
revenue of the optimal price by 1 minus epsilon factor, and as a result, the algorithm mistakenly
takes this price over the optimal price, right? And indeed, this could happen if this sale events
are treated as two independent coin flips and then you want to see which one is larger. But a
problem here, what's missing here is that these two events are not independent, so when we try to
estimate the revenue of these prices, there are correlations. So take this as example. Suppose q*
is the quantile of the optimal price and qi is the quantile of some other prices, so basically what
we want to estimate is the number of samples that fall between 0 and qi and the number of
samples that fall between 0 and q*, right? So you can see that the number of samples falling
between 0 and qi basically contribute to the sale property of both, so there's some kind of
correlation here. And in particular, kind of overestimated the sale probability or the revenue of
qi doesn't matter that much, because those will contribute to the sale property of q* as well, so if
you have more sample here, then you overestimate both, okay? The error mainly comes from the
samples that falls between qi and q*, and that observation will allow us to do a better analysis.
So okay, so that's the high-level intuition why we might be able to do a better analysis. Now, let
me get you the technical part. So we will need a technical lemma. So recall that when I say the
amateur distributions, the amateur assumption can be interpreted as saying the revenue curve is
strictly concave. Now, this lemma will tell you that it's actually strictly concave, at least at the
optimal price, so it's saying that when we move away from the optimal price in the quantile
space, then the revenue drops at least quadratically as how far we move away in the quantile
space. So here's the intuition. There's some technical definition for MHR in terms of this virtual
value and so on, but you can also play with it and write it as a differential inequality about the
revenue function over the quantile space, so it's just this guy, and then now let me explain how to
interpret this. So on the left-hand side, I have kind of a second-order derivative of the revenue
function on the quantile space, and on the right-hand side, it's basically saying how -- kind of a
lower bound on the magnitude, so it's saying that it's negative, and the right-hand side tells you
how negative it is. So what is this R of q minus q of R prime of q? It's basically taking tangent
line of q on the revenue curve and considering this intersection with the R axis, and then the
length of this intersection is exactly R of q minus q times the prime of q. So basically, the right-
hand side is saying that the magnitude of this second-order derivative is at least as large as the
intersection with the R axis by this tangent line.
>>: [Indiscernible].
>> Zhiyi Huang: It's not exactly the [indiscernible] thing, but I do think it's kind of similar.
>>: The [indiscernible] if you complete the function of [indiscernible].
>> Zhiyi Huang: Yes, in some sense. Yes. But in any case, like using the fact that q is between
0 and 1, it's saying that the second order derivative is at least as small as like minus the
intersection of the length of this intersection. And in particular, if you take q to be q*, then this
tangent line is kind of a flat line, and then the intersection is simply the optimal revenue, right?
Okay. So it's saying that the second-order derivative and q* is at least as small as negative Rq*.
And then at this point, the first-order derivative is 0, right? Because that's the definition of being
optimal. Now ->>: [Indiscernible].
>> Zhiyi Huang: Yes, it's kind of strong concavity at this particular point. Now, kind of the
intuition that tail expansion tells you is that now R of q kind of drops at least quadratically with
this form. Now, of course, tail expansion only tells you this host is in a small neighborhood of
q*, and we need this for any q, so the actual proof is more complicated, but at least this is the
intuition behind this. Okay. Okay, so given this technical lemma, now we are ready to show this
1/epsilon-to-the-three-halves upper bound. So we are given these samples, and for simplicity,
let's assume that the optimum price is also in the sample we take between them. So what we
want to show is that for any sample vi with quantile let's say qi smaller than q* and revenue,
which is 1 minus epsilon approximation of the optimal, we want to show that any of such vi
cannot fool the algorithm, will not fool the algorithm with high probability. Now, in general, of
course, it may not be exactly 1 minus epsilon, but this seems to be the worst-case scenario. It's
the closest.
>>: Is this bad?
>> Zhiyi Huang: Yes, this is a bad event. I want to show that this bad event cannot happen with
high probability. Okay, so what does it mean that vi fools the algorithm or the algorithm picks vi
over the optimal price p*? It's basically saying that how much we underestimate the sale
probability of the optimal price is at least 1 minus epsilon times more than how much we
underestimate the sale property of vi, right? So now, if you kind of move around the terms and
so on and also record this picture, basically, all the error basically comes from the number of
samples that come from qi and q*, right? Because the number of samples fall between 0 and qi
kind of contribute to both, and then they do not really matter, at least they do not matter in the
first-order sense. So if you play this out, that basically means that the number of samples that
fall between qi and q* must be very small, and it's at least epsilon -- it's smaller than it's
expectation by at least epsilon times m, basically. So this is because in order to underestimate
the sale probability of v*, we need the number of samples that fall between qi and q* to produce
an error that's at least an epsilon fraction of the total number of samples that we expect to fall
between 0 and q*, and that number is at least kind of order 1 times n, because the optimal sale
property is at least 1/e, right? Okay. So we have some lower bound in how much error we
expect to see kind of in this small interval, so what we need next is an upper bound on how large
this interval is, and that's where the technical lemma comes in handy. So the technical lemma
says that when we move away from the optimal quantile, then the revenue drops at least
quadratically, and since this revenue is 1 minus epsilon times the optimal, that means this
interval cannot be larger than roughly square root of epsilon, right? Okay. So we have some
upper bound on the size of this interval, and we also have some lower bound on how much
additive error we expect from this interval, and now it's just simple application of the Chernoff
bound, and that tells you that with 1/epsilon-to-the-three-half samples, with high probability this
cannot happen.
>>: [Indiscernible] distributions together?
>> Zhiyi Huang: Sorry, sorry.
>>: You're using this 1/e space left for MSR distributions here, right? That there's a probability
of accepting at ->> Zhiyi Huang: Yes, I'm using the fact that basically q* is at least 1/e.
>>: Right, so do you also do something for alpha in the distribution in this paper?
>> Zhiyi Huang: Yes, good. So there's kind of a family of assumptions that kind of interpolate
by MHR and record proposed by Tim recently called alpha strongly regular distributions. So for
alpha strongly regular distributions, again, the right answer is 1/epsilon-to-three-halves, but the
constant kind of becomes worse as alpha goes to zero. Okay, so kind of the regular case is a
singularity point when alpha goes to zero.
>>: Asymptotic [indiscernible] to 0.
>> Zhiyi Huang: The dependency on epsilon is still 1/epsilon-to-the-3/2. Yes.
>>: [Indiscernible].
>> Zhiyi Huang: And yes, both upper and lower bound. But in that case, both the upper and
lower bound does not match in terms of the dependency on alpha, at least for our current analysis
they did not match, but still the dependence on epsilon is 1/epsilon to the 3/2. So that's the upper
bound. So basically, it's just exploiting the fact that when we estimate a revenue or a sale
probability of these different samples, there are correlations, and this correlation is helping us.
And basically, although we don't quite estimate the revenue up to 1 minus epsilon accuracy, but
that does not matter. When we overestimate, we overestimate everything, and when we
underestimate, we underestimate everything. What matters is the kind of relative relationship of
this expected revenue that we estimate and 1/epsilon-to-the-3/2 is enough in hat perspective.
Okay, so now let me try to explain the lower bound framework. So at the high level, we are
trying to reduce the sample complexity of this revenue maximization problem to the sample
complexity of a classification problem. So suppose we have two prior distributions, D1 and D2.
Both are regular or both are MHR, depending on which setting we are talking about, and suppose
they are different enough in the sense that they have disjoint 1 minus epsilon or 3-epsilon
optimal price set. So what does that mean? It means that for any price, if it's 1 minus 3-epsilon
approximation for D1, then it cannot be a 1 minus 3-epsilon approximation for D2
simultaneously and vice versa. Now, if this is the case, then any pricing algorithm that is 1
minus epsilon approximation for both D1 and D2 will effectively distinguish these two
distributions, so basically, you run the algorithm on the samples, and then you see what's the
price that is suggested, and if the price it suggests falls into these 1 minus 3-epsilon optimal price
of D1, then you kind of assert that the underlying distribution is D1. Otherwise, you say it's D2.
And if the algorithm is good, then you will succeed with high probability. Okay. So effectively,
a good pricing algorithm can distinguish these two distributions. So the high-level plan is to
construct these two distributions to be as similar as possible, subject to having disjoint
approximately optimal price sets, and then use tools from information theory to lower bound the
number of samples we need to distinguish these two distributions, and that will translate to a
sample complexity for the kind of revenue or pricing algorithm. Okay, so many of you may be
familiar with this, but let me go over this anyway, to make sure we're on the same page, just
some information theory basics. So when it comes to distinguish two distributions, we need
some measure of distance of the two distributions, and it turns out the right notion of distance
here is the statistical distance defined in the first line. And, basically, if you want to distinguish
these two distributions with properties significantly higher than one-half, then the statistical
distance between the two distributions has to be at least as large as some constant, so that's the
takeaway point from this. And in our case, what is P1 and P2? P1 and P2 will be the joint
distribution of kind of m IID samples of D1 and D2 that we talked about. And usually, it's kind
of pretty hard to handle or compute the statistical distance of these kind of distributions. And
that's why in the literature, usually people use KL divergence instead. So the definition of KL
divergence is in the second line, and KL divergence and statistical distance are related to each
other by what's called the Pinskar's Inequality. So the Pinskar Inequality basically says that if the
statistical distance is at least a constant, then the KL divergence must also be at least as large as
some constant. Okay? And the nice thing about KL divergence is that if you take m IID
samples, then the KL divergence just gets multiplied by m. And so in order to show that the KL
divergence of P1 and P2 is small, it suffices to show that the KL divergence of the kind of base
distribution, D1 and D2, is small. Okay. Okay, so now if we can distinguish, we have a good
pricing algorithm using only m samples, that means we can distinguish D1 and D2 using m
samples. It means that the KL divergence between kind of D1 to the m and D2 to the m is at
least as large as the constant. Now, that means that the KL divergence between D1 and D2 must
be at least as large as some constant over m and the sample complexity -- on the flipside, it says
that the sample complexity has to be at least as large as 1 over the KL divergence of the D1 and
D2 that we construct. Okay? Okay, so now the plan is to construct two distributions that have
this joint approximately optimal price set that has small KL divergence, okay? And when it
comes to construct distributions with small KL divergence, we use some lemma from the
differential privacy literature. So it's basically saying that if point wise, the density function
differs by no more than 1 plus minus epsilon factor, then the KL divergence is at most epsilon
squared. So kind of the KL divergence is just the expectation of the log of this ratio, right? So
the fact that this ratio is between 1 plus minus epsilon basically says that the log is at most
epsilon, so the trivial analysis will give you that the KL divergence is at most epsilon, so why
epsilon squared. So the observation is that these signals or samples that you get not only give
you kind of positive signals but also give you false negative signals. So for example, like if two
distributions are very similar, then when it gets one sample and you say that the density of D1 for
this sample is larger than the density of D2, you assert that it's more likely the underlying D1,
right? But it's possible that this underlying sample is actually from D2, and you are getting some
kind of false negatives, right? And kind of mathematically, it's saying that since D1 and D2 both
sum to 1, it cannot be all these ratios are 1 plus epsilon. A lot of them will be 1 minus epsilon, as
well, and therefore the first-order term will cancel out and only the second-order term will
remain, and then you get kind of epsilon squared, right? So in any case, so now the goal is to
construct these two distributions such that point wise their density is kind of close to each other,
while keeping the approximately optimal price at this joint. So I will not bug you with the kind
of mathematical verification that they actually satisfy those conditions, but I'll kind of tell you
what the distribution looks like in terms of the revenue curve. Okay, so for regular distribution,
we are going to consider the following two distributions, so the first distribution is what's
sometimes called an equal revenue curve, so it's the blue line, basically a straight line, the
triangle that peaks at quantile q equals 0, and the second distribution is essentially the same,
except that we truncate it a little bit at the high-quantile regime, okay? So kind of from the
picture, you can see the two distributions are very similar, but also the kind of 1 minus epsilon
approximate prices are disjoint, because the approximate price set for D1 is essentially the blue
area, and the approximate price set for D2 is essentially the red area, right? Now, if you
calculate the KL divergence of these two distributions, then kind of for a majority of the samples
in the quantile space, the densities are identical. So only for the kind of top q epsilon 0 quantiles
their densities are different. And even for that, that kind of a small fraction of the samples that
they are different, they differ by at most 1 plus minus epsilon factor, and overall, you get that the
KL divergence is as small as epsilon-cube, so that's how you get the 1/epsilon-cubed sample
complexity lower bound. And for MHR, again, very similar thing. You kind of construct two
things that kind of shift a little bit such that their prices are disjoint and then show that their KL
divergence is small. And the distribution we consider, the first one is kind of a uniform
distribution between 1 and 2, and the second distribution, we basically kind of scaled down the
density between 1 and 1 plus square root of epsilon by a little bit and then scaled up the density
in the rest of the value space by 1 plus epsilon factor, and it turns out that's enough to shift the
distribution enough to have disjoint price space, and then the KL divergence turns out to be
epsilon to the three halves, okay? Okay, so that's all the results for the asymptotic regime, and
now let me spend -- or where should I stop? Oh, five minutes. Okay. So I'll give a very highlevel view of why in the single-sample space we get better than 0.5. So how do we prove 0.5?
It's actually very simple. So recall that regularity means that the revenue curve is concave, right?
And if we use the identity pricing, meaning that we take a sample and just use it as the reserve
price, then the expected revenue you get is exactly the area under the revenue curve, because you
take the sample, and the revenue you get for that sample is just the height in that quantile space,
and then you take expectation, just the area under the curve. And it's concave, so this area is at
least as large as the area of this triangle, that peak at the optimal. Now, it's basic geometry,
right? So what's the area of this triangle? It's the height, which is R*, times the base, which is 1,
and then divided by 2, and therefore the expected revenue is at least optimal over 2. Okay. So
since in some sense MHR distributions strictly make this curve strictly concave, that kind of
suggests that identity pricing will get strictly better than one-half, but that's not quite the case,
because it's not strictly concave in the usual sense. So kind of this is the technical definition, and
it basically says that for the quantiles, that are larger than q, meaning for the values that are
smaller than the optimal price, indeed, this is strictly concave. And in fact, we can show that it's
at least as concave as the revenue curve of exponential distribution. Okay. However, for the
first half in the quantile space, not necessarily, and in particular, we can kind of truncate any
regular distribution, which means that we kind of flatten that to be a straight line, and that's still a
regular distribution, and that part is not strictly concave. And in fact, a point mass is a regular
distribution, and for a point mass, identity pricing only gives you one half of the optimal, right?
Okay, so not exactly point mass. Let's say uniform distribution between 1 and 1 plus epsilon for
very small epsilon, then basically the revenue per set is 1, right? But the sale probability for
identity pricing is only one-half.
>>: [Indiscernible].
>> Zhiyi Huang: Yeah. Kind of a point mass, but you kind of play with it a little bit.
>>: I see you [indiscernible].
>> Zhiyi Huang: Yes, so it seems like -- okay, so the takeaway point is this. For MHR
distribution, the revenue curve looks like something as concave as exponential plus some
truncation at the beginning. So let's say that’s the case, and let me explain how to get around this
and how to get better than 0.5. So the hard part is the truncation part, right? Because of for that
part, identical pricing gets exactly one-half. So here's the thing. For the truncation part or a
point mass, we have good revenue per sale. The revenue per sale is exactly R, basically, but the
sale probability is kind of small. It's only one-half. Now, if we kind of scale down the sample
values slightly by some factor C, which is like bounding away from 1, then we basically double
the sale probability, right? Now, we sell for sure. And we only lower the revenue per sale by a
factor C. So that's a big improvement for this part. Now, of course, like the truncation, it's not
strictly point mass, and we need C to be sufficiently small to handle all the cases, but that's the
takeaway point. Now, for the strictly concave part, we show that it's at least as concave as the
exponential distribution, and for exponential distribution, identity pricing is actually pretty good.
It's like 0.68, much better than one-half. So we can afford to lose something on that part. And
again, by scaling down the sample value by a factor of C, we lower the expected revenue by at
most C, so if C is sufficiently large, then for this part, our expected revenue is still much better
than one-half. So, basically, we choose C to balance the analysis for these two parts, and then
that gives us 0.589. Okay? All right, so I think I will skip the impossibility results, and so this is
kind of the summary of our results. So for the four main regimes that we considered, we
proposed a couple of new results. For the asymptotic regime, we essentially have the right
answer for both MHR and regular distributions. The sample complexity for regular distribution
is 1/epsilon-cubed, and the sample complexity for MHR distribution is 1/epsilon-to-the-1.5. And
for the single-sample regime, we get some new upper and lower bounds for MHR distribution,
and noticeably, for MHR distribution, we can get much better than one-half, but there are still
some pretty large gaps between the upper and lower bounds. Okay. So there are many
interesting future directions for this problem. One of these ones is to close the gap for the MHR
single-sample regime, and another one is how do you do the pricing with more than one but only
constant many samples? So so far, our analysis and techniques are quite specific for only one
sample. For example, if I have two or three or five samples, I don't know how to do better than
this 0.589 or one-half or regular distribution. Of course, when the number of samples goes to
very large, now the asymptotic part kicks in and we can do close to 1 minus epsilon, but with a
few samples, we don't know what to do. So it would be interesting to see something in that
regard, because in many practical scenarios, we have like some data, but it's not large enough so
that those asymptotic regimes kick in and you can use concentration bounding and so on. So it
would be interesting to see some techniques here. And also, it would be interesting to analyze
the sample complexity for more complicated settings, like here, it's the absurdly simple setting of
one buyer, one item for sale. In practice, you're interested in more complicated settings, and can
we use some of the techniques we developed in this work to these more complicated settings and
potentially get right at sample complexity answers for those settings? And also, like beyond
identity samples, kind of having access to these priors via IID samples is slightly more realistic
than having closed from descriptions of those priors, but still, it's a questionable assumption. As
soon as the bidders find out that you're using their bids to design future auctions, they may play
around it, and also, there may be small changes in their prior every day and so on, so it would be
interesting to see if there are some more realistic settings where can still develop interesting
sample complexity, lower and upper bounds and algorithmic questions, as well. Okay. So that's
all. Thank you.
>>: What is your [indiscernible] samples? Is it not clear the algorithm should be clear? It
should be the same thing, no, or do you expect different algorithms?
>>: What is the same thing?
>> Zhiyi Huang: Yeah, what's the same thing for constant number of samples?
>>: Okay, I just think of this as my distribution, as my intermediate solution and then I
[indiscernible] and I calculated this for the ->> Zhiyi Huang: Okay, so ->>: This is [indiscernible] distribution, right?
>> Zhiyi Huang: So that's one potential, especially for regular. For regular, I think that might be
the right algorithm, but still, even for that algorithm, we don't know how to analyze it to show
that better than one-half approximation, until like the number sample gets really large. And
second of all, for MHR, this work seems to show that you want to scale down the sample value
by some factor, and that helps the analysis, right, at least with one sample. So how do you do
that when you have multiple samples? Do you kind of compute the empirical best price and then
scale it down, or what do you do? So it's not clear, so there is some also algorithmic question
there, as well.
>>: I think that even for the regular, you might want to do [indiscernible], because if you -- the
error on either side is asymmetric. It's better to err on the lower side than to err on the upper
side.
>> Zhiyi Huang: So on the price, it only kind of hurts your revenue by a little bit, but over price
can go from up to 0, so in that sense, you kind of tend to underprice when you don't have enough
information.
>>: So you assume [indiscernible] for regular half [indiscernible].
>> Zhiyi Huang: For deterministic. So that's actually very simple. So first of all, since point
mass is one specific regular distribution, or something close to point mass, so for the
deterministic algorithm, you don't want to overprice. Basically, if you overprice, then you're
doomed for point mass. Now, for this equal revenue curve that kind of peak at quantile equals
zero, kind of underpricing always hurts, because basically the optimal revenue is achieved when
the price goes to infinity. Now, if you never overprice, then for this distribution, you can get at
most one-half, and identity pricing gets you that one half. Yes. But, of course, now you know
that if you use randomization, you can do 0.5 plus epsilon.
>>: But can you [indiscernible].
>> Zhiyi Huang: That's also like one of the new results in the coming EZ. Yeah.
>>: But one cap.
>>: If you use randomization, you use regularity [indiscernible].
>>: [Inaudible].
>> Zhiyi Huang: Right, but ->>: You can do better than [indiscernible].
>>: So there's still a gap between the lower bound and the upper bound for even asymptotic
samples, right? For [indiscernible]. The lower bound is 1/epsilon-cubed, and the best-known
algorithm is ->> Zhiyi Huang: So it's off by a logarithmic factor. For all these, both regular and MHR
regime, it's off by a logarithmic factor.
>>: Okay.
>> Zhiyi Huang: But the main terms matches.
>>: So do people in information theory know? Because I understand that now you have two
distributions, and you say, okay, they're very similar simplistic wise, and that's how you
[indiscernible]. So do people in information theory, can they do this if they have more
distributions, to get the log of that?
>> Zhiyi Huang: Oh, I see what you mean. So you're saying that maybe like --
>>: Can I differentiate? So can I get a lower bound by not being able to differentiate between a
logarithmic number of distributions?
>> Zhiyi Huang: That's a good point. I haven't tried that, so that's potentially one way to get
there.
>>: [Indiscernible] do you know anything about that?
>> Zhiyi Huang: There's definitely literature doing this, like if you have m potential classes and
you want to do the classification algorithm, then what kind of condition you need, and some
other type of convergence. This is KL divergence, right? And then for multiple classes, some
other type of divergence is needed, but I don't know the literature well enough to answer you
right.
>>: For alpha regular distributions, does scaling down work, and what kind of [indiscernible] do
you get?
>> Zhiyi Huang: I don't know. We haven't tried that, actually, because even for this one, our
analysis and technique are quite primitive, and we don't think that's the right analysis and right
algorithm yet, so that's why we didn't try to extend it to alpha regular. But I would say probably
like scaling down still helps, and I -- okay, so now I remember, I thought about it a little bit.
>>: In the extreme case of alpha equal to 0, that doesn't equal revenue.
>> Zhiyi Huang: So let me say the following. So all the technical lemma we use for MHR and
single-sample regime has a counterpart for alpha regular distribution as well, so I'm pretty sure
we can get something like 0.5 plus epsilon, but what's the number, we don't know? And
probably the dependency on alpha is not very good as well, so we never really figure out the
answer.
>>: It's 0.68 is the explanation distribution.
>> Zhiyi Huang: Yes, yes, that's very -- so there's a simple proof of this is for e/4, basically.
This is e/4, but basically identity pricing achieved e/4 for exponential distribution, and then
essentially the same argument for the regular case, so again, point mass is the regular
distribution, so you don't want to overprice, and then you can kind of show that if we don't
overprice, then identity pricing gives you the best approximation ratio for exponential
distribution, and that's e/4. And there's a slightly more complicated version that gives you
something like e/4 minus some small constant. So yeah, so your intuition is right. That's from
exponential distribution.
>>: Good. Thanks.
>> Zhiyi Huang: Okay.
Download