19119 >> Yuval Peres: So very grateful to James Lee...

advertisement
19119
>> Yuval Peres: So very grateful to James Lee for giving us this second installment in the description of
Talagrand's theory of majorizing measures.
>> James Lee: Thanks. So it's less and less informal. This room is very informal. So let me recall what I
talked about last time, and then we'll get started on some new stuff today. So I'm going to assume a little
bit of familiarity with what I'm talking about. I'm not going to go over the definitions again. Maybe I will. I
just won't write them down.
So we had last time a Gaussian process. And if you weren't there, this is just a collection of joint league
Gaussian random variables. So every linear combination of these variables has a Gaussian combination,
every finite linear combination. Then we said we were interested in and we motivated last time was the
expected supremum of all these random variables. And as I said last time, there's no loss in assuming this
set T is finite. Assume T is finite all time. This is an expected maximum. This is the quantity we were
interested in.
And let me just, today we're going to have to use more probabilities of Gaussian in the process than we did
last time but let me just say the only two things we used were the last time was that we assumed that all
these random variables are centered. In general, when I say Gaussian process, I mean a centered
Gaussian process. This is the last line. I hope this is visible.
We use the sort of some kind of tail bound for Gaussian. So okay. So before I write this, let me mention,
let me add one more thing that we had last time. Whenever we have such a process there's a conical
distance that this process carries. It's the following: L2 distance on the set of random variables. The
square root of the expectation of XS minus XT squared. Given this distance, we can write sort of the
classical concentration equality, the probability that XS minus XT is bigger than lambda is at most E to the
minus lambda squared over twice the distance between S and T. And again just distance S and T is just
the variance of XS minus XT.
So that's the setup. Then what we did last time we said how can we control this quantity in terms of the
geometry of the process. So what I mean by the geometry of the process, we have this process, we have
this distance, which this now makes T comma D into a metric space.
And now we can look at the geometry of this metric space and see how did the geometry affect this value.
What we did last time was the following: So we said, okay, suppose we have a bunch of subsets, P0 inside
T1, inside T2, all sitting inside T. And they're going to satisfy the following bound. I guess the side of T is
T0 is 1 and for T bigger than 0 or N bigger than 0 size of T sub N is at most 2 to 2 to the N.
So I'll say what these refer. I'll remind you in a second. If we have a set satisfying these conditions, then
we can use these sets actually to get a bound on this process as follows. So this is a result of Pharnique
[phonetic], again, this is not how Pharnique stated it. But follows from Talagrand's work that the expected
supremum of this process is at most this quantity.
Is at most -- and of course -- so in general when I write A less than equal to squiggly B, so A less than or
equal to squiggly B, means A is at most some constant times B and this constant should be something
universal. So it doesn't depend on any of the parameters involved. This at most ten times this where ten is
some constant.
Okay. This is what we proved. And the way to think about this was basically as follows: So T0 is upset on
one point. So let's say that little T0 is the one point and then big T0. And you can think about constructing
a tree where the next level of the tree we have T 1 and then at the next level of the tree we have all the
points of T2 and so on.
And then how did we prove what's sort of how would you prove this theorem? Well, we just said -- okay.
So these are -- in all edges of the tree here, N points are pairs of points in the metric space. So edges
have lengths on them.
This is the edge of the tree have lengths on them. If you look at what this quantity is, this quantity just
looks at the, at basically the longest path in this tree but where the nth level is weighted by 2 to the N over
2. If you want upper bound to match that all you do come here and you look at the number of edges and
sort of the look at the nth level of the tree how many edges are there in this level. They're at most 2 to the
N times 2 to the 2 N plus 1 edges nth level because there's at most 2 to the N points here and at most 2 to
the N points 1 here. So that's at most like 2 to the 2 plus N plus 2.
Then if we apply this concentration inequality, what we get is we can take a union bound over sort of all the
edges -- by the way, if you didn't see the last talk this is not necessarily going to make sense. This is a
refresher. I assume people saw the last one.
But basically if we take a union bound over all these edges, we can be certain that if we have sort of XS
minus XT here, then XS minus XT is at most 2, is at most 2 to the N over 2 times its length. We take a
union bound saying sort of none of these edges is stretched by more than 2 to the N over 2.
If you look at the probability of being bigger than lambda times your distance and this is number of events,
then you see that putting 2 to the N over 2 -- 2 to the N over 2 is 2 N over 2 is square root log of the level
the edge is bigger than this get union bound and sort of, none of these edges has value XS minus XT,
which is bigger than its length times 2 to N over the 2. In any case, we had these sets. Took union bound
over the tree. This is what we proved. This bounding Pharnique. Which is a nice bound. Go ahead.
>>: What if the capital T is only one element, so all the Ts are 0?
>> James Lee: Capital T is only one element?
>>: Yeah, what happens?
>>: Compute maximum.
>>: But the right-hand side is 0.
>> James Lee: So is the expected maximum.
>>: The expected maximum.
>> James Lee: Of a center gram variable of 0.
>>: Right.
>> James Lee: So you get the right bound, even for when there's only one point. All right. So it's good to
check special cases. [laughter].
Okay. So now what we want to do is we got a pretty good upper bound. We'd like to consider lower
bounds on this quantity. In fact, I said what the lower bound we're going to prove is last time but let me
restate it in a second in a way which is going to be useful for what happens later. Let me restate the upper
bound in a slightly different way as follows.
Instead of thinking about subsets, let's think about -- this pen is awful so we're not going to use it.
Let's think about a sequence of this is a sequence of partitions of T. And it's such that it satisfies two
properties, the first property is that it's a sort of nested sequence of partitions. So A sub J plus 1 is a
refinement of A sub J for J bigger than or equal to 0. And the second property is that these partitions -we're taking a sequence of partitions into each one into some bounded number of pieces, and as the index
increases, they become sort of -- we get more and more pieces. They become more and more refined.
So my claim is that we can restate these bounds in the form -- I guess I should introduce one of these
notations. So I'll use this a lot. So for a partition P of the set T, and some point in T, point little T and C, let
P of T be the unique set that contains little T.
So generally have a partition. I'm just saying what set contains P it will be P of T. I can claim we can
rewrite this bound now in the following way. Expected soup TXT. Not rewriting it but, in fact, just a
consequence of it, is at most this quantity, which we'll see in a second.
So the same thing. Some soup over all T is 2 to N over 2. Now here we'll look at the diameter of this set A
sub N of T. Okay. So this upper bound is going to be a little bit easier to work with, I mean sort of this sort
of upper bounding mechanism to see that this works just notice that if you give me any such sequence of
partitions, then from the set A sub N I'll give you a set T sub N.
So this set has at most 2 to 2 to the N points. I'll give you T sub N. What do I do. Take one from every
piece in the partition. Then you can just check that this quantity is upper bounded by this quantity, because
the distance from any point to T sub N is at most a diameter of the piece that contains it.
Well we know this is an upper bound. So I claim this is an upper bound on this, which is an upper bound
on this. But this is going to be a little bit easier to work with. So this is the upper bound. And now the
immediate question that arises or maybe one that arises after you stare at it for a while is how good an
upper bound this is.
So after Pharnique proved this upper bound, again in a different form than this, using language of
measures instead of the language of partitions, he started thinking about trying to prove that there are
processes where sort of this is not the right -- this is not the right way to control the expected supremum.
And apparently he thought about it for about ten years. At some point during that time he became
convinced that there must be, that this is sort of not the best way to upper bound this quantity.
And actually started asking Talagrand to work on sort of more and more complicated examples trying to
prove that sometimes you could do much better than this upper bound.
And sort of as the story goes, every time sort he would give Talagrand some hairy example. Eventually
Talagrand would come back say, no, actually there's a sequence of partitions or majorizing measure that
gives a good upper bound. And one day Talagrand said to himself: Why don't I try to prove that in fact this
upper bound is great and is the only way to upper bound the expected supremum. Then, as he told me a
couple of weeks ago, two days later he proved his sort of majorizing measures theorem which I'll state it
now.
It says that basically this is the only way to upper bound the expected supremum of a Gaussian process by
choosing these partitions and looking at this value.
So for every Gaussian process like this there exists partitions A sub N satisfying 1 and 2 over here such
that this holds. So diameter A sub N of T. So this holds.
Now I'll state it this way. I could put the soup here but let me state it this way because it's a little bit clearer.
We say that every sequence of partitions gives us an upper bound on the expected supremum. So
Talagrand proved that actually given any process there exists a sequence of partitions such that such that
this also a lower bound up to a constant. This is sort of the strength here.
Maybe I should -- I'll put it so it matches here. Such that this is also lower bound. Okay. So in particular,
this showed sort of -- this shows if you remember what I talked about last time. Okay. I understand why
Yuval is staring at me in a strange way as we talked last time. This, for instance, characterizes sample
continuity of Gaussian processes, for instance.
You can say that -- you can sort of say exactly what geometric condition has to be satisfied for Gaussian
process to have this sort of sample continuity property we talked about last time.
So now I want to focus on this. So basically how can we construct really good partitions of this space?
Every partition gives us an upper bound. Now I want to construct one somehow that matches the upper
bound. To do that we'll have to use more than just these two properties of Gaussian processes.
And so we'll go through the properties now. We'll prove them. It's really interesting if you want to think
about extending this theory to things that are not Gaussian processes, because both of these are fairly
lightweight. Like you feel like sort of processes like M is random variable sub Gaussian tails all over the
place. You hope you could prove and in fact you can you can prove this upper bound for all these families.
But the lower bound requires or at least is believed to require much stronger properties of Gaussian
processes. So let's talk about what they are.
So the first thing we want to do is if people remember what we use this -- the first thing we need to do is -we proved an upper bound. We're trying to prove a matching lower bound. We need to have some -- we
want to have something which kind of sort of matches the behavior of this upper bound in a fairly strong
way.
So if you were in the talk last time, the sort of the test case for the analysis of this union bound and what we
would get if we applied it to many points instead of just one, was we took a bunch of -- we thought about
taking a bunch of random variables that were sort of uniformly spread out. Let me just draw a picture of
what I mean.
So we said, look, if we have a bunch of random variables, and now look at this metric on them. Suppose
the distance is quantity alpha. So every pair the distance is about alpha. We said look at this inequality.
Take a union bound, so suppose there are we have M points here.
Then the union bound says that the expected supremum in this case for these end points, maybe we'll write
it this way, for these M points is going to be about -- well, we can prove an upper bound which is something
like alpha times square root log N. All these are alpha. If we get one pair then the probability that they're
bigger than some C times alpha and grows like E to the minus C squared, so we should choose C to be
about square root log M, union bound over M squared.
This was sort of the motivating case for basically handling one level of the tree. The last time we need to
prove some kind of bound as a first step we'd like to be able to say, well, this bound is tight. In the simple
case where our process is sort of a bunch of random variables that are all uniformly spread out. That's
what we'll concentrate on first. Just proving that we can match this upper bound for this simple sub kind of
process.
Okay. And so well, this is actually a classical inequality. So called Sudakov's inequality, which says this.
So let's take here a Gaussian process. And let's suppose we have some uniform lower bound on all the
distances such that distance ST is at least alpha for all distinct pairs in T.
Then, well, we'd like to prove this lower bound. We'd like to prove the expected supremum is at least alpha
square root log the size of T. Okay. And it's not going to be equal. It's going to be a constant.
We try to prove a matching lower bound. This seems like is something we should try to do. Try to prove a
lower bound. I give you some uniform lower bound in all the distances. I say XS minus XT has some
definite amount of variance. Can prove that the expected maximum has to be really big. Sort of as the
number of points grows we get some growth that matches what the tail bound gives us.
Okay. So this is our first goal. Actually not -- it seems almost -- you can think about it. It seems like it
almost should be very easy. Might start thinking about how to prove it.
Okay. It's not so easy. But at least we can imagine one very easy case of this, which is if the Gaussians
are just all independent. And this is not just an example. We're going to reduce eventually to the
independent case. But let's look at something even simpler. Let's just consider G1 G2 up to GM to be ID
sort of standard Gaussians ID normal 0 random variables then I'll leave it as an exercise to see that in this
case the expected maximum G1. G2, up to G sub M grows like square root log N.
I'll leave it as an exercise. It's just the fact that, right, the tail really does behave -- there is really E minus C
times T squared probability masked path T. So the probability one of them is bigger than square root log
N, if you put the right constant, is at least say 1 over M. Or 1 over square root N if you like. If you have M
of N, one should get there, because they're independent.
>>: [inaudible].
>>: One of the --
>> James Lee: You mean you want to say you want to count the expected number they get out there?
>>: [inaudible] the expected number. You can't do it the second moment.
>> James Lee: Just in question. Even -- so the answer is I don't think so.
>>: Do you know the distance, what can you say about the probability of load variables -- both variables
are large? Do you have that information?
>> James Lee: Certainly. Certainly. In the case when you just have two you can write down exactly what
they are. You can write down the projection of one on the other. But it is not going to be enough.
So let me think about it while I'm talking, come back to it. I don't think it's -- I don't think it's enough to use
the second moment here. I mean, it is enough to use second moments. I think just not with the second
moment message. We'll have to use slightly more delicate ->>: Method by ->> James Lee: Nonetheless, well, the lemma we'll present has some sort of independent value. But
what's the problem? Okay. We don't have that much time. Let's not get wrapped up in it at the moment.
Okay. So we're going to use this. And then another classical fact called Slepian's lemma, which is the
following. So let's take two Gaussian processes on the same index at T. I'm making this up now. GP is
going to be Gaussian processes. Such that one of them always has distances bigger than the other. So
this is always bigger than Y sub S minus Y sub T squared for all points in our set.
Okay. Then the claim is that -- and again the claim seems fairly intuitive. But then the claim is that the
expected soup of the XTs is bigger than the expected soup of the YTs.
So if all the distances here are bigger than all distances here then the same thing is true of the expected
supremum. We'll prove this in a second. But let's just see that we can reduce it to the, that given the
lemma we can reduce [inaudible] inquality to the independent case.
The way we'll do that is let me write here. Maybe not. I'll move over. What we'll do we'll map all the Xs to
independent random variables and make sure that all the distances go down. So just send -- okay. So I
have XT. What should YT be. So YT will just be basically alpha square root 2 times G sub T, where this is
a family of ID normal 01. ID standard Gaussian.
>>: [inaudible].
>> James Lee: I'm proving that with this lemma we can reduce Sudakov's inequality, so this elementary
calculation. So this lemma involves two families of random variables in the same index set. You give me
your XT. Here are the YTs. Alpha over square root 2 times a bunch of ID Gaussians. Now you can check
the distance here is -- the distance between any two here is exactly alpha. By assumption over here this is
between any two guys at least alpha. So we definitely satisfy that all the distances have gone down.
Therefore, the conclusion holds. So what does the conclusion say? The conclusion says that the expected
supremum of the XTs is at least the expected supremum of the YTs, but now this is exactly equal to alpha
over square root 2 times the expected supremum of the G sub Ts and we already said that an elementary
calculation gives us this.
Okay. So if we can prove Slepian's lemma, we've proven the Sudakov's inequality which at least matches
the upper bound when we're in the uniform case. Now we proved Slepian's the lemma. And it's a very -I'm going to erase this.
Hopefully everyone got it. If not, apparently it's being recorded. But let me keep Sudakov's inequality here.
>>: In general, parallelize does it [inaudible].
>> James Lee: The co-variances, the values -- this is the incentive. These values over all pairs of points in
the set, these actually determine the entire process. So the process is determined by this. Not just the
supremum the process itself the random variables are determined by these values. If you want to
translate, they're determined exactly by the distances up to a translation.
>>: [inaudible].
>> James Lee: What's that? Yeah. So by a mean 0 Gaussian. So in other words, if I look at the process
XT or the process XT plus Z, where Z is a mean 0 Gaussian, these two processes have the same distance,
even though they might have -- yeah, they'll have different co-variances.
But this actually won't affect the expected supremum because the expected value of Z is 0. Okay. So now
we're going to do maybe the first nontrivial thing that's happened, including last time, which is proving this
inequality.
I like the proof. So that's why I'm presenting it. Also it's useful. This is informal, so we can prove
everything we want to. So how does the proof go? So it's going to go like this: So, first of all, I'm going to
assume T is finite. Obviously we're only using it for the finite case but if you prove this for finite T you can
just check that it immediately follows for arbitrary index sets T. So let's consider two processes X1 up to
XN.
And Y equals Y 1 up to YN. And let's recall our assumption that ->>: [inaudible].
>> James Lee: Yes. Good call. Let's just rewrite our assumption in a slightly different form. Okay. So I'm
going to almost prove this in full generality. So here's the assumption I'm going to make. I'm going to
make the assumption -- let's make sure I go the right direction here.
That expected value of YIJ is at least the expected value of XIXJ for all I and J. That certainly -- okay. And
I'm going to assume that expected value XI squared equals expected value YI squared. For all I. So
certainly this is -- if given both these conditions it's weaker than this. This implies what's written in red.
Because this condition combined with this condition gives you this condition.
It's only slightly weaker. What's that?
>>: What implies what again?
>> James Lee: If this theorem ->>: Theorem implies that.
>> James Lee: So let's say it this way. This condition is -- so this condition implies this condition. The
theorem goes the other way. But this condition implies this condition. But I'm going to make this
assumption. So I'm not going to prove this in complete generality. By the way, the same argument I just
gave with a small tweak you can use this assumption to prove it. Instead of writing Y sub T equals like this,
and then add some -- and then add some -- okay. I don't know what the -- add some beta Z. Beta
depends on T to all the variables where Z is just some other IID normal 01. Then you can basically can
ignore all this stuff and reduction of the standard Gaussian is the same because this is all independent.
But and then I mean you can choose beta so you can have this condition being set. But don't worry about
that. I'm going to use these conditions, which are slightly stronger than these conditions. Just for ease of
what's going on.
Okay. So now let F be just some function from RN to R. Okay. This is the whole thing. You should think
about -- I'm not sure what you should think about. You should think about F being the -- let F be some
arbitrary function. Then what we're going to do is the following. Let's define -- we're starting with X and we
want to compare X to Y.
So what we're going to do we're going to define some kind of continuous interpolation between the two
processes. So let's define X of -- so this is annoying because, but I thought about it. It's really the right
thing to do.
T now is going to denote a real number between 0 and 1. But it's time over here also classically. It's going
to denote actually a time. But it's okay because the index here is now I and J. So you won't be confused.
So everything is written in red. The indexes are Is and Js and think of T as time. We can replace T by
something like epsilon. But epsilon doesn't seem like a time. So all right.
So think about this process which at time 0 is X and at time 1 is Y. We'll move between X and Y. Then
we'll try and see how this thing, how some quantity we care about evolves as we move from X to Y. Think
about F as the quantity we care about.
Then let's finally define 5T to be the expected value of F applied to X, applied to this process at time T. So
this is -- so this is sort of -- we have some value we care about. Let's look at expected value, sort of the
expected of F of this thing time T. And if this is C of T, what we're going to try to show -- I haven't given any
conditions of F yet, we'll try to show that C of 0 is at most C of 1. That will imply that the expected value of
F on X is at most the expected value of F on Y. Because at time 0 this thing is X and at time Y this thing is
Y. It's clear what we should do. We should analyze the derivative of this thing with respect to T and see
how F is changing.
So let's do that.
>>: It's going to increase it.
>> James Lee: We like to show the derivative of this thing is nonnegative. That will give us this.
In case you're wondering, I mean if F was really the maximum, we would be proving this inequality in the
wrong way. F is going to be the indicator -- stronger, F is actually going to be the indicator on some
product of set. So F is going to be an indicator of a set in our N. And we're going to prove, basically we're
going to prove that the probability of -- let's make sure we're going the right way, the probability of Y being
inside some set is bigger than the probability of X being inside some set. And therefore the probability of X
being outside of that set is bigger than the probability of Y being outside of that set which will give us
something like this. But for now you can just look at everything in red and see what's going on.
All right. So let's hope that I remember how to do calculus. So let's take the derivative. We'll see what
happens. All right. So here's what happens. Okay. Well, we have to use the chain rule. So let me write it
down and then we'll see what's going on. F, X, T times XI prime of T. So derivative time T is you look at
how -- you look at how much F is changing in the ith direction and you multiply by how much I is changing.
How much the ith direction is changing. That's the chain rule. Looks good. This is the direction of the
derivative in direction I. So now we analyze this thing.
So now basically -- and I wrote this expectation. So there should be an expectation here. Okay. So now
actually we're going to be able to analyze this term by term. So let's just fix one term. So one term -- so fix
some I and let's prove that actually this thing is non-negative.
All right. So how can we do that? Well, here's the main fact which is cool and is the -- it will prove the fact.
It's if we look at the expected value of X -- basically you want to look at how does the derivative in the ith
direction correlate with what's going on in the Jth direction. So we'll look at this thing.
Okay. So this is super -- now we have to go back. So what is it? We get -- now I'm take the derivative of
XI times T. This is a simple expression. So we're going to get minus one-half one minus T to the minus
one-half XI.
Plus one-half T to the minus one-half YI. That's the derivative. Okay. And then we're going to multiply
times XJ of T which is just -- let's maybe write it here. 1 minus TXJ, plus square root T, YJ and we're taking
expectations.
So, first of all, we didn't say it explicitly, but, look, nobody would ever say this but I'm going to say it anyway.
We can assume these two processes are independent. Right? There's no -- we're never talking about
them on the same side. So we assume these two processes are independent processes.
And then the point is that the cross terms here cancel -- the XY terms go away because they all mean 0.
And we're just left with something which is great. Because this thing cancels out with this and this thing
cancels out with this. And we just get that this is half expected value YI, YJ, minus XIXJ.
Okay. Which is greater than equal to 0 by our assumption. So our assumption was made to guarantee
that we had some sort of -- so that these things were positively correlated. When we increase I, somehow
that should be positive and correlated with J. So what does it mean? Let's just write down what this
means. This means we can write XJ of T as some alpha J times XI prime of T plus Z sub J. Where Z sub J
is an independent Gaussian random variable.
The whole point is that the inner product is bigger than 0 and the alpha Js are nonnegative. So this is my
claim. If I give you -- I mean, this is where you should really think about Gaussian in terms of Euclidian
geometry again. We can write XJ as a projection on to this random variable, plus the orthogonal
component of Z sub J. This is orthogonal, which is the same thing as saying it's independent.
Everybody cool? We can write this like this? Okay. And then to finish -- let's finish over here. So now
what we're going to do -- now we need one more sort of argument. We're going to look at this quantity and
sort of -- how does this quantity depend on the alpha sub Js?
>>: [inaudible] function about --
>> James Lee: I'll add another assumption about F here. I just want to see what it's going to be before -yeah. Right. So we don't know anything about F. F could be arbitrary. So we haven't said anything. So I
said try.
Okay. We're trying right now. So we write this like this. And now -- okay. So let's see what happens.
Basically I want to think about doing the following: We're thinking about this quantity, as the alpha Js
increase. So when the alpha Js are all equal to 0, then maybe we'll just do the 0 case right here beside it.
Then what we get is expected value DIF sort of sum Z of T. Z of T is a vector of ZJ when the alpha Js are
0, times XI prime of T.
But now these Zs are independent of these things. So the expectation -- it's the expectation is that the
product of the expectations and the expectation of all these are 0. Because what are these? These are
just -- we wrote it down. These derivatives look like this and these are both expectation 0. So the
expectation of the derivative is 0. The point is this is 0. Again if all alpha Js are 0 then this thing is equal to
0. What I want to happen is as I raise the alpha Js I want this quantity to increase, never get smaller. If we
can prove that, then we can prove that this is bigger than or equal to 0. Which proves this is bigger than or
equal to 0. So now you just say what condition on F do I need so when the alpha Js increase this thing
increases, and the condition on F, you can check, is just that all the mixed second derivatives DIGJ are
greater than or equal to 0. So what does it mean? Let's see what happens. So alpha J we know is
positive.
So if this is -- for instance, if this is negative, this is negative. Then if this is negative and this is negative.
So increasing alpha J makes X sub J goes down, which by this fact makes this go down.
So if this is negative, this goes down, which is -- does that make sense? So all right like increasing -- okay.
I hope it makes sense. So we assume that DIJF is big bigger than or equal. If this is negative, as alpha J
goes up, then this goes down. This is negative. This is going down, which means that -- which means that
since the DIJ is greater than or equal this is going down. This is negative, this is going down so this is
going up. If this is positive -- should point to the positive case first.
This is positive, as the alpha J increases now this increases now DIJ is equal or bigger. So as this
increases, this increases. So this goes up.
When the alpha Js are equal to 0, this thing is 0. When they increase, this quantity increases. So we know
that this quantity is greater than or equal to 0. And this thing is just a sum of these quantities. So we know
this is greater than or equal to 0. So that implies maybe in sort of nonlinear board usage, that implies that
CT is greater than or equal to O for all times T, which implies that F of 0 is at most F of 1.
So we've proved that under all of these assumptions there is no try, there's only do. What is it? I don't
know what it is. Anybody what yoda says. There's only do or do not, there's no try.
All right. So we have this conclusion. So let's just use -- okay. So now let's finish the proof of Slepian's
lemma. By the way, this is useful in a lot of things, for instance, you can prove hypercontractive inquality.
Nice thing to interpolate between the two processes and then take derivatives. It's a beautiful kind of thing.
So now to finish, we just need to choose our F. So what is F going to be?
>>: Sub T, could you have chosen any other interpolation?
>> James Lee: The nice thing here is that the two norm of the coefficients is 1. So could I have chosen?
The answer could I have chosen a different interpolation? Basically the answer is no. What did we use for
the interpolation. We used this great fact of derivatives cancelled out and didn't depend on T. Basically we
chose the speed of the curve so that our derivative term wouldn't have any dependence on T. Not a
derivative term.
But a term that looked like this, XI prime T, XJT, sort of -- I mean, the -- we didn't have to do anything
interesting, because it had no dependence on T. It was exactly expressible in terms of it. So somehow this
is -- you could try to parameterize it by any speed you wanted. I think you could come up with this speed
when you tried to work out the details.
I mean, yeah, also -- I get the sense this is also somehow -- right. So somehow, I mean, you can think of
more complicated settings where you can choose a path between these two random variables. And
everything that Talagrand does in physics is choosing these path is carefully. Anyway it's like a simple
case. A very cool phenomenon you interpolate from one Gaussian to another and see what happens and
you try to get some monotonicity along the way.
So we need to finish. How do we finish? Just take F to be the indicator of a subset H here. A is a subset
of RN and just take A to be the product of a bunch of half lines. So I guess we should say capital N. So
the half lines look like -- maybe I should -- like this, for any lambda IZ you want like this, and then okay so,
first of all, you have to check that this condition holds.
It holds trivially as you move to the right you tend to be, what happens is you move to the right. If you ->>: [inaudible]. You can have that. Is there a point in the inequality?
>> James Lee: This is point-wise. But, of course, it doesn't matter what happens at the boundary. So in
any case, if we apply it like this, then what we get is -- so this is about -- so if we apply it to this F, we get
something about -- we get that the Y is being in this intersection have a greater probability than X is being
in this intersection. That's what we apply when we get this theorem. If we restate it. It means the
probability, if we take the union of all the YIs being bigger than lambda I. So the question is some YI bigger
than lambda I. They have a higher probability of falling in the product. So this is -- so falling outside the
product, which is what this union is. Should have a lower probability for the XIs.
And this immediately implies that expected value of the soup of the XIs is bigger than the expected value of
the soup of the YIs. Y equals 1. You agree?
>>: You can still explain this derivative.
>>: How do you take the derivative of the [inaudible].
>>: Smooth or ->> James Lee: No, I don't want it smooth. This is really like -- okay, I didn't even think about it because it's
such -- you don't need to -- you can take the derivative. The derivative is -- I mean, you can smooth if you
like and take a limit of the smoothings that converge to these sets. I'll leave it because I don't know how -so this is a standard -- sorry. So I mean ->>: Still explain intuitively in terms of write derivatives, why you have the right sign.
>> James Lee: Y is the sign, correct. That's a good question.
>>: So you have 0s.
>> James Lee: I understand. Except the one time it's interesting. So, crap. So somebody tells me they
see it before I say it. Because I had to do it earlier today now I'm going to have to like -- so the function
gets larger as we go this way. That's the right direction, right? You agree it's the right direction? So what's
the reason that it holds?
>>: Write formal derivatives. Minus infinity. Just write the first derivative and then take ->> James Lee: Want to put derivative minus on infinity.
>>: Then it starts being minus, then it starts being 0 again. 0 above. So you need to take. You
differentiate horizontally now?
>> James Lee: This was horizontal. Now we should.
>>: Now differentiate vertically then you see the thing goes minus infinity to 0 on that later.
>> James Lee: Yeah. So it's either -- you're either 0 all the time or if you sit on this line, then eventually
when you cross the other side, as you go -- so it's really ->>: The line goes down all the way. Minus infinity on that the whole time.
>> James Lee: Now we have the picture correct. So the function is 0 -- the derivative here is 0 minus
infinity 0. Here it's 0 minus infinity 0. And so if you're here and you're still ->>: Look at the derivatives.
>> James Lee: No, this is the horizontal. So if you're here my claim.
>>: It goes from -- right. So 0 minus infinity 0 and then above that it's only 0.
>> James Lee: I see. Sorry. It's 0 all the time out here.
>>: Yeah.
>> James Lee: 0 all the time out here. The only time it's minus infinity. Eventually minus infinity you can
only move outside the set if you're 0. I just drew the set in a horrible way. Does it make sense, Claire? So
sorry. I messed it up when I was preparing then I messed it up backwards so the derivative is minus infinity
here. If you're not on the boundary it's 0 all the time as you move infinitesimally. If you're on the boundary
it can only go up minus infinity.
>>: [indiscernible].
>> James Lee: They're nonpositive. Because that's why we chose them half spaces to the left. I mean,
you can only be -- as you move in the positive direction, you can only move out of the set. If we had
chosen them the other way, which is what I did, now I realize initially, then everything switches.
But, okay. So you do this. Then you switch again, to the complement you get the probability that this thing
is big is less than the probability that this thing is big. Then this follows -- everybody is there. This line
goes to this line. Can use like -- I mean, you can -- if you want you can write like this. Write expectations
like this. See that all these things give you that point-wise these for the Xs are bigger than for the Yes.
These for the Xs are smaller than for the Yes.
So we get this sort of -- we get this I don't want to call it majorrization, we get to the fact that the expected
soups of the X are bigger than the expected soup of the Yes. We've proven that sort of as the distances
increase the expected supremum process goes up. Then therefore we can always reduce to the
independent case.
So I have five minutes. I want to at least finish the stuff about Gaussian processes, so next time we can
actually finish the proof of the majorizing measure theorem. So we need something straight -- it's okay that
I erased Sudakov's inequality. We need something stronger than Sudakov's inequality, which is right here,
the following idea.
So again we have a Gaussian process. And then hopefully we'll be able to see where the final lecture is
going at the end. Again, we have a Gaussian process. And now suppose we take a bunch of points which
are all NT and TI, TJ is bigger than or equal to alpha. So so far -- so so far it's in the same setup as before.
But now I want to get a slightly stronger inequality, which is the following. I want to say that the expected
supremum of this -- so basically what we did was Sudakov's inequality picked up one scale. If we know
these points are uniformly separated we get lower bound if we build this tree we need to pick up this scale
plus the rest of the scales that are coming from smaller distances.
So here's the inequality that we're going to want to use. We get alpha square root log N like before. But
we're also going to get the following. The minimum of the expected soup. But now so let me write it and
then we'll approve it. Then I'll explain what's going on, at least.
Okay. So here's what's going on. And it's best to draw a picture. So we have T, and we have all these
points T1, T2, T3. T4 and so on, that are separated. We know that since there are so many of them, and
they're at least alpha apart, the expected supremum would be this much. We also would like to get, we put
these small balls, these balls of radius alpha over C, C some constant I haven't specified yet.
Okay. So these balls are radius alpha over C. So what we're saying is we would like to get the contribution
from these centers TI and then also this minimum is over all these balls.
And then also we would at least like to get the contribution of the expected supremum from the smallest
ball here. Okay. So we're saying, look, we know we're going to get -- like let's fix some center T0. We
know we're going to get some large -- like we have all these experiments from the Sudakov's inequality we
know one of these is going to be large. I don't know which one it will be. I can't control that. One of them
will be large. I should say if one of these is large, then I should at least be able to get at little bit more or
possibility a lot more from whatever, from the expected supremum over this ball as well. So I want to get -I want to be a little bit greedy. I want to get this not this but this plus whatever the expected supremum of
this ball too. So some kind of inequality like this.
So then -- that's going to be the hard part. We'll do that next time which is somehow we need to choose
this tree of balls in a very clever way so we don't lose -- we can never lose anything more than a constant
factor in constructing these trees. We need to choose them very cleverly. Let me mention now this is proof
that we have this inequality for next time.
Let's prove this, and then the next time we'll discuss actually how to choose these to construct these trees.
This indicates we're going to prove a lower bound by some kind of iterative processes. We find these balls
that are well separated. We get this contribution now. Then we get a contribution, we get to go inside one
of the balls and get the contribution there. And of course inside the next ball we'll do something similar and
keep recursing.
The key, the reason it's hard is because we only get the minimum. We don't get to choose which one of
these values is going to be large. We only get the minimum one. When we choose the balls we have to
choose it so they're as balanced as possible. Does us no good to have bunch of supremum in this ball and
tiny bit in this ball. Because then the process could be, could say, okay, this was large and now I don't get
any credit for that.
So there's got to be some kind of balancing that goes on when you choose these balls, but let's just look at
how you prove this. Okay. And the basic idea is as follows: So I mean we want to take a union bound.
We like to say, okay, how could you prove this. One way is to say, look, the probability that any of these
balls is much smaller than its expectation, like take a union bound over all these balls, say all these balls
are close to their expectation, and then we'd be in good shape. Sort of like if they're all simultaneously
close to their expectation, then we don't even have to condition. We sort of like -- we get this for the -- this
will come for free.
So the whole goal is to sort of show that the supremum in each of these balls we can take them all
simultaneously to be allows to their expectation. So we're just going to take -- we're going to say, first of all,
one of these balls is close to expectation. And then all simultaneously. And the problem with doing that is
you can just look at one ball.
If we look at the process restricted to this ball look at this value, why should this value be concentrated?
We have no idea when the configuration of the process is inside here. In particular, there could be -- we
don't know how many points are inside here. We can't take a union bound over all points in this ball. We
use the following fact. This fact, by the way, is probably the thing that makes majorizing measures the
most specialized to Gaussian. So it's the following concentration effect.
So it's a classical theorem. But here's the fact. The probability -- so let's just take any -- let's again take
any process. Then the probability that the soup of this process minus the expectation of this process, of
this soup, the probability that this is bigger, bigger than some lambda is at most -- here's the factors very -E to the minus lambda square root over something.
This is something which is nice. Here we can actually put the maximum variance. So the best -- so we can
put the soup of XT squared. Okay. So the soup is concentrated around this expectation in a way that
doesn't depend on the size of the set capital T at all. It only depends on the maximum variance there.
In the language over here, in the language over here, the maximum variance in this ball is basically -- it's
proportions to the diameter. So the maximum variance here is like -- I mean the diameter is twice alpha
over C. So what it means is that since these balls are small in the metric space, their values are going to
be very concentrated. And in fact are only going to differ from -- they're only going to differ from their
expectation by this much, say, but we're adding -- we're adding on a term here that's like really huge
compared to this.
So somehow the fluctuation here is not too bad. Even if we take a union bound. Now we have M balls.
Suppose we use this and take a union bound. Then the fluctuation of the worst one could be like this.
But if we choose C to be large enough, this is still tiny compared to this. So basically we can use this
inequality, take a union bound over all the balls and get that the value in each one of those only fluctuates
by this much.
But we're getting this much from this level. So if we choose C to be large enough, choose C to be 100,
they say negligible compared to this. And the fluctuations -- the fluctuations don't mess up this term.
So what we get, we get their expectation. Possibly minus this fluctuation, but this fluctuation can be
swallowed up in the main term. So let me just -- okay. So let's do it in one minute how do you prove this
theorem. You prove this theorem from this classical fact that if we take any function from RN to R, okay.
So and Equip, this is Equip. Let's Equip this with the Gaussian measure.
I'm just writing down the Gaussian for metric inequality. Then the probability that F of X1, XN minus the
expected value of F X1 of XN is bigger than lambda is at most E to the minus lambda squared over
something that depends on the Lipschitz constant of F. This is a classical fact.
So if you want to prove it without the optimal constant, you can do it in a page or two. But this is a classical
isometric inequality in Gaussian space, that if we take any function on our N Equip Gaussian measure,
Lipschitz, the function has to be tightly concentrated around expectations.
>>: [inaudible] anything for particular Gaussian.
>> James Lee: You're right. This is not particular for Gaussian. So how do we prove this? That's the last
step. What F do we use it for? This is the fact that it's going to be -- you're right.
>>: You're not taking F for this?
>> James Lee: No. So I mean the supremum is Lipschitz, but we need to -- what's the issue? Why is
there something confusing here? I agree the supremum is Lipschitz. The supreme is in fact one Lipschitz.
So there's something tricky here.
What's the difficulty?
>>: You're missing an equality estimate on the expected moment of that.
>> James Lee: No. No. This is.
>>: This is the proof.
>>: Why would you need anything like he'd just take after the supreme ->> James Lee: Yeah. First of all, I mean, this -- I mean, I can give you a trivial reason like this gets better
if you take the variances going to 0 and this wouldn't. That's not very convincing. The question why can't
you just take F.
>>: That's because you're using the two norm here and the supreme is the three norm.
>> James Lee: I agree.
>>: It gives you that result.
>> James Lee: No, you want to claim that it gives you this result. But I disagree. You need to see why I
disagree with you.
>>: Because -- how about the capital ->> James Lee: So anyway, from this inequality we saw that you can sort of like -- you can prove this and
basically what we're going to do next time is somehow choose this tree very carefully so that applying this
inequality at every level gives us something that meets the expected maximum to a constant factor. So
we'll stop now and if anybody -- this will be resolved in a second if anybody is truly interested. So what's
the problem here?
>>: [inaudible].
[applause]
Download