Document 17868095

advertisement
>> Yuval Peres: So welcome to the third in this series, so all the little guns you saw in the first
two lectures are all going to be firing today. Please, James.
>> James Lee: Thanks. Okay. So right. So our goal today is actually to prove the main
theorem, and I'll remind you what it is. And of all the -- well, I think this will be pretty
interesting. It's a cute proof and finally I figured out a way how to explain it in a talk.
So first let me remind you -- okay, just the object we're working with again. And I'll try to
write -- tell me if I'm not writing big enough. I see that many people are sitting quite a distance
away. All right.
The objects we're talking about again, a collection of joined and Gaussian random variables, the
Gaussian process. We equip this with a canonical metric, which is the L2 metric here. And our
goal was to understand the quantity, which is the expected supreme of this process.
And remember the philosophy is to understand this quantity in terms of the geometry. Okay. So
the index here is capital D. The philosophy is to understand this in terms of the geometry of this
sort of -- of this metric space T. Okay.
So now let me remind you very briefly the -- what the upper bound was, because we're going to
prove a matching lower bound today. So we'll take a sequence of -- sequence of partitions of T,
and we'll call this sequence -- so this is the sequence of partitions. It's a sequence of increasing
partitions.
So here AN plus 1 is the a refinement of AN. And we'll call this sequence, so this sequence
partitions is -- let's call it admissible. If it satisfies two properties, the first property is just that -well, we start with the whole set. So we start with a trivial partition into one piece. And second
property -- we saw this last time -- is that we have some upper bound on the sizes of these
partitions. Okay.
So the first partition is into one piece, and the Nth partition [inaudible] 2 to the 2 of the N pieces.
And we saw -- and we saw before, I mean, in the first talk why this number comes up naturally.
Basically -- well, the number of -- eventually we're going to be considering sort of the log of the
number of points of the important thing, and if you want the log to double, then you should be
squaring the number of points, which is why I have a growth pattern like this.
And the important upper bound we prove was due to Fernique, and sort of in the earlier version
the N tree bound due to Dudley, which says that we can bound -- given any such admissible
sequence of partitions, we can bound the expected [inaudible] of the process in this way.
So I'll remind you what this notation means in a second. But given any admissible -- okay. So
this holds for every admissible sequence.
And just to remind you of this notation, let me write it here in red. For one of these partitions -so if T -- if little T is a point of big T, then for some partition AN, AN of T is just the set of AN
containing little T.
>> [inaudible]
>> James Lee: All right. Fine. They do this in like -- all right. Okay. Okay. So the -- this -that's -- so I hope this is -- so this is just diameter of the set in this partition containing T.
And we prove this is the chaining upper bound. Okay. And so what you can do is is you can
define -- let's define a functional, which is -- which is called gamma 2 of this metric space, which
is just a best possible upper bound that this Fernique chaining argument proves. So the
functional is just take the infimum over all admissible sequences of the upper bound you get. I'm
just writing the same thing over again.
So the Fernique's bound exactly says that the expected supremum is upper bounded by the
gamma 2 functional. The gamma 2 functional gives you the best [inaudible]. And now we can
state what's called the majorizing measures theorem, which is what we're going to prove today,
that in fact such a sequence of partitions is the only way to upper bound the expected supremum.
So in fact the expected supremum of any Gaussian process is proportional to this gamma 2
functional.
Okay. So proportional just means up to an absolute constant. It's at most C times gamma 2, and
it's at least gamma 2 over C for some constant C. And this is due to -- this was conjectured by
Fernique and then eventually proved by Talagrand that this gamma 2 functional controls the
expected supremum. And this turns out to be a fairly powerful thing.
I claim today that this will be the only proof of the majorizing measures theorem given that was
possible to understand in a talk. That's my claim. It's a bold claim, but...
But the presentation of the proof of this theorem is always kind of disgusting. And okay. So I
think there's a nice way to do it. But this is our goal today. We've already proved the upper
bound. The goal is to prove the lower bound.
And now I just want to remind you what we were talking about last time. We introduced some
tools to prove the lower bound. So the first tool was the Sudakov inequality, which said the
following. Okay. Again there's this -- I'm not going to keep writing down this process.
We have this Gaussian process sitting in the background all the time. This Gaussian process X
of T says that if we take a bunch of points, T1, T2, up to TM, such that the pairwise distances in
our metric are all -- are large, so they're all at least alpha for I not equal to J, then -- and we
proved this last time using Slepian's lemma, this comparison equality, then if we look at the
expected supremum of say XT1 up to XTM, we said that this is at least -- grows like alpha times
the square root log the number of points here.
This is what we proved last time. Somehow it was a lower bound that matches what the union
bound gives. If we knew all the points were distance alpha, the -- sort of the Gaussian tail
inequality, which is the expected theorem, is that most alphas grow at log N, so if we know
they're at least alpha, we got some kind of matching lower bound.
And from this we actually want to get a slightly stronger corollary. So let me make -- let me
make it down here. Make one definition that's going to be useful. Definition. For some subset
A of our process, let's define G of A to be the -- just the expected supremum of the subprocess
when we look at the variables in A.
Okay. So our expected supremum is just G of T. But in general for some -- sort of some subset
A, we can consider G of A. Okay.
Now here's a corollary. I mean, it's a corollary that I wrote, require a proof of the Sudakov
inequality. So, again, let's say -- okay. So now we have -- let's start it this way. So under the
same assumptions. So, again, we have -- think about M points. All pairwise [inaudible] large. I
want to get a slightly better lower bound than this, which is of the following form.
I claim that the expected supremum, if we consider -- okay. So this is for some R. R is going to
be a fixed constant. So you can think about if you want R equals 20 definitely works for
everything in the talk. Little R is always going to be some fixed constant.
So the claim is that if we look at the expected supremum of not just looking at the points T1 of
the TN, but let's look at small balls around the points -- I'll draw a picture in a second -- then we
can actually get something slightly better.
So what we can get is -- we get this alpha over square log M. C is some universal constant.
Whenever I write C, it's a universal constant. So this hid some universal constant, this C is
universal constant. I claim we can get this plus something a little bit more, so we get a
contribution coming from the centers of these balls. I claim that we can also get the minimum
contribution coming from one of the balls.
Okay. So the picture is that we've got our whole space T, you give me these separated points T1,
T2, T3, T4 like this, and now they're separated by alpha. But now I come and I look at even a
much smaller ball around each point, an alpha over R ball around each point.
And the claim that we're making here is that not only do we get this large contribution coming
from one of the variables, you know, X of T1, we can also get some contribution from what's
going on inside these balls.
And let me just sort of sketch the proof of that. So what we saw before is that -- I mean, we can
always make a move like this. So let's just fix some point T not. I don't care what T not is. We
can always make a move like this where we say the expected supremum, just because all these
variables are centered -- okay. I mean, this is -- the expected value of this is 0, so we can do this
just fine.
And so the idea is that what this Sudakov inequality says is that sort of -- well, we know that -- is
that one of these -- I'm looking at these sort of -- these values here are like XT1 minus XT0, we
know that one of these should be large. The expected supremum of one of these things will be
large. Okay.
And what we'd like to do is say, well, also I should be able to get some credit for, you know, the
supremum. Now think about the -- think about each T1 as being the center of its ball. And then
I should be able to get some credit for these little black arrows as well. So I should not just get
one of these, I should get one of these plus one of these. Okay.
And the reason we take the minimum here is because we don't know which one of these we're
going to get, right? I don't know a priori which one of these variables is going to be big. I just
know that one of them should be big. So I get one of them to be big, and then I want to get sort
of this associated black arrow as well.
Now, of course the problem is that if I condition, for instance, on this one being big, I could
screw up the expectation of this ball.
So first of all let me just assume -- let me assume for the moment that essentially all the time all
of these balls have the fact that they're -- that the value here, so the supremum here -- okay,
again, what is the supremum here? It's the -- we look at XT minus XT4 over all Ts, all Ts in this
ball. I'll write as alpha R. Look at all these things.
I claim that we can basically [inaudible] the supremum here is always at least the expectation, so
the expectation is BTI alpha over R minus, okay, something that looks like some constant C
times alpha over R times square root log of M.
So let's assume I can -- I said all the time, no matter what happens, that all these balls achieve at
least [inaudible] which is the expectation minus this. Then we're done in the following way.
Because what we'll get here is instead of this inequality we'll get this minus C alpha over R
square root log M. But then by choosing R to be a large enough constant, this can be absorbed
into here.
In fact, by choosing here R equals, what, twice C squared, we'll, you know -- by choosing R to
be this, which is just some constant, this gets absorbed into here and you would get a 2 here.
Okay.
So if we could guarantee that all these ball -- these small balls always achieves at least the
expectation minus a little bit of loss, we get this inequality. And this is ->> [inaudible] alpha times square log M?
>> James Lee: Yeah, it could be much larger. Because it could be many, many more points in
there, right? So the diameter went down, but if the number of points went up by a huge amount,
then it could be larger. And, okay, so if we knew this -- now I claim that this is essentially true.
And the reason this is essentially true, we wrote down last time, is because of the following
concentration inequality, which I want to focus on the proof of the main theorem. So I won't
prove the concentration inequality now, but if someone wants to see it at the end, the proof is not
too difficult just from the classical concentration inequality for -- on the Gaussian measure in
RN.
So here's the concentration inequality, though. If we have a Gaussian process XT, then the claim
is that the probability that the supremum of XT differs from its expectation by more than lambda.
All right. It grows like this. It's exponential [inaudible] factor 2 here minus lambda squared over
something. And here's the important point. This something just depends on the maximum
variance. So the maximum XT squared value.
So this is a classical concentration inequality, which is somewhat surprising because it doesn't
depend on the number of points in this process. In fact, this T could be an infinite set. Could be
some kind of continuous set. And still the only thing that matters is the maximum variance.
Okay.
Okay. So it's -- well, we can get to the proof later. But why does this finish it? Because what's
the -- if we look at any variable of the form XT here in this ball, any variable of this form, well,
we know the ball has radius alpha over R in this metric, which means that the variance of this -the variance of all of these random variables is at most the variance of this thing is at most alpha
over R squared.
I mean, the Euclidean distance is at most alpha over R, which means that the variance at most
alpha over R squared. So now, I mean, you plug alpha over R squared in here. If we want sort
of -- we want to take a union bound, we have M events, these M balls, we want to take a union
bound, we should try to get in [inaudible] to be about 1 over M. Right. Which means we should
take lambda to be about, what, C times square root log M times the maximum variance. But the
maximum variance is alpha over R -- sorry. Not the variance, because we're squaring that, the
time is the maximum distance, which is alpha over R.
So the point is that if we take lambda to be this, then we can basically be assured that none of the
balls will deviate by more than this, and that's exactly what we said we were getting in the first
place. So if you just make that slightly more rigorous, then you get the actual statement of the
lemma.
But basically these balls -- the fluctuation of these balls can be absorbed into the main -- into this
term. All right. So that proves our concentration inequality. I mean, that proves our corollary.
And now actually you can forget everything about Gaussian processes because now that we
have -- this is the only thing -- this is what we're going to use about Gaussian processes.
So in fact let's -- now let me state this theorem. Let me restate it slightly differently, and then
we'll just be able to focus on the proof of this theorem. Okay. So let F be any functional, so just
a real valid function, on subsets of T. Okay. Think about F as measuring the size of this subset.
The F we're going to use is actually just the expected supremum. But somehow we don't want
to -- we can now divorce ourselves from this -- from thinking about random variables because
we're just going to use this fact. Okay. Such that two properties hold one property is that this is
a measure of size, so if A is a subset of B, then F of A should be at most F of B. Okay. This is
certainly satisfied for our -- for the expected supremum.
And the second property is just that this holds. But let me just restate it here, and then I won't
erase this for the rest of the talk. The second property is that if we have T1, T2, up to TM in our
set, and the distances between these things are pairwise at least some alpha for all II equal to J,
then this holds.
So then the functional applied to the union of the balls is at least some constant times alpha
square root log M plus the minimum over the balls. Sorry. Of the functional applied to the balls.
Okay. So we're going to take any functional that satisfies these properties. Certainly this little G
function, which is just the expected supremum, satisfies these properties. And then, okay -- so
then I claim there exists an admissible sequence of partitions, A sub-N, such that we get exactly
1. If we apply the functional to T, then up to some constant factor, this is at least this. Okay.
So the point is for any functional on subsets satisfying this kind of growth inequality, I claim that
there exists an admissible sequence such that this lower bounds F of T -- yeah.
>> Why does G satisfy the property number 1? So let's say I take -- like let's say A is part of the
space where things are getting really crazy, and then B is that -- is A plus like basically like a flat
area? Won't the expected supremum be the smaller?
>> James Lee: The expected supremum of a subset is always less than the expected supremum
of the whole set. I mean ->> What's a subset?
>> James Lee: Even more stronger than the supremum of a subset is always less than supremum
of the whole set for any. Yeah. This holds trivially for the expected supremum. But now does
everybody see that this finishes the proof? Because now if we just instantiate F with the
expected supremum, then we show there exists admissible sequence, and of course now by
definition this is -- I mean [inaudible] this is at least gamma 2. So that finishes the proof.
So now our entire goal is to show that if you give me F, I can give you this sequence of
partitions, such that F of T is at least this. And a lot of the power from this framework comes in
the fact that sort of this is a very general kind of thing that applies to lots of different kinds of
processes or modifications of this condition apply to lots of different kinds of processes.
All right. So this is our goal. And in fact -- so just remember an admissible sequence as the Nth
thing has size at most 2 to the N. That's all you need to remember from any of this. And now
let's erase this and just concentrate on -- so I hope everybody understands everything we're doing
now has nothing do with probability anymore. It's just about something about metric spaces.
So specifying this partition is actually also going to be not very difficult at all. I'm going to
use -- okay. Let's see what's going to happen. I will specify the partition here and then I'll do the
analysis here. Okay.
>> [inaudible]
>> James Lee: What's that?
>> That was already the [inaudible].
>> James Lee: Not quite that simple. At most 2 to the 2 pieces. All right. So let's start the
partition. Awesome. All right. That's a good first step. There's one thing that's going to happen
is that every set in the partition is also going to have a value, which is going to upper bound the
diameter of the set. So the value for this piece will just be -- and the value will be implicit,
because it's not -- I don't need some notation for it. But the value of this piece is just the
diameter of the set.
Okay. So now -- well, let's suppose that you've come -- you've given me A sub-N, looks some
way, you partition the space in A sub-N. Let's just choose -- I'm going to get the next partition
by taking every PC in your A sub-N. Okay. Let's blow up C just for the sake of -- okay. Here's
C.
Now I'm going to partition C into 2 to the 2 to the N pieces. So I'll take each of these pieces and
partition them further into 2 to the N pieces. And of course if I do that, then the size of AN plus
1 is at most 2 to the 2 to the N times 2 to the 2 to the N, which is 2 to the 2 the N plus 1. So this
will give an admissible sequence.
So how am I going to partition it. This is the whole -- all right. Let me tell you how to choose
the first piece, and then I'll tell you how to choose all the pieces.
So we choose -- okay. And this set C has some valued delta. Remember, this value delta is just
an upper bound of the diameter of C. Actually, it's an upper bound on the radius of C, not the
diameter. In other words, C is contained in some ball of radius delta. But this is not a
[inaudible].
Okay. So now choose T1 in C such that the following quantity is maximized. Pick your
functional. Look at the ball around T1 of radius delta over R squared. Okay. R is some
constant, which is bigger than 20, and such that this is satisfied. Just think about R as a constant.
It is a constant. Okay. Such that this intersected with C is maximum.
Okay. In other words, cut out the biggest piece you can where big is defined by looking at this
small ball around the thing. Okay. Okay. Expect -- well, I said choose T1 such this happens.
Here's the whole trick of the proof. And set C1, which is the first piece of our partition, to be -and this is the -- this is where all the magic happens. I mean, you won't see the magic now. But
see it soon. And set C1 to be the delta over R [inaudible] T1.
So what we do is the following. We first choose some point T1 which maximizes this amount.
Okay. So what am I looking at to maximize [inaudible]. I'm looking at this delta over R squared
ball around T1. But then once I've chosen T1, I actually cut out the delta over R ball. So I
actually cut out this bigger ball [inaudible] delta over R.
Okay. That's how I choose T1. So this is a delta over R squared ball, this is cutting out the delta
over R ball. All right. And the value of this set will be delta over R, which is of course an upper
bound in its radius because it was cut out. All right.
Okay. So now we just keep going. So in general let's let D sub-L be the amount of space that's
remaining after we've gone L steps, so it will be C minus everything we've cut out so far. Okay.
And we'll choose T sub-L, the next point in D sub-L, to again maximize the same sort of
quantity. It maximizes the delta over R squared ball intersected with what's left.
And finally you put C sub-L equals -- again, we maximize according to delta of R squared ball
but we cut out the delta over R ball.
>> CL plus 1?
>> James Lee: Yeah. Okay. Good call. Yes. CL plus 1. And probably TL plus 1, if we -- this
is 1, this is 1. All right. Okay. Okay.
So let the [inaudible] okay, now we go to -- we select the next point T2 such that this ball is
large, cut this out, maybe the next point T3 looks like this, but this is -- happens to be pretty
large. We cut this out. And we keep going. But now we want -- we only want to cut out 2 to the
2 to the N pieces. And we might get screwed up and we might -- I mean, we might not exhaust
the space before we get 2 to the 2 to the N pieces.
So except -- so let's -- okay. So I should specify here. Let's let -- let's let M here -- I just want
to -- I don't want to write 2 to the 2 to the N over and over again. So let's let M be 2 to the 2 to
the N. So this M is number of pieces. So we keep going except that -- dot, dot, dot, dot, dot,
except that C sub-M -- okay. I'm not going to write down here. Dot, dot, dot, dot, dot. Except
that C sub-M is actually just going to be D sub-M.
So, in other words, when you get to the end, you've got nothing left to do, so we're cutting, we're
cutting, we're cutting, we went, we finally got to the Mth point, again, it was chosen to maximize
this ball. But now what can we do. We just cut out the whole set. So this is T sub-M. I mean,
this is T sub-M. This is the last set. Okay.
>> [inaudible]
>> James Lee: Good.
>> [inaudible]
>> James Lee: I thought I got it all plus 1 though. Okay. Good. Okay. All right. That's the
whole -- okay. That's it. We're done. Except I have to tell you -- okay. Obviously we can't
reduce the -- the value here is now delta. We didn't reduce the diameter, so the value [inaudible]
delta. All these pieces now have value delta of R, and this piece has value delta.
And that specifies the entire partitioning, because I told you how to break up one piece, now you
could just keep going on and on and on. Okay.
The claim is that this partition satisfies this lower bound. All right. So now we need to get to
the -- all right. That's the whole partitioning. It's really quite simple.
It's not clear right now, but it was actually chosen -- the partitioning is chosen so as to make this
tree as balanced as possible. Okay. You might not think it's balanced because you say, well,
why would you -- if you look -- you're cutting out the biggest pieces, so you might leave behind
very little. But -- well, you'll see what comes up. Actually this is going to tend to be the biggest
piece. Okay. That's not even true, but let's see what happens. Okay.
So now we're going to go on to the analysis of this partition. So Talagrand's analysis involves
defining five quantities -- I'm not going to do this; I'm just telling you -- it involves finding five
quantities that satisfy seven equations. And then verifying that with every possible choice all
these things are -- are remained satisfied and then summing something at the end, which is -- it's
hard to understand. Okay. This proof is going to be understandable hopefully. Okay.
So let's -- here's the -- here's the idea. We can think about of course this whole partition as a tree.
So looks like it's a tree. And when I draw this tree, I just want to use -- I'm going to use one
convention that the leaves of the tree -- I mean the children go from left to right.
So if we indeed cut out M pieces at this level, the last piece, this giant -- it's a giant sucker over
here -- is the -- is going to be the rightmost piece. Okay.
So okay. So that gives us a tree. And on the nodes of the tree we have -- I mean, we have like
values, so, you know, we have values like delta, delta over R, and so on. Okay. Corresponding
to what's going on. All right.
And what the -- the final thing we can do is there's a natural value to associate to every edge in
the tree. The value of this edge is delta times 2 to the N over 2. Okay. So if this is level N,
which means we're using 2 to the 2 to the N points, the value of this edge is going to be delta
times 2 to the N over 2.
If we do that, okay, then here's what I claim. Then I claim that this quantity we care about,
diameter A sub-N of T, all right, I claim that this is at most -- okay. I know this is a factor of 2,
but it doesn't matter. And I hope nobody gets really upset about this.
So first of all it should be -- I didn't say it, but let's assume that here -- again, it doesn't really
matter, but just for simplicity, let's assume that T is finite. The main thing that we're trying to
prove follows from the finite case just by an easy -- well, at least for separable processes, but -okay. But just assume that T is finite. So eventually the leaves of this tree are just singletons.
We eventually just get singletons at the end, and we stop.
So the claim is that -- okay. I hope this is clear what it means. We've given every edge in this
tree a value. So we can look at a root leaf path in this tree, and it has some value, which is the
sum of the edge length along the path.
The sum of the edge length along the path is essentially this value. Essentially 2 to the N over 2
times the diameter, except for the fact that we said this is not actually diameter, this is just an
upper bound. So it's an -- so this upper bound is this value. Okay. So now here's the whole
game. [inaudible] told me not to use red, although the red got better.
>> Explain the [inaudible].
>> James Lee: Well, exists 2 to the N over 2, but 2 to the N over 2 is square root log the number
of points. And square root log, as you see, is an important thing for us, right? So that 2 to the N
over 2 is square root log. Okay.
I'm just really reformulating this bound in terms of this tree. If we think about this tree and we
give the edges this length, then the supremum root leaf path is bigger, is at least this.
So our goal now is to show that F of T is at least the value of any root leaf path. In other words,
my goal is this. You give me root leaf path in the tree, I show -- I prove to you that F of T is at
least that value. That will prove that it's at least a sup, which proves it's at least this. Okay. This
is the whole gain.
So now we need two properties. I'll stop using red. We need to make two observations. Let's
keep the -- oh, no, we have the -- we don't need the corollary anymore. We just need to make
two observations about this tree. And then -- let's see. Oh, yeah, we're in good shape. Okay.
Observation number one is that only right turns in this tree matter. So if we want to compute the
value, we only have to look at right turns. Okay. So why is that. Let's see. Okay. So right -and when I say a right turn, I mean a turn like this which corresponds to having chosen this -having chosen this -- just everything remaining and keeping the parameter value at delta.
Okay. So why is it -- well, look what happens anytime you make a turn that's not a right turn.
Look what happens here. At this level you've got value 2 to the N over 2. I mean, this is times
delta. Right? What's the value of this. Well, since it's not a right turn, we know that -- okay,
so -- so since -- let's say since this was not a right turn. By -- when I say right turn, I'm referring
to -- look at the board -- the rightmost child.
Since this is not a right turn, if this was delta, this goes down to delta over R, which means the
value I get here is only 2 to the N over 2 times -- sorry. 2 to the N plus 1 over 2 times delta over
R.
Okay. Now, suppose I again don't make a right turn. So let's suppose this wasn't a right turn, it
was another thing here. Well, then the value here is delta over R squared which means that at the
next level the value is going to be 2 to the N plus 2 over 2 times delta over R squared.
Now, R is a number that's bigger than 20. So taking these non-right turns is a geometrically
decreasing sequence as we go. So actually basically if we take a right turn like this and then a
sequence of non-right turns, the value you get along here is just comparable to the value you got
here. So we only need to count right turns.
>> [inaudible] look at one path [inaudible].
>> James Lee: No, no, no. Because you might -- you might venture this way in a tree because
you know that later on you're going to get to take a lot of very nice right turns.
Taking the right -- you might -- you might take a right turn and then realize you have nowhere
else to go, whereas you might want to like venture -- so you can -- optimizing all the way down
so that you take the most expensive right turns. I mean -- okay. So -- but the point is that
considering the value of a root leaf path, I claim we only need to consider the value over right
turns.
Okay. So, in other words, I'm going to just think about weight 0 being on everything except
these edges. That's the first reduction. So let's write down only right turns matter. And the
second property is that in fact if you take a sequence of right turns, only the last one matters.
Because, what, let's look what happens in a sequence of right turns. So these are all right turns,
which means that the delta parameter stays the same every time.
But now what's the value? This is delta times 2 to the N over 2. This is delta times 2 to the N
plus 1 over 2. This is delta times 2 to the N plus 2 over 2. It's a geometrically increasing
sequence so that only -- you know, okay, let's double this is not a right turn. Only the value of
the last right turn matters. So, in other words, when I compute the value of a root leaf path, since
I'm only trying to get things right up to constants, I only need to have -- I only need to add up the
values for the last right -- for every last right turn in that path.
Again, non-right turns are geometrically decreasing, and if I take a sequence of right turns, it's
dominated by the last one. So this -- okay. So -- and in fact only the last right turn in a
sequence -- in a sequence of right turns matters. All right.
With these two things set up -- all right. With these two things set up, let's -- okay. Let's just
continue here. We're ready to do the analysis. Okay. And the analysis is not going to be very
difficult, but here's -- so here it is. Okay.
And I apologize for the name of the following thing. If you know of a better name, like if you
think of something better that one could actually [inaudible] let me know. So this is the snake
poop game. You'll see why it's really the only appropriate name for this.
>> [inaudible]
>> James Lee: Okay. So again we want to prove that F of T is at least the value of any root leaf
path, and we know that to calculate this value we only need to look at the values of the last right
turns along the sequence.
Okay. So here's what we're going to do. If you give me this tree, let me define -- okay. So this
tree had values on just the edges. But I want to put values in the nodes as well, and you'll see
why this happens in a second.
So the value in a node is just -- so every node -- this is a partition tree. Every node corresponds
to a set. So the value on a node is just -- if this is set S is just the functional applied to the set.
That's the value ->> [inaudible]
>> James Lee: For the edges. Not for the nodes. We had to -- it doesn't -- I mean, it doesn't -oh, you mean these values. These values stick around. The diameter value sticks around. But
now let's call it a reward.
>> [inaudible]
>> James Lee: No, no, it's just that we have diameter values. I agree. But these are different.
So these things still have diameter values, but let's give them -- let's call them rewards. You
want to collect these things. All right. I don't know. Some kind of other value.
And the edges, instead of giving the edges value delta times 2 to the N over 2 well, look, I'm not
going to get delta, I'm only going to get -- oh, by the way, we -- I mean, you could swallow into
the R, but we should put R here just to be clear what's going on. The separation -- we didn't have
R before. Okay. So we can put C. Never mind.
So this C will in general be a small number. It could be 1 over 100. Instead of having the edges
have value 2 to the N over 2 times delta, let me put the edges to have value C times 2 to the N
over 2 times delta, where it's this C. Because I know I'm not going to be able to get this much if
I apply in inequality. I'm only going to be able to get C times this much.
So the value of an edge here is -- if this node sort of had diameter delta, then the value of an
outgoing edge is C times 2 to the N times 2 times delta.
Again, if I get a -- I mean, if I can show that F is at least the value of any root leaf path here,
again, the value just means along the edges, then I just lost a factor of C.
So, in other words, changing their edge values didn't affect much, except for the fact that it's
going to be a little bit helpful.
Okay. So now given any tree with values like this, you can, like with rewards, like think about
having a subset of the tree. So like choose some vertices and some edges. Okay. Now I can
sum up these values. So there's some reward associated with this.
My goal now is to show that F of T is at least -- so you give me some root leaf path, so your root
leaf path goes like this. I would like to show that F of T is at least -- is at least this value, which
is the value of all the right turns in the past. If we can do that we're done. What I want to be able
to do is write down inequality on trees. Like one tree with some markings is greater or equal
than some other tree with markings.
So what I'm going to prove is F of T is at least this. This is my goal. Okay. I'm not going to be
able to prove it. In fact, what I'll prove is that three times F of T is at least this. Okay. And this
is where the whole trick is going to come in. All right. So that's it. So we're going to prove that
three times F of T is at least this. You give me your path. This is what I'm going to prove.
Okay. So we need to start. So let's start somewhere. All right. So now we're at the top of the
tree. And suppose you tell me the first two steps in your path. So your path goes like this. All
right. So now I'm going to start -- I'm going to -- I get to start -- I'm going to spend my three
times F of T. I'm going to spend it in the following way. I'm going to mark this node and this
node and this node. These are the first three steps in your path.
Now I can -- now this tree times F of T is at least this, because this node has value F of T. And
by the subset property this node has value at most F of T and this one has value at most F of T.
So I can -- so I start the game like this. Now okay. And then the whole idea of the game is that
you're going to reveal to me the next step in your path, and I'm going to have to respond -- I'm
going to have to say that sort of I can choose different rewards such that this tree is greater than
the next tree. Okay. So let's look at an example. Okay.
So this is -- okay. So let's look at this example, first of all, which is a simple one. In this
example you can hear -- what I'm going to observe is that since this is not a rightmost -- this is
not a right edge, so I don't need to take -- I don't need to get this. I don't need to take care of this.
So in this case my move will just be the following move. I'll just go like this. By the subset
property I can make this move. I mean, this node is less -- costs less than this node. So I can
make this move. And I -- this move was easy. I didn't need to get anything because this was not
a rightmost turn. Okay.
>> [inaudible] the value of the tree [inaudible].
>> James Lee: Right. Because I'm going to have a sequence of inequalities. This sort of -maybe I should do it this way just for this one step. This -- three times this is at least this, and
this is at least -- let's draw the same thing. So this step was easy. This was the step I did here.
Okay. And so this is the easy case. If this top edge of the -- at all times I'm going to have three
colored nodes like this. If this top edge was not a right edge, then actually I don't need do
anything and I can just make this easy move. This move is easy because this -- this node -- the
value of this is greater [inaudible] the value of this. So this move we can make. This was an
easy case.
Let's look at the -- [inaudible] looks skeptical, so let's look at the -- let's look at the -- everybody
remembers the lessons. All right. Okay. So let's -- so this is not the hard case. The hard case is
when we need to poop. All right. Okay. So the hard case is if the path looks like this, it's the
last -- so -- okay. So there's -- there is a rightmost turn. So our current state -- again, we're
somewhere in the tree. Our current state looks like this. We've marked this, we've marked this,
and we've marked this. Okay. So now our -- again, you're specifying the path to me, and I'm
just making sure I can take care of anything.
Now, in this case -- okay, so now you specify to me the next -- the next -- okay. You want to
make this move. I have to make this move.
>> [inaudible] reward always going to be on the path?
>> James Lee: Yeah. The reward is always going to be on the path, and I'm going to -- every
time that I'm about to leave the last rightmost edge in the sequence, I'm going to have to get
credit for it. I'm going to mark that edge as well so that eventually I end up in this situation,
where all the last rightmost edges are marked.
In this step, this was not a rightmost edge, so I didn't care about marking it. I just kept sliding -I just kept like ->> [inaudible]
>> James Lee: Yeah, yeah, yeah. But no, no. But I need to get credit now. I need to -- I need
to move the snake so that the head -- that the head is here. And what I need to -- this is the
pooping part.
>> [inaudible]
>> James Lee: Yeah. I need to get the value of this edge.
>> So what is the value of the [inaudible]? How does F of F play a role in a reward? You just
collect the reward on ->> James Lee: The -- every subset of vertices and edges has a reward, has just a sum of the
values. So far in this picture you didn't see any edges getting a reward. Now I'm going to -- at
the end of the proof, I don't care about the vertices anymore. I just care about the edges that I
marked. But these vertices are going to help me pay for edges. So I initially invest three times F
of T in three vertices. And now as these vertices slide down the tree, they're going to help me
pay for edges.
So here's the important -- here's the -- I mean, this is really the heart of the matter. Basically you
can assume that all the rightmost -- all the last rightmost turns have been paid for inductively,
and now the snake is about to slither past this rightmost turn. We need to pay for it. That's the
pooping part. Because it's the end of the snake. Okay. Look, it still seems like the best analogy.
If you don't like it, come up with a better analogy. But here's the -- okay.
So we need to slide the snake down and also mark this edge but still have it that the next
configuration is at most the cost of this configuration. So how do we do it. Well ->> [inaudible] configuration?
>> James Lee: It's just the sum of the marked edges and vertices.
>> Marked edges and vertices.
>> James Lee: Yeah. That's the value of the configuration. Right? We're moving from our
initial configuration here to this configuration, always decreasing the value. So at the end we
know that three times F of T is at least this.
>> [inaudible] sum of two edges and three vertices?
>> James Lee: No, no, it will be a sum of three vertices and all the edges that we've encountered
that are rightmost ->> Last right most ->> James Lee: Last rightmost edges.
>> Okay.
>> James Lee: I mean, the questions are great, because it's not -- I mean, it's still -- I mean, it's
a -- but see -- okay. Yeah. Again, in this case nothing interesting is going on. We can just keep
slithering because we don't need to mark anything. This is where all the action is going to
happen. We need to pay for this last rightmost edge.
So the first case I want to do, because it contains all the ideas, is the case when -- is when the
next place you want to go is not a rightmost edge. Okay. You want to go here. So let's say -let's look at the values of these nodes. This one is delta. This had diameter delta. This was
rightmost edge, so it stayed at delta. This was not a rightmost edge, so it went to delta over R,
and this is not a rightmost edge, so it went to delta over R squared. Okay.
So -- okay. So let's see what happens. So first of all now I want to apply my inequality here on
these balls. So what do I get. I want to apply the inequality. So from the inequality, first of all, I
get this term, which is if you see -- I get C -- this term is C times alpha times square root log M,
square root log M is 2 to the N over 2. So I get this much. So in fact I'm going to ->> [inaudible]
>> James Lee: What's that?
>> Alpha and delta ->> James Lee: Oh, sorry. Yeah. This is -- yeah, I should put delta. I mean, alpha equals delta
in this demonstration. Okay. So I know that the value of this set is at least -- basically I can
make this edge and get rid of this, and I also get -- what do I also get. I also get the minimum of
BTI delta over R ->> [inaudible]
>> James Lee: Okay. Delta over R squared. It's the minimum of delta F I guess. Okay. So you
have to say why is the delta over R squared. Because the separation between these points, if I'm
at delta, the separation between these points is delta over R. So that's why. So this alpha over R
is delta over R squared. So I get this edge value plus I get this. Now, the whole idea is I want to
use this to pay for this. If I can prove that this value is at least this value, then I can put the next
thing here and now I -- and I've marked this edge and I can keep going.
So now why is it the case. So the first thing to observe is that the minimum here actually applies
to this vertex T sub-M. Because the order in which we chose these vertices was in terms of these
balls being decreasing. So this little -- this small ball has more weight than this small ball weight
and this small ball has more weight than this small ball. So this minimum actually just applies -is just -- is actually BTM delta over R squared.
>> F of.
>> James Lee: F of that. Yes. Okay. All right. Okay. So that's the first thing. But this T
sub-M was chosen so that among all the pieces in this set here it had the maximum delta over R
squared value.
Since this node -- where is the -- oh, yeah. Since this node is contained in a ball of radius delta
of R squared, this value -- this F value is bigger than this F value. Because this T sub-M was
chosen so that its delta over R squared was the maximum of everything in this set. So that means
that -- that means that this value, F of this, is at least -- is at least the value here.
In fact, it's at least the value of anything of delta R squared coming onto this tree. Also any of
the other ones. So that's how you move the -- that's how you move the token and pay for the -- I
mean, the snake moved on and left something behind. That's how you pay for this right turn.
Okay. And then there's only one more case which is I said, you know -- the other case is what if
this is a right turn. So that's not any conceptually -- any more conceptually difficult. Let's just
do the picture now. It's exactly the same thing. Now the picture is -- okay. We had a right turn
like this.
Then there's a non-right turn because we only need to pay for this if it's the last right turn. But
then you chose to go down a right turn the next time instead of not a right turn. So now I let you
keep going and I'll just -- tell me when you stop making right turns. Okay. So you keep making
right turns for a long time. Okay. Eventually you stop. Good.
So now the idea is I do the same thing. So we started in this configuration. Okay. Again, as
before, I use this to pay for this, plus I get a little bit extra, which is this value. And now I'll just
observe that, I mean, how do the delta values go. This was delta, this is delta, this one is delta
over R. Now it stays delta over R for a long time until finally you make a non-right turn and this
one is delta over R squared.
Well, now the same argument applies. This delta over R squared ball must be bigger in value
than this delta over R squared ball. So, again, we can just move this down to -- down to here.
And of course we need to get the things. But now these can be moved for just -- in the tree we
did before. You can always move these things down the tree to be here and here. Because
moving down the tree only decreases value.
>> [inaudible]
>> James Lee: It was constipated.
>> So if things -- I see. So if it actually ends before you make the right turn, then I guess there's
nothing to pay.
>> James Lee: So you're saying -- you're saying what if we -- what if eventually we just stop at
the last right turn. Okay. So the last -- it's true that the last -- the last right turn doesn't -- we can
always use one of these tokens to pay for the last right turn. I mean, if this -- if this was -- all
right. We just need to pay for this thing, how do we pay for it. Well, just move it here, and
then -- I mean, then you can just pay for it automatically. So you can always pay for the last
right turn.
Okay. So I'll draw the little box. But that's the end of the proof, that the functional is at least the
value of this partition.
And the -- and, again, let's see. We're going to finish in an hour. That's good. So we can try
to -- now that we've seen it we can try to figure out why -- you know, the whole idea was this.
We started with a space, some diameter delta, and then we partition it into pieces of diameter
delta over R. Okay. A bunch of these pieces, diameter delta over R. All right.
Now, okay, so this -- of course this partitioning gave us an upper bound at this level, but the
lower bound has a deficit, right? The lower bound has this deficit that it loses this factor of R.
So the balls that we get in our lower bound here, we don't get these -- we would love if the lower
bound was sort of like all these giant things, but the balls we get from a lower bound only look
like this.
Now, this is a really crappy state to be in if all the edges -- if there was like a ton of interesting
stuff here, because the lower bound would completely miss it. Like we could lose all the space
not contained in these blue dotted balls if we just applied a lower bound to this.
So we have to hope that someone was paying more attention at a higher scale so that if we miss
something in here somebody would have caught it beforehand. But how are we going to ensure
that happens? We do this by look -- I mean, if we want somebody at a higher scale to be paying
attention, then we should be paying attention to what's going on at lower scales. So that's
somehow what this delta over R squared versus delta over R thing is doing.
You optimize so that you make sure you're taking care of the lower scales, but of course you
have to partition -- I mean ->> [inaudible]
>> James Lee: Where was this used in the proof? This was this -- this was this -- the masterful
step of, you know, the proof when we managed to take this to pay for this next thing down here.
This thing only gives us minimums. We got a maximum, right? We said that this value was
greater than anything that came down here, not just the minimum.
So somehow this was because sort of when we chose this vertex we were looking ahead to make
sure -- at this step sort of there could have been a lot of loss, but we made sure that we covered it
at the next step. It's exactly taking care of this situation. There could be lots of stuff -- but lower
bound at this step is only going to see what's inside the green ball. So there's a lot of -- I mean
blue ball. So there's a lot of stuff that it's missing. We need to hope that if we're missing stuff
there then somebody at an earlier level who sort of had a better viewpoint of what's going on in
the space was taking care of it.
And to do that, I mean, yeah, as I said. Sort of we make sure that we're taking care of the next
scale. So yeah. Pretty beautiful proof [inaudible].
[applause]
>> When is the movie coming out?
>> [inaudible] can be applied to any of it, to other functional [inaudible]?
>> James Lee: Okay. So let me say two things. Oh, beside from the supremum. Somehow in
this field the supremum is the most interesting thing people study. But it has been applied to
nonGaussian processes, like P stable processes, or in general sort of any kind of process where
you have some kind of exponential tail with some power. You can do something similar.
Although instead of having one distance a lot of times you get a family of distances that comes
up.
So let me just say -- I told Jeff I would say something about this. So let me just say why my
selfish motivation for understanding this proof, because this proof has the weird property that
maybe the more natural thing to do is why do you stop at a bound number of pieces. Just keep
cutting out delta over R balls until you exhaust this space, and then the next step could outlook
the delta over R squared balls and keep going like this.
Okay. So that's how the original proof was done. But this proof has some nice features that -- I
mean, that come up in analyzing. Let me just say this problem that Talagrand worked on for
quite a long time, which is the Bernoulli conjecture.
As we said in the first talk -- I'll stop in five minutes. As we said in the first talk, we can
consider a Gaussian process in a different way. Just take T to be a subset of L2, so just a subset
of the sequences or the sum of the squares is bounded. And then define your process in the
following way. Okay. Also take an infinite family of -- so these are IID normal 01s. And then
your process is just -- does this. Okay. So for a separable Gaussian process, this is a generic
instruction. This gets you anything you want. So in the index I hear is T.
So the question is what if you consider instead of Gaussians here something very natural, which
would be the Bernoulli process, where these things are just IID, you know, uniform plus/minus 1
random variables. And instead of -- so instead of trying [inaudible] controlling the expected sup
for these Gaussians, what if you tried to control the expected sup of these random sums of the
sines.
So -- so okay. So there are two observations that come up there. So, I mean, how did we start
the Gaussian set. We started, we came up with a natural upper bound, which is this chaining,
and then we tried to match it. So what's a natural upper bound for the Bernoulli process? Well,
one natural upper bound is that -- I mean, I guess I'll leave this as an exercise. That for some
universal constant, which is at most five, I mean, I think it's square root pi over 2. But I have to
think about it for a second.
One thing you can do just by convexity argument is observe that the expected [inaudible] for the
Bernoulli is always bounded by some constant times the same thing for the Gaussians. It makes
sense. The Gaussians have tails and the Bernoullis don't, so they tend to be bigger.
So this is one way of bounding the process. So in fact if we -- right. Okay. So let's define -- if
we define sort of B of T in the same way we define G of T, so B of T is the expected supremum
of the sum of the epsilon ITIs, this says that -- this just says that B of T is at most a constant
times G of T. That's one way of getting control on the expected supremum. The Bernoulli
supremum is at most a constant times a Gaussian supremum.
All right. But then there's another way of upper bounding a Bernoulli process that doesn't apply
in the Gaussian setting, which is just this second way of upper bounding it, which is that this is at
most the maximum L1 norm of any vector in the set. Of course, the maximum value of this sum
is if all the sines of the epsilon Is coincide with the [inaudible] TIs. And then you can upper
bound it by the L1 norm.
But this doesn't -- I mean, so -- you know, okay. So these are two ways to upper bound it. And
then finally you can combine these two ways together in the following sense. If you put T inside
a set T1 plus T2, so this T1 plus T2, this is the Minkowski sum. So this is the set of all A plus B
such that A is in T1 and B is in T2. If you put T inside a set like this, then it's immediately clear
that you have this.
In particular you can mix the two kinds of bounds together. So you can -- okay. So now up to a
constant you can write them like -- okay. So you can mix the Gaussian and the -- sort of the -oh, good name -- the L1 upper bounds together according to some decomposition like this.
And Talagrand's conjecture, the Bernoulli conjecture is that this is a universal way of upper
bounding the process. So for every Bernoulli process, so for every T there exists T1 and T2 such
that T is contained in T1 plus T2 and in fact the B of T value is precisely -- I mean, constants
given by what's going on in the Gaussian setting for T1 plus the L1 bound for T2.
>> What's the example that the GT is not a corresponding [inaudible]?
>> James Lee: Take a family of -- I mean, take your set T to be E1, E2, E3 and so on. So now
in the Gaussian case, the expected supremum is infinite. I mean, this is -- because it's the
supremum of an infinite number of ID Gaussians, but in the Bernoulli case of course the
supremum is 1. I mean, if you sum up one term, you get -- yeah. So in fact they can be
arbitrarily different.
And of course -- I mean, of course this describes the whole heart of the problem, which is that
when I take -- when I take vectors T, which is very spread out, the Bernoulli sum, you know, by
the central limit theorem, tends to behave just as does a Gaussian sum. But if I take things that
are concentrated, then it sort of -- it behaves more like this bound or it can behave more like this
bound.
And now the problem is that the process could be a mixture of these behaviors at all scales going
back and forth and being -- you know, a different way of saying is that this process is rotationally
invariant. So if you rotate the set T, the distribution here doesn't change. Whereas of course I
mean this process is crazily aligned with coordinates. It doesn't have this rotational variance at
all.
So, anyway, when you consider -- when you consider this process, it seems that the most natural
thing to do is instead of considering one distance you consider a family of distances. What's that
family of distances. You sort of think about truncating these vectors T, so they have bounded L
infinity norm. Once these things have bounded L infinity norm, then you can start to see some
kind of comparison with the Gaussian case.
But now you sort of need to consider all the truncations, you know, you truncate, you don't
truncate, you look what happens when you sort of -- when I truncate, I just mean like sort of cap
out the coordinates. You know, like make the coordinates have some maximum value by just
cutting off the tops of them.
And you can consider it sort of -- it seems to understand this process you have to consider what
happens as this truncation parameter goes from infinity to 0 and you get this family of distances.
And this setting where you index things by the number of points instead of the distance is much
better when you have many different distances.
Because then you're always making progress. You're getting more and more sets as opposed to
like -- if you have a bunch of distances and you have different distances in every cluster, it's not
clear like -- I mean, is your diameter going down with respect to what distance or whatever. So,
anyways, this was my motivation for understanding Talagrand's new way of proving this.
Okay. That's all.
>> Yuval Peres: One more thing. Maybe you want to spend a minute of saying how these -how the gamma 2, the Talagrand functional serves to replace the log N [inaudible] theorem.
>> James Lee: [inaudible].
>> [inaudible]
>> James Lee: It's -- I mean, we've already seen it. It's -- you combine chaining with the
[inaudible] we already have.
>> Okay. Let's just use that as a hint for anyone that wants to pursue it. And let's thank James.
[applause]
Download