Document 17844808

advertisement
>> Yuval Peres: Good afternoon. We re-starting a summer tradition that Natti Lineal [phonetic]
started here years ago of expository talks and this one has also a specific aim two prepare us for
Ronen Eldan’s [phonetic] talk tomorrow. Anyway, happy to welcome James Lee who will tell us
about discrete Ito processes.
>> James R. Lee: I'll tell you about Ito processes, not discrete Ito processes. At a broad level
the talk is about correlated sampling which is what sort of what we might call in computer
science or path coupling as one might call it in probability, basically a way of sort of computing
a coupling or describing a coupling between two random variables in some kind of filtered
fashion. You sort of slowly sample both random variables and you are trying to minimize
something, so actually, many times in computer science one is trying to minimize things like
communication between two parties, for instance, in communication complexity or
probabilistic checkable proofs. And as you've all said, okay, so the point is in the Gaussian
setting actually this correlated sampling in some senses given by these Ito processes and there's
a whole beautiful theory that Ito process to analyze them. I want to talk a bit about this today
and how you can use it for some geometric applications and also how to set the stage for what
Ronen will talk about tomorrow. Actually, let me motivate what's going on now by starting
with telling you what Ronen and I proved. Suppose we have, and by the way as you all said, this
is supposed to be elementary so if you have any questions please interrupt or ask. Let's start
with the function on the hypercube and let's define this which is sometimes called BonamBeckner operator. This is an operator which maps f at a point x to the expectation over this
random point of the hypercube x hat where x hat is described in the following way. x hat i is
going to be xi with probability one minus epsilon and it will just be a random plus minus one
independently with probability epsilon. You should think about this for epsilon bigger than 0
you should think about this as some kind of smooth operator. You average f over small balls or
you can think about averaging over short random walks in the hypercube. Just as a sanity
check, let's check. If we apply no noise we get our original function back and if we apply sort of
maximum possible noise, then t1 of f is just the constant function which is the expected value
of f, expected value over the uniform measure on the cube. One might suspect that when you
apply this operator you tend to take functions that, you tend to smooth functions out because
you are doing some kind of averaging. And, in fact, it is the case, so let me remind you about
hyper contracivity and here I will list some people in alphabetical order. In other words, the
listing of their names has no bearing on anything. It's a set. I should probably say; that's a
better way to do it. Which is the following sort of phenomenon that given such a function f
what do we want to say? For every epsilon bigger than 0, for every p bigger than one, there
exists a q bigger than p such that the q norm for the epsilon f is that most the p norm of f. The
point here is q is bigger than p, so when we apply some noise to the function, on average, the
function gets smoother. What's that?
>>: [indiscernible]
>> James R. Lee: No. No. What do you mean? No. There is no constant. In fact, if you want a
constant is only known to be equivalent, it's false in general that if a Markov operator satisfies
this with a constant then it satisfies it with constant one. Constant one is called hyper
contractive. With a constant it's called hyper bounded. I think this is the terminology that is
used. Anyway, it does hold with constant one for some q bigger than p. This depends only on
epsilon and how big p is, but the point is the function gets smoother. We went from having an
lp estimate to lq estimate and so if you think about what this means. Combinatorially this
encodes actually like small set expansion in the hypercube. It encodes something about the
isoperimetric profile of the cube and this is a very useful kind of theorem. Yes?
>>: [indiscernible] correct?
>> James R. Lee: Yes. The point is it's here. None of the things depend on f. That's sort of the
point. There is epsilon p and q such for every f this holds. What we want to do is we want to
prove something in this spirit but here we have some operator estimate on f. We have an lp
estimate. To tell you this conjecture of Talagrand we want to actually basically make very
minimal assumptions on the function f. Let's simply make the following assumptions. Assume
that f is nonnegative and has expectation one with respect to the uniform measure. By the
way, notice that this hyper contracivity has nothing to do with, it's not about cancellation here.
You can replace f with the absolute value of f and this inequality only get stronger. This is about
non-negative functions. It's not about somehow magical cancellations and this averaging. Let's
take an arbitrary function and just normalize it as expectation one on the cube. What can we
say about, for instance, the tail behavior of such a function? Not very much. We have only the
estimate, from Markov’s inequality which is the probability that f is say bigger than alpha is at
most one over alpha. We have Markov’s inequality. So what Talagrand conjectured, and I am
right at the conjecture now, is that if you apply some smoothing via this noise operator to a
function, then actually you get tales that are better than Markov's inequality. Here's the
conjecture made by Talagrand in 1989. I'll start this new trend of putting the prize money in
the also, okay. It's worth a thousand dollars if you can solve it. The conjecture is the following.
For every amount of noise that you might want to apply there exists a function. You'll see
where this function comes up in a second. This is this function that evidence the tail that's
better than Markov's inequality, such that, again, for every f -- here the assumption applies to
the theorem. f is normalized as expectation one, so for every f probability that t epsilon f is
bigger than alpha is at most phi alpha or alpha and okay. So this phi represents a beating,
beating sounds violent, an improvement over Markov's inequality. For every fixed amount of
noise if we average that noise then no matter what function we start with we improve the tail.
This is the conjecture. One thing you might ask is what could you expect here. The optimal
thing you could get would be say some constant depending upon epsilon divided by alpha
square root log alpha, so the optimal thing you could get is phi of alpha being one over square
root log alpha. The optimal thing that you could get is phi of alpha being one over square root
log alpha. This is achieved, for instance, if f is just a delta function at a single point. It
corresponds to the, a delta function in the cube when you apply some noise looks kind of like a
half space. More generally, you can think about this is type for balls in the hypercube, this
bound. That's the conjecture. I guess I should, right now the conjecture is open for, this is
conjecture for every epsilon, but it's still open to proof for any epsilon and it's open to prove it
even if f is the indicator of a set, the scaled indicator of a set so it satisfies this [indiscernible].
The conjecture is still open. I will say many people here actually worked on this problem. Ryan
visited here for a month years ago and was working on the problem. Jonas Sherman was in
intern who spent his whole summer working on the problem. Heal [phonetic] told me that he
and Aaul [phonetic] and Jeff also worked on the problem during a visit. Anyway, if you can
solve it you can also sort of you, it would be more than $1000 because you minimize the future
costs of Microsoft employees working on the problem. We are not going to talk about this one
today, but what Ronen will prove tomorrow is that the Gaussian version is true. Let's say what
is the Gaussian version. I need some objects. One is some n-dimensional Brownian motion.
One can state the conjecture easily without Brownian motion, but it will be fundamental what
comes next. That's one thing. Okay. And I'll take a function now on RN, nonnegative and let's
assume, okay, the function is expectation one. The Brownian motion at time one just has the
distribution of an n-dimensional Gaussian, so just to be clear and at the same time introduce, so
gamma n is the n-dimensional Gaussian measure. We assume that the expected value of f is
one. Then I also want to introduce this semi-group. Where you start a Brownian motion at x
and you average the function f under the Brownian motion. With this now I can state the, let
me say it this way. Following theorem. For any t less than one, and say alpha bigger than 2,
that's not so important, we have the probability -- okay. We will come back to this quantity in a
second. That this is bigger than alpha is at most some constant depending on t, divided by
alpha log alpha to the 1/6. There's also some log log alpha term, but let's, since the 1/6 is not
tight, since it should be one half, let's not belabor the log log alpha term. This is the theorem
Ronen will pick tomorrow. I'm not going to say much more about it except to notice what's
happening here. What does this look like when we first take a Brownian motion for time t and
average over sort of a Brownian motion for time 1-t? The right picture is to think about sort of
you have a Brownian motion up to some time t and then you compute sort of the average of f.
You can think of it as sort of the average of f over sort of a Gaussian with variance one minus t
or more geometrically the average of f were you average over the rest of the Brownian motion.
You think about a Brownian motion at time one. You sample it up to time t and you average
over the future all the way to time one.
>>: [indiscernible]
>> James R. Lee: Yes. bt start at 0. Yet?
>>: [indiscernible] you just said because you said…
>> James R. Lee: How far back?
>>: [indiscernible] the future.
>> James R. Lee: The point is what is this quantity. If we went all the way, if t was equal to one,
if we all could go all the way to time one, then what we are just looking at is f of b of one.
When t equals one there is no averaging at all and we're just looking at this quantity, so it's just
f at a random point of Gaussian space. So if we can't improve Markov for such a function just
because you can just, you could choose f to be the indicator of a set and then, you know, the
tail would be exactly achieved sort of at the level of the set. But now what we say is we don't
go all with the time one. We'll run the Brownian motion up to time t. You can think about t as
close to one if you like, and then we'll average over the rest of time, average over the future.
>>: Why don't you just [indiscernible] one given?
>> James R. Lee: If you would like, that's another way you could write it. As you have all
pointed out, you can kind of think of this as a Dube Martingale which is yeah. It's sort of the
expectation at time one given the path up to time t. This is the proper analogue of the
conjecture in the cube and, in fact, one can see this is a special case. If you can prove the
conjecture for the discrete, you could prove it in Gaussian space. Just to finish talking about
this, let me just say I should mention that Ball Part [indiscernible] Alachevitz and Wolfe in 2010
prove that you can get some bound. This is not as good as it looked. It's some bound, but the
constant depends on a dimension and, in fact, it depends exponentially on the dimension. The
whole point of this is that it's a dimension free phenomenon. If one, but actually, even to prove
it in the case n equals one is an exercise and n equals two is already more than an exercise, but
the point is to prove something here which is independent of dimension here that's all I'm
going to say about this. Ronen will prove this tomorrow, but now what I want to talk about is -so now you could think about the case when f is the indicator of a set. In this case scaled by the
measure of the set so it has expectation one. Now we have some set sitting out here and we
want to prove something about this set s. If you look at what's happening here, you know, to
study this at s we kind of take a Brownian motion for some time and then, you know, and then
average sort of average over the future and see some piece of s under this process. The
problem with doing this process like this is that this theorem is the sort of interesting when the
measure of s is very small. If we just took a Brownian motion and tried to study s by how this
Brownian motion sees s then most of the time, of course, this Brownian motion will wander
away somewhere else off in space and will not see any of s at all. That's representing the fact
that there is this sort of one over alpha here when alpha is going to infinity. Okay. Brownian
motion is nice but it's not quite appropriate to study the geometry of the set s because, you
know, it rarely comes close to s if s is small. What we would want to do is consider a different
process. It's called a wt, which is Brownian motion conditioned to have law f at time one. If f is
just indicator of a set is Brownian motion conditioned to following the set. If I condition that
I'm following the set then I get rid of all of this spurious information coming from Brownian
motion to just wander far away from the set. We like to sort of study this property. One way
that you could form such a process is just, you know, sample a point of s according to the
Gaussian measure and then just take a Brownian bridge to this point. This is not very useful for
studying the geometry of s because we are essentially writing s as a union of points and then
doing something for every point, so this is not a generally good way to study the geometry of a
set. What we would like to do is use something like a Brownian motion which sort of
approaches s but sort of slowly so that you see what is happening along the way. What you
should think about is we want to have some sort of local process that approaches s and in
general, the process wants to be like a Brownian motion except for the fact that it has to hit s at
time one, so the process feels some pain at every step from the fact that it is being pulled
towards s. It turns out that by studying the amount of pain that this process feels you can say it
says a lot about the geometry of s. For instance, the pain that you are feeling at time one will
turn out to be related to the surface area of s which makes sense because that time one you
are just about to step into the set. That is going to be today where basically instead of the
geometry of sets like, you know, by empathizing with the pain of these modified processes.
Tomorrow Ronen will do what many bad people have done throughout history which is once
you understand something then you try to hurt it, so Ronen will be basically tomorrow the
sadistic and see what happens whenever the process is feeling pain if you push it a little more,
you know, sort of see how it responds. But today, we're just being empathetic. Does the
general idea make sense? We would like to have this process that lands in s at time one but
gets there in some slow way that sort of examines the geometry of s. If you think about it if you
have this averaging over the future, then if you do this you are kind of looking at the geometry
of s at many different scales. You see it for a while at faraway and then from medium distance
and then closer. This is going to turn out to be important. The question is what does it mean.
Now let's consider processes that do exactly what I said. Let's try to see how we can build this
w. At time 0 estimate at the Brownian motion which is time 0 and then let's see how, and then
w is going to change a little bit. I wanted to change like a Brownian motion plus some drift
obliged with this Brownian motion. The process wt is going to be in this form. I want this drift
to be predictable, so at time t you should know exactly what is the drift you want to apply to
the process. Then we want one more property here and I'll start writing on here because it's a
little bit low, which is that at time one we have the law of the f, so what is this thing at time
one? If we integrate the Brownian motion at time 1+ the integral of the drift from 0 to t and I
want this last condition that this is distributing according to f, whereas if we have a set
according to the Gaussian measure on the set s. This is a description of the process. The point
is, I mean, you can come up with many such processes like this. You could just do nothing, just
screw around until very close to time one and then just suddenly jump to the point of the set.
But we want to consider sort of, you should think about this vt as some kind of, this is the pain
that processes feeling. We would like to do this in a way that kind of minimizes the amount of
pain that the process feels along the way. Let's give first of all, okay. I mean, obviously, if the
set is very small or if f is very fall from the Gaussian measure, you will have to feel a substantial
amount of pain to get there. Here is one way you can measure pain in terms of the relative
entropy. This is the relative entropy. So the relative entropy of f with respect to the Gaussian
measure is just the expectation of f log f. This is the definition of relative entropy and you
should think about, just to make sure that everyone’s signed conventions are in the right place,
if f is sort of constant, if it's close to one, then it implies that the relative entropy is small and if f
is sort of spiky in the sense that it's very far from the Gaussian measure at some places, then it
implies that the relative entropy is large. This is what you should be thinking about.
>>: [indiscernible] measure f [indiscernible] with respect to the Gaussian. [indiscernible] short
form.
>> James R. Lee: Yeah. Here I conflated the [indiscernible], conflated the -- this is really f d
gamma n. I conflated the density with the measure, but actually people do this all the time.
I'm in good company to make this choice. If you think about it in discrete variables, this is, if
you were, for instance, on the cube and this was the uniform measure on the cube, then this is
like the entropy deficit of f. When f is uniform it would have entropy 0 because it has full n bits
of entropy and when it was on a single point it would have relative entropy n because it has no
entropy. I mean, because of the corresponding distribution has no Shannon entropy. The point
being that Shannon entropy and relative entropy have different size. Now let me tell you a
theorem. Various parts of the theorem are linked to Fullmer and to Joseph Hecht. I will say we
learned about it from Joseph's work. The theorem is the following. If you want, the relative
entropy of f with respect to the Gaussian measure is exactly the minimum over all drifts
satisfying star. I didn't put a star here but let's call this star. We look over all drifts that sort of
satisfy the fact that at time one we have, we are distributed according to the density f of
something beautiful which is this. The relative entropy of f with respect to the Gaussian
measure is exactly equal to the minimum amount of energy you have to expel in order to push
the Brownian motion to have distribution f at time one. Does the equation make sense? This is
the equation and even more than that there is an explicit form for the optional vt. The optimal
is obtained. It is unique but let's say and optimizer is vt which has this formula and I'll explain in
just a second. This formula is also equal to gradient log, here is the theorem. Okay. What does
it say? It says that at every point in time what you should do is kind of looked in the direction in
which f is increasing multiplicatively the fastest, but it's not f. It's sort of the average value of f
over the future is increasing the fastest. In a second I will tell you about the proof of this
theorem, but I've decided -- in some sense, yeah, this formula looks a bit strange. Maybe it
doesn't look strange. It might seem like it's kind of the right thing, but I want to stress the fact
that this is the most natural obvious thing to do. To do that, let's go back for a second to the
discrete cube and let's just, let me just tell you the same process for the discrete cube and then
there you will say oh, of course. Okay. That sort of motivates this form for vt. All right. Again,
we have, it's the same set up sort of all the philosophy is the same. We have a nonnegative
function, expectation of f equals one and we like to sample, we think that we would like to
sample a random variable according to the density f. How can you do it? Here's a very simple,
okay. So we want to sample, let's by analogy sample w according to f, so how might you do it?
Okay. You might sample the first big sort of w here; this is a random string in the discrete cube.
How might you do it? Well, you would first sample the first bit according to the marginal
distribution on the first bit. Then sample the second bit according to the marginal distribution
on the second bit conditioned on the choice for the first bit and so on. You would just okay.
Okay. So let's just, well, let's just compute the biases you would apply at every step. At every
step you are flipping some bias point to decide whether you should set the coordinate to be 1
or to -1 and let's just compute the biases and you'll see that the biases are exactly the
analogues of these values. Okay? Suppose that we sampled sort of w1, w2. These are the bits
of w up to i-1. How do we sample the next bit? Let's define vi to the expected value of b, so
again, analogously to what we are doing over there. B is uniformly random on the cube, so now
vi, you'll sample according to a uniformly random bit if the past had been samples we make and
also if the okay. So this is the expected density at a random string if we choose what, how we
have gone so far and said that the ith would equal one and I'm going to subtract from this the
expected value of f at a random string. If we, it's exactly the same thing if we have, if these only
seem so far, but the ith bit equals -1. Okay? And then I'm going to divide this by the average of
these two things, which is the expected value if you don't condition on what happens to the ith
bit. Okay. Up to I -1, w up to wi -1. Then we should put half here for the formula I want to
write down. Then how will we sample the ith bit? The ith bit is going to be one with probability
1+vi over 2 and -1 with probability 1-the ith over 2, and what we're doing here is just, I mean,
sort of, if we average over the future we just see this thing. This thing is exactly half of this plus
half of this because this thing chooses ith to be 1 or -1 uniformly at random and so what this vi
is computing is just how much more density is there in the direction of i equals 1 versus in the
direction of i equals -1. It's exactly the conditional, this is exactly the conditional probability
that the, it's exactly the marginal probability that the ith bit is 1 conditioned on the choices that
we have made so far. And I just want to sort of like now if we define sort of this partial
derivative operator, so the partial derivative of g at a point x is, set the ith bit to 1. Look at the
difference if we set the ith bit 1 versus if we set the ith it to -1 divided by 2, then this vi is
exactly the partial derivative of, you know, okay. I'm going to use shorthand. This fi-1 just
means that we set, this is the divided by this. Okay? So this is sort of, okay, partial derivative.
You see the formula is exactly the same one as the one we put here. This is exactly the
derivative of the, at wt where we are now, this p1-t is averaging over the future so it's exactly
the derivative. It's exactly measuring sort of how much do we, you know, the sort of the rate
that we should change multiplicatively. Sorry. Does it make sense? What's going on here?
These vi’s are exactly partial derivatives so sort of by analogy they shouldn't be surprising
anymore. This is just like conditioned on the past. How should I sample in the future? That's
all I was trying to say.
>>: [indiscernible] the definition you are saying is still okay. It's one-to-one, but is it
[indiscernible] once [indiscernible]
>> James R. Lee: Okay. There's a reason that everything becomes -- the answer is yes. You can
say the same statement. The problem is that in the continuous setting Ito processes sort of like
all the expressions are much nicer. This is actually something like e to the log of 1 plus blah,
blah blah, but 1 doesn't need all of the parts of log, just needs, okay. I'll say something about
that in a second. You can say the same thing in the discrete setting. It's just that the expression
will not look as nice. In fact, it's sort of I necessity if these vi’s are much bigger, are big, if they
are not epsilons then you will get a different expression. Here the feeling, since you are feeling
the vt instantaneously, okay. Things work out much more nicely. Let me just say one more
thing about the setup, which is also that I can tell you the value of f at this string w just by
examining what happens here. I'll make the claim and then, this is wi vi, okay? This is my claim
on f at w and the reason is just because, you know, if we happen to choose the sign that goes in
the direction of vi sort of like our sign is in the right direction, then the value of f increases by 1
plus vi, and if we choose the wrong direction than the value decreases. If one thinks about it
just for a second, this is exactly the… Well, no, no. This is both sides are random variables here.
This is exactly what we want. Yes. This is a random variable. This is a random variable.
Everything is okay. All right. Now let's go okay. Actually, I think, let's change the order a bit.
Now let me give you a couple of applications of this theorem, and then I will explain to you the
proof. The proof, well I will prove to you some part of it. We'll see, after I give you the
applications, we'll see how much energy people have for the proof. The applications and codes,
whatever the interestingness of having this minimum energy coupling between some measure
that samples from f and some measure that, you know, and the Brownian motion. Okay.
>>: [indiscernible] thinking on these gradients of the variables what could be the value of the
integer? So I just say what is [indiscernible] for vt. Is the gradient of p1-[indiscernible] f or wt?
>> James R. Lee: Yeah. You can say -- okay. It doesn't make sense anymore. It can now, if you
look over in this setting to what happens, now you'll get probabilities that are bigger than, you
know, that are bigger than 1 or less than -1, I mean, what will happen? I don't know. But what
does it mean what will happen?
>>: [indiscernible] some process but it's not clear what it means.
>> James R. Lee: It's not clear. It's not clear how you even interpret this process. What do
you…
>>: I mean you get some [indiscernible] that's well-defined, but it's not going to let you…
>> James R. Lee: Yeah. It's very well defined. If you ask me if it's better in some sense I'm not
sure because I don't actually know what is the -- I mean you will not end up, certainly you will
not end up in the log f at the end if you do that. I have no idea where you will end up. But you
see if you are not careful you will like overshoot the set. I mean you can, if you are not, yeah.
You will not end up in the, I think. Okay. I think, but now I have to think about it, but since we
are…
>>: Maybe it's [indiscernible]
>>: It's adapted so you won't exactly overshoot but you could do some kind of a selectory thing
due to this.
>> James R. Lee: It's not clear what will happen at time 1. I guess what you are saying is as you
approach time 1 you're sort of, you are insisting that you hit the set but now you could be
jumping over the set back and forth many times especially if the set has some crazy boundary
and you lose this smoothing effect. Okay. Let's pause on this question for a second. Let me tell
you one fact about this optimal vt which will be important and if we get to the proof of this
theorem then we will see the effect. This vt is a Martingale. For instance, the expected
direction at time 1, when you are at time t the expected direction you will be pointing at time 1
is the direction you are pointing now. So you know, all right. Okay. Let me get some
applications and then we can talk about the proof. Okay. The first application is the log solve
11 inequality in Gaussian space and I should say, also, both of these applications are due to the
Heckt, not the conclusions which are classical, but the use of this to prove them. Okay. The
first thing is the log solve 11 inequality which is equivalent by the way to the, I mean in the
discrete cube, for instance, the log solve 11 inequality is equivalent to the hyper contractive
inequality so you should think of this as a proxy for the thing we discussed at the beginning. For
this inequality we need one quantity of a function, which is its Fisher information so this is this
with respect to the Gaussian measure and then if you sort of, okay. Let me write it also in this
way, which is also equal to this. If you think about f, for instance, as the indicator of a set it's a
0 or 1 value and then this is really measuring the surface area of the set in some, you know, the
Gaussian surface area of the set at least in some analytic sense. That's the definition of Fisher
information. And then the log solve 11 inequality is just the following, the relative entropy of f
with respect to the Gaussian measure is at most 1/2 the Fisher information of f. It tells you that
if the function f is not, okay. This is most for a set. If the service A of a set is not, yeah, is not
too big then the set has to be fairly flat. In general, it relates sort of the global ability of f to be
different from the Gaussian measure to adjust to some local property of f. Let's now using this
theorem prove the log solve 11 inequality and it's really like completely effortless. The relative
entropy is exactly half, now vt is going to be the optimizer. Half the integral 0 to 1 of this thing.
Now we use the fact that vt is a Martingale, expected value of vt squared is that most expected
value of v1 squared, so this is that most half expected value v1 squared. Now, just notice what
is v1? V1 is exactly the gradient of f, yeah, so v1 is exactly -- actually, we need to do maybe one
more gradient log squared. This is at time one. I see. Okay. There was one mistake. There
was one problem with going out of order. Let me see how to correct this problem of going out
of order. Crap. Okay. What is it? If we look it's gradient f at w1 squared divided by f of w1
squared. This is not what we want. We want gradient of f at, let's put expectation here so it
makes more sense. We want the gradient of f at sort of a random Gaussian point, not at a
point of w1 which is distributive according to f. Okay. This is a bit, this is a bit sad because now
I'm going to have to write something and then okay. Okay. Okay. Here is the last step of the
proof, which you won't be able to understand for a second and then in the last 3 minutes that I
have I will explain to you the proof. This is because 1 over f of w1 is exactly the change of
measure that sort of transforms w1, w into Brownian motion, actually, the whole process wt
into Brownian motion. Unfortunately, you didn't see this yet which is a bit sad because…
>>: [indiscernible]
>> James R. Lee: What's that?
>>: [indiscernible] the w1 and f is distributive f times Gaussian [indiscernible].
>> James R. Lee: Yeah, yeah. Thank you. No, no, no, of course, of course. Yeah, you should
have said it before I started to do all of this.
>>: You already said it, just [indiscernible]
>>: The formula is just formulated wrong this way? It doesn't need explanation.
>>: Yeah. Remember that one of them [indiscernible]
>> James R. Lee: Yeah. Okay. Good. Yeah, we only, okay. Yeah. Here I wrote something
stronger which is that it holds for the whole path, but really, we only want it at time, that at
time one, f, you know, at time one this is the change of measure and the point is that okay.
>>: [indiscernible] finish it at w?
>> James R. Lee: Yeah. The point is that w1 is distributed according to f times the gamma, so f
of w1 is exactly the yeah. I mean, it exactly gives you the scaling of the probabilities, or sort of
the scaling of the probabilities so that you map, you know,, so you map w1 to b1. It's exactly
the change of measure from w1 to b1 by definition. Okay. That was the proof. Sorry if the
beauty was marred by my confusion. Okay. Let me give you one other proof that doesn't even
need this change of measure and is a bit simpler, which is the Talagrand’s entropy transport
inequality. Okay. The inequality here is that, so first let me write it. I have to remember what
is here is also a factor of two. Let's say the inequality is the following. Okay. Here is another
inequality which is that, I'll tell you in a second. The w2 distance from any, again, this is the
shorthand for f to gamma n. Okay, the w2, this is between any measure given by density f and
the Gaussian density squared is bounded by the relative entropy. Very quickly, what is w2? W2
is the earth mover distance between the two distributions, so in general, w2 between two
distributions, you can write it as in [indiscernible] notation like this. You look at all random
variables that are distributed according to mu and mu respectively and you take a coupling
between them. This is on a product space and you look at the minimum distance here.
>>: [indiscernible]
>> James R. Lee: Equal to the minimum over the best coupling between mu and mu. Okay?
You can also just think about it, if you think about them as just sort of piles of mass, it's the
least Euclidean distance you need to move the pieces of mass so that you move them Gaussian
mass to f. That's w2, so this is that if a function, this is sort of like an information theoretic
quantity. If a function is not, is small entropy effects in the Gaussian measure and actually you
can move the Gaussian measure to that function without doing much work. Let's do the proof
and hope I don't, this should be really easy, so it will be hard to screw this one up. Okay. This is
at most the distance between b1 and w1 simply because this is a particular coupling of the two.
This distance is exactly, I erased the formula, unfortunately. It's exactly the, this is exactly the
distance when I sample. This is the drift that I apply to get from b1 to w1. Now we can just use
convexity to say this is that most 01 bt square, I guess, strictly speaking we should put here in
norm square, which is equal to twice the relative entropy by the theorem. Again, you give sort
of one, again, if you think about the street case, the most natural coupling. And then if you look
at the amount you transport over this coupling it proves this inequality. All right. Now we are
out of time.
>>: You have about seven more minutes.
>> James R. Lee: Okay, so 7 minutes. By the way, that was your chance to object to the 7
minutes. I apologize for that. Now you can ask, how does one prove such a theorem. In order
to make the best use of my time, let me just -- okay. How does one prove such a theorem? You
do exactly, I mean, you just sort of like, all right. Let me remind you because I erased it, what is
wt. It looks like okay. It looks like this. The point is this is, this is what's called an
[indiscernible] process. These are calculus that we are dealing with, these things. Just in the
end let me tell you how, let me tell you what is the Ito calculus. It's very simple, but it's
somehow very powerful because all of these, you know, it means that basically all of the error
terms that you wouldn't want to deal with go away in the Gaussian case. Okay. Here is
probably the best way to understand Ito calculus. Consider the function which is e to the bt, so
the exponential of Brownian motion. Suppose I want to ask you if this is a martingale. I say
basically, is this a martingale. In other words, is the infinitesimal changes of this thing have
expectation 0. Yeah. If you ever lost 10 percent in the stock market and then made 10 percent
back and looked at your bank account, you would know that this is not a martingale. Okay. But
here's a proof that it is. Let's just take the derivative with respect to time, so it's like e to the bt
times dbt. This is how one sort of takes derivatives and calculus, and then just take expectation
over both sides and the point is that okay. This is 0. This is 0 because dbt is independent of bt
in some rank, so this is equal to 0. That, so then this thing is a martingale, but of course this
proof was, this proof was BS. The reason is just, okay. So what do we really want to look at?
We are sort of looking at what happens when we, we are looking at e dbt, so we can expand
this by a Taylor series. What do we get? We get 1 plus dbt plus 1/2 dbt squared. You guys tell
me when I can stop. Okay. Okay, good. We get something like this. Now if we take
expectation on both sides, what happens? We get a 1, which is what we expected. If this was a
martingale, then this would have been 1; we would have just multiplied by 1 on the expectation
plus 0, right? Okay. Ah, okay. The problem is that the expectation of this thing really, you
should think about it as we compute expectation. This is like half bt, if we compute
expectations. Yeah. Because Brownian motion has, okay. What's… Sort of nontrivial quadratic
variation, okay? I mean, some people believe this immediately and I don't know if other people
are concerned by this.
>>: If the increments are less than order square root of dt, then because of the dependence
the cancellations will just make the Brownian motion constant, so you think about…
>> James R. Lee: The way that you construct Brownian motion is exactly, you know, basically,
by adding plus minus root dt. This is how you construct Brownian motion from simple random
walk. You have a better explanation involved that doesn't require…
>>: [indiscernible]
>>: [indiscernible] [laughter]
>> James R. Lee: Okay. So the point is that when you try to do calculus with respect to
Brownian motion, the terms that would naturally go away, the second order of terms are
actually real things. But everything else goes away, okay? So that was the real Ito. That's Ito
calculus. The Ito calculus is really like when you do calculus, you know dt squared equals 0 and
sort of okay. I mean, you should think about dt squared is equal to 0. You should think about
bt squared as dt and then everything else, dt times dbt equals 0 and dbt cubed equals 0. Okay.
You have to worry about the second order term for Brownian motion. That's what the Ito
calculus says, so that me just tell you, for instance, the Ito lemma, which is that if I want to
compute the derivative say of a function of Brownian motion and also of time because I'll, okay,
then normally what I would do here is I would compute the derivative of f with respect to time
and then also the derivative x, x with respect to the first variable, dbt, but then Ito says we have
to go one more step and compute half delta x square f bt. Okay. So this is the, that's like the
whole of Ito calculus, that sort of to compute derivative with respect to Brownian motion you
have to include a second-order derivative for the quadratic variation of the Brownian motion.
Okay. As an exercise, then you can see exactly, exactly what your martingale should be, right?
If you look at dbt minus t over 2 and take the derivative, okay. Let's just compute what it
should be. This is the derivative with respect to time, so it's minus 1/2. This plus dbt plus the
second half the second derivative with respect to the first variable which is just the same thing,
dt. Oh, there's a, okay, good. So you see what happens here, which is that this thing is the
martingale term; this thing has [indiscernible] zero and these go away. This is, you know, a
legitimate martingale, okay? And now that I am out of time but me just say that by using this
you can prove the theorem. I can explain it to anyone off-line if you want, but this is really all
you need and then you can okay. Then you can prove the theorem. Okay. So I guess I should
wrap up by saying that so Ronen and I know how to, for instance, take these proofs and
translate them to discrete cubes so you can use them exactly this method of truth proof to
prove the log solve 11 inequality in the discrete cube. With a little more work you can use
exactly this method of proof to prove log solve 11 inequality in the symmetric group and it's
unclear how much further you can go, but it seems like a very powerful kind of method to do
this correlated sampling and then consider functional inequalities along the path of the
sampling. I should say it's not a new method, although, you know, Burrell was doing these
things for at least 20 years and many other people, but okay. But I mean, you know, this is so
clean that it should have a number of other applications and the theory goes directly to the
discrete case via the sort of sampling I talked about. Okay. So let me stop there. [applause]
>> Yuval Peres: Comments or questions?
>> James R. Lee: Yeah?
>>: I think I should have [indiscernible] but [indiscernible] if I move this [indiscernible] I actually
did f like…
>> James R. Lee: Is it trivial to see?
>>: That it could actually be one of the things that [indiscernible]
>> James R. Lee: I mean, if I have to think about it, it was not trivial, so that is sort of an easy
question, but…
>>: [indiscernible] you remember in the discrete case we actually saw this is basically the
condition, so [indiscernible] just think about Brownian motion as many discrete increments and
you will get exactly [indiscernible]
>> James R. Lee: Perhaps another way to see it is that if you remember before we wrote down
exactly what is f of w. It was exactly product i equals 1 to n, 1 plus wivi where the vi’s were the
partial derivatives. If you write down the same formula over there, okay, so then you will see
that at time 1 your distribution according to f. The benefit over there is that when you write
down this formula because of Ito calculus, you only have to, this is, if you think, if you write this
as expectation of log and then you only have to use the first two terms of log and sort of
everything becomes much nicer. But if you just write down the analog here, then yeah, you will
see that you distributed according to f at time one. Or as Ronen says you could discreetly
approximate all the jumps and then believe it and then transport it here.
>>: It's a much better handler than [indiscernible] the better [indiscernible]
>> James R. Lee: Yeah. That's why I did the discrete case because…
>>: How long does the continuous proof really take?
>> James R. Lee: It doesn't take very long, but, okay. But we should do it off-line. If anybody is
interested we can stop and we can do the proof. The only issue is that, you know, one thing I
don't want to do, I don't want to calculate derivatives in front of people because it can be mind
numbing, so I would have to hand wave over one of the derivatives. You do have to use some
Ito calculus somewhere to prove it.
>>: As far as [indiscernible] discrete work like [indiscernible]
>> James R. Lee: This here. What you are optimizing indiscreet world? Unfortunately, I mean
you can write down a quantity but it's not nearly as natural and somehow that's, somehow
that's the benefit of going to the, of being in the continuous case that sort of like there are
certain things that are much cleaner here. In the discrete world it's not, okay, the discrete
world is tough because here you notice there is no, to do this sampling there is no additional
randomness. This thing is deterministic. In the discrete world this was just a bias and then add
the use of additional randomness to choose the bits. This makes things much uglier somehow.
You know, you could…
>>: [indiscernible] is that what you are [indiscernible] is that what you are making?
[indiscernible] I see you say the second time will be some [indiscernible]
>> James R. Lee: Yes. There is some, well, okay. It's hard to say when duality is not happening
because, you know, I saw so many times it was happening, but okay. I don't know off the top of
my head. There is a duality here because, you know, there is a duality just in the notion of
relative entropy, but okay. But I don't see and the discrete case how it would tell you the
answer immediately.
>> Yuval Peres: Okay. So we will continue this off-line so let's thank the speaker again.
[applause]
Download