Document 17844808

>> Yuval Peres: Good afternoon. We re-starting a summer tradition that Natti Lineal [phonetic] started here years ago of expository talks and this one has also a specific aim two prepare us for Ronen Eldan’s [phonetic] talk tomorrow. Anyway, happy to welcome James Lee who will tell us about discrete Ito processes. >> James R. Lee: I'll tell you about Ito processes, not discrete Ito processes. At a broad level the talk is about correlated sampling which is what sort of what we might call in computer science or path coupling as one might call it in probability, basically a way of sort of computing a coupling or describing a coupling between two random variables in some kind of filtered fashion. You sort of slowly sample both random variables and you are trying to minimize something, so actually, many times in computer science one is trying to minimize things like communication between two parties, for instance, in communication complexity or probabilistic checkable proofs. And as you've all said, okay, so the point is in the Gaussian setting actually this correlated sampling in some senses given by these Ito processes and there's a whole beautiful theory that Ito process to analyze them. I want to talk a bit about this today and how you can use it for some geometric applications and also how to set the stage for what Ronen will talk about tomorrow. Actually, let me motivate what's going on now by starting with telling you what Ronen and I proved. Suppose we have, and by the way as you all said, this is supposed to be elementary so if you have any questions please interrupt or ask. Let's start with the function on the hypercube and let's define this which is sometimes called BonamBeckner operator. This is an operator which maps f at a point x to the expectation over this random point of the hypercube x hat where x hat is described in the following way. x hat i is going to be xi with probability one minus epsilon and it will just be a random plus minus one independently with probability epsilon. You should think about this for epsilon bigger than 0 you should think about this as some kind of smooth operator. You average f over small balls or you can think about averaging over short random walks in the hypercube. Just as a sanity check, let's check. If we apply no noise we get our original function back and if we apply sort of maximum possible noise, then t1 of f is just the constant function which is the expected value of f, expected value over the uniform measure on the cube. One might suspect that when you apply this operator you tend to take functions that, you tend to smooth functions out because you are doing some kind of averaging. And, in fact, it is the case, so let me remind you about hyper contracivity and here I will list some people in alphabetical order. In other words, the listing of their names has no bearing on anything. It's a set. I should probably say; that's a better way to do it. Which is the following sort of phenomenon that given such a function f what do we want to say? For every epsilon bigger than 0, for every p bigger than one, there exists a q bigger than p such that the q norm for the epsilon f is that most the p norm of f. The point here is q is bigger than p, so when we apply some noise to the function, on average, the function gets smoother. What's that? >>: [indiscernible] >> James R. Lee: No. No. What do you mean? No. There is no constant. In fact, if you want a constant is only known to be equivalent, it's false in general that if a Markov operator satisfies this with a constant then it satisfies it with constant one. Constant one is called hyper contractive. With a constant it's called hyper bounded. I think this is the terminology that is used. Anyway, it does hold with constant one for some q bigger than p. This depends only on epsilon and how big p is, but the point is the function gets smoother. We went from having an lp estimate to lq estimate and so if you think about what this means. Combinatorially this encodes actually like small set expansion in the hypercube. It encodes something about the isoperimetric profile of the cube and this is a very useful kind of theorem. Yes? >>: [indiscernible] correct? >> James R. Lee: Yes. The point is it's here. None of the things depend on f. That's sort of the point. There is epsilon p and q such for every f this holds. What we want to do is we want to prove something in this spirit but here we have some operator estimate on f. We have an lp estimate. To tell you this conjecture of Talagrand we want to actually basically make very minimal assumptions on the function f. Let's simply make the following assumptions. Assume that f is nonnegative and has expectation one with respect to the uniform measure. By the way, notice that this hyper contracivity has nothing to do with, it's not about cancellation here. You can replace f with the absolute value of f and this inequality only get stronger. This is about non-negative functions. It's not about somehow magical cancellations and this averaging. Let's take an arbitrary function and just normalize it as expectation one on the cube. What can we say about, for instance, the tail behavior of such a function? Not very much. We have only the estimate, from Markov’s inequality which is the probability that f is say bigger than alpha is at most one over alpha. We have Markov’s inequality. So what Talagrand conjectured, and I am right at the conjecture now, is that if you apply some smoothing via this noise operator to a function, then actually you get tales that are better than Markov's inequality. Here's the conjecture made by Talagrand in 1989. I'll start this new trend of putting the prize money in the also, okay. It's worth a thousand dollars if you can solve it. The conjecture is the following. For every amount of noise that you might want to apply there exists a function. You'll see where this function comes up in a second. This is this function that evidence the tail that's better than Markov's inequality, such that, again, for every f -- here the assumption applies to the theorem. f is normalized as expectation one, so for every f probability that t epsilon f is bigger than alpha is at most phi alpha or alpha and okay. So this phi represents a beating, beating sounds violent, an improvement over Markov's inequality. For every fixed amount of noise if we average that noise then no matter what function we start with we improve the tail. This is the conjecture. One thing you might ask is what could you expect here. The optimal thing you could get would be say some constant depending upon epsilon divided by alpha square root log alpha, so the optimal thing you could get is phi of alpha being one over square root log alpha. The optimal thing that you could get is phi of alpha being one over square root log alpha. This is achieved, for instance, if f is just a delta function at a single point. It corresponds to the, a delta function in the cube when you apply some noise looks kind of like a half space. More generally, you can think about this is type for balls in the hypercube, this bound. That's the conjecture. I guess I should, right now the conjecture is open for, this is conjecture for every epsilon, but it's still open to proof for any epsilon and it's open to prove it even if f is the indicator of a set, the scaled indicator of a set so it satisfies this [indiscernible]. The conjecture is still open. I will say many people here actually worked on this problem. Ryan visited here for a month years ago and was working on the problem. Jonas Sherman was in intern who spent his whole summer working on the problem. Heal [phonetic] told me that he and Aaul [phonetic] and Jeff also worked on the problem during a visit. Anyway, if you can solve it you can also sort of you, it would be more than $1000 because you minimize the future costs of Microsoft employees working on the problem. We are not going to talk about this one today, but what Ronen will prove tomorrow is that the Gaussian version is true. Let's say what is the Gaussian version. I need some objects. One is some n-dimensional Brownian motion. One can state the conjecture easily without Brownian motion, but it will be fundamental what comes next. That's one thing. Okay. And I'll take a function now on RN, nonnegative and let's assume, okay, the function is expectation one. The Brownian motion at time one just has the distribution of an n-dimensional Gaussian, so just to be clear and at the same time introduce, so gamma n is the n-dimensional Gaussian measure. We assume that the expected value of f is one. Then I also want to introduce this semi-group. Where you start a Brownian motion at x and you average the function f under the Brownian motion. With this now I can state the, let me say it this way. Following theorem. For any t less than one, and say alpha bigger than 2, that's not so important, we have the probability -- okay. We will come back to this quantity in a second. That this is bigger than alpha is at most some constant depending on t, divided by alpha log alpha to the 1/6. There's also some log log alpha term, but let's, since the 1/6 is not tight, since it should be one half, let's not belabor the log log alpha term. This is the theorem Ronen will pick tomorrow. I'm not going to say much more about it except to notice what's happening here. What does this look like when we first take a Brownian motion for time t and average over sort of a Brownian motion for time 1-t? The right picture is to think about sort of you have a Brownian motion up to some time t and then you compute sort of the average of f. You can think of it as sort of the average of f over sort of a Gaussian with variance one minus t or more geometrically the average of f were you average over the rest of the Brownian motion. You think about a Brownian motion at time one. You sample it up to time t and you average over the future all the way to time one. >>: [indiscernible] >> James R. Lee: Yes. bt start at 0. Yet? >>: [indiscernible] you just said because you said… >> James R. Lee: How far back? >>: [indiscernible] the future. >> James R. Lee: The point is what is this quantity. If we went all the way, if t was equal to one, if we all could go all the way to time one, then what we are just looking at is f of b of one. When t equals one there is no averaging at all and we're just looking at this quantity, so it's just f at a random point of Gaussian space. So if we can't improve Markov for such a function just because you can just, you could choose f to be the indicator of a set and then, you know, the tail would be exactly achieved sort of at the level of the set. But now what we say is we don't go all with the time one. We'll run the Brownian motion up to time t. You can think about t as close to one if you like, and then we'll average over the rest of time, average over the future. >>: Why don't you just [indiscernible] one given? >> James R. Lee: If you would like, that's another way you could write it. As you have all pointed out, you can kind of think of this as a Dube Martingale which is yeah. It's sort of the expectation at time one given the path up to time t. This is the proper analogue of the conjecture in the cube and, in fact, one can see this is a special case. If you can prove the conjecture for the discrete, you could prove it in Gaussian space. Just to finish talking about this, let me just say I should mention that Ball Part [indiscernible] Alachevitz and Wolfe in 2010 prove that you can get some bound. This is not as good as it looked. It's some bound, but the constant depends on a dimension and, in fact, it depends exponentially on the dimension. The whole point of this is that it's a dimension free phenomenon. If one, but actually, even to prove it in the case n equals one is an exercise and n equals two is already more than an exercise, but the point is to prove something here which is independent of dimension here that's all I'm going to say about this. Ronen will prove this tomorrow, but now what I want to talk about is -so now you could think about the case when f is the indicator of a set. In this case scaled by the measure of the set so it has expectation one. Now we have some set sitting out here and we want to prove something about this set s. If you look at what's happening here, you know, to study this at s we kind of take a Brownian motion for some time and then, you know, and then average sort of average over the future and see some piece of s under this process. The problem with doing this process like this is that this theorem is the sort of interesting when the measure of s is very small. If we just took a Brownian motion and tried to study s by how this Brownian motion sees s then most of the time, of course, this Brownian motion will wander away somewhere else off in space and will not see any of s at all. That's representing the fact that there is this sort of one over alpha here when alpha is going to infinity. Okay. Brownian motion is nice but it's not quite appropriate to study the geometry of the set s because, you know, it rarely comes close to s if s is small. What we would want to do is consider a different process. It's called a wt, which is Brownian motion conditioned to have law f at time one. If f is just indicator of a set is Brownian motion conditioned to following the set. If I condition that I'm following the set then I get rid of all of this spurious information coming from Brownian motion to just wander far away from the set. We like to sort of study this property. One way that you could form such a process is just, you know, sample a point of s according to the Gaussian measure and then just take a Brownian bridge to this point. This is not very useful for studying the geometry of s because we are essentially writing s as a union of points and then doing something for every point, so this is not a generally good way to study the geometry of a set. What we would like to do is use something like a Brownian motion which sort of approaches s but sort of slowly so that you see what is happening along the way. What you should think about is we want to have some sort of local process that approaches s and in general, the process wants to be like a Brownian motion except for the fact that it has to hit s at time one, so the process feels some pain at every step from the fact that it is being pulled towards s. It turns out that by studying the amount of pain that this process feels you can say it says a lot about the geometry of s. For instance, the pain that you are feeling at time one will turn out to be related to the surface area of s which makes sense because that time one you are just about to step into the set. That is going to be today where basically instead of the geometry of sets like, you know, by empathizing with the pain of these modified processes. Tomorrow Ronen will do what many bad people have done throughout history which is once you understand something then you try to hurt it, so Ronen will be basically tomorrow the sadistic and see what happens whenever the process is feeling pain if you push it a little more, you know, sort of see how it responds. But today, we're just being empathetic. Does the general idea make sense? We would like to have this process that lands in s at time one but gets there in some slow way that sort of examines the geometry of s. If you think about it if you have this averaging over the future, then if you do this you are kind of looking at the geometry of s at many different scales. You see it for a while at faraway and then from medium distance and then closer. This is going to turn out to be important. The question is what does it mean. Now let's consider processes that do exactly what I said. Let's try to see how we can build this w. At time 0 estimate at the Brownian motion which is time 0 and then let's see how, and then w is going to change a little bit. I wanted to change like a Brownian motion plus some drift obliged with this Brownian motion. The process wt is going to be in this form. I want this drift to be predictable, so at time t you should know exactly what is the drift you want to apply to the process. Then we want one more property here and I'll start writing on here because it's a little bit low, which is that at time one we have the law of the f, so what is this thing at time one? If we integrate the Brownian motion at time 1+ the integral of the drift from 0 to t and I want this last condition that this is distributing according to f, whereas if we have a set according to the Gaussian measure on the set s. This is a description of the process. The point is, I mean, you can come up with many such processes like this. You could just do nothing, just screw around until very close to time one and then just suddenly jump to the point of the set. But we want to consider sort of, you should think about this vt as some kind of, this is the pain that processes feeling. We would like to do this in a way that kind of minimizes the amount of pain that the process feels along the way. Let's give first of all, okay. I mean, obviously, if the set is very small or if f is very fall from the Gaussian measure, you will have to feel a substantial amount of pain to get there. Here is one way you can measure pain in terms of the relative entropy. This is the relative entropy. So the relative entropy of f with respect to the Gaussian measure is just the expectation of f log f. This is the definition of relative entropy and you should think about, just to make sure that everyone’s signed conventions are in the right place, if f is sort of constant, if it's close to one, then it implies that the relative entropy is small and if f is sort of spiky in the sense that it's very far from the Gaussian measure at some places, then it implies that the relative entropy is large. This is what you should be thinking about. >>: [indiscernible] measure f [indiscernible] with respect to the Gaussian. [indiscernible] short form. >> James R. Lee: Yeah. Here I conflated the [indiscernible], conflated the -- this is really f d gamma n. I conflated the density with the measure, but actually people do this all the time. I'm in good company to make this choice. If you think about it in discrete variables, this is, if you were, for instance, on the cube and this was the uniform measure on the cube, then this is like the entropy deficit of f. When f is uniform it would have entropy 0 because it has full n bits of entropy and when it was on a single point it would have relative entropy n because it has no entropy. I mean, because of the corresponding distribution has no Shannon entropy. The point being that Shannon entropy and relative entropy have different size. Now let me tell you a theorem. Various parts of the theorem are linked to Fullmer and to Joseph Hecht. I will say we learned about it from Joseph's work. The theorem is the following. If you want, the relative entropy of f with respect to the Gaussian measure is exactly the minimum over all drifts satisfying star. I didn't put a star here but let's call this star. We look over all drifts that sort of satisfy the fact that at time one we have, we are distributed according to the density f of something beautiful which is this. The relative entropy of f with respect to the Gaussian measure is exactly equal to the minimum amount of energy you have to expel in order to push the Brownian motion to have distribution f at time one. Does the equation make sense? This is the equation and even more than that there is an explicit form for the optional vt. The optimal is obtained. It is unique but let's say and optimizer is vt which has this formula and I'll explain in just a second. This formula is also equal to gradient log, here is the theorem. Okay. What does it say? It says that at every point in time what you should do is kind of looked in the direction in which f is increasing multiplicatively the fastest, but it's not f. It's sort of the average value of f over the future is increasing the fastest. In a second I will tell you about the proof of this theorem, but I've decided -- in some sense, yeah, this formula looks a bit strange. Maybe it doesn't look strange. It might seem like it's kind of the right thing, but I want to stress the fact that this is the most natural obvious thing to do. To do that, let's go back for a second to the discrete cube and let's just, let me just tell you the same process for the discrete cube and then there you will say oh, of course. Okay. That sort of motivates this form for vt. All right. Again, we have, it's the same set up sort of all the philosophy is the same. We have a nonnegative function, expectation of f equals one and we like to sample, we think that we would like to sample a random variable according to the density f. How can you do it? Here's a very simple, okay. So we want to sample, let's by analogy sample w according to f, so how might you do it? Okay. You might sample the first big sort of w here; this is a random string in the discrete cube. How might you do it? Well, you would first sample the first bit according to the marginal distribution on the first bit. Then sample the second bit according to the marginal distribution on the second bit conditioned on the choice for the first bit and so on. You would just okay. Okay. So let's just, well, let's just compute the biases you would apply at every step. At every step you are flipping some bias point to decide whether you should set the coordinate to be 1 or to -1 and let's just compute the biases and you'll see that the biases are exactly the analogues of these values. Okay? Suppose that we sampled sort of w1, w2. These are the bits of w up to i-1. How do we sample the next bit? Let's define vi to the expected value of b, so again, analogously to what we are doing over there. B is uniformly random on the cube, so now vi, you'll sample according to a uniformly random bit if the past had been samples we make and also if the okay. So this is the expected density at a random string if we choose what, how we have gone so far and said that the ith would equal one and I'm going to subtract from this the expected value of f at a random string. If we, it's exactly the same thing if we have, if these only seem so far, but the ith bit equals -1. Okay? And then I'm going to divide this by the average of these two things, which is the expected value if you don't condition on what happens to the ith bit. Okay. Up to I -1, w up to wi -1. Then we should put half here for the formula I want to write down. Then how will we sample the ith bit? The ith bit is going to be one with probability 1+vi over 2 and -1 with probability 1-the ith over 2, and what we're doing here is just, I mean, sort of, if we average over the future we just see this thing. This thing is exactly half of this plus half of this because this thing chooses ith to be 1 or -1 uniformly at random and so what this vi is computing is just how much more density is there in the direction of i equals 1 versus in the direction of i equals -1. It's exactly the conditional, this is exactly the conditional probability that the, it's exactly the marginal probability that the ith bit is 1 conditioned on the choices that we have made so far. And I just want to sort of like now if we define sort of this partial derivative operator, so the partial derivative of g at a point x is, set the ith bit to 1. Look at the difference if we set the ith bit 1 versus if we set the ith it to -1 divided by 2, then this vi is exactly the partial derivative of, you know, okay. I'm going to use shorthand. This fi-1 just means that we set, this is the divided by this. Okay? So this is sort of, okay, partial derivative. You see the formula is exactly the same one as the one we put here. This is exactly the derivative of the, at wt where we are now, this p1-t is averaging over the future so it's exactly the derivative. It's exactly measuring sort of how much do we, you know, the sort of the rate that we should change multiplicatively. Sorry. Does it make sense? What's going on here? These vi’s are exactly partial derivatives so sort of by analogy they shouldn't be surprising anymore. This is just like conditioned on the past. How should I sample in the future? That's all I was trying to say. >>: [indiscernible] the definition you are saying is still okay. It's one-to-one, but is it [indiscernible] once [indiscernible] >> James R. Lee: Okay. There's a reason that everything becomes -- the answer is yes. You can say the same statement. The problem is that in the continuous setting Ito processes sort of like all the expressions are much nicer. This is actually something like e to the log of 1 plus blah, blah blah, but 1 doesn't need all of the parts of log, just needs, okay. I'll say something about that in a second. You can say the same thing in the discrete setting. It's just that the expression will not look as nice. In fact, it's sort of I necessity if these vi’s are much bigger, are big, if they are not epsilons then you will get a different expression. Here the feeling, since you are feeling the vt instantaneously, okay. Things work out much more nicely. Let me just say one more thing about the setup, which is also that I can tell you the value of f at this string w just by examining what happens here. I'll make the claim and then, this is wi vi, okay? This is my claim on f at w and the reason is just because, you know, if we happen to choose the sign that goes in the direction of vi sort of like our sign is in the right direction, then the value of f increases by 1 plus vi, and if we choose the wrong direction than the value decreases. If one thinks about it just for a second, this is exactly the… Well, no, no. This is both sides are random variables here. This is exactly what we want. Yes. This is a random variable. This is a random variable. Everything is okay. All right. Now let's go okay. Actually, I think, let's change the order a bit. Now let me give you a couple of applications of this theorem, and then I will explain to you the proof. The proof, well I will prove to you some part of it. We'll see, after I give you the applications, we'll see how much energy people have for the proof. The applications and codes, whatever the interestingness of having this minimum energy coupling between some measure that samples from f and some measure that, you know, and the Brownian motion. Okay. >>: [indiscernible] thinking on these gradients of the variables what could be the value of the integer? So I just say what is [indiscernible] for vt. Is the gradient of p1-[indiscernible] f or wt? >> James R. Lee: Yeah. You can say -- okay. It doesn't make sense anymore. It can now, if you look over in this setting to what happens, now you'll get probabilities that are bigger than, you know, that are bigger than 1 or less than -1, I mean, what will happen? I don't know. But what does it mean what will happen? >>: [indiscernible] some process but it's not clear what it means. >> James R. Lee: It's not clear. It's not clear how you even interpret this process. What do you… >>: I mean you get some [indiscernible] that's well-defined, but it's not going to let you… >> James R. Lee: Yeah. It's very well defined. If you ask me if it's better in some sense I'm not sure because I don't actually know what is the -- I mean you will not end up, certainly you will not end up in the log f at the end if you do that. I have no idea where you will end up. But you see if you are not careful you will like overshoot the set. I mean you can, if you are not, yeah. You will not end up in the, I think. Okay. I think, but now I have to think about it, but since we are… >>: Maybe it's [indiscernible] >>: It's adapted so you won't exactly overshoot but you could do some kind of a selectory thing due to this. >> James R. Lee: It's not clear what will happen at time 1. I guess what you are saying is as you approach time 1 you're sort of, you are insisting that you hit the set but now you could be jumping over the set back and forth many times especially if the set has some crazy boundary and you lose this smoothing effect. Okay. Let's pause on this question for a second. Let me tell you one fact about this optimal vt which will be important and if we get to the proof of this theorem then we will see the effect. This vt is a Martingale. For instance, the expected direction at time 1, when you are at time t the expected direction you will be pointing at time 1 is the direction you are pointing now. So you know, all right. Okay. Let me get some applications and then we can talk about the proof. Okay. The first application is the log solve 11 inequality in Gaussian space and I should say, also, both of these applications are due to the Heckt, not the conclusions which are classical, but the use of this to prove them. Okay. The first thing is the log solve 11 inequality which is equivalent by the way to the, I mean in the discrete cube, for instance, the log solve 11 inequality is equivalent to the hyper contractive inequality so you should think of this as a proxy for the thing we discussed at the beginning. For this inequality we need one quantity of a function, which is its Fisher information so this is this with respect to the Gaussian measure and then if you sort of, okay. Let me write it also in this way, which is also equal to this. If you think about f, for instance, as the indicator of a set it's a 0 or 1 value and then this is really measuring the surface area of the set in some, you know, the Gaussian surface area of the set at least in some analytic sense. That's the definition of Fisher information. And then the log solve 11 inequality is just the following, the relative entropy of f with respect to the Gaussian measure is at most 1/2 the Fisher information of f. It tells you that if the function f is not, okay. This is most for a set. If the service A of a set is not, yeah, is not too big then the set has to be fairly flat. In general, it relates sort of the global ability of f to be different from the Gaussian measure to adjust to some local property of f. Let's now using this theorem prove the log solve 11 inequality and it's really like completely effortless. The relative entropy is exactly half, now vt is going to be the optimizer. Half the integral 0 to 1 of this thing. Now we use the fact that vt is a Martingale, expected value of vt squared is that most expected value of v1 squared, so this is that most half expected value v1 squared. Now, just notice what is v1? V1 is exactly the gradient of f, yeah, so v1 is exactly -- actually, we need to do maybe one more gradient log squared. This is at time one. I see. Okay. There was one mistake. There was one problem with going out of order. Let me see how to correct this problem of going out of order. Crap. Okay. What is it? If we look it's gradient f at w1 squared divided by f of w1 squared. This is not what we want. We want gradient of f at, let's put expectation here so it makes more sense. We want the gradient of f at sort of a random Gaussian point, not at a point of w1 which is distributive according to f. Okay. This is a bit, this is a bit sad because now I'm going to have to write something and then okay. Okay. Okay. Here is the last step of the proof, which you won't be able to understand for a second and then in the last 3 minutes that I have I will explain to you the proof. This is because 1 over f of w1 is exactly the change of measure that sort of transforms w1, w into Brownian motion, actually, the whole process wt into Brownian motion. Unfortunately, you didn't see this yet which is a bit sad because… >>: [indiscernible] >> James R. Lee: What's that? >>: [indiscernible] the w1 and f is distributive f times Gaussian [indiscernible]. >> James R. Lee: Yeah, yeah. Thank you. No, no, no, of course, of course. Yeah, you should have said it before I started to do all of this. >>: You already said it, just [indiscernible] >>: The formula is just formulated wrong this way? It doesn't need explanation. >>: Yeah. Remember that one of them [indiscernible] >> James R. Lee: Yeah. Okay. Good. Yeah, we only, okay. Yeah. Here I wrote something stronger which is that it holds for the whole path, but really, we only want it at time, that at time one, f, you know, at time one this is the change of measure and the point is that okay. >>: [indiscernible] finish it at w? >> James R. Lee: Yeah. The point is that w1 is distributed according to f times the gamma, so f of w1 is exactly the yeah. I mean, it exactly gives you the scaling of the probabilities, or sort of the scaling of the probabilities so that you map, you know,, so you map w1 to b1. It's exactly the change of measure from w1 to b1 by definition. Okay. That was the proof. Sorry if the beauty was marred by my confusion. Okay. Let me give you one other proof that doesn't even need this change of measure and is a bit simpler, which is the Talagrand’s entropy transport inequality. Okay. The inequality here is that, so first let me write it. I have to remember what is here is also a factor of two. Let's say the inequality is the following. Okay. Here is another inequality which is that, I'll tell you in a second. The w2 distance from any, again, this is the shorthand for f to gamma n. Okay, the w2, this is between any measure given by density f and the Gaussian density squared is bounded by the relative entropy. Very quickly, what is w2? W2 is the earth mover distance between the two distributions, so in general, w2 between two distributions, you can write it as in [indiscernible] notation like this. You look at all random variables that are distributed according to mu and mu respectively and you take a coupling between them. This is on a product space and you look at the minimum distance here. >>: [indiscernible] >> James R. Lee: Equal to the minimum over the best coupling between mu and mu. Okay? You can also just think about it, if you think about them as just sort of piles of mass, it's the least Euclidean distance you need to move the pieces of mass so that you move them Gaussian mass to f. That's w2, so this is that if a function, this is sort of like an information theoretic quantity. If a function is not, is small entropy effects in the Gaussian measure and actually you can move the Gaussian measure to that function without doing much work. Let's do the proof and hope I don't, this should be really easy, so it will be hard to screw this one up. Okay. This is at most the distance between b1 and w1 simply because this is a particular coupling of the two. This distance is exactly, I erased the formula, unfortunately. It's exactly the, this is exactly the distance when I sample. This is the drift that I apply to get from b1 to w1. Now we can just use convexity to say this is that most 01 bt square, I guess, strictly speaking we should put here in norm square, which is equal to twice the relative entropy by the theorem. Again, you give sort of one, again, if you think about the street case, the most natural coupling. And then if you look at the amount you transport over this coupling it proves this inequality. All right. Now we are out of time. >>: You have about seven more minutes. >> James R. Lee: Okay, so 7 minutes. By the way, that was your chance to object to the 7 minutes. I apologize for that. Now you can ask, how does one prove such a theorem. In order to make the best use of my time, let me just -- okay. How does one prove such a theorem? You do exactly, I mean, you just sort of like, all right. Let me remind you because I erased it, what is wt. It looks like okay. It looks like this. The point is this is, this is what's called an [indiscernible] process. These are calculus that we are dealing with, these things. Just in the end let me tell you how, let me tell you what is the Ito calculus. It's very simple, but it's somehow very powerful because all of these, you know, it means that basically all of the error terms that you wouldn't want to deal with go away in the Gaussian case. Okay. Here is probably the best way to understand Ito calculus. Consider the function which is e to the bt, so the exponential of Brownian motion. Suppose I want to ask you if this is a martingale. I say basically, is this a martingale. In other words, is the infinitesimal changes of this thing have expectation 0. Yeah. If you ever lost 10 percent in the stock market and then made 10 percent back and looked at your bank account, you would know that this is not a martingale. Okay. But here's a proof that it is. Let's just take the derivative with respect to time, so it's like e to the bt times dbt. This is how one sort of takes derivatives and calculus, and then just take expectation over both sides and the point is that okay. This is 0. This is 0 because dbt is independent of bt in some rank, so this is equal to 0. That, so then this thing is a martingale, but of course this proof was, this proof was BS. The reason is just, okay. So what do we really want to look at? We are sort of looking at what happens when we, we are looking at e dbt, so we can expand this by a Taylor series. What do we get? We get 1 plus dbt plus 1/2 dbt squared. You guys tell me when I can stop. Okay. Okay, good. We get something like this. Now if we take expectation on both sides, what happens? We get a 1, which is what we expected. If this was a martingale, then this would have been 1; we would have just multiplied by 1 on the expectation plus 0, right? Okay. Ah, okay. The problem is that the expectation of this thing really, you should think about it as we compute expectation. This is like half bt, if we compute expectations. Yeah. Because Brownian motion has, okay. What's… Sort of nontrivial quadratic variation, okay? I mean, some people believe this immediately and I don't know if other people are concerned by this. >>: If the increments are less than order square root of dt, then because of the dependence the cancellations will just make the Brownian motion constant, so you think about… >> James R. Lee: The way that you construct Brownian motion is exactly, you know, basically, by adding plus minus root dt. This is how you construct Brownian motion from simple random walk. You have a better explanation involved that doesn't require… >>: [indiscernible] >>: [indiscernible] [laughter] >> James R. Lee: Okay. So the point is that when you try to do calculus with respect to Brownian motion, the terms that would naturally go away, the second order of terms are actually real things. But everything else goes away, okay? So that was the real Ito. That's Ito calculus. The Ito calculus is really like when you do calculus, you know dt squared equals 0 and sort of okay. I mean, you should think about dt squared is equal to 0. You should think about bt squared as dt and then everything else, dt times dbt equals 0 and dbt cubed equals 0. Okay. You have to worry about the second order term for Brownian motion. That's what the Ito calculus says, so that me just tell you, for instance, the Ito lemma, which is that if I want to compute the derivative say of a function of Brownian motion and also of time because I'll, okay, then normally what I would do here is I would compute the derivative of f with respect to time and then also the derivative x, x with respect to the first variable, dbt, but then Ito says we have to go one more step and compute half delta x square f bt. Okay. So this is the, that's like the whole of Ito calculus, that sort of to compute derivative with respect to Brownian motion you have to include a second-order derivative for the quadratic variation of the Brownian motion. Okay. As an exercise, then you can see exactly, exactly what your martingale should be, right? If you look at dbt minus t over 2 and take the derivative, okay. Let's just compute what it should be. This is the derivative with respect to time, so it's minus 1/2. This plus dbt plus the second half the second derivative with respect to the first variable which is just the same thing, dt. Oh, there's a, okay, good. So you see what happens here, which is that this thing is the martingale term; this thing has [indiscernible] zero and these go away. This is, you know, a legitimate martingale, okay? And now that I am out of time but me just say that by using this you can prove the theorem. I can explain it to anyone off-line if you want, but this is really all you need and then you can okay. Then you can prove the theorem. Okay. So I guess I should wrap up by saying that so Ronen and I know how to, for instance, take these proofs and translate them to discrete cubes so you can use them exactly this method of truth proof to prove the log solve 11 inequality in the discrete cube. With a little more work you can use exactly this method of proof to prove log solve 11 inequality in the symmetric group and it's unclear how much further you can go, but it seems like a very powerful kind of method to do this correlated sampling and then consider functional inequalities along the path of the sampling. I should say it's not a new method, although, you know, Burrell was doing these things for at least 20 years and many other people, but okay. But I mean, you know, this is so clean that it should have a number of other applications and the theory goes directly to the discrete case via the sort of sampling I talked about. Okay. So let me stop there. [applause] >> Yuval Peres: Comments or questions? >> James R. Lee: Yeah? >>: I think I should have [indiscernible] but [indiscernible] if I move this [indiscernible] I actually did f like… >> James R. Lee: Is it trivial to see? >>: That it could actually be one of the things that [indiscernible] >> James R. Lee: I mean, if I have to think about it, it was not trivial, so that is sort of an easy question, but… >>: [indiscernible] you remember in the discrete case we actually saw this is basically the condition, so [indiscernible] just think about Brownian motion as many discrete increments and you will get exactly [indiscernible] >> James R. Lee: Perhaps another way to see it is that if you remember before we wrote down exactly what is f of w. It was exactly product i equals 1 to n, 1 plus wivi where the vi’s were the partial derivatives. If you write down the same formula over there, okay, so then you will see that at time 1 your distribution according to f. The benefit over there is that when you write down this formula because of Ito calculus, you only have to, this is, if you think, if you write this as expectation of log and then you only have to use the first two terms of log and sort of everything becomes much nicer. But if you just write down the analog here, then yeah, you will see that you distributed according to f at time one. Or as Ronen says you could discreetly approximate all the jumps and then believe it and then transport it here. >>: It's a much better handler than [indiscernible] the better [indiscernible] >> James R. Lee: Yeah. That's why I did the discrete case because… >>: How long does the continuous proof really take? >> James R. Lee: It doesn't take very long, but, okay. But we should do it off-line. If anybody is interested we can stop and we can do the proof. The only issue is that, you know, one thing I don't want to do, I don't want to calculate derivatives in front of people because it can be mind numbing, so I would have to hand wave over one of the derivatives. You do have to use some Ito calculus somewhere to prove it. >>: As far as [indiscernible] discrete work like [indiscernible] >> James R. Lee: This here. What you are optimizing indiscreet world? Unfortunately, I mean you can write down a quantity but it's not nearly as natural and somehow that's, somehow that's the benefit of going to the, of being in the continuous case that sort of like there are certain things that are much cleaner here. In the discrete world it's not, okay, the discrete world is tough because here you notice there is no, to do this sampling there is no additional randomness. This thing is deterministic. In the discrete world this was just a bias and then add the use of additional randomness to choose the bits. This makes things much uglier somehow. You know, you could… >>: [indiscernible] is that what you are [indiscernible] is that what you are making? [indiscernible] I see you say the second time will be some [indiscernible] >> James R. Lee: Yes. There is some, well, okay. It's hard to say when duality is not happening because, you know, I saw so many times it was happening, but okay. I don't know off the top of my head. There is a duality here because, you know, there is a duality just in the notion of relative entropy, but okay. But I don't see and the discrete case how it would tell you the answer immediately. >> Yuval Peres: Okay. So we will continue this off-line so let's thank the speaker again. [applause]

Document 17844808

Related documents

Products

Support

Document 17844808

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib