Yuval Peres: So we're delighted to have Yury Polyanskiy... channels with input constraints.

Yuval Peres: So we're delighted to have Yury Polyanskiy tell us about dissipation of information in channels with input constraints. Yury Polyanskiy: Thanks, Yuval. Yes, so this is a joint work with my colleague from Urbana-Champaign. And, okay, so I want -- so there is basically in this talk there is one technique somewhere in the middle, but I will start with a problem. So I'm mostly excited about the technique, not the particular problem that we solved using it. But the problem might be of some nice -- I mean it can be easily interpreted. So we have some original message, and you are allowed to process it, to encode it into some other real number, X1. The only thing I want is every time you encode is, that the expected square stay bounded. So this is to prohibit you to use too large a constellation. And then after you encode original message into something, it's perturbed by Gaussian noise, and that's what gets fed to the next encoder, and then the next encoder, again, looks -- maybe you will try to first infer the original at zero, which could be for example, binary bit. And then the recall and so on. But the idea is that you are allowed to do whatever you want as long as you don't violate the power constraint. And the question is, yes, so your idea is to design these processers so that you can still reconstruct, even after great many hops, you can still reconstruct the original message. So, yeah, so we call it chain of Gaussian channels. So initially, when I started looking at it, it was complete obvious to us that asymptotically, I mean, as in grows large, there must be some asymptotic independence. And I used to start this talk by saying, okay, obviously we conjecture that each stage is fine at the energy budget, so you cannot de-noise completely, so therefore after great many hops basically you have asymptotic decoupling. So the joint distribution within the initial and the last message is approximately independent. >>: How do you de-noise? I mean is the X -- you -Yury Polyanskiy: Yeah, by de-noise, suppose zero is plus minus 1 for example, right. So then you can encode it into plus minus trillion; right? Then you can convolve with the noise, you get some real numbers, so your threshold is at zero and you again encode it to plus minus million. >>: If is not [indiscernible]. Yury Polyanskiy: Oh, yeah, F can be anything. But at some point I would return back. Okay. So, yeah, and I used to say that it's obvious, and my talk is to prove this obvious fact, but then I started to sample opinions before I say that. Turns out that some people actually believe that it's possible. And the more advanced -- I'm an information theorist, by training, so the more advanced people you ask, and the more of them can say, because they know that, okay, oh, yeah and they used to also say that this works, everything I say works for arbitrary dimensions. So it doesn't have to be scalar, it can be arbitrary dimension. And so they will of course say but error correcting codes work, so there must be some way to preserve information. So nevertheless, yeah, I'll spare you this kind of -- I will show that indeed we have asymptotic decoupling, and I will gauge it in one of three metrics. So it's a total variation, distance or TV, then the KL divergence -- I hope some people here know what KL divergence is. So basically it's some metrical -- some kind of distance-like function. And then also we have asymptotic decorrelation. Each of this has an equivalent interpretation. So for example, convergence in KL distance is exactly equivalent just by definition of mutual information to convergence of the mutual information from the first and the last guy to zero, and decorrelation, which is important for one of the applications, is exactly equivalent to saying that the minimum means coerror estimate of the original message, given the output, actually becomes asymptotically trivial. So just estimating by the mean gives you the optimal thing. And roughly speaking, I mean if you know a little bit about information theory, or I mean even some general probability, it's easy to say that KL implies both others. So by this direction, by Pinsker inequality, this by something called rate-distortion theory. So it was very natural to us to immediately consider KL, and, well, let's just try to do KL. So let's see what kind of standard tools we have in our arsenal. So the first tool is -- the frame is data processing inequality, which says that divergence is distance between output distributions, between distributions perturbed by some stochastic transformation, maybe discreet, maybe non-discrete, it just cannot expand. It only shrinks. Okay, so the equivalent statement is that mutual information in Markov chains does not increase. Okay, so what does it tell us about our chain? Something of course very obvious that basically you can apply repeatedly and say that okay, the sequence of mutual information cannot increase, but of course it doesn't tell you that it has to decrease, right, so that's the only thing, you know. So this suggested that what we need is we need some quantitative version of data processing inequality. Maybe it doesn't -- maybe we can strengthen the statement that it cannot expand, right. Okay, so there is such a thing and it's called strong data processing inequality. So actually from people who heard this talk, I mean, for many people this is a big surprise that there is such a thing. So I want to also spend some time on this inequality. So what it says is that for actually most channels, you can insert the multiplicative constant here, which I denote by eta KL, the contraction coefficient. So then not only divergences decays multiplicatively, but actually, I mean, this is equivalent to say that mutual information decays multiplicatively, too. Okay. So now if we have this eta KL, which formerly just defined by the [indiscernible] ratio of output to input divergences, or mutual informations, I mean there's some famous result by Ahslwede and Gács that for discreet indecomposable channels, what are the indecomposable channels basically? You don't want something like some zero error effects there; you want every input to be spreading to all possible outputs, basically speaking. But basically everything, I mean binary symmetric channel, binary [indiscernible] channel, all this stuff is indecomposable. Okay, so this co-efficient, eta KL, has a lot of funny connections. I mean a lot of connections to some cute topics, like this original Ahslwede-Gács paper, they didn't care about this strong data processing; what they cared about is the hypercontractivity , which I'll mention a little bit later. And they showed that the worse possible hyper-contractivity [indiscernible] is actually exactly given by this co-efficient eta KL. Okay, and then there's some connections to log-Sobolev inequality and so on, so let me just focus on ->>: Sorry, I'm being a little slow, but what exactly does diagram mean? What do the two arrows on the left mean and what do the two arrows on the right mean? Yury Polyanskiy: Yeah, sorry, so this is a good question. I apologize to everybody who watches the video, and there are many people like this. >>: [Inaudible] Yury Polyanskiy: Oh, they can see the board? Oh, yeah, of course, yes. Okay, so I mean -- what I meant to say when I draw something like this is just that okay, so you have this stochastic kernel, so this is just -- this just means by definition that PY is just the integral of DPX of -- sorry, yeah, DPX of P, Y, U, and X equals X. Right. So that's what I mean. It's just the averaging of the distribution over this -- over this -- so for example, the probability of a set E is just averaging of this function. I mean nothing fancy. I don't know, it's just the margin of the joint. You form the joint, PX times P, Y, U, and X, and then you compute the margin in Y. >>: I see. And Q is just another example of the -Yury Polyanskiy: Yeah, exactly. So that's what I said, basically the way that ->>: Same transition rule, but different -Yury Polyanskiy: Yeah. Because the way information theorists frequently think about the channels is that something that's more distributions; right. So PX is more than TPY, or pair is more than another pair. Okay, yeah, so this -- all right. So now of course in this contraction coefficient is less than 1, then this mutual information, I mean again, by the same repeated application, you get exponential conversions to zero; right. Okay, now the sad part is that from my channel, this contraction coefficient is actually 1. The good news is that for a very slight modification, and I'll mention -- I'll spend some time on this. So if you replace this expected constraint with an absolute constraint, then it will be strictly less than 1. Then this contraction coefficient will be less than 1. So let me explain how to derive this in the fancy way. So this is a photo of Roland Dobrushin, who is my personal hero, and I couldn't resist inserting something here about him. So okay, so now Dobrushin defined this coefficient, which is the same as eta KL, but I denote that the TV, this is a contraction ratio of total variation. Again, this is [indiscernible] channel. And of course it's very easy to see since this is just an L1 norm, then the worse case input is just delta functions. Two inputs. So it's very easy to compute. And the original motivation for studying this coefficient was the mixing of Markov chains, at least for Dobrushin. Okay, so this is a slight insertion, because I know that there are many people who like Markov chains here. So of course you all know that Markov chains mix exponentially and frequently; right. So and then this coefficient, which I will use in this talk eta TV at the high square eta KL, they just correspond to mixing in various different distances. So you know that this typically is obtained from log-Sobolev, this is what is called spectral gap and so forth. Yes, so then some people studied the general relations between this contraction co-efficient for discreet and general channels; like, for example, for finite-state mark chains, there's some funny statement that eta KL is less than 1, only if eta TV is less than 1. And eta TV is easy to check, right, because you just need to check on delta functions. For eta KL, you don't know. So then there is this more recent result about that contraction coefficient in high scores is exactly equal to KL. There is some technical difference between this. So this statement doesn't of course mean an incorrect statement that log-Sobolev constant is always equal to spectral gap , which we know is not true, but it might appear like it. So if you're interested I can explain what the difference is. But yeah, let me proceed from now. Okay, so now I wanted to -- this is another slide which is tangent to my talk in some sense. But I wanted to say a little bit about this theorem. So these six people proved the result that says that eta KL, which is tough to compute, is actually always upper bounded by eta TV strictly. So why is it interesting? So here is the fun implication of this. So let's consider finite-state mark of chain and T will be its propagation operator. Then let me denote by E, the independent mark of chain, just something that averages a function with respect to the invariant distribution. Okay, so now suppose that so then this is just PP star; right. Now suppose the rows of this stochastic matrix are all basically -are all closer than one-half to the P star, then it implies by trying inequality that eta TV will be strictly less than 1; right. So you can insert P star here. And then by this theorem, which is same result, that kind of amplifies the analogy about the simple [indiscernible] inequality into something that eta K is less than 1. And then by Ahslwede-Gács, which you know is then this stochastic operator, will be hyper-contractive. So you can upper bound Q more by slightly less [indiscernible]. Okay, and that this tells you, not just that this estimate -- I mean they give an explicit estimate, they say take P to be eta KL times Q. So now for any -- I mean if you replace one-half here with something slightly smaller, you get an explicit estimate for which operators are hyper-contractive with constant 1. Yes? >>: Again, trying to slow you down, so I understand that eta total variation, that's the widespread contraction. Yury Polyanskiy: Yes. Yes. >>: What is eta KL in terms of the log-Sobolev constant? Yury Polyanskiy: It's effectively equal to ->>: It is the log-Sob? Yury Polyanskiy: Yes, because you see -- there are many log-Sobolev constants, right. There are two log-Sobolev, one log-Sobolev, modified log-Sobolev, log-Sobolev for continuous time, for discreet time ->>: What is it in terms of the log-Sobolev constant where you have a -Yury Polyanskiy: So this is -- you mean continuous time; right? >>: Yes. Yury Polyanskiy: So for continuous it would be PT; right? >>: Yes. Yury Polyanskiy: But that is not connected, because I'm talking about discreet groups. >>: [Indiscernible]. Yury Polyanskiy: So okay so ->>: [Indiscernible]. Yury Polyanskiy: Yeah, so there is ->>: [Indiscernible]. Yury Polyanskiy: So if you define something like this, okay ->>: No [indiscernible]. Yury Polyanskiy: Uh-huh. >>: [Indiscernible] with log E. On the right I want F. Yury Polyanskiy: I don't think it's related to this one. So it's related to this one. >>: To the modified. Yury Polyanskiy: Yes, to the modified. To the modified, there is a statement which says eta KL is always upper bounded by one minus alpha and lower bounded by one minus some absolute constant alpha. But of course for similar groups, frequently, I mean 1 modified log-Sobolev equals the usual log-Sobolev; right? >>: [Indiscernible] but not always. Yury Polyanskiy: Not always, yes. But in this sense, yes. It's not a universal connection. >>: For random [indiscernible] there is no log-Sobolev for the modified. Yury Polyanskiy: Yes, yes, yes. Anyways, I mean log-Sobolev is not going to be mentioned here. So it's -- what I wanted to talk about is, yes -- so you see, this connection was done without -- I think you're asking because hypercontractivity frequently is derived from log-Sobolev, right, by integrating log-Sobolev. But outside of this graph, this was completely discreet statement. So they didn't talk about log-Sobolev as well, so just a different method. So anyways, so now the punchline here is that surrounding this -- so in the space of four stochastic matrices, so there is one matrix E whose roles are OP star, right, and then there is the ball surrounding this -- there is a known vanishing ball. So for every operator inside this ball, it's Lp to Lq norm is 1. So it's a little bit funny situation. So it's basically you move away from pure independence, but for -- there is a planar face, so to speak, in this ball of matrices whose norm is upper bounded by 1. And this effect is actually -- yes, so let's ignore this. And this effect is funny because it was actually discovered in the Euclidean space first by Segal and Fefferman in '70, and then generalized to arbitrary operators in '88. So basically it's the usual effect that if you stay inside an operator which integrates with -- there is a ball surrounding an operator integrating with respect to probability measures with respect to the Lp to Lq estimates stays constant. But this is a completely, it's basically a one-line proof for these great cases following from that upper bound. Okay, so let me return to my talk. So we're still talking about this chain and we're talking about amplitude constraint, right. So all X now are required to be bounded in amplitude. So now we apply that estimate at the KL upper-bounded amplitude. So now here, it's very easy to estimate TV contraction, because you just want to take two points of plus minus A. So I'm in the engineering department; I'm going to use Q function, which is a complementary CDF of the Gaussian distribution. So I apologize to everybody who likes the tree better. Okay, so now what we derive from this is that for every amplitude constraint channel, you have exponential convergence of mutual information to zero. And surprisingly, this is the entire proof, right, two lines. There is a sequence of papers, actually, which deals with this Gaussian, so-called Gaussian line metrics for exactly this question. And they derive a worse exponent than us, these two lines. So I'm just saying this theory about eta KL upper-bound [indiscernible] is very overlooked and very few people actually know about that. I think it was rediscovered by Nick Law, in particular, at some point. Nick Law at Laduc. All right. Okay, so what about the average power constraint? Okay, so if you're an analyst, at this point you can say, okay, I mean come on, truncation; right? Some kind of truncation. Well, turns out, no. In this particular case, no. So that's what we thought, I mean basically upper amplitude constraint they didn't want to work on this. But my friend actually, he found the counter-example; he just said, wait a second, what if you take -- what if you put some mass at zero, right, and the other mass is traveling farther and farther apart, so then when you mix them -- so this part, the total variation, the only thing contributing to the total variation is this part; right. But when you convolve with the Gaussian, so you get basically two verifier part Gaussians. So as this constant as total variation between the original distribution is the case, turns out that eta TV converges to 1. So basically, the same binary distribution with two masses show that the diverse mutual information all do not contract. So there is no contraction. >>: [Indiscernible] what, do you vary the key to zero, or what? Yury Polyanskiy: Yes, so key is the distance between PX and QX; TV distance, right. Because this just cancels and subtracts it, so you just get T. So as T goes to zero -- so this basically proves there's no hope of exponential convergence, right, but maybe there is a sub-exponential convergence. So this is our final result that for this particular chain you have the super trooper slow conversions, right. So I mean we have a super trooper slow estimates; right. So basically it's 1 over log n, where n is the number of steps. >>: As you're going along with two signals you have, the only source you're trying -- their only difference would be translation by a different Gaussian. Yury Polyanskiy: Right ->>: And in this -- when you're looking at these ratios, you consider it too general. Yury Polyanskiy: Right, yes. >>: But translations are very special. Yury Polyanskiy: Yes, so you say that maybe you can exploit the fact that after a certain number of iteration, right, you are not considering arbitrary distributions. But now notice that there is this F2; right. So for example, Y1 is a convolution, so it's a very smooth distribution; right. Because something convolved with Gaussian, but then you apply some arbitrary function, right, it can be some crazy stuff. So this makes distribution of X2 something crazy again. Yeah, but there is obviously a work around, so as the result shows. And again, this estimate were so slow that it was just painful, so we actually made sure that our estimates were more or less tight, so up to this log log n factors, there is a lower bound too. So you can't improve it by much. Okay, so what I was going to talk next is how to prove it, what's the idea to prove this thing. All right. So the idea is the following. So strong data processing equality says how much the input diverges contract; right? So more precise characterization would be what if we just compute the full curve; right? We do the full thing. So we take two dimensional plane and for every pair of distributions we compute the input divergence and the output divergence and put it here and just iterate over all pairs or distributions. So data processing inequality tells us that, well, all the points will be strictly below diagonal. If there is strong data processing, then it will tell us that all distributions are here. We know it's not true; right. But maybe the situation is like this. And why would it be good? Well, because then what we can do is we start from somewhere, we don't know where, but from some point, right, and then we'll proceed with this iteration. And if this thing stays strictly below the diagonal, then it will converge to zero, right. So this was our hope. So basically the goal is to find -- now this is very exciting to us, because we take the Gaussian channel, which is something that was studied from 1948. It's the very first paper of Shannon, the most famous thing is one-half [indiscernible} plus P; right. So Gaussian channel is beaten to death. Here we want to associate the Gaussian channel some curve, right. So it's something new to say with Gaussian channel. So we spend some time trying to make sure nobody actually did it. Seems no. And it was surprising. So we were excited at this point, so we said okay, let's do this, let's calculate this thing. Yeah, so basically once this curve is curved a little bit then we are done. Now, then there was sad news. At this moment we tried to compute FK and realized nobody could compute it before, because actually it's exactly the straight line. So there's nothing good. And I don't know why we didn't give up at this point, because I mean for everybody it was pretty discouraging. You're trying to prove a result which obviously holds, right, and nobody cares about, and you actually write and you try for so many -- by this time we already tried a few other things, which I don't mention. But it doesn't work. But here at this time for some reason we also decided to try FTV, and magically it turned out for FTV, there is actually a contraction. And to color our hero, Dobrushin, we decided to call FTV the Dobrushin curve of the channel. So first of all, this is the strategy for the proof, which I'm sure I'm not going to have time to go over. But so the idea is to first work with TV, with total variation, to show you cannot transmit one bit and then upgrade it to general input, general message at zero. Then there is a trick how to go from decoupling in TV to decoupling in KL. In our case, we actually use some special property of Gaussian noise here. Then there's a general trick to go from decoupling in TV to decorrelation. Anyways, so we will see if we can get there. Okay, so let's start with the transmitting 1 bit. So suppose X zero is plus minus 1 equiprobably, right, and you're trying to show that there does not exist a test which distinguishes 1 from minus 1; right. And the only thing you need to do is you need to propagate, you need to introduce two measures, P and Q, right, one is conditioned on plus 1, another is conditioned on minus 1, and just look how these two measures more as they go along this chain; right. Okay. So, right. So then here's the formal definition of FTV. FTV is just the supreme overall PY minus QI -- over all TV [indiscernible] at the output, given that the total variation of thing is less than T, and that there is a power constraint. So the average power under P and Q is less than P. And of course the Dobrushin coefficient is just the maximal slope of Dobrushin curve, which is always at zero. okay. Right, so now here's another surprise. We told you we were able to prove that FTV is slightly below. We actually found it exactly. So this is not an estimate, this is exact volume. So now the interesting thing, another interesting thing, and at this point of course we didn't know the speed of convergence, but once we found FTV, we realized it's a funny curve because of this expression. It's smooth but not analytic. So all the derivatives -- so the first derivative is 1 and then all the other derivatives are zero. So that's why your convergence is so slow. That's why it's 1 over log n, this iteration. So basically it [indiscernible] the straight line very, very -- yes -- to a very high degree. Okay, so how do you -- what's the idea? So the idea is the following. So you give me two distributions, PX and QX whose total variation is T; right. And we are trying to bound how does a total variation shrink under convolving with the Gaussian noise; right. That's our job. And the idea was the following. So now suppose we couple PX to QX, so under this coupling P, right, so then basically under the good situation X minus X prime would be actually equal to zero, right, and then you can also couple the noise so that Y will be approximately good to Y prime. But you don't know what to do when X minus X prime is not equal to each other. But then it turns out that this kind of happen too often because of the power constraint. Now, if I actually describe this to you, you say okay, you obviously can convert it into some bound, why do you get exact estimates? So there is another thing where we got lucky. So this is full-proof. I don't know if you want to -- it's a very easy exercise. After being lucky so many times, it was just very easy to do this. Yes? >>: [Indiscernible]. Yury Polyanskiy: Okay. So let's go through it. Actually I said I don't need laser pointer. I believe this would be useful now. But I just have to find it. Yeah, okay, thanks. It would be more cool if it just dropped from the ceiling. >>: Pretty cool. Yury Polyanskiy: Yeah. So one function we need to introduce associated to the channel is how does the total variation distance interplay with Euclidean distance in the input space. So it's just this thing. So it's very easy to compute, it's something involving the Q function. Okay, now, so here's before we were lucky first time. So TV, total variation is the only F divergence which is also Wasserstein distance. So if you don't know what is the divergence, it doesn't matter. So as any Wasserstein distance, it's convex in the pair, so then what we do is we just -- we can average over all delta input distribution. So this just gives you these estimates and the expectation of the step of X minus X prime. So now for arbitrary dimension, right, so even if this is not scalar, total variation of the isotropic Gaussian only depends on the distance between X and X prime, because you can always rotate, so you can put absolute value inside, then you can apply Jenson, right, you condition X not to equal X prime, you apply Jenson, then you estimate this distance. So you have a distance of first moments, but you can estimate it by the second moments; right. And that's it. So basically -- and after we obtained all this, we asked ourselves why we were so lucky. And it turns out that, yes; so one luckiness was this is TV, not arbitrary F divergence; we can do the coupling trick. And the second, where we will look is that this works for any unimodel symmetric noise, symmetric meaning X. Noise only depends on the distance to zero. Okay, so, yeah, for the lower bound you just take the same distributions I mentioned before. So they attain a lower bound exactly. Okay, I don't know, should I -- it seems like -- yeah, basically pretty trivial. Okay, there are many more generalizations than you want to know, so you can generalize -- the most important thing is that you can [indiscernible] arbitrary dimension for Gaussian at least, for Gaussian noise. All right; these are the two slides that I inserted here, so there are two -- so if PD is something that each of you, then -- so you can interpret this result exactly as a statement about PDE's. So let's take the heat equation. So you sort of keep the equation on RM with the initial condition of X of X, and then you know that just by integrations by parts, L1 norm. So L2 norm was of course preserved, but L1 norm decays. So of the solution of NT, the case to the average volume. So the initial average volume. The question is how fast. So we gave a best-possible estimate of this [indiscernible], which is given here. This was just an interpretation, right. A convolution of Gaussian is exactly the solution to this. And the funny thing is, I don't know, I'm not a PD person, so I don't know if they studied this. But what this let's you do is it let's you trade the speed convergence, the speed of decay in L1 norm, let's say integral of 5 plus 0, so it gives you the estimate of 1 over square LT estimate of L1 norm. Of course at what price? Of course it doesn't come for free. So the price is there's a tail bound on the initial solution. So we trade the knowledge of the tails of the initial function -- initial condition, yes, and then you extract from it. So I thought it was funny. So another punchline -- sorry, another corollary we will absorb is that you can also derive something about CLT; so this is a CLT in what we call smooth total variation. So let me go over this. So remember that the intermediate step was to introduce this coupling between convolved versions of P and P prime, and estimated like this, right. Then we applied Jensen to get this thing. And then in our proof we just estimate the expectation of X minus X prime through the second moments, right, the distance between L1 and L2. But actually why don't we take the best coupling between P and P prime that minimizes L1 distance, so then you get an estimate in terms of Wasserstein distance; right. And of course Wasserstein distance, W1 distance, is something we all love because there's this Steins method to derive duty-quick estimates for convergence of Wasserstein's distance. So you get the estimate like this. Give my any [indiscernible] IED zero mean and unit variance, add tiny little bit of Gaussian to this [indiscernible] sums, right, then it will be closed in total variation to the Gaussian as 1 plus signal square. And here's an explicit estimate of the [indiscernible]. So this is why we call it smooth total variation, because of course if you said, well, I don't care about the smooth version, what about just the original, of course there is no convergence in TV because [indiscernible] is discreet and you'll always have a distance 1. Actually Prokhorov showed that to have convergence in non-smoothed to the usual TV, the only thing you need is that after a few convolutions you have some non-material absolute continuous component. But what is more fun is there is the best estimate from '62. I don't know if you all know or anybody here knows about this. I was surprised there's an exact estimate of the asymptotics of the speed of conversion in total variation, which I've never heard about. >>: [Indiscernible] smooth thing very [indiscernible]. Yury Polyanskiy: Well, [indiscernible] estimates [indiscernible] distance; right. It's here I'm talking about TV. But exactly, yeah, the estimate here is of the same order. >>: And the last one, are you assuming that it's absolutely continuous component? Yury Polyanskiy: Of course. Yes. It's under the same condition, Y. And moreover, there is a mistake here. It's not absolute third moment, just without the absolute. So it's zero if you have zero skewness, then the first order term is less than 1 over square of 10. But our estimate is lower type, typically, unless you have zero skewness. So again, just some tangent that I thought you would maybe like. So now we're done with the convergence to zero, if X zero is binary. So let me mention a few things. So when I'm supposed to finish? At 4:30, or something like that? Yuval Peres: Something like that. Yury Polyanskiy: So then let me just quickly mention how to get these arrows. So first of all, to get from binary to general case, it's convenient to introduce this T information. This is basically distance to product in total variation, and then just by some concavity you get the correct testament of this total variation distance actually in the case of the same speed, satisfies the same data processing in the quality under FTV. Okay, so now you can upgrade T information to Shannon information by some pretty easy coupling arguments. And for discreet -- I'm sorry, if the X zero is finite, then you can do some coupling trick, and if it's general, then you need to use some discreditation. And I'm sure you don't need to know about this. But anyways, there is some analysis here. Now for the correlation of how to get correlation from TV, again, this is some, as I mentioned in the beginning, if you have convergence of mutual information, then there's convergence of correlation, just by rate distortion. But if you want the speed of convergence, if you want explicit [indiscernible], you need to get some -- you need to bound some higher moments. For example, if you have some Gaussian X zero, then just by applying, again, some coupling to plus holder quality, you get, again, the log log n over log n estimate for the correlation. Okay, now, this is definitely -- I can skip. Well, one punchline here is that if you cared about -remember that I showed you that FKL of T was actually the straight line, so it doesn't contract at all. Turns out that all we need are divergences for 1 and above. So KL divergence, if you knew they were 1, so then they also do not contract, but everything below 1 actually does contract. So I don't know. So I have about fifteen minutes. Let me mention some application and implication of this stuff. So here's one version of the main result, just to recap. So we have an arbitrary chain of processers, which satisfy potentially -- I mean those links could be infinitely dimensional, right, there could be buses connecting computers together. As long as the total L2 energy expended in the communication is bounded, you will have mutual information conversion to zero as 1 over log n. >>: You talked about other convex functions. So you still get the same rate if you add X zero to the 1.1 or -Yury Polyanskiy: No rate changes, yes. I think rate changes. >>: It's a different function, it's not -Yury Polyanskiy: It's still log n, but it's a different power, it's something like minus P over 2. >>: If you just have a first known condition [indiscernible] -Yury Polyanskiy: I don't remember. I have to -- I mean for TV, yeah, I think it's actually still Log n. But I need to double check the paper. Okay, so anyways; here is the sort of philosophical implication; right. So you store something on the hard drive; right. Every once in a while you decide to copy it, while the copier only has finite energy, right. And what this means is it will offer great many iterations, you can't tell one bit, even a single bit written here under arbitrary error correction; right. So yeah, and of course the punchline, yes, I understand that everybody ->>: [Indiscernible]. Yury Polyanskiy: Yes, very good, very good, exactly. So 1 over log n, which basically killed us. But now here's a more fun example. So if you talk about memory controller, well, memory controller does exactly this, it reads memory cells, is writes memory cells, it reads, it writes, reads and writes. And that happens pretty often; that happens 16 times a second. So in one year you get 2 to the 29 reads and writes. So, you know, once you plug things into here, then the noise, you know, noise is not zero always, right; there's always some thermal noise. I don't know, I didn't try to estimate what you get. I was too bored because all of those constants are actually hard to extract. >>: [Indiscernible]. Yury Polyanskiy: Yeah, yeah, yeah. >>: Specifically what happens if you allow the power to grow a bit slowly? Yury Polyanskiy: No, then of course you're golden. >>: But how -- well, it depends -Yury Polyanskiy: I think square root log n is enough. Or log n, something like this, yes. >>: I mean that's kind of interesting philosophically, because, you know, if you want people to be able to read your paper in a million years and you're prepared to assume that -Yury Polyanskiy: Yeah. >>: -- more energy available to them. Yury Polyanskiy: Yeah. I mean you can't exceed the memory -- >>: [Indiscernible]. Yury Polyanskiy: Yeah, yeah. Anyways, so this is one implication. So another implication is this actually was a moderation of my friend, my colleague who we derived this originally with. So there is this question that these two guys Lipsa-Martins in control theory asked themselves. They said, okay, so suppose I have a Gaussian around a variable; right. And my job is to preserve it intact. But I have to read and write it from time to time. And when I read it, there is some exogenous noise, which corrupts me. So what can I do? So what is the optimal way to read and write? What are the processing points? So we call this memoryless controller, because it only has one real number to store; right. And when it reads it, it gets convolved in Gaussian noise. So it's very easy to show that if you want this operation to work once, for one reason, F1 linear is optimal, it's much less easy. In fact, I didn't read the paper, but it looked intimidating enough. But for [indiscernible] still the linear controller is optimal. Why this is intimidating? Because you see now when you try to optimize the minimum square, minimum square error, you get compositional F2 and F1. So it's pretty tough to optimize over-compositions. Anyways, so then these same guys showed that actually but for anything greater than for -- when you have greater than five convolutions, there is a non-linear scheme that actually outperforms any linear scheme. And what our results shows is that, first of all, asymptotically you always decorrelate so you can't store -- I mean you're as good as storing zero. And the second thing is that actually the linear FJ's correspond to a special restriction on the type of distributions. Exactly as you all mentioned; they're all a Gaussian. And then it's easy to show that KL actually contracts. So therefore, you actually decay exponentially fast. So yeah, so just from that, you can immediately see that non-linear controllers are optimal. So another application we discussed in the paper is -- so I don't know how many people here know about Gibbs measures, but that is a game which is played in certain corners of science, where what you do is -- maybe I shouldn't -- yeah, maybe here it's played often. >>: [Indiscernible]. Yury Polyanskiy: Yeah, okay. All right. Anyways, so given some conditional distributions, you try to guess if there is a joint, which solves this problem, and, yeah, as I'm sure most of you know, the idea is you try to see if there are multiple solutions, it means that there is some phase transition happening. And it's funny of course two-dimensional Ising model is the famous example. But the rule of thumb is that if the links -- if this dependence is weak, if there is a high temperature, then basically the nodes cannot communicate too much, so there is just one optimal -- there is just one solution; no phase transition. That's why I didn't -- yeah -- like [indiscernible] temperature for [indiscernible], that's the typical example. All right. So, yeah, so one way to show uniqueness is the Dobrushin method. So Dobrushin came up with this ingenious idea. He said, well, let's do the contrapositive. Suppose there are two solutions, let's round them side by side and let's couple all the neighbors using some coupling, right. If these channels, which goes from neighbors to the guy, are contracting total variation, then you can refine your coupling at this point; right. You can refine it and keep refining it until the total variation of any finite dimensional margin was less than arbitrary number, so there has to be the same. So what can we do with our method? Because basically what we're trying to do is we're trying to upgrade this TV contraction to some general thing. So for example, we could prove some estimate like this. But basically if you have some lower-bound -- I mean it's an ugly condition, because you would try to be as general as possible. If there is some lower-bound in all the potentials, then there cannot exist more than one Gibbs measure, which does not explode into infinity. So this is an additional qualifier. So the Dobrushin method gives you the statement for all measures beyond the 10 or the 12, there is an awesome kind of Schlausmann staircase kind of trick where something explodes into infinity. Okay, now the final application is for the circuits of noisy gates. That's another area where contraction co-efficients were used very successfully to prove certain results. So okay, so what is the game here? The game here is that supposedly you have a circuit of gates with K inputs, which computes some binary function, for example. So then we say that F essentially depends on all arguments, meaning every argument influences the output at least sometimes. So then it's easy to show that, well, the depth should be lower bound by log n. And then the next step was to say, okay, now suppose in our gates, instead of being ideal they have some noise added after each output; right. So for example, the standard thing that people studied, started from von Neumann is the noise Bernoulli. So there's a Bournoulli built-in noise. Added in the question is, can you actually build an arbitrary -- because the length of each path for complicated function becomes log n. So it grows, the noise actually accumulates. So the question is can you construct any circuits such that its probability of error is bounded away from one-half. And of course this whole field started when von Neumann showed that actually, yes. So here is an architecture by just using three-input majority volters, which requires constant depth expansion, logarithmic gate expansion, and produces a circuit whose probability of fire does not -- stays bounded away from one-half. >>: [Indiscernible]. Yury Polyanskiy: The depth -- yes, well, the depth, basically if you started with depth D, then your depth becomes RD. Now, for some circuits you need log n depth, yes. So then the depth becomes R log n. Yeah. So that's von Neumann's method. And roughly speaking, I mean, it was shown that both constant depth expansions, logarithmic gate expansion was actually acquired in both cases. Here the function was XOR of everything. So XOR of all the inputs. So now what is our contribution here? Well, we can replace noise with Gaussian noise and bound energy of each gate, right. So now we don't care. I mean maybe use some funny operation amplifiers, right; and I don't know how they work, but maybe they extract -- it was actually one of the critiques of von Neumann's work, that he himself said maybe I should have considered analog wires as opposed to digital wires. So one thing we can prove here is we can prove a lower bound on the probability of error asymptotically as in gross to infinite, so the approach is some number which is a fixed point of certain function. And then you can evaluate it to get a certain -- so S and R of each gate changes. I mean here I close from minus to 20 to 20, you get some lower bound on the probability of error. And the funny thing here, and this is the main difference with the Bernoulli noise or from von Neumann or Pippenger, is that here my lower amount always stays -- I mean it always stays bounded from one-half. Okay, first of all, maybe this lower bound is not tight; right. But actually there is reason -- I don't know the answer, but I think that you should -- yeah, and for the Bernoulli noise actually what happens is that when delta, when the noise is strong enough, it is strong that K input gate, let's say 3input gates, cannot have probability of error different from one-half if epson -- if the noise is greater than 1/6th#, for example. Under some circumstances this is optimal. So here I don't expect anything like this, roughly speaking because of the very, very slow contraction log n; right. So when you have exponential contraction, like fully discreet channels, you can -- I mean one way to overcome it is to have a lot of paths; right. If you have three -- yeah, like [indiscernible] has this fantastic paper with information with flow on trees; right. So when the degree on a tree is large enough, you basically can overcome the contraction of TV. But here there is nothing to overcome, because it doesn't contract. So that's why roughly for all SNR's, I expect you can produce. But I didn't try to produce an [indiscernible] bound. All right, so that's everything I wanted to say and it's 4:30. So I'm on time. My takeaway message is if you don't have linear contractions, maybe look at something like this. Sometimes it works. [Applause] Yuval Peres: Any more questions? >>: So linear [indiscernible], can you do something if the noise is Bernoulli? Yury Polyanskiy: Noise is Bernoulli? Yeah, for Bernoulli noise it's very easy because there is contraction. So just the contraction coefficient is 1 minus 2 delta square. Yeah, so just every time you have some basic mutual information contracts exponentially fast, exponentially fast to zero. >>: So did you tell us the functions that are achieved in the [indiscernible] -Yury Polyanskiy: Yeah, I did. But it's exactly the same thing that -- a pair of distributions is this 1 minus T built-in zero, plus T to the minus [indiscernible] I mean there is nothing funny. >>: But then the function is what? Yury Polyanskiy: Oh, the processers, you mean. >>: Yeah. Yury Polyanskiy: Yeah, I forgot what it is. It's some kind of contisers. Yeah. You mean -- you mean ->>: Like a threshold -Yury Polyanskiy: Yeah, yeah, yeah. It's some kind of threshold. And the idea is most of the time you output zero, but you save your energy budget to sometimes flash a huge input. That's roughly the idea. Honestly I forgot. >>: So you don't just threshold -Yury Polyanskiy: No, no, no. You also have to have a dead zone where you mark everything to zero. >>: Yeah, outside, that's what I mean. Yury Polyanskiy: Yeah, outside is just a binary threshold, yes. >>: Yes. Inside is zero and outside -Yury Polyanskiy: Yeah, something like this. Yeah, I mean we tried to look. So I just forgot. It was one of the first things we did. Yuval Peres: Any other comments or questions? Thank you again.

Yuval Peres: So we're delighted to have Yury Polyanskiy... channels with input constraints.

Related documents

Products

Support

Yuval Peres: So we're delighted to have Yury Polyanskiy... channels with input constraints.

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib