Yuval Peres: So we're delighted to have Yury Polyanskiy... channels with input constraints.

advertisement
Yuval Peres: So we're delighted to have Yury Polyanskiy tell us about dissipation of information in
channels with input constraints.
Yury Polyanskiy: Thanks, Yuval.
Yes, so this is a joint work with my colleague from Urbana-Champaign. And, okay, so I want -- so
there is basically in this talk there is one technique somewhere in the middle, but I will start with a
problem. So I'm mostly excited about the technique, not the particular problem that we solved using it.
But the problem might be of some nice -- I mean it can be easily interpreted. So we have some original
message, and you are allowed to process it, to encode it into some other real number, X1. The only
thing I want is every time you encode is, that the expected square stay bounded. So this is to prohibit
you to use too large a constellation. And then after you encode original message into something, it's
perturbed by Gaussian noise, and that's what gets fed to the next encoder, and then the next encoder,
again, looks -- maybe you will try to first infer the original at zero, which could be for example, binary
bit. And then the recall and so on. But the idea is that you are allowed to do whatever you want as
long as you don't violate the power constraint. And the question is, yes, so your idea is to design these
processers so that you can still reconstruct, even after great many hops, you can still reconstruct the
original message. So, yeah, so we call it chain of Gaussian channels.
So initially, when I started looking at it, it was complete obvious to us that asymptotically, I mean, as in
grows large, there must be some asymptotic independence. And I used to start this talk by saying,
okay, obviously we conjecture that each stage is fine at the energy budget, so you cannot de-noise
completely, so therefore after great many hops basically you have asymptotic decoupling. So the joint
distribution within the initial and the last message is approximately independent.
>>: How do you de-noise? I mean is the X -- you -Yury Polyanskiy: Yeah, by de-noise, suppose zero is plus minus 1 for example, right. So then you can
encode it into plus minus trillion; right? Then you can convolve with the noise, you get some real
numbers, so your threshold is at zero and you again encode it to plus minus million.
>>: If is not [indiscernible].
Yury Polyanskiy: Oh, yeah, F can be anything. But at some point I would return back.
Okay. So, yeah, and I used to say that it's obvious, and my talk is to prove this obvious fact, but then I
started to sample opinions before I say that. Turns out that some people actually believe that it's
possible. And the more advanced -- I'm an information theorist, by training, so the more advanced
people you ask, and the more of them can say, because they know that, okay, oh, yeah and they used to
also say that this works, everything I say works for arbitrary dimensions. So it doesn't have to be
scalar, it can be arbitrary dimension. And so they will of course say but error correcting codes work, so
there must be some way to preserve information. So nevertheless, yeah, I'll spare you this kind of -- I
will show that indeed we have asymptotic decoupling, and I will gauge it in one of three metrics. So
it's a total variation, distance or TV, then the KL divergence -- I hope some people here know what KL
divergence is. So basically it's some metrical -- some kind of distance-like function. And then also we
have asymptotic decorrelation. Each of this has an equivalent interpretation. So for example,
convergence in KL distance is exactly equivalent just by definition of mutual information to
convergence of the mutual information from the first and the last guy to zero, and decorrelation, which
is important for one of the applications, is exactly equivalent to saying that the minimum means coerror estimate of the original message, given the output, actually becomes asymptotically trivial. So
just estimating by the mean gives you the optimal thing.
And roughly speaking, I mean if you know a little bit about information theory, or I mean even some
general probability, it's easy to say that KL implies both others. So by this direction, by Pinsker
inequality, this by something called rate-distortion theory. So it was very natural to us to immediately
consider KL, and, well, let's just try to do KL.
So let's see what kind of standard tools we have in our arsenal. So the first tool is -- the frame is data
processing inequality, which says that divergence is distance between output distributions, between
distributions perturbed by some stochastic transformation, maybe discreet, maybe non-discrete, it just
cannot expand. It only shrinks.
Okay, so the equivalent statement is that mutual information in Markov chains does not increase.
Okay, so what does it tell us about our chain? Something of course very obvious that basically you can
apply repeatedly and say that okay, the sequence of mutual information cannot increase, but of course it
doesn't tell you that it has to decrease, right, so that's the only thing, you know.
So this suggested that what we need is we need some quantitative version of data processing inequality.
Maybe it doesn't -- maybe we can strengthen the statement that it cannot expand, right. Okay, so there
is such a thing and it's called strong data processing inequality. So actually from people who heard this
talk, I mean, for many people this is a big surprise that there is such a thing. So I want to also spend
some time on this inequality. So what it says is that for actually most channels, you can insert the
multiplicative constant here, which I denote by eta KL, the contraction coefficient. So then not only
divergences decays multiplicatively, but actually, I mean, this is equivalent to say that mutual
information decays multiplicatively, too. Okay.
So now if we have this eta KL, which formerly just defined by the [indiscernible] ratio of output to
input divergences, or mutual informations, I mean there's some famous result by Ahslwede and Gács
that for discreet indecomposable channels, what are the indecomposable channels basically? You don't
want something like some zero error effects there; you want every input to be spreading to all possible
outputs, basically speaking. But basically everything, I mean binary symmetric channel, binary
[indiscernible] channel, all this stuff is indecomposable.
Okay, so this co-efficient, eta KL, has a lot of funny connections. I mean a lot of connections to some
cute topics, like this original Ahslwede-Gács paper, they didn't care about this strong data processing;
what they cared about is the hypercontractivity , which I'll mention a little bit later. And they showed
that the worse possible hyper-contractivity [indiscernible] is actually exactly given by this co-efficient
eta KL. Okay, and then there's some connections to log-Sobolev inequality and so on, so let me just
focus on ->>: Sorry, I'm being a little slow, but what exactly does diagram mean? What do the two arrows on
the left mean and what do the two arrows on the right mean?
Yury Polyanskiy: Yeah, sorry, so this is a good question. I apologize to everybody who watches the
video, and there are many people like this.
>>: [Inaudible]
Yury Polyanskiy: Oh, they can see the board? Oh, yeah, of course, yes.
Okay, so I mean -- what I meant to say when I draw something like this is just that okay, so you have
this stochastic kernel, so this is just -- this just means by definition that PY is just the integral of DPX
of -- sorry, yeah, DPX of P, Y, U, and X equals X. Right. So that's what I mean. It's just the averaging
of the distribution over this -- over this -- so for example, the probability of a set E is just averaging of
this function. I mean nothing fancy. I don't know, it's just the margin of the joint. You form the joint,
PX times P, Y, U, and X, and then you compute the margin in Y.
>>: I see. And Q is just another example of the -Yury Polyanskiy: Yeah, exactly. So that's what I said, basically the way that ->>: Same transition rule, but different -Yury Polyanskiy: Yeah. Because the way information theorists frequently think about the channels is
that something that's more distributions; right. So PX is more than TPY, or pair is more than another
pair. Okay, yeah, so this -- all right. So now of course in this contraction coefficient is less than 1, then
this mutual information, I mean again, by the same repeated application, you get exponential
conversions to zero; right.
Okay, now the sad part is that from my channel, this contraction coefficient is actually 1. The good
news is that for a very slight modification, and I'll mention -- I'll spend some time on this. So if you
replace this expected constraint with an absolute constraint, then it will be strictly less than 1. Then
this contraction coefficient will be less than 1.
So let me explain how to derive this in the fancy way. So this is a photo of Roland Dobrushin, who is
my personal hero, and I couldn't resist inserting something here about him. So okay, so now Dobrushin
defined this coefficient, which is the same as eta KL, but I denote that the TV, this is a contraction ratio
of total variation. Again, this is [indiscernible] channel. And of course it's very easy to see since this is
just an L1 norm, then the worse case input is just delta functions. Two inputs. So it's very easy to
compute. And the original motivation for studying this coefficient was the mixing of Markov chains, at
least for Dobrushin.
Okay, so this is a slight insertion, because I know that there are many people who like Markov chains
here. So of course you all know that Markov chains mix exponentially and frequently; right. So and
then this coefficient, which I will use in this talk eta TV at the high square eta KL, they just correspond
to mixing in various different distances. So you know that this typically is obtained from log-Sobolev,
this is what is called spectral gap and so forth.
Yes, so then some people studied the general relations between this contraction co-efficient for discreet
and general channels; like, for example, for finite-state mark chains, there's some funny statement that
eta KL is less than 1, only if eta TV is less than 1. And eta TV is easy to check, right, because you just
need to check on delta functions. For eta KL, you don't know.
So then there is this more recent result about that contraction coefficient in high scores is exactly equal
to KL. There is some technical difference between this. So this statement doesn't of course mean an
incorrect statement that log-Sobolev constant is always equal to spectral gap , which we know is not
true, but it might appear like it. So if you're interested I can explain what the difference is. But yeah,
let me proceed from now.
Okay, so now I wanted to -- this is another slide which is tangent to my talk in some sense. But I
wanted to say a little bit about this theorem. So these six people proved the result that says that eta KL,
which is tough to compute, is actually always upper bounded by eta TV strictly. So why is it
interesting? So here is the fun implication of this. So let's consider finite-state mark of chain and T
will be its propagation operator. Then let me denote by E, the independent mark of chain, just
something that averages a function with respect to the invariant distribution. Okay, so now suppose
that so then this is just PP star; right. Now suppose the rows of this stochastic matrix are all basically -are all closer than one-half to the P star, then it implies by trying inequality that eta TV will be strictly
less than 1; right. So you can insert P star here. And then by this theorem, which is same result, that
kind of amplifies the analogy about the simple [indiscernible] inequality into something that eta K is
less than 1. And then by Ahslwede-Gács, which you know is then this stochastic operator, will be
hyper-contractive. So you can upper bound Q more by slightly less [indiscernible]. Okay, and that this
tells you, not just that this estimate -- I mean they give an explicit estimate, they say take P to be eta KL
times Q. So now for any -- I mean if you replace one-half here with something slightly smaller, you
get an explicit estimate for which operators are hyper-contractive with constant 1.
Yes?
>>: Again, trying to slow you down, so I understand that eta total variation, that's the widespread
contraction.
Yury Polyanskiy: Yes. Yes.
>>: What is eta KL in terms of the log-Sobolev constant?
Yury Polyanskiy: It's effectively equal to ->>: It is the log-Sob?
Yury Polyanskiy: Yes, because you see -- there are many log-Sobolev constants, right. There are two
log-Sobolev, one log-Sobolev, modified log-Sobolev, log-Sobolev for continuous time, for discreet time
->>: What is it in terms of the log-Sobolev constant where you have a -Yury Polyanskiy: So this is -- you mean continuous time; right?
>>: Yes.
Yury Polyanskiy: So for continuous it would be PT; right?
>>: Yes.
Yury Polyanskiy: But that is not connected, because I'm talking about discreet groups.
>>: [Indiscernible].
Yury Polyanskiy: So okay so ->>: [Indiscernible].
Yury Polyanskiy: Yeah, so there is ->>: [Indiscernible].
Yury Polyanskiy: So if you define something like this, okay ->>: No [indiscernible].
Yury Polyanskiy: Uh-huh.
>>: [Indiscernible] with log E. On the right I want F.
Yury Polyanskiy: I don't think it's related to this one. So it's related to this one.
>>: To the modified.
Yury Polyanskiy: Yes, to the modified. To the modified, there is a statement which says eta KL is
always upper bounded by one minus alpha and lower bounded by one minus some absolute constant
alpha. But of course for similar groups, frequently, I mean 1 modified log-Sobolev equals the usual
log-Sobolev; right?
>>: [Indiscernible] but not always.
Yury Polyanskiy: Not always, yes. But in this sense, yes. It's not a universal connection.
>>: For random [indiscernible] there is no log-Sobolev for the modified.
Yury Polyanskiy: Yes, yes, yes.
Anyways, I mean log-Sobolev is not going to be mentioned here. So it's -- what I wanted to talk about
is, yes -- so you see, this connection was done without -- I think you're asking because hypercontractivity frequently is derived from log-Sobolev, right, by integrating log-Sobolev. But outside of
this graph, this was completely discreet statement. So they didn't talk about log-Sobolev as well, so
just a different method. So anyways, so now the punchline here is that surrounding this -- so in the
space of four stochastic matrices, so there is one matrix E whose roles are OP star, right, and then there
is the ball surrounding this -- there is a known vanishing ball. So for every operator inside this ball, it's
Lp to Lq norm is 1. So it's a little bit funny situation. So it's basically you move away from pure
independence, but for -- there is a planar face, so to speak, in this ball of matrices whose norm is upper
bounded by 1. And this effect is actually -- yes, so let's ignore this.
And this effect is funny because it was actually discovered in the Euclidean space first by Segal and
Fefferman in '70, and then generalized to arbitrary operators in '88. So basically it's the usual effect
that if you stay inside an operator which integrates with -- there is a ball surrounding an operator
integrating with respect to probability measures with respect to the Lp to Lq estimates stays constant.
But this is a completely, it's basically a one-line proof for these great cases following from that upper
bound.
Okay, so let me return to my talk. So we're still talking about this chain and we're talking about
amplitude constraint, right. So all X now are required to be bounded in amplitude. So now we apply
that estimate at the KL upper-bounded amplitude. So now here, it's very easy to estimate TV
contraction, because you just want to take two points of plus minus A. So I'm in the engineering
department; I'm going to use Q function, which is a complementary CDF of the Gaussian distribution.
So I apologize to everybody who likes the tree better.
Okay, so now what we derive from this is that for every amplitude constraint channel, you have
exponential convergence of mutual information to zero. And surprisingly, this is the entire proof, right,
two lines. There is a sequence of papers, actually, which deals with this Gaussian, so-called Gaussian
line metrics for exactly this question. And they derive a worse exponent than us, these two lines. So
I'm just saying this theory about eta KL upper-bound [indiscernible] is very overlooked and very few
people actually know about that. I think it was rediscovered by Nick Law, in particular, at some point.
Nick Law at Laduc.
All right. Okay, so what about the average power constraint? Okay, so if you're an analyst, at this
point you can say, okay, I mean come on, truncation; right? Some kind of truncation. Well, turns out,
no. In this particular case, no. So that's what we thought, I mean basically upper amplitude constraint
they didn't want to work on this. But my friend actually, he found the counter-example; he just said,
wait a second, what if you take -- what if you put some mass at zero, right, and the other mass is
traveling farther and farther apart, so then when you mix them -- so this part, the total variation, the
only thing contributing to the total variation is this part; right. But when you convolve with the
Gaussian, so you get basically two verifier part Gaussians. So as this constant as total variation
between the original distribution is the case, turns out that eta TV converges to 1. So basically, the
same binary distribution with two masses show that the diverse mutual information all do not contract.
So there is no contraction.
>>: [Indiscernible] what, do you vary the key to zero, or what?
Yury Polyanskiy: Yes, so key is the distance between PX and QX; TV distance, right. Because this just
cancels and subtracts it, so you just get T. So as T goes to zero -- so this basically proves there's no
hope of exponential convergence, right, but maybe there is a sub-exponential convergence. So this is
our final result that for this particular chain you have the super trooper slow conversions, right. So I
mean we have a super trooper slow estimates; right. So basically it's 1 over log n, where n is the
number of steps.
>>: As you're going along with two signals you have, the only source you're trying -- their only
difference would be translation by a different Gaussian.
Yury Polyanskiy: Right ->>: And in this -- when you're looking at these ratios, you consider it too general.
Yury Polyanskiy: Right, yes.
>>: But translations are very special.
Yury Polyanskiy: Yes, so you say that maybe you can exploit the fact that after a certain number of
iteration, right, you are not considering arbitrary distributions. But now notice that there is this F2;
right. So for example, Y1 is a convolution, so it's a very smooth distribution; right. Because something
convolved with Gaussian, but then you apply some arbitrary function, right, it can be some crazy stuff.
So this makes distribution of X2 something crazy again. Yeah, but there is obviously a work around, so
as the result shows. And again, this estimate were so slow that it was just painful, so we actually made
sure that our estimates were more or less tight, so up to this log log n factors, there is a lower bound
too. So you can't improve it by much.
Okay, so what I was going to talk next is how to prove it, what's the idea to prove this thing. All right.
So the idea is the following. So strong data processing equality says how much the input diverges
contract; right? So more precise characterization would be what if we just compute the full curve;
right? We do the full thing. So we take two dimensional plane and for every pair of distributions we
compute the input divergence and the output divergence and put it here and just iterate over all pairs or
distributions. So data processing inequality tells us that, well, all the points will be strictly below
diagonal. If there is strong data processing, then it will tell us that all distributions are here. We know
it's not true; right. But maybe the situation is like this. And why would it be good? Well, because then
what we can do is we start from somewhere, we don't know where, but from some point, right, and then
we'll proceed with this iteration. And if this thing stays strictly below the diagonal, then it will
converge to zero, right. So this was our hope. So basically the goal is to find -- now this is very
exciting to us, because we take the Gaussian channel, which is something that was studied from 1948.
It's the very first paper of Shannon, the most famous thing is one-half [indiscernible} plus P; right. So
Gaussian channel is beaten to death. Here we want to associate the Gaussian channel some curve,
right. So it's something new to say with Gaussian channel. So we spend some time trying to make sure
nobody actually did it. Seems no. And it was surprising. So we were excited at this point, so we said
okay, let's do this, let's calculate this thing. Yeah, so basically once this curve is curved a little bit then
we are done.
Now, then there was sad news. At this moment we tried to compute FK and realized nobody could
compute it before, because actually it's exactly the straight line. So there's nothing good. And I don't
know why we didn't give up at this point, because I mean for everybody it was pretty discouraging.
You're trying to prove a result which obviously holds, right, and nobody cares about, and you actually
write and you try for so many -- by this time we already tried a few other things, which I don't mention.
But it doesn't work. But here at this time for some reason we also decided to try FTV, and magically it
turned out for FTV, there is actually a contraction. And to color our hero, Dobrushin, we decided to
call FTV the Dobrushin curve of the channel.
So first of all, this is the strategy for the proof, which I'm sure I'm not going to have time to go over.
But so the idea is to first work with TV, with total variation, to show you cannot transmit one bit and
then upgrade it to general input, general message at zero. Then there is a trick how to go from
decoupling in TV to decoupling in KL. In our case, we actually use some special property of Gaussian
noise here. Then there's a general trick to go from decoupling in TV to decorrelation. Anyways, so we
will see if we can get there.
Okay, so let's start with the transmitting 1 bit. So suppose X zero is plus minus 1 equiprobably, right,
and you're trying to show that there does not exist a test which distinguishes 1 from minus 1; right.
And the only thing you need to do is you need to propagate, you need to introduce two measures, P and
Q, right, one is conditioned on plus 1, another is conditioned on minus 1, and just look how these two
measures more as they go along this chain; right.
Okay. So, right. So then here's the formal definition of FTV. FTV is just the supreme overall PY
minus QI -- over all TV [indiscernible] at the output, given that the total variation of thing is less than
T, and that there is a power constraint. So the average power under P and Q is less than P.
And of course the Dobrushin coefficient is just the maximal slope of Dobrushin curve, which is always
at zero.
okay. Right, so now here's another surprise. We told you we were able to prove that FTV is slightly
below. We actually found it exactly. So this is not an estimate, this is exact volume. So now the
interesting thing, another interesting thing, and at this point of course we didn't know the speed of
convergence, but once we found FTV, we realized it's a funny curve because of this expression. It's
smooth but not analytic. So all the derivatives -- so the first derivative is 1 and then all the other
derivatives are zero. So that's why your convergence is so slow. That's why it's 1 over log n, this
iteration. So basically it [indiscernible] the straight line very, very -- yes -- to a very high degree.
Okay, so how do you -- what's the idea? So the idea is the following. So you give me two
distributions, PX and QX whose total variation is T; right. And we are trying to bound how does a total
variation shrink under convolving with the Gaussian noise; right. That's our job. And the idea was the
following. So now suppose we couple PX to QX, so under this coupling P, right, so then basically
under the good situation X minus X prime would be actually equal to zero, right, and then you can also
couple the noise so that Y will be approximately good to Y prime. But you don't know what to do
when X minus X prime is not equal to each other. But then it turns out that this kind of happen too
often because of the power constraint. Now, if I actually describe this to you, you say okay, you
obviously can convert it into some bound, why do you get exact estimates? So there is another thing
where we got lucky. So this is full-proof. I don't know if you want to -- it's a very easy exercise. After
being lucky so many times, it was just very easy to do this.
Yes?
>>: [Indiscernible].
Yury Polyanskiy: Okay. So let's go through it. Actually I said I don't need laser pointer. I believe this
would be useful now. But I just have to find it. Yeah, okay, thanks. It would be more cool if it just
dropped from the ceiling.
>>: Pretty cool.
Yury Polyanskiy: Yeah. So one function we need to introduce associated to the channel is how does
the total variation distance interplay with Euclidean distance in the input space. So it's just this thing.
So it's very easy to compute, it's something involving the Q function.
Okay, now, so here's before we were lucky first time. So TV, total variation is the only F divergence
which is also Wasserstein distance. So if you don't know what is the divergence, it doesn't matter. So
as any Wasserstein distance, it's convex in the pair, so then what we do is we just -- we can average
over all delta input distribution. So this just gives you these estimates and the expectation of the step of
X minus X prime. So now for arbitrary dimension, right, so even if this is not scalar, total variation of
the isotropic Gaussian only depends on the distance between X and X prime, because you can always
rotate, so you can put absolute value inside, then you can apply Jenson, right, you condition X not to
equal X prime, you apply Jenson, then you estimate this distance. So you have a distance of first
moments, but you can estimate it by the second moments; right. And that's it.
So basically -- and after we obtained all this, we asked ourselves why we were so lucky. And it turns
out that, yes; so one luckiness was this is TV, not arbitrary F divergence; we can do the coupling trick.
And the second, where we will look is that this works for any unimodel symmetric noise, symmetric
meaning X. Noise only depends on the distance to zero.
Okay, so, yeah, for the lower bound you just take the same distributions I mentioned before. So they
attain a lower bound exactly. Okay, I don't know, should I -- it seems like -- yeah, basically pretty
trivial.
Okay, there are many more generalizations than you want to know, so you can generalize -- the most
important thing is that you can [indiscernible] arbitrary dimension for Gaussian at least, for Gaussian
noise.
All right; these are the two slides that I inserted here, so there are two -- so if PD is something that each
of you, then -- so you can interpret this result exactly as a statement about PDE's. So let's take the heat
equation. So you sort of keep the equation on RM with the initial condition of X of X, and then you
know that just by integrations by parts, L1 norm. So L2 norm was of course preserved, but L1 norm
decays. So of the solution of NT, the case to the average volume. So the initial average volume. The
question is how fast. So we gave a best-possible estimate of this [indiscernible], which is given here.
This was just an interpretation, right. A convolution of Gaussian is exactly the solution to this. And the
funny thing is, I don't know, I'm not a PD person, so I don't know if they studied this. But what this
let's you do is it let's you trade the speed convergence, the speed of decay in L1 norm, let's say integral
of 5 plus 0, so it gives you the estimate of 1 over square LT estimate of L1 norm. Of course at what
price? Of course it doesn't come for free. So the price is there's a tail bound on the initial solution. So
we trade the knowledge of the tails of the initial function -- initial condition, yes, and then you extract
from it.
So I thought it was funny. So another punchline -- sorry, another corollary we will absorb is that you
can also derive something about CLT; so this is a CLT in what we call smooth total variation.
So let me go over this. So remember that the intermediate step was to introduce this coupling between
convolved versions of P and P prime, and estimated like this, right. Then we applied Jensen to get this
thing. And then in our proof we just estimate the expectation of X minus X prime through the second
moments, right, the distance between L1 and L2. But actually why don't we take the best coupling
between P and P prime that minimizes L1 distance, so then you get an estimate in terms of Wasserstein
distance; right. And of course Wasserstein distance, W1 distance, is something we all love because
there's this Steins method to derive duty-quick estimates for convergence of Wasserstein's distance. So
you get the estimate like this. Give my any [indiscernible] IED zero mean and unit variance, add tiny
little bit of Gaussian to this [indiscernible] sums, right, then it will be closed in total variation to the
Gaussian as 1 plus signal square. And here's an explicit estimate of the [indiscernible]. So this is why
we call it smooth total variation, because of course if you said, well, I don't care about the smooth
version, what about just the original, of course there is no convergence in TV because [indiscernible] is
discreet and you'll always have a distance 1. Actually Prokhorov showed that to have convergence in
non-smoothed to the usual TV, the only thing you need is that after a few convolutions you have some
non-material absolute continuous component. But what is more fun is there is the best estimate from
'62. I don't know if you all know or anybody here knows about this. I was surprised there's an exact
estimate of the asymptotics of the speed of conversion in total variation, which I've never heard about.
>>: [Indiscernible] smooth thing very [indiscernible].
Yury Polyanskiy: Well, [indiscernible] estimates [indiscernible] distance; right. It's here I'm talking
about TV. But exactly, yeah, the estimate here is of the same order.
>>: And the last one, are you assuming that it's absolutely continuous component?
Yury Polyanskiy: Of course. Yes. It's under the same condition, Y. And moreover, there is a mistake
here. It's not absolute third moment, just without the absolute. So it's zero if you have zero skewness,
then the first order term is less than 1 over square of 10. But our estimate is lower type, typically,
unless you have zero skewness. So again, just some tangent that I thought you would maybe like.
So now we're done with the convergence to zero, if X zero is binary. So let me mention a few things.
So when I'm supposed to finish? At 4:30, or something like that?
Yuval Peres: Something like that.
Yury Polyanskiy: So then let me just quickly mention how to get these arrows. So first of all, to get
from binary to general case, it's convenient to introduce this T information. This is basically distance to
product in total variation, and then just by some concavity you get the correct testament of this total
variation distance actually in the case of the same speed, satisfies the same data processing in the
quality under FTV.
Okay, so now you can upgrade T information to Shannon information by some pretty easy coupling
arguments. And for discreet -- I'm sorry, if the X zero is finite, then you can do some coupling trick,
and if it's general, then you need to use some discreditation. And I'm sure you don't need to know
about this. But anyways, there is some analysis here.
Now for the correlation of how to get correlation from TV, again, this is some, as I mentioned in the
beginning, if you have convergence of mutual information, then there's convergence of correlation, just
by rate distortion. But if you want the speed of convergence, if you want explicit [indiscernible], you
need to get some -- you need to bound some higher moments. For example, if you have some Gaussian
X zero, then just by applying, again, some coupling to plus holder quality, you get, again, the log log n
over log n estimate for the correlation.
Okay, now, this is definitely -- I can skip. Well, one punchline here is that if you cared about -remember that I showed you that FKL of T was actually the straight line, so it doesn't contract at all.
Turns out that all we need are divergences for 1 and above. So KL divergence, if you knew they were
1, so then they also do not contract, but everything below 1 actually does contract. So I don't know.
So I have about fifteen minutes. Let me mention some application and implication of this stuff. So
here's one version of the main result, just to recap. So we have an arbitrary chain of processers, which
satisfy potentially -- I mean those links could be infinitely dimensional, right, there could be buses
connecting computers together. As long as the total L2 energy expended in the communication is
bounded, you will have mutual information conversion to zero as 1 over log n.
>>: You talked about other convex functions. So you still get the same rate if you add X zero to the
1.1 or -Yury Polyanskiy: No rate changes, yes. I think rate changes.
>>: It's a different function, it's not -Yury Polyanskiy: It's still log n, but it's a different power, it's something like minus P over 2.
>>: If you just have a first known condition [indiscernible] -Yury Polyanskiy: I don't remember. I have to -- I mean for TV, yeah, I think it's actually still Log n.
But I need to double check the paper.
Okay, so anyways; here is the sort of philosophical implication; right. So you store something on the
hard drive; right. Every once in a while you decide to copy it, while the copier only has finite energy,
right. And what this means is it will offer great many iterations, you can't tell one bit, even a single bit
written here under arbitrary error correction; right. So yeah, and of course the punchline, yes, I
understand that everybody ->>: [Indiscernible].
Yury Polyanskiy: Yes, very good, very good, exactly. So 1 over log n, which basically killed us. But
now here's a more fun example. So if you talk about memory controller, well, memory controller does
exactly this, it reads memory cells, is writes memory cells, it reads, it writes, reads and writes. And that
happens pretty often; that happens 16 times a second. So in one year you get 2 to the 29 reads and
writes. So, you know, once you plug things into here, then the noise, you know, noise is not zero
always, right; there's always some thermal noise. I don't know, I didn't try to estimate what you get. I
was too bored because all of those constants are actually hard to extract.
>>: [Indiscernible].
Yury Polyanskiy: Yeah, yeah, yeah.
>>: Specifically what happens if you allow the power to grow a bit slowly?
Yury Polyanskiy: No, then of course you're golden.
>>: But how -- well, it depends -Yury Polyanskiy: I think square root log n is enough. Or log n, something like this, yes.
>>: I mean that's kind of interesting philosophically, because, you know, if you want people to be able
to read your paper in a million years and you're prepared to assume that -Yury Polyanskiy: Yeah.
>>: -- more energy available to them.
Yury Polyanskiy: Yeah. I mean you can't exceed the memory --
>>: [Indiscernible].
Yury Polyanskiy: Yeah, yeah.
Anyways, so this is one implication. So another implication is this actually was a moderation of my
friend, my colleague who we derived this originally with. So there is this question that these two guys
Lipsa-Martins in control theory asked themselves. They said, okay, so suppose I have a Gaussian
around a variable; right. And my job is to preserve it intact. But I have to read and write it from time
to time. And when I read it, there is some exogenous noise, which corrupts me. So what can I do? So
what is the optimal way to read and write? What are the processing points? So we call this
memoryless controller, because it only has one real number to store; right. And when it reads it, it gets
convolved in Gaussian noise. So it's very easy to show that if you want this operation to work once, for
one reason, F1 linear is optimal, it's much less easy. In fact, I didn't read the paper, but it looked
intimidating enough. But for [indiscernible] still the linear controller is optimal. Why this is
intimidating? Because you see now when you try to optimize the minimum square, minimum square
error, you get compositional F2 and F1. So it's pretty tough to optimize over-compositions. Anyways,
so then these same guys showed that actually but for anything greater than for -- when you have greater
than five convolutions, there is a non-linear scheme that actually outperforms any linear scheme. And
what our results shows is that, first of all, asymptotically you always decorrelate so you can't store -- I
mean you're as good as storing zero. And the second thing is that actually the linear FJ's correspond to
a special restriction on the type of distributions. Exactly as you all mentioned; they're all a Gaussian.
And then it's easy to show that KL actually contracts. So therefore, you actually decay exponentially
fast. So yeah, so just from that, you can immediately see that non-linear controllers are optimal.
So another application we discussed in the paper is -- so I don't know how many people here know
about Gibbs measures, but that is a game which is played in certain corners of science, where what you
do is -- maybe I shouldn't -- yeah, maybe here it's played often.
>>: [Indiscernible].
Yury Polyanskiy: Yeah, okay. All right. Anyways, so given some conditional distributions, you try to
guess if there is a joint, which solves this problem, and, yeah, as I'm sure most of you know, the idea is
you try to see if there are multiple solutions, it means that there is some phase transition happening.
And it's funny of course two-dimensional Ising model is the famous example. But the rule of thumb is
that if the links -- if this dependence is weak, if there is a high temperature, then basically the nodes
cannot communicate too much, so there is just one optimal -- there is just one solution; no phase
transition. That's why I didn't -- yeah -- like [indiscernible] temperature for [indiscernible], that's the
typical example.
All right. So, yeah, so one way to show uniqueness is the Dobrushin method. So Dobrushin came up
with this ingenious idea. He said, well, let's do the contrapositive. Suppose there are two solutions,
let's round them side by side and let's couple all the neighbors using some coupling, right. If these
channels, which goes from neighbors to the guy, are contracting total variation, then you can refine
your coupling at this point; right. You can refine it and keep refining it until the total variation of any
finite dimensional margin was less than arbitrary number, so there has to be the same. So what can we
do with our method? Because basically what we're trying to do is we're trying to upgrade this TV
contraction to some general thing. So for example, we could prove some estimate like this. But
basically if you have some lower-bound -- I mean it's an ugly condition, because you would try to be as
general as possible. If there is some lower-bound in all the potentials, then there cannot exist more than
one Gibbs measure, which does not explode into infinity. So this is an additional qualifier. So the
Dobrushin method gives you the statement for all measures beyond the 10 or the 12, there is an
awesome kind of Schlausmann staircase kind of trick where something explodes into infinity.
Okay, now the final application is for the circuits of noisy gates. That's another area where contraction
co-efficients were used very successfully to prove certain results. So okay, so what is the game here?
The game here is that supposedly you have a circuit of gates with K inputs, which computes some
binary function, for example. So then we say that F essentially depends on all arguments, meaning
every argument influences the output at least sometimes. So then it's easy to show that, well, the depth
should be lower bound by log n. And then the next step was to say, okay, now suppose in our gates,
instead of being ideal they have some noise added after each output; right. So for example, the
standard thing that people studied, started from von Neumann is the noise Bernoulli. So there's a
Bournoulli built-in noise. Added in the question is, can you actually build an arbitrary -- because the
length of each path for complicated function becomes log n. So it grows, the noise actually
accumulates. So the question is can you construct any circuits such that its probability of error is
bounded away from one-half. And of course this whole field started when von Neumann showed that
actually, yes. So here is an architecture by just using three-input majority volters, which requires
constant depth expansion, logarithmic gate expansion, and produces a circuit whose probability of fire
does not -- stays bounded away from one-half.
>>: [Indiscernible].
Yury Polyanskiy: The depth -- yes, well, the depth, basically if you started with depth D, then your
depth becomes RD. Now, for some circuits you need log n depth, yes. So then the depth becomes R
log n. Yeah. So that's von Neumann's method. And roughly speaking, I mean, it was shown that both
constant depth expansions, logarithmic gate expansion was actually acquired in both cases. Here the
function was XOR of everything. So XOR of all the inputs.
So now what is our contribution here? Well, we can replace noise with Gaussian noise and bound
energy of each gate, right. So now we don't care. I mean maybe use some funny operation amplifiers,
right; and I don't know how they work, but maybe they extract -- it was actually one of the critiques of
von Neumann's work, that he himself said maybe I should have considered analog wires as opposed to
digital wires. So one thing we can prove here is we can prove a lower bound on the probability of error
asymptotically as in gross to infinite, so the approach is some number which is a fixed point of certain
function. And then you can evaluate it to get a certain -- so S and R of each gate changes. I mean here
I close from minus to 20 to 20, you get some lower bound on the probability of error.
And the funny thing here, and this is the main difference with the Bernoulli noise or from
von
Neumann or Pippenger, is that here my lower amount always stays -- I mean it always stays bounded
from one-half. Okay, first of all, maybe this lower bound is not tight; right. But actually there is reason
-- I don't know the answer, but I think that you should -- yeah, and for the Bernoulli noise actually what
happens is that when delta, when the noise is strong enough, it is strong that K input gate, let's say 3input gates, cannot have probability of error different from one-half if epson -- if the noise is greater
than 1/6th#, for example. Under some circumstances this is optimal. So here I don't expect anything
like this, roughly speaking because of the very, very slow contraction log n; right. So when you have
exponential contraction, like fully discreet channels, you can -- I mean one way to overcome it is to
have a lot of paths; right. If you have three -- yeah, like [indiscernible] has this fantastic paper with
information with flow on trees; right. So when the degree on a tree is large enough, you basically can
overcome the contraction of TV. But here there is nothing to overcome, because it doesn't contract. So
that's why roughly for all SNR's, I expect you can produce. But I didn't try to produce an
[indiscernible] bound.
All right, so that's everything I wanted to say and it's 4:30. So I'm on time. My takeaway message is if
you don't have linear contractions, maybe look at something like this. Sometimes it works.
[Applause]
Yuval Peres: Any more questions?
>>: So linear [indiscernible], can you do something if the noise is Bernoulli?
Yury Polyanskiy: Noise is Bernoulli? Yeah, for Bernoulli noise it's very easy because there is
contraction. So just the contraction coefficient is 1 minus 2 delta square. Yeah, so just every time you
have some basic mutual information contracts exponentially fast, exponentially fast to zero.
>>: So did you tell us the functions that are achieved in the [indiscernible] -Yury Polyanskiy: Yeah, I did. But it's exactly the same thing that -- a pair of distributions is this 1
minus T built-in zero, plus T to the minus [indiscernible] I mean there is nothing funny.
>>: But then the function is what?
Yury Polyanskiy: Oh, the processers, you mean.
>>: Yeah.
Yury Polyanskiy: Yeah, I forgot what it is. It's some kind of contisers. Yeah. You mean -- you mean ->>: Like a threshold -Yury Polyanskiy: Yeah, yeah, yeah. It's some kind of threshold. And the idea is most of the time you
output zero, but you save your energy budget to sometimes flash a huge input. That's roughly the idea.
Honestly I forgot.
>>: So you don't just threshold -Yury Polyanskiy: No, no, no. You also have to have a dead zone where you mark everything to zero.
>>: Yeah, outside, that's what I mean.
Yury Polyanskiy: Yeah, outside is just a binary threshold, yes.
>>: Yes. Inside is zero and outside -Yury Polyanskiy: Yeah, something like this. Yeah, I mean we tried to look. So I just forgot. It was
one of the first things we did.
Yuval Peres: Any other comments or questions?
Thank you again.
Download