Document 17864806

advertisement
>> David Wilson: We are happy to have the third talk of the day. Ronen Eldan will tell us about
the Gaussian noise stability deficit.
>> Ronen Eldan: Thank you. So far I really enjoyed the talks in this seminar. So we are talking
about the Gaussian noise stability deficit. Let's try to understand what Gaussian noise stability
means. Our starting point is actually the Gaussian isoperimetric inequality. Let's see what that
is. So how are setting in this whole talk is just RN equipped with just standard Gaussian
measure. This is its density. And the Gaussian surface area of a subset of RN will just be
defined as integral over the boundary of the set of the Gaussian density with respect to the n-1
they mention on a Hausdorff measure. So this is roughly how much of the Gaussian major
increases in first order when we take an epsilon extension of the set. Now the Gaussian
isoperimetric inequality proved initially by Borell and Sudakov-Tsirelson somewhere in the ‘70s
says roughly that the isoperimetric minimizers are half spaces. In other words, out of all sets
whose volume is some prescribed number, the set which minimizes the surface area is just a
half space. What we'll consider is some extension of this isoperimetric inequality which is an
inequality concerning noise stability. So let's first try to understand what we mean by Gaussian
noise. We say that x and y are jointly standard Gaussian with some correlation role which is a
parameter between 0 and 1. If either, so one way to define it is the coordinate of x and y are all
normal variables and covariance matrix is just such that each one is just a standard Gaussian
and the corresponding x1 and y1 say have correlation role between them and that's true for
each coordinate separately. Another way to define it equivalently is the following. We take
three independent standard Gaussians and we say that x and y have this common component
route roh times Z one and then we add an independent component to each. So to x we add
this and to y we add an independent copy of this so we can just think about this as being the
actual thing we want to measure and this as being the noise. So x and y are the same thing
with each roh is close to one we have some small noise which is distinct between x and y and
we define the noise ability of subset a over n as just the probability that both x and y are in a so
we should, maybe it would have been natural to divide this by the product of probabilities that
x and y are in a. In some sense it measures how stable the set is to noise given that x was
already in a, y was some noise version of x. What's the probability, how likely is it that y is also
in a, so that's the Gaussian noise stability. Now a theorem of Crystal Borell from the mid-‘80s
says that half spaces are not only isoperimetric minimizers; they also maximize the noise ability,
so for all sets with a Gaussian measure, if I want to maximize the correlation between x and y
both being in a, I want to take this, my set to be a half space. y is than an extension of the
isoperimetric inequality, so it's not hard to see that when rho is very close to one. When the
noise is very small the probability that x will be in a and y will be in a and the complement of a
is just more or less proportional to the surface area, right, because x and y have to be kind of
close to each other and this is just a calculation that the first order of change with respect to
rho of the assessed ability is just proportional to the surface area, so this extends the
isoperimetric inequality and okay. This result has many applications. It kind of connects many
areas of mathematics. It’s relevant in approximation theory, in rearrangement inequalities, in
concentration in high dimension, related inequalities. I just want to mention one discrete
application to this, to the so-called majority stablest theorem. This is due to Mossel, O’Donnel
and Oleszkiewicz and that is kind of a discrete version of the same thing which states the
following. If we have a function defined in the discrete cube, so we can think about this
function as an election system. It takes the votes of n different people and the outcome is just
0 or 1 who won the election say. And we're thinking about this point in the cube as just the
uniform point in the cube and then we can consider a noise, so we can imagine, for example,
that the people counting the votes sometimes make mistakes, so there is a probability of
epsilon for each vote being counted that they regenerate the vote randomly. Now let's say that
we want to maximize the noise stability so we don't want this noise, these errors to affect the
final outcome and what the theorem says roughly is that the best way to avoid, to get a stable
thing under a condition of low influences, what this roughly means that is each voter does not
have a big effect on the outcome. I don't want to precisely define what that means, but it turns
out that the most stable thing is just the majority function so we just have to sum them all up
and check whether they are bigger than some constant. The best real-life application that I
could come up with to Borell’s theorem was the following. Say we are collecting street cats, so
we are wandering the streets and we see street cats which have different properties like their
size and height and how loudly they meow, so these are all properties in the real line that have
Gaussian, well I guess we could expect them to have Gaussian distribution and let's say that our
goal is to, we don't want to break up families of cats, so if two cats are siblings we want to kind
of get a high correlation between the events that I collect them both. So I want to maximize
the expected number of cats that are family members that I collect and let's say that we have,
so our space has two parameters. What's the weight of the cat, is it light or heavy, and what's
the complex argument of the cat, is it an imaginary cat or a real cat and I'm starting to collect
more and more cats and in the end I want to kind of decide how I, what is my criteria for
keeping the cat or not keeping the cat. And it turns out that I want to choose criteria based on
some half space like this. If the properties are not correlated it's easy to see that it would be a
coordinate half space otherwise it will be some, I have to do some PCA probably and it will be
some kind of half space. All right. We know that half spaces are the most stable sets. Now we
can ask ourselves is this fact robust. Namely, if we know that a set is almost as stable as its
corresponding half space, as that half space which has the same measure, is this set in some
sense, does this look like a half space or we could ask the same thing about isoperimetric
question. If a surface area of a set is almost like that of the corresponding half space, does the
set in some sense look like a half space. So more formally we would like to say something like
given that the deficit between the noise stability of the set a and the noise stability of the half
space with the same measure is smaller, is it true that the distance between the two sets is
small with respect to some metric and what metrics would one consider. A natural metric to
consider is just the total variation distance, so just the measure of the symmetric difference and
another measure one could consider is just what is the Wasserstein distance between the
restrictions of the Gaussian measure onto the sets. And the first result we have in this direction
is by Mossel and Neeman which says the following. We define Delta of a to be the minimum
among all half spaces of the Gaussian measure of the symmetric difference between a and this
half space. I also want the measure of this half space to be equal to the measure of a, so this is
some kind of total variation distance between a and the set of all possible half spaces. And the
result says that this quantity can be controlled by the deficit, so if the deficit is very, very small
in some sense the set is close to a half space and this is true up to a constant which depends
only on the measure of my set and on this parameter O. in particular, it implies that the quality
case can only, I mean we could only have equality if our set is a half space up to some
probability 0 change. Okay. So this robust inequality admits numerous applications basically
wherever we have, wherever Borrel’s theorem is used, almost. We also get some robust
version in particular for the majority stablest theorem we know a robust version, so basically
the majority function is essentially the only function which minimizes, which maximizes the
stability. And this also implies, so this could be seen as a quantitative version of Arrow's
theorem for those of you who know what that is. It implies that the only way to minimize the
probability of nonrational outcome in an election is by taking the majority function under some
low influence assumptions. By taking growth to one in some cases we get also a robust
isoperimetric inequality. I don't want to give details about this. The conjecture is that this
exponent 4 over here can be replaced by 2 and I just want to mention that there is a slightly
older robust result for the isoperimetric inequality by Cianchi, Fusco, Maggi and Pratelli. I guess
I said that okay from 2011. Cianchi, all right. Okay. Let's go back to this metric and I want to
try to understand. Maybe there's a better way to capture the distance between a and h. I want
to try to convince you that at least when we're talking about noise stability this metric might
miss something and to do that I want to construct a very simple example. The example looks
like this. We are going to construct two sets which are slight perturbations of just the measure
one half space on the line, so let's consider the real line. And let's say that this is 0, so the
measure of all of this is one half and I want to take here an interval of measure epsilon and call
it i2 and I'll call this thing an i1 so the half space is just i1 and i2. Now I want to take this interval
and just move it slightly to the right and call it i3. And the set h would just be these two things.
That's the original half space and a perturbation of h which I call a which will be just i1 and i3
instead of i2. But now I want to consider another perturbation. Instead of taking i3 to be here,
I take this epsilon mass and move it a constant distance, so I put it here. So let's say that this
point is the inverse Gaussian cumulative distribution function of 3 over 4, so this is one half and
I put it in 3 over 4 and let's call that i4 and the set b will be just i1 and i4. Now it's pretty clear
that the distance, well the total variation distance between both of these sets and the half
space is just epsilon so Delta over these sets is the same. But on the other hand, let's try to
understand what the noise stability of a and b are. Maybe I'll just, so a is the blue set and b will
be the black set. Okay. To know what the noise stability of a is I have to consider the
probability that both x and y are in a. So that's a probability that both x and y are in i1 plus the
probability that x is in i3 and y is in i1 conditioning on x being in i3. I have factor two here
because I can also replace x by y and I have an O of epsilon square which is the probability that
both x and y are in the small interval. Now I have exactly the same thing for b so if I want to
compare the deficit between these two guys in the stability of h, this suggests we have to look
at the difference between these two terms. Now it's not so hard to realize that given that x is
in i3 the probability of that y is in i1, the noise version of x is, well it's not so different than the
same probability but given that x was in i2. I didn't move i2 so much to get i3 and if you
calculate this you will see that actually the difference is of order epsilon. On the other hand if
rho is not very close to 0, it's also easy to see that given that x is in r4, when I say that x is here,
this diminishes the probability that y will be here by a lot. I mean well, at least by some
constant factor. If I plug in these two facts to the previous formulae, what I get is that the
stability of a is the stability of h, well minus something of the order of epsilon square while the
stability of b is much smaller. Because I moved this interval over here, I got something of order
epsilon and well this suggests that this metric doesn't capture what's going on so well, so I want
to capture not only how much mass I move, but how far I moved it. So this gets us to the main
there am I want to introduce and it's the following. So let's try to define different metric.
Namely what we do is this. We take our set a. We look that all possible half spaces whose
measure is the same measure as that of a, and we measure the distance between the centroid
of h and the centroid of a. It's pretty clear, so if this is the origin, if a is someone here, h would
probably look like this, and it's not hard to see that this measures how far I move the mass and
not only how much mass I moved and I guess this result could convince you that this metric is
somewhat more natural. What we get with this metric is again, I have to constantly depend on
the measure of a and on rho. This deficit can actually be bounded from both sides by the same
quantity up to some logarithmic factor. In some sense if we only care about knowing the deficit
of two constants, it's actually enough. We don't have to calculate what the noise stability of
the set is. We just have to calculate this quantity which is, I'm sure you'll agree with me that
this is simpler to calculate. It's basically just the one dimensional thing. It depends only on the
marginal of a onto a certain direction, right? Now this theorem has a few corollaries. First of
all, the conjecture I mentioned is verified since Delta square is controlled by this metric epsilon,
it gives an improved Gaussian robust isoperimetric inequality because by taking rho to one it
turns out that you can also get the limit case. This is another example of what you could get by
this inequality. For example, if you know that a set is a pretty good surface area measure, then
when rho is close to one this deficit will be small which will imply that epsilon is rather small
and now you use this fact with a larger value of rho and plug your estimate of epsilon here and
this will give you some estimate on the noise stability in terms of the surface area. So somehow
we know that the noise stability cannot get much worse as we increase rho by using this twosided thing. Any questions so far because at this point I think I'll move to some ideas from the
proof.
>>: I don't understand why the Gaussian squared is less than the [indiscernible]
>> Ronen Eldan: Okay. This is, well I haven't explained why, but basically the extreme example
in this case and it’s not so hard to prove it is the set a defined here. If you take the mass and
move it very closely you can see that for a Delta square is of the order epsilon and it's not hard
to see that this is the worst case you can be. Just project it onto one dimension and somehow
play with it. And this is a very easy fact, but maybe not immediate. Glad to help. Let's talk
about some ideas of the proofs. So what I'll do is mainly I will prove Borrell’s result. This is a
novel proof of Borrell’s result based on stochastic calculus and somehow in this proof we will
see how the centroid of the set comes up. I'm not going to really prove the robustness thing,
but hopefully I'll give an idea of how to do it. All right. We are interested in this quantity, the
stability of a, which is just the probability that x and y are in a. If we plug in the definition of x
and y, it's the probability that route rho’s at 1+ route 1 minus rho is that 2 na and the same for
y where z2 and z3, well z1, z2 and z3 I remind our just independent standard Gaussians. What
we can do is definitely we can take expectation over z1 and inside the expectation we can
condition on z1. We did nothing here. And when we condition on z1 it's clear that this guy and
this guy will be independent. We can instead of just checking that they are both in a, we will
just check that the first one is in a and take the square of the probability. At this point what we
do is the following. Let w be just the standard twin or a process or a Brownian motion. It's
clear that w time rho, the joint distribution of w time rho and time 1 is the joint distribution of
these two guys. What I can do is I can replace all of this expression by w1 and instead of
conditioning on z1 I'll just condition on whatever happens until time rho, so what we get is the
stability is just a probability that a Brownian motion at time 1 is an a conditioned on the
filtration at time rho squared. Until now we didn't really do anything. This encourages me to
take this probability to look at the dual martingale, the probability that w1 is in the a
conditioned on ft, this then I give it a name. Let's call it mt so we are actually interested in the
expectation of m rho squared. Since mt is a martingale by definition. It's a dual martingale, this
expectation by Ito’s formula is just the expectation of the quadratic variation of the martingale
between time 0 and time rho. All we are interested in is how much this martingale really varies
and in order to know what the quadratic variation is, what we want to do is try to calculate just
the Ito differential. To do this what we do is, mt is just this probability. This probability is just
the integral of some measure on a and this measure is just the measure of w1 conditioned on
wt. Now w1conditioned and wt is just a Gaussian measure centered at wt. We already went
used t of our time interval 01 which believes us 1 minus t seconds to go, so it's Gaussian whose
variance was 1 minus t. And mt would just be the integral of this density ft over our set a. Now
we have a process of measures ft which begins with a standard Gaussian. The Gaussian, its
center moves according to a Brownian motion as the actual Gaussian shrinks, the variance
shrinks and at time 1 we end up with some Delta measure. And we want d of mt which
encourages us to calculate d of ft. If we calculate d of ft, ft, well we have a formula for it, so we
can just use Ito’s formula to calculate the differentia. It turns out we get the following thing. I
don't want to bother you with the actual calculation, but I do want to give you some intuition
about what we get which is pretty simple. We get that this process measure varies in
infinitesimal time what happens is we take our measure ft and we multiply it by a linear
function which is equal to 0 on the center if x is equal to wt is equal to 0 and has a random
gradient. Basically our process is we start with a Gaussian measure and we keep multiplying by
linear functions with random slopes, I mean randomly distributed directions and this kind of
makes sense because if we think about it in one dimension, we multiplied by many functions
which look like this, one plus epsilon x and many functions which look like one minus epsilon x.
We have many cancellations. Each cancellation looks like one minus epsilon square x square
and if we take this to some high power we get something like e to the minus, some constant x
square which is a Gaussian density. But not all of them cancel. Some of them, in the end we
still are left with some terms which don't cancel out and this gives us an exponential which
actually moves the center of the Gaussian. So this is, I mean this is a very simple fact but it
turns out to be very useful and the reason it's useful is the following. If we want to know what
d of mt is now we take, we just integrate d of ft and this is a linear function and if we integrate
a linear function over the set a, all we care about is where the centroid of a is located. If the
centroid of a is far from the origin, this will change the mass of a a lot and if it's at the origin
multiplying by a linear function will just do nothing, so the center of mass actually appears here.
But the center of mass with respect to what? With respect to some random measure fft. But
it's not so hard to just change variables to get, ft is some Gaussian and of course I can make it
the standard Gaussian by just moving the center and dividing by the standard deviation and if
we do this we just get the actual Gaussian center of mass of a set, but the set is not exactly the
set a. It's just the set a which I moved a bit and I shrink, actually I inflated a bit. In order to
know, I remind you that we are interested in the quadratic variation of this process. It will be
big if those vectors are big. At any given time I’m taking this vector and I am multiplying it by an
in infinitesimal Gaussian. We finally get that the quadratic variation difference is just the norm
of the center of mass of the Gaussian center of mass of some translate of my original set a. If
we use the same change of variables we actually find that the measure of this set with respect
to which I am integrating is just my martingale mt, so at each point in time I moved my set a
somewhere so that it's Gaussian measure is exactly mt and the quadratic variation is just how
far the center of mass is from the origin. I have five more minutes, I think. We started 5
minutes late. All right. So what we want to do now is to compare the quadratic variation with
the process on a, which was an arbitrary set to the quadratic variation of the same process on a
half space whose measure is equal to the measure of a. So let's take a half space h which
satisfies this and define exactly the same process. Let's call it nt instead of mt and I want to see
what the quadratic variation of nt is. Here we make this simple observation that if we started
from a half space the set will always be a half space. If we move a half space and shrink it it
remains a half space. The analogous expression to this would just be the same thing but know
that in case of a half space all of this thing will only depend on the value of the dual martingale
namely we have the martingale nt and the quadratic variation of nt is just some quantity is just
some function of nt. What is this function? We take a half space whose measure is nt and we
look at its centroid and measure how far it is from the origin and now we just observed one
very simple fact and the fact is that if I have two sets which have the same measure, so I have
the set a and a half space whose measure was the same as that of a, then the centroid of h,
let's say the origin is somewhere here, then the centroid of h will always be more far away from
the origin than the centroid of a because to get from a to h I have to take this mass and put it
here and it's just a monotone one dimensional thing. This is a pretty obvious fact and this is
actually the only point in the proof where we have some inequality so by using this inequality
what we see is that given that mt and nt are the same we know that this quantity must be
bigger than this quantity, so the quadratic variation of mt is always smaller than the same thing
we get for nt. So we have two diffusion processes and we know that when they are equal one
of them moves faster than the other. That doesn't exactly tell us that the quadratic variation of
nt will be bigger than that of mt. We still have some work to do. What we can do is we couple
mt and nt by saying okay, up to some time change they are both Brownian motions. Let's make
them live on the same probability space by just saying that they are the same Brownian
motions and then the inequality so, just means that given some, given the time of the Brownian
motion the inner clock of mt move slower than inner clock of nt and this, well, this coupling
gets easy to see that the quadratic variation of mt will be dominated by that of nt which
finishes the proof if we take expectation here. We have an inequality and actually this gives us
a stronger in fact. We have a stochastic domination between these things which gives us
information about tire moments which in itself it has some more applications I just have to
mention that, well, at least integer moments were already known, the inequality was already
known by a paper by Mossell and O'Donnell but okay. This is a new proof of this. So this just
gives us the inequality and let me just for 1 minute, in 1 minute just try to give you brief ideas
about how to improve the robustness. To do this we have to say okay. We know that the
process nt is ahead of mt but by how much? We know that whenever the very center, those
distances are quite different nt kind of accumulates before mt becomes lagged after nt but well
this metric epsilon just tells us something about this difference times 0 and we want to say that
somehow well, we want to say that given that it's larger times 0 it kinds of remains large for
quite some time and to do that we kind of take the second derivative of what's going on. We
take the Ito differential of this process epsilon t which turns out to be dictated by the behavior
of some random matrix which this process is related to and we can analyze this random matrix.
Well it's kind of a stochastic random matrix. We can analyze it with some spectral tools and
[indiscernible] transportation entropy and equality is kind of the central tooling analysis and in
the half a minute I have left I just want to advertise that okay. This kind of stochastic equation
we, it was pretty simple for us to derive. We just took a very natural process and differentiated
it, but we can actually, given some initial measure mu we can actually define a new process
using these stochastic measures, so if the initial measure is not a Gaussian but something else,
we can still somehow follow to the same method and give, and get some kind of a stochastic
evolution on the space of measures and, for example, if we start with the uniform measure on
the discrete cube but embedded in RN this gives new direct proof of the majority stablest
theorem with a slightly stronger version of the conditions we need. This is joint work with E.
Mossel. The conditions we need our slightly weaker and it turns out these equations turn out
to be a pretty useful tool in high dimensional convex geometry. Yeah, I guess I'll finish here.
[applause]
>> David Wilson: Any [laughter] any other questions? Any other questions?
Download