>> Yuval Peres: [inaudible] very happy for Ryan O'Donnell... generalization of the famous KKL paper. KKL, Kruskal-Katona and...

advertisement
>> Yuval Peres: [inaudible] very happy for Ryan O'Donnell who will tell us about the striking
generalization of the famous KKL paper. KKL, Kruskal-Katona and monotone nets.
>> Ryan O'Donnell: Thanks, Yuval. It's great to be back here in Seattle at Microsoft. So I'm
going to be talking about I guess two actually joint works with Karl Wimmer formally of
Carnegie Mellon and now at Duquesne University.
Okay. So let's start with what monotone nets are. So this is -- by this I mean a net for the set of
all monotone Boolean functions. So let's say a set of function H is a gamma net for the set of
monotone functions, if every monotone Boolean function is kind of close to one of the functions
in age.
And specifically it won't be that close, so we'll look at instead the correlation of the function. So
we want that every function little H in H should have at least gamma correlation with every -well, every monotone function should have at least gamma correlation with something in age.
So just to remind you of definitions, the Boolean function is monotone if changing 0s to 1s in the
input can only make the output go from 0 to 1.
And by correlation of two functions H and F I mean the probability over a random X uniformly
from 0,1 to the N that H equals F minus the probability that they differ.
So this is a number between minus 1 and 1. It's 1 if they're equal, minus 1 if they're opposite,
and it's related to the distance between the two functions in this way.
But there's lots of monotone functions, so it will be hard to come up with like a net that's
covering them. So we'll actually be interested in distances that are pretty close to a half. So we
just want functions that are slightly correlated with all the monotone functions. Is that clear?
Okay. So actually this problem has been studied surprisingly a lot. So I'll tell you about the
previously known results. So this is Kerns and Valiant [phonetic] from '89. And they show that
this -- well, I don't have a pointer. They showed that this collection of N plus 2 Boolean
functions, the all zero function, the all one function, and the N coordinate projection functions,
are like a sort of good net for the collection of all monotone functions. So every monotone
function has at least 1 over N correlation with the all zeros, the all ones, or one of the N
coordinate projection functions.
I should say I guess that if you just take the all zeros function and the all ones function, then
that's every function has correlation at least 0 with one of those, right, because it's either more
likely to be zero or more likely to be one.
>> [inaudible]
>> Ryan O'Donnell: Yeah. So. So yeah. So I guess the idea is to keep H small. Let's say
polynomial size. Okay. So this is improved I guess six years later. Just the analysis was
improved. So [inaudible] showed that the same net is actually better than this. It's achieves
correlation at least log N over N.
And this is [inaudible], some Carnegie Mellon guys, in 1998. They took an even simpler looking
net that only contains three functions: 0, 1, and the majority function. And they showed that
every monotone function is either a little bit correlated with 0, with 1, or with the majority
function. And the amount was 1 over root N.
And they also had another interesting aspect in their paper. They showed that any collection of
polynomially many functions can achieve correlation at most log N over root N.
So there's a little gap left in what they did. Well, one thing we do in this paper is close this gap.
So the natural thing is just to combine these two nets. And indeed we show that if you take these
N plus three functions, 0, 1, majority, around the N coordinate projection functions, then every
monotone function has at least log N over root N correlation with one of these functions.
Okay. So we know it can be any better, so this is actually [inaudible].
Okay. So that's nice, I guess, but I don't know, what is the interest in this problem. It seems like
more like a curiosity, really.
Well, actually, the reason that all of these papers, including us, actually, were looking at this
problem was because of a notion in learning theory, computational learning theory. So actually
these are two heroes of learning theory [inaudible] and in this paper they address -- or they
introduce the notion of weak learning, which is like the learning theory model where instead of
striving to get like a really excellent hypothesis you only strive to get a hypothesis that's a little
bit better than trivial. And so you might do this if the set of functions you're trying to learn is
quite complicated. And they were specifically looking at trying to learn the class of all
monotone functions.
So I'll give you a quick explanation of the connection to learning theory. If you're not that
familiar with it, then you can just tune out for a bit, because I won't explain exactly everything I
say here.
So a corollary, for example, of our net result is that the set of all monotone functions is weakly
learnable with advantage this same quantity, log N over root N, under the uniform distribution.
And the proof of this fact, the algorithm, is extremely simple. You just draw a bunch of
examples. So the learning model is there's some unknown in this case monotone function F, and
you can get examples from it, which are pairs X where X is shown as as uniformly random, and
the label F of X. And you're trying to come up with like a hypothesis that is somehow close to
the unknown function F.
So this result immediately gives you this simple algorithm, just draw a bunch of examples and
check which of the functions in the net, either 0, 1, majority or the N projection functions,
empirically seems most correlated with these labels F that you're getting, and whichever one
seems best just pick that one and let that be your hypothesis.
Okay. And it's easy to see if you just do some samplings, some [inaudible] bounds or whatever,
that the empirical correlation will be close to the true correlation. So this theorem tells you that
one of these N plus 3 functions has at least this much correlation, and that's like the learning
theoretic advantage that your hypothesis achieves.
>> [inaudible]
>> Ryan O'Donnell: Yeah. Yeah, probably you invert this in square, but then you need to
multiply by log N because you're taking uniform -- a union bound over N plus three functions.
Actually, I little quirk -- I don't mention this, but a little quirk of how we prove this is actually
you can achieve the same thing with only N to the epsilon many samples. But that's taking this
too far into the guts of learning theories, so I'll skip over that.
And you can see actually that just anytime that you have a net result like this where you say like
every function in this class, e.g. monotone functions, is, you know, somewhat close to one of -- a
small set of functions. Then you can do the same algorithm. Just try everything in the net, pick
which one looks best.
So all of the results here have the same learning theory corollary.
In fact, I should mention that this result, which is the upper bound, was not actually an upper
bound about nets, it was -- or a -- it was an intractability result for learning theory. So this result
actually showed that any learning algorithm for monotone functions which only seize
polynomially many examples, information theoretically cannot achieve advantage better than
this amount.
So, in particular, the net kind of strategy, learning by net strategy can't do better, and that implies
this statement.
Okay. So that's the connection to learning theory, which motivated the study of monotone nets.
And, you know, it's kind of a somewhat obscure problem in learning theory maybe, and only if
you're a true like Boolean function nerd like me might you be really interested in this problem.
But as is sometimes the case, I think it turned out that like the how or like the tools used to prove
these results proved to be more interesting than the actual result itself.
So let me tell you how each of these papers achieved their result. So this basically follows from
the expansion of the hypercube, the expansion of it as a graph. This result follows immediately
from the KKL theorem. I'll say what it is soon. This result follows quite quickly from there
Kruskal-Katona theorem. And this result follows from first by proving like a generalized KKL
theorem and using that to prove like a robust Kruskal-Katona theorem, and then that implies the
result.
So these are the two things that I mainly want to tell you about in this talk, and you may
remember them from the title.
Okay. So let me first remind you about the KKL theorem. We have, as was mentioned, the first
K, Jeff Kahn [phonetic], sitting here in the audience. I guess I haven't even said who they are.
It's Jeff Kahn, Gill Kalai, and [inaudible] Linial from '88.
Okay. So this is a very, very famous theorem about Boolean functions, one of the most amazing
theorems in the area. And it says this. So here's the theorem. If you have any Boolean function
F, there always must exist some coordinate between 1 and N which has a somewhat high
influence on the function F.
So what does that mean? So first you kind of have to scale this statement by some factor. This
is a fraction of points where F is 0 and the fraction of points where F is 1.
We mainly are just concerned about the case where F has roughly as many 0s as 1s, so this is
about a half or at least a constant. This is about a half or at least a constant. So for the rest of the
talk just think of this factor as like an absolute constant.
So then it says just that function F, which is sort of roughly balanced, always has a coordinate
with at least log N over N influence. Well, what is this influence? The influence of coordinate I
is just like the probability that flipping the Ith coordinate makes a difference to the function F.
So, more precisely, the experiment is you pick a random string X, or a random point in the hyper
key of X, then you flip the Ith coordinate and you see if F has different values, 1,0 and 1,1, on
this coordinate.
Somehow the probability that this coordinate is relevant, it's somewhere between 0 and 1, and
the theorem of KKL is that there's always a coordinate that has influence at least log N over N.
I mean, so far it's not clear, is that good or bad or what. But so good, it's tight. We'll see -- we'll
understand a little bit more in the next few slides.
Any questions so far, by the way? Right.
So before I talk a little bit more about the KKL theorem, let me just remind you that I told you
this net result with log N over N, so somewhat far away from what we end up shooting for, but
this log N over N thing follows truly immediately from the KKL theorem. So let me see why.
So this is KKL. And let's imagine now that in particular F is a monotone function. If F is a
monotone function, then you can interpret the Ith influence in another way. You can see that as
follows.
In this experiment you pick some X, right, which is some long string, and then maybe this is the
Ith coordinate, and you consider what happens when it's 0 and what happens when it's 1, and you
look at F's value. So it has two values. So it could be 0 on both of the them, in which case you
don't sort of score any points in this probability.
Could be 1 on both of them, in which case again this event doesn't happen. It could be 0 and 1,
in which case you sort of get a point here, very different, or it could be 1,0. But actually it can't
be 1,0, because it's a monotone function. So if you change a 0 to 1, you know, that can't make a
monotone function go from 1 to 0.
So since these are the only three possibilities, you see it's also basically like the probability that F
of X equals XI, or, more precisely, it's the probability that F of X equals XI minus the probability
that F of X differs from XI.
Well, if you don't quite follow that calculation, it's very trivial, so take my word for it. This is
precisely the correlation of XI with F. So just to state it again, if F is a monotone function, then
you can also interpret the influence as just the correlation of the Ith coordinate with the function
F. And this is KKL. It says that there's always a coordinate with influence at least this much.
So, you know, basically one of three things can happen. F can be mostly 0, in which case, you
know, 0 is quite close to the function. F can be mostly 1, in which case 1 is quite close to the
function, or if neither happens, then both of these numbers are kind of absolute constants, KKL
tells you that there's a coordinate with influence at least log N over N, and therefore that also has
correlation at least log N other N. So that proves that the 0,1 and X1 through XN is a log N over
N net.
>> [inaudible]
>> Ryan O'Donnell: Yeah. There was some more learning theory results in there. They always
gave like a 2 to the root N time algorithm for strongly learning every monotone function. This is
quite nice result. Great. So, yeah, just told you about that.
Let me actually just briefly go back and tell you about this, because it will also put the KKL
theorem in a nice context for you. So this is even earlier results. Yeah. So this is sometimes
known as the expansion theorem or the [inaudible] inequality or the edge isoparmetric inquality
the hypercube. This E of F -- there's a lot of symbols, sorry about that, but this E of F represents
the average of the influences, okay, the average of the influences.
And, again, if you think about this as a constant, this is saying that the average of the influences
is always at least constant over N. So you can see already the KKL is kind of saying something
interesting. It's saying the average may be at least 1 over -- constant over N, but the max is
always at least log over N.
And to see the connection to like expansion on the hypercube or edge ispirometry, what is the
average of the influence is, well, in the influence experiment, right, you pick a random X and
then you flip the Ith coordinate. So if we're going to also average over I, it's like saying you pick
a random X and then flip a random coordinate.
>> [inaudible] do you get the bound on the min also of the [inaudible]?
>> Ryan O'Donnell: You know, the thing is -- the min on the influences can be quite small. It
can be sort of proportional to the min of these two things. In fact, yeah, it can be 0, right, if
[inaudible] thank you. That's a much ->> [inaudible] min is 0 over it.
>> Ryan O'Donnell: Yeah. Right. So the average of the influence can also be interpreted as
this. You pick a random edge and look at the probability that F labels its endpoints differently.
So, again, if you think of it geometrically, if you think of blue as 1 and this brown is zero, F kind
of makes let's say a subset of the cube, maybe the blue points. And then you count the fraction
of edges that are on the boundary. You go from the inside to the outside. Okay.
And so this is like an isoparmetric statement. It says that the fraction of edges on the boundary is
always as least constant over N times sort of the volume of the smaller side. That's a familiar
statement. This is a very trivial statement. You can prove it by induction on N. I think the
original proof was Harper in '64. He was a mathguard [phonetic] student at Oregon, our
neighbor to the south.
Okay. And this is also sharp, by the way, for these functions F that are like XI, just a coordinate
projection function. Here's like a picture of that. Because if it's this kind of function, the
probability of its one is half, probably at zero is a half, so those cancel with the four, and it's just
saying that are a 1 over N fraction of the edges go between the two sides, which is true, right, 1
over N fraction of the edges in the cube go in the Ith direction.
Okay. Any questions?
Great. So, yeah, just this statement about the expansion of the hypercube. Our edge
inspirometry says that the average of the influence is at least let's say constant over N, and KKL
says that actually the maximum is at least log N over N.
I'm actually going to use a slightly refined version of KKL, which I guess maybe first appeared
in a paper offal grant. It looks a bit funny, but this is the version I actually want you to
remember more so than to remember this. It says that if F is a function all of whose influences
are smaller than let's say 1 over N to the .01, which is pretty big, right, 1 over N to the .01, if
they're all smaller than this, then actually their average is a bit large. It's a bit funny. If they're
all kind of small, then their average is a bit large.
And you can see this as actually stronger than the KKL. Why? How would you prove KKL
using this fact? Okay. If F has an influence that's bigger than 1 over N to the .01, then we're
certainly far easily done. It's way bigger than log N over N. Otherwise the average of the
influences is at least log N over N. So certainly one of them has influence at least log N over N.
So remember this one, this version.
Great. I should also mention that the KKL in this theorem as well are also known to be sharp,
even for a monotone function. This function introduced by [inaudible] in '86 called Tribes has
the property that actually its correlation with all of these N plus 2 functions is log N over N. So
all of the influences are log N over N.
Okay. Any questions about KKL? So a few more facts about it. It's kind of funny. You know,
these influences are combinatorial notions, but there's no known combinatorial proof of KKL.
Only analytic proofs. You might say that this log N kind of looks small. I mean, you know,
you're beating the trivial bound here by log N, it doesn't look that big, but actually the fact that it
goes to infinity is like crucial. I mean, that's like the awesome aspect of it. And it's exactly why
it has all these nice applications.
So, for example, in the original paper it showed they used it to show in an idealized model of a
two-party election you can always bribe a little O of one fraction of the voters and force the
outcome of the function with high probability -- or the election.
The fact that this goes to infinity is the reason why monotone graph properties have sharp
thresholds. And it's also the reason why the sparsest-cut SPD has superconstant integrality gap,
if you know what that means. If you don't, you can ask [inaudible] who is the DE right there.
Okay. So that's the beauty of KKL. And I'll just say -- I mean, I won't give the proof. It's a little
complicated. But I'll kind of give some kind of sketch or how I think of the proof. A very
sketchy sketch. Somehow the idea I feel is that this expansion theorem we mentioned is not tight
for tiny sets. So sets that are of about size of about a half may have only this sort of expansion
by a factor of 1 over N, but a really tiny set in the hypercube actually has a lot of edges coming
out of it.
This is -- somehow the idea is to apply this to each set, which I'll call maybe delta I, which is the
points that are on the boundary of F in the Ith direction. Remember, the hypothesis of this
refined KKL was that if all of the influences are small, then somehow they're average is a bit
large.
So, I mean, if all the influences are small, it's like saying all these sets have small volume. So
then you can maybe hope to apply this idea to that fact and then somehow average it all together.
It doesn't work out so nicely as that, that's why you have to bring in these combinatorial analytic
ideas. But somehow I think of that as an idea. I'm not really sure what that picture illustrates.
So for some sets of the Hamming cube, this is also an easy fact. I mean, Harper also proved this
in '64. I mean, he showed a sharp edge isoparmetric inequality. If you give me like the size of a
set in a hypercube, I'll tell you the best one for the purposes of making the edge boundary small.
And he actually showed this. If G is let's say the indicator of a subset in the cube, then its
boundary is at least 2 over N times the -- let's say the volume of it times this extra log factor.
And the sharp subsets are like subcubes. And that's somehow why the log comes in.
>> [inaudible]
>> Ryan O'Donnell: Pardon me?
>> [inaudible] boundary?
>> Ryan O'Donnell: Yeah. So I guess I wrote G instead of F here because I'm thinking of -imagine G is the indicator of a smallish set. Yeah. So somehow you see there's this like extra
log that comes in that east kind of what you would like to exploit. Unfortunately, as I said, the
proof doesn't quite look like that. You have to get some analytic notions in.
And in particular if somehow instead of applying this to a set you want to apply it to a function
which is like imagine G of X is the probability that somehow a short random walk from X would
lie in the boundary, the Ith boundary of this set F.
Well, this is a real valid function. So you can't -- I mean, this is about sets. You can't apply
Harper's theorem. But there's a generalization of this theorem to real valid functions proved by I
guess Gross in 1975 called the Log-Sobolev Inequality. And it's also equivalent to these things
like hypercontractive.
Inequalities or [inaudible] inequalities, how it was proved in the original KKL. So somehow you
use this more powerful fact instead, and somehow this is the intuition. But it's tricky to actually
do the proof.
Okay. That's all I will say about the proof. But we can also stop for questions here, because I
think this is the last thing I'll say for KKL for a while.
Okay. Great. So that's KKL. And we're going to put it aside for a little while and come back to
it at the end. So we're going to now go to the second thing in the title, which is Kruskal-Katona,
which uses some of the same concepts but is a bit different.
So Kruskal-Katona, as you probably know, is a famous theorem in combinatorics from '63 or '68,
depending if you're Kruskal or Katona. And it's usually stated in terms of set systems. But I'm
going to stick with this notion of Boolean functions. And if you know the set systems, then
you're sharp enough to do the translation on the fly to Boolean functions.
Okay. So what's the Kruskal-Katona theorem say. This is my picture and will be for the future
of the talk of the Boolean hypercube, 0,1 to the N. So I draw it in this funny way where I kind of
picture the Hamming weights going upward. So this is the set of all strings with Hamming
weight 0. This is set of all string with Hamming weight N, and this is a set of all strings
somewhere around here with Hamming weight N over 2. And there are many more of those,
which is why I draw the picture in this funny way.
And the Kruskal-Katona theorem is concerned with just a particular slice, just a set of all
Boolean strings that have a fixed Hamming weight K. And I'll donate -- denote that slice by like
N choose K, the set of all Boolean strings that have Hamming weight K. So there's this slice.
And you can imagine you have some function F, a Boolean function F on the entire Boolean
cube, but you'll just focus in on what's going on on this slice. And I'll introduce this notation, mu
sub-K, for just a faction of points in the slice where F is 1. It's like the density of F in this slice F
you think of F as the indicator of a subset.
So the Kruskal-Katona theorem is all about comparing what's going on at one slice to what's
going on at the next slice up, or perhaps the next slice down. But we'll say next slice up.
And in fact it's all about trying to make a statement like this. Imagine you have a monotone
Boolean function on the whole cube and its density at level K is something. What can you say
about its density at level K plus 1. So by virtue of the fact that it's monotone, right, you would
know that there's some stuff -- there's some points where F is 1 up here, because anytime you
have a point on the K slice, which is where F is 1, and then you change a 0 do a 1, by virtue of
monotonicity, the resulting string has to be one where F is 1 up here, right?
And in fact it's very trivial exercise to prove that the density must go up as you go up the slices.
So it's very trivial to prove that the density of a monotone function at level K plus 1 is at least
that of what it is at level K.
The point of Kruskal-Katona is to give a better statement than that, to give like a sharp statement.
And that's exactly what Kruskal-Katona does. It actually tells you if you fix K and you tell me
exactly the density of a monotone function at level K it will tell you exactly like what F should
be so as to try to minimize the density at the next level. It's like some the first strings in
[inaudible] graphical order or something, but the point is it just tells you like the exact best
inequality that you can put here.
Now, it's actually a little complicated to state. And if I were to state it, it would look like this.
Mu K-sub 1 is some -- at least some complicated function of mu K and K. So people often quote
instead of the actual theorem, like a [inaudible] that's a bit easier to deal with. And [inaudible]
as has one. [inaudible] and Thompson have another.
I'll actually even further simplify them. So this is like a corollary of a corollary of K
Kruskal-Katona. It says that the density of a monotone function at level K plus 1 is bigger than
the density at level K plus this amount. So how should you think about this amount?
Basically I want to tell you it's something like 1 over N. So I what I'm saying is I want to focus
on the setting of parameters where, first of all, the slices we're talking about are somewhere in
the middle. The K over N is bounded away from 0 and 1. So we're not talking about up here or
down here.
And also I want to talk about the case where this density is also somewhere bounded away from
0 and 1. So we're not talking about almost completely full or empty slices. So I will always just
care about these two settings of parameters.
In that case, if mu K is a constant, then this is all a constant. And if K is like a constant fraction
of N, this then this is like a constant fraction of N. So it's just this. I mean this is a corollary of a
corollary of a corollary. The density goes up by at least some constant over N.
Okay. Any -- does that make sense? And this corollary of a corollary of a corollary is tight. I
mean, KK is exactly too tight. I mean, it is the exact best answer, but we haven't lost anything
yet here. And it's very easy to see. F of X could be this very simple monotone function XI. And
so what's the density of the XI function at level K, it's like asking if I pick a random string of N
bits with exactly K1, what's the probability that the Ith coordinate is 1.
Well, it's K over N. And so the density at level K plus 1 is K plus over N and the difference 1
over N. So Kruskal-Katona tells me that the least amount by which the density can go up from a
monotone function is like 1 over N, and this is [inaudible].
So you play around with it, and you might wonder are our examples that are kind of like is this
the only examples where this density increase is so small?
Oh. This is the lead-in to some slide that's like three slides from now. Actually maybe in the
interest of time -- I was going to show how -- well, okay, I'll show this. So this Kruskal-Katona
corollary actually easily implies this old net result, if you remember it, that either 0,1 or majority
has at least this much correlation with -- for any given monotone function, one of these three
functions has at least 1 over root N correlation with it.
Let me quickly sketch that. So imagine F is a monotone function. And we'll sort of divide up
the slices around the middle. Let me just assume that as density at the middle slice is about a
half, okay? If it's way bigger than the half, then probably 1 is quite correlated with that. If it's
way smaller than the half, then probably 0 is quite correlated with that.
So let's assume that it's a half, and then we'll show that majority is quite correlated with the
function. We have some function monotone function F, its density at this slice is a half. Okay.
So then Kruskal-Katona theorem tells us that the density at the next slice goes up by like some C
over N. Okay. Then you apply it again. It's still monotone, so Kruskal-Katona tells you this
density at this slice is at least 2C over N.
Okay. You keep doing this for a while. Let's say you do it for like root N slices, and therefore at
this point, at this N over 2 plus root N slice you know the density of F is at least half plus
constant over root N. You can do a symmetric thing going down and get that the density down
here is smaller than half minus C root N. And now you're in good shape to conclude that
majority has this much correlation with F.
Because, you know, majority is 1 everywhere above the middle, and 0 everywhere below the
middle. That's the definition of majority. So it's like all 1s up here and all the 0s down here. So
in this piece you're sort of catching correlation C over root N with majority.
And on this, because it's one up here. And on this piece you're kind of catching correlation C
over root N with majority. And these two pieces occupy like a constant fraction of the
hypercube.
This is well known that the constant fraction of the hypercube is between these two levels and
therefore also outside these two levels. So that's a pretty sketchy sketch. But that's how this net
theorem quickly follows from Kruskal-Katona.
Okay. So you remember our, you know, monotone net theorem, which was like the final
corollary of all this work was that one of these guys for every monotone function one of these
guys has at least log N over root N correlation. And you can imagine trying to give a similar
proof. And in fact you'll get the exact same proof if you could just conclude that instead of the
density going up by constant over N at each step it went up by log N over N at each step. And
you'll just gain a factor of log N.
And so we exactly execute that idea by proving this robust version of a Kruskal-Katona theorem.
So this is one of our main theorems. Let F be a monotone function and let's say you're
somewhere in the middle of the Hamming cube and the densities are also bounded away from 0
and 1.
Basically it says that the density, when you go from level K to level K plus 1 always goes up by
actually log N over N. Unless F is somehow strongly correlated with one of the N coordinate
functions.
So to state it in the contra-positive, if the correlation of F sort of within slice K and every XI is
smaller than 1 over N to the epsilon, then the density jumps up by log N over N.
And I think you can probably kind of imagine that like once you have this theorem to like
conclude this net result it's not too bad. I mean, it takes a couple of pages because -- well, you
know, basically the density is going up and up and up, and you're happy, unless at some level
you have like very large correlation with some function XI. And then you need to eventually
deduce that indeed this guy has good correlation with F sort of everywhere in the cube. But you
can see that they're kind of similar and a little bit of work will take you from this to this.
Okay. So this is the main theorem that we prove. And I think you can also see that it kind of
should, if you remember, remind you of KKL, right? The KKL theorem said that if F is any
function it doesn't have to be monotone and all of its -- the N influences are smaller than 1 over
N to the .01, then kind of the average of the influences, which is like the edge boundary, is at
least log N over N. And this is also some kind of statement that like the edge boundary or
somehow the boundary between the K and K plus first level is at least log N over N.
So what I'm saying is you can tell that this also kind of looks like the KKL theorem. So we
prove this robust KK, Kruskal-Katona, theorem by proving some new kind of KKL theorem. So
what's the difference? The difference is that the KKL theorem is about functions, and this thing
is kind of about functions restricted to like a single slice, which is actually a negligible -- each
individual slice is like negligible fraction of the hypercube.
So we kind of need like a KKL theorem that's like localized to a single slice.
Any questions? Feel free to ask them if you have them. Okay. Right.
So, okay, so let's imagine we wanted to now like prove KKL but somehow localized to a slice.
We want to prove something like this. Well, you can again say, okay, let F be a function
mapping this slice and choose K into 0,1. Now let's try to show that one of its influences is
large.
But there's an immediate problem, though, which is what is influence? Because, you see, the
normal definition of influence doesn't make sense anymore. Imagine you have a function
defined just on the set of all weight K strings. The definition of influence is like pick a random
string, okay, you can pick a random string here, but then you have to like flip is Ith coordinate.
But if you change a 0 to a 1 or a 1 to a 0, you will no longer have Hamming weight K. So you'll
get a string that's not even in the domain of F. So it doesn't make sense.
Okay. So you invent this sort of already given away here, but you invent a different notion of
influence where instead of like picking a random string and flipping the Ith coordinate you pick a
random string and you swap the Ith and Jth coordinate. And that's cool because we swap the Ith
and Jth coordinates, you won't change the Hamming weight of the string.
So you at least get to something that's still in the domain. And maybe there's some chance that
there actually -- this doesn't do anything if they have the same value. XI equals XJ, but they
may.
Great. So we can invent this new notion of influence of a pair of coordinates on a function
whose domain is the Kth slice, just the probability you pick X and then you swap the Ith and Jth
coordinates. That changes the value of the function. Okay. So that's some new definition. You
can again define this E to be the average of all the influences. Now you're averaging over N
choose two things.
And, yeah, this is the theorem we actually prove, that it's like a generalization of the KKL if all
of the influences, even the refined version, if all of the influences are smaller than 1 over N to the
.01, then their average is at least log N over N.
Okay. So, yeah, it looks like kind of identical to the original KKL theorem. But, just to remind
you, it's like in a different setting, functions just on the Kth slice. And, you know, when we tried
to prove this, it does seem harder than proving KKL, because the proof of KKL uses Fourier
analysis in a deep way.
And Fourier analysis is very like nicely attune to the situation where the domain is a -- carries a
product distribution. You usually associate the product distribution with 0,1 to the N, but the
uniform distribution on the set of strings of Hamming weight K is not a product distribution. So
it makes it seem like it's hard to envision using Fourier analysis. So that's why maybe it's a bit
harder than the original KKL.
>> [inaudible]
>> Ryan O'Donnell: Oh, you're like three or four slides ahead of me. I wish I had talked to you
back in the day. Great. Yeah. We'll see it in a second.
Okay. In summary I kind of told you about all of these things, and the only thing I haven't really
told you about yet is this KKL on a single Hamming slice.
Okay. So let me -- the last part of the talk will be about this. And in fact we prove a KKL
theorem in like a more generalized setting. Although I will now admit to you honestly that we
set up a nice generalized setting and then we have like two special cases in mind. A, the
Hamming cube, and B, like the slice of the Hamming cube. But, okay, we did it in a general
setting anyway.
Okay. So what is this general setting where we're going to try to prove like a generalized KKL.
The setting is what's called a Scheier graph. I never heard of what a Scheier graph is until I
started this project. But it's a simple generalization of a Kaylee [phonetic] graph, which I think
more people have heard of.
So it's a graph. The vertex set is some set X. And you imagine you have a group G which is
acting on X in the group theoretic sense. And you also imagine like in a Kaylee [phonetic] graph
that you have a subset of the group, which is like a generating set for the group, U. And we'll
also make the standard assumption that it's closed under inverses.
Okay. It's a graph, so what edges do you put in. You look at each X in little X and big X, and
you put in an edge to each X acted upon by use for each little U in the generating set. So here's
X. You like act upon it by all of the guys in the generating set. That gives you some other guys
in capital X and you just put an edge to all of them. Is there anything else on this slide?
So just a quick comments. First of all, this is an undirected graph because of this assumption that
it's closed under inverses. So if you go here, if you acted upon this guy by U1 inverse, which is
also in capital U, you come back to X. It's undirected. And it's also regular, right, because at
each vertex you put like one edge for each guy in capital U. It should also be connected, so
maybe G should act transitively on X, but never mind.
Okay. So whenever you have a graph, you also have like a natural -- oh, I should also say that
when X equals G and the group action is just group multiplication, then that's the Kaylee
[phonetic] graph.
Okay. So you have a nice graph and whenever you have a nice graph you can consider the
natural random walk where you just start at a point and go to a random neighbor and keep
walking. And luckily this is an undirected graph and it's regular so the stationary distribution for
this random walk will be the uniform distribution, which is on capital X, which is pleasant. You
could also interpret it as, you know, you start at a random little X and X, and then you just pick a
random generator and act on your current location. You pick another random generator and act
on your location. Yep?
>> Let's assume the action is transitive, it's just a quotient of the Kaylee [phonetic] graph, right,
and then you're essentially doing a random walk on the [inaudible] and looking at it, project it
down into X.
>> Ryan O'Donnell: Everybody else is nodding, so yeah.
[laughter]
>> Ryan O'Donnell: Sorry. I'm not a mathematician really, so I'll -- yeah, I think that sounds
right. I'm just a lowly computer scientist. But, yeah, sorry, yeah. Yeah.
Okay. Yeah. So we have two examples mainly where X is the set of all strings of Hamming
weight K. The group acting on it is the symmetric group, you know, because if you permute the
bits of a string of Hamming weight K, it remains a string of Hamming weight K. And the
generating set is the set of all transpositions. These are the swaps.
Another case is a very simple Kaylee [phonetic] graph of Z2 to the N. That's the Hamming cube.
The generators are like the elementary, you know, all zeros except for 1, 1, and that's like the
action of U is when you add that it's like flipping the Ith coordinate.
Great. Okay. So you can more generally define influences here, right? I mean, if you have a
function on set capital X, a Boolean function, Boolean valid function, you define the influence of
the Uth generator to be probably the F of X differs from F of X hit by the generator U. And it's
natural because the random walk to take X to have the uniform distribution on capital X. And
again E would be the average of all the influences over all the generating set.
And, yeah, so once you get to the setup, then you just try to carry out the KKL proof in this
level -- higher level of generality, and you do it carefully. And like whenever you seem to be
doing something sort of specific to the Boolean cube, you try to stay at a higher level. And
eventually you just give the proof. So the proof is like very similar to the original proof.
There's one thing that you kind of need that it's trivial in the hypercube case, but somehow in the
proof it seems like you really need this extra fact, that the generating set U is closed under
conjugation. So UV, U inverse is in capital U. If I'm saying words like this, I should also know
like quotient and [inaudible] quotient grid. So, yeah, I apologize.
Anyway, yeah, so you can basically just carry out the KKL proof in this setting with this extra
assumption. And so this extra assumption holds in like the two main cases that we care about.
In the hypercube case it's very easy because the group is abelian. So, I mean, conjugation just
doesn't do anything. It just leaves it the same. And it also works out fine in this case of the Kth
slice with the transpositions, because the transpositions form a conjugacy class of SN. So
conjugated transposition you get another transposition. So that's great. So you can prove it in
both of these settings.
Now, what are actual like numbers you get at or what quantify configuration do you get out. I'll
come to that to later. Let me just say a little bit about the proof I guess, again, at a very, very
high level. As before, if some set X maybe F is the indicator of these points, and again you
define little G of X to be the real value function on capital X which is that, well, one for each U.
The probability that a short random walk starting from X lands in the Uth boundary set of X. So
you have a random walk and then maybe it lands at a point where if you did a U step that goes
from inside F to outside F.
And hopefully in this graph small sets of large expansion, like in the hypercube graph. And
that's exactly quantified by like the Log-Sobolev for this Markov chain. Well, Log-Sobolev
inequality for this Markov chain would give you a statement like this Log-Sobolev so hopefully
you not only should U be closed under conjugation but this whole setup you better -- in a
configuration where there's like -- you know, this appropriate holds or like there's a large
Log-Sobolev constant.
And just one like word about this condition, why do you need that U is closed under conjugation,
somehow
when you're doing the proof, you need the -- if you do a U step followed by a random step or a
random bunch of steps, it's the same thing as doing some random steps and then a U step. I
mean, somehow you use that in the proof. And that's basically equivalent to saying that U is
closed under conjugation.
Okay. Great. So here's the theorem. In this exact Scheier graph setup where U is closed under
conjugation, this extra condition, we get this. This is like combining the two parts of talagran
thing into one. It says that the average of the influences is at least row, which is the log solo of
constant for the random walk, times log of 1 over the maximum of the influences.
Again, if you think of like if the maximum influence is small, then this has become -- this is
large and so this is like log of that. So if the maximum influence is small, then you sort of gain
like a log factor in the inequality.
In particular, like the very first paper that introduced Log-Sobolev inequality is this paper by
Gross in 1975. I mean, his first and main example was the simplest case, the hypercube with this
set of generators, just the standard random walk on the hypercube, and he computes by induction
that the Log-Sobolev Inequality -- Log-Sobolev Constant is 2 over N.
So you can take this general theorem, just plug in 2 over N here, and you get like constant over N
times log of 1 over maximum influence. So that's exactly like the KKL that the Talagran
refinement of KKL.
And then very luckily for us somebody else figured out the log-Sobolev Constant for this random
walk. So this is Lee and Yaw [phonetic] in 1998, quite a bit later. And they showed that for this
random walk in the Kth slice with the transpositions as the generating set, the Log-Sobolev
Constant is again 1 over N, assuming that K over N is bounded away from 0 and 1.
This was not hold, and it actually becomes 1 over log N, and therefore you exactly like lose the
factor that you gain, the log factor that you gained.
Yeah. So that was good for us. So but then if you plug 1 over N in, you get the exact same
theorem that we needed for the robust Kruskal-Katona that the averages of the influence is at
least 1 over N times -- well, if you do the simplified version, you get that if F is a balance
function there exists some generator U with influence at least row log 1 over row.
This is like a simple identification of this statement. So that's how we get the two KKL
applications that we needed.
Any questions? Okay. So I have a couple more slides I guess. That's the summary. I guess we
kind of finished all this. So this is a paper that Karl and I wrote a little while ago, and just now
we're kind of in the process of writing up a short paper where we show that some of these
things -- some of these results are tight. So I'll mention a couple updates from what we're
working on now.
We really thought that this condition was extraneous. We really, really worked to try to get rid
of it. And then one day we were like, hey, let's see if maybe it's necessary after all, and we just
found an example. The whole setup where U is not closed in our conjugation, and then the
theorem fails. So it's a pretty simple case. It's a Kaylee graph. This is the group. It's
nonabelian, as you see. It's like Zth to the N, semidirect product ZN onto the natural operation
where this just cyclicly permutes the coordinates.
Take this generator set, it's not closed in our conjugation, and then this simple function, F, which
takes a string plus an index, just ignores the index and outputs the first coordinate, it's balanced
but all of the influences are constant over N. 1 over N maybe. So that's a shape. So this strong
condition is actually necessary.
And we also -- there's a further like twist Talagran that put on the original KKL where we
actually -- it's actually strictly better. She showed that the average of the influence over log 1 of
the influence is at least 1 over N. And we also managed to generalize this slightly better version
and stick the log sub-11 quality in there.
And the proof is a bit different. You have to give this proof that uses like [phonetic] and some
generalized holder in equality. But again it's kind of -- yeah. You just kind of take the proof of
KKL, or telegran, and you just carefully stay at a high level and generalize it to this setting.
Okay. So the last thing I'll mention here is like one fun maybe open problem. This is more of a
quirky problem that came up for us. Go all the way back to this net thing. Remember we gave
this net that was like 0,1, majority, and the N coordinate functions, and we showed that every
monotone function is correlation at least log N over root N with one of these guys. And there's
N plus three guys in the net.
And for a long time we thought that like, you know, you got to have the N coordinate functions
in there, and so you'd think -- and so probably any net that achieves this much, that's how
cardinality at least like N, right, so we were like maybe we can get it down to N plus 2, N plus 1.
Well, actually we needs you can actually -- it won't show it, but you can get one of cardinality N
over log N even. So actually I think you can probably get one of slide now having seen this, I
think you can probably get a net that's this good and has cardinality N to the epsilon. So one
could try to prove or disprove that.
Okay. Thanks for your attention.
[applause]
>> Ryan O'Donnell: Questions? Claire?
>> Yeah. It's almost very simple. And I know it's not. So one thing I try to remember, one
piece of information from your talk [inaudible].
[laughter]
>> Ryan O'Donnell: Yeah.
>> [inaudible]
>> Ryan O'Donnell: No, I think you should remember the statement of the original KKL. It's a
cool thing. Well, I mean, if you want something that has to do with something we did, I guess I
like this cololary of the Kruskal-Katona that's like one of these sort -- you know, Kruskal-Katona
is an old result and it's like rigid in a sense that it like gives you the best optimize -- the best
function for like minimizing the density of the upper shadow, and it is what it is.
But we show that actually we can prove a theorem that says if you're not like the optimizers, if
you're kind of `not similar to one of the optimizers, then actually the density jump is a lot bigger.
You can look for other cases where you know there's been example that achieves like this much
but if you somehow rule out some other things then maybe you can prove a much better bound
for whatever your problem is.
Any other questions?
>> Can you hint at what's the smaller [inaudible]?
>> Ryan O'Donnell: Oh. Yeah. We didn't try very hard on this, but the thing that Karl tells me
works, he wrote it down, is just divide an N inputs into blocks of size log N and take majority on
each of the blocks. And maybe also throw in 0 and 1.
So how, yeah, somehow like you take majority on log N bits, it's kind of like similar enough to
like each of the coordinate functions there that kind of covers each of them. You can take the
place of -- you can stick log N coordinate functions together and replace them with majority on
that.
>> [inaudible]
>> Ryan O'Donnell: Yeah. We -- it's quite possible. We didn't really try very hard.
>> Yuval Peres: Okay?
>> Ryan O'Donnell: Okay. Thanks.
>> Yuval Peres: Let's thank Ryan.
[applause]a
Download