>> Yuval Peres: Alright, good afternoon. We’re delighted to have Li-Yang Tan tell us about his depth hierarchy theorem for Boolean circuits. >> Li-Yang Tan: Thanks Yuval. Yeah, thanks for the opportunity to be here. Gonna speak about joint work with Ben Rossman and Rocco Servedio; Ben’s at Simon’s, and Rocco’s at Columbia. So this a talk about circuit complexity, so very briefly, let’s recall the broad goal is to derive strong lower bounds on the size of circuits computing an explicit function. So let me elaborate on two important words here. The first is explicit; for this talk, it’s not very important; it’s … just think of a informal definition as concrete, simple-to-describe functions. So a non-example will be—you know—a random function. And I guess, formally, in complexity theory, we think of—you know—an explicit function as one being in NP. And as for circuits, the holy grail is really to understand the power of polynomial-size {AND, OR, NOT} circuits—the standard basis—or it’s known as—you know—P/poly, which is not important, but … so we’ll ideally like to exhibit a function—a concrete, simple-to-describe function—that cannot be computed by such circuits, which will separate P from NP, but we’re still very far from it. So the focus of research so far—and the focus of this talk—is on restricted subclasses of P/poly, and in this talk, we will focus on one specific restricted subclass. So what’s this class? It’s the class of small-depth Boolean circuits; so this a Boolean circuit over the standard basis of {AND, OR, NOT} gates. We can assume—it’s not hard to see—that all of the NOT gates are pushed onto the bottom, although that’s, again, not important for this talk. So that’s the circuit, and we are interested in function that requires complex circuits. And so what’s complex? There are two measures that I’m interested in; one is depth, which is the number of layers you have; and the other’s number of … size, which is number of gates you have—just these two. And we are gonna make it even simpler and fix one of them. For this talk, we’ll think of depth as a constant—say, a hundred—and the only parameter we care about is the number of gates, which is size. And we are interested in very strong lower bounds on size—exponential, okay—under this assumption that the depth is constant. Okay, so that’s our model. Small-depth circuits has been studied since the eighties, and it’s really a story of success. You know, even since the eighties, we have exponential lower bounds—again, it’s constant-depth circuits computing an explicit function. So the landmark result here is that of Håstad, which we will talk a lot about in this talk, and it builds on the work of Ajtai, Furst-Saxe-Sipser, and Yao—all in the eighties. And just as an aside, I should mention that it’s quite rare in—you know—circuit complexity or complexity theory that we have such strong lower bounds. It’s really among our strongest unconditional lower bounds in all of complexity theory. Okay, so that’s … and the techniques that were developed to prove these lower bounds, they are important beyond circuit complexity; they’re found in applications in pseudo-randomness, learning theory, proof complexity, and so on. And at a very, very high level, in this talk, I’m gonna speak about an extension of Håstad’s theorem, and we do so via generalization of his techniques. So we hear a lot about Håstad’s theorem, his techniques, and how we extend both of them. Okay, so let me go into detail about the outline of the talk. I’m gonna tell you about Håstad’s theorem and two of the extensions, neither of which are due to us—both due to him back in the eighties—one is average-case hardness, and the other is a depth hierarchy theorem; I’ll explain what each of these means in a second; you can probably guess what the first means already. And our main result is that we achieve both extensions of Håstad’s theorem simultaneously. So as the title suggests, we prove—you know—an average-case depth hierarchy theorem, which again, I’ll try to explain what that means. After telling you about our main result, I will give two applications, one in structural comple … two fairly different areas: one in structural complexity, showing—you know—that a polynomial hierarchy is infinite relative to a random oracle—I’ll explain what that means—and the second—and in a completely different area—in the Fourier analysis of Boolean functions, we answer a question that’s been floating around in the past few years about a converse to the famous Linial-Mansour-Nisan theorem, which I’ll tell you about also. And if I have time—I hope we have time—I’ll tell you about a technique, which is that of random projections, which is a generalization of Håstad’s technique, which is random restrictions. So I’ll tell you about random restrictions and what random projections are. Okay, so let’s get started with Håstad’s theorem. So Håstad’s theorem: he proved exponential lower bounds against a constant-depth circuits computing an explicit function, and the function is very explicit—the parity of x1 to xn. It really doesn’t get … there’s no arguing that this explicit; it’s concrete; it’s simple to describe. Okay, so what does his theorem say? His theorem in his PhD thesis, he showed that for every depth to … d greater than two—think of d as a hundred—any depth-d circuit computing the n-variable parity function requires size two to the n to the one over d minus one. So in particular, if you ask me to build a depth-100 circuit and make me compute parity, I need lots and lots of gates: two to the n to the zero point zero one. So these are results we like in circuit complexity; it’s a very strong lower bound against a very explicit function. >>: Is it this tight? Like, from … >> Li-Yang Tan: Yeah, it’s tight for parity. In particular, beating this bound for a function other than parity is a big open problem, yeah. So this our best lower bounds against depth-d circuits. Okay, so that’s Håstad’s theorem: parity require huge circuits. >>: The fan-in and … for now, can be un … >> Li-Yang Tan: Unbounded, right? Sorry, I should have said that: unbounded fan-in, but constant depth. Yeah, if it’s bounded, then you cannot do much in constant depth. Yeah, sorry. Thanks for that. Okay, so this is Håstad’s theorem—simple. Let’s talk about two extensions, both due to him. Okay, the first is average-case hardness. So in fact, he proved something stronger; you know, two slides ago, I showed that—you know—constant-depth circuits of—you know—what seems like large size cannot compute parity. In fact, he shows in his thesis, that—you know—depth-d circuits of the same size—two to the n to the one over d—agree with parity on … only on a half plus little o of one fraction of inputs. In fact, the little one is very strong—exponentially small fraction of inputs. So this is, hopefully, clearly a extension; two slides ago, I said that such circuits cannot compute parity; this gives me more information; it says that such circuits cannot even correlate with parity. Okay? So just a few notes: the … just the constant one function—a trivial, you know, depth-0 circuit—has agreement one half with parity. So this sort of says that if you allow me to build a depth-100 circuit, you allow me size two to the n to the zero point zero one—which seems like a lot—I cannot do much better than a constant. If I’m lazy, I may as well just output the constant; I cannot even do an exponentially small fraction better than constant. So it’s a very strong statement. And this is not too relevant for the talk, but as an interesting aside, this was implicit in his thesis, but the exact relationship between d, and this size, and the correlation bound was only recently pinned down by Impagliazzo et al and Håstad himself thirty years later. Okay, so this the first extension: correlation bounds against parity. Let me talk about the second extension, which is that of a depth hierarchy theorem, and this will take slightly more time. In particular, I need some notation; AC0 sub d is a set of all depth-d poly(n)-size circuits or functions computable by them. So depth-2 is not very interesting; it’s the set of all polynomial-size ORs of ANDs or ANDs of ORs—also known as CNFs or DNFs—depth-3 circuits, depth-4, and you take the union of it all, you get AC0, which is the class of all functions computable by this constant-depth poly(n)-size circuits—in particular, parity is not one such function. Okay, so this is a cartoon picture of Håstad’s theorem; it says that you have AC0, which is all constant-depth circuits; the … and parity doesn’t live in this green circle. In fact, if you want to compute parity with a poly-size circuit, you need depth roughly log n. So in a cartoon, it says that parity lives outside this circle at depth log n. So here’s a challenge: I want the same lower bound against depth-d circuits, so Håstad’s theorem tells us that—you know—if you make me compute parity with depth d, I need two to the n to the one over d. So here’s the challenge: I want the same lower bound against depth-d circuits, but for a function that AC0 depth-d-plus-one. ‘Kay, so intuitively, this feels like a more challenging task; you want the same lower bound against the same class of functions, but for a much simpler target function. Right, in particular, you want it to be so simple, even just allowing me one layer more of depth, I can compute it in polynomial size. So this is—hopefully also obviously—and extension of Håstad’s theorem. So … and then, Håstad was able to do it; it’s the so-called depth hierarchy theorem; and also in his PhD thesis, he showed that for every depth, d, greater than two, there’s a function, fd—the so-called Sipser function, which I’ll tell you about—such that fd is actually quite simple, as in depth … linear size, depth-d-plusone, but if you force me to use a depth-d circuit, I will require an exponential blow-up—two to the n to the one over d—and this is, again—just to recall—the same lower bound we have against parity. Okay, so a few notes: it builds on the work of Sipser—and hence, its Sipser function—which gave a super-polynomial separation, and Yao, a few years later, gave an exponential separation and was sharpened by Håstad. Okay, and before Yao’s work, the monotone case was solved by Klawe et al; it’s the same theorem, just put monotone everywhere. There’s monotone function fd—Sipser is monotone—fd is in AC0 depth-d-plus-one, but the lower bound only holds against monotone circuits, so it’s a … yeah. And the hard function for all of the above is the Sipser function, which I’ll tell you about. But before telling about that, just very briefly, the conceptual message here is that—you know—why is it called a depth hierarchy theorem is that—you know—it says that depth-d-plus-one circuits are much, much more powerful than depth-d. You know, you give me a nice, linear-size depth-d-plus-one circuit, you make me decrease the depth by one, I may have to ex … blow up exponentially. Okay, so what’s the depth-d Sipser function? The formal definition is a depth-d, read-once, regular, alternating, monotone formula, but it’s really … the picture just says it all. You have alternating layers of AND, OR, AND, OR, AND, OR—depth-d—the fan-in is regular—it’s n to the one over d—and it’s readonce; and it’s alternating AND/OR. It’s read-once in that—you know—the bottom-layer AND gates or OR gates touch distinct set of variables, and it’s monotone—there are no NOT gates. Right? It’s sort of the obvious depth-d formula; it’s sort of—you know—you made me write down a depth-d formula, this is sort of, maybe, one the first I’ll write down. And in particular, it’s linear-size, right? You … any readonce formula is linear-size—a very nice, regular structure. That’s a Sipser function. So let’s get some intuition as to why the Sipser function is so important for depth hierarchy theorems. So here’s a challenge: I have a depth-3 Sipser—n to the one third fan-in, AND, OR, AND, OR—and now, it’s nice and linear size; now, you point the gun to my head, and you say, “I want you to compute this in depth two.” Okay? So the … here’s one thing I can do: I focus in on this AND subgate; I rewrite an AND of ORs as an OR of ANDs—right, this is De Morgan; we can do that, but not very efficiently, right? How do you rewrite an AND of ORs and an OR of ANDs? You take one from every bucket—so it’s n to the one third to the n to the one third—so it’s roughly two to the n to the one third, right? And you write this gate as a OR of ANDs—do this for every gate—and now, you look … you see that—you know—you have OR of OR of ANDs, and you can collapse the two ORs, because an OR of ORs is OR of ORs. It’s kind of—I don’t know—this one way to do it, but not very smart, and you can ask—you know—whether there’s a better way to do it, right? And of … and what Håstad shows is that—you know—essentially, this is the best thing you can do. If you want to compute depth-3 Sipser with depth-2, just apply De Morgan, blow up the size—it’s the best thing you can do. And the same thing with depth hundred, computing it with depth ninety-nine—right—you ju … >>: So what you say … essentially, we really don’t work at all anyway, and the lower bounds nearly match, so … >> Li-Yang Tan: Nearly match. You don’t always get two to the n to the one third, right? You get two to the n to the omega one over d. >>: Yeah, yeah, yeah. >> Li-Yang Tan: Right? So it’s like you have lower bound two to the n to the one hundred and eighty. >>: Okay, but let’s say, from a construction point of view, you don’t know anything better than this. >> Li-Yang Tan: Oh, strictly better? Like, even just to save one gate? >>: Yeah, or not one gate, but let’s say something in … >> Li-Yang Tan: Yeah, I’m not sure. That’s a good question. For depth-3, I do not know if I can prove a precise two to the n to the one third lower bound. >>: No, but the question is different. >>: No, I’m just saying upper bound. >> Li-Yang Tan: Sorry? >>: Is the question different: can you save one gate? >> Li-Yang Tan: I don’t know, yeah. I … the lower bound is definitely not delicate enough to capture that you cannot, but maybe the upper bound is … yeah, I wouldn’t say that this is exactly optimal. Yeah, yeah, okay. So yeah, that’s a depth hierarchy theorem and some intuition as to why Sipser is the function to look at. Okay, so what have we shown? We have Håstad’s theorem, which is nice; it says that parity not in AC0; in fact—you know—depth-d—what seems like large size—two-to-the-n-to-the-one-over-d-size circuits cannot compute parity. We have two extensions that feel somewhat different. One is that depth-d circuits of that size cannot even approximate parity to fifty-one percent. The second one is that—you know—we don’t consider parity—we consider much simple function—and we get—you know—the same kind of lower bound against the same kind of circuits. So you can ask—and Håstad asked this— you know, can I get the best of both worlds? Can I show that there’s a function at depth d plus one such that, if you force me to us depth-d circuits, and you allow me huge size, I cannot even approximate it? Right, that seems like a natural best-of-both-worlds kind of extension, and our main result in this work is: we confirm this conjecture of Håstad. Okay? So I’ll like to tell you about this picture for the rest of the talk. Okay. So more precisely, it’s no surprise we have a Sipser function; we show that for every depth, d, greater than two, there’s a function fd—the Sipser function—which is in depth-d-plus-one AC0, but depth-d circuits of size two to the n to the one over d, they have agreement little o of … half plus little o of n. Okay, you cannot do fifty-one percent. And previous work: O’Donnell-Wimmer, in 2007, they proved the d equals two case; they proved—you know—a depth-3 Sipser you cannot compute it in depth-2 … you cannot approximate it in depth-2. And then it’s, well, starting point for us; we build on the techniques, and in particular, you hit … this is sort of the base case for us. So what we did is basically a reduction—a depth reduction—to the O’Donnell-Wimmer proof of the depth d equals two case. Okay, to answer a question that is probably on your minds, this correlation bound—right—a few slides ago, I said—you know—parity is very, very hard for AC0, and—you know—your correlation’s ex … at most exponentially small, which is … and here, it seems like we’re not doing much better … or we’re doing … we’re not doing as well; we only get—you know—one over poly(n) for d being constant. So you can ask: why not exponentially small? I have two reasons: one is simply not possible for the Sipser function; the Sipser function is a monotone function, and it’s a standard result that every Sipser function … every monotone function has one over poly(n) correlation with a very simple circuit—you know, either a dictator, an xi or a constant. So you cannot hope to do this for the Sipser function. So this not a very good excuse; you say, “Find another function—a non-monotone function—at depth d plus one for which you can prove exponential correlation bounds.” But in fact, it’s not possible for any function to begin with; if you force the function to be in depth d plus one—which it has to, because of the name of the game—it’s a standard result that any depth-d-plus-one circuit, it has one-over-poly(n)-correlation with one of the depth-d circuits that feed into the top gate. So that seems to be some sort of fundamental difference between depth hierarchy theorems’ correlation bounds and parity. In particular—you know—for second reason—the first is … yeah—the second reason—you know—you cannot hope for half plus exponential in n to the one over d, so it’s essentially optimal or constant— yeah. Okay, so that’s our theorem; let me touch on two applications of our result in two fairly different areas. One’s in structural complexity—so forget everything about circuits—let’s talk about oracles’ relativization. So part of our job in complexity theory is we want to separate complexity classes, and— you know—we haven’t been able to do so for many of them, and the famous example is whether P is equal to NP. So we haven’t been able to do it, so we can consider twist of the question: imagine a world where algorithms have free access to some magical function; call it A, okay? You give A an input, and for free, in unit time, you get answer—so think of A as 3-SAT; so you have this magical oracle that you give it a 3-SAT formula, and you snap your finger, and it tells you whether it’s satisfiable or not. So in this world, you can ask whether P is equal to NP; it’s a slightly different question, but it seems like a natural question; and the notation is thus P to the A equal NP to the A—does P, given an A oracle, equal NP, given an A oracle? So a priori, it’s not clear that this is any easier than P versus NP, but in … actually, for this, we have made a lot of progress, even since the seventies. The paper introducing this notion of oracles—Baker, Gill, and Solovay—they noted that there exists some oracle for which you can separate P and NP, and it … this was improved a few years later, qualitatively, by Bennett and Gill, who show that—you know—in fact, for almost every magical oracle, A, P is … to the A is not equal NP to the A. So we are still far from separating P versus NP, but at least in this—you know—the oracle sense of the word, this is a pretty satisfactory solution. Right, not only that’s there exists one oracle, for almost every oracle, they are distinct. Okay, that’s great. >>: What is distribution? I mean, when you say almost all … >> Li-Yang Tan: A uniform distribution—say, ninety-nine percent or half minus little o of one. Almost every function that you can give, it’s … P is not equal to NP. >>: True. >> Li-Yang Tan: And yet, we cannot … that doesn’t apply to P does not equal NP in our world. So that’s a little off-topic, but as a caveat, this shouldn’t be taken as evidence that P is not equal to NP for various reasons that we discovered later—not we, but you know, in the eighties, yeah. Yeah, so … but that’s a caveat, but it, just independently, is sort of an interesting question. Okay, so we have resolved P versus NP on these worlds; let’s sort of move on to other questions. So here are two statements that we also believe are true; they both concern the so-called polynomial hierarchy—I’ll explain what that means in a second. One is that PH—the polynomial hierarchy—is not equals to PSPACE; the second is that PH is infinite. For this talk, it’s actually not really important to know what either of these statements mean or really what the PH is; these are two statements—like P versus NP—that we’d like to prove, but we cannot prove; and two is stronger than one; and two implies P is not equal to NP, so two is very, very strong. And so we are stuck on them, but we can ask—you know—do these separations hold relative to one oracle? And if we can do that, can we ask … we can ask: do they hold relative to almost all oracles? >>: One is false, two’s also false, right? They’re … oh, okay. >> Li-Yang Tan: Right, two implies one. Exactly, so two is a very strong statement. So let’s see; we had success on the P versus NP question with respect to oracles; and we have much success here, too. Yao and Håstad showed that—you know—the weaker statement—that PH is not equal to PSPACE—holds for some oracle, A; there exists a magical function such that PH is not equal to PSPACE. This was improved by Cai and Babai, who showed that PH is not equal to PSPACE for almost all oracles, A. So the weaker question—you know—we are very satisfied, again, and Yao and Håstad also proved that, in fact, PH is infinite relative to some oracle, A. And again, you can ask for—you know—they conjecture that—you know—you have the strongest of all these statements—that in fact, the stronger statement is true for almost all oracles, that PH is infinite for almost all oracles, A. So this would imply these results, because—you know—PH being infinite for almost all oracles, A, implies that PH is not equals PSPACE for almost all oracles, A, and—you know—this version says it just for some oracle, and here, we are saying it for almost all oracles. And in this work, we confirm this conjecture, and in fact, I like to touch on why this is a direct consequence of our circuit lower bounds, and if you can … and if you may have guessed from the names of people who proved this, these—you know—relativization results are established using circuit lower bounds. So let me touch on this connection between circuits and relativization. Okay, so there is actually a tight connection between—you know—the class of circuits we are interested in, which is—you know—bounded-depth circuits, which we think of as very limited models of computation—you cannot even compute parity—and—you know—the polynomial hierarchy, which is a very, very expressive—and you know—tower of models of computation. And there’s a … this correspondence was noted by Furst, Saxe, and Sipser, and roughly speaking, they differ by an exponential. And the connection is really, really very close—you know, depth-3 AC0 with, you know, an AND gate on top corresponds to Pi3, no, depth-10 AC0 with an OR gate on top corresponds to Sigma10—okay, there’s a, really, a one-to-one correspondence. In particular, I believe that this was the original motivation for proving circuit lower bounds; we wanted to use circuit lower bounds to prove lower bounds against the polynomial hierarchy. Okay, so let’s take a look at the paper; it says—the title of the paper is “Parity, Circuits, and the Polynomial-Time Hierarchy”—it says that a super-polynomial lower bound is given for the size of circuits of fixed depth computing the parity function. Okay, and he say … they say the connections are given to relativization of the polynomial-time hierarchy. So this a paper, and let’s translate it; the first slide basically says that parity’s not in AC0; and the second says that—you know—so they prove a super-polynomial lower bound on the size of AC0 circuits computing parity, and they say that if you can improve it to super-quasi-polynomial, then you have that PH is not equal to PSPACE for some oracle, A. Okay, and it’s quite easy, and so they didn’t quite do this, but they noted that if you improve this separation to super-quasi-polynomial, then you get this separation. You get … and this was done by Yao and Håstad in ‘86, and not surprisingly, they proved this by proving the circuit result; they proved sufficiently strong lower bounds on the size of circuits computing parity. Okay, so in fact—you know—we saw two slides ago—y’know—you have the circuit results and the two extensions; they correspond perfectly to a picture in the relativized world. Yao and Håstad, they proved that—you know—depth-d circuits of—you know—large size cannot compute parity, and that corresponds exactly, and this was a motivation for them to show that PH is not equal to PSPACE for some oracle, A. You have two extensions, one of which says that you cannot even approximate it—that corresponds exactly to the strengthening of this theorem to say that the separation is true for almost all oracles, A. You have another extension that says that—you know—depth-d circuits cannot compute, not just parity, but a super simple function at depth d plus one; that corresponds exactly to the fact that PH infinite … PH is infinite relative to some oracle. And just like how you can ask for the best of both worlds in the circuit world, you can ask for the best of both worlds in the oracle world, and by confirming Håstad’s conjecture, we confirm the conjecture that PH is infinite relative to almost all oracles. Okay, I’m not gonna touch on this connection more, except to say that it’s really a very, very close translation—it’s really a mirror image. We didn’t even prove that; it just follows by standard techniques. So okay, that’s application one; it’s sort of retro and old-school. Let me switch gears to a different application now in analysis of Boolean functions—completely different, so forget all about oracles. So let me start with a very basic fact about circuits. Fix a function f; consider following very simple experiment: draw a uniform random x, flip a coordinate, alright? It’s so simple, and I want to know what’s the probability that f of x is not equal to f of y, and I’m gonna multiply by n—this a matter of convention; don’t worry too much about it, but if you allow me to multiply by n, it brings it to a number between zero and n—and it’s known as the influence of f on … also, like, average sensitivity. And it makes sense; the av … the name average sensitivity makes sense, right? It’s average number of coordinates on which you’re sensitive on. So given any function, f, I can ask: what’s its influence? Is it low? Is it close to zero? Is it high? Is it close to n? It’s just a measure of Boolean functions, and a famous theorem of Linial, Mansour, and Nisan’s, sharpened by Boppana, says that if you give me some more information about f—you promise me that f is computable by a size-s, depth-d circuit—then its influence is bounded by log s to the d minus one. And again, the regime of parameters we should think of is s being poly(n) and d being constant, in which case, LMN says that—you know—AC0 circuits have polylog(n) influence, which in a spectrum of zero to n, we should think of as low. So it says they … LMN says that—you know—small circuits of small depth have low influence. And this is a very important result. So here’s … let’s look at—you know—the usual suspects. So this is the line of all possible influences. On the left, you have low-influence functions; on the right, you have high-influence functions. So let’s start with high influence; parity is the world’s most influential function—it has influence n—a random function has influence n over two—so very influential—and majority’s not hard to work out—it … influence is roughly root n. For this talk, let’s think of root n as high. Okay, so that’s high-influence functions. Let’s look at low-influence functions; you have the constant function, which is very boring— its influence is zero—x1 which is also boring—its influence is one. You know, these are not very interesting functions. And you have the Tribes function, which is a DNF; its influence is log n. Okay, so you have this spectrum with all the usual suspects and Boolean functions lying on it. And as you can see on the right, it’s canonical functions—canonical examples of functions—that do not lie in AC0; they are—you know—complex functions, and LMN says that this is not a coincidence. You know, again, for the range of parameters we should think of—you know—size is poly(n) or even quasipoly(n); it’s a … if the depth is constant, LMN says that—you know—you give me such a function, it lies to the left of polylog(n). So in particular, LMN shows that majority, and random function, and parity are not in AC0. So it’s a strong theorem about—you know—the stability of circuits. Okay, so LMN is great. A question that was asked by Benjamini, Kalai, and Schramm in a very famous paper about noise sensitivity and about influence—and it was repeated in a different form in O’Donnell, and Kalai, and Hatami in the past few years—is whether the converse to LMN is true. Okay? What do I mean by that? LMN, vaguely speaking, says that small-depth, small-size circuits have low total influence; is it true that low-influence functions are basically circuits—basically small-depth circuits? It’s not hard to show that—you know—low-influence functions, they are not exactly a small-depth circuit, but it’s … it was still possible that low-influence functions are essentially small-depth circuits and that you can well approximate it with a small-depth circuit. So just to be slightly more precise, LMN says that if you give me—you know—this structure that you have—you know—poly-size or even quasi-polynomial-size and constant-depth circuits, you lie to the left of polylog(n). Okay? Now, you can ask: if you lie to the left of polylog(n), do you have this structure? Are all polylog(n)-influence functions well-approximated by the same class of functions? If it’s true, this’ll be a very, very nice characterization of low-influence functions. But to spoil the suspense—you know—we disprove it; we show that our main result gives a strong counterexample to this, which is unfortunate. And in particular, a question that came up during the times I gave this talk is: it will be nice to try to save this conjecture somehow; I’m very interested in the structure of polylog(n)-influence functions. Yeah, and roughly speaking—as an aside for the experts—like, log n is where Friedgaard’s theorem breaks down. Like, if your influence is below log n, Friedgaard’s theorem gives you a very nice structure, but log n is where it doesn’t give you any information. So I’ll be very … I’m very, very interested in the structure of log(n)-influence functions, but … and I was trying to prove this all summer, but ended up disproving it instead. So yeah, okay. So that’s … and again, it’s a simple consequence of our main theorem; I’m not gonna go into it; it’s … there’s nothing there; it’s … it follows quite easily. >>: So the example again? It would be the … >> Li-Yang Tan: Sipser function scale-out. Yeah, it’s … >>: Sipser function? >> Li-Yang Tan: Sipser function of depth, say, square log n. Okay, it has some influence, and what does our main result show? It says that if you are any circuit of depth, not just constant, but square log n minus one cannot even approximate it. >>: Yeah. >> Li-Yang Tan: So by adjusting parameters, choosing—you know, instead of square log n—maybe something else, you get a log(n)-influence function that cannot be approximated even by superconstant-depth circuits. Okay, so yeah, again, a open problem is to rescue this somehow. Oka, so two applications; one is—two fairly different areas—one is in structural complexity which shows that—you know—PH is infinite relative to a random oracle, and second is to answer this BKS conjecture that—repeated by O’Donnell, Kalai, and Hatami—which is that, unfortunately, there’s no approximate converse to LMN. Okay, so let me actually talk about actual result. So for the rest of this talk, let me tell you about Håstad’s techniques, the difficulties in applying them—you know, and by that, I mean applying them to get our result, which is an average-case depth hierarchy theorem—and how our techniques overcome these difficulties. Okay, in particular, let me give a very, very high, rough structure of Håstad’s theorem, just in one slides and pictures, and I will try to say why extension one—the, you know, approximation version—follows easily; in fact, it’s implicit in the proof. I’ll tell you why extension two does not follow easily and why Håstad had to do extra work to prove extension two. And the way he prove extension two, I’ll explain why it breaks extension one—why he loses average-case hardness— and—you know—I hope to convey this tension between extension one and extension two—if you wanted extension one, you cannot get extension two, and if you want extension two, it was hard to get average-case hardness—and how our techniques somehow are able to get both. Okay, but let’s start with the very basic Håstad’s theorem; in a picture, let’s, like, recap his proof. So one slide … first of all, his main technique—which is that of random restrictions—it’s a very simple concept: you take a Boolean function, f; you apply a random restriction to it; you get a simpler Boolean function, f sub rho, where rho is your random restriction. I’ll make this more precise in the next slide, but it was a very important concept introduced by Subbotovskaya back in the sixties, but even till today, is a really very indispensable tool in circuit complexity. Håstad’s theorem uses it, and our theorem builds on it. So let me tell you about what a random restriction is. So a random restriction, you have some parameter p—think of it as small, so zero point one or one over log n—and you generate—you know—a string, rho, in {0, 1, *} to the n. And how do you generate it? With probability p—independently—with probability p, you put down a star; otherwise you put … you flip a coin, and put down all the zeroes and ones. So the string you get out is roughly a p-fraction of stars, and in the one-minus-p-fraction, is split half-half between ones and zeroes. Okay, this what a restriction is; what does it mean to hit a function with a restriction? So the restriction of a Boolean function, f, by this string, rho, is the function where—you know—you take in values for the stars, and for the non-stars, you fill in according to the template. Okay, so this is what it means to transform f into f sub rho. So as you can see, intuitively, this is—you know—it … you do make the function simpler, right? It was an n-variable function; now, it’s an—you know—pn-variable function. So let’s see what this does to functions; so this what a random restriction is. So here’s Håstad’s theorem again as a cartoon. It says that parity—the red dot—doesn’t live in the green circle; it lives at depth log n. So how do you prove such a statement? You hit the red dot with a random restriction, and you hit the green circle with a random restriction, and if you can argue that two different things happen to them, then clearly, the red dot cannot like in the green circle, right? So in more detail, you argue that parity, when you hit it with a random restriction, it remains complex. It basically becomes parity on fewer variables, but still very complex. You take AC0, and you hit it with a random restriction, you are gonna collapse this to a really, really simple function—a small-depth decision tree—which roughly speaking, lies within here; it lies within, like, the bottom-level circle. And then, if you can argue these two, then to finish it off, you just argue that—you know—simple functions cannot compute complex functions, and you’re done, right? Okay, so of these three steps, one is essentially by definition of random restrictions and parity—it’s not hard at all—three is a simple exercise—that small-depth decision trees cannot compute parity—the main technical work that Håstad had to do was the second step—to show that if you hit the green circle with a random restriction, it really … like, the whole tower just collapses. Okay, so let me … one slide about this main technical ingredient. It says—you know, famous switching lemma—it says that you take any function in AC0, you hit it with a random restriction—with carefully chosen p, depending on the structure … the size of the circuit—the depth of it decreases by at least one. So to prove this theorem, you hit it—you know—d times, and you can argue that—you know—by hitting it, the overall restriction, it collapses to a very simple function. Okay, so there’s one slide; we’ll come back to this later, but that’s really the main technical ingredient, and this is the technique that’s in … seen lots of applications. Okay. So we just sort of sketch the proof of parity not in AC0 result; in particular, for every depth, d, greater than two—you know, depth-100 circuits of large size cannot compute parity. Okay, and actually, I claim that we have implicitly also established extension one. You know, the proof implicitly gives averagecase hardness, that not only can you not compute parity, your agreement is a tiny, tiny fraction. So let me tell you in one slide why this follows. It’s a consequence of our random restriction; they key fact here is that our random restrictions hide a uniform random string. What do I mean by that? They … I … the wor … the phrase we have been using is they complete to the uniform distribution, which I guess is not very formal, but here’s the formal meaning: you generate this random restriction, rho, alright? You … zero, ones, and stars, and then now, you fill in the stars with zeroes and ones uniformly at random. Consider this experiment, so at end of it, you get a fully zero-one-value string. The obvious-but-crucial fact that this gives average-case hardness is that—you know—the resulting string is a uniform random string, and this is sort of why—you know—you have proved an average-case hardness result. By hitting a function with—you know—this random restriction, you’re implicitly feeding it a uniform random input. So this is why you get … and this is not hard to see at all; it’s because—you know—when you’re not a star, you’re split between zeroes and ones. I mean, so I won’t go into detail, but this is, roughly speaking, the crucial fact behind why you get average-case hardness. Okay, what have we done? We have sketched Håstad’s proof of his basic theorem; we have shown why—you know—extension one is essentially—you know—implicit in the theorem. Let me tell you the more interesting this as to why extension two is not … does not follow easily from Håstad’s theorem and why, by proving extension two, it broke extension one. So here, again, is the statement of Håstad’s theorem and its proof—it says that parity is not in AC0. And how you prove it: you show that—you know—when you hit parity with a random restriction, it remains complex, whereas, you … if you hit AC0 with a random restriction, you collapse to a simple function, and you just note that—you know—simple functions cannot compute parity. Okay? Why can’t it prove a depth hierarchy theorem? It’s because for a depth hierarchy theorem—you know—I want to separate … it’s not delicate enough to separate depth d plus one and depth d. You know, the lightning is too powerful; it destroys all of AC0. In particular—you know—I … my hard function—say the Sipser function—lies in here, and you destroy it, right? You try to do the proof, and you say, “I hit my hard function with a random restriction; I hit depth-d AC0 with random restriction.” It shows that both of them collapse to small-depth decision trees, and it’s not … it doesn’t give you the contradiction you want. But this was not trouble for Håstad; he was able to do it be designing new random restrictions—not yellow in color, but blue in color—designed specifically for the Sipser function to keep it complex. Okay, so in a picture, what he does is that he comes with a new one, specifically with Sipser in mind, so that he very carefully keeps Sipser complex—he doesn’t want to destroy it—and yet, he still has to prove that anything of depth one less than Sipser still collapses to a decision tree. So he had to prove a new switching lemma for the blue random restrictions, and this blue random restriction was tailored very specifically for the Sipser function. So this is very nice. In particular, here you see you are doing something more … much more delicate, right? Your contradiction comes in a fact that a decision tree cannot compute a depth-2 circuit—you know, it really has to be very careful. Okay, so intuitively—just to say again—it’s a much more delicate task, right? For parity and AC0, it’s a nice result, but—you know—in part—you know—your hard function was really hard to begin with. So you just have to—you know—destroy AC0 and show that—you know—by destroying AC0, you do not destroy your hard function by too much. Whereas here, you’re really trying to get very, very fine-grained information about the structure of circuits, right? You have to come up with something that destroys depth-d circuits, but preserves—you know—your special function at depth d plus one. So this was [indiscernible] but Håstad did it; so this is parity not in AC0; this is depth hierarchy theorem. But he paid a price; the price comes in the fact that he only gets worst-case depth hierarchy theorem. So recall the key fact about the yellow—the usual—random restrictions; it was independent across coordinates, and in particular, it completes to the uniform distribution. Suppose you’re hiding—implicitly hiding—a uniform random string in the random restriction. Håstad’s new restriction’s thus carefully tailored for the Sipser function—you know, your coordinates are not independent; they are carefully correlated to keep Sipser complex. Okay, and the distribution is only supported on a exponentially small set of inputs, and hence, you only prove worst-case and not average-case hardness. So just to summarize the difficulty that we faced when we tried to do this project: at a high level, there are three requirements for an average-case depth hierarchy theorem; Håstad has two things, both of which achieve two, but not three. One, you have to keep the target function—the hard function— complex; two, you have to … your approximator, you have to destroy it; and three, you have to do so in such a way that however you’re hitting them, it completes to the uniform distribution, okay? So Håstad’s parity not in AC0 proof does two—you know, his famous switching lemma destroys AC0 circuits; it completes to the uniform distribution, just becomes it’s so simple; you know, you flip a coin independently for every coordinate—but his—you know—his yellow lightning was too powerful; it was designed to destroy all of AC0; and in particular, it destroys the hard function that you’re supposed not to destroy. Okay, so he—when faced with this—he said, “No problem. I’m gonna define a new random restriction that keeps my target function complex in depth d plus one. I still prove that my switching lemma still holds—that it collapses.” But the price he paid was that he did it so carefully and correlated the coordinates so carefully that it doesn’t … your proof … you’re not … it doesn’t complete to the uniform distribution. And in this work, we design a random projection that achieves all three, and it’s not hard to see that with random restrictions, you cannot achieve all three, and a key idea here was— you know—random projections. So let me, in my remaining time, tell you a bit about random projections and how … what … how do they relate to random restrictions? And if I have time—I’m not sure I do—but I’ll sketch how projections achieve very … all three. Okay, so again, our technique is random projections, which generalize this— you know—notion of random restrictions. So a restriction—just to recall—you take a Boolean function, f, over x1 to xn; you hit it with a random restriction; you get a simple Boolean function over x1 to xn. A random projection, on the other hand, you take a Boolean function, f, over x1 to xn; you randomly project it; you get a new Boolean function over new formal variables, y1 to ym. So this feels like a generalization, because—you know—a restriction is just where your new formal variables are your old formal variables. Okay, let me be more precise; in a random restriction, every xi is either set to a constant—zero or one—or it survives—you know, I have been denoting it star, but you can think of it as, you know, xi maps to xi, right? That’s what it means to survive. In a random projection, every xi is either set to constant like before, or you can map it to a new—brand new—formal variable, yj, where j doesn’t have anything to do with xi. So you’re basically changing the space of variables. And how do we exploit this? Very roughly speaking, in our proof, the new variables—the y variables—are much smaller than x variables, so we have a lot of collisions, right? This is something that you cannot do in a restriction world; we map many to many different xi’s to either zero, one, or the same yj, and we map in such a way depending on a structure of the Sipser formula. Okay. So again, projections, it’s easy to see it’s a generalization of restrictions, because every xi, instead of mapping to yj, you enforce that it just maps to the same xi. Okay, so hopefully, I can give you a sense of why this helps us—be fast. And I’ll do so … I will prove a weaker statement; I will show that—you know—I will show separation between 3d and d. And you can see here you have to encounter same … the same kinds of problems, because—you know—your hard function is in AC0. >>: Sorry, did you tell us the map or anything? No. >> Li-Yang Tan: No. I’m gonna come to that now. >>: Okay. >> Li-Yang Tan: Yeah. Hopefully, I didn’t skip … okay, good. Yeah, exactly, I went back to the same place. Okay, so let me tell you about this projection and why … okay, so it’s designed specifically with Sipser in mind, right? So you have … your depth-3d Sipser formula is—you know—AND gates at the bottom; let’s look at the jth AND and some variables—here’s the projection; it’s gonna look a little weird. Every xi in the jth tribe, I set it either one or yj—so again, this is something you cannot do in restrictions, right? You just set all the … if you are not set to one, you set it to the same variable, yj, and again, recall that j is the name of the tribe. So in the jth plus one AND, you either set it one or yj plus one, okay? And what’s the distribution? Well, independently—we want there to be independent with probability one half, but we condition on not getting the all ones input. Roughly speaking, why do we want to do that? We do not want the AND gate to be satisfied—right, if you put down the all ones input, the AND gates gets satisfied—we want to keep Sipser complex. And here, we see that we have not put on any zeroes. So for an AND gate, if you do not put down any zeroes, and you do not put down all ones, you keep it alive. In particular, it’s always the AND of a nonempty subset of inputs, which is nice, which … but one thing you should be skeptical about is the claim that this completes to the uniform distribution, because it’s all … it’s … I only put down ones and no zeroes, but roughly speaking, what’s gonna save me is the fact I’ve groups all this together. So by hitting it with very likely to be zero, it’s gonna be uniform. But anyway, so this is just one projection. As a standard in these depth hierarchy theorems—you know—the over-random projection is just doing this over and over again. And if it’s an OR, do it with the dual distribution with OR gates … with zeroes and yj’s. Okay, let’s see why this helps us. Have three things; we need to show that this preserves Sipser, but that’s essentially by definition, I hope. The second is that—you know—we still have to prove that AC0 circuits collapse to a simple function; and a third is that—you know—the restrictions complete to the uniform distribution. Okay, the first, I claim, is by design, and a three you should be skeptical about. So one, why does it remain complex? Well, it’s sort of designed with Sipser in mind; so the jth AND you either map to one and yj, and you’re never all ones—so it’s always the AND of a nonempty subset of yj variables, and the AND of yj, and yj, and yj is just yj. So this is designed specifically so that—you know— the AND gate is never killed, and in particular, every y … every AND gate just becomes a new formal variable, yj. So what happens is you go from depth-3d Sipser over x variables to depth-3d-minus-one Sipser over y variables very … with probability one, right? So this is easy. Second one is the completion to uniform, right? Every coordinate is correlated in a way that—you know—you condition on not getting the all ones input. Okay, and the reaction should be that it does not look uniform at all—that you only have ones and yj’s—but as you’ll see, like, the key here is that we are grouping the yj’s together. So this is—I mean, it’s an easy fact, but we were super happy when we found out about it—is that—you know—if you put down a bunch of ones, you group the yj’s, and then you hit the yj with something very biased toward zero, then the resulting string is uniform, and this is—I mean—this is not … this is calculation of the pmf, but—you know—what I really like about this is: allows me to generate a uniform random string in this two-stage process, where I put down a bunch of ones; I group stars and put down a bunch of zeroes; and I group stars; and I put down a bunch of ones. So what’s really nice is that—you know—you’re … the usual way you generate a random string is just go coordinate-bycoordinate and flip a coin; this allows me to just put down a lot of ones first—you know, for my application—and then—you know—hit the remaining things as a whole. Okay, so informally, rho— which is very non-uniform—composed with, like, this two to the biased product distribution, pops back to a uniform distribution, okay? So that’s nice. So I asked … so I’m still missing one part, but I’ve already given you a … hopefully given you quite a good picture of what happens to the Sipser formula. You have depth-3d Sipser, its AND gates at the bottom accessing x variables, and your goal is to prove hardness of approximation with respect to uniform. You hit it with a random restriction; you go from depth 3d to 3d minus one very nicely; and your job— instead of proving hardness according to uniform distribution—you prove hardness according to the two-to-the-w-biased product distribution over your new variables, yj. And again, what’s each yj? Each of these collapses to a new variable—this collapses to a new variable; this collapses, new variable. Okay, so what’s nice is—you know—your target remains very structured, and your gold remains very structured; your goal goes from—you know—product distribution to product distribution to product distribution. Okay, so the last step—the AC0 collapses to a simple function—I wouldn’t go into detail, but as a last slide, as you would expect, we had to prove what Håstad proved, but for our random projections. So the key in Håstad’s proof was a random restriction argument, showing that—you know—any function at depth … in AC0, it collapses to a simple function under random restrictions. You would hope that— you know—under one stage of what I described—just putting down ones and group things into yj, you know—you collapse by at least one, in which case, we can apply it many, many times. We couldn’t do that for—at least—for this random projection, but we could prove it if you allow me to hit it three times. So roughly speaking, this three is why I’ve only sketch a 3d versus d separation—you know, I used three layers on my target to trade me one layer in my approximator. So—you know—if given—you know— 5d, surely, I can get the contradiction, but with three, I can prove that—you know—I get the collapse. So with that, I achieve all three. So improving to d versus d minus one is, as you would guess—you know—we change it from red to orange—you know, a significantly more delicate random projection— to ensure that we get a collapse in depth, even under just one random projection. Okay? And we … of course, once you change what your random restriction is … random projection is, you have to ensure that the two other properties still hold. That—you know—one was somewhat simple in my sketch—you know, that is, the fact that your target remains complex, which was really very neat in this, you know, example, but it gets more complicated—and the completion to uniform also. So a big part of the project was trying to juggle these three balls; we only could get two in the air for like five months; then somehow, one day, we got all three in the air and were super happy. Yeah, okay, yeah, so just a summary: we prove an average-case depth hierarchy theorem, which is—you know—for every d—you know, say d equals hundred—there’s a function that—you know—if you allow me depth hundred and one, I can compute it at linear size, but if you force me to use depth hundred, if you allow me—you know—what seems like huge size, my agreement is—you know—less than fifty-one percent. And I gave two applications; one is that the PH is infinite relative to a random oracle; and a different application, showing that—you know—there’s no approximate converse to this—you know— famous and useful LMN theorem. And our main technique, which we’re quite excited about, is this notion of random projections, which extends—you know—this notion of random restrictions, and it’ll be very nice to find further applications. So thank you. [applause] >> Yuval Peres: Any additional questions? >>: So … >> Li-Yang Tan: Yeah? >>: How do you want to save the Linial-Mansour-Nisan thing? >> Li-Yang Tan: Ah, that’s a great question. I don’t know. >>: So you want to do something like … so what you show is essentially if it’s polylog(n) in terms of sensitivity … >> Li-Yang Tan: Yeah. >>: … it could still have high complexity—I mean, high … >> Li-Yang Tan: Right. >>: … large size. >> Li-Yang Tan: Exactly. >>: So do you want to say, maybe, if it’s not … >> Li-Yang Tan: Yeah. >>: … it’s smaller than log n, then it cannot have non [indiscernible] >> Li-Yang Tan: Right, so I … yeah, at a very high level, I like any structural information I can prove about log(n)-influence functions. And sort of as I sort of touched on, log n is a special number, because anything lower than log n, we actually have quite strong—by a … not a easy theorem at all, it’s a famous theorem of Friedgaard, which says that your influence is k, you’re essentially depending only on two to the k variables. So if you told me a function has influence a hundred, I can tell you, “Oh, it’s not a very interesting function. You essentially lie in dimension two to the one hundred.” Right? [laughter] It … you … right, if you told me influence is square root log n—you know—you lie in dimension two to the square root log n. Where does it break down? If you tell Friedgaard—you know—your influence is log n, he tells you you are—you know—closer [indiscernible] which doesn’t say much. So log n, I think, is a special number for me, because I really like to understand the structure of log(n)-influence functions. And this was very nice, right? BKS, and O’Donnell, and Kalai, and Hatami, they said, “Oh, maybe log(n)influence functions, you can depend on all coordinates, but maybe you are—you know—a simple circuit,” but it’s not—you’re approximated by a simple circuit—but it’s not true. One way to rescue is … >>: It’s nontrivial, or it’s not true? ‘Cause polylog—I mean—could be log log ten. >> Li-Yang Tan: Well, that’s … and even more complex functions, right? >>: Yeah, so again, what is your result? After what—when he said polylog n—what …? >> Li-Yang Tan: I see. I have a log(n)-influence function—yeah—I have a log(n)-influence function such that if you in … if you allow me depth—not just constant—if you allow me depth square root log n, if you allow me size, not just poly(n), but two to the n to the square root log n, I cannot approximate it. >>: Okay, so even at log … it’s not polylog(n); it’s at log n. >> Li-Yang Tan: Right. Exactly, I—again—my counterexample at log n. >>: Right. >> Li-Yang Tan: So a way to rescue is to allow me more … allow … broaden the class of circuits beyond just small-depth circuits. So a conjecture would be that log(n)-influence functions are wellapproximated by poly-size circuits, period. It will be very hard to disprove, because if you disproved it, you have—you know—separated P from P/poly, NP … yeah, but right? Because in particular, our function is clearly a poly-size circuit, but … >>: Wouldn’t it be very surprising if this notion of sensitivity was a tight characterization for some notion of computational complexity? I mean … >> Li-Yang Tan: Yeah, yeah, it’d be very nice, but it was … could also be very surprising, yeah. And in some sense, our results show that it was too much to hope for, right? This was … it would have been really nice if—you know—every log(n)-influence function is basically just a circuit. If you are a circuit— low … small-depth circuit—you’re log(n)-influence; if you are log(n)-influence, you’re basically a … but yeah, exactly as you said, it’s … maybe on hindsight, it was too bold, but yeah, it’s not true. But still, one can hope for some sort of structure of log(n)-influence functions. I’m not sure, yeah. They’re constantinfluence or—you know—square root log n, we have very good structural information, right? >> Yuval Peres: Thank you. >> Li-Yang Tan: Thanks. [applause]