>> Yuval Peres: The next talk will be by... independent component analysis.

advertisement
>> Yuval Peres: The next talk will be by Navin. He's going to talk about some algorithms for
independent component analysis.
>> Navin Goyal: Thanks. This is joint work with Santosh Vempala and Ying Xiao from Georgia
Tech. I'll talk about independent component analysis or ICA. It's a problem that was first
formulated in signal processing and medical imaging community in the 1980s. Since then it has
found connections in machine learning strategies and there is extensive literature on it. Before
I describe the general ICA problem, I'll start with a toy problem that gives the flavor of it. It's
called the cocktail party problem. It's a standard way to explain ICA. Suppose you have two
people speaking simultaneously and you have two microphones recording their voices. Now
you have these two recordings and from these two recordings you would like to decouple what
these people say. She would like to extract the voices of each person. To describe a
mathematical model for those, this is a possible mathematical model. It's a bit idealized, but it
will serve as a prototype for the ICA problem. We model the voices of these two people by
signals s1 of t and s2 of t. s1 of t is a random variable which takes values at time steps 1, 2, 3
and so on. It has a distribution associated with it. That each time step it independently
samples from this distribution to get the value s1 of t and similarly, s2 of t. These are the voices
of the two people speaking. If you don't observe that, what we observe is a superposition,
linear super positions of these signals. One microphone records one superposition and the
other records another superposition and these are different because the microphones are
located in different locations.
>>: [indiscernible]
>> Navin Goyal: I'll come to that, yes. At each time step you sample s1 of t independently from
the same distribution. s1 has one distribution and s2 has another distribution and that each
time stamp you individually sample. We don't know aij. We don't know where the speakers
are located so we don't know aij’s. We don't know distribution of s1 or s2 and the problem is to
recover aij and s given the samples of s1 and s2. This seems like an impossible problem. There
is not sufficient data to solve it. That's true, so we have to either get more data or make more
assumptions. The crucial assumption to make is that s1 and s2 are independent of each other,
so not only s1 of 1 is independent of s1 of 2 and so on but s1 and s2 are also independent of
each other, which is reasonable in many applications. When two people are speaking
simultaneously they generally don't have much to do with each other. That's the simple version
how I see it. We can describe what's happening here pictorially. Here I'm making a further
assumption that s1 and s2 are uniform in [indiscernible] minus 1 1. If you plot them than the
situation looks like this uniform distribution is square and what we are actually observing his
linear transform of this square. It's parallel [indiscernible] and uniform distribution of this.
From this we want to recover what [indiscernible] of these points on we want to recover what
linear transform was used. At least in this situation you can see that if you sufficiently manage
points you can sort of learn the shape of this [indiscernible] and you can record the linear
transform. That's the general ICA problem. The difference now is that the number of
microphones is n and the number of speakers is also n. We have this model x equal to As. x is
what we observe. sv is the hidden signal but its components are independent of each other. A
the constant metrics are non-singular and the problem is to recover A and possibly the
distribution of s given distribution of x. Even now the problem is not actually verbose and we
are to make some further restrictions to make it verbose. One thing is that suppose each
component of s is standard Gaussian and they are independent of each other. If A is the
rotation matrix, any rotation matrix, then what you get back is, again, independent Gaussian
components. Any rotation matrix is independent component of it. The problem is not solvable
in that case. We will just assume that any other components of s, none of the components of s
is Gaussian. Another thing is that if I scale the first column by a factor of 2, let's say, and I scale
s1 by factor of half I get a different A and a different s and this will satisfy the equation x is
equal to As. I cannot learn the scalings of these columns or the scaling of s, so we will allow for
the indeterminacy. Similarly, if I permute the columns of A by some permutation, I permute s
by some permutation and I again get x equal to As for these new A and s. Again, I cannot learn
this permutation so we will allow for these two indeterminacies. But that's all. There is a
theorem from the 1950s, Darmois-Skitovich theorem that implies that given this assumption
and if you allow these indeterminacies, then the problem is well posed. You can solve it given
the distribution of x. Let me just remark that the dictionary learning problem is very similar to
this. It has similar model x equal to As but the modeling assumption there is that the
components of s are not necessarily independent but they are sparse so it's useful for different
set of applications. I'm just restating the problem here. We want to recover a and s given
polynomial number of independent samples of x. We don't get the distribution of it. That's the
IT problem. Before I describe our contributions, I'll show you some [indiscernible] that are
known for this problem. One can try to use PCA a principal component analysis for this. In this
special case, suppose your distribution is a given point in rn or r2 in this case and suppose they
are uniform in this rectangle. Then you can try to use principal component analysis to find a
direction of maximum variance or second moment and that gives you this direction. This is the
direction of maximum variance and there is also direction of independent component.
Similarly, you can find this other direction. In this particular case you can just use PCA to solve
ICA problem. It's not all that useful. If your rectangle was a square then the second moment
looks the same in all directions, so it doesn't give you any information about the independent
components. Independent components are these two. But this does suggest trying higher
moments and interestingly, that works. If you take the fourth moment in direction v, v the unit
vector and the fourth moment is just expectation of redirects to v4, then they local minima of f,
this picture is showing the fourth moment. In this direction, this is the fourth moment. The
local minima of F of v precisely correspond to the facets of the skew and they can be efficiently
found using gradient descent type algorithm. And we can estimate the fourth moment quite
well just using this empirical moment from the samples that we have. In this case the problem
is reasonably well solved, although it doesn't handle all possible distributions for s but it
handles large numbers of them. What remains to be done? There is a more general version of
ICA which is called underdetermined ICA. The difference here is that the dimension of x vector
and s vector are not necessarily the same. Dimension of x can be smaller than dimension of s
and otherwise it's the same thing. Now the problem is, again, to recover A. s we cannot
recover now because there is no sufficient information. We are given in some sense even less
data compared to what we were given before because x is smaller dimension. So what I just
showed you doesn't seem to apply for this. Here is a picture of a special case. Suppose s1, s2
and s3, suppose m is equal to 3 and s1, s2, s3 are uniform intervals, then the join solution looks
like a cube, uniform distribution on a cube and suppose n is 2 then the data that we get there is
some distribution in this geometric shape. From this shape you want to determine the
orientation of the cube. Here the main result about underdetermined ICA, there was some
previous work, but it applied only to some special cases. Our result is pretty general but I've
only studied it for special cases for simplicity. n is the dimension of x, at least 2 and m is the
dimension of s more than n at most n squared over 10. Suppose columns of A have unit norm
which is without loss of generality because we can't learn the scaling of columns. Any two
columns of A are linearly independent and si are far from being Gaussian. There is a way to
quantify but I will not rewrite it here. First eight moments of each si are bounded. If these
things hold then our algorithm can estimate columns of A within L2 additive error epsilon using
number of samples which is polynomial n one over epsilon and 1 over sigma min of this
follower A which I'll shortly described. It's a polynomial time algorithm except for an interest in
[indiscernible] and times also the similar [indiscernible]. I'll describe what this power of A is.
This is called Khatri-Rao square of A. If your matrix A looks like this, these are the columns,
then Khatri-Rao square is obtained by taking Khari-Rao square of each column individually and
that gives you a new matrix. What is the Khari-Rao of an individual column? Is basically just the
tensor squared, put into a vector, so it's an n squared dimension vector. We are taking all
possible pairs here for Am squared, A1, A2 and so. If this was n by n matrix this n squared by n
matrix. I just restated the result here. You have sigma min of the Khari-Rao square of A here.
It's still probably not clear what this means. This is hard to interpret quantity because it
involves a funny operation from the columns of A. We have one result which I think gives some
meaning to this to what this means. You can think of the matrix A has been generated by
smoothing analysis type of distribution. Suppose you start with any matrix A of these
dimensions and squared over 10 and then you perturb it by adding independent Gaussian noise
and 0 sigma squared to each entry independently. Let's assume that this is actually the input
[indiscernible] then we can show that sigma min of the Khari-Rao square of this perturbed
matrix is small with small probability. In general, with high probability it's not too small. That
gives us the probability of at least 1 minus 2 over n that the probability is over the choice of n as
well as the random samples. Our algorithm finds columns of A plus N within small additive
error with one over sigma square here now. I think that makes more sense. In the remaining
time I will talk about the algorithm for this, for undetermined ICA. As I said before the local
optimization, the n doesn't seem to generalize to m > n case. What we'll do is we'll build upon
another method due to Yeredor which was given for m equal to n and we'll generalize it to m >
n. I'll first describe the algorithm for Yeredor’s algorithm for m equal to n and then I'll describe
the generalization. These are standard notions in probability theory. x and s and A are just as
before. We can define for unknown variable x it's moment generating function by expectation
of x expectation of E to the u transpose x. If you take the [indiscernible] sum of that than that
gives us the CGF, cumulate generating function. Similarly, we can define these functions for s,
[indiscernible] variable s, so it's the same definitions. The crucial point here is that the moment
generating function for s because the components of s are independent it factorizes into
moment generating functions for individual components. Similarly, the CGF decomposes into a
sum for individual components. This fact will be crucial in what we do next. Assume that the
variables t and u are related by this linear relation t [indiscernible] transpose u. Then e to the u
transpose x is equal to e to the u transpose a s just by our ICA model and then this is equal to e
to the t transpose as well, our assumption. That gives us that CGF of x [indiscernible] of q is
equal to CGF of s [indiscernible] t. Now just by basic calculus we try to understand how the
derivative of c changes under linear transform, so we can write this relation, this basic calculus.
The Hessian metrics of c with respect to u is equal to metrics a times Hessian metrics of c with
respect to t times a transpose. You just check how derivative changes in the linear transform
variables so that gives you this relation. Now we have polynomial number of samples of x and
using that we can estimate all of the entries of this, these now accurately. But we don't know
the right hand side, anything on the right-hand side. We do have one piece of information
about it which is that this Hessian matrix is a diagonal matrix and here we are using the fact
that this cs decomposes as a sum of individual CGFs of individual components of s and that
directly gives us that only these diagonal entries survive. We would like to exploit this
information to compute a. Here is a very simple idea. You sample u and u prime in Rn
uniformly from the unit sphere and evaluate your Hessians at u and you prime and take the
ratio. We already saw this relation. Now since I is an invertible matrix, a transpose
[indiscernible] with A transpose here and what we are left with is this quantity. Now the
quantity here is diagonal matrix. What we get is this thing that we can compute is equal to I
times the diagonal matrix times A [indiscernible]. And this is just a value I can vector equation.
[indiscernible]. So it is just eigenvalue Eigen vector equation and so we can compute A as
columns of eigenvectors of this matrix. That solves this problem. Eigen decomposition is
unique if the entries here are pairwise distinct and that's why we use the randomness of u and
u prime. We need to show that pairwise these things are distinct and actually for away from
each other and that allows us to find Eigen decomposition. That's for m equal to n and I would
like to generalize it to under determine ICA. We'll tried to proceed in a similar way but things
will break down and we will see where things will break down. We still have t equal to A
transpose u and we can still write as before the Hessian matrix of c with respect to u at
[indiscernible] and then we can try to take this ratio. It is fine up to this point but this is
incorrect because A is not a square matrix. Is not a invertible matrix, so A transpose and A
transpose inverse does not cancel, so this doesn't work. We can try to take pseudo-inverse
here but that also won't work. Here is the main new idea. We'll use higher derivatives of c so
we can take the fourth derivative so this is a tensor now. It's n by n by order 4 tensor, but we
are writing it as a matrix n squared by n squared matrix, so we just flatten the tensor into a
matrix just like you can find me matrix, n by n matrix into a square dimensional vector. This is
[indiscernible] square of A. The diagonal matrix the fourth derivative and then [indiscernible]
square of A transpose. This is easy to obtain this relation. Now I will again try to proceed in a
similar way. We can take the inverse and similar to what we were doing before, but again, we
have this problem of this not being a square matrix. But this time if we take a pseudo-inverse
for the second thing then things go through. This is indeed an identity matrix if this is zero
inverse and so what we get is very similar to what we got before and now we can obtain the
columns of [indiscernible] square of A by eigenvector computation and once you have the
columns of this matrix you can also get columns of A easily. That's basically all the algorithm.
There are several technical challenges in making this work which is one of them is that we have
to show that these things are far from each other in order for eigenvector computation to be
effective and all these diagonalization and all these things only happen in approximate settings,
so you have to take care of all of those errors. That's it. I will finish with an open problem.
[indiscernible] this polynomial time if n is the most n squared for any poly n, but suppose n is
very small. n is a constant and n is a good number. Then our algorithm can be super constant,
can take super constant time. We don't have a lower bound for proving that that's essential.
That's an open problem here. Thank you. [applause]
>>: I want to ask a question. It gets a little more interesting when you consider that the matrix
is not a transfer function because then there is a fresh equation between my signal in the
microphone and then the correlation within the signal affects the result, but you also have the
same way that you have a vein of certainty, you could have widening of spectrum. I wonder if
you look more into [indiscernible] to see how these [indiscernible]
>> Navin Goyal: No I don't know much about these. There are these models but I don't know.
>>: Do is a little bit of work here in the speech group, but I think they are not looking much into
that. That might be interesting to visit. It still an interesting problem that we would like to
solve. [applause]
>> Yuval Peres: The last talk would be coming from Anup who is coming from across the lake, I
guess. He will talk about lower bounds on multiparty communication.
>> Anup Rao: Thanks. I recognize that it's the last talk of the day and you are tired and I'm
tired, so I'm just going to show you a bunch of animations that I made in the last few days and
then we'll call it good. Before I start talking about the complicated title, I'm going to tell you a
little bit about why I care about these questions. I think it's kind of important because maybe a
lot of people don't know why those questions are important. Here is for me the biggest thorn
in my side as a complexity theorist, which is that I have no idea how to answer this kind of
question. If someone asks me here's a problem that I'm solving. I have an algorithm for it. Is
my algorithm the best you can do? I have no idea how to answer this question. You can make
this concrete anyway you want. Pick any nontrivial algorithm that you know about. Is the
algorithm the best algorithm that we know for finding matchings in graphs? We don't know. Is
it the best algorithm for matrix multiplication that we have optimal? We don't know. The same
for almost all algorithms that we have and, of course, the famous P versus NP question is also
of this format. That's just like asking is the best algorithm that we can come up with for SAT
optimal, and again, we don't really know. Here's something we do know. Here's some things
that we know, about proving when algorithms are optimal. We know that for many tasks if you
have a linear time algorithm, then that's about as good as you can do because you need to read
the input and that takes linear time, so you can't do better than linear time. That's one thing
we know. And then there's the technique of diagonalization, which is really clever and that
allows you to say that for certain kinds of tasks which succumb to this technique the obvious
algorithm is the optimal one. For example, if you're a given a program’s input and you want to
know does this program stop in T steps, then essentially what you should do is run the program
for T steps and see if it stops and that's about all you can do. You can't do better than that.
That's about all the techniques that we actually have for proving lower bounds and the run
time. People have worked on this question and most of the work has gone into studying all
kinds of restrictions of the concept of what an algorithm is. You can restrict it like this,
restrictive it like that and you get all kinds of questions and then you answer those questions
and there's a long list of such results. First, I want to start by convincing you that actually this
question about algorithms is essentially a question about communication. That's because of
the following picture which took me a long time to make. You can see that the internet helped
a lot. I still had to do things. What it says is that if you have an algorithm that computes some
function f that maps n bits to 1 bit then you actually obtained a communication protocol.
Suppose the algorithm runs in time T and computes as function F, then you can show that there
is a protocol that has about a T players that participate in the protocol. Each player knows one
of the bits of the input and each player during the protocol sends 1 bit to some set of players
and every player here receives at most 2 bits. Let me just show you an example. This guy
might send, based on his bit might send something to these two people. This person might
send something to these two people. Now these players know some information. They use it
to compute some bit and send that to some other people and in this way in every step
somebody speaks and sends some bit to somebody. And eventually, after all of the
communication is done, someone in the protocol magically knows the value of the function that
we are trying to compute. That's something we can show. If there is an algorithm for F then
you have a protocol that looks like this. Question?
>>: [indiscernible]
>> Anup Rao: It's a circuit. This is a circuit. This is basically a circuit. This is essentially
equivalent to the concept of what an algorithm is, but it's really hard to prove lower bounds
against this thing and one of the reasons why it's really hard is involved here are private
channels. Somehow, the reason that this protocol is weak is that some of the parties know
some parts of information and other people know other things and you need to exploit this fact
that not everybody knows everything to prove that it's, that such a protocol can't work quickly.
That's something that we basically have no idea how to exploit. I don't know any way to prove
how to exploit that weakness. On the other hand, if I allowed all of the players to communicate
by broadcast, just shout out a bit that everybody can hear, then you can compute anything with
n players because everybody just announces, each person just announces one bit and then
someone knows all the bits. We're done. Then it becomes a meaningless model. I need to
somehow understand what that weakness is in the private channels to prove lower bounds
against them. But there are settings where communicating by broadcast can be something
that's meaningful to exploit. For example, here is a result that was proved by Valiant in the
‘70s, made more explicit by Rubich [phonetic] and myself in a paper that we did recently. It
says that suppose F can be computed in parallel time log n and total work n, so you have some
number of processors that are computing the function and the total number of instructions
that are executed is order n but the total time it's taking is only order log n, so this is a log debt
circuit of linear size. Then you actually get from that a protocol with little o of n players where
each player knows a very small fraction of the bit into the .1 fraction of the bits of the input and
the players communicate now by broadcast. The first player look set some small fraction of
bits, says something. The second looks at some other fraction of the bits, says something and
now all of the communication is by broadcast, but eventually, one of these parties knows the
value of the function outputs.
>>: Who decides which player [indiscernible]
>> Anup Rao: Ahead of time based on the function the players decide these are the bits that I
know.
>>: [indiscernible]
>> Anup Rao: Yeah. And then they start communicating. Everybody decides ahead of time
what bits they know.
>>: Someone must be fixed, right? One person decides [indiscernible]
>> Anup Rao: I'm not sure what you mean.
>>: Someone knows x but that person is fixed?
>> Anup Rao: That person is fixed, yes. So the last person in the sequence, these people all
speak in this order and the last person knows f of x. Yeah. Roughly speaking, these players
each correspond to a line of the program that's being a run in parallel, but not all the lines
because there are n lines and they are only n over log log n players, but there is a subset of the
lines that are kind of the important ones and those are the ones that are simulated by these
players. This is something that I can hope to try to prove lower bounds against because we
have techniques that work against communication that happens with broadcasts. Let me give
you some idea of what we know how to prove about communication. Here's some easy results
that we know how to prove. All the proofs, many of the proofs I am going to tell you about
here, I won't tell you the proofs, but they are all really short. One thing we can prove, for
example, is suppose you have just two parties and each of them has a subset of 1 through n, so
it's an n-bit string and they want to know are the sets the same. That's something you can do
obviously by having one party send the other party his set. That will take n bits and this is the
best you can do. There is nothing better you can do. That's easy to prove. It's an exercise.
Another thing that you can show that requires n bits of communication is to compute the
generalized inter-product of these vectors, mod 2, so that's the same as counting is the
intersection size of x and y even? That also requires n bits of communication. It also requires n
bits of communication to tell whether x and y are disjoint or not. All these things are easy to
prove, but those kinds of results seem quite different from where we want to get to eventually
because we want to understand this situation. The problem is in this situation the players have
a lot of overlap in the input. They see a lot of bits that are the same and that's quite different
from the two-party situation where they have independent inputs. Let me show you why that's
different. Suppose we have now three parties and each party has a set and they want to tell if
all the sets are equal, then again, it's really easy to show that actually it requires n bits of
communication or n bits of communication to do this and omega n bits of communication to do
this. But what if each party knows two of the sets? How many bits of communication does it
take to tell whether the sense are the same or not? Two bits, and why is that?
>>: [indiscernible]
>> Anup Rao: Yeah. You just need one person to verify that x is equal to y and another to verify
that y is equal to z and then you know that all three must be the same. Suddenly, the
communication that you need drops down dramatically for this problem. It's hard to prove
lower bounds when the parties start to have information they can see in common.
Nevertheless, we do have methods that work against this kind of model. The really beautiful
paper of [indiscernible] in the early ‘90s showed that if you want to compute generalized interproduct of x, y and z then that still requires n bits of communication in this model. The subject
of this talk is understanding, for a long time it seems that that technique did not work when
you're trying to understand if the sets are disjoint or not. That's exactly what we worked on.
There are several reasons why this particular function is interesting. In theory, understanding it
has applications to prove complexity and basically other applications in complexity that I won't
get into. The main motivation was I want to find a technique that works eventually for the
model has related to algorithms and this seemed like something we have to understand first.
This seems like an easier problem. Here's kind of what the story is for this problem and our
result is at the bottom here. Basically there has been a long line of work that used increasingly
complicated techniques until our work which uses basically the same idea as BNS as
[indiscernible]. Eventually what we managed to show is that if you have k parties that have k
sets and each party knows k-1 of the sets and they want to know whether the sets are disjoint
or not, then they need to communicate at least n over 4 to the k bits, universe is of size n. And
that has more or less a ride dependence on n and k because there's an upper bound. There's a
really clever upper bound of Gromosh [phonetic] who showed that you can compute is in k
squared n over 2 to the k bits of communication.
>>: Are the lower bounds subject to random [indiscernible]
>> Anup Rao: Yes. All of these lower bounds also apply to randomized communication and we
also reprove, so our proof of the deterministic lower bound is really simple. The proof of the
randomized lower bound is not so simple but then we simplify the best proof before which was
the work of Shurstoff that gave a lower bound of roughly square root n. We also reduced the
complexity of that proof but we don't improve the bound. We get exactly the same bound.
>>: But this is deterministic?
>> Anup Rao: The upper bound is deterministic, yeah. The lower bound is really interesting
because there is a quantum upper bound that matches this lower bound and the proof works
also for quantum. The lower bound proofs also work for quantum. For quantum the lower
bound, this lower bound is tight. Somehow if you're going to get a lower bound you need to
distinguish quantum communication from classical, from randomized communication. When
did I start? I forget.
>>: You can go for 5 or 6 minutes.
>> Anup Rao: Okay. Now I will quickly outline how the proof goes. I won't really show you
many details. I will just give you some starting points. We are in the setting where we have k
parties. There are k sets. Each party knows k-1 of the sets and they want to know if the sets
are disjoint or not. Roughly, the idea is usually the way these kinds of proofs work is that you
find some hard distribution, find a hard distribution on the sets and argue somehow that on
this distribution players cannot succeed with the high probability. I now define a distribution
on the sets. To do that let's partition the universe into n parts and each part is going to be a
size roughly 4 to the k. Within each part I'll tell you how to sample all of the k sets. You sample
the first k-1 of them uniformly randomly conditioned on the fact that all of those k-1 sets
intersect in exactly 1 point. The first k-1 sets intersect in exactly 1 point and the last set is just a
completely random set. That's how you sample those sets in the first part of the universe. You
do exactly the same in every part of the universe.
>>: What is the universe?
>> Anup Rao: The universe is n elements and now I am describing how to sample the sets x1,
x2 thru xk as subsets of these n elements. You take the n elements and break it up into m
parts. And the first part that we sample is you sample the first k-1 sets so they intersect in
exactly one of those elements and then the last set is random. The chance of having
intersection in the first part is exactly 50 percent. It's the chance that the random set has this
one point of intersection of the rest period and you do this independently in all of these m parts
of the universe. That's the distribution. It's a really strange distribution because with extremely
high probability there will be an intersection of these k sets. The chance of intersection in the
first part is half, so with good probability there is going to be intersection somewhere.
Nevertheless, it's useful for the analysis which is I think what makes the whole proof really
counterintuitive and maybe it's why people missed this proof before.
>>: The issue is you are not allowing any probability of error.
>> Anup Rao: Yes. Although, later we will also use the same kind of distribution for the
randomized lower bound and there we will allow error, but then we will have to do something
more complicated. For now we are not allowing error, so let's allow Di to be the random
variable that indicates whether or not there's an intersection in the i of the part of the universe.
That's a random bit. Here's a theorem that occurs somewhere in one of the older proofs. It's
due to Shurzstoff. He showed that if pi is a protocol that's computed by using communication
c, if pi is a function computer by protocol in communication c then this expression holds. The
expected value of pi on these sets times basically the parity of these Di’s the absolute value of
that expectation is small. It's bounded by 2 to the c-2n. In some sense pi, this kind of says the
protocol pi is bad if computing the parity of Di’s. But if you are you trying to prove a lower
bound that's exactly what pi is trying to compute anyway. It's trying to compute the
disjointedness, which is just like computing whether all these are set to one. What we can
prove is that it's bad at computing the parity of the Di’s but in the deterministic case basically
what we observed is that's enough. If pi is a deterministic protocol and it makes no errors, then
when the sets are disjoint then pi is equal to one, but then all of these are equal to one, so
whenever pi is equal to one this sum here is exactly equal to m. And when pi is equal to zero,
everything is zero in this product. That means that this left-hand side if pi computes
disjointedness is exactly equal to 2 to the minus m. It's just the probability that the sets are all
disjoint. The left hand side is 2 to the minus m. Right-hand side is this. You rearrange it and
you get the communication must be at least m. It's kind of confusing that something like this
can happen, but it's actually really simple. This is pretty much what it is.
>>: [indiscernible]
>> Anup Rao: This thing is basically the same idea as [indiscernible] which I guess I don't have
time to get into but it's a two-page proof and it just, you repeatedly apply Kushi-Schwartz
[phonetic] to this expression until things become nice is all I can say. But I don't have time to
explain now. But it's just Kushi-Schwartz.
>>: [indiscernible]
>> Anup Rao: Let me give you an outline of how the randomized lower bound goes. This, we
can't use this kind of thing anymore because here I really use the fact that I does not make any
errors, because if pi makes an error then there could be places where pi is equal to one and this
thing is not fixed. But nevertheless, here's how the proof goes. You define this function f, j
which takes an input j and it's simply the probability that the protocol outputs 1 when the
number of places where the blocks are disjoint is g. That's the function fg. Then you can use
any kind of bound that we saw in the last slide to get this kind of algebraic expression for f. It's
not too hard to get here, but it says that if fj is defined as above, then it must be the case for all
that for every r is bigger than c this expression f of the some of these di thru dr times m over r
times this parity is small in expectation. This is kind of variant analogous to what I was seeing,
what we were seeing in the previous slide. This is an expression that corresponds to pi, but I'm
not going to tell you how we got there. And this point is just completely, we forget the fact that
we are working with protocols. It turns out that you can use this fact to show that f can be
approximated by a degree c polynomial once you have this thing. That's something that takes a
couple of pages to prove, but it's again, a little bit tricky. It's not that straightforward. And
once you know that fact, if f computes disjointedness then, if f came from a protocol that
computes disjointedness, then f is supposed to look something like this, right? The probability
that the protocol outputs one is supposed to be zero when the some of the Di’s is less than n
and only when all the blocks are disjoint is it supposed to accept. So kind of f should look
something like this. It should be close to zero at all these points and then suddenly spike of to
something close to one, but we know, it was proved by Nisan and Segarti that any such
polynomial must have degree square root m. We know that f can be approximated by a degree
c polynomial and any such polynomial must have degree squared m and that kind of shows that
c has degree roughly squared m. I guess I dropped the k terms [indiscernible]. That's roughly
what happens in the proof. I'll stop there. I will just spend a couple of minutes telling you what
I think we should do next. I think the big open problem here is to prove all the lower bounds
we know how to prove for k parties in the set up are the type n over 2 to the k and the big open
problem is to prove a lower bound that is really like n or n over k or n over polynomial and k.
And here's a candidate problem that I think should be hard for this model. Imagine that you
have, the input is k matching, so there is one matching between these two sets of vertices and
another matching between these two and another matching between these two and so on.
What we want to know is if you start at this designated vertex and keep walking clockwise for n
over hundred steps, n is the number of vertices in each graph, and each part here, then do we
end up at an odd vertex or an even vertex? The trivial way to do it is, you know, figure out
where this edge is going. Then figure out where the next edge is going and so on. And there
should be essentially no better way to do it than that with communication. That's a candidate,
but I will warn you that there is a really clever protocol that can do something really clever. You
can try and figure it out for yourself but it's pretty hard. It turns out that if you want to just do
one round, if you want to just do three steps and there's three players, there's a protocol that
can figure out where the third step goes in little o of n communication. It is something, there is
something non-trivial that you can do there. But then I don't believe that you can, that that
protocol can be extended to get something that figures out how a really long path goes. That's
all so basically related to what I think is a hard problem for this question. For parallel rhythms
should be hard to tell when you are given a directive graph where every vertex has odd degree
1, it should be hard to tell if the vertex 1 is connected to vertex 2 simultaneously in log depth
and in a log time and work. I'll stop there. [applause]
>> Yuval Peres: Questions?
>>: I want to ask a question. Can you go back to [indiscernible]. Everybody has everything
except like xi has everything except xi?
>> Anup Rao: Yeah, sure. Every player sees basically a matching from, the first player sees a
matching from this to this end he just doesn't know what the matching is between these two
layers.
>> Yuval Peres: Okay. Let’s thank Anup again. [applause]
Download