>> Li Deng: Okay. Hello, everybody. It's... Sayed to come here to give us this talk. ...

advertisement
>> Li Deng: Okay. Hello, everybody. It's my great pleasure to have Professor Ali
Sayed to come here to give us this talk. And this is the full -- this talk will give more
depth to one of the keynote that he gave recently at CAS on online learning and
bio-inspired signal processing.
I'm not going to go through all, you know, his long outstanding research career except to
say that Ali's work has really been outstanding in some of the impact of some of his
work has shown in some of our recent summer intern work from his student doing some
very exciting work, having a tremendous impact in the kind of work that we have been
doing over here.
So without further ado, I will pass the floor to Professor Sayed.
>> Ali Sayed: Okay. Thank you very much, Li, I appreciate it. Thank you very much
also for coming. That's very nice of you. Okay. So relative to [inaudible] I'll have a few
more minutes, so I can be a little bit more technical. But, of course, I don't have enough
time to go deep into the technical aspects of the work.
So I'll focus on the intuition behind the results, what the results are, the intuition behind
the results, try to motivate them.
And if you are interesting in more details, I'll be glad to give references. You can e-mail
me and so forth. Okay?
So this is a talk on the general topic of learning and adaptation over networks. When
you have a collection of agents that work together and you would like them to work
together in such a way that the performance can a improve, they should benefit from
cooperation. Because it's not always true that you benefit from cooperation. But we
would like to do it in a way that the network performance gets enhanced.
And we would like the network to get the ability of learning and adapting in realtime. So
how do you achieve these kinds of features? So I will be giving you the technicalities
behind -- behind that.
But before I get into that, let me first define what do we mean by distributed signal
processing? Because this talk is in this general area of distributed signal processing.
Okay? So, of course, many definitions are possible. This is just one of them. For the
purposes of this talk, this is good enough. Okay.
So distributed signal processing deals with the extraction of some global piece of
information from interactions among agents that are distributed over space and that can
only interact locally with each other. Okay. That's a very broad definition of what
distributed signal processing is.
Over here I'm just showing one particular example in this simulation. What you have in
this simulation, you have a collection of agents. Each one of these agents has a
measurement of the distance to a target, noisy distance to the target, and also a
measurement of the noisy direction towards the target.
So in principle, each one of them could try on its own to get there. At the same time,
they also have noisy distance to some dangerous area, noisy ideas of the directions
towards that area, and by cooperating with each other they would like to avoid one area
and converge to another area.
So this is an example of distributed cooperation among agents sharing information with
each other so that they can get better estimate of where they want to go and a better
estimate of what to avoid. Okay?
>> Li Deng: So the information in this example is the distance ->> Ali Sayed: Distance and direction. And direction. But these are noisy.
So through cooperation they -- so some of these agents may know more information
than other agents, right?
So, for example, if danger could be a shark following these agents and so the one that's
closest to the shark may have better information about that particular element than the
others. And through sharing and diffusion of information to through this network, they
can behave in a better way and avoid that particular danger.
So you can see from this example, this is just an example to motivate the idea that in
distributed processing you have a global objective that the agents would like to work
together to achieve something that's common to them, whether it is reaching a target or
avoiding a certain domain, the processing has to be local. So these agents cannot
collect all the information they have, send it somewhere else, and then wait for the
result. It has to be local.
And then the agents in general, they are dispersed so they can only communicate with
their immediate neighbors, okay? So these are features of the distributed solution that
we are interested in.
No why the interest in distributed signal processing? There is lots of work going on in
this area. And there are many good reasons. I list several of them here. For example,
there is also the power of cooperation to mine data, right? Nothing better than putting
many intelligent agents -- even if each one of them has limited capabilities, when you
put them to work together, something comes up, right? That's the power of cooperation
and multiplication:
Often the data is available in dispersed locations, right? That happens nowadays more
and more often. You have lots of data that you would like to access and process, right?
And doing that by using a collection of agents, each one doing its part and working
together, right, that will allow for better performance, right?
For example, the alternative would be for these agents to send the information they
have to a central processor for processing and then communicate back with the agents.
In many situations the agents may not be comfortable in doing that for privacy issues,
secrecy issues and so forth. Okay? So these are just some examples of why
performing things in a distributed manner would be -- would be useful. You have a
question? Yes?
>>: I want to make sure that I understand. So the distributed way, is it because you
think that distributed is more -- you get better results by using distributed systems, or is
it because you're restricted to use distributed because, for example, the data is
distributed?
>> Ali Sayed: Right. That's a very good question. Both reasons can be justified
depending on the situations. In some situations the data is already distributed, and for
reasons I listed here you would like to process them in that manner. Okay?
For example, sending the data all the time to a central processor and sending it back,
you end up with an architecture where if the central processor fails, everything fails,
whereas the distributed solution would be much robust, much more robust.
And the second question becomes if I solve these problems in a distributed manner,
can I attain or match the performance of a centralized solution with the added
robustness? And that's what I will be showing you later. You know.
>>: Just make sure.
>> Ali Sayed: Yes.
>>: So still, at least in theory, if we put aside the robustness thing, the centralized
solution, if it was possible, would be better than the distributed solution?
>> Ali Sayed: Yes. Yes.
>>: Right?
>> Ali Sayed: Okay. It would be better for the same class algorithms, which is the
stochas -- in the class of stochastic- gradient algorithms that I will be discussing here. If
you limit them to using the same kind of tools, that's not necessarily a correct statement,
as I'm going to show. Distributed solution can do as well. Okay?
And actually distributed solution adds an additional degree of freedom that you can
exploit. I'm going to show that, okay?
>> Li Deng: So relate -- so this morning we have discussion since we both [inaudible].
>> Ali Sayed: Yes. Yes.
>> Li Deng: So to your first question, for training data the [inaudible] can never do
better than the function [inaudible]. So our hope is that it may help with the
generalization, which we don't have any theory on this.
>>: But any -- so the reason why I'm puzzled is you can -- on a centralized solution you
can always simulate a distributed solution.
>> Ali Sayed: I know. But what I'm saying, there are many situations in practice where
the centralized solution is not an option.
>>: Right.
>> Ali Sayed: It would like to do it in a distributed solution. Right. If you want -- if you
want to solve it in a centralized manner, of course, if you can afford to do that. But now
the question becomes, if I don't want to do that, for many reasons, okay, how close can
I get to that, and what are the advantages of doing that? This is what this will show,
right? Okay.
These are all valid questions, but there are reasons why you want to do things this way
or that way. Okay?
So when you do things in a distributed manner -- so motivated by these considerations
that I just mentioned, this is what we would like to do. We would like to design adaptive
networks which consist of a collection of agents linked together. These links can
change over time. Agents can turn on and off over time. I can give more or less weight
to my neighbors, depending on whether I trust them more or less.
So I would like to develop networks that behave in a truly adaptive manner. The
network as a whole becomes an adaptive entity. The agents are adaptive. They are
able to learn continuously. The agents cooperate with each other locally, right?
Even the topology can change with time. The agents can also move, for example,
biological networks is a good example of that.
And on top of all of that, I would like this network to solve something meaningful, not just
do this for the sake of doing it, okay? But when you try to solve problems like this,
many interesting issues happen and many interesting observations come up, okay?
Why? Because you have many additional degrees of freedom now that you can control.
For example, topology. Who should be connected to whom? You know? You have
many options to connect people with each other, right? So this fluency the behavior in a
certain way. So how can I exploit that degree of freedom to my advantage? How much
weight should I give to the information I am receiving from [inaudible] as opposed to the
information I'm releasing from Li? Maybe I trust him more than I trust him because of
he's noisy and he's less noisy, right?
How should I adapt those suites over time? Because over time he may become less
reliable than him. So there are now many degrees of freedom that I can control, you
see? Some agents may fail, and yet you would like the network to continue to work as
long as it is connected and you can paths between agents, things should continue to
work, even if some links fail, even if some agents fail and so forth.
>>: [inaudible].
>> Ali Sayed: Yes?
>>: In [inaudible] applications they are dealing with how realistic it is to assume that
information passing is just by multiply a weight or ->> Ali Sayed: This is a model, of course. This is a simplified model, right? Like many
other problems in engineering we assume a model. It just turns out that this model is
very useful.
For example if you wanted to solve distributed -- many problems can be formulated as
distributed optimization problems, and it turns out that the solution will involve a step
that involves aggregation of the form I'm going to be showing you here where you weigh
data over the links in that manner. Okay?
>>: So in the early example you gave [inaudible].
>> Ali Sayed: Exactly. Yes.
>>: Is that kind of problem appropriate to be modeled by this kind of [inaudible].
>> Ali Sayed: Of course, We haven't performed biological experiments to check that,
but there are studies, there are studies, and I give references in some of my papers by
groups whose research is exactly devoted to modeling, for example, how fish behave,
right, and they perform experiments on that. And they have models where they perform
aggregation models like this, and they match theory to practice, you know.
So there is some justification, even experimental, justifying that the kind of things that
we do here could be an appropriate model. I'm not saying this models everything, but in
some cases it is an appropriate model, okay?
Anyway, so there are many degrees of freedom that we would like to control. But
regardless of that, many issues come up that we are not used to from single agents,
signal processing. Some issues come up that are not a direct extension of the intuition
we have from single agent or centralized processing as you are going to -- you are
going to see.
So in order to appreciate that, let me first give you a brief overview, brief review of
classical results in stand-alone adaptation and learning. When you have a single agent
that's trying to learn and adapt something let's see what's known there, and then let's
see when you try to extend these results to the multi-agent case what issues occur,
what issues come up. Okay? And then you will appreciate more, okay, the richness of
the distributed solution.
So I'm now giving you for the next few slides a very brief summary of what single agent
adaptation and learning is. So in single agent adaptation and learning we are interested
often in problems like this. I have a cost function J of omega. Omega is some
parameter vector I would like to estimate like the one we were talking about. J is a cost
function, for example, a mean squared error cost function. But it could be some other
cost function, a logistic function or something much more general than that.
In most problems of interest, the risk function is often expressed as the expectation of
some loss function, okay? For example the mean squared error is the expected value
of some error squared. So the error squared is the Q function. And the expected value
of that is the risk function. Okay?
So I'm introducing both notations because you are going to use them. So how do
people solve a problem like this, using a statistic gradient algorithm?
So for this particular talk, just to keep the technicality simple, I'm going to assume that
my cost function is convex. So you have a minimum and you would like to converge to
that minimum. So a popular solution for that problem is the steepest-descent solution.
What do you do? You start from an initial guess and you start iterating over it, moving
along in [inaudible] direction of the gradient vector of J of your risk function, right? You
evaluate the gradient vector at the current estimate, multiplied by a step-size mu of I.
That's a positive step-size parameter that changes with time.
If you keep repeating this, there are results that guarantee the following: Okay. So that
step-size is a decaying step-size, meaning it adds up to infinity but goes down to zero in
energy. For example, one over I, it diverges it -- so it goes down to zero but not fast
enough. That's what this condition is saying, okay?
Any step-size sequence like that that goes down to zero but not fast enough guarantees
the following conclusion: That WI, if you repeat this recursion, WI will go to the
minimum that you are looking for, okay? These are classical results in statistic gradient
approximation theory. Okay?
You will get the behavior like that. It will go down to the minimum.
>>: But only for convex problem, right?
>> Ali Sayed: For this problem, yes. Yes. For convex problems where you have a
unique minimum. Okay.
Now, what happens in practice is that very rarely you have the function J: So you
cannot compute the gradient of J. Why? Because J is the expectation of something.
And for you to be able to know J, you need to be able to know the statistical properties
of the data so that you can compute expectation over that. I'm going to show you a
specific example.
So what people do, they replace the gradient of J by the gradient of Q. Q depends only
on the data. So the gradient of Q is an instant approximation for the gradient of J. And
that's why now it's called statistic-gradient algorithm. Why? Because now you've
replaced the two gradient that you want by an approximation for it. So that introduces
an error. That error is called gradient noise.
So when you use the statistic-gradient algorithm, you always introduce an error into the
algorithm which I'm denoting by S, okay?
But still under some reasonable approximations -- sorry, assumptions on this gradient
noise, for example, that it is zero mean on average, there are some reasonable
technical approximation -- assumptions on it, you can still show that convergence
occurs. That's very interesting for decaying step-sizes. Even if you replace the gradient
of J by the gradient of Q, you can still get convergence and solve your problem. Okay?
So let me show you an example so that you understand this better. Let's take a very
example, linear model, different from your model, okay? Assume I have data D when is
the output of some FIR channel. U is the input, betas are the coefficients. So D is a
combination of delayed versions of U plus noise. Right?
This is a channel estimation problems formula. And I -- this is a very -- you would like -you know the D, you know the U, you know the D, so you know the input/output
behavior. You would like to estimate the parameters, right, the betas.
Let's look at this. So that's the model. I'm going to select my Us into a regression
vector. You have M of them. Just put them into a vector. Collect the unknowns into a
vector. So I can just use a more compact vector notation because that sum becomes
an inner product. This is just notation convenience, right?
So I would like, given D and U, I would like to estimate W, right? That's the problem.
So one way to do it, given D and U, find the W that minimizes the mean squared error,
right? This is a classical problem in estimation theory, right?
So you see that J is the expected value of the mean square -- of the error, whereas Q is
the squared error, right? So you see to find the J, you need to know the moments of D
and U. If you expand this, you need to know the variance of D, the variance of U, the
cross-correlation between D and U, right?
Whereas for G, if you just know the data and you know an estimate for W, you can find
it without knowing the statistical properties of the data, okay? So if you continue, risk on
left, loss on the right, okay. So if you expand the J on the left, I need to know these
quantities here, the moments, RDU, RU. You need to know them, right?
So if you write down the steepest-descent algorithm, you move along the direction of
the gradient of J, the gradient of J will involve the moments, which generally you don't
know. Right?
If you instead replace the gradient of J by the gradient of Q, you move along the
direction that's determined by the data. So now you can apply this algorithm, right? So
this algorithm only depends on the data.
And if you are using a decaying step-size, and under reasonable conditions on the
gradient noise this will converge, right? So this is an example that illustrates these
points.
>>: Can I ask a question ->> Ali Sayed: Yes. Yes.
>>: To refresh my ->> Ali Sayed: Yes. Yes.
>>: If you are in a case where you actually happen to know the statistics ->> Ali Sayed: Yes.
>>: -- but just for fun you use the statistic gradient and you could have used both, which
means that then the stochastic-gradient is likely to converge as more slowly, right?
>> Ali Sayed: It was the bump here. It will be bump. It will converge. Over.
>> Ali Sayed: Instead both will reach ->>: But it's like the average rate of convergence is likely to be the same as the ->> Ali Sayed: Yes, it will be similar, yes. The rate of the average, okay? You use the
right one. On average the rate of convergence ->>: So it oscillates, but it doesn't go -[brief talking over].
>> Ali Sayed: On average, the rate of convergence will be the same. But it will be a
bumpy ride until [inaudible].
>>: Okay.
>> Ali Sayed: Okay. Fine.
So this illustrates -- but now I want to mention one point which is important, you see. So
if you use the steepest-descent, if you knew the gradient that's the first line. If you don't
know the gradient, that's the second line. The difference is an error, right?
But look what's useful, you know? Even if in the stochastic-gradient case W converges
to WO, the gradient noise will never go down to zero. You will always have a term there
coming from the measurement noise. That's a persistent noise that will always be
present, okay? Even if the algorithm is able to converge.
So that absolute component that never disappears I'm going to call it our subtest.
That's the variance of that [inaudible]. So because data are independent from noise,
that's the variance of the noise times RU. Okay? Just remember this fact. Because I'll
show you why it's important. Okay?
But even though you are getting convergence, there is persistent gradient noise
influencing the algorithm, okay? Fine.
This is another examination. Okay. This is closer to some of the things you are doing.
For example, I have data that belonged to one class or another class plus or minus one,
and you would like to find the hyperplane that separates them, okay? So there are
many ways by which you can formulate this problem in a similar manner. For example,
by choosing the loss function to be let's say the logistic function, okay? This is just
another example to illustrate the idea in a different domain. You can write down the
stochastic-gradient algorithm for that function, okay? If you go back, right, gradient of
Q. Write down the recursion. You get this -- this algorithm, right, widely known.
Now, you see that so far I have use the decaying step-size. So the decaying step-size
will be about to converge the W. But what if you are dealing with a problem where this
hyperplane is always changing because the nature of the data is changing?
So you need an algorithm that is able to track changing models. That's why now I have
to shift from a decaying step-size to using a constantly step-size. Because when you
sues a decaying step-size, by the time it goes down to zero, you are not updating
anymore. And if your model changes, you are not able to track. Okay? So from now
on I would like -- so by adding a constantly step-size you enable the algorithm now the
ability to adapt. If the model changes it will be able to track, to continue to track. Okay?
>>: May I ask a question?
>> Ali Sayed: Yes. Yes.
>>: Then in this case you might say that there's probably such a thing as a [inaudible]
step-size to adapt to that particular rate, right?
>> Ali Sayed: Yes. Yes. All these questions are relevant and they have been
addressed, okay?
>>: Okay.
>> Ali Sayed: Very good. Now, there are issues now. Let's continue. So now I remove
the decaying step-size. Let's replace it by a constant step-size. So for example if you
repeat this for the channel estimation problem I mentioned before, the only difference is
that the mu now is a constant mu. Right? So remember that even if this doesn't
happen, even if this guy were to converge to W, you still have a persistent gradient
noise. And because this term never dies out, that gradient noise will always be there
influencing performance.
So there is a price to pay, okay? Because in the decaying step-size K, it will cancel that
term so that gradient noise, even though it is there, it doesn't influence the conversions.
But now you have a constant step-size.
>>: Right. Okay. So [inaudible] condition of that to be squared to be less than zero?
>> Ali Sayed: Square? Which is square?
>>: [inaudible] there's a condition for the step-size.
>> Ali Sayed: Yeah. No, those conditions are for decaying step-size. This is different
now. Right. So now you have a constant step-size, okay? So now adaptation, the
ability to -- you are learning. But you would like it to be able to adapt as well. But this
comes at a price. At what price, right? So let's see at what price. Okay?
So to analyze at what price, let's define the error between what you want and what you
have, right? In the decaying step-size case, that error goes to zero. And the constant
step-size case, the variance of the error will go to some value. Right? Because W0, WI
will not go to W0. So the mean squared error, we measure it by measuring -- we
measure the size of the error by measuring the mean squared deviation, okay, which is
the variance of the error. It will go to some value.
And also it will reach that at a certain convergence rate alpha. This is essentially giving
you an idea of how many iterations it takes for you to get there, okay?
So, for example, some results for this general case look like this, okay? So the
algorithm is up there. I'm repeating it. Constant step-size. This is -- the covariance of
the noise that's persistent. So even when you converge, this noise will be there. So call
it R sub-S. And this is the Hessian matrix. H is the Hessian maker of your cost
function, okay? So these two parameters determine how much performance gain or
degredation you are going to get, okay?
So the results in the general case look like this. So how much MSD will I get in steady
state, okay? The error is not going to zero. But it will be proportional to the step-size.
So if you use very small step-sizes for all practical purposes this is a small error. So
you can ensure that I will be fluctuating very close to the optimal solution I'm looking for.
And you will be converging at a certain rate, which depends also on the step-size and
on the smallest eigenvalue of this matrix H, okay?
But the important solution from here is that the price that you pay is the degredation in
the MSD only of the order of the step-size. If the step-size is small, that's good enough.
Of course, the smaller step-size means a slow adaptation because if mu is small, this is
closer to -- you converge slower, you're adapting slower. Okay? Fine.
So now let's get closer to networks now. So this was my brief overview of classical
results. So now let's assume I have two nodes. Node one is getting data D and U. I'm
calling them D sub-1, U sub-1, because these are data at node one. And it's adapting.
Okay? So if you use the previous expectation for the MSD in this case, you will find out
that the performance of this algorithm in steady state it will get close to the solution by
this much, okay?
This expression is exactly the previous expression, no -- specialized to the quadratic
case, okay? So mu M is the size of the filter. And that's the variance of the noise.
If you have another node receiving data and doing the same thing, you will also get a
similar expression. But you can see that if the data in case two has more noise
variance, sigma V squared than the data in node one, then node two will perform worse,
right. Even though both of them are trying to estimate the same channel, if the data
here is noisier than the data there, then you can see they solve the problem on their
own, one will perform worse than the other, right? This is only natural from the results.
Now, let's take the case where the nodes send their data to the fusion center, okay?
Send their data to the fusion center, the centralized processor. And that fusion center is
going to run a centralized stochastic-gradient algorithm of the same nature, right, an
LMS type algorithm of the same nature as the previous nodes. Because I want to
compare algorithms of the same class.
Of course, the central processor can do something much more involved. But then that
will not be a fair comparison. Okay?
So if the central processor uses an LMS algorithm what's different here? Because it
has data from both nodes, it can average their gradients to get the better approximation
of the gradient direction along which should -- it should move. Fine. So what's the
performance of this algorithm?
So the performance of this algorithm, if you use the expression I derived before,
because this is an LMS type algorithm, you'll find out that it is a similar term but now
times the average of the noises. Okay?
So now when you have two nodes, one has noise at point one. The other has noise
level at point two. The average will be some value in between. So you can -- so you
can see that if you perform it in this manner, the performance will be better than one but
not necessarily better than -- than the other. Right?
And the question is can I perform this in a different way, in a distributed way, such that I
can take advantage of how they share the information and ensure that the performance
will be better relative to both, that when they share information they can perform better
than one and better than two. Okay?
You had a question?
>>: Do you assume that they both get the same amount of data?
>> Ali Sayed: Yes. The same -- at every time I the fusion center gets ->>: So they have a single clock covering the ->> Ali Sayed: Yes. Same time I the nodes send the data the fusion center and it tracks
an LMS type algorithm. And the performance will be the average actually of the
performance. And for that case it will be better than some and worse than others. Of
okay?
So now let's take it to a distributed domain. I will first discuss the case of two nodes and
then I'll go to a network. Okay? So the first step is fine. Much what if I don't want to
use a distribute -- a centralized solution, I want to use a distributed solution where let's
say the two nodes want to talk to each other, meaning they want to share some
information?
So, for example, node one sends its estimate to node two. Node two is going to scale it
and combine it with its own estimate. And also node two is going to send an estimate to
node one. For example, look at the first equation. Node one is going to perform the
update just like before. But it's going to call that an intermediate value of psi one. Not
its updated estimate, but an intermediate value of psi.
Node two is doing the same thing. It's going to call that an intermediate value of psi
two. Now, the nodes share the psis. So node one is going to gets its psi and its psi
from node two and combine them in a convex manner.
Node two is doing the same thing, gets psi one from node one and combines it in a
convex manner with node two. Right? That's the only additional thing I did.
Each one of them is still running an LMS type update. But in addition they share -okay. This is what I think the estimate is, this is what I think that the estimate is, they
share their local information to get a better estimate of what it should be, and then they
use this for the next iteration and continue from there.
>>: [inaudible] processor ->> Ali Sayed: Is not -- no, it's not doing this. This is -- this ->>: No, no. Which part of the processing node does that average ->> Ali Sayed: No. No. Each average is done at the node. You see at node one
because they share -- this is not -- node one does this and this, node two does this and
this.
>>: Okay.
>> Ali Sayed: Right? So you have adaptation, aggregation, adaptation, aggregation.
Right? What performance can you expect from -- okay, from a system like this. Okay?
So questions that are useful to ask. Can the coefficients be chosen to ensure that the
MSD performance that each one of them is going to get is better than what they used to
get when they were doing it on their own? Right? The centralized solution only ensures
that you will do better than one. But can I do better than both?
And the answer to this is yes. Okay?
>>: I'm confused.
>> Ali Sayed: Yes?
>>: Why the centralized one cannot do better than either ->> Ali Sayed: This is stochastic -- it is using a stochastic algorithm, this stochastic
algorithm that I'm showing you here. The LMS algorithm of a similar nature. By simply
averaging the data. Now, you can ask this question -- I will mention this later, okay? It
cannot do better -- okay. The reason why this can do better is because you have an
additional degree of freedom which is these coefficients alpha and beta that I'm using
now, I'm going to design.
>>: [inaudible] noise levels are different ->> Ali Sayed: Right. These alphas and betas will correct for that.
>>: [inaudible] are the same this would be as good ->> Ali Sayed: As good as centralized. Okay? Because here you are not using alpha
and beta. However -- however, I can incorporate weighting into this solution. I can say
I'm going to weight the data from node one by a certain alpha, the weight from -- you
understand? Okay.
>>: So you have the question down the road to ->> Ali Sayed: Down the road, yes.
[brief talking over].
>>: Beta and then you ->> Ali Sayed: Right. I'm just getting there. I'm just moving through a simple example to
get to that point. Does that answer your question?
>>: I guess. So I mean you should be able to do -- I mean, you have twice as much ->> Ali Sayed: Yeah, yeah, yeah. If you modify this -- you know, the final -- one of the
final conclusions there is that it will tell you should modify this if you want to match the
distributed, right? But then what I'm trying to say is that the distributed has additional
degrees of freedom that you can exploit and match the best you can do in
stochastic-gradient algorithm here. So you are not losing -- you are not limiting yourself
by doing things in a distributed manner, okay? That's one of the conclusions. Okay?
Fine.
So can performance match? The answer will be yes. And this is another interesting
question. Okay? Node one can run LMS and estimate the channel. Node two can run
LMS and somewhat your parameter vector. Does it always follow that if I put them to
work together like this the solution will always be able to solve the same problem?
And this is not true. This is not true unless you do it right, okay? So, for example -these are results I'm going to show you later. If you use consensus trajectories, all right,
in comparison to the strategies I'm going to industry here which are based on diffusion
strategies, this can happen. The individual nodes can be stable. But when they work
together, the network becomes unstable. For reasons I'm going to display to you. So
you have to do it right. You cannot just combine them like this. Okay?
>>: But there is additional [inaudible] that I just wonder whether this [inaudible] can
exploit, that is when combine when you can aggregate, in other words, you can do it
asynchronously the way you are doing it synchronous?
>> Ali Sayed: So far is synchronous, yes.
>>: Synchronous, does it give you additional.
>> Ali Sayed: It gives you additional -- I will not be talking about the asynchronous
network behavior here. But that adds another degree of freedom that you can exploit.
It's not discussed here. Okay? But that's a good point, also, okay?
Now, so let's go to the networks. You know, I just talked about two nodes. Now,
assume you have a collection of nodes, right, five, or ten, or a thousand. Now there are
many different ways by which these nodes can be connected to each other, right, not
just like the alpha and beta I showed you before. You can define many different
neighborhoods for each node, okay?
So if I give you a collection of nodes and many different agents and many different
topologies, which topology is better? Okay. That's a valid question, right? Does it
matter which topology I use in will all of them converge to the same conclusion at the
same right and give me the same performance? Okay.
These nodes are going to combine the data they share over the links by using certain
combination weights. Does it matter what combination weights I use? For example, if
I'm a node and I have three neighbors, I can give one-third away to each one of them,
or I can give more weight to him and less weight to you. Does it matter how I combine
the data?
So these are valid questions that now arise at the network level, okay? So to answer
these questions, let me show you now we answer questions like this.
>>: May I ask a ->> Ali Sayed: Yes.
>>: -- question that in practice might be important.
>> Ali Sayed: Yes.
>>: I'm pretty sure you're excited about that, is in many cases there is a nontrivial cost
of sending the data between sensors.
>> Ali Sayed: Yes.
>>: So one question could be what's the minimum number of edges between those
agents that would give a certain level of [inaudible] performance.
>> Ali Sayed: That's a very good question and a very valid question. In this particular
presentation I'm assuming that they can share information that is no cost to it, okay? In
some other -- in some other role that we have done, we assume that there is a cost
involved for sharing information, right? In that case, some agents may decide to share,
may decide not to share because that cost may be attractive or not attractive. And that
involves other dynamics. It's a very good question, right? But it's not in this particular
talk, okay?
>>: Can you just look at it like it's a matrix of weights and some of them have to be
zero?
>> Ali Sayed: Here?
>>: Yes.
>> Ali Sayed: Yes. I'm coming -- I'm coming next, okay? You mean for the -- for
talking to each other?
>>: Yes. Right.
>> Ali Sayed: Yes. Okay. So I will get there. Okay?
So to answer these questions, I need how topology, how combination policy influence
performance. Right? Now, so I need to perform that analysis.
Now, the challenge here is that these agents think of it you have many agents. They
interact with each other. So whatever happens here is going to influence other agents.
So they are coupled together, okay? So that's the challenge that you face here relative
to studying single agent systems. Okay?
Nevertheless we can get some results. So let me show you the results now, okay? Of
course, the summary of the results. So I have a network. Let's now go to the network
level. I have a network and agents connected. Each agent has a cost function
associated with it. They call -- they could be the same costs I'm calling J sub-L if only
they have the cost function associated with agent L. It could be the mean squared error
cost. All of them could have the same cost, like machine learning applications, that's
common. All of them are trying to minimize the same cost.
But for generality, let's assume that they all have different costs. And I would like to
minimize an aggregate cost function across the network. Subject to some constant aids
because some agents may know something about the solution that other agents may
not know, right? So you can solve a constraint problem like this, okay?
But for this talk I'm going -- a problem like this you can transform it into an approximate
problem that's good enough that is unconstrained that looks like this. I'm going to focus
on the unconstrained version here, okay?
>>: [inaudible].
>> Ali Sayed: Convex. Yes. Subject to convex constraints. Okay. So okay.
So let's just -- this is good enough for this presentation. I have an aggregate cost the
network would like to work together to optimize. Find the minimum. Okay. And each
cost, like I said, is the expected value of a loss function, okay? So this is one algorithm,
okay? We developed this algorithm years back, around 2006. And since then many
colleagues have studied and described to the area as well. This is how the algorithm
works, you see.
If you look at the first line, each node K is going to perform an LMS type update, a
stochastic gradient approximation update. Node KW, K is the estimate at node K at
iteration I minus one. That node is going to move along the negative direction of its
gradient loss. Find anything. We did estimate.
And then in the second step node K is going to combine in a convex manner using
combination weights A. Those are the alphas and betas from before, but now I call
them A, over the neighborhood of node K. So if my -- if I have five neighbors I'm going
to use combination weights A to combine the weights of my neighbors.
So every node in the network now is applying a strategy like this. Of course, I'm not
showing you how we arrive at this strategy. There is a formal way of arriving at this
strategy. I'm just showing you the final result, just like I showed you for the two-node
case, okay?
But the nice thing is that you see the analogy extends, right? The analogy extends you
have. So the first step I call it adaptation. The second step I call it combination
aggregation. So that's why we call this -- this is the adapted then combined structure.
You first adapt, and then you combine. Because there is a variation of this where you
first combine at another.
Now, I want to mention two observations which are important. Observe that in the
adaptation step every agent is starting from where it is, from its current state, its current
estimate, right. So it is the same W here and the loss is evaluated at the same W.
Okay? That's number one. And the step-sizes are constant, okay?
The second variation is where you combine first. So first I combine the estimates that
exist at my neighbors, and then I perform the adaptation from that into immediate state
that I found, right? So this is the other version of the algorithm where you combine,
then you adapt. Okay?
>>: One one's better? I read the paper.
>> Ali Sayed: Yes, ATC has some advantages, okay? Because think about it. What
ATC is saying, I'm node K, let me first think about the property, adapt based on my data.
And then everybody else is also doing its own adaptation learning based on it, and then
you combine.
So here, like in an exam problem where students can solve the problem cooperatively,
first they think and then they share the information, right? Whereas here they first share
the information before thinking and then they think, okay? You can share analytically
actually that ATC has some advantages in terms of convergence rate and mean
squared performance.
Now, the third form of the algorithm that I would like to show, which is also very popular
in the distributed optimization literature, is the consensus strategy. It is similar in form to
the second diffusion strategy I showed where you combine first and then adapt. There
it is. You combine first and then adapt.
However, there is one critical difference here. In consensus when you adapt, you start
from the intermittent somewhat but the loss function you still evaluate it at where you
were. Whereas that's not the case. In diffusion, it's always symmetric, okay?
And it turns out that this asymmetry is what causes instability when you are trying to use
networks to perform adaptation and learning. We have phone in the paper in TSP
accomplished in 2012 that the state of consensus networks can grow unbounded
because of that. Okay?
So that's why I'm going to focus, okay, because I'm -- that's for the case of constant
step-sizes. If you are using decaying step-sizes, this is not an issue. But if you are
uses constant step-sizes, things accumulate over time, and then instability can occur.
Okay?
So let's continue with the diffusion solution. Fine. So I have a network now. I'm trying
to minimize an aggregate cost function. I have a class of algorithms. The diffusion
strategies that I can use to solve the problem. So how well does this network perform
and how does the performance depend on how I connect the agents and how they
combine what weights they use? Okay?
So, now, I have many nodes. So for each node I'm going to define an error, W to the K,
I, which is WO minus WKI, right? That's the error at node K. I'm going to define the
variance for node K and the network variance as the somewhat of that. Right? Just like
I did for the single agent case. Now, for every node I have the performance and the
average performance, right?
So what are the values of these? Okay. So let me show you the results, okay? So
before some notation. Remember that every node is combined. Node K has neighbors.
2, 6, 7 and L is combining the data from these neighbors using coefficients A. I'm going
to collect these coefficients into matrix A. So this matrix A, like Phil was saying, it's a
matrix which is M by N and the LK entry has the weight that agent K is using to a data
arriving from agent L. That's what it means. Okay?
>>: So in matrix [inaudible] are separate?
>> Ali Sayed: W is what you want to estimate. That's the parameter you want to
estimate. The A are the combination coefficients of the agents are using to perform -that's a free parameter that you designed. You can choosing the A differently, right?
Because you have N agents, you will have the matrix that's M by N. It has the
combination weights that, okay, if you look at any particular column, you'll have the
weights that that particular node is using to combine the information from its neighbors.
Okay?
Because these combination weights are convex combination weights, this matrix A
turns out to have a useful property. Each of its columns, the numbers add up to one on
each of its columns. Okay? It's the left-stochastic matrix.
And because I'm assuming that I have a connected network, right, with at least one
self-loop, so at least one agent should trust its own data, so at least one nonzero
element on the diagonal of A exists, okay? AKK for some agent K should be positive.
Okay? This is called a perimeter matrix. Okay?
For matrixes like this, you have this interesting result. This is called -- follows from the
Perron-Frobenius Theorem which says this: A has the left eigenvector at one from this
relation and eigenvalue at one. But there is also a right eigenvector which I'm going to
call P. And I'm going to normalize it so that its entries add up to one. What the theorem
guarantees is that all other eigenvalues of A are strictly inside the unit circle. There is
only one eigenvalue at one. With eigenvector P. And all the entries of P are between
zero and one.
So they add up to one. And they are between zero and one. So the entries of this right
eigenvector P, they have the interpretation of probability measure. Right? They add up
to one. And all of them are between zero and one.
So P reflects the topology, right? You see A reflects the topology of the network and
how much weight you are putting on the links, right? So different As will have different
Ps. So P summarizes the topology for you. You are going to see that the results
depend on this P. Okay?
And then I can optimize over P. I can ask questions like what P and for what A should I
choose to optimize performance relative to this or relative to that. Okay?
For example these are just some examples. Let's assume that the -- the combination
weight that I'm using is the average in the rule. I'm just assigning to my neighbors one
over the number of neighbors. I have three neighbors. Each one of them I'm assigning
weight one over the number of neighbors.
In this case you can find explicitly what P is. It's the degree of node K how many
neighbors you have normalized by the total number of the sizes of the neighborhoods.
Okay?
So for particular cases of combination weights we know exactly what this P is, okay?
Fine. This is just an example. Okay? So what are the main results. Now, let me show
you two results. Okay?
Result number one, which I find very interesting, okay, the result number one says the
following. Remember that I'm trying to minimize this cost function, right? So technically
from calculus if I'm trying to minimize this function, I should converge to a point W that
makes the sum of the gradients equal zero. But the network is not doing that. The
network will converge to a point that makes the weighted sum of the gradient equal zero
and the weights are the step-sizes and the Ps. Okay?
So the network is converging to a different solution than the solution of this. What does
that mean? Okay. For example, if all the step-sizes are equal, and if all the Ps are
equal, for example the metropolis rule here there is a certain way to construct the
weights, lead to values of P that are equal to, one, the number of nodes. So all of them
are equal.
>>: Ideally the whole strategy [inaudible] rather if you perform optimization --
>> Ali Sayed: Yes.
>>: Whatever form of the distortion error with respect to this [inaudible] you decide that
A to make that minimum.
>> Ali Sayed: You can.
>>: [inaudible].
>> Ali Sayed: May not. Yes.
>>: May not?
>> Ali Sayed: I'm just giving you an example that if P happens to be a constant, mu
happens to be a constant, then you are going to the sum of gradients equal -- you are
going to that solution.
But what this is saying is that actually the interaction over the network is adding a
degree of freedom. Because you can design A, therefore you can design P, therefore
you can still your network to converge to different points. Okay? By choosing different
Ps and different step-sizes you can make your network converge to different points.
These points actually have a interpretation. They are called Pareto optimality points.
So let me show you. Okay. Assume I have two quadratic functions. Let's take the
mean squared error case, two quadratic functions. Node K has this quadratic function,
node L has that quadratic function.
If node K solves the problem on its own, it is going to converge to its minimum. If node
L solves the problem on its own, it only myself its cost, it's going to converge to its
minimum. The aggregate cost will look like the red curve. So if the nodes work
together, okay, they -- we would like them to converge to that minimum.
So what's happening now, okay, what's happening now, when you put them to work
together over a network, it's converging actually to what's called the Pareto optimal
solution. What's the Pareto optimal solution? It will converge to a point which is not
necessarily that minimum, right? Different from that minimum. And what point is that?
So the agents, okay, will converge to this point. And this point has the following
interpretation, okay? Or actually it can converge to any point in between all the points in
between here serve as Pareto optimal solution, including this one.
And what do those points have in common? They have the following property. Assume
node -- the blue node wants to move to the left because it wants to get closer to its
minimum. If it moves to the left, then that guy will move to the left, its performance will
become worse. And the same way if that guy tries to move to the right to get closer to
its minimum, this guy will have to move to the right.
So you can see a Pareto optimal solution is such that if an agent takes a selfish step, he
will hurt the others in the network. So for cooperative networks, right, over cooperative
networks, this kind of a solution discourages that kind of behavior. Okay? So you
converge the Pareto optimal solution with this property will hold. Okay?
So by picking -- by picking A and therefore P, you can make the network converge to
different solutions according to your needs. Okay?
And here I'm just showing you special cases, for example when the step-sizes are
equal, when the Ps are uniform, then you converge to the solution of the original
problem. But then you can ask yourself this question. If I want to converge to a
particular weighted solution of the gradients, how do I choose the Ps, and, therefore,
how do I choose the A? Well, that's the expression.
If you give me the weights that you want to use, I can use them to figure out what
combination matrix I should use to make it converge to that particular point that I'm
interested in.
>>: But you haven't proved so far that this optimal, after you choose this A, has the
lower [inaudible] centralized ->> Ali Sayed: No, no, this is not here yet. This is just showing that if I want to
[inaudible] a particular point how do I construct the A to go to that point. So this is
answering that question, right? Okay. So this is one way of doing that.
The second -- the question you are mentioning is a different question. I'll mention it
later. Okay.
So let's go to the second result. The second result is for every node I have gradient
noise. I defined it as S. Different between the gradient of the loss and the gradient of
the risk. Right? Remember. So this is the variance of the persistent noise, and this is
the Hessian matrix. Okay?
So with every node in the network I assign these two parameters now, okay? H and R.
Okay? And I define this data which is the trace of H inverse R, okay? This is more or
less like a signal to noise ratio if you think about it, because that's noise power and this
is essentially signal power. Okay?
So this is the second result. Don't worry about the equations but the intuition behind the
equations. This result has three components. And it's useful for the following reason.
The first component is saying if you train a network using these diffusion strategies, the
MSD at any node will be the MSD of the network, meaning all of them essentially will
converge to the same performance level. Even though some agents may be noisier
than other agents, after they cooperate in steady state all of them are going to converge
essentially to the same performance level. Right?
And when you average, therefore, you will get the MSD of the network. That's result
number one. Result number two. What performance level then I can expect and how
does the topology influence it here? Okay.
You see that the performance level depends on a weighted combination of the Hessian
matrices and the weighted combination of the noise across the agents. Okay? What's
important in this expression is that it's telling you how the topology influences
performance because the topology's reflected in those Ps.
So now, if you give me different topology, I can compare them, and I can say which
topology has better performance and which topology will converge faster or slower
because I also know how the converge rate depends on the topology. Okay?
So these expressions are useful because now it allows me to compare different
networks against each other. Okay?
So coming back to -- I'll get to your questions. Now, I can ask this question. I want to
optimize this over the Ps, right, over the Ps. It turns out that the optimal solution is the
left-stochastic matrix, meaning the columns have to add up to one. Not a doubly
stochastic matrix, which is very common in the distributed optimization literature.
Doubly stochastic means columns add up to one and rows add up to one.
And intuitively makes sense, you know. Because the weight I give you doesn't
necessarily need to be the same as the weight you give me. Okay? Because I may
think you have better information relative to me but not the other way around. Okay?
So the optimal solution turns out to be a left-stochastic matrix. Okay? For example,
when all the updates are using similar step-sizes and we have an expression for it, we
have an expression, okay, for the optimal A. Now, if you use this optimal A, plug it back
into the MSD expression, you'll find out that that MSD expression is better than what
you would get for the centralized solution.
Now, this mind sound puzzling. It's puzzling for the following reason. This is using
more information than the centralized solution. This is using the As. Okay? The
centralized solution is not giving weights to the data, it's just combining the data.
If you go back and modify the centralized solution to give weights to the data, then of -of course, it will be the same thing. Okay? But this one is using optimal weights that
depend on information that the centralized solution is not using.
>>: So can have any [inaudible] example of an application where A is [inaudible].
[brief talking over].
>> Ali Sayed: So you have to have an A. A is how you identify the data, right? And
this is telling you now how do you pick A. This is one way of picking -- we call this the
Hasting's rule. I did mention it to you.
>>: If you take A to be just average of different nodes ->> Ali Sayed: Yes.
>>: Are you going to get worse result than ->> Ali Sayed: Yes, you will get worse result than this.
>>: But how does it compare with the centralized solution? Can.
>> Ali Sayed: You know, I haven't done that compilation, but -- I haven't done that
compilation.
>>: Just to make sure that I understand the essence of this statement, what you're
saying here is if I know that different agents get noisier or less noisy data and I know the
noise level of different agents, I can use that when combining this information ->> Ali Sayed: Exactly. To get better performance.
>>: To get better performance?
>> Ali Sayed: Yes.
>>: Okay.
>> Ali Sayed: Now, then -- that's a very good point. Now, the next question, how do I
know these noises? Well, you have to estimate the noises, well, you have to estimate
those statements.
>>: Okay. But here in the construction here you assume that you know these numbers.
But this is how you constructed the matching state.
>> Ali Sayed: Right. Then in future slide I'll show you quick -- we have ways of
estimating [inaudible] then this makes it fully adaptive, okay?
Now, one final point before I show you this application, okay? So I -- so we know that
the optimal is left stochastic and now I know how performance changes with topology.
And then this is another point.
Now, assume I start adding more agents to my network. Is it better to have more
people solve the problem or less people solve the problem? Or is this an -- or is then
an optimal choice?
If you look at the expression for the conversion state, the more nodes you have solving
the problem, this sum becomes bigger because everything here is positive or positive
definite. So you are subtracting something bigger from one. This becomes a smaller.
So the conversion's rate improves. Okay?
So if you have more people helping you converge faster. But if you look at the MSD
expression, if you have more N, more nodes helping you, this expression doesn't
necessarily become smaller. It can become smaller, it can become larger, it can stay
invariant.
So it may happen that you may have more people helping you, you will converge faster,
but the worse MSD level, the worst performance, yes.
>>: But you when you measure the rate, when you add more agents, actually you're
getting more data per time interval, right? Because if I --
>> Ali Sayed: Across the network. Every node is still getting data disseminated. Yes.
>>: Right. But at the network level I'm getting ->> Ali Sayed: Right.
>>: So the comparison, is it a valid comparison now, that that -- my network now puts
this as an every-time unit more data so that the convergence rate is kind of [inaudible].
>> Ali Sayed: No. Convergence here is defined as how long it takes to reach the
steady state.
>>: But it's not in terms of information units? Right? Because when you add more
agents, you actually add more information units.
>> Ali Sayed: That's right.
>>: [inaudible] time.
>> Ali Sayed: That's right. That's the -- it's reflected in -- it's telling -- what you're saying
is N. You know, N is the number of information that you have for every iteration.
>>: But you can also look at it, you can take a different take on it and say assume that I
can get, you know, certain items, you know, certain items per iteration.
>> Ali Sayed: Yes.
>>: And I can either have fallen agents each one of them getting one unit [inaudible].
>> Ali Sayed: Yes.
>>: Or I can have less than that, each one of them takes, you know, say 10 units of
information but can share only once during every iteration.
>> Ali Sayed: Yes.
>>: So now the amount of information coming to the network is fixed.
>> Ali Sayed: Yes.
>>: And you're just saying which one will converge faster in terms of ->> Ali Sayed: The scenario you describe is a very interesting one. I like that. But that's
different from what I'm -- because in your scenario, the rate at which the data is being
processed is changing. They can receive more data or less data over the same period
of time.
Here the rate is fixed at every -- yeah.
>>: So the question is whether per agent the information, the information rate is fixed or
per the network.
>> Ali Sayed: Per agent.
>>: Per agent?
>> Ali Sayed: Yes. Okay. Yes.
>>: [inaudible] quickly on what [inaudible] is saying could be even a system design
thing if I understood you were saying if I want to throw agents in, do I throw simple
agents ->> Ali Sayed: Yes. Exactly.
>>: -- that can only send me one number a bunch [inaudible] or do I throw slightly
smarter agents that can tell me ->>: Should I invest in a huge computer ->>: Exactly.
>>: Yeah.
>>: Or should I just ->> Ali Sayed: It's a very interesting question. I understand. Very interesting question.
>>: But this also [inaudible] because the more agent referring the more freedom you
can adjust.
>> Ali Sayed: No, but agents brings in noise and gradient noise with them.
>>: I see.
>> Ali Sayed: So depending on the particular topology that [inaudible] it's not trivial
because they are coupled, okay? And these expressions reveal that, okay? And we
have a theorem that proves that. Okay? I'm going to skip that. Okay.
The final point I want to mention is the point about -- okay. So this is the diffusion
algorithm. You -- one of them. Okay? You combine and adapt. And this is what is
popular, the consensus that people use. Okay. You combine, you adapt. However, the
weights are isometric here. Because of that, when you lose constant step sizes this
leads to instability, okay? And I would like to mention to you why this happens, okay?
If you are using decaying step-sizes, that's not an issue, but for adaptation and learning,
it is an issue, okay? So asymmetry is a source of instability and then not only that,
diffusion will converge faster and attain better MSD levels, okay?
And the reason for that is this. It's very simple. You know. You can show that the
mean error of the network if you combine -- if you collect the errors across the agents
into big vector, the state of the network evolves according to an equation like this, first
order equation like this, be where the coefficient matrix and the noncooperative cases
diagonal matrix.
In the diffusion case it will be that matrix multiplied by the combination matrix. In the
consensus case it will be subtraction. Now, a result in matrix theory shows that if the
diagonal matrix is stable, meaning that the noncooperative cases stable, no matter what
left of stochastic measure you choose, this product will always be stable. This is a
matrix theorem result. You can prove it.
However, subtraction like this is not stable, you know. Think about it. If you have a
scaler, no matter what number less than one you choose from it, if that scaler is less
than one, the product will always still be less than one. But this is not true when you are
combining them additively. But it's always true when you are combining -- and diffusion
is always involved in the state in this manner and not in that manner, okay? So this is
the -- this is the conclusion, okay?
So I want to come back to a point you mentioned. Okay. I skipped some points just to
finish on time. Okay?
For example, coming to the issue. So now I know how to choose the A. I have an
expression for it. But that depends on certain parameters. How do I estimate the
[inaudible]. So the fact that you can also adapt the combination weights over time is
useful. For example, this is an application. I have a network and I have an intruder. So
if I can adapt the weights that I assign to my neighbors, I can learn over time that he is a
good neighbor, you are a bad neighbor, and cut over time, reduce the weight I put over
your links so that the ultimate effect is essentially disconnecting you from the network,
right?
So we have done that, for example, I am not going through this, but if you look at this,
this is exactly the algorithm here you have. You see, this is the combination, okay, and
this is the adaptation. And in between I'm computing this As in an adaptive manner
because I'm learning over time who to trust more, who to trust left. This is just a
two-line algorithm. There are ways to do it which I'm not explaining, okay?
So these As can also be adaptive. So you see I call it of I here and I have a way of
doing it. When you do that, you can get performance like this. You see, you can get -if you look at the simulation, the counter there is running. The intruder is there. You are
going to see that. When this network converges it's going to assign, it's going to cut that
intruder from the network. The thickness of the line is the size of the weight it's going to
assign to the neighbors, to the links that it trusts. You know.
So you can see that over time when you also adapt the weights, you can get behavior
like this over the network, okay?
Anyways, I'm going to -- to conclude, just -- yes?
>>: Intruders like an outlier or --
>> Ali Sayed: Yes, it could be an outlier or it could be a malicious person who is
feeding you wrong information, you know.
>>: So it's somehow inconsistent ->> Ali Sayed: Exactly. Inconsistence ->>: Consistently bad.
>> Ali Sayed: Consistently bad. Because the ->>: [inaudible].
>> Ali Sayed: Yeah. Defective. The network learns, you know, by looking the network
learns by comparing what you are giving to what it has for the nodes can compare what
you are giving me to what I have, for example. And over time they learn from that
information, if you are a good neighbor or not and I assign you more weight or less
weight. That's essentially the idea.
>>: [inaudible].
>> Ali Sayed: Sorry?
>>: It's malicious in a sense that if it's bad, it's bad all the time?
>> Ali Sayed: Yes.
>>: Because otherwise I can be an agent, which is I'll be very nice, I will be cooperative
until you -- you know, the weight you'll assign to me is going to be high, and then I'm
going to ->> Ali Sayed: Right. But again, remember, this is adaptive. It will continue learning.
Right? So after some time it will block you again, right?
>>: You can do short-term then.
>> Ali Sayed: Yes. You can do short-term then, right. But not long-term.
>>: It's the fact you trust me to create larger damage, right?
>> Ali Sayed: Yes. Yes. Yes. Okay. Anyway. So anyway, what I tried to explain,
right, I tried to explain some of the issues that come up, design of networks, how
topology influences the performance of networks, how do you design the topology, right,
to get better mean squared error performance, better convergence rates. Some of the
advantages of doing things in this way, you know. So I tried to clarify some of the
technical details and answer some of your interesting questions.
So I'm going to here just to stay on time.
>>: Okay.
>> Ali Sayed: Okay. And if you have any further questions, I'll be glad to answer them.
>>: You said something about some practical applications that the [inaudible].
>> Ali Sayed: Yes.
>>: That you have encountered in the ->> Ali Sayed: Yeah. Of course. The distributed optimizations is always the high-level
application. Right? If you want to -- many problems can be formulated as aggregate
optimization problems like this, right, distributed optimization is one example. For
example in a paper we wrote recently when you want to try to have secondary users in
their cognitive system work together to estimate the spectrum, right? So that they know
where the holes in the spectrum are. So they want to work together cooperatively to
estimate the same -- the same quantity. That's one example where algorithms like this
can be useful.
In some other work we use algorithms like this to model behavior by biological
networks, the examples I showed at [inaudible] to try to reproduce that kind of behavior,
right? So like the simulation I showed in the first -- in the second slide, right? Okay.
>>: So for our [inaudible] that computer data center for example, exactly how difficult
[inaudible] -- how do you design topology [inaudible] each other in order to compute
certain things?
>> Ali Sayed: So that they share information. How should -- who should talk to who.
And how much weight you should -- relevance you should give over these things so that
the combination's as best as it can be. Right? These are the kinds of questions that
you can answer.
>>: Just out of curiosity, have you given some thought on the game theoretical aspects
of it? So for example, since you assume that every agent has its own objective, is it
always good for the agent to cooperate? Could there be settings in which actually the
agent could benefit from actually acting maliciously or something like that towards ->> Ali Sayed: No. It's a very good question. Comes back to the point [inaudible]
mentioned earlier. What I talked about today is in the context of cooperative networks
where I assume everybody is willing to work together, right, to solve a problem of
common interest. Now, the problem you're talking about relates to let's say malicious
users or selfish users where there might be a cost for them to share information, they
may not want to share that information, or they may be selfish, they only want to share
information if they are going to benefit from it, you know. In some more recent work we
are pursuing or studying these problems and you have published recently few
conference papers on that topic where you add an additional cost to the cost you have
here that takes into account how much it's going to cost me to share this information.
Right?
And then that will involve -- really introduce a different dynamics into the learning of the
agents, you know. It's an interesting problem. But I haven't discussed it here. But it's
an interesting problem to ->>: Practice could be related to you have a communication link into the cost of ->> Ali Sayed: Cost of communication.
>>: That link could be related to what you just ->> Ali Sayed: Exactly. Because you assign a cost of that, yes.
>>: Yes. [inaudible] in Maryland that actually have something similar to -- but they do
this malicious, have a ->> Ali Sayed: Yes, yes, yes.
>>: Agents. And I think there was a comment I heard that this -- a lot of -- this
cooperative network can be treated special case of this game theoretical weight of
looking at the problem [inaudible] do you remember ->> Ali Sayed: No, this is ->>: Game theory. How game theory could apply to this as special case.
>> Ali Sayed: No. Regarding work on malicious agents, selfish agents, lots of work in
this area has been done by Professor van der Schaar, also, at UCLA, and we are
interacting with her along these lines. However, here I talked about cooperative
networks, right? Okay. So have been is working together, okay?
Now, game theoretic formulations, they take other aspects into consideration where
now there is, you know, the exchange may be beneficial to one, not beneficial to the
other. It adds different dynamics. So this is not a special case of that, this is a different
way of looking at the problem. Okay?
And also this is an adaptive way of looking at the problem. Where networks have to be
learning all the time. Because the step-sizes are constant, they will not stop learning,
right? Okay.
So these formulations can be pursued, but they will lead to different kinds of solutions,
yes. Okay.
>>: How does it relate to this social learning? That's another way of looking at the
network that even is ->> Ali Sayed: Right. Exactly. You know, actually I'm sorry we use the terminology
social learning. When we refer to the aggregation step that's an example of a social
learning step. Because that's where you are interacting from your neighbors to learn
from them. Of you are interacting with your small community, allow you to learn from
them.
So that's one example of social learning, you know. In this particular problem -- another
thing that's very important, in this work I did not assume anywhere information about the
underlying PDFs of the data, probability distributions of the data. I'm working with data
all the time, right, with instance -- like you were talking earlier today, you have the data
I'm just learning my algorithms over the data, what kind of performance can I
guarantee? Right. In many of the -- of the works you mentioned people need some
prior information about the distributions of the data to be able to many come up, right,
with the algorithms that use that information. Okay?
Here it's purely adaptive. Just you give me data, I tell you how to process the data and
what kind of performance on average I can guarantee for that data. Right?
>>: I see. So do you think this type of theory may have some applications in a social
network analysis when people pass information ->> Ali Sayed: We have ->>: [inaudible].
>> Ali Sayed: We have ->>: I pass to my friends.
>> Ali Sayed: Yes. We have. We have conference paper from two years ago where
we applied some of these ideas in that context as well. This can be also meaningful
there. You know? Yeah, you have a question?
>>: Just -- just -- you were talking about the distribution. So the whole discussion here
was under the assumption that the data is stochastic and that it -- the data comes from
a distribution. It's not -- it's not ->> Ali Sayed: Yes. Yes. Because if you look at the MSD expressions, the MSD
expressions, if you -- I mean, if we will go back just a few slides. Okay. You see this R
is the variance of the noise. To evaluate that, you need -- it depends on the PDF of the
data. So what these results is saying is this.
You apply the algorithm to the data, right? You only have data. You don't have these
quantities. But this is telling you what kind of performance you can expect on average.
Even over data that you haven't seen. Right? This is the kind of performance you can
expect. Okay?
>>: You assume the that the area comes from this stochastic process which is ->> Ali Sayed: Yes.
>>: So that the expectation can be defined.
>> Ali Sayed: Can be defined, yes. Yes. Yes.
>>: [inaudible] of data which is ->> Ali Sayed: Yes. Yes. Yes.
>>: The earlier slide you -- I think you just wrote the word H with the single power as a
Hessian.
>> Ali Sayed: No, H is the Hessian of the cost function at the solution.
>>: I see. That's why it's secondary.
>> Ali Sayed: Secondary.
>>: So what does this [inaudible] the square ->> Ali Sayed: No, no, no, think about it like this. If this is a quadratic cost, this will be
the covariance of the data. So that's why ->>: [inaudible].
>> Ali Sayed: [inaudible] single power. Okay. Of course, in the general context it may
not mean single power. I'm just using that as a terminology.
>>: I see. Okay.
>> Ali Sayed: Okay?
>>: Yeah. So as to relate to what we discussed in the morning. So if this -- when you
actually now have a more implementation model, for example, nonlinear.
>> Ali Sayed: Yes, yes.
>>: Function there in terms of a combination, not only you linear [inaudible] function
there, so what kind of change would that theory need to be in order to handle that type
of more complex.
>> Ali Sayed: Of course. It needs to change like we were talking, right? I need to -first of all, if the problem is too general, it's to nonlinear, right, you can only solve it in an
approximate manner, right?
>>: I see.
>> Ali Sayed: But the problem that you have, it appears to me through some
transformations it could perhaps be close, right, closer to the models that we have. You
know? But if it is totally nonlinear, then, of course.
>>: The problem, do you have it in many type of problem [inaudible] design the
linearity.
>> Ali Sayed: You can design it.
>>: Yeah. It doesn't have to be fixed in the [inaudible] I told you about?
>> Ali Sayed: Yes. Yes. Yes.
>>: We are pretty much free. But on the other hand, [inaudible] have to be reasonable.
>> Ali Sayed: One ->>: That analogy shows a certain type of linearity will convert the problem into
something ->> Ali Sayed: Yes.
>>: [inaudible].
>> Ali Sayed: Nonlinearity is [inaudible] that's a useful property. It will go back and
forth.
>>: It also has to have some other type of properties.
>> Ali Sayed: Yes.
>>: So it cannot -- there's some limitations here. We don't know what it is. But on the
other hand, if any specific linearity is giving, we can roughly assess what is perfect for
other obligations. So that -- okay. And do you know of anyone in this area about
dealing with linearity?
>> Ali Sayed: You know, similar to problem, no. The first time I saw your model is now
in our discussions, okay?
>>: Okay.
>>: This is curiosity. You mentioned -- this is very cool. You mentioned that I think an
important part of the adaptation [inaudible].
>> Ali Sayed: Yes.
>>: Which is how much I trust ->> Ali Sayed: Yes. Yes. Yes.
[brief talking over].
>>: All that. Would there also be an advantage to -- depending what noise levels of
how I perceive the As may be to adjust the mus and have different step-sizes for
different sensors?
>> Ali Sayed: Very good, very good question, yes. You can do that. Now, as I can see
here, I'm already assumed the step-sizes have different -- are different. They depend
on the node.
>>: Oh.
>> Ali Sayed: So technically they are different. But what you're saying is let's also
adapt these step-sizes.
>>: Right.
>> Ali Sayed: And, yeah, that's an additional level that would be interesting also.
In the single agent case, when you are just talking about the single adaptive [inaudible]
people have studied that greatly. But now you have dynamics you are coupling, right?
That would introduce it -- so that would have been an interesting thing to do. We
haven't done it yet. But that's an interesting point to do, also, adapt ->>: The traditional adapt ->> Ali Sayed: Yes, yes.
>>: Tons of [inaudible].
>> Ali Sayed: On adapting the weights, yes. Yes. But [inaudible] are going to have this
coupling, you know, yes, and that would be -- yes.
>>: But [inaudible] is new as a function of node and that's kind of hard to tune, right?
You have to tune the size.
>> Ali Sayed: The chosen with mu, the challenge with mu is that it has an inference on
the stability of the algorithm. This mu has to be small enough. You can see
performance always depends on mu. So even if you adapt it, you have to be careful not
to influence the stability of the network. So, okay, that's a constraint you have. Okay?
>>: How do you decide mu as a function of nodes? I mean ->> Ali Sayed: No, no. Every node, every node is using a different -- could use -- they
could all use the same mu or they could use different mus, right? There is no restriction
why they should be using the same mu, right? So that's why the results are in terms of
mus that change over the nodes.
>>: And the kind of ->> Ali Sayed: Sorry?
>>: Any kind of [inaudible] mu.
>> Ali Sayed: Any kind of?
>>: Guideline.
>>: Guideline.
>> Ali Sayed: Yes. Okay. We have conditions to ensure the stability. And these
conditions are in terms of bounds on this mu. Okay? So if you follow these conditions,
they will tell you, choose any mu as long as you don't violate these bounds, that will be
good enough.
>>: You want to be large enough so that you learn faster?
>> Ali Sayed: No, there's compromise.
>>: Compromise.
>> Ali Sayed: Large enough, large enough, it goes faster, you are right. But then you
get more noise. So it's -- then people -- you can study this question. What is an optimal
value of mu to trade-off, right? An optimal ->>: [inaudible].
>> Ali Sayed: Exactly. Yes.
>>: And what does your intuition tell me? That in this more [inaudible] with multiple
agents do you feel that a good algorithm to adapt the local mus could actually lead to
better performance?
>> Ali Sayed: It can because -- I'm not saying it's easy, okay? For example, at the -anyway. Okay. Let me put it this way. The mus can be adapted. It is not easy
because of the stability issues over a network, okay? Because these problems
compound. Because of the interactions, okay?
So you have -- and the stability condition is a very select condition, okay? So you have
to take that into account, okay? So that's why it's not a simple thing to do like you are
just dealing with a single adaptive node. Now, you are dealing with many nodes. So
whatever you do here will influence the other nodes, okay?
It is doable. It will influence performance because it -- the expressions are showing
here how it influences performance, right? For example you can design mus in such a
way that -- so this is what people usually do in the single agent case that you can do
here, okay? They can use a larger mu initially to converge faster and then start using a
smaller mu when you are closer in steady state so that you track better. Things like that
can be done here as well. Can be -- as long as you satisfy the stability conditions.
Okay?
>>: No problem.
>> Li Deng: Any more questions? Okay. Thank you very much.
[applause].
>> Ali Sayed: Okay. Thank you.
Download