>> Li Deng: Okay. Hello, everybody. It's my great pleasure to have Professor Ali Sayed to come here to give us this talk. And this is the full -- this talk will give more depth to one of the keynote that he gave recently at CAS on online learning and bio-inspired signal processing. I'm not going to go through all, you know, his long outstanding research career except to say that Ali's work has really been outstanding in some of the impact of some of his work has shown in some of our recent summer intern work from his student doing some very exciting work, having a tremendous impact in the kind of work that we have been doing over here. So without further ado, I will pass the floor to Professor Sayed. >> Ali Sayed: Okay. Thank you very much, Li, I appreciate it. Thank you very much also for coming. That's very nice of you. Okay. So relative to [inaudible] I'll have a few more minutes, so I can be a little bit more technical. But, of course, I don't have enough time to go deep into the technical aspects of the work. So I'll focus on the intuition behind the results, what the results are, the intuition behind the results, try to motivate them. And if you are interesting in more details, I'll be glad to give references. You can e-mail me and so forth. Okay? So this is a talk on the general topic of learning and adaptation over networks. When you have a collection of agents that work together and you would like them to work together in such a way that the performance can a improve, they should benefit from cooperation. Because it's not always true that you benefit from cooperation. But we would like to do it in a way that the network performance gets enhanced. And we would like the network to get the ability of learning and adapting in realtime. So how do you achieve these kinds of features? So I will be giving you the technicalities behind -- behind that. But before I get into that, let me first define what do we mean by distributed signal processing? Because this talk is in this general area of distributed signal processing. Okay? So, of course, many definitions are possible. This is just one of them. For the purposes of this talk, this is good enough. Okay. So distributed signal processing deals with the extraction of some global piece of information from interactions among agents that are distributed over space and that can only interact locally with each other. Okay. That's a very broad definition of what distributed signal processing is. Over here I'm just showing one particular example in this simulation. What you have in this simulation, you have a collection of agents. Each one of these agents has a measurement of the distance to a target, noisy distance to the target, and also a measurement of the noisy direction towards the target. So in principle, each one of them could try on its own to get there. At the same time, they also have noisy distance to some dangerous area, noisy ideas of the directions towards that area, and by cooperating with each other they would like to avoid one area and converge to another area. So this is an example of distributed cooperation among agents sharing information with each other so that they can get better estimate of where they want to go and a better estimate of what to avoid. Okay? >> Li Deng: So the information in this example is the distance ->> Ali Sayed: Distance and direction. And direction. But these are noisy. So through cooperation they -- so some of these agents may know more information than other agents, right? So, for example, if danger could be a shark following these agents and so the one that's closest to the shark may have better information about that particular element than the others. And through sharing and diffusion of information to through this network, they can behave in a better way and avoid that particular danger. So you can see from this example, this is just an example to motivate the idea that in distributed processing you have a global objective that the agents would like to work together to achieve something that's common to them, whether it is reaching a target or avoiding a certain domain, the processing has to be local. So these agents cannot collect all the information they have, send it somewhere else, and then wait for the result. It has to be local. And then the agents in general, they are dispersed so they can only communicate with their immediate neighbors, okay? So these are features of the distributed solution that we are interested in. No why the interest in distributed signal processing? There is lots of work going on in this area. And there are many good reasons. I list several of them here. For example, there is also the power of cooperation to mine data, right? Nothing better than putting many intelligent agents -- even if each one of them has limited capabilities, when you put them to work together, something comes up, right? That's the power of cooperation and multiplication: Often the data is available in dispersed locations, right? That happens nowadays more and more often. You have lots of data that you would like to access and process, right? And doing that by using a collection of agents, each one doing its part and working together, right, that will allow for better performance, right? For example, the alternative would be for these agents to send the information they have to a central processor for processing and then communicate back with the agents. In many situations the agents may not be comfortable in doing that for privacy issues, secrecy issues and so forth. Okay? So these are just some examples of why performing things in a distributed manner would be -- would be useful. You have a question? Yes? >>: I want to make sure that I understand. So the distributed way, is it because you think that distributed is more -- you get better results by using distributed systems, or is it because you're restricted to use distributed because, for example, the data is distributed? >> Ali Sayed: Right. That's a very good question. Both reasons can be justified depending on the situations. In some situations the data is already distributed, and for reasons I listed here you would like to process them in that manner. Okay? For example, sending the data all the time to a central processor and sending it back, you end up with an architecture where if the central processor fails, everything fails, whereas the distributed solution would be much robust, much more robust. And the second question becomes if I solve these problems in a distributed manner, can I attain or match the performance of a centralized solution with the added robustness? And that's what I will be showing you later. You know. >>: Just make sure. >> Ali Sayed: Yes. >>: So still, at least in theory, if we put aside the robustness thing, the centralized solution, if it was possible, would be better than the distributed solution? >> Ali Sayed: Yes. Yes. >>: Right? >> Ali Sayed: Okay. It would be better for the same class algorithms, which is the stochas -- in the class of stochastic- gradient algorithms that I will be discussing here. If you limit them to using the same kind of tools, that's not necessarily a correct statement, as I'm going to show. Distributed solution can do as well. Okay? And actually distributed solution adds an additional degree of freedom that you can exploit. I'm going to show that, okay? >> Li Deng: So relate -- so this morning we have discussion since we both [inaudible]. >> Ali Sayed: Yes. Yes. >> Li Deng: So to your first question, for training data the [inaudible] can never do better than the function [inaudible]. So our hope is that it may help with the generalization, which we don't have any theory on this. >>: But any -- so the reason why I'm puzzled is you can -- on a centralized solution you can always simulate a distributed solution. >> Ali Sayed: I know. But what I'm saying, there are many situations in practice where the centralized solution is not an option. >>: Right. >> Ali Sayed: It would like to do it in a distributed solution. Right. If you want -- if you want to solve it in a centralized manner, of course, if you can afford to do that. But now the question becomes, if I don't want to do that, for many reasons, okay, how close can I get to that, and what are the advantages of doing that? This is what this will show, right? Okay. These are all valid questions, but there are reasons why you want to do things this way or that way. Okay? So when you do things in a distributed manner -- so motivated by these considerations that I just mentioned, this is what we would like to do. We would like to design adaptive networks which consist of a collection of agents linked together. These links can change over time. Agents can turn on and off over time. I can give more or less weight to my neighbors, depending on whether I trust them more or less. So I would like to develop networks that behave in a truly adaptive manner. The network as a whole becomes an adaptive entity. The agents are adaptive. They are able to learn continuously. The agents cooperate with each other locally, right? Even the topology can change with time. The agents can also move, for example, biological networks is a good example of that. And on top of all of that, I would like this network to solve something meaningful, not just do this for the sake of doing it, okay? But when you try to solve problems like this, many interesting issues happen and many interesting observations come up, okay? Why? Because you have many additional degrees of freedom now that you can control. For example, topology. Who should be connected to whom? You know? You have many options to connect people with each other, right? So this fluency the behavior in a certain way. So how can I exploit that degree of freedom to my advantage? How much weight should I give to the information I am receiving from [inaudible] as opposed to the information I'm releasing from Li? Maybe I trust him more than I trust him because of he's noisy and he's less noisy, right? How should I adapt those suites over time? Because over time he may become less reliable than him. So there are now many degrees of freedom that I can control, you see? Some agents may fail, and yet you would like the network to continue to work as long as it is connected and you can paths between agents, things should continue to work, even if some links fail, even if some agents fail and so forth. >>: [inaudible]. >> Ali Sayed: Yes? >>: In [inaudible] applications they are dealing with how realistic it is to assume that information passing is just by multiply a weight or ->> Ali Sayed: This is a model, of course. This is a simplified model, right? Like many other problems in engineering we assume a model. It just turns out that this model is very useful. For example if you wanted to solve distributed -- many problems can be formulated as distributed optimization problems, and it turns out that the solution will involve a step that involves aggregation of the form I'm going to be showing you here where you weigh data over the links in that manner. Okay? >>: So in the early example you gave [inaudible]. >> Ali Sayed: Exactly. Yes. >>: Is that kind of problem appropriate to be modeled by this kind of [inaudible]. >> Ali Sayed: Of course, We haven't performed biological experiments to check that, but there are studies, there are studies, and I give references in some of my papers by groups whose research is exactly devoted to modeling, for example, how fish behave, right, and they perform experiments on that. And they have models where they perform aggregation models like this, and they match theory to practice, you know. So there is some justification, even experimental, justifying that the kind of things that we do here could be an appropriate model. I'm not saying this models everything, but in some cases it is an appropriate model, okay? Anyway, so there are many degrees of freedom that we would like to control. But regardless of that, many issues come up that we are not used to from single agents, signal processing. Some issues come up that are not a direct extension of the intuition we have from single agent or centralized processing as you are going to -- you are going to see. So in order to appreciate that, let me first give you a brief overview, brief review of classical results in stand-alone adaptation and learning. When you have a single agent that's trying to learn and adapt something let's see what's known there, and then let's see when you try to extend these results to the multi-agent case what issues occur, what issues come up. Okay? And then you will appreciate more, okay, the richness of the distributed solution. So I'm now giving you for the next few slides a very brief summary of what single agent adaptation and learning is. So in single agent adaptation and learning we are interested often in problems like this. I have a cost function J of omega. Omega is some parameter vector I would like to estimate like the one we were talking about. J is a cost function, for example, a mean squared error cost function. But it could be some other cost function, a logistic function or something much more general than that. In most problems of interest, the risk function is often expressed as the expectation of some loss function, okay? For example the mean squared error is the expected value of some error squared. So the error squared is the Q function. And the expected value of that is the risk function. Okay? So I'm introducing both notations because you are going to use them. So how do people solve a problem like this, using a statistic gradient algorithm? So for this particular talk, just to keep the technicality simple, I'm going to assume that my cost function is convex. So you have a minimum and you would like to converge to that minimum. So a popular solution for that problem is the steepest-descent solution. What do you do? You start from an initial guess and you start iterating over it, moving along in [inaudible] direction of the gradient vector of J of your risk function, right? You evaluate the gradient vector at the current estimate, multiplied by a step-size mu of I. That's a positive step-size parameter that changes with time. If you keep repeating this, there are results that guarantee the following: Okay. So that step-size is a decaying step-size, meaning it adds up to infinity but goes down to zero in energy. For example, one over I, it diverges it -- so it goes down to zero but not fast enough. That's what this condition is saying, okay? Any step-size sequence like that that goes down to zero but not fast enough guarantees the following conclusion: That WI, if you repeat this recursion, WI will go to the minimum that you are looking for, okay? These are classical results in statistic gradient approximation theory. Okay? You will get the behavior like that. It will go down to the minimum. >>: But only for convex problem, right? >> Ali Sayed: For this problem, yes. Yes. For convex problems where you have a unique minimum. Okay. Now, what happens in practice is that very rarely you have the function J: So you cannot compute the gradient of J. Why? Because J is the expectation of something. And for you to be able to know J, you need to be able to know the statistical properties of the data so that you can compute expectation over that. I'm going to show you a specific example. So what people do, they replace the gradient of J by the gradient of Q. Q depends only on the data. So the gradient of Q is an instant approximation for the gradient of J. And that's why now it's called statistic-gradient algorithm. Why? Because now you've replaced the two gradient that you want by an approximation for it. So that introduces an error. That error is called gradient noise. So when you use the statistic-gradient algorithm, you always introduce an error into the algorithm which I'm denoting by S, okay? But still under some reasonable approximations -- sorry, assumptions on this gradient noise, for example, that it is zero mean on average, there are some reasonable technical approximation -- assumptions on it, you can still show that convergence occurs. That's very interesting for decaying step-sizes. Even if you replace the gradient of J by the gradient of Q, you can still get convergence and solve your problem. Okay? So let me show you an example so that you understand this better. Let's take a very example, linear model, different from your model, okay? Assume I have data D when is the output of some FIR channel. U is the input, betas are the coefficients. So D is a combination of delayed versions of U plus noise. Right? This is a channel estimation problems formula. And I -- this is a very -- you would like -you know the D, you know the U, you know the D, so you know the input/output behavior. You would like to estimate the parameters, right, the betas. Let's look at this. So that's the model. I'm going to select my Us into a regression vector. You have M of them. Just put them into a vector. Collect the unknowns into a vector. So I can just use a more compact vector notation because that sum becomes an inner product. This is just notation convenience, right? So I would like, given D and U, I would like to estimate W, right? That's the problem. So one way to do it, given D and U, find the W that minimizes the mean squared error, right? This is a classical problem in estimation theory, right? So you see that J is the expected value of the mean square -- of the error, whereas Q is the squared error, right? So you see to find the J, you need to know the moments of D and U. If you expand this, you need to know the variance of D, the variance of U, the cross-correlation between D and U, right? Whereas for G, if you just know the data and you know an estimate for W, you can find it without knowing the statistical properties of the data, okay? So if you continue, risk on left, loss on the right, okay. So if you expand the J on the left, I need to know these quantities here, the moments, RDU, RU. You need to know them, right? So if you write down the steepest-descent algorithm, you move along the direction of the gradient of J, the gradient of J will involve the moments, which generally you don't know. Right? If you instead replace the gradient of J by the gradient of Q, you move along the direction that's determined by the data. So now you can apply this algorithm, right? So this algorithm only depends on the data. And if you are using a decaying step-size, and under reasonable conditions on the gradient noise this will converge, right? So this is an example that illustrates these points. >>: Can I ask a question ->> Ali Sayed: Yes. Yes. >>: To refresh my ->> Ali Sayed: Yes. Yes. >>: If you are in a case where you actually happen to know the statistics ->> Ali Sayed: Yes. >>: -- but just for fun you use the statistic gradient and you could have used both, which means that then the stochastic-gradient is likely to converge as more slowly, right? >> Ali Sayed: It was the bump here. It will be bump. It will converge. Over. >> Ali Sayed: Instead both will reach ->>: But it's like the average rate of convergence is likely to be the same as the ->> Ali Sayed: Yes, it will be similar, yes. The rate of the average, okay? You use the right one. On average the rate of convergence ->>: So it oscillates, but it doesn't go -[brief talking over]. >> Ali Sayed: On average, the rate of convergence will be the same. But it will be a bumpy ride until [inaudible]. >>: Okay. >> Ali Sayed: Okay. Fine. So this illustrates -- but now I want to mention one point which is important, you see. So if you use the steepest-descent, if you knew the gradient that's the first line. If you don't know the gradient, that's the second line. The difference is an error, right? But look what's useful, you know? Even if in the stochastic-gradient case W converges to WO, the gradient noise will never go down to zero. You will always have a term there coming from the measurement noise. That's a persistent noise that will always be present, okay? Even if the algorithm is able to converge. So that absolute component that never disappears I'm going to call it our subtest. That's the variance of that [inaudible]. So because data are independent from noise, that's the variance of the noise times RU. Okay? Just remember this fact. Because I'll show you why it's important. Okay? But even though you are getting convergence, there is persistent gradient noise influencing the algorithm, okay? Fine. This is another examination. Okay. This is closer to some of the things you are doing. For example, I have data that belonged to one class or another class plus or minus one, and you would like to find the hyperplane that separates them, okay? So there are many ways by which you can formulate this problem in a similar manner. For example, by choosing the loss function to be let's say the logistic function, okay? This is just another example to illustrate the idea in a different domain. You can write down the stochastic-gradient algorithm for that function, okay? If you go back, right, gradient of Q. Write down the recursion. You get this -- this algorithm, right, widely known. Now, you see that so far I have use the decaying step-size. So the decaying step-size will be about to converge the W. But what if you are dealing with a problem where this hyperplane is always changing because the nature of the data is changing? So you need an algorithm that is able to track changing models. That's why now I have to shift from a decaying step-size to using a constantly step-size. Because when you sues a decaying step-size, by the time it goes down to zero, you are not updating anymore. And if your model changes, you are not able to track. Okay? So from now on I would like -- so by adding a constantly step-size you enable the algorithm now the ability to adapt. If the model changes it will be able to track, to continue to track. Okay? >>: May I ask a question? >> Ali Sayed: Yes. Yes. >>: Then in this case you might say that there's probably such a thing as a [inaudible] step-size to adapt to that particular rate, right? >> Ali Sayed: Yes. Yes. All these questions are relevant and they have been addressed, okay? >>: Okay. >> Ali Sayed: Very good. Now, there are issues now. Let's continue. So now I remove the decaying step-size. Let's replace it by a constant step-size. So for example if you repeat this for the channel estimation problem I mentioned before, the only difference is that the mu now is a constant mu. Right? So remember that even if this doesn't happen, even if this guy were to converge to W, you still have a persistent gradient noise. And because this term never dies out, that gradient noise will always be there influencing performance. So there is a price to pay, okay? Because in the decaying step-size K, it will cancel that term so that gradient noise, even though it is there, it doesn't influence the conversions. But now you have a constant step-size. >>: Right. Okay. So [inaudible] condition of that to be squared to be less than zero? >> Ali Sayed: Square? Which is square? >>: [inaudible] there's a condition for the step-size. >> Ali Sayed: Yeah. No, those conditions are for decaying step-size. This is different now. Right. So now you have a constant step-size, okay? So now adaptation, the ability to -- you are learning. But you would like it to be able to adapt as well. But this comes at a price. At what price, right? So let's see at what price. Okay? So to analyze at what price, let's define the error between what you want and what you have, right? In the decaying step-size case, that error goes to zero. And the constant step-size case, the variance of the error will go to some value. Right? Because W0, WI will not go to W0. So the mean squared error, we measure it by measuring -- we measure the size of the error by measuring the mean squared deviation, okay, which is the variance of the error. It will go to some value. And also it will reach that at a certain convergence rate alpha. This is essentially giving you an idea of how many iterations it takes for you to get there, okay? So, for example, some results for this general case look like this, okay? So the algorithm is up there. I'm repeating it. Constant step-size. This is -- the covariance of the noise that's persistent. So even when you converge, this noise will be there. So call it R sub-S. And this is the Hessian matrix. H is the Hessian maker of your cost function, okay? So these two parameters determine how much performance gain or degredation you are going to get, okay? So the results in the general case look like this. So how much MSD will I get in steady state, okay? The error is not going to zero. But it will be proportional to the step-size. So if you use very small step-sizes for all practical purposes this is a small error. So you can ensure that I will be fluctuating very close to the optimal solution I'm looking for. And you will be converging at a certain rate, which depends also on the step-size and on the smallest eigenvalue of this matrix H, okay? But the important solution from here is that the price that you pay is the degredation in the MSD only of the order of the step-size. If the step-size is small, that's good enough. Of course, the smaller step-size means a slow adaptation because if mu is small, this is closer to -- you converge slower, you're adapting slower. Okay? Fine. So now let's get closer to networks now. So this was my brief overview of classical results. So now let's assume I have two nodes. Node one is getting data D and U. I'm calling them D sub-1, U sub-1, because these are data at node one. And it's adapting. Okay? So if you use the previous expectation for the MSD in this case, you will find out that the performance of this algorithm in steady state it will get close to the solution by this much, okay? This expression is exactly the previous expression, no -- specialized to the quadratic case, okay? So mu M is the size of the filter. And that's the variance of the noise. If you have another node receiving data and doing the same thing, you will also get a similar expression. But you can see that if the data in case two has more noise variance, sigma V squared than the data in node one, then node two will perform worse, right. Even though both of them are trying to estimate the same channel, if the data here is noisier than the data there, then you can see they solve the problem on their own, one will perform worse than the other, right? This is only natural from the results. Now, let's take the case where the nodes send their data to the fusion center, okay? Send their data to the fusion center, the centralized processor. And that fusion center is going to run a centralized stochastic-gradient algorithm of the same nature, right, an LMS type algorithm of the same nature as the previous nodes. Because I want to compare algorithms of the same class. Of course, the central processor can do something much more involved. But then that will not be a fair comparison. Okay? So if the central processor uses an LMS algorithm what's different here? Because it has data from both nodes, it can average their gradients to get the better approximation of the gradient direction along which should -- it should move. Fine. So what's the performance of this algorithm? So the performance of this algorithm, if you use the expression I derived before, because this is an LMS type algorithm, you'll find out that it is a similar term but now times the average of the noises. Okay? So now when you have two nodes, one has noise at point one. The other has noise level at point two. The average will be some value in between. So you can -- so you can see that if you perform it in this manner, the performance will be better than one but not necessarily better than -- than the other. Right? And the question is can I perform this in a different way, in a distributed way, such that I can take advantage of how they share the information and ensure that the performance will be better relative to both, that when they share information they can perform better than one and better than two. Okay? You had a question? >>: Do you assume that they both get the same amount of data? >> Ali Sayed: Yes. The same -- at every time I the fusion center gets ->>: So they have a single clock covering the ->> Ali Sayed: Yes. Same time I the nodes send the data the fusion center and it tracks an LMS type algorithm. And the performance will be the average actually of the performance. And for that case it will be better than some and worse than others. Of okay? So now let's take it to a distributed domain. I will first discuss the case of two nodes and then I'll go to a network. Okay? So the first step is fine. Much what if I don't want to use a distribute -- a centralized solution, I want to use a distributed solution where let's say the two nodes want to talk to each other, meaning they want to share some information? So, for example, node one sends its estimate to node two. Node two is going to scale it and combine it with its own estimate. And also node two is going to send an estimate to node one. For example, look at the first equation. Node one is going to perform the update just like before. But it's going to call that an intermediate value of psi one. Not its updated estimate, but an intermediate value of psi. Node two is doing the same thing. It's going to call that an intermediate value of psi two. Now, the nodes share the psis. So node one is going to gets its psi and its psi from node two and combine them in a convex manner. Node two is doing the same thing, gets psi one from node one and combines it in a convex manner with node two. Right? That's the only additional thing I did. Each one of them is still running an LMS type update. But in addition they share -okay. This is what I think the estimate is, this is what I think that the estimate is, they share their local information to get a better estimate of what it should be, and then they use this for the next iteration and continue from there. >>: [inaudible] processor ->> Ali Sayed: Is not -- no, it's not doing this. This is -- this ->>: No, no. Which part of the processing node does that average ->> Ali Sayed: No. No. Each average is done at the node. You see at node one because they share -- this is not -- node one does this and this, node two does this and this. >>: Okay. >> Ali Sayed: Right? So you have adaptation, aggregation, adaptation, aggregation. Right? What performance can you expect from -- okay, from a system like this. Okay? So questions that are useful to ask. Can the coefficients be chosen to ensure that the MSD performance that each one of them is going to get is better than what they used to get when they were doing it on their own? Right? The centralized solution only ensures that you will do better than one. But can I do better than both? And the answer to this is yes. Okay? >>: I'm confused. >> Ali Sayed: Yes? >>: Why the centralized one cannot do better than either ->> Ali Sayed: This is stochastic -- it is using a stochastic algorithm, this stochastic algorithm that I'm showing you here. The LMS algorithm of a similar nature. By simply averaging the data. Now, you can ask this question -- I will mention this later, okay? It cannot do better -- okay. The reason why this can do better is because you have an additional degree of freedom which is these coefficients alpha and beta that I'm using now, I'm going to design. >>: [inaudible] noise levels are different ->> Ali Sayed: Right. These alphas and betas will correct for that. >>: [inaudible] are the same this would be as good ->> Ali Sayed: As good as centralized. Okay? Because here you are not using alpha and beta. However -- however, I can incorporate weighting into this solution. I can say I'm going to weight the data from node one by a certain alpha, the weight from -- you understand? Okay. >>: So you have the question down the road to ->> Ali Sayed: Down the road, yes. [brief talking over]. >>: Beta and then you ->> Ali Sayed: Right. I'm just getting there. I'm just moving through a simple example to get to that point. Does that answer your question? >>: I guess. So I mean you should be able to do -- I mean, you have twice as much ->> Ali Sayed: Yeah, yeah, yeah. If you modify this -- you know, the final -- one of the final conclusions there is that it will tell you should modify this if you want to match the distributed, right? But then what I'm trying to say is that the distributed has additional degrees of freedom that you can exploit and match the best you can do in stochastic-gradient algorithm here. So you are not losing -- you are not limiting yourself by doing things in a distributed manner, okay? That's one of the conclusions. Okay? Fine. So can performance match? The answer will be yes. And this is another interesting question. Okay? Node one can run LMS and estimate the channel. Node two can run LMS and somewhat your parameter vector. Does it always follow that if I put them to work together like this the solution will always be able to solve the same problem? And this is not true. This is not true unless you do it right, okay? So, for example -these are results I'm going to show you later. If you use consensus trajectories, all right, in comparison to the strategies I'm going to industry here which are based on diffusion strategies, this can happen. The individual nodes can be stable. But when they work together, the network becomes unstable. For reasons I'm going to display to you. So you have to do it right. You cannot just combine them like this. Okay? >>: But there is additional [inaudible] that I just wonder whether this [inaudible] can exploit, that is when combine when you can aggregate, in other words, you can do it asynchronously the way you are doing it synchronous? >> Ali Sayed: So far is synchronous, yes. >>: Synchronous, does it give you additional. >> Ali Sayed: It gives you additional -- I will not be talking about the asynchronous network behavior here. But that adds another degree of freedom that you can exploit. It's not discussed here. Okay? But that's a good point, also, okay? Now, so let's go to the networks. You know, I just talked about two nodes. Now, assume you have a collection of nodes, right, five, or ten, or a thousand. Now there are many different ways by which these nodes can be connected to each other, right, not just like the alpha and beta I showed you before. You can define many different neighborhoods for each node, okay? So if I give you a collection of nodes and many different agents and many different topologies, which topology is better? Okay. That's a valid question, right? Does it matter which topology I use in will all of them converge to the same conclusion at the same right and give me the same performance? Okay. These nodes are going to combine the data they share over the links by using certain combination weights. Does it matter what combination weights I use? For example, if I'm a node and I have three neighbors, I can give one-third away to each one of them, or I can give more weight to him and less weight to you. Does it matter how I combine the data? So these are valid questions that now arise at the network level, okay? So to answer these questions, let me show you now we answer questions like this. >>: May I ask a ->> Ali Sayed: Yes. >>: -- question that in practice might be important. >> Ali Sayed: Yes. >>: I'm pretty sure you're excited about that, is in many cases there is a nontrivial cost of sending the data between sensors. >> Ali Sayed: Yes. >>: So one question could be what's the minimum number of edges between those agents that would give a certain level of [inaudible] performance. >> Ali Sayed: That's a very good question and a very valid question. In this particular presentation I'm assuming that they can share information that is no cost to it, okay? In some other -- in some other role that we have done, we assume that there is a cost involved for sharing information, right? In that case, some agents may decide to share, may decide not to share because that cost may be attractive or not attractive. And that involves other dynamics. It's a very good question, right? But it's not in this particular talk, okay? >>: Can you just look at it like it's a matrix of weights and some of them have to be zero? >> Ali Sayed: Here? >>: Yes. >> Ali Sayed: Yes. I'm coming -- I'm coming next, okay? You mean for the -- for talking to each other? >>: Yes. Right. >> Ali Sayed: Yes. Okay. So I will get there. Okay? So to answer these questions, I need how topology, how combination policy influence performance. Right? Now, so I need to perform that analysis. Now, the challenge here is that these agents think of it you have many agents. They interact with each other. So whatever happens here is going to influence other agents. So they are coupled together, okay? So that's the challenge that you face here relative to studying single agent systems. Okay? Nevertheless we can get some results. So let me show you the results now, okay? Of course, the summary of the results. So I have a network. Let's now go to the network level. I have a network and agents connected. Each agent has a cost function associated with it. They call -- they could be the same costs I'm calling J sub-L if only they have the cost function associated with agent L. It could be the mean squared error cost. All of them could have the same cost, like machine learning applications, that's common. All of them are trying to minimize the same cost. But for generality, let's assume that they all have different costs. And I would like to minimize an aggregate cost function across the network. Subject to some constant aids because some agents may know something about the solution that other agents may not know, right? So you can solve a constraint problem like this, okay? But for this talk I'm going -- a problem like this you can transform it into an approximate problem that's good enough that is unconstrained that looks like this. I'm going to focus on the unconstrained version here, okay? >>: [inaudible]. >> Ali Sayed: Convex. Yes. Subject to convex constraints. Okay. So okay. So let's just -- this is good enough for this presentation. I have an aggregate cost the network would like to work together to optimize. Find the minimum. Okay. And each cost, like I said, is the expected value of a loss function, okay? So this is one algorithm, okay? We developed this algorithm years back, around 2006. And since then many colleagues have studied and described to the area as well. This is how the algorithm works, you see. If you look at the first line, each node K is going to perform an LMS type update, a stochastic gradient approximation update. Node KW, K is the estimate at node K at iteration I minus one. That node is going to move along the negative direction of its gradient loss. Find anything. We did estimate. And then in the second step node K is going to combine in a convex manner using combination weights A. Those are the alphas and betas from before, but now I call them A, over the neighborhood of node K. So if my -- if I have five neighbors I'm going to use combination weights A to combine the weights of my neighbors. So every node in the network now is applying a strategy like this. Of course, I'm not showing you how we arrive at this strategy. There is a formal way of arriving at this strategy. I'm just showing you the final result, just like I showed you for the two-node case, okay? But the nice thing is that you see the analogy extends, right? The analogy extends you have. So the first step I call it adaptation. The second step I call it combination aggregation. So that's why we call this -- this is the adapted then combined structure. You first adapt, and then you combine. Because there is a variation of this where you first combine at another. Now, I want to mention two observations which are important. Observe that in the adaptation step every agent is starting from where it is, from its current state, its current estimate, right. So it is the same W here and the loss is evaluated at the same W. Okay? That's number one. And the step-sizes are constant, okay? The second variation is where you combine first. So first I combine the estimates that exist at my neighbors, and then I perform the adaptation from that into immediate state that I found, right? So this is the other version of the algorithm where you combine, then you adapt. Okay? >>: One one's better? I read the paper. >> Ali Sayed: Yes, ATC has some advantages, okay? Because think about it. What ATC is saying, I'm node K, let me first think about the property, adapt based on my data. And then everybody else is also doing its own adaptation learning based on it, and then you combine. So here, like in an exam problem where students can solve the problem cooperatively, first they think and then they share the information, right? Whereas here they first share the information before thinking and then they think, okay? You can share analytically actually that ATC has some advantages in terms of convergence rate and mean squared performance. Now, the third form of the algorithm that I would like to show, which is also very popular in the distributed optimization literature, is the consensus strategy. It is similar in form to the second diffusion strategy I showed where you combine first and then adapt. There it is. You combine first and then adapt. However, there is one critical difference here. In consensus when you adapt, you start from the intermittent somewhat but the loss function you still evaluate it at where you were. Whereas that's not the case. In diffusion, it's always symmetric, okay? And it turns out that this asymmetry is what causes instability when you are trying to use networks to perform adaptation and learning. We have phone in the paper in TSP accomplished in 2012 that the state of consensus networks can grow unbounded because of that. Okay? So that's why I'm going to focus, okay, because I'm -- that's for the case of constant step-sizes. If you are using decaying step-sizes, this is not an issue. But if you are uses constant step-sizes, things accumulate over time, and then instability can occur. Okay? So let's continue with the diffusion solution. Fine. So I have a network now. I'm trying to minimize an aggregate cost function. I have a class of algorithms. The diffusion strategies that I can use to solve the problem. So how well does this network perform and how does the performance depend on how I connect the agents and how they combine what weights they use? Okay? So, now, I have many nodes. So for each node I'm going to define an error, W to the K, I, which is WO minus WKI, right? That's the error at node K. I'm going to define the variance for node K and the network variance as the somewhat of that. Right? Just like I did for the single agent case. Now, for every node I have the performance and the average performance, right? So what are the values of these? Okay. So let me show you the results, okay? So before some notation. Remember that every node is combined. Node K has neighbors. 2, 6, 7 and L is combining the data from these neighbors using coefficients A. I'm going to collect these coefficients into matrix A. So this matrix A, like Phil was saying, it's a matrix which is M by N and the LK entry has the weight that agent K is using to a data arriving from agent L. That's what it means. Okay? >>: So in matrix [inaudible] are separate? >> Ali Sayed: W is what you want to estimate. That's the parameter you want to estimate. The A are the combination coefficients of the agents are using to perform -that's a free parameter that you designed. You can choosing the A differently, right? Because you have N agents, you will have the matrix that's M by N. It has the combination weights that, okay, if you look at any particular column, you'll have the weights that that particular node is using to combine the information from its neighbors. Okay? Because these combination weights are convex combination weights, this matrix A turns out to have a useful property. Each of its columns, the numbers add up to one on each of its columns. Okay? It's the left-stochastic matrix. And because I'm assuming that I have a connected network, right, with at least one self-loop, so at least one agent should trust its own data, so at least one nonzero element on the diagonal of A exists, okay? AKK for some agent K should be positive. Okay? This is called a perimeter matrix. Okay? For matrixes like this, you have this interesting result. This is called -- follows from the Perron-Frobenius Theorem which says this: A has the left eigenvector at one from this relation and eigenvalue at one. But there is also a right eigenvector which I'm going to call P. And I'm going to normalize it so that its entries add up to one. What the theorem guarantees is that all other eigenvalues of A are strictly inside the unit circle. There is only one eigenvalue at one. With eigenvector P. And all the entries of P are between zero and one. So they add up to one. And they are between zero and one. So the entries of this right eigenvector P, they have the interpretation of probability measure. Right? They add up to one. And all of them are between zero and one. So P reflects the topology, right? You see A reflects the topology of the network and how much weight you are putting on the links, right? So different As will have different Ps. So P summarizes the topology for you. You are going to see that the results depend on this P. Okay? And then I can optimize over P. I can ask questions like what P and for what A should I choose to optimize performance relative to this or relative to that. Okay? For example these are just some examples. Let's assume that the -- the combination weight that I'm using is the average in the rule. I'm just assigning to my neighbors one over the number of neighbors. I have three neighbors. Each one of them I'm assigning weight one over the number of neighbors. In this case you can find explicitly what P is. It's the degree of node K how many neighbors you have normalized by the total number of the sizes of the neighborhoods. Okay? So for particular cases of combination weights we know exactly what this P is, okay? Fine. This is just an example. Okay? So what are the main results. Now, let me show you two results. Okay? Result number one, which I find very interesting, okay, the result number one says the following. Remember that I'm trying to minimize this cost function, right? So technically from calculus if I'm trying to minimize this function, I should converge to a point W that makes the sum of the gradients equal zero. But the network is not doing that. The network will converge to a point that makes the weighted sum of the gradient equal zero and the weights are the step-sizes and the Ps. Okay? So the network is converging to a different solution than the solution of this. What does that mean? Okay. For example, if all the step-sizes are equal, and if all the Ps are equal, for example the metropolis rule here there is a certain way to construct the weights, lead to values of P that are equal to, one, the number of nodes. So all of them are equal. >>: Ideally the whole strategy [inaudible] rather if you perform optimization -- >> Ali Sayed: Yes. >>: Whatever form of the distortion error with respect to this [inaudible] you decide that A to make that minimum. >> Ali Sayed: You can. >>: [inaudible]. >> Ali Sayed: May not. Yes. >>: May not? >> Ali Sayed: I'm just giving you an example that if P happens to be a constant, mu happens to be a constant, then you are going to the sum of gradients equal -- you are going to that solution. But what this is saying is that actually the interaction over the network is adding a degree of freedom. Because you can design A, therefore you can design P, therefore you can still your network to converge to different points. Okay? By choosing different Ps and different step-sizes you can make your network converge to different points. These points actually have a interpretation. They are called Pareto optimality points. So let me show you. Okay. Assume I have two quadratic functions. Let's take the mean squared error case, two quadratic functions. Node K has this quadratic function, node L has that quadratic function. If node K solves the problem on its own, it is going to converge to its minimum. If node L solves the problem on its own, it only myself its cost, it's going to converge to its minimum. The aggregate cost will look like the red curve. So if the nodes work together, okay, they -- we would like them to converge to that minimum. So what's happening now, okay, what's happening now, when you put them to work together over a network, it's converging actually to what's called the Pareto optimal solution. What's the Pareto optimal solution? It will converge to a point which is not necessarily that minimum, right? Different from that minimum. And what point is that? So the agents, okay, will converge to this point. And this point has the following interpretation, okay? Or actually it can converge to any point in between all the points in between here serve as Pareto optimal solution, including this one. And what do those points have in common? They have the following property. Assume node -- the blue node wants to move to the left because it wants to get closer to its minimum. If it moves to the left, then that guy will move to the left, its performance will become worse. And the same way if that guy tries to move to the right to get closer to its minimum, this guy will have to move to the right. So you can see a Pareto optimal solution is such that if an agent takes a selfish step, he will hurt the others in the network. So for cooperative networks, right, over cooperative networks, this kind of a solution discourages that kind of behavior. Okay? So you converge the Pareto optimal solution with this property will hold. Okay? So by picking -- by picking A and therefore P, you can make the network converge to different solutions according to your needs. Okay? And here I'm just showing you special cases, for example when the step-sizes are equal, when the Ps are uniform, then you converge to the solution of the original problem. But then you can ask yourself this question. If I want to converge to a particular weighted solution of the gradients, how do I choose the Ps, and, therefore, how do I choose the A? Well, that's the expression. If you give me the weights that you want to use, I can use them to figure out what combination matrix I should use to make it converge to that particular point that I'm interested in. >>: But you haven't proved so far that this optimal, after you choose this A, has the lower [inaudible] centralized ->> Ali Sayed: No, no, this is not here yet. This is just showing that if I want to [inaudible] a particular point how do I construct the A to go to that point. So this is answering that question, right? Okay. So this is one way of doing that. The second -- the question you are mentioning is a different question. I'll mention it later. Okay. So let's go to the second result. The second result is for every node I have gradient noise. I defined it as S. Different between the gradient of the loss and the gradient of the risk. Right? Remember. So this is the variance of the persistent noise, and this is the Hessian matrix. Okay? So with every node in the network I assign these two parameters now, okay? H and R. Okay? And I define this data which is the trace of H inverse R, okay? This is more or less like a signal to noise ratio if you think about it, because that's noise power and this is essentially signal power. Okay? So this is the second result. Don't worry about the equations but the intuition behind the equations. This result has three components. And it's useful for the following reason. The first component is saying if you train a network using these diffusion strategies, the MSD at any node will be the MSD of the network, meaning all of them essentially will converge to the same performance level. Even though some agents may be noisier than other agents, after they cooperate in steady state all of them are going to converge essentially to the same performance level. Right? And when you average, therefore, you will get the MSD of the network. That's result number one. Result number two. What performance level then I can expect and how does the topology influence it here? Okay. You see that the performance level depends on a weighted combination of the Hessian matrices and the weighted combination of the noise across the agents. Okay? What's important in this expression is that it's telling you how the topology influences performance because the topology's reflected in those Ps. So now, if you give me different topology, I can compare them, and I can say which topology has better performance and which topology will converge faster or slower because I also know how the converge rate depends on the topology. Okay? So these expressions are useful because now it allows me to compare different networks against each other. Okay? So coming back to -- I'll get to your questions. Now, I can ask this question. I want to optimize this over the Ps, right, over the Ps. It turns out that the optimal solution is the left-stochastic matrix, meaning the columns have to add up to one. Not a doubly stochastic matrix, which is very common in the distributed optimization literature. Doubly stochastic means columns add up to one and rows add up to one. And intuitively makes sense, you know. Because the weight I give you doesn't necessarily need to be the same as the weight you give me. Okay? Because I may think you have better information relative to me but not the other way around. Okay? So the optimal solution turns out to be a left-stochastic matrix. Okay? For example, when all the updates are using similar step-sizes and we have an expression for it, we have an expression, okay, for the optimal A. Now, if you use this optimal A, plug it back into the MSD expression, you'll find out that that MSD expression is better than what you would get for the centralized solution. Now, this mind sound puzzling. It's puzzling for the following reason. This is using more information than the centralized solution. This is using the As. Okay? The centralized solution is not giving weights to the data, it's just combining the data. If you go back and modify the centralized solution to give weights to the data, then of -of course, it will be the same thing. Okay? But this one is using optimal weights that depend on information that the centralized solution is not using. >>: So can have any [inaudible] example of an application where A is [inaudible]. [brief talking over]. >> Ali Sayed: So you have to have an A. A is how you identify the data, right? And this is telling you now how do you pick A. This is one way of picking -- we call this the Hasting's rule. I did mention it to you. >>: If you take A to be just average of different nodes ->> Ali Sayed: Yes. >>: Are you going to get worse result than ->> Ali Sayed: Yes, you will get worse result than this. >>: But how does it compare with the centralized solution? Can. >> Ali Sayed: You know, I haven't done that compilation, but -- I haven't done that compilation. >>: Just to make sure that I understand the essence of this statement, what you're saying here is if I know that different agents get noisier or less noisy data and I know the noise level of different agents, I can use that when combining this information ->> Ali Sayed: Exactly. To get better performance. >>: To get better performance? >> Ali Sayed: Yes. >>: Okay. >> Ali Sayed: Now, then -- that's a very good point. Now, the next question, how do I know these noises? Well, you have to estimate the noises, well, you have to estimate those statements. >>: Okay. But here in the construction here you assume that you know these numbers. But this is how you constructed the matching state. >> Ali Sayed: Right. Then in future slide I'll show you quick -- we have ways of estimating [inaudible] then this makes it fully adaptive, okay? Now, one final point before I show you this application, okay? So I -- so we know that the optimal is left stochastic and now I know how performance changes with topology. And then this is another point. Now, assume I start adding more agents to my network. Is it better to have more people solve the problem or less people solve the problem? Or is this an -- or is then an optimal choice? If you look at the expression for the conversion state, the more nodes you have solving the problem, this sum becomes bigger because everything here is positive or positive definite. So you are subtracting something bigger from one. This becomes a smaller. So the conversion's rate improves. Okay? So if you have more people helping you converge faster. But if you look at the MSD expression, if you have more N, more nodes helping you, this expression doesn't necessarily become smaller. It can become smaller, it can become larger, it can stay invariant. So it may happen that you may have more people helping you, you will converge faster, but the worse MSD level, the worst performance, yes. >>: But you when you measure the rate, when you add more agents, actually you're getting more data per time interval, right? Because if I -- >> Ali Sayed: Across the network. Every node is still getting data disseminated. Yes. >>: Right. But at the network level I'm getting ->> Ali Sayed: Right. >>: So the comparison, is it a valid comparison now, that that -- my network now puts this as an every-time unit more data so that the convergence rate is kind of [inaudible]. >> Ali Sayed: No. Convergence here is defined as how long it takes to reach the steady state. >>: But it's not in terms of information units? Right? Because when you add more agents, you actually add more information units. >> Ali Sayed: That's right. >>: [inaudible] time. >> Ali Sayed: That's right. That's the -- it's reflected in -- it's telling -- what you're saying is N. You know, N is the number of information that you have for every iteration. >>: But you can also look at it, you can take a different take on it and say assume that I can get, you know, certain items, you know, certain items per iteration. >> Ali Sayed: Yes. >>: And I can either have fallen agents each one of them getting one unit [inaudible]. >> Ali Sayed: Yes. >>: Or I can have less than that, each one of them takes, you know, say 10 units of information but can share only once during every iteration. >> Ali Sayed: Yes. >>: So now the amount of information coming to the network is fixed. >> Ali Sayed: Yes. >>: And you're just saying which one will converge faster in terms of ->> Ali Sayed: The scenario you describe is a very interesting one. I like that. But that's different from what I'm -- because in your scenario, the rate at which the data is being processed is changing. They can receive more data or less data over the same period of time. Here the rate is fixed at every -- yeah. >>: So the question is whether per agent the information, the information rate is fixed or per the network. >> Ali Sayed: Per agent. >>: Per agent? >> Ali Sayed: Yes. Okay. Yes. >>: [inaudible] quickly on what [inaudible] is saying could be even a system design thing if I understood you were saying if I want to throw agents in, do I throw simple agents ->> Ali Sayed: Yes. Exactly. >>: -- that can only send me one number a bunch [inaudible] or do I throw slightly smarter agents that can tell me ->>: Should I invest in a huge computer ->>: Exactly. >>: Yeah. >>: Or should I just ->> Ali Sayed: It's a very interesting question. I understand. Very interesting question. >>: But this also [inaudible] because the more agent referring the more freedom you can adjust. >> Ali Sayed: No, but agents brings in noise and gradient noise with them. >>: I see. >> Ali Sayed: So depending on the particular topology that [inaudible] it's not trivial because they are coupled, okay? And these expressions reveal that, okay? And we have a theorem that proves that. Okay? I'm going to skip that. Okay. The final point I want to mention is the point about -- okay. So this is the diffusion algorithm. You -- one of them. Okay? You combine and adapt. And this is what is popular, the consensus that people use. Okay. You combine, you adapt. However, the weights are isometric here. Because of that, when you lose constant step sizes this leads to instability, okay? And I would like to mention to you why this happens, okay? If you are using decaying step-sizes, that's not an issue, but for adaptation and learning, it is an issue, okay? So asymmetry is a source of instability and then not only that, diffusion will converge faster and attain better MSD levels, okay? And the reason for that is this. It's very simple. You know. You can show that the mean error of the network if you combine -- if you collect the errors across the agents into big vector, the state of the network evolves according to an equation like this, first order equation like this, be where the coefficient matrix and the noncooperative cases diagonal matrix. In the diffusion case it will be that matrix multiplied by the combination matrix. In the consensus case it will be subtraction. Now, a result in matrix theory shows that if the diagonal matrix is stable, meaning that the noncooperative cases stable, no matter what left of stochastic measure you choose, this product will always be stable. This is a matrix theorem result. You can prove it. However, subtraction like this is not stable, you know. Think about it. If you have a scaler, no matter what number less than one you choose from it, if that scaler is less than one, the product will always still be less than one. But this is not true when you are combining them additively. But it's always true when you are combining -- and diffusion is always involved in the state in this manner and not in that manner, okay? So this is the -- this is the conclusion, okay? So I want to come back to a point you mentioned. Okay. I skipped some points just to finish on time. Okay? For example, coming to the issue. So now I know how to choose the A. I have an expression for it. But that depends on certain parameters. How do I estimate the [inaudible]. So the fact that you can also adapt the combination weights over time is useful. For example, this is an application. I have a network and I have an intruder. So if I can adapt the weights that I assign to my neighbors, I can learn over time that he is a good neighbor, you are a bad neighbor, and cut over time, reduce the weight I put over your links so that the ultimate effect is essentially disconnecting you from the network, right? So we have done that, for example, I am not going through this, but if you look at this, this is exactly the algorithm here you have. You see, this is the combination, okay, and this is the adaptation. And in between I'm computing this As in an adaptive manner because I'm learning over time who to trust more, who to trust left. This is just a two-line algorithm. There are ways to do it which I'm not explaining, okay? So these As can also be adaptive. So you see I call it of I here and I have a way of doing it. When you do that, you can get performance like this. You see, you can get -if you look at the simulation, the counter there is running. The intruder is there. You are going to see that. When this network converges it's going to assign, it's going to cut that intruder from the network. The thickness of the line is the size of the weight it's going to assign to the neighbors, to the links that it trusts. You know. So you can see that over time when you also adapt the weights, you can get behavior like this over the network, okay? Anyways, I'm going to -- to conclude, just -- yes? >>: Intruders like an outlier or -- >> Ali Sayed: Yes, it could be an outlier or it could be a malicious person who is feeding you wrong information, you know. >>: So it's somehow inconsistent ->> Ali Sayed: Exactly. Inconsistence ->>: Consistently bad. >> Ali Sayed: Consistently bad. Because the ->>: [inaudible]. >> Ali Sayed: Yeah. Defective. The network learns, you know, by looking the network learns by comparing what you are giving to what it has for the nodes can compare what you are giving me to what I have, for example. And over time they learn from that information, if you are a good neighbor or not and I assign you more weight or less weight. That's essentially the idea. >>: [inaudible]. >> Ali Sayed: Sorry? >>: It's malicious in a sense that if it's bad, it's bad all the time? >> Ali Sayed: Yes. >>: Because otherwise I can be an agent, which is I'll be very nice, I will be cooperative until you -- you know, the weight you'll assign to me is going to be high, and then I'm going to ->> Ali Sayed: Right. But again, remember, this is adaptive. It will continue learning. Right? So after some time it will block you again, right? >>: You can do short-term then. >> Ali Sayed: Yes. You can do short-term then, right. But not long-term. >>: It's the fact you trust me to create larger damage, right? >> Ali Sayed: Yes. Yes. Yes. Okay. Anyway. So anyway, what I tried to explain, right, I tried to explain some of the issues that come up, design of networks, how topology influences the performance of networks, how do you design the topology, right, to get better mean squared error performance, better convergence rates. Some of the advantages of doing things in this way, you know. So I tried to clarify some of the technical details and answer some of your interesting questions. So I'm going to here just to stay on time. >>: Okay. >> Ali Sayed: Okay. And if you have any further questions, I'll be glad to answer them. >>: You said something about some practical applications that the [inaudible]. >> Ali Sayed: Yes. >>: That you have encountered in the ->> Ali Sayed: Yeah. Of course. The distributed optimizations is always the high-level application. Right? If you want to -- many problems can be formulated as aggregate optimization problems like this, right, distributed optimization is one example. For example in a paper we wrote recently when you want to try to have secondary users in their cognitive system work together to estimate the spectrum, right? So that they know where the holes in the spectrum are. So they want to work together cooperatively to estimate the same -- the same quantity. That's one example where algorithms like this can be useful. In some other work we use algorithms like this to model behavior by biological networks, the examples I showed at [inaudible] to try to reproduce that kind of behavior, right? So like the simulation I showed in the first -- in the second slide, right? Okay. >>: So for our [inaudible] that computer data center for example, exactly how difficult [inaudible] -- how do you design topology [inaudible] each other in order to compute certain things? >> Ali Sayed: So that they share information. How should -- who should talk to who. And how much weight you should -- relevance you should give over these things so that the combination's as best as it can be. Right? These are the kinds of questions that you can answer. >>: Just out of curiosity, have you given some thought on the game theoretical aspects of it? So for example, since you assume that every agent has its own objective, is it always good for the agent to cooperate? Could there be settings in which actually the agent could benefit from actually acting maliciously or something like that towards ->> Ali Sayed: No. It's a very good question. Comes back to the point [inaudible] mentioned earlier. What I talked about today is in the context of cooperative networks where I assume everybody is willing to work together, right, to solve a problem of common interest. Now, the problem you're talking about relates to let's say malicious users or selfish users where there might be a cost for them to share information, they may not want to share that information, or they may be selfish, they only want to share information if they are going to benefit from it, you know. In some more recent work we are pursuing or studying these problems and you have published recently few conference papers on that topic where you add an additional cost to the cost you have here that takes into account how much it's going to cost me to share this information. Right? And then that will involve -- really introduce a different dynamics into the learning of the agents, you know. It's an interesting problem. But I haven't discussed it here. But it's an interesting problem to ->>: Practice could be related to you have a communication link into the cost of ->> Ali Sayed: Cost of communication. >>: That link could be related to what you just ->> Ali Sayed: Exactly. Because you assign a cost of that, yes. >>: Yes. [inaudible] in Maryland that actually have something similar to -- but they do this malicious, have a ->> Ali Sayed: Yes, yes, yes. >>: Agents. And I think there was a comment I heard that this -- a lot of -- this cooperative network can be treated special case of this game theoretical weight of looking at the problem [inaudible] do you remember ->> Ali Sayed: No, this is ->>: Game theory. How game theory could apply to this as special case. >> Ali Sayed: No. Regarding work on malicious agents, selfish agents, lots of work in this area has been done by Professor van der Schaar, also, at UCLA, and we are interacting with her along these lines. However, here I talked about cooperative networks, right? Okay. So have been is working together, okay? Now, game theoretic formulations, they take other aspects into consideration where now there is, you know, the exchange may be beneficial to one, not beneficial to the other. It adds different dynamics. So this is not a special case of that, this is a different way of looking at the problem. Okay? And also this is an adaptive way of looking at the problem. Where networks have to be learning all the time. Because the step-sizes are constant, they will not stop learning, right? Okay. So these formulations can be pursued, but they will lead to different kinds of solutions, yes. Okay. >>: How does it relate to this social learning? That's another way of looking at the network that even is ->> Ali Sayed: Right. Exactly. You know, actually I'm sorry we use the terminology social learning. When we refer to the aggregation step that's an example of a social learning step. Because that's where you are interacting from your neighbors to learn from them. Of you are interacting with your small community, allow you to learn from them. So that's one example of social learning, you know. In this particular problem -- another thing that's very important, in this work I did not assume anywhere information about the underlying PDFs of the data, probability distributions of the data. I'm working with data all the time, right, with instance -- like you were talking earlier today, you have the data I'm just learning my algorithms over the data, what kind of performance can I guarantee? Right. In many of the -- of the works you mentioned people need some prior information about the distributions of the data to be able to many come up, right, with the algorithms that use that information. Okay? Here it's purely adaptive. Just you give me data, I tell you how to process the data and what kind of performance on average I can guarantee for that data. Right? >>: I see. So do you think this type of theory may have some applications in a social network analysis when people pass information ->> Ali Sayed: We have ->>: [inaudible]. >> Ali Sayed: We have ->>: I pass to my friends. >> Ali Sayed: Yes. We have. We have conference paper from two years ago where we applied some of these ideas in that context as well. This can be also meaningful there. You know? Yeah, you have a question? >>: Just -- just -- you were talking about the distribution. So the whole discussion here was under the assumption that the data is stochastic and that it -- the data comes from a distribution. It's not -- it's not ->> Ali Sayed: Yes. Yes. Because if you look at the MSD expressions, the MSD expressions, if you -- I mean, if we will go back just a few slides. Okay. You see this R is the variance of the noise. To evaluate that, you need -- it depends on the PDF of the data. So what these results is saying is this. You apply the algorithm to the data, right? You only have data. You don't have these quantities. But this is telling you what kind of performance you can expect on average. Even over data that you haven't seen. Right? This is the kind of performance you can expect. Okay? >>: You assume the that the area comes from this stochastic process which is ->> Ali Sayed: Yes. >>: So that the expectation can be defined. >> Ali Sayed: Can be defined, yes. Yes. Yes. >>: [inaudible] of data which is ->> Ali Sayed: Yes. Yes. Yes. >>: The earlier slide you -- I think you just wrote the word H with the single power as a Hessian. >> Ali Sayed: No, H is the Hessian of the cost function at the solution. >>: I see. That's why it's secondary. >> Ali Sayed: Secondary. >>: So what does this [inaudible] the square ->> Ali Sayed: No, no, no, think about it like this. If this is a quadratic cost, this will be the covariance of the data. So that's why ->>: [inaudible]. >> Ali Sayed: [inaudible] single power. Okay. Of course, in the general context it may not mean single power. I'm just using that as a terminology. >>: I see. Okay. >> Ali Sayed: Okay? >>: Yeah. So as to relate to what we discussed in the morning. So if this -- when you actually now have a more implementation model, for example, nonlinear. >> Ali Sayed: Yes, yes. >>: Function there in terms of a combination, not only you linear [inaudible] function there, so what kind of change would that theory need to be in order to handle that type of more complex. >> Ali Sayed: Of course. It needs to change like we were talking, right? I need to -first of all, if the problem is too general, it's to nonlinear, right, you can only solve it in an approximate manner, right? >>: I see. >> Ali Sayed: But the problem that you have, it appears to me through some transformations it could perhaps be close, right, closer to the models that we have. You know? But if it is totally nonlinear, then, of course. >>: The problem, do you have it in many type of problem [inaudible] design the linearity. >> Ali Sayed: You can design it. >>: Yeah. It doesn't have to be fixed in the [inaudible] I told you about? >> Ali Sayed: Yes. Yes. Yes. >>: We are pretty much free. But on the other hand, [inaudible] have to be reasonable. >> Ali Sayed: One ->>: That analogy shows a certain type of linearity will convert the problem into something ->> Ali Sayed: Yes. >>: [inaudible]. >> Ali Sayed: Nonlinearity is [inaudible] that's a useful property. It will go back and forth. >>: It also has to have some other type of properties. >> Ali Sayed: Yes. >>: So it cannot -- there's some limitations here. We don't know what it is. But on the other hand, if any specific linearity is giving, we can roughly assess what is perfect for other obligations. So that -- okay. And do you know of anyone in this area about dealing with linearity? >> Ali Sayed: You know, similar to problem, no. The first time I saw your model is now in our discussions, okay? >>: Okay. >>: This is curiosity. You mentioned -- this is very cool. You mentioned that I think an important part of the adaptation [inaudible]. >> Ali Sayed: Yes. >>: Which is how much I trust ->> Ali Sayed: Yes. Yes. Yes. [brief talking over]. >>: All that. Would there also be an advantage to -- depending what noise levels of how I perceive the As may be to adjust the mus and have different step-sizes for different sensors? >> Ali Sayed: Very good, very good question, yes. You can do that. Now, as I can see here, I'm already assumed the step-sizes have different -- are different. They depend on the node. >>: Oh. >> Ali Sayed: So technically they are different. But what you're saying is let's also adapt these step-sizes. >>: Right. >> Ali Sayed: And, yeah, that's an additional level that would be interesting also. In the single agent case, when you are just talking about the single adaptive [inaudible] people have studied that greatly. But now you have dynamics you are coupling, right? That would introduce it -- so that would have been an interesting thing to do. We haven't done it yet. But that's an interesting point to do, also, adapt ->>: The traditional adapt ->> Ali Sayed: Yes, yes. >>: Tons of [inaudible]. >> Ali Sayed: On adapting the weights, yes. Yes. But [inaudible] are going to have this coupling, you know, yes, and that would be -- yes. >>: But [inaudible] is new as a function of node and that's kind of hard to tune, right? You have to tune the size. >> Ali Sayed: The chosen with mu, the challenge with mu is that it has an inference on the stability of the algorithm. This mu has to be small enough. You can see performance always depends on mu. So even if you adapt it, you have to be careful not to influence the stability of the network. So, okay, that's a constraint you have. Okay? >>: How do you decide mu as a function of nodes? I mean ->> Ali Sayed: No, no. Every node, every node is using a different -- could use -- they could all use the same mu or they could use different mus, right? There is no restriction why they should be using the same mu, right? So that's why the results are in terms of mus that change over the nodes. >>: And the kind of ->> Ali Sayed: Sorry? >>: Any kind of [inaudible] mu. >> Ali Sayed: Any kind of? >>: Guideline. >>: Guideline. >> Ali Sayed: Yes. Okay. We have conditions to ensure the stability. And these conditions are in terms of bounds on this mu. Okay? So if you follow these conditions, they will tell you, choose any mu as long as you don't violate these bounds, that will be good enough. >>: You want to be large enough so that you learn faster? >> Ali Sayed: No, there's compromise. >>: Compromise. >> Ali Sayed: Large enough, large enough, it goes faster, you are right. But then you get more noise. So it's -- then people -- you can study this question. What is an optimal value of mu to trade-off, right? An optimal ->>: [inaudible]. >> Ali Sayed: Exactly. Yes. >>: And what does your intuition tell me? That in this more [inaudible] with multiple agents do you feel that a good algorithm to adapt the local mus could actually lead to better performance? >> Ali Sayed: It can because -- I'm not saying it's easy, okay? For example, at the -anyway. Okay. Let me put it this way. The mus can be adapted. It is not easy because of the stability issues over a network, okay? Because these problems compound. Because of the interactions, okay? So you have -- and the stability condition is a very select condition, okay? So you have to take that into account, okay? So that's why it's not a simple thing to do like you are just dealing with a single adaptive node. Now, you are dealing with many nodes. So whatever you do here will influence the other nodes, okay? It is doable. It will influence performance because it -- the expressions are showing here how it influences performance, right? For example you can design mus in such a way that -- so this is what people usually do in the single agent case that you can do here, okay? They can use a larger mu initially to converge faster and then start using a smaller mu when you are closer in steady state so that you track better. Things like that can be done here as well. Can be -- as long as you satisfy the stability conditions. Okay? >>: No problem. >> Li Deng: Any more questions? Okay. Thank you very much. [applause]. >> Ali Sayed: Okay. Thank you.