>> Li Deng: Hello everybody. It is a pleasure to have Professor Guang-Bin Huang to come back; this is the second time. A few months ago when he [inaudible] from calibration and [inaudible] invited him. He gave a one talk, somewhat a preliminary to this. But he is going to go over that again for the people who did not attend the first talk. So maybe I should talk to him a little bit and ask him to explain what extreme means here so there happened to be these little buzzwords around these days. And so I asked him if extreme can be interpreted to be extremely simple and he said okay. So think about this is to be something extremely simple learning technique but the beauty of a simple is that you can use that as a little building block like a Lego piece to build some more fancy and more effective learning machine. So with this rather than going all through his bio, I will just simply say that Professor Guang-Bin Huang has some really interesting work and less time he gave this one version of this extremely simple machine, extremely simple learning machine and today he actually promised to give something even simpler than when he talked last time, so I will give the floor to Professor Guang-Bin Huang. >> Guang-Bin Huang: Okay, thank you Li Deng for the kind invitation, I know this is a very encouraging introduction. So actually it is my pleasure to come here again. So last time actually I met some friends so this time I saw some new faces and also I saw some pioneer who I have been expecting for a long time, especially in your little areas. So today actually I also compare with work done by John Platt. Last time I came here under the, to you it sounds invitation, so during that talk I just gave an instruction to [inaudible] extreme immersion. So then we keep this discussion with [inaudible] who has some very great ideas and learning [inaudible]. So then we say we found a major problem in learning is scalability, so today maybe I will focus on the scalability part. So basically we will cover today because I thought of some of you attended the first talks so maybe the first part will actually come from the first talk. >> Li Deng: Before you start can you give us an official definition of extreme? >> Guang-Bin Huang: Ah [laughter]. Okay, go back to the definition. This is a, actually there is a story behind this. So we come out with this idea I mean this learning method first. So we don't know which name is good for this technique. So [inaudible] we wanted to [inaudible] on neural networks. So then I sit down watching TV. There is something there I saw a channel, Discovery channel. So during watching that channel I actually saw one program sought extreme machine, so in that extreme machine was a ship and a different kind of a plane and a different kind of capacity. So for the technique we developed we found it is very difficult when just using one word to give the general ideas, to cover all of the ideas. It is not just random. We also think of ways that we can use random something, too many kinds of randoms, so it kind of describe in pictures of method. So we say very simple. It is very simple, but may be very efficient and sometimes we also can see that the results are very good. Accurate. So then we say that the channel extreme so we program extreme machines, so we say why not? If in mechanical we have extreme machine, then why not in learning environment we can't we have an extreme learning machine. But it is not to replace the other, just use something quite different from the traditional learning methods. So basically it is kind of tuning free based learning technique, okay? But it is very simple; it is extremely efficient and I [inaudible] the day that we saw the results so it looks extremely good, so I say why not extreme, so actually we borrowed the idea from the mechanical areas of extreme machine. Okay so, we are talking about feedforward neutral networks. So we are talking about general feedforward networks and talk about function approximation classification. I also will talk about SVM, because to me I consider SVM as one of the solutions of general feedforward neural networks. Then we go to the learning theory we proposed. The learning theory is different than traditional learning theory and then during these each theory we are corresponding one 1-ELM algorithm. So basically we have one incremental ELM algorithm. So this method is for us to build a network as adding a node to the existing network one by one, node by node and it actually looks very efficient later on, I assure you. This one we compared the results [inaudible] neural networks. These are very, very popular, very, very famous one. And then we--during these I-ELM, the network architecture looks big so then we talk about enhanced incremental ELM so we can get the more compatible network, so this is very important for some kind of application which actually SQL maybe have very large SQL data sets and the network additive can be very large so using this method will reduce the network size. So then we talked about the last time we mentioned that ELM but the main question is was it similar to [inaudible] ELM and also the RBF network. So then we were mentioned the difference here through some comparation. So than the last part we will talk about online sequential ELM. So in this case the data can be read chunk by chunk, one by one, especially for those applications which are very, very big. So you can load all of that data on, you can load all the data in one time so you have a better way to get at the data and limit the data chunk by chunk, say 5000, so maybe next time maybe 6000 coming, maybe 10,000 coming, right? So after [inaudible] data and discolored the data, okay? So now we go to the first part we mentioned in the first talk so we will give a quick review. So there are two types of feedforward neural networks. One we call sigmoid type network. So generally the outline function is defined this way. So then you have input layer, hidden layer, output layer. So then the input to each hidden node is the weighted sum of the input from the hidden, input layer. So but for this we can write this output of the network in this way, right? Is the weighted sum of the output of the hidden layer? So the other major one is called RBF network. So RBF network in this case the hidden node here output function is RBF function. So generally most people using an input function, but then you rewrite them so we can use the same formula for the output function, so the weighted sum of the output hidden node, right? So in this way we consider both sigmoid network and RBF network is a kind of feedforward network. So they have just a single hidden layer for both types of, for these type of feedforward networks so people have proof these type of networks have universal approximation, property so that means given any [inaudible] function okay. Okay, linear function f x. So does long as [inaudible] those hidden here might be enough. So the output of this network can be as close to your targeted function. Approximate your targeted function as close as possible, given any error we can, in theory we can reach that error, okay [inaudible]. So this is the traditionally necessary. A traditionally necessary has to turn in all of the parameters here, so that means all of the parameters in this single hidden feedforward neural network so input the input from the hidden input layer to the hidden layer and also the hidden layer parameters, right? For sigmoid type is fires. For RBF layers it's the center of the RBF node and also output is always here, right? So for this case you would have beta I. So from learning point of view the target layer is unknown so then people have to do some samples, get at the training data so XIT, XJTJ, so we wish to let, I mean, ideally to let the SQL called the output of the network, right, is as close to your target as possible. So then people have tuning all of the parameters so this is the traditional way. So a single hidden layer can approximate any continuous targeted function how about a classification so [inaudible]. So from a classification point of view we already prove as long as this single layer neural network can approximate any targeted continuous function, then we can prove these kinds of networks can differentiate any disjoined classification region. Can be any region no matter what are the dimensions as long as this network can approximate any targeted continuous function then this network can now differentiate any, separate any disjoined classification regions. So that means as long as this can do universal approximating that translates in theory these can also be used to approximate classification. So from the learning method point of view in the past people were using a BP to train their network and are using RBF method to train. So this is a square solution-based RBF solution method for these kinds of [inaudible] so we are using a two kind of BP, this square based RBF solution. But based all these methods have primary [inaudible] tunings. So [inaudible] is another very popular one is called [inaudible]. So [inaudible] machines, it is very, very popular. Now it is used almost everywhere especially in classification. Of course many people consider the SVM is not a kind of neural network solution, but if you read the Web Leak 1995 paper. So in their work they consider one final solution to multiple hidden layer feedforward networks. So they consider that instant of tuning SQL in that hidden layer is just considered [inaudible] hidden layer, so the output of the [inaudible] hidden layer can be considered vector mapping via X I, okay with regard to the input X I. So the last hidden, the output of the last hidden layer is I X. So then they tried to minimize or let the SQL this kind of network to have this kind of optimization to minimize the margin, maximize the classification margin, and minimize the SQL training error, right? So then they get the solution for these kinds of network, but generally the output of, actually the final decision function of SVM in this way. So you consider these, all right but up here Kirner, Kirner can actually be considered a hidden node. So here how for STS is considered, can be considered as the venue of the output [inaudible] linking the hidden layer to the output layer. Of course how to get the XS is a Kirner definition. And then they have to find the R5 STS. They have a different way to find it, okay? So this SVM method generally can be considered as a solution ultimate solution to the method of feedforward neural networks. Okay that 5X is just that the output function of the last hidden layer So all of these methods all of the hidden parameters can be tuning based on the commission of the new theory means all of the parameters must be tuning. So then we find no matter SVM, RBF type [inaudible], SQL type [inaudible] actually a kind of single can be considered a generalized single hidden layer feedforward neural network. So you have single hidden layer here. Suppose you have AO hidden nodes. So then so the error hidden nodes this hidden layer can actually be considered the SQL, the function, the objective of the hidden layer is kind of mapping. So we have the SQL the first hidden node which is the output of the first hidden node is G A1 B1 X. The last hidden node is GAO BAO, X, all right? So this is actually a vector. This vector is output of the hidden layer can be considered vector mapping. Right, so either can be sigmoid type [inaudible] or RBF. So even other type of mapping function. So the idea is we proved, we mentioned it before, suppose for this type of SQL single hidden layer feedforward network, suppose [inaudible] exists a tuning method so you are tuning all of the parameters AI, BI beta I, whatever. Suppose now there exists such a tuning method for you to find the appropriate parameters, venue for these parameters, right? So to reach your target. To solve your problem. So then we say such a tuning is not required. So in theory that means give the targeted function continuous targeted function F okay? We have a continuous targeted function F. As well as so-called the number of hidden Os is not enough. Then we can say all of the hidden nodes, all of the parameters AI, BIA in the hidden layer can be random generated. And then you just needed to calculate the output with I. So we know the number of AI is not enough. The output of the network can be as close to your target as possible. So actually with probability it can reach zero, right? So this is proof even here. >>: So do you have any thing about the [inaudible] of that [inaudible]? How fast it is? >> Guang-Bin Huang: We asked several mathematicians. We say this is an often problem. They try to do it but it looks very difficult to handle especially is also relative probability statistics and also function base, targeted based. So it is also very interesting topic but so far very difficult to find a solution yet. So how to prove it. I just give an idea in [inaudible]. So the whole idea is given a network, supposing you are adding to the network. We added a note to the existing network one by one because this is related to the larger data set application. So suppose you already have M -1 hidden node added to the network. So the network has M -1 hidden node. So because all of the hidden nodes were random generated so whenever you add a new hidden node you don't touch, you don't change the values of the existing parameters in the existing network. So that means for the existing M -1 hidden node they are parameter fixed. The output of this related to this M -1 hidden node all fixed. Why do you need to do? You just need to random generate a lot of hidden nodes. Say GM. Because GM actually is actually a SQL operative function of the Ms generated by the hidden node, all right? So for this random generated hidden node, it is either this directional or is left direction; it doesn't matter. So we just calculate the so-called optimum value for this the output [inaudible] linking this newly added hidden node to the output node. So the existing M -1 hidden node never changes. The existing network never changes. We just needed to calculate the SQL value for the newly added network, the output of the newly added network beta M. How do you calculate it based on this formula? We just base how do you beta? Because after you have added M -1 hidden nodes, you are talking to F function is here, right? Targeted function is here. Okay we are talking of this F, right? So the target is here. So from that target there is a residential arrow continual SQL being the target and the output function of the existing [inaudible] with M -1 hidden nodes. So we just do projection to get at this venue, beta M. So this is approximation. We get with this venue. So that just is optimum venue for the output value of the newly added [inaudible]. So then you are reaching here. So the next time you add another one. It could be in this direction or in that direction, right? So again, you had another one then get an approximation, get an optimum value for the output [inaudible] of the newly added network. >>: [inaudible] what is added is outside of the [inaudible] so your direction is here, right? So [inaudible]. >> Guang-Bin Huang: Your direction is here, right? Is here. Then the [inaudible] can be [inaudible]. >>: [inaudible] that direction? >> Guang-Bin Huang: Yeah, that direction. >>: And then move away? >> Guang-Bin Huang: No, you will actually move in this direction when the value of the beta will come back to you. Then go here, again. And then you're projection still comes here, right? So then it doesn't matter. So every time you add then what happens? Really they all, your error will become smaller and smaller, so we proof in this way. So we prove the original error sequence, okay, will be convergent. So can you find any subsequence of the error, you will definitely find an image for that subsequence. So that means that the original error sequence is convergent. We proved this first. Second we proved that norm, that the norm of this original error is also convergent. So finally we proved that the norm of the image of the convergence is zero; it is quite complicated, okay? Generally it is a [inaudible] solution, proof method. So basically for this message you see it is very simple for you to add to that hidden node to compute the parameter of the network. So every time you just add one. Add one, right? So that hidden node is random generated. So after you and one, you just need to calculate one parameter beta I for the newly added hidden node I, okay? So you just calculate in this way. So this method, actually this calculation usually it is very fast. So from here we can generate a SQL, a method, which for us to update in a little while, okay? So every time so you give the target, so maximum of how many hidden nodes you can add and then what is expected error, right? So if you haven't reached your maximum targeted hidden node and residual error is bigger than what you expect so then we keep at generate a new hidden node added to the network. And then we calculate the value for the output wheat of the newly added hidden node, that's all. Every time just do this, okay? So then we compute the residual error for the new newly viewed network. Yes? >>: Is not clear to me yet how we decide at first [inaudible]. >> Guang-Bin Huang: It is random generated. Even the first one, just random generated. >>: Then you prove requires that you [inaudible]. [inaudible] and practice if you do something different [inaudible]… >> Guang-Bin Huang: No. Yeah, [inaudible]. >>: I just [inaudible] much faster times. How do you explain that? >> Guang-Bin Huang: Yeah. This is random generated. For your case is very difficult to prove, but more interesting. This proof, so after we give this proof it looks very simple, but we spent three years or four years finding this proof, yeah. So it is very, very complicated during the thinking this way or that way and finally we find a simple way to do it, yeah. I guess you're, the proof for your case maybe much more complicated because yours is not random generated. But of course [inaudible] the first time we think about random, how to prove it with random, it is complicated, but finally we find it and the solution is simple. But in your case everything is not random [inaudible] mistake, so without proof [inaudible] it could become [inaudible]. >>: So your [inaudible]. >> Guang-Bin Huang: Yeah, convergent to zero. >>: That would mean your [inaudible] machines [inaudible] traditional [inaudible]. >> Guang-Bin Huang: No, no. >>: [inaudible] different [inaudible]. >> Guang-Bin Huang: Yeah, different. Yeah, they probably for efficiency variations would have to be based on simulation, yeah. So for this case you see for different we test of course for small cases. We don't have a large dataset; we only have a small data set so we base it on the [inaudible] is in the hall convergence, all right? So that this is the every training time usually just base it on the depends on a number of hidden nodes to be added. So there is a [inaudible] increase, right? So we compare say incremental ERM where here incremental definitely different from others. So later on we have a lot of kind of called sequential method, right? So sometimes people talk about SQL learning called incremental learning. But we have [inaudible]. That means data coming in one by one or chunk by chunk. So we already find the other ones sequential learning so then for that I automatic updating we call incremental path. So for this I ELM we apply sigmoid motor function RBF function and even sign function [inaudible] function, okay? Then we try to add up to 200 hidden nodes so that we tested several cases. We compare with a lot of other very, very pioneer work one, okay? [inaudible] work in this area. So resource location area proposed by John Platt and also another called minimum resource location, okay? So with this testing the accuracy comparison so from here you can see usually that the ELM generated better accuracy, right? So we also compare the [inaudible] looks roughly in the same order or even better. You see the time, changing time, you see I ELM the changing time here zero point, in this case 0.22 seconds. You look at the SQL and resource location usually ran usually much lower. I tried these; I run all of these by myself and have too many parameters. I guess 5 to 7 parameters to run concurrently. Of course after each one, some of them are fixed but they still had two or three more have to be tuning based on applications. So that means each of them we have the tuning parameters. For all cases so far no parameters are tuning. And then you see for this case, right, you had to spend quite a long time, okay? These days we don't know how to use remote desktop so I have to run my old PC at home and then in the middle of the night and I have to check the result, okay? >>: [inaudible] ELM, did you increase it from one up to 200? >> Guang-Bin Huang: Yes. We added the [inaudible] from 1 to 200 and this is the plan. >>: [inaudible]. >> Guang-Bin Huang: Now this is different. >>: [inaudible] faster than this. >> Guang-Bin Huang: That is batch learning, we mentioned before. That is a fixed [inaudible]. >>: Fixed one? >> Guang-Bin Huang: Yeah. [inaudible]. >>: [inaudible] this incremental? >> Guang-Bin Huang: That information usually is good but then the time could be more time could have been used, had to spend. >>: [inaudible] incremental [inaudible] could do more efficient because [inaudible] before you [inaudible] whether it is worthwhile to [inaudible]. >> Guang-Bin Huang: Yeah but this [inaudible] is more efficient every time just calculate one, just calculate one time. That is very, very simple way. >>: But [inaudible] a few calculate [inaudible] >> Guang-Bin Huang: No, every time we just after we add one, we just remember that error. So next time we don't calculate the SQL and so we recalculate the error for the past thing, for the existing network. Just remember error so next time we just use… >>: [inaudible] you have to sit there and watch every time you add one, whereas if you have the batch of the [inaudible] from the incremental, you just do it once. You should just do it once and pick up your spare time. >> Guang-Bin Huang: [inaudible] this is why I have the information [inaudible]. That is it so your dataset is very huge. It is very difficult to calculate like metrics. But I guess this could provide the possibility for your solution. But every time we just add one hidden node. So that means that even if you have 200 million data, that is 200 million calculations, one times 200 million. This is just one vector calculation, that's all. >>: I know but [inaudible] very slow [inaudible]. >> Guang-Bin Huang: No. [inaudible] no every [inaudible]. I suppose that the data… >>: [inaudible]. >> Guang-Bin Huang: Okay, the idea is after you have loaded the data what happens? You just need to remember the error. Right so the data is still there. >>: [inaudible]. >> Guang-Bin Huang: Yes. Suppose that the data [inaudible] of course is sometimes walking there if the memory is limited that is another case, yeah. But suppose the data is already loaded. I guess this would be very, very efficient. >>: But I think the advantage of doing this is that you can tell automatically determine where you want to stop [inaudible]. >>: Good yeah. You are right. >>: So, by practice you, we have to try some [inaudible] the more the better. You know when to stop because the error still [inaudible]. >> Guang-Bin Huang: Yeah… >>: So in a single problem, you might think that the wide error [inaudible]. >> Guang-Bin Huang: Yeah. Okay this actually is very interesting question that I forgot to mention. Yes, it is very good. You see, of course, people can use cross validation to determine where to stop. You see for this case it is very interesting. You see that hidden nodes are random generated here. And we only need to calculate the output wheat for the newly added hidden node. So the, actually, you see here the decreasing very, very smoothly. It is different from others. The others, we tried these. We wait around tried 200 300, 500, 1000, it is always keep decreasing, because the single hidden single feedforward neural network, hidden layer plays a major [inaudible] but the hidden node here, random generated either. It won't care, it wants to be determined based on how on testing data. So the hidden node determines basically the training data that overfeeding can be easily generated [inaudible]. But for this hidden layer consider premium testing the same. You don't want to give benefit to anyone. So that usually is more, actually much better. It is a much smoother it won't cause SQL [inaudible] or repeating. Yes? >>: So this way of doing incremental [inaudible] is very much similar to [inaudible] worrier are adding [inaudible] by one and [inaudible] the same as you actually referred to as [inaudible] a small [inaudible]. >> Guang-Bin Huang: Yeah, well here we won't consider that. Here we are just trying to get the basic idea on how to [inaudible] upon this can be incorporated. That idea is here could be, yeah but currently we have not considered that. Here we are using, just using random… >>: [inaudible] difference there [inaudible]… >> Guang-Bin Huang: Here it looks like… >>: [inaudible] example [inaudible]. >> Guang-Bin Huang: So here the hidden node is random generated. In their case it is not. So then… >>: [inaudible] how you decide what each node is. [inaudible] random [inaudible]. >> Guang-Bin Huang: Yeah, so, but here our original idea is we tried to prove… >>: [inaudible] idea after you go through your [inaudible] error [inaudible] no [inaudible] to the data. >> Guang-Bin Huang: Yeah, that is how I say how to incorporate these ideas here or combine the ideas of the two works. I guess a much better result could be obtained. But here it is just the basic idea. >>: [inaudible] samples make other kind of mistake [inaudible] in the place of [inaudible] data. >> Guang-Bin Huang: Yeah, [inaudible]. Okay. So we also compare this with the BP and the SVRs. But about [inaudible]. We compare using the ELM with a 500 hidden nodes, okay 500 hidden nodes. We compare with the BP and SVRs, so you can see the accuracy here. So usually I ELM and the SVR can reach a similar result. But actually the knowledge of I ELM is here, you see? It looks very, very fast, actually a new parameter in between. >>: [inaudible] how large the [inaudible] problem? >> Guang-Bin Huang: Yeah, whole lot, for this case 500 hidden nodes. >>: [inaudible] input layer [inaudible]. >> Guang-Bin Huang: The input layer depends on the implementation of each case. >>: Yeah, but about how large? >> Guang-Bin Huang: Of this, usually uses a single case. >>: [inaudible] handful [inaudible]? >> Guang-Bin Huang: For input? Input layer, that means the number of dimensions over…. >>: [inaudible] >> Guang-Bin Huang: This here is very, very [inaudible] several tens of [inaudible]. This is a very, very simple case for [inaudible]. Yeah? >>: One would expect that as that gets quite large that far fewer of the randomly generated features will prove useful. >> Guang-Bin Huang: Yeah, good point. Actually we, last time I left information that SQL input dimensions and later we have one. >>: [inaudible] significant number [inaudible]. >> Guang-Bin Huang: Okay so we have comparison here. So BP is this number of hidden nodes. So ELM is 500, okay? So BP is here. And then SVR is number of vectors here. >>: If you use the same number [inaudible] would you get a better result than I ELM? >> Guang-Bin Huang: Okay, for that I will come back later. Here we are fixed at 500. So next time we will have an enhanced one, so this is just basically a proof method, a proof method generated [inaudible] okay? So even if you come back here even if we have this 500 compared with BP, it is still faster. So compared with [inaudible], it is much faster. [inaudible] machine you have to tune in these parameters it is quite complicated. Right for I ELM… >>: [inaudible] so this is only compared for time, for accuracy… >> Guang-Bin Huang: Accuracy is here. >>: It makes sense to have the same number of [inaudible]. >> Guang-Bin Huang: Yeah. >>: [inaudible] the same number of [inaudible]? >> Guang-Bin Huang: No. It was different. The number of hidden nodes is, we keep adding until 500. >>: [inaudible] BP what is this [inaudible]? >> Guang-Bin Huang: BP is sigmoid tuning [inaudible]. >>: [inaudible] no, no [inaudible] same number… >> Guang-Bin Huang: No, we can't, no. >>: [inaudible]. >> Guang-Bin Huang: Will you can't. BP uses 10 hidden nodes and then we say I ELM is using 10 [inaudible]. Some use [inaudible]. Some use [inaudible]. >>: But you need to compare with the same [inaudible] actually cost the same. >> Guang-Bin Huang: You are right. So we, this is why we had to come back later. We have another one to enhance that. This is very, very basic. The objective of this method was to prove that universal approximation of the original extreme learning machine with a fixed [inaudible] vector. >>: So I see the problem is that random nodes [inaudible] are unlikely have over 50 but you not trained on the lower levels… >> Guang-Bin Huang: That's right. >>: So if you have the same number of hidden modules for BP you very quickly [inaudible] [multiple speakers] [inaudible] >>: [inaudible] you have 400 inputs, you have two key hidden nodes never [inaudible] >>: So maybe it's never [inaudible] very small. >>: No, no, I think you never tried to use [inaudible]. >> Guang-Bin Huang: Okay, I will come back to that later, later. Yeah, this is just [inaudible], okay? So we also, you see, in a neural network a common problem is how to find a solution learning method for the [inaudible] network. So that means that the single feedforward neural net with the so-called binary alternate function hidden [inaudible]. So actually [inaudible] solution, before that, right? Also binary, you're kind of using a gradient message, for single [inaudible] or multiple [inaudible] usually gradient message had to be used. So in this case we are just using ELM directly, so that means the ELM hidden node just using binary function. So then we compare this with BP. BP, they tried to change that gain factor to make the gain factor nosh so that means SQL the network [inaudible] can be impressioned by the sigmoid function within nodule gain factor. So that we compare these [inaudible] by the ELM, actually the I ELM and we added the hidden [inaudible] note one by one and compare with BP. You see here usually these get a direct solution and the direct solution is sometimes much better than this indirect solution, all right? >>: So you're still using a different number of [inaudible], right? >> Guang-Bin Huang: Yeah, yeah, different… >>: That is not a fair comparison. >>: The idea that you need to tune that [inaudible] in such a way that it can best perform and I bet you we can [inaudible]. >>: [inaudible] >> Guang-Bin Huang: Ha ha, all right? Okay so, now we go back to your example. So 500 nodules, [inaudible] right? Compares 10 of 50 which is used in the BP method. So then we say every time we just add one hidden node [inaudible] we don't go further, right? So why not? Every time, every step we choose, we random generated several hidden nodes. We see which one SQL should be adding to create SQL largest of the decrease in relation to your error. So you add in this direction and then in this direction. Okay we say, this is better direction. So then after we say let's go in this direction and then next time had several, right, and try to say which one generates several case so which one will go for this way or that way, so go that way. So this one is the same method, right? The same method we can prove using approximation. Then the idea is same kind of error [inaudible]. But the thing that is interesting is convergent read much, much faster, okay? Conversion convert very quickly. >>: That is only based on the number of [inaudible] based on the [inaudible]. So each time we need to [inaudible] several [inaudible]. >> Guang-Bin Huang: [inaudible] I have a slide to show you to address your concern. Say for example you see the [inaudible] and this and that. So the things on the nine are the original I ELM created one by one. Okay every time it just generates one so that the seeker saw the nine is the new one [inaudible]. That means every time we generate maybe 10, so among 10 we choose the best one. You see? Convergent read quite different. >>: [inaudible] the scale like 10 [inaudible]. 10 times like 10 times, then go through it maybe 10 times more, do they convert? >> Guang-Bin Huang: Ah, I will come back to this question later. So you see I united these [inaudible] so for this abalone in this case you see original conversions curve is here. The new convergence is here. In order to get the same accuracy we say 0.09 is our target. So our original maybe 330. Though now, 21 hidden nodes. Yes? >>: [inaudible] the small number of nodes is tuned to the data. Are you subject to overturning them? Overturning… >> Guang-Bin Huang: In this case? In this case it is still the same node. They are still using the same because every time I just get it the same, so I still follow the SQL or a SQL series, okay? I only had one continuous, they always SQL one by one and then original error always SQL strictly decreasing, I mean really serious. So the over 15 is not so obvious. >>: But did you verify that with cross validation? >> Guang-Bin Huang: For this case we did not; we just keep adding. We just want to shoot the curve. Of course, until you add, add; add eventually I guess [inaudible] account. For this case is not so… >>: [inaudible] mantra of fixed [inaudible] >>: [inaudible] by selecting one from [inaudible] >>: Oh, I see [inaudible] >> Guang-Bin Huang: So while [inaudible] comparing these I guess you could kind of continue it will come. So what I fear is not so upper even you see for this case. We keep adding; it is still going. It, compare the earlier one. Because the earlier one which is [inaudible]. There is no need to see which one is better, just keep adding. So for the first one the overturning is not so serious. The second one, although not so obvious, compare the first definitely is there. >>: So this one [inaudible] [inaudible] 10 times of the computation to choose [inaudible] 10 times. >> Guang-Bin Huang: No. You see here we are just using 20 nodes. Here you have to use the 330 nodes. So which one is faster? This one is faster. >>: [inaudible] >> Guang-Bin Huang: For this you can. I will come back here. So here we say you have a different number of candidates. >>: So basically you are saying [inaudible] 330 [inaudible] actually [inaudible] 330 [inaudible]. >> Guang-Bin Huang: Yeah so, here you get time [inaudible] here is a suppose you added 21 nodes. Actually, you already tried 221, 210 nodes. >>: Select the number of candidates? >> Guang-Bin Huang: Yeah. So this here should I resolve. So if the K for the 1 so it means the original, basic incremental. So you will K kernel five this one, [inaudible] 10, [inaudible] 20. You go further, further, right? Sometimes they converge somewhere. >>: It looks like you will never converge for the one versus 20. >> Guang-Bin Huang: Yeah, for the 20, that means that you will go 30 and still decreasing maybe in some cases… >>: That's why I told you [inaudible] keep doing that all the way down [inaudible] >> Guang-Bin Huang: Yeah, well, this case is a case dependent. >>: [inaudible] basically three factors [inaudible] accuracy [inaudible] and the number of nodes [inaudible] I'm so [inaudible] this basically just shows [inaudible] which is [inaudible]. >> Guang-Bin Huang: This, this is not a time. This is testing accuracy a new square of. Yeah. Of course the reason Y is K for i 1, this right? So usually we suggest a [inaudible] test [inaudible] because usually it will run faster. It is easier for the user to determine in your own case what is SQL, what is the basic number of candidates to be used. But generally we say we have a comparison here, so usually maybe 20 around actually is good for many cases. Of course for special cases you may have a different number of candidates to be used. So we come back to this data again so we have the number of candidates k for 10 and 20, so we compare to the original one. So you have [inaudible] here you see that K for the 20, you see the accuracy here which usually is, here is right is better somewhere is, here it is better than this. Usually this increases. But of course for this curve I am sure there is quite a difference. Because sometimes this is 0.1, 0.9, right, but issue a value here roughly in the same order, so this is, yes? >>: [inaudible] accuracy values in releasing very, very low and in the view right, and then some different area, very there regularly higher so I mean what is it like [inaudible] but here I don't even see what [inaudible]. >> Guang-Bin Huang: While this is a regression. This is for operation. Yeah. So these, okay so this had to head to… >>: [inaudible] so basically why you [inaudible] is that the [inaudible] you are using only the zeros or the information to adjust the values of the function, right? >> Guang-Bin Huang: Uh-huh. >>: If you use the first or the information [inaudible] you will get a faster convergence [inaudible] found work [inaudible] much faster [inaudible]. >> Guang-Bin Huang: You are right. So here is the basic idea. Our first great [inaudible] so other information to be incorporated into this method actions can be a lot of very interesting work. But so far we are just doing the basic. So whatever result we show is a very basic result. >>: [inaudible] training times. That probably is not very important, the training time. So [inaudible] [multiple speakers] [inaudible]. >>: I understand. [inaudible] much smaller than the original [inaudible] >> Guang-Bin Huang: Uh-huh. >>: [inaudible] so much faster. But if you're using first all of the information [inaudible] >> Guang-Bin Huang: [inaudible] and here [inaudible] because you can smaller letter [inaudible] compare the early one. So this is, okay, this is testing standard observation, okay? So you compare the SQL, okay, so this, a lot, okay. So this is SQL how to add that hidden node one by one. We haven't tried [inaudible] proposing everything. Right. So [inaudible] a lot of logic, skill, not a data size [inaudible]. So that could really be tried. Every time just at one and watch the performance. So that is SQL and the hidden node one by one so then we, so what is the relationship between the SQL network and the in the extreme learning machine with a fixed network architecture. This is what I mentioned in the first talk. So with that proof that incremental error [inaudible]. So the later architecture for this [inaudible] hidden layer would not be tuning. It can be random generated. So if it is random generated as we mentioned last time, so that the training to this will become very simple, right? So given the training data XJ and TJ, we wish to have the output out of the [inaudible] put out a later [inaudible]. You [inaudible] this ideal case, right? So in this case given the untrained data so we have the [inaudible] functions, right? So AO means number of, means number of parameters. So in this case we can write them in SQL metrics format, right? So then we can H [inaudible]. This we mentioned before. So in order to get this [inaudible] solve the problem we just need to get the value to solve this creation. To get it to beta, so we just random generated a hitting node and [inaudible] hidden layer off with the metrics and then do the reverse and get at the beta. So this is the basic idea. But we mentioned before so from the original operational [inaudible] right? In order to get the to get a stable solution, in order to get a stable solution, in particular, sort of inverse we have to, we better to introduce the SQL recognizing parameter. So add it in there angle and [inaudible]. So add IO receive here, okay? So let this [inaudible] inverse them at this rise inverse. So after we get it this then what happens? So basically for ELM if the level of activity is fixed [inaudible] to fix them. [inaudible] we had the hidden nodes one by one. But [inaudible] are fixed so what is the best solution? This is a square method, right? So we just use a square method and get that this solution, right? So each here is random generated. For this case we added the recognization factors there. >>: In practice most of the time [inaudible] all of the [inaudible]. >> Guang-Bin Huang: So you are talking about something different. We want to say… >>: No, no, [inaudible] solution [inaudible]. >> Guang-Bin Huang: Your data is very much so they have to use the first one, right? Or weight, maybe the second one. Yeah, the second. Because your, here is N times AO. So AO the second one you will find it will become AO times AO, the size of the metric is a AO times AO, so then we say if H is unknown, if the hidden layer of the output function is unknown to us, so then we say we can use the kernel method, right? So then we define the kernel method, right? So that means HXI times HXJ becomes KXI KXJ. So now what is the difference between these and the [inaudible] square method? All right so this is one of the [inaudible] goes here. This is many people interested in these. So long kernel method, we had these. Actually from this first formulae we can get at these, get at these suppose HX I times and HXJ turned to KXI XJ, right? And so this corner definition using SVM and other kernel method. So this method you see the same formula for regression, for binary class, for multiple class. So how about [inaudible] square SVM? User is square SVM is here. This is square SVM. For binary squares they have this formula. For approximately all SVM have this formula. Right, so compare this, this looks much simpler in terms of implementation. And then from here you see there also have for this square SVM and for an approximate SVM and also the kernel is called metrics. The times [inaudible]. So these T15X1, TM5XM. >>: So why you say [inaudible] only two elements right? [inaudible] very large, large [inaudible] >> Guang-Bin Huang: Know this has to solve the same metrics. So here also the same metrics N times N. So here also N times N. >>: [inaudible] still be the same, huh? >> Guang-Bin Huang: You see the formula here is, this looks more what I say, yeah, unified because it is all consistent. Then this is for binary. And then you see therefore this kernel metrics and then the definition kernel metrics depends on the output [inaudible] were also [inaudible]. And also for these so that kernel matches irrelevant to the [inaudible] package. So this is very different. >>: [inaudible] fast method to do this. >> Guang-Bin Huang: [inaudible] use it here. [inaudible] using SVM, okay? So this is for the square [inaudible] compare for SVM. So why, so for multiple class, for the square SVM, multiple class is here. So that means you have the SQL, M classes, so then you can use M [inaudible] square SVM, so which is for one class. And then have this, but for ELM you just have one, okay? Yes? >>: [inaudible] this right-hand side should be the same? [inaudible]. If you divide it into a metric form [inaudible] >> Guang-Bin Huang: Actually yesterday I saw that it is different. Here you have and the M and there is quiet, see there would be N times N, then. There you have a [inaudible] N times N metrics. >>: But at the end you still have [inaudible]. >> Guang-Bin Huang: Yeah. >>: So in the end the solution is just doing the reverse? So the main [inaudible] the inverse. [inaudible] >> Guang-Bin Huang: Yeah, finally we just solved this one. You see from here [inaudible] they are similar actually even here, and even here they are similar. >>: [inaudible] the size of the [inaudible] which was larger. This one just huge. >> Guang-Bin Huang: Yeah, we talk about this one, the so called the binary case, I think the complexity that is the same because of the [inaudible] and the N times N. We talked the multiclass, what would happen. They would be M times M because [inaudible] put them together, right? For each class they have to do SQL N times N, right? [multiple speakers] [inaudible] >> Guang-Bin Huang: No, here you have N, suppose you have N class, each class is N times N, right? So in class, because here you have the computer N class fire. So the total time is N times N squared. All right? So here, it is just one, right? >>: [inaudible] the same because you have [inaudible] so [inaudible] >> Guang-Bin Huang: No. Because the metric size here is N times N. These are fixed. >> No. [inaudible] Look [inaudible] is the same. It would only difference would be [inaudible]. >> Guang-Bin Huang: No. Even here is just N. This fixed it. >>: [inaudible] >> Guang-Bin Huang: So here is always N times N. Here is N times N squared. Okay so why, you see here and here the formula is different? So why this is squared SVM and this P SVM is different, final solution. So that the key point here. So SQL is [inaudible] so they have the same constraint. But their packet of functions is different. For approximate [inaudible] here. So then they generate a different solution. And the name for [inaudible] ELM, the targeted function the same as this is squared as SPM. But the condition is different. The condition of the ELM is different from the rest. For the ELM we don't have the B here but that is the fires of the output node. The [inaudible] have the [inaudible] in the opposite node. [inaudible] what is the reason? Your ELM come from the universal approximation first. We say the [inaudible] have to follow has to satisfy universal approximation condition. So that remains ideally [inaudible]. This is [inaudible]. So we do not have a B there. But for the square SVM and the P SVM, they come from SVM. You know, SVM [inaudible] come from classification. So when we are talking about classification first, in the first place we didn't mention the universal approximation. So ideally it should have a fires in the output. Because [inaudible] here, here and that is going to be possible to do parallel to each other, right? But they are SQL [inaudible] and not pass through the already in the vector space, but the ELM solution is by universal approximation first. So ideally in the vector space that means that the SQL hidden layer in the vector space so the separate [inaudible] should pass through the original. So we do not need to have B here. So this is the main issue. So due to this B, the problem comes. You will see the [inaudible] square SVM, you see that is a primer SQL optimization problem. You will see fewer problem. You will see from a [inaudible] problem because we all have to satisfy [inaudible] condition. So the [inaudible] condition due to the B. Due to the B issue, we have to certify this. This is a square solution. We have to certify this condition. So what is this condition? That means the parameters have to obtain from the hyperplane here. But for ELM new constraints, new set of constraints there. So that means that ELM don't have get a solution in the entire space. New constraints. >>: All I am saying is that if you can't modify the [inaudible] SVM formulation originally saying that you constrain your hyperplane to pass through [inaudible] optimally [inaudible] minimizing [inaudible] [multiple speakers] [inaudible] >> Guang-Bin Huang: Polynomial [inaudible] can be used. That is true. As different as it became actually a different kernel, maybe better for different applications. So that issue is so [inaudible] kind abuse [inaudible] universal approximation so then you use [inaudible] at that condition there and then SVM become here but then you, and use can't use linear and that means SVM appeared to be a very big horse. Very perfect and beautiful horse can use if you are here or there. And that is true too. We have SVM [inaudible] the world would be different. [inaudible] feature recognition, face recognition would be quite different. But the issue is I say too good, the [inaudible] so it covers too many scenes. So we had to make it a seam. So that means okay, you have this and this, okay, I just use the universal [inaudible]. That means the kernel has to satisfy universal approximation. And then [inaudible] squares is here also follow the SVM and we say okay, we just [inaudible]. Then everything becomes simplified. So [inaudible] located this right, so then the solution becomes very simplified, simple. So this is the target. Of course, then we say generally all of this [inaudible]. But after we remove those redundancy unnecessary condition and then we can say oh, then finally everything becomes unified. So in this case we don't have the one against one problem or one against all problem, everything becomes simple. So you [inaudible] SVM you have the [inaudible] so [inaudible] migration, for this case same formula for all, okay? So this will allow one part, I want to… >>: Can you go back to the previous slide? From what I understand the reason you want to have these constraints is exactly the benefit of SVM which gives you better [inaudible] because you can constrain your [inaudible] so you can get better [inaudible] >> Guang-Bin Huang: No. For this case you this constraint does not come from the SQL [inaudible] the minimum nom. These come from B because the fire B there. So you [inaudible] calling on the [inaudible] to solve the SQL optimize any problem, right? Have to do some kind of gradient with regards to the B. In that case condition on B where it's generated this path, this condition. Without the B this condition will be no more there. Okay, so this [inaudible] SVM issue. I don't recall of this kind of--actually because this come from original [inaudible] classification issue and so on, right? So generally ELM [inaudible] node ought to be tuning then the hidden layer actually certify universal approximation and we would have to certify it to attack it, okay? Right, so this… >>: Talk about in their reverse [inaudible] condition. When you use this kernel method that you have, how does that proof carryover. It probably doesn't. [inaudible] the concept of the hidden nodes here [inaudible]. >> Guang-Bin Huang: No. We say, this is why I say we do not, ELM does not work for all of the kernels. We have to say that. Kernels have to certify universal approximation. If the kernels certify universal approximation, then it works. >>: So how do you judge whether [inaudible]. >> Guang-Bin Huang: I see this is the other issue. Now you say, this kernel, that kernel what. By the way, we say no. For all cases it may be better to have a kernel certify universal approximation, because if you do not certify universal approximation that it means that some application there, say some targeted function you can't match, right? So you have to be used everywhere. Of course either it can be used in here, good. But you may not be able to use the other place. So this is… >>: [inaudible] kernel [inaudible] then how do you know if that is certified [inaudible] but not in terms of the [inaudible] proof again. >> Guang-Bin Huang: Yeah, so we tested it. We actually added it here in the main [inaudible]. It is [inaudible] and [inaudible] input at the 8x28 so we use 10 classes so okay we build the rest we already made before. We just hide it here. So for [inaudible] we refer to the Hinton Times paper so he mentioned SVM can reach 98.6 testing accuracy. Then he keep learning, keep learning message has reached the 98.8 accuracy for the training data, all the training data 60,000 training data. So we passed the ELM method. We run ELM in [inaudible]. We get the accuracy the 98.7, with 50,000 training data because we use more memory, so due to the memory issue we [inaudible] until we can't use all of the 60,000 data. We just use 50,000 training data for this case; we reach here, but we added up a lot of training data [inaudible] maybe reach the 98.8, okay? Of course [inaudible] has a lot of peripheral kind of tours [inaudible] so [inaudible] learning, keep learning further. So I guess you combined with his work maybe can I think can be higher than 90%, right, [inaudible] expectation, okay? So that is, so far we talk about adding hidden nodes one by one, right. So this actually, I say could be used efficiently for the application [inaudible]. So a lot of case is just pull the data SQL the later [inaudible] six. So can we learn the data chunk by chunk? So in this case if data can come in one by one or chunk by chunk, right, so that at anytime only the newly arrived data can be learned. So after learned the data, the data should be discarded. Because you don't know how many data will come. As long as new data come in you have to retrain your old data then and that would be very, very timeconsuming. So in this case whenever a new data come in, you just learn the new data. Of the new data [inaudible]. Okay? So then the next day is supposed the method does not have the knowledge on the application on this one. So in this case how do we change the [inaudible]? So we can use the ELM, because due to the simple formula I have to say due to that simple formula. So then we can use the recursive, based on the recursive of the square method. That is a simple metric method. So say they have a larger metrics, right, and you want to do a [inaudible] so how about we [inaudible] first. So then and then next [inaudible] and then the next [inaudible]. So finally [inaudible] solution on the tighter universal [inaudible] on the entire metric so this is the, this is a method for us to solve the application with larger data. So we compare with the [inaudible] of course here is a simple for a small data set here. Alright we use a sigmoid [inaudible] BP and some other [inaudible] which can then the data check channel by channel one by one okay? So this is the chain in time, right? So this is testing the accuracy, comparison. Usually it is fast, good. >>: So again for testing [inaudible] smaller number of [inaudible] >> Guang-Bin Huang: Yeah. 13. >>: Smaller? >> Guang-Bin Huang: 13. >>: [inaudible] 25. Do you get better with [inaudible]? >> Guang-Bin Huang: No. So this is what we have tried to get the best. For BP-based [inaudible] each [inaudible] had to tune in very carefully, so exactly. So after a while we require quality difference. So [inaudible] random generated so here is one, there is one so really was very flat. So too few it kind of work; too many, not good. Right? >>: Sorry sir, so what is the size of the data set? >> Guang-Bin Huang: The size was very, very small. Yeah, because we just want to see whether it worked, okay? Yeah. So also we test in the multiclass case. So here we order the data coming chunk by chunk. So these [inaudible] mean that the data can be randomly, the size of the data is randomly generated between 10 and 30. Sometime 10, 15, 20 or 30. It is not just coming one by one. Most of the existing [inaudible] learning messages learned the data one by one, okay, usually chunk by chunk. So here is a DNA case, is a high input dimension. So this is the accuracy we achieved. These are the original one, we thought acting recognization factor. [inaudible] recognizing factor last time I mentioned the accuracy was very, very high, okay, same as SVM, right? Okay. So this we come to the conclusion. The most of them I have mentioned before. So then the issue is for the kernel method, the kernel method I would test is for [inaudible] actual accuracy is 98.7, is almost the same as the rest. But then for the large data set, right, a large memory would be required. [inaudible] you use other memory so you have to have several million of data so how do we handle these? We have several million of data that metric would be several million times several million so it would be impossible for us to handle. So then the key point is can online sequential learning the [inaudible]? We all [inaudible] along this direction but I [inaudible] is open. It is very, very challenging, especially for SVM. SVM, right, people tried to have the [inaudible] online screener method but so far, very difficult, still very difficult to handle larger data sets. So, can we do it? And then second thing is can we prove, SQL actually not always. It usually provides better [inaudible] even though it is random generated. So sometimes it won't give the buyers to the [inaudible] data or test data. So usually we found either the same kernel used, the same [inaudible] kernel used to my [inaudible] researchers. ELM usually generated better general performance than SVM and the [inaudible] squared SVM. And the reason [inaudible]. That is the reason I mention. Can we rigorously prove, that is just a kind of intuitiveness speaking. Okay so, then ELM is always faster than square SVM. All right so this could be I think can be also proved because from here we can see it. So ELM solve this formula. SVM and SVM squared solved this formula, right? So at least we don't have this path. We don't have this path right? For the binary case [inaudible], right? This looks similar. This looks slower than the squared SVM, right? So this conclusion. Okay, yeah, thanks. [applause] >>: So take over this proof for the kernel case with the convergence. So [inaudible] prove that you get it. [inaudible] error reduced to zero. Now for the kernel case how do you do that? >> Guang-Bin Huang: Well, the kernel case [inaudible] so that causing function [inaudible]. Not all the kernels for the causing function that is already proved before. That isn't just, it does not just come from my proof. Yeah, my proof is for random, yeah. And then we say this method can be extended to the kernel method. We just say it just one step away. So then we extend it from random to the kernel machine, right? But not all the kernels. Just some kernels that certify to give us our approximation condition. >> Li Deng: Any other questions? Okay thank you. >> Guang-Bin Huang: Thank you very much.