>> Li Deng: Hello everybody. It is a... come back; this is the second time. A few...

advertisement
>> Li Deng: Hello everybody. It is a pleasure to have Professor Guang-Bin Huang to
come back; this is the second time. A few months ago when he [inaudible] from
calibration and [inaudible] invited him. He gave a one talk, somewhat a preliminary to
this. But he is going to go over that again for the people who did not attend the first talk.
So maybe I should talk to him a little bit and ask him to explain what extreme means here
so there happened to be these little buzzwords around these days. And so I asked him if
extreme can be interpreted to be extremely simple and he said okay. So think about this
is to be something extremely simple learning technique but the beauty of a simple is that
you can use that as a little building block like a Lego piece to build some more fancy and
more effective learning machine. So with this rather than going all through his bio, I will
just simply say that Professor Guang-Bin Huang has some really interesting work and
less time he gave this one version of this extremely simple machine, extremely simple
learning machine and today he actually promised to give something even simpler than
when he talked last time, so I will give the floor to Professor Guang-Bin Huang.
>> Guang-Bin Huang: Okay, thank you Li Deng for the kind invitation, I know this is a
very encouraging introduction. So actually it is my pleasure to come here again. So last
time actually I met some friends so this time I saw some new faces and also I saw some
pioneer who I have been expecting for a long time, especially in your little areas. So
today actually I also compare with work done by John Platt. Last time I came here under
the, to you it sounds invitation, so during that talk I just gave an instruction to [inaudible]
extreme immersion. So then we keep this discussion with [inaudible] who has some very
great ideas and learning [inaudible]. So then we say we found a major problem in
learning is scalability, so today maybe I will focus on the scalability part.
So basically we will cover today because I thought of some of you attended the first talks
so maybe the first part will actually come from the first talk.
>> Li Deng: Before you start can you give us an official definition of extreme?
>> Guang-Bin Huang: Ah [laughter]. Okay, go back to the definition. This is a, actually
there is a story behind this. So we come out with this idea I mean this learning method
first. So we don't know which name is good for this technique. So [inaudible] we
wanted to [inaudible] on neural networks. So then I sit down watching TV. There is
something there I saw a channel, Discovery channel. So during watching that channel I
actually saw one program sought extreme machine, so in that extreme machine was a
ship and a different kind of a plane and a different kind of capacity.
So for the technique we developed we found it is very difficult when just using one word
to give the general ideas, to cover all of the ideas. It is not just random. We also think of
ways that we can use random something, too many kinds of randoms, so it kind of
describe in pictures of method. So we say very simple. It is very simple, but may be
very efficient and sometimes we also can see that the results are very good. Accurate.
So then we say that the channel extreme so we program extreme machines, so we say
why not? If in mechanical we have extreme machine, then why not in learning
environment we can't we have an extreme learning machine.
But it is not to replace the other, just use something quite different from the traditional
learning methods. So basically it is kind of tuning free based learning technique, okay?
But it is very simple; it is extremely efficient and I [inaudible] the day that we saw the
results so it looks extremely good, so I say why not extreme, so actually we borrowed the
idea from the mechanical areas of extreme machine.
Okay so, we are talking about feedforward neutral networks. So we are talking about
general feedforward networks and talk about function approximation classification. I
also will talk about SVM, because to me I consider SVM as one of the solutions of
general feedforward neural networks. Then we go to the learning theory we proposed.
The learning theory is different than traditional learning theory and then during these
each theory we are corresponding one 1-ELM algorithm. So basically we have one
incremental ELM algorithm. So this method is for us to build a network as adding a node
to the existing network one by one, node by node and it actually looks very efficient later
on, I assure you. This one we compared the results [inaudible] neural networks. These
are very, very popular, very, very famous one.
And then we--during these I-ELM, the network architecture looks big so then we talk
about enhanced incremental ELM so we can get the more compatible network, so this is
very important for some kind of application which actually SQL maybe have very large
SQL data sets and the network additive can be very large so using this method will
reduce the network size. So then we talked about the last time we mentioned that ELM
but the main question is was it similar to [inaudible] ELM and also the RBF network. So
then we were mentioned the difference here through some comparation. So than the last
part we will talk about online sequential ELM. So in this case the data can be read chunk
by chunk, one by one, especially for those applications which are very, very big. So you
can load all of that data on, you can load all the data in one time so you have a better way
to get at the data and limit the data chunk by chunk, say 5000, so maybe next time maybe
6000 coming, maybe 10,000 coming, right? So after [inaudible] data and discolored the
data, okay?
So now we go to the first part we mentioned in the first talk so we will give a quick
review. So there are two types of feedforward neural networks. One we call sigmoid
type network. So generally the outline function is defined this way. So then you have
input layer, hidden layer, output layer. So then the input to each hidden node is the
weighted sum of the input from the hidden, input layer. So but for this we can write this
output of the network in this way, right? Is the weighted sum of the output of the hidden
layer? So the other major one is called RBF network. So RBF network in this case the
hidden node here output function is RBF function. So generally most people using an
input function, but then you rewrite them so we can use the same formula for the output
function, so the weighted sum of the output hidden node, right? So in this way we
consider both sigmoid network and RBF network is a kind of feedforward network. So
they have just a single hidden layer for both types of, for these type of feedforward
networks so people have proof these type of networks have universal approximation,
property so that means given any [inaudible] function okay. Okay, linear function f x. So
does long as [inaudible] those hidden here might be enough. So the output of this
network can be as close to your targeted function. Approximate your targeted function as
close as possible, given any error we can, in theory we can reach that error, okay
[inaudible].
So this is the traditionally necessary. A traditionally necessary has to turn in all of the
parameters here, so that means all of the parameters in this single hidden feedforward
neural network so input the input from the hidden input layer to the hidden layer and also
the hidden layer parameters, right? For sigmoid type is fires. For RBF layers it's the
center of the RBF node and also output is always here, right? So for this case you would
have beta I.
So from learning point of view the target layer is unknown so then people have to do
some samples, get at the training data so XIT, XJTJ, so we wish to let, I mean, ideally to
let the SQL called the output of the network, right, is as close to your target as possible.
So then people have tuning all of the parameters so this is the traditional way. So a single
hidden layer can approximate any continuous targeted function how about a classification
so [inaudible]. So from a classification point of view we already prove as long as this
single layer neural network can approximate any targeted continuous function, then we
can prove these kinds of networks can differentiate any disjoined classification region.
Can be any region no matter what are the dimensions as long as this network can
approximate any targeted continuous function then this network can now differentiate
any, separate any disjoined classification regions. So that means as long as this can do
universal approximating that translates in theory these can also be used to approximate
classification.
So from the learning method point of view in the past people were using a BP to train
their network and are using RBF method to train. So this is a square solution-based RBF
solution method for these kinds of [inaudible] so we are using a two kind of BP, this
square based RBF solution. But based all these methods have primary [inaudible]
tunings. So [inaudible] is another very popular one is called [inaudible]. So [inaudible]
machines, it is very, very popular. Now it is used almost everywhere especially in
classification. Of course many people consider the SVM is not a kind of neural network
solution, but if you read the Web Leak 1995 paper.
So in their work they consider one final solution to multiple hidden layer feedforward
networks. So they consider that instant of tuning SQL in that hidden layer is just
considered [inaudible] hidden layer, so the output of the [inaudible] hidden layer can be
considered vector mapping via X I, okay with regard to the input X I. So the last hidden,
the output of the last hidden layer is I X. So then they tried to minimize or let the SQL
this kind of network to have this kind of optimization to minimize the margin, maximize
the classification margin, and minimize the SQL training error, right? So then they get
the solution for these kinds of network, but generally the output of, actually the final
decision function of SVM in this way.
So you consider these, all right but up here Kirner, Kirner can actually be considered a
hidden node. So here how for STS is considered, can be considered as the venue of the
output [inaudible] linking the hidden layer to the output layer. Of course how to get the
XS is a Kirner definition. And then they have to find the R5 STS. They have a different
way to find it, okay? So this SVM method generally can be considered as a solution
ultimate solution to the method of feedforward neural networks. Okay that 5X is just that
the output function of the last hidden layer
So all of these methods all of the hidden parameters can be tuning based on the
commission of the new theory means all of the parameters must be tuning. So then we
find no matter SVM, RBF type [inaudible], SQL type [inaudible] actually a kind of single
can be considered a generalized single hidden layer feedforward neural network. So you
have single hidden layer here. Suppose you have AO hidden nodes. So then so the error
hidden nodes this hidden layer can actually be considered the SQL, the function, the
objective of the hidden layer is kind of mapping. So we have the SQL the first hidden
node which is the output of the first hidden node is G A1 B1 X. The last hidden node is
GAO BAO, X, all right? So this is actually a vector. This vector is output of the hidden
layer can be considered vector mapping. Right, so either can be sigmoid type [inaudible]
or RBF.
So even other type of mapping function. So the idea is we proved, we mentioned it
before, suppose for this type of SQL single hidden layer feedforward network, suppose
[inaudible] exists a tuning method so you are tuning all of the parameters
AI, BI beta I, whatever. Suppose now there exists such a tuning method for you to find
the appropriate parameters, venue for these parameters, right? So to reach your target.
To solve your problem. So then we say such a tuning is not required. So in theory that
means give the targeted function continuous targeted function F okay? We have a
continuous targeted function F. As well as so-called the number of hidden Os is not
enough. Then we can say all of the hidden nodes, all of the parameters AI, BIA in the
hidden layer can be random generated. And then you just needed to calculate the output
with I. So we know the number of AI is not enough. The output of the network can be as
close to your target as possible. So actually with probability it can reach zero, right? So
this is proof even here.
>>: So do you have any thing about the [inaudible] of that [inaudible]? How fast it is?
>> Guang-Bin Huang: We asked several mathematicians. We say this is an often
problem. They try to do it but it looks very difficult to handle especially is also relative
probability statistics and also function base, targeted based. So it is also very interesting
topic but so far very difficult to find a solution yet. So how to prove it. I just give an
idea in [inaudible]. So the whole idea is given a network, supposing you are adding to
the network. We added a note to the existing network one by one because this is related
to the larger data set application. So suppose you already have M -1 hidden node added
to the network. So the network has M -1 hidden node. So because all of the hidden
nodes were random generated so whenever you add a new hidden node you don't touch,
you don't change the values of the existing parameters in the existing network.
So that means for the existing M -1 hidden node they are parameter fixed. The output of
this related to this M -1 hidden node all fixed. Why do you need to do? You just need to
random generate a lot of hidden nodes. Say GM. Because GM actually is actually a SQL
operative function of the Ms generated by the hidden node, all right? So for this random
generated hidden node, it is either this directional or is left direction; it doesn't matter. So
we just calculate the so-called optimum value for this the output [inaudible] linking this
newly added hidden node to the output node. So the existing M -1 hidden node never
changes. The existing network never changes. We just needed to calculate the SQL
value for the newly added network, the output of the newly added network beta M. How
do you calculate it based on this formula? We just base how do you beta? Because after
you have added M -1 hidden nodes, you are talking to F function is here, right? Targeted
function is here. Okay we are talking of this F, right? So the target is here. So from that
target there is a residential arrow continual SQL being the target and the output function
of the existing [inaudible] with M -1 hidden nodes. So we just do projection to get at this
venue, beta M. So this is approximation. We get with this venue. So that just is
optimum venue for the output value of the newly added [inaudible]. So then you are
reaching here. So the next time you add another one. It could be in this direction or in
that direction, right? So again, you had another one then get an approximation, get an
optimum value for the output [inaudible] of the newly added network.
>>: [inaudible] what is added is outside of the [inaudible] so your direction is here,
right? So [inaudible].
>> Guang-Bin Huang: Your direction is here, right? Is here. Then the [inaudible] can
be [inaudible].
>>: [inaudible] that direction?
>> Guang-Bin Huang: Yeah, that direction.
>>: And then move away?
>> Guang-Bin Huang: No, you will actually move in this direction when the value of the
beta will come back to you. Then go here, again. And then you're projection still comes
here, right? So then it doesn't matter. So every time you add then what happens? Really
they all, your error will become smaller and smaller, so we proof in this way. So we
prove the original error sequence, okay, will be convergent. So can you find any
subsequence of the error, you will definitely find an image for that subsequence. So that
means that the original error sequence is convergent. We proved this first. Second we
proved that norm, that the norm of this original error is also convergent. So finally we
proved that the norm of the image of the convergence is zero; it is quite complicated,
okay?
Generally it is a [inaudible] solution, proof method. So basically for this message you
see it is very simple for you to add to that hidden node to compute the parameter of the
network. So every time you just add one. Add one, right? So that hidden node is
random generated. So after you and one, you just need to calculate one parameter beta I
for the newly added hidden node I, okay? So you just calculate in this way. So this
method, actually this calculation usually it is very fast. So from here we can generate a
SQL, a method, which for us to update in a little while, okay? So every time so you give
the target, so maximum of how many hidden nodes you can add and then what is
expected error, right? So if you haven't reached your maximum targeted hidden node and
residual error is bigger than what you expect so then we keep at generate a new hidden
node added to the network. And then we calculate the value for the output wheat of the
newly added hidden node, that's all. Every time just do this, okay? So then we compute
the residual error for the new newly viewed network. Yes?
>>: Is not clear to me yet how we decide at first [inaudible].
>> Guang-Bin Huang: It is random generated. Even the first one, just random generated.
>>: Then you prove requires that you [inaudible]. [inaudible] and practice if you do
something different [inaudible]…
>> Guang-Bin Huang: No. Yeah, [inaudible].
>>: I just [inaudible] much faster times. How do you explain that?
>> Guang-Bin Huang: Yeah. This is random generated. For your case is very difficult
to prove, but more interesting. This proof, so after we give this proof it looks very
simple, but we spent three years or four years finding this proof, yeah. So it is very, very
complicated during the thinking this way or that way and finally we find a simple way to
do it, yeah. I guess you're, the proof for your case maybe much more complicated
because yours is not random generated. But of course [inaudible] the first time we think
about random, how to prove it with random, it is complicated, but finally we find it and
the solution is simple. But in your case everything is not random [inaudible] mistake, so
without proof [inaudible] it could become [inaudible].
>>: So your [inaudible].
>> Guang-Bin Huang: Yeah, convergent to zero.
>>: That would mean your [inaudible] machines [inaudible] traditional [inaudible].
>> Guang-Bin Huang: No, no.
>>: [inaudible] different [inaudible].
>> Guang-Bin Huang: Yeah, different. Yeah, they probably for efficiency variations
would have to be based on simulation, yeah. So for this case you see for different we test
of course for small cases. We don't have a large dataset; we only have a small data set so
we base it on the [inaudible] is in the hall convergence, all right? So that this is the every
training time usually just base it on the depends on a number of hidden nodes to be
added. So there is a [inaudible] increase, right? So we compare say incremental ERM
where here incremental definitely different from others. So later on we have a lot of kind
of called sequential method, right? So sometimes people talk about SQL learning called
incremental learning. But we have [inaudible]. That means data coming in one by one or
chunk by chunk. So we already find the other ones sequential learning so then for that I
automatic updating we call incremental path. So for this I ELM we apply sigmoid motor
function RBF function and even sign function [inaudible] function, okay? Then we try to
add up to 200 hidden nodes so that we tested several cases. We compare with a lot of
other very, very pioneer work one, okay? [inaudible] work in this area. So resource
location area proposed by John Platt and also another called minimum resource location,
okay?
So with this testing the accuracy comparison so from here you can see usually that the
ELM generated better accuracy, right? So we also compare the [inaudible] looks roughly
in the same order or even better. You see the time, changing time, you see I ELM the
changing time here zero point, in this case 0.22 seconds. You look at the SQL and
resource location usually ran usually much lower. I tried these; I run all of these by
myself and have too many parameters. I guess 5 to 7 parameters to run concurrently. Of
course after each one, some of them are fixed but they still had two or three more have to
be tuning based on applications. So that means each of them we have the tuning
parameters. For all cases so far no parameters are tuning. And then you see for this case,
right, you had to spend quite a long time, okay? These days we don't know how to use
remote desktop so I have to run my old PC at home and then in the middle of the night
and I have to check the result, okay?
>>: [inaudible] ELM, did you increase it from one up to 200?
>> Guang-Bin Huang: Yes. We added the [inaudible] from 1 to 200 and this is the plan.
>>: [inaudible].
>> Guang-Bin Huang: Now this is different.
>>: [inaudible] faster than this.
>> Guang-Bin Huang: That is batch learning, we mentioned before. That is a fixed
[inaudible].
>>: Fixed one?
>> Guang-Bin Huang: Yeah. [inaudible].
>>: [inaudible] this incremental?
>> Guang-Bin Huang: That information usually is good but then the time could be more
time could have been used, had to spend.
>>: [inaudible] incremental [inaudible] could do more efficient because [inaudible]
before you [inaudible] whether it is worthwhile to [inaudible].
>> Guang-Bin Huang: Yeah but this [inaudible] is more efficient every time just
calculate one, just calculate one time. That is very, very simple way.
>>: But [inaudible] a few calculate [inaudible]
>> Guang-Bin Huang: No, every time we just after we add one, we just remember that
error. So next time we don't calculate the SQL and so we recalculate the error for the
past thing, for the existing network. Just remember error so next time we just use…
>>: [inaudible] you have to sit there and watch every time you add one, whereas if you
have the batch of the [inaudible] from the incremental, you just do it once. You should
just do it once and pick up your spare time.
>> Guang-Bin Huang: [inaudible] this is why I have the information [inaudible]. That is
it so your dataset is very huge. It is very difficult to calculate like metrics. But I guess
this could provide the possibility for your solution. But every time we just add one
hidden node. So that means that even if you have 200 million data, that is 200 million
calculations, one times 200 million. This is just one vector calculation, that's all.
>>: I know but [inaudible] very slow [inaudible].
>> Guang-Bin Huang: No. [inaudible] no every [inaudible]. I suppose that the data…
>>: [inaudible].
>> Guang-Bin Huang: Okay, the idea is after you have loaded the data what happens?
You just need to remember the error. Right so the data is still there.
>>: [inaudible].
>> Guang-Bin Huang: Yes. Suppose that the data [inaudible] of course is sometimes
walking there if the memory is limited that is another case, yeah. But suppose the data is
already loaded. I guess this would be very, very efficient.
>>: But I think the advantage of doing this is that you can tell automatically determine
where you want to stop [inaudible].
>>: Good yeah. You are right.
>>: So, by practice you, we have to try some [inaudible] the more the better. You know
when to stop because the error still [inaudible].
>> Guang-Bin Huang: Yeah…
>>: So in a single problem, you might think that the wide error [inaudible].
>> Guang-Bin Huang: Yeah. Okay this actually is very interesting question that I forgot
to mention. Yes, it is very good. You see, of course, people can use cross validation to
determine where to stop. You see for this case it is very interesting. You see that hidden
nodes are random generated here. And we only need to calculate the output wheat for the
newly added hidden node. So the, actually, you see here the decreasing very, very
smoothly. It is different from others. The others, we tried these. We wait around tried
200 300, 500, 1000, it is always keep decreasing, because the single hidden single
feedforward neural network, hidden layer plays a major [inaudible] but the hidden node
here, random generated either. It won't care, it wants to be determined based on how on
testing data. So the hidden node determines basically the training data that overfeeding
can be easily generated [inaudible]. But for this hidden layer consider premium testing
the same. You don't want to give benefit to anyone. So that usually is more, actually
much better. It is a much smoother it won't cause SQL [inaudible] or repeating. Yes?
>>: So this way of doing incremental [inaudible] is very much similar to [inaudible]
worrier are adding [inaudible] by one and [inaudible] the same as you actually referred to
as [inaudible] a small [inaudible].
>> Guang-Bin Huang: Yeah, well here we won't consider that. Here we are just trying to
get the basic idea on how to [inaudible] upon this can be incorporated. That idea is here
could be, yeah but currently we have not considered that. Here we are using, just using
random…
>>: [inaudible] difference there [inaudible]…
>> Guang-Bin Huang: Here it looks like…
>>: [inaudible] example [inaudible].
>> Guang-Bin Huang: So here the hidden node is random generated. In their case it is
not. So then…
>>: [inaudible] how you decide what each node is. [inaudible] random [inaudible].
>> Guang-Bin Huang: Yeah, so, but here our original idea is we tried to prove…
>>: [inaudible] idea after you go through your [inaudible] error [inaudible] no
[inaudible] to the data.
>> Guang-Bin Huang: Yeah, that is how I say how to incorporate these ideas here or
combine the ideas of the two works. I guess a much better result could be obtained. But
here it is just the basic idea.
>>: [inaudible] samples make other kind of mistake [inaudible] in the place of
[inaudible] data.
>> Guang-Bin Huang: Yeah, [inaudible]. Okay. So we also compare this with the BP
and the SVRs. But about [inaudible]. We compare using the ELM with a 500 hidden
nodes, okay 500 hidden nodes. We compare with the BP and SVRs, so you can see the
accuracy here. So usually I ELM and the SVR can reach a similar result. But actually
the knowledge of I ELM is here, you see? It looks very, very fast, actually a new
parameter in between.
>>: [inaudible] how large the [inaudible] problem?
>> Guang-Bin Huang: Yeah, whole lot, for this case 500 hidden nodes.
>>: [inaudible] input layer [inaudible].
>> Guang-Bin Huang: The input layer depends on the implementation of each case.
>>: Yeah, but about how large?
>> Guang-Bin Huang: Of this, usually uses a single case.
>>: [inaudible] handful [inaudible]?
>> Guang-Bin Huang: For input? Input layer, that means the number of dimensions
over….
>>: [inaudible]
>> Guang-Bin Huang: This here is very, very [inaudible] several tens of [inaudible].
This is a very, very simple case for [inaudible]. Yeah?
>>: One would expect that as that gets quite large that far fewer of the randomly
generated features will prove useful.
>> Guang-Bin Huang: Yeah, good point. Actually we, last time I left information that
SQL input dimensions and later we have one.
>>: [inaudible] significant number [inaudible].
>> Guang-Bin Huang: Okay so we have comparison here. So BP is this number of
hidden nodes. So ELM is 500, okay? So BP is here. And then SVR is number of
vectors here.
>>: If you use the same number [inaudible] would you get a better result than I ELM?
>> Guang-Bin Huang: Okay, for that I will come back later. Here we are fixed at 500.
So next time we will have an enhanced one, so this is just basically a proof method, a
proof method generated [inaudible] okay? So even if you come back here even if we
have this 500 compared with BP, it is still faster. So compared with [inaudible], it is
much faster. [inaudible] machine you have to tune in these parameters it is quite
complicated. Right for I ELM…
>>: [inaudible] so this is only compared for time, for accuracy…
>> Guang-Bin Huang: Accuracy is here.
>>: It makes sense to have the same number of [inaudible].
>> Guang-Bin Huang: Yeah.
>>: [inaudible] the same number of [inaudible]?
>> Guang-Bin Huang: No. It was different. The number of hidden nodes is, we keep
adding until 500.
>>: [inaudible] BP what is this [inaudible]?
>> Guang-Bin Huang: BP is sigmoid tuning [inaudible].
>>: [inaudible] no, no [inaudible] same number…
>> Guang-Bin Huang: No, we can't, no.
>>: [inaudible].
>> Guang-Bin Huang: Will you can't. BP uses 10 hidden nodes and then we say I ELM
is using 10 [inaudible]. Some use [inaudible]. Some use [inaudible].
>>: But you need to compare with the same [inaudible] actually cost the same.
>> Guang-Bin Huang: You are right. So we, this is why we had to come back later. We
have another one to enhance that. This is very, very basic. The objective of this method
was to prove that universal approximation of the original extreme learning machine with
a fixed [inaudible] vector.
>>: So I see the problem is that random nodes [inaudible] are unlikely have over 50 but
you not trained on the lower levels…
>> Guang-Bin Huang: That's right.
>>: So if you have the same number of hidden modules for BP you very quickly
[inaudible]
[multiple speakers] [inaudible]
>>: [inaudible] you have 400 inputs, you have two key hidden nodes never [inaudible]
>>: So maybe it's never [inaudible] very small.
>>: No, no, I think you never tried to use [inaudible].
>> Guang-Bin Huang: Okay, I will come back to that later, later. Yeah, this is just
[inaudible], okay? So we also, you see, in a neural network a common problem is how to
find a solution learning method for the [inaudible] network. So that means that the single
feedforward neural net with the so-called binary alternate function hidden [inaudible]. So
actually [inaudible] solution, before that, right? Also binary, you're kind of using a
gradient message, for single [inaudible] or multiple [inaudible] usually gradient message
had to be used. So in this case we are just using ELM directly, so that means the ELM
hidden node just using binary function. So then we compare this with BP. BP, they tried
to change that gain factor to make the gain factor nosh so that means SQL the network
[inaudible] can be impressioned by the sigmoid function within nodule gain factor. So
that we compare these [inaudible] by the ELM, actually the I ELM and we added the
hidden [inaudible] note one by one and compare with BP. You see here usually these get
a direct solution and the direct solution is sometimes much better than this indirect
solution, all right?
>>: So you're still using a different number of [inaudible], right?
>> Guang-Bin Huang: Yeah, yeah, different…
>>: That is not a fair comparison.
>>: The idea that you need to tune that [inaudible] in such a way that it can best perform
and I bet you we can [inaudible].
>>: [inaudible]
>> Guang-Bin Huang: Ha ha, all right? Okay so, now we go back to your example. So
500 nodules, [inaudible] right? Compares 10 of 50 which is used in the BP method. So
then we say every time we just add one hidden node [inaudible] we don't go further,
right? So why not? Every time, every step we choose, we random generated several
hidden nodes. We see which one SQL should be adding to create SQL largest of the
decrease in relation to your error. So you add in this direction and then in this direction.
Okay we say, this is better direction. So then after we say let's go in this direction and
then next time had several, right, and try to say which one generates several case so
which one will go for this way or that way, so go that way. So this one is the same
method, right? The same method we can prove using approximation. Then the idea is
same kind of error [inaudible]. But the thing that is interesting is convergent read much,
much faster, okay? Conversion convert very quickly.
>>: That is only based on the number of [inaudible] based on the [inaudible]. So each
time we need to [inaudible] several [inaudible].
>> Guang-Bin Huang: [inaudible] I have a slide to show you to address your concern.
Say for example you see the [inaudible] and this and that. So the things on the nine are
the original I ELM created one by one. Okay every time it just generates one so that the
seeker saw the nine is the new one [inaudible]. That means every time we generate
maybe 10, so among 10 we choose the best one. You see? Convergent read quite
different.
>>: [inaudible] the scale like 10 [inaudible]. 10 times like 10 times, then go through it
maybe 10 times more, do they convert?
>> Guang-Bin Huang: Ah, I will come back to this question later. So you see I united
these [inaudible] so for this abalone in this case you see original conversions curve is
here. The new convergence is here. In order to get the same accuracy we say 0.09 is our
target. So our original maybe 330. Though now, 21 hidden nodes. Yes?
>>: [inaudible] the small number of nodes is tuned to the data. Are you subject to
overturning them? Overturning…
>> Guang-Bin Huang: In this case? In this case it is still the same node. They are still
using the same because every time I just get it the same, so I still follow the SQL or a
SQL series, okay? I only had one continuous, they always SQL one by one and then
original error always SQL strictly decreasing, I mean really serious. So the over 15 is not
so obvious.
>>: But did you verify that with cross validation?
>> Guang-Bin Huang: For this case we did not; we just keep adding. We just want to
shoot the curve. Of course, until you add, add; add eventually I guess [inaudible]
account. For this case is not so…
>>: [inaudible] mantra of fixed [inaudible]
>>: [inaudible] by selecting one from [inaudible]
>>: Oh, I see [inaudible]
>> Guang-Bin Huang: So while [inaudible] comparing these I guess you could kind of
continue it will come. So what I fear is not so upper even you see for this case. We keep
adding; it is still going. It, compare the earlier one. Because the earlier one which is
[inaudible]. There is no need to see which one is better, just keep adding. So for the first
one the overturning is not so serious. The second one, although not so obvious, compare
the first definitely is there.
>>: So this one [inaudible] [inaudible] 10 times of the computation to choose [inaudible]
10 times.
>> Guang-Bin Huang: No. You see here we are just using 20 nodes. Here you have to
use the 330 nodes. So which one is faster? This one is faster.
>>: [inaudible]
>> Guang-Bin Huang: For this you can. I will come back here. So here we say you
have a different number of candidates.
>>: So basically you are saying [inaudible] 330 [inaudible] actually [inaudible] 330
[inaudible].
>> Guang-Bin Huang: Yeah so, here you get time [inaudible] here is a suppose you
added 21 nodes. Actually, you already tried 221, 210 nodes.
>>: Select the number of candidates?
>> Guang-Bin Huang: Yeah. So this here should I resolve. So if the K for the 1 so it
means the original, basic incremental. So you will K kernel five this one, [inaudible] 10,
[inaudible] 20. You go further, further, right? Sometimes they converge somewhere.
>>: It looks like you will never converge for the one versus 20.
>> Guang-Bin Huang: Yeah, for the 20, that means that you will go 30 and still
decreasing maybe in some cases…
>>: That's why I told you [inaudible] keep doing that all the way down [inaudible]
>> Guang-Bin Huang: Yeah, well, this case is a case dependent.
>>: [inaudible] basically three factors [inaudible] accuracy [inaudible] and the number of
nodes [inaudible] I'm so [inaudible] this basically just shows [inaudible] which is
[inaudible].
>> Guang-Bin Huang: This, this is not a time. This is testing accuracy a new square of.
Yeah. Of course the reason Y is K for i 1, this right? So usually we suggest a [inaudible]
test [inaudible] because usually it will run faster. It is easier for the user to determine in
your own case what is SQL, what is the basic number of candidates to be used. But
generally we say we have a comparison here, so usually maybe 20 around actually is
good for many cases. Of course for special cases you may have a different number of
candidates to be used. So we come back to this data again so we have the number of
candidates k for 10 and 20, so we compare to the original one. So you have [inaudible]
here you see that K for the 20, you see the accuracy here which usually is, here is right is
better somewhere is, here it is better than this. Usually this increases. But of course for
this curve I am sure there is quite a difference. Because sometimes this is 0.1, 0.9, right,
but issue a value here roughly in the same order, so this is, yes?
>>: [inaudible] accuracy values in releasing very, very low and in the view right, and
then some different area, very there regularly higher so I mean what is it like [inaudible]
but here I don't even see what [inaudible].
>> Guang-Bin Huang: While this is a regression. This is for operation. Yeah. So these,
okay so this had to head to…
>>: [inaudible] so basically why you [inaudible] is that the [inaudible] you are using
only the zeros or the information to adjust the values of the function, right?
>> Guang-Bin Huang: Uh-huh.
>>: If you use the first or the information [inaudible] you will get a faster convergence
[inaudible] found work [inaudible] much faster [inaudible].
>> Guang-Bin Huang: You are right. So here is the basic idea. Our first great
[inaudible] so other information to be incorporated into this method actions can be a lot
of very interesting work. But so far we are just doing the basic. So whatever result we
show is a very basic result.
>>: [inaudible] training times. That probably is not very important, the training time. So
[inaudible] [multiple speakers] [inaudible].
>>: I understand. [inaudible] much smaller than the original [inaudible]
>> Guang-Bin Huang: Uh-huh.
>>: [inaudible] so much faster. But if you're using first all of the information [inaudible]
>> Guang-Bin Huang: [inaudible] and here [inaudible] because you can smaller letter
[inaudible] compare the early one. So this is, okay, this is testing standard observation,
okay? So you compare the SQL, okay, so this, a lot, okay. So this is SQL how to add
that hidden node one by one. We haven't tried [inaudible] proposing everything. Right.
So [inaudible] a lot of logic, skill, not a data size [inaudible]. So that could really be
tried. Every time just at one and watch the performance. So that is SQL and the hidden
node one by one so then we, so what is the relationship between the SQL network and the
in the extreme learning machine with a fixed network architecture. This is what I
mentioned in the first talk. So with that proof that incremental error [inaudible]. So the
later architecture for this [inaudible] hidden layer would not be tuning. It can be random
generated. So if it is random generated as we mentioned last time, so that the training to
this will become very simple, right? So given the training data XJ and TJ, we wish to
have the output out of the [inaudible] put out a later [inaudible]. You [inaudible] this
ideal case, right? So in this case given the untrained data so we have the [inaudible]
functions, right? So AO means number of, means number of parameters. So in this case
we can write them in SQL metrics format, right? So then we can H [inaudible]. This we
mentioned before. So in order to get this [inaudible] solve the problem we just need to
get the value to solve this creation. To get it to beta, so we just random generated a
hitting node and [inaudible] hidden layer off with the metrics and then do the reverse and
get at the beta. So this is the basic idea. But we mentioned before so from the original
operational [inaudible] right? In order to get the to get a stable solution, in order to get a
stable solution, in particular, sort of inverse we have to, we better to introduce the SQL
recognizing parameter. So add it in there angle and [inaudible]. So add IO receive here,
okay?
So let this [inaudible] inverse them at this rise inverse. So after we get it this then what
happens? So basically for ELM if the level of activity is fixed [inaudible] to fix them.
[inaudible] we had the hidden nodes one by one. But [inaudible] are fixed so what is the
best solution? This is a square method, right? So we just use a square method and get
that this solution, right? So each here is random generated. For this case we added the
recognization factors there.
>>: In practice most of the time [inaudible] all of the [inaudible].
>> Guang-Bin Huang: So you are talking about something different. We want to say…
>>: No, no, [inaudible] solution [inaudible].
>> Guang-Bin Huang: Your data is very much so they have to use the first one, right?
Or weight, maybe the second one. Yeah, the second. Because your, here is N times AO.
So AO the second one you will find it will become AO times AO, the size of the metric is
a AO times AO, so then we say if H is unknown, if the hidden layer of the output
function is unknown to us, so then we say we can use the kernel method, right? So then
we define the kernel method, right? So that means HXI times HXJ becomes KXI KXJ.
So now what is the difference between these and the [inaudible] square method? All
right so this is one of the [inaudible] goes here. This is many people interested in these.
So long kernel method, we had these. Actually from this first formulae we can get at
these, get at these suppose HX I times and HXJ turned to KXI XJ, right? And so this
corner definition using SVM and other kernel method. So this method you see the same
formula for regression, for binary class, for multiple class. So how about [inaudible]
square SVM? User is square SVM is here. This is square SVM. For binary squares they
have this formula. For approximately all SVM have this formula. Right, so compare
this, this looks much simpler in terms of implementation. And then from here you see
there also have for this square SVM and for an approximate SVM and also the kernel is
called metrics. The times [inaudible]. So these T15X1, TM5XM.
>>: So why you say [inaudible] only two elements right? [inaudible] very large, large
[inaudible]
>> Guang-Bin Huang: Know this has to solve the same metrics. So here also the same
metrics N times N. So here also N times N.
>>: [inaudible] still be the same, huh?
>> Guang-Bin Huang: You see the formula here is, this looks more what I say, yeah,
unified because it is all consistent. Then this is for binary. And then you see therefore
this kernel metrics and then the definition kernel metrics depends on the output
[inaudible] were also [inaudible]. And also for these so that kernel matches irrelevant to
the [inaudible] package. So this is very different.
>>: [inaudible] fast method to do this.
>> Guang-Bin Huang: [inaudible] use it here. [inaudible] using SVM, okay? So this is
for the square [inaudible] compare for SVM. So why, so for multiple class, for the
square SVM, multiple class is here. So that means you have the SQL, M classes, so then
you can use M [inaudible] square SVM, so which is for one class. And then have this,
but for ELM you just have one, okay? Yes?
>>: [inaudible] this right-hand side should be the same? [inaudible]. If you divide it
into a metric form [inaudible]
>> Guang-Bin Huang: Actually yesterday I saw that it is different. Here you have and
the M and there is quiet, see there would be N times N, then. There you have a
[inaudible] N times N metrics.
>>: But at the end you still have [inaudible].
>> Guang-Bin Huang: Yeah.
>>: So in the end the solution is just doing the reverse? So the main [inaudible] the
inverse. [inaudible]
>> Guang-Bin Huang: Yeah, finally we just solved this one. You see from here
[inaudible] they are similar actually even here, and even here they are similar.
>>: [inaudible] the size of the [inaudible] which was larger. This one just huge.
>> Guang-Bin Huang: Yeah, we talk about this one, the so called the binary case, I think
the complexity that is the same because of the [inaudible] and the N times N. We talked
the multiclass, what would happen. They would be M times M because [inaudible] put
them together, right? For each class they have to do SQL N times N, right?
[multiple speakers] [inaudible]
>> Guang-Bin Huang: No, here you have N, suppose you have N class, each class is N
times N, right? So in class, because here you have the computer N class fire. So the total
time is N times N squared. All right? So here, it is just one, right?
>>: [inaudible] the same because you have [inaudible] so [inaudible]
>> Guang-Bin Huang: No. Because the metric size here is N times N. These are fixed.
>> No. [inaudible] Look [inaudible] is the same. It would only difference would be
[inaudible].
>> Guang-Bin Huang: No. Even here is just N. This fixed it.
>>: [inaudible]
>> Guang-Bin Huang: So here is always N times N. Here is N times N squared. Okay
so why, you see here and here the formula is different? So why this is squared SVM and
this P SVM is different, final solution. So that the key point here. So SQL is [inaudible]
so they have the same constraint. But their packet of functions is different. For
approximate [inaudible] here. So then they generate a different solution. And the name
for [inaudible] ELM, the targeted function the same as this is squared as SPM. But the
condition is different. The condition of the ELM is different from the rest. For the ELM
we don't have the B here but that is the fires of the output node. The [inaudible] have the
[inaudible] in the opposite node. [inaudible] what is the reason? Your ELM come from
the universal approximation first. We say the [inaudible] have to follow has to satisfy
universal approximation condition.
So that remains ideally [inaudible]. This is [inaudible]. So we do not have a B there.
But for the square SVM and the P SVM, they come from SVM. You know, SVM
[inaudible] come from classification. So when we are talking about classification first, in
the first place we didn't mention the universal approximation. So ideally it should have a
fires in the output. Because [inaudible] here, here and that is going to be possible to do
parallel to each other, right? But they are SQL [inaudible] and not pass through the
already in the vector space, but the ELM solution is by universal approximation first. So
ideally in the vector space that means that the SQL hidden layer in the vector space so the
separate [inaudible] should pass through the original. So we do not need to have B here.
So this is the main issue. So due to this B, the problem comes. You will see the
[inaudible] square SVM, you see that is a primer SQL optimization problem. You will
see fewer problem. You will see from a [inaudible] problem because we all have to
satisfy [inaudible] condition. So the [inaudible] condition due to the B. Due to the B
issue, we have to certify this. This is a square solution. We have to certify this condition.
So what is this condition? That means the parameters have to obtain from the hyperplane
here. But for ELM new constraints, new set of constraints there. So that means that
ELM don't have get a solution in the entire space. New constraints.
>>: All I am saying is that if you can't modify the [inaudible] SVM formulation
originally saying that you constrain your hyperplane to pass through [inaudible]
optimally [inaudible] minimizing [inaudible]
[multiple speakers] [inaudible]
>> Guang-Bin Huang: Polynomial [inaudible] can be used. That is true. As different as
it became actually a different kernel, maybe better for different applications. So that
issue is so [inaudible] kind abuse [inaudible] universal approximation so then you use
[inaudible] at that condition there and then SVM become here but then you, and use can't
use linear and that means SVM appeared to be a very big horse. Very perfect and
beautiful horse can use if you are here or there. And that is true too. We have SVM
[inaudible] the world would be different. [inaudible] feature recognition, face
recognition would be quite different. But the issue is I say too good, the [inaudible] so it
covers too many scenes. So we had to make it a seam. So that means okay, you have
this and this, okay, I just use the universal [inaudible]. That means the kernel has to
satisfy universal approximation. And then [inaudible] squares is here also follow the
SVM and we say okay, we just [inaudible]. Then everything becomes simplified. So
[inaudible] located this right, so then the solution becomes very simplified, simple. So
this is the target. Of course, then we say generally all of this [inaudible]. But after we
remove those redundancy unnecessary condition and then we can say oh, then finally
everything becomes unified. So in this case we don't have the one against one problem or
one against all problem, everything becomes simple. So you [inaudible] SVM you have
the [inaudible] so [inaudible] migration, for this case same formula for all, okay? So this
will allow one part, I want to…
>>: Can you go back to the previous slide? From what I understand the reason you want
to have these constraints is exactly the benefit of SVM which gives you better [inaudible]
because you can constrain your [inaudible] so you can get better [inaudible]
>> Guang-Bin Huang: No. For this case you this constraint does not come from the
SQL [inaudible] the minimum nom. These come from B because the fire B there. So you
[inaudible] calling on the [inaudible] to solve the SQL optimize any problem, right?
Have to do some kind of gradient with regards to the B. In that case condition on B
where it's generated this path, this condition. Without the B this condition will be no
more there. Okay, so this [inaudible] SVM issue. I don't recall of this kind of--actually
because this come from original [inaudible] classification issue and so on, right?
So generally ELM [inaudible] node ought to be tuning then the hidden layer actually
certify universal approximation and we would have to certify it to attack it, okay? Right,
so this…
>>: Talk about in their reverse [inaudible] condition. When you use this kernel method
that you have, how does that proof carryover. It probably doesn't. [inaudible] the
concept of the hidden nodes here [inaudible].
>> Guang-Bin Huang: No. We say, this is why I say we do not, ELM does not work for
all of the kernels. We have to say that. Kernels have to certify universal approximation.
If the kernels certify universal approximation, then it works.
>>: So how do you judge whether [inaudible].
>> Guang-Bin Huang: I see this is the other issue. Now you say, this kernel, that kernel
what. By the way, we say no. For all cases it may be better to have a kernel certify
universal approximation, because if you do not certify universal approximation that it
means that some application there, say some targeted function you can't match, right? So
you have to be used everywhere. Of course either it can be used in here, good. But you
may not be able to use the other place. So this is…
>>: [inaudible] kernel [inaudible] then how do you know if that is certified [inaudible]
but not in terms of the [inaudible] proof again.
>> Guang-Bin Huang: Yeah, so we tested it. We actually added it here in the main
[inaudible]. It is [inaudible] and [inaudible] input at the 8x28 so we use 10 classes so
okay we build the rest we already made before. We just hide it here. So for [inaudible]
we refer to the Hinton Times paper so he mentioned SVM can reach 98.6 testing
accuracy. Then he keep learning, keep learning message has reached the 98.8 accuracy
for the training data, all the training data 60,000 training data. So we passed the ELM
method. We run ELM in [inaudible]. We get the accuracy the 98.7, with 50,000 training
data because we use more memory, so due to the memory issue we [inaudible] until we
can't use all of the 60,000 data. We just use 50,000 training data for this case; we reach
here, but we added up a lot of training data [inaudible] maybe reach the 98.8, okay? Of
course [inaudible] has a lot of peripheral kind of tours [inaudible] so [inaudible] learning,
keep learning further. So I guess you combined with his work maybe can I think can be
higher than 90%, right, [inaudible] expectation, okay?
So that is, so far we talk about adding hidden nodes one by one, right. So this actually, I
say could be used efficiently for the application [inaudible]. So a lot of case is just pull
the data SQL the later [inaudible] six. So can we learn the data chunk by chunk? So in
this case if data can come in one by one or chunk by chunk, right, so that at anytime only
the newly arrived data can be learned. So after learned the data, the data should be
discarded. Because you don't know how many data will come. As long as new data
come in you have to retrain your old data then and that would be very, very timeconsuming. So in this case whenever a new data come in, you just learn the new data.
Of the new data [inaudible]. Okay? So then the next day is supposed the method does
not have the knowledge on the application on this one. So in this case how do we change
the [inaudible]? So we can use the ELM, because due to the simple formula I have to say
due to that simple formula. So then we can use the recursive, based on the recursive of
the square method. That is a simple metric method. So say they have a larger metrics,
right, and you want to do a [inaudible] so how about we [inaudible] first. So then and
then next [inaudible] and then the next [inaudible].
So finally [inaudible] solution on the tighter universal [inaudible] on the entire metric so
this is the, this is a method for us to solve the application with larger data. So we
compare with the [inaudible] of course here is a simple for a small data set here. Alright
we use a sigmoid [inaudible] BP and some other [inaudible] which can then the data
check channel by channel one by one okay? So this is the chain in time, right? So this is
testing the accuracy, comparison. Usually it is fast, good.
>>: So again for testing [inaudible] smaller number of [inaudible]
>> Guang-Bin Huang: Yeah. 13.
>>: Smaller?
>> Guang-Bin Huang: 13.
>>: [inaudible] 25. Do you get better with [inaudible]?
>> Guang-Bin Huang: No. So this is what we have tried to get the best. For BP-based
[inaudible] each [inaudible] had to tune in very carefully, so exactly. So after a while we
require quality difference. So [inaudible] random generated so here is one, there is one
so really was very flat. So too few it kind of work; too many, not good. Right?
>>: Sorry sir, so what is the size of the data set?
>> Guang-Bin Huang: The size was very, very small. Yeah, because we just want to see
whether it worked, okay? Yeah. So also we test in the multiclass case. So here we order
the data coming chunk by chunk. So these [inaudible] mean that the data can be
randomly, the size of the data is randomly generated between 10 and 30. Sometime 10,
15, 20 or 30. It is not just coming one by one. Most of the existing [inaudible] learning
messages learned the data one by one, okay, usually chunk by chunk. So here is a DNA
case, is a high input dimension. So this is the accuracy we achieved. These are the
original one, we thought acting recognization factor. [inaudible] recognizing factor last
time I mentioned the accuracy was very, very high, okay, same as SVM, right? Okay.
So this we come to the conclusion. The most of them I have mentioned before. So then
the issue is for the kernel method, the kernel method I would test is for [inaudible] actual
accuracy is 98.7, is almost the same as the rest. But then for the large data set, right, a
large memory would be required. [inaudible] you use other memory so you have to have
several million of data so how do we handle these? We have several million of data that
metric would be several million times several million so it would be impossible for us to
handle. So then the key point is can online sequential learning the [inaudible]? We all
[inaudible] along this direction but I [inaudible] is open. It is very, very challenging,
especially for SVM.
SVM, right, people tried to have the [inaudible] online screener method but so far, very
difficult, still very difficult to handle larger data sets. So, can we do it? And then second
thing is can we prove, SQL actually not always. It usually provides better [inaudible]
even though it is random generated. So sometimes it won't give the buyers to the
[inaudible] data or test data. So usually we found either the same kernel used, the same
[inaudible] kernel used to my [inaudible] researchers. ELM usually generated better
general performance than SVM and the [inaudible] squared SVM. And the reason
[inaudible]. That is the reason I mention. Can we rigorously prove, that is just a kind of
intuitiveness speaking. Okay so, then ELM is always faster than square SVM. All right
so this could be I think can be also proved because from here we can see it. So ELM
solve this formula. SVM and SVM squared solved this formula, right? So at least we
don't have this path. We don't have this path right? For the binary case [inaudible],
right? This looks similar. This looks slower than the squared SVM, right? So this
conclusion. Okay, yeah, thanks. [applause]
>>: So take over this proof for the kernel case with the convergence. So [inaudible]
prove that you get it. [inaudible] error reduced to zero. Now for the kernel case how do
you do that?
>> Guang-Bin Huang: Well, the kernel case [inaudible] so that causing function
[inaudible]. Not all the kernels for the causing function that is already proved before.
That isn't just, it does not just come from my proof. Yeah, my proof is for random, yeah.
And then we say this method can be extended to the kernel method. We just say it just
one step away. So then we extend it from random to the kernel machine, right? But not
all the kernels. Just some kernels that certify to give us our approximation condition.
>> Li Deng: Any other questions? Okay thank you.
>> Guang-Bin Huang: Thank you very much.
Download