>> Zhengyou Zhang: Okay. So it's my pleasure... Huang from Nanyang Technological University of Singapore, NTU. So...

advertisement
>> Zhengyou Zhang: Okay. So it's my pleasure to introduce Professor Guang-Bin
Huang from Nanyang Technological University of Singapore, NTU. So he is
talking about his long interest research in extreme learning machine.
He graduated from Northeastern University in China in applied mathematics and
has been with NTU for quite a few years. And he's associate editor of
Neurocomputing and also IEEE Transactions on Systems, Man and Cybernetics.
And he's organizing our workshop. You can advertise your conference later. So
please.
>> Guang-Bin Huang: Thanks, Dr. Zhang, for inviting me here. It's my honor to
give a talk to introduce the extreme learning machine.
This is actually -- the idea was initially initiated in '9 -- actually 2003. That was
the first time we submitted papers. And then it become recognized by more and
more researchers recently. So we just had a workshop in Australia last December.
So we are going to have an international symposium on extreme learning machine
that is coming December in China. It's Dr. Zhang's hometown, or nearby his
hometown. So we're hoping you can join.
Okay. So what is extreme learning machine? Actually, this is a technical talk
about -- talking about the kind of learning method. Such a learning method is
different from the traditional learning method. So tuning is not required.
So you will wish to know extreme learning machine I think it's better to go back to
review the traditional feedforward neural networks, including support vector
machines.
I assume many people consider support vector machines to not be known to neural
networks. But actually in my opinion [inaudible] they have the same target, same
architecture in some aspects, in some sense.
So after we review feedforward neural networks, then we can go to the -- to
introduce what is called extreme learning machine. Actually, extreme learning
machine is very, very simple. Okay. During my talk I also wish to give the
comparison between the year end and this square SVM. So finally I wish to show
linkage between the ELM and also the traditional SVM, so we know what's the
difference, what's the relationship. Okay.
So the feedforward neural network, I have several types of frameworks,
architectures. One of the popular ones is multi-layer feedforward neural networks.
But then in theory and also in applications people found a single-hidden layer is
enough for us to handle all the applications in theory. So that means given any
applications, we can design a single-hidden layer feedforward network.
This single-hidden layer feedforward network can be useful as to approximate any
continuous target function. Can be useful as to classify any disjoined regions.
Okay. So for single-hidden layer feedforward networks, usually we also have the
two type of popular architectures. The first one is so-called sigmoid type of
feedforward network. So that mean the hidden layer here use sigmoid of type
network. Okay. Sometime I call additive hidden nodes. That means the input of
each hidden node is weighted sum of the input.
Okay. So of course G here, usually people are using sigmoid type of function. But
you can also write the output hidden layer -- output hidden node as uppercase G.
So AIPI and X. X is input of the network. AIPI are the hidden parameters for
each -- say for hidden node I, the AIPI are the parameters of the node I.
All right. So this is the sigmoid type of hidden feedforward network. Of course
another one is very, very popular is RBF network. So RBF network here, a hidden
node output function is RBF function.
So what you rewrite then in this computer format, we actually have the same
so-called output function of the single-hidden layer feedforward network as
sigmoid type network. So here always uppercase G.
So these two type of networks are very interesting in the past two to three decades.
Two groups wrote papers, researchers, to work on these two areas, and they
consider them separate. And they use a different learning method for these two
method networks.
But generally for both type of networks, we have this so-called theorem. So given
any target continuous function, FX, so the output of function of this single-hidden
layer can be as close as to this target continuous function given any arrow.
And definitely in series, we can find such a network. So that is the output of this
network can be as close to this target function effects.
Of course, in real applications, we do not know the target function, FX. All right.
So your sigmoid processing will have a sampling. We only can sample the discrete
samples. And we wish to learn these discrete samples, training samples. So we
wish to adjust the parameters of the hidden layer and also the waste between the
hidden layer to the output layer. Try to find some algorithm to learn these
parameters to make sure the output of the network to approximate a target
function.
Okay. From learning point of view -- so given -- we're given, say, N training
samples, XI, TI, so we wish to have the network -- output of the network with
respect to the input XJ incurred to your target output, TJ. Of course, in most cases,
your output not exactly the same as your target. So there is some error there.
Suppose the output of the network is OJ, so we wish to minimize this cost function.
So in order to do this, many people, right, actually spend time finding different
methods on how to tune the hidden layer parameters, AI, BI, and also the waste
between the hidden layer to the output layer. That is beta I, which is I call output
weight.
All right. So that is the situation for the learning. So that is for approximation.
But how about classification? So in my theory published in year 2000 we say as
long as this kind of hidden -- single-hidden layer feedforward network can
approximate any target function, this network in theory can be used for us to
classify any disjoint regions. So this is a classification case. Yeah.
>>: So there's a very well-known result in neural networks that you probably
know which says that in order for the [inaudible] to be valid, the number of units
had to be very, very large and that if you can infinite [inaudible] what's called a
Gaussian process.
>> Guang-Bin Huang: Yeah, you're right.
>>: So that gives [inaudible] indication of processing [inaudible].
>> Guang-Bin Huang: Ah. Okay. That is actually very useful theory. That one
actually used for our further development. So that is why we come to extreme
learning machine. That is a guide.
But infinity number of hidden nodes usually not required in real applications. But
in theory, in order to approximate any target function, say epsilon, the arrow
reaching zero, then in that sense infinity number of hidden nodes needs to be
given. But in real application we do not need that. So I will mention later.
But that theory is very, very important. Okay. Without -- um-hmm.
>>: Also, recently many people observe that if you have many layers, actually you
can [inaudible] single layer system, even if you have very large number of hidden
units in a single layer system.
>> Guang-Bin Huang: You're right. So this will have another paper -- actually, I
didn't mention here -- that is we prove instance of, say, three hidden layer. Just
talk about two hidden layer. Compare one hidden layer of architecture. So two
hidden layer of network usually need much fewer number of hidden nodes than
single-hidden layer.
>>: [inaudible].
>> Guang-Bin Huang: Yeah. I proved in theoretical and also showing in
simulations. So that means that from learning capability point of view, my hidden
layer looks more powerful. But that is actually not shown here. Okay. That one
we can discuss later.
Okay. So then you want to learn these kind of networks, so then in the past two
decades most people use gradient-based learning method. So one of the most
popular one is back-propagation. Of course and its variants. So many variants.
People talk about just some parameter. They generate another learning method.
Okay.
Another method is called for RBF network. So talk about least-square method.
But in this least-square method is some kind of -- something different from ELM,
which I will introduce later.
So single-hidden -- so single impact factor used in all hidden nodes. That means in
all hidden nodes they use the -- all hidden nodes use the same impact factor
[inaudible]. Okay. Sometime it call sigma. Right?
Okay. So what's the drawbacks of those gradient-based method or -- so usually we
will fear very difficult in research. So different group of researchers will count
different network. Actually, intuitively speaking, they have the similar
architectures, but usually RBF network researcher work on RBF. So feedforward
network people work on feedforward network. So they consider the little
difference. So we actually sometime waste resources.
And also in all the network, users -- actually, you [inaudible] two users. There's
sometimes so many parameter for user to tune manually. Right? It's case by case.
Okay. So it's sometimes inconvenient for long expert users. Okay.
So usually we also face overfitting issues. So this is why even too many use the
number of hidden nodes use it in hidden layer. We were afraid it's a problem,
overfitting problem. Right?
And also for RBF a local minimum. Right? So you kind of get the optimum
solution. Usually get a local minimal solution. That is better for that local area but
not for the general -- the entire applications.
Of course time-consuming. Time-consuming not only on learning [inaudible] and
also actually on human effort. Human has to spend time to find the so-called
proper parameters, user-specified parameters.
So we wish to overcome all these limitations constraints in the original learning
methods.
Now, let's look at support vector machine. Is there any relationship between the
support vector machine and the traditional feedforward network. Of course, when
SVM people talk about SVM, they never talk about neural network. They say
they're separate. So this is why there is a story.
So when I joined [inaudible] in 2004 -- before 2004, SVM paper is seldom
appearing neural network conference. So then in 2004 the organizer of 2004, right
at the top we have a committee, committee meeting, and they say why this year so
many SVM papers come to neural network conference? Okay. So people consider
these are different.
But what I found, actually they are very close to each other. Generally speaking,
they are the same, in the same network architecture. Let's look at the SVM.
So SVM, of course, we talk a little bit about optimization objective. SVM is to
minimize this formula. Right. So this objective function. So minimize the weight,
the actual output of weights, apply the tuning error. On this you [inaudible]
conditions. Right?
But in looking at the final solution, decision of SVM is this decision. But what is
this? K here is a kernel of the SVM [inaudible] is a parameter we want to find.
Okay. But looking at this formula, this actually exactly is a single-hidden layer
feedforward network. What is single-hidden layer? Single-hidden layer formed by
the hiddens with these kernels. Right? So KX, X1, KX, X2, XI2, KX, XN. So
this is the hidden layer. Right? This is the kernel -- hidden layer with this kernel.
So what is the output weight, then? Output weight is F over 1, T1, R over I, TI, R
over N, TN. That is the output weight. That is a better [inaudible] in feedforward
network. Okay. Yeah. Please.
>>: This is [inaudible].
>> Guang-Bin Huang: Okay. Yeah. This is not objective of the SVM. But
finally it turn out to be in this formula. So I say from the architecture point of
view. So finally they have a similar architecture.
>>: [inaudible] was called new ways to train neural networks. So that was a
connection from the beginning [inaudible].
>> Guang-Bin Huang: Yeah. Yeah. So actually -- yeah. You will talk about
[inaudible] paper published in 1995. So WAPNIC [phonetic], what is the so-called
expectation inspiration to [inaudible]? So WAPNIC say actually we have to go
back to the original neural network.
Okay. So [inaudible] actually 1962 [inaudible] studied much in that year
feedforward network. In those days, we found has new idea how to tune in the
hidden layer. So then [inaudible] said how about it? We considered a [inaudible]
layer [inaudible] hidden layer. You could consider master hidden layer. Master
hidden layer is output mapping function. Looks at output of mapping function.
So then can we find out some way to just find us an output -- the last hidden layer
output function. That is feature map function. Then you can [inaudible]. So then
WAPNIC say, okay, so what is -- so the output function of the master hidden layer
that can be considered is phi XI. But what is phi XI? I have no idea. So then
you -- WAPNIC paper say how about it, let phi X [inaudible].
So we have this kind of constraint, because this is very important for classification.
Under this constraint they finally get this. So although the output hidden layer
mapping function, that is the SVM feature mapping function, phi X is [inaudible].
But we can find the corresponding kernel. So that's why I come to this stage.
But I should -- I'm [inaudible]. I'm talking about from a structural point of view, if
we go to this stage, final turn out to be in the same format. Of course from here the
question is how to find other ITS. How to find other ITS is how to find this output
[inaudible]. So you would consider the hidden layer feedforward network.
Feedforward of hidden network also try to find the parameter. So SVM and the
traditional BPN. So in this sense is either just different ways to find the parameters
of the hidden layer feedforward network, single-hidden layer feedforward
networks.
SVM find it in SVM format. BP find it in BP format. RBF found it in RBF
format. So this is why different people find methods in different ways. So then
the question is can we unify them. Okay.
So this is one of my so-called research work. Okay. Actually, in order to say -- to
show the linkage between the SVM and the single-hidden layer feedforward
network, I think that people better read these two papers. Okay. Because these
two papers also give me some idea, inspire my ideas. And they build the linkage
between the ELM and SVM first. So then I found this linkage also.
Okay. So now let's go back to the -- let's go to talk about extreme learning
machine. Of course extreme learning machine originally start from the neural
networks. We come from neural networks. So those days we try to say, okay, we
started from BP network.
So BP we found is inefficient. But what is the sense, what is the original
expectation for neural network? We try to eliminate human synching. The human
can synch something, can find a solution very quickly. So the question is do we
have tuning in human brain. But usually to me it's neutron in there in most cases,
right? Just find a solution theoretically or I can learn very fast. Right? Either very
fast, either without tuning.
But go back to the original learning method of the neural network. So whatever
machine you see, always tuning there. Right? Always you had to find a parameter
is there. Can we have some way to simplify the implementation of the
computation intended in this method?
So then we found -- we first we -- from all this [inaudible] we see O can be
simplified in this single-hidden layer feedforward networks.
So O is a hidden layer, AI, BI. But hidden layer here, hidden node may not be
neural. You can be a kernel. You can be other subnetwork even. But each hidden
layer had a parameter, AI, BI. Those are the parameters.
[inaudible] real formally of each output of hidden -- each output function of the
hidden node, right, details. We just write them in this compact format. Okay. So
then the output of this -- output function of this network is FX equals summation of
beta I, G AI, BI, X. G, AI, BI, X actually is the output function of the I's hidden
node. Or it can be RBF little kernel. It can be so-called sigmoid type network if
something else. Even can be long differential [inaudible] network.
Okay. So then you see this hidden layer, the entire hidden layer, what is the output
of the entire hidden layer. The entire hidden layer supposed to have -- supposed -the entire hidden layer has a AO hidden nodes. So then the output layer, what is
output layer? The output layer is HX, is that vector of these AO elements of this -the output of functions, output of this AO hidden nodes. Right? It's a vector.
HX here is the feature mapping, is a hidden layer so-called output mapping.
Right?
Okay. So in this case the question is in the traditional method, AI, BI, AAO, A1,
B1, AO, BAO, O have to be tuning there. So this is why we have to have the
gradient-based method or SVM method. Because, you see, if this parameter -- say
parameter or hidden layer has been tuning, the result you obtain in the early stage
in the output layer, right, beta I here, may not be optimum. You have to adjust the
output layer again. If you adjust the output layer, the beta I, beta 1 to beta AO,
then the parameter in the hidden layer may not be optimal further anymore. So
you have to adjust again. So this is why you always have to adjust it, right,
iteratively.
So then luckily in our theory, we find tuning in the hidden layer actually is not
required. All the parameters in the hidden layer can be random generated. This go
back to your question. So this assume infinity number of hidden nodes can be
used. Of course, this assumption that actually we do not need.
Okay. Right? So this also go back to the SVM. SVM is -- go to the feature space,
suppose the dimension of the feature space is very naughty, what is very naughty,
can be good to infinity. Right? So that is one -- that is why they have the kernel
concept there.
So you will say if the hidden layer can -- so number of hidden nodes in a hidden
layer can be infinity or as naughty as possible. Then the question is -- then we
have this conclusion. So tuning is not required. All the parameters in the hidden
layer can be random generated.
This based on a lot of assumptions. Say suppose given an output function, given
output function G of the hidden layer, if there exists a method, okay, which can be
used in tuning to find the proper parameters for AI, BI.
So that means [inaudible] parameter we can have a parameter. We will have a
tuning method, BP method, RBF method, or SVM method, whatever method. As
long as there is -- there exists such a tuning method which can be used to train our
single-hidden layer feedforward, okay, then such a tuning is not required. Such a
method is not required. Hidden layer can then be random generated. That is given
in this paper. Okay.
So -- and also another assumption is G here should be nonlinear piecewise
continuous. You can't just give it the error function. Then -- yeah. Please.
>>: Can you explain the difference between the AI and BI?
>> Guang-Bin Huang: Okay. So this [inaudible] AI, BI, sometime they can be put
together, just one parameter. Okay? But AI, BI here built usually in the traditional
neural network. You see RBF network. You have center and impact factor. So AI
here in RBF hidden node, AI is center. BI is impact factor.
If you got a sigmoid function, AI is input always. BI is the buyers of the hidden
node. So this [inaudible] comes from the traditional neural network. So it's special
for neural network researchers. Usually they have two parameters. So this is why
I put AI, BI.
But if you go to the frequency [inaudible] usually you have the position and the
frequency. Right? Not the location and the frequency. Right? So usually have
the two.
>>: What is it for SVMs?
>> Guang-Bin Huang: Oh. For SVMs, SVM you will talk about kernel
[inaudible]. Kernel you have the so-called -- AI is the input that is called sample,
training samples. BI is a so-called -- we call kernel parameter. That is sigma. But,
of course, here we talk about random. So later on we can build the linkage
between SVM and this method. Okay? But of course here we'll talk about random
first.
Okay. So -- yeah.
>>: [inaudible] any application do we need to tune AI [inaudible]?
>> Guang-Bin Huang: I will address this issue later. Yeah. It is an important
question. I will come back. Yeah.
Okay. So since all the hidden node parameter need not be tuning, a lot of -actually, not be tuning in the sense, before you see the training data you can
randomly generate the parameter first. Okay. Then say for this application, you
can use a random -- already random generated nodes for that kind of application.
That means the parameter can be independently from the training from your
applications.
So since AI, BI -- all the parameter, hidden node parameters can be random
generated, they go here. So suppose you have N training samples, given any
training sample XG, so we wished the output function output of this network
[inaudible] to your target, TJ. Right? The left-hand is the output of the network.
We wished this equal to your target. So we have the AO hidden nodes. So here
we have an AO [inaudible]. Only AO on nonparameter is beta I. Because we have
no idea what is beta I. That is what we want to find.
Although the hidden layer can be randomly generated, the TJ your target. So we
have the N training samples. So from here we know we have [inaudible] equations
with AO on nonparameters, right?
So then for these AO equations, we can -- can be written in the computer format H
beta [inaudible] T. So H is we call hidden node output matrix. Right? It's also the
so-called hidden layer mapping. Okay. Feature mapping. Okay? So we have the
N training data. Right? So each row represent the output of the hidden layer with
respect to one input training XI.
Okay. So we wish to -- T is already given. We want to find this parameter. So for
this H beta [inaudible] T, H is already known. T is already known. So then we can
find the parameter beta easily. Three steps then.
So the first step randomly assign hidden node parameters, AI, BI. And then
calculate the hidden layer output function matrix H. All right. So then just do a
certain inverse to the beta. That's all.
So these are three steps. Okay. So the source code can be found here.
For this we will find the simple math is enough in many cases. Of course then for
this single-hidden layer feedforward network, the hidden node, okay, need not be
sigmoid type. Needn't be RBF type. It can be others. [inaudible] hidden node
because previously [inaudible] can be trained directly. So with this method,
[inaudible] came out to be trained.
But that is the basic idea. So now what relationship -- yeah.
>>: [inaudible] generalized data models where you take the input data, you map it
to some nonlinear thing, and then you take a linear combination, which looks
exactly like this. So how is it related? How is [inaudible]?
>> Guang-Bin Huang: What is a nonlinear ->>: Generalized linear [inaudible].
>> Guang-Bin Huang: Uh-huh.
>>: Where you basically map the data to a nonlinear form and then take the linear
combination [inaudible]. So it looks very similar to this [inaudible] what the
difference is.
>> Guang-Bin Huang: Yeah. So many method actually look similar, but the key
point is here the hidden nodes here all random generated [inaudible] other
methods. And they have two so-called either user preset -- preset some parameters
or either tune it in some way and then get the hidden layer output. So then they use
the least-square method. Even the original RBF method. So they have to do
something before they go to the random generated -- yeah.
>>: When you say that taking a random hidden layer is enough, and I don't
understand what enough means. I mean, does it mean that you get uniformly good
approximation across the input space? I mean, this can't be a
distribution-dependent thing. You can't get actual learning from this. This has to
be assuming some type of uniform distribution on your input space.
>> Guang-Bin Huang: Okay. Easy enough to talk about here, say given any target
function, continual target function, FX, we can say for given such a kind of
single-hidden layer feedforward network may not be a neural network, okay, so as
long as a number of hidden nodes is not enough, we can say the output of this
single-hidden network can be as close as to your target function. So given any
target function, any continuous target function.
So that we can make the output of the network as close as to your target. So this
is -- because it's from the learning point of view. In theory we wish to have this
universal approximation property.
>>: I think what you want is to get -- to learn the specific distribution from that
end, so ->> Guang-Bin Huang: Oh.
>>: This is some kind of extreme L infinity type of -- or, I mean, you're talking
here about a uniform convergence, right, across the entire space, whereas we may
actually care about some little part of the space which is governed by the
distribution [inaudible]. I mean, that's what [inaudible].
>> Guang-Bin Huang: You're right. So talk about -- say you give an entire -- so
here talk about the error in the entire [inaudible] space. Of course talk about
so-called some [inaudible] space, that could be different. So this actually -- this
problem actually true to many, many method. Even go to the BP. And they also
talk about general. Even SVM talk about over a space. Given this region, we wish
to learn to see how good in this entire region instead of specific so-called locations,
area, small area. But that is problem to the general learning method. Because
learning method usually say give us space, give us this training data, any training
data. We wish to learn the overperformance on these, any training day.
Of course somehow we are also interested in some particular subspace. But that is
different ways then.
>>: I don't think you understood what I said. When we get a training set, you care
about the distribution from which that day set came. You don't care about
[inaudible] specially has an analysis, so the only learning and generalization
[inaudible] SVM talks about generalizing with respect to the distribution of the
training day, not uniformly across the space.
>> Guang-Bin Huang: Okay. You're talking about distribution of the training
data. In which sense?
>>: In which sense. You're assuming that your sample is [inaudible] from some
distribution. That's a distribution with respect to which you measure risk and you
want to generalize with respect to. You don't really care about doing anything
uniformly across the entire space. In essence [inaudible] some infinite dimensional
thing, and all you really care about is the subspace scanned by your data and the
distribution order [inaudible].
>> Guang-Bin Huang: Okay.
>>: Here this is -- this is kind of a -- you know ->> Guang-Bin Huang: Okay. I will go back to your [inaudible] so why they find
it can be so-called even -- so least-square then can be simplified to this stage.
Okay. I come back later. Yeah.
So that is the three steps. But [inaudible] we talk about these three steps. So
usually we -- after we get a beta, beta can be, say, calculated through the pseudo
inverse format. You have a different way to calculate pseudo inverse. So that one
of the ways [inaudible] method. So we say we have the pseudo inverse of the -okay, beta -- so-called H can be in this way or in this way. Right?
But it -- from the ridge regression theory point of view, in order to improve the
stability, improve the performance of the pseudo inverse, so usually we wish to add
a so-called [inaudible] in the biangle entry of this path. So then we -- here.
So when -- after we added this one here, because this one is the usual way for us to
handle the pseudo inverse, so there was a relationship with SVM then. Okay.
So after we added this, the issues become very interesting. If HX is known to us,
this is different from SVM. In SVM, HX usually is unknown. So if HX is known
to us, then we have this. All right.
If HX is unknown, we can also use that same format. We can use a kernel format.
So from the first formula. So H times H, then we get this, kernel format.
In this case, as long as K is known to us, even the feature mappings function HX is
unknown to us, we can get this formula. Okay. So this is another formula from
ELM, from the ELM we mentioned in this paper.
So then what happens then? You compare these two. This is the solution of the
ELM. This is a solution of the least-square SVM. Least-square SVM you see, you
really can know the first column, first row, first column of this least-square SVM
solution. This part, this part actually very similar to this part, right, to ELM
solution.
But then [inaudible] of course here we have the K times K. Here is Z times Z. So
they have this one. Okay. So roughly looks similar. But this actually is a
simplified version of least-square SVM.
Why this simplified version? [inaudible] in this paper because we found it may
have some flaw in this square SVM and the SVM series I will mention later.
But you will look at this square SVM. So we have the T. So [inaudible] here
actually is a kernel matrix. Omega here is a kernel matrix. But [inaudible] in their
kernel -- kernel is the related feature mapping. It should not be related to the
target.
But in their feature mapping, kernel mapping actually is related to the target. But
this is a problem. So your feature mapping is coming from the input instead of you
have feedback from the output. But in their solution, they have the TI -- so T you
have a target in the kernel mapping matrix. Okay. So this is a problem there.
But then from here we have simplified with [inaudible] the first column, first row,
first column, then make it simple. So I will come back later.
So why this one actually can be so called -- it can be this constant. It may have
something which can be improved somewhere. Okay.
So now let's look at the classification so-called performance of the ELM. So now
let's look at the [inaudible] problem. All right. So this is output boundary of the
ELM which founded by ELM. So we look at another case, right, this also
boundary from the so-called ELM method.
So now go back to your other questions, do we need infinity number of hidden
nodes? Usually not. So we say -- so as long as the number of hidden nodes is not
enough, is not enough, then okay. So in what sense not enough? So you know all
simulations, so even for this case, we only have four parameter, four training data,
1,000 hidden nodes.
Even here you have, say, here maybe 80,000 data, 1,000 training data or 1,000
hidden nodes. Even for other case we'll somehow have a meaning of training data,
okay, is the case. We only just needed 1,000 hidden nodes.
So that means usually 1,000 enough, 1,000 to 2,000 should be fine for ELM
usually. This -- what we find -- yes?
>>: Is that an empirical observation [inaudible] theory that actually suggests that
that's ->> Guang-Bin Huang: Oh, no, we do not have theory. This is just from
experimental simulation. Also not only founded by us, also founded by some other
teams from Europe and China. Yeah. So ->>: [inaudible].
>> Guang-Bin Huang: Ah. Okay. For the moment we found it. It looks likely
independent from the dimensions. Sometime -- later on it was issue, some is
higher dimension, some is lower dimension. Some you even say were [inaudible].
Also okay. Yeah.
>>: Also, when you said 1,000, do you mean you need 1,000 to match the
performance of a [inaudible]? What do you ->> Guang-Bin Huang: We also compare with SVM, regular SVM later.
>>: But what I'm asking is what performance do you try to get? I mean, I've had
two.
>> Guang-Bin Huang: No. We wish to compare with traditional SVM in the same
meaning. So even if you have the dimension reduction, then we compare in the
same way. Even if do not have the dimension reduction, we also compare the
performance result dimension reduction. We wish to see the -- if [inaudible] came,
we want to see the classification read in the testing data. Yeah.
>>: [inaudible].
>> Guang-Bin Huang: Yeah.
>>: With a style of method. So for about a thousand hidden nodes [inaudible].
>> Guang-Bin Huang: Sorry?
>>: If I was [inaudible] comparing the styling methods [inaudible] actual error rate
[inaudible] same error, you need a thousand hidden nodes with ELM.
>> Guang-Bin Huang: Yeah.
>>: So what's the [inaudible]? What's the largest problem you can apply this to?
>> Guang-Bin Huang: Okay. So this problem -- because you see we -- ELM
scores scalability actually is very, very good. Because you see the function. That
calculation is very, very simple. You look at it here. This actually is unified for
[inaudible] label classification also.
But this here, similar to SVM. It's just for binary. So you will [inaudible] very
naughty case for SVM and not -- least-square SVM usually have so-called
[inaudible] kind of method for both SVM, and least square SVM.
So usually we have to find the supercomputer for very naughty case. But to us we
usually run it in the home computer or just the desktop in the lab.
So we [inaudible] the largest one we run is however many of data. With each
so-called data instance is, say, 59 input [inaudible]. But for the rest, because due to
the simple computing [inaudible] somehow very difficult. We have to wait for -somehow wait for months to get the result. This is a problem.
But for us just several minutes we'll get a result. So somehow we [inaudible] we
also had to wait all the rest.
>>: [inaudible] you reach probably up to however many ->> Guang-Bin Huang: Yeah, the maximum. I will bet you -- I have one slide to
show.
>>: One that you have [inaudible] variability when you try a different set of
random parameters.
>> Guang-Bin Huang: For the moment we have a --
>>: [inaudible].
>> Guang-Bin Huang: Yeah, yeah, yeah, that's right. We usually have to seem
round -- we will -- in order to compare SVM and other method, not only SVM, BP,
we -- to us we -- actually, you know, just the hidden number of hidden nodes. That
one usually can be just given then round for all. But for the rest messy order, you
have to tune this method, tune to that method. In order to find one method, even
given one method, they have to spend a lot -- so it will cost a lot of time for them
to compute.
>>: Procedure-wise [inaudible] when you actually use a random generator to get
this AB, how many runs do you have to [inaudible]?
>> Guang-Bin Huang: Usually just one.
>>: But you get a different [inaudible].
>> Guang-Bin Huang: For the result here, we -- later I show we can generate
many times. We see the average. We can see the average. Yeah.
>>: I see. Do they differ a lot?
>> Guang-Bin Huang: I will show you that. There is some [inaudible]. Very
interesting result you will see.
Okay. Let's look at these several data cases. Actually, not very nice, okay? So
let's say they have this case. Say the training number of training data for letter
recognition is 13,000, then testing is 6,000, okay, so then this number of input
features in classes here.
So this means the data can be -- so training data and the testing every time during
our trial are reshuffled. Or erased. No means no shuffle according to the
simulation done by others. The training data fixed, testing data fixed. Okay.
Now, let's look at this case. So we talk about the -- let's look at the lower part first.
If SVM, least-square SVM, ELM all use Gaussian kernel. Gaussian kernel here is
lot of random generate. You look at an earlier solution here. I say if the HX is -are really known to you, you can use these two formulas. If HX is unknown,
similar to SVM, but to us is needing to be tuning then, right, so we can use the
same format here using the kernel format. This is different from least-square
SVM. We can also use a kernel here, because usually here no [inaudible] just use
one parameter.
Of course for any method [inaudible] definitely do something for some -- right, at
this one parameter.
So you will talk about a good kernel, Gaussian kernel method you see over here.
This is the testing rate for SVM. Okay. So this is training time. This is
least-square method. You see the ELM.
For all the method, for all the data, the ELM achieved better generation the testing
accuracy. Okay, you see here, all better here.
You see the time. The same Gaussian kernel. This time [inaudible] all shorter
than the SVM and this SVM, because least-square SVM is much starter than SVM.
Okay. So this is the -- using a kernel format. Right?
But if the -- you see the standard duration here also, zero. Because the training
data is fixed. Right? Training data is fixed, right? So then these three methods
have the same -- almost the same standard duration.
>>: [inaudible] Gaussian kernel likely choosing the [inaudible]?
>> Guang-Bin Huang: Yeah. No. This not. Can't. This can't. These we want to
intentionally compare SVM, least-square SVM, ELM, if they use the same kernel,
that we compare them in the same environment. Of course, you have to choose the
parameter gamma impact factor for all these three methods. Right? So this is
the -- one comparison.
But you will -- because ELM here, we say the hidden layer output function is
unknown to us, but we can use in the kernel format similar to the SVM. Just
kernel learning.
But if the HX is known to us, it seems the RBF network, sigmoid network, the
traditional neural network, so then we say we're just using one method, sigmoid or
hidden node.
So then all the parameter are random generated. And the number of hidden nodes
is said as 1,000. Then what happen? You look here. It's very interesting. You see
the classic case you read here also higher, almost better than all the rest.
And then, of course, because they're random generated, so some [inaudible]
duration here. Even the training data is fixed, [inaudible] fixed. You still have
some kind of [inaudible]. Because every time you random generate, then have
some kind of [inaudible] there.
But you see the duration is not so high. All right. Yeah.
>>: [inaudible] styling methods gap.
>> Guang-Bin Huang: Support vectors for this -- actually, this one -- oh, I see can
be found from a paper. I cannot really remember. But usually ->>: [inaudible] runtime. It's going to be slower if you have a thousand versus ten.
>> Guang-Bin Huang: [inaudible] so I will go back later. So you see the training
time. So the training time here is much, much shorter than the SVM, and it is least
square SVM. Even shorter than the ELM with Gaussian kernel. Because when
you go to Gaussian kernel, you see usually we consider a feature space go to
infinity. I mean, in theory. But here ->>: -- is comparing the sigmoid even though [inaudible] do you have general
statement as to which one is better in terms of performance versus ->> Guang-Bin Huang: Okay. In terms of classification -- in terms of stability,
Gaussian kernel is better because always standardization. Here is zero. Whenever
fixed, training data fixed, then the result is definitely deterministic.
But you determine the time, training time is better. Sigmoid. So-called
long-kernel based is better. Because the kernel, you know, we need not give the -so we need not give too many hidden nodes. So usually 1,000 enough. But 1,000
enough very good for very large applications. [inaudible] go to a very naughty
application, the scalability of SVM and least square SVM would be very hectic,
troublesome to us. But then ELM is better. Right? So this is ->>: So I'm thinking what you are doing now there is to use the random [inaudible]
at the top level.
>> Guang-Bin Huang: Yeah, yeah, yeah.
>>: So why is that feature a good one?
>> Guang-Bin Huang: I've already explained to you ->>: It's not intuitive.
>> Guang-Bin Huang: Yeah, it is not. Even in the beginning when I think of
this -- when I thought of this idea, if I could myself, can it work? I can't find any
natural phenomena at the latter stage. But now I found the natural phenomena. So
I say this a natural learning scale. I will go back later. It's very interesting. I will
go back later.
>>: By the way, so the way you use SVM [inaudible] automatically determined,
right? [inaudible] will be considered [inaudible] hidden units in your ELM. So do
you have idea what is a number of support vectors you get with the SVM?
>> Guang-Bin Huang: Okay. Usually for SVM, what I found is in some order
related to number of training data. [inaudible] regression method usually half of
the training data will be used at the support vectors. Go to this page.
Just run a difficult case, SinC function. Usual that's a very important benchmark
problem to test the SVM performance. Right? We have a 5 training -- 5,000
training data and 5,000 testing data. So training data we intentionally add some
noise there. Testing data is noise-free.
So if we go to SVM, of course here we're training to support vector regression
[inaudible] regression case. So we have the 5,000 data, training data is here.
Nearly half of them use as support vectors. So this is why even in other case we all
usually say the number of support vectors is very comparable to the number of
training data. So we talk about very nice good large dataset. The number of
support vector will be very huge. So then these go through the runtime issue that
are raised by this gentleman, right?
So you could do testing. The test time was very huge for support vectors, a
least-square SVM. But while the number of hidden nodes, ELM can be very short.
Look at this case, [inaudible] fix them as 1,000. I just choose 20. You see the
testing accuracy for this case, ELM already short, smaller then, is better than SVM.
SVR. The training time here is -- actually is 0.1 seconds. You see the SVR is near
20 minutes. So this is -- yeah, please.
>>: If you use, say, a thousand on ELM in this case, would it maintain the same
accuracy, or would it overtrain?
>> Guang-Bin Huang: Okay. These are different methods then, because we all
have the different version [inaudible] on the developing.
Here for this case we use -- because it's probably in the year 2006, quite a long
time ago. For that case we did not include constant [inaudible] diagonal entry of
this, of edge, right? We do not include this I over C here. We just use the -- using
a formula without I over C. This part.
For that result, that diagonal entry, you can tune -- you can tune -- you can keep in
the number of hidden nodes.
So then the number of hidden nodes, you will go back to see this paper different
from the back-propagation method. Back-propagation method is very sensitive to
the number of hidden nodes. All right. You could go to the peak performance
easily, quickly, then drop down quickly because overfitting there.
[inaudible] ELM usually have a very large advantage, so the performance, the
general performance is very stable in the very naughty range [inaudible]. Of
course, the number of hidden nodes in that sense -- number of hidden nodes
required in ELM is higher than number of hidden nodes in BP because every
hidden node in the BP had to be tuned carefully. But to us random generated.
But why is this single step in a very wide range [inaudible]? Because every one is
random generated. So you add one more or one less, usually more or less there.
So in that sense, intuitive case, most [inaudible]. This is why we have a
[inaudible] original intention of the ELM should be so-called tolerance -- we
should have some kind of redundancy. We should have some kind of error
tolerance, right? But for [inaudible] BP, it's not [inaudible] feature. Yes, please.
>>: In that sense there's no real tuning parameters?
>> Guang-Bin Huang: Yeah. So in this case we're just using [inaudible] section.
Say ten hidden nodes too little, too few, incapable of learning. 1,000 too few. It
just say the 500. Just reach that usually. You want simulations. We do not have
to be carefully tuning everything. Yeah. So this is why even say in this case
number of hidden nodes have to be given. But usually can be given very easily
just using a [inaudible]. Okay. Of course this is for [inaudible] case I won't repeat
any more, compare with BP, compare with ELM and support vector machine,
okay?
Now, you look here [inaudible] parameters. BP have to tune in a parameters here.
Of course [inaudible] have to tune in the number of learning [inaudible]. Then the
SVR have [inaudible] parameters. ELM for that traditional one have to give a
number of hidden nodes.
Of course, after we add the diagonal entry in the formula, number of hidden nodes
can be fixed, such as 1,000, without doing any tuning here.
>>: Now you need to tune C, right?
>> Guang-Bin Huang: You are right. But C ->>: Yeah, it does. Definitely. That means either you're tuning -- either you're
tuning -- tuning number of hidden nodes here. But of course this is not sensitive.
Or either turning C. But C also not so sensitive compared with SVM. I mean, you
have to do tuning. But it's not sensitive compared to SVM. Okay.
>>: So this slide seems like to suggest you need to use much more number of
hidden units than the BP.
>> Guang-Bin Huang: You're right. I just mentioned because number -- so-called
number of the hidden -- the rules of which hidden node in the BP have to be
already calculated carefully by the BP, or it's just random. So sometime it's
random redundancy there. But redundant is good. Help us to remain stateable in
some case.
>>: So my question is for large problems, for huge problems, will this cause any
problem?
>> Guang-Bin Huang: No. You will see here. Not too many even compared -- if
you compare with SVR, not too many.
But as I just mentioned, after you add the diagonal entry, number of hidden nodes
can be fixed. Usually we just pick 1,000. Right? In almost all the test cases, we
have so far, just a [inaudible].
>>: [inaudible] so suppose you compare the same network [inaudible] the same
level parameters, all that, and when you are doing that you fixed the earlier
parameters by random number [inaudible] and then of course you are not going to
get as low error as I do both, optimize both. I do the fixed whatever you have at
the top level better, and I guess we have freedom to train the other one, you get
better result in the training one. So [inaudible] about the training result [inaudible].
>> Guang-Bin Huang: Correct. Great idea -- question. This depends. So it
depends which error we are using to train your network. Say if we are using BP,
then you're stuck in the local minimum. You cannot go to the global optimal.
>>: [inaudible] all I'm saying that you try [inaudible] I think I would certainly do
that, you use whatever method to get [inaudible] a good result, and then you fix
[inaudible] and then I fix those and I do back-propagation BP to train my AB so
that you won't be the [inaudible] of what you randomly [inaudible]. I actually get
[inaudible].
>> Guang-Bin Huang: No, BP will not get a better result in most cases. It may get
a better result in some cases. What I say, it won't get -- BP won't get a better result
in most case in terms of accuracy, okay?
>>: No, I'm saying that given that I used the same parameter that you learned and
had, I [inaudible] to adjust other parameters. I mean, have [inaudible] I will just
use any [inaudible] I might get some low error. Is that going to give you ->> Guang-Bin Huang: No, definitely ->>: The issue will be generalization [inaudible].
>> Guang-Bin Huang: No. Definitely BP will get a lower general performance.
You say why? Because when the BP have to tune all the hidden nodes -- even
ELM hidden nodes is random generated, okay, in most cases, BP have to tune in
all parameters. Say we have 1,000 hidden nodes random generated. For ELM it
works. Then go to BP. Overfitting. Because say you just have 100 data. You
have -- using 1,000 hidden nodes, then this network definitely can not learn this
100 data. Because number of hidden nodes is larger than number of training data.
It's impossible. Right?
The training data -- training error can go to zero. Then testing accuracy can be
good to -- even good to infinity in some sense [inaudible] overfitting there.
But ELM is random generated. So in which sense why the ELM is better than BP.
Of course [inaudible] other reason. One reason is here. When we generated the
hidden nodes, we do not give the preference to training data. So that means the
hidden node is -- can be generated without seeing the training data. So when they
see the testing data and the training data in the same rule, give the same priority
[inaudible].
But in the BP, hidden node is tuning the base of the training data. In other words,
you give the priority, give the advantage to the training data. Give the buyers the
testing data. So in this sense BP actually is overfocused on training data. Right?
So maybe not balanced, the training and the testing performance, testing data. So
in this sense the BP is not good. Okay. So similar to other method. Okay?
Now, you look at a larger case, this also appeared probably five or six years ago.
So we have this -- so you have a -- actually it's a [inaudible] 600,000 data. So
100,000 for training. Actually later on we also test for however many. Use these
here. SVM spend 12 hours for one trial. That means suppose we already found
the best parameters or optimum user space by the parameters. Then how many
number of hidden nodes -- number of support vectors.
So here you have 100,000 training data. So here is 1/3 of them used as support
vectors. This go back to that gentleman. Right. So that means for the runtime
case, the testing time will be very huge. So even you have the training network.
Give application. You are [inaudible] SVM may respond very slowly compared to
ELM. ELM only 200 hidden nodes for this case.
Then for each training just 1.6 minutes. All right? So -- but look here. That here
is roughly 12 hours for one trial. But [inaudible] appropriate parameter, you have
to run maybe 200 times to find the best parameters. So usually we have to spend
one month to get SVM complete. But for ELM, what did I say just several times,
okay? So this one.
Now, go back to here. Why ELM is better. So the key expectation of ELM is
hidden layer need not be tuned. Okay. It can be generated before we see the
training data. This is the first key point.
Second key point is hidden layer actually should solidify universe or approximate
condition. Even I just mentioned, you -- even HX is unknown. We assume HX
should certify this condition. If HX hidden layer does not solidify this function, let
me give some application, you cannot approach it -- you cannot solve that
application. Then what shall we use? Why do we need to use this method. Right?
Okay.
So then we want to minimize these two. Actually, this similar to SVM. But this
come from neural network theory. Actually, probably quite a long time ago.
Okay. This is proved by so-called -- actually, UC Berkeley professor. UC
Berkeley professor say we should -- if the training error is already smaller, we try
to minimize the width of -- norm of the widths, of the whole network.
So then I combine, then, these two. Why not minimize margin error as well as
minimize width of the whole network. Right? So this then why we come here. In
order to address why ELM is better than SVM in most cases, we [inaudible] just
we compare least-square SVM. From that we kind of see the why -- the reasons.
Okay. We go back to the original SVM. Original SVM actually talk about this
kind of optimization target which [inaudible] conditions. Rights?
If we go back to ELM, we also use the [inaudible] conditions. But we know HX
satisfied the universe approximate of condition. So we have -- in ELM we have
this target function. I mean, this is analyzed from optimization aspect.
So then we also have this condition. Note here, this different from SVM. SVM
here had a B there, right? But in ELM no B there, because HX satisfy universal
approximation. Give any target in the feature space, suppose the separating
boundary should pass through origin in the feature space.
So without B, then we have this dual optimization problem for ELM. We only
have this condition. That means [inaudible] get a solution for ELM. We only
[inaudible] optimal parameter of I in this cube, from 0 to C.
But look at [inaudible] use this paper. Look at the SVM, the problem comes.
Because [inaudible] X in SVM is unknown. Unknown. That means they're even
using polynomial method. Unknown, that means you may not be [inaudible] may
not satisfy universal approximation condition.
So this one have a B, because in the feature space, right, the separating so-called
boundary may not pass through the origin in a vector space. So this is why they
added a B there. After they add a B, the problem here in the dual optimization
problem, this condition will be generated. But if we generate this condition, then
SVM -- there's a problem.
In order to find the optimum of I [inaudible] parameter, how do we find. Not only
in this cube, also have to find the parameter in this hyperplane. This hyperplane
defined by this condition. This is why you'll have SMO method. Define the best
parameter along this hyperplane in this cube.
Look here. Here find the optimal solution of I in this cube. SVM find the solution
in the hyperplane within this cube. So which one has the better optimal solution
score. This should be better. Right? Because this solution -- optimal solution here
definitely a lot better than the solution here.
So then why this problem comes? So this is why in my brain -- of course this I just
added this morning, and not published yet, so I wish to bring this idea here for
discussion. Is there any flaw in SVM theory?
Look at that. SVM. SVM is very great. I had to -- actually, I appreciate this
work. Without SVM, definitely the computational intelligence research may not
be so attractive, right?
Okay. But the problem is SVM always search the optimum solution in this
hyperplane within this cube, right, as I mentioned before. Right? So then the
problem [inaudible] two irrelevant applications. And they have to search in this -they may have to search the optimal parameter in the same hyperplane.
Say I give the two applications, XR1, X2, X1, T1, X2, T2, okay, application 1,
application 2. When I talk about weather forecasting, when I talk about human
face recognition, part of the different irrelevant applications, but their target may
be the same, whether good or not. This face recognition, this man, yes or no.
Right? So the target is always T1, 0 or 1, 0 or 1.
But according to SVM condition, they have to search -- they have to follow
so-called [inaudible] solution the same plane. Even that these two applications
different, but their TI target class label may be the same. May be the same. Right?
So that means summation of ITI equal to 0 [inaudible] the same, build TI the same.
Okay. Weather forecasting, 500 data. Face recognition, 500 data. So in weather
forecasting, half of TI is 1. Half of TI is 0. Face recognition, target half of them,
1. Half of them 0 possibility. Right?
So then TI is the same in these two applications. But applications are totally
irrelevant. Input XI quite a different dimension, different distribution. But then
the solution area of the TI of I, same.
So this looks not so reasonable. Different application. Why the solution so-called
error the same? Solution [inaudible] could be the same.
>>: But if the -- if you [inaudible] separate the two causes, why is it -- is it not
[inaudible]?
>> Guang-Bin Huang: This is a problem. Because they are different applications,
input dimensions distribution totally different. So here they only talk about the
[inaudible] should be some parameter for the tuning. This should not only be
related to the target.
Also, you really want to relate it to the training data. You should relate a target
and also some way to input.
>>: [inaudible] training data because if you optimize for another [inaudible].
>> Guang-Bin Huang: You're right. So this only related to the target in this sense.
In other words, give the preference to the target. This is what I want to say. It's
good to relate it to training data, but I only relate to the target at this stage. Either
you ->>: I mean, [inaudible] to the training data because clearly this is just a constraint.
You have the optimization [inaudible].
>> Guang-Bin Huang: Ah. You're right. But you will go [inaudible] relation
optimizing goal here. You'll see here, right, this is optimized. So from here, this
part only relate to the target. But of course I cannot say this is totally wrong.
What I found myself is it give the [inaudible] BP. It give the preference to the
target. It looks -- two different application, they want to link even their input
distribution dimension totally different, but due to their similarity in the target, so
may match them, make them close to each other.
This is why I have some kind of so-called pre-reservation, some kind of hesitation
on this part. Right? This may not be tolerated at all. This is why I put a question
here. Is this reasonable enough? Okay.
But then the reality is [inaudible] for discussion only, okay, I haven't published
this. This come from this morning, okay, after I wake up.
So because SVM is too general to the feature mapping, phi X is unknown. So in
all the literature of the SVM, what you cannot see if phi X had to satisfy universal
condition. They can be anything. You say polynomial, even X squared can be
used sometimes. But as you can see, that only -- [inaudible] will go to infinity.
The output function, output of the network can go to infinity or so, right? So that
may not be reasonable.
But because the feature mapping is unknown, so then they -- [inaudible] unknown,
they have buyers in the output. Because the target is here, your output feature
mapping is unknown. So they have some buyers between this [inaudible] between
your real target and your actual target, actual output. So we have buyers there.
Because there's buyers there, so I think this is kind of contradiction. Of course this
have to be verified further. This is just for internal discussion, okay? Yeah. We
can discuss later.
Okay. So this is -- go back to this method. Go to -- even on this analyze, we go to
this method for ELM. It also works in this sense we see here. We give some kind
of -- the classification testing case here.
So then we see SVM method, ELM method. Of course for this case ELM, we said
the number of hidden nodes is 1,000 [inaudible] C had to be tuning in there. Had
to be given.
The SVM of course you have seen in gamma. Of course [inaudible] we also can
use Gaussian kernel. But in other comparing, I didn't use the Gaussian kernel for
ELM.
So you see here testing and reading roughly is always better than SVM. So that
means they are comparing the same method here [inaudible] compare least-square
SVM, because those use [inaudible] constraints. Because you go back earlier
slides, and we compare the same method in the least-square method, right, of the
metric space.
>>: So the data size roughly is how much in those tasks.
>> Guang-Bin Huang: Yeah. Training data. Yeah. This is just the [inaudible].
This is not sure in the large dataset. Even here for -- we see the high dimension
gene selection case, so we had this result. Okay.
Of course that is for ELM [inaudible] means all the data are ready. But in real
applications, especially in the so-called [inaudible] applications, right, so many
data are coming one by one or chunk by chunk. You do not know how many that
you will have and how many -- when all the data will come. So this is the
situation.
So how do we implement something online. So this is why we have another one
called online sequential ELM [inaudible] one slide here to introduce a feature.
So the data may come in one by one. Okay. They come in a chunk. A chunk size
may be wide [inaudible] maybe just one, maybe somehow 20 data come in maybe
one [inaudible], sometime next time you just have so-called 500 data come in.
Right?
So we wish to have a [inaudible] which can just [inaudible] arriving data. Okay.
As long as this [inaudible] data has been done, this data should be discard. Should
be removed from the system. Right?
So next time when the data come in, you should [inaudible] new data with the
knowledge which learned from the past data, from the history of the data. But after
all the data learned, data should be discarded. It's just very similar to our learning
skills.
Say we started from kindergarten to the PAD. Okay. So every semester we have
new textbook. When we learn the new textbook, we go to examination. This is a
chunk of data. After we pass the examination, we discard the old textbook. We
learn the new textbook, means a new chunk of data. Right? But when we learn the
new chunk of data, we learn some new data with the knowledge learn from the
other textbooks. But we needed to go back to our old textbook. Of course you can
review the old text, but usually just learn one, then you ignore it.
So this is the very natural way. So in this sense, we can implement this online
sequential learning based on the original batch learning mode ELM. Okay. I just
skip this part.
Online sequential learning is very [inaudible] to SVM and other method. Usually
many parameter have been tuning manually. So very difficult to handle. Okay.
Even the training times are very huge. In many, many case, [inaudible] very
troublesome.
So we use the ELM, this can be done very, very easily. Okay. So I just skip this
part.
Okay. So thank you for summary -- or, sorry, [inaudible]. So usually it can be
done without tuning. Of course we say without tuning means you have one
insensitive parameter, the number of hidden nodes or the so-called complexity
[inaudible]. But usually compare SVM with other method, this not so sensitive,
just give one, just try one or two time, usually okay. Right?
It can be efficient for batch mode learning or the sequential learning. One part of
equation, incremental learning, I didn't -- was not mentioned here. Actually, it's all
very interesting. This come from our ELM theory. For ELM different from other
method. Any major algorithms of ELM come from theory. So that means any
algorithm with a proof in theory. Okay. Any major theory in ELM generally
result in the learning method. So this is different from other method. Other
method usually come from idea, but it kind of prove. Okay. Somehow you can
prove something in the neural network, but no algorithm to support. So number of
hidden node can't go to infinity, how to implement it. That's a problem, right?
So of course ELM work for different hidden nodes, also including kernel. If HX is
not random generated. If HX are not, you can also use the same as kernel as SVM
method just using kernel. Right? So HX can be [inaudible].
Okay. So [inaudible] learning is very fast. Usually you say SVM can be
[inaudible] least-squares, say, for one case three hour, seven hour, right, to
complete one case. So ELM just even less than one second.
So in this sense it can be -- can we do something on realtime learning, okay, and
especially also online sequential learning can also be used in realtime cases.
Then the next two this morning I added. I just mention so ELM -- we talk about
traditional SVM. Okay. Even talk about least-square. Traditional SVM switch
the best result in the hyperplane within the cube. If we compared the same
method, ELM switch the best result in the entire cube.
So in this sense, ELM always provided better general performance. Because that's
definitely one of the worst thing, the result generated from a hyperplane in that
cube. Right?
And also ->>: But the parameter space is different.
>> Guang-Bin Huang: SVM. SVM.
>>: No, no, SVM usually has a [inaudible].
>> Guang-Bin Huang: SVM. You look here, this actually ->>: [inaudible] you have more [inaudible].
>> Guang-Bin Huang: You see here, you look at this, this is the optimization
target of SVM. Right? So TI, TI, phi I, okay, usually here the same.
>>: [inaudible].
>> Guang-Bin Huang: N here the same. N here is the number of training data. So
you say that training data the same. Then we use the same kernel, same feature
mapping phi in SVM. So H in ELM, they use the same one, say, for example.
And then the condition for ELM is here. The condition for SVM is -- you add one
more. Right? So usually better. Because we -- in ELM case we request feature
mapping satisfy universal approximate, even unknown. Should be.
[inaudible] you can't certify universal equation. That means some black hole there.
Right? Some case you can't ->>: [inaudible] I haven't [inaudible] in your case, for the universal approximation
to be held, do you -- I remember a long time ago when people [inaudible] universal
approximation [inaudible] they have to assume that you have [inaudible].
>> Zhengyou Zhang: Well, maybe we can [inaudible].
>> Guang-Bin Huang: Yeah. Yeah. Okay. So then similar to least-square SVM,
the same case, okay, the same case, so then go back to your earlier question. So
why ELM -- so random generate good. Okay. [inaudible] so in the beginning, in
the five years ago I have new idea. And I always tried to think of some idea
[inaudible].
So I already told this story to my students and also other researchers. Okay. In the
beginning [inaudible]. So what is the natural phenomena very similar to this case?
Say Chinese. China, right? In China. [inaudible] lake, right? Lake. We want to
fill up the lake, okay, a river. You see the bottom. Bottom is a curve. Bottom is a
function.
So two type of people to fill up the lake. One is engineer. Another is farmer. So
how does engineer fill the lake? Engineer had to pump out all the water, measure
the curve of the bottom to see what's the size, what's the curve [inaudible] the
training data first. And then -- okay. So the training data. Then do another
research. How many stones I can use to fill up the lake. 1,000 -- 1,000 stones,
10,000 stones, or 100 stones fill the lake.
You move the stone to fill the lake. Of course VLM means go to the [inaudible].
Okay.
So each stone used by the engineer looks like a hidden node. So then the engineer
had to measure, had to calculate where to put this stone. This looks so big, put
here may be overfitting. Too nice. Put there -- so they have spent a lot of time
finding location. Right? If 1,000 of his stone not enough, can we [inaudible]
where to put. So this the BP learning method.
Now, farmer comes. Farmer has no knowledge, right? It's very -- it does not know
how to do calculation. Very simple. We have a large mountain. [inaudible]
mountain using a [inaudible] to explore the whole mountains. They generated
random size stones, rocks, big size, small size, even sand. So how the farmer fill in
that lake? Using a truck. Random -- when I pick random number of stones,
random number of size of stones, push to the lake, random put here, here, here.
What is that? That is random generate.
So when the farmer use a truck to take the stone, does he need to know what's the
size, what's the training -- somewhat training data from the lake? No need. Just
the random generated stone. Let's put random here, here, here.
So which one faster? Of course farmer method is much more efficient. And does
the farmer care about how many stone need to be used? No need, as long as the
least-square or means the error smaller. It okay. Right?
So then what is a beta in -- used by the farmer. For the engineer, he has to
calculate the size, the height of the stones. Well, farmer can go there using
something, just smash the stone, if the stone [inaudible]. We suppose that the beta
indicate the height of a stone. Right? If the beta equals 0, the height of the stone
equal 0 means nothing. Even if you put a stone there, [inaudible] hidden nodes put
there, you can smash the stone to make the height of the stone 0.
So definitely farmer method is more efficient. I'm right? So then even for farmer
method, does it care about -- do you have 1,000 hidden stone or 1,001 hidden
node? Doesn't matter. Even have more, doesn't matter. I can smash the height of
the stone and make the beta become 0. Right? Even if you have too many, it
doesn't matter. It all becomes 0. So this is why ELM works.
>>: [inaudible] I appreciate the work you said earlier on that, when you randomize
your lower layer parameters, you don't need to have all the data. In that case you
minimize some of the early position in terms of each simulation. And there's
actually kind of similar kind of technique now [inaudible] mostly in the NIPS
[inaudible] they actually talk about this [inaudible]. They have a similar
philosophy of not using the target in initializing parameter. But they actually use
the data for reconstruction to make sure that whatever [inaudible] you have you'll
get minimum error in [inaudible] the data, but [inaudible] entirely the output. And
once they do that, [inaudible] to do better [inaudible].
>> Guang-Bin Huang: Yeah. This is possibly right. This is why ->>: [inaudible] similar kind of ->> Guang-Bin Huang: I think so. They may have some kind of common so-called
sense behind. So that means when we do hidden layer, some parameter may not be
related to your target.
>> Guang-Bin Huang: Well, but the whole power of that approach is that that
allowed me to build the [inaudible] layer by layer [inaudible]. So that comes back
to some of the thing [inaudible] whether you actually use this not only for one
single layer, but keep building up the layers using similar kind of philosophy
[inaudible].
>> Guang-Bin Huang: Yeah, yeah. Great idea. So here we talk about a hidden
layer. We say the hidden layer HX. So sometimes we say HX is unknown or HX
is something else, is not just hidden -- one hidden node. It may be a larger network
there. Maybe a lot of hidden layers there.
>>: [inaudible].
>> Guang-Bin Huang: That's right. We [inaudible] layer by layer [inaudible] in
our theory we say HX may not be just a single compute -- single-hidden node. It
can be just any kind of computation of hidden node. It just -- it can be a larger
network. A larger layer. But in theory already prove it.
>>: You still try to see whether you build ->> Guang-Bin Huang: Ah.
>>: -- from the lower [inaudible] better result. If you do that [inaudible].
>> Guang-Bin Huang: [inaudible]. But in this case, we haven't tried yet. But that
is one of the good directions. In theory, we have something related. That means
HX may not be just a so-called RBF hidden node or some kind of kernel. It may
be a computation -- other computation [inaudible] it can be a system, even. So we
somehow -- so this why I do not intentionally use [inaudible] here. I wish to use a
node. A computation of node. These can be a system. It can be a single output
function, whatever.
>> Zhengyou Zhang: Okay. Last question.
>>: [inaudible] online learning scheme, so would you have to like incrementally
sort of adjust the number of hidden nodes?
>> Guang-Bin Huang: Okay. So actually we have the two type of learning
methods for online. So one is called online sequential learning. So here is the
[inaudible] the case where the data come in one by one or chunk by chunk.
Or in this case the [inaudible] fixed. Is fixed. So then whenever the data come in,
of course some parameters of the [inaudible] be tuning, especially for beta, output
weight.
So another so-called technique we call incremental learning. Of course this is
different -- definition may be different from other incremental learning. So that
means the data is fixed or the data was coming, but the [inaudible] can also be
updated. So then -- so both the parameter and network [inaudible] can be updated.
That is another one. That is also very, very recent. We prove in all the cases it's
better than SVR, support vector machine for regression. That is actually already
published in the so-called -- actually in the [inaudible]. That is also -- comes from
why -- okay. That is -- let me show here. That is published here.
So this why this prove this theory actually generated that algorithm. Okay.
Actually this theory also come from that area. We can consider intuitively that
algorithm works. Now, how to prove it. That is very, very difficult to prove. We
have to appreciate the peer review, professional review from this journal and also
the editor, past editor in chief of this journal, that a review took three -- it took
three years to complete a review, seven reviews.
Of course in the first version of the paper, the proof we taught later on. But then in
review, very [inaudible] the result is right. So then we -- I work with my student,
tried to correct the proof, but [inaudible] published paper. Actually the proof in
that paper also wrong. So then we did a node proof in that paper wrong. So we
take a long time and find out the proof in that paper wrong, so then we tried to
correct [inaudible] also from the math department, so then he say [inaudible] able
to use the proof in that paper, so we tried to correct the proof in that paper. So then
we tried to do it.
But just one week before we submitted this paper, we also found out no, the old
correction is also not correct. So then we [inaudible] then we use the [inaudible]
method to correct it. So finally it works. So it's very interesting.
So finally we -- the algorithm generated from that theory actually is very, very
efficient. I just took out these slides this morning. I found it too many pages here.
Yeah.
>> Zhengyou Zhang: Let's thank the ->> Guang-Bin Huang: Thank you so much.
[applause]
Download