476-328

advertisement
GENERALIZATION ISSUES IN MULTICLASS CLASSIFICATION NEW FRAMEWORK USING MIXTURE OF EXPERTS.
S. MEENAKSHISUNDARAM, W.L. WOO, S.S. DLAY
School of Electrical, Electronics and Computer Engineering
University of Newcastle upon Tyne
UNITED KINGDOM
Abstract: - In this paper we introduce a new framework for expert systems used in real time speech
applications. This consists of Mixture of Experts (MoEs) trained for multiclass classifications problems such as
speech. We focus mainly on the generalization issues which are surprisingly ignored in established methods and
demonstrate how severe these can be when the framework is drafted as a system. We limit this paper by
addressing the issues, presenting the MoE capabilities to overcome and statistical perspective behind the training is
briefly presented. Significant leap in the performance is achieved and justified by an impressive 10 %
improvement on word recognition rate over the best available frameworks and an impressive 18.082 % over the
baseline HMM. Critically the error rate is reduced by 10.61% over other connectionist models and 23.29 % over
baseline HMM method.
Key-Words: - Expert systems, Hybrid Connectionist, Self organising Map Mixture of Experts, Cross Entropy
1. Introduction
Mixture of Experts (MoE) are used extensively as
expert systems for their modular ability as classifiers
and for the learning capabilities. Most of the
established frameworks in this area use Mixture of
Experts along with statistical models such as HMM
to model real time applications. In speech
recognition the classifiers object is map the input
sequences to one of the target classes. [1][8]. HMM
has been used extensively but their reliance on the
input probabilistic assumptions and limitations for
the correlated input distribution due to their
likelihood maximization assumptions led to
Bourlard and Morgan’s to propose a hybrid
paradigm of HMM and Artificial Neural Networks
(ANN) based on theory that given satisfying
regularity conditions with each output unit of an
ANN is associated with each possible HMM state,
the posterior probabilities for input patterns can be
generated through training of the ANNs.[2-4][7]
This probability is then converted into total scores
using viterbi decoding algorithm to be used as the
local probabilities in HMMs. This overcomes the
HMM limitations of poor discrimination among
models due to Maximum Likelihood criterion.
Popular hybrids include Multilayered Perceptron
(MLP) neural network, Radial Basis Functions
(RBFs), Time delayed Neural Network (TDNN) and
Recurrent Neural Network (RNN) [5-6][9][11]. In
this series HMM is combined with Mixture of
Experts (MoE) Hierarchical Mixtue of Experts [1213] were also fused into the hybrid framework.
However these hybrids rely on heuristic training
scheme and their generalization capabilities are
poorer as the learning models results are given less
importance. To summarize there is a strong need for
architecture efficient enough to be trained for good
approximation with the training procedure resulting
in error performance that is global minimum.
2. Mixture of Experts Framework:
2.1 Architecture:
The input speech is modelled using HMM and
the vector set is fed to MoE network. Here it is
important to split input space as subspaces as
this increases the modular experts to handle the
input region more appropriately. Thus we have
fused Self organised Map (SOM) which
categorises the input space and then clusters
them into regions.[14-15].However SOMs
advantages of portioning and clustering will be
maximised only when the heuristic training
scheme becomes more modular.The gating
networks allocation and decision making is very
critical for MoE performance. The critical case
occurs during the training stage when gating
weights assigned to expert networks is not
optimized for new data. This results in the
network degenerating for generalization. To
avoid this we propose a twin training loop
where the network parameters are tuned to for
training data set and when the new set of data is
given to the framework we compute the weights
and fine tune them to adjust to the variations in
the input a space.
achieves local optimum solution within the feature
space. In Least Mean Squares (LMS) algorithm to
calculate the weights wi that are updated during
online training the weights are initialized for the
network to be 0. wi  0 for 0  i  N .They are
As for architecture , the member networks of CM,
the RBFs are two layer feed-forward ANN with an
input layer I , a hidden layer k made of basis
functions and an output layer O . RBF is
characterized by the center of Gaussian ck , the width
learning rate parameter. The training process is
continued till steady state conditions are reached.
The CM is built by combining the RBFs and a gating
network to achieve a globally converging solution.
Gating network applies feedback from output to
adjust the weights in subsequent stages. Let us
consider such a system where the output is
represented as Z . Representing the output from the
Fig 2 in terms of individual networks output yi we
can write
of the activation surface  k and for training the
weighting factors. We represent these parameters by


  ck k 1 ,  k k 1 , wk k 1 .To find centers kN
N
N
means clustering algorithm is applied where the set
of data are divided into subgroups and centers are
placed in regions containing significant data. Let
m1 denote number of the RBFs determined through
experimentation. Let ck (n) with k limiting from 1
to m1 denote the centers of radius basis functions.
At first random different values for initial centers
c k (0) are chosen. A sample vector x has been
trained
for
k  1,2,...
time
with
d (k )  w  k  x  k  where w  k  = estimated
T
weights and x  k  = network input. If we represent
error function as ek  it can be written as
ek   d k   d ' k  where d  k  and d   k  are
desired class and estimated output. The weights are
adjusted according to LMS algorithm as
w  k  1  w  k   2 e  k  x  k  where  is the
N
Z   yi
(2)
i 1
where N denotes the number of RBF networks used
within MoE. With the weighed scheme applied the
above equation, (2) can be written as
N
Z   gi yi
(3)
i 1
drawn from the input space x n with a certain
probability an input into algorithm at iteration n. If
we denote k (x) denote the index of the best
matching center for input vector x it at iteration n
can be found by using the minimum distance
Euclidean criterion: k (x) = arg min k
The gating network parameters gi are chosen with
respect to the subjected to training and previous state
output e.g. 0  g i  1 such that  g i  1 .
x(n)  ck (n) , k = 1, 2…where c k (n) is the center
For performance, the error function  can be
defined as the difference between the CMs output
and the desired response during classification
problem. If we denote the desired response by d the
aim is to reduce the error function which can be
written as   d  Z . The architecture could be
optimal if the error function is kept at minimum.
Most of the existing architectures use Mean Square
Error (MSE) as the performance measure. MSE is
suitable for the regression problems. However for
of the kth radial-basis function at iteration n. The
centers of RBFs are then adjusted using the update
ck n   xn   ck n , k  k x 
ck n  1  

ck n , otherwise


rule:
where  is a learning-rate parameter within the
range of 0<  <1. Finally iteration n is increment by
1 and the procedure is continued until no noticeable
changes observed in the centers. This algorithm
i
2.3. Cross Entropy (CE) Error Criterion:
the multiclass classification problems such as speech
classes it is appropriate to choose an error criterion
knows as Cross Entropy (CE) function. and from
the MoE architecture figure 2 we can write the error
function  to be
y
n
t k ln  kn

n k 1
 tk
c
n

 1  yk n 
  (1  t k n ) ln 


 1 t n 
k 


(4)

J  E    E  d 


2

N
i 1

gi yi 


2




(5)
For example in MSE the weights can be defined by
the following equation
 i
(6)
e
gi 
N
e
 j
j 1
where i = 1,2,3….N and k
aTk x   aki xi . N
i
denotes number of sub-spaces and a k is the k th
state in the gating network. For our method the
gating weights are results from SOM centroids
which are fine tuned in novel training stage. On
differentiating the error function with respect to
three parameters ui , c, w j ci , the mean center ands
weights we can obtain the optimum conditions for
the improved generalized performance.
The optimal solution is obtained by the proper
choice of the step size k . Mean Square Error (MSE)
is chosen as training criterion, minimized using the
above state equation and every successive state is
corrected using the weights. This result in an optimal
solution for any input space using CMs. Supervised
Learning is followed for the training of the CM. The
input space is analyzed for the dataset and RBF
networks are allocated for training. The gating
network parameters are initialized for equal weights
for N nodes and CM output for the input data is
computed initially with these equal weights.
Assuming the desired class denoted by d , the
average error associated with the MoE at any time t
is given by E  (t )  E  d (t )  Z (t ) . The gating
parameters a k k 1 are adjusted using the adaptive
M
feedback for the minimization of the error cost
function. E  2   0. The gating parameters are
adjusted towards the steady state conditions.
When the data from HMM is introduced to
framework the SOM partitions and clusters them and
the initial values of fed to MoE. Then the MoE
computes the output score. For new set of data the
weights are fine tuned in such a way that the
network contribution results in maximum
performance.
3. Mixture of Experts –Generalization
performance analysis:
On theory the advantages of using Mixture of
experts are due to their efficient usage of all the
networks of a population. This makes them superior
as none of the networks are discarded as it is done in
other networks where the best network is chosen out
of many networks and this leads to wastage of
training time. In brief, a Mixture of expert networks
yields better generalized solution than multinetwork approach. The above argument can be
explained by using the below theory. Consider a
network whose output is denoted by yk (x) , the
desired red class denoted by the regression function
hx  to which we are seeking to approximate. Then
the error associated M such networks with each
single network contributing  k (x) can be written
as E av 
1
M
 E   .
M
2
k
If we consider a MoE
k
involving M number of networks whose output is
averaged. The error associated with the MoE can be
represented
as
2
M
 1 M

 
2 
Ecom   
y k x   hx      1   k 

 

 M k 1
 M k 1
If we assume the errors have zero mean and are
uncorrelated combining the above equations we can
1 M
1
2
relate Ecom =
  ek  M Eav .Using the
M 2 k 1
Cauchy’s inequality in the form the above equation
confirms that MoEs cannot contribute to any
increase in the expected error yielding improved
performance
compared
to
the
individual
networks[10]. The architecture consists of an input
layer, a hidden layer and an output layer and a gating
network. The input space is divided into M number
of subspaces and each networks focus on a particular
subspace avoiding overlap across regions. For each
set of data in the particular subspace the
corresponding
networks
are
trained
for
 
classification. RBF networks are chosen for their
clustering and classifying properties and Mixture of
RBF can handle the variability in input data. This is
fused into hybrid HMM model where the scores are
calculated for every state of HMM using viterbi
decoding algorithm [2].
For the experiments the TIMIT database speech
consisting of 6300 utterances, 10 sentences spoken
by each of 630 speakers from 8 dialect regions in
United States is used. The front end has speech
sampled at 8 KHz, 20ms duration 10ms overlap,
hamming windowed frames. They are analyzed
through Mel Filter banks, DCT and the log energy
spectrum yielding 39 MFCC coefficients including
the first and second order derivatives. These features
are modeled individually with baseline HMM having
three to five states and using Gaussians for the
emission probabilities. Training and decoding is then
performed
using
viterbi
algorithm.
This
configuration is then changed to hybrid HMM-MLP
model replacing Gaussians for the a posteriori
emission probability calculations. The MLP used is
feed forward with two layers resulting in 117 input
neurons (39 MFCC x 3). The hidden layer is selected
with 200 neurons and 64 output neurons. On
analysis we observed the hybrid HMM-MLP
performs with 59 % compared to 56% by baseline.
This is due to MLPs MLE approach approximating
better than the Gaussian counterpart. Alternatively
RBFs are chosen as the emission probability
estimators with MSE criterion and 62.5 %
performance were achieved. A two phase
discriminative training for RBF where the MSE is
optimized using back propagation and then in the
output scores are trained for Minimum Classification
Errors (MCE) yields a maximum performance of
63.8%.As for MoE we have utilized the hybrid
HMM and CM model. This configuration consists of
RBFs used with one hidden layer with 100 hidden
units. The estimation of a posteriori probabilities is
the combined output scores of candidate RBF
networks. For the classification ordinary RBFs with
exponential activation function are used with MSE
chosen as the criterion. The proportion of the
weighting factors determines the individual RBFs
role in approximating the MSE criterion. An
iterative algorithm is applied to perform this and
final output score of CM is emitted that represents a
posterori probability for each state of HMM. From
the experiments we found the cost function MSE
reaching as low as 0.013 in 8 -12% lesser iterations
than the others. RBF performs better for a member
as its approximation is quicker and with ease of
training. It is also observed that the RBFs with one
hidden layer found to be very effective for a MoE
machine. Importantly the requirement of a number
of networks to deal with the huge dimensional
hidden space is addressed by limiting the RBFs
within their area of expertise. By this the networks
are task managed and the hidden space is
generalized
using
contribution
from
the
neighborhood networks. In the boundary regions the
net output would be a non linear output of the
networks resulting in the smooth coverage to the
hidden spaces. The overall performance comparison
is listed in Table 1. From the Table it is evident that
the CM results yield 3% improvement over the
RBFs as a single network to train the feature space
with generalization. An important observation is that
when two-phase RBF training for MSE and then
MCE is avoided RBF with MoEs for convergence to
global minimum. RBF on its own has poor ability to
converge globally even under Generalized
Probabilistic Descent (GPD) [11] .The disadvantage
of RBF without GPD is also solved with the results
confirming the theory of MoEs’ superiority over the
individual networks.
4. Conclusion
In this paper a distinctive connectionist model for
constructing the artificial intelligent system is
presented. The benchmark tests for this AI system
for speech recognition applications clearly indicate
our proposed architecture’s superior performance in
better speech recognition accuracy and minimum
word error rate over the rest of the models developed
so far. With its simple kernels and less strenuous
training scheme we have analyzed and achieve
significant results improving the word recognition
rate by 10 % over the best reported connectionist
methods so far and an impressive 18.082 % over the
non connectionist HMM models. Error rate is
reduced by 10.61% over connectionist models and
23.29 % when compared to non connectionist
baseline HMM method. .Finally the theory of MoEs
contributing fewer errors than any best individual
networks has been validated through our
experimental results.
ACKNOWLEDGEMENT
This work is been funded by the Overseas Research
Scholarship by Universities UK. We would also like
to thank the School of Electrical, Electronic and
Computer Engineering, University of Newcastle
upon tyne for their financial support and
encouragement to this academic research.
References:
[1] Ajit V.Rao and Kenneth Rose, Deterministic
Annealed design of Hidden Markov Speech
Recognizers , IEEE Trans. on Speech an Audio
Processing, Vol. 9,No.2 pp. 111- 125,Feb 2001.
[2] Bourlard.H and N. Morgan, Connectionist
Speech Recognition, Kluwer Academic
Publishers, Massachusetts, 1994.
[3] Choi.K and J.N Hwang, Baum-Welch hidden
Markov model inversion for reliable audio-tovisual conversion, IEEE 3rd Workshop on
Multimedia pp-175-180, 1999.
[4] Dupont.S, Missing Data Reconstruction for
Robust Automatic Speech Recognition in the
Framework of Hybrid HMM/ANN Systems,
Proc. ICSLP'98, pp 1439-1442.Sydney,
Australia, 1998.
[5] Gong.Y,Speech
recognition
in
noisy
Environments:
A
survey,
Speech
communication Vol. 12, No. 3, pp. 231--239,
June, 1995.
[6] Morris.A, A.Hagen and H.Bourlard, The Full
Combination Sub-Bands Approach to Noise
Robust HMM/ANN-Based ASR, Proc. of
Eurospeech, Budapest, Hungary, pp-599-602,
1999.
[7] Nelson Morgan and Hervé Bourlard, An
Introduction to Hybrid HMM/Connectionist
Continuous Speech Recognition, IEEE Signal
Processing Magazine, pp. 25-42, May 1995
[8] Picone.J, Continuous Speech Recognition
Using Hidden Markov Models, IEEE ASSP
Magazine,Vol.7.no.3,pp.26-41,July 1990.
[9] Renals.S, N. Morgan, H. Bourlard, M. Cohen,
and H. Franco, Connectionist probability
estimators in HMM speech recognition, IEEE
Trans. Speech and Audio Processing, Vol. 2 No.
1 Part 2, pp. 161-174, 1994.
[10] S. Haykin, Neural Networks, Prentice Hall,
1999.
[11] W.Reichl and G.Ruske, A Hybrid RBF-HMM
system for continuous speech recognition,
ICASSP, vol. 5, pages IV/3335–3338. IEEE,
1995.
[12] M. I. Jordan and R. A. Jacobs. 1994.,
Hierarchical Mixtures of Experts and the EM
Algorithm. Neural Computation, vol 6, pp181214.
[13] M. I. Jordan and R. A. Jacobs. Modular and
Hierarchical Learning Systems. in M.A.Arbib,
ed.,The Handbook of Brain Theory
and Neural Networks, pp579-53, 1995
Cambridge, MA:MIT Press.
[14] T. Kohonen . 1990., The Self-Organizing Map.
Proceedings of the IEEE . Vol 78. No.9. p14641480.
[15] Bin Tang, Malcolm I. Heywood, and Michael
Shepherd. Input partitioning to mixture of
experts. In 2002 International Joint Conference
on Neural Networks, pages 227{232, Honolulu,
Hawaii, May 2002.
Performance Analysis and Test Results
Mixture of Experts Performance for TIMIT
database:
Total No of words
Word Recognition Rate
Error Performance
: 48974
: 32557/48974 (66.48%)
: 16417/48974 (33.52%)
a) Substitution: 11308/48974 (23.09%)
b) Insertion: 3008/48974 (6.14%)
c) Deletion: 2101/48974 (4.29%)
Recognition
Configuration
RR*
Subs*
Del*
Ins*
ER*
Baseline HMM
56.3
27.1
10.1
6.5
43.7
HMM+ MLP
59.09
26.24
8.91
5.76
40.91
HMM + RBF
62.5
24.5
8.3
4.7
37.5
HMM+ TDNN
60.47
26.81
8.3
4.42
39.53
HMM + MoE
66.48
23.09
6.14
4.29
33.52
Table 1: Comparative results of all recognition
systems
* - In percentage
RR – Recognition Rate Subs – Substitution Error
Del – Deletion Error Ins --- Insertion Error
ER – Error Rate
FIGURE 1: MOE Framework- DIAGRAM
A
C
O
S
T
I
C
M
F
C
C
C
O
E
F
F
I
C
I
E
N
T
S
H
M
M
S
T
A
T
E
M
O
D
E
L
S
O
M
C
L
U
S
T
E
R
I
N
G
y1(n)
y4(n)

d
Output y(n)
-
+
e
GATING
NETWORK
Download