GENERALIZATION ISSUES IN MULTICLASS CLASSIFICATION NEW FRAMEWORK USING MIXTURE OF EXPERTS. S. MEENAKSHISUNDARAM, W.L. WOO, S.S. DLAY School of Electrical, Electronics and Computer Engineering University of Newcastle upon Tyne UNITED KINGDOM Abstract: - In this paper we introduce a new framework for expert systems used in real time speech applications. This consists of Mixture of Experts (MoEs) trained for multiclass classifications problems such as speech. We focus mainly on the generalization issues which are surprisingly ignored in established methods and demonstrate how severe these can be when the framework is drafted as a system. We limit this paper by addressing the issues, presenting the MoE capabilities to overcome and statistical perspective behind the training is briefly presented. Significant leap in the performance is achieved and justified by an impressive 10 % improvement on word recognition rate over the best available frameworks and an impressive 18.082 % over the baseline HMM. Critically the error rate is reduced by 10.61% over other connectionist models and 23.29 % over baseline HMM method. Key-Words: - Expert systems, Hybrid Connectionist, Self organising Map Mixture of Experts, Cross Entropy 1. Introduction Mixture of Experts (MoE) are used extensively as expert systems for their modular ability as classifiers and for the learning capabilities. Most of the established frameworks in this area use Mixture of Experts along with statistical models such as HMM to model real time applications. In speech recognition the classifiers object is map the input sequences to one of the target classes. [1][8]. HMM has been used extensively but their reliance on the input probabilistic assumptions and limitations for the correlated input distribution due to their likelihood maximization assumptions led to Bourlard and Morgan’s to propose a hybrid paradigm of HMM and Artificial Neural Networks (ANN) based on theory that given satisfying regularity conditions with each output unit of an ANN is associated with each possible HMM state, the posterior probabilities for input patterns can be generated through training of the ANNs.[2-4][7] This probability is then converted into total scores using viterbi decoding algorithm to be used as the local probabilities in HMMs. This overcomes the HMM limitations of poor discrimination among models due to Maximum Likelihood criterion. Popular hybrids include Multilayered Perceptron (MLP) neural network, Radial Basis Functions (RBFs), Time delayed Neural Network (TDNN) and Recurrent Neural Network (RNN) [5-6][9][11]. In this series HMM is combined with Mixture of Experts (MoE) Hierarchical Mixtue of Experts [1213] were also fused into the hybrid framework. However these hybrids rely on heuristic training scheme and their generalization capabilities are poorer as the learning models results are given less importance. To summarize there is a strong need for architecture efficient enough to be trained for good approximation with the training procedure resulting in error performance that is global minimum. 2. Mixture of Experts Framework: 2.1 Architecture: The input speech is modelled using HMM and the vector set is fed to MoE network. Here it is important to split input space as subspaces as this increases the modular experts to handle the input region more appropriately. Thus we have fused Self organised Map (SOM) which categorises the input space and then clusters them into regions.[14-15].However SOMs advantages of portioning and clustering will be maximised only when the heuristic training scheme becomes more modular.The gating networks allocation and decision making is very critical for MoE performance. The critical case occurs during the training stage when gating weights assigned to expert networks is not optimized for new data. This results in the network degenerating for generalization. To avoid this we propose a twin training loop where the network parameters are tuned to for training data set and when the new set of data is given to the framework we compute the weights and fine tune them to adjust to the variations in the input a space. achieves local optimum solution within the feature space. In Least Mean Squares (LMS) algorithm to calculate the weights wi that are updated during online training the weights are initialized for the network to be 0. wi 0 for 0 i N .They are As for architecture , the member networks of CM, the RBFs are two layer feed-forward ANN with an input layer I , a hidden layer k made of basis functions and an output layer O . RBF is characterized by the center of Gaussian ck , the width learning rate parameter. The training process is continued till steady state conditions are reached. The CM is built by combining the RBFs and a gating network to achieve a globally converging solution. Gating network applies feedback from output to adjust the weights in subsequent stages. Let us consider such a system where the output is represented as Z . Representing the output from the Fig 2 in terms of individual networks output yi we can write of the activation surface k and for training the weighting factors. We represent these parameters by ck k 1 , k k 1 , wk k 1 .To find centers kN N N means clustering algorithm is applied where the set of data are divided into subgroups and centers are placed in regions containing significant data. Let m1 denote number of the RBFs determined through experimentation. Let ck (n) with k limiting from 1 to m1 denote the centers of radius basis functions. At first random different values for initial centers c k (0) are chosen. A sample vector x has been trained for k 1,2,... time with d (k ) w k x k where w k = estimated T weights and x k = network input. If we represent error function as ek it can be written as ek d k d ' k where d k and d k are desired class and estimated output. The weights are adjusted according to LMS algorithm as w k 1 w k 2 e k x k where is the N Z yi (2) i 1 where N denotes the number of RBF networks used within MoE. With the weighed scheme applied the above equation, (2) can be written as N Z gi yi (3) i 1 drawn from the input space x n with a certain probability an input into algorithm at iteration n. If we denote k (x) denote the index of the best matching center for input vector x it at iteration n can be found by using the minimum distance Euclidean criterion: k (x) = arg min k The gating network parameters gi are chosen with respect to the subjected to training and previous state output e.g. 0 g i 1 such that g i 1 . x(n) ck (n) , k = 1, 2…where c k (n) is the center For performance, the error function can be defined as the difference between the CMs output and the desired response during classification problem. If we denote the desired response by d the aim is to reduce the error function which can be written as d Z . The architecture could be optimal if the error function is kept at minimum. Most of the existing architectures use Mean Square Error (MSE) as the performance measure. MSE is suitable for the regression problems. However for of the kth radial-basis function at iteration n. The centers of RBFs are then adjusted using the update ck n xn ck n , k k x ck n 1 ck n , otherwise rule: where is a learning-rate parameter within the range of 0< <1. Finally iteration n is increment by 1 and the procedure is continued until no noticeable changes observed in the centers. This algorithm i 2.3. Cross Entropy (CE) Error Criterion: the multiclass classification problems such as speech classes it is appropriate to choose an error criterion knows as Cross Entropy (CE) function. and from the MoE architecture figure 2 we can write the error function to be y n t k ln kn n k 1 tk c n 1 yk n (1 t k n ) ln 1 t n k (4) J E E d 2 N i 1 gi yi 2 (5) For example in MSE the weights can be defined by the following equation i (6) e gi N e j j 1 where i = 1,2,3….N and k aTk x aki xi . N i denotes number of sub-spaces and a k is the k th state in the gating network. For our method the gating weights are results from SOM centroids which are fine tuned in novel training stage. On differentiating the error function with respect to three parameters ui , c, w j ci , the mean center ands weights we can obtain the optimum conditions for the improved generalized performance. The optimal solution is obtained by the proper choice of the step size k . Mean Square Error (MSE) is chosen as training criterion, minimized using the above state equation and every successive state is corrected using the weights. This result in an optimal solution for any input space using CMs. Supervised Learning is followed for the training of the CM. The input space is analyzed for the dataset and RBF networks are allocated for training. The gating network parameters are initialized for equal weights for N nodes and CM output for the input data is computed initially with these equal weights. Assuming the desired class denoted by d , the average error associated with the MoE at any time t is given by E (t ) E d (t ) Z (t ) . The gating parameters a k k 1 are adjusted using the adaptive M feedback for the minimization of the error cost function. E 2 0. The gating parameters are adjusted towards the steady state conditions. When the data from HMM is introduced to framework the SOM partitions and clusters them and the initial values of fed to MoE. Then the MoE computes the output score. For new set of data the weights are fine tuned in such a way that the network contribution results in maximum performance. 3. Mixture of Experts –Generalization performance analysis: On theory the advantages of using Mixture of experts are due to their efficient usage of all the networks of a population. This makes them superior as none of the networks are discarded as it is done in other networks where the best network is chosen out of many networks and this leads to wastage of training time. In brief, a Mixture of expert networks yields better generalized solution than multinetwork approach. The above argument can be explained by using the below theory. Consider a network whose output is denoted by yk (x) , the desired red class denoted by the regression function hx to which we are seeking to approximate. Then the error associated M such networks with each single network contributing k (x) can be written as E av 1 M E . M 2 k If we consider a MoE k involving M number of networks whose output is averaged. The error associated with the MoE can be represented as 2 M 1 M 2 Ecom y k x hx 1 k M k 1 M k 1 If we assume the errors have zero mean and are uncorrelated combining the above equations we can 1 M 1 2 relate Ecom = ek M Eav .Using the M 2 k 1 Cauchy’s inequality in the form the above equation confirms that MoEs cannot contribute to any increase in the expected error yielding improved performance compared to the individual networks[10]. The architecture consists of an input layer, a hidden layer and an output layer and a gating network. The input space is divided into M number of subspaces and each networks focus on a particular subspace avoiding overlap across regions. For each set of data in the particular subspace the corresponding networks are trained for classification. RBF networks are chosen for their clustering and classifying properties and Mixture of RBF can handle the variability in input data. This is fused into hybrid HMM model where the scores are calculated for every state of HMM using viterbi decoding algorithm [2]. For the experiments the TIMIT database speech consisting of 6300 utterances, 10 sentences spoken by each of 630 speakers from 8 dialect regions in United States is used. The front end has speech sampled at 8 KHz, 20ms duration 10ms overlap, hamming windowed frames. They are analyzed through Mel Filter banks, DCT and the log energy spectrum yielding 39 MFCC coefficients including the first and second order derivatives. These features are modeled individually with baseline HMM having three to five states and using Gaussians for the emission probabilities. Training and decoding is then performed using viterbi algorithm. This configuration is then changed to hybrid HMM-MLP model replacing Gaussians for the a posteriori emission probability calculations. The MLP used is feed forward with two layers resulting in 117 input neurons (39 MFCC x 3). The hidden layer is selected with 200 neurons and 64 output neurons. On analysis we observed the hybrid HMM-MLP performs with 59 % compared to 56% by baseline. This is due to MLPs MLE approach approximating better than the Gaussian counterpart. Alternatively RBFs are chosen as the emission probability estimators with MSE criterion and 62.5 % performance were achieved. A two phase discriminative training for RBF where the MSE is optimized using back propagation and then in the output scores are trained for Minimum Classification Errors (MCE) yields a maximum performance of 63.8%.As for MoE we have utilized the hybrid HMM and CM model. This configuration consists of RBFs used with one hidden layer with 100 hidden units. The estimation of a posteriori probabilities is the combined output scores of candidate RBF networks. For the classification ordinary RBFs with exponential activation function are used with MSE chosen as the criterion. The proportion of the weighting factors determines the individual RBFs role in approximating the MSE criterion. An iterative algorithm is applied to perform this and final output score of CM is emitted that represents a posterori probability for each state of HMM. From the experiments we found the cost function MSE reaching as low as 0.013 in 8 -12% lesser iterations than the others. RBF performs better for a member as its approximation is quicker and with ease of training. It is also observed that the RBFs with one hidden layer found to be very effective for a MoE machine. Importantly the requirement of a number of networks to deal with the huge dimensional hidden space is addressed by limiting the RBFs within their area of expertise. By this the networks are task managed and the hidden space is generalized using contribution from the neighborhood networks. In the boundary regions the net output would be a non linear output of the networks resulting in the smooth coverage to the hidden spaces. The overall performance comparison is listed in Table 1. From the Table it is evident that the CM results yield 3% improvement over the RBFs as a single network to train the feature space with generalization. An important observation is that when two-phase RBF training for MSE and then MCE is avoided RBF with MoEs for convergence to global minimum. RBF on its own has poor ability to converge globally even under Generalized Probabilistic Descent (GPD) [11] .The disadvantage of RBF without GPD is also solved with the results confirming the theory of MoEs’ superiority over the individual networks. 4. Conclusion In this paper a distinctive connectionist model for constructing the artificial intelligent system is presented. The benchmark tests for this AI system for speech recognition applications clearly indicate our proposed architecture’s superior performance in better speech recognition accuracy and minimum word error rate over the rest of the models developed so far. With its simple kernels and less strenuous training scheme we have analyzed and achieve significant results improving the word recognition rate by 10 % over the best reported connectionist methods so far and an impressive 18.082 % over the non connectionist HMM models. Error rate is reduced by 10.61% over connectionist models and 23.29 % when compared to non connectionist baseline HMM method. .Finally the theory of MoEs contributing fewer errors than any best individual networks has been validated through our experimental results. ACKNOWLEDGEMENT This work is been funded by the Overseas Research Scholarship by Universities UK. We would also like to thank the School of Electrical, Electronic and Computer Engineering, University of Newcastle upon tyne for their financial support and encouragement to this academic research. References: [1] Ajit V.Rao and Kenneth Rose, Deterministic Annealed design of Hidden Markov Speech Recognizers , IEEE Trans. on Speech an Audio Processing, Vol. 9,No.2 pp. 111- 125,Feb 2001. [2] Bourlard.H and N. Morgan, Connectionist Speech Recognition, Kluwer Academic Publishers, Massachusetts, 1994. [3] Choi.K and J.N Hwang, Baum-Welch hidden Markov model inversion for reliable audio-tovisual conversion, IEEE 3rd Workshop on Multimedia pp-175-180, 1999. [4] Dupont.S, Missing Data Reconstruction for Robust Automatic Speech Recognition in the Framework of Hybrid HMM/ANN Systems, Proc. ICSLP'98, pp 1439-1442.Sydney, Australia, 1998. [5] Gong.Y,Speech recognition in noisy Environments: A survey, Speech communication Vol. 12, No. 3, pp. 231--239, June, 1995. [6] Morris.A, A.Hagen and H.Bourlard, The Full Combination Sub-Bands Approach to Noise Robust HMM/ANN-Based ASR, Proc. of Eurospeech, Budapest, Hungary, pp-599-602, 1999. [7] Nelson Morgan and Hervé Bourlard, An Introduction to Hybrid HMM/Connectionist Continuous Speech Recognition, IEEE Signal Processing Magazine, pp. 25-42, May 1995 [8] Picone.J, Continuous Speech Recognition Using Hidden Markov Models, IEEE ASSP Magazine,Vol.7.no.3,pp.26-41,July 1990. [9] Renals.S, N. Morgan, H. Bourlard, M. Cohen, and H. Franco, Connectionist probability estimators in HMM speech recognition, IEEE Trans. Speech and Audio Processing, Vol. 2 No. 1 Part 2, pp. 161-174, 1994. [10] S. Haykin, Neural Networks, Prentice Hall, 1999. [11] W.Reichl and G.Ruske, A Hybrid RBF-HMM system for continuous speech recognition, ICASSP, vol. 5, pages IV/3335–3338. IEEE, 1995. [12] M. I. Jordan and R. A. Jacobs. 1994., Hierarchical Mixtures of Experts and the EM Algorithm. Neural Computation, vol 6, pp181214. [13] M. I. Jordan and R. A. Jacobs. Modular and Hierarchical Learning Systems. in M.A.Arbib, ed.,The Handbook of Brain Theory and Neural Networks, pp579-53, 1995 Cambridge, MA:MIT Press. [14] T. Kohonen . 1990., The Self-Organizing Map. Proceedings of the IEEE . Vol 78. No.9. p14641480. [15] Bin Tang, Malcolm I. Heywood, and Michael Shepherd. Input partitioning to mixture of experts. In 2002 International Joint Conference on Neural Networks, pages 227{232, Honolulu, Hawaii, May 2002. Performance Analysis and Test Results Mixture of Experts Performance for TIMIT database: Total No of words Word Recognition Rate Error Performance : 48974 : 32557/48974 (66.48%) : 16417/48974 (33.52%) a) Substitution: 11308/48974 (23.09%) b) Insertion: 3008/48974 (6.14%) c) Deletion: 2101/48974 (4.29%) Recognition Configuration RR* Subs* Del* Ins* ER* Baseline HMM 56.3 27.1 10.1 6.5 43.7 HMM+ MLP 59.09 26.24 8.91 5.76 40.91 HMM + RBF 62.5 24.5 8.3 4.7 37.5 HMM+ TDNN 60.47 26.81 8.3 4.42 39.53 HMM + MoE 66.48 23.09 6.14 4.29 33.52 Table 1: Comparative results of all recognition systems * - In percentage RR – Recognition Rate Subs – Substitution Error Del – Deletion Error Ins --- Insertion Error ER – Error Rate FIGURE 1: MOE Framework- DIAGRAM A C O S T I C M F C C C O E F F I C I E N T S H M M S T A T E M O D E L S O M C L U S T E R I N G y1(n) y4(n) d Output y(n) - + e GATING NETWORK