SELF-ORGANIZING MAP AND MULTILAYER PERCEPTRON FOR MALAY SPEECH RECOGNITION GOH KIA ENG A thesis submitted in fulfilment of the requirements for the award of the degree of Master of Science (Computer Science) Faculty of Computer Science and Information System Universiti Teknologi Malaysia AUGUST 2006 iii To my beloved mother and father iv ACKNOWLEDGEMENT First of all, I would like to thank my mother and father who have been supporting me and giving me lots of encouragements to complete this thesis. They’ve been so great and I know there would be no way I could have such a wonderful life without having the love and care from them. Thanks for always been there for me. A special thank to my supervisor, Prof. Madya Abdul Manan bin Ahmad, for all his guidance and time. Thanks so much for his advices, comments and suggestions on how to improve this research and how to produce a good thesis. He is an understanding and helpful person in helping me to complete this research. Not forgetting I also would like to take this opportunity to thank all my friends. All the motivations, helps, and supports are fully appreciated. Thanks for being there and listening to my complaints and lend me a helpful hand when I am in troubles. Last but not least for those who were not mentioned above, I would like you to know that your countless effort and support will always remembered. All credits to everyone! Thank you very much. v ABSTRACT Various studies have been done in this field of speech recognition using various techniques such as Dynamic Time Warping (DTW), Hidden Markov Model (HMM) and Artificial Neural Network (ANN) in order to obtain the best and suitable model for speech recognition system. Every model has its drawbacks and weaknesses. Multilayer Perceptron (MLP) is a popular ANN for pattern recognition especially in speech recognition because of its non-linearity, ability to learn, robustness and ability to generalize. However, MLP has difficulties when dealing with temporal information as it needs input pattern of fixed length. With that in mind, this research focuses on finding a hybrid model/approach which combines Self-Organizing Map (SOM) and Multilayer Perceptron (MLP) to overcome as well as reduce the drawbacks. A hybrid-based neural network model has been developed to speech recognition in Malay language. In the proposed model, a 2D SOM is used as a sequential mapping function in order to transform the acoustic vector sequences of speech signal into binary matrix which performs dimensionality reduction. The idea of the approach is accumulating the winner nodes of an utterance into a binary matrix where the winner node is scaled as value “1” and others as value “0”. As a result, a binary matrix is formed which represents the content of an utterance. Then, MLP is used to classify the binary matrix to which each word corresponds to. The conventional model (MLP only) and the proposed model (SOM and MLP) were tested for digit recognition (“satu” to “sembilan”) and word recognition (30 selected Malay words) to find out the recognition accuracy using different values of parameters (cepstral order, dimension of SOM, hidden node number and learning rate). Both of the models were also tested using two types of classification: syllable classification and word classification. Finally, comparison and discussion was made between conventional and proposed model based on their recognition accuracy. The experimental results showed that the proposed model achieved higher accuracy. vi ABSTRAK Banyak penyelidikan telah dijalankan dalam bidang pengecaman suara menggunakan pelbagai teknik seperti Dynamic Time Warping (DTW), Hidden Markov Models (HMM), Artificial Neural Network (ANN) dan sebagainya. Namun demikian, setiap teknik mempunyai kelemahannya masing-masing. Hal ini menyebabkan sistem menjadi kurang tepat. Multilayer Perceptron (MLP) merupakan satu rangkaian neural yang terkenal bagi pengecaman suara. Walau bagaimanapun, MLP mempunyai kelemahan di mana akan melemahkan pretasi sistem. Oleh itu, penyelidikan ini menumpu terhadap pembangunan satu model hybrid yang menggabungkan dua rangkaian neural iaitu Self-Organizing Map (SOM) dan Multilayer Perceptron (MLP). Satu model berasaskan rangkaian neural hybrid telah dibangunkan bagi sistem pengecaman suara dalam bahasa Melayu. Dalam model ini, SOM yang berdimensi dua digunakan sebagai fungsi pemetaan turutan untuk menukar turutan vector akuastik bagi isyarat suara kepada matrik binari. Hal ini bertujuan untuk mengurangkan dimensi bagi vektor suara. SOM menyimpan nod pemenang bagi suara dalam bentuk matrik di mana nod pemenang diskalakan kepada nilai “1” dan yang lain diskalakan kepada nilai “0”. Hal ini membentukkan satu matrik binari yang mewakili kandungan suara tersebut. Kemudian, MLP mengelaskan matrik binari tersebut kepada kelas masing-masing. Ekperimen dijalankan terhadap model tradisional (MLP) and model hybrid (SOM dan MLP) dalam pengecaman digit (“satu” to “sembilan”) dan pengecaman perkataan 2-suku (30 perkataan yang dipilih). Experimen ini bertujuan untuk mendapat ketepatan pengecaman dengan menggunakan nilai parameter yang berbeza (dimensi cepstral, dimensi SOM, bilangan nod tersembunyi dan kadar pembelajaran). Kedua-dua model ini juga diuji dengan menggunakan dua teknik pengelasan: pengelasan mengikut suku perkataan dan perkataan. Perbandingan dan perbincangan telah dibuat berdasarkan ketepatan pengecaman masing-masing. Keputusan menunjukkan bahawa model kami mencapai ketepatan yang lebih tinggi. eksperimen vii TABLE OF CONTENTS CHAPTER 1 TITLE PAGE DECLARATION ii DEDICATION iii ACKNOWLEDGEMENT iv ABSTRACT v ABSTRAK vi TABLE OF CONTENTS vii LIST OF TABLES xiv LIST OF FIGURES xviii LIST OF ABBREVIATIONS xxiii LIST OF SYMBOLS xxv LIST OF APPENDICES xxvi INTRODUCTION 1.1 Introduction 1 1.2 Background of Study 2 1.3 Problem Statements 4 1.4 Aim of the Research 5 1.5 Objectives of the Research 5 1.6 Scopes of the Research 5 1.7 Justification 6 1.8 Thesis Outline 8 viii 2 REVIEW OF SPEECH RECOGNITION AND NEURAL NETWORK 2.1 Fundamental of Speech Recognition 10 2.2 Linear Predictive Coding (LPC) 11 2.3 Speech Recognition Approaches 16 2.3.1 Dynamic Time Warping (DTW) 16 2.3.2 Hidden Markov Model (HMM) 17 2.3.3 Artificial Neural Network (ANN) 18 2.4 2.5 2.6 2.7 Comparison between Speech Recognition Approaches 20 Review of Artificial Neural Networks 21 2.5.1 Processing Units 21 2.5.2 Connections 22 2.5.3 Computation 22 2.5.4 Training 23 Types of Neural Networks 24 2.6.1 24 Supervised Learning 2.6.2 Semi-Supervised Learning 25 2.6.3 25 Unsupervised Learning 2.6.4 Hybrid Networks 26 Related Research 27 2.7.1 Phoneme/Subword Classification 27 2.7.2 Word Classification 29 2.7.3 Classification Using Hybrid Neural Network Approach 2.8 3 Summary 31 32 SPEECH DATASET DESIGN 3.1 Human Speech Production Mechanism 33 3.2 Malay Morphology 35 3.2.1 35 Primary Word 3.2.2 Derivative Word 38 3.2.3 39 Compound Word ix 3.2.4 3.3 Malay Speech Dataset Design 3.3.1 3.3.2 3.4 4 Reduplicative Word 39 39 Selection of Malay Speech Target Sounds 40 Acquisition of Malay Speech Dataset 44 Summary 46 FEATURE EXTRACTION AND CLASSIFICATION ALGORITHM 4.1 4.2 4.3 4.4 The Architecture of Speech Recognition System 47 Feature Extractor (FE) 48 4.2.1 Speech Sampling 49 4.2.2 Frame Blocking 50 4.2.3 Pre-emphasis 51 4.2.4 Windowing 51 4.2.5 Autocorrelation Analysis 52 4.2.6 LPC Analysis 52 4.2.7 Cepstrum Analysis 53 4.2.8 Endpoint Detection 54 4.2.9 Parameter Weighting 55 Self-Organizing Map (SOM) 55 4.3.1 SOM Architecture 57 4.3.2 Learning Algorithm 58 4.4.3 Dimensionality Reduction 63 Multilayer Perceptron (MLP) 65 4.4.1 MLP Architecture 65 4.4.2 66 Activation Function 4.4.3 Error-Backpropagation 67 4.4.4 Improving Error-Backpropagation 69 4.4.5 Implementation of ErrorBackpropagation 4.5 Summary 73 74 x 5 SYSTEM DESIGN AND IMPLEMENTATION 5.1 Introduction 75 5.2 Implementation of Speech Processing 76 5.2.1 76 Feature Extraction using LPC 5.2.2 Endpoint Detection 80 5.3 Implementation of Self-Organizing Map 91 5.4 Implementation of Multilayer Perceptron 97 5.4.1 MLP Architecture for Digit Recognition 5.4.2 5.4.3 5.5 6 97 MLP Architecture for Word Recognition 98 Implementation of MLP 99 Experiment Setup 107 RESULTS AND DISCUSSION 6.1 Introduction 109 6.2 Testing of Digit Recognition 111 6.2.1 Testing Results for Conventional System 111 6.2.1.1 Experiment 1: Optimal Cepstral Order (CO) 111 6.2.1.2 Experiment 2: Optimal Hidden Node Number (HNN) 112 6.2.1.3 Experiment 3: Optimal Learning Rate (LR) 6.2.2 114 Results for Proposed System Testing 115 6.2.2.1 Experiment 1: Optimal Cepstral Order (CO) 115 6.2.2.2 Experiment 2: Optimal Dimension of SOM (DSOM) 116 6.2.2.3 Experiment 3: Optimal Hidden Node Number (HNN) 117 xi 6.2.2.4 Experiment 4: Optimal Learning Rate (LR) 6.2.3 119 Discussion for Digit Recognition Testing 120 6.2.3.1 Comparison of Performance for DRCS and DRPS (CO) 120 6.2.3.2 Comparison of Performance for DRCS and DRPS (HNN) 121 6.2.3.3 Comparison of Performance for DRCS and DRPS (LR) 123 6.2.3.4 Discussion on Performance of DRPS according to DSOM 124 6.2.3.5 Summary for Digit Recognition Testing 6.3 Testing of Word Recognition 6.3.1 125 126 Results for Conventional System Testing (Syllable Classification) 126 6.3.1.1 Experiment 1: Optimal Cepstral Order (CO) 126 6.3.1.2 Experiment 2: Optimal Hidden Node Number (HNN) 127 6.3.1.3 Experiment 3: Optimal Learning Rate (LR) 6.3.2 128 Results for Conventional System Testing (Word Classification) 130 6.3.2.1 Experiment 1: Optimal Cepstral Order (CO) 130 6.3.2.2 Experiment 2: Optimal Hidden Node Number (HNN) 131 6.3.2.3 Experiment 3: Optimal Learning Rate (LR) 6.3.3 132 Results for Proposed System Testing (Syllable Classification) 133 xii 6.3.3.1 Experiment 1: Optimal Cepstral Order (CO) 133 6.3.3.2 Experiment 2: Optimal Dimension of SOM (DSOM) 135 6.3.3.3 Experiment 3: Optimal Hidden Node Number (HNN) 136 6.3.3.4 Experiment 4: Optimal Learning Rate (LR) 6.3.4 137 Results for Proposed System Testing (Word Classification) 138 6.3.4.1 Experiment 1: Optimal Cepstral Order (CO) 138 6.3.4.2 Experiment 2: Optimal Dimension of SOM (DSOM) 140 6.3.4.3 Experiment 3: Optimal Hidden Node Number (HNN) 141 6.3.4.4 Experiment 4: Optimal Learning Rate (LR) 6.3.5 142 Discussion for Word Recognition Testing 143 6.3.5.1 Comparison of Performance for WRCS and WRPS according to CO 143 6.3.5.2 Comparison of Performance for WRCS and WRPS according to HNN 146 6.3.5.3 Comparison of Performance for WRCS and WRPS according to LR 149 6.3.5.4 Comparison of Performance of WRPS according to DSOM 152 xiii 6.3.5.5 Comparison of Performance for WRCS and WRPS according to Type of Classification 155 6.3.5.6 Summary of Discussion for Word Recognition 6.4 7 Summary 156 157 CONCLUSION AND SUGGESTION 7.1 Conclusion 158 7.2 Directions for Future Research 159 REFERENCES 162 PUBLICATIONS 170 Appendices A – V 171 - 192 xiv LIST OF TABLES TABLE NO. 1.1 TITLE PAGE Comparison of different speech recognition systems. 7 The comparison between different speech recognition approaches. 20 The performance comparison between different speech recognition approaches. 20 3.1 Structure of words with one syllable. 36 3.2 Structure of words with two syllables. 37 3.3 Structure of words with three syllables or more. 38 3.4 15 selected syllables in order to form two-syllable words as target sounds. 41 Two-syllable Malay words combined using 15 selected syllables. 42 30 selected Malay two-syllable words as the speech target sounds. 43 10 selected digit words as the speech target sounds for digit recognition. 44 3.8 Specification of dataset for word recognition 45 3.9 Specification of dataset for digit recognition 45 5.1 The setting of the target values for MLP in digit recognition. 98 The setting of the target values for MLP (syllable classification). 100 2.1 2.2 3.5 3.6 3.7 5.2 xv 5.3 The setting of the target values for MLP (word classification). 101 Recognition accuracy for different CO for Experiment 1 (DRCS) 111 Recognition accuracy for different HNN for Experiment 2 (DRCS) 113 Recognition accuracy for different LR for Experiment 3 (DRCS) 114 Recognition accuracy for different CO for Experiment 1 (DRPS) 115 Recognition accuracy for different DSOM for Experiment 2 (DRPS) 116 Recognition accuracy for different HNN for Experiment 3 (DRPS) 118 Recognition accuracy for different LR for Experiment 4 (DRPS) 119 Comparison of performance for DRCS and DRPS according to CO 120 Comparison of performance for DRCS and DRPS according to HNN 122 Comparison of performance for DRCS and DRPS according to LR 123 6.11 The optimal parameters and the architecture for DRPS 125 6.12 Recognition accuracy for different CO for Experiment 1 (WRCS(S)) 126 Recognition accuracy for different HNN for Experiment 2 (DRCS) 127 Recognition accuracy for different LR for Experiment 3 (WRCS(S)) 129 Recognition accuracy for different CO for Experiment 1 (WRCS(W)) 130 Recognition accuracy for different HNN for Experiment 2 (WRCS(W)) 131 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 6.10 6.13 6.14 6.15 6.16 xvi 6.17 6.18 6.19 6.20 6.21 6.22 6.23 6.24 6.25 6.26 6.27 6.28 6.29 6.30 6.31 6.32 6.33 Recognition accuracy for different LR for Experiment 3 (WRCS(W)) 132 Recognition accuracy for different CO for Experiment 1 (WRPS(S)) 134 Recognition accuracy for different DSOM for Experiment 2 (WRPS(S)) 135 Recognition accuracy for different HNN for Experiment 3 (WRPS(S)) 136 Recognition accuracy for different LR for Experiment 4 (WRPS(S)) 137 Recognition accuracy for different CO for Experiment 1 (WRPS(W)) 139 Recognition accuracy for different DSOM for Experiment 2 (WRPS(W)) 140 Recognition accuracy for different HNN for Experiment 3 (WRPS(W)) 141 Recognition accuracy for different LR for Experiment 4 (WRPS(W)) 142 Comparison of performance for WRCS(S) and WRPS(S) according to CO. 144 Comparison of performance for WRCS(W) and WRPS(W) according to CO. 145 Comparison of performance for WRCS(S) and WRPS(S) according to HNN. 146 Comparison of performance for WRCS(W) and WRPS(W) according to HNN. 147 Comparison of performance for WRCS(S) and WRPS(S) according to LR. 149 Comparison of performance for WRCS(W) and WRPS(W) according to LR. 150 Comparison of performance for WRPS according to DSOM. 152 Results of testing for WRCS and WRPS according to type of classification 155 xvii 6.34 The optimal parameters and the architecture for WRPS(S). 156 xviii LIST OF FIGURES FIGURE NO. 1.1 TITLE PAGE Feature map with neurons (circles) which is labeled with the symbols of the phonemes to which they “learned” to give the best responses. 3 The sequence of the responses obtained from the trained feature map when the Finnish word humppila was uttered. 3 2.1 Basic model of speech recognition system. 11 2.2 The current speech sample is predicted as a linear combination of past p samples. (n = total number of speech sample). 12 2.3 Dynamic Time Warping (DTW) 17 2.4 A basic architecture of Multilayer Perceptron (MLP) 19 2.5 A basic neuron processing unit 22 2.6 Neural network topologies: (a) Unstructured, (b) Layered, (c) Recurrent and (d) Modular 23 Perceptrons: (a) Single-layer Perceptron (b) Multilayer Perceptron 25 Decision regions formed by a 2-layer Perceptron using backpropagation training and vowel formant data. 28 3.1 The vocal tract 34 3.2 Structure of one-syllable word “Ya” and “Stor”. 36 3.3 Structure of two-syllable word “Guru” and “Jemput”. 37 4.1 Proposed speech recognition model 47 1.2 2.7 2.8 xix 4.2 Feature Extractor (FE) schematic diagram 48 4.3 Figure 4.3: Speech signal for the word kosong01.wav sampled at 16 kHz with a precision of 16 bits. 49 Blocking of speech waveform into overlapping frames with N analysis frame length and M shifting length. 50 4.5 Cepstral coefficient of BU.cep 54 4.6 SOM transforms feature vectors generated by speech processing into binary matrix which performs dimensionality reduction. 56 4.7 The 2-D SOM architecture 57 4.8 Flow chart of SOM learning algorithm 61 4.9 Trained feature map after 1,250,000 iterations. 62 4.10 Dimensionality reduction performed by SOM. 63 4.11(a) The 12 x 12 mapping of binary matrix of /bu/ syllable. 64 4.11(b) Binary matrix of /bu/ which is fed as input for MLP. 64 4.12 A three-layer Multilayer Perceptron 66 4.13 The determination of hidden node number using Geometric Pyramid Rule (GPR). 71 4.14 Flow chart of error-backpropagation algorithm 73 5.1 The implementation of speech recognition system 75 5.2(a) The detected boundaries of sembilan04.wav using rms energy in Level 1 of Initial endpoint detection 90 The detected boundaries of sembilan04.wav using zero crossing rate in Level 2 of Initial endpoint detection 90 The actual boundaries of sembilan04.wav using Euclidean distance of cepstrum in Level 3 of Actual endpoint detection 90 5.3 The architecture of Self-Organizing Map (SOM) 91 5.4 MLP with 10 output. The 10 output nodes correspond to 10 Malay digit words respectively. 97 4.4 5.2(b) 5.2(c) xx 5.5 MLP with 15 output nodes. The 15 output nodes correspond to 15 Malay syllables respectively. 99 MLP with 30 output nodes. The 30 output nodes correspond to 30 Malay two-syllable words respectively. 99 System architecture for conventional model (singlenetwork) 108 System architecture for proposed model (hybrid network) 108 5.9 Training and testing of the digit recognition system 109 5.10 Training and testing of the word recognition system 109 6.1 Presentation and discussion of the results of the tests in table and graph form in stages. 110 Recognition accuracy for different CO for Experiment 1 (DRCS) 112 Recognition accuracy for different HNN for Experiment 2 (DRCS) 113 Recognition accuracy for different LR for Experiment 3 (DRCS) 114 Recognition accuracy for different CO for Experiment 1 (DRPS) 116 Recognition accuracy for different DSOM for Experiment 2 (DRPS) 117 Recognition accuracy for different HNN for Experiment 3 (DRPS) 118 Recognition accuracy for different LR for Experiment 4 (DRPS) 119 Analysis of comparison of performance for DRCS and DRPS according to CO. 121 Analysis of comparison of performance for DRCS and DRPS according to HNN. 122 Analysis of comparison of performance for DRCS and DRPS according to LR. 124 5.6 5.7 5.8 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 6.10 6.11 xxi 6.12 6.13 6.14 6.15 6.16 6.17 6.18 6.19 6.20 6.21 6.22 6.23 6.24 6.25 6.26 6.27 6.28 Recognition accuracy for different CO for Experiment 1 (WRCS(S)) 127 Recognition accuracy for different HNN for Experiment 2 (WRCS(S)) 128 Recognition accuracy for different LR for Experiment 3 (WRCS(S)) 129 Recognition accuracy for different CO for Experiment 1 (WRCS(W)) 130 Recognition accuracy for different HNN for Experiment 2 (WRCS(W)) 132 Recognition accuracy for different LR for Experiment 3 (WRCS(W)) 133 Recognition accuracy for different CO for Experiment 1 (WRPS(S)) 134 Recognition accuracy for different DSOM for Experiment 2 (WRPS(S)) 135 Recognition accuracy for different HNN for Experiment 3 (WRPS(S)) 137 Recognition accuracy for different LR for Experiment 4 (WRPS(S)) 138 Recognition accuracy for different CO for Experiment 1 (WRPS(W)) 139 Recognition accuracy for different DSOM for Experiment 2 (WRPS(W)) 140 Recognition accuracy for different HNN for Experiment 3 (WRPS(W)) 142 Recognition accuracy for different LR for Experiment 4 (WRPS(W)) 143 Comparison of performance for WRCS(S) and WRPS(S) according to CO. 144 Comparison of performance for WRCS(W) and WRPS(W) according to CO. 145 Comparison of performance for WRCS(S) and WRPS(S) according to HNN. 147 xxii 6.29 6.30 6.31 6.32 6.33(a) 6.33(b) 6.34 Comparison of performance for WRCS(W) and WRPS(W) according to HNN. 148 Comparison of performance for WRCS(S) and WRPS(S) according to LR. 150 Comparison of performance for WRCS(W) and WRPS(W) according to LR. 151 Comparison of performance for WRPS according to DSOM. 153 Matrix mapping of “buku” word where the arrows show the direction of the sequence of phonemes. 154 Matrix mapping of “kubu” word where the arrows show the direction of the sequence of phonemes. 154 Analysis of comparison of performance for WRCS and WRPS according to syllable classification and word classification. 155 xxiii LIST OF ABBREVIATIONS AI - Artificial Intelligence ANN - Artificial Neural Network BMU - Best Matching Unit BP - Back-Propagation CO - Cepstral Order CS - Conventional System DR - Digit Recognition DRCS - Digit Recognition Conventional System DRPS - Digit Recognition Proposed System DSOM - Dimension of Self-Organizing Map DTW - Dynamic Time Warping FE - Feature Extractor GPR - Geometric Pyramid Rule HMM - Hidden Markov Model HNN - Hidden Node Number KSOM - Kohonen Self-Organization Network LP - Linear Prediction LPC - Linear Predictive Coding LR - Linear Rate LVQ - Learning Vector Quantization MLP - Multilayer Perceptron PARCOR - Partial-Correlation PC - Personal Computer PS - Proposed System SAMSOM - Structure Adaptive Multilayer Self-Organizing Map SLP - Single-layer Perceptron SOM - Self-Organizing Map xxiv TDNN - Time-Delay Neural Network VQ - Vector Quantization WPF - Winning Probability Function WR - Word Recognition WRCS - Word Recognition Conventional System WRCS(S) - Word Recognition Conventional System using Syllable Classification WRCS(W) - Word Recognition Conventional System using Word Classification WRPS - Word Recognition Proposed System WRPS(S) - Word Recognition Proposed System using Syllable Classification WRPS(W) - Word Recognition Proposed System using Word Classification xxv LIST OF SYMBOLS s - Speech sample sĖ - Predicted speech sample a - Predicted coefficient e - Prediction error E - Mean squared error (LPC) E - Energy power (Endpoint detection) Z - Zero-crossing T - Threshold (Endpoint detection) D - Weighted Euclidean distance R - Autocorrelation function w - Hamming window p - The order of the LPC analysis k - PARCOR coefficients c - Cepstral coefficients X - Input nodes Y - Output nodes H - Hidden nodes M - Weights B - Bias σ - Width of lattice (SOM) λ - Time constant (SOM) α - Learning rate (SOM) Θ - The amount of influence a node's distance from the BMU (SOM) η - Learning rate (MLP) δ - Error information term xxvi LIST OF APPENDICES APPENDIX A B C D E F G H I J K L TITLE PAGE Specification of test on optimal Cepstral Order for DRCS 171 Specification of test on optimal Hidden Node Number for DRCS 172 Specification of test on optimal Learning Rate for DRCS 173 Specification of test on optimal Cepstral Order for DRPS 174 Specification of test on optimal Dimension of SOM for DRPS 175 Specification of test on optimal Hidden Node Number for DRPS 176 Specification of test on optimal Learning Rate for DRPS 177 Specification of test on optimal Cepstral Order for WRCS(S) 178 Specification of test on optimal Hidden Node Number for WRCS(S) 179 Specification of test on optimal Learning Rate for WRCS(S) 180 Specification of test on optimal Cepstral Order for WRCS(W) 181 Specification of test on optimal Hidden Node Number for WRCS(W) 182 xxvii M N O P Q R S T U V Specification of test on optimal Learning Rate for WRCS(W) 183 Specification of test on optimal Cepstral Order for WRPS(S) 184 Specification of test on optimal Dimension of SOM for WRPS(S) 185 Specification of test on optimal Hidden Node Number for WRPS(S) 186 Specification of test on optimal Learning Rate for WRPS(S) 187 Specification of test on optimal Cepstral Order for WRPS(W) 188 Specification of test on optimal Dimension of SOM for WRPS(W) 189 Specification of test on optimal Hidden Node Number for WRPS(W) 190 Specification of test on optimal Learning Rate for WRPS(W) 191 Convergences file (dua12.cep) which shows the rms error in each epoch. 192 CHAPTER 1 INTRODUCTION 1.1 Introduction By 1990, many researchers had demonstrated the value of neural networks for important task like phoneme recognition and spoken digit recognition. However, it is still unclear whether connectionist techniques would scale up to large speech recognition tasks. There is a large variety in the speech recognition technology and it is important to understand the differences between the technologies. Speech recognition system can be classified according to the type of speech, size of the vocabulary, the basic units and the speaker independence. The position of a speech recognition system in these dimensions determines which algorithm can or has to be used. Speech recognition has been another proving ground for neural networks. Some researchers achieved good results in such basic tasks as voiced/unvoiced discrimination (Watrous, 1988), phoneme recognition (Waibel et al., 1989), and spoken digit recognition (Peeling and Moore, 1987). However, research in finding a good neural network model for robust speech recognition still has a wide potential to be developed. Why does the speech recognition problem attract researchers? If an efficient speech recognizer is produced, a very natural human-machine interface would be obtained. By natural means something that is intuitive and easy to be used by a person, a method that does not require special tools or machines but only the natural capabilities that every human possesses. Such a system could be used by any person who is able to speak and will allow an even broader use of machines, specifically computers. 2 1.2 Background of Study Neural network classifier has been compared with other pattern recognition classifiers and is explored as an alternative to other speech recognition techniques. Lippman (1989) has proposed a static model which is employed as an input pattern of Multilayer Perceptron (MLP) network. The conventional neural network (Pont et al., 1996; Ahkuputra et al., 1998; Choubassi et al., 2003) defines a network as consisting of a few basic layers (input, hidden and output) in a Multilayer Perceptron type of topology. Then a training algorithm such as backpropagation is applied to develop the interconnection weights. This conventional model or system has also been used in a variety of pattern recognition and control applications that are not effectively handled by other AI paradigms. However, there are some difficulties in using MLP alone. The most major difficulty is that, increasing the number of connections not only increases the training time but also makes it more probable to fall in a poor local minima. It also necessitates more data for training. Perceptron as well as Multilayer Perceptron (MLP) usually needs input pattern of fixed length (Lippman, 1989). This is the reason why the MLP has difficulties when dealing with temporal information (essential speech information or feature extracted during speech processing). Since the word has to be recognized as a whole, the word boundaries are often located automatically by endpoint detector and the noise is removed outside of the boundaries. The word patterns have to be also warped using some pre-defined paths in order to obtain fixed length word patterns. Since the early eighties, researchers have been using neural networks in the speech recognition problem. One of the first attempts was Kohonen’s electronic typewriter (Kohonen, 1992). It uses the clustering and classification characteristics of the Self-Organizing Map (SOM) to obtain an ordered feature map from a sequence of feature vectors which is shown in Figure 1.1. The training was divided into two stages, where the first stage was used to obtain the SOM. Speech feature vectors were fed into the SOM until it converged. The second training stage consisted in labeling the SOM, as example, each neuron of the feature map was assigned a phoneme label. Once the labeling process was completed, the training process 3 ended. Then, unclassified speech was fed into the system, which was then translated it into a sequence of labels. Figure 1.2 shows the sequence of the responses obtained from the trained feature map when the Finnish word humppila was uttered. This way, the feature extractor plus the SOM behaved like a transducer, transforming a sequence of speech samples into a sequence of labels. Then, the sequence of labels was processed by some AI scheme (Grammatical Transformation Rules) in order to obtain words from it. Figure 1.1: Feature map with neurons (circles) which is labeled with the symbols of the phonemes to which they “learned” to give the best responses. Figure 1.2: The sequence of the responses obtained from the trained feature map when the Finnish word humppila was uttered. 4 Usage of an unsupervised learning neural network as well as SOM seems to be wise. The SOM constructs a topology preserving mapping from the highdimensional space onto map units (neurons) in such a way that relative distances between data points are preserved. The way SOM performs dimensionality reduction is by producing a map of usually 2 dimensions which plot the similarities of the data by grouping similar data items together. Because of its characteristic which is able to form an ordered feature map, the SOM is found to be suitable for dimensionality reduction of speech feature. Forming a binary matrix to feed to the MLP makes the training and classification simpler and better. Such a hybrid system consists of two neural-based models, a SOM and a MLP. The hybrid system mostly tries to overcome the problem of the temporal variation of utterances where the utterances for same word by same speaker may be different in duration and speech rate). 1.3 Problem Statements According to the background of study, here are the problem statements: i. Various approaches have been introduced for Malay speech recognition in order to produce an accurate and robust system for Malay speech recognition. However, there are only a few approaches which have achieved excellent performance for Malay speech recognition (Ting et al., 2001a, 2001b and 2001c; Md Sah Haji Salam et al., 2001). Thus, research in speech recognition for Malay language still has a wide potential to be developed. ii. Multilayer Perceptron (MLP) has difficulties when dealing with temporal information. Since the word has to be recognized as a whole, the word patterns have to be warped using some pre-defined paths in order to obtain fixed length word patterns (Tebelskis, 1995; Gavat et al., 1998). Thus, an efficient model is needed to improve this drawback. iii. Self-Organizing Map (SOM) is considered as a suitable and effective approach for both clustering and dimensionality reduction. However, is SOM an efficient neural network to be applied in MLP-based speech recognition in order to reduce the dimensionality of feature vector? 5 1.4 Aim of the Research The aim of the research is to investigate how hybrid neural network can be applied or utilized in speech recognition area and propose a hybrid model by combining Self-Organizing Map (SOM) and Multilayer Perceptron (MLP) for Malay speech recognition in order to achieve a better performance compared to conventional model (single network). 1.5 i. Objectives of the Research Studying the effectiveness of various types of neural network models in terms of speech recognition. ii. Developing a hybrid model/approach by combining SOM and MLP in speech recognition for Malay language. iii. Developing a prototype of Malay speech recognition which contains three main components namely speech processing, SOM and MLP. iv. Conducting experiments to determine the optimal values for the parameters (cepstral order, dimension of SOM, hidden node number, learning rate) of the system in order to obtain the optimal performance. v. Comparing the performance between conventional model (single network) and proposed model (SOM and MLP) based to the recognition accuracy to prove the improvement achieved by the proposed model. The recognition accuracy is based on the calculation of percentage below: Recognition Accuracy (%) = Total of Correct Recognized Word Total of Sample Word 1.6 Scopes of the Research The scope of the research clearly defines the specific field of the study. The discussion of the study and research is confined to the scope. 6 i. There are two datasets created where one is used for digit recognition and another one is used for word recognition. The former consists of 10 Malay digits and the latter consists of 30 selected two-syllable Malay words. Speech samples are collected in a noise-free environment using unidirectional microphone. ii. Human speakers comprise of 3 males and 3 females. The system supports speaker-independent capability. The age of the speakers ranges between 18 – 25 years old. iii. Linear Predictive Coding (LPC) is used as the feature extraction method. The method is to extract the speech feature from the speech data. The LPC coefficients are determined using autocorrelation method. The extracted LPC coefficients are then converted to cepstral coefficients. iv. Self-Organizing Map (SOM) and Multilayer Perceptron (MLP) is applied in the proposed system. SOM acts as a feature extractor which converts the higher-dimensional feature vector into lower-dimensional binary vector. Then MLP takes the binary vectors as its input for training and classification. 1.7 Justification Many researchers have worked in automatic speech recognition for almost few decades. In the eighties, speech recognition research was characterized by a shift in technology from template-based approaches (Hewett, 1989; Aradilla et al., 2005) to statistical-based approaches (Gold, 1988; Huang, 1992; Siva, 2000; Zbancioc and Costin, 2003) and connectionist approaches (Watrous, 1988; Hochberg et al., 1994). Instead of Hidden Markov Model (HMM), the use of neural networks has become another idea in speech recognition problems. Anderson (1999) has made a comparison between statistical-based and template-based approaches. Today’s research focuses on a broader definition of speech recognition. It is not only concerned with recognizing the word content but also prosody (Shih et al., 2001) and personal signature. 7 Despite all of the advances in the speech recognition area, the problem is far from being completely solved. A number of commercial products are currently sold in the commercial market. Products that recognize the speech of a person within the scope of a credit card phone system, command recognizers that permit voice control of different types of machines, “electronic typewriters” that can recognize continuous speech and manage several tens of thousands word vocabularies, and so on. However, although these applications may seem impressive, they are still computationally intensive, and in order to make their usage widespread more efficient algorithms must be developed. Summing up, there is still room for a lot of improvement and research. Currently there are many speech recognition applications released, whether as a commercial or free software. The technology behind speech output has changed over times and the performance of speech recognition system is also increasing. Early system used discrete speech; Dragon Dictate is the only discrete speech system still available commercially today. On the other hand, the main continuous speech systems currently available for PC are Dragon Naturally Speaking and IBM ViaVoice. Table 1.1 shows the comparison of different speech recognition systems with the prototype to be built in this research. This comparison is important as it gives an insight of the current trend of speech recognition technology. Table 1.1: Comparison of different speech recognition systems Software Dragon Dictate IBM Voice Naturally Speaking 7 Microsoft Office XP SR Prototype To Be Built Discrete Speech Recognition √ X X X √ Continuous Speech Recognition X √ √ √ X Speaker Dependent √ √ √ √ √ Speaker Independent X X X X √ Speech-to-Text √ √ √ √ √ Active Vocabulary Size (Words) 30,000 – 60,000 22,000 – 64,000 300,000 Finite 30 – 100 Language English English English English Malay Feature 8 In this research, the speech recognition problem is transformed into simplified binary matrix recognition problem. The binary matrices are generated and simplified while preserving most of the useful information by means of a SOM. Then, word recognition turns into a problem of binary matrix recognition in a smaller dimensional feature space and this performs dimensionality reduction. Besides, the comparison between the single-network recognizer and hybrid-network recognizer conducted here sheds new light on future directions of research in the field. It is important to understand that it is not the purpose of this work to develop a full-scale speech recognizer but only to test proposed hybrid model and explore its usefulness in providing more efficient solutions in speech recognition. 1.8 Thesis Outline The first few chapters of this thesis provide some essential background and a summary of related work in speech recognition and neural networks. Chapter 2 reviews the field of speech recognition, neural network and also the intersection of these two fields, summarizing both past and present approaches to speech recognition using neural networks. Chapter 3 introduces the speech dataset design. Chapter 4 presents the algorithms of the proposed system: speech feature extraction (Speech processing and SOM) and classification (MLP). Chapter 5 presents the implementation of the proposed system: Speech processing, Self-Organizing Map and Multilayer Perceptron. The essential parts of the source code are shown and explained in detail. Chapter 6 presents the experimental tests on both of the systems: conventional system and the proposed system. The tests are conducted using digit dataset for digit recognition and word dataset for word recognition. For word recognition, two classification approaches are applied such as syllable and word 9 classification. The tests are conducted on speaker-independent system with different values of the parameters in order to obtain optimal performance according to the recognition accuracy. Discussion and comparison of the experimental results are also included in this chapter. Chapter 7 presents the conclusions and future works of the thesis. CHAPTER 2 REVIEW OF SPEECH RECOGNITION AND NEURAL NETWORK 2.1 Fundamental of Speech Recognition This chapter basically describes the literature review of the basic concept of speech recognition system with its main components: speech processing, feature extraction and classifier. Through the literature review, useful insights are gained to design the methodology of this research. Figures 2.1 shows the model or block diagram of a basic speech recognition system which comprises of speech processing, feature extraction and classifier: 2.2 2.3 2.4 Analog 2.5Speech Signals Speech Processing Feature Extraction Discrete Signals Classifier/ Recognizer Feature Vector 2.6 2.7 2.8 Speech Information Database Figure 2.1: Basic model of speech recognition system Recognized word 11 • Analog Speech Signal Here the analog speech signal is converted to a discrete signal (digital speech data format). The examples of the digital speech data format are .wav, .snd, and .au. • Speech Processing The speech signal also contains data that is unnecessary like noise and nonspeech, which need to be removed before feature extraction. The resulting speech signals will be passed through an endpoint detector to determine the beginning and the ending of a speech data. • Feature Extraction To extract the characteristics of the processed speech signal, feature extractor is used to extract the useful feature or characteristic from the speech data which is called feature vector. • Classifier/ Recognizer Here the recognition of the speech data is established. There are many kinds of classifier with different techniques and advantages. The feature vector will be the input of the classifier. The output of the classifier is the recognized word. 2.2 Linear Predictive Coding (LPC) Linear Predictive Coding (LPC) is one of most popular speech feature extraction methods. It is basically the prediction of present speech sample from a linear combination of the past speech samples (Rabiner, 1993). It is widely applied as the speech feature extraction algorithm because it is mathematical precise and simple to be implemented as well as fast computation (Parsons, 1986; Picone, 1993; Rabiner, 1993). Besides, it provides a good model of the speech signals. It is often used in the speech feature extraction of isolated words, 12 syllables, phonemes, consonant and even vowels of foreign languages such as English and Japanese. Among the LPC parameters are LPC coefficients, reflection or PARCOR coefficients and log area ratio coefficients. The basic principle behind the LPC is that the current speech sample can be predicted as a linear combination of the past speech samples, as shown in the Figure 2.2. s(n-2) ……… s(n-1) s(n) s(n-p) p speech samples Figure 2.2: The current speech sample is predicted as a linear combination of past p samples. (n = total number of speech sample) sˆ(n) = a1 s (n − 1) + a 2 s (n − 2) + a3 s (n − 3) + ...... + a p s (n − p) (2.1a) p sˆ( n) = ∑ a k s ( n − k ) k =1 (2.1b) where a k are the coefficients which are assumed to be constant over the speech analysis frame and p is the number of past speech samples. A speech analysis frame is defined as short segment of the speech waveform to be examined or analyzed. The prediction error is defined as the difference between the actual speech samples, s(n) and the predicted samples, sĖ . 13 e(n) = s (n) − sˆ(n) (2.2a) p = s ( n) − ∑ a k s ( n − k ) (2.2b) k =1 The purpose of LPC is to find these predictor coefficients, a k which are said to match the speech waveform within an analysis frame (Parsons, 1986). The predicted coefficients have to be determined in a way that minimizes the mean squared prediction error over a small analysis frame. Mean squared prediction error is defined as M E = ∑ e 2 ( n) (2.3a) n =0 M p n =0 k =1 2 = ∑ s ( n) − ∑ a k s ( n − k ) (2.3b) where M is the analysis frame length. The minimization of mean squared error is done by setting the partial derivatives of mean squared error, E with respect to a k simultaneously equal to zero. ∂E = 0, ∂a k M p n =0 k =1 ∑ 2 s ( n) − ∑ a k s ( n − k ) k = 1,2,3,..., p (2.4a) (− s(n − j )) = 0, j = 1,2,3,..., p (2.4b) M M p n =0 n =0 k =1 − 2∑ s ( n ) s ( n − j ) + 2∑ s ( n − j ) ∑ a k s ( n − k ) = 0 (2.4c) 14 Realizing that the autocorrelation function can be written in the forms as follow: M R( j ) = ∑ s ( n)( n − j ) n =0 (2.5) M R( k − j ) = ∑ s( n − j ) s( n − k ) n= 0 (2.6) As a result, the Equation (2.5) can be reduced to a simple autocorrelation function as shown in Equation (2.7). p R( j ) = ∑ ak R( k − j ) k =1 (2.7) The predictor coefficients a k can be determined by solving the Equation (2.7). This can be done by either using autocorrelation or covariance method. The autocorrelation method is preferred over the covariance method because the former is simple and fast in computation (Rabiner, 1976 and 1993). In autocorrelation method, we have to pre-determine a range for parameter n, so that speech segment, s(n) is set to zero for which the speech segments are outside the range. The range is expressed as 0 ≤ n ≤ M − 1 . This setting is simply done by applying a window to the speech segment. The typical weighting window is Hamming window. The purpose of windowing is to taper the signal near n = 0 and near n = M - 1, so as to minimize the errors at the speech segment boundaries. Based on the windowed signal, the mean squared error becomes M −1+ p E (n ) = ∑e n= 0 2 ( n) (2.8) 15 and the Equation (2.8) can be expressed as M −1+ p R( k − j ) = ∑ s(n − j )s(n − k ), 1 ≤ j ≤ p, 1 ≤ j ≤ p (2.9a) 1 ≤ j ≤ p, 1 ≤ j ≤ p (2.9b) n= 0 or M −1− ( k − j ) R( k − j ) = ∑ s (n + k − j), n =0 Since the autocorrelation function is symmetric, for example, R ( j ) = R(− j ) and R ( j − k ) = R (k − j ) , the LPC equations can be rewritten as p ∑ R(| k − j |)a k = R ( j ), (2.10) k =1 The Equation (2.10) can also be expressed in matrix form as R(1) R(2) ïĢŪ R ( 0) ïĢŊ R(1) R(0) R(1) ïĢŊ ïĢŊ R ( 2) ïĢŊ M M ïĢŊ M ïĢŊïĢ° R( p − 1) R( p − 1) ïĢđ ïĢŪ a1 ïĢđ ïĢŪ R(1) ïĢđ ïĢŊ ïĢš L R( p − 2)ïĢšïĢš ïĢŊ a 2 ïĢš ïĢŊïĢŊ R(2) ïĢšïĢš L R( p − 3) ïĢš ïĢŊ a 3 ïĢš = ïĢŊ R(3) ïĢš ïĢš ïĢšïĢŊ ïĢš ïĢŊ M ïĢšïĢŊ M ïĢš ïĢŊ M ïĢš L R(0) ïĢšïĢŧ ïĢŊïĢ°a p ïĢšïĢŧ ïĢŊïĢ° R( p )ïĢšïĢŧ L (2.11) The matrix is a special matrix known as Toeplitz matrix, where all the diagonal elements are equal. The solving of the matrix can be done efficiently by iterative method known as Levinson-Durbin algorithm (Haykin, 2001), which will yield the predictor coefficients. This algorithm is used to solve equation where the elements across the diagonal are identical and the matrix of coefficients is symmetry (Teoplitz matrix). 16 2.3 Speech Recognition Approaches Speech recognition deals with the recognition of the specific individual speech sounds. On the other hand, speech recognition classifies the speech sounds according to their specific groups of sounds (eg. /b, d, g/ as consonant phonemes). The speech recognition can be performed using Dynamic Time Warping (DTW), Hidden Markov Model (HMM) or Artificial Neural Network (ANN). 2.3.1 Dynamic Time Warping (DTW) We now explain the Dynamic Time Warping algorithm, one of the oldest and most important algorithms in speech recognition (Itakura, 1975; Sakoe and Chiba, 1978). The simplest way to recognize an isolated word sample is to compare it against a number of stored word templates and determine which the “best match” is as shown in Figure 2.3. This goal is complicated by a number of factors. First, different samples of a given word will have somewhat different durations (temporal variation). This problem can be eliminated by simply normalizing the templates and the unknown speech so that they all have an equal duration. However, another problem is that the rate of speech may not be constant throughout the word; in other words, the optimal alignment between a template and the speech sample may be nonlinear. DTW is able to achieve promising accuracy of higher than 95% in digit recognition (Sakoe and Chiba, 1978; Ting et al., 2001a). DTW is only suitable to be used in the recognition of small vocabulary because it is computational intensive. It is not practicable in real-time system when the vocabulary is large. In order to apply DTW to the word-recognition problem, we need to know the beginning and ending points of the words. In noisy conditions this is not a trivial task. 17 Reference Time-Warping Path Sample Figure 2.3: Dynamic Time Warping (DTW) The advantages of DTW are: • Efficient hardware implementation exists. • The training sequence is simple, since it just involves the feature extraction for words that need to be recognized. The disadvantages of DTW are: • It is not suitable for continuous speech recognition. • It requires the computation of the beginning and ending points of the word. 2.3.2 Hidden Markov Model (HMM) Hidden Markov Models (HMM) (Rabiner, 1989) are essentially statistical models to assign the greatest likelihood or probability to the occurrence of the observed input pattern. It is a doubly stochastic process with hidden underlying process. HMM represents speech by a sequence of states, each representing a piece of the input signal. The states of the HMM correspond to phones, biphones or 18 triphones. At each state, there is a probability distribution for each of the possible letters, and a transition probability to the next state. The speech recognition processes then boils down to finding the most probable path. The training procedure for the HMM-based recognizer is more complex than the DTW-based recognizer (Rabiner et al., 1989; Lee, 1988; Woodland et al., 1994; Huang, 1992). The advantages of an HMM-based approach are: • It is easy to incorporate other information, such as speech and language models. • Continuous HMM is powerful for continuous speech recognition. The disadvantages of HMM-based approach are: • The HMM probability density models (discrete, continuous, and semicontinuous) have suboptimal modeling accuracy. Specifically, discrete density HMMs suffer from quantization errors, while continuous or semicontinuous density HMMs suffer from model mismatch. • The Maximum Likelihood training criterion leads to poor discrimination between the acoustic models. Discrimination can be improved using the Maximum Mutual Information training criterion, but this is more complex and difficult to implement properly. 2.3.3 Artificial Neural Network (ANN) Artificial Neural Network (ANN) (Aleksander and Morton, 1990; Hansen and Salamon, 1990; Fausett, 1994) is the emerging technology in the speech recognition and classification. An ANN is basically an information-processing system that has certain performance characteristics in common with biological neural networks. It is a system that processes information in a parallel-distributed manner. Although its major drawback is the long training time, it is still widely applied in the speech recognition system because it offers many advantages such as non-linearity, ability of adaptation or learning, robustness and ability to generalize (Lippmann, 1989; Tebelskis, 1995; Pablo, 1998). 19 Multilayer Perceptron (MLP) (Bourland and Wellekens, 1987; Haykin, 1994; Ahad et al., 2002) is one of the most popular neural network architectures. A basic architecture of MLP is shown in Figure 2.4. It is a supervised learning, which adapts its weights in response to the teacher values of the training patterns. Its backpropagation (BP) learning propagates the errors at the output layer back to the hidden and input layer in order to adjust its weights (Rumelhart et al., 1986; Pandya and Macy, 1996). It is a universal function approximator, which can solve problem efficiently. Besides, its fast execution speed makes it practical to be implemented in real-time processing. It is used to perform recognition of speech sounds at phoneme, syllable and even isolated word level (Peeling and Moore, 1987 and 1988; Gold, 1988; Kammerer and Kupper, 1990; Jurgen, 1996; Lee and Ching, 1999; Siva, 2000; Ting et al., 2001c). Neuron Weight Input Input Output Input Output Output Layer Input Hidden Layer Input Layer Figure 2.4: A basic architecture of Multilayer Perceptron (MLP) 20 2.4 Comparison between Speech Recognition Approaches Table 2.1 and Table 2.2 show the comparison between different speech recognition approaches based on the literature reviews (Rabiner 1993; Grant 1991). Table 2.1: The comparison between different speech recognition approaches. Speech Recognition Phase Approach Relevant Variables/ Data Structures Input Output Speech Sampling ALL Analog Speech Signal Analog Speech Signal Digital Speech Samples DTW Statistical Features (LPC coefficients) Digital Speech Samples Acoustic Sequence Templates HMM Subword Features (phonemes) Digital Speech Samples Subword Features (phonemes) ANN Statistical Features (LPC coefficients) Digital Speech Samples Statistical Features (LPC coefficients) DTW Reference Model Database Acoustic Sequence Templates Comparison Score HMM Markov Chain Subword Features (phonemes) Comparison Score ANN Neural Network with Weights Statistical Features (LPC coefficients) Positive/Negative Output Feature Extraction Training and Testing Table 2.2: The performance comparison between different speech recognition approaches. Approaches Performance/ Application DTW • Mostly used for isolated, digit and connected word recognition. • Small vocabulary size. • Training is simple. HMM • Mostly used for continuous word recognition. • Large vocabulary size. • Training is complex. ANN • Mostly used for isolated, connected and continuous word recognition. • Medium vocabulary size. • Training is time-consuming. 21 2.5 Review of Artificial Neural Networks Artificial Neural Network (ANN) is extremely a powerful computational device (Anderson and Rosenfeld, 1988; Fausett, 1994; Haykin, 1994). Their massive parallelism makes them very efficient. They can learn and generalize from training data. They are particularly fault-tolerant. Besides, they are also noise-tolerant. In principle, they can do anything that a symbolic or logic system can do. There are many forms of ANN. Most operate by passing neural activations through a network of connected neurons such as MLP, SOM and Hopfield network. One of the most powerful features of neural networks is their ability to learn and generalize from a set of training data. They adapt the weights of the connections between neurons so that the final output activations are correct. The goal of the network is to learn some association between input and output patterns. This learning process is achieved through the modification of the connection weights between units. In statistical terms, this is equivalent to interpreting the value of the connections between units as parameters to be estimated. The model of network specifies the learning algorithm to be used. In the section below we will briefly review the fundamentals of neural networks: 2.5.1 Processing Units A neural network contains a potentially huge number of simple processing units. All these units operate simultaneously, supporting massive parallelism. All computation in the system is performed by these units. At each moment in time, each unit simply computes a scalar function of its local inputs, and broadcasts the result to its neighboring units. A basic neuron processing unit is shown in Figure 2.5. The units in a network are typically divided into input units, which receive data from the environment; hidden units, which may internally transform the data representation; and/or output units, which represent decisions or control signals. 22 Input Output Input Input Input Input Neuron i Weight Neuron j Figure 2.5: A basic neuron processing unit. 2.5.2 Connections The units in a network are organized into a given topology by a set of connections or weights. Weights are usually one-directional (from input units towards output units), but they may be two-directional, especially when there is no distinction between input and output units. Weights can be changed as a result of training, but they tend to be changed slowly, because accumulated knowledge changes slowly. A network can be connected with any kind of topology. Common topologies include unstructured, layered, recurrent, and modular networks, as shown in Figure 2.6. Each kind of topology is best suited to a particular type of application. 2.5.3 Computation Computation always begins with presenting an input pattern to the network. Then, the activations of all of the remaining units are computed, either synchronously or asynchronously. In layered networks, it is called forward propagation, as it progresses from the input layer to the output layer. In feed-forward networks, the activations will be stabilized as soon as the computations reach the output layer but in recurrent networks, the activations may never be stabilized. 23 (a) (b) (c) (d) Figure 2.6: Neural network topologies: (a) Unstructured, (b) Layered, (c) Recurrent and (d) Modular. 2.5.4 Training Training a network means adapting its connections so that the network exhibits the desired computational behavior for all input patterns. The process usually involves modifying the weights but sometimes it also involves modifying the actual topology of the network. In a sense, weight modification is more general than topology modification. However, topological changes can improve both generalization and the speed of learning. In general, networks are nonlinear and multilayered, and their weights can be trained only by an iterative procedure, such as gradient descent on a global performance measure. This requires multiple passes of training on the entire training set; each pass is called iteration or an epoch. Moreover, the weights must be modified very gently so as not to destroy all the previous learning. A small constant called the learning rate is used to control the magnitude of weight modifications. Finding a good value for the learning rate is very important. If the value is too small, learning takes forever; but if the value is too large, learning disrupts all the previous knowledge. Unfortunately, there is no analytical method for finding the optimal learning rate. It is usually optimized empirically by trying different values. 24 2.6 Types of Neural Networks Now we will give an overview of some different types of networks. This overview will be organized in terms of the learning procedures used by the networks. There are three main classes of learning procedures. Most networks fall into one of these categories, but there are also various networks, such as hybrid networks which straddle these categories. 2.6.1 Supervised Learning Supervised learning means that a “teacher” provides output targets for each input pattern, and corrects the network’s errors explicitly. This paradigm can be applied to many types of networks, both feed-forward and recurrent in nature. Perceptrons (Rosenblatt, 1962) are the simplest type of feed-forward networks that use supervised learning. A perceptron is comprised of binary threshold units arranged into layers, as shown in Figure 2.7(a). MLP may have any number of hidden layers, although a single hidden layer is sufficient for many applications, and additional hidden layers tend to make training slower. MLP can also be architecturally constrained in various ways, for instance by limiting their connectivity to geometrically local areas, or by limiting the values of the weights, or tying different weights together. Multilayer Perceptron (MLP), as shown in Figure 2.7(b), can theoretically learn any function, but they are more complex to be trained. However, if an MLP uses sigmoid function rather than threshold function, then it becomes possible to use partial derivatives and the chain rule to derive the influence of any weight on any output activation, which in turn indicates how to modify that weight in order to reduce the network’s error. This generalization of the Delta Rule is known as backpropagation. 25 Figure 3.3: Perceptrons (a) Single layer perceptron (b) Multilayer perceptron. Figure 2.7: Perceptron: (a) Single-layer Perceptron (b) Multilayer Perceptron Hopfield (1982) studied neural networks that implement a kind of contentaddressable associative memory. He worked with unstructured networks of binary threshold units with symmetric connections, in which activations are updated asynchronously. This type of recurrent network is now called a Hopfield network. 2.6.2 Semi-Supervised Learning In semi-supervised learning, an external teacher does not provide explicit targets for the network’s outputs, but only evaluates the network’s behavior as “good” or “bad”. The nature of their environment may be either static or dynamic, as example, the definition of “good” behavior may be fixed or it may change over time. The problem of semi-supervised learning is reduced to the problem of supervised learning, by setting the training targets to be either the actual outputs or their negations, depending on whether the network’s behavior was judged “good” or “bad”. The network is then trained using the Delta Rule, where the targets are compared against the network’s mean outputs, and error is backpropagated through the network if necessary (Barto and Anandan, 1985). 2.6.3 Unsupervised Learning In unsupervised learning, there is no teacher, and a network must detect regularities and similarities in the input data by itself. Such self-organizing networks 26 can be used for compressing, clustering, quantizing, classifying, or mapping input data. This type of network is often called an encoder, especially when the inputs or outputs are binary vectors. We also say that this network performs dimensionality reduction. There is one type of the unsupervised networks which is based on competitive learning, in which one output unit is considered the “winner”; these are known as winner-take-all networks. The winning unit may be found by lateral inhibitory connections on the output units. Competitive learning is useful for clustering the data, in order to classify or quantize input patterns (Hertz et al., 1991). Kohonen (1988b, 1995 and 2002) developed a competitive learning algorithm which performs feature mapping called Self-Organizing Map (SOM). SOM is a neural network that acts like a transformer which maps an m-dimensional input vector into n-dimensional space while locally preserving the topology of the input data. This is the reason that explains why a SOM is called a feature map: relevant features are extracted from the input space and presented in the output space in an ordered manner. It is always possible to reverse the mapping and restore the original set of data to the original m-dimensional space with a bounded error. The bound on this error is determined by the architecture of the network and the number of neurons. 2.6.4 Hybrid Networks Some networks combine supervised and unsupervised training in different layers (Keun-Rong and Wen-Tsuen, 1993; Fritzke, 1994). Most commonly, unsupervised training is applied at the lowest layer in order to cluster the data, and then backpropagation is applied at the higher layer to associate these clusters with the desired output patterns. The attraction of hybrid networks is that they reduce the Multilayer backpropagation algorithm to the single-layer Delta Rule, considerably reducing training time. On the other hand, since such networks are trained in terms 27 of independent modules rather than as an integrated whole, they have somewhat less accuracy than networks trained entirely with backpropagation. 2.7 Related Research Many early researchers tried to apply neural networks approaches to speech recognition problem. This is because speech recognition is a pattern recognition task, and neural networks are good in pattern recognition. The earliest attempts involved highly simplified tasks as example, classifying speech segments as voiced/unvoiced, or nasal/fricative/plosive. Success in these experiments encouraged more researchers to move on to phoneme or subword classification. The same techniques also achieved some success at the level of word recognition, although it became clear that there were scaling problems when scaling to level of sentences or larger vocabulary size. Basically, there are two approaches to speech classification using neural networks: static and dynamic. In static classification, all of the input speech are fed into the neural network at once, and then makes a single decision to classify the speech. By contrast, in dynamic classification, only a small window of the speech are fed into the network, and this window slides over the input speech while the network makes a series of local decisions. These local decisions then have to be integrated into a global decision at the final stage. Static classification works well for phoneme recognition, but it scales poorly to the level of words or sentences. But dynamic classification can scale better than static classification. In the section below we will review some researches in static approach for phoneme/subword classification and word classification. 2.7.1 Phoneme/Subword Classification Huang and Lippmann (1988) have performed a simple experiment to show that neural networks can form complex decision surfaces from speech data. They 28 used a MLP with only 2 inputs, 50 hidden nodes, and 10 outputs, to Peterson and Barney’s (1952) collection of vowels produced by men, women, & children, using the first two formants of the vowels as the input speech representation. After 50,000 iterations of training, the network produced the decision regions shown in Figure 2.8. These decision regions are nearly optimal, resembling the decision regions that would be drawn by hand, and they yield classification accuracy comparable to that of more conventional algorithms. Figure 2.8: Decision regions formed by a 2-layer Perceptron using backpropagation training and vowel formant data. Elman and Zipser (1987) trained a network to classify the vowels /a, i, u/ and the consonants /b, d, g/ as they occur in the utterances /ba, bi, bu/, /da, di, du/ and /ga, gi, gu/. Their network input consisted of 16 spectral coefficients over 20 frames and was fed into a hidden layer with between 2 and 6 units, leading to 3 outputs for either vowel or consonant classification. This network achieved an acceptable result with error rates of 0.5% for vowels and 5.0% for consonants. An analysis of the hidden units showed that they tend to be feature detectors, discriminating between important classes of sounds, such as consonants and vowels. The experimental results demonstrate that backpropagation learning can be used well with complex and natural data. 29 Among the difficult tasks in classification is the so-called E-set, as example, discriminating between the rhyming English letters “B, C, D, E, G, P, T, V, and Z”. Burr (1988) applied a static network to this task, with very good results. His network used an input window of 20 spectral frames, automatically extracted from the whole utterance using energy information. These inputs led directly to 9 outputs representing the E-set letters. The network was trained and tested using 180 tokens from a single speaker. Its recognition accuracy was high which mostly achieved over 99%. Lee and Ching (1999) proposed a design of neural-based speech recognition system for isolated Cantonese syllables. The speech recognition system consists of a tone recognizer and a base syllable recognizer. The tone recognizer adopts the architecture of MLP in which each output neuron represents a particular tone. The syllable recognizer contains a large number of independently trained recurrent networks, each representing a designated Cantonese syllable. A speaker-dependent recognition system has been built with the vocabulary growing from 40 syllables to 200 syllables. In the case of 200 syllables, 3 experiments were conducted on the proposed system and achieved a top-1 (highest result for experiment 1) accuracy of 81.8% and a top-3 (highest result for experiment 3) accuracy of 95.2%. 2.7.2 Word Classification Peeling and Moore (1987) applied MLP to digit recognition with excellent results. They used a static input buffer of 60 frames (1.2 seconds) of spectral coefficients, long enough for the longest spoken word; longer words were padded with zeros and positioned randomly in the 60-frame buffer. Evaluating a variety of MLP topologies, they obtained the best performance with a single hidden layer with 50 units. Comparison is made between proposed MLP and HMM where error rates were 0.25% vs. 0.2% in speaker-dependent experiments, 1.9% vs. 0.6% for multispeaker experiments using a 40-speaker database of digits. In addition, the MLP was five times faster than the HMM system. 30 Kammerer and Kupper (1990) applied a variety of networks to the TI 20word database, finding that a Single-layer Perceptron (SLP) outperformed both MLP and DTW template-based recognizer in many cases. They used a static input buffer of 16 frames, into which each word was linearly normalized, with 16 2-bit coefficients per frame. Error rates for the SLP vs. DTW were 0.4% vs. 0.7% in speaker-dependent experiments, or 2.7% vs. 2.5% for speaker-independent experiments. Burr (1988) applied MLP to the more difficult task of alphabet recognition. He used a static input buffer of 20 frames, into which each spoken letter was linearly normalized, with 8 spectral coefficients per frame. Training on three sets of the 26 spoken letters and testing on a fourth set, an MLP achieved an error rate of 15% in speaker-dependent experiments, matching the accuracy of a DTW template-based approach. Kohonen (1988 and 1992) has described a microprocessor-based real-time speech recognition system. It is able to produce orthographic transcriptions for arbitrary words or phrases uttered in Finnish or Japanese (Kohonen et al., 1988). It can also be used as a large-vocabulary isolated word recognizer. The acoustic processor of the system transcribing speech into phonemes is based on neural network principles. The so-called Phonotopic Maps constructed by a self-organizing process are employed. The co-articulation effects in phonetic transcriptions are compensated by means of errors at the acoustic processor output. Without applying any language model, the recognition result is correct up to 92% to 97% referring to individual letters. Ha-Jin Yu and Yung-Hwan Oh (1996) proposed a subword-based neural network model for continuous speech recognition. The system consists of three modules, and each module is composed of simple neural networks. The speech input is segmented into non-uniform units by the network. Non-uniform unit can model phoneme variations which spread for several phonemes and between words. The second module recognizes segmented units. The unit has stationary and transition parts, and the network is divided according to the two parts. The last module spots words by modeling temporal representation. The results showed that the system can 31 model such phoneme variations successfully. In this research, the recognizer was built by using simple structures of neural networks. The system consists of three modules. The input speech is segmented by the first module, and is classified by the second module. In this research, a module is added to detect words from the result of subword unit recognition. The units are trained by the result of word detection, rather than the result of unit recognition itself. 2.7.3 Classification Using Hybrid Neural Network Approach Keun-Rong Hsieh and Wen-Tsuen Chen (1993) proposed a neural network architecture which combines unsupervised and supervised learning for pattern recognition. The network is a hierarchical self-organization map, which is trained by unsupervised learning first. When the network fails to recognize similar patterns, supervised learning is applied to teach the network to give different scaling factors for different features so as to discriminate similar patterns. Simulation results showed that their proposed model obtained good generalization capability as well as sharp discrimination between similar patterns. Salmela et al. (1996) proposed a neural network, which is capable of recognizing isolated spoken numbers speaker independently. The recognition system is a hybrid architecture of SOM and MLP. The SOM maps the feature vectors of a word in a constant dimension matrix, which is classified by MLP. The decision borders of the SOM were fine-tuning with Learning Vector Quantization (LVQ) algorithm, with which the hybrid achieved over 99% recognition out of 1232 test set samples. The training convergence of the MLP was tested with two different initialization methods. Kusumoputro (1999) proposed an adaptive recognition system, which is based on Kohonen Self-Organization Map (KSOM). The goals in their research on ANN are to improve the recognition capability of the network and at the same time minimize the time needed for learning the patterns. The goals could be achieved by combining two types of learning: supervised learning and unsupervised learning. 32 They developed a new kind of hybrid neural learning system, combining unsupervised KSOM and supervised back-propagation learning rules. The hybrid neural system will henceforth be referred to as hybrid adaptive SOM with winning probability function and supervised BP or KSOM(WPF)-BP. This hybrid neural system could estimate the cluster distribution of given data, and directed it into predefined number of cluster neurons through creation and deletion mechanism. The result of experiment showed that the hybrid neural system of KSOM-BP with winning probability function has higher recognition rate compare with that of previous KSOM-BP, even using smaller number of cluster neurons. Tabatabai et al. (1994) proposed a hybrid neural network which consists of a SOM and a Perceptron. The hybrid is proposed for speaker independent isolated word recognition. The novel idea in their system is the usage of a SOM as the feature extractor which converts phonetic similarities of the speech frames into spatial adjacency in the map. The property simplifies the classification task. The system performance was evaluated for recognition of a limited number of Farsi words (numbers "zero" to "ten"). The overall performance of their hybrid recognizer showed to be 93.82%. The benefits of their system are speed and simplicity. In fact it performs the recognition task in about one second, running on an IBM PCIAT 386 + 387/33MHz. 2.8 Summary In this chapter, we have reviewed some popular approaches for speech recognition system, and then comparison between the differences of these approaches is made. Besides, some fundamental of neural network is reviewed based on the topology and type of learning. Lastly, some related researches are included in the last section in order to show us the different approaches or classifications used and the result obtained. CHAPTER 3 SPEECH DATASET DESIGN 3.1 Human Speech Production Mechanism All human speech sounds begins as pressure generated by the lungs that pushed air through the vocal tract. The vocal tract consists of the pharynx, the mouth or oral cavity and nasal cavity as shown in Figure 3.1. The sound produced depends on the state of the vocal tract as the air is pushed through it. The state of the vocal tract is determined by the position, shape and size of various articulators such as lips, jaw, tongue and velum. The human speech production mechanism involves the respiration of lungs which provides the energy source, the phonation of vocal cords or folds which act as source of sounds, the resonation of vocal tract which resonates the sounds from the vocal folds and the articulation mechanism at the oral cavity which manipulates the sounds from the vocal folds into various distinctive sounds. The speech sounds can be produced in a relatively open oral cavity or through a constriction in the oral cavity. The speech sounds are produced in a continuous way. As a result, the speech sounds have to be chopped into small units called phones for analysis. Each phone is included in brackets [] to indicate that it is a type of sound. The speech sounds are classified into vowels and consonants (Deller et al., 1993). 34 Figure 3.1: The vocal tract 35 3.2 Malay Morphology Malay morphology is defined as study of word structures in Malay language (Lutfi Abas, 1971). A morpheme is the term used in the morphology. A morpheme is the smallest meaningful unit in a language. In another words, morpheme is a combination of phonemes into a meaningful unit. A Malay word can be comprised of one or more morphemes. When we talk about Malay morphology, we cannot avoid from discussing the process of word formation in Malay language. Malay language is one of the agglutinative languages in the world. It is a language of derivative which allows the addition of affixes to the base word to form new words. The language itself is different from the English. In English language, the process involves the changes in the phonemes according to their groups. The processes of word formation in Malay language are in the forms of primary words, derivative words, compound words and reduplicative words. 3.2.1 Primary Word Primary word is the word that does not take any affixes or reduplication. A primary word can be comprised of one or more syllables. A syllable consists of a vowel, or a vowel with a consonant or a vowel with several consonants. The vowel can be presented at the front or back of the consonants. In Malay language, primary word with one syllable accounts for about 500 only (Nik Safiah Karim et al., 1995). Some of the primary words are taken from other languages such as English and Arabic. The structures of the syllable are shown in Table 3.1 and Figure 3.2. The C stands for consonant and the V stands for vowel. Primary words with two syllables account for the majority in the Malay primary words. The structures of the words are shown in Table 3.2 and Figure 3.3. Primary words with three and more syllables exist in a few numbers. Most of them are taken 36 from other languages. The structure of the words is shown in Table 3.3 and Figure 3.4. Table 3.1: Structure of words with one syllable. Syllable Structure Example of word CV Ya (yes) VC Am (common) CVC Sen (cent) CCVC Stor (store) CVCC Bank (bank) CCCV Skru (screw) CCCVC Skrip (script) Y A C V C – Consonant V – Vowel S T O R C C V C C – Consonant V – Vowel Figure 3.2: Structure of one-syllable word “Ya” and “Stor”. 37 Table 3.2: Structure of words with two syllables. Syllable Structure Example of Words V + CV Ibu (mother) V + VC Air (water) V + CVC Ikan (fish) VC + CV Erti (meaning) VC + CVC Empat (four) CV + V Doa (pray) CV + VC Diam (silent) CV + CV Guru (teacher) CV + CVC Telur (egg) CVC + CV Lampu (lamp) CVC + CVC Jemput (invite) G U + R U C V + C V C – Consonant V – Vowel J E M + P U T C V C + C V C C – Consonant V – Vowel Figure 3.3: Structure of two-syllable word “Guru” and “Jemput”. 38 Table 3.3: Structure of words with three syllables or more. Syllable Structure 3.2.2 Example of Words CV + V + CV Siapa (who) CV + V + CVC Siasat (investigate) V + CV + V Usia (age) CV + CV + V Semua (all) CV + CV + VC Haluan (direction) CVC + CV + VC Berlian (diamond) V + CV + CV Utara (north) VC + CV + CV Isteri (wife) CV + CV + CV Budaya (culture) CVC + CVC + CV Sempurna (perfect) CVC + CV + CVC Matlamat (aim) CV + CV + VC + CV Keluarga (family) CV + CVC + CV + CV Peristiwa (event) CV + CV + V + CVC Mesyuarat (meeting) CV + CV + CV + CVC Munasabah (reasonable) CV + CV + CV + CV Serigala (wolf) V + CV + CVC + CV + CV Universiti (university) CV + CV + CV + CV + CV + CV Maharajalela (king) Derivative Word Derivative words are the words that are formed by adding affixes to the base of the words. The affixes can exist at the initial (Prefixes), within (Infixes) or final (Suffixes) of the words. They can also exist at the initial and final of the words at the same time. These kinds of affixes are called confixes. Examples of derivative words are “berjalan” (walking), “mempunyai” (having), “pakaian” (clothes) and so on. 39 3.2.3 Compound Word Compound words are the words that are combined from two individual primary words, which carry certain meanings. There are quite lots of compound words in Malay language. Examples of compound words are “alat tulis” (stationery), “jalan raya” (road), “kapal terbang” (aeroplane), “Profesor Madya” (associate professor), “hak milik” (ownership), “pita suara” (vocal folds) and so on. Some of the Malay idioms are from the compound words such as “kaki ayam” (bare feet), “buah hati” (gift), “berat tangan” (lazy) and so on. 3.2.4 Reduplicative Word Reduplicative words, as its name suggests, are the words that are reduplicated from the primary words. There are three forms of reduplication in Malay language: full, partial and rhythmic. Examples of reduplicative words are “mata-mata” (policeman), “sama-sama” (welcomed) and so on. 3.3 Malay Speech Dataset Design Malay speech dataset design basically involves the proper selection of speech target sounds for speech recognition. The Malay consonantal phonemes can be analyzed according to the descriptive analysis and distinctive feature analysis. The descriptive analysis provided an analysis using frequency, mean, and factor analysis. Distinctive feature is a feature that is able to signal a difference in meaning by changing its plus or minus value. Generally, the descriptive analysis is preferred over the distinctive feature analysis because it is easier to be implemented. 40 3.3.1 Selection of Malay Speech Target Sounds From the study, there are about 500 primary words with one syllable. Primary word with three syllables or more exists in a small number and most of them are taken from other languages. The majority of the Malay words are comprised of primary word with two syllables. Among the Malay syllables, the structure of CV and CVC are the most common. CV is preferred over the CVC because it is easy to implement in the system and its number is quite substantial. Thus, the speech token selected is the Malay syllable (CV structure), which is initialized with plosives and followed by vowels. However, the syllables can be combined to form two-syllable Malay words. In order to get a good distribution of consonants and vowels for the dataset, 15 Malay syllables are selected in order to form Malay words consist of two syllables. Those syllables are in the form of CV, which are initialized with a chosen consonant and followed by a chosen vowel. The 15 Malay one-syllable speech target sounds are shown in Table 3.4. The most common two-syllable words combined using the selected syllables are shown in Table 3.5. The source is taken from “Kamus Dewan” (Sheikh Othman Sheikh Salim et al., 1989). Among the two-syllable Malay words listed in Table 3.5, 30 of them are chosen as target sounds for dataset. The selection of the words is based on the similar distribution of the syllables between the words. The 30 selected Malay twosyllable speech target sounds are shown in Table 3.6. 41 Table 3.4: 15 selected syllables in order to form two-syllable words as target sounds. No. Malay Syllable 1 Ba 2 Bu 3 Bi 4 Ka 5 Ku 6 Ki 7 Ma 8 Mu 9 Mi 10 Sa 11 Su 12 Si 13 Ta 14 Tu 15 Ti 42 Table 3.5: Two-syllable Malay words combined using 15 selected syllables. No. Two-syllable Malay Words 1 Baba KaKu MaMa SaSi TaTa 2 BaBi KaKi MaMi SaKa TaTu 3 BaKa KaBa MaKa SaKu TaTi 4 BaKu KaBu MaKi SaKi TaBa 5 BaKi KaMi MaSa SaMa TaKa 6 BaMa KaMu MaSi SaMu TaKi 7 BaSa KaSa MaSu SaMi TaMu 8 BaSi KaSi MaTa SaTa TaMa 9 BaSu KaTa MaTi SaTu TaSu 10 BaTa KaTi MaTu SaTi TuTa 11 BaTu KuKu MuKa SuSu TuTi 12 BuKa KuBu MuKu SuSa TuBa 13 BuKu KuMu MuSa SuKa TuBu 14 BuMi KuSa MuSi SuKu TuBi 15 BuSu KuSi MuTu SuMi TuKu 16 BuSi KuTu MuTi SiSa TuKa 17 BuTa KiTa MiSi SiSi TuMi 18 Butu KiSi MiSa SiBu TuSi 19 BiBi SiKu TiTi 20 BiKa SiKi TiBa 21 BiKu SiTi TiBi 22 BiSa SiTu TiKa 23 BiSi TiSu 24 BiSu TiSi 25 BiTu Total 25 18 18 22 24 43 Table 3.6: 30 selected Malay two-syllable words as the speech target sounds. No. Word No. Word No. Word 1 Baki 11 Kubu 21 Susu 2 Bata 12 Kita 22 Suka 3 Buka 13 Masa 23 Sisi 4 Buku 14 Mata 24 Situ 5 Bumi 15 Mati 25 Tamu 6 Bisu 16 Muka 26 Taba 7 Bisa 17 Mutu 27 Tubu 8 Kamu 18 Misi 28 Tubi 9 Kaki 19 Sami 29 Tiba 10 Kuku 20 Satu 30 Titi For digit recognition, 10 Malay digit words are chosen as the target speech sounds. These 10 digit words consist of different syllable, different combination of consonants and vowels and different number of syllable. The 10 Malay digit words as speech target sounds for digit recognition are shown in Table 3.7. Table 3.7: 10 selected digit words as the speech target sounds for digit recognition. No. Digit Structure No. Digit Structure 0 Kosong CV + CVCC 5 Lima CV + CV 1 Satu CV + CV 6 Enam V + CVC 2 Dua CV + V 7 Tujuh CV + CVC 3 Tiga CV + CV 8 Lapan CV + CVC 4 Empat VC + CVC 9 Sembilan CVC + CV + CVC The purpose of speech dataset for word recognition as in Table 3.6 is to test the recognition performance of the system on the Malay words with combination of CV-syllables. The CV-syllables can be combined to produce many Malay words. This may reduce the number of target words to be recognized. The speech dataset 44 for digit recognition is different from word recognition. The word structure is not only made by CV + CV as in word recognition but consist of different structures as CV + CVCC, CV + V, V + CVC, CVC + CV + CVC and so on. The structure of speech target words for digit recognition can be referred in Table 3.7. 3.3.2 Acquisition of Malay Speech Dataset The speech dataset consists of training dataset and test dataset. The training dataset is used to train the neural networks (SOM and MLP) for learning process. The testing dataset is used after the training to test the recognition performance of the neural networks in term of recognition rate. During the testing, the test dataset is the data which are not used in training for SOM and MLP. 50% of the total number of dataset is randomly chosen as training data and the rest of the data is used as training data. It is not the purpose of this research to develop a full scale speech recognizer with large dataset, but to test new techniques by developing a prototype. Considering this goal, all the approaches were tested on word recognition and digit recognition, with medium dataset. All the experiments reported for word recognition used a training set and a testing set contributed by six speakers (3 male and 3 female students from Faculty of Computer Science and Information Systems, Universiti Teknologi Malaysia). Each speaker contributes 300 utterances for the training set (10 repetitions * 30 words), and 300 utterances for testing set (10 repetitions * 30 words). Therefore, there are a total of 3600 utterances used for both training (1800 utterances) and testing (1800 utterances). For digit recognition, all the experiments reported used a training set and a testing set contributed by same speakers. Each speaker contributes the training set of 100 utterances, 10 for each target word, and a testing set composed of 100 utterances, 10 for each target word. For digit recognition, there are a total of 1200 utterances 45 used for both training and testing. Table 3.8 and 3.9 show the specifications of dataset for word recognition and digit recognition respectively. Table 3.8: Specification of dataset for word recognition Dataset Training Testing Speaker 6 Number of Word 30 Samples per Word Total 10 10 1800 1800 3600 Table 3.9: Specification of dataset for digit recognition Dataset Training Testing Speaker 6 Number of Word 10 Samples per Word 10 10 Total 600 600 1200 For speech recognition with single neural network such as MLP, it needs a lot of data for training purpose. This is because MLP needs more training data to have good learning and achieve optimal performance. Generally, the number of data needed for training is more than for testing. This becomes one of the drawbacks for MLP if the number of test data increases. However, the same number of data is used for training and testing in this experiment. This is to test the performance of the proposed algorithm applying SOM in the MLP-based speech recognition. SOM is good and efficient for data optimization and clustering which may improve the drawback of MLP mentioned above. 46 According to Rabiner (1993) the speech signal can be classified into three states: silence, unvoiced and voiced. For voiced sound such as vowels, voiced stops or other quasi-periodic pulses, the frequency of interest is in the region of 0 – 4 kHz. For unvoiced sounds such as unvoiced stops or random noise, the frequency of interest is in the region of 4 – 8 kHz. A sound editor, Sound Forge 6.0 is used to record the voice of the speakers. The recorded voice samples are saved as .wav file. Each file is named according to its speech sounds and followed by an indexing. For example, speech sound of “Buka” is saved as “Buka01.wav”, “Buka02.wav, “Buka03.wav” and so on. Another example for digit sound of “Satu” is saved as “Satu01.wav”, “Satu02.wav, “Satu03.wav” and so on. The sampling rate is set at 16 kHz with 16-bit resolution. The recording was conducted in normal laboratory environment with ambient noise of 75.40 dB, via a high quality unidirectional microphone. According to Nyquist theorem (Parsons, 1986), in order to capture a signal, the sampling rate must be at least twice higher than the frequency of the signal. In order to capture the unvoiced sounds, 16 kHz sampling rate is needed to be sufficient. Thus, a sampling rate of 16 kHz is used to obtain the speech dataset. 3.4 Summary In this chapter, the some fundamental concepts of Malay morphology has been explained in order to let us understand about the combination of phoneme into a meaningful unit. Besides, the whole dataset of this research consists of dataset for digit recognition and dataset for word recognition where both the datasets are collected from 6 human speakers. The collected dataset is divided into 2 sets which will be used for network training and testing. CHAPTER 4 FEATURE EXTRACTION AND CLASSIFICATION ALGORITHM 4.1 The Architecture of Speech Recognition System In this work, the proposed speech recognition model is divided into two stages shown by the schematic diagram in Figure 4.1. The Feature Extractor (FE) block generates a sequence of feature vectors by speech processing module, and then SOM transforms the feature vectors into binary matrices before proceeds to Recognizer. In the next stage, the Recognizer performs the binary matrix recognition. Speech Signal Speech Processing SOM Feature Extractor (FE) MLP Recognizer Figure 4.1: Proposed speech recognition model Output 48 4.2 Feature Extractor (FE) The objective of the FE block is to use a priori knowledge to transform an input in the signal space to an output in a feature space to achieve some desired criteria (Rabiner, 1993). If lots of clusters in a high dimensional space must be classified, the objective of FE is to transform that space such that classifying becomes easier. The FE block designed in speech recognition aims towards reducing the complexity of the problem before the next stage start to work with the data. Results from biological facts, neural networks such as SOM (Kohonen, 1995), have been combined and utilized in the design of speech recognition feature extractors. The proposed FE block divides it into two sub-blocks as shown in Figure 4.2: the first block is based on speech processing techniques, and the second block uses a SOM for dimensionality reduction. Speech Signal Speech Processing Sampling Frame Blocking Pre-emphasis LPC Analysis Autocorrelation Analysis Hamming Windowing Cepstrum Analysis Endpoint Detection Parameter Weighting Self-Organizing Map (SOM) Binary Matrix Figure 4.2: Feature Extractor (FE) schematic diagram 49 4.2.1 Speech Sampling It might be argued that a higher sampling frequency, or more sampling precision, is needed to achieve higher recognition accuracy. However, if a normal digital phone, which samples speech at 8,000 Hertz with an 8 bit precision, is able to preserve most of the information carried by the signal (Kohonen, 1995), it does not seem necessary to increase the sampling rate beyond 11,025 Hertz or the sampling precision to something higher than 8 bits. Another reason behind these settings is that commercial speech recognizers typically use comparable parameter values and achieve impressive results. In this work, the incoming signal was sampled at 16 kHz with 16 bits of precision shown in Figure 4.3. This is because the speech to be recognized by our proposed system includes not only voiced speech but also unvoiced such as /s/. Therefore higher sampling frequency and more sampling precision is chosen. The speech was recorded and sampled using an off-the-shelf relatively inexpensive dynamic microphone and a standard PC sound card. Figure 4.3: Speech signal for the word kosong01.wav, sampled at 16 kHz with a precision of 16 bits. 50 4.2.2 Frame Blocking The speech signal is dynamic or time-variant in the nature. According to Rabiner (1993), the speech signal is assumed to be stationary when it is examined over a short period of time. In order to analyze the speech signal, it has to be blocked into frames of N samples, with adjacent frames being separated by M samples as shown in Figure 4.4. In other words, the frame is shifted with M samples from the adjacent frame. If the shifting is small, then the LPC spectral estimated from frame to frame will be very smooth. If there is no overlapping between adjacent frames, the speech signals will be totally lost and correlation between the result LPC spectral estimates of adjacent frames will contain a noisy component. The value of N can range from 100 to 400 samples at 16 kHz sampling rate. N M N Figure 4.4: Blocking of speech waveform into overlapping frames with N analysis frame length and M shifting length. 51 4.2.3 Pre-emphasis In general, the digitized speech waveform has a high dynamic range and suffers from additive noise. For this reason, pre-emphasis is applied to spectrally flatten the signal in order to make it less susceptible to finite precision effects (such as overflow and underflow) later in the speech processing. The most widely used pre-emphasis is the fixed first-order system. The calculation of pre-emphasis is shown in Equation (4.1). H ( z ) = 1 − az −1 0 .9 ≤ a ≤ 1 .0 (4.1) The most common value for a is 0.95 (Deller et al., 1993). A pre-emphasis can be expressed as Equation (4.2). ~ s (n) = s (n) − 0.95s(n − 1) 4.2.4 (4.2) Windowing Each frame is windowed to minimize the signal discontinuities at the beginning and ending of each frame or to taper the signal to zero at the beginning and ending of each frame. If we define the window as w(n) , then the windowed signal is ~ x (n) = x(n) w(n), 0 ≤ n ≤ N −1 (4.3) A typical window used is the Hamming window, which has the form ïĢŦ 2n ïĢķ w(n) = 0.54 − 0.46 cosïĢŽ ïĢ·, ïĢ N −1ïĢļ 0 ≤ n ≤ N −1 (4.4) 52 The value of the analysis frame length N must be long enough so that tapering effects of the window do not seriously affect the result. 4.2.5 Autocorrelation Analysis The windowed signal is then auto-correlated according to the equation R (n) = N −1− m ∑ ~x (n) ~x (n + m), m = 0,1,2,..., p (4.5) n =0 where the highest autocorrelation value, p is the order of the LPC analysis. The selection of p depends primarily on the sampling rate. A speech spectrum can be represented as having an average density of two poles per kHz. Thus, one kHz of sampling rate corresponds to one pole. In addition, a total of 3 – 4 poles are needed to adequately represent the source excitation spectrum and the radiation load (Rabiner, 1993). For a sampling rate of 16 kHz, the value for p can ranges between 16 and 20. 4.2.6 LPC Analysis LPC analysis converts the autocorrelation coefficients R into the LPC parameters. The LPC parameters can be the LPC coefficients. Levinson-Durbin recursive algorithm is used to perform the conversion from the autocorrelation coefficients into LPC coefficients. E 0 = R ( 0) i −1 ïĢŪ ïĢđ k i = ïĢŊ R (i ) − ∑ a ij−1 R (i − j )ïĢš / E i −1 , j −1 ïĢ° ïĢŧ (4.6) 1≤ i ≤ p (4.7) 53 aii = k i (4.8) a ij = a ij−1 − k i aii−−1j , 1 ≤ j ≤ i −1 Ei = (1 − k i2 ) Ei −1 (4.9) (4.10) The set of equations (4.6 – 4.10) is solved recursively for i = 1,2,..., p, where p is the order of the LPC analysis. The k i are the reflection or PARCOR coefficients. The a j are the LPC coefficients. The final solution for the LPC coefficients is given as a j = a (j p ) , 4.2.7 1≤ j ≤ p (4.11) Cepstrum Analysis LPC cepstral coefficients can be derived directly from the LPC coefficients. Figure 4.5 shows the pattern of the cepstral coefficients for a speech sound. The conversion is done using the following recursive method: c0 = a 0 (4.12) m −1 ïĢŦkïĢķ c m = a m + ∑ ïĢŽ ïĢ·c k a m − k , k =1 ïĢ m ïĢļ 1≤ m ≤ p (4.13) m −1 ïĢŦkïĢķ c m = ∑ ïĢŽ ïĢ·c k a m − k , k =1 ïĢ m ïĢļ m> p (4.14) 54 Figure 4.5: Cepstral coefficients of BU.cep 4.2.8 Endpoint Detection The purpose of the endpoint detection is to find the start and end points of the speech waveform (Rabiner and Sambur, 1975; Savoji, 1989; Yiying Zhang et al., 1997). The challenge of the endpoint detection to locate the actual start and end points of the speech waveform. A 3-level adaptive endpoint detection algorithm for isolated speech based on time and frequency parameters is developed in order to obtain the actual start and end points of the speech. The concept of Euclidean distance adopted in this algorithm can determine the segment boundaries between silence and voiced speech as well as unvoiced speech. The algorithm consists of three basic modules: the background noise estimation, initial endpoint detection, and actual endpoint detection. The initial endpoint detection is performed using rootmean-square energy and zero-crossing rate, and the actual endpoint detection is performed using Euclidean distance within cepstral coefficients. 55 4.2.9 Parameter Weighting The lower-order cepstral coefficients are sensitive to overall spectral slope whereas high-order cepstral coefficients are sensitive to noise. In order to reduce these sensitivities, the cepstral coefficients have to be weighted with a standard technique. If we define the weighting window as wm , then the general weighting is c~m = wm c m , 1≤ m ≤ p ïĢŪ p ïĢŦ m ïĢķïĢđ wm = ïĢŊ1 + sin ïĢŽïĢŽ ïĢ·ïĢ·ïĢš, ïĢ° 2 ïĢ p ïĢļïĢŧ (4.15) 1≤ m ≤ p (4.16) The weighted cepstral coefficients are then normalized in between -1 and +1. The normalization is expressed as ïĢŦ w − wmin wnormalized = 2ïĢŽïĢŽ ïĢ wmax − wmin 4.3 ïĢķ ïĢ·ïĢ· − 1 ïĢļ (4.17) Self-Organizing Map (SOM) One of the objectives of the present work is to reduce the dimension of feature vectors by using SOM in speech recognition problem. Figure 4.6 shows us the diagram of the dimensional reduction by SOM. What we considered here is the dimensionality of the acoustic feature vector (LP-cepstral coefficient) is reduced before feeding them into the recognizer block. In this manner, the classification is highly simplified. 56 Dimensionality reduction cepstral coefficient Self-Organizing Map (SOM) binary matrix Figure 4.6: SOM transforms feature vectors generated by speech processing into binary matrix which performs dimensionality reduction. Kohonen (1988b, 1995 and 2002) proposed a neural network architecture which can automatically generate self-organization properties during unsupervised learning process, namely, a Self-Organizing Map (SOM). All the input vectors of utterances are presented into the network sequentially in time without specifying the desired output. After enough input vectors have been presented, weight vectors from input to output nodes will specify cluster or vector centers that sample the input space such that the point density function of the vector centers tends to approximate the probability density function of the input vectors. In addition, the weight vectors will be organized such that topologically close nodes are sensitive to inputs that are physically similar in Euclidean distance. Kohonen has proposed an efficient learning algorithm for practical applications. We used this learning algorithm in our proposed system. Using the fact that the SOM is a Vector Quantization (VQ) scheme that preserves some of the topology in the original space (Villmann, 1999), the basic idea behind the approach proposed in this work is to use the output of a SOM trained with the output of the speech processing block to obtain reduced feature vector (binary matrix) that preserve some of the behavior of the original feature vector. The problem is now reduced to find the correct number of neurons (Dimension of SOM) for constituting the SOM. Based on the ideas stated above, the optimal dimension size of SOM has to be searched in order to ensure the SOM has enough neurons to reduce the dimensionality of the feature vector while keeping enough information to achieve high recognition accuracy (Kangas et al., 1992; Salmela et al.,1996; Gavat et al., 1998). 57 4.3.1 SOM Architecture The architecture of a SOM is shown in Figure 4.7. The SOM consists of only one real layer of neurons. The SOM is arranged in a 2-D lattice. The 2-D SOM can be graphically represented for visual display. This architecture implements similarity measure using Euclidean distance measurement. In fact, it measures the cosine of the angle between normalized input and weight vectors. Since the SOM algorithm uses Euclidian metric to measure distances between data vectors, scaling of variables was deemed to be an important step and we normalized the value of all input vectors to unity. The input vector is normalized between -1 and +1 before it is fed into the network. Usually, the output of the network is given by the most active neuron as the winning neuron. LP-cepstral coefficients (p-coefficients per frame) Weight, mij X1 Xp Output space (SOM) in 2-Dimension space Winner node (Output) Neighborhood set Figure 4.7: The 2-D SOM architecture 58 4.3.2 Learning Algorithm The objective of the learning algorithm in SOM neural networks is the formation of the feature map which captures the essential characteristics of the pdimensional input data and maps them on the typically 2-D feature space. The learning algorithm captures two essential aspects of the map formation, namely, competition and cooperation between neurons of the output lattice. 2 Denote M ij(t ) = { mij1 (t ) , mij (t ) , … , mijN (t ) } as the weight vector of node (i, j) of the feature map at time instance t; i, j = 1, …, M are the horizontal and vertical indices of the square grid of output nodes, N is the dimension of the input vector. Denote the input vector at time t as X(t), the learning algorithm can be summarized as follows (Kohonen, 1995; Anderson, 1999; Yamada et al., 1999): 1. Initializing the weights Prior to training, each node's weights must be initialized. Typically these will be set to small standardized random values. The weights in the SOM in this research are initialized so that 0 < weight < 1. 2. Calculating the winner node - Best Matching Unit (BMU) To determine the BMU, one method is to iterate through all the nodes and calculate the Euclidean distance between each node's weight vector and the current input vector. The node with a weight vector closest to the input vector is tagged as the BMU. The Euclidean distance is given as: Dist = i =n ∑(X i (t ) − M ij (t )) 2 (4.18) i =0 To select the node with minimum Euclidean distance to the input vector X(t): X (t ) − M icjc (t ) = min { X (t ) − M ij (t ) }. i, j (4.19) 59 3. Determining the Best Matching Unit's Local Neighborhood For each iteration, after the BMU has been determined, the next step is to calculate which of the other nodes are within the BMU’s neighborhood. Radius of the neighborhood is calculated. Figure 4 shows an example of the size of a typical neighborhood close to the commencement of training. The area of the neighborhood shrinks over time using the exponential decay function: ïĢŦ tïĢķ σ (t ) = σ 0 expïĢŽ − ïĢ· ïĢ λïĢļ t = 1, 2, 3… (4.20) where σ0, denotes the width of the lattice at time = 0 and λ denotes a time constant. t is the current time-step. If a node is found to be within the neighborhood then its weight vector is adjusted as shown in next step. 4. Adjusting the weights Every node within the BMU’s neighborhood including the BMU ( ic , j c ) has its weight vector adjusted according to the following equation: M ij (t + 1) = M ij (t ) + α (t )( X (t ) − M ij (t )) for ic − N c (t ) ≤ i ≤ ic + N c (t ) and jc − N c (t ) ≤ j ≤ jc + N c (t ) (4.20) M ij (t + 1) = M ij (t ) for all other indices (i, j). (4.21) where t represents the time-step and α is a small variable called the learning rate, which decreases with time. Basically, this means that the new adjusted weight for the node is equal to the old weight, plus a fraction of the difference α between the old weight M and the input vector X. The decay of the learning rate is calculated each iteration using the following equation: ïĢŦ tïĢķ α (t ) = α 0 expïĢŽ − ïĢ· ïĢ λïĢļ t = 1, 2, 3… (4.22) 60 Ideally, the amount of learning should fade over distance similar to the Gaussian decay. So, an adjustment is made to Equation 4.21 which shown as equation below: M ij (t + 1) = M ij (t ) + Θ(t )α (t )( X (t ) − M ij (t )) (4.23) where Θ represents the amount of influence a node's distance from the BMU has on its learning. Θ(t) is given by equation below: ïĢŦ dist 2 ïĢķ Θ(t ) = expïĢŽïĢŽ 2 ïĢ·ïĢ· t = 1, 2, 3… ïĢ 2σ (t ) ïĢļ (4.24) where dist is the distance a node is from the BMU and σ is the width of the neighborhood function as calculated by (Equation 4.24). Additionally, Θ also decays over time. 5. Update time t = t + 1, add new input vector and go to Step 2. 6. Continue until α (t ) approach a certain pre-defined value or t reach maximum iteration. Figure 4.8 shows the flow of SOM learning algorithm. The learning algorithm repeatedly presents all the frames until the termination condition is approached. The input vector is the cepstral coefficients. After training, the testing data is fed into the feature map to form a binary matrix. These binary matrixes will be used as input in MLP for classification. The number in a binary matrix determines the number of input node in MLP. 61 Randomly initialize weight vectors Obtain a training pattern and apply to input Determine winner node with minimum Euclidean distance to input vector Update weights (nodes within neighborhood set of the winner node) Decrease gain α (t ) & neighborhood set Update time t = t + 1 t >= max or α (t ) <= 0 No Yes Save the trained map STOP Figure 4.8: Flow chart of SOM learning algorithm 62 After the learning process is completed, a trained feature map is formed. Figure 4.9 shows a trained feature map after 1,250,000 iterations. The neurons shown as rectangular are labeled with symbols of the phonemes to which they learned to give best responses. The neurons labeled with ‘?’ symbols show neurons with confused phoneme. The neurons labels with ‘sil’ show neurons with silence (no voice) speech. The neurons labels with alphabets ‘B’, ‘K’, ‘M’, ‘S’, ‘T’, ‘A’, ‘I’ and ‘U’ show the neurons with the corresponding phoneme /B/, /K/, /M/, /S/, /T/, /A/, /I/ and /U/. From the figures, we can see that SOM classify the phonemes in an order where neurons with similar phoneme are ordered nearby in the map. Suitable size or dimension of map have to be determined during learning process in order to provide enough feature space for speech phoneme to be trained. B B A ? A A A T T T T T U U U B A B A A A A T A T U T T U T ? B B A A A T T T U K T U U K B B B A A B A T U T U U U U K B B A B ? A A ? U U U K U K K M B A A A A ? T U U S T K U K M B M A A A A A U U U T K U K ? M B A A A ? A S S S S S T K A M M A A A A ? I S S U U S S M M M A A M M I I I S S U S S sil M M M M I ? I S I S S S S ? sil M M sil M I ? I I S ? S S S S ? sil M sil I sil I I S S S S S S S sil sil sil ? sil I S ? S I S S S S S sil sil ? I sil sil sil I I S S S S S S Figure 4.9: Trained feature map after 1,250,000 iterations. 63 4.3.3 Dimensionality Reduction In second stage of FE, SOM performs dimensionality reduction which is shown in Figure 4.10. SOM is used to transform the LPC cepstrum vectors into trajectory in binary matrix form. The LPC cepstrum vectors are fed into a 2dimensional feature map. The node in the feature map with the closest weight vector gives the response, which is called winner node. The winner node is then scaled into value “1” and other nodes are scaled into value “0”. All the winner nodes in feature map are accumulated into a binary matrix with same dimension as the feature map. If a node in the map has been a winner, the corresponding matrix element is unity. Therefore SOM serves as sequential mapping function transforming acoustic vector sequences of speech signal into a two-dimension binary pattern. Figure 4.11(a) shows an example of binary matrix where â denotes value 1 and â denotes value 0. After mapping all the speech frames of a word, a vector made by cascading the columns of the matrix excites an MLP which has been trained by the backpropagation algorithm for classifying words of the vocabulary. Figure 4.11(b) shows the values of binary matrix fed into MLP for further process. For example which shown in Figure 4.9, 12 cepstral coefficients are generated for each feature vectors during speech processing using LPC analysis. If there are 50 feature vectors or frames needed for a word, this means there is a total of 600 cepstral coefficients fed as input to MLP network. If the number of vectors or number of order for cepstral coefficients increases, the number of input node or dimension of input layer in MLP may also increase. By using binary matrix produced by SOM for input to MLP network, the problem above can be overcome because the dimension of binary matrix is according to the size of feature map chosen which is fixed even the order of cepstral efficient is increased. Dimensionality Reduction 50 feature vectors (600 LP-cepstral coefficients) Self-Organizing Map (12 x 12) Binary matrix (144 nodes) Figure 4.10: Dimensionality reduction performed by SOM. 64 /u/ /b/ Figure 4.11(a): The 12 x 12 mapping of binary matrix of /bu/ syllable. 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Figure 4.11(b): Binary matrix of /bu/ which is fed as input to MLP. 65 4.4 Multilayer Perceptron (MLP) The output of the FE block is “blind,” i.e. it does not care about the word that is being represented by that trajectory. The FE block only converts an incoming pressure wave into a trajectory in some feature space. It is the Recognizer block that discovers the relationships between the trajectories and the classes to which they belong, and adds the notion of class of utterance into the system. In our hybrid system, we used Multilayer Perceptron (MLP) for the classifier. The number of output nodes corresponds to the numbers of class, and optimal hidden nodes. The number of input nodes is the same as the dimension of the input vector (binary matrix from SOM). The Multilayer Perceptron is trained by the errorbackpropagation algorithm. 4.4.1 MLP Architecture A three-layer MLP with an input layer, a hidden layer and an output layer is shown in Figure 4.12 (Bourland and Wellekens, 1987; Md Sah Haji Salam et al., 2001; Ting et al., 2001; Ahad et al., 2002, Zbancioc and Costin, 2003). The neurons in the input, hidden and output layers are denoted by xi , h j and y k respectively, where i = 1, 2, … I; j = 1, 2, … , J and k = 1, 2, … K. I, J and K are the number of neurons in the layers of inputs, hidden and output respectively. Connection weights from input layer to hidden layer are denoted by wij . Similarly, w jk are the connection weights from hidden layer to output layer. The network is fully connected in the sense that every neuron in each layer of the network is connected to every other neuron in the adjacent forward layer by the connection weights. 66 …… yk by k w jk 1 …… hj Bias, B Output Layer, Y Hidden Layer, H wij bh j 1 …… xi Input Layer, X Binary matrix Figure 4.12: A three-layer Multilayer Perceptron The hidden layer and output layer may have biases, which act just like connection weights. These biases are connected to the hidden neurons and output neurons. The biases can be fixed or adjustable according to the network model utilized. The typical value for fixed biases is 1. The biases connected to the hidden layer and output layer are denoted as bh j and by k respectively. 4.4.2 Activation Function Basic operation of a neuron involves summing its weighted input signals and applying an activation function. Typically, a nonlinear activation function is used for all the neurons in the network. According to Fausett (1994), an activation function used for Back-Propagation network should be continuous, differentiable, monotonically non-decreasing and its derivative is easy to compute for computational efficiency. The most typical activation function is the binary sigmoid function, which has a range between zero and one. The binary sigmoid function is defined as f ( x) = 1 1 + e−X (4.25) 67 4.4.3 Error Backpropagation (BP) Error backpropagation (BP) or the Generalized Delta Rule (Lippmann, 1989) is the most widely used supervised training algorithm for neural networks especially MLP. Because of its importance, we will discuss it in some detail in this section. We begin with a full derivation of the learning rule. Suppose we have a multilayered feedforward network of nonlinear (typically sigmoid) units. We want to find values for the weights that will enable the network to compute a desired function from input vectors to output vectors. Because the units compute nonlinear functions, we cannot solve for the weights analytically; so we will instead use a gradient descent procedure on some global error function E. In the stage of back-propagation, error is first calculated at the output layer. Then, the error is back propagated to the hidden layer and lastly to the input layer. Each output neuron ( y k ) compares the calculated or actual output with the corresponding target value to find out the error information term, δ k . The δ k then is used to calculate the weight correction term, âw jk and bias correction term, âby k . The âw jk will be used to update the connection weight, w jk later and similarly the âby k will be used to update the bias, by k later. η is the learning rate of the network. δ k = (t k − y k ) f ' ( y _ input k ) (4.26a) δ k = (t k − y k ) y k (1 − y k ) (4.26b) âw jk = ηδ k h j (4.27) âby k = ηδ k (4.28) The error information term, δ k is sent to the hidden layer as a delta input. Each hidden neuron sums its delta inputs to give 68 K δ _ input j = ∑ δ k w jk (4.29) k =1 The error information term at the hidden layer, δ j is determined by multiplying the derivative of its activation function with δ _ input j . δ j = δ _ input j f ' (h _ input j ) (4.30a) δ j = δ _ input j (h j )(1 − h j ) (4.30b) The δ j then is used to calculate weight correction term, âwij and bias correction term, âbh j . The âwij and the âbh j will be used to update wij and bh j respectively later. âwij = ηδ j xi (4.31) âbh j = ηδ j (4.32) The weights and biases are updated using the weight and bias correction terms respectively. The adjustment makes use of the current weights and biases. At output layer, each output neuron ( y k ) updates its weights and biases according to w jk (new) = w jk (old ) + âw jk (4.33) by k (new) = by k (old ) + âby k (4.34) Similarly at hidden layer, each hidden neuron updates its weights and biases based on wij (new) = wij (old ) + âwij (4.35) bh j (new) = bh j (old ) + âbh j (4.36) 69 The process of adjustment is preferred if a momentum constant, α can be added to the weight updating formulas. The purpose of the momentum is to accelerate the convergence of the BP learning algorithm. It can be very useful when some training data are very different from the majority of the data. In BP with momentum term, the weights and biases are updated not only with the current gradient but also with previous gradient. The modifications of the adjustment are as follows: 1. 2. 4.4.4 At output layer w jk (t + 1) = w jk (t ) + âw jk (t + 1) , (4.37a) where âw jk (t + 1) = ηδ k h j + αâw jk (t ) (4.37b) by k (t + 1) = by k (t ) + âby k (t + 1) , (4.38a) where âby k (t + 1) = ηδ k + αâby k (t ) (4.38b) At hidden layer wij (t + 1) = wij (t ) + âwij (t + 1) , (4.39a) where âwij (t + 1) = ηδ j hi + αâwij (t ) (4.39b) bh j (t + 1) = bh j (t ) + âbh j (t + 1) , (4.40a) where âbh j (t + 1) = ηδ j + αâbh j (t ) (4.40b) Improving Error-Backpropagation There are some methods can be adopted in order to enhance and improve the Error-Backpropagation. First, the weights and biases must be initialized uniformly at 70 small random values. The typical range is between -0.5 and +0.5 (Bourland, 1987). The choice of initial weights and biases is important because it will affect the training of network toward a global minimum. Improper initialization will lead the network to reach a local minimum. Besides, long training times and the suboptimal results of MLP may be due to the lack of a proper initialization. Construction of proper initialization of adaptive parameters should enable finding close to optimal solutions for real-world problems (Wessel and Barnard, 1992). Second, the target values must be chosen within the range of the sigmoid activation function. If a binary sigmoid activation function is used, the neural network is impossible to reach its extreme value of 0 and 1. If the network is trained to achieve these levels, weights will be driven to such large values that numerical instability will result. Moreover, the derivative of the sigmoid function approaches zero at extreme values, thus results in slow learning (Masters, 1993; Pandya et al., 1996). Third, learning rate and momentum must be selected properly. The learning rate can be selected as large as possible without causing the learning of network to oscillation. However, a small learning rate will guarantee a true gradient descent. The choice of learning rate depends on the learning problem and also the network architecture. The typical values for learning rate and momentum are 0.1 and 0.9 respectively. Finally, the number of hidden layer neuron of the network must be chosen appropriately. Fewer neurons increase the possibility of the network to be trapped in local minimum during training. Excessive numbers of hidden neurons not only increases the training but also may cause problem of over-fitting or over-learning. It is suggested that the approach to find the optimal number of hidden neurons starts with a small number, then the number is increased slightly (Shin Watanabe et al., 1992). 71 The best number of hidden units depends in a complex way on many factors, including the number of training patterns, the numbers of input and output units, the amount of noise in the training data, the complexity of the function or classification to be learned, the type of hidden unit activation function, the training algorithm. Too few hidden units will generally leave high training and generalization errors due to under-fitting. Too many hidden units will result in low training errors, but will make the training unnecessarily slow, and will result in poor generalization unless some other technique (such as regularization) is used to prevent over-fitting. A sensible strategy is to try a range of numbers of hidden units and see which works best. One rough guideline for choosing the number of hidden neurons in many problems is the Geometric Pyramid Rule (GPR). It states that, for many practical networks, the number of neurons follows a pyramid shape, with the number decreasing from the input toward output. The number of neurons in each layer follows a geometric progression. The determination of hidden node number using GPR is shown in Figure 4.13. Other investigators (Berke and Hajela, 1991) suggested that the nodes on the hidden layer should be between the average and the sum of the nodes on the input and output layers. Multilayer Perceptron (MLP) Output (Y) Hidden (H) Input (X) X = number of input node Hidden Node Number (H) = X *Y = X *Y =6 Y = number of output node Figure 4.13: The determination of hidden node number using Geometric Pyramid Rule (GPR). 72 A network as a whole will usually learn most efficiently if all its neurons are learning at roughly the same speed. So, maybe different parts of the network should have different learning rates. There are a number of factors that may affect the choices: i. The later network layers (nearer the outputs) will tend to have larger local gradients (deltas) than the earlier layers (nearer the inputs). ii. The activations of units with many connections feeding into or out of them tend to change faster than units with fewer connections. iii. Activations required for linear units will be different for sigmoid units. iv. There is empirical evidence that it helps to have different learning rates for the thresholds or biases compared with the real connection weights. In practice, the learning process is often faster by just using the same rates for all the weights and thresholds, rather than spending time trying to work out appropriate differences. A very powerful approach is to use evolutionary strategies to determine good learning rates (Maniezzo, 1994). 73 4.4.5 Implementation of Error-Backpropagation The Error-Backpropagation learning algorithm is summarized in a flow chart shown in Figure 4.14. Initialize weights and biases Randomize the order of training patterns. Obtain pair of training pattern and target value and apply to input Calculate the output of hidden neuron N ≥ Epochmax ? Calculate the output of output neuron N Y Calculate output error information, weight and bias correction term ≤ E min ? Calculate hidden error information, weight and bias correction term STOP Update weights and biases in output layer and hidden layer Y Last pattern in epoch? N Figure 4.14: Flow chart of error-backpropagation algorithm 74 4.5 Summary As a summary, a hybrid-based model of neural network has been proposed and explained in this chapter. Basically, the proposed system consists of three main parts: Speech Processing, Feature Extractor with SOM and Recognizer performed by MLP. The algorithm of SOM and MLP has also been shown step by step in this chapter. In this research, SOM plays an important role in dimensionality reduction of feature vector. SOM transforms higher-dimensional cepstral coefficients into lowerdimensional binary matrices. Error-backpropagation learning algorithm is used in the training of MLP network. CHAPTER 5 SYSTEM DESIGN AND IMPLEMENTATION 5.1 Introduction This chapter will discuss the implementation of speech recognition system. The discussion includes speech processing involves LPC feature extraction and endpoint detection, Self-Organizing Map (SOM), Multilayer Perceptron (MLP) and the training and testing of the system. The implementation of the speech recognition system is shown in Figure 5.1. The Implementation of Speech Recognition System Speech Processing SOM MLP Training and Testing Figure 5.1: The implementation of speech recognition system 76 5.2 Implementation of Speech Processing In speech processing, speech features are extracted from the speech sounds. The speech processing simply involves two main phases: Endpoint Detection and Speech Feature Extraction. In the research, Malay syllables, two-syllable words and digit are used as the speech target sounds. Thus, appropriate endpoint detection is used to determine the starting and ending point of a speech sound. A 3-level endpoint detection algorithm is proposed to improve the performance of the traditional method. As for speech feature extraction, LPC analysis is used to extract LPC-derived cepstral coefficients. 5.2.1 Feature Extraction using LPC The purpose of the speech feature extraction is to extract LPC-derived cepstral coefficients from the syllable sounds. The features will serve as the inputs to the speech classification system. The following section will discuss the implementation of LPC analysis. The steps of the feature extraction are as follows: Step 1: Blocking the total frame length into analysis frames and Hamming windowing every analysis frame. The analysis frame length is 240 sample points; the shifting of the analysis frame is 80 sample points. The number of analysis frame will be determined after the endpoint detection. Step 2: Preemphasis the signals of the syllable sound. Array Buftemp [] is the buffer to store the preemphasized signals and array Buf[] is the buffer to store the syllable sample points. for (i = 0; i <= NN - Shift; i++) Buftemp[i] = (double)Buf[i] - 0.95 * (double)Buf[i - 1]; 77 Step 3: Hamming windowing every analysis frame. A Hamming window is generated before the windowing. Array Hamming[] is used to store the window value. Frame is the analysis frame length. for (i = 0; i < Frame; i++) Hamming[i] = 0.54 - 0.46 * cos(i * 6.283 / (Frame - 1)); Array temp[] is the buffer to store the Hamming windowed signals. for (i = 0; i <= NN - Shift; i++) for (j = 0; j <= Frame - 1; j++) { temp[j] = Buftemp[i * Shift + j]; temp[j] *= Hamming[j]; } Step 4: Perform autocorrelation analysis. Each analysis frame is autocorrelated to give autocorrelation coefficients. LPCOrder is the order of the autocorrelation analysis. Array Autocorr[] is used to store the autocorrelation coefficients. for (k = 0; k <= LPCOrder; k++) { aucorr = 0.0; for (l = 0; l <= Frame – 1 - k; l++) aucorr += temp[l] * temp[l + k]; Autocorr[k] = aucorr; } Step 5: Perform LPC analysis for each analysis frame. LPC coefficients are computed from the autocorrelation coefficients. Array LPCBuf[] is used to store the LPC coefficients. Array temp2[] is used to store the LPC coefficients temporarily. E is the prediction error and kn is the reflection coefficients. 78 double ar, kn[LPCOrder + 1], E; LPCBuf[0] = 1; E = Autocorr[0]; kn[1] = Autocorr[1] / Autocorr[0]; LPCBuf[1] = kn[1]; for (r = 2; r <= LPCOrder; r++) { LPCBuf[r] = 0; ar = 0; for (s = 1; s <= r - 1; s++) ar += LPCBuf[s] * Autocorr[r - s]; kn[r] = (Autocorr[r] - ar) / E; LPCBuf[r] = kn[r]; for (s = 1; s <= r - 1; s++) temp2[s] = LPCBuf[s] - kn[r] * LPCBuf[r - s]; E = (1 - kn[r] * kn[r]) * E; for (s = 1; s <= r - 1; s++) LPCBuf[s] = temp2[s]; } Step 6: Convert the LPC coefficients to cepstral coefficients. CEPOrder is the order of the cepstral coefficients. A double array CEPBuf[] is used to store the cepstral coefficients. int t, u; for (t = 1; t <= LPCOrder; t++) lpc[i][t] = LPCBuf[t]; CEPBuf[i][1] = LPCBuf[1]; for (t = 2; t <= LPCOrder; t++) { double sum = 0; CEPBuf[i][t] = LPCBuf[t]; for(u=1; u<=t-1;u++) { sum += u * CEPBuf[i][u] * LPCBuf[t - u] / t; CEPBuf[i][t] += sum; } if (CEPOrder > LPCOrder) { 79 for (t = LPCOrder + 1; t <= CEPOrder; t++) { double sum = 0; CEPBuf[i][t] = 0; for (u = 1; u <= LPCOrder - 1; u++) sum += u*CEPBuf[i][u]*LPCBuf[t-u]/t; CEPBuf[i][t] += sum; } } Step 7: Perform endpoint detection to obtain the actual start and end point of the speech sound using energy-power in term of root mean square (rms), zerocrossing rate and Euclidean distance measurement of cepstral coefficients between each analysis frame. This 3-level endpoint detection algorithm will be discussed more details in the next section. Step 8: Perform cepstral weighting and normalization for every analysis frame. A cepstral window is generated before the cepstral weighting. Array Cepwin[] is used to store the weighting window. After the weighting, every frame of cepstral coefficients is normalized between -1 and +1. for (i = 1; i <= CEPOrder; i++) // Generating cepstral window Cepwin[i] = 1.0 + CEPOrder * 0.5 * sin(I * 3.1416 / CEPOrder); for (t = 1; t <= CEPOrder; t++) // Perform cepstral weighting CEPBuf[i][t] = Cepwin[t] * CEPBuf[i][t]; // Normalize cepstral coefficients between -1 and +1 temp3 = new double[CEPOrder + 1]; double max = -5.0, min = 5.0; for (v = 1; v <= CEPOrder; v++) temp3[v] = CEPBuf[i][v]; for (w = 1; w <= CEPOrder; w++) { if (temp3[w] < min) min = temp3[w]; if (temp3[w] > max) max = temp3[w]; for (x = 1; x <= CEPOrder; x++) } 80 CEPBuf[i][x] = (((CEPBuf[i][x] - min) / (max - min)) * 2 - 1); Step 9: Save all the cepstral coefficients into “cep” file with its target value. The filename is named after numerical number starting from zero up to the last “.wav” file in the training set and testing set. The cepstral coefficients of the training speech dataset are saved together in one cepstral file named according to their cepstral order. for (x = 0; x < TotalFrameNumber; x++) { for (y = 1; y <= CEPOrder; y++) fprintf(SaveFile, "%.4f\t", CEPBuf[x][y]); fprintf(SaveFile, "\n"); } // first 20 speech sounds are for training dataset if (index <= 20) { for (x = 0; x < TotalFrameNumber; x++) for (y = 1; y <= CEPOrder; y++) fprintf(TrainingCEP, "%.4f\n", CEPBuf[x][y]); } Step 10: Repeat step 1 to 8 for every “.wav” file in the training set and test set. 5.2.2 Endpoint Detection The purpose of the endpoint detection is to determine the actual start and end point of the speech sounds. A 3-level adaptive endpoint detection algorithm for isolated speech based on time and frequency parameters has been developed in order to obtain the actual start and end points of the speech (Ahmad, 2004). The concept of Euclidean distance measurement adopted in this algorithm can determine the segment boundaries between silence and voiced speech as well as unvoiced speech. The algorithm consists of three basic modules: the background noise estimation, initial endpoint detection, and actual endpoint detection. The initial endpoint 81 detection is performed using energy-power in term of root mean square (rms) and zero-crossing rate, and the actual endpoint detection is performed using Euclidean distance of cepstral coefficients between each analysis frame. (1) Background Noise Estimation Before estimating background noise, the speech data is normalized with respect to the maximum of the speech data and then pre-emphasized with first order low-pass filter. The speech data will also be smoothed by Hamming window. The background noise is estimated which is used to decide the threshold values of the following steps. From the samples taken at the beginning and the ending of the input signal, the background noise is estimated. rms energy power, E is computed as (Equation 5.1). 1 ïĢŪ 1 En = ïĢŊ ïĢ°W W ∑ i =1 ïĢđ2 S n [i ]ïĢš ïĢŧ ~ 2 (5.1) in which, i =1, 2, …, W-1, W. W is the length of a frame (we use W=240), n is the number of frame 1, 2, … N-1, N (N = Total Frame). for (Window = 0; Window <= NN - Shift; Window += Shift) { power = 0; for (j = 0; j < Frame; j++) power += (_int64)pow(Buf[j + Window], 2); power = power / Frame; power = sqrtl (power); power = power / 32768; rmsenergy[i] = power; The noise level at the front-end of the signal E f is estimated using the first 5 energy frames where the energy values in these 5 frames are consistent between each others. 82 Ef = 1 5 ∑ Ei 5 i =1 (5.2) The noise level at the back-end of the signal Eb is estimated in the same way, using the last 5 frames where their energy values are consistent between each others. Eb = 1 N ∑ Ei 5 i= N −4 (5.3) Finally, the background noise level of the input signal E N is estimated using the noise levels at the front and back ends at the following: EN = E f + Eb 2 (5.4) for (i = e_start; i < e_start + 5; i++) e1 = e1 + rmsenergy[i]; e1 = e1 / 5.0; for (i = e_end; i > e_end - 5; i--) e2 = e2 + rmsenergy[i]; e2 = e2 / 5.0; e_noise = (e1 + e2) / 2.0; However, the rms energy of background noise obtained should lie within two limit thresholds; otherwise the speech signal is not acceptable as being either too noisy or under-amplified. Another parameter zero crossing of the background noise is also been estimated in the similar way as that of the parameter rms energy. The followings (Equation 5.5 – 5.8) are the estimation of background noise of zero crossing: 83 Zn = 1 W −1 ïĢŦ~ ïĢķ ïĢŦ~ ïĢķ sgn S sgn − n ïĢ· ïĢŽ ∑ ïĢ ïĢļ ïĢŽïĢ S n+1 ïĢ·ïĢļ 2 i =1 (5.5) which ïĢą 1, sgn ( x ) = ïĢē ïĢģ− 1, if if x≥0 x<0 for (j = 0; j < NN - Shift; j += Shift) { ZCnumber=0; for (i = 0; i<Frame - 1; i++) { if ((Buf[j + i] * Buf[j + I + 1]) < 0) ZCnumber++; } zerocrossing[k] = (double)Zcnumber / (double)Frame; fprintf(ZCR, "%d %d - %lf\n", k, j, zerocrossing[k]); k++; } Zf = Zb = ZN = 1 5 ∑ Zi 5 i =1 1 N ∑ Zi 5 i= N −4 Z f + Zb 2 for (i = z_start; i < z_start + 5; i++) f1 = f1 + zerocrossing[i]; f1 = f1 / 5.0; for (i = z_end; i > z_end - 5; i--) f2 = f2 + zerocrossing[i]; f2 = f2 / 5.0; (5.6) (5.7) (5.8) 84 zcr_noise = (f1 + f2) / 2.0; However, the zero crossing of background noise obtained should lie within two limit thresholds; otherwise the speech signal is not acceptable as being either too noisy or under-amplified. (2) Level 1 and 2: Initial Endpoint Detection The starting point of the first voiced speech of the input utterance and the ending point of the last one are located to be used as the reference points for the detection of the actual endpoints of the speech signal. This part begins with the searching the rms energy function from frame with highest rms energy to left with a frame in shifting step. The first frame whose rms energy is below an energy threshold Te is assumed to lie at the beginning of the first voiced speech. Thus, the starting point, PF 1 of the front voiced speech is obtained by PF 1 = arg n max{En < Te , n = m, m − 1,...,0} (5.9) in which En is defined by (Equation 5.1) and m is the index for frame with highest rms energy. The Te is experimentally derived from the background noise E N , using the relation Te = Ce × EN (5.10) which C e is an experimentally derived constant. In the same way, the ending point, PB1 of the last voiced speech is obtained by searching the energy function backwards from right to the left. PB1 = arg n min{En < Te , n = m, m + 1,..., N } (5.11) 85 for (Window = 0; Window < Total_Frame; Window++) { if (rmsenergy[Window] < rmse_threshold) Counter1=0; if (rmsenergy[Window] > rmse_threshold) Counter1++; if (Counter1 >= 3) { Start = (Window - 2) * Shift; break; } } for (Window = Start / 80; Window < Total_Frame; Window++) { if (rmsenergy[Window] > rmse_threshold) Counter2=0; if (rmsenergy[Window] < rmse_threshold) Counter2++; if (Counter2 >= 3) { End = (Window – 2) * Shift; break; } } If the (Equation 5.9) and (Equation 5.11) cannot be satisfied or if the distance between the points PF 1 and PB1 is below a certain threshold, the algorithm recognizes absence of speech in the input signal and the procedure terminated. Otherwise, the speech signal between these two reference points is assumed to be voiced speech segment. Next, we utilize the parameter zero crossing to relax the endpoints. It begins with searching the zero crossing function from point PF 1 backwards. The reference starting point, PF 2 is obtained by PF 2 = argn max{Zn > TZF , n = PF1, PF1 − 1,...,1} (5.12) in which Z n is defined by equation (6), TZF is the zero crossing threshold for frontend defined by TZF = C ZF × Z N in which C ZF is obtained experimentally. (5.13) 86 In the same way, the reference ending point, PB 2 is obtained by searching the zero crossing function from PB1 forwards: PB2 = argn min{Zn > TZB , n = PB1, PB1 + 1,..., N} (5.14) where TZB is the zero crossing threshold for back-end defined by TZB = C ZB × Z N (5.15) which C ZB is obtained experimentally. for (i = (Start / 80); i >= 0; i--) { if (zerocrossing[i] < z_front_threshold) else counter1++; counter1 = 0; if (counter1 >= 2) { zcr_start = (i + 2) * 80; break; } } for (i = (End / 80); i < Total_Frame; i++) { if (zerocrossing[i] < z_back_threshold) else counter2++; counter2 = 0; if (counter2 >= 2) { zcr_end = (i - 1) * 80; break; } } Due to the different characteristic of starting and ending phonemes of an isolated speech, different zero crossing thresholds are utilized for determining the starting-point and ending-point respectively. 87 (3) Level 3: Actual Endpoint Detection In this part, the implementation is based on the discrimination between current frame and the last retained frame j and compare this distance with a distance threshold. The simplest discrimination measure we used which was also proved to be successful is the weighted Euclidean distance, D. This method emphasizes the transient regions, which are more relevant for speech recognition. The boundary between voiced speech signal and silence can be determined by adopting the principle of this method. The decision criterion then becomes the following: leave the current frame out if D(i, j ) < TD . Since the changes of speech signal can be better embodied in the frequency domain and cepstral coefficients can be measured by Euclidean distance, cepstral coefficient is adopted to determine the actual endpoints. Let D(i, j ) be the Euclidean distance between the cepstral coefficient vectors of frame i and j. If D(i, j ) is great than threshold TD in the searching procedure, the transient of voiced and unvoiced speech segment is assumed to occur. In order to avoid the sudden high-energy noise, three frames are detected. Searching from frame PF 2 forwards until frame PB 2 , the actual starting point PF 3 is determined by ïĢą D (n, n + 1) > TD & & ïĢž ïĢī ïĢī PF 3 = arg n min ïĢē D (n, n + 2 ) > TD & &ïĢ― + 1 ïĢī D(n, n + 3) > T ïĢī D ïĢū ïĢģ in which PF 2 ≤ n ≤ PB 2 . for (i = zcr_start / 80; i < (zcr_end / 80) - 1; i++) { for (k = 1; k <= 3; k++) { d = 0.0; (5.16) 88 for (j = 1; j <= 20; j++) d = d + pow((CEPBuf[i][j] - CEPBuf[i + k][j]), 2.0); if (d > euclidean_threshold) else { Counter++; Counter = 0; break; } } if (Counter >= 3) { actual_start = (i - 2) * 80; break; } } Searching from frame PB 2 backwards until frame PF 2 , the actual ending point PB 3 is determined by ïĢą D(n, n − 1) > TD & & ïĢž ïĢī ïĢī PB 3 = arg n max ïĢē D(n, n − 2 ) > TD & &ïĢ― − 1 ïĢī ïĢī D (n, n − 3) > T D ïĢū ïĢģ (5.17) in which PB 2 ≥ n ≥ PF 2 . The final result or actual endpoint for the proposed algorithm are the starting point , PF 3 and the ending point, PB 3 . for (i = zcr_end / 80; i > (zcr_start / 80) + 1; i--) { for (k = 1; k <= 3; k++) { d = 0.0; for (j = 1; j <= 20; j++) d = d + pow((CEPBuf[i][j] - CEPBuf[i - k][j]), 2.0); if (d > euclidean_threshold) else { Counter = 0; Counter++; 89 break; } } if (Counter >= 3) { actual_end = (i + 2) * 80; break; } } From Figure 5.2(a) – (c), we can see the effectiveness of the 3-level endpoint detection to obtain the actual boundaries of the start and end point of a speech sample. After the endpoint detection, the cepstral coefficients within these determined boundaries will be weighted and saved into “.cep” file for training and testing of the SOM. 90 Figure 5.2(a): The detected boundaries of sembilan04.wav using rms energy in Level 1 of Initial endpoint detection. Figure 5.2(b): The detected boundaries of sembilan04.wav using zero crossing rate in Level 2 of Initial endpoint detection. Figure 5.2(c): The actual boundaries of sembilan04.wav using Euclidean distance of cepstrum in Level 3 of Actual endpoint detection. 91 5.3 Implementation of Self-Organizing Map A Self-Organizing Map (SOM) with shorucut learning by Kohonen is used to train the system and transform the input LPC-derived cepstral coefficients into binary matrix. The variable-length cepstral coefficients are converted into fixed-length binary matrix according to the dimension of the SOM utilized. A two-layer SOM network is proposed in our system as shown in Figure 5.3. The SOM consists of an input layer and an output layer. The input layer has a total of input nodes which is same dimension with the dimension of the cepstral order in cepstral files. The feature vectors are fed into the input nodes for training and testing. Figure 5.3: The architecture of Self-Organizing Map (SOM) The implementation of SOM involves the training of the network with training sets and also the testing of the network with testing dataset. The implementation for SOM is shown in steps as follows. Only essential parts of the implementation of the SOM will be shown. 92 (1) Implementation of SOM Training Step 1: Get the configuration of the network parameters including the maximum epoch for training, MaxEpoch and the dimension of the feature vectors, VectorDimension. VectorDimension = VECTORDIMENSION; MaxEpoch = MAX_EPOCH; Step 2: Read the input data from speech input pattern (in “.cep” file) and saved into the array inputArray[]. InputVector is used to store the total number of the cepstral coefficients for training. i = 0; while (!feof(inputfile)) { for (j = 0; j < VectorDimension; j++){ fscanf(inputfile, "%lf", &inputArray[i][j]); i++; } } InputVector = i; Step 3: The input arrays are then normalized by function NormalizeInput() in order to fasten the process of the convergence and make the network training more efficient. for (i = 0; i < InputVector; i++) { total = 0.0; for (j = 0; j < VectorDimension; j++) total += inputArray[i][j] * inputArray[i][j]; for (j = 0; j < VectorDimension; j++) inputArray[i][j] = inputArray[i][j] / sqrt(total); } 93 Step 4: Initialize the weights to random values with closer range between 0.0 and 1.0 and saved into the array weights[] according to the determined dimension of SOM (X and Y). for (i = 0; i < X; i++) for (j = 0; j < Y; j++) for (k = 0; k < VectorDimension; k++) weights[i][j][k] = (double)rand() / RAND_MAX; Step 5: Train the networks until it reaches maximum epoch. The max epoch is set to 5000000. First, the input vector from input arrays is chosen randomly for learning. GetWinnerNode() is a function used to determine the winner node with smallest distance with the input array among all the nodes. The weight vector of these nodes that lie within a nearest neighborhood set of the winner node is updated. The learning constant and neighborhood set both decrease monotonically with time according to T. Kohonen’s algorithm (Kohonen, 1995). In our research, we chose 0.25 and DimX*DimY-1 as initial values for learning constant and neighborhood size respectively. This process is repeated until the maximum epoch is achieved. while (Epoch < MaxEpoch){ VectorChosen = (int)(rand()%(InputVector)); GetWinnerNode(); for (i = 0; i < X; i++){ for (j = 0; j < Y; j++){ distance = ((winnerX - (i + 1)) * (winnerX - (i + 1))) + ((winnerY - (j + 1)) * (winnerY - (j + 1))); distance = sqrt(distance); if (distance < NeighbourhoodSize / 2){ influence = exp(-(distance) / (NeighbourhoodSize / 2)); for (k=0; k<VectorDimension; k++) weights[i][j][k] += (LearningRate * influence * (inputArray[VectorChosen][k] - weights[i][j][k])); } } } } 94 if (NeighbourhoodSize >= 1.0) NeighbourhoodSize = (InitNeighbourhoodSize / 2) * exp(-(double)Epoch / time_constant); if (LearningRate >= 0.0001) LearningRate = InitLearningRate * exp(-(double)Epoch / time_constant); Epoch++; } The implementation of function GetWinnerNode() is as follows: minDistance = 999; for (i = 0; i < X; i++){ for (j = 0; j < Y; j++){ Kohonen[i][j] = GetDistance(i,j); if (Kohonen[i][j] < minDistance){ minDistance = Kohonen[i][j]; winnerX = i + 1; winnerY = j + 1; } } } winnernode = (winnerX) * X + (winnerY); TrackWinner[Epoch] = winnernode; The implementation of function GetDistance() is as follows: double GetDistance(int x, int y){ int i; double distance = 0.0; for (i = 0; i < VectorDimension; i++) distance += ((inputArray[VectorChosen][i] - weights[x][y][i]) * (inputArray[VectorChosen][i] - weights[x][y][i])); return sqrt(distance); } 95 Step 6: Finally, the weight vectors are saved into the weight file in “.wgt”. The weight file then is used for testing. for (i = 0; i < X; i++) for (j = 0; j < Y; j++){ for (k = 0; k < VectorDimension; k++) fprintf(WeightFile, "%lf ", weights[i][j][k]); (2) Implementation of SOM Testing Step 1: The process of SOM testing is similar to SOM training but the dataset used for testing includes the training set and testing set. First it reads all of the feature vectors from a speech sound file in “.cep” and saved to the array inputArray[]. while (!feof(inputfile)) { for (j = 0; j < VectorDimension; j++) fscanf(inputfile, "%lf", &inputArray[i][j]); i++; } Step 2: The weight vectors are read from the weight file in “.wgt” which is trained in the training process before. The weight vectors then are saved to the array weights[]. for (i = 0; i < X; i++) for (j = 0; j < Y; j++) for (k = 0; k < VectorDimension; k++) fscanf(weightfile, "%lf", &weights[i][j][k]); 96 Step 3: Find the winner node by calculating the Euclidean distance among of the nodes. The function GetWinnerNode() and GetDistance() have been shown in the implementatin of SOM training. The index of the winner nodes is saved to the array TrackWinner[]. The process is repeated for all of the feature vectors of a speech sound file. winnernode = (winnerX) * X + (winnerY); TrackWinner[Epoch] = winnernode; Step 4: A binary matrix, Matrix[] which has same dimension with SOM is created. The binary matrix is used to accumulate the winner nodes of a speech sound file. All of the values of binary matrix are scaled into value “0” except the winner nodes. The winner nodes are scaled into value “1”. The binary matrix is then saved in matrix file “.mtx” according to their target values. Implementation of the function CreateMatrix() is as follows. for (i = 0; i < X*Y; i++) Matrix[i] = 0; for (i = 0; i < Epoch; i++) Matrix[TrackWinner[i] - 1] = 1; for (i = 0; i < X*Y; i++) fprintf(matrix, "%d\n", Matrix[i]); Step 5: The step 1 – 4 are repeated for all the cepstral files for both training dataset and testing dataset. The binary matrix files are then used in MLP as input for speech classification. 97 5.4 Implementation of Multilayer Perceptron A three-layer MLP with Error-Backpropagation learning is used to train the system and classify the Malay target sounds based on their syllable and word. In order to classify the Malay target sounds, two types of classification are proposed to be used in the speech recognition system: syllable classification and word classification. Only word classification is applied for digit recognition while syllable and word classification are applied for word recognition and comparison is made in term of recognition accuracy. The next sections will discuss the network architecture according to each classification. 5.4.1 MLP Architecture for Digit Recognition Dataset for digit recognition consists of 10 Malay digits. The MLP has 10 output nodes, which correspond to the 10 Malay digits (/kosong/, /satu/, /dua/, /tiga/, /empat/, /lima/, /enam/, tujuh/, /lapan/, /sembilan/). The architecture of MLP is shown in Figure 5.4. A decimal number system is used to set the target values of the digits. In MLP, the setting of the target values is shown in Table 5.1. The value “0.9” indicates the status of presence while value “0.1” indicates the status of absence. /Kosong/ /Satu/ …… yk …… /Lapan/ /Sembilan/ by k w jk 1 hj …… xi Hidden Layer, H wij bh j 1 Output Layer, Y …… Input Layer, X Figure 5.4: MLP with 10 output. The 10 output nodes correspond to 10 Malay digit words respectively. 98 5.4.2 MLP Architecture for Word Recognition Dataset for word recognition consists of 30 selected Malay two-syllable words. In order to classify the target word, syllable and word classifications are applied in the word recognition. For syllable classification, MLP has 15 output nodes, which correspond to the 10 Malay syllables (/ba/, /bu/, /bi/, /ka/, /ku/, /ki/, /ma/, /mu/, /mi/, /sa/, /su/, /si/, /ta/, /tu/, /ti/). The architecture of MLP for syllable classification is shown in Figure 5.5. A decimal number system is used to set the target values of the syllables. The setting of the target values for syllable classification is shown in Table 5.2. The value “0.9” indicates the status of presence while value “0.1” indicates the status of absence. Due to the character of the non-linear-sigmoid function used in MLP, the better choice would be to use values 0.9 and 0.1 instead of 1 and 0 (Haykin, 1994). Table 5.1: The setting of the target values for MLP in digit recognition. Decimal Node 10 9 8 7 6 5 4 3 2 1 Kosong 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.9 1 Satu 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.9 0.1 2 Dua 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.9 0.1 0.1 3 Tiga 0.1 0.1 0.1 0.1 0.1 0.1 0.9 0.1 0.1 0.1 4 Empat 0.1 0.1 0.1 0.1 0.1 0.9 0.1 0.1 0.1 0.1 5 Lima 0.1 0.1 0.1 0.1 0.9 0.1 0.1 0.1 0.1 0.1 6 Enam 0.1 0.1 0.1 0.9 0.1 0.1 0.1 0.1 0.1 0.1 7 Tujuh 0.1 0.1 0.9 0.1 0.1 0.1 0.1 0.1 0.1 0.1 8 Lapan 0.1 0.9 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 9 Sembilan 0.9 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 10 Value “0.9” = status of presence Number Value “0.1” = status of absence 99 /Ba/ …… /Bu/ /Tu/ /Ti/ …… yk Output Layer, Y by k w jk 1 …… hj Hidden Layer, H wij bh j 1 …… xi Input Layer, X Figure 5.5: MLP with 15 output nodes. The 15 output nodes correspond to 15 Malay syllables respectively. For word classification, MLP has 30 output nodes, which correspond to the 30 selected Malay words listed in Table 5.3. The architecture of MLP for word classification is shown in Figure 5.6. A decimal number system is used to set the target values of the Malay words. The setting of the target values for word classification is shown in Table 5.4. The value “0.9” indicates the status of presence while value “0.1” indicates the status of absence. /Buku/ /Kubu/ yk …… …… by k …… xi Output Layer, Y Hidden Layer, H wij bh j 1 /Kita/ w jk 1 hj /Kati/ …… Input Layer, X Figure 5.6: MLP with 30 output nodes. The 30 output nodes correspond to 30 Malay two-syllable words respectively. 100 Table 5.2: The setting of the target values for MLP (syllable classification). Decimal Node 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 Ba 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.9 1 Bu 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.9 0.1 2 Bi 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.9 0.1 0.1 3 Ka 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.9 0.1 0.1 0.1 4 Ku 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.9 0.1 0.1 0.1 0.1 5 Ki 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.9 0.1 0.1 0.1 0.1 0.1 6 Ma 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.9 0.1 0.1 0.1 0.1 0.1 0.1 7 Mu 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.9 0.1 0.1 0.1 0.1 0.1 0.1 0.1 8 Mi 0.1 0.1 0.1 0.1 0.1 0.1 0.9 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 9 Sa 0.1 0.1 0.1 0.1 0.1 0.9 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 10 Su 0.1 0.1 0.1 0.1 0.9 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 11 Si 0.1 0.1 0.1 0.9 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 12 Ta 0.1 0.1 0.9 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 13 Tu 0.1 0.9 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 14 Ti 0.9 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 15 Value “0.9” = status of presence Value “0.1” = status of absence Number 101 Table 5.3: The setting of the target values for MLP (word classification). Decimal Node 30 ……….. 10 9 8 7 6 5 4 3 2 1 Baki 0.1 ……….. 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.9 1 Bata 0.1 ……….. 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.9 0.1 2 Buka 0.1 ……….. 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.9 0.1 0.1 3 Buku 0.1 ……….. 0.1 0.1 0.1 0.1 0.1 0.1 0.9 0.1 0.1 0.1 4 Bumi 0.1 ……….. 0.1 0.1 0.1 0.1 0.1 0.9 0.1 0.1 0.1 0.1 5 Bisu 0.1 ……….. 0.1 0.1 0.1 0.1 0.9 0.1 0.1 0.1 0.1 0.1 6 Bisa 0.1 ……….. 0.1 0.1 0.1 0.9 0.1 0.1 0.1 0.1 0.1 0.1 7 Kamu 0.1 ……….. 0.1 0.1 0.9 0.1 0.1 0.1 0.1 0.1 0.1 0.1 8 Kaki 0.1 ……….. 0.1 0.9 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 9 Kuku 0.1 ……….. 0.9 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 10 Kubu 0.1 ……….. 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 11 Kita 0.1 ……….. 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 12 Masa 0.1 ……….. 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 13 Mata 0.1 ……….. 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 14 Mati 0.1 ……….. 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 15 Muka 0.1 ……….. 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 16 Mutu 0.1 ……….. 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 17 Misi 0.1 ……….. 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 18 Sami 0.1 ……….. 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 19 Satu 0.1 ……….. 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 20 Susu 0.1 ……….. 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 21 Suka 0.1 ……….. 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 22 Sisi 0.1 ……….. 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 23 Situ 0.1 ……….. 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 24 Tamu 0.1 ……….. 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 25 Taba 0.1 ……….. 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 26 Tubu 0.1 ……….. 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 27 Tubi 0.1 ……….. 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 28 Tiba 0.1 ……….. 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 29 Titi 0.9 ……….. 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 30 Value “0.9” = status of presence Value “0.1” = status of absence Number 102 (1) Implementation of MLP training Step 1: Initialize the weights and bias to small random values. Initialize the target values. for (time_state = 0; time_state < NUMTIMESTATES; time_state++) for (from_layer = 0; from_layer < NUMOFLAYERS-1; from_layer++) for (from_neuron = 0; from_neuron < neurons_per_layer[from_layer]; from_neuron++) for (to_neuron = 0; to_neuron <neurons_per_layer[from_layer+1]; to_neuron++) synapses[time_state][from_layer][from_neuron][to_neuron] = (((float)rand() / RAND_MAX ) - 0.5) * INITWEIGHTRANGE; for (i = 0; i< NUMTRAININGSETS; i++) // Initialize the target values for (j = 0; j < NUMOUTPUTNEURONS; j++) { training_outs[i][j] = 0.1; if (I == j) training_outs[i][j] = 0.9; } Step 2: Train the networks until one of the conditions is reached: min error function or max epochs. First, ForwardPass( ) is used to propagate the input signals to the output nodes from input layer to hidden layer and from hidden layer to output layer. Then, BackwardPass() is used to back-propagate the error and also update the weights and biases. The final weights and biases are saved into “.wgt” file after training. The learning rate of network is decreased 5% if the difference between previous error value and current error value is smaller than c for 25 times. c is a constant where it is set to 0.0001 during training process. The implementation of function ForwardPass() is as follows: for (i = 1; i < NUMOFLAYERS; i++) { 103 for (j = 0; j < neurons_per_layer[i]; j++) { temp = 0; for (k = 0; k < neurons_per_layer[i-1]; k++) { temp += neurons[i-1][k] * synapses[CURRENTWEIGHTS][i -1][k][j]; } neurons[i][j] = Sigmoid(temp); } The implementation of function BackwardPass() is as follows: // Calculate errors in output neurons for (i = 0; i < NUMOUTPUTNEURONS; i++) errors[NUMOFLAYERS-1][i] = neurons[NUMOFLAYERS-1][i] * (1 - neurons[NUMOFLAYERS-1][i]) * (trainset[i] - neurons[NUMOFLAYERS-1][i]); // Calculate new weights going into output layer for (i = (NUMOFLAYERS - 2); i >= 0; i--) for (j = 0; j < neurons_per_layer[i]; j++) { for (k = 0; k < neurons_per_layer[i+1]; k++) synapses[NEWWEIGHTS][i][j][k] = synapses[CURRENTWEIGHTS][i][j][k] + ((learningrate * errors[i+1][k] * neurons[i][j]) * (1 + MOMENTUM)); temp = 0; for (k = 0; k < neurons_per_layer[i+1]; k++) temp += errors[i+1][k] * synapses[CURRENTWEIGHTS][i][j][k]; errors[i][j] = neurons[i][j] * (1 - neurons[i][j]) * temp; } // Copy all the new weights into the current set of weights for the next // set of forward and backward passes for (i = 0; i < (NUMOFLAYERS-1); i++) for (j = 0; j < neurons_per_layer[i]; j++) 104 for (k = 0; k < neurons_per_layer[i+1]; k++) synapses[CURRENTWEIGHTS][i][j][k] = synapses[NEWWEIGHTS][i][j][k]; // The learning rate of network is decreased 5% if (error_distance < c) error_count++; else error_count = 0; if ((error_count%25==0)&&(error_count!=0)&&(learningrate>0.075)) learningrate = learningrate * 0.95; The implementation of function SaveWeights() is as follows: for (from_layer = 0; from_layer < NUMOFLAYERS-1; from_layer++){ for (from_neuron = 0; from_neuron < neurons_per_layer[from_layer]; from_neuron++) { for (to_neuron = 0; to_neuron < neurons_per_layer[from_layer+1]; to_neuron++) fprintf(outfile, "%f ", synapses[1][from_layer][from_neuron][to_neuron]); fprintf(outfile, "\n"); } fprintf(outfile, "\n"); } (2) Implementation of MLP testing Step 1: Read the weight values from the weight file which is saved in “.wgt” after the training of the network. The weights is then stored into array synapses[]. 105 for (from_layer = 0; from_layer < NUMOFLAYERS-1; from_layer++) for (from_neuron = 0; from_neuron < neurons_per_layer[from_layer]; from_neuron++) for (to_neuron = 0; to_neuron < neurons_per_layer[from_layer+1]; to_neuron++) fscanf(weights_file, "%f", &synapses[CURRENTWEIGHTS][from_layer][from_neuron][to_neuron]); Step 2: Read the test data from speech input pattern (in “.mtx” file) and saved to the array testing_ins[]. There are a total of 40 patterns per word where first 20 of them are used for training and the rest 20 of them are used for testing. for (i = 0; i < NUMINPUTNEURONS; i++) fscanf(infile, "%f\n", &testing_ins[i]); Step 3: Implement the function ForwardPass() (same with the function in training) to propagate the test input signals to the output nodes by calculating the output values for output nodes. Step 4: Find the output node with the highest activation. Compare the output result with the target result to determine the recognition accuracy after all the training set and test set are tested. The testing result is saved to the test report. highest_val = 0; for (i = 0; i < neurons_per_layer[NUMOFLAYERS - 1]; i++) if (layers[NUMOFLAYERS - 1][i] > highest_val){ highest_val = layers[NUMOFLAYERS - 1][i]; highest_ind = i; } for (i = 0; i < neurons_per_layer[NUMOFLAYERS - 1]; i++) if (i == highest_ind) fprintf(result, "recognized syllable : %s %.4lf", syllable[highest_ind], layers[NUMOFLAYERS - 1][i]); 106 if (x+1 == highest_ind+1) { cout << " ** Corect Result !! " << endl; countx[x]++; } else { cout << " Incorrect Result !! " << endl; fprintf(result, " *** Incorrect :< \n\n"); } accuracyx[x] = (float)countx[x] / (NUMTESTINGTOKENS); accuracy = (float)count / (NUMTESTINGSETS*NUMTESTINGTOKENS); fprintf(result, "\n\ncorrect\/total : %d\/%d\n", count, NUMTESTINGSETS*NUMTESTINGTOKENS); fprintf(result, "\nrecognition accuracy : %.4f \n\n", accuracy); for (x = 0; x < NUMTESTINGSETS; x++) fprintf(result, "%s : %d\/%d\t%.4f \n", syllable[x], countx[x], NUMTESTINGTOKENS, accuracyx[x]); 107 5.5 Experiment Setup The performance test of the speech recognition system is conducted on digit recognition and word recognition. Both of the recognition systems are tested using the conventional model (single network) and the proposed model (SOM and MLP). Comparison of system performance is then made according to their accuracy. Digit recognition system is tested in order to evaluate the performance of neural network in speech recognition for small vocabulary size. Word recognition is tested in order to evaluate the performance of neural network in speech recognition for larger vocabulary by using different approaches such as syllable approach and word approach. Experiments are conducted to find the optimal parameters of the system in order to obtain optimal performance of the system. For conventional system, the optimal values of parameters to be determined are cepstral order (CO), hidden node number (HNN) and learning rate (LR). For our proposed system, the optimal values of parameters to be determined are cepstral order (CO), Dimension of SOM (DSOM), hidden node number (HNN) and learning rate (LR). The values determined will be used in the rest of the tests. The proposed speech recognition system which uses SOM and MLP is then compared with the conventional system which uses MLP only. Figure 5.7 and 5.8 show the system architecture for conventional system and proposed system respectively. Here we will briefly describe the implementation of both conventional and proposed system. As shown in Figure 5.7, the conventional system consists of two main components: speech processing and MLP. Speech processing acts as the feature extractor and MLP as the classifier. First, waveform speech input into speech processing where LPC extracts the speech feature (cepstral coefficient). The cepstral coefficients are then used to train the MLP for learning process before we can use for recognition. The output (recognition accuracy) for MLP is printed to a result text file. 108 Cepstral coefficient database Speech database Waveform speech Input speech Cepstral coefficient Speech Processing Recognition accuracy in text file. MLP Output (Endpoint Detection and LPC) Figure 5.7: System architecture for conventional model (single network) Speech database Waveform speech Input speech Cepstral coefficient database Cepstral coefficient Speech Processing Binary matrix database Binary matrix SOM Recognition accuracy in text file. MLP Output (Endpoint Detection and LPC) Figure 5.8: System architecture for proposed model (hybrid network) As shown in Figure 5.8, the proposed system consists of three main components: speech processing, SOM and MLP. Speech processing acts as the feature extractor, SOM for dimensionality reduction and MLP as the classifier. Same with conventional system, in speech processing the waveform speech is extracted by LPC to generate cepstral coefficient. The cepstral coefficients are fed into SOM for training and then are transformed into binary matrix. The binary matrix is then used to train the MLP before recognition. The output (recognition accuracy) for MLP is printed to a result text file. 109 Figure 5.9 shows the training and testing of the digit recognition system level by level. Figure 5.10 shows the training and testing of the word recognition system level by level. The system setup of training and testing is shown in Appendices level by level. Digit Recognition (DR) Conventional System (CS) Cepstral Order (CO), Hidden Node Number (HNN), Learning Rate (LR) Proposed System (PS) Cepstral Order (CO), Dimension of SOM (DSOM), Hidden Node Number (HNN), Learning Rate (LR) Figure 5.9: Training and testing of the digit recognition system Word Recognition (WR) Conventional System (CS) Syllable Classification Word Classification Cepstral Order (CO), Hidden Node Number (HNN), Learning Rate (LR) Proposed System (PS) Syllable Classification Word Classification Cepstral Order (CO), Dimension of SOM (DSOM), Hidden Node Number (HNN), Learning Rate (LR) Figure 5.10: Training and testing of the word recognition system CHAPTER 6 RESULTS AND DISCUSSION 6.1 Introduction In this chapter, the performance of both the conventional and the proposed system are evaluated. The results of the tests performed are presented and discussed in table and graph form in stages as shown in Figure 6.1. Speech Recognition System Digit recognition Conventional Proposed Experiment 1, 2, and 3 Experiment 1, 2, 3 and 4 Word recognition Conventional Syllable classification Word classification Experiment 1, 2, and 3 Proposed Syllable classification Word classification Experiment 1, 2, 3 and 4 Comparison and Discussion Figure 6.1: Presentation and discussion of the results of the tests in table and graph form in stages. 111 6.2 Testing of Digit Recognition The results of digit recognition tests are presented according to different value of parameters in conventional system (DRCS – Digit Recognition Conventional System) and proposed system (DRPS – Digit Recognition Proposed System). The best results for the test sets will be referred as the optimal value for the parameters (Cepstral Order (CO) for LPC analysis, Dimension of SOM (DSOM), Hidden Node Number (HNN) for MLP, and Learning Rate (LR) for MLP training). 6.2.1 Testing Results for Conventional System 6.2.1.1 Experiment 1: Optimal Cepstral Order (CO) The recognition accuracy for training and testing set in Experiment 1 are presented in Table 6.1 and its analysis is presented in graph form shown in Figure 6.2. Table 6.1: Recognition accuracy for different CO for Experiment 1 (DRCS) CO Recognition accuracy (%) Train Test 12 93.00 85.00 16 94.83 87.67 20 93.33 89.00 24 95.00 89.33 From Table 6.1 and Figure 6.2, we found that the higher cepstral order gives higher recognition accuracy. The decreased of the accuracy percentages for testing data are expected, as the testing data is different from the training data where it is not used to train the system. 112 100 98 recognition accuracy (%) 96 94 92 train 90 test 88 86 84 82 80 12 16 20 24 cepstral order Figure 6.2: Recognition accuracy for different CO for Experiment 1 (DRCS) 6.2.1.2 Experiment 2: Optimal Hidden Node Number (HNN) The recognition accuracy for training and testing set in Experiment 2 are presented in Table 6.2 and its analysis is presented in graph form shown in Figure 6.3. The chosen HNN are based on the Geometric Pyramid Rule (GPR) which has been discussed on Section 4.4.4 in Chapter 4. The experiment is tested using 3 chosen HNN: ¾GPR, GPR and 1¼GPR. The calculation of HNN is as below: Hidden Node Number (HNN) = X *Y where X is the number of input node and Y is the number of output node. 113 Table 6.2: Recognition accuracy for different HNN for Experiment 2 (DRCS) HNN Recognition accuracy (%) Train Test 98 91.50 83.50 130 95.00 89.33 163 93.00 90.00 100 98 recognition accuracy (%) 96 94 92 train 90 test 88 86 84 82 80 98 130 163 hidden node num ber Figure 6.3: Recognition accuracy for different HNN for Experiment 2 (DRCS) From Table 6.2 and Figure 6.3, we found that the HNN determined using GPR gives higher recognition accuracy in training set. However, the higher HNN gives better result in testing set. The decreased of the accuracy percentages for testing data are expected, as the testing data is different from the training data where it is not used to train the system. 114 6.2.1.3 Experiment 3: Optimal Learning Rate (LR) The recognition accuracy for training and testing set in Experiment 3 are presented in Table 6.3 and its analysis is presented in graph form shown in Figure 6.4. Table 6.3: Recognition accuracy for different LR for Experiment 3 (DRCS) LR Recognition accuracy (%) Train Test 0.1 96.83 91.67 0.2 95.00 89.33 0.3 93.33 89.00 0.4 95.00 85.50 100 98 recognition accuracy (%) 96 94 92 train 90 test 88 86 84 82 80 0.1 0.2 0.3 0.4 learning rate Figure 6.4: Recognition accuracy for different LR for Experiment 3 (DRCS) 115 From Table 6.3 and Figure 6.4, we found that smaller learning rate gives the better result for both training and testing set. The decreased of the accuracy percentages for testing data are expected, as the testing data is different from the training data where it is not used to train the system. 6.2.2 Testing Results for Proposed System 6.2.2.1 Experiment 1: Optimal Cepstral Order (CO) The recognition accuracy for training and testing set in Experiment 1 are presented in Table 6.4 and its analysis is presented in graph form shown in Figure 6.5. Table 6.4: Recognition accuracy for different CO for Experiment 1 (DRPS) CO Recognition accuracy (%) Train Test 12 100.00 91.50 16 100.00 92.00 20 100.00 98.83 24 100.00 99.83 From Table 6.3 and Figure 6.4, we found the system obtains 100% accuracy in training set for all the chosen ceptral order from 12 to 24. In testing set, higher cepstral order gives the better results as high as 99.83% accuracy. The decreased of the accuracy percentages for testing data are expected, as the testing data is different from the training data where it is not used to train the system. 116 100 99 recognition accuracy (%) 98 97 96 train 95 test 94 93 92 91 90 12 16 20 24 cepstral order Figure 6.5: Recognition accuracy for different CO for Experiment 1 (DRPS) 6.2.2.2 Experiment 2: Optimal Dimension of SOM (DSOM) The recognition accuracy for training and testing set in Experiment 2 are presented in Table 6.5 and its analysis is presented in graph form shown in Figure 6.6. Table 6.5: Recognition accuracy for different DSOM for Experiment 2 (DRPS) DSOM Recognition accuracy (%) Train Test 10 x 10 100.00 99.50 12 x 12 100.00 99.83 15 x 15 100.00 97.33 20 x 20 100.00 98.00 117 100 99 recognition accuracy (%) 98 97 96 train 95 test 94 93 92 91 90 10 x 10 12 x 12 15 x 15 20 x 20 Dim ension of SOM Figure 6.6: Recognition accuracy for different DSOM for Experiment 2 (DRPS) From Table 6.5 and Figure 6.6, we found that the system obtains 100% accuracy in training set for all the chosen DSOM. However, DSOM of 12x12 gives highest accuracy in testing set. The decreased of the accuracy percentages for testing data are expected, as the testing data is different from the training data where it is not used to train the system. 6.2.2.3 Experiment 3: Optimal Hidden Node Number (HNN) The recognition accuracy for training and testing set in Experiment 3 are presented in Table 6.6 and its analysis is presented in graph form shown in Figure 6.7. 118 Table 6.6: Recognition accuracy for different HNN for Experiment 3 (DRPS) HNN Recognition accuracy (%) Train Test 28 100.00 97.33 38 100.00 99.83 48 100.00 99.50 100 99 recognition accuracy (%) 98 97 96 train 95 test 94 93 92 91 90 28 38 48 hidden node num ber Figure 6.7: Recognition accuracy for different HNN for Experiment 3 (DRPS) From Table 6.6 and Figure 6.7, we found that the system obtains 100% accuracy in training set for all the chosen HNN. However, the HNN determined using GPR gives higher recognition accuracy in testing set. However, the higher HNN gives better result in testing set. The decreased of the accuracy percentages for testing data are expected, as the testing data is different from the training data where it is not used to train the system. 119 6.2.2.4 Experiment 4: Optimal Learning Rate (LR) The recognition accuracy for training and testing set in Experiment 4 are presented in Table 6.7 and its analysis is presented in graph form shown in Figure 6.8. Table 6.7: Recognition accuracy for different LR for Experiment 4 (DRPS) LR Recognition accuracy (%) Train Test 0.1 100.00 99.83 0.2 100.00 99.83 0.3 100.00 96.67 0.4 100.00 97.50 100 recognition accuracy (%) 99 98 train 97 test 96 95 94 0.1 0.2 0.3 0.4 learning rate Figure 6.8: Recognition accuracy for different LR for Experiment 4 (DRPS) 120 From Table 6.7 and Figure 6.8, we found that the system obtains 100% accuracy in training set for all the chosen LR. However, smaller learning rate gives the better result in testing set. The decreased of the accuracy percentages for testing data are expected, as the testing data is different from the training data where it is not used to train the system. 6.2.3 Discussion for Digit Recognition Testing 6.2.3.1 Comparison of Performance for DRCS and DRPS according to CO The comparison of performance for DRCS and DRPS according to CO is presented in Table 6.8 and analyzed in Figure 6.9. The comparison of performance is presented according to testing results. Table 6.8: Comparison of performance for DRCS and DRPS according to CO CO Recognition accuracy (%) DRCS DRPS DRPS - DRCS 12 85.00 91.50 +6.50 16 87.67 92.00 +4.33 20 89.00 98.83 +9.83 24 89.33 99.83 +10.50 From Table 6.8 and Figure 6.9, some observations that could be made: a) DRCS and DRPS have the same order of cepstral order from lowest to highest accuracy, where the order is 12, 16, 20 and 24. b) For both DRCS and DRPS, the highest recognition accuracy is cepstral order of 24. c) DRPS has the higher accuracy than DRCS for every test sets. 121 100 98 recognition accuracy (%) 96 94 92 DRCS 90 DRPS 88 86 84 82 80 12 16 20 24 cepstral order Figure 6.9: Comparison of performance for DRCS and DRPS according to CO. It is shown that DRPS outperform the DRCS in every test set according to cepstral order. From the comparison, we can see that it can recognize well with higher cepstral order. This is because higher the cepstral order is, the more detail the acoustic information we obtained. From the tests, we can conclude that cepstral order of 24 outperforms other cepstral order (12, 16 and 20). 6.2.3.2 Comparison of Performance for DRCS and DRPS according to HNN The comparison of performance for DRCS and DRPS according to HNN is presented in Table 6.9 and analyzed in Figure 6.10. The comparison of performance is presented according to testing results. 122 Table 6.9: Comparison of performance for DRCS and DRPS according to HNN HNN Recognition accuracy (%) DRCS DRPS DRPS - DRCS ¾ GPR 83.50 97.33 +13.83 GPR 89.33 99.83 +10.50 1¼ GPR 90.00 99.50 +9.50 From Table 6.9 and Figure 6.10, some observations that could be made: a) For DRCS, the order of hidden node number from lowest to highest accuracy is ¾ GPR, GPR and 1¼ GPR. b) For DRPS, the order of hidden node number from lowest to highest accuracy is ¾ GPR, 1¼ GPR and GPR. c) The highest accuracy for DRPS is HNN using GPR. d) DRPS outperforms DRCS for every test sets. 100 98 recognition accuracy (%) 96 94 92 DRCS 90 DRPS 88 86 84 82 80 ¾ GPR GPR 1¼ GPR hidden node number Figure 6.10: Comparison of performance for DRCS and DRPS according to HNN. 123 It is shown that DRPS outperform the DRCS in every test set according to hidden node number. From DRCS testing results, we cannot conclude that hidden node number according to GPR achieve the better performance. However, the hidden node number according to GPR or higher gives good accuracy (90% or above). It can be considered as optimal hidden node number. 6.2.3.3 Comparison of Performance for DRCS and DRPS according to LR The comparison of performance for DRCS and DRPS according to LR is presented in Table 6.10 and analyzed in Figure 6.11. The comparison of performance is presented according to testing results. The chosen HNN is according to the HNN with highest accuracy in the previous experiment. Table 6.10: Comparison of performance for DRCS and DRPS according to LR LR Recognition accuracy (%) DRCS DRPS DRPS - DRCS 0.1 91.67 99.83 +8.16 0.2 89.33 99.83 +10.50 0.3 89.00 96.67 +7.67 0.4 85.50 97.50 +12.00 From Table 6.10 and Figure 6.11, some observations that could be made: a) For DRCS, the order of learning rate from lowest to highest accuracy is 0.4, 0.3, 0.2 and 0.1. b) For DRPS, the order of learning rate from lowest to highest accuracy is 0.3, 0.4, 0.2 and 0.1. c) Both DRCS and DRPS achieve the same highest accuracy with learning rate of 0.1. d) DRPS has the higher accuracy than DRCS for every test sets. 124 100 98 recognition accuracy (%) 96 94 92 DRCS 90 DRPS 88 86 84 82 80 0.1 0.2 0.3 0.4 learning rate Figure 6.11: Comparison of performance for DRCS and DRPS according to LR. It is shown that DRPS outperform the DRCS in every test set according to learning rate. We can see that the MLP can learn and generalize better with smaller learning rate. However, smaller learning rate may cause the slow convergence rate during the learning process because the modification of the weights is smaller. From the testing results, we can say that learning rate of 0.1 outperforms other learning rate (0.2 – 0.4) and can be considered as optimal learning rate. 6.2.3.4 Discussion on Performance of DRPS according to DSOM From Table 6.5 and Figure 6.6, some observations that could be made: a) The order of SOM dimension from lowest to highest accuracy is 15 x 15, 20 x 20, 10 x 10 and 12 x 12. b) DRPS achieves almost 100% accuracy for dimension of 12 x 12. 125 It is shown that dimension of SOM with the size in the range of 10 x 10 and 12 x 12 is enough for the acoustic mapping of SOM in digit recognition where its vocabulary size is 10. The smaller the dimension used, the faster the training process. Thus, dimension size of 12 x 12 can be considered as optimal dimension for SOM. 6.2.3.5 Summary for Digit Recognition Testing For digit recognition, the proposed system outperform the conventional system in every test sets with an acceptable result and improvement of recognition accuracy. Besides, the optimal parameters and the architecture for our proposed system have been determined from the test results, which are shown in Table 6.11. The recognition accuracy may be higher in digit recognition because of the less number of target word to be recognized. There are only ten digits to be recognized by the system. In addition, the feature of each digit is different from each other. So, the digit recognition may achieve better performance compared to word recognition which recognizes with larger vocabulary size and similar target words. Table 6.11: The optimal parameters and the architecture for DRPS Component Speech Processing SOM MLP Parameter Analysis frame length / Shifting length Value 240 / 80 (samples) Cepstral order 24 Learning rate 0.25 Dimension 12 x 12 Input node 144 Hidden node 38 Output node 10 Learning rate 0.1 – 0.2 Momentum 0.8 Max Epoch / Error Function 1000 / 0.01 126 6.3 Testing of Word Recognition The results of word recognition tests are presented according to different value of parameters in conventional system and proposed system. The results are also presented according to the types of classification used such as syllable classification and word classification. 6.3.1 Testing Results for Conventional System (Syllable Classification) 6.3.1.1 Experiment 1: Optimal Cepstral Order (CO) The recognition accuracy for training and testing set in Experiment 1 are presented in Table 6.12 and its analysis is presented in graph form shown in Figure 6.12. Table 6.12: Recognition accuracy for different CO for Experiment 1 (WRCS(S)) CO Recognition accuracy (%) Train Test 12 90.67 76.83 16 91.33 81.67 20 94.00 78.00 24 93.33 84.83 From Table 6.12 and Figure 6.12, we found that the cepstral order of 20 and 24 give higher recognition accuracy in training and testing set respectively. The decreased of the accuracy percentages for testing data are expected, as the testing data is different from the training data where it is not used to train the system. 127 100 98 96 94 recognition accuracy (%) 92 90 88 86 train 84 test 82 80 78 76 74 72 70 12 16 20 24 cepstral order Figure 6.12: Recognition accuracy for different CO for Experiment 1 (WRCS(S)) 6.3.1.2 Experiment 2: Optimal Hidden Node Number (HNN) The recognition accuracy for training and testing set in Experiment 2 are presented in Table 6.13 and its analysis is presented in graph form shown in Figure 6.13. Table 6.13: Recognition accuracy for different HNN for Experiment 2 (WRCS(S)) HNN Recognition accuracy (%) Train Test 100 90.50 83.50 134 93.33 84.83 168 92.00 87.33 128 100 98 96 recognition accuracy (%) 94 92 90 88 train 86 test 84 82 80 78 76 74 100 134 168 hidden node num ber Figure 6.13: Recognition accuracy for different HNN for Experiment 2 (WRCS(S)) From Table 6.13 and Figure 6.13, we found that the HNN of 134 (GPR) and 168 (1¼ GPR) give higher recognition accuracy in training and testing set respectively. The decreased of the accuracy percentages for testing data are expected, as the testing data is different from the training data where it is not used to train the system. 6.3.1.3 Experiment 3: Optimal Learning Rate (LR) The recognition accuracy for training and testing set in Experiment 3 are presented in Table 6.14 and its analysis is presented in graph form shown in Figure 6.14. 129 Table 6.14: Recognition accuracy for different LR for Experiment 3 (WRCS(S)) LR Recognition accuracy (%) Train Test 0.1 93.00 86.00 0.2 93.33 84.83 0.3 95.50 80.33 0.4 95.00 78.67 100 98 96 94 recognition accuracy (%) 92 90 88 86 train 84 test 82 80 78 76 74 72 70 0.1 0.2 0.3 0.4 learning rate Figure 6.14: Recognition accuracy for different LR for Experiment 3 (WRCS(S)) From Table 6.14 and Figure 6.14, we found that the LR of 0.3 gives higher accuracy in training set. However, the smaller LR gives better results in testing set. The decreased of the accuracy percentages for testing data are expected, as the testing data is different from the training data where it is not used to train the system. 130 6.3.2 Testing Results for Conventional System (Word Classification) 6.3.2.1 Experiment 1: Optimal Cepstral Order (CO) The recognition accuracy for training and testing set in Experiment 1 are presented in Table 6.15 and its analysis is presented in graph form shown in Figure 6.15. Table 6.15: Recognition accuracy for different CO for Experiment 1 (WRCS(W)) CO Recognition accuracy (%) Train Test 12 87.00 71.33 16 90.67 77.33 20 90.00 75.67 24 88.33 79.00 100 98 96 94 92 recognition accuracy (%) 90 88 86 84 82 train 80 test 78 76 74 72 70 68 66 64 62 60 12 16 20 24 cepstral order Figure 6.15: Recognition accuracy for different CO for Experiment 1 (WRCS(W)) 131 From Table 6.15 and Figure 6.15, we found that the cepstral order of 16 and 24 give higher recognition accuracy in training and testing set respectively. The decreased of the accuracy percentages for testing data are expected, as the testing data is different from the training data where it is not used to train the system. 6.3.2.2 Experiment 2: Optimal Hidden Node Number (HNN) The recognition accuracy for training and testing set in Experiment 2 are presented in Table 6.16 and its analysis is presented in graph form shown in Figure 6.16. Table 6.16: Recognition accuracy for different HNN for Experiment 2 (WRCS(W)) HNN Recognition accuracy (%) Train Test 142 89.00 77.67 190 88.33 79.00 238 91.33 77.00 From Table 6.16 and Figure 6.16, we found that the HNN of 238 (GPR) and 190 (1¼ GPR) give higher recognition accuracy in training and testing set respectively. The decreased of the accuracy percentages for testing data are expected, as the testing data is different from the training data where it is not used to train the system. 132 100 98 96 94 recognition accuracy (%) 92 90 88 86 train 84 test 82 80 78 76 74 72 70 142 190 238 hidden node number Figure 6.16: Recognition accuracy for different HNN for Experiment 2 (WRCS(W)) 6.3.2.3 Experiment 3: Optimal Learning Rate (LR) The recognition accuracy for training and testing set in Experiment 3 are presented in Table 6.17 and its analysis is presented in graph form shown in Figure 6.17. Table 6.17: Recognition accuracy for different LR for Experiment 3 (WRCS(W)) LR Recognition accuracy (%) Train Test 0.1 91.33 79.83 0.2 88.33 79.00 0.3 90.67 76.67 0.4 88.50 77.33 133 100 98 96 94 recognition accuracy (%) 92 90 88 86 train 84 test 82 80 78 76 74 72 70 0.1 0.2 0.3 0.4 learning rate Figure 6.17: Recognition accuracy for different LR for Experiment 3 (WRCS(W)) From Table 6.17 and Figure 6.17, we found that the results obtained are not consistent. However, smaller LR gives higher accuracy in both set. The decreased of the accuracy percentages for testing data are expected, as the testing data is different from the training data where it is not used to train the system. 6.3.3 Testing Results for Proposed System (Syllable Classification) 6.3.3.1 Experiment 1: Optimal Cepstral Order (CO) The recognition accuracy for training and testing set in Experiment 1 are presented in Table 6.18 and its analysis is presented in graph form shown in Figure 6.18. 134 Table 6.18: Recognition accuracy for different CO for Experiment 1 (WRPS(S)) CO Recognition accuracy (%) Train Test 12 97.33 88.00 16 97.50 90.67 20 99.67 94.67 24 99.00 95.83 100 98 recognition accuracy (%) 96 94 92 train 90 test 88 86 84 82 80 12 16 20 24 cepstral order Figure 6.18: Recognition accuracy for different CO for Experiment 1 (WRPS(S)) From Table 6.18 and Figure 6.18, we found that higher cepstral order especially 24 gives higher accuracy in both training and testing set. The decreased of the accuracy percentages for testing data are expected, as the testing data is different from the training data where it is not used to train the system. 135 6.3.3.2 Experiment 2: Optimal Dimension of SOM (DSOM) The recognition accuracy for training and testing set in Experiment 2 are presented in Table 6.19 and its analysis is presented in graph form shown in Figure 6.19. Table 6.19: Recognition accuracy for different DSOM for Experiment 2 (WRPS(S)) DSOM Recognition accuracy (%) Train Test 10 x 10 97.33 91.33 12 x 12 99.00 95.83 15 x 15 98.50 96.67 20 x 20 98.67 95.33 100 98 recognition accuracy (%) 96 94 92 train 90 test 88 86 84 82 80 10 x 10 12 x 12 15 x 15 20 x 20 dim ension of SOM Figure 6.19: Recognition accuracy for different DSOM for Experiment 2 (WRPS(S)) 136 From Table 6.19 and Figure 6.19, we found that DSOM of 12 x 12 and 15 x 15 give better results in training and testing set respectively. The decreased of the accuracy percentages for testing data are expected, as the testing data is different from the training data where it is not used to train the system. 6.3.3.3 Experiment 3: Optimal Hidden Node Number (HNN) The recognition accuracy for training and testing set in Experiment 3 are presented in Table 6.20 and its analysis is presented in graph form shown in Figure 6.20. Table 6.20: Recognition accuracy for different HNN for Experiment 3 (WRPS(S)) HNN Recognition accuracy (%) Train Test 44 98.83 94.33 58 98.50 96.67 72 98.33 96.00 From Table 6.20 and Figure 6.20, we found that the results are almost consistent in training set. However, HNN of 58 (GPR) gives better results in testing set. The decreased of the accuracy percentages for testing data are expected, as the testing data is different from the training data where it is not used to train the system. 137 100 98 recognition accuracy (%) 96 94 92 train 90 test 88 86 84 82 80 44 58 72 hidden node num ber Figure 6.20: Recognition accuracy for different HNN for Experiment 3 (WRPS(S)) 6.3.3.4 Experiment 4: Optimal Learning Rate (LR) The recognition accuracy for training and testing set in Experiment 4 are presented in Table 6.21 and its analysis is presented in graph form shown in Figure 6.21. Table 6.21: Recognition accuracy for different LR for Experiment 4 (WRPS(S)) LR Recognition accuracy (%) Train Test 0.1 98.00 96.67 0.2 98.50 96.67 0.3 97.33 96.67 0.4 98.00 95.33 138 100 98 recognition accuracy (%) 96 94 92 train 90 test 88 86 84 82 80 0.1 0.2 0.3 0.4 learning rate Figure 6.21: Recognition accuracy for different LR for Experiment 4 (WRPS(S)) From Table 6.21 and Figure 6.21, we found that the results are almost consistent for both training and testing set except LR of 0.4 in training set. However, LR of 0.1 – 0.2 gives good result in both set. The decreased of the accuracy percentages for testing data are expected, as the testing data is different from the training data where it is not used to train the system. 6.3.4 Testing Results for Proposed System (Word Classification) 6.3.4.1 Experiment 1: Optimal Cepstral Order (CO) The recognition accuracy for training and testing set in Experiment 1 are presented in Table 6.22 and its analysis is presented in graph form shown in Figure 6.22. 139 Table 6.22: Recognition accuracy for different CO for Experiment 1 (WRPS(W)) CO Recognition accuracy (%) Train Test 12 95.33 83.00 16 96.50 84.83 20 96.00 90.33 24 97.33 91.00 100 98 96 recognition accuracy (%) 94 92 90 train 88 test 86 84 82 80 78 76 12 16 20 24 cepstral order Figure 6.22: Recognition accuracy for different CO for Experiment 1 (WRPS(W)) From Table 6.22 and Figure 6.22, we found that higher cepstral order especially 24 gives higher accuracy in both training and testing set. The decreased of the accuracy percentages for testing data are expected, as the testing data is different from the training data where it is not used to train the system. 140 6.3.4.2 Experiment 2: Optimal Dimension of SOM (DSOM) The recognition accuracy for training and testing set in Experiment 2 are presented in Table 6.23 and its analysis is presented in graph form shown in Figure 6.23. Table 6.23: Recognition accuracy for different DSOM for Experiment 2 (WRPS(W)) DSOM Recognition accuracy (%) Train Test 10 x 10 97.33 87.83 12 x 12 97.33 91.00 15 x 15 98.50 90.33 20 x 20 98.00 91.33 100 98 recognition accuracy (%) 96 94 92 train 90 test 88 86 84 82 80 10 x 10 12 x 12 15 x 15 20 x 20 dim ension of SOM Figure 6.23: Recognition accuracy for different DSOM for Experiment 2 (WRPS(W)) 141 From Table 6.23 and Figure 6.23, we found that the results obtained are not consistent in both training and testing set. However, DSOM of 15 x 15 and 20 x 20 give better result in training and testing set respectively. The decreased of the accuracy percentages for testing data are expected, as the testing data is different from the training data where it is not used to train the system. 6.3.4.3 Experiment 3: Optimal Hidden Node Number (HNN) The recognition accuracy for training and testing set in Experiment 3 are presented in Table 6.24 and its analysis is presented in graph form shown in Figure 6.24. Table 6.24: Recognition accuracy for different HNN for Experiment 3 (WRPS(W)) HNN Recognition accuracy (%) Train Test 82 97.33 88.83 110 98.00 91.33 138 96.50 91.00 From Table 6.24 and Figure 6.24, we found that HNN of 110 (GPR) achieves higher accuracy in both training and testing set. The decreased of the accuracy percentages for testing data are expected, as the testing data is different from the training data where it is not used to train the system. 142 100 98 recognition accuracy (%) 96 94 92 train 90 test 88 86 84 82 80 82 110 138 hidden node num ber Figure 6.24: Recognition accuracy for different HNN for Experiment 3 (WRPS(W)) 6.3.4.4 Experiment 4: Optimal Learning Rate (LR) The recognition accuracy for training and testing set in Experiment 4 are presented in Table 6.25 and its analysis is presented in graph form shown in Figure 6.25. Table 6.25: Recognition accuracy for different LR for Experiment 4 (WRPS(W)) LR Recognition accuracy (%) Train Test 0.1 98.00 91.33 0.2 98.00 91.33 0.3 97.00 88.67 0.4 98.50 88.00 143 100 98 recognition accuracy (%) 96 94 92 train 90 test 88 86 84 82 80 0.1 0.2 0.3 0.4 learning rate Figure 6.25: Recognition accuracy for different LR for Experiment 4 (WRPS(W)) From Table 6.25 and Figure 6.25, we found that the results obtained in training set are almost consistent. However, smaller LR gives better result in testing set. The decreased of the accuracy percentages for testing data are expected, as the testing data is different from the training data where it is not used to train the system. 6.3.5 Discussion for Word Recognition Testing 6.3.5.1 Comparison of Performance for WRCS and WRPS according to CO The comparison of performance for WRCS and WRPS according to CO is presented in Table 6.26 and 6.27 and analyzed in Figure 6.26 and 6.27. The comparison of performance is presented according to testing results. 144 Table 6.26: Comparison of performance for WRCS(S) and WRPS(S) according to CO. CO Recognition accuracy (%) WRCS(S) WRPS(S) WRPS(S) – WRCS(S) 12 76.83 88.00 +11.17 16 81.67 90.67 +9.00 20 78.00 94.67 +16.67 24 84.83 95.83 +11.00 100 98 96 recognition accuracy (%) 94 92 90 88 WRCS(S) 86 WRPS(S) 84 82 80 78 76 74 72 12 16 20 24 cepstral order Figure 6.26: Comparison of performance for WRCS(S) and WRPS(S) according to CO. 145 Table 6.27: Comparison of performance for WRCS(W) and WRPS(W) according to CO. CO Recognition accuracy (%) WRCS(W) WRPS(W) WRPS(W) – WRCS(W) 12 71.33 83.00 +11.67 16 77.33 84.83 +7.50 20 75.67 90.33 +14.66 24 79.00 91.00 +12.00 100 98 96 94 92 recognition accuracy (%) 90 88 86 84 82 WRCS(W) 80 WRPSW) 78 76 74 72 70 68 66 64 62 60 12 16 20 24 cepstral order Figure 6.27: Comparison of performance for WRCS(W) and WRPS(W) according to CO. 146 From Table 6.26, Table 6.27, Figure 6.26 and Figure 6.27, some observations that could be made: a) In WRCS(S) and WRCS(W), the order of cepstral order from lowest to highest accuracy is 12, 20, 16 and 24. b) In WRPS(S) and WRPS(W), the order of cepstral order from lowest to highest accuracy is 12, 16, 20 and 24. c) The highest recognition accuracy for both WRCS and WRPS is cepstral order of 24. d) WRPS has the higher accuracy than WRCS for every test sets with both of the syllable and word classification. It is shown that WRPS outperform WRCS in every test set according to cepstral order. We can see that higher cepstral order gives the better performance. However, higher ceptral order produces longer feature vector. This may cause longer training time for network. For performance wise, cepstral order of 24 is significant to be considered as optimal cepstral order (12, 16 and 20). 6.3.5.2 Comparison of Performance for WRCS and WRPS according to HNN The comparison of performance for WRCS and WRPS according to HNN is presented in Table 6.28 – 6.29 and analyzed in Figure 6.28 – 6.29. The comparison of performance is presented according to testing results. Table 6.28: Comparison of performance for WRCS(S) and WRPS(S) according to HNN. HNN Recognition accuracy (%) WRCS(S) WRPS(S) WRPS(S) – WRCS(S) ¾ GPR 83.50 94.33 +10.83 GPR 84.83 96.67 +11.84 1¼ GPR 87.33 96.00 +8.67 147 100 98 96 94 recognition accuracy (%) 92 90 88 86 WRCS(S) 84 WRPS(S) 82 80 78 76 74 72 70 ¾ GPR GPR 1¼ GPR hidden node number Figure 6.28: Comparison of performance for WRCS(S) and WRPS(S) according to HNN. Table 6.29: Comparison of performance for WRCS(W) and WRPS(W) according to HNN. HNN Recognition accuracy (%) WRCS(W) WRPS(W) WRPS(W) – WRCS(W) ¾ GPR 77.67 88.83 +11.16 GPR 79.00 91.33 +12.33 1¼ GPR 77.00 91.00 +14.00 148 100 98 96 94 recognition accuracy (%) 92 90 88 86 WRCS(W) 84 WRPS(W) 82 80 78 76 74 72 70 ¾ TGP TGP 1¼ TGP hidden node number Figure 6.29: Comparison of performance for WRCS(W) and WRPS(W) according to HNN. From Table 6.28, Table 6.29, Figure 6.28 and Figure 6.29, some observations that could be made: a) In WRCS(S), the order of hidden node number from lowest to highest accuracy is ¾ GPR, GPR and 1¼ GPR. b) In WRCS(W), the order of hidden node number from lowest to highest accuracy is 1¼ GPR, ¾ GPR and GPR. c) In WRPS(S) and WRPS(W), the order of hidden node number from lowest to highest accuracy is ¾ GPR, 1¼ GPR and GPR. d) WRPS has the higher accuracy than WRCS for every test sets with both of the syllable and word approaches. 149 It is shown that WRPS outperform WRCS in every test set according to hidden node number. From WRPS testing results, we can see that hidden node number according to GPR achieves the better performance. Thus, hidden node number according to GPR can be considered as optimal hidden node number. 6.3.5.3 Comparison of Performance for WRCS and WRPS according to LR The comparison of performance for WRCS and WRPS according to LR is presented in Table 6.30 – 6.31 and analyzed in Figure 6.30 – 6.31 The comparison of performance is presented according to testing results. Table 6.30: Comparison of performance for WRCS(S) and WRPS(S) according to LR. LR Recognition accuracy (%) WRCS(S) WRPS(S) WRPS(S) – WRCS(S) 0.1 86.00 96.67 +10.67 0.2 87.33 96.67 +9.34 0.3 80.33 96.67 +16.34 0.4 78.67 95.33 +16.66 150 100 98 96 94 recognition accuracy (%) 92 90 88 86 WRCS(S) 84 WRPS(S) 82 80 78 76 74 72 70 0.1 0.2 0.3 0.4 learning rate Figure 6.30: Comparison of performance for WRCS(S) and WRPS(S) according to LR. Table 6.31: Comparison of performance for WRCS(W) and WRPS(W) according to LR. LR Recognition accuracy (%) WRCS(W) WRPS(W) WRPS(W) – WRCS(W) 0.1 79.83 91.33 +11.50 0.2 79.00 91.33 +12.33 0.3 76.67 88.67 +12.00 0.4 77.33 88.00 +10.67 151 100 98 96 94 recognition accuracy (%) 92 90 88 86 WRCS(W) 84 WRPS(W) 82 80 78 76 74 72 70 0.1 0.2 0.3 0.4 learning rate Figure 6.31: Comparison of performance for WRCS(W) and WRPS(W) according to LR. From Table 6.30 – 6.31 and Figure 6.30 – 6.31, some observations that could be made: a) In WRCS(W), the order of learning rate from lowest to highest accuracy is 0.3, 0.4, 0.2 and 0.1. b) In WRCS(S), the order of learning rate from lowest to highest accuracy is 0.4, 0.3, 0.1 and 0.2. c) In WRPS(S) and WRPS(W), the order of learning rate from lowest to highest accuracy is 0.4, 0.3, 0.2 and 0.1. d) The accuracies remain consistent from 0.1 – 0.3 for WRPS(S) testing. e) WRPS has the higher accuracy than WRCS for every test sets. 152 It is shown that WRPS outperform the WRCS in every test set according to learning rate. MLP can generalize better with smaller learning rate from 0.1 – 0.3 for WRPS(S). Thus learning rate from 0.1 – 0.3 can be considered as optimal learning rate. 6.3.5.4 Comparison of Performance of WRPS according to DSOM The comparison of performance for WRPS according to DSOM is presented in Table 6.32 and analyzed in Figure 6.32. The comparison of performance is presented according to testing results. Table 6.32: Comparison of performance for WRPS according to DSOM. Recognition accuracy (%) DSOM WRPS(S) WRPS(W) WRPS(S) – WRCPS(W) 10 x 10 91.33 87.83 +3.50 12 x 12 95.83 91.00 +4.83 15 x 15 96.67 90.33 +6.34 20 x 20 95.33 91.33 +4.00 From Table 6.32 and Figure 6.32, some observations that could be made: a) In WRPS(S), the order of SOM dimension from lowest to highest accuracy is 10 x 10, 20 x 20, 12 x 12 and 15 x 15. b) In WRPS(W), the order of SOM dimension from lowest to highest accuracy is 10 x 10, 15 x 15, 12 x 12 and 20 x 20. 153 100 recognition accuracy (%) 98 96 94 WRPS(S) 92 WRPS(W) 90 88 86 84 10 x 10 12 x 12 15 x 15 20 x 20 dim ension of SOM Figure 6.32: Comparison of performance for WRPS according to DSOM From the testing results, it is shown that dimension with the size of 15 x 15 and 20 x 20 is significant for WRPS(S) and WRPS(W). It is acceptable that WRPS(W) needs more SOM nodes to promise an optimal feature map in order to store all of the speech acoustic information. Besides, WRPS(W) has a limitation that it cannot distinguish the words which have the similar speech content even though they are in different ordering (e.g. words /buku/ and /kubu/). For mapping using one matrix to accumulate results, the temporal information of the input acoustic vector sequence are lost and only the information about the acoustic content are retained. This may result in confusion among the words having similar acoustic content but differing in phoneme ordering. An example is shown in Figure 6.33(a) and (b) for two Malay words: “buku” and “kubu”. In the figure, â denotes value 1 and â denotes value 0. It can be seen that two maps are very similar except that the sequence of phonemes are in opposite directions. 154 /k/ 2 3 /u/ 1 /b/ Figure 6.33(a): Matrix mapping of “buku” word where the arrows show the direction of the sequence of phonemes. /k/ 1 2 /u/ 3 /b/ Figure 6.33(b): Matrix mapping of “kubu” word where the arrows show the direction of the sequence of phonemes. 155 6.3.5.5 Comparison of Performance for WRCS and WRPS according to Type of Classification For word recognition, there are two types of classification implemented on both the conventional system and proposed system: syllable and word classification. Both type of classifications are different in their architecture, which indirectly affect the performance of both systems in terms of their recognition accuracies. The comparison of performance for WRCS and WRPS according to the types of classification is presented in Table 6.33 and analyzed in Figure 6.34. Only the best results (test set) with highest accuracy are presented for each system. Table 6.33: Results of testing for WRCS and WRPS according to type of classification System Type of classification with highest accuracy for testing set (%) Syllable (S) Word (W) (S) – (W) Conventional 87.33 79.83 +7.50 Proposed 96.67 91.33 +5.34 100 98 96 recognition accuracy (%) 94 92 90 88 86 syllable classification 84 word classification 82 80 78 76 74 72 70 conventional system proposed system Figure 6.34: Comparison of performance for WRCS and WRPS according to syllable classification and word classification. 156 6.3.5.6 Summary of Discussion for Word Recognition For word recognition, the proposed system outperforms the conventional system in every test sets. We also evaluate both the system with syllable and word classification. From the testing, our proposed system was proved to be better with an acceptable result and improvement of recognition accuracy. Besides, the optimal parameters and the architecture for our proposed system with syllable classification have been determined from the test results, which are shown in Table 6.34. Syllable classification outperforms the word classification in both conventional and proposed system even the recognition process in syllable classification is more complicated than word classification. Based on the experiments, we found that the recognition accuracy in word recognition is lower than in digit recognition because of the bigger number of target word to be recognized. Table 6.34: The optimal parameters and the architecture for WRPS(S). Component Speech Processing SOM MLP Parameter Analysis frame length / Shifting length Value 240 / 80 (samples) Cepstral order 24 Learning rate 0.25 Dimension 15 x 15 Input node 225 Hidden node 58 Output node 15 Learning rate 0.1 – 0.3 Momentum 0.8 Max Epoch / Error Function 1000 / 0.01 157 6.4 Summary From the tests in digit recognition and word recognition, we can see that conventional system without SOM only achieves the recognition accuracy between 70% – 90% while the proposed system applying SOM and MLP achieves an acceptable result with recognition accuracy above 90%. In short, the performance of proposed system with both SOM and MLP is better compared to conventional system. This is because our proposed system with SOM allows the dimensionality reduction of feature vectors, which also simplifies the classification task in MLP. Therefore, network architecture with combining SOM and MLP can be considered as a new and efficient algorithm in term of improvement of the speech recognition performance. In terms of type of classification used, syllable classification has the better performance, having the highest recognition accuracy for both conventional and proposed system. This is because scope of vocabulary size becomes smaller when syllable is used as classification unit. CHAPTER 7 CONCLUSION AND SUGGESTION 7.1 Conclusion This research has fulfilled its objective of study and has contributed towards the development of the hybrid model of neural networks for Malay speech recognition. The proposed model combines two neural networks namely SelfOrganizing Map (SOM) and Multilayer Perceptron (MLP). The evaluation of the performance of the proposed model was made through its recognition accuracy and compare with the conventional model shown in previous chapter. In the study of neural networks and their abilities in speech recognition, we found that MLP is a powerful pattern recognition technique with its characteristic of generalization. However, it may not be the best to be used alone as it has its limitations in pattern recognition. Therefore, we developed a hybrid model by combining MLP with an unsupervised learning neural network, SOM in order to obtain optimal performance for speech recognition system. It is interesting to note that despite the fact that the SOM has been used in the speech recognition field for more than a decade, there are a few has used it to produce matrixes, but only to generate sequences of labels (Kohonen, 1992). Finally, the new approach developed for the neural network’s architecture in Malay speech recognition proved to be simple and very efficient. It reduced considerably the amount of calculations needed for finding the correct set of parameters. 159 159 In the experiments, it has been proved that the performance of speech recognition using the proposed model is better than the performance using conventional model. It was found that SOM is a better neural model and tool for dimensionality reduction as well as for speech recognition. Although none of the approaches proved to be the best approach for practical purposes with the present extent of development, they were good enough to prove that translating speech from acoustic feature into binary matrix in a feature space works for dimensionality reduction which may simply the recognition task. The human speech is an inherently dynamical process that can be properly described as a binary matrix in a certain feature space. Even more, the dimensionality reduction scheme proved to reduce the dimensionality while preserving some of the original topology of the matrix, as example, it preserved enough information to allow good recognition accuracy. 7.2 Directions for Future Research Besides improving the FE block and devising a more robust Recognizer, the scope of the problem should be broadened to larger vocabularies, continuous speech, and more speakers. From this perspective, the results presented in this thesis are only preliminary. The binary matrices generated by SOM might be very noisy, which complicated the recognition process. This can be a natural consequence of the speech signal, or an artifact caused by the feature extraction scheme. If the latter were the case, it would not be surprising since the LPC-derived cepstral coefficients coder scheme is not very efficient when it comes to represent noise-like sounds as the consonant ‘S’ (Zbancioc and Costin, 2003). It may be more appropriate to use feature extractors which do not lose the essential information before the dimensionality reduction scheme of SOM is used. As an example, the output of a Fourier or Wavelet transform (Deller et al., 1993; Tilo, 1994), both of which retain all the information needed to reconstruct the original signal which could be directly used as input to the SOM. 160 160 With respect to the dimensionality reduction, as the vocabulary size grows, the feature space of SOM will start to crowd with binary matrix. It is important to study how this crowding effect affects the recognition accuracy when binary matrices are used. It has been argued that a hierarchical (Multilayer) approach to knowledge representation, information extraction and problem solving is the most suitable strategy in complex settings. Furthermore, the determination of structure has far reaching consequences: too small a structure implies an inability to generate a satisfactory representation for the problem while too large a structure may over-learn the problem thereby resulting in poor generalization performance. Some enhancements can be made to the standard SOM model by using SAMSOM (Fritzke, 1994; Dittenbach et al., 2000), an overlapped and Multilayer structure and decision fusion mechanism for SOM. The SOM learning proposed in this thesis, however, may have poor quantization performance since the learning may not necessarily lead to a global or local minimum of the average distortion over whole training set as usually defined in conventional vector quantization methods, and the learning procedure depends on the presentation order of input data. The Learning Vector Quantization (LVQ) (Wang and Peng, 1999) and K-means clustering method, on the other hand, can lead to a local minimum of average distortion, but the resulting codeword have no structural properties. Therefore, after training by self-organization, the feature map can be retuned using LVQ or K-means methods in order to improve the quantization performance while preserving the self-organization property of the feature map. Furthermore, the architecture of the MLP can be modified such that instead of training one MLP for recognizing a long and complex binary matrix, it may be easier to train a hierarchy of simpler MLPs, each designed to recognize portions of the long and complex matrix (Jurgen Fritsch, 1996; Siva, 2000). The major drawback of the MLP is the long training time. The introduction of larger training set will improve the generalization of the networks, but it also decreases the convergence rate drastically. As found in the experiments, the MLP generalized well with a small learning rate. The small learning rate results in the weight adjustment in a slow progress. Thus, it is impractical to train the networks 161 161 with a large set of training patterns in terms of training time. In order to accelerate the learning of the network, the adaptive learning rate can be adopted. Its development is based on the analysis of the convergence of the conventional gradient descent method for MLP. This efficient way of learning is able to dynamically vary its learning rate according to the changes of the gradient values (Shin, 1992). In this research, we chose MLP to combine together with SOM for hybrid neural network as MLP is a basic feedforward neural network and its structure is simpler compared to other models. However, the performance of the MLP is very dependent on the accuracy of the endpoint detection or segmentation of the speech signals. It is possible to use other ANN architecture, which is less dependent on the precise endpoint detection in time such as Time-Delay Neural Network (TDNN) (Waibel et al., 1989; Lang, 1989; Lang and Waibel, 1990; Zhong et al., 1999). The TDNN introduces delay neurons in the input and the first hidden layer of the network. Thus, TDNN does not only consider the current speech input features but also the history of these features. In this way, the TDNN is able to learn the dynamic properties of the sounds, which is good to deal with the co-articulation problem in speech recognition. So far, all the approaches used in the speech recognition require a significant amount of examples for each class. In a thousands-of-words vocabulary problem this would require that the user of the system uttered hundreds of thousands of examples in order to train the system. New approaches must be developed such that the information acquired by one module can be used to train other modules, as example, that use previously learned information to deduce the matrices that correspond to non-uttered words. 162 REFERENCES Ahad, A., Fayyaz, A. and Mehmood, T. (2002). Speech Recognition Using Multilayer Perceptron.” IEEE Proceedings of Students Conference (ISCON’02). 1: 103-109. Ahkuputra, V., Jitapunkul, S., Wutiwiwatchai, C., Maneenoi, E. and Kasuriya, S. (1998). Comparison of Thai Speech Recognition Systems using Different Techniques. Proceedings of the IASTED International Conference - Signal and Image Processing ’98. 517-520 Ahmad, A. M., Eng, G. K., Shaharoun, A. M., Yeek, T. C. and Jarni, M. H. B. (2004). An Isolated Speech Endpoint Detector Using Multiple Speech Features. In Proc. IEEE Region 10 Conference (B). 2: 403-406. Aicha, E.G., Brieuc, C. G. and Fabrice, R. (2004). Self Organizing Map and Symbolic Data. Journal of Symbolic Data Analysis. 2(1). Aleksander, I. and Morton, H. (1990). An Introduction to Neural Computing. London: Chapman and Hall. Anderson, J. and Rosenfeld, E. (1988). Neurocomputing: Foundations of Research. Cambridge: MIT Press. Anderson, S. E. (1999). Speech Recognition Meets Bird Song: A Comparison. of Statistics-based and Template-based Techniques. Journal of Acoust. Soc. Am. Aradilla, G., Vepa, J. and Bourlard, H. (2005). Improving Speech Recognition Using a Data-Driven Approach. Proceedings of Interspeech. Barto, A. and Anandan, P. (1985). Pattern Recognizing Stochastic Learning Automata. IEEE Transactions on Systems, Man, and Cybernetics. 15: 360-375. Berke, L. and Hajela, P. (1991). Application of Neural Nets in Structural Optimisation. NATO/AGARD Advanced Study Institute. 23(1-2): 731-745. Bourland, H. and Wellekens, J. (1987). Multilayer Perceptron and Automatic Speech Recognition. IEEE Neural Networks. 163 Burr, D. (1988). Experiments on Neural Net Recognition of Spoken and Written Text. In IEEE Trans. on Acoustics, Speech, and Signal Processing. 36: 1162-1168. Choubassi, M. E., Khoury, H. E., Alagha, C. J., Skaf, J. and Al-Alaoui, M. (2003). Arabic Speech Recognition Using Recurrent Neural Networks. In IEEE International Symposium on Signal Processing and Information Technology. Delashmit, W. H. and Manry, M. T. (2005). Recent Developments in Multilayer Perceptron Neural Networks. Proceedings of the 7th annual Memphis Area Engineering and Science Conference (MAESC). Deller, J., Proakis, J. and Hansen, J. (1993). Discrete-Time Processing of Speech Signals. McMillan Publishing Co. Dittenbach, M., Merkl, D. and Rauber, A. (2000). Growing Hierarchical Self-Organizing Map. IEEE International Joint Conference on Neural Network. 6: 15-19. Elman, J. and Zipser, D. (1987). Learning the Hidden Structure of Speech. ICS Report 8701, Institute for Cognitive Science, University of California, San Diego, La Jolla, CA. Fausett, L. (1994). Fundamentals of Neural Networks. New Jersey: Prentice-Hall, Inc. Fritzke, B. (1994). Growing Cell Structures - A Self-Organizing Network for Unsupervised and Supervised Learning. Neural Networks. 7(9): 1441-1460. Gavat, I., Valsan, Z. and Sabac, B. (1998). Combining Self-Organizing Map and Multilayer Perceptron in a Neural System for Improved Isolated Word Recognition. Communication98. 245-255. Gold, B. (1988). A Neural Network for Isolated Word Recognition. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing. Grant, P. M. (1991). Speech Recognition Techniques. Electronic and Communication Engineering Journal. 37-48. Hansen, L. K. and Salamon, P. (1990). Neural Network Ensembles. IEEE Transactions on Pattern Analysis and Machine Intelligence. 12: 993-1001. Ha-Jin, Y. and Yung-Hwan, O. (1996). A Neural Network using Acoustic Sub-word units for Continuous Speech Recognition. The Fourth International Conference on Spoken Language Processing, ICSLP96. 506-509. Haykin, S. (1994). Neural Networks – A Comprehensive Foundation. New York: Macmillan College Publishing Company, Inc. 164 Haykin, S. (2001). Adaptive Filter Theory (4th Edition). New York: Prentice Hall, Inc. Hertz, J., Krogh, A. and Palmer, R. (1991). Introduction to the Theory of Neural Computation. Addison-Wesley. Hewett, A. J. (1989). Training and Speaker Adaptation in Template-Based Speech Recognition. Cambridge University: PhD Thesis. Hochberg, M. M., Cook, G. D., Renals, S. J. and Robinson, A. J.(1994). Connectionist Model Combination for Large Vocabulary Speech Recognition. IEEE Neural Networks for Signal Processing IV. 269-278. Hopfield, J. (1982). Neural Networks and Physical Systems with Emergent Collective Computational Abilities. Proc. National Academy of Sciences USA, 79: 2554 -2558. Huang, W. M. and Lippmann, R. (1988). Neural Net and Traditional Classifiers. In Neural Information Processing Systems. 387-396. Huang, X. D. (1992). Phoneme Classification using Semicontinuous Hidden Markov Models. IEEE Trans. on Signal Processing, 40(5). Itakura, F. (1975). Minimum Prediction Residual Principle Applied to Speech Recognition.IEEE Trans. on Acoustics, Speech, and Signal Processing, 23(1): 6772. Jurgen, F. (1996). Modular Neural Networks for Speech Recognition. Interactive Systems Laboratories. Carnegie Mellon University (USA) and University of Karlsruhe (German): Diploma Thesis. Kammerer, B. and Kupper, W. (1990). Experiments for Isolated Word Recognition with Single and Two-Layer Perceptrons. Neural Networks. 3: 693-706. Keun-Rong, H. and Wen-Tsuen, C. (1993). A Neural Network Model which Combines Unsupervised and Supervised Learning. IEEE Transactions on Neural Networks. 4: 357- 360. Kangas, J., Torkkola, K. and Kokkonen, M. (1992). Using SOMs as Feature Extractors for Speech Recognition. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP92). Keun-Rong, H. and Wen-Tsuen, C.. (1993). A Neural Network Model which Combines Unsupervised and Supervised Learning. IEEE Trans on Neural Networks. 4(2). 165 Kohonen, T. (1988a). The Neural Phonetic Typewriter. IEEE Computer. 11-22. Kohonen, T. (1988b). Self-Organization and Associative Memory. New York: Spring Verlag. Kohonen, T., Torkkola, K., Shozakai, M., Kangas, J. and Venta, O. (1988). Phonetic Typewriter for Finnish and Japanese. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP88). 607-610. Kohonen, T. (1992). “The Neural Phonetic Typewriter.” In Artificial Neural Networks, IEEE Press, Piscataway, NJ. 42-51 Kohonen, T. (1995). Self-Organizing Maps. Springer, Berlin, Heidelberg. Kohonen, T. (2002). Self-Organizing Neural Networks - Recent Advances and Applications. Studies in Fuzziness and Soft Computing. 78: 1-12. Kusumoputro, B. (1999). Development of Self-Organized Network with a Supervised Training in Artificial Odor discrimination System. In Computational Intelligence for Modelling, Control & Automation. 57-62. Lang, K. (1989). A Time-Delay Neural Network Architecture for Speech Recognition. Carnegie Mellon University: PhD. Thesis. Lang, K. J. and Waibel, A. H. (1990). A Time-Delay Neural Network Architecture for Isolated Word Recognition. Neural networks. 3: 23-43. Lee, K. F. (1988). Large Vocabulary Speaker-Independent Continuous Speech Recognition. The SPHINX System. Carnegie Mellon University: PhD. Thesis. Lee, T. and Ching, P. C. (1999). Cantonese Syllable Recognition Using Neural Networks. IEEE Transactions on Speech and Audio Processing. 7: 466-472. Lippmann, R. (1989). Review of Neural Networks for Speech Recognition. Neural Computation. 1(1): 1-38. Lutfi, A. (1971). Linguistik Deskriptif dan Nahu Bahasa Melayu. Kuala Lumpur: Dewan Bahasa dan Pustake. Maniezzo, V. (1994). Genetic Evolution of the Topology and Weight Distribution of Neural Networks. IEEE Trans. On Neural Networks.5(1): 39-53. Mashor, M.Y. (1999). Some Properties of RBF Network with Applications to System Identification. International Journal of Computer and Engineering Management. 7(1): 34-56. 166 Masters, T. (1993). Practical Neural Network Recipes in C++. San Diego: Academic Press, Inc. Md, S. H. S., Dzulkifli, M. and Sheikh, H. S. S. (2001). Neural Network SpeakerDependent Isolated Malay Speech Recognition System: Handicrafted vs Genetic Algorithm. Proceedings of the International Symposium on Signal Processing and Its Application (ISSPA 2001). 2. 731-734. Nik, S. K., Farid, M. O., Hashim, H. M. and Abdul, H. M. (1995). Tatabahasa Dewan. New edition. Kuala Lumpur: Dewan Bahasa dan Pustaka. Pablo, Z. (1998). Speech Recognition Using Neural Networks. University of Arizona: Master Thesis. Pandya, A. S. and Macy (1996). Pattern Recognition with Neural Network in C++. CRC Press Florida. Parsons, T. W. (1986). Voice and Speech Processing. New York: McGraw-Hill. Peeling, S. and Moore, R. (1987). Experiments in Isolated Digit Recognition Using the Multi-Layer Perceptron. Technical Report 4073, Royal Speech and Radar Establishment, Malvern, Worcester, Great Britain. Peeling, S. M. and Moore, R. K. (1988). Isolated Digit Recognition Experiments Using the Multi-Layer Perceptron. Speech Communication. 7: 403-409. Peterson, G. E. and Barney, H. L. (1952). Control Methods Used in a Study of the Vowels. J. Acoust. Soc. Am. 24: 175-184. Picone, J. (1993). Signal Modeling Techniques in Speech Recognition. IEEE Proceedings. 81(9): 1215-1247. Pont, M. J., Keeton, P. I. J. and Palooran, P. (1996). Speech Recognition Using A Combination of Auditory Models and Conventional Neural Networks. In ABSP1996, 321-324. Rabiner, L. R. (1976) Digital Processing of Speech Signals. New Jersey: Prentice-Hall, Englewood Cliffs. Rabiner, L. R. (1989). A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE, 77(2). Rabiner, L. R. and Sambur, M. R. (1975). An Algorithm for Determining the Endpoints of Isolated Utterances. The Bell System Technical Journal. 54(2): 297. 167 Rabiner, L. R., Wilpon, J. G. and Soong, F. K. (1989). High Performance Connected Digit Recognition Using Hidden Markov Models. IEEE Transaction on Acoustics, Speech and Signal Processings. 1214-1225. Rabiner, L. R. and Juang, B. H. (1993). Fundamentals of Speech Recognition. Prentice Hall. Rosenblatt, F. (1962). Principles of Neurodynamics. New York: Spartan. Rumelhart, D., Hinton, G. and Williams, G. (1986). Learning Internal Representations by Error Propagation. Parallel Distributed Processing: Explorations in the Micostructure of Cognition. M.I.T. Press. Sakoe, H. and Chiba, S. (1978). Dynamic Programming Algorithm Optimization for Spoken Word Recognition. IEEE Trans. on Acoustics, Speech, and Signal Processing, 26(1): 43-49. Salmela, P., Kuusisto, S. S., Saarinen, J., Laurila, K. and Haavisto, P. (1996). Isolated Spoken Number Recognition with Hybrid of Self-Organizing Map and Multilayer Perceptron. Proceedings of the International Conference on Neural Networks (ICNN’96). 3: 1912-1918. Savoji (1989). A Robust Algorithm for Accurate Endpointing of Speech Signals. Speech Communication. 8: 45-60. Sheikh, O. S. S., Mohammad, N. A. G. and Ibrahim, A. (1989). Kamus Dewan. 3rd ed. Kuala Lumpur: Dewan Bahasa dan Pustaka. Shih, C., Kochanski, G. P., Fosler-Lussier, E., Chan, M. and Yuan, J. (2001). Implications of Prosody Modeling for Prosody Recognition. ISCA Workshop on Prosody in Speech Recognition and Understanding. Shin, W., Nobukazu, I., Mototaka, S., Hideo, M. and Yukio, Y. (1992). Method of Deciding ANNs Parameters for Pattern Recognition. International Joint Conference on Neural Networks. 4:19-24. Siva, J. Y. (2000). Recognition of Consonant-Vowel (CV) Utterances Using Modular Neural Network Models. Indian Institute of Technology, Madras: Master Thesis. Tabatabai, V., Azimi, B., Zahir, A. S. B. and Lucas, C. (1994). Isolated Word Recognition Using a Hybrid Neural Network. Proc. of International Conference Acoustics, Speech and Signal Processing. 168 Tebelskis, J. (1995). Speech Recognition Using Neural Networks. School of Computer Science, Carnegie Mellon University: PhD. Dissertation. Tilo Schurer (1994). An Experimental Comparison of Different Feature Extraction and Classification Methods for Telephone Speech. IEEE Workshop on Interactive Voice Technology for Telecommunications Applications (IVTTA94). 93-96. Ting, H. N., Jasmy, Y., Sheikh, H. S. S. and Cheah, E. L. (2001a). Malay Syllable Recognition Based on Multilayer Perceptron and Dynamic Time Warping. Proceedings of the 6th International Symposium on Signal Processing and Its Applications (ISSPA 2001). 2: 743-744. Ting, H. N., Jasmy, Y. and Sheikh, H. S. S. (2001b). Malay Syllable Recognition Using Neural Networks. Proceedings of the IEEE Student Conference of Research and Development (SCOReD 2001). Paper 081. Ting, H. N., Jasmy Y. and Sheikh, H. S. S. (2001c). Classification of Malay Speech Sounds Based on Place of Articulation and Voicing Using Neural Networks. Proceedings of the IEEE Region 10 International Conference on Electrical and Electronic Technology (TENCON 2001). 1: 170-173. Villmann, T. (1999). Topology Preservation in Self-Organizing Maps. Kohonen Maps, Elsevier. 279-292. Waibel, A., Hanazawa, T., Hinton, G., Shikano, K. and Lang, K. (1989). Phoneme Recognition Using Time-Delay Neural Networks. IEEE Trans. on Acoustics, Speech, and Signal Processing. 37(3). Wang, J. H. and Peng, C. Y. (1999). Competitive Neural Network Scheme for Learning Vector Quantization. Electronics Letters. 35(9): 725-726. Watrous, R. (1988). Speech Recognition using Connectionist Networks. University of Pennsylvania: PhD. Thesis. Wessel, F.L. and Barnard, E. (1992). Avoiding False Local Minima by Proper Initialization of Connections. IEEE Trans. Neural Networks. 3: 899-905 Woodland, P., Odell, J., Valtchev, V. and Young, S. (1994). Large Vocabulary Continuous Speech Recognition using HTK. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing. 169 Yamada, T., Hattori, M., Morisawa, M. and Ito, H. (1999). Sequential Learning for Associative Memory Using Kohonen Feature Map. International Joint Conference on Neural Networks (IJCNN’99). 3: 1920-1923. Yiying Z., Xiaoyan, Z. and Yu, H. (1997). A Robust and Fast Endpoint Detection Algorithm for Isolated Word Recognition. IEEE International Conference on Intelligent Processing Systems. 4(3): 1819-1822. Zbancioc, M. and Costin, M. (2003). Using Neural Networks and LPCC to Improve Speech Recognition. International Symposium on Signals, Circuits and Systems (SCS 2003). 2: 445-448. Zhong, L., Yuanyuan, S. and Runsheng, L. (1999). A Dynamic Neural Network for Syllable Recognition. International Joint Conference on Neural Networks (IJCNN’99). 5: 2997-3001. 170 PUBLICATIONS Eng, G. K., Ahmad, A. M., Shaharoun, A. M., Yeek, T. C. and Jarni, M. H. B. (2004). An Isolated Speech Endpoint Detector Using Multiple Speech Features. In IEEE Proceedings of TENCON 2004 IEEE Region 10 Conference. 2: 403 – 406. Eng, G. K. and Ahmad, A. M. (2004). An 3-Level Endpoint Detection Algorithm for Isolated Speech using Time and Frequency-based Feature. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASS04). Eng, G. K. and Ahmad, A. M. (2005). Malay Syllable Speech Recognition using Hybrid Neural Network. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASS05). Eng, G. K. and Ahmad, A. M. (2005. Malay Speech Recognition using Self-Organizing Map and Multilayer Perceptron. In Proceedings of Postgraduate Annual Research Seminar (PARS’05), Universiti Teknologi Malaysia. 171 APPENDIX A Specification of test on optimal CO for DRCS CO Item 12 16 20 Sampling rate (kHz) 16 Analysis Frame Length (ms) 240 Shifting (ms) 80 Total Frame Number 70 Input Node Hidden Node 24 840, 1120, 1400, 1680 92 105 118 Output Node 10 Learning rate 0.2 Momentum 0.8 Termination EF (rms) 0.01 Termination epoch 1000 130 172 APPENDIX B Specification of test on optimal HNN for DRCS HNN Item 98 130 Cepstral order 24 Total Frame Number 70 Input Node 1680 Output Node 10 Learning rate 0.2 Momentum 0.8 Termination EF (rms) 0.01 Termination epoch 1000 163 173 APPENDIX C Specification of test on optimal LR for DRCS LR Item 0.1 0.2 0.3 Cepstral order 24 Total Frame Number 70 Input Node 1680 Hidden Node 130 Output Node 10 Momentum 0.8 Termination EF (rms) 0.01 Termination epoch 1000 0.4 174 APPENDIX D Specification of test on optimal CO for DRPS CO Item 12 16 20 Sampling rate (kHz) 16 Analysis Frame Length (ms) 240 Shifting (ms) 80 DSOM 12 x 12 Input Node 144 Hidden Node 58 Output Node 10 Learning rate 0.2 Momentum 0.8 Termination EF (rms) 0.01 Termination epoch 1000 24 175 APPENDIX E Specification of test on optimal DSOM for DRPS Item 10x10 Cepstral Order Input Node DSOM 12x12 15x15 24 100, 144, 225, 400 Hidden Node 32, 38, 48, 63 Output Node 10 Learning rate 0.2 Momentum 0.8 Termination EF (rms) 0.01 Termination epoch 1000 20x20 176 APPENDIX F Specification of test on optimal HNN for DRPS HNN Item 28 Cepstral order DSOM 38 24 12 x 12 Input Node 144 Output Node 10 Learning rate 0.2 Momentum 0.8 Termination EF (rms) 0.01 Termination epoch 1000 48 177 APPENDIX G Specification of test on optimal LR for DRPS LR Item 0.1 Cepstral order DSOM 0.2 0.3 24 12 x 12 Input Node 144 Hidden Node 38 Output Node 10 Momentum 0.8 Termination EF (rms) 0.01 Termination epoch 1000 0.4 178 APPENDIX H Specification of test on optimal CO for WRCS(S) CO Item 12 16 20 Sampling rate (kHz) 16 Analysis Frame Length (ms) 240 Shifting (ms) 80 Total Frame Number 50 Input Node 600, 800, 1000, 1200 Hidden Node 95, 110, 122, 134 Output Node 15 Learning rate 0.2 Momentum 0.8 Termination EF (rms) 0.01 Termination epoch 1000 24 179 APPENDIX I Specification of test on optimal HNN for WRCS(S) HNN Item 100 134 Cepstral order 24 Total Frame Number 50 Input Node 1200 Output Node 15 Learning rate 0.2 Momentum 0.8 Termination EF (rms) 0.01 Termination epoch 1000 168 180 APPENDIX J Specification of test on optimal LR for WRCS(S) LR Item 0.1 0.2 0.3 Cepstral order 24 Total Frame Number 50 Input Node 1200 Hidden Node 134 Output Node 15 Momentum 0.8 Termination EF (rms) 0.01 Termination epoch 1000 0.4 181 APPENDIX K Specification of test on optimal CO for WRCS(W) CO Item 12 16 20 Sampling rate (kHz) 16 Analysis Frame Length (ms) 240 Shifting (ms) 80 Total Frame Number 50 Input Node 600, 800, 1000, 1200 Hidden Node 134, 155, 173, 190 Output Node 30 Learning rate 0.2 Momentum 0.8 Termination EF (rms) 0.01 Termination epoch 1000 24 182 APPENDIX L Specification of test on optimal HNN for WRCS(W) HNN Item 142 190 Cepstral order 24 Total Frame Number 50 Input Node 1200 Output Node 30 Learning rate 0.2 Momentum 0.8 Termination EF (rms) 0.01 Termination epoch 1000 238 183 APPENDIX M Specification of test on optimal LR for WRCS(W) LR Item 0.1 0.2 0.3 Cepstral order 24 Total Frame Number 50 Input Node 1200 Hidden Node 190 Output Node 30 Momentum 0.8 Termination EF (rms) 0.01 Termination epoch 1000 0.4 184 APPENDIX N Specification of test on optimal CO for WRPS(S) CO Item 12 16 20 Sampling rate (kHz) 16 Analysis Frame Length (ms) 240 Shifting (ms) 80 DSOM 12 x 12 Input Node 144 Hidden Node 46 Output Node 15 Learning rate 0.2 Momentum 0.8 Termination EF (rms) 0.01 Termination epoch 1000 24 185 APPENDIX O Specification of test on optimal DSOM for WRPS(S) Item 10x10 Cepstral Order Input Node DSOM 12x12 15x15 24 100, 144, 225, 400 Hidden Node 38, 46, 58, 77 Output Node 15 Learning rate 0.2 Momentum 0.8 Termination EF (rms) 0.01 Termination epoch 1000 20x20 186 APPENDIX P Specification of test on optimal HNN for WRPS(S) HNN Item 44 Cepstral order DSOM 58 24 15 x 15 Input Node 225 Output Node 15 Learning rate 0.2 Momentum 0.8 Termination EF (rms) 0.01 Termination epoch 1000 72 187 APPENDIX Q Specification of test on optimal LR for WRPS(S) LR Item 0.1 Cepstral order DSOM 0.2 0.3 24 15 x 15 Input Node 225 Hidden Node 58 Output Node 15 Momentum 0.8 Termination EF (rms) 0.01 Termination epoch 1000 0.4 188 APPENDIX R Specification of test on optimal CO for WRPS(W) CO Item 12 16 20 Sampling rate (kHz) 16 Analysis Frame Length (ms) 240 Shifting (ms) 80 DSOM 12 x 12 Input Node 144 Hidden Node 65 Output Node 30 Learning rate 0.2 Momentum 0.8 Termination EF (rms) 0.01 Termination epoch 1000 24 189 APPENDIX S Specification of test on optimal DSOM for WRPS(W) Item 10x10 Cepstral Order Input Node DSOM 12x12 15x15 24 100, 144, 225, 400 Hidden Node 54, 68, 82, 110 Output Node 30 Learning rate 0.2 Momentum 0.8 Termination EF (rms) 0.01 Termination epoch 1000 20x20 190 APPENDIX T Specification of test on optimal HNN for WRPS(W) HNN Item 82 Cepstral order DSOM 110 24 20 x 20 Input Node 400 Output Node 30 Learning rate 0.2 Momentum 0.8 Termination EF (rms) 0.01 Termination epoch 1000 138 191 APPENDIX U Specification of test on optimal LR for WRPS(W) LR Item 0.1 Cepstral order DSOM 0.2 0.3 24 20 x 20 Input Node 400 Hidden Node 110 Output Node 30 Momentum 0.8 Termination EF (rms) 0.01 Termination epoch 1000 0.4 192 APPENDIX V Convergences file (dua12.cep) which shows the rms error in each epoch. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 0.251450 0.244031 0.227514 0.185642 0.140053 0.108097 0.085516 0.071240 0.063772 0.059556 0.056840 0.054887 0.053365 0.052110 0.051031 0.050078 0.049214 0.048418 0.047675 0.046975 0.046311 0.045682 0.045087 0.044522 0.043988 0.043482 0.043000 0.042540 0.042099 0.041674 0.041263 0.040864 0.040476 0.040097 0.039726 0.039363 0.039007 0.038656 0.038313 0.037975 0.037643 0.037318 0.036999 0.036687 0.036381 0.036082 0.035789 0.035501 0.035218 0.034939 0.034665 0.034395 0.034127 0.033862 0.033600 0.033340 0.033083 0.032827 0.032575 0.032324 0.032077 0.031833 0.031594 0.031359 0.031128 0.030904 0.030684 0.030469 0.030259 0.030054 0.029853 0.029655 0.029460 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 0.029268 0.029079 0.028891 0.028705 0.028520 0.028337 0.028155 0.027973 0.027792 0.027613 0.027433 0.027255 0.027078 0.026901 0.026725 0.026549 0.026375 0.026202 0.026029 0.025857 0.025685 0.025514 0.025344 0.025173 0.025003 0.024832 0.024662 0.024491 0.024320 0.024148 0.023977 0.023806 0.023636 0.023467 0.023299 0.023134 0.022970 0.022809 0.022651 0.022495 0.022341 0.022189 0.022039 0.021889 0.021741 0.021592 0.021443 0.021293 0.021141 0.020988 0.020833 0.020675 0.020516 0.020355 0.020193 0.020030 0.019868 0.019708 0.019551 0.019398 0.019251 0.019109 0.018975 0.018847 0.018725 0.018610 0.018500 0.018395 0.018294 0.018197 0.018103 0.018012 0.017923 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 0.017836 0.017751 0.017668 0.017587 0.017506 0.017427 0.017350 0.017273 0.017198 0.017123 0.017050 0.016977 0.016905 0.016834 0.016764 0.016695 0.016626 0.016558 0.016491 0.016425 0.016359 0.016283 0.016221 0.016160 0.016100 0.016040 0.015981 0.015922 0.015864 0.015807 0.015749 0.015693 0.015637 0.015581 0.015525 0.015471 0.015416 0.015362 0.015308 0.015255 0.015202 0.015150 0.015098 0.015046 0.014995 0.014944 0.014885 0.014836 0.014788 0.014741 0.014694 0.014648 0.014602 0.014556 0.014510 0.014465 0.014420 0.014375 0.014331 0.014287 0.014243 0.014199 0.014156 0.014113 0.014070 0.014027 0.013985 0.013943 0.013901 0.013860 0.013818 0.013769 0.013729 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 0.013691 0.013652 0.013614 0.013576 0.013538 0.013501 0.013463 0.013426 0.013389 0.013352 0.013315 0.013279 0.013243 0.013206 0.013171 0.013135 0.013099 0.013064 0.013029 0.012994 0.012959 0.012924 0.012890 0.012849 0.012815 0.012783 0.012750 0.012718 0.012687 0.012655 0.012623 0.012592 0.012560 0.012529 0.012498 0.012467 0.012437 0.012406 0.012375 0.012345 0.012315 0.012285 0.012255 0.012225 0.012195 0.012166 0.012136 0.012107 0.012072 0.012043 0.012016 0.011988 0.011961 0.011934 0.011906 0.011880 0.011853 0.011826 0.011799 0.011773 0.011746 0.011720 0.011694 0.011668 0.011642 0.011616 0.011590 0.011564 0.011538 0.011513 0.011487 0.011462 0.011437 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 0.011406 0.011382 0.011358 0.011334 0.011311 0.011287 0.011264 0.011240 0.011217 0.011194 0.011171 0.011148 0.011125 0.011102 0.011080 0.011057 0.011034 0.011012 0.010989 0.010967 0.010945 0.010923 0.010901 0.010879 0.010857 0.010830 0.010808 0.010787 0.010767 0.010746 0.010726 0.010705 0.010685 0.010665 0.010645 0.010624 0.010604 0.010584 0.010564 0.010545 0.010525 0.010505 0.010485 0.010466 0.010446 0.010427 0.010407 0.010388 0.010368 0.010349 0.010325 0.010306 0.010288 0.010270 0.010252 0.010234 0.010216 0.010198 0.010181 0.010163 0.010145 0.010127 0.010110 0.010092 0.010075 0.010057 0.010040 0.010022 0.010005 0.009988