SELF-ORGANIZING MAP AND MULTILAYER PERCEPTRON FOR MALAY SPEECH RECOGNITION GOH KIA ENG

advertisement
SELF-ORGANIZING MAP AND MULTILAYER PERCEPTRON
FOR MALAY SPEECH RECOGNITION
GOH KIA ENG
A thesis submitted in fulfilment of the
requirements for the award of the degree of
Master of Science (Computer Science)
Faculty of Computer Science and Information System
Universiti Teknologi Malaysia
AUGUST 2006
iii
To my beloved mother and father
iv
ACKNOWLEDGEMENT
First of all, I would like to thank my mother and father who have been
supporting me and giving me lots of encouragements to complete this thesis. They’ve
been so great and I know there would be no way I could have such a wonderful life
without having the love and care from them. Thanks for always been there for me.
A special thank to my supervisor, Prof. Madya Abdul Manan bin Ahmad, for
all his guidance and time. Thanks so much for his advices, comments and
suggestions on how to improve this research and how to produce a good thesis. He is
an understanding and helpful person in helping me to complete this research.
Not forgetting I also would like to take this opportunity to thank all my
friends. All the motivations, helps, and supports are fully appreciated. Thanks for
being there and listening to my complaints and lend me a helpful hand when I am in
troubles.
Last but not least for those who were not mentioned above, I would like you
to know that your countless effort and support will always remembered. All credits
to everyone! Thank you very much.
v
ABSTRACT
Various studies have been done in this field of speech recognition using
various techniques such as Dynamic Time Warping (DTW), Hidden Markov Model
(HMM) and Artificial Neural Network (ANN) in order to obtain the best and suitable
model for speech recognition system. Every model has its drawbacks and
weaknesses. Multilayer Perceptron (MLP) is a popular ANN for pattern recognition
especially in speech recognition because of its non-linearity, ability to learn,
robustness and ability to generalize. However, MLP has difficulties when dealing
with temporal information as it needs input pattern of fixed length. With that in
mind, this research focuses on finding a hybrid model/approach which combines
Self-Organizing Map (SOM) and Multilayer Perceptron (MLP) to overcome as well
as reduce the drawbacks. A hybrid-based neural network model has been developed
to speech recognition in Malay language. In the proposed model, a 2D SOM is used
as a sequential mapping function in order to transform the acoustic vector sequences
of speech signal into binary matrix which performs dimensionality reduction. The
idea of the approach is accumulating the winner nodes of an utterance into a binary
matrix where the winner node is scaled as value “1” and others as value “0”. As a
result, a binary matrix is formed which represents the content of an utterance. Then,
MLP is used to classify the binary matrix to which each word corresponds to. The
conventional model (MLP only) and the proposed model (SOM and MLP) were
tested for digit recognition (“satu” to “sembilan”) and word recognition (30 selected
Malay words) to find out the recognition accuracy using different values of
parameters (cepstral order, dimension of SOM, hidden node number and learning
rate). Both of the models were also tested using two types of classification: syllable
classification and word classification. Finally, comparison and discussion was made
between conventional and proposed model based on their recognition accuracy. The
experimental results showed that the proposed model achieved higher accuracy.
vi
ABSTRAK
Banyak penyelidikan telah dijalankan dalam bidang pengecaman suara
menggunakan pelbagai teknik seperti Dynamic Time Warping (DTW), Hidden
Markov Models (HMM), Artificial Neural Network (ANN) dan sebagainya. Namun
demikian, setiap teknik mempunyai kelemahannya masing-masing. Hal ini
menyebabkan sistem menjadi kurang tepat. Multilayer Perceptron (MLP) merupakan
satu rangkaian neural yang terkenal bagi pengecaman suara. Walau bagaimanapun,
MLP mempunyai kelemahan di mana akan melemahkan pretasi sistem. Oleh itu,
penyelidikan ini menumpu terhadap pembangunan satu model hybrid yang
menggabungkan dua rangkaian neural iaitu Self-Organizing Map (SOM) dan
Multilayer Perceptron (MLP). Satu model berasaskan rangkaian neural hybrid telah
dibangunkan bagi sistem pengecaman suara dalam bahasa Melayu. Dalam model ini,
SOM yang berdimensi dua digunakan sebagai fungsi pemetaan turutan untuk
menukar turutan vector akuastik bagi isyarat suara kepada matrik binari. Hal ini
bertujuan untuk mengurangkan dimensi bagi vektor suara. SOM menyimpan nod
pemenang bagi suara dalam bentuk matrik di mana nod pemenang diskalakan kepada
nilai “1” dan yang lain diskalakan kepada nilai “0”. Hal ini membentukkan satu
matrik binari yang mewakili kandungan suara tersebut. Kemudian, MLP
mengelaskan matrik binari tersebut kepada kelas masing-masing. Ekperimen
dijalankan terhadap model tradisional (MLP) and model hybrid (SOM dan MLP)
dalam pengecaman digit (“satu” to “sembilan”) dan pengecaman perkataan 2-suku
(30 perkataan yang dipilih). Experimen ini bertujuan untuk mendapat ketepatan
pengecaman dengan menggunakan nilai parameter yang berbeza (dimensi cepstral,
dimensi SOM, bilangan nod tersembunyi dan kadar pembelajaran). Kedua-dua model
ini juga diuji dengan menggunakan dua teknik pengelasan: pengelasan mengikut
suku perkataan dan perkataan. Perbandingan dan perbincangan telah dibuat
berdasarkan
ketepatan
pengecaman
masing-masing.
Keputusan
menunjukkan bahawa model kami mencapai ketepatan yang lebih tinggi.
eksperimen
vii
TABLE OF CONTENTS
CHAPTER
1
TITLE
PAGE
DECLARATION
ii
DEDICATION
iii
ACKNOWLEDGEMENT
iv
ABSTRACT
v
ABSTRAK
vi
TABLE OF CONTENTS
vii
LIST OF TABLES
xiv
LIST OF FIGURES
xviii
LIST OF ABBREVIATIONS
xxiii
LIST OF SYMBOLS
xxv
LIST OF APPENDICES
xxvi
INTRODUCTION
1.1
Introduction
1
1.2
Background of Study
2
1.3
Problem Statements
4
1.4
Aim of the Research
5
1.5
Objectives of the Research
5
1.6
Scopes of the Research
5
1.7
Justification
6
1.8
Thesis Outline
8
viii
2
REVIEW OF SPEECH RECOGNITION AND
NEURAL NETWORK
2.1
Fundamental of Speech Recognition
10
2.2
Linear Predictive Coding (LPC)
11
2.3
Speech Recognition Approaches
16
2.3.1
Dynamic Time Warping (DTW)
16
2.3.2
Hidden Markov Model (HMM)
17
2.3.3
Artificial Neural Network (ANN)
18
2.4
2.5
2.6
2.7
Comparison between Speech Recognition
Approaches
20
Review of Artificial Neural Networks
21
2.5.1
Processing Units
21
2.5.2
Connections
22
2.5.3
Computation
22
2.5.4
Training
23
Types of Neural Networks
24
2.6.1
24
Supervised Learning
2.6.2 Semi-Supervised Learning
25
2.6.3
25
Unsupervised Learning
2.6.4 Hybrid Networks
26
Related Research
27
2.7.1 Phoneme/Subword Classification
27
2.7.2 Word Classification
29
2.7.3
Classification Using Hybrid Neural
Network Approach
2.8
3
Summary
31
32
SPEECH DATASET DESIGN
3.1
Human Speech Production Mechanism
33
3.2
Malay Morphology
35
3.2.1
35
Primary Word
3.2.2 Derivative Word
38
3.2.3
39
Compound Word
ix
3.2.4
3.3
Malay Speech Dataset Design
3.3.1
3.3.2
3.4
4
Reduplicative Word
39
39
Selection of Malay Speech
Target Sounds
40
Acquisition of Malay Speech Dataset
44
Summary
46
FEATURE EXTRACTION AND
CLASSIFICATION ALGORITHM
4.1
4.2
4.3
4.4
The Architecture of Speech Recognition
System
47
Feature Extractor (FE)
48
4.2.1
Speech Sampling
49
4.2.2
Frame Blocking
50
4.2.3
Pre-emphasis
51
4.2.4
Windowing
51
4.2.5
Autocorrelation Analysis
52
4.2.6 LPC Analysis
52
4.2.7
Cepstrum Analysis
53
4.2.8 Endpoint Detection
54
4.2.9
Parameter Weighting
55
Self-Organizing Map (SOM)
55
4.3.1 SOM Architecture
57
4.3.2
Learning Algorithm
58
4.4.3 Dimensionality Reduction
63
Multilayer Perceptron (MLP)
65
4.4.1 MLP Architecture
65
4.4.2
66
Activation Function
4.4.3 Error-Backpropagation
67
4.4.4
Improving Error-Backpropagation
69
4.4.5
Implementation of ErrorBackpropagation
4.5
Summary
73
74
x
5
SYSTEM DESIGN AND IMPLEMENTATION
5.1
Introduction
75
5.2
Implementation of Speech Processing
76
5.2.1
76
Feature Extraction using LPC
5.2.2 Endpoint Detection
80
5.3
Implementation of Self-Organizing Map
91
5.4
Implementation of Multilayer Perceptron
97
5.4.1
MLP Architecture for Digit
Recognition
5.4.2
5.4.3
5.5
6
97
MLP Architecture for Word
Recognition
98
Implementation of MLP
99
Experiment Setup
107
RESULTS AND DISCUSSION
6.1
Introduction
109
6.2
Testing of Digit Recognition
111
6.2.1
Testing Results for Conventional
System
111
6.2.1.1 Experiment 1: Optimal
Cepstral Order (CO)
111
6.2.1.2 Experiment 2: Optimal
Hidden Node Number (HNN)
112
6.2.1.3 Experiment 3: Optimal
Learning Rate (LR)
6.2.2
114
Results for Proposed System
Testing
115
6.2.2.1 Experiment 1: Optimal
Cepstral Order (CO)
115
6.2.2.2 Experiment 2: Optimal
Dimension of SOM (DSOM)
116
6.2.2.3 Experiment 3: Optimal
Hidden Node Number (HNN)
117
xi
6.2.2.4 Experiment 4: Optimal
Learning Rate (LR)
6.2.3
119
Discussion for Digit Recognition
Testing
120
6.2.3.1 Comparison of Performance
for DRCS and DRPS (CO)
120
6.2.3.2 Comparison of Performance
for DRCS and DRPS (HNN)
121
6.2.3.3 Comparison of Performance
for DRCS and DRPS (LR)
123
6.2.3.4 Discussion on Performance of
DRPS according to DSOM
124
6.2.3.5 Summary for Digit
Recognition Testing
6.3
Testing of Word Recognition
6.3.1
125
126
Results for Conventional System
Testing (Syllable Classification)
126
6.3.1.1 Experiment 1: Optimal
Cepstral Order (CO)
126
6.3.1.2 Experiment 2: Optimal
Hidden Node Number (HNN)
127
6.3.1.3 Experiment 3: Optimal
Learning Rate (LR)
6.3.2
128
Results for Conventional System
Testing (Word Classification)
130
6.3.2.1 Experiment 1: Optimal
Cepstral Order (CO)
130
6.3.2.2 Experiment 2: Optimal
Hidden Node Number (HNN)
131
6.3.2.3 Experiment 3: Optimal
Learning Rate (LR)
6.3.3
132
Results for Proposed System
Testing (Syllable Classification)
133
xii
6.3.3.1 Experiment 1: Optimal
Cepstral Order (CO)
133
6.3.3.2 Experiment 2: Optimal
Dimension of SOM (DSOM)
135
6.3.3.3 Experiment 3: Optimal
Hidden Node Number (HNN)
136
6.3.3.4 Experiment 4: Optimal
Learning Rate (LR)
6.3.4
137
Results for Proposed System
Testing (Word Classification)
138
6.3.4.1 Experiment 1: Optimal
Cepstral Order (CO)
138
6.3.4.2 Experiment 2: Optimal
Dimension of SOM (DSOM)
140
6.3.4.3 Experiment 3: Optimal
Hidden Node Number (HNN)
141
6.3.4.4 Experiment 4: Optimal
Learning Rate (LR)
6.3.5
142
Discussion for Word Recognition
Testing
143
6.3.5.1 Comparison of Performance
for WRCS and WRPS
according to CO
143
6.3.5.2 Comparison of Performance
for WRCS and WRPS
according to HNN
146
6.3.5.3 Comparison of Performance
for WRCS and WRPS
according to LR
149
6.3.5.4 Comparison of Performance
of WRPS according to DSOM
152
xiii
6.3.5.5 Comparison of Performance
for WRCS and WRPS
according to Type of
Classification
155
6.3.5.6 Summary of Discussion for
Word Recognition
6.4
7
Summary
156
157
CONCLUSION AND SUGGESTION
7.1
Conclusion
158
7.2
Directions for Future Research
159
REFERENCES
162
PUBLICATIONS
170
Appendices A – V
171 - 192
xiv
LIST OF TABLES
TABLE NO.
1.1
TITLE
PAGE
Comparison of different speech recognition
systems.
7
The comparison between different speech
recognition approaches.
20
The performance comparison between different
speech recognition approaches.
20
3.1
Structure of words with one syllable.
36
3.2
Structure of words with two syllables.
37
3.3
Structure of words with three syllables or more.
38
3.4
15 selected syllables in order to form two-syllable
words as target sounds.
41
Two-syllable Malay words combined using 15
selected syllables.
42
30 selected Malay two-syllable words as the
speech target sounds.
43
10 selected digit words as the speech target
sounds for digit recognition.
44
3.8
Specification of dataset for word recognition
45
3.9
Specification of dataset for digit recognition
45
5.1
The setting of the target values for MLP in digit
recognition.
98
The setting of the target values for MLP
(syllable classification).
100
2.1
2.2
3.5
3.6
3.7
5.2
xv
5.3
The setting of the target values for MLP
(word classification).
101
Recognition accuracy for different CO for
Experiment 1 (DRCS)
111
Recognition accuracy for different HNN for
Experiment 2 (DRCS)
113
Recognition accuracy for different LR for
Experiment 3 (DRCS)
114
Recognition accuracy for different CO for
Experiment 1 (DRPS)
115
Recognition accuracy for different DSOM for
Experiment 2 (DRPS)
116
Recognition accuracy for different HNN for
Experiment 3 (DRPS)
118
Recognition accuracy for different LR for
Experiment 4 (DRPS)
119
Comparison of performance for DRCS and DRPS
according to CO
120
Comparison of performance for DRCS and DRPS
according to HNN
122
Comparison of performance for DRCS and DRPS
according to LR
123
6.11
The optimal parameters and the architecture for DRPS
125
6.12
Recognition accuracy for different CO for
Experiment 1 (WRCS(S))
126
Recognition accuracy for different HNN for
Experiment 2 (DRCS)
127
Recognition accuracy for different LR for
Experiment 3 (WRCS(S))
129
Recognition accuracy for different CO for
Experiment 1 (WRCS(W))
130
Recognition accuracy for different HNN for
Experiment 2 (WRCS(W))
131
6.1
6.2
6.3
6.4
6.5
6.6
6.7
6.8
6.9
6.10
6.13
6.14
6.15
6.16
xvi
6.17
6.18
6.19
6.20
6.21
6.22
6.23
6.24
6.25
6.26
6.27
6.28
6.29
6.30
6.31
6.32
6.33
Recognition accuracy for different LR for
Experiment 3 (WRCS(W))
132
Recognition accuracy for different CO for
Experiment 1 (WRPS(S))
134
Recognition accuracy for different DSOM for
Experiment 2 (WRPS(S))
135
Recognition accuracy for different HNN for
Experiment 3 (WRPS(S))
136
Recognition accuracy for different LR for
Experiment 4 (WRPS(S))
137
Recognition accuracy for different CO for
Experiment 1 (WRPS(W))
139
Recognition accuracy for different DSOM for
Experiment 2 (WRPS(W))
140
Recognition accuracy for different HNN for
Experiment 3 (WRPS(W))
141
Recognition accuracy for different LR for
Experiment 4 (WRPS(W))
142
Comparison of performance for WRCS(S) and
WRPS(S) according to CO.
144
Comparison of performance for WRCS(W) and
WRPS(W) according to CO.
145
Comparison of performance for WRCS(S) and
WRPS(S) according to HNN.
146
Comparison of performance for WRCS(W) and
WRPS(W) according to HNN.
147
Comparison of performance for WRCS(S) and
WRPS(S) according to LR.
149
Comparison of performance for WRCS(W) and
WRPS(W) according to LR.
150
Comparison of performance for WRPS according
to DSOM.
152
Results of testing for WRCS and WRPS according
to type of classification
155
xvii
6.34
The optimal parameters and the architecture for
WRPS(S).
156
xviii
LIST OF FIGURES
FIGURE NO.
1.1
TITLE
PAGE
Feature map with neurons (circles) which is labeled
with the symbols of the phonemes to which they
“learned” to give the best responses.
3
The sequence of the responses obtained from the
trained feature map when the Finnish word humppila
was uttered.
3
2.1
Basic model of speech recognition system.
11
2.2
The current speech sample is predicted as a linear
combination of past p samples. (n = total number
of speech sample).
12
2.3
Dynamic Time Warping (DTW)
17
2.4
A basic architecture of Multilayer Perceptron
(MLP)
19
2.5
A basic neuron processing unit
22
2.6
Neural network topologies: (a) Unstructured,
(b) Layered, (c) Recurrent and (d) Modular
23
Perceptrons: (a) Single-layer Perceptron
(b) Multilayer Perceptron
25
Decision regions formed by a 2-layer Perceptron
using backpropagation training and vowel formant
data.
28
3.1
The vocal tract
34
3.2
Structure of one-syllable word “Ya” and “Stor”.
36
3.3
Structure of two-syllable word “Guru” and “Jemput”.
37
4.1
Proposed speech recognition model
47
1.2
2.7
2.8
xix
4.2
Feature Extractor (FE) schematic diagram
48
4.3
Figure 4.3: Speech signal for the word kosong01.wav
sampled at 16 kHz with a precision of 16 bits.
49
Blocking of speech waveform into overlapping frames
with N analysis frame length and M shifting length.
50
4.5
Cepstral coefficient of BU.cep
54
4.6
SOM transforms feature vectors generated by speech
processing into binary matrix which performs
dimensionality reduction.
56
4.7
The 2-D SOM architecture
57
4.8
Flow chart of SOM learning algorithm
61
4.9
Trained feature map after 1,250,000 iterations.
62
4.10
Dimensionality reduction performed by SOM.
63
4.11(a)
The 12 x 12 mapping of binary matrix of /bu/ syllable.
64
4.11(b)
Binary matrix of /bu/ which is fed as input for MLP.
64
4.12
A three-layer Multilayer Perceptron
66
4.13
The determination of hidden node number using
Geometric Pyramid Rule (GPR).
71
4.14
Flow chart of error-backpropagation algorithm
73
5.1
The implementation of speech recognition system
75
5.2(a)
The detected boundaries of sembilan04.wav using
rms energy in Level 1 of Initial endpoint detection
90
The detected boundaries of sembilan04.wav using
zero crossing rate in Level 2 of Initial endpoint
detection
90
The actual boundaries of sembilan04.wav using
Euclidean distance of cepstrum in Level 3 of Actual
endpoint detection
90
5.3
The architecture of Self-Organizing Map (SOM)
91
5.4
MLP with 10 output. The 10 output nodes correspond
to 10 Malay digit words respectively.
97
4.4
5.2(b)
5.2(c)
xx
5.5
MLP with 15 output nodes. The 15 output nodes
correspond to 15 Malay syllables respectively.
99
MLP with 30 output nodes. The 30 output nodes
correspond to 30 Malay two-syllable words
respectively.
99
System architecture for conventional model (singlenetwork)
108
System architecture for proposed model (hybrid
network)
108
5.9
Training and testing of the digit recognition system
109
5.10
Training and testing of the word recognition system
109
6.1
Presentation and discussion of the results of the tests
in table and graph form in stages.
110
Recognition accuracy for different CO for
Experiment 1 (DRCS)
112
Recognition accuracy for different HNN for
Experiment 2 (DRCS)
113
Recognition accuracy for different LR for
Experiment 3 (DRCS)
114
Recognition accuracy for different CO for
Experiment 1 (DRPS)
116
Recognition accuracy for different DSOM for
Experiment 2 (DRPS)
117
Recognition accuracy for different HNN for
Experiment 3 (DRPS)
118
Recognition accuracy for different LR for
Experiment 4 (DRPS)
119
Analysis of comparison of performance for
DRCS and DRPS according to CO.
121
Analysis of comparison of performance for
DRCS and DRPS according to HNN.
122
Analysis of comparison of performance for
DRCS and DRPS according to LR.
124
5.6
5.7
5.8
6.2
6.3
6.4
6.5
6.6
6.7
6.8
6.9
6.10
6.11
xxi
6.12
6.13
6.14
6.15
6.16
6.17
6.18
6.19
6.20
6.21
6.22
6.23
6.24
6.25
6.26
6.27
6.28
Recognition accuracy for different CO for
Experiment 1 (WRCS(S))
127
Recognition accuracy for different HNN for
Experiment 2 (WRCS(S))
128
Recognition accuracy for different LR for
Experiment 3 (WRCS(S))
129
Recognition accuracy for different CO for
Experiment 1 (WRCS(W))
130
Recognition accuracy for different HNN for
Experiment 2 (WRCS(W))
132
Recognition accuracy for different LR for
Experiment 3 (WRCS(W))
133
Recognition accuracy for different CO for
Experiment 1 (WRPS(S))
134
Recognition accuracy for different DSOM for
Experiment 2 (WRPS(S))
135
Recognition accuracy for different HNN for
Experiment 3 (WRPS(S))
137
Recognition accuracy for different LR for
Experiment 4 (WRPS(S))
138
Recognition accuracy for different CO for
Experiment 1 (WRPS(W))
139
Recognition accuracy for different DSOM for
Experiment 2 (WRPS(W))
140
Recognition accuracy for different HNN for
Experiment 3 (WRPS(W))
142
Recognition accuracy for different LR for
Experiment 4 (WRPS(W))
143
Comparison of performance for WRCS(S) and
WRPS(S) according to CO.
144
Comparison of performance for WRCS(W) and
WRPS(W) according to CO.
145
Comparison of performance for WRCS(S) and
WRPS(S) according to HNN.
147
xxii
6.29
6.30
6.31
6.32
6.33(a)
6.33(b)
6.34
Comparison of performance for WRCS(W) and
WRPS(W) according to HNN.
148
Comparison of performance for WRCS(S) and
WRPS(S) according to LR.
150
Comparison of performance for WRCS(W) and
WRPS(W) according to LR.
151
Comparison of performance for WRPS according
to DSOM.
153
Matrix mapping of “buku” word where the
arrows show the direction of the sequence
of phonemes.
154
Matrix mapping of “kubu” word where the
arrows show the direction of the sequence
of phonemes.
154
Analysis of comparison of performance for
WRCS and WRPS according to syllable
classification and word classification.
155
xxiii
LIST OF ABBREVIATIONS
AI
-
Artificial Intelligence
ANN
-
Artificial Neural Network
BMU
-
Best Matching Unit
BP
-
Back-Propagation
CO
-
Cepstral Order
CS
-
Conventional System
DR
-
Digit Recognition
DRCS
-
Digit Recognition Conventional System
DRPS
-
Digit Recognition Proposed System
DSOM
-
Dimension of Self-Organizing Map
DTW
-
Dynamic Time Warping
FE
-
Feature Extractor
GPR
-
Geometric Pyramid Rule
HMM
-
Hidden Markov Model
HNN
-
Hidden Node Number
KSOM
-
Kohonen Self-Organization Network
LP
-
Linear Prediction
LPC
-
Linear Predictive Coding
LR
-
Linear Rate
LVQ
-
Learning Vector Quantization
MLP
-
Multilayer Perceptron
PARCOR
-
Partial-Correlation
PC
-
Personal Computer
PS
-
Proposed System
SAMSOM
-
Structure Adaptive Multilayer Self-Organizing Map
SLP
-
Single-layer Perceptron
SOM
-
Self-Organizing Map
xxiv
TDNN
-
Time-Delay Neural Network
VQ
-
Vector Quantization
WPF
-
Winning Probability Function
WR
-
Word Recognition
WRCS
-
Word Recognition Conventional System
WRCS(S)
-
Word Recognition Conventional System using
Syllable Classification
WRCS(W)
-
Word Recognition Conventional System using
Word Classification
WRPS
-
Word Recognition Proposed System
WRPS(S)
-
Word Recognition Proposed System using
Syllable Classification
WRPS(W)
-
Word Recognition Proposed System using
Word Classification
xxv
LIST OF SYMBOLS
s
-
Speech sample
ŝ
-
Predicted speech sample
a
-
Predicted coefficient
e
-
Prediction error
E
-
Mean squared error (LPC)
E
-
Energy power (Endpoint detection)
Z
-
Zero-crossing
T
-
Threshold (Endpoint detection)
D
-
Weighted Euclidean distance
R
-
Autocorrelation function
w
-
Hamming window
p
-
The order of the LPC analysis
k
-
PARCOR coefficients
c
-
Cepstral coefficients
X
-
Input nodes
Y
-
Output nodes
H
-
Hidden nodes
M
-
Weights
B
-
Bias
σ
-
Width of lattice (SOM)
λ
-
Time constant (SOM)
α
-
Learning rate (SOM)
Θ
-
The amount of influence a node's distance from
the BMU (SOM)
η
-
Learning rate (MLP)
δ
-
Error information term
xxvi
LIST OF APPENDICES
APPENDIX
A
B
C
D
E
F
G
H
I
J
K
L
TITLE
PAGE
Specification of test on optimal Cepstral Order
for DRCS
171
Specification of test on optimal Hidden Node
Number for DRCS
172
Specification of test on optimal Learning Rate
for DRCS
173
Specification of test on optimal Cepstral Order
for DRPS
174
Specification of test on optimal Dimension of
SOM for DRPS
175
Specification of test on optimal Hidden Node
Number for DRPS
176
Specification of test on optimal Learning Rate
for DRPS
177
Specification of test on optimal Cepstral Order
for WRCS(S)
178
Specification of test on optimal Hidden Node
Number for WRCS(S)
179
Specification of test on optimal Learning Rate
for WRCS(S)
180
Specification of test on optimal Cepstral Order
for WRCS(W)
181
Specification of test on optimal Hidden Node
Number for WRCS(W)
182
xxvii
M
N
O
P
Q
R
S
T
U
V
Specification of test on optimal Learning Rate
for WRCS(W)
183
Specification of test on optimal Cepstral Order
for WRPS(S)
184
Specification of test on optimal Dimension of
SOM for WRPS(S)
185
Specification of test on optimal Hidden Node
Number for WRPS(S)
186
Specification of test on optimal Learning Rate
for WRPS(S)
187
Specification of test on optimal Cepstral Order
for WRPS(W)
188
Specification of test on optimal Dimension of
SOM for WRPS(W)
189
Specification of test on optimal Hidden Node
Number for WRPS(W)
190
Specification of test on optimal Learning Rate
for WRPS(W)
191
Convergences file (dua12.cep) which shows
the rms error in each epoch.
192
CHAPTER 1
INTRODUCTION
1.1
Introduction
By 1990, many researchers had demonstrated the value of neural networks
for important task like phoneme recognition and spoken digit recognition. However,
it is still unclear whether connectionist techniques would scale up to large speech
recognition tasks. There is a large variety in the speech recognition technology and it
is important to understand the differences between the technologies. Speech
recognition system can be classified according to the type of speech, size of the
vocabulary, the basic units and the speaker independence. The position of a speech
recognition system in these dimensions determines which algorithm can or has to be
used. Speech recognition has been another proving ground for neural networks.
Some researchers achieved good results in such basic tasks as voiced/unvoiced
discrimination (Watrous, 1988), phoneme recognition (Waibel et al., 1989), and
spoken digit recognition (Peeling and Moore, 1987). However, research in finding a
good neural network model for robust speech recognition still has a wide potential to
be developed.
Why does the speech recognition problem attract researchers? If an efficient
speech recognizer is produced, a very natural human-machine interface would be
obtained. By natural means something that is intuitive and easy to be used by a
person, a method that does not require special tools or machines but only the natural
capabilities that every human possesses. Such a system could be used by any person
who is able to speak and will allow an even broader use of machines, specifically
computers.
2
1.2
Background of Study
Neural network classifier has been compared with other pattern recognition
classifiers and is explored as an alternative to other speech recognition techniques.
Lippman (1989) has proposed a static model which is employed as an input pattern
of Multilayer Perceptron (MLP) network. The conventional neural network (Pont et
al., 1996; Ahkuputra et al., 1998; Choubassi et al., 2003) defines a network as
consisting of a few basic layers (input, hidden and output) in a Multilayer Perceptron
type of topology. Then a training algorithm such as backpropagation is applied to
develop the interconnection weights. This conventional model or system has also
been used in a variety of pattern recognition and control applications that are not
effectively handled by other AI paradigms.
However, there are some difficulties in using MLP alone. The most major
difficulty is that, increasing the number of connections not only increases the training
time but also makes it more probable to fall in a poor local minima. It also
necessitates more data for training. Perceptron as well as Multilayer Perceptron
(MLP) usually needs input pattern of fixed length (Lippman, 1989). This is the
reason why the MLP has difficulties when dealing with temporal information
(essential speech information or feature extracted during speech processing). Since
the word has to be recognized as a whole, the word boundaries are often located
automatically by endpoint detector and the noise is removed outside of the
boundaries. The word patterns have to be also warped using some pre-defined paths
in order to obtain fixed length word patterns.
Since the early eighties, researchers have been using neural networks in the
speech recognition problem. One of the first attempts was Kohonen’s electronic
typewriter (Kohonen, 1992). It uses the clustering and classification characteristics
of the Self-Organizing Map (SOM) to obtain an ordered feature map from a sequence
of feature vectors which is shown in Figure 1.1. The training was divided into two
stages, where the first stage was used to obtain the SOM. Speech feature vectors
were fed into the SOM until it converged. The second training stage consisted in
labeling the SOM, as example, each neuron of the feature map was assigned a
phoneme label. Once the labeling process was completed, the training process
3
ended. Then, unclassified speech was fed into the system, which was then translated
it into a sequence of labels. Figure 1.2 shows the sequence of the responses obtained
from the trained feature map when the Finnish word humppila was uttered. This
way, the feature extractor plus the SOM behaved like a transducer, transforming a
sequence of speech samples into a sequence of labels. Then, the sequence of labels
was processed by some AI scheme (Grammatical Transformation Rules) in order to
obtain words from it.
Figure 1.1: Feature map with neurons (circles) which is labeled with the symbols of
the phonemes to which they “learned” to give the best responses.
Figure 1.2: The sequence of the responses obtained from the trained feature map
when the Finnish word humppila was uttered.
4
Usage of an unsupervised learning neural network as well as SOM seems to
be wise. The SOM constructs a topology preserving mapping from the highdimensional space onto map units (neurons) in such a way that relative distances
between data points are preserved. The way SOM performs dimensionality reduction
is by producing a map of usually 2 dimensions which plot the similarities of the data
by grouping similar data items together. Because of its characteristic which is able to
form an ordered feature map, the SOM is found to be suitable for dimensionality
reduction of speech feature. Forming a binary matrix to feed to the MLP makes the
training and classification simpler and better. Such a hybrid system consists of two
neural-based models, a SOM and a MLP. The hybrid system mostly tries to
overcome the problem of the temporal variation of utterances where the utterances
for same word by same speaker may be different in duration and speech rate).
1.3
Problem Statements
According to the background of study, here are the problem statements:
i.
Various approaches have been introduced for Malay speech recognition
in order to produce an accurate and robust system for Malay speech
recognition. However, there are only a few approaches which have
achieved excellent performance for Malay speech recognition (Ting et
al., 2001a, 2001b and 2001c; Md Sah Haji Salam et al., 2001). Thus,
research in speech recognition for Malay language still has a wide
potential to be developed.
ii.
Multilayer Perceptron (MLP) has difficulties when dealing with
temporal information. Since the word has to be recognized as a whole,
the word patterns have to be warped using some pre-defined paths in
order to obtain fixed length word patterns (Tebelskis, 1995; Gavat et al.,
1998). Thus, an efficient model is needed to improve this drawback.
iii.
Self-Organizing Map (SOM) is considered as a suitable and effective
approach for both clustering and dimensionality reduction. However, is
SOM an efficient neural network to be applied in MLP-based speech
recognition in order to reduce the dimensionality of feature vector?
5
1.4
Aim of the Research
The aim of the research is to investigate how hybrid neural network can be
applied or utilized in speech recognition area and propose a hybrid model by
combining Self-Organizing Map (SOM) and Multilayer Perceptron (MLP) for Malay
speech recognition in order to achieve a better performance compared to
conventional model (single network).
1.5
i.
Objectives of the Research
Studying the effectiveness of various types of neural network models in terms
of speech recognition.
ii.
Developing a hybrid model/approach by combining SOM and MLP in speech
recognition for Malay language.
iii.
Developing a prototype of Malay speech recognition which contains three
main components namely speech processing, SOM and MLP.
iv.
Conducting experiments to determine the optimal values for the parameters
(cepstral order, dimension of SOM, hidden node number, learning rate) of the
system in order to obtain the optimal performance.
v.
Comparing the performance between conventional model (single network)
and proposed model (SOM and MLP) based to the recognition accuracy to
prove the improvement achieved by the proposed model. The recognition
accuracy is based on the calculation of percentage below:
Recognition Accuracy (%) =
Total of Correct Recognized Word
Total of Sample Word
1.6
Scopes of the Research
The scope of the research clearly defines the specific field of the study. The
discussion of the study and research is confined to the scope.
6
i.
There are two datasets created where one is used for digit recognition and
another one is used for word recognition. The former consists of 10 Malay
digits and the latter consists of 30 selected two-syllable Malay words. Speech
samples are collected in a noise-free environment using unidirectional
microphone.
ii.
Human speakers comprise of 3 males and 3 females. The system supports
speaker-independent capability. The age of the speakers ranges between 18 –
25 years old.
iii.
Linear Predictive Coding (LPC) is used as the feature extraction method.
The method is to extract the speech feature from the speech data. The LPC
coefficients are determined using autocorrelation method. The extracted LPC
coefficients are then converted to cepstral coefficients.
iv.
Self-Organizing Map (SOM) and Multilayer Perceptron (MLP) is applied in
the proposed system. SOM acts as a feature extractor which converts the
higher-dimensional feature vector into lower-dimensional binary vector.
Then MLP takes the binary vectors as its input for training and classification.
1.7
Justification
Many researchers have worked in automatic speech recognition for almost
few decades. In the eighties, speech recognition research was characterized by a
shift in technology from template-based approaches (Hewett, 1989; Aradilla et al.,
2005) to statistical-based approaches (Gold, 1988; Huang, 1992; Siva, 2000;
Zbancioc and Costin, 2003) and connectionist approaches (Watrous, 1988; Hochberg
et al., 1994). Instead of Hidden Markov Model (HMM), the use of neural networks
has become another idea in speech recognition problems. Anderson (1999) has made
a comparison between statistical-based and template-based approaches. Today’s
research focuses on a broader definition of speech recognition. It is not only
concerned with recognizing the word content but also prosody (Shih et al., 2001) and
personal signature.
7
Despite all of the advances in the speech recognition area, the problem is far
from being completely solved. A number of commercial products are currently sold
in the commercial market. Products that recognize the speech of a person within the
scope of a credit card phone system, command recognizers that permit voice control
of different types of machines, “electronic typewriters” that can recognize continuous
speech and manage several tens of thousands word vocabularies, and so on.
However, although these applications may seem impressive, they are still
computationally intensive, and in order to make their usage widespread more
efficient algorithms must be developed. Summing up, there is still room for a lot of
improvement and research.
Currently there are many speech recognition applications released, whether as
a commercial or free software. The technology behind speech output has changed
over times and the performance of speech recognition system is also increasing.
Early system used discrete speech; Dragon Dictate is the only discrete speech system
still available commercially today. On the other hand, the main continuous speech
systems currently available for PC are Dragon Naturally Speaking and IBM
ViaVoice. Table 1.1 shows the comparison of different speech recognition systems
with the prototype to be built in this research. This comparison is important as it
gives an insight of the current trend of speech recognition technology.
Table 1.1: Comparison of different speech recognition systems
Software
Dragon
Dictate
IBM
Voice
Naturally
Speaking 7
Microsoft
Office
XP SR
Prototype
To Be
Built
Discrete Speech
Recognition
√
X
X
X
√
Continuous Speech
Recognition
X
√
√
√
X
Speaker Dependent
√
√
√
√
√
Speaker Independent
X
X
X
X
√
Speech-to-Text
√
√
√
√
√
Active Vocabulary
Size (Words)
30,000 –
60,000
22,000 –
64,000
300,000
Finite
30 – 100
Language
English
English
English
English
Malay
Feature
8
In this research, the speech recognition problem is transformed into
simplified binary matrix recognition problem. The binary matrices are generated and
simplified while preserving most of the useful information by means of a SOM.
Then, word recognition turns into a problem of binary matrix recognition in a smaller
dimensional feature space and this performs dimensionality reduction. Besides, the
comparison between the single-network recognizer and hybrid-network recognizer
conducted here sheds new light on future directions of research in the field. It is
important to understand that it is not the purpose of this work to develop a full-scale
speech recognizer but only to test proposed hybrid model and explore its usefulness
in providing more efficient solutions in speech recognition.
1.8
Thesis Outline
The first few chapters of this thesis provide some essential background and a
summary of related work in speech recognition and neural networks.
Chapter 2 reviews the field of speech recognition, neural network and also the
intersection of these two fields, summarizing both past and present approaches to
speech recognition using neural networks.
Chapter 3 introduces the speech dataset design.
Chapter 4 presents the algorithms of the proposed system: speech feature
extraction (Speech processing and SOM) and classification (MLP).
Chapter 5 presents the implementation of the proposed system: Speech
processing, Self-Organizing Map and Multilayer Perceptron. The essential parts of
the source code are shown and explained in detail.
Chapter 6 presents the experimental tests on both of the systems:
conventional system and the proposed system. The tests are conducted using digit
dataset for digit recognition and word dataset for word recognition. For word
recognition, two classification approaches are applied such as syllable and word
9
classification. The tests are conducted on speaker-independent system with different
values of the parameters in order to obtain optimal performance according to the
recognition accuracy. Discussion and comparison of the experimental results are
also included in this chapter.
Chapter 7 presents the conclusions and future works of the thesis.
CHAPTER 2
REVIEW OF SPEECH RECOGNITION AND
NEURAL NETWORK
2.1
Fundamental of Speech Recognition
This chapter basically describes the literature review of the basic concept of
speech recognition system with its main components: speech processing, feature
extraction and classifier. Through the literature review, useful insights are gained to
design the methodology of this research. Figures 2.1 shows the model or block
diagram of a basic speech recognition system which comprises of speech processing,
feature extraction and classifier:
2.2
2.3
2.4
Analog
2.5Speech
Signals
Speech
Processing
Feature
Extraction
Discrete
Signals
Classifier/
Recognizer
Feature
Vector
2.6
2.7
2.8
Speech
Information
Database
Figure 2.1: Basic model of speech recognition system
Recognized
word
11
•
Analog Speech Signal
Here the analog speech signal is converted to a discrete signal (digital speech
data format). The examples of the digital speech data format are .wav, .snd,
and .au.
•
Speech Processing
The speech signal also contains data that is unnecessary like noise and nonspeech, which need to be removed before feature extraction. The resulting
speech signals will be passed through an endpoint detector to determine the
beginning and the ending of a speech data.
•
Feature Extraction
To extract the characteristics of the processed speech signal, feature extractor
is used to extract the useful feature or characteristic from the speech data
which is called feature vector.
•
Classifier/ Recognizer
Here the recognition of the speech data is established. There are many kinds
of classifier with different techniques and advantages. The feature vector will
be the input of the classifier. The output of the classifier is the recognized
word.
2.2
Linear Predictive Coding (LPC)
Linear Predictive Coding (LPC) is one of most popular speech feature
extraction methods. It is basically the prediction of present speech sample from a
linear combination of the past speech samples (Rabiner, 1993).
It is widely applied as the speech feature extraction algorithm because it is
mathematical precise and simple to be implemented as well as fast computation
(Parsons, 1986; Picone, 1993; Rabiner, 1993). Besides, it provides a good model of
the speech signals. It is often used in the speech feature extraction of isolated words,
12
syllables, phonemes, consonant and even vowels of foreign languages such as
English and Japanese. Among the LPC parameters are LPC coefficients, reflection
or PARCOR coefficients and log area ratio coefficients.
The basic principle behind the LPC is that the current speech sample can be
predicted as a linear combination of the past speech samples, as shown in the Figure
2.2.
s(n-2)
………
s(n-1)
s(n)
s(n-p)
p speech samples
Figure 2.2: The current speech sample is predicted as a linear combination of past p
samples. (n = total number of speech sample)
sˆ(n) = a1 s (n − 1) + a 2 s (n − 2) + a3 s (n − 3) + ...... + a p s (n − p)
(2.1a)
p
sˆ( n) = ∑ a k s ( n − k )
k =1
(2.1b)
where a k are the coefficients which are assumed to be constant over the
speech analysis frame and p is the number of past speech samples. A speech analysis
frame is defined as short segment of the speech waveform to be examined or
analyzed. The prediction error is defined as the difference between the actual speech
samples, s(n) and the predicted samples, ŝ .
13
e(n) = s (n) − sˆ(n)
(2.2a)
p
= s ( n) − ∑ a k s ( n − k )
(2.2b)
k =1
The purpose of LPC is to find these predictor coefficients, a k which are said
to match the speech waveform within an analysis frame (Parsons, 1986). The
predicted coefficients have to be determined in a way that minimizes the mean
squared prediction error over a small analysis frame. Mean squared prediction error
is defined as
M
E = ∑ e 2 ( n)
(2.3a)
n =0
M
p
n =0
k =1
2
= ∑ s ( n) − ∑ a k s ( n − k )
(2.3b)
where M is the analysis frame length. The minimization of mean squared error is
done by setting the partial derivatives of mean squared error, E with respect to a k
simultaneously equal to zero.
∂E
= 0,
∂a k
M
p
n =0
k =1
∑ 2 s ( n) − ∑ a k s ( n − k )
k = 1,2,3,..., p
(2.4a)
(− s(n − j )) = 0, j = 1,2,3,..., p
(2.4b)
M
M
p
n =0
n =0
k =1
− 2∑ s ( n ) s ( n − j ) + 2∑ s ( n − j ) ∑ a k s ( n − k ) = 0
(2.4c)
14
Realizing that the autocorrelation function can be written in the forms as
follow:
M
R( j ) = ∑ s ( n)( n − j )
n =0
(2.5)
M
R( k − j ) = ∑ s( n − j ) s( n − k )
n= 0
(2.6)
As a result, the Equation (2.5) can be reduced to a simple autocorrelation
function as shown in Equation (2.7).
p
R( j ) = ∑ ak R( k − j )
k =1
(2.7)
The predictor coefficients a k can be determined by solving the Equation
(2.7). This can be done by either using autocorrelation or covariance method. The
autocorrelation method is preferred over the covariance method because the former is
simple and fast in computation (Rabiner, 1976 and 1993).
In autocorrelation method, we have to pre-determine a range for parameter n,
so that speech segment, s(n) is set to zero for which the speech segments are outside
the range. The range is expressed as 0 ≤ n ≤ M − 1 . This setting is simply done by
applying a window to the speech segment. The typical weighting window is
Hamming window. The purpose of windowing is to taper the signal near n = 0 and
near n = M - 1, so as to minimize the errors at the speech segment boundaries. Based
on the windowed signal, the mean squared error becomes
M −1+ p
E (n ) =
∑e
n= 0
2
( n)
(2.8)
15
and the Equation (2.8) can be expressed as
M −1+ p
R( k − j ) =
∑ s(n − j )s(n − k ),
1 ≤ j ≤ p, 1 ≤ j ≤ p
(2.9a)
1 ≤ j ≤ p, 1 ≤ j ≤ p
(2.9b)
n= 0
or
M −1− ( k − j )
R( k − j ) =
∑ s (n + k − j),
n =0
Since the autocorrelation function is symmetric, for example, R ( j ) = R(− j )
and R ( j − k ) = R (k − j ) , the LPC equations can be rewritten as
p
∑ R(| k − j |)a
k
= R ( j ),
(2.10)
k =1
The Equation (2.10) can also be expressed in matrix form as
R(1) R(2)
 R ( 0)
 R(1)
R(0) R(1)

 R ( 2)

M
M
 M
 R( p − 1)
R( p − 1)   a1   R(1) 
 
L R( p − 2)  a 2   R(2) 
L R( p − 3)   a 3  =  R(3) 

  
M
 M   M 
L
R(0)  a p   R( p )
L
(2.11)
The matrix is a special matrix known as Toeplitz matrix, where all the
diagonal elements are equal. The solving of the matrix can be done efficiently by
iterative method known as Levinson-Durbin algorithm (Haykin, 2001), which will
yield the predictor coefficients. This algorithm is used to solve equation where the
elements across the diagonal are identical and the matrix of coefficients is symmetry
(Teoplitz matrix).
16
2.3
Speech Recognition Approaches
Speech recognition deals with the recognition of the specific individual
speech sounds. On the other hand, speech recognition classifies the speech sounds
according to their specific groups of sounds (eg. /b, d, g/ as consonant phonemes).
The speech recognition can be performed using Dynamic Time Warping (DTW),
Hidden Markov Model (HMM) or Artificial Neural Network (ANN).
2.3.1
Dynamic Time Warping (DTW)
We now explain the Dynamic Time Warping algorithm, one of the oldest and
most important algorithms in speech recognition (Itakura, 1975; Sakoe and Chiba,
1978). The simplest way to recognize an isolated word sample is to compare it
against a number of stored word templates and determine which the “best match” is
as shown in Figure 2.3. This goal is complicated by a number of factors. First,
different samples of a given word will have somewhat different durations (temporal
variation). This problem can be eliminated by simply normalizing the templates and
the unknown speech so that they all have an equal duration. However, another
problem is that the rate of speech may not be constant throughout the word; in other
words, the optimal alignment between a template and the speech sample may be
nonlinear.
DTW is able to achieve promising accuracy of higher than 95% in digit
recognition (Sakoe and Chiba, 1978; Ting et al., 2001a). DTW is only suitable to be
used in the recognition of small vocabulary because it is computational intensive. It
is not practicable in real-time system when the vocabulary is large. In order to apply
DTW to the word-recognition problem, we need to know the beginning and ending
points of the words. In noisy conditions this is not a trivial task.
17
Reference
Time-Warping Path
Sample
Figure 2.3: Dynamic Time Warping (DTW)
The advantages of DTW are:
• Efficient hardware implementation exists.
• The training sequence is simple, since it just involves the feature
extraction for words that need to be recognized.
The disadvantages of DTW are:
• It is not suitable for continuous speech recognition.
• It requires the computation of the beginning and ending points of the
word.
2.3.2
Hidden Markov Model (HMM)
Hidden Markov Models (HMM) (Rabiner, 1989) are essentially statistical
models to assign the greatest likelihood or probability to the occurrence of the
observed input pattern. It is a doubly stochastic process with hidden underlying
process. HMM represents speech by a sequence of states, each representing a piece
of the input signal. The states of the HMM correspond to phones, biphones or
18
triphones. At each state, there is a probability distribution for each of the possible
letters, and a transition probability to the next state. The speech recognition
processes then boils down to finding the most probable path. The training procedure
for the HMM-based recognizer is more complex than the DTW-based recognizer
(Rabiner et al., 1989; Lee, 1988; Woodland et al., 1994; Huang, 1992).
The advantages of an HMM-based approach are:
• It is easy to incorporate other information, such as speech and language
models.
• Continuous HMM is powerful for continuous speech recognition.
The disadvantages of HMM-based approach are:
• The HMM probability density models (discrete, continuous, and semicontinuous) have suboptimal modeling accuracy. Specifically, discrete
density HMMs suffer from quantization errors, while continuous or semicontinuous density HMMs suffer from model mismatch.
• The Maximum Likelihood training criterion leads to poor discrimination
between the acoustic models. Discrimination can be improved using the
Maximum Mutual Information training criterion, but this is more complex
and difficult to implement properly.
2.3.3
Artificial Neural Network (ANN)
Artificial Neural Network (ANN) (Aleksander and Morton, 1990; Hansen
and Salamon, 1990; Fausett, 1994) is the emerging technology in the speech
recognition and classification. An ANN is basically an information-processing
system that has certain performance characteristics in common with biological neural
networks. It is a system that processes information in a parallel-distributed manner.
Although its major drawback is the long training time, it is still widely applied in the
speech recognition system because it offers many advantages such as non-linearity,
ability of adaptation or learning, robustness and ability to generalize (Lippmann,
1989; Tebelskis, 1995; Pablo, 1998).
19
Multilayer Perceptron (MLP) (Bourland and Wellekens, 1987; Haykin, 1994;
Ahad et al., 2002) is one of the most popular neural network architectures. A basic
architecture of MLP is shown in Figure 2.4. It is a supervised learning, which adapts
its weights in response to the teacher values of the training patterns. Its backpropagation (BP) learning propagates the errors at the output layer back to the hidden
and input layer in order to adjust its weights (Rumelhart et al., 1986; Pandya and
Macy, 1996). It is a universal function approximator, which can solve problem
efficiently. Besides, its fast execution speed makes it practical to be implemented in
real-time processing. It is used to perform recognition of speech sounds at phoneme,
syllable and even isolated word level (Peeling and Moore, 1987 and 1988; Gold,
1988; Kammerer and Kupper, 1990; Jurgen, 1996; Lee and Ching, 1999; Siva, 2000;
Ting et al., 2001c).
Neuron
Weight
Input
Input
Output
Input
Output
Output Layer
Input
Hidden Layer
Input Layer
Figure 2.4: A basic architecture of Multilayer Perceptron (MLP)
20
2.4
Comparison between Speech Recognition Approaches
Table 2.1 and Table 2.2 show the comparison between different speech
recognition approaches based on the literature reviews (Rabiner 1993; Grant 1991).
Table 2.1: The comparison between different speech recognition approaches.
Speech
Recognition
Phase
Approach
Relevant Variables/
Data Structures
Input
Output
Speech
Sampling
ALL
Analog Speech
Signal
Analog Speech
Signal
Digital Speech
Samples
DTW
Statistical Features
(LPC coefficients)
Digital Speech
Samples
Acoustic Sequence
Templates
HMM
Subword Features
(phonemes)
Digital Speech
Samples
Subword Features
(phonemes)
ANN
Statistical Features
(LPC coefficients)
Digital Speech
Samples
Statistical Features
(LPC coefficients)
DTW
Reference Model
Database
Acoustic
Sequence
Templates
Comparison Score
HMM
Markov Chain
Subword
Features
(phonemes)
Comparison Score
ANN
Neural Network with
Weights
Statistical
Features (LPC
coefficients)
Positive/Negative
Output
Feature
Extraction
Training and
Testing
Table 2.2: The performance comparison between different speech recognition
approaches.
Approaches
Performance/ Application
DTW
• Mostly used for isolated, digit and connected word recognition.
• Small vocabulary size.
• Training is simple.
HMM
• Mostly used for continuous word recognition.
• Large vocabulary size.
• Training is complex.
ANN
• Mostly used for isolated, connected and continuous word
recognition.
• Medium vocabulary size.
• Training is time-consuming.
21
2.5
Review of Artificial Neural Networks
Artificial Neural Network (ANN) is extremely a powerful computational
device (Anderson and Rosenfeld, 1988; Fausett, 1994; Haykin, 1994). Their massive
parallelism makes them very efficient. They can learn and generalize from training
data. They are particularly fault-tolerant. Besides, they are also noise-tolerant. In
principle, they can do anything that a symbolic or logic system can do. There are
many forms of ANN. Most operate by passing neural activations through a network
of connected neurons such as MLP, SOM and Hopfield network. One of the most
powerful features of neural networks is their ability to learn and generalize from a set
of training data. They adapt the weights of the connections between neurons so that
the final output activations are correct.
The goal of the network is to learn some association between input and output
patterns. This learning process is achieved through the modification of the
connection weights between units. In statistical terms, this is equivalent to
interpreting the value of the connections between units as parameters to be estimated.
The model of network specifies the learning algorithm to be used. In the section
below we will briefly review the fundamentals of neural networks:
2.5.1
Processing Units
A neural network contains a potentially huge number of simple processing
units. All these units operate simultaneously, supporting massive parallelism. All
computation in the system is performed by these units. At each moment in time,
each unit simply computes a scalar function of its local inputs, and broadcasts the
result to its neighboring units. A basic neuron processing unit is shown in Figure
2.5. The units in a network are typically divided into input units, which receive data
from the environment; hidden units, which may internally transform the data
representation; and/or output units, which represent decisions or control signals.
22
Input
Output
Input
Input
Input
Input
Neuron i
Weight
Neuron j
Figure 2.5: A basic neuron processing unit.
2.5.2
Connections
The units in a network are organized into a given topology by a set of
connections or weights. Weights are usually one-directional (from input units
towards output units), but they may be two-directional, especially when there is no
distinction between input and output units. Weights can be changed as a result of
training, but they tend to be changed slowly, because accumulated knowledge
changes slowly. A network can be connected with any kind of topology. Common
topologies include unstructured, layered, recurrent, and modular networks, as shown
in Figure 2.6. Each kind of topology is best suited to a particular type of application.
2.5.3
Computation
Computation always begins with presenting an input pattern to the network.
Then, the activations of all of the remaining units are computed, either
synchronously or asynchronously. In layered networks, it is called forward
propagation, as it progresses from the input layer to the output layer. In feed-forward
networks, the activations will be stabilized as soon as the computations reach the
output layer but in recurrent networks, the activations may never be stabilized.
23
(a)
(b)
(c)
(d)
Figure 2.6: Neural network topologies: (a) Unstructured, (b) Layered, (c) Recurrent
and (d) Modular.
2.5.4
Training
Training a network means adapting its connections so that the network
exhibits the desired computational behavior for all input patterns. The process
usually involves modifying the weights but sometimes it also involves modifying the
actual topology of the network. In a sense, weight modification is more general than
topology modification. However, topological changes can improve both
generalization and the speed of learning. In general, networks are nonlinear and
multilayered, and their weights can be trained only by an iterative procedure, such as
gradient descent on a global performance measure. This requires multiple passes of
training on the entire training set; each pass is called iteration or an epoch.
Moreover, the weights must be modified very gently so as not to destroy all
the previous learning. A small constant called the learning rate is used to control the
magnitude of weight modifications. Finding a good value for the learning rate is
very important. If the value is too small, learning takes forever; but if the value is
too large, learning disrupts all the previous knowledge. Unfortunately, there is no
analytical method for finding the optimal learning rate. It is usually optimized
empirically by trying different values.
24
2.6
Types of Neural Networks
Now we will give an overview of some different types of networks. This
overview will be organized in terms of the learning procedures used by the networks.
There are three main classes of learning procedures. Most networks fall into one of
these categories, but there are also various networks, such as hybrid networks which
straddle these categories.
2.6.1
Supervised Learning
Supervised learning means that a “teacher” provides output targets for each
input pattern, and corrects the network’s errors explicitly. This paradigm can be
applied to many types of networks, both feed-forward and recurrent in nature.
Perceptrons (Rosenblatt, 1962) are the simplest type of feed-forward
networks that use supervised learning. A perceptron is comprised of binary threshold
units arranged into layers, as shown in Figure 2.7(a). MLP may have any number of
hidden layers, although a single hidden layer is sufficient for many applications, and
additional hidden layers tend to make training slower. MLP can also be
architecturally constrained in various ways, for instance by limiting their
connectivity to geometrically local areas, or by limiting the values of the weights, or
tying different weights together.
Multilayer Perceptron (MLP), as shown in Figure 2.7(b), can theoretically
learn any function, but they are more complex to be trained. However, if an MLP
uses sigmoid function rather than threshold function, then it becomes possible to use
partial derivatives and the chain rule to derive the influence of any weight on any
output activation, which in turn indicates how to modify that weight in order to
reduce the network’s error. This generalization of the Delta Rule is known as
backpropagation.
25
Figure 3.3: Perceptrons (a) Single layer perceptron (b) Multilayer perceptron.
Figure 2.7: Perceptron: (a) Single-layer Perceptron (b) Multilayer Perceptron
Hopfield (1982) studied neural networks that implement a kind of contentaddressable associative memory. He worked with unstructured networks of binary
threshold units with symmetric connections, in which activations are updated
asynchronously. This type of recurrent network is now called a Hopfield network.
2.6.2
Semi-Supervised Learning
In semi-supervised learning, an external teacher does not provide explicit
targets for the network’s outputs, but only evaluates the network’s behavior as
“good” or “bad”. The nature of their environment may be either static or dynamic, as
example, the definition of “good” behavior may be fixed or it may change over time.
The problem of semi-supervised learning is reduced to the problem of supervised
learning, by setting the training targets to be either the actual outputs or their
negations, depending on whether the network’s behavior was judged “good” or
“bad”. The network is then trained using the Delta Rule, where the targets are
compared against the network’s mean outputs, and error is backpropagated through
the network if necessary (Barto and Anandan, 1985).
2.6.3
Unsupervised Learning
In unsupervised learning, there is no teacher, and a network must detect
regularities and similarities in the input data by itself. Such self-organizing networks
26
can be used for compressing, clustering, quantizing, classifying, or mapping input
data. This type of network is often called an encoder, especially when the inputs or
outputs are binary vectors. We also say that this network performs dimensionality
reduction.
There is one type of the unsupervised networks which is based on
competitive learning, in which one output unit is considered the “winner”; these are
known as winner-take-all networks. The winning unit may be found by lateral
inhibitory connections on the output units. Competitive learning is useful for
clustering the data, in order to classify or quantize input patterns (Hertz et al., 1991).
Kohonen (1988b, 1995 and 2002) developed a competitive learning algorithm
which performs feature mapping called Self-Organizing Map (SOM). SOM is a
neural network that acts like a transformer which maps an m-dimensional input
vector into n-dimensional space while locally preserving the topology of the input
data. This is the reason that explains why a SOM is called a feature map: relevant
features are extracted from the input space and presented in the output space in an
ordered manner. It is always possible to reverse the mapping and restore the original
set of data to the original m-dimensional space with a bounded error. The bound on
this error is determined by the architecture of the network and the number of
neurons.
2.6.4
Hybrid Networks
Some networks combine supervised and unsupervised training in different
layers (Keun-Rong and Wen-Tsuen, 1993; Fritzke, 1994). Most commonly,
unsupervised training is applied at the lowest layer in order to cluster the data, and
then backpropagation is applied at the higher layer to associate these clusters with the
desired output patterns. The attraction of hybrid networks is that they reduce the
Multilayer backpropagation algorithm to the single-layer Delta Rule, considerably
reducing training time. On the other hand, since such networks are trained in terms
27
of independent modules rather than as an integrated whole, they have somewhat less
accuracy than networks trained entirely with backpropagation.
2.7
Related Research
Many early researchers tried to apply neural networks approaches to speech
recognition problem. This is because speech recognition is a pattern recognition
task, and neural networks are good in pattern recognition. The earliest attempts
involved highly simplified tasks as example, classifying speech segments as
voiced/unvoiced, or nasal/fricative/plosive. Success in these experiments
encouraged more researchers to move on to phoneme or subword classification. The
same techniques also achieved some success at the level of word recognition,
although it became clear that there were scaling problems when scaling to level of
sentences or larger vocabulary size.
Basically, there are two approaches to speech classification using neural
networks: static and dynamic. In static classification, all of the input speech are fed
into the neural network at once, and then makes a single decision to classify the
speech. By contrast, in dynamic classification, only a small window of the speech
are fed into the network, and this window slides over the input speech while the
network makes a series of local decisions. These local decisions then have to be
integrated into a global decision at the final stage. Static classification works well
for phoneme recognition, but it scales poorly to the level of words or sentences. But
dynamic classification can scale better than static classification. In the section below
we will review some researches in static approach for phoneme/subword
classification and word classification.
2.7.1
Phoneme/Subword Classification
Huang and Lippmann (1988) have performed a simple experiment to show
that neural networks can form complex decision surfaces from speech data. They
28
used a MLP with only 2 inputs, 50 hidden nodes, and 10 outputs, to Peterson and
Barney’s (1952) collection of vowels produced by men, women, & children, using
the first two formants of the vowels as the input speech representation. After 50,000
iterations of training, the network produced the decision regions shown in Figure 2.8.
These decision regions are nearly optimal, resembling the decision regions that
would be drawn by hand, and they yield classification accuracy comparable to that of
more conventional algorithms.
Figure 2.8: Decision regions formed by a 2-layer Perceptron using backpropagation
training and vowel formant data.
Elman and Zipser (1987) trained a network to classify the vowels /a, i, u/ and
the consonants /b, d, g/ as they occur in the utterances /ba, bi, bu/, /da, di, du/ and
/ga, gi, gu/. Their network input consisted of 16 spectral coefficients over 20 frames
and was fed into a hidden layer with between 2 and 6 units, leading to 3 outputs for
either vowel or consonant classification. This network achieved an acceptable result
with error rates of 0.5% for vowels and 5.0% for consonants. An analysis of the
hidden units showed that they tend to be feature detectors, discriminating between
important classes of sounds, such as consonants and vowels. The experimental
results demonstrate that backpropagation learning can be used well with complex and
natural data.
29
Among the difficult tasks in classification is the so-called E-set, as example,
discriminating between the rhyming English letters “B, C, D, E, G, P, T, V, and Z”.
Burr (1988) applied a static network to this task, with very good results. His network
used an input window of 20 spectral frames, automatically extracted from the whole
utterance using energy information. These inputs led directly to 9 outputs
representing the E-set letters. The network was trained and tested using 180 tokens
from a single speaker. Its recognition accuracy was high which mostly achieved
over 99%.
Lee and Ching (1999) proposed a design of neural-based speech recognition
system for isolated Cantonese syllables. The speech recognition system consists of a
tone recognizer and a base syllable recognizer. The tone recognizer adopts the
architecture of MLP in which each output neuron represents a particular tone. The
syllable recognizer contains a large number of independently trained recurrent
networks, each representing a designated Cantonese syllable. A speaker-dependent
recognition system has been built with the vocabulary growing from 40 syllables to
200 syllables. In the case of 200 syllables, 3 experiments were conducted on the
proposed system and achieved a top-1 (highest result for experiment 1) accuracy of
81.8% and a top-3 (highest result for experiment 3) accuracy of 95.2%.
2.7.2 Word Classification
Peeling and Moore (1987) applied MLP to digit recognition with excellent
results. They used a static input buffer of 60 frames (1.2 seconds) of spectral
coefficients, long enough for the longest spoken word; longer words were padded
with zeros and positioned randomly in the 60-frame buffer. Evaluating a variety of
MLP topologies, they obtained the best performance with a single hidden layer with
50 units. Comparison is made between proposed MLP and HMM where error rates
were 0.25% vs. 0.2% in speaker-dependent experiments, 1.9% vs. 0.6% for multispeaker experiments using a 40-speaker database of digits. In addition, the MLP was
five times faster than the HMM system.
30
Kammerer and Kupper (1990) applied a variety of networks to the TI 20word database, finding that a Single-layer Perceptron (SLP) outperformed both MLP
and DTW template-based recognizer in many cases. They used a static input buffer
of 16 frames, into which each word was linearly normalized, with 16 2-bit
coefficients per frame. Error rates for the SLP vs. DTW were 0.4% vs. 0.7% in
speaker-dependent experiments, or 2.7% vs. 2.5% for speaker-independent
experiments.
Burr (1988) applied MLP to the more difficult task of alphabet recognition.
He used a static input buffer of 20 frames, into which each spoken letter was linearly
normalized, with 8 spectral coefficients per frame. Training on three sets of the 26
spoken letters and testing on a fourth set, an MLP achieved an error rate of 15% in
speaker-dependent experiments, matching the accuracy of a DTW template-based
approach.
Kohonen (1988 and 1992) has described a microprocessor-based real-time
speech recognition system. It is able to produce orthographic transcriptions for
arbitrary words or phrases uttered in Finnish or Japanese (Kohonen et al., 1988). It
can also be used as a large-vocabulary isolated word recognizer. The acoustic
processor of the system transcribing speech into phonemes is based on neural
network principles. The so-called Phonotopic Maps constructed by a self-organizing
process are employed. The co-articulation effects in phonetic transcriptions are
compensated by means of errors at the acoustic processor output. Without applying
any language model, the recognition result is correct up to 92% to 97% referring to
individual letters.
Ha-Jin Yu and Yung-Hwan Oh (1996) proposed a subword-based neural
network model for continuous speech recognition. The system consists of three
modules, and each module is composed of simple neural networks. The speech input
is segmented into non-uniform units by the network. Non-uniform unit can model
phoneme variations which spread for several phonemes and between words. The
second module recognizes segmented units. The unit has stationary and transition
parts, and the network is divided according to the two parts. The last module spots
words by modeling temporal representation. The results showed that the system can
31
model such phoneme variations successfully. In this research, the recognizer was
built by using simple structures of neural networks. The system consists of three
modules. The input speech is segmented by the first module, and is classified by the
second module. In this research, a module is added to detect words from the result of
subword unit recognition. The units are trained by the result of word detection,
rather than the result of unit recognition itself.
2.7.3 Classification Using Hybrid Neural Network Approach
Keun-Rong Hsieh and Wen-Tsuen Chen (1993) proposed a neural network
architecture which combines unsupervised and supervised learning for pattern
recognition. The network is a hierarchical self-organization map, which is trained by
unsupervised learning first. When the network fails to recognize similar patterns,
supervised learning is applied to teach the network to give different scaling factors
for different features so as to discriminate similar patterns. Simulation results
showed that their proposed model obtained good generalization capability as well as
sharp discrimination between similar patterns.
Salmela et al. (1996) proposed a neural network, which is capable of
recognizing isolated spoken numbers speaker independently. The recognition system
is a hybrid architecture of SOM and MLP. The SOM maps the feature vectors of a
word in a constant dimension matrix, which is classified by MLP. The decision
borders of the SOM were fine-tuning with Learning Vector Quantization (LVQ)
algorithm, with which the hybrid achieved over 99% recognition out of 1232 test set
samples. The training convergence of the MLP was tested with two different
initialization methods.
Kusumoputro (1999) proposed an adaptive recognition system, which is
based on Kohonen Self-Organization Map (KSOM). The goals in their research on
ANN are to improve the recognition capability of the network and at the same time
minimize the time needed for learning the patterns. The goals could be achieved by
combining two types of learning: supervised learning and unsupervised learning.
32
They developed a new kind of hybrid neural learning system, combining
unsupervised KSOM and supervised back-propagation learning rules. The hybrid
neural system will henceforth be referred to as hybrid adaptive SOM with winning
probability function and supervised BP or KSOM(WPF)-BP. This hybrid neural
system could estimate the cluster distribution of given data, and directed it into
predefined number of cluster neurons through creation and deletion mechanism. The
result of experiment showed that the hybrid neural system of KSOM-BP with
winning probability function has higher recognition rate compare with that of
previous KSOM-BP, even using smaller number of cluster neurons.
Tabatabai et al. (1994) proposed a hybrid neural network which consists of a
SOM and a Perceptron. The hybrid is proposed for speaker independent isolated
word recognition. The novel idea in their system is the usage of a SOM as the
feature extractor which converts phonetic similarities of the speech frames into
spatial adjacency in the map. The property simplifies the classification task. The
system performance was evaluated for recognition of a limited number of Farsi
words (numbers "zero" to "ten"). The overall performance of their hybrid recognizer
showed to be 93.82%. The benefits of their system are speed and simplicity. In fact
it performs the recognition task in about one second, running on an IBM PCIAT 386
+ 387/33MHz.
2.8
Summary
In this chapter, we have reviewed some popular approaches for speech
recognition system, and then comparison between the differences of these
approaches is made. Besides, some fundamental of neural network is reviewed based
on the topology and type of learning. Lastly, some related researches are included in
the last section in order to show us the different approaches or classifications used
and the result obtained.
CHAPTER 3
SPEECH DATASET DESIGN
3.1
Human Speech Production Mechanism
All human speech sounds begins as pressure generated by the lungs that
pushed air through the vocal tract. The vocal tract consists of the pharynx, the mouth
or oral cavity and nasal cavity as shown in Figure 3.1. The sound produced depends
on the state of the vocal tract as the air is pushed through it. The state of the vocal
tract is determined by the position, shape and size of various articulators such as lips,
jaw, tongue and velum.
The human speech production mechanism involves the respiration of lungs
which provides the energy source, the phonation of vocal cords or folds which act as
source of sounds, the resonation of vocal tract which resonates the sounds from the
vocal folds and the articulation mechanism at the oral cavity which manipulates the
sounds from the vocal folds into various distinctive sounds.
The speech sounds can be produced in a relatively open oral cavity or through
a constriction in the oral cavity. The speech sounds are produced in a continuous
way. As a result, the speech sounds have to be chopped into small units called
phones for analysis. Each phone is included in brackets [] to indicate that it is a type
of sound. The speech sounds are classified into vowels and consonants (Deller et al.,
1993).
34
Figure 3.1: The vocal tract
35
3.2
Malay Morphology
Malay morphology is defined as study of word structures in Malay language
(Lutfi Abas, 1971). A morpheme is the term used in the morphology. A morpheme
is the smallest meaningful unit in a language. In another words, morpheme is a
combination of phonemes into a meaningful unit. A Malay word can be comprised
of one or more morphemes. When we talk about Malay morphology, we cannot
avoid from discussing the process of word formation in Malay language.
Malay language is one of the agglutinative languages in the world. It is a
language of derivative which allows the addition of affixes to the base word to form
new words. The language itself is different from the English. In English language,
the process involves the changes in the phonemes according to their groups. The
processes of word formation in Malay language are in the forms of primary words,
derivative words, compound words and reduplicative words.
3.2.1
Primary Word
Primary word is the word that does not take any affixes or reduplication. A
primary word can be comprised of one or more syllables. A syllable consists of a
vowel, or a vowel with a consonant or a vowel with several consonants. The vowel
can be presented at the front or back of the consonants.
In Malay language, primary word with one syllable accounts for about 500
only (Nik Safiah Karim et al., 1995). Some of the primary words are taken from
other languages such as English and Arabic. The structures of the syllable are shown
in Table 3.1 and Figure 3.2. The C stands for consonant and the V stands for vowel.
Primary words with two syllables account for the majority in the Malay primary
words. The structures of the words are shown in Table 3.2 and Figure 3.3. Primary
words with three and more syllables exist in a few numbers. Most of them are taken
36
from other languages. The structure of the words is shown in Table 3.3 and Figure
3.4.
Table 3.1: Structure of words with one syllable.
Syllable
Structure
Example of word
CV
Ya (yes)
VC
Am (common)
CVC
Sen (cent)
CCVC
Stor (store)
CVCC
Bank (bank)
CCCV
Skru (screw)
CCCVC
Skrip (script)
Y
A
C
V
C – Consonant
V – Vowel
S
T
O
R
C
C
V
C
C – Consonant
V – Vowel
Figure 3.2: Structure of one-syllable word “Ya” and “Stor”.
37
Table 3.2: Structure of words with two syllables.
Syllable
Structure
Example of Words
V + CV
Ibu (mother)
V + VC
Air (water)
V + CVC
Ikan (fish)
VC + CV
Erti (meaning)
VC + CVC
Empat (four)
CV + V
Doa (pray)
CV + VC
Diam (silent)
CV + CV
Guru (teacher)
CV + CVC
Telur (egg)
CVC + CV
Lampu (lamp)
CVC + CVC
Jemput (invite)
G
U
+
R
U
C
V
+
C
V
C – Consonant
V – Vowel
J
E
M
+
P
U
T
C
V
C
+
C
V
C
C – Consonant
V – Vowel
Figure 3.3: Structure of two-syllable word “Guru” and “Jemput”.
38
Table 3.3: Structure of words with three syllables or more.
Syllable
Structure
3.2.2
Example of Words
CV + V + CV
Siapa (who)
CV + V + CVC
Siasat (investigate)
V + CV + V
Usia (age)
CV + CV + V
Semua (all)
CV + CV + VC
Haluan (direction)
CVC + CV + VC
Berlian (diamond)
V + CV + CV
Utara (north)
VC + CV + CV
Isteri (wife)
CV + CV + CV
Budaya (culture)
CVC + CVC + CV
Sempurna (perfect)
CVC + CV + CVC
Matlamat (aim)
CV + CV + VC + CV
Keluarga (family)
CV + CVC + CV + CV
Peristiwa (event)
CV + CV + V + CVC
Mesyuarat (meeting)
CV + CV + CV + CVC
Munasabah (reasonable)
CV + CV + CV + CV
Serigala (wolf)
V + CV + CVC + CV + CV
Universiti (university)
CV + CV + CV + CV + CV + CV
Maharajalela (king)
Derivative Word
Derivative words are the words that are formed by adding affixes to the base
of the words. The affixes can exist at the initial (Prefixes), within (Infixes) or final
(Suffixes) of the words. They can also exist at the initial and final of the words at the
same time. These kinds of affixes are called confixes. Examples of derivative words
are “berjalan” (walking), “mempunyai” (having), “pakaian” (clothes) and so on.
39
3.2.3
Compound Word
Compound words are the words that are combined from two individual
primary words, which carry certain meanings. There are quite lots of compound
words in Malay language. Examples of compound words are “alat tulis” (stationery),
“jalan raya” (road), “kapal terbang” (aeroplane), “Profesor Madya” (associate
professor), “hak milik” (ownership), “pita suara” (vocal folds) and so on. Some of
the Malay idioms are from the compound words such as “kaki ayam” (bare feet),
“buah hati” (gift), “berat tangan” (lazy) and so on.
3.2.4
Reduplicative Word
Reduplicative words, as its name suggests, are the words that are reduplicated
from the primary words. There are three forms of reduplication in Malay language:
full, partial and rhythmic. Examples of reduplicative words are “mata-mata”
(policeman), “sama-sama” (welcomed) and so on.
3.3
Malay Speech Dataset Design
Malay speech dataset design basically involves the proper selection of speech
target sounds for speech recognition. The Malay consonantal phonemes can be
analyzed according to the descriptive analysis and distinctive feature analysis. The
descriptive analysis provided an analysis using frequency, mean, and factor analysis.
Distinctive feature is a feature that is able to signal a difference in meaning by
changing its plus or minus value. Generally, the descriptive analysis is preferred over
the distinctive feature analysis because it is easier to be implemented.
40
3.3.1
Selection of Malay Speech Target Sounds
From the study, there are about 500 primary words with one syllable.
Primary word with three syllables or more exists in a small number and most of them
are taken from other languages. The majority of the Malay words are comprised of
primary word with two syllables.
Among the Malay syllables, the structure of CV and CVC are the most
common. CV is preferred over the CVC because it is easy to implement in the
system and its number is quite substantial. Thus, the speech token selected is the
Malay syllable (CV structure), which is initialized with plosives and followed by
vowels. However, the syllables can be combined to form two-syllable Malay words.
In order to get a good distribution of consonants and vowels for the dataset,
15 Malay syllables are selected in order to form Malay words consist of two syllables.
Those syllables are in the form of CV, which are initialized with a chosen consonant
and followed by a chosen vowel. The 15 Malay one-syllable speech target sounds
are shown in Table 3.4. The most common two-syllable words combined using the
selected syllables are shown in Table 3.5. The source is taken from “Kamus Dewan”
(Sheikh Othman Sheikh Salim et al., 1989).
Among the two-syllable Malay words listed in Table 3.5, 30 of them are
chosen as target sounds for dataset. The selection of the words is based on the
similar distribution of the syllables between the words. The 30 selected Malay twosyllable speech target sounds are shown in Table 3.6.
41
Table 3.4: 15 selected syllables in order to form two-syllable words as target sounds.
No.
Malay Syllable
1
Ba
2
Bu
3
Bi
4
Ka
5
Ku
6
Ki
7
Ma
8
Mu
9
Mi
10
Sa
11
Su
12
Si
13
Ta
14
Tu
15
Ti
42
Table 3.5: Two-syllable Malay words combined using 15 selected syllables.
No.
Two-syllable Malay Words
1
Baba
KaKu
MaMa
SaSi
TaTa
2
BaBi
KaKi
MaMi
SaKa
TaTu
3
BaKa
KaBa
MaKa
SaKu
TaTi
4
BaKu
KaBu
MaKi
SaKi
TaBa
5
BaKi
KaMi
MaSa
SaMa
TaKa
6
BaMa
KaMu
MaSi
SaMu
TaKi
7
BaSa
KaSa
MaSu
SaMi
TaMu
8
BaSi
KaSi
MaTa
SaTa
TaMa
9
BaSu
KaTa
MaTi
SaTu
TaSu
10
BaTa
KaTi
MaTu
SaTi
TuTa
11
BaTu
KuKu
MuKa
SuSu
TuTi
12
BuKa
KuBu
MuKu
SuSa
TuBa
13
BuKu
KuMu
MuSa
SuKa
TuBu
14
BuMi
KuSa
MuSi
SuKu
TuBi
15
BuSu
KuSi
MuTu
SuMi
TuKu
16
BuSi
KuTu
MuTi
SiSa
TuKa
17
BuTa
KiTa
MiSi
SiSi
TuMi
18
Butu
KiSi
MiSa
SiBu
TuSi
19
BiBi
SiKu
TiTi
20
BiKa
SiKi
TiBa
21
BiKu
SiTi
TiBi
22
BiSa
SiTu
TiKa
23
BiSi
TiSu
24
BiSu
TiSi
25
BiTu
Total
25
18
18
22
24
43
Table 3.6: 30 selected Malay two-syllable words as the speech target sounds.
No.
Word
No.
Word
No.
Word
1
Baki
11
Kubu
21
Susu
2
Bata
12
Kita
22
Suka
3
Buka
13
Masa
23
Sisi
4
Buku
14
Mata
24
Situ
5
Bumi
15
Mati
25
Tamu
6
Bisu
16
Muka
26
Taba
7
Bisa
17
Mutu
27
Tubu
8
Kamu
18
Misi
28
Tubi
9
Kaki
19
Sami
29
Tiba
10
Kuku
20
Satu
30
Titi
For digit recognition, 10 Malay digit words are chosen as the target speech
sounds. These 10 digit words consist of different syllable, different combination of
consonants and vowels and different number of syllable. The 10 Malay digit words
as speech target sounds for digit recognition are shown in Table 3.7.
Table 3.7: 10 selected digit words as the speech target sounds for digit recognition.
No.
Digit
Structure
No.
Digit
Structure
0
Kosong
CV + CVCC
5
Lima
CV + CV
1
Satu
CV + CV
6
Enam
V + CVC
2
Dua
CV + V
7
Tujuh
CV + CVC
3
Tiga
CV + CV
8
Lapan
CV + CVC
4
Empat
VC + CVC
9
Sembilan
CVC + CV + CVC
The purpose of speech dataset for word recognition as in Table 3.6 is to test
the recognition performance of the system on the Malay words with combination of
CV-syllables. The CV-syllables can be combined to produce many Malay words.
This may reduce the number of target words to be recognized. The speech dataset
44
for digit recognition is different from word recognition. The word structure is not
only made by CV + CV as in word recognition but consist of different structures as
CV + CVCC, CV + V, V + CVC, CVC + CV + CVC and so on. The structure of
speech target words for digit recognition can be referred in Table 3.7.
3.3.2
Acquisition of Malay Speech Dataset
The speech dataset consists of training dataset and test dataset. The training
dataset is used to train the neural networks (SOM and MLP) for learning process.
The testing dataset is used after the training to test the recognition performance of the
neural networks in term of recognition rate. During the testing, the test dataset is the
data which are not used in training for SOM and MLP. 50% of the total number of
dataset is randomly chosen as training data and the rest of the data is used as training
data.
It is not the purpose of this research to develop a full scale speech recognizer
with large dataset, but to test new techniques by developing a prototype.
Considering this goal, all the approaches were tested on word recognition and digit
recognition, with medium dataset. All the experiments reported for word recognition
used a training set and a testing set contributed by six speakers (3 male and 3 female
students from Faculty of Computer Science and Information Systems, Universiti
Teknologi Malaysia). Each speaker contributes 300 utterances for the training set
(10 repetitions * 30 words), and 300 utterances for testing set (10 repetitions * 30
words). Therefore, there are a total of 3600 utterances used for both training (1800
utterances) and testing (1800 utterances).
For digit recognition, all the experiments reported used a training set and a
testing set contributed by same speakers. Each speaker contributes the training set of
100 utterances, 10 for each target word, and a testing set composed of 100 utterances,
10 for each target word. For digit recognition, there are a total of 1200 utterances
45
used for both training and testing. Table 3.8 and 3.9 show the specifications of
dataset for word recognition and digit recognition respectively.
Table 3.8: Specification of dataset for word recognition
Dataset
Training
Testing
Speaker
6
Number of Word
30
Samples per Word
Total
10
10
1800
1800
3600
Table 3.9: Specification of dataset for digit recognition
Dataset
Training
Testing
Speaker
6
Number of Word
10
Samples per Word
10
10
Total
600
600
1200
For speech recognition with single neural network such as MLP, it needs a lot
of data for training purpose. This is because MLP needs more training data to have
good learning and achieve optimal performance. Generally, the number of data
needed for training is more than for testing. This becomes one of the drawbacks for
MLP if the number of test data increases. However, the same number of data is used
for training and testing in this experiment. This is to test the performance of the
proposed algorithm applying SOM in the MLP-based speech recognition. SOM is
good and efficient for data optimization and clustering which may improve the
drawback of MLP mentioned above.
46
According to Rabiner (1993) the speech signal can be classified into three
states: silence, unvoiced and voiced. For voiced sound such as vowels, voiced stops
or other quasi-periodic pulses, the frequency of interest is in the region of 0 – 4 kHz.
For unvoiced sounds such as unvoiced stops or random noise, the frequency of
interest is in the region of 4 – 8 kHz.
A sound editor, Sound Forge 6.0 is used to record the voice of the speakers.
The recorded voice samples are saved as .wav file. Each file is named according to
its speech sounds and followed by an indexing. For example, speech sound of
“Buka” is saved as “Buka01.wav”, “Buka02.wav, “Buka03.wav” and so on.
Another example for digit sound of “Satu” is saved as “Satu01.wav”, “Satu02.wav,
“Satu03.wav” and so on. The sampling rate is set at 16 kHz with 16-bit resolution.
The recording was conducted in normal laboratory environment with ambient noise
of 75.40 dB, via a high quality unidirectional microphone.
According to Nyquist theorem (Parsons, 1986), in order to capture a signal,
the sampling rate must be at least twice higher than the frequency of the signal. In
order to capture the unvoiced sounds, 16 kHz sampling rate is needed to be sufficient.
Thus, a sampling rate of 16 kHz is used to obtain the speech dataset.
3.4 Summary
In this chapter, the some fundamental concepts of Malay morphology has been
explained in order to let us understand about the combination of phoneme into a
meaningful unit. Besides, the whole dataset of this research consists of dataset for
digit recognition and dataset for word recognition where both the datasets are
collected from 6 human speakers. The collected dataset is divided into 2 sets which
will be used for network training and testing.
CHAPTER 4
FEATURE EXTRACTION AND CLASSIFICATION ALGORITHM
4.1
The Architecture of Speech Recognition System
In this work, the proposed speech recognition model is divided into two
stages shown by the schematic diagram in Figure 4.1. The Feature Extractor (FE)
block generates a sequence of feature vectors by speech processing module, and then
SOM transforms the feature vectors into binary matrices before proceeds to
Recognizer. In the next stage, the Recognizer performs the binary matrix
recognition.
Speech
Signal
Speech
Processing
SOM
Feature Extractor (FE)
MLP
Recognizer
Figure 4.1: Proposed speech recognition model
Output
48
4.2
Feature Extractor (FE)
The objective of the FE block is to use a priori knowledge to transform an
input in the signal space to an output in a feature space to achieve some desired
criteria (Rabiner, 1993). If lots of clusters in a high dimensional space must be
classified, the objective of FE is to transform that space such that classifying
becomes easier. The FE block designed in speech recognition aims towards reducing
the complexity of the problem before the next stage start to work with the data.
Results from biological facts, neural networks such as SOM (Kohonen, 1995), have
been combined and utilized in the design of speech recognition feature extractors.
The proposed FE block divides it into two sub-blocks as shown in Figure 4.2: the
first block is based on speech processing techniques, and the second block uses a
SOM for dimensionality reduction.
Speech Signal
Speech Processing
Sampling
Frame
Blocking
Pre-emphasis
LPC Analysis
Autocorrelation
Analysis
Hamming
Windowing
Cepstrum
Analysis
Endpoint
Detection
Parameter
Weighting
Self-Organizing Map (SOM)
Binary Matrix
Figure 4.2: Feature Extractor (FE) schematic diagram
49
4.2.1
Speech Sampling
It might be argued that a higher sampling frequency, or more sampling
precision, is needed to achieve higher recognition accuracy. However, if a normal
digital phone, which samples speech at 8,000 Hertz with an 8 bit precision, is able to
preserve most of the information carried by the signal (Kohonen, 1995), it does not
seem necessary to increase the sampling rate beyond 11,025 Hertz or the sampling
precision to something higher than 8 bits. Another reason behind these settings is
that commercial speech recognizers typically use comparable parameter values and
achieve impressive results.
In this work, the incoming signal was sampled at 16 kHz with 16 bits of
precision shown in Figure 4.3. This is because the speech to be recognized by our
proposed system includes not only voiced speech but also unvoiced such as /s/.
Therefore higher sampling frequency and more sampling precision is chosen. The
speech was recorded and sampled using an off-the-shelf relatively inexpensive
dynamic microphone and a standard PC sound card.
Figure 4.3: Speech signal for the word kosong01.wav, sampled at 16 kHz with a
precision of 16 bits.
50
4.2.2
Frame Blocking
The speech signal is dynamic or time-variant in the nature. According to
Rabiner (1993), the speech signal is assumed to be stationary when it is examined
over a short period of time. In order to analyze the speech signal, it has to be
blocked into frames of N samples, with adjacent frames being separated by
M samples as shown in Figure 4.4. In other words, the frame is shifted with
M samples from the adjacent frame. If the shifting is small, then the LPC spectral
estimated from frame to frame will be very smooth. If there is no overlapping
between adjacent frames, the speech signals will be totally lost and correlation
between the result LPC spectral estimates of adjacent frames will contain a noisy
component. The value of N can range from 100 to 400 samples at 16 kHz sampling
rate.
N
M
N
Figure 4.4: Blocking of speech waveform into overlapping frames with N analysis
frame length and M shifting length.
51
4.2.3
Pre-emphasis
In general, the digitized speech waveform has a high dynamic range and
suffers from additive noise. For this reason, pre-emphasis is applied to spectrally
flatten the signal in order to make it less susceptible to finite precision effects (such
as overflow and underflow) later in the speech processing. The most widely used
pre-emphasis is the fixed first-order system. The calculation of pre-emphasis is
shown in Equation (4.1).
H ( z ) = 1 − az −1
0 .9 ≤ a ≤ 1 .0
(4.1)
The most common value for a is 0.95 (Deller et al., 1993). A pre-emphasis
can be expressed as Equation (4.2).
~
s (n) = s (n) − 0.95s(n − 1)
4.2.4
(4.2)
Windowing
Each frame is windowed to minimize the signal discontinuities at the
beginning and ending of each frame or to taper the signal to zero at the beginning
and ending of each frame. If we define the window as w(n) , then the windowed
signal is
~
x (n) = x(n) w(n),
0 ≤ n ≤ N −1
(4.3)
A typical window used is the Hamming window, which has the form
 2n 
w(n) = 0.54 − 0.46 cos
,
 N −1
0 ≤ n ≤ N −1
(4.4)
52
The value of the analysis frame length N must be long enough so that
tapering effects of the window do not seriously affect the result.
4.2.5
Autocorrelation Analysis
The windowed signal is then auto-correlated according to the equation
R (n) =
N −1− m
∑ ~x (n) ~x (n + m),
m = 0,1,2,..., p
(4.5)
n =0
where the highest autocorrelation value, p is the order of the LPC analysis. The
selection of p depends primarily on the sampling rate. A speech spectrum can be
represented as having an average density of two poles per kHz. Thus, one kHz of
sampling rate corresponds to one pole. In addition, a total of 3 – 4 poles are needed
to adequately represent the source excitation spectrum and the radiation load
(Rabiner, 1993). For a sampling rate of 16 kHz, the value for p can ranges between
16 and 20.
4.2.6
LPC Analysis
LPC analysis converts the autocorrelation coefficients R into the LPC
parameters. The LPC parameters can be the LPC coefficients. Levinson-Durbin
recursive algorithm is used to perform the conversion from the autocorrelation
coefficients into LPC coefficients.
E 0 = R ( 0)
i −1


k i =  R (i ) − ∑ a ij−1 R (i − j ) / E i −1 ,
j −1


(4.6)
1≤ i ≤ p
(4.7)
53
aii = k i
(4.8)
a ij = a ij−1 − k i aii−−1j ,
1 ≤ j ≤ i −1
Ei = (1 − k i2 ) Ei −1
(4.9)
(4.10)
The set of equations (4.6 – 4.10) is solved recursively for i = 1,2,..., p, where
p is the order of the LPC analysis. The k i are the reflection or PARCOR
coefficients. The a j are the LPC coefficients. The final solution for the LPC
coefficients is given as
a j = a (j p ) ,
4.2.7
1≤ j ≤ p
(4.11)
Cepstrum Analysis
LPC cepstral coefficients can be derived directly from the LPC coefficients.
Figure 4.5 shows the pattern of the cepstral coefficients for a speech sound. The
conversion is done using the following recursive method:
c0 = a 0
(4.12)
m −1
k
c m = a m + ∑  c k a m − k ,
k =1  m 
1≤ m ≤ p
(4.13)
m −1
k
c m = ∑  c k a m − k ,
k =1  m 
m> p
(4.14)
54
Figure 4.5: Cepstral coefficients of BU.cep
4.2.8
Endpoint Detection
The purpose of the endpoint detection is to find the start and end points of the
speech waveform (Rabiner and Sambur, 1975; Savoji, 1989; Yiying Zhang et al.,
1997). The challenge of the endpoint detection to locate the actual start and end
points of the speech waveform. A 3-level adaptive endpoint detection algorithm for
isolated speech based on time and frequency parameters is developed in order to
obtain the actual start and end points of the speech. The concept of Euclidean
distance adopted in this algorithm can determine the segment boundaries between
silence and voiced speech as well as unvoiced speech. The algorithm consists of
three basic modules: the background noise estimation, initial endpoint detection, and
actual endpoint detection. The initial endpoint detection is performed using rootmean-square energy and zero-crossing rate, and the actual endpoint detection is
performed using Euclidean distance within cepstral coefficients.
55
4.2.9
Parameter Weighting
The lower-order cepstral coefficients are sensitive to overall spectral slope
whereas high-order cepstral coefficients are sensitive to noise. In order to reduce
these sensitivities, the cepstral coefficients have to be weighted with a standard
technique. If we define the weighting window as wm , then the general weighting is
c~m = wm c m ,
1≤ m ≤ p
 p  m 
wm = 1 + sin  ,
 2  p 
(4.15)
1≤ m ≤ p
(4.16)
The weighted cepstral coefficients are then normalized in between -1 and +1.
The normalization is expressed as
 w − wmin
wnormalized = 2
 wmax − wmin
4.3

 − 1

(4.17)
Self-Organizing Map (SOM)
One of the objectives of the present work is to reduce the dimension of
feature vectors by using SOM in speech recognition problem. Figure 4.6 shows us
the diagram of the dimensional reduction by SOM. What we considered here is the
dimensionality of the acoustic feature vector (LP-cepstral coefficient) is reduced
before feeding them into the recognizer block. In this manner, the classification is
highly simplified.
56
Dimensionality reduction
cepstral
coefficient
Self-Organizing
Map (SOM)
binary
matrix
Figure 4.6: SOM transforms feature vectors generated by speech processing into
binary matrix which performs dimensionality reduction.
Kohonen (1988b, 1995 and 2002) proposed a neural network architecture
which can automatically generate self-organization properties during unsupervised
learning process, namely, a Self-Organizing Map (SOM). All the input vectors of
utterances are presented into the network sequentially in time without specifying the
desired output. After enough input vectors have been presented, weight vectors from
input to output nodes will specify cluster or vector centers that sample the input
space such that the point density function of the vector centers tends to approximate
the probability density function of the input vectors. In addition, the weight vectors
will be organized such that topologically close nodes are sensitive to inputs that are
physically similar in Euclidean distance. Kohonen has proposed an efficient learning
algorithm for practical applications. We used this learning algorithm in our proposed
system.
Using the fact that the SOM is a Vector Quantization (VQ) scheme that
preserves some of the topology in the original space (Villmann, 1999), the basic idea
behind the approach proposed in this work is to use the output of a SOM trained with
the output of the speech processing block to obtain reduced feature vector (binary
matrix) that preserve some of the behavior of the original feature vector. The
problem is now reduced to find the correct number of neurons (Dimension of SOM)
for constituting the SOM. Based on the ideas stated above, the optimal dimension
size of SOM has to be searched in order to ensure the SOM has enough neurons to
reduce the dimensionality of the feature vector while keeping enough information to
achieve high recognition accuracy (Kangas et al., 1992; Salmela et al.,1996; Gavat et
al., 1998).
57
4.3.1
SOM Architecture
The architecture of a SOM is shown in Figure 4.7. The SOM consists of only
one real layer of neurons. The SOM is arranged in a 2-D lattice. The 2-D SOM can
be graphically represented for visual display. This architecture implements similarity
measure using Euclidean distance measurement. In fact, it measures the cosine of
the angle between normalized input and weight vectors. Since the SOM algorithm
uses Euclidian metric to measure distances between data vectors, scaling of variables
was deemed to be an important step and we normalized the value of all input vectors
to unity. The input vector is normalized between -1 and +1 before it is fed into the
network. Usually, the output of the network is given by the most active neuron as
the winning neuron.
LP-cepstral coefficients
(p-coefficients per frame)
Weight, mij
X1
Xp
Output space (SOM)
in 2-Dimension
space
Winner node
(Output)
Neighborhood set
Figure 4.7: The 2-D SOM architecture
58
4.3.2
Learning Algorithm
The objective of the learning algorithm in SOM neural networks is the
formation of the feature map which captures the essential characteristics of the pdimensional input data and maps them on the typically 2-D feature space. The
learning algorithm captures two essential aspects of the map formation, namely,
competition and cooperation between neurons of the output lattice.
2
Denote M ij(t ) = { mij1 (t ) , mij (t ) , … , mijN (t ) } as the weight vector of node (i,
j) of the feature map at time instance t; i, j = 1, …, M are the horizontal and vertical
indices of the square grid of output nodes, N is the dimension of the input vector.
Denote the input vector at time t as X(t), the learning algorithm can be summarized
as follows (Kohonen, 1995; Anderson, 1999; Yamada et al., 1999):
1. Initializing the weights
Prior to training, each node's weights must be initialized. Typically these will
be set to small standardized random values. The weights in the SOM in this
research are initialized so that 0 < weight < 1.
2. Calculating the winner node - Best Matching Unit (BMU)
To determine the BMU, one method is to iterate through all the nodes and
calculate the Euclidean distance between each node's weight vector and the
current input vector. The node with a weight vector closest to the input
vector is tagged as the BMU. The Euclidean distance is given as:
Dist =
i =n
∑(X
i
(t ) − M ij (t )) 2
(4.18)
i =0
To select the node with minimum Euclidean distance to the input vector X(t):
X (t ) − M icjc (t ) = min { X (t ) − M ij (t ) }.
i, j
(4.19)
59
3. Determining the Best Matching Unit's Local Neighborhood
For each iteration, after the BMU has been determined, the next step is to
calculate which of the other nodes are within the BMU’s neighborhood.
Radius of the neighborhood is calculated. Figure 4 shows an example of the
size of a typical neighborhood close to the commencement of training. The
area of the neighborhood shrinks over time using the exponential decay
function:

t
σ (t ) = σ 0 exp − 
 λ
t = 1, 2, 3…
(4.20)
where σ0, denotes the width of the lattice at time = 0 and λ denotes a time
constant. t is the current time-step. If a node is found to be within the
neighborhood then its weight vector is adjusted as shown in next step.
4. Adjusting the weights
Every node within the BMU’s neighborhood including the BMU ( ic , j c ) has
its weight vector adjusted according to the following equation:
M ij (t + 1) = M ij (t ) + α (t )( X (t ) − M ij (t ))
for ic − N c (t ) ≤ i ≤ ic + N c (t )
and jc − N c (t ) ≤ j ≤ jc + N c (t )
(4.20)
M ij (t + 1) = M ij (t )
for all other indices (i, j).
(4.21)
where t represents the time-step and α is a small variable called the learning
rate, which decreases with time. Basically, this means that the new adjusted
weight for the node is equal to the old weight, plus a fraction of the difference
α between the old weight M and the input vector X. The decay of the
learning rate is calculated each iteration using the following equation:

t
α (t ) = α 0 exp − 
 λ
t = 1, 2, 3…
(4.22)
60
Ideally, the amount of learning should fade over distance similar to the
Gaussian decay.
So, an adjustment is made to Equation 4.21 which shown as equation below:
M ij (t + 1) = M ij (t ) + Θ(t )α (t )( X (t ) − M ij (t ))
(4.23)
where Θ represents the amount of influence a node's distance from the BMU
has on its learning. Θ(t) is given by equation below:
 dist 2 
Θ(t ) = exp 2  t = 1, 2, 3…
 2σ (t ) 
(4.24)
where dist is the distance a node is from the BMU and σ is the width of the
neighborhood function as calculated by (Equation 4.24). Additionally, Θ also
decays over time.
5. Update time t = t + 1, add new input vector and go to Step 2.
6. Continue until α (t ) approach a certain pre-defined value or t reach maximum
iteration.
Figure 4.8 shows the flow of SOM learning algorithm. The learning
algorithm repeatedly presents all the frames until the termination condition is
approached. The input vector is the cepstral coefficients. After training, the testing
data is fed into the feature map to form a binary matrix. These binary matrixes will
be used as input in MLP for classification. The number in a binary matrix
determines the number of input node in MLP.
61
Randomly initialize
weight vectors
Obtain a training pattern
and apply to input
Determine winner node
with minimum Euclidean
distance to input vector
Update weights (nodes
within neighborhood set
of the winner node)
Decrease gain α (t ) &
neighborhood set
Update time t = t + 1
t >= max or
α (t ) <= 0
No
Yes
Save the
trained map
STOP
Figure 4.8: Flow chart of SOM learning algorithm
62
After the learning process is completed, a trained feature map is formed.
Figure 4.9 shows a trained feature map after 1,250,000 iterations. The neurons
shown as rectangular are labeled with symbols of the phonemes to which they
learned to give best responses. The neurons labeled with ‘?’ symbols show neurons
with confused phoneme. The neurons labels with ‘sil’ show neurons with silence (no
voice) speech. The neurons labels with alphabets ‘B’, ‘K’, ‘M’, ‘S’, ‘T’, ‘A’, ‘I’ and
‘U’ show the neurons with the corresponding phoneme /B/, /K/, /M/, /S/, /T/, /A/, /I/
and /U/. From the figures, we can see that SOM classify the phonemes in an order
where neurons with similar phoneme are ordered nearby in the map. Suitable size or
dimension of map have to be determined during learning process in order to provide
enough feature space for speech phoneme to be trained.
B
B
A
?
A
A
A
T
T
T
T
T
U
U
U
B
A
B
A
A
A
A
T
A
T
U
T
T
U
T
?
B
B
A
A
A
T
T
T
U
K
T
U
U
K
B
B
B
A
A
B
A
T
U
T
U
U
U
U
K
B
B
A
B
?
A
A
?
U
U
U
K
U
K
K
M
B
A
A
A
A
?
T
U
U
S
T
K
U
K
M
B
M
A
A
A
A
A
U
U
U
T
K
U
K
?
M
B
A
A
A
?
A
S
S
S
S
S
T
K
A
M
M
A
A
A
A
?
I
S
S
U
U
S
S
M
M
M
A
A
M
M
I
I
I
S
S
U
S
S
sil
M
M
M
M
I
?
I
S
I
S
S
S
S
?
sil
M
M
sil
M
I
?
I
I
S
?
S
S
S
S
?
sil
M
sil
I
sil
I
I
S
S
S
S
S
S
S
sil
sil
sil
?
sil
I
S
?
S
I
S
S
S
S
S
sil
sil
?
I
sil
sil
sil
I
I
S
S
S
S
S
S
Figure 4.9: Trained feature map after 1,250,000 iterations.
63
4.3.3
Dimensionality Reduction
In second stage of FE, SOM performs dimensionality reduction which is
shown in Figure 4.10. SOM is used to transform the LPC cepstrum vectors into
trajectory in binary matrix form. The LPC cepstrum vectors are fed into a 2dimensional feature map. The node in the feature map with the closest weight vector
gives the response, which is called winner node. The winner node is then scaled into
value “1” and other nodes are scaled into value “0”. All the winner nodes in feature
map are accumulated into a binary matrix with same dimension as the feature map.
If a node in the map has been a winner, the corresponding matrix element is unity.
Therefore SOM serves as sequential mapping function transforming acoustic vector
sequences of speech signal into a two-dimension binary pattern. Figure 4.11(a)
shows an example of binary matrix where ● denotes value 1 and ○ denotes value 0.
After mapping all the speech frames of a word, a vector made by cascading the
columns of the matrix excites an MLP which has been trained by the
backpropagation algorithm for classifying words of the vocabulary. Figure 4.11(b)
shows the values of binary matrix fed into MLP for further process. For example
which shown in Figure 4.9, 12 cepstral coefficients are generated for each feature
vectors during speech processing using LPC analysis. If there are 50 feature vectors
or frames needed for a word, this means there is a total of 600 cepstral coefficients
fed as input to MLP network. If the number of vectors or number of order for
cepstral coefficients increases, the number of input node or dimension of input layer
in MLP may also increase. By using binary matrix produced by SOM for input to
MLP network, the problem above can be overcome because the dimension of binary
matrix is according to the size of feature map chosen which is fixed even the order of
cepstral efficient is increased.
Dimensionality Reduction
50 feature vectors
(600 LP-cepstral
coefficients)
Self-Organizing
Map (12 x 12)
Binary matrix
(144 nodes)
Figure 4.10: Dimensionality reduction performed by SOM.
64
/u/
/b/
Figure 4.11(a): The 12 x 12 mapping of binary matrix of /bu/ syllable.
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
1
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
1
0
0
1
0
0
0
1
1
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
1
0
1
0
0
0
0
0
0
0
0
0
1
0
1
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Figure 4.11(b): Binary matrix of /bu/ which is fed as input to MLP.
65
4.4
Multilayer Perceptron (MLP)
The output of the FE block is “blind,” i.e. it does not care about the word that
is being represented by that trajectory. The FE block only converts an incoming
pressure wave into a trajectory in some feature space. It is the Recognizer block that
discovers the relationships between the trajectories and the classes to which they
belong, and adds the notion of class of utterance into the system.
In our hybrid system, we used Multilayer Perceptron (MLP) for the classifier.
The number of output nodes corresponds to the numbers of class, and optimal hidden
nodes. The number of input nodes is the same as the dimension of the input vector
(binary matrix from SOM). The Multilayer Perceptron is trained by the errorbackpropagation algorithm.
4.4.1
MLP Architecture
A three-layer MLP with an input layer, a hidden layer and an output layer is
shown in Figure 4.12 (Bourland and Wellekens, 1987; Md Sah Haji Salam et al.,
2001; Ting et al., 2001; Ahad et al., 2002, Zbancioc and Costin, 2003). The neurons
in the input, hidden and output layers are denoted by xi , h j and y k respectively,
where i = 1, 2, … I; j = 1, 2, … , J and k = 1, 2, … K. I, J and K are the number of
neurons in the layers of inputs, hidden and output respectively. Connection weights
from input layer to hidden layer are denoted by wij . Similarly, w jk are the
connection weights from hidden layer to output layer. The network is fully
connected in the sense that every neuron in each layer of the network is connected to
every other neuron in the adjacent forward layer by the connection weights.
66
……
yk
by k
w jk
1
……
hj
Bias, B
Output Layer, Y
Hidden Layer, H
wij
bh j
1
……
xi
Input Layer, X
Binary matrix
Figure 4.12: A three-layer Multilayer Perceptron
The hidden layer and output layer may have biases, which act just like
connection weights. These biases are connected to the hidden neurons and output
neurons. The biases can be fixed or adjustable according to the network model
utilized. The typical value for fixed biases is 1. The biases connected to the hidden
layer and output layer are denoted as bh j and by k respectively.
4.4.2
Activation Function
Basic operation of a neuron involves summing its weighted input signals and
applying an activation function. Typically, a nonlinear activation function is used for
all the neurons in the network. According to Fausett (1994), an activation function
used for Back-Propagation network should be continuous, differentiable,
monotonically non-decreasing and its derivative is easy to compute for
computational efficiency. The most typical activation function is the binary sigmoid
function, which has a range between zero and one. The binary sigmoid function is
defined as
f ( x) =
1
1 + e−X
(4.25)
67
4.4.3
Error Backpropagation (BP)
Error backpropagation (BP) or the Generalized Delta Rule (Lippmann, 1989)
is the most widely used supervised training algorithm for neural networks especially
MLP. Because of its importance, we will discuss it in some detail in this section.
We begin with a full derivation of the learning rule.
Suppose we have a multilayered feedforward network of nonlinear (typically
sigmoid) units. We want to find values for the weights that will enable the network
to compute a desired function from input vectors to output vectors. Because the units
compute nonlinear functions, we cannot solve for the weights analytically; so we will
instead use a gradient descent procedure on some global error function E.
In the stage of back-propagation, error is first calculated at the output layer.
Then, the error is back propagated to the hidden layer and lastly to the input layer.
Each output neuron ( y k ) compares the calculated or actual output with the
corresponding target value to find out the error information term, δ k . The δ k then is
used to calculate the weight correction term, ∆w jk and bias correction term, ∆by k .
The ∆w jk will be used to update the connection weight, w jk later and similarly the
∆by k will be used to update the bias, by k later. η is the learning rate of the network.
δ k = (t k − y k ) f ' ( y _ input k )
(4.26a)
δ k = (t k − y k ) y k (1 − y k )
(4.26b)
∆w jk = ηδ k h j
(4.27)
∆by k = ηδ k
(4.28)
The error information term, δ k is sent to the hidden layer as a delta input.
Each hidden neuron sums its delta inputs to give
68
K
δ _ input j = ∑ δ k w jk
(4.29)
k =1
The error information term at the hidden layer, δ j is determined by
multiplying the derivative of its activation function with δ _ input j .
δ j = δ _ input j f ' (h _ input j )
(4.30a)
δ j = δ _ input j (h j )(1 − h j )
(4.30b)
The δ j then is used to calculate weight correction term, ∆wij and bias
correction term, ∆bh j . The ∆wij and the ∆bh j will be used to update wij and bh j
respectively later.
∆wij = ηδ j xi
(4.31)
∆bh j = ηδ j
(4.32)
The weights and biases are updated using the weight and bias correction
terms respectively. The adjustment makes use of the current weights and biases. At
output layer, each output neuron ( y k ) updates its weights and biases according to
w jk (new) = w jk (old ) + ∆w jk
(4.33)
by k (new) = by k (old ) + ∆by k
(4.34)
Similarly at hidden layer, each hidden neuron updates its weights and biases
based on
wij (new) = wij (old ) + ∆wij
(4.35)
bh j (new) = bh j (old ) + ∆bh j
(4.36)
69
The process of adjustment is preferred if a momentum constant, α can be
added to the weight updating formulas. The purpose of the momentum is to
accelerate the convergence of the BP learning algorithm. It can be very useful when
some training data are very different from the majority of the data. In BP with
momentum term, the weights and biases are updated not only with the current
gradient but also with previous gradient. The modifications of the adjustment are as
follows:
1.
2.
4.4.4
At output layer
w jk (t + 1) = w jk (t ) + ∆w jk (t + 1) ,
(4.37a)
where ∆w jk (t + 1) = ηδ k h j + α∆w jk (t )
(4.37b)
by k (t + 1) = by k (t ) + ∆by k (t + 1) ,
(4.38a)
where ∆by k (t + 1) = ηδ k + α∆by k (t )
(4.38b)
At hidden layer
wij (t + 1) = wij (t ) + ∆wij (t + 1) ,
(4.39a)
where ∆wij (t + 1) = ηδ j hi + α∆wij (t )
(4.39b)
bh j (t + 1) = bh j (t ) + ∆bh j (t + 1) ,
(4.40a)
where ∆bh j (t + 1) = ηδ j + α∆bh j (t )
(4.40b)
Improving Error-Backpropagation
There are some methods can be adopted in order to enhance and improve the
Error-Backpropagation. First, the weights and biases must be initialized uniformly at
70
small random values. The typical range is between -0.5 and +0.5 (Bourland, 1987).
The choice of initial weights and biases is important because it will affect the
training of network toward a global minimum. Improper initialization will lead the
network to reach a local minimum. Besides, long training times and the suboptimal
results of MLP may be due to the lack of a proper initialization. Construction of
proper initialization of adaptive parameters should enable finding close to optimal
solutions for real-world problems (Wessel and Barnard, 1992).
Second, the target values must be chosen within the range of the sigmoid
activation function. If a binary sigmoid activation function is used, the neural
network is impossible to reach its extreme value of 0 and 1. If the network is trained
to achieve these levels, weights will be driven to such large values that numerical
instability will result. Moreover, the derivative of the sigmoid function approaches
zero at extreme values, thus results in slow learning (Masters, 1993; Pandya et al.,
1996).
Third, learning rate and momentum must be selected properly. The learning
rate can be selected as large as possible without causing the learning of network to
oscillation. However, a small learning rate will guarantee a true gradient descent.
The choice of learning rate depends on the learning problem and also the network
architecture. The typical values for learning rate and momentum are 0.1 and 0.9
respectively.
Finally, the number of hidden layer neuron of the network must be chosen
appropriately. Fewer neurons increase the possibility of the network to be trapped in
local minimum during training. Excessive numbers of hidden neurons not only
increases the training but also may cause problem of over-fitting or over-learning. It
is suggested that the approach to find the optimal number of hidden neurons starts
with a small number, then the number is increased slightly (Shin Watanabe et al.,
1992).
71
The best number of hidden units depends in a complex way on many factors,
including the number of training patterns, the numbers of input and output units, the
amount of noise in the training data, the complexity of the function or classification
to be learned, the type of hidden unit activation function, the training algorithm. Too
few hidden units will generally leave high training and generalization errors due to
under-fitting. Too many hidden units will result in low training errors, but will make
the training unnecessarily slow, and will result in poor generalization unless some
other technique (such as regularization) is used to prevent over-fitting. A sensible
strategy is to try a range of numbers of hidden units and see which works best.
One rough guideline for choosing the number of hidden neurons in many problems is
the Geometric Pyramid Rule (GPR). It states that, for many practical networks, the
number of neurons follows a pyramid shape, with the number decreasing from the
input toward output. The number of neurons in each layer follows a geometric
progression. The determination of hidden node number using GPR is shown in
Figure 4.13. Other investigators (Berke and Hajela, 1991) suggested that the nodes
on the hidden layer should be between the average and the sum of the nodes on the
input and output layers.
Multilayer Perceptron (MLP)
Output (Y)
Hidden (H)
Input (X)
X = number of input node
Hidden Node Number (H)
=
X *Y
=
X *Y
=6
Y = number of output node
Figure 4.13: The determination of hidden node number using Geometric Pyramid
Rule (GPR).
72
A network as a whole will usually learn most efficiently if all its neurons are
learning at roughly the same speed. So, maybe different parts of the network should
have different learning rates. There are a number of factors that may affect the
choices:
i.
The later network layers (nearer the outputs) will tend to have larger local
gradients (deltas) than the earlier layers (nearer the inputs).
ii.
The activations of units with many connections feeding into or out of them
tend to change faster than units with fewer connections.
iii.
Activations required for linear units will be different for sigmoid units.
iv.
There is empirical evidence that it helps to have different learning rates for
the thresholds or biases compared with the real connection weights.
In practice, the learning process is often faster by just using the same rates for
all the weights and thresholds, rather than spending time trying to work out
appropriate differences. A very powerful approach is to use evolutionary strategies
to determine good learning rates (Maniezzo, 1994).
73
4.4.5
Implementation of Error-Backpropagation
The Error-Backpropagation learning algorithm is summarized in a flow chart
shown in Figure 4.14.
Initialize weights and biases
Randomize the
order of
training
patterns.
Obtain pair of training pattern and
target value and apply to input
Calculate the output of hidden
neuron
N
≥ Epochmax ?
Calculate the output of output
neuron
N
Y
Calculate output error information,
weight and bias correction term
≤ E min ?
Calculate hidden error information,
weight and bias correction term
STOP
Update weights and biases in output
layer and hidden layer
Y
Last pattern
in epoch?
N
Figure 4.14: Flow chart of error-backpropagation algorithm
74
4.5
Summary
As a summary, a hybrid-based model of neural network has been proposed
and explained in this chapter. Basically, the proposed system consists of three main
parts: Speech Processing, Feature Extractor with SOM and Recognizer performed by
MLP. The algorithm of SOM and MLP has also been shown step by step in this
chapter. In this research, SOM plays an important role in dimensionality reduction of
feature vector. SOM transforms higher-dimensional cepstral coefficients into lowerdimensional binary matrices. Error-backpropagation learning algorithm is used in the
training of MLP network.
CHAPTER 5
SYSTEM DESIGN AND IMPLEMENTATION
5.1
Introduction
This chapter will discuss the implementation of speech recognition system.
The discussion includes speech processing involves LPC feature extraction and
endpoint detection, Self-Organizing Map (SOM), Multilayer Perceptron (MLP) and
the training and testing of the system. The implementation of the speech recognition
system is shown in Figure 5.1.
The Implementation of
Speech Recognition System
Speech
Processing
SOM
MLP
Training and Testing
Figure 5.1: The implementation of speech recognition system
76
5.2
Implementation of Speech Processing
In speech processing, speech features are extracted from the speech sounds.
The speech processing simply involves two main phases: Endpoint Detection and
Speech Feature Extraction. In the research, Malay syllables, two-syllable words and
digit are used as the speech target sounds. Thus, appropriate endpoint detection is
used to determine the starting and ending point of a speech sound. A 3-level endpoint
detection algorithm is proposed to improve the performance of the traditional method.
As for speech feature extraction, LPC analysis is used to extract LPC-derived
cepstral coefficients.
5.2.1
Feature Extraction using LPC
The purpose of the speech feature extraction is to extract LPC-derived
cepstral coefficients from the syllable sounds. The features will serve as the inputs to
the speech classification system. The following section will discuss the
implementation of LPC analysis. The steps of the feature extraction are as follows:
Step 1: Blocking the total frame length into analysis frames and Hamming
windowing every analysis frame. The analysis frame length is 240 sample
points; the shifting of the analysis frame is 80 sample points. The number of
analysis frame will be determined after the endpoint detection.
Step 2: Preemphasis the signals of the syllable sound. Array Buftemp [] is the buffer
to store the preemphasized signals and array Buf[] is the buffer to store the
syllable sample points.
for (i = 0; i <= NN - Shift; i++)
Buftemp[i] = (double)Buf[i] - 0.95 * (double)Buf[i - 1];
77
Step 3: Hamming windowing every analysis frame. A Hamming window is
generated before the windowing. Array Hamming[] is used to store the
window value. Frame is the analysis frame length.
for (i = 0; i < Frame; i++)
Hamming[i] = 0.54 - 0.46 * cos(i * 6.283 / (Frame - 1));
Array temp[] is the buffer to store the Hamming windowed signals.
for (i = 0; i <= NN - Shift; i++)
for (j = 0; j <= Frame - 1; j++) {
temp[j] = Buftemp[i * Shift + j];
temp[j] *= Hamming[j];
}
Step 4: Perform autocorrelation analysis. Each analysis frame is autocorrelated to
give autocorrelation coefficients. LPCOrder is the order of the
autocorrelation analysis. Array Autocorr[] is used to store the autocorrelation
coefficients.
for (k = 0; k <= LPCOrder; k++) {
aucorr = 0.0;
for (l = 0; l <= Frame – 1 - k; l++)
aucorr += temp[l] * temp[l + k];
Autocorr[k] = aucorr;
}
Step 5: Perform LPC analysis for each analysis frame. LPC coefficients are
computed from the autocorrelation coefficients. Array LPCBuf[] is used to
store the LPC coefficients. Array temp2[] is used to store the LPC
coefficients temporarily. E is the prediction error and kn is the reflection
coefficients.
78
double ar, kn[LPCOrder + 1], E;
LPCBuf[0] = 1;
E = Autocorr[0];
kn[1] = Autocorr[1] / Autocorr[0];
LPCBuf[1] = kn[1];
for (r = 2; r <= LPCOrder; r++) {
LPCBuf[r] = 0;
ar = 0;
for (s = 1; s <= r - 1; s++)
ar += LPCBuf[s] * Autocorr[r - s];
kn[r] = (Autocorr[r] - ar) / E;
LPCBuf[r] = kn[r];
for (s = 1; s <= r - 1; s++)
temp2[s] = LPCBuf[s] - kn[r] * LPCBuf[r - s];
E = (1 - kn[r] * kn[r]) * E;
for (s = 1; s <= r - 1; s++)
LPCBuf[s] = temp2[s];
}
Step 6: Convert the LPC coefficients to cepstral coefficients. CEPOrder is the order
of the cepstral coefficients. A double array CEPBuf[] is used to store the
cepstral coefficients.
int t, u;
for (t = 1; t <= LPCOrder; t++)
lpc[i][t] = LPCBuf[t];
CEPBuf[i][1] = LPCBuf[1];
for (t = 2; t <= LPCOrder; t++) {
double sum = 0;
CEPBuf[i][t] = LPCBuf[t];
for(u=1; u<=t-1;u++) {
sum += u * CEPBuf[i][u] * LPCBuf[t - u] / t;
CEPBuf[i][t] += sum; }
if (CEPOrder > LPCOrder) {
79
for (t = LPCOrder + 1; t <= CEPOrder; t++) {
double sum = 0;
CEPBuf[i][t] = 0;
for (u = 1; u <= LPCOrder - 1; u++)
sum += u*CEPBuf[i][u]*LPCBuf[t-u]/t;
CEPBuf[i][t] += sum; }
}
Step 7: Perform endpoint detection to obtain the actual start and end point of the
speech sound using energy-power in term of root mean square (rms), zerocrossing rate and Euclidean distance measurement of cepstral coefficients
between each analysis frame. This 3-level endpoint detection algorithm will
be discussed more details in the next section.
Step 8: Perform cepstral weighting and normalization for every analysis frame. A
cepstral window is generated before the cepstral weighting. Array Cepwin[]
is used to store the weighting window. After the weighting, every frame of
cepstral coefficients is normalized between -1 and +1.
for (i = 1; i <= CEPOrder; i++) // Generating cepstral window
Cepwin[i] = 1.0 + CEPOrder * 0.5 * sin(I * 3.1416 / CEPOrder);
for (t = 1; t <= CEPOrder; t++) // Perform cepstral weighting
CEPBuf[i][t] = Cepwin[t] * CEPBuf[i][t];
// Normalize cepstral coefficients between -1 and +1
temp3 = new double[CEPOrder + 1];
double max = -5.0, min = 5.0;
for (v = 1; v <= CEPOrder; v++)
temp3[v] = CEPBuf[i][v];
for (w = 1; w <= CEPOrder; w++) {
if (temp3[w] < min) min = temp3[w];
if (temp3[w] > max) max = temp3[w];
for (x = 1; x <= CEPOrder; x++)
}
80
CEPBuf[i][x] = (((CEPBuf[i][x] - min) / (max - min)) * 2 - 1);
Step 9: Save all the cepstral coefficients into “cep” file with its target value. The
filename is named after numerical number starting from zero up to the last
“.wav” file in the training set and testing set. The cepstral coefficients of the
training speech dataset are saved together in one cepstral file named
according to their cepstral order.
for (x = 0; x < TotalFrameNumber; x++) {
for (y = 1; y <= CEPOrder; y++)
fprintf(SaveFile, "%.4f\t", CEPBuf[x][y]);
fprintf(SaveFile, "\n"); }
// first 20 speech sounds are for training dataset
if (index <= 20) {
for (x = 0; x < TotalFrameNumber; x++)
for (y = 1; y <= CEPOrder; y++)
fprintf(TrainingCEP, "%.4f\n", CEPBuf[x][y]);
}
Step 10: Repeat step 1 to 8 for every “.wav” file in the training set and test set.
5.2.2
Endpoint Detection
The purpose of the endpoint detection is to determine the actual start and end
point of the speech sounds. A 3-level adaptive endpoint detection algorithm for
isolated speech based on time and frequency parameters has been developed in order
to obtain the actual start and end points of the speech (Ahmad, 2004). The concept of
Euclidean distance measurement adopted in this algorithm can determine the
segment boundaries between silence and voiced speech as well as unvoiced speech.
The algorithm consists of three basic modules: the background noise estimation,
initial endpoint detection, and actual endpoint detection. The initial endpoint
81
detection is performed using energy-power in term of root mean square (rms) and
zero-crossing rate, and the actual endpoint detection is performed using Euclidean
distance of cepstral coefficients between each analysis frame.
(1)
Background Noise Estimation
Before estimating background noise, the speech data is normalized with
respect to the maximum of the speech data and then pre-emphasized with first order
low-pass filter. The speech data will also be smoothed by Hamming window. The
background noise is estimated which is used to decide the threshold values of the
following steps. From the samples taken at the beginning and the ending of the input
signal, the background noise is estimated. rms energy power, E is computed as
(Equation 5.1).
1
 1
En = 
W
W
∑
i =1
2
S n [i ]

~ 2
(5.1)
in which, i =1, 2, …, W-1, W. W is the length of a frame (we use W=240), n is the
number of frame 1, 2, … N-1, N (N = Total Frame).
for (Window = 0; Window <= NN - Shift; Window += Shift) {
power = 0;
for (j = 0; j < Frame; j++)
power += (_int64)pow(Buf[j + Window], 2);
power = power / Frame;
power = sqrtl (power);
power = power / 32768;
rmsenergy[i] = power;
The noise level at the front-end of the signal E f is estimated using the first 5
energy frames where the energy values in these 5 frames are consistent between each
others.
82
Ef =
1 5
∑ Ei
5 i =1
(5.2)
The noise level at the back-end of the signal Eb is estimated in the same way,
using the last 5 frames where their energy values are consistent between each others.
Eb =
1 N
∑ Ei
5 i= N −4
(5.3)
Finally, the background noise level of the input signal E N is estimated
using the noise levels at the front and back ends at the following:
EN =
E f + Eb
2
(5.4)
for (i = e_start; i < e_start + 5; i++)
e1 = e1 + rmsenergy[i];
e1 = e1 / 5.0;
for (i = e_end; i > e_end - 5; i--)
e2 = e2 + rmsenergy[i];
e2 = e2 / 5.0;
e_noise = (e1 + e2) / 2.0;
However, the rms energy of background noise obtained should lie within two
limit thresholds; otherwise the speech signal is not acceptable as being either too
noisy or under-amplified.
Another parameter zero crossing of the background noise is also been
estimated in the similar way as that of the parameter rms energy. The followings
(Equation 5.5 – 5.8) are the estimation of background noise of zero crossing:
83
Zn =
1 W −1
~ 
~ 
sgn
S
sgn
−
n


∑    S n+1 
2 i =1
(5.5)
which
 1,
sgn ( x ) = 
− 1,
if
if
x≥0
x<0
for (j = 0; j < NN - Shift; j += Shift) {
ZCnumber=0;
for (i = 0; i<Frame - 1; i++) {
if ((Buf[j + i] * Buf[j + I + 1]) < 0)
ZCnumber++;
}
zerocrossing[k] = (double)Zcnumber / (double)Frame;
fprintf(ZCR, "%d %d - %lf\n", k, j, zerocrossing[k]);
k++;
}
Zf =
Zb =
ZN =
1 5
∑ Zi
5 i =1
1 N
∑ Zi
5 i= N −4
Z f + Zb
2
for (i = z_start; i < z_start + 5; i++)
f1 = f1 + zerocrossing[i];
f1 = f1 / 5.0;
for (i = z_end; i > z_end - 5; i--)
f2 = f2 + zerocrossing[i];
f2 = f2 / 5.0;
(5.6)
(5.7)
(5.8)
84
zcr_noise = (f1 + f2) / 2.0;
However, the zero crossing of background noise obtained should lie within
two limit thresholds; otherwise the speech signal is not acceptable as being either too
noisy or under-amplified.
(2)
Level 1 and 2: Initial Endpoint Detection
The starting point of the first voiced speech of the input utterance and the
ending point of the last one are located to be used as the reference points for the
detection of the actual endpoints of the speech signal. This part begins with the
searching the rms energy function from frame with highest rms energy to left with a
frame in shifting step. The first frame whose rms energy is below an energy
threshold Te is assumed to lie at the beginning of the first voiced speech. Thus, the
starting point, PF 1 of the front voiced speech is obtained by
PF 1 = arg n max{En < Te ,
n = m, m − 1,...,0}
(5.9)
in which En is defined by (Equation 5.1) and m is the index for frame with highest
rms energy. The Te is experimentally derived from the background noise E N , using
the relation
Te = Ce × EN
(5.10)
which C e is an experimentally derived constant.
In the same way, the ending point, PB1 of the last voiced speech is obtained
by searching the energy function backwards from right to the left.
PB1 = arg n min{En < Te ,
n = m, m + 1,..., N }
(5.11)
85
for (Window = 0; Window < Total_Frame; Window++) {
if (rmsenergy[Window] < rmse_threshold)
Counter1=0;
if (rmsenergy[Window] > rmse_threshold)
Counter1++;
if (Counter1 >= 3) {
Start = (Window - 2) * Shift;
break;
}
}
for (Window = Start / 80; Window < Total_Frame; Window++) {
if (rmsenergy[Window] > rmse_threshold)
Counter2=0;
if (rmsenergy[Window] < rmse_threshold)
Counter2++;
if (Counter2 >= 3) {
End = (Window – 2) * Shift;
break;
}
}
If the (Equation 5.9) and (Equation 5.11) cannot be satisfied or if the distance
between the points PF 1 and PB1 is below a certain threshold, the algorithm
recognizes absence of speech in the input signal and the procedure terminated.
Otherwise, the speech signal between these two reference points is assumed to be
voiced speech segment. Next, we utilize the parameter zero crossing to relax the
endpoints. It begins with searching the zero crossing function from point PF 1
backwards. The reference starting point, PF 2 is obtained by
PF 2 = argn max{Zn > TZF ,
n = PF1, PF1 − 1,...,1}
(5.12)
in which Z n is defined by equation (6), TZF is the zero crossing threshold for frontend defined by
TZF = C ZF × Z N
in which C ZF is obtained experimentally.
(5.13)
86
In the same way, the reference ending point, PB 2 is obtained by searching
the zero crossing function from PB1 forwards:
PB2 = argn min{Zn > TZB ,
n = PB1, PB1 + 1,..., N}
(5.14)
where TZB is the zero crossing threshold for back-end defined by
TZB = C ZB × Z N
(5.15)
which C ZB is obtained experimentally.
for (i = (Start / 80); i >= 0; i--) {
if (zerocrossing[i] < z_front_threshold)
else
counter1++;
counter1 = 0;
if (counter1 >= 2) {
zcr_start = (i + 2) * 80;
break; }
}
for (i = (End / 80); i < Total_Frame; i++) {
if (zerocrossing[i] < z_back_threshold)
else
counter2++;
counter2 = 0;
if (counter2 >= 2) {
zcr_end = (i - 1) * 80;
break; }
}
Due to the different characteristic of starting and ending phonemes of an
isolated speech, different zero crossing thresholds are utilized for determining the
starting-point and ending-point respectively.
87
(3)
Level 3: Actual Endpoint Detection
In this part, the implementation is based on the discrimination between
current frame and the last retained frame j and compare this distance with a distance
threshold. The simplest discrimination measure we used which was also proved to be
successful is the weighted Euclidean distance, D. This method emphasizes the
transient regions, which are more relevant for speech recognition. The boundary
between voiced speech signal and silence can be determined by adopting the
principle of this method. The decision criterion then becomes the following: leave
the current frame out if D(i, j ) < TD .
Since the changes of speech signal can be better embodied in the frequency
domain and cepstral coefficients can be measured by Euclidean distance, cepstral
coefficient is adopted to determine the actual endpoints. Let D(i, j ) be the Euclidean
distance between the cepstral coefficient vectors of frame i and j. If D(i, j ) is great
than threshold TD in the searching procedure, the transient of voiced and unvoiced
speech segment is assumed to occur. In order to avoid the sudden high-energy noise,
three frames are detected.
Searching from frame PF 2 forwards until frame PB 2 , the actual starting point
PF 3 is determined by
 D (n, n + 1) > TD & & 


PF 3 = arg n min  D (n, n + 2 ) > TD & & + 1
 D(n, n + 3) > T 
D


in which PF 2 ≤ n ≤ PB 2 .
for (i = zcr_start / 80; i < (zcr_end / 80) - 1; i++) {
for (k = 1; k <= 3; k++) {
d = 0.0;
(5.16)
88
for (j = 1; j <= 20; j++)
d = d + pow((CEPBuf[i][j] - CEPBuf[i + k][j]), 2.0);
if (d > euclidean_threshold)
else {
Counter++;
Counter = 0;
break;
}
}
if (Counter >= 3) {
actual_start = (i - 2) * 80;
break;
}
}
Searching from frame PB 2 backwards until frame PF 2 , the actual ending point
PB 3 is determined by
 D(n, n − 1) > TD & & 


PB 3 = arg n max  D(n, n − 2 ) > TD & & − 1

 D (n, n − 3) > T
D


(5.17)
in which PB 2 ≥ n ≥ PF 2 .
The final result or actual endpoint for the proposed algorithm are the starting
point , PF 3 and the ending point, PB 3 .
for (i = zcr_end / 80; i > (zcr_start / 80) + 1; i--) {
for (k = 1; k <= 3; k++) {
d = 0.0;
for (j = 1; j <= 20; j++)
d = d + pow((CEPBuf[i][j] - CEPBuf[i - k][j]), 2.0);
if (d > euclidean_threshold)
else {
Counter = 0;
Counter++;
89
break;
}
}
if (Counter >= 3) {
actual_end = (i + 2) * 80;
break;
}
}
From Figure 5.2(a) – (c), we can see the effectiveness of the 3-level endpoint
detection to obtain the actual boundaries of the start and end point of a speech sample.
After the endpoint detection, the cepstral coefficients within these determined
boundaries will be weighted and saved into “.cep” file for training and testing of the
SOM.
90
Figure 5.2(a): The detected boundaries of sembilan04.wav using rms energy in Level
1 of Initial endpoint detection.
Figure 5.2(b): The detected boundaries of sembilan04.wav using zero crossing rate in
Level 2 of Initial endpoint detection.
Figure 5.2(c): The actual boundaries of sembilan04.wav using Euclidean distance of
cepstrum in Level 3 of Actual endpoint detection.
91
5.3
Implementation of Self-Organizing Map
A Self-Organizing Map (SOM) with shorucut learning by Kohonen is used to
train the system and transform the input LPC-derived cepstral coefficients into binary
matrix. The variable-length cepstral coefficients are converted into fixed-length
binary matrix according to the dimension of the SOM utilized.
A two-layer SOM network is proposed in our system as shown in Figure 5.3.
The SOM consists of an input layer and an output layer. The input layer has a total of
input nodes which is same dimension with the dimension of the cepstral order in
cepstral files. The feature vectors are fed into the input nodes for training and testing.
Figure 5.3: The architecture of Self-Organizing Map (SOM)
The implementation of SOM involves the training of the network with
training sets and also the testing of the network with testing dataset. The
implementation for SOM is shown in steps as follows. Only essential parts of the
implementation of the SOM will be shown.
92
(1)
Implementation of SOM Training
Step 1: Get the configuration of the network parameters including the maximum
epoch for training, MaxEpoch and the dimension of the feature vectors,
VectorDimension.
VectorDimension = VECTORDIMENSION;
MaxEpoch = MAX_EPOCH;
Step 2: Read the input data from speech input pattern (in “.cep” file) and saved into
the array inputArray[]. InputVector is used to store the total number of the
cepstral coefficients for training.
i = 0;
while (!feof(inputfile)) {
for (j = 0; j < VectorDimension; j++){
fscanf(inputfile, "%lf", &inputArray[i][j]);
i++;
}
}
InputVector = i;
Step 3: The input arrays are then normalized by function NormalizeInput() in order to
fasten the process of the convergence and make the network training more
efficient.
for (i = 0; i < InputVector; i++) {
total = 0.0;
for (j = 0; j < VectorDimension; j++)
total += inputArray[i][j] * inputArray[i][j];
for (j = 0; j < VectorDimension; j++)
inputArray[i][j] = inputArray[i][j] / sqrt(total); }
93
Step 4: Initialize the weights to random values with closer range between 0.0 and 1.0
and saved into the array weights[] according to the determined dimension of
SOM (X and Y).
for (i = 0; i < X; i++)
for (j = 0; j < Y; j++)
for (k = 0; k < VectorDimension; k++)
weights[i][j][k] = (double)rand() / RAND_MAX;
Step 5: Train the networks until it reaches maximum epoch. The max epoch is set to
5000000. First, the input vector from input arrays is chosen randomly for
learning. GetWinnerNode() is a function used to determine the winner node
with smallest distance with the input array among all the nodes. The weight
vector of these nodes that lie within a nearest neighborhood set of the winner
node is updated. The learning constant and neighborhood set both decrease
monotonically with time according to T. Kohonen’s algorithm (Kohonen,
1995). In our research, we chose 0.25 and DimX*DimY-1 as initial values for
learning constant and neighborhood size respectively. This process is
repeated until the maximum epoch is achieved.
while (Epoch < MaxEpoch){
VectorChosen = (int)(rand()%(InputVector));
GetWinnerNode();
for (i = 0; i < X; i++){
for (j = 0; j < Y; j++){
distance = ((winnerX - (i + 1)) * (winnerX - (i + 1))) +
((winnerY - (j + 1)) * (winnerY - (j + 1)));
distance = sqrt(distance);
if (distance < NeighbourhoodSize / 2){
influence = exp(-(distance) / (NeighbourhoodSize / 2));
for (k=0; k<VectorDimension; k++)
weights[i][j][k] += (LearningRate * influence *
(inputArray[VectorChosen][k] - weights[i][j][k])); }
}
}
}
94
if (NeighbourhoodSize >= 1.0)
NeighbourhoodSize = (InitNeighbourhoodSize / 2) *
exp(-(double)Epoch / time_constant);
if (LearningRate >= 0.0001)
LearningRate = InitLearningRate *
exp(-(double)Epoch / time_constant);
Epoch++;
}
The implementation of function GetWinnerNode() is as follows:
minDistance = 999;
for (i = 0; i < X; i++){
for (j = 0; j < Y; j++){
Kohonen[i][j] = GetDistance(i,j);
if (Kohonen[i][j] < minDistance){
minDistance = Kohonen[i][j];
winnerX = i + 1;
winnerY = j + 1;
}
}
}
winnernode = (winnerX) * X + (winnerY);
TrackWinner[Epoch] = winnernode;
The implementation of function GetDistance() is as follows:
double GetDistance(int x, int y){
int i;
double distance = 0.0;
for (i = 0; i < VectorDimension; i++)
distance += ((inputArray[VectorChosen][i] - weights[x][y][i]) *
(inputArray[VectorChosen][i] - weights[x][y][i]));
return sqrt(distance);
}
95
Step 6: Finally, the weight vectors are saved into the weight file in “.wgt”. The
weight file then is used for testing.
for (i = 0; i < X; i++)
for (j = 0; j < Y; j++){
for (k = 0; k < VectorDimension; k++)
fprintf(WeightFile, "%lf ", weights[i][j][k]);
(2)
Implementation of SOM Testing
Step 1: The process of SOM testing is similar to SOM training but the dataset used
for testing includes the training set and testing set. First it reads all of the
feature vectors from a speech sound file in “.cep” and saved to the array
inputArray[].
while (!feof(inputfile)) {
for (j = 0; j < VectorDimension; j++)
fscanf(inputfile, "%lf", &inputArray[i][j]);
i++;
}
Step 2: The weight vectors are read from the weight file in “.wgt” which is trained in
the training process before. The weight vectors then are saved to the array
weights[].
for (i = 0; i < X; i++)
for (j = 0; j < Y; j++)
for (k = 0; k < VectorDimension; k++)
fscanf(weightfile, "%lf", &weights[i][j][k]);
96
Step 3: Find the winner node by calculating the Euclidean distance among of the
nodes. The function GetWinnerNode() and GetDistance() have been shown in
the implementatin of SOM training. The index of the winner nodes is saved
to the array TrackWinner[]. The process is repeated for all of the feature
vectors of a speech sound file.
winnernode = (winnerX) * X + (winnerY);
TrackWinner[Epoch] = winnernode;
Step 4: A binary matrix, Matrix[] which has same dimension with SOM is created.
The binary matrix is used to accumulate the winner nodes of a speech sound
file. All of the values of binary matrix are scaled into value “0” except the
winner nodes. The winner nodes are scaled into value “1”. The binary matrix
is then saved in matrix file “.mtx” according to their target values.
Implementation of the function CreateMatrix() is as follows.
for (i = 0; i < X*Y; i++)
Matrix[i] = 0;
for (i = 0; i < Epoch; i++)
Matrix[TrackWinner[i] - 1] = 1;
for (i = 0; i < X*Y; i++)
fprintf(matrix, "%d\n", Matrix[i]);
Step 5: The step 1 – 4 are repeated for all the cepstral files for both training dataset
and testing dataset. The binary matrix files are then used in MLP as input for
speech classification.
97
5.4
Implementation of Multilayer Perceptron
A three-layer MLP with Error-Backpropagation learning is used to train the
system and classify the Malay target sounds based on their syllable and word. In
order to classify the Malay target sounds, two types of classification are proposed to
be used in the speech recognition system: syllable classification and word
classification. Only word classification is applied for digit recognition while syllable
and word classification are applied for word recognition and comparison is made in
term of recognition accuracy. The next sections will discuss the network architecture
according to each classification.
5.4.1
MLP Architecture for Digit Recognition
Dataset for digit recognition consists of 10 Malay digits. The MLP has 10
output nodes, which correspond to the 10 Malay digits (/kosong/, /satu/, /dua/, /tiga/,
/empat/, /lima/, /enam/, tujuh/, /lapan/, /sembilan/). The architecture of MLP is
shown in Figure 5.4. A decimal number system is used to set the target values of the
digits. In MLP, the setting of the target values is shown in Table 5.1. The value “0.9”
indicates the status of presence while value “0.1” indicates the status of absence.
/Kosong/ /Satu/
……
yk
……
/Lapan/ /Sembilan/
by k
w jk
1
hj
……
xi
Hidden Layer, H
wij
bh j
1
Output Layer, Y
……
Input Layer, X
Figure 5.4: MLP with 10 output. The 10 output nodes correspond to 10 Malay digit
words respectively.
98
5.4.2
MLP Architecture for Word Recognition
Dataset for word recognition consists of 30 selected Malay two-syllable
words. In order to classify the target word, syllable and word classifications are
applied in the word recognition.
For syllable classification, MLP has 15 output nodes, which correspond to the
10 Malay syllables (/ba/, /bu/, /bi/, /ka/, /ku/, /ki/, /ma/, /mu/, /mi/, /sa/, /su/, /si/, /ta/,
/tu/, /ti/). The architecture of MLP for syllable classification is shown in Figure 5.5.
A decimal number system is used to set the target values of the syllables. The setting
of the target values for syllable classification is shown in Table 5.2. The value “0.9”
indicates the status of presence while value “0.1” indicates the status of absence. Due
to the character of the non-linear-sigmoid function used in MLP, the better choice
would be to use values 0.9 and 0.1 instead of 1 and 0 (Haykin, 1994).
Table 5.1: The setting of the target values for MLP in digit recognition.
Decimal
Node
10
9
8
7
6
5
4
3
2
1
Kosong
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.9
1
Satu
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.9
0.1
2
Dua
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.9
0.1
0.1
3
Tiga
0.1
0.1
0.1
0.1
0.1
0.1
0.9
0.1
0.1
0.1
4
Empat
0.1
0.1
0.1
0.1
0.1
0.9
0.1
0.1
0.1
0.1
5
Lima
0.1
0.1
0.1
0.1
0.9
0.1
0.1
0.1
0.1
0.1
6
Enam
0.1
0.1
0.1
0.9
0.1
0.1
0.1
0.1
0.1
0.1
7
Tujuh
0.1
0.1
0.9
0.1
0.1
0.1
0.1
0.1
0.1
0.1
8
Lapan
0.1
0.9
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
9
Sembilan
0.9
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
10
Value “0.9” = status of presence
Number
Value “0.1” = status of absence
99
/Ba/
……
/Bu/
/Tu/
/Ti/
……
yk
Output Layer, Y
by k
w jk
1
……
hj
Hidden Layer, H
wij
bh j
1
……
xi
Input Layer, X
Figure 5.5: MLP with 15 output nodes. The 15 output nodes correspond to 15 Malay
syllables respectively.
For word classification, MLP has 30 output nodes, which correspond to the
30 selected Malay words listed in Table 5.3. The architecture of MLP for word
classification is shown in Figure 5.6. A decimal number system is used to set the
target values of the Malay words. The setting of the target values for word
classification is shown in Table 5.4. The value “0.9” indicates the status of presence
while value “0.1” indicates the status of absence.
/Buku/ /Kubu/
yk
……
……
by k
……
xi
Output Layer, Y
Hidden Layer, H
wij
bh j
1
/Kita/
w jk
1
hj
/Kati/
……
Input Layer, X
Figure 5.6: MLP with 30 output nodes. The 30 output nodes correspond to 30 Malay
two-syllable words respectively.
100
Table 5.2: The setting of the target values for MLP (syllable classification).
Decimal
Node
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
Ba
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.9
1
Bu
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.9
0.1
2
Bi
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.9
0.1
0.1
3
Ka
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.9
0.1
0.1
0.1
4
Ku
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.9
0.1
0.1
0.1
0.1
5
Ki
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.9
0.1
0.1
0.1
0.1
0.1
6
Ma
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.9
0.1
0.1
0.1
0.1
0.1
0.1
7
Mu
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.9
0.1
0.1
0.1
0.1
0.1
0.1
0.1
8
Mi
0.1
0.1
0.1
0.1
0.1
0.1
0.9
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
9
Sa
0.1
0.1
0.1
0.1
0.1
0.9
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
10
Su
0.1
0.1
0.1
0.1
0.9
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
11
Si
0.1
0.1
0.1
0.9
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
12
Ta
0.1
0.1
0.9
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
13
Tu
0.1
0.9
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
14
Ti
0.9
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
15
Value “0.9” = status of presence
Value “0.1” = status of absence
Number
101
Table 5.3: The setting of the target values for MLP (word classification).
Decimal
Node
30
………..
10
9
8
7
6
5
4
3
2
1
Baki
0.1
………..
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.9
1
Bata
0.1
………..
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.9
0.1
2
Buka
0.1
………..
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.9
0.1
0.1
3
Buku
0.1
………..
0.1
0.1
0.1
0.1
0.1
0.1
0.9
0.1
0.1
0.1
4
Bumi
0.1
………..
0.1
0.1
0.1
0.1
0.1
0.9
0.1
0.1
0.1
0.1
5
Bisu
0.1
………..
0.1
0.1
0.1
0.1
0.9
0.1
0.1
0.1
0.1
0.1
6
Bisa
0.1
………..
0.1
0.1
0.1
0.9
0.1
0.1
0.1
0.1
0.1
0.1
7
Kamu
0.1
………..
0.1
0.1
0.9
0.1
0.1
0.1
0.1
0.1
0.1
0.1
8
Kaki
0.1
………..
0.1
0.9
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
9
Kuku
0.1
………..
0.9
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
10
Kubu
0.1
………..
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
11
Kita
0.1
………..
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
12
Masa
0.1
………..
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
13
Mata
0.1
………..
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
14
Mati
0.1
………..
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
15
Muka
0.1
………..
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
16
Mutu
0.1
………..
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
17
Misi
0.1
………..
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
18
Sami
0.1
………..
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
19
Satu
0.1
………..
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
20
Susu
0.1
………..
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
21
Suka
0.1
………..
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
22
Sisi
0.1
………..
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
23
Situ
0.1
………..
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
24
Tamu
0.1
………..
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
25
Taba
0.1
………..
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
26
Tubu
0.1
………..
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
27
Tubi
0.1
………..
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
28
Tiba
0.1
………..
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
29
Titi
0.9
………..
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
30
Value “0.9” = status of presence
Value “0.1” = status of absence
Number
102
(1)
Implementation of MLP training
Step 1: Initialize the weights and bias to small random values. Initialize the target
values.
for (time_state = 0; time_state < NUMTIMESTATES; time_state++)
for (from_layer = 0; from_layer < NUMOFLAYERS-1; from_layer++)
for (from_neuron = 0; from_neuron <
neurons_per_layer[from_layer];
from_neuron++)
for (to_neuron = 0; to_neuron
<neurons_per_layer[from_layer+1];
to_neuron++)
synapses[time_state][from_layer][from_neuron][to_neuron] =
(((float)rand() / RAND_MAX ) - 0.5) * INITWEIGHTRANGE;
for (i = 0; i< NUMTRAININGSETS; i++) // Initialize the target values
for (j = 0; j < NUMOUTPUTNEURONS; j++) {
training_outs[i][j] = 0.1;
if (I == j) training_outs[i][j] = 0.9; }
Step 2: Train the networks until one of the conditions is reached: min error function
or max epochs. First, ForwardPass( ) is used to propagate the input signals to
the output nodes from input layer to hidden layer and from hidden layer to
output layer. Then, BackwardPass() is used to back-propagate the error and
also update the weights and biases. The final weights and biases are saved
into “.wgt” file after training. The learning rate of network is decreased 5% if
the difference between previous error value and current error value is smaller
than c for 25 times. c is a constant where it is set to 0.0001 during training
process.
The implementation of function ForwardPass() is as follows:
for (i = 1; i < NUMOFLAYERS; i++) {
103
for (j = 0; j < neurons_per_layer[i]; j++) {
temp = 0;
for (k = 0; k < neurons_per_layer[i-1]; k++) {
temp += neurons[i-1][k] *
synapses[CURRENTWEIGHTS][i -1][k][j]; }
neurons[i][j] = Sigmoid(temp);
}
The implementation of function BackwardPass() is as follows:
// Calculate errors in output neurons
for (i = 0; i < NUMOUTPUTNEURONS; i++)
errors[NUMOFLAYERS-1][i] = neurons[NUMOFLAYERS-1][i] *
(1 - neurons[NUMOFLAYERS-1][i]) *
(trainset[i] - neurons[NUMOFLAYERS-1][i]);
// Calculate new weights going into output layer
for (i = (NUMOFLAYERS - 2); i >= 0; i--)
for (j = 0; j < neurons_per_layer[i]; j++) {
for (k = 0; k < neurons_per_layer[i+1]; k++)
synapses[NEWWEIGHTS][i][j][k] =
synapses[CURRENTWEIGHTS][i][j][k]
+ ((learningrate * errors[i+1][k] *
neurons[i][j]) * (1 + MOMENTUM));
temp = 0;
for (k = 0; k < neurons_per_layer[i+1]; k++)
temp += errors[i+1][k] *
synapses[CURRENTWEIGHTS][i][j][k];
errors[i][j] = neurons[i][j] * (1 - neurons[i][j]) * temp; }
// Copy all the new weights into the current set of weights for the next
// set of forward and backward passes
for (i = 0; i < (NUMOFLAYERS-1); i++)
for (j = 0; j < neurons_per_layer[i]; j++)
104
for (k = 0; k < neurons_per_layer[i+1]; k++)
synapses[CURRENTWEIGHTS][i][j][k] =
synapses[NEWWEIGHTS][i][j][k];
// The learning rate of network is decreased 5%
if (error_distance < c)
error_count++;
else
error_count = 0;
if ((error_count%25==0)&&(error_count!=0)&&(learningrate>0.075))
learningrate = learningrate * 0.95;
The implementation of function SaveWeights() is as follows:
for (from_layer = 0; from_layer < NUMOFLAYERS-1; from_layer++){
for (from_neuron = 0; from_neuron < neurons_per_layer[from_layer];
from_neuron++)
{
for (to_neuron = 0; to_neuron < neurons_per_layer[from_layer+1];
to_neuron++)
fprintf(outfile, "%f ",
synapses[1][from_layer][from_neuron][to_neuron]);
fprintf(outfile, "\n");
}
fprintf(outfile, "\n");
}
(2)
Implementation of MLP testing
Step 1: Read the weight values from the weight file which is saved in “.wgt” after the
training of the network. The weights is then stored into array synapses[].
105
for (from_layer = 0; from_layer < NUMOFLAYERS-1; from_layer++)
for (from_neuron = 0; from_neuron < neurons_per_layer[from_layer];
from_neuron++)
for (to_neuron = 0; to_neuron < neurons_per_layer[from_layer+1];
to_neuron++)
fscanf(weights_file, "%f",
&synapses[CURRENTWEIGHTS][from_layer][from_neuron][to_neuron]);
Step 2: Read the test data from speech input pattern (in “.mtx” file) and saved to the
array testing_ins[]. There are a total of 40 patterns per word where first 20 of
them are used for training and the rest 20 of them are used for testing.
for (i = 0; i < NUMINPUTNEURONS; i++)
fscanf(infile, "%f\n", &testing_ins[i]);
Step 3: Implement the function ForwardPass() (same with the function in training)
to propagate the test input signals to the output nodes by calculating the
output values for output nodes.
Step 4: Find the output node with the highest activation. Compare the output result
with the target result to determine the recognition accuracy after all the
training set and test set are tested. The testing result is saved to the test report.
highest_val = 0;
for (i = 0; i < neurons_per_layer[NUMOFLAYERS - 1]; i++)
if (layers[NUMOFLAYERS - 1][i] > highest_val){
highest_val = layers[NUMOFLAYERS - 1][i];
highest_ind = i;
}
for (i = 0; i < neurons_per_layer[NUMOFLAYERS - 1]; i++)
if (i == highest_ind)
fprintf(result, "recognized syllable : %s %.4lf", syllable[highest_ind],
layers[NUMOFLAYERS - 1][i]);
106
if (x+1 == highest_ind+1) {
cout << " ** Corect Result !! " << endl;
countx[x]++;
}
else {
cout << " Incorrect Result !! " << endl;
fprintf(result, " *** Incorrect :< \n\n");
}
accuracyx[x] = (float)countx[x] / (NUMTESTINGTOKENS);
accuracy = (float)count / (NUMTESTINGSETS*NUMTESTINGTOKENS);
fprintf(result, "\n\ncorrect\/total : %d\/%d\n", count,
NUMTESTINGSETS*NUMTESTINGTOKENS);
fprintf(result, "\nrecognition accuracy : %.4f \n\n", accuracy);
for (x = 0; x < NUMTESTINGSETS; x++)
fprintf(result, "%s : %d\/%d\t%.4f \n", syllable[x], countx[x],
NUMTESTINGTOKENS, accuracyx[x]);
107
5.5
Experiment Setup
The performance test of the speech recognition system is conducted on digit
recognition and word recognition. Both of the recognition systems are tested using
the conventional model (single network) and the proposed model (SOM and MLP).
Comparison of system performance is then made according to their accuracy.
Digit recognition system is tested in order to evaluate the performance of
neural network in speech recognition for small vocabulary size. Word recognition is
tested in order to evaluate the performance of neural network in speech recognition
for larger vocabulary by using different approaches such as syllable approach and
word approach.
Experiments are conducted to find the optimal parameters of the system in
order to obtain optimal performance of the system. For conventional system, the
optimal values of parameters to be determined are cepstral order (CO), hidden node
number (HNN) and learning rate (LR). For our proposed system, the optimal values
of parameters to be determined are cepstral order (CO), Dimension of SOM (DSOM),
hidden node number (HNN) and learning rate (LR). The values determined will be
used in the rest of the tests. The proposed speech recognition system which uses
SOM and MLP is then compared with the conventional system which uses MLP only.
Figure 5.7 and 5.8 show the system architecture for conventional system and
proposed system respectively. Here we will briefly describe the implementation of
both conventional and proposed system.
As shown in Figure 5.7, the conventional system consists of two main
components: speech processing and MLP. Speech processing acts as the feature
extractor and MLP as the classifier. First, waveform speech input into speech
processing where LPC extracts the speech feature (cepstral coefficient). The cepstral
coefficients are then used to train the MLP for learning process before we can use for
recognition. The output (recognition accuracy) for MLP is printed to a result text file.
108
Cepstral coefficient
database
Speech database
Waveform
speech
Input
speech
Cepstral
coefficient
Speech
Processing
Recognition
accuracy in
text file.
MLP
Output
(Endpoint Detection and LPC)
Figure 5.7: System architecture for conventional model (single network)
Speech database
Waveform
speech
Input
speech
Cepstral coefficient
database
Cepstral
coefficient
Speech
Processing
Binary matrix
database
Binary
matrix
SOM
Recognition
accuracy in
text file.
MLP
Output
(Endpoint Detection and LPC)
Figure 5.8: System architecture for proposed model (hybrid network)
As shown in Figure 5.8, the proposed system consists of three main
components: speech processing, SOM and MLP. Speech processing acts as the
feature extractor, SOM for dimensionality reduction and MLP as the classifier. Same
with conventional system, in speech processing the waveform speech is extracted by
LPC to generate cepstral coefficient. The cepstral coefficients are fed into SOM for
training and then are transformed into binary matrix. The binary matrix is then used
to train the MLP before recognition. The output (recognition accuracy) for MLP is
printed to a result text file.
109
Figure 5.9 shows the training and testing of the digit recognition system level
by level. Figure 5.10 shows the training and testing of the word recognition system
level by level. The system setup of training and testing is shown in Appendices level
by level.
Digit Recognition
(DR)
Conventional System
(CS)
Cepstral Order (CO), Hidden Node
Number (HNN), Learning Rate (LR)
Proposed System
(PS)
Cepstral Order (CO), Dimension of
SOM (DSOM), Hidden Node
Number (HNN), Learning Rate (LR)
Figure 5.9: Training and testing of the digit recognition system
Word Recognition
(WR)
Conventional System
(CS)
Syllable
Classification
Word
Classification
Cepstral Order (CO), Hidden Node
Number (HNN), Learning Rate (LR)
Proposed System
(PS)
Syllable
Classification
Word
Classification
Cepstral Order (CO), Dimension of
SOM (DSOM), Hidden Node
Number (HNN), Learning Rate (LR)
Figure 5.10: Training and testing of the word recognition system
CHAPTER 6
RESULTS AND DISCUSSION
6.1
Introduction
In this chapter, the performance of both the conventional and the proposed
system are evaluated. The results of the tests performed are presented and discussed
in table and graph form in stages as shown in Figure 6.1.
Speech Recognition System
Digit recognition
Conventional
Proposed
Experiment 1,
2, and 3
Experiment 1,
2, 3 and 4
Word recognition
Conventional
Syllable
classification
Word
classification
Experiment 1, 2,
and 3
Proposed
Syllable
classification
Word
classification
Experiment 1, 2, 3
and 4
Comparison and
Discussion
Figure 6.1: Presentation and discussion of the results of the tests in table and graph
form in stages.
111
6.2
Testing of Digit Recognition
The results of digit recognition tests are presented according to different
value of parameters in conventional system (DRCS – Digit Recognition
Conventional System) and proposed system (DRPS – Digit Recognition Proposed
System). The best results for the test sets will be referred as the optimal value for the
parameters (Cepstral Order (CO) for LPC analysis, Dimension of SOM (DSOM),
Hidden Node Number (HNN) for MLP, and Learning Rate (LR) for MLP training).
6.2.1
Testing Results for Conventional System
6.2.1.1 Experiment 1: Optimal Cepstral Order (CO)
The recognition accuracy for training and testing set in Experiment 1 are
presented in Table 6.1 and its analysis is presented in graph form shown in Figure 6.2.
Table 6.1: Recognition accuracy for different CO for Experiment 1 (DRCS)
CO
Recognition accuracy (%)
Train
Test
12
93.00
85.00
16
94.83
87.67
20
93.33
89.00
24
95.00
89.33
From Table 6.1 and Figure 6.2, we found that the higher cepstral order gives
higher recognition accuracy. The decreased of the accuracy percentages for testing
data are expected, as the testing data is different from the training data where it is not
used to train the system.
112
100
98
recognition accuracy (%)
96
94
92
train
90
test
88
86
84
82
80
12
16
20
24
cepstral order
Figure 6.2: Recognition accuracy for different CO for Experiment 1 (DRCS)
6.2.1.2 Experiment 2: Optimal Hidden Node Number (HNN)
The recognition accuracy for training and testing set in Experiment 2 are
presented in Table 6.2 and its analysis is presented in graph form shown in Figure 6.3.
The chosen HNN are based on the Geometric Pyramid Rule (GPR) which has been
discussed on Section 4.4.4 in Chapter 4. The experiment is tested using 3 chosen
HNN: ¾GPR, GPR and 1¼GPR. The calculation of HNN is as below:
Hidden Node Number (HNN) =
X *Y
where X is the number of input node and Y is the number of output node.
113
Table 6.2: Recognition accuracy for different HNN for Experiment 2 (DRCS)
HNN
Recognition accuracy (%)
Train
Test
98
91.50
83.50
130
95.00
89.33
163
93.00
90.00
100
98
recognition accuracy (%)
96
94
92
train
90
test
88
86
84
82
80
98
130
163
hidden node num ber
Figure 6.3: Recognition accuracy for different HNN for Experiment 2 (DRCS)
From Table 6.2 and Figure 6.3, we found that the HNN determined using
GPR gives higher recognition accuracy in training set. However, the higher HNN
gives better result in testing set. The decreased of the accuracy percentages for
testing data are expected, as the testing data is different from the training data where
it is not used to train the system.
114
6.2.1.3 Experiment 3: Optimal Learning Rate (LR)
The recognition accuracy for training and testing set in Experiment 3 are
presented in Table 6.3 and its analysis is presented in graph form shown in Figure 6.4.
Table 6.3: Recognition accuracy for different LR for Experiment 3 (DRCS)
LR
Recognition accuracy (%)
Train
Test
0.1
96.83
91.67
0.2
95.00
89.33
0.3
93.33
89.00
0.4
95.00
85.50
100
98
recognition accuracy (%)
96
94
92
train
90
test
88
86
84
82
80
0.1
0.2
0.3
0.4
learning rate
Figure 6.4: Recognition accuracy for different LR for Experiment 3 (DRCS)
115
From Table 6.3 and Figure 6.4, we found that smaller learning rate gives the
better result for both training and testing set. The decreased of the accuracy
percentages for testing data are expected, as the testing data is different from the
training data where it is not used to train the system.
6.2.2
Testing Results for Proposed System
6.2.2.1 Experiment 1: Optimal Cepstral Order (CO)
The recognition accuracy for training and testing set in Experiment 1 are
presented in Table 6.4 and its analysis is presented in graph form shown in Figure 6.5.
Table 6.4: Recognition accuracy for different CO for Experiment 1 (DRPS)
CO
Recognition accuracy (%)
Train
Test
12
100.00
91.50
16
100.00
92.00
20
100.00
98.83
24
100.00
99.83
From Table 6.3 and Figure 6.4, we found the system obtains 100% accuracy
in training set for all the chosen ceptral order from 12 to 24. In testing set, higher
cepstral order gives the better results as high as 99.83% accuracy. The decreased of
the accuracy percentages for testing data are expected, as the testing data is different
from the training data where it is not used to train the system.
116
100
99
recognition accuracy (%)
98
97
96
train
95
test
94
93
92
91
90
12
16
20
24
cepstral order
Figure 6.5: Recognition accuracy for different CO for Experiment 1 (DRPS)
6.2.2.2 Experiment 2: Optimal Dimension of SOM (DSOM)
The recognition accuracy for training and testing set in Experiment 2 are
presented in Table 6.5 and its analysis is presented in graph form shown in Figure 6.6.
Table 6.5: Recognition accuracy for different DSOM for Experiment 2 (DRPS)
DSOM
Recognition accuracy (%)
Train
Test
10 x 10
100.00
99.50
12 x 12
100.00
99.83
15 x 15
100.00
97.33
20 x 20
100.00
98.00
117
100
99
recognition accuracy (%)
98
97
96
train
95
test
94
93
92
91
90
10 x 10
12 x 12
15 x 15
20 x 20
Dim ension of SOM
Figure 6.6: Recognition accuracy for different DSOM for Experiment 2 (DRPS)
From Table 6.5 and Figure 6.6, we found that the system obtains 100%
accuracy in training set for all the chosen DSOM. However, DSOM of 12x12 gives
highest accuracy in testing set. The decreased of the accuracy percentages for testing
data are expected, as the testing data is different from the training data where it is not
used to train the system.
6.2.2.3 Experiment 3: Optimal Hidden Node Number (HNN)
The recognition accuracy for training and testing set in Experiment 3 are
presented in Table 6.6 and its analysis is presented in graph form shown in Figure 6.7.
118
Table 6.6: Recognition accuracy for different HNN for Experiment 3 (DRPS)
HNN
Recognition accuracy (%)
Train
Test
28
100.00
97.33
38
100.00
99.83
48
100.00
99.50
100
99
recognition accuracy (%)
98
97
96
train
95
test
94
93
92
91
90
28
38
48
hidden node num ber
Figure 6.7: Recognition accuracy for different HNN for Experiment 3 (DRPS)
From Table 6.6 and Figure 6.7, we found that the system obtains 100%
accuracy in training set for all the chosen HNN. However, the HNN determined
using GPR gives higher recognition accuracy in testing set. However, the higher
HNN gives better result in testing set. The decreased of the accuracy percentages for
testing data are expected, as the testing data is different from the training data where
it is not used to train the system.
119
6.2.2.4 Experiment 4: Optimal Learning Rate (LR)
The recognition accuracy for training and testing set in Experiment 4 are
presented in Table 6.7 and its analysis is presented in graph form shown in Figure 6.8.
Table 6.7: Recognition accuracy for different LR for Experiment 4 (DRPS)
LR
Recognition accuracy (%)
Train
Test
0.1
100.00
99.83
0.2
100.00
99.83
0.3
100.00
96.67
0.4
100.00
97.50
100
recognition accuracy (%)
99
98
train
97
test
96
95
94
0.1
0.2
0.3
0.4
learning rate
Figure 6.8: Recognition accuracy for different LR for Experiment 4 (DRPS)
120
From Table 6.7 and Figure 6.8, we found that the system obtains 100%
accuracy in training set for all the chosen LR. However, smaller learning rate gives
the better result in testing set. The decreased of the accuracy percentages for testing
data are expected, as the testing data is different from the training data where it is not
used to train the system.
6.2.3
Discussion for Digit Recognition Testing
6.2.3.1 Comparison of Performance for DRCS and DRPS according to CO
The comparison of performance for DRCS and DRPS according to CO is
presented in Table 6.8 and analyzed in Figure 6.9. The comparison of performance
is presented according to testing results.
Table 6.8: Comparison of performance for DRCS and DRPS according to CO
CO
Recognition accuracy (%)
DRCS
DRPS
DRPS - DRCS
12
85.00
91.50
+6.50
16
87.67
92.00
+4.33
20
89.00
98.83
+9.83
24
89.33
99.83
+10.50
From Table 6.8 and Figure 6.9, some observations that could be made:
a) DRCS and DRPS have the same order of cepstral order from lowest to
highest accuracy, where the order is 12, 16, 20 and 24.
b) For both DRCS and DRPS, the highest recognition accuracy is cepstral
order of 24.
c) DRPS has the higher accuracy than DRCS for every test sets.
121
100
98
recognition accuracy (%)
96
94
92
DRCS
90
DRPS
88
86
84
82
80
12
16
20
24
cepstral order
Figure 6.9: Comparison of performance for DRCS and DRPS according to CO.
It is shown that DRPS outperform the DRCS in every test set according to
cepstral order. From the comparison, we can see that it can recognize well with
higher cepstral order. This is because higher the cepstral order is, the more detail the
acoustic information we obtained. From the tests, we can conclude that cepstral
order of 24 outperforms other cepstral order (12, 16 and 20).
6.2.3.2 Comparison of Performance for DRCS and DRPS according to HNN
The comparison of performance for DRCS and DRPS according to HNN is
presented in Table 6.9 and analyzed in Figure 6.10. The comparison of performance
is presented according to testing results.
122
Table 6.9: Comparison of performance for DRCS and DRPS according to HNN
HNN
Recognition accuracy (%)
DRCS
DRPS
DRPS - DRCS
¾ GPR
83.50
97.33
+13.83
GPR
89.33
99.83
+10.50
1¼ GPR
90.00
99.50
+9.50
From Table 6.9 and Figure 6.10, some observations that could be made:
a) For DRCS, the order of hidden node number from lowest to highest
accuracy is ¾ GPR, GPR and 1¼ GPR.
b) For DRPS, the order of hidden node number from lowest to highest
accuracy is ¾ GPR, 1¼ GPR and GPR.
c) The highest accuracy for DRPS is HNN using GPR.
d) DRPS outperforms DRCS for every test sets.
100
98
recognition accuracy (%)
96
94
92
DRCS
90
DRPS
88
86
84
82
80
¾ GPR
GPR
1¼ GPR
hidden node number
Figure 6.10: Comparison of performance for DRCS and DRPS according to HNN.
123
It is shown that DRPS outperform the DRCS in every test set according to
hidden node number. From DRCS testing results, we cannot conclude that hidden
node number according to GPR achieve the better performance. However, the
hidden node number according to GPR or higher gives good accuracy (90% or
above). It can be considered as optimal hidden node number.
6.2.3.3 Comparison of Performance for DRCS and DRPS according to LR
The comparison of performance for DRCS and DRPS according to LR is
presented in Table 6.10 and analyzed in Figure 6.11. The comparison of
performance is presented according to testing results. The chosen HNN is according
to the HNN with highest accuracy in the previous experiment.
Table 6.10: Comparison of performance for DRCS and DRPS according to LR
LR
Recognition accuracy (%)
DRCS
DRPS
DRPS - DRCS
0.1
91.67
99.83
+8.16
0.2
89.33
99.83
+10.50
0.3
89.00
96.67
+7.67
0.4
85.50
97.50
+12.00
From Table 6.10 and Figure 6.11, some observations that could be made:
a) For DRCS, the order of learning rate from lowest to highest accuracy is
0.4, 0.3, 0.2 and 0.1.
b) For DRPS, the order of learning rate from lowest to highest accuracy is
0.3, 0.4, 0.2 and 0.1.
c) Both DRCS and DRPS achieve the same highest accuracy with learning
rate of 0.1.
d) DRPS has the higher accuracy than DRCS for every test sets.
124
100
98
recognition accuracy (%)
96
94
92
DRCS
90
DRPS
88
86
84
82
80
0.1
0.2
0.3
0.4
learning rate
Figure 6.11: Comparison of performance for DRCS and DRPS according to LR.
It is shown that DRPS outperform the DRCS in every test set according to
learning rate. We can see that the MLP can learn and generalize better with smaller
learning rate. However, smaller learning rate may cause the slow convergence rate
during the learning process because the modification of the weights is smaller. From
the testing results, we can say that learning rate of 0.1 outperforms other learning rate
(0.2 – 0.4) and can be considered as optimal learning rate.
6.2.3.4 Discussion on Performance of DRPS according to DSOM
From Table 6.5 and Figure 6.6, some observations that could be made:
a) The order of SOM dimension from lowest to highest accuracy is 15 x 15,
20 x 20, 10 x 10 and 12 x 12.
b) DRPS achieves almost 100% accuracy for dimension of 12 x 12.
125
It is shown that dimension of SOM with the size in the range of 10 x 10 and
12 x 12 is enough for the acoustic mapping of SOM in digit recognition where its
vocabulary size is 10. The smaller the dimension used, the faster the training process.
Thus, dimension size of 12 x 12 can be considered as optimal dimension for SOM.
6.2.3.5 Summary for Digit Recognition Testing
For digit recognition, the proposed system outperform the conventional
system in every test sets with an acceptable result and improvement of recognition
accuracy. Besides, the optimal parameters and the architecture for our proposed
system have been determined from the test results, which are shown in Table 6.11.
The recognition accuracy may be higher in digit recognition because of the less
number of target word to be recognized. There are only ten digits to be recognized
by the system. In addition, the feature of each digit is different from each other. So,
the digit recognition may achieve better performance compared to word recognition
which recognizes with larger vocabulary size and similar target words.
Table 6.11: The optimal parameters and the architecture for DRPS
Component
Speech
Processing
SOM
MLP
Parameter
Analysis frame length /
Shifting length
Value
240 / 80 (samples)
Cepstral order
24
Learning rate
0.25
Dimension
12 x 12
Input node
144
Hidden node
38
Output node
10
Learning rate
0.1 – 0.2
Momentum
0.8
Max Epoch / Error Function
1000 / 0.01
126
6.3 Testing of Word Recognition
The results of word recognition tests are presented according to different
value of parameters in conventional system and proposed system. The results are
also presented according to the types of classification used such as syllable
classification and word classification.
6.3.1
Testing Results for Conventional System (Syllable Classification)
6.3.1.1 Experiment 1: Optimal Cepstral Order (CO)
The recognition accuracy for training and testing set in Experiment 1 are
presented in Table 6.12 and its analysis is presented in graph form shown in Figure
6.12.
Table 6.12: Recognition accuracy for different CO for Experiment 1 (WRCS(S))
CO
Recognition accuracy (%)
Train
Test
12
90.67
76.83
16
91.33
81.67
20
94.00
78.00
24
93.33
84.83
From Table 6.12 and Figure 6.12, we found that the cepstral order of 20 and
24 give higher recognition accuracy in training and testing set respectively. The
decreased of the accuracy percentages for testing data are expected, as the testing
data is different from the training data where it is not used to train the system.
127
100
98
96
94
recognition accuracy (%)
92
90
88
86
train
84
test
82
80
78
76
74
72
70
12
16
20
24
cepstral order
Figure 6.12: Recognition accuracy for different CO for Experiment 1 (WRCS(S))
6.3.1.2 Experiment 2: Optimal Hidden Node Number (HNN)
The recognition accuracy for training and testing set in Experiment 2 are
presented in Table 6.13 and its analysis is presented in graph form shown in Figure
6.13.
Table 6.13: Recognition accuracy for different HNN for Experiment 2 (WRCS(S))
HNN
Recognition accuracy (%)
Train
Test
100
90.50
83.50
134
93.33
84.83
168
92.00
87.33
128
100
98
96
recognition accuracy (%)
94
92
90
88
train
86
test
84
82
80
78
76
74
100
134
168
hidden node num ber
Figure 6.13: Recognition accuracy for different HNN for Experiment 2 (WRCS(S))
From Table 6.13 and Figure 6.13, we found that the HNN of 134 (GPR) and
168 (1¼ GPR) give higher recognition accuracy in training and testing set
respectively. The decreased of the accuracy percentages for testing data are expected,
as the testing data is different from the training data where it is not used to train the
system.
6.3.1.3 Experiment 3: Optimal Learning Rate (LR)
The recognition accuracy for training and testing set in Experiment 3 are
presented in Table 6.14 and its analysis is presented in graph form shown in Figure
6.14.
129
Table 6.14: Recognition accuracy for different LR for Experiment 3 (WRCS(S))
LR
Recognition accuracy (%)
Train
Test
0.1
93.00
86.00
0.2
93.33
84.83
0.3
95.50
80.33
0.4
95.00
78.67
100
98
96
94
recognition accuracy (%)
92
90
88
86
train
84
test
82
80
78
76
74
72
70
0.1
0.2
0.3
0.4
learning rate
Figure 6.14: Recognition accuracy for different LR for Experiment 3 (WRCS(S))
From Table 6.14 and Figure 6.14, we found that the LR of 0.3 gives higher
accuracy in training set. However, the smaller LR gives better results in testing set.
The decreased of the accuracy percentages for testing data are expected, as the
testing data is different from the training data where it is not used to train the system.
130
6.3.2
Testing Results for Conventional System (Word Classification)
6.3.2.1 Experiment 1: Optimal Cepstral Order (CO)
The recognition accuracy for training and testing set in Experiment 1 are
presented in Table 6.15 and its analysis is presented in graph form shown in Figure
6.15.
Table 6.15: Recognition accuracy for different CO for Experiment 1 (WRCS(W))
CO
Recognition accuracy (%)
Train
Test
12
87.00
71.33
16
90.67
77.33
20
90.00
75.67
24
88.33
79.00
100
98
96
94
92
recognition accuracy (%)
90
88
86
84
82
train
80
test
78
76
74
72
70
68
66
64
62
60
12
16
20
24
cepstral order
Figure 6.15: Recognition accuracy for different CO for Experiment 1 (WRCS(W))
131
From Table 6.15 and Figure 6.15, we found that the cepstral order of 16 and
24 give higher recognition accuracy in training and testing set respectively. The
decreased of the accuracy percentages for testing data are expected, as the testing
data is different from the training data where it is not used to train the system.
6.3.2.2 Experiment 2: Optimal Hidden Node Number (HNN)
The recognition accuracy for training and testing set in Experiment 2 are
presented in Table 6.16 and its analysis is presented in graph form shown in Figure
6.16.
Table 6.16: Recognition accuracy for different HNN for Experiment 2 (WRCS(W))
HNN
Recognition accuracy (%)
Train
Test
142
89.00
77.67
190
88.33
79.00
238
91.33
77.00
From Table 6.16 and Figure 6.16, we found that the HNN of 238 (GPR) and
190 (1¼ GPR) give higher recognition accuracy in training and testing set
respectively. The decreased of the accuracy percentages for testing data are expected,
as the testing data is different from the training data where it is not used to train the
system.
132
100
98
96
94
recognition accuracy (%)
92
90
88
86
train
84
test
82
80
78
76
74
72
70
142
190
238
hidden node number
Figure 6.16: Recognition accuracy for different HNN for Experiment 2 (WRCS(W))
6.3.2.3 Experiment 3: Optimal Learning Rate (LR)
The recognition accuracy for training and testing set in Experiment 3 are
presented in Table 6.17 and its analysis is presented in graph form shown in Figure
6.17.
Table 6.17: Recognition accuracy for different LR for Experiment 3 (WRCS(W))
LR
Recognition accuracy (%)
Train
Test
0.1
91.33
79.83
0.2
88.33
79.00
0.3
90.67
76.67
0.4
88.50
77.33
133
100
98
96
94
recognition accuracy (%)
92
90
88
86
train
84
test
82
80
78
76
74
72
70
0.1
0.2
0.3
0.4
learning rate
Figure 6.17: Recognition accuracy for different LR for Experiment 3 (WRCS(W))
From Table 6.17 and Figure 6.17, we found that the results obtained are not
consistent. However, smaller LR gives higher accuracy in both set. The decreased of
the accuracy percentages for testing data are expected, as the testing data is different
from the training data where it is not used to train the system.
6.3.3
Testing Results for Proposed System (Syllable Classification)
6.3.3.1 Experiment 1: Optimal Cepstral Order (CO)
The recognition accuracy for training and testing set in Experiment 1 are
presented in Table 6.18 and its analysis is presented in graph form shown in Figure
6.18.
134
Table 6.18: Recognition accuracy for different CO for Experiment 1 (WRPS(S))
CO
Recognition accuracy (%)
Train
Test
12
97.33
88.00
16
97.50
90.67
20
99.67
94.67
24
99.00
95.83
100
98
recognition accuracy (%)
96
94
92
train
90
test
88
86
84
82
80
12
16
20
24
cepstral order
Figure 6.18: Recognition accuracy for different CO for Experiment 1 (WRPS(S))
From Table 6.18 and Figure 6.18, we found that higher cepstral order
especially 24 gives higher accuracy in both training and testing set. The decreased of
the accuracy percentages for testing data are expected, as the testing data is different
from the training data where it is not used to train the system.
135
6.3.3.2 Experiment 2: Optimal Dimension of SOM (DSOM)
The recognition accuracy for training and testing set in Experiment 2 are
presented in Table 6.19 and its analysis is presented in graph form shown in Figure
6.19.
Table 6.19: Recognition accuracy for different DSOM for Experiment 2 (WRPS(S))
DSOM
Recognition accuracy (%)
Train
Test
10 x 10
97.33
91.33
12 x 12
99.00
95.83
15 x 15
98.50
96.67
20 x 20
98.67
95.33
100
98
recognition accuracy (%)
96
94
92
train
90
test
88
86
84
82
80
10 x 10
12 x 12
15 x 15
20 x 20
dim ension of SOM
Figure 6.19: Recognition accuracy for different DSOM for Experiment 2 (WRPS(S))
136
From Table 6.19 and Figure 6.19, we found that DSOM of 12 x 12 and 15 x
15 give better results in training and testing set respectively. The decreased of the
accuracy percentages for testing data are expected, as the testing data is different
from the training data where it is not used to train the system.
6.3.3.3 Experiment 3: Optimal Hidden Node Number (HNN)
The recognition accuracy for training and testing set in Experiment 3 are
presented in Table 6.20 and its analysis is presented in graph form shown in Figure
6.20.
Table 6.20: Recognition accuracy for different HNN for Experiment 3 (WRPS(S))
HNN
Recognition accuracy (%)
Train
Test
44
98.83
94.33
58
98.50
96.67
72
98.33
96.00
From Table 6.20 and Figure 6.20, we found that the results are almost
consistent in training set. However, HNN of 58 (GPR) gives better results in testing
set. The decreased of the accuracy percentages for testing data are expected, as the
testing data is different from the training data where it is not used to train the system.
137
100
98
recognition accuracy (%)
96
94
92
train
90
test
88
86
84
82
80
44
58
72
hidden node num ber
Figure 6.20: Recognition accuracy for different HNN for Experiment 3 (WRPS(S))
6.3.3.4 Experiment 4: Optimal Learning Rate (LR)
The recognition accuracy for training and testing set in Experiment 4 are
presented in Table 6.21 and its analysis is presented in graph form shown in Figure
6.21.
Table 6.21: Recognition accuracy for different LR for Experiment 4 (WRPS(S))
LR
Recognition accuracy (%)
Train
Test
0.1
98.00
96.67
0.2
98.50
96.67
0.3
97.33
96.67
0.4
98.00
95.33
138
100
98
recognition accuracy (%)
96
94
92
train
90
test
88
86
84
82
80
0.1
0.2
0.3
0.4
learning rate
Figure 6.21: Recognition accuracy for different LR for Experiment 4 (WRPS(S))
From Table 6.21 and Figure 6.21, we found that the results are almost
consistent for both training and testing set except LR of 0.4 in training set. However,
LR of 0.1 – 0.2 gives good result in both set. The decreased of the accuracy
percentages for testing data are expected, as the testing data is different from the
training data where it is not used to train the system.
6.3.4
Testing Results for Proposed System (Word Classification)
6.3.4.1 Experiment 1: Optimal Cepstral Order (CO)
The recognition accuracy for training and testing set in Experiment 1 are
presented in Table 6.22 and its analysis is presented in graph form shown in Figure
6.22.
139
Table 6.22: Recognition accuracy for different CO for Experiment 1 (WRPS(W))
CO
Recognition accuracy (%)
Train
Test
12
95.33
83.00
16
96.50
84.83
20
96.00
90.33
24
97.33
91.00
100
98
96
recognition accuracy (%)
94
92
90
train
88
test
86
84
82
80
78
76
12
16
20
24
cepstral order
Figure 6.22: Recognition accuracy for different CO for Experiment 1 (WRPS(W))
From Table 6.22 and Figure 6.22, we found that higher cepstral order
especially 24 gives higher accuracy in both training and testing set. The decreased of
the accuracy percentages for testing data are expected, as the testing data is different
from the training data where it is not used to train the system.
140
6.3.4.2 Experiment 2: Optimal Dimension of SOM (DSOM)
The recognition accuracy for training and testing set in Experiment 2 are
presented in Table 6.23 and its analysis is presented in graph form shown in Figure
6.23.
Table 6.23: Recognition accuracy for different DSOM for Experiment 2 (WRPS(W))
DSOM
Recognition accuracy (%)
Train
Test
10 x 10
97.33
87.83
12 x 12
97.33
91.00
15 x 15
98.50
90.33
20 x 20
98.00
91.33
100
98
recognition accuracy (%)
96
94
92
train
90
test
88
86
84
82
80
10 x 10
12 x 12
15 x 15
20 x 20
dim ension of SOM
Figure 6.23: Recognition accuracy for different DSOM for Experiment 2 (WRPS(W))
141
From Table 6.23 and Figure 6.23, we found that the results obtained are not
consistent in both training and testing set. However, DSOM of 15 x 15 and 20 x 20
give better result in training and testing set respectively. The decreased of the
accuracy percentages for testing data are expected, as the testing data is different
from the training data where it is not used to train the system.
6.3.4.3 Experiment 3: Optimal Hidden Node Number (HNN)
The recognition accuracy for training and testing set in Experiment 3 are
presented in Table 6.24 and its analysis is presented in graph form shown in Figure
6.24.
Table 6.24: Recognition accuracy for different HNN for Experiment 3 (WRPS(W))
HNN
Recognition accuracy (%)
Train
Test
82
97.33
88.83
110
98.00
91.33
138
96.50
91.00
From Table 6.24 and Figure 6.24, we found that HNN of 110 (GPR) achieves
higher accuracy in both training and testing set. The decreased of the accuracy
percentages for testing data are expected, as the testing data is different from the
training data where it is not used to train the system.
142
100
98
recognition accuracy (%)
96
94
92
train
90
test
88
86
84
82
80
82
110
138
hidden node num ber
Figure 6.24: Recognition accuracy for different HNN for Experiment 3 (WRPS(W))
6.3.4.4 Experiment 4: Optimal Learning Rate (LR)
The recognition accuracy for training and testing set in Experiment 4 are
presented in Table 6.25 and its analysis is presented in graph form shown in Figure
6.25.
Table 6.25: Recognition accuracy for different LR for Experiment 4 (WRPS(W))
LR
Recognition accuracy (%)
Train
Test
0.1
98.00
91.33
0.2
98.00
91.33
0.3
97.00
88.67
0.4
98.50
88.00
143
100
98
recognition accuracy (%)
96
94
92
train
90
test
88
86
84
82
80
0.1
0.2
0.3
0.4
learning rate
Figure 6.25: Recognition accuracy for different LR for Experiment 4 (WRPS(W))
From Table 6.25 and Figure 6.25, we found that the results obtained in
training set are almost consistent. However, smaller LR gives better result in testing
set. The decreased of the accuracy percentages for testing data are expected, as the
testing data is different from the training data where it is not used to train the system.
6.3.5
Discussion for Word Recognition Testing
6.3.5.1 Comparison of Performance for WRCS and WRPS according to CO
The comparison of performance for WRCS and WRPS according to CO is
presented in Table 6.26 and 6.27 and analyzed in Figure 6.26 and 6.27. The
comparison of performance is presented according to testing results.
144
Table 6.26: Comparison of performance for WRCS(S) and WRPS(S) according to
CO.
CO
Recognition accuracy (%)
WRCS(S)
WRPS(S)
WRPS(S) – WRCS(S)
12
76.83
88.00
+11.17
16
81.67
90.67
+9.00
20
78.00
94.67
+16.67
24
84.83
95.83
+11.00
100
98
96
recognition accuracy (%)
94
92
90
88
WRCS(S)
86
WRPS(S)
84
82
80
78
76
74
72
12
16
20
24
cepstral order
Figure 6.26: Comparison of performance for WRCS(S) and WRPS(S) according to
CO.
145
Table 6.27: Comparison of performance for WRCS(W) and WRPS(W) according to
CO.
CO
Recognition accuracy (%)
WRCS(W)
WRPS(W)
WRPS(W) – WRCS(W)
12
71.33
83.00
+11.67
16
77.33
84.83
+7.50
20
75.67
90.33
+14.66
24
79.00
91.00
+12.00
100
98
96
94
92
recognition accuracy (%)
90
88
86
84
82
WRCS(W)
80
WRPSW)
78
76
74
72
70
68
66
64
62
60
12
16
20
24
cepstral order
Figure 6.27: Comparison of performance for WRCS(W) and WRPS(W) according to
CO.
146
From Table 6.26, Table 6.27, Figure 6.26 and Figure 6.27, some observations
that could be made:
a) In WRCS(S) and WRCS(W), the order of cepstral order from lowest to
highest accuracy is 12, 20, 16 and 24.
b) In WRPS(S) and WRPS(W), the order of cepstral order from lowest to
highest accuracy is 12, 16, 20 and 24.
c) The highest recognition accuracy for both WRCS and WRPS is cepstral
order of 24.
d) WRPS has the higher accuracy than WRCS for every test sets with both
of the syllable and word classification.
It is shown that WRPS outperform WRCS in every test set according to
cepstral order. We can see that higher cepstral order gives the better performance.
However, higher ceptral order produces longer feature vector. This may cause
longer training time for network. For performance wise, cepstral order of 24 is
significant to be considered as optimal cepstral order (12, 16 and 20).
6.3.5.2 Comparison of Performance for WRCS and WRPS according to HNN
The comparison of performance for WRCS and WRPS according to HNN is
presented in Table 6.28 – 6.29 and analyzed in Figure 6.28 – 6.29. The comparison
of performance is presented according to testing results.
Table 6.28: Comparison of performance for WRCS(S) and WRPS(S) according to
HNN.
HNN
Recognition accuracy (%)
WRCS(S)
WRPS(S)
WRPS(S) – WRCS(S)
¾ GPR
83.50
94.33
+10.83
GPR
84.83
96.67
+11.84
1¼ GPR
87.33
96.00
+8.67
147
100
98
96
94
recognition accuracy (%)
92
90
88
86
WRCS(S)
84
WRPS(S)
82
80
78
76
74
72
70
¾ GPR
GPR
1¼ GPR
hidden node number
Figure 6.28: Comparison of performance for WRCS(S) and WRPS(S) according to
HNN.
Table 6.29: Comparison of performance for WRCS(W) and WRPS(W) according to
HNN.
HNN
Recognition accuracy (%)
WRCS(W)
WRPS(W)
WRPS(W) – WRCS(W)
¾ GPR
77.67
88.83
+11.16
GPR
79.00
91.33
+12.33
1¼ GPR
77.00
91.00
+14.00
148
100
98
96
94
recognition accuracy (%)
92
90
88
86
WRCS(W)
84
WRPS(W)
82
80
78
76
74
72
70
¾ TGP
TGP
1¼ TGP
hidden node number
Figure 6.29: Comparison of performance for WRCS(W) and WRPS(W) according to
HNN.
From Table 6.28, Table 6.29, Figure 6.28 and Figure 6.29, some observations
that could be made:
a) In WRCS(S), the order of hidden node number from lowest to highest
accuracy is ¾ GPR, GPR and 1¼ GPR.
b) In WRCS(W), the order of hidden node number from lowest to highest
accuracy is 1¼ GPR, ¾ GPR and GPR.
c) In WRPS(S) and WRPS(W), the order of hidden node number from
lowest to highest accuracy is ¾ GPR, 1¼ GPR and GPR.
d) WRPS has the higher accuracy than WRCS for every test sets with both
of the syllable and word approaches.
149
It is shown that WRPS outperform WRCS in every test set according to
hidden node number. From WRPS testing results, we can see that hidden node
number according to GPR achieves the better performance. Thus, hidden node
number according to GPR can be considered as optimal hidden node number.
6.3.5.3 Comparison of Performance for WRCS and WRPS according to LR
The comparison of performance for WRCS and WRPS according to LR is
presented in Table 6.30 – 6.31 and analyzed in Figure 6.30 – 6.31 The comparison of
performance is presented according to testing results.
Table 6.30: Comparison of performance for WRCS(S) and WRPS(S) according to
LR.
LR
Recognition accuracy (%)
WRCS(S)
WRPS(S)
WRPS(S) – WRCS(S)
0.1
86.00
96.67
+10.67
0.2
87.33
96.67
+9.34
0.3
80.33
96.67
+16.34
0.4
78.67
95.33
+16.66
150
100
98
96
94
recognition accuracy (%)
92
90
88
86
WRCS(S)
84
WRPS(S)
82
80
78
76
74
72
70
0.1
0.2
0.3
0.4
learning rate
Figure 6.30: Comparison of performance for WRCS(S) and WRPS(S) according to
LR.
Table 6.31: Comparison of performance for WRCS(W) and WRPS(W) according to
LR.
LR
Recognition accuracy (%)
WRCS(W)
WRPS(W)
WRPS(W) – WRCS(W)
0.1
79.83
91.33
+11.50
0.2
79.00
91.33
+12.33
0.3
76.67
88.67
+12.00
0.4
77.33
88.00
+10.67
151
100
98
96
94
recognition accuracy (%)
92
90
88
86
WRCS(W)
84
WRPS(W)
82
80
78
76
74
72
70
0.1
0.2
0.3
0.4
learning rate
Figure 6.31: Comparison of performance for WRCS(W) and WRPS(W) according to
LR.
From Table 6.30 – 6.31 and Figure 6.30 – 6.31, some observations that could
be made:
a) In WRCS(W), the order of learning rate from lowest to highest accuracy
is 0.3, 0.4, 0.2 and 0.1.
b) In WRCS(S), the order of learning rate from lowest to highest accuracy is
0.4, 0.3, 0.1 and 0.2.
c) In WRPS(S) and WRPS(W), the order of learning rate from lowest to
highest accuracy is 0.4, 0.3, 0.2 and 0.1.
d) The accuracies remain consistent from 0.1 – 0.3 for WRPS(S) testing.
e) WRPS has the higher accuracy than WRCS for every test sets.
152
It is shown that WRPS outperform the WRCS in every test set according to
learning rate. MLP can generalize better with smaller learning rate from 0.1 – 0.3 for
WRPS(S). Thus learning rate from 0.1 – 0.3 can be considered as optimal learning
rate.
6.3.5.4 Comparison of Performance of WRPS according to DSOM
The comparison of performance for WRPS according to DSOM is presented
in Table 6.32 and analyzed in Figure 6.32. The comparison of performance is
presented according to testing results.
Table 6.32: Comparison of performance for WRPS according to DSOM.
Recognition accuracy (%)
DSOM
WRPS(S)
WRPS(W)
WRPS(S) – WRCPS(W)
10 x 10
91.33
87.83
+3.50
12 x 12
95.83
91.00
+4.83
15 x 15
96.67
90.33
+6.34
20 x 20
95.33
91.33
+4.00
From Table 6.32 and Figure 6.32, some observations that could be made:
a) In WRPS(S), the order of SOM dimension from lowest to highest
accuracy is 10 x 10, 20 x 20, 12 x 12 and 15 x 15.
b) In WRPS(W), the order of SOM dimension from lowest to highest
accuracy is 10 x 10, 15 x 15, 12 x 12 and 20 x 20.
153
100
recognition accuracy (%)
98
96
94
WRPS(S)
92
WRPS(W)
90
88
86
84
10 x 10
12 x 12
15 x 15
20 x 20
dim ension of SOM
Figure 6.32: Comparison of performance for WRPS according to DSOM
From the testing results, it is shown that dimension with the size of 15 x 15
and 20 x 20 is significant for WRPS(S) and WRPS(W). It is acceptable that
WRPS(W) needs more SOM nodes to promise an optimal feature map in order to
store all of the speech acoustic information.
Besides, WRPS(W) has a limitation that it cannot distinguish the words
which have the similar speech content even though they are in different ordering (e.g.
words /buku/ and /kubu/). For mapping using one matrix to accumulate results, the
temporal information of the input acoustic vector sequence are lost and only the
information about the acoustic content are retained. This may result in confusion
among the words having similar acoustic content but differing in phoneme ordering.
An example is shown in Figure 6.33(a) and (b) for two Malay words: “buku”
and “kubu”. In the figure, ● denotes value 1 and ○ denotes value 0. It can be seen
that two maps are very similar except that the sequence of phonemes are in opposite
directions.
154
/k/
2
3
/u/
1
/b/
Figure 6.33(a): Matrix mapping of “buku” word where the arrows show the direction
of the sequence of phonemes.
/k/
1
2
/u/
3
/b/
Figure 6.33(b): Matrix mapping of “kubu” word where the arrows show the direction
of the sequence of phonemes.
155
6.3.5.5 Comparison of Performance for WRCS and WRPS according to Type of
Classification
For word recognition, there are two types of classification implemented on
both the conventional system and proposed system: syllable and word classification.
Both type of classifications are different in their architecture, which indirectly affect
the performance of both systems in terms of their recognition accuracies. The
comparison of performance for WRCS and WRPS according to the types of
classification is presented in Table 6.33 and analyzed in Figure 6.34. Only the best
results (test set) with highest accuracy are presented for each system.
Table 6.33: Results of testing for WRCS and WRPS according to type of
classification
System
Type of classification with highest accuracy for testing set (%)
Syllable (S)
Word (W)
(S) – (W)
Conventional
87.33
79.83
+7.50
Proposed
96.67
91.33
+5.34
100
98
96
recognition accuracy (%)
94
92
90
88
86
syllable classification
84
word classification
82
80
78
76
74
72
70
conventional system
proposed system
Figure 6.34: Comparison of performance for WRCS and WRPS according to syllable
classification and word classification.
156
6.3.5.6 Summary of Discussion for Word Recognition
For word recognition, the proposed system outperforms the conventional
system in every test sets. We also evaluate both the system with syllable and word
classification. From the testing, our proposed system was proved to be better with an
acceptable result and improvement of recognition accuracy. Besides, the optimal
parameters and the architecture for our proposed system with syllable classification
have been determined from the test results, which are shown in Table 6.34. Syllable
classification outperforms the word classification in both conventional and proposed
system even the recognition process in syllable classification is more complicated
than word classification. Based on the experiments, we found that the recognition
accuracy in word recognition is lower than in digit recognition because of the bigger
number of target word to be recognized.
Table 6.34: The optimal parameters and the architecture for WRPS(S).
Component
Speech
Processing
SOM
MLP
Parameter
Analysis frame length /
Shifting length
Value
240 / 80 (samples)
Cepstral order
24
Learning rate
0.25
Dimension
15 x 15
Input node
225
Hidden node
58
Output node
15
Learning rate
0.1 – 0.3
Momentum
0.8
Max Epoch / Error Function
1000 / 0.01
157
6.4 Summary
From the tests in digit recognition and word recognition, we can see that
conventional system without SOM only achieves the recognition accuracy between
70% – 90% while the proposed system applying SOM and MLP achieves an
acceptable result with recognition accuracy above 90%. In short, the performance of
proposed system with both SOM and MLP is better compared to conventional
system. This is because our proposed system with SOM allows the dimensionality
reduction of feature vectors, which also simplifies the classification task in MLP.
Therefore, network architecture with combining SOM and MLP can be considered as
a new and efficient algorithm in term of improvement of the speech recognition
performance. In terms of type of classification used, syllable classification has the
better performance, having the highest recognition accuracy for both conventional
and proposed system. This is because scope of vocabulary size becomes smaller
when syllable is used as classification unit.
CHAPTER 7
CONCLUSION AND SUGGESTION
7.1
Conclusion
This research has fulfilled its objective of study and has contributed towards
the development of the hybrid model of neural networks for Malay speech
recognition. The proposed model combines two neural networks namely SelfOrganizing Map (SOM) and Multilayer Perceptron (MLP). The evaluation of the
performance of the proposed model was made through its recognition accuracy and
compare with the conventional model shown in previous chapter.
In the study of neural networks and their abilities in speech recognition, we
found that MLP is a powerful pattern recognition technique with its characteristic of
generalization. However, it may not be the best to be used alone as it has its
limitations in pattern recognition. Therefore, we developed a hybrid model by
combining MLP with an unsupervised learning neural network, SOM in order to
obtain optimal performance for speech recognition system.
It is interesting to note that despite the fact that the SOM has been used in the
speech recognition field for more than a decade, there are a few has used it to
produce matrixes, but only to generate sequences of labels (Kohonen, 1992).
Finally, the new approach developed for the neural network’s architecture in Malay
speech recognition proved to be simple and very efficient. It reduced considerably
the amount of calculations needed for finding the correct set of parameters.
159
159
In the experiments, it has been proved that the performance of speech
recognition using the proposed model is better than the performance using
conventional model. It was found that SOM is a better neural model and tool for
dimensionality reduction as well as for speech recognition. Although none of the
approaches proved to be the best approach for practical purposes with the present
extent of development, they were good enough to prove that translating speech from
acoustic feature into binary matrix in a feature space works for dimensionality
reduction which may simply the recognition task. The human speech is an inherently
dynamical process that can be properly described as a binary matrix in a certain
feature space. Even more, the dimensionality reduction scheme proved to reduce the
dimensionality while preserving some of the original topology of the matrix, as
example, it preserved enough information to allow good recognition accuracy.
7.2
Directions for Future Research
Besides improving the FE block and devising a more robust Recognizer, the
scope of the problem should be broadened to larger vocabularies, continuous speech,
and more speakers. From this perspective, the results presented in this thesis are
only preliminary.
The binary matrices generated by SOM might be very noisy, which
complicated the recognition process. This can be a natural consequence of the
speech signal, or an artifact caused by the feature extraction scheme. If the latter
were the case, it would not be surprising since the LPC-derived cepstral coefficients
coder scheme is not very efficient when it comes to represent noise-like sounds as
the consonant ‘S’ (Zbancioc and Costin, 2003). It may be more appropriate to use
feature extractors which do not lose the essential information before the
dimensionality reduction scheme of SOM is used. As an example, the output of a
Fourier or Wavelet transform (Deller et al., 1993; Tilo, 1994), both of which retain
all the information needed to reconstruct the original signal which could be directly
used as input to the SOM.
160
160
With respect to the dimensionality reduction, as the vocabulary size grows,
the feature space of SOM will start to crowd with binary matrix. It is important to
study how this crowding effect affects the recognition accuracy when binary matrices
are used. It has been argued that a hierarchical (Multilayer) approach to knowledge
representation, information extraction and problem solving is the most suitable
strategy in complex settings. Furthermore, the determination of structure has far
reaching consequences: too small a structure implies an inability to generate a
satisfactory representation for the problem while too large a structure may over-learn
the problem thereby resulting in poor generalization performance. Some
enhancements can be made to the standard SOM model by using SAMSOM (Fritzke,
1994; Dittenbach et al., 2000), an overlapped and Multilayer structure and decision
fusion mechanism for SOM.
The SOM learning proposed in this thesis, however, may have poor
quantization performance since the learning may not necessarily lead to a global or
local minimum of the average distortion over whole training set as usually defined in
conventional vector quantization methods, and the learning procedure depends on the
presentation order of input data. The Learning Vector Quantization (LVQ) (Wang
and Peng, 1999) and K-means clustering method, on the other hand, can lead to a
local minimum of average distortion, but the resulting codeword have no structural
properties. Therefore, after training by self-organization, the feature map can be
retuned using LVQ or K-means methods in order to improve the quantization
performance while preserving the self-organization property of the feature map.
Furthermore, the architecture of the MLP can be modified such that instead of
training one MLP for recognizing a long and complex binary matrix, it may be easier
to train a hierarchy of simpler MLPs, each designed to recognize portions of the long
and complex matrix (Jurgen Fritsch, 1996; Siva, 2000).
The major drawback of the MLP is the long training time. The introduction
of larger training set will improve the generalization of the networks, but it also
decreases the convergence rate drastically. As found in the experiments, the MLP
generalized well with a small learning rate. The small learning rate results in the
weight adjustment in a slow progress. Thus, it is impractical to train the networks
161
161
with a large set of training patterns in terms of training time. In order to accelerate
the learning of the network, the adaptive learning rate can be adopted. Its
development is based on the analysis of the convergence of the conventional gradient
descent method for MLP. This efficient way of learning is able to dynamically vary
its learning rate according to the changes of the gradient values (Shin, 1992).
In this research, we chose MLP to combine together with SOM for hybrid
neural network as MLP is a basic feedforward neural network and its structure is
simpler compared to other models. However, the performance of the MLP is very
dependent on the accuracy of the endpoint detection or segmentation of the speech
signals. It is possible to use other ANN architecture, which is less dependent on the
precise endpoint detection in time such as Time-Delay Neural Network (TDNN)
(Waibel et al., 1989; Lang, 1989; Lang and Waibel, 1990; Zhong et al., 1999). The
TDNN introduces delay neurons in the input and the first hidden layer of the
network. Thus, TDNN does not only consider the current speech input features but
also the history of these features. In this way, the TDNN is able to learn the dynamic
properties of the sounds, which is good to deal with the co-articulation problem in
speech recognition.
So far, all the approaches used in the speech recognition require a significant
amount of examples for each class. In a thousands-of-words vocabulary problem this
would require that the user of the system uttered hundreds of thousands of examples
in order to train the system. New approaches must be developed such that the
information acquired by one module can be used to train other modules, as example,
that use previously learned information to deduce the matrices that correspond to
non-uttered words.
162
REFERENCES
Ahad, A., Fayyaz, A. and Mehmood, T. (2002). Speech Recognition Using Multilayer
Perceptron.” IEEE Proceedings of Students Conference (ISCON’02). 1: 103-109.
Ahkuputra, V., Jitapunkul, S., Wutiwiwatchai, C., Maneenoi, E. and Kasuriya, S. (1998).
Comparison of Thai Speech Recognition Systems using Different Techniques.
Proceedings of the IASTED International Conference - Signal and Image
Processing ’98. 517-520
Ahmad, A. M., Eng, G. K., Shaharoun, A. M., Yeek, T. C. and Jarni, M. H. B. (2004). An
Isolated Speech Endpoint Detector Using Multiple Speech Features. In Proc.
IEEE Region 10 Conference (B). 2: 403-406.
Aicha, E.G., Brieuc, C. G. and Fabrice, R. (2004). Self Organizing Map and Symbolic
Data. Journal of Symbolic Data Analysis. 2(1).
Aleksander, I. and Morton, H. (1990). An Introduction to Neural Computing. London:
Chapman and Hall.
Anderson, J. and Rosenfeld, E. (1988). Neurocomputing: Foundations of Research.
Cambridge: MIT Press.
Anderson, S. E. (1999). Speech Recognition Meets Bird Song: A Comparison. of
Statistics-based and Template-based Techniques. Journal of Acoust. Soc. Am.
Aradilla, G., Vepa, J. and Bourlard, H. (2005). Improving Speech Recognition Using a
Data-Driven Approach. Proceedings of Interspeech.
Barto, A. and Anandan, P. (1985). Pattern Recognizing Stochastic Learning Automata.
IEEE Transactions on Systems, Man, and Cybernetics. 15: 360-375.
Berke, L. and Hajela, P. (1991). Application of Neural Nets in Structural Optimisation.
NATO/AGARD Advanced Study Institute. 23(1-2): 731-745.
Bourland, H. and Wellekens, J. (1987). Multilayer Perceptron and Automatic Speech
Recognition. IEEE Neural Networks.
163
Burr, D. (1988). Experiments on Neural Net Recognition of Spoken and Written Text. In
IEEE Trans. on Acoustics, Speech, and Signal Processing. 36: 1162-1168.
Choubassi, M. E., Khoury, H. E., Alagha, C. J., Skaf, J. and Al-Alaoui, M. (2003). Arabic
Speech Recognition Using Recurrent Neural Networks. In IEEE International
Symposium on Signal Processing and Information Technology.
Delashmit, W. H. and Manry, M. T. (2005). Recent Developments in Multilayer
Perceptron Neural Networks. Proceedings of the 7th annual Memphis Area
Engineering and Science Conference (MAESC).
Deller, J., Proakis, J. and Hansen, J. (1993). Discrete-Time Processing of Speech Signals.
McMillan Publishing Co.
Dittenbach, M., Merkl, D. and Rauber, A. (2000). Growing Hierarchical Self-Organizing
Map. IEEE International Joint Conference on Neural Network. 6: 15-19.
Elman, J. and Zipser, D. (1987). Learning the Hidden Structure of Speech. ICS Report
8701, Institute for Cognitive Science, University of California, San Diego, La
Jolla, CA.
Fausett, L. (1994). Fundamentals of Neural Networks. New Jersey: Prentice-Hall, Inc.
Fritzke, B. (1994). Growing Cell Structures - A Self-Organizing Network for
Unsupervised and Supervised Learning. Neural Networks. 7(9): 1441-1460.
Gavat, I., Valsan, Z. and Sabac, B. (1998). Combining Self-Organizing Map and
Multilayer Perceptron in a Neural System for Improved Isolated Word
Recognition. Communication98. 245-255.
Gold, B. (1988). A Neural Network for Isolated Word Recognition. In Proc. IEEE
International Conference on Acoustics, Speech, and Signal Processing.
Grant, P. M. (1991). Speech Recognition Techniques. Electronic and Communication
Engineering Journal. 37-48.
Hansen, L. K. and Salamon, P. (1990). Neural Network Ensembles. IEEE Transactions
on Pattern Analysis and Machine Intelligence. 12: 993-1001.
Ha-Jin, Y. and Yung-Hwan, O. (1996). A Neural Network using Acoustic Sub-word
units for Continuous Speech Recognition. The Fourth International Conference
on Spoken Language Processing, ICSLP96. 506-509.
Haykin, S. (1994). Neural Networks – A Comprehensive Foundation. New York:
Macmillan College Publishing Company, Inc.
164
Haykin, S. (2001). Adaptive Filter Theory (4th Edition). New York: Prentice Hall, Inc.
Hertz, J., Krogh, A. and Palmer, R. (1991). Introduction to the Theory of Neural
Computation. Addison-Wesley.
Hewett, A. J. (1989). Training and Speaker Adaptation in Template-Based Speech
Recognition. Cambridge University: PhD Thesis.
Hochberg, M. M., Cook, G. D., Renals, S. J. and Robinson, A. J.(1994). Connectionist
Model Combination for Large Vocabulary Speech Recognition. IEEE Neural
Networks for Signal Processing IV. 269-278.
Hopfield, J. (1982). Neural Networks and Physical Systems with Emergent Collective
Computational Abilities. Proc. National Academy of Sciences USA, 79: 2554
-2558.
Huang, W. M. and Lippmann, R. (1988). Neural Net and Traditional Classifiers. In
Neural Information Processing Systems. 387-396.
Huang, X. D. (1992). Phoneme Classification using Semicontinuous Hidden Markov
Models. IEEE Trans. on Signal Processing, 40(5).
Itakura, F. (1975). Minimum Prediction Residual Principle Applied to Speech
Recognition.IEEE Trans. on Acoustics, Speech, and Signal Processing, 23(1): 6772.
Jurgen, F. (1996). Modular Neural Networks for Speech Recognition. Interactive
Systems Laboratories. Carnegie Mellon University (USA) and University of
Karlsruhe (German): Diploma Thesis.
Kammerer, B. and Kupper, W. (1990). Experiments for Isolated Word Recognition with
Single and Two-Layer Perceptrons. Neural Networks. 3: 693-706.
Keun-Rong, H. and Wen-Tsuen, C. (1993). A Neural Network Model which Combines
Unsupervised and Supervised Learning. IEEE Transactions on Neural Networks.
4: 357- 360.
Kangas, J., Torkkola, K. and Kokkonen, M. (1992). Using SOMs as Feature Extractors
for Speech Recognition. IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP92).
Keun-Rong, H. and Wen-Tsuen, C.. (1993). A Neural Network Model which
Combines Unsupervised and Supervised Learning. IEEE Trans on Neural
Networks. 4(2).
165
Kohonen, T. (1988a). The Neural Phonetic Typewriter. IEEE Computer. 11-22.
Kohonen, T. (1988b). Self-Organization and Associative Memory. New York: Spring
Verlag.
Kohonen, T., Torkkola, K., Shozakai, M., Kangas, J. and Venta, O. (1988). Phonetic
Typewriter for Finnish and Japanese. IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP88). 607-610.
Kohonen, T. (1992). “The Neural Phonetic Typewriter.” In Artificial Neural Networks,
IEEE Press, Piscataway, NJ. 42-51
Kohonen, T. (1995). Self-Organizing Maps. Springer, Berlin, Heidelberg.
Kohonen, T. (2002). Self-Organizing Neural Networks - Recent Advances and
Applications. Studies in Fuzziness and Soft Computing. 78: 1-12.
Kusumoputro, B. (1999). Development of Self-Organized Network with a Supervised
Training in Artificial Odor discrimination System. In Computational Intelligence
for Modelling, Control & Automation. 57-62.
Lang, K. (1989). A Time-Delay Neural Network Architecture for Speech Recognition.
Carnegie Mellon University: PhD. Thesis.
Lang, K. J. and Waibel, A. H. (1990). A Time-Delay Neural Network Architecture for
Isolated Word Recognition. Neural networks. 3: 23-43.
Lee, K. F. (1988). Large Vocabulary Speaker-Independent Continuous Speech
Recognition. The SPHINX System. Carnegie Mellon University: PhD. Thesis.
Lee, T. and Ching, P. C. (1999). Cantonese Syllable Recognition Using Neural Networks.
IEEE Transactions on Speech and Audio Processing. 7: 466-472.
Lippmann, R. (1989). Review of Neural Networks for Speech Recognition. Neural
Computation. 1(1): 1-38.
Lutfi, A. (1971). Linguistik Deskriptif dan Nahu Bahasa Melayu. Kuala Lumpur: Dewan
Bahasa dan Pustake.
Maniezzo, V. (1994). Genetic Evolution of the Topology and Weight Distribution of
Neural Networks. IEEE Trans. On Neural Networks.5(1): 39-53.
Mashor, M.Y. (1999). Some Properties of RBF Network with Applications to System
Identification. International Journal of Computer and Engineering Management.
7(1): 34-56.
166
Masters, T. (1993). Practical Neural Network Recipes in C++. San Diego: Academic
Press, Inc.
Md, S. H. S., Dzulkifli, M. and Sheikh, H. S. S. (2001). Neural Network SpeakerDependent Isolated Malay Speech Recognition System: Handicrafted vs Genetic
Algorithm. Proceedings of the International Symposium on Signal Processing and
Its Application (ISSPA 2001). 2. 731-734.
Nik, S. K., Farid, M. O., Hashim, H. M. and Abdul, H. M. (1995). Tatabahasa Dewan.
New edition. Kuala Lumpur: Dewan Bahasa dan Pustaka.
Pablo, Z. (1998). Speech Recognition Using Neural Networks. University of Arizona:
Master Thesis.
Pandya, A. S. and Macy (1996). Pattern Recognition with Neural Network in C++. CRC
Press Florida.
Parsons, T. W. (1986). Voice and Speech Processing. New York: McGraw-Hill.
Peeling, S. and Moore, R. (1987). Experiments in Isolated Digit Recognition Using the
Multi-Layer Perceptron. Technical Report 4073, Royal Speech and Radar
Establishment, Malvern, Worcester, Great Britain.
Peeling, S. M. and Moore, R. K. (1988). Isolated Digit Recognition Experiments Using
the Multi-Layer Perceptron. Speech Communication. 7: 403-409.
Peterson, G. E. and Barney, H. L. (1952). Control Methods Used in a Study of the
Vowels. J. Acoust. Soc. Am. 24: 175-184.
Picone, J. (1993). Signal Modeling Techniques in Speech Recognition. IEEE
Proceedings. 81(9): 1215-1247.
Pont, M. J., Keeton, P. I. J. and Palooran, P. (1996). Speech Recognition Using A
Combination of Auditory Models and Conventional Neural Networks. In ABSP1996, 321-324.
Rabiner, L. R. (1976) Digital Processing of Speech Signals. New Jersey: Prentice-Hall,
Englewood Cliffs.
Rabiner, L. R. (1989). A Tutorial on Hidden Markov Models and Selected Applications
in Speech Recognition. Proceedings of the IEEE, 77(2).
Rabiner, L. R. and Sambur, M. R. (1975). An Algorithm for Determining the Endpoints
of Isolated Utterances. The Bell System Technical Journal. 54(2): 297.
167
Rabiner, L. R., Wilpon, J. G. and Soong, F. K. (1989). High Performance Connected
Digit Recognition Using Hidden Markov Models. IEEE Transaction on
Acoustics, Speech and Signal Processings. 1214-1225.
Rabiner, L. R. and Juang, B. H. (1993). Fundamentals of Speech Recognition. Prentice
Hall.
Rosenblatt, F. (1962). Principles of Neurodynamics. New York: Spartan.
Rumelhart, D., Hinton, G. and Williams, G. (1986). Learning Internal Representations
by Error Propagation. Parallel Distributed Processing: Explorations in the
Micostructure of Cognition. M.I.T. Press.
Sakoe, H. and Chiba, S. (1978). Dynamic Programming Algorithm Optimization for
Spoken Word Recognition. IEEE Trans. on Acoustics, Speech, and Signal
Processing, 26(1): 43-49.
Salmela, P., Kuusisto, S. S., Saarinen, J., Laurila, K. and Haavisto, P. (1996). Isolated
Spoken Number Recognition with Hybrid of Self-Organizing Map and Multilayer
Perceptron. Proceedings of the International Conference on Neural Networks
(ICNN’96). 3: 1912-1918.
Savoji (1989). A Robust Algorithm for Accurate Endpointing of Speech Signals. Speech
Communication. 8: 45-60.
Sheikh, O. S. S., Mohammad, N. A. G. and Ibrahim, A. (1989). Kamus Dewan. 3rd ed.
Kuala Lumpur: Dewan Bahasa dan Pustaka.
Shih, C., Kochanski, G. P., Fosler-Lussier, E., Chan, M. and Yuan, J. (2001).
Implications of Prosody Modeling for Prosody Recognition. ISCA Workshop on
Prosody in Speech Recognition and Understanding.
Shin, W., Nobukazu, I., Mototaka, S., Hideo, M. and Yukio, Y. (1992). Method of
Deciding ANNs Parameters for Pattern Recognition. International Joint
Conference on Neural Networks. 4:19-24.
Siva, J. Y. (2000). Recognition of Consonant-Vowel (CV) Utterances Using Modular
Neural Network Models. Indian Institute of Technology, Madras: Master Thesis.
Tabatabai, V., Azimi, B., Zahir, A. S. B. and Lucas, C. (1994). Isolated Word
Recognition Using a Hybrid Neural Network. Proc. of International Conference
Acoustics, Speech and Signal Processing.
168
Tebelskis, J. (1995). Speech Recognition Using Neural Networks. School of Computer
Science, Carnegie Mellon University: PhD. Dissertation.
Tilo Schurer (1994). An Experimental Comparison of Different Feature Extraction and
Classification Methods for Telephone Speech. IEEE Workshop on Interactive
Voice Technology for Telecommunications Applications (IVTTA94). 93-96.
Ting, H. N., Jasmy, Y., Sheikh, H. S. S. and Cheah, E. L. (2001a). Malay Syllable
Recognition Based on Multilayer Perceptron and Dynamic Time Warping.
Proceedings of the 6th International Symposium on Signal Processing and Its
Applications (ISSPA 2001). 2: 743-744.
Ting, H. N., Jasmy, Y. and Sheikh, H. S. S. (2001b). Malay Syllable Recognition Using
Neural Networks. Proceedings of the IEEE Student Conference of Research and
Development (SCOReD 2001). Paper 081.
Ting, H. N., Jasmy Y. and Sheikh, H. S. S. (2001c). Classification of Malay Speech
Sounds Based on Place of Articulation and Voicing Using Neural Networks.
Proceedings of the IEEE Region 10 International Conference on Electrical and
Electronic Technology (TENCON 2001). 1: 170-173.
Villmann, T. (1999). Topology Preservation in Self-Organizing Maps. Kohonen Maps,
Elsevier. 279-292.
Waibel, A., Hanazawa, T., Hinton, G., Shikano, K. and Lang, K. (1989). Phoneme
Recognition Using Time-Delay Neural Networks. IEEE Trans. on Acoustics,
Speech, and Signal Processing. 37(3).
Wang, J. H. and Peng, C. Y. (1999). Competitive Neural Network Scheme for Learning
Vector Quantization. Electronics Letters. 35(9): 725-726.
Watrous, R. (1988). Speech Recognition using Connectionist Networks. University of
Pennsylvania: PhD. Thesis.
Wessel, F.L. and Barnard, E. (1992). Avoiding False Local Minima by Proper
Initialization of Connections. IEEE Trans. Neural Networks. 3: 899-905
Woodland, P., Odell, J., Valtchev, V. and Young, S. (1994). Large Vocabulary
Continuous Speech Recognition using HTK. In Proc. IEEE International
Conference on Acoustics, Speech, and Signal Processing.
169
Yamada, T., Hattori, M., Morisawa, M. and Ito, H. (1999). Sequential Learning for
Associative Memory Using Kohonen Feature Map. International Joint
Conference on Neural Networks (IJCNN’99). 3: 1920-1923.
Yiying Z., Xiaoyan, Z. and Yu, H. (1997). A Robust and Fast Endpoint Detection
Algorithm for Isolated Word Recognition. IEEE International Conference on
Intelligent Processing Systems. 4(3): 1819-1822.
Zbancioc, M. and Costin, M. (2003). Using Neural Networks and LPCC to Improve
Speech Recognition. International Symposium on Signals, Circuits and Systems
(SCS 2003). 2: 445-448.
Zhong, L., Yuanyuan, S. and Runsheng, L. (1999). A Dynamic Neural Network for
Syllable Recognition. International Joint Conference on Neural Networks
(IJCNN’99). 5: 2997-3001.
170
PUBLICATIONS
Eng, G. K., Ahmad, A. M., Shaharoun, A. M., Yeek, T. C. and Jarni, M. H. B. (2004).
An Isolated Speech Endpoint Detector Using Multiple Speech Features. In IEEE
Proceedings of TENCON 2004 IEEE Region 10 Conference. 2: 403 – 406.
Eng, G. K. and Ahmad, A. M. (2004). An 3-Level Endpoint Detection Algorithm for
Isolated Speech using Time and Frequency-based Feature. In Proceedings of
IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASS04).
Eng, G. K. and Ahmad, A. M. (2005). Malay Syllable Speech Recognition using Hybrid
Neural Network. In Proceedings of IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASS05).
Eng, G. K. and Ahmad, A. M. (2005. Malay Speech Recognition using Self-Organizing
Map and Multilayer Perceptron. In Proceedings of Postgraduate Annual Research
Seminar (PARS’05), Universiti Teknologi Malaysia.
171
APPENDIX A
Specification of test on optimal CO for DRCS
CO
Item
12
16
20
Sampling rate (kHz)
16
Analysis Frame Length (ms)
240
Shifting (ms)
80
Total Frame Number
70
Input Node
Hidden Node
24
840, 1120, 1400, 1680
92
105
118
Output Node
10
Learning rate
0.2
Momentum
0.8
Termination EF (rms)
0.01
Termination epoch
1000
130
172
APPENDIX B
Specification of test on optimal HNN for DRCS
HNN
Item
98
130
Cepstral order
24
Total Frame Number
70
Input Node
1680
Output Node
10
Learning rate
0.2
Momentum
0.8
Termination EF (rms)
0.01
Termination epoch
1000
163
173
APPENDIX C
Specification of test on optimal LR for DRCS
LR
Item
0.1
0.2
0.3
Cepstral order
24
Total Frame Number
70
Input Node
1680
Hidden Node
130
Output Node
10
Momentum
0.8
Termination EF (rms)
0.01
Termination epoch
1000
0.4
174
APPENDIX D
Specification of test on optimal CO for DRPS
CO
Item
12
16
20
Sampling rate (kHz)
16
Analysis Frame Length (ms)
240
Shifting (ms)
80
DSOM
12 x 12
Input Node
144
Hidden Node
58
Output Node
10
Learning rate
0.2
Momentum
0.8
Termination EF (rms)
0.01
Termination epoch
1000
24
175
APPENDIX E
Specification of test on optimal DSOM for DRPS
Item
10x10
Cepstral Order
Input Node
DSOM
12x12
15x15
24
100, 144, 225, 400
Hidden Node
32, 38, 48, 63
Output Node
10
Learning rate
0.2
Momentum
0.8
Termination EF (rms)
0.01
Termination epoch
1000
20x20
176
APPENDIX F
Specification of test on optimal HNN for DRPS
HNN
Item
28
Cepstral order
DSOM
38
24
12 x 12
Input Node
144
Output Node
10
Learning rate
0.2
Momentum
0.8
Termination EF (rms)
0.01
Termination epoch
1000
48
177
APPENDIX G
Specification of test on optimal LR for DRPS
LR
Item
0.1
Cepstral order
DSOM
0.2
0.3
24
12 x 12
Input Node
144
Hidden Node
38
Output Node
10
Momentum
0.8
Termination EF (rms)
0.01
Termination epoch
1000
0.4
178
APPENDIX H
Specification of test on optimal CO for WRCS(S)
CO
Item
12
16
20
Sampling rate (kHz)
16
Analysis Frame Length (ms)
240
Shifting (ms)
80
Total Frame Number
50
Input Node
600, 800, 1000, 1200
Hidden Node
95, 110, 122, 134
Output Node
15
Learning rate
0.2
Momentum
0.8
Termination EF (rms)
0.01
Termination epoch
1000
24
179
APPENDIX I
Specification of test on optimal HNN for WRCS(S)
HNN
Item
100
134
Cepstral order
24
Total Frame Number
50
Input Node
1200
Output Node
15
Learning rate
0.2
Momentum
0.8
Termination EF (rms)
0.01
Termination epoch
1000
168
180
APPENDIX J
Specification of test on optimal LR for WRCS(S)
LR
Item
0.1
0.2
0.3
Cepstral order
24
Total Frame Number
50
Input Node
1200
Hidden Node
134
Output Node
15
Momentum
0.8
Termination EF (rms)
0.01
Termination epoch
1000
0.4
181
APPENDIX K
Specification of test on optimal CO for WRCS(W)
CO
Item
12
16
20
Sampling rate (kHz)
16
Analysis Frame Length (ms)
240
Shifting (ms)
80
Total Frame Number
50
Input Node
600, 800, 1000, 1200
Hidden Node
134, 155, 173, 190
Output Node
30
Learning rate
0.2
Momentum
0.8
Termination EF (rms)
0.01
Termination epoch
1000
24
182
APPENDIX L
Specification of test on optimal HNN for WRCS(W)
HNN
Item
142
190
Cepstral order
24
Total Frame Number
50
Input Node
1200
Output Node
30
Learning rate
0.2
Momentum
0.8
Termination EF (rms)
0.01
Termination epoch
1000
238
183
APPENDIX M
Specification of test on optimal LR for WRCS(W)
LR
Item
0.1
0.2
0.3
Cepstral order
24
Total Frame Number
50
Input Node
1200
Hidden Node
190
Output Node
30
Momentum
0.8
Termination EF (rms)
0.01
Termination epoch
1000
0.4
184
APPENDIX N
Specification of test on optimal CO for WRPS(S)
CO
Item
12
16
20
Sampling rate (kHz)
16
Analysis Frame Length (ms)
240
Shifting (ms)
80
DSOM
12 x 12
Input Node
144
Hidden Node
46
Output Node
15
Learning rate
0.2
Momentum
0.8
Termination EF (rms)
0.01
Termination epoch
1000
24
185
APPENDIX O
Specification of test on optimal DSOM for WRPS(S)
Item
10x10
Cepstral Order
Input Node
DSOM
12x12
15x15
24
100, 144, 225, 400
Hidden Node
38, 46, 58, 77
Output Node
15
Learning rate
0.2
Momentum
0.8
Termination EF (rms)
0.01
Termination epoch
1000
20x20
186
APPENDIX P
Specification of test on optimal HNN for WRPS(S)
HNN
Item
44
Cepstral order
DSOM
58
24
15 x 15
Input Node
225
Output Node
15
Learning rate
0.2
Momentum
0.8
Termination EF (rms)
0.01
Termination epoch
1000
72
187
APPENDIX Q
Specification of test on optimal LR for WRPS(S)
LR
Item
0.1
Cepstral order
DSOM
0.2
0.3
24
15 x 15
Input Node
225
Hidden Node
58
Output Node
15
Momentum
0.8
Termination EF (rms)
0.01
Termination epoch
1000
0.4
188
APPENDIX R
Specification of test on optimal CO for WRPS(W)
CO
Item
12
16
20
Sampling rate (kHz)
16
Analysis Frame Length (ms)
240
Shifting (ms)
80
DSOM
12 x 12
Input Node
144
Hidden Node
65
Output Node
30
Learning rate
0.2
Momentum
0.8
Termination EF (rms)
0.01
Termination epoch
1000
24
189
APPENDIX S
Specification of test on optimal DSOM for WRPS(W)
Item
10x10
Cepstral Order
Input Node
DSOM
12x12
15x15
24
100, 144, 225, 400
Hidden Node
54, 68, 82, 110
Output Node
30
Learning rate
0.2
Momentum
0.8
Termination EF (rms)
0.01
Termination epoch
1000
20x20
190
APPENDIX T
Specification of test on optimal HNN for WRPS(W)
HNN
Item
82
Cepstral order
DSOM
110
24
20 x 20
Input Node
400
Output Node
30
Learning rate
0.2
Momentum
0.8
Termination EF (rms)
0.01
Termination epoch
1000
138
191
APPENDIX U
Specification of test on optimal LR for WRPS(W)
LR
Item
0.1
Cepstral order
DSOM
0.2
0.3
24
20 x 20
Input Node
400
Hidden Node
110
Output Node
30
Momentum
0.8
Termination EF (rms)
0.01
Termination epoch
1000
0.4
192
APPENDIX V
Convergences file (dua12.cep) which shows the rms error in each epoch.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
0.251450
0.244031
0.227514
0.185642
0.140053
0.108097
0.085516
0.071240
0.063772
0.059556
0.056840
0.054887
0.053365
0.052110
0.051031
0.050078
0.049214
0.048418
0.047675
0.046975
0.046311
0.045682
0.045087
0.044522
0.043988
0.043482
0.043000
0.042540
0.042099
0.041674
0.041263
0.040864
0.040476
0.040097
0.039726
0.039363
0.039007
0.038656
0.038313
0.037975
0.037643
0.037318
0.036999
0.036687
0.036381
0.036082
0.035789
0.035501
0.035218
0.034939
0.034665
0.034395
0.034127
0.033862
0.033600
0.033340
0.033083
0.032827
0.032575
0.032324
0.032077
0.031833
0.031594
0.031359
0.031128
0.030904
0.030684
0.030469
0.030259
0.030054
0.029853
0.029655
0.029460
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
0.029268
0.029079
0.028891
0.028705
0.028520
0.028337
0.028155
0.027973
0.027792
0.027613
0.027433
0.027255
0.027078
0.026901
0.026725
0.026549
0.026375
0.026202
0.026029
0.025857
0.025685
0.025514
0.025344
0.025173
0.025003
0.024832
0.024662
0.024491
0.024320
0.024148
0.023977
0.023806
0.023636
0.023467
0.023299
0.023134
0.022970
0.022809
0.022651
0.022495
0.022341
0.022189
0.022039
0.021889
0.021741
0.021592
0.021443
0.021293
0.021141
0.020988
0.020833
0.020675
0.020516
0.020355
0.020193
0.020030
0.019868
0.019708
0.019551
0.019398
0.019251
0.019109
0.018975
0.018847
0.018725
0.018610
0.018500
0.018395
0.018294
0.018197
0.018103
0.018012
0.017923
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
0.017836
0.017751
0.017668
0.017587
0.017506
0.017427
0.017350
0.017273
0.017198
0.017123
0.017050
0.016977
0.016905
0.016834
0.016764
0.016695
0.016626
0.016558
0.016491
0.016425
0.016359
0.016283
0.016221
0.016160
0.016100
0.016040
0.015981
0.015922
0.015864
0.015807
0.015749
0.015693
0.015637
0.015581
0.015525
0.015471
0.015416
0.015362
0.015308
0.015255
0.015202
0.015150
0.015098
0.015046
0.014995
0.014944
0.014885
0.014836
0.014788
0.014741
0.014694
0.014648
0.014602
0.014556
0.014510
0.014465
0.014420
0.014375
0.014331
0.014287
0.014243
0.014199
0.014156
0.014113
0.014070
0.014027
0.013985
0.013943
0.013901
0.013860
0.013818
0.013769
0.013729
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
0.013691
0.013652
0.013614
0.013576
0.013538
0.013501
0.013463
0.013426
0.013389
0.013352
0.013315
0.013279
0.013243
0.013206
0.013171
0.013135
0.013099
0.013064
0.013029
0.012994
0.012959
0.012924
0.012890
0.012849
0.012815
0.012783
0.012750
0.012718
0.012687
0.012655
0.012623
0.012592
0.012560
0.012529
0.012498
0.012467
0.012437
0.012406
0.012375
0.012345
0.012315
0.012285
0.012255
0.012225
0.012195
0.012166
0.012136
0.012107
0.012072
0.012043
0.012016
0.011988
0.011961
0.011934
0.011906
0.011880
0.011853
0.011826
0.011799
0.011773
0.011746
0.011720
0.011694
0.011668
0.011642
0.011616
0.011590
0.011564
0.011538
0.011513
0.011487
0.011462
0.011437
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
0.011406
0.011382
0.011358
0.011334
0.011311
0.011287
0.011264
0.011240
0.011217
0.011194
0.011171
0.011148
0.011125
0.011102
0.011080
0.011057
0.011034
0.011012
0.010989
0.010967
0.010945
0.010923
0.010901
0.010879
0.010857
0.010830
0.010808
0.010787
0.010767
0.010746
0.010726
0.010705
0.010685
0.010665
0.010645
0.010624
0.010604
0.010584
0.010564
0.010545
0.010525
0.010505
0.010485
0.010466
0.010446
0.010427
0.010407
0.010388
0.010368
0.010349
0.010325
0.010306
0.010288
0.010270
0.010252
0.010234
0.010216
0.010198
0.010181
0.010163
0.010145
0.010127
0.010110
0.010092
0.010075
0.010057
0.010040
0.010022
0.010005
0.009988
Download