Uploaded by ngyh-wg17

Multi-Lingual Speech Recognition Door Security Project Proposal

Firstly, I would like to express my gratitude to Tunku Abdul Rahman University College for
providing a great opportunity of participating a final year project to develop and learn a project to
enhance my engineering skills in the field that I am interested.
I would also like to thank my supervisor Dr. Hong Kai Sze for providing useful guidance
and advice throughout my final year project. Despite his busy schedule, Dr. Hong would arrange
weekly meeting and gives me guidance when I met issues in the project. His knowledge and
experience in this field and also his words of encouragement and patient are crucial for the
completion of the project.
Finally, I would like to thank my family and friends for providing support and encouragement
during the time while completing this project especially I felt demotivated and discouraged. My
friends also shared useful tips on completing their project in the past years to arrange time wisely
while also taking care of my studies.
Door security system has evolve in the past years, since door plays an important role on
home security since people have to enter the house through the door. Therefore, the paper propose
a door security system using multi-lingual speech recognition techniques for Chinese, Tamil and
English language. In this project, MATLAB application is used for the development platform of
this project. Besides, Mel-Frequency Cepstral Coefficient (MFCC) is used as the feature extraction
technique, Hidden Markov Model (HMM) is used as the training and matching of system while
GOLDWAVE application is used as the recording of audio signal.
Table of Contents
Acknowledgements ................................................................................................
Table of Contents .................................................................................................. iv
List of Figures ......................................................................................................
List of Tables........................................................................................................ vi
Chapter 1 Introduction ...........................................................................................
1.1 Research Background .................................................................................
1.2 Problem Statement ....................................................................................
1.3 Research Objectives....................................................................................
Chapter 2 Literature Review....................................................................................
2.1 Speech Recognition ....................................................................................
2.1.1 Speech recognition by humans and machine ...........................................
2.1.2 A critical review and analysis on techniques of speech recognition ..............
2.2 Security System ........................................................................................
2.2.1 Biometric Voice Recognition in Security System ....................................
2.2.2 Design and Implementation of Voice Recognition System (VRS) Security
System using Biometric Technology .....................................................
2.2.3 IoT System Design using Voice Recognition Technique for Door Access
Control System ...............................................................................
2.2.4 Keypad and Voice Recognition as Implementation of a Two-Level Security
Door Access ....................................................................................
2.2.5 Summary .......................................................................................
2.3 Feature Matching and Feature Extraction.........................................................
2.3.1 Voice Command Recognition System based on MFCC and DTW ...............
2.3.2 Text-Independent Speaker Identification through Feature Fusion and Deep
Neural Network ............................................................................... 11
2.3.3 Speech Recognition and Verification Using MFCC & VQ ........................ 12
2.3.4 LPC and LPCC Method of Feature Extraction in Speech Recognition System 12
2.3.5 Hidden Markov Model (HMM) on Speech Recognition ............................ 14
2.3.6 2.3.6 Combination of VQ and Linde-Buzo-Grey (LBG) Algorithm Speech
Recognition .................................................................................... 14
2.3.7 Performance of Different Classifier for Speech Recognition System ............ 15
2.3.8 Summary on Feature Extraction and Matching ....................................... 17
2.4 Limitations of Speech Recognition ................................................................ 17
2.4.1 Speech Recognition in Noisy Environment............................................. 17
2.4.2 Problems Encountered by using Speech Recognition System...................... 18
2.4.3 Summary in Limitations on Speech Recognition System ........................... 18
Chapter 3 Methodology ......................................................................................... 20
3.1 Flow Chart of Implementation Process ........................................................... 20
3.2 Software used for implementation ................................................................. 21 MATLAB .......................................................................... 21 GoldWave .......................................................................... 21
3.3 Gantt Chart............................................................................................... 21
Chapter 4 Conclusion ............................................................................................ 23
References ........................................................................................................... 24
Appendices .......................................................................................................... 27
List of Figures
Figure 1-1 Stages of speech recognition system (Chavan & Sable 2013)..........................
Figure 2-1 Speech corpora with different vocabulary size ............................................
Figure 2-2 Characteristics of speech corpora .............................................................
Figure 2-3 General architecture of speech recognition system ......................................
Figure 2-4 Speech recognition security system ..........................................................
Figure 2-5 Process of MFCC .................................................................................
Figure 2-6 Accuracy result for 1 individual ..............................................................
Figure 2-7 Accuracy result for 10 individuals ...........................................................
Figure 2-8 Results of voice recognition security system ...............................................
Figure 2-9 Results of voice recognition security system ...............................................
Figure 2-10 Initial (up) and final (down) speech signal after MFCC ................................. 10
Figure 2-11 Comparison of same speech signal ........................................................... 10
Figure 2-12 Comparison of different speech signal....................................................... 10
Figure 2-13 Proposed methodology for speaker identification system ............................... 11
Figure 2-14 Efficiencies for identify genders for different machine learning algorithm ......... 11
Figure 2-15 Visual image of VQ for 2 speakers inside a codebook ................................... 12
Figure 2-16 Block diagram for LPC method ............................................................... 13
Figure 2-17 Block diagram for LPCC method ............................................................. 13
Figure 2-18 5 state Markov chain ............................................................................. 14
Figure 2-19 Flowchart of LBG algorithm ................................................................... 15
Figure 2-20 Experimental results for WER and accuracy ............................................... 15
Figure 2-21 Accuracy of three classifiers ................................................................... 17
Figure 2-22 Block diagram of CNN(Passricha & Aggarwal 2018) ................................... 19
Figure 2-23 Simulation results on diffrent NN-HMM models(Santos et al. 2015)................ 19
Figure 3-1 Flow Chart of Implementation Process of Speech Recognition Security System . 20
List of Tables
Table 2-1
Summary of Feature Extraction and Matching ........................................... 17
Table 3-1
Gantt Chart for Semester 1 and 2 ............................................................. 22
Chapter 1 Introduction
Research Background
This research studies the need of implementing a speech recognition security system
that supports multi-language commonly used in Malaysia (English, Chinese, Tamil and Malay).
Housing area would normally be equipped with a security system, no matter if it’s the traditional
key and lock way or a more advanced method of using keypads and thumbprints. Security system
brings the purpose of denying unauthorized entry to protect personal property or contents from
damaging or missing (Biswas 2nd & Mynuddin 3rd n.d.). Home security has been improving year
by year by the recent implementation of digital doors that use passwords or thumbprints. However,
the same problem emerges as the digital door, which is security that relies on physical touch,
words or number to unlock the door is easily hackable and guessable, whereas the key and lock
door are easy to break down by brute force or duplicating keys or identities cards, there is also a
possibility for the individual to not lock the door properly due to insufficient time to double check
the locks before leaving. Even equipped with human security guards would not be safe enough
due to the fact that humans could not execute their task efficiently all the time and were bound
to make mistakes that would let intruders in. Therefore, this project focuses on implementing a
biometric security system that relies on the individual’s “body” as an entry ticket, as biometric
characteristics such as voice and eye scan are unique and could be used for identifying specific
Speech is considered one of the biometric characteristics that are easily detectable by
computers as the human voice contains a lot of information that could be extracted. When people
speak, a certain pitch, frequency, tone and rhythm is made and is unique to one another. Different
genders will also produce a different range of pitch as the average male will have lower voices
compared to females. Differences in where they live will also produce different accents and
different ways to pronounce a certain word. This research would like to overcome these issues and
would correctly recognize what the user was trying to say to identify whether a certain key phrase
or word has been spoken by the user.
For speech recognition system, four stages are responsible for processing a voice signal,
namely feature extraction, feature training and matching and finally decision logic. By doing
so, MATLAB software is used for the coding of this system, feature extraction of the speech of
the user is done by applying Mel-Frequency Cepstral Coefficients (MPCC) which converts the
Figure 1-1 Stages of speech recognition system (Chavan & Sable 2013)
acoustic signal into a series of acoustic feature vector. Besides, Convolutional neural networks
(CNN) is used to train the database of the system to recognize words to compare the speech signal
into identifying the correct user. (Najkar et al. 2010).
Problem Statement
The recognition of voice is inconsistent due to different accents, way of pronunciations and
speed. Besides that, the support of multi-language increases the difficulty of extraction of voice.
Noise could also occur while the system is detecting voice and extraction occurred which would
affect the final results.
This project focus on door security system based on recognizing the speech spoken to the
speech recognition security system. The purpose of this project is to improve home security by
letting the house owner or his/her family open the door of the house using multi-lingual speech
recognition system instead of using keys or padlock.
This project aims to design and evaluate speech recognition system in multi-language in
order to be used by all Malaysian. This project uses algorithms such as Deep Learning and
Linear Predictive Coding to ensure that the system has better reliability by constant updating the
vocabulary database and effective feature extraction for more effective approach while the voice
is detecting for the security system.
Research Objectives
This project focuses on securing the doors of the room/house of the house owner by
saying the word/phrase to the control system implemented using any language they preferred
(English/Chinese/Tamil/Malay/etc). The purpose of this project is to enhance the security of the
house using speech recognition instead of using the traditional way to open the house through key
and padlock. Keys and padlock can now be easily replicated and hacked down by people, hence
the home theft rate in Malaysia are spiking year by year. This speech recognition control systems
fixes that as it only recognizes the house owner and his/her families’ voice by extracting their voice
and make it recognizable by the control system. The development software used for this project
is MATLAB while the Mel-Frequency Cepstral Coefficient (MFCC) is used to extract the speech
spoken by the house owner and his/her family and also Hidden Markov Model is used for pattern
By implementing this project, objectives below are hoped to be achieved:
1. To research and develop speech recognition door security system that can recognize
English, Chinese, Tamil and Malay Speech.
2. To record speech from various speakers to be used as the training template.
3. To test the accuracy of the speech recognition door security system using testing template.
4. To evaluate the new system and propose new idea to improve the system.
Chapter 2 Literature Review
In this chapter, several research papers are read and analyzed in order to discuss their work on
speech recognition security systems that support multi-language. Research papers act as reference
for successful project implementation.
Speech Recognition
Speech recognition by humans and machine
Figure 2-1 Speech corpora with different vocabulary size
Figure 2-2 Characteristics of speech corpora
In this paper, speech corpora with different vocabulary size are used to test the ability of
speech recognition system shown in Figure 2-1. All corpora aims to recognize all words that is
prompted by user to read by sentence. Different speech corpora has different range of vocabulary
based on the number of talkers they read to, total duration of the reading which has gone through,
and also what does the talker reads to the corpora, detailed characteristics of each 6 speech corpora
are shown in Figure 2-2. Error rates of the speech corpora and humans are also compared and found
out that it is higher than humans in each circumstances such as the appearance of spontaneous
speech and working under a noisy environment (Lippmann 1997).
A critical review and analysis on techniques of speech recognition
Figure 2-3 General architecture of speech recognition system
This paper discusses the speech recognition techniques used from the year 2000 to 2015 from
50 articles. The feature extraction and classification used for 50 articles are listed in Figure 2-3.
There are also some common challenges faced while building speech recognition system, which
includes the consideration on emotion in the speaker which affects the audio signal when the
emotion is different, the different combination also effects the accuracy of speech recognition.
For instance, MFCC algorithm for feature extraction face issues in recognition speech signals in
noisy environment, therefore a noise-adaptive classifier must be combine with MFCC algorithm
to increase the accuracy of system (Haridas et al. 2018).
Security System
Biometric Voice Recognition in Security System
Figure 2-4 Speech recognition security system
Figure 2-5 Process of MFCC
In this paper, the speech recognition security system shown in Figure 2-4 is built using
MATLAB (SIMULINK) application. The system will convert the input speech to energy feature
and then saved as the reference template. This process is done by applying MFCC algorithm as
shown in Figure 2-5. The input speech feature is then compared to the model to produce logic
‘1’ or ‘0’ based on the match of the speech. Another method is also discussed, namely Vector
Quantization (VQ) and Gaussian Mixture Model (GMM). By referring to the result for the paper,
the specific user’s voice is successfully recognized and other users’ voice is rejected. Figure 2-6
and 2-7 shows the results obtained by testing the system’s accuracy with 1 individual and by testing
the system’s accuracy with 10 different people, with the red highlighted results show the voice
recognition system is unable to provide the correct output. The reason that incorrect output has
occurred is the inconsistent energy output by the speaker caused by how soft or loud the speaker
speaks. From Figure 2-7, we also know that the gender and the difference in age may affect the
accuracy of system. Despite that, the accuracy of this system is 75% (Shah et al. 2014).
Figure 2-6 Accuracy result for 1 individual
Figure 2-7 Accuracy result for 10 individuals
Design and Implementation of Voice Recognition System (VRS) Security
System using Biometric Technology
This paper discusses the spike in popularity in biometric technology in recent years, from
fingerprints and handwriting to more recent ones which includes face scan, iris/eye scan, hand print
and voice print. It also various application for biometric voice, which include to eliminate cell
phone fraud for cell phone security access control, to eliminate pin # fraud for ATM manufacturers
and to reduce theft and carjacking for automobile manufacturer. Voice controlled security system
is implemented for this research paper by using MATLAB function blocks from SIMULINK for
the development of verification algorithms which could authenticate a person’s identity by their
voice pattern. The security system will produce logic ‘1’ if the voice is match, while a mismatch
will produce logic ‘0’. Besides that, the door is controlled by a microcontroller circuit to test
the reliability of the voice controlled security system. Results from Figure 2-8 shows 4 different
targets which include different genders to speak a simple word for the system to extract his or
her voice pattern, and by comparing the reference user and the speaker’s nonparametric estimates
power spectrum while calculating the standard deviation of difference between user and speaker
to the user and speaker respectively. The logic value could be generated if both of the generated
standard deviation are below 15% (Rashid et al. 2008).
Figure 2-8 Results of voice recognition security system
IoT System Design using Voice Recognition Technique for Door Access Control
This study introduces an Internet of Things (IoT) system design for a door access control
system using voice recognition technique. The system only allows authorized users by identifying
the correct random words given and an alert will be sent through Telegram if someone successfully
entered. As a result, this system is suitable for institutions or organizations to improve the security
of critical rooms or the buildings for prohibited users. The design process is guided using a waterfall
model methodology. The system consist of hardware and software components such as Arduino
IDE to implement coding for Arduino UNO, Python IDE as programming software, Telegram
Flutter as messaging application, solenoid door lock Wifi module as electrical-mechanical locking
mechanism and jumper wires as wiring for pins. Five processes are also used to train the module
for better authorization on user’s voice which are randomly generating words, fuzzy matching,
voice matching detection and longtime spectral deviation, MFCC and Gaussian Mixture Model
(GMM) (Zaini et al. 2021).
Keypad and Voice Recognition as Implementation of a Two-Level Security
Door Access
A two-level security system which uses matrix keypad interface with a microcontroller as the
first level security validation to monitor and control the execution of desired tasks within the keypad
and voice recognition (KVR) system, while the second level security uses a voice recognition
integrated circuit in this paper. For this system, a tristate buffer is also employed to logically
isolate the buses of the digital signal processing (DSP) chips and those of the microcontroller.
Environments with noise and without noise were conducted and the recommended distance between
the user and the microphone were identified. The results show that while using electret type
condenser, the required distance is 1.0 cm to 16 cm while under noisy conditions it requires 1.0
cm to 6.0 cm. The tested results vary depending on the sensitivity of the microphone and the
environmental conditions at the time. In conclusion, the paper shows that integrating the keypad
and voice recognition design can help to optimize the security level and help control unwanted
intrusion into buildings. Figure 2-9 shows the results of 3 different users with different genders
to speak out the word ‘open’ and ‘close’ for the system to execute the following command, which
shown that user 1 (male) has the highest accuracy of 100% of identifying the correct voice pattern
to execute the correct command, followed by user 3 with an accuracy of 75% and lastly user 2
(female) with an accuracy of 62.5%. The reason that user 1 has the highest accuracy is that user 1
participate on training the system on 4 trial run and the results show that the gender of user which
conducts training of the system will improve the gender’s accuracy of identifying its voice pattern
(Yilwatda et al. 2017).
Figure 2-9 Results of voice recognition security system
In conclusion, MATLAB application is popular to be implemented for voice controlled
security system due to the availability of functions block for better simulation of system. From the
numerous research paper, it also known that the gender plays a significant role in voice recognition
due to different voice pattern in different gender, while noisy environment and different energy
output among speaker will also affect the outcome of voice recognition.
Feature Matching and Feature Extraction
Voice Command Recognition System based on MFCC and DTW
VIn this paper, by using MATLAB as development of the voice recognition system and
also using MFCC as feature extraction and Dynamic Time Warping (DTW) as feature matching.
Voice recognition system are separated into two modules to shown the output signal after feature
extraction and matching. Figure 2-10 above shows the signal audio obtained through speaker,
noise is then removed by removing any signal less than the minimum and maximum threshold,
the produced signal is called utterance and is then divided into frames, pass into a discrete filter
and hammering window. After that, the frequency domain is then passed through a mel filter
bank and finally converted into time domain by performing Discrete Cosine Transform shown in
Figure 2-10 below. Figure 2-11 and shows the two results when two speech signals compared are
the same and different respectively. The comparing process is based on the DTW algorithm which
measures the similarity between two varying time or speed time series. The results show that the
cost value is 0 for same speech signal and 107.8 for the different speech signal (Bala et al. 2010).
Figure 2-10 Initial (up) and final (down) speech signal after MFCC
Figure 2-11 Comparison of same speech signal
Figure 2-12 Comparison of different speech signal
Text-Independent Speaker Identification through Feature Fusion and Deep
Neural Network
This paper proposes a speech recognition system that uses the combination of MFCC and
time-based features, known as MFCCT, while the system is shown in Figure 2-13. This algorithm
fixes the problem faced by short-time features such as MFCC that does not work well under
complex speech datasets. The feature extracted by MFCCT is then fed into a deep neural network
(DNN) as classifier to identify the SID and gender of user, which is proven to be suitable to use with
MFCCT by feeding MFCCT extracted features into other different machine learning algorithm,
namely Naïve Bayes, random forest, and k-nearest neighbor. Figure 2-14 shows the efficiencies
of each machine learning algorithm obtained, with DNN obtaining the highest efficiency ranging
from 83.5-92.9%. The models of speech signals are also obtained from the LibriSpeech database,
which contains speech signals majority from the United States. It contains many sub corpus that
is trained from a long time and many US males and females’ speech signal are included (Jahangir
et al. 2020).
Figure 2-13 Proposed methodology for speaker identification system
Figure 2-14 Efficiencies for identify genders for different machine learning algorithm
Speech Recognition and Verification Using MFCC & VQ
In this paper, speech recognition system using MFCC as feature extraction while VQ as
feature modeling. Since MFCC is discussed in previous paper, VQ will be emphasized. In order for
the system to estimate probability distributions of the computed feature vectors, VQ algorithm will
perform quantization on extracted features into smaller number of template vectors, since small
vectors could represent the characteristics of the whole features. The vectors are then mapped
inside a finite space which is named as cluster. Codebook is generated once the trained sample on
that specific speaker by clustering the vectors which is centered by code word which is also known
as centroid. VQ distortion is known as the distance determined from the centroid to the closest
vector sample. The input speech signal is then compared by determining its VQ distortion also
known as Euclidean Distance and the ones that has the smallest value are considered to be match
with the speaker. Figure 2-15 shows the visual image of VQ for 2 speakers inside a codebook.
This paper also uses the k-mean algorithm in order to cluster the training vector into feature vector
shown in Figure 2-10. The working of the algorithm is that the algorithm will produce k sets
of cluster by initializing centroids then randomize k amount of vectors around the centroids, the
process is then repeated until there was no change of centroids. The purpose of this process is to
minimizing intra-cluster variance, V which ensures similar training vectors are inside the cluster
while each cluster are stayed away as far as possible (Patel & Prasad 2013).
Figure 2-15 Visual image of VQ for 2 speakers inside a codebook
LPC and LPCC Method of Feature Extraction in Speech Recognition System
For this research paper, selection of feature extraction which includes Linear Predictive
Cepstral Coefficient (LPCC) and Linear Predictive Coding (LPC) is examined and studied.
Methodology of each method is reviewed and their merits and demerits are discussed thoroughly.
Figure 2-16 and Figure 2-17 shows block diagrams for LPC and LPCC method respectively to
showcase how does the feature extraction works which display speech signals through finite number
of measures of signals.
Figure 2-16 Block diagram for LPC method
Figure 2-17 Block diagram for LPCC method
In short, the main process of both LPC and LPCC contains four steps, which mainly are
the pre-emphasis which filters the signal normally with coefficient between 0.9 and 1 with the
purpose of flattening the signal, followed by framing which divides the speech signal into frames
that contains overlap of 10ms between two adjacent frames to ensure stationary between frames,
windowing also occur when the hammering windows are then multiply with frames in order to
minimize edge effect and lastly, the LPC is computed by applying auto-correlation on the previous
frames that is windowed which also obtains LPCC. For LPCC method, the few steps are followed
exactly as LPC method, the only difference is that cepstral coefficient is added inside the LPC
parameters to calculate the LPCC features. The predictor coefficient vector must be found in
order to calculate cepstral coefficient. In short, feature extractions for both method are the same,
the small difference is LPCC requires conversion of LPC to obtain LPCC. For the merits and
demerits for both method, LPC estimates speech parameters precisely, characterize speech traits
well and more effective computation, while LPCC provides better reliability and robustness. For
the demerit, LPC method is incapable to capture the unvoiced sound such as the “th” and the
nasalized sound such as “m” “n” accurately. While for the LPC method, the performance is
degraded if insufficient order is used. For both cases, the performance is effected greatly when the
environment contains noise. Gupta & Gupta (2016)
Hidden Markov Model (HMM) on Speech Recognition
HMM is an algorithm commonly used in large vocabulary continuous speech recognition
system (LVCSR). While recognition the speech with respect to the time when state changes, the
change of state occur with the possibility to return back to the same state. While time varies from
t=1, 2, 3..., while the actual state on the time t is denoted as q. The state transition probability
which is based on the current and predecessor state is shown as the equation below.
Figure 2-18 shows the 5 state Markov chain which shows the probability shown on how
does the state change from one another or remain at the same state (Rabiner 1989).
Figure 2-18 5 state Markov chain
2.3.6 Combination of VQ and Linde-Buzo-Grey (LBG) Algorithm Speech
This paper proposes a speech recognition system using combination of VQ and Linde-
Buzo-Grey (LBG) algorithm. The VQ works the same as discussed last paper, therefore LBG
algorithm is discussed for this paper. Figure 2-19 shows the flowchart of LBG algorithm, this
process clusters sets of training vector comes with the symbol L into sets of codebook vector,
comes with the symbol M. In order to kick start this algorithm, a one vector codebook is designed.
It is then spilt into 2 times the vector codebook. Cluster vector is then done by executing the
nearest-neighbor search for code word with each training vector in order to find the nearest code
word, and then the centroid is updated. Finally the distortion is computed by summing up all
distance of code word searched by the nearest-neighbor search and comparison of the average
distance falls below the threshold value and while the M vector is obtained, the process stops thus
an M vector codebook is produced. The experimental simulation on MATLAB is also produced
and shown in Figure 2-20. The low accuracy for some of the testing is due to research creating
hostile environment for the system to not extract the signal accuractely (Dua & Kamra 2019).
Figure 2-19 Flowchart of LBG algorithm
Figure 2-20 Experimental results for WER and accuracy
Performance of Different Classifier for Speech Recognition System
In this paper, three different classifier is used to perform feature matching for Malayam digits,
which include Artificial Neural Network (ANN),Naïve Bayes (NB) algorithm and Support Vector
Machine (SVM). The feature of Malayam digits is then extracted by Discrete Wavelet Transform
(DWT) algorithm which the signal passes through high pass filter to obtain the approximation
coefficient and low pass filter which produces the detail coefficients. Characteristics on the detail
coefficients are more useful than approximation coefficients for speech signals. The signals are
sub sampled by 2 until the desired signals decided by MATLAB algorithm is then obtained. The
extracted feature is then execute speech recognition by undergoing classifiers that creates a training
model from the datasets to predicts the class of each test sets from the datasets.
The ANN classifier is a well-known data processing model that includes a number of basic
processing units or networks known as neurons. The word neurons came from human brain neuron
cell which functions to learn, adapt, and recognize faults and many more. Many ANN method are
available for classifier, this paper uses multilayer perceptron (MLP) which has n input of layers, one
or more hidden layers, and an output layer. The input layers received extracted feature while the
output layer is where the MLP performs prediction of classification. Hidden layers which follow
the back propagation learning algorithm as the feature distributed by the input layer is passed on
to the first hidden layer as output while the next hidden will receive the output of previous layer
as input. Error back propagation correction algorithm is used to correct the errors in a backward
direction.The network eventually establishes the input-output relationships through the adjusted
weights on the network.
SVM classifies speech by constructing hyper planes in a multidimensional space that
separates different class labels based on statistical learning theory. Two common SVM strategy
on how to classify data, namely One-against-One and One-against-All. One-against-one is chosen
for this experiment while one binary SVM is used to classify each type of class by grouping them
together by class. Finally, NB classifier is also used for classification. It works under the Naïve
Bayesian algorithm which calculates the probability of the model with a small given datasets by
applying the formula below.
𝑃 ( 𝐴|𝐵) = 𝑃(𝐵| 𝐴) ∗ 𝑃( 𝐴)/𝑃(𝐵)
The values in the equations are the Posterior probability defined as the probability of
hypothesis A on the observed event B, comes with the symbol P(A|B), P(B|A) which is the
likelihood of hypothesis happens to be true, P(A) and P(B) which are the prior probability and
marginal probability respectively. This method of classifier requires small amount of sets of
training to be effective. Figure 2-21 shows the results on three classifier, ANN is chosen since it
obtains the highest accuracy of 89% among three methods (Suuny et al. 2013).
Figure 2-21 Accuracy of three classifiers
Summary on Feature Extraction and Matching
Table 2-1 Summary of Feature Extraction and Matching
Matching &
Testing Method
83.5% - 92.9%
(Vyas 2013)
70% - 85%
(Matsui &
DIM Method
84.8% - 89.1%
Furui 1994)
(Maseri &
(Bala et al.
et al. 2020)
(Patel &
(Dua &
Table 2-1 shows the literature review on speech recognition security system by previous
authors to understand and review the previous research and work by researchers. To date, different
feature extraction, feature matching and testing method is used by researchers. In summary, MFCC
as feature extraction and HMM as feature matching and testing is used more than other methods.
Limitations of Speech Recognition
Speech Recognition in Noisy Environment
This paper conducts experiments on human speech recognition in noisy environments. The
effects of human language modeling are largely mitigated while the experiment is conducted.
MFCC algorithm will conduct phase elimination at the process. When the speech signal with
eliminated phase is heard by humans, it shows that the recognition error rate has increase slightly.
Results obtained from experiment showed that the humans are not sensitive to phase, as the
recognition rate is reduced from 100% to 91.5%, which is caused by the phase elimination
characteristics by MFCC techniques. The results also showed resolution reduction due to the
subsequent cepstral filtering. As shown in the decision tree, noisy environment affects the
recognition rate of speech recognition system the most, followed by difficult speakers combined
with native listener. The results also shown that as large as four time increase in digit error
rate when difficult speakers are encountered, which also result in reduction in spectral resolution.
As shown in the paper, redundancy exists inside speech recognition systems. Moreover, if the
speech recognition system is capable enough with little notice interacting, it is possible that the
traditional cepstral filtered features which removes unnecessary information is sufficient to carry
useful information for accurate speech recognition (Peters et al. 1999).
Problems Encountered by using Speech Recognition System
In this paper, it discusses the fact that speech recognition system will cause user to have
issues regarding their voice. The symptoms that users might face includes sore throat, difficulty
to make noise and the worst case even the loss of voice. Besides, the requirement of the user to
produce voice that have constant pitch, inflection and volume that might cause muscle fatigue at
throat as the vocal trait is kept in a fixed position while speaking that also might eventually lead to
laryngeal musculature. Despite the fact that this paper emphasize that more studies are to make
to decide whether speech recognition system will cause issues to user’s voice and throat. User
are still advised to perform warm-up to voice and also cool-down to the voice (Kambeyanda et al.
Summary in Limitations on Speech Recognition System
Due to the advance of technologies, the problems encountered by the speech recognition
system could be fixed by using deep learning into the system. Instead of using HMM as feature
matching and training, A hybrid of CNN-HMM which uses Convolutional Neural Networks (CNN)
as a pattern recognizer by passing through 3 layers of convolutional layers, namely convolution,
pooling and non-linearity which is shown in Figure 2-22 to learn and extract the useful infromation
from the speech features. Lastly, the CNN processes each input speech utterance by generating
all HMM state probabilities for each frame. Then a Viterbi decoder is used to get the sequence of
labels corresponding to the input utterance (Abdel-Hamid et al. 2012).
The reason CNN is used as feature matching is due to the better performance in recognizing
the word in noisy environment, which is due to the properties that CNN brought unique to other
neural networks, namely locality, weight sharing and pooling. Locality in the convulutional
layer allows to receive features representing a limited bandwidth of the whole speech spectrum.
Therefore, MFCC is modified as it does not pass through DCT-based decor-relation transform
since the speech inputs must be represented in a frequency scale that can be divided into a number
of local bands. It also increases the robustness against non-white noise as noise is less affected
while the useful features are computed locally by the local filters in relatively cleaner parts of
the spectrum (Abdel-Hamid et al. 2014), whereas the pooling and weight sharing is to increase
the performance of speech recognition. The article written by Santos et al. with the title of
“Speech Recognition in Noisy Environments with Convolutional Neural Networks” suggest that
by applying CNN-HMM model into the speech recognition system, it has the lowest equal error
rate while exposed to different kinds of noise among other models, which is shown in Figure 2-23.
Figure 2-22 Block diagram of CNN(Passricha & Aggarwal 2018)
Figure 2-23 Simulation results on diffrent NN-HMM models(Santos et al. 2015)
Chapter 3 Methodology
Flow Chart of Implementation Process
Figure 3-1 Flow Chart of Implementation Process of Speech Recognition Security System
Figure 3-1 displays the flow chart of implementation process of the speech recognition
security system. Studies on research paper on topics related to the system is reviewed such as
feature extraction, feature matching. After that, recording of speech signals for testing datasets are
recorded using GoldWave application. Coding and implementation of Graphical User Interphase
(GUI) on MATLAB application is then done for testing on the system. Troubleshoot is made if
coding are found at fault.
Software used for implementation
The speech recognition system is simulated using the MATLAB software. MATLAB
is a numerical computing environment developed by MathWorks that uses fourth-generation
programming language to compute technical algorithm. The software programs, computes and
visualize the algorithm. Common functions of MATLAB also includes matrix manipulations,
plotting of function of data, implementation of more sophisticated functions such as fast Fourier
transform, Bessel functions, matrix inverse and matrix eigenvalues, and also equipped with
Graphical User Interphase (GUI) to showcase the algorithm. Another reason why MATLAB
is used instead of programming software such as C++ as it consists of toolbox to help in simulate
the speech recognition system which is known as VOICEBOX written by Mike Brooks. This
speech processing toolbox contains of many speech process function useful for simulation of
speech recognition system (Pan 2014).
In order to perform testing and training for the speech recognition system, speech signal are
required to be recorded by different users saying the same word or phrase to ensure high recognition
rate for this speech security system. The speech signal is recorded using GoldWave software. It
provides various functions, which includes record and edit of the speech signal, analyze audios
from real time visuals, apply audio effects on speech signals and also perform filters and noise
reduction to the speech signal, save in various format such as .wave files, .snd files and many more.
Gantt Chart
Table 3-1 below shows the process of implementing software, proposal writing and thesis
writing for 2 semesters.
Table 3-1 Gantt Chart for Semester 1 and 2
Confirmation on supervisor
Semester 1
Choose FYP Title
Submission of Proposal
Research on Project
Software Implementation
Proposal Submission
Logbook and Risk Assessment
Proposal Writing
Metting with Supervisor
Proposal Defend
Semester 2
Troubleshoot on Software
Result Analysing and Improve-
Software Implementation
Thesis Writing
Thesis Submission
Logbook Submission
Thesis Defend
Chapter 4 Conclusion
In conclusion, a speech recognition door security system is developed by applying feature
extraction (MFCC algorithm) is used which acts as conversion of speech signal into a sequence
of acoustic speech vectors, feature matching and training (CNN) is also used to train the database
of the system in order to raise the accuracy of detecting the correct user. While the whole speech
recognition process is simulated using MATLAB application, while the recording of speech signals
are done by GoldWave application.
Abdel-Hamid, O., Mohamed, A.-r., Jiang, H., Deng, L., Penn, G. & Yu, D. (2014), ‘Convolutional
neural networks for speech recognition’, IEEE/ACM Transactions on audio, speech, and
language processing 22(10), 1533–1545.
Abdel-Hamid, O., Mohamed, A.-r., Jiang, H. & Penn, G. (2012), Applying convolutional neural
networks concepts to hybrid nn-hmm model for speech recognition, in ‘2012 IEEE international
conference on Acoustics, speech and signal processing (ICASSP)’, IEEE, pp. 4277–4280.
Bala, A., Kumar, A. & Birla, N. (2010), ‘Voice command recognition system based on mfcc and
dtw’, International Journal of Engineering Science and Technology 2(12), 7335–7342.
Biswas 2nd, P. & Mynuddin 3rd, M. (n.d.), ‘Design and implementation of smart home security
Chavan, R. S. & Sable, G. S. (2013), ‘An overview of speech recognition using hmm’, International
Journal of Computer Science and Mobile Computing 2(6), 233–238.
Dua, S. & Kamra, A. (2019), Development of speech recognition system: Using combination of
vector quantization and linde-buzo-gray algorithm.
Gupta, H. & Gupta, D. (2016), Lpc and lpcc method of feature extraction in speech recognition
system, in ‘2016 6th International Conference-Cloud System and Big Data Engineering
(Confluence)’, IEEE, pp. 498–502.
Haridas, A. V., Marimuthu, R. & Sivakumar, V. G. (2018), ‘A critical review and analysis on
techniques of speech recognition: The road ahead’, International Journal of Knowledge-Based
and Intelligent Engineering Systems 22(1), 39–57.
Jahangir, R., Teh, Y. W., Memon, N. A., Mujtaba, G., Zareei, M., Ishtiaq, U., Akhtar, M. Z. &
Ali, I. (2020), ‘Text-independent speaker identification through feature fusion and deep neural
network’, IEEE Access 8, 32187–32202.
Kambeyanda, D., Singer, L. & Cronk, S. (1997), ‘Potential problems associated with use of speech
recognition products’, Assistive Technology 9(2), 95–101.
Lippmann, R. P. (1997), ‘Speech recognition by machines and humans’, Speech communication
22(1), 1–15.
Maseri, M. & Mamat, M. (2019), Malay language speech recognition for preschool children using
hidden markov model (hmm) system training, in ‘Computational Science and Technology’,
Springer, pp. 205–214.
Matsui, T. & Furui, S. (1994), ‘Comparison of text-independent speaker recognition methods
using vq-distortion and discrete/continuous hmm’s’, IEEE Transactions on speech and audio
processing 2(3), 456–459.
Najkar, N., Razzazi, F. & Sameti, H. (2010), ‘A novel approach to hmm-based speech recognition
systems using particle swarm optimization’, Mathematical and Computer Modelling 52(1112), 1910–1920.
Pan, L. (2014), ‘Research and simulation on speech recognition by matlab’.
Passricha, V. & Aggarwal, R. K. (2018), Convolutional neural networks for raw speech recognition,
Patel, K. & Prasad, R. (2013), ‘Speech recognition and verification using mfcc & vq’, Int. J.
Emerg. Sci. Eng.(IJESE) 1(7), 137–140.
Peters, S. D., Stubley, P. & Valin, J.-M. (1999), On the limits of speech recognition in noise, in
‘1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings.
ICASSP99 (Cat. No. 99CH36258)’, Vol. 1, IEEE, pp. 365–368.
Rabiner, L. R. (1989), ‘A tutorial on hidden markov models and selected applications in speech
recognition’, Proceedings of the IEEE 77(2), 257–286.
Rashid, R. A., Mahalin, N. H., Sarijari, M. A. & Aziz, A. A. A. (2008), Security system using
biometric technology: Design and implementation of voice recognition system (vrs), in ‘2008
international conference on computer and communication engineering’, IEEE, pp. 898–902.
Santos, R. M., Matos, L. N., Macedo, H. T. & Montalvão, J. (2015), Speech recognition in noisy
environments with convolutional neural networks, in ‘2015 Brazilian Conference on Intelligent
Systems (BRACIS)’, pp. 175–179.
Shah, H. N. M., Ab Rashid, M. Z., Abdollah, M. F., Kamarudin, M. N., Lin, C. K. & Kamis,
Z. (2014), ‘Biometric voice recognition in security system’, Indian journal of Science and
Technology 7(2), 104.
Suuny, S., Peter, S. D. & Jacob, K. P. (2013), ‘Performance of different classifiers in speech
recognition’, Int. J. Res. Eng. Technol 2(4), 590–597.
Vyas, M. (2013), ‘A gaussian mixture model based speech recognition system using matlab’,
Signal & Image Processing 4(4), 109.
Yilwatda, M. M., Enokela, J. A. & Goshwe, N. Y. (2017), ‘Implementation of a two-level security
door access using keypad and voice recognition’, International Journal of Security and Its
Applications 11(4), 45–58.
Zaini, N. M. S. M., Zakaria, N. A., Roslan, I., Nahar, H., Ghazali, K. W. M. & Harum, N. (2021),
‘Iot system design using voice recognition technique for door access control system’, Manuscript
Editor 2021, 145.