PROJECT PROPOSAL OF DESIGN AND DEVELOPMENT OF DOOR SECUIRTY SYSTEM USING MULTI-LINGUAL SPEECH RECOGNITION TECHNIQUES FOR MALAY, TAMIL AND CHINESE LANGAUGE BY NG YIK HENG DEPARTMENT OF ELECTRIC AND ELECTICAL ENGINEERING FACULTY OF ENGINEERING AND TECHNOLOGY TUNKU ABDUL RAHMAN UNIVERSITY COLLEGE KUALA LUMPUR 2022/23 Acknowledgements Firstly, I would like to express my gratitude to Tunku Abdul Rahman University College for providing a great opportunity of participating a final year project to develop and learn a project to enhance my engineering skills in the field that I am interested. I would also like to thank my supervisor Dr. Hong Kai Sze for providing useful guidance and advice throughout my final year project. Despite his busy schedule, Dr. Hong would arrange weekly meeting and gives me guidance when I met issues in the project. His knowledge and experience in this field and also his words of encouragement and patient are crucial for the completion of the project. Finally, I would like to thank my family and friends for providing support and encouragement during the time while completing this project especially I felt demotivated and discouraged. My friends also shared useful tips on completing their project in the past years to arrange time wisely while also taking care of my studies. i Abstract Door security system has evolve in the past years, since door plays an important role on home security since people have to enter the house through the door. Therefore, the paper propose a door security system using multi-lingual speech recognition techniques for Chinese, Tamil and English language. In this project, MATLAB application is used for the development platform of this project. Besides, Mel-Frequency Cepstral Coefficient (MFCC) is used as the feature extraction technique, Hidden Markov Model (HMM) is used as the training and matching of system while GOLDWAVE application is used as the recording of audio signal. ii Table of Contents Acknowledgements ................................................................................................ i Abstract............................................................................................................... ii Table of Contents .................................................................................................. iv List of Figures ...................................................................................................... v List of Tables........................................................................................................ vi Chapter 1 Introduction ........................................................................................... 1 1.1 Research Background ................................................................................. 1 1.2 Problem Statement .................................................................................... 2 1.3 Research Objectives.................................................................................... 2 Chapter 2 Literature Review.................................................................................... 4 2.1 Speech Recognition .................................................................................... 4 2.1.1 Speech recognition by humans and machine ........................................... 4 2.1.2 A critical review and analysis on techniques of speech recognition .............. 5 2.2 Security System ........................................................................................ 5 2.2.1 Biometric Voice Recognition in Security System .................................... 5 2.2.2 Design and Implementation of Voice Recognition System (VRS) Security System using Biometric Technology ..................................................... 7 2.2.3 IoT System Design using Voice Recognition Technique for Door Access Control System ............................................................................... 7 2.2.4 Keypad and Voice Recognition as Implementation of a Two-Level Security Door Access .................................................................................... 8 2.2.5 Summary ....................................................................................... 9 2.3 Feature Matching and Feature Extraction......................................................... 9 2.3.1 Voice Command Recognition System based on MFCC and DTW ............... 9 2.3.2 Text-Independent Speaker Identification through Feature Fusion and Deep Neural Network ............................................................................... 11 2.3.3 Speech Recognition and Verification Using MFCC & VQ ........................ 12 2.3.4 LPC and LPCC Method of Feature Extraction in Speech Recognition System 12 2.3.5 Hidden Markov Model (HMM) on Speech Recognition ............................ 14 2.3.6 2.3.6 Combination of VQ and Linde-Buzo-Grey (LBG) Algorithm Speech Recognition .................................................................................... 14 2.3.7 Performance of Different Classifier for Speech Recognition System ............ 15 iii 2.3.8 Summary on Feature Extraction and Matching ....................................... 17 2.4 Limitations of Speech Recognition ................................................................ 17 2.4.1 Speech Recognition in Noisy Environment............................................. 17 2.4.2 Problems Encountered by using Speech Recognition System...................... 18 2.4.3 Summary in Limitations on Speech Recognition System ........................... 18 Chapter 3 Methodology ......................................................................................... 20 3.1 Flow Chart of Implementation Process ........................................................... 20 3.2 Software used for implementation ................................................................. 21 3.2.0.1 MATLAB .......................................................................... 21 3.2.0.2 GoldWave .......................................................................... 21 3.3 Gantt Chart............................................................................................... 21 Chapter 4 Conclusion ............................................................................................ 23 References ........................................................................................................... 24 Appendices .......................................................................................................... 27 iv List of Figures Figure 1-1 Stages of speech recognition system (Chavan & Sable 2013).......................... 2 Figure 2-1 Speech corpora with different vocabulary size ............................................ 4 Figure 2-2 Characteristics of speech corpora ............................................................. 4 Figure 2-3 General architecture of speech recognition system ...................................... 5 Figure 2-4 Speech recognition security system .......................................................... 5 Figure 2-5 Process of MFCC ................................................................................. 5 Figure 2-6 Accuracy result for 1 individual .............................................................. 6 Figure 2-7 Accuracy result for 10 individuals ........................................................... 6 Figure 2-8 Results of voice recognition security system ............................................... 7 Figure 2-9 Results of voice recognition security system ............................................... 8 Figure 2-10 Initial (up) and final (down) speech signal after MFCC ................................. 10 Figure 2-11 Comparison of same speech signal ........................................................... 10 Figure 2-12 Comparison of different speech signal....................................................... 10 Figure 2-13 Proposed methodology for speaker identification system ............................... 11 Figure 2-14 Efficiencies for identify genders for different machine learning algorithm ......... 11 Figure 2-15 Visual image of VQ for 2 speakers inside a codebook ................................... 12 Figure 2-16 Block diagram for LPC method ............................................................... 13 Figure 2-17 Block diagram for LPCC method ............................................................. 13 Figure 2-18 5 state Markov chain ............................................................................. 14 Figure 2-19 Flowchart of LBG algorithm ................................................................... 15 Figure 2-20 Experimental results for WER and accuracy ............................................... 15 Figure 2-21 Accuracy of three classifiers ................................................................... 17 Figure 2-22 Block diagram of CNN(Passricha & Aggarwal 2018) ................................... 19 Figure 2-23 Simulation results on diffrent NN-HMM models(Santos et al. 2015)................ 19 Figure 3-1 Flow Chart of Implementation Process of Speech Recognition Security System . 20 v List of Tables Table 2-1 Summary of Feature Extraction and Matching ........................................... 17 Table 3-1 Gantt Chart for Semester 1 and 2 ............................................................. 22 vi Chapter 1 Introduction 1.1 Research Background This research studies the need of implementing a speech recognition security system that supports multi-language commonly used in Malaysia (English, Chinese, Tamil and Malay). Housing area would normally be equipped with a security system, no matter if it’s the traditional key and lock way or a more advanced method of using keypads and thumbprints. Security system brings the purpose of denying unauthorized entry to protect personal property or contents from damaging or missing (Biswas 2nd & Mynuddin 3rd n.d.). Home security has been improving year by year by the recent implementation of digital doors that use passwords or thumbprints. However, the same problem emerges as the digital door, which is security that relies on physical touch, words or number to unlock the door is easily hackable and guessable, whereas the key and lock door are easy to break down by brute force or duplicating keys or identities cards, there is also a possibility for the individual to not lock the door properly due to insufficient time to double check the locks before leaving. Even equipped with human security guards would not be safe enough due to the fact that humans could not execute their task efficiently all the time and were bound to make mistakes that would let intruders in. Therefore, this project focuses on implementing a biometric security system that relies on the individual’s “body” as an entry ticket, as biometric characteristics such as voice and eye scan are unique and could be used for identifying specific individuals. Speech is considered one of the biometric characteristics that are easily detectable by computers as the human voice contains a lot of information that could be extracted. When people speak, a certain pitch, frequency, tone and rhythm is made and is unique to one another. Different genders will also produce a different range of pitch as the average male will have lower voices compared to females. Differences in where they live will also produce different accents and different ways to pronounce a certain word. This research would like to overcome these issues and would correctly recognize what the user was trying to say to identify whether a certain key phrase or word has been spoken by the user. For speech recognition system, four stages are responsible for processing a voice signal, namely feature extraction, feature training and matching and finally decision logic. By doing so, MATLAB software is used for the coding of this system, feature extraction of the speech of the user is done by applying Mel-Frequency Cepstral Coefficients (MPCC) which converts the 1 Figure 1-1 Stages of speech recognition system (Chavan & Sable 2013) acoustic signal into a series of acoustic feature vector. Besides, Convolutional neural networks (CNN) is used to train the database of the system to recognize words to compare the speech signal into identifying the correct user. (Najkar et al. 2010). 1.2 Problem Statement The recognition of voice is inconsistent due to different accents, way of pronunciations and speed. Besides that, the support of multi-language increases the difficulty of extraction of voice. Noise could also occur while the system is detecting voice and extraction occurred which would affect the final results. This project focus on door security system based on recognizing the speech spoken to the speech recognition security system. The purpose of this project is to improve home security by letting the house owner or his/her family open the door of the house using multi-lingual speech recognition system instead of using keys or padlock. This project aims to design and evaluate speech recognition system in multi-language in order to be used by all Malaysian. This project uses algorithms such as Deep Learning and Linear Predictive Coding to ensure that the system has better reliability by constant updating the vocabulary database and effective feature extraction for more effective approach while the voice is detecting for the security system. 1.3 Research Objectives This project focuses on securing the doors of the room/house of the house owner by saying the word/phrase to the control system implemented using any language they preferred (English/Chinese/Tamil/Malay/etc). The purpose of this project is to enhance the security of the house using speech recognition instead of using the traditional way to open the house through key and padlock. Keys and padlock can now be easily replicated and hacked down by people, hence the home theft rate in Malaysia are spiking year by year. This speech recognition control systems fixes that as it only recognizes the house owner and his/her families’ voice by extracting their voice 2 and make it recognizable by the control system. The development software used for this project is MATLAB while the Mel-Frequency Cepstral Coefficient (MFCC) is used to extract the speech spoken by the house owner and his/her family and also Hidden Markov Model is used for pattern comparison. By implementing this project, objectives below are hoped to be achieved: 1. To research and develop speech recognition door security system that can recognize English, Chinese, Tamil and Malay Speech. 2. To record speech from various speakers to be used as the training template. 3. To test the accuracy of the speech recognition door security system using testing template. 4. To evaluate the new system and propose new idea to improve the system. 3 Chapter 2 Literature Review In this chapter, several research papers are read and analyzed in order to discuss their work on speech recognition security systems that support multi-language. Research papers act as reference for successful project implementation. 2.1 2.1.1 Speech Recognition Speech recognition by humans and machine Figure 2-1 Speech corpora with different vocabulary size Figure 2-2 Characteristics of speech corpora In this paper, speech corpora with different vocabulary size are used to test the ability of speech recognition system shown in Figure 2-1. All corpora aims to recognize all words that is prompted by user to read by sentence. Different speech corpora has different range of vocabulary based on the number of talkers they read to, total duration of the reading which has gone through, and also what does the talker reads to the corpora, detailed characteristics of each 6 speech corpora are shown in Figure 2-2. Error rates of the speech corpora and humans are also compared and found out that it is higher than humans in each circumstances such as the appearance of spontaneous speech and working under a noisy environment (Lippmann 1997). 4 2.1.2 A critical review and analysis on techniques of speech recognition Figure 2-3 General architecture of speech recognition system This paper discusses the speech recognition techniques used from the year 2000 to 2015 from 50 articles. The feature extraction and classification used for 50 articles are listed in Figure 2-3. There are also some common challenges faced while building speech recognition system, which includes the consideration on emotion in the speaker which affects the audio signal when the emotion is different, the different combination also effects the accuracy of speech recognition. For instance, MFCC algorithm for feature extraction face issues in recognition speech signals in noisy environment, therefore a noise-adaptive classifier must be combine with MFCC algorithm to increase the accuracy of system (Haridas et al. 2018). 2.2 2.2.1 Security System Biometric Voice Recognition in Security System Figure 2-4 Speech recognition security system Figure 2-5 Process of MFCC 5 In this paper, the speech recognition security system shown in Figure 2-4 is built using MATLAB (SIMULINK) application. The system will convert the input speech to energy feature and then saved as the reference template. This process is done by applying MFCC algorithm as shown in Figure 2-5. The input speech feature is then compared to the model to produce logic ‘1’ or ‘0’ based on the match of the speech. Another method is also discussed, namely Vector Quantization (VQ) and Gaussian Mixture Model (GMM). By referring to the result for the paper, the specific user’s voice is successfully recognized and other users’ voice is rejected. Figure 2-6 and 2-7 shows the results obtained by testing the system’s accuracy with 1 individual and by testing the system’s accuracy with 10 different people, with the red highlighted results show the voice recognition system is unable to provide the correct output. The reason that incorrect output has occurred is the inconsistent energy output by the speaker caused by how soft or loud the speaker speaks. From Figure 2-7, we also know that the gender and the difference in age may affect the accuracy of system. Despite that, the accuracy of this system is 75% (Shah et al. 2014). Figure 2-6 Accuracy result for 1 individual Figure 2-7 Accuracy result for 10 individuals 6 2.2.2 Design and Implementation of Voice Recognition System (VRS) Security System using Biometric Technology This paper discusses the spike in popularity in biometric technology in recent years, from fingerprints and handwriting to more recent ones which includes face scan, iris/eye scan, hand print and voice print. It also various application for biometric voice, which include to eliminate cell phone fraud for cell phone security access control, to eliminate pin # fraud for ATM manufacturers and to reduce theft and carjacking for automobile manufacturer. Voice controlled security system is implemented for this research paper by using MATLAB function blocks from SIMULINK for the development of verification algorithms which could authenticate a person’s identity by their voice pattern. The security system will produce logic ‘1’ if the voice is match, while a mismatch will produce logic ‘0’. Besides that, the door is controlled by a microcontroller circuit to test the reliability of the voice controlled security system. Results from Figure 2-8 shows 4 different targets which include different genders to speak a simple word for the system to extract his or her voice pattern, and by comparing the reference user and the speaker’s nonparametric estimates power spectrum while calculating the standard deviation of difference between user and speaker to the user and speaker respectively. The logic value could be generated if both of the generated standard deviation are below 15% (Rashid et al. 2008). Figure 2-8 Results of voice recognition security system 2.2.3 IoT System Design using Voice Recognition Technique for Door Access Control System This study introduces an Internet of Things (IoT) system design for a door access control system using voice recognition technique. The system only allows authorized users by identifying the correct random words given and an alert will be sent through Telegram if someone successfully entered. As a result, this system is suitable for institutions or organizations to improve the security of critical rooms or the buildings for prohibited users. The design process is guided using a waterfall model methodology. The system consist of hardware and software components such as Arduino 7 IDE to implement coding for Arduino UNO, Python IDE as programming software, Telegram Flutter as messaging application, solenoid door lock Wifi module as electrical-mechanical locking mechanism and jumper wires as wiring for pins. Five processes are also used to train the module for better authorization on user’s voice which are randomly generating words, fuzzy matching, voice matching detection and longtime spectral deviation, MFCC and Gaussian Mixture Model (GMM) (Zaini et al. 2021). 2.2.4 Keypad and Voice Recognition as Implementation of a Two-Level Security Door Access A two-level security system which uses matrix keypad interface with a microcontroller as the first level security validation to monitor and control the execution of desired tasks within the keypad and voice recognition (KVR) system, while the second level security uses a voice recognition integrated circuit in this paper. For this system, a tristate buffer is also employed to logically isolate the buses of the digital signal processing (DSP) chips and those of the microcontroller. Environments with noise and without noise were conducted and the recommended distance between the user and the microphone were identified. The results show that while using electret type condenser, the required distance is 1.0 cm to 16 cm while under noisy conditions it requires 1.0 cm to 6.0 cm. The tested results vary depending on the sensitivity of the microphone and the environmental conditions at the time. In conclusion, the paper shows that integrating the keypad and voice recognition design can help to optimize the security level and help control unwanted intrusion into buildings. Figure 2-9 shows the results of 3 different users with different genders to speak out the word ‘open’ and ‘close’ for the system to execute the following command, which shown that user 1 (male) has the highest accuracy of 100% of identifying the correct voice pattern to execute the correct command, followed by user 3 with an accuracy of 75% and lastly user 2 (female) with an accuracy of 62.5%. The reason that user 1 has the highest accuracy is that user 1 participate on training the system on 4 trial run and the results show that the gender of user which conducts training of the system will improve the gender’s accuracy of identifying its voice pattern (Yilwatda et al. 2017). Figure 2-9 Results of voice recognition security system 8 2.2.5 Summary In conclusion, MATLAB application is popular to be implemented for voice controlled security system due to the availability of functions block for better simulation of system. From the numerous research paper, it also known that the gender plays a significant role in voice recognition due to different voice pattern in different gender, while noisy environment and different energy output among speaker will also affect the outcome of voice recognition. 2.3 2.3.1 Feature Matching and Feature Extraction Voice Command Recognition System based on MFCC and DTW VIn this paper, by using MATLAB as development of the voice recognition system and also using MFCC as feature extraction and Dynamic Time Warping (DTW) as feature matching. Voice recognition system are separated into two modules to shown the output signal after feature extraction and matching. Figure 2-10 above shows the signal audio obtained through speaker, noise is then removed by removing any signal less than the minimum and maximum threshold, the produced signal is called utterance and is then divided into frames, pass into a discrete filter and hammering window. After that, the frequency domain is then passed through a mel filter bank and finally converted into time domain by performing Discrete Cosine Transform shown in Figure 2-10 below. Figure 2-11 and shows the two results when two speech signals compared are the same and different respectively. The comparing process is based on the DTW algorithm which measures the similarity between two varying time or speed time series. The results show that the cost value is 0 for same speech signal and 107.8 for the different speech signal (Bala et al. 2010). 9 Figure 2-10 Initial (up) and final (down) speech signal after MFCC Figure 2-11 Comparison of same speech signal Figure 2-12 Comparison of different speech signal 10 2.3.2 Text-Independent Speaker Identification through Feature Fusion and Deep Neural Network This paper proposes a speech recognition system that uses the combination of MFCC and time-based features, known as MFCCT, while the system is shown in Figure 2-13. This algorithm fixes the problem faced by short-time features such as MFCC that does not work well under complex speech datasets. The feature extracted by MFCCT is then fed into a deep neural network (DNN) as classifier to identify the SID and gender of user, which is proven to be suitable to use with MFCCT by feeding MFCCT extracted features into other different machine learning algorithm, namely Naïve Bayes, random forest, and k-nearest neighbor. Figure 2-14 shows the efficiencies of each machine learning algorithm obtained, with DNN obtaining the highest efficiency ranging from 83.5-92.9%. The models of speech signals are also obtained from the LibriSpeech database, which contains speech signals majority from the United States. It contains many sub corpus that is trained from a long time and many US males and females’ speech signal are included (Jahangir et al. 2020). Figure 2-13 Proposed methodology for speaker identification system Figure 2-14 Efficiencies for identify genders for different machine learning algorithm 11 2.3.3 Speech Recognition and Verification Using MFCC & VQ In this paper, speech recognition system using MFCC as feature extraction while VQ as feature modeling. Since MFCC is discussed in previous paper, VQ will be emphasized. In order for the system to estimate probability distributions of the computed feature vectors, VQ algorithm will perform quantization on extracted features into smaller number of template vectors, since small vectors could represent the characteristics of the whole features. The vectors are then mapped inside a finite space which is named as cluster. Codebook is generated once the trained sample on that specific speaker by clustering the vectors which is centered by code word which is also known as centroid. VQ distortion is known as the distance determined from the centroid to the closest vector sample. The input speech signal is then compared by determining its VQ distortion also known as Euclidean Distance and the ones that has the smallest value are considered to be match with the speaker. Figure 2-15 shows the visual image of VQ for 2 speakers inside a codebook. This paper also uses the k-mean algorithm in order to cluster the training vector into feature vector shown in Figure 2-10. The working of the algorithm is that the algorithm will produce k sets of cluster by initializing centroids then randomize k amount of vectors around the centroids, the process is then repeated until there was no change of centroids. The purpose of this process is to minimizing intra-cluster variance, V which ensures similar training vectors are inside the cluster while each cluster are stayed away as far as possible (Patel & Prasad 2013). Figure 2-15 Visual image of VQ for 2 speakers inside a codebook 2.3.4 LPC and LPCC Method of Feature Extraction in Speech Recognition System For this research paper, selection of feature extraction which includes Linear Predictive Cepstral Coefficient (LPCC) and Linear Predictive Coding (LPC) is examined and studied. Methodology of each method is reviewed and their merits and demerits are discussed thoroughly. Figure 2-16 and Figure 2-17 shows block diagrams for LPC and LPCC method respectively to 12 showcase how does the feature extraction works which display speech signals through finite number of measures of signals. Figure 2-16 Block diagram for LPC method Figure 2-17 Block diagram for LPCC method In short, the main process of both LPC and LPCC contains four steps, which mainly are the pre-emphasis which filters the signal normally with coefficient between 0.9 and 1 with the purpose of flattening the signal, followed by framing which divides the speech signal into frames that contains overlap of 10ms between two adjacent frames to ensure stationary between frames, windowing also occur when the hammering windows are then multiply with frames in order to minimize edge effect and lastly, the LPC is computed by applying auto-correlation on the previous frames that is windowed which also obtains LPCC. For LPCC method, the few steps are followed exactly as LPC method, the only difference is that cepstral coefficient is added inside the LPC parameters to calculate the LPCC features. The predictor coefficient vector must be found in order to calculate cepstral coefficient. In short, feature extractions for both method are the same, the small difference is LPCC requires conversion of LPC to obtain LPCC. For the merits and demerits for both method, LPC estimates speech parameters precisely, characterize speech traits 13 well and more effective computation, while LPCC provides better reliability and robustness. For the demerit, LPC method is incapable to capture the unvoiced sound such as the “th” and the nasalized sound such as “m” “n” accurately. While for the LPC method, the performance is degraded if insufficient order is used. For both cases, the performance is effected greatly when the environment contains noise. Gupta & Gupta (2016) 2.3.5 Hidden Markov Model (HMM) on Speech Recognition HMM is an algorithm commonly used in large vocabulary continuous speech recognition system (LVCSR). While recognition the speech with respect to the time when state changes, the change of state occur with the possibility to return back to the same state. While time varies from t=1, 2, 3..., while the actual state on the time t is denoted as q. The state transition probability which is based on the current and predecessor state is shown as the equation below. Figure 2-18 shows the 5 state Markov chain which shows the probability shown on how does the state change from one another or remain at the same state (Rabiner 1989). Figure 2-18 5 state Markov chain 2.3.6 2.3.6 Combination of VQ and Linde-Buzo-Grey (LBG) Algorithm Speech Recognition This paper proposes a speech recognition system using combination of VQ and Linde- Buzo-Grey (LBG) algorithm. The VQ works the same as discussed last paper, therefore LBG algorithm is discussed for this paper. Figure 2-19 shows the flowchart of LBG algorithm, this process clusters sets of training vector comes with the symbol L into sets of codebook vector, comes with the symbol M. In order to kick start this algorithm, a one vector codebook is designed. 14 It is then spilt into 2 times the vector codebook. Cluster vector is then done by executing the nearest-neighbor search for code word with each training vector in order to find the nearest code word, and then the centroid is updated. Finally the distortion is computed by summing up all distance of code word searched by the nearest-neighbor search and comparison of the average distance falls below the threshold value and while the M vector is obtained, the process stops thus an M vector codebook is produced. The experimental simulation on MATLAB is also produced and shown in Figure 2-20. The low accuracy for some of the testing is due to research creating hostile environment for the system to not extract the signal accuractely (Dua & Kamra 2019). Figure 2-19 Flowchart of LBG algorithm Figure 2-20 Experimental results for WER and accuracy 2.3.7 Performance of Different Classifier for Speech Recognition System In this paper, three different classifier is used to perform feature matching for Malayam digits, which include Artificial Neural Network (ANN),Naïve Bayes (NB) algorithm and Support Vector 15 Machine (SVM). The feature of Malayam digits is then extracted by Discrete Wavelet Transform (DWT) algorithm which the signal passes through high pass filter to obtain the approximation coefficient and low pass filter which produces the detail coefficients. Characteristics on the detail coefficients are more useful than approximation coefficients for speech signals. The signals are sub sampled by 2 until the desired signals decided by MATLAB algorithm is then obtained. The extracted feature is then execute speech recognition by undergoing classifiers that creates a training model from the datasets to predicts the class of each test sets from the datasets. The ANN classifier is a well-known data processing model that includes a number of basic processing units or networks known as neurons. The word neurons came from human brain neuron cell which functions to learn, adapt, and recognize faults and many more. Many ANN method are available for classifier, this paper uses multilayer perceptron (MLP) which has n input of layers, one or more hidden layers, and an output layer. The input layers received extracted feature while the output layer is where the MLP performs prediction of classification. Hidden layers which follow the back propagation learning algorithm as the feature distributed by the input layer is passed on to the first hidden layer as output while the next hidden will receive the output of previous layer as input. Error back propagation correction algorithm is used to correct the errors in a backward direction.The network eventually establishes the input-output relationships through the adjusted weights on the network. SVM classifies speech by constructing hyper planes in a multidimensional space that separates different class labels based on statistical learning theory. Two common SVM strategy on how to classify data, namely One-against-One and One-against-All. One-against-one is chosen for this experiment while one binary SVM is used to classify each type of class by grouping them together by class. Finally, NB classifier is also used for classification. It works under the Naïve Bayesian algorithm which calculates the probability of the model with a small given datasets by applying the formula below. 𝑃 ( 𝐴|𝐵) = 𝑃(𝐵| 𝐴) ∗ 𝑃( 𝐴)/𝑃(𝐵) The values in the equations are the Posterior probability defined as the probability of hypothesis A on the observed event B, comes with the symbol P(A|B), P(B|A) which is the likelihood of hypothesis happens to be true, P(A) and P(B) which are the prior probability and marginal probability respectively. This method of classifier requires small amount of sets of training to be effective. Figure 2-21 shows the results on three classifier, ANN is chosen since it obtains the highest accuracy of 89% among three methods (Suuny et al. 2013). 16 Figure 2-21 Accuracy of three classifiers 2.3.8 Summary on Feature Extraction and Matching Table 2-1 Summary of Feature Extraction and Matching Author Feature Feature Additional Matching & Extraction Method Testing Method Method DTW MFCC - - DNN MFCCT - 83.5% - 92.9% VQ MFCC - 87% (Vyas 2013) GMM MFCC - 70% - 85% (Matsui & Continuous LPC DIM Method 84.8% - 89.1% Furui 1994) HMM (Maseri & HMM MFCC - 76%-84.86% VQ-LBG MFCC - 64.3%-92.6% (Bala et al. Accuracy 2010) (Jahangir et al. 2020) (Patel & Prasad 2013) Mamat 2019) (Dua & Kamra 2019) Table 2-1 shows the literature review on speech recognition security system by previous authors to understand and review the previous research and work by researchers. To date, different feature extraction, feature matching and testing method is used by researchers. In summary, MFCC as feature extraction and HMM as feature matching and testing is used more than other methods. 2.4 2.4.1 Limitations of Speech Recognition Speech Recognition in Noisy Environment This paper conducts experiments on human speech recognition in noisy environments. The effects of human language modeling are largely mitigated while the experiment is conducted. MFCC algorithm will conduct phase elimination at the process. When the speech signal with eliminated phase is heard by humans, it shows that the recognition error rate has increase slightly. Results obtained from experiment showed that the humans are not sensitive to phase, as the 17 recognition rate is reduced from 100% to 91.5%, which is caused by the phase elimination characteristics by MFCC techniques. The results also showed resolution reduction due to the subsequent cepstral filtering. As shown in the decision tree, noisy environment affects the recognition rate of speech recognition system the most, followed by difficult speakers combined with native listener. The results also shown that as large as four time increase in digit error rate when difficult speakers are encountered, which also result in reduction in spectral resolution. As shown in the paper, redundancy exists inside speech recognition systems. Moreover, if the speech recognition system is capable enough with little notice interacting, it is possible that the traditional cepstral filtered features which removes unnecessary information is sufficient to carry useful information for accurate speech recognition (Peters et al. 1999). 2.4.2 Problems Encountered by using Speech Recognition System In this paper, it discusses the fact that speech recognition system will cause user to have issues regarding their voice. The symptoms that users might face includes sore throat, difficulty to make noise and the worst case even the loss of voice. Besides, the requirement of the user to produce voice that have constant pitch, inflection and volume that might cause muscle fatigue at throat as the vocal trait is kept in a fixed position while speaking that also might eventually lead to laryngeal musculature. Despite the fact that this paper emphasize that more studies are to make to decide whether speech recognition system will cause issues to user’s voice and throat. User are still advised to perform warm-up to voice and also cool-down to the voice (Kambeyanda et al. 1997). 2.4.3 Summary in Limitations on Speech Recognition System Due to the advance of technologies, the problems encountered by the speech recognition system could be fixed by using deep learning into the system. Instead of using HMM as feature matching and training, A hybrid of CNN-HMM which uses Convolutional Neural Networks (CNN) as a pattern recognizer by passing through 3 layers of convolutional layers, namely convolution, pooling and non-linearity which is shown in Figure 2-22 to learn and extract the useful infromation from the speech features. Lastly, the CNN processes each input speech utterance by generating all HMM state probabilities for each frame. Then a Viterbi decoder is used to get the sequence of labels corresponding to the input utterance (Abdel-Hamid et al. 2012). The reason CNN is used as feature matching is due to the better performance in recognizing the word in noisy environment, which is due to the properties that CNN brought unique to other 18 neural networks, namely locality, weight sharing and pooling. Locality in the convulutional layer allows to receive features representing a limited bandwidth of the whole speech spectrum. Therefore, MFCC is modified as it does not pass through DCT-based decor-relation transform since the speech inputs must be represented in a frequency scale that can be divided into a number of local bands. It also increases the robustness against non-white noise as noise is less affected while the useful features are computed locally by the local filters in relatively cleaner parts of the spectrum (Abdel-Hamid et al. 2014), whereas the pooling and weight sharing is to increase the performance of speech recognition. The article written by Santos et al. with the title of “Speech Recognition in Noisy Environments with Convolutional Neural Networks” suggest that by applying CNN-HMM model into the speech recognition system, it has the lowest equal error rate while exposed to different kinds of noise among other models, which is shown in Figure 2-23. Figure 2-22 Block diagram of CNN(Passricha & Aggarwal 2018) Figure 2-23 Simulation results on diffrent NN-HMM models(Santos et al. 2015) 19 Chapter 3 Methodology 3.1 Flow Chart of Implementation Process Figure 3-1 Flow Chart of Implementation Process of Speech Recognition Security System 20 Figure 3-1 displays the flow chart of implementation process of the speech recognition security system. Studies on research paper on topics related to the system is reviewed such as feature extraction, feature matching. After that, recording of speech signals for testing datasets are recorded using GoldWave application. Coding and implementation of Graphical User Interphase (GUI) on MATLAB application is then done for testing on the system. Troubleshoot is made if coding are found at fault. 3.2 Software used for implementation 3.2.0.1 MATLAB The speech recognition system is simulated using the MATLAB software. MATLAB is a numerical computing environment developed by MathWorks that uses fourth-generation programming language to compute technical algorithm. The software programs, computes and visualize the algorithm. Common functions of MATLAB also includes matrix manipulations, plotting of function of data, implementation of more sophisticated functions such as fast Fourier transform, Bessel functions, matrix inverse and matrix eigenvalues, and also equipped with Graphical User Interphase (GUI) to showcase the algorithm. Another reason why MATLAB is used instead of programming software such as C++ as it consists of toolbox to help in simulate the speech recognition system which is known as VOICEBOX written by Mike Brooks. This speech processing toolbox contains of many speech process function useful for simulation of speech recognition system (Pan 2014). 3.2.0.2 GoldWave In order to perform testing and training for the speech recognition system, speech signal are required to be recorded by different users saying the same word or phrase to ensure high recognition rate for this speech security system. The speech signal is recorded using GoldWave software. It provides various functions, which includes record and edit of the speech signal, analyze audios from real time visuals, apply audio effects on speech signals and also perform filters and noise reduction to the speech signal, save in various format such as .wave files, .snd files and many more. 3.3 Gantt Chart Table 3-1 below shows the process of implementing software, proposal writing and thesis writing for 2 semesters. 21 Table 3-1 Gantt Chart for Semester 1 and 2 Activity Confirmation on supervisor Semester 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ! ! Choose FYP Title Submission of Proposal ! Research on Project ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! Software Implementation ! ! ! ! Proposal Submission ! Logbook and Risk Assessment ! 13 14 Proposal Writing ! Metting with Supervisor Submission ! Proposal Defend 1 2 3 4 5 Semester 2 6 7 8 9 ! ! ! ! ! ! ! ! ! ! Troubleshoot on Software ! ! ! ! ! ! ! ! Result Analysing and Improve- ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! Activity Software Implementation 10 11 12 ! ! ment Thesis Writing ! ! Thesis Submission ! Logbook Submission ! ! Thesis Defend 22 Chapter 4 Conclusion In conclusion, a speech recognition door security system is developed by applying feature extraction (MFCC algorithm) is used which acts as conversion of speech signal into a sequence of acoustic speech vectors, feature matching and training (CNN) is also used to train the database of the system in order to raise the accuracy of detecting the correct user. While the whole speech recognition process is simulated using MATLAB application, while the recording of speech signals are done by GoldWave application. 23 References Abdel-Hamid, O., Mohamed, A.-r., Jiang, H., Deng, L., Penn, G. & Yu, D. (2014), ‘Convolutional neural networks for speech recognition’, IEEE/ACM Transactions on audio, speech, and language processing 22(10), 1533–1545. Abdel-Hamid, O., Mohamed, A.-r., Jiang, H. & Penn, G. (2012), Applying convolutional neural networks concepts to hybrid nn-hmm model for speech recognition, in ‘2012 IEEE international conference on Acoustics, speech and signal processing (ICASSP)’, IEEE, pp. 4277–4280. Bala, A., Kumar, A. & Birla, N. (2010), ‘Voice command recognition system based on mfcc and dtw’, International Journal of Engineering Science and Technology 2(12), 7335–7342. Biswas 2nd, P. & Mynuddin 3rd, M. (n.d.), ‘Design and implementation of smart home security system’. Chavan, R. S. & Sable, G. S. (2013), ‘An overview of speech recognition using hmm’, International Journal of Computer Science and Mobile Computing 2(6), 233–238. Dua, S. & Kamra, A. (2019), Development of speech recognition system: Using combination of vector quantization and linde-buzo-gray algorithm. Gupta, H. & Gupta, D. (2016), Lpc and lpcc method of feature extraction in speech recognition system, in ‘2016 6th International Conference-Cloud System and Big Data Engineering (Confluence)’, IEEE, pp. 498–502. Haridas, A. V., Marimuthu, R. & Sivakumar, V. G. (2018), ‘A critical review and analysis on techniques of speech recognition: The road ahead’, International Journal of Knowledge-Based and Intelligent Engineering Systems 22(1), 39–57. Jahangir, R., Teh, Y. W., Memon, N. A., Mujtaba, G., Zareei, M., Ishtiaq, U., Akhtar, M. Z. & Ali, I. (2020), ‘Text-independent speaker identification through feature fusion and deep neural network’, IEEE Access 8, 32187–32202. Kambeyanda, D., Singer, L. & Cronk, S. (1997), ‘Potential problems associated with use of speech recognition products’, Assistive Technology 9(2), 95–101. Lippmann, R. P. (1997), ‘Speech recognition by machines and humans’, Speech communication 22(1), 1–15. 24 Maseri, M. & Mamat, M. (2019), Malay language speech recognition for preschool children using hidden markov model (hmm) system training, in ‘Computational Science and Technology’, Springer, pp. 205–214. Matsui, T. & Furui, S. (1994), ‘Comparison of text-independent speaker recognition methods using vq-distortion and discrete/continuous hmm’s’, IEEE Transactions on speech and audio processing 2(3), 456–459. Najkar, N., Razzazi, F. & Sameti, H. (2010), ‘A novel approach to hmm-based speech recognition systems using particle swarm optimization’, Mathematical and Computer Modelling 52(1112), 1910–1920. Pan, L. (2014), ‘Research and simulation on speech recognition by matlab’. Passricha, V. & Aggarwal, R. K. (2018), Convolutional neural networks for raw speech recognition, IntechOpen. Patel, K. & Prasad, R. (2013), ‘Speech recognition and verification using mfcc & vq’, Int. J. Emerg. Sci. Eng.(IJESE) 1(7), 137–140. Peters, S. D., Stubley, P. & Valin, J.-M. (1999), On the limits of speech recognition in noise, in ‘1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No. 99CH36258)’, Vol. 1, IEEE, pp. 365–368. Rabiner, L. R. (1989), ‘A tutorial on hidden markov models and selected applications in speech recognition’, Proceedings of the IEEE 77(2), 257–286. Rashid, R. A., Mahalin, N. H., Sarijari, M. A. & Aziz, A. A. A. (2008), Security system using biometric technology: Design and implementation of voice recognition system (vrs), in ‘2008 international conference on computer and communication engineering’, IEEE, pp. 898–902. Santos, R. M., Matos, L. N., Macedo, H. T. & Montalvão, J. (2015), Speech recognition in noisy environments with convolutional neural networks, in ‘2015 Brazilian Conference on Intelligent Systems (BRACIS)’, pp. 175–179. Shah, H. N. M., Ab Rashid, M. Z., Abdollah, M. F., Kamarudin, M. N., Lin, C. K. & Kamis, Z. (2014), ‘Biometric voice recognition in security system’, Indian journal of Science and Technology 7(2), 104. Suuny, S., Peter, S. D. & Jacob, K. P. (2013), ‘Performance of different classifiers in speech recognition’, Int. J. Res. Eng. Technol 2(4), 590–597. 25 Vyas, M. (2013), ‘A gaussian mixture model based speech recognition system using matlab’, Signal & Image Processing 4(4), 109. Yilwatda, M. M., Enokela, J. A. & Goshwe, N. Y. (2017), ‘Implementation of a two-level security door access using keypad and voice recognition’, International Journal of Security and Its Applications 11(4), 45–58. Zaini, N. M. S. M., Zakaria, N. A., Roslan, I., Nahar, H., Ghazali, K. W. M. & Harum, N. (2021), ‘Iot system design using voice recognition technique for door access control system’, Manuscript Editor 2021, 145. 26 Appendices 27