TABLE OF CONTENTS ABSTRACT 5 1. INTRODUCTION 6 1.1 SPEECH RECOGNITION VARIABLES 7 1.1.1 VOCABULARY SIZE 1.1.2 SPEAKER DEPENDENCY 1.1.3 CONTINUITY OF SPEECH 1.1.4 BACKGROUND NOISE 1.2 PROBLEM STATEMENT 7 7 8 8 8 1.3 CONTRIBUTIONS 9 1.4 SOFTWARE DEVELOPMENT LIFE CYCLE MODEL 9 2. SPEECH DATASET (DATABASE) ERROR! BOOKMARK NOT DEFINED. 2.1 PREVIOUS WORK ERROR! BOOKMARK NOT DEFINED. 2.2 DATA COLLECTION ERROR! BOOKMARK NOT DEFINED. 2.3 ERROR! BOOKMARK NOT DEFINED. DATASET STATISTICS 2.4 SUPPORT TOOLS 3. HIDDEN MARCOV MODEL (HMM) ERROR! BOOKMARK NOT DEFINED. ERROR! BOOKMARK NOT DEFINED. 3.1 DISCRETE MARKOV PROCESS ERROR! BOOKMARK NOT DEFINED. 3.2 EXTENSION OF HIDDEN MARCOV MODELS ERROR! BOOKMARK NOT DEFINED. 3.3 ELEMENTS OF HMM ERROR! BOOKMARK NOT DEFINED. 3.4 THE THREE BASIC PROBLEMS OF HMM ERROR! BOOKMARK NOT DEFINED. PROBLEM 1: SOLUTION : PROBLEM 2: SOLUTION: PROBLEM 3: SOLUTION: 3.5 TYPES OF HMM 3.6 OPTIMIZATION CRITERIA ERROR! BOOKMARK NOT DEFINED. ERROR! BOOKMARK NOT DEFINED. ERROR! BOOKMARK NOT DEFINED. ERROR! BOOKMARK NOT DEFINED. ERROR! BOOKMARK NOT DEFINED. ERROR! BOOKMARK NOT DEFINED. ERROR! BOOKMARK NOT DEFINED. ERROR! BOOKMARK NOT DEFINED. 4. HIDDEN MARCOV MODEL TOOLKIT (HTK) ERROR! BOOKMARK NOT DEFINED. 1 4.1 SIMPLE HTK EXAMPLE ERROR! BOOKMARK NOT DEFINED. 4.2 DIFFICULTIES IN HTK ERROR! BOOKMARK NOT DEFINED. 5. MODELING AND LABELING TECHNIQUES ERROR! BOOKMARK NOT DEFINED. 5.1 EXAMPLE ON WORD LEVEL LABELING USING HTK ERROR! BOOKMARK NOT DEFINED. 6. SPEECH RECOGNITION DIFFERENT LABELING SCHEMES EXPERIMENTS ERROR! BOOKMARK NOT DEFINED. 6.1 LABELING ON WORD LEVEL ERROR! BOOKMARK NOT DEFINED. 6.2 PHONETIC-BASED LABELING ERROR! BOOKMARK NOT DEFINED. 6.3 LITERAL-BASED LABELING (OLD VERSION) ERROR! BOOKMARK NOT DEFINED. 6.4 LITERAL-BASED LABELING (NEW VERSION) ERROR! BOOKMARK NOT DEFINED. 6.5 COMPARISON BETWEEN THE FOUR METHODS ERROR! BOOKMARK NOT DEFINED. DISCUSSION ERROR! BOOKMARK NOT DEFINED. 7. RESULTS OF WORD-LEVEL LABELING ERROR! BOOKMARK NOT DEFINED. 8. RESULTS OF LITERAL-BASED LABELING ERROR! BOOKMARK NOT DEFINED. 9. RESULTS OF PHONETIC-BASED LABELING ERROR! BOOKMARK NOT DEFINED. 10. CONCLUSION ERROR! BOOKMARK NOT DEFINED. 10.1 FUTURE WORK ERROR! BOOKMARK NOT DEFINED. REFERENCES ERROR! BOOKMARK NOT DEFINED. 2 LIST OF TABLES Table 2.1: The Dataset parts………………………………………………………... 14 Table 2.2: The names that were used in the SR system and the number of the different samples for each name……………………………………………………. 14 Table 4.1: Brief Description of some HTK tools.…………………………………... 32 Table 4.2: Files needed to build an HTK example……………………………......... 32 Table 4.3: Files generated from HTK example……………………………………... 33 Table 5.1: Results of comparing word level and letter level……............................... 39 Table 6.1: The classes included in the small project and the classes sizes…………. 43 Table 6.2: Results on multi-speaker dataset using the word-level labeling………… 44 Table 6.3: Results on single-speaker dataset using the word-level labeling……… 44 Table 6.4: set of rules used in phonetic-based labeling…………………………….. 45 Table 6.5: Results on multi-speaker dataset using the phonetic-based labeling……. 46 Table 6.6: Results on single-speaker dataset using the phonetic-based labeling….... 46 Table 6.7: set of rules used in literal-based labeling (old version)…………………. 47 Table 6.8: Results on multi-speaker dataset using the literal-based labeling (old version)……………………………………………………………………………… Table 6.9: Results on single-speaker dataset using the literal-based……………….. 47 47 Table 6.10: set of rules used in literal-based labeling (new version)……………….. 48 Table 6.11: Results on multi-speaker dataset using the literal-based labeling (new version) ……………………………………………………………………………... 49 Table 6.12: Results on single-speaker dataset using the literal-based labeling (new version)…………………………………………………………………………….... 50 Table 6.13: Comparison between different labeling techniques on the multispeaker dataset……………………………………………………………………… 50 Table 6.14: Comparison between different labeling techniques on the singlespeaker dataset……………………………………………………………………… Table 7.1: Results of word-level labeling experiment……………………………… 50 52 Table 8.1: Results of literal-based labeling experiment…………………………….. 53 3 Table 8.2: Confusion Matrix of Literal Model on Training Set…………………….. 54 Table 8.3: Confusion Matrix of Literal Model on Testing Set……………………... 56 Table 9.1: Results of phonetic-based labeling experiment using monophones…….. 58 Table 9.2: Confusion Matrix of Phoneme Model on Training Set using Monophones……………………………………………………………………….... 59 Table 9.3: Confusion Matrix of Phoneme Model on Testing Set using Monophones………………………………………………………………………… 61 Table 9.4: Results of phonetic-based labeling experiment using triphones………… 63 Table 9.5: Confusion Matrix of Phoneme Model on Testing Set using Triphones… 64 Table 9.6: Comparison between monophones and triphones results……………….. 66 TABLE OF FIGURES FIGURE 3.1: A THREE-STATE HMM ……………………………..………………………. 22 FIGURE 3.2: ERGODIC HMM …………………………………………..…………………. 28 FIGURE 3.3: LEFT TO RIGHT HMM ………………………………………………...……. 28 4 Abstract Speech recognition is being the interest of scientists for more than four decades. Arabic speech recognition was an interest of Arabic scientists but they couldn’t reach a good performance until now. This is because they don’t have an existing speech dataset that they can build there systems on it. In this project, we are going to build a Speech Recognition system using the HTK tool. We are also going to build a speaker-independent speech dataset that can be used in our project to train our system and to validate it. This speech dataset can be helpful in other Speech Recognition researches. We will also run our recognizer interactively so that we can make use of it in information retrieval applications. 5 1. Introduction Speech recognition is being the interest of the scientists for more than four decades due to its wide range of applications. Speech recognition is to recognize the speech of a talker and transform it to the language of the receiver [1]. In computers, the automatic speech recognition means transforming human speech to a text or to an order to the computer. Automatic speech recognition can be used in a very wide range of applications such as information retrieval applications or applications that simplify the human-computers interactions. Speech recognition is a very complex process but we don’t recognize that because it is very natural to humans. It is a very large problem and it has so many variables that may affect the performance of the recognition if they were ignored. Speech recognition varies from one language to another, from one accent to another, from one sound to another. Moreover, it varies to the same person from one time to another. Also, the background noise may affect the recognition and the electrical characteristics of the telephone or any other electrical equipment can affect the recognition [1]. The performance of automatic speech recognition increased in the past few years. That’s because of the interest of the scientists in this field. This interest is because that the speech recognition is the simplest and the fastest way of communication between computers and humans. It saves the users time and effort which is the main goal of computers (besides the accuracy) [1]. Applications of automatic speech recognition systems include: telephone operated services, airline reservation systems, systems that talk with an 6 intelligent robot, learning systems, dictation systems, systems that helps persons with special needs, and so many other applications [5]. 1.1 Speech Recognition Variables [1] Automatic Speech Recognition has so many variables that may affect its performance and may increase the difficulty of building such a system. We will describe some of these variables below: 1.1.1 Vocabulary Size The vocabulary size is the number of words that the system is expected to recognize. As the vocabulary size increases, the performance decreases. 1.1.2 Speaker Dependency We can divide Speech Recognition Systems to two types based on this variable: Speaker Dependent SR System: which is a system that recognize the speech of one speaker only or a set of speakers known to the system before. Speaker Independent SR System: which is a system that can recognize the speech of any speaker. This is the most difficult type in SR systems. There has to be a large sounds database for training this system. 7 1.1.3 Continuity of Speech We can divide Speech Recognition Systems to three types also based on this variable: Isolated Word Recognition: which recognizes only single word at a time. Discontinuous Speech Recognition: which recognizes full sentences but there words separated with silences. Continuous Speech Recognition: this is the most difficult type of the SR systems. It recognizes the human natural speech. 1.1.4 Background Noise As we mentioned before, the background may affect the recognition process. So that, some systems has some conditions on the background such as speaking in a quite room, turning off the air conditioners in the room, speaking with a special type of microphones and so on. The perfect Speech Recognition System should recognize the human speech and give good accuracy with unlimited vocabulary size, speaker independent, continuous speech, and in any environment. 1.2 Problem Statement Our problem in this project is to create a voice directed phone directory system. We are going to build a Speech Recognition System that will recognize Arabic Spoken Names from any talker; it will then give the user 8 the number of the person that he was asking for his number. Our system can then be expanded to serve any application; because we will create the recognition system (the core). It can be connected to any application, all what we should do is to create a new interface with the Recognition System. Our recognition system will be an isolated word, speaker independent Speech Recognition System. 1.3 Contributions The new thing in this project is that it will recognize the speech that comes over the phone. Such kind of speech comes with some noise and some lost of data due to the characteristic of the telephone line and its limited bandwidth. There are so many existing applications that serve the sound dictation in Arabic language, but there is no existing information retrieval system based on Automatic Speech Recognition in Arabic language till now. 1.4 Software Development Life Cycle Model In this project, we are going to follow the development cycle model to achieve our goal: Speech Dataset Design and Development: we are going to build and design a very large speaker-independent speech dataset that will be used in training our recognition system. This dataset may also be used in the future to train other recognition systems. Learning the concept of Hidden Marcov Model (HMM) and its Implementation Tool (HTK). We should learn the concept of HMM, which 9 is a statistical model that will be used in building the models of recognition. This concept is implemented in a tool that is named Hidden Marcov Model Toolkit (HTK). We should learn how to use this tool and how to build a recognition system using this tool. Modeling: In this step, we are going to build the recognition models, and we are going to do some analysis to know what different models we have, the advantages and disadvantages of each modeling technique. Implementation: We will implement our recognition system in this phase using HTK. Testing: In this phase, we are going to run our recognizer, do all the tests that we need on it and make sure that it is working well. We can have a back cycle from each step to previous step or steps to modify our system if there were any errors. 10