See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/335909815 ROBUST PHONEME CLASSIFICATION USING RADON TRANSFORM OF AUDITORY NEUROGRAMS Thesis · September 2019 DOI: 10.13140/RG.2.2.10851.30242 CITATIONS READS 0 245 4 authors: Md.Shariful Alam Wissam A. Jassim University of Malaya Trinity College Dublin 6 PUBLICATIONS 14 CITATIONS 38 PUBLICATIONS 339 CITATIONS SEE PROFILE SEE PROFILE Mohd Yazed Ahmad Muhammad S A Zilany University of Malaya Texas A&M University at Qatar 29 PUBLICATIONS 215 CITATIONS 46 PUBLICATIONS 966 CITATIONS SEE PROFILE Some of the authors of this publication are also working on these related projects: Audio signal processing View project Wireless power transfer for implantable medical devices View project All content following this page was uploaded by Md.Shariful Alam on 19 September 2019. The user has requested enhancement of the downloaded file. SEE PROFILE ROBUST PHONEME CLASSIFICATION USING RADON TRANSFORM OF AUDITORY NEUROGRAMS MD. SHARIFUL ALAM DEPARTMENT OF BIOMEDICAL ENGINEERING FACULTY OF ENGINEERING UNIVERSITY OF MALAYA KUALA LUMPUR 2016 ROBUST PHONEME CLASSIFICATION USING RADON TRANSFORM OF AUDITORY NEUROGRAMS MD. SHARIFUL ALAM DESSERTATION SUBMITTED IN FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF ENGINEERING SCIENCE DEPARTMENT OF BIOMEDICAL ENGINEERING FACULTY OF ENGINEERING UNIVERSITY OF MALAYA KUALA LUMPUR 2016 UNIVERSITY OF MALAYA ORIGINAL LITERARY WORK DECLARATION Name of Candidate: Md. Shariful Alam (I.C/Passport No: BH0110703) Registration/Matric No: KGA140010 Name of Degree: Master of Engineering Science Title of Dissertation: ROBUST PHONEME CLASSIFICATION USING RADON TRANSFORM OF AUDITORY NEUROGRAMS Field of Study: Signal processing. I do solemnly and sincerely declare that: (1) I am the sole author/writer of this Work; (2) This Work is original; (3) Any use of any work in which copyright exists was done by way of fair dealing and for permitted purposes and any excerpt or extract from, or reference to or reproduction of any copyright work has been disclosed expressly and sufficiently and the title of the Work and its authorship have been acknowledged in this Work; (4) I do not have any actual knowledge nor do I ought reasonably to know that the making of this work constitutes an infringement of any copyright work; (5) I hereby assign all and every rights in the copyright to this Work to the University of Malaya (“UM”), who henceforth shall be owner of the copyright in this Work and that any reproduction or use in any form or by any means whatsoever is prohibited without the written consent of UM having been first had and obtained; (6) I am fully aware that if in the course of making this Work I have infringed any copyright whether intentionally or otherwise, I may be subject to legal action or any other action as may be determined by UM. Candidate’s Signature Date: Subscribed and solemnly declared before, Witness’s Signature Date: Name: Designation: iii ABSTRACT The use of speech recognition technology has increased considerably in the last three decades. In the real world, the performance of well-trained speech recognizers is usually degraded by different types of noise and distortions such as background noise, reverberation and telephone channels. In particular, speech signal is extremely difficult to recognize due to the interference created by reverberation and bandwidth of transmission channels. The accuracy of traditional speech recognition systems in noisy environments is much lower than the recognition accuracy of an average human being. Robustness of speech recognition systems must be addressed for practical applications. Although many successful techniques have been developed for dealing with clean signal and noise, particularly uncorrelated noise with simple spectral characteristics (e.g., white noise), the problem of sound reverberation and channel distortions has remained essentially unsolved. This problem hampers the wider use of acoustic interfaces for many applications. Unlike traditional methods in which features are extracted from the properties of the acoustic signal, this study proposes a phoneme classification technique using neural responses from a physiologically-based computational model of the auditory periphery. The 2-D neurograms were constructed from the simulated responses of the auditorynerve fibers to speech phonemes. The features of the neurograms were extracted using the Radon transform and used to train the classification system using a support vector machine classifier. Classification performances were evaluated for phonemes extracted from the TIMIT and HTIMIT databases. Experiments were performed in mismatched train/test conditions where the test data in these experiments consist of speech corrupted by variety of real world additive noises at different signal-to-noise ratios (SNRs), convolutive distortions introduced by different room impulse response functions, and iv multiple telephone channel speech recordings with different frequency characteristics. Performances of the proposed method were compared to those of Mel-Frequency Cepstral Coefficients (MFCC), Gamma-tone Frequency Cepstral Coefficients (GFCC), and Frequency Domain Linear Prediction (FDLP)-based phoneme classifiers. Based on simulation results, the proposed method outperformed most of the traditional acousticproperty-based phoneme classification methods for both in quiet and under noisy conditions. This approach is accurate, easy to implement, and can be used without any knowledge about the type of distortion in the signal, i.e., it can handle any type of noise. Using (support vector machine/ hidden Markov model) hybrid classifiers, the proposed method could be extended to develop an automatic speech recognition system. v ABSTRAK Semenjak 3 dekad yang lalu, penggunaan teknologi pengecaman pertuturan semakin meningkat dengan mendadak. Dalam situasi praktikal, prestasi teknologi pengecaman pertuturan selalunya jatuh disebabkan pelbagai jenis bunyi bising seperti bunyi latar belakang, bunyi gema dan herotan telefon. Khususnya isyarat ucapan dengan bunyi latar belakang atau herotan saluran amat sukar untuk dikecam disebabkan gangguan tersebut. Ketepatan system pengecaman pertuturan tradisional dalam keadaan bising adalah amat rendah berbanding prestasi pengecaman seorang manusia sederhana. Oleh itu, kemantapansistem pengecaman pertuturan hendaklah diselidiki bagi aplikasi praktikal. Walaupun pelbagai teknik telah bejaya diusahakan bagi mengendalikan isyarat dalam keadaan bersih dan bising (khususnya bunyi bising dengan ciri-ciri spectral yang asas), cabaran gangguan bunyi gema serta gangguan saluran tidak lagi diselesaikan. Masalah ini menjadi suatu halangan bagi penggunaan antara muka akustik dalam ramai aplikasi. Berlainan dengan kaedah tradisional di mana ciri-ciri diambil dari isyarat akustik, kajian ini mencadangkan kaedah klasifikasi fonem yang berdasarkan tindak balas neural daripada model pengiraan fisiologi sistem periferi pendengaran. Neurogram 2-D telah diperolehi daripada simulasi tindak balas saraf auditori terhadap fonem pertuturan. Ciriciri neurogram telah diperolehi menggunakan transformasi Radon dan diguna untuk melatih system pengelasan mesin vektor sokongan. Prestasi pengelasan bagi sistem tersebut telah dinilai menggunakan fonem-fonem berbeza dari pangkalan data TIMIT dan HTIMIT. Eksperimen telah dijalankan dalam keadaan latihan/ujian ‘mismatch’ di mana data bagi proses ujian ditambah dengan herotan yang berbeza pada nisbah isyarathingar (SNR) berlainan, herotan kusut serta saluran telefon yang berlainan. Prestasi kaedah yang dicadangkan telah dibandingkan dengan kaedah pengelasan MelFrequency Cepstral Coefficients (MFCC), Frequency Domain Linear Prediction (FDLP) vi dan gammatone frequency cepstral coefficients (GFCC). Berdasarkan keputusan simulasi, kaedah yang dicadangkan telah menunjukkan prestasi yang lebih cemerlang berbanding kaedah tradisional bderdasarkan ciri akustik dalam keadaan bising dan senyap. Kaedah ini tepat, senang diguna dan boleh diguna tanpa maklumat mengenai jenis herotan isyarat. Dengun menggunakan pengelas hybrid (mesin vektor sokongan, model terselindung Markov), kaedah ini boleh diguna bagi mengusahakan sistem pengecaman pertuturan automatik. vii ACKNOWLEDGEMENTS In the name of Allah the most Merciful and Beneficent First and Foremost praise is to ALLAH, the Almighty, the greatest of all, on whom ultimately we depend for sustenance and guidance. I would like to thank Almighty Allah for giving me opportunity, determination and strength to do my research. His continuous grace and mercy was with me throughout my life and ever more during the tenure of my research. This research that is conducted at Auditory Neuroscience (AN) lab would not have been possible without direct and indirect aid I received from so many others including: teachers, family and friends. I would like to gratefully and sincerely thank my principle supervisor Dr. Muhammad Shamsul Arefeen Zilany for his guidance, understanding, patience, and most importantly, his friendship during my graduate studies. I also want to thank my co-supervisor Dr. Mohd Yazed Bin AHMAD for his continuous support throughout the entire project. I want to express my thanks for the financial support I received from the HIR projects of UM. My heartfelt gratitude also goes to Dr. Wissam A. Jassim who is a research fellow from department of Electrical Engineering who dutifully and patiently taught me the hands on application in developing the software, sharing ideas of new topics on solving the problem arises in completing the thesis. Thanks also to my research-mates for sharing their thoughts to improve the research outcome. I owe everything to my parents and siblings who encouraged and helped me at every stage of my personal and academic life and longed to see this achievement come true. I dedicate this work to my wife and son (Saad Abdullah). viii TABLE OF CONTENTS Abstract ............................................................................................................................ iv Abstrak ............................................................................................................................. vi Acknowledgements ........................................................................................................viii List of Figures ................................................................................................................xiii List of Tables.................................................................................................................. xvi List of Symbols and Abbreviations ..............................................................................xviii CHAPTER 1: INTRODUCTION .................................................................................. 1 1.1 Phoneme classification ............................................................................................ 2 1.2 Automatic speech recognition ................................................................................. 4 1.3 Problem Statement ................................................................................................... 5 1.4 Motivation................................................................................................................ 8 1.4.1 Comparisons between Humans and Machines ........................................... 8 1.4.2 Neural-response-based feature ................................................................... 9 1.5 Objectives of this study ......................................................................................... 10 1.6 Scope of the study.................................................................................................. 11 1.7 Organization of Thesis ........................................................................................... 13 CHAPTER 2: LITERATURE REVIEW .................................................................... 14 2.1 Introduction............................................................................................................ 14 2.2 Research background ............................................................................................. 14 2.3 Existing metrics ..................................................................................................... 16 2.3.1 Mel-frequency Cepstral Coefficient (MFCC) .......................................... 16 2.3.2 Gammatone Frequency Cepstral Coefficient (GFCC) ............................. 17 2.3.3 Frequency Domain Linear Prediction (FDLP) ......................................... 17 ix 2.4 2.5 Structure and Function of the Auditory System .................................................... 18 2.4.1 Outer Ear .................................................................................................. 19 2.4.2 Middle Ear (ME) ...................................................................................... 19 2.4.3 Inner Ear ................................................................................................... 20 2.4.4 Basilar Membrane Responses................................................................... 20 2.4.5 Auditory Nerve ......................................................................................... 21 Brief history of Auditory Nerve (AN) Modeling .................................................. 22 2.5.1 Description of the computational model of AN ....................................... 25 2.5.1.1 C1 Filter: ................................................................................... 26 2.5.1.2 Feed forward control path (including OHC): ............................ 26 2.5.1.3 C2 Filter: ................................................................................... 27 2.5.1.4 The Inner hair cell (IHC):.......................................................... 28 2.5.2 2.6 2.7 Envelope (ENV) and Temporal Fine Structure (TFS) neurogram ........... 31 Support Vector Machines ...................................................................................... 31 2.6.1 Linear Classifiers ...................................................................................... 32 2.6.2 Non-linear Classifiers ............................................................................... 33 2.6.3 Kernels ...................................................................................................... 33 2.6.4 Multi-class SVMs ..................................................................................... 34 Radon Transform ................................................................................................... 34 2.7.1 Theoretical foundation ............................................................................. 34 2.7.2 How Radon transform works ................................................................... 35 2.7.3 Current applications ................................................................................. 36 CHAPTER 3: METHODOLOGY ............................................................................... 37 3.1 System overview .................................................................................................... 37 3.2 Datasets .................................................................................................................. 37 3.2.1 TIMIT database ........................................................................................ 38 x 3.2.2 HTIMIT corpus ........................................................................................ 40 3.3 AN model and neurogram ..................................................................................... 41 3.4 Feature extraction using Radon transform ............................................................. 44 3.5 SVM classifier ....................................................................................................... 45 3.6 Environmental Distortions ..................................................................................... 47 3.6.1 3.7 Existing Strategies to Handle Environment Distortions ........................... 51 Generation of noise ................................................................................................ 52 3.7.1 Speech with additive noise ....................................................................... 52 3.7.2 Reverberant Speech .................................................................................. 52 3.7.3 Telephone speech ..................................................................................... 54 3.8 Similarity measure ................................................................................................. 55 3.9 Procedure ............................................................................................................... 56 3.9.1 Feature extraction using MFCC, GFCC and FDLP for classification ..... 58 CHAPTER 4: RESULTS.............................................................................................. 60 4.1 Introduction............................................................................................................ 60 4.2 Overview of result ................................................................................................. 60 4.3 Classification accuracies (%) for phonemes in quiet environment ....................... 63 4.4 Performance for signal with additive noise ........................................................... 64 4.5 Performance for reverberant speech ...................................................................... 68 4.6 Signals distorted by noise due to telephone channel ............................................. 69 CHAPTER 5: DISCUSSIONS ..................................................................................... 71 5.1 Introduction............................................................................................................ 71 5.2 Broad class accuracy.............................................................................................. 71 5.3 Comparison of results from previous studies ........................................................ 73 5.4 Effect of the number of Radon angles on classification results............................. 75 xi 5.5 Effect of SPL on classification results ................................................................... 76 5.6 Effect of window length on classification results .................................................. 77 5.7 Effect of number of CFs on classification results .................................................. 78 5.8 Robustness property of the proposed system ........................................................ 79 CHAPTER 6: CONCLUSIONS & FUTURE WORKS ............................................ 84 6.1 Conclusions ........................................................................................................... 84 6.2 Limitations and future work .................................................................................. 85 LIST OF PUBLICATIONS AND CONFERENCE PROCEEDINGS..................... 87 REFERENCES ……………………………………………………………………...88 APPENDIX A – CONFUSION MATRICES............................................................ 100 xii LIST OF FIGURES Figure 1.1: Overview of phoneme classification .............................................................. 2 Figure 1.2: Difference between phoneme classification and phoneme recognition. The small box encloses the task of phoneme classifier, and the big box encloses the task of a phoneme recogniser. After the phonemes have been classified in (b), the dynamic programming method (c) finds the most likely sequence of phonemes (d). ..................... 3 Figure 1.3: Architecture of an ASR system [adapted from(Wang, 2015)] ....................... 4 Figure 2.1: Illustration of block diagram for MFCC derivation. .................................... 16 Figure 2.2: Illustration of methodology to extract GFCC feature ................................... 17 Figure 2.3: Deriving sub-band temporal envelopes from speech signal using FDLP. ... 18 Figure 2.4: Illustration of the structure of the auditory system showing outer, middle and inner ear (Reproduced from Encyclopaedia Britannica, Inc. 1997)................................ 19 Figure 2.5: Motions of BM at different frequencies (Reproduced from Encyclopaedia Britannica, Inc. 1997) ...................................................................................................... 21 Figure 2.6: Model of one local peripheral section. It includes outer/ME, BM, and IHC– AN synapse models. (Robert & Eriksson, 1999) ............................................................ 23 Figure 2.7: The model of the auditory peripheral system developed by Bruce et al. (Bruce et al., 2003), modified from Zhang et al. (Zhang et al., 2001) ............................ 24 Figure 2.8: Schematic diagram of the auditory-periphery. The model consists of ME filter, a feed-forward control path, two signal path such as C1 and C2, the inner hair cell xiii (IHC), outer hair cell (OHC) followed by the synapse model with spike generator. (Zilany & Bruce, 2006). .................................................................................................. 25 Figure 2.9: (A) Schematic diagram of the model of the auditory periphery (B) IHC-AN synapse model: exponential adaptation followed by parallel PLA models (slow and fast).................................................................................................................................. 29 Figure 2.10 (a) a separating hyperplane. (b) The hyperplane that maximizes the margin of separability .................................................................................................................. 32 Figure 3.1: Block diagram of the proposed phoneme classifier...................................... 37 Figure 3.2: Time-frequency representations of speech signals. (A) a typical speech waveform (to produce spectrogram and neurogram of that signal), (B) the corresponding spectrogram responses, and (C) the respective neurogram responses. ........................... 42 Figure 3.3: Geometry of the DRT ................................................................................... 44 Figure 3.4: This figure shows how Radon transforms work. (a) a binary image (b) Radon Transform at 0 Degree (c) Radon Transform at 45 Degree. ................................ 45 Figure 3.5: Types of environmental noise which can affect speech signals (Wang, 2015) ......................................................................................................................................... 48 Figure 3.6: Impact of environment distortions on clean speech signals in various domain: (a) clean speech signal (b) speech corrupted with background noise (c) speech degraded by reverberation (d) telephone speech signal. (e)-(h) Spectrum of the respective signals shown in (a)–(d). (i)-(l) neurogram of the respective signals shown in (a)–(d).(m)-(p) Radon coefficient of the respective signals shown in (a)–(d). ............... 50 xiv Figure 3.7: Comparison of clean and reverberant speech signals for phoneme /aa/: (a) clean speech, (b) signal corrupted by reverberation (c)-(d) Spectrogram of the respective signals shown in (a)–(b) and (e)-(f) Radon coefficient of the respective signals shown in (a)-(b). ............................................................................................................................. 53 Figure 3.8: A simplified environment model where background noise and channel distortion dominate. ........................................................................................................ 55 Figure 3.9: Neurogram -based feature extraction for the proposed method: (a) a typical phoneme waveform (/aa/), (b) speech corrupted with SSN (10 dB) (c) speech corrupted with SSN (0 dB). (d)-(f) neurogram of the respective signals shown in (a)–(c). (g)-(h) Radon coefficient of the respective signals shown in (a)–(b). (j) Radon coefficient of the respective signals shown in (a)-(c). ................................................................................. 57 Figure 4.1: Broad phoneme classification accuracies (%) for different features in various noise types at different SNRs values. Clean condition is denoted as Q. ......................... 67 Figure 5.1: Example of radon coefficient representations: (a) stop /p/ (b) fricative /s/ (c) nasal /m/ (d) vowel /aa/. .................................................................................................. 72 Figure 5.2: Example of radon coefficient representations for stop: (a) /p/ (b)/t/ (c)/k/ and (d) /b/ ........................................................................................................................ 72 Figure 5.3: The correlation coefficient (a) MFCC features extracted from the phoneme in quiet (solid line) and at an SNR of 0 dB (dotted line) condition. The correlation coefficient between the two vectors was 0.76. (b) FDLP features under clean and noisy conditions. Correlation coefficient between the two cases was 0.72. (d) Neurogram responses of the phoneme under clean and noisy conditions. The Correlation coefficient between the two vectors was 0.85. .................................................................................. 80 xv LIST OF TABLES Table 1.1: Human versus machine speech recognition performance (Halberstadt, 1998). ........................................................................................................................................... 8 Table 1.2: Human and machine recognition results. All percentages are word error rates. Best results is indicated in bold......................................................................................... 9 Table 3.1: Mapping from 61 classes to 39 classes, as proposed by Lee and Hon (Lee & Hon, 1989). ..................................................................................................................... 39 Table 3.2: Broad classes of phones proposed by Reynolds and Antoniou, (T. J. Reynolds & Antoniou, 2003). ......................................................................................... 40 Table 3.3: Number of token in phonetic subclasses for train and test sets ..................... 40 Table 4.1: Classification accuracies (%) of individual and broad class phonemes for different feature extraction techniques on clean speech, speech with additive noise (average performance of six noise types at -5, 0, 5, 10, 15, 20 and 25 dB SNRs), reverberant speech (average performance for eight room impulse response functions), and telephone speech (average performance for nine channel conditions). The best performance for each condition is indicated in bold. ...................................................... 61 Table 4.2: Confusion matrices for segment classification in clean condition................. 62 Table 4.3: Classification accuracies (%) of broad phonetic classes in clean condition. . 64 Table 4.4: Individual phoneme classification accuracies (%) for different feature extraction techniques for 3 different noise types at -5, 0, 5, 10, 15, 20 and 25 dB SNRs. The best performance for each condition is indicated in bold. ....................................... 65 xvi Table 4.5: Individual phoneme classification accuracies (%) for different feature extraction techniques for 3 different noise types at -5, 0, 5, 10, 15, 20 and 25 dB SNRs. The best performance for each condition is indicated in bold. ....................................... 66 Table 4.6: Classification accuracies (%) in eight different reverberation test set. The best performance for each condition is indicated in bold. Last column show the average value indicated as “Avg”. ................................................................................................ 68 Table 4.7: Classification accuracies (%) for signal distorted by nine different telephone channels. The best performance is indicated in bold. Last column shows the average value indicated as “Avg”. ................................................................................................ 70 Table 5.1 Correlation measure in acoustic and Radon domain for different phoneme... 73 Table 5.2: Phoneme classification accuracies (%) on the TIMIT core test set (24 Speakers) and complete test set (168 speakers) in quiet condition for individual phone (denoted as single) and broad class (denoted as Broad). Here, RPS is the abbreviation of reconstructed phase space. .............................................................................................. 74 Table 5.3: Phoneme classification accuracies (%) as a function of the number of Radon ......................................................................................................................................... 76 Table 5.4: Effects of SPL on classification accuracy (%). .............................................. 77 Table 5.5: Effect of window size on classification performance (%). ............................ 78 Table 5.6: Effect of number of CF on classification performance (%). .......................... 79 Table 5.7: Correlation measure for different phoneme and their corresponding noisy (SSN) phoneme (clean-10dB and clean-0dB) in different domain. Average correlation measure (Avg) of seven phonemes is indicated in bold (last row). ................................ 82 xvii LIST OF SYMBOLS AND ABBREVIATIONS ABWE : Artificial bandwidth extension AF : Articulatory features AMR : Adaptive multi-rate AN : Auditory Nerve ANN : Artificial neural net-works AP : Action Potential AR : auto-regressive ASR : Automatic speech recognition BF : Best Frequency BM : Basilar Membrane CF : Character frequency C-SVC : C-support vector classification DCT : Discrete cosine transform DRT : Discrete Radon transform DTW : Dynamic time warping FDLP : Frequency Domain Linear Prediction FTC : Frequency Tuning Curve GF : Gammatone feature xviii GFCC : Gammatone frequency coefficient cepstra GMM : Gaussian Mixture Model HMM : Hidden Markov Models HTIMIT : handset TIMIT IHC : Inner Hair Cells LPC : Linear predictive coding LVASR : Large Vocabulary ASR LVCSR : Large vocabulary continuous speech recognition OHC : Outer Hair Cells OVO : One versus one OVR : One versus rest PLP : Perceptual linear prediction PSTH : Post Stimulus Time Histogram RASTA : Relative spectra RBF : Radial basis function RIR : Room impulse response SRM : Structural Risk Minimization SVM : Support vector machines xix CHAPTER 1: INTRODUCTION Automatic speech recognition (ASR) has been extensively studied in the past several decades. Driven by both commercial and military interests, ASR technology has been developed and investigated on a great variety of tasks with increasing scales and difficulty. As speech recognition technology moves out of laboratories and is widely applied in more and more practical scenarios, many challenging technical problems emerge. Environmental noise is a major factor which contributes to the diversity of speech. As speech services are provided on various devices, ranging from telephones, desktop computers to tablets and game consoles, speech signals also exhibit large variations caused by channel characteristic differences. Speech signals captured in enclosed environments by distant microphones are usually subject to reverberation. Compared with background noise and channel distortions, reverberant noise is highly dynamic and strongly correlated with the original clean speech signals. Speech recognition based on phonemes is very attractive, since it is inherently free from vocabulary limitations. Large Vocabulary ASR (LVASR) systems’ performance depends on the quality of the phone recognizer. That is why research teams continue developing phone recognizers, in order to enhance their performance as much as possible. The classification of phonemes can be seen as one of the basic units in a speech recognition system. This thesis will focus on the development of a robust phoneme classification technique that works well in diverse conditions. 1 Unknown utterance Speech signal Predicted class (b) (a) (c) Learning Machine (d) ../p/, /t/, /k/,... Figure 1.1: Overview of phoneme classification 1.1 Phoneme classification Phoneme classification is the task of determining the phonetic identity of a speech utterance (typically short) based on the extracted features from speech. The individual sounds used to create speech are called phonemes. The task of this thesis is to create a learning machine that can classify phoneme (sequence of acoustic observations) both in quiet and under noisy environments. To explain how this learning machine works, we can consider a speech signal for an ordinary English sentence, labeled (a) in Fig. 1.1. The signal is split up into elementary speech unit [(b) in the figure 1.1] and provided into the learning machine (c). The task of this learning machine is to classify each of these unknown utterances to one of the 39 targets, representing the phonemes in the English language. The idea of classifying phonemes is widely used in both isolated and continuous speech recognition. The predictions found by the learning machine can be passed on to a statistical model to find the most likely sequence of phonemes that construct a meaningful sentence. Hence for a successful ASR system accurate phoneme classification is important. 2 Phoneme classification (a) Learning Machine (b) ../p/, /t/, /k/,... (c) Dynamic programming method (d) ../p/, /k/,,... Figure 1.2: Difference between phoneme classification and phoneme recognition. The small box encloses the task of phoneme classifier, and the big box encloses the task of a phoneme recogniser. After the phonemes have been classified in (b), the dynamic programming method (c) finds the most likely sequence of phonemes (d). In this thesis, the chosen learning machine for the task is support vector machines (SVMs). The task of phoneme classification is typically done in the context of phoneme recognition. As the name implies, phoneme recognition consists of identifying the individual phonemes a sentence is composed of and then a dynamic programming method is required to transform the phoneme classifications into phoneme predictions. This difference is shown in Fig. 1.2. We have chosen not to perform complete phoneme recognition, because we would need to include a dynamic programming method. This may seem fairly straightforward, but to do it properly, it would involve introducing large areas of speech recognition such as decoding techniques and the use of language models. This would detract from the focus of the thesis. Phone recognition has a wide range of applications. In addition to typical LVASR systems, it can be found in applications related to language recognition, keyword detection, speaker identification, and applications for music identification and translation (Lopes & Perdigao, 2011). 3 Figure 1.3: Architecture of an ASR system [adapted from(Wang, 2015)] 1.2 Automatic speech recognition This section will introduce the speech recognition systems. The aim of ASR system is to produce the most likely word sequence given an incoming speech signal. Figure 1.3 shows the architecture of an ASR system and its main components. In the first stage of speech recognition, input speech signals are processed by a front-end to provide a stream of acoustic feature vectors or observations. These observations should be compact and carry sufficient information for recognition in the later stage. This process is usually known as front-end processing or feature extraction. In the second stage, the extracted observation sequence is provided into a decoder to recognise the mostly likely word sequence. Three main knowledge sources, such as the lexicon, language models and acoustic models are used in this stage. The lexicon, also known as the dictionary, is usually used in large vocabulary continuous speech recognition (LVCSR) systems to map sub-word units to words used in the language model. The language model represents the prior knowledge about the syntactic and semantic information of word sequences. The acoustic model represents the acoustic knowledge of how an observation sequence can be mapped to a sequence of sub-word units. In this thesis, we consider phone classification that allows a good evaluation of the quality of the acoustic 4 modeling, since it computes the performance of the recognizer without the use of any kind of grammar (Reynolds & Antoniou, 2003). 1.3 Problem Statement State-of-the-art algorithms for ASR systems suffer from poorer performance when compared to the ability of human listeners to detect, analyze, and segregate the dynamic acoustic stimuli, especially in complex and under noisy environments (Lippmann, 1997; G. A. Miller & Nicely, 1955; Sroka & Braida, 2005). Performance of ASR systems can be improved by using additional levels of language and context modeling, provided that the input sequence of elementary speech units is sufficiently accurate (Yousafzai, Ager, Cvetković, & Sollich, 2008). To achieve a robust recognition of continuous speech, both sophisticated language-context modeling and accurate predictions of isolated phonemes are required. Indeed, most of the inherent robustness of human speech recognition occurs before and independently of context and language processing (G. A. Miller, Heise, & Lichten, 1951; G. A. Miller & Nicely, 1955). For phoneme recognition, human auditory system’s accuracy is already above chance level, at an signal-to-noise ratio (SNR) of -18 dB (G. A. Miller & Nicely, 1955). Also, several studies have demonstrated the superior performance of human speech recognition compared to machine performance both in quiet and under noisy conditions (Allen, 1994; Meyer, Wächter, Brand, & Kollmeier, 2007), and thus the ultimate challenge for an ASR is to achieve recognition performance that is close to the performance of human auditory system. In this thesis, we consider front-end features for phoneme classification because accurate classification of isolated phonetic unit is very important for achieving robust recognition of continuous speech. Most of the existing ASR systems use perceptual linear prediction (PLP), Relative spectra (RASTA) or Cepstral features, normally some variant of Mel-frequency cepstral 5 coefficients (MFCCs) as their front-end. Due to nonlinear processing involved in the feature extraction, even a moderate level of distortion may cause significant departures from feature distributions learned on clean data, making these distributions inadequate for recognition in the presence of environmental distortions such as additive noise (Yousafzai, Sollich, Cvetković, & Yu, 2011). Some attempts have been made to utilize Gammatone frequency coefficient cepstra (GFCC) in ASR (Shao, Srinivasan, & Wang, 2007). But their improvement was not significant. During past years, efforts have also been made to design a robust ASR system motivated by articulatory and auditory processing (Holmberg, Gelbart, & Hemmert, 2006; Jankowski Jr, Vo, & Lippmann, 1995; Jeon & Juang, 2007). However, these models did not include most of the nonlinearities observed at the level of the auditory periphery and thus were not physiologically-accurate. As a result, the performance of ASR systems based on these features is far below compared to human performance in adverse conditions (Lippmann, 1997; Sroka & Braida, 2005). Until recently, the problem of recognizing reverberant signal and signal distorted with telephone channel has remained unsolved due to the nature of reverberant speech and bandwidth of telephone channel (Nakatani, Kellermann, Naylor, Miyoshi, & Juang, 2010; Pulakka & Alku, 2011). Reverberation is a form of distortion quite distinct from both additive noise and spectral shaping. Unlike additive noise, reverberation creates interference that is correlated with the speech signal. Most of the telephone systems in use today transmit only a narrow audio bandwidth limited to the traditional telephone band of 0.3–3.4 kHz or to only a slightly wider bandwidth. Natural speech contains frequencies far beyond this range and consequently, the naturalness, quality, and intelligibility of telephone speech are degraded by the narrow audio bandwidth (Pulakka & Alku, 2011). 6 In order to solve the reverberation problem, a number of de-reverberation algorithms have been proposed based on cepstral filtering, inverse filtering, temporal ENV filtering, excitation source information and spectral processing. The main limitation of these techniques is that the acoustic impulse responses or talker locations must be known or blindly estimated for successful de-reverberation. This is known to be a difficult task (Krishnamoorthy & Prasanna, 2009). These problems preclude its use for real-time applications (Nakatani, Yoshioka, Kinoshita, Miyoshi, & Juang, 2009). In the past decades, many pitch determination algorithms have been proposed to improve the recognition accuracy of telephone speech. An important consideration for pitch determination algorithms is the performance in telephone speech, where the fundamental frequency is always weak or even missing, which makes pitch determination even more difficult (L. Chang, Xu, Tang, & Cui, 2012). A significant improvement in speech quality can be achieved by using wideband (WB) codecs for speech transmission. For example, the adaptive multi-rate (AMR)-WB speech codec transmits the audio bandwidth of 50–7000 Hz. But wideband telephony is possible only if both terminal devices and the network support AMR-WB. Most of the above-mentioned methods provide good performance (but still less than the average performance of human being) for a specific condition, i.e., quiet environment, additive noise, reverberant speech, or channel distortion, but the speech signals are usually affected by multiple acoustic factors simultaneously. This makes difficult to use them in real environment. In this study, we propose a method that can handle any types of noise and thus can be used in real environment. 7 Table 1.1: Human versus machine speech recognition performance (Halberstadt, 1998). Corpus Description Vocabulary Size Machine Error (%) Human Error (%) TI Digits Read Digits 10 0.72 0.009 Alphabet Letters Read Alphabetic Letters 26 5 1.6 Resource Management Read Sentences (Wordpair Grammar) 1,000 3.6 0.1 Resource Management Read Sentences (Null Grammar) 1,000 17 2 Wall Street Journal Read Sentences 7.2 0.9 North American Business News Read Sentences Unlimited 6.6 0.4 Switchboard Spontaneous Telephone Conservations 2,000Unlimited 43 4 1.4 5,000 Motivation Two sources of motivation contributed to the conception of the ideas for the experiments in this thesis. 1.4.1 Comparisons between Humans and Machines Different studies have shown the performance between humans and ASR systems (Lippmann, 1997; G. A. Miller & Nicely, 1955; Sroka & Braida, 2005). Table 1.1 shows the performance of machine and human for different corpus. Kingsbury et al. showed the difference between humans and machines for reverberant speech (Kingsbury & Morgan, 1997). Machine recognition tests were run using a hybrid hidden Markov model/multilayer perceptron (HMM/MLP) recognizer. Four front ends were 8 Table 1.2: Human and machine recognition results. All percentages are word error rates. Best results is indicated in bold. Experiment Baseline Feature set Condition Error (%) PLP Clean 17.8 Reverb 71.5 Clean 16.4 Reverb 74.4 Clean 16.9 Reverb 78.9 Clean 31.7 Reverb 66.0 Reverb 6.1 Log-RASTA J-RASTA Mod. Spec. Humans tested: PLP, log-RASTA-PLP, JRASTA-PLP, and an experimental RASTA-like front end called the modulation spectrogram (Mod. Spec.). Their results are shown in the Table 1.2. Aforementioned results imply that the performances of humans are significantly better than machines. Machines cannot extract efficiently the low-level phonetic information from speech signal. These results motivated us to work on front-end features for accurate recognition of phonemes, and closely related problem of classification. 1.4.2 Neural-response-based feature The accuracy of human speech recognition motivates the application of information processing strategies found in the human auditory system to ASR (Hermansky, 1998; Kollmeier, 2003). In general, the auditory system is nonlinear. Incorporating nonlinear properties from the auditory system in the design of a phoneme classifier might improve the performance of the recognition system. The current study proposes an approach to 9 classify phonemes based on the simulated neural responses from a physiologicallyaccurate model of the auditory system (Zilany, Bruce, Nelson, & Carney, 2009). This approach is expected to improve the robustness of the phoneme classification system. It was motivated by the fact that neural responses are robust against noise due to the phase-locking property of the neuron, i.e., the neurons fire preferentially at a certain phase of the input stimulus (M. I. Miller, Barta, & Sachs, 1987), even when noise is added to the acoustic signal. In addition to this, the auditory-nerve (AN) model also captures most of the nonlinear properties observed at the peripheral level of the auditory system such as nonlinear tuning, compression, two-tone suppression, and adaptation in the inner-hair-cell-AN synapse as well as some other nonlinearities observed only at high sound pressure levels (SPLs) (M. I. Miller et al., 1987; Robles & Ruggero, 2001; Zilany & Bruce, 2006). 1.5 Objectives of this study The robustness of a recognition system is heavily influenced by the ability to handle the presence of background noise, to cope with the distortion due to convolution and transmission channel. State-of-the-art algorithms for ASR systems exhibit good performance in quiet environment but suffer from poorer performance when compared to the ability of human listeners in noisy environments (Lippmann, 1997; G. A. Miller & Nicely, 1955; Sroka & Braida, 2005). Our ultimate goal is to develop an ASR system that can handle distortions caused by various acoustic factors, including speaker differences, channel distortions, and environmental noises. Developing methods for phoneme recognition and the closely related problem of classification are a major step towards achieving this goal. Hence, this thesis only considers the prediction of phonemes, since the classification of phonemes can be seen as one of the basic units in a speech recognition system. The specific objectives of the study are to: 10 develop a robust phoneme classifier based on neural responses and to evaluate the performance in quiet and under noisy conditions. compare the recognition accuracy of the proposed method with the results from some existing popular metrics, such as MFCC, GFCC and FDLP. examine the effects of different parameter on the performance of the proposed method. 1.6 Scope of the study Phone recognition from TIMIT has more than two decades of intense research behind it, and its performance has naturally improved with time. There is a full array of systems, but with regard to evaluation, they concentrate on three domains: phone segmentation, phone classification and phone recognition (Lopes & Perdigao, 2011). Phone segmentation is a process of finding the boundaries of a sequence of known phones in a spoken utterance. Phonetic classification takes the correctly segmented signal, but with unknown labels for the segments. The problem is to correctly identify the phones in those segments. Phone recognition also has complex task. The speech given to the recognizer corresponds to the whole utterance. The phone models plus a Viterbi decoding find the best sequence of labels for the input utterance. In this case, a grammar can be used. The best sequence of phones found by the Viterbi path is compared to the reference (the TIMIT manual labels for the same utterance) using a dynamic programming algorithm, usually the Levenshtein distance, which takes into account phone hits, substitutions, deletions and insertions. This thesis will focus on the phone classification task only. A number of benchmarking databases have been constructed in recognition purpose, for example, the DARPA resource management (RM) database, TIDIGITS - connected digits, Alpha- Numeric (AN) 4-100 words vocabulary, Texas Instrument and 11 Massachusetts Institute of Technology (TIMIT), WSJ5K - 5,000 words vocabulary. In this thesis, most widely used TIMIT database has been used that includes the dynamic behavior of the speech signal, source of variability, e.g., intra-speaker (same speaker), inter-speaker (cross speaker), and linguistic (speaking style). TIMIT is totally and manually annotated at the phone level. Research in the speech recognition area has been underway for a couple of decades, and a great deal of progress has been made in reducing the error on speech recognition. Most of the approach shows high speech recognition accuracy under controlled conditions (for example, in quiet environment or under specific noise), but human auditory system can recognize speech without prior information about noise types. It is very difficult to develop a feature that can handle all types of noise. Our proposed feature can be used in quiet environment and under background noise, room reverberation and channel variations. Most of the existing phoneme recognition systems are based on the features extracted from the acoustic signal in the time and/or frequency domain. PLP, RASTA, MFCCs (Hermansky, 1990; Hermansky & Morgan, 1994; Zheng, Zhang, & Song, 2001) are some examples of the preferred traditional features for the ASR systems. Some attempts have also been made to utilize GFCC in ASR, for instance (Qi, Wang, Jiang, & Liu, 2013; Schluter, Bezrukov, Wagner, & Ney, 2007; Shao, Jin, Wang, & Srinivasan, 2009). Recently, Ganapathy et al. (Ganapathy, Thomas, & Hermansky, 2010) proposed a feature extraction technique for phoneme recognition based on deriving modulation frequency components from the speech signal. This FDLP feature is an efficient technique for robust ASR. In this study, the classification results of the proposed neuralresponse-based method were compared to the performances of the traditional acoustic- 12 property-based speech recognition methods using features such as MFCCs, GFCCs and FDLPs. 1.7 Organization of Thesis This thesis proposes a new technique for phoneme classification. Chapter two provides background information for the experimental work. We explain the anatomy and physiology of the peripheral auditory system, model of the AN, the SVM classifier employed in this thesis, Radon transform and some existing metrics. In Chapter three, the procedure of the proposed method has been discussed. Chapter four presents the experiments evaluating the various feature extraction techniques in the task of TIMIT phonetic classification. We show the results for both in quiet and under noisy conditions for different types of phonemes extracted from the TIMIT database. Chapter five explains the reason behind the robustness of proposed feature. It also shows the effects of different parameters on classification accuracy. Finally, the general conclusion and some future direction are provided in Chapter six. 13 CHAPTER 2: LITERATURE REVIEW 2.1 Introduction This chapter reviews some related work for phoneme classification. After that, brief description of some acoustic-property-based feature extraction technique, and structure of the auditory system is discussed. This chapter also describes SVM classifier, model of the AN and Radon transforms technique that has been used in this study. 2.2 Research background After several years of intense research by a large number of research teams, error rates on speech recognition have been improved considerably but still the performances of humans are significantly better than machines. In 1994, Robinson (Robinson, 1994) reported a phoneme classification accuracy of 70.4%, using recurrent neural network (RNN). In 1996, Chen and Jamieson showed phoneme classification accuracy 74.2% using same classifier (Chen & Jamieson, 1996). Rathinavelu et al. used the hidden Markov models (HMMs) for speech classification and got the accuracy 68.60 % (Chengalvarayan & Deng, 1997). This approach becomes problematic when dealing with the variabilities of natural conversational speech. In 2003, a broad classification accuracy of 84.1% using MLP was reported by Reynolds et al (T. J. Reynolds & Antoniou, 2003). Clarkson et al. created a multi-class SVM system to classify phonemes. Their reported result of 77.6% is extremely encouraging and shows the potential of SVMs in speech recognition. They also used Gaussian Mixture Model (GMM) for classification and got the accuracy 73.7 %. Similarly Johnson et. al showed the results for MFCC-based feature using HMM (Johnson et al., 2005). They reported 14 the single phone accuracy 54.86 % for complete test set. They also showed the accuracy of 35.06% for their proposed feature. A hybrid SVM/HMM system was developed by Ganapathiraju et al. (Ganapathiraju, Hamaker, & Picone, 2000). This approach has provided improvements in recognition performance over HMM baselines on both small-and large-vocabulary recognition tasks, even though SVM classifiers were constructed solely from the cepstral representations (Ganapathiraju, Hamaker, & Picone, 2004; Krüger, Schafföner, Katz, Andelic, & Wendemuth, 2005). The above papers all show the potential of using SVMs in speech recognition and motivated us to use SVM as a classifier. Phoneme classification has been also studied by a large number of researchers (Clarkson & Moreno, 1999; Halberstadt & Glass, 1997; Layton & Gales, 2006; McCourt, Harte, & Vaseghi, 2000; Rifkin, Schutte, Saad, Bouvrie, & Glass, 2007; Sha & Saul, 2006) for the purpose of testing different methods and representations. In recent years, many articulatory and auditory based processing methods also have been proposed to address the problem of phonetic variations in a number of framebased, segment-based, and acoustic landmark systems. For example, articulatory features (AFs) derived from phonological rules have outperformed the acoustic HMM baseline in a series of phonetic level recognition tasks (Kirchhoff, Fink, & Sagerer, 2002). Similarly, experimental studies of the mammalian peripheral and central auditory organs have also introduced many perceptual processing methods. For example, several auditory models have been constructed to simulate human hearing, e.g., the ensemble interval histogram, the lateral inhibitory network, and Meddis’ inner hair-cell model (Jankowski Jr et al., 1995; Jeon & Juang, 2007). Holmberg et al. incorporated a synaptic adaptation into their feature extraction method and found that the performance of the system improved substantially (Holmberg et al., 2006). Similarly, Strope and Alwan 15 (1997) used a model of temporal masking (Strope & Alwan, 1997) and Perdigao et al. (1998) employed a physiologically-based inner ear model for developing a robust ASR system (Perdigao & Sá, 1998). 2.3 Existing metrics The performance was evaluated on the complete test set of TIMIT database and compared to the results from three standard acoustic-property-based methods. In this section we will describe these features such as MFCC, GFCC and FDLP. 2.3.1 Mel-frequency Cepstral Coefficient (MFCC) MFCCs (Davis & Mermelstein, 1980) are a widely used feature in ASR system. Figure. 2.1 represent the derivation of MFCC- feature from a typical acoustic sound waveform. Initially a FFT is applied to each frame to obtain complex spectral features. Normally 512-points FFT is applied to derive 256-points complex spectral without considering phase information. To make more efficient information representation only 30 or so smooth spectrum per frame is considered and scaled logarithmically to Mel or Bark scale to make those spectrum more meaningful. Mel Frequency Cepstral Coefficients (MFCC) Speech Signal Fast Fourier Transform Spectrum Mel Scale Filtering Mel Frequency Spectrum Log (.) Feature Vector Derivatives Cepstral Coefficients Discrete Cosine Transform Figure 2.1: Illustration of block diagram for MFCC derivation. 16 Input signal Windowing GT filter Cubic root operation DCT GFCC Figure 2.2: Illustration of methodology to extract GFCC feature In MFCCs derivation, linearly spaced triangular filter-bank is used. The final step is to convert the filter output to cepstral coefficient by using discrete cosine transform (DCT). 2.3.2 Gammatone Frequency Cepstral Coefficient (GFCC) GFCC is an acoustic cepstral-based feature which is derived from Gammatone feature (GF) using Gammatone filter-bank. To extract GFCC, the audio signal is initially synthesized using a 128-channel Gammatone filter-bank. Its center frequencies are quasi logarithmically spaced from 50 Hz to 8 KHz (or half of the sampling frequency of the input signal), which models human cochlear filtering. Using 10 ms frame rate filterbank outputs are down sampled to100 Hz in the time dimension. Cubic root operation is used to compress the magnitudes of the down-sampled filter-bank outputs. A matrix of “cepstral” coefficients is obtained by multiplying the DCT matrix with a GF vector. The 30-order GFCCs among 128-channel-based GFCC are normally used for speaker identification or speech recognition since there retain most information of GF due to energy compaction property of DCT as mentioned in (Shao, Srinivasan, & Wang, 2007). So the size of the GFCC feature is m × 23 or m × 30; where m is the number of frames. Figure 2.2 represent the GFCC derivation procedure. 2.3.3 Frequency Domain Linear Prediction (FDLP) Ganapathy et al.(Ganapathy, Thomas, & Hermansky, 2009) proposed a feature which is based on deriving modulation frequency components from the speech signal. 17 Speech signal DCT Critical Band Windowing Sub-band FDLP Envelopes Figure 2.3: Deriving sub-band temporal envelopes from speech signal using FDLP. The FDLP technique is implemented in two parts - first, the discrete cosine transform (DCT) is applied on long segments of speech to obtain a real valued spectral representation of the signal. Then, linear prediction is performed on the DCT representation to obtain a parametric model of the temporal envelope. The block schematic for extraction of sub-band temporal envelopes from speech signal is shown in Fig. 2.3. 2.4 Structure and Function of the Auditory System Peripheral auditory pathway connects our sensory organs (ear) with the auditory parts of the nervous system to interpret the received information. The peripheral auditory system converts the pressure waveform into a mechanical vibration and finally a change in the membrane potential of hair cells. This transform of energy from mechanical motion to electrical energy is called transduction. The change in the membrane potential triggers the action potentials (AP’s) from AN fibers. The AP used to conveys information in the nervous system. Anatomically, human auditory system consists of peripheral auditory system and central auditory nervous system. Periphery auditory system can be further subdivided into three major parts: outer, middle and inner ear. The human auditory system is shown in Fig. 2.4. 18 Figure 2.4: Illustration of the structure of the auditory system showing outer, middle and inner ear (Reproduced from Encyclopaedia Britannica, Inc. 1997) 2.4.1 Outer Ear The outer ear composed of two components: one is the most visible portion of the ear called pinna and another is the long and narrow canal that joined to ear drum called the ear canal. The word ear may be used to refer the pinna. Ear drum also known as tympanic membrane that act as boundary between the outer ear and (ME). The sounds arrive at the outer ear and passes through canal into the ear drum. Ear drum vibrates when the sound wave hits the ear drum. The primary functions of the outer ear are to protect the middle and inner ears from external bodies and amplify the sounds with high-frequency. It also provides the primary cue to determine the source of sound. 2.4.2 Middle Ear (ME) The ME is separated from external ear through the tympanic membrane. It creates a link between outer ear and fluid-filled inner air. This link is made through three 19 bones: malleus (hammer), incus (anvil) and stapes (stirrup), which are known as ossicles. The malleus is connected with stapes through incus. The stapes is the smallest bone that is connected with the oval window. The three bones are connected in such a way that movement of the tympanic membrane causes movement of the stapes. The ossicles can amplify the sound waves by almost thirty times. 2.4.3 Inner Ear The structure of the inner ear is very complex that located within dense potion of the skull. The inner ear, also known as labyrinth due to its complexity, can be divided into two major sections: a bony outer casing and osseous labyrinth. The osseous labyrinth consists of semicircular canals, the vestibule, and the cochlea. The coil, snail shaped cochlea is the most important part of the inner ear that contain sensory organs for hearing. The largest and smallest turn of the inner ear is known as basal turn and apical turn respectively. Oval window and round window also part of the cochlea. The winding channel can be divided into three sections: scala media, scala tympani and scala vestibuli, where each section filled with fluid. Scala media move pressure wave in response to the vibration as a consequence of oval window and ossicles. Basilar membrane (BM) is a stiff structural element separate that seperate the scala media and the scala tympani. Based on stimulus frequency the displacement pattern of BM also changed. Cochlea acts like a frequency analyzer of sound wave. 2.4.4 Basilar Membrane Responses The motion of the BM normally defined as travelling wave. Along the length of BM the parameter at a given point determine the character frequency (CF). CF is the frequency at which BM is most sensitive to sound vibrations. At the apex the BM is widest (0.08– 0.16 mm) and narrowest (0.08–0.16 mm) part is located at the base. Through ear drum 20 Figure 2.5: Motions of BM at different frequencies (Reproduced from Encyclopaedia Britannica, Inc. 1997) and oval window a sound pressure waveform pass to the cochlea and creates pressure wave along the BM. There are different CF along the position of BM. The base of the cochlea is more tuned to higher frequency and it decrease along apex. When sound pressure passes through BM both low and high frequency regions are excited and cause an overlap of frequency detection. Based on the low frequency tone (below 5 kHz), the resulting nerve spike (AP) are synchronized through phase-locking process. The AN model successfully captured this strategy. At different CF the motion of the BM is shown in Fig. 2.5. 2.4.5 Auditory Nerve AN fibers generate electrical potentials that do not vary with amplitude when activated. If the nerve fibers fire, they always reach 100% amplitude. The APs are very short-lived events, normally take 1 to 2 ms to rise in maximum amplitude and then return to resting state. Due to this behavior they are normally called as spikes. These spikes are used to decode the information in the auditory portions of the central nervous system. 21 2.5 Brief history of Auditory Nerve (AN) Modeling AN modeling is an effective tool to understand the mechanical and physiological processes in the human auditory periphery. To develop a computational model several efforts (Bruce, Sachs, & Young, 2003; Tan & Carney, 2003; Zhang, Heinz, Bruce, & Carney, 2001) had been made that integrate data and theories from a wide range of research in the cochlea. The main focus of these modeling efforts was the nonlinearities in the cochlea such as compression, two-tone suppression and shift in the best frequencies etc. The processing of simple and complex sounds in the peripheral auditory system can also be studying through these models. These models can also be used as front ends in many research areas such as speech recognition in noisy environment (Ghitza, 1988; Tchorz & Kollmeier, 1999) , computational modeling of auditory analysis (Brown & Cooke, 1994) , modeling of neural circuits in the auditory brain-stem (Hewitt & Meddis, 1993),design of speech processors for cochlear implants (Wilson et al., 2005) and design of hearing-aid amplification schemes (Bondy, Becker, Bruce, Trainor, & Haykin, 2004; Bruce, 2004). In 1960, Flanagan and his colleague developed a computational model that can emulate the responses of the mechanical displacement of the BM filter for a known stimulus. The model considered the cochlea as a linear and passive filter and took account the properties of the cochlea responses by using simple stimulus like clicks, single tones, or pair of tones (R. L. Miller, Schilling, Franck, & Young, 1997; Sellick, Patuzzi, & Johnstone, 1982; Wong, Miller, Calhoun, Sachs, & Young, 1998) However, in 1987, Deng and Geisler reported significant nonlinearities in the responses of AN fibers due to the speech sound which was attributed as “synchrony capture”. They described the main property of discovered nonlinearities as “synchrony capture” which means that the responses produced by one formant in the speech syllable is more synchronous to itself than what linear methods predicted from the fibers threshold frequency tuning curve (FTC). These models are relatively simple but did not 22 Figure 2.6: Model of one local peripheral section. It includes outer/ME, BM, and IHC– AN synapse models. (Robert & Eriksson, 1999) consider the nonlinearities of BM (Narayan, Temchin, Recio, & Ruggero, 1998). Their effort led to proposing a composite model which incorporated either a linear BM stage or a nonlinear one (Deng & Geisler, 1987). In 1999, a model was proposed by Robert and Eriksson which could able to produce different effects of human auditory system seen in the electrophysiological recordings related to tones, two-tones, and tone-noise combinations. The model was able to generate significant nonlinear behaviors such as compression, two tone suppression, and the shift in rate-intensity functions when noise is added to the signal (Robert & Eriksson, 1999). The model is shown in Fig. 2.6. The model introduced by Zhang et al. addressed some important phenomena of human AN such as temporal response properties of AN fibers and the asymmetry in suppression growth above and below characteristics frequency. The model focused more on nonlinear tuning properties like the compressive 23 Figure 2.7: The model of the auditory peripheral system developed by Bruce et al. (Bruce et al., 2003), modified from Zhang et al. (Zhang et al., 2001) changes in gain and bandwidth as a function of stimulus level, the associated changes in the phase-locked responses, and two-tone suppression (Zhang et al., 2001). Bruce et al. expanded the foresaid model of the auditory periphery to assess the effects of acoustic trauma on AN responses. The model incorporated the responses of the outer hair cells (OHCs), inner hair cells (IHCs) to increase the accuracy in predicting responses to speech sounds. Their study was limited to low and moderate level responses (Bruce et al., 2003). The schematic diagram of their model is shown in Fig. 2.7. 24 Figure 2.8: Schematic diagram of the auditory-periphery. The model consists of ME filter, a feed-forward control path, two signal path such as C1 and C2, the inner hair cell (IHC), outer hair cell (OHC) followed by the synapse model with spike generator. (Zilany & Bruce, 2006). 2.5.1 Description of the computational model of AN In this study, we used AN model developed by Zilany et al. that will be described in this section (Zilany & Bruce, 2006). In 2006, the model of the auditory periphery developed by Bruce et al.(2003) was improved to simulate the more realistic responses of the AN fibers in response to simple and complex stimuli for a range of characteristics frequencies (Bruce et al., 2003). Figure 2.8 shows the model of the auditory periphery developed by Zilany et al. (Zilany & Bruce, 2006). The model introduced two modes of BM that includes the inner and OHC resembling the physiological BM function in two filter components C1 and C2. The C1 filter was designed to address low and intermediate-level responses where C2 was introduced as a mode of excitation to the IHC and produce high-level effects of the cochlea. Meanwhile, C2 corresponds to IHC which filters high level response and then followed by C2 transduction function to produce high-level effects and transition region. This feature in the Zilany-Bruce model 25 causes it to be more effective on a wider dynamic range of characteristics frequency of the BM compared to previous AN models (Zilany & Bruce, 2006, 2007; Zilany et al., 2009). 2.5.1.1 C1 Filter: The C1 filter is a type of linear chirping filter that is used to produce the frequency glides and BF shifts observed in the auditory periphery. The C1 filter was designed to address low and intermediate-level responses of the cochlea. The output of the C1 filter closely resembles the primary mode of vibration of the BM and act as the excitatory input of the IHC. The filter order has a great impact on the sharpness of the tuning curve. The filter remains sharply tuned if the filter order is too high even for high-SPL or in the situation of OHC impairment. To enable the FTC of the filter more realistic in both normal and impairment conditions, the filter order was used 10 instead of 20 (Tan & Carney, 2003). 2.5.1.2 Feed forward control path (including OHC): The feed forward control path regulates the gain and bandwidth of the of the BM filter to reflect the several level-dependent properties of the cochlea by using the output of the C1 filter. The nonlinearity of the AN model that represents an active cochlea is modelled in this control path. There are three main stages involves in this path, which are: i) Gammatone filter: The control-path filter is a time varying filter that is the product of a gamma distribution and sinusoidal tone with a central frequency and bandwidth broader than these of C1 filter. The broader bandwidth of the control-path filter produces two-tone rate suppression in the model output. 26 ii) Boltzmann function: This symmetric non-linear function was introduced to control dynamic range and time course compression in the model. iii) A nonlinear function that converts the low pass filter output to a time-varying time constant for the C1 filter. Any impairment of the OHC is controlled by the COHC and the output is used to control the nonlinearity of the cochlea as well. On the other hand, the nonlinearity of cochlea is controlled inside the feed forward control path based on different type of stimulus of the SPLs. At low stimulus SPLs, the control path response with maximum output. Thus the tuning curve becomes sharp with maximum gain and the filter behaves linearly. In response to the moderate levels of stimulus, the control signal deviates substantially from the maximum and varies dynamically from maximum to minimum. The tuning curve of the corresponding filter becomes broader and the gain of the filter is reduced. At very high stimulus SPLs the control path stimulus saturates with minimum output level and the filter is again becomes effectively linear with broad tuning and low gain. 2.5.1.3 C2 Filter: The C2 filter, parallel to C1 the filter is a wideband filter with its broadest possible tuning (i.e. at 40 kHz). The filter has been introduced based on the Kiang‟s two-factor cancellation hypothesis, in which the level of stimuli will affect the C2‟s transduction function followed after C2 filter‟s output. The hypothesis states that „the interaction between the two paths produces effects such as the C1/C2 transition and peak splitting in the period histogram (Zilany & Bruce, 2006). To be consistent with the behavioral studies (Wong et al., 1998), C2 filter has been chosen to be identical to the broadest possible C1 filter. Thus, the C2 filter has been implemented by replacing the poles and 27 zeroes in the complex plane at a position for the C1 filter with complete OHCs impairment. According to the Liberman, (Liberman & Dodds, 1984) C2 responses remain unchanged where C1 responses are significantly attenuated in acoustically traumatized cats. Also, C1 responses can be suppressed using crossed olivocochlear bundle stimulus, where C2 responses become unaltered (Gifford & Guinan Jr, 1983). To include this phenomenon in the model, C2 filter is made as linear and static. The transduction function gives the output based on SPLs that affect the C1/C2 interactions. a) At low SPLs, its output is significantly lower than the output of the corresponding C1 response. b) At high SPLs, the output dominates and the C1 and C2 outputs are out of phase. c) At medium SPLs, C1 and C2 outputs are approximately equal and tend to cancel each other. Furthermore, the C2 response is not subject to rectification, unlike the C1 response (at high levels) such that the peak splitting phenomenon also results from the C1/C2 interaction. Poor frequency selectivity of AN fiber is caused by too many frequency components consists in a speech stimuli. This is overcome by increasing the order of C2 up to 10th order, which compensate the order of C1 filter. 2.5.1.4 The Inner hair cell (IHC): The IHC is modelled by a low pass filter that functions to convert the mechanical energy produced by the BM to electrical energy that stimulates the neurotransmitter to be released in the IHC-AN synapse. Two types of IHCs, tallest and shorter IHC sterocilia generate the C1 and C2 responses, respectively and were controlled by both C1 and C2 transduction functions. C1 transduction function uses the output of the C1 filter and is related to high-CF model fibers to produce the direct current (DC) 28 components of the electrical output. Meanwhile, the C2 transduction function uses the C2 filter output that is first transformed to increase towards 90-100 SPL at low and moderate-CF level. Finally, the C1 and C2 transduction function outputs, Vihc,C1 and Vihc,C2 are summed and resulted to the overall potential of Vihc output after passing through the IHC low pass filter. The output of the AN model simulates multidimensional pulse signals from each channel that is obtained by means of its statistical characteristics of the pulse signals called the peristimulus time histogram (PSTH). In 2009, the model was further improved by introducing power law adaptation with the previous model (Zilany et al., 2009). Figure 2.9 shows the AN model developed by Zilany and colleagues in 2006 but with the additional rate-adaptation model which is the Figure 2.9: (A) Schematic diagram of the model of the auditory periphery (B) IHC-AN synapse model: exponential adaptation followed by parallel PLA models (slow and fast). IHC-AN Power- Law Synapse Model (PLA) (Zilany et al., 2009). In the model, powerlaw adaptation (PLA) is used with the exponential adapting components with rapid and short-term constants. 29 The two parallel paths in PLA provide slowly and rapidly adapting responses where exponentially adapting components are responsible for shaping onset responses. These responses further made the AN output to improve the AN response after stimuli offset, in which the person could still hear a persistent or lingering effect after the stimuli has past and also to adapt to a stimuli with increasing or decreasing amplitude. The adapting PLA in the synapse model significantly increases the synchronization of the output to pure tones, and therefore, the adapted cut-off frequency is matched with the maximum synchronized output of the AN fiber for pure tones as a frequency function. The PLA model simulates repetitive of the stimulus output of the synapse into a single IHC output instead of only simulating a single stimulus of the synapse by the previous model. Because of the discharge generator has quite a relatively long lifetime emission dynamics and can be extended from one stimulus to the next, a series of the same output synapses were formed through a combination of repetitive stimulus and silences between each stimuli. Moreover, the model synaptic PLA also has memory that exceeds the repetition duration of a single stimulus (Zilany et al., 2009). The AN model introduced by Zilany et al. is capable to simulate realistic responses of normal and impaired AN fiber in cats across a wide range of characteristics frequencies. The model successfully capture the most of non-linearities observed at the level of the AN. The model simulate the response of AN fibers to account for high level AN responses. This was accomplished by suggesting that inner hair cell should be subjected to two modes of BM excitation, instead of only one mode. Two parallel filters named C1 (component 1) and C2 (component 2) generate these two modes. Each of these two excitation modes has their own transduction function in a way that the C1/C2 interaction occurs within the inner hair cell. The transduction function was chosen in such a way that at low and moderate (SPLs), C1 filter output dominate the overall response from the IHC output, whereas the high level responses were dominated by C2 30 responses. This property of Zilany-Bruce model makes it more effective on wider dynamic range of SPLs compared to those of previous AN models (Zilany & Bruce, 2006). 2.5.2 Envelope (ENV) and Temporal Fine Structure (TFS) neurogram Based on temporal resolution (bin width of 10 μs and 100 μs, which will subsequently be referred to as TFS and ENV, respectively), two types of neurogram were constructed from the output of the model for the auditory periphery. In this study, we used only ENV neurogram. 2.6 Support Vector Machines In this thesis, the Support Vector Machine (SVMs) formulation for phoneme classification was used. In classification, SVMs are binary classifiers. They are used to build a decision boundary by mapping data from the original input space to a higher dimensional feature space, where the data can be separated using a linear hyper plane. Instead of choosing a hyperplane that only minimizes the training error, SVMs choose the hyperplane that maximizes the margin of separability between the two sets of data points. This selection can be viewed as an implementation of the Structural Risk Minimization (SRM) principle, which seeks to minimize the upper bound on the generalization error (Vapnik, 2013). Typically, an increase in generalization error is expected when constructing a decision boundary in higher-dimensional space. 31 Figure 2.10 (a) a separating hyperplane. (b) The hyperplane that maximizes the margin of separability [adapted from (Salomon,2001) ] 2.6.1 Linear Classifiers Consider the binary classification problem of an arrangement of data points as shown in Fig. 2.10 (a). We denote the “square” samples with targets yi = +1 as positive examples, belonging to the set S+. Similarly, we define the “round” samples with yi=−1 as negative examples, belonging to S−. One mapping that can separate S+ and S− is: f(x,y)= sign(w. x + b) (2.1) Where w is a weight vector and b the offset from origin. Given such a mapping, the hyperplane w. x+b= 0 defines the decision boundary between S+ and S−. The two data sets are said to be linearly separable by the hyperplane if a pair {w, b} can be chosen such that the mapping in Eq. (2.1) is perfect. This is the case on figure 2.10 (a), where the “round” and “square” samples are clearly separable. The SVM classifier finds the only hyperplane that maximizes the margin between the two sets. This is shown in Fig. 2.10(b). 32 Classification An instance x is classified by determining on which side of the decision boundary it falls. To do this, we compute: f(x)=sign(w∗x+b)=sign(∑ni=1 αi yi (𝐱 𝐢 . 𝐱 ) + b) (2.2) Where αi is the Lagrange multiplier corresponding to the ith training sample xi . Then assign it to one of the target labels +1 or –1, representing the positive and negative examples. The parameter 𝛼 and b are optimized during training. 2.6.2 Non-linear Classifiers In the above section, we described how the linear SVM could handle misclassified examples. Another extension is needed before SVMs can be used to effectively handle real-world data: the modeling of non-linear decision surfaces. In 2006, Systems et al. proposed a method to handle non-linear decision surfaces using SVM (Systems & Weiss, 2006). The idea is to explicitly map the input data to some higher dimensional space, where the data is linearly separable. 2.6.3 Kernels To handle non-linear decision boundaries kernel functions are used that implicitly map data points to some higher dimensional space where the decision surfaces are again linear. Two commonly used kernels are the radial basis function (RBF) kernel k r (xi, 2 x)=𝑒 −𝜏‖𝒙𝒊 −𝒙‖ and polynomial kernel Kp (xi, x) = (1+ (𝒙𝒊 , x)) Θ Where Θ is integer polynomial order and τ is the width factor. These parameters are tuned to a particular classification problem. The classification performance at clean condition for the standard polynomial and RBF kernels are almost similar. Hence, in all experiment we used RBF kernel. 33 2.6.4 Multi-class SVMs In literature, numerous schemes have been proposed to solve multi class problem using SVMs (Friedman, 1996 ; Mayoraz & Alpaydin, 1999). SVM is basically a binary classifier, which compares each sample to another sample. There are two types of classification mode in SVM: one versus one (OVO) and one versus rest (OVR) of samples. The advantage of OVO classification system is that, it takes less time since the problems to be solved are smaller in size. In OVR system, each test instance is compared with the target one samples and rests all instances at same time. In practice, heuristic methods such as the OVO and OVR approaches are mostly used than other multiclass SVM implementations. There are several online software packages available that efficiently solve the binary SVM, such as (Cortes & Vapnik, 1995). 2.7 Radon Transform A phoneme classification technique was proposed in this study based on the application of Radon transform on simulated neural responses from a model of the auditory periphery. The Radon transform computes projections of an image matrix along specified directions. 2.7.1 Theoretical foundation In 1917, Radon, an Austrian mathematician, published an analytic solution to the problem of reconstructing an object from multiple projections. William Oldendorf first reported the application of mathematical image-reconstruction techniques for radiographic medical imaging in 1961, and Nobel laureate Godfrey N. Hounsfield developed the first clinical computerized tomography around a decade later. Radon posited that it was possible to recover a function on the plane from its integrals over all the lines in the plane, and that functions could have real or complex values. Therefore, it 34 would be feasible to use rotational scanning to obtain projections of a 2D object, which could then be used to reconstruct an image. Radon’s theorem states, “The value of a 2D function at a random point is uniquely obtained by the integrals along the lines of all the directions passing that point.” (Rajput, Som, & Kar, 2016) 2.7.2 How Radon transform works Detecting the entire image is a challenging task. Radon transform avoids the need for global image detection by reducing recognition to the problem of detecting peak image parameters, which enables the definition of prominent features. After establishing thresholds for feature extraction that can be plugged into the Radon transform, it is possible to extract the prominent features from the preprocessed image, effectively recovering global image parameters. Several existing algorithms, such as edge-detection filters, are suitable for estimating image parameters after which linear regression can be applied to connect the individual pixels. However, these algorithms are less suitable when the image has intersecting lines, the noise level is high, or filters are difficult to stabilize. Radon transform can overcome these problems. Formula for Radon transform is present in section 3.4. A projection of a two-dimensional function f(x,y) is a set of line integrals. The radon function computes the line integrals from multiple sources along parallel paths, or beams, in a certain direction. The beams are spaced 1 pixel unit apart. To represent an image, the radon function takes multiple, parallel-beam projections of the image from different angles by rotating the source around the center of the image. By exploiting the move-out or curvature of signal of interest, Leastsquares and High-resolution Radon transform methods can effectively eliminate random or correlated noise, enhance signal clarity (Gu & Sacchi, 2009). 35 2.7.3 Current applications In more modern terms, 2D Radon transform represents an image as a line integral, which is the sum of the image’s pixel intensities, and shows the relationship between the 2D object and its projections. Radon transform is a fundamental tool in a wide range of disciplines, including radar, geophysical, and medical imaging. It also can be used in science and industry to evaluate the properties of a material, component, or system without causing damage. (Rajput et al., 2016). 36 CHAPTER 3: METHODOLOGY 3.1 System overview This chapter describes the procedure of the proposed neural-response-based phoneme classification system. The block diagram of the proposed method is shown in Fig. 3.1. The 2-D neurograms were constructed from the simulated responses of the auditorynerve fibers to speech phonemes. The features of the neurograms were extracted using the Radon transform and used to train the classification system using a support vector machine classifier. In the testing phase, the same procedure was followed for extracting feature from the unknown (speech) signal. The class of the test phoneme was identified using the approximated function obtained from SVM training stage. 3.2 Datasets To evaluate our proposed feature under quiet environment, speech with additive noise and reverberant speech, experiments were performed on the complete test set of the TIMIT database. Training Training phoneme AN Model Neurogram Radon Transform SVM Training SVM model Radon Transform Feature Matching Classification Matrix Testing Testing phoneme AN Model Neurogram Figure 3.1: Block diagram of the proposed phoneme classifier 37 HTMIT corpus was used to collect speech distorted by different telephone channel. 3.2.1 TIMIT database TIMIT is a unique corpus that has been used for numerous studies over the last 15 years. TIMIT is ideal for performing isolated phoneme classification experiments because it contains expertly labeled, phonetic transcriptions/segmentations performed by linguists that most other corpora do not possess. This database can also be used for continuous recognition as well. Experiments were performed on the complete test set of the TIMIT database (Garofolo & Consortium, 1993). There are two “sa” (dialect) sentences in the TIMIT database spoken by all speakers that may result in artificially high classification scores (Lee & Hon, 1989). To avoid any unfair bias, experiments were performed on the “si” (diverse) and “sx” (compact) sentences. The training data set consists of 3696 utterances from 462 speakers (140225 tokens), whereas testing set consists of 1344 utterances from 168 speakers (50754 tokens). The glottal stop /q/ was removed from the class labels, and the 61 TIMIT phoneme labels were collapsed into 39 labels following the standard practice given in (Lee & Hon, 1989). Table 3.1, describes this folding process and the resultant 39 phone set. Knowing that phone confusions occur within similar phones some researchers proposed broad phone class (Halberstadt, 1998); (T. J. Reynolds & Antoniou, 2003; Scanlon, Ellis, & Reilly, 2007). We showed the classification performance using a broad phone-class approach proposed by Reynolds et al. (T. J. Reynolds & Antoniou, 2003). 38 Table 3.1: Mapping from 61 classes to 39 classes, as proposed by Lee and Hon (Lee & Hon, 1989). 1 iy 20 n en nx 2 ih ix 21 ng eng 3 eh 22 v 4 ae 23 f 5 ax ah ax-h 24 dh 6 uw ux 25 th 7 uh 26 z 8 ao aa 27 s 9 ey 28 zh sh 10 ay 29 jh 11 oy 30 ch 12 aw 31 b 13 ow 32 p 14 er axr 33 d 15 l el 34 dx 16 r 35 t 17 w 36 g 18 y 37 k 19 m em 38 hh hv 39 bcl pcl dcl tcl gcl kcl q epi pau h# In 2003, Reynolds and Antoniou (T. J. Reynolds & Antoniou, 2003) divided the 39 phone set into 7 broad classes namely: Plosives (Plo), Fricatives (Fri), Nasals (Nas), Semi-vowels (Svow), Vowels (Vow), Diphthongs (Dip) and Closures (Clo) (Table 3.2). Table 3.3 indicates the number of tokens in each of the data sets. 39 Table 3.2: Broad classes of phones proposed by Reynolds and Antoniou, (T. J. Reynolds & Antoniou, 2003). Phone Class #TIMIT labels TIMIT labels Plosives 8 b d g p t k jh ch Fricatives 8 f z th dh s sh v hh Nasals 3 n m ng Semi-vowels 5 l er r y w Vowels 8 aa ae ah uh iy ih eh uh uw Diphthongs 5 ay oy ey aw ow Closures 2 sil dx Table 3.3: Number of token in phonetic subclasses for train and test sets Phone 462-speaker train 168-speaker complete test Plosives 17967 6261 Fricatives 20314 7278 Nasals 12312 4434 Semi-vowels 17406 7021 Vowels 34544 12457 Diphthongs 6890 2431 Closures 30792 10872 3.2.2 HTIMIT corpus To evaluate the effectiveness of the proposed approach on channel distortions we used the well-known handset TIMIT (HTIMIT) corpus. This corpus is a re-recording of a subset of the TIMIT corpus through different telephone handsets. It was create for the study of telephone transducer effects on speech which minimized confounding factors. 40 It was created by playing 10 TIMIT sentences from 192 male and 192 females through a stereo loud speaker into different transducers positioned directly in front of the loudspeaker and digitizing the output from the transducers on a sunspace A/D at a 8kHz sampling rate and a 16 bit resolution. In this study we resampled the data from 8 kHz to 16 kHz. The set of utterances was playback and recorded through nine other different handsets include four carbon button (called cbl, cb2, cb3 and cb4), four electret (called ell, e12, el3 and e14) handsets, and one portable cordless phone (called ptl). However, in this study, all experiments were performed on 359 speakers from 1431 utterances (54748 tokens). 3.3 AN model and neurogram The AN model developed by Zilany et al. (Zilany & Bruce, 2006) is a useful tool for understanding the underlying mechanical and physiological process in auditory periphery of human. The schematic block diagram of this AN model is shown in Fig. 3.1 in (Zilany & Bruce, 2006). This model represents the method of encoding the simple and complex sounds in the auditory periphery (Carney & Yin, 1988). The sound stimulus in (ME) is resampled with 100 KHz to comply with human ME instantaneous pressure waveform of that stimulus. To replicate human ME filter response, a fifth order digital filter was used (Zilany & Bruce, 2006). To maintain stability, this fifth order filter was converted into second order system by cascading two second order filters and one first order filter. Normally 32 CFs are simulated, ranged from 150 Hz to 8 KHz which were logarithmically spaced for AN model to produce neurogram. Each CF is considered as a single AN fiber and behaves like a band-pass filter with asymmetric filter shape. In this presented study, each CF’s neural responses have been simulated only for single repetition for each stimulus. A single nerve fiber does not fire on every cycle of the stimulus but once spike is obtained, it obtains roughly at same phase of the 41 waveform at repetitive time which is called phase locking. The phase locking varies somewhat across species but upper frequency boundary lies at 4-5 KHz (Palmer & Russell, 1986). Heinz et al. observed a weak phase locking up to 10 KHz frequencies (Heinz, Colburn, & Carney, 2001). That’s why the maximum limit of CF ranged are commonly used 8 KHz in AN model. To recall, the acoustic signal is adapted with this model by up-sampling at 100 KHz. Three different types of responses (Liberman, 1978) ( low, medium, and high spontaneous rate) of AN fibers were simulated to maintain consistency with physiology of Auditory system. According to the spontaneous rate distribution, the AN model synapse responses are multiplied with 0.2, 0.2, and 0.6 to low, medium, and high spontaneous rate respectively. The detail of the AN model was described in literature review section. Figure 3.2: Time-frequency representations of speech signals. (A) a typical speech waveform (to produce spectrogram and neurogram of that signal), (B) the corresponding spectrogram responses, and (C) the respective neurogram responses. 42 A neurogram is analogous to a spectrogram that provides a pictorial representation of neural responses in the time-frequency domain. Figure 3.2 shows the difference between neurogram and spectrogram. The basic difference between spectrogram and neurogram is that, the spectrogram is the frequency spectrum respective to time for an acoustic signal using FFT whereas neurogram is the collective 2-D neural responses i.e. the collection of responses to corresponding CF. Neurogram is obtained by averaging the neural responses for each CF. ENV neurogram was separated from the neural responses using 100µs bin width, Hamming window (50% overlap between adjacent frames) with periodic flag (is useful for spectral analysis (Oppenheim, Schafer, & Buck, 1989)) for 128 points as one frame, and the average value of each frame was manipulated. So the spike synchronization frequency for ENV neurogram was up to ~160 Hz [1/(100x10-6x128x0.5)]. To calculate TFS neurogram, 10µs bin (time window) size with Hamming window (50% overlap of adjacent frames with periodic flag) for 32 points were used for each frame. The combination of 32 points Hamming window with binning 10µs extended the synchronization frequency ranges up to ~ 6.25 KHz [1/10x10-6x32x0.5]. It implied that the ENV neurogram resolution was less and contains less information comparative to TFS neurogram. Since ENV neurogram has average information about the whole neural response and smaller in size comparative to TFS neurogram, less time will be required for phoneme classification modeling and testing, that is why ENV neurogram has been used in our proposed method. In over all, AN model can capture mammal peripheral auditory system non-linearity more accurately as compared to existing auditory model and acoustic cepstral-based model. It is also useful for robust phoneme, speech and speaker identification. In this study, Neurograms were constructed by simulating the responses of the AN fibers to phonemes from the TIMIT database. In addition, responses of 32 AN fibers logarithmically spaced from 150 to 8000 Hz (for HTIMIT the range is 150~4000 Hz) were simulated. 43 Figure 3.3: Geometry of the DRT 3.4 Feature extraction using Radon transform The multiple, parallel-beam projections of the image, 𝑓(𝑥, 𝑦) from different angles are referred to as the DRT. The projections are considered by rotating the source around the center of the image (Bracewell & Imaging, 1995). In general, the Radon transform 𝑅𝜃 (𝑥 ′ ) of an image is the line integral of 𝑓 parallel to the 𝑦 ′ -axis ∞ 𝑅𝜃 (𝑥 ′ ) = ∫−∞ 𝑓(𝑥 ′ cos(𝜃) − 𝑦 ′ sin(𝜃) , 𝑥 ′ sin(𝜃) + 𝑦′cos(𝜃))𝑑𝑦′ where (3.1) 𝑥′ cos(𝜃) sin(𝜃) 𝑥 [ ]=[ ] [ ] 𝑦′ − sin(𝜃) cos(𝜃) 𝑦 Figure 3.3 illustrates the geometry of the Radon transform. There are two distinct Radon transforms. The source can either be a single point (not shown) or it can be a array of sources. The Radon transform is a mapping from the Cartesian rectangular coordinates(x, y) to a distance and an angel (ρ, θ), also known as polar coordinates. Figure 3.4 (a) shows a binary image (size 100 by 100). Panel (b) and (c) shows the application of Radon transforms at 0° and 45° of this image, respectively. 44 Figure 3.4: This figure shows how Radon transforms work. (a) a binary image (b) Radon Transform at 0 Degree (c) Radon Transform at 45 Degree. 3.5 SVM classifier The SVM is a popular supervised learning method for classification and regression (Cortes & Vapnik, 1995; Vapnik, 2013). The SVM algorithm constructs a set of hyperplanes in a high-dimensional space. The hyperplanes are defined by a set of points which have a constant dot product with a vector in a higher-dimensional space. In this study, the LIBSVM (C.-C. Chang & Lin, 2011) (a Matlab library for SVM) was used to train the proposed features (Radon projection coefficients) to predict the multi-labels of classified phonemes. The C-support vector classification (C-SVC) with the (RBF) as a kernel mapping was employed in this study. The C parameter of the SVC and the parameter of the RBF kernel function (gamma) were selected using the cross-validation algorithm. The mathematical formulas of the C-SVC optimization algorithm and its corresponding approximation function can be found in (Cortes & Vapnik, 1995). The mathematical formulas of the C-SVC optimization algorithm and its corresponding approximation function shows below (C.-C. Chang & Lin, 2011). Given training vectors xi 𝜖 𝑅 𝑛 , i=1,.., l in two classes, and an indicator vector y 𝜖 𝑅 𝑙 such that yi 𝜖 {1,-1}, C-SVC (Boser, Guyon, & Vapnik, 1992; Cortes & Vapnik, 1995) solves the following primal optimization problem. 45 𝑚𝑖𝑛 𝑤,𝑏,ƺ Subject to 1 2 𝑤 𝑇 w + C ∑𝑙𝑖=1 𝜁𝑖 (3.2) yi (𝑤 𝑇 ∅ (xi) +b) ≥ 1- 𝜁i 𝜁i= 0, i=1,…, l, where ∅ (xi) maps xi into a higher-dimensional space and C>0 is the regularization parameter. Due to possible high dimensionally of the vector variable w, usually we solve the following dual problem. 𝑚𝑖𝑛 𝛼 subject to 1 2 𝛼 𝑇 𝑄𝛼 - 𝑒 𝑇 𝛼 𝑦 𝑇 𝛼 = 0, 0≤ 𝛼𝑖 ≤ 𝐶, (3.3) i=1,…, l, where e = [1,…, 1]T is the vector of all ones, Q is an l by l positive semidefinite matrix, 𝑄𝑖𝑗 ≡yiyj K(xi, xj), and K(xi, xj) ≡ ∅(xi)T ∅(xj) is the kernel function. After problem (3.4) is solved, using the primal-dual relationship, the optimal w satisfies w= ∑𝑙𝑖=1 𝑦𝑖 𝛼𝑖 ∅(𝑥𝑖 ) (3.4) and the decision function is sgn (𝑤 𝑇 ∅(𝑥)+b) = sgn ( ∑𝑙𝑖=1 𝑦𝑖 𝛼𝑖 𝐾 (𝑥𝑖 , 𝑥 ) + 𝑏) yi𝛼𝑖 ∀𝑖 , b labels name, support vectors, and other information such as kernel parameters is stored in the model for prediction. SVM parameter selection SVM is basically a binary classifier, which compares each sample to another sample. There are two types of classification mode in SVM: (OVO) and OVR of samples (OVR). The advantage of OVO classification system is that, it takes less time since the problems to be solved are smaller in size. In OVR system, each test instance is compared with the target one samples and rests all instances at same time. The performance using OVO and OVR classification system was 69.21 and 68.70 46 respectively. In this study, all performance was evaluated using OVO classification system. Four types of kernel function are used in SVM. The default type of kernel function (RBF)) has been used in presented phoneme classification. There are two parameters C and γ, which are associated to RBF kernel. The C (penalty parameter) prevents misclassification of training data against plainness of decision surface. A small value of C is liable for ingenuousness of decision surface while large value gives freedom to select large data as support vector to the model (Soong, Rosenberg, Juang, & Rabiner, 1987). Too small value of gamma (γ) makes the model constrain that cannot handle complex data. It implies that, the classification performance of SVM model with RBF kernel fully depend on a proper selection of C and γ. The best values of C and gamma obtained by the cross-validation algorithm. For C = 1 the individual accuracy is 65.94, with the increase of value of C the accuracy increased up to C = 12. If we increase the value of C above 12 then the accuracy remain almost unchanged but higher value of C is computational expensive. Similarly, we choose the value of G. In this study, the value used for C and G was 12 and 0.05 respectively, considering computational cost and accuracy under both quiet and noisy condition. 3.6 Environmental Distortions The first step in most environmental robustness techniques is to specify how speech signals are altered by the acoustic environment. Though the relationship between the environments corrupted signals and the original, noise free, signals is often complicated; it is possible to relate them approximately using a model of the acoustic environment 47 Reverberation Additive transmission noise Microphone channel distortion Lombard effect Transmission channel distortion ASR systems background noise Figure 3.5: Types of environmental noise which can affect speech signals (Wang, 2015) (Acero, 1990; Hansen, 1996). Figure 3.5 demonstrates one such model, in which several major types of environment noise are depicted. First, the production of speech can be influenced by the ambient background noise. This is the Lombard effect (Junqua & Anglade, 1990): as the level of ambient noise increases, speakers tend to hyperarticulate: vowels are emphasised while consonants become distorted. It is reported that recognition performance with such stressed speech can degrade substantially (Bou-Ghazale & Hansen, 2000). There have been a few attempts to tackle the Lombard effect as well as stressed, emotional speech. However, in this thesis, these effects are not addressed. We will consider other environmental distortions. Figure 3.5 show, a major source of environmental distortions is the ambient noise presented in the background. The background noise can be emitted near the speaker or the microphone. It is considered as additive to the original speech signals in the time domain and can be viewed as statistically independent of the original speech. Additive background noise can be stationary (e.g., noise generated by a ventilation fan), semi-stationary (e.g., noise caused by interfering speakers or music), or abrupt (e.g., noise caused by incoming vehicles). Additive background noise occurs in all speech recognition applications. Therefore, it has been intensively studied in the past several decades (Acero, 1990; M. J. Gales, 2011; M. J. F. Gales, 1995). If the speaker uses a distant microphone in an enclosed space, the speech signals are subject to another type of environment distortion – reverberation. As shown in Fig. 3.5 48 reverberation is usually caused by the reflection of speech waveform on flat surfaces, e.g., a wall or other objects in a room. These reflections result in several delayed and attenuated copies of the original signals, and these copies are also captured by the recording microphone. In theory, the reverberant speech signals can be described as convolving the speech signals with the room impulse response (RIR), resulting in a multiplicative distortion in the spectral domain. Compared with the additive background noises, which are usually statistically independent of the original clean signals, the reverberant noises are correlated with the original signals and the reverberant distortions are different from the distortions caused by additive background noise. Reverberation has a strong detrimental effect on the recognition performance of an ASR systems: for instance, without any reverberant robustness technique, the recognition accuracy of a ASR system can easily drop to 60% or even more in a moderate reverberant environment (length of RIR ∼ 200ms) (Yoshioka et al., 2012). Therefore, reverberant robustness techniques are an essential component in many applications. Figure 3.5 also shows that speech signals can be transmitted by a series of transducers, including microphones, and some communication channels. Differences in characteristics of these transducers add another level of distortions, i.e., the channel distortion. 49 Figure 3.6: Impact of environment distortions on clean speech signals in various domain: (a) clean speech signal (b) speech corrupted with background noise (c) speech degraded by reverberation (d) telephone speech signal. (e)-(h) Spectrum of the respective signals shown in (a)–(d). (i)-(l) neurogram of the respective signals shown in (a)–(d).(m)-(p) Radon coefficient of the respective signals shown in (a)–(d). Compared with reverberant distortions, the channel distortions are usually caused by linearly filtering the incoming signals with the microphone impulse response, which is shorter than the length of analysis window. Figure 3.6 shows the impact of environment distortions on clean speech signals in various domains. Figure 3.6 suggests that due to environmental distortions there is a high impact on acoustic domain and spectral domain but in Radon domain the effect is very low. For simplicity, in this Figure we do not show the full length of reverberant speech. It is shown in figure 3.7. 50 3.6.1 Existing Strategies to Handle Environment Distortions The impact of environment distortions on clean speech signals have been described in figure 3.6. Most-widely used environmental robustness techniques can be grouped into three broad categories (Wang, 2015): i) Inherently robust front-end or robust signal processing: in which the front-end is designed to extract feature vectors x(t) which are insensitive to the differences between clean speech signals x(t ) and noise corrupted signals y(t ), or the clean speech signals x(t) can be re-constructed given corrupted signals y(t ) ; ii) Feature compensation, in which the corrupted feature vector y(t) is compensated such that it closely resembles the clean speech vector x(t). This will be referred to as the feature-based approach; iii) Model compensation, in which the clean-trained acoustic model Mx is adapted to My that it better matches the environment corrupted feature vectors in the target condition. This will be referred to as the model-based approach In this study, we proposed an approach to handle this environmental distortion through a combination of two successful schemes. Our proposed approach will be discussed in section 3.9. 51 3.7 Generation of noise We evaluate our proposed feature with different types of noise. The procedure to generate these noise is presented below. 3.7.1 Speech with additive noise Among the three main noise types, additive noise is the most common source of distortions. It can occur in almost any environment. Additive noise is the sound from the background, captured by the microphone, and linearly mixed with the target speech signals. It is statistically independent and additive to the original speech signals. Additive noise can be stationary (e.g., aircraft/train noise and noise caused by ventilation fan), slowly changing (e.g., noise caused by background music), or transient (e.g., in car noise caused by traffic or door slamming). When speech signal is corrupted by additive noise, the signal that reaches the microphone can be written as a[t] =s[t] +n[t], (3.5) where a[t] is the discrete representation of the input signal, s[t] represents the clean speech signal which is corrupted by noise n[t]. In this thesis we consider six types of additive noise which are: speech shape noise (SSN), babble, exhibition, train, car, and restaurant noise. In this study, additive noise (Hopkins & Moore, 2009) with different SNRs ranging from -5 to 25 dB in steps of 5 dB were added to the clean phoneme signals to evaluate the performance of the proposed methods. 3.7.2 Reverberant Speech When the length of impulse response is significantly longer, the nature of its impact on the speech recognition systems is very different. 52 Figure 3.7: Comparison of clean and reverberant speech signals for phoneme /aa/: (a) clean speech, (b) signal corrupted by reverberation (c)-(d) Spectrogram of the respective signals shown in (a)–(b) and (e)-(f) Radon coefficient of the respective signals shown in (a)-(b). In the time domain, the reverberation process is mathematically described as the convolution of clean speech sequence s(t) with an unknown RIR h(t) r (t) =s (t)*h (t) (3.6) where s (t), h (t), and r (t) denote the original speech signal, the RIR’s, and the reverberant speech, respectively. The length of a RIR is usually measured by the so called reverberation time, T60, which is the time needed for the power of reflections of a direct sound to decay by 60dB. The T60 value is usually significant longer than the analysis window. For example, T60 normally ranges from 200ms to 600ms in a small office room, 400ms to 800ms in a living room, while in a lecture room or concert hall environment, it can range from 1s to 2s or even longer. For phoneme classification experiments with reverberant speech, the clean TIMIT test data were convolved with a set of eight different room responses collected from various sources (Gelbart & Morgan, 53 2001) with reverberation time (T60) ranging from almost 300 to 400 ms. To make convoluted noise we used ‘conv’ function and keep the length same for both clean phoneme and reverberant phoneme using ‘same’ option in the function. The use of eight different room responses results in eight reverberant test sets consisting of 50754 tokens each. Figure 3.7 Shows the impact of a reverberation noise (T60=350ms) on waveform, spectrogram and Radon coefficient. This is because the convolution in the time domain is transferred to multiplication in the frequency domain and addition in the cepstral domain. From figure it is clear that due to reverberation there is a significant change in the spectral domain but slightly change in the Radon domain. 3.7.3 Telephone speech If the Lombard effect and the additive transmission noise in the environment model depicted in Fig. 3.5 are ignored, and the speaker does not talk in an enclosed acoustic space, the background noise and the channel (both the microphone and transmission channel) distortion become the main distortions. Figure 3.8 depicts this simplified environment model We can write channel distortion as: y(t)=s(t)* h(t) + n(t), (3.7) where s(t) and y(t) are the time-domain clean speech signal and noisy speech signal respectively. h(t) is the impulse response of the microphone and transmission network, and n(t) is the channel filtered version of the ambient background noise. For experiments in telephone channel, we collected speech data from nine telephone sets in 54 Figure 3.8: A simplified environment model where background noise and channel distortion dominate. the HTIMIT database (Reynolds, 1997). For each of these telephone channels 54748 test utterances are used. In all the experiments, the system is trained only on the training set of TIMIT database, representing clean speech without the distortions introduced by the additive or convolutive noise but tested on the clean TIMIT test set as well as the noisy versions of the test set in additive, reverberant, and telephone channel conditions (mismatched train and test conditions). 3.8 Similarity measure Cross-correlation is a measure of similarity of two signal or images and can be represented as a sliding dot product or inner-product. Correlation can be applied in pattern recognition, single particle analysis, and neurophysiology. In this study crosscorrelation is used to compare two signal in different domain. The similarity between two matrices or vectors can be found through correlation coefficient measure by following way: 𝑟= ∑𝑚 ∑𝑛(𝐴𝑚𝑛 −𝐴̅)(𝐵𝑚𝑛 −𝐵̅) √(∑𝑚 ∑𝑛(𝐴 𝑚𝑛 −𝐴̅ )2 ) (∑𝑚 ∑𝑛(𝐵𝑚𝑛 −𝐵̅)2 ) (3.8) Where r is correlation coefficient, 𝐴̅ and 𝐵̅ are the mean values of matrices or vectors A and B, respectively. 55 3.9 Procedure The block diagram of the proposed neural-response-based phoneme classification method is shown in Fig. 3.1. Each phoneme signal was up-sampled to 100 kHz which was required by the AN model in order to ensure stability of the digital filters implemented for faithful replication of frequency responses of different stages (e.g., ME) in the peripheral auditory system (Zilany et al., 2009). The SPL of all phonemes was set to 70 dB which represents the preferred listening level for a monaural listening situation. Because the AN model used in this study is nonlinear, the neural representation would be different at different sound levels. In response to a speech signal, the model simulates the discharge timings (spike train sequence) of AN fiber for a given characteristic frequency. Therefore, a 2-D representation, referred to as a neurogram, was constructed by simulating the responses of AN fibers over a wide range of CFs spanning the dynamic range of hearing. In the SVM training phase, the Radon projection coefficients were calculated from the phoneme neurogram using ten (10) rotation angles ranging from 0° to 180° in steps of 20°. The vector of each Radon projection was resized to 35 points and then combined together for all angles to form a (1 × 350) feature vector. Thus the total number of features for each phoneme was 350 irrespective of the duration of the phoneme in the time domain. A mapping function was used subsequently to normalize the mean and standard deviation of the feature vector to 0 and 1, respectively. All normalized data from each phoneme were combined together to form an input array for SVM training. The corresponding label vector of phoneme classes was also constructed. In testing phase, the Radon projection coefficients using the same ten rotation angles were calculated from the test (unknown) phoneme neurogram. The label (class) of the 56 Figure 3.9: Neurogram -based feature extraction for the proposed method: (a) a typical phoneme waveform (/aa/), (b) speech corrupted with SSN (10 dB) (c) speech corrupted with SSN (0 dB). (d)-(f) neurogram of the respective signals shown in (a)–(c). (g)-(h) Radon coefficient of the respective signals shown in (a)–(b). (j) Radon coefficient of the respective signals shown in (a)-(c). test phoneme was identified using the approximated function obtained from SVM training stage. The train and test data (for additive noise and room reverberation) were choosen from the TIMIT database, whether data for testing channel distortion was taken from the HTIMIT database. Figure 3.9 shows the example features extracted by applying the Radon transform on the neurogram. Figure 3.9 (a) shows the waveform of a typical phoneme (/aa/) taken from the TIMIT database, and the same phoneme under noisy conditions with SNRs of 10 and 0 dB is shown in the panels (b) and (c), respectively. Figure 3.9 (i) shows the Radon projection coefficients of the neurogram for 3 angles for the phoneme signal in quiet (solid line) and at SNRs of 10 dB (dashed 57 line) and 0 dB (dotted line). Note that in the proposed method, we employed 10 angles, but for clarity, we used only 3 angles of projection. This figure also shows how Radon projection coefficients changes with SNRs. The classification results of the proposed method were compared to the performances of three traditional acoustic-property-based features such as MFCC, GFCC and FDLP. 3.9.1 Feature extraction using MFCC, GFCC and FDLP for classification The classification results of the proposed method were compared to the performances of three traditional acoustic-property-based features such as MFCC, GFCC and FDLP. MFCC is a short-time cepstral representation of a speech which is widely used as a feature in the field of speech processing applications. In this study, the RASTAMAT toolbox (Ellis, 2005) was used to extract MFCC features from each phoneme signal. A Hanning window of length 25 ms (with an overlap of 60% among adjacent frames) was used for dividing the speech signal into frames. The log-energy based 39 MFCC coefficients were then computed for each frame. This set of coefficients consists of three groups: Ceps (Mel-frequency cepstral coefficients), Del (derivatives of ceps) and Ddel (derivatives of del) with13 features for each group (note that in the proposed method, the total number of features for each phoneme was fixed to 350). In this study, FDLP has been derived following code from Ganapathy et al. (Ganapathy et al., 2010), which is available on their official website. Like MFCC, same window type were used in computing the corresponding 39 FDLP features for each phoneme frame but the size of overlapped frames was 10 ms (40% overlapping). The Gammatone filter cepstral coefficient (GFCC) is an auditory-based feature used in phoneme classification. GFCC features can be computed by taking the (DCT) of the output of Gammatone filter as proposed by Shao et. al. (Shao et al., 2007). According to the physiological observation, the Gammatone filter-bank resembles more to the cochlear filter-bankn (Patterson, Nimmo-Smith, Holdsworth, & Rice, 1987). In this study, the procedure provided by 58 Shao et al. (Shao et al., 2007) was used to compute GFCC coefficients for phoneme. A fourth-order 128-channels Gammatone filter-bank with the center frequencies from 50 Hz to 8 kHz (half of the sampling frequency) was used to extract GFCC features. Instead of log operation which is commonly used in MFCC calculation, the cubic root was applied to extract GFCC features. 23 dimensional GFCC features were used in the present study to simulate the result. The details of the method can be found in (Shao et al., 2007). The SVM training and testing procedure described above were employed to determine the identity (label) of an unknown phoneme. In all the experiments, the system is trained only on the training set of TIMIT database, representing clean speech without the distortions introduced by the additive or convolutive noise but tested on the clean TIMIT test set as well as the noisy versions of the test set. 59 CHAPTER 4: RESULTS 4.1 Introduction This chapter summarizes the results on phoneme classification under diverse environmental distortions. We describe a series of experiments to evaluate the classification accuracy obtained using the proposed methods and to contrast this with the accuracy obtained using MFCC-, GFCC- and FDLP- based methods. Experiments were performed in mismatched train/test conditions where the test data are corrupted with various environmental distortions. The features extracted from original (clean) phoneme samples of TIMIT train subset were used to train the SVM models. In the testing stage, additive noise with a particular SNR was added to the test phoneme signal from TIMIT test subset, and the proposed features were then extracted. For reverberation speech same procedure was followed. Telephone speech was collected from HTIMIT corpus. In this study, we present the result for individual phone accuracy (denoted as single) and broad phone class accuracy (denoted as broad) following other researchers (Halberstadt & Glass, 1997; T. J. Reynolds & Antoniou, 2003; Scanlon et al., 2007). Broad class confusion matrices have been produced by first computing the full phone-against-phone confusion matrices and then adding all the entries within each broad-class block to give one number. Confusion matrices showing the performance of different feature in clean condition and under noisy condition is presented in appendix A. 4.2 Overview of result The average classification performance for the various feature extraction techniques on clean speech, speech with additive noise, reverberant speech, and telephone channel 60 Table 4.1: Classification accuracies (%) of individual and broad class phonemes for different feature extraction techniques on clean speech, speech with additive noise (average performance of seven noise types at -5, 0, 5, 10, 15, 20 and 25 dB SNRs), reverberant speech (average performance for eight room impulse response functions), and telephone speech (average performance for nine channel conditions). The best performance for each condition is indicated in bold. MFCC (%) GFCC (%) FDLP (%) Proposed feature (%) Clean single broad single broad single broad single broad 60.16 76.68 49.94 64.21 50.57 67.65 69.19 84.21 47.72 50.29 60.99 23.25 34.85 47.77 43.40 41.750 60.15 Speech with additive noise 42.94 53.79 39.00 47.45 37.42 Reverberant speech 27.06 34.06 37.41 48.38 19.26 Telephone Speech 42.53 57.93 28.19 29.92 15.86 speech is presented in Table 4.1. Our proposed approach outperforms all other baseline features in clean and additive noise conditions. For classification in reverberant and telephone speech baseline GFCC- and MFCC- based feature extraction techniques provides the best performance respectively but suffer performance degradation in other conditions. However, for reverberant and telephone speech, the performance of the proposed feature is comparable with GFCC- and MFCC-based feature respectively. 61 Table 4.2: Confusion matrices for segment classification in clean condition. PLO FRI NAS PLO 4447 707 47 FRI 559 5213 NAS 71 SVW SVW VOW DIP CLO MFCC 109 223 4 724 178 118 243 7 960 91 3234 168 338 18 514 97 61 206 5145 1106 207 199 VOW 137 101 327 1114 9791 745 242 DIP 1 3 14 184 656 1552 21 CLO 369 224 402 129 207 2 9539 FDLP PLO 3672 721 181 188 305 29 1165 FRI 590 4942 160 142 199 11 1234 NAS 122 130 2520 373 562 19 708 SVW 138 208 601 4457 1079 101 437 VOW 246 361 886 1511 8386 473 594 DIP 13 21 48 177 685 1391 96 CLO 393 494 455 181 352 27 8970 GFCC PLO 294 873 101 107 347 3 4536 FRI 12 3777 134 87 226 2 3040 NAS 5 11 3064 125 440 4 785 SVW 6 9 192 4928 1621 72 193 VOW 5 34 216 1084 10327 466 325 DIP CLO 0 12 0 295 3 311 299 122 1679 370 438 3 12 9759 Proposed feature PLO 5514 395 28 31 36 0 257 FRI 565 5903 138 72 90 1 509 NAS 46 75 3507 135 342 4 325 SVW 51 69 194 5640 908 92 67 VOW 45 47 274 901 10567 513 110 DIP 0 0 6 155 664 1604 2 CLO 195 201 297 81 92 0 10006 62 4.3 Classification accuracies (%) for phonemes in quiet environment The overall classification accuracies for individual and broad class phonemes under clean condition are shown in Table 4.1. Our proposed method achieved an accuracy of 69.19% and 84.21% for individual and broad phoneme classification in quiet environment. The classification score using MFCC, GFCC and FDLP-based features are 60.16, 49.94, and 50.57% respectively for individual phone and for broad phone class accuracies are 76.68, 64.21 and 67.65 respectively. The proposed method resulted higher accuracy in classification performance than the results for MFCC-, FDLP- and GFCC- based methods. The segment classification confusion matrices are shown in Table 4.2 for clean condition. It is obvious that closures (CLO) were in general more confused with others for all types of features. Some of the plosives (PLO) and fricatives (FRI) were confused with other groups, but most of the confusions were observed within these two groups. Similarly, nasals (NAS), semivowels (SVW), vowels (VOW) and diphthongs (DIP) were confused more among these groups compared to other groups for all four methods However, the proposed method outperformed the three other traditional methods in terms of accuracy. Table 4.3 shows the accuracies of broad classes using different features in quiet condition. Under clean condition, the performance of the proposed method for all classes was better than the results for all other features. The performance of the GFCCbased method was worse for plosive than the results from other methods. In general, closures were identified more accurately in quiet by all other methods (proposed: 92.03%, MFCC: 87.73%, GFCC 80.39%, and FDLP: 82.50%). 63 Table 4.3: Classification accuracies (%) of broad phonetic classes in clean condition. 4.4 Class MFCC (%) GFCC (%) FDLP (%) Proposed feature (%) PLO 71.0 4.69 58.64 88.06 FRI 71.62 51.89 67.90 81.10 NAS 72.93 69.10 56.83 79.09 SVOW 73.28 70.18 63.48 80.33 VOW 78.59 82.90 67.31 84.82 DIP 63.84 18.01 57.21 65.98 CLO 87.73 80.39 82.50 92.03 Performance for signal with additive noise This section present the classification accuracies (%) for different feature extraction techniques for six noise types with different values of SNRs ranging from -5 to 25 dB in steps of 5 dB. Table 4.4 shows the individual phone accuracies (%) for SSN, babble and Exhibition noise. Similarly, single phone accuracies (%) for restaurant, train and exhibition noise is presented in Table 4.5. In the experiment reported in Table 4.4, for SSN at -5 dB SNR, the GFCC feature provides the best performance. However, for almost all noise types and SNRs, the proposed feature provides improvements over the all baseline features. The broad classification accuracies in clean conditions and under different background noises are shown in Fig. 4.1. Again, the proposed method resulted higher accuracy in classification performance than the results for MFCC-, GFCC- and FDLP-based methods. 64 Table 4.4: Individual phoneme classification accuracies (%) for different feature extraction techniques for 3 different noise types at -5, 0, 5, 10, 15, 20 and 25 dB SNRs. The best performance for each condition is indicated in bold. SNR (dB) MFCC (%) GFCC (%) FDLP (%) Proposed feature (%) SSN -5 23.35 25.69 22.41 20.23 0 31.72 31.96 27.39 33.9 5 41.78 40.9 34.09 46.36 10 50.5 46.72 40.98 57.25 15 55.82 48.71 46.29 64.2 20 58.58 49.47 48.65 67.3 25 59.45 49.66 50.04 68.56 Babble -5 21.9 20.39 23.01 23.15 0 29.28 29.88 27.64 34.96 5 38.97 38.84 33.52 46.33 10 47.21 44.29 39.65 55.72 15 53.78 47.41 44.74 62.71 20 57.39 49.04 47.89 66.69 25 58.99 49.57 49.51 68.35 Exhibition -5 19.1 19.17 22.37 23.57 0 25.84 25.62 26.63 32.82 5 34.79 32.52 32.08 41.83 10 43.42 39.45 38.38 50.52 15 50.47 44.43 43.55 57.53 20 55.29 47.28 46.73 63.04 25 57.96 48.82 48.68 66.7 . 65 Table 4.5: Individual phoneme classification accuracies (%) for different feature extraction techniques for 3 different noise types at -5, 0, 5, 10, 15, 20 and 25 dB SNRs. The best performance for each condition is indicated in bold. SNR (dB) MFCC (%) GFCC (%) FDLP (%) Proposed feature (%) Restaurant -5 22.65 21.51 23.43 24.27 0 29.91 30.95 28.11 36.15 5 39.42 39.46 33.89 47.37 10 47.77 44.28 39.84 56.14 15 53.83 47.30 44.9 62.72 20 57.3 48.98 47.95 66.72 25 59.17 49.58 49.54 68.19 Train -5 19.54 17.13 17.13 23.44 0 26.53 23.50 23.50 32.21 5 35.41 31.27 31.27 42.18 10 44.53 39.40 39.40 51.07 15 51.28 44.25 44.25 57.95 20 55.54 47.000 47.00 62.95 25 57.91 48.63 48.63 66.41 Car -5 17.81 22.20 22.37 47.38 0 24.83 29.20 26.63 31.16 5 34.23 36.65 32.08 41.78 10 44.32 42.54 38.38 51.61 15 51.49 46.45 43.55 59.05 20 56.24 48.47 46.73 64.35 25 58.34 49.52 48.68 67.43 66 Figure 4.1: Broad phoneme classification accuracies (%) for different features in various noise types at different SNRs values. Clean condition is denoted as Q. 67 4.5 Performance for reverberant speech In this section, the effectiveness of the proposed approaches to robust phoneme classification in reverberant speech will be investigated. Phoneme classification in reverberant environments is a challenging problem due to the highly dynamic nature of reverberated speech. Reverberation can be described as a convolution of the clean speech signal with a RIR. The length of a RIR is usually measured by the reverberation time, T60, which is the time needed for the power of reflections of a direct sound to decay by 60 dB. It can be define in following way, Reverberation time, T60 = time to drop 60 dB below original level Table 4.6: Classification accuracies (%) in eight different reverberation test set. The best performance for each condition is indicated in bold. Last column show the average value indicated as “Avg”. T60 (ms) MFCC (%) GFCC (%) FDLP (%) Proposed feature (%) single broad single broad single broad single broad 344 25.28 30.34 36.19 46.34 19.81 23.86 34.92 47.04 350 25.67 30.84 36.88 47.59 19.46 23.72 35.43 48.43 352 25.35 30.62 36.64 47.43 19.31 23.79 35.33 48.6 359 25.39 30.43 38.31 49.59 19.51 23.42 34.91 47.66 360 25.32 30.39 38.47 50.07 19.11 23.04 34.84 47.77 364 24.67 29.05 36.57 47.26 19.08 22.72 33.49 45.55 395 32.5 45.63 37.1 48.11 18.77 22.83 34.06 47.43 404 32.36 45.18 39.19 50.65 19.04 22.67 35.87 49.68 Avg. 27.0675 34.06 37.41 48.38 19.26 23.25 34.85 47.77 68 Table 4.6 shows the result for reverberant speech. In general, the GFCC-based system provides the best performance (average accuracies are 37.41% and 48.38 % for individual and broad class, respectively) in all reverberant test set. The average results, for the same test set, using our proposed feature are 34.85% and 47.77%, respectively. The performances using our proposed feature are comparable to the results using GFCC-based system for reverberant speech. However, MFCC- and FDLP- based features are substantially less accurate in reverberant speech. 4.6 Signals distorted by noise due to telephone channel Telephone speech recognition is extremely difficult due to the limited bandwidth of transmission channels. From figure 3.6 we can see that above 4 kHz, there is no energy for telephone speech. To evaluate the effectiveness of the proposed approach for signal distorted by telephone channels, we have used the well-known HTIMIT database (D. A. Reynolds, 1997). Classification accuracies (%) with different feature extraction techniques for different handsets are presented in Table 4.7. We can see that for some handset (cb2, cb4, el2 and el4), MFCC shows the best performance and for other handset (cb1, cb3, el3 and pt1), the proposed feature shows the best performance. The average classification accuracies of the proposed method for nine telephone channels were 41.75% and 60.16 % for individual phone and broad class phone, respectively, whereas for the same condition, the performance of the MFCC- based method was 42.53% and 57.94%, respectively. In general, performances of the MFCC and the proposed feature are comparable for telephone speech, but GFCC and FDLP are substantially less accurate. For all feature, the classification accuracy is very low for signal distorted by handset cb3 and cb4, because they had particularly poor sound characteristics (D. A. Reynolds, 1997). 69 Table 4.7: Classification accuracies (%) for signal distorted by nine different telephone channels. The best performance is indicated in bold. Last column shows the average value indicated as “Avg”. Channel MFCC (%) FDLP (%) GFCC (%) Proposed feature (%) single broad single broad single broad single broad cb1 50.26 64.09 29.09 44.49 17.26 30.09 50.84 68.67 cb2 52.67 66.65 30.84 47.04 19.80 32.73 47.21 66.73 cb3 27.17 44.26 22.42 36.73 13.41 27.34 32.76 48.56 cb4 33.01 48.58 24.29 38.98 16.91 31.37 32.89 48.47 el1 51.69 67.45 30.73 46.43 16.28 29.97 49.64 69.62 el2 45.66 60.95 28.50 43.80 11.20 24.99 40.08 60.64 el3 45.97 59.82 28.95 44.08 16.48 30.11 48.35 67.42 el4 45.45 58.43 29.91 45.02 19.15 35.10 35.49 51.90 pt1 30.88 51.22 28.95 44.08 12.21 27.64 38.49 59.39 Avg. 42.53 57.94 28.19 43.41 15.86 29.93 41.75 60.16 70 CHAPTER 5: DISCUSSIONS 5.1 Introduction This chapter mainly discusses the effect of different parameters on classification accuracy, robustness issue, and comparison of the performances among different methods. The important finding of the study is that the proposed feature resulted a consistent performance across different types of noise. On the other hand, all baseline feature-based (such as the MFCC, GFCC and FDLP coefficients) systems produced quiet different results for different types of noise. For example, MFCC-based feature achieved good performance under channel distortions, but suffer severely under reverberant condition. Similarly, classification using GFCC-based feature is less accurate in clean condition and under channel distortions, but exhibits a more robust behavior in reverberant condition. 5.2 Broad class accuracy Based on simulation result, the proposed system outperformed all baseline feature-based system in clean condition and resulted a comparable performance under noisy conditions. According to simulation results, it is obvious that phonemes classified by the proposed method are more confused within groups, rather than confusions among different groups. Table 5.1 shows the correlation coefficients between the waveforms of two different phonemes in both time and Radon domains. We consider two phonemes from stops (/p/, /t/) and one phoneme from vowel (/aa/) having the same length (106 ms). 71 Figure 5.1: Example of radon coefficient representations: (a) stop /p/ (b) fricative /s/ (c) nasal /m/ (d) vowel /aa/. Figure 5.2: Example of radon coefficient representations for stop: (a) /p/ (b)/t/ (c)/k/ and (d) /b/ In time and Radon domain, correlation coefficient between /p/ and /t/ is 0.04 and 0.98, respectively. Similarly, for /p/ and /aa/, correlation coefficient is -0.5 and 0.94 in time and Radon domain, respectively. From Table 5.1 we can see that the phonemes of the same group are more correlated compared to other groups. Figure 5.1 shows the Radon coefficient representations of four phonemes from four different groups. Figure 5.2 shows the four phonemes from the stops group. From this figure, it is clear that Radon representation of a phoneme is more close to the phoneme of same group compared to the representation of phonemes from different group. This observation may reflect the reason behind the good classification accuracy of the proposed method for broad class. 72 Table 5.1 Correlation measure in acoustic and Radon domain for different phoneme. Similarity index Phoneme Acoustic Radon (from neurogram) /p/ and /t/ .04 0.98 /p/ and /aa/ -.05 0.94 The following section describes the impact of different parameters on accuracy, and robustness of the proposed method compared to alternate methods. 5.3 Comparison of results from previous studies In Table 5.2, results of some recent experiments on the TIMIT classification task in quiet condition are gathered and compared with our results reported in this paper. We also show results obtained using MFCC. Unfortunately, despite the use of a standard database, it is still difficult to compare results because of the differences in selection of training and test data, and also because of differences in selection of phone groups. In order to include more variations of phonemes, the proposed method was tested using the complete set of TIMIT database (50754 tokens from 168 speakers), which we believe the most reasonable choice for comparisons of phonetic classification methods. Clarkson et al. created a multi-class SVM system to classify phonemes (Clarkson & Moreno, 1999). They also used GMM for classification and achieved an accuracy of 73.7%. Their reported result of 77.6%, using the core set (24 speakers), is extremely encouraging and shows the potential of SVMs in speech recognition. In 2003, Reynolds et al. used modular MLP architecture for speech recognition (T. J. Reynolds & Antoniou, 2003). 73 Table 5.2: Phoneme classification accuracies (%) on the TIMIT core test set (24 Speakers) and complete test set (168 speakers) in quiet condition for individual phone (denoted as single) and broad class (denoted as Broad). Here, RPS is the abbreviation of reconstructed phase space. Author Method Test set Single (%) (Clarkson & Moreno, 1999) GMM (MFCC) SVM (MFCC) Core 73.7 77.6 (T. J. Reynolds & Antoniou, 2003) Modular MLP architecture Core (Halberstadt & Glass, 1997) Heterogeneous framework Complete (Johnson et al., 2005) HMM (RPS) HMM (MFCC) Complete 35.06 54.86 Proposed method SVM (MFCC) SVM (proposed) Complete 60.16 69.06 Broad (%) 84.1 79.0 76.68 84.21 Their broad class accuracy using the core test was 84.1%. In this thesis, we used phone group for broad class defined by them. The most directly comparable results in quiet environment appear to be those of (Halberstadt & Glass, 1997; Johnson et al., 2005). Halberstadt et al. used heterogeneous acoustic measurements for phoneme classification. Their broad class accuracy using the complete set (118 speakers) was 79.0%, whereas our accuracy is 84.21% (168 speakers). Similarly, Johnson et.al showed the result for MFCC-based feature using HMM for the complete set, and the single phone accuracy was 54.86 %, whereas we have achieved 69.06 % using the proposed feature. They also showed the accuracy of 35.06% for their proposed feature RPS. We also show the result of 60.16 % using MFCC-based feature with the SVM classifier. 74 Ganapathy et al. reported phoneme recognition accuracies on the complete test set of individual phonemes for FDLP-based feature for both in quiet and under different noise types. Accuracy based on their feature for clean speech, speech with additive noise (average performance of four noise types at 0, 5, 10, 15, and 20 dB SNRs), reverberant speech (average performance for nine room impulse response functions), and telephone speech (average performance for nine channel conditions) was 62.1%, 43.9%, 33.6% and 55.5 %, respectively. They used the complete test set. Though they used complete test set from TIMIT and used 9 telephone channels from HTIMIT for evaluation, still it is difficult to compare our result with them directly as their reported result was for phone recognition and our result is for phone classification. The accuracy for the proposed feature for clean speech, speech with additive noise (average performance of six noise types at 0, 5, 10, 15, and 20 dB SNRs), reverberant speech (average performance for eight room impulse response functions), and telephone speech (average performance for nine channel conditions) is 69.06%, 51.48%, 34.85% and 41.75% respectively. 5.4 Effect of the number of Radon angles on classification results Table 5.3 presents the classification accuracy for the proposed method as a function of the number of Radon angles. With the increase of number of angles, the classification accuracy increased substantially up to a number of 10 for both in quiet and under noisy conditions. The number of Radon angles used in this study was chosen as ten based on the phoneme classification accuracy for both in quiet and under noisy conditions, though the performance for 13 angles is slightly higher than that of angle 10. The performance of classifier become saturated for the number of angles more than 13. 75 Table 5.3: Phoneme classification accuracies (%) as a function of the number of Radon angles (between 0° and 180°) used to encode phoneme information. No. of angle 5.5 Clean (%) Additive noise (%) Reverberant speech (%) Telephone speech (%) single broad single broad single broad single broad 1 34.01 50.94 23.02 36.28 28.99 44.48 19.55 24.16 3 61.64 78.18 29.52 48.51 44.71 64.32 36.57 48.51 5 66.93 82.54 32.38 52.92 48.29 68.26 35.57 47.57 7 68.28 83.42 33.41 54.18 49.43 69.33 35.15 47.2 10 69.19 84.24 33.9 54.65 49.64 69.62 34.92 47.04 13 69.44 84.54 33.87 54.9 49.79 69.65 34.84 47.09 Effect of SPL on classification results Substantial research has shown that in quiet environment, presenting speech at higher than conversational speech level (~65-70 dB SPL) results in decreased speech understanding (Fletcher & Galt, 1950; French & Steinberg, 1947; Molis & Summers, 2003; Pollack & Pickett, 1958; Studebaker, Sherbecoe, McDaniel, & Gwaltney, 1999). Under noisy conditions, the speech recognition is affected by several factors. These factors include the presence or absence of a background noise, the SNR, and frequency being presented (French & Steinberg, 1947; Molis & Summers, 2003; Pollack & Pickett, 1958; Studebaker et al., 1999). In order to quantify the effect of SPL on the proposed feature, the classification accuracies were estimated with different noise types at 50, 70 and 90 dB SPL, as shown in Table 5.4. The results show that the classification accuracy decreased when SPL increased from 70 dB to 90 dB SPL in quiet condition that matches with previous study. It is interesting that the accuracy increase from 69.19 76 Table 5.4: Effects of SPL on classification accuracy (%). SPL (dB) Clean (%) Additive noise (%) Reverberant speech (%) Telephone speech (%) single broad single broad single broad single broad 90 65.4 80.88 35.29 53.53 35.76 48 42.41 59.74 70 69.19 84.24 33.9 54.65 34.92 47.04 49.64 69.62 50 71.23 85.6 37.34 59.08 31.27 40.76 49.94 70.09 % at 70 dB SPL to 71.23 % at 50 dB SPL in quiet condition. This could be related to the saturation of AN fiber responses at higher levels (the dynamic range is ~30-40 dB from threshold). But for reverberant speech, the performance is not satisfactory at 50 dB SPL compared to the results at higher sound presentation levels. In this thesis, we evaluate the performance of the proposed and existing methods using speech presented at 70 dB SPL. MFCC-and FDLP-based classification system has no effect due to the variation in sound presentation level, and there is a negligible effect of SPL on GFCC-based system. Thus, only the proposed feature-based system can handle the effect of SPL on recognition. 5.6 Effect of window length on classification results The neurogram resolution for the proposed method was chosen based on the physiological-based data (Zilany & Bruce, 2006). To study the effect of neurogram resolution on phoneme classification, three different window sizes (16, 8 and 4 ms) were considered for smoothing, as illustrated in Table 5.5. It was observed that in quiet condition, the classifier performance was relatively high when a window length of 4 ms was used but resulted a poorer performance under noisy conditions. The performance of 77 Table 5.5: Effect of window size on classification performance (%). Window (ms) Clean (%) Additive noise (%) Reverberant speech (%) Telephone speech (%) single broad single broad single broad single broad 16 67.27 82.46 37.28 56.75 36.6 50.77 36.65 49.37 8 69.19 84.24 33.9 54.65 49.64 69.62 34.92 47.04 4 69.39 84.75 32.65 53.24 49.35 70.19 33.19 44.82 the proposed method with a 16-ms window was satisfactory for noisy conditions but exhibited slightly lower results in quiet. Thus considering the performance in clean and under noisy conditions, the length of the window was set to 8 ms for this study. 5.7 Effect of number of CFs on classification results Frequency at which a given neuron responds to the smallest sound intensity is called characteristics frequency (CF). It is also known as best frequency (BF). In this study, we investigate the effect of number of CF on the performance of the proposed system using three different simulation conditions shown in Table 5.6. Results suggests that the performance of the proposed system with 12-CFs, under quiet and noisy condition, is not satisfactory but 32 neurons (out of 30000 neurons) responses-based system is enough to confidently classify phoneme. In the proposed system, we employed 32 CFs since with the increase of no of CF the computational cost also increase. 78 Table 5.6: Effect of number of CF on classification performance (%). No. of CFs 5.8 Clean (%) Additive noise (%) Reverberant speech (%) Telephone speech (%) single broad single broad single broad single broad 12 63.31 80.44 33.83 53.52 48.9 69.86 31.87 44.41 32 69.19 84.24 33.9 54.65 49.64 69.62 34.92 47.04 40 69.21 84.1 33.6 54.44 49.64 69.71 35.11 47.31 Robustness property of the proposed system In this section, we will investigate and discuss the robustness of the proposed system. The similarity between two matrices or vectors can be found through the correlation coefficient measure using the Eq. (3.8). First, we will investigate the robustness of the neurogram (neural responses) compared to other baseline features. We consider a typical phoneme (/aa/) of length almost 65 ms (not shown) as an input to the AN model to generate the corresponding neurogram responses [Fig. 5.3 (d)]. The corresponding MFCC, GFCC and FDLP coefficients are also shown for each frame in Fig. 5.3 (a), (b) and (c), respectively. The size of the generated neurogram image was 32 by 17, where the neural responses were simulated for 32 CFs. All columns of neurogram array were combined together to form a 1-D vector (the new size was 1×544) which is shown in Fig. 5.3 (d). Similar procedure was followed for other feature extraction technique. The responses (features) of the phoneme in quiet are shown by the solid lines, and the responses using the same phoneme distorted by a SSN of 0 dB SNR are shown by the dotted lines in each corresponding plots (a, b, c, and d). The correlation coefficient between these two vectors was computed using the Eq. (8), and it was found to 79 Figure 5.3: The correlation coefficient (a) MFCC features extracted from the phoneme in quiet (solid line) and at an SNR of 0 dB (dotted line) condition. The correlation coefficient between the two vectors was 0.76. (b) FDLP features under clean and noisy conditions. Correlation coefficient between the two cases was 0.72. (d) Neurogram responses of the phoneme under clean and noisy conditions. The Correlation coefficient between the two vectors was 0.85. be ~ 0.85 for the neurogram feature, 0.76 for MFCC, 0.78 for GFCC, and 0.72 for FDLP coefficients.. Based on similarity index represented by the correlation coefficient, it can be concluded that the proposed neural-response-based neurogram features were more robust compared to the traditional acoustic-property-based features, meaning that the representation of phoneme in the AN responses was less distorted by noise. Now, we will investigate the robustness of the neurogram feature. Figure 3.6 shows the effect of different noises on the neurogram and the corresponding Radon coefficients 80 (Radon coefficients were computed by applying Radon transform on the neurogram). From figure, it is obvious that Radon representations are more robust. To make it more clear, we measure correlation between the clean and noisy phoneme in different domain, as presented in Table 5.7. Seven phonemes from seven different groups were chosen for the experiment and SSN was added at SNRs of 0 and 10 dB to generate two noisy versions of the phoneme. Then we measure correlation coefficient between the clean and noisy phoneme (10 dB and 0 dB) in different domain. Finally, we show the mean and standard deviation of correlation coefficient for 100 instances of the same phoneme. In Table 5.7 we use mean value only, because the standard deviation is very low. For time, spectral, neurogram, and Radon domains, the average standard deviations for correlation coefficient (between clean and 10 dB) are 0, 0.03, 0.04 and 0, respectively, and for correlation measure between the clean and 0 dB are 0.03, 0.10, 0.05 and 0.01, respectively. Table 5.7 suggests that the Radon representations of phoneme are more robust compared to the representation in any other domains. In the present study, the neurogram features are derived directly from the responses of a model of the auditory periphery to speech signals. The proposed method differs from all previous auditory model based methods in that a more complete and physiologicallybased model of the auditory periphery is employed in this study (Zilany, Bruce, & Carney, 2014). The auditory-nerve (AN) model developed by Zilany et al. has been extensively validated 81 Table 5.7: Correlation measure for different phoneme and their corresponding noisy (SSN) phoneme (clean-10dB and clean-0dB) in different domain. Average correlation measure (Avg) of seven phonemes is indicated in bold (last row). Time Spectral Neurogram Radon 10 dB 0 dB 10 dB 0 dB 10 dB 0 dB 10 dB 0 dB /p/ .95 .70 .94 .68 .97 .94 .99 .99 /s/ .95 .70 .96 .28 .91 .86 .99 .98 /m/ .95 .71 .99 .95 .96 .92 .99 .99 /l/ .95 .70 .99 .91 .92 .86 .99 .98 /aa/ .95 .70 .99 .80 .89 .81 .99 .98 /ow/ .95 .70 .99 .91 .86 .77 .99 .98 /bcl/ .95 .70 .99 .88 .93 .89 .99 .99 Avg .95 .70 .98 .77 .92 .86 .99 .98 against a wide range of physiological recordings from the mammalian peripheral auditory system. The model can successfully replicate almost all of the nonlinear phenomena observed at different level of the auditory periphery (e.g., in the cochlea, inner-hair-cell (IHC), IHC-AN synapse, and AN fibers). These phenomena include the nonlinear tuning, compression, two-tone suppression, level-dependent rate and phase responses, shift in the best frequency with level, adaptation, and several high-level effects (Zilany & Bruce, 2006; Zilany et al., 2014; Zilany et al., 2009). The model responses have been tested for both simple (e.g., tone) and complex stimuli (e.g., speech) with a wide range of frequency and intensity spanning the dynamic range of hearing. The robustness of the proposed neural-response-based system could lie on the underlying physiological mechanisms observed at the level of the auditory periphery. 82 Since the AN model used in this study is nonlinear (i.e., incorporates most of these nonlinear phenomena), it would be difficult to tease apart the contribution of individual nonlinear mechanism towards classification performance. However, it would not be unwise to shed some light on the possible mechanism towards the classification task, especially under noisy conditions. The AN fiber tends to fire at a particular phase of a stimulating low-frequency tone, meaning that it tends to give spikes at an integer times of period of that tone. It has been reported that the magnitude of phase-locking declines with frequency and the limit of phase locking varies somewhat across species, but the upper frequency boundary lies at ~4-5 kHz (Palmer & Russell, 1986). 83 CHAPTER 6: CONCLUSIONS & FUTURE WORKS 6.1 Conclusions In this study, we proposed a neural-response-based method for a robust phoneme classification system which works well for both in quiet and under noisy environments. The proposed feature successfully captured the important distinguishing information about phonemes to make the system relatively robust against different types of degradation of the input acoustic signals. The neurogram was extracted from the responses of a physiologically-based model of the auditory periphery. The features from each neurogram are extracted using a well-known Radon transform. The performance of the proposed method was evaluated in quiet and under noisy conditions such as additive noise, room reverberation, and telephone channel noise, and also compared to the classification accuracy from several existing methods. Based on simulation results, the proposed method outperformed most of the traditional acoustic-property-based phoneme classification methods for both in quiet and under noisy conditions. The robustness of the proposed neural-response-based system could lie on the underlying physiological mechanisms observed at the level of the auditory periphery. The main findings in this study can be summarized as follows: 1. In quiet environment and for speech with additive noises, the proposed feature outperformed all other baseline feature-based system. 2. For reverberant speech, GFCC features achieved a very good performance but suffered severely in other conditions, especially in quiet. Classification accuracy achieved by the proposed method was comparable to the results using GFCC feature for reverberant speech. 84 3. MFCC feature was less accurate for noisy conditions, especially for reverberant speech, but exhibited a more robust behavior for telephone speech. The neuralresponse-based proposed feature also resulted a comparable performance (like MFCC)for channel distortions. 4. The results showed that the classification accuracy decreased when SPL increased from 70 dB to 90 dB SPL in quiet condition, which is consistent with the published result from previous behavioral studies. 6.2 Limitations and future work In this study, the responses of the AN fibers for 32 CFs were simulated to phoneme signals to construct neurograms that incurs high computational complexity. Although the AN model incorporates most of the nonlinearities observed at the peripheral level of the auditory system, the effect of individual nonlinear mechanism on phoneme recognition was not explored in this study. In addition, although the classification performance under additive noise and channel distortions was satisfactory, the proposed system resulted a relatively poorer classification performance for reverberant conditions. There are a number of possible directions to be further explored along the line of research works presented in this thesis. Some of these directions are summarized in the following: The proposed approach could be used in a hybrid phone-based architecture that integrates SVMs with HMMs for continuous speech recognition (Ganapathiraju et al., 2004; Krüger et al., 2005) and is expected to improve the recognition performance over HMM baselines. Since the AN model employed in this study is a computational model, it would allow to investigate the effects of individual nonlinear phoneme on classification accuracy. The outcome will have important implications for designing sinal- 85 processing strategies for hearing aids and cochlear implants. Because the computational model allows to simulate the responses of impaired AN fibers, the present work could be extended to predict the classification performance for people with hearing loss under diverse conditions. 86 LIST OF PUBLICATIONS AND CONFERENCE PROCEEDINGS Alam MS, Jassim WA, Zilany MS. “Radon transform of auditory neurograms: a robust feature set for phoneme classification”. IET Signal Processing. 2017 Oct 5;12(3):260268 Alam MS, Zilany MS, Jassim WA, Ahmad MY. “Phoneme classification using the auditory neurogram”. IEEE Access. 2017 Jan 16;5:633-42. Alam, M. S., Jassim, W., & Zilany, M. S. (2014, December). “Neural response based phoneme classification under noisy condition”. In Intelligent Signal Processing and Communication Systems (ISPACS), 2014 International Symposium on (pp. 175-179). IEEE. 87 REFERENCES Acero, A. (1990). Acoustical and environmental robustness in automatic speech recognition. Carnegie Mellon University Pittsburgh. Allen, J. B. (1994). How do humans process and recognize speech? Speech and Audio Processing, IEEE Transactions on, 2(4), 567-577. Bondy, J., Becker, S., Bruce, I., Trainor, L., & Haykin, S. (2004). A novel signalprocessing strategy for hearing-aid design: neurocompensation. Signal Processing, 84(7), 1239-1253. Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). A training algorithm for optimal margin classifiers. Paper presented at the Proceedings of the fifth annual workshop on Computational learning theory. Bou-Ghazale, S. E., & Hansen, J. H. (2000). A comparative study of traditional and newly proposed features for recognition of speech under stress. Speech and Audio Processing, IEEE Transactions on, 8(4), 429-442. Bracewell, R., & Imaging, T.-d. (1995). Prentice-Hall Signal Processing Series: Prentice-Hall, Englewood Cliffs, NJ. Brown, G. J., & Cooke, M. (1994). Computational auditory scene analysis. Computer Speech & Language, 8(4), 297-336. Bruce, I. C. (2004). Physiological assessment of contrast-enhancing frequency shaping and multiband compression in hearing aids. Physiological measurement, 25(4), 945. Bruce, I. C., Sachs, M. B., & Young, E. D. (2003). An auditory-periphery model of the effects of acoustic trauma on auditory nerve responses. The Journal of the Acoustical Society of America, 113(1), 369-388. Carney, L. H., & Yin, T. (1988). Temporal coding of resonances by low-frequency auditory nerve fibers: single-fiber responses and a population model. Journal of Neurophysiology, 60(5), 1653-1677. 88 Chang, C.-C., & Lin, C.-J. (2011). LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3), 27. Chang, L., Xu, J., Tang, K., & Cui, H. (2012). A new robust pitch determination algorithm for telephone speech. Paper presented at the Information Theory and its Applications (ISITA), 2012 International Symposium on. Chen, R., & Jamieson, L. (1996). Experiments on the implementation of recurrent neural networks for speech phone recognition. Paper presented at the Signals, Systems and Computers, 1996. Conference Record of the Thirtieth Asilomar Conference on. Chengalvarayan, R., & Deng, L. (1997). HMM-based speech recognition using statedependent, discriminatively derived transforms on Mel-warped DFT features. Speech and Audio Processing, IEEE Transactions on, 5(3), 243-256. Clarkson, P., & Moreno, P. J. (1999). On the use of support vector machines for phonetic classification. Paper presented at the Acoustics, Speech, and Signal Processing, 1999. Proceedings., 1999 IEEE International Conference on. Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine learning, 20(3), 273-297. Davis, S. B., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. Acoustics, Speech and Signal Processing, IEEE Transactions on, 28(4), 357-366. Deng, L., & Geisler, C. D. (1987). A composite auditory model for processing speech sounds. The Journal of the Acoustical Society of America, 82(6), 2001-2012. Ellis, D. P. (2005). {PLP} and {RASTA}(and {MFCC}, and inversion) in {M} atlab. Fletcher, H., & Galt, R. H. (1950). The perception of speech and its relation to telephony. The Journal of the Acoustical Society of America, 22(2), 89-151. French, N., & Steinberg, J. (1947). Factors governing the intelligibility of speech sounds. The Journal of the Acoustical Society of America, 19(1), 90-119. 89 Friedman, J. (1996). Another approach to polychotomous classifcation. Dept. Statist., Stanford Univ., Stanford, CA, USA, Tech. Rep. Gales, M. J. (2011). Model-based approaches to handling uncertainty Robust Speech Recognition of Uncertain or Missing Data (pp. 101-125): Springer. Gales, M. J. F. (1995). Model-ased techniques for noise ro ust speech recognition. Ganapathiraju, A., Hamaker, J., & Picone, J. (2000). Hybrid SVM/HMM architectures for speech recognition. Paper presented at the INTERSPEECH. Ganapathiraju, A., Hamaker, J. E., & Picone, J. (2004). Applications of support vector machines to speech recognition. Signal Processing, IEEE Transactions on, 52(8), 2348-2355. Ganapathy, S., Thomas, S., & Hermansky, H. (2009). Modulation frequency features for phoneme recognition in noisy speech. The Journal of the Acoustical Society of America, 125(1), EL8-EL12. Ganapathy, S., Thomas, S., & Hermansky, H. (2010). Temporal envelope compensation for robust phoneme recognition using modulation spectrum. The Journal of the Acoustical Society of America, 128(6), 3769-3780. Garofolo, J. S., & Consortium, L. D. (1993). TIMIT: acoustic-phonetic continuous speech corpus: Linguistic Data Consortium. Gelbart, D., & Morgan, N. (2001). Evaluating long-term spectral subtraction for reverberant ASR. Paper presented at the Automatic Speech Recognition and Understanding, 2001. ASRU'01. IEEE Workshop on. Ghitza, O. (1988). Temporal non-place information in the auditory-nerve firing patterns as a front-end for speech recognition in a noisy environment. Journal of phonetics, 16, 109-123. Gifford, M. L., & Guinan Jr, J. J. (1983). Effects of crossed‐olivocochlear‐bundle stimulation on cat auditory nerve fiber responses to tones. The Journal of the Acoustical Society of America, 74(1), 115-123. 90 Gu, Y. J., & Sacchi, M. (2009). Radon transform methods and their applications in mapping mantle reflectivity structure. Surveys in geophysics, 30(4-5), 327-354. Halberstadt, A. K. (1998). Heterogeneous acoustic measurements and multiple classifiers for speech recognition. Massachusetts Institute of Technology. Halberstadt, A. K., & Glass, J. R. (1997). Heterogeneous acoustic measurements for phonetic classification 1. Paper presented at the EUROSPEECH. Hansen, J. H. (1996). Analysis and compensation of speech under stress and noise for environmental robustness in speech recognition. Speech Communication, 20(1), 151-173. Heinz, M. G., Colburn, H. S., & Carney, L. H. (2001). Evaluating auditory performance limits: I. One-parameter discrimination using a computational model for the auditory nerve. Neural Computation, 13(10), 2273-2316. Hermansky, H. (1990). Perceptual linear predictive (PLP) analysis of speech. The Journal of the Acoustical Society of America, 87(4), 1738-1752. Hermansky, H. (1998). Should recognizers have ears? Speech Communication, 25(1), 327. Hermansky, H., & Morgan, N. (1994). RASTA processing of speech. Speech and Audio Processing, IEEE Transactions on, 2(4), 578-589. Hewitt, M. J., & Meddis, R. (1993). Regularity of cochlear nucleus stellate cells: a computational modeling study. The Journal of the Acoustical Society of America, 93(6), 3390-3399. Holmberg, M., Gelbart, D., & Hemmert, W. (2006). Automatic speech recognition with an adaptation model motivated by auditory processing. Audio, Speech, and Language Processing, IEEE Transactions on, 14(1), 43-49. Hopkins, K., & Moore, B. C. (2009). The contribution of temporal fine structure to the intelligibility of speech in steady and modulated noise. The Journal of the Acoustical Society of America, 125(1), 442-446. 91 Jankowski Jr, C. R., Vo, H.-D. H., & Lippmann, R. P. (1995). A comparison of signal processing front ends for automatic word recognition. Speech and Audio Processing, IEEE Transactions on, 3(4), 286-293. Jeon, W., & Juang, B.-H. (2007). Speech analysis in a model of the central auditory system. Audio, Speech, and Language Processing, IEEE Transactions on, 15(6), 1802-1817. Johnson, M. T., Povinelli, R. J., Lindgren, A. C., Ye, J., Liu, X., & Indrebo, K. M. (2005). Time-domain isolated phoneme classification using reconstructed phase spaces. Speech and Audio Processing, IEEE Transactions on, 13(4), 458-466. Junqua, J.-C., & Anglade, Y. (1990). Acoustic and perceptual studies of Lombard speech: Application to isolated-words automatic speech recognition. Paper presented at the Acoustics, Speech, and Signal Processing, 1990. ICASSP-90., 1990 International Conference on. Kingsbury, B. E., & Morgan, N. (1997). Recognizing reverberant speech with RASTAPLP. Paper presented at the Acoustics, Speech, and Signal Processing, 1997. ICASSP-97., 1997 IEEE International Conference on. Kirchhoff, K., Fink, G. A., & Sagerer, G. (2002). Combining acoustic and articulatory feature information for robust speech recognition. Speech Communication, 37(3), 303-319. Kollmeier, B. (2003). Auditory Principles in Speech Processing-Do Computers Need Silicon Ears? Paper presented at the Eighth European Conference on Speech Communication and Technology. Krishnamoorthy, P., & Prasanna, S. M. (2009). Reverberant speech enhancement by temporal and spectral processing. Audio, Speech, and Language Processing, IEEE Transactions on, 17(2), 253-266. Krüger, S. E., Schafföner, M., Katz, M., Andelic, E., & Wendemuth, A. (2005). Speech recognition with support vector machines in a hybrid system. Paper presented at the INTERSPEECH. 92 Layton, M., & Gales, M. J. (2006). Augmented statistical models for speech recognition. Paper presented at the Acoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings. 2006 IEEE International Conference on. Lee, K.-F., & Hon, H.-W. (1989). Speaker-independent phone recognition using hidden Markov models. Acoustics, Speech and Signal Processing, IEEE Transactions on, 37(11), 1641-1648. Liberman, M. C. (1978). Auditory‐nerve response from cats raised in a low‐noise chamber. The Journal of the Acoustical Society of America, 63(2), 442-455. Liberman, M. C., & Dodds, L. W. (1984). Single-neuron labeling and chronic cochlear pathology. III. Stereocilia damage and alterations of threshold tuning curves. Hearing research, 16(1), 55-74. Lippmann, R. P. (1997). Speech recognition by machines and humans. Speech Communication, 22(1), 1-15. Lopes, C., & Perdigao, F. (2011). Phone recognition on the TIMIT database. Speech Technologies/Book, 1, 285-302. Mayoraz, E., & Alpaydin, E. (1999). Support vector machines for multi-class classification Engineering Applications of Bio-Inspired Artificial Neural Networks (pp. 833-842): Springer. McCourt, P., Harte, N., & Vaseghi, S. (2000). Discriminative multi-resolution sub-band and segmental phonetic model combination. Electronics Letters, 36(3), 270-271. Meyer, B. T., Wächter, M., Brand, T., & Kollmeier, B. (2007). Phoneme confusions in human and automatic speech recognition. Paper presented at the INTERSPEECH. Miller, G. A., Heise, G. A., & Lichten, W. (1951). The intelligibility of speech as a function of the context of the test materials. Journal of experimental psychology, 41(5), 329. 93 Miller, G. A., & Nicely, P. E. (1955). An analysis of perceptual confusions among some English consonants. The Journal of the Acoustical Society of America, 27(2), 338-352. Miller, M. I., Barta, P. E., & Sachs, M. B. (1987). Strategies for the representation of a tone in background noise in the temporal aspects of the discharge patterns of auditory‐nerve fibers. The Journal of the Acoustical Society of America, 81(3), 665-679. Miller, R. L., Schilling, J. R., Franck, K. R., & Young, E. D. (1997). Effects of acoustic trauma on the representation of the vowel/ε/in cat auditory nerve fibers. The Journal of the Acoustical Society of America, 101(6), 3602-3616. Molis, M. R., & Summers, V. (2003). Effects of high presentation levels on recognition of low-and high-frequency speech. Acoustics Research Letters Online, 4(4), 124-128. Nakatani, T., Kellermann, W., Naylor, P., Miyoshi, M., & Juang, B. H. (2010). Introduction to the special issue on processing reverberant speech: methodologies and applications. IEEE Transactions on Audio, Speech, and Language Processing, 7(18), 1673-1675. Nakatani, T., Yoshioka, T., Kinoshita, K., Miyoshi, M., & Juang, B.-H. (2009). Realtime speech enhancement in noisy reverberant multi-talker environments based on a location-independent room acoustics model. Paper presented at the Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. IEEE International Conference on. Narayan, S. S., Temchin, A. N., Recio, A., & Ruggero, M. A. (1998). Frequency tuning of basilar membrane and auditory nerve fibers in the same cochleae. Science, 282(5395), 1882-1884. Oppenheim, A. V., Schafer, R. W., & Buck, J. R. (1989). Discrete-time signal processing (Vol. 2): Prentice-hall Englewood Cliffs. Palmer, A., & Russell, I. (1986). Phase-locking in the cochlear nerve of the guinea-pig and its relation to the receptor potential of inner hair-cells. Hearing research, 24(1), 1-15. 94 Patterson, R., Nimmo-Smith, I., Holdsworth, J., & Rice, P. (1987). An efficient auditory filterbank based on the gammatone function. Paper presented at the a meeting of the IOC Speech Group on Auditory Modelling at RSRE. Perdigao, F., & Sá, L. (1998). Auditory models as front-ends for speech recognition. Proc. NATO ASI on Computational Hearing, 179-184. Pollack, I., & Pickett, J. (1958). Masking of speech by noise at high sound levels. The Journal of the Acoustical Society of America, 30(2), 127-130. Pulakka, H., & Alku, P. (2011). Bandwidth extension of telephone speech using a neural network and a filter bank implementation for highband mel spectrum. Audio, Speech, and Language Processing, IEEE Transactions on, 19(7), 21702183. Qi, J., Wang, D., Jiang, Y., & Liu, R. (2013). Auditory features based on gammatone filters for robust speech recognition. Paper presented at the Circuits and Systems (ISCAS), 2013 IEEE International Symposium on. Rajput, H., Som, T., & Kar, S. (2016). Using Radon Transform to Recognize Skewed Images of Vehicular License Plates. Computer, 49(1), 59-65. Reynolds, D. A. (1997). HTIMIT and LLHDB: speech corpora for the study of handset transducer effects. Paper presented at the Acoustics, Speech, and Signal Processing, 1997. ICASSP-97., 1997 IEEE International Conference on. Reynolds, T. J., & Antoniou, C. A. (2003). Experiments in speech recognition using a modular MLP architecture for acoustic modelling. Information sciences, 156(1), 39-54. Rifkin, R., Schutte, K., Saad, M., Bouvrie, J., & Glass, J. (2007). Noise robust phonetic classificationwith linear regularized least squares and second-order features. Paper presented at the Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on. Robert, A., & Eriksson, J. L. (1999). A composite model of the auditory periphery for simulating responses to complex sounds. The Journal of the Acoustical Society of America, 106(4), 1852-1864. 95 Robinson, A. J. (1994). An application of recurrent nets to phone probability estimation. Neural Networks, IEEE Transactions on, 5(2), 298-305. Robles, L., & Ruggero, M. A. (2001). Mechanics of the mammalian cochlea. Physiological reviews, 81(3), 1305-1352. Scanlon, P., Ellis, D. P., & Reilly, R. B. (2007). Using broad phonetic group experts for improved speech recognition. Audio, Speech, and Language Processing, IEEE Transactions on, 15(3), 803-812. Schluter, R., Bezrukov, L., Wagner, H., & Ney, H. (2007). Gammatone features and feature combination for large vocabulary speech recognition. Paper presented at the Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on. Sellick, P., Patuzzi, R., & Johnstone, B. (1982). Measurement of basilar membrane motion in the guinea pig using the Mössbauer technique. The Journal of the Acoustical Society of America, 72(1), 131-141. Sha, F., & Saul, L. K. (2006). Large margin Gaussian mixture modeling for phonetic classification and recognition. Paper presented at the Acoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings. 2006 IEEE International Conference on. Shao, Y., Jin, Z., Wang, D., & Srinivasan, S. (2009). An auditory-based feature for robust speech recognition. Paper presented at the Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. IEEE International Conference on. Shao, Y., Srinivasan, S., & Wang, D. (2007). Incorporating auditory feature uncertainties in robust speaker identification. Paper presented at the Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on. Smith, N., & Niranjan, M. (2001). Data-dependent kernels in SVM classification of speech patterns. 96 Soong, F. K., Rosenberg, A. E., Juang, B.-H., & Rabiner, L. R. (1987). Report: A vector quantization approach to speaker recognition. AT&T technical journal, 66(2), 14-26. Sroka, J. J., & Braida, L. D. (2005). Human and machine consonant recognition. Speech Communication, 45(4), 401-423. Strope, B., & Alwan, A. (1997). A model of dynamic auditory perception and its application to robust word recognition. Speech and Audio Processing, IEEE Transactions on, 5(5), 451-464. Studebaker, G. A., Sherbecoe, R. L., McDaniel, D. M., & Gwaltney, C. A. (1999). Monosyllabic word recognition at higher-than-normal speech and noise levels. The Journal of the Acoustical Society of America, 105(4), 2431-2444. Systems, C. o. N. I. P., & Weiss, Y. (2006). Advances in neural information processing systems: MIT Press. Tan, Q., & Carney, L. H. (2003). A phenomenological model for the responses of auditory-nerve fibers. II. Nonlinear tuning with a frequency glide. The Journal of the Acoustical Society of America, 114(4), 2007-2020. Tchorz, J., & Kollmeier, B. (1999). A model of auditory perception as front end for automatic speech recognition. The Journal of the Acoustical Society of America, 106(4), 2040-2050. Vapnik, V. (2013). The nature of statistical learning theory: Springer Science & Business Media. Wang, Y. (2015). Model-based Approaches to Robust Speech Recognition in Diverse Environments. Wilson, B. S., Schatzer, R., Lopez-Poveda, E. A., Sun, X., Lawson, D. T., & Wolford, R. D. (2005). Two new directions in speech processor design for cochlear implants. Ear and hearing, 26(4), 73S-81S. 97 Wong, J. C., Miller, R. L., Calhoun, B. M., Sachs, M. B., & Young, E. D. (1998). Effects of high sound levels on responses to the vowel/ε/in cat auditory nerve. Hearing research, 123(1), 61-77. Yoshioka, T., Sehr, A., Delcroix, M., Kinoshita, K., Maas, R., Nakatani, T., & Kellermann, W. (2012). Making machines understand us in reverberant rooms: robustness against reverberation for automatic speech recognition. Signal Processing Magazine, IEEE, 29(6), 114-126. Yousafzai, J., Ager, M., Cvetković, Z., & Sollich, P. (2008). Discriminative and generative machine learning approaches towards robust phoneme classification. Paper presented at the Information Theory and Applications Workshop, 2008. Yousafzai, J., Sollich, P., Cvetković, Z., & Yu, B. (2011). Combined features and kernel design for noise robust phoneme classification using support vector machines. Audio, Speech, and Language Processing, IEEE Transactions on, 19(5), 13961407. Zhang, X., Heinz, M. G., Bruce, I. C., & Carney, L. H. (2001). A phenomenological model for the responses of auditory-nerve fibers: I. Nonlinear tuning with compression and suppression. The Journal of the Acoustical Society of America, 109(2), 648-670. Zheng, F., Zhang, G., & Song, Z. (2001). Comparison of different implementations of MFCC. Journal of Computer Science and Technology, 16(6), 582-589. Zilany, M. S., & Bruce, I. C. (2006). Modeling auditory-nerve responses for high sound pressure levels in the normal and impaired auditory periphery. The Journal of the Acoustical Society of America, 120(3), 1446-1466. Zilany, M. S., & Bruce, I. C. (2007). Representation of the vowel/ε/in normal and impaired auditory nerve fibers: model predictions of responses in cats. The Journal of the Acoustical Society of America, 122(1), 402-417. Zilany, M. S., Bruce, I. C., & Carney, L. H. (2014). Updated parameters and expanded simulation options for a model of the auditory periphery. The Journal of the Acoustical Society of America, 135(1), 283-286. 98 Zilany, M. S., Bruce, I. C., Nelson, P. C., & Carney, L. H. (2009). A phenomenological model of the synapse between the inner hair cell and auditory nerve: long-term adaptation with power-law dynamics. The Journal of the Acoustical Society of America, 126(5), 2390-2412. 99 APPENDIX A – CONFUSION MATRICES A confusion matrix is formed by creating rows and columns for each class in the set. The rows represent the expert labels or true classes of the testing exemplars, while the columns correspond to the classifier system’s output or it’s hypothesized class label. The confusions are then tabulated and inserted into the correct position in the matrix. For example, consider the Table 1. This table represents the confusion matrix for four phonemes namely: /p/, /t/, /k/, /b/. The phoneme ‘/p/’ was classified correctly twelve times, and classified as ‘/t/’ five times, ‘/k/’ three times, and ‘/b/’ never. Correct classifications are displayed in main diagonal. Off-diagonal represent the errors. Confusion matrices for different features in diverse environmental distortions are presented in table 1 to table 16. We show the confusion matrices for clean condition, additive noise (SSN at 0dB), reverberant speech (RIR 344 ms) and telephone speech (channel cb 1). The main diagonal of the matrix is indicated in bold. Table 1: Example of a confusion matrix /p/ /t/ /k/ /b/ /p/ 12 5 3 0 /t/ 5 10 0 5 /k/ 3 1 14 2 /b/ 1 2 4 13 100 Table 1: Confusion matrix showing classification performance of the proposed method in clean condition. The overall accuracy is 69.19 %. Table 2: Confusion matrix showing classification performance of the MFCC-based feature at clean condition. The overall accuracy is 60.12 %. 101 Table 3: Confusion matrix showing classification performance of the GFCC-based feature at clean condition. The overall accuracy is 49.93 %. Table 4: Confusion matrix showing classification performance of the FDLP-based feature at clean condition. The overall accuracy is 50.57 %. 102 Table5.Confusion matrix of the proposed method at an SNR of 0 dB (under speechshaped noise). The overall accuracy is 33.90 %. Table 6. Confusion matrix of the MFCC-based feature at an SNR of 0 dB (under speech-shaped noise). The overall accuracy is 31.71 %. 103 Table 7. Confusion matrix of the GFCC-based feature at an SNR of 0 dB (under speech-shaped noise). The overall accuracy is 31.95 %. Table 8. Confusion matrix of the FDLP-based feature at an SNR of 0 dB (under speech-shaped noise). The overall accuracy is 27.39 %. 104 Table 9. Confusion matrix of the proposed method under reverberant speech (RIR is 344 ms). The overall accuracy is 34.91 %. Table 10. Confusion matrix of the MFCC-based feature under reverberant speech (RIR is 344 ms). The overall accuracy is 25.28 %. 105 Table 11. Confusion matrix of the GFCC-based feature under reverberant speech (RIR is 344 ms). The overall accuracy is 36.18 %. Table 12. Confusion matrix of the FDLP-based feature under reverberant speech (RIR is 344 ms). The overall accuracy is 19.81 %. 106 Table 13. Confusion matrix of the proposed method under telephone speech (channel cb1). The overall accuracy is 50.84 %. Table 14. Confusion matrix of the MFCC-based feature under telephone speech (channel cb1). The overall accuracy is 50.25 %. 107 Table 15. Confusion matrix of the GFCC-based feature under telephone speech (channel cb1). The overall accuracy is 17.26 %. Table 16. Confusion matrix of the FDLP-based feature under telephone speech (channel cb1). The overall accuracy is 28.08 %. 108 View publication stats