Uploaded by Григорий Мельниченко

MSc Thesis UM

advertisement
See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/335909815
ROBUST PHONEME CLASSIFICATION USING RADON TRANSFORM OF
AUDITORY NEUROGRAMS
Thesis · September 2019
DOI: 10.13140/RG.2.2.10851.30242
CITATIONS
READS
0
245
4 authors:
Md.Shariful Alam
Wissam A. Jassim
University of Malaya
Trinity College Dublin
6 PUBLICATIONS 14 CITATIONS
38 PUBLICATIONS 339 CITATIONS
SEE PROFILE
SEE PROFILE
Mohd Yazed Ahmad
Muhammad S A Zilany
University of Malaya
Texas A&M University at Qatar
29 PUBLICATIONS 215 CITATIONS
46 PUBLICATIONS 966 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Audio signal processing View project
Wireless power transfer for implantable medical devices View project
All content following this page was uploaded by Md.Shariful Alam on 19 September 2019.
The user has requested enhancement of the downloaded file.
SEE PROFILE
ROBUST PHONEME CLASSIFICATION USING RADON
TRANSFORM OF AUDITORY NEUROGRAMS
MD. SHARIFUL ALAM
DEPARTMENT OF BIOMEDICAL ENGINEERING
FACULTY OF ENGINEERING
UNIVERSITY OF MALAYA
KUALA LUMPUR
2016
ROBUST PHONEME CLASSIFICATION USING
RADON TRANSFORM OF AUDITORY NEUROGRAMS
MD. SHARIFUL ALAM
DESSERTATION SUBMITTED IN FULFILMENT OF
THE REQUIREMENTS FOR THE DEGREE OF MASTER
OF ENGINEERING SCIENCE
DEPARTMENT OF BIOMEDICAL ENGINEERING
FACULTY OF ENGINEERING
UNIVERSITY OF MALAYA
KUALA LUMPUR
2016
UNIVERSITY OF MALAYA
ORIGINAL LITERARY WORK DECLARATION
Name of Candidate: Md. Shariful Alam
(I.C/Passport No: BH0110703)
Registration/Matric No: KGA140010
Name of Degree: Master of Engineering Science
Title of Dissertation:
ROBUST PHONEME CLASSIFICATION USING RADON TRANSFORM OF
AUDITORY NEUROGRAMS
Field of Study: Signal processing.
I do solemnly and sincerely declare that:
(1) I am the sole author/writer of this Work;
(2) This Work is original;
(3) Any use of any work in which copyright exists was done by way of fair
dealing and for permitted purposes and any excerpt or extract from, or
reference to or reproduction of any copyright work has been disclosed
expressly and sufficiently and the title of the Work and its authorship have
been acknowledged in this Work;
(4) I do not have any actual knowledge nor do I ought reasonably to know that
the making of this work constitutes an infringement of any copyright work;
(5) I hereby assign all and every rights in the copyright to this Work to the
University of Malaya (“UM”), who henceforth shall be owner of the
copyright in this Work and that any reproduction or use in any form or by any
means whatsoever is prohibited without the written consent of UM having
been first had and obtained;
(6) I am fully aware that if in the course of making this Work I have infringed
any copyright whether intentionally or otherwise, I may be subject to legal
action or any other action as may be determined by UM.
Candidate’s Signature
Date:
Subscribed and solemnly declared before,
Witness’s Signature
Date:
Name:
Designation:
iii
ABSTRACT
The use of speech recognition technology has increased considerably in the last three
decades. In the real world, the performance of well-trained speech recognizers is usually
degraded by different types of noise and distortions such as background noise,
reverberation and telephone channels. In particular, speech signal is extremely difficult
to recognize due to the interference created by reverberation and bandwidth of
transmission channels. The accuracy of traditional speech recognition systems in noisy
environments is much lower than the recognition accuracy of an average human being.
Robustness of speech recognition systems must be addressed for practical applications.
Although many successful techniques have been developed for dealing with clean signal
and noise, particularly uncorrelated noise with simple spectral characteristics (e.g.,
white noise), the problem of sound reverberation and channel distortions has remained
essentially unsolved. This problem hampers the wider use of acoustic interfaces for
many applications.
Unlike traditional methods in which features are extracted from the properties of the
acoustic signal, this study proposes a phoneme classification technique using neural
responses from a physiologically-based computational model of the auditory periphery.
The 2-D neurograms were constructed from the simulated responses of the auditorynerve fibers to speech phonemes. The features of the neurograms were extracted using
the Radon transform and used to train the classification system using a support vector
machine classifier. Classification performances were evaluated for phonemes extracted
from the TIMIT and HTIMIT databases. Experiments were performed in mismatched
train/test conditions where the test data in these experiments consist of speech corrupted
by variety of real world additive noises at different signal-to-noise ratios (SNRs),
convolutive distortions introduced by different room impulse response functions, and
iv
multiple telephone channel speech recordings with different frequency characteristics.
Performances of the proposed method were compared to those of Mel-Frequency
Cepstral Coefficients (MFCC), Gamma-tone Frequency Cepstral Coefficients (GFCC),
and Frequency Domain Linear Prediction (FDLP)-based phoneme classifiers. Based on
simulation results, the proposed method outperformed most of the traditional acousticproperty-based phoneme classification methods for both in quiet and under noisy
conditions. This approach is accurate, easy to implement, and can be used without any
knowledge about the type of distortion in the signal, i.e., it can handle any type of noise.
Using (support vector machine/ hidden Markov model) hybrid classifiers, the proposed
method could be extended to develop an automatic speech recognition system.
v
ABSTRAK
Semenjak 3 dekad yang lalu, penggunaan teknologi pengecaman pertuturan semakin
meningkat dengan mendadak. Dalam situasi praktikal, prestasi teknologi pengecaman
pertuturan selalunya jatuh disebabkan pelbagai jenis bunyi bising seperti bunyi latar
belakang, bunyi gema dan herotan telefon. Khususnya isyarat ucapan dengan bunyi latar
belakang atau herotan saluran amat sukar untuk dikecam disebabkan gangguan tersebut.
Ketepatan system pengecaman pertuturan tradisional dalam keadaan bising adalah amat
rendah berbanding prestasi pengecaman seorang manusia sederhana. Oleh itu,
kemantapansistem pengecaman pertuturan hendaklah diselidiki bagi aplikasi praktikal.
Walaupun pelbagai teknik telah bejaya diusahakan bagi mengendalikan isyarat dalam
keadaan bersih dan bising (khususnya bunyi bising dengan ciri-ciri spectral yang asas),
cabaran gangguan bunyi gema serta gangguan saluran tidak lagi diselesaikan. Masalah
ini menjadi suatu halangan bagi penggunaan antara muka akustik dalam ramai aplikasi.
Berlainan dengan kaedah tradisional di mana ciri-ciri diambil dari isyarat akustik, kajian
ini mencadangkan kaedah klasifikasi fonem yang berdasarkan tindak balas neural
daripada model pengiraan fisiologi sistem periferi pendengaran. Neurogram 2-D telah
diperolehi daripada simulasi tindak balas saraf auditori terhadap fonem pertuturan. Ciriciri neurogram telah diperolehi menggunakan transformasi Radon dan diguna untuk
melatih system pengelasan mesin vektor sokongan. Prestasi pengelasan bagi sistem
tersebut telah dinilai menggunakan fonem-fonem berbeza dari pangkalan data TIMIT
dan HTIMIT. Eksperimen telah dijalankan dalam keadaan latihan/ujian ‘mismatch’ di
mana data bagi proses ujian ditambah dengan herotan yang berbeza pada nisbah isyarathingar (SNR) berlainan, herotan kusut serta saluran telefon yang berlainan. Prestasi
kaedah yang dicadangkan telah dibandingkan dengan kaedah pengelasan MelFrequency Cepstral Coefficients (MFCC), Frequency Domain Linear Prediction (FDLP)
vi
dan gammatone frequency cepstral coefficients (GFCC). Berdasarkan keputusan
simulasi, kaedah yang dicadangkan telah menunjukkan prestasi yang lebih cemerlang
berbanding kaedah tradisional bderdasarkan ciri akustik dalam keadaan bising dan
senyap. Kaedah ini tepat, senang diguna dan boleh diguna tanpa maklumat mengenai
jenis herotan isyarat. Dengun menggunakan pengelas hybrid (mesin vektor sokongan,
model terselindung Markov), kaedah ini boleh diguna bagi mengusahakan sistem
pengecaman pertuturan automatik.
vii
ACKNOWLEDGEMENTS
In the name of Allah the most Merciful and Beneficent
First and Foremost praise is to ALLAH, the Almighty, the greatest of all, on whom
ultimately we depend for sustenance and guidance. I would like to thank Almighty
Allah for giving me opportunity, determination and strength to do my research. His
continuous grace and mercy was with me throughout my life and ever more during the
tenure of my research.
This research that is conducted at Auditory Neuroscience (AN) lab would not have been
possible without direct and indirect aid I received from so many others including:
teachers, family and friends. I would like to gratefully and sincerely thank my principle
supervisor Dr. Muhammad Shamsul Arefeen Zilany for his guidance, understanding,
patience, and most importantly, his friendship during my graduate studies. I also want to
thank my co-supervisor Dr. Mohd Yazed Bin AHMAD for his continuous support
throughout the entire project. I want to express my thanks for the financial support I
received from the HIR projects of UM.
My heartfelt gratitude also goes to Dr. Wissam A. Jassim who is a research fellow from
department of Electrical Engineering who dutifully and patiently taught me the hands on
application in developing the software, sharing ideas of new topics on solving the
problem arises in completing the thesis. Thanks also to my research-mates for sharing
their thoughts to improve the research outcome.
I owe everything to my parents and siblings who encouraged and helped me at every
stage of my personal and academic life and longed to see this achievement come true. I
dedicate this work to my wife and son (Saad Abdullah).
viii
TABLE OF CONTENTS
Abstract ............................................................................................................................ iv
Abstrak ............................................................................................................................. vi
Acknowledgements ........................................................................................................viii
List of Figures ................................................................................................................xiii
List of Tables.................................................................................................................. xvi
List of Symbols and Abbreviations ..............................................................................xviii
CHAPTER 1: INTRODUCTION .................................................................................. 1
1.1
Phoneme classification ............................................................................................ 2
1.2
Automatic speech recognition ................................................................................. 4
1.3
Problem Statement ................................................................................................... 5
1.4
Motivation................................................................................................................ 8
1.4.1
Comparisons between Humans and Machines ........................................... 8
1.4.2
Neural-response-based feature ................................................................... 9
1.5
Objectives of this study ......................................................................................... 10
1.6
Scope of the study.................................................................................................. 11
1.7
Organization of Thesis ........................................................................................... 13
CHAPTER 2: LITERATURE REVIEW .................................................................... 14
2.1
Introduction............................................................................................................ 14
2.2
Research background ............................................................................................. 14
2.3
Existing metrics ..................................................................................................... 16
2.3.1
Mel-frequency Cepstral Coefficient (MFCC) .......................................... 16
2.3.2
Gammatone Frequency Cepstral Coefficient (GFCC) ............................. 17
2.3.3
Frequency Domain Linear Prediction (FDLP) ......................................... 17
ix
2.4
2.5
Structure and Function of the Auditory System .................................................... 18
2.4.1
Outer Ear .................................................................................................. 19
2.4.2
Middle Ear (ME) ...................................................................................... 19
2.4.3
Inner Ear ................................................................................................... 20
2.4.4
Basilar Membrane Responses................................................................... 20
2.4.5
Auditory Nerve ......................................................................................... 21
Brief history of Auditory Nerve (AN) Modeling .................................................. 22
2.5.1
Description of the computational model of AN ....................................... 25
2.5.1.1 C1 Filter: ................................................................................... 26
2.5.1.2 Feed forward control path (including OHC): ............................ 26
2.5.1.3 C2 Filter: ................................................................................... 27
2.5.1.4 The Inner hair cell (IHC):.......................................................... 28
2.5.2
2.6
2.7
Envelope (ENV) and Temporal Fine Structure (TFS) neurogram ........... 31
Support Vector Machines ...................................................................................... 31
2.6.1
Linear Classifiers ...................................................................................... 32
2.6.2
Non-linear Classifiers ............................................................................... 33
2.6.3
Kernels ...................................................................................................... 33
2.6.4
Multi-class SVMs ..................................................................................... 34
Radon Transform ................................................................................................... 34
2.7.1
Theoretical foundation ............................................................................. 34
2.7.2
How Radon transform works ................................................................... 35
2.7.3
Current applications ................................................................................. 36
CHAPTER 3: METHODOLOGY ............................................................................... 37
3.1
System overview .................................................................................................... 37
3.2
Datasets .................................................................................................................. 37
3.2.1
TIMIT database ........................................................................................ 38
x
3.2.2
HTIMIT corpus ........................................................................................ 40
3.3
AN model and neurogram ..................................................................................... 41
3.4
Feature extraction using Radon transform ............................................................. 44
3.5
SVM classifier ....................................................................................................... 45
3.6
Environmental Distortions ..................................................................................... 47
3.6.1
3.7
Existing Strategies to Handle Environment Distortions ........................... 51
Generation of noise ................................................................................................ 52
3.7.1
Speech with additive noise ....................................................................... 52
3.7.2
Reverberant Speech .................................................................................. 52
3.7.3
Telephone speech ..................................................................................... 54
3.8
Similarity measure ................................................................................................. 55
3.9
Procedure ............................................................................................................... 56
3.9.1
Feature extraction using MFCC, GFCC and FDLP for classification ..... 58
CHAPTER 4: RESULTS.............................................................................................. 60
4.1
Introduction............................................................................................................ 60
4.2
Overview of result ................................................................................................. 60
4.3
Classification accuracies (%) for phonemes in quiet environment ....................... 63
4.4
Performance for signal with additive noise ........................................................... 64
4.5
Performance for reverberant speech ...................................................................... 68
4.6
Signals distorted by noise due to telephone channel ............................................. 69
CHAPTER 5: DISCUSSIONS ..................................................................................... 71
5.1
Introduction............................................................................................................ 71
5.2
Broad class accuracy.............................................................................................. 71
5.3
Comparison of results from previous studies ........................................................ 73
5.4
Effect of the number of Radon angles on classification results............................. 75
xi
5.5
Effect of SPL on classification results ................................................................... 76
5.6
Effect of window length on classification results .................................................. 77
5.7
Effect of number of CFs on classification results .................................................. 78
5.8
Robustness property of the proposed system ........................................................ 79
CHAPTER 6: CONCLUSIONS & FUTURE WORKS ............................................ 84
6.1
Conclusions ........................................................................................................... 84
6.2
Limitations and future work .................................................................................. 85
LIST OF PUBLICATIONS AND CONFERENCE PROCEEDINGS..................... 87
REFERENCES ……………………………………………………………………...88
APPENDIX A – CONFUSION MATRICES............................................................ 100
xii
LIST OF FIGURES
Figure 1.1: Overview of phoneme classification .............................................................. 2
Figure 1.2: Difference between phoneme classification and phoneme recognition. The
small box encloses the task of phoneme classifier, and the big box encloses the task of a
phoneme recogniser. After the phonemes have been classified in (b), the dynamic
programming method (c) finds the most likely sequence of phonemes (d). ..................... 3
Figure 1.3: Architecture of an ASR system [adapted from(Wang, 2015)] ....................... 4
Figure 2.1: Illustration of block diagram for MFCC derivation. .................................... 16
Figure 2.2: Illustration of methodology to extract GFCC feature ................................... 17
Figure 2.3: Deriving sub-band temporal envelopes from speech signal using FDLP. ... 18
Figure 2.4: Illustration of the structure of the auditory system showing outer, middle and
inner ear (Reproduced from Encyclopaedia Britannica, Inc. 1997)................................ 19
Figure 2.5: Motions of BM at different frequencies (Reproduced from Encyclopaedia
Britannica, Inc. 1997) ...................................................................................................... 21
Figure 2.6: Model of one local peripheral section. It includes outer/ME, BM, and IHC–
AN synapse models. (Robert & Eriksson, 1999) ............................................................ 23
Figure 2.7: The model of the auditory peripheral system developed by Bruce et al.
(Bruce et al., 2003), modified from Zhang et al. (Zhang et al., 2001) ............................ 24
Figure 2.8: Schematic diagram of the auditory-periphery. The model consists of ME
filter, a feed-forward control path, two signal path such as C1 and C2, the inner hair cell
xiii
(IHC), outer hair cell (OHC) followed by the synapse model with spike generator.
(Zilany & Bruce, 2006). .................................................................................................. 25
Figure 2.9: (A) Schematic diagram of the model of the auditory periphery (B) IHC-AN
synapse model: exponential adaptation followed by parallel PLA models (slow and
fast).................................................................................................................................. 29
Figure 2.10 (a) a separating hyperplane. (b) The hyperplane that maximizes the margin
of separability .................................................................................................................. 32
Figure 3.1: Block diagram of the proposed phoneme classifier...................................... 37
Figure 3.2: Time-frequency representations of speech signals. (A) a typical speech
waveform (to produce spectrogram and neurogram of that signal), (B) the corresponding
spectrogram responses, and (C) the respective neurogram responses. ........................... 42
Figure 3.3: Geometry of the DRT ................................................................................... 44
Figure 3.4: This figure shows how Radon transforms work. (a) a binary image (b)
Radon Transform at 0 Degree (c) Radon Transform at 45 Degree. ................................ 45
Figure 3.5: Types of environmental noise which can affect speech signals (Wang, 2015)
......................................................................................................................................... 48
Figure 3.6: Impact of environment distortions on clean speech signals in various
domain: (a) clean speech signal (b) speech corrupted with background noise (c) speech
degraded by reverberation (d) telephone speech signal. (e)-(h) Spectrum of the
respective signals shown in (a)–(d). (i)-(l) neurogram of the respective signals shown in
(a)–(d).(m)-(p) Radon coefficient of the respective signals shown in (a)–(d). ............... 50
xiv
Figure 3.7: Comparison of clean and reverberant speech signals for phoneme /aa/: (a)
clean speech, (b) signal corrupted by reverberation (c)-(d) Spectrogram of the respective
signals shown in (a)–(b) and (e)-(f) Radon coefficient of the respective signals shown in
(a)-(b). ............................................................................................................................. 53
Figure 3.8: A simplified environment model where background noise and channel
distortion dominate. ........................................................................................................ 55
Figure 3.9: Neurogram -based feature extraction for the proposed method: (a) a typical
phoneme waveform (/aa/), (b) speech corrupted with SSN (10 dB) (c) speech corrupted
with SSN (0 dB). (d)-(f) neurogram of the respective signals shown in (a)–(c). (g)-(h)
Radon coefficient of the respective signals shown in (a)–(b). (j) Radon coefficient of the
respective signals shown in (a)-(c). ................................................................................. 57
Figure 4.1: Broad phoneme classification accuracies (%) for different features in various
noise types at different SNRs values. Clean condition is denoted as Q. ......................... 67
Figure 5.1: Example of radon coefficient representations: (a) stop /p/ (b) fricative /s/ (c)
nasal /m/ (d) vowel /aa/. .................................................................................................. 72
Figure 5.2: Example of radon coefficient representations for stop: (a) /p/ (b)/t/ (c)/k/
and (d) /b/ ........................................................................................................................ 72
Figure 5.3: The correlation coefficient (a) MFCC features extracted from the phoneme
in quiet (solid line) and at an SNR of 0 dB (dotted line) condition. The correlation
coefficient between the two vectors was 0.76. (b) FDLP features under clean and noisy
conditions. Correlation coefficient between the two cases was 0.72. (d) Neurogram
responses of the phoneme under clean and noisy conditions. The Correlation coefficient
between the two vectors was 0.85. .................................................................................. 80
xv
LIST OF TABLES
Table 1.1: Human versus machine speech recognition performance (Halberstadt, 1998).
........................................................................................................................................... 8
Table 1.2: Human and machine recognition results. All percentages are word error rates.
Best results is indicated in bold......................................................................................... 9
Table 3.1: Mapping from 61 classes to 39 classes, as proposed by Lee and Hon (Lee &
Hon, 1989). ..................................................................................................................... 39
Table 3.2: Broad classes of phones proposed by Reynolds and Antoniou, (T. J.
Reynolds & Antoniou, 2003). ......................................................................................... 40
Table 3.3: Number of token in phonetic subclasses for train and test sets ..................... 40
Table 4.1: Classification accuracies (%) of individual and broad class phonemes for
different feature extraction techniques on clean speech, speech with additive noise
(average performance of six noise types at -5, 0, 5, 10, 15, 20 and 25 dB SNRs),
reverberant speech (average performance for eight room impulse response functions),
and telephone speech (average performance for nine channel conditions). The best
performance for each condition is indicated in bold. ...................................................... 61
Table 4.2: Confusion matrices for segment classification in clean condition................. 62
Table 4.3: Classification accuracies (%) of broad phonetic classes in clean condition. . 64
Table 4.4: Individual phoneme classification accuracies (%) for different feature
extraction techniques for 3 different noise types at -5, 0, 5, 10, 15, 20 and 25 dB SNRs.
The best performance for each condition is indicated in bold. ....................................... 65
xvi
Table 4.5: Individual phoneme classification accuracies (%) for different feature
extraction techniques for 3 different noise types at -5, 0, 5, 10, 15, 20 and 25 dB SNRs.
The best performance for each condition is indicated in bold. ....................................... 66
Table 4.6: Classification accuracies (%) in eight different reverberation test set. The best
performance for each condition is indicated in bold. Last column show the average
value indicated as “Avg”. ................................................................................................ 68
Table 4.7: Classification accuracies (%) for signal distorted by nine different telephone
channels. The best performance is indicated in bold. Last column shows the average
value indicated as “Avg”. ................................................................................................ 70
Table 5.1 Correlation measure in acoustic and Radon domain for different phoneme... 73
Table 5.2: Phoneme classification accuracies (%) on the TIMIT core test set (24
Speakers) and complete test set (168 speakers) in quiet condition for individual phone
(denoted as single) and broad class (denoted as Broad). Here, RPS is the abbreviation of
reconstructed phase space. .............................................................................................. 74
Table 5.3: Phoneme classification accuracies (%) as a function of the number of Radon
......................................................................................................................................... 76
Table 5.4: Effects of SPL on classification accuracy (%). .............................................. 77
Table 5.5: Effect of window size on classification performance (%). ............................ 78
Table 5.6: Effect of number of CF on classification performance (%). .......................... 79
Table 5.7: Correlation measure for different phoneme and their corresponding noisy
(SSN) phoneme (clean-10dB and clean-0dB) in different domain. Average correlation
measure (Avg) of seven phonemes is indicated in bold (last row). ................................ 82
xvii
LIST OF SYMBOLS AND ABBREVIATIONS
ABWE
:
Artificial bandwidth extension
AF
:
Articulatory features
AMR
:
Adaptive multi-rate
AN
:
Auditory Nerve
ANN
:
Artificial neural net-works
AP
:
Action Potential
AR
:
auto-regressive
ASR
:
Automatic speech recognition
BF
:
Best Frequency
BM
:
Basilar Membrane
CF
:
Character frequency
C-SVC
:
C-support vector classification
DCT
:
Discrete cosine transform
DRT
:
Discrete Radon transform
DTW
:
Dynamic time warping
FDLP
:
Frequency Domain Linear Prediction
FTC
:
Frequency Tuning Curve
GF
:
Gammatone feature
xviii
GFCC
:
Gammatone frequency coefficient cepstra
GMM
:
Gaussian Mixture Model
HMM
:
Hidden Markov Models
HTIMIT
:
handset TIMIT
IHC
:
Inner Hair Cells
LPC
:
Linear predictive coding
LVASR
:
Large Vocabulary ASR
LVCSR
:
Large vocabulary continuous speech recognition
OHC
:
Outer Hair Cells
OVO
:
One versus one
OVR
:
One versus rest
PLP
:
Perceptual linear prediction
PSTH
:
Post Stimulus Time Histogram
RASTA
:
Relative spectra
RBF
:
Radial basis function
RIR
:
Room impulse response
SRM
:
Structural Risk Minimization
SVM
:
Support vector machines
xix
CHAPTER 1: INTRODUCTION
Automatic speech recognition (ASR) has been extensively studied in the past several
decades. Driven by both commercial and military interests, ASR technology has been
developed and investigated on a great variety of tasks with increasing scales and
difficulty. As speech recognition technology moves out of laboratories and is widely
applied in more and more practical scenarios, many challenging technical problems
emerge. Environmental noise is a major factor which contributes to the diversity of
speech. As speech services are provided on various devices, ranging from telephones,
desktop computers to tablets and game consoles, speech signals also exhibit large
variations caused by channel characteristic differences. Speech signals captured in
enclosed environments by distant microphones are usually subject to reverberation.
Compared with background noise and channel distortions, reverberant noise is highly
dynamic and strongly correlated with the original clean speech signals. Speech
recognition based on phonemes is very attractive, since it is inherently free from
vocabulary limitations. Large Vocabulary ASR (LVASR) systems’ performance
depends on the quality of the phone recognizer. That is why research teams continue
developing phone recognizers, in order to enhance their performance as much as
possible. The classification of phonemes can be seen as one of the basic units in a
speech recognition system. This thesis will focus on the development of a robust
phoneme classification technique that works well in diverse conditions.
1
Unknown utterance
Speech signal
Predicted class
(b)
(a)
(c)
Learning Machine
(d) ../p/, /t/, /k/,...
Figure 1.1: Overview of phoneme classification
1.1
Phoneme classification
Phoneme classification is the task of determining the phonetic identity of a speech
utterance (typically short) based on the extracted features from speech. The individual
sounds used to create speech are called phonemes. The task of this thesis is to create a
learning machine that can classify phoneme (sequence of acoustic observations) both in
quiet and under noisy environments. To explain how this learning machine works, we
can consider a speech signal for an ordinary English sentence, labeled (a) in Fig. 1.1.
The signal is split up into elementary speech unit [(b) in the figure 1.1] and provided
into the learning machine (c). The task of this learning machine is to classify each of
these unknown utterances to one of the 39 targets, representing the phonemes in the
English language. The idea of classifying phonemes is widely used in both isolated and
continuous speech recognition. The predictions found by the learning machine can be
passed on to a statistical model to find the most likely sequence of phonemes that
construct a meaningful sentence. Hence for a successful ASR system accurate phoneme
classification is important.
2
Phoneme classification
(a)
Learning Machine
(b)
../p/, /t/, /k/,...
(c)
Dynamic programming
method
(d)
../p/, /k/,,...
Figure 1.2: Difference between phoneme classification and phoneme recognition. The
small box encloses the task of phoneme classifier, and the big box encloses the task of a
phoneme recogniser. After the phonemes have been classified in (b), the dynamic
programming method (c) finds the most likely sequence of phonemes (d).
In this thesis, the chosen learning machine for the task is support vector machines
(SVMs). The task of phoneme classification is typically done in the context of phoneme
recognition. As the name implies, phoneme recognition consists of identifying the
individual phonemes a sentence is composed of and then a dynamic programming
method is required to transform the phoneme classifications into phoneme predictions.
This difference is shown in Fig. 1.2. We have chosen not to perform complete phoneme
recognition, because we would need to include a dynamic programming method. This
may seem fairly straightforward, but to do it properly, it would involve introducing
large areas of speech recognition such as decoding techniques and the use of language
models. This would detract from the focus of the thesis. Phone recognition has a wide
range of applications. In addition to typical LVASR systems, it can be found in
applications related to language recognition, keyword detection, speaker identification,
and applications for music identification and translation (Lopes & Perdigao, 2011).
3
Figure 1.3: Architecture of an ASR system [adapted from(Wang, 2015)]
1.2
Automatic speech recognition
This section will introduce the speech recognition systems. The aim of ASR system is to
produce the most likely word sequence given an incoming speech signal. Figure 1.3
shows the architecture of an ASR system and its main components. In the first stage of
speech recognition, input speech signals are processed by a front-end to provide a
stream of acoustic feature vectors or observations. These observations should be
compact and carry sufficient information for recognition in the later stage. This process
is usually known as front-end processing or feature extraction. In the second stage, the
extracted observation sequence is provided into a decoder to recognise the mostly likely
word sequence. Three main knowledge sources, such as the lexicon, language models
and acoustic models are used in this stage. The lexicon, also known as the dictionary, is
usually used in large vocabulary continuous speech recognition (LVCSR) systems to
map sub-word units to words used in the language model. The language model
represents the prior knowledge about the syntactic and semantic information of word
sequences. The acoustic model represents the acoustic knowledge of how an
observation sequence can be mapped to a sequence of sub-word units. In this thesis, we
consider phone classification that allows a good evaluation of the quality of the acoustic
4
modeling, since it computes the performance of the recognizer without the use of any
kind of grammar (Reynolds & Antoniou, 2003).
1.3
Problem Statement
State-of-the-art algorithms for ASR systems suffer from poorer performance when
compared to the ability of human listeners to detect, analyze, and segregate the dynamic
acoustic stimuli, especially in complex and under noisy environments (Lippmann, 1997;
G. A. Miller & Nicely, 1955; Sroka & Braida, 2005). Performance of ASR systems can
be improved by using additional levels of language and context modeling, provided that
the input sequence of elementary speech units is sufficiently accurate (Yousafzai, Ager,
Cvetković, & Sollich, 2008). To achieve a robust recognition of continuous speech, both
sophisticated language-context modeling and accurate predictions of isolated phonemes
are required. Indeed, most of the inherent robustness of human speech recognition
occurs before and independently of context and language processing (G. A. Miller,
Heise, & Lichten, 1951; G. A. Miller & Nicely, 1955). For phoneme recognition, human
auditory system’s accuracy is already above chance level, at an signal-to-noise ratio
(SNR) of -18 dB (G. A. Miller & Nicely, 1955). Also, several studies have
demonstrated the superior performance of human speech recognition compared to
machine performance both in quiet and under noisy conditions (Allen, 1994; Meyer,
Wächter, Brand, & Kollmeier, 2007), and thus the ultimate challenge for an ASR is to
achieve recognition performance that is close to the performance of human auditory
system. In this thesis, we consider front-end features for phoneme classification because
accurate classification of isolated phonetic unit is very important for achieving robust
recognition of continuous speech.
Most of the existing ASR systems use perceptual linear prediction (PLP), Relative
spectra (RASTA) or Cepstral features, normally some variant of Mel-frequency cepstral
5
coefficients (MFCCs) as their front-end. Due to nonlinear processing involved in the
feature extraction, even a moderate level of distortion may cause significant departures
from feature distributions learned on clean data, making these distributions inadequate
for recognition in the presence of environmental distortions such as additive noise
(Yousafzai, Sollich, Cvetković, & Yu, 2011). Some attempts have been made to utilize
Gammatone frequency coefficient cepstra (GFCC) in ASR (Shao, Srinivasan, & Wang,
2007). But their improvement was not significant. During past years, efforts have also
been made to design a robust ASR system motivated by articulatory and auditory
processing (Holmberg, Gelbart, & Hemmert, 2006; Jankowski Jr, Vo, & Lippmann,
1995; Jeon & Juang, 2007). However, these models did not include most of the
nonlinearities observed at the level of the auditory periphery and thus were not
physiologically-accurate. As a result, the performance of ASR systems based on these
features is far below compared to human performance in adverse conditions (Lippmann,
1997; Sroka & Braida, 2005).
Until recently, the problem of recognizing reverberant signal and signal distorted with
telephone channel has remained unsolved due to the nature of reverberant speech and
bandwidth of telephone channel (Nakatani, Kellermann, Naylor, Miyoshi, & Juang,
2010; Pulakka & Alku, 2011). Reverberation is a form of distortion quite distinct from
both additive noise and spectral shaping. Unlike additive noise, reverberation creates
interference that is correlated with the speech signal. Most of the telephone systems in
use today transmit only a narrow audio bandwidth limited to the traditional telephone
band of 0.3–3.4 kHz or to only a slightly wider bandwidth. Natural speech contains
frequencies far beyond this range and consequently, the naturalness, quality, and
intelligibility of telephone speech are degraded by the narrow audio bandwidth (Pulakka
& Alku, 2011).
6
In order to solve the reverberation problem, a number of de-reverberation algorithms
have been proposed based on cepstral filtering, inverse filtering, temporal ENV
filtering, excitation source information and spectral processing. The main limitation of
these techniques is that the acoustic impulse responses or talker locations must be
known or blindly estimated for successful de-reverberation. This is known to be a
difficult task (Krishnamoorthy & Prasanna, 2009). These problems preclude its use for
real-time applications (Nakatani, Yoshioka, Kinoshita, Miyoshi, & Juang, 2009). In the
past decades, many pitch determination algorithms have been proposed to improve the
recognition accuracy of telephone speech. An important consideration for pitch
determination algorithms is the performance in telephone speech, where the
fundamental frequency is always weak or even missing, which makes pitch
determination even more difficult (L. Chang, Xu, Tang, & Cui, 2012). A significant
improvement in speech quality can be achieved by using wideband (WB) codecs for
speech transmission. For example, the adaptive multi-rate (AMR)-WB speech codec
transmits the audio bandwidth of 50–7000 Hz. But wideband telephony is possible only
if both terminal devices and the network support AMR-WB.
Most of the above-mentioned methods provide good performance (but still less than the
average performance of human being) for a specific condition, i.e., quiet environment,
additive noise, reverberant speech, or channel distortion, but the speech signals are
usually affected by multiple acoustic factors simultaneously. This makes difficult to use
them in real environment. In this study, we propose a method that can handle any types
of noise and thus can be used in real environment.
7
Table 1.1: Human versus machine speech recognition performance (Halberstadt, 1998).
Corpus
Description
Vocabulary
Size
Machine
Error (%)
Human
Error (%)
TI Digits
Read Digits
10
0.72
0.009
Alphabet Letters
Read Alphabetic Letters
26
5
1.6
Resource
Management
Read Sentences (Wordpair Grammar)
1,000
3.6
0.1
Resource
Management
Read Sentences (Null
Grammar)
1,000
17
2
Wall Street Journal
Read Sentences
7.2
0.9
North American
Business News
Read Sentences
Unlimited
6.6
0.4
Switchboard
Spontaneous Telephone
Conservations
2,000Unlimited
43
4
1.4
5,000
Motivation
Two sources of motivation contributed to the conception of the ideas for the
experiments in this thesis.
1.4.1
Comparisons between Humans and Machines
Different studies have shown the performance between humans and ASR systems
(Lippmann, 1997; G. A. Miller & Nicely, 1955; Sroka & Braida, 2005). Table 1.1
shows the performance of machine and human for different corpus. Kingsbury et al.
showed the difference between humans and machines for reverberant speech
(Kingsbury & Morgan, 1997). Machine recognition tests were run using a hybrid hidden
Markov model/multilayer perceptron (HMM/MLP) recognizer. Four front ends were
8
Table 1.2: Human and machine recognition results. All percentages are word error rates.
Best results is indicated in bold.
Experiment
Baseline
Feature set
Condition
Error (%)
PLP
Clean
17.8
Reverb
71.5
Clean
16.4
Reverb
74.4
Clean
16.9
Reverb
78.9
Clean
31.7
Reverb
66.0
Reverb
6.1
Log-RASTA
J-RASTA
Mod. Spec.
Humans
tested: PLP, log-RASTA-PLP, JRASTA-PLP, and an experimental RASTA-like front
end called the modulation spectrogram (Mod. Spec.). Their results are shown in the
Table 1.2.
Aforementioned results imply that the performances of humans are significantly better
than machines. Machines cannot extract efficiently the low-level phonetic information
from speech signal. These results motivated us to work on front-end features for
accurate recognition of phonemes, and closely related problem of classification.
1.4.2
Neural-response-based feature
The accuracy of human speech recognition motivates the application of information
processing strategies found in the human auditory system to ASR (Hermansky, 1998;
Kollmeier, 2003). In general, the auditory system is nonlinear. Incorporating nonlinear
properties from the auditory system in the design of a phoneme classifier might improve
the performance of the recognition system. The current study proposes an approach to
9
classify phonemes based on the simulated neural responses from a physiologicallyaccurate model of the auditory system (Zilany, Bruce, Nelson, & Carney, 2009). This
approach is expected to improve the robustness of the phoneme classification system. It
was motivated by the fact that neural responses are robust against noise due to the
phase-locking property of the neuron, i.e., the neurons fire preferentially at a certain
phase of the input stimulus (M. I. Miller, Barta, & Sachs, 1987), even when noise is
added to the acoustic signal. In addition to this, the auditory-nerve (AN) model also
captures most of the nonlinear properties observed at the peripheral level of the auditory
system such as nonlinear tuning, compression, two-tone suppression, and adaptation in
the inner-hair-cell-AN synapse as well as some other nonlinearities observed only at
high sound pressure levels (SPLs) (M. I. Miller et al., 1987; Robles & Ruggero, 2001;
Zilany & Bruce, 2006).
1.5
Objectives of this study
The robustness of a recognition system is heavily influenced by the ability to handle the
presence of background noise, to cope with the distortion due to convolution and
transmission channel. State-of-the-art algorithms for ASR systems exhibit good
performance in quiet environment but suffer from poorer performance when compared
to the ability of human listeners in noisy environments (Lippmann, 1997; G. A. Miller
& Nicely, 1955; Sroka & Braida, 2005). Our ultimate goal is to develop an ASR system
that can handle distortions caused by various acoustic factors, including speaker
differences, channel distortions, and environmental noises. Developing methods for
phoneme recognition and the closely related problem of classification are a major step
towards achieving this goal. Hence, this thesis only considers the prediction of
phonemes, since the classification of phonemes can be seen as one of the basic units in a
speech recognition system. The specific objectives of the study are to:
10

develop a robust phoneme classifier based on neural responses and to
evaluate the performance in quiet and under noisy conditions.

compare the recognition accuracy of the proposed method with the results
from some existing popular metrics, such as MFCC, GFCC and FDLP.

examine the effects of different parameter on the performance of the
proposed method.
1.6
Scope of the study
Phone recognition from TIMIT has more than two decades of intense research behind it,
and its performance has naturally improved with time. There is a full array of systems,
but with regard to evaluation, they concentrate on three domains: phone segmentation,
phone classification and phone recognition (Lopes & Perdigao, 2011). Phone
segmentation is a process of finding the boundaries of a sequence of known phones in a
spoken utterance. Phonetic classification takes the correctly segmented signal, but with
unknown labels for the segments. The problem is to correctly identify the phones in
those segments. Phone recognition also has complex task. The speech given to the
recognizer corresponds to the whole utterance. The phone models plus a Viterbi
decoding find the best sequence of labels for the input utterance. In this case, a grammar
can be used. The best sequence of phones found by the Viterbi path is compared to the
reference (the TIMIT manual labels for the same utterance) using a dynamic
programming algorithm, usually the Levenshtein distance, which takes into account
phone hits, substitutions, deletions and insertions. This thesis will focus on the phone
classification task only.
A number of benchmarking databases have been constructed in recognition purpose, for
example, the DARPA resource management (RM) database, TIDIGITS - connected
digits, Alpha- Numeric (AN) 4-100 words vocabulary, Texas Instrument and
11
Massachusetts Institute of Technology (TIMIT), WSJ5K - 5,000 words vocabulary. In
this thesis, most widely used TIMIT database has been used that includes the dynamic
behavior of the speech signal, source of variability, e.g., intra-speaker (same speaker),
inter-speaker (cross speaker), and linguistic (speaking style). TIMIT is totally and
manually annotated at the phone level.
Research in the speech recognition area has been underway for a couple of decades, and
a great deal of progress has been made in reducing the error on speech recognition.
Most of the approach shows high speech recognition accuracy under controlled
conditions (for example, in quiet environment or under specific noise), but human
auditory system can recognize speech without prior information about noise types. It is
very difficult to develop a feature that can handle all types of noise. Our proposed
feature can be used in quiet environment and under background noise, room
reverberation and channel variations.
Most of the existing phoneme recognition systems are based on the features extracted
from the acoustic signal in the time and/or frequency domain. PLP, RASTA, MFCCs
(Hermansky, 1990; Hermansky & Morgan, 1994; Zheng, Zhang, & Song, 2001) are
some examples of the preferred traditional features for the ASR systems. Some attempts
have also been made to utilize GFCC in ASR, for instance (Qi, Wang, Jiang, & Liu,
2013; Schluter, Bezrukov, Wagner, & Ney, 2007; Shao, Jin, Wang, & Srinivasan,
2009). Recently, Ganapathy et al. (Ganapathy, Thomas, & Hermansky, 2010) proposed
a feature extraction technique for phoneme recognition based on deriving modulation
frequency components from the speech signal. This FDLP feature is an efficient
technique for robust ASR. In this study, the classification results of the proposed neuralresponse-based method were compared to the performances of the traditional acoustic-
12
property-based speech recognition methods using features such as MFCCs, GFCCs and
FDLPs.
1.7
Organization of Thesis
This thesis proposes a new technique for phoneme classification. Chapter two provides
background information for the experimental work. We explain the anatomy and
physiology of the peripheral auditory system, model of the AN, the SVM classifier
employed in this thesis, Radon transform and some existing metrics. In Chapter three,
the procedure of the proposed method has been discussed. Chapter four presents the
experiments evaluating the various feature extraction techniques in the task of TIMIT
phonetic classification. We show the results for both in quiet and under noisy conditions
for different types of phonemes extracted from the TIMIT database. Chapter five
explains the reason behind the robustness of proposed feature. It also shows the effects
of different parameters on classification accuracy. Finally, the general conclusion and
some future direction are provided in Chapter six.
13
CHAPTER 2: LITERATURE REVIEW
2.1
Introduction
This chapter reviews some related work for phoneme classification. After that, brief
description of some acoustic-property-based feature extraction technique, and structure
of the auditory system is discussed. This chapter also describes SVM classifier, model
of the AN and Radon transforms technique that has been used in this study.
2.2
Research background
After several years of intense research by a large number of research teams, error rates
on speech recognition have been improved considerably but still the performances of
humans are significantly better than machines. In 1994, Robinson (Robinson, 1994)
reported a phoneme classification accuracy of 70.4%, using recurrent neural network
(RNN). In 1996, Chen and Jamieson showed phoneme classification accuracy 74.2%
using same classifier (Chen & Jamieson, 1996). Rathinavelu et al. used the hidden
Markov models (HMMs) for speech classification and got the accuracy 68.60 %
(Chengalvarayan & Deng, 1997). This approach becomes problematic when dealing
with the variabilities of natural conversational speech. In 2003, a broad classification
accuracy of 84.1% using MLP was reported by Reynolds et al (T. J. Reynolds &
Antoniou, 2003). Clarkson et al. created a multi-class SVM system to classify
phonemes. Their reported result of 77.6% is extremely encouraging and shows the
potential of SVMs in speech recognition. They also used Gaussian Mixture Model
(GMM) for classification and got the accuracy 73.7 %. Similarly Johnson et. al showed
the results for MFCC-based feature using HMM (Johnson et al., 2005). They reported
14
the single phone accuracy 54.86 % for complete test set. They also showed the accuracy
of 35.06% for their proposed feature.
A hybrid SVM/HMM system was developed by Ganapathiraju et al. (Ganapathiraju,
Hamaker, & Picone, 2000). This approach has provided improvements in recognition
performance over HMM baselines on both small-and large-vocabulary recognition
tasks, even though SVM classifiers were constructed solely from the cepstral
representations (Ganapathiraju, Hamaker, & Picone, 2004; Krüger, Schafföner, Katz,
Andelic, & Wendemuth, 2005). The above papers all show the potential of using SVMs
in speech recognition and motivated us to use SVM as a classifier. Phoneme
classification has been also studied by a large number of researchers (Clarkson &
Moreno, 1999; Halberstadt & Glass, 1997; Layton & Gales, 2006; McCourt, Harte, &
Vaseghi, 2000; Rifkin, Schutte, Saad, Bouvrie, & Glass, 2007; Sha & Saul, 2006) for
the purpose of testing different methods and representations.
In recent years, many articulatory and auditory based processing methods also have
been proposed to address the problem of phonetic variations in a number of framebased, segment-based, and acoustic landmark systems. For example, articulatory
features (AFs) derived from phonological rules have outperformed the acoustic HMM
baseline in a series of phonetic level recognition tasks (Kirchhoff, Fink, & Sagerer,
2002). Similarly, experimental studies of the mammalian peripheral and central auditory
organs have also introduced many perceptual processing methods. For example, several
auditory models have been constructed to simulate human hearing, e.g., the ensemble
interval histogram, the lateral inhibitory network, and Meddis’ inner hair-cell model
(Jankowski Jr et al., 1995; Jeon & Juang, 2007). Holmberg et al. incorporated a synaptic
adaptation into their feature extraction method and found that the performance of the
system improved substantially (Holmberg et al., 2006). Similarly, Strope and Alwan
15
(1997) used a model of temporal masking (Strope & Alwan, 1997) and Perdigao et al.
(1998) employed a physiologically-based inner ear model for developing a robust ASR
system (Perdigao & Sá, 1998).
2.3
Existing metrics
The performance was evaluated on the complete test set of TIMIT database and
compared to the results from three standard acoustic-property-based methods. In this
section we will describe these features such as MFCC, GFCC and FDLP.
2.3.1
Mel-frequency Cepstral Coefficient (MFCC)
MFCCs (Davis & Mermelstein, 1980) are a widely used feature in ASR system. Figure.
2.1 represent the derivation of MFCC- feature from a typical acoustic sound waveform.
Initially a FFT is applied to each frame to obtain complex spectral features. Normally
512-points FFT is applied to derive 256-points complex spectral without considering
phase information. To make more efficient information representation only 30 or so
smooth spectrum per frame is considered and scaled logarithmically to Mel or Bark
scale to make those spectrum more meaningful.
Mel Frequency Cepstral Coefficients (MFCC)
Speech Signal
Fast Fourier
Transform
Spectrum
Mel Scale
Filtering
Mel Frequency
Spectrum
Log (.)
Feature Vector
Derivatives
Cepstral
Coefficients
Discrete Cosine
Transform
Figure 2.1: Illustration of block diagram for MFCC derivation.
16
Input signal
Windowing
GT filter
Cubic root
operation
DCT
GFCC
Figure 2.2: Illustration of methodology to extract GFCC feature
In MFCCs derivation, linearly spaced triangular filter-bank is used. The final step is to
convert the filter output to cepstral coefficient by using discrete cosine transform
(DCT).
2.3.2
Gammatone Frequency Cepstral Coefficient (GFCC)
GFCC is an acoustic cepstral-based feature which is derived from Gammatone feature
(GF) using Gammatone filter-bank. To extract GFCC, the audio signal is initially
synthesized using a 128-channel Gammatone filter-bank. Its center frequencies are
quasi logarithmically spaced from 50 Hz to 8 KHz (or half of the sampling frequency of
the input signal), which models human cochlear filtering. Using 10 ms frame rate filterbank outputs are down sampled to100 Hz in the time dimension. Cubic root operation is
used to compress the magnitudes of the down-sampled filter-bank outputs. A matrix of
“cepstral” coefficients is obtained by multiplying the DCT matrix with a GF vector. The
30-order GFCCs among 128-channel-based GFCC are normally used for speaker
identification or speech recognition since there retain most information of GF due to
energy compaction property of DCT as mentioned in (Shao, Srinivasan, & Wang,
2007). So the size of the GFCC feature is m × 23 or m × 30; where m is the number of
frames. Figure 2.2 represent the GFCC derivation procedure.
2.3.3
Frequency Domain Linear Prediction (FDLP)
Ganapathy et al.(Ganapathy, Thomas, & Hermansky, 2009) proposed a feature which
is based on deriving modulation frequency components from the speech signal.
17
Speech signal
DCT
Critical Band
Windowing
Sub-band
FDLP
Envelopes
Figure 2.3: Deriving sub-band temporal envelopes from speech signal using FDLP.
The FDLP technique is implemented in two parts - first, the discrete cosine transform
(DCT) is applied on long segments of speech to obtain a real valued spectral
representation of the signal. Then, linear prediction is performed on the DCT
representation to obtain a parametric model of the temporal envelope. The block
schematic for extraction of sub-band temporal envelopes from speech signal is shown in
Fig. 2.3.
2.4
Structure and Function of the Auditory System
Peripheral auditory pathway connects our sensory organs (ear) with the auditory parts of
the nervous system to interpret the received information. The peripheral auditory system
converts the pressure waveform into a mechanical vibration and finally a change in the
membrane potential of hair cells. This transform of energy from mechanical motion to
electrical energy is called transduction. The change in the membrane potential triggers
the action potentials (AP’s) from AN fibers. The AP used to conveys information in the
nervous system. Anatomically, human auditory system consists of peripheral auditory
system and central auditory nervous system. Periphery auditory system can be further
subdivided into three major parts: outer, middle and inner ear. The human auditory
system is shown in Fig. 2.4.
18
Figure 2.4: Illustration of the structure of the auditory system showing outer, middle and
inner ear (Reproduced from Encyclopaedia Britannica, Inc. 1997)
2.4.1
Outer Ear
The outer ear composed of two components: one is the most visible portion of the ear
called pinna and another is the long and narrow canal that joined to ear drum called the
ear canal. The word ear may be used to refer the pinna. Ear drum also known as
tympanic membrane that act as boundary between the outer ear and (ME). The sounds
arrive at the outer ear and passes through canal into the ear drum. Ear drum vibrates
when the sound wave hits the ear drum. The primary functions of the outer ear are to
protect the middle and inner ears from external bodies and amplify the sounds with
high-frequency. It also provides the primary cue to determine the source of sound.
2.4.2
Middle Ear (ME)
The ME is separated from external ear through the tympanic membrane. It creates a link
between outer ear and fluid-filled inner air. This link is made through three
19
bones: malleus (hammer), incus (anvil) and stapes (stirrup), which are known as
ossicles. The malleus is connected with stapes through incus. The stapes is the smallest
bone that is connected with the oval window. The three bones are connected in such a
way that movement of the tympanic membrane causes movement of the stapes. The
ossicles can amplify the sound waves by almost thirty times.
2.4.3
Inner Ear
The structure of the inner ear is very complex that located within dense potion of the
skull. The inner ear, also known as labyrinth due to its complexity, can be divided into
two major sections: a bony outer casing and osseous labyrinth. The osseous labyrinth
consists of semicircular canals, the vestibule, and the cochlea. The coil, snail shaped
cochlea is the most important part of the inner ear that contain sensory organs for
hearing. The largest and smallest turn of the inner ear is known as basal turn and apical
turn respectively. Oval window and round window also part of the cochlea. The
winding channel can be divided into three sections: scala media, scala tympani and scala
vestibuli, where each section filled with fluid. Scala media move pressure wave in
response to the vibration as a consequence of oval window and ossicles. Basilar
membrane (BM) is a stiff structural element separate that seperate the scala media and
the scala tympani. Based on stimulus frequency the displacement pattern of BM also
changed. Cochlea acts like a frequency analyzer of sound wave.
2.4.4
Basilar Membrane Responses
The motion of the BM normally defined as travelling wave. Along the length of BM the
parameter at a given point determine the character frequency (CF). CF is the frequency
at which BM is most sensitive to sound vibrations. At the apex the BM is widest (0.08–
0.16 mm) and narrowest (0.08–0.16 mm) part is located at the base. Through ear drum
20
Figure 2.5: Motions of BM at different frequencies (Reproduced from Encyclopaedia
Britannica, Inc. 1997)
and oval window a sound pressure waveform pass to the cochlea and creates pressure
wave along the BM. There are different CF along the position of BM. The base of the
cochlea is more tuned to higher frequency and it decrease along apex. When sound
pressure passes through BM both low and high frequency regions are excited and cause
an overlap of frequency detection. Based on the low frequency tone (below 5 kHz), the
resulting nerve spike (AP) are synchronized through phase-locking process. The AN
model successfully captured this strategy. At different CF the motion of the BM is
shown in Fig. 2.5.
2.4.5
Auditory Nerve
AN fibers generate electrical potentials that do not vary with amplitude when activated.
If the nerve fibers fire, they always reach 100% amplitude. The APs are very short-lived
events, normally take 1 to 2 ms to rise in maximum amplitude and then return to resting
state. Due to this behavior they are normally called as spikes. These spikes are used to
decode the information in the auditory portions of the central nervous system.
21
2.5
Brief history of Auditory Nerve (AN) Modeling
AN modeling is an effective tool to understand the mechanical and physiological
processes in the human auditory periphery. To develop a computational model several
efforts (Bruce, Sachs, & Young, 2003; Tan & Carney, 2003; Zhang, Heinz, Bruce, &
Carney, 2001) had been made that integrate data and theories from a wide range of
research in the cochlea. The main focus of these modeling efforts was the nonlinearities
in the cochlea such as compression, two-tone suppression and shift in the best
frequencies etc. The processing of simple and complex sounds in the peripheral auditory
system can also be studying through these models. These models can also be used as
front ends in many research areas such as speech recognition in noisy environment
(Ghitza, 1988; Tchorz & Kollmeier, 1999) , computational modeling of auditory
analysis (Brown & Cooke, 1994) , modeling of neural circuits in the auditory brain-stem
(Hewitt & Meddis, 1993),design of speech processors for cochlear implants (Wilson et
al., 2005) and design of hearing-aid amplification schemes (Bondy, Becker, Bruce,
Trainor, & Haykin, 2004; Bruce, 2004). In 1960, Flanagan and his colleague developed
a computational model that can emulate the responses of the mechanical displacement
of the BM filter for a known stimulus. The model considered the cochlea as a linear and
passive filter and took account the properties of the cochlea responses by using simple
stimulus like clicks, single tones, or pair of tones (R. L. Miller, Schilling, Franck, &
Young, 1997; Sellick, Patuzzi, & Johnstone, 1982; Wong, Miller, Calhoun, Sachs, &
Young, 1998) However, in 1987, Deng and Geisler reported significant nonlinearities in
the responses of AN fibers due to the speech sound which was attributed as “synchrony
capture”. They described the main property of discovered nonlinearities as “synchrony
capture” which means that the responses produced by one formant in the speech syllable
is more synchronous to itself than what linear methods predicted from the fibers
threshold frequency tuning curve (FTC). These models are relatively simple but did not
22
Figure 2.6: Model of one local peripheral section. It includes outer/ME, BM, and IHC–
AN synapse models. (Robert & Eriksson, 1999)
consider the nonlinearities of BM (Narayan, Temchin, Recio, & Ruggero, 1998). Their
effort led to proposing a composite model which incorporated either a linear BM stage
or a nonlinear one (Deng & Geisler, 1987).
In 1999, a model was proposed by Robert and Eriksson which could able to produce
different effects of human auditory system seen in the electrophysiological recordings
related to tones, two-tones, and tone-noise combinations. The model was able to
generate significant nonlinear behaviors such as compression, two tone suppression, and
the shift in rate-intensity functions when noise is added to the signal (Robert &
Eriksson, 1999). The model is shown in Fig. 2.6. The model introduced by Zhang et al.
addressed some important phenomena of human AN such as temporal response properties
of AN fibers and the asymmetry in suppression growth above and below characteristics
frequency. The model focused more on nonlinear tuning properties like the compressive
23
Figure 2.7: The model of the auditory peripheral system developed by Bruce et al.
(Bruce et al., 2003), modified from Zhang et al. (Zhang et al., 2001)
changes in gain and bandwidth as a function of stimulus level, the associated changes in
the phase-locked responses, and two-tone suppression (Zhang et al., 2001). Bruce et al.
expanded the foresaid model of the auditory periphery to assess the effects of acoustic
trauma on AN responses. The model incorporated the responses of the outer hair cells
(OHCs), inner hair cells (IHCs) to increase the accuracy in predicting responses to
speech sounds. Their study was limited to low and moderate level responses (Bruce et
al., 2003). The schematic diagram of their model is shown in Fig. 2.7.
24
Figure 2.8: Schematic diagram of the auditory-periphery. The model consists of ME
filter, a feed-forward control path, two signal path such as C1 and C2, the inner hair cell
(IHC), outer hair cell (OHC) followed by the synapse model with spike generator.
(Zilany & Bruce, 2006).
2.5.1
Description of the computational model of AN
In this study, we used AN model developed by Zilany et al. that will be described in this
section (Zilany & Bruce, 2006). In 2006, the model of the auditory periphery developed
by Bruce et al.(2003) was improved to simulate the more realistic responses of the AN
fibers in response to simple and complex stimuli for a range of characteristics
frequencies (Bruce et al., 2003). Figure 2.8 shows the model of the auditory periphery
developed by Zilany et al. (Zilany & Bruce, 2006). The model introduced two modes of
BM that includes the inner and OHC resembling the physiological BM function in two
filter components C1 and C2. The C1 filter was designed to address low and
intermediate-level responses where C2 was introduced as a mode of excitation to the
IHC and produce high-level effects of the cochlea. Meanwhile, C2 corresponds to IHC
which filters high level response and then followed by C2 transduction function to
produce high-level effects and transition region. This feature in the Zilany-Bruce model
25
causes it to be more effective on a wider dynamic range of characteristics frequency of
the BM compared to previous AN models (Zilany & Bruce, 2006, 2007; Zilany et al.,
2009).
2.5.1.1 C1 Filter:
The C1 filter is a type of linear chirping filter that is used to produce the frequency
glides and BF shifts observed in the auditory periphery. The C1 filter was designed to
address low and intermediate-level responses of the cochlea. The output of the C1 filter
closely resembles the primary mode of vibration of the BM and act as the excitatory
input of the IHC. The filter order has a great impact on the sharpness of the tuning
curve. The filter remains sharply tuned if the filter order is too high even for high-SPL
or in the situation of OHC impairment. To enable the FTC of the filter more realistic in
both normal and impairment conditions, the filter order was used 10 instead of 20 (Tan
& Carney, 2003).
2.5.1.2 Feed forward control path (including OHC):
The feed forward control path regulates the gain and bandwidth of the of the BM filter
to reflect the several level-dependent properties of the cochlea by using the output of the
C1 filter. The nonlinearity of the AN model that represents an active cochlea is
modelled in this control path. There are three main stages involves in this path, which
are:
i) Gammatone filter:
The control-path filter is a time varying filter that is the product of a gamma distribution
and sinusoidal tone with a central frequency and bandwidth broader than these of C1
filter. The broader bandwidth of the control-path filter produces two-tone rate
suppression in the model output.
26
ii) Boltzmann function:
This symmetric non-linear function was introduced to control dynamic range and time
course compression in the model.
iii) A nonlinear function that converts the low pass filter output to a time-varying time
constant for the C1 filter. Any impairment of the OHC is controlled by the COHC and
the output is used to control the nonlinearity of the cochlea as well.
On the other hand, the nonlinearity of cochlea is controlled inside the feed forward
control path based on different type of stimulus of the SPLs. At low stimulus SPLs, the
control path response with maximum output. Thus the tuning curve becomes sharp with
maximum gain and the filter behaves linearly. In response to the moderate levels of
stimulus, the control signal deviates substantially from the maximum and varies
dynamically from maximum to minimum. The tuning curve of the corresponding filter
becomes broader and the gain of the filter is reduced. At very high stimulus SPLs the
control path stimulus saturates with minimum output level and the filter is again
becomes effectively linear with broad tuning and low gain.
2.5.1.3 C2 Filter:
The C2 filter, parallel to C1 the filter is a wideband filter with its broadest possible
tuning (i.e. at 40 kHz). The filter has been introduced based on the Kiang‟s two-factor
cancellation hypothesis, in which the level of stimuli will affect the C2‟s transduction
function followed after C2 filter‟s output. The hypothesis states that „the interaction
between the two paths produces effects such as the C1/C2 transition and peak splitting
in the period histogram (Zilany & Bruce, 2006). To be consistent with the behavioral
studies (Wong et al., 1998), C2 filter has been chosen to be identical to the broadest
possible C1 filter. Thus, the C2 filter has been implemented by replacing the poles and
27
zeroes in the complex plane at a position for the C1 filter with complete OHCs
impairment. According to the Liberman, (Liberman & Dodds, 1984) C2 responses
remain unchanged where C1 responses are significantly attenuated in acoustically
traumatized cats. Also, C1 responses can be suppressed using crossed olivocochlear
bundle stimulus, where C2 responses become unaltered (Gifford & Guinan Jr, 1983). To
include this phenomenon in the model, C2 filter is made as linear and static.
The transduction function gives the output based on SPLs that affect the C1/C2
interactions. a) At low SPLs, its output is significantly lower than the output of the
corresponding C1 response.
b) At high SPLs, the output dominates and the C1 and C2 outputs are out of phase.
c) At medium SPLs, C1 and C2 outputs are approximately equal and tend to cancel each
other.
Furthermore, the C2 response is not subject to rectification, unlike the C1 response (at
high levels) such that the peak splitting phenomenon also results from the C1/C2
interaction. Poor frequency selectivity of AN fiber is caused by too many frequency
components consists in a speech stimuli. This is overcome by increasing the order of C2
up to 10th order, which compensate the order of C1 filter.
2.5.1.4 The Inner hair cell (IHC):
The IHC is modelled by a low pass filter that functions to convert the mechanical
energy produced by the BM to electrical energy that stimulates the neurotransmitter to
be released in the IHC-AN synapse. Two types of IHCs, tallest and shorter IHC
sterocilia generate the C1 and C2 responses, respectively and were controlled by both
C1 and C2 transduction functions. C1 transduction function uses the output of the C1
filter and is related to high-CF model fibers to produce the direct current (DC)
28
components of the electrical output. Meanwhile, the C2 transduction function uses the
C2 filter output that is first transformed to increase towards 90-100 SPL at low and
moderate-CF level. Finally, the C1 and C2 transduction function outputs, Vihc,C1 and
Vihc,C2 are summed and resulted to the overall potential of Vihc output after passing
through the IHC low pass filter. The output of the AN model simulates multidimensional pulse signals from each channel that is obtained by means of its statistical
characteristics of the pulse signals called the peristimulus time histogram (PSTH).
In 2009, the model was further improved by introducing power law adaptation with the
previous model (Zilany et al., 2009). Figure 2.9 shows the AN model developed by
Zilany and colleagues in 2006 but with the additional rate-adaptation model which is the
Figure 2.9: (A) Schematic diagram of the model of the auditory periphery (B) IHC-AN
synapse model: exponential adaptation followed by parallel PLA models (slow and
fast).
IHC-AN Power- Law Synapse Model (PLA) (Zilany et al., 2009). In the model, powerlaw adaptation (PLA) is used with the exponential adapting components with rapid and
short-term constants.
29
The two parallel paths in PLA provide slowly and rapidly adapting responses where
exponentially adapting components are responsible for shaping onset responses. These
responses further made the AN output to improve the AN response after stimuli offset,
in which the person could still hear a persistent or lingering effect after the stimuli has
past and also to adapt to a stimuli with increasing or decreasing amplitude. The adapting
PLA in the synapse model significantly increases the synchronization of the output to
pure tones, and therefore, the adapted cut-off frequency is matched with the maximum
synchronized output of the AN fiber for pure tones as a frequency function. The PLA
model simulates repetitive of the stimulus output of the synapse into a single IHC
output instead of only simulating a single stimulus of the synapse by the previous
model. Because of the discharge generator has quite a relatively long lifetime emission
dynamics and can be extended from one stimulus to the next, a series of the same output
synapses were formed through a combination of repetitive stimulus and silences
between each stimuli. Moreover, the model synaptic PLA also has memory that exceeds
the repetition duration of a single stimulus (Zilany et al., 2009).
The AN model introduced by Zilany et al. is capable to simulate realistic responses of
normal and impaired AN fiber in cats across a wide range of characteristics frequencies.
The model successfully capture the most of non-linearities observed at the level of the
AN. The model simulate the response of AN fibers to account for high level AN
responses. This was accomplished by suggesting that inner hair cell should be subjected
to two modes of BM excitation, instead of only one mode. Two parallel filters named
C1 (component 1) and C2 (component 2) generate these two modes. Each of these two
excitation modes has their own transduction function in a way that the C1/C2
interaction occurs within the inner hair cell. The transduction function was chosen in
such a way that at low and moderate (SPLs), C1 filter output dominate the overall
response from the IHC output, whereas the high level responses were dominated by C2
30
responses. This property of Zilany-Bruce model makes it more effective on wider
dynamic range of SPLs compared to those of previous AN models (Zilany & Bruce,
2006).
2.5.2
Envelope (ENV) and Temporal Fine Structure (TFS) neurogram
Based on temporal resolution (bin width of 10 μs and 100 μs, which will subsequently
be referred to as TFS and ENV, respectively), two types of neurogram were constructed
from the output of the model for the auditory periphery. In this study, we used only
ENV neurogram.
2.6
Support Vector Machines
In this thesis, the Support Vector Machine (SVMs) formulation for phoneme
classification was used. In classification, SVMs are binary classifiers. They are used to
build a decision boundary by mapping data from the original input space to a higher
dimensional feature space, where the data can be separated using a linear hyper plane.
Instead of choosing a hyperplane that only minimizes the training error, SVMs choose
the hyperplane that maximizes the margin of separability between the two sets of data
points. This selection can be viewed as an implementation of the Structural Risk
Minimization (SRM) principle, which seeks to minimize the upper bound on the
generalization error (Vapnik, 2013). Typically, an increase in generalization error is
expected when constructing a decision boundary in higher-dimensional space.
31
Figure 2.10 (a) a separating hyperplane. (b) The hyperplane that maximizes the margin
of separability [adapted from (Salomon,2001) ]
2.6.1
Linear Classifiers
Consider the binary classification problem of an arrangement of data points as shown in
Fig. 2.10 (a). We denote the “square” samples with targets yi = +1 as positive examples,
belonging to the set S+. Similarly, we define the “round” samples with yi=−1 as
negative examples, belonging to S−. One mapping that can separate S+ and S− is:
f(x,y)= sign(w. x + b)
(2.1)
Where w is a weight vector and b the offset from origin.
Given such a mapping, the hyperplane w. x+b= 0 defines the decision boundary
between S+ and S−. The two data sets are said to be linearly separable by the hyperplane
if a pair {w, b} can be chosen such that the mapping in Eq. (2.1) is perfect. This is the
case on figure 2.10 (a), where the “round” and “square” samples are clearly separable.
The SVM classifier finds the only hyperplane that maximizes the margin between the
two sets. This is shown in Fig. 2.10(b).
32
Classification
An instance x is classified by determining on which side of the decision boundary it
falls. To do this, we compute:
f(x)=sign(w∗x+b)=sign(∑ni=1 αi yi (𝐱 𝐢 . 𝐱 ) + b)
(2.2)
Where αi is the Lagrange multiplier corresponding to the ith training sample xi . Then
assign it to one of the target labels +1 or –1, representing the positive and negative
examples. The parameter 𝛼 and b are optimized during training.
2.6.2
Non-linear Classifiers
In the above section, we described how the linear SVM could handle misclassified
examples. Another extension is needed before SVMs can be used to effectively handle
real-world data: the modeling of non-linear decision surfaces. In 2006, Systems et al.
proposed a method to handle non-linear decision surfaces using SVM (Systems &
Weiss, 2006). The idea is to explicitly map the input data to some higher dimensional
space, where the data is linearly separable.
2.6.3
Kernels
To handle non-linear decision boundaries kernel functions are used that implicitly map
data points to some higher dimensional space where the decision surfaces are again
linear. Two commonly used kernels are the radial basis function (RBF) kernel k r (xi,
2
x)=𝑒 −𝜏‖𝒙𝒊 −𝒙‖ and polynomial kernel Kp (xi, x) = (1+ (𝒙𝒊 , x)) Θ Where Θ is integer
polynomial order and τ is the width factor. These parameters are tuned to a particular
classification problem. The classification performance at clean condition for the
standard polynomial and RBF kernels are almost similar. Hence, in all experiment we
used RBF kernel.
33
2.6.4
Multi-class SVMs
In literature, numerous schemes have been proposed to solve multi class problem using
SVMs (Friedman, 1996 ; Mayoraz & Alpaydin, 1999). SVM is basically a binary
classifier, which compares each sample to another sample. There are two types of
classification mode in SVM: one versus one (OVO) and one versus rest (OVR) of
samples. The advantage of OVO classification system is that, it takes less time since the
problems to be solved are smaller in size. In OVR system, each test instance is
compared with the target one samples and rests all instances at same time. In practice,
heuristic methods such as the OVO and OVR approaches are mostly used than other
multiclass SVM implementations. There are several online software packages available
that efficiently solve the binary SVM, such as (Cortes & Vapnik, 1995).
2.7
Radon Transform
A phoneme classification technique was proposed in this study based on the application
of Radon transform on simulated neural responses from a model of the auditory
periphery. The Radon transform computes projections of an image matrix along
specified directions.
2.7.1
Theoretical foundation
In 1917, Radon, an Austrian mathematician, published an analytic solution to the
problem of reconstructing an object from multiple projections. William Oldendorf first
reported the application of mathematical image-reconstruction techniques for
radiographic medical imaging in 1961, and Nobel laureate Godfrey N. Hounsfield
developed the first clinical computerized tomography around a decade later. Radon
posited that it was possible to recover a function on the plane from its integrals over all
the lines in the plane, and that functions could have real or complex values. Therefore, it
34
would be feasible to use rotational scanning to obtain projections of a 2D object, which
could then be used to reconstruct an image. Radon’s theorem states, “The value of a 2D
function at a random point is uniquely obtained by the integrals along the lines of all the
directions passing that point.” (Rajput, Som, & Kar, 2016)
2.7.2
How Radon transform works
Detecting the entire image is a challenging task. Radon transform avoids the need for
global image detection by reducing recognition to the problem of detecting peak image
parameters, which enables the definition of prominent features. After establishing
thresholds for feature extraction that can be plugged into the Radon transform, it is
possible to extract the prominent features from the preprocessed image, effectively
recovering global image parameters. Several existing algorithms, such as edge-detection
filters, are suitable for estimating image parameters after which linear regression can be
applied to connect the individual pixels. However, these algorithms are less suitable
when the image has intersecting lines, the noise level is high, or filters are difficult to
stabilize. Radon transform can overcome these problems. Formula for Radon transform
is present in section 3.4. A projection of a two-dimensional function f(x,y) is a set of
line integrals. The radon function computes the line integrals from multiple sources
along parallel paths, or beams, in a certain direction. The beams are spaced 1 pixel unit
apart. To represent an image, the radon function takes multiple, parallel-beam
projections of the image from different angles by rotating the source around the center
of the image. By exploiting the move-out or curvature of signal of interest, Leastsquares and High-resolution Radon transform methods can effectively eliminate random
or correlated noise, enhance signal clarity (Gu & Sacchi, 2009).
35
2.7.3
Current applications
In more modern terms, 2D Radon transform represents an image as a line integral,
which is the sum of the image’s pixel intensities, and shows the relationship between
the 2D object and its projections. Radon transform is a fundamental tool in a wide range
of disciplines, including radar, geophysical, and medical imaging. It also can be used in
science and industry to evaluate the properties of a material, component, or system
without causing damage. (Rajput et al., 2016).
36
CHAPTER 3: METHODOLOGY
3.1
System overview
This chapter describes the procedure of the proposed neural-response-based phoneme
classification system. The block diagram of the proposed method is shown in Fig. 3.1.
The 2-D neurograms were constructed from the simulated responses of the auditorynerve fibers to speech phonemes. The features of the neurograms were extracted using
the Radon transform and used to train the classification system using a support vector
machine classifier. In the testing phase, the same procedure was followed for extracting
feature from the unknown (speech) signal. The class of the test phoneme was identified
using the approximated function obtained from SVM training stage.
3.2
Datasets
To evaluate our proposed feature under quiet environment, speech with additive noise
and reverberant speech, experiments were performed on the complete test set of the
TIMIT database.
Training
Training
phoneme
AN
Model
Neurogram
Radon
Transform
SVM
Training
SVM
model
Radon
Transform
Feature
Matching
Classification
Matrix
Testing
Testing
phoneme
AN
Model
Neurogram
Figure 3.1: Block diagram of the proposed phoneme classifier
37
HTMIT corpus was used to collect speech distorted by different telephone channel.
3.2.1
TIMIT database
TIMIT is a unique corpus that has been used for numerous studies over the last 15
years. TIMIT is ideal for performing isolated phoneme classification experiments
because it contains expertly labeled, phonetic transcriptions/segmentations performed
by linguists that most other corpora do not possess. This database can also be used for
continuous recognition as well. Experiments were performed on the complete test set of
the TIMIT database (Garofolo & Consortium, 1993). There are two “sa” (dialect)
sentences in the TIMIT database spoken by all speakers that may result in artificially
high classification scores (Lee & Hon, 1989). To avoid any unfair bias, experiments
were performed on the “si” (diverse) and “sx” (compact) sentences. The training data
set consists of 3696 utterances from 462 speakers (140225 tokens), whereas testing set
consists of 1344 utterances from 168 speakers (50754 tokens). The glottal stop /q/ was
removed from the class labels, and the 61 TIMIT phoneme labels were collapsed into 39
labels following the standard practice given in (Lee & Hon, 1989). Table 3.1, describes
this folding process and the resultant 39 phone set. Knowing that phone confusions
occur within similar phones some researchers proposed broad phone class (Halberstadt,
1998); (T. J. Reynolds & Antoniou, 2003; Scanlon, Ellis, & Reilly, 2007). We showed
the classification performance using a broad phone-class approach proposed by
Reynolds et al. (T. J. Reynolds & Antoniou, 2003).
38
Table 3.1: Mapping from 61 classes to 39 classes, as proposed by Lee and Hon (Lee &
Hon, 1989).
1
iy
20
n en nx
2
ih ix
21
ng eng
3
eh
22
v
4
ae
23
f
5
ax ah ax-h
24
dh
6
uw ux
25
th
7
uh
26
z
8
ao aa
27
s
9
ey
28
zh sh
10
ay
29
jh
11
oy
30
ch
12
aw
31
b
13
ow
32
p
14
er axr
33
d
15
l el
34
dx
16
r
35
t
17
w
36
g
18
y
37
k
19
m em
38
hh hv
39
bcl pcl dcl tcl gcl kcl q epi pau h#
In 2003, Reynolds and Antoniou (T. J. Reynolds & Antoniou, 2003) divided the 39
phone set into 7 broad classes namely: Plosives (Plo), Fricatives (Fri), Nasals (Nas),
Semi-vowels (Svow), Vowels (Vow), Diphthongs (Dip) and Closures (Clo) (Table 3.2).
Table 3.3 indicates the number of tokens in each of the data sets.
39
Table 3.2: Broad classes of phones proposed by Reynolds and Antoniou, (T. J.
Reynolds & Antoniou, 2003).
Phone Class
#TIMIT labels
TIMIT labels
Plosives
8
b d g p t k jh ch
Fricatives
8
f z th dh s sh v hh
Nasals
3
n m ng
Semi-vowels
5
l er r y w
Vowels
8
aa ae ah uh iy ih eh uh uw
Diphthongs
5
ay oy ey aw ow
Closures
2
sil dx
Table 3.3: Number of token in phonetic subclasses for train and test sets
Phone
462-speaker train
168-speaker complete test
Plosives
17967
6261
Fricatives
20314
7278
Nasals
12312
4434
Semi-vowels
17406
7021
Vowels
34544
12457
Diphthongs
6890
2431
Closures
30792
10872
3.2.2
HTIMIT corpus
To evaluate the effectiveness of the proposed approach on channel distortions we used
the well-known handset TIMIT (HTIMIT) corpus. This corpus is a re-recording of a
subset of the TIMIT corpus through different telephone handsets. It was create for the
study of telephone transducer effects on speech which minimized confounding factors.
40
It was created by playing 10 TIMIT sentences from 192 male and 192 females through a
stereo loud speaker into different transducers positioned directly in front of the
loudspeaker and digitizing the output from the transducers on a sunspace A/D at a 8kHz
sampling rate and a 16 bit resolution. In this study we resampled the data from 8 kHz to
16 kHz. The set of utterances was playback and recorded through nine other different
handsets include four carbon button (called cbl, cb2, cb3 and cb4), four electret (called
ell, e12, el3 and e14) handsets, and one portable cordless phone (called ptl). However,
in this study, all experiments were performed on 359 speakers from 1431 utterances
(54748 tokens).
3.3
AN model and neurogram
The AN model developed by Zilany et al. (Zilany & Bruce, 2006) is a useful tool for
understanding the underlying mechanical and physiological process in auditory
periphery of human. The schematic block diagram of this AN model is shown in Fig.
3.1 in (Zilany & Bruce, 2006). This model represents the method of encoding the simple
and complex sounds in the auditory periphery (Carney & Yin, 1988). The sound
stimulus in (ME) is resampled with 100 KHz to comply with human ME instantaneous
pressure waveform of that stimulus. To replicate human ME filter response, a fifth order
digital filter was used (Zilany & Bruce, 2006). To maintain stability, this fifth order
filter was converted into second order system by cascading two second order filters and
one first order filter. Normally 32 CFs are simulated, ranged from 150 Hz to 8 KHz
which were logarithmically spaced for AN model to produce neurogram. Each CF is
considered as a single AN fiber and behaves like a band-pass filter with asymmetric
filter shape. In this presented study, each CF’s neural responses have been simulated
only for single repetition for each stimulus. A single nerve fiber does not fire on every
cycle of the stimulus but once spike is obtained, it obtains roughly at same phase of the
41
waveform at repetitive time which is called phase locking. The phase locking varies
somewhat across species but upper frequency boundary lies at 4-5 KHz (Palmer &
Russell, 1986). Heinz et al. observed a weak phase locking up to 10 KHz frequencies
(Heinz, Colburn, & Carney, 2001). That’s why the maximum limit of CF ranged are
commonly used 8 KHz in AN model. To recall, the acoustic signal is adapted with this
model by up-sampling at 100 KHz. Three different types of responses (Liberman, 1978)
( low, medium, and high spontaneous rate) of AN fibers were simulated to maintain
consistency with physiology of Auditory system. According to the spontaneous rate
distribution, the AN model synapse responses are multiplied with 0.2, 0.2, and 0.6 to
low, medium, and high spontaneous rate respectively. The detail of the AN model was
described in literature review section.
Figure 3.2: Time-frequency representations of speech signals. (A) a typical speech
waveform (to produce spectrogram and neurogram of that signal), (B) the corresponding
spectrogram responses, and (C) the respective neurogram responses.
42
A neurogram is analogous to a spectrogram that provides a pictorial representation of
neural responses in the time-frequency domain. Figure 3.2 shows the difference
between neurogram and spectrogram. The basic difference between spectrogram and
neurogram is that, the spectrogram is the frequency spectrum respective to time for an
acoustic signal using FFT whereas neurogram is the collective 2-D neural responses i.e.
the collection of responses to corresponding CF. Neurogram is obtained by averaging
the neural responses for each CF. ENV neurogram was separated from the neural
responses using 100µs bin width, Hamming window (50% overlap between adjacent
frames) with periodic flag (is useful for spectral analysis (Oppenheim, Schafer, & Buck,
1989)) for 128 points as one frame, and the average value of each frame was
manipulated. So the spike synchronization frequency for ENV neurogram was up to
~160 Hz [1/(100x10-6x128x0.5)]. To calculate TFS neurogram, 10µs bin (time window)
size with Hamming window (50% overlap of adjacent frames with periodic flag) for 32
points were used for each frame. The combination of 32 points Hamming window with
binning 10µs extended the synchronization frequency ranges up to ~ 6.25 KHz
[1/10x10-6x32x0.5]. It implied that the ENV neurogram resolution was less and contains
less information comparative to TFS neurogram. Since ENV neurogram has average
information about the whole neural response and smaller in size comparative to TFS
neurogram, less time will be required for phoneme classification modeling and testing,
that is why ENV neurogram has been used in our proposed method. In over all, AN
model can capture mammal peripheral auditory system non-linearity more accurately as
compared to existing auditory model and acoustic cepstral-based model. It is also useful
for robust phoneme, speech and speaker identification. In this study, Neurograms were
constructed by simulating the responses of the AN fibers to phonemes from the TIMIT
database. In addition, responses of 32 AN fibers logarithmically spaced from 150 to
8000 Hz (for HTIMIT the range is 150~4000 Hz) were simulated.
43
Figure 3.3: Geometry of the DRT
3.4
Feature extraction using Radon transform
The multiple, parallel-beam projections of the image, 𝑓(𝑥, 𝑦) from different angles are
referred to as the DRT. The projections are considered by rotating the source around the
center of the image (Bracewell & Imaging, 1995). In general, the Radon transform
𝑅𝜃 (𝑥 ′ ) of an image is the line integral of 𝑓 parallel to the 𝑦 ′ -axis
∞
𝑅𝜃 (𝑥 ′ ) = ∫−∞ 𝑓(𝑥 ′ cos(𝜃) − 𝑦 ′ sin(𝜃) , 𝑥 ′ sin(𝜃) + 𝑦′cos(𝜃))𝑑𝑦′
where
(3.1)
𝑥′ cos(𝜃) sin(𝜃) 𝑥
[ ]=[
] [ ]
𝑦′ − sin(𝜃) cos(𝜃) 𝑦
Figure 3.3 illustrates the geometry of the Radon transform. There are two distinct Radon
transforms. The source can either be a single point (not shown) or it can be a array of
sources. The Radon transform is a mapping from the Cartesian rectangular
coordinates(x, y) to a distance and an angel (ρ, θ), also known as polar coordinates.
Figure 3.4 (a) shows a binary image (size 100 by 100). Panel (b) and (c) shows the
application of Radon transforms at 0° and 45° of this image, respectively.
44
Figure 3.4: This figure shows how Radon transforms work. (a) a binary image (b)
Radon Transform at 0 Degree (c) Radon Transform at 45 Degree.
3.5
SVM classifier
The SVM is a popular supervised learning method for classification and regression
(Cortes & Vapnik, 1995; Vapnik, 2013). The SVM algorithm constructs a set of
hyperplanes in a high-dimensional space. The hyperplanes are defined by a set of points
which have a constant dot product with a vector in a higher-dimensional space. In this
study, the LIBSVM (C.-C. Chang & Lin, 2011) (a Matlab library for SVM) was used to
train the proposed features (Radon projection coefficients) to predict the multi-labels of
classified phonemes. The C-support vector classification (C-SVC) with the (RBF) as a
kernel mapping was employed in this study. The C parameter of the SVC and the
parameter of the RBF kernel function (gamma) were selected using the cross-validation
algorithm. The mathematical formulas of the C-SVC optimization algorithm and its
corresponding approximation function can be found in (Cortes & Vapnik, 1995). The
mathematical formulas of the C-SVC optimization algorithm and its corresponding
approximation function shows below (C.-C. Chang & Lin, 2011).
Given training vectors
xi 𝜖 𝑅 𝑛 , i=1,.., l in two classes, and an indicator vector y 𝜖 𝑅 𝑙
such that yi 𝜖 {1,-1}, C-SVC (Boser, Guyon, & Vapnik, 1992; Cortes & Vapnik, 1995)
solves the following primal optimization problem.
45
𝑚𝑖𝑛
𝑤,𝑏,ƺ
Subject to
1
2
𝑤 𝑇 w + C ∑𝑙𝑖=1 𝜁𝑖
(3.2)
yi (𝑤 𝑇 ∅ (xi) +b) ≥ 1- 𝜁i
𝜁i= 0, i=1,…, l,
where ∅ (xi) maps xi into a higher-dimensional space and C>0 is the regularization
parameter. Due to possible high dimensionally of the vector variable w, usually we
solve the following dual problem.
𝑚𝑖𝑛
𝛼
subject to
1
2
𝛼 𝑇 𝑄𝛼 - 𝑒 𝑇 𝛼
𝑦 𝑇 𝛼 = 0,
0≤ 𝛼𝑖 ≤ 𝐶,
(3.3)
i=1,…, l,
where e = [1,…, 1]T is the vector of all ones, Q is an l by l positive semidefinite matrix,
𝑄𝑖𝑗 ≡yiyj K(xi, xj), and K(xi, xj) ≡ ∅(xi)T ∅(xj) is the kernel function. After problem
(3.4) is solved, using the primal-dual relationship, the optimal w satisfies
w= ∑𝑙𝑖=1 𝑦𝑖 𝛼𝑖 ∅(𝑥𝑖 )
(3.4)
and the decision function is
sgn (𝑤 𝑇 ∅(𝑥)+b) = sgn ( ∑𝑙𝑖=1 𝑦𝑖 𝛼𝑖 𝐾 (𝑥𝑖 , 𝑥 ) + 𝑏)
yi𝛼𝑖 ∀𝑖 , b labels name, support vectors, and other information such as kernel parameters
is stored in the model for prediction.
SVM parameter selection
SVM is basically a binary classifier, which compares each sample to another sample.
There are two types of classification mode in SVM: (OVO) and OVR of samples
(OVR). The advantage of OVO classification system is that, it takes less time since the
problems to be solved are smaller in size. In OVR system, each test instance is
compared with the target one samples and rests all instances at same time. The
performance using OVO and OVR classification system was 69.21 and 68.70
46
respectively. In this study, all performance was evaluated using OVO classification
system. Four types of kernel function are used in SVM. The default type of kernel
function (RBF)) has been used in presented phoneme classification.
There are two parameters C and γ, which are associated to RBF kernel. The C (penalty
parameter) prevents misclassification of training data against plainness of decision
surface. A small value of C is liable for ingenuousness of decision surface while large
value gives freedom to select large data as support vector to the model (Soong,
Rosenberg, Juang, & Rabiner, 1987). Too small value of gamma (γ) makes the model
constrain that cannot handle complex data. It implies that, the classification
performance of SVM model with RBF kernel fully depend on a proper selection of C
and γ. The best values of C and gamma obtained by the cross-validation algorithm. For
C = 1 the individual accuracy is 65.94, with the increase of value of C the accuracy
increased up to C = 12. If we increase the value of C above 12 then the accuracy remain
almost unchanged but higher value of C is computational expensive. Similarly, we
choose the value of G. In this study, the value used for C and G was 12 and 0.05
respectively, considering computational cost and accuracy under both quiet and noisy
condition.
3.6
Environmental Distortions
The first step in most environmental robustness techniques is to specify how speech
signals are altered by the acoustic environment. Though the relationship between the
environments corrupted signals and the original, noise free, signals is often complicated;
it is possible to relate them approximately using a model of the acoustic environment
47
Reverberation
Additive transmission
noise
Microphone
channel distortion
Lombard effect
Transmission
channel distortion
ASR
systems
background noise
Figure 3.5: Types of environmental noise which can affect speech signals (Wang, 2015)
(Acero, 1990; Hansen, 1996). Figure 3.5 demonstrates one such model, in which several
major types of environment noise are depicted. First, the production of speech can be
influenced by the ambient background noise. This is the Lombard effect (Junqua &
Anglade, 1990): as the level of ambient noise increases, speakers tend to
hyperarticulate: vowels are emphasised while consonants become distorted. It is
reported that recognition performance with such stressed speech can degrade
substantially (Bou-Ghazale & Hansen, 2000). There have been a few attempts to tackle
the Lombard effect as well as stressed, emotional speech. However, in this thesis, these
effects are not addressed. We will consider other environmental distortions. Figure 3.5
show, a major source of environmental distortions is the ambient noise presented in the
background. The background noise can be emitted near the speaker or the microphone.
It is considered as additive to the original speech signals in the time domain and can be
viewed as statistically independent of the original speech. Additive background noise
can be stationary (e.g., noise generated by a ventilation fan), semi-stationary (e.g., noise
caused by interfering speakers or music), or abrupt (e.g., noise caused by incoming
vehicles). Additive background noise occurs in all speech recognition applications.
Therefore, it has been intensively studied in the past several decades (Acero, 1990; M. J.
Gales, 2011; M. J. F. Gales, 1995).
If the speaker uses a distant microphone in an enclosed space, the speech signals are
subject to another type of environment distortion – reverberation. As shown in Fig. 3.5
48
reverberation is usually caused by the reflection of speech waveform on flat surfaces,
e.g., a wall or other objects in a room. These reflections result in several delayed and
attenuated copies of the original signals, and these copies are also captured by the
recording microphone. In theory, the reverberant speech signals can be described as
convolving the speech signals with the room impulse response (RIR), resulting in a
multiplicative distortion in the spectral domain. Compared with the additive background
noises, which are usually statistically independent of the original clean signals, the
reverberant noises are correlated with the original signals and the reverberant distortions
are different from the distortions caused by additive background noise. Reverberation
has a strong detrimental effect on the recognition performance of an ASR systems: for
instance, without any reverberant robustness technique, the recognition accuracy of a
ASR system can easily drop to 60% or even more in a moderate reverberant
environment (length of RIR ∼ 200ms) (Yoshioka et al., 2012). Therefore, reverberant
robustness techniques are an essential component in many applications.
Figure 3.5 also shows that speech signals can be transmitted by a series of transducers,
including
microphones,
and
some
communication
channels.
Differences
in
characteristics of these transducers add another level of distortions, i.e., the channel
distortion.
49
Figure 3.6: Impact of environment distortions on clean speech signals in various
domain: (a) clean speech signal (b) speech corrupted with background noise (c) speech
degraded by reverberation (d) telephone speech signal. (e)-(h) Spectrum of the
respective signals shown in (a)–(d). (i)-(l) neurogram of the respective signals shown in
(a)–(d).(m)-(p) Radon coefficient of the respective signals shown in (a)–(d).
Compared with reverberant distortions, the channel distortions are usually caused by
linearly filtering the incoming signals with the microphone impulse response, which is
shorter than the length of analysis window. Figure 3.6 shows the impact of environment
distortions on clean speech signals in various domains. Figure 3.6 suggests that due to
environmental distortions there is a high impact on acoustic domain and spectral domain
but in Radon domain the effect is very low. For simplicity, in this Figure we do not
show the full length of reverberant speech. It is shown in figure 3.7.
50
3.6.1
Existing Strategies to Handle Environment Distortions
The impact of environment distortions on clean speech signals have been described in
figure 3.6. Most-widely used environmental robustness techniques can be grouped into
three broad categories (Wang, 2015):
i) Inherently robust front-end or robust signal processing: in which the front-end is
designed to extract feature vectors x(t) which are insensitive to the differences between
clean speech signals x(t ) and noise corrupted signals y(t ), or the clean speech signals
x(t) can be re-constructed given corrupted signals y(t ) ;
ii) Feature compensation, in which the corrupted feature vector y(t) is compensated such
that it closely resembles the clean speech vector x(t). This will be referred to as the
feature-based approach;
iii) Model compensation, in which the clean-trained acoustic model Mx is adapted to
My that it better matches the environment corrupted feature vectors in the target
condition. This will be referred to as the model-based approach
In this study, we proposed an approach to handle this environmental distortion through a
combination of two successful schemes. Our proposed approach will be discussed in
section 3.9.
51
3.7
Generation of noise
We evaluate our proposed feature with different types of noise. The procedure to
generate these noise is presented below.
3.7.1
Speech with additive noise
Among the three main noise types, additive noise is the most common source of
distortions. It can occur in almost any environment. Additive noise is the sound from
the background, captured by the microphone, and linearly mixed with the target speech
signals. It is statistically independent and additive to the original speech signals.
Additive noise can be stationary (e.g., aircraft/train noise and noise caused by
ventilation fan), slowly changing (e.g., noise caused by background music), or transient
(e.g., in car noise caused by traffic or door slamming). When speech signal is corrupted
by additive noise, the signal that reaches the microphone can be written as
a[t] =s[t] +n[t],
(3.5)
where a[t] is the discrete representation of the input signal, s[t] represents the clean
speech signal which is corrupted by noise n[t]. In this thesis we consider six types of
additive noise which are: speech shape noise (SSN), babble, exhibition, train, car, and
restaurant noise. In this study, additive noise (Hopkins & Moore, 2009) with different
SNRs ranging from -5 to 25 dB in steps of 5 dB were added to the clean phoneme
signals to evaluate the performance of the proposed methods.
3.7.2
Reverberant Speech
When the length of impulse response is significantly longer, the nature of its impact on
the speech recognition systems is very different.
52
Figure 3.7: Comparison of clean and reverberant speech signals for phoneme /aa/: (a)
clean speech, (b) signal corrupted by reverberation (c)-(d) Spectrogram of the respective
signals shown in (a)–(b) and (e)-(f) Radon coefficient of the respective signals shown in
(a)-(b).
In the time domain, the reverberation process is mathematically described as the
convolution of clean speech sequence s(t) with an unknown RIR h(t)
r (t) =s (t)*h (t)
(3.6)
where s (t), h (t), and r (t) denote the original speech signal, the RIR’s, and the
reverberant speech, respectively. The length of a RIR is usually measured by the so
called reverberation time, T60, which is the time needed for the power of reflections of
a direct sound to decay by 60dB. The T60 value is usually significant longer than the
analysis window. For example, T60 normally ranges from 200ms to 600ms in a small
office room, 400ms to 800ms in a living room, while in a lecture room or concert hall
environment, it can range from 1s to 2s or even longer. For phoneme classification
experiments with reverberant speech, the clean TIMIT test data were convolved with a
set of eight different room responses collected from various sources (Gelbart & Morgan,
53
2001) with reverberation time (T60) ranging from almost 300 to 400 ms. To make
convoluted noise we used ‘conv’ function and keep the length same for both clean
phoneme and reverberant phoneme using ‘same’ option in the function. The use of eight
different room responses results in eight reverberant test sets consisting of 50754 tokens
each.
Figure 3.7 Shows the impact of a reverberation noise (T60=350ms) on waveform,
spectrogram and Radon coefficient. This is because the convolution in the time domain
is transferred to multiplication in the frequency domain and addition in the cepstral
domain. From figure it is clear that due to reverberation there is a significant change in
the spectral domain but slightly change in the Radon domain.
3.7.3
Telephone speech
If the Lombard effect and the additive transmission noise in the environment model
depicted in Fig. 3.5 are ignored, and the speaker does not talk in an enclosed acoustic
space, the background noise and the channel (both the microphone and transmission
channel) distortion become the main distortions. Figure 3.8 depicts this simplified
environment model
We can write channel distortion as:
y(t)=s(t)* h(t) + n(t),
(3.7)
where s(t) and y(t) are the time-domain clean speech signal and noisy speech signal
respectively. h(t) is the impulse response of the microphone and transmission network,
and n(t) is the channel filtered version of the ambient background noise. For
experiments in telephone channel, we collected speech data from nine telephone sets in
54
Figure 3.8: A simplified environment model where background noise and channel
distortion dominate.
the HTIMIT database (Reynolds, 1997). For each of these telephone channels 54748
test utterances are used. In all the experiments, the system is trained only on the training
set of TIMIT database, representing clean speech without the distortions introduced by
the additive or convolutive noise but tested on the clean TIMIT test set as well as the
noisy versions of the test set in additive, reverberant, and telephone channel conditions
(mismatched train and test conditions).
3.8
Similarity measure
Cross-correlation is a measure of similarity of two signal or images and can be
represented as a sliding dot product or inner-product. Correlation can be applied in
pattern recognition, single particle analysis, and neurophysiology. In this study crosscorrelation is used to compare two signal in different domain. The similarity between
two matrices or vectors can be found through correlation coefficient measure by
following way:
𝑟=
∑𝑚 ∑𝑛(𝐴𝑚𝑛 −𝐴̅)(𝐵𝑚𝑛 −𝐵̅)
√(∑𝑚 ∑𝑛(𝐴 𝑚𝑛 −𝐴̅ )2 ) (∑𝑚 ∑𝑛(𝐵𝑚𝑛 −𝐵̅)2 )
(3.8)
Where r is correlation coefficient, 𝐴̅ and 𝐵̅ are the mean values of matrices or vectors A
and B, respectively.
55
3.9
Procedure
The block diagram of the proposed neural-response-based phoneme classification
method is shown in Fig. 3.1. Each phoneme signal was up-sampled to 100 kHz which
was required by the AN model in order to ensure stability of the digital filters
implemented for faithful replication of frequency responses of different stages (e.g.,
ME) in the peripheral auditory system (Zilany et al., 2009). The SPL of all phonemes
was set to 70 dB which represents the preferred listening level for a monaural listening
situation. Because the AN model used in this study is nonlinear, the neural
representation would be different at different sound levels. In response to a speech
signal, the model simulates the discharge timings (spike train sequence) of AN fiber for
a given characteristic frequency. Therefore, a 2-D representation, referred to as a
neurogram, was constructed by simulating the responses of AN fibers over a wide range
of CFs spanning the dynamic range of hearing.
In the SVM training phase, the Radon projection coefficients were calculated from the
phoneme neurogram using ten (10) rotation angles ranging from 0° to 180° in steps of
20°. The vector of each Radon projection was resized to 35 points and then combined
together for all angles to form a (1 × 350) feature vector. Thus the total number of
features for each phoneme was 350 irrespective of the duration of the phoneme in the
time domain. A mapping function was used subsequently to normalize the mean and
standard deviation of the feature vector to 0 and 1, respectively. All normalized data
from each phoneme were combined together to form an input array for SVM training.
The corresponding label vector of phoneme classes was also constructed.
In testing phase, the Radon projection coefficients using the same ten rotation angles
were calculated from the test (unknown) phoneme neurogram. The label (class) of the
56
Figure 3.9: Neurogram -based feature extraction for the proposed method: (a) a typical
phoneme waveform (/aa/), (b) speech corrupted with SSN (10 dB) (c) speech corrupted
with SSN (0 dB). (d)-(f) neurogram of the respective signals shown in (a)–(c). (g)-(h)
Radon coefficient of the respective signals shown in (a)–(b). (j) Radon coefficient of the
respective signals shown in (a)-(c).
test phoneme was identified using the approximated function obtained from SVM
training stage. The train and test data (for additive noise and room reverberation) were
choosen from the TIMIT database, whether data for testing channel distortion was taken
from the HTIMIT database. Figure 3.9 shows the example features extracted by
applying the Radon transform on the neurogram. Figure 3.9 (a) shows the waveform of
a typical phoneme (/aa/) taken from the TIMIT database, and the same phoneme under
noisy conditions with SNRs of 10 and 0 dB is shown in the panels (b) and (c),
respectively. Figure 3.9 (i) shows the Radon projection coefficients of the neurogram
for 3 angles for the phoneme signal in quiet (solid line) and at SNRs of 10 dB (dashed
57
line) and 0 dB (dotted line). Note that in the proposed method, we employed 10 angles,
but for clarity, we used only 3 angles of projection. This figure also shows how Radon
projection coefficients changes with SNRs. The classification results of the proposed
method were compared to the performances of three traditional acoustic-property-based
features such as MFCC, GFCC and FDLP.
3.9.1
Feature extraction using MFCC, GFCC and FDLP for classification
The classification results of the proposed method were compared to the performances of
three traditional acoustic-property-based features such as MFCC, GFCC and FDLP.
MFCC is a short-time cepstral representation of a speech which is widely used as a
feature in the field of speech processing applications. In this study, the RASTAMAT
toolbox (Ellis, 2005) was used to extract MFCC features from each phoneme signal. A
Hanning window of length 25 ms (with an overlap of 60% among adjacent frames) was
used for dividing the speech signal into frames. The log-energy based 39 MFCC
coefficients were then computed for each frame. This set of coefficients consists of
three groups: Ceps (Mel-frequency cepstral coefficients), Del (derivatives of ceps) and
Ddel (derivatives of del) with13 features for each group (note that in the proposed
method, the total number of features for each phoneme was fixed to 350). In this study,
FDLP has been derived following code from Ganapathy et al. (Ganapathy et al., 2010),
which is available on their official website. Like MFCC, same window type were used
in computing the corresponding 39 FDLP features for each phoneme frame but the size
of overlapped frames was 10 ms (40% overlapping). The Gammatone filter cepstral
coefficient (GFCC) is an auditory-based feature used in phoneme classification. GFCC
features can be computed by taking the (DCT) of the output of Gammatone filter as
proposed by Shao et. al. (Shao et al., 2007). According to the physiological observation,
the Gammatone filter-bank resembles more to the cochlear filter-bankn (Patterson,
Nimmo-Smith, Holdsworth, & Rice, 1987). In this study, the procedure provided by
58
Shao et al. (Shao et al., 2007) was used to compute GFCC coefficients for phoneme. A
fourth-order 128-channels Gammatone filter-bank with the center frequencies from 50
Hz to 8 kHz (half of the sampling frequency) was used to extract GFCC features.
Instead of log operation which is commonly used in MFCC calculation, the cubic root
was applied to extract GFCC features. 23 dimensional GFCC features were used in the
present study to simulate the result. The details of the method can be found in (Shao et
al., 2007). The SVM training and testing procedure described above were employed to
determine the identity (label) of an unknown phoneme. In all the experiments, the
system is trained only on the training set of TIMIT database, representing clean speech
without the distortions introduced by the additive or convolutive noise but tested on the
clean TIMIT test set as well as the noisy versions of the test set.
59
CHAPTER 4: RESULTS
4.1
Introduction
This chapter summarizes the results on phoneme classification under diverse
environmental distortions. We describe a series of experiments
to evaluate the
classification accuracy obtained using the proposed methods and to contrast this with
the accuracy obtained using MFCC-, GFCC- and FDLP- based methods. Experiments
were performed in mismatched train/test conditions where the test data are corrupted
with various environmental distortions. The features extracted from original (clean)
phoneme samples of TIMIT train subset were used to train the SVM models. In the
testing stage, additive noise with a particular SNR was added to the test phoneme signal
from TIMIT test subset, and the proposed features were then extracted. For
reverberation speech same procedure was followed. Telephone speech was collected
from HTIMIT corpus. In this study, we present the result for individual phone accuracy
(denoted as single) and broad phone class accuracy (denoted as broad) following other
researchers (Halberstadt & Glass, 1997; T. J. Reynolds & Antoniou, 2003; Scanlon et
al., 2007). Broad class confusion matrices have been produced by first computing the
full phone-against-phone confusion matrices and then adding all the entries within each
broad-class block to give one number. Confusion matrices showing the performance of
different feature in clean condition and under noisy condition is presented in appendix
A.
4.2
Overview of result
The average classification performance for the various feature extraction techniques on
clean speech, speech with additive noise, reverberant speech, and telephone channel
60
Table 4.1: Classification accuracies (%) of individual and broad class phonemes for
different feature extraction techniques on clean speech, speech with additive noise
(average performance of seven noise types at -5, 0, 5, 10, 15, 20 and 25 dB SNRs),
reverberant speech (average performance for eight room impulse response functions),
and telephone speech (average performance for nine channel conditions). The best
performance for each condition is indicated in bold.
MFCC (%)
GFCC (%)
FDLP (%)
Proposed feature (%)
Clean
single
broad
single
broad
single
broad
single
broad
60.16
76.68
49.94
64.21
50.57
67.65
69.19
84.21
47.72
50.29
60.99
23.25
34.85
47.77
43.40
41.750
60.15
Speech with additive noise
42.94
53.79
39.00
47.45
37.42
Reverberant speech
27.06
34.06
37.41
48.38
19.26
Telephone Speech
42.53
57.93
28.19
29.92
15.86
speech is presented in Table 4.1. Our proposed approach outperforms all other baseline
features in clean and additive noise conditions. For classification in reverberant and
telephone speech baseline GFCC- and MFCC- based feature extraction techniques
provides the best performance respectively but suffer performance degradation in other
conditions. However, for reverberant and telephone speech, the performance of the
proposed feature is comparable with GFCC- and MFCC-based feature respectively.
61
Table 4.2: Confusion matrices for segment classification in clean condition.
PLO
FRI
NAS
PLO
4447
707
47
FRI
559
5213
NAS
71
SVW
SVW
VOW
DIP
CLO
MFCC
109
223
4
724
178
118
243
7
960
91
3234
168
338
18
514
97
61
206
5145
1106
207
199
VOW
137
101
327
1114
9791
745
242
DIP
1
3
14
184
656
1552
21
CLO
369
224
402
129
207
2
9539
FDLP
PLO
3672
721
181
188
305
29
1165
FRI
590
4942
160
142
199
11
1234
NAS
122
130
2520
373
562
19
708
SVW
138
208
601
4457
1079
101
437
VOW
246
361
886
1511
8386
473
594
DIP
13
21
48
177
685
1391
96
CLO
393
494
455
181
352
27
8970
GFCC
PLO
294
873
101
107
347
3
4536
FRI
12
3777
134
87
226
2
3040
NAS
5
11
3064
125
440
4
785
SVW
6
9
192
4928
1621
72
193
VOW
5
34
216
1084
10327
466
325
DIP
CLO
0
12
0
295
3
311
299
122
1679
370
438
3
12
9759
Proposed feature
PLO
5514
395
28
31
36
0
257
FRI
565
5903
138
72
90
1
509
NAS
46
75
3507
135
342
4
325
SVW
51
69
194
5640
908
92
67
VOW
45
47
274
901
10567
513
110
DIP
0
0
6
155
664
1604
2
CLO
195
201
297
81
92
0
10006
62
4.3
Classification accuracies (%) for phonemes in quiet environment
The overall classification accuracies for individual and broad class phonemes under
clean condition are shown in Table 4.1. Our proposed method achieved an accuracy of
69.19% and 84.21% for individual and broad phoneme classification in quiet
environment. The classification score using MFCC, GFCC and FDLP-based features are
60.16, 49.94, and 50.57% respectively for individual phone and for broad phone class
accuracies are 76.68, 64.21 and 67.65 respectively. The proposed method resulted
higher accuracy in classification performance than the results for MFCC-, FDLP- and
GFCC- based methods.
The segment classification confusion matrices are shown in Table 4.2 for clean
condition. It is obvious that closures (CLO) were in general more confused with others
for all types of features. Some of the plosives (PLO) and fricatives (FRI) were confused
with other groups, but most of the confusions were observed within these two groups.
Similarly, nasals (NAS), semivowels (SVW), vowels (VOW) and diphthongs (DIP)
were confused more among these groups compared to other groups for all four methods
However, the proposed method outperformed the three other traditional methods in
terms of accuracy.
Table 4.3 shows the accuracies of broad classes using different features in quiet
condition. Under clean condition, the performance of the proposed method for all
classes was better than the results for all other features. The performance of the GFCCbased method was worse for plosive than the results from other methods. In general,
closures were identified more accurately in quiet by all other methods (proposed:
92.03%, MFCC: 87.73%, GFCC 80.39%, and FDLP: 82.50%).
63
Table 4.3: Classification accuracies (%) of broad phonetic classes in clean condition.
4.4
Class
MFCC (%)
GFCC (%)
FDLP (%)
Proposed feature (%)
PLO
71.0
4.69
58.64
88.06
FRI
71.62
51.89
67.90
81.10
NAS
72.93
69.10
56.83
79.09
SVOW
73.28
70.18
63.48
80.33
VOW
78.59
82.90
67.31
84.82
DIP
63.84
18.01
57.21
65.98
CLO
87.73
80.39
82.50
92.03
Performance for signal with additive noise
This section present the classification accuracies (%) for different feature extraction
techniques for six noise types with different values of SNRs ranging from -5 to 25 dB in
steps of 5 dB. Table 4.4 shows the individual phone accuracies (%) for SSN, babble and
Exhibition noise. Similarly, single phone accuracies (%) for restaurant, train and
exhibition noise is presented in Table 4.5. In the experiment reported in Table 4.4, for
SSN at -5 dB SNR, the GFCC feature provides the best performance. However, for
almost all noise types and SNRs, the proposed feature provides improvements over the
all baseline features. The broad classification accuracies in clean conditions and under
different background noises are shown in Fig. 4.1. Again, the proposed method resulted
higher accuracy in classification performance than the results for MFCC-, GFCC- and
FDLP-based methods.
64
Table 4.4: Individual phoneme classification accuracies (%) for different feature
extraction techniques for 3 different noise types at -5, 0, 5, 10, 15, 20 and 25 dB SNRs.
The best performance for each condition is indicated in bold.
SNR (dB)
MFCC (%)
GFCC (%)
FDLP (%)
Proposed
feature (%)
SSN
-5
23.35
25.69
22.41
20.23
0
31.72
31.96
27.39
33.9
5
41.78
40.9
34.09
46.36
10
50.5
46.72
40.98
57.25
15
55.82
48.71
46.29
64.2
20
58.58
49.47
48.65
67.3
25
59.45
49.66
50.04
68.56
Babble
-5
21.9
20.39
23.01
23.15
0
29.28
29.88
27.64
34.96
5
38.97
38.84
33.52
46.33
10
47.21
44.29
39.65
55.72
15
53.78
47.41
44.74
62.71
20
57.39
49.04
47.89
66.69
25
58.99
49.57
49.51
68.35
Exhibition
-5
19.1
19.17
22.37
23.57
0
25.84
25.62
26.63
32.82
5
34.79
32.52
32.08
41.83
10
43.42
39.45
38.38
50.52
15
50.47
44.43
43.55
57.53
20
55.29
47.28
46.73
63.04
25
57.96
48.82
48.68
66.7
.
65
Table 4.5: Individual phoneme classification accuracies (%) for different feature
extraction techniques for 3 different noise types at -5, 0, 5, 10, 15, 20 and 25 dB SNRs.
The best performance for each condition is indicated in bold.
SNR (dB)
MFCC (%)
GFCC (%)
FDLP (%)
Proposed feature
(%)
Restaurant
-5
22.65
21.51
23.43
24.27
0
29.91
30.95
28.11
36.15
5
39.42
39.46
33.89
47.37
10
47.77
44.28
39.84
56.14
15
53.83
47.30
44.9
62.72
20
57.3
48.98
47.95
66.72
25
59.17
49.58
49.54
68.19
Train
-5
19.54
17.13
17.13
23.44
0
26.53
23.50
23.50
32.21
5
35.41
31.27
31.27
42.18
10
44.53
39.40
39.40
51.07
15
51.28
44.25
44.25
57.95
20
55.54
47.000
47.00
62.95
25
57.91
48.63
48.63
66.41
Car
-5
17.81
22.20
22.37
47.38
0
24.83
29.20
26.63
31.16
5
34.23
36.65
32.08
41.78
10
44.32
42.54
38.38
51.61
15
51.49
46.45
43.55
59.05
20
56.24
48.47
46.73
64.35
25
58.34
49.52
48.68
67.43
66
Figure 4.1: Broad phoneme classification accuracies (%) for different features in various
noise types at different SNRs values. Clean condition is denoted as Q.
67
4.5
Performance for reverberant speech
In this section, the effectiveness of the proposed approaches to robust phoneme
classification in reverberant speech will be investigated. Phoneme classification in
reverberant environments is a challenging problem due to the highly dynamic nature of
reverberated speech. Reverberation can be described as a convolution of the clean
speech signal with a RIR. The length of a RIR is usually measured by the reverberation
time, T60, which is the time needed for the power of reflections of a direct sound to
decay by 60 dB. It can be define in following way,
Reverberation time, T60 = time to drop 60 dB below original level
Table 4.6: Classification accuracies (%) in eight different reverberation test set. The best
performance for each condition is indicated in bold. Last column show the average
value indicated as “Avg”.
T60
(ms)
MFCC (%)
GFCC (%)
FDLP (%)
Proposed feature
(%)
single
broad
single
broad
single
broad
single
broad
344
25.28
30.34
36.19
46.34
19.81
23.86
34.92
47.04
350
25.67
30.84
36.88
47.59
19.46
23.72
35.43
48.43
352
25.35
30.62
36.64
47.43
19.31
23.79
35.33
48.6
359
25.39
30.43
38.31
49.59
19.51
23.42
34.91
47.66
360
25.32
30.39
38.47
50.07
19.11
23.04
34.84
47.77
364
24.67
29.05
36.57
47.26
19.08
22.72
33.49
45.55
395
32.5
45.63
37.1
48.11
18.77
22.83
34.06
47.43
404
32.36
45.18
39.19
50.65
19.04
22.67
35.87
49.68
Avg.
27.0675
34.06
37.41
48.38
19.26
23.25
34.85
47.77
68
Table 4.6 shows the result for reverberant speech. In general, the GFCC-based system
provides the best performance (average accuracies are 37.41% and 48.38 % for
individual and broad class, respectively) in all reverberant test set. The average results,
for the same test set, using our proposed feature are 34.85% and 47.77%, respectively.
The performances using our proposed feature are comparable to the results using
GFCC-based system for reverberant speech. However, MFCC- and FDLP- based
features are substantially less accurate in reverberant speech.
4.6
Signals distorted by noise due to telephone channel
Telephone speech recognition is extremely difficult due to the limited bandwidth of
transmission channels. From figure 3.6 we can see that above 4 kHz, there is no energy
for telephone speech. To evaluate the effectiveness of the proposed approach for signal
distorted by telephone channels, we have used the well-known HTIMIT database (D. A.
Reynolds, 1997). Classification accuracies (%) with different feature extraction
techniques for different handsets are presented in Table 4.7. We can see that for some
handset (cb2, cb4, el2 and el4), MFCC shows the best performance and for other
handset (cb1, cb3, el3 and pt1), the proposed feature shows the best performance. The
average classification accuracies of the proposed method for nine telephone channels
were 41.75% and 60.16 % for individual phone and broad class phone, respectively,
whereas for the same condition, the performance of the MFCC- based method was
42.53% and 57.94%, respectively. In general, performances of the MFCC and the
proposed feature are comparable for telephone speech, but GFCC and FDLP are
substantially less accurate. For all feature, the classification accuracy is very low for
signal distorted by handset cb3 and cb4, because they had particularly poor sound
characteristics (D. A. Reynolds, 1997).
69
Table 4.7: Classification accuracies (%) for signal distorted by nine different telephone
channels. The best performance is indicated in bold. Last column shows the average
value indicated as “Avg”.
Channel
MFCC (%)
FDLP (%)
GFCC (%)
Proposed feature
(%)
single
broad
single
broad
single
broad
single
broad
cb1
50.26
64.09
29.09
44.49
17.26
30.09
50.84
68.67
cb2
52.67
66.65
30.84
47.04
19.80
32.73
47.21
66.73
cb3
27.17
44.26
22.42
36.73
13.41
27.34
32.76
48.56
cb4
33.01
48.58
24.29
38.98
16.91
31.37
32.89
48.47
el1
51.69
67.45
30.73
46.43
16.28
29.97
49.64
69.62
el2
45.66
60.95
28.50
43.80
11.20
24.99
40.08
60.64
el3
45.97
59.82
28.95
44.08
16.48
30.11
48.35
67.42
el4
45.45
58.43
29.91
45.02
19.15
35.10
35.49
51.90
pt1
30.88
51.22
28.95
44.08
12.21
27.64
38.49
59.39
Avg.
42.53
57.94
28.19
43.41
15.86
29.93
41.75
60.16
70
CHAPTER 5: DISCUSSIONS
5.1
Introduction
This chapter mainly discusses the effect of different parameters on classification
accuracy, robustness issue, and comparison of the performances among different
methods. The important finding of the study is that the proposed feature resulted a
consistent performance across different types of noise. On the other hand, all baseline
feature-based (such as the MFCC, GFCC and FDLP coefficients) systems produced
quiet different results for different types of noise. For example, MFCC-based feature
achieved good performance under channel distortions, but suffer severely under
reverberant condition. Similarly, classification using GFCC-based feature is less
accurate in clean condition and under channel distortions, but exhibits a more robust
behavior in reverberant condition.
5.2
Broad class accuracy
Based on simulation result, the proposed system outperformed all baseline feature-based
system in clean condition and resulted a comparable performance under noisy
conditions. According to simulation results, it is obvious that phonemes classified by
the proposed method are more confused within groups, rather than confusions among
different groups. Table 5.1 shows the correlation coefficients between the waveforms of
two different phonemes in both time and Radon domains. We consider two phonemes
from stops (/p/, /t/) and one phoneme from vowel (/aa/) having the same length (106
ms).
71
Figure 5.1: Example of radon coefficient representations: (a) stop /p/ (b) fricative /s/ (c)
nasal /m/ (d) vowel /aa/.
Figure 5.2: Example of radon coefficient representations for stop: (a) /p/ (b)/t/ (c)/k/
and (d) /b/
In time and Radon domain, correlation coefficient between /p/ and /t/ is 0.04 and 0.98,
respectively. Similarly, for /p/ and /aa/, correlation coefficient is -0.5 and 0.94 in time
and Radon domain, respectively. From Table 5.1 we can see that the phonemes of the
same group are more correlated compared to other groups. Figure 5.1 shows the Radon
coefficient representations of four phonemes from four different groups. Figure 5.2
shows the four phonemes from the stops group. From this figure, it is clear that Radon
representation of a phoneme is more close to the phoneme of same group compared to
the representation of phonemes from different group. This observation may reflect the
reason behind the good classification accuracy of the proposed method for broad class.
72
Table 5.1 Correlation measure in acoustic and Radon domain for different phoneme.
Similarity index
Phoneme
Acoustic
Radon (from neurogram)
/p/ and /t/
.04
0.98
/p/ and /aa/
-.05
0.94
The following section describes the impact of different parameters on accuracy, and
robustness of the proposed method compared to alternate methods.
5.3
Comparison of results from previous studies
In Table 5.2, results of some recent experiments on the TIMIT classification task in
quiet condition are gathered and compared with our results reported in this paper. We
also show results obtained using MFCC. Unfortunately, despite the use of a standard
database, it is still difficult to compare results because of the differences in selection of
training and test data, and also because of differences in selection of phone groups. In
order to include more variations of phonemes, the proposed method was tested using the
complete set of TIMIT database (50754 tokens from 168 speakers), which we believe
the most reasonable choice for comparisons of phonetic classification methods.
Clarkson et al. created a multi-class SVM system to classify phonemes (Clarkson &
Moreno, 1999). They also used GMM for classification and achieved an accuracy of
73.7%. Their reported result of 77.6%, using the core set (24 speakers), is extremely
encouraging and shows the potential of SVMs in speech recognition. In 2003, Reynolds
et al. used modular MLP architecture for speech recognition (T. J. Reynolds &
Antoniou, 2003).
73
Table 5.2: Phoneme classification accuracies (%) on the TIMIT core test set (24
Speakers) and complete test set (168 speakers) in quiet condition for individual phone
(denoted as single) and broad class (denoted as Broad). Here, RPS is the abbreviation of
reconstructed phase space.
Author
Method
Test set
Single (%)
(Clarkson &
Moreno, 1999)
GMM (MFCC)
SVM (MFCC)
Core
73.7
77.6
(T. J. Reynolds &
Antoniou, 2003)
Modular MLP
architecture
Core
(Halberstadt &
Glass, 1997)
Heterogeneous
framework
Complete
(Johnson et al.,
2005)
HMM (RPS)
HMM (MFCC)
Complete
35.06
54.86
Proposed method
SVM (MFCC)
SVM (proposed)
Complete
60.16
69.06
Broad (%)
84.1
79.0
76.68
84.21
Their broad class accuracy using the core test was 84.1%. In this thesis, we used phone
group for broad class defined by them. The most directly comparable results in quiet
environment appear to be those of (Halberstadt & Glass, 1997; Johnson et al., 2005).
Halberstadt et al. used heterogeneous acoustic measurements for phoneme
classification. Their broad class accuracy using the complete set (118 speakers) was
79.0%, whereas our accuracy is 84.21% (168 speakers).
Similarly, Johnson et.al
showed the result for MFCC-based feature using HMM for the complete set, and the
single phone accuracy was 54.86 %, whereas we have achieved 69.06 % using the
proposed feature. They also showed the accuracy of 35.06% for their proposed feature
RPS. We also show the result of 60.16 % using MFCC-based feature with the SVM
classifier.
74
Ganapathy et al. reported phoneme recognition accuracies on the complete test set of
individual phonemes for FDLP-based feature for both in quiet and under different noise
types. Accuracy based on their feature for clean speech, speech with additive noise
(average performance of four noise types at 0, 5, 10, 15, and 20 dB SNRs), reverberant
speech (average performance for nine room impulse response functions), and telephone
speech (average performance for nine channel conditions) was 62.1%, 43.9%, 33.6%
and 55.5 %, respectively. They used the complete test set. Though they used complete
test set from TIMIT and used 9 telephone channels from HTIMIT for evaluation, still it
is difficult to compare our result with them directly as their reported result was for
phone recognition and our result is for phone classification. The accuracy for the
proposed feature for clean speech, speech with additive noise (average performance of
six noise types at 0, 5, 10, 15, and 20 dB SNRs), reverberant speech (average
performance for eight room impulse response functions), and telephone speech (average
performance for nine channel conditions) is 69.06%, 51.48%, 34.85% and 41.75%
respectively.
5.4
Effect of the number of Radon angles on classification results
Table 5.3 presents the classification accuracy for the proposed method as a function of
the number of Radon angles. With the increase of number of angles, the classification
accuracy increased substantially up to a number of 10 for both in quiet and under noisy
conditions. The number of Radon angles used in this study was chosen as ten based on
the phoneme classification accuracy for both in quiet and under noisy conditions,
though the performance for 13 angles is slightly higher than that of angle 10. The
performance of classifier become saturated for the number of angles more than 13.
75
Table 5.3: Phoneme classification accuracies (%) as a function of the number of Radon
angles (between 0° and 180°) used to encode phoneme information.
No. of
angle
5.5
Clean (%)
Additive noise (%)
Reverberant
speech (%)
Telephone speech
(%)
single
broad
single
broad
single
broad
single
broad
1
34.01
50.94
23.02
36.28
28.99
44.48
19.55
24.16
3
61.64
78.18
29.52
48.51
44.71
64.32
36.57
48.51
5
66.93
82.54
32.38
52.92
48.29
68.26
35.57
47.57
7
68.28
83.42
33.41
54.18
49.43
69.33
35.15
47.2
10
69.19
84.24
33.9
54.65
49.64
69.62
34.92
47.04
13
69.44
84.54
33.87
54.9
49.79
69.65
34.84
47.09
Effect of SPL on classification results
Substantial research has shown that in quiet environment, presenting speech at higher
than conversational speech level (~65-70 dB SPL) results in decreased speech
understanding (Fletcher & Galt, 1950; French & Steinberg, 1947; Molis & Summers,
2003; Pollack & Pickett, 1958; Studebaker, Sherbecoe, McDaniel, & Gwaltney, 1999).
Under noisy conditions, the speech recognition is affected by several factors. These
factors include the presence or absence of a background noise, the SNR, and frequency
being presented (French & Steinberg, 1947; Molis & Summers, 2003; Pollack &
Pickett, 1958; Studebaker et al., 1999). In order to quantify the effect of SPL on the
proposed feature, the classification accuracies were estimated with different noise types
at 50, 70 and 90 dB SPL, as shown in Table 5.4. The results show that the classification
accuracy decreased when SPL increased from 70 dB to 90 dB SPL in quiet condition
that matches with previous study. It is interesting that the accuracy increase from 69.19
76
Table 5.4: Effects of SPL on classification accuracy (%).
SPL
(dB)
Clean (%)
Additive noise (%)
Reverberant
speech (%)
Telephone speech
(%)
single
broad
single
broad
single
broad
single
broad
90
65.4
80.88
35.29
53.53
35.76
48
42.41
59.74
70
69.19
84.24
33.9
54.65
34.92
47.04
49.64
69.62
50
71.23
85.6
37.34
59.08
31.27
40.76
49.94
70.09
% at 70 dB SPL to 71.23 % at 50 dB SPL in quiet condition. This could be related to the
saturation of AN fiber responses at higher levels (the dynamic range is ~30-40 dB from
threshold). But for reverberant speech, the performance is not satisfactory at 50 dB SPL
compared to the results at higher sound presentation levels. In this thesis, we evaluate
the performance of the proposed and existing methods using speech presented at 70 dB
SPL. MFCC-and FDLP-based classification system has no effect due to the variation in
sound presentation level, and there is a negligible effect of SPL on GFCC-based system.
Thus, only the proposed feature-based system can handle the effect of SPL on
recognition.
5.6
Effect of window length on classification results
The neurogram resolution for the proposed method was chosen based on the
physiological-based data (Zilany & Bruce, 2006). To study the effect of neurogram
resolution on phoneme classification, three different window sizes (16, 8 and 4 ms)
were considered for smoothing, as illustrated in Table 5.5. It was observed that in quiet
condition, the classifier performance was relatively high when a window length of 4 ms
was used but resulted a poorer performance under noisy conditions. The performance of
77
Table 5.5: Effect of window size on classification performance (%).
Window
(ms)
Clean (%)
Additive noise (%)
Reverberant
speech (%)
Telephone
speech (%)
single
broad
single
broad
single
broad
single
broad
16
67.27
82.46
37.28
56.75
36.6
50.77
36.65
49.37
8
69.19
84.24
33.9
54.65
49.64
69.62
34.92
47.04
4
69.39
84.75
32.65
53.24
49.35
70.19
33.19
44.82
the proposed method with a 16-ms window was satisfactory for noisy conditions but
exhibited slightly lower results in quiet. Thus considering the performance in clean and
under noisy conditions, the length of the window was set to 8 ms for this study.
5.7
Effect of number of CFs on classification results
Frequency at which a given neuron responds to the smallest sound intensity is called
characteristics frequency (CF). It is also known as best frequency (BF). In this study, we
investigate the effect of number of CF on the performance of the proposed system using
three different simulation conditions shown in Table 5.6. Results suggests that the
performance of the proposed system with 12-CFs, under quiet and noisy condition, is
not satisfactory but 32 neurons (out of 30000 neurons) responses-based system is
enough to confidently classify phoneme. In the proposed system, we employed 32 CFs
since with the increase of no of CF the computational cost also increase.
78
Table 5.6: Effect of number of CF on classification performance (%).
No. of
CFs
5.8
Clean (%)
Additive noise
(%)
Reverberant
speech (%)
Telephone speech
(%)
single
broad
single
broad
single
broad
single
broad
12
63.31
80.44
33.83
53.52
48.9
69.86
31.87
44.41
32
69.19
84.24
33.9
54.65
49.64
69.62
34.92
47.04
40
69.21
84.1
33.6
54.44
49.64
69.71
35.11
47.31
Robustness property of the proposed system
In this section, we will investigate and discuss the robustness of the proposed system.
The similarity between two matrices or vectors can be found through the correlation
coefficient measure using the Eq. (3.8). First, we will investigate the robustness of the
neurogram (neural responses) compared to other baseline features. We consider a
typical phoneme (/aa/) of length almost 65 ms (not shown) as an input to the AN model
to generate the corresponding neurogram responses [Fig. 5.3 (d)]. The corresponding
MFCC, GFCC and FDLP coefficients are also shown for each frame in Fig. 5.3 (a), (b)
and (c), respectively. The size of the generated neurogram image was 32 by 17, where
the neural responses were simulated for 32 CFs. All columns of neurogram array were
combined together to form a 1-D vector (the new size was 1×544) which is shown in
Fig. 5.3 (d). Similar procedure was followed for other feature extraction technique. The
responses (features) of the phoneme in quiet are shown by the solid lines, and the
responses using the same phoneme distorted by a SSN of 0 dB SNR are shown by the
dotted lines in each corresponding plots (a, b, c, and d). The correlation coefficient
between these two vectors was computed using the Eq. (8), and it was found to
79
Figure 5.3: The correlation coefficient (a) MFCC features extracted from the phoneme
in quiet (solid line) and at an SNR of 0 dB (dotted line) condition. The correlation
coefficient between the two vectors was 0.76. (b) FDLP features under clean and noisy
conditions. Correlation coefficient between the two cases was 0.72. (d) Neurogram
responses of the phoneme under clean and noisy conditions. The Correlation coefficient
between the two vectors was 0.85.
be ~ 0.85 for the neurogram feature, 0.76 for MFCC, 0.78 for GFCC, and 0.72 for
FDLP coefficients.. Based on similarity index represented by the correlation coefficient,
it can be concluded that the proposed neural-response-based neurogram features were
more robust compared to the traditional acoustic-property-based features, meaning that
the representation of phoneme in the AN responses was less distorted by noise.
Now, we will investigate the robustness of the neurogram feature. Figure 3.6 shows the
effect of different noises on the neurogram and the corresponding Radon coefficients
80
(Radon coefficients were computed by applying Radon transform on the neurogram).
From figure, it is obvious that Radon representations are more robust. To make it more
clear, we measure correlation between the clean and noisy phoneme in different domain,
as presented in Table 5.7. Seven phonemes from seven different groups were chosen for
the experiment and SSN was added at SNRs of 0 and 10 dB to generate two noisy
versions of the phoneme. Then we measure correlation coefficient between the clean
and noisy phoneme (10 dB and 0 dB) in different domain. Finally, we show the mean
and standard deviation of correlation coefficient for 100 instances of the same phoneme.
In Table 5.7 we use mean value only, because the standard deviation is very low. For
time, spectral, neurogram, and Radon domains, the average standard deviations for
correlation coefficient (between clean and 10 dB) are 0, 0.03, 0.04 and 0, respectively,
and for correlation measure between the clean and 0 dB are 0.03, 0.10, 0.05 and 0.01,
respectively. Table 5.7 suggests that the Radon representations of phoneme are more
robust compared to the representation in any other domains.
In the present study, the neurogram features are derived directly from the responses of a
model of the auditory periphery to speech signals. The proposed method differs from all
previous auditory model based methods in that a more complete and physiologicallybased model of the auditory periphery is employed in this study (Zilany, Bruce, &
Carney, 2014). The auditory-nerve (AN) model developed by Zilany et al. has been
extensively validated
81
Table 5.7: Correlation measure for different phoneme and their corresponding noisy
(SSN) phoneme (clean-10dB and clean-0dB) in different domain. Average correlation
measure (Avg) of seven phonemes is indicated in bold (last row).
Time
Spectral
Neurogram
Radon
10 dB
0 dB
10 dB
0 dB
10 dB
0 dB
10 dB
0 dB
/p/
.95
.70
.94
.68
.97
.94
.99
.99
/s/
.95
.70
.96
.28
.91
.86
.99
.98
/m/
.95
.71
.99
.95
.96
.92
.99
.99
/l/
.95
.70
.99
.91
.92
.86
.99
.98
/aa/
.95
.70
.99
.80
.89
.81
.99
.98
/ow/
.95
.70
.99
.91
.86
.77
.99
.98
/bcl/
.95
.70
.99
.88
.93
.89
.99
.99
Avg
.95
.70
.98
.77
.92
.86
.99
.98
against a wide range of physiological recordings from the mammalian peripheral
auditory system. The model can successfully replicate almost all of the nonlinear
phenomena observed at different level of the auditory periphery (e.g., in the cochlea,
inner-hair-cell (IHC), IHC-AN synapse, and AN fibers). These phenomena include the
nonlinear tuning, compression, two-tone suppression, level-dependent rate and phase
responses, shift in the best frequency with level, adaptation, and several high-level
effects (Zilany & Bruce, 2006; Zilany et al., 2014; Zilany et al., 2009). The model
responses have been tested for both simple (e.g., tone) and complex stimuli (e.g.,
speech) with a wide range of frequency and intensity spanning the dynamic range of
hearing.
The robustness of the proposed neural-response-based system could lie on the
underlying physiological mechanisms observed at the level of the auditory periphery.
82
Since the AN model used in this study is nonlinear (i.e., incorporates most of these
nonlinear phenomena), it would be difficult to tease apart the contribution of individual
nonlinear mechanism towards classification performance. However, it would not be
unwise to shed some light on the possible mechanism towards the classification task,
especially under noisy conditions. The AN fiber tends to fire at a particular phase of a
stimulating low-frequency tone, meaning that it tends to give spikes at an integer times
of period of that tone. It has been reported that the magnitude of phase-locking declines
with frequency and the limit of phase locking varies somewhat across species, but the
upper frequency boundary lies at ~4-5 kHz (Palmer & Russell, 1986).
83
CHAPTER 6: CONCLUSIONS & FUTURE WORKS
6.1
Conclusions
In this study, we proposed a neural-response-based method for a robust phoneme
classification system which works well for both in quiet and under noisy environments.
The proposed feature successfully captured the important distinguishing information
about phonemes to make the system relatively robust against different types of
degradation of the input acoustic signals. The neurogram was extracted from the
responses of a physiologically-based model of the auditory periphery. The features from
each neurogram are extracted using a well-known Radon transform. The performance of
the proposed method was evaluated in quiet and under noisy conditions such as additive
noise, room reverberation, and telephone channel noise, and also compared to the
classification accuracy from several existing methods. Based on simulation results, the
proposed method outperformed most of the traditional acoustic-property-based
phoneme classification methods for both in quiet and under noisy conditions. The
robustness of the proposed neural-response-based system could lie on the underlying
physiological mechanisms observed at the level of the auditory periphery. The main
findings in this study can be summarized as follows:
1. In quiet environment and for speech with additive noises, the proposed feature
outperformed all other baseline feature-based system.
2. For reverberant speech, GFCC features achieved a very good performance but
suffered severely in other conditions, especially in quiet. Classification accuracy
achieved by the proposed method was comparable to the results using GFCC
feature for reverberant speech.
84
3. MFCC feature was less accurate for noisy conditions, especially for reverberant
speech, but exhibited a more robust behavior for telephone speech. The neuralresponse-based proposed feature also resulted a comparable performance (like
MFCC)for channel distortions.
4. The results showed that the classification accuracy decreased when SPL
increased from 70 dB to 90 dB SPL in quiet condition, which is consistent with
the published result from previous behavioral studies.
6.2
Limitations and future work
In this study, the responses of the AN fibers for 32 CFs were simulated to phoneme
signals to construct neurograms that incurs high computational complexity. Although
the AN model incorporates most of the nonlinearities observed at the peripheral level of
the auditory system, the effect of individual nonlinear mechanism on phoneme
recognition was not explored in this study. In addition, although the classification
performance under additive noise and channel distortions was satisfactory, the proposed
system resulted a relatively poorer classification performance for reverberant
conditions. There are a number of possible directions to be further explored along the
line of research works presented in this thesis. Some of these directions are summarized
in the following:

The proposed approach could be used in a hybrid phone-based architecture that
integrates SVMs with HMMs for continuous speech recognition (Ganapathiraju
et al., 2004; Krüger et al., 2005) and is expected to improve the recognition
performance over HMM baselines.
Since the AN model employed in this study is a computational model, it would
allow to investigate the effects of individual nonlinear phoneme on classification
accuracy. The outcome will have important implications for designing sinal-
85
processing strategies for hearing aids and cochlear implants. Because the
computational model allows to simulate the responses of impaired AN fibers, the
present work could be extended to predict the classification performance for
people with hearing loss under diverse conditions.
86
LIST OF PUBLICATIONS AND CONFERENCE PROCEEDINGS
Alam MS, Jassim WA, Zilany MS. “Radon transform of auditory neurograms: a robust
feature set for phoneme classification”. IET Signal Processing. 2017 Oct 5;12(3):260268
Alam MS, Zilany MS, Jassim WA, Ahmad MY. “Phoneme classification using the
auditory neurogram”. IEEE Access. 2017 Jan 16;5:633-42.
Alam, M. S., Jassim, W., & Zilany, M. S. (2014, December). “Neural response based
phoneme classification under noisy condition”. In Intelligent Signal Processing and
Communication Systems (ISPACS), 2014 International Symposium on (pp. 175-179).
IEEE.
87
REFERENCES
Acero, A. (1990). Acoustical and environmental robustness in automatic speech
recognition. Carnegie Mellon University Pittsburgh.
Allen, J. B. (1994). How do humans process and recognize speech? Speech and Audio
Processing, IEEE Transactions on, 2(4), 567-577.
Bondy, J., Becker, S., Bruce, I., Trainor, L., & Haykin, S. (2004). A novel signalprocessing strategy for hearing-aid design: neurocompensation. Signal
Processing, 84(7), 1239-1253.
Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). A training algorithm for optimal
margin classifiers. Paper presented at the Proceedings of the fifth annual
workshop on Computational learning theory.
Bou-Ghazale, S. E., & Hansen, J. H. (2000). A comparative study of traditional and
newly proposed features for recognition of speech under stress. Speech and
Audio Processing, IEEE Transactions on, 8(4), 429-442.
Bracewell, R., & Imaging, T.-d. (1995). Prentice-Hall Signal Processing Series:
Prentice-Hall, Englewood Cliffs, NJ.
Brown, G. J., & Cooke, M. (1994). Computational auditory scene analysis. Computer
Speech & Language, 8(4), 297-336.
Bruce, I. C. (2004). Physiological assessment of contrast-enhancing frequency shaping
and multiband compression in hearing aids. Physiological measurement, 25(4),
945.
Bruce, I. C., Sachs, M. B., & Young, E. D. (2003). An auditory-periphery model of the
effects of acoustic trauma on auditory nerve responses. The Journal of the
Acoustical Society of America, 113(1), 369-388.
Carney, L. H., & Yin, T. (1988). Temporal coding of resonances by low-frequency
auditory nerve fibers: single-fiber responses and a population model. Journal of
Neurophysiology, 60(5), 1653-1677.
88
Chang, C.-C., & Lin, C.-J. (2011). LIBSVM: a library for support vector machines.
ACM Transactions on Intelligent Systems and Technology (TIST), 2(3), 27.
Chang, L., Xu, J., Tang, K., & Cui, H. (2012). A new robust pitch determination
algorithm for telephone speech. Paper presented at the Information Theory and
its Applications (ISITA), 2012 International Symposium on.
Chen, R., & Jamieson, L. (1996). Experiments on the implementation of recurrent
neural networks for speech phone recognition. Paper presented at the Signals,
Systems and Computers, 1996. Conference Record of the Thirtieth Asilomar
Conference on.
Chengalvarayan, R., & Deng, L. (1997). HMM-based speech recognition using statedependent, discriminatively derived transforms on Mel-warped DFT features.
Speech and Audio Processing, IEEE Transactions on, 5(3), 243-256.
Clarkson, P., & Moreno, P. J. (1999). On the use of support vector machines for
phonetic classification. Paper presented at the Acoustics, Speech, and Signal
Processing, 1999. Proceedings., 1999 IEEE International Conference on.
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine learning, 20(3),
273-297.
Davis, S. B., & Mermelstein, P. (1980). Comparison of parametric representations for
monosyllabic word recognition in continuously spoken sentences. Acoustics,
Speech and Signal Processing, IEEE Transactions on, 28(4), 357-366.
Deng, L., & Geisler, C. D. (1987). A composite auditory model for processing speech
sounds. The Journal of the Acoustical Society of America, 82(6), 2001-2012.
Ellis, D. P. (2005). {PLP} and {RASTA}(and {MFCC}, and inversion) in {M} atlab.
Fletcher, H., & Galt, R. H. (1950). The perception of speech and its relation to
telephony. The Journal of the Acoustical Society of America, 22(2), 89-151.
French, N., & Steinberg, J. (1947). Factors governing the intelligibility of speech
sounds. The Journal of the Acoustical Society of America, 19(1), 90-119.
89
Friedman, J. (1996). Another approach to polychotomous classifcation. Dept. Statist.,
Stanford Univ., Stanford, CA, USA, Tech. Rep.
Gales, M. J. (2011). Model-based approaches to handling uncertainty Robust Speech
Recognition of Uncertain or Missing Data (pp. 101-125): Springer.
Gales, M. J. F. (1995). Model-ased techniques for noise ro ust speech recognition.
Ganapathiraju, A., Hamaker, J., & Picone, J. (2000). Hybrid SVM/HMM architectures
for speech recognition. Paper presented at the INTERSPEECH.
Ganapathiraju, A., Hamaker, J. E., & Picone, J. (2004). Applications of support vector
machines to speech recognition. Signal Processing, IEEE Transactions on,
52(8), 2348-2355.
Ganapathy, S., Thomas, S., & Hermansky, H. (2009). Modulation frequency features for
phoneme recognition in noisy speech. The Journal of the Acoustical Society of
America, 125(1), EL8-EL12.
Ganapathy, S., Thomas, S., & Hermansky, H. (2010). Temporal envelope compensation
for robust phoneme recognition using modulation spectrum. The Journal of the
Acoustical Society of America, 128(6), 3769-3780.
Garofolo, J. S., & Consortium, L. D. (1993). TIMIT: acoustic-phonetic continuous
speech corpus: Linguistic Data Consortium.
Gelbart, D., & Morgan, N. (2001). Evaluating long-term spectral subtraction for
reverberant ASR. Paper presented at the Automatic Speech Recognition and
Understanding, 2001. ASRU'01. IEEE Workshop on.
Ghitza, O. (1988). Temporal non-place information in the auditory-nerve firing patterns
as a front-end for speech recognition in a noisy environment. Journal of
phonetics, 16, 109-123.
Gifford, M. L., & Guinan Jr, J. J. (1983). Effects of crossed‐olivocochlear‐bundle
stimulation on cat auditory nerve fiber responses to tones. The Journal of the
Acoustical Society of America, 74(1), 115-123.
90
Gu, Y. J., & Sacchi, M. (2009). Radon transform methods and their applications in
mapping mantle reflectivity structure. Surveys in geophysics, 30(4-5), 327-354.
Halberstadt, A. K. (1998). Heterogeneous acoustic measurements and multiple
classifiers for speech recognition. Massachusetts Institute of Technology.
Halberstadt, A. K., & Glass, J. R. (1997). Heterogeneous acoustic measurements for
phonetic classification 1. Paper presented at the EUROSPEECH.
Hansen, J. H. (1996). Analysis and compensation of speech under stress and noise for
environmental robustness in speech recognition. Speech Communication, 20(1),
151-173.
Heinz, M. G., Colburn, H. S., & Carney, L. H. (2001). Evaluating auditory performance
limits: I. One-parameter discrimination using a computational model for the
auditory nerve. Neural Computation, 13(10), 2273-2316.
Hermansky, H. (1990). Perceptual linear predictive (PLP) analysis of speech. The
Journal of the Acoustical Society of America, 87(4), 1738-1752.
Hermansky, H. (1998). Should recognizers have ears? Speech Communication, 25(1), 327.
Hermansky, H., & Morgan, N. (1994). RASTA processing of speech. Speech and Audio
Processing, IEEE Transactions on, 2(4), 578-589.
Hewitt, M. J., & Meddis, R. (1993). Regularity of cochlear nucleus stellate cells: a
computational modeling study. The Journal of the Acoustical Society of
America, 93(6), 3390-3399.
Holmberg, M., Gelbart, D., & Hemmert, W. (2006). Automatic speech recognition with
an adaptation model motivated by auditory processing. Audio, Speech, and
Language Processing, IEEE Transactions on, 14(1), 43-49.
Hopkins, K., & Moore, B. C. (2009). The contribution of temporal fine structure to the
intelligibility of speech in steady and modulated noise. The Journal of the
Acoustical Society of America, 125(1), 442-446.
91
Jankowski Jr, C. R., Vo, H.-D. H., & Lippmann, R. P. (1995). A comparison of signal
processing front ends for automatic word recognition. Speech and Audio
Processing, IEEE Transactions on, 3(4), 286-293.
Jeon, W., & Juang, B.-H. (2007). Speech analysis in a model of the central auditory
system. Audio, Speech, and Language Processing, IEEE Transactions on, 15(6),
1802-1817.
Johnson, M. T., Povinelli, R. J., Lindgren, A. C., Ye, J., Liu, X., & Indrebo, K. M.
(2005). Time-domain isolated phoneme classification using reconstructed phase
spaces. Speech and Audio Processing, IEEE Transactions on, 13(4), 458-466.
Junqua, J.-C., & Anglade, Y. (1990). Acoustic and perceptual studies of Lombard
speech: Application to isolated-words automatic speech recognition. Paper
presented at the Acoustics, Speech, and Signal Processing, 1990. ICASSP-90.,
1990 International Conference on.
Kingsbury, B. E., & Morgan, N. (1997). Recognizing reverberant speech with RASTAPLP. Paper presented at the Acoustics, Speech, and Signal Processing, 1997.
ICASSP-97., 1997 IEEE International Conference on.
Kirchhoff, K., Fink, G. A., & Sagerer, G. (2002). Combining acoustic and articulatory
feature information for robust speech recognition. Speech Communication,
37(3), 303-319.
Kollmeier, B. (2003). Auditory Principles in Speech Processing-Do Computers Need
Silicon Ears? Paper presented at the Eighth European Conference on Speech
Communication and Technology.
Krishnamoorthy, P., & Prasanna, S. M. (2009). Reverberant speech enhancement by
temporal and spectral processing. Audio, Speech, and Language Processing,
IEEE Transactions on, 17(2), 253-266.
Krüger, S. E., Schafföner, M., Katz, M., Andelic, E., & Wendemuth, A. (2005). Speech
recognition with support vector machines in a hybrid system. Paper presented at
the INTERSPEECH.
92
Layton, M., & Gales, M. J. (2006). Augmented statistical models for speech
recognition. Paper presented at the Acoustics, Speech and Signal Processing,
2006. ICASSP 2006 Proceedings. 2006 IEEE International Conference on.
Lee, K.-F., & Hon, H.-W. (1989). Speaker-independent phone recognition using hidden
Markov models. Acoustics, Speech and Signal Processing, IEEE Transactions
on, 37(11), 1641-1648.
Liberman, M. C. (1978). Auditory‐nerve response from cats raised in a low‐noise
chamber. The Journal of the Acoustical Society of America, 63(2), 442-455.
Liberman, M. C., & Dodds, L. W. (1984). Single-neuron labeling and chronic cochlear
pathology. III. Stereocilia damage and alterations of threshold tuning curves.
Hearing research, 16(1), 55-74.
Lippmann, R. P. (1997). Speech recognition by machines and humans. Speech
Communication, 22(1), 1-15.
Lopes, C., & Perdigao, F. (2011). Phone recognition on the TIMIT database. Speech
Technologies/Book, 1, 285-302.
Mayoraz, E., & Alpaydin, E. (1999). Support vector machines for multi-class
classification Engineering Applications of Bio-Inspired Artificial Neural
Networks (pp. 833-842): Springer.
McCourt, P., Harte, N., & Vaseghi, S. (2000). Discriminative multi-resolution sub-band
and segmental phonetic model combination. Electronics Letters, 36(3), 270-271.
Meyer, B. T., Wächter, M., Brand, T., & Kollmeier, B. (2007). Phoneme confusions in
human and automatic speech recognition. Paper presented at the
INTERSPEECH.
Miller, G. A., Heise, G. A., & Lichten, W. (1951). The intelligibility of speech as a
function of the context of the test materials. Journal of experimental psychology,
41(5), 329.
93
Miller, G. A., & Nicely, P. E. (1955). An analysis of perceptual confusions among some
English consonants. The Journal of the Acoustical Society of America, 27(2),
338-352.
Miller, M. I., Barta, P. E., & Sachs, M. B. (1987). Strategies for the representation of a
tone in background noise in the temporal aspects of the discharge patterns of
auditory‐nerve fibers. The Journal of the Acoustical Society of America, 81(3),
665-679.
Miller, R. L., Schilling, J. R., Franck, K. R., & Young, E. D. (1997). Effects of acoustic
trauma on the representation of the vowel/ε/in cat auditory nerve fibers. The
Journal of the Acoustical Society of America, 101(6), 3602-3616.
Molis, M. R., & Summers, V. (2003). Effects of high presentation levels on recognition
of low-and high-frequency speech. Acoustics Research Letters Online, 4(4),
124-128.
Nakatani, T., Kellermann, W., Naylor, P., Miyoshi, M., & Juang, B. H. (2010).
Introduction to the special issue on processing reverberant speech:
methodologies and applications. IEEE Transactions on Audio, Speech, and
Language Processing, 7(18), 1673-1675.
Nakatani, T., Yoshioka, T., Kinoshita, K., Miyoshi, M., & Juang, B.-H. (2009). Realtime speech enhancement in noisy reverberant multi-talker environments based
on a location-independent room acoustics model. Paper presented at the
Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. IEEE
International Conference on.
Narayan, S. S., Temchin, A. N., Recio, A., & Ruggero, M. A. (1998). Frequency tuning
of basilar membrane and auditory nerve fibers in the same cochleae. Science,
282(5395), 1882-1884.
Oppenheim, A. V., Schafer, R. W., & Buck, J. R. (1989). Discrete-time signal
processing (Vol. 2): Prentice-hall Englewood Cliffs.
Palmer, A., & Russell, I. (1986). Phase-locking in the cochlear nerve of the guinea-pig
and its relation to the receptor potential of inner hair-cells. Hearing research,
24(1), 1-15.
94
Patterson, R., Nimmo-Smith, I., Holdsworth, J., & Rice, P. (1987). An efficient auditory
filterbank based on the gammatone function. Paper presented at the a meeting of
the IOC Speech Group on Auditory Modelling at RSRE.
Perdigao, F., & Sá, L. (1998). Auditory models as front-ends for speech recognition.
Proc. NATO ASI on Computational Hearing, 179-184.
Pollack, I., & Pickett, J. (1958). Masking of speech by noise at high sound levels. The
Journal of the Acoustical Society of America, 30(2), 127-130.
Pulakka, H., & Alku, P. (2011). Bandwidth extension of telephone speech using a
neural network and a filter bank implementation for highband mel spectrum.
Audio, Speech, and Language Processing, IEEE Transactions on, 19(7), 21702183.
Qi, J., Wang, D., Jiang, Y., & Liu, R. (2013). Auditory features based on gammatone
filters for robust speech recognition. Paper presented at the Circuits and Systems
(ISCAS), 2013 IEEE International Symposium on.
Rajput, H., Som, T., & Kar, S. (2016). Using Radon Transform to Recognize Skewed
Images of Vehicular License Plates. Computer, 49(1), 59-65.
Reynolds, D. A. (1997). HTIMIT and LLHDB: speech corpora for the study of handset
transducer effects. Paper presented at the Acoustics, Speech, and Signal
Processing, 1997. ICASSP-97., 1997 IEEE International Conference on.
Reynolds, T. J., & Antoniou, C. A. (2003). Experiments in speech recognition using a
modular MLP architecture for acoustic modelling. Information sciences, 156(1),
39-54.
Rifkin, R., Schutte, K., Saad, M., Bouvrie, J., & Glass, J. (2007). Noise robust phonetic
classificationwith linear regularized least squares and second-order features.
Paper presented at the Acoustics, Speech and Signal Processing, 2007. ICASSP
2007. IEEE International Conference on.
Robert, A., & Eriksson, J. L. (1999). A composite model of the auditory periphery for
simulating responses to complex sounds. The Journal of the Acoustical Society
of America, 106(4), 1852-1864.
95
Robinson, A. J. (1994). An application of recurrent nets to phone probability estimation.
Neural Networks, IEEE Transactions on, 5(2), 298-305.
Robles, L., & Ruggero, M. A. (2001). Mechanics of the mammalian cochlea.
Physiological reviews, 81(3), 1305-1352.
Scanlon, P., Ellis, D. P., & Reilly, R. B. (2007). Using broad phonetic group experts for
improved speech recognition. Audio, Speech, and Language Processing, IEEE
Transactions on, 15(3), 803-812.
Schluter, R., Bezrukov, L., Wagner, H., & Ney, H. (2007). Gammatone features and
feature combination for large vocabulary speech recognition. Paper presented at
the Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE
International Conference on.
Sellick, P., Patuzzi, R., & Johnstone, B. (1982). Measurement of basilar membrane
motion in the guinea pig using the Mössbauer technique. The Journal of the
Acoustical Society of America, 72(1), 131-141.
Sha, F., & Saul, L. K. (2006). Large margin Gaussian mixture modeling for phonetic
classification and recognition. Paper presented at the Acoustics, Speech and
Signal Processing, 2006. ICASSP 2006 Proceedings. 2006 IEEE International
Conference on.
Shao, Y., Jin, Z., Wang, D., & Srinivasan, S. (2009). An auditory-based feature for
robust speech recognition. Paper presented at the Acoustics, Speech and Signal
Processing, 2009. ICASSP 2009. IEEE International Conference on.
Shao, Y., Srinivasan, S., & Wang, D. (2007). Incorporating auditory feature
uncertainties in robust speaker identification. Paper presented at the Acoustics,
Speech and Signal Processing, 2007. ICASSP 2007. IEEE International
Conference on.
Smith, N., & Niranjan, M. (2001). Data-dependent kernels in SVM classification of
speech patterns.
96
Soong, F. K., Rosenberg, A. E., Juang, B.-H., & Rabiner, L. R. (1987). Report: A vector
quantization approach to speaker recognition. AT&T technical journal, 66(2),
14-26.
Sroka, J. J., & Braida, L. D. (2005). Human and machine consonant recognition. Speech
Communication, 45(4), 401-423.
Strope, B., & Alwan, A. (1997). A model of dynamic auditory perception and its
application to robust word recognition. Speech and Audio Processing, IEEE
Transactions on, 5(5), 451-464.
Studebaker, G. A., Sherbecoe, R. L., McDaniel, D. M., & Gwaltney, C. A. (1999).
Monosyllabic word recognition at higher-than-normal speech and noise levels.
The Journal of the Acoustical Society of America, 105(4), 2431-2444.
Systems, C. o. N. I. P., & Weiss, Y. (2006). Advances in neural information processing
systems: MIT Press.
Tan, Q., & Carney, L. H. (2003). A phenomenological model for the responses of
auditory-nerve fibers. II. Nonlinear tuning with a frequency glide. The Journal
of the Acoustical Society of America, 114(4), 2007-2020.
Tchorz, J., & Kollmeier, B. (1999). A model of auditory perception as front end for
automatic speech recognition. The Journal of the Acoustical Society of America,
106(4), 2040-2050.
Vapnik, V. (2013). The nature of statistical learning theory: Springer Science &
Business Media.
Wang, Y. (2015). Model-based Approaches to Robust Speech Recognition in Diverse
Environments.
Wilson, B. S., Schatzer, R., Lopez-Poveda, E. A., Sun, X., Lawson, D. T., & Wolford,
R. D. (2005). Two new directions in speech processor design for cochlear
implants. Ear and hearing, 26(4), 73S-81S.
97
Wong, J. C., Miller, R. L., Calhoun, B. M., Sachs, M. B., & Young, E. D. (1998).
Effects of high sound levels on responses to the vowel/ε/in cat auditory nerve.
Hearing research, 123(1), 61-77.
Yoshioka, T., Sehr, A., Delcroix, M., Kinoshita, K., Maas, R., Nakatani, T., &
Kellermann, W. (2012). Making machines understand us in reverberant rooms:
robustness against reverberation for automatic speech recognition. Signal
Processing Magazine, IEEE, 29(6), 114-126.
Yousafzai, J., Ager, M., Cvetković, Z., & Sollich, P. (2008). Discriminative and
generative machine learning approaches towards robust phoneme classification.
Paper presented at the Information Theory and Applications Workshop, 2008.
Yousafzai, J., Sollich, P., Cvetković, Z., & Yu, B. (2011). Combined features and kernel
design for noise robust phoneme classification using support vector machines.
Audio, Speech, and Language Processing, IEEE Transactions on, 19(5), 13961407.
Zhang, X., Heinz, M. G., Bruce, I. C., & Carney, L. H. (2001). A phenomenological
model for the responses of auditory-nerve fibers: I. Nonlinear tuning with
compression and suppression. The Journal of the Acoustical Society of America,
109(2), 648-670.
Zheng, F., Zhang, G., & Song, Z. (2001). Comparison of different implementations of
MFCC. Journal of Computer Science and Technology, 16(6), 582-589.
Zilany, M. S., & Bruce, I. C. (2006). Modeling auditory-nerve responses for high sound
pressure levels in the normal and impaired auditory periphery. The Journal of
the Acoustical Society of America, 120(3), 1446-1466.
Zilany, M. S., & Bruce, I. C. (2007). Representation of the vowel/ε/in normal and
impaired auditory nerve fibers: model predictions of responses in cats. The
Journal of the Acoustical Society of America, 122(1), 402-417.
Zilany, M. S., Bruce, I. C., & Carney, L. H. (2014). Updated parameters and expanded
simulation options for a model of the auditory periphery. The Journal of the
Acoustical Society of America, 135(1), 283-286.
98
Zilany, M. S., Bruce, I. C., Nelson, P. C., & Carney, L. H. (2009). A phenomenological
model of the synapse between the inner hair cell and auditory nerve: long-term
adaptation with power-law dynamics. The Journal of the Acoustical Society of
America, 126(5), 2390-2412.
99
APPENDIX A – CONFUSION MATRICES
A confusion matrix is formed by creating rows and columns for each class in the set.
The rows represent the expert labels or true classes of the testing exemplars, while the
columns correspond to the classifier system’s output or it’s hypothesized class label.
The confusions are then tabulated and inserted into the correct position in the matrix.
For example, consider the Table 1. This table represents the confusion matrix for four
phonemes namely: /p/, /t/, /k/, /b/. The phoneme ‘/p/’ was classified correctly twelve
times, and classified as ‘/t/’ five times, ‘/k/’ three times, and ‘/b/’ never. Correct
classifications are displayed in main diagonal. Off-diagonal represent the errors.
Confusion matrices for different features in diverse environmental distortions are
presented in table 1 to table 16. We show the confusion matrices for clean condition,
additive noise (SSN at 0dB), reverberant speech (RIR 344 ms) and telephone speech
(channel cb 1). The main diagonal of the matrix is indicated in bold.
Table 1: Example of a confusion matrix
/p/
/t/
/k/
/b/
/p/
12
5
3
0
/t/
5
10
0
5
/k/
3
1
14
2
/b/
1
2
4
13
100
Table 1: Confusion matrix showing classification performance of the proposed
method in clean condition. The overall accuracy is 69.19 %.
Table 2: Confusion matrix showing classification performance of the MFCC-based
feature at clean condition. The overall accuracy is 60.12 %.
101
Table 3: Confusion matrix showing classification performance of the GFCC-based
feature at clean condition. The overall accuracy is 49.93 %.
Table 4: Confusion matrix showing classification performance of the FDLP-based
feature at clean condition. The overall accuracy is 50.57 %.
102
Table5.Confusion matrix of the proposed method at an SNR of 0 dB (under speechshaped noise). The overall accuracy is 33.90 %.
Table 6. Confusion matrix of the MFCC-based feature at an SNR of 0 dB (under
speech-shaped noise). The overall accuracy is 31.71 %.
103
Table 7. Confusion matrix of the GFCC-based feature at an SNR of 0 dB (under
speech-shaped noise). The overall accuracy is 31.95 %.
Table 8. Confusion matrix of the FDLP-based feature at an SNR of 0 dB (under
speech-shaped noise). The overall accuracy is 27.39 %.
104
Table 9. Confusion matrix of the proposed method under reverberant speech (RIR
is 344 ms). The overall accuracy is 34.91 %.
Table 10. Confusion matrix of the MFCC-based feature under reverberant speech
(RIR is 344 ms). The overall accuracy is 25.28 %.
105
Table 11. Confusion matrix of the GFCC-based feature under reverberant speech
(RIR is 344 ms). The overall accuracy is 36.18 %.
Table 12. Confusion matrix of the FDLP-based feature under reverberant speech
(RIR is 344 ms). The overall accuracy is 19.81 %.
106
Table 13. Confusion matrix of the proposed method under telephone speech
(channel cb1). The overall accuracy is 50.84 %.
Table 14. Confusion matrix of the MFCC-based feature under telephone speech
(channel cb1). The overall accuracy is 50.25 %.
107
Table 15. Confusion matrix of the GFCC-based feature under telephone speech
(channel cb1). The overall accuracy is 17.26 %.
Table 16. Confusion matrix of the FDLP-based feature under telephone speech
(channel cb1). The overall accuracy is 28.08 %.
108
View publication stats
Download