VOICE PRINT ANALYSIS FOR SPEAKER RECOGNITION

advertisement
SIM UNIVERSITY
SCHOOL OF SCIENCE AND TECHNOLOGY
VOICE PRINT ANALYSIS FOR SPEAKER
RECOGNITION
STUDENT
SUPERVISOR
PROJECT CODE
: THANG WEE KEONG (M0606042)
: DR YUAN ZHONG XUAN
: JAN09/BEHE/57
A project report submitted to SIM University
in partial fulfilment of the requirements for the degree of
Bachelor of Engineering of Electronics
NOV 2009
Acknowledgement
I would like to express my heartfelt gratitude to my project supervisor, Dr Yuan Zhong
Xuan for his patient guidance, invaluable advice and supervision. I would like to say thanks
to him for being so accommodating to adjust his busy schedule to meet me. Dr Yuan has
selflessly imparted his expertise and years of experience to me throughout the whole
duration of the Final Year Project. Without his help, I would not have been able to complete
my project.
I would also like to extend my gratitude to all my UniSIM lecturers that have taught me
throughout my university days for the knowledge and guidance imparted to me. Without
them, the educational experience in UniSIM would not have been so enjoyable and
enriching.
I would also like to express warm appreciation to my colleagues and friends who have been
so accommodating and helpful in my pursue for academic excellence.
Last but not least, I would like to extend my heartfelt thanks to my family and my partner for
their unconditional love, accommodation and constant moral support throughout my
academic years.
ENG499 CAPSTONE PROJECT REPORT
i
Abstract
A person's voice contains various parameters that convey information such as emotion,
gender, attitude, health and identity. This thesis talks about speaker recognition which deals
with the subject of identifying a person based on their unique voiceprint present in their
speech data. Pre-processing of the speech signal is performed before voice feature
extraction. This process ensures the voice feature extraction contains accurate information
that conveys the identity of the speaker. Voice feature extraction methods such as Linear
Predictive Coding (LPC), Linear Predictive Cepstral Coefficients (LPCC) and MelFrequency Cepstral Coefficients (MFCC) are analysed and evaluated for their suitability for
use in speaker recognition tasks. A new method which combined LPCC and MFCC
(LPCC+MFCC) using fusion output was proposed and evaluated together with the different
voice feature extraction methods. The speaker model for all the methods was computed
using Vector Quantization- Linde, Buzo and Gray (VQ-LBG) method. Individual modelling
and comparison for LPCC and MFCC is used for the LPCC+MFCC method. The similarity
scores for both methods are then combined for identification decision. The results show that
this method is better or at least comparable to the traditional methods such as LPCC and
MFCC without incurring high computation costs that will compromise the performance of
the speaker recognition tasks.
ENG499 CAPSTONE PROJECT REPORT
ii
Contents
Acknowledgement ................................................................................................................. i
Abstract .................................................................................................................................ii
List of Figures ....................................................................................................................... v
List of Tables ......................................................................................................................vii
1.
Introduction ................................................................................................................... 1
1.1
Development of speaker recognition systems ........................................................... 2
1.2
Project Objectives ...................................................................................................... 3
1.3
Project Scope ............................................................................................................. 3
1.4
Summary of Report.................................................................................................... 4
2.
Literature review ........................................................................................................... 5
2.1
Concepts of speaker recognition ................................................................................ 6
2.2
Identification and Verification tasks .......................................................................... 6
2.2.1 Text dependent and Text independent Task .............................................................. 6
2.2.2 Open and closed-set Identification ............................................................................ 7
2.3
Phases of Speaker Identification ................................................................................ 7
2.4
Pre-processing techniques ......................................................................................... 8
2.4.1 Analogue-To-Digital (A/D) ....................................................................................... 8
2.4.2 End-Point Detection................................................................................................... 9
2.4.3 Pre-Emphasis ........................................................................................................... 10
2.4.4 Speech analysis technique or framing ..................................................................... 10
2.5
Voice features extraction ........................................................................................ 11
2.5.1 Linear Predictive Coding (LPC) .............................................................................. 11
2.5.2 Linear Predictive Cepstral Coefficients ................................................................... 13
2.5.3 Mel-Frequency Cepstral Coefficients (MFCC) ....................................................... 14
2.6
Summary of feature extraction techniques .............................................................. 16
2.7
Speaker Modeling .................................................................................................... 17
2.7.1 Template Matching .................................................................................................. 17
2.7.1.2 Dynamic Time Wrapping (DTW) ........................................................................ 17
2.7.2 Vector Quantization source modelling .................................................................... 18
2.7.3 Stochastic Models .................................................................................................... 21
2.7.3.1 Hidden Markov Model ......................................................................................... 21
2.8
Neural Networks ...................................................................................................... 22
2.9
Summary .................................................................................................................. 23
3.
3.1
4.
Project Plan.................................................................................................................. 25
Gantt Chart............................................................................................................... 28
Development of the speaker recognition system ......................................................... 30
ENG499 CAPSTONE PROJECT REPORT
iii
4.1
DC Offset Removal ................................................................................................. 31
4.2
End-point detection .................................................................................................. 31
4.3
Pre-Emphasis ........................................................................................................... 32
4.4
Framing/Windowing ................................................................................................ 34
4.5
Feature extraction .................................................................................................... 34
4.5.1 Linear Prediction Coefficients (LPC) ...................................................................... 35
4.5.2 Linear Prediction Cepstral Coefficients (LPCC) ..................................................... 35
4.5.3 Mel-Frequency Cepstral Coefficients (MFCC) ....................................................... 37
4.6
Speaker modelling ................................................................................................... 38
5.
Experimental Setup ..................................................................................................... 39
6.
Results ......................................................................................................................... 40
6.1
Linear Predictive Coefficients ................................................................................. 40
6.1.2 Conclusion of LPC .................................................................................................. 42
6.2
Linear Predictive Cepstral Coefficients ................................................................... 43
6.2.1 Conclusion of LPCC ................................................................................................ 45
6.3
Mel-Frequency Cepstral Coefficients ...................................................................... 46
6.3.1 Conclusion of MFCC ............................................................................................... 48
6.4
LPCC+MFCC .......................................................................................................... 50
6.4.2 Comparison of LPCC+MFCC vs Other Methods ................................................... 51
7.
7.1
8.
Conclusion ................................................................................................................... 53
Recommendations for further study ........................................................................ 54
Critical review and reflections .................................................................................... 55
Appendix .............................................................................................................................60
A.1
Identification rates using codebook size of 32.......................................................... 60
A.2
Identification rates using codebook size of 64...........................................................60
A.3
Identification rates using codebook size of 128.........................................................61
A.4
Identification rates using MFCC+LPCC (Codebook size 32)....................................61
A.5
FYP.fig ..................................................................................................................... 62
A.6
Feature_selection.fig .................................................................................................62
A.7
User_Identified.fig ....................................................................................................63
A.8
Voice_recording_FYP.fig .........................................................................................63
ENG499 CAPSTONE PROJECT REPORT
iv
List of Figures
2.1
Speaker recognition processing tree...........................................................................6
2.2
Block diagram of automatic speaker identification system........................................8
2.3
Block diagram of the pre-processing stages...............................................................8
2.4
Speech Analysis Filter................................................................................................12
2.5
Speech Synthesis Filter..............................................................................................12
2.6
Block diagram of Linear Predictive Cepstral Coefficient..........................................13
2.7
Mel Scale plot.............................................................................................................14
2.8
Block diagram of Mel-Frequency Cepstral Coefficient.............................................15
2.9
Block diagram of speaker decision............................................................................17
2.10 2-dimensional VQ with 32 regions............................................................................18
2.11 K-means clusters........................................................................................................19
2.12 Steps for LBG algorithm............................................................................................20
2.13 Probabilistic parameters of a hidden Markov............................................................21
4.1
Flowchart of prototype speaker recognition system...................................................30
4.2
Flowchart of end-point detection for prototype speaker recognition system.............31
4.3
Speech data before and after End-Point Detection....................................................32
4.4
Frequency response and z-plane plot of the pre-emphasis filter with  = 0.95.........33
4.5
Speech data before and after End-Point Detection....................................................33
4.6
Power spectrum before and after End-Point Detection..............................................33
4.7
Speech frame before and after Hamming window....................................................34
4.8
Flowchart of computation of LPCC...........................................................................36
4.9
Flowchart of computation of MFCC..........................................................................37
6.1
Results of LPC for codebook size of 32.....................................................................40
6.2
Results of LPC for codebook size of 64.....................................................................41
6.3
Results of LPC for codebook size of 128...................................................................41
6.4
Comparison of LPC using different codebook sizes..................................................42
6.5
Results of LPCC for codebook size of 32..................................................................43
6.6
Results of LPCC for codebook size of 64..................................................................44
6.7
Results of LPCC for codebook size of 128................................................................44
6.8
Comparison of LPCC using different codebook sizes..............................................45
6.9
Results of MFCC for codebook size of 32.................................................................46
6.10
Results of MFCC for codebook size of 64................................................................47
ENG499 CAPSTONE PROJECT REPORT
v
6.11 Results of MFCC for codebook size of 128...............................................................47
6.12 Comparison of MFCC using different codebook sizes..............................................48
6.13 Block diagram of proposed system (LPCC+MFCC).................................................50
6.14 Overview of recognition rates using different codebook sizes..................................51
ENG499 CAPSTONE PROJECT REPORT
vi
List of Tables
2.1 Comparison of features extraction in terms of filtering techniques..................................16
2.2 Comparison of criteria of feature extraction techniques...................................................16
2.3 Comparison of different feature extraction and modelling techniques.............................23
3.1 Details of project plan.......................................................................................................29
5.1 Voiceprint test parameters……………………………………………………………....39
ENG499 CAPSTONE PROJECT REPORT
vii
1.
Introduction
In everyday life, there is a need for controlled access to certain information /places for
security purposes. Typical such secure identification system requires a person to use a
cardkey (something that the user has) or to enter a pin (something that the user knows) in
order to gain access to the system. However, the two methods mentioned above have some
shortcomings as the access control used can be stolen, lost, misused or forgotten.
The desire for a more secure identification system (whereby the physical human self is the
key to access the system) led to the development of biometric recognition systems. Biometric
recognition systems make use of features that is unique to each individual, which is not
duplicable or transferable. There are two characteristics of biometric features. Behavioural
characteristics such as voice and signature are the result of body part movements. In the case
of voice it merely reflects the physical properties of the voice production organs. The
articulatory process and the subsequent speech produced are never exactly identical even
when the same person utters the same words. Physiological characteristics refer to the actual
physical properties of a person such as fingerprint, iris and hand geometry measurement.
Some of the possible applications of biometric recognition systems include user-interface
customisation and access control such as airport check in, building access control, telephone
banking or remote credit card purchases.
Speech technology offers many possibilities for personal identification that is natural and
non-intrusive. Besides that, speech technology offers the capability to verify the identity of a
person remotely over long distance by using a normal telephone. A conversation between
people contains a lot of information besides just the communication of ideas. Speech also
conveys information such as gender, emotion, attitude, health situation and identity of a
ENG499 CAPSTONE PROJECT REPORT
1
speaker. The topic of this thesis deals with speaker recognition that refers to the task of
recognising people by their voices.
1.1
Development of speaker recognition systems
The first type of speaker recognition system in the 1960’s uses spectrogram of voices, also
known as voiceprint analysis. It is the acoustic spectrum of the voice that is similar to the
fingerprint. However, this type of analysis could not fulfil the goal of automatic recognition
as human interpretation was needed. In the 1980’s various methods were proposed to extract
features from voice for speaker recognition that represented features in time, frequency or in
both domains. Acoustic features of speech differ amongst individuals. These acoustic
features include both learned behavioural features (e.g. pitch, accent) and anatomy (e.g.
shape of the vocal tract and mouth) [10]. The most commonly extracted features are the
Linear Predictive Coding (LPC), Linear Predictive Cepstral Coefficient (LPCC) and MelFrequency Cepstral Coefficients (MFCC) which belong to the short time analysis that
provides information on the vocal tract [10].
Different modelling techniques were also developed to model voiceprint extracted from
speech. Various concepts were introduced such as pattern matching (Dynamic Time
Warping) which does direct template matching between training and testing subject.
However, direct template matching is time consuming when the number of feature vectors
increase. Clustering is a method to reduce the number of feature vectors by using a
codebook to represent centres of the feature vectors (Vector Quantization). The LBG (Linde,
Buzo and Gray) algorithm [25] and the k-means algorithm are some of the most well known
algorithms for Vector Quantization (VQ). Other methods proposed for speaker modelling
includes neural networks and also stochastic models that uses probability distribution such
as Hidden Markov Model (HMM) and the Gaussian Mixture Model (GMM).
ENG499 CAPSTONE PROJECT REPORT
2
Although the development in the field of speech technology is moving rapidly, there are a
few inherent problems that have to be solved. The reliability of the speaker recognition
drops drastically when a huge user database is used or when it is used under a noisy
environment. However, the field of technology is moving fast and it may be possible to
refine and improve the robustness of the existing techniques to solve some of the issues.
1.2
Project Objectives
The principle objectives of this thesis are:
1. To study the concepts of speaker recognition and understand its uses in identification and
verification systems.
2. To conduct research on different types of voiceprint in the field of speaker recognition
and understand the details of the feature extraction methods.
3. To evaluate the recognition capability of different voice features and parameters to find
out the method that is suitable for Automatic Speaker Recognition (ASR) systems in
terms of reliability and computational efficiency.
1.3
Project Scope
Although a lot of work has been done in the field of speaker recognition, there are many
practical issues to be resolved before it can be implemented in the real world. The scope of
this thesis is to make a general overview of the available techniques and to analyse the
reliability of the various voiceprint features for use in ASR.
In this project, an open set, text independent, speaker identification system prototype will be
developed to conduct the above mentioned.
ENG499 CAPSTONE PROJECT REPORT
3
1.4

Summary of Report
Chapter 1: Introduction
The background of speaker recognition systems is discussed. The overall project
objectives and scope for this thesis are defined in this chapter.

Chapter 2: Literature review
An overview of the techniques used in speaker recognition systems will be discussed.

Chapter 3: Project plan
The Gantt chart showing the details of the project tasks will be discussed to show how
the project is planned and executed.

Chapter 4: Development of the speaker recognition system
The details of the techniques used in this project to build the prototype speaker
recognition system will be discussed in this chapter.

Chapter 5: Experiment setup
The details of the experiment setup to evaluate the different voice feature extraction
methods will be shown.

Chapter 6: Results
The results of the different voice feature extraction methods will be shown and evaluated
in this chapter.

Chapter 7: Conclusion
This chapter details the summary of the work accomplished and suggestions for further
research.

Chapter 8: Critical review and reflections
The difficulties, successes and personal lessons learnt in this project are summarized in
this chapter.
ENG499 CAPSTONE PROJECT REPORT
4
2.
Literature review
The fascination with employing speech for the many purposes in daily life has driven
engineers and scientist to conduct vast amount of research and development in this field. The
idea of an “Automatic speaker recognition” (ASR) aims to build a machine that can identify
a person by recognising voice characteristics or features that are unique to each person.
The performance of modern recognition systems has improved significantly due to the
various improvements of the algorithm and techniques involved in this field. As of this
moment, ASR is still a subject of great interest to researchers and engineers worldwide and
the efficiency level of ASR is still improving.
This chapter aims to highlight some of the important techniques, algorithm and research that
are relevant to this report. Various types of typical pre-processing techniques, feature
extraction and speaker modelling techniques will be covered in this report. An overview of
the advantages and typical applications of the techniques and algorithm in the speaker
recognition system will be provided. Lastly, an overview of the comparison of the speaker
recognition systems using algorithms and techniques that are explained in this report will be
presented at the end of the chapter.
ENG499 CAPSTONE PROJECT REPORT
5
2.1
Concepts of speaker recognition
The typical classification of automatic speaker recognition is divided into two tasks: Speaker
Identification (SI) and Speaker Verification (SV). Figure 2.1 shows the taxonomy of speech
technologies. Speaker recognition is one of the three sub-classes of speech technology which
is further subdivided into SI and SV tasks.
Speaker Recognition
Speaker Identification
Text-dependent
Open-set
Closed-set
Speaker Verification
Text-independent
Open-set
Text-dependent
Text-independent
Closed-set
Figure 2.1 Speaker recognition processing tree
2.2
Identification and Verification tasks
Speaker recognition generally involves two main applications: speaker identification and
speaker verification. Speaker identification (SI), or 1:N matching, is the process of finding
the identity of an unknown speaker by matching the voice against the voice of registered
speakers in the database. The system will then return the best matching speaker as the
recognition decision. Speaker verification (SV), or 1:1 matching, is the process of verifying
the claimed identity of the unknown speaker by comparing the voice of the unknown speaker
against the voice of the claimed speaker in the database. The similarities between the speaker
and the speaker template in the database will determine the recognition decision.
2.2.1 Text dependent and Text independent Task
Speaker Identification tasks can be further divided into text-dependent and text-independent
tasks. In the case of text-dependent systems, the system requires the user to utter words that
ENG499 CAPSTONE PROJECT REPORT
6
has been enrolled. Text-independent systems do not require user to speak specific words in
order to perform recognition tasks. The system will model the voice feature characteristics of
the unknown speaker and perform recognition tasks.
In general, text-dependent systems are more accurate as both the content and voice feature
characteristics are compared. However, a serious flaw exists in both types of system. An
intruder can access the system by using a pre-recorded voice of a registered speaker in the
database. A method known as text-prompt speaker verification that randomly selected
passwords for the user to utter is used to counter this problem.
2.2.2 Open and closed-set Identification
Speaker identification can be further classified into open and closed set recognition. In open
set recognition, the system is able to suggest that the voice from the unknown speaker does
not match any speaker in the registered database. In closed set recognition, the voice will
come only from the specified set of known speakers and the system is forced to make a
decision based on the best matching speaker in the registered database.
2.3
Phases of Speaker Identification
Automatic speaker recognition system identifies the person speaking based on a database of
known speakers in the system [1]. Figure 2.2 shows the overview of an automatic speaker
recognition system. In the training or enrolment phase, a new speaker with known identity is
enrolled in the database of the system. In the identification phase, voice features from the
unknown speaker are extracted and modelled. The speaker model is then used for
comparison with speaker models from the enrolment phase to determine the identity of the
speaker. Both enrolment and identification phase use the same modelling algorithms.
ENG499 CAPSTONE PROJECT REPORT
7
Speech input
Training
mode
Speaker modelling
Feature
Extraction
Pattern matching
Recognition
mode
Database of
known speakers
Decision
Figure 2.2 Block diagram of automatic speaker identification system
2.4
Pre-processing techniques
Pre-processing is a critical process performed on speech input in order to develop a robust
and efficient speaker recognition system [2]. It is mainly performed in a few stages as shown
in figure below.
A/D
End-point
Detection
PreEmphasis
Feature
Extraction
Analogue
Speech Signal
Figure 2.3 Block diagram of the pre-processing stages
2.4.1 Analogue-To-Digital (A/D)
The first stage is the Analogue-to-Digital (A/D) conversion where the analogue speech
signal is converted into a digital signal. In local speaker identification systems, voices are
normally recorded using microphones with sampling frequency that ranges from 8 KHz to
20 KHz [3]. However, the process of A/D conversion using microphones introduces an
unwanted DC offset or constant component that may cause errors in the speaker modelling.
It can be removed using two different methods as mentioned by Marwan [4]. The first
method involves performing a fast Fourier transform on the digitized speech, removing the
first frequency component of the transform and finally, performing an inverse Fourier
ENG499 CAPSTONE PROJECT REPORT
8
transform. The second method involves subtraction of the average mean of the signal from
the original signal to remove the DC offset.
2.4.2 End-Point Detection
The second stage is the removal of silent segment from the captured speech signal, otherwise
known as end-point detection. The two main reasons for doing this is firstly, most of the
speaker specific information or features reside in the voiced segment of the speech signal.
Secondly, removal of the silent segment reduces unnecessary computation which improves
the efficiency of the ASR. The two most widely used end-point detection methods in use are
the short-time energy based method (STE) and the zero-crossing method (ZCR). STE uses
the fact that silence segment of speech signal has very low short time energy. Average
energy of the signal is computed and segments of the speech signal with energy lower than
the threshold set are removed. The reliability of the end point detection depends very much
on the threshold chosen. Changes to the threshold value might be required under different
ambient/noise conditions [2]. ZCR refer to the rate that the amplitude of the sound wave
changes sign. It uses the theory that silent segments of signal have a higher ZCR than the
voice segment. The typical silence segment has a ZCR of 50, and a typical voiced segment
has a ZCR of about 12 [5]. A study to combine STE and ZCR for end-point detection by
Mark Greenwood in 1999 shows an average accuracy of 65% [6]. End-point detection is a
field widely researched upon and various techniques have been proposed that can achieve
better performance than the conventional STE and ZCR [5; 7]. However, the STE and ZCR
still remains the most widely used method for speaker recognition system due to their
simplicity and ease of implementation.
ENG499 CAPSTONE PROJECT REPORT
9
2.4.3 Pre-Emphasis
The third stage is to perform pre-emphasis by passing the signal through a high pass filter.
The purpose of pre-emphasis is to offset the attenuation due to physiological characteristics
of the speech production system and also to enhance higher frequencies to improve the
efficiency of the analysis as most of the speaker specific information lies within the higher
frequencies. A study by Li Liu [8] shows that pre-emphasis does improve the performance
of the ASR. Another advantage is that pre-emphasis does not require complex computation,
as such the computation time of the ASR will not be increased much [9].
2.4.4 Speech analysis technique or framing
Speech data contains information that represents speaker identity. The physical structure of
the vocal tract, excitation source and behavioural traits are unique for each person. Selection
of proper frame size and overlap for analysis is crucial in order to extract relevant features
that represent speaker identity [10].
Segmental analysis uses frame size and overlap ranging from 10-30 ms that captures
information from the vocal tract. The quasi-stationary nature of the frames makes it suitable
for practical analyses and processing for vocal tract information.
Sub-segmental analysis uses small frame size and overlap ranging from 3-5 ms that is more
suitable to capture characteristics of the excitation source which is relatively fast compared
to vocal tract information.
Supra-segmental analysis uses large frame size and overlap ranging from 100-300 ms to
capture behaviour traits such as word duration, speaking rate, accent and intonation
information that varies much relatively slower to vocal tract information.
ENG499 CAPSTONE PROJECT REPORT
10
2.5
Voice features extraction
Voice feature extraction, otherwise known as front end processing is performed in both
recognition and training mode. Feature extraction converts digital speech signal into sets of
numerical descriptors called feature vectors that contain key characteristics of the speaker.
2.5.1 Linear Predictive Coding (LPC)
Linear predictive coding (LPC) is one of the earliest standardized coders. LPC has been
proven to be efficient for the representation of speech signal in mathematical form. LPC is a
useful tool for feature extraction as the vocal tract can be accurately modelled and analysed.
Studies have shown that the current speech sample is highly correlated to the previous
sample and the immediately preceding samples [11]. LPC coefficients are generated by the
linear combination of the past speech samples using the autocorrelation or the autovariance
method and minimizing the sum of squared difference between predicted and actual speech
sample.
is the predicted
based on the summation of past samples.
is the linear
prediction coefficients. M is the number of coefficients and n is the sample.
The error between the actual sample and the prediction can then be expressed by
ENG499 CAPSTONE PROJECT REPORT
11
The speech sample can then be accurately reconstructed by using the LP coefficients
the residual error
.
and
can be represented by the following in z domain.
The figure below shows the analysis filter
Figure 2.4 Speech Analysis Filter
The transfer function H(z) can be expressed as an all pole function , where G represents the
gain of the system.
The figure below shows the speech synthesis filter
Figure 2.5 Speech Synthesis Filter
Schroeder [12] mentioned that the LPC model can adequately model most speech sound by
passing an excitation pulse through time-varying all-pole filter using LP coefficients. S.
Kwong [13] considers LPC as a method that provides a good estimate of the vocal tract
spectral envelope. Gupta [14] mentioned that LPC is important in speech analysis because of
the accuracy and speed with which it can be derived. The feature vectors are calculated by
LPC over each frame. The coefficients used to represent the frame typically ranges from 10
to 20 depending on the speech sample, application and number of poles in the model.
However, LPC also have disadvantages. Firstly, LPC approximates speech linearly at all
frequencies that is inconsistent with the hearing perception of humans. Secondly, LPC is
ENG499 CAPSTONE PROJECT REPORT
12
very susceptible to noise from the background which may cause errors in the speaker
modeling.
2.5.2 Linear Predictive Cepstral Coefficients
Linear predictive cepstral coefficients (LPCC) combine the benefits of LPC and cepstral
analysis and also improve the accuracy of the features obtained for speaker recognition.
LPCC is equivalent to the smooth envelop of the log of the speech that allows for the
extraction of speaker specific features. The block diagram of the LPCC is shown in the
figure below.
A/D
Pre emphasis
Speech Input
Framing/Windowing
Linear Predictive
Coefficient
Linear Predictive
Cepstral Coefficient
Figure 2.6 Block diagram of Linear Predictive Cepstral Coefficient
LPC is transformed into cepstral coefficients using the following recursive formula
where
and
are the ith-order cepstrum coefficient and linear predictor coefficient,
respectively. Atal [15] did a study on various parameters for the LPC and found the
cepstrum to be the most effective parametric for recognition for speakers. Eddie Wong [16]
mentioned that LPCC is more robust and reliable than LPC. However, LPCC also performs
poorly under noisy environment.
ENG499 CAPSTONE PROJECT REPORT
13
2.5.3 Mel-Frequency Cepstral Coefficients (MFCC)
Mel-frequency Cepstral coefficient is one of the most prevalent and popular method used in
the field of voice feature extraction. The difference between the MFC and cepstral analysis
is that the MFC maps frequency components using a Mel scale modeled based on the human
ear perception of sound instead of a linear scale [17]. The Mel-frequency cepstrum
represents the short-term power spectrum of a sound using a linear cosine transform of the
log power spectrum of a Mel scale. The formula for the Mel scale is
Linear Frequency vs Mel Frequency
3500
Mel Frequency(mels)
3000
2500
2000
1500
1000
500
0
0
1000
2000
3000
4000
5000
6000
7000
8000
9000 10000
Frequency(Hz)
Figure 2.7 Mel Scale plot
Vergin [18] mentioned that MFCC as frequency domain parameters are much more
consistent and accurate than time domain features. Vergin [18] listed the steps leading to
ENG499 CAPSTONE PROJECT REPORT
14
extraction of MFCCs: Fast Fourier Transform, filtering and cosine transform of the log
energy vector. According to Vergin [19], MFCCs can be obtained by the mapping of an
acoustic frequency to a perceptual frequency scale called the Mel scale. MFCCs are
computed by taking the windowed frame of the speech signal, putting it through a Fast
Fourier Transform (FFT) to obtain certain parameters and finally undergoing Mel-scale
warping to retrieve feature vectors that represents useful logarithmically compressed
amplitude and simplified frequency information [20]. Seddik [21] mentioned that MFCC are
computed by applying discrete cosine transform to the log of the Mel-filter bank. The results
are features that describe the spectral shape of the signal. Rashidul [17] describe the main
steps for extraction of MFCC, shown on figure. The main steps are as follow: pre-emphasis,
framing, windowing, perform Fourier fast transform FFT), Mel frequency warping, filter
bank, logarithm, discrete Cosine transform (DCT).
A/D
Pre emphasis
Framing/
Windowing
Speech Input
Fourier
Transform
Melfrequency
wrapping
Logarithm
Discrete
Cosine
Transform
Mel-Frequency
Cepstral
Coefficients
Figure 2.8 Block diagram of Mel-Frequency Cepstral Coefficient
The main advantage of MFCC is the robustness towards noise and spectral estimation errors
under various conditions [22]. A. Reynolds did a study on the comparison of different
features and found that the MFCC provides better performance than other features [23].
ENG499 CAPSTONE PROJECT REPORT
15
2.6
Summary of feature extraction techniques
A summary of the feature extraction technique is compiled in table 2.1 that compares the
techniques mentioned in this report in terms of filtering, relevant variables, inputs and
corresponding outputs
Process
Technique
Feature
Extraction
Linear
Predictive
Coding (LPC)
Linear
Predictive
Cepstral
Coefficients
Mel-Frequency
Cepstral
Coefficient
(MFCC)
Type of
Filter
All Pole
Filter
Relevant variables/Data
Output
structure
Statistical Features
Linear Predictive
Linear Predictive
Coefficients (LPC)
Coefficients
All Pole
Statistical Features
Linear Predictive
Filter
Linear Predictive
Cepstral
Cepstral Coefficients
Coefficients
(LPCC)
Mel-Filter
Statistical Features
Mel-Frequency
Bank
Mel-Frequency
Cepstral
Cepstral Coefficients
Coefficients
(MFCC)
Table 2.1 Comparison of features extraction in terms of filtering techniques
Criteria
Main Task
LPC
Features extracted by
analysing past speech
samples.
LPCC
Features extracted by
combining LPC with
spectral analysis
Speaker Dependence
High Speaker
dependent
Poor
Speech production
motivated
representation
All-Pole Filters
Speech compression
High Speaker
dependent
Poor
Speech production
motivated
representation
All-Pole Filters
Speaker and speech
recognition
Robustness
Motivation
Representation
Filter Bank
Typical Applications
MFCC
Features extracted
based on frequency
domain using Melscale that represents
human hearing
Moderate Speaker
dependent
Good
Perceptually
motivated
representation
Triangular Mel Filters
Speaker and speech
recognition
Table 2.2 Comparison of criteria of feature extraction techniques
ENG499 CAPSTONE PROJECT REPORT
16
2.7
Speaker Modeling
The objective of modeling techniques is to generate patterns or speaker models for feature
matching. The speaker models are models that contain enhanced speaker specific
information with a compressed rate [10]. In the training or enrolling mode, speaker models
are built using the specific voice features extracted from the current speaker. In the
recognition mode, the speaker model is used to compare with the current speaker model for
identification or verification purposes (24).
Target/Speaker
model
Front-end
process
+
Σ
Impostor/
background model
Λ>θ Accept
Score
Normalisation
-
Λ<θ Reject
Figure 2.9 Block diagram of speaker decision
Three main types of modeling techniques were mentioned, namely: template matching,
stochastic modeling, neural networks.
2.7.1 Template Matching
In template matching, the speaker model may simply contain the feature template of the
frame of speech. A matching score will be computed by calculating the distance of the input
feature template and the model templates in the system database to determine the identity of
the speaker.
2.7.1.2Dynamic Time Wrapping (DTW)
Dynamic time wrapping is one of the most popular and widely used template based methods
in use for text-dependent speaker recognition system. DTW is a technique that uses dynamic
programming to process text-dependent input feature vectors to removes the effect of speech
rate variability by the speakers. The matching score is computed by comparing the feature
ENG499 CAPSTONE PROJECT REPORT
17
vectors frame by frame with the speaker model in the database and is used to identify the
speaker [1].
2.7.2 Vector Quantization source modelling
Vector Quantization is an efficient way that compresses large training vectors by using
codebooks. Codebooks contain the numerical representation of features that are speaker
specific. The speaker specific codebook is generated in the training phase by clustering the
feature vectors of each speaker (as shown in figure.). In the recognition stage, input
utterances are vector quantised and the VQ distortion that is calculated over the entire
utterance is used to determine the identity of the speaker.
Figure 2.10 2-dimensional VQ with 32 regions. Retrieved on September 2009 from
www.data-compression.com/vq.html
There are many types of codebook generation algorithm but the most well known and widely
applied are the K-means algorithm [25; 26] and the Linde, Buzo and Gray (LBG) algorithm
[27].
ENG499 CAPSTONE PROJECT REPORT
18
The steps for K-means algorithm are:
1.
Cluster the vectors based on attributes into k partitions of centroids.
2.
Assign the feature vectors to centroid that is nearest to it
3.
Calculate the position of the K centroids using the mean of the distance between the
feature and the centroid.
4.
Repeat step 2 and 3 until the position of the centroid no longer changes.
Figure 2.11 K-means clusters. Retrieved on September 2009 from
http://cmp.felk.cvut.cz/cmp/software/stprtool/examples.html
Advantages of K means clustering include lies in its simplicity and ease of computation.
However it does not guarantee the classification of speech spacing to be optimal.
ENG499 CAPSTONE PROJECT REPORT
19
The steps for Linde, Buzo and Gray (LBG) algorithm are:
1.
Determine number of codewords, N or size of the Codebook centroid.
2.
Selection of random codeword for initial centroid.
3.
Spilt the centroid into two codeword.
4.
Compute the Euclidean distance to cluster the vectors around the codeword.
5.
Compute the new set of codeword
6.
Compute the distortion using Euclidean distance
7.
Repeat step 4 to 6 until the codewords do not change or change in codewords is small.
8.
Repeat step 3 until the desired number of codewords, N is attained.
Figure 2.12 Steps for LBG algorithm. Retrieved on September 2009 from
https://engineering.purdue.edu/people/mireille.boutin.1/ECE301kiwi/LBGAlgorithm
ENG499 CAPSTONE PROJECT REPORT
20
The advantage of LBG lies in the generation of accurate codebooks with minimum distortion
when a good quality initial codebook is used for LBG. However, due to the complexity, the
computation cost is high [28].
2.7.3 Stochastic Models
Using the stochastic model methods, the pattern matching are formulated by measuring the
likelihood of the feature vector from the input model matching the feature vector of the
speaker model in the database [24]. One of the most popular methods is the Hidden Markov
Model (HMM).
2.7.3.1Hidden Markov Model
Magdi [29] mentioned that HMM are extremely useful for modeling sequentially changing
behavior as in speech application.
Figure 2.13 Probabilistic parameters of a hidden Markov model
(example)
x — states
y — possible observations
a — state transition probabilities
b — output probabilities
Retrieved on September 2009 from
http://www.answers.com/topic/hidden-markov-model
Rabiner [30] defines HMM as “a doubly embedded stochastic process that is hidden but can
be observed by using another set of stochastic process that produces the sequence of
ENG499 CAPSTONE PROJECT REPORT
21
observations.” This means that the steps leading to state of the HMM is not directly visible
but the output which is dependent on the state is visible. Each state has a probability
distribution to the possible output state, as such, the sequence of functions generated by the
HMM gives information about the sequence of states. The advantage of HMM technique lies
in its ability to capture the voice/unvoiced information in the states that reduce its reliant on
good voice/unvoiced segmentation techniques. Information such as intonation, accent can
also be captured into the states [31].
However, Rabiner [30] identify three basic problems to be solved with HMM for it to be
useful for practical applications. The three main concerns are the evaluation problem, the
decoding problem and the learning problem. Evaluation refers to how well a given model
can match the given observation sequence. Decoding refers to the attempt to uncover the
hidden part of the model to find the correct state sequences and finally, learning, which
refers to the attempt to optimize the parameters to observe training data to in order to create
a good model.
2.8
Neural Networks
The ability of neural networks to recognise patterns of different classes makes it suitable for
speaker recognition. Typical neural network has three main components, the top layer, the
hidden layer, which can be one or more layers, and the output layer [10]. Each of the layers
contains processing units that represents the neuron, which are interconnected. During the
training phase, the weights of the neurons are adjusted using a training algorithm that
attempts to minimize the sum of squared difference between the desire and actual values of
the output neurons. The weights are adjusted over several training iterations until the desired
sum of squared difference between desire and actual values are attained [32]. To put it
simply, neural networks are used to model the pattern between inputs and outputs.
ENG499 CAPSTONE PROJECT REPORT
22
2.9
Summary
Table 2.3 below shows the comparison between different feature extraction and modelling
techniques.
No.
1.
2.
3.
4.
5.
6.
7.
System
Reference
Source
Features
Applied
Dimension
Speaker Identification
using Mel Frequency
Cepstral Coefficients
A continuous HMM
text-independent
speaker recognition
system based on vowel
spotting
Automatic Speaker
Identification Using
Vector Quantization
Text-dependent
Speaker Identification
Using Neural Network
On Distinctive Thai
Tone Marks
A Vector Quantization
Approach to Speaker
Recognition
[17]
MFCC
[33]
[27]
LPC
9
VQ-LBG
67% to 98%
Speaker Identification
using Cepstral Based
Features and Discrete
Hidden Markov Model
Speaker-Independent
Phone Recognition
Using Hidden Markov
Models
[36]
LPCC
10-20
HMM
83.93% to 87.5%
MFCC
10-20
LPC
12
Recognition rate
20
Modelling
techniques
applied
VQ
MFCC
19
HMM
59.17% to 98.85%
[34]
MFCC
12
VQ
65% to 98%
[35]
LPCC
10
Neural
68.89% to 95.56%
57% to 100%
Network
[37]
75% to 83.93%
HMM
64.07% to 73.80%
Table 2.3 Comparison of different feature extraction and modelling techniques
Selection of LPC, LPCC & MFCC for feature extraction and VQ-LBG for modelling was
based on their proven high level of performance in ASR, comprehensive list of materials that
is readily available for referencing, ease of implementation, and also their popularity in the
field of speaker recognition.
In addition, the combination of the above mentioned feature extraction together with VQLBG also shows relatively good results in various papers as shown in table. This report aims
ENG499 CAPSTONE PROJECT REPORT
23
to examine and validate the various feature extraction methods mentioned above for their
suitability in ASR.
Achieving a recognition rate of 85% and above for the recognition system indicates the
usefulness of the above feature extraction and modelling method mentioned.
This chapter has presented an overview of the most widely known concepts in the field of
ASR. The steps to design an ASR begin with pre-processing, feature extraction, modelling
and finally, comparison using distance methods.
The LPC, LPCC and MFCC are the most popular feature extraction methods while HMM,
Neural Networks and VQ are the most popular modelling techniques widely employed in
most modern day ASR.
ENG499 CAPSTONE PROJECT REPORT
24
3.
Project Plan
The aim of a project plan is to list down the tasks and the timeframe for it to be completed in
a systematic manner to ensure the successful completion of the project. Main tasks are listed
below:
Stage 1 (Background information research).
This stage focuses on the background information such as the origin, terminology, concepts,
uses and limitations of ASR. Sources include IEEE journals, Internet and books that explain
the basic terminology and concepts used in ASR. Verification using IEEE journals, books
and various other readings was done to ensure the soundness of information collected.
Stage 2 (Preparation of initial report)
Preparation of the initial report was conducted simultaneously together with the background
research. This stage took three weeks to complete.
Stage 3 (Research on algorithms and voice print features used in speaker recognition
systems)
Important technical details of the project are subdivided into pre-processing, feature
extraction, modelling techniques and comparison methods for clearer focus during research.
Selection of techniques and algorithm for the implementation of the speaker recognition
system will be done in this stage.
Stage 4 (Learning of Matlab programming and development of prototype software for
speaker recognition system)
At this stage, the aim is to be proficient in Matlab programming so as to develop an ASR
using the Graphical User Interface (GUI) platform in Matlab.
It is divided into three subsections:
ENG499 CAPSTONE PROJECT REPORT
25
1. Familiarisation with Matlab functions – Learning the basic functions, operators and
programming method used in Matlab.
2. Learning Matlab GUI functions – Learn and understand the Matlab GUI platform,
functions and method of building a GUI Interface.
3. Build GUI Interface – Construct the basic GUI interface for the ASR.
Stage 5 (Software simulation and evaluation of different algorithms and voice print
features using the prototype software developed)
At this stage, the various stages involved in building an ASR will be implemented and
Matlab codes written will be tested.
Programming modules are divided into six subsections:
1. Pre-processing
2. LPC
3. LPCC
4. MFCC
5. Modelling techniques
6. Comparison techniques
Phase 6 (Project review):
At this stage, the testing and evaluation of voice features will be done and the results will be
compiled for analysis.
The three main subsections are
1. Compilation of results – Testing results will be compiled and tabulated.
2. Analysis of results – Results will be evaluated to determine the effective of the features
for speaker recognition.
3. Assessment of difficulties encountered - Difficulties encountered during the project are
noted down and possible solutions will be suggested in the final report.
ENG499 CAPSTONE PROJECT REPORT
26
Phase 7 (Preparation of final report – Thesis):
This stage is the preparation of the final report. It is subdivided into different sections in
order to provide a clearer objective so as to complete the report writing in time. A review
will be conducted to finalise and make necessary amendments.
Phase 8 (Preparation of oral presentation):
The final task will be the preparation of the oral project presentation. The details for
the poster will be finalised and sent for printing, and preparation for the presentation
will be done.
ENG499 CAPSTONE PROJECT REPORT
27
3.1
Gantt Chart
ENG499 CAPSTONE PROJECT REPORT
28
Voice print analysis for speaker recognition system
Start Date
Duration End Date
Task Description
64
26-Jan-09
30-Mar-09
1. Background information research
26-Jan-09
12
6-Feb-09
1.1 Study relevant articles and papers
7-Feb-09
8
14-Feb-09
1.2 Overview of Speaker Recognition Systems
15-Feb-09
44
30-Mar-09
1.3 Techniques for feature extractions
15-Feb-09
44
30-Mar-09
1.4 Modelling techniques for ASR
26
2-Feb-09
27-Feb-09
2. Preparation of initial report (TMA01)
3. Research on techniques used for ASR
3.1 Review on feature/voiceprint extraction
techniques
3.2 Review of modelling techniques for database
3.3 Review of comparison techniques for
speaker recognition
4. Learning of Matlab programming and
development of prototype software for speaker
recognition system
4.1 Familarising of Matlab functions and
commands
4.2 Matlab GUI functions
4.3 Building GUI for ASR
5. Software simulation and evaluation of
different algorithms and voice print features
using the prototype software developed
5.1 Implementing Pre-processing in Matlab
5.2 Implementing LPC in Matlab
5.3 Implementing LPCC in Matlab
5.4 Implementing MFCC in Matlab
5.5 Implementing VQ-LBG in Matlab
5.6 Implementing comparison technique in
Matlab
6. Project review
6.1 Compilation of results
6.2 Analysis of results
6.3 Assessment of difficulties encountered
7. Preparation of final report
7.1 Writing skeleton of final report
7.2 Writing literature review of report
7.3 Writing introduction of report
7.4 Writing main body of report
7.5 Writing conclusion and further study
7.6 Finalising and amendments to final report
8. Preparation of oral presentation
28-Feb-09
70
8-May-09
28-Feb-09
29
28-Mar-09
29-Mar-09
19-Apr-09
21
20
18-Apr-09
8-May-09
9-May-09
36
13-Jun-09
9-May-09
8
16-May-09
17-May-09
31-May-09
14-Jun-09
14
14
58
30-May-09
13-Jun-09
10-Aug-09
14-Jun-09
28-Jun-09
5-Jul-09
12-Jul-09
19-Jul-09
26-Jul-09
14
7
7
7
7
16
27-Jun-09
4-Jul-09
11-Jul-09
18-Jul-09
25-Jul-09
10-Aug-09
14-Jun-09
14-Jun-09
9-Aug-09
4-Oct-09
14-Jun-09
14-Jun-09
28-Jun-09
2-Aug-09
15-Aug-09
9-Sep-09
1-Oct-09
12-Oct-09
121
57
56
8
121
14
35
13
25
22
12
40
11-Oct-09
9-Aug-09
3-Oct-09
11-Oct-09
11-Oct-09
27-Jun-09
1-Aug-09
14-Aug-09
8-Sep-09
30-Sep-09
11-Oct-09
22-Nov-09
Resources
IEEE ASP,
Internet,
Library
resources
Personal
Computer &
Matlab
Table 3.1 Details of project plan
ENG499 CAPSTONE PROJECT REPORT
29
4.
Development of the speaker recognition system
The main purpose of the prototype system is to compare the recognition rate in order to
determine the suitability of the different types of features to be use in a speaker recognition
system.
The development of the prototype speaker recognition system will be done using Matlab.
This chapter will describe in detail the techniques used for the pre-processing and voice
feature extraction stages. The figure below shows the flowchart of the program development
for feature extraction.
Select and open
wav file of speaker
Perform end-point
detection
Perform
Pre-Emphasis
Framing/Windowing
Perform feature
extraction
LPC
MFCC
LPCC
Create Speaker Model using
LBG algorithm
Figure4.1Flowchart of prototype speaker recognition system
ENG499 CAPSTONE PROJECT REPORT
30
4.1
DC Offset Removal
The average of the signal will be computed and subtracted from the original signal.
4.2
End-point detection
Endpoint detection refers to the removal of silence portion of the speech data. STE will be
method implemented for this process in this project. Basically, the speech signal will be
divided into 0.5ms frames and compared with the average energy of the speech signal.
Frames with energy below the threshold set will be discarded. Retained frames will be
combined to form the final speech data for further speech processing.
Speech data wih DC
offset removed
Calculate average
energy of speech signal
Split the speech signal
into 0.5ms frames
Calculate energy of split
speech signal
Calculate next frame
Framed speech signal
energy is compared
with the average
energy
Frame energy < Average
speech energy
Speech frame
is discarded
Frame energy >= Average
speech energy
Speech frame is
retained
Combine speech frames
retained to retrieve
speech signal
Figure 4.2 Flowchart of end-point detection for prototype speaker recognition system
ENG499 CAPSTONE PROJECT REPORT
31
Average energy of speech frame is computed using this formula:
Figure 4.3 Speech data before and after End-Point Detection
4.3
Pre-Emphasis
Pre-emphasis is a technique used to enhance the high frequencies of speech signal. There are
two important factors for doing this:
1. To enhance the speaker specific information in the higher frequencies of speech
2. To negate the effect of energy decrease in higher frequencies in order to enable proper
analysis on the whole spectrum of the speech signal.
Figure below shows the speech signal before and after pre-emphasis. In this project preemphasis is implemented as a first-order Finite Impulse Response (FIR) filter with the form:
ENG499 CAPSTONE PROJECT REPORT
32
Generally
is selected between 0.9 to 0.95
Figure 4.4 Frequency response and z-plane plot of the pre-emphasis filter with  = 0.95
Figure 4.5 Speech data before and after End-Point Detection
Figure 4.6 Power spectrum before and after End-Point Detection
ENG499 CAPSTONE PROJECT REPORT
33
4.4
Framing/Windowing
Speech signals are quasi-stationary when examined over small time intervals from 15ms to
30ms. The process of framing is to divide the speech signal into smaller segments to make it
suitable for practical analyses and processing for vocal tract information. Overlapping is
required to fully capture the speaker specific features in the speech data.
Windowing is performed on the framed signal to smooth the abrupt frequencies at the end
points of the frames. In this project, the speech signal is divided into fixed frames of 20ms
with an overlap of 50%. A Hamming window will be used to smooth the abrupt and
undesirable frequencies in the speech frames.
Figure 4.7 Speech frame before and after Hamming window
4.5
Feature extraction
Evaluation of the different types of feature extracted from voice to determine their suitability
for ASR is the main focus for this project. The current most popular and widely known
features used for ASR are the Linear Prediction Coefficients (LPC), Linear Prediction
Cepstral Coefficients (LPCC) & Mel-Frequency Cepstral Coefficients (MFCC). As such,
this project will use the three features mentioned to evaluate their suitability for
implementation in ASR.
ENG499 CAPSTONE PROJECT REPORT
34
4.5.1 Linear Prediction Coefficients (LPC)
In this project, the LPC coefficients are retrieved by passing the speech frames into the LPC
function in Matlab. Basically the Matlab function uses the autocorrelation method of
autoregressive (AR) modelling to find the filter coefficients.
LPC computes the least square solution to X a  b
where
and m is the length of x. Solving the least squares problem using normal equations
X H Xa  X Hb
Leads to the Yule-Walker equation
R= [ r(1) r(2) r(3) …. r(p+1) ] are autocorrelation estimates for x computed using xcorr.
The Yule-Walker equations are solved using the Levinson-Durbin algorithm. In this project,
the orders of LPC being used are the 8th, 12th, 16th and 20th.
4.5.2 Linear Prediction Cepstral Coefficients (LPCC)
LPCC is a technique that combines LP and cepstral analysis by taking the inverse Fourier
transform of the log magnitude of the LPC spectrum for improved accuracy and robustness
of the voice features extracted.
ENG499 CAPSTONE PROJECT REPORT
35
Frame of windowed
speech signal
Extract Linear Prediction Coefficients
using lpc function in Matlab
Extract Linear Prediction
Cepstral Coefficients
using recursive formula
No
End of speech
frames?
Yes
Complete computation
of LPCC
Figure 4.8 Flowchart of computation of LPCC
For this project, the recursive formula used for the calculation of the LPCC is:
In this project, the orders of LPCC being used are the 8th, 12th, 16th and 20th.
ENG499 CAPSTONE PROJECT REPORT
36
4.5.3 Mel-Frequency Cepstral Coefficients (MFCC)
MFCC uses banks of filters to wrap the frequency spectrum onto the Mel-scale that is
similar to how the human ear perceives sound. The filters of the Mel-scale are linear at low
frequencies but logarithm at high frequencies to imitate the human hearing perception. For
this project, the filters of the mfcc will be adapted from Voicebox: Speech Processing
Toolbox for MATLAB by Mike Brooks. In this project, MFCC are extracted by passing the
frames of the windowed speech signal into the mfcc.m function written.
Frame of
windowed
speech signal
Fast Fourier
Transform
Mel-frequency wrapping
by applying filterbank
function melbank.m from
Voicebox
Logarithm
Discrete Cosine
Transform
Mel-Frequency
Cepstral
Coefficients
No
End of
speech
frames?
Yes
Complete
computation of
MFCC
Figure 4.9 Flowchart of computation of MFCC
ENG499 CAPSTONE PROJECT REPORT
37
4.6
Speaker modelling
Speaker model is the compact representation of the speaker voice features. In this project,
the speaker models are generated using the Vector Quantization-LBG method. The speaker
feature coefficients are passed into the function to generate the codebook of sizes 32, 64 and
128. The rationale for choosing the VQ-LBG method is the ease of implementation and
comparable performance to other speaker modelling methods.
This project uses the square Euclidean distance measurement for speaker similarity measure.
The only difference between the square Euclidean and normal Euclidean is that the square
root is not taken for this instance. The formula used in the function is:
d ( x, y )  ( x1  y1 ) 2  ( x 2  y 2 ) 2  ....( x n  y n ) 2



d i ( xi , x )  ( xi  x ) T ( xi  x )
ENG499 CAPSTONE PROJECT REPORT
38
5.
Experimental Setup
This chapter will discuss the setup to analyse the various techniques and algorithms to
determine the suitability of the techniques in speaker recognition systems.
Speaker voice data
The speaker voices are recorded in a quiet room using the windows audio recorder software
and the sonic gear microphone. The sampling rate for the recording is set at 16KHz. Users
were asked to utter the single digit sequence from ‘0’ to ‘9’ for eighteen repetitions. Fifteen
of the samples will be combined into one single file for the purpose of training the speaker
model. Three of the samples will be used to test the recognition capability of the different
voiceprint in the prototype ASR developed.
Language
English
Speakers
15 (6 Males, 9 Females)
Speech Type
Single digits from ‘0’ to ‘9’
Recording Condition
Relatively clean
Sampling Frequency
16 kHz
Training Speech Duration
Approximately 90 seconds
Evaluation Speech Duration
Approximately 5 seconds
Table 5.1 Voiceprint test parameters
The speech pre-processing, feature extraction and speaker modelling used are as mentioned
in chapter 4.
Prototype development
The prototype ASR will be developed using Matlab GUI.
ENG499 CAPSTONE PROJECT REPORT
39
6.
Results
6.1
Linear Predictive Coefficients
The first method to be evaluated is the LPC derived voice features. LPC is seldom used by
itself for speaker recognition in modern day ASR but in this project it will serve as a basis
for comparison for the other methods. The results are shown in figure 6.1, 6.2 and 6.3
Figure 6.1 Results of LPC for codebook size of 32
From figure 6.1, there is an effect of varying the order of LPC. It is observed that LPC using
8 coefficients has a better recognition rate than other LPC coefficients for codebook of size
32.
The results however are not unexpected. The two most significant factors that affect the
recognition results are the quality of the speech signal together with the size of the
codebook. Increasing the size of the codebook and LPC coefficients increases the effect of
noise on the signal, as the signal will contain more information where noise can be present.
ENG499 CAPSTONE PROJECT REPORT
40
Figure 6.2 Results of LPC for codebook size of 64
The results obtained for LPC using codebook size of 64 in figure 6.2 are pretty much similar
to those using codebook sizes of 32. The recognition rate decreases from 66.67%, to
hovering around 40% to 53.33%, as the number of coefficients used increases. Figure 6.3
also shows similar recognition for codebook of size 128 as compared to codebook of size 64
and 32.
Figure 6.3 Results of LPC for codebook size of 128
ENG499 CAPSTONE PROJECT REPORT
41
6.1.2 Conclusion of LPC
Figure 6.4 Comparison of LPC using different codebook sizes
The results show that the accuracy of the recognition rate of features extracted using LPC
ranging from 40% to 66.67%. LPC using lesser coefficients generally have better
recognition rates than LPC using higher coefficients. It is observed that LPC recognition
rates are more consistent when a codebook of size 32 is employed.
The possible reasons for the poor recognition rate might be firstly, the insufficient speaker
specific information in LPC, and secondly, susceptibly to noise within the signal that can
causes inaccuracy in the features extracted.
The findings have shown that LPC does not perform well enough to be used for a secure
ASR.
ENG499 CAPSTONE PROJECT REPORT
42
6.2
Linear Predictive Cepstral Coefficients
The second method to be evaluated is the LPCC derived voice features. LPCC is computed
from LPC and is one of the most popular used for speaker recognition in modern day ASR.
The results are shown in figure 6.5, 6.6, 6.7.
Figure 6.5 Results of LPCC for codebook size of 32
From figure 6.5, there is an effect in varying the order of LPCC. It is observed that the
recognition rate increases when the order of LPCC increases using codebook of size 32. The
recognition rate increases from 73.33% (LPCC8) to 93.33% (LPCC12 & LPCC16) and
drops to 86.67% (LPCC20). From that finding, we can see that the recognition rate does not
increase all the time just by increasing the order of the LPCC. In fact, LPCC experiences a
drop in the recognition rate when higher order coefficients are used. This tally with the study
by Reynolds [23] where the LPCC recognition rate averages at 90% and drops when higher
order LPCC is used.
ENG499 CAPSTONE PROJECT REPORT
43
Figure 6.6 Results of LPCC for codebook size of 64
The results obtained for LPCC using codebook size of 64 in figure 6.6 are pretty much
similar to those using codebook sizes of 32. The recognition rate increases from 73.33% and
stagnated at 93.33%. The finding shows that increasing of the order of LPCC will only work
for low order coefficients. Increasing the order of LPCC will not have any effect beyond
that.
Figure 6.7 Results of LPCC for codebook size of 128
ENG499 CAPSTONE PROJECT REPORT
44
Figure 6.7 also shows similar recognition for codebook of size 128. The results are similar to
LPCC using codebook of size 32 and 64. The recognition rates peak at the 16th order of
LPCC.
6.2.1 Conclusion of LPCC
Figure 6.8 Comparison of LPCC using different codebook sizes
The results show that the accuracy of the recognition rate of features extracted using LPCC
ranging from 73.33 % to 100%. Results show that LPCC using lesser coefficients have the
worst recognition. The recognition rate peaks at the 16th order of the LPCC for all codebook
sizes.
The findings show that the order of LPCC used will affect the recognition rate, however,
increasing the order beyond (order > 16) increases the computation time and causes
recognition to drop.
The findings are expected as LPCC is less efficient when higher order coefficients are used
as superfluous information or noise will be included in the modelling of the speaker model
resulting in lower recognition rates. The results also show LPCC to be more robust and
accurate than LPC.
ENG499 CAPSTONE PROJECT REPORT
45
6.3
Mel-Frequency Cepstral Coefficients
The third method to be evaluated is the MFCC derived voice features. MFCC are
coefficients that represent sound based on human perception. This is also one of the most
popular used for speaker recognition in modern day ASR. MFCC are derived by taking the
Fourier Transform of the signal, warping it to by using a Mel-filter bank that closely mimic
the Mel-scale, the final step is to perform Discrete Cosine Transform on the logarithm power
of the speech frame from the Mel-scale output. The results are shown in figure 6.9, 6.10,
6.11.
Figure 6.9 Results of MFCC for codebook size of 32
From figure 6.9, the effect of varying the order of the MFCC does not seemed to have much
effect on the recognition rate. The recognition rate increases from 80% (MFCC8) and stays
stagnant at 93.33% (MFCC12, MFCC16 and MFCC20). The results of MFCC using
codebook size of 32 shows that MFCC function better than LPCC and LPC when using
smaller size codebooks. This might be due to the MFCC being more immune to noise that
affects the LPC and LPCC.
ENG499 CAPSTONE PROJECT REPORT
46
Figure 6.10 Results of MFCC for codebook size of 64
The results obtained from figure 6.10 shows a recognition rate of 93.33% across all the
orders of the MFCC used. From the results, varying the orders of the MFCC does not show
any effect of increasing or decreasing the recognition rate.
Figure 6.11 Results of MFCC for codebook size of 128
ENG499 CAPSTONE PROJECT REPORT
47
Figure 6.11 also shows similar recognition for codebook of size 128. The results are similar
to MFCC using codebook of size 32. The recognition rate peaks at the 93.33%.
6.3.1 Conclusion of MFCC
Figure 6.12 Comparison of MFCC using different codebook
sizes
The results show the recognition rate of MFCC ranging from 80 % to 100%. MFCC using
lesser coefficients (MFCC8) have the worst recognition. The recognition rate peaks at the
12th order of the MFCC for all codebook sizes.
The findings show that the order of MFCC used affects the recognition rate. However,
increasing the order beyond (order > 12) will not be more beneficial for the recognition rate
of the system.
It is also observed that the MFCC using codebook size of 64 is the most consistent.
Codebook size determine the amount of features vectors stored for comparison and the
findings show that utilising codebook sizes of 64 gives the best results for speaker
recognition.
ENG499 CAPSTONE PROJECT REPORT
48
Overall, the MFCC recognition rate is better when compared to the LPC and LPCC. It did
not achieve a 100% for speaker recognition but the recognition rates are more consistent
than LPC and LPCC. The worst recognition rate was 80% for the MFCC8 and the best
recognition rate is at 93.33% for MFCC12, MFCC16 and MFCC 20.
The findings are consistent with MFCC being known to be more robust to noise and spectral
estimation errors when higher order coefficients are used. (Recognition rates maintained for
higher orders. (>12))
ENG499 CAPSTONE PROJECT REPORT
49
6.4
LPCC+MFCC
Based on the above results retrieved for the different voice features, this method aims to
combine the two features LPCC and MFCC to achieve better recognition rate by considering
supplementary information sources. This is accomplished by using output fusion that model
individual data separately and combining them at the output to give the overall matching
score. The figure below shows the structure of the proposed system.
Speaker models from
LPCC16, codebook
size 32
Compare with
speaker models
LPCC
Score 1 weighting
MFCC
Score 2 weighting
Compare with
speaker models
Score combination
Speech
signal
Output matching scores
Decision
Logic
Output matching scores
Speaker models from
MFCC16, codebook
size 32
User name and distance
Figure 6.13 Block diagram of proposed system (LPCC+MFCC)
The speech signal of the unknown speaker will be processed individually using LPCC (16th
order, codebook size 32) and MFCC (16th order, codebook size 32) and compared with the
corresponding codebooks of the known speaker database. The choice of codebook size and
order used are based on the following reason:
1. The size of the codebook determines the complexity of the computation and based on
the results achieved. Codebook size of 32 managed to achieve 93.33% for both
LPCC16 and MFCC16.
ENG499 CAPSTONE PROJECT REPORT
50
2. The extra computational time required for implementing such a system is negligible
as compared to other methods when running in typical home PC setup using
Pentium Core2 dual.
The corresponding matching scores that indicate the degree of similarity between the users
will be generated and combined. The weighting allocated is equal. The reason for this is due
to the fact that the results for the show that LPCC and MFCC have equal recognition rates.
The combine score is calculated by:
Where i is the current unknown speaker and j is the known speakers residing in the database.
The user with the lowest score (highest degree of similarity) for the combined scores will be
returned as the identity of the unknown speaker.
6.4.2 Comparison of LPCC+MFCC vs Other Methods
Figure 6.14 Overview of recognition rates using different codebook sizes
ENG499 CAPSTONE PROJECT REPORT
51
Looking at figure 6.14, LPCC+MFCC managed to achieve 100% recognition rate. It
achieved better recognition rate than the individual LPCC16 and MFCC16 using codebook
size of 32. The extra computational time required is negligible using modern desktop PCs.
Overall, MFCC performed slightly more consistent than LPC. Varying the order and
codebook size in MFCC does not cause the recognition rate to fluctuate as much as LPCC.
LPC performed the worst of all the methods tested. It is not surprising as LPC does not
contain enough speaker specific features and is much more prone to noise than LPCC and
MFCC.
ENG499 CAPSTONE PROJECT REPORT
52
7.
Conclusion
This thesis has presented the analysis for voiceprint analysis for speaker recognition.
Various pre-processing stages prior to feature extraction were studied and implemented for
the prototype ASR. The prototype was developed to analyse and evaluate various voice
feature extraction methods such as LPC, LPCC and MFCC for their suitability in ASR. In
addition, a new method (LPCC16+MFCC16) was proposed to enhance the recognition rate
of ASR by using fusion output.
The results obtained have shown that LPCC and MFCC perform relatively well in speaker
recognition tasks. LPCC using an order of 16 with codebook size of 128 achieved the best
recognition rate of 100%. However, utilizing a codebook of 128 requires much
computational processes that affect the performance of the system. LPCC also performs
poorly when insufficient order is used. MFCC is more consistent than the LPCC in
performing recognition task as it is less susceptible to noise and due to the fact that it is
modelled after the human perception of sound.
An evaluation of the performance of the fusion method using LPCC16 and MFCC16
achieved 100% accuracy using a group of 15 speakers. The result indicates that by using
multiple features sets, it is possible to achieve high recognition rate using smaller size
codebooks.
ENG499 CAPSTONE PROJECT REPORT
53
7.1
Recommendations for further study
Usage of other source of information: Combination of short time spectral analysis (1030ms), together with behavioural analysis such as accent, word duration and speaking speed
for speaker identification.
Robustness of LPCC+MFCC: Reliability of speaker recognition system drops drastically
in noisy environments. Evaluation of the robustness of LPCC+MFCC could be conducted.
Evaluate LPCC+MFCC using telephone speech: The effectiveness of the method can be
further evaluated by using telephone speech for evaluation.
ENG499 CAPSTONE PROJECT REPORT
54
8.
Critical review and reflections
“The road of life twists and turns and no two directions are ever the same. Yet our lessons
come from the journey, not the destination.” This quote by American novelist Don
Williams, JR, quite rightly sums up my feelings for the journey partaken for my Final Year
Project.
From the selection of the thesis topic that seemed simple to the actual realisation of the
daunting tasks to be completed in order to complete the subject. The lessons I gained were
not just in terms of academic knowledge but also of much life lessons such as time
management, discipline and perseverance.
The main tasks of the project were divided into smaller sections to make the daunting task
more manageable. The first phase involved long periods of time in libraries and on the
Internet to research for related topics to speaker recognition. Much time and effort were
utilised in order to gain a better understanding in the field of speaker recognition in which I
had no prior knowledge of. The many consultation sessions initiated to clear various queries
of concepts and theory of speaker recognition technology with my project supervisor Dr
Yuan were invaluable and critical for the lying of my basic foundation knowledge in this
field.
The second phase involved putting the knowledge gained from the research into practical
implementation. The prototype ASR was developed using Matlab to evaluate different
voiceprints. There were many issues and challenges that were encountered during this phase.
The lack of practical knowledge in Matlab were crippling at times, hindering the overall
progress of the project. Much time was consumed to search for help through the Internet and
Matlab help forums to implement certain functions. The sense of euphoria when a Matlab
ENG499 CAPSTONE PROJECT REPORT
55
function that I coded after all the hours of testing and debugging functioned in the manner I
wanted made me realised that programming can be fun and exciting too.
The third phase involved the actual testing and evaluation of the different voiceprints. The
initial ASR had a recognition rate of 70%, which was a far cry from the 90% as indicated in
various research papers. Dr Yuan pointed out that my speaker models were being built with
insufficient speaker voice samples that needed to be changed in order to achieve better
recognition rates. After rectifying the problem, the ASR managed to achieve results that are
close to the standards.
After the evaluation of the different voiceprints, the idea of merging two feature extraction
methods (LPCC+MFCC) for use in ASR was inspired in one of the sessions with Dr Yuan.
As always, the idea is simple, the implementation is hard. The comparison distance results of
the LPCC and MFCC are different. Setting the percentage of weightage for the
LPCC+MFCC requires analysis of the results for the different orders of the LPCC and
MFCC. Extra effort was needed to attain the goal.
All in all, the FYP to me is not just about attaining the project objectives. The most
important thing is the self-actualisation process. From a seemingly uphill task to climbing
the “big” mountain step by step that led to the actual completion of the project. The process
is much more than just knowledge. Looking back, I am glad that I have embarked on this
journey that has provided me with a chance to hone my analytical, problem solving and
critical thinking skills that can only benefit me in my life and future career.
ENG499 CAPSTONE PROJECT REPORT
56
9. References
1. Reynolds, D.A.,. "An overview of automatic speaker recognition technology,". Acoustics,
Speech, and Signal Processing, 2002. Proceedings. (ICASSP '02). IEEE International
Conference on , vol.4, no., pp. IV-4072-IV-4075 vol.4, 2002.
2. Ayaz Keerio, Bhargav Kumar Mitra, Philip Birch, Rupert Young, and Chris
Chatwin. "On Preprocessing of Speech Signals". International Journal of Signal Processing
; Vol.5 No.3 2009 [Page 216].
3. Campbell, J. P. "Speaker Recognition",. 1999. Technical report, Department of
Defence,Fort Meade..
4. Al-Akaidi, Marwan. "Introduction to speech processing". Fractal Speech Processing.
s.l. : Cambridge University Press The Edinburgh Building, Cambridge CB2 2RU, UK, 2004.
5. Saha. G., Chakroborty. S., and Senapati. S,. “A new Silence Removal and End Point
Detection Algorithm for Speech and Speaker Recognition Applications” . in Proc. of
Eleventh National Conference on Communications (NCC), IITKharagpur, India, January
28-30, 20.
6. Kinghorn, M. Greenwood and A. “Suving: Automaticsilence/unvoiced/voiced
classification of speech,”. Departmentof Computer Science, The University ofSheffield,
1999.
7. Long, Hai-Nan and Cui-Gai Zhang. "An improved method for robust speech endpoint
detection," . Machine Learning and Cybernetics, 2009 International Conference on , vol.4,
no., pp.2067-2071, 12-15 July 2009.
8. Liu, Li, He, Jialong and Palm, G.,. "Signal modeling for speaker identification,".
Acoustics, Speech, and Signal Processing, 1996. ICASSP-96. Conference Proceedings.,
1996 IEEE International Conference on , vol.2, no., pp.665-668 vol. 2, 7-10 May 1996.
9. Picone, J.W.,. "Signal modeling techniques in speech recognition," . Proceedings of the
IEEE , vol.81, no.9, pp.1215-1247, Sep 1993.
10. Jayanna H S, Mahadeva Prasanna S R. "Analysis, Feature Extraction, Modeling and
Testing Techniques for Speaker Recognition". IETE Tech Rev 2009;26:181-90.
11. Schroeder, M.,. "Linear prediction, entropy and signal analysis," . ASSP Magazine,
IEEE , vol.1, no.3, pp. 3-11, Jul 1984.
12. Schroeder, M.,.. "Linear predictive coding of speech: Review and current directions,".
Communications Magazine, IEEE , vol.23, no.8, pp. 54-61, Aug 1985.
13. Kwong, S. and Nui, P.T.,. "Design and implementation of a parametric speech coder,".
Consumer Electronics, IEEE Transactions on , vol.44, no.1, pp.163-169, Feb 1998.
14. Gupta, V., Bryan, J. and Gowdy, J.,. "A speaker-independent speech-recognition
system based on linear prediction,". Acoustics, Speech and Signal Processing, IEEE
Transactions on , vol.26, no.1, pp. 27-33, Feb 1978.
ENG499 CAPSTONE PROJECT REPORT
57
15. Atal, B. "Effectiveness of linear prediction characteristics of the speech wave for
automatic speaker identification and verification.". J. Acoust. Soc. Am. 55 (6), 1304-1312.
16. Wong, E. and Sridharan, S. "Comparison of linear prediction cepstrum coefficients
and mel-frequency cepstrum coefficients for language identification,. "Intelligent
Multimedia, Video and Speech Processing, 2001. Proceedings of 2001 International
Symposium on, vol., no., pp.95-98, 2001. 2001.
17. Md. Rashidul Hasan, Mustafa Jamil, Md. Golam Rabbani, Md. Saifur Rahman.
"Speaker Identification using Mel Frequency cepstral coefficients". 3rd International
Conference on Electrical & Computer Engineering ICECE 2004, 28-30 December 2004,
Dhaka, Bangladesh.
18. Vergin, R.,. "An algorithm for robust signal modelling in speech recognition," .
Acoustics, Speech and Signal Processing, 1998. Proceedings of the 1998 IEEE International
Conference on , vol.2, no., pp.969-972 vol.2, 12-15 May 1998.
19. Vergin, R., O'Shaughnessy, D. and Farhat, A.,. "Generalized mel frequency cepstral
coefficients for large-vocabulary speaker-independent continuous-speech recognition," .
Speech and Audio Processing, IEEE Transactions on , vol.7, no.5, pp.525-532, Sep 1999.
20. Buchanan, C.R. "Informatics Reseach Proposal - Modeling the Semantics of sound".
2005.
21. Seddik, H., Rahmouni, A. and Sayadi, M.,. "Text independent speaker recognition
using the Mel frequency cepstral coefficients and a neural network classifier". ,
Communications and Signal Processing, 2004. First International Symposium on , vol., no.,
pp. 631-634, 2004. 2004.
22. Molau, S., et al. "Computing Mel-frequency cepstral coefficients on the power spectrum
," . Acoustics, Speech, and Signal Processing, 2001. Proceedings. (ICASSP '01). 2001 IEEE
International Conference on , vol.1, no., pp.73-76 vol.
23. Reynolds, D.A.,. "Experimental evaluation of features for robust speaker identification,"
. Speech and Audio Processing, IEEE Transactions on , vol.2, no.4, pp.639-643, Oct 1994.
24. Zhonghua, Fu and Zhao Rongchun. "An overview of modeling technology of speaker
recognition," . Neural Networks and Signal Processing, 2003. Proceedings of the 2003
International Conference on , vol.2, no., pp. 887-891 Vol.2, 14-17 Dec. 2003.
25. Y. Linde, A. Buzo, and R.M. Gray,. "An algorithm for vector quantizer design,". IEEE
Trans. Communications, vol. COM-28(1), pp. 84-96, Jan. 1980.
26. R. Gray. "Vector quantization,". IEEE Acoust., Speech, Signal Process. Mag., vol. 1, pp.
4-29, Apr. 1984.
27. F.K. Soong, A.E. Rosenberg, L.R. Rabiner, and B.H. Juang,. "A Vector quantization
approach to speaker recognition,". in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process.,
vol. 10, Detroit, Michingon, Apr. 1985, pp. 387-90.
ENG499 CAPSTONE PROJECT REPORT
58
28. Sookpotharom Supot*, Reruang Sutat**, Airphaiboon Surapan**, and
Sangworasil Manas**. Medical Image Compression Using Vector Quantization and Fuzzy
C-Means. [Online] http://www.kmitl.ac.th/biolab/member/sutath/final_paper_iscit02.pdf.
29. Mohamed, M.A. and Gader, P.,. "Generalized hidden Markov models. I. Theoretical
frameworks,". Fuzzy Systems, IEEE Transactions on , vol.8, no.1, pp.67-81, Feb 2000.
30. Rabiner, L.R.,. "A tutorial on hidden Markov models and selected applications in
speech recognition,". Proceedings of the IEEE , vol.77, no.2, pp.257-286, Feb 1989.
31. Florida, Ashish Jain John Harris Graduate Student MIL CNEL Univ. of FLorida
Univ. of. "Speaker Identification using MFCC and HMM based techniques". FL : s.n., April
25, 2004.
32. R.V Pawar, P.P.Kajave, and S.N.Mali. "Speaker Identification using Neural
Networks,". s.l. : World Academy of Science, Engineering and Technology , 12 2005.
33. Nikos Fakotaki, Kallirroi Georgila, and AnastasiosTsopanoglou,. “A continuous
hmm text-independentspeaker recognition system based on vowel spotting,” .
inEUROSPEECH’97, 1997, vol. 5, pp. 2247–2250.
34. Poonam Bansal, Amita Dev and Shail Bala Jain,. "Automatic Speaker Identification
Using Vector Quantization". Amity School of Engineering and Technology, 580, Delhi
Palam Vihar Road, Asian Journal of Information Technology 6 (9): 938-942, 2007 ISSN:
1682-3915, Medwell Journals, 2007.
35. C. Wutiwiwatchai, S. Sae-tang, and C. Tanprasert,. " Text-dependent Speaker
Identification Using Neural Network on Distinctive Thai Tone Marks", . Proceedings of
International Joint Conference on Neural Networks, July 1999.
36. Biswas, S., Ahmad, S. and Islam Mollat, M.K.,. "Speaker Identification Using
Cepstral Based Features and Discrete Hidden Markov Model,". Information and
Communication Technology, 2007. ICICT '07. International Conference on , vol., no.,
pp.303-306, 7-9 March 2.
37. Lee, K.-F. and Hon, H.-W.,. "Speaker-independent phone recognition using hidden
Markov models ," . Acoustics, Speech and Signal Processing, IEEE Transactions on , vol.37,
no.11, pp.1641-1648, Nov 1989.
ENG499 CAPSTONE PROJECT REPORT
59
Appendix
A.1 Identification rates using codebook size of 32
A.2 Identification rates using codebook size of 64
ENG499 CAPSTONE PROJECT REPORT
60
A.3 Identification rates using codebook size of 128
A.4 Identification rates using MFCC+LPCC (Codebook size 32)
ENG499 CAPSTONE PROJECT REPORT
61
Figure A.3 FYP.fig
Fig A.4 Feature_selection.fig
ENG499 CAPSTONE PROJECT REPORT
62
Figure A.5 User_Identified.fig
Figure A.6 Voice_record_FYP.fig
ENG499 CAPSTONE PROJECT REPORT
63
Download