Project Report

advertisement
Technion – Israel Institute of Technology
Department of Electrical Engineering
Signal & Image Processing Lab
Final Report
Subject:
Implementation of Voice
Activity Detector
Students:
Dan Waxman
Ran Margolin
Instructor:
040496382
015345598
Yevgeni Litvin
Listing Semester: Winter 2008
Handed: December 2008
Contents
Page
1.
Abstract
3
2.
Preface
5
3.
The Chosen Algorithm
7
4.
A Block Diagram of the Algorithm Implemented
8
4.1.
The Algorithm’s State Machine Diagram
11
5.
MATLAB Implementation
15
6.
C code Implementation
17
7.
From Floating Point to Fixed Point
20
8.
Un-optimized Partial DSP Implementation
21
9.
Algorithm Performance and Evaluation
22
10.
Results and Conclusions
26
11.
Appendix A
30
1. Abstract
The objective of this project was to implement Voice Activity
Detection (VAD) algorithm for the teleconference switch manufactured
by Polycom. Speech presence decision is used for teleconference
management logic, implemented inside the switch (e.g. dominant speaker
emphasis).
In the first stage of the project we studied the subject of voice
analysis and found an algorithm suitable for the requested application.
In the second stage, we implemented the chosen algorithm in
MATLAB. (This stage included testing the initial version of the
algorithm with various inputs)
In the third stage, we tested different methods to improve algorithm
performance.
In the fourth stage we converted MATLAB vector calculations to
loops (C-Style), in order to simulate better real use of the algorithm.
Finally, after debugging the algorithm in MATLAB we converted the
MATLAB code into C.
Since most of the MATLAB code was in C format (meaning no special
MATLAB functions were used other than FFT), this stage didn’t present
any substantial difficulty, and was pretty straight-forward.
In the next stage we converted all MATLAB and C code from
floating point arithmetic (FL.P.) into Fixed Point (F.P.)
The transition required some changes in the code (mainly C); for example
altering some expressions in order to avoid division.
Finally we validated that both MATLAB and C implementations
produce the same results.
2. Preface
The VAD main objective (as suggested from its title) is to decide
whether speech is present in some audio segment.
As deducted from the above definition, the input of the algorithm is an
audio stream, and its output consists of a binary vector indicating the
presence of speech in it. The input file will very likely contain
background noises, so the system must be immune against them.
The classification process is based on a spectral analysis of noise in
an audio segment. Presence of speech in a given frame of a sub-band can
be determined by the ratio between the local energy of the noisy speech
and its minimum within a specified time window (Pseudo SNR). The
ratio is compared to a certain threshold value, where a smaller ratio
indicates absence of speech. Subsequently, a temporal smoothing is
carried out to reduce fluctuations between speech and non-speech
segments, thereby exploiting the strong correlation of speech presence in
neighboring frames. The resultant noise estimate is computationally
efficient, robust with respect to the input Sound to Noise Ratio (SNR) and
type of additive noise and characterized by the ability to quickly follow
abrupt changes in the noise spectrum.
When attempting to make a binary decision regarding a frame of
audio samples, we must first distinguish between the two general types of
speech parts: “Voiced” and “Unvoiced” speech. The Unvoiced speech
consists of consonants, meaning short non-vocal sounds, which can easily
be defined as background noises when not in context of speech flow.
Examples of such sounds can be found when one pronounces letters such
as ‘p’, ‘s’, ‘t’ etc. These sounds are created without the usage of vocal
cords, which creates similarities in characteristics to those of background
noises. Therefore, such sounds are difficult to distinguish from nonspeech related noises.
The second general type of speech is Voiced speech. This type
consists of vowels such as ‘a’, ‘e’, ‘o’ etc. These sounds are created by
the excitation of the vocal cords, which can be easily set apart by the
human ear from regular environmental noises. When inspecting the
frequency analysis of such sounds, we can easily distinguish the two
types of audio segments. As shown in figure 1:
Figure 1: Spectrogram of an 'o' sound followed by a 'ch' sound
The noise segment’s energy is spread across a wide spectral range,
with no particular pattern; while the speech’s segment energy, has
distinguishable peaks that can be seen in the voice analysis.
Therefore among our goals was to ensure the VAD’s ability to
recognize unvoiced phonemes as well as voiced phonemes as speech. Our
requirement was to deal only with quasi-stationary noises (i.e. noise
which statistics might change slowly), therefore a sudden change in
energy in enough frequency bands (explained in chapter 4) proved to be a
good indicator of the presence of speech.
3. Chosen Algorithm
The implemented VAD algorithm is based on the ideas from article
written by Prof. Israel Cohen, member IEEE, and Baruch Berdugo [1].
The article introduces a method for noise estimation, by averaging
past spectral power values and using a smoothing parameter that is
adjusted by the signal presence probability in sub bands. The presence of
speech in sub-bands is determined by the ratio between the local energy
of the noisy speech and its minimum within a specified time window
(pseudo SNR).
Refer to [1] for a description of the recursive process.
The reasons for choosing the aforementioned article as a source for
our VAD algorithm are:
 Intended for real-time environment.
 Previous practical usage in the industry (the Speex codec).
 Previous usage in a similar project, done in the SIPL lab in
2004.
4. A Block Diagram of the Algorithm
The following diagram depicts the implemented algorithm, which is
based on the article [1]. A more detailed description of the diagram’s
stages is given later on the current chapter.
Receiving an
N-sample frame
(N-M overlay)
I = indicating
possible speech
presence
in a sub-band
I  " SNR "   ?1:0
FFT
Estimating “SNR”
by dividing each
sub-band with its
local minima
Calculating probability
for speech in each
sub-band by smoothing
I according to past values
1.
Smoothing in time
and frequency domains
based on past values
Finding a local minima
(based on past values)
over at least L frames
& at most 2*L frames
Decision on speech/
non-speech on first
M samples according
to process described next
The first stage of the process is the acquisition of a
single input frame (N samples).
2.
Second stage performs FFT transform of the input
frame.
3.
Third stage is the smoothing in time and frequency. The
smoothing is performed in order to improve power spectrum estimation.
After consulting with the SIPL staff, we were advised to inspect a certain
range of frequencies that is known to include all the human voice
frequencies: [30,3000]Hz . The results of this stage will be noted henceforth
as S .
4.
Fourth stage is finding the local minima in the frame. A sample-
wise comparison between the local energy and the minimum value of the
previous frame yields the minimum value for the current frame. After L
frames have been read (see appendix A for description of parameter L
and example calculation of local minimum).
5.
The fifth stage is estimation of pseudo-SNR. We define pseudo-
SNR to be the ration between estimated signal strength and estimated
noise which was estimated in the previous stage.
6.
The sixth stage is estimating speech presence. We do so by
comparing the SNR we got in the previous stage to fixed threshold  .
7.
Seventh stage reduces variance of the speech presence estimator by
smoothing it. Smoothing of the estimated indicator is valid due to the
continuity assumption on the speech frames: there is a high probability of
speech near adjacent speech frames. The results of this stage will be
noted henceforth as p .
8.
The eighth stage marks the current frame as either Speech or
non-Speech, according current and previous statistics (implemented as a
state machine).
4.1 Deciding on speech/no speech transitions
Implementation of speech presence/absence transition decision was done
using a state machine.
The decision is based on the speech continuity principle:
The probability of speech in a current frame is dependent upon the
distance between the current frame and the last speech detected frame
(i.e. frame adjacent to speech frames, has a higher probability of speech
than a frame adjacent to non-speech frames).
The state machine implements this idea by using each state to represent
the probability (referred to as confidence) of the presence of speech in a
current frame (based on the distance and amount of previously detected
speech frames). The higher the probability is, the more the detection
constraints are relaxed.
State Machine Diagram:
OnsetConfidence=0
CoolDown=0
VadSpeech=0
SteadyConfidence=0
CoolDown=0
VadSpeech=0
OnsetConfidence++
VadSpeech=1
SteadyConfidence++
CoolDown=0
VadSpeech=1
CoolDown++
VadSpeech=1
else
Pavg ≤ StartTalk
&
CoolDown > MinCoolD+ min(SteadyConfidence,MaxCoolD)
else
Pavg > StartTalk
&
OnsetConfidence < Conf
Pavg > StartTalk
&
Total_Energy > EngT
Quiet
else
Cool
Down
Pavg > StartTalk
Onset
Pmax < ContinueTalk
&
SteadyConfidence ≥ ConfS
Pavg > StartTalk
&
OnsetConfidence ≥ Conf
Steady
Pmax ≤ ContinueTalk
&
SteadyConfidence < ConfS
States:
Mathematical Notations Used:

pi - Probability of the presence of speech in frequency bin i
(calculated in stage 7).
 PAvg 
1
M
M 1
p
i 0
i
- Where M is the number of frequency bins used.
 Pmax  max{Pi | 0  i  M  1} - Where M is the number of frequency
bins used.
 Si - Smoothed FFT transform coefficient (calculated in stage 3)
M 1
 Total _ Energy   Si2 - Where M is the number of frequency bins
i 0
used.
 SteadyConfidence – Counter: Number of consecutive frames while
in Steady state.
Pmax
>
ContinueTalk
 CoolDown – Counter: Number of consecutive frames while in
CoolDown state.
 A priori Thresholds:
o StartTalk
o EngT
o OnsetConfidence
o Conf
o MinCoolD
o MaxCoolD
Note:
1. Refer to the diagram above for transition criteria between stages.
2. All decisions/calculations discussed below, are preformed per frame.
 Quiet:
As stated before, according to the speech continuity principle,
probability of speech adjacent to non-speech frame is low,
therefore we use a high threshold to discern between speech
absence and presence.
The threshold used here is an average of the probability of
speech between the relevant frequencies, and a very low
threshold of energy.
If the speech criteria is met transition to the Onset stage is done,
and the frame is marked as speech.
If the criterion is not met, the frame is marked as non-speech.
 Onset:
In the Onset stage we are yet unsure of speech presence. High
thresholds are kept (unchanged from Quiet state) in order to
minimize false alarms. After a couple of frames which met the
criteria of speech presence (i.e. high confidence of presence of
speech), the frame is marked as speech, the thresholds are
relaxed
and
transition
to
the
Steady
state
is
done.
If the criterion of speech presence is not met, we return to the
Quiet state, and mark the frame as non-speech.
 Steady:
In correlation to the Quiet and Onset stage, according to the
speech continuity principle, frames adjacent to speech frames
have a high probability of being speech frames. Therefore, in
this stage we relax the criteria for speech detection and instead
of using an average of the probabilities of speech ( Avg ( P ) ) we
use the maximum probability ( Max( P) ).
If the speech criterion is not met, the frame is marked as nonspeech and transition to the Quiet state is done.
After a couple of frames that have met the speech criteria (i.e.
we are confident of steady/continues speech), we continue to
relax the threshold, and add cool down mechanism.
Now, if the speech criterion is not met, transition to the Cool
down state is done and the frame is marked as speech.
If the criterion of speech is not met, we return to the Quiet state,
and mark the frame as non-speech.
 CoolDown:
This state prevents fluctuations of detection while in continues
speech (i.e. this state prevents transition to Quiet state due to a
speaker who might pause for a short period between continues
speech.).
While in this state, we count number of frames that do not meet
speech presence criterion while still announcing the frames as
containing speech.
The counter is set according to the length of previous
consecutive speech frames.
If the speech criterion is met before the a certain maximum
tolerance number is reached we return back to the Steady state.
If not, transition to the Quiet state is done, and the frame is
marked as non-speech.
5. Matlab Implementation
We first implemented the algorithm in MATLAB due to the ease of
rapid prototyping possible in that environment.
Our first implementation of VAD algorithm took advantage of matrix
arithmetics natively supported by MATLAB.
We then tested different methods to actually deduce presence of
speech from the information available to us (e.g. various thresholds,
deciding which frequency bins are relevant, whether to use PMax or PAvg ,
separation into logical states (quiet, onset, etc.), etc.).
In order to be able to state that one approach proves better than
another, we implemented a benchmark to test our different methods. (See
Algorithm Performance and Evaluation for benchmark implementation).
After we were pleased with our results (over 90% of accurate
detection), we converted all the matrix arithmetics to loops in order to
make transition to C language easier.
We introduced fixed point arithmetics (all but the FFT
implementation) to the entire algorithm and finally we had MATLAB
implementation that could be easily ported to C language. This is the
MATLAB code version supplied as a part of project deliverables.
The output of the algorithm is a matrix P containing speech
presence probabilities for each time-frequency bin.
All the calculations explained in chapter 4 are done (matrix
arithmetic) which result in a matrix of probability of speech P , where
each column represents a frame ( p ), and each row represents a frequency
bin
The decision whether to mark the frame as speech or non-speech is
performed as a state machine. On each iteration we input the frame's
probability vector p (one column), and according to the current state and
current input, an output of speech/non-speech is returned to an output
vector.
6. C code Implementation
In the C code implementation we had to deal with the real time
execution requirements. Afterwards we had to revise the code in order to
adapt to fixed point environment.
The algorithm state is kept in a previous_state structure. It consists
of previous frame analysis results as well as state variables needed for the
state machine.
// header of Vad function. return value is sVad struct defined above
sVad Vad(
int ProbableSpeechStart,
int ProbableSpeechEnd,
NoisySpeechFrame,
vad_config,
previous_state,
NumOfFramesRead);
// vad struct
struct sVad
{
Probability_Vector;
SpeechPresence_Vector;
P_Vector;
S_Vector;
Stmp_Vector;
Smin_Vector;
};
In order to incorporate the usage of past values (i.e. from the
previous frame), we created a struct which objective was to hold the
vectors of the previous frame’s analysis. By keeping the values stated
above, the algorithm’s state is retained between calls.
// previous state struct
struct previous_state
{
PreviousSmin[FrameSize];
PreviousStmp[FrameSize];
PreviousS[FrameSize];
PreviousP[FrameSize];
int Started;
int SteadyConfidence;
int OnsetConfidence;
int CoolDown;
Mode next_mode;
};
For sake of code clarity, configuration parameters of the VAD are
kept in a separate structure.
// configuration parameters struct
struct vad_config
{
// sampling parameters
int FrameStep;
FP_SAMPLE
Ts;
Int SpeechStartFreq;
Int SpeechEndFreq;
// parameters of minima search algorithm
int DeltaParam;
int Lparam;
int WindowLength;
FP_SAMPLE
AlphaS;
FP_SAMPLE
AlphaP;
// parameters of decision algorithm
FP_SAMPLE
ContinueTalk;
FP_SAMPLE
StartTalk;
Int Conf;
Int ConfS;
Int Started;
Int OnsetConfidence;
Int SteadyConfidence;
Int CoolDown;
Int MinCoolD;
Int MaxCoolD;
FP_SAMPLE
EngT;
};
Two functions are used to initialize the algorithms state and
configuration structures:
void vad_default_config()
void state_struct_initialize(previous_state)
The main loop in the Algorithm invokes VAD function for every
input frame.
//*******************************************************************//
//******************* initializing vad main loop ********************//
//*******************************************************************//
for (i 0  number_of_samples_including_padded_zeros_without_last_frame ; i +=
FrameStep)
Amongst the functions that were implemented to realize the
algorithm we find:
 Time/Frequency Smoothing functions
void TimeSmooth(IN Sf, IN PreviousS,
IN AlphaS,IN NumOfFramesRead,
IN FrameSize, OUT S /*S(k)*/);
void FreqSmooth(IN SAMPLE* puiY, IN FrameSize,
IN WindowLength, OUT Sf);

Local Minima Search
void LocalMin(IN S/*S(k,l)*/, IN PreviousS/*S(k,l-1)*/,
IN PreviousStmp/*Stmp(k,l-1)*/, IN PreviousSmin/*Smin(k,l-1)*/,
IN Lparam/*L*/, IN FrameSize,
IN NumOfFramesRead/*l*/, OUT Sr/*Sr(k,l)*/,
OUT Smin/*Smin(k,l)*/, OUT Stmp/*Stmp(k,l)*/)

DSP radix 2 (imported)
void DSP_radix2 (SAMPLE x, SAMPLE nx, double w);

Energy Computation of the DFT Result.
void Energy_Compute (Energy_Y, LengthY, Y);
C implementation gets input data from file. First, samples are
acquired from windows .wav file (PCM format). The last frame read from
file is zero padded if the file does not contain a multiple of 128 samples.
The output of the algorithm is a binary decision for every frame. The
results are accumulated and written to the output file at the end of
program execution.
First, the C code was written for Microsoft’s Visual Studio 2005.
Later we were asked to port the implementation to Microsoft’s Visual
Studio 6.
(e.g. In Visual Studio 6, the scope of variables declared inside "for"
statement is different than in standard ANSI C++).
In the following chapter we review changes that were made to the
code in order to adapt to fixed point environment
7. Adapting algorithm to fixed point arithmetics
Project requirements included fixed point implementation of the VAD
algorithm. This section surveys changes that were made to the original
algorithm to fulfill this requirement.
Q1.30 fixed point representation was chosen for the majority of
variables (allows dynamic range of [2, 2  230 ] [2,1.9] ).
The translation from floating point to Q1.30 is done by multiplying the
floating point value by 2^30 following by round up to the closest integer.
Using this method we change the dynamic range from [-1, 1] to [230 , 230 ].
Issues that have to be addressed when using Q1.30 representation
are:
 Possible overflow when adding or subtracting two numbers
 The result of multiplication is a 64 bit integer
To resolve these issues, multiplication is done by keeping only the most
significant 31 bits (+ 1 signed bit), and overflow is dealt by hand coding
integer saturation functionality.
In addition, in order to optimize the code to better suit DSP,
division were replaced by multiplication (i.e. instead of dividing left
hand-side we multiply the right hand-side, and thus avoid division).
Example of this can be seen when comparing the calculated
"SNR" to a predefined threshold:
Originally:
I  pSNR 
Replaced by:
S
  ? 1:0
S min
I  pSNR  S    Smin ? 1:0
8. Draft DSP Implementation
In order to implement our algorithm on a real DSP platform (Texas
Instruments TMS320C64x series), we had to introduce some changes to
the pure C implementation.
The major changes were due to transition from C++ functions to C
functions.
Due to different memory model of DSPsome changes to memory
allocation were implemented. Examples of these optimizations:
1.
Reading data frame by frame instead of reading the whole
data and breaking it into frames.
2.
Instead of keeping the entire pre-calculated DFT matrix in
memory, each DFT matrix coefficient is calculated “on the
fly”
A major problem we encountered was using read from files in DSP
environment. Even after much trial and error we were still unable to read
more than one frame from the wav file. As a result, the submitted code is
capable of correctly handling only a single frame read from PCM file..
9. Algorithm Performance and Evaluations
The algorithm performance evaluation methodology was inspired
by the TIA/EIA-136-250 standard.
The next flow chart presents the cataloguing process audio clip’s frames:
Noisy Speech Clip
Noisy Speech
Frames
Noise Frames
Illegal Speech
Quiet
Onset
Noisy Speech
Frames > -15dB
Steady
Offset
The VAD marks a frame as speech/non-speech by dividing the clip into 2
categories: noisy speech and noise.
The tagging of a frame as legal or illegal is done by using thresholds; we
used the following:
 Noisy Speech Frames if the calculation over the frame of the
larger than -15dB.
 All Noise Frames are marked as legal.
Taking the legal frames into account, the statistics are divided into 4
categories of interest:
 Onset – 7 frames in the beginning of a noisy speech area.
 Offset –7 frames in the end of a noisy speech area.
 Steady – the frames between Onset and Offset.
 Quiet – all the legal frames not in the noisy speech area.
A more detailed description of the meaning of the categories can be
found in the description of the algorithm’s state machine diagram chapter.
The categorized frames are then collected into making a statistical
mapping of the clip. The implementation of the collection process is
given here in pseudo-code extracted from the relevant C code. Examples
of results are presented in the Results and Conclusions chapter.
The following code details the decision per frame. Certain
variables were defined in order to help with the implementation:

StartTalk, ContinueTalk – are defined to help with the
count of 7 frames at the beginning of a noisy speech area

SpeechPresence – marks a frame as a speech frame

Cooldown – before deciding to mark a frame as non-speech,
we have to go over a certain threshold. This threshhold
will help us minimize our errors in the decision that
there’s no more speech in the following frames.

OnsetConfidence/SteadyConfidence - a threshold that helps
us
with
the
decision
of
entering
an
onset
area,
or
keeping a steady area.

Pavg – average probability that helps with the decision
of the next frame’s mark.
switch (current frame)
{
case QUIET:
if ((Pavg > StartTalk) && (TotalEnergy > EngT))
/******** Quiet -> onset Speech ********/
{
SpeechPresence = 1; // frame marked as speech
Probability = Pavg;
OnsetConfidence += 1; // increase overall confidence
previous_state->next_mode = ONSET;
break;
}
else
/******** Quiet -> Quiet ********/
{
SpeechPresence = 0; // mark current frame as non-speech
OnsetConfidence = 0; // Initializing Confidence counter
Cooldown = 0;
// initialize steady cooldown
previous_state->next_mode = QUIET;
break;
}
case ONSET:
if (Pavg > StartTalk)
/******** Onset -> Onset speech ********/
{
SpeechPresence = 1;
// frame marked as speech
Probability = Pavg;
OnsetConfidence += 1;
// increase overall confidence
if (OnsetConfidence >= Conf)
// case of enough voiced frames to be confident of steady speech
{
previous_state->next_mode = STEADY;
break;
}
else
{
previous_state->next_mode = ONSET;
break;
}
}
else
/******** Onset -> Quiet ********/
{
SpeechPresence = 0; // mark current frame as non-speech
OnsetConfidence = 0; // Initializing Confidence counter
Probability = Pavg;
previous_state->next_mode = QUIET;
break;
}
case STEADY:
if (MaxP > ContinueTalk) // If max probability is above threshold
{
SpeechPresence = 1;
// mark current frame as speech
Probability = MaxP;
SteadyConfidence += 1;// increase Steady confidence
Cooldown = 0;
// initialize steady cooldown
previous_state->next_mode = STEADY;
break;
}
else
/******** Steady Speech -> Quiet ********/
{
if (SteadyConfidence < ConfS)
{ //mark as non-speech if not confident in steady speech
or cooldown timeout
SpeechPresence = 0; // mark current frame as non-speech
SteadyConfidence = 0;
Probability = -MaxP;
Cooldown = 0;
previous_state->next_mode = QUIET;
break;
}
else
{/******** Steady Speech -> CoolDown (still speech) ********/
SpeechPresence = 1; // frame marked as speech until cooldown
timeout
Probability = -MaxP;
Cooldown += 1; // increasing cooldown counter for timeout
previous_state->next_mode = COOLDOWN;
break;
}
}
case COOLDOWN:
if (Pavg > StartTalk) // If max probability is above threshold
{
SpeechPresence = 1; // frame marked as speech
Probability = MaxP;
SteadyConfidence += 1;
Cooldown = 0;
previous_state->next_mode = STEADY;
break;
}
else /******** Cooldown -> Quiet ********/
{
if (Cooldown > (MinCoolD + (SteadyConfidence<MaxCoolD)
? SteadyConfidence:MaxCoolD)))
{
SpeechPresence = 0; // mark current frame as non-speech
Probability = MaxP;
SteadyConfidence = 0;
Cooldown = 0;
previous_state->next_mode = QUIET;
break;
}
else
{
SpeechPresence = 1; // frame marked as speech
Probability = MaxP;
Cooldown += 1;
previous_state->next_mode = COOLDOWN;
break;
}
}
}
10. Results and Conclusions
We tuned various parameters of our algorithm in two stages:
1. First, we tried a “brute force” approach by running numerous
tests using our test bench and scanning the entire range of VAD
parameters for 7 types of noises (noise database name. Did we
get it from Polycom?):
Environment Setting
Additional Description
a. Conference room with air conditioning
PC videoconference system
b. Laboratory containing various
computers and video equipment
The room is also used for
videoconferencing
c. Office with two PCs and air
conditioning.
The microphone was approximately one
foot from a workstation monitor (typical
for desktop videotelephony).
d. Office with two PCs and air
conditioning.
Harmonic pickup from workstation
monitor - microphone was approximately
three inches from monitor. Signal
exhibits ~76Hz period and is impulse
like.
e. Pickup of 60Hz mains hum from an
unshielded wire.
Signal exhibits 60Hz fundamental.
f. Noisy lab environment, with
microphone close to a PC fan to pickup
harmonics of air flow.
The microphone was out of the direct
path of the fan air flow.
g. Cafeteria setting.
Contains mix of male and female
speakers conversing in English. Also
present, but to a lesser extent is the
noise of a vending machine refrigeration
unit.
This approach proved unproductive (the final parameters did
not prove to be a local maximum when tested with slight
parameter fluctuations – this is due to the fact that the results
were derived by averaging best results over different sound
clips, and usually average results are non-optimal for all the
sound clips)
2. Second, we used parameters obtained in the previous stage and
then by basing on informal listening tests, tuned some
parameters manually to better comply to the project
requirements (e.g. low tolerance to speech offset errors).
Most of the tuning was done using stationary noises, but following
Polycom's request we made an attempt to handle keyboard noises as well.
A couple of methods were tried in order to handle correctly keyboard
keystroke noises, (e.g. zero crossing, autocorrelation, etc.) but due to realtime constraints we did not find suitable solution.
VAD final results for two types of sounds (Success Rate Vs. SNR):
These tests were created using two of the Pictel library noises.
Each speech/noise mixture was created by combining noise with speech,
taken from the TIMIT library, at a certain SNR level. Using a priori
information as to the actual presence of speech (TIMIT library provides
information as to which syllable is uttered at which time), a percentage of
success rate is shown (y-axis) in relation to SNR (x-axis)
1. Recorded in a room with air conditioning and PC (Used as a
videoconference system)
2. Recorded in a laboratory with various computers and video
equipment (Room used for videoconference)
Appendix A- Algorithm Variables and Calculations
The following parameters can be used to tune the VAD algorithm.
Different parameter names refer to MATLAB implementation and the C
name (“Matlab identifier”/”C identifier”):
params.Frame_Size/uiFrameSize – Sets analysis frame size (i.e.
number of FFT samples). Increasing value of this parameter results in a
more robust speech detection, but more corse time resolution and higher
computational load.
params.Frame_Step/uiFrameStep – Defines the overlap between the
consecutive frames (frame overlap). EXPLAIN THE IMPACT OF THIS
PARAMETER
params.L/uiLparam – Controls the length of local minima search
window. In order to calculate Sr (i.e. SNR for each frequency) the noise
variance estimation is taken to be local minimum of the PSD. The local
minimum is calculated for each region of L samples.
// S (k , l ) is the energy at time bin l for frequency k
if (l mod L ! 0)
// l not divisable by L
S min (k , l )  min( S min (k , l  1), S (k , l ) )
Stmp (k , l )  min( Stmp (k , l  1), S (k , l ) )
else
S min (k , l )  min( Stmp (k , l  1), S (k , l ) )
Stmp (k , l )  S (k , l )
Example of Local Minima calculation:
Blue – signal
Red – local minimum
Green – Every L frames (e.g. 0 L 2L 3L …)
As can be seen at 3L marker, the minimum is taken as the minimum of
the current L set of frames, thus the local minimum is never sustained for
longer then 2L frames.
params.w/uiWindowLength – sets the window length for the
frequency smoothing.
b  hamming(2 w  1) // creating a 2w  1 sized hamming window
where the Frequency Smooth is:
//Y(k,l) is the FFT of frame at time bin l at frequency k.
Sf (k , l ) 
w
 b(i) Y (k  i, l )
2
i  w
params.as & params.ap/uiAlphaS & uiAlphaP – Time smoothing
weights for the PSD estimator smoothing in time (as) and for the speech
presence indicator smoothing (ap)
Time Smooth :
S (k , l )  as S (k , l  1)  (1  as ) S f (k , l )
Indicator Smooth :
// I(k, l) is the indicator whether Sr (k , l )  
p (k , l )  a p p (k , l  1)  (1  a p ) I (k , l )
params.delta/uiDeltaParam – Defines the “SNR” threshold for
speech presence decision(for each frequency, in each frame). The
indicator is set according to this comparison.
I ( k , l )  Sr ( k , l )   ? 1 : 0
[params.speech_start_freq, params.speech_end_freq]/
[uiSpeechStartFreq, uiSpeechEndFreq] – Sets the start and
end of relevant frequency (i.e. use only probabilities in these frequency
ranges).Defines frequency range used by the algorithm. Too low and too
high frequencies will not contain reliable data regarding speech presence.
The Decision Algorithm:
params.starttalk /uiStartTalk – Threshold for average probability
( PAvg ), used in Quiet, Onset and CoolDown states.
params.continuetalk/uiContinueTalk – Threshold for maximum
probability ( PMax ) (over relevant frequencies), used in Steady state.
params.Conf/Conf – Defines minimum number of frames to fulfill
speech presence criterion before moving from onset to steady state.
PLEASE GO THROUGH THE REST OF THE DEFINITIONS AND
CLARIFY THEM.
params.ConfS/ConfS – Counter threshold (with to respect to the
number of speech marked frames) for confidence of steady speech, used
only in Steady state.
params.MaxCoolD/MaxCoolD – Maximum threshold for number of
frames allowed for cool-down (frames still marked as speech while
speech not detected), used in Cooldown state only.
params.MinCoolD/MinCoolD – Minimum threshold for number of
frames allowed for cool-down (frames still marked as speech while
speech not detected), used in Cooldown state only.
See state machine for clarification:
Pavg  Averaged speech probability in current frame (over relative frequencies)
Pavg  mean( p)
Pmax Maximum speech probability in current frame (over relative frequencies)
Pmax  max( p)
Reference
[1]
Israel Cohen and Baruch Berdugo, “Noise Estimation by Minima
Controlled Recursive Averaging for Robust Speech Enhancement”,
IEEE SIGNAL PROCESSING LETTERS, VOL. 9, NO. 1, pp. 1215, JANUARY 2002.
Download