Technion – Israel Institute of Technology Department of Electrical Engineering Signal & Image Processing Lab Final Report Subject: Implementation of Voice Activity Detector Students: Dan Waxman Ran Margolin Instructor: 040496382 015345598 Yevgeni Litvin Listing Semester: Winter 2008 Handed: December 2008 Contents Page 1. Abstract 3 2. Preface 5 3. The Chosen Algorithm 7 4. A Block Diagram of the Algorithm Implemented 8 4.1. The Algorithm’s State Machine Diagram 11 5. MATLAB Implementation 15 6. C code Implementation 17 7. From Floating Point to Fixed Point 20 8. Un-optimized Partial DSP Implementation 21 9. Algorithm Performance and Evaluation 22 10. Results and Conclusions 26 11. Appendix A 30 1. Abstract The objective of this project was to implement Voice Activity Detection (VAD) algorithm for the teleconference switch manufactured by Polycom. Speech presence decision is used for teleconference management logic, implemented inside the switch (e.g. dominant speaker emphasis). In the first stage of the project we studied the subject of voice analysis and found an algorithm suitable for the requested application. In the second stage, we implemented the chosen algorithm in MATLAB. (This stage included testing the initial version of the algorithm with various inputs) In the third stage, we tested different methods to improve algorithm performance. In the fourth stage we converted MATLAB vector calculations to loops (C-Style), in order to simulate better real use of the algorithm. Finally, after debugging the algorithm in MATLAB we converted the MATLAB code into C. Since most of the MATLAB code was in C format (meaning no special MATLAB functions were used other than FFT), this stage didn’t present any substantial difficulty, and was pretty straight-forward. In the next stage we converted all MATLAB and C code from floating point arithmetic (FL.P.) into Fixed Point (F.P.) The transition required some changes in the code (mainly C); for example altering some expressions in order to avoid division. Finally we validated that both MATLAB and C implementations produce the same results. 2. Preface The VAD main objective (as suggested from its title) is to decide whether speech is present in some audio segment. As deducted from the above definition, the input of the algorithm is an audio stream, and its output consists of a binary vector indicating the presence of speech in it. The input file will very likely contain background noises, so the system must be immune against them. The classification process is based on a spectral analysis of noise in an audio segment. Presence of speech in a given frame of a sub-band can be determined by the ratio between the local energy of the noisy speech and its minimum within a specified time window (Pseudo SNR). The ratio is compared to a certain threshold value, where a smaller ratio indicates absence of speech. Subsequently, a temporal smoothing is carried out to reduce fluctuations between speech and non-speech segments, thereby exploiting the strong correlation of speech presence in neighboring frames. The resultant noise estimate is computationally efficient, robust with respect to the input Sound to Noise Ratio (SNR) and type of additive noise and characterized by the ability to quickly follow abrupt changes in the noise spectrum. When attempting to make a binary decision regarding a frame of audio samples, we must first distinguish between the two general types of speech parts: “Voiced” and “Unvoiced” speech. The Unvoiced speech consists of consonants, meaning short non-vocal sounds, which can easily be defined as background noises when not in context of speech flow. Examples of such sounds can be found when one pronounces letters such as ‘p’, ‘s’, ‘t’ etc. These sounds are created without the usage of vocal cords, which creates similarities in characteristics to those of background noises. Therefore, such sounds are difficult to distinguish from nonspeech related noises. The second general type of speech is Voiced speech. This type consists of vowels such as ‘a’, ‘e’, ‘o’ etc. These sounds are created by the excitation of the vocal cords, which can be easily set apart by the human ear from regular environmental noises. When inspecting the frequency analysis of such sounds, we can easily distinguish the two types of audio segments. As shown in figure 1: Figure 1: Spectrogram of an 'o' sound followed by a 'ch' sound The noise segment’s energy is spread across a wide spectral range, with no particular pattern; while the speech’s segment energy, has distinguishable peaks that can be seen in the voice analysis. Therefore among our goals was to ensure the VAD’s ability to recognize unvoiced phonemes as well as voiced phonemes as speech. Our requirement was to deal only with quasi-stationary noises (i.e. noise which statistics might change slowly), therefore a sudden change in energy in enough frequency bands (explained in chapter 4) proved to be a good indicator of the presence of speech. 3. Chosen Algorithm The implemented VAD algorithm is based on the ideas from article written by Prof. Israel Cohen, member IEEE, and Baruch Berdugo [1]. The article introduces a method for noise estimation, by averaging past spectral power values and using a smoothing parameter that is adjusted by the signal presence probability in sub bands. The presence of speech in sub-bands is determined by the ratio between the local energy of the noisy speech and its minimum within a specified time window (pseudo SNR). Refer to [1] for a description of the recursive process. The reasons for choosing the aforementioned article as a source for our VAD algorithm are: Intended for real-time environment. Previous practical usage in the industry (the Speex codec). Previous usage in a similar project, done in the SIPL lab in 2004. 4. A Block Diagram of the Algorithm The following diagram depicts the implemented algorithm, which is based on the article [1]. A more detailed description of the diagram’s stages is given later on the current chapter. Receiving an N-sample frame (N-M overlay) I = indicating possible speech presence in a sub-band I " SNR " ?1:0 FFT Estimating “SNR” by dividing each sub-band with its local minima Calculating probability for speech in each sub-band by smoothing I according to past values 1. Smoothing in time and frequency domains based on past values Finding a local minima (based on past values) over at least L frames & at most 2*L frames Decision on speech/ non-speech on first M samples according to process described next The first stage of the process is the acquisition of a single input frame (N samples). 2. Second stage performs FFT transform of the input frame. 3. Third stage is the smoothing in time and frequency. The smoothing is performed in order to improve power spectrum estimation. After consulting with the SIPL staff, we were advised to inspect a certain range of frequencies that is known to include all the human voice frequencies: [30,3000]Hz . The results of this stage will be noted henceforth as S . 4. Fourth stage is finding the local minima in the frame. A sample- wise comparison between the local energy and the minimum value of the previous frame yields the minimum value for the current frame. After L frames have been read (see appendix A for description of parameter L and example calculation of local minimum). 5. The fifth stage is estimation of pseudo-SNR. We define pseudo- SNR to be the ration between estimated signal strength and estimated noise which was estimated in the previous stage. 6. The sixth stage is estimating speech presence. We do so by comparing the SNR we got in the previous stage to fixed threshold . 7. Seventh stage reduces variance of the speech presence estimator by smoothing it. Smoothing of the estimated indicator is valid due to the continuity assumption on the speech frames: there is a high probability of speech near adjacent speech frames. The results of this stage will be noted henceforth as p . 8. The eighth stage marks the current frame as either Speech or non-Speech, according current and previous statistics (implemented as a state machine). 4.1 Deciding on speech/no speech transitions Implementation of speech presence/absence transition decision was done using a state machine. The decision is based on the speech continuity principle: The probability of speech in a current frame is dependent upon the distance between the current frame and the last speech detected frame (i.e. frame adjacent to speech frames, has a higher probability of speech than a frame adjacent to non-speech frames). The state machine implements this idea by using each state to represent the probability (referred to as confidence) of the presence of speech in a current frame (based on the distance and amount of previously detected speech frames). The higher the probability is, the more the detection constraints are relaxed. State Machine Diagram: OnsetConfidence=0 CoolDown=0 VadSpeech=0 SteadyConfidence=0 CoolDown=0 VadSpeech=0 OnsetConfidence++ VadSpeech=1 SteadyConfidence++ CoolDown=0 VadSpeech=1 CoolDown++ VadSpeech=1 else Pavg ≤ StartTalk & CoolDown > MinCoolD+ min(SteadyConfidence,MaxCoolD) else Pavg > StartTalk & OnsetConfidence < Conf Pavg > StartTalk & Total_Energy > EngT Quiet else Cool Down Pavg > StartTalk Onset Pmax < ContinueTalk & SteadyConfidence ≥ ConfS Pavg > StartTalk & OnsetConfidence ≥ Conf Steady Pmax ≤ ContinueTalk & SteadyConfidence < ConfS States: Mathematical Notations Used: pi - Probability of the presence of speech in frequency bin i (calculated in stage 7). PAvg 1 M M 1 p i 0 i - Where M is the number of frequency bins used. Pmax max{Pi | 0 i M 1} - Where M is the number of frequency bins used. Si - Smoothed FFT transform coefficient (calculated in stage 3) M 1 Total _ Energy Si2 - Where M is the number of frequency bins i 0 used. SteadyConfidence – Counter: Number of consecutive frames while in Steady state. Pmax > ContinueTalk CoolDown – Counter: Number of consecutive frames while in CoolDown state. A priori Thresholds: o StartTalk o EngT o OnsetConfidence o Conf o MinCoolD o MaxCoolD Note: 1. Refer to the diagram above for transition criteria between stages. 2. All decisions/calculations discussed below, are preformed per frame. Quiet: As stated before, according to the speech continuity principle, probability of speech adjacent to non-speech frame is low, therefore we use a high threshold to discern between speech absence and presence. The threshold used here is an average of the probability of speech between the relevant frequencies, and a very low threshold of energy. If the speech criteria is met transition to the Onset stage is done, and the frame is marked as speech. If the criterion is not met, the frame is marked as non-speech. Onset: In the Onset stage we are yet unsure of speech presence. High thresholds are kept (unchanged from Quiet state) in order to minimize false alarms. After a couple of frames which met the criteria of speech presence (i.e. high confidence of presence of speech), the frame is marked as speech, the thresholds are relaxed and transition to the Steady state is done. If the criterion of speech presence is not met, we return to the Quiet state, and mark the frame as non-speech. Steady: In correlation to the Quiet and Onset stage, according to the speech continuity principle, frames adjacent to speech frames have a high probability of being speech frames. Therefore, in this stage we relax the criteria for speech detection and instead of using an average of the probabilities of speech ( Avg ( P ) ) we use the maximum probability ( Max( P) ). If the speech criterion is not met, the frame is marked as nonspeech and transition to the Quiet state is done. After a couple of frames that have met the speech criteria (i.e. we are confident of steady/continues speech), we continue to relax the threshold, and add cool down mechanism. Now, if the speech criterion is not met, transition to the Cool down state is done and the frame is marked as speech. If the criterion of speech is not met, we return to the Quiet state, and mark the frame as non-speech. CoolDown: This state prevents fluctuations of detection while in continues speech (i.e. this state prevents transition to Quiet state due to a speaker who might pause for a short period between continues speech.). While in this state, we count number of frames that do not meet speech presence criterion while still announcing the frames as containing speech. The counter is set according to the length of previous consecutive speech frames. If the speech criterion is met before the a certain maximum tolerance number is reached we return back to the Steady state. If not, transition to the Quiet state is done, and the frame is marked as non-speech. 5. Matlab Implementation We first implemented the algorithm in MATLAB due to the ease of rapid prototyping possible in that environment. Our first implementation of VAD algorithm took advantage of matrix arithmetics natively supported by MATLAB. We then tested different methods to actually deduce presence of speech from the information available to us (e.g. various thresholds, deciding which frequency bins are relevant, whether to use PMax or PAvg , separation into logical states (quiet, onset, etc.), etc.). In order to be able to state that one approach proves better than another, we implemented a benchmark to test our different methods. (See Algorithm Performance and Evaluation for benchmark implementation). After we were pleased with our results (over 90% of accurate detection), we converted all the matrix arithmetics to loops in order to make transition to C language easier. We introduced fixed point arithmetics (all but the FFT implementation) to the entire algorithm and finally we had MATLAB implementation that could be easily ported to C language. This is the MATLAB code version supplied as a part of project deliverables. The output of the algorithm is a matrix P containing speech presence probabilities for each time-frequency bin. All the calculations explained in chapter 4 are done (matrix arithmetic) which result in a matrix of probability of speech P , where each column represents a frame ( p ), and each row represents a frequency bin The decision whether to mark the frame as speech or non-speech is performed as a state machine. On each iteration we input the frame's probability vector p (one column), and according to the current state and current input, an output of speech/non-speech is returned to an output vector. 6. C code Implementation In the C code implementation we had to deal with the real time execution requirements. Afterwards we had to revise the code in order to adapt to fixed point environment. The algorithm state is kept in a previous_state structure. It consists of previous frame analysis results as well as state variables needed for the state machine. // header of Vad function. return value is sVad struct defined above sVad Vad( int ProbableSpeechStart, int ProbableSpeechEnd, NoisySpeechFrame, vad_config, previous_state, NumOfFramesRead); // vad struct struct sVad { Probability_Vector; SpeechPresence_Vector; P_Vector; S_Vector; Stmp_Vector; Smin_Vector; }; In order to incorporate the usage of past values (i.e. from the previous frame), we created a struct which objective was to hold the vectors of the previous frame’s analysis. By keeping the values stated above, the algorithm’s state is retained between calls. // previous state struct struct previous_state { PreviousSmin[FrameSize]; PreviousStmp[FrameSize]; PreviousS[FrameSize]; PreviousP[FrameSize]; int Started; int SteadyConfidence; int OnsetConfidence; int CoolDown; Mode next_mode; }; For sake of code clarity, configuration parameters of the VAD are kept in a separate structure. // configuration parameters struct struct vad_config { // sampling parameters int FrameStep; FP_SAMPLE Ts; Int SpeechStartFreq; Int SpeechEndFreq; // parameters of minima search algorithm int DeltaParam; int Lparam; int WindowLength; FP_SAMPLE AlphaS; FP_SAMPLE AlphaP; // parameters of decision algorithm FP_SAMPLE ContinueTalk; FP_SAMPLE StartTalk; Int Conf; Int ConfS; Int Started; Int OnsetConfidence; Int SteadyConfidence; Int CoolDown; Int MinCoolD; Int MaxCoolD; FP_SAMPLE EngT; }; Two functions are used to initialize the algorithms state and configuration structures: void vad_default_config() void state_struct_initialize(previous_state) The main loop in the Algorithm invokes VAD function for every input frame. //*******************************************************************// //******************* initializing vad main loop ********************// //*******************************************************************// for (i 0 number_of_samples_including_padded_zeros_without_last_frame ; i += FrameStep) Amongst the functions that were implemented to realize the algorithm we find: Time/Frequency Smoothing functions void TimeSmooth(IN Sf, IN PreviousS, IN AlphaS,IN NumOfFramesRead, IN FrameSize, OUT S /*S(k)*/); void FreqSmooth(IN SAMPLE* puiY, IN FrameSize, IN WindowLength, OUT Sf); Local Minima Search void LocalMin(IN S/*S(k,l)*/, IN PreviousS/*S(k,l-1)*/, IN PreviousStmp/*Stmp(k,l-1)*/, IN PreviousSmin/*Smin(k,l-1)*/, IN Lparam/*L*/, IN FrameSize, IN NumOfFramesRead/*l*/, OUT Sr/*Sr(k,l)*/, OUT Smin/*Smin(k,l)*/, OUT Stmp/*Stmp(k,l)*/) DSP radix 2 (imported) void DSP_radix2 (SAMPLE x, SAMPLE nx, double w); Energy Computation of the DFT Result. void Energy_Compute (Energy_Y, LengthY, Y); C implementation gets input data from file. First, samples are acquired from windows .wav file (PCM format). The last frame read from file is zero padded if the file does not contain a multiple of 128 samples. The output of the algorithm is a binary decision for every frame. The results are accumulated and written to the output file at the end of program execution. First, the C code was written for Microsoft’s Visual Studio 2005. Later we were asked to port the implementation to Microsoft’s Visual Studio 6. (e.g. In Visual Studio 6, the scope of variables declared inside "for" statement is different than in standard ANSI C++). In the following chapter we review changes that were made to the code in order to adapt to fixed point environment 7. Adapting algorithm to fixed point arithmetics Project requirements included fixed point implementation of the VAD algorithm. This section surveys changes that were made to the original algorithm to fulfill this requirement. Q1.30 fixed point representation was chosen for the majority of variables (allows dynamic range of [2, 2 230 ] [2,1.9] ). The translation from floating point to Q1.30 is done by multiplying the floating point value by 2^30 following by round up to the closest integer. Using this method we change the dynamic range from [-1, 1] to [230 , 230 ]. Issues that have to be addressed when using Q1.30 representation are: Possible overflow when adding or subtracting two numbers The result of multiplication is a 64 bit integer To resolve these issues, multiplication is done by keeping only the most significant 31 bits (+ 1 signed bit), and overflow is dealt by hand coding integer saturation functionality. In addition, in order to optimize the code to better suit DSP, division were replaced by multiplication (i.e. instead of dividing left hand-side we multiply the right hand-side, and thus avoid division). Example of this can be seen when comparing the calculated "SNR" to a predefined threshold: Originally: I pSNR Replaced by: S ? 1:0 S min I pSNR S Smin ? 1:0 8. Draft DSP Implementation In order to implement our algorithm on a real DSP platform (Texas Instruments TMS320C64x series), we had to introduce some changes to the pure C implementation. The major changes were due to transition from C++ functions to C functions. Due to different memory model of DSPsome changes to memory allocation were implemented. Examples of these optimizations: 1. Reading data frame by frame instead of reading the whole data and breaking it into frames. 2. Instead of keeping the entire pre-calculated DFT matrix in memory, each DFT matrix coefficient is calculated “on the fly” A major problem we encountered was using read from files in DSP environment. Even after much trial and error we were still unable to read more than one frame from the wav file. As a result, the submitted code is capable of correctly handling only a single frame read from PCM file.. 9. Algorithm Performance and Evaluations The algorithm performance evaluation methodology was inspired by the TIA/EIA-136-250 standard. The next flow chart presents the cataloguing process audio clip’s frames: Noisy Speech Clip Noisy Speech Frames Noise Frames Illegal Speech Quiet Onset Noisy Speech Frames > -15dB Steady Offset The VAD marks a frame as speech/non-speech by dividing the clip into 2 categories: noisy speech and noise. The tagging of a frame as legal or illegal is done by using thresholds; we used the following: Noisy Speech Frames if the calculation over the frame of the larger than -15dB. All Noise Frames are marked as legal. Taking the legal frames into account, the statistics are divided into 4 categories of interest: Onset – 7 frames in the beginning of a noisy speech area. Offset –7 frames in the end of a noisy speech area. Steady – the frames between Onset and Offset. Quiet – all the legal frames not in the noisy speech area. A more detailed description of the meaning of the categories can be found in the description of the algorithm’s state machine diagram chapter. The categorized frames are then collected into making a statistical mapping of the clip. The implementation of the collection process is given here in pseudo-code extracted from the relevant C code. Examples of results are presented in the Results and Conclusions chapter. The following code details the decision per frame. Certain variables were defined in order to help with the implementation: StartTalk, ContinueTalk – are defined to help with the count of 7 frames at the beginning of a noisy speech area SpeechPresence – marks a frame as a speech frame Cooldown – before deciding to mark a frame as non-speech, we have to go over a certain threshold. This threshhold will help us minimize our errors in the decision that there’s no more speech in the following frames. OnsetConfidence/SteadyConfidence - a threshold that helps us with the decision of entering an onset area, or keeping a steady area. Pavg – average probability that helps with the decision of the next frame’s mark. switch (current frame) { case QUIET: if ((Pavg > StartTalk) && (TotalEnergy > EngT)) /******** Quiet -> onset Speech ********/ { SpeechPresence = 1; // frame marked as speech Probability = Pavg; OnsetConfidence += 1; // increase overall confidence previous_state->next_mode = ONSET; break; } else /******** Quiet -> Quiet ********/ { SpeechPresence = 0; // mark current frame as non-speech OnsetConfidence = 0; // Initializing Confidence counter Cooldown = 0; // initialize steady cooldown previous_state->next_mode = QUIET; break; } case ONSET: if (Pavg > StartTalk) /******** Onset -> Onset speech ********/ { SpeechPresence = 1; // frame marked as speech Probability = Pavg; OnsetConfidence += 1; // increase overall confidence if (OnsetConfidence >= Conf) // case of enough voiced frames to be confident of steady speech { previous_state->next_mode = STEADY; break; } else { previous_state->next_mode = ONSET; break; } } else /******** Onset -> Quiet ********/ { SpeechPresence = 0; // mark current frame as non-speech OnsetConfidence = 0; // Initializing Confidence counter Probability = Pavg; previous_state->next_mode = QUIET; break; } case STEADY: if (MaxP > ContinueTalk) // If max probability is above threshold { SpeechPresence = 1; // mark current frame as speech Probability = MaxP; SteadyConfidence += 1;// increase Steady confidence Cooldown = 0; // initialize steady cooldown previous_state->next_mode = STEADY; break; } else /******** Steady Speech -> Quiet ********/ { if (SteadyConfidence < ConfS) { //mark as non-speech if not confident in steady speech or cooldown timeout SpeechPresence = 0; // mark current frame as non-speech SteadyConfidence = 0; Probability = -MaxP; Cooldown = 0; previous_state->next_mode = QUIET; break; } else {/******** Steady Speech -> CoolDown (still speech) ********/ SpeechPresence = 1; // frame marked as speech until cooldown timeout Probability = -MaxP; Cooldown += 1; // increasing cooldown counter for timeout previous_state->next_mode = COOLDOWN; break; } } case COOLDOWN: if (Pavg > StartTalk) // If max probability is above threshold { SpeechPresence = 1; // frame marked as speech Probability = MaxP; SteadyConfidence += 1; Cooldown = 0; previous_state->next_mode = STEADY; break; } else /******** Cooldown -> Quiet ********/ { if (Cooldown > (MinCoolD + (SteadyConfidence<MaxCoolD) ? SteadyConfidence:MaxCoolD))) { SpeechPresence = 0; // mark current frame as non-speech Probability = MaxP; SteadyConfidence = 0; Cooldown = 0; previous_state->next_mode = QUIET; break; } else { SpeechPresence = 1; // frame marked as speech Probability = MaxP; Cooldown += 1; previous_state->next_mode = COOLDOWN; break; } } } 10. Results and Conclusions We tuned various parameters of our algorithm in two stages: 1. First, we tried a “brute force” approach by running numerous tests using our test bench and scanning the entire range of VAD parameters for 7 types of noises (noise database name. Did we get it from Polycom?): Environment Setting Additional Description a. Conference room with air conditioning PC videoconference system b. Laboratory containing various computers and video equipment The room is also used for videoconferencing c. Office with two PCs and air conditioning. The microphone was approximately one foot from a workstation monitor (typical for desktop videotelephony). d. Office with two PCs and air conditioning. Harmonic pickup from workstation monitor - microphone was approximately three inches from monitor. Signal exhibits ~76Hz period and is impulse like. e. Pickup of 60Hz mains hum from an unshielded wire. Signal exhibits 60Hz fundamental. f. Noisy lab environment, with microphone close to a PC fan to pickup harmonics of air flow. The microphone was out of the direct path of the fan air flow. g. Cafeteria setting. Contains mix of male and female speakers conversing in English. Also present, but to a lesser extent is the noise of a vending machine refrigeration unit. This approach proved unproductive (the final parameters did not prove to be a local maximum when tested with slight parameter fluctuations – this is due to the fact that the results were derived by averaging best results over different sound clips, and usually average results are non-optimal for all the sound clips) 2. Second, we used parameters obtained in the previous stage and then by basing on informal listening tests, tuned some parameters manually to better comply to the project requirements (e.g. low tolerance to speech offset errors). Most of the tuning was done using stationary noises, but following Polycom's request we made an attempt to handle keyboard noises as well. A couple of methods were tried in order to handle correctly keyboard keystroke noises, (e.g. zero crossing, autocorrelation, etc.) but due to realtime constraints we did not find suitable solution. VAD final results for two types of sounds (Success Rate Vs. SNR): These tests were created using two of the Pictel library noises. Each speech/noise mixture was created by combining noise with speech, taken from the TIMIT library, at a certain SNR level. Using a priori information as to the actual presence of speech (TIMIT library provides information as to which syllable is uttered at which time), a percentage of success rate is shown (y-axis) in relation to SNR (x-axis) 1. Recorded in a room with air conditioning and PC (Used as a videoconference system) 2. Recorded in a laboratory with various computers and video equipment (Room used for videoconference) Appendix A- Algorithm Variables and Calculations The following parameters can be used to tune the VAD algorithm. Different parameter names refer to MATLAB implementation and the C name (“Matlab identifier”/”C identifier”): params.Frame_Size/uiFrameSize – Sets analysis frame size (i.e. number of FFT samples). Increasing value of this parameter results in a more robust speech detection, but more corse time resolution and higher computational load. params.Frame_Step/uiFrameStep – Defines the overlap between the consecutive frames (frame overlap). EXPLAIN THE IMPACT OF THIS PARAMETER params.L/uiLparam – Controls the length of local minima search window. In order to calculate Sr (i.e. SNR for each frequency) the noise variance estimation is taken to be local minimum of the PSD. The local minimum is calculated for each region of L samples. // S (k , l ) is the energy at time bin l for frequency k if (l mod L ! 0) // l not divisable by L S min (k , l ) min( S min (k , l 1), S (k , l ) ) Stmp (k , l ) min( Stmp (k , l 1), S (k , l ) ) else S min (k , l ) min( Stmp (k , l 1), S (k , l ) ) Stmp (k , l ) S (k , l ) Example of Local Minima calculation: Blue – signal Red – local minimum Green – Every L frames (e.g. 0 L 2L 3L …) As can be seen at 3L marker, the minimum is taken as the minimum of the current L set of frames, thus the local minimum is never sustained for longer then 2L frames. params.w/uiWindowLength – sets the window length for the frequency smoothing. b hamming(2 w 1) // creating a 2w 1 sized hamming window where the Frequency Smooth is: //Y(k,l) is the FFT of frame at time bin l at frequency k. Sf (k , l ) w b(i) Y (k i, l ) 2 i w params.as & params.ap/uiAlphaS & uiAlphaP – Time smoothing weights for the PSD estimator smoothing in time (as) and for the speech presence indicator smoothing (ap) Time Smooth : S (k , l ) as S (k , l 1) (1 as ) S f (k , l ) Indicator Smooth : // I(k, l) is the indicator whether Sr (k , l ) p (k , l ) a p p (k , l 1) (1 a p ) I (k , l ) params.delta/uiDeltaParam – Defines the “SNR” threshold for speech presence decision(for each frequency, in each frame). The indicator is set according to this comparison. I ( k , l ) Sr ( k , l ) ? 1 : 0 [params.speech_start_freq, params.speech_end_freq]/ [uiSpeechStartFreq, uiSpeechEndFreq] – Sets the start and end of relevant frequency (i.e. use only probabilities in these frequency ranges).Defines frequency range used by the algorithm. Too low and too high frequencies will not contain reliable data regarding speech presence. The Decision Algorithm: params.starttalk /uiStartTalk – Threshold for average probability ( PAvg ), used in Quiet, Onset and CoolDown states. params.continuetalk/uiContinueTalk – Threshold for maximum probability ( PMax ) (over relevant frequencies), used in Steady state. params.Conf/Conf – Defines minimum number of frames to fulfill speech presence criterion before moving from onset to steady state. PLEASE GO THROUGH THE REST OF THE DEFINITIONS AND CLARIFY THEM. params.ConfS/ConfS – Counter threshold (with to respect to the number of speech marked frames) for confidence of steady speech, used only in Steady state. params.MaxCoolD/MaxCoolD – Maximum threshold for number of frames allowed for cool-down (frames still marked as speech while speech not detected), used in Cooldown state only. params.MinCoolD/MinCoolD – Minimum threshold for number of frames allowed for cool-down (frames still marked as speech while speech not detected), used in Cooldown state only. See state machine for clarification: Pavg Averaged speech probability in current frame (over relative frequencies) Pavg mean( p) Pmax Maximum speech probability in current frame (over relative frequencies) Pmax max( p) Reference [1] Israel Cohen and Baruch Berdugo, “Noise Estimation by Minima Controlled Recursive Averaging for Robust Speech Enhancement”, IEEE SIGNAL PROCESSING LETTERS, VOL. 9, NO. 1, pp. 1215, JANUARY 2002.