Speech Processing (Module CS5241) Lecture 2 – Speech signal processing 0.5 0 −0.5 −1 0.5 1 1.5 2 2.5 3 4 x 10 1 Frequency 0.8 0.6 0.4 0.2 0 2000 4000 6000 8000 Time 10000 12000 SIM Khe Chai School of Computing, National University of Singapore, January 2010 14000 Speech Processing Speech Waveforms 0.8 0.6 0.4 0.2 0 −0.2 −0.4 −0.6 −0.8 −1 0.5 1 1.5 2 2.5 3 4 x 10 Speech sounds are produced by vibratory activity in the human vocal tract. Speech is normally transmitted to a listener’s ears or to a microphone through the air, where speech and other sounds take on the form of radiating waves of variation in air pressure. These type of waves are known as longitudinal waves, where the vibrations are along or parallel to the direction of travel (as opposed to transverse wave where vibrations are perpendicular to the direction of travel). School of Computing, National University of Singapore, January 2010 1 Speech Processing Digital Speech Waveforms When we speak into a microphone, these changes in pressure are converted to proportional variations in electrical voltage. Computers equipped with the proper hardware can convert the analog voltage variations into digital sound waveforms by a process called analog-to-digital conversion (ADC), which involves: • Sampling – taking a fixed number of pressure value readings at equal time intervals from the continuously varying speech signal. For example, a speech waveform sampled at 16000 times per second has a sampling frequency of 16 kHz (kilo Hertz). Higher sampling rate yields better sound quality. • Quantization – represent the sampled waveform amplitudes as discrete values (rounded to the nearest value which is expressible in a given number of bits). For example, 8 bits and 16 bits can represent a total of 256 and 65536 possible quantization levels respectively. School of Computing, National University of Singapore, January 2010 2 Speech Processing Simple Waveform Processing Simple waveform processinf techniques: • Block/frame processing • Short-time energy • Zero-crossing rate • A simple end-pointer • Autocorrelation function • Correlation with sinusoid School of Computing, National University of Singapore, January 2010 3 Speech Processing Block processing • A speech waveform consists of a long sequence of sampled values • Useful to break the long sequence into blocks/frames which are quasi-stationary • Frame-size is a compromise between: – having sufficient sample points for accurate measurement/analysis – ensuring that the quasi-stationary assumption is valid • Frame shift: number of samples (seconds) between start of successive frames – use overlapped frames to better represent the signal dynamics School of Computing, National University of Singapore, January 2010 4 Speech Processing Block processing School of Computing, National University of Singapore, January 2010 5 Speech Processing Short-time energy Short-time energy is defined as the sum of squares of the samples in a frame: E= N −1 X 2 si (1) i=0 f l o a t Energy ( f l o a t s [ ] ) { f l o a t sum = 0 ; f o r ( i n t i =0; i <s . l e n g t h ; i ++) sum += s [ i ] ∗ s [ i ] ; r e t u r n sum ; } School of Computing, National University of Singapore, January 2010 6 Speech Processing Short-time Energy 0.5 0 −0.5 −1 0.5 1 1.5 2 2.5 3 4 x 10 100 80 60 40 20 10 20 30 School of Computing, National University of Singapore, January 2010 40 50 60 7 Speech Processing Zero-crossing Rate Zero-crossing rate is defined as the number of times the zero axis is crossed per frame. ZCR is large for unvoiced speech i n t ZCR( f l o a t s [ ] ) { i n t count = 0; f o r ( i n t i =1; i <s . l e n g t h ; i ++) i f ( s [ i −1]∗ s [ i ] <= 0 ) c o u n t ++; r e t u r n count ; } School of Computing, National University of Singapore, January 2010 8 Speech Processing Zero-crossing Rate 0.5 0 −0.5 −1 0.5 1 1.5 2 2.5 3 4 x 10 150 100 50 0 10 20 30 School of Computing, National University of Singapore, January 2010 40 50 60 9 Speech Processing End-point detector Accurate end-point detection is difficult. A simple end-point detector for isolated word can be constructed from energy and ZCR values: 1. Measure background energy & ZCR (Eb, Zb) 2. Start from the centre of the waveform 3. Moving outwards, mark points Ns, Ne for which E < Eb + α 4. Continuing outwards from Ns, Ne if Z > Zb + β for 3 frames, update Ns, Ne and repeat step 3 School of Computing, National University of Singapore, January 2010 10 Speech Processing Auto-correlation Function Auto-correlation emphasises periodicity. Definition: rk = NX −k−1 sisi+k (2) i=0 f l o a t AutoCorr ( f l o a t s [ ] , i n t k ) { f l o a t sum = 0 ; f o r ( i n t i =0; i <s . l e n g t h −k ; i ++) sum += s [ i ] ∗ s [ i+k ] ; r e t u r n count ; } School of Computing, National University of Singapore, January 2010 11 Speech Processing Auto-correlation Function 0.5 0 −0.5 −1 100 200 300 400 500 600 700 800 900 1000 100 80 60 40 20 0 −20 50 100 150 School of Computing, National University of Singapore, January 2010 200 250 300 12 Speech Processing Pitch (Fundamental Frequency) detector Simple pitch detection based on auto-correlation function: 1. Compute r0 to rmax 2. r0 is the energy 3. Find peak rp in the range r0 to rmax 4. if rp > 0.3r0 then speech is voiced with period pT 5. otherwise speech is unvoiced School of Computing, National University of Singapore, January 2010 13 Speech Processing Spectral Analysis Any periodic signal can be expressed as the summation of a fundamental frequency ω and its harmonics: N −1 X sn = s(nT ) = Apcos(ωpnT + φp) (3) p=0 where Ap and φp are known as the amplitude and phase of the pth harmonic. These quantities can be found by correlating s(nT ) with cos(ωpnT ) and sin(ωpnT ): Ap = c(ωp) = N −1 X q c(ωp) + s(ωp) s(nT )cos(ωpnT ) and and n=0 School of Computing, National University of Singapore, January 2010 −1 φp = tan s(ωp) = N −1 X » s(ωp) c(ωp) – s(nT )sin(ωpnT ) (4) (5) n=0 14 Speech Processing Discrete Fourier Transform (DFT) The process of obtaining the amplitures (Ap) and phases (φp) is known as the Discrete Fourier Transform (DFT). DFT implementation: v o i d DFT( f l o a t s [ ] , f l o a t amp [ ] , f l o a t p h a s e [ ] ) { int N = s . length ; f o r ( i n t p =0; p<N ; p++) { f l o a t csum = 0 , ssum = 0 ; f o r ( i n t n =0; n<N ; s++) { f l o a t a r g = 2 ∗ p i ∗ n ∗ p/N ; csum += s [ n ] ∗ c o s ( a r g ) ; ssum += s [ n ] ∗ s i n ( a r g ) ; } amp [ p ] = s q r t ( csum ∗ csum + ssum ∗ ssum ) ; p h a s e [ p ] = a r c t a n ( ssum / csum ) ; } } School of Computing, National University of Singapore, January 2010 15 Speech Processing Complex Formulation of DFT A complex number has a real part and an imaginary part. z = Ae jθ = Acos(θ) + jAsin(θ) (6) j is a complex multiplier where j 2 = −1. Therefore, the DFT can be expressed in the following complex form: ” “ N −1 X 2πnp −j N p = 0, 1, . . . , N − 1 (7) Sp = s(nT )e n=0 Sp is a complex value given by the amplitude and phase of the pth harmonic component. DFT is a linear transform of the signal. The inverse DFT is given by “ ” N −1 2πnp 1 X j s(nT ) = Spe N N p=0 School of Computing, National University of Singapore, January 2010 n = 0, 1, . . . , N − 1 (8) 16 Speech Processing DFT of Discrete Signals Given a sequence of N discrete samples, performing DFT results in N complex values, Sp. If the sampling frequency is fs = 1/T , then: • • • • The frequency resolution = fs/N Hz The magnitude of Sp is symmetric about fs/2 Hz The phase of Sp is anti-symmetric about fs/2 Hz Can effectively represent the output as N real values! School of Computing, National University of Singapore, January 2010 17 Speech Processing Time vs Frequency Resolution Assuming that the signal characteristics remain the same over progressively longer intervals, as the length of the analysis window (N ) increases: • this leads to a reduced ability to respond to sudden changes in the signal: poor time resolution • the spacing between the spectral components ( N1T ) decreases. Hence, can determine the signal frequencies more accurately: improved frequency resolution Zero-padding: adding trailing zeros to the signal to increase N : • yields more frequency points – but doesn’t really increase frequency resolution since no addition of information! School of Computing, National University of Singapore, January 2010 18 Speech Processing Implicit Periodicity of the DFT • • • • • DFT evaluates spectrum at N evenly spaced discrete frequencies Only periodic signals have discrete spectra Therefore, DFT assumes periodicity outside the analysis frame, with period equal to N Such assumption may give rise to boundary discountinuity (edge effects) May distort high frequency components School of Computing, National University of Singapore, January 2010 19 Speech Processing Implicit Periodicity of the DFT 1 1 0.5 0.5 0 0 −0.5 −0.5 −1 −1 0 50 100 150 1 1 0.5 0.5 0 0 −0.5 −0.5 −1 0 50 100 150 0 50 100 150 −1 0 50 100 150 10 10 5 5 0 0 5 10 15 20 25 School of Computing, National University of Singapore, January 2010 5 10 15 20 25 30 20 Speech Processing Windowing 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 50 100 150 200 250 50 100 150 200 250 In signal processing, a window function (also known as a tapering function) is a function that is zero-valued outside of some chosen interval. Windowing is applied to the signal to reduce the edge effects. A commonly used window functions include: • Hamming window: w(n) = 0.53836 − 0.46164 cos “ “ ”” 2πn • Hanning window: w(n) = 0.5 1 − cos N −1 School of Computing, National University of Singapore, January 2010 “ 2πn N −1 ” 21 Speech Processing Implicit Periodicity of the DFT with Windowing 1 1 0.5 0.5 0 0 −0.5 −0.5 −1 −1 0 50 100 150 1 1 0.5 0.5 0 0 −0.5 −0.5 −1 0 50 100 150 0 50 100 150 −1 0 50 100 150 200 6 10 4 5 2 0 0 5 10 15 20 25 30 School of Computing, National University of Singapore, January 2010 5 10 15 20 25 30 22 Speech Processing Fast Fourier Transform (FFT) Direct computation of DFT has a complexity of order N 2. Fast Fourier Transform (FFT) is a fast algorithm for computing the DFT. It’s a divide-and-conquer approach. A 2-radix FFT algorithm requires N = 2k where k is an integer. Zero padding can be applied to meet this requirement. Sp = N −1 X s(nT )e −j “ 2πnp N ” n=0 N/2−1 = X „ s(2nT )e −j 2πnp N/2 « „ + s((2n + 1)T )e −j 2πnp 2πp + N N/2 « n=0 = even Sp +e −j “ 2πp N ” odd Sp (9) This can be applied recursively since N = 2k . This reduces the complexity to order of N log N . School of Computing, National University of Singapore, January 2010 23 Speech Processing Pre-emphasis • Pre-emphasis – to increase, within a band of frequencies, the magnitude of some (usually higher) frequencies w.r.t. the magnitude of other (usually lower) frequencies in order to improve the overall signal-to-noise ratio. Pre-emphasis is given by xn = xn − αxn−1, where α is the pre-emphasis factor (typically 0.97). The boundary condition x−1 = x0 is assumed. v oi d PreEmphasis ( f l o a t s [ ] , f l o a t alpha ) { f o r ( i n t i=s . l e n g t h − 1; i > 0; i −−) s [ i ] −= a l p h a ∗ s [ i − 1]; s [ 0 ] ∗= ( 1 − a l p h a ) ; } School of Computing, National University of Singapore, January 2010 24 Speech Processing Spectrum Analysis – With & Wihout Pre-emphasis 0.4 0.2 0.2 0 0 −0.2 −0.2 −0.4 −0.4 −0.6 −0.8 −0.6 100 200 300 400 500 5 5 0 0 −5 100 200 300 400 500 50 100 150 200 250 −5 50 100 150 200 250 School of Computing, National University of Singapore, January 2010 25 Speech Processing Spectrogram – Short Time Fourier Transform 0.5 0 −0.5 −1 0.5 1 1.5 2 2.5 3 4 x 10 Frequency 8000 6000 4000 2000 0 0.2 0.4 0.6 0.8 1 Time 1.2 1.4 1.6 1.8 0.2 0.4 0.6 0.8 1 Time 1.2 1.4 1.6 1.8 Frequency 8000 6000 4000 2000 0 Spectrogram is a two-dimensional visual representation of the Short Time Fourier Transform (STFT) of a time signal, an example of block-processing. There is a trade-off between time and frequency resolution. The trade-off can be adjusted by varying the analysis window length and the amount of overlap between sucessive windows. MIDDLE: 256-point window (better frequency resolution); BOTTOM: 64-point window (better time resolution). 50% overlap. School of Computing, National University of Singapore, January 2010 26 Speech Processing Recap • Digital speech waveforms: – Discretisation: finite time points – Quantisation: finite values • Simple waveform processing algorithms • Spectral Analysis – Discrete Fourier Transform (DFT) – Windowing – Pre-emphasis – fast Fourier Transform (FFT) School of Computing, National University of Singapore, January 2010 27