Pitch-synchronous overlap add (TD-PSOLA) Purpose: Modify pitch or timing of a signal • PSOLA is a time domain algorithm • Pseudo code 1. Find the pitch points of the signal 2. Apply Hanning window centered on the pitch points and extending to the next and previous pitch point 3. Add waves back • To slow down speech, duplicate frames • To speed up, remove frames • Hanning windowing preserves signal energy • Undetectable if epochs are accurately found. Why? We are not altering the vocal filter, but changing signal spacing TD-PSOLA Illustrations Pitch (window and add) Duration (insert or remove) TD-PSOLA Pitch Points (Epochs) • TD-PSOLA requires an exact marking of pitch points in a time domain signal • Pitch mark – Marking any part within a pitch period is okay as long as the algorithm marks the same point for every frame – The most common marking point is the instant of glottal closure, which identifies a quick time domain descent • Create an array of sample sample numbers comprise an analysis epoch sequence P = {p1, p2, …, pn} • Estimate pitch period distance = (pk – pk+1)/2 TD-PSOLA Evaluation • Advantages – As a time domain algorithm, it is unlikely that any other approach will be more efficient (O(N)) – Listeners cannot perceive signal alteration of up to 50% • Disadvantages – Epoch marking must be exact – Only timing changes are possible Time Domain Pitch Detection • Auto Correlation – Correlate a window of speech with a previous window – Find the best match – Issue: too many false peaks • Peak and center clipping – Algorithm to reduce false peaks – clip the top/bottom of a signal – Center the remainder around 0 • Other alternatives – Researchers propose many other pitch detection algorithms – There are much debate as to which is the best Auto Correlation 1. Auto Correlation 1/M ∑n=0,M-1 xn xn-k ;if n-k < 0 xn-k = 0 Find the k that maximizes the sum 2. Difference Function 1/M ∑n=1,M-1 |(xn – xn-k)|; if n-k<0 sn-k = 0 Find the k that minimizes the sum 3. Considerations a. Difference approach is faster b. Both can get false positives c. The YIN algorithm combines both techniques Harmonic Product Spectrum Pseudo Code Divide signal into frames (20-30 ms long) Perform FFT Down sample FFT by factors of 2, 3, 4 (taking every 2nd , 3rd , 4th values) Add FFT and down sampled spectrums together The pitch harmonics will line up (The spectrum will “spike” at the pitch value) Find the spike: return fsample / fftSize * index Frequency Spectrum Background Noise • Definition: an unwanted sound or an unwanted perturbation to a wanted signal • Examples: – – – – – – Clicks from microphone synchronization Ambient noise level: background noise Roadway noise Machinery Additional speakers Background activities: TV, Radio, dog barks, etc. – Classifications • Stationary: doesn’t change with time (i.e. fan) • Non-stationary: changes with time (i.e. door closing, TV) Noise Spectrums Power measured relative to frequency f • • • • • • • • • • • White Noise: constant over range of f Pink Noise: Decreases by 3db per octave; perceived equal across f Brown(ian): Decreases proportional to 1/f2 per octave Red: Decreases with f (either pink or brown) Blue: increases proportional to f Violet: increases proportional to f2 Gray: proportional to a psycho-acoustical curve Orange: bands of 0 around musical notes Green: noise of the world; pink, with a bump near 500 HZ Black: 0 everywhere except 1/fβ where β>2 in spikes Colored: Any noise that is not white Audio samples: http://en.wikipedia.org/wiki/Colors_of_noise Signal Processing Information Base: http://spib.rice.edu/spib.html Applications • ASR: Prevent significant degradation in noisy environments Goal: Minimize recognition degradation with noise present • Sound Editing and Archival: – Improve intelligibility of audio recordings – Goals: Eliminate perceptible noise; recover audio from wax recordings • Mobile Telephony: – Transmission of audio in high noise environments – Goal: Reduce transmission requirements • Comparing audio signals – A variety of digital signal processing applications – Goal: Normalize audio signals for ease of comparison Signal to Noise Ratio (SNR) • Definition: Power ratio between a signal and noise that interferes. • Standard Equation in decibels: SNRdb = 10 log(A Signal/ANoise)2 N= 20 log(Asignal/Anoise) • For digitized speech SNRf = P(signal)/P(noise) = 10 log(∑n=0,N-1sf(n)2/nf(x)2) – sf is an array holding samples from a frame – nf is an array of noise samples. • Note: if sf(n) = nf(x), SNRf = 0 Stationary Noise Suppression • Requirements – Maximize the amount of noise removed – Minimize signal distortion – Efficient algorithm with low big-Oh complexity • Problems – Tradeoff between removing noise and distorting the signal – More noise removal tends to distort the signal • Popular approaches – Time domain: Moving average filter (distorts frequency domain) – Frequency domain: Spectral Subtraction – Time domain: Weiner filter (using LPC) Auto regression Noise Removal • Definition: An autoregressive process is one where a value can be determined by a linear combination of previous values • Formula: Xt = c + ∑0,P-1ai Xt-i + nt c is a constant, nt is the noise, the summation is the pure signal • This is none other than linear prediction; noise is the residue. • Applying the LPC filter to the signal separates noise from signal (Wiener Filter) Spectral Subtraction Assumption: Noisy signal: yt = st + nt st is the clean signal and nt is additive noise Perform FFT on all windowed frames IF speech not present Update the estimate of the noisy spectrum { σnt + (1- σ)nt-1, 0 <= σ <=1 } ELSE Subtract the estimated noise spectrum Perform an inverse FFT S. F. Boll, “Suppression of acoustic noise in speech using spectral subtraction," IEEE Trans. Acoustics, Speech, Signal Processing, vol. ASSP-27, Apr. 1979. Implementation Issues 1. Question: How do we estimate the noise? Answer: Use the frequency distribution during times when no voice is present 2. Question: How do we know when voice is present? Answer: Use Voice Activity Detection algorithms (VAD) 3. Question: Even if we know the noise amplitudes, what about phase differences between the clean and noisy signals? Answer: Human hearing largely ignores phase differences 4. Question: Is the noise independent of the signal? Answer: We assume that it noise is linear and does not interact with the signal. 5. Question: Are noise distributions really stationary? Answer: We assume yes. Phase Distortions • Problem: We don’t know how much of the phase in an FFT is from noise and from speech. • Assumption: The algorithm assumes the phase of both are the same (that of the noisy signal). • Result: When SNR approaches 0db the audio has an hoarse sounding voice. • Why? The phase assumption means that the expected noise magnitude is incorrectly calculated. • Conclusion: There is a limit to spectral subtraction utility when SNR is close to zero Evaluation • Advantage: Easy to understand and implement • Disadvantages – The noise estimate is not exact • When too high, speech portions will be lost • When too low, some noise remains • When a noise frequency exceeds the noisy sound frequency, a negative frequency results causes musical tone artifacts – Non-linear or interacting noise • Negligible with large SNR values • Significant impact when SNR is small Musical noise Definition: Random isolated tone bursts across the frequency. Why? Most implementations set frequency bin magnitudes to zero if noise reduction would cause them to become negative Green dashes: noisy signal, Solid line: noise estimate Black dots: projected clean signal Spectral Subtraction Enhancements • Eliminate negative frequencies • Reduce the noise estimates by some factor o Vary the noise estimate factor in different frequency bands o Larger in regions outside of human speech range • Apply psycho-acoustical methods o Only attempt to remove perceived noise, not all noise o Human hearing masks sounds of adjacent frequencies o A loud sound masks sounds even after it ceases • Adaptive noise estimation: Nt(f) = λFGt(p-1)+(1-λF)Nt-1(f) Threshold of Hearing Masking Acoustical Effects • Characteristic Frequency (CF): The frequency that causes maximum response at a point of the Cochlea Basilar Membrane • Neuron exhibit a maximum response for 20 ms and then decrease to a steady state, shortly after the stimulus is removed • Masking effects can be simultaneous or temporal – Simultaneous: one signal drowns out another – Temporal: One signal masks the ones that follow – Forward: still audible after masker removed (5ms–150ms) – Back: weak signal masked from a strong one following (5ms) Voice Activity Detector (VAD) • Many VAD algorithms exist • Possible approaches to consider – – – – – Energy above background noise Low Zero crossing rate Determine if pitch is present Low fractal dimensions compared to pure noise Low LPC residual • General principle: It is better to misclassify noise as speech than to misclassify speech as noise • Standard algorithms: telephone/cell phone environments Possible VAD algorithm Note: energy and 0-crossings of noise estimated from the initial ¼ second boolean vad: double[] frame // returns true if speech present IF frame energy < low noise threshold (standard deviation units) RETURN false; IF energy < low noise threshold RETURN FALSE IF energy > high noise threshold RETURN TRUE FOR forward frames IF forward frame energy < low noise threshold RETURN FALSE IF forward frame energy > high noise threshold FOR previous ¼ second of frames COUNT previous frames having a large 0-crossing rate IF count > 0-crossing threshold (standard deviation units) IF this frame index > than first frame with 0-crossing rate > threshold RETURN true RETURN false