Unit Selection, Synthesis

advertisement
Pitch-synchronous overlap add (TD-PSOLA)
Purpose: Modify pitch or timing of a signal
• PSOLA is a time domain algorithm
• Pseudo code
1. Find the pitch points of the signal
2. Apply Hanning window centered on the pitch points and
extending to the next and previous pitch point
3. Add waves back
• To slow down speech, duplicate frames
• To speed up, remove frames
• Hanning windowing preserves signal energy
• Undetectable if epochs are accurately found. Why?
We are not altering the vocal filter, but changing signal spacing
TD-PSOLA Illustrations
Pitch (window and add)
Duration (insert or remove)
TD-PSOLA Pitch Points (Epochs)
• TD-PSOLA requires an exact marking of pitch points
in a time domain signal
• Pitch mark
– Marking any part within a pitch period is okay as long as
the algorithm marks the same point for every frame
– The most common marking point is the instant of glottal
closure, which identifies a quick time domain descent
• Create an array of sample sample numbers comprise
an analysis epoch sequence P = {p1, p2, …, pn}
• Estimate pitch period distance = (pk – pk+1)/2
TD-PSOLA Evaluation
• Advantages
– As a time domain algorithm, it is unlikely that any other
approach will be more efficient (O(N))
– Listeners cannot perceive signal alteration of up to 50%
• Disadvantages
– Epoch marking must be exact
– Only timing changes are possible
Time Domain Pitch Detection
• Auto Correlation
– Correlate a window of speech
with a previous window
– Find the best match
– Issue: too many false peaks
• Peak and center clipping
– Algorithm to reduce false peaks
– clip the top/bottom of a signal
– Center the remainder around 0
• Other alternatives
– Researchers propose many other
pitch detection algorithms
– There are much debate as to
which is the best
Auto Correlation
1. Auto Correlation
1/M ∑n=0,M-1 xn xn-k ;if n-k < 0 xn-k = 0
Find the k that maximizes the sum
2. Difference Function
1/M ∑n=1,M-1 |(xn – xn-k)|; if n-k<0 sn-k = 0
Find the k that minimizes the sum
3. Considerations
a. Difference approach is faster
b. Both can get false positives
c. The YIN algorithm combines both techniques
Harmonic Product Spectrum
Pseudo Code
Divide signal into frames (20-30 ms long)
Perform FFT
Down sample FFT by factors of 2, 3, 4
(taking every 2nd , 3rd , 4th values)
Add FFT and down sampled spectrums together
The pitch harmonics will line up
(The spectrum will “spike” at the pitch value)
Find the spike: return fsample / fftSize * index
Frequency Spectrum
Background Noise
• Definition: an unwanted sound or an unwanted
perturbation to a wanted signal
• Examples:
–
–
–
–
–
–
Clicks from microphone synchronization
Ambient noise level: background noise
Roadway noise
Machinery
Additional speakers
Background activities: TV, Radio, dog barks, etc.
– Classifications
• Stationary: doesn’t change with time (i.e. fan)
• Non-stationary: changes with time (i.e. door closing, TV)
Noise Spectrums
Power measured relative to frequency f
•
•
•
•
•
•
•
•
•
•
•
White Noise: constant over range of f
Pink Noise: Decreases by 3db per octave; perceived equal across f
Brown(ian): Decreases proportional to 1/f2 per octave
Red: Decreases with f (either pink or brown)
Blue: increases proportional to f
Violet: increases proportional to f2
Gray: proportional to a psycho-acoustical curve
Orange: bands of 0 around musical notes
Green: noise of the world; pink, with a bump near 500 HZ
Black: 0 everywhere except 1/fβ where β>2 in spikes
Colored: Any noise that is not white
Audio samples: http://en.wikipedia.org/wiki/Colors_of_noise
Signal Processing Information Base: http://spib.rice.edu/spib.html
Applications
• ASR: Prevent significant degradation in noisy environments
Goal: Minimize recognition degradation with noise present
• Sound Editing and Archival:
– Improve intelligibility of audio recordings
– Goals: Eliminate perceptible noise; recover audio from wax recordings
• Mobile Telephony:
– Transmission of audio in high noise environments
– Goal: Reduce transmission requirements
• Comparing audio signals
– A variety of digital signal processing applications
– Goal: Normalize audio signals for ease of comparison
Signal to Noise Ratio (SNR)
• Definition: Power ratio between a signal and noise
that interferes.
• Standard Equation in decibels:
SNRdb = 10 log(A Signal/ANoise)2 N= 20 log(Asignal/Anoise)
• For digitized speech
SNRf = P(signal)/P(noise) = 10 log(∑n=0,N-1sf(n)2/nf(x)2)
– sf is an array holding samples from a frame
– nf is an array of noise samples.
• Note: if sf(n) = nf(x), SNRf = 0
Stationary Noise Suppression
• Requirements
– Maximize the amount of noise removed
– Minimize signal distortion
– Efficient algorithm with low big-Oh complexity
• Problems
– Tradeoff between removing noise and distorting the signal
– More noise removal tends to distort the signal
• Popular approaches
– Time domain: Moving average filter (distorts frequency domain)
– Frequency domain: Spectral Subtraction
– Time domain: Weiner filter (using LPC)
Auto regression Noise Removal
• Definition: An autoregressive process is one
where a value can be determined by a linear
combination of previous values
• Formula: Xt = c + ∑0,P-1ai Xt-i + nt
c is a constant, nt is the noise, the summation is the pure signal
• This is none other than linear prediction; noise
is the residue.
• Applying the LPC filter to the signal separates
noise from signal (Wiener Filter)
Spectral Subtraction
Assumption: Noisy signal: yt = st + nt
st is the clean signal and nt is additive noise
Perform FFT on all windowed frames
IF speech not present
Update the estimate of the noisy spectrum
{ σnt + (1- σ)nt-1, 0 <= σ <=1 }
ELSE Subtract the estimated noise spectrum
Perform an inverse FFT
S. F. Boll, “Suppression of acoustic noise in speech using
spectral subtraction," IEEE Trans. Acoustics, Speech, Signal
Processing, vol. ASSP-27, Apr. 1979.
Implementation Issues
1. Question: How do we estimate the noise?
Answer: Use the frequency distribution during times when
no voice is present
2. Question: How do we know when voice is present?
Answer: Use Voice Activity Detection algorithms (VAD)
3. Question: Even if we know the noise amplitudes, what about
phase differences between the clean and noisy signals?
Answer: Human hearing largely ignores phase differences
4. Question: Is the noise independent of the signal?
Answer: We assume that it noise is linear and does not
interact with the signal.
5. Question: Are noise distributions really stationary?
Answer: We assume yes.
Phase Distortions
• Problem: We don’t know how much of the phase in
an FFT is from noise and from speech.
• Assumption: The algorithm assumes the phase of
both are the same (that of the noisy signal).
• Result: When SNR approaches 0db the audio has an
hoarse sounding voice.
• Why? The phase assumption means that the
expected noise magnitude is incorrectly calculated.
• Conclusion: There is a limit to spectral subtraction
utility when SNR is close to zero
Evaluation
• Advantage: Easy to understand and implement
• Disadvantages
– The noise estimate is not exact
• When too high, speech portions will be lost
• When too low, some noise remains
• When a noise frequency exceeds the noisy sound
frequency, a negative frequency results causes musical
tone artifacts
– Non-linear or interacting noise
• Negligible with large SNR values
• Significant impact when SNR is small
Musical noise
Definition: Random isolated tone bursts across the frequency.
Why? Most implementations set frequency bin magnitudes to
zero if noise reduction would cause them to become negative
Green dashes: noisy signal, Solid line: noise estimate
Black dots: projected clean signal
Spectral Subtraction Enhancements
• Eliminate negative frequencies
• Reduce the noise estimates by some factor
o Vary the noise estimate factor in different frequency bands
o Larger in regions outside of human speech range
• Apply psycho-acoustical methods
o Only attempt to remove perceived noise, not all noise
o Human hearing masks sounds of adjacent frequencies
o A loud sound masks sounds even after it ceases
• Adaptive noise estimation: Nt(f) = λFGt(p-1)+(1-λF)Nt-1(f)
Threshold of Hearing
Masking
Acoustical Effects
• Characteristic Frequency (CF): The frequency that causes
maximum response at a point of the Cochlea Basilar Membrane
• Neuron exhibit a maximum response for 20 ms and then
decrease to a steady state, shortly after the stimulus is removed
• Masking effects can be simultaneous or temporal
– Simultaneous: one signal drowns out another
– Temporal: One signal masks the ones that follow
– Forward: still audible after masker removed (5ms–150ms)
– Back: weak signal masked from a strong one following (5ms)
Voice Activity Detector (VAD)
• Many VAD algorithms exist
• Possible approaches to consider
–
–
–
–
–
Energy above background noise
Low Zero crossing rate
Determine if pitch is present
Low fractal dimensions compared to pure noise
Low LPC residual
• General principle: It is better to misclassify noise as speech
than to misclassify speech as noise
• Standard algorithms: telephone/cell phone environments
Possible VAD algorithm
Note: energy and 0-crossings of noise estimated from the initial ¼ second
boolean vad: double[] frame // returns true if speech present
IF frame energy < low noise threshold (standard deviation units) RETURN false;
IF energy < low noise threshold RETURN FALSE
IF energy > high noise threshold RETURN TRUE
FOR forward frames
IF forward frame energy < low noise threshold RETURN FALSE
IF forward frame energy > high noise threshold
FOR previous ¼ second of frames
COUNT previous frames having a large 0-crossing rate
IF count > 0-crossing threshold (standard deviation units)
IF this frame index > than first frame with 0-crossing rate > threshold
RETURN true
RETURN false
Download