Fundamental Frequency Modelling and Estimation

advertisement
Fundamental Frequency Modelling and Estimation
Speech Processing
Tom Bäckström
Aalto University
October 2015
F0 – Basics
I
The fundamental frequency of speech signals is generated by
the quasi-periodic oscillations in the vocal folds caused by
airflow from the lungs.
The fundamental frequency refers to speed of oscillations and
is thus a measure of the physical phenomenon.
The pitch of a speech signal refers to the perceived
frequency, that is, what a human listener hears.
I
In speech, the fundamental frequencies are roughly in the
range F0 ∈ 80 Hz ... 400 Hz.
Perception of pitch is a complex topic;
I
I
For example, you can remove the fundamental frequency of a
harmonic signal, but a human listener will automatically
deduce the fundamental from the upper harmonics, such that
the perceived pitch is the now missing fundamental frequency.
That is, we perceive something which is not there.
F0 – Basics
Physiological generation of F0
(a)
Displacement
Left
Right
Time
Magnitude (dB)
(b)
Frequency
F0 – Basics
Voiced phone
Amplitude
0.4
Pitch period length T
0.2
0
-0.2
0
50
100
150
200
250
300
350
400
450
500
Magnitude (dB)
Time (samples)
Spectrum with comb-structure of harmonic signal
0
-40
-80
Magnitude (dB)
0
1000
2000
3000
4000
5000
6000
7000
8000
Frequency (Hz)
Low-frequency part of spectrum
50
F0
2F0
3F0
4F0 5F0
6F0
7F0
8F0 9F0 10F0 11F0 12F0 13F0 14F0 15F0 16F0 17F0 18F0 19F0 20F0
0
-50
0
500
1000
1500
2000
Frequency (Hz)
2500
3000
3500
F0 – Basics
I
Let the length of the pitch period be T (seconds).
I
I
The signal repeats itself after every T seconds, so it repeats
itself also at multiples of T , that is, at kT with
k = 1, 2, 3, . . . .
The fundamental frequency is then F0 = 1/T (Hertz).
I
The harmonic frequencies are multiples of F0 , that is, kF0 with
k = 1, 2, 3, . . . .
I
If fs is the sampling frequency, then the pitch period length is
K = fs T samples.
I
The fundamental frequency is then F0 = fs /K .
I
If we assume that F0 ∈ 80 Hz ... 400 Hz then T ∈ 2.5 ms ...
12.5 ms and with fs = 16 kHz K ∈ 40 ... 200 samples.
F0 – Time
Voiced phone
0.3
Amplitude
0.2
0.1
0
-0.1
-0.2
0
50
100
150
200
250
300
350
400
450
500
Time (samples)
I
Periodicity means that the signal repeats itself after a time T .
I
The signal should then have x(t) = x(t − T ).
I
We can then try to determine such a T that
e(T ) = |x(t) − x(T − t)|2 is minimized.
I
The error measure can be simplified
e(T ) = |x(t)|2 + |x(t − T )|2 − 2x(t)x(t − T )
= −2x(t)x(t − T ) + constant.
I
In other words, we search for the maximum of the correlation
x(t)x(t − T ) at a distance T to find the period length.
F0 – Time
I
It is often useful to normalize the correlation such that the
correlation is in the interval c(T ) ∈ [−1, +1] by
xT x
.
c(T ) = kxt tkkxt−T
t−T k
I
Algorithm
1. Choose a segment xt = [x(t), x(t + 1), . . . , x(t + N − 1)]T .
2. For each T ∈ [40, 400]
I
Determine the correlation c(T ) =
3. Find T for which c(T ) is maximized.
xT
t xt−T
.
kxt kkxt−T k
F0 – Time
Amplitude
Signal and delayed windows
x t-57
x t-97
x t-137
x t-177
0
50
100
150
200
250
300
350
400
Normalized correlation c(T)
Time (samples)
Normalized correlation
1
T=40
Tmax=95
T=200
0.5
0
-0.5
0
50
100
150
Lag (samples)
200
250
F0 – Time
I
I
We found the pitch period at Tmax = 95 samples.
Observe that at 2Tmax there was a big peak as well.
I
I
I
I
Correlation analysis can often find a multiple of the
true period T .
We need separate safe guards to check if the maximum
corresponds to a multiple of the true period.
Usually we would check whether 2Tmax or 3Tmax would also be
plausible, and use some other information (such as the pitch of
the previous frame) to choose which one is best.
More peaks can be seen at 21 Tmax and other fractions.
I
Often difficult to decide which is the right one.
F0 – Time
Summary
I
The fundamental pitch period can be estimated by a
correlation analysis in the time domain.
I
A frequently appearing problem is that if there is a correlation
at a distance T , then there will be a correlation also on a
distance 2T (as well as 12 T ).
I
This approach is often used by for example speech codecs
(=your mobile phone does this all the time).
F0 – Spectrum
I
I
I
Often, the highest peak is a multiple of F0 , that is, the highest
peak is 2F0 or 3F0 etc.
We then need to check whether 12 Fmax or 31 Fmax could be
better estimates of F0 .
Amplitude
I
The spectrum of a harmonic signal features a comb-structure
in the spectrum, which is easy to locate.
Identifying the first peak of the comb-structure gives the
fundamental frequency.
A similar problem as with the time domain frequently appears
if we measure F0 in the spectrum.
Voiced phone
0.2
0.1
0
-0.1
50
100
150
200
250
300
350
400
450
500
Time (samples)
Spectrum, envelope and formants (Fk)
40
F1
Spectrum
Envelope
F2
20
Magnitude (dB)
I
F3
F4
0
F5
-20
-40
-60
0
1000
2000
3000
4000
Frequency (Hz)
5000
6000
7000
8000
F0 – Spectrum
Phonation
Amplitude
0.2
0
-0.2
0
50
100
150
200
250
300
350
400
450
500
Magnitude (dB)
Time (samples)
Spectrum
0
-40
-80
0
1000
2000
3000
4000
5000
6000
7000
8000
600
700
800
Magnitude (dB)
Frequency (Hz)
Low-frequency part of spectrum
20
0
-20
-40
0
100
200
300
400
Frequency (Hz)
500
F0 – Spectrum
Phonation
Amplitude
0.5
0
-0.5
0
50
100
150
200
250
300
350
400
450
500
Magnitude (dB)
Time (samples)
Spectrum
20
0
-20
-40
-60
0
1000
2000
3000
4000
5000
6000
7000
8000
600
700
800
Magnitude (dB)
Frequency (Hz)
Low-frequency part of spectrum
20
0
-20
0
100
200
300
400
Frequency (Hz)
500
F0 – Spectrum
Phonation
Amplitude
0.4
0.2
0
-0.2
0
50
100
150
200
250
300
350
400
450
500
Magnitude (dB)
Time (samples)
Spectrum
0
-40
-80
0
1000
2000
3000
4000
5000
6000
7000
8000
600
700
800
Magnitude (dB)
Frequency (Hz)
Low-frequency part of spectrum
20
0
-20
-40
-60
0
100
200
300
400
Frequency (Hz)
500
F0 – Spectrum
Phonation
Amplitude
0.05
0
-0.05
0
50
100
150
200
250
300
350
400
450
500
Magnitude (dB)
Time (samples)
Spectrum
0
-20
-40
-60
-80
0
1000
2000
3000
4000
5000
6000
7000
8000
600
700
800
Magnitude (dB)
Frequency (Hz)
Low-frequency part of spectrum
0
-20
-40
0
100
200
300
400
Frequency (Hz)
500
F0 – Spectrum
Phonation
Amplitude
0.5
0
-0.5
0
50
100
150
200
250
300
350
400
450
500
Magnitude (dB)
Time (samples)
Spectrum
20
0
-20
-40
-60
0
1000
2000
3000
4000
5000
6000
7000
8000
600
700
800
Magnitude (dB)
Frequency (Hz)
Low-frequency part of spectrum
20
0
-20
-40
0
100
200
300
400
Frequency (Hz)
500
F0 – Spectrum
Phonation
Amplitude
0.5
0
-0.5
0
50
100
150
200
250
300
350
400
450
500
Magnitude (dB)
Time (samples)
Spectrum
20
0
-20
-40
0
1000
2000
3000
4000
5000
6000
7000
8000
600
700
800
Magnitude (dB)
Frequency (Hz)
Low-frequency part of spectrum
20
0
-20
-40
0
100
200
300
400
Frequency (Hz)
500
F0 – Spectrum
Phonation
Amplitude
0.01
0
-0.01
0
50
100
150
200
250
300
350
400
450
500
Magnitude (dB)
Time (samples)
Spectrum
-20
-40
-60
-80
0
1000
2000
3000
4000
5000
6000
7000
8000
600
700
800
Magnitude (dB)
Frequency (Hz)
Low-frequency part of spectrum
-20
-40
-60
0
100
200
300
400
Frequency (Hz)
500
F0 – Spectrum
I
Typical basic algorithm:
1. Find the highest peak in the interval fˆ ∈ 80 Hz ... 450 Hz as
the first fundamental frequency estimate.
2. Check if this is an integer multiple of the fundamental; if there
is a peak at fˆ/2, fˆ/3 or fˆ/4, then use that as the fundamental
frequency estimate.
I
Finding the integer-multiple-peak is a typical problem.
I
I
I
Since the harmonics are k octaves higher than the
fundamental, this error is known as an octave-error.
When analyzing the F0 in subsequent windows, an octave error
in one frame can cause a jump in the F0 estimate. Such errors
are known as octave jumps.
Perceptually (in for example speech coding), octave jumps are
very easily perceivable, but octave errors less so.
I
Many methods try to avoid octave jumps even when it means
making octave errors more often.
F0 – Spectrum
I
The octave error problem comes from the fact that the first
few harmonics can have a very high energy.
I
I
I
Especially when the first formant is low (such as /u/), then
the envelope has a peak near the first few harmonics.
It is often difficult to determine which lower-frequency peaks
are part of the comb-structure and which are noise.
Another typical problem is that it is difficult to determine
whether the frame is voiced or not (unvoiced or non-speech).
I
I
I
How do we determine whether the spectrum has a
comb-structure or not?
We could find some peaks even when the signal is pure noise.
How do we decide if a peak is part of a comb-structure
(harmonic signal) or noise? Difficult.
F0 – Spectrum
I
How do we determine where the peak is? What if F0 is not
integer number?
I
I
I
Often we would need to interpolate in the vicinity of the peak.
We can for example fit a second-order polynomial to the peak
and find the maximum of the polynomial.
If the pitch changes rapidly within the analysis window, the
comb-structure becomes smeared.
I
I
I
If the first harmonic moves from F0 to F0 + ∆f ,
then the kth harmonic moves from kF0 to k(F0 + ∆f ).
The movement for the kth harmonic is thus k∆f .
The higher harmonics move a lot, whereby we can see only the
first few harmonics.
Time signal with T=50 to 50
Time signal with T=50 to 51
Time signal with T=50 to 52
Time signal with T=50 to 53
Time signal with T=50 to 54
Time
Magnitude (dB) Magnitude (dB) Magnitude (dB) Magnitude (dB) Magnitude (dB)
Amplitude
Amplitude
Amplitude
Amplitude
Amplitude
F0 – Spectrum
Spectrum
Spectrum
Spectrum
Spectrum
Spectrum
Frequency
F0 – Spectrum
Summary
I
I
The fundamental frequency is visible in the spectrum as a
comb-structure.
It is easy to develop an algorithm which estimates the
fundamental frequency in the frequency domain.
I
I
Choose the first big peak.
It is not easy to develop a robust algorithm which estimates
the fundamental frequency in the frequency domain.
I
I
Octave errors and jumps are a problem.
Changing frequency is a problem.
F0 – Cepstrum
Recall that the cepstrum was defined as |F{log(|F{xn }|)}|
and that it can be used for F0 estimation.
I
If the signal has a comb-structure, then the cepstrum has a
peak whose location corresponds to the pitch period T .
I
We can simply search for the largest peak in the range
F0 ∈ 80 Hz ... 400 Hz which corresponds to T ∈ 2.5 ms ...
12.5 ms which is 40 ... 200 samples at fs = 16 kHz.
Amplitude
I
Voiced phone
0.2
0.1
0
-0.1
Magnitude (dB)
50
100
150
200
250
300
350
400
450
500
Time (samples)
Spectrum
40
0
-40
0
1000
2000
3000
4000
5000
6000
7000
8000
Frequency (Hz)
Cepstrum and multiples of the pitch period T
T
Magnitude
15
10
5
3T
2T
0
0
50
100
150
200
250
Quefrency
300
350
400
450
500
F0 – Cepstrum
Phonation
Amplitude
0.2
0
-0.2
0
50
100
150
200
250
300
350
400
450
500
Magnitude (dB)
Time (samples)
Spectrum
0
-20
-40
-60
0
1000
2000
3000
4000
5000
6000
7000
8000
Frequency (Hz)
Cepstrum
Magnitude
10 5
0
50
100
150
200
250
300
Quefrency (samples)
350
400
450
500
F0 – Cepstrum
Phonation
Amplitude
0.2
0
-0.2
-0.4
0
50
100
150
200
250
300
350
400
450
500
Magnitude (dB)
Time (samples)
Spectrum
20
0
-20
-40
0
1000
2000
3000
4000
5000
6000
7000
8000
Frequency (Hz)
Cepstrum
Magnitude
10 5
0
50
100
150
200
250
300
Quefrency (samples)
350
400
450
500
F0 – Cepstrum
Phonation
Amplitude
0.2
0
-0.2
0
50
100
150
200
250
300
350
400
450
500
Magnitude (dB)
Time (samples)
Spectrum
0
-20
-40
-60
0
1000
2000
3000
4000
5000
6000
7000
8000
Frequency (Hz)
Cepstrum
Magnitude
10 5
0
50
100
150
200
250
300
Quefrency (samples)
350
400
450
500
F0 – Cepstrum
Phonation
Amplitude
0.5
0
-0.5
0
50
100
150
200
250
300
350
400
450
500
20
0
-20
-40
-60
0
1000
2000
3000
4000
5000
6000
7000
8000
Frequency (Hz)
Cepstrum
Magnitude
Magnitude (dB)
Time (samples)
Spectrum
0
50
100
150
200
250
300
Quefrency (samples)
350
400
450
500
F0 – Cepstrum
Phonation
Amplitude
0.5
0
-0.5
0
50
100
150
200
250
300
350
400
450
500
20
0
-20
-40
0
1000
2000
3000
4000
5000
6000
7000
8000
Frequency (Hz)
Cepstrum
Magnitude
Magnitude (dB)
Time (samples)
Spectrum
0
50
100
150
200
250
300
Quefrency (samples)
350
400
450
500
F0 – Cepstrum
Phonation
Amplitude
0.05
0
-0.05
-0.1
0
50
100
150
200
250
300
350
400
450
500
Magnitude (dB)
Time (samples)
Spectrum
0
-20
-40
-60
0
1000
2000
3000
4000
5000
6000
7000
8000
Frequency (Hz)
Cepstrum
Magnitude
10 5
0
50
100
150
200
250
300
Quefrency (samples)
350
400
450
500
F0 – Cepstrum
Phonation
Amplitude
0.1
0
-0.1
-0.2
0
50
100
150
200
250
300
350
400
450
500
Magnitude (dB)
Time (samples)
Spectrum
20
0
-20
-40
-60
0
1000
2000
3000
4000
5000
6000
7000
8000
Frequency (Hz)
Cepstrum
Magnitude
10 5
0
50
100
150
200
250
300
Quefrency (samples)
350
400
450
500
F0 – Cepstrum
Phonation
Amplitude
0.1
0
-0.1
0
50
100
150
200
250
300
350
400
450
500
Magnitude (dB)
Time (samples)
Spectrum
0
-20
-40
-60
0
1000
2000
3000
4000
5000
6000
7000
8000
Frequency (Hz)
Cepstrum
Magnitude
10 5
0
50
100
150
200
250
300
Quefrency (samples)
350
400
450
500
Cepstrum
Summary
I
Cepstrum can also be used for F0 estimation.
I
It is a time-domain representation, so the main peak
corresponds to the pitch lag.
I
The cepstrum is more robust to octave-jumps than the other
methods presented.
I
The cepstrum is, however, sensitive to background noise.
I
The complexity is slightly higher than the other methods,
because we need two FFT’s and the logarithm of the whole
spectrum.
Autocorrelation
I
We have already seen that the fundamental frequency is visible
in the autocorrelation as a peak at the distance of the pitch
period length.
Voiced phone sk
Amplitude
0.2
0.1
0
-0.1
20
40
60
80
100
120
140
160
180
200
40
60
80
100
Time (samples)
8
Autocovariance
×10 -3
6
ck
4
2
0
-2
-4
-100
-80
-60
-40
-20
0
20
Delay k
I
We can therefore search for a peak in a suitable region.
I
A fundamental frequency F0 ∈ 80 Hz ... 400 Hz correspond to
autocorrelation lags in the 40 ... 200 samples when the
sampling frequency is fs = 16 kHz.
Autocorrelation
Summary
I
The autocorrelation can thus be used for F0 estimation in a
similar manner as the normalized correlation.
I
I
The mathematical differences are actually very small.
The autocorrelation is though a bit more robust to noise
(usually a longer window), but on the other hand, it assumes
that the pitch is stable for the whole window (less robust to
changes in pitch).
Peak picking algorithms
I
I
The presented algorithms for F0 estimation belong to the class
of methods known as peak picking algorithms.
The main problem of peak picking algorithms are:
1. Choosing the right peak = avoiding octave-jumps. This leads
often to heuristic algorithms.
2. The samples do not always coincide with the actual peak = we
need interpolation to approximate the true pitch. If the peak is
very narrow (e.g. a single sample) then interpolation is
difficult.
3. The estimate relies on very few samples, whereby the methods
are sensitive to noise. A single noisy sample (a single high
noise peak) can corrupt the estimate completely.
F0 tracking
I
The absolute pitch is often less important than the changes in
pitch.
I
I
I
For example, emphasis in a sentence is in many languages
indicated with a higher F0 .
It is therefore interesting/useful to track the F0 over time.
Conversely, by looking at pitch tracks we can easily spot
octave jumps and other errors.
I
I
We can apply post-processing to clean up the pitch estimate to
obtain smooth pitch contours.
It is physically very difficult to change pitch abruptly, whereby
it is sensible to require continuous smooth pitch contours.
F0 tracking
Amplitude
Speech signal
0.4
0.2
0
-0.2
-0.4
0
0.2
0.4
0.6
0.8
1
1.2
Cepstral maximum lag
F0 estimate
400
Lag max
300
2Lagmax
200
0.5Lag
100
0
0.2
0.4
0.6
0.8
1
1.2
Cepstral maximum relative amplitude
Cmax/C 0
0.3
0.2
0.1
0
0
0.2
0.4
0.6
0.8
Time (s)
1
1.2
max
F0 estimation summary
I
I
The fundamental frequency describes a basic property of
speech whereby its estimation is perceptually important.
F0 is visible & can be estimated in many different domains:
I
I
I
Correlation-analysis in time-domain and autocorrelations show
peaks at the distance of the pitch lag and its multiples.
Magnitude spectra show a comb-structure at the fundamental
frequency distance.
Cepstra show peaks at the distance of the pitch lag and weakly
also at its multiples.
Download