Fundamental Frequency Modelling and Estimation Speech Processing Tom Bäckström Aalto University October 2015 F0 – Basics I The fundamental frequency of speech signals is generated by the quasi-periodic oscillations in the vocal folds caused by airflow from the lungs. The fundamental frequency refers to speed of oscillations and is thus a measure of the physical phenomenon. The pitch of a speech signal refers to the perceived frequency, that is, what a human listener hears. I In speech, the fundamental frequencies are roughly in the range F0 ∈ 80 Hz ... 400 Hz. Perception of pitch is a complex topic; I I For example, you can remove the fundamental frequency of a harmonic signal, but a human listener will automatically deduce the fundamental from the upper harmonics, such that the perceived pitch is the now missing fundamental frequency. That is, we perceive something which is not there. F0 – Basics Physiological generation of F0 (a) Displacement Left Right Time Magnitude (dB) (b) Frequency F0 – Basics Voiced phone Amplitude 0.4 Pitch period length T 0.2 0 -0.2 0 50 100 150 200 250 300 350 400 450 500 Magnitude (dB) Time (samples) Spectrum with comb-structure of harmonic signal 0 -40 -80 Magnitude (dB) 0 1000 2000 3000 4000 5000 6000 7000 8000 Frequency (Hz) Low-frequency part of spectrum 50 F0 2F0 3F0 4F0 5F0 6F0 7F0 8F0 9F0 10F0 11F0 12F0 13F0 14F0 15F0 16F0 17F0 18F0 19F0 20F0 0 -50 0 500 1000 1500 2000 Frequency (Hz) 2500 3000 3500 F0 – Basics I Let the length of the pitch period be T (seconds). I I The signal repeats itself after every T seconds, so it repeats itself also at multiples of T , that is, at kT with k = 1, 2, 3, . . . . The fundamental frequency is then F0 = 1/T (Hertz). I The harmonic frequencies are multiples of F0 , that is, kF0 with k = 1, 2, 3, . . . . I If fs is the sampling frequency, then the pitch period length is K = fs T samples. I The fundamental frequency is then F0 = fs /K . I If we assume that F0 ∈ 80 Hz ... 400 Hz then T ∈ 2.5 ms ... 12.5 ms and with fs = 16 kHz K ∈ 40 ... 200 samples. F0 – Time Voiced phone 0.3 Amplitude 0.2 0.1 0 -0.1 -0.2 0 50 100 150 200 250 300 350 400 450 500 Time (samples) I Periodicity means that the signal repeats itself after a time T . I The signal should then have x(t) = x(t − T ). I We can then try to determine such a T that e(T ) = |x(t) − x(T − t)|2 is minimized. I The error measure can be simplified e(T ) = |x(t)|2 + |x(t − T )|2 − 2x(t)x(t − T ) = −2x(t)x(t − T ) + constant. I In other words, we search for the maximum of the correlation x(t)x(t − T ) at a distance T to find the period length. F0 – Time I It is often useful to normalize the correlation such that the correlation is in the interval c(T ) ∈ [−1, +1] by xT x . c(T ) = kxt tkkxt−T t−T k I Algorithm 1. Choose a segment xt = [x(t), x(t + 1), . . . , x(t + N − 1)]T . 2. For each T ∈ [40, 400] I Determine the correlation c(T ) = 3. Find T for which c(T ) is maximized. xT t xt−T . kxt kkxt−T k F0 – Time Amplitude Signal and delayed windows x t-57 x t-97 x t-137 x t-177 0 50 100 150 200 250 300 350 400 Normalized correlation c(T) Time (samples) Normalized correlation 1 T=40 Tmax=95 T=200 0.5 0 -0.5 0 50 100 150 Lag (samples) 200 250 F0 – Time I I We found the pitch period at Tmax = 95 samples. Observe that at 2Tmax there was a big peak as well. I I I I Correlation analysis can often find a multiple of the true period T . We need separate safe guards to check if the maximum corresponds to a multiple of the true period. Usually we would check whether 2Tmax or 3Tmax would also be plausible, and use some other information (such as the pitch of the previous frame) to choose which one is best. More peaks can be seen at 21 Tmax and other fractions. I Often difficult to decide which is the right one. F0 – Time Summary I The fundamental pitch period can be estimated by a correlation analysis in the time domain. I A frequently appearing problem is that if there is a correlation at a distance T , then there will be a correlation also on a distance 2T (as well as 12 T ). I This approach is often used by for example speech codecs (=your mobile phone does this all the time). F0 – Spectrum I I I Often, the highest peak is a multiple of F0 , that is, the highest peak is 2F0 or 3F0 etc. We then need to check whether 12 Fmax or 31 Fmax could be better estimates of F0 . Amplitude I The spectrum of a harmonic signal features a comb-structure in the spectrum, which is easy to locate. Identifying the first peak of the comb-structure gives the fundamental frequency. A similar problem as with the time domain frequently appears if we measure F0 in the spectrum. Voiced phone 0.2 0.1 0 -0.1 50 100 150 200 250 300 350 400 450 500 Time (samples) Spectrum, envelope and formants (Fk) 40 F1 Spectrum Envelope F2 20 Magnitude (dB) I F3 F4 0 F5 -20 -40 -60 0 1000 2000 3000 4000 Frequency (Hz) 5000 6000 7000 8000 F0 – Spectrum Phonation Amplitude 0.2 0 -0.2 0 50 100 150 200 250 300 350 400 450 500 Magnitude (dB) Time (samples) Spectrum 0 -40 -80 0 1000 2000 3000 4000 5000 6000 7000 8000 600 700 800 Magnitude (dB) Frequency (Hz) Low-frequency part of spectrum 20 0 -20 -40 0 100 200 300 400 Frequency (Hz) 500 F0 – Spectrum Phonation Amplitude 0.5 0 -0.5 0 50 100 150 200 250 300 350 400 450 500 Magnitude (dB) Time (samples) Spectrum 20 0 -20 -40 -60 0 1000 2000 3000 4000 5000 6000 7000 8000 600 700 800 Magnitude (dB) Frequency (Hz) Low-frequency part of spectrum 20 0 -20 0 100 200 300 400 Frequency (Hz) 500 F0 – Spectrum Phonation Amplitude 0.4 0.2 0 -0.2 0 50 100 150 200 250 300 350 400 450 500 Magnitude (dB) Time (samples) Spectrum 0 -40 -80 0 1000 2000 3000 4000 5000 6000 7000 8000 600 700 800 Magnitude (dB) Frequency (Hz) Low-frequency part of spectrum 20 0 -20 -40 -60 0 100 200 300 400 Frequency (Hz) 500 F0 – Spectrum Phonation Amplitude 0.05 0 -0.05 0 50 100 150 200 250 300 350 400 450 500 Magnitude (dB) Time (samples) Spectrum 0 -20 -40 -60 -80 0 1000 2000 3000 4000 5000 6000 7000 8000 600 700 800 Magnitude (dB) Frequency (Hz) Low-frequency part of spectrum 0 -20 -40 0 100 200 300 400 Frequency (Hz) 500 F0 – Spectrum Phonation Amplitude 0.5 0 -0.5 0 50 100 150 200 250 300 350 400 450 500 Magnitude (dB) Time (samples) Spectrum 20 0 -20 -40 -60 0 1000 2000 3000 4000 5000 6000 7000 8000 600 700 800 Magnitude (dB) Frequency (Hz) Low-frequency part of spectrum 20 0 -20 -40 0 100 200 300 400 Frequency (Hz) 500 F0 – Spectrum Phonation Amplitude 0.5 0 -0.5 0 50 100 150 200 250 300 350 400 450 500 Magnitude (dB) Time (samples) Spectrum 20 0 -20 -40 0 1000 2000 3000 4000 5000 6000 7000 8000 600 700 800 Magnitude (dB) Frequency (Hz) Low-frequency part of spectrum 20 0 -20 -40 0 100 200 300 400 Frequency (Hz) 500 F0 – Spectrum Phonation Amplitude 0.01 0 -0.01 0 50 100 150 200 250 300 350 400 450 500 Magnitude (dB) Time (samples) Spectrum -20 -40 -60 -80 0 1000 2000 3000 4000 5000 6000 7000 8000 600 700 800 Magnitude (dB) Frequency (Hz) Low-frequency part of spectrum -20 -40 -60 0 100 200 300 400 Frequency (Hz) 500 F0 – Spectrum I Typical basic algorithm: 1. Find the highest peak in the interval fˆ ∈ 80 Hz ... 450 Hz as the first fundamental frequency estimate. 2. Check if this is an integer multiple of the fundamental; if there is a peak at fˆ/2, fˆ/3 or fˆ/4, then use that as the fundamental frequency estimate. I Finding the integer-multiple-peak is a typical problem. I I I Since the harmonics are k octaves higher than the fundamental, this error is known as an octave-error. When analyzing the F0 in subsequent windows, an octave error in one frame can cause a jump in the F0 estimate. Such errors are known as octave jumps. Perceptually (in for example speech coding), octave jumps are very easily perceivable, but octave errors less so. I Many methods try to avoid octave jumps even when it means making octave errors more often. F0 – Spectrum I The octave error problem comes from the fact that the first few harmonics can have a very high energy. I I I Especially when the first formant is low (such as /u/), then the envelope has a peak near the first few harmonics. It is often difficult to determine which lower-frequency peaks are part of the comb-structure and which are noise. Another typical problem is that it is difficult to determine whether the frame is voiced or not (unvoiced or non-speech). I I I How do we determine whether the spectrum has a comb-structure or not? We could find some peaks even when the signal is pure noise. How do we decide if a peak is part of a comb-structure (harmonic signal) or noise? Difficult. F0 – Spectrum I How do we determine where the peak is? What if F0 is not integer number? I I I Often we would need to interpolate in the vicinity of the peak. We can for example fit a second-order polynomial to the peak and find the maximum of the polynomial. If the pitch changes rapidly within the analysis window, the comb-structure becomes smeared. I I I If the first harmonic moves from F0 to F0 + ∆f , then the kth harmonic moves from kF0 to k(F0 + ∆f ). The movement for the kth harmonic is thus k∆f . The higher harmonics move a lot, whereby we can see only the first few harmonics. Time signal with T=50 to 50 Time signal with T=50 to 51 Time signal with T=50 to 52 Time signal with T=50 to 53 Time signal with T=50 to 54 Time Magnitude (dB) Magnitude (dB) Magnitude (dB) Magnitude (dB) Magnitude (dB) Amplitude Amplitude Amplitude Amplitude Amplitude F0 – Spectrum Spectrum Spectrum Spectrum Spectrum Spectrum Frequency F0 – Spectrum Summary I I The fundamental frequency is visible in the spectrum as a comb-structure. It is easy to develop an algorithm which estimates the fundamental frequency in the frequency domain. I I Choose the first big peak. It is not easy to develop a robust algorithm which estimates the fundamental frequency in the frequency domain. I I Octave errors and jumps are a problem. Changing frequency is a problem. F0 – Cepstrum Recall that the cepstrum was defined as |F{log(|F{xn }|)}| and that it can be used for F0 estimation. I If the signal has a comb-structure, then the cepstrum has a peak whose location corresponds to the pitch period T . I We can simply search for the largest peak in the range F0 ∈ 80 Hz ... 400 Hz which corresponds to T ∈ 2.5 ms ... 12.5 ms which is 40 ... 200 samples at fs = 16 kHz. Amplitude I Voiced phone 0.2 0.1 0 -0.1 Magnitude (dB) 50 100 150 200 250 300 350 400 450 500 Time (samples) Spectrum 40 0 -40 0 1000 2000 3000 4000 5000 6000 7000 8000 Frequency (Hz) Cepstrum and multiples of the pitch period T T Magnitude 15 10 5 3T 2T 0 0 50 100 150 200 250 Quefrency 300 350 400 450 500 F0 – Cepstrum Phonation Amplitude 0.2 0 -0.2 0 50 100 150 200 250 300 350 400 450 500 Magnitude (dB) Time (samples) Spectrum 0 -20 -40 -60 0 1000 2000 3000 4000 5000 6000 7000 8000 Frequency (Hz) Cepstrum Magnitude 10 5 0 50 100 150 200 250 300 Quefrency (samples) 350 400 450 500 F0 – Cepstrum Phonation Amplitude 0.2 0 -0.2 -0.4 0 50 100 150 200 250 300 350 400 450 500 Magnitude (dB) Time (samples) Spectrum 20 0 -20 -40 0 1000 2000 3000 4000 5000 6000 7000 8000 Frequency (Hz) Cepstrum Magnitude 10 5 0 50 100 150 200 250 300 Quefrency (samples) 350 400 450 500 F0 – Cepstrum Phonation Amplitude 0.2 0 -0.2 0 50 100 150 200 250 300 350 400 450 500 Magnitude (dB) Time (samples) Spectrum 0 -20 -40 -60 0 1000 2000 3000 4000 5000 6000 7000 8000 Frequency (Hz) Cepstrum Magnitude 10 5 0 50 100 150 200 250 300 Quefrency (samples) 350 400 450 500 F0 – Cepstrum Phonation Amplitude 0.5 0 -0.5 0 50 100 150 200 250 300 350 400 450 500 20 0 -20 -40 -60 0 1000 2000 3000 4000 5000 6000 7000 8000 Frequency (Hz) Cepstrum Magnitude Magnitude (dB) Time (samples) Spectrum 0 50 100 150 200 250 300 Quefrency (samples) 350 400 450 500 F0 – Cepstrum Phonation Amplitude 0.5 0 -0.5 0 50 100 150 200 250 300 350 400 450 500 20 0 -20 -40 0 1000 2000 3000 4000 5000 6000 7000 8000 Frequency (Hz) Cepstrum Magnitude Magnitude (dB) Time (samples) Spectrum 0 50 100 150 200 250 300 Quefrency (samples) 350 400 450 500 F0 – Cepstrum Phonation Amplitude 0.05 0 -0.05 -0.1 0 50 100 150 200 250 300 350 400 450 500 Magnitude (dB) Time (samples) Spectrum 0 -20 -40 -60 0 1000 2000 3000 4000 5000 6000 7000 8000 Frequency (Hz) Cepstrum Magnitude 10 5 0 50 100 150 200 250 300 Quefrency (samples) 350 400 450 500 F0 – Cepstrum Phonation Amplitude 0.1 0 -0.1 -0.2 0 50 100 150 200 250 300 350 400 450 500 Magnitude (dB) Time (samples) Spectrum 20 0 -20 -40 -60 0 1000 2000 3000 4000 5000 6000 7000 8000 Frequency (Hz) Cepstrum Magnitude 10 5 0 50 100 150 200 250 300 Quefrency (samples) 350 400 450 500 F0 – Cepstrum Phonation Amplitude 0.1 0 -0.1 0 50 100 150 200 250 300 350 400 450 500 Magnitude (dB) Time (samples) Spectrum 0 -20 -40 -60 0 1000 2000 3000 4000 5000 6000 7000 8000 Frequency (Hz) Cepstrum Magnitude 10 5 0 50 100 150 200 250 300 Quefrency (samples) 350 400 450 500 Cepstrum Summary I Cepstrum can also be used for F0 estimation. I It is a time-domain representation, so the main peak corresponds to the pitch lag. I The cepstrum is more robust to octave-jumps than the other methods presented. I The cepstrum is, however, sensitive to background noise. I The complexity is slightly higher than the other methods, because we need two FFT’s and the logarithm of the whole spectrum. Autocorrelation I We have already seen that the fundamental frequency is visible in the autocorrelation as a peak at the distance of the pitch period length. Voiced phone sk Amplitude 0.2 0.1 0 -0.1 20 40 60 80 100 120 140 160 180 200 40 60 80 100 Time (samples) 8 Autocovariance ×10 -3 6 ck 4 2 0 -2 -4 -100 -80 -60 -40 -20 0 20 Delay k I We can therefore search for a peak in a suitable region. I A fundamental frequency F0 ∈ 80 Hz ... 400 Hz correspond to autocorrelation lags in the 40 ... 200 samples when the sampling frequency is fs = 16 kHz. Autocorrelation Summary I The autocorrelation can thus be used for F0 estimation in a similar manner as the normalized correlation. I I The mathematical differences are actually very small. The autocorrelation is though a bit more robust to noise (usually a longer window), but on the other hand, it assumes that the pitch is stable for the whole window (less robust to changes in pitch). Peak picking algorithms I I The presented algorithms for F0 estimation belong to the class of methods known as peak picking algorithms. The main problem of peak picking algorithms are: 1. Choosing the right peak = avoiding octave-jumps. This leads often to heuristic algorithms. 2. The samples do not always coincide with the actual peak = we need interpolation to approximate the true pitch. If the peak is very narrow (e.g. a single sample) then interpolation is difficult. 3. The estimate relies on very few samples, whereby the methods are sensitive to noise. A single noisy sample (a single high noise peak) can corrupt the estimate completely. F0 tracking I The absolute pitch is often less important than the changes in pitch. I I I For example, emphasis in a sentence is in many languages indicated with a higher F0 . It is therefore interesting/useful to track the F0 over time. Conversely, by looking at pitch tracks we can easily spot octave jumps and other errors. I I We can apply post-processing to clean up the pitch estimate to obtain smooth pitch contours. It is physically very difficult to change pitch abruptly, whereby it is sensible to require continuous smooth pitch contours. F0 tracking Amplitude Speech signal 0.4 0.2 0 -0.2 -0.4 0 0.2 0.4 0.6 0.8 1 1.2 Cepstral maximum lag F0 estimate 400 Lag max 300 2Lagmax 200 0.5Lag 100 0 0.2 0.4 0.6 0.8 1 1.2 Cepstral maximum relative amplitude Cmax/C 0 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 Time (s) 1 1.2 max F0 estimation summary I I The fundamental frequency describes a basic property of speech whereby its estimation is perceptually important. F0 is visible & can be estimated in many different domains: I I I Correlation-analysis in time-domain and autocorrelations show peaks at the distance of the pitch lag and its multiples. Magnitude spectra show a comb-structure at the fundamental frequency distance. Cepstra show peaks at the distance of the pitch lag and weakly also at its multiples.