digital processing of speech and image signals

Lecture DIGITAL PROCESSING OF SPEECH AND IMAGE SIGNALS RWTH Aachen, WS 2006/7 Prof. Dr.-Ing. H. Ney, Dr.rer.nat. R. Schlüter Lehrstuhl für Informatik 6 RWTH Aachen 1. System Theory and Fourier Transform 2. Discrete Time Systems 3. Spectral Analysis 4. Fourier Transform and Image Processing 5. LPC Analysis 6. Wavelets 7. Coding 8. Image Segmentation and Contour-Finding Completions: L. Welling, A. Eiden; April 1997 Completions: J. Dahmen, F. Hilger, S. Koepke; Mai 2000 Completions: F. Hilger, D. Keysers; Juli 2001 Translation: M. Popović, R. Schlüter; April 2003 Corrections: D. Stein; October 2006 Literature: • A. V. Oppenheim, R. W. Schafer: Discrete Time Signal Processing, Prentice Hall, Englewood Cliffs, NJ, 1989. • A. Papoulis: Signal Analysis, McGraw-Hill, New York, NY, 1977. • A. Papoulis: The Fourier Integral and its Applications, McGraw-Hill Classic Textbook Reissue Series, McGraw-Hill, New York, NY, 1987. • W. K. Pratt: Digital Image Processing, Wiley & Sons Inc, New York, NY, 1991. Further reading: • T. K. Moon, W. C. Stirling: Mathematical Methods and Algorithms for Signal Processing. Prentice Hall, Upper Saddle River, NJ, 2000. • J. R. Deller, J. G. Proakis, J. H. L. Hansen: Discrete-Time Processing of Speech Signals, Macmillan Publishing Company, New York, NY, 1993. • W. H. Press, S. A. Teukolsky, W. T. Vetterling, B. P. Flannery: Numerical Recipes in C, Cambridge Univ. Press, Cambridge, 1992. • L. Rabiner, B. H. Juang: Fundamentals of Speech Recognition, Prentice Hall, Englewood Cliffs, NJ, 1993. • T. Lehmann, W. Oberschelp, E. Pelikan, R. Repges: Bildverarbeitung für die Medizin, Springer Verlag, Berlin, 1997. • L. Berg: Lineare Gleichungssysteme mit Bandstruktur, VEB Deutscher Verlag der Wissenschaften, Berlin, 1986. Contents 1 System Theory and Fourier Transform 1.1 Introduction . . . . . . . . . . . . . . . 1.2 Linear time-invariant Systems . . . . . 1.3 Fourier Transform . . . . . . . . . . . . 1.4 Properties of the Fourier Transform . . 1.5 Parseval Theorem . . . . . . . . . . . . 1.6 Autocorrelation Function . . . . . . . . 1.7 Existence of the Fourier Transform . . 1.8 δ-Function . . . . . . . . . . . . . . . . 1.9 Motivation for Fourier Series . . . . . . 1.10 Time Duration and Band Width . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Discrete Time Systems 2.1 Motivation and Goal . . . . . . . . . . . . . . . . . . . . . 2.2 Digital Simulation using Discrete Time Systems . . . . . . 2.3 Examples of Discrete Time Systems . . . . . . . . . . . . . 2.4 Sampling Theorem (Nyquist Theorem) and Reconstruction 2.5 Logarithmic Scale and dB . . . . . . . . . . . . . . . . . . 2.6 Quantization . . . . . . . . . . . . . . . . . . . . . . . . . 2.7 Fourier Transform and z–Transform . . . . . . . . . . . . . 2.8 System Representation and Examples . . . . . . . . . . . . 2.9 Discrete Time Signal Fourier Transform Theorem . . . . . 2.10 Discrete Fourier Transform: DFT . . . . . . . . . . . . . . 2.11 DFT as Matrix Operation . . . . . . . . . . . . . . . . . . 2.12 From Continuous Fourier Transform to Matrix Representation of Discrete Fourier Transform . . . . . . . . . . . . . . 2.13 Frequency Resolution and Zero Padding . . . . . . . . . . 2.14 Finite Convolution . . . . . . . . . . . . . . . . . . . . . . 2.15 Fast Fourier Transform (FFT) . . . . . . . . . . . . . . . . 2.16 FFT Implementation . . . . . . . . . . . . . . . . . . . . . i 1 2 11 16 25 33 34 35 36 41 45 51 52 53 56 61 70 72 74 78 88 90 98 102 104 105 108 118 2.17 Cyclic Matrices and Fourier Transform . . . . . . . . . . . 124 3 Spectral analysis 131 3.1 Features for Speech Recognition . . . . . . . . . . . . . . . 132 3.2 Short Time Analysis and Windowing . . . . . . . . . . . . 135 3.3 Autocorrelation Function and Power Spectral Density . . . 159 3.4 Spectrograms . . . . . . . . . . . . . . . . . . . . . . . . . 165 3.5 Filter Bank Analysis . . . . . . . . . . . . . . . . . . . . . 168 3.6 Mel-frequency scale . . . . . . . . . . . . . . . . . . . . . . 171 3.7 Cepstrum . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 3.8 Statistical Interpretation of the Cepstrum Transformation 183 3.9 Energy in acoustic Vector . . . . . . . . . . . . . . . . . . 185 4 Fourier Transform and Image Processing 4.1 Spatial Frequencies and Fourier Transform for Images 4.2 Discrete Fourier Transform for Images . . . . . . . . 4.3 Fourier Transform in Computer Tomography . . . . . 4.4 Fourier Transform and RST Invariance . . . . . . . . 5 LPC Analysis 5.1 Principle of LPC Analysis . . . . . . . . . 5.2 LPC: Covariance Method . . . . . . . . . . 5.3 LPC: Autocorrelation Method . . . . . . . 5.4 LPC: Interpretation in Frequency Domain 5.5 LPC: Generative Model . . . . . . . . . . 5.6 LPC: Alternative Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 188 196 197 199 . . . . . . 207 208 212 213 216 221 223 6 Outlook: Wavelet Transform 225 6.1 Motivation: from Fourier to Wavelet Transform . . . . . . 226 6.2 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 6.3 Discrete Wavelet Transform . . . . . . . . . . . . . . . . . 229 7 Coding (appendix available as separate document) 233 8 Image Segmentation and Contour-Finding 237 ii List of Figures 1.1 Oscillograms of three time functions composed as sum of 20 partial oscillations. a) φn = 0, b) φn = π2 , c) φn statistical. 3 Amplitude spectrum of a time function composed as sum of 20 partial tones. . . . . . . . . . . . . . . . . . . . . . . . . 3 from left to right: original photo, low-pass and high-pass filtered version . . . . . . . . . . . . . . . . . . . . . . . . 3 Phase manipulation for portion of a speech signal (vowel ’o’) sampled at 8kHz, 25ms analysis window (200 samples), 512 point FFT . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Phase manipulation for portion of a speech signal (consonant ’n’) sampled at 8kHz, 25ms analysis window (200 samples), 512 point FFT . . . . . . . . . . . . . . . . . . . . . . . . 5 1.6 Phase manipulation for a Heaviside–function (step–function) 6 1.7 Schematic representation of the physiological mechanism of speech production . . . . . . . . . . . . . . . . . . . . . . . 8 2.1 Digital photo . . . . . . . . . . . . . . . . . . . . . . . . . 58 2.2 Gradient image . . . . . . . . . . . . . . . . . . . . . . . . 58 2.3 Several real cases of Laplace Operator subtraction from original image. a) Original image b) Original image minus Laplace Operator (negative values are set to 0 and values above the grey scale are set to the highest grade of grey) . 60 Ideal reconstruction of a band-limited signal (from Oppenheim, Schafer) a) original signal b) sampled signal c) reconstructed signal 64 1.2 1.3 1.4 1.5 2.4 iii 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 2.14 2.15 2.16 2.17 Sampling of band-limited signal with different sampling rates: b) sampling rate higher than Nyquist rate - exact reconstruction possible c) sampling rate equal to Nyquist rate - exact reconstruction possible d) sampling rate smaller than Nyquist rate - aliasing - exact reconstruction not possible . . . . . . . . . . . . . . . . . . 65 Amplitude spectrum of the voiceless phoneme “s” from the word “ist” . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Logarithmic amplitude spectrum of the phoneme “s” . . . 71 Amplitude spectrum of the voiced phoneme “ae” from the word “Äh” . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Logarithmic amplitude spectrum of the phoneme “ae” . . . 71 Amplitude spectrum of a speech pause . . . . . . . . . . . 71 Logarithmic amplitude spectrum of a speech pause . . . . 71 Hanning window . . . . . . . . . . . . . . . . . . . . . . . 103 Example of a linear convolution of two finite length signals: a) two signals; b) signal x[n-k] for different values of n: i) n < 0, no overlap with h[k], therefore convolution y[n] = 0 ii) n between 0 and Nh + Nx − 2, convolution 6= 0 iii) n > Nh + Nx − 2, no overlap with h[k], convolution y[n] =0 c) resulting convolution y[n]. . . . . . . . . . . . . . . . . . 106 Flow diagram for decomposition of one N -DFT to two N/2– DFTs with N = 8 . . . . . . . . . . . . . . . . . . . . . . . 110 Flow diagram of an 8–point–FFT using Butterfly operations. 111 Flow diagram of an 8–point–FFT using Butterfly operations. 120 Input and output arrays of an FFT. a) The input array contains N (N is power of 2) complex input values in one real array of the length 2N . with alternating real and imaginary parts. b) The output array contains complex Fourier spectrum at N frequency values. Again alternating real and imaginary parts. The array begins with the zero-frequency and then goes up to the highest frequency followed with values for the negative frequencies. . . . . . . . . . . . . . 122 iv 3.1 3.2 3.3 3.4 3.5 Example for the application of the Discrete Fourier Transform (DFT). . . . . . . . . . . . . . . . . . . . . . . . . . . 138 a) signal v[n]; b) DFT-spectrum V [k]; c) Fourier spectrum V (ejω ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 a) signal v[n]; b) DFT-spectrum V [k]; c) Fourier spectrum V (ejω ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 a) DFT of length N = 64; b) DFT of length N = 128; c) Fourier spectrum V (ejω ). . . . . . . . . . . . . . . . . . . . 151 Influence of the window function: above: speech signal (vowel “a”); central: 512 point FFT using rectangle window; below: 512 point FFT using Hamming window . . . . . . . . . . . . . . . . . . . . . . . . . 158 3.6 Fourier Transform of a voiced speech segment: a) signal progression, b) high resolution Fourier Transform, c) low resolution Fourier Transform with short Hamming window (50 sampled values), d) low resolution Fourier Transform using autocorrelation function (19 coefficients), e) low resolution Fourier Transform using autocorrelation function (13 coefficients) . . . . . . . . . . . . . . . . . . . . . . . . 162 3.7 Signal progression and autocorrelation function of voiced (left) and unvoiced (right) speech segment . . . . . . . . . 163 Temporal progression of speech signal and four autocorrelation coefficients . . . . . . . . . . . . . . . . . . . . . . . . 164 3.8 3.9 a) wide-band spectrogram: short time window, high time resolution (vertical lines), no frequency resolution; for voiced signals provides information on formant structure b) narrowband spectrogram: long time window, no time resolution, high frequency resolution (horizontal lines); for voiced signals provides information on fundamental frequency (pitch) 166 3.10 Wide-band and narrow-band spectrogram and speech amplitude for the sentence “Every salt breeze comes from the sea”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 3.11 Above: logarithmized power spectrum of a spoken vowel (schematic). Below: corresponding cepstrum (inverse Fourier–transform of the logarithmized power spectrum). . . . . . . . . . . . 177 v 3.12 Cepstral smoothing: speech signal (vowel “a”), windowed speech signal (Hamming window), spectrum obtained from the whole cepstrum (blue) and smoothed spectrum obtained from the first 13 cepstral coefficients (red). . . . . . . . . . 3.13 Homomorph analysis of a speech segment: signal progression, homomorph smoothed spectrum using 13 and 19 cepstral coefficients . . . . . . . . . . . . . . . . . . . . . . . . 179 4.1 4.2 4.3 4.4 4.5 4.6 TV–image (analog) . . . . . . Digitized TV–image . . . . . . Amplitude spectrum of Figure Low-pass filtered . . . . . . . High-pass filtered . . . . . . . High-pass enhancement . . . . 193 193 193 193 194 194 5.1 LPC–analysis of one speech segment a) signal progression, b) prediction error (K=12), c) LPC– spectrum with K=12 coefficients, d) spectrum of the prediction error (K=12), e) LPC–spectrum with K=18 coefficients 219 LPC–Spectra for different prediction orders K . . . . . . . 220 5.2 . . . . 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 List of Tables 2.1 2.2 Fourier transform pairs . . . . . . . . . . . . . . . . . . . . Fourier transform Theorems . . . . . . . . . . . . . . . . . 87 88 Chapter 1 System Theory and Fourier Transform • Overview: 1.1 Introduction 1.2 Linear time-invariant Systems 1.3 Fourier Transform 1.4 Properties of Fourier Transform 1.5 Parseval Theorem 1.6 Autocorrelation Function 1.7 Existence of the Fourier Transform 1.8 δ–Function 1.9 Fourier Series 1.10 Duration and Band Width Digital Processing of Speech and Image Signals 1 WS 2006/2007 1.1 Introduction What distinguishes the Fourier Transform (FT) from other transformations? 1. Mathematical property of linear time-invariant systems: • FT decomposes the time signal into “eigenfunctions” “eigenfunctions” keep their form by passing the linear time-invariant system Ax=λx • Magnitude of FT: shift invariant 2. Physical observation: • Human ear produces sort of FT, essentially only magnitude of FT (strictly speaking: short-time FT) • Example: Time functions with different evolution can sound equally. The human ear either senses sense phase differences of partial tones of the complete sound of stationary processes very weakly, or does not sense them at all. Fourier transform in speech processing: • Calculation of the spectral components of speech • Basic method for obtaining observations (features) for speech recognition Digital Processing of Speech and Image Signals 2 WS 2006/2007 φ=0 0 0 φ = π/2 0 0 random φ 0 0 Figure 1.2: Amplitude spectrum of a time function composed as sum of 20 partial tones. 0 Figure 1.1: Oscillograms of three time functions composed as sum of 20 partial oscillations. a) φn = 0, b) φn = π2 , c) φn statistical. Figure 1.3: from left to right: original photo, low-pass and high-pass filtered version Digital Processing of Speech and Image Signals 3 WS 2006/2007 amplitude spectrum 0 original signal 0 0 inverse FT for phase φ(f ) = 0 0 0 Inverse FT for phase φ(f ) = π 2 0 0 Inverse FT for random phase φ(f ) 0 0 Figure 1.4: Phase manipulation for portion of a speech signal (vowel ’o’) sampled at 8kHz, 25ms analysis window (200 samples), 512 point FFT Digital Processing of Speech and Image Signals 4 WS 2006/2007 amplitude spectrum 0 original signal 0 0 inverse FT for phase φ(f ) = 0 0 0 inverse FT for phase φ(f ) = π 2 0 0 inverse FT for random phase φ(f ) 0 0 Figure 1.5: Phase manipulation for portion of a speech signal (consonant ’n’) sampled at 8kHz, 25ms analysis window (200 samples), 512 point FFT Digital Processing of Speech and Image Signals 5 WS 2006/2007 amplitude spectrum 0 original signal 0 inverse FT for phase φ(f ) = 0 0 0 inverse FT for phase φ(f ) = π 2 0 0 inverse FT for random phase φ(f ) 0 0 Figure 1.6: Phase manipulation for a Heaviside–function (step–function) Digital Processing of Speech and Image Signals 6 WS 2006/2007 Why Fourier? Roughly: • Production, description and algorithmic operations on signals (functions or measurement curves over the time axis) can be described very well in Fourier domain (frequency domain). Deeper reason: • Production, description and algorithmic operations on signals are largely based on linear time-invariant (LTI) operations. • Fourier Transform: simple representation of LTI-operations (later: convolution theorem) Why continuous? • Real world is continuous • Computer (digital = time discrete = sampled) model of the real world Digital Processing of Speech and Image Signals 7 WS 2006/2007 a) A glottal pulses vocal tract filter T t b) speech [a:] radiation from lips and nose T t |E(f)| [dB] [a:] |V(f)| [dB] |A(f)| [dB] |S(f)| [dB] [a:] 1/T 0 * 4 = * 0 4 0 4 0 NOSE OUTPUT NASAL CAVITY VELUM PHARYNX CAVITY VOCAL CORDS LARYNX TUBE MOUTH CAVITY TONGUE HUMP MOUTH OUTPUT TRACHEA AND BRONCHI LUNG VOLUME MUSCLE FORCE Figure 1.7: Schematic representation of the physiological mechanism of speech production Digital Processing of Speech and Image Signals 8 WS 2006/2007 4 signal (speech, image) feature extraction (signal analysis) feature vector (pattern vector) (pattern) comparison reference data (vectors, features) decision Examples: • Spoken language • Written numbers (letters) • Cell recognition (red blood cells) Digital Processing of Speech and Image Signals 9 WS 2006/2007 Examples of applications of Fourier Transform: • Electrical switchgears • Recognition and coding – Speech and general acoustic signals – Image signals • Time series analysis: – Astronomical measurement curves – Stock-market course – ... • Computer tomography • Solving differential equations • Description of image production in optical systems Digital Processing of Speech and Image Signals 10 WS 2006/2007 1.2 Linear time-invariant Systems Example: – speech production – electrical systems h(t) input signal x(t) output signal y(t) S symbolic: {t → y(t)} = S {t → x(t)} simplified: y(t) = S {x(t)} • Note: the complete time domain of the function is important, not individual positions in time t. more exact: y = S {x} LTI–System: (LTI = Linear Time-Invariant) • Linear: Additive: S {x1 + x2 } = S {x1 } + S {x2 } Homogeneous: S {α x} = αS {x} , α ∈ IR • Time-invariant: {t → y(t − t0 )} = S {t → x(t − t0 )} , Digital Processing of Speech and Image Signals 11 t0 ∈ IR WS 2006/2007 Mathematical theorem: • Linearity and time invariance result in convolution representation • Output signal y(t) of LTI system S with input signal x(t): y(t) = = Z∞ −∞ Z∞ −∞ x(t − τ ) h(τ ) dτ x(τ ) h(t − τ ) dτ = x(t) ∗ h(t) • h: impulse response of the system S e (t) ∆τ x (t) 1/∆τ ∆τ t τi t • system response h∆τ (t) to excitation e∆τ (t): h∆τ (t) = S {e∆τ (t)} • signal x(t) is represented as sum of amplitude weighted and time shifted elementary functions e∆τ (t): " # X x(t) = lim x(τi ) e∆τ (t − τi ) ∆τ ∆τ →0 i Digital Processing of Speech and Image Signals 12 WS 2006/2007 Hence the following holds for the output signal y(t): y(t) = S {x(t)} = S = ( lim ∆τ →0 " lim ∆τ →0 S X (i additivity: = lim ∆τ →0 " X x(τi ) e∆τ (t − τi ) ∆τ X i ) x(τi ) e∆τ (t − τi ) ∆τ S { x(τi ) e∆τ (t − τi ) ∆τ } i )# # homogeneity (for x(τi ) and ∆τ ): " # X = lim x(τi ) S { e∆τ (t − τi ) } ∆τ ∆τ →0 i time invariance: = lim ∆τ →0 " X i x(τi ) h∆τ (t − τi ) ∆τ # limiting case ∆τ → 0 : X ∆τ τi h∆τ (t) −→ −→ −→ −→ Z dτ τ h(t) result: y(t) = Z∞ −∞ h(t): x(τ ) h(t − τ ) dτ = x(t) ∗ h(t) impulse response of the system Digital Processing of Speech and Image Signals 13 WS 2006/2007 Examples of LTI-operations: • Oscillatory systems (electrical or mechanical) with external excitation: h(τ ) x(t) −→ y(t) = Z −→ y(t) h(t − τ ) x(τ ) dτ y ′′ (t) + 2αy ′ (t) + β 2 y(t) = x(t) α, β: parameters depending on the oscillatory system • More general electrical engineering systems: high-pass, low-pass, band-pass • Sliding average value: x(t) −→ S −→ y(t) := x(t) +T Z /2 1 x(t) = T x(t + τ ) dτ −T /2 • Differentiator: x(t) −→ −→ y(t) := x′ (t) S • Comb filter: ”hypothesized” period T x(t) −→ S −→ y(t) := x(t) − x(t − T ) • In general: linear differential equations with coefficients ck and dl P P ck y (k) (t) = dl x(l) (t) k l [ + further constraints ] Digital Processing of Speech and Image Signals 14 WS 2006/2007 Example of a non-linear system: system: y(t) = x2 (t) x(t) = A cos(βt) A2 (1 + cos(2βt)) =⇒ y(t) = A cos (βt) = 2 2 2 frequency doubling Digital Processing of Speech and Image Signals 15 WS 2006/2007 1.3 Fourier Transform Sinusoidal oscillation: x(t) = A sin ( ω t + ϕ ) amplitude A phase / null phase ϕ angular frequency ω = 2 π f j 2 = −1, j∈C Im 1 sin α α cos α 1 complex representation: Re ej α = cos α + j sin α, ejα + e−jα cos α = 2 and α ∈ IR ejα − e−jα sin α = 2j dimension: DIM(ω) DIM(t) = 1 DIM(ω) = 1 1 = = [Hz] DIM(t) [sec] Digital Processing of Speech and Image Signals 16 WS 2006/2007 LTI-System y(t) = Z∞ −∞ x(t − τ )h(τ )dτ = x(t) ∗ h(t) • Determine the following specific input signal: x(t) = A ej(ωt+ϕ) • For this input signal the output signal becomes: y(t) = Z∞ A ej(ω(t−τ )+ϕ) h(τ )dτ −∞ = A ej(ωt+ϕ) Z∞ h(τ )e−jωτ dτ −∞ = | {z } H(ω) = F {h(τ )} x(t) · H(ω) • Definition of the Fourier transform: Z∞ h(τ )e−jωτ dτ = F {h(τ )} = F {τ → h(τ )} H(ω) = −∞ (→ decomposition into e−jωτ ) • H(ω) is called transfer function of the system Remark about x(t) = A ej(ωt+ϕ) : • The shape of the input signal x(t), i.e. its frequency ω (“eigenfunction”) remains invariant • Amplitude (intensity) and phase (time shift) are depending on H(ω) (“eigenvalue”) (→ analogy to the problem of eigenvalues in linear algebra) Digital Processing of Speech and Image Signals 17 WS 2006/2007 Remarks • FT is complex: H(ω) = Re {H(ω)} + j Im {H(ω)} = |H(ω)| ejΦ(ω) • Amplitude (spectrum): q Re {H(ω)}2 + Im {H(ω)}2 |H(ω)| = • Phase (spectrum):  Im {H(ω)}   arctan   Re {H(ω)}         Im {H(ω)}   arctan + π   Re {H(ω)}    Φ(ω) = π    2           π   −     2 Digital Processing of Speech and Image Signals 18 Re {H(ω)} > 0 Re {H(ω)} < 0 Re {H(ω)} = 0, Im {H(ω)} > 0 Re {H(ω)} = 0, Im {H(ω)} < 0 WS 2006/2007 Examples of Fourier transforms: 1. Rectangle function t h(t) = rect( ) = T H(ω) = Z∞ 1, 0, |t| ≤ T /2 |t| > T /2 T −jωt h(t)e dt = −∞ = Z2 −jωt e − T2 i 1 h −jω T jω T2 2 e −e dt = −jω ωT ) T sin( 2 ωT 2 sin( ) = ωT ω 2 2 (here: Im {H(ω)} = 0) h(t) t H(ω) ω Digital Processing of Speech and Image Signals 19 WS 2006/2007 2. Double-sided exponential h(t) = e−α|t| H(ω) = = Z∞ −∞ Z∞ with α > 0 h(t)e−jωt dt e−(α+jω)t dt + 0 = = = = Z∞ e−(α−jω)t dt 0 ∞ e−(α−jω)t e + −(α + jω) −(α − jω) 0 1 1 0+0− − −(α + jω) −(α − jω) α − jω + α + jω α2 + ω 2 2α α2 + ω 2 −(α+jω)t • Imaginary part equals 0 • Infinite spectrum • No zeros H(ω ) h(t) ω t • If h(t) is symmetric (i.e. h(t) = h(−t)), imaginary parts drop away and the real part is sufficient Digital Processing of Speech and Image Signals 20 WS 2006/2007 3. Damped oscillations h(t) = e−α|t| cos(βt) with α > 0 H(ω) = = Z∞ −∞ Z∞ h(t)e−jωt dt e−(α+jω)t cos(βt)dt + 0 = Z∞ = e−(α−jω)t cos(βt)dt 0 e−(α+jω)t ejβt + e−jβt dt + 2 0 = Z∞ Z∞ e−(α−jω)t ejβt + e−jβt dt 2 0 ... (elementary calculation) α α + α2 + (ω − β)2 α2 + (ω + β)2 • Limiting case: H(ω)|ω=±β = 1 α + α α2 + (2β)2 =⇒ tends towards ∞ or −∞ if α tends towards 0 H(ω ) h(t) t Digital Processing of Speech and Image Signals 21 | | −β β ω WS 2006/2007 4. Modulated rectangle function (“truncated cosine”) cos(β t), |t| ≤ T /2 h(t) = 0, |t| > T /2 H(ω) = Z∞ h(t)e−jωt dt −∞ T = Z2 cos(β t)e−jωt dt − T2 = ... =  (elementary calculation) T sin (ω − β) T 2   T 2 (ω − β) 2  T sin (ω + β) 2   +  T (ω + β) 2 h(t) h(t) | | t t H(ω) H(ω) ω Digital Processing of Speech and Image Signals 22 | | −β β ω WS 2006/2007 Fourier Transform pairs (u = ω/2π) Rectangle function Sinc function 1 1 -1/2 sin(πu) πu 1/2 Squared sinc function Triangle function 1 1 -1/2 1/2 Exponential function 2α α2+(2πu)2 e-α|x| Gaussian function e -αx 2 π - πu e α α 2 Unit impulse δ(x) 1 Digital Processing of Speech and Image Signals 23 WS 2006/2007 Inverse Fourier–transform Z∞ H(ω) = h(t)e−jωt dt −∞ assumption: with: h̃(t) = 1 2π H(ω) = Z∞ = = = H(ω)ejωt dω −∞ h(τ )e−jωτ dτ −∞ inserting H(ω) in h̃(t): h̃(t) = Z∞ 1 2π lim Ω,T →∞  ZΩ −Ω 1 lim lim 2π Ω→∞ T →∞ 1 π lim lim Ω→∞ T →∞ lim Ω→∞ 1 π Z∞ −∞ = h(t)  ZT  h(τ ) ejω(t−τ ) dτ  dω −T ZT ZΩ ejω(t−τ ) dω h(τ ) dτ −T −Ω ZT −T sin (Ω(t − τ )) h(τ ) dτ t−τ sin (Ω(t − τ )) h(τ ) dτ t−τ due to: 1 lim Ω→∞ π Z∞ −∞ (→ sin(Ωt) h(t) dt = h(0) t −∞ formal expression: h(t) = Z∞  1 2π | Z∞ −∞  ejω(t−τ ) dω  h(τ ) dτ {z = δ(t − τ ) } distribution theory, see there for stronger proof) Digital Processing of Speech and Image Signals 24 WS 2006/2007 1.4 Properties of the Fourier Transform Symmetry H(ω) = Z∞ h(t) e−jωt dt = F {h(t)} 1 2π Z∞ −∞ h(t) = −∞ H(ω) ejωt dω = F −1 {H(ω)} F 2 {h(t)} = F {H(ω)} = 2πh(−t) F −1 F {h(t)} = F −1 {H(ω)} = h(t) • Time domain and frequency domain are correlated symmetrically. • Properties of FT are valid in both domains, especially the convolution theorem (see later). Digital Processing of Speech and Image Signals 25 WS 2006/2007 Theorems for the Fourier transform H(ω) = Z∞ e−jωt h(t) dt −∞ consider the equation: H(ω) = F {h(t)} more exact: {ω → H(ω)} = F {t → h(t)} 1. Linearity: integral operator is linear 2. Inverse scaling, similarity principle: Z∞ h(αt) e−jωt dt = −∞ F {h(αt)} = 1 |α| Z∞ ω h(τ ) e−j α τ dτ −∞ 1 ω H( ), |α| α α ∈ IR\{0} Note: Absolute value, because integral boundaries are swapped for α < 0. 3. Shift: h(t − t0 ) Z∞ −∞ h(t − t0 ) e−jωt dt = e−jωt0 = e−jωt0 Z∞ −∞ Z∞ h(t − t0 ) e−jω(t−t0 ) dt h(τ ) e−jωτ dτ −∞ Digital Processing of Speech and Image Signals 26 WS 2006/2007 =⇒ F {h(t − t0 )} = e−jωt0 H(ω) t0 ∈ IR with H(ω) = F {h(t)} important: | F {h(t − t0 )} | = | F {h(t)} | , because |e−jωt0 | = |e−ju | = | cos u − j sin u| p cos2 u + sin2 u = = 1 4. Symmetry and antisymmetry: h(t) = h(−t) results in h(t) = −h(−t) 5. Complex conjugation: Z∞ Im{H(ω)} = 0 results in Re{H(ω)} = 0 suppose that h(t) is a complex function h(t) e−jωt dt Z∞ = −∞ −∞ Z∞ = h(t) ejωt dt h(t) ejωt dt = H(−ω) −∞ F {h(t)} = H(−ω) = F {h(t)} Special case: h(t) is real, so h(t) = h(t) =⇒ H(ω) = H(−ω) =⇒ | H(ω) | = | H(−ω) | = | H(−ω) | Digital Processing of Speech and Image Signals 27 WS 2006/2007 6. Differentiation: dh dt  ∂  1 ∂t 2π = 1 2π = Z∞ Z∞ −∞  H(ω) ejωt dω  H(ω) jω ejωt dω −∞ F{ dh(t) } = jω F {h(t)} dt Interpretation: differentiation = enhancement of high frequencies (due to the multiplication with ω) 7. Integration: F{ Zt h(τ )dτ } = −∞ Proof: 1 F {h(t)} jω similar to differentiation or inversion 8. Modulation principle: F {h(t) cos(ω0 t)} = Z∞ −∞ h(t) cos(ω0 t) e−jωt dt  Z∞  Z∞ 1  h(t) ejω0 t e−jωt dt + h(t) e−jω0 t e−jωt dt  2 −∞  −∞  ∞ Z Z∞ 1  h(t) e−j(ω−ω0 )t dt + h(t) e−j(ω+ω0 )t dt  = 2 = −∞ = −∞ 1 [ H(ω − ω0 ) + H(ω + ω0 ) ] 2 and similarly F { h(t) sin(ω0 t) } = 1 [ H(ω − ω0 ) − H(ω + ω0 ) ] 2j Digital Processing of Speech and Image Signals 28 WS 2006/2007 y(t) x(t) h(t), H(ω) Y(ω) X(ω) Convolution theorem • Convolution in time domain corresponds to multiplication in frequency domain Z∞ Time domain: y(t) = x(t) ∗ h(t) = x(t − τ ) h(τ ) dτ −∞ Frequency domain: Y (ω) = = = Z∞ −∞ Z∞ −∞ Z∞  e−jωt  Z∞  h(τ ) x(t − τ ) dτ  dt  −∞  Z∞ h(τ )  x(t − τ ) e−jωt dt  dτ −∞ h(τ ) X(ω) e−jωτ dτ −∞ = X(ω) Z∞ (shifting) h(τ ) e−jωτ dτ −∞ = X(ω) H(ω) Digital Processing of Speech and Image Signals 29 WS 2006/2007 • Likewise, multiplication in time domain corresponds to convolution in 1 ): frequency domain (note the factor 2π Time domain: y(t) = a(t) · b(t) Frequency domain: Y (ω) = = Z∞ −∞ Z∞ −∞ = = 1 2π 1 2π a(t) · b(t) e−jωt dt 1 a(t) 2π Z∞ −∞ Z∞ −∞ = Z∞ B(ω̃)ej ω̃t e−jωt dω̃ dt −∞ B(ω̃) Z∞ a(t)e−j(ω−ω̃)t dt dω̃ −∞ A(ω − ω̃) · B(ω̃)dω̃ 1 A(ω) ∗ B(ω) 2π • Motivation for the Fourier transform: FT gives the “simplest” representation of the system operation, because every LTI-System can be interpreted as convolution of the input signal x(t) and the impulse response of the system h(t). Convolution can be then efficiently calculated using FT and convolution theorem. • Mathematical: eigenfunctions Digital Processing of Speech and Image Signals 30 WS 2006/2007 Example: Oscillator with excitation Oscillator x(t) −→ −→ y(t) y ′′ (t) + 2α y ′ (t) + β 2 y(t) = x(t) Z+∞ 1 x(t) = X(ω)ejωt dω 2π y(t) = y ′ (t) = y ′′ (t) = 1 2π 1 2π 1 2π −∞ Z+∞ Y (ω)ejωt dω −∞ Z+∞ Y (ω)jω ejωt dω −∞ Z+∞ Y (ω)[−ω 2 ] ejωt dω −∞ Z+∞ Z+∞ [−ω 2 + 2αjω + β 2 ]Y (ω)ejωt dω = X(ω)ejωt dω Z+∞ −∞ −∞ −∞ [−ω 2 + 2αjω + β 2 ] Y (ω) − X(ω) ejωt dω = 0 | {z } ∀t =0 In this way we obtain the transfer function of an oscillator: H(ω) = Y (ω) 1 = X(ω) −ω 2 + 2αjω + β 2 Digital Processing of Speech and Image Signals 31 WS 2006/2007 1 h(t) = 2π Z+∞ H(ω)ejωt dω −∞ (can be given explicitly) Z+∞ x(t) h(t − τ )dτ y(t) = −∞ Note: y(t) does not contain the component which corresponds to the homogeneous differential equation of the oscillator. x(t) Convolution with h(t) Inverse Fourier Transform Fourier Transform X(ω) y(t) Multiplication with H(ω) = F{h(t)} Digital Processing of Speech and Image Signals 32 Y(ω) WS 2006/2007 1.5 Parseval Theorem Convolution theorem: F −1 {H(ω) X(ω)} = (⋆) 1 2π Z∞ H(ω) X(ω) ejωτ dω −∞ Z∞ −∞ h(t) x(τ − t) dt = (h ∗ x) (τ ) We make two special assumptions: i) x(−t) := h(t), then: X(ω) = H(ω) ii) τ = 0 Inserting in (⋆) results in: 1 2π Z∞ −∞ 1 2π H(ω)H(ω) dω Z∞ −∞ Z∞ = |H(ω)|2 dω = −∞ Z∞ −∞ h(t)h(t) dt |h(t)|2 dt = E • Energy E in time domain = Energy E in frequency domain 1 1 ; aid: use normalization factor √ for both (up to the factor 2π 2π directions of Fourier Transform) • Physical aspect: energy conservation • Mathematical aspect: unitary (orthogonal) representation in vector space • |H(ω)|2 is called power spectral density. Digital Processing of Speech and Image Signals 33 WS 2006/2007 1.6 Autocorrelation Function Autocorrelation function • Autocorrelation function of time continuous signal or function h(t) is defined as: R(t) = Z∞ h(τ ) h(t + τ )dτ −∞ • The following equation is valid: R(t) = h(t) ∗ h(−t) → • Fourier transform gives: which results in R(t) = R(−t) (”Wiener-Khinchin Theorem”) F {R(t)} = H(ω) H(ω) = |H(ω)|2 • Thus: Fourier transform connects autocorrelation function R(t) and power spectral density |H(ω)|2 |H(ω)|2 = Z∞ R(t) e−jωt dt = Z∞ R(t) cos(ωt) dt −∞ −∞ • Remark: autocorrelation is a special case of the cross correlation between signals x(τ ) and h(t) Ch,x = Z∞ h(τ ) x(t + τ )dτ −∞ Digital Processing of Speech and Image Signals 34 WS 2006/2007 1.7 Existence of the Fourier Transform Conditions for h(t) for the existence of the Fourier transform H(ω) = Z∞ 1 h(t) = 2π e−jωt h(t) dt , −∞ Z∞ ejωt H(ω) dω −∞ When are those equations valid? Sufficient conditions: 1. h(t) is absolutely integrable: Z∞ −∞ |h(t)|dt < ∞ 2. h(t) has finite number of jumps, minima and maxima in each interval of IR 3. h(t) has no infinite jumps More general conditions are possible (but rather complex set of conditions): • Generalized functions, distributions, definition as functional • Example: δ-function: Z∞ δ(t) h(t) dt = h(0) for all functions h −∞ Digital Processing of Speech and Image Signals 35 WS 2006/2007 Impulse response: y(t) = Z∞ −∞ h(t − τ )δ(τ ) dτ = h(t) ∗ δ(t) = h(t) Consequence: h(t) ≡ 1 ⇒ Z∞ δ(t) dt = 1 −∞ • A function like δ(t) does not “exist”. But it is possible to define the functional for each function t → h(t): [t → h(t)] −→ δ̃(h) := h(0) 1.8 δ-Function Starting point: definition of the δ-function as a boundary case of a function δǫ (t): lim Z+∞ ǫ→0 −∞ f (t) δǫ (t) dt = f (0) (1.1) • Possible realizations of δǫ (t)   1 t ∈ [−ǫ, +ǫ] 2ǫ a) δǫ (t) =  0 otherwise b) δǫ (t) = 1 ǫ π ǫ2 + t2 Digital Processing of Speech and Image Signals 36 WS 2006/2007 c) δǫ (t) = 1 sin (t/ǫ) π t d) δǫ (t) = √ 1 2πǫ2 t2 − e 2ǫ2 • During inversion of the Fourier transform we have “formally” obtained: δ(t) = 1 2π Z+∞ ejωt dω = lim Ω→∞ 1 sin (Ωt) π t (1.2) −∞ Fourier transform F {δ(t)}: F {δ(t)} = Z+∞ e−jωt δ(t) dt −∞ due to (1.1) the following holds: F {δ(t)} = ejωt |t=0 = 1 • Another derivation using (1.2): δ(t) = = 1 2π 1 2π Z+∞ −∞ Z+∞ ejωt F {δ(t)} dω ejωt dω general according to (1.2) −∞ Comparison results in: F {δ(t)} = 1 Digital Processing of Speech and Image Signals 37 WS 2006/2007 From this we obtain the following equations: From symmetry property: F {1} = 2 π δ(ω) From shifting theorem: F {ejω0 t 1} = 2 π δ(ω − ω0 ) cos (ω0 t) = = 1 jω0 t e + e−jω0 t 2   Z+∞ Z+∞ 1  δ(ω + ω0 ) ejωt dω  δ(ω − ω0 ) ejωt dω + 2 = π −∞ Z+∞ 1 2π −∞ −∞ [ δ(ω − ω0 ) + δ(ω + ω0 ) ] ejωt dω F { cos (ω0 t) } = π [ δ(ω − ω0 ) + δ(ω + ω0 ) ] Note: another derivation: consider “damped oscillations” 1 −α|t| e cos (ω0 t) 2π in the limit α → 0 . Digital Processing of Speech and Image Signals 38 WS 2006/2007 Comb function • define “comb function” (pulse train, sequence of δ-impulses): +∞ X x(t) = n=−∞ δ(t − nT ) • Fourier transform of comb function: X(ω) = Z+∞ x(t) e−jωt dt −∞ = = = Z+∞ X +∞ δ(t − nT ) e−jωt dt X δ(t − nT ) e−jωt dt −∞ n=−∞ +∞ +∞ Z n=−∞−∞ +∞ X −jωnT e n=−∞ = = ... (see Papoulis 1962, p. 44) +∞ 2π X 2π δ(ω − n ) T n=−∞ T • in words: δ-impulse sequence with period T in time domain produces δ-impulse sequence with period T1 in frequency domain (i.e. 2π T in ω-frequency domain) comb function is transformed to comb function Digital Processing of Speech and Image Signals 39 WS 2006/2007 Comb function Σ n=- -6T 2π T δ(t-nT) -3T -T T 3T Σ n=- -6π -4π -2π T T T 6T δ(ω-n2π/T) 2π 4π 6π T T T 1(δ(ω-ω )+δ(ω+ω )) 0 0 2 cos(ω0t) −ω0 ω0 1 j(−δ(ω-ω )+δ(ω+ω )) 0 0 2 sin(ω0t) ω0 −ω0 Digital Processing of Speech and Image Signals 40 WS 2006/2007 1.9 Motivation for Fourier Series x: IR −→ IR t −→ x(t) Consider a periodical function x with period T : x(t) = x(t + T ) then also x(t) = x(t + kT ) for each t ∈ IR for k ∈ Z Examples: • Constant function: x0 (t) = A0 • Harmonic oscillator: x1 (t) = A1 cos ( 2π t + ϕ1 ) , T A1 > 0 • All higher harmonic: xn (t) = An cos (n 2π t + ϕn ) , T An > 0 therefore x(t) = ∞ X An cos (n ω0 t + ϕn ) with ω0 = n=0 is periodical with period T = 2π , T An ≥ 0 2π ω0 • Another notation: x(t) = ∞ X Bn e−j n ω0 t where Bn is a complex number n=−∞ Digital Processing of Speech and Image Signals 41 WS 2006/2007 Line spectrum representation Real measured signal has always a ”widespread” spectrum. Reasons: • Strictly periodical signal (almost) never exists – Period can fluctuate – ”Wave form” within one period can fluctuate – Only a finite section of the signal is analyzed (”window function”) • Only a strictly periodical signal has a sharp line spectrum Remarks: • Fourier series are actually not strictly related to periodical functions: a finite interval of IR is sufficient (the signal is then interpreted as infinitely prolonged). • By transition from the finite interval to the complete real axis the Fourier series becomes Fourier integral. Digital Processing of Speech and Image Signals 42 WS 2006/2007 Calculation of Fourier coefficients: • Consider a periodical function x(t) with period T = 2π ω0 • approach: x(t) = +∞ X an ej n ω0 t a∈C n=−∞ • multiplication with e−j m ω0 t where m ∈ IN and integration over one period result in: +T Z /2 x(t) e−j m ω0 t dt = +∞ X ej (n−m) ω0 t dt an n=−∞ −T /2 +T Z /2 −T /2 • Due to “orthogonality” holds: +T Z /2 j (n−m) ω0 t e dt = −T /2 T 0 if n = m if n = 6 m • Then: ZT /2 x(t) e−j m ω0 t dt = am T −T /2 • Result: an = 1 T +T Z /2 x(t) e−j n ω0 t dt −T /2 = 1 T +T Z /2 x(t) cos (n ω0 t) dt − j −T /2 1 T +T Z /2 x(t) sin (n ω0 t) dt −T /2 Digital Processing of Speech and Image Signals 43 WS 2006/2007 Spectrum of a periodical function • If x(t) is periodical with the period T = x(t) = +∞ X 2π ω0 an ej n ω0 t , n=−∞ , then an ∈ C • The Fourier transform X(ω) is: X(ω) = F {x(t)} +∞ X an = n=−∞ +∞ X = 2π F {ej n ω0 t } | {z } = 2πδ(ω − nω0 ) n=−∞ an δ(ω − nω0 ) • Note: This derivation is formal, because the Fourier integral does not exist in the “usual sense”; strict derivation within the scope of distribution theory. • In words: a periodic function with the period T has a Fourier transform in the form of a line spectrum with the distance ω0 = 2π T between the components. Digital Processing of Speech and Image Signals 44 WS 2006/2007 1.10 Time Duration and Band Width 1. Similarity principle: F {h(αt)} = 1 ω H( ) |α| α h(t) H( ω ) ω t 0<α<1: _ H( _ 1 ω α α) h(α t) ω t time duration T band width B T · B = const. • High resolution in the time domain results in low resolution in the frequency domain and vice versa Digital Processing of Speech and Image Signals 45 WS 2006/2007 2. Special case: h(t) with Im {H(ω)} = 0 ( h(t) symmetrical ) and Re {H(ω)} ≥ 0 h(t) has maximum for t = 0: 1 h(t) = 2π Z∞ −∞ 1 H(ω) cos(ωt) dω ≤ 2π Z∞ H(ω) dω = h(0) −∞ define: T B = = 1 h(0) Z∞ −∞ Z∞ 1 H(0) h(t) dt H(ω) dω −∞ from T = H(0) h(0) and B = 2 π h(0) H(0) follows T · B Digital Processing of Speech and Image Signals 46 = 2π WS 2006/2007 3. In general: normalized impulse h(t) ∈ IR with Z∞ h2 (t) dt = 1, h(t) ∈ IR −∞ T2 B2 Z∞ := −∞ Z∞ := −∞ h2 (t) t2 dt | H(ω) |2 ω 2 dω • Results in “uncertainty relation”:  = 2π T · B ≥ r Z∞ −∞  2 [h′ (t)] dt π 2 • Proof: Cauchy-Schwarz inequality | xT y | ≤ ||x|| · ||y|| ∞ 2 Z Z∞ Z∞ 2 2 [ t h(t)] dt [ h′ (t) ] dt [ t h(t) ] h′ (t) dt ≤ −∞ | {z } {z }−∞ {z } | |−∞ 2 2 1 B =T = 4 2π From: partial integration Z Z ′ u (t) v(t) dt = u(t) v(t) − Z u(t) v ′ (t) dt 1 t dt = h(t)2 t − [ h(t) h′ (t) ] |{z} {z } | 2 v(t) u′ (t) Z∞ −∞ Z [ h(t) h′ (t) ] t dt = 0 − Digital Processing of Speech and Image Signals 47 1 2 h (t) 1 dt 2 1 2 WS 2006/2007 Equality sign is valid for linear dependency: h′ (t) = −λ t h(t) dh = −λ t dt h 1 log(h) = − λ t2 + const., 2 =⇒ Optimum T · B = r π 2 λ > 0 for Gauss’ impulse 1 2 λ − λt h(t) = √ e 2 2π Variance: σ 2 = 1 λ Quantum Physics: similar statement about position and impulse of a particle Digital Processing of Speech and Image Signals 48 WS 2006/2007 4. Finite positive signal ≥0 0≤t≤T g(t) =0 t < 0 or T < t g(t) 0 T t The following is valid for the amplitude spectrum |G(ω)|: +∞ Z −jωt g(t) e dt |G(ω)| = ≤ = −∞ +∞ Z |g(t)| |e−jωt |dt −∞ Z+∞ −∞ |g(t)| dt because g(t) ≥ 0 = G(0) Define the band width ωB as: |G(ωB )|2 = G2 (0) 2 and |G(ωB )|2 ≤ |G(ω)|2 for |ω| < ωB Then: T ωB ≥ Digital Processing of Speech and Image Signals 49 π 2 WS 2006/2007 Proof: The following inequalities are valid: (a − b)2 a +b ≥ 2 | sin α| + | cos α| ≥ 1 2 2 ∀ a, b ∈ IR ∀ α ∈ IR For the Fourier-Transform of g(t) holds: Re{G(ω)} = ZT g(t) cos ωt dt 0 Im{G(ω)} = − ZT g(t) sin ωt dt 0 π holds: cos ωt ≥ 0, sin ωt ≥ 0 2 and therefore: cos ωt + sin ωt = | cos ωt| + | sin ωt| ≥ 1 For 0 ≤ ω t ≤ Re{G(ω)} − Im{G(ω)} = ZT g(t) [cos ωt + sin ωt] dt ≥ ZT g(t) 1 dt 0 0 = G(0) |G(ω)|2 = Re2 {G(ω)} + Im2 {G(ω)} [Re{G(ω)} − Im{G(ω)}]2 ≥ 2 1 2 G (0) ≡ |G(ωB )|2 ≥ 2 Digital Processing of Speech and Image Signals 50 WS 2006/2007 Chapter 2 Discrete Time Systems • Overview: 2.1 Motivation and Goal 2.2 Digital Simulation using Discrete Time Systems 2.3 Examples of Discrete Time Systems 2.4 Sampling Theorem and Reconstruction 2.5 Logarithmic Scale and dB 2.6 Quantization 2.7 Fourier Transform and z–Transform 2.8 System Representation and Examples 2.9 Discrete Time Signal Fourier Transform Theorems 2.10 Discrete Fourier Transform (DFT) 2.11 DFT as Matrix Operation 2.12 From continuous FT to Matrix Representation of DFT 2.13 Frequency Resolution and Zero Padding 2.14 Finite Convolution 2.15 Fast Fourier Transform (FFT) 2.16 FFT Implementation Digital Processing of Speech and Image Signals 51 WS 2006/2007 2.1 Motivation and Goal If we want to process a continuous time signal x(t) with a computer, we have to sample it at discrete equidistant time points tn = n · TS where TS is called sampling period. Terminology: “time discrete” is often called “digital”, where this adjective often (but not always) denotes the amplitude quantization, i.e. the quantization of the value x(n · TS ). Advantages of digital processing in comparison to analog components: • independent of analog components and technical difficulties with respect to their realization; • in principle arbitrary high accuracy; • also non-linear methods are possible, in principle even every mathematical method. Digital Processing of Speech and Image Signals 52 WS 2006/2007 2.2 Digital Simulation using Discrete Time Systems Task definition: • Given: Analog system with input signal x(t) and output signal y(t); Sampling with sampling period TS • Wanted: Discrete System with input signal x[n] and output signal y[n], such that x[n] = x(nTS ) results in y[n] = y(nTS ) • For which signals is such a digital simulation possible? • The sampling theorem gives (most of) the answer. Digital Processing of Speech and Image Signals 53 WS 2006/2007 LTI System (analog to continuous time case): • Linearity: Homogeneity: S {α x[n]} = α S {x[n]} Additivity: S {x1 [n] + x2 [n]} = S {x1 [n]} + S {x2 [n]} • Shift invariance: S {x[n − n0 ]} = y[n − n0 ], Digital Processing of Speech and Image Signals 54 n0 whole number WS 2006/2007 Representation of an LTI System as discrete convolution: Unit impulse: δ[n] = 1, 0, n = 0 n 6= 0 The signal x[n] is represented with amplitude weighted and time shifted unit impulses δ[n]. The system reacts on δ[n] with h[n]: h[n] = S {δ[n]} Input signal: ∞ X x[n] = k=−∞ x[k] δ[n − k] Output signal: y[n] = S ( ∞ X k=−∞ x[k] δ[n − k] ) Additivity = ∞ X S { x[k] δ[n − k] } ∞ X x[k] S { δ[n − k] } k=−∞ Homogeneity = k=−∞ Time invariance = ∞ X k=−∞ x[k] h[n − k] Input signal x[n] and output signal y[n] of a discrete time LTI system are linked through discrete convolution. h[n] is called impulse response like in continuous time case. Digital Processing of Speech and Image Signals 55 WS 2006/2007 2.3 Examples of Discrete Time Systems • Difference calculation: y[n] = x[n] − x[n − n0 ] • “1-2-1”-averaging: y[n] = 0.5 · x[n − 1] + x[n] + 0.5 · x[n + 1] • sliding window averaging (“smoothing”) M X 1 y[n] = x[n − k] 2M + 1 k=−M • weighted averaging: instead of constant weight h[n] = 1 2M + 1 arbitrary weights can be used: y[n] = M X k=−M h[k] · x[n − k] Note: the only difference from general case is finite length of the convolution kernel h[n]. • First order difference equation: (recursive averaging, averaging with memory) y[n] − α y[n − 1] = x[n] • (Digital) resonator (second order difference equation) y[n] − α y[n − 1] − β y[n − 2] = x[n] • Image processing: Gradient calculation and image enhancement (Roberts Operator, Laplace Operator) Digital Processing of Speech and Image Signals 56 WS 2006/2007 • Roberts Cross Operator gray values x[i, j] j+1 j i i+1 2 |∇x[i, j]|2 = (x[i, j] − x[i + 1, j + 1])2 + (x[i, j + 1] − x[i + 1, j])2 Note: non-linear operation simplified: |∇x[i, j]| = |x[i, j] − x[i + 1, j + 1]| + |x[i, j + 1] − x[i + 1, j]| Digital Processing of Speech and Image Signals 57 WS 2006/2007 Figure 2.1: Digital photo Figure 2.2: Gradient image Digital Processing of Speech and Image Signals 58 WS 2006/2007 • Laplace Operator ≡ discrete approximation of the second derivation ∇2 x[i, j] = △2i x[i, j] + △2j x[i, j] = x[i + 1, j] − 2x[i, j] + x[i − 1, j] + x[i, j + 1] − 2x[i, j] + x[i, j − 1] = x[i + 1, j] + x[i − 1, j] + x[i, j + 1] + x[i, j − 1] − 4x[i, j] 1 -2 1 j+1 1 1 -4 1 -2 1 j-1 1 i-1 i j i+1 1 Image enhancement: y[i, j] = x[i, j] − ∇2 x[i, j] = h[i, j] ∗ x[i, j] Digital Processing of Speech and Image Signals 59 WS 2006/2007 Figure 2.3: Several real cases of Laplace Operator subtraction from original image. a) Original image b) Original image minus Laplace Operator (negative values are set to 0 and values above the grey scale are set to the highest grade of grey) Digital Processing of Speech and Image Signals 60 WS 2006/2007 2.4 Sampling Theorem (Nyquist Theorem) and Reconstruction The following will be analyzed and derived respectively: How should we choose the sampling period TS , if we want to represent a continuous signal x(t) with its sample values x(nTS ) so that the signal x(t) can be exactly reconstructed from its sample values? • Fourier transform of the continuous time signal x(t): Z∞ X(ω) = F { x(t) } = x(t) e−jωt dt −∞ 1 x(t) = F −1 { X(ω) } = 2π Z∞ X(ω) ejωt dω (2.1) −∞ • Signal x(t) has limited bandwidth with upper limit ωB , which means: X(ω) = 0 for all |ω| ≥ ωB Note: X(ωB ) = 0 • X(ω) in domain −ωB < ω < ωB can be represented as Fourier Series: ∞ X ω an exp(−jnπ ) X(ω) = (2.2) ω B n=−∞ • The coefficients an are given by: ZωB ω 1 X(ω) exp(jnπ ) dω an = 2ωB ωB (2.3) −ωB • Comparison of the equations (2.1) and (2.3) shows that the coefficients an are given by the values of the inverse Fourier transform of x(t) at points nπ tn = (2.4) ωB The band limitation of X(ω) has to be considered for the integration limits in (2.1). Result: π nπ (2.5) an = x( ) · ωB ωB Digital Processing of Speech and Image Signals 61 WS 2006/2007 • Inserting Eq. (2.5) into Eq. (2.2) and then in Eq. (2.1) results in: 1 x(t) = 2π ZωB −ωB ∞ nπ ω π X x( ) exp(−jnπ ) exp(jωt) dω ωB n=−∞ ωB ωB • After swapping summation and integration and subsequent integration: x(t) = ∞ X x( n=−∞ nπ ) ωB nπ )) ωB nπ ) ωB (t − ωB sin(ωB (t − • Reconstruction of the signal x(t) from sample values is possible if nπ equidistant sample values x( ) = x(n · Ts ) have the distance TS ωB π (2.6) TS = ωB • The sampling period TS corresponds to the sampling frequency ΩS : ΩS = 2π TS • Equation (2.6) shows that if the sampling frequency is ΩS := 2 ωB the original signal x(t) can be reconstructed exactly. • In the Fourier series representation of X(ω) in equation (2.2), the period 2 · ωB has been supposed. ωB is the highest frequency component of the signal x(t). Digital Processing of Speech and Image Signals 62 WS 2006/2007 • Since X(ω) is equal to zero for |ω| ≥ ωB , the period 2 · ωB can be substituted with every period 2 · ω eB where ω eB ≥ ωB . The previous derivation is also valid for this ω eB . • When ω eB = then: x(t) = ∞ X x(n TS ) n=−∞ π TS sin(π (t − n TS )/TS ) π (t − n TS )/TS (reconstruction formula) = 1 (l’Hôpital’s rule) Note: limt→0 sin(t) t • The condition ω eB ≥ ωB results in: TS ≤ π ωB (2.7) for the sampling period TS and in: ΩS ≥ 2 · ωB (2.8) for the sampling frequency ΩS . • The equations (2.7) and (2.8) are denoted as sampling theorem. The sampling frequency has to be at least twice as high as the upper limit frequency of the signal ωB where X(ω) = 0 for |ω| ≥ ωB . If and only if this condition is satisfied, an exact reconstruction (without approximation) of a continuous signal x(t) from its sample values x(nTS ) is possible. • Note: The sampling frequency ΩS = 2 · ωB is also called Nyquist frequency. Digital Processing of Speech and Image Signals 63 WS 2006/2007 a) x(t) t b) xs(t) t T c) xr(t) T Figure 2.4: Ideal reconstruction of a band-limited signal (from Oppenheim, Schafer) a) original signal b) sampled signal c) reconstructed signal Digital Processing of Speech and Image Signals 64 WS 2006/2007 X(ω) a) −ωΒ ωΒ ω XS1(ω) , ΩS > 2ωΒ b) ... ... −ωΒ -ΩS ΩS ωΒ ω XS2(ω) , ΩS = 2ωΒ (Nyquist rate) c) ... ... -ΩS −ωΒ ωΒ ΩS ω XS3(ω) , ΩS < 2ωΒ (aliasing) d) ... ... ωΒ ΩS Β −Ω−ω S ω Figure 2.5: Sampling of band-limited signal with different sampling rates: b) sampling rate higher than Nyquist rate - exact reconstruction possible c) sampling rate equal to Nyquist rate - exact reconstruction possible d) sampling rate smaller than Nyquist rate - aliasing - exact reconstruction not possible Digital Processing of Speech and Image Signals 65 WS 2006/2007 Another proof using delta- and comb-function: Sampling of the continuous signal x(t) with ΩS = 2π TS • Band limitation: X(ω) = 0 for |ω| ≥ ωB (always possible: analog to low-pass with T (ω) = 0 for |ω| ≥ ωB ) • Sampling procedure Multiplication of a function with a comb-function in time domain xs (t) = Ts x(t) · +∞ X n=−∞ δ(t − nTs ) results in a convolution with a comb-function in frequency domain: +∞ 1 2πn 2π X Xs (ω) = Ts · X(ω) ∗ δ ω − ω̃ 2π Ts n=−∞ Ts Z+∞ +∞ X 2πn dω̃ δ ω − ω̃ X(ω̃) = T s n=−∞ −∞ +∞ X 2π X ω−n = Ts n=−∞ =⇒ sampled signal has periodical Fourier spectrum (Analogy to Fourier series: periodical signal has line spectrum, i.e. discrete spectrum) No overlap if: ωB ≤ ΩS − ωB 2ωB ≤ ΩS Digital Processing of Speech and Image Signals 66 WS 2006/2007 • In so-called digital simulation, the signal x(t) is represented by its sampled values x(n · TS ) measured at equidistant time points with distance TS . With a proper sampling period TS an exact reconstruction of the signal x(t) from the sampled values x(n · TS ) is possible. • If it is possible to exactly reconstruct the signal x(t) from the sampled values x(n·TS ), then it is possible to perform a discrete time processing of the sampled values x(n · TS ) on a computer, which is equivalent to the continuous time processing of the signal x(t) (digital simulation). • Continuous time processing: y(t) = Z∞ −∞ x(τ ) h(t − τ ) dτ • Discrete time processing: – Sampling period TS – x[n] := x(nTS ) y(nTS ) = y[n] = ∞ X k=−∞ ∞ X k=−∞ x(kTS ) h(nTS − kTS ) TS , h̃[n] = h(nTS ) x[k] h̃[n − k] • As a result of the convolution theorem (convolution in time domain corresponds to multiplication in frequency domain), the band limited input signal gives an also band limited output signal which is exactly determined by its sampled values. Digital Processing of Speech and Image Signals 67 WS 2006/2007 Important: • In the domain |ω| < ΩS /2 the Fourier transform of a continuous time signal x(t) is identical with the Fourier–transform of the corresponding sampled discrete time signal x(nTS ): Z∞ X(ω) = x(t) exp(−jωt) dt −∞ for |ω| ≤ ΩS /2 is identical to ∞ X x(nTS ) exp(−jωTS n) TS · XS (ω) = TS · = TS · n=−∞ ∞ X x(nTS ) exp(−j n=−∞ 2πω n) ΩS • Inverse Fourier transform of discrete time signal: x(nTS ) = 1 ΩS Ω ZS /2 XS (ω) exp(jωTS n) dω −ΩS /2 • One period: ΩS ΩS ≤ ω ≤ 2 2 2πω ≤ π −π ≤ ΩS − • The Fourier transform of a discrete time signal is periodic in ω with the period 2 π/TS = ΩS . • The Fourier transform of a discrete time signal is continuous in ω. Digital Processing of Speech and Image Signals 68 WS 2006/2007 Frequency normalization • Define the normalized frequency ωN : ωN : = 2π • Definition: ω ΩS (ω now denotes a normalized frequency) – Fourier transform of discrete time signal x[n]: +∞ X jω X(e ) = x[n] exp(−jωn) n=−∞ Note the notation X(ejω ). – Inverse Fourier transform of discrete time signal x[n]: 1 x[n] = 2π Zπ X(ejω ) exp(jωn) dω −π Digital Processing of Speech and Image Signals 69 WS 2006/2007 2.5 Logarithmic Scale and dB Why? – large dynamic range for the amplitude values of a signal x(t) = A cos βt A := amplitude (pressure, velocity, inclination, current, voltage, ... ) “linear” variable A0 := reference amplitude predefined value for calibration dB := “decibel” A[dB] ≡ 20 · lg A , A0 A2 = 10 · lg 2 , A0 lg ≡ log10 A2 = quadratic variable = energy, intensity because of 210 = 1024 ∼ = 103 : 1 bit more = ˆ factor 2 for amplitude = ˆ 6 dB = factor 4 for intensity 3 dB = ˆ factor 2 for intensity Digital Processing of Speech and Image Signals 70 WS 2006/2007 Phonem: s Phonem: s 7 2 6 1.5 1 5 0.5 A log A 4 3 0 -0.5 2 -1 1 0 -1.5 0 1000 2000 3000 4000 f / Hz 5000 6000 7000 -2 8000 Figure 2.6: Amplitude spectrum of the voiceless phoneme “s” from the word “ist” 0 1000 2000 3000 4000 f / Hz 5000 6000 7000 8000 Figure 2.7: Logarithmic amplitude spectrum of the phoneme “s” Phonem: ae Phonem: ae 12 2.5 2 10 1.5 1 log A A 8 6 0.5 0 4 -0.5 2 0 -1 0 1000 2000 3000 4000 f / Hz 5000 6000 7000 -1.5 8000 Figure 2.8: Amplitude spectrum of the voiced phoneme “ae” from the word “Äh” 0 1000 2000 3000 4000 f / Hz 5000 6000 7000 8000 Figure 2.9: Logarithmic amplitude spectrum of the phoneme “ae” Pause Pause 1 0 0.9 -0.5 0.8 0.7 -1 log A A 0.6 0.5 -1.5 0.4 -2 0.3 0.2 -2.5 0.1 0 0 1000 2000 3000 4000 f / Hz 5000 6000 7000 -3 8000 Figure 2.10: Amplitude spectrum of a speech pause 0 1000 2000 3000 4000 f / Hz 5000 6000 7000 8000 Figure 2.11: Logarithmic amplitude spectrum of a speech pause Digital Processing of Speech and Image Signals 71 WS 2006/2007 2.6 Quantization • Uniform quantization -X MAX XMAX • Quantisation: x̂ = Q(x) • B bits correspond to 2B quantisation levels • Boundaries: x0 , x1 , . . . , xk , . . . , xK where K = 2B • Width △ of one quantisation level using uniform quantisation: △ = 2 · XM AX 2B • Quantisation error: σe2 Z+∞ Zxk K X = (x − x̂)2 p(x) dx = (x − x̂k )2 p(x) dx k=1 x k−1 −∞ – for uniform quantisation: a) b) xk − xk−1 = △ = const(k) x̂k = 12 (xk−1 + xk ) – uniform distribution with p(x) = const(x) results in: σe2 = X △2 k 2 1 △2 XM AX · = = 12 K 12 3 · 22B Digital Processing of Speech and Image Signals 72 WS 2006/2007 • signal-to-noise ratio in dB (general definition): σx2 SN R[dB] := 10 lg 2 σn σx2 = power of the signal x σn2 = power of the noise n SN R = signal-to-noise ratio • signal-to-quantisation noise ratio (special case): σx2 SN R[dB] := 10 lg 2 σe σe2 = power of the noise caused by quantisation errors • uniform quantisation using B bits: SN R[dB] = 6.02 B + 4.77 − 20 lg XM AX σx • if signal amplitude has Gaussian distribution, only 0.064% of samples have amplitude greater than 4σx : SN R[dB] = 6.02 B − 7.2 Digital Processing of Speech and Image Signals 73 for XM AX = 4σx WS 2006/2007 2.7 Fourier Transform and z–Transform Transfer function and Fourier transform Eigenfunctions of discrete linear time invariant systems (analog to time continuous case): x[n] = ej ω n −∞ < n < ∞ (ω is dimensionless here) Proof: x[n] = ej ω n ∞ X y[n] = h[k] ej ω (n−k) k=−∞ ∞ X jωn = e h[k] e−j ω k k=−∞ Define: H(ej ω ) = ∞ P h[k] e−j ω k k=−∞ Remark: The Fourier transform of a discrete time signal is already introduced as Fourier series during the derivation of sampling theorem and reconstruction formula (equation (2.2)). Result: y[n] = ej ω n H(ej ω ) Digital Processing of Speech and Image Signals 74 WS 2006/2007 z–transform: • Fourier transform of a discrete time signal: x[n] +∞ X jω X(e ) = x[n] e−jωn n=−∞ – periodic in ω – ω is normalized frequency, thence: −π < ω ≤ π – X is evaluated on the unit circle (ejω ) • Generalization: X is evaluated for any complex values z. • That results in z–transform: +∞ X X(z) = x[n] z −n n=−∞ • Reasons for z–transform 1. analytically simpler, function theory methods are applicable 2. better handling of convergence problem: – convergence of finite signal, i.e. x[n] = 0 for each n > N0 – convergence of infinite signal depends on z • Inverse z–transform: I 1 x[n] = 2πj formally: z = ejω X(z) z n−1 dz dz = jzdω x[n] = 1 2π Z2π X(ejω ) ejωn dω 0 Digital Processing of Speech and Image Signals 75 WS 2006/2007 Example of Fourier transform and z–transform: • “Truncated geometric series” n a x[n] = 0 • z–transform N −1 X X(z) = n a z −n 0≤n≤N −1 otherwise = n=0 1 = z N −1 N −1 X −1 n (a z ) n=0 N N z −a z−a 1 − (a z −1 )N = 1 − a z −1 • Fourier transform z–transform results in Fourier transformation using substitution z = ejω jω X(e ) = 1 − aN e−jωN 1 − a e−jω special case for a = 1 (discrete time rectangle): ωN sin ω(N − 1) 2 ω = exp −j 2 sin 2 Digital Processing of Speech and Image Signals 76 WS 2006/2007 Proof for the z–transform inversion • Statement: 1 x[k] = 2πj I X(z) z k−1 dz • Cauchy integration rule I 1 1 k=1 z −k dz = 0 k 6= 1 2πj I I X 1 1 x[n] z −n+k−1 dz X(z) z k−1 dz = 2πj 2πj n I X 1 x[n] = z −n+k−1 dz 2πj n | {z } 6= 0 only for n = k = x[k] • Fourier: z = ejω =⇒ dz = j ejω dω Then: x[n] = 1 2πj Z+π X(ejω ) (ejω )n−1 j ejω dω −π Integration path is unit circle because of ejω = 1 2π Z+π X(ejω ) ejωn dω −π Digital Processing of Speech and Image Signals 77 WS 2006/2007 2.8 System Representation and Examples Example 1: Difference calculation • Difference equation y[n] = x[n] − x[n − n0 ], • Fourier transform gives: ∞ X −jωn y[n] e n=−∞ = ∞ X n0 integral number −jωn x[n] e n=−∞ Y (ejω ) = X(ejω ) − jω ∞ X n=−∞ −jωn0 = X(e ) − e • Then follows: − ∞ X n=−∞ x[n − n0 ] e−jωn x[n] e−jωn e−jωn0 X(ejω ) Y (ejω ) H(e ) = X(ejω ) = 1 − e−jωn0 jω |H(ejω )|2 = (1 − cos(ωn0 ))2 + sin2 (ωn0 ) = 1 − 2cos(ωn0 ) + cos2 (ωn0 ) + sin2 (ωn0 ) = 2 (1 − cos(ωn0 )) |H(eiω )|2 5 4 3 2 1 0 ω Digital Processing of Speech and Image Signals 78 π n0 WS 2006/2007 Example 2: First order difference equation x[n] y[n] + α Delay y[n-1] x[n] + α y[n − 1] = y[n] y[n] − α y[n − 1] = x[n] ⇐⇒ Method 1: Estimation of transfer function H(ejω ) from impulse response h[n]: • From the Eq. above with y[n] = h[n] and x[n] = δ[n] follows: h[n] = δ[n] + α h[n − 1] = δ[n] + α δ[n − 1] + α2 δ[n − 2] + · · · n α , n≥0 = 0, otherwise • Fourier spectrum/transfer function H(ejω ) jω H(e ) = = = +∞ X h[k] e−jωk k=−∞ +∞ X αk e−jωk k=0 +∞ X α e−jω k=0 = 1 1 − α e−jω Digital Processing of Speech and Image Signals 79 k for |α| < 1 WS 2006/2007 Method 2: Estimation of transfer function H(ejω ) using Fourier transform of difference equation: • Difference equation: y[n] − α y[n − 1] = x[n] • Fourier–transform: Y (ejω ) − α e−jω Y (ejω ) = X(ejω ) • Result: H(ejω ) = = Digital Processing of Speech and Image Signals 80 Y (ejω ) X(ejω ) 1 1 − α e−jω WS 2006/2007 Example 3: Linear difference equations (with constant coefficients) • Difference equation: y[n] = I X i=0 b[i] x[n − i] − • z-transform: Y (z) = X(z) I X b[i]z −i i=0 • Result: H(z) = = Y (z) X(z) I P J X a[j]z −j j=1 b[i] z −i 1+ j=1 = j=1 a[j] y[n − j] − Y (z) i=0 J P +∞ X J X a[j] z −j h[n] z −n n=−∞ Using the definition of H(z) we can optain the impulse response as a function of the coefficients of the difference equation in the above term. Remark: If we factorise denominator and numerator polynoms into linear factors, we can obtain a zero-pole-representation of a discrete time LTI system: ΠI (z − vi ) H(i) = Ji=1 Πj=1 (z − wj ) with zeros vi ∈ C and poles wj ∈ C. Digital Processing of Speech and Image Signals 81 WS 2006/2007 • in general: h[n] has infinite number of non-zero values =⇒ IIR–filter: Infinite Impulse Response • but if: a[j] ≡ 0 ∀j h[n] identical to zero outside of a finite interval h[n] = b[n] 0 n = 0, . . . , I otherwise =⇒ FIR–filter: Finite Impulse Response Digital Processing of Speech and Image Signals 82 WS 2006/2007 Example 4: Impulse response as “truncated geometric series” h[n] = H(z) = an 0 M X 0≤n≤M otherwise n a z −n n=0 a ∈ IR 1 − aM +1 z −(M +1) = 1 − a z −1 system operation: y[n] = ∞ X k=−∞ = M X k=0 h[k] x[n − k] ak x[n − k] or as difference equation (“recursively”) y[n] − a y[n − 1] = x[n] − aM +1 x[n − M − 1] Digital Processing of Speech and Image Signals 83 WS 2006/2007 For this example we consider the zero-pole-representation: 1 − ( az )−(M +1) H(z) = 1 − ( az )−1 Zeros: Pole: zk z0 α>0 2πk k = 0, 1, . . . , M = a · ej M +1 = a (cancelled by zero z0 = a) Im M=11 a Re Digital Processing of Speech and Image Signals 84 WS 2006/2007 Example 5: Fibonacci numbers Difference equation: n≥0 h[n + 2] = h[n + 1] + h[n] h[0] = h[1] = 1 h[n] = 0 n<0 H(z) = ∞ X h[n]z −n n=−∞ = 1 + z = 1 + z −1 −1 + + ∞ X n=0 ∞ X h[n + 2]z −(n+2) h[n + 1]z −(n+2) ∞ X + n=0 = 1 + z −1 + z −1 = 1 + z −1 (1 + | h[n]z −(n+2) n=0 ∞ X n=0 ∞ X h[n + 1]z h[n]z −n ) + z −2 {z H(z) = 1 + z −1 H(z) + z −2 H(z) = H(z) = = + z −2 ∞ X h[n]z −n n=0 n=1 H(z)(1 − z −1 − z −2 ) −(n+1) } ∞ X h[n]z −n |n=0 {z H(z) } 1 1 1 − z −1 − z −2 1 a b √ − 1 − bz −1 5 1 − az −1 √ √ 1− 5 1+ 5 and b = where a = 2 2 Digital Processing of Speech and Image Signals 85 WS 2006/2007 For a and b the following holds: ∞ ∞ X X a n 1 = = an z −n a 1 − (z ) z n=0 n=0 That results in: ∞ X 1 √ an+1 − bn+1 z −n H(z) = 5 n=0 ∞ X ! = h[n] z −n n=0 h[n] = 1 √ 5 an+1 − bn+1 Digital Processing of Speech and Image Signals 86 WS 2006/2007 Table 2.1: Fourier transform pairs signal Fourier–transform 1. δ[n] 1 2. δ[n − n0 ] e−jωn0 3. 1 4. ∞ X (−∞ < n < ∞) an u[n] 2πδ(ω + 2πk) k=−∞ 1 1 − ae−jω (|a| < 1) ∞ X 1 + πδ(ω + 2πk) 1 − e−jω k=−∞ 5. u[n] 6. (n + 1)an u[n] 7. rn sin ωp (n + 1) u[n] sin ωp 8. sin ωc n πn ( 1, |ω| < ωc , X(e ) = 0, ωc < |ω| ≤ π 9. ( 1, 0 ≤ n ≤ M x[n] = 0, otherwise sin[ω(M + 1)/2] −jωM/2 e sin(ω/2) 10. ejω0 n (|a| < 1) (|r| < 1) 1 (1 − ae−jω )2 1 1 − 2r cos ωp e−jω + r2 e−j2ω jω ∞ X k=−∞ 11. cos(ω0 n + φ) π 2πδ(ω − ω0 + 2πk) ∞ X [ejφ δ(ω − ω0 + 2πk) + e−jφ δ(ω − ω0 + 2πk)] k=−∞ u[n] = 1, n ≥ 0 0, n < 0 Digital Processing of Speech and Image Signals 87 WS 2006/2007 2.9 Discrete Time Signal Fourier Transform Theorem Basically there is no difference between FT theorem for the continuous time and the discrete time case because summation has the same properties as integration. Only differentiation and difference calculation are not completely analog, because it is not possible to form a derivative in the discrete time case. Table 2.2: Fourier transform Theorems signal x[n], y[n] Fourier–transform X(ejω ), Y (ejω ) 1. ax[n] + by[n] aX(ejω ) + bY (ejω ) 2. x[n − nd ], nd is integral number e−jωnd X(ejω ) 3. ejω0 n x[n] X(ej(ω−ω0 ) ) 4. x[−n] X(e−jω ) X(ejω ) if x[n] is real 5. nx[n] j 6. x[n] ∗ y[n] X(ejω )Y (ejω ) x[n]y[n] 1 2π 7. 8. Parseval theorem 9. 1 |x[n]| = 2π n=−∞ ∞ X 2 Z π X(ejΘ )Y (ej(ω−Θ) )dΘ −π (1 − e−jω )X(ejω ) |1 − e−jω |2 = 2(1 − cos ω) x[n] − x[n − 1] ∞ X dX(ejω ) dω Z 1 10. x[n]y[n] = 2π n=−∞ π −π Z |X(ejω )|2 dω π X(ejω )Y (ejω )dω −π Digital Processing of Speech and Image Signals 88 WS 2006/2007 Example 1 corresponding to Theorem 5: +∞ X jω X(e ) = x[k] e−jωk k=−∞ k=−∞ +∞ X = j k=−∞ +∞ X d X(ejω ) = dω x[k] e−jωk k=−∞ +∞ X = ⇐⇒ +∞ X d dω d X(ejω ) = dω d dω ! x[k] e−jωk x[k] (−jk) e−jωk k x[k] e−jωk k=−∞ F {n · x[n]} = j d F {x[n]} dω Example 2 corresponding to Theorem 8: +∞ X F {x[n] − x[n − 1]} = k=−∞ +∞ X = k=−∞ jω =⇒ |F {x[n] − x[n − 1]} |2 −jωk x[k] e − x[k] e−jωk − = X(e ) 1 − e−jω +∞ X k=−∞ +∞ X x[k − 1] e−jωk x[k] e−jωk e−jω k=−∞ = |F {x[n]} |2 |1 − e−jω |2 = |F {x[n]} |2 2 · (1 − cos(ω)) Digital Processing of Speech and Image Signals 89 WS 2006/2007 2.10 Discrete Fourier Transform: DFT The Fourier transform for discrete time signals and systems has been explained on the previous pages. For discrete time signals with finite length there is also another Fourier representation called “Discrete Fourier Transform” (DFT). The DFT plays a central role in digital signal processing. Decisive reasons: • fast algorithms exist for DFT calculation (Fast Fourier Transform, FFT). • discrete frequencies ωk can be better represented in the computer than continuous frequencies ω. Digital Processing of Speech and Image Signals 90 WS 2006/2007 Assume a discrete time signal x[n] with finite length (see also chapter 3.2 on page 135): x[n] = x[n] 0 0≤n≤N −1 otherwise Note: For a continuous time signal it is impossible in the strict sense to be band-limited and time-limited (truncation effect = ˆ Windowing). • The discrete time signal Fourier transform for x[n] is: jω X(e ) = N −1 X x[n] exp(−jωn) n=0 • ω is a continuous variable. The period is 2π. Frequency discretisation is made by sampling along the frequency axis. • The Fourier transform X(ejω ) is evaluated at ωk = 2π k N where k = 0, 1, . . . , N − 1 • Define: X[k] : = X(ejω )|ω = ωk Im C N=8 Re Digital Processing of Speech and Image Signals 91 WS 2006/2007 • Discrete Fourier Transform (DFT): X[k] = N −1 X x[n] exp(−j n=0 2π k n), N k = 0, 1, . . . , N − 1 • Inverse DFT: N −1 1 X 2π x[n] = k n), X[k] exp(j N N k=0 n = 0, 1, . . . , N − 1 • Remark: This equation can be proven by inserting the equation for X[k] in the equation for x[n] and using the orthogonality: N −1 1 X 2π 1 exp j kn = 0 N n=0 N k = m N, otherwise m is integral number • Note: Consider the “analogy” between inverse DFT (above) and inverse Fourier transform of discrete time signal: x[n] = 1 2π Z2π X(ejω ) ejωn dω 0 Under the given conditions the integral is equal to the sum (without approximation!). Digital Processing of Speech and Image Signals 92 WS 2006/2007 Remarks: • DFT coefficients X[k] are not an approximation of the discrete time signal Fourier transform X(ejω ). On the contrary: X[k] = X(ejω )|ω = ωk • Number of the coefficients X[k] depends on the signal length N . A finer sampling of the discrete time signal Fourier transform is possible by appending zeros to the signal x[n] (Zero–Padding). x[n] N-1 Digital Processing of Speech and Image Signals 93 n WS 2006/2007 Interpretation of Fourier coefficients • Fourier transform X(ejω ) of the time discrete signal x[n] jω |X(e )| −π π ω • Evaluation at N discrete sampling points 2π k N ωk = yields the DFT coefficients X[k]. At first k lies in the domain k = − N N + 1, . . . , 0, . . . , . 2 2 |X(e −π -N/2+1 -1 0 1 Digital Processing of Speech and Image Signals 94 2 jω )|, |X[k]| π ω N/2 k WS 2006/2007 • Because of the periodicity of X(ejω ) the coefficients X[k] can also be obtained by shifting the sampling points with negative frequency into the positive frequency domain (by one period). Then k = 0, . . . , N/2, . . . , N − 1. X[k] = N −1 X x[n] exp(−j n=0 |X(e jω 2π k n) N )|, |X[k]| −π π 0 1 ω k N-1 2 • Interpretation of coefficients for general signal x[n]: k=0 N −1 1≤k≤ 2 N k= 2 N +1≤k ≤N −1 2 ←→ f =0 ←→ 0<f < ←→ ± ←→ Digital Processing of Speech and Image Signals 95 fS 2 fS 2 fS − <f <0 2 WS 2006/2007 Symmetric relations by real signals: • For DFT coefficients X[k] of a real signal x[n] the following holds: X[k] = X[N − k] Re(X[k]) = Re(X[N − k]) Im(X[k]) = −Im(X[N − k]) • For the amplitude spectrum | X[k] | the following holds: | X[k] |2 = Re2 {X[k]} + Im2 {X[k]} = | X[N − k] |2 Digital Processing of Speech and Image Signals 96 WS 2006/2007 Realization of DFT: /* /* /* /* PI = 3.14159265358979 x: input signal N: length of input signal Xre, Xim: real and imaginary part of DFT coefficients void dft (int N, float x[], float Xre[], float Xim[]) { int n, k; float SumRe, SumIm; for (k=0; k<=N-1; k++) { SumRe = 0.0; SumIm = 0.0; for (n=0; n<=N-1; n++) { SumRe += x[n]*cos(2*PI*k*n/N); SumIm -= x[n]*sin(2*PI*k*n/N); } Xre[k] = SumRe; Xim[k] = SumIm; } } Remark: • discrete realization 2πj 2πj • Reduction of “Fourier powers” e− N ·kn to e− N ·l (l = 0, 1, . . . , N − 1) is possible, because they are periodical (on the unit circle). Digital Processing of Speech and Image Signals 97 WS 2006/2007 */ */ */ */ 2.11 DFT as Matrix Operation • Notation with unit roots X[k] N −1 X = x[n] exp (− n=0 N −1 X = 2πj k n) N x[n] WNkn n=0 where WN := exp (− 2πj ) N N=12 W 0N =1 W 1N W 3 N W 2N • Periodicity of WN unit root: 2π k) = (WN )k N 2π := exp (−j ) N exp (−j ωk ) = exp (−j WN Digital Processing of Speech and Image Signals 98 WS 2006/2007 Note: 1. WNr = WNr mod N 2. WNkN = (WNN )k = 1k = 1 3. WN2 = [exp (− k∈Z 2πj 2πj 2 )] = exp (− 2) N N = exp (− 2πj ) = WN/2 N/2 N/2 = exp (− 2πj N ) = exp (−πj) = −1 N 2 r+N/2 = WN 4. WN 5. WN N/2 N even WNr = −WNr Digital Processing of Speech and Image Signals 99 WS 2006/2007 DFT as matrix multiplication X[k] = = = N −1 X n=0 N −1 X n=0 N −1 X n=0 x[n] exp (− 2πj k n) N WNkn x[n] {WN }kn x[n] with the matrix {WN } and the matrix elements: {WN }kn := WNkn Inversion: N −1 2πj 1 X k n) X[k] exp ( x[n] = N N = = 1 N 1 N k=0 N −1 X (WN−1 )kn X[k] k=0 N −1 X k=0 {WN−1 }kn X[k] Therefore for the matrix {WN }−1 holds: {WN }−1 := Digital Processing of Speech and Image Signals 100 1 {WN−1 } N WS 2006/2007 DFT matrix operation: properties • DFT: invertible linear mapping N complex signal values ↔ N complex Fourier components N real signal values ↔ N complex Fourier components 2 (due to symmetry) in words: DFT causes no “information loss” in the signal. • Parseval theorem for DFT general Fourier: N −1 X n=0 1 = 2π |x[n]|2 Z+π |X(ejω )|2 dω −π special DFT: (recalculate for yourself!) N −1 X n=0 2 |x[n]| N −1 1 X = |X[k]|2 N k=0 in words: 1 , the DFT is a norm conserving (= ˆ energy N conserving) transformation (mathematical terminology: “unitary”). Disregarding the factor Digital Processing of Speech and Image Signals 101 WS 2006/2007 2.12 From Continuous Fourier Transform to Matrix Representation of Discrete Fourier Transform Assumption: band-limited signal x(t) Fourier transform of the continuous time signal x(t): X(ω) = F {x(t)} = Z ∞ x(t) e−jωt dt (2.9) −∞ For the exact reconstruction (without approximation) of the continuous time signal from sampled values, the samples x[n] = x(n · Ts ) must have the distance of at most Ts = π ωB (sampling theorem). This results in the Fourier transform of the discrete time signal x[n]: jω X(e ) = ∞ X x[n] e−jωn (2.10) −∞ where ω is frequency “normalised on Ts′′ Functions (2.9) and (2.10) agree in interval [−ΩS /2, +ΩS /2] = [−ωB , +ωB ]. | X (ω)| −Ω S ωB −ω B Digital Processing of Speech and Image Signals 102 ΩS ω WS 2006/2007 1 0.8 0.6 0.4 0.2 0 0 N-1 Figure 2.12: Hanning window The signal x[n] is further decomposed by applying a window function w[n] (windowing): ( ... n = 0, . . . , N − 1 w[n] = 0 otherwise Windowed signal y[n]: y[n] = w[n] · x[n] can be analyzed using Fourier transform or DFT. jωk Y (e ) = N −1 X y[n] e−jωk n=0 DFT: ωk = Y [k] = 2πk N N −1 X where k = 0, . . . , N − 1 2π y[n] e− N kn n=0 Matrix representation:    Y [0] ..    .        Y [k]  =     ..    . Y [K − 1] K=N .. . 2πj e− N ·n·k .. . Digital Processing of Speech and Image Signals 103        y[0] .. . y[n] .. . y[N − 1]        WS 2006/2007 2.13 Frequency Resolution and Zero Padding Task: signal x[n] with finite length N is given. Wanted: Fourier transform X(ejωk ) at ωk = 2π k, K where k = 0, 1, . . . , K − 1 and K > N Inserting the definitions: jωk X(e ) = = N −1 X n=0 K−1 X x[n] exp (− 2πj k n) K x̃[n] exp (− 2πj k n) K n=0 where x̃[n] = x[n] 0 n = 0, . . . , N − 1 n = N, . . . , K − 1 i.e. Zero Padding (appending zeros). Matrix representation:    X[0] ..    .                =              ..    . X[K − 1]  x[0]  ..  .     ..  .    x[N − 1]   0  ..  . 0 WKnk               n=0 n=N −1 n=N n=K −1 Note: “Zero Padding” does not introduce any additional information into the signal. This is only a trick so that DFT and particularly FFT (Fast Fourier Transform) can be performed with a higher frequency resolution . Digital Processing of Speech and Image Signals 104 WS 2006/2007 2.14 Finite Convolution Input signal and convolution kernel have finite duration Consider “finite” convolution: • Impulse response: h[n] ≡ 0 for n 6∈ {0, 1, 2, . . . , Nh − 1} • Input signal: x[n] ≡ 0 for n 6∈ {0, 1, 2, . . . , Nx − 1} • Output signal: y[n] = = ∞ X k=−∞ N h −1 X k=0 h[k] x[n − k] h[k] x[n − k] h[k] N-1 h k x[-k] n=0 -(N-1) x • Altogether: k 0 Nx + Nh − 1 positions with “overlap” • Therefore only Nx + Nh − 1 values of output signal can be different from zero:  n > N x + Nh − 2 0 y[n] = ... n = 0, 1, . . . , Nx + Nh − 2  0 n<0 Digital Processing of Speech and Image Signals 105 WS 2006/2007 a) h[k] 1 0 Nh-1=12 k x[k] 1 0.8 0.6 0.4 0.2 0 b) Nx-1=4 k x[n-k] , n=-1 i) -Nx -1 0 Nh-1 k x[n-k] ,n=m (m>0 & m<Nh+Nx-2) ii) m-NX+1 0 m Nh-1 k x[n-k] , n=Nh+Nx-1 iii) 0 c) Nh+Nx-1 Nh-1 k y[n] 2.8 3 3 2.4 2 1.8 1.2 1 0.6 0.2 0 Nx-1 Nh-1 Nh+Nx-2 n Figure 2.13: Example of a linear convolution of two finite length signals: a) two signals; b) signal x[n-k] for different values of n: i) n < 0, no overlap with h[k], therefore convolution y[n] = 0 ii) n between 0 and Nh + Nx − 2, convolution 6= 0 iii) n > Nh + Nx − 2, no overlap with h[k], convolution y[n] = 0 c) resulting convolution y[n]. Digital Processing of Speech and Image Signals 106 WS 2006/2007 Finite convolution using DFT Convolution theorem: y[n] = ∞ X k=−∞ h[k] x[n − k] Fourier: Y (ejω ) = H(ejω ) X(ejω ), 0 ≤ ω ≤ 2π Also valid for sample frequencies: ωk := 2π k, N k = 0, . . . , N − 1 for any N Notation: Y [k] = H[k] X[k] • Question: How to choose the length N of the DFT ? – Reminder: different “lengths” – x[n]: Nx non-zero values – h[n]: Nh non-zero values – y[n]: Ny = Nx + Nh − 1 non-zero values • Answer: – The convolution theorem is certainly correct for any N > 0. – If we want to calculate the output signal completely from Y [k], we have to know Y [k] for at least N = Nx + Nh − 1 frequency values k = 0, 1, . . . , N − 1. – In words: for the DFT length N must be valid: N ≥ Nx + Nh − 1 Method: Zero Padding, i.e. appending zeros. Note: The FFT will be introduced on the next pages. A comparison of costs for realization of the finite convolution by DFT and FFT can be found at the end of paragraph 2.15. Digital Processing of Speech and Image Signals 107 WS 2006/2007 2.15 Fast Fourier Transform (FFT) Principle of FFT: Calculation of the DFT can be done by successive decomposition into smaller DFT calculations. In this way, the number of elementary operations (multiplications and additions) is dramatically reduced: FFT: N2 → N = 1024 : N ld N 2 2·N 2 · 1024 = = 200 ld N 10 operations factor of velocity gain The matrix is decomposed into a product of sparse matrices, therefore N with lot of prime factors is convenient (not necessarily only powers of two). Terminology for different variants of FFT: • in time ↔ in frequency • in place: yes/no • radix 2 ↔ radix 4 • decomposition to prime factors instead of N = 2n History 1965 Cooley and Tukey 1942 Danielson and Lanczos 1905 Runge 1805 Gauss Digital Processing of Speech and Image Signals 108 WS 2006/2007 Algorithms which are based on a decomposition of the signal x[n] are called “decimation–in–time algorithms”. The case N = 2ν is considered in the following. X[k] = = N −1 X n=0 N −1 X x[n] exp(−j 2π k n) N x[n] WNnk where k = 0, 1, . . . , N − 1 where WNnk = exp(−j n=0 2π k n) N • Decomposition of the sum over n into the sums over even and odd n: N/2−1 N/2−1 X[k] = X x[2r] WN2rk X + r=0 r=0 N/2−1 = X (2r+1)k x[2r + 1] WN x[2r] (WN2 )rk + N/2−1 WNk r=0 X x[2r + 1] (WN2 )rk r=0 • Because of WN2 = exp(−2j 2π 2π ) = exp(−j ) = WN/2 N N/2 for k = 0, . . . , N − 1 holds: N/2−1 X[k] = X N/2−1 x[2r] rk WN/2 + r=0 = G[k] + WNk X x[2r + 1] (WN/2 )rk r=0 WNk H[k] • Each of the two sums corresponds to the DFT with the length N/2. The first sum is a N/2–DFT of the even indexed signal values x[n], the second sum is a N/2–DFT of the odd indexed values. • The DFT of the length N can be obtained by getting the two N/2– DFT’s together, with the factor WNk . Digital Processing of Speech and Image Signals 109 WS 2006/2007 Complexity: The complexity O(N 2 ) of one-dimensional FT can be reduced by adequate resorting values from two FTs with length N2 and complexity O(2 · ( N2 )2 ) = N2 2 . By successive application of this resorting the complexity can be reduced to O(N log N ). The case N = 23 = 8 is considered in the following. • X[4] can be obtained from H[4] and G[4] according to previous equation. • Because of the DFT–length N 2 = 4: H[4] = H[0] and G[4] = G[0] And then: X[4] = G[0] + WN4 H[0] The values X[5], X[6] and X[7] can be obtained analogously. Flow diagram for decomposition of one N -DFT into two N/2–DFTs: x[n] X[k] G[0] x[0] X[0] G[1] x[2] x[4] 0 N W X[1] N/2-point G[2] DFT 1 N W X[2] G[3] 2 N W x[6] X[3] W3N x[1] X[4] H[0] x[3] x[5] 4 N W X[5] N/2-point H[1] 5 N W DFT X[6] H[2] 6 N W x[7] X[7] H[3] 7 N W Figure 2.14: Flow diagram for decomposition of one N -DFT to two N/2–DFTs with N =8 Digital Processing of Speech and Image Signals 110 WS 2006/2007 • Further analogous decomposition, until only DFT’s with the length N = 2 remain (so called Butterfly Operation) • Resulting flow diagram of the FFT: x[n] X[k] X[0] x[0] W0N x[4] -1 X[1] 0 N W x[2] W W x[6] X[2] -1 2 N 0 N -1 -1 X[3] W0N x[1] 1 N 0 N W W x[5] -1 2 N 0 N W W x[3] W2N W0N x[7] -1 -1 -1 W3N X[4] -1 X[5] -1 X[6] -1 X[7] -1 Figure 2.15: Flow diagram of an 8–point–FFT using Butterfly operations. Digital Processing of Speech and Image Signals 111 WS 2006/2007 Complexity reduction • Number of complex multiplications in FFT is N/2 · ld N . • Comparison: Direct application of the DFT definition needs N 2 complex multiplications. • Example: N = 1024 = 210 N2 ≈ 200 N/2 · ld N Complexity reduction by factor 200 FFT with the base 2 is not minimal according to number of additions, FFT with the base 4 can be better. Digital Processing of Speech and Image Signals 112 WS 2006/2007 Matrix representation of the FFT principle • The complex Fourier matrix can be decomposed into the product of r = ld N matrices, each of them having only two non-zero elements in each column. • The following graph shows the decomposition of the Fourier matrix in the case of inverse transformation. • w corresponds to WN−1 X = |wnk |X = T3 · T2 · T1 · TS · x This is how the decomposition into r + 1 = 4 matrices looks like (w4 = −1, w8 = 1): 1 1 1 1 nk |w | = 1 1 1 1 1 w w2 w3 w4 w5 w6 w7 1 w2 w4 w6 w8 w10 w12 w14 1 w3 w6 w9 w12 w15 w18 w21 1 w4 w8 w12 w16 w20 w24 w28 1 w5 w10 w15 w20 w25 w30 w35 1 w6 w12 w18 w24 w30 w36 w42 1 1 1 1 w7 1 w w2 w14 1 w2 w4 w21 1 w3 w6 w28 = 1 w4 1 w35 1 w5 w2 w42 1 w6 w4 w49 1 w7 w6 Digital Processing of Speech and Image Signals 113 1 w3 w6 w w4 w7 w2 w5 1 w4 1 w4 1 w4 1 w4 1 w5 w2 w7 w4 w w6 w3 1 w6 w4 w2 1 w6 w4 w2 1 w7 w6 w5 w4 w3 w2 w WS 2006/2007 Signal flow diagram The calculation operations which correspond to the matrix representation of FFT can be showed in a signal flow diagram. T3 1000 1 0 0 0 0100 0 w 0 0 0 0 1 0 0 0 w2 0 0 0 0 1 0 0 0 w3 1 0 0 0 −1 0 0 0 0 1 0 0 0 −w 0 0 0 0 1 0 0 0 −w2 0 0 0 0 1 0 0 0 −w3 T2 10 1 0 0 1 0 w2 1 0 -1 0 0 1 0 −w2 00 0 0 00 0 0 00 0 0 00 0 0 00 0 0 00 0 0 00 0 0 00 0 0 10 1 0 0 1 0 w2 1 0 −1 0 0 1 0 −w2 TS T1 T1 1100 1 -1 0 0 0011 0 0 1 -1 0000 0000 0000 0000 TS 0 0 0 0 1000 0 0 0 0 0000 0 0 0 0 0010 0 0 0 0 0000 1 1 0 0 0100 1 -1 0 0 0 0 0 0 0 0 1 1 0001 0 0 1 -1 0 0 0 0 T2 0000 1000 0000 0010 0000 0100 0000 0001 T3 x[0] x[1] X[0] X[1] -1 x[2] ω2 x[3] -1 X[2] -1 X[3] -1 x[4] x[5] ω -1 x[6] ω2 x[7] -1 Digital Processing of Speech and Image Signals 114 ω2 -1 ω3 -1 −1 X[4] −1 X[5] −1 X[6] −1 X[7] WS 2006/2007 • Matrices T1 , T2 and T3 contain exactly two non-zero elements in each row. • Non-zero elements are realizing the Butterfly Operation. • Matrix T1 : step width of the Butterfly Operation is 1 Matrix T2 : step width of the Butterfly Operation is 2 Matrix T3 : step width of the Butterfly Operation is 4 • step widths can be found: – in signal flow diagram – distance between the non-zero elements in T1 , T2 and T3 Digital Processing of Speech and Image Signals 115 WS 2006/2007 Butterfly Operation • Signal flow diagram and matrix representation of the FFT are based on the following basic operation: Xm[p] Xm-1[p] WrN Xm-1[q] Xm[q] -1 • For two input values Xm−1 [p] and Xm−1 [q] this operation produces two output values Xm [p] and Xm [q]. The output values are thereby a linear combination of the input values. • Because of the flow graph, the operation is called “Butterfly Operation”. Xm [p] = Xm−1 [p] + WNr Xm−1 [q] Xm [q] = Xm−1 [p] − WNr Xm−1 [q] Xm [p] Xm [q] = 1 WNr 1 − WNr Digital Processing of Speech and Image Signals 116 Xm−1 [p] Xm−1 [q] WS 2006/2007 Bit Reversal • The matrix representation of the FFT uses a sorting matrix, i.e. the signal which is to be transformed is at first resorted. • Example for N = 8: n Binary representation Reversed n’ 0 000 000 0 1 001 100 4 2 010 010 2 3 011 110 6 4 100 001 1 5 101 101 5 6 110 011 3 7 111 111 7 • Bit Reversal is a necessary part of the FFT–Algorithm. Bit Reversal for N = 23 Digital Processing of Speech and Image Signals 117 WS 2006/2007 2.16 FFT Implementation • Fortran version C adapted from: Oppenheim, Schafer p. 608 C SUBROUTINE FFT DecimationInTime (X, ld n) ********************************** C **************************************************************************** PARAMETER PI = 3.14159265358979 PARAMETER N max = 2048 COMPLEX COMPLEX COMPLEX COMPLEX INTEGER X(N max) ! array for input AND output Temp ! temporary storage W uni ! root of unity W pow ! powers of W uni N, ld N, ip, iq, iqbeq, j, k, i exp, istp N = 2**ld n IF (N.GT.N max) STOP C BIT Reversed Sorting ********************************************************* j = 1 DO i = 1, N-1 IF (i.LT.j) THEN ! swap X(j) and X(i) Temp = X(j) X(j) = X(i) X(i) = Temp ENDIF k = N/2 DO WHILE (k.LT.j) j = j - k k = k / 2 ENDDO j = j + k ENDDO C End of Bit Reversed Sorting ************************************************** C FFT Butterfly Operations ***************************************************** DO i=1, ld N i exp = 2**i ! exponent istp = i exp/2 ! stepsize W pow = (1.0,0.0) W uni = CMPLX (COS (PI/FLOAT(istp)), -SIN(PI/FLOAT(istp))) DO ipbeg = 1, istp DO ip = ipbeg, N, i exp iq = ip + istp Temp = X(iq) * W pow X(iq) = X(iq) - Temp X(ip) = X(iq) + Temp ENDDO W pow = W pow * W uni ENDDO ENDDO C End of FFT Butterfly Operations ********************************************** RETURN END Digital Processing of Speech and Image Signals 118 WS 2006/2007 Explanations about Fortran Program Two program parts: 1. Bit Reversal 2. Butterfly Operations • 3 loops with variables i, ipbeg, ip are controlling the execution of the Butterfly operations • outer loop i: i specifies the level of the FFT • With exception of the first level, Butterfly operations are “nested”. Therefore two loops are used for the Butterfly operations within one level. • middle loop, ipbeg: ipbeg: goes over the “nested” Butterfly operations i=1: ipbeg=1 i=2: ipbeg=1,2 i=3: ipbeg=1,2,3,4 iqbeg: specifies the sequence of starting points for inner loop • inner loop, ip: ip: specifies the first element of the Butterfly operation istp: step width of the Butterfly operation iq=ip+istp: specifies the second element for Butterfly operation inner loop is “started” once per “nesting” Digital Processing of Speech and Image Signals 119 WS 2006/2007 x[0] X[0] 0 N W x[4] -1 X[1] 0 N W x[2] 0 N W x[6] X[2] -1 2 N W -1 -1 X[3] 0 N W x[1] 0 N 1 N W x[5] W -1 W0N W2N x[3] W0N x[7] -1 W2N -1 -1 W3N -1 -1 -1 -1 X[4] X[5] X[6] X[7] Figure 2.16: Flow diagram of an 8–point–FFT using Butterfly operations. Digital Processing of Speech and Image Signals 120 WS 2006/2007 • C version (from Numerical Recipes in C) #include <math.h> #define SWAP(a,b) tempr=(a);(a)=(b);(b)=tempr void four(float data[], unsigned long nn, int isign) Replaces data[1..2*nn] by its discrete Fourier transform, if isign is input as 1; or replaces data[1..2*nn] by nn times its inverse discrete Fourier transform if insign is input as -1. data is a complex array of lenght nn or, equivalently, a real array of lenght 2*nn. nn MUST be an whole number power of 2 (this is not checked for!). { unsigned long n, mmax, m, j, istep, i; double wtemp, wr, wpr, wpi, wi, theta; Double prec. for the trigonometric recurrences. float tempr, tempi; n=nn << 1; j=1; for (i=1; i<n; i+=2) { This is the bit-reversal section of the routine. if (j > i) { SWAP (data[j], data[i]); Exchange the two complex numbers. SWAP (data[j+1], data[i+1]); } m=n >> 1; while (m >= 2 && j > m) { j -= m; m >>= 1; } j += m; } Here begins the Danielson-Lanczos section of the routine. mmax=2; while (n > mmax) { Outer loop executed log2 nn times istep=mmax << 1; theta=isign∗(6.28318530717959/mmax); Initialise the trigonometric recurrence. wtemp=sin(0.5*theta): wpr = -2.0*wtemp*wtemp; wpi=sin(theta); wr=1.0; wi=0.0; for (m=1;m <mmax;m+=2) { Here are the two nested loops. for (i=m;i<=n;i+=istep) { j=i+mmax; This is the Danielson-Lanczos formula: tempr=wr*data[j]-wi*data[j+1]; tempi=wr*data[j+1]+wi*data[j]; data[j]=data[i]-tempr; data[j+1]=data[i+1]-tempi; data[i] += tempr; data[i+1] += tempi; } wr=(wtemp=wr)*wpr-wi*wpi+wr; Trigonometric recurrence wi=wi*wpr+wtemp*wpi+wi; } mmax=istep } } Digital Processing of Speech and Image Signals 121 WS 2006/2007 • Input and output arrays (C version) a) 1 real 2 imag 3 real 4 imag 2N-3 real } } b) 1 real 2 imag 3 real 4 imag N-1 real N imag N+1 real t=0 t=∆ } } N+2 imag N+3 real N+4 imag 2N-1 real } } } } } f=0 f= 1 N∆ f = N/2 - 1 N∆ f= 1 2∆ f= N/2 - 1 N∆ f= 1 N∆ t = (N-2)∆ 2N-2 imag 2N-1 real t = (N-1)∆ 2N imag 2N imag } Figure 2.17: Input and output arrays of an FFT. a) The input array contains N (N is power of 2) complex input values in one real array of the length 2N . with alternating real and imaginary parts. b) The output array contains complex Fourier spectrum at N frequency values. Again alternating real and imaginary parts. The array begins with the zero-frequency and then goes up to the highest frequency followed with values for the negative frequencies. Digital Processing of Speech and Image Signals 122 WS 2006/2007 Finite convolution: complexity by the application of FFT Estimation of number of necessary multiplications for a convolution of x[n] and h[n] x[n]: Nx non-zero values h[n]: Nh non-zero values Realisation direct implementation DFT FFT transformation (Nx + Nh )2 Nx · Nh Nx +Nh 2 log2 (Nx + Nh ) multiplication in frequency domain Nx + Nh Nx + Nh inverse transformation (Nx + Nh )2 Digital Processing of Speech and Image Signals 123 Nx +Nh 2 log2 (Nx + Nh ) WS 2006/2007 2.17 Cyclic Matrices and Fourier Transform The Fourier transform plays a significant role for so-called cyclic matrices (cf. chapter 3.8): 0       H =       n h0 h1 h2 hN −1 . . . . . . hN −2 . . . . . . ... .. . N −1 ... ... ... ... ... ... ... ... ... ... ... ... h1 so: hN −1 hN −1 hN −2 .. . h2 h1 h0  0     m      N −1 Hmn = h(n−m)modN with so-called kernel vector (h0 , h1 . . . , hN −1 )T and hn ∈ C, mostly hn ∈ IR. Remark: using cyclic matrices it is possible to define cyclic i.e. periodic convolutions and to build a cyclic variant of the system theory. The eigenvectors of a cyclic matrix can be obtained from the columns of DFT matrix: 0 0 m N −1             n N −1 w0·0 · · · wn·0 · · · w(N −1)·0 .. .. . . wn·1 .. .. . . .. .. . . ··· ··· .. .. . . .. .. ··· . ··· . .. .. . wn·(N −1) . Digital Processing of Speech and Image Signals 124        where w = e− 2πj N      WS 2006/2007 The eigenvalues λk of a cyclic matrix with the kernel vector (h0 , h1 . . . , hN −1 )T are: λk = N −1 X 2πj hn e− N ·k·n n=0 where k = 0, 1 . . . , N − 1 The representation is oriented on L. Berg: Linear equation systems with band structure (page 52 ff). The special case of a cyclic matrix when the kernel vector h is symmetric and real is especially interesting for many applications: hn = hN −n and hn ∈ IR That means that the matrix is real, symmetric and cyclic: 0 0 N −1             h0 h1 h2 h1 . . . . . . . . . ... ... ... ... ... .. ... . h2 h3 h1 h2 Digital Processing of Speech and Image Signals 125 ... ... ... ... ... N −1  h2 h1 h2    h3  . . . ...     ...  ... h  1  h1 h0 WS 2006/2007 For such a matrix is valid: a) the eigenvalues λk are: λk = N −1 X n=0 2πkn hn cos N b) The eigenvectors can be obtained from the columns of the “discrete cos–matrix”: n     .. .     .. .         . ..           cos 2πnm )  m N     ..     .         . ..     .. . Application: diagonalisation of covariance matrices, e.g. for coding of image and speech signals. Digital Processing of Speech and Image Signals 126 WS 2006/2007 Proof: We will prove that      vk =      cos 2πk·0 N 2πk·1 cos N .. . .. . .. . 2πk·(N −1) cos N           are eigenvectors of the given matrix with eigenvalues λk . For a symmetric cyclic matrix is valid: Hmn := h(n−m)modN where hn = hN −n One row results in (for odd N ): X n N −1 Hmn · vkn 2πkm = h0 · cos + N 2 X l=1 hl · 2πk(m − l) 2πk(m + l) cos + cos N N For even N only one term for l = (N − 1)/2 can go into the sum. According to addition theorem: cos(x + y) + cos(x − y) = 2 · cos x · cos y With x = 2πkm/N and y = 2πkl/N follows: X n N −1 Hmn · vkn 2πkm + 2 = h0 · cos N N −1 2 2 X l=1 hl · 2πkm 2πkl · cos cos N N 2πkm 2πkl i · cos = h0 + 2 hl cos N | {zN } l=1 | {z } vkm h X λk = λk · vkm Digital Processing of Speech and Image Signals 127 WS 2006/2007 Excursion: Toeplitz matrices We consider only quadratic real matrices: H ∈ IRN ×N with Hmn ∈ IR a) H is a (general) Toeplitz matrix , if: Hmn = hn−m i.e.          H =          h0 h−1 h−2 .. . h1 ... h2 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... h2−N h1−N h2−N h−2  hN −2 hN −1   hN −2    ..  .   ...    ...  h2    ... h1   h−1 h0 b) For a symmetric Toeplitz matrix then holds: i.e.          H =          Hmn = h|n−m| h0 h1 h2 .. . h1 ... ... ... h2 . . . hN −2 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... hN −2 hN −1 hN −2 h2 h1 Digital Processing of Speech and Image Signals 128  hN −1   hN −2    ..  .       h2    h1   h0 WS 2006/2007 c) A cyclic matrix can be obtained by special choice: Hmn = h(n−m)modN          H =          h0 hN −1 hN −2 .. . h1 h2 . . . hN −2 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... h2 h1 h2 hN −1  hN −1   hN −2    ..  .       h2    h1   h0 Also valid: each cyclic matrix is a Toeplitz matrix. d) A symmetric cyclic matrix can be obtained by the following choice of the kernel vectors h: hn = hN −n for n = 0, . . . , N For example, we obtain for N = 8:          H =         h0 h1 h2 h3 h4 h3 h2 h1   h1 h0 h1 h2 h3 h4 h3 h2    h2 h1 h0 h1 h2 h3 h4 h3   h3 h2 h1 h0 h1 h2 h3 h4    h4 h3 h2 h1 h0 h1 h2 h3    h3 h4 h3 h2 h1 h0 h1 h2   h2 h3 h4 h3 h2 h1 h0 h1   h1 h2 h3 h4 h3 h2 h1 h0 Digital Processing of Speech and Image Signals 129 WS 2006/2007 Chapter 3 Spectral analysis • Overview: 3.1 Features for Speech Recognition 3.2 Short Time Analysis and Windowing 3.3 Autocorrelation function and Power Spectral Density 3.4 Spectrograms 3.5 Filter Bank Analysis 3.6 Mel–Scale 3.7 Cepstrum - Cepstrum Calculation from Filter Bank Output - Mel–Cepstrum according to Davis and Mermelstein 3.8 Statistical Interpretation of Cepstrum Transformation 3.9 Energy in acoustic Vector Digital Processing of Speech and Image Signals 131 WS 2006/2007 3.1 Features for Speech Recognition Architecture of an automatic speech recognition system speech signal short-time analysis each 10 ms (using FFT) sequence of acoustic vectors reference model for each word in the vocabulary pattern comparison decision Digital Processing of Speech and Image Signals 132 WS 2006/2007 Short time analysis: • window length 10–40ms • sampling period 10–20ms • in case of sampling rate of 10kHz: – Window: 100–400 samples – sampling period (frame shift): 100–200 samples Recommended windows: • Hamming • Kaiser • Blackman Model parameters: • Energy, intensity (“loudness”) • Fundamental frequency (“height”) • Spectral parameters (“colour”, “smoother” amplitude spectrum) Digital Processing of Speech and Image Signals 133 WS 2006/2007 Goal: • Ideally: Real features for the recognition • In practice: Data reduction, i.e. compact description of the speech signal (amplitude spectrum) Side effect: • Method also enables coding of speech signals using lowest possible number of bits Key words: • Fourier transform: wide band/narrow band, autocorrelation function • Filter bank • Cepstrum • Linear Predictive Coding (LPC) analysis • Fundamental frequency analysis Digital Processing of Speech and Image Signals 134 WS 2006/2007 3.2 Short Time Analysis and Windowing • The DFT is defined for signals with finite duration. • Speech signal s[n]: quasi stationary, i.e. properties do not change within 20-50 ms. • Window function w[n]: Decomposition of the original signal s[n] into (overlapping) segments using a window function w[n]: x[n] = s[n] · w[n] where for example w[n] = 1, 0, |n| ≤ N/2 otherwise • The windowed signal x[n] is analyzed with a Fourier Transform or DFT. • The multiplication of the original signal s[n] with the window function w[n] in the time domain corresponds to the convolution of the spectra of two signals S(ejω ) and window function W (ejω ) in the frequency domain: 1 X(ejω ) = 2π Zπ S(ejθ ) W (ej(ω−θ) ) dθ −π • This convolution performs a (spectral) smearing in the frequency domain (leakage). Digital Processing of Speech and Image Signals 135 WS 2006/2007 Window function: Impulse response: 0 1 -10 0.8 -20 dB 0.6 0.4 -30 -40 0.2 -50 0 0 -60 -0.5 fs N-1 0 0.5 fs 0 0.5 fs 0 0.5 fs 0 0.5 fs Rectangle 0 1 -10 0.8 -20 dB 0.6 0.4 -30 -40 0.2 -50 0 0 -60 -0.5 fs N-1 Triangle 0 1 -10 0.8 -20 dB 0.6 0.4 -30 -40 0.2 -50 0 0 -60 -0.5 fs N-1 Hanning 0 1 -10 0.8 -20 dB 0.6 0.4 -30 -40 0.2 -50 0 0 -60 -0.5 fs N-1 Hamming Digital Processing of Speech and Image Signals 136 WS 2006/2007 Window function: Impulse response: 0 1 -10 0.8 -20 dB 0.6 0.4 -30 -40 0.2 -50 0 0 -60 -0.5 fs N-1 0 0.5 fs 0 0.5 fs 0 0.5 fs Nuttall 0 1 -10 0.8 -20 dB 0.6 0.4 -30 -40 0.2 -50 0 0 -60 -0.5 fs N-1 Gauss 0 1 -10 0.8 -20 dB 0.6 0.4 -30 -40 0.2 -50 0 0 -60 -0.5 fs N-1 Chebyshev Digital Processing of Speech and Image Signals 137 WS 2006/2007 Fourier Transform of a continuous time signal SC(Ω) -Ω0 Ω0 0 Frequency graph of anti-aliasing low-pass filter Ω H(Ω) 1 -π T 0 Ω π T Ω XC(Ω) Fourier Transform of filtered signal -π T π T -Ω0 0 Ω0 Fourier Transform of sampled signal X(ejω) ω −2π −π −ω0 0 Fourier Transform of window function π ω0=ΩT 2π W(ejω) ω −2π -π −π 2π 2π n Fourier Transform of windowed signal and sampled values of continuous spectrum obtained using DFT −2π π 0 V(ejω), V[k] 0 π ω 2π Figure 3.1: Example for the application of the Discrete Fourier Transform (DFT). Digital Processing of Speech and Image Signals 138 WS 2006/2007 Properties of short-time DFT–analysis Important effects: • Picket Fence If not enough sampled values of continuous spectrum are available, spectral sampling can yield delusive results. This problem can be reduced using Zero Padding (inter-space between the coefficients S[k] becomes smaller, i.e. frequency resolution becomes better) • Leakage: Spreading of the line spectrum Because the window function is limited in time, a spreaded spectrum is measured instead of the spectrum of the original signal unlimited in time. That means, the line spectrum even becomes spreaded for pure sinusoidal signals. Digital Processing of Speech and Image Signals 139 WS 2006/2007 3 examples of DFT analysis • we observe a continuous time signal x(t) composed of two sinusoids: x(t) = A0 cos(Ω0 t) + A1 cos(Ω1 t) −∞<t<∞ • sampling according to sampling theorem (with negligible quantization errors) • discrete time signal x[n]: x[n] = A0 cos(ω0 n) + A1 cos(ω1 n) −∞<n<∞ where ω0 = Ω0 TS and ω1 = Ω1 TS • with the window function w[n]: v[n] = A0 w[n] cos(ω0 n) + A1 w[n] cos(ω1 n) Intermediate calculations: v[n] = + also modulation principle A0 A0 w[n] exp(j ω0 n) + w[n] exp(−j ω0 n) 2 2 A1 A1 w[n] exp(j ω1 n) + w[n] exp(−j ω1 n) 2 2 • Fourier Transform of the windowed signal: V (ejω ) = + A0 A0 W (ej(ω−ω0 ) ) + W (ej(ω+ω0 ) ) 2 2 A1 A 1 W (ej(ω−ω1 ) ) + W (ej(ω+ω1 ) ) 2 2 Digital Processing of Speech and Image Signals 140 WS 2006/2007 • Assume: 4π 2π · 10kHz, Ω1 = · 10kHz 14 15 1/TS = 10kHz, rectangle window with N = 64, A0 = 1, A1 = 0.75 Ω0 = • The windowed signal v[n] for the discrete time signal x(n) is therefore:  4π 2π   cos( n) + 0.75 cos( n) : 0 ≤ n ≤ 63 14 15 v[n] = :   0 : otherwise v[n] 2 1 63 0 n -1 Digital Processing of Speech and Image Signals 141 WS 2006/2007 • Fourier Transform W (ejω ) of the rectangle window function 64 −π Example 1: π 0 Leakage Effect Variation of ω0 and ω1 resp. Ω0 and Ω1 Difference between frequencies ω0 and ω1 is reduced gradually Case 1a: Ω0 = 2π 4 10 Hz, 6 Ω1 = 2π 4 10 Hz 3 ω0 = Ω0 TS = 2π 4 2π 10 Hz 10−4 s = 6 6 ω1 = Ω1 TS = 2π 2π 4 10 Hz 10−4 s = 3 3 Digital Processing of Speech and Image Signals 142 WS 2006/2007 Case 1a (continued): ω0 = 2π 6 ω1 = 2π 3 V(ω) 32 −π Case 1b: 2π 3 ω0 = 2π 6 2π 14 2π 6 0 ω1 = 2π 3 π ω 4π 15 V(ω) 32 −π 4π 15 2π 14 0 Digital Processing of Speech and Image Signals 143 2π 14 4π 15 π ω WS 2006/2007 Case 1c: ω0 = 2π 14 ω1 = 2π 12 V(ω) 30 -π Case 1d: 2π 2π 12 14 0 ω0 = 2π 14 ω1 = π ω 4π 25 V(ω) 40 −π 0 Digital Processing of Speech and Image Signals 144 π WS 2006/2007 Example 2: Picket Fence Effect DFT gives sampled values of the spectrum of the windowed signal. Spectral sampling can yield delusive results. Case 2a: • Windowed signal v[n]: ( 2π 4π cos( n) + 0.75 cos( n) : 0 ≤ n ≤ 63 v[n] = 14 15 0 : otherwise • DFT of the length N = 64 without Zero Padding Digital Processing of Speech and Image Signals 145 WS 2006/2007 a) v[n] 2 1 63 0 n -1 b) V(k) 30 0 c) 63 k 2π ω V(ω) 32 0 π Figure 3.2: a) signal v[n]; b) DFT-spectrum V [k]; c) Fourier spectrum V (ejω ). Digital Processing of Speech and Image Signals 146 WS 2006/2007 Case 2b: • In contrast to case 2a, the frequencies of sinusoids are changed only slightly. • Windowed signal v[n]: ( 2π 2π cos( n) + 0.75 cos( n) : 0 ≤ n ≤ 63 v[n] = 16 8 0 : otherwise • DFT of the length N = 64 without Zero Padding Digital Processing of Speech and Image Signals 147 WS 2006/2007 a) v(n) 1 0 63 n -1 b) V(k) 30 0 c) k 63 V(ω) 32 0 π 2π ω Figure 3.3: a) signal v[n]; b) DFT-spectrum V [k]; c) Fourier spectrum V (ejω ). Digital Processing of Speech and Image Signals 148 WS 2006/2007 Analysis of Example 2: • The manifestation of the DFT can be put down to the spectral sampling. Although in Case 2b the windowed signal v[n] contains a significant number of frequencies beyond ω0 and ω1 , they do not show in the DFT spectrum of length N = 64. • Using a rectangle window, the DFT of the sinusoidal signal gives sharp spectral lines, if the period N of the transformation is a whole multiple of the signal period and no Zero Padding is applied. • Explanation for the case of a complex exponential function: – Assume the signal x[n]: 1 2π n) exp(j N n0 x[n] = – Then: X[k] = δ(k − N ) n0 – For the DFT of rectangle window holds: W [k] = sin(πk) sin(πk/N ) – Convolution theorem for windowed signal v[n] gives: N sin π(k − ) n0 V [k] = X[k] ∗ W [k] = N sin π(k − )/N n0 – In case of N ∈ IN n0 only the DFT coefficient k = Digital Processing of Speech and Image Signals 149 N is non-zero. n0 WS 2006/2007 Example 2 (continued) • Assume signal v[n] of Case 2b: ( 2π 2π cos( n) + 0.75 cos( n) : 0 ≤ n ≤ 63 v[n] = 16 8 0 : otherwise • In contrast to Case 2b, a DFT with length N = 128 is applied (Zero Padding). • Result: Using finer sampling, existing additional frequency components emerge. Digital Processing of Speech and Image Signals 150 WS 2006/2007 a) V(k) 30 0 b) k 63 V(k) 32 0 c) 127 k V(ω) 32 0 π 2π ω Figure 3.4: a) DFT of length N = 64; b) DFT of length N = 128; c) Fourier spectrum V (ejω ). Digital Processing of Speech and Image Signals 151 WS 2006/2007 Example 3: Explanation of following illustrations: • Assume: signal of Example 2, Case 2a. • Window: Kaiser window is applied instead of rectangle window. • First: window length L = 64 and DFT length N = 64. • Then: window length L and DFT length N are halved. • Afterwards: for the case L = 32, the DFT length N is gradually increased up to N = 1024 (Zero Padding). • Finally: DFT spectrum for the case N = 1024 and L = 64. The Kaiser window is defined as:  1/2 2    I0 β 1 − [(n − α) /α] wK [n] = : 0≤n≤L−1  I0 (β)   0 : otherwise In this example: β = 0.8 and α= L−1 2 The windowed signal v[n]: v[n] = wK [n] cos( 4π 2π n) + 0.75 wK [n] cos( n) 14 15 Digital Processing of Speech and Image Signals 152 WS 2006/2007 Example 3: (continued) DFT length N = 64, window length L = 64 • Windowed signal v(n) 1 0 63 n -1 • DFT spectrum V(k) 30 0 63 Digital Processing of Speech and Image Signals 153 k WS 2006/2007 Example 3: (continued) DFT length N = 32, window length L = 32 (N and L halved) • Windowed signal v(n) 1 0 31 n • DFT spectrum V(k) 8 0 31 Digital Processing of Speech and Image Signals 154 k WS 2006/2007 Example 3: (continued) Effect of changing DFT length N at constant window length L = 32 (Zero Padding) • DFT length N = 32, window length L = 32 V(k) 8 0 31 k 63 k • DFT length N = 64, window length L = 32 V(k) 8 0 Digital Processing of Speech and Image Signals 155 WS 2006/2007 Example 3: (continued) • DFT length N = 128, window length L = 32 V(k) 8 0 127 k 1024 k • DFT length N = 1024, window length L = 32 V(k) 8 0 Digital Processing of Speech and Image Signals 156 WS 2006/2007 Example 3: (continued) Increasing the window length (L) • DFT length N = 1024, window length L = 64 V(k) 16 0 1024 Digital Processing of Speech and Image Signals 157 k WS 2006/2007 Example 4: Window function influence on the spectrum speech signal phoneme "a" 0 0 amplitude spectrum - rectangle window - 0 amplitude spectrum - Hamming window - 0 Figure 3.5: Influence of the window function: above: speech signal (vowel “a”); central: 512 point FFT using rectangle window; below: 512 point FFT using Hamming window Digital Processing of Speech and Image Signals 158 WS 2006/2007 3.3 Autocorrelation Function and Power Spectral Density Definition of Autocorrelation Function (ACF) analog to the continuous time case: R[k] : = ∞ X x[n] x[n + k] n=−∞ For a signal x[n] assume (e.g. after some suitable windowing): x[n] 0≤n≤N −1 x[n] = 0 otherwise In this case the ACF gives: R[k] = NX −1−k x[n] x[n + k] because x[n] = 0 for n < 0 and n ≥ N n=0 “triangular effect” number of terms in R[k] -N N k Cross correlation: Rxy [k] = ∞ X x[n] · y[n − k] ∞ X x[n] · y[k − n] n=−∞ In contrast to convolution: Oxy [k] = n=−∞ Digital Processing of Speech and Image Signals 159 WS 2006/2007 Properties of ACF: 1. R[k] = R[−k] 2. R[k] ≤ R[0] for each k ∈ IN (R[0]: energy, intensity) 3. If x[n] −→ R[k], then α x[n] −→ α2 R[k] 4. Intensity spectrum is the Fourier Transform of the ACF: | X(ejω ) |2 = X(ejω ) · X(ejω ) ∞ ∞ X X = x[k] exp(−jωk) · x[l] exp(jωl) = = = = k=−∞ ∞ X l=−∞ ∞ X k=−∞ l=−∞ ∞ ∞ X X x[k] x[l] exp(−jωk) exp(jωl) x[k + l] x[l] exp(−jωk) exp(−jωl) exp(jωl) k=−∞ l=−∞ ∞ ∞ X X k=−∞ ∞ X x[k + l] x[l] l=−∞ ! exp(−jωk) R[k] exp(−jωk) k=−∞ Note: The phase spectrum is removed. Digital Processing of Speech and Image Signals 160 WS 2006/2007 5. Because of the symmetry R[k] = R[−k] the DFT becomes the cosine transform: jω 2 | X(e ) | = = ∞ X R[k] exp(−jωk) k=−∞ N −1 X k=−(N −1) = R[0] + R[k] exp(−jωk) N −1 X R[k] (exp(−jωk) + exp(jωk)) k=1 N −1 X = R[0] + 2 · R[k] cos(ωk) because R[k] = R[−k] k=1 6. The intensity spectrum | X(ejω ) |2 is a polynom of cos(ω) with grade N − 1. Reason: Moivre formula: k k cosk−4 (ω) sin4 (ω) − . . . cosk−2 (ω) sin2 (ω) + cos(ωk) = cosk (ω) − 4 2 Digital Processing of Speech and Image Signals 161 WS 2006/2007 Example 1: Spectral analysis using ACF speech signal phoneme "a" 0 0 amplitude spectrum - Hamming window - amplitude spectrum - short hamming window - 0 0 amplitude spectrum - 19 ACF-coefficients - amplitude spectrum - 13 ACF-coefficients - 0 0 Figure 3.6: Fourier Transform of a voiced speech segment: a) signal progression, b) high resolution Fourier Transform, c) low resolution Fourier Transform with short Hamming window (50 sampled values), d) low resolution Fourier Transform using autocorrelation function (19 coefficients), e) low resolution Fourier Transform using autocorrelation function (13 coefficients) Digital Processing of Speech and Image Signals 162 WS 2006/2007 Example 2: ACF of voiced and unvoiced speech segments speech signal phoneme "s" speech signal phoneme "a" 0 0 0 0 autocorrelation - rectangle window - autocorrelation - rectangle window - 0 0 0 0 autocorrelation - Hamming window - autocorrelation - Hamming window - 0 0 0 0 Figure 3.7: Signal progression and autocorrelation function of voiced (left) and unvoiced (right) speech segment Digital Processing of Speech and Image Signals 163 WS 2006/2007 Example 3: Temporal progression of autocorrelation coefficients speech signal - digit sequence 0861909 0 0 ACF - coefficient for index 3 ACF - coefficient for index 0 (energy) 0 0 0 0 0 ACF - coefficient for index 6 ACF - coefficient for index 9 0 0 Figure 3.8: Temporal progression of speech signal and four autocorrelation coefficients Digital Processing of Speech and Image Signals 164 WS 2006/2007 3.4 Spectrograms Using DFT • Wide-band: in frequency domain: – short time window – “interaction” in the “synchronization” between time window and “pitch impulses” – vertical lines – no resolution of spectral fine structure • Narrow-band: in frequency domain: – long time window – good resolution of the spectral fine structure Digital Processing of Speech and Image Signals 165 WS 2006/2007 Example 1: speech spectrograms Figure 3.9: a) wide-band spectrogram: short time window, high time resolution (vertical lines), no frequency resolution; for voiced signals provides information on formant structure b) narrow-band spectrogram: long time window, no time resolution, high frequency resolution (horizontal lines); for voiced signals provides information on fundamental frequency (pitch) Digital Processing of Speech and Image Signals 166 WS 2006/2007 Example 2: speech spectrograms Figure 3.10: Wide-band and narrow-band spectrogram and speech amplitude for the sentence “Every salt breeze comes from the sea”. Digital Processing of Speech and Image Signals 167 WS 2006/2007 3.5 Filter Bank Analysis • History: Decomposition of the signal using a “bank” of band-pass filters and energy calculation in each frequency band transfer function f • Today digitally: – Digital filters: yk [n] = ∞ X m=−∞ hk [n − m] x[m] , k = 1, . . . , K – FIR: Finite Impulse Response IIR: Infinite Impulse Response (recursive filters) – DFT (FFT) + further processing • DFT/FFT Method: – Window function – Appending zeros for desired “resolution” (zero padding) – FFT – “Energy” calculation: |X(ejω )|, |X(ejω )|2 , log |X(ejω )| – Weighted averaging for each channel and frequency band respectively Digital Processing of Speech and Image Signals 168 WS 2006/2007 DFT/FFT filter bank: transfer function f transfer function f Digital Processing of Speech and Image Signals 169 WS 2006/2007 • Averaging: summation should be as smooth as possible over all channels Form: rectangle, triangle, trapeze, etc. • Choosing the central frequencies fk : – constant: ∆fk = const. for all k e.g. 20 channels with ∆f = 200Hz for 0 − 4 kHz – constant relative band width: ∆fk = const. for all k fk – frequency groups of the ear (total number 24): f < 500Hz : f ≥ 500Hz : ∆f = 100 ∆f = 20% f – adjusted to vowels or sounds Digital Processing of Speech and Image Signals 170 WS 2006/2007 3.6 Mel-frequency scale The frequency resolution of the human ear is decreasing on the higher frequencies. This empirical dependency results in the definition of the Mel scale, which is approximately calculated as (from: Hidden Markov Toolkit, Cambridge University Engineering Departement, S.J.Young): f ) fMEL = 2595 log10 (1 + 700Hz f MEL 2700 7000 f / Hz Compression of the high frequencies f fMEL A filter bank with constant band-widths can be used on the Mel scale: f MEL Digital Processing of Speech and Image Signals 171 WS 2006/2007 Table: MEL Scale: f /Hz fMEL 65 100 136 200 213 300 298 400 391 500 492 600 603 700 724 800 856 900 1000 1000 1158 1100 1330 1200 1519 1300 1724 1400 1949 1500 2195 1600 2464 1700 2757 1800 3078 1900 3429 2000 3812 2100 4230 2200 4688 2300 5187 2400 5734 2500 6331 2600 6984 2700 Digital Processing of Speech and Image Signals 172 WS 2006/2007 3.7 Cepstrum The Cepstrum is the Fourier series expansion of the logarithm of the spectrum. Comparison: autocorrelation function is a Fourier series of the normal spectrum. We consider: y[n] = ∞ X k=−∞ h[n − k] x[k] Goal: Separating the kernel h[n] from the input signal x[n]. This problem is also called inversion or deconvolution. • Convolution theorem: Y (ejω ) = H(ejω ) X(ejω ) • Logarithm (complex): log Y (ejω ) = log H(ejω ) + log X(ejω ) • Inverse Fourier Transform: F −1 log Y (ejω ) = F −1 log H(ejω ) + F −1 log X(ejω ) Digital Processing of Speech and Image Signals 173 WS 2006/2007 • Another notation: ŷ[n] = x̂[n] + ĥ[n] using the definition of the cepstrum for x[n] (analogous for ŷ[n] and ĥ[n]) x̂[n] = F −1 log X(ejω ) Zπ 1 exp(jωn) log X(ejω ) dω = 2π −π # " Zπ X 1 x[m] exp(−jωm) dω = exp(jωn) log 2π m −π = C {x[n]} • Note: – Cepstrum = artificial word derived from “spectrum” – Cepstrum is located in time domain Digital Processing of Speech and Image Signals 174 WS 2006/2007 Through the cepstrum transformation x[n] −→ x̂[n] = C {x[n]} the convolution comes down to a simple addition. In the cepstrum domain, a linear operation L (time invariance is not necessary) on ŷ[n] is performed separately on ĥ[n] and x̂[n]: y[n] = ∞ X k=−∞ h[n − k] x[k] ŷ[n] = ĥ[n] + x̂[n] o n L {ŷ[n]} = L ĥ[n] + L {x̂[n]} With the definition GL for the concatenation of the cepstrum, the operation L, and the inverse cepstrum GL := C −1 ◦ L ◦ C we obtain GL {h[n] ∗ x[n]} = GL {h[n]} ∗ GL {x[n]} . Such a transformation GL acts on h[n] and x[n] separately, and is called: homomorph (structure preserving) Digital Processing of Speech and Image Signals 175 WS 2006/2007 Complex cepstrum: 1 x̂[n] = 2π Zπ exp(jωn) logX(ejω ) dω −π Note: complex logarithm Simple cepstrum (real cepstrum): 1 x̃[n] = 2π Z2π exp(jωn) log|X(ejω )| dω 0 • Cepstrum: Fourier coefficients of the logarithmized power spectral density • ACF: Fourier coefficients of Fourier series of the power spectral density Setting cepstral coefficients x̂[n] to zero for high n results in smoothing of the power spectral density. Implementation: Fourier Transform via N –FFT (N = 512, 1024, 2048) (But: discretisation error): 2π 2π N −1 j k 1 X j kn x̂[n] := log |X(e N )| e N N k=0 Digital Processing of Speech and Image Signals 176 WS 2006/2007 Example 1: Real cepstrum Fine structure of power spectral density with the period 1/T results in a single peak in the cepstrum at time T . log|F(ω)|2 0 1 T frequency F-1(log|F(w)|2) T 0 time Figure 3.11: Above: logarithmized power spectrum of a spoken vowel (schematic). Below: corresponding cepstrum (inverse Fourier–transform of the logarithmized power spectrum). Digital Processing of Speech and Image Signals 177 WS 2006/2007 Example 2: Smoothing speech signal phoneme "a" 0 0 windowed phoneme "a" - Hamming window - 0 0 spectrum from cepstrum whole cepstrum first 13 coefficients 0 Figure 3.12: Cepstral smoothing: speech signal (vowel “a”), windowed speech signal (Hamming window), spectrum obtained from the whole cepstrum (blue) and smoothed spectrum obtained from the first 13 cepstral coefficients (red). Digital Processing of Speech and Image Signals 178 WS 2006/2007 Example 3: Smoothing with different numbers of cepstral coefficients speech signal phoneme "a" 0 0 spectrum from cepstrum whole cepstrum first 13 coefficients 0 spectrum from cepstrum whole cepstrum first 19 coefficients 0 spectrum from cepstrum whole cepstrum first 19 coefficients first 13 coefficients 0 Figure 3.13: Homomorph analysis of a speech segment: signal progression, homomorph smoothed spectrum using 13 and 19 cepstral coefficients Digital Processing of Speech and Image Signals 179 WS 2006/2007 Cepstrum calculation using Filter Bank Output • Filter bank outputs A[k] for k = 1, . . . , K Note: k = 0 is missing. • We complete the outputs symmetrically: A -K+1 A -1 A A A A 0 1 2 K • Symmetry A−k+1 = Ak for all k = 1, . . . , K. Digital Processing of Speech and Image Signals 180 WS 2006/2007 Inverse DFT a[n] of the symmetric sequence A−K+1 , . . . , AK : 1 a[n] = 2K K X k=−K+1 2πj nk Ak exp 2K K 2πj 2πj 1 X nk + exp n(−k + 1) Ak exp = 2K 2K 2K k=1 K 2πj 2πj 1 X 2πj = exp 0.5 n(k − 0.5) + exp − n(k − 0.5) Ak exp 2K 2K 2K 2K k=1 K πn 1 X 2πj 0.5 (k − 0.5) Ak cos = exp 2K K K k=1 The phase term exp around k = 0.5. 2πj 2K 0.5 depends on the position of the symmetry axis Cepstrum is defined as: a[n] = K πn 1 X (k − 0.5) Ak cos K K k=1 Digital Processing of Speech and Image Signals 181 WS 2006/2007 Mel Cepstrum according to Davis and Mermelstein f = 100 MEL k=1 f f = 300 MEL MEL k=K k=3 Filter bank: • overlapping band-pass filters triangular shape, • all channels have equal band width, and filter positioning is equidistant on a Mel scale. Calculation of the filter bank outputs: • magnitude of DFT coefficients, • for each channel summation of the magnitudes according to triangular weight function, • for each channel logarithm of the sum. Thus the filter outputs A[k] with k = 1, . . . , K are obtained. Using the filter bank outputs, the cepstrum is calculated using a cosine transform. (see previous description) Digital Processing of Speech and Image Signals 182 WS 2006/2007 3.8 Statistical Interpretation of the Cepstrum Transformation We consider the filter bank outputs log|Xk |. log |X k| 0 s p N/2 k Assumption: The correlation between the outputs s and p, i.e. the element Csp of the covariance matrix does not depend directly on s or p, but only on their difference. Because the spectrum is periodical there is no distance greater than N : Csp = c(s−p)modN It is further assumed that the correlation is locally symmetric: Cs,s+n = Cs,s−n Then: c(s−s−n)modN = c(s−s+n)modN c(−n)modN = c(+n)modN With 0 ≤ n ≤ N follows: cn = cN −n i.e. we have a symmetric cyclic matrix with the kernel vector c. Digital Processing of Speech and Image Signals 183 WS 2006/2007 Example: the covariance matrix for N  c0 c1 c2 c3 c c c c  1 0 1 2 c c c c  2 1 0 1  c c c c C =  3 2 1 0  c4 c3 c2 c1   c3 c4 c3 c2   c2 c3 c4 c3 c1 c2 c3 c4 = 8: c4 c3 c2 c1 c0 c1 c2 c3 c3 c4 c3 c2 c1 c0 c1 c2 c2 c3 c4 c3 c2 c1 c0 c1 c1 c2 c3 c4 c3 c2 c1 c0              Such a covariance matrix will be diagonalised using the cosine transform (or Fourier Transform, which results in the cosine transform due to the symmetry) (see excursion in chapter 2.17). Digital Processing of Speech and Image Signals 184 WS 2006/2007 3.9 Energy in acoustic Vector The energy is usually added as zeroth (or first) component to the acoustic vector. For the logarithmic energy we have: log E = 1 2π Zπ log|X(ejω )|2 dω −π For the (short time) spectrum or cepstrum it approximately holds: K 1 X log E ≈ log|Xk |2 K k=1 Spectra are usually normalized with log E: logYk2 = log|Xk |2 − log E such that: K X k=1 logYk2 ≡ 0 The cepstral coefficient x̂[0] is the logarithmized energy. Digital Processing of Speech and Image Signals 185 WS 2006/2007 186 Chapter 4 Fourier Transform and Image Processing • Overview: 4.1 Spatial Frequencies and Fourier Transform for Images 4.2 Discrete Fourier Transform for Images 4.3 Fourier Transform in Computer Tomography 4.4 Fourier Transform and RST Invariance Digital Processing of Speech and Image Signals 187 WS 2006/2007 4.1 Spatial Frequencies and Fourier Transform for Images A grey-valued image g(x, y) can be interpreted as: g : IR2 → [0, ∞[ (x, y) → g(x, y) Space coordinates (x, y) are at first considered as continuous. Discretization and DFT will be analyzed later. Convention: g(x, y) ≡ 0 outside of the image. The Fourier–transform G(fx , fy ) of the image g(x, y) is defined as: G(fx , fy ) = F {(x, y) → g(x, y)} Z∞ Z∞ g(x, y) e−2πj(fx x+fy y) dxdy = −∞ −∞ The arguments fx and fy are called spatial frequencies. Digital Processing of Speech and Image Signals 188 WS 2006/2007 The two-dimensional Fourier–transform can be obtained by using two onedimensional Fourier–transforms. We consider one “image row” with a constant value of y: x → g(x, y) Corresponding Fourier–transform Gy (fx ) : Gy (fx ) = F {x → g(x, y)} Z∞ g(x, y) e−2πjfx x dx = −∞ Then we compute the Fourier–transform of the function: y → Gy (fx ) and obtain: F {y → Gy (fx )} = = Z+∞ Gy (fx ) e−2πjfy y dy −∞ Z+∞ Z+∞ g(x, y) e−2πj(fx x+fy y) dxdy −∞ −∞ In this way we get the result: G(fx , fy ) = F {y → F {x → g(x, y)}} For the inverse FT we have (as can be expected): g(x, y) = F −1 {(fx , fy ) → G(fx , fy )} Z+∞ Z+∞ G(fx , fy ) e2πj(fx x+fy y) dfx dfy = −∞ −∞ Digital Processing of Speech and Image Signals 189 WS 2006/2007 We would like to interpret the two-dimensional FT visually. For this purpose, we consider the exponential factor in the FT and require the following condition: ! e2πj(fx x+fy y) = 1 ⇒ 2πj(fx x + fy y) = 2πn ⇒ y = − for n ∈ IN fx n x + fy fy y 1/fy ∆L θ 1/fx x 1 ∆L = q fx2 + fy2 spatial period Digital Processing of Speech and Image Signals 190 WS 2006/2007 Special case: |G(fx , fy )| has a large value only at “one” point (u, v) = (fx , fy ) in the spatial frequency plane fy |G(fx,fy)| v -u u fx -v y x Digital Processing of Speech and Image Signals 191 WS 2006/2007 Since G(fx , fy ) = G(−fx , −fy ) for a real image g(x, y), we have two dominant frequency pairs in the Fourier–transform integral: |G(u, v)| · [e2πj(ux+vy) + e−2πj(ux+vy) ] = 2|G(u, v)| · cos 2π(ux + vy) This function describes a black-white cosine wave pattern with (fx , fy ) = (u, v) Where is the value of G(fx , fy ) large ? ideally: points (u, v) and (−u, −v) represent cosine–variant of the grey values really: straight line through (u, v) and (−u, −v) represents abrupt changes of the grey values Digital Processing of Speech and Image Signals 192 WS 2006/2007 Figure 4.1: TV–image (analog) Figure 4.2: Digitized TV–image Figure 4.3: Amplitude spectrum of Figure 4.2 Figure 4.4: Low-pass filtered Digital Processing of Speech and Image Signals 193 WS 2006/2007 Figure 4.5: High-pass filtered Figure 4.6: High-pass enhancement Explanation for figures 4.1–4.6 (from Duda & Hart 1973, pp. 310–312): Figure 4.1: TV–image (analog) Figure 4.2: digitized TV–image - 120×120 pixels - grey values from 0 (black) to 15 (white) Figure 4.3: Fourier–transform of the image from Figure 4.2 (amplitude spectrum) log|G(fx , fy )|: black = ˆ high amplitude note: 1. strong components along the axes = ˆ vertical and horizontal image edges 2. concentration around (fx , fy ) = (0, 0) = ˆ regions with constant grey values Figure 4.4: Low-pass filter: H(fx , fy ) = [cos(πfx ) · cos(πfy )]16 0≦H≦1 Figure 4.5: High pass filter: H(fx , fy ) = 1.5 − [cos(πfx ) · cos(πfy )]4 0.5 ≦ H ≦ 1.5 Figure 4.6: High pass enhancement: H(fx , fy ) = 2.0 − [cos(πfx ) · cos(πfy )]4 1.0 ≦ H ≦ 2.0 Digital Processing of Speech and Image Signals 194 WS 2006/2007 Following general rules for G(fx , fy ) ensue: Edges in the image g(x, y): • An image edge produces “strong” spatial frequency components along one straight line in the spatial frequency plane which is orthogonal to the edge. • The “sharper” the edge is, the “longer” is the corresponding line in the spatial frequency domain. Regions with constant grey values: • Regions with constant grey values increase the values of |G(fx , fy )| around the origin (fx , fy ) = (0, 0). (fx , fy ) = (0, 0) is called “DC component” (average grey value, DC=direct current). Digital Processing of Speech and Image Signals 195 WS 2006/2007 4.2 Discrete Fourier Transform for Images The analog image g(x, y) is discretized (“sampled”) along both axes. We obtain the discrete image: g[j, k] := g(j · ∆x, k · ∆y) where j, k = 0, 1, . . . , N − 1 √ Change in notation: i = −1 instead of j G(e 2πi N ·u ,e 2πi N ·v ) = −1 N −1 N X X 2πi g[j, k]e− N (uj + vk) where u, v = 0, 1, . . . , N − 1 j=0 k=0 Discretization is written as: −1 N −1 N X X G[u, v] = 2πi g[j, k] e− N (uj + vk) j=0 k=0 N −1 X = j=0 − 2πi N uj e · N −1 X − 2πi N vk g[j, k] e k=0 ! Interpretation: Fourier–transform of the image is first performed row by row, then column by column. 2πi Using usual definition of the “Fourier matrix” W (i.e. Wvk = (e− N )vk ), we obtain the matrix representation of Fourier–transform. Using the notation: g ∈ IRN xN W ∈ CN xN G ∈ CN xN we obtain G = g = [W g W ] 1 · [W −1 G W −1 ] 2 N Note: In the corresponding definition of the Fourier–transform, instead of the factors 1 and 1/N 2 we have the factors 1/N and 1/N or 1/N 2 and 1. Digital Processing of Speech and Image Signals 196 WS 2006/2007 4.3 Fourier Transform in Computer Tomography b y x y=ax+b, a const. We consider a projection of the image g(x, y) along the straight line: y = ax + b We produce a set of straight lines by keeping a constant and varying b. Projection: ga (b) = Fourier–transform: Z g(x, ax + b) dx Z ga (b) e−2πjfb b db Z Z = g(x, ax + b) e−2πjfb b db dx Ga (fb ) = We substitute b = y − ax and obtain: Z Z Ga (fb ) = g(x, y) e−2πj(yfb −xafb ) dydx = G(−afb , fb ) = Fourier–transform G(fx , fy ) of g(x, y) along 1 the spatial frequency straight line (fx , fy ) with fy = − fx a Digital Processing of Speech and Image Signals 197 WS 2006/2007 Remarks: a. Straight line in spatial frequency domain: (fx , fy ) = (−afb , fb ) is orthogonal to y = ax + b: 1 y = ax + b => in Fourier–transform fy = − fx a The angle between these straight lines is a right angle because 1 a· − a = −1. In general: y1 (x) = m1 x + b1 y2 (x) = m2 x + b2 y1 (x) ⊥ y2 (x) ⇔ m1 · m2 = −1 b. The value Ga (fb ) is independent of the “offset” b and depends only on the orientation a of the straight line. Therefore, if we calculate the projection for many different inclinations a and apply the one-dimensional FT, we obtain the two-dimensional FT of the image g(x, y). Digital Processing of Speech and Image Signals 198 WS 2006/2007 4.4 Fourier Transform and RST Invariance We will investigate invariance of the Fourier–transform to R : Rotation S : Scaling T : Translation We will use vector notation for the two-dimensional Fourier–transform: x coordinates: z = ∈ IR2 y image grey values: g(z) = g(x, y) ∈ IR+ spatial frequency: f fx = fy ∈ IR2 We ignore the discretization. Digital Processing of Speech and Image Signals 199 WS 2006/2007 Translation z → z + z0 x0 with translation vector z0 = ∈ IR2 y0 Image : g(z) → g̃(z) := g(z + z0 ) FT : G̃(f ) = exp (i[fx x0 + fy y0 ]) · G(f ) Rotation z → Dϑ z with rotation matrix Dϑ = cos ϑ sin ϑ − sin ϑ cos ϑ y ϑ x Digital Processing of Speech and Image Signals 200 WS 2006/2007 Scaling z → Image : g(z) → FT : G̃(f ) = α·z with scaling factor α > 0 g̃(z) = g(α · z) f 1 · G α2 α basically: similarity principle (S.??) for one-dimensional FT transferred to two dimensions Invertible linear mapping z Image : g(z) FT : → A·z where A ∈ IR2×2 invertable → G̃(f ) = = Proof: g̃(z) = g(Az) ... 1 · G̃((A−1 )T f ) det(A) transformation of two-dimensional integration variables Digital Processing of Speech and Image Signals 201 WS 2006/2007 We apply two basic rules to obtain the RST-invariance: 1. Invariance to translation (=T) can be obtained by using the square of the absolute value g(z) → G(f ) → |G(f )|2 2. To obtain RS-invariance we transfer to polar coordinates in the spatial frequency domain. We write in complex notation: fz ≡ fx + i fy = r eiφ = exp (ln r + iφ) ∈ C2 Complex logarithm: fz′ := ln fz = ln r + iφ We already know: a) rotation by angle φ0 in spatial domain = ˆ rotation by angle φ0 in spatial frequency domain b) scaling with factor α in spatial domain 1 1 2 = ˆ scaling with factor respectively in spatial and α α frequency domain fz′ scaling and −→ rotation Digital Processing of Speech and Image Signals 202 f˜z′ WS 2006/2007 f˜z′ = ln z̃ = ln r̃ + iφ̃ r = ln + i(φ − φ0 ) α = ln r + i φ − ln α − i φ0 = fz′ − ln α − i φ0 = translation with the shift vector (− ln α − i φ) ∈ C2 in logarithmic polar coordinates of the spatial frequency plane Digital Processing of Speech and Image Signals 203 WS 2006/2007 RST-invariant features can therefor be obtained as follows: g(x, y) image: (x, y) ∈ IR2 first Fourier–transform G(fx , fy ) = F {g(x, y)} mit (fx , fy ) ∈ IR2 squared absolute value |G(fx , fy )|2 logarithmic polar coordinates: (ln r, φ) |G(ln r, φ)|2 second Fourier–transform F (|G(ln r, φ)|2 ) squared absolute value |F (|G(ln r, φ)|2 )|2 Next page: Analysis of the RST-invariant features. Original grey valued images are identical up to a 90◦ –rotation. Digital Processing of Speech and Image Signals 204 WS 2006/2007 y 6 - x original image fy 6 - f x |FFT| φ 6 - log–polar ln r f˜y 6 - f˜ x |FFT| Digital Processing of Speech and Image Signals 205 WS 2006/2007 Warning: a) Invariant observations are not necessarily good for classification. b) Observations that are calculated using the two-dimensional Fourier– transform are not complete, i.e. the original image cannot be reconstructed completely. Digital Processing of Speech and Image Signals 206 WS 2006/2007 Chapter 5 LPC Analysis • Overview: 5.1 Principle of LPC Analysis 5.2 LPC: Covariance Method 5.3 LPC: Autocorrelation Method 5.4 LPC: Interpretation in Frequency Domain 5.5 LPC: Generative Model 5.6 LPC: Alternative Representations The acronym LPC stands for Linear Predictive Coefficients / Coding and is utilized in signal processing and frequency analysis, as well as in signal coding. Digital Processing of Speech and Image Signals 207 WS 2006/2007 5.1 Principle of LPC Analysis n-2 n time We consider a discrete time signal x[n], possibly multiplied with a window function. The goal of an LPC analysis is to predict each signal value x[n] by its preceding values x[n − 1], x[n − 2], ..., x[n − K]. We distinguish: x[n] : x̂[n] : signal value predicted value We assume the predicted value x̂[n] to be a linear combination of the preceding values of x[n]: x̂[n] := K X k=1 αk x[n − k] with at first unknown coefficients αk , k = 1, ..., K, which are called LPC–coefficients or prediction coefficients. The value K is called prediction order, e.g. K = 8, . . . , 10 at a sampling frequency of 4 kHz (about 2 coefficients per kHz). Digital Processing of Speech and Image Signals 208 WS 2006/2007 Outlook Starting point: “coding” in time domain (goal: bit reduction) ↓ Parseval Theorem parametric model for power spectrum of Fourier–transform (more exact: rough structure of power spectrum for speech signal) LPC analysis applications: • speech coding (ADPCM = adaptive differential pulse code modulation) • signal processing: parametric modelling with autoregressive or all-pole models (order K) • time curves: resonance and oscillator curves, sun spots, stock-market course, ... • image coding • also: interpretation as Maximum Entropy Approach Digital Processing of Speech and Image Signals 209 WS 2006/2007 The coefficients αk are unknown at first. To estimate these, we define the prediction error for each point n in time: e[n] := x[n] − x̂[n] K X = x[n] − αk x[n − k] k=1 For a reliable set of LPC–coefficients we calculate the squared error criterion E as sum of the squared prediction errors e[n]: X e2 (n) E = n = X n ! = " x[n] − K X k=1 αk x[n − k] #2 minimum with respect to α1 , . . . , αk , . . . , αK ∂ Taking the derivative ∂α for l = 1, . . . , K results in: l P P ! x[n] − αk x[n − k] x[n − l] = 0 n P k k P P αk x[n − k]x[n − l] = x[n − l]x[n] n n Here, the summation limits are not specified on purpose. If the squared error criterion E is considered as a function of LPC–coefficients, the following properties ensue: • E is quadratic in α1 , . . . , αk , . . . , αK ; it is guaranteed to be nonnegative and it has a single well-defined minimum. • The optimal LPC–coefficients are invariant to linear scaling of the signal values x[n]. Digital Processing of Speech and Image Signals 210 WS 2006/2007 Minimization of the squared error criterion with respect to the LPC– coefficients results either from taking the derivative or from the “quadratic complement” (recalculate for yourself!). The linear equation system for the LPC–coefficients αk ensues: l = 1, . . . , K : K P k=1 αk · P n x[n − k] x[n − l] = X n x[n − l] x[n] with still unspecified summation limits over n. We consider two methods for the choice of summation limits: 1. covariance method 2. autocorrelation method Warning: terminology is not consistent. Digital Processing of Speech and Image Signals 211 WS 2006/2007 5.2 LPC: Covariance Method known values predicted value 0 n N-1 Covariance Method • No window function is applied, such that we obtain the following summation limits: X 2 e (n) = n N −1 X e2 (n) n=0 i.e. we also use signal values x[n] with n < 0 for prediction. • The resulting equation system for LPC–coefficients: l = 1, . . . , K : K X αk Φ(l, k) = Φ(l, 0) k=1 with the definition: Φ(l, k) := N −1 X n=0 x[n − l] x[n − k] For the above terms hold: – they describe a kind of cross correlation between two “signals” – they are similar to a covariance matrix Computational complexity for solving the equation system: O(K 3 ) + O(N K) – autocorrelation method has more favorable complexity: O(K 2 ) – but: calculation of auto/cross-correlation function dominates In contrast to covariance method, autocorrelation method offers an interpretation in the frequency domain and therefore is often preferred. Digital Processing of Speech and Image Signals 212 WS 2006/2007 5.3 LPC: Autocorrelation Method window function 0 n N-1 We consider the signal after multiplication with a convenient window function, usually Hamming window: In principle, the summation limits now are X 2 e [n] = n=+∞ X e2 [n] . n=−∞ n Since, due to windowing the signal x[n] is identical to zero outside the window function, i.e. x[n] ≡ 0 for n < 0 or N − 1 < n we obtain the following for the prediction error e[n]: e[n] ≡ 0 for n < 0 or N − 1 + K < n Therefore, the total error E becomes: E = NX +K−1 e2 [n] n=0 The prediction error e[n] can become “large” on the window function boundaries: - Beginning: prediction from ”zeros” - End: prediction of ”zeros” Digital Processing of Speech and Image Signals 213 WS 2006/2007 Inserting the summation limits: X x[n − k] x[n − l] = R(|l − k|), n where R(|l − k|) = R(|l|) = NX −1−l n=0 X n X n x[n] x[n − l] = R(|l|) x[n − k] x[n − l] x[n] x[n − l] = NX −1−l n=0 x[n] x[n − l] In this way we obtain the following equation system for the LPC–coefficients αk : l = 1, ..., K : K X k=1 αk R(|l − k|) = R(l) or in matrix form:               R(0) R(1) R(1) .. . R(0) .. . R(K − 1) R(K − 2)     α1 R(1)    α2  ... R(K − 2)    R(2)           ..  .          =    .. . . ...   ..   ..   .             R(1)        . . . R(1) R(0) αK R(K) ... R(K − 1) Digital Processing of Speech and Image Signals 214 WS 2006/2007 Note that this equation system is completely determined by the autocorrelation coefficients R(0), ..., R(k), ..., R(K). Hence, the autocorrelation coefficients will “only” be converted to obtain the LPC–coefficients α1 , ..., αk , ..., αK . The matrix of this equation system has the following properties: - Toeplitz structure (follows from time invariance) - solution: Durbin–algorithm with complexity O(K 2 ) Digital Processing of Speech and Image Signals 215 WS 2006/2007 5.4 LPC: Interpretation in Frequency Domain The LPC autocorrelation method allows prediction error conversion from time domain into frequency domain using Parseval theorem so that LPC analysis can be interpreted as adaptation of parametric model spectrum to the observed signal spectrum. We start with the prediction error e[n]: e[n] = x[n] − K X k=1 αk x[n − k] and apply the z–transform to this equation. The z–transform is restricted to the unit circle. z = ejω ∈C For the z-transforms E(z) and X(z) we obtain: " # K X E(z) = X(z) · 1 − αk z −k k=1 The total error Etot for the squared error criterion becomes: Etot = = = = NX +K−1 e2 [n] n=0 Z+π 1 2π 1 2π 1 2π |E(ejω )|2 dω −π Z+π (Parseval Theorem) 2 K X −jωk αk e · |X(ejω )|2 dω 1 − −π Z+π −π k=1 P (ejω )2 · |X(ejω )|2 dω Digital Processing of Speech and Image Signals 216 WS 2006/2007 with the so-called predictor polynom: jω P (e ) := 1 − K X αk e−jωk k=1 Squared absolute value of the predictor polynom 2 K X −jωk P (ejω )2 = 1 − αk e k=1 = ... = K X k=1 Bk · cos(ωk) (with suitable coefficients Bk resulting from the predictor coefficients) is a polynom with respect to cos(ω), which can be obtained via application of trigonometric transformations. The predictor polynom tries to “compensate” for |X(ejω )|2 – especially at maxima – and to generate a “white” spectrum for the prediction error e[n]. The complex predictor polynom P (z) with z ∈ C has exactly K zeros in the complex plane and therefore can be factorised into linear factors: K Y P (z) = (z − zk ) k=1 Digital Processing of Speech and Image Signals 217 WS 2006/2007 Observations: • These zeros are complex conjugated pairs because αk ∈ IR. jω 2 • The zeros can cause ”minima” of P (e ) . The minima of |P (ejω )|2 approximately correspond to the maxima of the smoothed spectrum |X(ejω )|2 , because for minimization of the error integral it is first of all necessary to “compensate” for the maxima of the signal spectrum. The LPC analysis could therefore be used to describe of the speech signal formant structure. |P(e iω )|2 |X(e iω )|2 ω Digital Processing of Speech and Image Signals 218 WS 2006/2007 windowed phoneme "a" - Hamming window - prediction error - 12 LPC-coefficients - 0 0 0 0 LPC-spectrum - 12 coefficients - spectrum of prediction error (12 LPC-coefficients) 0 0 LPC-spectrum - 18 coefficients - 0 Figure 5.1: LPC–analysis of one speech segment a) signal progression, b) prediction error (K=12), c) LPC–spectrum with K=12 coefficients, d) spectrum of the prediction error (K=12), e) LPC–spectrum with K=18 coefficients Digital Processing of Speech and Image Signals 219 WS 2006/2007 windowed phoneme "a" - Hamming window - amplitude spectrum - Hamming window - 0 0 0 LPC-spectrum - 4 coefficients - LPC-spectrum - 8 coefficients - 0 0 LPC-spectrum - 12 coefficients - LPC-spectrum - 16 coefficients - 0 0 LPC-spectrum - 18 coefficients - LPC-spectrum - 20 coefficients - 0 0 Figure 5.2: LPC–Spectra for different prediction orders K Digital Processing of Speech and Image Signals 220 WS 2006/2007 5.5 LPC: Generative Model e(n) x(n) recursive filter αk For the prediction error e[n] and its z–transform holds: e[n] = x[n] − K X αk x[n − k] k=1 K X E(z) = X(z) − αk X(z) z −k k=1 = X(z) · [1 − K X αk z −k ] k=1 If we consider prediction error as input signal, we can also interpret the LPC–theorem as generative model which generates an output signal x[n] from an adequate “input signal” e[n]: x[n] = e[n] + K X k=1 αk x[n − k] . For the signal spectrum X(z) holds: X(z) = E(z) K P αk z −k 1− k=1 This model is called autoregressive model. The excitation has to be chosen such that E(z) is “white”, i.e. it does not have fine structure due to the fundamental frequency (”pitch–frequency”). In other words: E(z) = G = const. (”gain”) Digital Processing of Speech and Image Signals 221 WS 2006/2007 Special case: E[n] = G · δ[n] Then for LPC model spectrum X(z) holds: X(z) = G K P αk z −k 1− k=1 This spectrum is often interpreted as LPC model spectrum X(z) of observed signal. It is reasonable to set (without explanation): # " K K X X R(k) G2 = R(0) − αk R(k) = R(0) 1 − αk R(0) k=1 k=1 This LPC model spectrum does not have any zeros, it has only poles, and therefore is also called all–pole model. Remarks: • stability problems by solving the equation system (←− truncation error in autocorrelation) • way out: preemphasis through difference calculation • absolute rule for choice of order K: 1 formant needs 2 LPC–coefficients 1 formant per kHz + excitation pulse shape + radiation: 2 LPC–coefficients =⇒ rule of thumb: bandwidth 4 kHz 5 kHz 6 kHz Digital Processing of Speech and Image Signals 222 K = 10 K = 12 K = 14 WS 2006/2007 5.6 LPC: Alternative Representations • so far: G gain αk LPC–coefficients • impulse response of generative model • impulse response of squared absolute value of ”predictor polynom” • cepstrum • poles / zeros of synthesis model / ”predictor polynom” =⇒ formants / bandwidths problem: noise susceptible • PARCOR–coefficients: partial correlation • Area–coefficients: cross-section surfaces Ak • reflexion coefficients ∼ PARCOR; tube model A1 A2 A3 Glottis A4 A5 Lips Digital Processing of Speech and Image Signals 223 WS 2006/2007 Chapter 6 Outlook: Wavelet Transform • Overview: 6.1 Motivation: from Fourier to Wavelet Transform 6.2 Definition 6.3 Discrete Wavelet Transform Digital Processing of Speech and Image Signals 225 WS 2006/2007 6.1 Motivation: from Fourier to Wavelet Transform Fourier transform uses infinitely extended basis functions e−jωt and therefore does not have any time resolution, i.e. there is no information about the localization along the time axis. Therefore, a window function often is used t → w(t), complex in general This function has finite support such that it is possible to investigate a segment of a function of interest. We define a short–time Fourier transform Fb (w) of time signal t → f (t) at position b ∈ IR in time: Fb (w) := Z+∞ −∞ f (t)w(t − b)e−jωt dt where w(t) denotes the complex conjugated value of w(t). The wavelet–transform can be derived from this equation in two steps: a) we ignore the basis function e−jωt and we define the window function as the new basis function. b) in addition to the localization parameter b, we also introduce a scaling parameter a > 0. We consider the family of window functions Wab (t): t−b wab (t) := w a Digital Processing of Speech and Image Signals 226 WS 2006/2007 6.2 Definition The following notation is usually used for the Wavelet–transform: t−b 1 ψab (t) := √ · ψ a a which is the so-called Mother-Wavelet t → ψ(t). Like the window function for the short–time Fourier transform the MotherWavelet should be “localized” as much as possible. Example: Mexican-Hat Function: 1 2 ψ(t) = (1 − t2 )e− 2 t ψ (t) t Digital Processing of Speech and Image Signals 227 WS 2006/2007 The wavelet–transform of f (t) with respect to ψ(t) is defined as: 1 Fψ (a, b) = a Z+∞ −∞ t−b f (t) · ψ dt a with scaling parameter a > 0 and localization parameter b ∈ IR. For the inverse transformation holds: Z+∞ Z+∞ t−b 1 1 f (t) = · · da db Fψ (a, b) · ψ Cψ a2 a −∞ 0 with Cψ := Z∞ 0 |ψ(ω)|2 dω < ∞ ω Proof (principle only): The proof uses the (generalized) Parseval Theorem: Fψ (a, b) 1 = √ a 1 = √ a with Z+∞ −∞ Z+∞ −∞ 1 ψab (t) = √ · ψ a f (t) · ψ ab (t) dt F (ω) · ψab (ω) dω t−b a using further conversions. Digital Processing of Speech and Image Signals 228 WS 2006/2007 6.3 Discrete Wavelet Transform For the scaling parameters a > 0 we choose: a = α0m where α0 > 1 and m ∈ Z. The values of m determine the width of wavelet ψab (t). In order to adjust the localization parameter properly, we define: b = n b0 am 0 where b0 > 0 and n ∈ Z. Thus we constrain the Wavelet transform to discrete values: Fψ (a, b) → Fψ (n, m) The choice of the function ψ(t) is still open. It is useful to choose ψ(t) such that the function system {ψmn |m, n ∈ Z, m > 0} with −m 2 ψmn (t) := a0 · ψ(a−m 0 t − nb0 ) represents an orthonormal basis for functions t → f (t) ∈ L2 (IR). Note: The scalar product < f (t), g(t) > of two functions f (t) and g(t) is defined as: Z < f (t), g(t) > = f (t) · g(t) dt Digital Processing of Speech and Image Signals 229 WS 2006/2007 In this way we obtain the following representation for the discrete Wavelet– transform: Fψ (m, n) = Z+∞ −∞ −m f (t)a0 2 ψ(a−m 0 t − nb0 ) dt = < f (t), ψmn (t) > Due to the orthogonality it is possible to convert the integral of the inverse Wavelet–transform into an infinite series: f (t) = = 1 XX Fψ (m, n)ψmn (t) Cψ m n 1 XX −m Fψ (m, n) a0 2 ψ(a−m 0 t − nb0 ) Cψ m n Digital Processing of Speech and Image Signals 230 WS 2006/2007 Example: Haar –function and Haar –basis special choice: a0 = 2 b0 = 1 The Haar –function is defined as:  0 ≦ t ≦ 12  1 ψ(t) = −1 12 ≦ t < 1   0 otherwise This defines the Haar –basis ψ(t) | m, n ∈ Z, m > 0 : 1 ψmn (t) = √ · ψ 2m 2m t − n It is easy to see that for increasing m a increasingly finer resolution is obtained and that n determines localization in time. Digital Processing of Speech and Image Signals 231 WS 2006/2007 Digital Processing of Speech and Image Signals 232 WS 2006/2007 Chapter 7 Coding The following types of coding are distinguished: • source coding (data compression) goal: transmission (storage) using as few bits as possible without or with few errors • channel coding goal: preferably faultless data transmission (storage) e.g. error-recognizing and error-correcting codes • simultaneous source and channel coding goal: simultaneous optimization The following data types are distinguished: • discrete alphabet • continuous signal (audio, video, . . . ) Source coding • lossless coding (compression) usually discrete sources, e.g. text compression • lossy coding usually continuous signals notation: rate - distortion theory distortion, error bit rate 233 Digital Processing of Speech and Image Signals 234 WS 2006/2007 Three effects can be utilized for signal coding: a) statistical redundancy and correlation: samples are not independent. b) perceptive properties of the receiver (ear and eye): some fine structures in the signal are irrelevant to the receiver c) signal distortion: coded signal differs from the original signal without significant quality deterioration. signal Q T C transmission reconstructed signal T -1 Q -1 C -1 T: transformation, e.g. DCT Q: quantization, e.g. vector quantization C: mapping of bit representation Digital Processing of Speech and Image Signals 235 WS 2006/2007 References: • Ze-Nian Li: CMPT 365 Multimedia Systems. Simon Fraser University, British Columbia, Canada, fall 1999, Version Jan.2000; http://www.cs.sfu.ca/CourseCentral/365/li/index_prev.html. • Peter Noll: MPEG Digital Audio Coding. IEEE Signal Processing Magazine, pp.59-81, Sep. 1997. • Thomas Sikora: MPEG Digital Video-Coding Standards. IEEE Signal Processing Magazine, pp.82-100, Sep. 1997. • A. Ortega, K. Ramchandran: Rate-Distortion Methods for Image and Video Compression. IEEE Signal Processing Magazine, pp.2350, Nov. 1998. • G. J. Sullivan, Th. Wiegand: Rate-Distortion Optimization for Video Compression. IEEE Signal Processing Magazine, pp.74-90, Nov. 1998. Digital Processing of Speech and Image Signals 236 WS 2006/2007 Chapter 8 Image Segmentation and Contour-Finding The lecture notes for this chapter are available as a separate document. 237 238

digital processing of speech and image signals

Related documents

Products

Support

digital processing of speech and image signals

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib