Speech Processing (Module CS5241)

advertisement
Speech Processing
(Module CS5241)
Lecture 2 – Speech signal processing
0.5
0
−0.5
−1
0.5
1
1.5
2
2.5
3
4
x 10
1
Frequency
0.8
0.6
0.4
0.2
0
2000
4000
6000
8000
Time
10000
12000
SIM Khe Chai
School of Computing, National University of Singapore, January 2010
14000
Speech Processing
Speech Waveforms
0.8
0.6
0.4
0.2
0
−0.2
−0.4
−0.6
−0.8
−1
0.5
1
1.5
2
2.5
3
4
x 10
Speech sounds are produced by vibratory activity in the human vocal tract. Speech is normally
transmitted to a listener’s ears or to a microphone through the air, where speech and other sounds
take on the form of radiating waves of variation in air pressure. These type of waves are known
as longitudinal waves, where the vibrations are along or parallel to the direction of travel (as
opposed to transverse wave where vibrations are perpendicular to the direction of travel).
School of Computing, National University of Singapore, January 2010
1
Speech Processing
Digital Speech Waveforms
When we speak into a microphone, these changes in pressure are converted to proportional
variations in electrical voltage. Computers equipped with the proper hardware can convert
the analog voltage variations into digital sound waveforms by a process called analog-to-digital
conversion (ADC), which involves:
• Sampling – taking a fixed number of pressure value readings at equal time intervals from the
continuously varying speech signal. For example, a speech waveform sampled at 16000 times
per second has a sampling frequency of 16 kHz (kilo Hertz). Higher sampling rate yields
better sound quality.
• Quantization – represent the sampled waveform amplitudes as discrete values (rounded to
the nearest value which is expressible in a given number of bits). For example, 8 bits and 16
bits can represent a total of 256 and 65536 possible quantization levels respectively.
School of Computing, National University of Singapore, January 2010
2
Speech Processing
Simple Waveform Processing
Simple waveform processinf techniques:
• Block/frame processing
• Short-time energy
• Zero-crossing rate
• A simple end-pointer
• Autocorrelation function
• Correlation with sinusoid
School of Computing, National University of Singapore, January 2010
3
Speech Processing
Block processing
• A speech waveform consists of a long sequence of sampled values
• Useful to break the long sequence into blocks/frames which are quasi-stationary
• Frame-size is a compromise between:
– having sufficient sample points for accurate measurement/analysis
– ensuring that the quasi-stationary assumption is valid
• Frame shift: number of samples (seconds) between start of successive frames
– use overlapped frames to better represent the signal dynamics
School of Computing, National University of Singapore, January 2010
4
Speech Processing
Block processing
School of Computing, National University of Singapore, January 2010
5
Speech Processing
Short-time energy
Short-time energy is defined as the sum of squares of the samples in a frame:
E=
N
−1
X
2
si
(1)
i=0
f l o a t Energy ( f l o a t s [ ] ) {
f l o a t sum = 0 ;
f o r ( i n t i =0; i <s . l e n g t h ; i ++)
sum += s [ i ] ∗ s [ i ] ;
r e t u r n sum ;
}
School of Computing, National University of Singapore, January 2010
6
Speech Processing
Short-time Energy
0.5
0
−0.5
−1
0.5
1
1.5
2
2.5
3
4
x 10
100
80
60
40
20
10
20
30
School of Computing, National University of Singapore, January 2010
40
50
60
7
Speech Processing
Zero-crossing Rate
Zero-crossing rate is defined as the number of times the zero axis is crossed per frame. ZCR is
large for unvoiced speech
i n t ZCR( f l o a t s [ ] ) {
i n t count = 0;
f o r ( i n t i =1; i <s . l e n g t h ; i ++)
i f ( s [ i −1]∗ s [ i ] <= 0 )
c o u n t ++;
r e t u r n count ;
}
School of Computing, National University of Singapore, January 2010
8
Speech Processing
Zero-crossing Rate
0.5
0
−0.5
−1
0.5
1
1.5
2
2.5
3
4
x 10
150
100
50
0
10
20
30
School of Computing, National University of Singapore, January 2010
40
50
60
9
Speech Processing
End-point detector
Accurate end-point detection is difficult. A simple end-point detector for isolated word can be
constructed from energy and ZCR values:
1. Measure background energy & ZCR (Eb, Zb)
2. Start from the centre of the waveform
3. Moving outwards, mark points Ns, Ne for which E < Eb + α
4. Continuing outwards from Ns, Ne if Z > Zb + β for 3 frames, update Ns, Ne and
repeat step 3
School of Computing, National University of Singapore, January 2010
10
Speech Processing
Auto-correlation Function
Auto-correlation emphasises periodicity. Definition:
rk =
NX
−k−1
sisi+k
(2)
i=0
f l o a t AutoCorr ( f l o a t s [ ] , i n t k ) {
f l o a t sum = 0 ;
f o r ( i n t i =0; i <s . l e n g t h −k ; i ++)
sum += s [ i ] ∗ s [ i+k ] ;
r e t u r n count ;
}
School of Computing, National University of Singapore, January 2010
11
Speech Processing
Auto-correlation Function
0.5
0
−0.5
−1
100
200
300
400
500
600
700
800
900
1000
100
80
60
40
20
0
−20
50
100
150
School of Computing, National University of Singapore, January 2010
200
250
300
12
Speech Processing
Pitch (Fundamental Frequency) detector
Simple pitch detection based on auto-correlation function:
1. Compute r0 to rmax
2. r0 is the energy
3. Find peak rp in the range r0 to rmax
4. if rp > 0.3r0 then speech is voiced with period pT
5. otherwise speech is unvoiced
School of Computing, National University of Singapore, January 2010
13
Speech Processing
Spectral Analysis
Any periodic signal can be expressed as the summation of a fundamental frequency ω and its
harmonics:
N
−1
X
sn = s(nT ) =
Apcos(ωpnT + φp)
(3)
p=0
where Ap and φp are known as the amplitude and phase of the pth harmonic. These quantities
can be found by correlating s(nT ) with cos(ωpnT ) and sin(ωpnT ):
Ap =
c(ωp) =
N
−1
X
q
c(ωp) + s(ωp)
s(nT )cos(ωpnT )
and
and
n=0
School of Computing, National University of Singapore, January 2010
−1
φp = tan
s(ωp) =
N
−1
X
»
s(ωp)
c(ωp)
–
s(nT )sin(ωpnT )
(4)
(5)
n=0
14
Speech Processing
Discrete Fourier Transform (DFT)
The process of obtaining the amplitures (Ap) and phases (φp) is known as the Discrete Fourier
Transform (DFT). DFT implementation:
v o i d DFT( f l o a t s [ ] , f l o a t amp [ ] , f l o a t p h a s e [ ] ) {
int N = s . length ;
f o r ( i n t p =0; p<N ; p++) {
f l o a t csum = 0 , ssum = 0 ;
f o r ( i n t n =0; n<N ; s++) {
f l o a t a r g = 2 ∗ p i ∗ n ∗ p/N ;
csum += s [ n ] ∗ c o s ( a r g ) ;
ssum += s [ n ] ∗ s i n ( a r g ) ;
}
amp [ p ] = s q r t ( csum ∗ csum + ssum ∗ ssum ) ;
p h a s e [ p ] = a r c t a n ( ssum / csum ) ;
}
}
School of Computing, National University of Singapore, January 2010
15
Speech Processing
Complex Formulation of DFT
A complex number has a real part and an imaginary part.
z = Ae
jθ
= Acos(θ) + jAsin(θ)
(6)
j is a complex multiplier where j 2 = −1. Therefore, the DFT can be expressed in the following
complex form:
”
“
N
−1
X
2πnp
−j
N
p = 0, 1, . . . , N − 1
(7)
Sp =
s(nT )e
n=0
Sp is a complex value given by the amplitude and phase of the pth harmonic component. DFT
is a linear transform of the signal. The inverse DFT is given by
“
”
N −1
2πnp
1 X
j
s(nT ) =
Spe N
N p=0
School of Computing, National University of Singapore, January 2010
n = 0, 1, . . . , N − 1
(8)
16
Speech Processing
DFT of Discrete Signals
Given a sequence of N discrete samples, performing DFT results in N complex values, Sp. If the
sampling frequency is fs = 1/T , then:
•
•
•
•
The frequency resolution = fs/N Hz
The magnitude of Sp is symmetric about fs/2 Hz
The phase of Sp is anti-symmetric about fs/2 Hz
Can effectively represent the output as N real values!
School of Computing, National University of Singapore, January 2010
17
Speech Processing
Time vs Frequency Resolution
Assuming that the signal characteristics remain the same over progressively longer intervals, as
the length of the analysis window (N ) increases:
• this leads to a reduced ability to respond to sudden changes in the signal: poor time resolution
• the spacing between the spectral components ( N1T ) decreases. Hence, can determine the
signal frequencies more accurately: improved frequency resolution
Zero-padding: adding trailing zeros to the signal to increase N :
• yields more frequency points – but doesn’t really increase frequency resolution since no addition
of information!
School of Computing, National University of Singapore, January 2010
18
Speech Processing
Implicit Periodicity of the DFT
•
•
•
•
•
DFT evaluates spectrum at N evenly spaced discrete frequencies
Only periodic signals have discrete spectra
Therefore, DFT assumes periodicity outside the analysis frame, with period equal to N
Such assumption may give rise to boundary discountinuity (edge effects)
May distort high frequency components
School of Computing, National University of Singapore, January 2010
19
Speech Processing
Implicit Periodicity of the DFT
1
1
0.5
0.5
0
0
−0.5
−0.5
−1
−1
0
50
100
150
1
1
0.5
0.5
0
0
−0.5
−0.5
−1
0
50
100
150
0
50
100
150
−1
0
50
100
150
10
10
5
5
0
0
5
10
15
20
25
School of Computing, National University of Singapore, January 2010
5
10
15
20
25
30
20
Speech Processing
Windowing
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
50
100
150
200
250
50
100
150
200
250
In signal processing, a window function (also known as a tapering function) is a function that is
zero-valued outside of some chosen interval. Windowing is applied to the signal to reduce the
edge effects. A commonly used window functions include:
• Hamming window: w(n) = 0.53836 − 0.46164 cos
“
“
””
2πn
• Hanning window: w(n) = 0.5 1 − cos N −1
School of Computing, National University of Singapore, January 2010
“
2πn
N −1
”
21
Speech Processing
Implicit Periodicity of the DFT with Windowing
1
1
0.5
0.5
0
0
−0.5
−0.5
−1
−1
0
50
100
150
1
1
0.5
0.5
0
0
−0.5
−0.5
−1
0
50
100
150
0
50
100
150
−1
0
50
100
150
200
6
10
4
5
2
0
0
5
10
15
20
25
30
School of Computing, National University of Singapore, January 2010
5
10
15
20
25
30
22
Speech Processing
Fast Fourier Transform (FFT)
Direct computation of DFT has a complexity of order N 2. Fast Fourier Transform (FFT) is a fast
algorithm for computing the DFT. It’s a divide-and-conquer approach. A 2-radix FFT algorithm
requires N = 2k where k is an integer. Zero padding can be applied to meet this requirement.
Sp
=
N
−1
X
s(nT )e
−j
“
2πnp
N
”
n=0
N/2−1
=
X
„
s(2nT )e
−j
2πnp
N/2
«
„
+ s((2n + 1)T )e
−j
2πnp 2πp
+
N
N/2
«
n=0
=
even
Sp
+e
−j
“
2πp
N
”
odd
Sp
(9)
This can be applied recursively since N = 2k . This reduces the complexity to order of N log N .
School of Computing, National University of Singapore, January 2010
23
Speech Processing
Pre-emphasis
• Pre-emphasis – to increase, within a band of frequencies, the magnitude of some (usually
higher) frequencies w.r.t. the magnitude of other (usually lower) frequencies in order to
improve the overall signal-to-noise ratio. Pre-emphasis is given by xn = xn − αxn−1, where
α is the pre-emphasis factor (typically 0.97). The boundary condition x−1 = x0 is assumed.
v oi d PreEmphasis ( f l o a t s [ ] , f l o a t alpha ) {
f o r ( i n t i=s . l e n g t h − 1; i > 0; i −−)
s [ i ] −= a l p h a ∗ s [ i − 1];
s [ 0 ] ∗= ( 1 − a l p h a ) ;
}
School of Computing, National University of Singapore, January 2010
24
Speech Processing
Spectrum Analysis – With & Wihout Pre-emphasis
0.4
0.2
0.2
0
0
−0.2
−0.2
−0.4
−0.4
−0.6
−0.8
−0.6
100
200
300
400
500
5
5
0
0
−5
100
200
300
400
500
50
100
150
200
250
−5
50
100
150
200
250
School of Computing, National University of Singapore, January 2010
25
Speech Processing
Spectrogram – Short Time Fourier Transform
0.5
0
−0.5
−1
0.5
1
1.5
2
2.5
3
4
x 10
Frequency
8000
6000
4000
2000
0
0.2
0.4
0.6
0.8
1
Time
1.2
1.4
1.6
1.8
0.2
0.4
0.6
0.8
1
Time
1.2
1.4
1.6
1.8
Frequency
8000
6000
4000
2000
0
Spectrogram is a two-dimensional visual representation of the Short Time Fourier Transform
(STFT) of a time signal, an example of block-processing. There is a trade-off between time
and frequency resolution. The trade-off can be adjusted by varying the analysis window length
and the amount of overlap between sucessive windows. MIDDLE: 256-point window (better
frequency resolution); BOTTOM: 64-point window (better time resolution). 50% overlap.
School of Computing, National University of Singapore, January 2010
26
Speech Processing
Recap
• Digital speech waveforms:
– Discretisation: finite time points
– Quantisation: finite values
• Simple waveform processing algorithms
• Spectral Analysis
– Discrete Fourier Transform (DFT)
– Windowing
– Pre-emphasis
– fast Fourier Transform (FFT)
School of Computing, National University of Singapore, January 2010
27
Download