Time-Scale Modification of Speech Signals

advertisement
Time-Scale Modification
of Speech Signals
Bill Floyd
ECE 5525 – Digital Speech Processing
December 14, 2004
Objectives
 Introduction
 Background Theory
 Methods
 Examples
 Matlab Code
 Short Time Fourier Transform
 Short Time Fourier Transform Magnitude
 Speech Samples
 Conclusion
 Questions
 References
Slide 2 of 49
Introduction
 Goal

To either speed up or slow down a speech
signal while maintaining the approximate pitch
 Applications




Slide 3 of 49
Change voice mail playback
Court stenographers-play proceedings quicker
Sound effects
Etc…
Introduction
 Option 1 – Change sample rate
 If you modify the sample rate, you can change
the speed but the pitch is also changed


Increase sample rate = higher pitch (chipmunk
sound)
Decrease sample rate = lower pitch (drawn out
echo sound)
 Option 2 – Decimate or Interpolate Signal
 If you change the number of samples, the
result is the same as modifying the sample
rate
Slide 4 of 49
Introduction
 Option 3 – Use more complex methods

This will change the speed of the sample while
preserving the pitch data




Slide 5 of 49
Short Time Fourier Transform
Short Time Fourier Transform Magnitude
Sinusoidal Synthesis
Linear Prediction Synthesis
Terminology
Frame Rate
Window Representation
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
Window Size
Slide 6 of 49
100
200
300
400
500
600
700
Theory
 Short Time Fourier Transform Methods



Slide 7 of 49
Chapter 7 in our text (Discrete-Time Speech
Signal Processing)
Refer to notes from in class for mathematical
theory of operation
I will pick up from where Dr. Kepuska stopped
in his notes
Short Time Fourier Transform
 Short Time Fourier Transform


Also called the Fairbanks method
Extract successive short-time segments and
then discard the following ones
Signal
STFT
Decimate
Samples
IFFT
Output
OLA
Slide 8 of 49
Short Time Fourier Transform
 Frame Rate factor L

In frequency domain after taking the STFT,
you get


X(nL,ω)
Form a new signal by

Y(nL, ω) = X(snL, ω)
 where s = compression factor
 Take Inverse Fourier Transform
 Use Overlap and Add method to form new
signal
Slide 9 of 49
Short Time Fourier Transform
1
0.8
0.6
X(nL, ω)
0.4
0.2
0
0
100
200
300
400
500
600
700
800
1
0.8
Y(nL, ω)
= X(2nL, ω)
0.6
0.4
0.2
0
0
Slide 10 of 49
100
200
300
400
500
600
700
800
Short Time Fourier Transform
Window Representation
New Sequence
1
0.9
0.8
0.7
1
0.6
0.5
0.9
0.4
0.3
0.8
0.2
0.1
0
0.7
0
100
200
300
400
500
Original
Windowed
Sequence
600
700
0.6
0.5
0.4
0.3
0.2
0.1
0
Slide 11 of 49
100
200
300
400
500
600
Short Time Fourier Transform
 Problems
 Pitch Synchronization
 It is highly likely that the pitch periods will not line up
properly
Slide 12 of 49
Short Time Fourier Transform
Magnitude
 Short Time Fourier Transform Magnitude

Problems with STFT method relate directly to
the linear phase component of the STFT


Time shift = phase change
Alternate approach is to only use the
magnitude portion of the STFT—Short Time
Fourier Transform Magnitude
Slide 13 of 49
Short Time Fourier Transform
Magnitude
 Compression



With the Fairbanks method, time slices were
discarded
Now we can just compress the time slices
Form a new signal by

|Y(nM, ω)| = |X(nL, ω)| where
 M = compression factor = L / speed
 i.e. for speeding up by two => M = L/2
Slide 14 of 49
Short Time Fourier Transform
Magnitude
 Compression


Take Inverse Fourier Transform
Use Overlap and Add method to form new
signal
Slide 15 of 49
Short Time Fourier Transform
Magnitude
1
0.8
0.6
X(nL, ω)
0.4
0.2
0
0
100
200
300
400
500
600
700
800
1
0.8
Y(nM, ω)
= X(nL, ω)
M=L/2
0.6
0.4
0.2
0
0
Slide 16 of 49
100
200
300
400
500
600
700
800
Short Time Fourier Transform
Magnitude
Window Representation
New Sequence
1
0.9
0.8
0.7
1
0.6
0.5
0.9
0.4
0.3
0.8
0.2
0.1
0
0.7
0
100
200
300
400
500
Original
Windowed
Sequence
600
700
0.6
0.5
0.4
0.3
0.2
0.1
0
-50
Slide 17 of 49
0
50
100
150
200
250
300
350
400
450
Other Methods
 Sinusoidal Synthesis—Chapter 9



Time-warp the sinewave frequency track and
the amplitude function
This technique has been successful with not
only speech but also music, biological, and
mechanical signals
Problems


Slide 18 of 49
Does not maintain the original phase relations
Suffer from reverberance
Other Methods
 Linear Prediction Synthesis


Use Homomorphic and Linear Prediction
results to modify the time base
Book briefly mentions this is possible but ran
out of time before I could investigate this
process more
Slide 19 of 49
Other Methods
 New Techniques

Internet search showed several methods
trying to improve on what is out there now
 Software


Different software programs that will change
speed for you
Adobe Audition is one of the most all
encompassing right now
Slide 20 of 49
Matlab Code
-Prepare the Workspace
%%%%%%%%%%%%%%%%
% Prepare Workspace
%%%%%%%%%%%%%%%%
close all;
clear all;
window_size_1 = 200;
frame_rate_1 = 100;
%Speed to slow down by
speed = 2;
Slide 21 of 49
Matlab Code
-Load the Speech Signal
%%%%%%%%%%%%%%%%
% Load Data File
%%%%%%%%%%%%%%%%
filename = input('Please enter the file name to be used. ');
[sample_data,sample_rate,nbits] = wavread(filename);
loop_time = floor(max(size(sample_data))/frame_rate_1);
sample_data((max(size(sample_data))):(loop_time+1)*
frame_rate_1)=0;
Slide 22 of 49
Matlab Code
-Develop the Window
%%%%%%%%%%%%%%%%
% Create Windows
%%%%%%%%%%%%%%%%
% Want windows of 25ms
% File sampled at 10,000 samples/sec
% Want a window of size 10000 * 25ms(10ms)
triangle_30ms = triang(window_size_1);
%triangle_30ms = hamming(window_size_1);
W0 = sum(triangle_30ms);
Slide 23 of 49
Matlab Code
-Window the Entire Speech Signal
%%%%%%%%%%%%%%%%
% Window the speech
%%%%%%%%%%%%%%%%
for i =0:loop_time-1
window_data(:,i+1)=sample_data((frame_rate_1*i)+1:((i+2)*
frame_rate_1)).*triangle_30ms;
end
Slide 24 of 49
Matlab Code
-Perform the Fast Fourier Transform
%%%%%%%%%%%%%%%%
% Create FFT
%%%%%%%%%%%%%%%%
for i = 1:loop_time
window_data_fft(:,i) = fft(window_data(:,i),1024);
end
Slide 25 of 49
Matlab Code
-Recreate the Modified Signal
%%%%%%%%%%%%%%%%
% Recreate Original Signal
%%%%%%%%%%%%%%%%
%Initialize the recreated signals
reconstructed_signal(1:(loop_time+1)*frame_rate_1)=0;
real_reconstructed_signal(1:(loop_time+1)*frame_rate_1)=0;
modified_reconstructed_signal(1:(loop_time+3)*(frame_rate_1/speed))
=0;
modified_reconstructed_signal_compressed(1:(loop_time+3)*
(frame_rate_1/ speed))=0;
Slide 26 of 49
Matlab Code
-Recreate the Modified Signal
% Perform the ifft
for i = 1:loop_time
recreated_data_ifft(:,i) = ifft(window_data_fft(:,i),1024);
real_recreated_data_ifft(:,i) = ifft(abs(window_data_fft(:,i)),1024);
truncated_recreated_data_ifft(:,i) =
recreated_data_ifft(1:window_size_1,i).*(frame_rate_1/W0);
real_truncated_recreated_data_ifft(:,i) =
real_recreated_data_ifft(1:window_size_1,i).*(frame_rate_1/W0);
end
Slide 27 of 49
Matlab Code
-Recreate the Modified Signal
% Get back to the original signal
for i=0:loop_time-1
reconstructed_signal((frame_rate_1*i)+1:((i+2)*frame_rate_1)) =
reconstructed_signal((frame_rate_1*i)+1:((i+2)*frame_rate_1)) +
truncated_recreated_data_ifft(:,i+1)';
real_reconstructed_signal((frame_rate_1*i)+1:((i+2)*frame_rate_1)) =
real_reconstructed_signal((frame_rate_1*i)+1:((i+2)*frame_rate_1))
+ real_truncated_recreated_data_ifft(:,i+1)';
end
Slide 28 of 49
Matlab Code
-Recreate the Modified Signal
% Get a modified signal by deleting certain parts (STFT)
for i=0:(loop_time-1)/speed
modified_reconstructed_signal((frame_rate_1*i)+1:((i+2)*
frame_rate_1)) =
modified_reconstructed_signal((frame_rate_1*i)+1:((i+2)*frame_rate
_1)) + real_truncated_recreated_data_ifft(:,i*speed+1)';
end
Slide 29 of 49
Matlab Code
-Recreate the Modified Signal
% Initialize the compressed sequence (STFTM)
modified_reconstructed_signal_compressed(1:frame_rate_1+frame_rat
e_1/speed+1)=truncated_recreated_data_ifft(frame_rate_1frame_rate_1/speed:window_size_1,1)';
% Get a modified signal by compressing
for i=0:(loop_time-2)
modified_reconstructed_signal_compressed((frame_rate_1/speed*i)
+1:(frame_rate_1/speed*i)+window_size_1) =
modified_reconstructed_signal_compressed((frame_rate_1/speed*i)
+1:(frame_rate_1/speed*i)+window_size_1) +
real_truncated_recreated_data_ifft(:,i+2)';
end
Slide 30 of 49
Matlab Code
-Plot Results
%%%%%%%%%%%%%%%%
% Plot Results
%%%%%%%%%%%%%%%%
Figure; subplot(211)
plot(sample_data)
title('Original Speech'); v1=axis;
hold on; subplot(212)
plot(real(modified_reconstructed_signal))
title(['STFT Synthesis w/ Speed = ',num2str(speed),'X']); v2=axis;
if speed > 1
subplot(211); axis(v1)
subplot(212); axis(v1)
else
subplot(211); axis(v2)
subplot(212); axis(v2)
end
Slide 31 of 49
Matlab Code
-Write Sound Files
%%%%%%%%%%%%%%%%
% Write sound files
%%%%%%%%%%%%%%%%
wavwrite(modified_reconstructed_signal,sample_rate,nbits,'C:\Classes\
ECE_5525\tea party fairbanks 2x.wav')
Slide 32 of 49
Examples
Baseline Samples
Sample Rate 2X
STFT Sound file
Sample Rate .5X
STFTM Sound file
Original File
Slide 33 of 49
Examples
STFT—Speed 0.5X
Original Speech
0.6
0.4
0.2
0
Sound file
-0.2
-0.4
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
4
x 10
STFT Synthesis w/ Speed = 0.5X
0.6
0.4
0.2
0
-0.2
-0.4
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
4
x 10
Slide 34 of 49
Examples
STFT—Speed 2X
Original Speech
1
0.5
0
Sound file
-0.5
-1
0
0.5
1
1.5
2
2.5
4
x 10
STFT Synthesis w/ Speed = 2X
1
0.5
0
-0.5
-1
0
0.5
1
1.5
2
2.5
4
x 10
Slide 35 of 49
Examples
STFT—Speed 4X
Original Speech
1
0.5
0
Sound file
-0.5
-1
0
0.5
1
1.5
2
2.5
4
x 10
STFT Synthesis w/ Speed = 4X
1
0.5
0
-0.5
-1
0
0.5
1
1.5
2
2.5
4
x 10
Slide 36 of 49
Examples
STFTM—Speed 0.5X
Original Speech
0.6
0.4
0.2
0
Sound file
-0.2
-0.4
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
4
x 10
STFTM Synthesis w/ Speed = 0.5X
0.6
0.4
0.2
0
-0.2
-0.4
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
4
x 10
Slide 37 of 49
Examples
STFTM—Speed 2X
Original Speech
1
0.5
0
Sound file
-0.5
-1
0
0.5
1
1.5
2
2.5
4
x 10
STFTM Synthesis w/ Speed = 2X
1
0.5
0
-0.5
-1
0
0.5
1
1.5
2
2.5
4
x 10
Slide 38 of 49
Examples
STFTM—Speed 4X
Original Speech
1
0.5
0
Sound file
-0.5
-1
0
0.5
1
1.5
2
2.5
4
x 10
STFTM Synthesis w/ Speed = 4X
1
0.5
0
-0.5
-1
0
0.5
1
1.5
2
2.5
4
x 10
Slide 39 of 49
More Results
 Change in window size



If the window size becomes too small, then a
change in pitch will occur
Need window to be 2 to 3 pitch periods long
I generally used 20 – 30 ms windows
Slide 40 of 49
More Results
 Change in frame rate
 If the frame rate decreases too much, then there will
be too many samples overlapping to get an intelligible
signal
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
-50
Slide 41 of 49
0
50
100
150
200
250
300
350
400
450
More Results
 Change filter type



Tried Hamming—not much perceptual
difference
Using the window energy becomes important
here
Frame Rate/W0 is not equal to one
Slide 42 of 49
Conclusion
 Optimum area


Frame rate is one half of the window size
Window size needs to be 2 to 3 pitch periods
long
 It is possible to easily change the time scale
and still maintain the original pitch although
the result is not always natural sounding
Slide 43 of 49
Conclusion
 Further investigation

What to do when you want to slow down over
half.

Using the STFTM means there will be gaps
between the sequences
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Slide 44 of 49
0
100
200
300
400
500
600
700
800
900
1000
Conclusion
 Further investigation

What to do when you want to slow down over half

Could replicate windowed segments
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Slide 45 of 49
0
100
200
300
400
500
600
700
800
900
1000
Conclusion
 Further investigation
 Use the other methods to determine quality
 Implement Sinusoidal Synthesis
 Implement Linear Predictive Synthesis using linear
prediction and homomorphic methods
 Work on synchronizing pitch periods
 Shift samples so that the peaks line up
 Scott and Gerber—Synchronized Overlap and Add (SOLA)
 Cross-correlation of two samples to find peak
 Use the peaks to line up samples

Slide 46 of 49
Align the window at same relative location within a
pitch period
Questions
 Are there any questions?
Slide 47 of 49
References
 Quatieri, Thomas E. Discrete-Time Speech Signal
Processing. Prentice Hall, Upper Saddle River, NJ,
2002.
 Rabiner, L.R. and Schafer, R.W. Digital Processing
of Speech Signals. Prentice Hall, Upper Saddle
River, NJ, 1978.
 Oppenheim, A.V and Schafer, R.W. Digital Signal
Processing. Prentice Hall, Englewood Cliffs, NJ,
1975.
 Scott, R. and Gerber, S. “Pitch Synchronous TimeCompression of Speech,” Proc. Conf. Speech
Communications Processing, p63-85, April 1972.
Slide 48 of 49
References
 Fairbanks, G., Everitt, W.L., and Jaeger, R.P.
“Method for Time or Frequency CompressionExpansion of Speech,” IEEE Transaction Audio and
Electroacoustics, vol. AU-2 pp.7-12, Jan 1954.
Slide 49 of 49
Download