Time-Scale Modification of Speech Signals Bill Floyd ECE 5525 – Digital Speech Processing December 14, 2004 Objectives Introduction Background Theory Methods Examples Matlab Code Short Time Fourier Transform Short Time Fourier Transform Magnitude Speech Samples Conclusion Questions References Slide 2 of 49 Introduction Goal To either speed up or slow down a speech signal while maintaining the approximate pitch Applications Slide 3 of 49 Change voice mail playback Court stenographers-play proceedings quicker Sound effects Etc… Introduction Option 1 – Change sample rate If you modify the sample rate, you can change the speed but the pitch is also changed Increase sample rate = higher pitch (chipmunk sound) Decrease sample rate = lower pitch (drawn out echo sound) Option 2 – Decimate or Interpolate Signal If you change the number of samples, the result is the same as modifying the sample rate Slide 4 of 49 Introduction Option 3 – Use more complex methods This will change the speed of the sample while preserving the pitch data Slide 5 of 49 Short Time Fourier Transform Short Time Fourier Transform Magnitude Sinusoidal Synthesis Linear Prediction Synthesis Terminology Frame Rate Window Representation 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 Window Size Slide 6 of 49 100 200 300 400 500 600 700 Theory Short Time Fourier Transform Methods Slide 7 of 49 Chapter 7 in our text (Discrete-Time Speech Signal Processing) Refer to notes from in class for mathematical theory of operation I will pick up from where Dr. Kepuska stopped in his notes Short Time Fourier Transform Short Time Fourier Transform Also called the Fairbanks method Extract successive short-time segments and then discard the following ones Signal STFT Decimate Samples IFFT Output OLA Slide 8 of 49 Short Time Fourier Transform Frame Rate factor L In frequency domain after taking the STFT, you get X(nL,ω) Form a new signal by Y(nL, ω) = X(snL, ω) where s = compression factor Take Inverse Fourier Transform Use Overlap and Add method to form new signal Slide 9 of 49 Short Time Fourier Transform 1 0.8 0.6 X(nL, ω) 0.4 0.2 0 0 100 200 300 400 500 600 700 800 1 0.8 Y(nL, ω) = X(2nL, ω) 0.6 0.4 0.2 0 0 Slide 10 of 49 100 200 300 400 500 600 700 800 Short Time Fourier Transform Window Representation New Sequence 1 0.9 0.8 0.7 1 0.6 0.5 0.9 0.4 0.3 0.8 0.2 0.1 0 0.7 0 100 200 300 400 500 Original Windowed Sequence 600 700 0.6 0.5 0.4 0.3 0.2 0.1 0 Slide 11 of 49 100 200 300 400 500 600 Short Time Fourier Transform Problems Pitch Synchronization It is highly likely that the pitch periods will not line up properly Slide 12 of 49 Short Time Fourier Transform Magnitude Short Time Fourier Transform Magnitude Problems with STFT method relate directly to the linear phase component of the STFT Time shift = phase change Alternate approach is to only use the magnitude portion of the STFT—Short Time Fourier Transform Magnitude Slide 13 of 49 Short Time Fourier Transform Magnitude Compression With the Fairbanks method, time slices were discarded Now we can just compress the time slices Form a new signal by |Y(nM, ω)| = |X(nL, ω)| where M = compression factor = L / speed i.e. for speeding up by two => M = L/2 Slide 14 of 49 Short Time Fourier Transform Magnitude Compression Take Inverse Fourier Transform Use Overlap and Add method to form new signal Slide 15 of 49 Short Time Fourier Transform Magnitude 1 0.8 0.6 X(nL, ω) 0.4 0.2 0 0 100 200 300 400 500 600 700 800 1 0.8 Y(nM, ω) = X(nL, ω) M=L/2 0.6 0.4 0.2 0 0 Slide 16 of 49 100 200 300 400 500 600 700 800 Short Time Fourier Transform Magnitude Window Representation New Sequence 1 0.9 0.8 0.7 1 0.6 0.5 0.9 0.4 0.3 0.8 0.2 0.1 0 0.7 0 100 200 300 400 500 Original Windowed Sequence 600 700 0.6 0.5 0.4 0.3 0.2 0.1 0 -50 Slide 17 of 49 0 50 100 150 200 250 300 350 400 450 Other Methods Sinusoidal Synthesis—Chapter 9 Time-warp the sinewave frequency track and the amplitude function This technique has been successful with not only speech but also music, biological, and mechanical signals Problems Slide 18 of 49 Does not maintain the original phase relations Suffer from reverberance Other Methods Linear Prediction Synthesis Use Homomorphic and Linear Prediction results to modify the time base Book briefly mentions this is possible but ran out of time before I could investigate this process more Slide 19 of 49 Other Methods New Techniques Internet search showed several methods trying to improve on what is out there now Software Different software programs that will change speed for you Adobe Audition is one of the most all encompassing right now Slide 20 of 49 Matlab Code -Prepare the Workspace %%%%%%%%%%%%%%%% % Prepare Workspace %%%%%%%%%%%%%%%% close all; clear all; window_size_1 = 200; frame_rate_1 = 100; %Speed to slow down by speed = 2; Slide 21 of 49 Matlab Code -Load the Speech Signal %%%%%%%%%%%%%%%% % Load Data File %%%%%%%%%%%%%%%% filename = input('Please enter the file name to be used. '); [sample_data,sample_rate,nbits] = wavread(filename); loop_time = floor(max(size(sample_data))/frame_rate_1); sample_data((max(size(sample_data))):(loop_time+1)* frame_rate_1)=0; Slide 22 of 49 Matlab Code -Develop the Window %%%%%%%%%%%%%%%% % Create Windows %%%%%%%%%%%%%%%% % Want windows of 25ms % File sampled at 10,000 samples/sec % Want a window of size 10000 * 25ms(10ms) triangle_30ms = triang(window_size_1); %triangle_30ms = hamming(window_size_1); W0 = sum(triangle_30ms); Slide 23 of 49 Matlab Code -Window the Entire Speech Signal %%%%%%%%%%%%%%%% % Window the speech %%%%%%%%%%%%%%%% for i =0:loop_time-1 window_data(:,i+1)=sample_data((frame_rate_1*i)+1:((i+2)* frame_rate_1)).*triangle_30ms; end Slide 24 of 49 Matlab Code -Perform the Fast Fourier Transform %%%%%%%%%%%%%%%% % Create FFT %%%%%%%%%%%%%%%% for i = 1:loop_time window_data_fft(:,i) = fft(window_data(:,i),1024); end Slide 25 of 49 Matlab Code -Recreate the Modified Signal %%%%%%%%%%%%%%%% % Recreate Original Signal %%%%%%%%%%%%%%%% %Initialize the recreated signals reconstructed_signal(1:(loop_time+1)*frame_rate_1)=0; real_reconstructed_signal(1:(loop_time+1)*frame_rate_1)=0; modified_reconstructed_signal(1:(loop_time+3)*(frame_rate_1/speed)) =0; modified_reconstructed_signal_compressed(1:(loop_time+3)* (frame_rate_1/ speed))=0; Slide 26 of 49 Matlab Code -Recreate the Modified Signal % Perform the ifft for i = 1:loop_time recreated_data_ifft(:,i) = ifft(window_data_fft(:,i),1024); real_recreated_data_ifft(:,i) = ifft(abs(window_data_fft(:,i)),1024); truncated_recreated_data_ifft(:,i) = recreated_data_ifft(1:window_size_1,i).*(frame_rate_1/W0); real_truncated_recreated_data_ifft(:,i) = real_recreated_data_ifft(1:window_size_1,i).*(frame_rate_1/W0); end Slide 27 of 49 Matlab Code -Recreate the Modified Signal % Get back to the original signal for i=0:loop_time-1 reconstructed_signal((frame_rate_1*i)+1:((i+2)*frame_rate_1)) = reconstructed_signal((frame_rate_1*i)+1:((i+2)*frame_rate_1)) + truncated_recreated_data_ifft(:,i+1)'; real_reconstructed_signal((frame_rate_1*i)+1:((i+2)*frame_rate_1)) = real_reconstructed_signal((frame_rate_1*i)+1:((i+2)*frame_rate_1)) + real_truncated_recreated_data_ifft(:,i+1)'; end Slide 28 of 49 Matlab Code -Recreate the Modified Signal % Get a modified signal by deleting certain parts (STFT) for i=0:(loop_time-1)/speed modified_reconstructed_signal((frame_rate_1*i)+1:((i+2)* frame_rate_1)) = modified_reconstructed_signal((frame_rate_1*i)+1:((i+2)*frame_rate _1)) + real_truncated_recreated_data_ifft(:,i*speed+1)'; end Slide 29 of 49 Matlab Code -Recreate the Modified Signal % Initialize the compressed sequence (STFTM) modified_reconstructed_signal_compressed(1:frame_rate_1+frame_rat e_1/speed+1)=truncated_recreated_data_ifft(frame_rate_1frame_rate_1/speed:window_size_1,1)'; % Get a modified signal by compressing for i=0:(loop_time-2) modified_reconstructed_signal_compressed((frame_rate_1/speed*i) +1:(frame_rate_1/speed*i)+window_size_1) = modified_reconstructed_signal_compressed((frame_rate_1/speed*i) +1:(frame_rate_1/speed*i)+window_size_1) + real_truncated_recreated_data_ifft(:,i+2)'; end Slide 30 of 49 Matlab Code -Plot Results %%%%%%%%%%%%%%%% % Plot Results %%%%%%%%%%%%%%%% Figure; subplot(211) plot(sample_data) title('Original Speech'); v1=axis; hold on; subplot(212) plot(real(modified_reconstructed_signal)) title(['STFT Synthesis w/ Speed = ',num2str(speed),'X']); v2=axis; if speed > 1 subplot(211); axis(v1) subplot(212); axis(v1) else subplot(211); axis(v2) subplot(212); axis(v2) end Slide 31 of 49 Matlab Code -Write Sound Files %%%%%%%%%%%%%%%% % Write sound files %%%%%%%%%%%%%%%% wavwrite(modified_reconstructed_signal,sample_rate,nbits,'C:\Classes\ ECE_5525\tea party fairbanks 2x.wav') Slide 32 of 49 Examples Baseline Samples Sample Rate 2X STFT Sound file Sample Rate .5X STFTM Sound file Original File Slide 33 of 49 Examples STFT—Speed 0.5X Original Speech 0.6 0.4 0.2 0 Sound file -0.2 -0.4 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 4 x 10 STFT Synthesis w/ Speed = 0.5X 0.6 0.4 0.2 0 -0.2 -0.4 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 4 x 10 Slide 34 of 49 Examples STFT—Speed 2X Original Speech 1 0.5 0 Sound file -0.5 -1 0 0.5 1 1.5 2 2.5 4 x 10 STFT Synthesis w/ Speed = 2X 1 0.5 0 -0.5 -1 0 0.5 1 1.5 2 2.5 4 x 10 Slide 35 of 49 Examples STFT—Speed 4X Original Speech 1 0.5 0 Sound file -0.5 -1 0 0.5 1 1.5 2 2.5 4 x 10 STFT Synthesis w/ Speed = 4X 1 0.5 0 -0.5 -1 0 0.5 1 1.5 2 2.5 4 x 10 Slide 36 of 49 Examples STFTM—Speed 0.5X Original Speech 0.6 0.4 0.2 0 Sound file -0.2 -0.4 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 4 x 10 STFTM Synthesis w/ Speed = 0.5X 0.6 0.4 0.2 0 -0.2 -0.4 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 4 x 10 Slide 37 of 49 Examples STFTM—Speed 2X Original Speech 1 0.5 0 Sound file -0.5 -1 0 0.5 1 1.5 2 2.5 4 x 10 STFTM Synthesis w/ Speed = 2X 1 0.5 0 -0.5 -1 0 0.5 1 1.5 2 2.5 4 x 10 Slide 38 of 49 Examples STFTM—Speed 4X Original Speech 1 0.5 0 Sound file -0.5 -1 0 0.5 1 1.5 2 2.5 4 x 10 STFTM Synthesis w/ Speed = 4X 1 0.5 0 -0.5 -1 0 0.5 1 1.5 2 2.5 4 x 10 Slide 39 of 49 More Results Change in window size If the window size becomes too small, then a change in pitch will occur Need window to be 2 to 3 pitch periods long I generally used 20 – 30 ms windows Slide 40 of 49 More Results Change in frame rate If the frame rate decreases too much, then there will be too many samples overlapping to get an intelligible signal 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -50 Slide 41 of 49 0 50 100 150 200 250 300 350 400 450 More Results Change filter type Tried Hamming—not much perceptual difference Using the window energy becomes important here Frame Rate/W0 is not equal to one Slide 42 of 49 Conclusion Optimum area Frame rate is one half of the window size Window size needs to be 2 to 3 pitch periods long It is possible to easily change the time scale and still maintain the original pitch although the result is not always natural sounding Slide 43 of 49 Conclusion Further investigation What to do when you want to slow down over half. Using the STFTM means there will be gaps between the sequences 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Slide 44 of 49 0 100 200 300 400 500 600 700 800 900 1000 Conclusion Further investigation What to do when you want to slow down over half Could replicate windowed segments 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Slide 45 of 49 0 100 200 300 400 500 600 700 800 900 1000 Conclusion Further investigation Use the other methods to determine quality Implement Sinusoidal Synthesis Implement Linear Predictive Synthesis using linear prediction and homomorphic methods Work on synchronizing pitch periods Shift samples so that the peaks line up Scott and Gerber—Synchronized Overlap and Add (SOLA) Cross-correlation of two samples to find peak Use the peaks to line up samples Slide 46 of 49 Align the window at same relative location within a pitch period Questions Are there any questions? Slide 47 of 49 References Quatieri, Thomas E. Discrete-Time Speech Signal Processing. Prentice Hall, Upper Saddle River, NJ, 2002. Rabiner, L.R. and Schafer, R.W. Digital Processing of Speech Signals. Prentice Hall, Upper Saddle River, NJ, 1978. Oppenheim, A.V and Schafer, R.W. Digital Signal Processing. Prentice Hall, Englewood Cliffs, NJ, 1975. Scott, R. and Gerber, S. “Pitch Synchronous TimeCompression of Speech,” Proc. Conf. Speech Communications Processing, p63-85, April 1972. Slide 48 of 49 References Fairbanks, G., Everitt, W.L., and Jaeger, R.P. “Method for Time or Frequency CompressionExpansion of Speech,” IEEE Transaction Audio and Electroacoustics, vol. AU-2 pp.7-12, Jan 1954. Slide 49 of 49