Time-Scale Modification of Music Signals S. Grofit and Y. Lavner Tel-Hai Academic College 1. Introduction Time-Scale Modification (TSM) of audio signals is the process of modifying the rate of audio signals such as speech or music, while maintaining other parameters (pitch, timbre) unchanged. It is a subject of major theoretical and practical interest. In this study, a new algorithm for time-scale modification of music signals is presented. Present techniques for TSM of audio signals are used in many applications, for example, in recording studios, for synchronization between different sounds, and between the audio and the video components. To accomplish the requirement of high-quality time scaling of speech signals, a number of algorithms have been proposed in the past decade. Unfortunately, applying these algorithms on music signals does not yield satisfactory results. The proposed algorithm is related to the PSOLA-like algorithms, which are based on the similarity of the short-time Fourier Transform between the original and the timescaled signal. The basic assumption of these algorithms is that the spectral characteristics of the signal are constant for short durations, and that the signal is quasi-periodic in the time domain. The signal in PSOLA is divided into short-time overlapping frames, which are used for constructing the time-scaled synthesis signal, while maintaining the original spectral parameters and their related location. It is well known that the information contained in the temporal envelope of audio signals is important for perceptual quality. This information is not preserved when applying the overlap-and-add technique to the original music signal, causing reverberations and degrading the quality of the time-scaled music, especially in edges like attacks and decays. An attempt to prevent the degradation by maintaining these edges untouched improves the quality in some cases. The short-time energy function used for detecting these edges is not sensitive enough for many situations, for example in very fast music, or when there are many musical instruments playing together. Therefore, in the presented algorithm, the problematic sections are detected using a Mel-scaled filter-bank with a time variant threshold function. This enables detection of the important edges, leaving these sections intact, while time-scaling the steady-state sections. In addition, the normalized correlation function, which is calculated in the PSOLA part of the algorithm, is used to detect frames whose frequency content is dissimilar above a given threshold. 2. PSOLA-Like Algorithms Most non-parametric algorithms for time-scale modification of audio signals are based on minimizing the distance between short-time Fourier transforms of the original and the time-scaled signals, in corresponding neighborhoods, according to the mapping function m [1]: 1) D X w m, , Yw m , X w m, Yw m , d 2 n Where X w m, , Yw m , are the short-time Fourier transforms (STFT) of x m , y m , respectively, and defined as: 2) X w m, x m n wm e j m m and w m is a window function of finite duration, and of symmetric and low-pass type [2] such as a hamming window. The overlap-and-add (OLA) equation minimizes the distance if there exists S1 , S 2 , such as k S1 k S2 k : 3) y m x m S k 1 k S 2 k w2 m S 2 k w m S 2 k 2 k Unfortunately, the OLA equation destroys the original phase relationships, and does not maintain the quasi-periodic structure of the original signal. The WSOLA [3] algorithm avoids these discontinuities and solves the problem by enabling a local adjustment of k in the selection of analysis frames in the input signal: 4) y m x m S k 1 k S 2 k k w2 m S 2 k w m S 2 k 2 , k choosing w2 m so that the support of the window is 2 S2 and weighing it such that k w m S 2 k 2 k 1 If v m is the weighted window, the WSOLA equation will be: y m 5) x m S 1 k k S2 k k v m S2 k . In the study presented here, k was selected according to the maximal normalized correlation between overlapping windowed frames: 1 6) k max cn k , max max cn k , x m S k 1 S m S2 1 m S2 1 2 k 1 x m S1 k x 2 m S1 k 1 S2 k 1 1 m S2 x 2 m S1 k where S1 in order to prevent time-reversal in the input signal, and for synchronization between adjacent pitch max 2 periods max has to be at least half the maximal pitch period. For each step index k, the sum of two overlapping half windows is added to the current output time-scaled signal. The algorithm was implemented so that in each step, the result of a sum of two overlapping half frames is added to the constructed signal. If the decision is not to overlap the frames, J samples from the input signal will be copied to the output signal, and the procedure will restart from the next frame. S1 K S2 S1 K 1 n0 J S2 Figure 1: Technic for avoiding non-stationary section overlap. Notice that n0 J may be much larger then the nonstationary section. 3. Spectral non-stationarity detection and untouched copy decision Copying sections from the input signal to the output signal without applying OLA operation would modify the mapping function. Defining long sections as spectrally significant could create undesirable distortions effects. Therefore, local thresholds have to be used, adaptive both to the input signal and the required mapping function. 3.1 Spectral distortion measure based on human auditory perception This study proposed a technique for detecting events of significant spectral changes [4], such as “attacks” and “decays” of musical instruments based on characteristics of the human hearing system. Time-scaling of these sections can deteriorate the music signal by stretching (scaling factor > 1) or eliminating (scaling factor <1) the important sections, causing reverberations and distortions. Thus, the regular PSOLA-like algorithms do not operate properly on music signals. For this purpose the Mel-Frequency Cepstrum (MFC) coefficients are computed in successive analysis frames of the input signal. Let MFCC n J , l , be coefficient l in a frame centered around n J , where J is the distance (in samples) between adjacent frames. The measure for spectral non-stationarity in the section n J , n 1 J is defined as 7) CD n J N L 1 MFCC n J , l MFCC n 1 J , l l 0 where N L is the number of MFC coefficients. 2 3.2 Threshold function for the spectral non-stationarity Measure The duration of an untouched section from the input signal depends on the length of the common support of overlapping half-frames. The average of this length will be: 8) N ol 2 S2 S1 S1 S2 S1 S1 S2 The threshold values are set so that only a desired percentage of the original signal is copied without modification; therefore they are based on the following function: N CDol n J max CD n i J 0 i ol J The final local threshold value TCDF n J will be set according to two threshold values: a global TCDG and a local 9) TCDL n J value. The local threshold value is chosen to be the value within the 1 PCDL percentile from CDol n J in a neighborhood of Nlt samples: Nlt Nlt 10) TCDL n J PCDol n i J , 1 PCDL i 2 J 2 J The local threshold is intended to select the most important sections for unmodified copying in a local interval, thus avoiding long unmodified sections which may drastically change the mapping function and produce rate distortions. ********** The global threshold TCDG P CDol n J , 1 PCDG is aimed at preventing copying unnecessary sections in spectral stationary signals. Consequently: 11) TCDF n J max TCDL n J , TCDG where PCDG PCDL . Plot A Plot B Plot C Plot D Figure 2: A) Blue line - original signal. Red line - untouched section marked. B) Mel-Scale Spectral coefficients over time. C) Blue line - CD n J . Green Line - TCDL n J . Red line - TCDG . D) Energy distance [DB]. 3.3 Normalized correlation, motive & thresholds Referring WSOLA carefully reveals a built-in mechanism for alarming non-stationarity segment overlapping. For each step index k, k selected according to the maximal normalized correlation between overlapping windowed frames. Output signal frame quality can be characterized by the normalized correlation achieved. This property does not require additional computations, and based on the relation between the input and the constructed output signal. On the other hand, it doesn't consider human ear properties and tend to ignore high frequencies as low presents. Despite normalized correlation's constant range cn k , 1,1 , adaptive threshold function found necessary for controlling copy rate and avoiding tempo distortions. A normalized correlation function is created by running WSOLA. The local threshold function TNCL m and global threshold TNCG calculated using the percentile method described above. 4. Correcting the mapping-function to compensate for unscaled intervals A method for time-scale modification with location of significantly spectral events was presented above. The method necessarily modifies the required mapping function. For example, assume a constant mapping function with 1.5 , and suppose that 10% of the signal is selected for untouched replication. The output signal will be according to a scaling ratio of 1.5 0.9 0.1 1.45 . Unfortunately, the total duration of the untouched sections cannot be accurately predefined, and hence a constant mapping function that provides the required mapping cannot be evaluated. The mapping function is realized in the ratio between S1 and S 2 , so that changing of either S1 or S 2 will modify the mapping function. Here we chose to change S1 and to avoid modifying the window function. In each frame S1 was calculated so that the difference between the desired and the actual scaling is compensated within N FIX samples, ignoring k : 12) S1 S 2 X N fix offset Yoffset N fix where X offset and Yoffset are the indices of the former frame in the input and output signals, respectively. Figure 3: Blue line - original signal. Red line - the mapping-function over time. Requested constant mapping-function: 1.5 . Unscaled ratio (of input length): 9.03% 5. Accurate Time-Scale Modification The WSOLA algorithm as presented in [2] does not guarantee accurate time-scaling, which is an important requirement in some applications. Due to the scope of the present paper, only a brief and general outline of a technique for accurate time-scaling that replicates events containing spectral non-stationarities is presented here. The technique, which is a variation of WSOLA, enables time-scaling of a given input signal of length X Len to an output signal of length YLen , with an accuracy of up to a few samples. Suppose M sections have been chosen for untouched replication, according to the method described in sections 3.1, and 3.2. Let’s denote these sections in ascending sequence of non-overlapping sections: 1 , 1 , 2 , 2 ,..., M , M , where i and i are the left and right delimits, respectively. The preferred locations of these sections in the output signal will be a weighted average according to their respective locations in the input signal with respect to the left and right delimiters: i , i i , i 13) i i X Len i i i X Len i i i i i X Len i i i i i i i unless i 1 or i X Len . In this case, the untouched sections will be replicated to the left or right ends of the current output signal. In frames that are not chosen for replication, time-scaling will be performed as described above. The technique is not problem-free, since it increases the length of the replication sections, but it does meet the timescaling requirement. 6. Conclusion In this study a method for time-scale modification of music signal is presented. The method is based on detection of spectral non-stationarities in music signals, events that are perceptually significant. The other stationary sections are time-scaled using an improved version of WSOLA algorithm. Unofficial listening tests indicated that the proposed algorithm produces better results compared with other algorithms such as SOLA, WSOLA, and EDSOLA. A drawback of the algorithm is its high computational complexity. References: [1] Griffin D.W. and Lim J.S., “Signal Estimation from Modified Short-Time Fourier Transform.” IEEE Trans. on ASSP April (1984),236-243:(2)32. [2] Moulines E. and Laroche J., “Non-Parametric techniques for pitch-scale and time-scale modification of speech.” Speech Communication 16 (1995) 175-205 [3] Verhelst W. and Roelands M., “An Overlap-Add Technique based on waveform similarity (WSOLA) for high quality time-scale modification of speech”. ICASSP-93, (1993), 554-557. [4] Kapilow, D., Stylianou, Y., and Schroeter, J., “Detection of non-stationarities in speech signals and is application in time-scaling, Eurospeech 99, (1999). This study was partly supported by a Guastella Fellowship of the Sacta-Rashi Foundation, and the JAFI project to promote higher education in the Eastern Galilee.