Time-Scale Modification of Music Signals

advertisement
Time-Scale Modification of Music Signals
S. Grofit and Y. Lavner
Tel-Hai Academic College
1. Introduction
Time-Scale Modification (TSM) of audio signals is the process of modifying the rate of audio signals such as speech or
music, while maintaining other parameters (pitch, timbre) unchanged. It is a subject of major theoretical and practical
interest. In this study, a new algorithm for time-scale modification of music signals is presented. Present techniques for
TSM of audio signals are used in many applications, for example, in recording studios, for synchronization between
different sounds, and between the audio and the video components. To accomplish the requirement of high-quality time
scaling of speech signals, a number of algorithms have been proposed in the past decade. Unfortunately, applying these
algorithms on music signals does not yield satisfactory results. The proposed algorithm is related to the PSOLA-like
algorithms, which are based on the similarity of the short-time Fourier Transform between the original and the timescaled signal. The basic assumption of these algorithms is that the spectral characteristics of the signal are constant for
short durations, and that the signal is quasi-periodic in the time domain. The signal in PSOLA is divided into short-time
overlapping frames, which are used for constructing the time-scaled synthesis signal, while maintaining the original
spectral parameters and their related location.
It is well known that the information contained in the temporal envelope of audio signals is important for perceptual
quality. This information is not preserved when applying the overlap-and-add technique to the original music signal,
causing reverberations and degrading the quality of the time-scaled music, especially in edges like attacks and decays.
An attempt to prevent the degradation by maintaining these edges untouched improves the quality in some cases. The
short-time energy function used for detecting these edges is not sensitive enough for many situations, for example in
very fast music, or when there are many musical instruments playing together. Therefore, in the presented algorithm, the
problematic sections are detected using a Mel-scaled filter-bank with a time variant threshold function. This enables
detection of the important edges, leaving these sections intact, while time-scaling the steady-state sections. In addition,
the normalized correlation function, which is calculated in the PSOLA part of the algorithm, is used to detect frames
whose frequency content is dissimilar above a given threshold.
2. PSOLA-Like Algorithms
Most non-parametric algorithms for time-scale modification of audio signals are based on minimizing the distance
between short-time Fourier transforms of the original and the time-scaled signals, in corresponding neighborhoods,
according to the mapping function   m  [1]:
1)



 
D X w  m,   , Yw   m  ,   
X w  m,    Yw   m  ,    d
2
n  
Where X w  m,   , Yw   m  ,   are the short-time Fourier transforms (STFT) of x  m , y  m , respectively, and defined
as:
2)
X w  m,   

 x m  n  wm  e
 j m
m 
and w  m is a window function of finite duration, and of symmetric and low-pass type [2] such as a hamming window.
The overlap-and-add (OLA) equation minimizes the distance if there exists S1 , S 2 , such as k   S1  k   S2  k :

3)
y  m 
 x m  S
k 
1
 k  S 2  k   w2  m  S 2  k 

 w m  S
2
k 
2
k
Unfortunately, the OLA equation destroys the original phase relationships, and does not maintain the quasi-periodic
structure of the original signal. The WSOLA [3] algorithm avoids these discontinuities and solves the problem by
enabling a local adjustment of  k in the selection of analysis frames in the input signal:

4)
y  m 
 x m  S
k 
1
 k  S 2  k   k   w2  m  S 2  k 

 w m  S
2
k 
2
,
k
choosing w2  m so that the support of the window is 2  S2 and weighing it such that k

 w m  S
2
k 
2
k 1
If v  m  is the weighted window, the WSOLA equation will be:

y m 
5)
 x m  S
1
k 
 k  S2  k   k   v  m  S2  k  .
In the study presented here,  k was selected according to the maximal normalized correlation between overlapping
windowed frames:
1
6)
 k  max cn  k ,    max     max  cn  k ,   

 x  m  S   k  1  S
m  S2
1

m  S2
1
2
  k 1   x  m  S1  k   
x 2  m  S1   k  1  S2   k 1  
1

m  S2
x 2  m  S1  k   
where   S1 in order to prevent time-reversal in the input signal, and for synchronization between adjacent pitch
max
2
periods  max has to be at least half the maximal pitch period. For each step index k, the sum of two overlapping half
windows is added to the current output time-scaled signal. The algorithm was implemented so that in each step, the
result of a sum of two overlapping half frames is added to the constructed signal. If the decision is not to overlap the
frames, J samples from the input signal will be copied to the output signal, and the procedure will restart from the next
frame.
S1   K
S2
S1   K 1
n0  J
S2
Figure 1: Technic for avoiding non-stationary section overlap. Notice that n0  J may be much larger then the nonstationary section.
3. Spectral non-stationarity detection and untouched copy decision
Copying sections from the input signal to the output signal without applying OLA operation would modify the mapping
function. Defining long sections as spectrally significant could create undesirable distortions effects. Therefore, local
thresholds have to be used, adaptive both to the input signal and the required mapping function.
3.1 Spectral distortion measure based on human auditory perception
This study proposed a technique for detecting events of significant spectral changes [4], such as “attacks” and “decays”
of musical instruments based on characteristics of the human hearing system. Time-scaling of these sections can
deteriorate the music signal by stretching (scaling factor > 1) or eliminating (scaling factor <1) the important sections,
causing reverberations and distortions. Thus, the regular PSOLA-like algorithms do not operate properly on music
signals. For this purpose the Mel-Frequency Cepstrum (MFC) coefficients are computed in successive analysis frames of
the input signal. Let MFCC  n  J , l  , be coefficient l in a frame centered around n  J , where J is the distance (in
samples) between adjacent frames. The measure for spectral non-stationarity in the section  n  J ,  n  1  J  is defined as
7) CD  n  J  
N L 1
 MFCC  n  J , l   MFCC  n  1  J , l 
l 0
where N L is the number of MFC coefficients.
2
3.2 Threshold function for the spectral non-stationarity Measure
The duration of an untouched section from the input signal depends on the length of the common support of overlapping
half-frames. The average of this length will be:

8) N ol  2  S2  S1 S1  S2

S1
S1  S2

The threshold values are set so that only a desired percentage of the original signal is copied without modification;
therefore they are based on the following function:


N
CDol  n  J   max CD   n  i   J  0  i   ol 

J 

The final local threshold value TCDF  n  J  will be set according to two threshold values: a global TCDG and a local
9)
TCDL  n  J  value.
The local threshold value is chosen to be the value within the 1  PCDL  percentile from CDol  n  J  in a neighborhood of
Nlt samples:

 Nlt 
 Nlt  
10)
TCDL n  J   PCDol n  i   J , 1  PCDL  
i


2

J


 2  J  

The local threshold is intended to select the most important sections for unmodified copying in a local interval, thus
avoiding long unmodified sections which may drastically change the mapping function and produce rate distortions.
**********
The global threshold TCDG  P  CDol  n  J  , 1  PCDG   is aimed at preventing copying unnecessary sections in spectral
stationary signals. Consequently:
11)
TCDF  n  J   max TCDL  n  J  , TCDG 
where PCDG  PCDL .
Plot A
Plot B
Plot C
Plot D
Figure 2: A) Blue line - original signal. Red line - untouched section marked.
B) Mel-Scale Spectral coefficients over time.
C) Blue line - CD  n  J  . Green Line - TCDL  n  J  . Red line - TCDG .
D) Energy distance [DB].
3.3 Normalized correlation, motive & thresholds
Referring WSOLA carefully reveals a built-in mechanism for alarming non-stationarity segment overlapping. For each
step index k,  k selected according to the maximal normalized correlation between overlapping windowed frames.
Output signal frame quality can be characterized by the normalized correlation achieved. This property does not require
additional computations, and based on the relation between the input and the constructed output signal. On the other
hand, it doesn't consider human ear properties and tend to ignore high frequencies as low presents.
Despite normalized correlation's constant range  cn  k ,     1,1 , adaptive threshold function found necessary for
controlling copy rate and avoiding tempo distortions. A normalized correlation function is created by running WSOLA.
The local threshold function TNCL  m  and global threshold TNCG calculated using the percentile method described above.
4. Correcting the mapping-function to compensate for unscaled intervals
A method for time-scale modification with location of significantly spectral events was presented above. The method
necessarily modifies the required mapping function. For example, assume a constant mapping function with   1.5 , and
suppose that 10% of the signal is selected for untouched replication. The output signal will be according to a scaling
ratio of   1.5  0.9  0.1  1.45 . Unfortunately, the total duration of the untouched sections cannot be accurately
predefined, and hence a constant mapping function that provides the required mapping cannot be evaluated. The
mapping function is realized in the ratio between S1 and S 2 , so that changing of either S1 or S 2 will modify the mapping
function. Here we chose to change S1 and to avoid modifying the window function. In each frame S1 was calculated so
that the difference between the desired and the actual scaling is compensated within N FIX samples, ignoring  k :
12)
S1  S 2 
  X
N fix
offset
 Yoffset     N fix
where X offset and Yoffset are the indices of the former frame in the input and output signals, respectively.
Figure 3:
Blue line - original signal. Red line - the mapping-function over time.
Requested constant mapping-function:   1.5 . Unscaled ratio (of input length): 9.03%
5. Accurate Time-Scale Modification
The WSOLA algorithm as presented in [2] does not guarantee accurate time-scaling, which is an important requirement
in some applications. Due to the scope of the present paper, only a brief and general outline of a technique for accurate
time-scaling that replicates events containing spectral non-stationarities is presented here. The technique, which is a
variation of WSOLA, enables time-scaling of a given input signal of length X Len to an output signal of length YLen , with
an accuracy of up to a few samples. Suppose M sections have been chosen for untouched replication, according to the
method described in sections 3.1, and 3.2. Let’s denote these sections in ascending sequence of non-overlapping
sections: 1 , 1  ,  2 ,  2  ,...,  M ,  M  , where  i and  i are the left and right delimits, respectively. The preferred
locations of these sections in the output signal will be a weighted average according to their respective locations in the
input signal with respect to the left and right delimiters:
 i , i    i ,  i 
13)

 i     i  


X
Len
  i   i     i
X Len   i   i 
    i     i   i   
X Len

i

  i   i  
 i   i   i   i 
unless i  1 or  i  X Len . In this case, the untouched sections will be replicated to the left or right ends of the current
output signal. In frames that are not chosen for replication, time-scaling will be performed as described above.
The technique is not problem-free, since it increases the length of the replication sections, but it does meet the timescaling requirement.
6. Conclusion
In this study a method for time-scale modification of music signal is presented. The method is based on detection of
spectral non-stationarities in music signals, events that are perceptually significant. The other stationary sections are
time-scaled using an improved version of WSOLA algorithm. Unofficial listening tests indicated that the proposed
algorithm produces better results compared with other algorithms such as SOLA, WSOLA, and EDSOLA.
A drawback of the algorithm is its high computational complexity.
References:
[1] Griffin D.W. and Lim J.S., “Signal Estimation from Modified Short-Time Fourier Transform.” IEEE Trans. on
ASSP April (1984),236-243:(2)32.
[2] Moulines E. and Laroche J., “Non-Parametric techniques for pitch-scale and time-scale modification of speech.”
Speech Communication 16 (1995) 175-205
[3] Verhelst W. and Roelands M., “An Overlap-Add Technique based on waveform similarity (WSOLA) for high
quality time-scale modification of speech”. ICASSP-93, (1993), 554-557.
[4] Kapilow, D., Stylianou, Y., and Schroeter, J., “Detection of non-stationarities in speech signals and is application
in time-scaling, Eurospeech 99, (1999).
This study was partly supported by a Guastella Fellowship of the Sacta-Rashi Foundation, and the JAFI project to
promote higher education in the Eastern Galilee.
Download