Proc. 1999 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, New York, Oct. 17-20, 1999 MAXIMIZATION OF THE SUBJECTIVE LOUDNESS OF SPEECH WITH CONSTRAINED AMPLITUDE Jarno Seppänen, Sami Kananoja, Jari Yli-Hietanen, Konsta Koppinen and Jari Sjöberg Tampere University of Technology Signal Processing Laboratory P.O.Box 553, FIN-33101 Tampere, Finland Nokia Research Center Speech and Audio Systems Laboratory P.O.Box 100, FIN-33721 Tampere, Finland {jarno.seppanen, sami.kananoja, jari.yli-hietanen, konsta.koppinen}@cs.tut.fi and jari.sjoberg@research.nokia.com ABSTRACT 2. METHOD We introduce an adaptive algorithm for constraining the amplitude of speech signals while at the same time trying to maintain the subjective loudness and trying not to produce disturbing artifacts. The algorithm can be applied to compensate for the clipping distortion of amplifiers in speech reproduction devices. The algorithm analyzes the speech signal on multiple frequency bands and applies an internal audibility law in order to make inaudible changes to the signal. An example of the audibility law, presented in the form of a matrix, is described, associated with a specific speech reproduction device. Multiple band-pass signals are processed with a waveshaper to accomplish soft-clipping and to constrain the amplitude of the processed signal. When processed with the proposed algorithm, the computational loudness value of speech signals was found to diminish only slightly (approximately 6 sones) during processing, while at the same time the signal amplitude could be reduced by even 15 dB. The aim of the proposed algorithm is to lower the amplitude of an incoming speech signal without introducing disturbing artifacts and to maintain the subjective loudness of the speech while doing this. The algorithm works by separating the speech signal into individual frequency bands and soft-clipping these. The soft-clipping is done proportionally to the amplitudes of the band signals. The band locations and the soft-clipping amplitude limits are chosen based on empirical data. The algorithm can be broken down into 10 sequential processing steps as follows: 1. Time-domain signal segmentation into frames overlapping by 50%. 2. Signal frequency-domain division into 12 adjacent bands. 3. Scaling the band-pass signals by a fixed amount. 4. Band-pass signal amplitude estimation. 5. Selection of a combination of bands and an amplitude limit for soft-clipping. 6. Smoothing of the individual bands’ amplitude limits with respect to time. 7. Constraining the amplitudes of the band-pass signals by soft-clipping. 8. Signal resynthesis from the processed band-pass signals. 1. INTRODUCTION The compensation of nonlinearities inherent in amplifier–speaker systems is becoming practical due to the performance of modern digital signal processors. In this paper we introduce an algorithm for the enhancement of speech via compensation of the most notable nonlinearity of amplifier–speaker systems, the clipping distortion of the amplifier. Since the clipping distortion is a non-inversible process, the signal amplitudes must be constrained prior to entering the amplifier. However, if the incoming speech signal was merely scaled to satisfy this requirement, the resulting speech would sound softer than the nonscaled and clipped speech. This is not a preferable setup especially in noisy conditions, and therefore greater loudness and slightly coloured speech is often preferable to quieter and noncoloured signal. The applications of the algorithm are speech reproduction systems, where greater loudness of speech is preferred over perfect reproduction. Various research has been conducted on multi-band loudness processing in hearing aids [1], [2], [3] but less has been done in connection with the nonlinearities in amplifier–speaker systems. The nonlinear speech enhancement algorithms for noisy environments by Gülzow et al. [4] and in Parsons [7] are quite similar to the algorithm presented in this paper. The method proposed here differs from the referred methods in the attempt to preserve as much power as possible in the bands. 9. Hanning-windowed overlap-add of the processed frames. 10. Soft-clipping of the composite signal. 2.1. Algorithm implementation The individual processing steps are described in detail in the following sections. 2.1.1. Time-domain segmentation The input signal is first segmented into short frames. The frames are typically 20–50 ms in duration and overlap by 50%. Given an input signal x[n℄; n 2 N , the frames are defined by w [n℄ = x N f + n ; n 2 [0; N f 2 1℄ ; (1) where f 2 N is the frame number, N (even) is the length of a frame and wf [n℄ denotes the signal in frame f . W99-1 Proc. 1999 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, New York, Oct. 17-20, 1999 H1 (z ) H2 (z ) wf (n) H3 (z ) H12 (z ) v1f (n) v2f (n) v3f (n) f (n) v12 P R O C E S S I N G aptive processing. In our implementation, the lowest band from 0 Hz to 200 Hz is cleared and the other bands are left unchanged. This is done due to the lack of low frequency response in the prototype device. P uf (n) 2.1.4. Amplitude estimation The amplitudes of the band-pass signals are estimated by taking the maximum absolute values of the band-pass signals: a f b n (3) 2.1.5. Band selection 0 A number of adjacent bands1 is selected for soft-clipping. The nonlinear processing is performed only to the selected bands. The selection is done based on the band-pass signal amplitude estimates af and a matrix of allowable attenuation coefficients for each combination of adjacent bands (figure 4). For example, the 37 element implies that the three bands starting from the 7th band can be attenuated by 37 = 0:45 ( 7 dB) without the attenuation being disturbing. and af are used to build another matrix , which contains estimates of the amplitude of a processed and resynthesized signal for each band combination. For a band combination, this estimate is calculated by summing the amplitude estimates of the bands in matrix question, scaling this total amplitude according to the and summing the result with the amplitude estimates of the rest of the bands: −5 Magnitude [dB] v [n℄ ; b 2 [1; 12℄; = max where af denotes the vector of amplitudes of the band-pass signals in frame f . Figure 1: Frequency-domain analysis and synthesis. G −10 G −15 G G −20 −25 B G −30 0 f b 500 1000 1500 2000 2500 Frequency [Hz] 3000 3500 4000 B mn X = 2[ + 1℄ m 2 [1; 5℄; n 2 [1; 12℄ b Figure 2: Analysis filter magnitude responses. Each signal frame is divided into 12 non-overlapping frequencydomain bands. The division is carried out by using several bandpass linear-phase FIR filters in parallel [6, p. 21]: 1 X f b k = 1 w [k℄h [n k℄; f n;n m mn a f b + X 2[ + b = n;n m 1℄ a ; f b (4) The actual band combination used in soft-clipping is then selected based on the matrix according to the following rule: choose the element that provides sufficient attenuation with the minimum number of bands2 . The reasoning behind this rule is to try to lessen the bandwidth of the modifications. The selected combination of bands is described with two variables: i 2 [1; 5℄ and j 2 [1; 12℄. These are indices into both and and determine the range of selected bands bs 2 [j; j + i 1℄ and the attenuation coefficient ij 2 [0; 1) directly. B 2.1.2. Frequency-domain division v [n℄ = G b (2) v n℄ is the band-pass signal and hb [n℄ is the impulse response of the analysis filter on band b 2 [1; 12℄. There is no windowing done and the filter state is passed from frame to frame. Figure 1 illustrates the frequency-domain analysis and synthesis operations. Decimation is not done in order to preserve the relationship of band-pass signal amplitudes to the overall amplitude at all times. Figure 2 illustrates the locations of the bands. The figure shows the magnitude responses of the band-pass analysis filters Hb (z ). f [ b 2.1.3. Non-adaptive scaling Based on the frequency response of the target system, the bandpass signals may be attenuated by a fixed amount prior to the ad- B G G 2.1.6. Smoothing of soft-clipping coefficients After selecting the band combination and the attenuation coefficient for soft-clipping, the non-smoothed coefficient vector ^f is as follows: f ^b = G ; 1 ij ; if b 2 [j; j + i otherwise 1℄ (5) 1 Using a set of adjacent fixed-width bands corresponds to using a single variable-width band; this is done for simplicity and to limit the number of dimensions in the matrix. 2 If no such element exists, min ij , i.e. the element providing most attenuation, is chosen. W99-2 G B Proc. 1999 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, New York, Oct. 17-20, 1999 The individual band coefficients are smoothed between successive signal frames with a first-order IIR filter: f b f )^b + b 1; (6) 0.5 ; if ^ < 1 = (7) ; if ^ > 1 : 1 if ^ = 1; [0; 1) and 2 [0; 1) coefficients are used to determine f b f b f b f b f b f b L 8 < s (x) where f = (1 1 The 2 the smoothing amount for attack and release states, respectively. l f b f b = a : f b v^ [n℄ = s f b v [n℄ ; L = l L f b f (9) b The soft-clipping waveshaper function sL (x) with hard amplitude limit L 2 (0; 1) and soft amplitude limit Ls 2 [0; L) is as follows: 8 < s (x) = : L k ar tan j j k ar tan j j x; x k x Ls k + Ls L; L; x>L if x < L if jxj L if s s s (10) s s with k defined as L L ): (11) The waveshaping function s1 (x) is illustrated in figure 3 with different values of the soft amplitude limit L . The ratio of L =L is k= 2( s s s predetermined by the user of the algorithm. 2.1.8. Frequency-domain resynthesis The processed band-pass signals v^bf [n℄ are combined with a summation, u [n℄ = f 12 X =1 −1 −3 v n℄: f ^b [ (12) −2 −1 0 x 1 2 3 Figure 3: The waveshaping function s1 (x) used for soft-clipping, with Ls = f0:3; 0:5; 0:7g. (8) The signals in each frame and each band are soft-clipped individually. The individual samples in a band and a frame are soft-clipped using the same waveshaping function: L = 0.3 s Ls = 0.5 Ls = 0.7 −0.5 2.1.7. Band signal soft-clipping After having chosen the band combination for processing, the selected signals are processed with a waveshaper to accomplish softclipping. The amplitude limit vector lf is calculated by multiplying the smoothed soft-clipping attenuation coefficients f with the amplitude values of the segmented band-pass signals one by one: 0 2.1.10. Wide-band soft-clipping The wide-band signal is soft-clipped to the amplitude limit using the waveshaper described in section 2.1.7. This is done to ensure that the output signal will be within the amplitude constraint. The bandwise soft-clipping may not always be able to provide sufficiently small amplitude due to the matrix. G 2.2. Listening test The band locations are based on data collected from small-scale listening tests. The aim of the tests was to measure the maximum allowable attenuations which could be made to speech in different frequencies and bandwidths. The attenuation was considered allowable if the resulting speech did not sound disturbingly colored. The sound was reproduced by a specific device during the tests. The band locations were chosen so that an individual band under 1000 Hz can be filtered out totally without a disturbing effect on speech signals. This is to enable the proposed algorithm to filter out a single band if desired. The lowest band was selected to contain the frequencies from 0 Hz to 200 Hz in order to combine together the frequencies the prototype device cannot reproduce. The bandwidth used is therefore 200 Hz–4000 Hz. For comparison, the traditional telephony band is 300 Hz–3400 Hz. The matrix (figure 4) used in the calculation of the bandwise amplitude limits has been constructed according to the listening tests with the prototype device. G b 3. PERFORMANCE 2.1.9. Time-domain overlap-add The output signal y [n℄ is reconstructed from the processed frames uf [n℄ by windowing the overlapping frames with the length N Hanning window function [6, p. 447] [n℄; n 2 [0; N 1℄ and adding the windowed frames: y[n℄ = [j ℄u 1 (j ) + [k℄u [k℄; n N 2 fN (f 1)N ;k=n ; f = 2n j=n 2 2 N f f (13) Figure 5 illustrates the results of a computational performance evaluation of the proposed algorithm. We processed 18.1-second male and 19.5-second female speech signals with the proposed algorithm. For comparison we also scaled the original speech signals to meet the amplitude constraint. The loudness of the processed and the scaled signals is shown in figure 5. It may turn out that the bandwise soft-clipping does not resmatrix only speult in sufficiently large attenuation since the cifies options for non-disturbing processing. In such cases the G W99-3 Proc. 1999 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, New York, Oct. 17-20, 1999 70 5 −5 −8 −8 −6 −6 −5 −5 −2 4 −7 −9 −9 −7 −7 −6 −6 −4 −4 Subjective loudness [sone] Number of bands 65 3 −7 −10 −13 −7 −7 −6 −7 −6 −6 −4 2 −10 −12 −13 −13 −9 −6 −8 −9 −8 −6 −4 1 −∞ −∞ −∞ −∞ −∞ −∞ −∞ −∞ −∞ −8 −8 −6 1 2 3 4 5 6 7 8 Initial band 9 Processed 60 55 Scaled 50 45 40 35 10 11 12 30 0 G Male speech Female speech 5 10 15 20 Amplitude reduction [dB] 25 30 Figure 4: The matrix converted to decibels: allowable attenuations versus frequency and bandwidth. Figure 5: Performance of the proposed algorithm in terms of preserving subjective loudness. soft-clipping of the wide-band composite signal will ensure that the output signal be constrained. This happened in over 50% of the processed frames in our performance evaluation tests when the amplitude reduction was over 20 dB. Therefore, if amplitude reductions greater than 20 dB were to be used in our prototype system, the proposed algorithm would gradually resemble a timedomain waveshaper as the amplitude constraint was hardened. Based on this and the computational loudness values in figure 5, we assert that the algorithm should be used to constrain the amplitude of speech signals to a level 15 dB below the amplitude of the original signal. In this case the loudness of speech signals was observed to soften approximately by 6 sones in processing with the proposed algorithm, while the scaled speech signals were softened by 15 sones. On the other hand it may also turn out that the soft-clipping choice according to results in excess attenuation, since the options specified in the matrix are on the boundary of being disturbing. It is worth noting that the values of the matrix and the loudness values are strictly specific to the prototype device used in our experiments. The band-pass signal amplitude estimates af , when defined as the maximum absolute values, are worst-case estimates in the sense that the sum of the estimates for a given frame will always be greater than or equal to the actual amplitude of the wide-band signal. This results in that the band selection mechanism will probably select an overly strong attenuation to be used. G G G 4. CONCLUSIONS An adaptive multi-band soft-clipping algorithm for speech processing was presented. The adaptivity is based on empirical psychoacoustical data. The algorithm suppresses the amplitude of a speech signal while trying to preserve its loudness as much as possible. The speech is distorted during the processing, but the artifacts are generally not disturbing. The loudness was inspected to diminish only slightly during the processing, provided that the signal amplitude be diminished by less than 15 dB. W99-4 5. REFERENCES [1] Allen, J.B., “Recruitment Compensation as a Hearing Aid Signal Processing Strategy.” Proc. IEEE ISCAS ’98, Vol. 6, pp. 565–568, 1998 [2] Chabries, D.M., Anderson, D.V., Stockham, T.G., Jr. and Christiansen, R.W., “Application of a Human Auditory Model to Loudness Perception and Hearing Compensation.” Proc. IEEE ICASSP ’95, Vol. 5, pp. 3527–3530, 1995 [3] Fröhlich, T. and Dillier, N., “DSP-Implementation of a Multiband Loudness Correction Hearing Aid.” Proc. IEEE Int. Conf. Eng. Med. Bio. Soc. ’91, Vol. 13, pp. 1889–1890, 1991 [4] Gülzow, T., Engelsberg, A. and Heute, U., “Comparison of a Discrete Wavelet Transformation and a Nonuniform Polyphase Filterbank Applied to Spectral-Subtraction Speech Enhancement.” Signal Processing, Vol. 64, pp. 5–19, 1998 [5] Moore, B.C.J., Glasberg, B.R. and Baer, T., “A Model for the Prediction of Thresholds, Loudness, and Partial Loudness.” J. Audio Eng. Soc., Vol. 45, No. 4, pp. 224–240, 1997 [6] Oppenheim, A.V. and Schafer, R.W., “Discrete-Time Signal Processing.” ISBN 0-13-216771-9, Prentice-Hall, Inc., 1989 [7] Parsons, T.W., “Voice and Speech Processing.” ISBN 0-07048541-0, McGraw-Hill, Inc., pp. 348–350, 1987