as a PDF

advertisement
Proc. 1999 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, New York, Oct. 17-20, 1999
MAXIMIZATION OF THE SUBJECTIVE LOUDNESS OF SPEECH
WITH CONSTRAINED AMPLITUDE
Jarno Seppänen, Sami Kananoja, Jari Yli-Hietanen, Konsta Koppinen and Jari Sjöberg
Tampere University of Technology
Signal Processing Laboratory
P.O.Box 553, FIN-33101 Tampere, Finland
Nokia Research Center
Speech and Audio Systems Laboratory
P.O.Box 100, FIN-33721 Tampere, Finland
{jarno.seppanen, sami.kananoja, jari.yli-hietanen,
konsta.koppinen}@cs.tut.fi and jari.sjoberg@research.nokia.com
ABSTRACT
2. METHOD
We introduce an adaptive algorithm for constraining the amplitude
of speech signals while at the same time trying to maintain the subjective loudness and trying not to produce disturbing artifacts. The
algorithm can be applied to compensate for the clipping distortion
of amplifiers in speech reproduction devices.
The algorithm analyzes the speech signal on multiple frequency bands and applies an internal audibility law in order to make
inaudible changes to the signal. An example of the audibility law,
presented in the form of a matrix, is described, associated with a
specific speech reproduction device.
Multiple band-pass signals are processed with a waveshaper
to accomplish soft-clipping and to constrain the amplitude of the
processed signal.
When processed with the proposed algorithm, the computational loudness value of speech signals was found to diminish only
slightly (approximately 6 sones) during processing, while at the
same time the signal amplitude could be reduced by even 15 dB.
The aim of the proposed algorithm is to lower the amplitude of
an incoming speech signal without introducing disturbing artifacts
and to maintain the subjective loudness of the speech while doing
this.
The algorithm works by separating the speech signal into individual frequency bands and soft-clipping these. The soft-clipping
is done proportionally to the amplitudes of the band signals. The
band locations and the soft-clipping amplitude limits are chosen
based on empirical data.
The algorithm can be broken down into 10 sequential processing steps as follows:
1. Time-domain signal segmentation into frames overlapping
by 50%.
2. Signal frequency-domain division into 12 adjacent bands.
3. Scaling the band-pass signals by a fixed amount.
4. Band-pass signal amplitude estimation.
5. Selection of a combination of bands and an amplitude limit
for soft-clipping.
6. Smoothing of the individual bands’ amplitude limits with
respect to time.
7. Constraining the amplitudes of the band-pass signals by
soft-clipping.
8. Signal resynthesis from the processed band-pass signals.
1. INTRODUCTION
The compensation of nonlinearities inherent in amplifier–speaker
systems is becoming practical due to the performance of modern
digital signal processors. In this paper we introduce an algorithm
for the enhancement of speech via compensation of the most notable nonlinearity of amplifier–speaker systems, the clipping distortion of the amplifier.
Since the clipping distortion is a non-inversible process, the
signal amplitudes must be constrained prior to entering the amplifier. However, if the incoming speech signal was merely scaled to
satisfy this requirement, the resulting speech would sound softer
than the nonscaled and clipped speech. This is not a preferable
setup especially in noisy conditions, and therefore greater loudness
and slightly coloured speech is often preferable to quieter and noncoloured signal. The applications of the algorithm are speech reproduction systems, where greater loudness of speech is preferred
over perfect reproduction.
Various research has been conducted on multi-band loudness
processing in hearing aids [1], [2], [3] but less has been done in
connection with the nonlinearities in amplifier–speaker systems.
The nonlinear speech enhancement algorithms for noisy environments by Gülzow et al. [4] and in Parsons [7] are quite similar
to the algorithm presented in this paper. The method proposed
here differs from the referred methods in the attempt to preserve
as much power as possible in the bands.
9. Hanning-windowed overlap-add of the processed frames.
10. Soft-clipping of the composite signal.
2.1. Algorithm implementation
The individual processing steps are described in detail in the following sections.
2.1.1. Time-domain segmentation
The input signal is first segmented into short frames. The frames
are typically 20–50 ms in duration and overlap by 50%.
Given an input signal x[n℄; n 2 N , the frames are defined by
w [n℄ = x N f + n ; n 2 [0; N
f
2
1℄
;
(1)
where f 2 N is the frame number, N (even) is the length of a
frame and wf [n℄ denotes the signal in frame f .
W99-1
Proc. 1999 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, New York, Oct. 17-20, 1999
H1 (z )
H2 (z )
wf (n)
H3 (z )
H12 (z )
v1f (n)
v2f (n)
v3f (n)
f (n)
v12
P
R
O
C
E
S
S
I
N
G
aptive processing. In our implementation, the lowest band from
0 Hz to 200 Hz is cleared and the other bands are left unchanged.
This is done due to the lack of low frequency response in the prototype device.
P
uf (n)
2.1.4. Amplitude estimation
The amplitudes of the band-pass signals are estimated by taking
the maximum absolute values of the band-pass signals:
a
f
b
n
(3)
2.1.5. Band selection
0
A number of adjacent bands1 is selected for soft-clipping. The
nonlinear processing is performed only to the selected bands. The
selection is done based on the band-pass signal amplitude estimates af and a matrix
of allowable attenuation coefficients for
each combination of adjacent bands (figure 4). For example, the
37 element implies that the three bands starting from the 7th
band can be attenuated by 37 = 0:45 ( 7 dB) without the attenuation being disturbing.
and af are used to build another matrix , which contains
estimates of the amplitude of a processed and resynthesized signal
for each band combination. For a band combination, this estimate
is calculated by summing the amplitude estimates of the bands in
matrix
question, scaling this total amplitude according to the
and summing the result with the amplitude estimates of the rest of
the bands:
−5
Magnitude [dB]
v [n℄ ; b 2 [1; 12℄;
= max
where af denotes the vector of amplitudes of the band-pass signals
in frame f .
Figure 1: Frequency-domain analysis and synthesis.
G
−10
G
−15
G
G
−20
−25
B
G
−30
0
f
b
500
1000
1500 2000 2500
Frequency [Hz]
3000
3500
4000
B
mn
X
=
2[ + 1℄
m 2 [1; 5℄; n 2 [1; 12℄
b
Figure 2: Analysis filter magnitude responses.
Each signal frame is divided into 12 non-overlapping frequencydomain bands. The division is carried out by using several bandpass linear-phase FIR filters in parallel [6, p. 21]:
1
X
f
b
k
= 1
w [k℄h [n k℄;
f
n;n
m
mn
a
f
b
+
X
2[ +
b = n;n
m
1℄
a ;
f
b
(4)
The actual band combination used in soft-clipping is then selected based on the
matrix according to the following rule:
choose the element that provides sufficient attenuation with the
minimum number of bands2 . The reasoning behind this rule is to
try to lessen the bandwidth of the modifications.
The selected combination of bands is described with two variables: i 2 [1; 5℄ and j 2 [1; 12℄. These are indices into both
and and determine the range of selected bands bs 2 [j; j + i 1℄
and the attenuation coefficient ij 2 [0; 1) directly.
B
2.1.2. Frequency-domain division
v [n℄ =
G
b
(2)
v n℄ is the band-pass signal and hb [n℄ is the impulse response
of the analysis filter on band b 2 [1; 12℄. There is no windowing
done and the filter state is passed from frame to frame. Figure 1 illustrates the frequency-domain analysis and synthesis operations.
Decimation is not done in order to preserve the relationship of
band-pass signal amplitudes to the overall amplitude at all times.
Figure 2 illustrates the locations of the bands. The figure
shows the magnitude responses of the band-pass analysis filters
Hb (z ).
f
[
b
2.1.3. Non-adaptive scaling
Based on the frequency response of the target system, the bandpass signals may be attenuated by a fixed amount prior to the ad-
B
G
G
2.1.6. Smoothing of soft-clipping coefficients
After selecting the band combination and the attenuation coefficient for soft-clipping, the non-smoothed coefficient vector ^f is
as follows:
f
^b =
G
;
1
ij
;
if b 2 [j; j + i
otherwise
1℄
(5)
1 Using a set of adjacent fixed-width bands corresponds to using a single
variable-width band; this is done for simplicity and to limit the number of
dimensions in the matrix.
2 If no such element exists, min
ij , i.e. the element providing most
attenuation, is chosen.
W99-2
G
B
Proc. 1999 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, New York, Oct. 17-20, 1999
The individual band coefficients are smoothed between successive signal frames with a first-order IIR filter:
f
b
f
)^b +
b
1;
(6)
0.5
; if ^ < 1
=
(7)
; if ^ > 1
:
1
if ^ =
1;
[0; 1) and
2 [0; 1) coefficients are used to determine
f
b
f
b
f
b
f
b
f
b
f
b
L
8
<
s (x)
where
f
= (1
1
The 2
the smoothing amount for attack and release states, respectively.
l
f
b
f
b
=
a :
f
b
v^ [n℄ = s
f
b
v [n℄ ; L = l
L
f
b
f
(9)
b
The soft-clipping waveshaper function sL (x) with hard amplitude limit L 2 (0; 1) and soft amplitude limit Ls 2 [0; L) is as
follows:
8
<
s (x) = :
L
k ar tan j j
k ar tan j j
x;
x
k
x
Ls
k
+
Ls
L;
L;
x>L
if x < L
if jxj L
if
s
s
s
(10)
s
s
with k defined as
L L ):
(11)
The waveshaping function s1 (x) is illustrated in figure 3 with different values of the soft amplitude limit L . The ratio of L =L is
k=
2(
s
s
s
predetermined by the user of the algorithm.
2.1.8. Frequency-domain resynthesis
The processed band-pass signals v^bf [n℄ are combined with a summation,
u [n℄ =
f
12
X
=1
−1
−3
v n℄:
f
^b [
(12)
−2
−1
0
x
1
2
3
Figure 3: The waveshaping function s1 (x) used for soft-clipping,
with Ls = f0:3; 0:5; 0:7g.
(8)
The signals in each frame and each band are soft-clipped individually. The individual samples in a band and a frame are soft-clipped
using the same waveshaping function:
L = 0.3
s
Ls = 0.5
Ls = 0.7
−0.5
2.1.7. Band signal soft-clipping
After having chosen the band combination for processing, the selected signals are processed with a waveshaper to accomplish softclipping.
The amplitude limit vector lf is calculated by multiplying the
smoothed soft-clipping attenuation coefficients f with the amplitude values of the segmented band-pass signals one by one:
0
2.1.10. Wide-band soft-clipping
The wide-band signal is soft-clipped to the amplitude limit using
the waveshaper described in section 2.1.7. This is done to ensure
that the output signal will be within the amplitude constraint. The
bandwise soft-clipping may not always be able to provide sufficiently small amplitude due to the matrix.
G
2.2. Listening test
The band locations are based on data collected from small-scale
listening tests. The aim of the tests was to measure the maximum
allowable attenuations which could be made to speech in different
frequencies and bandwidths. The attenuation was considered allowable if the resulting speech did not sound disturbingly colored.
The sound was reproduced by a specific device during the tests.
The band locations were chosen so that an individual band
under 1000 Hz can be filtered out totally without a disturbing effect
on speech signals. This is to enable the proposed algorithm to filter
out a single band if desired.
The lowest band was selected to contain the frequencies from
0 Hz to 200 Hz in order to combine together the frequencies the
prototype device cannot reproduce. The bandwidth used is therefore 200 Hz–4000 Hz. For comparison, the traditional telephony
band is 300 Hz–3400 Hz.
The matrix (figure 4) used in the calculation of the bandwise
amplitude limits has been constructed according to the listening
tests with the prototype device.
G
b
3. PERFORMANCE
2.1.9. Time-domain overlap-add
The output signal y [n℄ is reconstructed from the processed frames
uf [n℄ by windowing the overlapping frames with the length N
Hanning window function [6, p. 447] [n℄; n 2 [0; N
1℄ and
adding the windowed frames:
y[n℄ = [j ℄u 1 (j ) + [k℄u [k℄; n N
2
fN
(f
1)N
;k=n
; f = 2n
j=n
2
2
N
f
f
(13)
Figure 5 illustrates the results of a computational performance
evaluation of the proposed algorithm. We processed 18.1-second
male and 19.5-second female speech signals with the proposed algorithm. For comparison we also scaled the original speech signals
to meet the amplitude constraint. The loudness of the processed
and the scaled signals is shown in figure 5.
It may turn out that the bandwise soft-clipping does not resmatrix only speult in sufficiently large attenuation since the
cifies options for non-disturbing processing. In such cases the
G
W99-3
Proc. 1999 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, New York, Oct. 17-20, 1999
70
5 −5 −8 −8 −6 −6 −5 −5 −2
4 −7 −9 −9 −7 −7 −6 −6 −4 −4
Subjective loudness [sone]
Number of bands
65
3 −7 −10 −13 −7 −7 −6 −7 −6 −6 −4
2 −10 −12 −13 −13 −9 −6 −8 −9 −8 −6 −4
1 −∞ −∞ −∞ −∞ −∞ −∞ −∞ −∞ −∞ −8 −8 −6
1
2
3
4
5
6 7 8
Initial band
9
Processed
60
55
Scaled
50
45
40
35
10 11 12
30
0
G
Male speech
Female speech
5
10
15
20
Amplitude reduction [dB]
25
30
Figure 4: The
matrix converted to decibels: allowable attenuations versus frequency and bandwidth.
Figure 5: Performance of the proposed algorithm in terms of preserving subjective loudness.
soft-clipping of the wide-band composite signal will ensure that
the output signal be constrained. This happened in over 50% of
the processed frames in our performance evaluation tests when
the amplitude reduction was over 20 dB. Therefore, if amplitude
reductions greater than 20 dB were to be used in our prototype
system, the proposed algorithm would gradually resemble a timedomain waveshaper as the amplitude constraint was hardened.
Based on this and the computational loudness values in figure 5, we assert that the algorithm should be used to constrain the
amplitude of speech signals to a level 15 dB below the amplitude
of the original signal. In this case the loudness of speech signals
was observed to soften approximately by 6 sones in processing
with the proposed algorithm, while the scaled speech signals were
softened by 15 sones.
On the other hand it may also turn out that the soft-clipping
choice according to results in excess attenuation, since the options specified in the
matrix are on the boundary of being disturbing.
It is worth noting that the values of the matrix and the loudness values are strictly specific to the prototype device used in our
experiments.
The band-pass signal amplitude estimates af , when defined
as the maximum absolute values, are worst-case estimates in the
sense that the sum of the estimates for a given frame will always
be greater than or equal to the actual amplitude of the wide-band
signal. This results in that the band selection mechanism will probably select an overly strong attenuation to be used.
G
G
G
4. CONCLUSIONS
An adaptive multi-band soft-clipping algorithm for speech processing was presented. The adaptivity is based on empirical psychoacoustical data. The algorithm suppresses the amplitude of a speech
signal while trying to preserve its loudness as much as possible.
The speech is distorted during the processing, but the artifacts are
generally not disturbing.
The loudness was inspected to diminish only slightly during
the processing, provided that the signal amplitude be diminished
by less than 15 dB.
W99-4
5. REFERENCES
[1] Allen, J.B., “Recruitment Compensation as a Hearing Aid
Signal Processing Strategy.” Proc. IEEE ISCAS ’98, Vol. 6,
pp. 565–568, 1998
[2] Chabries, D.M., Anderson, D.V., Stockham, T.G., Jr. and
Christiansen, R.W., “Application of a Human Auditory
Model to Loudness Perception and Hearing Compensation.”
Proc. IEEE ICASSP ’95, Vol. 5, pp. 3527–3530, 1995
[3] Fröhlich, T. and Dillier, N., “DSP-Implementation of a Multiband Loudness Correction Hearing Aid.” Proc. IEEE Int.
Conf. Eng. Med. Bio. Soc. ’91, Vol. 13, pp. 1889–1890, 1991
[4] Gülzow, T., Engelsberg, A. and Heute, U., “Comparison of
a Discrete Wavelet Transformation and a Nonuniform Polyphase Filterbank Applied to Spectral-Subtraction Speech Enhancement.” Signal Processing, Vol. 64, pp. 5–19, 1998
[5] Moore, B.C.J., Glasberg, B.R. and Baer, T., “A Model for the
Prediction of Thresholds, Loudness, and Partial Loudness.”
J. Audio Eng. Soc., Vol. 45, No. 4, pp. 224–240, 1997
[6] Oppenheim, A.V. and Schafer, R.W., “Discrete-Time Signal
Processing.” ISBN 0-13-216771-9, Prentice-Hall, Inc., 1989
[7] Parsons, T.W., “Voice and Speech Processing.” ISBN 0-07048541-0, McGraw-Hill, Inc., pp. 348–350, 1987
Download