new sonorities for early jazz recordings using sound

advertisement
NEW SONORITIES FOR EARLY JAZZ RECORDINGS USING SOUND
SOURCE SEPARATION AND AUTOMATIC MIXING TOOLS
Daniel Matz
University of Applied Sciences
Düsseldorf, Germany
daniel.matz@hotmail.de
Estefanía Cano
Fraunhofer IDMT
Ilmenau, Germany
cano@idmt.fhg.de
ABSTRACT
In this paper, a framework for automatic mixing of early
jazz recordings is presented. In particular, we propose
the use of sound source separation techniques as a preprocessing step of the mixing process. In addition to an initial solo and accompaniment separation step, the proposed
mixing framework is composed of six processing blocks:
harmonic-percussive separation (HPS), cross-adaptive multi-track scaling (CAMTS), cross-adaptive equalizer
(CAEQ), cross-adaptive dynamic spectral panning
(CADSP), automatic excitation (AE), and time-frequency
selective panning (TFSP). The effects of the different processing steps in the final quality of the mix are evaluated
through a listening test procedure. The results show that
the desired quality improvements in terms of sound balance, transparency, stereo impression, timbre, and overall
impression can be achieved with the proposed framework.
1. INTRODUCTION
When early jazz recordings are analyzed from a modern
audio engineering perspective, clear stylistic differences
can be identified with respect to modern recording techniques. These characteristics mainly evidence the technological and stylistic differences between the two eras. For
example, solo instruments such as the saxophone or the
trumpet often completely dominate the audio mix in early
jazz recordings. At the same time, the rhythm section, i.e.,
double bass, piano, drums, and percussion, often falls in
a secondary place, recorded or mixed with much lower intensity and often perceived as unclear and undifferentiated.
Additionally, from today’s perspective, early jazz recordings often present an unusual stereo image. Instrument
groups are sometimes assigned to extreme stereo positions
which can cause the solo instrument to be panned to the
left and the accompaniment band panned to the right. As a
consequence, the energy distribution over the stereo width
is unbalanced and is often perceived today as irritating and
disturbing, especially when listened through headphones.
c Daniel Matz, Estefanía Cano, Jakob Abeßer.
Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Daniel Matz, Estefanía Cano, Jakob
Abeßer. “New sonorities for early jazz recordings using sound source
separation and automatic mixing tools”, 16th International Society for
Music Information Retrieval Conference, 2015.
Jakob Abeßer
Fraunhofer IDMT
Ilmenau, Germany
abr@idmt.fhg.de
Several initiatives have arisen that attempt to give such
early recordings a more modern sonority. Remastering and
Automatic Mixing (AM) techniques offer various methods for a sonic redesign of such recordings. However,
given that the original individual stems of the instruments
in the recordings are usually not available, these techniques
can only achieve minor modifications to the sound characteristics of mono and stereo mixtures. In-depth remixing
usually requires the original multi-track recordings to be
available. For this purpose, sound source separation methods developed in the Music Information Retrieval (MIR)
community can be useful tools to retrieve individual instruments from a given mix.
2. GOALS
The main goal of this study is to identify suitable signal
processing methods to modify the above-mentioned characteristics in a selection of early jazz recordings. These
methods are combined in a fully automatic mixing framework. In particular, we focus on modifying the audio mix
in terms of transparency, stereo impression, frequency response, and acoustic balance in order to improve the overall perception of sound and the quality of the mix with respect to the original recording.
Our main approach for remixing is to modify the characteristic of the backing track to make it more present in
the mix. We also aim at improving the acoustical and spatial balance of th audio mix. The solo signal is balanced
with respect to its loudness and spectral energy to minimize spectral masking as well as to improve its position in
the stereo image.
3. RELATED WORK
In the field of automatic mixing, several approaches have
been presented in the literature. In [1], a method is proposed to automatically adjust gain and equalizer parameters for multi-track recordings using a least-squares optimization. In [12] the idea of modifying the magnitude
spectrogram of a signal towards a target spectrogram called
target mixing, is presented. Other approaches for automatic mixing of multi-track recordings have incorporated
machine learning algorithms to perform the mixing process [16, 17].
749
Proceedings of the 16th ISMIR Conference, Málaga, Spain, October 26-30, 2015
750
xe (n)
xm (n )
543
4 Konzeption und Umsetzung
543
4 Konzeption und Umsetzung
Multi-band
adaptive
gate
AM-DAFX IMPLEMENTATIONS
h 0,m (n )
hK–1,m (n )
Filter bank (with K bands)
h 0,m (n )
RMS(h 0,m (n ))
heK–1(n )
Gate
hg 0,m (n)
tr 0,m (n)
RMS(h 0,m (n )) > RMS(he 0(n))
RMS(he 0(n ))
he 0(n )
Filter bank (with K bands)
xl 0,m (n)
Avr(xl 0,m (n))
Fixed SPL
Multi-band
adaptive
gate
4 Konzeption
und Umsetzung
SPL
AM-DAFX
IMPLEMENTATIONS
543
measurement
AM-DAFX IMPLEMENTATIONS
Table of
hg 0,m (n)
543
0,m
to its amplitude level so that, by manipulating
its gain per band level, we can achieve a desired
tr 0,m (n)
tr
(n)
0
h 0,m (n )
h
(n )
0,m
0,m by f v
probable loudness per band of a given input
and is denoted
k,m (n). The block diagram of
this multi-band loudness feature extraction method
Gate is shown in Figure 13.11.
(13.18)
f vk,mused
(n)/K
/M the target average loudness
(4.11)
where cvk,m (n) is the controll(n)
gain=variable per band
to achieve
Output most probable
loudness per band: f v (n )
m=0
k=0
extraction block has a function
l(n) and H lk,m (n) is the function of the desired system. The featurek,m
(13.18)
42
xe (n)
42
42
xe (n )
xm (n )
42
543
543
4 Konzeption und Umsetzung
AM-DAFX IMPLEMENTATIONS
Multi-band
adaptive
gate
4 Konzeption und Umsetzung
42
Multi-band
adaptive
gate
AM-DAFX IMPLEMENTATIONS
h 0,m (n )
Gate
hg 0,m (n )
hK–1,m (n)
Filter bank (with K bands)
h 0,m (n )
RMS(h 0,m (n ))
heK–1(n)
tr 0,m (n )
RMS(h 0,m (n )) > RMS(he 0(n))
RMS(he 0(n))
he 0(n )
Filter bank (with K bands)
h 0,m (n )
Output most probable loudness per band: fvk,m (n )
Table of
loudness
hg 0,m (n )
RMS(h 0,m (n )) > RMS(he 0(n))
0
tr 0,m (n )
0
1
Gate
Probability mass function
0,m
xm (n )
Table of
The same
Filter bank loudness
(with K bands)
for
xe (n )
xe(n )
weighting
all K filters
h 0,m (ncoefficients
)
Filter bank hg
(with
K bands)
0,m (n)*w(SP(n))
hK–1,m (n)
SP(n )
he 0(n )S
Avr(xl 0,m
))
he(n
SPL
K–1(n)
measurement
xl (n)
RMS(he 0(n)) 0,m RMS(h 0,m (n ))
1
Fixed SPL
f v (n )
The same
(n)
f v (n )
h 0,m (n )
Filter bank (with K bands)0,m
1
Cross-adaptive feature processing
xl
Fixed SPL
measurement
probable loudness per band of a given input and is denoted
SP(n ) by f vk,m (n). The block diagram of
Avr(xl 0,mx(n
))
S
SPL 13.11.
m (n )
this multi-band loudness feature extraction
method
is shown in Figure
0,m
for 4.4: Figure
13.11 Blockder
diagram
ofweighting
an automatic
equaliser.
42
4 Konzeption
und Umsetzung
Abbildung
Blockdiagramm
Merkmalsgewinnung
xe(n )des implementierten
all K filters
AM-DAFX IMPLEMENTATIONS
543
cross-adaptiven
Equalizers coefficients
hg (n)*w(SP(n))
0,m
Filter bank (with K bands)
Cross-adaptive
xe (n )Merkmalsverarbeitung
k,m
Filter bank (with K bands)
h
(n )
hK–1,m (n)
e
xe (n )
all K filters
m=0
k=0
Cross-adaptive
feature
processing
coefficients
is the function
of the desired system.
The feature extraction block has a function
l(n) and H lk,m (n)
hg 0,m (n)*w(SP(n))
m
probable loudness per band of a given
input
of
cvk,m (n)
= and is denoted by
, f vk,m (n). The block diagram
(13.18)
H lk,mis
(n)x
this multi-band
loudness feature extraction
method
shown
Table
of in Figure 13.11.
k,m (n)
M 1 xK (n1)
The same
Cross-adaptive
processing
mapping
the perceptual
loudness of
each spectral
Innerhalb
desfeature
CAFPB
geht esconsists
nun umofdie
Zuordnung
der häufigsten
Lautheit
f vk,mband
(n)
he00(n )
heK–1its
(n) gain per band level, we can achieve a desired
to its amplitude level so that,
by manipulating
Multi-band
jedes Frequenzbandes zu0 einem Verstärkungsfaktor
cvk,m (n) des korrespondierenden
1
target loudness level. We aim
to achieve
an average
value
l(n), therefore we
must decrease
RMS(h loudness
(n ))
RMS(he
adaptive
0(n))
Probability
mass function 0,m
EQ-Bandes
Tracks.
Hierdurch
kannabove
die Verstärkung
eines
spezifischen
the loudness offür
thealle
equalisation
bands for signals
this average and
increase
thegate
band Bandes
gain for
signals belowwerden,
this average.
Thisdieses
resultsletztendlich
in a system indie
which
we have a multiplier cvak,m (n) per
beeinflusst
sodass
beabsichtigte
RMS(h
(n ))probable
> RMS(he
Output 0,m
loudness
band: fvk,m (n ) Lautheit zu besitzt.
0(n))
band, such that we scale the
inputmost
bands
xk,m (n)
in
orderper
to achieve
a desired average loudness l(n).
Dies
erfolgtofintheAnlehnung
anapproximated
die trcross-adaptive
Verarbeitung
wobei
(n=
) l(n)/(cvk,mdes
h 0,m
0,m (n ) by H lk,m
The function
system can be
(n)
(n)xCAMTS,
k,m (n)), where the
Figure
13.11 Blockder
diagram
of an automatic equaliser.
Abbildung
4.4: (n)
Blockdiagramm
Merkmalsgewinnung
des Mittelwertbildung
implementierten
control vector
cv
is given
by Ziel-Lautheit
zunächst
ebenfalls
mittlere
l(n) mit
Hilfe der
k,m eine
42
4 Konzeption
und Umsetzung
Gate
cross-adaptiven Equalizers
AM-DAFX IMPLEMENTATIONS
543
hg 0,m (n
)
über alle f vk,m (n) für alle K Frequenzbänder
und
M Signale bestimmt wird:
l(n)
loudness
for
Cross-adaptive
Merkmalsverarbeitung
f vk,m
(n)/K
/M x the
(4.11)
(n ) target average loudness
weighting
where cv (n) is the controll(n)
gain=variableFilter
per bank
band
used
achieve
(with to
K bands)
Cross-adaptive feature processing
1
Table of
H lk,m (n)xk,m (n)
K 1
f v0,m (n )
Cross-adaptive Merkmalsverarbeitung
xl 0,m (n)
Fixed SPL
measurement
(13.18)
The same
Figure 13.11
Blockder
diagram
ofloudness
an
automatic
f vk,m
(n)/K
/Mequaliser.
(4.11)
Abbildung
where
cvk,mfor
(n)4.4:
is theBlockdiagramm
controll(n)
gain=variable
perMerkmalsgewinnung
band
used
to achieve
thedes
targetimplementierten
average loudness
m=0
k=0
xe(n )
weighting
cross-adaptiven
Equalizers
filters
(n) is the
function of the desired
system.
The feature extraction
block has a function
l(n) andallHKlk,m
coefficients
probable loudness per bandhgof0,ma(n)*w(SP(n))
given input and is denoted by f vk,m (n). The block diagram of
SP(n ) in Figure 13.11.
this multi-band loudness feature
is shown
Avr(xl 0,mmethod
(n ))
S extraction
SPL
M 1
hg 0,m (n )
cvk,m
(n) = loudness
, fvk,m (n )
Output most
probable
per band:
0,m
hK–1,m
(n) of
Cross-adaptive
processing
mapping
the
perceptual
loudness
each spectral
SP(n
)der häufigsten
Innerhalb
desfeature
CAFPB
gehtSesconsists
nun
umofdie
Zuordnung
Lautheit
f vk,mband
(n)
Avr(xl 0,m (n ))
SPL we can achieve a desired
to its amplitude level so that,
level,
) by manipulating
0(n einem
heK–1its
(n) gain per band
jedes Frequenzbandeshezu
Verstärkungsfaktor
cvmeasurement
k,m (n) des korrespondierenden
xl 0,m
Bachelorarbeit
Daniel
Matz
target
loudness
level.
We aim to achieve
an(n)
average loudness value
l(n), therefore Multi-band
we must decrease
RMS(h
))this average
adaptive
EQ-Bandes
Tracks.
Hierdurch
kannabove
die
Verstärkung
eines
spezifischen
0,m (n
0(n))for signals
1RMS(he
the loudness offür
thealle
equalisation
bands
and
increase
the
band Bandes
gain for
Fixed SPL
f v0,m (n )
gate
(n) per
signals belowwerden,
this average.
Thisdieses
resultsletztendlich
in a system indie
which
we have a multiplier
beeinflusst
sodass
beabsichtigte
Lautheitcva
zuk,mbesitzt.
band, such that we scale the
input
bands xk,m (n) in
order to achieve a desired average loudness l(n).
RMS(h
0,m (n )) > RMS(he
0(n))
Dies
erfolgtofintheAnlehnung
anapproximated
die cross-adaptive
Verarbeitung
wobei
The function
system0 can be
by H lk,m (n)
= l(n)/(cvk,mdes
(n)xCAMTS,
k,m (n)), where the
tr 0,m (n ) 1
0
h 0,m (n )
control vector
cvk,m (n)eine
is given
by Ziel-Lautheit
zunächst
ebenfalls
mittlere
l(n)
mit Hilfe der Mittelwertbildung
Probability
mass function
Gate
über alle f vk,m (n) für alle K Frequenzbänderl(n)
und M Signale bestimmt wird:
Bachelorarbeit Daniel Matz
(13.18)
xe (n)
RMS(he 0(n))
he 0(n )
Filter bank (with K bands)
xm (n )
h 0,m (n )
1
fv0,m (n)
RMS(h 0,m (n )) > RMS(he 0(n))
hK–1,m (n)
Multi-band
adaptive
gate
543
4 Konzeption und Umsetzung
543
4 Konzeption und Umsetzung
AM-DAFX IMPLEMENTATIONS
Filter bank (with K bands)
h 0,m (n)
RMS(h 0,m (n ))
heK–1(n)
tr 0,m (n )
RMS(h 0,m (n )) > RMS(he 0(n))
AM-DAFX IMPLEMENTATIONS
Gate
hg 0,m (n)
xm (n )
Table of
The same
Filter bank loudness
(with K bands)
xe (n)
for
xe(n)
weighting
all K filters
h 0,m (ncoefficients
)
Filter bank (with K bands)
h
hg 0,m (n)*w(SP(n))
K–1,m (n)
SP(n)
he(n))
Avr(xl 0,m
K–1(n)
SPL
Multi-band
measurement
adaptive
gate
Fixed SPL
xl 0,m (n)
RMS(h 0,m (n ))
RMS(he 0(n))
he 0(n )S
tr 0,m (n )
h 0,m (n)
0
0
1
Gate
Probability mass function
hg 0,m (n)
Output most probable loudnessTable
per band:
fvk,m (n)
of
543
4 Konzeption und Umsetzung
fv0,m (n)
Fixed SPL
k,m
band, such that we scale the input bands xk,m (n) in order to achieve a desired average loudness l(n).
Dies
erfolgtofintheAnlehnung
anapproximated
die cross-adaptive
Verarbeitung
wobei
The function
system can be
by H l (n)
= l(n)/(cv des
(n)xCAMTS,
(n)), where
the
k,mautomatic equaliser.
k,m
k,m
13.11 Blockder
diagram
of an
Abbildung 4.4: Figure
Blockdiagramm
Merkmalsgewinnung
des implementierten
heK–1(n)
control vector
cvk,m (n)eine
is given
by Ziel-Lautheit l(n) mit Hilfe der Mittelwertbildung
zunächst
ebenfalls
mittlere
cross-adaptiven
Equalizers
Multi-band
über
alleloudness
f vk,m (n)
alleof K
Frequenzbänder
und M by
Signale
bestimmt
wird:
probable
perfür
band
a given
input and isl(n)
denoted
f vk,m (n).
The block
diagram of
cvk,m (n) =
(13.18)
RMS(h 0,m (nthis
)) multi-band loudness feature extraction
adaptive
method
shown
in, Figure 13.11.
H lk,mis
(n)x
k,m (n)
M 1 K 1
Cross-adaptive Merkmalsverarbeitung
gate
f vk,mused
(n)/K
/M the target average loudness
(4.11)
where cvk,m (n) is the controll(n)
gain=variable per band
to achieve
tr 0,m (n)
RMS(h 0,m (n)) > RMS(he 0(n))
Avr(xl 0,m (n))
fv0,m (n)
xl 0,m (n)
Cross-adaptive feature processing m=0 k=0
l(n) and H lk,m (n) is the function of the desired system. The feature extraction block has a function
band, such that we scale the input bands
xk,m (n)
in order to achieve a desired average loudness l(n).
Cross-adaptive feature processing consists of mapping the perceptual loudness of each spectral
band
Innerhalb
des
CAFPB
geht
es
nun
um
die
Zuordnung
der
häufigsten
Lautheit
f
v
(n)
k,m
to its amplitude level so that, by manipulating its gain per band level, we can achieve a desired
jedes Frequenzbandes zu einem Verstärkungsfaktor cvk,m (n) des korrespondierenden
target loudness
level.
We aim to achieve an average loudness value l(n), therefore we must decrease
Bachelorarbeit
Daniel
Matz
EQ-Bandes
Tracks. Hierdurch
kannabove
die Verstärkung
eines
spezifischen
the
offür
thealle
equalisation
bands for signals
this average and
increase
the band Bandes
gain for
(n)
hloudness
0,m
(n) per
signals
belowwerden,
this average.
Thisdieses
resultsletztendlich
in a system indie
which
we have a multiplier
beeinflusst
sodass
beabsichtigte
Lautheitcva
zuk,mbesitzt.
SP(n)
Fixed SPL
SPL
measurement
H lk,m (n)xk,m (n)
K 1
,
(13.18)
f vk,mused
(n)/K
/M the target average loudness
(4.11)
where cvk,m
gain=variable per band
to achieve
Table
of(n) is the controll(n)
m=0
k=0
l(n) and H lk,m (n) is the function of the desired system. The feature extraction block has a function
loudness
xe(n)
weighting
Bachelorarbeit Daniel Matz
coefficients
M 1
cvk,m (n) =
Dies
erfolgtofintheAnlehnung
anapproximated
die cross-adaptive
Verarbeitung
wobei
The function
system can be
by H lk,m (n)
= l(n)/(cvk,mdes
(n)xCAMTS,
k,m (n)), where the
Gate
control vector
cvk,m (n)eine
is given
by Ziel-Lautheit l(n) mit Hilfe der Mittelwertbildung
zunächst
ebenfalls
mittlere
hg 0,m (n)über alle f vk,m(n) für alle K Frequenzbänderl(n)
und M Signale bestimmt wird:
hg 0,m (n)*w (SP(n))
S
0,m
target loudness level. We aim
to achieve
an function
average loudness value l(n), therefore we must decrease
Probability
mass
(with
Kdesbands)
Cross-adaptive
feature
processing
mapping
the perceptual
loudness of
each spectral
Innerhalb
CAFPB
geht esconsists
nun umofdie
Zuordnung
der häufigsten
Lautheit
f vk,mband
(n)
0
to its amplitude level so that, by manipulating its1 gain per band level, we can achieve a desired
jedes Frequenzbandes zu0 einem Verstärkungsfaktor
cvk,m (n) des korrespondierenden
Cross-adaptive feature processing
is the function
of the desired system. The feature extraction block has a function
l(n) and H lk,m (n)
Cross-adaptive
feature
processing
Filter bank
The same
loudness
AM-DAFX
IMPLEMENTATIONS
for
13.11 Blockder
diagram
ofweighting
an automatic equaliser.
xe(n)des implementierten
Abbildung
4.4: Figure
Blockdiagramm
Merkmalsgewinnung
all K filters
cross-adaptiven
Equalizers coefficients
hg 0,m (n)*w(SP(n))
SP(n) by f v (n). The block diagram of
probable loudness per band ofS a given
input
and is denoted
k,m
Avr(xl
0,m (n))
SPL
this multi-band loudness feature extraction
method is shown inmeasurement
Figure 13.11.
xl (n)
Cross-adaptive Merkmalsverarbeitung
1
cvk,m (n)
= and is denoted by
,
(13.18)
probable loudness per band of a given
input
H lk,m (n)xk,m (n) f vk,m (n). The block diagram of
M 1 method
K 1
this multi-band loudness feature extraction
is shown in Figure 13.11.
xm (n)
Cross-adaptive
processing
mapping
the perceptual
loudness of
each spectral
Innerhalb
desfeature
CAFPB
geht esconsists
nun umofdie
Zuordnung
der häufigsten
Lautheit
f vk,mband
(n)
to its amplitude level so that, by manipulating its gain per band level, we can achieve a desired
jedes Frequenzbandes 0zu einem Verstärkungsfaktor cvk,m (n) des korrespondierenden
target loudness level. We aim
0 to achieve an average
1 loudness value l(n), therefore we must decrease
EQ-Bandes
Tracks.
Hierdurch
kannabove
die Verstärkung
eines
spezifischen
the loudness offür
thealle
equalisation
bandsmass
for signals
this average and
increase
the band Bandes
gain for
Probability
function
(n) per
signals belowwerden,
this average.
Thisdieses
resultsletztendlich
in a system indie
which
we have a multiplier
beeinflusst
sodass
beabsichtigte
Lautheitcva
zuk,mbesitzt.
band, such that we scale the
inputmost
bands
xk,m (n)
in orderper
to achieve
a desired
average loudness l(n).
Output
probable
loudness
band: fvk,m
(n )
Dies
erfolgtofintheAnlehnung
anapproximated
die
cross-adaptive
Verarbeitung
wobei
The function
system can be
by H lk,m (n)
= l(n)/(cvk,mdes
(n)xCAMTS,
k,m (n)), where the
control vector
cv
is given
by Block
zunächst
ebenfalls
eine
mittlere
Ziel-Lautheit
l(n)
mit Hilfe
der
Figure
13.11
diagram
of an
automatic
equaliser.
k,m (n)
Abbildung
4.4:
Blockdiagramm
der
Merkmalsgewinnung
des Mittelwertbildung
implementierten
Equalizers l(n)
über alle f vk,m (n)cross-adaptiven
für alle K Frequenzbänder
und M Signale bestimmt wird:
l(n)
f vk,mused
(n)/K
/M the target average loudness
(4.11)
where cvk,m (n) is the control
gain=variable per band
to achieve
Cross-adaptive
Merkmalsverarbeitung
m=0
k=0
,
Tracks. Hierdurch
kannabove
die Verstärkung
eines
spezifischen
the loudness offür
thealle
equalisation
bands for signals
this average and
increase
the band Bandes
gain for
h 0,m (n )EQ-Bandes
(n) per
signals belowwerden,
this average.
Thisdieses
results
indie
which
wefvhave
hOutput
(n
)in a system
beeinflusst
sodass
letztendlich
beabsichtigte
Lautheitcva
zuk,mbesitzt.
most
probable
loudness
per
band:
(n)a multiplier
K–1,m
H lk,m (n)xk,m (n)
K 1
f vk,mused
(n)/K
/M the target average loudness
(4.11)
where cvk,m (n) is the controll(n)
gain=variable per band
to achieve
m=0
k=0
l(n) and H lk,m (n) is the function of the desired system. The feature extraction block has a function
M 1
cvk,m (n) =
Cross-adaptive
processing
mapping
the perceptual
loudness of
each spectral
Innerhalb
desfeature
CAFPB
geht esconsists
nun umofdie
Zuordnung
der häufigsten
Lautheit
f vk,mband
(n)
to
its amplitude
level so that, by manipulating its gain per band level, we can achieve a desired
Bachelorarbeit
Daniel Matz zu einem Verstärkungsfaktor cvk,m (n) des korrespondierenden
jedes
Frequenzbandes
target loudness level. We aim to achieve an average loudness value
l(n), therefore we must decrease
EQ-Bandes
Tracks. Hierdurch
kannabove
die Verstärkung
eines
spezifischen
the loudness offür
thealle
equalisation
bands for signals
this average and
increase
the band Bandes
gain for
(n) per
signals belowwerden,
this average.
Thisdieses
resultsletztendlich
in a system indie
which
we have a multiplier
beeinflusst
sodass
beabsichtigte
Lautheitcva
zuk,mbesitzt.
band, such that we scale the input bands xk,m (n) in order to achieve a desired average loudness l(n).
Dies
erfolgtofintheAnlehnung
anapproximated
die cross-adaptive
Verarbeitung
wobei
The function
system can be
by H lk,m (n)
= l(n)/(cvk,mdes
(n)xCAMTS,
k,m (n)), where the
control vector
cvk,m (n)eine
is given
by Ziel-Lautheit l(n) mit Hilfe der Mittelwertbildung
zunächst
ebenfalls
mittlere
über alle f vk,m (n) für alle K Frequenzbänderl(n)
und M Signale bestimmt wird:
Bachelorarbeit Daniel Matz
RMS(he 0(n ))
he 0(n )
Filter bank (with K bands)
The same
for
all K filters
1
Stereo Remix
42
42
he10(n )
he(n)
K–1(n )
f v0,m
AM-DAFX IMPLEMENTATIONS
Table of
loudness
xe(n)
weighting
xm (n )
coefficients
hg 0,m (n)*w(SP(n))
Filter bank (with K bands)
xe (n)
SP(n )
Avr(xl 0,m (n))
S
SPL
h 0,m (n )
Filter bank (with K bands)
measurement
xl 0,m (n)
hK–1,m (n )
The same
for
all K filters
hg 0,m (n)*w(SP(n))
for
probable loudness
per band of a given input and is denoted
(n). The block diagram of
xe(n)
weightingby f vk,m
all K filters
this multi-band loudness feature extraction method is coefficients
shown in Figure 13.11.
loudness
hg 0,m (n)
Output most probableGate
loudness per band: f vk,m (n )
RMS(h 0,m (n ))
RMS(he 0(n ))
0
0
1
RMS(h 0,m (n )) > RMS(he 0(n))
Probability mass function
tr 0,m (n)
h 0,m (n )
13.11 Blockder
diagram
of an automatic equaliser.
Abbildung 4.4: Figure
Blockdiagramm
Merkmalsgewinnung
des implementierten
cross-adaptiven Equalizers Table of
The same
S
Cross-adaptive feature processing
42
Cross-adaptive MerkmalsverarbeitungSP(n )
Cross-adaptive feature processing
4 Konzeption und Umsetzung
m=0 method
k=0
this
loudness
feature extraction
is shown
in Figure
13.11. block has a function
Gate
is the function
of the desired
system.
The feature
extraction
l(n) multi-band
and H lk,m (n)
k,m
Cross-adaptive
processing
mapping
the perceptual
loudness of
each spectral
Innerhalb
desfeature
CAFPB
geht esconsists
nun umofdie
Zuordnung
der häufigsten
Lautheit
f vk,mband
(n)
1
Fixed SPL
f v0,mx (n)
to its amplitude level so that, by manipulating
its gain per band level, we can achieve a desired
m (n )
jedes Frequenzbandes zu einem Verstärkungsfaktor
cvk,m (n) des korrespondierenden
target loudness level. We aim to achieve an average loudness value
l(n), therefore we must decrease
Filter
(with
bands) and
xequalisation
EQ-Bandes
Tracks.
Hierdurch
kannbank
die Verstärkung
eines
spezifischen
the loudness offür
thealle
bands for signals
above
this Kaverage
increase
the band Bandes
gain for
e (n)
0
0Thisdieses
1 indie
(n) per
signals belowwerden,
thisFilter
average.
resultsletztendlich
in a system
we have a multiplier
beeinflusst
sodass
beabsichtigte
Lautheitcva
zuk,mbesitzt.
h 0,m (n
) which
bank
(with K bands)
(n ) average loudness l(n).
mass
band, such that we scale theProbability
input bands
xk,mfunction
(n) in order to achieve ahK–1,m
desired
Dies
erfolgtofintheAnlehnung
anapproximated
die cross-adaptive
Verarbeitung
wobei
he 0can
(n ) be
The function
system
by) H lk,m (n)
= l(n)/(cvk,mdes
(n)xCAMTS,
k,m (n)), where the
heK–1(n
Multi-band
control vector
cvk,m (n)eine
isOutput
given
by probable
most
loudness l(n)
per band:
zunächst
ebenfalls
mittlere
Ziel-Lautheit
mit f vHilfe
k,m (n ) der Mittelwertbildung
RMS(h 0,m (n ))
RMS(he 0(n ))
adaptive
über alle f vk,m (n)Figure
für alle
K Frequenzbänder
und
M Signale
bestimmt gate
wird:
l(n)
13.11
Block diagram
of
an automatic
equaliser.
Abbildung 4.4: Blockdiagramm
des implementierten
cvk,m (n)der
= Merkmalsgewinnung
,
(13.18)
RMS(h 0,m (n )) >Equalizers
RMS(he
(n))(n)xk,m (n)
H lk,m
cross-adaptiven
M 1 K 10
tr
) /M
h 0,m (n by
0,m (n)
l(n)
=variable
f visk,m
(n)/K
(4.11)
probable
per control
band of
a given
input
and
denoted
f vk,mthe
(n).target
The block
diagram
of
where
cv loudness
(n) is the
gain
per
band
used
to achieve
average
loudness
Cross-adaptive
Merkmalsverarbeitung
hg 0,m (n)
42
The same
Bachelorarbeit
Daniel Matz
= 01(n))
,
0 0,mcv
RMS(h
(nk,m
)) >(n)
RMS(he
H lk,m (n)xk,m (n)
M tr1 function
K 1
Probability mass
h 0,m (n )
0,m (n)
f vk,mused
(n)/K
/M the target average loudness
(4.11)
where cvk,m (n) is the controll(n)
gain=variable per band
to achieve
Gate
m=0
k=0
Output most
probable
f vk,m
(n )
of the
desiredloudness
system. per
Theband:
feature
extraction
block has a function
l(n) and H lk,m (n) is the function
(13.18)
loudness
Cross-adaptive
processing
mapping
the perceptual
loudness of
each spectral
Innerhalb
desfeature
CAFPB
geht esconsists
nun umofdie
Zuordnung
der häufigsten
Lautheit
f vk,mband
(n)
for
xm (n )
xe(n)
weighting
to its amplitude
by manipulating
its
gain
per
band
level,
we
can
achieve
a desired
all K filterslevel so that,
jedes Frequenzbandes
zu einem Verstärkungsfaktor
cvk,m (n) des korrespondierenden
coefficients
target loudness level.
We aimhgto0,machieve
an average
loudness
l(n), therefore we must decrease
(n)*w(SP(n))
Filter bank
(with Kvalue
bands)
xe (n)
EQ-Bandes
Tracks. Hierdurch
kannabove
die Verstärkung
eines
spezifischen
the loudness offür
thealle
equalisation
bands for signals
this average
and
increase
the band Bandes
gain for
SP(n
)
Avr(xl 0,m (n)) h (n )
S
SPL a multiplier cva (n) per
bank
(with
K bands)
signals belowwerden,
thisFilter
average.
This
results
in a system
which
we have
0,m indie
beeinflusst
sodass
dieses
letztendlich
beabsichtigte
zuk,mbesitzt.
hK–1,m (nLautheit
)
measurement
xl 0,m
(n)(n) in order to achieve
band, such that we scale the input bands
xk,m
a desired average loudness l(n).
he (n ) an die cross-adaptive
heK–1(n
Dies
erfolgtofintheAnlehnung
Verarbeitung
des CAMTS,
wobei
The function
system10can be
approximated
by) H lk,m (n)
=Fixed
l(n)/(cv
(n)), where
the
SPL k,m (n)xk,m
Multi-band
f v0,m (n)
control vector
cvk,m (n)eine
is RMS(he
given
by0(n ))Ziel-Lautheit
RMS(h 0,m (nl(n)
))
zunächst
ebenfalls
mittlere
mit Hilfe der Mittelwertbildung
adaptive
gate
über alle f vk,m (n) für 0alle K Frequenzbänderl(n)
und M Signale bestimmt wird:
13.11 Blockder
diagram
of an automatic equaliser.
Abbildung 4.4: Figure
Blockdiagramm
Merkmalsgewinnung
des implementierten
Table of
Equalizers loudness
The same
cross-adaptiven
Bachelorarbeitfor
Daniel Matz
Cross-adaptive feature processing xl 0,m (n)
SPL
measurement
x (n)
weighting
probableallloudness
K filters per band of a given input and is denoted by f vk,me(n). The block diagram of
coefficients
hg 0,m extraction
(n)*w(SP(n))
this multi-band loudness feature
method is shown in Figure 13.11.
SP(n )
Cross-adaptive Merkmalsverarbeitung
Avr(xl 0,m (n))
S
Cross-adaptive
processing
of mapping
the perceptual
loudness of
each spectral
1
Fixed
SPL
Innerhalb
desfeature
CAFPB
geht esconsists
nun um
häufigsten
Lautheit
f vk,mband
(n)
f v die
(n)Zuordnung der
jedes Frequenzbandes zu einem Verstärkungsfaktor cv4k,mKonzeption
(n) des korrespondierenden
42
Umsetzung
target loudness level. We aim to achieve an average loudness value l(n), thereforeund
we must
decrease
AM-DAFX IMPLEMENTATIONS
543
0
EQ-Bandes
Tracks.
Hierdurch
kannabove
die Verstärkung
eines
spezifischen
the loudness offür
thealle
equalisation
bands for signals
this average and
increase
the band Bandes
gain for
0
1
(n) per
signals belowwerden,
this average.
Thisdieses
resultsletztendlich
in a system indie
which
we have a multiplier
beeinflusst
sodass
beabsichtigte
Lautheitcva
zuk,mbesitzt.
Probability
mass function
band, such that we scale the input bands xk,mx(n)
) order to achieve a desired average loudness l(n).
m (nin
Dies
erfolgtofintheAnlehnung
anapproximated
die cross-adaptive
Verarbeitung
wobei
The function
system can be
by H lk,m (n)
= l(n)/(cvk,mdes
(n)xCAMTS,
k,m (n)), where the
Filter
bank (with
K bands)
Output most probable
loudness
per band:
f vk,m (n )
xe(n)
(n)eine
control vector
cvk,m
is given
by Ziel-Lautheit l(n) mit Hilfe
zunächst
ebenfalls
mittlere
der Mittelwertbildung
h 0,mof
(nund
)an automatic equaliser.
Filter
bankalle
(with
bands)
13.11
Blockder
diagram
über
alle f vk,m
(n)Figure
für
K KFrequenzbänder
wird:
Abbildung
4.4:
Blockdiagramm
Merkmalsgewinnung
des
implementierten
hK–1,mbestimmt
(n
)
l(n) M Signale
cvk,m (n)
=
,
(13.18)
cross-adaptiven
Equalizers
he 0(n )
he H(n
) (n)xk,m (n)
lk,m
M 1 KK–11
probable loudness per band
of a given
input
and is (n
denoted
f vk,m (n). The Multi-band
block
diagram of
RMS(h
))
RMS(he
(n ))
adaptive (4.11)
l(n)
f0,m
vk,m
(n)/Kby /M
0=
where
cvk,m (n) loudness
is the control
gain
variable method
per band
target average
loudness
this
multi-band
feature
extraction
is used
showntoinachieve
Figure the
13.11.
gate
m=0
k=0
0,m
Cross-adaptive feature processing
(13.18)
l(n) and H lk,m (n) is the function of the desired system. The feature extraction block has a function
Cross-adaptive
Merkmalsverarbeitung
RMS(h (n )) > RMS(he (n))
,
H lk,m (n)xk,m (n)Fixed SPL
M f v1 K
1
(n)
xm
(n )
0,m
xl 0,m (n)
cvk,m
(n) =
Multi-band
adaptive
gate
f vk,mused
(n)/K
/M the target average loudness
(4.11)
where cvk,m (n) is the controll(n)
gain=variable per band
to achieve
m=0 Filter
k=0bank (with K bands)
the function
of the desired system. The feature extraction block has a function
l(n) and H lk,m (n) xise (n)
0
1
Cross-adaptive
feature
mapping
the perceptual
loudness of
each spectral
Bachelorarbeit
Daniel
Matzprocessing
Innerhalb
des
CAFPB
geht esconsists
nun umofdie
Zuordnung
der häufigsten
Lautheit
f vk,mband
(n)
Gateits gain per band level, we can achieve a desired
to its amplitude level so that, by manipulating
jedes Frequenzbandes zu einem Verstärkungsfaktor
cvk,m (n) des korrespondierenden
hg 0,m
(n)
target loudness level. We aim to achieve an average
loudness
value
l(n), therefore we must decrease
EQ-Bandes
Tracks. Hierdurch
kannabove
die Verstärkung
eines
spezifischen
the loudness offür
thealle
equalisation
bands for signals
this average and
increase
the band Bandes
gain for
Table of
same
(n) per
signals The
below
this average.
Thisdieses
resultsletztendlich
in a system in
which
we have a multiplier
beeinflusst
werden,
sodass
die
beabsichtigte
Lautheitcva
zuk,mbesitzt.
loudness
band, such for
that we scale the input bands xk,m (n) in order
to achieve a desired
xe(n) average loudness l(n).
weighting
all K filters
Dies
erfolgt
anapproximated
die cross-adaptive
Verarbeitung
wobei
The function
ofintheAnlehnung
system hg
can be
by Hcoefficients
lk,m (n)
= l(n)/(cvk,mdes
(n)xCAMTS,
k,m (n)), where the
0,m (n)*w(SP(n))
control vector
cvk,m (n)eine
is given
by Ziel-Lautheit l(n) mit Hilfe der Mittelwertbildung
zunächst
ebenfalls
mittlere
SP(n ) 4 Konzeption und Umsetzung
42
Avr(xl 0,m (n))
S
SPL
IMPLEMENTATIONS
543
über alle f vk,m (n) für alle K Frequenzbänder
undAM-DAFX
M measurement
Signale
bestimmt wird:
l(n)
h10,m (n )
Filter bank 0 (with K bands)
hK–1,m (n )
Probability mass function
he 0(n )
heK–1(n )
Output most probable
loudness
per band: f vk,m (n )
RMS(h
RMS(he 0(n ))
0,m (n ))
Bachelorarbeit Daniel Matz
13.11 Blockder
diagram
of an automatic equaliser.
Abbildung 4.4: Figure
Blockdiagramm
Merkmalsgewinnung
des implementierten
RMS(h 0,m (n )) > RMS(he 0(n))
cross-adaptiven
Equalizers
Cross-adaptive feature processing
Table of
hg 0,m (n)
Cross-adaptive Merkmalsverarbeitung
The same
0
1
cvk,m (n) =
,
H lk,m (n)xk,m (n)
Probability mass
M 1 function
K 1
loudness
Cross-adaptive
processing
mapping
the perceptual
loudness of
each spectral
Innerhalb
desfeature
CAFPB
geht esconsists
nun umofdie
Zuordnung
der häufigsten
Lautheit
f vk,mband
(n)
for
xe(n)
weighting
all K filterslevel so that, by manipulating its gain per band level,
to its amplitude
we can achieve a desired
coefficients
jedes Frequenzbandes zuhgeinem
Verstärkungsfaktor
cvk,m (n) des korrespondierenden
(n)*w(SP(n))
target loudness level. We aim to0,machieve
an average loudness value
l(n), therefore we must decrease
SP(n
)
EQ-Bandes
Tracks.SHierdurch
kannabove
die Verstärkung
eines
spezifischen
the loudness offür
thealle
equalisation
bands
for signals
this average
and
increase
the band Bandes
gain for
Avr(xl
0,m (n))
SPL
(n) per
signals belowwerden,
this average.
Thisdieses
resultsletztendlich
in a system indie
which
we have a multiplier
beeinflusst
sodass
beabsichtigte
Lautheitcva
zuk,mbesitzt.
measurement
xl (n)
band, such that we scale the input bands 0,m
xk,m (n) in order to achieve a desired average loudness l(n).
Dies
erfolgtofintheAnlehnung
anapproximated
die cross-adaptive
Verarbeitung
wobei
Fixed
SPL k,mdes
The function
system1 can be
l(n)/(cv
(n)xCAMTS,
k,m (n)), where the
f v0,m (n) by H lk,m (n) =
control vector
cvk,m (n)eine
is given
by Ziel-Lautheit l(n) mit Hilfe der Mittelwertbildung
zunächst
ebenfalls
mittlere
über alle f vk,m (n) für 0alle K Frequenzbänderl(n)
und M Signale bestimmt wird:
13.11 Blockder
diagram
of an automatic equaliser.
Abbildung 4.4: Figure
Blockdiagramm
Merkmalsgewinnung
des implementierten
cross-adaptiven Equalizers
Bachelorarbeit Daniel Matz
,
probable loudness per band of a given input and is denoted by f vk,m (n). The block diagram of
this multi-band loudness feature extraction method is shown in Figure 13.11.
Cross-adaptive feature processing
Cross-adaptive Merkmalsverarbeitung
H lk,m (n)xk,m (n)
K 1
f vk,mused
(n)/K
/M the target average loudness
(4.11)
where cvk,m (n) is the controll(n)
gain=variable per band
to achieve
m=0
k=0
l(n) and H lk,m (n) is the function of the desired system. The feature extraction block has a function
M 1
cvk,m (n) =
Cross-adaptive feature processing consists of mapping the perceptual
loudness of
each spectral
Innerhalb
des
CAFPB
geht
es
nun
um
die
Zuordnung
der häufigsten
Lautheit
f vk,mband
(n)
to its amplitude level so that, by manipulating its gain per band level, we can achieve a desired
jedes Frequenzbandes zu einem Verstärkungsfaktor cvk,m (n) des korrespondierenden
target loudness level. We aim to achieve an average loudness value l(n), therefore we must decrease
EQ-Bandes
Tracks. Hierdurch
kannabove
die Verstärkung
eines
spezifischen
the loudness offür
thealle
equalisation
bands for signals
this average and
increase
the band Bandes
gain for
(n) per
signals belowwerden,
this average.
Thisdieses
resultsletztendlich
in a system indie
which
we have a multiplier
beeinflusst
sodass
beabsichtigte
Lautheitcva
zuk,mbesitzt.
band, such that we scale the input bands xk,m (n) in order to achieve a desired average loudness l(n).
Dies
erfolgtofintheAnlehnung
anapproximated
die cross-adaptive
Verarbeitung
wobei
The function
system can be
by H lk,m (n)
= l(n)/(cvk,mdes
(n)xCAMTS,
k,m (n)), where the
control vector
cvk,m (n)eine
is given
by Ziel-Lautheit l(n) mit Hilfe der Mittelwertbildung
zunächst
ebenfalls
mittlere
über alle f vk,m (n) für alle K Frequenzbänderl(n)
und M Signale bestimmt wird:
Bachelorarbeit Daniel Matz
0
1
Probability mass function
0
Output most probable loudness per band: f vk,m (n)
,
(13.18)
The algorithm as proposed in [2] automatically extracts
pitch sequences of the solo instrument and uses them as
prior information in the separation scheme. In order to
We use the algorithm for harmonic-percussive separation
proposed in [6]. The algorithm is based on median filtering of the magnitude spectrogram to split the original audio signal into its horizontal (harmonic sources) and vertical elements (percussive sources). In an automatic mixing
context, these components can be understood as separate
subgroups which can be processed individually and finally
remixed.
4.1 Solo and Backing track Separation
4.2 Harmonic-percussive Separation (HPS)
Figure 1 presents a block diagram of the proposed framework. There are three main signal pathways A, B, and C. If
the CADSP is activated, pathway A is chosen. If CADSP
is not activated, pathway B and C are chosen depending
on whether the harmonic-percussive separation (HPS) is
used. All signal paths output a stereo mix. In the following
sections, the individual subcomponents are first described,
followed by a description of the three proposed signal pathways.
Stereo Remix
13.11 Blockder
diagram
of an automatic equaliser.
Abbildung 4.4: Figure
Blockdiagramm
Merkmalsgewinnung
des implementierten
cross-adaptiven Equalizers
H lk,m (n)xk,m (n)
K 1
6. Time-frequency selective panning (TFSP)
Normalize
probable loudness per band of a given input and is denoted by f vk,m (n). The block diagram of
this multi-band loudness feature extraction method is shown in Figure 13.11.
Cross-adaptive feature processing
Cross-adaptive Merkmalsverarbeitung
M 1
cvk,m (n) =
5. Automatic excitation (AE)
Normalize
Stereo Remix
Figure 1: Signal flow-chart of the developed automatic
remixing framework
For our mixing framework, we propose the use of sound
source separation techniques as a pre-processing step of
the mixing process. For this purpose, we first isolate the
solo instrument from the audio mix by applying an algorithm for pitch-informed solo and accompaniment separation [2]. The two separated signals, i.e., the solo and the
residual/backing signal, are the starting point for the automatic remixing process. Additionally, based on the requirements discussed in section 2, our proposed framework
comprises six subcomponents:
Cross-adaptive
processing
mapping
the perceptual
loudness of
each spectral
Innerhalb
desfeature
CAFPB
geht esconsists
nun umofdie
Zuordnung
der häufigsten
Lautheit
f vk,mband
(n)
to its amplitude level so that, by manipulating its gain per band level, we can achieve a desired
jedes Frequenzbandes zu einem Verstärkungsfaktor cvk,m (n) des korrespondierenden
target loudness level. We aim to achieve an average loudness value l(n), therefore we must decrease
EQ-Bandes
Tracks. Hierdurch
kannabove
die Verstärkung
eines
spezifischen
the loudness offür
thealle
equalisation
bands for signals
this average and
increase
the band Bandes
gain for
(n) per
signals belowwerden,
this average.
Thisdieses
resultsletztendlich
in a system indie
which
we have a multiplier
beeinflusst
sodass
beabsichtigte
Lautheitcva
zuk,mbesitzt.
band, such that we scale the input bands xk,m (n) in order to achieve a desired average loudness l(n).
Dies
erfolgtofintheAnlehnung
anapproximated
die cross-adaptive
Verarbeitung
wobei
The function
system can be
by H lk,m (n)
= l(n)/(cvk,mdes
(n)xCAMTS,
k,m (n)), where the
control vector
cvk,m (n)eine
is given
by Ziel-Lautheit l(n) mit Hilfe der Mittelwertbildung
zunächst
ebenfalls
mittlere
über alle f vk,m (n) für alle K Frequenzbänderl(n)
und M Signale bestimmt wird:
4. Cross-adaptive dynamic spectral panning (CADSP)
Mixdown
Stereo? (true or false)
f vk,mused
(n)/K
/M the target average loudness
(4.11)
where cvk,m (n) is the controll(n)
gain=variable per band
to achieve
m=0
k=0
l(n) and H lk,m (n) is the function of the desired system. The feature extraction block has a function
3. Cross-adaptive equalization (CAEQ)
Mixdown
Mixdown
Filters
AE
CAMTS
Bachelorarbeit Daniel Matz
2. Cross-adaptive multi-track scaling (CAMTS)
Normalize
4. PROPOSED METHOD
obtain more accurate spectral estimates of the solo instrument, the algorithm creates tone objects from the pitch sequences, and performs separation on a tone-by-tone basis.
Tone segmentation allows more accurate modeling of the
temporal evolution of the spectral parameters of the solo
instrument. The algorithm performs an iterative search
in the magnitude spectrogram in order to find the exact
frequency locations of the different partials of the tone.
A smoothness constraint is enforced on the temporal envelopes of each partial. In order to reduce interference
from other sources caused by overlapping of spectral components in the time-frequency representation, a common
amplitude modulation is required for the temporal envelopes of the partials. Additionally, a post-processing stage
based on median filtering is used to reduce the interference
from percussive instruments in the solo estimation.
1. Harmonic-percussive separation (HPS)
Stereo Split
TFSP
false
Filters
true
Filters
Filters
Filters
Backing Solo
Backing Solo
TFSP
Stereo Split
Mixdown
Filters
AE
AE
CADSP
CAMTS
Solo
Backing
Solo
Backing
Solo
Backing
Mixdown
CAMTS
false
true
true
CAEQ
CAEQ
C
B
A
CAEQ
HPS
Filters
Filters
Filters
Filters
Filters
HPS
Downmix
Downmix
Filters
HPS ? (true or false)
false
CADSP ? (true or false)
In [14] and [19], several cross-adaptive signal processing methods for automatic mixing such as source enhancer,
panner, fader, equalizer, and polarity and time offset correction are proposed. These modules can be combined
into a full mixing application. In [4], the authors propose
a knowledge-engineered autonomous mixing system and
propose to include expert knowledge within an automatic
mixing system. The included audio effects are automatically controlled based on extracted low-level and highlevel features such as musical genre, instrumentation, and
the type of sound sources. The authors evaluated the system using short four bar audio signals with vocals, bass,
guitar, keyboard, and other instruments.
Harmonic-percussive source separation was used as preprocessing step for manual remixing in [6], in particular to Filters
adjust the sound source levels of the signals. To the authors’ best knowledge, a framework for automatic remixing that suits the requirements discussed in section 2 has
not been proposed so far.
Proceedings of the 16th ISMIR Conference, Málaga, Spain, October 26-30, 2015
4.3 Cross-adaptive Multi-track Scaling (CAMTS)
The method proposed in [19] which is commonly referred
to as automatic fader control, is used for automatic scaling
of the sound sources. The algorithm is used to automatically modify the amplification of separate sound sources.
A psychoacoustic model based on the EBU R-128 standard [9] is used to compute the loudness of each track using a histogram-based approach. All tracks are individually amplified to be perceived as equally loud.
4.4 Cross-adaptive Equalizer (CAEQ)
We use the cross-adaptive equalizing algorithm proposed
in [19] to obtain a spectrally balanced mixture. The main
approach is to modify the spectral envelopes of the audio signals and to minimize the spectral masking between
the solo signal and the backing track. The algorithm is
a multi-band extensions of the CAMTS algorithm as discussed in section 4.3. The spectral characteristics of the
separated signals are modified by enhancing or attenuating pre-defined frequency bands depending on the signal’s
perceived loudness with respect to the overall loudness. In
contrast to the CAMTS algorithm, the loudness model proposed in [19] is used since it outperformed the loudness
model based on EBU R-128 during informal testing. In
particular, the mix results based on EBU-R 128 showed too
strong of an emphasis on treble frequencies while lacking
energy in the lower frequency range. We use a 10-band octave equalizer with second-order biquad IIR filters following [19] and frequency bands uniformly distributed over
the audible frequency range. Standard frequency values
based on [8] are used to adjust the center frequencies of the
peak filter as well as the cutoff frequencies of the shelving
filters.
4.5 Cross-adaptive Dynamic Spectral Panning
(CADSP)
Dynamic spectral panning is a technique that allows the
creation of a stereophonic impression in a given monophonic multi-track recording. We use the algorithm proposed in [15] to create a spatialization effect given multitrack signals.
The method dynamically assigns
time-frequency bins of the original tracks towards azimuth
positions. The assignment reduces masking due to shared
azimuth positions between multiple sound sources. This
improves the overall transparency of an audio mix. In the
cases where the original audio mix is a stereo track, it is
first down-mixed to mono and then up-mixed to a new
stereo image using the CADSP algorithm.
4.6 Automatic Exciter (AE)
The exciting algorithm improves the assertiveness of the
backing track. The digital signal processing methods are
implemented following the APHEX Aural Exciter described in [18]. The audibility of the mixed signal is enhanced
by adding harmonic distortions in the upper frequency range.
These distortions create additional harmonic signal com-
751
ponents which improve the presence, clarity, and brightness of the audio signal.
The automation of the exciting step is implemented following a target mixing approach. Based on [5], the mixing parameters are iteratively adjusted to a target energy
ratio. The target energy ratio is computed from the relationship between the energy of the high-pass filtered signal
and the energy of the target signal. In the side chain, an
asymmetric soft clipping characteristic, harmonic generator block, with adaptive threshold was used. This allows a
level-independent distortion as well as the preservation of
the signal dynamics [5].
4.7 Time-frequency selective Panning (TFSP)
Time-frequency selective panning improves the stereo image as well as the overall spatial impression of an audio
mix. In our framework, the method for time-frequency selective panning presented in [3] was used. The azimuth positions of the sound sources are modified using a non-linear
warping function. The stereo image is widened while the
initial arrangement of the sound sources, as well as the
sound quality of the original source is maintained. Within
the proposed automatic remixing framework, the TFSP algorithm can be interpreted as an extension of the CADPS
algorithm. The panning algorithm is only applied to the
residual signal (see section 4.8.1). We set the aperture parameter ⇢ to a fixed value based on initial informal testing.
4.8 Processing Pathways
4.8.1 Signal path A (Main Path)
The main processing path includes all system components.
Stereo files must be down-mixed to mono first due to constraints of the cross-adaptive dynamic spectral panning
(CADSP) algorithm as detailed in section 4.5. All sound
sources, which are initially distributed in the stereo panorama, are first centered to the mono channel and later redistributed over the stereo panorama again based on the
harmonic-percussive sound separation [6]. This up-mixing
step that can involve a modification of the stereo arrangement is only possible in this signal path.
The cross-adaptive equalization (CAEQ) and multi-track
scaling (CAMTS) are the first processing steps in all three
pathways. After applying the dynamic spectral panning
(CADSP) to the percussive and harmonic signal components, all stereophonic signals are summed up to a backing
track with a more homogeneous distribution of the sound
sources. The backing track can now be processed with the
automatic excitation (AE) and the time-frequency selective
panning (TFSP) algorithms. The solo signal is split into
stereo channels in the Stereo Split stage and scaled such
that the overall gain remains constant. In the final mixdown step, the backing track is mixed with the solo track
by adjusting the individual amplification factors as given
by the CAMTS stage. If the cross-adaptive equalization
(CAEQ) was performed, the spectral envelope of the backing track is perceivably modified due to the minimization
752
Proceedings of the 16th ISMIR Conference, Málaga, Spain, October 26-30, 2015
of the spectral masking. The stereo sum signal is finally
normalized.
Gender
4.8.2 Signal path B
Hearing impairment?
Signal path B resembles signal path A, however, the equalization (CAEQ) and scaling (CAMTS) steps offer more
ways to modify parameters due to the prior harmonic percussive separation stage.
Listening test experience?
4.8.3 Signal path C
Educational background in music?
In the signal path C, no harmonic-percussive separation is
performed.
The equalization (CAEQ) and scaling
(CAMTS) are applied to both the backing and the solo
track. However, the automatic excitation is only applied
to the backing track since we particularly want to enhance
the presence, clarity, and brightness of the backing track.
As shown in figure 1, the time-frequency selective panning
(TFSP) can only be applied to the backing track if it is a
stereo signal. For monaural signals, the signal is split to
the stereo channels (Stereo Split) and scaled such that the
overall gain remains constant. Similar to signal path B, the
signals are finally mixed down and normalized.
5. EVALUATION
5.1 Experimental Design
To evaluate the proposed framework, a listening test procedure was conducted following the guidelines of the Multi
Stimulus Test with Hidden Reference and Anchor
(MUSHRA) described in the ITU-R BS.1534-2 recommendation [11], and modifying them to fit the characteristics of
this study. The main difference of our test with respect to
the original MUSHRA is that a reference signal, which in
our case would be an ideal mix of the original recording,
is not available. Moreover, the notion of an ideal mix is
ill-posed in the automatic remixing context.
The listening test was conducted in a quiet room and
all signals were played using open headphones (AKG K
701). A total of 19 participants conducted the listening
test. The participants included audio signal processing experts, professional audio engineers, music students (jazz,
classical music), musicologists, as well as amateur musicians and regular music consumers. The average age of
the participants was 30.7 years old. Further demographic
information such as gender, hearing impairments, listening test experience, and educational background were also
collected. A summary of the demographic information is
presented in table 1.
The listening test was divided into five evaluation tasks,
each focusing on a different subjective quality parameter.
The following parameters were selected based on the ITUR BS.1248-1 recommendation [10], and were adopted to
our requirements: (QP1) Sound Balance, (QP2) Transparency, (QP3) Stereo/Spatial Impression, (QP4) Timbre, and
(QP5) Overall Impression. In each evaluation task, a training phase was first conducted to allow the participants to
familiarize themselves with the test material and to adjust
playback levels to a comfortable one.
Expert in audio engineering?
M
F
Yes
No
Yes
No
Yes
No
Yes
No
16
3
0
19
9
10
11
8
15
4
Table 1: Demographics of the listening test participants
Following the training phase, an evaluation phase was
conducted for each task. Five audio tracks as described in
Table 2 with ten mixtures each were rated by the participants. The five tracks used in this study are part of the
Jazzomat Database 1 . Among the presented mixtures, the
original signal, eight mixes created with different configurations of the proposed framework, and an anchor signal
(rhythm section reduced by 6 dB, the sum signal low-pass
filtered at 3.5 kHz) were used. Table 3 gives an overview
of all the remix configurations.
Title
Body and Soul
Tenor Madness
Crazy Rhythm
Bye Bye Blackbird
Adam’s Apple
Soloist (Instrument)
Chu Berry (ts)
Sonny Rollins (ts)
J.J. Johnson (tb)
Ben Webster (ts)
Wayne Shorter (ts)
Style
Swing
Hardbop
Bebop
Swing
Postbop
Year
1938
1956
1957
1959
1966
Table 2: Dataset description
Mix
1
2
3
4
5
6
7
8
(mono)
HPS
off
off
off
on
on
on
on
on
CAEQ
on
off
on
on
off
off
on
on
CAMTS
off
on
on
off
on
off
on
on
CADSP
off
off
off
off
off
on
on
off
AE
on
on
on
on
on
on
on
on
TFSP
off
off
off
off
off
on
on
off
Table 3: Configurations of the eight remixes used in the
listening test
The automatic exciting (AE) component is active in all
the mixes. The panning (TFSP) algorithm is only activated in conjunction with the cross-adaptive dynamic spectral panning (CADSP). This way, a further stereo expansion of critical stereo recordings with an unbalanced stereo
panorama is avoided. Mixture 8 was added to investigate
the influence of the stereo effects (CADSP and TFSP) onto
the input signals in the pre-processing step of pathway B
that are mixed monophonic.
1 A description of the Jazzomat Database is available at: http://
jazzomat.hfm-weimar.de/dbformat/dbcontent.html
S)
Mix 2 (CAMTS)
Mix 2 (CAMTS)
Mix 2 (CAMTS)
Mix 2 (CAMTS)
Mix 2 (CAMTS)
Mix 2 (CAMTS)
Frequenzgang
und Klangfarbe
Transparenz
s
Gleichgewicht
Stereofoner
und räumlicher
Eindruck
Akustischer
Gesamteindruck
Akustischer
Gesamteindruck
HPS+CAEQ+CAMTS(mono)
HPS+CAEQ+CAMTS(mono)
Q+CAMTS(mono)
HPS+CAEQ+CAMTS(mono)
HPS+CAEQ+CAMTS(mono)
HPS+CAEQ+CAMTS(mono)
HPS+CADSP+TFSP
HPS+CADSP+TFSP
P
HPS+CADSP+TFSP
HPS+CADSP+TFSP
HPS+CADSP+TFSP
+CAMTS)
Mix
Mix 3 (CAEQ+CAMTS)
Mix
3 (CAEQ+CAMTS)
3 (CAEQ+CAMTS)
Mix
Mix 3 (CAEQ+CAMTS)
1 3 (CAEQ+CAMTS)
1
1HPS+CAEQ+CAMTS+CADSP+TFSP
1 3 (CAEQ+CAMTS)1Mix
00
100
100
HPS+CAEQ+CAMTS+CADSP+TFSP
S+CADSP+TFSP
HPS+CAEQ+CAMTS+CADSP+TFSP
HPS+CAEQ+CAMTS+CADSP+TFSP
HPS+CAEQ+CAMTS+CADSP+TFSP
AEQ) Mix 4 (HPS+CAEQ) Mix 4 (HPS+CAEQ)
Mix 4 (HPS+CAEQ) Mix 4 (HPS+CAEQ)
Mix 4 (HPS+CAEQ)
Mix 4 (HPS+CAEQ)
1
Excellent(Sehr
Gut)
Excellent(Sehr
Gut)
Frequenzgang
und Klangfarbe
Transparenz
es
Gleichgewicht
Stereofoner
und räumlicher
Eindruck
Akustischer Gesamteindruck
Akustischer
Gesamteindruck
HPS+CAEQ+CAMTS(mono)
HPS+CAEQ+CAMTS(mono)
S(mono)
HPS+CAEQ+CAMTS(mono)
HPS+CAEQ+CAMTS(mono)
HPS+CAEQ+CAMTS(mono)
AMTS) Mix
Mix 5 (HPS+CAMTS)
5 (HPS+CAMTS) Mix 5 (HPS+CAMTS)
5 (HPS+CAMTS)
Mix 5 (HPS+CAMTS) Mix
Mix 5 (HPS+CAMTS)
100
100
100
ADSP+TFSP)
Mix 6 (HPS+CADSP+TFSP)
Mix 6 0.8
(HPS+CADSP+TFSP)
Mix 6 (HPS+CADSP+TFSP)
Mix
6
(HPS+CADSP+TFSP)
Mix 6 (HPS+CADSP+TFSP)
Mix
6
(HPS+CADSP+TFSP)
0.8
0.8
0.8
0.8
t)
Excellent(Sehr
Gut)
Frequenzgang
und Klangfarbe
Transparenz
chgewicht
Stereofoner
undMix
räumlicher
Eindruck
Akustischer Gesamteindruck
Akustischer
Gesamteindruck
AEQ+CAMTS+CADSP+TFSP)
Mix
7 (HPS+CAEQ+CAMTS+CADSP+TFSP)
7 Gut)
(HPS+CAEQ+CAMTS+CADSP+TFSP)
Mix Excellent(Sehr
7 (HPS+CAEQ+CAMTS+CADSP+TFSP)
Mix 7 (HPS+CAEQ+CAMTS+CADSP+TFSP)
Mix 7 (HPS+CAEQ+CAMTS+CADSP+TFSP)
Mix Original
7 (HPS+CAEQ+CAMTS+CADSP+TFSP)
0.8
100 80 Mix 8 (HPS+CAEQ+CAMTS,
AEQ+CAMTS,
mono)
mono)
Mix
880
(HPS+CAEQ+CAMTS,
mono) Málaga,
Mix 8 (HPS+CAEQ+CAMTS,
mono)
Mix 8 (HPS+CAEQ+CAMTS,
mono)
Mix 8 (HPS+CAEQ+CAMTS,
mono)
Proceedings
of the100
16th
ISMIR
Conference,
Spain,
October
26-30,mono)
2015
Mix Anchor
8 (HPS+CAEQ+CAMTS,
Mix 1 (CAEQ)
ut) 0.6 Excellent(Sehr 0.6
Excellent(Sehr0.6
Gut)
Gut)
0.6
0.6
Mix
2
(CAMTS)
Original
80
80Eindruck
80
0.6räumlicher
es Gleichgewicht
Stereofoner
und
Frequenzgang
und
Klangfarbe
Transparenz
Akustisches
Gleichgewicht
Akustischer
Gesamteindruck
Akustisches
Gleichgewicht
Mix 3 (CAEQ+CAMTS)
Anchor
Excellent(Sehr
Gut) 100
Gut)
00 Excellent(SehrGood(Gut)
100Good(Gut)
100
Mix 4 (HPS+CAEQ)
Mix 1 (CAEQ)
0.4
0.4
0.4
0.4
0.4
80
80
80
Mix 5 (HPS+CAMTS)
Mix 2 (CAMTS)
Original
0.4
Mix 6 (HPS+CADSP+TFSP)
t) Excellent(Sehr Gut)
Good(Gut)
ut)
Excellent(Sehr
Gut) Good(Gut)
Mix 3 (CAEQ+CAMTS)
Excellent(Sehr
Gut) Anchor
Mix 7 (HPS+CAEQ+CAMTS+CADSP+TFSP)
Mix 4 (HPS+CAEQ)
80 60 Mix 1 (CAEQ)
80 60
0.2
0.2
0.2
0.2
0.2
Mix 8 (HPS+CAEQ+CAMTS, mono)
Mix 5 (HPS+CAMTS)
ut)
Good(Gut)
80
80
80 Mix 2 (CAMTS)
0.280 Good(Gut)
Mix 6 (HPS+CADSP+TFSP)
Mix 3 (CAEQ+CAMTS)
60
60
60
Akustisches Gleichgewicht
Mix 7 (HPS+CAEQ+CAMTS+CADSP+TFSP)
(HPS+CAEQ)
Good(Gut)
Good(Gut)
0
Fair(Ausreichend)
ut)
Good(Gut)
Good(Gut)
0 Fair(Ausreichend)
0
0
Good(Gut)0Mix 4 100
Mix 8 (HPS+CAEQ+CAMTS,
mono)
Mix 5 (HPS+CAMTS)
0
60
60
60
Mix 6 (HPS+CADSP+TFSP)
Akustisches Gleichgewicht
d)
Fair(Ausreichend)
Fair(Ausreichend)
60
60 Excellent(Sehr
60
60 Mix 7Gut)
(HPS+CAEQ+CAMTS+CADSP+TFSP)
−0.2
−0.2
100
−0.2
−0.2
−0.2
60 40 Mix 8 (HPS+CAEQ+CAMTS,
60 40
mono)
−0.2
ChuFair(Ausreichend)
Berry
J.J.Johnson
Wayne Shorter
All
nd)
Fair(Ausreichend) 80Sonny Rollins
d)
Fair(Ausreichend)
Fair(Ausreichend)
Fair(Ausreichend)
753
Excellent(Sehr Gut)
Gleichgewicht
40 −0.4
40Akustisches
40
−0.4
−0.4
−0.4 Wayne Shorter
Good(Gut) −0.4
100
Chu Berry
J.J.Johnson
Sonny Rollins
All
Fair(Ausreichend)
Fair(Ausreichend)
40
40
Poor(Mangelhaft)
40
−0.440
80 Poor(Mangelhaft)
40
40
40
60
xcellent(Sehr Gut)
Chu Berry −0.6
J.J.Johnson−0.6
Sonny Rollins
Wayne Shorter
All
−0.6
−0.6
Good(Gut)
ft)
Poor(Mangelhaft)
Poor(Mangelhaft)
t) −0.6
Poor(Mangelhaft)
Poor(Mangelhaft)
Poor(Mangelhaft)
−0.6
40
40 20
Fair(Ausreichend)
20
80
J.J.Johnson
Sonny Rollins
Wayne Shorter
All
60
aft) Chu BerryPoor(Mangelhaft)
Poor(Mangelhaft)
20
20
20
20
−0.8
−0.8
−0.8
−0.8
40 −0.8
Good(Gut)
20
20
20
−0.8
Fair(Ausreichend)
Poor(Mangelhaft)
Poor(Mangelhaft)
ht)
Bad(Schlecht)
Bad(Schlecht)
Bad(Schlecht)
Bad(Schlecht)
Bad(Schlecht)
Poor(Mangelhaft)
60
20
20 40
−1
−1 20
−1
−1
−1
1
2
35 0 1
46
8
2 157
3 268
4 1379
5 1.5
6 59 2 7 10
10 9
1
2 0
3
4−1 Bad(Schlecht)
10
4
6 2.5 8 7
39 8
3.5
4 10
0
t)0
Bad(Schlecht)
20
Fair(Ausreichend)
Sonny
Rollins
Wayne
Shorter
All
Chu
Berry
J.J.Johnson
Sonny
Rollins
Wayne
Shorter
Ben
Webster
Chu
Berry
J.J.Johnson
Sonny
Rollins
Wayne
Shorter
J.J.Johnson
Sonny
Rollins
Wayne
Shorter
All
Ben
Webster
Chu
Berry
J.J.Johnson
Sonny
Rollins
Wayne
Ben
Webster
Chu
Berry
J.J.Johnson
Sonny
Rollins
Wayne
Shorter
All Shorter
Transparency
Stereo/Spatial
Impression
Sound
Timbre
1 20
4
5 Berry
6 All
7
8All
9 Overall
10
Ben3Webster
Chu
J.J.Johnson
Sonny
Rollins
Wayne Shorter All
Impression
20
02 Balance
0
Poor(Mangelhaft)
Impression
Ben
Webster
ChuWebster
Berry
J.J.Johnson
Sonny
Rollins
Wayne
Shorter
J.J.Johnson
Sonny
Rollins
Wayne
Shorter
All
Sonny
Rollins
Shorter
AllBerry
Chu
Berry
J.J.Johnson
Sonny
Rollins Stereo/Spatial
Wayne
Shorter
All
Ben
Chu
Berry
J.J.Johnson
SonnyAllRollins
Ben
ChuWebster
J.J.Johnson
Sonny
Rollins
Wayne
Shorter
ht)
Bad(Schlecht)
Bad(Schlecht) Wayne
Bad(Schlecht)
40
20
0
Ben
Webster
ChuWebster
Berry
J.J.Johnson
Sonny
Rollins
Sonny
Rollins
Wayne
Shorter
Chu
Berry
J.J.Johnson
Ben
Bad(Schlecht)
Bad(Schlecht)
Poor(Mangelhaft)
Bad(Schlecht)
0
0
Ben
ChuWebster
Berry
J.J.Johnson
Sonny
Rollins
Sonny
Rollins
Wayne
Shorter
Chu
Berry
J.J.Johnson
Ben
20 Webster
0
Bad(Schlecht)
Ben
Webster
J.J.Johnson
Sonny
Rollins
Chu
Berry
0
4.5
5
All
WayneAll
Sho
0
0
0
Ben Webster
5.2 Results
ChuWebster
Berry
Sonny
Rollins
Wayne
Shorter
J.J.Johnson
Ben
5.2.1
General Chu Berry
Ben
Webster
0
J.J.Johnson
Sonny
Rollins
Wayne
Shorter
Wayne
Shorter
AllBerry
Sonny
Rollins test results for
Wayne
Shorter
All
Ben
Chu
J.J.Johnson
ChuWebster
J.J.Johnson
Sonny
Rollins
Figure02:
Listening
theAllBerry
five evaluated
parameters.
Ben Webster
Chu Berry
J.J.Johnson
Sonny Rollins
0
J.J.Johnson
Wayne
Shorter
AllBerry
Sonny
Rollins
Ben
ChuWebster
Chu Berry
J.J.Johnson
Wayne
Shorter
AllBerry
Sonny
Rollins
Ben
ChuWebster
J.J.Johnson
Sonny
Rollins
Wayne
Shorter
ChuAll
Berry
J.J.Johnson
J.J.Johnson
5.2.2
Sonny
Rollins
Wayne
Shorter
ChuAll
Berry
J.J.Johnson
Sonny Rollins
Figure 2 shows the results of the listening test for the five
acoustic quality parameters. The figure legend summarizes
all the system configurations that were evaluated. It is evident from the plot that the anchor stimulus was always
correctly identified. Results also suggest that the use of
harmonic-percussive separation does not bring perceptual
quality gains for HPS+CAEQ (mix 4) when compared to
the CAEQ (mix 3). Unexpectedly, results even got worse
for the parameters timbre and overall impressions. Similarly, the combined settings in HPS+CAMTS (mix 5) do
not show an improvement in the ratings when compared to
CAMTS (mix 2).
To facilitate analysis of results, table 4 lists the percentage improvement obtained for each of the five quality parameters (QP), subject to the presence or absence of the
individual framework components compared to the original signal. Mixes 4 and 5, which include the harmonicpercussive separation, were not listed due to the reasons
previously described. The five mixtures listed in the table
are further analyzed in the following sections.
Mix 1 (CAEQ)
Mix 2 (CAMTS)
Mix 3 (CAEQ+CAMTS)
Mix 6 (HPS+CADSP+TFSP)
Mix 7 (All components)
QP1
18 %
15 %
10 %
9%
29 %
QP2
17 %
19 %
12 %
16 %
24 %
QP3
4%
10 %
6%
18 %
43 %
Sonny Rollins
QP4
16 %
3%
QP5
9%
4%
8%
6%
Table 4: Percentage improvement of the remixed signal
compared to the original audio recording subject to the
presence (or absence) of the individual framework components shown for each of the five perceptual quality parameters.
SonnyAllRollins
Wayne
Shorter
Wayne Shorter
Wayne
Shorter
All
J.J.Johnson
Sonny
Rollins
Wayne Shorter
Mix 1 (CAEQ)
Wayne
Shorter
All
J.J.Johnson
Sonny
Rollins
SonnyAllRollins
Wayne
Shorter
All
SonnyAllRollins
Wayne
Shorter
Mix 1 does not include
a prior separation of the residual
All
component and outperforms the original mix for most of
the quality parameters. The highest improvements were
18% for sound balance and 17% for transparency. However, for timbre and overall impressions, no improvement
was observed.
Wayne Shorter
5.2.3 Mix 2 (CAMTS)
Despite the absence of the harmonic percussive separation
step, mix 2 showed improvements for transparency (19 %),
sound balance (15%), and overall impression (9 %). The
reason for the improvement in timbre by 16% is not entirely clear in this case; however, a possible explanation
is that the increased loudness of the rhythm section led to
more balanced dynamic levels and a clearer perception of
the instrument and overall timbres.
5.2.4 Mix 3 (CAEQ+CAMTS)
The combination of the CAEQ and CAMTS components
showed inferior results compared to the exclusive application of both components. However, the ratings are still
slightly higher than the ratings of the original audio file.
5.2.5 Mix 6 and Mix 7
Both mixtures 6 and 7 outperformed the original audio file.
The highest ratings were achieved with mixture 7 which
was extracted with the full processing chain. In particular,
the improvements compared to the original audio file were
29 % for sound balance, 24 % for transparency, as well as
43 % for stereo and spatial impression. The small improvements with respect to the overall impression are likely due
to the individual aesthetic preferences of the listening test
participants.
Additionally, to analyze the influence of the stereo effects to the input signals of pathway B (which are initially
WayneAll
S
All
WayneA
WayneAll
Shorter
754
Proceedings of the 16th ISMIR Conference, Málaga, Spain, October 26-30, 2015
downmixed to mono), Table 5 presents the percentage improvement obtained with mix 7 (all components active) in
comparison to mix 8 (mono).
QP 1
39 %
QP 2
16 %
QP 3
33 %
QP 4
12 %
QP 5
18 %
Table 5: Mean ratings of the five quality parameters for
the additional usage of the stereo effects (CADSP+TFSP)
in mix 7 compared with the non-processed monophonic
input signal in the same framework setting of mix 8 (HPS,
CAEQ, CAMTS, AE).
As can be observed in the table, the use of the CADSP
and TFSP modules improved the ratings for all five quality
parameters. The improvement was statistically significant
for sound balance (39 %) and stereo/spatial impression (33
%).
6. CONCLUSIONS
In this paper, we proposed a prototype implementation of
an automatic remixing framework for tonal optimization of
early jazz recordings. The main focus was on improving
the balance between the solo instrument and the rhythm
section. The framework consists of six components which
include different processing steps to modify the loudness,
frequency response, timbre, and stereophonic perception
of the separated sound sources. We compared different
configurations of the framework and evaluated the improvement of the transparency of the backing track as well as
the acoustic balance, stereophonic homogeneity, and overall quality perception. The evaluation was performed with
a MUSHRA-like listening test based on the ratings given
by 19 participants.
The usage of automatic equalization (CAEQ) and multitrack scaling (CAMTS) showed clear improvement in the
quality parameter ratings, whereas the combination of both
led to a smaller improvements than the independent application of each approach. The improvement based on
harmonic-percussive separation (HPS) within the automatic
mixing framework is not easy to assess. The usage of HPS
in conjunction with CAEQ and CAMTS did not improve
the ratings. On the other hand, HPS is a basic requirement
for the application of CADSP on the backing track of mix
7, and therefore contributes to its consistent high ratings.
HPS is irrelevant for the automatic excitation (AE) step,
since it is applied to the full residual track.
Particularly with mix 7 (all components), the initially
targeted improvements in sound balance, stereo and spatial
impression, and transparency with respect to the original
audio recording were achieved.
In future work, the most relevant processing modules
must be further investigated and improved with respect to
the aforementioned quality parameters. Modules that
showed none or only minor improvements must be replaced
and alternative algorithms must be evaluated for the given
tasks. Promising algorithms seem to be a mastering equalizer [7] or dynamic range compression [13]. The additional
use of semantic information of music genre and instrumentation seems to be another fruitful approach as discussed in
section 3.
Finally, the integration of audio restoration methods such
as denoising will likely help to remove unwanted background noise and spurious signals from the main signal to
be processed.
7. REFERENCES
[1] Daniele Barchiesi and Joshua D. Reiss. Automatic target mixing using least-square optimization of gains and
equalization settings. In Proceedings of the 12th International Conference on Digital Audio Effects (DAFx09), Como, Italy, 2009.
[2] Estefanía Cano, Gerald Schuller, and Christian
Dittmar. Pitch-informed solo and accompaniment separation: towards its use in music education applications. EURASIP Journal on Advances in Signal Processing, 23:1–19, 2014.
[3] Maximo Cobos and Jose J. Lopez. Interactive enhancement of stereo recordings using time-frequency selective panning. In Proceedings of the 40th AES International Conference on Spatial Audio, Tokyo, Japan,
2010.
[4] Brecht De Man and Joshua D. Reiss. A semantic approach to autonomous mixing. Journal of the Art of
Record Production, 8, 2013.
[5] Brecht De Man and Joshua D. Reiss. Adaptive control
of amplitude distortion effects. In Proceedings of the
53rd AES International Conference on Semantic Audio, London, UK, 2014.
[6] Derry FitzGerald. Harmonic/percusssive separation using median filtering. In Proceedings of the 13th International Conference on Digital Audio Effects (DAFx),
Graz, Austria, 2010.
[7] Marcel Hilsamer and Stefan Herzog. A statistical approach to automated offline dynamic processing in
the audio mastering process. In Proceedings of the
17th International Conference on Digital Audio Effects
(DAFx-14), Erlangen, Germany, 2014.
[8] ISO International Organization for Standardization.
Acoustics - preferred frequencies, August 1997.
[9] ITU Radiocommunication Bureau. Algorithms to measure audio programme loudness and true-peak audio
level (rec. itu-r bs.1770-3), August 2012.
[10] ITU Radiocommunication Bureau. General methods
for the subjective assessment of sound quality (rec. itur bs.1248-1), December 2003.
[11] ITU Radiocommunication Bureau. Method for the subjective assessment of intermediate quality level of audio systems (rec. itu-r bs.1534-2), June 2014.
Proceedings of the 16th ISMIR Conference, Málaga, Spain, October 26-30, 2015
[12] Zheng Ma, Joshua D. Reiss, and Black, Dawn A. A.
Implementation of an intelligent equalization tool using yule-walker for music mixing and mastering. In
Proceedings of the 134th AES Convention, Rome and
Italy, 2013. AES.
[13] Stylianos-Ioannis Mimilakis, Konstantinos Drossos,
Andreas Floros, and Dionysios Katerelos. Automated
tonal balance enhancement for audio mastering applications. In Proceedings of the 134th AES Convention,
Rome, Italy, 2013. AES.
[14] Enrique Perez Gonzales. Advanced Automatic Mixing
Tools for Music. PhD thesis, Queen Mary University of
London, London. UK, 30.09.2010.
[15] Pedro D. Pestana and Joshua D. Reiss. A crossadaptive dynamic spectral panning technique. In Proceedings of the 17th International Conference on Digital Audio Effects (DAFx-14), Erlangen, Germany,
2014.
[16] Jeffrey Scott and Youngmoo E. Kim. Analysis of accoustic features for automated multi-track mixing. In
Proceedings of the 12th International Society for Music Information Retrieval Conference (ISMIR), Miami,
USA, 2011.
[17] Jeffrey Scott, Matthew Prockup, Erik M. Schmidt, and
Youngmoo E. Kim. Automatic multi-track mixing using linear dynamical systems. In Proceedings of the
8th Sound and Music Computing Conference (SMC),
Padova, Italy, 2011.
[18] Priyanka Shekar and Smith, III, Julius O. Modeling
the harmonic exciter. In Proceedings of the 135th AES
Convention, New York, USA, 2013.
[19] U. Zölzer. DAFX: Digital Audio Effects. John Wiley &
Sons Ltd., second edition, 2011.
755
Download